Hello Internet and Archivists!
I'm back for another blog post and I'm excited to share some overhauls, updates, and research that I have been working on for the Memento Damage service, previously developed by Dr. Justin Brunelle, Erika Siregar, and Grant Atkins. The project page is still currently running the original project build but I wanted to share some behind-the-scenes updates before we roll out the new build soon!
Under the Hood
![]() |
Fig. 1: Homepage for the Memento Damage web service |
When I took on this project, there were many components needing update due to age; the code base had been on Python version 2 still as it had sat over the years as Dr. Brunelle and previous students had graduated and moved on to other endeavors. Updating the code base to Python version 3 was one of the top to-do items! Over a lot of time learning the code base and refactoring I was able to clean up and modernize the code a bit thanks to the syntax and language updates coming along with Python 3.
Memento Damage had previously used the popular web automation library PhantomJS to do the heavy lifting for page crawls. Unfortunately, development on this library was suspended back in 2018, so there was a need to migrate to a new crawling framework. The most suitable choices I had found for a replacement were Selenium, Puppeteer, and Playwright.
- Selenium is designed a bit more broadly than the other two libraries. It supports both Chrome and Firefox browsers and is suited for general and specified web automation tasks.
- Puppeteer is maintained by the Chrome DevTools team and is designed as a high-level API for interfacing and controlling the Chrome or Chromium browser, specifically.
- Playwright is a popular library maintained by Microsoft. It is used for end-to-end web automation for all major browsers and shares a similar syntax to Puppeteer with some differences to debugging tools, handling of HTML, and page interactions.
Ultimately, I chose to go with Puppeteer, as the project is operated under a single browser anyway, and it offered the type of control I was looking for. Playwright is a promising library though, and holds potential as a future replacement or upgrade. The syntactical operations between PhantomJS and Puppeteer differed quite a bit, with Puppeteer taking a more asynchronous approach to web automation and requiring a more involved setup to install and initialize a browser client and page. Puppeteer's robust nature has many benefits though, such as a development team working close to the browser itself and tight integration with the Chrome DevTools and associated communication Protocol. Its major flaw is that it relies on a tight coupling to specific Chrome/Chromium binary versions, with which Puppeteer is kept updated in a lockstep fashion.
The Memento Damage project was using Docker already to package itself for distribution, and I have found that the best way to handle this coupling is to use Puppeteer's official Docker image. The image is based on a Node.JS Docker image so modifying it to fit was not too difficult. The more challenging aspect in regards to working with web automation in this environment is the handling of security and sandboxing trade-offs within Docker environments. Many commonly used flags used to make Puppeteer and Docker work are not always an option for every IT environment or use case. One of the best methods I have come across for working with Puppeteer in such restrictive environments is to use a JSON formatted file called a seccomp profile in order to explicitly map the required security parameters by Chrome.
Learning Web Automation
Prior to working on the Memento Damage project I had no practical experience in Web Automation and had only really begun to get into web coding about a year before joining the Web Science & Digital Libraries group. To say that learning about the nuance of the Web is a little overwhelming is a bit of an understatement 😅
![]() |
Fig. 2: Cartoon illustration by Peter Schrank, originally found on The Economist. |
In addition to the 'system' updates, I've also updated the way the project parses and dereferences URLs and added the ability to pass multiple URLs to the server command-line interface as a CSV file. The Web interface is more presentational and still accepts only a single URL to be analyzed.
WARC files are a huge staple of web archiving. I have updated the Memento Damage service to handle WARCs well by utilizing the third-party service ReplayWeb.Page, by the WebRecorder project. I am still working out some bugs in this area but I hope to have it completely good to go soon ;) A nice feature of the ReplayWeb service is that it can be utilized from the public gateway or self-hosted locally for security and redundancy.
Trace and Template
In taking on the Memento Damage project, I have been collaborating with a wide-ranging team on the Collaborative Software Archiving for Institutions (CoSAI) project, funded by the Sloan Foundation. We are led by Vicky Rampin from NYU Libraries and Dr. Martin Klein from Los Alamos National Laboratories and work with many others including Wilkie from the OCCAM project, Talya Cooper, Lyudmila Balakireva, and four members from ODU's Web Science & Digital Libraries group, Dr. Michael Nelson, Dr. Michele Weigle, Emily Escamilla, and myself.
In working with such a diverse group and based on conclusions from prior work with the Memento Damage project damage, in general, is a difficult thing to approximate. Web pages themselves come in all manners of layout and style and it is challenging to nail down how much something should be considered damaged from page to page and person to person, such that a universal solution will not suffice. Assigned weights for every individual page on the web is also hardly feasible. I have been experimenting with page templates as a middle ground for this problem, in which custom weights can be assigned to various domains and large groups of homogeneously arranged pages. As the CoSAI project's interests lie in the archival of software projects, naturally a bit of extra attention has been given regarding damage to pages for software projects across multiple repository hosting sites, such as GitHub, GitLab, BitBucket, and Sourceforge. For further reading, check out some of Emily Escamilla's publications on this topic, such as "It’s Not Just GitHub: Identifying Data and Software Sources Included in Publications" and "The Rise of GitHub in Scholarly Publications".
As part of the CoSAI project, I have also been working on a proposal for a more specialized version of templates that integrate with the CoSAI project's Memento Tracer Browser Extension. The Memento Tracer extension allows viewers to capture and archive web pages based on a user's interactions with the page. These "traces" can then be utilized for the efficient, large-scale capture of similar web pages. This proposed integration would enable traces to provide contextual information in their outputs capable of being utilized by the Memento Damage service to measure and return calculated damage values for a traced page.
![]() |
Fig. 3: Memento Tracer UI with proposed weighted element input |
Updated Damage Scoring
With respect to the calculation of page damage, I have tried to match the overall damage calculations with that of Dr. Brunelle's work on the project where I can with some adjustments to page element weighting and normalization. A flaw in the original damage code was that pages could be potentially scored over 100% when accounting for various weights and importance scores applied to different elements. I've been working to balance and recalculate this such that all page damage is properly normalized between 0 and 100%. To this end, I have rewritten many parts of the code pertaining to calculating damage in an attempt to make them more accurate and account for more variables and context, while aligning with the related work from the CoSAI group.
Stylesheets do not have a visual size or placement on a page, but they do have other useful information, such as the number of utilized bytes, directives, and the number of rules referenced by various page elements that can be calculated manually. JavaScript is the more difficult of the two coded component groups to access and evaluate as JavaScript code often tends to be minified and purposely obfuscated and made difficult to read its final, production form. JavaScript code often has a tendency to call other, subsequent JavaScript code, making it difficult to measure the true impact of a given JavaScript file as it opaquely loads resources into a page. Luckily, Puppeteer's integration and coupling with the Chrome browser and DevTools can come in handy here to monitor and extract some useful quantifiers that can be beneficial in assessing the extent to which each of these "non-visual" files is utilized by the Web page in order to better assign them an appropriate damage score, alongside their visual counterparts. Despite this, what we are able to see and measure might still only be a partial glimpse of the complete picture. I am still learning a lot in working with Web development diagnostic tools and hope to improve my results in this area in the future.
The evaluation of CSS and JavaScript components is handled first in the overall calculation algorithm as this data is used to inform the subsequent damage values for visual elements on the page. Visual page components more directly present indications of damage, unlike coded components which can often be opaque as to their damage potential. For visual elements, presence, size, placement, rendered appearance, interactivity, and other factors, such as density with regard to more compounded or logically grouped components, can all be used to derive a damage estimation. Ultimately, each page element has a base value synthesized from its various derivative precursor quantifiers. These base values then can be combined with a series of multipliers to augment their value, as seen below in Figure 4. Some of these multipliers are implicitly calculated, while others can be set or overridden using templates and then applied to matching elements. A location multiplier, for instance, can be set on an element if its position is determined to have some importance based on visual page analysis or information provided from a template.
Fig. 4: Composition of element base and various multipliers |
I have also been working on mechanisms to better present and visualize the page damage assessed by the Memento Damage service. The service is already grabbing a screenshot of the page during its crawl process, now, as it calculates page damage, it can generate annotations showing the damaged areas of the the page. The service can optionally generate a full-page screenshot highlighting the damaged areas, as well as generate individual highlights, cropped only to the damaged region with their individual damage values. Figure 5 below is an example of highlights to a GitHub repository page, here showing missing reference images meant to be used as examples and provide extra context to the repository.
Fig 5. Damage GitHub page with missing example images (http://web.archive.org/web/20211201080913/https://github.com/uwasystemhealth/time_domain_beamforming) |
In Summary
To condense a TL;DR of these updates, as a start the underlying base container image has been migrated to use the Puppeteer base image, pending further revision. The project code base has been rewritten to use modern Python version 3 for its Pythonic components and the crawler has been rewritten for Puppeteer, moving away from the previously deprecated PhantomJS library. I have been looking into the Crawlee, as it uses Puppeteer partly under the hood and might provide some useful features, or Playwright as an alternative framework in order to reduce the dependency on Puppeteer. The page analysis code has been updated to try and improve the damage scoring accuracy in addition to providing new features for a service consumer to better visualize the estimated page damage. Additional new features include replay and damage estimation for offline pages archived to WARC files, the ability to process multiple URLs (over command-line or API, the Web portal will still be presentational and handle individual URLs), specialized processing for hosted Git repository pages, and integration with the Memento Tracer project. At this point, integration between the Memento Tracer and Memento Damage services is still a work in progress and updates are coming along from both sides. I hope to have more updates for everyone soon as the build gets closer to release and that the Web archiving ecosystem might find them beneficial!
- David