Audit Mode
When given a URI-R, audit mode (Figure 1) can be used to replay the mementos of the URI-R from two public web archives. Figure 2 shows the current process used for audit mode. Before starting audit mode, a configuration file needs to be created that includes the URI-R and a list of web archives that we want to use for audit mode.
At the beginning of audit mode, Selenium is used to set up the web browsers and display context information needed for audit mode. After the web browsers are set up, MemGator is used to retrieve URI-Ms based on the URI-R and web archives specified in the configuration file. (A URI-M is a URI for a memento.) Currently this mode supports viewing archived web pages from Wayback Machine, archive.today, and Archive-It. We will add support for other web archives in the future.
After retrieving the URI-Ms, we pair URI-Ms from the different archives based on Memento-Datetime, so that we are comparing mementos that were archived around the same time. Currently the default threshold is 30 days, so if the difference between Memento-Datetimes for two archived web pages is over 30 days, then the archived web pages will not be replayed at the same time during the livestream.
The next step is replaying the unique URI-Ms based on the SimHash of the web page’s HTML source and the Hamming distance between the previous unique URI-M’s SimHash and the current URI-M’s SimHash. I decided to use SimHash and Hamming distance because TMVis (paper: “Visualizing Webpage Changes Over Time”) uses these two algorithms when determining unique mementos. TMVis shows how a web page has changed over time by displaying screenshots of the unique mementos of a web page. Figure 3 shows an example of when two SimHashes have a large enough Hamming distance for the default threshold of 4. Since the archived web pages are shown beside each other during the audit like in Figure 1, we can see the similarities and differences between two mementos of a web page that were archived around the same time by different web archives.
After replaying all of the unique mementos of the web page, a final message is displayed (Figure 4) that lists the number of unique mementos of the web page and the total number of mementos of the web page for each web archive.
Listed below are some demos of audit mode:
Auditing Wayback Machine and Archive-It for the URI-R: https://cs.odu.eduFuture Work
Some future updates we could make for audit mode are adding support for more web archives, adding an option for comparing visual similarity between archived web pages, and measuring replay performance metrics. In the future, audit mode could use any active web archive that is supported by MemGator, since we use MemGator to get the URI-Ms for the audit. During the audit, we could determine the image similarity between screenshots of the two archived web pages that were archived around the same time, so that we can automatically detect when the content on the archived web pages are different between the two web archives. Also, during the audit, we could use tools like Memento Damage and Web Archiving Screenshot Compare to measure replay performance metrics that can be shown during results mode and to use the results for an automated gaming livestream.Summary
We have added a new mode, named audit mode, to our web archiving livestreams. This mode audits web archives by replaying unique mementos of a URI-R from two different web archives, which is different from replay mode where only recently archived mementos are replayed. Viewing the replay of two mementos from different web archives is useful when comparing the missing content from the two mementos and checking if the content for a web page would be different based on personalization, location, or changed each time the web page was loaded. Currently audit mode works with Wayback Machine, Archive-It, and archive.today. In the future we plan on integrating audit mode with more web archives.
-Travis Reid (@TReid803)