Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 737

2023-11-22: Auditing Web Archiving Livestreams

$
0
0

 

Figure 1: Using audit mode to replay mementos of https://oduwsdl.github.io/ from the Wayback Machine and archive.today

While working on the Game Walkthroughs and Web Archiving project, we created web archiving livestreams where viewers would be able to watch two web crawlers archive a set of seed URIs and watch the replay of the archived web pages. We recently created a new mode that can audit web archives so that we can view archived web pages, or mementos, from different web archives at the same time. Viewing two mementos from different web archives is useful when the content on the original web page could vary based on personalization, location, or was different each time the web page was loaded. Audit mode will allow viewers to watch an audit of two web archives for the same URI-R. (A URI-R identifies the live web version of a web resource.) In audit mode, we show a replay of all of the unique mementos associated with a given URI-R for two web archives. Being able to view the replay of the mementos allows the user and viewers to see which resources are missing from a memento and compare the differences between content from two mementos from different web archives.

Previously, we created a mode similar to audit mode, called replay mode. The main difference between replay mode and audit mode is that replay mode replays mementos that were recently archived using two web archiving tools during archive mode and audit mode shows all unique mementos for one URI-R from two different web archives. Another difference is that during archive mode multiple URI-Rs can be archived, which results in mementos from multiple URI-Rs being replayed during replay mode.

Audit Mode

When given a URI-R, audit mode (Figure 1) can be used to replay the mementos of the URI-R from two public web archives. Figure 2 shows the current process used for audit mode. Before starting audit mode, a configuration file needs to be created that includes the URI-R and a list of web archives that we want to use for audit mode.

Figure 2: The current process for audit mode

At the beginning of audit mode, Selenium is used to set up the web browsers and display context information needed for audit mode. After the web browsers are set up, MemGator is used to retrieve URI-Ms based on the URI-R and web archives specified in the configuration file. (A URI-M is a URI for a memento.) Currently this mode supports viewing archived web pages from Wayback Machine, archive.today, and Archive-It. We will add support for other web archives in the future. 

After retrieving the URI-Ms, we pair URI-Ms from the different archives based on Memento-Datetime, so that we are comparing mementos that were archived around the same time. Currently the default threshold is 30 days, so if the difference between Memento-Datetimes for two archived web pages is over 30 days, then the archived web pages will not be replayed at the same time during the livestream.

Figure 3: Two mementos of a URI-R from the same web archive that have SimHashes with a Hamming distance of 27, which is more than the default Hamming distance threshold of 4. These two mementos would not be shown beside each other during audit mode since they are from the same web archive. Both mementos of the web page are considered unique and would be replayed at different times during audit mode.


The next step is replaying the unique URI-Ms based on the SimHash of the web page’s HTML source and the Hamming distance between the previous unique URI-M’s SimHash and the current URI-M’s SimHash. I decided to use SimHash and Hamming distance because TMVis (paper: “Visualizing Webpage Changes Over Time”) uses these two algorithms when determining unique mementos. TMVis shows how a web page has changed over time by displaying screenshots of the unique mementos of a web page. Figure 3 shows an example of when two SimHashes have a large enough Hamming distance for the default threshold of 4. Since the archived web pages are shown beside each other during the audit like in Figure 1, we can see the similarities and differences between two mementos of a web page that were archived around the same time by different web archives.

After replaying all of the unique mementos of the web page, a final message is displayed (Figure 4) that lists the number of unique mementos of the web page and the total number of mementos of the web page for each web archive.

Figure 4: Final message for audit mode that displays the number of unique mementos of a web page and the total number of mementos for a web page

Listed below are some demos of audit mode:

Auditing Wayback Machine and Archive-It for the URI-R: https://cs.odu.edu

Auditing Wayback Machine and archive.today for the URI-R: https://www.odu.edu

Part 1

Part 2

Auditing Wayback Machine and archive.today for the URI-R: https://oduwsdl.github.io/

Future Work

Some future updates we could make for audit mode are adding support for more web archives, adding an option for comparing visual similarity between archived web pages, and measuring replay performance metrics. In the future, audit mode could use any active web archive that is supported by MemGator, since we use MemGator to get the URI-Ms for the audit. During the audit, we could determine the image similarity between screenshots of the two archived web pages that were archived around the same time, so that we can automatically detect when the content on the archived web pages are different between the two web archives. Also, during the audit, we could use tools like Memento Damage and Web Archiving Screenshot Compare to measure replay performance metrics that can be shown during results mode and to use the results for an automated gaming livestream.

Summary

We have added a new mode, named audit mode, to our web archiving livestreams. This mode audits web archives by replaying unique mementos of a URI-R from two different web archives, which is different from replay mode where only recently archived mementos are replayed. Viewing the replay of two mementos from different web archives is useful when comparing the missing content from the two mementos and checking if the content for a web page would be different based on personalization, location, or changed each time the web page was loaded. Currently audit mode works with Wayback Machine, Archive-It, and archive.today. In the future we plan on integrating audit mode with more web archives.


-Travis Reid (@TReid803)

Viewing all articles
Browse latest Browse all 737

Trending Articles