2017-11-20: Dodging the Memory Hole 2017 Trip Report

November 20, 2017, 11:01 am

≫ Next: 2017-11-22: Deploying the Memento-Damage Service

≪ Previous: 2017-11-16: Paper Summary for Routing Memento Requests Using Binary Classifiers

At the Internet Archive, it was rainy in San Francisco, but that did not deter those of us attending Dodging the Memory Hole 2017. We engaged in discussions about a very important topic: the preservation of online news content.

An attendee listens to a presentation at the @RJI Dodging the Memory Hole conference in San Francisco. #DTMH2017 pic.twitter.com/2ajM1k06Ru
— RJI Futures Lab (@RJIFuturesLab) November 16, 2017

Keynote: Brewster Kahle, founder and digital librarian for the Internet Archive

"Let's become a library. Let's be useful to society to understand ourselves."― @brewster_kahle #DTMH2017 pic.twitter.com/rg7hMqWBdI
— JDNA (@RJIJDNA) November 15, 2017

Brewster Kahle is well known in digital preservation and especially web archiving circles. He founded the Internet Archive in May 1996. The WS-DL and LANL's Prototyping Team collaborate heavily with those from the Internet Archive, so hearing his talk was quite inspirational.

"Wikipedia is the front door of the Internet. It's thin. Now's the time to build the house."@brewster_kahle #DTMH2017
— Frank LoMonte (@FrankLoMonte) November 15, 2017

Being a better library means no more "post-truth" and/or "alt-facts." Testify, @brewster_kahle. #DTMH2017
— Educopia (@Educopia) November 15, 2017

We are familiar with the Internet Archive's efforts to archive the Web, visible mostly through the Wayback Machine, but the goal of the Internet Archive is "Universal Access to All Knowledge", something that Kahle equates to the original Library of Alexandria or putting humans on the moon. To that extent, he highlighted many initiatives by the Internet Archive to meet this goal. He mentioned that the contents of a book take up roughly a MegaByte. With 28 TeraBytes the works of the Library of Congress can be stored digitally—digitizing it is another matter, but it is completely doable, and by digitizing it we remove restrictions on access due to distance and other factors. Why stop with documents? There are many other types of content. Kahle highlighted the efforts by the Internet Archive to make television content, video games, audio, and more. They also have a loaning program whereby they allow users to borrow books, which are also digitized using book scanners. He stressed that, because of its mission to provide content to all, the Internet Archive is indeed a library.

.@brewster_kahle: the @internetarchive is starting to do VHS tapes at scale for items that are not available on DVD #dtmh2017 pic.twitter.com/BKPmgSYy0R
— Shawn M. Jones (@shawnmjones) November 15, 2017

As a library, the Internet Archive also becomes a target for governments seeking information on the activities of their citizens. Kahle highlighted one incident in which the FBI sent a letter demanding information from the Internet Archive. Thanks to help from the Electronic Frontier Foundation, the Internet Archive sued the United States government and won, defending the rights of those using their services.

"What happens to libraries? They get burned by governments. So let's make copies..."― @brewster_kahle #DTMH2017 pic.twitter.com/NNEy4P0XYd
— JDNA (@RJIJDNA) November 15, 2017

Kahle emphasized that we can all help with preserving the web by helping the Internet Archive build its holdings of web content. The Internet Archive contains a form with a simple "save page now" button, but they also support other methods of submitting content.

Katherine Skinner from @Educopia notes that Educopia uses the "save page now" button from @internetarchive for all citations https://t.co/z2MDUT1byc #dtmh2017 pic.twitter.com/KaVateg9Rs
— Shawn M. Jones (@shawnmjones) November 15, 2017

Contributions from Los Alamos National Laboratory (LANL) and Old Dominion University (ODU)

Martin Klein from LANL and Mark Graham from the Internet Archive

Robust Linking to Web Resources from Martin Klein

Martin Klein presented work on Robust Links. Martin briefly used motivating work he had done with Herbert Van de Sompel at Los Alamos National Laboratory, mentioning the problems of link rot, and content drift, the latter of which I also worked on.

The Problem: "Reference rot hinders our ability to follow links as they were intended when they were put in place."― @mart1nkle1n #DTMH2017 pic.twitter.com/2beIuqQoZN
— JDNA (@RJIJDNA) November 15, 2017

He covered how one can create links that are robust by:

submitting a URI to a web archive
decorating the link HTML so that future users can reach archived versions of the linked content

For the first item, he talked about how one can use tools like the Internet Archive's "Save Page Now" button as well as WS-DL's own ArchiveNow. The second item is covered by the Robust Links specification. Mark Graham, Directory of the Wayback Machine at the Internet Archive, further expanded upon Martin's talk by describing how the Wayback Extension also provides the capability to save pages, navigate the archive, and more. It is available for Chrome, Safari, and Firefox. It is shown in the screenshots below.

No	Subject	Prototype	Memento Damage
1.	Programming Language	Javascript + Perl	Javascript + Python
2.	Interface	CLI	CLI Website REST API
3.	Distribution	Source Code	Source Code Python library Docker
4.	Output	Plain Text	Plain Text JSON
5.	Processing time	Very slow	Fast
6.	Includes IFrame	NA	Available
7.	Redirection Handling	NA	Available
8.	Resolve Overlap	NA	Available
9.	Blacklisted URIs	Only has 1 blacklisted URI which is added manually	Add several new blacklisted URIs. Blacklisted URIs are identified based on a certain pattern.
10.	Batch execution	Not supported	Supported
11.	DOM selector capability	only support simple selection query	support complex selection query
12.	Input filtering	NA	Only process an input of HTML format

Platform	Memento Damage Score	Visual Inspection Comments
Original Page at Storify	0.002	All social cards complete Views Widget works Embed Widget works Livefyre Comments widget is present Interactive Share Widget contains all images No visible pagination animation
Internet Archive with Save Page Now	0.053	Missing the last 5 social cards Views Widget does not work Embed Widget works Livefyre Comments widget is missing Interactive Share Widget contains all images Pagination animation runs on click and terminates with errors
Archive.is	0.000	Missing the last 5 social cards Views Widget does not work Embed Widget does not work Livefyre Comments widget is missing Interactive Share Widget is missing Pagination animation is replaced by "Next Page" which goes nowhere
Webrecorder.io	0.051*	Missing the last 5 social cards, but can capture all with user interaction while recording Views Widget works Embed Widget works Livefyre Comments widget is missing Interactive Share Widget contains all images No visible pagination animation
WAIL	0.025	All social cards complete Views Widget works, but is missing downward arrow Embed Widget is missing images, but otherwise works Livefyre Comments widget is missing Interactive Share Widget is missing images Pagination animation runs and does not terminate

Keynote: Brewster Kahle, founder and digital librarian for the Internet Archive

Contributions from Los Alamos National Laboratory (LANL) and Old Dominion University (ODU)

Martin Klein from LANL and Mark Graham from the Internet Archive

Shawn M. Jones from ODU

Selected Presentations

Social

Summary

Analyzing the Challenges

Measuring the Damage

Dealing with Overlapping Resources

Dealing with Iframes

Using Memento-Damage

The Web Service

Local Service

Testing on a Large Number of URIs

Summary

Help Us to Improve

Step 0

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Resources

Saving the Content from Storify

Manually

Using Storified From DocNow

Using Web Archiving on Storify Stories

Summary

HTTP responses for some links found in the Wired article.

URL Rewriting

ServiceWorker

Reconstructive

Module Usage

Archival Banner

Limitations

Conclusions

Resources

Repositories

Modules

Results and Takeaways

Dependencies:

How to run the script:

Output:

Example Output:

How it works:

Memento Responses

Paywalls in Academia

Searching for Similarity

What to take away

Evil JavaScript

Caching

Geo-location

On-demand Archiving

Misconfigured Crawler

Cookies, the Real Culprit

Twitter Cookies and Heritrix

Why is Kannada More Prominent?

What Should We Do About It?

Conclusions

Day 1 -- March 22, 2018

Day 2 -- March 23, 2018

Day 3 -- March 24, 2018

1. Uploading content to web archives to ensure continuous availability of the data

2. Avoiding governments' censorship or websites' terms of service

3. Using URLs from web archives instead of direct links for news sites with opposing ideologies to deprive them of ad revenue