2016-12-20: Archiving Pages with Embedded Tweets

December 20, 2016, 7:12 am

≫ Next: 2017-01-07: Two WS-DL Classes Offered for Spring 2017

≪ Previous: 2016-11-21: WS-DL Celebration of #IA20

I'm from Louisiana and used Archive-It to build a collection of webpages about the September flood there (https://www.archive-it.org/collections/7760/).

One of the pages I came across, Hundreds of Louisiana Flood Victims Owe Their Lives to the 'Cajun Navy', highlighted the work of the volunteer "Cajun Navy" in rescuing people from their flooded homes. The page is fairly complex, with a Flash video, YouTube video, 14 embedded tweets (one of which contained a video), and 2 embedded Instagram posts. Here's a screenshot of the original page (click for full page):

Live page, screenshot generated on Sep 9, 2016

To me, the most important resources here were the tweets and their pictures, so I'll focus here on how well they were archived.

First, let's look at how embedded Tweets work on the live web. According to Twitter: "An Embedded Tweet comes in two parts: a <blockquote> containing Tweet information and the JavaScript file on Twitter’s servers which converts the <blockquote> into a fully-rendered Tweet."

Here's the first embedded tweet (https://twitter.com/vernonernst/status/765398679649943552), with a picture of a long line of trucks pulling their boats to join the Cajun Navy.

First embedded tweet - live web

Here's the source for this embedded tweet:
<blockquote class="twitter-tweet" data-width="500">
<a target="_blank" href="https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a>— Vernon Ernst (@vernonernst) <a href="https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

When the widgets.js script executes in the browser, it transforms the <blockquote class="twitter-tweet"> element into a <twitterwidget>:
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0"style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px;margin-bottom: 10px;" data-tweet-id="765398679649943552">

Now, let's consider how the various archives handle this.

Archive-It

Since I'd been using Archive-It to create the collection, that was the first tool I used to capture the page. Archive-It uses the Internet Archive's Heritrix crawler and Wayback Machine for replay. I set the crawler to archive the page and embedded resources, but not to follow links. No special scoping rules were applied.

http://wayback.archive-it.org/7760/20160818180453/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/

Archive-It, captured on Aug 18, 2016

Here's how the first embedded tweet displayed in Archive-It:


Embedded tweet as displayed in Archive-It

Here's the source (as rendered in the DOM) upon playback in Archive-It's Wayback Machine:
<blockquote class="twitter-tweet twitter-tweet-error" data-conversation="none" data-width="500" data-twitter-extracted-i1479916001246582557="true">
<a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/CajunNavy?src=hash" target="_blank" rel="external nofollow">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/LouisianaFlood?src=hash" target="_blank" rel="external nofollow">#LouisianaFlood</a> <a href="http://wayback.archive-it.org/7760/20160818180453/https://t.co/HaugQ7Jvgg" target="_blank" rel="external nofollow">pic.twitter.com/HaugQ7Jvgg</a>— Vernon Ernst (@vernonernst)<a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/vernonernst/status/765398679649943552" target="_blank" rel="external nofollow">August 16, 2016</a></blockquote>
<script async=""
src="//wayback.archive-it.org/7760/20160818180453js_/http://platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script>

Except for the links being re-written to point to the archive, this is the same as the original embed source, rather than the transformed version. Upon playback, although widgets.js was archived (http://wayback.archive-it.org/7760/20160818180456js_/http://platform.twitter.com/widgets.js?4fad35), it is not able to modify the DOM as it does on the live web (widgets.js loads additional JavaScript that was not archived).

webrecorder.io

Next up is the on-demand service, webrecorder.io. Webrecorder.io is able to replay the embedded tweets as on the live web.

https://webrecorder.io/mweigle/south-louisiana-flood---2016/20160909144135/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/

Webrecorder.io, viewed Sep 29, 2016

The HTML source (https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/) looks similar to the original embed (except for re-written links):
<blockquote class="twitter-tweet" data-width="500"><a target="_blank" href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a>— Vernon Ernst (@vernonernst) <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a></blockquote>
<script async src="//wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

Upon playback, we see that webrecorder.io is able to fully execute the widgets.js script, so the transformed HTML looks like the live web (with the inserted <twitterwidget> element):
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552"></twitterwidget>
<script async="" src="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

Note that widgets.js is archived and is loaded from webrecorder.io, not the live web.

archive.is

archive.is is another on-demand archiving service. As with webrecorder.io, the embedded tweets are shown as on the live web.

http://archive.is/5JcKx

archive.is, captured Sep 9, 2016

archive.is executes and then flattens JavaScript, so although the embedded tweet looks similar to how it's rendered in webrecorder.io and on the live web, the source is completely different:
<article style="direction:ltr;display:block;">
...
<a href="https://archive.is/o/5JcKx/twitter.com/vernonernst/status/765398679649943552/photo/1" style="color:rgb(43, 123, 185);text-decoration:none;display:block;position:absolute;top:0px;left:0px;width:100%;height:328px;line-height:0;background-color: rgb(255, 255, 255); outline: invert none 0px; "><img alt="View image on Twitter" src="http://archive.is/5JcKx/fc15e4b873d8a1977fbd6b959c166d7b4ea75d9d" title="View image on Twitter" style="width:438px;max-width:100%;max-height:100%;line-height:0;height:auto;border-width: 0px; border-style: none; border-color: white; "></a>
...
</article>
...
<blockquote cite="https://twitter.com/vernonernst/status/765398679649943552"style="list-style: none outside none; border-width: medium; border-style: none; margin: 0px; padding: 0px; border-color: white; ">
...
#CajunNavy</a>
on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://archive.is/o/5JcKx/https://twitter.com/hashtag/LouisianaFlood?src=hash" style="direction:ltr;background-color: transparent; color:rgb(43, 123, 185);text-decoration:none;outline: invert none 0px; ">#LouisianaFlood</a>

...
</blockquote>

WARCreate

WARCreate is a Google Chrome extension that our group developed to allow users to archive the page they are currently viewing in their browser. It was last actively updated in 2014, though we are currently working on updates to be released in 2017.

The image below shows the result of the page being captured with WARCreate and replayed in webarchiveplayer.

WARCreate, captured Sep 9, 2016, replayed in webarchiveplayer

Upon replay, WARCreate is not able to display the tweet at all. Here's the close-up of where the tweets should be:

WARCreate capture replayed in webarchiveplayer, with tweets missing

Examining both the WARC file and the source of the archived page helps to explain what's happening.

Inside the WARC, we see:
<h4>In stepped a group known as the <E2><80><9C>Cajun Navy<E2><80><9D>:</h4>
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
<script async="" src="//platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script>

This is the same markup that's in the DOM upon replay in webarchiveplayer, except for the script source being written to localhost:
<h4>In stepped a group known as the “Cajun Navy”:</h4>
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
<script async="" src="//localhost:8090/20160822124810js_///platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script>

WARCreate captures the HTML after the page has fully loaded. So what's happening here is that the page loads, widgets.js is executed, the DOM is changed (thus the <twitterwidget> tag), and then WARCreate saves the transformed HTML. But, what we don't get is the widgets.js script in order to be able to properly display <twitterwidget>. Our expectation is that with fixes to allow WARCreate to archive the loaded JavaScript, the embedded tweet would be displayed as on the live web.

Discussion

Each of these four archiving tools operates on the embedded tweet in a different way, highlighting the complexities of archiving asynchronously loaded JavaScript and DOM transformations.

Archive-It (Heritrix/Wayback) - archives the HTML returned in the HTTP response and JavaScript loaded from the HTML
Webrecorder.io - archives the HTML returned in the HTTP response, JavaScript loaded from the HTML, and JavaScript loaded after execution in the browser
Archive.is - fully loads the webpage, executes JavaScript, rewrites the resulting HTML, and archives the rewritten HTML
WARCreate - fully loads the webpage, executes JavaScript, and archives the transformed HTML

It is useful to examine how different archiving tools and playback engines render complex webpages, especially those that contain embedded media. Our forthcoming update to the Archival Acid Test will include tests for embedded content replay.

-Michele

↧

2017-01-07: Two WS-DL Classes Offered for Spring 2017

January 7, 2017, 10:30 am

≫ Next: 2017-01-08: Review of WS-DL's 2016

≪ Previous: 2016-12-20: Archiving Pages with Embedded Tweets

"One of the primary reasons I got hired was because of the [@WebSciDL] courses I took at ODU." -@prasanna_sajjan (@ODUnow, @oducs, MS '16) pic.twitter.com/hGJtRIQ8JY
— ODU Computer Science (@oducs) December 5, 2016

Two WS-DL classes are offered for Spring 2017:

CS 725/825 - Information Visualization, Dr. Weigle
CS 432/532 - Introduction to Web Science, Dr. Nelson

Information Visualization is being offered both online (CRNs 26614/26617 (HR), 26615/26618 (VA), 26616/26619 (US)) and on-campus (CRN 24698/24699). Web Science is offered on-campus only (CRNs 25728/25729). Although it's not a WS-DL course per se, WS-DL member Corren McCoy is also teaching CS 462/562 Cybersecurity Fundamentals this semester (see this F15 offering from Dr. Weigle for an idea about its contents).

--Michael

↧

2017-01-08: Review of WS-DL's 2016

January 8, 2017, 2:15 pm

≫ Next: 2017-01-15: Summary of "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data"

≪ Previous: 2017-01-07: Two WS-DL Classes Offered for Spring 2017

Sawood and Mat show off the InterPlanetary Wayback poster at JCDL 2016

The Web Science and Digital Libraries Research Group had a productive 2016, with two Ph.D. and one M.S. students graduating, one large research grant awarded ($830k), 16 publications, and 15 trips to conferences, workshops, hackathons, etc.

For student graduations, we had:

Justin Brunelle defended his Ph.D. dissertation on February 5, 2016. Justin already had a full-time position at MITRE, but not coincidentally he had his choice of significant promotions at the conclusion of his Ph.D.
Yasmin AlNoamany defended her Ph.D.dissertation on June 16, 2016. Yasmin had several post-doctoral opportunities, and eventually decided on a position with Dr. Erik Mitchell, at the UC Berkeley Institute for Data Science (BIDS).
Greg Szalkowski completed his M.S. in 2016 as well. We had hoped to keep him on for a Ph.D., but he's having too much fun traveling the world setting up military communications solutions.

Other student advancements:

Shawn Jones passed his breadth exam.
Alexander Nwala passed his breadth exam.
Lulwah Alkwai passed her research ability exam.
Erika Siregar, a new M.S. student, joined us via the Fulbright Scholar Program.

We had 16 publications in 2016:

Shawn's paper "Avoiding spoilers: wiki time travel with Sheldon Cooper", based on his MS Thesis research, was finally published in IJDL.
Also in IJDL were expanded versions of three papers from TPDL 2015: Sawood Alam's paper "Web Archive Profiling Through CDX Summarization", and two from Yasmin "Characteristics of social media stories: What makes a good story?" and "Detecting Off-Topic Pages in Web Archives".
Mat Kelly and Sawood each had a paper at TPDL 2016.
Alexander, Sawood, and Mat had three posters at JCDL 2016 (we did not have full paper submissions since Michele Weigle was a PC co-chair). Alexander's JCDL poster was also released as an expanded tech report.
Justin had a paper in D-Lib Magazine about archiving a corporate intranet.
Shawn's position at LANL was productive, with a paper in PLoS ONE about "Scholarly Context Adrift", a poster at WWW 2016 about linking (or not) to DOIs, and a tech report about extracting HTML from archived web pages, which informed several blog post proposals about formal definitions for getting at raw archived content.
Michele had a poster at IEEE Vis 2016 based on a project from her CS 725/825 class.
I had a tech report with David Rosenthal and Herbert Van de Sompel about various techniques for archiving journals, including Signposting.

In late April, we had Herbert, Harish Shankar, and Shawn Jones visit from LANL. Herbert has been here many times, but this was the first visit to Norfolk for Harish. It was also on the visit that Shawn did his breadth exam.

.@WebSciDL on our way to #jcdl2016 @machawk1 @weiglemc @phonedude_mln @acnwala @ibnesayeed @jcdl2016 pic.twitter.com/xI7zeDQeAd
— Michael L. Nelson (@phonedude_mln) June 18, 2016

@elunca @WebSciDL @machawk1 @weiglemc @acnwala @ibnesayeed @jcdl2016 yes! Took the ferry to make it longer! ;-) pic.twitter.com/wbJENY334O
— Michael L. Nelson (@phonedude_mln) June 18, 2016

In addition to the fun road trip to JCDL 2016 in New Jersey (which included beers on the Cape May-Lewes Ferry!), our group traveled to:

Justin attended the Federal Cloud Computing Summit in January.
Sawood, Mat, and Alexander attended the Archives Unleashed Hackathon in March at the University of Toronto.
In April I went to both the CNI Spring Meeting in San Antonio and the IIPC General Assembly in Reykjavik.
April also saw Shawn attend the WWW Conference in Montreal.
In May, Erika attended the Fulbright Enrichment Seminar in Pittsburgh.
From June through August, Alexander was at Harvard University for a summer fellowship at the Library Innovation Lab; where he worked on the Local Memory Project.
As mentioned above, in June Mat, Sawood, Alexander, Michele, and I all went to JCDL 2016 (also the Web Archiving & Digital Libraries Workshop, and the JCDL Doctoral Consortium).
Right before JCDL 2016 Mat, Sawood, Alexander, Shawn, John Berlin, and Mohamed Aturban attended the second Archives Unleashed Hackathon in Washington DC. While the rest of the people left for JCDL in New Jersey, Shawn stuck around for the Saving the Web Symposium at the Library of Congress right after the Hackathon.
In August I attended the Documenting the Now Advisory Board meeting in St. Louis.
Mat attended the IIPC Building Better Crawlers Hackathon in London in September.
In October, Mat, Shawn, John, and I attended the Dodging the Memory Hole meeting at UCLA.
Sawood went to TPDL 2016 in Hannover in October, and also visited with Dr. Michael Herzog after the conference.
Michele attended part of IEEE Vis 2016 in Baltimore in October.
And somehow, I don't think anyone traveled in November or December!

WS-DL at JCDL 2016 Reception in Newark, NJ

.@weiglemc giving program overview #jcdl2016 @WebSciDL pic.twitter.com/dJKqdhZ0Z1
— Michael L. Nelson (@phonedude_mln) June 20, 2016

Although we did not travel to San Francisco for the 20th Anniversary of the Internet Archive, we did celebrate locally with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving. This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky). We write plenty of papers, blog posts, etc. about technical issues and the mechanics of web archiving, but I'm especially proud of how we were able to assemble a wide array of personal stories about the social impact of web archiving. I encourage you to take the time to go through these posts:

View the story "WS-DL Celebration of #IA20" on Storify

.@WebSciDL celebrates 20 years of #webarchiving& @internetarchive w tacos and @djspooky CDs! #IA20 pic.twitter.com/AFb3qUiuzz
— Michael L. Nelson (@phonedude_mln) October 26, 2016

We had only one popular press story about our research this year, with Tech.Co's "You Can’t Trust the Internet to Continue Existing" citing Hany SalahEldeen's 2012 TPDL paper about the rate of loss of resources shared via Twitter.

We released several software packages and data sets in 2016:

Mat and Sawood continue to develop IPWB, an implementation of the Wayback Machine and the InterPlanetary Filesystem.
Yasmin released her Off-Topic Detection code, as well the story data sets used in her dissertation.
Zetan Li provided a much needed update for the Carbon Date software.
Alexander released the first version of code for the Local Memory Project (from his time @ Harvard).
Sawood released MemGator, which allows one to setup their own Memento Aggregator.

In April we were extremely fortunate to receive a major research award, along with Herbert Van de Sompel at LANL, from the Andrew Mellon Foundation:

"Old Dominion Computer Scientists Awarded $830,000 Mellon Grant"

This project will address a number of areas, including: Signposting, automated assessment of web archiving quality, verification of archival integrity, and automating the archiving of non-journal scholarly output. We will soon be releasing several research outputs as a result of this grant.

WS-DL reviews are also available for 2015, 2014, and 2013. We're happy to have graduated Greg, Yasmin, and Justin; and we're hoping that we can get Erika back for a PhD after her MS is completed. I'll close with celebratory images of me (one dignified, one less so...) with Dr. AlNoamany and Dr. Brunelle; may 2017 bring similarly joyous and proud moments.

--Michael

.@yasmina_anwar returns from @UCBIDS for grad ceremony & avatar advancement @WebSciDL @oducs https://t.co/Vo2NH3iTpI pic.twitter.com/o02e6nlDc7
— Michael L. Nelson (@phonedude_mln) December 17, 2016

↧

2017-01-15: Summary of "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data"

January 15, 2017, 6:14 pm

≫ Next: 2017-01-20: CNN.com has been unarchivable since November 1st, 2016

≪ Previous: 2017-01-08: Review of WS-DL's 2016

Based on the paper:

Kuhn, T., Dumontier, M.: Trusty URIs: Verifiable, immutable, and permanent digital artifacts for linked data. Proceedings of European Semantic Web Conference (ESWC) pp. 395–410 (2014).

A trusty URI is a URI that contains a cryptographic hash value of the content it identifies. The authors introduced this technique of using trusty URIs to make digital artifacts, specially those related to scholarly publications, immutable, verifiable, and permanent. With the assumption that a trusty URI, once created, is linked from other resources or stored by a third party, it becomes possible to detect if the content that the trusty URI identifies has been tampered with or manipulated on the way (e.g., trusty URIs to prevent man-in-the-middle attacks). In addition, trusty URIs can verify the content even if it is no longer found at the original URI but still can be retrieved from other locations, such as Google's cache, and web archives (e.g., Internet Archive).

The core contribution of this paper is the ability of creating trusty URIs on different kind of content. Two modules are proposed: in the module F, the hash is calculated on the byte-level file content while in the second module R the hash is calculated on RDF graphs. The paper introduced an algorithm to generate the hash value on RDF graphs independent of any serialization syntax (e.g., N-Quads or TriX). Moreover, they investigated how trusty URIs work on the structured documents (nanopublications). Nanopublications are small RDF graphs (named graphs -- one of the main concepts of Semantic Web) to describe information about scientific statements. The nanopublication as a Named Graph itself consists of multiple Named Graphs: the "assertion" has the actual scientific statement like "malaria is transmitted by mosquitos" in the example below; the "provenance" has information about how the statement in the "assertion" was originally derived; and the "publication information" has information like who created the nanopublication and when.

A nanopublication: basic elements from http://nanopub.org

Nanopublications may cite other nanopublications resulting in having complex citation tree. Trusty URIs are designed not only to validate nanopublications individually but also to validate the whole citation tree. The nanopublication example shown below, which is about the statement "malaria is transmitted by mosquitos", is from the paper ("The anatomy of a nanopublication") and it is in TRIG format:

@prefix swan: < http://swan.mindinformatics.org/ontologies/1.2/pav.owl> .
@prefix cw: < http://conceptwiki.org/index.php/Concept>.
@prefix swp: <http://www.w3.org/2004/03/trix/swp-1/>.
@prefix : <http://www.example.org/thisDocument#> .

:G1 = { cw:malaria cw:isTransmittedBy cw:mosquitoes }
:G2 = { :G1 swan:importedBy cw:TextExtractor,
:G1 swan:createdOn "2009-09-03"^^xsd:date,
:G1 swan:authoredBy cw:BobSmith }
:G3 = { :G2 ann:assertedBy cw:SomeOrganization }

In addition to the two modules, they are planning to define new modules for more types of content (e.g., hypertext/HTML) in the future.

The example below illustrates the general structure of trusty URIs:

The artifact code, everything after r1, is the part that make this URI a trusty URI. The first character in this code (R) is to identify the module. In the example, R indicates that this trusty URI was generated on a RDF graph. The second character (A) is to specify any version of this module. The remaining characters (5..0) represents the hash value on the content. All hash values are generated by SHA-256 algorithm. I think it would be more useful to allow users to select any preferred cryptographic hash function instead of enforcing a single hash function. This might result in adding more characters to the artifact code to represent the selected hash function. InterPlanetary File System (IPFS), for example, uses Multihash as mechanism to prefix the resulting hash value with an id that maps to a particular hash function. Similar to trusty URIs, resources in the IPFS network are addressed based on hash values calculated on the content. For instance, the first two characters "Qm" in the IPFS address "/ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V" indicates that SHA256 is the hash function used to generate the hash value "ZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V".

Here are some differences between the approach of using trusty URIs and other related ideas as mentioned in the paper:

Trusty URIs can be used to identify and verify resources on the web while in systems like Git version control system, hash values are there to verify "commits" in Git repositories only. The same applies to IPFS where hashes in addresses (e.g., /ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V) are used to verify files within the IPFS network only.
Hashes in trusty URIs can be generated on different kinds of content while in Git or ni-URI, hash values are computed based on the byte level of the content.
Trusty URIs support self-references (i.e., when trusty URIs are included in the content).

The same authors published a follow-up version to their ESWC paper ("Making digital artifacts on the web verifiable and reliable") in which they described in some detail how to generate trusty URIs on content of type RA for multiple RDF graphs and RB for a single RDF graph (RB was not included in the original paper). In addition, in this recent version, they graphically described the structure of the trusty URIs.

While calculating the hash value on the content of type F (byte-level file content) is a straightforward task, multiple steps are required to calculate the hash value on content of type R (RDF graphs), such as converting any serialization (e.g, N-Quads or TriG) into RDF triples, sorting of RDF triples lexicographically, serializing the graph into a single string, replacing newline characters with "\n", and dealing with self-references and empty nodes.

To evaluate their approach, the authors used the Java implementation to create trusty URIs for 156,026 of small structured data files (nanopublications) which are in different serialization format (N-Quads and TriX). By testing these files, again using the Java implementation, they all were successfully verified as they matched to their trusty URIs. In addition, they tested modified copies of these nanopublications. Results are shown in the figure below:

Examples of using trusty URIs:

[1] Trusty URI for byte-level content

Let say that I have published my paper on the web at http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.pdf, and somebody links to it or saves the link somewhere. Now, if I intentionally (or not) change the content of the paper, for example, by modifying some statistics, adding a chart, correcting a typo, or even replacing the PDF with something completely different (read about content drift), anyone downloads the paper after these changes by dereferencing the original URI will not be able to realize that the original content has been tampered with. Trusty URIs may solve this problem. For testing, I used Trustyuri-python, the Python implementation, to first generate the artifact code on the PDF file "tpdl-2015.pdf":

%python ProcessFile.py tpdl-2015.pdf

The file (tpdl-2015.pdf) is renamed to (tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf) containing the artifact code (FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao) as a part of its name -- in the paper, they call this file a trusty file. Finally, I published this trusty file on the web at the trusty URI (http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf). Anyone with this trusty URI can verify the original content using the the library Trustyuri-python, for example:

%python CheckFile.py http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf
Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao

As you can see, the output "Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao" indicates that the hash value in the
trusty URI is identical to the hash value of the content which means that this resource contains the correct and the desired content.

To see how the library detects any changes in the original content available at http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf, I replaced all occurrence of the number "61" with the number "71" in the content. Here is the commands I used to apply these changes:

%pdftk tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf output tmp.pdf uncompress
%sed -i 's/61/71/g' tmp.pdf
%pdftk tmp.pdf output tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf compress

The figures below show the document before and after the changes:

Before changes

After changes

The library detected that the original resource has been changed:

$python CheckFile.py http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf
*** INCORRECT HASH ***

[2] Trusty URI for RDF content

I downloaded this nanopublication serialized in XML from "https://github.com/trustyuri/trustyuri-java/blob/master/src/main/resources/examples/nanopub1-pre.xml":

<?xml version='1.0' encoding='UTF-8'?> <trix xmlns="http://www.w3.org/2004/03/trix/trix-1/"> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1Head</uri> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#hasAssertion</uri> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1assertion</uri> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#hasProvenance</uri> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1provenance</uri> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#hasPublicationInfo</uri> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1pubinfo</uri> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1</uri> OSPOSPOSP<uri>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#Nanopublication</uri> OSPOSP</triple> OSP</graph> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1assertion</uri> OSPOSP<triple> OSPOSPOSP<uri>http://example.com/mosquito</uri> OSPOSPOSP<uri>http://example.com/transmits</uri> OSPOSPOSP<uri>http://example.com/malaria</uri> OSPOSP</triple> OSP</graph> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1provenance</uri> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1assertion</uri> OSPOSPOSP<uri>http://www.w3.org/ns/prov#wasDerivedFrom</uri> OSPOSPOSP<uri>http://dx.doi.org/10.3233/ISU-2010-0613</uri> OSPOSP</triple> OSP</graph> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1pubinfo</uri> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1</uri> OSPOSPOSP<uri>http://purl.org/dc/terms/created</uri> OSPOSPOSP<typedliteral datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-02-25T14:33:21+01:00</typedliteral> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1</uri> OSPOSPOSP<uri>http://purl.org/pav/createdBy</uri> OSPOSPOSP<uri>http://orcid.org/0000-0002-1267-0234</uri> OSPOSP</triple> OSP</graph> </trix>

This nanopublication (RDF file) can be transformed into a trusty file using:

$python TransformRdf.py nanopub1-pre.xml http://trustyuri.net/examples/nanopub1

The Python script "TransformRdf.py" performed multiple steps to transform this RDF file into the trusty file "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". The steps as mentioned above include generating RDF triples, sorting those triples, handling self-references, etc. The Python library used the second argument "http://trustyuri.net/examples/nanopub1", considered as the original URI, to manage self-references by replacing all occurrences of "http://trustyuri.net/examples/nanopub1" with "http://trustyuri.net/examples/nanopub1. " in the original XML file. You may have noticed that this ends with '.' and blank space. Once the artifact code is generated, the new trusty file is created "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". In this trusty file all occurrences of "http://trustyuri.net/examples/nanopub1. " are replaced with "http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo". The trusty file is shown below:

<?xml version="1.0" encoding="utf-8"?> <trix ospxmlns:xml="http://www.w3.org/XML/1998/namespace" ospxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" ospxmlns:xsd="http://www.w3.org/2001/XMLSchema#" ospxmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" ospxmlns="http://www.w3.org/2004/03/trix/trix-1/"> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo</uri> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg</uri> OSPOSPOSP<uri>http://purl.org/dc/terms/created</uri> OSPOSPOSP<typedliteral datatype="http://www.w3.org/2001/XMLSchema#dateTime">2013-02-25T14:33:21+01:00</typedliteral> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg</uri> OSPOSPOSP<uri>http://purl.org/pav/createdBy</uri> OSPOSPOSP<uri>http://orcid.org/0000-0002-1267-0234</uri> OSPOSP</triple> OSP</graph> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#provenance</uri> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#assertion</uri> OSPOSPOSP<uri>http://www.w3.org/ns/prov#wasDerivedFrom</uri> OSPOSPOSP<uri>http://dx.doi.org/10.3233/ISU-2010-0613</uri> OSPOSP</triple> OSP</graph> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#Head</uri> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#hasProvenance</uri> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#provenance</uri> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#hasPublicationInfo</uri> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo</uri> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#hasAssertion</uri> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#assertion</uri> OSPOSP</triple> OSPOSP<triple> OSPOSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg</uri> OSPOSPOSP<uri>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</uri> OSPOSPOSP<uri>http://www.nanopub.org/nschema#Nanopublication</uri> OSPOSP</triple> OSP</graph> OSP<graph> OSPOSP<uri>http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#assertion</uri> OSPOSP<triple> OSPOSPOSP<uri>http://example.com/mosquito</uri> OSPOSPOSP<uri>http://example.com/transmits</uri> OSPOSPOSP<uri>http://example.com/malaria</uri> OSPOSP</triple> OSP</graph> </trix>

To verify this trusty file we can use the following command which resulting in having "Correct hash" --the content is verified to be correct. Again, to handle self-references, the Python library replaces all occurrences of "http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo" with "http://trustyuri.net/examples/nanopub1. " before recomputing the hash.

%python CheckFile.py nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml
Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

Or by the following command if the trusty file is published on the web:

%python CheckFile.py http://www.cs.odu.edu/~maturban/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml
Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

What we are trying to do with trusty URI:

We are working on a project, funded by the Andrew W. Mellon Foundation, to automatically capture and archive the scholarly record on the web. One part of this project is to come up with a mechanism through which we can verify the fixity of archived resources to ensure that these resources have not been tampered with or corrupted. In general, we try to collect information about the archived resources and generate manifest file. This file will then be pushed into multiple archives, so it can be used later. Herbert Van de Sompel, from Los Alamos National Laboratory, pointed to this idea of using trusty URIs to identify and verify web resources. In this way, we have the manifest files to verify archived resources, and trusty URIs to verify these manifests.

Resources:

Kuhn, T., Dumontier, M.: Trusty URIs: Verifiable, immutable, and permanent digital artifacts for linked data. Proceedings of European Semantic Web Conference (ESWC) pp. 395–410 (2014).

video

slides

Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data from Tobias Kuhn

Kuhn, Dumontier. "Making Digital Artifacts on the Web Verifiable and Reliable." arXiv preprint arXiv:1507.01697 (2015)
Kuhn, Tobias. "nanopub-java: A Java Library for Nanopublications." arXiv preprint arXiv:1508.04977 (2015)
Kurian, S.P., Vishnu S.S.,: Distributed digital artifacts on the semantic web. International Journal of Computer Applications Technology and Research pp. 99–103 (2015)
IPFS and Multihash
Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PLoS One 9(12), e115,253 (2014)

--Mohamed Aturban

↧

2017-01-20: CNN.com has been unarchivable since November 1st, 2016

January 20, 2017, 6:38 pm

≫ Next: 2017-01-23: Finding URLs on Twitter - A simple recommendation

≪ Previous: 2017-01-15: Summary of "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data"

CNN.com has been unarchivable since 2016-11-01T15:01:31, at least by the common web archiving systems employed by the Internet Archive, archive.is, and webcitation.org. The last known correctly archived page in the Internet Archive's Wayback Machine is 2016-11-01T13:15:40, with all versions since then producing some kind of error (including today's; 2017-01-20T09:16:50). This means that the most popular web archives have no record of the time immediately before the presidential election through at least today's presidential inauguration.

Given the political controversy surrounding the election, one might conclude this is a part of some grand conspiracy equivalent to those found in the TV series The X-Files. But rest assured, this is not the case; the page was archived as is, and the reasons behind the archival failure are not as fantastical as those found in the show. As we will explain below, other archival systems have successfully archived CNN.com during this period (e.g, Perma.cc).

To begin the explanation of this anomaly, let's consider the raw HTML of the memento on 2016-11-01T15:01:31. At first glance, the HTML appears normal with few apparent differences (disregarding the Wayback injected tags) from the live web when comparing the two using only the browser's view-source feature. Only by looking closely at the body tag will you notice something out of place: the body tag has several CSS classes applied to it one of which seems oddly suspicious.

<bodyclass="pg pg-hidden pg-homepage pg-section domestic t-light">

The class that should jump out is pg-hidden which is defined in the external style sheet page.css. Its definition seen below can be found on lines 28625-28631.

.pg-hidden { display: none }

As the definition is extremely simple a quick fix would be to remove it. So let's remove it.

What is revealed after removing the pg-hidden class is a skeleton page i.e. a template page sent by the server that relies on the client-side JavaScript to do the bulk of the rendering. A hint to confirm this can be found in the number of errors thrown when loading the archived page.

The first error occurs when JavaScript attempts the change the domain property of the document.

Uncaught DOMException: Failed to set the 'domain' property on 'Document'
'cnn.com' is not a suffix of 'web.archive.org'. at (anonymous) @ (index):8

This is commonly done to allow a page on a subdomain to load resources from another page on the superdomain (or vice versa) in order to avoid cross-origin restrictions. In the case of cnn.com, it is apparent that this is done in order to communicate with their CDN (content delivery network) and several embedded iframes in the page (more on this later). To better understand this consider the following excerpt about Same-origin policy from the MDN (Mozilla Developer Network):

A page may change its own origin with some limitations. A script can set the value of document.domain to a suffix of the current domain. If it does so, the shorter domain is used for subsequent origin checks. For example, assume a script in the document at http://store.company.com/dir/other.html executes the following statement:

document.domain = "company.com";

After that statement executes, the page would pass the origin check with http://company.com/dir/page.html. However, by the same reasoning, company.com could not set document.domain to othercompany.com.

There are four other exceptions displayed in the console from three JavaScript files (brought in from the CDN)

cnn-header.e4a512e…-first-bundle.js
cnn-header-second.min.js
cnn-footer-lib.min.js

that further indicate that JavaScript is loading and rendering the remaining portions of the page.

Seen below is the relevant portion of JavaScript that does not get executed after the document.domain exception.

This portion of code sets up the global CNN object with the necessary information on how to load the assets for each section (zone) of the page and the manner by which to load them. What was not shown is the configurations for the sections, i.e the explicit definition of the content contained in them. This is important because these definitions are not added to the global CNN object due to the exception being thrown above (at window.document.domain). This causes the execution of the remaining portions of the script tag to halt before reaching them. Shown below is another inline script that is further in the document which does a similar setup.

In this tag the definitions that tell how the content model (news stories contained in the sections) are to be loaded along with further assets to be loaded. This code block does get executed in its entirety, which is important to note because the "lazy loading" definitions seen in the previous code block are added here. By defining that the content is to be lazily loaded (loadAllZonesLazy) the portion of Javascript responsible for revealing the page will not execute because the previous code blocks definitions are not added to the global CNN object. The section of code (from cnn-footer-lib.min.js) that does the reveal is seen below

As you can see the reveal code depends on two things: zone configuration defined in the section of code not executed and information added to the global CNN object in the cnn-header files responsible for the construction of the page. These files (along with the other cnn-*.js files) were un-minified and assignments to the global CNN object reconstructed to make this determination. For those interested, the results of this process can be viewed in this gist.

At this point, you must be wondering what changed between the time when the CNN archives could be viewed via the WaybackMachine and now. These changes can be summarized by considering the relevant code sections from the last correctly archived memento on 2016-11-01T13:15:40 seen below

When considering the non-whiteout archives, CNN did not require all zones to be lazy loaded and intelligent loading was not enabled. From this, we can assume they did not wait for the start of the more dynamic sections of the page to begin loading or to be loaded before showing the page.

As you can see in the above image of the memento on 2016-11-01T13:15:40, the headline of the page and the first image from the top stories section of the page are visible. The remaining sections of the page are missing as they are the lazily loaded content. Now compare this to the first not correctly archived memento on 2016-11-01T15:01:3. The headline and the first image from the top stories are a part of the sections lazily loaded (loadAllZonesLazy); thus, they contain dynamic content. This is confirmed when the pg-hidden CSS class is removed from the body tag to reveal that only the footer of the page is rendered but without any of the contents.

Even today the archival failure is happening as seen in the memento on 2017-01-20T16:00:45 seen below

In short, the archival failure is caused by changes CNN made to their CDN; these changes are reflected in the JavaScript used to render the homepage. The Internet Archive is not the only archive experiencing the failure, archive.is and webcitation.org are also affected. Viewing a capture on 2016-11-29T23:09:40 from archive.is, the preserved page once again appears to be an about:blank page.

Removing the pg-hidden definition reveals that only the footer is visible which is the same result as the memento from the Internet Archive's on 2016-11-01T15:01:31.

But unlike the Internet Archive's capture the archive.is capture is only the body of CNN's homepage with the CSS styles inlined (style="...") on each tag. This happens because archive.is does not preserve any of the JavaScript associated with the page and performs the transformation previously described to the page in order to archive it. This means that cnn.com's JavaScript will never be executed when replayed thus the dynamic contents will not be displayed.

WebCitation, on the other hand, does preserve some of the page's JavaScript, but it is not immediately apparent due to how pages are replayed. When viewing a capture from WebCitation on 2016-11-13T33:51:09 the page appears to be rendered "properly" albeit without any CSS styling.

This happens because WebCitation replays the page using PHP and a frame. The replay frame's PHP script loads the preserved page into the browser; then, any of the preserved CSS and JavaScript is served from another PHP script. However, using this process of serving the preserved contents may not work successfully as seen below.

WebCitation sent the CSS style sheets with the MIME type of text/html instead of text/css. This would explain why the page looks as it does. But cnn.com's JavaScript was executed with the same errors occurring that were present when replaying the Internet Archive's capture. This begs the question, "How can we preserve cnn.com as cnn.com is unarchivable, at least by the most commonly used means?".

The solution is not as simple as one may hope, but a preliminary solution (albeit band-aid) would be to archive the page using tools such as WARCreate, Webrecorder or Perma.cc. These tools are effective since they preserve a fully rendered page along with all network requests made when rendering the page. This ensures that the JavaScript requested content and rendered sections of the page are replayable. Replaying of the page without the effects of that line of code is possible but requires the page to be replayed in an iframe. This method of replay is employed by Ilya Kreymer's PyWb (Python implementation of the Wayback Machine) and is used by Webrecorder and Perma.cc.

This is a fairly old hack used to avoid the avoid cross-origin restrictions. The guest page, brought in through the iframe, is free set the document.domain thus allowing the offending line code to execute without issue. A more detailed explanation can be found in this blog post but the proof is in the pudding by preservation and replay. I have created an example collection through Webrecorder that contains two captures of cnn.com.

The first is named "Using WARCreate" which used WARCreate for preservation on 2017-01-18T22:59:43,

and the second is named "Using Webrecorder" which used Webrecorders recording feature as the preservation means on 2017-01-13T04:12:34.

A capture of cnn.com on 2017-01-19T16:57:05 using Perma.cc for preservation is also available for replay here.

All three captures will be replayed using PyWb and when bringing up the console, the document.domain exception will no longer be seen.

The CNN archival failure highlights some of the issues faced when preserving online news and was a topic addressed at Dodging The Memory 2016. The whiteout, a function of the page itself not the archives, raises two questions "Is using web browsers for archiving the only viable option?" and "How much modification of the page is required in order to make replay feasible?".

- John Berlin

↧

2017-01-23: Finding URLs on Twitter - A simple recommendation

January 23, 2017, 6:58 am

≫ Next: 2017-02-13: Electric WAILs and Ham

≪ Previous: 2017-01-20: CNN.com has been unarchivable since November 1st, 2016

A prompt from Twitter indicating no search results

As part of a research experiment, I had the need to find URLs embedded in tweets from Twitter's web search service. Most of the URLs where much older than 7 days, so using the Twitter search API was not an option since the search is performed on a sample of tweets published in the past 7 days, so I used the web search service.

I began the experiment by pasting URLs from tweets into the search box on twitter.com:

Searching Twitter for a URL by pasting the URL into the search box

I noticed I was able to find some URLs embedded in tweets, but this was not always the case. Based on my observations, finding the URLs was not correlated with the age of the tweet. I discussed this observation with Ed Summers and he recommended adding a "url:" prefix to the URL before searching. For example, if the search URL is:

"http://www.cnn.com",

he recommended searching for

"url:http://www.cnn.com"

I observed that prepending search URLs with the "url:" prefix improved my search success rate. For example, the search URL: "http://www.motherjones.com/environment/2016/09/dakota-access-pipeline-protest-timeline-sioux-standing-rock-jill-stein" was not found except with the "url:" prefix.

Example of a URL that was not found except with the "url:" parameter

Example of a URL that was not found with the "url:" parameter, but found without

Based on these observations, and considering that there was no apparent protocol switching, or URL canonicalization, I scaled the experiment to gain a better insight about this search behavior. I wanted to know the proportion of URLs that are:

found exclusively with the "url:" prefix
found exclusively without the "url:" prefix
found with and without the "url:" prefix (both 1 and 2).

I issued 3,923 URL queries to Twitter and observed the following proportions:

Count of URLs found exclusively with the "url:" prefix: 1,519
Count of URLs found exclusively without the "url:" prefix: 129
Count of URLs found with and without the "url:" prefix (both 1 and 2): 853
Count of URLs not found: 1,422

My initial non-automated tests gave the false impression that the "url:" prefix was the only consistent method to find all URLs embedded in tweets, but these tests result show that even though the "url:" prefix search method exhibits a higher hit rate, it is not self sufficient.
Consequently, to find a URL "U" via twitter web search, I recommend beginning the search with "url:U". If "U" is not found, search for U, because this promises a higher hit ratio.

--Nwala

↧

2017-02-13: Electric WAILs and Ham

February 13, 2017, 12:54 pm

≫ Next: 2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

≪ Previous: 2017-01-23: Finding URLs on Twitter - A simple recommendation

Mat Kelly recently posted Lipstick or Ham: Next Steps For WAIL in which he spoke about the past, present, and potential future for WAIL. Web Archiving Integration Layer (WAIL) is a tool that seeks to address the disparity between institutional and individual archiving tools by providing one-click configuration and utilization of both Heritrix and Wayback from a user's personal computer. I am here to speak on the realization of WAIL's future by introducing WAIL-Electron.

WAIL-Electron

WAIL has been completely revised from a Python application using modern Web technologies into an Electron application. Electron combines a Chromium (Chrome) browser with Node.js allowing for native desktop applications to be created using only HTML, CSS, and JavaScript.

The move to Electron has brought with it many improvements most importantly, of which is the ability to update and package WAIL for the three major operating systems: Linux, MacOS, and Windows. Support for these operating systems is easily achieved by packing utility used (electron-packager) which allows one to produce the binary for a specific system. Also thanks to this move, the directory structure issue mentioned by Mat in his post has been resolved. Electron applications have their own directory structure inside the OS-specific application directory path accessible via their API. Here the packager will place the tools WAIL makes available for use.

Electric Ham

The meat of this revision is adding new functionality to WAIL in addition to the tools already made available through WAIL, namely Heritrix and Wayback. This new functionality comes in two parts. First, WAIL is now collection-centric. The previous revision, WAIL Python, added the WARC files created through WAIL to a single archive. This archive was an ambiguous collection of sorts where users had to create their own means of associating the WARCs to each other. Initially, this beneficial feature allowed users to archive what they saw at any given instance and replay the preserved page immediately. But updates to WAIL could not be justified if they did not build upon the existing functionality; which is why the concept of personal collection-based archiving was introduced.

Collections

WAIL now provides users with the ability for the curation of personalized web archive collections akin to the subscription service Archive-It except on their local machines. By default, WAIL comes with an initial collection and allows for the creation of additional collections.

The Collections screen displays the collections created through WAIL. This view displays the collection name along with some summary information about it.

Seeds: How many seeds are associated with the collection
Last Updated: the last time (date and time) the collection was updated
Size: How large is the collection on the file system

Creation of a collection is as simple as clicking the New Collection button available on the Collections (home) screen of WAIL. After doing so, a dialog will appear from which users can specify the name, title, and description for the collection. Once these fields have been filled in, WAIL will create the collection that users can access from the Collections View.

The Collection View displays the information about each seed contained in the collection

Seed URL: The URL
Added: The date time it was added to the Collection
Last Archived: The last time it was archived through WAIL
Mementos: The number of Mementos for the seed in the collection

along with a link for viewing the seed in Wayback.

Seeds can be added to a collection from either the live web or from WARC files present on the filesystem. To aid in the process of adding a seed from the live web, WAIL provides the user will the ability to "check" the seed before archiving.

The check provides summary information about the seed that includes the HTTP status code and a report on the embedded resources contained in the page. This lets users choose an archive configuration before starting WAIL's archival process to configure and launch a Heritrix crawl.

To add a seed from the filesystem all the user has to do is drag and drop the (W)ARC file into the corresponding interface for that functionality. WAIL will process the (W)ARC file and display a list of potential seeds discovered.

WAIL can not automatically determine the seed due to the nature of (W)ARC files. Rather WAIL uses heuristics on the contents of the (W)ARC file to determine which entries are valid candidates for the seed URL. From this display, the user chooses the correct one. WAIL will then add the seed to the collection, and it will be available for replay from the Collection View.

Twitter Archiving

The second added functionality is the ability to monitor and archive Twitter content automatically. This was made possible thanks to the scholarship I received for preserving online news. There are two options for the Twitter archival feature implemented in WAIL. The first is monitoring a user’s timeline for tweets which were tweeted after the monitoring has started with the option of selecting only the tweets containing hashtags specified during configuration. The second, a slight variation of the first, will only archive tweets that have specific keywords in the tweet’s body as specified during configuration.

What makes this unique is how WAIL preserves this content. Before this addition, WAIL utilized Heritrix as the primary preservation means. Heritrix executes HTTP GET requests to retrieve the target web page and archives the HTTP response headers and the content returned from the server. The embedded Javascript of the web page is not executed potentially decreasing the fidelity of the capture. This is problematic when archiving Twitter content since the rendering of tweets is done exclusively through client-side Javascript.

To address this WAIL utilizes the native Chromium browser provided by Electron in conjunction with WARCreate. Modifications were made to WARCreate in order to integrate it with WAIL to eliminate the need for human intervention to decide when to generate the WARC and to work inside of Electron. By integrating WARCreate into WAIL the archival process of Twitter content has been simplified to loading the URL of the tweet into the browser and waiting until the browser indicates that the page has been rendered in its entirety. Then the archival process through WARCreate is initiated. Once the WARC has been generated, it is added to the collection specified by the user.

Putting on Lipstick

As mentioned in Mat's blog post, the UI for WAIL-Python needed an update not only for its maintainability but also for a cohesive user experience across supported platforms. At the time of starting this revision of WAIL, the choices available for the front-end framework as seen on Github were plentiful. It simply boiled down to choosing the one that had the "least" painful setup and deployment process with a learning curve such that any person taking over the project could be brought up to speed with minimal effort.

With this in mind, React was chosen for WAIL's UI library; it is unopinionated about other technologies which may be used alongside it and features a large production tested ecosystem with an active developer community. React is only a view library, which is why WAIL uses Redux and Immutable.js to complete the traditional MVC package. This React, Redux, and Immutable.js stack provide WAIL a consistent user experience across supported platforms and a much more manageable codebase. On the tools side of making WAIL look and perform beautifully, WAIL is now using Ilya Kreymer's pywb. Pywb is used by WAIL for both replay and to aid in the heavy lifting of managing the collections.

WAIL is now available from the project's release page on Github available. For more information about how to use WAIL be sure to visit the wiki.

- John Berlin

↧

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

February 22, 2017, 4:09 pm

≫ Next: 2017-03-02: National Symposium on Web Archiving Interoperability Trip Report

≪ Previous: 2017-02-13: Electric WAILs and Ham

Examples: Archive Now (archivenow) CLI

A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library (Archive Now) to push web resources into several on-demand archives, such as The Internet Archive, WebCite, Perma.cc, and Archive.is. For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now, one command like:

$ archivenow --all www.cnn.com

is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library.

Archive Now allows you to accomplish the following major tasks:

A web page can be pushed into one archive
A web page can be pushed into multiple archives
A web page can be pushed into all archives
Adding new archives
Removing existing archives

Install Archive Now from PyPI:
$ pip install archivenow

To install from the source code:
$ git clone git@github.com:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./

"pip", "archivenow", and "docker" may require "sudo"

Archive Now can be used through:

The CLI (or A Docker Container)
A Web Service
Python code

1. The CLI

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag:
$ archivenow -h
usage: archivenow [-h][--cc][--cc_api_key [CC_API_KEY]]
[--ia][--is][--wc][-v][--all][--server]
[--host [HOST]][--port [PORT]][URI]
positional arguments:
URI URI of a web resource
optional arguments:
-h, --help show this help message and exit
--cc Use The Perma.cc Archive
--cc_api_key [CC_API_KEY]
An API KEY is required by The Perma.cc
Archive
--ia Use The Internet Archive
--is Use The Archive.is
--wc Use The WebCite Archive
-v, --version Report the version of archivenow
--all Use all possible archives
--server Run archiveNow as a Web Service
--host [HOST] A server address
--port [PORT] A port number to run a Web Service

Examples:

To archive the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com

By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com

To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive.is:

$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com http://archive.is/fPVyc

To save the web page (www.foxnews.com) in all configured web archives:

$ archivenow --all www.foxnews.com --cc_api_key $Your-Perma-CC-API-Key
https://perma.cc/8YYC-C7RM
https://web.archive.org/web/20170220074919/http://www.foxnews.com
http://archive.is/jy8B0
http://www.webcitation.org/6o9IKD9FP

Run it as a Docker Container (you need to do "docker pull" first)

$ docker pull maturban/archivenow

$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
...

2. A Web Service

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)

$ archivenow --server
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://127.0.0.1:12345/ia/www.foxnews.com

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:29:23 GMT

{
"results": [
"https://web.archive.org/web/20170209142922/http://www.foxnews.com"
]
}

To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 172
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:33:47 GMT

{
"results": [
"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
"http://archive.is/H2Yfg",
"http://www.webcitation.org/6o9Jubykh",
"Error (The Perma.cc Archive): An API KEY is required"
]
}

you may use the Perma.cc API_Key as following:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com?cc_api_key=$Your-Perma-CC-API-Key

3. Python Usage

>>> from archivenow import archivenow

To save the web page (www.foxnews.com) in The WebCite Archive:

>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']

To save the web page (www.foxnews.com) in all configured archives:

>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]

To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$Your-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']

To start the server from Python do the following. The server/port number can be passed (e.g,

start(port=1111, host='localhost')):
>>> archivenow.start()
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "*_handler.py" organized.

Removing an archive can be done by one of the following options:

Removing the archive handler file from the folder "handlers"
Rename the archive handler file to other name that does not end with "_handler.py"
Simply, inside the handler file, set the variable "enabled" to "False"

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm, the IA will create a new memento (let's call it M1) of the CNN homepage. The IA will then return M1 for all requests to archive the CNN homepage received before 10:02pm. The Archive.is sets this time gap to five minutes.

Updates and pull requests are welcome: https://github.com/oduwsdl/archivenow

--Mohamed Aturban

↧

2017-03-02: National Symposium on Web Archiving Interoperability Trip Report

March 2, 2017, 10:09 am

≫ Next: 2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

≪ Previous: 2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

The National Symposium on Web Archiving Interoperability was held February 21-22, 2017 at The Internet Archive in San Francisco, CA. The symposium was held as part of the IMLS-funded"WASAPI" project, which is researching "web archiving systems APIs". The participants are Internet Archive’s Archive-It, Stanford University Libraries (DLSS and LOCKSS), University of North Texas, and Rutgers University. There were nearly 50 attendees from a variety of international institutions.

Jefferson Bailey and Nicholas Taylor began the day with a review of the WASAPI project: "Building API-Based Web Archiving Systems and Services". They also lead a discussion about soliciting usage scenarios and feedback from potential users (see the results from their 2016 survey). You can track the WASAPI developments at their github repo, where they have the WASAPI Data Transfer API General Specification (for the transfer of WARC files, WAT files, etc.), reference implementations, and other items.

.@nullhandle presenting the use cases that motivate WASAPI initiative. #webarchiving pic.twitter.com/FixnLqEnnB
— Justin Littman (@justin_littman) February 21, 2017

After a break, we had a series of short presentations:

Debbie Kempe, NYARC, "NYARC’s Opensearch API Integration"
Greg Wiedeman, SUNY Albany, "Automating Web Archives Records in ASpace" (see also his blog post from 2016-10-18).
Stephen Abrams, CDL, "Cobweb"
Michael Nelson, ODU, "Web Archiving Activities of ODU’s Web Science and Digital Library Research Group" (slides also embedded below)

Web Archiving Activities of ODU’s Web Science and Digital Library Research Group (@WebSciDL) from Michael Nelson

I knew our group had been busy, but I could not help but be impressed with my recent, albeit extremely brief, catalog of our activities. I put the focus on tools and services we had created in support of our research, which lead to interesting questions from Tom Cramer and others about the role of tool production for PhD students. It's something Michele and I struggle with frequently: everyone enjoys when their tools are popular and useful to others, but student success should not be predicated on the popularity and uptake of the tools. Some tools are simply more applicable to a wider audience than others, which does not mean they are more or less suitable for the research purposes for which they were created.

The day closed with a social and a lot of informal meetings in the lobby of the Internet Archive.

.@brewster_kahle kicks off day two of the #webArchiving interoperability symposium - in the beautiful sunny @internetarchive great hall. pic.twitter.com/ndlKt9CekF
— Ian Milligan (@ianmilligan1) February 22, 2017

The second day began with a keynote from Brewster Kahle and a tour of the Internet Archive itself. It was my second tour of the IA, but it is always enjoyable. After a break we had three presentations:

Tom Cramer, Stanford University, (a talk about collaboration, but I can't find the slides)
Dallas Pillen, Bentley Historical Library, "ArchivesSpace-Archivematica-DSpace Workflow Integration"
Nicholas Taylor, Stanford University, (a talk about new developments in LOCKSS, but I can't find the slides)

We then had breakout sessions about collaboration goals, API expectations, and the impact of interoperability. The breakout session I attended was only moderately successful, producing two concurrent discussions that were informative but did not produce much in the way of tangible outputs. The other sessions were more productive and had materials to report back to the symposium at large.

#WebArchiving in action! pic.twitter.com/7dvsRl95Ut
— Archive-It (@archiveitorg) February 23, 2017

After lunch, the day resumed with some WASAPI transfer demos. My notes show only David Rosenthal (Stanford) giving a live LOCKSS demo of using the WASAPI API, but there may have been more. That lead to three more presentations:

Justin Littman, GWU, Social Feed Manager (I can't find the slides)
Ilya Kreymer and Mark Beasley, Rhizome, "Webrecorder Interoperability"
Ian Milligan (Waterloo) and Nick Ruest (York), "Warcbase: Using Scalable Web Analytics to Analyze Canadian Collections En Masse"

I've seen demos and presentations about Social Feed Manager several times and although our group has yet to use it, it looks like a great tool. Justin has also done a good job providing several pre-built collections (contact him for details). Ilya's presentation was tremendous, highlighting such interesting features as mixed archive integration (including localhost and otherwise "private" archives), import of collections from other archives (e.g., Archive-It) and augmenting missing resources from the live web (I noted it should check other archives for the desired datetime to avoid zombies), and "curated archives" which appear to be similar to twitter moments or storify stories, but for archived pages (see Yasmin AlNoamany's recent dissertation for similar research in this area). Ilya and his group are doing really great stuff with webrecorder.io. Ian's and Nick's presentation was excellent as always, and highlighted the work they're doing with Warcbase.

Then we had another round of breakouts, although I don't have good notes about their contents. I spent a lot of this time talking with Mark Graham and other folks. The final round of presentations included:

Matt Weber, Rutgers - Archives Unleashed
Martin Klein, "Web Archive Interoperability with Memento"

.@mart1nkle1n presenting #memento #webarchiving pic.twitter.com/GUREfN7mJ5
— Michael L. Nelson (@phonedude_mln) February 23, 2017

Matt's presentation was about the two Archives Unleashed hackathons (see the @WebSciDL trip reports for the first and second Archives Unleashed hackathons). The third hackathon immediately followed this symposium (on Thursday & Friday, February 23-24) and we will have a separate trip report for that, and the fourth hackathon has been announced for this summer at the British Library.

Martin's presentation touched on familiar topics such as the Time Travel Memento service, the Memento for Chrome extension, and Robust Links (demoable in our December 2015 D-Lib paper "Reminiscing About 15 Years of Interoperability Efforts"). This was a good presentation to end the day with, since Memento is the first and de facto standard for web archive interoperability (see the quick intro or RFC 7089). Memento does not address bulk upload or download of WARCs, WATs, etc., but it does define linkage between mementos (i.e., archived pages), their live web counterparts ("original resources"), lists of available mementos for an original resource ("TimeMaps"), and resources that do content negotiation in the dimension of datetime in order to direct you to the best available memento ("TimeGates").

The hashtag was "#webarchiving", but that's a general hashtag so the tweets from that event will quickly become lost. Some are embedded above but I've put the bulk of the symposiums tweets in this twitter moment. There was also a slack channel.

Overall this was an important and welcome event. There was less focus than I expected on the WASAPI APIs themselves, but perhaps I'm in the minority in enjoying digging through APIs. The WASAPI effort still seems to be in data reception mode, actively soliciting requirements and use cases. I thought there would be more demos from the WASAPI team, but I understand these things are difficult to bootstrap. The symposium was useful in getting many of the main players in the web archiving community together for technical interchange, especially since I'll miss the postponed IIPC General Assembly this year. Thanks to Jefferson, Lori, and everyone at the Internet Archive that helped host us, and thanks to the IMLS for funding this critical activity.

--Michael

"No one knows what to do with WARCs, so we got WAIL"#wail @WebSciDL @phonedude_mln #webarchiving pic.twitter.com/3qi2AKrntI
— Martin Klein (@mart1nkle1n) February 22, 2017

↧

2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

March 7, 2017, 10:15 am

≫ Next: 2017-03-09: A State Of Replay or Location, Location, Location

≪ Previous: 2017-03-02: National Symposium on Web Archiving Interoperability Trip Report

Archive Unleashed 3.0 took place in the Internet Archive, San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium, hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam, Mohamed Aturban, Erika Siregar, and myself. This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0, and Web Archive Hackathon 2.0.

@WebSciDL at @internetarchive after Archives Unleashed 3.0 wrap up. We have a winner of #HackArchives pic.twitter.com/vYLi89yap0
— Sawood Alam (@ibnesayeed) February 25, 2017

This workshop, was supported by the Internet Archive, Rutgers University, and the University of Waterloo. The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools to web archives. The three organizers of this workshop were: Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan, (Assistant Professor, Department of History, University of Waterloo), and Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo).

It was a big moment for me as I first saw the Internet Archive building, it had an Internet Archive truck parked outside. Since 2009, the IA headquarters have been at 300 Funston Avenue in San Fransisco, a former Christian Science Church. Inside the building in the main hall there were multiple mini statues for every archivist who worked in the IA for over three years.

On Wednesday night, we had a welcome dinner and a small introduction of the members that have arrived.

Day 1 (February 23, 2017)

On Thursday, we started with a breakfast and headed to the main hall where several presentations occurred. Matthew Weber presented “Stating the Problem, Logistical Comments”. Dr. Weber started by stating the goals which include developing a common vision of web archiving development and tool development, and to learn to work with born digital resources for humanities and social science research.

Next, Ian Milligan presented “History Crawling with Warcbase”. Dr. Milligan gave an overview of Warcbase. Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The tool is used to analyze web archives using Spark, and to take advantage of HBase to provide random access as well as analytics capabilities.

Archives Unleashed 3 underway w an overview of Warcbase @ianmilligan1 #hackarchives thx @NSF @RutgersCommInfo @internetarchive for support pic.twitter.com/AaJ8Nemvd8
— Matthew Weber (@docmattweber) February 23, 2017

Next, Jefferson Bailey (Internet Archive) presented “Data Mining Web Archives”. He talked about the conceptual issues in Access to WA which include: Provenance (much data, but not all as expected), Acquisition (highly technical; crawl configs; soft 404s), Border issues (the web never really ends), Lure of evermore data (more data is not better data), and Attestation (higher sensitivity to elision than in traditional archives?). He also explained the different formats that the Internet Archive can save its data, which include CDX, Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). In addition, he presented an overview of some research projects based on the IA collaboration. Some of the researches he mentioned were: The ALEXANDRIA project, Web Archives for Longitudinal Knowledge, Global Event and Trend Archive Research & Integrated Digital Event Archiving, and many more.

Next, Vinay Goel (Internet Archive) presented “API Overview”. He presented the Beta WayBack Machine, which searches the IA based on a URL or a word related to a sites home page. He mentioned that search results are presented based on the anchor text search.

Justin Littman (George Washington University Libraries), presented “Social Media Collecting with Social Feed Manager”. SFM is an open source software that collects social media from APIs of Twitter, Tumblr, Flickr, and Sina Weibo.

The final talk was by “Ilya Kreymer” (Rhizome), where he presents an overview of the tool “Webrecorder”. The tool provides an integrated platform for creating high-fidelity web archives while browsing, sharing, and disseminating archived content.

After that, we had a short coffee break and started to form three groups. In order to form the groups, all participants were encouraged to write a few words on the topic they would like to work on, some words that appeared were: fake news, news, twitter, etc. Similar notes are grouped together and associating members. The resulting groups were Local News, Fake News, and End of Term Transition.

Group Name	Group Members
Local News: Good News/Bad News	Sawood Alam, Old Dominion University Lulwah Alkwai, Old Dominion University Mark Beasley, Rhizome Brenda Berkelaar, University of Texas at Austin Frances Corry, University of Southern California Ilya Kreymer, Rhizome Nathalie Casemajor, INRS Lauren Ko, University of North Texas
Fake News	Erika Siregar, Old Dominion University Allison Hegel, University of California, Los Angeles Liuqing Li, Virginia Tech Dallas Pillen, University of Michigan Melanie Walsh, Washington University
End of Term Transition	Mohamed Aturban, Old Dominion University Justin Littman, George Washington University Jessica Ogden, University of Southampton Yu Xu, University of Southern California Shawn Walker, University of Washington

Every group started to work on its dataset and brain storm different research questions to answer, and formed a plan of work. Then we basically worked all through the day, and ended the night with a working dinner at the IA.

Archives Unleashed day 1: Team EOT-Transition using @eotarchive to try to detect changes in web presence of fed agencies. #hackarchives
— Justin Littman (@justin_littman) February 24, 2017

Day 2 (February 24, 2017)

On Friday we started by eating breakfast, and then each team continued to work on their projects.

Every Friday the IA has free lunches where hundreds of people join together; some were artists, activists, engineers, librarians and many more. After that, a public tour of the IA takes place.

We had some light talks after lunch. The first talk was by Justin Littman, were he presented an overview of his new tool called “Fbarc”. This tool archive webpages from Facebook using the Graph API.

.@justin_littman gives a lightning intro to the Facebook Graph API and his new tool f(b)arc (https://t.co/7UQtNOiPjZ). #hackarchives pic.twitter.com/uJn17UFGmh
— Ian Milligan (@ianmilligan1) February 24, 2017

Nick Ruest (Digital Assets Librarian at York University), gave a talk on “Twitter”. Next, Shawn Walker (University of Washington), presented “We are doing it wrong!”. He explained how the current collecting process of social media is not how people view social media.

Now @walkeroh is showing us how we are collecting social media wrong - users see websites, researchers see scrolling JSON. #hackarchives pic.twitter.com/SWCMT3qq0Z
— Ian Milligan (@ianmilligan1) February 24, 2017

After that all the teams presented their projects. Starting with our team, we called our project "Good News/Bad News". We utilized historical captures (mementos) of various local news sites' homepages from Archive-It to prepare our seed dataset. In order to transform the data for our usage we utilized the Webrecorder, WAT converter, and some custom scripts. We extracted various headlines featured on the homepages of the each site for each day. With the extracted headlines we analyzed the sentiments on various levels including individual headlines, individual sites, and over the whole nation using the VADER-Sentiment-Analysis Python library. To leverage more machine learning capabilities for clustering and classification, we built a Latent Semantic Indexer (LSI) model using a Ruby library called Classifier Reborn. Our LSI model helped us convey the overlap of discourse across the country. We also explored the possibility of building Word2Vec model using TensorFlow for advanced machine learning, but due to limited amount of available time, despite the great potential, we could not pursue it. To distinguish between the local and the national discourse we planned on utilizing Term Frequency-Inverse Document Frequency, but could not put it together on time. For the visualization we planned on showing the interactive US map along with the heat map of the newspaper location with the newspaper ranking as the size of the spot and the color indicating if it is good news (green) or bad news (red). Also, when a newspaper is selected a list of associated headlines is revealed (color coded as Good/Bad), a Pie chart showing the overall percentage Good/Bad/Neutral, related headlines from various other news sites across the country, and a word cloud of the top 200 most frequently used words. This visualization could also have a time slider that show the change of the sentiment for the newspapers over time. We had many more interesting visualization ideas to express our findings, but the limited amount of time only allowed us to go this far. We have made all of our code and necessary data available in a GitHub repo and trying to make a live installation available for exploration at soon.

Final presentations at #hackarchives are underway @internetarchive wrapping up archives unleashed 3!! pic.twitter.com/KbV4oFfurh
— Matthew Weber (@docmattweber) February 25, 2017

Next, the team “Fake News” presented their work. The team started with the research questions: “Is it fake news to misquote a presidential candidate by just one word? What about two? Three? When exactly does fake news become fake?”. Based on these question, they hypothesis that “Fake news doesn’t only happen from the top down, but also happens at the very first moment of interpretation, especially when shared on social media networks". With this in mind, they want to determine how Twitter users were recording, interpreting, and sharing the words spoken by Donald Trump and Hillary Clinton in real time. Furthermore, they also want to find out how the “facts” (the accurate transcription of the words) began to evolve into counter-facts or alternate versions of their words. They analyzed the twitter data from the second presidential debate and focused on the most famous keywords such as "locker room", "respect for women", and "jail". The analysis result is visualized using word tree and bar chart. They also conducted a sentiment analysis which outputs a surprising result: most twitter result has positive sentiments towards the locker-room talk. Further analysis showed that apparently sarcastic/insincere comments skewed the sentiment analysis, hence the positive sentiments.

Next team used Twitter to track "fake news" - how were Trump, Clinton debate quotes shared. How did alternatives appear? #hackarchives pic.twitter.com/NsOtS5eMnt
— Ian Milligan (@ianmilligan1) February 25, 2017

After that, the team “End of Term Transition” presented their project. The group were trying to use public archives to estimate change in the main government domains at the time of each US presidential administration transition. For each of these official websites, they planned to identify the kind and the rate of change using multiple techniques including the Simhash, TF–IDF, edit distance, and efficient thumbnail generation. They investigated each of these techniques in terms of its performance and accuracy. The datasets were collected from the Internet Archive Wayback Machine around the 2001, 2005, 2009, 2013, and 2017 transitions. The team made their work available on Github.

Last up! Analysis of changes in web content across end of term crawls @internetarchive #hackarchives pic.twitter.com/dWrfChnTFF
— Matthew Weber (@docmattweber) February 25, 2017

Finally, a surprise new team joined, it was team “Nick”. It was presented by Nick Ruest, (Digital Assets Librarian at York University). Nick has been exploring Twitter API mysteries, he showed some visualizations showing some odd peaks that occurred.

The final mystery 'team' is @ruebot, who has been exploring Twitter API mysteries. Odd plateaus and peaks in data. #hackarchives pic.twitter.com/Q4dCGBvcTT
— Ian Milligan (@ianmilligan1) February 25, 2017

After the teams presented their work, the judges announced the team with the most points, and the winner team was “End of Term Transition”.

And the winner is Team Transition @internetarchive #hackarchives !!! That's a wrap folks. pic.twitter.com/bGili0d7WT
— Matthew Weber (@docmattweber) February 25, 2017

This workshop was extremely interesting and I enjoyed it fully. The fourth Datathon Archives Unleashed 4.0: Web Archive Datathon was announced, and will occur at the British Library, London, UK, at June 11 – 13, 2017. Thanks to Matthew Weber, Ian Milligan, and Jimmy Lin for organizing this event, and for Jefferson Bailey, and Vinay Goel, and everyone at the Internet Archive.

And we're proud to announce archives unleashed 4 @britishlibrary jun 11-13. Details at https://t.co/OCbha8LDmk #hackarchives
— Matthew Weber (@docmattweber) February 25, 2017

-Lulwah M. Alkwai

↧

2017-03-09: A State Of Replay or Location, Location, Location

March 9, 2017, 10:30 am

≫ Next: 2017-03-20: A survey of 5 boilerplate removal methods

≪ Previous: 2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

We have written blog posts about the time traveling zombie apocalypse in web archives and how the lack of client-side JavaScript execution at preservation time prevented the SOPA protest of certain websites from being seen in the archive. A more recent post about CNN's utilization of JavaScript to load and render the contents of its homepage have made it unarchivable since November 1st, 2016. The CNN post detailed how some "tricks" were utilized to circumvent CORS restrictions of HTTP requests made by JavaScript to talk to their CDN were the root cause of why the page is unarchivable / unreplayable. I will now present to you a variation of this which is more insidious and less obvious than what was occurring in the CNN archives.

TL;DR

In this blog post, I will be showing in detail what caused a particular web page to fail on replay. In particular, the replay failure occurred due to the lack of necessary authentication and HTTP methods made for the custom resources this page requires for viewing. Thus the pages JavaScript thought the current page being viewed required the viewer to sign in and will always cause redirection to happen before the page has loaded. Also depending on the replay systems rewrite mechanisms, the JavaScript of the page could collide with the replays systems causing undesired effects. The biggest issue highlighted in the blog post is that certain archives replay systems are employing unbounded JavaScript rewrites that, albeit in certain situations, fundamentally destroy the original page's JavaScript. Putting its execution into states its creators could not have prepared for or thought possible when viewing the page on the live web. It must be noted that this blog post is the result of my research into the modifications made to a web page in order to archive and replay it faithfully as it was on the live web.

Background

Consider the following URI https://www.mendeley.com/profiles/helen-palmer which when viewed on the live web behaves as you would expect any page not requiring a login to behave.

But before I continue, some background about mendely.com since you may not have known about this website as I did not before it was brought to my attention. mendely.com is a LinkedIn of sorts for researchers which provides additional services geared towards them specifically. Like LinkedIn, mendely.com has publicly accessible profile pages listing a researcher's interests, their publications, educational history, professional experience, and following/follower network. All of this is accessible without a login and the only features you would expect to require a login such as follow the user or read one of their listed publications take you to a login page. But the behavior of the live web page is not maintained when replayed after being archived.

A State of Replay

Now consider the memento of https://www.mendeley.com/profiles/helen-palmer from Archive-It on 2016-12-15T23:19:00. When the page starts to load and becomes partially rendered, an abrupt redirection occurs taking you to

www.mendeley.com/sign-in/?routeTo=https%3A%2F%2Fwww.mendeley.com%2Fprofiles%2Fhelen-palmer

which is 404 in the archive.

Obviously, this should not be happening since this is not the behavior of the page on the live web. It is likely that the pages JavaScript is misbehaving when running on the host wayback.archive-it.org. Before we investigate what is causing this to happen let us see if the redirection occurs when replaying a memento from the Internet Archive on 2017-01-26T21:48:31 and a memento from Webrecorder on 2017-02-12T23:27:56.

Webrcorder

Internet Archives

The video below shows this occurring in all three archives

and as seen in the video below this happens on other pages on mendeley.com

Comparing The Page On The Live Web To Replay On Archive-It

Unfortunately, both are unable to replay the page due to the redirection occurring which points to the credibility of the original assumption that the pages JavaScript is causing the redirection. Before diving into JavaScript detective mode, let us see if the output from the developer console can give us any clues. Seen below is the browser console with XMLHttpRequests (XHR) logging enabled when viewing

https://www.mendeley.com/profiles/helen-palmer

on the live web seen below

Besides the optimizely (user experience/analytics platform) XHR requests the page's own JavaScript makes several requests to the sites backend at

https://api.mendeley.com

and a single GET request for

https://www.mendeley.com/profiles/helen-palmer/co-authors

A breakdown of the request to api.mendely listed below:

GET api.mendeley.com/catalog (x8)
GET api.mendeley.com/documents (x1)
GET api.mendeley.com/scopus/article_authors (x8)
POST api.mendely.com/events/_batch (x1)

From these network requests, we can infer that the live web page is dynamically populating the publications list of its profile pages and perhaps some other elements of the page. Now let's check the browser console from the Archive-It memento on 2016-12-15T23:19:00.

Many errors are occurring as seen in the browser console from the Archive-It memento but it is the XHR request errors and lack of XHR requests made that are significant. The first significant XHR error is a 404 that occurred when trying to execute a GET request for

http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/helen-palmerco-authors/

This is a rewrite error (URI-R -> URI-M). The live web pages JavaScript requested

https://www.mendeley.com/profiles/helen-palmer/co-authors

but when replayed the archived JavaScript made the request for

https://www.mendeley.com/profiles/helen-palmerco-authors

Stranger yet is that the XHR finished loading console entry indicates it was made to

http://wayback.archive-it.org/profiles/helen-palmerco-authors

not, the URI-M that received the 404. Thankfully we can consult the developer tools included in our web browsers to see request/response headers for each request. The corresponding headers for

http://wayback.archive-it.org/profiles/helen-palmerco-authors

are seen below

The request was really 302 and was indeed made to

http://wayback.archive-it.org/profiles/helen-palmerco-authors

but the actual location indicated in the response is to the "correct" URI-M

http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/helen-palmerco-authors

The other significant difference from the live webs XHR requests is that the archived pages JavaScript is no longer requesting the resources from api.mendely.com. We now have a single request for

http://wayback.archive-it.org/profiles/refreshToken

This request suffered the same fate as the previous request, 302 with location of

http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/refreshToken

and then the redirection happens. Now we have a better understanding of what is happening with the Archive-It memento. The question about the Internet Archives and Webrecorders memento remains.

Does This Occur In Other Archives

The console output from the Internet Archives memento on 2017-01-26T21:48:31 seen below shows that the requests to api.mendeley.com are not made. The request for the refresh token is made, but unlike the Archive-It memento the request to co-authors is rewritten successfully and does not receive a 404 but still redirects seen below:

Likewise with the memento from Webrecorder on 2017-02-12T23:27:56 seen below, the request made to co-authors is rewritten successfully, we have the request for refresh token but still redirect to the sign-in page like the others.

As the redirection occurs for the Internet Archives and Webrecorders memento we can now finally ask the question what happened to the api.mendeley.com requests and what in the pages JavaScript is making replay fail.

Location, Location, Location

The mendeley website defines a global object that contains definitions for URLs to be used by the pages JavaScript when talking to the backend. That global object seen below (from the Archive-It memento) is untouched by the archives rewriting mechanisms. Now there is another inline script tag that adds some preloaded state for use by the pages JavaScript seen below (also from Archive-It). But here we find that our first instance of erroneous JavaScript rewriting. As you can see the __PRELOADED_STATE__ object has a key of WB_wombat_self_location which is a rewrite targeting window.location or self.location. Clearly, this is not correct when you consider the contents of this object which describe a physical location. When comparing the live web key for this entry seen below, the degree of error in this rewrite becomes apparent. Some quick background on the WB_wombat prefix before continuing on. The WB_wombat prefix normally indicates that the replay system is using the wombat.js library from PyWb and conversely Webrecorder. They are not, rather they are using their own rewrite library called ait-client-rewrite.js. The only similarity between the two is the usage of the name wombat.

Finding the refresh token code in the pages JavaScript was not so difficult seen below is the section of code that likely causes the redirect. You will notice that the redirect occurs when it is determined the viewer is not authorized to be seeing this page. This becomes clearer when seeing the code that executes retrieval of the refresh token. Here, we see two things: mendeley.com has a maximum number of retries for actions they require some form of authentication for (this is the same for the majority of its resources the pages JavaScript makes requests for) and the second instance of erroneous JavaScript rewriting:

e = t.headers.WB_wombat_self_location;

It is clear to see that Archive-It is using regular expressions to rewrite any <pojo>.location to WB_wombat_self_location as on inspection of that code section you can see that the pages JavaScript is clearly looking for the location sent in the headers commonly for 3xx or 201 responses (RFC 7231#6.3.2). This is further confirmed by the following line from the code seen above

e && this.settings.followLocation && 201 === t.status

The same can be seen in this code section from the Webrecorder memento which leaves the Internet Archives memento but the Internet Archive does not do such rewrites making this a non-issue for them. These files can be found in a gist I created if you desire to inspect them for yourself. Now at this point, you must be thinking case closed we have found out what went wrong and so did I but was not so sure as the redirection occurs in the Internet Archives memento as well.

Digging Deeper

I downloaded the Webrecorder memento loaded into my own instance of PyWb, and used its fuzzy match rewrite rules (via regexes) to insert print statements at locations in the code I believed would surface additional errors. The fruit of this labor can be seen below.

As seen above the requests to

api.mendely.com/documents and api.mendely.com/events/_batch

are actually being made but are shown as even going through by the developer tools which is extremely odd. However, the effects of this can be seen by the two errors shown after the console entries for

/profiles/helen-palmer/co-authors

and

anchor_setter herf https://www.mendely.com/profiles/helen-palmer/co-authors

which are store:publications:set.error and data:co-authors:list.error. These are the errors which I believe to be the root cause of the redirection. Before I address why that is and what the anchor_setter console entry means, we need to return to considering the HTTP requests made by the browser when viewing the live web page and not just those the browsers built in developer tools show us.

Understanding A Problem By Proxy

To achieve this I used an open-source alternative to Charles called James. James is an HTTP Proxy and Monitor that will allow us to intercept and view the requests made from the browser when viewing

https://www.mendeley.com/profiles/helen-palmer

on the live web. The image below displays the HTTP requests made by the browser starting at the time when the request for co-authors was made.

The blue rectangle highlights the requests made when replayed via Archive-It, Internet Archive and Webrecorder which include the request for co-authors (data:co-authors:list.error). The red rectangle highlights the request made for retrieving the publications (store:publications:set.error). The pinkish purple rectangle highlights a block of HTTP Options (RFC 7231#4.3.7) requests made when requesting resources from api.mendely. The request made in the red rectangle also have the Options request made before the

GET request for api.mendeley.com/catalog?=[query string]

This is happening because and to quote from the MDN entry for HTTP OPTIONS request:

Preflighted requests in CORS
In CORS, a preflight request with the OPTIONS method is sent, so that the server can respond whether it is acceptable to send the request with these parameters. The Access-Control-Request-Method header notifies the server as part of a preflight request that when the actual request is sent, it will be sent with a POST request method. The Access-Control-Request-Headers header notifies the server that when the actual request is sent, it will be sent with a X-PINGOTHER and Content-Type custom headers. The server now has an opportunity to determine whether it wishes to accept a request under these circumstances.

What they mean by preflighted is that this request is made implicitly by the browser and the reason it is being sent before the actual JavaScript made request is because the content type they are requesting is

application/vnd.mendeley-document.1+json

A full list of the content-types the mendeley pages request are enumerated in a gist likewise with the JavaScript that makes the requests for each content-type Again let's compare the browser requests as seen by James from the live web to the archived versions to see if what our browser was not showing us for the live web version is happening in the archive. Seen below are the browser-made HTTP requests as seen by James for the Archive-It memento on 2016-12-15T23:19:00.

The

helen-palmer/co-authors -> helen-palmerco-authors

rewrite issue is indeed occurring with the requests which are not made for the URI-M but hitting wayback.archive-it.org first same with profile/refreshToken. We do not see any of the requests for api.mendely as you would expect. Another strange thing is both of the requests for refreshToken get 302 status until a 200 response comes back but now from a memento on 2016-12-15T23:19:01. The memento from the Internet Archive on 2017-01-26T21:48:31 suffers similarly as seen below, but the request for helen-palmer/co-authors remains intact. The biggest difference here is that the memento from the Internet Archive is bouncing through time much more than the Archive-It memento.

The memento from Webrecorder on 2017-02-12T23:27:56 suffers similarly as did the memento from Archive-It, but this time something new happens as seen below.

The request for refreshToken goes through the first time and resolves to a 200 but we have the

helen-palmer/co-authors -> helen-palmerco-authors

rewrite error occurring. Only this time the request stays a memento request but promptly resolves to a 404 due to the rewrite error. Both the Archive-It memento and the Webrecorder memento share this rewrite error, and both use wombat to some extent so what gives. The explanation for this is likely to lie with the use of wombat (at least for the Webrecorder memento) as the library overrides a number of the global dom elements and friends at the prototype level (enumerated for clarity via this link). This is to bring the URL rewrites to the JavaScript level and to ensure the requests made become rewritten at request time. In order to better understand the totality of this, recall the image seen below (this time with sections highlighted) which I took after inserting print statements into the archived JavaScript via PyWbs fuzzy match rewrite rules.

The console entry anchor_setter href represents an instance when the archived JavaScript for mendeley.com/profiles/helen-palmer is about to make an XHR request and is logged from the wombat override of the a tags href setter method. I added this print statement to my instance of PyWb's wombat because the mendeley JavaScript uses a promise based XHR request library called axios. The axios library utilizes an anchor tag to determine if the URL for the request being made is same origin and does its own processing of the URL to be tested after using the anchor tag to normalize it. As you can see from the image above, the URL being set is relative to the page but becomes a normalized URL after being set on the anchor tag (I logged the before and after of just the set method). It must be noted that the version of wombat I used likely differs from the versions being employed by Webrecorder and maybe Archive-It. But from the evidence presented it appears to be a collision between the rewriting code and the axios libraries own code.

HTTP Options Request Replay Test

Now I can image that the heads of the readers of this blog post maybe heads are hurting, or I may have lost a few along the way. I apologize for that by the way. However, I have one more outstanding issue I brought before you to clear up. What happened to the api.mendely requests especially the options requests. The options requests were not executed for one of two reasons. The first is the pages JavaScript could not receive the expected responses due to the Authflow requests failed when replayed from an archive. Second one of the requests for content-type

application/vnd.mendeley-document.1+json

failed due to the lack of replaying HTTP Options methods or it did not return what you thought it would when replayed. To test this out I created a page hosted using GitHub pages called replay test. This page's goal is to throw some gotchas at archival and replay systems. Of those gotchas is an HTTP Options request (using axios) to https://n0tan3rd.github.io/replay_test which is promptly replied to by GitHub with a 405 not allowed. An interesting property of the response by GitHub for the request that the body is HTML which the live web displays once the request is complete. We may assume a service like Webrecorder would be able to replay this. Wrong it does not nor does the Internet Archive. What does happen is the following as seen when replayed via Webrecoder which created the capture.

The same can be seen in the capture from the Internet Archive below

What you are seeing is the response to my Options request which is to respond as if I my browser made an GET request to view the capture. This means the headers and status code I was expecting to find were never sent but saw a 200 response for viewing the capture not the request for the resource I made. This implies that the mendeleys JavaScript will never be able to make the requests for its resources that are content-type

application/vnd.mendeley-document.1+json

when replayed from an archive. Few, this now concludes this investigation and I leave what else my replay_test pages does as an exercise for the reader.

Conclusions

So what is the solution for this but first we must consider.... I'm joking. I can see only two solutions for this. The first is that replay systems used by archives that use regular expressions for JavaScript rewrites need to start thinking like JavaScript compilers such as Babel when doing the rewriting. Regular expressions can not understand the context of the statement being rewritten whereas compilers like Babel can. This would ensure the validity of the rewrite and avoid rewriting JavaScript code that has nothing to do with the windows location. The second is to archive and replay the full server client HTTP request-response chain.

- John Berlin

↧

2017-03-20: A survey of 5 boilerplate removal methods

March 20, 2017, 7:21 pm

≫ Next: 2017-03-24: The Impact of URI Canonicalization on Memento Count

≪ Previous: 2017-03-09: A State Of Replay or Location, Location, Location

Boilerplate removal result from BeautifulSoup's get_text() method for news website. Extracted text includes extraneous text, HTML and Javascript text.

Fig. 1: Boilerplate removal result for BeautifulSoup's get_text() method for a news website. Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text.

Boilerplate removal result from NLTK's (OLD) clean_html() method for news website. Extraneous text included, but does not include Javascript and HTML text.

Fig. 2: Boilerplate removal result for NLTK's (OLD) clean_html() method for a news website. Extracted text includes extraneous text, but does not include Javascript, HTML, comments or CSS text.

Boilerplate removal result from Justext method for news website. Smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but title missing.

Fig. 3: Boilerplate removal result for Justext method for a news website. Extracted text includes smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent.

Boilerplate removal result from Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext but missing title text and first paragraph.

Fig. 4: Boilerplate removal result for Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent.

Boilerplate removal result from Python-boilerpipe (ArticleExtractor) method for this news website. Smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.

Fig. 5: Boilerplate removal result for Python-boilerpipe (ArticleExtractor) method for a news website. Extracted text includes smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.

Boilerplate removal refers to the task of extracting the main text content of webpages. This is done through the removal of content such as navigation links, header and footer sections, etc. Even though this task is a common prerequisite for most text processing tasks, I have not found an authoritative versatile solution. In other to better understand how some common options for boilerplate removal perform against one another, I developed a simple experiment to measure how well the methods perform when compared to a gold standard text extraction method (myself). Python-boilerpipe (ArticleExtractor mode) performed best on my small sample of 10 news documents with an average Jaccard Index score of 0.7530, and median Jaccard Index score of 0.8964. The Jaccard scores for each document for a given boilerplate removal method was calculated over the sets (bag of words) created from the news documents and the gold standard text.

Some common boilerplate removal methods

BeautifulSoup's get_text()

Description: BeautifulSoup is a very (if not the most) popular python library used to parse HTML. It offers a boilerplate removal method - get_text() - which can be invoked with a tag element such as the body element of a webpage. Empirically, the get_text() method does not do a good job removing all the Javascript, HTML markups, comments, and CSS text of webpages, and includes extraneous text along with the extracted text.
Recommendation: I don't recommend exclusive use of get_text() for boilerplate removal.

NLTK's (OLD) clean_html()

Description: Natural Language processing Toolkit (NLTK) used to provide a method called clean_html() for boilerplate removal. This method used regular expressions to parse and subsequently remove HTML, Javascript, CSS, comments, and white spaces. However, presently, NLTK deprecated this implementation and suggests the use of BeautifulSoup's get_text() method, which as we have already seen does not do a good job.
Recommendation: This method does a good job removing HTML, Javascript, CSS, comments, and white spaces. However, it includes the boilerplate text such as the navigation link text, as well as header and footer sections text. Therefore, if your application is not sensitive to extraneous text, and you just care about including all text from a page, this method is sufficient.

Justext

Description: According to Mišo Belica, the creator of Justext, it was designed to preserve mainly text containing full sentences, thus, well suited for creating linguistic resources. Justext also provides an online demo.
Recommendation: Justext is a decent boilerplate removal method that performed almost as well as the best boilerplate removal method from our experiment (Python-boilerpipe). But note that Justext may omit page titles.

Python-goose

Description: Python-goose is a python rewrite of an application originally written in Java and subsequently Scala. According to the author, the goal of Goose is to process news article or article-type pages, extract the main body text, metadata, and most probable image candidate.
Recommendation: Python-goose is a decent boilerplate removal method, but it was outperformed by Python-boilerpipe. Also note that Python-goose may omit page titles just like Justext.

Python-boilerpipe

Description: Python-boilerpipe is a python wrapper of the original Java library for boilerplate removal and text extraction from HTML pages.
Recommendation: Python-boilerpipe outperformed all the other boilerplate removal methods in my small test sample. I currently use this method as the boilerplate removal method for my applications.

Experiment to compare boilerplate removal methods

First, I arbitrarily selected 10 news document:

With the following corresponding gold standard text documents:

The HTML extracted from the 10 news documents was extracted by dereferencing each of the 10 URLs with curl. This means the boilerplate removal methods operated on just HTML (without running Javascript). I also ran the boilerplate removal methods on archived copies from archive.is for the 10 documents. This was based on the rationale that since archive.is runs Javascript and transforms the original page, this might impact the results. My experiment showed that boilerplate removal run on archived copies reduced the similarity between the gold standard texts and the output texts of all the boilerplate removal methods except BeautifulSoup's get_text() method (Table 2).

Second, for each document, I manually copied text I considered to be the main body of text for the document to create a total of 10 gold standard texts. Third, I removed the boilerplate from the 10 documents using the 8 methods outlined in Table 1. This led to a total of 80 extracted text documents (10 for each boilerplate removal method). Fourth, for each of the 80 documents, I computed the Jaccard Index (intersection divided by union of both set) over each document and it's respective gold standard. Fifth, for each of the 8 boilerplate removal methods outlined in Table 1, I computed the average of the Jaccard scores for the 10 news documents (Table 1).

Result

Table 1: Boilerplate removal results for live web news documents

Index	Methods	Averages of Jaccard Indices for 10 documents	Median of Jaccard Indices for 10 documents
1	BeautifulSoup's get_text()	0.1959	0.2201
2	NLTK's (OLD) clean_html()	0.3847	0.3479
3	Justext	0.7134	0.8339
4	Python-goose	0.7009	0.6822
5	Python-boilerpipe.ArticleExtractor	0.7530	0.8964
6	Python-boilerpipe.DefaultExtractor	0.6706	0.7073
7	Python-boilerpipe.CanolaExtractor	0.6227	0.6472
8	Python-boilerpipe.LargestContentExtractor	0.6188	0.6444

Table 2: Boilerplate removal results for archived news documents showing lower similarity compared to live web version (Table 1)

Index	Methods	Averages of Jaccard Indices for 10 documents	Median of Jaccard Indices for 10 documents
1	BeautifulSoup's get_text()	0.2630	0.2687
2	NLTK's (OLD) clean_html()	0.3365	0.3232
3	Justext	0.5956	0.6414
4	Python-goose	0.4209	0.4289
5	Python-boilerpipe.ArticleExtractor	0.6240	0.7121
6	Python-boilerpipe.DefaultExtractor	0.5534	0.7010
7	Python-boilerpipe.CanolaExtractor	0.5028	0.5274
8	Python-boilerpipe.LargestContentExtractor	0.4961	0.4669

Python-boilerpipe (ArticleExtractor mode) outperformed all the other methods. I acknowledge that this experiment is by no means rigorous for important reasons which include:

The test sample is very small.
Only news documents were considered.
The use of the Jaccard similarity measure forces documents to be represented as sets. This eliminates order (the permutation of words) and duplicate words. Consequently, if a boilerplate removal method omits some occurrences of a word, this information will be lost in the Jaccard similarity calculation.

Nevertheless, I believe this small experiment sheds some light about the different behaviors of the different boilerplate removal methods. For example, BeautifulSoup get_text() does not do a good job removing HTML, Javascript, CSS, and comments unlike NLTK's clean_html(), which does a good job removing these, but includes extraneous text. Also, Justext and Python-goose do not include a large body of extraneous text, even though they may omit a news article's title. Finally, based on these experiment results, Python-boilerpipe is best boilerplate removal method.

--Nwala

↧

2017-03-24: The Impact of URI Canonicalization on Memento Count

March 24, 2017, 11:30 am

≫ Next: 2017-04-17: CNI Spring 2017 Trip Report

≪ Previous: 2017-03-20: A survey of 5 boilerplate removal methods

Mat reports that relying solely on a Memento TimeMap to evaluate how well a URI is archived is not a sufficient method.

We performed a study of very large Memento TimeMaps to evaluate the ratio of representations versus redirects obtained when dereferencing each archived capture. Read along below or check out the full report.

Memento represents a set of captures for a URI (e.g., http://google.com) with a TimeMap. Web archives may provide a Memento endpoint that allows users to obtain this list of URIs for the captures, called URI-Ms. Each URI-M represents a single capture (memento), accessible when dereferencing the URI-M (resolving the URI-M to an archived representation of a resource).

Variations in the "original URI" are canonicalized (coalescing https://google.com and http://www.google.com:80/, for instance) with the original URI (URI-R in Memento terminology) also included with a literal "original" relationship value.


<http://ws-dl.blogspot.com/>; rel="original",
<http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/>; rel="self"; type="application/link-format"; from="Wed, 29 Sep 2010 00:03:40 GMT"; until="Mon, 20 Mar 2017 19:09:10 GMT",
<http://web.archive.org/web/http://ws-dl.blogspot.com/>; rel="timegate",
<http://web.archive.org/web/20100929000340/http://ws-dl.blogspot.com/>; rel="first memento"; datetime="Wed, 29 Sep 2010 00:03:40 GMT",
<http://web.archive.org/web/20110202180231/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 02 Feb 2011 18:02:31 GMT",
<http://web.archive.org/web/20110902171049/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:10:49 GMT",
<http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:12:56 GMT",
...
<http://web.archive.org/web/20151205080546/http://www.ws-dl.blogspot.com/>; rel="memento"; datetime="Sat, 05 Dec 2015 08:05:46 GMT",
<http://web.archive.org/web/20161104143102/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 04 Nov 2016 14:31:02 GMT",
<http://web.archive.org/web/20161109005749/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 09 Nov 2016 00:57:49 GMT",
<http://web.archive.org/web/20170119233646/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Thu, 19 Jan 2017 23:36:46 GMT",
<http://web.archive.org/web/20170320190910/http://ws-dl.blogspot.com/>; rel="last memento"; datetime="Mon, 20 Mar 2017 19:09:10 GMT"

Figure 1. An abbreviated TimeMap for
http://ws-dl.blogspot.com from Internet Archive

For instance, to view the TimeMap for this very blog from Internet Archive, a user may request http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/ (Figure 1). Each URI-M (e.g., http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/) is listed with a corresponding relationship (rel) and datetime value. Note the www.ws-dl.blogspot.com and ws-dl.blogspot.com subdomain variants are both included in the same TimeMap, an product of the canonicalization procedure. The TimeMap for this URI-R currently contains 60 URI-Ms. Internet Archive's Web interface reports 58 captures -- a subtle yet differing "count". This difference get much more extreme with other URI-Rs.

The quality of each memento (e.g., in terms of completeness of capture of embedded resources) cannot be determined using the TimeMap alone. This fact is inherent in a URI-M needing to be dereferenced and each embedded resource requested upon rending the base URI-M. Comprehensively evaluating the quality over time is something we have already covered (see our TPDL2013, JCDL2014, and IJDL2015 papers/article).

In performing some studies and developing web archiving tools, we required knowing how many captures existed for a particular URI using both a Memento aggregator and the TimeMap from an archive's Memento endpoint. For http://google.com, counting the number of URIs in a TimeMap with a rel value of "memento" produces a count of 695,525 (as of May 2017). The number obtained from Internet Archive's calendar interface and CDX endpoint currently show much smaller count values (e.g., calendar interface currently states 62,339 captures for google.com).

Dereferencing these URI-Ms would take a very long time due to network latency in accessing the archive as well as limits on pipelining (though the latter can be mitigated with distributing the task). We did exactly this for google.com and found that the large majority of the URI-Ms produced a redirect to another URI-M in the TimeMap. This lead us to know that counting mementos in an archive's holdings is not sufficient with this procedure.

Figure 2. Dereferencing URI-Ms may produce a representation, a redirect, or an archived error.

For google.com we found that nearly 85% of the URI-Ms resulted in a redirect when dereferenced. We repeated this procedure for seven other TimeMaps for large web sites (e.g., yahoo.com, instagram.com, wikipedia.org) and found a wide array of trends in this naïve counting method (88.2%, 67.3%, and 44.6% are redirects, respectively). We also repeated this procedure with thirteen academic institutions' URI-Rs to observe if this trend persisted.

We have posted an extensive report of our findings as a tech report available on arXiv (linked below).

— Mat (@machawk1)

Mat Kelly, Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel. "Impact of URI Canonicalization on Memento Count," Technical Report arXiv:1703.03302, 2017.

↧

2017-04-17: CNI Spring 2017 Trip Report

April 16, 2017, 11:49 pm

≫ Next: 2017-04-17: Personal Digital Archiving 2017

≪ Previous: 2017-03-24: The Impact of URI Canonicalization on Memento Count

The Coalition for Networked Information (CNI) Spring 2017 Membership Meeting was held April 3-4, 2017 in Albuquerque, NM. As before, the presentations were of very high quality but the eight-way (!) split of presentations means that you're going to miss some good presentations. The full schedule is available, but this trip report will focus on the sessions that I was able to attend. Fortunately, the attendees did well-covered in Twitter (#cni17s), and the tweets are collected by both CNI (Day 1, Day 2) and Michael Collins (Day 1, Day 2). The presentation slides are being collected at OSF.

The first day began with a plenary by Alison J. Head, representing Project Information Literacy (PIL). Alison's talk was entitled "What today's university students have taught us", and these slides from not quite a year ago were similar to what she presented at CNI. Alison has done extensive research about how undergraduates use Wikipedia, the Web in general, and life-long learning after graduation, as well as the relationship with university libraries. A full list of publications is available on their site, but she provided five take aways at the end of her presentation: 1) students say research is more difficult than ever before (as compared to high-school), 2) students have the most difficulty with getting started on their assignments, 3) contextualizing research is difficult and frustrating for students, 4) students use a search strategy driven by familiarity and efficiency (mainly using the tools shown below), and 5) evaluating research resources (e.g., for quality, timeliness) is the primary skill students carry with them after graduation.

This is where students go most often. Interesting how many are free resources. None are for-profits (well, maybe course readings). #cni17s pic.twitter.com/xHJGgA3gq2
— Roger C. Schonfeld (@rschon) April 3, 2017

Two additional points I found relevant to my own experiences with undergraduates were 1) employers make hiring decisions based on students' technical knowledge, but are then surprised when they did not ask neighbors/colleagues when they got stuck (and instead they use Google to find answers instead), and 2) students liked that instructional videos which illustrated common failures/traps/gotchas, whereas in professors' class notes everything works fine -- not unlike TV home or car repair shows! (edit: the video of Alison's keynote is now available)

Great "direct from the swamp" public policy update from @ARLpolicy& ALA public policy #cni17s pic.twitter.com/vQfN8FiEJw
— Dave Hansen (@DigLibCopyright) April 3, 2017

The first session I attended was "Direct from the Swamp: Developments of the 45th President and 115th Congress", by Krista L. Cox (ARL) and Alan S. Inouye (ALA). Krista and Alan gave summaries and commentary of the situation in DC, starting with the "who wins and who loses" in the so-called "skinny budget". The federal hiring freeze (now over) has had the unintended side-effect of slowing the rate in which the new policies could put into place. They also discussed ALA collecting "#SaveIMLS" tweets:

Your calls are paying off! So far we have 108 signers for LSTA and 136 for IAL. There's still time to act https://t.co/lF5OI2QpZ0 #SaveIMLS pic.twitter.com/abLKrUV2ts
— Amer. Library Assn. (@ALALibrary) April 3, 2017

They also discussed the current bill to make the Register of Copyrights a presidential appointment instead of the Librarian of Congress. You can imagine how "popular" that was with the audience, right up there with not being able to read Georgia State Law without paying a company (see also: Carl Malamud's Public.Resource.Org), and the FCC chairman who "wants to take a 'weed whacker' to net neutrality." Personally, I was disappointed to learn about David Gelernter, since Linda was a big influence on some of my early system designs. Krista and Alan discussed many other issues but they did not have slides and I wasn't able to take a complete set of notes.

The next session was Herbert, Martin, and me presenting "To the Rescue of the Orphans of Scholarly Communication". The slides we presented are below, as well as a video Mohamed created to help illustrate some of the concepts, and some "action" shots. David Rosenthal has written a really strong summary of our session and I encourage you to read that.

To the Rescue of the Orphans of Scholarly Communication from Martin Klein

Interesting study and findings on rescuing scholarly orphans by @hvdsomp @mart1nkle1n @phonedude_mln #cni17s pic.twitter.com/YFdejE0KZE
— Yasmina Anwar (@yasmina_anwar) April 3, 2017

The last session on Monday was by Jeff Spies, entitled "Data Integrity for Librarians, Archivists, and Criminals: What We Can Steal from Bitcoin, BitTorrent, and Usenet". The talk was pretty true to the title, and Jeff gave a high-level review of blockchain, erasure codes, NNTP, BitTorrent, and other related technologies relevant to archiving. The talk reminded me of Frank McCown's 2008 JCDL paper about encoding server-side components in HTML comments and using erasure codes from archived web pages to reconstruct an eprints server. And before anyone gets too enthusiastic about blockchain, I suggest you read some of David Rosenthal's blog posts on blockchain.

Day 2 began with Geoffrey Bilder presenting "Open Persistent Identifier Infrastructures: The Key to Scaling Mandate Auditing and Assessment Exercises". Geoffrey argued for the need for identifiers, for publications (e.g., DOIs), people (e.g., ORCIDs), and the newly proposed Organizational Identifier Project (blogs from CrossRef, DataCite, and ORCID). The need for identifiers was not controversial, but there was lively discussion about the various forces amplifying the need for identifiers, such as the increasing volume of publications and the number of people who start an academic career but reroute along the way (and whether is acceptable, even desirable, or a real problem). Regarding the potential for identifiers to accelerate a metrics-based approach to science, he also quoted from an article by Cliff Lynch who said "I am deeply concerned about the potential quantification of scholarly impact" -- like all of Cliff's work, the full article is worth your time. David Rosenthal wrote a great review of this session as well.

The next session I attended was "Building Distinctive Collections through International Collaborations: Lessons from UCLA's International Digital Ephemera Project" by T-Kay Sangwand and Todd Grapone. This was the second or third time I've seen the International Digital Ephemera Project presented at they've got a great collection of material (e.g., the Green Movement in Iran). T-Kay and Todd showed some videos in their presentation but I can't find them online. This project is a bit outside of the typical web archiving work that we do, but their IDEP Partners Toolkit is worth checking out.

The next session was by Cliff Lynch, "Institutional Repository Strategies: What We Learned at the Executive Roundtables", where he summarized the two IR roundtable sessions from Sunday. This session was standing room only and followed with great interest. There was an audio recording of this session that I'll link to when it's available, as well as a written summary that will be available in about 2 months. I hesitate to even attempt to summarize Cliff's summary, but I did manage to write down a few points. First, universities are struggling with scope of their IR and how to disentangle set of demands for digital collection management platforms, such as: newspapers, photographs, special collections, OJS, university presses, etc. This approach is different from the other model of taking contributions from the university community at large, in a variety of formats and granularities. One quote that Cliff relayed was (more or less) "we have 5-6 platforms that have aspects of IRs... it is hard to explain what is to be found in one vs. the other".

There was also a brief detour in the realm of discipline-specific repositories vs. IRs, as well as any requirements that arise from specialized formats. This made me think of Richard Poynder's recent interview with Cliff and the various responses to it.

Cliff also addressed another tension with IRs: do they collect material created by faculty, with an emphasis on what is at risk of being lost, or do they capture a record of the institution's output (with a further emphasis on journal literature)? The latter does not work out well because of access mandates from publishers. What is the incentive for the library to make an investment to implement open access policy (esp. if it comes from the faculty)? Cliff's observation was institutions were more willing to chase material down 5 years ago, but now they recognize the significant cost associated with such an approach.

Cliff finished with four "nuanced points": 1) IRs have been around long enough for migration issues to arise (i.e., people are already having to migrate between IRs), 2) how development is being handled on open source platforms: are develop strategies driven by needs of institutions or by the developers themselves?, 3) have we been too insular? The library may be the tip of the spear, but this is no longer a library problem, it is a university problem, and we should we be looking at Blackboard, distance learning systems, and DAMs, 4) what is the IRs' position relative OERs? Some systems, like ETDs, generated quick wins (and less so with journals) but OERs would have immediate impact on students.

The next session was "Social Networks and Archival Context: In Transition from Project to Program", by Daniel V. Pitti and Jerry Simmons. I did not take good notes during this session, but there is an extensive video available as well as the project web site for more information.

The closing session was "Fresh Perspectives on the Future of University-Based Publishing" by Amy Brand of MIT Press. She began with a quote from Paul Courant that while university presses provide a "warm glow", they are not "essential elements for excellent universities".

@amy_brand Cliff Lynch thanks Amy for great presentation. I agree, very informative! #cni17s pic.twitter.com/GsiesTncAp
— Robert Cartolano (@rob_cartolano) April 4, 2017

Amy gave an overview of all the things they're doing at MIT Press to "future-proof" the university press:

@amy_brand Strategic roadmap #cni17s pic.twitter.com/Q89bAS3BR2
— Robert Cartolano (@rob_cartolano) April 4, 2017

She discussed a wide range of things and it was difficult to keep up; an incomplete list included: scanning their back catalog with the Internet Archive / Open Library, a partnership with the NYPL, implementing AltMetrics, setting up a "futures lab", investigating hypothes.is, assigning DOIs to individual book chapters for greater citation granularity, uniformly providing both soft- and hard-copies for books with a single purchase, using watermarking instead of DRM wherever possible, and inhousing technology development as much as possible:

Brand: Why MIT Press would build its own platforms/subscriptions? #cni17s pic.twitter.com/siJfRoGcYE
— Mary Ellen Davis (@med744) April 4, 2017

There was a lot more to Amy's excellent presentation, but you should probably wait for the video. Not all presentations were recorded but many were and I'll update this post with links to videos as CNI releases them (edit: the video of Amy's keynote is now available).

Again, another great CNI membership meeting and thanks to all at CNI for putting it together. See you in DC in December for the winter meeting -- hopefully this time with ODU as a full CNI member!

--Michael

PS -- David Rosenthal blogged about two other sessions that I did not attend -- you should check them out.

PPS -- More "action" shots!

Institutional & Web Archiving perspectives to rescue scholarly orphans! #cni17s pic.twitter.com/Xt59pkNUu8
— Yasmina Anwar (@yasmina_anwar) April 3, 2017

ORCID @ORCID_Org is dominated by life sciences! #cni17s @mart1nkle1n pic.twitter.com/sJhYxJ80kO
— Yasmina Anwar (@yasmina_anwar) April 3, 2017

.@phonedude_mln on legal challenges re capturing scholarly artifacts - it's a mess! w/ @hvdsomp and myself #cni17s pic.twitter.com/YEg3xnMcDJ
— Martin Klein (@mart1nkle1n) April 3, 2017

↧

2017-04-17: Personal Digital Archiving 2017

April 17, 2017, 12:32 pm

≫ Next: 2017-04-18: Local Memory Project - going global

≪ Previous: 2017-04-17: CNI Spring 2017 Trip Report

On March 29-30, 2017 I attended Personal Digital Archiving Conference 2017 (#pda2017) held at Stanford University in sunny Palo Alto, California. Other members of the Web Science and Digital Libraries Research Group (WS-DL) had previously attended this conference (see their 2013, 2012, and 2011 trip reports) and from their rave reviews of previous year's conferences, I was looking forward to it. I also just happened to be presenting and demoing the Web Archiving Integration Layer (WAIL) there as an added bonus.

Day 1

Day one started off at 9am with Gary Wolf giving the first keynote on Quantified Self Archives. Quantified Self Archives are comprised of data generated from health monitoring tools such as the FitBit or life blogging data which is used to gain in sites into your own life through data visualization.

Gary Wolf kicking off #pda2017. pic.twitter.com/X3pUOEddNs
— John Berlin (@johnaberlin) March 29, 2017

@agaricus Gary Wolf talking abt the quantified self. groups meet to discuss #lifelogging& self tracking https://t.co/MnNPDJcYiA… #pda2017
— Melody Condron (@MTbeekeeper) March 29, 2017

After the keynote was the first session Research Horizons moderated by WS-DL alumni, Yasmina Anwar.

@yasmina_anwar moderating Research Horizons #pda2017 @WebSciDL pic.twitter.com/OJfBrsfx2S
— John Berlin (@johnaberlin) March 29, 2017

The first talk of this session was Whose Life Is It, Anyway? Photos, Algorithms, and Memory (Nancy Van House, UC Berkeley). In the talk, Van House spoke on the effects of "faceless" algorithms on images and how they can distort the memory of the images they are applied to in many personal archives. Van House also spoke about how machine learning techniques when done in aggregate on images without context can have unintended consequences, especially when attempting to detect emotion. To demonstrate this, Van House showed a set of images tagged with the emotion of Joy one of which was a picture of an avatar from the online life simulator Second Life.

Nacy Van House speaking on algorithms their use on images #pda2017 pic.twitter.com/QNRPtljJDD
— John Berlin (@johnaberlin) March 29, 2017

Van House presentation on search algorithms is fascinating: showing the flaws in context, tagging, computer matching & recognition #pda2017
— Melody Condron (@MTbeekeeper) March 29, 2017

Fb manipulates what we see, algorithms -how do they work, who developed them, controlled vocab. Can be a serious problem #pda2017
— Chaitra Powell (@ChaitraPeezy) March 29, 2017

The second talk was Digital Workflow and Archiving in the Humanities and Social Sciences (Smiljana Antonijevic Ubois, Penn State University). Ubois spoke on the many ways scholars use non-traditional archives such as Dropbox or photos taken by their smartphones to preserve their work. One of the biggest points brought up in the talk by Ubois was that humanities and social sciences scholars still see the web as a resource rather than home to a digital archive.

Smiljana Antonijevic Ubois: Digital Workflow and Archiving in the Humanities and Social Sciences #pda2017 pic.twitter.com/hdgmL5XWrk
— John Berlin (@johnaberlin) March 29, 2017

#pda2017 @Smiljana_A scholars opt for documenting research materials/ archival documents through smart phone photos.
— robin margolis (@poeticdoxa) March 29, 2017

The third talk was Mementos Mori: Saving the Legacy of Older Performers (Joan Jeffri, Research Center for Arts & Culture/The Actors Fund). In the talk, Jeffri spoke on the efforts being made to document and preserve the works of artists by the performing arts legacy project. The project found that one in five living artists in New York had no documentation of their work especially the older artists.

Joan Jeffri Mementos Mori: Saving the Legacy of Older Performers #pda2017 pic.twitter.com/PBABb9Wt5n
— John Berlin (@johnaberlin) March 29, 2017

just heard joan jeffri speak on https://t.co/MakGTuAXuD at #pda2017 @ecolleary
— Abby Adams (@digarchivist) March 29, 2017

The final talk in the session was Exploring Personal Financial Information Management Among Young Adults (Robert Douglas Ferguson, McGill School of Information Studies). Douglas spoke on the passive preservation i.e usage of web portal and tools provided by financial services, done by young adults when it comes to managing their money and the need to consider long-term preservation of these materials.

Robert Douglas: Exploring Personal Financial Information Management Among Young Adult #pda2017 pic.twitter.com/RsFy6Vze2V
— John Berlin (@johnaberlin) March 29, 2017

#pda2017 checks and paper-based financial records can have sentimental and social values, especially to older adults - marking time, events
— Lotus Norton-Wisla (@lnortonwisla) March 29, 2017

Session two was Preserving & Serving PDA at Memory Institutions moderated by Glynn Edwards.

Glynn Edwards moderating: Preserving & Serving PDA at Memory Institutions #pda2017 pic.twitter.com/px1ikiqnfL
— John Berlin (@johnaberlin) March 29, 2017

This session started off with Second-Generation Digital Archives: What We Learned from the Salman Rushdie Project (Dorothy Waugh and Elizabeth Russey Roke, Emory University). In 2010, Emory University announced the launch of the Salman Rushdie Digital Archives. This reading room kiosk offered researchers at the Manuscript, Archives, and Rare Book Library the opportunity to explore born-digital material from one of four of Rushdie’s personal computers through dual access systems. One of the biggest lessons learned noted by Waugh was the need to document everything the software engineers do as their work is just as ephemeral as the born digital information they wished to preserve.

Dorothy Waugh & Elizabeth Russey Roke,
Second-Generation Digital Archives: What We Learned from the Salman Rushdie Project #pda2017 pic.twitter.com/d2mfv6OQUN
— John Berlin (@johnaberlin) March 29, 2017

#pda2017 Emory opted to provide two tiers of access in reading room for Rushdie CPUs. 1) PDFs of content 2) emulation of CPU environment
— robin margolis (@poeticdoxa) March 29, 2017

After Waugh was Composing an Archive: the personal digital archives of contemporary composers in New Zealand (Jessica Moran, National Library of New Zealand). In recent years the Library has acquired the digital archives of a number of prominent contemporary composers. Moran discussed the personal digital archiving practices of the composer, the composition of the archive, and the work of the digital archivists, in collaboration with curators, arrangement and description librarians, and audio-visual conservators, to collect, describe, and preserve this collection.

Jessica Moran Composing an Archive: the personal digital archives of contemporary composers in New Zealand #pda2017 pic.twitter.com/kbMnaWztFY
— John Berlin (@johnaberlin) March 29, 2017

#pda2017 @jessicammoran shares on composer personal archive featuring musician stand-bys of floppy disks and drafts of composition on Logic!
— robin margolis (@poeticdoxa) March 29, 2017

The final talk in session two was Learning from users of personal digital archives at the British Library (Rachel Foss, The British Library). Foss discussed the efforts made by the British Library to provide access to their digital collections that require emulation to viewed. Foss disscused that arhiving professionals also need to consider how we assist and educate our researchers to make use of born-digital collections implying understanding more about how they want to interrogate these collections as a resource.

Rachel Foss Learning from users of personal digital archives at the British Library #pda2017 pic.twitter.com/JgvyFBQKgk
— John Berlin (@johnaberlin) March 29, 2017

#pda2017 Rachel Foss shares choice to allow user access to unreadable files, offering teaching moments around digital preservation
— robin margolis (@poeticdoxa) March 29, 2017

Lunch happened. Session 3 Teaching PDA moderated by Charles Ransom.

Charles Ransom moderating Teaching PDA #pda2017 pic.twitter.com/jcbQCwsOSO
— John Berlin (@johnaberlin) March 29, 2017

Journalism Archive Management (JAM): Preparing journalism students to manage their personal digital assets and diffuse JAM best practices into the media industry (Dorothy Carner & Edward McCain, University of Missouri). In collaboration with MU Libraries and the school’s Donald W. Reynolds Journalism Institute, a personal digital archive learning model was developed and deployed in order to prepare journalism-school students, faculty and staff for their ongoing information storage and access needs. The MU J-School has created a set of PDA best practices for journalists and branded it: Journalism Archive Management (JAM).

Model for Journal Archive Management #JAM #pda2017 #Journals pic.twitter.com/4acegUqGqc
— Yasmina Anwar (@yasmina_anwar) March 29, 2017

Journalism Archive Management (JAM) program @Mizzou teaches communications students to create, label, store, find +reuse digi files #pda2017
— Robin M. Katz (@robinmkatz) March 29, 2017

An archivist in the lab with a codebook: Using archival theory and “classic” detective skills to encourage reuse of personal data (Carly Dearborn, Purdue University Libraries). Dearborn designed a workshop inspired by the Society of Georgia Archivists’ personal digital archiving activities to introduced attendees to archival concepts and techniques which can be applied to familiarize researchers with new data structures.

Carly Dearborn: An archivist in the lab with a codebook #pda2017 pic.twitter.com/dfHFauB8y4
— John Berlin (@johnaberlin) March 29, 2017

W her campus population, @carlydearborn found data producers just wanted step-by-step instructions, not to think critically abt it #pda2017
— Robin M. Katz (@robinmkatz) March 29, 2017

Session 4: Emergent Technologies & PDA 1 moderated by Nicholas Taylor

Nicholas Taylor moderating Emergent Technologies & PDA 1#pda2017 pic.twitter.com/BjsIcP1je9
— John Berlin (@johnaberlin) March 29, 2017

Cogifo Ergo Sum: GifCities & Personal Archives on the Web (Maria Praetzellis & Jefferson Bailey, Internet Archive). In the talk Praetzellis and Bailey spoke on the gif archive GifCities created for the Internet Archives 20th anniversary which included a search interface. The GeoCities Animated GIF Search Engine, comprising over 4.6 million animated GIFs from the GeoCities web archive. Each GIF links back to the archived GeoCities web page upon which it was originally embedded. The search engine offers a novel, flabbergasting window into what is likely one of the largest aggregations of publicly-accessible archival personal documentary collections. It also provokes a reassessment of how we conceptualize personal archives as being both from the web (as historical encapsulations) and of the web (as networked recontextualization).

Maria Praetzellis & Jefferson Bailey
Cogifo Ergo Sum: GifCities & Personal Archives on the Web #pda2017 pic.twitter.com/Sufvb56EHE
— John Berlin (@johnaberlin) March 29, 2017

@jefferson_bail& maria praetzellis present on geocities and personal archives on the web #pda2017 pic.twitter.com/ZswptEB5nV
— ePADD (@e_padd) March 29, 2017

Comparison of Aggregate Tools for Archiving Social Media (Melody Condron). In the talk Condron spoke about many tools which could make archiving social media easier: Frostbox, If This Then That and digi.me. Of all the tools mentioned If This Then That provided the easiest way for its users to push social media into archives such Internet Archive or Webrecorder.

Melody Condron Comparison of Aggregate Tools for Archiving Social Media #pda2017 pic.twitter.com/1mBxFtdfrS
— John Berlin (@johnaberlin) March 29, 2017

"You can make Twitter to turn into yr coffeemaker", @MTbeekeeper! Gr8 tools for aggregating social media #pda2017 pic.twitter.com/i2zGcprOlG
— Yasmina Anwar (@yasmina_anwar) March 29, 2017

Video games collectors and archivists: how might private archives influence archival practices (Adam Lefloic Lebel, University of Montreal)

Adam Lefloic Lebel, Video games collectors and archivists: how might private archives influence archival practices #pda2017 pic.twitter.com/5BXVGz0sNm
— John Berlin (@johnaberlin) March 29, 2017

Collectors help with preserving games - Research from the University of Montreal @melswal #pda2017 pic.twitter.com/iVO6MCxRBy
— Sarah Slade (@sladiladi) March 29, 2017

Demonstrations:
There were two different demonstration sessions the first was between session 4&5 and the second was at the end after session 6.

.@johnaberlin from @WebSciDL @oducs is demoing #WAIL @StanfordLibs #pda2017 pic.twitter.com/htNqLae4Ib
— Yasmina Anwar (@yasmina_anwar) March 29, 2017

The demo for the Web Archiving Integration Layer (WAIL) consisted of two videos and myself talking to those who stopped by about the particular use cases of WAIL or answering any questoins they had about WAIL. The first is viewable below which is detailed feature walkthrough of WAIL and the second was showing off WAIL in action.

Session 5: Emergent Technologies & PDA 2 moderated by Henry Lowood

CiteTool: Leveraging Software Collections for Historical Research (Eric Kaltman, UC Santa Cruz) Kaltman spoke about how the tool is currently being used in a historical exploration of the computer game DOOM as a way to compare conditions across versions and to save key locations for future historical work. Since the tool provides links to saved locations, it is also possible to share states amongst researchers in collaborative environments. The links also function as an executable citation in cases where an argument about a program’s functionality is under discussion and would benefit from first-hand execution.

Applying technology of Scientific Open Data to Personal Closed Data (Jean-Yves Le Meur, CERN) Le Meur explained the methodology and technologies developed (partly at CERN) to preserve scientific data (like High Energy Physics) could be re-used for Personal restricted data. Existing initiatives to collect and preserve for very long term the personal data from individuals will first be reviewed, as well as a few examples of well established collective memory portals. Solutions implemented for Open data in HEP will then be compared, looking at the guiding principles and underlying technologies. Finally, a proposal to foster a solid shared platform for closed Personal Data Archive will be drafted on the model of Open Scientific Data Archives.

Personal Data and the Personal Archive (Chelsea Gunn, University of Pittsburgh) Gunn questioned if quantified self and lifelogging application are forms of personal data as a part of our personal archives, or do they constitute a form of ephemera, useful for the purposes of tracking progress toward a goal, but not of long-term interest?

Using Markdown for PDA Interoperability (Jay Datema, Stony Brook University). The only thing you can count on with born-digital projects is that you will have to migrate the content at some point. But having done digital library development for over a decade, I'd like to talk about simple text, and a problem that has a proven solution. Markdown is an intermediate step between text and HTML. If you're writing anything that requires an HTML link, its shortcuts are worth learning. Most web applications rely on the humble submit button. Once text goes in, it becomes part of a database backend. To extract it, it may require a set of database calls, or parsing a SQL file, or hoping that someone wrote a module to let you download what you entered.

Session 6 PDA The Arts moderated by Kate Tasker

From Virtual to Reality: Dissecting Jennifer Steinkamp’s Software-Based Installation (Shu-Wen Lin, New York University) Lin spoke about time-based and digital art combines media and technology that challenges traditional conservation practices while requiring dedicated care from working with Steinkamp’s animated installation Botanic that was exhibited in Times Square Arts: Midnight Moment. Lin's talk focused on the internal structure and relationship between the software used which was Maya, After Effects, scripts, and final deliverables. Lin also spoke about provide a risk assessment that will enable museum professionals as well as the artist herself to identify sustainability and compatibility of digital elements in order to build a documentation that can collect and preserve the whole spectrum of digital objects related to the piece.

The PDAs of Others: Completeness, Confidentiality, and Creepiness in the Archives of Living Subjects (Glen Worthey, Stanford University) The title and inspiration for Worthey's presentation came from the 2006 German film Das Leben der Anderen, which dramatized the covert monitoring of East Germans. Although the biography was "authorized", Worthy spoke on how the process of gathering and documenting materials often reveals tensions between completeness and a respect for privacy; between on-the-record and off-the-record conversations; between the personal and the professional; between the probing of important questions and voyeuristic-seeming observation of the subject's complex inner life.

RuschaView 2.0 (Stace Maple, Stanford University) In 1964, LA Painter, Ed Ruscha put a Nikon Camera in the back of his truck, drove up and down Sunset Strip and shot what would become a continuous panorama of "Every Building on the Sunset Strip" (1966). Maples talk highlighted both Ruscha's multi-decade project, as well as Maple's multi-month attempt to create the metadata required to reproduce something like Ruscha's "Every Building..." publication, in a digital context.

(Pete Schreiner, NCSU) Between 2003-2013 an associated group of independent rocks bands from Bloomington, Indiana shared a tour van. When the owner, a librarian, was preparing to move across the country in 2014, Pete Schreiner, band member and proto-librarian decided to preserve this esoteric collection of local music-related history. Subsequently, as time allowed, he created an online collection of the photographs using Omeka. This case study presents a guerrilla archiving project, issues encountered throughout the process, and attempts to find the balance between professional archiving principles and getting it done.

Day 2

Due to request of a presenter(s) who did not want their slides material recorded/show too others beyound the attendies no photos were taken

Session 7 Documenting Cultures Communities moderated by Michael Olson

(Anna Trammell, University of Illinois) Trammell's talk discussed the experience gained from forming relationships and building trust with the student organizations at the University of Illinois, capturing and processing their digital content, and utilizing these records in instruction and outreach.

(Jennifer Douglas, University of British Columbia) Online grieving and intimate archives: a cyberethnographic approach (Jennifer Douglas, University of British Columbia) Douglas presented a short paper discussing the archiving practices of the community of parents grieving stillborn children. In that paper, Douglas demonstrated how these communities functioned as aspirational archives, not only preserving the past, but creating a space in the world for their deceased children. Regarding the ethics of online research and archiving, Douglas' paper introduced the methodology of cyberethnography and explored its potential connections to the work of digital archivists.

(Barbara Jenkins, University of Oregon) In the talk Jenkins spoke on the development of an Afghanistan personal archives project which was created in 2012 and was able to expand its scope through a short sabbatical supported by the University of Oregon in 2016. The Afghanistan collection Jenkins was able to build combines over 4,000 slides, prints, negatives, letters, maps, oral histories, and primary documents.

Session 8 Narratives Biases Pda Social Justice moderated by Kim Christen

Andrea Pritchett, co-founder of Berkeley Copwatch, Robin Margolis, UCLA MLIS in Media Archives, and Ina Kelleher presented a proposed design for a digital archive aggregating different sources of documentation toward the goal of tracking individual officers. Copwatch chapters operate from a framework of citizen documentation of the police as a practice of community-driven accountability and de-escalation.

Stacy Wood, PhD candidate in Information Studies at UCLA, discussed the ways in which personal records and citizen documentation are embedded within techno-socio-political infrastructural arrangements and how society can reframe these technologies as mechanisms and narratives of resistance.

Session 9 PDA And Memory moderated by Wendy Hagenmaier

Interconnectedness: personal memory-making on YouTube (Leisa Gibbons, Kent State University) Gibbons spoke about the use of YouTube as a personal memory-making space and research questions concerning what conceptual, practical and ethical role institutions of memory have in online participatory spaces and how personal use of online technologies can be preserved as evidence.

(Sudheendra Hangal& Abhilasha Kumar, Ashoka University) This talk was about Cognitive Experiments with Life-Logs (CELL) and how it is a scalable new approach to measure recall of personally familiar names using computerized text-based analysis of email archives. Regression analyses revealed that accuracy in familiar name recall declined with the age of the email, but increased with greater frequency of interaction with the person. Based on those findings, Hangal and Kumar believe that CELL can be applied as an ecologically valid web-based measure to study name retrieval using existing digital life-logs among large populations.

(Frances Corry, University of Southern California) Corry spoke about the built-in feature on most smartphones, tablets, and computers today, and how these tool enables users to “photograph” what rests on the surface of their screens. These “photographs” rather screenshots were presented as a valuable tool worthy of further attention in digital archival contexts.

Session 10 Engaging Communities In PDA 1 moderated by Martin Gengenbach

Introducing a Mobile App for Uploading Family Treasures to Public Library Collections (Natalie Milbrodt, Queens Public Library) The Queens Public Library in New York City has developed a free mobile application for uploading scanned items, digital photos, oral history interviews and “wild sound” recordings of Queens neighborhoods for permanent safekeeping in the library’s archival collections. It allows families to add their personal histories to the larger historical narrative of their city and their country. The tool is part of the programmatic and technological offerings of the library’s Queens Memory program, whose mission is to capture contemporary history in Queens.

(Russell Martin, District of Columbia Public Library) The Memory Lab (Russell Martin, District of Columbia Public Library) The Memory Lab at District of Columbia Public Library is a do-it-yourself personal archiving space where members of the public can digitize outdated forms of media, such as VHS, VHS-C, mini DVs, audio cassettes, photos, slides, negatives and floppy disks. Martin's presentation consists of how the Memory Lab was developed by a fellow from the Library of Congress' National Digital Stewardship Residency, budget for the lab, equipment used and how it is put together, training for staff and the public, as well as success stories and lessons learned.

(Wendy Hagenmaier, Georgia Tech) Hagenmaier's presentation outlined the user research process of the retroTECH team to inform the design of the carts, offer an overview of the carts’ features and use cases, and reflected on where retroTECH’s personal digital archiving services are headed. retroTECH aims to inspire a cultural mindset that emphasizes the importance of personal archives, open access to digital heritage, and long-term thinking.

The Great Migration (Jasmyn Castro, Smithsonian NMAAHC) Castro presented the ongoing film preservation efforts at the Smithsonian for the African American community and how the museum invite visitors to bring their home movies into the museum and have them inspected and digitally scanned by NMAAHC staff.

Session 11 Engaging Communities In Pda 2 moderated by Mary Kidd

Citizen archive and extended MyData principles (Mikko Lampi, Mikkeli University of Applied Sciences) Lampi spoke about how Digitalia – Research Center on Digital Information Management – is developing a professional-quality digital archiving solution available for common people. The Citizen archive relies on an open-source platform allowing users to manage their personal data and ensure access to it on a long-term basis. MyData paradigm is connected with personal archiving by managing coherent descriptive metadata and access rights, while also ensuring privacy and usefulness.

Born Digital 2016: Collecting for the Future (Sarah Slade, State Library Victoria) Slade presented Born Digital 2016: collecting for the future a week-long national media and communications campaign to raise public awareness of digital archiving and preservation and why it matters to individuals, communities and organizations. The campaign successfully engaged traditional television and print media, and online news outlets, to increase public awareness of what digital archiving and preservation is and why it is important.

Whose History? (Katrina Vandeven, MLIS Candidate, University of Denver) Vandeven discussed the macro appraisal and documenting intersectionality within the Women's March on Washington Archives Project, where it went wrong, possible solutions to documenting intersectionality in activism, and introduced the Documenting Denver Activism Archives Project.

Bring Personal Digital Archiving 2017 to a close was Session 12 PDA Retrospect And Prospect Panel moderated by Cathy Marshall

Howard Besser, Clifford Lynch and Jeff Ubois discussed how early observers and practitioners of personal digital archiving will look back on the last decade, and forward to the next, covering changing social norms about what is saved, why, who can view it, and how; legal structures, intellectual property rights, and digital executorships; institutional practices, particularly in library and academic settings, but also in the form of new services to the public; market offerings from both established and emerging companies; and technological developments that will allow (or limit) the practice of personal archiving.

- John

↧

2017-04-18: Local Memory Project - going global

April 17, 2017, 10:59 pm

≫ Next: 2017-04-20: Trusted Timestamping of Mementos

≪ Previous: 2017-04-17: Personal Digital Archiving 2017

Screenshots of world local newspapers from the Local Memory Project's local news repository. Top: newspapers from Iraq, Nigeria, and France. Bottom: Chile, US (Alaska), and Australia.

Soon after the introduction of the Local Memory Project (LMP) and the local news repository of:

5,992 US Newspapers
1,061 US TV stations, and
2,539 US Radio stations

I considered extending the local news collection beyond US local media to include newspapers from around the world.

Finding and generating the world local newspaper dataset

After a sustained search, I narrowed my list of potential sources of world local news media to the following in order of my perceived usefulness:

From this list, I chose Paperboy as my world local news source because it was fairly structured (makes web scraping easier), and contained the cities in which the various newspaper organizations are located. Following scraping and data cleanup, I extracted local newspaper information for:

6,638 Newspapers from
3,151 Cities in
183 Countries

The dataset is publicly available.

Integrating the world local newspaper dataset into LMP

For a seamless transition from US to a world-centric Local Memory Project, it was pertinent to ensure the world local media was represented with exactly the same data schema as the US local media. This guarantees that the architecture of LMP remains the same. For example, the following response excerpt represents a single US college newspaper (Harvard Crimson).

{
"city": "Cambridge", 
"city-latitude": 42.379146, 
"city-longitude": -71.12803, 
"collection": [
   {
"city-county-lat": 42.377, 
"city-county-long": -71.1167, 
"city-county-name": "Harvard", 
"country": "USA", 
"facebook": "http://www.facebook.com/TheHarvardCrimson", 
"media-class": "newspaper", 
"media-subclass": "college", 
"miles": 0.6, 
"name": "Harvard Crimson", 
"open-search": [], 
"rss": [], 
"state": "MA", 
"twitter": "http://www.twitter.com/thecrimson", 
"video": "https://www.youtube.com/user/TheHarvardCrimson/videos", 
"website": "http://www.thecrimson.com/"
   }
  ], 
"country": "USA", 
"self": "http://www.localmemory.org/api/countries/USA/02138/10/?off=tv%20radio%20", 
"state": "MA", 
"timestamp": "2017-04-17T18:56:10Z"
 }

Similarly, world local media use this same schema for seamless integration into the existing LMP framework. However, different countries have different administrative subdivisions. From an implementation standpoint, it would have been ideal if all countries had the US-style administrative subdivision of: Country - State - City, but this is not the case. Also, currently, LMP's Geo and LMP's Local Stories Collection Generator are accessed using a zip code. Consequently, the addition of world local news media meant finding the various databases which mapped zip codes to their respective geographical locations. To overcome the obstacles of multiple administrative subdivisions, and the difficulty of finding comprehensive databases that mapped zip codes to geographical locations, while maintaining the pre-existing LMP data schema, I created a new access method for Non-US local media. Specifically, US local news media are accessed with a zip code (which maps to a City in a State), while Non-US local news media are accessed with the name of the City. For example, here is a list of 100 local newspapers that serve Toronto, Canada: http://www.localmemory.org/geo/#Canada/Toronto/100/

The addition of 6,638 Non-US newspapers from 183 countries makes it possible not only to see local news media from different countries, but also to build collections of stories about events from the perspectives of local media around the world.

--Nwala

↧

2017-04-20: Trusted Timestamping of Mementos

April 20, 2017, 10:14 am

≫ Next: 2017-04-23: Remembering Professor Hussein Abdel-Wahab

≪ Previous: 2017-04-18: Local Memory Project - going global

The Memento Protocol provides a Memento-Datetime header indicating at what datetime a memento was captured by a given web archive. In most cases, this metadata sufficiently informs the user of when the given web resource existed. Even though it has been established in US courts that web archives can be used to legally establish that a given web resource existed at a given time, there is still potential to doubt this timestamp because the same web archive that provides the memento also provides its Memento-Datetime. Though not a replacement for Memento-Datetime, trusted timestamping is the process that provides certainty of timestamps for content and can be used to provide additional data to alleviate this doubt.

In this post, I examine different trusted timestamping methods. I start with some of the more traditional methods before discussing OriginStamp, a solution by Gipp, Meuschke, and Gernandt that uses the Bitcoin blockchain for timestamping.

Brief Cryptography Background

Trusted timestamping systems use some concepts from cryptography for confidentiality and integrity. I will provide a brief overview of these concepts here.

Throughout this document I will use the verb hashing to refer to the use of a one-way collision-resistant hash function. Users supply content, such as a document, as input and the hash function provides a digest as output. Hash functions are one-way, meaning that no one can take that digest and reconstitute the document. Hash functions are collision-resistant, meaning that there is a very low probability that another input will produce the same digest, referred to as a collision. As shown in the figure below, small changes to the input of a hash function produce completely different digests. Thus, hash digests provide a way to identify content without revealing it. The output of the hash function is also referred to as a hash.

This diagram shows the digests produced with the same cryptographic hash function over several inputs. A small change in the input produces a completely different hash digest as output. Source: Wikipedia

The timestamping solutions in this post use the SHA-256 and RIPEMD-160 hash functions. SHA-256 is a version of the SHA-2 algorithm with 256 bit keys. Its predecessor, SHA-1, has been under scrutiny for some time. In 2005, cryptographers showed mathematically that SHA-1 was not collision-free, prompting many to start moving to SHA-2. In February of 2017, Google researchers were able to create a collision with SHA-1, showing that SHA-1 is no longer reliably trustworthy. Because collision attacks are theoretically possible, though technically infeasible, for SHA-2, SHA-3 has been developed as a future replacement. I mention this to show how the world of hash functions is dynamic, resulting in continued research of better functions. For this post, however, it is most important to just understand the purpose of hash functions.

In addition to hash functions, this post discusses solutions that utilize pubic-key cryptography, consisting of private keys and public keys. Users typically generate a private key using random information generated by their computer and an algorithm such as 3DES. Users then use this private key with an algorithm such as RSA or ECC to generate a public key. Users are expected to secure their private key, but share the public key.

A diagram showing an example of encryption using public and private keys. Source: Wikipedia

Users use public keys to encrypt content and private keys to decrypt it. In the figure above, everyone has access to Alice's pubic key. Bob encrypts a message using Alice's public key, but only Alice can decrypt it because she is the only one with access to her private key.

This process can be used in reverse to digitally sign content. The private key can be used to encrypt content and the public key can be used to decrypt it. This digital signature allows anyone with access to the public key to verify that the content was encrypted by the owner of the private key because only the owner of the private key should have access to the private key.

Certificates are documents containing a public key and a digital signature. A user typically requests a certificate on behalf of themselves or a server. A trusted certificate authority verifies the user's identity and issues the certificate with a digital signature. Other users can verify the identity of the owner of the certificate by verifying the digital signature of the certificate with the certificate authority. Certificates expire after some time and must be renewed. If a user's private key is compromised, then the certificate authority can revoke the associated certificate.

A nonce is a single-use value that is added to data prior to encryption or hashing. Systems often insert it to ensure that transmitted encrypted data can not be reused by an attacker in the future. In this article nonces are used with hashing as part of Bitcoin's proof-of-work function, to be explained later.

Finally, there is the related concept of binary-to-text encoding. Encoding allows a system to convert data to printable text. Unlike hash digests, encoded text can be converted back into its original input data. Cryptographic systems typically use encoding to create human-readable versions of public/private keys and hash digests. Base64 is a popular encoding scheme used on the Internet. Bitcoin also uses the lesser known Base58 scheme.

Brief Bitcoin Background

Bitcoin is a cryptocurrency. It is not issued by an institution or backed by quantities of physical objects, like gold or silver. It is software that was released with an open source license to the world by an anonymous individual using the pseudonym Satoshi Nakamoto. Using a complex peer-to-peer network protocol it ensures that funds (Bitcoins) are securely transferred from one account to another.

Bitcoin accounts are identified by addresses. Addresses are used to indicate where Bitcoins should be sent (paid). The end user’s Bitcoin software uses public and private keys to generate an address. Users often have many addresses to ensure their privacy. Users have special purpose software, called Wallets, that generates and keeps track of addresses. There is no central authority to issue addresses, meaning that addresses must be generated individually by all participants.

Wallets generate Bitcoin addresses using the following process:

Generate an ECC public-private key pair
Perform SHA-256 hash of public key
Perform RIPEMD-160 hash of that result
Add version byte (0x00) to the front of that result
Perform a SHA-256 hash of that result, twice
Append the first 4 bytes of that result to the value from #4
Convert that result into Base58, which eliminates confusing characters 0 and O as well as 1 and l

The last step uses Base58 so that users can write the address on a piece of paper or speak it aloud over the phone. The ECC algorithms are used by Bitcoin to make the public-private key pair "somewhat resistant" to quantum computers. SHA-256 is used twice in step 5 to reduce the chance of success for any as yet unknown attacks against the SHA-2 hash function. Because all Bitcoin users generate addresses themselves, without a central addressing authority, this long process exists to reduce the probability of a collision between addresses to 0.01%. Even so, for improved security, the community suggests generating new addresses for each transaction. Note that only public-private keys and hashing are involved. There are no certificates to revoke or expire.

Transactions contain the following types of entries:

Transaction inputs contain a list of addresses and amount of Bitcoins to transfer from those addresses. Also included is a digital signature corresponding to each address. This digital signature is used by the Bitcoin software to verify that the transaction is legitimate and thus these Bitcoins can be spent. There is also a user-generated script used to specify how to access the bitcoins, but the workings of these scripts are outside the scope of this post.
Transaction outputs contain a list of addresses and amount of Bitcoins to transfer to those addresses. As with transaction inputs, a user-generated script is included to specify how to spend the bitcoins, but I will not go into further detail here.
Another field exists to enter the amount of transaction fees paid to the miners for processing the transaction.

The Bitcoin system broadcasts new transactions to all nodes. Miners select transactions and group them into blocks. A block contains the transactions, a timestamp, a nonce, and a hash of the previous block.

Within each block, Bitcoin stores transactions in a Merkle tree, an example diagram of which is shown below. Transactions reside in the leaves of the tree. Each non-leaf node contains a hash of its children. This data structure is used to prevent corrupt or illicit transactions from being shared, and thus included in the block chain.

A diagram showing an example of a Merkle tree. Each non-leaf node contains a hash of its children. For Bitcoin, transactions reside in the leaves. Source: Wikipedia

A conceptual diagram shows the Bitcoin blockchain. Each block contains: a hash of the previous block, a timestamp, a nonce, and the root of a tree of transactions. Source: Wikipedia

Miners only see Bitcoin addresses and amounts in each transaction, providing some privacy to those submitting transactions. To add a block to the blockchain, miners must solve a proof-of-work function. Once a block has been assembled by a miner, the Bitcoin software generates a nonce, adds it to the content of the block, and then hashes the contents of the block with the nonce using SHA-256, twice, to produce a hash. The system does not share the nonce with the miners. To add their block to the Bitcoin blockchain, the miner must guess nonces, combine them with the block content, and hash this content until they produce the correct hash digest value. This proof-of-work function is designed to be fast for the system to verify, but time-consuming for the miners to execute. The length of the nonce is increased every 14 days to maintain the level of difficulty in solving the proof-of-work function. This value was chosen to ensure that that miners continue to take 10 minutes to process each block. For each block completed, miners are rewarded any user-provided transaction fees included in the transactions as well as newly minted Bitcoins -- a block reward. The block reward is currently set at 12.5 bitcoins, worth $15,939 as of March 2, 2017. Miners run software and dedicated hardware around the globe to solve the proof-of-work function. Currently the local cost of electricity is the limiting factor in profiting from mining bitcoins.

To alter previous transactions, an attacker would need to select the block of the transaction they wished to alter. They would then need to insert their illicit transaction into the block and create a new block. After this they would then need to solve the block containing that transaction and all subsequent blocks faster than more than 50% of all the other miners, thus it is considered to be extremely hard to alter the blockchain.

Bitcoins do not really exist, even on an individual's hard drive. The blockchain contains the record of every bitcoin spent and indicates the current balance at each Bitcoin address. Full Bitcoin nodes have a copy of the blockchain, currently at 105GB, which can create problems for users running full nodes. Satoshi Nakamoto recommended periodically pruning the blockchain of old transactions, but so far this has not been done.

Technology exists to create blockchains outside of Bitcoin, but Bitcoin provides incentives for participation, in terms of monetary rewards. Any system attempting to use a blockchain outside of Bitcoin would need to produce similar incentives for participants. The participation by the miners also secures the blockchain by preventing malicious users from spamming it with transactions.

How accurate are the timestamps in the blockchain? According to the Bitcoin wiki:

A timestamp is accepted as valid if it is greater than the median timestamp of previous 11 blocks, and less than the network-adjusted time + 2 hours. "Network-adjusted time" is the median of the timestamps returned by all nodes connected to you. As a result, block timestamps are not exactly accurate, and they do not even need to be in order. Block times are accurate only to within an hour or two.

Bitcoins come in several denominations. The largest is the Bitcoin. The smallest is the satoshi. One satoshi equals 0.00000001 (1 x 10^-8) Bitcoins.

Trusted Timestamping

Trusted Timestamping allows a verifier to determine that the given content existed during the time of the timestamp. It does not indicate time of creation. In many ways, it is like Memento-Datetime because it is still an observation of the document at a given point in time.

Timestamping can be performed by anyone with access to a document. For a timestamp to be defensible, however, it must be produced by a reliable and trusted source. For example, timestamps can be generated for a document by a user's personal computer and then signed with digital signatures. At some point in the future, a verifier can check that the digital signature is correct and verify the timestamp. This timestamp is not trustworthy because the clock on the personal computer may be altered or set incorrectly, thus providing doubt in the accuracy of the timestamp. Some trustworthy party must exist to that not only sets their time correctly, but ensures that timestamps are verifiable in the future.

Trusted Timestamping relies upon a trustworthy authority to accept data, typically a document, from a requestor and issue timestamps for future verification. The process then allows a verifier that has access to the timestamp and the original data to verify that the data existed at that point in time. Thus, two basic overall processes exist: (1) timestamp issue, and (2) timestamp verification.

In addition, privacy is a concern for documents. A document being transmitted can be intercepted and if the document is held by some third party for the purposes of verifying a timestamp in the future, then it is possible that the document can be stolen from the third party. It is also possible for such a document to become corrupted. To address privacy concerns, trusted timestamping focuses on providing a timestamp for the hash of the content instead. Because such hashes cannot be reversed, the document cannot be reconstructed. Owners of the document, however, can generate the hashes from the document to verify it with the timestamping system.

Finally, verifying the timestamps should not depend on some ephemeral service. If such a service is nonexistent in the future, then the timestamps cannot be verified. Any timestamping solution will need to ensure that verification can be done for the foreseeable future.

Trusted Timestamping with RFC 3161 and ANSI X9.95

ANSI X9.95 extends RFC 3161 to provide standards for trusted timestamping in the form of a third party service called a Time Stamping Authority (TSA). Both standards discuss the formatting of request and response messages used to communicate with a TSA as well as indicating what requirements a TSA should meet.

The TSA issues time-stamp tokens (TST) as supporting evidence that the given content existed prior to a specific datetime. The following process allows the requestor to acquire a given timestamp:

The requestor creates a hash of the content.
The requestor submits this hash to the TSA.
The TSA ensures that its clock is synchronized with an authoritative time source.
The TSA ensures that the hash is the correct length, but, to ensure privacy, does not examine the hash in any other way.
The TSA generates a TST containing the hash of the document, the timestamp, and a digital signature of these two pieces of data. The digital signature is signed with a private key whose sole purpose is timestamping. RFC 3161 requires that the requestor not be identified in the TST. The TST may also include additional metadata, such as the security policy used.
The TST is sent back to the requestor, who should then store it along with the original document for future verification.

Simplified diagram showing the process of using a Time Stamp Authority (TSA) to issue and verify timestamps. Source: Wikipedia

To verify a timestamp, a verifier does not need the TSA. The verifier only needs:

the hash of the original document
the TST
the TSA's certificate

They use the original data and the TST in the following process:

The verifier verifies the digital signature of the TST against the TSA’s certificate. If this is correct, then they know that the TST was issued by the TSA.
The verifier then checks that the hash in the TST matches the hash of the document. If they match, then the verifier knows that the document hash was used to generate that TST.
The timestamp contained in the TST and the hash were used in the generation of the digital signature, hence the TSA observed the document at the given time.

Hauber and Stornetta mentioned that the TSA can be compromised in their 1991 paper "How to Time-Stamp a Digital Document" and prescribed a few solutions, such as linked timestamping, which is implemented by ANSI X9.95. With Linked Timestamping, each TST includes a hash digest of the previous TST. Users can then additionally verify that a timestamp is legitimate by comparing this hash digest with the previously granted TST.

ANSI X9.95 also supports the use of transient-key cryptography. In this case, the system generates a distinct public-private key pair for each timestamp issued. Once a timestamp is issued and digitally signed, the system deletes the private key so that it cannot be compromised. The verifier uses the public key to verify the digital signature.

Services using these standards exist with companies like DigiStamp, eMudhra, Tecxoft, and Safe Stamper TSA. Up to 5 free timestamps can be generated per day per IP at Safe Creative's TSA.

The solutions above have different issues.

ANSI X9.95 and RFC 3161 provide additional guidance on certificate management and security to ensure that the TSA is not easily compromised, but the TSA is still the single point of failure in this scheme. If the TSA relies on an incorrect time source or is otherwise compromised, then all timestamps generated are invalid. If the TSA’s certificate expires or is revoked, then verifying past timestamps becomes difficult if not impossible, depending on the availability of the now invalid certificate. If the revoked certificate is still available, the datetime of revocation can be used as an upper bound for the validity of any timestamps. Unfortunately, a certificate is usually revoked because its private key was compromised. A compromised key creates doubt in any timestamps issued using it. If transient-key cryptographic is used, doubt exists with any generated public-private keys as well as their associated timestamps.

Linked timestamping helps ensure that the TSA's tokens are not easily faked, but require that the verifier meet with other verifiers to review the tokens. This requirement violates the need for privacy.

Hauber and Stornetta developed the idea of distributed trust for providing timestamps. The system relies on many clients being ready, available, and synchronized to a time source. Requestors would submit a document hash digest to a random set of k timestamping clients. These clients would in turn each digitally sign their timestamp response. Because the choice in clients is random, there is a low probability of malicious clients issuing bad timestamps. The requestor would then store all timestamps from the k clients who responded. Unfortunately, this system requires participation without direct incentives.

Trusted Timestamping with OriginStamp

Gipp, Meuschke, and Gernandt recognized that the cryptocurrency Bitcoin provides timestamping as part of maintaining the blockchain. Each block contains a hash of the previous block, implementing something similar to the linking concept developed by Hauber and Stornetta and used in ANSI X9.95. The blockchain is distributed among all full Bitcoin clients and updated by miners, who only see transactions and cannot modify them. In some ways, the distributed nature of Bitcoin resembles parts of Hauber and Stornetta's distributed trust. Finally, the blockchain, because it is distributed to all clients, is an independent authority able to verify timestamps of transactions, much like a TSA, but without the certificate and compromise issues.

They created the OriginStamp system for timestamping user-submitted documents with the Bitcoin blockchain. They chose Bitcoin because it is the most widely used cryptocurrency and thus is perceived to last for a long time. This longevity is a requirement for verification of timestamps in the future.

OriginStamp Process to convert a document content into a Bitcoin address for use in a Bitcoin transaction that can be later verified against the blockchain.

The figure above displays the OriginStamp process for creating a Bitcoin address from a document:

A user submits a document to the system which is then hashed, or just submits the hash of a document.
The submitted document hash is placed into a list of hashes -- seed text -- from other submissions during that day.
Once per day, this seed text is itself hashed using SHA-256 to produce an aggregate hash.
This aggregate hash is used as the Bitcoin private key which is used to generate a public key.
That public key is used to generate a Bitcoin address which can be used in a timestamped Bitcoin transaction of 1 satoshi.

OriginStamp could submit each document hash to the blockchain as an individual transaction, but the hashes are aggregated together to keep operating costs low. Because fees are taken out of every Bitcoin transaction, each transaction costs $ 0.03, allowing Gipp and his team to offer this low cost service for free. They estimate that the system costs $10/year to operate.

Their paper was published in March of 2015. According to coindesk.com, for the March 2015 time period 1 Bitcoin was equal to $268.32. As of March of 2017, 1 Bitcoin is now equal to $960.36. The average transaction fee now sits at approximately 45,200 satoshis, resulting in a transaction fee of $0.43, as of March 26, 2017.

A screenshot of the memento that I timestamp throughout this section.

OriginStamp allows one to submit documents for timestamping using the Bitcoin blockchain. In this case, I submitted the HTML content of the memento shown in the figure above.

OriginStamp responds to the submission by indicating that it will submit a group of hashes to the Bitcoin blockchain in a few hours.

With the OriginStamp service, the requestor acquires a timestamp using the following process:

Submit the document -- or just its hash -- to the OriginStamp website as seen in the screenshots above. If a document is submitted, its hash is calculated and the document is not retained.
OriginStamp sends an email once the system has submitted the concatenated hashes to the Bitcoin blockchain. This email will contain information about the seed text used and this seed must be used for verification.
In addition, the @originstamp Twitter account will tweet that the given hash was submitted to the blockchain.

A screenshot showing how OriginStamp displays verification information for a given hash. In this case, the document hash is da5328049647343c31e0e62d3886d6a21edb28406ede08a845adeb96d5e8bf50 and it was submitted to the blockchain on 4/10/2017 10:18:29 AM GMT-0600 (MDT) as part of seed text whose hash, and hence private key is c634bcafba86df8313332abc0ae854eea9083b279cdd4d9cde1d516ee6fb70d9.

Because the blockchain is expected to last for the forseeable future and is tamper-proof, it can be used at any time to verify the timestamp. There are two methods available: with the OriginStamp service, or directly against the Bitcoin blockchain using the seed text.

To do so with the OriginStamp service, the verifier can follow this process:

Using the OriginStamp web site, the verifier can submit the hash of the original document and will receive a response as shown in the screenshot above. The response contains the timestamp under the heading "Submitted to Bitcoin network".
If the verifier wishes to find the timestamp in the blockchain, they can expand the "Show transaction details" section of this page, shown below. This section reveals a button allowing one to download the list of hashes (seed text) used in the transaction, the private and public keys used in the transaction, the recipient Bitcoin address, and a link to blockchain.info also allowing verification of the transaction at a specific time.
Using the link "Verify the generated secret on http://blockchain.info", they can see the details of the transaction and verify the timestamp, shown in the figure below.

A screenshot showing that more information is available once the user clicks "show transaction details". The recipient Bitcoin address is outlined in red. From this screen, the user can download the seed text containing the list of hashes submitted to the Bitcoin blockchain. A SHA-256 hash of this seed text is the Bitcoin private key. From this private key, a user can generate the public key and eventually the Bitcoin address for verification. In this case the generated Bitcoin address is 1EcnftDQwawHQWhE67zxEHSLUEoXKZbasy.

A screenshot of the blockchain.info web site showing the details of a Bitcoin transaction, complete with its timestamp. The Bitcoin address and timestamp have been enlarged and outlined in red. For Bitcoin address 1EcnftDQwawHQWhE67zxEHSLUEoXKZbasy, 1 Satoshi was transferred on 2017-02-28 00:01:15, thus that transaction date is the timestamp for the corresponding document.

To verify that a document was timestamped by directly checking the Bitcoin blockchain, one only needs:

The hash of the original document.
The seed text containing the list of hashes submitted to the bitcoin network.
Tools necessary to generate a Bitcoin address from a private key and also search the contents of the blockchain.

If OriginStamp is not available for verification, the verifier would then follow this process:

Generate the hash of the document.
Verify that this hash is in the seed text. This seed text should have been saved as a result of the email or tweet from OriginStamp.
Hash the seed text with SHA256 to produce the Bitcoin private key.
Generate the Bitcoin address using a tool such as bitaddress.org. The figure below shows the use of bitaddress.org to generate a Bitcoin address using a private key.
Search the Bitcoin blockchain for this address, using a service such as blockchain.info. The block timestamp is the timestamp of the document.

A screenshot of bitaddress.org's "Wallet Details" tab with the Bitcoin address enlarged and outlined in red. One can insert a Bitcoin private key and it will generate the pubic key and associated Bitcoin address. This example uses the private key of c634bcafba86df8313332abc0ae854eea9083b279cdd4d9cde1d516ee6fb70d9 shown in previous screenshots which corresponds to a Bitcoin address of 1EcnftDQwawHQWhE67zxEHSLUEoXKZbasy, also shown in previous figures.

OriginStamp also supplies an API that developers can use to submit documents for timestamping as well as verify timestamps and download the seed text for a given document hash.

Others are using blockchain technology for building a decentralized Internet and creating applications. Some Bitcoin users are not happy with using the Bitcoin blockchain for non-currency related transactions for fear that the miners' time will be taken up with transactions that are not related to commerce. Other Bitcoin users are quite happy with using it for timestamping. IBM is creating their own blockchain technologies for use in applications.

Comparison of Timestamping Solutions

In the table below I provide a summary comparison between the TSA, Origin Stamp, and submitting directly to the blockchain without OriginStamp.

	TSA	OriginStamp	Directly To Blockchain
Financial Cost per Timestamp	Dependent on Service and subscription, ranges from $3 to $0.024	Dependent on size of seed text, but less than Bitcoin transaction fee	Bitcoin transaction fee, optimally $0.56
Accuracy of Timestamp	Within seconds of time of request, but dependent on number of requests in queue if linked timestamping used	Within 1 Day + 2 hours	Within 2 hours
Items Needed For Verification	Original Document TST Certificate of server to verify signature	Original Document The seed text of hashes submitted at the same time	Original Document
Tools Needed For Verification	Certificate verification tools	Software to generate Bitcoin Address Software to search blockchain
Timestamp Storage	In TST saved by requestor	Blockchain
Privacy	Only hash of document is submitted, but TSA knows of the requestor's IP address	Miners only see the Bitcoin address, not who submitted the document or even its hash
Targets of Compromise	TSA time server TSA certificate private key	Blockchain
Requirement for Compromise	Server is configured insecurely Server has open software vulnerabilities	>50% of Bitcoin miners colluding
Dependency of Longevity	Life of Organization Offering Timestamping Service	Continued Interest In Preserving the Blockchain

In the first row, we compare the cost of timestamps from each service. At the TSA service run by Digistamp, an archivist can obtain a cost of $0.024 per timestamp for 600,000 timestamps, but would need to commit to an 1 year fee of $14,400. They would also need to acquire all 600,000 timestamps within a year or lose them. If they pay $10, they are only allocated 30 timestamps and need to use them within 3 months, resulting in a cost of $3 per timestamp. Tecxoft's pricing is similar. OriginStamp attempts to keep costs down by bundling many hashes into a seed text file, but is still at the mercy of Bitcoin's transaction fees. The price of Bitcoin is currently very volatile. The transaction fee mentioned in Gipp's work from 2015 was $0.03. Miners tend to process a block faster if it has a higher transaction fee. The optimal fee has gone up due to Bitcoin's rise in value and an increase in the number of transactions waiting to be processed. The lowest price for the least delay is currently 200 satoshis per byte and the median transaction size is 226 bytes for a total cost of 45,200 satoshis. This was equivalent to $0.43 when I started writing this blog post and is now $0.56.

In the second row, we compare timestamp accuracy. The TSA should be able to issue a timestamp to the requestor within seconds of the request. This can be delayed by a long queue if the TSA uses linked timestamping because every request must be satisfied in order. OriginStamp, however, tries to keep costs down by submitting its seed list to the blockchain at once-a-day intervals, according to the paper. On top of this, the timestamp in the blockchain is accurate to within two hours of submission of the Bitcoin transaction. This means that an OriginStamp timestamp may be as much as 24 hours + 2 hours = 26 hours off from the time of submission of the document hash. In practice, I do not know the schedule used by OriginStamp, as I submitted a document on February 28, 2017 and it was not submitted to the Bitcoin network until March 4, 2017. Then again, a document submitted on March 19, 2017 was submitted to the Bitcoin network by OriginStamp almost 18 hours later.

If the cost is deemed necessary, this lack of precision can be alleviated by not using OriginStamp but submitting to the blockchain directly. One could generate a Bitcoin address from a single document hash and then submit it to the blockchain immediately. The timestamp precision would still be within 2 hours of transaction submission.

For the TSA, timestamps are stored in the TST, which must be saved by the requestor for future verification. In contrast, OriginStamp saves timestamps in the Blockchain. OriginStamp users still need to save the seed list, so both solutions require the requestor to retain something along with the original document for future verification.

All solutions offer privacy through the use of document hashes. The Bitcoin miners receiving OriginStamp transactions only see the Bitcoin address generated from the hash of the seed list and do not even know it came from OriginStamp, hiding the original document submission in additional layers. The TSA, on the other hand, is aware of the requestor's IP address and potentially other identifying information.

To verify the timestamp, TSA users must have access to the original document, the TST, and the certificate of the TSA to verify the digital signature. OriginStamp only requires the original document and the seed list of hashes submitted to the blockchain. This means that OriginStamp requires slightly fewer items to be retained.

If using the blockchain directly, without OriginStamp, a single document hash could be used as the private key. There would be no seed list in this case. For verification, one would merely need the original document, which would be retained anyway.

To compromise the timestamps, the TSA time server must be attacked. This can be done by taking advantage of software vulnerabilities or insecure configurations. TSAs are usually audited to prevent insecure configurations, but vulnerabilities are frequently discovered in software. OriginStamp, on the other hand, requires that the blockchain be attacked directly, which is only possible if more than 50% of Bitcoin miners collude to manipulate blocks.

Finally, each service has different vulnerabilities when it comes to longevity. Mementos belong to web archives, and as such, are intended to exist far, far longer than 20 years. This makes longevity a key concern in any decision of a timestamping solution. The average lifespan of a company is now less than 20 years and expected to decrease. The certificates for verifying a timestamp running a TSA may last for little more than 3 years. This means that the verifier will need someone to have saved the TSA's certificate prior to verification of the timestamp. If the organization with the document and the TST is also the same organization providing the TSA's certificate, then there is cause for doubt in its validity because that organization can potentially forge any or all of these verification components.

The Bitcoin blockchain, on the other hand, is not tied to any single organization and is expected to last as long as there is interest in investing in the cryptocurrency. In addition, there are many copies of the blockchain available in the world. If Bitcoin goes away, there is still an interest in maintaining the blockchain for verification of transactions, and thus retaining copies of the blockchain by many parties. If someone wishes to forge a timestamp, they would need to construct their own illicit blockchain. Even if they went that far, a copy of their blockchain can be compared to other existing copies to evaluate its validity. Thus, even if the blockchain is no longer updated, it is still an independent source of information that can be used for future verification. If the blockchain is ever pruned, then prior copies will still need to be archived somewhere for verification of prior transactions. The combined interests of all of these parties support the concept of the Bitcoin blockchain lasting longer than a single server certificate or company.

So, with trusted timestamping available, what options do we have to make it easy to verify memento timestamps?

Trusted Timestamping of Mementos

It is worth noting that, due to delays between timestamp request and response in each of these solutions, trusted timestamping is not a replacement for Memento-Datetime. The Memento-Datetime header provides a timestamp of when the web archive captured a given resource. Trusted timestamping, on the other hand, can provide an additional dimension of certainty that a resource existed at a given datetime. Just as Memento-Datetime applies to a resource at a specific URI-M, so would a trusted timestamp.

A crawler can capture a memento as part of its normal operations, compute the hash of its content, and then submit this hash for timestamping to one of these services. The original memento content encountered during the crawl, the raw memento, must be preserved by the archive indefinitely. The memento can include a link relation in the response headers, such as the unregistered trusted-timestamp-info relation shown in the example headers below, indicating where one can find additional information to verify the timestamp.

HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 17:34:15 GMT Content-Type: text/html;charset=utf-8 Content-Length: 109672 Connection: keep-alive Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMTLink: <http://www.cnn.com/>; rel="original", <http://a.example.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://a.example.org/web/http://www.cnn.com/>; rel="timegate", <http://a.example.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://a.example.org/web/20160120080735/http://www.cnn.com/>; rel="first memento"; datetime="Wed, 20 Jan 2016 08:07:35 GMT", <http://a.example.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT", <http://a.example.org/timestamping/20160722191106/http://www.cnn.com/>; rel="trusted-timestamp-info"

The URI linked by the timestamp-info relation could be a JSON-formatted resource providing information for verification. For example, if OriginStamp is used for timestamping, then the resource might look like this:

{ "timestamping-method": "OriginStamp", "seed-text": "http://a.example.org/timestamping/20160722191106/seedtext.txt",
"hash-algorithm": "SHA-256" }

In this case, a verifier already knows the URI-M of the memento. They only need to locate the raw memento, calculate its hash, and use the seed text as described above to generate the Bitcoin address and find the timestamp in the Bitcoin blockchain.

Or, if an RFC 3161 solution is used, the timestamp-info resource could look like this:

{ "timestamping-method": "RFC 3161", "timestamp-token": "http://a.example.org/timestamping/tst/20160721174431/http://www.cnn.com", "tsa-certificate": "http://a.example.org/timestamping/tsacert/cert.pem",
"hash-algorithm": "SHA-256" }

In this case, a verifier can locate the raw memento, calculate its hash, and use verify it using the timestamp token (TST) and the TSA certificate as described above for RFC 3161.

If it is known that the crawler creates the hash of the raw memento and uses it as a private key for generating a Bitcoin address, thus submitting it directly to the blockchain for timestamping, then no additional headers would be needed. Verifiers only need the content of the raw memento to generate the hash. In addition, perhaps a separate timestamping service could exist for mementos, using the URI-M (e.g., https://timestamping-service.example.org/{URI-M}).

If one specific timestamping scheme is used, then perhaps specific link relations can be created to convey the resources from each of these fields.

Of course, this assumes that we only want to timestamp raw mementos. Conceivably, one might wish to timestamp a screenshot of a web page, or its WARC. We will need to perform additional analysis of the other potential use cases needed.

Summary

In this post, I have discussed different timestamping options. These options have different levels of complexity and security. All of them support privacy by a permitting the submission of a document hash to a timestamping service. OriginStamp attempts to address some of the concerns of the existing ANSI X9.95/RFC 3161 standard by using the timestamping features of the Bitcoin blockchain.

Of these options, the Bitcoin blockchain offers a decentralized, secure solution that supports privacy and does not depend on a central server that can fail or be compromised. Because copies of the blockchain are distributed to all full Bitcoin clients, it remains present for verification in the future. Bitcoin has been around for 8 years and continues to increase in value. Because all participants have incentives to keep the blockchain distributed and up to date, it is expected to outlast most companies, who have a median age of 20 years. In addition, if Bitcoin is no longer used, copies of the blockchain will still need to be maintained indefinitely for verification. It does, however, suffer from issues with timestamp accuracy inherent in the Bitcoin protocol. These can be alleviated by submitting a document hash directly against the blockchain.

Companies offering trusted timestamping using TSAs, on the other hand, may not have the longevity and require subscription fees for a limited number of timestamps. Though Bitcoin is currently volatile, it has stabilized before, and the subscription fees from these companies are still more expensive on average than the Bitcoin transaction fee.

Even though timestamping options exist, use cases must be identified for the verification of such timestamps in the future. These use cases will inform requestors of which content to be timestamped and will also affect which timestamping solution is selected. It would also be beneficial for verifiers to have links to additional resources for verification.

Trusted timestamping of mementos is possible, but will require some additional decisions and technology to become a reality.

Additional References Used For This Post:

Bela Gipp, Norman Meuschke, and André Gernandt. 2015. Trusted Timestamping using the Crypto Currency Bitcoin. In iConference 2015. Newport Beach, CA: iSchools. http://hdl.handle.net/2142/73770
Pedro Franco. 2015. Understanding Bitcoin: Cryptography, Engineering and Economics, Chichester, UK: John Wiley & Sons Ltd. http://www.wiley.com/WileyCDA/WileyTitle/productCd-1119019168.html
Satoshi Nakamoto. 2008. Bitcoin: A Peer-to-Peer Electronic Cash System. https://bitcoin.org/bitcoin.pdf
Stuart Haber and W. Scott Stornetta. 1991. How to Time-Stamp a Digital Document. Journal of Cryptology 3, 2 (1991), 99–111. http://dx.doi.org/10.1007/bf00196791

-- Shawn M. Jones

↧

2017-04-23: Remembering Professor Hussein Abdel-Wahab

April 23, 2017, 8:39 pm

≫ Next: 2017-04-24: Pushing Boundaries

≪ Previous: 2017-04-20: Trusted Timestamping of Mementos


Hussein (blue shirt) at the post-defense feast for Dr. SalahEldeen.

As we head into exam week, I can't help but reflect that this is the first exam week at ODU since 1980 that does not involve Professor Hussein Abdel-Wahab. The department itself was established in 1979, so Hussein has been here nearly since the beginning. For comparison, in 1980 I was in middle school.

I had the privilege of knowing Hussein both as my instructor for three classes in 1996 & 1997, and as a colleague since 2002. None who knew Hussein would dispute that he was an excellent instructor with unrivaled concern for students' learning and general well-being. It is fitting that ODU is establishing the Dr. Abdel-Wahab Memorial Scholarship (http://bit.ly/HusseinAbdelWahabODU) which will support graduate students. As of April 11, the scholarship is 58% of the way to its goal of $25k. I've donated, and I call on all former students and colleagues to continue Hussein's legacy and ensure this scholarship is fully funded.

--Michael

↧

2017-04-24: Pushing Boundaries

April 24, 2017, 9:17 am

≫ Next: 2017-04-26: Discovering Scholars Everywhere They Tread

≪ Previous: 2017-04-23: Remembering Professor Hussein Abdel-Wahab

Since the advent of the web, more elements of scholarly communication are occurring online. A world that once consisted mostly of conference proceedings, books, and journal articles now includes blog posts, project websites, datasets, software projects, and more. Efforts like LOCKSS, CLOCKSS, and Portico preserve the existing journal system, but there is no similar dedicated effort for the web presence of scholarly communication. Because web-based scholarly communication is born on the web, it can benefit from web archiving.

This is complicated by the complexity of scholarly objects. Consider a dataset on the website Figshare, whose landing page is shown in Fig. 1. Each dataset on Figshare has a landing page consisting of a title, owner name, brief description, licensing information, and links to bibliographic metadata in various forms. If an archivist merely downloads the dataset and ignores the rest, then a future scholar using their holdings is denied context and additional metadata. The landing page, dataset, and bibliographic metadata are all objects making up this artifact. Thus, in order to preserve context, a crawler will need to acquire all of these linked resources belonging to this artifact on Figshare.

Fig. 1: A screenshot of the landing page of an artifact on Figshare.

Green boxes outline links to URIs that belong to this artifact.

Red boxes outline links to that do not belong to this artifact.

Interestingly, this artifact links to another artifact, a master's thesis, that does not link back to this artifact, complicating discovery of the dataset associated with the research paper. Both artifacts are fully archived in the Internet Archive. In contrast, Fig. 2 below shows a different incomplete Figshare artifact -- as of April 12, 2017 -- at the Internet Archive. Through incidental crawling, the Internet Archive discovered the landing page for this artifact, but has not acquired the actual dataset or the bibliographic metadata. Such cases show that incidental crawling is insufficient for archiving scholarly artifacts.

Fig 2: A screenshot of the web pages from an incompletely archived artifact. The Internet Archive has archived the landing page of this Figshare artifact, but did not acquire the dataset or the bibliographic metadata about the artifact.

What qualifies as an artifact? An artifact is a set of interconnected objects belonging to a portal that represent some unit of scholarly discourse. Example artifacts include datasets, blog posts, software projects, presentations, discussion, and preprints. Artifacts like blog posts and presentations may only consist of a single object. As seen in Fig. 1, datasets can consist of landing pages, metadata, and additional documentation that are all part of the artifact. Software projects hosted online may consist of source code, project documentation, discussion pages, released binaries, and more. For example, the Python Memento Client library on the GitHub portal consists of source code, documentation, and issues. All of these items would become part of the software project artifact. An artifact is usually a citable object, often referenced by a DOI.

Artifacts are attributed to a scholar or scholars. Portals provide methods like search engines, APIs, and user profile pages to discover artifacts. Outside of portals, web search engine results and focused crawling can also be used to discover artifacts. From experience, I have observed that each result from these search efforts contains a URI pointing to an entry page. To acquire the Figshare example above, a Figshare search engine result contained a link to the entry page, not links to the dataset or bibliographic data. A screenshot showing entry pages as local search engine results is seen in Fig. 3 below. Poursardar and Shipman have studied this problem for complex objects on the web and have concluded that "there is no simple answer to what is related to a resource" thus making it difficult to discover artifacts on the general web. Artifacts stored on scholarly portals, however, appear to have some structure that can be exploited. For the purposes of this post, I will discuss capturing all objects in an artifact starting from its entry page because the entry page is designed to be used by humans to reach the rest of the objects in the artifact and because entry pages are the results returned by these search methods.

Fig. 3: A screenshot showing search engine results in Figshare that lead to entry pages for artifacts.

HTML documents contain embedded resources, like JavaScript, CSS, and images. Web archiving technology is still evolving in acquiring embedded resources. For the sake of this post, I assume that any archiving solution will capture these embedded resources for any HTML document, this post will focus on discovering the base URIs of the linked resources making up the artifact, referred to in the rest of this article as artifact URIs.

For simplicity of discovery, I want to restrict artifact URIs to a specific portal. Thus, the set of domain names possible for each artifact URI in an artifact is restricted to the set of domain names used by the portal. I consider linked items on other portals to be separate artifacts. As mentioned with the example in Fig. 1, a dataset page links to its associated thesis, thus we have two interlinked artifacts: a dataset and a thesis. A discussion of interlinking artifacts is outside of the scope of this post and is being investigated by projects such as Research Objects by Bechhofer, De Roure, Gamble, Goble, and Buchan, as well as already being supported by efforts such as OAI-ORE.

Fig. 4: This diagram demonstrates an artifact and its boundary. Artifacts often have links to content elsewhere in the scholarly portal,but only some of these links are to items that belong to the artifact.

How does a crawler know which URIs belong to the artifact and which should be ignored? Fig. 4 shows a diagram containing an entry page that links to several resources. Only some of these resources, the artifact URIs, belong to the artifact. How do we know which URIs linked from an entry page are artifact URIs? Collection synthesis and focused crawling will acquire pages matching a specific topic, but we want as close to the complete artifact as possible with no missed and minimal extra objects. OAI-ORE provides a standard for aggregations of web resources using a special vocabulary as well as resource maps in RDF and other formats. Signposting is a machine-friendly solution that informs crawlers of this boundary by using link relations in the HTTP Link header to indicate which URIs belong to an artifact. The W3C work on "Packaging on the Web" and "Portable Web Publications" require that the content be formatted to help machines find related resources. LOCKSS boxes use site-specific plugins to intelligently crawl publisher web sites for preservation. How can a crawler determine this boundary without signposting, OAI-ORE, these W3C drafts, or site-specific heuristics? Can we infer the boundary from the structures used in each site?

Fortunately, portals have predictable behaviors that automated tools can use. In this post I assume an automated system will use heuristics to advise a crawler that is attempting to discover all artifact URIs within the boundary of an artifact. The resulting artifact URIs can then be supplied to a high resolution capture system, like Webrecorder.io. The goal is to develop a limited set of general, rather than site-specific, heuristics. Once an archivist is provided these artifact URIs, they can then create high-resolution captures of their content. In addition to defining these heuristics, I also correlate these heuristics with similar settings in Archive-It to demonstrate that the problem of capturing many of these artifacts is largely addressed. I then indicate which heuristics apply to which artifacts on some known scholarly portals. I make the assumption that all authentication, access (e.g., robots.txt exclusions), and licensing issues have been resolved and therefore all content is available.

Artifact Classes and Crawling Heuristics

To discover crawling heuristics in scholarly portals, I reviewed portals from Kramer and Boseman's Crowdsourced database of 400+ tools, part of their Innovations in Scholarly Communication. I filtered the list to only include entries from the categories of publication, outreach, and assessment. To find the most popular portals, I sorted the list by Twitter followers as a proxy for popularity. I then selected the top 36 portals from this list that were not journal articles and that contained scholarly artifacts. After that I manually reviewed artifacts on each portal to find common crawling heuristics shared across portals.

Single Artifact URI and Single Hop

In Figures below, I have drawn three different classes of web-based scholarly artifacts. The simplest class, in Fig. 5a, is an artifact consisting of a single artifact URI. This blog post, for example, is an artifact consisting of a single artifact URI.

Archiving single artifact URIs can be done easily in one shot with Webrecorder.io, Archive-It's One Page setting, and "save the page" functionality offered from web archives like archive.is. I will refer to this heuristic as Single Page.

Fig 5a: A diagram of the Single Artifact URI artifact class.

Fig 5b: A diagram showing an example Single Hop artifact class.

Fig 5c: A diagram showing an example of a Multi-Hop artifact class.

Fig. 5b shows an example artifact consisting of one entry page artifact URI and those artifact URIs linked to it. Our Figshare example above matches this second form. I will refer to this artifact class as Single Hop because all artifact URIs are available within one hop from the entry page. To capture all artifact URIs for this class, an archiving solution merely captures the entry page and any linked pages, stopping at one hop away from the entry page. Archive-It has a setting that addresses this named One Page+. Inspired by Archive-It's terminology, I will use the + to indicate "including all linked URIs within one hop". Thus, I will refer to the heuristic for capturing this artifact class as Single Page+.

Because one hop away will acquire menu items and other site content, Single Page+ will acquire more URIs than needed. As an optimization, our automated system can first create an Ignored URIs List, inspired in part by a dead page detection algorithm by Bar-Yossef, Broder, Kumar, and Thomkins. The automated tool would fill this list using the following method:

Construct an invalid URI (i.e., one that produces a 404 HTTP status) for the portal.
Capture the content at that URI and place all links from that content into the ignored URIs list.
Capture the content from the homepage of the portal and place all links from that content into the ignored URIs list.
Remove the entry page URI from the ignored URIs list, if present.

The ignored URIs list should now contain URIs that are outside of the boundary, like those that refer to site menu items and licensing information. This method captures content both from the invalid URI and a homepage because homepages may not contain all menu items. As part of a promotion effort, the entry page URI may be featured on the homepage, hence we remove it from the list in the final step. Our system would then advise the crawler to ignore any URIs on this list, reducing the number of URIs crawled.

I will refer to this modified heuristic as Single Page+ with Ignored URIs.

Multi-Hop Artifacts

Fig. 5c shows an example artifact of high complexity. It consists of many interlinked artifact URIs. Examples of scholarly sites fitting into this category include GitHub, Open Science Framework, and Dryad. Because multiple hops are required to reach all artifact URIs, I will refer to this artifact class as Multi-Hop. Due to its complexity, Mulit-Hop breaks down into additional types that require special heuristics to acquire completely.

Software projects are stored on portals like GitHub and BitBucket. These portals host source code in repositories using a software version control system, typically Git or Mercurial. Each of these version control systems provide archivists with the ability to create a complete copy of the version control system repository. The portals provide more than just hosting for these repositories. They also provide issue tracking, documentation services, released binaries, and other content that provides additional context for the source code itself. The content from these additional services is not present in the downloaded copy of the version control system repository.

Fig. 6: Entry page belonging to the artifact representing the Memento Damage software project.

For these portals, the entry page URI is a substring of all artifact URIs. Consider the example GitHub source code page shown in Fig. 6. This entry page belongs to the artifact representing the Memento Damage software project. The entry page artifact URI is https://github.com/erikaris/web-memento-damage/. Artifact URIs belonging to this artifact will contain the entry page URI as a substring; here are some examples with the entry page URI substrings shown in italics:

https://github.com/erikaris/web-memento-damage/issues
https://github.com/erikaris/web-memento-damage/graphs/contributors
https://github.com/erikaris/web-memento-damage/blob/master/memento_damage/phantomjs/text-coverage.js
https://github.com/erikaris/web-memento-damage/commit/afcdf74cc31178166f917e79bbad8f0285ae7831

Because all artifact URIs are based on the entry page URI, I have named this heuristic Path-Based.

In the case of GitHub some item URIs reside in a different domain: raw.githubusercontent.com. Because these URIs are in a different domain and hence do not contain our significant string, they will be skipped by the path-based heuristic. We can amend the directory-based heuristic to capture these additional resources by allowing the crawler to capture all linked URIs that belong to a domain different from the domain of the entry page. I refer to this heuristic as Path-Based with Externals.

Silk allows a user to create an entire web site devoted to the data visualization and interaction of a single dataset. When a user creates a new Silk project, a subdomain is created to host that project (e.g., http://dashboard101innovations.silk.co/). Because the data and visualization are intertwined, the entire subdomain site is itself an artifact. Crawling this artifact still relies upon the path (i.e., a single slash), and hence, its related heuristic is Path-Based as well, but without the need to acquire content external to the portal.

For some portals, like Dryad, a significant string exists in the content of each object that is part of the artifact. An automated tool can acquire this significant string from the <title> element of the HTML of the entry page and a crawler can search for the significant string in the content -- not just the title, but the complete content -- of each resource discovered during the crawl. If the resource's content does not contain this significant string, then it is discarded. I refer to this heuristic as Significant String from Title.

Fig. 7: A diagram of a Dryad artifact consisting of a single dataset, but multiple metadata pages. Red boxes outline the significant string, Data from: Cytokine responses in birds challenged with the human food-borne pathogen Campylobacter jejuni implies a Th17 response., which is found in the title of the entry page and is present in almost all objects within the artifact. Only the dataset does not contain this significant string, hence a crawler must crawl URIs one hop out from those matching the Significant String from Title heuristic and also ignore menu items and other common links, hence the Significant String from Title+ with Ignored URIs is the prescribed heuristic in this case.

In reality, this heuristic misses the datasets linked from each Dryad page. To solve this our automated tool can create an ignored URI list using the techniques mentioned above. Then the crawler can crawl one hop out from each captured page, ignoring URIs in this list. I refer to this heuristic as Significant String from Title+ with Ignored URIs. Fig.7 shows an example Dryad artifact that can make use of this heuristic.

A crawler can use this heuristic for Open Science Framework (OSF) with one additional modification. OSF includes the string "OSF | " in all page titles, but not in the content of resources belonging to the artifact, hence an automated system needs to remove it before the title can be compared with the content of linked pages. Fig. 8 shows an example of this.

Fig. 8: A screenshot of an OSF artifact entry page showing the source in the lower pane. The title element of the page contains the string "OSF | Role of UvrD/Mfd in TCR Supplementary Material". The string "Role of UvrD/Mfd in TCR Supplementary Material" is present in all objects related to this artifact. To use this significant string, the substring
"OSF | " must be removed.

Here are the steps for removing the matching text:

Capture the content of the entry page.
Save the text from the <title> tag of the entry page.
Capture the content of the portal homepage.
Save the text from the <title> tag of the homepage.
Starting from the leftmost character of each string, compare the characters of the entry page title text with the homepage title text.

If the characters match, remove the character in the same position from the entry page title.
Stop comparing when characters no longer match.

I will refer to the heuristic with this modification as Significant String from Filtered Title.

Entry pages for the Global Biodiversity Information Facility (GBIF) contain an identification string in the path part of each URI that is present in all linked URIs belonging to the same artifact. For example, the entry page at URI http://www.gbif.org/dataset/98333cb6-6c15-4add-aa0e-b322bf1500ba contains the string 98333cb6-6c15-4add-aa0e-b322bf1500ba and its page content links to the following artifact URIs:

http://www.gbif.org/occurrence/search?datasetKey=98333cb6-6c15-4add-aa0e-b322bf1500ba
http://api.gbif.org/v1/dataset/98333cb6-6c15-4add-aa0e-b322bf1500ba/document

An automated system can compare the entry page URI to each of the URIs of its links to extract this significant string. Informed by this system, a crawler will then ignore URIs that do not contain this string. I will refer to the heuristic for artifacts on this portal as Significant String from URI.

Discovering the significant string in the URI may also require site-specific heuristics. Discovering the longest common substring between the path elements of the entry page URI and any linked URIs may work for the GBIF portal, but it may not work for other portals, hence this heuristic may need further development, if applicable to other portals.

The table below lists the artifact classes and associated heuristics that have been covered. As noted above, even though an artifact fits into a particular class, its structure on the portal is ultimately what determines its applicable crawling heuristic.

Artifact Class	Potential Heuristics	Potentially Adds Extra URIs Outside Artifact	Potentially Misses Artifact URIs
Single Artifact URI	Single Page	No	No
Single Hop	Single Page+	Yes	No
Single Hop	Single Page+ with Ignored URIs	Yes (but reduced amount compared to Single Page+)	No
Multi-Hop	Path-Based	Depends on Portal/Artifact	Depends on Portal/Artifact
	Path-Based with Externals	Yes	No
	Significant String from Title	No	Yes
	Significant String from Title+ with Ignored URIs	Yes	No
	Significant String from Filtered Title	No	Yes
	Significant String from URI	No	Yes

Comparison to Archive-It Settings

Even though the focus of this post has been to find artifact URIs with the goal of feeding them into a high resolution crawler, like Webrecorder.io, it is important to note that Archive-It has settings that match or approximate many of these heuristics. This would allow an archivist to use an entry page as a seed URI and capture all artifact URIs. The table below provides a listing of similar settings between the heuristics mentioned here and a setting in Archive-It that functions similarly.

Crawling Heuristic from this Post	Similar Setting in Archive-It
Single Page	Seed Type: One Page
Single Page+	Seed Type: One Page+
Single Page+ with Ignored URIs	Seed Type: Standard Host Scope Rule: Block URL if it contains the text <string>
Path-Based	Seed Type: Standard
Path-Based with Externals	Seed Type: Standard+
Significant String from Title	None
Significant String from Title+ with Ignored URIs	None
Significant String from Filtered Title	None
Significant String from URI	Seed Type: Standard Expand Scope to Include URL if it contains the text <string>

Fig. 9: This is a screenshot of part of the Archive-It configuration allowing a user to control the crawling strategy for each seed.

Archive-It's seed type settings allow one to change how a seed is crawled. As shown in Fig. 9, four settings are available. One Page, One Page+, and Standard all map exactly to our heuristics of Single Page, Single Page+, and Path-Based. For Path-Based, one merely needs to supply the entry page URI as a seed -- including the ending slash -- and Archive-It's scoping rules will ensure that all links include the entry page URI. Depending on the portal, Standard+ may crawl more URIs than Path-Based with Externals, but is otherwise successful in acquiring all artifact URIs.

Fig. 10: This is a screenshot of the Archive-It configuration allowing the user to expand the crawl scope to include URIs that contain a given string.

Fig. 11: This screenshot displays the portion of the Archive-It configuration allowing the user to block URIs that contain a given string.

To address our other heuristics, Archive-It's scoping rules must be altered, with screenshots of these settings shown in Figs 10 and 11. To mimic our Significant String from URI heuristic, a user would first need to know the significant string, and then can supply it as an argument to the setting "Expand Scope to include URL if it contains the text:". Likewise, to mimic Single Page+ with Ignored URIs, a user would need to know which URIs to ignore, and can use them as arguments to the setting "Block URL if...".

Archive-It does not have a setting for analyzing page content during a crawl, and hence I have not found settings that can address any of the members of the Significant String in Title heuristics family.

In addition to these settings, a user will need to experiment with crawl times to capture some of the Multi-Hop artifacts due to the number of artifact URIs that must be visited.

Heuristics Used In Review of Artifacts on Scholarly Portals

While manually reviewing one or two artifacts from each of the 36 portals from the dataset, I documented the crawling heuristic I used for each artifact, shown in the table below. I focused on a single type of artifact for each portal. It is possible that different artifact types (e.g., blog post vs. forum) may require different heuristics even though they reside on the same portal.

Portal	Artifact Type Reviewed	Artifact Class	Applicable Crawling Heuristic
Academic Room	Blog Post	Single Artifact URI	Single Page
AskforEvidence	Blog Post	Single Artifact URI	Single Page
Benchfly	Video	Single Artifact URI	Single Page
BioRxiv	Preprint	Single Hop	Single Page+ with Ignores
BitBucket	Software Project	Multi-Hop	Path-Based
Dataverse*	Dataset	Multi-Hop	Significant String From Filtered Title (starting from title end instead of beginning like OSF)
Dryad	Dataset	Multi-Hop	Significant String From Title+ with Ignored URI List
ExternalDiffusion	Blog Post	Single Artifact URI	Single Page
Figshare	Dataset	Single Hop	Single Page+ with Ignores
GitHub	Software Project	Multi-Hop	Path-Based with Externals
GitLab.com	Software Project	Multi-Hop	Path-Based
Global Biodiversity Information Facility*	Dataset	Multi-Hop	Significant String From URI
HASTAC	Blog Post	Single Artifact URI	Single Page
Hypotheses	Blog Post	Single Artifact URI	Single Page
JoVe	Videos	Single Hop	Single Page+ with Ignores
JSTOR daily	Article	Single Artifact URI	Single Page
Kaggle Datasets	Dataset with Code and Discussion	Multi-Hop	Path-Based with Externals
MethodSpace	Blog Post	Single Artifact URI	Single Page
Nautilus	Article	Single Artifact URI	Single Page
Omeka.net*	Collection Item	Single Artifact URI	Single Page (but Depends on Installation)
Open Science Framework*	Non-web content, e.g., datasets and PDFs	Multi-Hop	Significant String From Filtered Title
PubMed Commons	Discussion	Single Artifact URI	Single Page
PubPeer	Discussion	Single Artifact URI	Single Page
ScienceBlogs	Blog Post	Single Artifact URI	Single Page
Scientopia	Blog Post	Single Artifact URI	Single Page
SciLogs	Blog Post	Single Artifact URI	Single Page
Silk*	Data Visualization and Interaction	Multi-Hop	Path-Based
Slideshare	Slideshow	Multi-Hop	Path-Based
SocialScienceSpace	Blog Post	Single Artifact URI	Single Page
SSRN	Preprint	Single Hop	Single Page+ with Ignores
Story Collider	Audio	Single Artifact URI	Single Page
The Conversation	Article	Single Artifact URI	Single Page
The Open Notebook	Blog Post	Single Artifact URI	Single Page
United Academics	Article	Single Artifact URI	Single Page
Wikipedia	Encyclopedia Article	Single Artifact URI	Single Page
Zenodo	Non-web content	Single Hop	Single Page+ with Ignores

Five entries are marked with an asterisk (*) because they may offer additional challenges.

Omeka.net provides hosting for the Omeka software suite, allowing organizations to feature collections of artifacts and their metadata on the web. Because each organization can customize their Omeka installation, they may add features that make the Single Page heuristic no longer function. Dataverse is similar in this regard. I only reviewed artifacts from Harvard's Dataverse.

Global Biodiversity Information Facility (GBIF) contains datasets submitted by various institutions throughout the world. A crawler can acquire some metadata about these datasets, but the dataset itself cannot be downloaded from these pages. Instead, an authenticated user must request the dataset. Once the request has been processed, the portal then sends an email to the user with a URI indicating where the dataset may be downloaded. Because of this extra step, this additional dataset URI will need to be archived by a human separately. In addition, it will not be linked from content of the other captured artifact URIs.

Dataverse, Open Science Framework, and Silk offer additional challenges. A crawler cannot just use anchor tags to find artifact URIs because some content is only reachable via user interaction with page elements (e.g., buttons, dropdowns, specific <div> tags). Webrecorder.io can handle these interactive elements because a human performs the crawling. The automated system that we are proposing to aide a crawler will not be as successful unless it can detect these elements and mimic the human's actions. CLOCKSS has been working on this problem since 2009 and has developed an AJAX collector to address some of these issues.

Further Thoughts

There may be additional types of artifacts that I did not see on these portals. Those artifacts may require different heuristics. Also, there are many more scholarly portals that have not yet been reviewed, and it is likely that additional heuristics will need to be developed to address some of them. A larger study analyzing the feasibility and accuracy of these heuristics is needed.

From these 36 portals, most artifacts fall into the class of Single Artifact URI. A larger study on the distribution of classes of artifacts could indicate how well existing crawling technology can discover artifact URIs and hence archive complete artifacts.

Currently, a system would need to know which of these heuristics to use based on the portal and type of artifact. Without any prior knowledge, is there a way our system can use the entry page -- including its URI, response headers, and content -- to determine to which artifact type the entry page belongs? From there, can the system determine which heuristic can be used? Further work may be able to develop a more complex heuristic or even an algorithm applicable to most artifacts.

These solutions rely on the entry page for initial information (e.g., URI strings, content). Given any other artifact URI in the artifact, is it possible to discover the rest of the artifact URIs? If a given artifact URI references content that does not contain other URIs -- either through links or text -- then the system will not be able to discover other artifact URIs. If the content of a given artifact URI does contain other URIs, a system would need to determine which heuristic is might apply in order to find the other artifact URIs.

What about artifacts that link to other artifacts? Consider again our example in Fig. 1 where a dataset links to a thesis. A crawler can save those artifact URIs to its frontier and pursue the crawl of those additional artifacts separately, if so desired. The crawler would need to determine when it had encountered a new artifact and pursue its crawl separately with the heuristics appropriate to the new artifact and portal.

Conclusion

I have outlined several heuristics for discovering artifact URIs belonging to an artifact. I also demonstrated that many of those heuristics can already be used with Archive-It. The heuristics offered here require that one know the entry page URI of the artifact and they expect that any system analyzing pages can work with interactive elements. Because portals provide predictable patterns, finding the boundary appears to be a tractable problem for anyone looking to archive a scholarly object.

--Shawn M. Jones

Acknowledgements: Special thanks to Mary Haberle for helping to explain Archive-It scoping rules.

↧

2017-04-26: Discovering Scholars Everywhere They Tread

April 26, 2017, 10:24 am

≫ Next: 2017-06-09: InfoVis Spring 2016 Class Projects

≪ Previous: 2017-04-24: Pushing Boundaries

Though scholars write articles and papers, they also post a lot of content on the web. Datasets, blog posts (like this one), presentations, and more are posted by scholars as part of scholarly communications. What if we could aggregate the content by scholar, instead of by web site?

Why would we want to do this? We can create stories, or collections of a scholar's work in an interface, much like Storify. We can also index this information and create a search engine that allows a user to search by scholar and find all of their work, not just their published papers, as is offered by Scopus or Web of Science, but their web-based content as well. Finally we can archive their work before the ocean of link rot washes it away.

To accomplish our goal, two main questions must be answered: (1) For a given scholar, how do we create a global scholar profile describing the scholar and constructed from multiple sources? (2) How do we locate the scholar's work on the web and use this global scholarly profile to confirm that we have found their work?

Throughout this post I attempt to determine what resources could be used by a hypothetical automated system to build our global scholar profile and then use it to discover user information on scholarly portals. I also review some scholarly portals to determine what resources they provide that can be used with the global scholar profile. Note: our hypothetical system is currently just attempting to find the websites to which scholars post their content; discovering and processing the content itself is a separate issue.

Building a global scholar profile

Abdel-Hafez and Xu provide "A Survey of User Modeling in Social Media Websites". In that paper, they describe that "modeling users will have different methods between different websites". They discuss the work that has been done on constructing a user model from different social media sites, using a rather broad definition of social media that includes blogs and collaborative portals like wikis. They discuss the problems associated with building a user profile from social media, which inspires the term global scholar profile in this post.

They also provide an overview of the "cold start problem" where insufficient starting information is available to begin using a useful user profile. Existing solutions to the cold start problem in recommender systems, such as those by Lika, Kolomvatsos, and Hadjiefthymiades rely on the use of demographic data to create user profiles, which will not be useful for identifying scholars. Instead, we can use some existing sources containing information about scholars.

The EgoSystem project, by Powell, Shankar, Rodriguez, and Van de Sompel, concerned itself with building a global scholarly profile from several sources of scholarly information. It accepts a scholar's name, the university where they earned their PhD, their fields of study, their current affiliation, their current title, and some keywords noting their field of work. Using this information, the system starts with a web search using Yahoo BOSS search API (now defunct) with these input terms and the names of portals, such as LinkedIn, Twitter, and Slideshare. After visiting each page in the search results, the system awards points to a page for each string of data that matches. If the points reach a certain threshold, then the page is considered to be a good match for the scholar and additional data is then acquired via a site API -- or scraped from the web page, if necessary -- and added to the system's knowledge of the scholar for future iterations. This scoring system was insipred by Northern and Nelson's work on disambiguating university students' social media profiles. EgoSystem's data is stored in a graph database for future retrieval and updating, much like the semantic network profiles discussed by Gauch, Speretta, Chandramouli, and Micarelli.

Kramer and Boseman created the Innovations in Scholarly Communication project. As part of that project, they developed a list of 400+ Tools and Innovations in Scholarly Communication. Many of the tools on this list are scholarly portals, places where scholars post content.

Our hypothetical system must first build a global scholar profile that can be tested against content from various scholarly portals. To do so our automated system needs is data about a scholar. Many services exist which index and analyze scholar's published works from journals and conference proceedings. All of this can provide information to be used for disambiguation.

If we have access to all of this information, then we should be able to use EgoSystem's scoring method of disambiguation against scholarly portals. What if we do not yet have this information? Given just a name and an affiliation, from what sources can we construct a global scholar profile?

In the table below, I reviewed the documentation for several sources of information about scholars, based on their published works. In the access restrictions section I document which restrictions I have found for each source. Included in this table is the name of the web service, which data it provides that is useful to identify a scholar, and the access restrictions of the service. I reviewed each service, to determine which fields were available in the output. I did not sign up for any authentication keys, so the data useful for scholar identification comes from each service's documentation. I also only included services that allow one to query by author name.

Service	Data Useful for Scholar Identification	Access Restrictions
arXiv API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	None
Clarivate's Web of Science API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	Institution Must Be Licensed Additional Restrictions on Data Usage
CrossRef REST API	Authors and Co-authors Terms from titles Affiliations Keywords	None
Elsevier's Scopus API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	Institution Must Be Licensed Additional Restrictions on Data Usage
Europe PMC database	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	None
IEEE Xplore Search Gateway	Authors and Co-authors Terms from titles Terms from abstracts Affiliations Keywords	None
Microsoft Academic Knowledge API	Authors and Co-authors Terms from titles Terms from abstracts Journal/Proceedings Information Affiliations Keywords	Free for 10,000 queries/month, otherwise $0.25 per 1,000 calls
Nature.com OpenSearch API	Authors and Co-authors Terms from titles Links to landing pages	Non-Commercial Use Only All downloaded content must be deleted within 24 hour period Application requires a "Powered by nature.com" logo Requires signing up for authentication key
OCLC WorldCat Identities API	Authors Terms from titles	Non-commercial use only
ORCID API	ORCID Other Identifiers Authors and Co-authors Terms from titles Journal/Proceedings Information Links to landing pages Employment Education Links to additional websites Keywords Biography	None
PLOS API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	Rate limited to 10 requests per minute Data must be attributed to PLOS Requires signing up for authentication key
Springer API Service	Terms from Titles Journal/Proceedings Information Keywords Links to landing pages	Requires signing up for authentication key

Some of these services are not free. Microsoft Academic Search API, Elsevier's Scopus, and Web of Science all provide information about scholars and their works, but with limitations and often for a fee. Microsoft Academic Search API has become Microsoft Academic Knowledge API and now limits the user to 10,000 calls per month unless they pay. Scopus API is free of charge, but "full API access is only granted to clients that run within the networks of organizations" with a Scopus subscription. Clarivate's Web of Science API provides access with similar restrictions, "using your institution's subscription entitlements".

There are also restrictions on how a system is permitted to use the data from Web of Science, including which fields can be displayed to the public. Scopus has similar restrictions on text and data mining, which may affect our system's ability to use these sources at all. Furthermore, the Nature.com OpenSearch API requires that any data acquired is refreshed or deleted within a 24-hour period, also making it unlikely to be useful to our system because the data cannot be retained.

Some organizations, such as PubMed Central, offer an OAI-PMH interface that can be used to harvest metadata. Our system can harvest this metadata and provide its own searches. Similarly, other organizations, such as the Hathi Trust Digital Library, offer downloadable datasets of their holdings. Data from API queries is more desirable because it will be more current than data obtained via datasets.

Not all of these sources are equally reliable for discovering information about scholars. For example, a recent study by Klein and Van de Sompel indicates that, in spite of the information scholars can provide about themselves on ORCID, many do not fill in data that would be useful for identification.

Because the global scholar profile is supposed to be the known good information for future disambiguation, the data gathered for the global scholar profile at this stage may need to be reviewed by a human before we trust it. For example, the screenshot below is from Scopus, and shows multiple entries for Herbert Van de Sompel which refer to the same person.

Scopus has multiple entries for "Herbert Van de Sompel".

Discovering Where Scholars Post Their Work

Once we have a global scholar profile for a scholar, we can search for their content on known scholarly portals. Several methods exist to discover hints as to which scholarly portals contain a scholar's content.

Homepages

If we know a scholar's homepage, it might be another potential source of links to additional content produced by that scholar. I decided to see if scholars acted this way. In August-September of 2016, I used Microsoft Academic API to find the homepages of the top 99 researchers from 13 different knowledge domains. From these 1287 scholarly records, they broke down as shown in the table below until I had 733 homepages with a 200 status. For those 733 homepages, I downloaded each homepage and extracted its links.

Total Records	1287
Records without a homepage	133
Homepages Resulting in soft-404s	369
Homepages with connection errors	61
Homepages has too many redirects	1
Homepages with a 200 status	723
Homepages containing one or more URIs from the list of scholarly tools	204

Each link was then compared with the domain name of the tools listed in Kramer and Boseman's 400+ Tools and innovations. Out of 723 homepages 204 (28.2%) contained 1 or more URIs matching a tool from that list. This does indicate that homepages could be used as a source of additional sites that may contain the work of the scholar in question.

Now that Microsoft Academic API has changed its terms, alternatives to finding homepages will be useful. Fang, Si, and Mathur tested several methods of detecting faculty homepages in web search engine results. Their study provides some direction on locating a scholar's work on the web. They used the Yahoo BOSS API to acquire search results. These search results were then evaluated for accuracy using site specific heuristics, logistic regression, SVM, and a joint prediction model. They discovered that the joint prediction model outperformed the other methods.

Social Media Profiles

In addition to scholarly databases, social media profiles may offer additional sources for us to find information about scholars. The social graph in services like Twitter and Facebook provides additional dimensions that can be analyzed.

For example, if we know an institution's Twitter account, how likely is it that a scholar follows this account? If we cannot find a scholar's Twitter account using their institution's Twitter account, can we discover them using link prediction techniques like Schall's triadic closeness.

In addition, there is ample work in discovering researchers on Twitter. For example, Hadgu and Jäschke used Twitter to determine the relationships between computer scientists, reviewing several machine learning algorithms to discover demographic information, topics, and the most influential computer scientists. Instead of looking at the institution's Twitter account as a base for finding computer scientists, they used the Twitter accounts of scientific conferences. Perhaps our hypothetical system can use conference information from a scholar's publication list in this way.

It is also possible that a scholar's social media posts contain links to websites where they post their data. We can use their social media feeds to discover links to scholarly portals and then disambiguate them.

Querying Portals Directly

On top of hints, we can query the portals directly using their native capabilities. Unfortunately, the same capabilities for finding scholars are not available at all portals. To discover these capabilities, I started with Kramer and Boseman's 400+ Tools and Innovations in Scholarly Communication. I sorted the list by number of Twitter followers as a proxy for popularity. I then filtered the list for those tools categorized as Publication, Outreach, or Assessment. Finally, I selected the first 36 non-journal portals for which I could find scholarly output hosted on the portal. I then reviewed the different ways of discovering scholars on these sites.

The table below contains the list of portals used in my review. In order to describe the nature of each site, I have classified them according to the categories used in Kaplan and Haenlein's "Users of the world, unite! The challenges and opportunities of Social Media". The categories used in the table below are:

Social Networking applies to portals that allow users to create connections to other users, usually via a "friend" network or by "following". Examples: MethodSpace, Twitter
Blogs encompasses everything from blog posts to magazine articles to forums. Examples: HASTAC, PubPeer, The Conversation
Content Communities involves portals where users share media, such as datasets, videos, and documents, including preprints. Examples: Figshare, BioRxiv, Slideshare, JoVe
Collaborative Works is reserved for portals where users collaboratively change a single produce, like a wiki page. Examples: Wikipedia

Portal	Kaplan and Haenlein Social Media Classification
Academic Room	Blogs, Content Communities
AskforEvidence	Blogs
Benchfly	Content Communities
BioRxiv	Content Communities
BitBucket	Content Communities
Dataverse	Content Communities
Dryad	Content Communities
ExternalDiffusion	Blogs
Figshare	Content Communities
GitHub	Content Communities, Social Networking
GitLab.com	Content Communities
Global biodiversity information facility Data	Content Communities
HASTAC	Blogs
Hypotheses	Blogs
JoVe	Content Communities
JSTOR daily	Blogs
Kaggle Datasets	Content Communities
Methodspace	Social Networking, Blogs
Nautilus	Blogs
Omeka.net	Content Communities
Open Science Framework	Content Communities
PubMed Commons	Blogs
PubPeer	Blogs
ScienceBlogs	Blogs
Scientopia	Blogs
SciLogs	Blogs
Silk	Content Communities
SlideShare	Content Communities
SocialScienceSpace	Blogs
SSRN	Blogs
Story Collider	Content Communities
The Conversation	Blogs
The Open Notebook	Blogs
United Academics	Blogs
Wikipedia (& Wikimedia Commons)	Collaborative Works
Zenodo	Content Communities

Local Portal Search Engines

I wanted to know if I could find a scholar, by name, in this set of portals using local portal search engines. If such services are present on each portal, then our automated system could submit a scholar's name to the search engine and then scrape the results.

I reviewed whether or not the portal contained profile pages for its users. Profile pages are special web pages that contain user information that can be used to identify the scholar. Contained within a profile page might be the additional information necessary to identify that it belongs to the scholar we are interested in. This is important, because a profile page provides a single resource where we might be able verify that the user has an account on the portal. Without it, our system would need to go through the actual contributions to each portal.

For our 36 portals, 24 contained profile pages. This indicates that 24 portals associate some concept of identity with their content. With the exception of Academic Room, profiles in the portals also provide links to the scholar's contributions to the portal.

This screenshot shows a common case of a search engine that provides profiles in its search results. I have outlined one of the links to a profile in a red box and shown a separate screenshot of the linked profile.

Next, I reviewed each portal to discover if its local search engine, if present, provided profiles as search results. For 13 portals, the local search engine provided profile pages in their search results. This means that I was able to type a scholar's name into the portal's search bar and find their profile page directly linked from the result. In this case, an automated system would only need to scrape the search results pages to find the profile pages. Once the profile pages are acquired, the system can then compare them against what we know about the scholar to determine if the scholar has an identity on that portal. In some cases, a scraper can use pattern matching to eliminate the non-profile URIs from the list of results.

An example of a search engine providing profiles in its results is shown above with Figshare. In this case, searching for "Ana Maria Aguilera-Luque" on Figshare leads to a list of landing pages for uploaded content. Content on Figshare is associated with a user, and a clickable link to that user's profile shows up in the search results under the name of the uploaded content.

This screenshot shows an example of a search engine that does not provide profiles in its search results, even though the portal has profiles. The screenshots are of the search results, following the link to the document, and then following the link from the document to the profile page. Each followed link is outlined in a red box.

Unfortunately, this is not the case for all results. For 4 portals, the profile page is only available if one clicks on a search result link, and then clicks on the profile link from that search result. This increases the complexity of our automated system because now it must crawl through more pages before finding a candidate set of profiles to review.

The figure above shows an example of this case, where searching for "Chiara Civardi" on the magazine web site UA Magazine leads one to a list of articles. Each article contains a link to the profile of its author, thus allowing one to reach the scholar's profile.

This screenshot shows an example of a site that does not provide user profiles at all, but does provide search results if a scholar's name shows up in a document.

For 9 portals, the search results are the only source of information we have for a given scholar on the portal. Because the search results may be based on the search terms in the scholar's name, our automated system must crawl through some subset, possibly all, of the results to determine if the scholar has content on the given portal.

The figure above shows a search for "Heather Cucolo" on the the audio site "The Story Collider" which leads a user to the list of documents containing that string. Our automated system would need to review the content of the linked pages to determine if the Heather Cucolo we were searching for had content posted on this site.

And for 10 portals, the local search engine was not successful or did not exist. In these cases I had to resort to use a web search engine -- I used Google -- to find a profile page or content belonging to the scholar. I did so using the site search operator and the name of the scholar.

The table below shows the results of my attempt to manually find a scholar's work on each of the 36 portals.

Portal	Profiles Exist?	How did I find portal content based on actual scholar's name?	How did I get from local search results to profile page?
Academic Room	Yes	Web Search
AskforEvidence	No	Local Search	No profile, only search results
Benchfly	Yes	Web Search
BioRxiv	No	Local Search	No profile, only search results
BitBucket	Yes	Web Search
Dataverse	No	Local Search	No profile, only search results
Dryad	No	Local Search	No profile, only search results
ExternalDiffusion	No	Local Search	No profile, only search results
Figshare	Yes	Local Search	Profile pages in results
GitHub	Yes	Local Search w/ Special Settings	Profile pages in results, if correct search used
GitLab.com	Yes	Web Search
Global biodiversity information facility Data	Yes	Local Search	Click on result, Profile linked from result page
HASTAC	Yes	Local Search	Profile pages in results
Hypotheses	Yes	Web Search
JoVe	Yes	Local Search	Profile pages in results
JSTOR daily	Yes	Local Search	Click on result, Profile linked from result page
Kaggle Datasets	Yes	Local Search	Profile pages in results
Methodspace	Yes	Local Search	Profile pages in results
Nautilus 3 sentence science	No	Local Search	No profile, only search results
Omeka.net	No	Web Search
Open Science Framework	Yes	Local Search	Profile pages in results
PubMed Commons	No	Web Search
PubPeer	No	Local Search	No profile, only search results
ScienceBlogs	Yes	Local Search	Profile pages in results
Scientopia	Yes	Local Search	Profile pages in results
SciLogs	Yes	Web Search
Silk	No	Web Search
SlideShare	Yes	Local Search	Click on result, Profile linked from result page
SocialScienceSpace	Yes	Local Search	Profile pages in results
SSRN	Yes	Local Search	Profile pages in results
Story Collider	No	Local Search	No profile, only search results
The Conversation	Yes	Local Search	Profile pages in results
The Open Notebook	Yes	Web Search
United Academics	Yes	Local Search	Click on result, Profile linked from result page
Wikipedia (& Wikimedia Commons)	Yes	Local Search w/ Special Settings	Profile pages in results, if correct search used
Zenodo	No	Local Search	No profile, only search results

Portal Web APIs

The result pages of local search engines must be scraped. A web API might provide structured data that can be used to effectively find the work of the scholar.

To search for web APIs for each portal, I used the following method:

Look for the terms "developers", "API", "FAQ" on the main page of each portal. If present, follow those links to determine if the resulting resource contained further information on an API.
Use the local search engine to search for these terms
Use Google search with the following queries

site:<hostname> rest api
site:<hostname> soap api
site:<hostname> api
site:<hostname> developer
<hostname> api
<hostname> developer

Using this method, I could only find evidence of web APIs for 14 of the 36 portals. PubPeer's FAQ states that they have an API, but they request that API users contact them for more information, and I could not find their documentation online. I included PubPeer in this count, but was unable to review its documentation.

By reviewing the public API documentation, I was able to confirm that a search for scholars by name on 5 of the portals allowed one to match names to strings in multiple API fields. For example, the Dataverse API allows one to search for a string in multiple fields. The example response in the documentation is for the search term "finch", which does return a result containing an author name of "Finch, Fiona".

Like most software, some of these APIs were continuing to add functionality. For example, the current version of Zenodo's REST API allows users to deposit data and metadata. The beta version of this API provides the ability to "search published records", but this functionality is not yet documented. This functionality is expected to be available "in the autumn". Zenodo also provides an OAI-PMH interface, so a system could conceivably harvest metadata about all Zenodo records and perform its own searches for scholars.

Other APIs were did not provide the ability to search for users based on identity. Much like its local search engine, BitBucket's API requires that one know the id of the user before querying, which does not help us find scholars on their site. Omeka.net has an API, but Omeka.net contains many sites running the Omeka software. The users of these sites do not necessarily enable their API. Regardless, Omeka's API documentation states that "users cannot be browsed". I was uncertain if this applied to search queries as well, but found no evidence in the documentation that they supported search of users, even as keywords.

Below are the results of my review of all 36 portals. It is possible that some of the portals marked "No" actually contain an API, but I was unable to find its documentation or evidence of it using the method above.

Portal	Evidence of API Found
Academic Room	No
AskforEvidence	No
Benchfly	No
BioRxiv	No
BitBucket	Yes
Dataverse	Yes
Dryad	Yes
ExternalDiffusion	No
Figshare	Yes
GitHub	Yes
GitLab.com	Yes
Global biodiversity information facility Data	Yes
HASTAC	No
Hypotheses	No
JoVe	No
JSTOR daily	No
Kaggle Datasets	No
Methodspace	No
Nautilus 3 sentence science	No
Omeka.net	Yes
Open Science Framework	Yes
PubMed Commons	No
PubPeer	Yes
ScienceBlogs	No
Scientopia	No
SciLogs	No
Silk	Yes
SlideShare	Yes
SocialScienceSpace	No
SSRN	No
Story Collider	No
The Conversation	No
The Open Notebook	No
United Academics	No
Wikipedia (& Wikimedia Commons)	Yes
Zenodo	Yes

Web Search Engines

If local portal search engines and web APIs are ineffective, we can use web search engines, much like EgoSystem and Yi Fang's work. As noted above, I did need to use web search engines to find profiles for some users when the local portal search engine was either unsuccessful or nonexistent. Depending on the effectiveness of these site-specific services, web search engines may also be useful in lieu of the local search engine or API.

The table below shows four popular search engines, what data is available via their API, and what restrictions any system will encounter with each. As noted before, the Yahoo Boss API no longer exists, but is included because Yahoo! is a well known search engine. DuckDuckGo's Instant Answers API does not provide full search results due to digital rights issues, focusing on areas of topics, categories, and disambiguation. It is focused on topics, so "most deep queries (non topic names) will be blank". This leaves Bing and Google as the offerings that may help us, but they have restrictions on the number of times they can be accessed before limiting occurs.

Search Engine	Data available via API	Restrictions
Bing	Links to Search Results Date Published	Free for 1K calls per month up to 3 months
DuckDuckGo	Topic Summaries Links to Some Search Results No full search results	Rate limited
Google	Search Results	100 queries per day for free $5 / 1000 queries up to 10K queries per day
Yahoo!	Search Results	Defunct as of March 31, 2016

Queries would likely be of a form like that use with EgoSystem, e.g., "LinkedIn+Marko+Rodriguez".

Because web search engines can return a large number of results, our hypothetical system would need to have limits on the number of results that it reviews. It would also need to determine the best queries to use for generating results for a given portal.

Crawling the Portal and Building Our Own Search Engine

If using web search is cost prohibitive or ineffective, we can potentially crawl the sites ourselves and produce our own search engine.

I evaluated each portal to determine if the website served a robots.txt file from its root directory in compliance with the Robots Exclusion Protocol. Using this file, the portal indicates to a search engine which URI paths it does not wish to have crawled using the keyword "disallow". Because the disallow applies only to certain paths or even certain crawlers, it may not apply to our hypothetical system. I discovered that 29 out of 36 portals have a robots.txt.

Portals may also have a sitemap exposing information about which URIs are available to the crawler. A link to a sitemap can be stored in the robots.txt. Sitemaps are also located at different paths on the portal. For example, http://www.example.com/path1/sitemap.xml is a sitemap that applies to the path /path1/ and will not contain information for URIs containing the string http://www.example.com/path2. I only examined if sitemaps were listed in the robots.txt or existed at the root directory for each portal. I discovered that 11 portals listed a sitemap in their robots.txt and 12 portals had a sitemap.xml or sitemap.xml.gz in their root directory.

The results of my review of these portals is shown below.

Portal	Robots.txt present	Sitemap in robots.txt	Sitemap in root level directory
Academic Room	Yes		Yes
AskforEvidence	Yes
Benchfly	Yes	Yes	Yes
BioRxiv	Yes	Yes	Yes
BitBucket	Yes
Dataverse
Dryad	Yes	Yes
ExternalDiffusion
Figshare
GitHub	Yes
GitLab.com	Yes
Global biodiversity information facility Data	Yes
HASTAC	Yes		Yes
Hypotheses	Yes		Yes
JoVe	Yes	Yes
JSTOR daily			Yes
Kaggle Datasets
Methodspace	Yes	Yes	Yes
Nautilus 3 sentence science	Yes	Yes	Yes
Omeka.net	Yes
Open Science Framework	Yes
PubMed Commons	Yes	Yes
PubPeer	Yes
ScienceBlogs	Yes		Yes
Scientopia	Yes	Yes	Yes
SciLogs	Yes
Silk	Yes
SlideShare	Yes	Yes
SocialScienceSpace	Yes
SSRN	Yes
Story Collider	Yes	Yes	Yes
The Conversation	Yes	Yes	Yes
The Open Notebook	Yes
United Academics
Wikipedia (& Wikimedia Commons)	Yes
Zenodo

It is likely that portals with such technology in place will already be well indexed by search engines.

Next Steps

In searching for information sources to feed our hypothetical system, I discovered some sources of information that can be used as a scholarly footprint. I conducted an evaluation of the documentation for these systems, with an eye on what information they provide, but a more extensive evaluation of many of these systems is needed. Other portals, such as ResearchGate and Academia.edu, were not evaluated, but may be useful data sources as well. How often do scholars put useful data in their Twitter, Facebook, or other social media profiles? Also, what can be done to remove human review from the process of generating and verifying a global scholar profile?

Some portals have multiple options when it comes to determining if a scholar has posted work there. Many have local portal search engines that we can use, but I anecdotally noticed that some local search engines are more precise than others when it comes to their results. Within the context of finding the work of a given scholar, a review of the precision and recall of the search engines on these portals might help determine if a web search engine is a better choice than the local search engine for a given portal.

Open access journals, such as PLOS, have requirements that authors have posted data online, in sites such as Figshare and Open Science Framework. If we know that a scholar published in an open access journal, can we search one of their journal articles for links to these datasets and hence find the scholar's profile pages on sites such as Figshare and Open Science Framework?

Much like the APIs used for the global scholar profile, the APIs for each portal will need to be evaluated for precision and usefulness to our system. Some provide the ability to search for users, but others only provide the ability to find the work of a user if the scholar's user ID is already known.

Preliminary work using web search engines shows promise, but may also require a study to determine how to most effectively build queries that discover scholars on given portals. Such a study would also need to determine the ideal number of search engine results to review before our system stops trying to find the scholar on a portal using this method.

I evaluated 36 portals to determine if they contained robots.txt and sitemaps to help search engines crawl them. If a portal utilizes these items, do they have a good ranking for our search engine queries when trying to find a scholar by name and portal? How many of the portals lacking these items have poor web search engine ranking?

Ahmed, Low, Aly, and Josifovski studied the dynamic nature of user profiles, modeling how user interests change over time. Gueye, Abdessalem, and Naacke attempted to account for these changes when building recommendation systems. With this evidence that user information changes, how often does information about a scholar change? Scholars publish new work and upload new data. In some cases, such as Figshare, a scholar may post a dataset and never return, but other sites, like United Academics, may feature frequent updates. How often should our hypothetical system refresh its information about scholars?

Our case of disambiguating scholars is a subset of the larger problem of entity resolution as a whole: are the records for two items referring to the same item? Lise Getoor and Ashwin Machanavajjhala provide an excellent tutorial on the larger problem of entity resolution. They note that the problem has become even more important to solve with the web providing multiple information from heterogeneous sources. Their summary mentions that different matching techniques work better for comparing certain fields, such as the idea that similarity measures like Jaccard and Cosine similarity work well for text and keywords, but not necessarily for names, where Jaro-Winkler has better performance. In addition, they cover the use of machine learning and crowdsourcing as ways to augment the simple matching of field contents to one another. Which parts of the global scholar profile are useful for disambiguation/entity resolution? In addition to the scholar, will other entities in the profile need to be resolved as well? What matching techniques are most accurate for each part of the global scholar profile?

Buccafurri, Lax, Nocera, and Ursino attempted to solve the problem of connecting users across social networks, a concept they referred to as "social internetworking". Recognizing that the detection of the same account on different networks is related to the concept of link prediction, they offer an algorithm that takes into account the similarity between user names and the similarity of common neighbors. How many scholarly portals can make use of a social graph information?

Lops, Gemmis, Semeraro, Musto, Narducci, and Bux build profiles for use in recommender systems with specific focus on solving two problems. The first is polysemy, where a term can have multiple meanings. Next is synonymy, where many terms have the same meaning. Their work focused on associating tags with users and constructing recommendations based on the terms encountered. For our system, polysemy will need to be investigated because the same term will mean different things in different disciplines (e.g., port means something different to a computer hardware engineer than a network engineer). Our system may become even more confused if presented terms from interdisciplinary scholars. However, unlike recommender systems, the system will use more than just terms for disambiguation, relying on other, less ambiguous data like affiliations and email addresses. Thus are polysemy and synonymy issues that our system needs to resolve? With which parts of the global scholar profile (i.e., fields) are they needed?

For the fields we have chosen to identify scholars, which matching techniques are most accurate? As noted before, some algorithms work better for names and userids than for keywords. What algorithms, including machine learning, network analysis, and probabilistic soft logic might best match some of our fields? Do they vary between scholarly portals?

Not all scholars may want to have their online work discovered in this way. What techniques can be employed to allow some scholars to opt-out?

Summary

Searching for scholars is slightly easier than searching for other online identities because, by their very nature, scholars produce content that can be useful for disambiguation. In this article, I reviewed sources of data that can be used by a hypothetical automated system seeking to identify the websites to which scholars have posted content. I provided a listing of different services that might help one build a global scholar profile that can be further used to disambiguate scholars online.

In order to discover where scholars post items online, I looked at scholar homepages, social media profiles, and the services offered by the portals themselves. After sampling homepages from Microsoft Academic Search API I found that 28.2% of the homepages contained links to websites on Kramer and Boseman's list of 400+ Tools and innovations.

I reviewed capabilities of 38 scholarly portals themselves, discovering that, in some cases, local portal search engines can be used to locate content for a scholar. Any automated tool would need to scrape these results, and getting to a scholar's information on the site falls into one of three patterns. To find an alternative to scraping, I also discovered APIs for a number of portals, but could only determine if scholars can be searched for on 5 of them.

To augment or replace the searching capabilities of scholarly portals, I examined the capabilities of search engine APIs and discovered the cost associated with each. As a few existing research projects looking for scholarly information (e.g., EgoSystem) make use of search engine APIs, I wanted to shows that this was still a viable option.

So, sources do exist for building the global scholar profile and methods exist at known scholarly portals to find the works of scholars at each portal. Evaluating solutions for disambiguation will be the next key step to finding their footprints in the portals in which they tread.

--Shawn

↧