Quantcast
Viewing all 742 articles
Browse latest View live

2017-11-20: Dodging the Memory Hole 2017 Trip Report

At the Internet Archive, it was rainy in San Francisco, but that did not deter those of us attending Dodging the Memory Hole 2017. We engaged in discussions about a very important topic: the preservation of online news content.


Keynote: Brewster Kahle, founder and digital librarian for the Internet Archive

Brewster Kahle is well known in digital preservation and especially web archiving circles. He founded the Internet Archive in May 1996. The WS-DL and LANL's Prototyping Team collaborate heavily with those from the Internet Archive, so hearing his talk was quite inspirational.




We are familiar with the Internet Archive's efforts to archive the Web, visible mostly through the Wayback Machine, but the goal of the Internet Archive is "Universal Access to All Knowledge", something that Kahle equates to the original Library of Alexandria or putting humans on the moon. To that extent, he highlighted many initiatives by the Internet Archive to meet this goal. He mentioned that the contents of a book take up roughly a MegaByte. With 28 TeraBytes the works of the Library of Congress can be stored digitally—digitizing it is another matter, but it is completely doable, and by digitizing it we remove restrictions on access due to distance and other factors. Why stop with documents? There are many other types of content. Kahle highlighted the efforts by the Internet Archive to make television content, video games, audio, and more. They also have a loaning program whereby they allow users to borrow books, which are also digitized using book scanners. He stressed that, because of its mission to provide content to all, the Internet Archive is indeed a library.



As a library, the Internet Archive also becomes a target for governments seeking information on the activities of their citizens. Kahle highlighted one incident in which the FBI sent a letter demanding information from the Internet Archive. Thanks to help from the Electronic Frontier Foundation, the Internet Archive sued the United States government and won, defending the rights of those using their services.



Kahle emphasized that we can all help with preserving the web by helping the Internet Archive build its holdings of web content. The Internet Archive contains a form with a simple "save page now" button, but they also support other methods of submitting content.



Contributions from Los Alamos National Laboratory (LANL) and Old Dominion University (ODU)


Martin Klein from LANL and Mark Graham from the Internet Archive




Martin Klein presented work on Robust Links. Martin briefly used motivating work he had done with Herbert Van de Sompel at Los Alamos National Laboratory, mentioning the problems of link rot, and content drift, the latter of which I also worked on.
He covered how one can create links that are robust by:
  1. submitting a URI to a web archive
  2. decorating the link HTML so that future users can reach archived versions of the linked content
For the first item, he talked about how one can use tools like the Internet Archive's "Save Page Now" button as well as WS-DL's own ArchiveNow. The second item is covered by the Robust Links specification. Mark Graham, Directory of the Wayback Machine at the Internet Archive, further expanded upon Martin's talk by describing how the Wayback Extension also provides the capability to save pages, navigate the archive, and more. It is available for Chrome, Safari, and Firefox. It is shown in the screenshots below.
Image may be NSFW.
Clik here to view.
A screenshot of the Wayback Extension in Chrome.
Image may be NSFW.
Clik here to view.
A screenshot of the Wayback Extension in Safari. Note the availability of the option "Site Map", which is not present in the Chrome version
Image may be NSFW.
Clik here to view.
A screenshot of the Wayback Extension in Firefox. Note how there is less functionality.


Of course, the WS-DL efforts of ArchiveNow and Mink augment these preservation efforts by submitting content to multiple web archives, including the Internet Archive.



I enjoyed one of the most profound revelations from Martin and Mark's talk: URIs are addresses, not the content that was on the page at the moment you read it. I realize that efforts like IPFS are trying to use hashes to address this dichotomy, but the web has not yet migrated to them.

Shawn M. Jones from ODU




I presented a lightning talk highlighting a blog post from earlier this year where I try to answer the question: where can we post stories summarizing web archive collections? I talked about why storytelling works as a visualization method for summarizing collections and then evaluated a number of storytelling and curation tools with the goal of finding those that best support this visualization method.


Selected Presentations


I tried to cover elements of all presentations while live tweeting during the event, and wish I could go into more detail here, but, as usual I will only cover a subset.

Mark Graham highlighted the Internet Archive's relationships with online news content. He highlighted a report by Rachel Maddow where she used the Internet Archive to recover tweets posted by former US National Security Advisor Michael Flynn, thus incriminating him. He talked about other efforts, such as NewsGrabber, Archive-It, and the GDELT project, which all further archive online news or provide analysis of archived content. Most importantly, he covered "News At Risk"—content that has been removed from the web by repressive regimes, further emphasizing the importance of archiving it for future generations. In that vein, he discussed the Environmental Data & Governance Initiative, set up to archive environmental data from government agencies after Donald Trump's election.

Ilya Kreymer and Anna Perricci presented their work on Webrecorder, web preservation software hosted at webrecorder.io. An impressive tool for "high fidelity" web archiving, Webrecorder allows one to record a web browsing session and save it to a WARC. Kreymer demonstrated its use on a CNN news web site with an embedded video, showing how the video was captured as well as the rest of the content on the page. The Webrecorder.io web platform allows one to record using their native browser, or they can choose from a few other browsers and configurations in case the user agent plays a role in the quality of recording or playback. For offline use, they have also developed Webrecorder Player, in which one can playback their WARCs without requiring an Internet connection. Anna Perricci said that it is perfect for browsing a recorded web session on an airplane. Contributors to this blog have written about Webrecorder before.

Katherine Boss, Meredith Broussard, Fernando Chirigati, and Rémi Rampin discussed the problems surrounding the preservation of news apps: interactive content on news sites that allow readers to explore data collected by journalists on a particular issue. Because of their dynamic nature, news apps are difficult to archive. Unlike static documents, they can not be printed or merely copied. They often consist of client and server side code developed without a focus on reproducibility. Preserving news apps often requires the assistance of the organization that created the news app, which is not always available. Rémi Rampin noted that, for those organizations that were willing to help them, their group had had success using the research reproducibility tool reprozip to preserve and play back news apps.

Roger Macdonald and Will Crichton provided an overview of the Internet Archive's efforts to provide information from TV news. They have employed the Esper video search tool as a way to explore their colleciton. Because it is difficult for machines to derive meaning from pixels within videos, they used captioning as a way to provide for effective searching and analysis of the TV news content at the Internet Archive. Their goal is to allow search engines to connect fact checking to the TV media. To this end, they employed facial recognition on hours of video to find content where certain US politicians were present. From there one can search for a politician and see where they have given interviews on such news channels as CNN, BBC, and Fox News. Alternatively, they are exploring the use of identifying the body position of each person in a frame. Using this, it might be possible to answer questions such as "every video where a man is standing over a woman". The goal is to make video as easy as text to search for meaning.

Maria Praetzellis highlighted a project named Community Webs that uses Archive-It. Community Webs provides libraries the tools necessary to preserve news and other content relevant to their local community. Through community webs, local public libraries receive education and training, help with collection development, and archiving services and infrastructure.

Kathryn Stine and Stephen Abrams presented the work done on the Cobweb Project. Cobweb provides an environment where many users can collaborate to produce seeds that can then be captured by web archiving initiatives. If an event is unfolding and news stories are being written, the documents containing these stories may change quickly, thus it is imperative for our cultural memory that these stories be captured as close to publication as possible. Cobweb provides an environment for the community to create a collection of seeds and metadata related to one of these events.
Matthew Weber shared some results from the News Measures Research Project. This project started as an attempt to "create an archive of local news content in order to assess the breadth and depth of local news coverage in the United States". The researchers were surprised to discover that local news in the United States covers a much larger area than expected: 546 miles on average. Most areas are "woefully underserved". Consolidation of corporate news ownership has led to fewer news outlets in many areas and the focus of these outlets is becoming less local and more regional. These changes are of concern because the press is important to the democratic processes within the United States.

Social



As usual, I met quite a few people during our meals and breaks. I appreciate talks over lunch with Sativa Peterson of Arizona State Library and Carolina Hernandez of the University of Oregon. It was nice to discuss the talks and their implications for journalism with Eva Tucker of Centered Media and Barrett Golding of Hearing Voices. I also appreciated feedback and ideas from Ana Krahmer of the University of North Texas, Kenneth Haggerty of the University of Missouri, Matthew Collins of the University of San Francisco Gleeson Library, Kathleen A. Hansen of University of Minnesota, and Nora Paul, retired director of Minnesota Journalism Center. I was especially intrigued by discussions with Mark Graham on using storytelling with web archives, Rob Brackett of Brackett Development, who is interested in content drift, and James Heilman, who works on WikiProject Medicine with Wikipedia.


Summary


Like last year, Dodging the Memory Hole was an inspirational conference highlighting current efforts to save online news. Having it at the Internet Archive further provided expertise and stimulated additional discussion on the techniques and capabilities afforded by web archives. Pictures of the event are available on Facebook. Video coverage is broken up into several YouTube videos: Day 1 before lunch, Day 1 after lunch, Day 2 before lunch, Day 2 after lunch, and lightning talks. DTMH highlights the importance of news in an era of a changing media presence in the United States, further emphasizing that web archiving can help us fact-check statements so we can hold onto a record of not only how we got here, but also guide where we might go next. -- Shawn M. Jones

2017-11-22: Deploying the Memento-Damage Service


Image may be NSFW.
Clik here to view.



Many web services such as archive.isArchive-ItInternet Archive, and UK Web Archive have provided archived web pages or mementos for us to use. Nowadays, the web archivists have shifted their focus from how to make a good archive to measuring how well the archive preserved the page. It raises a question about how to objectively measure the damage of a memento that can correctly emulate user (human) perception.

Related to this, Justin Brunelle devised a prototype for measuring the impact of missing embedded resources (the damage) on a web page. Brunelle, in his IJDL paper (and the earlier JCDL version), describes that the quality of a memento depends on the availability of its resources. The straight percentage of missing resources in a memento is not always a good indicator of how "damaged" it is. For example, one page could be missing several small icons whose absence users never even notice, and a second page could be missing a single embedded video (e.g., a Youtube page). Even though the first page is missing more resources, intuitively the second page is more damaged and less useful for users. The damage value ranges from 0 to 1, where a damage of 1 means the web page lost all of its embedded resources. Figure 1 gives an illustration of how this prototype works.

Image may be NSFW.
Clik here to view.
Figure 1. The overview flowchart of Memento Damage
Although this prototype has been successfully proven to be capable of measuring the damage, it is not user ready. Thus, we implemented a web service, called Memento-Damage, based on the prototype.

Analyzing the Challenges

Reducing the Calculation Time

As previously explained, the basic notion of damage calculation is mirroring human perception of a memento. Thus, we analyze the screenshot of the web page as a representation of how the page looks in the user's eyes. This screenshot analysis takes the most time of the entire damage calculation process.

The initial prototype is built on top of the Perl programming language and used PerlMagick to analyze the screenshot. This library dumps the color scheme (RGB) of each pixel in the screenshot into a file. This output file will then be loaded by the prototype for further analysis. Dumping and reading the pixel colors of the screenshot take a significant amount of time and it is repeated according to the number of stylesheets the web page has. Therefore, if a web page has 5 stylesheets, the analysis will be repeated 5 times even though it uses the same screenshot as the basis.

Simplifying the Installation and Making It Distributable
Before running the prototype, users are required to install all dependencies manually. The list of dependencies is not provided. Users must discover it themselves by identifying the error that appears during the execution. Furthermore, we need to ‘package’ and deploy this prototype into a ready-to-use and distributable tool that can be used widely in various communities. How? By providing 4 different ways of using the service:
Solving Other Technical Issues
Several technical issues that needed to be solved included:
  1. Handling redirection (status_code  = 301, 302, or 307).
  2. Providing some insights and information.
    The user will not only get the final damage value but also will be informed about the detail of the process that happened during the crawling and calculation process as well as the components that make up the value of the final damage. If an error happened, the error info will also be provided.  
  3. Dealing with overlapping resources and iframes. 

Measuring the Damage

Crawling the Resources
When a user inputs a URI-M into the Memento-Damage service, the tool will check the content-type of the URI-M and crawl all resources. The properties of the resources, such as size and position of an image, will be written into a log file. Figure 2 summarizes the crawling process conducted in Memento-Damage. Along with this process, a screenshot of the website will also be created.

Image may be NSFW.
Clik here to view.
Figure 2. The crawling process in Memento-Damage

Calculating the Damage
After crawling the resources, Memento-Damage will start calculating the damage by reading the log files that are previously generated (Figure 3). Memento-Damage will first read the network log and examine the status_code of each resource. If a URI is redirected (status_code = 301 or 302), it will chase down the final URI by following the URI in the header location as depicted in Figure 4. Each resource will be processed in accordance with its type (image, css, javascript, text, iframe) to obtain its actual and potential damage value. Then, the total damage is computed using the formula:
\begin{equation}\label{eq:tot_dmg}T_D = \frac{A_D}{P_D}\end{equation}
where:
     $ T_D = Total Damage \\
        A_D = Actual Damage \\
        P_D = Potential Damage $

The formula above can be further elaborated into:
$$ T_D = \frac{A_{D_i} + A_{D_c} + A_{D_j} + A_{D_m} + A_{D_t} + A_{D_f}}{P_{D_i} + P_{D_c} + P_{D_j} + P_{D_m} + P_{D_t} + P_{D_f}} $$
where each subscript notation represent image (i), css (c), javascript (j), multimedia (m), text, and iframe (f), respectively. 
For image analysis, we use Pillow, a python image library that has better and faster performance than PerlMagick. Pillow can read pixels in an image without dumping it into a file to speed up the analysis process. Furthermore, we modify the algorithm so that we only need to run the analysis script once for all stylesheets.
Image may be NSFW.
Clik here to view.
Figure 3. The calculation process in Memento-Damage

Image may be NSFW.
Clik here to view.
Figure 4. Chasing down a redirected URI

Dealing with Overlapping Resources

Image may be NSFW.
Clik here to view.
Figure 5. Example of a memento that contains overlapping resources (accessed on March 30th, 2017)
URIs with overlapping resources such the one illustrated in Figure 5 need to be treated differently to prevent the damage value from being double counted. To solve this problem, we created a concept of rectangle (Rectangle = xmin, ymin, xmax, ymax). We perceive the overlapping resources as rectangles and calculate the size of the intersection area. The size of one of the overlapped resource will be reduced by the intersection size, while the other overlapped resource will get the whole full size. Figure 6 and Listing 1 give the illustration of the rectangle concept.
Image may be NSFW.
Clik here to view.
Figure 6. Intersection concept for overlapping resources in an URI
def rectangle_intersection_area(a, b):
dx = min(a.xmax, b.xmax) - max(a.xmin, b.xmin)
dy = min(a.ymax, b.ymax) - max(a.ymin, b.ymin)
if (dx >= 0) and (dy >= 0):
return dx * dy
Listing 1. Measuring image rectangle in Memento-Damage

Dealing with Iframes

Dealing with iframe is quite tricky and requires some customization. First, by default, crawling process cannot access content inside of iframe using native javascript or JQuery selector due to a cross-domain problem. This problem becomes more complicated when this iframe is nested in another iframe(s). Therefore, we need to find a way to switch from main frame to the iframe. To handle this problem, we utilize the API provided by PhantomJS that facilitates switching from one iframe to another. Second, the location properties of the resources inside of iframe are calculated relative to that particular iframe position, not to the main frame position. It could lead to a wrong damage calculation. Thus, for a resource located inside an iframe, its position must be computed in a nested calculation by taking into account the position of its parent frame(s)

Using Memento-Damage

The Web Service

a. Website
The Memento-Damage website gives the easiest way to use the Memento-Damage tool. However, since it runs on a resource-limited server provided by ODU, it is not recommended for calculating damage a large number of requests. Figure 7 shows a brief preview of the website.
Image may be NSFW.
Clik here to view.
Figure 7. The calculation result from Memento-Damage

b. REST API
The REST API service is part of the web service which facilitates damage calculation from any HTTP Client (e.g. web browser, CURL, etc) and gives output in JSON format. This makes it possible for the user to do further analysis with the resulting output. Using REST API, a user can create a script and calculate damage on a few number of URIs (e.g. 5).
The default REST API usage for memento damage is:
http://memento-damage.cs.odu.edu/api/damage/[the input URI-M]
Listing 2 and Listing 3 show examples of using Memento-Damage REST API with CURL and Python.
curl http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci
Listing 2. Using Memento-Damage REST API with Curl
import requests
resp = requests.get('http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci')
print resp.json()
Listing 3. Using Memento-Damage REST API as embedded code in Python

Local Service

a. Docker Version
The Memento-Damage docker image uses Ubuntu-LXDE for the desktop environment. A fixed desktop environment is used to avoid an inconsistency issue of the damage value of the same URI run by different machines with different operating systems. We found that PhantomJS, the headless browser that is used for generating the screenshot, rendered the web page in accordance with the machine's desktop environment. Hence, the same URI could have a slightly different screenshot, and thus different damage values when run on different machines (Figure 8). 


Image may be NSFW.
Clik here to view.
Figure 8. Screenshot of https://web.archive.org/web/19990125094845/http://www.dot.state.al.us taken by PhantomJS run on 2 machines with different OS.
To start using the Docker version of Memento-Damage, the user can follow these  steps: 
  1. Pull the docker image:
    docker pull erikaris/memento-damage
  2. Run the docker image:
    docker run -it -p :80 --name Image may be NSFW.
    Clik here to view.

    Example:
    docker run -i -t -p 8080:80 --name memdamage erikaris/memento-damage:latest
    After this step is completed, we now have the Memento-Damage web service running on
    http://localhost:8080/
  3. Run memento-damage as a CLI using the docker exec command:
    docker exec -it <container name> memento-damage <URI>
    Example:
    docker exec -it memdamage memento-damage http://odu.edu/compsci
    If the user wants to work from the inside of the Docker's terminal, use this following command:
    docker exec -it <container name> bash
    Example:
    docker exec -it memdamage bash 
  4. Start exploring the Memento-Damage using various options (Figure 9) that can be obtained by typing
    docker exec -it memdamage memento-damage --help
    or if the user is already inside the Docker's container, just simply type:
    memento-damage --help
~$ docker exec -it memdamage memento-damage --help
Usage: memento-damage [options] <URI>

Options:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output-dir=OUTPUT_DIR
output directory (optional)
-O, --overwrite overwrite existing output directory
-m MODE, --mode=MODE output mode: "simple" or "json" [default: simple]
-d DEBUG, --debug=DEBUG
debug mode: "simple" or "complete" [default: none]
-L, --redirect follow url redirection

Figure 9. CLI options provided by Memento-Damage

Figure 10 depicts an output generated by CLI-version Memento-Damage using complete debug mode (option -d complete).
Image may be NSFW.
Clik here to view.
Figure 10. CLI-version Memento-Damage output using option -d complete
Further details about using Docker to run Memento-Damage is available on http://memento-damage.cs.odu.edu/help/.

b. Library
The library version offers functionality (web service and CLI) that is relatively similar to that of the Docker version. It is aimed at the people who already have all the dependencies (PhantomJS 2.xx and Python 2.7) installed on their machine and do not want to bother installing Docker. The latest library version can be downloaded from GitHub.
Start using the library by following these steps:
  1. Install the library using the command:
    sudo pip install web-memento-damage-X.x.tar.gz
  2. Run the Memento-Damage as a web service:     memento-damage-server
  3. Run the Memento-Damage via CLI:   memento-damage <URI>
  4. Explore available options by using option --help:
    memento-damage-server --help                     (for the web service)
    or 
    memento-damage --help                                  (for CLI)

Testing on a Large Number of URIs

To prove our claim that Memento-Damage can handle a large number of URIs, we conducted a test on 108,511 URI-Ms using a testing script written in Python. The testing used the Docker version of Memento-Damage that was run on a machine with specification: Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz, Memory 4 GiB. The testing result is shown below.

Summary of Data
=================================================
Total URI-M: 108511
Number of URI-M are successfully processed: 80580
Number of URI-M are failed to process: 27931
Number of URI-M has not processed yet: 0

With a dataset of 108,511 input URI-Ms tested, we found 80,580 URI-Ms were successfully processed while the rest, 27,931 URI-Ms, failed to process. The processing failure on those 27,931 URI-Ms happened because of the concurrent limitation access from Internet Archive. On average, 1 URI-M needs 32.5 seconds processing time. This is 110 times faster than the prototype version, which takes an average of 1 hour to process 1 URI-M. In some cases, the prototype version even took almost 2 hours to process 1 URI-M. 

From those successfully processed URI-Ms, we managed to create some visualizations to help us better understand the result as can be seen below. The first graph (Figure 11) shows the number of average missing embedded resources per memento per year according to the damage value (Dm) and the missing resources (Mm). The most interesting case appeared in 2011 where the Dm value was significantly higher than Mm. It means, although on the average the URI-Ms in 2011 only lost 4% of their resources, these losses actually caused 4 (four) times higher damages than the number showed by Mm. On the other hand, in 2008, 2010, 2013, and 2017, Dm value is lower than Mm, which implies those missing resources are less important.
Image may be NSFW.
Clik here to view.
Figure 11. The average embedded resources missed per memento per year
Image may be NSFW.
Clik here to view.
Figure 12. Comparison of All Resources vs Missing Resources

The second graph (Figure 12) shows the number of total resources in each URI-Ms and its missing resources.  The x-axis represents each URI-M sorted in descending order by the number of resources, while the y-axis represents the number of resources owned by each URI-M. This graph gives us an insight that almost every URI-M lost at least one of its embedded resources. 

Summary

In this research, we have improved the calculation method for measuring the damage on a memento (URI-M) based on the prototype from the previous version. The improvements include reducing calculation time, fixing various bugs, handling redirection and a new type of resources. We successfully developed the Memento-Damage into a comprehensive tool that has the ability to show the detail of every resource that contributes to the damage. Furthermore, it also provides several approaches for utilizing the tool such as python library and the Docker version. The testing result shows that Memento-Damage works faster compared to the prototype version and can handle a larger number of mementos. Table 1 summarizes the improvements that we made on Memento-Damage compared to the initial prototype. 

No
Subject
Prototype
Memento Damage
1.Programming LanguageJavascript + PerlJavascript + Python
2.InterfaceCLICLI
Website
REST API
3.DistributionSource CodeSource Code
Python library
Docker
4.OutputPlain TextPlain Text
JSON
5.Processing timeVery slowFast
6.Includes IFrameNAAvailable
7.Redirection HandlingNAAvailable
8.Resolve OverlapNAAvailable
9.Blacklisted URIsOnly has 1 blacklisted URI which is added manuallyAdd several new blacklisted URIs. Blacklisted URIs are identified based on a certain pattern.
10.Batch executionNot supportedSupported
11.DOM selector capabilityonly support simple selection querysupport complex selection query
12.Input filteringNAOnly process an input of HTML format
Table 1. Improvement on Memento-Damage compared to the initial prototype

Help Us to Improve

This tool still needs a lot of improvements to increase its functionality and provide a better user experience. We really hope and strongly encourage everyone, especially people who work in web archiving field, to try this tool and give us feedback. Please kindly read the FAQ and HELP before starting using the Memento-Damage. Help us to improve and tell us what we can do better by posting any bugs, errors, issues, or difficulties that you find in this tool on our GitHub

- Erika Siregar -

2017-12-03: Introducing Docker - Application Containerization & Service Orchestration


For the last few years, Docker, the application containerization technology, has been gaining a lot of attraction from the DevOps community and lately it has made its way to the academia and research community as well. I have been following it since its inception in 2013. For the last couple years, it has become a daily driver for me. At the same time, I have been encouraging my colleagues to use Docker in their research projects. As a result, we are gradually moving away from one virtual machine (VM) per project to a swarm of nodes running containers of various projects and services. If you have accessed MemGator, CarbonDate, Memento Damage, Story Graph or some other WS-DL services lately, you have been served from our Docker deployment. We even have an on-demand PHP/MySQL application deployment system using Docker for the CS418 - Web Programming course.



In the last summer, Docker Inc. selected me as the Docker Campus Ambassador for Old Dominion University. While I have already given some Docker talks to some more focused groups, with the campus ambassador hat on, I decided to organize an event where grads and undergrads of the Computer Science department at large can benefit.


The CS department accepted it as a colloquium, scheduled for Nov 29, 2017. We were anticipating about 50 participants, but many more showed up. The increasing interest of students towards containerization technology can be taken as an indicator of its usefulness and perhaps it should be included as part of some courses offered in future.


The session lasted for a little over an hour. It started with some slides motivating with a Dockerization story and a set of problems that potentially Docker can solve. Slides then introduced some basics of Docker and further illustrated how a simple script can be packaged into an image and distributed using DockerHub. The presentation followed by a live demo of a step-by-step evolution of a simple script into a multi-container application using micro-service architecture while demonstrating various aspects of Docker in each step. Finally, the session was opened for questions and answers.


For the purpose of illustration I prepared an application that scrapes a given web page to extract links from it. The demo code has folders for various steps as it progresses from a simple script to a multi-service application stack. Each folder has a README file to explain changes from the previous step and instructions to run the application. The code is made available on GitHub. Following is a brief summary of the demo.

Step 0

The Step 0 has a simple linkextractor.py Python script (as shown below) that accepts a URL as an argument and prints all the hyperlinks on the page out.


However, running this rather simple script might raise some of the following issues:

  • Is the script executable? (chmod a+x linkextractor.py)
  • Is Python installed on the machine?
  • Can you install software on the machine?
  • Is "pip" installed?
  • Are "requests" and "beautifulsoup4" Python libraries installed?

Step 1

The Step 1 includes a simple Dockerfile to it to automate installation of all the requirements and build an isolated self-contained image.


Inclusion of this Dockerfile ensures that the script will run without any hiccups in a Docker container as a one-off command.

Step 2

The Step 2 makes some changes in the Python script; 1) to convert extracted paths to full URLs, 2) to extract both links and anchor texts, and 3) to move the main logic in a function and return an object so that the script can be used as a module in other scripts.

This step illustrates that new changes in the code will not affect any running containers and will not impact an image that was built already (unless overridden). Building a new image with a different tag allows co-existence of both the versions that can be run as desired.

Step 3

The Step 3 adds another Python file main.py that utilizes the module written in the previous step to expose the link extraction as a web service API that returns JSON response. Libraries required are extracted in the requirements.txt file. The Dockerfile is updated to accommodate these changes and to by default run the server rather than the script as a one-off command.

This step demonstrates how host and container ports are mapped to expose the service running inside a container.

Step 4

The Step 4 moves all the code, written so far for the JSON API, in a separate folder to build an independent image. In addition to that, it adds a PHP file index.php in a separate folder that serves as a front-end application which internally communicates with the Python API for link extraction. To glue these services together a docker-compose.yml file is added as shown below.


This step demonstrates how multiple services can be orchestrated using Docker Compose. We did not crate a custom image for the PHP application, instead demonstrated how the code can be mounted inside a container (in this case a container based on the official php:7-apache image). This allows any modifications of the code reflected immediately inside the running container, which could be very handy in the development mode.

Step 5

The Step 5 adds another Dockerfile to build a custom image of the front-end PHP application. The Python API server is updated to utilize Redis for caching. Additionally, the docker-compose.yml file is updated to reflect changes in the front-end application ( the"web" service block) and to include a service of Redis from its official Docker image.

This step illustrates how easy it is to progressively add components to compose a multi-container service stack. At this stage, the demo application architecture reflects what is illustrated in the title image of this post (the first figure).

Step 6

The Step 6 completely replaces the Python API service component with an equivalent Ruby implementation. Some slight modifications are made in the docker-compose.yml file to reflect these changes. Additionally, a "logs" directory is mounted in the Ruby API service as a volume for persistent storage.

This step illustrates how easily any component of a micro-service architecture application stack can be swapped out with an equivalent service. Additionally, it demonstrates volumes for persistent storage so that containers can remain stateless.


The video recording of the session is made available on YouTube as well as on the colloquium recordings page of the department (the latter has more background noise). Slides and demo codes are made available under appropriate permissive licenses to allow modification and reuse.

Resources



--
Sawood Alam

2017-12-11: Difficulties in timestamping archived web pages

Image may be NSFW.
Clik here to view.
Figure 1: A web page from nasa.gov is archived
 by Michael's Evil Wayback in July 2017.
Image may be NSFW.
Clik here to view.
Figure 2: When visiting the same archived page in October 2017,
we found that the content of the page has been tampered with. 
The 2016 Survey of Web Archiving in the United States shows an increasing trend of using public and private web archives in addition to the Internet Archive (IA). Because of this tendency we should consider the question of validity of archived web pages deleivered by these archives. 
Let us look at an example where the important web page https://climate.nasa.gov/vital-signs/carbon-dioxide/, that keeps a record of the carbon dioxide (CO2) level in the Earth’s atmosphere, is captured by a private web archive “Michael’s Evil Wayback” on July 17, 2017 at 18:51 GMT. At this time, as Figure 1 shows, the CO2 was 406.31 ppm.
When revisiting the same archived page in October 2017, we should be presented with the same content. Surprisingly, CO2 changed and became 270.31 ppm as Figure 2 shows. So which one is the “real” archived archived page?
We can simply detect that the content of an archived web page has been modified by generating a cryptographic hash value on the returned HTML code. For example, the following command will download the web page https://climate.nasa.gov/vital-signs/carbon-dioxide/ and generate a SHA-256 hash value on its HTML content
$ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256
b87320c612905c17d1f05ffb2f9401ef45a6727ed6c80703b00240a209c3e828  -
The next figure illustrates how the simple apporach of generating hashes can detect any tampering with content of archived pages. In this example, the "black hat" in the figure (i.e., Michael’s Evil Wayback) has changed the CO2 to a lower value (i.e., in favor of individuals or organizations who deny that CO2 is one of the main causes of global warming).  
Another possible solution to validate archived web pages is to use timestamping. If a trusted timestamp is issued on an archived web page, anyone should verify that a particular representation of the web page has existed in a specific time in the past.
As of today, many systems, such as OriginStamp and OpenTimestamps offer a free-of-charge service to generate blockchain-based trusted timestamps of digital documents, such as Bitcoin. These tools perform multiple steps to successfully create timestamps. One of these steps requires computing a hash value which represents the content of the resource (i.e, by the cURL command above). Next, this hash value is converted to a Bitcoin's address, then a Bitcoin's transaction is made where one of the two sides of the transaction (i.e., the source and destination) should be the new generated address. Once approved by the blockchain, the transaction creation datetime is considered to be a trusted timestamp. Shawn Jones describes in "Trusted Timestamping of Mementos" how to create trusted timestamp of archived web pages using blockchain networks.
In our technical report "Difficulties of Timestamping Archived Web Pages", we show that trusted timestamping archived web pages is not an easy task for several reasons. The main reason is that a hash value calculated on the content of  an archived web page (i.e., memento) should be repeatable. That is we should always obtain the same hash value each time we retrieve the memento. In addition to those difficulties, we introduced some requirements to be fulfilled in order to generate repeatable hash values of mementos.

--Mohamed Aturban


Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, "Difficulties of Timestamping Archived Web Pages." 2017. Technical Report. arXiv:1712.03140.

2017-12-14: Storify Will Be Gone Soon, So How Do We Preserve The Stories?

Popular Storytelling service, Storify, will be shut down on May 16, 2018. Storify has been used by journalists and researchers to create stories about events and topics of interest. It has a wonderful interface, shown below, that allows one to insert text, but also add social cards and other content from a variety of services, including Twitter, Instagram, Facebook, YouTube, Getty Images, and of course regular HTTP URIs.
Image may be NSFW.
Clik here to view.
This screenshot displays the Storify editing Interface.
As shown below, Storify is used by news sources to build and publish stories about unfolding events, as seen below for the Boston NPR Station WBUR.
Image may be NSFW.
Clik here to view.
Storify is used by WBUR in Boston to convey news stories.
It is also the visualization platform used for summarizing Archive-It collections in the Dark and Stormy Archives (DSA) Framework, developed by WS-DL members Yasmin AlNoamany, Michele Weigle, and Michael Nelson. In a previous blog post, I covered why this visualization technique works and why many other tools fail to deliver it effectively. An example story produced by the the DSA is shown below.
Image may be NSFW.
Clik here to view.
This Storify story summarizes Archive-It Collection 2823 about a Russian plane crash on September 7, 2011.

Ian Milligan provides an excellent overview of the importance of Storify and the issues surrounding its use. Storify stories have been painstakingly curated and the aggregation of content is valuable in and of itself, so before Storify disappears, how do we save these stories?

Saving the Content from Storify



Manually


Storify does allow a user to save their own content, one story at a time. Once you've logged in, you can perform the following steps:
Image may be NSFW.
Clik here to view.
1. Click on My Stories
Image may be NSFW.
Clik here to view.
2. Select the story you wish to save
Image may be NSFW.
Clik here to view.
3. Choose the ellipsis menu from the upper right corner
Image may be NSFW.
Clik here to view.
4. Select Export
Image may be NSFW.
Clik here to view.
5. Choose the output format: HTML, XML, or JSON

Depending on your browser and its settings, the resulting content may display in your browser or a download dialog may appear. URIs for each file format do match a pattern. In our example story above, the slug for the story is 2823spst0s and our account name is ait_stories. The different formats for our example story reside at the following URIs.
  • JSON file format: https://api.storify.com/v1/stories/ait_stories/2823spst0s
  • XML file format: https://storify.com/ait_stories/2823spst0s.xml
  • Static HTML file format: https://storify.com/ait_stories/2823spst0s.html
If one already has the slugs and the account names, they can save any public story. Private stories, however, can only be saved by the owner of the story. What if we do not know the slugs of all of our stories? What if we want to save someone else's stories?

Using Storified From DocNow


For saving the HTML, XML, and JSON formats of Storify stories, Ed Summers, creator of twarc, has created the storified utility as part of the DocNow project. Using this utility, one can save public stories from any Storify account in the 3 available formats. I used the utility to save the stories from the DSA's own ait_stories account. After ensuring I had installed python and pip, I was able to install and use the utility as follows:
  1. git clone https://github.com/DocNow/storified.git
  2. pip install requests
  3. cd storified
  4. python ./storified.py ait_stories # replace ait_stories with the name of the account you wish to save
Update: Ed Summers mentions that one can now run pip install storified, replacing these steps. One only needs to then run storified.py ait_stories, again replacing ait_stories with the account name you wish to save.

Storified creates a directory with the given account name containing sub-directories named after each story's slug. For our Russia Plane crash example, I have the following:
~/storified/ait_stories/2823spst0s % ls -al
total 416
drwxr-xr-x 5 smj staff 160 Dec 13 16:46 .
drwxr-xr-x 48 smj staff 1536 Dec 13 16:47 ..
-rw-r--r-- 1 smj staff 58107 Dec 13 16:46 index.html
-rw-r--r-- 1 smj staff 48440 Dec 13 16:46 index.json
-rw-r--r-- 1 smj staff 98756 Dec 13 16:46 index.xml
I compared the content produced by the manual process above with the output from storified and there are slight differences in metadata between the authenticated manual export and the anonymous export generated by storified. Last seen dates and view counts are different in the JSON export, but there are no other differences. The XML and HTML exports of each process have small differences, such as <canEdit>false</canEdit> in the storified version versus <canEdit>true</canEdit> in the manual export. These small differences are likely due to the fact that I had to authenticate to manually export the story content whereas storified works anonymously. The content of the actual stories, however, is the same. I have created a GitHub gist showing the different exported content.

Using storified, I was able to extract and save our DSA content to Figshare for posterity. Figshare provides persistence as part of its work with the the Digital Preservation Network, and used CLOCKSS prior to March 2015.

That covers extracting the base story text and structured data, but what about the images and the rest of the experience? Can we use web archives instead?

Using Web Archiving on Storify Stories



Storify Stories are web resources, so how well can they be archived by web archives? Using our example Russia Plane Crash story, with a screenshot shown below, I submitted its URI to several web archiving services and then used the WS-DL memento damage application to compute the memento damage of the resulting memento.
Image may be NSFW.
Clik here to view.
A screenshot of our example Storify story, served from storify.com.

Image may be NSFW.
Clik here to view.
A screenshot of our Storify story served from the Internet Archive, after submission via the Save Page Now Utility.
Image may be NSFW.
Clik here to view.
A screenshot of our Storify story served from archive.is.

Image may be NSFW.
Clik here to view.
A screenshot of our Storify story served from webrecorder.io.


Image may be NSFW.
Clik here to view.
A screenshot of our Storify story served via WAIL version 1.2.0-beta3.
PlatformMemento Damage ScoreVisual Inspection Comments
Original Page at Storify0.002
  • All social cards complete
  • Views Widget works
  • Embed Widget works
  • Livefyre Comments widget is present
  • Interactive Share Widget contains all images
  • No visible pagination animation
Internet Archive with Save Page Now0.053
  • Missing the last 5 social cards
  • Views Widget does not work
  • Embed Widget works
  • Livefyre Comments widget is missing
  • Interactive Share Widget contains all images
  • Pagination animation runs on click and terminates with errors
Archive.is0.000
  • Missing the last 5 social cards
  • Views Widget does not work
  • Embed Widget does not work
  • Livefyre Comments widget is missing
  • Interactive Share Widget is missing
  • Pagination animation is replaced by "Next Page" which goes nowhere
Webrecorder.io0.051*
  • Missing the last 5 social cards, but can capture all with user interaction while recording
  • Views Widget works
  • Embed Widget works
  • Livefyre Comments widget is missing
  • Interactive Share Widget contains all images
  • No visible pagination animation
WAIL0.025
  • All social cards complete
  • Views Widget works, but is missing downward arrow
  • Embed Widget is missing images, but otherwise works
  • Livefyre Comments widget is missing
  • Interactive Share Widget is missing images
  • Pagination animation runs and does not terminate


Out of these platforms, Archive.is has the lowest memento damage score, but in this case the memento damage tool has been misled by how Archive.is produces its content. Because Archive.is takes a snapshot of the DOM at the time of capture and does not preserve the JavaScript on the page, it may score low on Memento Damage, but also has no functional interactive widgets and is also missing 5 social cards at the end of the page. The memento damage tool crashed while trying to provide a damage score for Webrecorder.io; its score has been extracted from logging information.

I visually evaluated each platform for the authenticity of its reproduction of the interactivity of the original page. I did not expect functions that relied on external resources to work, but I did expect menus to appear and images to be present when interacting with widgets. In this case, Webrecorder.io produces the most authentic reproduction, only missing the Livefyre comments widget. Storify stories, however, do not completely display the entire story at load time. Once a user scrolls down, JavaScript retrieves the additional content. Webrecorder.io will not acquire this additional paged content unless the user scrolls the page manually while recording.

WAIL, on the other hand, retrieved all of the social cards. Even though it failed to capture some of the interactive widgets, it did capture all social cards and, unlike webrecorder.io, does not require any user interaction once seeds are inserted. On playback, however, it does still display the animated pagination widget as seen below, misleading the user to believe that more content is loading.
Image may be NSFW.
Clik here to view.
A zoomed in screenshot from WAIL's playback engine with the pagination animation outlined in a red box.


WAIL also has the capability of crawling the web resources linked to from the social cards themselves, making them suitable choices if linked content is more important than complete authentic reproduction.

The most value comes from the social cards and the text of the story, and not the interactive widgets. Rather than using the story URIs themselves, one can avoid the page load pagination problems by just archiving the static HTML version of the story mentioned above — use https://storify.com/ait_stories/2823spst0s.html rather than https://storify.com/ait_stories/2823spst0s. I have tested the static HTML URIs in all tools and have discovered that all social cards were preserved.
Image may be NSFW.
Clik here to view.
The static HTML page version of the same story, missing interactive widgets, but containing all story content.

Unfortunately, other archived content probably did not link to the static HTML version. Because of this, if one were trying to browse a web archive's collection and followed a link intended to reach a Storify story, they would not see it, even though the static HTML version may have been archived. In other words, web archives would not know to canonicalize https://storify.com/ait_stories/2823spst0s.html and https://storify.com/ait_stories/2823spst0s.

Summary



As with most preservation, the goal of the archivist needs to be clear before attempting to preserve Storify stories. Using the manual method or DocNow's storified, we can save the information needed to reconstruct the text of the social cards and other text of the story, but with missing images and interactive content. Aiming web archiving platforms at the Storify URIs, we can archive some of the interactive functionality of Storify, with some degree of success, but also with loss of story content due to automated pagination.

For the purposes of preserving the visualization that is the story, I recommend using a web archiving tool to archive the static HTML version, which will preserve the images and text as well as the visual flow of the story so necessary for successful storytelling. I also recommend performing a crawl to preserve not only the story, but the items linked from the social cards. Keep in mind that web pages likely link to the Storify story URI and not its static HTML URI, hampering discovery within large web archives.

Even though we can't save Storify the organization, we can save the content of Storify the web site.

-- Shawn M. Jones


Updated on 2017/12/14 at 3:30 PM EST with note about pip install storified thanks to Ed Summers' feedback.

2017-12-19: CNI Fall 2017 Trip Report

The Coalition for Networked Information (CNI) Fall 2017 Membership Meeting was held in Washington, DC on December 11-12, 2017. University Librarian George Fowler and I represented ODU, which was recognized as a new member this year.

CNI runs several parallel sessions of project briefings, so I will focus on those sessions that I was able to attend. The attendees were active on Twitter, using the hashtag #cni17f, and I'll embed some of the tweets below.  CNI has the full schedule (pdf) available and will have some of the talks on the CNI YouTube channel. (I'll note if any sessions I attended were scheduled to be recorded and add the link when published.) The project briefings page has additional information on each briefing and links to presentations that have been submitted.

Dale Askey (McMaster University) has published his CNI Fall 2017 Membership Meeting notes, which covers several of the sessions that I was unable to attend.

DAY 1 - December 11

Plenary - recorded

CNI Executive Director (and newly-named ACM Fellow) Clifford Lynch opened the Fall meeting with a plenary talk.

Cliff gave a wide-ranging talk that touched on several timely issues including the DataRefuge movement, net neutrality, generative adversarial networks, provenance, Memento, the Digital Preservation Statement of Shared Values, annotation, and blockchain.
Our recent work investigating the challenges of timestamping archived webpages (available as a tech report at arXiv) is relevant here, given Cliff's comments about DataRefuge, provenance, Memento, and blockchain.


Archival Collections, Open Linked Data, and Multi-modal Storytelling
Andrew White (Rensselaer Polytechnic Institute)

The focus was on taking campus historical archives and telling a story, with links between students, faculty, buildings, and other historical relationships on campus. They developed a system using the Unity game engine to power visualizations and the interactive environment. The system is currently displayed on 3 side-by-side monitors:
  1. Google map of the campus with building nodes overlaid
  2. Location / Character / Event timeline
  3. Images from the archives for the selected node
The goal was to take the photos and relationships from their archives and build a narrative that could be explored in this interactive environment.


Always Already Computational: Collections as Data - slides
Thomas Padilla (UNLV), Hannah Frost (Stanford), Laurie Allen (Univ of Pennsylvania)

Always Already Computational is an IMLS-funded project with the following goals:
  1. creation of a collections as data framework to support collection transformation
  2. development of computationally amenable collection use cases and personas
  3. functional requirements that support development of technological solutions
Much of their current work is focused on talking with libraries and researchers to determine what the needs are and how data can be distributed to researchers. The bottom line is how to make the university collections more useful. There was a lot of interest and interaction with the audience about how to use library collections and make them available for researchers.


Web Archiving Systems APIs (WASAPI) for Systems Interoperability and Collaborative Technical Development - slides
Jefferson Bailey (Internet Archive), Nicholas Taylor (Stanford)
Jefferson and Nicholas reported on WASAPI, an IMLS-funded project to facilitate the transfer of web archive data (WARCs) or derivative data from WARCs.

One of the motivations for the work was a survey finding that local web archive preservation is still uncommon. Only about 20% of institutions surveyed downloading their web archive data for preservation locally.

WASAPI's goal is to help foster and facilitate greater local data preservation and data transfer. There's currently an  Archive-It Data Transfer API that allows Archive-It partners to download WARCs and derivative data (WAT, CDX, etc.) from their Archive-It collections.



Creating Topical Collections: Web Archives vs. the Live Web
Martin Klein (Los Alamos National Laboratory)

Martin and colleagues looked at comparing creating topical collections from live web resources (URIs, twitter hashtags, etc) and creating topical collections from web archives. The work was inspired by Gossen et al.'s "Extracting Event-Centric Document Collections from Large-Scale Web Archives" (published in TPDL 2017, preprint available at arXiv) and uses WS-DL's Carbondate tool to help with extracting datetimes from webpages.

Through this investigation, they found:
  • Collections about recent events benefit more from the live web resources
  • Collections about events from the distant past benefit more archived resources
  • Collections about less recent events can still benefit from the live web and from the archived web 


Creating Topical Collections: Web Archives vs. Live Web from Martin Klein


DAY 2 - December 12

From First Seeds to Now: Researching, Building, and Piloting a Harvesting Tool
Ann Connolly, bepress

bepress has developed a harvesting tool for faculty publications in their Expert Gallery Suite and ran a pilot study to gain feedback from potential users. The tool harvests data from MS Academic, which has been shown to have journal coverage on par with Web of Science and Scopus. In addition MS Academic pulls in working papers, conference proceedings, patents, books, and book chapters. The harvesting tool allows university libraries to harvest metadata from published works of their faculty, including works published while the faculty member was at another institution.

Being unfamiliar with bepress, I didn't realize at first that this was essentially a product pitch. But I learned that this is the company behind Digital Commons, which powers ODU's Digital Commons, so I was at least a little familiar with the technology that was being discussed. 

bepress was recently acquired by Elseiver, and this was the topic of much discussion during CNI. The acquisition was addressed at a briefing "bepress and Elsevier: Let’s Go There", given by Jean-Gabriel Bankier, the Managing Director of bepress on Day 1.


Value of Preserving and Disseminating Student Research Through Institutional Repositories - slides
Adriana Popescu and Radu Popescu (Cal Poly)

This study investigated the impact of hosting student research in an institutional repository (IR) on faculty research impact (citations). They looked at faculty publications indexed in the Web of Science from six departments at Cal Poly and undergraduate senior projects from those same departments deposited in the university's Digital Commons. For their dataset, they found that the citation impact increased as the student project downloads increased. One surprising finding was that the correlation between faculty repository activity and research impact was weaker than the correlation between student repository activity and research impact. The work will be published in Evidence-Based Library and Information Practice.


Annotation and Publishing Standards Work at the W3C - recorded
Timothy Cole (Illinois - Urbana-Champaign)

Tim presented an overview of the W3C Recommendations for Web Annotation and highlighted a few implementations:
Tim also talked about web publications and the challenges in how they can be accommodated on the web.  "A web publication needs to operate on the web as a single resource, even as its components are also web resources."

Tim also gave a pitch for those interested to join a W3C Community Group and noted that membership in W3C is not required for participation there.


Beprexit: Rethinking Repository Services in a Changing Scholarly Communication Landscape - slides
Sarah WippermanLaurie Allen, Kenny Whitebloom (UPenn Libraries)

Since I had learned a bit about bepress earlier in the day, I decided to attend this session to hear thoughts from those using Digital Commons and other bepress tools.

The University of Pennsylvania has been using bepress since 2004, but with its acquisition by Elsevier, they are now exploring open source options for hosting Penn's IR, ScholarlyCommons.  Penn released a public statement on their decision to leave bepress.

The presenters gave an overview of researcher services provided by the library and an outline of how they are carefully considering their role and future options.  As they said, Penn is "leaving, but not rushing." They are documenting their exploration of open repository systems at https://beprexit.wordpress.com/.

There was much interest from those representing other university libraries in the audience regarding joining Penn in this effort.


Paul Evan Peters Award & Lecture  - recorded

Scholarly Communication: Deconstruct and Decentralize?
Herbert Van de Sompel, Los Alamos National Laboratory

The final talk at the Fall 2017 CNI Meeting was the Paul Evans Peters Award Lecture.  This year's honoree was our friend and colleague, Herbert Van de Sompel. Herbert's slides are below, and the video will be posted soon.
Herbert discussed applying the principles of the decentralized web to scholarly communication. He proposed a Personal Scholarly Web Observatory that would automatically track the researcher's web activities, including created artifacts, in a variety of portals.
Herbert referenced several interesting projects that have inspired his thinking:
  • MIT's Solid Architecture - proposed set of conventions and tools for building decentralized social applications based on Linked Data principles
  • Sarven Capadisli's dokie.li - a decentralised article authoring, annotation, and social notification tool
  • Amy Guy's "Personal Web Observatory" - tracks daily activities, categorized and arranged visually with icons
These ideas could be used to develop a "Researcher Pod", which could combine an artifact tracker, an Event Store, and a communication platform that could be run on an institutional hosting platform along with an institutional archiving process.  These pods could be mobile and persistent so that researchers moving from one institution to another could take their pods with them.


Paul Evan Peters Lecture from Herbert Van de Sompel


Final Thoughts 

I greatly enjoyed attending my first CNI membership meeting. The talks were all high-quality, and I learned a great deal about some of the issues facing libraries and other institutional repositories.  Once the videos are posted, I encourage everyone to watch Cliff Lynch's plenary and Herbert Van de Sompel's closing talk. Both were excellent.

Because of the parallel sessions, I wasn't able to attend all of the briefings that I was interested in. After seeing some of the discussion on Twitter, I was particularly disappointed to have missed "Facing Slavery, Memory, and Reconciliation: The Research Library’s Role and Georgetown University’s Experience" presented by K. Matthew Dames (Georgetown) and Melissa Levine (Michigan).
Finally, I want to thank and acknowledge our funders, NEH, IMLS, and the Mellon Foundation.  Program officers from these organizations gave talks at CNI:
-Michele

2017-12-31: ACM Workshop on Reproducibility in Publication


On December 7 and 8 I attend the ACM Workshop on Reproducibility in Publication in NYC as part of my role as a member of the ACM Publications Board and co-chair (with Alex Wade) of the Digital Library Committee.  The purpose of this workshop was to gather input from the various ACM SIGs about the approach to reproducibility and "artifacts", objects supplementary to the conventional publication process.  The workshop was attended by 50+ people, mostly from the ACM SIGs but also included representatives from other professional societies and repositories and hosting services.  A collection of the slides presented at the workshop and a summary report are being worked on now, and as such this trip report is mostly my personal perspectives on the workshop; I'll update with slides, summary, and other materials as they become available.

This was the third such workshop that had been held, but it was the first for me since I joined the Publications Board in September of 2017.  I have a copy of a draft report, entitled "Best Practices Guidelines for Data, Software, and Reproducibility in Publication" from the second workshop, but I don't believe that report is public so I won't share it here.

I believe it was from these earlier workshops where the ACM adopted their policy of including "artifacts" (i.e., software, data, videos, and other supporting materials) in the digital library.  At the time of this meeting the ACM DL had 600-700 artifacts.  To illustrate the ACM's approach to reproducibility and artifacts in the DL, below I show and example from ACM SIGMOD (each ACM SIG is implementing different approaches to reproducibility as appropriate within their community). 

The first image below is a paper from ACM SIGMOD 2016, "ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks", which has the DOI URI of https://doi.org/10.1145/2882903.2882964.  This page also links to the SIGMOD guidelines for reproducibility.


Included under the "Source Materials" tab is a link to a zip file of the software and a separate README file in unprocessed markdown format.  What this page doesn't link to is the software page in the ACM DL that also has a separate DOI, https://doi.org/10.1145/3159287.  The software DOI does link back to the SIGMOD paper, but not the SIGMOD paper does not appear to explicitly link to the software DOI (again, it links to just the zip and README). 



In that page I've also clicked on the "artifacts" button to produce a pop up that explains the various "badges" that the ACM provides; a full description is also available at a separate page.  More tellingly, on this page there is a link to the software as it exists in GitHub.

In slight contrast to the SIGMOD example, The Graphics Replicability Stamp Initiative (GRSI) embraces GitHub completely, with a combination of linking both to the repositories of the individuals (or groups) that wrote the code as well as linking to forks of the code within the GSRI account.  Of course, existing in GitHub is not the same as being archived (reminder: the fading of SourceForge and the closing of Google Code) and a DL has a long-term responsibility in hosting bits and not just linking to them (though to be fair, Git is bigger than GitHub and ACM could commit to git.acm.org).  On the other hand, as GRSI implicitly acknowledge, decontextualizing the code from the community and functions that the hosting service (in this case, GitHub) provides is not a realistic short- or mid-term approach either.  Resolving the tension between memory organizations (like ACM) and non-archival hosting services (like GitHub) is one of the goals of the ODU/LANL AMF funded project ("To the Rescue of the Orphans of Scholarly Communication": slides, video, DSHR summary) and I hope to apply the lessons learned from the research project to the ACM DL.

One of the common themes was "who evaluates the artifacts?"  Initially, most artifacts are considered only for publications otherwise already accepted, and in most cases the evaluation is done non-anonymously by a different set of reviewers.  That adapts best to the current publishing process, but it is unresolved whether or not this is the ideal process -- if artifacts are to become true first class citizens in scholarly discourse (and thus the DL), perhaps they should be reviewed simultaneously with the paper submission.  Of course, the workload would be immense and anonymity (in both directions) would be difficult if not impossible.  Setting aside the issue of whether or not that it desirable, it would still represent a significant change to how most conferences and journals are administered.  Furthermore, while some SIGs have successfully implemented initial approaches to artifact evaluation with grad students and post-docs, it is not clear to me that this is scalable, and furthermore I'm not sure it sends the right message about the importance of the artifacts. 

Some other resources of note:
The discussion of identifiers, and especially DOIs, is of interest to me because one of the points I made in the meeting and continued on twitter can roughly be described as "DOIs have no magical properties".  No one actually claimed this, of course, but I did feel the discussion edging toward "just give it a DOI" (cf. getting DOIs for GitHub repositories).  I'm not against DOIs, rather the short version of my caution is that currently there is correlation between "archival properties" and "things we give DOIs to", but DOIs do not cause archival properties.

There was a fair amount of back channel discussion on Twitter with "#acmrepro"; I've captured the tweets during and immediately after the workshop in the Twitter moment embedded below.

I'll update this post as slides and the summary report become available.

--Michael








2018-01-02: Link to Web Archives, not Search Engine Caches

Image may be NSFW.
Clik here to view.
Fig.1 Link TheFoundingSon Web Cache
Image may be NSFW.
Clik here to view.
Fig.2 TheFoundingSon Archived Post
In a recent article in Wired, "Yup, the Russian propagandists were blogging lies on Medium too,"Matt Burgess makes reference to three now-suspended Twitter accounts: @TheFoundingSon (archived), @WadeHarriot (archived), and @jenn_abrams (archived), and their activity on the blogging service Medium.

Image may be NSFW.
Clik here to view.
Fig.3 TheFoundingSon Suspended Medium Account
Burgess reports that these accounts were suspended on Twitter and Medium, and quotes a Medium spokesperson as saying:
 With regards to the recent reporting around Russian accounts specifically, we’re paying close attention and working to ensure that our trust and safety processes continue to evolve and identify any accounts that violate our rules.
Unfortunately, to provide evidence of the pages' former content, Burgess links to Google caches instead of web archives.  At the time of this writing, two of the three links for @TheFoundingSon's blog posts, which were included in Wired's article, produced a 404 response code from Google (the search engine containing the cached page) when clicking on the link (see Fig.1). 

Only one link (Fig. 4), related to science and politics, was still available a few days after the article was written.

Image may be NSFW.
Clik here to view.
Fig.4 TheFoundingSon Medium Post Related to Science and Politics
Why is only one out of three web cache links still available? Search Engine (SE) caches are useful for covering transient errors in the live web, but they are not archives and thus not suitable for long-term access. In previous work our group has studied SE caches ("Characterization of Search Engine Caches") and the rate at which SE caches are purged ("Observed Web Robot Behavior on Decaying Web Subsites"). SE caches used to play a larger role in providing access to the past web (e.g., "How much of the web is archived?"), but improvements in the Internet Archive (i.e., no longer has a quarantine period, has a "save page now" function) and restrictions on SE APIs (e.g., monetization of the Yahoo BOSS API) have greatly reduced the role of SE caches in providing access to the past web.
To answer our original question of why two of the three links were not useful can be explained in that since Burgess is using SE caches to provide evidence of web pages that are removed from Medium's servers, and scientific research studies have proven SE’s will purge the index and cache of resources that are no longer available, we can expect that all links in the Wired's article pointing to SE caches will eventually decay.
If I were going to inquire about the type of blog @TheFoundingSon was writing, I could query https://medium.com/@TheFoundingSon from the IA's Wayback machine at web.archive.org (Fig.5).
Image may be NSFW.
Clik here to view.
Fig.5 TheFoundingSon Web Archived Pages

Doing so provides a list of ten archived URIs:
  1. https://web.archive.org/web/20170223230217/https://medium.com/@TheFoundingSon/5-things-hillary-is-going-to-do-on-debate-night-81412f6878ab
  2. https://web.archive.org/web/20170223012115/https://medium.com/@TheFoundingSon/blindfolded-election-2016-bc269463dc7
  3. https://web.archive.org/web/20170626021233/https://medium.com/@TheFoundingSon/catholicism-is-evil-and-islam-is-religion-of-peace-74f2d7947162
  4. https://web.archive.org/web/20170222145442/https://medium.com/@TheFoundingSon/gun-control-absurdity-34cabd52f0e4
  5. https://web.archive.org/web/20170120073029/https://medium.com/@TheFoundingSon/hillarys-actions-vs-trump-s-words-what-is-louder-92798789eaf6
  6. https://web.archive.org/web/20170119183629/https://medium.com/@TheFoundingSon/lessons-huffpost-wants-us-to-learn-from-orlando-ac74f2a27922
  7. https://web.archive.org/web/20170222224659/https://medium.com/@TheFoundingSon/making-america-deplorable-37b9cea48b4b
  8. https://web.archive.org/web/20170223094335/https://medium.com/@TheFoundingSon/one-missed-wake-up-call-6cb87200cc2a
  9. https://web.archive.org/web/20170120021905/https://medium.com/@TheFoundingSon/see-something-say-nothing-b144aa5d4d39
  10. https://web.archive.org/web/20170807072351/https://medium.com/@TheFoundingSon/votes-that-count-7766810f0809
The archived web pages are in a time capsule preserved for generations to come, in contrast to SE caches which decay in a very short period of time. It is interesting to see that for @WadeHarriot, the account with the smallest number of Twitter followers before its suspension, Wired resorted to the IA for the 'lies' from Hillary Clinton  posting; the other link was a Web search engine cache. Both web pages are available on the IA.

Another advantage of web archives over search engine caches is that web archives allow us to analyze changes of a web page through time.  For example, @TheFoundingSon on 2016-06-16 had 14,253 followers, and on 2017-09-01 it had 41,942 followers.


The data to plot @JennAbrams and @TheFoundingSon Twitter follower counts over time were obtained by utilizing a tool created by Orkun Krand while working at ODU Web Science Digital Library Group (@WebSciDL). Our tool, which will be release in the near future, makes use of the IA and Mementos. Ideally, we would like to capture as many copies (mementos) as possible of available resources, not only in the IA, but in all the web archives around the world. However, our Follower-Count-History tool only uses the IA, because some random Twitter pages most likely will not be found in the other web archives, and since our tool is using HTML scraping to extract the data, other archives may store their web pages in a different a format than the IA.


The IA allows us to analyze our Twitter accounts in greater detail. We could not graph the count over time for @WadeHarriot's Twitter followers because only one memento was available in the web archives. However, multiple mementos were found for the other two accounts. The Followers-Count-Over-Time tool provided the data to plot the two graphs shown above. We notice by looking at the graph of @TheFoundingSon that its Twitter followers doubled from around 15K to around 30K in only six months, and it continued an accelerated ascend reaching over 40K followers before its suspension. Similar analysis can be made with the @jenn_abrams account. Before October of 2015 @jenn_abrams had around 30K followers, and a year later, it almost doubled to around 55K followers, topping over 70K followers before its suspension. We could question if the followers of these accounts are real people, or if the rate of accumulation of followers followed a normal rate on Twitter, but we will leave these questions for another post.

SE caches are an important part of the Web Infrastructure, but using them as a link is a bad idea since they are expected to decay. Instead we should link to web archives. They are more stable, and as shown in the Twitter-Followers-Count-Over-Time graphs, they allow us time series analysis if we can find multiple mementos for the same URI.

- Plinio Vargas

HTTP responses for some links found in the Wired article.



2017-12-31: Digital Blackness in the Archive - DocNow Symposium Trip Report


From December 11-12, 2017, I attended the second Documenting the Now Symposium in St. Louis, MO.  The meeting presentations were recorded and are available along with an annotated agenda; for further background about the Documenting the Now project and my involvement via the advisory board, I suggest my 2016 trip report, as well as DocNow activity on github, slack, and Twitter.  In addition, the meeting itself was extensively live-tweeted with #BlackDigArchive (see also the data set of Tweet ids collected by Bergis Jules).



The symposium began at the Ferguson Public Library, first with a welcome Vernon Mitchell of DocNow and Scott Bonner of the Ferguson Public Library.  This venue was chosen for its role in the events of Ferguson 2014 (ALA interview, CNN story).  The engaging opening keynote was by Marisa Parham of Amherst College, entitled "Sample, Signal, Strobe", and I urge you to take the time to watch it and not rely on my inevitably incomplete and inaccurate summary.  With those caveats, what I took away from Parham's talk can be summarized as addressing "the confluence of social media and the agency it gives some people" and "twitter as a dataset vs. twitter as an experience", and losing context of tweet, removes the "performance" part.  Watching hashtags emerge, watching the repetition of RTs, and the sense of contemporary community and shared experience (which she called "the chorus of again").  I can't remember if she made this analogy directly or if it is just what I put in my notes, but a movie in a theater is a different experience than at home even though home theaters can be quite high-fidelity, in part because of the shared, real-time experience.  To this point I also tweeted a link to our Katrina web archive slides because we find that replay of contemporary web pages makes a more powerful argument for web archives than, say, wikipedia or other summary pages.



Parham had a presentation online that provided some of the quotes that she used, but I did not catch the URI.  Here are some of the resources that I was able to track down while she talked (I'm sure I missed several):

Next up was the panel "The Ferguson Effect on Local Activism and Community Memory", and two of the panelists, Alexis Templeton and Kayla Reed, were repeat panelists from the 2016 meeting; and this brought up a point they made during their presentations: while archives document specific points in time, the people involved should be allowed to evolve and live their life without the expectations and weight of those moments.  There was a lot conveyed by the panelists and I feel I would be doing them a disservice to further summarize their life experiences. Instead, at the risk of interrupting the flow of the post, I will include more tweets than I would normally  from others and redirect you to the video for the full presentations and the pointed discussion that followed. 















After this panel, we adjourned to the local institution of Drake's Place for lunch, and in the evening saw a screening of "Whose Streets?" at WUSTL.

The next morning we resumed the meeting on the campus of WUSTL and began with tool/technology overviews then breakout demos from Ed Summers, Alexandra Dolan-Mescal, Justin Littman, and Francis Kayiwa.



I'm not sure how much longer demo.docnow.io will be up, but highly recommend that you interact with the service while you can and provide feedback (sample screen shots above).  The top screen shows trending hashtags for your geographic area, and the bottom screen shows the mutli-panel display for the hashtag: tweets, users, co-occurring hashtags, and embedded media.

The second panel, "Supporting Research: Digital Black Culture Archives for the Humanities and Social Sciences", began after the tool demo sessions.


Meridith Clark began with the observation about the day of Ferguson, "some of my colleagues will see this just as data."  Unfortunately, this panel does not appear to have been recorded.  Catherine Knight Steele made the point that while social media are "public spaces", like a church they still require respect.






Clark also solicited feedback from the panel about what tools and functionality they would like to see.  Melissa Brown talked about Instagram (with which our group has done almost nothing to date) and Picodash (with extended features like geographic bounding of searches).  Some one (not clear in my notes) also discussed the need to not just have, for example, the text in a blog, but also the entire contemporary UI maintained (this is clearly an application for web archiving, but social media is often not easy to archive).  Clark also discussed the need for more advanced visualization tools, and the panel ended with a discussion about IRBs and social media.

Unfortunately I had to leave for the airport right after lunch and had to miss the third panel, "Digital Blackness in the Archive: Collecting for the Culture".  Fortunately that panel was recorded and is linked from the symposium page

Another successful meeting, and I'm grateful to the organizers (Vernon Mitchell, Bergis Jules, Tim Cole).  The DocNow project is coming to an end in early 2018, and although I'm not sure what happens next I hope to continue my relationship with this team.

--Michael



2018-01-06: Two WSDL Classes Offered for Spring 2018


Two Web Science & Digital Library (WS-DL) courses will be offered in Spring 2018:
Also, although they are not WS-DL courses per se, WS-DL member Corren McCoy is also teaching CS 462 Cybersecurity Fundamentals again this semester, and WS-DL alumnus Dr. Charles Cartledge is teaching two classes: CS 395 "Data Wrangling" and CS 395 "Data Analysis".

--Michael

2018-01-07: Review of WS-DL's 2017

The Web Science and Digital Libraries Research Group had a steady 2017, with one MS student graduated, one research grant awarded ($75k), 10 publications, and 15 trips to conferences, workshops, hackathons, internships, etc.  In the last four years (2016--2013) we have graduated five PhD and three MS students, so the focus for this year was "recruiting" and we did pick up seven new students: three PhD and four MS.  We had so many new and prospective students that Dr. Weigle and I created a new CS 891 web archiving seminar to indoctrinate introduce them to web archiving and graduate schools basics.

We had 10 publications in 2017:
  • Mohamed Aturban published a tech report about the difficulties in simply computing fixity information about archived web pages (spoiler alert: it's a lot harder than you might think; blog post).  
  • Corren McCoy published a tech report about ranking universities by their "engagement" with Twitter.  
  • Yasmin AlNoamany, now a post-doc at UC Berkeley,  published two papers based on her dissertation about storytelling: a tech report about the different kinds of stories that are possible for summarizing archival collections, and a paper at Web Science 2017 about how our automatically created stories are indistinguishable from those created by experts.
  • Lulwah Alkwai published an extended version of her JCDL 2015best student paper in ACM TOIS about the archival rate of web pages in Arabic, English, Danish, and Korean languages (spoiler alert: English (72%), Arabic (53%), Danish (35%), and Korean (32%)).
  • The rest of our publications came from JCDL 2017:
    •  Alexander published a paper about his 2016 summer internship at Harvard and the Local Memory Project, which allows for archival collection building based on material from local news outlets. 
    • Justin Brunelle, now a lead researcher at Mitre, published the last paper derived from his dissertation.  Spoiler alert: if you use headless crawling to activate all the javascript, embedded media, iframes, etc., be prepared for your crawl time to slow and your storage to balloon.
    • John Berlin had a poster about the WAIL project, which allows easily running Heritrix and the Wayback Machine on your laptop (those who have tried know how hard this was before WAIL!)
    • Sawood Alam had a proof-of-concept short paper about "ServiceWorker", a new javascript library that allows for rewriting URIs in web pages and could have significant impact on how we transform web pages in archives.  I had to unexpectedly present this paper since thanks to a flightcancellation the day before, John and Sawood were in a taxi headed to the venue during the scheduled presentation time!
    • Mat Kelly had both a poster (and separate, lengthy tech report) about how difficult it is to simply count how many archived versions of a web page an archive has (spoiler alert: it has to do with deduping, scheme transition of http-->https, status code conflation, etc.).  This won best poster at JCDL 2017!
We were fortunate to be able to travel to about 15 different workshops, conferences, hackathons:

















WS-DL did not host any external visitors this year, but we were active with the colloquium series in the department and the broader university community:
In the popular press, we had had two main coverage areas:
  • RJI ran three separate articles about Shawn, John, and Mat participating in the 2016"Dodging the Memory Hole" meeting. 
  • On a less auspicious note, it turns out that Sawood and I had inadvertently uncovered the Optionsbleed bug three years ago, but failed to recognize it as an attack. This fact was covered in several articles, sometimes with the spin of us withholding or otherwise being cavalier with the information.
We've continued to update existing and release new software and datasets via our GitHub account. Given the evolving nature of software and data, sometimes it can be difficult a specific release date, but this year our significant releases and updates include:
For funding, we were fortunate to continue our string of eight consecutive years with new funding.  The NEH and IMLS awarded us a $75k, 18 month grant, "Visualizing Webpage Changes Over Time", for which Dr. Weigle is the PI and I'm the Co-PI.  This is an area we've recognized as important for some time and we're excited to have a separate project dedicated to the visualizing archived web pages. 

Another point you can probably infer from the discussion above but I decided to make explicit is that we're especially happy to be able to continue to work with so many of our alumni.  The nature of certain jobs inevitably takes some people outside of the WS-DL orbit, but as you can see above in 2017 we were fortunate to continue to work closely with Martin (2011) now at LANL, Yasmin (2016) now at Berkeley, and Justin (2016) now at Mitre.  

WS-DL annual reviews are also available for 2016, 2015, 2014, and 2013.  Finally, I'd like to thank all those who at various conferences and meetings have complimented our blog, students, and WS-DL in general.  We really appreciate the feedback, some of which we include below.

--Michael











2018-01-08: Introducing Reconstructive - An Archival Replay ServiceWorker Module


Web pages are generally composed of many resource such as images, style sheets, JavaScript, fonts, iframe widgets, and other embedded media. These embedded resources can be referenced in many ways (such as relative path, absolute path, or a full URL). When the same page is archived and replayed from a different domain under a different base path, these references may not resolve as intended, hence, may result in a damaged memento. For example, a memento (an archived copy) of the web page https://www.odu.edu/ can be seen at https://web.archive.org/web/20180107155037/https://www.odu.edu/. Note that domain name has changed from www.odu.edu to web.archive.org and some extra path segments are added to it. In order for this page to render properly, various resource references in it are rewritten, for example, images/logo-university.png in a CSS file is replaced with /web/20171225230642im_/http://www.odu.edu/etc/designs/odu/images/logo-university.png.

Traditionally, web archival replay systems rewrite link and resource references in HTML/CSS/JavaScript responses so that they resolve to their corresponding archival version. Failure to do so would result in a broken rendering of archived pages (composite mementos) as the embedded resource references might resolve to their live version or an invalid location. With the growing use of JavaScript in web applications, often resources are injected dynamically, hence rewriting such references is not possible from the server side. To mitigate this issue, some JavaScript is injected in the page that overrides the global namespace to modify the DOM and monitor all network activity. In JCDL17 and WADL17 we proposed a ServiceWorker-based solution to this issue that requires no server-side rewriting, but catches every network request, even those that were initiated due to dynamic resource injection. Read our paper for more details.
Sawood Alam, Mat Kelly, Michele C. Weigle and Michael L. Nelson, "Client-side Reconstruction of Composite Mementos Using ServiceWorker," In JCDL '17: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries. June 2017, pp. 237-240.


URL Rewriting


There are primarily three ways to reference a resource from another resource, namely, relative path, absolute path, and absolute URL. All three have their own challenges when served from an archive (or from a different origin and/or path than the original). In the case of archival replay, both the origin and base paths are changed from the original while original origin and paths usually become part of the new path. Relative paths are often the easiest to replay as they are not tied to the origin or the root path, but they cannot be used for external resources. Absolute paths and absolute URLs on the other hand are resolved incorrectly or live-leaked when a primary resource is served from an archive, neither of these conditions are desired in archival replay. There is a fourth way of referencing a resource called schemeless (or protocol-relative) that starts with two forward slashes followed by a domain name and paths. However, usually web archives ignore the scheme part of the URI when canonicalizing URLs, so we can focus on just three main ways. The following table illustrates examples of each with their resolution issues.


Reference typeExampleResolution after relocation
Relative pathimages/logo.pngPotentially correct
Absolute path/public/images/logo.pngPotentially incorrect
Absolute URLhttp://example.com/public/images/logo.pngPotentially live leakage

Archival replay systems (such as OpenWayback and PyWB) rewrite responses before serving to the client in a way that various resource references point to their corresponding archival page. Suppose a page, originally located at http://example.com/public/index.html, has an image in it that is referenced as <img src="/public/images/logo.png">. When the same page is served from an archive at http://archive.example.org/<datetime>/http://example.com/public/index.html, the image reference needs to be rewritten as <img src="/<datetime>/http://example.com/public/images/logo.png"> in order for it to work as desired. However, URLs constructed by JavaScript, dynamically on the client-side are difficult to rewrite just by the static analysis of the code at server end. With the rising usage of JavaScript in web pages, it is becoming more challenging for the archival replay systems to correctly replay archived web pages.

ServiceWorker


ServiceWorker is a new web API that can be used to intercept all the network requests within its scope or originated from its scope (with a few exceptions such as an external iframe source). A web page first delivers a ServiceWorker script and installs it in the browser, which is registered to watch for all requests from a scoped path under the same origin. Once installed, it persists for a long time and intercepts all subsequent requests withing its scope. An active ServiceWorker sits in the middle of the client and the server as a proxy (which is built-in to the browser). It can change both requests and responses as necessary. The primary use-case of the API is to provide better offline experience in web apps by serving pages from a client-side cache when there is no network or populating/synchronizing the cache. However, we found it useful to solve an archival replay problem.

Reconstructive


We created Reconstructive, a ServiceWorker module for archival replay that sits on the client-side and intercepts every potential archival request to properly reroute it. This approach requires no rewrites from the server side. It is being used successfully in our IPFS-based archival replay system called InterPlanetary Wayback (IPWB). The main objective of this module is to help reconstruct (hence the name) a composite memento (from one or more archives) while preventing from any live-leaks (also known as zombie resources) or wrong URL resolutions.



The following figure illustrates an example where an external image reference in an archived web page would have leaked to the live-web, but due to the presence of Reconstructive, it was successfully rerouted to the corresponding archived copy instead.


In order to reroute requests to the URI of a potential archived copy (also known as Memento URI or URI-M) Reconstructive needs the request URL and the referrer URL, of which the latter must be a URI-M. It extracts the datetime and the original URI (or URI-R) of the referrer then combines them with the request URL as necessary to construct a potential URI-M for the request to be rerouted to. If the request URL is already a URI-M, it simply adds a custom request header X-ServiceWorker and fetches the response from the server. When necessary, the response is rewritten on the client-side to fix some quirks to make sure that the replay works as expected or to optionally add an archival banner. The following flowchart diagram shows what happens in every request/response cycle of a fetch event in Reconstructive.


We have also released an Archival Capture Replay Test Suite (ACRTS) to test the rerouting functionality in different scenarios. It is similar to our earlier Archival Acid Test, but more focused on URI references and network activities. The test suite comes with a pre-captured WARC file of a live test page. captured resources are all green while the live site has everything red. The WARC file can be replayed using any archival replay system to test how well the system is replaying archived resources. In the test suite a green box means properly rerouting, red box means a live-leakage, while white/gray means incorrectly resolving the reference.


Module Usage


The module is intended to be used by archival replay systems backed by a Memento endpoints. It can be a web archive such as IPWB or a Memento aggregator such as MemGator. In order use the module, write a ServiceWorker script (say, serviceworker.js) with your own logic to register and update it. In that script, import reconstructive.js script (locally or externally) which will make the Reconstructive module available with all of its public members/functions. Then bind the fetch event listener to the publicly exposed Reconstructive.reroute function.

importScripts('https://oduwsdl.github.io/reconstructive/reconstructive.js');
self.addEventListener('fetch', Reconstructive.reroute);

This will start rerouting every request according to a default URI-M pattern while excluding some requests that match a default set of exclusion rules. However, URI-M pattern, exclusion rules, and many other configuration options can be customized. It even allows customization of the default response rewriting function and archival banner. The module can also be configured to only reroute a subset of the requests while letting the parent ServiceWorker script deal with the rest. For more details read the user documentation, example usage (registration process and sample ServiceWorker), or heavily documented module code.

Archival Banner


Reconstructive module has implemented a custom element named <reconstructive-banner> to provide an archival banner functionality. The banner element utilizes Shadow DOM to prevent any styles from the banner to leak into the page or the other way. Banner inclusion can be enabled by setting the showBanner configuration option to true when initializing Reconstructive module after which it will be added to every navigational page. Unlike many other archival banners in use, it does not use an iframe or stick to the top of the page. It floats at the bottom of the page, but goes out of the way when not needed. The banner element is currently in its early stage with very limited information and interactivity, but it is intended to be evolved in to a more functional component.

<scriptsrc="https://oduwsdl.github.io/reconstructive/reconstructive-banner.js"></script>
<reconstructive-bannerurir="http://example.com/"datetime="20180106175435"></reconstructive-banner>


Limitations


It is worth noting that we rely on some fairly new web APIs that might not have a very good and consistent support across all browsers and may potentially change in future. At the time of writing this post ServiceWorker support is available in about 74% active browsers globally. To help the server identify whether a request is coming from Reconstructive (to provide fallback of server-side rewriting), we add a custom request header X-ServiceWorker.

As per current specifications, there can be only one service worker active on a given scope. This means, if an archived page has its own ServiceWorker, it cannot work along with Reconstructive. However, in usual web apps ServiceWorkers are generally used for better user experience and gracefully degrade to remain functional (this is not guaranteed though). The best we can do in this case is to rewrite every ServiceWorker registration code (on client-side) in any archived page before serving the response to disable it so that Reconstructive continues to work.

Conclusions


We conceptualized an idea, experimented with it, published a peer-reviewed paper on it, implemented it in a more production-ready fashion, used it in a novel archival replay system, and made the code publicly available under the MIT License. We also released a test suite ACRTS that can be useful by itself. This work is supported in part by NSF grant III 1526700.

Resources




--
Sawood Alam

2018-02-27: Summary of Gathering Alumni Information from a Web Social Network

While researching my dissertation topic (slides 2--28) on social media profile discovery, I encountered a related paper titled Gathering Alumni Information from a Web Social Network written by Gabriel Resende Gonçalves, Anderson Almeida Ferreira, and Guilherme Tavares de Assis, which was published in the proceedings of the 9th IEEE Latin American Web Congress (LA-WEB). In this paper, the authors detailed their approach to define a semi-automated method to gather information regarding alumni of a given undergraduate program at Brazilian higher education institutions. Specifically, they use the Google Custom Search Engine (CSE) to identify candidate LinkedIn pages based on a comparative evaluation of similar pages in their training set. The authors contend alumni are efficiently found through their process, which is facilitated by focused crawling of data publicly available on social networks posted by the alumni themselves. The proposed methodology consists of three main modules and two data repositories, which are depicted in Figure 1. Using this functional architecture, the authors constructed a tool that gathers professional data on the alumni in undergraduate programs of interest, then proceeds to classify the associated HTML page to determine relevance. A summary of their methodology is presented here.

Image may be NSFW.
Clik here to view.
Functional architecture of the proposed method
Figure 1 - Functional architecture of the proposed method

Repositories

The first repository, Pages Repository, stores the web pages from the initial set of data samples which are used to start the classification process. This set is comprised of alumni lists obtained from five universities across Brazil. The lists contain the names of students enrolled between 2000 and 2010 in undergraduate programs, namely Computer Science at three institutions, Metallurgical Engineering at one institution, and Chemistry at one institution. The total number of alumni available on all lists is 6,093. For the purpose of validation, a random set of 15 alumni are extracted from each list as training examples during each run of their classifier. The second repository, Final Database, is the database where academic data on each alumnus is stored for further analysis.

Modules

The first module, Searcher, determines the candidate pages from a Google result set that might belong to the alumni group. LinkedIn is the social network of choice from which the authors leverage public pages on the web which have been indexed by a search engine. The search is initiated using a combination of the first, middle and last names of a given alumnus, then, relevant data concerning the undergraduate program, program degree, and institution are extracted from the candidate pages. The authors chose not to search using LinkedIn's Application Programming Interface (API) due to its inherent limitations. Specifically, the API requires authentication by a registered LinkedIn user and searches are restricted to the first degree connections of the user conducting the search. As an alternative, the authors use the Google Custom Search Engine which provides access to Google's  massive repository of indexed pages, but is limited to 100 daily free searches returning 100 results per query.

We should note in the years since this paper was published in 2014, LinkedIn has instituted a number of security measures to impede data harvesting of public profiles. They employ a series of automated tools, FUSE, Quicksand, Sentinel, and Org Block, that are used to monitor suspicious activity and block web scraping. Requests are throttled based on the requester's IP address (see HIQ Labs V. LinkedIn Corporation).  Anonymous viewing of a large number of public LinkedIn profile pages, even if retrieved using Google's boolean search criteria, is not always possible. After an undisclosed number of  public profile views, LinkedIn forces the user to either sign up or log in as a way to thwart scraping by 3rd party applications (Figure 2).


Image may be NSFW.
Clik here to view.
LinkedIn Anonymous Search Limit Reached
Figure 2 - LinkedIn Anonymous Search Limit Reached
The second module, Filter, determines the significance of the candidate pages provided by the Searcher module via the Pages Repository. The classification process determines the similarity among pages using the academic information on the LinkedIn page as terms which are then separated into categories that describe the undergraduate program, institution, and degree. The authors proceed to use Cosine Similarity to build a relationship between candidate pages from the Searcher module and the initial training set based on term frequency and specify a 30% threshold for the minimum percentage of pages on which a term must appear.

The third module, Extraction, extracts the demographic and academic information from the HTML pages returned by the Filter module using regular expressions as shown in Figure 3. The extracted information is stored in the Final Database for further analysis using the Naive Bayes bag-of-words model to identify specific alumni of the desired undergraduate program.


Image may be NSFW.
Clik here to view.
Figure 3 - Regular Expressions Used by Extraction Module

Results and Takeaways

The authors acknowledge that obtaining an initial list of alumni names is not a major obstacle. However, collecting the initial set of sample pages from a social network, such as LinkedIn, may be time consuming and labor intensive even with small data sets. Their evaluation, as shown in Figure 4, indicates satisfactory precision and the methodology proposed in their paper is able to find an average of 7.5% to 12.2% of alumni for undergraduate programs with more than 1,000 alumni.

Image may be NSFW.
Clik here to view.
Pages Retrieved and Precision Results For Proposed Method and Baseline
Figure 4 - Pages Retrieved and Precision Results For Proposed Method and Baseline
Given the highly structured design of LinkedIn HTML pages, we would expect the Filter and Extraction modules to identify and successful retrieve a higher percentage of alumni; even without applying a machine learning technique. The bulk of this paper's research is predicated upon access to public data on the web. If social media networks choose to present barriers that impede the collection of this public information, continued research by these authors and others will be significantly impacted. With regards to LinkedIn public profiles, we can only anticipate the imminent outcome of pending litigation which will determine who controls publicly available data.

--Corren McCoy (@correnmccoy)


Gonçalves, G. R., Ferreira, A. A., de Assis, G. T., & Tavares, A. I. (2014, October). Gathering alumni information from a web social network. In Web Congress (LA-WEB), 2014 9th Latin American (pp. 100-108). IEEE.

2018-03-04: Installing Stanford CoreNLP in a Docker Container

Image may be NSFW.
Clik here to view.
Fig. 1: Example of Text Labeled with the CoreNLP Part-of-Speech, Named-Entity Recognizer and Dependency Annotators.
The Stanford CoreNLP suite provides a wide range of important natural language processing applications such as Part-of-Speech (POS) Tagging and Named-Entity Recognition (NER) Tagging. CoreNLP is written in Java and there is support for other languages. I tested a couple of the latest Python wrappers that provide access to CoreNLP but was unable to get them working due to different environment-related complications. Fortunately, with the help of Sawood Alam, our very able Docker campus ambassador at Old Dominion University, I was able to create a Dockerfile that installs and runs the CoreNLP server (version 3.8.0) in a container. This eliminated the headaches of installing the server and also provided a simple method of accessing CoreNLP services through HTTP requests.
How to run the CoreNLP server on localhost port 9000 from a Docker container
  1. Install Docker if not already available
  2. Pull the image from the repository and run the container:
Using the server
The server can be used either from the browser or the command line or custom scripts:
  1. Browser: To use the CoreNLP server from the browser, open your browser and visit http://localhost:9000/. This presents the user interface (Fig. 1) of the CoreNLP server.
  2. Command line (NER example):
    Image may be NSFW.
    Clik here to view.
    Fig. 2: Sample request URL sent to the Named Entity Annotator 
    To use the CoreNLP server from the terminal, learn how to send requests to the particular annotator from the CoreNLP usage webpage or learn from the request URL the browser (1.) sends to the server. For example, this request URL was sent to the server by from the browser (Fig. 2), and corresponds to following command that uses the Named-Entity Recognition system to label the supplied text:
  3. Custom script (NER example): I created a Python function nlpGetEntities() that uses the NER annotator to label a user-supplied text.
To stop the server, issue the following command: 
The Dockerfile I created targets CoreNLP version 3.8.0 (2017-06-09). There is a newer version of the service (3.9.1). I believe it should be easy to adapt the Dockerfile to install the latest version by replacing all occurrences of "2017-06-09" with "2018-02-27" in the Dockerfile.  However, I have not tested this operation since version 3.9.1 is marginally different from version 3.8.0 for my use case, and I have not tested version 3.9.1 with my application benchmark. 

--Nwala

2018-03-12: NEH ODH Project Directors' Meeting

Image may be NSFW.
Clik here to view.


Michael and I attended the NEH Office of Digital Humanities (ODH)Project Directors' Meeting and the "ODH at Ten" celebration (#ODHatTen) on February 9 in DC.  We were invited because of our recent NEH Digital Humanities Advancement Grant, "Visualizing Webpage Changes Over Time" (described briefly in a previous blog post when the award was first announced), which is joint work with Pamela Graham and Alex Thurman from Columbia University Libraries and Deborah Kempe from the Frick Art Reference Library and NYARC.

The presentations were recorded, so I expect to see a set of videos available in the future, as was done for the 2014 meeting (my 2014 trip report).

The afternoon keynote was given by Kate Zwaard, Chief of National Digital Initiatives at the Library of Congress. She highlighted the great work being done at LC Labs.



After the keynote, each project director was allowed 3 slides and 3 minutes to present an overview of their newly funded work.  There were 45 projects highlighted and short descriptions of each are available through the award announcements (awarded in August 2017, awarded in December 2017).  Remember, video is coming soon for all of the 3-minute lightning talks.



Here are my 3 slides, previewing our grid, animation/slider, and timeline views for visualizing significant webpage changes over time.


Visualizing Webpage Changes Over Time from Michele Weigle


Following the lightning talks, the ODH at Ten celebration began with a keynote by Honorable John Unsworth, NEH National Council Member and University Librarian and Dean of Libraries at the University of Virginia.

I was honored to be invited to participate in the closing panel highlighting the impact that ODH support had on our individual careers and looking ahead to future research directions in digital humanities. 
Panel: Amanda French (George Washington), Jesse Casana (Dartmouth College), Greg Crane (Tufts), Julia Flanders (Northeastern), Dan Cohen (Northeastern),  Michele Weigle (Old Dominion), Matt Kirschenbaum (University of Maryland)



Thanks to the ODH staff, especially ODH Director Brett Bobley and our current Program Officer Jen Serventi, for organizing a great meeting.  It was also great to be able to catch up with our first ODH Program Officer, Perry Collins. We are so appreciative of the support for our research from NEH ODH.

Here are more tweets from our day at ODH:

-Michele


2018-03-14: Twitter Follower Count History via the Internet Archive

Image may be NSFW.
Clik here to view.
The USA Gymnastics team shows significant growth during the years the Olympics are held.

Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive. If the account is popular enough to be archived, then a follower count for a specific date can be collected. 

The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation date of the follower is the lower bound for when they could have started following the account under observation. Its correctness is dependent on new accounts immediately following the account under observation to get an accurate lower bound. The order Twitter returns followers is subject to unannounced change, so it can't be depended on to work long term. That will not show when an account starts losing followers, because it only returns users still following the account. This tool will help accurately gather and plot the follower count based on mementos, or archived web pages, collected from the Internet Archive to show growth rates, track deleted accounts, and help pinpoint when an account might have bought bots to increase follower numbers.

I improved on a Python script, created by Orkun Krand, that collects the followers for a specific Twitter username from the mementos found in the Internet Archive. The code can be found on Github. Through the historical pages kept in the Internet Archive, the number of followers can be observed for a specific date of the collected memento. This script collects the follower count by identifying various CSS Selectors associated with the follower count for most of the major layouts Twitter has implemented. If a Twitter page isn't popular enough to warrant being archived, or too new, then no data can be collected on that user.

This code is especially useful for investigating users that have been deleted from Twitter. The Russian troll @Ten_GOP, impersonating the Tennessee GOP was deleted once discovered. However, with the Internet Archive we can still study its growth rate while it was active and being archived. 
In February 2018, there was an outcry as conservatives lost, mostly temporarily, thousands of followers due to Twitter suspending suspected bot accounts. This script enables investigating users who have lost followers, and for how long they lost them. It is important to note that the default flag to collect one memento a month is not expected to have the granularity to view behaviors that typically happen on a small time frame. To correct that, the flag [-e] to collect all mementos for an account should be used. The republican political commentator @mitchellvii lost followers in two recorded incidences. In January 2017 from the 1st to the 4th, @mitchellvii lost 1270 followers. In April 2017 from the 15th to the 17th, @mitchellvii lost 1602 followers. Using only the Twitter API to collect follower growth would not show this phenomenon.



Dependencies:


  • Python 3

  • R* (to create graph)

  • bs4

  • urllib

  • archivenow* (push to archive)

  • datetime* (push to archive)



*optional

How to run the script:

$ git clone https://github.com/oduwsdl/FollowerCountHistory.git
$ cd FollowerCountHistory
$ ./FollowerHist.py [-h] [-g] [-e] [-p | -P] <twitter-
username-without-@>

Output: 

The program will create a folder named <twitter-username-without-@>. This folder will contain two .csv files. One, labeled <twitter-username-without-@>.csv, will contain the dates collected, the number of followers for that date, and the URL for that memento. The other, labeled <twitter-username-without-@>-Error.csv, will contain all the dates of mementos where the follower count was not collected and will list the reason why. All file and folder names are named after the Twitter username provided, after being cleaned to ensure system safety.

If the flag [-g]is used, then the script will create an image <twitter-username-without-@>-line.pngof the data plotted on a line chart created by the follower_count_linechart.R script. An example of that graph is shown as the heading image for the user @USAGym, the official USA Olympic gymnastics team. The popularity of the page changes with the cycle of the Summer Olympics, evidenced by most of the follower growth occurring in 2012 and 2016.

Example Output:

./FollowerHist.py -g -p USAGym
USAGym
http://web.archive.org/web/timemap/link/http://twitter.com/USAGym
242 archive points found
20120509183245
24185
20120612190007
...
20171221040304
250242
20180111020613
250741
Not Pushing to Archive. Last Memento Within Current Month.
null device
1


cd usagym/; ls
usagym.csv usagym-Error.csv usagym-line.png

How it works:

$ ./FollowerHist.py --help

usage: FollowerHist.py [-h] [-g] [-p | -P] [-e] uname

Follower Count History. Given a Twitter
username, collect follower counts from
the Internet Archive.

positional arguments:

uname Twitter
username without @

optional arguments:

-h, --help show this help message and exit
-g Generate a graph with data points
-p Push to Internet Archive
-P Push to all archives available through ArchiveNow
-e Collect every memento, not just one per month

First, the timemap, the list of all mementos for that URI, is collected for http://twitter.com/username. Then, the script collects the dates from the timemap for each memento. Finally, it dereferences each memento and extracts the follower count if all the following apply:
    1. A previously created .csv of the name the script would generate does not contain the date.
    2. The memento is not in the same month as a previously collected memento, unless [-e] is used.
    3. The page format can be interpreted to find the follower count.
    4. The follower count number can be converted to an Arabic numeral.
A .csv is created, or appended to, to contain the date, number of followers, and memento URI for each collected data point.
A error .csv is created, or appended, with the date, number of followers, and memento URI for each data point that was not collected. This will contain repeats if run repeatedly because it will not delete the old entries while writing the new errors in.

If the [-g] flag is used, a .png of the line chart will be created "<twitter-username-without-@>-line.png".
If the [-p] flag is used, the URI will be pushed to the Internet Archive to create a new memento if there is no current memento.
If the [-P] flag is used, the URI will be pushed to all archives available through archivenow to create new mementos if there is no current memento in Internet Archive.
If the [-e] flag is used, every memento will be collected instead of collecting just one per month.

As a note for future use, if the Twitter layout undergoes another change, the code will need to be updated to continue successfully collecting data.

Special thanks to Orkun Krand, whose work I am continuing.
--Miranda Smith (@mir_smi)


2018-03-15: Paywalls in the Internet Archive

Image may be NSFW.
Clik here to view.
Paywall page from The Advertister

Paywalls have become increasingly notable in the Internet Archive over the past few years. In our recent investigation into news similarity for U.S. news outlets, we chose from a list of websites and then pulled the top stories. We did not initially include subscriber based sites, such as The Financial Times or Wall Street Journal, because these sites only provided snippets of an article, and then users would be confronted with a "Subscribe Now" sign to view the remaining content. The New York Times, as well as other news sites, also have subscriber based content but access is only limited once a user has exceeded a set number of stories seen. In our study of 30 days of news sites, we found 24 URIs that were deemed to be paywalls, and these are listed below:

Memento Responses

All of these URIs point to the Internet Archive but result in an HTTP status code of 404. We took all of these URI-Ms from the homepage of their respective news sites and tried to see how the Internet Archive captured these URI-Ms over a period of a month within the Internet Archive.



The image above shows requests sent to the Internet Archive's memento API with the initial request being 0 days and then adding 1, 7, 30 days to the inital request to see if the URI-M retrieved resolved to something other than 404. The initial request to these mementos all had a 404 status code. Adding a day to the memento and then requesting a new copy from the Internet Archive resulted in some of the URI-Ms resolving to with a 200 response code showing that these articles became available. Adding 7 days to the initial request date time shows that by this time the Internet Archive has found copies for all but 1 URI-M. This result is then repeated when adding 30 days to the initial memento request date time. The response code "0" indicates no response code caused by a infinite redirect loop. The chart follows the idea that content is released as free once a period of time has passed.

For the New York Times articles, they end up redirecting to a different part of the New York Times website: https://web.archive.org/web/20100726195833/http://www.nytimes.com/glogin. Although each of the URIs resolve with a 404 status code an earlier capture shows that it was a login page asking for signup or subscription:

Paywalls in Academia

Paywalls restrict not just news content but also academic content. When users are directly linked through a DOI assigned to a paper,  they are often redirected to a splash page showing a short description of a paper but not the actual pdf document. An example of this is: http://www.springerlink.com/index/rw3572714v41q507.pdf. This URI seemingly shows that the result should yield a pdf but actually resolves to a splash page:

In order to actually access the content a user is first redirected the splash page:
https://link.springer.com/article/10.1023%2FA%3A1022602019183?LI=true

This splash page then contains a link to the desired content:
https://link.springer.com/content/pdf/10.1023%2FA%3A1022602019183.pdf

An archived copy of the desired content is:
http://web.archive.org/web/*/https://link.springer.com/content/pdf/10.1023%2FA%3A1022602019183.pdf

An interesting find is that the archived URI without the document type ".pdf" at the end of the URI also contains mementos of the content and not the splash page:
http://web.archive.org/web/*/https://link.springer.com/content/pdf/10.1023%2FA%3A1022602019183

Organizations that are willing to pay for a subscription to an association that host the academic papers will have access to content. A popular example is the ACM Digital Library. When users visit pages like springerlink, they may not have the option of getting the blue "Download PDF" button but rather a grey button signifying it is disabled for a non-subscribed user.

Van de Sompel et al. investigated 1.6 million URI references from arXiv and PubMed Central and found that over 500,000 of the URIs were locating URIs indicating the current document location. These URIs can expire over time and removes the use of DOIs.

Searching for Similarity

When considering hard paywall sites like Financial Times (FT) and Wall Street Journal (WSJ) it's intuitive that most of the paywall pages a non-subscribed user sees will be relatively the same. We experimented with 10 of the top WSJ articles on 11/01/2016 where each article was scraped from the homepage of WSJ. From these 10 articles we did pairwise comparisons between each article by taking the SimHash of each article's HTML representation and then computing the Hamming distance between each unique paired SimHash bit string. 

We found that pages with completely different representations stood out with a higher hamming distance of 40+ bits, while articles that had the same styled representation had at most a 3-bit hamming distance, regardless if the article was a snippet or a full length article. This showed that SimHash was not well suited for discovering differences in content but rather differences in content representation such as changes in: CSS, HTML, or Javascript. It didn't help our observations that WSJ was including entire font-family data text inside of their HTML at the time. In reference to Maciej Ceglowski's post on "The Website Obesity Crisis," WSJ injecting a CSS font-family data string does not aid in a healthy "web pyramid":



From here, I decided to explore the option of using a binary image classifier for a thumbnail of a news site, labeling an image as a "paywall_page" or a "content_page." To accomplish this I decided to use Tensorflow and the very easily applicable examples provided by the "Tensorflow for poets"tutorial. Utilizing the MobileNet model, I trained 122 paywall images and 119 content page images, mainly news homepages and articles. The images were collected using Google Images and manually classified as content or paywall pages.


I trained the model with the new images for 4000 iterations and this produced an accuracy of 80-88%. As a result, I built a simple web application named paywall-classify, that can be found on Github, that utilizes Puppeteer to take screenshots for a given list of URIs (maximum 10) at a resolution of 1920x1080 and then uses Tensorflow to classify images as well. More instructions on how to use the application can be found in the repository readme.

There are many other techniques that could be considered for image classification of webpages, for example, slicing a full page image of a news website into sections. However this approach would more than likely show bias towards the content classification as the "subscribe now" seems to always be at the top of an article meaning slicing would only have this portion in 1/n slices. For this application I also didn't consider the possibility of scrolling down a page to trigger a javascript popup of a paywall message.

Other approaches might utilize textual analysis, such as performing Naive Bayes classification on terms collected from a paywall page and then building a classifier from there. 

What to take away

It's actually difficult to find a cause as to why some the URI-Ms listed result in 404 responses while other articles for those sites may be a 200 response on their first memento. The New York Times has a limit of 10 "free" articles for each user, so perhaps at crawl time the Internet Archive hit its quota. As per Mat Kelly et al. in Impact of URI Canonicalization on Memento Count, they talk about "archived 302s", indicating at crawl time a live web site returns an HTTP 302 redirect, meaning these New York Times articles may actually be redirecting to a login page at crawl time.

-Grant Atkins (@grantcatkins)

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English


Image may be NSFW.
Clik here to view.
Fig. 1 - Barack Obama's Twitter page in Urdu

The ODU WSDL lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President Barack Obama’s archived Twitter page, shown in the image above, is in English, but the page template is in Urdu. You may notice that some of the information, such as, "followers", "following", "log in", etc. are not display in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman in "The vulnerability in the US digital registry, Twitter, and the Internet Archive". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications when looked at from a digital archivist perspective.

The problem became more evident when Miranda Smith (a member of the WSDL lab) was finalizing the implementation of a Twitter Follower-History-Count tool.  The application makes use of mementos extracted from the Internet Archive (IA) in order to find the number of followers that a particular Twitter account had acquired through time. The tool expects the Web page retrieved from the IA to be rendered in English in order to perform Web scraping and extract for the number of followers a Twitter account had at a particular time. Since it was now evident that Twitter pages were not archived in English only, we had to decide to account for all possible language settings or discard non-English mementos. We asked ourselves: Why are some Twitter pages archived in Non-English languages that we generally expected to be in English? Note that we are referring to the interface/template language and not the language of the tweet content.

We later found that this issue is more prevalent than we initially thought it was. We selected the previous US President Barack Obama as our personality to explore how many languages and how often his Twitter page was archived. We downloaded the TimeMap of his page using MemGator and then downloaded all the mementos in it for analysis. We found that his Twitter page was archived in 47 different languages (all the languages that Twitter currently supports, a subset of which is supported in their widgets) across five different web archives, including Internet Archive (IA), Archive-It (A-It), Library of Congress (LoC), UK Web Archive (UKWA), and Portuguese Web Archive (PT). Our dataset shows that overall only 53% of his pages (out of over 9,000 properly archived mementos) were archived in English. Of the remaining 47% mementos 22% were archived in Kannada and 25% in 45 other languages combined. We excluded mementos from our dataset that were not "200 OK" or did not have language information.

Fig. 2 shows that in the UKWA English is only 5% of languages in which Barack Obama's Twitter pages were archived. Conversely, in the IA, about half of the number of Barack Obama's Twitter pages are archived in English as much as all the remaining languages combined. It is worth noting that A-It is a subset of the IA. On the one hand, it is good to have more language diversity in archives (for example, the archival record is more complete for English language web pages than other languages). On the other hand, it is very disconcerting when the page is captured in a language not anticipated. We also noted that Twitter pages in the Kannada language are archived more often than all other non-English languages combined, although Kannada ranks 32 globally by the number of native speakers which are 0.58% of the global population. We tried to find out why some Twitter pages were archived in non-English languages that belong to accounts that generally tweet in English. We also tried to find out why Kannada is so prevalent among many other non-English languages. Our findings follow.

Image may be NSFW.
Clik here to view.
Fig. 2 Barack Obama Twitter Page Language Distribution in Web Archives

We started investigating the reason why web archives sometimes capture pages in non-English languages, and we came up with the following potential reasons:
  • Some JavaScript in the archived page is changing the template text in another language at the replay time
  • A cached page on a shared proxy is serving content in other languages
  • "Save Page Now"-like features are utilizing users' browsers' language preferences to capture pages
  • Geo-location-based language setting
  • Crawler jobs are intentionally or unintentionally configured to send a different "Accept-Language" header
The actual reason turned out to have nothing to do with any of these, instead it was related to cookies, but describing our thought process and how we arrived at the root of the issue has some important lessons worth sharing.

Evil JavaScript


Since JavaScript is known to cause issues in web archiving (a previous blog post by John Berlin expands on this problem), both at capture and replay time, we first thought this has to do with some client-side localization where a wrong translation file is leaking at replay time. However, when we looked at the page source in a browser as well as on the terminal using curl (as illustrated below), it was clear that translated markup is being generated on the server side. Hence, this possibility was struck off.

$ curl --silent https://twitter.com/?lang=ar | grep "<meta name=\"description\""
<meta name="description" content="من الأخبار العاجله حتى الترفيه إلى الرياضة والسياسة، احصل على القصه كامله مع التعليق المباشر.">

Caching


We thought Twitter might be doing content negotiation using "Accept-Language" request header, so we changed language preference in our web browser and opened Twitter in an incognito window which confirmed our hypothesis. Twitter did indeed consider the language preference sent by the browser and responded a page in that language. However, when we investigated HTTP response headers we found that twitter.com does not return the "Vary" header when it should. This behavior can be dangerous because the content negotiation is happening on "Accept-Language" header, but it is not advertised as a factor of content negotiation. This means, a proxy can cache a response to a URI in some language and serve it back to someone else when the same URI is requested, even with a different language in the "Accept-Language" setting. We considered this as a potential possibility of how an undesired response can get archived. 

On further investigation we found that Twitter tries very hard (sometimes in wrong ways) to make sure their pages are not cached. This can be seen in their response headers illustrated below. The Cache-Control and obsolete Pragma headers explicitly ask proxies and clients not to cache the response itself or anything about the response by setting values to "no-cache" and "no-store". The Date (the date/time at which the response was originated) and Last-Modified headers are set to the same value to ensure that the cache (if stored) becomes invalid immediately. Additionally, the Expires header (the date/time after which the response is considered stale) is set to March 31, 1981, a date far in the past, long before Twitter even existed, to further enforce cache invalidation.


$ curl -I https://twitter.com/
HTTP/1.1 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
pragma: no-cache
date: Sun, 18 Mar 2018 17:43:25 GMT
last-modified: Sun, 18 Mar 2018 17:43:25 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
...

Hence, the possibility of a cache returning pages in different languages due to the missing "Vary" header was also not sufficient to justify the number of mementos in non-English languages.

Geo-location


We thought about the possibility that Twitter identifies a potential language for guest visitors based on their IP address (to guess the geo-location). However, the languages seen in mementos do not align with the places where archival crawlers are located. For example, the Kannada language that is dominating in the UK Web Archive is spoken in the State of Karnataka in India, and it is unlikely that the UK Web Archive is crawling from machines located in Karnataka.

On-demand Archiving


The Internet Archive recently introduced the "Save Page Now" feature, which acts as a proxy and forwards request headers of the user to the upstream web server rather than its own. This behavior can be observed in a memento that we requested for an HTTP echo service, HTTPBin, from our browser. The service echoes back data in the response that it receives from the client in the request. By archiving it, we expect to see headers that identify the client that the service has seen as the requesting client. The headers shown there are of our browser, not of the IA's crawler, especially the "Accept-Language" (that we customized in our browser) and the "User-Agent" headers, which confirms our hypothesis that IA's Save Page Now feature acts as a proxy.

$ curl http://web.archive.org/web/20180227154007/http://httpbin.org/anything
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "es",
"Connection": "close",
"Host": "httpbin.org",
"Referer": "https://web.archive.org/",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36"
},
"json": null,
"method": "GET",
"origin": "207.241.225.235",
"url": "https://httpbin.org/anything"
}


This behavior made us consider that people from different regions of the world with different language setting in their browsers, when using "Save Page Now" feature, would end up preserving Twitter pages in the language of their preference (since Twitter does honor "Accept-Language" header in some cases). However, we were unable to replicate this in our browser. Also, not every archive has on-demand archiving and thus could never replay users' request headers.

We also repeated this experiment in Archive.is, another on-demand web archive. Unlike IA, they do not replay users' headers like a proxy, instead they have their custom request headers. Archive.is does not show the original markup, instead it modifies the page heavily before serving, so a curl output will not be very useful. However, the content of our archived HTTP echo service page look like this:

{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip",
"Accept-Language": "tt,en;q=0.5",
"Connection": "close",
"Host": "httpbin.org",
"Referer": "https://www.google.co.uk/",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2704.79 Safari/537.36"
},
"json": null,
"method": "GET",
"origin": "128.82.7.11, 128.82.7.11",
"url": "https://httpbin.org/anything"
}

Note that it has its custom "Accept-Language" and "User-Agent" headers (different from our browser from which we requested the capture). It also has a custom "Referer" header included. However, unlike IA, it replayed our IP address as origin. We then captured https://twitter.com/?lang=ar (http://archive.is/cioM5) followed by https://twitter.com/phonedude_mln/ (http://archive.is/IbHgB) to see if the language session sticks across two successive Twitter requests, but that was not the case as the second page was archived in English (not in Arabic). Though, this does not necessarily prove that their crawler does not have this issue. It is possible that two different instances of their crawler handled the two requests or some other links of Twitter pages (with "?lang=en") were archived between the two requests by someone else. However, we do not have sufficient information to be certain about it.

Misconfigured Crawler


Some of the early memento we observed this behavior happening were from the Archive-It. So, we thought that some collection maintainers might have misconfigured their crawling job that sends a non-default "Accept-Language" header, resulting in such mementos. Since we did not have access to their crawling configuration, there was very little we could do to test this hypothesis. Many of the leading web archives are using Heritrix as their crawler, including Archive-It, and we happen to have some WARC files from AI, so we started looking into those. We looked for request records of those WARC files for any Twitter links to see what "Accept-Language" header was sent. We were quite surprised to see that Heritrix never sent any "Accept-Language" headers to any server, so this could not be the reason at all. However, when looking into those WARC files, we saw "Cookie" headers sent to the servers in the request records of Twitter and many others. This lead us to uncover the actual cause of the issue.

Cookies, the Real Culprit


So far, we have been considering Heritrix to be a stateless crawler, but when we looked into the WARC files of AI, we observed Cookies being sent to servers. This means Heritrix does have Cookie management built-in (which is often necessary to meaningfully capture some sites). With this discovery, we started investigating Twitter's behavior from a different perspective. The page source of Twitter has a list of alternate links for each language they provide localization for (currently, 47 languages). This list can get added to the frontier queue of the crawler. Though, these links have a different URI (i.e., having a query parameter "?lang=<lang-code>"), once any of these links are loaded, the session is set for that language until the language is explicitly changed or the session expires/cleared. In the past they had options in the interface to manually select a language, which then gets set for the session. It is understandable that general purpose web sites cannot rely on the "Accept-Language" completely for localization related content negotiation as browsers have made it difficult to customize language preferences, especially if one has to set it on a per-site basis.

We experimented with Twitter's language related behavior in our web browser by navigating to https://twitter.com/?lang=ar, which yields the page in the Arabic language. Then navigating to any Twitter page such as https://twitter.com/ or https://twitter.com/ibnesayeed (without the explicit "lang" query parameter) continues to serve Arabic pages (if a Twitter account is not logged in). Here is how Twitter's server behaves for language negotiation:

  • If a "lang" query parameter (with a supported language) is present in any Twitter link, that page is served in the corresponding language.
  • If the user is a guest, value from the "lang" parameter is set for the session (this gets set each time an explicit language parameter is passed) and remains sticky until changed/cleared.
  • If the user is logged in (using Twitter's credentials), the default language preference is taken from their profile preferences, so the page will only show in a different language if an explicit "lang" parameter is present in the URI. However, it is worth noting that crawlers generally behave like guests.
  • If the user is a guest and no "lang" parameter is passed, Twitter falls back to the language supplied in the "Accept-Language" header.
  • If the user is a guest, no "lang" parameter is passed, and no "Accept-Language" header is provided, then responses are in English (though, this could be affected by Geo-IP, which we did not test).

In the example below we illustrate some of that behavior using curl. First, we fetch Twitter's home page in Arabic using explicit "lang" query parameter and show that the response was indeed in Arabic as it contains lang="ar" attribute in the <html> element tag. We also saved any cookies that the server might want to set in the "/tmp/twitter.cookie" file. We then showed that the cookie file does indeed have a "lang" cookie with the value "ar" (there are some other cookies in it, but those are not relevant here). In the next step, we fetched Twitter's home page without any explicit "lang" query parameter and received a response in the default English language. Then we fetched the home page with the "Accept-Language: ur" header and got the responses in Urdu. Finally, we fetched the home page again, but this time supplied the saved cookies (that includes "lang=ar" cookie) and received the response again in Arabic.
$ curl --silent -c /tmp/twitter.cookie https://twitter.com/?lang=ar | grep "<html"
<html lang="ar" data-scribe-reduced-action-queue="true">

$ cat /tmp/twitter.cookie | grep lang
twitter.com FALSE / FALSE 0 lang ar

$ curl --silent https://twitter.com/ | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">

$ curl --silent -H "Accept-Language: ur" https://twitter.com/ | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">

$ curl --silent -b /tmp/twitter.cookie https://twitter.com/ | grep "<html"
<html lang="ar" data-scribe-reduced-action-queue="true">


Twitter Cookies and Heritrix


Now that we understood the reason, we wanted to replicate what is happening in a real archival crawler. We used Heritrix to simulate the effect that Twitter cookies have when a Twitter page gets archived in the IA. The order of these links was carefully chosen to see if the first link sets the language to Arabic and then the second one gets captured in Arabic or not. We seeded the following URLs and placed them in the same sequence inside Heritrix's configuration file:
We had already proven that the first URI which included the language identifier for Arabic (lang=ar) will place the language identifier inside the cookie. The question now becomes: What is the effect this cookie will have on subsequent crawls/requests of future Twitter pages? Is the language identifier going to stay the same as the one already set in the cookie? Is is it going to revert to a default language preference? The common expectation for our seeded URIs is that the first Twitter page will be archived in Arabic, and that the second page will be archived in English, since a request to a top level .com domain is usually defaulted to the English language. However, since we have observed that the Twitter cookies contain the language identifier when this parameter is passed in the URI, then if subsequent Twitter pages use the same cookie, it is plausible that the language identifier is going to be maintained.

After running the crawling job in Heritrix for the seeded URIs, we inspected the WARC file generated by Heritrix. The results were as we expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic. Relevant portions of the resulting WARC file are shown below:

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://twitter.com/?lang=ar
WARC-Date: 2018-03-16T21:58:44Z
WARC-Concurrent-To: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
WARC-Record-ID: <urn:uuid:473273f6-48fa-4dd3-a5f0-81caf9786e07>
Content-Type: application/http; msgtype=request
Content-Length: 301

GET /?lang=ar HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://cs.odu.edu/)
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: twitter.com
Cookie: guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC, shown above, is a  request record for the URI https://twitter.com/?lang=ar. Highlighted lines illustrates the GET request made to the host "twitter.com" with the path and query parameter "/?lang=ar". This request yielded a response from Twitter that contains a "set-cookie"header with the language identifier included in the URI "lang=ar" as shown in the portion of the WARC below. The HTML was rendered in Arabic (notice the highlighted <html> element with the lang attribute in the response payload below).

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://twitter.com/?lang=ar
WARC-Date: 2018-03-16T21:58:44Z
WARC-Payload-Digest: sha1:FCOPDBN2U5LXU7FEUUGQ4WXYGR7OP5JI
WARC-IP-Address: 104.244.42.129
WARC-Record-ID: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
Content-Type: application/http; msgtype=response
Content-Length: 151985

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 150665
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:44 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:44 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:34 UTC; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: lang=ar; Path=/
set-cookie: ct0=10558ec97ee83fe0f2bc6de552ed4b0e; Expires=Sat, 17 Mar 2018 03:58:44 UTC; Path=/; Domain=.twitter.com; Secure
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: 2a2fc89f51b930202ab24be79b305312
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 100
x-transaction: 001495f800dc517f
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">
...

The subsequent request in the seeded Heritrix configuration file (https://twitter.com/phonedude_mln/) generated an additional request record which is shown on the WARC file portion below. The highlighted lines illustrates the GET request made to the host "twitter.com" with the path and query parameter "/phonedude_mln/". You may notice that a "Cookie"  with the value lang=ar was included as one of the parameters in the header request which was set initially by the first seeded URI. The results were as expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic.

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://twitter.com/phonedude_mln/
WARC-Date: 2018-03-16T21:58:48Z
WARC-Concurrent-To: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
WARC-Record-ID: <urn:uuid:eef134ed-f3dc-459b-95e7-624b4d747bc1>
Content-Type: application/http; msgtype=request
Content-Length: 655

GET /phonedude_mln/ HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://cs.odu.edu/)
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: twitter.com
Cookie: lang=ar; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; ct0=10558ec97ee83fe0f2bc6de552ed4b0e; guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC file, shown below, shows the effect of Heritrix saving and playing the "Cookie" headers. The highlighted <html> element proved that the  HTML language identifier was set to Arabic on the second seeded URI (https://twitter.com/phonedude_mln/), although the URI did not include in the language identifier.

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://twitter.com/phonedude_mln/
WARC-Date: 2018-03-16T21:58:48Z
WARC-Payload-Digest: sha1:5LI3DGWO6NGK4LWSIHFZZHW43H2Z2IWA
WARC-IP-Address: 104.244.42.129
WARC-Record-ID: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
Content-Type: application/http; msgtype=response
Content-Length: 518086

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 516921
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:48 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:48 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:38 UTC; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/; Domain=.twitter.com; Secure; HTTPOnly
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: ef102c969c74f3abf92966e5ffddb6ba
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 335
x-transaction: 0014986c00687fa3
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">

We used PyWb to replay pages from the captured WARC file. Fig. 3 is the page rendered after retrieving the first seeded URI of our collection (https://twitter.com/?lang=ar). For those not familiar with Arabic, this is indeed Twitter's home page in Arabic.

Image may be NSFW.
Clik here to view.
Fig.3  https://twitter.com/?lang=ar

Fig. 4 is the representation given by PyWb after requesting the second seeded URI (https://twitter.com/phonedude_mln). The page was rendered using Arabic as the default language, although we did not include this setting in the URI, nor did our browser language settings include Arabic.

Image may be NSFW.
Clik here to view.
Fig.4  https://twitter.com/phonedude_mln/ in Arabic

Why is Kannada More Prominent?


As we noted before, Twitter's page source now includes a list of alternate links for 47 supported languages. These links look something like this:

<link rel="alternate" hreflang="fr" href="https://twitter.com/?lang=fr">
<link rel="alternate" hreflang="en" href="https://twitter.com/?lang=en">
<link rel="alternate" hreflang="ar" href="https://twitter.com/?lang=ar">
...
<link rel="alternate" hreflang="kn" href="https://twitter.com/?lang=kn">

The fact that Kannada ("kn") is the last language in the list is why it is so prevalent in web archives. While other language specific links overwrite the session set by their predecessor, the last one affects many more Twitter links in the frontier queue. Twitter started supporting Kannada along with three other Indian languages in July 2015 and placed it at the very end of language related alternate links. Since then, it has been captured more often in various archives than any other non-English language. Before these new languages were added, Bengali used to be the last link in the alternate language links for about a year. Our dataset shows dense archival activity for Bengali between July 2014 to July 2015, then Kannada took over. This confirms our hypothesis about the spatial placement of the last language related link sticking the session for a long time with that language. This affects all upcoming links in the crawlers' frontier queue from the same domain until another language specific link overwrites the session.

What Should We Do About It?


Disabling cookies does not seem to be a good option for crawlers as some sites would try hard to set a cookie by repeatedly returning redirect responses until their desired "Cookie" headers is included in the request. However, explicitly reducing the cookie expiration duration in crawlers could potentially mitigate the long-lasting impact of such sticky cookies. Garbage collecting any cookie that was set more than a few seconds ago would make sure that no cookie is being reused for more than a few successive requests. Sandboxing crawl jobs in many isolated sessions is another potential solution to minimize the impact. Alternatively, some filtering policies can be set for URLs that set any session cookies to download them in a separate short-lived session to isolate them from rest of the crawl frontier queue.

Conclusions


The problem of portions of Twitter pages unintentionally being archived in non-English languages is quite significant. We found that 47% of mementos of Barack Obama's Twitter page were in non-English languages, almost half of which were in Kannada alone. While language diversity in web archives is generally a good thing, in this case though, it is disconcerting and counter-intuitive. We found that the root cause has to do with Twitter's sticky language sessions maintained using cookies which Heritrix crawler seems to honor.

The Kannada language being the last one in the list of language-specific alternate links on Twitter's pages makes it overwrite the language cookies resulting from the URLs in other languages listed above it. This causes more Twitter pages in the frontier queue being archived in Kannada than other non-English languages. Crawlers are generally considered to be stateless, but honoring cookies makes them somewhat stateful. This behavior in web archives may not be specific to just Twitter, but many other sites that utilize cookies for content negotiation might have some similar consequences. This issue can potentially be mitigated by reducing the cookie expiration duration explicitly in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances.

--
Sawood Alam
and
Plinio Vargas

2018-04-09: Trip Report for the National Forum on Ethics and Archiving the Web (EAW)


On March 23-24, 2018 I attended the National Forum on Ethics and Archiving the Web (EAW), hosted at the New Museum and organized by Rhizome and the members of the Documenting the Now project.  The nor'easter "Toby" frustrated the travel plans of many, including causing my friend Martin Klein to have to cancel completely and for me to not arrive at the New Museum until after the start of the second session at 2pm on Thursday.  Fortunately, all the sessions were recorded and I link to them below.

Day 1 -- March 22, 2018


Session 1 (recording) began with a welcome, and then a keynote by Marisa Parham, entitled "The Internet of Affects: Haunting Down Data".  I did have the privilege of seeing her keynote at the last DocNow meeting in December, and looking at the tweets ("#eaw18") she addressed some of the same themes, including the issues of the process of archiving social media (e.g., tweets) and the resulting decontextualization, including "Twitter as dataset vs. Twitter as experience", and "how do we reproduce the feeling of community and enhance our understanding of how to read sources and how people in the past and present are engaged with each other?"  She also made reference to the Twitter heat map for showing interaction with the Ferguson grand jury verdict ("How a nation exploded over grand jury verdict: Twitter heat map shows how 3.5 million #Ferguson tweets were sent as news broke that Darren Wilson would not face trial").



After Marisa's keynote was the panel on "Archiving Trauma", with Michael Connor (moderator), Chido Muchemwa, Nick Ruest (slides), Coral Salomón, Tonia Sutherland, and Lauren Work.  There are too many important topics here and I did not experience the presentations directly, so I will refer you to the recording for further information and a handful of selected tweets below. 


The next session after lunch was "Documenting Hate" (recording), with Aria Dean (moderator), Patrick Davison, Joan Donovan, Renee Saucier, and Caroline Sinders.  I arrived at the New Museum about 10 minutes into this panel.  Caroline spoke about the Pepe the Frog meme, its appropriation by Neo-Nazis, and the attempt by its creator to wrest it back -- "How do you balance the creator’s intentions with how culture has remixed their work?"

Joan spoke about a range of topics, including archiving the Daily Stormer forum, archiving the disinformation regarding the attacks in Charlottesville this summer (including false information originating on 4chan about who drove the car), and an algorithmic image collection technique for visualizing trending images in the collection.


Renee Saucier talked about experiences collecting sites for the "Canadian Political Parties and Political Interest Groups" (Archive-It collection 227), which includes Neo-Nazi and affiliated political parties.


The next panel was "Web Archiving as Civic Duty", with Amelia Acker (co-moderator), Natalie Baur, Adam Kriesberg (co-moderator) (transcript), Muira McCammon, and Hanna E. Morris.  My own notes on this session are sparse (in part because most of the presenters did not use slides), so I'll include a handful of tweets I noted that I feel succinctly capture the essence of the presentations.  I did find a link to Muria's MS thesis "Reimagining and Rewriting the Guantánamo Bay Detainee Library: Translation, Ideology, and Power", but it is currently under embargo.  I did find an interview with her that is available and relevant.  Relevant to Muria's work with deleted US Govt accounts is Justin Littman's recent description of a disinformation attack with re-registering deleted accounts ("Vulnerabilities in the U.S. Digital Registry, Twitter, and the Internet Archive"). 


The third session, "Curation and Power" (recording) began with a panel with Jess Ogden (moderator), Morehshin Allahyari, Anisa Hawes, Margaret Hedstrom, and Lozana Rossenova.  Again, I'll borrow heavily from tweets. 


The final session for Thursday was the keynote by Safiya Noble, based on her recent book "Algorithms of Oppression" (recording).  I really enjoyed Safiya's keynote; I had heard of some of the buzz and controversy (see my thread (1, 2, 3) about archiving some of the controversy) around the book but I had not yet given it a careful review (if you're not familiar with it, read this five minute summary Safiya wrote for Time).  I include several insightful tweets from others below, but I'll also summarize some of the points that I took away from her presentation (and they should be read as such and not as a faithful or complete transcription of her talk).

First, as a computer scientist I understand and am sympathetic to the idea that ranking algorithms that Google et al. use should be neutral.  It's an ultimately naive and untenable position, but I'd be lying if I said I did not understand the appeal.  The algorithms that help us differentiate quality pages from spam pages about everyday topics like artists, restaurants, and cat pictures do what they do well.  In one of the examples I use in my lecture (slides 55-58), it's the reason why for the query "DJ Shadow", the wikipedia.org and last.fm links appear on Google's page 1, and djshadow.rpod.ru appears on page 15: in this case the ranking of the sites based on their popularity in terms of links, searches, clicks, and other user-oriented metrics makes sense.  But what happens when the query is, as Safiya provides in her first example, "black girls"?  The result (ca. 2011) is almost entirely porn (cf. the in-conference result for "asian girls"), and the algorithms that served us so well in finding quality DJ Shadow pages in this case produce a socially undesirable result.  Sure, this undesirable result is from having indexed the global corpus (and our interactions with it) and is thus a mirror of the society that created those pages, but given the centrality in our life that Google enjoys and the fact that people consider it an oracle rather than just a tool that gives undesirable results when indexing undesirable content, it is irresponsible for Google to ignore the feedback loop that they provide; they no longer just reflect the bias, they hegemonically reinforce the bias, as well as give attack vectors for those who would defend the bias

Furthermore, there is already precedent for adjusting search results to eliminate bias in other dimensions: for example, PageRank by itself is biased against late-arriving pages/sites (e.g., "Impact of Web Search Engines on Page Popularity"), so search engines (SEs) adjust the rankings to accommodate these pages.  Similarly, Google has a history of intervening to remove "Google Bombs" (e.g., "miserable failure"), punish attempts to modify ranking, and even replacing results pages with jokes -- if these modifications are possible, then Google can no longer pretend the algorithm results are inviolable. 

She did not confine her criticism to Google, she also examined query results in digital libraries like ArtStor.  The metadata describing the contents in the DL originate from a point-of-view, and queries with a different POV will not return the expected results.  I use similar examples in my DL lecture on metadata (my favorite is reminding the students that the Vietnamese refer to the Vietnam War as the "American War"), stressing that even actions as seemingly basic as assigning DNS country codes (e.g., ".ps") are fraught with geopolitics, and that neutrality is an illusion even in a discipline like computer science. 

There's a lot more to her talk than I have presented, and I encourage you to take the time to view it.  We can no longer pretend Google is just the "backrub" crawler and google.stanford.edu interface; it is a lens that both shows and shapes who we are.  That's an awesome responsibility and has to be treated as such.


Day 2 -- March 23, 2018


The second day began with the panel "Web as Witness - Archiving & Human Rights" (recording), with Pamela Graham (moderator), Anna Banchik, Jeff Deutch, Natalia Krapiva, and Dalila Mujagic. Anna and Natalia presented the activities of the UC Berkeley Human Rights Investigations Lab, where they do open-source investigations (discovering, verifying, geo-locating, more) publicly available data of human rights violations.  Next was Jeff talking about the Syrian Archive, and the challenges they faced with Youtube algorithmically removing what they believed to be "extremist content".  He also had a nice demo about how they used image analysis to identify munitions videos uploaded by Syrians.  Dalila presented the work of WITNESS, an organization promoting the use of video to document human rights violations and how they can be used as evidence.  The final presentation was about the airwars.org (a documentation project about civilian causalities in air strikes), but I missed a good part of this presentation as I focused on my upcoming panel. 


My session, "Fidelity, Integrity, & Compromise", was Ada Lerner (moderator) (site), Ashley Blewer (slides, transcript) Michael L. Nelson (me) (slides), and Shawn Walker (slides).  I had the luxury of going last, but that meant that I was so focused on reviewing my own material that I could not closely follow their presentations.  I and my students have read Ada's paper and it is definitely worth reviewing.  They review a series of attacks (and fixes) that all center around "abandoned" live web resources (what we called "zombies") that can be (re-)registered and then included in historical pages.  That sounds like a far-fetched attack vector, except when you remember that modern pages include 100s of resources from many different sites via Javascript, and there is a good chance that any page is likely to include a zombie whose live web domain is available for purchase.  Scott's presentation dealt with research issues surrounding using social media, and Ashley's talk dealt with role of using fixity information (e.g., "There's a lot "oh I should be doing that" or "I do that" but without being integrated holistically into preservation systems in a way that brings value or a clear understand as to the "why"").  As for my talk, I asserted that Brian Williams first performed "Gin and Juice" in 1992, a full year before Snoop Dogg, and I have a video of a page in the Internet Archive to "prove" it.  The actual URI in which it is indexed in the Internet Archive is obfuscated, but this video is 1) of an actual page in the IA, that 2) pulls live web content into the archive, despite the fixes that Ada provided, and 3) the page rewrites the URL in the address bar to pretend to be at a different URL and time (in this case, dj-jay-requests.surge.sh, and 19920531014618 (May 31, 1992)). 






The last panel before lunch was "Archives for Change", with Hannah Mandel (moderator), Lara Baladi, Natalie Cadranel, Lae’l Hughes-Watkins, and Mehdi Yahyanejad.  My notes for this session are sparse, so again I'll just highlight a handful of useful tweets.




After lunch, the next session (recording) was a conversation between Jarrett Drake and Stacie Williams on their experiences developing the People's Archive of Police Violence in Cleveland, which "collects, preserves, and shares the stories, memories, and accounts of police violence as experienced or observed by Cleveland citizens."  This was the only panel with the format of two people having a conversation (effectively interviewing each other) about their personal transformation and lessons learned.


The next session was "Stewardship & Usage", with Jefferson Bailey, Monique Lassere, Justin Littman, Allan Martell, Anthony Sanchez.  Jefferson's excellent talk was entitled "Lets put our money where our ethics are", and was an eye-opening discussion about the state of funding (or lack thereof) for web archiving. The tweets below capture the essence of the presentation, but this is definitely one you should take the time to watch.  Allan's presentation addressed the issues about building "community archives" and being aware of tensions that exist between different marginalized groups. Justin's presentation was excellent, detailing both GWU's collection activities and the associated ethical challenges (including who and what to collect) and the gap between collecting via APIs and archiving web representations.  I believe Anthony and Monique jointly gave their presentation about ethical web archiving requires proper representation from marginalized communities.



The next panel "The Right to be Forgotten", was in Session 7 (recording), and featured Joyce Gabiola (moderator), Dorothy Howard, and Katrina Windon.  The right to be forgotten is a significant issue facing search engines in the EU, but has yet to arrive as a legal issue in the US.  Again, my notes on this session are sparse, so I'm relying on tweets. 


The final regular panel was "The Ethics of Digital Folklore", and featured Dragan Espenschied (moderator) (notes), Frances Corry, Ruth Gebreyesus, Ian Milligan (slides), and Ari Spool.  At this point my laptop's battery died so I have absolutely no notes on this session. 


The final session was with Elizabeth Castle, Marcella Gilbert, Madonna Thunder Hawk, with an approximately 10 minute rough cut preview of "Warrior Women", a documentary about Madonna Thunder Hawk, her daughter Marcella Gilbert, Standing Rock, and the DAPL protests.


Day 3 -- March 24, 2018


Unfortunately, I had to leave on Saturday and was unable to attend any of the nine workshop sessions: "Ethical Collecting with Webrecorder", "Distributed Web of Care", "Open Source Forensics", "Ethically Designing Social Media from Scratch", "Monitoring Government Websites with EDGI", "Community-Based Participatory Research", "Data Sharing", "Webrecorder - Sneak Preview", "Artists’ Studio Archives", and unconference slots.   There are three additional recorded sessions corresponding to the workshops that I'll link here (session 8, session 9, session 10) because they'll eventually scroll off the main page.

This was a great event and the enthusiasm with which it was greeted is an indication of the topic.  There were so many great presentations that I'm left with the unenviable task of writing a trip report that's simultaneously too long and does not do justice to any of the presentations.  I'd like to thank the other members of my panel (AdaShawn, and Ashley), all who live tweeted the event, the organizers at Rhizome (esp. Michael Connor), Documenting the Now (esp. Bergis Jules), the New Museum, and the funders: IMLS and the Knight Foundation.   I hope they will find a way to do this again soon. 

--Michael

See also: Ashley Blewer wrote a short summary of EAW, with a focus on the keynotes and  three different presentations.  Please let me know if there are other summaries / trip reports to add.

Also, please feel free to contact me with additions / corrections for the information and links above.  





2018-04-13: Web Archives are Used for Link Stability, Censorship Avoidance, and Traffic Siphoning

Image may be NSFW.
Clik here to view.
ISIS members immolating captured Jordanian pilot
Web archives have been used for purposes other than digital preservation and browsing historical data. These purposes can be divided into three categories:

  1. Uploading content to web archives to ensure continuous availability of the data.
  2. Avoiding governments' censorship or websites' terms of service.
  3. Using URLs from web archives, instead of direct links, for news sites with opposing ideologies to avoid increasing their web traffic and deprive them of ad revenue.

1. Uploading content to web archives to ensure continuous availability of the data


Web archives, by design, are intended to solve the problem of digital data preservation so people can access data when it is no longer available on the live web. In this paper, Who and What Links to the Internet Archive, (Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, and Michael L. Nelson, 2013), the authors show that the percentage of the requested archived pages which currently do not exist on the live web is 65%. The paper also determines where do Internet Archive's Wayback Machine users come from. The following table, from the paper, contains the top 10 referrers that link to IA’s Wayback Machine. The list of top 10 referrers represents 51.9% of all the referrers. en.wikipedia.org outnumbers all other sites including search engines and the home page of Internet Archive (archive.org).
Image may be NSFW.
Clik here to view.
The top 10 referrers that link to IA’s Wayback Machine
Who and What Links to the Internet Archive, (AlNoamany et al. 2013) Table 5

Sometimes the archived data is controversial and the user wants to make sure that he or she can refer back to it later in case it is removed from the live web. A clear example of that is the deleted tweets from U.S. president Donald Trump.
Image may be NSFW.
Clik here to view.
Mr. Trump's deleted tweets on politwoops.eu


2. Avoiding governments' censorship or websites' terms of service


Using the Internet Archive to find a way around terms of service for file sharing sites was addressed by Justin Littman in a blog post, Islamic State Extremists Are Using the Internet Archive to Deliver Propaganda. He stated that ISIS sympathizers are using the Internet Archive as a web delivery platform for extremist propaganda, posing a threat to the archival mission of Internet Archive. Mr. Littman did not evaluate the content to determine if it is extremist in nature since much of it is in Arabic. This behavior is not new. It has been noted with some of the data uploaded by Al-Qaeda sympathizers a long time before ISIS was created. Al-Qaeda uploaded this file https://archive.org/details/osamatoobama to the Internet Archive on February 16 of 2010 to circumvent file sharing sites' content removal policies. ISIS sympathizers upload clips documenting battles, executions, or even video announcements by ISIS leaders to the Internet Archive because that type of data will get automatically removed from the web if uploaded to video sharing sites like Youtube to prevent extremists propaganda.

On February 4th of 2015, ISIS uploaded a video to the Internet Archive featuring the execution by immolation of captured Jordanian pilot Muath Al-Kasasbeh; that's only one day after the execution! This video violates Youtube's terms of service and is no longer on Youtube.
https://archive.org/details/YouTube_201502
Image may be NSFW.
Clik here to view.
ISIS members immolating captured Jordanian pilot (graphic video)
In fact, Youtube's algorithm is so aggressive that it removed thousands of videos documenting the Syrian revolution. Activists argued that the removed videos were uploaded for the purpose of documenting atrocities during the Syrian government's crackdown, and that Youtube killed any possible hope for future war crimes prosecutions.

Hani Al-Sibai, a lawyer, Islamic scholar, Al-Qaeda sympathizer, and a former member of The Egyptian Islamic Jihad Group who lives in London as a political refugee, uploads his content to the Internet Archive. Although he is anti-ISIS and, more often than not, his content does not encourage violence and he only had few issues with Youtube, he pushes his content to multiple sites on the web including web archiving sites to ensure continuous availability of his data.

For example, this is a an audio recording from Hani Al-Sibai condemning the immolation of the Jordanian pilot, Muath Al-Kasasbeh. Mr. Al-Sibai uploaded this recording to the Internet Archive a day after the execution.
https://archive.org/details/7arqTayyar
Image may be NSFW.
Clik here to view.
An audio recording by Hani Al-Sibai condemning the execution by burning (uploaded to IA a day after the execution)

These are some examples where the Internet Archive is used as a file sharing service. Clips are simultaneously uploaded to Youtube. Vimeo, and the Internet Archive for the purpose of sharing.
Image may be NSFW.
Clik here to view.
Screen-shot from justpaste.it where links to videos uploaded to IA are used for sharing purpose 
Both videos shown in the screen shot were removed from Youtube for violating terms of service, but they are not lost because they have been uploaded to the Internet Archive.

https://www.youtube.com/watch?v=Cznm0L5X9LE
Image may be NSFW.
Clik here to view.
Rebuttal from Hani Al-Sibai addressing ISIS spokesman's attack on Al-Qaeda leader Ayman Al-Zawaheri (removed from Youtube)

https://archive.org/details/Fajr3_201407
Image may be NSFW.
Clik here to view.
Rebuttal from Hani Al-Sibai addressing ISIS spokesman's attack on Al-Qaeda leader Ayman Al-Zawaheri (uploaded to IA)

https://www.youtube.com/watch?v=VuSgxhBtoic
Image may be NSFW.
Clik here to view.
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (removed from Youtube)

https://archive.org/details/Ta3liq_Hadi
Image may be NSFW.
Clik here to view.
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (uploaded to IA)
The same video was not removed from Vimeo
https://vimeo.com/111975796
Image may be NSFW.
Clik here to view.
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (uploaded to Vimeo)
I am not sure if web archiving sites have content moderation policies, but even with sharing sites that do, they are inconsistent! Youtube is a perfect example; no one knows what YouTube's rules even are anymore.

Less popular use of the Internet Archive include browsing the live web using Internet Archive links to bypass governments' censorship. Sometimes, governments block sites with opposing ideologies, but their archived versions remain accessible. When these governments realize that their censorship is being evaded, they entirely block the Internet Archive to prevent access to the the same content they blocked on the live web. In 2017, the IA’s Wayback Machine was blocked in India and in 2015, Russia blocked the Internet Archive over a single page!

3. Using URLs from web archives instead of direct links for news sites with opposing ideologies to deprive them of ad revenue

Even when the live web version is not blocked, there are situations where readers want to deny traffic and the resulting ad revenue for web sites with opposing ideologies. In a recent paper, Understanding Web Archiving Services and Their (Mis)Use on Social Media (Savvas Zannettou, Jeremy Blackburn, Emiliano De Cristofaro, Michael Sirivianos, Gianluca Stringhini, 2018), the authors presented a large-scale analysis of Web archiving services and their use on social network, the archived content, and how it is shared/used. They found that contentious news and social media posts are the most common types of content archived. Also, URLs from web archiving sites are widely posted on “fringe” groups in Reddit and 4chan to preserve controversial data that might disappear; this case also falls under the first category. Furthermore, the authors found evidence of groups' admins forcing members to use URLs from web archives instead of direct links to sites with opposing ideologies to refer to them without increasing their traffic or to deprive them of ad revenue. For instance, The_Donald subreddit systematically targets ad revenue of news sources with adverse ideologies using moderation bots that block URLs from those sites and prompt users to post archive URLs instead.

The authors also found that web archives are used to evade censorship policies in some communities: for example, /pol/ users post archive.is URLs to share content from 8chan and Facebook, which are banned on the platform, or to dodge word-filters (e.g., ‘smh’ becomes ‘baka’, so links to smh.com.au point to baka.com.au instead).

According to the authors, Reddit bots are responsible for posting a huge portion of archive URLs in Reddit due to moderators trying to ensure the availability of the data, but this practice affects the amount of traffic that the source sites would have received from Reddit.

I went on 4chan to include a few examples similar to those examined in the paper and despite not knowing what 4chan is prior to reading the paper, I was able to find a couple of examples of sharing archived links on 4chan in just under 2 minutes. I took screen shots of both examples; the threads have been deleted since 4chan removes threads after they reach page 10.

Image may be NSFW.
Clik here to view.
Pages are archived on archive.is then shared on 4chan
Image may be NSFW.
Clik here to view.
Sharing links to archive.org in a comment on 4chan

The take away message is that web archives have been used for purposes other than digital preservation and browsing historical data. These purposes include:
  1. Uploading content to web archives to mitigate the risk of data loss.
  2. Avoiding governments' censorship or websites' terms of service.
  3. Using URLs from web archives, instead of original source links for news sites with opposing ideologies to deprive them of ad revenue.
--
Hussam Hallak

Viewing all 742 articles
Browse latest View live