Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 741 articles
Browse latest View live

2018-04-23: "Grampa, what's a deleted tweet?"

$
0
0

In early February, 2018 Breitbart News made a splash with its inflammatory tweet suggesting Muslims will end Super Bowl,  which they deleted twelve hours later stating it did not meet their editorial standards. The deleted tweet had an imaginary conversation between a Muslim child and a grandparent about the Super Bowl and linked to one of articles on the declining TV ratings of  National Football League (NFL) for the annual championship game. News articles from The Hill, Huffington Post, Politico, Independent, etc., talked about the deleted tweet controversy in detail. 

Being web archiving researchers, we decided to look into the deleted tweet incident of Breitbart News to shed some light on their deleted tweets pattern over recent months.

Role of web archives in finding deleted tweets   


Hany M. SalahEdeen and Michael L. Nelson in their paper, "Losing my revolution: How many resources shared on social media have been lost?",  talk about the amount of resources shared in social media that is still live or present in the public web archives. They concluded that nearly 11% of the shared resources are lost in their first year and after that we lose the shared resources at a rate of 0.02% per day.

Web archives such as Internet ArchiveArchive-ItUK Web Archives, etc., have an important role in the preservation of resources shared in social media. Using web archives, sometimes we can recover deleted tweets. For example, Miranda Smith in her blog post, "Twitter Follower Count History via Internet Archive" talks about using Internet Archive to fetch historical Twitter data to graph followers count over time. She also explains the advantages of using web archives for finding historical data of users over the Twitter API.

The only caveat in using web archives to uncover the deleted tweets is its limited coverage of Twitter. But for popular Twitter accounts having a high number of mementos such as RealDonaldTrumpBarrack ObamaBreitbartNewsCNN, etc., we can often uncover deleted tweets. The issue of "How Much of the Web Is Archived?" has been discussed by Ainsworth et al. but there has been no separate analysis on how much of Twitter is archived which will help us in estimating the accuracy of finding deleted tweets using web archives.

Web services like Politwoops track deleted tweets of public officials including people currently in office and candidates for office in the USA and some EU nations. However, tweets deleted before a  person becomes a candidate or tweets deleted after a person left office will not be covered. Although Politwoops tracks the elected officials, it misses out on appointed government officials like Michael FlynnFor these twitter accounts web archives are the lone solution to finding their deleted tweets. The most important aspect of not relying totally on these web services alone to find the deleted tweets is due to them being banned by Twitter. It happened once in June, 2015 with Twitter citing violation of the developer agreement. It took Politwoops six months to resume its services back in December, 2015. These instances of being banned by Twitter suggest that we explore web archives to uncover deleted tweets in case of services like Politwoops are banned again.  

Why are deleted tweets important?


With the surge in the usage of social media sites like Twitter, Facebook etc., researchers have been using social media sites to study patterns of online user behaviour.  In context of Twitter, deleted tweets play an important role in understanding users' behavioural patterns. In the paper, "An Examination of Regret in Bullying Tweets", Xu et al. built a SVM-based classifier to predict deleted tweets from Twitter users posting bullying related tweets to later regret and delete them. Petrovic et al., in their paper, "I Wish I Didn’t Say That! Analyzing and Predicting Deleted Messages in Twitter", discuss about the reasons for deleted tweets and using a machine learning approach to predict them. They concluded by saying that tweets with swear words have higher probability of being deleted. Zhou et al. in their papers, "Tweet Properly: Analyzing Deleted Tweets to Understand and Identify Regrettable Ones" and "Identifying Regrettable Messages from Tweets", mention the impact of published tweets that cannot be undone by deletion, as other users have noticed the tweet and cached them even before they are deleted.      


How were deleted tweets found?


To begin our analysis, we used the Twitter API to fetch the most recent 3200 tweets from Breitbart News' Twitter timeline. The live tweets fetched from the Twitter API spanned from 2017-10-22 to 2018-02-18. Later, we received the TimeMap for Breitbart's Twitter page using Memgator, the Memento aggregator service built by Sawood Alam. Using the URI-Ms from the fetched TimeMap, we collected mementos for Breitbart's Twitter page within the specified  time range of live tweets fetched using the Twitter API. 

Code to fetch recent tweets using Python-Twitter API

importtwitter
api = twitter.Api(consumer_key='xxxxxx',
consumer_secret='xxxxxx',
access_token_key='xxxxxx',
access_token_secret='xxxxxx',
sleep_on_rate_limit=True)

twitter_response = api.GetUserTimeline(screen_name=screen_name, count=200, include_rts=True)

Shell command to run Memgator locally 

$ memgator --contimeout=10s --agent=XXXXXX server 
MemGator 1.0-rc7
_____ _______ __
/ \ _____ _____ / _____/______/ |___________
/ Y Y \/ __ \/ \/ \ ___\__ \ _/ _ \_ _ \
/ | | \ ___/ Y Y \ \_\ \/ __ | | |_| | | \/
\__/___\__/\____\__|_|__/\_______/_____|__|\___/|__|

TimeMap : http://localhost:1208/timemap/{FORMAT}/{URI-R}
TimeGate : http://localhost:1208/timegate/{URI-R} [Accept-Datetime]
Memento : http://localhost:1208/memento[/{FORMAT}|proxy]/{DATETIME}/{URI-R}

# FORMAT => link|json|cdxj
# DATETIME => YYYY[MM[DD[hh[mm[ss]]]]]
# Accept-Datetime => Header in RFC1123 format

Code to fetch TimeMap for any twitter handle

1
2
3
4
url ="http://localhost:1208/timemap/"
data_format ="cdxj"
command = url + data_format +"/http://twitter.com/<screen-name>"+
response = requests.get(command)
We parsed tweets and their tweet ids from each memento and compared each archived tweet id with the live tweet ids fetched using the Twitter API. We further validated the status of tweet ids present in web archives but deleted on Twitter using the Twitter API to confirm if the tweets were deleted. On comparing the live and archived versions of tweets, we discovered 22 deleted tweets from Breitbart News.

Code to parse tweets, their timestamps and tweet ids from mementos


importbs4

soup = bs4.BeautifulSoup(open(<HTML representation of Memento>),"html.parser")
match_tweet_div_tag = soup.select('div.js-stream-tweet')
for tag in match_tweet_div_tag:
if tag.has_attr("data-tweet-id"):
# Get Tweet id
...........
# Parse tweets
match_timeline_tweets = tag.select('p.js-tweet-text.tweet-text')
...........
# Parse tweet timestamps
match_tweet_timestamp = tag.find("span", {"class": "js-short-timestamp"})
...........

Analysis of Deleted Tweets from Breitbart News


The most prominent of the 22 deleted tweets was the above mentioned Super Bowl deleted tweet. Talking about the above mentioned deleted tweet in context for people who are unaware of the role of web archives, we urge them that taking screenshots fearing something might be lost in future is smart but it would be even better if we push them to the web archives where it would be preserved for a longer time than compared to someone's private archive. For further information refer to Plinio Varagas's blog post "Links to Web Archives, not Search Engine Caches", where he talks about the difference between archived pages and search engine caches in terms of the decay period of the web pages.

Fig 1 - Super Bowl tweet on Internet Archive
Tweet Memento at Internet Archive
There is another tweet which was initially tweeted by Allum Bokhari, a senior Breitbart correspondent, and retweeted by Breitbart News but was later un-retweeted. The original tweet from Allum Bokhari is present on the live of web but the retweet is missing from the live web, with the plausible reason of Breitbart News later retweeting a similar post from Allum Bokhari.
Undo retweet of Breitbart News
Fig 2 - Archived version of unretweeted tweet by Breitbart News
Tweet memento at the Internet Archive

Fig 3 - Live version of unretweeted tweet by Breitbart News
Live Tweet Status
Of the 22 deleted tweets, 20 were of the form where Breitbart News retweeted someone's tweet but the original tweet was lost. Of those 20 tweets, 18 were from two affiliates of Breitbart News, NolteNCand John Carney. Therefore, we decided to have a look at both the accounts to determine the reason for their deleted tweets.

Analysis of deleted tweets from John Carney and  NolteNC


We fetched live tweets for John Carney using the Twitter API and then fetched the TimeMap for John Carney's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. Due to the low number of mementos within the specified time range, the analysis showed no deleted tweets. We then fetched live tweets from the Twitter API for John Carney for a week to find deleted tweets by comparing with all the previous responses from the Twitter API. We discovered that tweets older than seven days are automatically deleted on Tuesday and Saturday. The precise manner in deletion of tweets suggests the use of any automated tweet deletion service. There are a number of tweet deletion services like Twitter DeleterTweet Eraser etc. which delete tweets on certain conditions based on the lifespan of the tweet or the number of tweets to be present in the Twitter timeline at any given instance.
Fig 4 - John Carney's tweet deletion pattern shown with 50 tweet ids
We fetched live tweets for NolteNC using the Twitter API and then fetched the TimeMap for NolteNC's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. As for NolteNC, we had a considerable number of mementos within the specified time range to discover his deleted tweets. Our analysis provided us with 169 live tweets and 3569 deleted tweets from 2017-11-03 to 2018-02-17.
Fig 5 - NolteNC's original tweet


Fig 6 - Breitbart News retweeting NolteNC's tweet.
With 1000s of deleted tweets, it seemed unlikely that he was manually deleting tweets. We had all the reasons to believe that similar to John Carney, NolteNC deleted tweets automatically using some tweet deletion service. We collected live tweets for his account over a week and compared all the previous responses from the Twitter API to come to the conclusion that all his tweets which were aged more than seven days on Wednesday and Saturday were deleted.
Fig 7 - NolteNC's tweet deletion pattern shown with 50 tweets 

Conclusions

  1. It is not enough to make screen shots of controversial tweets  but, we need to push web contents that we wish to preserve for future fearing of its loss to the web archives due to longer retention capability than our personal archives.
  2. For finding deleted tweets, web archives work effectively for popular accounts because they are archived often but for less popular accounts with fewer mementos this approach will not work.
  3. Although Breitbart News does not delete tweets often, some of its correspondents automatically delete their tweets, effectively deleting the corresponding retweets.
--
Mohammed Nauman Siddique (@m_nsiddique)


2018-04-24: Let's Get Visual and Examine Web Page Surrogates

$
0
0

Why visualize individual web pages? A variety of visualizations of individual web pages exist, but why do we need them when we can just choose a URI from a list and put it in our web browser? URIs are intended to be opaque: text from the underlying web resource does not need to exist in the URI.

Consider http://dx.doi.org/10.1007/s00799-016-0200-8. Where does it go? Should we click on it? What content exists under the veil of the URI? Will it meet our needs?

Now consider this web page surrogate produced by embed.ly for the same URI:

Avoiding spoilers: wiki time travel with Sheldon Cooper

A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if...
If we were looking for research papers about avoiding spoilers for TV shows, then we know that clicking on this surrogate will take us to something that meets our information needs. If we were searching for marine mammals, then this surrogate shows us that the underlying page will not be very satisfying. In this case, the surrogate is intended to give the user enough information to answer the question: should I click on this?

Last year, when I reviewed a number of live web curation and social media tools, I was primarily focused on tools that produce social cards like the one above. This was because social cards appeared to be the lingua franca of web page surrogates. Social cards are not the only surrogate in use today and definitely not the only surrogate evaluated in literature. In this post, I cover several surrogates that have been evaluated and then talk about the studies in which they played a part. I was curious as to which surrogate might be best for collections of mementos.

Different Web Page Surrogates




Text Snippet


Text snippets are one of the earliest surrogates. They only require fetching a given web page before selecting the text to be used in the snippet. The text selection can be done via many different methods like El-Beltagy's "KP-Miner: A Keyphrase Extraction System for English and Arabic Documents" and Chen's "A Practical System of Keyphrase Extraction for Web Pages". Text snippets are typically used by search engines for displaying results.

The Google search result text snippet for Michele Weigle's ODU CS page.
The Bing search result text snippet for Michele Weigle's ODU CS homepage. Note that Bing did not capture the last modified date, but does list a series of links on the bottom of the snippet, drawn from the menu of homepage
The DuckDuckGo search result for Michele Weigle's ODU CS homepage. Note that DuckDuckGo displays the favicon and generates a different text snippet from Google and Bing.

In the above search results for Michele Weigle's ODU CS homepage, the text snippets are slightly different depending on the search engine. Because there is a lot of variation in web pages, there are a lot of possibilities when building text snippets.

Text snippets still receive a bit of research, with Maxwell evaluating the effectiveness of snippet length in 2017 as part of "A Study of Snippet Length and Informativeness" (university repository copy).

As a group, text snippets listed one per row on a web page. This is optimal for search results, as the position of the result conveys its relevancy. This format affects how many surrogates can be viewed at once. Where text snippets are viewed one per row, more thumbnails can fit into the same amount of space.

Thumbnail


A thumbnail is produced by loading the given page in a browser and taking a screenshot of the contents of the browser window. They have been used in many forms. The Safari web browser uses them to display the content of tabs.

The Safari web browser uses thumbnails to show surrogates for web pages  that are currently loaded in its tabs.
In "Visual preview for link traversal on the World Wide Web", Kopetzky demonstrated that thumbnails could be used to provide a preview of a linked page via a mouseover effect so that users could decide if a link was worth clicking. In "Data Mountain: Using Spatial Memory for Document Management" (Microsoft Research copy), Robertson proposed using a 3D virtual environment for organizing a corpus of web pages where each page is visualized as a thumbnail. Outside of the web, file management tools, such as macOS's Finder, use thumbnails to provide visual previews of documents.

An example of the interface for Data Mountain, a 3D environment for browsing web pages via thumbnails.

macOS Finder displaying thumbnails of file contents.

In the web archiving world, the UK Web Archive uses thumbnails to show a series of mementos so one can compare the content of each memento, effectively viewing the content drift over time. Thumbnails are also used in our own What Did It Look Like?, a platform that animates thumbnails so one can watch the changes to a web page over the years. Our group is also investigating the use of thumbnails for summarizing how a single webpage has changed over time, using three different visualizations: an animation, a grid view, and an interactive timeline view.

The UK Web Archive uses thumbnails to show different mementos for the same resource, allowing the user to view web page changes over time.

What Did It Look Like? allows the user to watch a web page change over time by animating the thumbnails of the mementos of a resource.

The size of thumbnails has a serious effect on their utility. If the thumbnail is too large, it does not provide room for comparison of surrogates. If the thumbnail is too small, users cannot see what is in the image. Thumbnails are also difficult for users to understand if a page consists mostly of text or has no unique features. In "How People Recognize Previously Seen Web Pages from Titles, URLs and Thumbnails", Kaasten established that the optimal thumbnail size is 208x208 pixels.

The viewport of a thumbnail is also an important part of its construction. Depending on what we want to emphasize on a web page, we may need to generate a thumbnail from content "below the fold". Aula evaluated the use of thumbnails that were the same size, but had magnified a portion of a web page at 20% versus 38%. She found that users performed better with thumbnails at a magnification of 20%.

Enhanced Thumbnail


In 2001, Woodruff introduced the enhanced thumbnail in "Using Thumbnails to Search the Web" (author copy). Prior to taking the screenshot of the browser as with a normal thumbnail, the HTML of the page is modified to make certain terms stand out. In the example below, changes in font size and background color emphasize certain terms of a page. The goal is to draw attention to these terms in hopes that search engine users could find relevant pages faster.

Examples of Thumbnails and Enhanced Thumbnails:
(a) Plain thumbnail
(b) Enhanced Thumbnail using HTML modification to emphasize the words "Recipe" and "Pound Cake"
(c) Enhanced Thumbnail using HTML and image modification to make "Recipe" and "Pound Cake" stand out more
(d) Emphasis on "MiniDisc Player"
(e) Emphasis on "hybrid", "car", and "mileage"
(f) Emphasis on "Hellerstein"
(g) Plain thumbnail of a page only consisting of text
(h) Enhanced thumbnail emphasizing specific terms in the text page


Even though enhanced thumbnails have performed well, they are computationally expensive to create. This likely explains why they have not been seen in use outside of laboratory studies.

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali developed something similar by adding a tag cloud to each thumbnail and named the concept a "visual tag".

Internal Image


An internal image is an image embedded within the web page. For some web pages, like news stories and product pages, these internal images can be good surrogates because of their uniqueness. Pinterest uses internal images as surrogates.

Pinterest uses internal images as surrogates for web pages.

The key is identifying which embedded image is best for representing the page. Hu identified the issues with solving this problem as part of "Categorizing Images in Web Documents", identifying a number of features such as using the text surrounding an image and evaluating the number of colors in the image. Maekawa worked on classifying images and achieved an 83.1% accuracy in "Image Classification for Mobile Web Browsing" (conference copy). While these studies provided solutions for classifying images, we really need to know which images are unique and relevant to the web page. Research does exist to address this issue, such as the work described in Li's "Improving relevance judgment of web search results with image excerpts" (conference copy). These solutions are imperfect, which may be why Pinterest and other sites ask the user to choose an image from those embedded in the page.


Visual Snippet


In 2009, Teevan introduced visual snippets as part of "Visual snippets: summarizing web pages for search and revisitation" (Microsoft Research copy, conference slides). Teevan gave 20 web pages to a graphic designer and asked him to generate a small 120x120 image representing each page. She observed a pattern in the resulting images and derived a template to use as a surrogate. These surrogates combine the internal image, placed within the background of the surrogate, with a title running across the top of the page, and a page logo.

Examples of thumbnails on the bottom and their corresponding visual snippets on top.
She used machine learning to choose a good internal image and logo. This is more complex than merely selecting a salient internal image as noted in the previous section. Not only does the visual snippet require two images, but two different types of images.

External Image


In 2010, Jiao put forth the idea of using external images in "Visual Summarization of Web Pages". Jiao notes that detecting the internal image may be difficult if not impossible for some pages. Instead, he suggests using image search engines to find a representative image to use as a surrogate.

A simplified version of his algorithm is:

  1. Extract key phrases from the target web page using Chen's KEX algorithm
  2. Use these phrases as queries for an image search engine
  3. Rerank the search engine results based on textual similarity to the target web page
  4. Choose the top ranked image
Though this would likely work well for live web pages about products, it may be a poor fit for mementos due to the temporal nature of words. Consider a memento from the late 1990s where one of the key phrases extracted contains the word Clinton. In the 1990s, the document was likely referring to US President Bill Clinton. If we use a search engine in 2018, it may return an image of 2016 presidential candidate Hillary Clinton. Some of these temporal issues have been detailed as part of the Longitudinal Analytics on Web Archive Data (LAWA) project.

Text + Thumbnail


In "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" (google research copy) by Aula and "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" by Dziadosz, the authors consider the combination of text with a thumbnail as a surrogate.

The Internet Archive uses text and thumbnails for its search results, seen in the screenshot below.

The Internet Archive uses thumbnails and text together as part of its search results.
Al Maqbali further extended this concept with a text + visual tags.

Social Card


The social card goes by many names: rich link, snippet, social snippet, social media card, Twitter card, embedded representation, rich object, or social card. The social card typically consists of an image, a title, and a text snippet from the web page it visualizes.

The data within the social card is typically drawn from data within the meta tags of the HTML of the target web page. As an artifact of social media, different social media platforms consult different meta tags within the target page.

For example, for Twitter, I used the following tags to produce the card below:

Social card for https://www.shawnmjones.org as seen on Twitter.


For Facebook, I used the following tags to produce the card below:
Social card for https://www.shawnmjones.org as seen on Facebook.


Note how the HTML tags are different for each service. Facebook supports the Open Graph Protocol, developed around 2009 (according to the CarbonDate service) whereas Twitter's features were developed around 2010 (according to CarbonDate). There are pages that lack this kind of assistive markup. To produce those cards, social media platforms will often use other methods, like those mentioned above, to extract a text snippet and an internal image. Any mementos captured prior to 2009 will not have the benefit of this assistive markup.

Though most platforms generate social cards come in landscape form, some do generate a portrait form as well. The intended use of the social cards on the platform and the nature of other visual cues on the platform often drive the decision as to which form the social card should take. All of the studies in this blog post evaluated social cards in their landscape form.
A landscape social card from Facebook.
A portrait social card from Google+.


Social cards are not just used by social media. Wikipedia uses social cards to provide a preview of links if the user hovers over the link, like what Kopetzky had envisioned with thumbnails. Google News often uses social cards for individual stories. Social cards sometimes include additional information beyond text snippet and image. In "What's Happening and What Happened: Searching the Social Web" Omar Alonso detailed the use of social cards in a prototype for Bing search results. Those cards also incorporated lists of users who shared the target web page as well as associated hashtags.

When a user hovers over an internal link, Wikipedia uses social cards  to display a preview of the linked web page.
Google News often uses social cards to list individual news articles.

There are similar concepts that are not instances of the social card. Some of the cards used by Google News are not social cards because each is a surrogate for a news story spanning multiple resources, rather than a single resource. Likewise, search engines use entity cards to display information about a specific entity drawn from multiple sources. Entity cards have been found to be useful by Bota's 2016 study "Playing Your Cards Right: The Effect of Entity Cards on Search Behaviour and Workload". I do not consider entity cards to be social cards because each social card is a surrogate for a single web resource, whereas an entity card is a surrogate for a conceptual entity and is drawn from multiple sources.
This card used by Google News is not a surrogate for a single web resource, and hence I do not consider it a social card.
This card format, used by Google is also not a surrogate for a single web resource. This is an entity card, drawing from multiple web resources.

The creation of social cards can also be a lucrative market, with Embed.ly offering plans for web platforms ranging from $9 to $99 per month. They provide embedding services for the long form blogging service Medium, supporting a limited number of source websites. Individual cards can be made on their code generation page.

Evaluations of these Surrogates


Web page surrogates have been of great interest to those studying search engine result pages. I have review eight studies on web surrogates, most mentioned above. I focused on how these studies compared surrogates with each other.


Author & YearText
Snippet
Internal/
External
Image
Visual
Snippet
ThumbnailEnhanced
Thumbnail/
Visual Tags
Text + ThumbnailSocial Card
Woodruff 2001XXX
Dziadosz 2002XXX
Li 2008XX
Teevan 2009XXX
Jiao 2010XXX
Aula 2010XXX
Al Maqbali 2010XXXXX
Loumakis 2011XXX
Capra 2013XXX


As noted above Woodruff introduced the concept of enhanced thumbnails in "Using Thumbnails to Search the Web". To evaluate their effectiveness, she generated questions based on tasks users commonly perform on the web. The questions were divided into 4 categories and 3 questions per category were each given to 18 participants. The participants were presented with search engine result pages consisting of 100 text snippets, thumbnails, or enhanced thumbnails. In their attempt to find web resources that would address their assigned questions, participants were evaluated based on their response times. The results indicated that enhanced thumbnails provided the fastest response times overall, but the results varied depending on the type of task. For locating an entity's homepage, text snippets and enhanced thumbnails performed roughly the same. For finding the picture of an entity, thumbnails and enhanced thumbnails performed roughly the same. All three surrogate types performed just as well for e-commerce or medical side-effect questions.

Dziadosz tested the concept of text snippets combined with thumbnails in "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" In this study of 35 participants, each was given 2 queries each and 2 tasks. Each participant was given a different surrogate type. Their first task was to identify all search engine results on the page that they assumed to be relevant to their query. Their second task was to visit the pages being the surrogates and identify which were actually relevant. The number of correct decisions for text snippets combined with thumbnails was higher than just for text or just for thumbnails. Aula, in "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" also evaluated text snippets, thumbnails, and a combination. She discovered that both were effective in making relevance judgements.

Teevan evaluated the effectiveness of visual snippets in "Visual snippets: summarizing web pages for search and revisitation". Her study consisted on 276 participants who were each given 12 search tasks and a set of 20 search results, with 4 of the 12 tasks completed with different surrogates. She discovered that text snippets required the fewest clicks compared to thumbnails, which required the most. This indicates a lot of false positive matches for participants when using thumbnails. Participants preferred visual snippets or text snippets equally over thumbnails and preferred visual snippets for shopping tasks. Most participants found thumbnails to be too small to be useful.

Jiao introduced the concept of using external images as a surrogate in "Visual Summarization of Web Pages". He compared the use of internal images, external images, thumbnails, and visual snippets. Like Dziadosz's study, participants were asked to guess the relevance of the web page behind the surrogate and then later evaluate if their earlier guess was correct. To generate search results, they randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Bing. His results show that none of the surrogates works for all types of pages. Overall internal images were best for pages that contained a dominant image whereas thumbnails or external images were best for understanding pages that did not contain a dominant image.

In "Improving relevance judgment of web search results with image excerpts", Li was interested in identifying dominant images in web pages. I focus here on the second study in his work which compares text snippets and social cards. They randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Google. The search engine results were then evaluated and reformatted into either text snippets or social cards. Two groups of 12 students each were given the queries either classified by their functionalities or semantic categories. The participants were evaluated based on the number of clicks of relevant results and also on the amount of time they took with each search. Social cards were the clear winner over text snippets in terms of time and clicks.

Loumakis, in "This Image Smells Good: Effects of Image Information Scent in Search Engine Results Pages" (university copy) attempted to compare the performance of images, text snippets, and social cards. Using preselected queries and 81 participants, Loumakis also reformatted Google search results. He did not get the same level of performance in his study, noting that "Adding an image to a SERP result will not significantly help users in identifying correct results, but neither will it significantly hinder them if an image is placed with text cues where the scents may conflict."

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali explored the use of different image augmentations for visual snippets, text + thumbnail, social card, text + visual snippet, and a text + tag cloud/thumbnail combination. Al Maqbali had 65 participants evaluate the relevance of search engine result pages as in the prior studies. This study reached the same conclusion as Loumakis: adding images to text snippets does not appear to make a difference to the performance of search engine users.

To further understand the disagreement between the results of Loumakis, Al Maqbali, and Li, in "Augmenting web search surrogates with images", Capra explored the effectiveness of text snippets and social cards. He wanted determine if the quality or relevance of the image used in the social card had any effect on performance. Prior to any relevance study, he had one set of participants rate individual internal images for a social card as good, bad, and mixed. For individual surrogates, Capra discovered that text snippets with good images have a slightly higher statistically significant accuracy score than just text snippets alone, at the cost of judgement duration for each surrogate. The accuracy for text snippets was 0.864, the accuracy for social cards with bad images was also 0.864, and the accuracy for social cards with good images was 0.884. If the search engine result pages were evaluated overall, then there was evidence that good images showed improvement in accuracy with ambiguous queries (e.g., jaguar the car or the cat?), but in this case the improvements were not statistically significant.

Deciding on the best surrogate for use with web pages depends on a number of factors, and the studies comparing these surrogates have some disagreement. Text snippets continue to endure for search results likely due to Capra's, Al Maqbali's, and Loumakis' results. Social cards are preferred by users, but the minor improvement in search time and relevance accuracy does not warrant the effort necessary to select a good internal image for the card. This means that social cards are effectively relegated for use in social media where each can be generated individually rather than with hundreds of search results. This also means that thumbnails are relegated to other tasks, such as a surrogate for a file on a filesystem or within a browser's interface. As most of these studies focused primarily on search engine results, it is likely that many of these surrogates work better with other use cases.

Surrogates for Mementos


There are more uses for surrogates than search engine results. When grouped together, some surrogates provide more information than the answer to the question should I click on this?.

Enhanced thumbnails often reflect the search terms of the query provided by the user. Most memento applications do not have a query, and hence there are not words or phrases to enhance within the thumbnail. Mabali's tag cloud concept may be of interest here. I am examining other ways to expose words and phrases of interest from archived collections, so this surrogate type may find new life in mementos.

Internal images are often used as part of social cards. If one could expose the images that tie to a particular theme in a web archive collection, then it is possible that we could select images for use as memento surrogates within the theme of the collection. This would likely require some form of machine learning to be viable. This same process goes for visual snippets.

As noted above, external images are problematic surrogates for mementos due to the temporal nature of words. If we could divide a web archive into specific time periods, then external images could be extracted from pages around the same time, limiting the amount of temporal shift.

Thumbnails are often useful in groups to demonstrate the content drift of a single web resource. For this surrogate group to be useful, the consumer of such a thumbnail gallery needs to understand the direction that time flows in the visualization. Thumbnails are not limited to the "one-per-row" paradigm of landscape social cards or text snippets, and hence thumbnails can be presented in a grid formation. This can be confusing to the user trying to compare the content drift of a resource, but textual cues, such as the memento-datetime, placed above or below the thumbnail can clear up this confusion.

Storytelling often uses surrogates in the form of social cards to tell a story. In this case, the surrogates are visualizations of the underlying web pages. When provided as a series of social cards, one per row, in order to publication date or memento-datetime, collections of these surrogates can convey information about an unfolding news story, such as in AlNoamany's collection summarization work (preprint version, dissertation). Many mementos do not have the metadata that might assist in finding a good internal image. This means that any service providing social cards to mementos must instead rely upon a number of image selection algorithms with differing levels of success. Because text snippets are essentially social cards lacking an image, is it possible that they, too, would be suitable in this context?

Conclusion


I started on this journey looking for the best surrogate for use with mementos. I discovered many different surrogates for web resources. The studies evaluating these different surrogates focused on the success of users finding relevant information in search engine results. It appears that the search engine industry has largely focused on text snippets as they are the least expensive surrogate to produce and studies indicate that the addition of images has minimal impact on their effectiveness. Mementos have many different uses and it is possible that one or more of these surrogates may be better fit for their temporal nature. Now that I am developing a vocabulary for these surrogates, I can start to explore how they might best be used with mementos, bringing other useful visualizations to web archive collections.

-- Shawn M. Jones

2018-04-24: Why we need multiple web archives: the case of blog.reidreport.com

$
0
0

This story started in December, 2017 with Joy-Ann Reid (of MSNBC) apologizing for "insensitive LGBT blog posts" that she wrote on her blog many years ago when she was a morning radio talk show host in Florida.   This apology was, at least in some quarters, (begrudgingly) accepted.   Today's update was news that Reid and her lawyers had in December claimed that either her blog, and/or the Internet Archive's record of the blog had been hacked (Mediaite, The Intercept).  Later today, the Internet Archive issued a blog post deny the claim that it was hacked, stating:
This past December, Reid’s lawyers contacted us, asking to have archives of the blog (blog.reidreport.com) taken down, stating that “fraudulent” posts were “inserted into legitimate content” in our archives of the blog. Her attorneys stated that they didn’t know if the alleged insertion happened on the original site or with our archives (Reid’s claim regarding the point of manipulation is still unclear to us).
...
At some point after our correspondence, a robots.txt exclusion request specific to the Wayback Machine was placed on the live blog. That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, unbeknownst to us (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog.   
Checking the Internet Archive for robots.txt, we can see that on 2018-02-16 blog.reidreport.com had a standard robots.txt page that blocked the admin section of WordPress, but by 2018-02-21 they had a version that blocked all robots, and as of today (2018-04-24) they had a version that specifically blocked only the Internet Archive's crawler ("ia_archiver").  As of about 5pm EDT, the robots.txt file had been removed (probably because of the Internet Archive's blog post calling out the presence of the robots.txt; cf. a similar situation in 2013 with the Conservative Party in the UK), but it may take a while for the Internet Archive to register its absence.

2018-04-25 update: Thanks to Peter Sterne for pointing out that www.blog.reidreport.com/robots.txt still exists, even though blog.reidreport.com/robots.txt does not.  They technically can be two different URLs though the convention is for them to canonicalize to the same URL (which is what the Wayback Machine does).  HTTP session info provided below, but the summary is that robots.txt is still in effect and the need for other web archives is still paramount. 



Until the Internet Archive begins serving blog.reidreport.com again, this is a good time to remind everyone that there are web archives other than the Internet Archive.  The screen shot above shows the Memento Time Travel service, which searches about 26 public web archives.  In this case, it found mementos (i.e., captures of web pages) in five different web archives: Archive-It (a subsidiary of the Internet Archive), Bibliotheca Alexandrina (the Egyptian Web Archive), the National Library of Ireland, the archive.is on-demand archiving service, and the Library of Congress.  For a machine readable service, below I list the TimeMap (list of mementos) generated by our MemGator service; the details aren't important but it is the source of the URLs that will appear next.  

Beginning with the original tweets by @Jamie_Maz (2017-11-30 thread, 2018-04-18 thread), I scanned through the screen shots (no URLs were given) and looked for screen shots that had definitive datetimes (most images did not have them).  The datetimes are (with ones for which we have evidence in bold, and the ones that we inferred by matching text are maked with "(inferred)"):

2005-04-25
2005-07-16
2005-07-21
2006-01-20 (inferred)
2006-06-05
2006-06-13 (inferred)
2006-10-03
2006-12-23
2007-02-21
2008-07-04
2008-10-16
2009-01-15
(update: because of canonicalization errors, some of the URLs are not being excluded; see below)

Most of those dates are pretty early in web archiving times, when the Internet Archive was the only archive commonly available, and many (all?) of the mementos in other web archives were surely originally crawled by the Internet Archive, even if on a contract basis (e.g., for the Library of Congress).  Nonetheless, with multiple copies geographically and administratively dispersed throughout the globe, an adversary would have had to hack multiple web archives and alter their contents (cf. lockss.org), or have hacked the original site (blog.reidreport.com) approximately 12 years ago for adulterated pages to have been hosted at all the different web archives.  While both scenarios are technically possible, they are extraordinarily unlikely.  

While we don't know the totality of the hacking claims, we can offer three archived web pages, hosted at the Library of Congress web archive (webarchive.loc.gov), that corroborate at least some of the claims @Jamie_Maz.

2006-01-20


Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20060125004941/http://blog.reidreport.com/ 


2006-06-05


Evidence for this tweet can be found at (approximately 2/3 down): http://webarchive.loc.gov/all/20060608144033/http://blog.reidreport.com/


2006-06-13

I'm not sure this evidence maps directly to one of tweets, but it fits the general theme of anti-Charlie Crist: http://webarchive.loc.gov/all/20060615134635/http://blog.reidreport.com/


This memento also exists at archive.is; it is a copy of the Internet Archive's copy but it is not blocked by robots.txt because it is in another archive: http://archive.is/20060615134635/http://blog.reidreport.com/

2006-10-03



Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20061010125903/http://blog.reidreport.com/


2008-10-16


Evidence for this tweet can be found at (approximately 1/3 down): http://webarchive.loc.gov/all/20081018020856/http://blog.reidreport.com/ 



In summary, of the many examples that @Jamie_Maz provides, I can find five copies in the Library of Congress's web archive.  These crawls were probably performed on behalf of the Library of Congress by the Internet Archive (for election-based coverage); even though there are many different (and independent) web archives now, in 2006 the Internet Archive was pretty much the only game in town.  Even though these mementos are not independent observations, there is no plausible scenario for these copies to have been hacked in multiple web archives or at the original blog 10+ years ago.  There may be additional evidence in the other web archives, but I haven't exhaustively searched them.

We don't know the full details of what Reid's lawyers alleged, so perhaps there are details that we don't know.  But the analysis from the Internet Archive crawl engineers, plus evidence in separate web archives suggest that the claim has no merit.

The case of blog.reidreport.com is another example of why we need multiple web archives.  


--Michael


Thanks to Prof. Michele Weigle and John Berlin for bringing this issue to my attention and uncovering some of the examples.   

Memento TimeMap for blog.reidreport.com:



2018-04-25 update: As noted above, Peter Sterne brought to my attention that the non-standard URL of www.blog.reidreport.com/robots.txt still exists (and is blocking "ia_archiver") even though the more standard blog.reidreport.com/robots.txt is 404. 



Another 2018-04-25 update: The NYT has covered the story ("MSNBC Host Joy Reid Blames Hackers for Anti-Gay Blog Posts, but Questions Mount"), and there was an interview with Reid's computer security expert ("Should We Believe Joy Reid’s Blog Was Hacked? This Security Consultant Says We Should"), Jonathon Nichols.  

 I embed a statement from Nichols (released by Erik Wemple), and a tweet from Nichols clarifying that they were not suggesting that Wayback Machine's mementos were hacked, but rather the hacked blog was crawled by the Internet Archive.  

This is where it's important to note that there maybe a discrepancy between the posts that Nichols is concerned with and those that @Jamie_Maz surfaced.  There is (semi-)independent evidence of @Jamie_Maz's pages, with the ultimate implication that for those pages to have been the result of a hack, blog.reidreport.com would have had to been hacked as many as 12 years ago -- and for nobody to have noticed at the time.  

Reid (& Nichols) could always unblock the Internet Archive and share the evidence of the hack. 




Yet another 2018-04-25 update: Apparently there are some holes in the http vs. https canonicalization wrt robots.txt blockage, allowing some of posts to surface.  Here's an example (via @YanceyMc):
https://web.archive.org/web/20060225041734/https://blog.reidreport.com/2005/10/harriet-miers-and-lesbian-hair-check.html





Also, @wvualphasoldier deleted his tweets then protected his account, so that's the reason the above embed no longer formats correctly. 

Yet, Yet Another 2018-04-25 update:

Thanks to Prof. Weigle and Mat Kelly for providing examples of some of the URLs that are slipping through the robots.txt exclusion.

Here's one: https://web.archive.org/web/20060805055643/https://blog.reidreport.com

and another: https://web.archive.org/web/20050728132003/https://blog.reidreport.com:443/

Which has the following information that I thought I saw in the original @Jamie_Maz tweets but now I can't find it, so perhaps I'm misremembering.  It certainly fits the overall theme.



2018-04-30: A High Fidelity MS Thesis, To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages

$
0
0

It is hard to believe that the time has come for me to write a wrap up blog about the adventure that was my Masters Degree and the thesis that got me to this point. If you follow this blog with any regularity you may remember two posts, written by myself, that were the genesis of my thesis topic:

Bonus points if you can guess the general topic of the thesis from the titles of those two blog posts. However, it is ok if you can not as I will give an oh so brief TL;DR;. The replay problems with cnn.com were, sadly, your typical here today gone tomorrow replay issues involving this little thing, that I have come to , known as JavaScript. What we also found out, when replaying mementos of cnn.com from the major web archives, was each web archive has their own unique and subtle variation of this thing called "replay". The next post about the curious case of mendely.com user pages (A State Of Replay) further confirmed that to us.

We found that not only does there exist variations in how web archives perform URL rewriting (URI-Rs URI-Ms) but also that, depending on the replay scheme employed, web archives are also modifying the JavaScript execution environment of the browser and the archived JavaScript code itself beyond URL rewriting! As you can imagine this left us asking a number of questions that lead to the realization that the web archiving lacks the terminology required to effectively describe the existing styles of replay and the modifications made to an archived web page and its embedded resources in order to facilitate replay.

Thus my thesis was born and is titled "To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages".

Since I am known around the ws-dl headquarters for my love of deep diving into the secrets of (securely) replaying JavaScript, I will keep the length of this blog post to a minimum. The thesis can be broken down into three parts, namely Styles Of Replay, Memento Modifications, and Auto-Generating Client-Side Rewriters. For more detail information about my thesis, I have embedded my defense's slides below and the full text of the thesis has been made available.

Styles Of Replay

The existing styles of replaying mementos from web archives is broken down into two distinct models, namely "Wayback" and "Non-Wayback", and each has its own distinct styles. For the sake of simplicity and length of this blog post I will only (briefly) cover the replay styles of the "Wayback" model.

Non-Sandboxing Replay

Non-sandboxing replay is the style of replay that does not separate the replayed memento from the archive-controlled portion of replay, namely the banner. This style of replay is considered the OG (original gangster) way for replaying mementos simply because it was, at the time, the only way to replay mementos and was introduced by the Internet Archive's Wayback Machine. To both clarify and illustrate what we mean by "does not separate the replayed memento from archive-controlled portion of replay", consider the image below displaying the HTML and frame tree for a http://2016.makemepulse.com memento replayed from the Internet Archive on October 22, 2017.

As you can see from the image above, the archive's banner and the memento exist together on the same domain (web.archive.org). Implying that the replayed memento(s) can tamper with the banner (displayed during replay) and or interfere with archive control over replay. Non-malicious examples of mementos containing HTML tags that can both tamper with the banner and interfere with archive control over replay skip to the Replay Preserving modifications section of post. Now to address the recent claim that "memento(s) were hacked in the archive" and its correlation to non-sanboxing replay. Additional discussion on this topic can be found in Dr. Michael Nelson's blog post covering the case of blog.reidreport.com and in his presentation for the National Forum on Ethics and Archiving the Web (slides, trip report).

For a memento to be considered (actually) hacked, the web archive the memento is replayed (retrieved) from must be have been compromised in a manner that requires the hack to be made within the data-stores of the archive and does not involve user initiated preservation. However, user initiated preservation can only tamper with a non-hacked memento when replayed from an archive. The tampering occurs when an embedded resource, previously un-archived at the memento-datetime of the "hacked" memento, is archived from the future (present datetime relative to memento-datetime) and typically involves the usage of JavaScript. Unlike non-sandboxing replay, the next style of Wayback replay, Sandboxed Replay, directly addresses this issue and the issues of how to securely replay archived JavaScript. PS. No signs of tampering, JavaScript based or otherwise, were present in the blog.reidreport.com mementos from the Library of Congress. How do I know??? Read my thesis and or look over my thesis defense slides, I cover in detail what is involved in the mitigation of JavaScript based memento tampering and know what that actually looks like .

Sandboxed Replay

Sandboxed replay is the style of replay that separates the replayed memento from the archive-controlled portion of the page through replay isolation. Replay isolation is the usage of an iframe to the sandbox the replayed memento, replayed from a different domain, from the archive controlled portion of replay. Because replay is split into two different domains (illustrated in the image seen below), one for the replay of the memento and one for the archived controlled portion of replay (banner), the memento cannot tamper with the archives control over replay or the banner. Due to security restrictions placed on web pages from different origins by the browser called the Same Origin Policy. Web archives employing sandboxed replay typically also perform the memento modification style known as Temporal Jailing. This style of replay is currently employed by Webrecorder and all web archives using Pywb (open source, python implementation of the Wayback Machine). For more information on the security issues involved in high-fidelity web archiving see the talk entitled Thinking like a hacker: Security Considerations for High-Fidelity Web Archives given by Ilya Kreymer and Jack Cushman at WAC2017 (trip report), as well as, Dr. David Rosenthal's commentary on the talk.

Memento Modifications

The modification made by web archives to mementos in order to facilitate there replay can be broken down into three categories, the first of which is Archival Linkage.

Archival Linkage Modifications

Archival linkage modifications are made by the archive to a memento and its embedded resources in order to serve (replay) them from the archive. The archival linkage category of modifications are the most fundamental and necessary modifications made to mementos by web archives simply because they prevent the Zombie Apocalypse. You are probably already familiar with this category of memento modifications as it is more commonly referred to as URL rewriting
(URI-R URI-M).

<!-- pre rewritten -->
<link class="token attr-name">rel= class="token punctuation">"stylesheet" class="token attr-name">href= class="token punctuation">"/foreverTime.css" class="token punctuation">>
<!-- post rewritten -->
<link class="token attr-name">rel= class="token punctuation">"stylesheet" class="token attr-name">href= class="token punctuation">"/20171007035807cs_/foreverTime.css class="token punctuation">">

URL rewriting (archival linkage modifications) ensures that you can relive (replay) mementos, not from the live web, but from the archive. Hence the necessity and requirement for this kind of memento modifications. However, it is becoming necessary to seemingly damage mementos in order to simply replay them.

Replay Preserving Modifications

Replay Preserving Modifications are modifications made by web archives to specific HTML element and attribute pairs in order to negate their intended semantics. To illustrate this, let us consider two examples, the first of which was introduced by our fearless leader Dr. Michael Nelson and is known as the zombie introducing meta refresh tag shown below.

        class="token punctuation"><metahttp-equiv        class="token attr-value">="refresh        class="token punctuation">"content        class="token attr-value">="35;url=?zombie=666        class="token punctuation">"/>

As you are familiar, the meta refresh tag will, after 35 seconds, refresh the page with the "?zombie=666" appended to original URL. When a page containing this dastardly tag is archived and replayed, the results of the refresh plus appending "?zombie=666" to the URI-M causes the browser to navigate to a new URI-M that was never archived. To overcome this archives must arm themselves with the attribute prefixing shotgun in order to negate the tag and attribute's effects. A successful defense against the zombie invasion when using the attribute prefixing shotgun is shown below.


class="token punctuation"><meta_http-equiv class="token attr-value">="refresh class="token punctuation">"_content class="token attr-value">="35;url=?zombie=666 class="token punctuation">"/>

Now let me introduce to you a new more insidious tag that does not introduce a zombie into replay but rather a demon known as the meta csp tag, shown below.

        class="token punctuation"><metahttp-equiv        class="token attr-value">="Content-Security-Policy        class="token punctuation">"
content class="token punctuation">="default-src http://notarchive.com; img-src .... class="token punctuation">"/>

Naturally, web archives do not want web pages to be delivering their own Content-Security-Policies via meta tag because the results are devastating, as shown by the YouTube video below.

Readers have no fear, this issue is fixed!!!! I fixed the meta csp issue for Pywb and Webrecorder in pull request #274 submitted to Pywb. I also reported this to the Internet Archive and they promptly got around to fixing it.

Temporal Jailing

The final category of modifications, known as temporal Jailing, is the emulation of the JavaScript environment as it existed at the original memento-datetime through client-side rewriting. Temporal jailing ensures both the secure replay of JavaScript and that JavaScript can not tamper with time (introduce zombies) by applying overrides to the JavaScript APIs provided by the browser in order to intercept un-rewriten urls. Yes there is more to it, a whole lot more, but because it involves replaying JavaScript and I am attempting to keep this blog post reasonably short(ish), I must force you to consult my thesis or thesis defense slides for more specific details. However, for more information about the impact of JavaScript on archivability, and measuring the impact of missing resources see Dr. Justin Brunelle'sPh.D. wrap up blog post. The technique for the secure replay of JavaScript known as temporal jailing is currently used by Webrecorder and Pywb.

Auto-Generating Client-Side Rewriters

Have I mention yet just how much I JavaScript?? If not, lemme give you a brief overview of how I am auto-generating client-side rewriting libraries, created a new way to replay JavaScript (currently used in production by Webrecorder and Pywb) and increased the replay fidelity of the Internet Archive's Wayback Machine.

First up let me introduce to you Emu: Easily Maintained Client-Side URL Rewriter (GitHub). Emu allows for any web archive to generate their own generic client-side rewriting library, that conforms to the de facto standard implementation Pywb's wombat.js, by supplying it the Web IDL definitions for the JavaScript APIs of the browser. Web IDL was created by the W3C to describe interfaces intended to be implemented in web browser, allow the behavior of common script objects in the web platform to be specified more readily, and provide how interfaces described with Web IDL correspond to constructs within ECMAScript execution environments. You may be wondering how can I guarantee this tool will generate a client-side rewriter that provides complete coverage of the JavaScript APIs of the browser and that we can readily obtain these Web IDL definitions? My answer is simple and it is to confider the following excerpt from the HTML specification:

This specification uses the term document to refer to any use of HTML, ..., as well as to fully-fledged interactive applications. The term is used to refer both to Document objects and their descendant DOM trees, and to serialized byte streams using the HTML syntax or the XML syntax, depending on context ... User agents that support scripting must also be conforming implementations of the IDL fragments in this specification, as described in the Web IDL specification

Pretty cool right, what is even cooler is that a good number of your major browsers/browser engines (Chromium, FireFox, and Webkit) generate and make publicly available Web IDL definitions representing the browsers/engines conformity to the specification! Next up a new way to replay JavaScript.

Remember the curious case of mendely.com user pages (A State Of Replay) and how we found out that Archive-It, in addition to applying archival linkage modifications, was rewriting JavaScript code to substitute a new foreign, archive controlled, version of the JavaScript APIs it was targeting. This is shown in the image below.

Archive-It rewriting embedded JavaScript from the memento for the curious case mendely.com user pages

Hmmmm, looks like Archive-It is only rewriting only two out of four instances of the text string location in the example shown above. This JavaScript rewriting was targeting the Location interface which controls the location of the browser. Ok, so how well would Pywb/Webrecorder do in this situation?? From the image shown below, not as good and maybe a tad bit worse...

Pywb v0.33 replay of https://reacttraining.com/react-router/web/example/auth-workflow

That's right folks, JavaScript rewrites in HTML. Why??? See below.

Bundling HTML in JavaScript, https://reacttraining.com/react-router/15-5fae8d6cf7d50c1c6c7a.js

Because the documentation site for React Router was bundling HTML inside of JavaScript containing the text string "location" (shown above), the rewrites were exposed in the documentations HTML displayed to page viewers (second image above). In combination with how Archive-It is also rewriting archived JavaScript, in a similar manner, I was like this needs to be fix. And fix it I did. Let me introduce to you a brand new way of replaying archived JavaScript shown below.

// window proxy
newwindow.Proxy class="token punctuation">({ class="token punctuation">}, class="token punctuation">{
get(target class="token punctuation">, prop) class="token punctuation">{ class="token comment">/*intercept attribute getter calls*/ class="token punctuation">},
set(target class="token punctuation">, prop, value class="token punctuation">){/*intercept attribute setter calls*/ class="token punctuation">},
has(target, prop class="token punctuation">){/*intercept attribute lookup*/ class="token punctuation">},
ownKeys(target class="token punctuation">){/*intercept own property lookup*/ class="token punctuation">},
getOwnPropertyDescriptor(target class="token punctuation">, key) class="token punctuation">{/*intercept descriptor lookup*/ class="token punctuation">},
getPrototypeOf(target class="token punctuation">){/*intercept prototype retrieval*/ class="token punctuation">},
setPrototypeOf(target class="token punctuation">, newProto) class="token punctuation">{/*intercept prototype changes*/ class="token punctuation">},
isExtensible(target class="token punctuation">){/*intercept is object extendable lookup*/ class="token punctuation">},
preventExtensions(target class="token punctuation">){/*intercept prevent extension calls*/ class="token punctuation">},
deleteProperty(target class="token punctuation">, prop) class="token punctuation">{/*intercept is property deletion*/ class="token punctuation">},
defineProperty(target class="token punctuation">, prop, desc class="token punctuation">){/*intercept new property definition*/ class="token punctuation">},
})

// document proxy
newwindow.Proxy class="token punctuation">(window.document class="token punctuation">,{
get(target class="token punctuation">, prop) class="token punctuation">{ class="token comment">/*intercept attribute getter calls*/ class="token punctuation">},
set(target class="token punctuation">, prop, value class="token punctuation">){/*intercept attribute setter calls*/ class="token punctuation">}
})

The native JavaScript Proxy object allows an archive to perform runtime reflection on the proxied object. Simply put, it allows an archive to defined custom or restricted behavior for the proxied object. I have annotated the code snippet above with additional information about the particulars of how archives can use the Proxy object. Archives using the JavaScript Proxy object in combination with the setup shown below, web archives can guarantee the secure replay of archived JavaScript and do not have to perform the kind of rewriting shown above. Yay! Less archival modification of JavaScript!! This method of replaying archived JavaScript was merged into Pywb on August 4, 2017 (contributed by yours truly) and has been used in production by Webrecoder since August 21, 2017. Now to tell you about how I increased the replay fidelity of the Internet Archive and how you can too .

        class="token keyword">var        class="token function-variable function">__archive$assign$function__        class="token operator">=function        class="token punctuation">(name)        class="token punctuation">{/*return archive override*/        class="token punctuation">};
{
// archive overrides shadow these interfaces
let window =__archive$assign$function__ class="token punctuation">("window") class="token punctuation">;
let self =__archive$assign$function__ class="token punctuation">("self" class="token punctuation">);
let document =__archive$assign$function__ class="token punctuation">("document" class="token punctuation">);
let location =__archive$assign$function__ class="token punctuation">("location" class="token punctuation">);
let top =__archive$assign$function__ class="token punctuation">("top" class="token punctuation">);
let parent =__archive$assign$function__ class="token punctuation">("parent") class="token punctuation">;
let frames =__archive$assign$function__ class="token punctuation">("frames") class="token punctuation">;
let opener =__archive$assign$function__ class="token punctuation">("opener") class="token punctuation">;
/* archived JavaScript */
}

Ok so I generated a client-side rewriter for the Internet Archive's Wayback Machine using the code that is now Emu and crawled 577 Internet Archive mementos from the top 700 web pages found in the Alexa top 1 million web site list circa June 2017. The crawler I wrote for this can be found on GitHub . By using the generated client-side rewriter I was able to increase the cumulative number of requests made by the Internet Archive mementos by 32.8%, a 45,051 request increase (graph of this metric shown below). Remember that each additional request corresponds to a resource that previously was unable to be replayed from the Wayback Machine.

Hey look, I also decreased the number of requests blocked by the content-security policy of the Wayback Machine by 87.5%, a 5,972 request increase (graph of this metric shown below). Remember, that earch request un-blocked corresponds to a URI-R the Wayback Machine could not rewrite server-side and requires the usage of client-side rewriting (Pywb and Webrecorder are using this technique already).

Now you must be thinking this impressive to say the least, but how do I know these numbers are not faked / or doctored in some way in order to give a client-side rewriting the advantage??? Well you know what they say seeing is believing!!! The generated client-side rewriter used in the crawl that produced the numbers shown to you today is available as the Wayback++ Chrome and Firefox browser extension! Source code for it is on GitHub as well. And oh look, a video demonstrating the increase in replay fidelity gained if the Internet Archive were to use client-side. Oh, I almost forgot to mention that at the 1:47 mark in the video I make mementos of cnn.com replayable again from the Internet Archive. Winning!!

Pretty good for just a masters thesis wouldn't you agree. Now it's time for the obligatory list of all the things I have created in the process of this research and time as a masters student:

What is next you may ask??? Well I am going to be taking a break before I start down the path known as a Ph.D. Why??????? To become the senior backend developer for Webrecorder of course! There is so so much to be learned from actually getting my hands dirty in facilitating high-fidelity web archiving such that when I return, I will have a much better idea of what my research's focus should be on.

If I have said this once, I have said this a million times. When you use a web browser in the preservation process, there is no such thing as an un-archivable web page! Long live high-fidelity web archiving!

- John Berlin (@johnaberlin , @N0taN3rd )

2018-05-04: An exploration of URL diversity measures

$
0
0
Recently, as part of a research effort to describe a collections of URLs, I was faced with the problem of identifying a quantitative measure that indicates how many different kinds of URLs there are in a collection. In other words, what is the level of diversity in a collection of URLs? Ideally a diversity measure should produce a normalized value between 0 and 1. A 0 value means no diversity, for example, a collection of duplicate URLs (Fig. 2 first row, first column). In contrast, a diversity value of 1 indicates maximum diversity - all different URLs (Fig. 2, first row, last column):
1. http://www.cnn.com/path/to/story?p=v
2. https://www.vox.com/path/to/story
3. https://www.foxnews.com/path/to/story
Surprisingly, I did not find a standard URL diversity measure in the Web Science community, so I introduced the WSDL diversity index (described below). I acknowledge there may be other URL diversity measures in the Web Science community that exist under different names. 
Not surprisingly, Biologist (especially Conservation Biologist) have multiple measures for quantifying biodiversity called diversity indices. In this blog post, I will briefly describe how some common biodiversity measures in addition to the WSDL diversity index can be used to quantify URL diversity. Additionally, I have provided recommendations for choosing a URL diversity measure depending on the problem domain. I have also provided a simple python script that reads a text file containing URLs and produces the URL diversity scores of the different measures introduced in this post.
Fig. 2: WSDL URL diversity matrix of examples across multiple policies (URL, hostname, and domain). For all policies, the schemes, URL parameters, and fragments are stripped before calculation. For hostname diversity calculation, only the host is considered, and for domain diversity calculation, only the domain is considered.
I believe the problem of quantify how many different species there are in biological community is very similar to the problem of quantify how many different URLs there are in a collection of URLs. Biodiversity measures (or diversity indices) express the degree of variety in a community. Such measures answer questions such as: does a community of mushrooms only include one, two, or three species of mushrooms? Similarly, a URL diversity measure expresses the degree of variety in a collection of URLs and answers questions such as: does a collection of URLs only represent one (e.g cnn.com), two (cnn.com and foxnews.com), or three (cnn.com, foxnews.com, and nytimes.com) domains. Even though the biodiversity diversity indices and URL diversity measures are similar, it is important to note that since both domains are different their respective diversity measures reflect these differences. For example, the WSDL diversity index I introduce later does not reward duplicate URLs because duplicate URLs do not increase the informational value of a URL collection.

URL Diversity Measures

Let us consider the WSDL diversity index for quantifying URL diversity, and apply popular biodiversity indices to quantify URL diversity.

URL preprocessing:
Since URLs have aliases, the following steps were taken before the URL diversity was calculated.

1.Scheme removal: This transforms
http://www.cnn.com/path/to/story?param1=value1&param2=value2#1 
to 
www.cnn.com/path/to/story?param1=value1&param2=value2#1

2. URL parameters and fragment removal: This transforms
www.cnn.com/path/to/story?param1=value1&param2=value2#1
to
www.cnn.com/path/to/story

3. Multi-policy and combined (or unified) policy URL diversity: For the WSDL diversity index (introduced below), the URL diversity can be calculated for multiple separate policies such as the URL (www.cnn.com/path/to/story), Domain (cnn.com), or Hostname (www.cnn.com). For the biodiversity measures introduced,  the URL diversity can also be calculated by combining policies. For example, URL diversity calculation done by combining Hostname (or domain) with URL paths. This involves considering the Hostnames (or domains) as the species and the URL paths as individuals. I call this combined policy approach of calculating URL diversity, unified diversity.

WSDL diversity index:

The WSDL diversity index (Fig. 3) rewards variety and not duplication. It is the ratio of unique items  (URIs or Domain names, or Hostnames) to the total number of items |C|. We subtract 1 from both numerator and denominator in order to normalize (0 - 1 range) the index. A value of 0 (e.g., Fig 2. first row, first column) is assigned by a list of duplicate URLs. A value of 1 is assigned by a list of distinct URLs (e.g., Fig. 2 first row, last column).
Fig. 3: The WSDL diversity index (Equation 1) and the explanation of variables. U represents the count of unique URLs (or species - R).  |C| represents the number of URLs (or individuals N).
Unlike the other biodiversity indices introduced next, the WSDL diversity index can be calculated for separate policies: URL, Domain, and Hostname. This is because the numerator of the formula considers uniqueness not counts. In other words the numerator operates over sets of URLs (no duplicates allowed) unlike the biodiversity measures that operate over lists (duplicates allowed). Since the biodiversity measures introduced below take counts (count of species) into account, calculation of all the URL diversity across multiple policies results in the same diversity value except if the polices are combined (e.g., Hostname combined with URL paths).

The Simpson's diversity index (Fig. 4, equation 2) is a common diversity measure in Ecology that quantifies the degree of biodiversity (variety of species) in a community of organisms. It is also known as the Herfindahl–Hirschman index (HHI) in Economics, and Hunter-Gaston index in Microbiology. The index simultaneous quantifies two quantities - the richness (number of different kinds of organisms) and evenness (the proportion of each species present) in a bio-community. Additionally, the index produces diversity values ranging between 0 and 1. 0 means no diversity and 1 means maximum diversity.
Fig. 4: Simpson's diversity index (Equation 2) and Shannon's evenness index (Equation 3) and the explanation of variables (R, n_i (n subscript i), and N) they share.
Applying the Simpson's diversity index to measure URL diversity:
There are multiple variants of the Simpson's diversity index, the variant showed in Fig. 4, equation. 2 is applicable to measuring URL diversity in two ways. First, we may consider URLs as the species of biological organisms (Method 1). Second, we may consider the Hostnames as the species (Method 2)  and the URL paths as the individuals. There are three parameters needed to use Simpson's diversity index (Fig. 4):
Method 1:
  1. R - total number of species (or URLs)
  2. n_i (n subscript i) - number of individuals for a given species, and 
  3. N - total number of individuals
Method 2 (Unified diversity):
  1. R - total number of species where the Hostnames (or Domains) are the species
  2. n_i (n subscript i) - number of individuals (URL paths) for a given species, and
  3. N - total number of individuals
Fig. 5a applies Method 1 to calculate the URL diversity. In Fig. 5a, there are 3 different URLs interpreted as 3 species (R = 3) in the Simpson's diversity index formula (Fig. 4, equation. 2):
1. www.cnn.com/path/to/story1
2. www.cnn.com/path/to/story2
3. www.vox.com/path/to/story1

Fig. 5a: Example showing how the Simpson's diversity index and Shannon's evenness index can be applied to calculate URL diversity by setting three variables: R represents the number of species (URLs). In the example, there are 3 different URLs. n_i (n subscript i) represents the count of the species (n_1 = 3, n_2 = 1, and n_3 = 1). N represents the total number of individuals (URLs). The Simpson's diversity index (Fig. 4, equation 2) is 0.7, Shannon's evenness index - 0.86
The first URL has 3 copies which can be interpreted as 3 individuals (for the first species - n_0) in the Simpson's diversity index formula. The second and third URLs have 1 copy each, similarly, this can be interpreted as 1 individual for the second (n_1) and third species (n_2). In total (including duplicates) we have 5 URL individuals (N = 5). With all the parameters of the Simpson's diversity index (Fig. 4, equation 2) set, the diversity index for the example in Fig. 5a is 0.7.
Fig. 5b: Example showing how to the Simpson's diversity index and Shannon's diversity index can be applied to calculate unified URL diversity by interpreting Hostnames as the species (R) and the URLs paths as the individuals (n_i). This method combines the Hostname (or Domain) with URL paths for URL diversity calculation.
Fig. 5b applies Method 2 to calculate the Unified diversity. In the unified diversity calculation, the policies are combined (Hostname with URL paths). For example, in Fig. 5b the species represent the Hostnames and the URL paths are considered the individuals.

Shannon-Wiener diversity index:

The Shannon-Wiener diversity index or Shannon's diversity index comes from information theory where it is used to quantify the entropy in a string. However, in Ecology, similar to the Simpson's index, it is applied to quantify the biodiversity in a community. It simultaneously measures the richness (number of species) and the evenness (homogeneity of the species). The Shannon's Evenness Index  (SEI) is the Shannon's diversity index divided by the maximum diversity (ln(R)) which occurs when each species has the same frequency (maximum evenness).

Applying the SEI to measure URL diversity:
The variables in the SEI are the same variables in the Simpson's diversity index. Fig 5a. evaluates the SEI (Equation 3) for a set of URLs, while Fig. 5b. calculates the unified URL diversity by interpreting the Hostnames as species.
Fig. 6: Example showing hot the URL diversity indices differ. For example, the WSDL diversity index rewards URL uniqueness and penalized URL duplication since the duplication of URLs does not increase informational value, but the Shannon's evenness index rewards balance in the proportion of URLs. It is also important to note that calculation of URL diversity across multiple separate policies (URL, domain, and hostname) is only possible with the WSDL diversity index.
I recommend using the WSDL diversity index for measuring URL diversity if the inclusion of a duplicate URL should not be rewarded and there is a need to calculate URL diversity across multiple separate policies (URL, domain, and hostname). Both Simpson's diversity index and Shannon evenness index strive to simultaneously capture richness and evenness. I believe Shannon's evenness index does a better job capturing evenness which happens when the proportion of species is distributed evenly (Fig. 6 first row, second column). I recommend using the Simpson's diversity and Shannon's evenness indices for URL diversity calculation when the definition of diversity is similar to the Ecological meaning of diversity and the presence of duplicate URLs need not penalize the overall diversity score. The source code that implements the URL diversity measures introduced here is publicly available.
-- Nwala (@acnwala)

2018-05-15: Archives Unleashed: Toronto Datathon Trip Report

$
0
0
The Archives Unleashed team (pictured below) hosted a two-day datathon, April 26-27, 2018, at the University of Toronto’s Robarts Library. This time around, Shawn Jones and I were selected to represent the Web Science and Digital Libraries (WSDL) research group from Old Dominion University. This event was the first in a series of four planned datathons to give researchers, archivists, computer scientists, and many others the opportunity to get hands-on experience with the Archives Unleashed Toolkit (AUT) and provide valuable feedback to the team. The AUT facilitates analysis and processing of web archives at scale and the datathons are designed to help participants find ways to incorporate these tools into their own workflow. Check out the Archives Unleashed team on Twitter and their website to find other ways to get involved and stay up to date with the work they’re doing.


Archives Unleashed datathon organizers (left to right): Nich Worby, Ryan Deschamps, Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz

Day 1 - April 26, 2018
Ian Milligan kicked off the event by talking about why these datathons are so important to the Archives Unleashed project team. For the project to be a success, the team needs to: build a community, create a common vision for web archiving tool development, avoid black box systems that nobody really understands, and equip the community with these tools to be able to work as a collective.


Many presentations, conversations, and tweets during the event indicated that working with web archives, particularly WARC files, can be messy, intimidating, and really difficult. The AUT tries to help simplify the process by breaking it down into four parts:
  1. Filter - focus on a date range, a single domain, or specific content
  2. Analyze - extract information that might be useful such as links, tags, named entities, etc.
  3. Aggregate - summarize the analysis by counting, finding maximum values, averages, etc.
  4. Visualize - create tables from the results or files for use in external applications, such as Gephi


We were encouraged to use the AUT throughout the event to go through the process of filtering, analyzing, aggregating, and visualizing for ourselves. Multiple datasets were provided to us and preloaded onto powerful virtual machines, provided by Compute Canada, in an effort to maximize the time spent working with the AUT instead of fiddling with settings and data transfers.



Now that we knew the who, what, and why of the datathon, it was time to create our teams and get to work. We wrote down research questions (pink), datasets (green), and technologies/techniques (yellow) we were interested in using on sticky notes and posted them on a whiteboard. Teams started to form naturally from the discussion, but not very quickly, until we got a little help from Jimmy and Ian to keep things moving.

I worked with Jayanthy Chengan, Justin Littman, Shawn Walker, and Russell White. We wanted to use the #neveragain tweet dataset to see if we could filter out spam links and create a list of better quality seed URLs for archiving. Our main goal was to use the AUT without relying on other tools that we may have already been familiar with. Many of us had never even heard of Scala, the language that AUT is written in. We had all worked through the homework leading up to the datathon, but it still took us a few hours to get over the initial jitters and become productive.

Scala was a point of contention among many participants. Why not use Python or another language that more people are familiar with and can easily interface with existing tools? Jimmy had an answer ready, as he did for every question thrown at him over the course of the event.

Around 5pm, it was time for dinner at Duke of York. My team decided against trying to get everyone up and running on their local machines, to enjoy dinner, and come back fresh for day 2.



Day 2 - April 27, 2018 
Day 2 began with what felt like an epiphany for our team:
In reality, it was more like:

Either way, we learned from the hiccups of the first day and began working at a much faster pace. All of the teams worked right up until the deadline to turn in slides, with a few coffee breaks and lightning talks sprinkled throughout. I'll include more information on the lightning talks and team presentations as they become available.

Lightning Talks
  • Jimmy Lin led a brainstorming session about moving the AUT from RDD to DataFrames. Samantha Fritz posted a summary of the feedback received where you can participate in the discussion.
  • Nick Ruest talked about Warclight, a tool that helps with discovery within a WARC collection. He showed off a demo of it after giving us a little background information.
  • Shawn Jones presented the five minute version of a blog post he wrote last year that talks about summarizing web archive collections.
  • Justin Litmann presented TweetSets, a service that allows a user to derive their own Twitter dataset from existing ones. You can filter by any Tweet attributes such as text, hashtags, mentions, date created, etc.
  • Shawn Walker talked about the idea of using something similar to a credit score to warn users, in realtime, of the risk that content they're viewing may be misinformation.

At 3:30pm, Ian introduced the teams and we began final presentations right on time.

Team Make Tweets Great Again (Shawn Jones' team) used a dataset including tweets sent to @realdonaldtrump between June 2017 and now, along with tweets with #MAGA in them from June - October 2017. A few of the questions they had were:

  • As a Washington insider falls from grace, how quickly do those active in #MAGA and @realDonaldTrump shift allegiance?
  • Did sentiment change towards Bannon before and after he was fired by Trump?

They used positive or negative sentiment (emojis and text-based analysis) as an indicator of shifting allegiance towards a person. There was a decline in the sentiment rating for Steve Bannon when he was fired in August 2017, but the real takeaway is that people really love the 😂 emoji. Shawn worked with Jacqueline Whyte Appleby and Amanda Oliver. Jacqueline decided to focus on Bannon for the analysis, Amanda came up with the idea to use emojis, and Shawn used twarc to gather the information they would need.


Team Pipeline Research used datasets made up of WARC files of pipeline activism and Canadian political party pages, along with tweets (#NoASP, #NoDAPL,  #StopKM, #KinderMorgan). From the datasets, they were able to generate word clouds, find the image most frequently used, perform link analysis between pages, and analyze the frequency of hashtags used in the tweets. Through the analysis process, they discovered that some URLs had made it into the collection erroneously. 



Team Spam Links (my team) used a dataset including tweets with various hashtags related to the Never Again/March for Our Lives movement. The question we wanted to answer was “What is the best algorithm for extracting quality seed URLs from social media data?”. We created a Top 50 list of URLs tweeted in the unfiltered dataset and coded them as relevant, not relevant, or indeterminate. We then came up with multiple factors to filter the dataset by (users with/without the default Twitter profile picture, with/without bio in profile, user follower counts, including/excluding retweets, etc.) and generated a new Top 50 list each time. The original Top 50 list was then compared to each of the filtered Top 50 lists.


We didn’t find a significant change in the rankings of the spam URLs, but we think that’s because there just weren’t that many in the dataset’s Top 50 to begin with. Running these experiments against other datasets and expanding the Top 50 to maybe the Top 100 or more would likely yield better results. Special thanks to Justin and Shawn Walker for getting us started and doing the heavy lifting, Russell for coding all of the URLs, and Jayanthy for figuring out Scala with me.


Team BC Teacher Labour was the final group of the day and they used a dataset from Archive-It about the British Columbia Teachers’ Labour Dispute. While exploring the dataset with the AUT, they created word clouds showing the frequency of words compared between multiple domains, network graphs showing named entities and how they related to each other, and many others. The most interesting visual they created was an animated GIF that quickly showed the primary image from each memento, giving a good overview of the types of images in the collection. 



Team Just Kidding, There’s One More Thing was a team of one: Jimmy Lin. Jimmy was busy listening to feedback about Scala vs. Python and working on his own secret project. He created a new branch of the AUT running in a Python environment, enabling some of the things people were asking for at the beginning of Day 1. Awesome.



After Jimmy’s surprise, the organizers and teams voted for the winning project. All of the projects were great, but there can only be one winner and that was Team Make Tweets Great Again! I’m still convinced there’s a correlation between the number of emojis in their presentation, their team name, and the number of votes they received but 🤷🏻‍♂️. Just kidding 😂, your presentation was 🔥. Congratulations 🎊 to Shawn and his team! 

I’m brand new to the world of web archiving and this was my first time attending an event like this, so I had some trepidation leading up to the first day. However, I quickly discovered that the organizers and participants, regardless of skill level or background, were there to learn and willing to share their own knowledge. I would highly encourage anyone, especially if you’re in a situation similar to mine, to apply for the Vancouver datathon that was announced at the end of Day 2 or one of the 2019 datathons taking place in the United States.

Thanks again to the organizers (Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz, Ryan Deschamps, and Nich Worby), their partners, and the University of Toronto for hosting us. Looking forward to the next one!

- Brian Griffin

2018-06-08: Joint Conference on Digital Libraries (JCDL) 2018 Trip Report

$
0
0
The gathering place at the Cattle Raisers Museum, Fort Worth, Texas 
This year's 18th ACM/IEEE Joint Conference on Digital Libraries Libraries (JCDL 2018) took place at the University of North Texas (Fort Worth, Texas). Between June 3-6, members of WSDL attended paper sessions, workshops, tutorials, panels, and a doctoral consortium.

The theme of this year's conference was "From Data to Wisdom: Resilient Integration across Societies, Disciplines, and Systems." The conference provided researchers across multiple disciplines ranging from Digital Libraries and Web science research to Libraries and Information science, with the opportunity to communicate the findings of their research.

Day 1 (June 3, 2018)

The first day of the conference was dedicated to doctoral consortium, tutorials, and workshops. The doctoral consortium provided an opportunity for Ph.D. students in the early phases of their dissertation to present their thesis and research plans and receive constructive feedback. I will provide a link to the Doctoral Consortium blogpost when it becomes available.

Day 2 (June 4, 2018)

The conference officially began on the second day with Dr. Jiangping Chen's introduction of the conference and the keynote speaker - Dr. Trevor Owens. Dr. Trevor Owens is a librarian, researcher and policy maker and the first head of Digital Content Management for library services at the Library of Congress. His talk was titled: "We have interesting problems." 

It started with a highlight of Ben Shneiderman's The New ABCs of Research which provides students with guidance on how to succeed in research, and provides senior researchers and policy makers on how to respond to new problems and apply new technologies. The new ABC's of research may be grossly summarized with two acronyms included in the book: ABC (Applied, Basic, and Combined) and SED (Science, Engineering, and Design).
Additionally, he presented NDP@3, an IMLS framework for investments in digital infrastructures for libraries. Also he presented multiple IMLS-funded projects such as: Image Analysis for Archival Discover (AIDA), which explores various ways to use millions of images representing the digitized cultural record.
Next he talked about some resources at the Library of Congress Labs such as:
  • Library of Congress Colors: provides the capability of exploring the colors in the Library of Congress collections.
  • LC for Robots: provides a list of APIs, data and tutorials for exploring the digital collections at the Library of Congress.
Following the keynote were three concurrent paper sessions with the theme: Use, Collection Building, and Semantics & Linking. I will briefly describe the papers discussed in two paper sessions.


Paper session 1B (Day 2)


Myriam Traub (best paper nominee), a PhD student at Centrum Wiskunde & Informatica (CWI) presented a full paper titled: "Impact of Crowdsourcing OCR Improvements on Retrievability Bias." She discussed how  crowd-sourced correction of OCR errors affects the retrievability of documents in a historic newspaper corpus in a digital library.
Three short papers followed Traubs's presentation. First, Karen Harker, a Collection Assessment Librarian at the University of North Texas Libraries presented: "Applying the Analytic Hierarchy Process to an Institutional Repository Collection." She discussed the application of the Analytic Hierarchy Process (AHP) to create a model for evaluating collection development strategies of institutions. Second, Douglas Kennard presented: "Computer-Assisted Crowd Transcription of the U.S. Census with Personalized Assignments for Better Accuracy and Participation," where he introduced the Open Genealogy Data census transcription project that strives to make census  data readily available to researchers and digital libraries. This was achieved through the use of automatic handwriting recognition to bootstrap their census database, and subsequent crowd-sourced correction of the data through a web interface. Finally, Mandy Neumann, a research associate at the Institute of Information Science at TH Köln presented: "Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp." She explored different features for ranking conference candidates by using a pseudo-relevance assessment.


Paper session 1C (Day 2)


Dr. Federico Nanni (best paper nominee), a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim presented the first of three full papers titled: "Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context," in which he introduced a method for obtaining specific descriptions of entities in text by retrieving the most related section from Wikipedia.
Next, Gary Munnelly, a PhD student at the School of Computer Science and Statistics (SCSS) at Trinity College Dublin presented: "Investigating Entity Linking in Early English Legal Documents," discussing the effectiveness of different entity linking systems for the task of disambiguating named entities in 17th century depositions obtained during the 1641 Irish rebellion.
Finally, Dr. Ahmed Tayeh presented: "An Analysis of Cross-Document Linking Mechanisms," where he discussed different strategies for linking or associating information across physical and digital documents. The titles of other papers presented in a parallel session (1A) include:

Open Cross-Document Linking Service Based on a Plug-in Architecture from Ahmed Tayeh


Paper session 2A (Day 2)


Two full papers were presented after a break. The first was titled: "Putting Dates on the Map: Harvesting and Analyzing Street Names with Date Mentions and their Explanations," was presented by Rosita Andrade. She presented her research about the automated analysis of street names with date references around the world, and showed that "temporal streets" are frequently used to commemorate important events such as a political change in a country.
Next, Dr. Philipp Mayr, a deputy department head and a team leader at the GESIS department Knowledge Technologies for the Social Sciences presented: "Contextualised Browsing in a Digital Library's Living Lab." He presented two approaches that contextualize browsing in a digital library. The first approached is based on document similarity and the second utilizes implicit session information (e.g., queries and document metadata from sessions of users). 


Paper session 3A (Day 2)


Three concurrent paper sessions followed Dr. Phillip Mayr's presentation. Dr. Dominika Tkaczyk, a researcher and a data scientist at the Applied Data Analysis Lab at the University of Warsaw (Poland) presented: "Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers," in which she presented the results of the comparison of different methods for parsing scholarly article references.
Anne Lauscher, a PhD student at the University of Mannheim presented: "Linked Open Citation Database: Enabling Libraries to Contribute to an Open and Interconnected Citation Graph." She presented the current state of the workflow and implementation of the Linked Open Citation Database project, which is a distributed infrastructure based on linked data technology for efficiently cataloging citations in libraries.


Paper session 3C (Day 2)


Norman Meuschke, a PhD student at the University of Konstanz, presented: "An Adaptive Image-based Plagiarism Detection Approach," in which he discussed his analysis of images in academic documents to detect disguised forms of plagiarism with approaches such as perceptual hashing, ratio hashing and position-aware OCR text matching. 


Hisham Benotman presented his work: "Extending Multiple Diagram Navigation with Internal Diagram And Collection Connections." He discussed his work about extending Multiple diagram navigation (MDN) such that diagram-to-content queries reach related collection documents not directly connected to the diagrams.
Other papers presented in a parallel session (3B) include:
Minute madness followed the paper sessions. The minute madness was an activity in which poster presenters were given 1 minute to advertise their respective posters to the conference attendees. The poster session began after the minute madness.





Day 3 (June 5, 2018)

Day 3 of the conference began with Dr. Niall Gaffney's keynote. Dr. Niall Gaffney is an Astronomer and Director of Data Intensive Computing at the Texas Advanced Computing Center (TACC). He started by emphasizing the importance of scientific reproducibility before moving on to show some of the projects supported by the computational machinery at TACC such as Firefly.
Two concurrent paper sessions followed a short break.

Paper session 4A (Day 3)


Dr. Gianmaria Silvello, an assistant professor at the Department of Information Engineering of the University of Padua presented a full paper titled: "Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents." He discussed his project about the development of an evaluation framework for validating the conformance of long-term preservation by assessing correctness, usability and usefulness.
Next, Dr. Pavlos Fafalios a researcher at L3S Research Center in Germany presented a full paper titled: "Ranking Archived Documents for Structured Queries on Semantic Layers," in which he proposed two ranking models that rank archived documents and considers the similarity of documents to entities, timeliness of documents, and the temporal relations between the entities.
The final paper presented (not by an author of the paper) in this session was a short paper titled: "Modeling Author Contribution Rate With Blockchain." Three concurrent paper sessions (all full papers) followed after break.


Paper session 4B (Day 3)


Florian Mai, a graduate student at Kiel University in Germany was the first presenter of the paper session on Text Collections. He presented a full paper titled: "Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text," in which he presented the findings from investigating how deep learning models obtained from training on titles compare to deep learning models obtained from training on full-texts.
Next, Chris Holstrom, a PhD student from the Information School at the University of Washington presented a short paper: "Social Tagging: Organic and Retroactive Folksonomies," in which he showed that tags on MetaFilter and AskMetaFilter follow a power law distribution and retroactive taggers do not use "organization" tags like professional indexers.
Next, Jens Willkomm, a PhD student at the Karlsruhe Institute of Technology in Germany, presented a full paper titled: "A Query Algebra for Temporal Text Corpora." He proposed a novel query algebra for accessing and analyzing words in large text corpora.


Paper session 5A (Day 3)


Omar Alonso (best paper nominee) presented a full paper titled: "How it Happened:  Discovering and Archiving the Evolution of a Story Using Social Signals." He introduced a method of showing the evolution of stories from the perspective of social media users as well as the articles that include social media as supporting evidence.
Tobias Backes a researcher at Gesis presented  his paper titled: "Keep it Simple: Effective Unsupervised Author Disambiguation with Relative Frequencies." He addressed the problem of author name homonymy in the Web Science domain by proposing a novel probabilistic similarity measure for author name disambiguation based on feature overlap.
The last paper (best paper nominee) presented in this session was titled: "Digital History meets Microblogging: Analyzing Collective Memories in Twitter."


Paper session 5B (Day 3)


Noah Siegel a researcher at the Allen Institute for Artificial Intelligence presented a full paper titled: "Extracting Scientific Figures with Distantly Supervised Neural Networks," where he introduced a system of extracting figures from large number of scientific documents without human intervention.
Next, André Greiner-Petter presented his full paper titled: "Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context." He presented a new approach for mathematical format conversion that utilizes textual information to reduced error rate. Additionally, he evaluated state-of-the art tools for mathematical conversions and provided a public manually-created gold standard dataset for mathematical format conversion.

Next, Yuta Kobayashi presented a paper titled: "Citation Recommendation Using Distributed Representation of Discourse Facets in Scientific Articles," presenting the effectiveness of using facets of scientific articles such as "objective,""method," and "result" for citation recommendation by learning a multi-vector representation of scientific articles, in which each vector represents a facet in the article.

Paper session 5C (Day 3)


Catherine Marshall, an adjunct professor at Texas A&M University presented: "Biography, Ephemera, and the Future of Social Media Archiving." She presented her finding from answering the following question: "Will the addition of new digital sources such as records repositories, digital libraries, social media, and collections of ephemera change biographical research practices?" She demonstrated how new digital resources unravel a subject's social network, thus exposing  biographical information formerly invisible.
Next, I presented our full paper titled: "Scraping SERPs for Archival Seeds: It Matters When You Start" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson. In my presentation, first, I highlighted the importance of web archive collections for studying important historical events ranging from elections to disease outbreaks. Next, I showed that search engines (specifically Google) can be used to generate seeds. Finally, I showed that it becomes harder to find the older URLs of news stories over time, so seed generators that utilize search engines should begin early and persist to capture the evolution of an event.

Next, Mat Kelly (best paper nominee), a fellow PhD student at Old Dominion University and member of WSDL presented his full paper titled: "A Framework for Aggregating Private and Public Web Archives." He showed his framework that provides a means of combining public web archive captures and private web captures (e.g., banking and social media information) without compromising sensitive information included in the private captures. This work utilizes Sawood Alams's Memgator, a Memento aggregator that supports multiple serialization formats such as Link, JSON, and CDXJ.


Paper session 6A (Day 3)


The last paper session on Topic Modeling and Detection consisted of three full papers. First, Julian Risch (best paper nominee), a PhD student at Hasso-Plattner Institute (Germany) presented: "My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections." He presented a topic model combined with automatic domain term extraction and phrase segmentation that distinguishes collection-specific and collection-independent words based on information entropy.
Next, Dr. Ralf Krestel, the head of Web Science Research Group & Senior Researcher at Hasso-Plattner Institute (Germany) presented his full paper titled: "WELDA: Enhancing Topic Models by Incorporating Local Word Context." He proposed a new topic model called WELDA that combines word embeddings (WE) and Latent Dirichlet Allocation (LDA).
Finally, Angelo Salatino, a PhD student at the Knowledge Media Institute (UK) presented a full paper titled: "AUGUR: Forecasting the Emergence of New Research Topics." He introduced AUGUR, which is a new approach for the early detection of research topics in order to help stakeholders such as universities, institutional funding bodies, academic publishers and companies recognize new research trends.

A dinner at the Fort Worth Museum of Science and History followed after a break. The best poster award was presented to Mohamed Aturban, a fellow PhD student at Old Dominion University and member of WSDL for this poster "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation."
Dr. Federico Nanni  (Providing Fine-Grained Semantics of Entities in Context) and Myriam Traub (Impact of Crowdsourcing OCR Improvements on Retrievability Bias) tied for the Vannevar Bush best paper awards. Myriam Traub also won the best student paper award.


Day 4 (June 6, 2018)

Day 4 began with a keynote from Dr. Carly Strasser, director of Strategic Development for the Collaborative Knowledge Foundation. Her keynote "Open Source Tech for Scholarly Communication: Why It Matters," illustrated the problems in the submission, production and delivery of scholarly communication. She talked about the problem of the disjoint nature (silos) of the various stages of scholarly communication, as well as the expensive delivery, slow production, static and less interoperable output.
She also presented a vision of scholarly communication that consists of living documents that link to open source code and data, a cheaper delivery system, faster production and more interoperable and dynamic output. Additionally, she talked about the organizations working to achieve various aspects of this vision.
The main conference gave way to workshops and a preview of JCDL 2019 which is scheduled to take place at the School of Information Sciences at the University of Illinois, Urbana-Champaign from June 2-6, 2019.
I would like to thank the organizers of the conference, the hosts, University of North Texas (UNT) College of Information and UNT Health Science Center, as well as SIGIR for the travel grants. I will provide links to Mat Kelly's Web Archiving and Digital Libraries (WADL) workshop trip report once it is available. But here is a preview of WADL from Jasmine Mulliken, Digital Production Associate at Stanford University PressDr. Min-Yen Kan set up a repository for all the slides from JCDL 2018; please upload your slides if you have not already done so.

-- Nwala (@acnwala)

2018-06-08: Joint Conference on Digital Libraries (JCDL) Doctoral Consortium Trip Report

$
0
0



On June 3, 2018, PhD students arrived in Fort Worth, Texas to attend the Joint Conference on Digital Libraries Doctoral Consortium. This is a pre-conference event associated with the ACM and IEEE-CS Joint Conference on Digital Libraries. This event gives PhD students a forum in which to discuss their dissertation work with others in the field. The Doctoral Consortium was well attended, not only by the presenting PhD students, their advisors/supervisors, and organizers, but also by those who were genuinely interested in emerging work. As usual, I live-tweeted the event to capture salient points. It was a very enjoyable experience for all.

Thanks very much to the chairs: In this post I will cover the work of all accepted students, three of whom are from the Web Science and Digital Libraries Research Group at Old Dominion University: I would also like to thank the assigned mentors of the Doctoral Consortium, who provided insight and guidance not only to their own assigned students, but the rest of us as well:

WS-DL Presentations



Shawn M. Jones




How does a researcher differentiate between web archive collections that cover the same topic? Some web archive collections consist of 100,000+ seeds, each with multiple mementos. There are more than 8000 collections in Archive-It as of the end of 2016. Existing metadata in Archive-It collections is insufficient because the metadata is produced by different curators from different organizations applying different content standards and different rules of interpretation. As part of my doctoral consortium submission, I proposed improving upon the solution piloted by Yasmin AlNoamany. She generated a series of representative mementos and then submitted them to the social media storytelling platform Storify in order to provide a summary of each collection. As part of my preliminary work I presented some findings that will be published at iPres 2018. We discovered four semantic categories of Archive-It collections: collections where an organization archived itself, collections about a specific subject, collections about expected events or time periods, and collections about spontaneous events. The collections AlNoamany used in her work fit into the last category. This also turned out to be the smallest category of collections, meaning that there are many other types of collections not evaluated by her method. She proved that humans could not tell the difference between her automatically-generated stories and other stories generated by humans. She did not, however, provide evidence that the visualization was useful for collection understanding. We also have the problem that Storify is no longer in service, something that I mentioned in a previous blog post. My plan includes developing a flexible framework that allows us to test different methods of selecting representative mementos. This framework will also allow us to test different types of visualizations using those representative mementos. Some of these visualizations may make use of different social media platforms. I plan to evaluate these collections by first creating user tasks that give us some idea that a user understands aspects of a collection. With these tasks I intend to then evaluate different solutions via user testing. The solutions that score best from the testing will address a large problem inherent to the scale of web archives.


Alexander Nwala




How do we find high quality seeds for generating web archive collections? Alexander is focusing on a different aspect of web archive collections than I am. I am analyzing existing collections. He is building collections from seeds supplied by social media users. He notes that users often create "micro-collections" of web resources, typically surrounding an event. Using examples like ebola epidemics, the Ukraine crisis, and school shootings, Alexander asks if seeds generated by social media are comparable to those generated by professional curators. He also seeks quantitative methods for evaluating collections. Finally, he wants to evaluate the quality of collections at scale.
He demonstrated the results of using a prototype system that extracts seeds from social media and compared these seeds to those extracted from Google search engine result pages (SERPs). He discovered that, when using SERPs, the probability of finding a URI for a news story diminishes with time. He introduced methods like distribution of topics, distribution of sources, distribution in time, content diversity, collection exposure, target audience, and more. He covered some of his work on the Local Memory Project as well as work that will be presented at JCDL 2018 and Hypertext 2018. He intends to do further research on hubs and authorities in social media, as well as evaluating the quality of collections. Alexander will ensure that good quality seeds make it into web archives, addressing an aspect of curation that has long been an area of concern in web archives.


Mohamed Aturban



How can we verify the content of web archives? Mohamed presented his work on fixity for mementos. He described issues with temporal violations and playback issues. He asked whether different web archives agreed on the content of mementos produced for the same live resource at the same time. He showed how "evil" archives could potentially manipulate memento content to produce a different page than existed at the time of capture. So, how do we ensure that the memento was unaltered since the time of capture?

He demonstrated that the playback engine used by a web archive can inadvertently change the result of the displayed memento. Just providing a timestamped hash of the memento HTML is not enough. He proposes generating a cryptographic hash for the memento and all embedded resources and then generating a manifest of these hashes. This manifest will then be stored as mementos themselves in multiple web archives. I expect this work to be quite important to the archiving community, addressing a concern that many professional archivists have had for quite some time.


Other Work Presented



André Greiner-Petter



Research papers use equations all of the time. Unfortunately, there isn't a good method of comparing equations or providing semantic information about them. André Greiner-Petter is working on creating a method of enriching the equations used in research papers. This will have a variety of uses, such as detecting plagiarism or finding related literature.


Timothy Kanke



How are people using Wikidata? I had attended a session on Wikidata at Wiki ConferenceUSA 2014, but have not really examined it since. Will it be useful for me? How do I participate? Who is involved? Timothy Kanke seeks to understand the answers to all of these questions. The Wikidata project has grown over the last few years, feeding information back into the Wikipedia community. Kanke will study the Wikidata community and provide a good overview for those who want to use its content. Using his work, we will all have an understanding of the overall ways in which Wikidata can work for the scholarly community.

Hany Alsalmi



How many languages do you use for searching? What is the precision of the results when you switch languages, even for the same query? Hany Alsalmi noticed that users who search in English were getting different results than when they searched for the same term in Arabic. Alsalmi will perform studies on users of the Saudi Digital Library to understand how they perform their searches and how successful those searches are. He will also record their reactions to search results, with the concern being that the user will quit in frustration if the results are insufficient. His work will have implications for search engines in the Arabic-speaking world.

Corinna Breitinger




Scholarly recommendation systems examine papers using text similarity. Can we do better? What about the figures, citations, and equations? Corinna Breitinger will take all of these text-independent semantic markers into consideration with the development of a new recommender approach targeted at STEM fields. Once that is done, she will create a new visualization concept that will help users view and navigate a collection of similar literature. The benefits of such a system will help spot redundant research and also help us find related research in the field.

Susanne Putze



How is research data managed? How can we facilitate making data management a “first-class citizen”? To do so would improve the amount of data shared by researchers as well as its quality. Susanne Putze has extended experiment models to improve data documentation. She will create prototypes and evaluate how well they work to address data management in the scholarly process. From there she will begin the process of improving knowledge discovery using these prototypes. Her research has implications for how we handle our data and incorporate it into scholarly communications.

Stephen Abrams



How successful are digital preservation efforts? Stephen Abrams is working on creating metrics for this purpose. He is planning on evaluating digital preservation from the perspective of communications rather than through preservation management concepts like quantities, ages, of quality of preserved material. Thanks to his presentation I will now examine terms like “verisimilitude”, “semiotic”, and “truthlikeness”. When he is done, we should have better metrics to evaluate things like the trustworthiness of preserved material. His work is more general and theoretical than Mohamed’s, but there is a loose connection to be sure.

Tirthankar Ghosal




Why are papers rejected by editors? Have we done a good job identifying what makes our paper novel? What if we could spot such complex issues in our papers prior to submission? Tirthankar Ghosal seeks to help address these concerns by using AI techniques to help researchers and editors more easily identify papers that will likely be rejected. He has already done some work examining reasons for desk rejections. He will identify methods for detecting what makes a paper novel, if a paper is fit for a given journal, if it is of sufficient quality to be accepted, and lastly create benchmark data that can be used to evaluate papers in the future. His work has large implications for scholarly communication and may affect not only the way we write, but also how submissions are handled in the future.

What Next?


I would like to thank all participants for their input and insight throughout the event. Hearing their feedback for other participants was quite informative to me as well. We will all have improved candidacy proposals as a result of their input and, more importantly, will use this input to improve our contributions to the world.
Updated on 2018/06/09 at 20:50 EDT with embed of Mohamed Aturban's Slideshare.
--Shawn M. Jones

2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

$
0
0
Mat Kelly reports on the Web Archiving and Digital Libraries (WADL) Workshop 2018 that occurred in Fort Worth, Texas.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ


On June 6, 2018, after attending JCDL 2018 (trip report), WS-DL members attended the Web Archiving and Digital Libraries 2018 Workshop (#wadl2018) in Fort Worth, Texas (see trip reports from WADL 2017, 2016, 2015, 2013). WS-DL's contributions to the workshop included multiple presentations inclusive of the workshop keynote by my PhD advisor, which I discuss below.

The Project Panel

Martin Klein (@mart1nkle1n) initially welcomed the workshop attendees and had the group of 26-or-so participants give a quick overview of who they were and their interest in attending. He then introduced Zhiwu Xie (@zxie) of Virginia Tech to begin the series of presentations reporting on the project kickoff of the IMLS-funded project (as establish at WADL 2017) "Continuing Education to Advance Web Archiving". A distinguishing feature of this project compared to others, Zhiwu said, is that the projects will use project-based problem solving instead of the products being surveys and lectures. He highlighted a collection of Curriculum modules involving existing practice (event archiving) to feed into various Web archiving tools (e.g., Social Feed Manager (SFM), ArchiveSpark, and Archives Unleashed Toolkit) to facilitate the understanding of the fundamentals (e.g., web, data science, big data) to produce experience in libraries, archives, and programming. The focus here was on individuals that had some level of prior experience with archives instead of the program being designed as training for those with zero experience in the area.

ODU WS-DL's Michael Nelson (@phonedude_mln) continued with the one motivation is to encourage storytelling using Web archives and how that has been hampered with the recent closing of Storify. Some recent work of the group (including the in-develop project MementoEmbed) would allow this concept to be revitalized despite Storify's demise through systematic "card" generation of mementos to allow a more persistent (in the preservation sense) version of the story to be extracted and retained.

Justin Littman (@justin_littman) of George Washington University Libraries continued the project description by describing Social Feed Manager's and emphasized that what you get from the Twitter API may well differ from what you get from the Web interface. The purpose of SFM is to be an easy-to-use, self-service Web interface to drive down the barriers in collecting social media data for academic research.

Ian Milligan (@ianmilligan1) continued by giving a quick run-down of his group's Archives Unleashed Projects, noting a realization in the project's development that not all historians like working with the command-line and Scala. He then briefly described the projects' approach of a filter-analyze-aggregate-visualize to make using large collections of Web archives more effective for research.

Wrapping up the project report, Ed Fox described Virginia Tech's initial attempts at performing crawls with Heritrix via Archive-It and how noisy the results were. He emphasized that a typical crawling approach consisting of starting with seed URIs harvested from tweets does not work well. The event model his group is developing and further evaluating will help guide the crawling procedure.

Ed's presentation completed the series of reports for the IMLS project panel and began a series of individuals presenting.

Individual Presentations

John Berlin (@johnaberlin) started off with an abbreviated version of his Master's Thesis titled, "Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay". While John had recently given his defense in April (see his post for more details), this presentation focused on some of the more problematic aspects of archival replay as caused by JavaScript. He highlighted specific instances where the efforts of a replay system to accurately replay JavaScript varied from causing a page to display a completely blank viewport (see CNN.com has been unarchivable since November 1st, 2016) to the representation being highjacked to declare Brian Williams as the originator of "Gin and Juice" long before Snoop Dogg(y Dogg). John has created a Chrome and Firefox extension he dubbed "Wayback Plus Plus" that mitigates JavaScript-based replay issues using client-side redirects. See his presentation for more details.

The workshop participants then had a break to grab a boxed lunch and followed with Ed Fox, again, presenting "A Study of Historical Short URLs in Event Collections of Tweets". In this work Ed highlighted the number of tweets in their collections that had URLs, namely that 10% had 2 URLs and less than 0.5% had 3 or more. From this collection, his group analyzed how many of the URLs linked are still accessible in Internet Archive's Wayback Machine with an emphasis that the Wayback Machine is not covering a lot of things that are in the Twitter data he has gathered. His group also analyzed the time difference between when a tweet with URLs was made and when it was archived and found that 50% were archived within 5 days after the tweet was posted.

Keynote

The workshop keynote, "Enabling Personal Use of Web Archives" was next and presented by my PhD Advisor Dr. Michele C. Weigle (@weiglemc). Her presentation initially gave a high-level overview of the needs of those that want to perform personal Web archiving and the tools that the WS-DL group have created over the years in facilitating the efforts to address those needs. She highlighted the early work of the group in identifying disasters in existing archives with a segue of the realization that many archive users lack in that there are more archives beyond Internet Archive.

In her (our) group's tooling to encourage Web users to Archive What They See Now, they created the WARCreate Chrome extension to create WARC files from any Web page. To resolve the issue of what a user is to do with their WARCs, they then created the Web Archiving Integration Layer (WAIL) (and later an Electron version) to allow individuals to control both the preservation and replay process. To give users a better picture of the archived Web as they browsed, they created the Chrome extension Mink to give users a measure of how well-archived (in terms of quantity) a URI is as they browsed the live Web and optionally (and easily) submit the URI currently viewed to 1-to-3 Web archives.

Dr. Weigle also highlighted the work of other WS-DL students of past and present like Yasmin Anwar's (@yasmina_anwar) Dark and Stormy Archives (DSA) and Shawn Jones' (@shawnmjones) upcoming MementoEmbed tool.

Following a tool review, Dr. Weigle asked, "What if browsers could natively interpret and replay WARCs?". She performed a high level review of what could be possible if the compatibility barriers between the archived and live Web were resolved through live Web tools that could natively interact with the archived Web. In one example, she provided a screenshot where in-place of the "secure" badge a browser provides, it might also be aware that it is viewing an archived page and indicate as such.

Libby Hemphill (@libbyh) presented next with "Developing a Social Media Archive at ICPSR" where her group sought to make data useful for people who wanted to understand how we are today from the perspective of people of the long-distant future. She mentioned how messy it can be to consider the ethical challenges when archiving social media data and that people have different levels of comfort depending of what sort of research for which their social media content is to be used. She outlined an architecture of their social media archive SOMAR for federating data to follow the terms of service, rehydrating tweets to follow the terms of research, and other aspects of the social-media-to-research-data process.

The workshop then took another break with a simultaneous poster session including a poster by Justin Littman titled, "Supporting social media research at scale" and WS-DL's Sawood Alam's (@ibnesayeed) "A Survey of Archival Replay Banners". Just prior to their poster presentations, each gave a lightning talk as a quick overview to entice attendees into stopping by.

After the break, WS-DL's Mohamed Aturban (@maturban1) presented "It is Hard to Compute Fixity on Archived Web Pages". Mohamed's work highlighted an issue that subtle changes in content may be difficult to detect using conventional hashing methods to compute the fixity of Web pages. He emphasized that computing the fixity of the root HTML page of a memento is not enough for fixity and that the fixity must also be computed for all embedded resources. With an approach utilizing Merkle trees (or on WP), he generates a hash of the composite memento representative of the fixity of all embedded resources. In one example highlighted in his recent post and tech report, Mohamed showed the manipulation of Climate Change data.

To wrap up the presentations for the workshop, I (Mat Kelly, @machawk1) presented "Client-Assisted Memento Aggregation Using the Prefer Header". This work highlighted one particular aspect of my presentation the previous day at JCDL 2018 (see blog post), namely of how the framework in the basis presentation facilitates the specification of which archives are aggregated using Memento. The previous investigation by Jones, Van de Sompel et al. (see "Mementos in the Raw, Take Two") used the HTTP Prefer header to allow a client to request the un-rewritten version of mementos from archival replay system. In my work, I imagined a more capable Memento aggregator that would expose the archives aggregated and allow a client, basing their customizations off of the aggregator's response, customize the set of archives aggregated by sending the set as base64-encoded data in the Prefer request header.

Closing

When I was through with the final presentation, Ed Fox began the wrap-up of the conference. This discussion of all attendees opened the floor for comments and recommendations for the future of the workshop. With the discussion finished, the workshop came to a close. As usual, I found this workshop extremely informative, though I was familiar with many of the participants previous work. I am hoping, as also expressed by other attendees, to encourage other fields to become involved and present their ongoing work and ideas at this informal workshop. Doing so, from the perspective of both an attendee and presenter, has proven valuable.

Mat (@machawk1)

2018-06-11: Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report from JCDL2018

$
0
0

Fort Worth Museum of Science & History 9/11 Tribute

The theme of the workshop on Knowledge Discovery from Digital Libraries (KDDL) was to uncover hidden relationships between data with techniques from artificial intelligence, mathematics, statistics, and algorithms. The workshop organizers, which included ODU Computer Science alumna, Dr. Hui Shi, Dr. Wu He, and Dr. Guandong Xu identified the following objectives that we were to explore:
  • Existing and novel techniques to extract and present knowledge from digital libraries;
  • Advanced ways to organize and maintain digital libraries to facilitate knowledge discovery;
  • Knowledge discovery applications in business; and
  • New challenges and technologies brought to the area of knowledge discovery and digital libraries.

The KDDL workshop consisted of three paper presentations which are summarized here.

Presentation 1: I presented my work on Mining the Web to Approximate University Rankings based on the tech report "University Twitter Engagement: Using Twitter Followers to Rank Universities" (https://arxiv.org/abs/1708.05790) and discussed in an earlier blog post.


This paper presented an alternative methodology for approximating the academic rankings of a university using social media; specifically, the university's Twitter followers. We identified a strategy for discovering official Twitter accounts along with a comparative analysis of metrics mined from the web which could be predictors of high academic rank (e.g., athletic expenditures, undergraduate enrollment, endowment value). As expected, schools with more financial resources tend to have more Twitter (@Twitter) followers based on larger enrollments, big endowments, and big investments in their sport programs. We also discovered that smaller schools like Wake Forest University can enhance their reputation when they employ faculty with national name recognition (e.g., Melissa Harris-Perry (@MHarris-Perry)).  For those wishing to perform further analysis, we have posted all of the ranking and supporting data used in this study which includes a social media rich data set containing over 1 million Twitter profiles, ranking data, and other institutional demographics in the oduwsdl Github repository.

Presentation 2: Basic Science and Technological Innovation: A Classification of Research Publications was presented by Dr. Robert M. Patton, Oak Ridge National Laboratory. This paper explored the context required for funding decision makers, sponsors, and the general public to determine the value of research publications. Core questions addressed the accessibility of massive digital libraries and methods related to identification of new discoveries, data sets, publications in disparate journals, and new software codes. Dr. Patton asserted that research evaluation has become increasingly complicated and citation analysis alone is insufficient if considered within the context of the people who control the flow of funding. His presentation of evaluation techniques included altmetrics along with a comparison of Bohr’s, Edison’s, and Pasteur’s quadrants as classifiers which use the wording of titles and abstracts in conjunction with domain specific terminology.

A Classification of Research Publications


Presentation 3: Introducing Math QA -- A Math Aware Question Answering System was presented by Felix Hamborg, University of Konstanz. This paper presented a software tool that allows a user to enter a textual request for a math formula (e.g., What is the formula for …?) in English or Hindi and is then presented with the required parameters and the actual formula from Wikidata. The authors mined 40 million articles in Wikidata searching for <math> tags to identify 17 thousand general and geometric formulas. They defined a QA System workflow consisting of three distinct modules for calculation, question parsing, and formula retrieval. Their discovery of geometric formulas (e.g., polygons, curves) was slightly more complex as these formulas can include a nested hierarchy of related data that required traversal of the associated Wikidata subsections. Following evaluation and comparison to a commercial engine, exported information was parsed and ported back into Wikidata. The author's source code and data is available in their GitHub repository (http://github.com/ag-gipp/MathQa).

A Math Aware Question Answering System

Following the paper presentations, the workshop participants divided into two groups to conduct a breakout session where we discussed Challenges and Research Trends in Knowledge Discovery from Digital Libraries and Beyond.  Each group was asked to offer opinions and provide summary responses for each of the following topics:
  • What are your reactions to the paper presentations? What did you learn that you didn’t previously know?
  • What are the current techniques, applications, and/or research questions that you are addressing in Knowledge Discovery from Digital Libraries and Beyond? What are the biggest impediments or challenges limiting Knowledge Discovery from Digital Libraries and Beyond?
  • What are your top priorities in implementing Knowledge Discovery from Digital Libraries and Beyond? 
  • What resources and/or support do you need to implement? 
  • What areas will you recommend for research? How do you think artificial intelligence (AI) can benefit knowledge discovery in digital libraries? 
  • Suggestions for coordination of research and future collaboration.

Collectively, my group's responses centered on the themes of data curation with less reliance on subject matter experts, methods or tools to make data more self-documenting, and new strategies for relationship extraction between linked entities. There was also considerable discussion related to reproducible research using common repositories and formats conducive to sharing data (e.g., XML) and open access to both software and the peer review process.

I would like to thank Old Dominion University for the Graduate Student Travel Award which helped to facilitate my participation in the JCDL conference and this workshop.

--Corren (@correnmccoy)

2018-06-27: InfoVis Spring 2017 Class Projects

$
0
0
This may sound familiar, but yet again I'm way behind in posting about my previous offerings of CS 725/825 Information Visualization.
(Previous semester highlights posts: Spring 2016, Spring 2015, Spring/Fall 2013, Fall 2012, Fall 2011)

Here are a few projects that I'd like to highlight from Spring 2017. (All class projects are listed in my InfoVis Gallery.  This semester has its own page because there were 19(!) projects.)  All of the projects were implemented using the D3.js library.

Because the Spring 2017 semester began with President Donald Trump's Travel Ban (EO 13769) and we have a large international graduate student population, students were understandably interested in US immigration and refugee data.  The first two projects here focus on that.  In addition, one project looked at sentiment about the US Presidential candidates on social media on Election Day.

The last two projects that I'll highlight are focused on the lighter topic of sports, NFL football and IPL cricket.

Visualization of US Refugee Admittance Data
Created by Susan Zehra


This project focuses on refugee admittance to the US between 2008-2016. The visualization highlights the number of refugees by country of nationality/religion, and relationship between a country's number of refugees, number of war deaths, population, GDP per capita (GDPPC), and State Fragility Index (SFI).



Foreign Travel and Immigration to the US
Created by Hind Aldabagh and Bathsheba Nelson


This project (available at http://www.cs.odu.edu/~bnelson/cs725/project1/index.html) shows the total number of immigrants (2010-2015) from each region, country, and class of admission as well as the totals that settle in each state in the US.  The visualizations include an interactive world map alongside a tabbed panel with various idioms (bar chart, line chart, choropleth map, text lists) to provide quick access to multiple views of immigration information.

Flash video available at http://www.cs.odu.edu/~haldabag/cs725/worldmap-template2/video.html

Sentiment Analysis Based on Social Media
Created by Triveni Bhardwaj

This project (available at http://www.cs.odu.edu/~ttriveni/cs725/SentimentAnalysis/test.html) presents emotion and sentiment analysis of Tweets about the 2016 US Presidential candidates on Election Day. Visualization idioms include treemap, wordcloud, and US map.



Insights into American Football
Created by Mahesh Kukunooru and Maheedhar Gunnam

This project presents a visualization interface to explore NFL football data over the past 10 years. Different idioms like multi-line chart, bar chart, radar and donut charts are used to visualize the football dataset, which aims at providing a platform for the users to help them explore and find some interesting insights that may go unnoticed.


IPL - Indian Premier League
Created by Karan Balmaui and Varun Kumar Karne

This project visualizes statistics of the Indian Premier League (IPL) for all 9 seasons. It is concentrated on displaying complete information from a season to each ball in every match. The user is provided with performance rankings, points table for each season, and total scores of each match in a season. Upon comparing total runs in all matches played by a team in a season, the user can navigate to run-rate, loss of wickets, types of runs, batting/bowling stands in the selected match.


-Michele

2018-07-02: The Off-Topic Memento Toolkit

$
0
0
Inspired by AlNoamany's work from "Detecting off-topic pages within TimeMaps in Web archives" I am pleased to announce an alpha release of the Off-Topic Memento Toolkit (OTMT). The results of testing with this software will be presented at iPres 2018 and those results are now available as a preprint.

Web archive collections are created with a specific purpose in mind. A curator will supply seeds for the collection and create multiple versions of these seeds in order to study the evolution of a web page over time. This is valuable for following the changes in an organization or the events in a news story. Unfortunately, depending on the curator's intent, sometimes these seeds go off-topic. Because web archive crawling software has no way to know that a page is off-topic, these mementos are added to the collection. Below I list a few examples of off-topic pages within Archive-It collections.

This memento from the Human Rights collection at Archive-It created by the Columbia University Libraries is off-topic. The page ceased to be available at some point and produced this "404 Page Not Found" response with a 200 HTTP status.

This memento from the Egypt Revolution and Politics collection at Archive-It created by the American University in Cairo is off-topic. The web site began having database problems.

It is important to note that the OTMT does not delete potentially off-topic mementos, but rather only flags them for curator review. Detecting such mementos allows us to exclude them from consideration or flag them for deletion by some downstream tool, which is important to our collection summarization and storytelling efforts. The OTMT detects these mementos using a variety of different similarity measures. One could also use the OTMT to detect and study off-topic mementos.

Installing the software


The OTMT requires Python 3.6. Once you have met that requirement, install OTMT by typing:

# pip install otmt

This installs the necessary libraries and provides the system with a new detect-off-topic command.

A simple run


To perform an off-topic run with the software on Archive-It collection 1068, type:

# detect-off-topic -i archiveit=1068 -tm cosine,bytecount -o myoutputfile.json

This will find all URI-Rs (seeds) related to Archive-It collection 1068, download their timemaps (URI-Ts), download the mementos within each timemap, process those mementos via the default similarity measures, and write the results in JSON format out to a file named outputfile.json.

The JSON output looks like the following.



Each URI-T serves as a key containing all URI-Ms within that timemap. In this example the timemap at URI-T http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/ contains several mementos. For brevity, we are only showing results for the memento at http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/.

The key "timemap measures" contains all measures run against the memento. In this case I used the two measures "cosine" and "bytecount". Each measure entry indicates which preprocessing has been performed against that memento (e.g., stemmed, tokenized, and removed boilerplate). Under "comparison score" is that measure's score. Under "topic status" is a verdict on whether or not the memento is on or off-topic. Finally, the "overall topic status" indicates if any of the measures determined that the memento is off-topic.

The OTMT uses an input-measure-output architecture. This way the tool separates the concerns of input, (e.g., how to process Archive-It collection 1068 for mementos), from measure (e.g., how to process these mementos using cosine and byte count similarity measures), and output (e.g., how to produce the output in JSON format and write it to the file outputfile.json). This architecture is extensible, providing interfaces allowing for more input types, measures, and output types to be added in the future.

The -i (for specifying the input) and -o (for specifying the output) options are the only required options. The following sections detail the different command line options available to this tool.

Input and Output


The input type is supplied by the -i option. OTMT currently supports the following input types:
  • an Archive-It collection ID (keyword: archiveit)
  • one or more TimeMap URIs (URI-T) (keyword: timemap)
  • one or more WARCs (keyword: warc)
An output file is supplied by the -o option. Output types are specified by the -ot option. OTMT currently supports the following output types:
  • JSON as shown above (the default) (keyword: json)
  • a comma-separated file consisting of the same content found in the JSON file (keyword: csv)
To specify multiple WARCs, list them after the warc option like so:

# detect-off-topic -i warc=mycrawl1.warc.gz,mycrawl2.warc.gz -o myoutputfile.json

Likewise, for multiple TimeMaps, list them with the timemap argument and separate their URI-Ts with commas, like so:

# detect-off-topic -i timemap=https://archive.example.org/urit/http://example.org,https://archive.example.org/urit/http://example2.org -o myoutputfile.json

To use the comma-separated file format instead of json use the -ot option as follows:

# detect-off-topic -i archiveit=3936 -o myoutputfile.csv -ot csv

For better processing, we want to eliminate any interference from HTML and JavaScript associated with archive-specific branding. In the case of TimeMaps and Archive-It collections, raw mementos will be downloaded where available. While any TimeMap may be specified for processing, raw mementos are preferred as they do not contain the additional banner information and other augmentations supplied by many web archives. These augmentations may skew the off-topic results. Currently, only raw mementos from Archive-It are detected and processed. WARC files, of course, are "raw" by their nature, so removing web-archive augmentations like banners is not needed for WARC files.

Measures


OTMT supports the following measures with the -tm (for "timemap measure") option:

Each of these measures considers the first memento in a TimeMap to be on-topic and evaluates all other mementos in that TimeMap against that first memento.

Measures and thresholds can be supplied on the command line, separated by commas. For example, to use Jaccard with a threshold of 0.15, separate the measure name and the threshold value, like so:

# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15

Multiple measures can also be used, separated by commas. For example, to use jaccard and cosine similarity, type the following:

# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15,cosine=0.10

The default thresholds for these measures have been derived from testing using a gold standard dataset of on and off-topic mementos originally generated by AlNoamany. This dataset is now available at: https://github.com/oduwsdl/offtopic-goldstandard-data/. We used this dataset as a standard and selected thresholds that produced the best F1 score for each measure. I will present the details of how we arrived at these thresholds at iPres 2018. Our study is available as a preprint available on arXiv.

Other options


Optionally, one may also change the working directory (-d) and the logging file (-l). By default, the software uses the directory /tmp/otmt-working for its work and logs to the screen with stdout.

The Future


I am still researching several features that will make it into future releases. I have separated the capabilities into library modules for use with future Python applications, but the code is currently volatile and I expect changes to come in the following months as new features are added and defects are fixed.

The software does not currently offer an algorithm utilizing the Web-based kernel function specified in AlNoamany's paper. This algorithm augments terms from the memento with terms from search engine result pages (SERPs), pioneered by Sahami and Heilman. Due to the sheer number of mementos to be evaluated by the OTMT and Google's policy on blocking requests to its SERPs, I will likely not implement this feature unless it is requested by the community.

I am also interested in the concept of "collection measures". I created the "timemap measures" key in the JSON output to differentiate one set of measure results from another eventual category of collection-wide measures that would test each memento against the topic of an entire collection. Preliminary work using the Jaccard Distance in this area was not fruitful, but I am considering other ideas.

The Off-Topic Memento Toolkit is available at https://github.com/oduwsdl/off-topic-memento-toolkit. Please give it a try and report any issues encountered and features desired. Although developed with an eye toward Archive-It collections, we hope to increase its suitability for all themed collections of archived web pages, such as personal collections created with webrecorder.io.



-- Shawn M. Jones

2018-07-03: Extracting Metadata from Archive-It Collections with Archive-It Utilities

$
0
0
At iPres 2018, I will be presenting "The Many Shapes of Archive-It", a paper that focuses on some structural features inherent in Archive-It collections. The paper is now available as a preprint on arXiv.

As part of the data gathering for "The Many Shapes of Archive-It", and also as part of the development the Off-Topic Memento Toolkit, I had to write code that extracts metadata and seeds from public Archive-It collections. This capability will be useful to several aspects of our storytelling and summarization work, so I used the knowledge gained from those projects and produced a standalone Python library named Archive-It Utilities (AIU). This library is currently in alpha status, but is already being used with upcoming projects.

The metadata available from an Archive-It collection

Archive-It curators can use the predefined metadata fields of Dublin core. They can also supply their own custom metadata fields.

An screenshot of Archive-It collection 4515 with metadata annotated.
Above is Archive-It collection 4515, named 2013 BART Strike and collected by the San Francisco Public Library. This collection's curators generated quite a bit of metadata. In this screenshot, we can see the following metadata fields for the collection:
  • Subject
  • Creator
  • Publisher
  • Source
  • Format
  • Rights
  • Language
  • Collector
In addition to collection-wide metadata, we see that the first seed has the following metadata applied:
  • Creator
  • Publisher
  • Language
  • Format
  • Date
For research purposes, there is quite a lot of data here to be analyzed, especially when comparing collections as we did in "The Many Shapes of Archive-It". I discovered that most collections used the controlled vocabulary from Dublin Core, shown as blue in the bar chart below, more often than freeform vocabulary, shown in green.

Distribution of the top 20 collection-wide metadata fields in public Archive-It collections.

Each collection can have one or more topics. As shown in the screenshot below, the curator can choose from the controlled vocabulary offered by the collection topics field. They can also add their own freeform topics in the subject field. The public-facing interface combines entries from both of these input fields into the public-facing subject field.

Metadata can be added by curators by using the metadata page of one of their Archive-It collections. 
The bar chart below shows the distribution of the top 20 topics in public Archive-It collections. I discovered that most curators apply the controlled vocabulary topics to their collections.

Distribution of the top 20 collection-wide subjects (also called topics) of public Archive-It Collections.


This creates a confusing nomenclature. When viewing an Archive-It collection from the outside, everything is displayed as part of the subject field. Because of this, the rest of this post, and Archive-It Utilities, uses the subject field to refer to these topics.

As work for "The Many Shapes of Archive-It" progressed, we focused more on collecting seed lists and then mementos for further analysis. We tried to predict the topics using machine learning, but were unsuccessful and chose a different path for predicting the semantic categories of a collection. Most of the metadata gathered did not make it into the study's results, but will be used in future work. I have included these results here to show the kinds of questions one can answer with Archive-It Utilities.

Installation

Archive-It Utilities requires Python 3.6. Once that requirement has been met, you can install it using:

pip install aiu

It provides several experimental executables. We will only cover fetch_ait_metadata in this post.

Running fetch_ait_metadata

The fetch_ait_metadata command produces a JSON file containing all of the information available about a public Archive-It collection.

To run it on collection 4515 and store the results in file output.json, type the following command:

fetch_ait_metadata -c 4515 -o output.json

The -c option allows one to specify an Archive-It collection and the -o option allows one to indicate where to store the JSON output.

The JSON output looks like the following, truncated for brevity:



From this JSON we can see the name of the collection, which organization created it from the collected_by field, the subjects the curator applied to the collection as a list in the subject field, and when the collection was created in the archived_since field.

Within the optional dictionary field, we see values for freeform metadata added by the curator. In this case we have creator, publisher, source, format, rights, language, and collector.

Also included is the "seed metadata" section containing a list of seeds both scraped from the HTML of the Archive-It collection's web pages and also gathered from the CSV report provided for each Archive-It collection. Above I've listed the seed http://www.bart.gov/news/articles/2013/news20130617 to demonstrate the type of metadata that can be gathered. As noted in "The Many Shapes of Archive-It", seed metadata is optional, but in this example the curator added a title, creator, publisher, language, format, and date to this seed.

Using Archive-It Utilities In Python Code

This information can also be acquired programmatically using the ArchiveItCollection object. The script below demonstrates how one can acquire and the collection name, collecting organization, and the list of seed URIs for Archive-It collection ID 4515.



which produces the following output, truncated for brevity:



The following methods of the ArchiveItCollection class are useful for analyzing the metadata of a collection:
  • get_collection_name - returns the name of the collection
  • get_collection_uri - returns the URI of the collection
  • get_collectedby - returns the name of the collecting organization
  • get_collectedby_uri - returns the URI of the collecting organization
  • get_description - returns the content of the "description" field
  • get_subject - returns a Python list containing the subjects applied to the collection
  • get_archived_since - returns the content of the "archived since" field
  • is_private - returns True if the collection is not public, False otherwise
  • does_exist - not all collection identifiers are valid, this method returns True if the collection identifier actually represents a real collection, False otherwise
  • list_seed_uris - returns a Python list of seed URIs
  • get_seed_metadata(uri) - returns a Python dictionary containing metadata for a specific seed at uri
  • return_collection_metadata_dict - returns a Python dictionary containing all collection-wide metadata
  • return_seed_metadata_dict - a Python dictionary containing all seeds and their metadata
  • return_all_metadata_dict - a Python dictionary containing all collection-wide and seed metadata
  • save_all_metadata_to_file(filename) - writes all collection-wide and seed metadata out as JSON to a file named filename


The code does perform some measure of lazy loading to be nice to Archive-It. If you only need the general collection-wide metadata, it only acquires the first page of the collection. If you need all seed URIs, it must download all Archive-It pages belonging to the collection.

Summary

Archive-It collections have metadata that can be used to answer many research questions. After working on "The Many Shapes of Archive-It", to be presented at iPres 2018, I used the lessons learned to create Archive-It utilities as a Python library that can be used to acquire this metadata. Please try it out and log any issues at the GitHub repository https://github.com/oduwsdl/archiveit_utilities.

--Shawn M. Jones

2018-07-11: InfoVis Fall 2017 Class Projects

$
0
0
(Previous semester highlights posts: Spring 2017, Spring 2016, Spring 2015, Spring/Fall 2013, Fall 2012, Fall 2011)

Here are a few projects that I'd like to highlight from Fall 2017. (All class projects are listed in my InfoVis Gallery.)  All of the projects were implemented using the D3.js library.

World Leader Interactions on Social Media (Twitter) 
Created by Grant Atkins
This project (available at http://www.cs.odu.edu/~gatkins/world-leader-vis/app/) provides an interactive dashboard to visualize ways Twitter list data can be used and represented. This visualization uses the World Leaders list on Twitter, with the addition of a few world leaders not on the list, to derive information and visualize shared information among these users. The goal of this visualization is to show shared term usage among world leaders, see which times tweets are more likely to be sent out, the sentiment of the users, and the decay of data allocated in a static decreasing time interval.


Investigation Into Cryptocurrency Pricing Patterns With Respect to Financial Instability
Created by Jason Orender


This project (available at http://www.cs.odu.edu/~jorender/cs725/CS725-PROJECT_Y9FziL/) provides a focused presentation of the world events co-incident with spikes in peer-to-peer cryptocurrency transactions together with a continuous evolutionary timeline to provide perspective regarding the state of development and usage at a national, regional and worldwide level. Bitcoin was the sole cryptocurrency used in this analysis due to the large amount of country specific peer-to-peer data available.



Holiday Flight Patterns
Created by Asmita Gosavi

This project (available at http://www.cs.odu.edu/~agosav/cs725/HolidaysFlightPattern/index.html) is an interactive tool for visualizing holiday flight patterns using a dot chart to visualize the last year's data, a US map with bubbles to display the average arrival delay at particular airports, and a line chart which shows the monthly distribution from 2006-2015. The datasets selected for this visualization are the percentage of on-time arrival, delayed and cancellations for different airlines operating in the US, over different years. The intention was to find flight delay patterns between holiday and non-holiday months.



Sports Injuries
Created by Plinio Vargas and Miranda Smith

This project (available at http://www.cs.odu.edu/~pvargas/cs725/cs725-project/) deals with an organization interested in reducing the number of personnel injuries due to the dangerous nature of their job, and looks into a more effective visualization technique that measures its physical performance training program. The goal of the visualization project is to provide answers specific to the organization in terms of identifying where most injuries occur, which activities are injury-prone, correlation of injuries with the training program, and the evaluation and trends of its members physical training performance.


Federal Workforce
Created by John Ashley

This project (available at http://www.cs.odu.edu/~jashley/cs725/project/) highlights some of the characteristics that make up the current Federal civil service workforce. The visualization also provides a visual snapshot of how widely dispersed the workforce is.

Video available at http://www.cs.odu.edu/~jashley/cs725/demo/

-Michele

2018-07-15: How well are the National Guideline Clearinghouse and the National Quality Measures Clearinghouse Archived?

$
0
0


On July 13, I saw this on Twitter:

There are two US government websites in danger, the National Guideline Clearinghouse (https://www.guideline.gov) and the National Quality Measures Clearinghouse (https://qualitymeasures.ahrq.gov). Both store medical guidelines. Both will "not be available after July 16, 2018". According to the linked Daily Beast article above:
Medical guidelines are best thought of as cheatsheets for the medical field, compiling the latest research in an easy-to-use format. When doctors want to know when they should start insulin treatments, or how best to manage an HIV patient in unstable housing — even something as mundane as when to start an older patient on a vitamin D supplement — they look for the relevant guidelines. The documents are published by a myriad of professional and other organizations, and NGC has long been considered among the most comprehensive and reliable repositories in the world.

The Sunlight Foundation Web Integrity Project wrote a report about the archivabilty of this service. They note that "interactive features do not function, making archived content much more difficult to access and, in many cases completely unavailable." Seeing as web archives typically crawl websites from the client side and have no access to the server components, I expect that the search functionality of the web sites should not work once archived.

The robots.txt for www.guideline.gov disallows everyone:


The robots.txt for qualitymeasures.ahrq.gov disallows everyone:



In December of 2016, the Internet Archive stopped honoring robots.txt for .gov and .mil websites, hoping to "keep this valuable information available to users in the future". Seeing at these two sites will be shut down on July 16, 2018, how well are they archived?

Experiment Setup



The method I used to evaluate how much of each site was archived consisted of the following general steps:
  • Acquire a sample of original resource URIs from www.guideline.gov and qualitymeasures.ahrq.gov
  • Use a Memento Aggregator to determine if each original resource has at least one memento
As we can see in the robots.txt above, there is no machine readable site map for either web site. This means that I would need to crawl each site to find all of the URIs. Remembering lessons from when I evaluated URI patterns for Signposting the Scholarly Web, and when I manually crawled a number of scholarly web sites looking for URI patterns to help other crawling efforts, I started off thinking like someone who was planning to archive each site. I knew that I did not have time to manually crawl the entirety of each site so I tried to evaluate which documents appeared to be the main products of these sites. I have classified the documents I evaluated into five categories: main products - summaries, expert commentaries, guideline syntheses, summaries in other formats, and other pages. I will describe these categories in more detail in the following sections.

I created a GitHub repository to save my work. Due to the time crunch, I did not organize it nicely and it will be updated in the coming days with more content used in this article, so check back to it often if interested.

Update on 2018/07/16 at 20:37 GMT:The GitHub repository is now as stable as it is going to be. As it was written over the course of 3 days, the code is very, very rough. I have no intentions of improving it, but the data and code is provided for anyone who is interested. Feel free to contact me on Twitter with any questions.

After acquiring sample original resource URIs to test I installed MemGator, a Memento Aggregator developed by the WS-DL Research Group. I wrote a Python script which requested an aggregated TimeMap from MemGator for each original resource URI and recorded the number of mementos per URI.

So, what categories of documents did I retrieve before feeding them into MemGator?

Main Products - Summaries



I reviewed the menus across the top of each site's home page. I discovered that the main product of www.guideline.gov appeared to be the guideline summaries and the main product of qualitymeasures.ahrq.gov appeared to be measure summaries. I focused on these documents because, if captured as mementos, an enterprising archivist could build their own search engine around them.

As shown in the screenshot below, these summaries were accessible via paginated search result pages. Fortunately, there is an "All Summaries" option which will list all summaries as a series of search results.


The qualitymeasures.ahrq.gov site also has its own "All Summaries" page, shown in the screenshot below, so these URIs can be scraped using a script aware of the paging as well.



As Corren wrote last year, pagination can result in a missed captures. Knowing this, I wondered if the pagination would have an impact on if the guideline summaries were archived.

I wrote some simple (and very rough) code in Python using the requests library and BeautifulSoup to scrape all URIs from each search result page. The same script was used to scrape both sites. For both sites I selected the guideline summary URIs, identified because they contained the string "/summaries/summary/", and removed duplicates. This gave me 1415 original URIs for www.guideline.gov and 2533 original URIs for qualitymeasures.ahrq.gov.

Expert Commentaries



Both sites also contained expert commentaries about these summaries. I decided that this also looked important, even though these commentaries did not appear to be indexed by the search engine.

A screenshot of the Expert Commentaries page on www.guideline.gov

A screenshot of the Expert Commentaries page on qualitymeasures.ahrq.gov
I wrote a script to scrape the expert commentary URIs from these pages. With this I ended up with 45 URIs for www.guideline.gov and 52 URIs for qualitymeasures.ahrq.gov.

Guideline Syntheses



The www.guideline.gov site has a series of documents labeled guideline synthesis documenting "areas of agreement and difference, the major recommendations, the corresponding strength of evidence and recommendation rating schemes, and a comparison of guideline methodologies". These documents also seemed to be important, so I chose to include them as well.

The Guideline Synthesis page at www.guideline.gov is another set of documents provided by the web site.

I wrote a script to scrape this page for all guideline synthesis URIs. This led me to 18 URIs for www.guideline.gov. The qualitymeasures.ahrq.gov site did not contain this type of document.


Summaries in other formats



In addition to the HTML formatted guideline summaries, there were guideline summaries available in PDF, XML, and DOC format on www.guideline.gov. I wrote another script to iterate through all of the summary pages captured in the previous section and save off the PDF, XML, and DOC URIs. The qualitymeasures.ahrq.gov website only has HTML formatted measure summaries, so this document category does not apply to that site.

This screenshot demonstrates the multiple formats available for a guideline summary on www.guideline.gov.

My script to scrape these pages gave me 4185 URIs for www.guideline.gov.

Other Pages



Finally, I was curious about what may have been missed elsewhere. I decided to try to gather URIs as a crawler would who is given the seed of the top level page. With this exercise, I was hoping to gather a number of top level pages to see how their archive status differed from the guideline summaries, expert commentaries, guideline syntheses, and the measure summaries.

I wrote two simple spiders (crawlers) using the Python crawling framework scrapy. I pointed each spider at the homepage of each website, instructed it not to crawl outside of the domain of each site, and told it to print out any URI listed on a page it discovered while crawling. Unfortunately, I ran it on a machine with insufficient memory. The operating system killed scrapy in both cases because it was consuming too much memory. This means that the crawl for www.guideline.gov ran for 4 hours while the crawl for qualitymeasures.ahrq.gov ran for 7 hours. This inconsistency in crawl times was disappointing, but I kept the URIs from these crawls because they provide an interesting contrast in the results section.

Once I had a list of URIs linked from pages encountered during the crawl, I then removed all URIs that were not in the domains of www.guideline.gov or qualitymeasures.ahrq.gov, respectively.

Hundreds of thousands of URIs returned were related to search facets. The crawl of www.guideline.gov returned 894,881 such URIs while the crawl of qualitymeasures.ahrq.gov returned 1,474,516. Because these search facet URIs were related to the summaries from the prior sections, I removed these search URIs in the interest of time and only focused on the other pages crawled because these other pages contained actual content. I removed any URIs containing fragments (i.e. hashes like #introduction). I also filtered the URIs for summaries, guideline syntheses, and expert commentaries so that there would be no overlap in results.

I then fed the URIs through MemGator to see if the pages were captured.

Results and Discussion



The table below shows the results of testing if a page was archived for www.guideline.gov. Of those URIs recorded for this experiment, 98.8% of them were indeed archived, which is good news.

www.guideline.gov
Page Category
ArchivedNot ArchivedTotal
Guidelines Summaries (HTML)1401141415
Expert Commentaries45045
Guideline Syntheses18018
Guideline Summaries (Other Formats)4185574242
Other Pages1502152
Total5799
(98.8%)
73
(1.2%)
5872


Most importantly, of the 1415 guideline summaries from www.guideline.gov, 1401/1415 (99.0%) are archived. Only 14/1415 (1.0%) are not archived. Also, all 45 expert commentaries and all 18 guideline syntheses are archived. This means that almost all of the important site content is preserved and an enterprising archivist can build a search engine around them in the future.

The table below shows the results of testing if a page was archived for qualitymeasures.ahrq.gov. Of the URIs recorded for this experiment, 97.5% of them were archived.


qualitymeasures.ahrq.gov
Page Category
ArchivedNot ArchivedTotal
Measures Summaries2509242533
Expert Commentaries52052
Other Pages9044134
Total2651
(97.5%)
68
(2.5%)
2719


Of the 2533 measure summaries from qualitymeasures.ahrq.gov, 2509/2533 (99%) are archived. Only 24/2533 (0.9%) were not archived. Also, all 52 expert commentaries are archived. Again, this means that the majority of the important documents exist in a web archive and can be indexed by a potential search engine in the future. The picture is not so good for the other pages category, where only 90/134 (67.2%) of the pages exist in a web archive.

The high overall numbers are remarkable and likely a result of the Internet Archive's efforts to remove the robots.txt restrictions at the end of 2016. The next sections answer additional questions.

What is the distribution of mementos per category per site?



Below several histograms show the distribution of memento counts across the the different categories of pages for www.guideline.gov. Note that this only applies to those pages with mementos.

Histogram of the number of mementos per URI for guideline summaries for www.guideline.gov.
Minimum: 1, Maximum: 24, Mode: 8.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for expert commentaries for www.guideline.gov.
Minimum: 9, Maximum: 14, Mode: 11.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for guideline syntheses for www.guideline.gov.
Minimum: 9,  Maximum: 18, Mode: 14.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for non-HTML guideline summaries for www.guideline.gov.
Minimum: 1, Maximum: 13, Mode: 9.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for other pages for www.guideline.gov.
Minimum: 1, Maximum: 2072, Mode: 1.
Note: only pages with mementos were evaluated.
We see that the more specific content in the guideline summaries, expert commentary, guideline syntheses, and the guideline summaries in multiple formats tend to have a mode of 8, 9, 11, or 14 mementos. This means that many of the more important pages have multiple mementos. The other content, consisting mostly of top level pages, has a mode of 1, meaning that a lot of these top level pages were only archived once. There is at least one page in the other category, though, that was archived 2072 times.

Below several histograms show the distribution of memento counts across the the different categories of pages for qualitymeasures.ahrq.gov.


Histogram of the number of mementos per URI for measure summaries for qualitymeasures.ahrq.gov.
Minimum: 1, Maximum: 15, Mode: 4.
Histogram of the number of mementos per URI for expert commentaries for qualitymeasures.ahrq.gov.
Minimum: 6, Maximum: 7, Mode: 7.
Histogram of the number of mementos per URI for other pages for qualitymeasures.ahrq.gov.
Minimum: 2, Maximum 131, Mode: 2.

The numbers are much lower for qualitymeasures.ahrq.gov, but they exhibit the same pattern.

How does the crawling pattern for mementos change over time per category per site?



So, how does the crawling of www.guideline.gov change over time? The bar charts below show the number of mementos added to archives per month based on their memento-datetime.

Memento count per month for guideline summaries for www.guideline.gov. We see a big push in more recent months.
Memento count per month for expert commentaries for www.guideline.gov. There is much the same pattern as for the prior category.

The number of mementos crawled per month for the guideline syntheses  documents of www.guideline.gov. There has been a lot of activity the past few months.
Memento count per month for the non-HTML versions of guideline summaries for www.guideline.gov. Again, we see a big push in more recent months.

Memento count per month for other pages at www.guideline.gov. Here we see years of crawling with big spikes after the US election. This may be related to the Internet Archive's new robots.txt policy. 
It is interesting to note that mementos exist for some of these pages prior to December of 2016, meaning that people were archiving them with functionality such as "Save Page Now". Archiving really picked up in all cases in September of 2016, then again in October of 2017, and then again starting in April of 2018. These spikes appear to be a coordinated effort to archive parts of the site.

In the last graph, we see years of crawling the top level pages. This is interesting considering the contents of the robots.txt file. Did it change over time? Was it more permissive at some point? Fortunately, we have web archives we can use to check.

Here is a screenshot of the Internet Archive's capture calendar for www.guideline.gov/robots.txt from 2005. Orange indicates that the robots.txt file did not exist. Blue indicates that it did. 
Based on the above screenshot, it appears that a robots.txt did not exist for the site www.guideline.gov until 2005. It was first observed at this site on August 23, 2005 at 22:54:19 GMT. Its contents were as follows:


According to the robots.txt specification website, this indicates "To allow all robots complete access". This means that at one time, the site was far more permissive about crawling than it is now. I randomly chose a memento for the robots.txt each year after 2005 and found that it did not change. In August of 2008, the robots.txt disappeared again. In 2009, the successful robots.txt captures are actually of a soft-404 page indicating that it does not exist. Before September 11, 2010 at 18:14:50 GMT, the robots.txt became more complex, as shown below:



As we see, it still isn't disallowing all content like I mentioned at the beginning of the article. This configuration persisted until August 26, 2016 when the robots.txt was still present, but a completely blank file. The robots.txt was changed to its current state on April 27, 2017 before 20:09:28 GMT. The US Senate approved the nomination of Tom Price to the office of Secretary of Health and Human Services on February 1, 2017. This means that the site's robots.txt allowed crawling until after Tom Price took office. This is probably why so many top level pages had been captured by web archives since the site's creation.

What about the qualitymeasures.ahrq.gov site? The bar charts below show the number of mementos per month for each of its categories.

Memento count per month for measure summaries at qualitymeasures.ahrq.gov. There is some activity in 2016, but a lot of very recent crawling of the content.
Memento count per month for measure summaries at qualitymeasures.ahrq.gov. Like above, there is some activity in 2016, but a big push in June of 2017, and a lot of very recent crawling of the content.

Memento count per month for measure summaries at qualitymeasures.ahrq.gov. We see the same large push in recent history, with a lot of crawling.

The crawling of qualitymeasures.ahrq.gov follows much the same pattern, though not with the exact same spikes prior to this last month. From these graphs we see that there has been a concerted effort to archive both of these sites since June. This site created its first robots.txt on August 24, 2005 before 00:04:23 GMT. And the robots.txt was completely permissive, as with www.guideline.gov.

The emergence of a robots.txt for qualitymeasures.ahrq.gov on August 24, 2005, as shown on the Internet Archive's calendar page for the URI qualitymeasures.ahrq.gov/robots.txt.

The robots.txt went through much the same history for this site as for www.guideline.gov, implying a similar policy or even the same webmaster for both sites. It finally changed to its current disallow state on April 27, 2017 before 22:10:11 GMT. Again, this is after Tom Price took office. This again explains why so many of the top level pages of the site were archived throughout the history of qualitymeasures.ahrq.gov.

In which archives are these pages preserved?



I chose to use an aggregator because I wanted to search in multiple web archives for these pages. How well do these mementos spread throughout the archives? The charts below show the number of mementos per archive for each category of pages at www.guideline.gov. Only archives containing mementos for a given category are displayed in each chart.

This chart of the guideline summaries for www.guideline.gov shows 6,848 mementos are present in the Internet Archive, with 4,887 mementos preserved by Archive-It and 10 mementos preserved by Archive.today (archive.is).
This chart of the non-HTML versions of the guideline summaries for www.guideline.gov shows 14,846 mementos are  preserved in the Internet Archive, but more, 19,044 are preserved in Archive-It.

This chart of the expert commentaries for www.guideline.gov shows that 324 mementos are held by the Internet Archive while 177 are held by Archive-It.

This chart of the guideline syntheses for www.guideline.gov shows that 189 mementos are held by the Internet Archive while 62 are held by Archive-It.


The chart of the other pages for www.guideline.gov shows that the top-level pages are preserved at more archives than the previous categories. There are 3,397 mementos at the Internet Archive, 819 mementos at Archive-It, 128 mementos at the Library of Congress, 19 at Archive.today, 11 at the Icelandic Web Archive, 7 at the Portuguese Archive, and 1 at Perma.cc.

While the Internet Archive and Archive-It have most of the mementos, some mementos of the top-level pages of the site are held in other archives. As this is a US government web site, I was surprised that the Library of Congress was not featured more. Archive-It also has more non-HTML guideline summaries than the Internet Archive, indicating a particular effort by some organization to preserve these documents in other formats. Unfortunately, the Archive-It mementos I discovered with MemGator belonged to the collection /all/ meaning that I have no indication as to which Archive-It collection or organization was preserving the pages.

Update on 2018/07/16 at 18:10 GMT: To find the specific Archive-It collection and collecting organization, Michele Weigle has suggested that one might be able to search the Archive-It collections for these URIs using Archive-It's explore all archives search interface. One would need to use the "Search Page Text" tab. I did try the string www.guideline.gov and discovered 5,906 search results, so this hostname is in the content of some of these pages. I tried using a URI reported to have an Archive-It memento, but did not receive any search results. If you are successful, please say something in the comments.


The bar charts below show the distribution of mementos across web archives for the qualitymeasures.ahrq.gov web site.

This chart of the measure summaries for qualitymeasures.ahrq.gov shows 9,494 mementos at the Internet Archive and only 216 at Archive-It.

This chart of the measure summaries for qualitymeasures.ahrq.gov shows all 360 mementos of expert commentaries are held by the Internet Archive.

This chart for the other pages at qualitymeasures.ahrq.gov shows 1,147 mementos at the Internet Archive, 145 mementos at Archive-It, 40 mementos at the Library of Congress, 12 at the Portuguese Web Archive, 1 at Archive.today, and 1 at Perma.cc.

The results for qualitymeasures.ahrq.gov show that most of the mementos for that site are archived at the Internet Archive, with a few in other archives. This is in contrast to the results for www.guideline.gov, where the numbers between the Internet Archive and Archive-It were close in many cases.


Attempts at Archiving the Missing Pages



On July 14, 2018, I attempted to use our own ArchiveNow to preserve the ~1% of summary URIs from each site that had not been archived. Unfortunately, the live resources started responding very slowly. The sample of summary URIs that had not been archived produced 500 status codes, as can be seen in the output from the curl commands below, each which took close to a minute to execute:





I ran curl on all live URIs listed as not captured and they return a HTTP 500 status as of July 14, 2018 at approximately 16:50:00 GMT. Because I had scraped these URIs from the "All Summaries" page, it is possible that they returned 500 statuses at the time of crawl and this is why web archives do not currently have them. This means, that, even on the live web, they were not available. The live versions of the other summary pages with mementos returned a 200 status (after about a minute delay).

It is also possible that the service at these web sites is degrading in their last hours. As of approximately 07:00 GMT on July 15, 2018, the qualitymeasures.ahrq.gov site was no longer available, displaying error messages for pages, as shown in the screenshot below.

As of 07:00 GMT on July 15, 2018, the qualitymeasures.ahrq.gov website started displaying error messages instead of content.
Update on 2018/07/16 at 19:00 GMT: The website qualitymeasures.ahrq.gov is available again, but the measure summaries that were missing from the archives still return HTTP 500 status codes. The missing guideline summaries for www.guideline.gov also still return HTTP 500 status codes.
This was quite disheartening, because my plan was to archive the pages I had detected as missing after I did my initial study. I thought I had until July 16 to save the web pages!

Conclusion



Almost all web archiving is done externally, with no knowledge of the software running on the server side. This reduces mementos to a series of observations of pages rather than a complete reproduction of all of the functionality that existed at a web site. In the case of two US government websites that will be shut down on July 16, 2018, www.guideline.gov and qualitymeasures.ahrq.gov have server-side functionality, but their most valuable assets are a series of summary documents that can be captured without having to reproduce the functionality of the server side. In this article, I've tried to determine how much of these web sites have been captured prior to their termination.

When focusing on the main products of each site, the guideline summaries and the measure summaries, we see that these products are actually pretty well archived, at 99% of guideline summaries for www.guideline.gov and 99% of measure summaries for qualitymeasures.ahrq.gov. We also observed that 100% of all expert commentaries were archived in both cases. Other aspects of the site, such as trying to reproduce all facets of the search engine were not tested. I did, however, attempt to crawl the sites to gain a list of pages outside of these categories and discovered that, at least among the pages captured during a limited crawl, other pages at www.guideline.gov are archived at a percentage of 99%, higher than those for qualitymeasures.ahrq.gov, which only stand at 67.2%.

Many of these main products have more than one memento and as many as 25 in some cases. There are more mementos for www.guideline.gov than for qualitymeasures.ahrq.gov, but the mode for the number of mementos of the main products range between 4 and 14 mementos. This means that the main products have good coverage. The top-level content at these sites, however, has a mode of 1 or 2 mementos, indicating poor coverage of the changes over time for some top-level pages.

Over the life of these sites, most of the mementos stored in web archives are for the top-level pages, because crawling was permitted by their robots.txt until April 27, 2017, a few months after Tom Price became the Secretary of Health and Human Services. Fortunately, there has been a large push to archive the main products of the site since September of 2016, resulting in many mementos created within the last month.

Most of the mementos for these sites are stored in the Internet Archive. Archive-It has more mementos of the non-HTML versions of guideline summaries for www.guideline.gov, but its memento count is eclipsed by the Internet Archive in all other cases. After the Internet Archive and Archive-It, there is a long tail of archives for top-level pages, but the number of mementos for each of these archives is less than 100. With the exception of 10 guideline summaries for www.guideline.gov stored in Archive.today, none of the main products of these sites are stored outside of the Internet Archive or Archive-It.

My attempts to archive the pages after running this experiment failed, in large part due to the degradation in service at these web sites. Even though I tried preserving the pages prior to the cutoff date of July 16, 2018, they were no longer reliably available.

Because one needs to know the original resource URI in order to find mementos in a web archive, I have published the URIs I discovered to Figshare. I do this in hopes that someone might build a resource for providing easy access to the content of these sites, especially for medical personnel. If you want to access them, use these links.
Feel free to contact me if you run into problems with these files.

This case demonstrates the importance of organizations like the Sunlight foundation for identifying at risk resources. Also important are the web archives for allowing us to preserve these resources. This case also demonstrates how we can come together and ensure that these resources are preserved. We do need to be concerned that so much of this content is preserved in one place, rather than spread across multiple archives. If a page is of value to you, you have an obligation to archive it and archive it in multiple archives. What web pages have you archived today, so that you, and others, can access their content long after the live site has gone away?

--Shawn M. Jones

2018-07-18: Why We Need Private Web Archives: Almost Two-Thirds of Web Traffic IS NOT Publicly Archivable

$
0
0

Google.com mementos from May 8th 1999 on the Internet Archive
In terms of the ability to be archived in public web archives, web pages fall into one of two categories: publicly archivable, or not publicly archivable.

    1. Publicly Archivable Web Pages:

    These pages are archivable by public archives. The pages can be accessed without login/authentication. In other words, these pages do not reside behind a paywall. Grant Atkins examined paywalls in the Internet Archive for news sites and found that web pages behind paywalls may actually be redirecting to a login page at crawl time. A good example of a publicly archivable page is Dr. Steven Zeil's page since no authentication is required to view the page. Furthermore, it does not use client-side scripts (i.e., Ajax) to load additional content, so what you see in the web browser and what you can replay from public web archives are exactly the same.

    Screen shot from Dr. Steven Zeil's page captured on 2018-07-02
    Memento for Dr. Zeil's page on the Internet Archive captured on 2017-12-02 
    Some web pages provide "personalized" content depending on the GeoIP of the requester. In these cases, what you see in the browser and what you can replay from public web archives are nearly the same, except for some minor personalization/GeoIP related changes. For example, a user requesting https://www.islamicfinder.org from Suffolk, Virginia will see the prayer times for the closest major city (Norfolk, Virginia). On the other hand, when the Internet Archive crawls the page, it sees the prayer times for San Bruno, California. This is likely because the crawling/archiving is happening from San Francisco, California. The two pages, otherwise, are exactly the same!

    The live version of https://www.islamicfinder.org for a user in Suffolk, VA on 2018-07-02 
    Memento for  https://www.islamicfinder.org from the Internet Archive captured on 2018-06-22
    Some social media sites, like Twitter, are publicly archivable and the Internet Archive captures most of their content. Twitter's home page is personalized, so user-specific contents, like "Who to Follow" and "Trends for you" are not captured, but the tweets are. Also, some Twitter services require authentication.

    @twitter live web page
    @twitter memento from the Internet Archive captured on 2016-05-18

    The archived memento for the @twitter web page shows a message that cookies are used and they are important for an enhanced user experience, nevertheless, the main content of the page, tweets, is preserved (or at least the top-k tweets, since the crawler does not automatically scroll at archive time to activate the Ajax-based pagination, cf. Corren McCoy's "Pagination Considered Harmful to Archiving").

    Message from Twitter about cookies use to enhance user experience
    Also, deep links to individual tweets are archivable.
    Memento for a deep link to a tweet on the Internet Archive captured on 2013-01-18

    2. Not Publicly Archivable Web Pages:

    As far as the amount of web traffic, search engines are at the top. According to SimilarWeb, Google is number one; its share is 10.10% of the entire web traffic. The Internet Archive crawls it on regular basis, and has over 0.5 million mementos as of 2018-05-01 (cf. Mat Kelly's tech report about the difficulty in counting the number of mementos). The captured mementos are exact copies as far as the look, but obviously not a functioning search page.
    As of 2018-05-01 the IA has 552,652 mementos of Google.com

    Google.com memento from May 8th 1999 on the Internet Archive played on 2018-05-01
    It is possible to push a search result page from Google to a public web archive like archive.is, but that is not how web archives are normally used.
    A Google search query for "Machine Learning" on 2018-06-18 archived in archive.is
    Furthermore, it is not viable for web archives to try to archive search engines' result pages (SERPs) because there is an infinite number of possible URIs due to an infinite number of search queries and syntax, so even if we preserve a single SERP from June, 2018 (as shown above), we are unable to issue new queries against a June, 2018 version of Google. Maps and other applications that depend on user interaction are similar: individual pages may be archived, but we typically don't consider the entire application "archived" (cf. Michael Nelson's "Game Walkthroughs As A Metaphor for Web Preservation").

    Even when web archives use headless browsers to overcome the Ajax problem, there can be additional challenges. For example, I pushed a page from Google Maps with an address in Chesapeake, Virginia to archive.today and the result was a page from Google support (in Russian) telling me that I (or more accurately, archive.today) need to update the browser in order to use Google Maps! While technically not a paywall, this is similar to Grant's study mentioned above in that there is now something in the web archive corresponding to that Google Maps URI, but it does not match the users' expectations. It also reveals a clue about the GeoIP of archive.today.
    Google Maps page for the address 4940 S Military HWY, Chesapeake, VA 23321 pushed to archive.today on 2018-07-02
    Memento for the Google Maps page I pushed to archive.today on 2018-07-02
    It is worth mentioning there are emerging tools like Webrecorder, WARCreate, WAIL, and Memento Tracer for personal web archiving (or community tools in the case of Tracer), but even if/when the Internet Archive replaces Heritrix with Brozzler and resolves the problems with Ajax, their Wayback Machine cannot be expected to have pages requiring authentication, nor pages with effectively infinite inputs like search engines and maps.

    Social media pages respond differently when web archives' crawlers try to crawl and archive them. Public web archives might have mementos of some social media pages, however, they often require a login to allow the download of the pages' representation. Otherwise, a redirection takes place. Another obstacle faces archiving social media pages is their heavy use of client-side executed scripts that will, for example, fetch new content when the page is scrolled or when hiding/showing comments with no change in the URI. Facebook, for example, does not allow web archives' crawlers to access the majority of its pages. The Internet Archive's Wayback Machine returned 1,699 mementos for the former president's official Facebook page, but when I opened one of these mementos, it returned the infamous Facebook login or register page.
    1,699 mementos for the official Facebook page of Mr. Obama, former U.S. president as of 2018-05-01


    The memento captured on 2017-02-10 is showing the login page of Facebook
    There are few exceptions where the Internet Archive is able to archive some user-contributed Facebook pages.


    Memento for a facebook page in the Internet Archive captured on 2012-03-02
    Also, it seems like archive.is is using a dummy account ("Nathan") to authenticate, view, and archive some Facebook pages.

    Memento for a facebook page in archive.is captured on 2018-06-21
    With the previous exceptions in mind, it is still safe to say that Facebook pages are not publicly archivable.

    Linkedin shares the same behavior with Facebook. The notifications page has 46 mementos as of 2018-05-29, but they are entirely empty. The live page contains notifications from contacts such as who is having a birthday, job anniversary, got a new job, and so on. This page is completely personalized and requires a cookie or login to display information that is related to the user, and therefore, the Internet Archive has no way of downloading its representation.

    My account's notification page on Linkedin
     
    Memento of Linkedin's notification page

    The last example I would like to share is Amazon's "yourstore" page. I chose this example because it contains recommended items (another clean example for personalized web pages). The recommendations are based on the user's behavior. In my case, Amazon recommended electronics, automotive tools, and prime video.

    My Amazon's page (live) on 2018-05-02
    As of 2018-05-02, I found 111 mementos for "my Amazon's your store page" in the Internet Archive, and opened one of them to see what has been captured.

    Mementos for Amazon's yourstore page in the Internet Archive on 2018-05-02
    As I expected, the page has a redirect to another page that asks for a login. It returned a 302 response code when it was crawled by the Internet Archive. The actual content of the original page was not archived because the IA crawler does not provide credentials to download the content of the page. The representation saved to the Internet Archive is for a resource different from the originally crawled page.

    IA crawler was redirected to a login page and captured it instead
    Login page captured in the IA instead of the crawled page
    There are many web sites with this behavior, so it is safe to assume that for some web sites, even when there are plenty of mementos, they all might return a soft 404.

    Estimating the amount of archivable web traffic:

    To explore the amount of web traffic that is archivable, I examined the top 100 sites as ranked by Alexa, and manually constructed a data set of those 100 sites using traffic analysis services from SimilarWeb and Semrush.

    The data was collected on 2018-02-23 and I captured three web traffic measures offered by both websites, total visits, unique visits, and pages/visit.
    • Total visits is the total number of non-unique visits from last month.
    • Unique visits is the number of unique visits from last month.
    • Pages/visit is the average number of visited pages per user's visit.
    I determined whether or not a website is archivable based on the discussion I provided earlier, and put it all together in a csv file to use it later as input for my script. Suggestions, feedback, and pull requests are always welcome!

    The data set used in the experiment
    Using Python 3, I wrote a simple script that calculates the percentage of web traffic that is publicly archivable. I am assuming that the top 100 sites is a good representative of the whole web. I am aware that 100 sites is a small number compared to 1.8 billion live websites on the Internet, but according to SimilarWeb, the top 100 sites receive 48.86% of the entire traffic on the web which is consistent with a Pareto distribution. The program offers six different results, each of which is based on a certain measure or a combination of measures, total visits, unique visits, and pages/visit. Flags can be set to control what measures are used in the calculation. If no flags are set, the program shows all the results using all three measures and their combination. I came up with this formula to calculate the percentage of publicly archivable websites based on all three measures combined:
    1. Multiply the pages/visit by visits for each web site from both SimilarWeb and SemRush
    2. Take the average for both sources, SimilarWeb and SemRush
    3. Take the average of unique visits for each website from SimilarWeb and SemRush
    4. Add the numbers obtained in 2 and 3
    5. Add the number obtained in 4 for all archivable websites
    6. Add the number obtained in 4 for all non-archivable websites
    7. Add the numbers obtained in 5 and 6 to get the total
    8. Calculate the percentage of the numbers obtained in 5 and 6 from the total, obtained in 7
    Using all measures, I found that 65.30% of the traffic of the top 100 sites is not archivable by public web archives. The program and the data set are available on Github.

    Now, it is possible to discuss three different scenarios and compute a range. If the top 100 sites receive 48.86% of the traffic, and 65.30% of that traffic is not publicly archivable, therefore:

    1.  If all of the remaining web traffic is publicly archivable, then 31.91% of the entire web traffic is not publicly archivable. 65.30 * 0.4886 = 31.91.
    2. If the remaining web traffic is similar to the traffic from the top 100 sites, then 65.30% of the entire web traffic is not publicly archivable.
    3. Finally, if all of the remaining web traffic is not publicly archivable, then only 16.95% of the entire web traffic is archivable. 34.7 * 0.4886 = 16.95. This means that 83.05% of the entire web traffic is not publicly archivable.

    So the percentage of not publicly archivable web traffic is between 31.91% and 83.05%. More likely, it is close to 65.30% (the second case).

    I would like to emphasize that since the top 100 websites are mainly Google, Bing, Yahoo, etc, and their derivatives, the nature of these top sites is the determining factor of my results. However, since the range has been calculated, it is safe to say that, at least, 1/3 of the entire web traffic is not publicly archivable. This percentage constitutes the necessity of private web archives. There are few available tools to solve this problem, Web Recorder, Warcreate, and WAIL. Public web archiving sites like the Internet Archive, archive.is, and others will never be able to preserve personalized or private web pages like emails, bank accounts, etc.

    Take Away Message:

    Personal web archiving is crucial since, at least, 31.91% of the entire web traffic is not archivable by public web archives. This is due to the increase use of personalized/private web pages and the use of technologies hindering the ability of web archives' crawlers to crawl and archive these pages. The experiment shows that the percentage of not publicly archivable web traffic can be as high as 83.05%, but the more likely case is that around 65% of web traffic is not publicly archivable. Unfortunately, no matter how good public web archives get at capturing web pages, there will always be a significant number of web pages that are not publicly archivable. This emphasizes the need for personal web archiving tools, such as Web Recorder, Warcreate, and WAIL - possibly combined with a collaboratively-maintained repository of how to interact with complex sites, as introduced by Memento Tracer. Even if Ajax-related web archiving problems were eliminated, no less than 1/3 of web traffic is to sites that will otherwise never appear in public web archives.

    --
    Hussam Hallak

    2018-07-18: HyperText and Social Media (HT) Trip Report

    $
    0
    0

    Leaping Tiger statue next to the College of Arts at Towson University
    From July 9 - 12, the 2018 ACM Conference on Hypertext and Social Media (HT) took place at the College of Arts at Towson University in Baltimore, Maryland. Researchers from around the world presented the results of complete or ongoing work in tutorial, poster, and paper sessions. Also, during the conference I had the opportunity to present a full paper: "Bootstrapping Web Archive Collections from Social Media" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson.

    Day 1 (July 9, 2018)


    The first day of the conference was dedicated to a tutorial (Efficient Auto-generation of Taxonomies for Structured Knowledge Discovery and Organization) and three workshops:
    1. Human Factors in Hypertext (HUMAN)
    2. Opinion Mining, Summarization and Diversification
    3. Narrative and Hypertext
    I attended the Opinion Mining, Summarization and Diversification workshop. The workshop started with a talk titled: "On Reviews, Ratings and Collaborative Filtering," presented by Dr. Oren Sar Shalom, principal data scientist at Intuit, Israel. Next, Ophélie Fraisier, a PhD student studying stance analysis on social media at Paul Sabatier University, France, presented: "Politics on Twitter : A Panorama," in which she surveyed methods of analyzing tweets to study and detect polarization and stances, as well as election prediction and political engagement.
    Next, Jaishree Ranganathan, a PhD student at the University of North Carolina, Charlotte, presented: "Automatic Detection of Emotions in Twitter Data - A Scalable Decision Tree Classification Method."
    Finally, Amin Salehi, a PhD student at Arizona State University, presented: "From Individual Opinion Mining to Collective Opinion Mining." He showed how collective opinion mining can help capture the drivers behind opinions as opposed to individual opinion mining (or sentiment) which identifies single individual attitudes toward an item.

    Day 2 (July 10, 2018)


    The conference officially began on day 2 with a keynote: "Lessons in Search Data" by Dr. Seth Stephens-Davidowitz, a data scientist and NYT bestselling author of: "Everybody Lies."
    In his keynote, Dr. Stephens-Davidowitz revealed insights gained from search data ranging from racism to child abuse. He also discussed a phenomenon in which people are likely to lie to pollsters (social desirability bias) but are honest to Google ("Digital Truth Serum") because Google incentivizes telling the truth. The paper sessions followed the keynote with two full papers and a short paper presentation.


    The first (full) paper of day 2 in the Computational Social Science session: "Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora," was presented by Shubhanshu Mishra, a PhD student at the iSchool of the University of Illinois at Urbana-Champaign. He showed correlations between user-level and tweet-level metadata by addressing two questions: "Do tweets from users with similar Twitter characteristics have similar sentiments?" and "What meta-data features of tweets and users correlate with tweet sentiment?" 
    Next, Dr. Fred Morstatter presented a full paper: "Mining and Forecasting Career Trajectories of Music Artists," in which he showed that their dataset generated from concert discovery platforms can be used to predict important career milestones (e.g., signing by a major music label) of musicians.
    Next, Dr. Nikolaos Aletras, a research associate at the University College London, Media Futures Group, presented a short paper: "Predicting Twitter User Socioeconomic Attributes with Network and Language Information." He described a method of predicting the occupational class and income of Twitter users by using information extracted from their extended networks.
    After a break, the Machine Learning session began with a full paper (Best Paper Runner-Up): "Joint Distributed Representation of Text and Structure of Semi-Structured Documents," presented by Samiulla Shaikh, a software engineer and researcher at IBM India Research Labs.
    Next, Dr. Oren Sar Shalom presented a short paper titled: "As Stable As You Are: Re-ranking Search Results using Query-Drift Analysis," in which he presented the merits of using query-drift analysis for search re-ranking. This was followed by a short paper presentation titled: "Embedding Networks with Edge Attributes," by Palash Goyal, a PhD student at University of Southern California. In his presentation, he showed a new approach to learn node embeddings that uses the edges and associated labels.
    Another short paper presentation (Recommendation System session) by Dr. Oren Sar Shalom followed. It was titled: "A Collaborative Filtering Method for Handling Diverse and Repetitive User-Item Interactions." He presented a collaborative filtering model that captures multiple complex user-item interactions without any prior domain knowledge.
    Next, Ashwini Tonge, a PhD student at Kansas State University presented a short paper titled: "Privacy-Aware Tag Recommendation for Image Sharing," in which she presented a means of tagging images on social media in order to improve the quality of user annotations while preserving user privacy sharing patterns.
    Finally, Palash Goyal presented another short paper titled: "Recommending Teammates with Deep Neural Networks."

    The day 2 closing keynote by Leslie Sage, director of data science at DevResults followed after a break that featured a brief screening of the 2018 World Cup semi-final game between France and Belgium. In her keynote, she presented the challenges experienced in the application of big data toward international development.

    Day 3 (July 11, 2018)


    Day 3 of the conference began with a keynote: "Insecure Machine Learning Systems and Their Impact on the Web" by Dr. Ben Zhao, Neubauer Professor of Computer Science at University of Chicago. He highlighted many milestones of machine learning by showing problems they have solved in natural language processing and computer vision. But showed that opaque machine learning systems are vulnerable to attack by agents with malicious intents, and he expressed the idea that these critical issues must be addressed especially given the rush to deploy machine learning systems. 
    Following the keynote, I present our full paper: "Bootstrapping Web Archive Collections from Social Media" in the Temporal session. I highlighted the importance of web archive collections as a means of preserving the historical record of important events, and the seeds (URLs) from which they are formed. The seeds are collected by experts curators, but we do not have enough experts to collect seeds in a world of rapidly unfolding events. Consequently, I proposed exploiting the collective domain expertise of web users by generating seeds from social media collections and showed through a novel suite of measures, that seeds generated from social media are similar to those generated by experts.

    Next, Paul Mousset, a PhD student at Paul Sabatier University, presented a full paper: "Studying the Spatio-Temporal Dynamics of Small-Scale Events in Twitter," in which he presented his work into the granular identification and characterization of event types on Twitter.
    Next, Dr. Nuno Moniz, invited Professor at the Sciences College of the University of Porto, presented a short paper: "The Utility Problem of Web Content Popularity Prediction." He demonstrated that state-of-the-art approaches for predicting web content popularity have been optimized for improving the predictability of average behavior of data: items with low levels of popularity.
    Next, Samiulla Shaikh (again), presented the first full paper (Nelson Newcomer Award winner) of the Semantic session: "Know Thy Neighbors, and More! Studying the Role of Context in Entity Recommendation," in which he showed how to efficiently explore a knowledge graph for the purpose of entity recommendation by utilizing contextual information to help in the selection of a subset of a entities in a knowledge graph.
    Samiulla Shaikh (again), presented a short paper: "Content Driven Enrichment of Formal Text using Concept Definitions and Applications," in which he showed a method of making formal text more readable to non-expert users by text enrichment e.g., highlighting definitions and fetching of definitions from external data sources.
    Next, Yihan Lu, a PhD student at Arizona State University, presented a short paper: "Modeling Semantics Between Programming Codes and Annotations." He presented the results from investigating a systematic method to examine annotation semantics and its relationship with source codes. He also showed their model which predict concepts in programming code annotation. Such annotations could be useful to new programmers.
    Following a break, the User Behavior session began. Dr. Tarmo Robal, a research scientist at the Tallinn University of Technology, Estonia, presented a full paper: "IntelliEye: Enhancing MOOC Learners' Video Watching Experience with Real-Time Attention Tracking." He introduced IntelliEye, a system that monitors students watching video lessons and detects when they are distracted and intervenes in an attempt to refocus their attention.
    Next, Dr. Ujwal Gadiraju, a postdoctoral researcher at L3S Research Center, Germany, presented a full paper: "SimilarHITs: Revealing the Role of Task Similarity in Microtask Crowdsourcing." He presented his findings from investigating the role of task similarity in microtask crowdsourcing on platforms such as Amazon Mechanical Turk and its effect on market dynamics.
    Next, Xinyi Zhang, a computer science PhD candidate at UC Santa Barbara, presented a short paper: "Penny Auctions are Predictable: Predicting and profiling user behavior on DealDash." She showed that penny auction sites such as DealDash are vulnerable to modeling and adversarial attacks by showing that both the timing and source of bids are highly predictable and users can be easily classified into groups based on their bidding behaviors.
    Shortly after another break, the Hypertext paper sessions began. Dr. Charlie Hargood, senior lecturer at Bournemouth University, UK and Dr. David Millard, associate Professor  at the University of Southampton, UK, presented a full paper: "The StoryPlaces Platform: Building a Web-Based Locative Hypertext System." They presented StoryPlaces, an open source authoring tool designed for the creation of locative hypertext systems.
    Next, Sharath Srivatsa, a Masters student at International Institute of Information Technology, India, presented a full paper: "Narrative Plot Comparison Based on a Bag-of-actors Document Model." He presented an abstract "bag-of-actors" document model for indexing, retrieving, and comparing documents based on their narrative structures. The model resolves the actors in the plot and their corresponding actions.
    Next, Dr. Claus Atzenbeck, professor at Hof University, Germany, presented a short paper: "Mother: An Integrated Approach to Hypertext Domains." He stated that the Dexter Hypertext Reference Model which was developed to provide a generic model for node-link hypertext systems does not match the need of Component-Based Open Hypermedia Systems (CB-OHS), and proposed how this can be remedied by introducing Mother, a system that implements link support.
    The final (short) paper of the day, "VAnnotatoR: A Framework for Generating Multimodal Hypertexts," was presented by Giuseppe Abrami. He introduced a virtual reality and augmented reality framework for generating multimodal hypertexts called VAnnotatoR. The framework enables the annotation and linkage of texts, images and their segments with walk-on-able animations of places and buildings.
    The conference banquet at Rusty Scupper followed the last paper presentation. The next HyperText conference was announced at the banquet.

    Day 4 (July 12, 2018)


    The final day of the conference featured multiple papers presentations such as:
    The day began with a keynote "The US National Library of Medicine: A Platform for Biomedical Discovery and Data-Powered Health," presented by Elizabeth Kittrie, strategic advisor for data and open science at the National Library of Medicine (NLM). She discussed the role the NLM serves such as provider of health data for biomedical research and discovery. She also discussed the challenges that arise from the rapid growth of biomedical data, shifting paradigms of data sharing, as well as the role of libraries in providing access to digital health information.
    The Privacy session of exclusively full papers followed the keynote. Ghazaleh Beigi, a PhD student at Arizona State University presented: "Securing Social Media User Data - An Adversarial Approach." She showed a privacy vulnerability that arises from the anonymization of social media data by demonstrating an adversarial attack specialized for social media data.
    Next, Mizanur Rahman, a PhD student at Florida International University, presented: "Search Rank Fraud De-Anonymization in Online Systems." The bots and automatic methods session with two full paper presentations followed.
    Diego Perna, a researcher at the University of Calabria, Italy, presented: "Learning to Rank Social Bots." Given recent reports about the use of bots to spread misinformation/disinformation on the web in order to sway public opinion, Diego Perna proposed a machine-learning framework for identifying and ranking online social network accounts based on their degree similarity to bots.
    Next, David Smith, a researcher at University of Florida,  presented: "An Approximately Optimal Bot for Non-Submodular Social Reconnaissance." He showed that studies that show how social bots befriend real users as part of an effort to collect sensitive information operate with the premise that the likelihood of users accepting bot friend requests is fixed, a constraint contradicted by empirical evidence. Subsequently, he presented his work which addressed this limitation.
    The News session began shortly after a break with a full paper (Best Paper Award) presentation from Lemei Zhang, a PhD candidate from Norwegian University of Science and Technology: "A Deep Joint Network for Session-based News Recommendations with Contextual Augmentation." She highlighted some of the issues news recommendation system suffer such as fast updating rate of news articles and lack of user profiles. Next, she proposed a news recommendation system that combines user click events within sessions and news contextual features to predict the next click behavior of a user.
    Next, Lucy Wang, senior data scientist at Buzzfeed, presented a short paper: "Dynamics and Prediction of Clicks on News from Twitter."
    Next, Sofiane Abbar, senior software/research engineer at Qatar Computing Research Institute, presented via a YouTube video: "To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content." He proposed a new approach for predicting the popularity of news articles before  they are published. The approach is based on observations regarding the article similarity and topicality and complements existing content-based methods.

    Next, two full papers (Community Detection session) where presented by Ophélie Fraisier and Amin Salehi. Ophélie Fraisier presented: "Stance Classification through Proximity-based Community Detection."  She proposed the Sequential Community-based Stance Detection (SCSD) model for stance (online viewpoints) detection. It is a semi-supervised ensemble algorithm which considers multiple signals that inform stance detection. Next, Amin Salehi presented: "Sentiment-driven Community Profiling and Detection on Social Media." He presented a method of profiling social media communities based on their sentiment toward topics and proposed a method of detecting such communities and identifying motives behind their formation.
    I would like to thank the organizers of the conference, the hosts, Towson University College of Arts, as well as IMLS for funding our research.
    -- Nwala (@acnwala)

    2018-07-22: Tic-Tac-Toe and Magic Square Made Me a Problem Solver and Programmer

    $
    0
    0

    "How did you learn programming?", a student asked me in a recent summer camp. Dr. Yaohang Li organized the Machine Learning and Data Science Summer Camp for High School students of the Hampton Roads metropolitan region at the Department of Computer Science, Old Dominion University from June 25 to July 9, 2018. The camp was funded by the Virginia Space Grant Consortium. More than 30 students participated in it. They were introduced to a variety topics such as Data Structures, Statistics, Python, R, Machine Learning, Game Programming, Public Datasets, Web Archiving, and Docker etc. in the form of discussions, hands-on labs, and lectures by professors and graduate students. I was invited to give a lecture about my research and Docker. At the end of my talk I solicited questions and distributed Docker swag.

    The question "How did you learn programming?" led me to draw Tic-Tac-Toe Game and a 3x3 Magic Square on the white board. Then I told them a more than a decade old story of the early days of my bachelors degree when I had recently got my very first computer. One day while brainstorming on random ideas, I realized the striking similarity between the winning criteria of a Tic-Tac-Toe game and sums of 15 using three numbers of a 3x3 Magic Square that uses unique numbers from one to nine. The similarity has to do with their three rows, three columns, and two diagonals. After confirming that there are only eight combinations of selecting three unique numbers from one to nine whose sum is 15, I was sure that those are all placed at strategic locations in a magic square and there is no other possibility left for another such combination. If we assign values to each block of the Tic-Tac-Toe game according the Magic Square and store list of values acquired by the two players, we can decide potential winning moves in the next step by trying various combinations of two acquired vales of a player and subtracting it from 15. For example, if places 4 and 3 are acquired by the red (cross sign) player then a potential winning move would be place 8 (15-4-3=8). With this basic idea of checking potential wining move, when the computer is playing against a human, I could set strategies of first checking for the possibility of winning moves by the computer and if none are available then check for the possibility of the next winning moves by the human player and block them. While there are many other approaches to solve this problem, my idea was sufficient to get me excited and try to write a program for it.

    By that time I only had the basic understanding of programming constructs such as variables, conditions, loops, and functions in C programming language as part of the introductory Computer Science curriculum. While C is a great language for many reasons, it was not an exciting language for me as a beginner. If I were to write Tic-Tac-Toe game in C, I would have ended up writing something that would have a text-based user interface in the terminal which is not what I was looking for. I asked someone about the possibility of writing software with a graphical user interface (GUI) and he suggested that I try Visual Basic. So I went to the library, got a book on VB6, and studied it for about a week. Now, I was ready to create a small window with nine buttons arranged in a 3x3 grid. When these buttons would be clicked, a colored label (a circle or a cross) would be placed and a callback function would be called with an argument associated with the value according to the position of the button (as per the Magic Square arrangement). The callback function can then update states and play the next move. Later, the game was improved with different modes and settings.

    One day, I shared my program and approach with a professor (who is working for Microsoft now) with excitement. He said this technique is explored in an algorithm book too. This made me feel a little underwhelmed because I was not the first one who came up with this idea. However, I was equally happy that I discovered it independently and the fact that it was validated by some smart people already.

    This was not the only event when I had an idea and needed the right tool to express it. Over time my curiosity lead me to many more challenges, ideas of potential solutions for the problem, and exploration of numerous suitable tools, techniques, and programming languages.


    My talk was scheduled for Wednesday, June 27, 2018. I started by introducing myself, WS-DL Research Group, basics of Web Archiving, and then briefly talked about my Archive Profiling research. Without going too much into the technical details, I tried to explain the need of Memento Routing and how Archive Profiles can help to achieve this.


    Luckily, Dr. Michele Weigle had already introduced Web Archiving to them the day before my talk. When I started mentioning Web Archives, they knew what I was talking about. This helped me cut my talk down and save some time to talk about other things and the Q/A session.


    I then put my Docker Campus Ambassador hat on and started with the Dockerizing ArchiveSpark story. Then I briefly described what Docker is, where can it be useful, and how it works. I walked them through a code example to illustrate the procedure of working with Docker. As expected, it was their first encounter with Docker and many of them had no experience with Linux operating system either, so I tried to keep things as simple as possible.


    I had a lot of stickers and some leftover T-shirts from my previous Docker event, so I gave them to those who asked any questions. A couple days later, Dr. Li told me that the students were very excited about Docker and especially those T-shirts, so I decided to give a few more of those away. For that, I asked them a few questions related to my earlier talk and whoever was able to recall the answers got a T-shirt.


    Overall, I think it was a successful summer camp. I am positive that those High School students had a great learning experience and exposure to some research techniques that can be helpful in their career and some of them might be encouraged to go for a graduation degree one day. Being a research university, ODU is enriched with many talented graduate students with a variety of expertise and experiences which can benefit the community at large. I think more such programs should be organized in the Department of Computer Science and various other departments of the university.


    It was a fun experience for me as I interacted with High School students here in the USA for the first time. They were all energetic, excited, and engaging. Good luck to all who were part of this two weeks long event. And now you know, how I learned programming!

    --
    Sawood Alam

    2018-08-01: A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages

    $
    0
    0

    As commonly seen on Facebook and Twitter, the social card is a type of surrogate that provides clues as to what is behind a URI. In this case, the URI is from Google and the social card makes it clear that the document behind this long URI is directions.
    As I described to the audience of Dodging the Memory Hole last year, surrogates provide the reader with some clue of what exists behind a URI. The social card is one type of surrogate. Above we see a comparison between a Google URI and a social card generated from that URI. Unless a reader understands the structure of all URIs at google.com, they will not know what the referenced content is about until they click on it. The social card, on the other hand, provides clues to the reader that the underlying URI provides directions from Old Dominion University to Los Alamos National Laboratory. Surrogates allow readers to pierce the veil of the URI's opaqueness.

    With the death of Storify, I've been examining alternatives for summarizing web archive collections. Key to these summaries are surrogates. I have discovered that there exist services that provide users with embeds. These embeds allow an author to insert a surrogate into the HTML of their blog post or other web page. These containing pages often use the surrogate to further illustrate some concept from the surrounding content. Our research team blog posts serve as containing pages for embeds all of the time. We typically use embeddable surrogates of tweets, videos from YouTube, and presentations from Slideshare, but surrogates can be generated for a variety of other resources as well. Unfortunately, not all services generate good surrogates for mementos. After some reading, I came to the conclusion that we can fill in the gap with our own embeddable surrogate service: MementoEmbed.


    A recent WS-DL blog post containing embeddable surrogates of Slideshare presentations.


    Sam Pearson and Clara Garcia Fraile are in residence for one month Sam Pearson and Clara Garcia Fraile are in residence for one month working on a new project called In My Shoes. They are developin

    MementoEmbed is the first archive-aware embeddable surrogate service. This means it can include memento-specific information such as the memento-datetime, the archive from which a memento originates, and the memento's original resource domain name. In the MementoEmbed social card above, we see the following information:
    • from the resource itself:
      • title — "Blast Theory"
      • a description conveying some information of what the resource is about — "Sam Pearson and Clara Garcia..."
      • a striking image from the resource conveying some visual aspect of aboutness
      • its original web site favicon — the bold "B" in the lower left corner
      • its original domain name — "BLASTTHEORY.CO.UK"
      • its memento-datetime — 2009-05-22T22:12:51 Z
      • a link to its current version — under "Current version"
      • a link to other versions — under "Other Versions"
    • from the archive containing the resource:
      • the domain name of the archive — "WEBARCHIVE.ORG.UK"
      • the favicon of the archive — the white "UKWA" on the aqua background
      • a link to the memento in the archive — accessible via the links in the the title and the memento-datetime
    Most of this information is not provided by services for live web resources, such as Embed.ly.

    MementoEmbed is a deployable service that currently generates social cards, like the one above, and thumbnails. As with most software I announce, MementoEmbed is still in its alpha prototype phase, meaning that crashes and poor output are to be expected. A bleeding edge demo is available at http://mementoembed.ws-dl.cs.odu.edu. The source code is available from https://github.com/oduwsdl/MementoEmbed. Its documentation is growing at https://mementoembed.readthedocs.io/en/latest/.

    In spite of its simplicity in concept, MementoEmbed is an ambitious project, requiring that it not only support parsing and processing of the different web concepts and technologies of today, but all that have ever existed. With this breadth of potential in mind, I know that MementoEmbed does not yet currently handle all memento cases, but that is where you can help contribute by submitting issue reports that help us improve it.

    But why use MementoEmbed instead of some other service? What are the goals of MementoEmbed? How does it work? What does the future of MementoEmbed look like?

    Why MementoEmbed?


    Why should someone use MementoEmbed and not some other embedding service? I reviewed several embedding services mentioned on the web. The examples in this section will demonstrate some embeds using a memento of the New York Times front page from 2005 preserved by the Internet Archive, shown below.

    This is a screenshot of the example New York Times memento used in the rest of this section. Its memento-datetime is June 2, 2005 at 19:45:24 GMT and it is preserved by the Internet Archive. This page was selected because it contains a lot of content, including images.
    I reviewed Embed.lyembed.rocksIframelynoembedmicrolink, and autoembed. As of this writing, the autoembed service appears to be gone. The noembed service only provides embeds for a small number of web sites and does not support web archives. Iframely responds with errors for memento URIs, as shown below.
    Iframely fails to generate an embed for a memento of a New York Times page at the Internet Archive. The error message is misleading. There are multiple images on this page.
    What the Iframely parsers see for this memento according to their web application.
    What Iframely generates for the current New York Times web page (as of July 29, 2018 at 18:23:15 GMT).


    Embed.ly, embed.rocks. and microlink are the only services that attempt to generate embeds for mementos. Unfortunately, none of them are fully archive-aware. One of the goals of a good surrogate is to convey some level of aboutness with respect to the underlying web resource. Mementos are documents with their own topics. They are typically not about the archives that contain them. Intermixing these two concepts of document content and archive information, without clear separation, produces surrogates that can confuse users. The microlink screenshot below shows an embed that fails to convey the aboutness of its underlying memento. The microlink service is not archive-aware. In this example, microlink mixes the Internet Archive favicon and Internet Archive banner with the title from the original resource. The embed.rocks example below does not fare much better, appearing to attribute the New York Times article to web.archive.org. What is the resource behind this surrogate really about? This mixing of resources weakens the surrogate's ability to convey the aboutness of the memento.

    As seen in the screenshot of a social card for our example New York Times memento from 2005, microlink conflates  original resource information and archive information.
    The embed.rocks social card does not fare much better, attributing the New York Times page to web.archive.org.

    Embed.ly does a better job, but still falls short. In the screenshot below an embed was created for the same resource. It contains the title of the resource as well as a short description and even a striking image from the memento itself. Unfortunately, it contains no information about the original resource, potentially implying that someone at archive.org is serving content for the New York Times. Even worse, in the world where readers are concerned about fake news this surrogate may lead an informed reader to believe that this is a link to a counterfeit resource because it does not come from nytimes.com.
    This screenshot of an embed for the same New York Times memento shows how well embed.ly performs. While the image and description convey more aboutness for the original resource, there is only attribution information about the archive.
    Below, the same resource is represented as a social card in MementoEmbed. MementoEmbed chose the New York Times logo as the striking image for this page. This card incorporates elements used in other surrogates, such as the title of the page, a description, and a striking image pulled from the page content. Further down, I annotate the card and show how the information exists in separate areas of the card. MementoEmbed places archive information and the original resource information into their own areas of the card, visually providing separation between these concepts to reduce confusion.

    A screenshot of the same New York Times memento in MementoEmbed.



    This is not to imply that cards generated by Embed.ly or other services should not be used, just that they appear to be tailored to live web resources. MementoEmbed is strictly designed for use with mementos and strives to occupy that space.

    Goals of MementoEmbed


    MementoEmbed has the following goals in mind.

    1. The system shall provide archive-aware surrogates of mementos
    2. The system shall be deployable by others
    3. Surrogates shall degrade gracefully
    4. Surrogates shall have limited or no dependency on an external service
    5. Not just humans, but machines shall be able to generate surrogates
    I have demonstrated how we meet the first goal in the prior section. In the following subsections I'll provide an overview of how well the current service meets these other goals.

    Deployable by others



    I did not want MementoEmbed to be another centralized service. My goal is that eventually web archives can run their own copies of MementoEmbed. Visitors to those archives will be able to create their own embeds from mementos they find. The embeds can be used in blog posts and other web pages and thus help these archives promote themselves.

    MementoEmbed is a Python Flask application that can be run from a Docker container. Again, it is in its alpha prototype phase, but thanks to the expertise of fellow WS-DL member Sawood Alam, others can download the current version from DockerHub.

    Type the following to acquire the MementoEmbed Docker image:

    docker pull oduwsdl/mementoembed

    Type the following to create a container from the image and run it on TCP port 5550:

    docker run -it --rm -p 5550:5550 oduwsdl/mementoembed

    Inside the container, the service runs on port 5550. The -p flag maps the container's port 5550 to your local port 5550.  From here, the user can access the container at http://localhost:5550 and they are greeted with the page below.

    The welcome page for MementoEmbed.

    Surrogates that degrade gracefully



    Prior to executing any JavaScript, MementoEmbed's social cards use the blockquote, div, and p tags. After JavaScript, these tags are augmented with styles, images, and other information. This means that if the MementoEmbed JavaScript resource is not available, the social card is still viewable in a browser, as seen below.

    A MementoEmbed social card generated for a memento from the Portuguese Web Archive.


    The same social card rendered without the associated JavaScript.


    Surrogates with limited or no external dependencies


    All web resources are ephemeral, and embedding services are no exception. If an embed service fails or otherwise disappears, what happens to its embeds? Consider Embed.ly. The embed code for Embed.ly is typically less than 100 bytes in length. They achieve this small size because their embeds contain the title of the represented page, the represented URI, and a URI to a JavaScript resource. Everything else is loaded from their service via that JavaScript resource. Web page authors trade a small embed code for dependency on an outside service. Once that JavaScript is executed and a page is rendered, the embed grows to around 2kB. What has the web page author using the embed really gained from the small size? They have less to copy and paste, but their page size still grows once rendered. Also, in order for their page to render, it now relies on the speed and dependability of yet another external service. This is why Embed.ly cards sometimes experience a delay when the containing page is being rendered.

    Privacy can be another concern. Embedded resources result in additional requests to web servers outside of the one providing the containing page. This means that an embed not only potentially conveys information about which pages it is embedded in, but also who is visiting these pages. If a web page author does not wish to share their audience with an outside service, then they might want to reconsider embeds.

    Thinking about this from the perspective of web archives, I decided that MementoEmbed can do better. I started thinking about how its embeds could outlive MementoEmbed while at the same time offering privacy to visiting users.

    MementoEmbed offers thumbnails as data URIs so that pages using these thumbnails do not depend on MementoEmbed.
    Currently, MementoEmbed provides surrogates either as social cards or thumbnails. In response to requests for thumbnails, MementoEmbed provides an embed as a data URI, as shown above. Data URI support for images in browsers is well established at this point. A web page containing the data URI can render it without relying upon any MementoEmbed service, thus removing an external dependency. Of course, one can also save the thumbnail locally and upload it to their own server.

    MementoEmbed offers the option of using data URIs for images and favicons in social cards so that these embedded resources are not dependent on outside services.
    For social cards, I tried to take the use of data URIs a step further. As seen in the screenshot above, MementoEmbed allows the user to use data URIs in their social card rather than just relying upon external resources for favicons and images. This makes the embeds larger, but ensures that they do not rely upon external services.

    As noted in the previous section, MementoEmbed includes some basic data and simple HTML to allow for degradation. CSS and images are later added by JavaScript loaded from the MementoEmbed service. To eliminate this dependency, I am currently working on an option that will allow the user (or machine) to request an HTML-only social card.

    Not just for humans


    The documentation provides information on the growing web API that I am developing for MementoEmbed. For the sake of brevity, I will talk about how a machine can request a social card or a thumbnail here.

    MementoEmbed uses similar tactics to other web archive frameworks. Each service has its own URI "stem" and the URI-M to be operated on is appended to this stem.

    Firefox displays a social card produced by the machine endpoint /services/product/socialcard at http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
    To request a social card, a URI-M is appended to the endpoint /services/product/socialcard/. For example, consider a system that wants to request a social card for the memento at http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ from the MementoEmbed service running at mementoembed.ws-dl.cs.odu.edu. The client would visit: http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the HTML and JavaScript necessary to render the social card, as seen in the above screenshot.

    Firefox displays a thumbnail produced by the machine endpoint /services/product/thumbnail at http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
    Likewise, to request a thumbnail for the same URI-M from the same service, the machine would visit the endpoint at /services/product/thumbnail at the URI http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the image as shown in the above Firefox screenshot. The thumbnail service returns thumbnails in the PNG image format.

    Clients can use the Prefer header from RFC 7240 to control the generation of these surrogates. I have written about the Prefer header before, and Mat Kelly is using it in his work as well. Simply, the client uses the Prefer header to request certain behavior on behalf of a server with respect to a resource. The server responds with a Preference-Applied header indicating which behaviors exist in the response.

    For example, to change the width of a thumbnail to 500 pixels, a client would generate a Prefer header containing the thumbnail_width option. If one were to use curl, the HTTP request headers to a local instance of MementoEmbed would look like this, with the Prefer header marked red for emphasis:

    GET /services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ HTTP/1.1
    Host: localhost:5550
    User-Agent: curl/7.54.0
    Accept: */*
    Prefer: thumbnail_width=500

    And the MementoEmbed service would respond with the following headers, with the Preference-Applied headed marked red for emphasis:

    HTTP/1.0 200 OK
    Content-Type: image/png
    Content-Length: 216380
    Preference-Applied: viewport_width=1024,viewport_height=768,thumbnail_width=500,thumbnail_height=375,timeout=15,remove_banner=no
    Server: Werkzeug/0.14.1 Python/3.6.5
    Date: Sun, 29 Jul 2018 21:08:19 GMT

    The server indicates that the thumbnail returned has not only a width of 500 pixels, but also a height of 375 pixels. Also included are other preferences used in its creation, like the size of the browser viewport, the number of seconds MementoEmbed waited before giving up on a response from the archive, and whether or not the archive banner was removed.

    The social card service also supports preferences for whether or not to use data URIs for images and favicons.

    Other service endpoints exist, like /services/memento/archivedata, to provide parts of information used in social cards. In addition to these services, I am also developing an oEmbed endpoint for MementoEmbed.

    Brief Overview of MementoEmbed Internals



    Here I will briefly cover some of the libraries and algorithms used by MementoEmbed. The Memento protocol is a key part of what allows MementoEmbed to work. MementoEmbed uses the Memento protocol to discover the original resource domain, locate favicons, and of course to find a memento's memento-datetime.

    If metadata is present in HTML meta tags, then MementoEmbed uses those values for the social card. MementoEmbed favors Open Graph metadata tags first, followed by Twitter card metadata, and then resorts to mining the HTML page for items like title, description, and striking image.

    Titles are extracted for social cards using BeautifulSoup. The description is generated using readability-lxml. This library provides scores for paragraphs in an HTML document. Based on comments from the readability code, the paragraph with the highest score is considered to be "good content". The highest scored paragraph is selected for use in the description and truncated to the first 197 characters so it will fit into the card. If readability fails for some reason, MementoEmbed falls back to building one large paragraph from the content using justext and taking the first 197 characters from it, a process Grusky, et. al. refer to as Lede-3.

    Striking image selection is a difficult problem. To support our machine endpoints, I needed to find a method that would select an image without any user intervention. There are several research papers offering different solutions for image selection based on machine learning. I was concerned about performance, so I opted to use some heuristics instead. Currently, MementoEmbed employs an algorithm that scores images using the equation below.



    where S is the score, N is the number of images on the page, n is the current image position on the page, s is the size of the image in pixels, h is the number of bars in the image histogram containing a value of 0, and r is the ratio of width to height. The variables k1 through k4 are weights. This equation is built on several observations. Images earlier in a page (a low value of n) tend to be more important. Larger images (a high s) tend to be preferred. Images with a histogram consisting of many 0s tend to be mostly text, and are likely advertisements or navigational elements. Images whose width is much greater than their height (a high value for r) tend to be banner ads. For performance, the first 15 images on a page are scored. If the highest scoring image meets some threshold, then it is selected. If no images meet that threshold, then the next 15 are loaded and evaluated.

    The thumbnails are generated by a call from flask to puppeteer. MementoEmbed includes a Python class that can make this cross-language call, provided a user has puppeteer installed. If requested by the user, MementoEmbed uses its knowledge of various archives to produce a thumbnail without the archive banner. This only works for some archives. For Wayback Archives, information for choosing URI-Ms without banners was gathered from Table 9 of John Berlin's Masters Thesis.

    The Future of MementoEmbed



    MementoEmbed has many possibilities. I have already mentioned that MementoEmbed will support features like an oEmbed endpoint and HTML-only social cards. In the visible future, I will address language-specific issues and problems with certain web constructs, like framesets and ancient character sets. I also foresee the need for additional social card preferences, like changes to width and height as well as a preference for a vertical rather than horizontal card. One could even use content negotiation to request thumbnails in formats other than PNG.

    The striking image selection algorithm will be improved. At the moment the weights are set at what works based on my limited testing. It is likely new weights, a new equation, or even a new algorithm could be employed at some point. Feedback from the community will guide these decisions.

    Some other ideas that I have considered involve new forms of surrogates. Simple alterations to existing surrogates are possible, like social cards that contain thumbnails or social cards without any images. More complex concepts like Teevan's Visual Snippets or Woodruff's enhanced thumbnails would require a lot of work, but are possible within the framework of MementoEmbed.

    A lot of it will depend on the needs of the community. Thanks to Sawood Alam, Mat Kelly, Grant Atkins, Michael Nelson, and Michele Weigle for already providing feedback. As more people experience MementoEmbed, they will no doubt come up with ideas I had not considered, so please try our demo at http://mementoembed.ws-dl.cs.odu.edu or look at the source code in GitHub at https://github.com/oduwsdl/MementoEmbed. Most importantly, report any issues or ideas to our GitHub issue tracker: https://github.com/oduwsdl/MementoEmbed/issues.


    --Shawn M. Jones

    2018-08-25: Four WS-DL Classes Offered for Fall 2018

    $
    0
    0

    Four WS-DL classes are offered for Fall 2018:
    Dr. Michele C. Weigle is not teaching this semester.

    Our current plan for courses in Spring 2019 is to offer a record five WS-DL courses:
    • CS 432/532 Web Science, Alexander Nwala
    • CS 725/825 Information Visualization, Dr. Michele C. Weigle
    • CS 734/834 Information Retrieval, Dr. Jian Wu
    • CS 795/895 Human-Computer Interaction (HCI), Dr. Sampath Jayarathna
    • CS 795/895 Web Archiving Forensics, Dr. Michael L. Nelson
    Note that CS 418, 431, and 432 all count for the CS Web Programming minor.  

    --Michael


    Viewing all 741 articles
    Browse latest View live