2018-04-23: "Grampa, what's a deleted tweet?"

April 23, 2018, 4:33 pm

≫ Next: 2018-04-24: Let's Get Visual and Examine Web Page Surrogates

≪ Previous: 2018-04-13: Web Archives are Used for Link Stability, Censorship Avoidance, and Traffic Siphoning

Took screen shot, just in case, but I fear #Breitbart is well beyond the point of decency and shame that they would delete this insane tweet. #INTL4335 #Islamophobia pic.twitter.com/ipo1MhDmNI
— Cas Mudde 🌪️ (@CasMudde) February 5, 2018

In early February, 2018 Breitbart News made a splash with its inflammatory tweet suggesting Muslims will end Super Bowl, which they deleted twelve hours later stating it did not meet their editorial standards. The deleted tweet had an imaginary conversation between a Muslim child and a grandparent about the Super Bowl and linked to one of articles on the declining TV ratings of National Football League (NFL) for the annual championship game. News articles from The Hill, Huffington Post, Politico, Independent, etc., talked about the deleted tweet controversy in detail.

We have deleted a tweet that did not meet our editorial standards.
— Breitbart News (@BreitbartNews) February 5, 2018

Being web archiving researchers, we decided to look into the deleted tweet incident of Breitbart News to shed some light on their deleted tweets pattern over recent months.

Role of web archives in finding deleted tweets

Hany M. SalahEdeen and Michael L. Nelson in their paper, "Losing my revolution: How many resources shared on social media have been lost?", talk about the amount of resources shared in social media that is still live or present in the public web archives. They concluded that nearly 11% of the shared resources are lost in their first year and after that we lose the shared resources at a rate of 0.02% per day.

Web archives such as Internet Archive, Archive-It, UK Web Archives, etc., have an important role in the preservation of resources shared in social media. Using web archives, sometimes we can recover deleted tweets. For example, Miranda Smith in her blog post, "Twitter Follower Count History via Internet Archive" talks about using Internet Archive to fetch historical Twitter data to graph followers count over time. She also explains the advantages of using web archives for finding historical data of users over the Twitter API.

The only caveat in using web archives to uncover the deleted tweets is its limited coverage of Twitter. But for popular Twitter accounts having a high number of mementos such as RealDonaldTrump, Barrack Obama, BreitbartNews, CNN, etc., we can often uncover deleted tweets. The issue of "How Much of the Web Is Archived?" has been discussed by Ainsworth et al. but there has been no separate analysis on how much of Twitter is archived which will help us in estimating the accuracy of finding deleted tweets using web archives.

Web services like Politwoops track deleted tweets of public officials including people currently in office and candidates for office in the USA and some EU nations. However, tweets deleted before a person becomes a candidate or tweets deleted after a person left office will not be covered. Although Politwoops tracks the elected officials, it misses out on appointed government officials like Michael Flynn. For these twitter accounts web archives are the lone solution to finding their deleted tweets. The most important aspect of not relying totally on these web services alone to find the deleted tweets is due to them being banned by Twitter. It happened once in June, 2015 with Twitter citing violation of the developer agreement. It took Politwoops six months to resume its services back in December, 2015. These instances of being banned by Twitter suggest that we explore web archives to uncover deleted tweets in case of services like Politwoops are banned again.

Why are deleted tweets important?

With the surge in the usage of social media sites like Twitter, Facebook etc., researchers have been using social media sites to study patterns of online user behaviour. In context of Twitter, deleted tweets play an important role in understanding users' behavioural patterns. In the paper, "An Examination of Regret in Bullying Tweets", Xu et al. built a SVM-based classifier to predict deleted tweets from Twitter users posting bullying related tweets to later regret and delete them. Petrovic et al., in their paper, "I Wish I Didn’t Say That! Analyzing and Predicting Deleted Messages in Twitter", discuss about the reasons for deleted tweets and using a machine learning approach to predict them. They concluded by saying that tweets with swear words have higher probability of being deleted. Zhou et al. in their papers, "Tweet Properly: Analyzing Deleted Tweets to Understand and Identify Regrettable Ones" and "Identifying Regrettable Messages from Tweets", mention the impact of published tweets that cannot be undone by deletion, as other users have noticed the tweet and cached them even before they are deleted.

How were deleted tweets found?

To begin our analysis, we used the Twitter API to fetch the most recent 3200 tweets from Breitbart News' Twitter timeline. The live tweets fetched from the Twitter API spanned from 2017-10-22 to 2018-02-18. Later, we received the TimeMap for Breitbart's Twitter page using Memgator, the Memento aggregator service built by Sawood Alam. Using the URI-Ms from the fetched TimeMap, we collected mementos for Breitbart's Twitter page within the specified time range of live tweets fetched using the Twitter API.

Code to fetch recent tweets using Python-Twitter API

importtwitter
api = twitter.Api(consumer_key='xxxxxx',
                      consumer_secret='xxxxxx',
                      access_token_key='xxxxxx',
                      access_token_secret='xxxxxx',
                      sleep_on_rate_limit=True)

twitter_response = api.GetUserTimeline(screen_name=screen_name, count=200, include_rts=True)

Shell command to run Memgator locally

$ memgator --contimeout=10s --agent=XXXXXX server 
MemGator 1.0-rc7
   _____                  _______       __
  /     \  _____  _____  / _____/______/  |___________
 /  Y Y  \/  __ \/     \/  \  ___\__  \   _/ _ \_   _ \
/   | |   \  ___/  Y Y  \   \_\  \/ __ |  | |_| |  | \/
\__/___\__/\____\__|_|__/\_______/_____|__|\___/|__|

TimeMap   : http://localhost:1208/timemap/{FORMAT}/{URI-R}
TimeGate  : http://localhost:1208/timegate/{URI-R} [Accept-Datetime]
Memento   : http://localhost:1208/memento[/{FORMAT}|proxy]/{DATETIME}/{URI-R}

# FORMAT          => link|json|cdxj
# DATETIME        => YYYY[MM[DD[hh[mm[ss]]]]]
# Accept-Datetime => Header in RFC1123 format

Code to fetch TimeMap for any twitter handle

url ="http://localhost:1208/timemap/"
data_format ="cdxj"
command = url + data_format +"/http://twitter.com/<screen-name>"+
response = requests.get(command)

We parsed tweets and their tweet ids from each memento and compared each archived tweet id with the live tweet ids fetched using the Twitter API. We further validated the status of tweet ids present in web archives but deleted on Twitter using the Twitter API to confirm if the tweets were deleted. On comparing the live and archived versions of tweets, we discovered 22 deleted tweets from Breitbart News.

Code to parse tweets, their timestamps and tweet ids from mementos

importbs4

soup = bs4.BeautifulSoup(open(<HTML representation of Memento>),"html.parser")
match_tweet_div_tag = soup.select('div.js-stream-tweet')
for tag in match_tweet_div_tag:
if tag.has_attr("data-tweet-id"):
# Get Tweet id
...........
# Parse tweets
       match_timeline_tweets = tag.select('p.js-tweet-text.tweet-text')
...........
# Parse tweet timestamps
       match_tweet_timestamp = tag.find("span", {"class": "js-short-timestamp"})
...........

Analysis of Deleted Tweets from Breitbart News

The most prominent of the 22 deleted tweets was the above mentioned Super Bowl deleted tweet. Talking about the above mentioned deleted tweet in context for people who are unaware of the role of web archives, we urge them that taking screenshots fearing something might be lost in future is smart but it would be even better if we push them to the web archives where it would be preserved for a longer time than compared to someone's private archive. For further information refer to Plinio Varagas's blog post "Links to Web Archives, not Search Engine Caches", where he talks about the difference between archived pages and search engine caches in terms of the decay period of the web pages.

Fig 1 - Super Bowl tweet on Internet Archive
Tweet Memento at Internet Archive

There is another tweet which was initially tweeted by Allum Bokhari, a senior Breitbart correspondent, and retweeted by Breitbart News but was later un-retweeted. The original tweet from Allum Bokhari is present on the live of web but the retweet is missing from the live web, with the plausible reason of Breitbart News later retweeting a similar post from Allum Bokhari.

Fig 2 - Archived version of unretweeted tweet by Breitbart News
Tweet memento at the Internet Archive

Fig 3 - Live version of unretweeted tweet by Breitbart News
Live Tweet Status

Of the 22 deleted tweets, 20 were of the form where Breitbart News retweeted someone's tweet but the original tweet was lost. Of those 20 tweets, 18 were from two affiliates of Breitbart News, NolteNCand John Carney. Therefore, we decided to have a look at both the accounts to determine the reason for their deleted tweets.

Analysis of deleted tweets from John Carney and NolteNC

We fetched live tweets for John Carney using the Twitter API and then fetched the TimeMap for John Carney's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. Due to the low number of mementos within the specified time range, the analysis showed no deleted tweets. We then fetched live tweets from the Twitter API for John Carney for a week to find deleted tweets by comparing with all the previous responses from the Twitter API. We discovered that tweets older than seven days are automatically deleted on Tuesday and Saturday. The precise manner in deletion of tweets suggests the use of any automated tweet deletion service. There are a number of tweet deletion services like Twitter Deleter, Tweet Eraser etc. which delete tweets on certain conditions based on the lifespan of the tweet or the number of tweets to be present in the Twitter timeline at any given instance.

Fig 4 - John Carney's tweet deletion pattern shown with 50 tweet ids

We fetched live tweets for NolteNC using the Twitter API and then fetched the TimeMap for NolteNC's Twitter page using Memgator and mementos within the time range of live tweets fetched using the Twitter API. As for NolteNC, we had a considerable number of mementos within the specified time range to discover his deleted tweets. Our analysis provided us with 169 live tweets and 3569 deleted tweets from 2017-11-03 to 2018-02-17.

Fig 5 - NolteNC's original tweet

Tweet memento at the Internet Archive

Fig 6 - Breitbart News retweeting NolteNC's tweet.

Internet Archive Memento Link

With 1000s of deleted tweets, it seemed unlikely that he was manually deleting tweets. We had all the reasons to believe that similar to John Carney, NolteNC deleted tweets automatically using some tweet deletion service. We collected live tweets for his account over a week and compared all the previous responses from the Twitter API to come to the conclusion that all his tweets which were aged more than seven days on Wednesday and Saturday were deleted.

Fig 7 - NolteNC's tweet deletion pattern shown with 50 tweets

Conclusions

It is not enough to make screen shots of controversial tweets but, we need to push web contents that we wish to preserve for future fearing of its loss to the web archives due to longer retention capability than our personal archives.
For finding deleted tweets, web archives work effectively for popular accounts because they are archived often but for less popular accounts with fewer mementos this approach will not work.
Although Breitbart News does not delete tweets often, some of its correspondents automatically delete their tweets, effectively deleting the corresponding retweets.

--
Mohammed Nauman Siddique (@m_nsiddique)

↧

2018-04-24: Let's Get Visual and Examine Web Page Surrogates

April 24, 2018, 8:27 am

≫ Next: 2018-04-24: Why we need multiple web archives: the case of blog.reidreport.com

≪ Previous: 2018-04-23: "Grampa, what's a deleted tweet?"

Why visualize individual web pages? A variety of visualizations of individual web pages exist, but why do we need them when we can just choose a URI from a list and put it in our web browser? URIs are intended to be opaque: text from the underlying web resource does not need to exist in the URI.

Consider http://dx.doi.org/10.1007/s00799-016-0200-8. Where does it go? Should we click on it? What content exists under the veil of the URI? Will it meet our needs?

Now consider this web page surrogate produced by embed.ly for the same URI:

Avoiding spoilers: wiki time travel with Sheldon Cooper
A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if...

If we were looking for research papers about avoiding spoilers for TV shows, then we know that clicking on this surrogate will take us to something that meets our information needs. If we were searching for marine mammals, then this surrogate shows us that the underlying page will not be very satisfying. In this case, the surrogate is intended to give the user enough information to answer the question: should I click on this?

Last year, when I reviewed a number of live web curation and social media tools, I was primarily focused on tools that produce social cards like the one above. This was because social cards appeared to be the lingua franca of web page surrogates. Social cards are not the only surrogate in use today and definitely not the only surrogate evaluated in literature. In this post, I cover several surrogates that have been evaluated and then talk about the studies in which they played a part. I was curious as to which surrogate might be best for collections of mementos.

Different Web Page Surrogates

Text Snippet

Text snippets are one of the earliest surrogates. They only require fetching a given web page before selecting the text to be used in the snippet. The text selection can be done via many different methods like El-Beltagy's "KP-Miner: A Keyphrase Extraction System for English and Arabic Documents" and Chen's "A Practical System of Keyphrase Extraction for Web Pages". Text snippets are typically used by search engines for displaying results.

The Google search result text snippet for Michele Weigle's ODU CS page.

The Bing search result text snippet for Michele Weigle's ODU CS homepage. Note that Bing did not capture the last modified date, but does list a series of links on the bottom of the snippet, drawn from the menu of homepage

The DuckDuckGo search result for Michele Weigle's ODU CS homepage. Note that DuckDuckGo displays the favicon and generates a different text snippet from Google and Bing.

In the above search results for Michele Weigle's ODU CS homepage, the text snippets are slightly different depending on the search engine. Because there is a lot of variation in web pages, there are a lot of possibilities when building text snippets.

Text snippets still receive a bit of research, with Maxwell evaluating the effectiveness of snippet length in 2017 as part of "A Study of Snippet Length and Informativeness" (university repository copy).

As a group, text snippets listed one per row on a web page. This is optimal for search results, as the position of the result conveys its relevancy. This format affects how many surrogates can be viewed at once. Where text snippets are viewed one per row, more thumbnails can fit into the same amount of space.

Thumbnail

A thumbnail is produced by loading the given page in a browser and taking a screenshot of the contents of the browser window. They have been used in many forms. The Safari web browser uses them to display the content of tabs.

The Safari web browser uses thumbnails to show surrogates for web pages that are currently loaded in its tabs.

In "Visual preview for link traversal on the World Wide Web", Kopetzky demonstrated that thumbnails could be used to provide a preview of a linked page via a mouseover effect so that users could decide if a link was worth clicking. In "Data Mountain: Using Spatial Memory for Document Management" (Microsoft Research copy), Robertson proposed using a 3D virtual environment for organizing a corpus of web pages where each page is visualized as a thumbnail. Outside of the web, file management tools, such as macOS's Finder, use thumbnails to provide visual previews of documents.

An example of the interface for Data Mountain, a 3D environment for browsing web pages via thumbnails.

macOS Finder displaying thumbnails of file contents.

In the web archiving world, the UK Web Archive uses thumbnails to show a series of mementos so one can compare the content of each memento, effectively viewing the content drift over time. Thumbnails are also used in our own What Did It Look Like?, a platform that animates thumbnails so one can watch the changes to a web page over the years. Our group is also investigating the use of thumbnails for summarizing how a single webpage has changed over time, using three different visualizations: an animation, a grid view, and an interactive timeline view.

The UK Web Archive uses thumbnails to show different mementos for the same resource, allowing the user to view web page changes over time.

What Did It Look Like? allows the user to watch a web page change over time by animating the thumbnails of the mementos of a resource.

The size of thumbnails has a serious effect on their utility. If the thumbnail is too large, it does not provide room for comparison of surrogates. If the thumbnail is too small, users cannot see what is in the image. Thumbnails are also difficult for users to understand if a page consists mostly of text or has no unique features. In "How People Recognize Previously Seen Web Pages from Titles, URLs and Thumbnails", Kaasten established that the optimal thumbnail size is 208x208 pixels.

The viewport of a thumbnail is also an important part of its construction. Depending on what we want to emphasize on a web page, we may need to generate a thumbnail from content "below the fold". Aula evaluated the use of thumbnails that were the same size, but had magnified a portion of a web page at 20% versus 38%. She found that users performed better with thumbnails at a magnification of 20%.

Enhanced Thumbnail

In 2001, Woodruff introduced the enhanced thumbnail in "Using Thumbnails to Search the Web" (author copy). Prior to taking the screenshot of the browser as with a normal thumbnail, the HTML of the page is modified to make certain terms stand out. In the example below, changes in font size and background color emphasize certain terms of a page. The goal is to draw attention to these terms in hopes that search engine users could find relevant pages faster.

Examples of Thumbnails and Enhanced Thumbnails:
(a) Plain thumbnail
(b) Enhanced Thumbnail using HTML modification to emphasize the words "Recipe" and "Pound Cake"
(c) Enhanced Thumbnail using HTML and image modification to make "Recipe" and "Pound Cake" stand out more
(d) Emphasis on "MiniDisc Player"
(e) Emphasis on "hybrid", "car", and "mileage"
(f) Emphasis on "Hellerstein"
(g) Plain thumbnail of a page only consisting of text
(h) Enhanced thumbnail emphasizing specific terms in the text page

Even though enhanced thumbnails have performed well, they are computationally expensive to create. This likely explains why they have not been seen in use outside of laboratory studies.

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali developed something similar by adding a tag cloud to each thumbnail and named the concept a "visual tag".

Internal Image

An internal image is an image embedded within the web page. For some web pages, like news stories and product pages, these internal images can be good surrogates because of their uniqueness. Pinterest uses internal images as surrogates.

Pinterest uses internal images as surrogates for web pages.

The key is identifying which embedded image is best for representing the page. Hu identified the issues with solving this problem as part of "Categorizing Images in Web Documents", identifying a number of features such as using the text surrounding an image and evaluating the number of colors in the image. Maekawa worked on classifying images and achieved an 83.1% accuracy in "Image Classiﬁcation for Mobile Web Browsing" (conference copy). While these studies provided solutions for classifying images, we really need to know which images are unique and relevant to the web page. Research does exist to address this issue, such as the work described in Li's "Improving relevance judgment of web search results with image excerpts" (conference copy). These solutions are imperfect, which may be why Pinterest and other sites ask the user to choose an image from those embedded in the page.

Visual Snippet

In 2009, Teevan introduced visual snippets as part of "Visual snippets: summarizing web pages for search and revisitation" (Microsoft Research copy, conference slides). Teevan gave 20 web pages to a graphic designer and asked him to generate a small 120x120 image representing each page. She observed a pattern in the resulting images and derived a template to use as a surrogate. These surrogates combine the internal image, placed within the background of the surrogate, with a title running across the top of the page, and a page logo.

Examples of thumbnails on the bottom and their corresponding visual snippets on top.

She used machine learning to choose a good internal image and logo. This is more complex than merely selecting a salient internal image as noted in the previous section. Not only does the visual snippet require two images, but two different types of images.

External Image

In 2010, Jiao put forth the idea of using external images in "Visual Summarization of Web Pages". Jiao notes that detecting the internal image may be difficult if not impossible for some pages. Instead, he suggests using image search engines to find a representative image to use as a surrogate.

A simplified version of his algorithm is:

Extract key phrases from the target web page using Chen's KEX algorithm
Use these phrases as queries for an image search engine
Rerank the search engine results based on textual similarity to the target web page
Choose the top ranked image

Though this would likely work well for live web pages about products, it may be a poor fit for mementos due to the temporal nature of words. Consider a memento from the late 1990s where one of the key phrases extracted contains the word Clinton. In the 1990s, the document was likely referring to US President Bill Clinton. If we use a search engine in 2018, it may return an image of 2016 presidential candidate Hillary Clinton. Some of these temporal issues have been detailed as part of the Longitudinal Analytics on Web Archive Data (LAWA) project.

Text + Thumbnail

In "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" (google research copy) by Aula and "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" by Dziadosz, the authors consider the combination of text with a thumbnail as a surrogate.

The Internet Archive uses text and thumbnails for its search results, seen in the screenshot below.

The Internet Archive uses thumbnails and text together as part of its search results.

Al Maqbali further extended this concept with a text + visual tags.

Social Card

The social card goes by many names: rich link, snippet, social snippet, social media card, Twitter card, embedded representation, rich object, or social card. The social card typically consists of an image, a title, and a text snippet from the web page it visualizes.

The data within the social card is typically drawn from data within the meta tags of the HTML of the target web page. As an artifact of social media, different social media platforms consult different meta tags within the target page.

For example, for Twitter, I used the following tags to produce the card below:

Social card for https://www.shawnmjones.org as seen on Twitter.

For Facebook, I used the following tags to produce the card below:

Social card for https://www.shawnmjones.org as seen on Facebook.

Note how the HTML tags are different for each service. Facebook supports the Open Graph Protocol, developed around 2009 (according to the CarbonDate service) whereas Twitter's features were developed around 2010 (according to CarbonDate). There are pages that lack this kind of assistive markup. To produce those cards, social media platforms will often use other methods, like those mentioned above, to extract a text snippet and an internal image. Any mementos captured prior to 2009 will not have the benefit of this assistive markup.

Though most platforms generate social cards come in landscape form, some do generate a portrait form as well. The intended use of the social cards on the platform and the nature of other visual cues on the platform often drive the decision as to which form the social card should take. All of the studies in this blog post evaluated social cards in their landscape form.

A landscape social card from Facebook.

A portrait social card from Google+.

Social cards are not just used by social media. Wikipedia uses social cards to provide a preview of links if the user hovers over the link, like what Kopetzky had envisioned with thumbnails. Google News often uses social cards for individual stories. Social cards sometimes include additional information beyond text snippet and image. In "What's Happening and What Happened: Searching the Social Web" Omar Alonso detailed the use of social cards in a prototype for Bing search results. Those cards also incorporated lists of users who shared the target web page as well as associated hashtags.

When a user hovers over an internal link, Wikipedia uses social cards to display a preview of the linked web page.

Google News often uses social cards to list individual news articles.

There are similar concepts that are not instances of the social card. Some of the cards used by Google News are not social cards because each is a surrogate for a news story spanning multiple resources, rather than a single resource. Likewise, search engines use entity cards to display information about a specific entity drawn from multiple sources. Entity cards have been found to be useful by Bota's 2016 study "Playing Your Cards Right: The Effect of Entity Cards on Search Behaviour and Workload". I do not consider entity cards to be social cards because each social card is a surrogate for a single web resource, whereas an entity card is a surrogate for a conceptual entity and is drawn from multiple sources.

This card used by Google News is not a surrogate for a single web resource, and hence I do not consider it a social card.

This card format, used by Google is also not a surrogate for a single web resource. This is an entity card, drawing from multiple web resources.

The creation of social cards can also be a lucrative market, with Embed.ly offering plans for web platforms ranging from $9 to $99 per month. They provide embedding services for the long form blogging service Medium, supporting a limited number of source websites. Individual cards can be made on their code generation page.

Evaluations of these Surrogates

Web page surrogates have been of great interest to those studying search engine result pages. I have review eight studies on web surrogates, most mentioned above. I focused on how these studies compared surrogates with each other.

Author & Year	Text Snippet	Internal/ External Image	Visual Snippet	Thumbnail	Enhanced Thumbnail/ Visual Tags	Text + Thumbnail	Social Card
Woodruff 2001	X			X	X
Dziadosz 2002	X			X		X
Li 2008	X						X
Teevan 2009	X		X	X
Jiao 2010		X	X	X
Aula 2010	X			X		X
Al Maqbali 2010	X		X		X	X	X
Loumakis 2011	X	X					X
Capra 2013	X	X					X

As noted above Woodruff introduced the concept of enhanced thumbnails in "Using Thumbnails to Search the Web". To evaluate their effectiveness, she generated questions based on tasks users commonly perform on the web. The questions were divided into 4 categories and 3 questions per category were each given to 18 participants. The participants were presented with search engine result pages consisting of 100 text snippets, thumbnails, or enhanced thumbnails. In their attempt to find web resources that would address their assigned questions, participants were evaluated based on their response times. The results indicated that enhanced thumbnails provided the fastest response times overall, but the results varied depending on the type of task. For locating an entity's homepage, text snippets and enhanced thumbnails performed roughly the same. For finding the picture of an entity, thumbnails and enhanced thumbnails performed roughly the same. All three surrogate types performed just as well for e-commerce or medical side-effect questions.

Dziadosz tested the concept of text snippets combined with thumbnails in "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" In this study of 35 participants, each was given 2 queries each and 2 tasks. Each participant was given a different surrogate type. Their first task was to identify all search engine results on the page that they assumed to be relevant to their query. Their second task was to visit the pages being the surrogates and identify which were actually relevant. The number of correct decisions for text snippets combined with thumbnails was higher than just for text or just for thumbnails. Aula, in "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" also evaluated text snippets, thumbnails, and a combination. She discovered that both were effective in making relevance judgements.

Teevan evaluated the effectiveness of visual snippets in "Visual snippets: summarizing web pages for search and revisitation". Her study consisted on 276 participants who were each given 12 search tasks and a set of 20 search results, with 4 of the 12 tasks completed with different surrogates. She discovered that text snippets required the fewest clicks compared to thumbnails, which required the most. This indicates a lot of false positive matches for participants when using thumbnails. Participants preferred visual snippets or text snippets equally over thumbnails and preferred visual snippets for shopping tasks. Most participants found thumbnails to be too small to be useful.

Jiao introduced the concept of using external images as a surrogate in "Visual Summarization of Web Pages". He compared the use of internal images, external images, thumbnails, and visual snippets. Like Dziadosz's study, participants were asked to guess the relevance of the web page behind the surrogate and then later evaluate if their earlier guess was correct. To generate search results, they randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Bing. His results show that none of the surrogates works for all types of pages. Overall internal images were best for pages that contained a dominant image whereas thumbnails or external images were best for understanding pages that did not contain a dominant image.

In "Improving relevance judgment of web search results with image excerpts", Li was interested in identifying dominant images in web pages. I focus here on the second study in his work which compares text snippets and social cards. They randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Google. The search engine results were then evaluated and reformatted into either text snippets or social cards. Two groups of 12 students each were given the queries either classified by their functionalities or semantic categories. The participants were evaluated based on the number of clicks of relevant results and also on the amount of time they took with each search. Social cards were the clear winner over text snippets in terms of time and clicks.

Loumakis, in "This Image Smells Good: Effects of Image Information Scent in Search Engine Results Pages" (university copy) attempted to compare the performance of images, text snippets, and social cards. Using preselected queries and 81 participants, Loumakis also reformatted Google search results. He did not get the same level of performance in his study, noting that "Adding an image to a SERP result will not significantly help users in identifying correct results, but neither will it significantly hinder them if an image is placed with text cues where the scents may conflict."

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali explored the use of different image augmentations for visual snippets, text + thumbnail, social card, text + visual snippet, and a text + tag cloud/thumbnail combination. Al Maqbali had 65 participants evaluate the relevance of search engine result pages as in the prior studies. This study reached the same conclusion as Loumakis: adding images to text snippets does not appear to make a difference to the performance of search engine users.

To further understand the disagreement between the results of Loumakis, Al Maqbali, and Li, in "Augmenting web search surrogates with images", Capra explored the effectiveness of text snippets and social cards. He wanted determine if the quality or relevance of the image used in the social card had any effect on performance. Prior to any relevance study, he had one set of participants rate individual internal images for a social card as good, bad, and mixed. For individual surrogates, Capra discovered that text snippets with good images have a slightly higher statistically significant accuracy score than just text snippets alone, at the cost of judgement duration for each surrogate. The accuracy for text snippets was 0.864, the accuracy for social cards with bad images was also 0.864, and the accuracy for social cards with good images was 0.884. If the search engine result pages were evaluated overall, then there was evidence that good images showed improvement in accuracy with ambiguous queries (e.g., jaguar the car or the cat?), but in this case the improvements were not statistically significant.

Deciding on the best surrogate for use with web pages depends on a number of factors, and the studies comparing these surrogates have some disagreement. Text snippets continue to endure for search results likely due to Capra's, Al Maqbali's, and Loumakis' results. Social cards are preferred by users, but the minor improvement in search time and relevance accuracy does not warrant the effort necessary to select a good internal image for the card. This means that social cards are effectively relegated for use in social media where each can be generated individually rather than with hundreds of search results. This also means that thumbnails are relegated to other tasks, such as a surrogate for a file on a filesystem or within a browser's interface. As most of these studies focused primarily on search engine results, it is likely that many of these surrogates work better with other use cases.

Surrogates for Mementos

There are more uses for surrogates than search engine results. When grouped together, some surrogates provide more information than the answer to the question should I click on this?.

Enhanced thumbnails often reflect the search terms of the query provided by the user. Most memento applications do not have a query, and hence there are not words or phrases to enhance within the thumbnail. Mabali's tag cloud concept may be of interest here. I am examining other ways to expose words and phrases of interest from archived collections, so this surrogate type may find new life in mementos.

Internal images are often used as part of social cards. If one could expose the images that tie to a particular theme in a web archive collection, then it is possible that we could select images for use as memento surrogates within the theme of the collection. This would likely require some form of machine learning to be viable. This same process goes for visual snippets.

As noted above, external images are problematic surrogates for mementos due to the temporal nature of words. If we could divide a web archive into specific time periods, then external images could be extracted from pages around the same time, limiting the amount of temporal shift.

Thumbnails are often useful in groups to demonstrate the content drift of a single web resource. For this surrogate group to be useful, the consumer of such a thumbnail gallery needs to understand the direction that time flows in the visualization. Thumbnails are not limited to the "one-per-row" paradigm of landscape social cards or text snippets, and hence thumbnails can be presented in a grid formation. This can be confusing to the user trying to compare the content drift of a resource, but textual cues, such as the memento-datetime, placed above or below the thumbnail can clear up this confusion.

Storytelling often uses surrogates in the form of social cards to tell a story. In this case, the surrogates are visualizations of the underlying web pages. When provided as a series of social cards, one per row, in order to publication date or memento-datetime, collections of these surrogates can convey information about an unfolding news story, such as in AlNoamany's collection summarization work (preprint version, dissertation). Many mementos do not have the metadata that might assist in finding a good internal image. This means that any service providing social cards to mementos must instead rely upon a number of image selection algorithms with differing levels of success. Because text snippets are essentially social cards lacking an image, is it possible that they, too, would be suitable in this context?

Conclusion

I started on this journey looking for the best surrogate for use with mementos. I discovered many different surrogates for web resources. The studies evaluating these different surrogates focused on the success of users finding relevant information in search engine results. It appears that the search engine industry has largely focused on text snippets as they are the least expensive surrogate to produce and studies indicate that the addition of images has minimal impact on their effectiveness. Mementos have many different uses and it is possible that one or more of these surrogates may be better fit for their temporal nature. Now that I am developing a vocabulary for these surrogates, I can start to explore how they might best be used with mementos, bringing other useful visualizations to web archive collections.

-- Shawn M. Jones

↧

2018-04-24: Why we need multiple web archives: the case of blog.reidreport.com

April 24, 2018, 6:25 pm

≫ Next: 2018-04-30: A High Fidelity MS Thesis, To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages

≪ Previous: 2018-04-24: Let's Get Visual and Examine Web Page Surrogates

This story started in December, 2017 with Joy-Ann Reid (of MSNBC) apologizing for "insensitive LGBT blog posts" that she wrote on her blog many years ago when she was a morning radio talk show host in Florida. This apology was, at least in some quarters, (begrudgingly) accepted. Today's update was news that Reid and her lawyers had in December claimed that either her blog, and/or the Internet Archive's record of the blog had been hacked (Mediaite, The Intercept). Later today, the Internet Archive issued a blog post deny the claim that it was hacked, stating:

This past December, Reid’s lawyers contacted us, asking to have archives of the blog (blog.reidreport.com) taken down, stating that “fraudulent” posts were “inserted into legitimate content” in our archives of the blog. Her attorneys stated that they didn’t know if the alleged insertion happened on the original site or with our archives (Reid’s claim regarding the point of manipulation is still unclear to us).

...

At some point after our correspondence, a robots.txt exclusion request specific to the Wayback Machine was placed on the live blog. That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, unbeknownst to us (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog.

Checking the Internet Archive for robots.txt, we can see that on 2018-02-16 blog.reidreport.com had a standard robots.txt page that blocked the admin section of WordPress, but by 2018-02-21 they had a version that blocked all robots, and as of today (2018-04-24) they had a version that specifically blocked only the Internet Archive's crawler ("ia_archiver"). As of about 5pm EDT, the robots.txt file had been removed (probably because of the Internet Archive's blog post calling out the presence of the robots.txt; cf. a similar situation in 2013 with the Conservative Party in the UK), but it may take a while for the Internet Archive to register its absence.

2018-04-25 update: Thanks to Peter Sterne for pointing out that www.blog.reidreport.com/robots.txt still exists, even though blog.reidreport.com/robots.txt does not. They technically can be two different URLs though the convention is for them to canonicalize to the same URL (which is what the Wayback Machine does). HTTP session info provided below, but the summary is that robots.txt is still in effect and the need for other web archives is still paramount.

Until the Internet Archive begins serving blog.reidreport.com again, this is a good time to remind everyone that there are web archives other than the Internet Archive. The screen shot above shows the Memento Time Travel service, which searches about 26 public web archives. In this case, it found mementos (i.e., captures of web pages) in five different web archives: Archive-It (a subsidiary of the Internet Archive), Bibliotheca Alexandrina (the Egyptian Web Archive), the National Library of Ireland, the archive.is on-demand archiving service, and the Library of Congress. For a machine readable service, below I list the TimeMap (list of mementos) generated by our MemGator service; the details aren't important but it is the source of the URLs that will appear next.

Beginning with the original tweets by @Jamie_Maz (2017-11-30 thread, 2018-04-18 thread), I scanned through the screen shots (no URLs were given) and looked for screen shots that had definitive datetimes (most images did not have them). The datetimes are (with ones for which we have evidence in bold, and the ones that we inferred by matching text are maked with "(inferred)"):

2005-04-25

2005-07-16

2005-07-21

2006-01-20 (inferred)

2006-06-05

2006-06-13 (inferred)

2006-10-03

2006-12-23

2007-02-21

2008-07-04

2008-10-16

2009-01-15

(update: because of canonicalization errors, some of the URLs are not being excluded; see below)

Most of those dates are pretty early in web archiving times, when the Internet Archive was the only archive commonly available, and many (all?) of the mementos in other web archives were surely originally crawled by the Internet Archive, even if on a contract basis (e.g., for the Library of Congress). Nonetheless, with multiple copies geographically and administratively dispersed throughout the globe, an adversary would have had to hack multiple web archives and alter their contents (cf. lockss.org), or have hacked the original site (blog.reidreport.com) approximately 12 years ago for adulterated pages to have been hosted at all the different web archives. While both scenarios are technically possible, they are extraordinarily unlikely.

While we don't know the totality of the hacking claims, we can offer three archived web pages, hosted at the Library of Congress web archive (webarchive.loc.gov), that corroborate at least some of the claims @Jamie_Maz.

2006-01-20

30/x Joy seemed very interested in Brokeback Mountain, but wouldn't watch it bc it featured two men hooking up.

She can't understand who is going to see it since she imagines everyone would be turned off by it. pic.twitter.com/ogBnPyDSUF
— Not a bot (@Jamie_Maz) April 18, 2018

Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20060125004941/http://blog.reidreport.com/

2006-06-05

19/x Joy - I am not a gay marriage supporter pic.twitter.com/wbf2QSkx7n
— Not a bot (@Jamie_Maz) April 18, 2018

Evidence for this tweet can be found at (approximately 2/3 down): http://webarchive.loc.gov/all/20060608144033/http://blog.reidreport.com/

2006-06-13

I'm not sure this evidence maps directly to one of tweets, but it fits the general theme of anti-Charlie Crist: http://webarchive.loc.gov/all/20060615134635/http://blog.reidreport.com/

This memento also exists at archive.is; it is a copy of the Internet Archive's copy but it is not blocked by robots.txt because it is in another archive: http://archive.is/20060615134635/http://blog.reidreport.com/

2006-10-03

10/x Of course there are even more posts about Charlie Crist.

For some strange reason Joy posts a link to an article that claims Crist was involved in gay sex parties with Mark Foley. pic.twitter.com/qNazjxYQ7K
— Not a bot (@Jamie_Maz) April 18, 2018

Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20061010125903/http://blog.reidreport.com/

2008-10-16

7/x Joy calls Crist "Miss Charlie" and again declares his potential wedding to a women is a fraud and a "veep marketing strategy". pic.twitter.com/ZMCbEfURfn
— Not a bot (@Jamie_Maz) November 30, 2017

Evidence for this tweet can be found at (approximately 1/3 down): http://webarchive.loc.gov/all/20081018020856/http://blog.reidreport.com/

In summary, of the many examples that @Jamie_Maz provides, I can find five copies in the Library of Congress's web archive. These crawls were probably performed on behalf of the Library of Congress by the Internet Archive (for election-based coverage); even though there are many different (and independent) web archives now, in 2006 the Internet Archive was pretty much the only game in town. Even though these mementos are not independent observations, there is no plausible scenario for these copies to have been hacked in multiple web archives or at the original blog 10+ years ago. There may be additional evidence in the other web archives, but I haven't exhaustively searched them.

We don't know the full details of what Reid's lawyers alleged, so perhaps there are details that we don't know. But the analysis from the Internet Archive crawl engineers, plus evidence in separate web archives suggest that the claim has no merit.

The case of blog.reidreport.com is another example of why we need multiple web archives.

--Michael

Thanks to Prof. Michele Weigle and John Berlin for bringing this issue to my attention and uncovering some of the examples.

Memento TimeMap for blog.reidreport.com:

% curl -i https://memgator.cs.odu.edu/timemap/link/http://blog.reidreport.com/ HTTP/1.1 200 OK Access-Control-Allow-Origin: * Access-Control-Expose-Headers: Link, Location, X-Memento-Count, X-Generator Content-Type: application/link-format Date: Tue, 24 Apr 2018 20:39:32 GMT X-Generator: MemGator:1.0-rc7 X-Memento-Count: 174 Transfer-Encoding: chunked  <http:></http:>; rel="original", <https:></https:>; rel="self"; type="application/link-format", <http:></http:>; rel="first memento"; datetime="Tue, 13 Dec 2005 06:37:57 GMT", <http:></http:>; rel="memento"; datetime="Tue, 27 Dec 2005 07:11:34 GMT", <http:></http:>; rel="memento"; datetime="Mon, 09 Jan 2006 23:38:24 GMT", <http:></http:>; rel="memento"; datetime="Wed, 11 Jan 2006 22:17:38 GMT", <http:></http:>; rel="memento"; datetime="Fri, 13 Jan 2006 22:17:54 GMT", <http:></http:>; rel="memento"; datetime="Tue, 17 Jan 2006 04:00:21 GMT", <http:></http:>; rel="memento"; datetime="Wed, 25 Jan 2006 00:49:41 GMT", <http:></http:>; rel="memento"; datetime="Mon, 30 Jan 2006 21:27:07 GMT", <http:></http:>; rel="memento"; datetime="Tue, 07 Feb 2006 04:37:23 GMT", <http:></http:>; rel="memento"; datetime="Tue, 14 Feb 2006 02:11:36 GMT", <http:></http:>; rel="memento"; datetime="Fri, 02 Jun 2006 12:01:19 GMT", <http:></http:>; rel="memento"; datetime="Thu, 08 Jun 2006 14:40:33 GMT", <http:></http:>; rel="memento"; datetime="Thu, 15 Jun 2006 13:46:35 GMT", <http:></http:>; rel="memento"; datetime="Thu, 15 Jun 2006 13:46:35 GMT", <http:></http:>; rel="memento"; datetime="Fri, 29 Sep 2006 09:35:09 GMT", <http:></http:>; rel="memento"; datetime="Tue, 10 Oct 2006 12:59:03 GMT", <http:></http:>; rel="memento"; datetime="Thu, 19 Oct 2006 21:33:57 GMT", <http:></http:>; rel="memento"; datetime="Sun, 19 Nov 2006 12:46:09 GMT", <http:></http:>; rel="memento"; datetime="Tue, 19 Dec 2006 12:28:32 GMT", <http:></http:>; rel="memento"; datetime="Tue, 02 Jan 2007 04:08:34 GMT", <http:></http:>; rel="memento"; datetime="Sun, 14 Jan 2007 01:52:13 GMT", <http:></http:>; rel="memento"; datetime="Sun, 13 May 2007 09:35:53 GMT", <http:></http:>; rel="memento"; datetime="Mon, 17 Dec 2007 22:54:56 GMT", <http:></http:>; rel="memento"; datetime="Sun, 13 Jan 2008 23:01:46 GMT", <http:></http:>; rel="memento"; datetime="Thu, 14 Feb 2008 15:34:40 GMT", <http:></http:>; rel="memento"; datetime="Fri, 29 Aug 2008 14:53:25 GMT", <http:></http:>; rel="memento"; datetime="Thu, 04 Sep 2008 17:09:37 GMT", <http:></http:>; rel="memento"; datetime="Sat, 13 Sep 2008 11:06:33 GMT", <http:></http:>; rel="memento"; datetime="Mon, 22 Sep 2008 19:57:42 GMT", <http:></http:>; rel="memento"; datetime="Fri, 26 Sep 2008 15:47:52 GMT", <http:></http:>; rel="memento"; datetime="Thu, 02 Oct 2008 22:37:53 GMT", <http:></http:>; rel="memento"; datetime="Thu, 09 Oct 2008 21:02:02 GMT", <http:></http:>; rel="memento"; datetime="Sat, 18 Oct 2008 02:08:56 GMT", <http:></http:>; rel="memento"; datetime="Sun, 26 Oct 2008 03:28:23 GMT", <http:></http:>; rel="memento"; datetime="Sat, 01 Nov 2008 23:14:44 GMT", <http:></http:>; rel="memento"; datetime="Fri, 07 Nov 2008 19:08:50 GMT", <http:></http:>; rel="memento"; datetime="Fri, 14 Nov 2008 19:29:33 GMT", <http:></http:>; rel="memento"; datetime="Sat, 29 Nov 2008 22:26:46 GMT", <http:></http:>; rel="memento"; datetime="Fri, 07 Aug 2009 19:22:02 GMT", <http:></http:>; rel="memento"; datetime="Sun, 06 Sep 2009 03:43:48 GMT", <http:></http:>; rel="memento"; datetime="Mon, 23 Nov 2009 07:26:35 GMT", <http:></http:>; rel="memento"; datetime="Mon, 23 Nov 2009 07:26:35 GMT", <http:></http:>; rel="memento"; datetime="Tue, 08 Jun 2010 13:09:17 GMT", <http:></http:>; rel="memento"; datetime="Wed, 08 Sep 2010 15:06:01 GMT", <http:></http:>; rel="memento"; datetime="Wed, 08 Sep 2010 15:06:01 GMT", <http:></http:>; rel="memento"; datetime="Sun, 17 Oct 2010 18:08:28 GMT", <http:></http:>; rel="memento"; datetime="Thu, 21 Oct 2010 20:44:35 GMT", <http:></http:>; rel="memento"; datetime="Sat, 23 Oct 2010 14:39:57 GMT", <http:></http:>; rel="memento"; datetime="Sat, 23 Oct 2010 14:39:57 GMT", <http:></http:>; rel="memento"; datetime="Fri, 29 Oct 2010 01:03:31 GMT", <http:></http:>; rel="memento"; datetime="Thu, 04 Nov 2010 23:39:18 GMT", <http:></http:>; rel="memento"; datetime="Thu, 11 Nov 2010 20:52:48 GMT", <http:></http:>; rel="memento"; datetime="Thu, 18 Nov 2010 12:52:39 GMT", <http:></http:>; rel="memento"; datetime="Thu, 25 Nov 2010 13:04:03 GMT", <http:></http:>; rel="memento"; datetime="Thu, 02 Dec 2010 21:13:57 GMT", <http:></http:>; rel="memento"; datetime="Fri, 03 Dec 2010 22:33:09 GMT", <http:></http:>; rel="memento"; datetime="Fri, 03 Dec 2010 22:33:09 GMT", <http:></http:>; rel="memento"; datetime="Sat, 04 Dec 2010 13:00:37 GMT", <http:></http:>; rel="memento"; datetime="Fri, 10 Dec 2010 22:04:16 GMT", <http:></http:>; rel="memento"; datetime="Fri, 10 Dec 2010 22:04:16 GMT", <http:></http:>; rel="memento"; datetime="Sat, 18 Dec 2010 02:25:03 GMT", <http:></http:>; rel="memento"; datetime="Sat, 25 Dec 2010 01:14:55 GMT", <http:></http:>; rel="memento"; datetime="Sat, 01 Jan 2011 10:29:29 GMT", <http:></http:>; rel="memento"; datetime="Sun, 02 Jan 2011 12:42:25 GMT", <http:></http:>; rel="memento"; datetime="Mon, 10 Jan 2011 19:21:23 GMT", <http:></http:>; rel="memento"; datetime="Sat, 15 Jan 2011 14:10:29 GMT", <http:></http:>; rel="memento"; datetime="Sat, 29 Jan 2011 08:10:21 GMT", <http:></http:>; rel="memento"; datetime="Mon, 31 Jan 2011 23:54:56 GMT", <http:></http:>; rel="memento"; datetime="Wed, 02 Feb 2011 02:23:38 GMT", <http:></http:>; rel="memento"; datetime="Sat, 05 Feb 2011 15:35:52 GMT", <http:></http:>; rel="memento"; datetime="Tue, 08 Feb 2011 00:21:06 GMT", <http:></http:>; rel="memento"; datetime="Sat, 19 Feb 2011 17:35:53 GMT", <http:></http:>; rel="memento"; datetime="Fri, 04 Mar 2011 21:33:16 GMT", <http:></http:>; rel="memento"; datetime="Sun, 06 Mar 2011 07:40:27 GMT", <http:></http:>; rel="memento"; datetime="Mon, 07 Mar 2011 14:47:06 GMT", <http:></http:>; rel="memento"; datetime="Thu, 10 Mar 2011 14:05:43 GMT", <http:></http:>; rel="memento"; datetime="Fri, 11 Mar 2011 19:27:05 GMT", <http:></http:>; rel="memento"; datetime="Mon, 21 Mar 2011 17:02:36 GMT", <http:></http:>; rel="memento"; datetime="Thu, 24 Mar 2011 21:38:16 GMT", <http:></http:>; rel="memento"; datetime="Tue, 29 Mar 2011 05:31:24 GMT", <http:></http:>; rel="memento"; datetime="Wed, 30 Mar 2011 17:00:39 GMT", <http:></http:>; rel="memento"; datetime="Wed, 06 Apr 2011 22:31:19 GMT", <http:></http:>; rel="memento"; datetime="Thu, 14 Apr 2011 01:19:42 GMT", <http:></http:>; rel="memento"; datetime="Sat, 16 Apr 2011 10:08:48 GMT", <http:></http:>; rel="memento"; datetime="Wed, 20 Apr 2011 15:45:44 GMT", <http:></http:>; rel="memento"; datetime="Wed, 27 Apr 2011 20:17:27 GMT", <http:></http:>; rel="memento"; datetime="Wed, 04 May 2011 13:59:20 GMT", <http:></http:>; rel="memento"; datetime="Fri, 20 May 2011 04:52:29 GMT", <http:></http:>; rel="memento"; datetime="Fri, 27 May 2011 18:39:51 GMT", <http:></http:>; rel="memento"; datetime="Thu, 02 Jun 2011 13:53:15 GMT", <http:></http:>; rel="memento"; datetime="Wed, 08 Jun 2011 09:00:12 GMT", <http:></http:>; rel="memento"; datetime="Fri, 10 Jun 2011 11:36:20 GMT", <http:></http:>; rel="memento"; datetime="Wed, 15 Jun 2011 13:11:17 GMT", <http:></http:>; rel="memento"; datetime="Wed, 22 Jun 2011 11:38:49 GMT", <http:></http:>; rel="memento"; datetime="Sat, 02 Jul 2011 04:01:34 GMT", <http:></http:>; rel="memento"; datetime="Wed, 06 Jul 2011 23:17:37 GMT", <http:></http:>; rel="memento"; datetime="Wed, 13 Jul 2011 17:30:24 GMT", <http:></http:>; rel="memento"; datetime="Thu, 21 Jul 2011 09:26:04 GMT", <http:></http:>; rel="memento"; datetime="Thu, 28 Jul 2011 20:50:32 GMT", <http:></http:>; rel="memento"; datetime="Fri, 29 Jul 2011 09:24:10 GMT", <http:></http:>; rel="memento"; datetime="Thu, 04 Aug 2011 05:48:17 GMT", <http:></http:>; rel="memento"; datetime="Fri, 05 Aug 2011 15:26:39 GMT", <http:></http:>; rel="memento"; datetime="Thu, 11 Aug 2011 05:19:14 GMT", <http:></http:>; rel="memento"; datetime="Thu, 11 Aug 2011 05:24:15 GMT", <http:></http:>; rel="memento"; datetime="Wed, 17 Aug 2011 22:56:34 GMT", <http:></http:>; rel="memento"; datetime="Wed, 24 Aug 2011 09:54:45 GMT", <http:></http:>; rel="memento"; datetime="Sat, 10 Sep 2011 22:09:09 GMT", <http:></http:>; rel="memento"; datetime="Sun, 27 Nov 2011 12:49:34 GMT", <http:></http:>; rel="memento"; datetime="Mon, 28 Nov 2011 19:08:33 GMT", <http:></http:>; rel="memento"; datetime="Thu, 16 Feb 2012 19:11:31 GMT", <http:></http:>; rel="memento"; datetime="Fri, 10 Aug 2012 23:51:03 GMT", <http:></http:>; rel="memento"; datetime="Sat, 18 Aug 2012 05:12:23 GMT", <http:></http:>; rel="memento"; datetime="Fri, 24 Aug 2012 00:36:55 GMT", <http:></http:>; rel="memento"; datetime="Thu, 30 Aug 2012 03:12:37 GMT", <http:></http:>; rel="memento"; datetime="Wed, 05 Sep 2012 21:26:20 GMT", <http:></http:>; rel="memento"; datetime="Thu, 20 Sep 2012 04:39:05 GMT", <http:></http:>; rel="memento"; datetime="Fri, 28 Sep 2012 20:54:35 GMT", <http:></http:>; rel="memento"; datetime="Fri, 05 Oct 2012 09:02:12 GMT", <http:></http:>; rel="memento"; datetime="Fri, 12 Oct 2012 14:26:52 GMT", <http:></http:>; rel="memento"; datetime="Tue, 06 Nov 2012 21:45:50 GMT", <http:></http:>; rel="memento"; datetime="Tue, 13 Nov 2012 21:34:24 GMT", <http:></http:>; rel="memento"; datetime="Thu, 22 Nov 2012 03:51:16 GMT", <http:></http:>; rel="memento"; datetime="Wed, 28 Nov 2012 01:26:55 GMT", <http:></http:>; rel="memento"; datetime="Thu, 06 Dec 2012 07:38:47 GMT", <http:></http:>; rel="memento"; datetime="Sat, 08 Dec 2012 10:48:25 GMT", <http:></http:>; rel="memento"; datetime="Sun, 09 Dec 2012 11:25:53 GMT", <http:></http:>; rel="memento"; datetime="Wed, 12 Dec 2012 15:21:12 GMT", <http:></http:>; rel="memento"; datetime="Wed, 19 Dec 2012 20:15:42 GMT", <http:></http:>; rel="memento"; datetime="Sat, 22 Dec 2012 08:35:28 GMT", <http:></http:>; rel="memento"; datetime="Fri, 28 Dec 2012 06:20:56 GMT", <http:></http:>; rel="memento"; datetime="Thu, 03 Jan 2013 13:19:28 GMT", <http:></http:>; rel="memento"; datetime="Fri, 04 Jan 2013 12:19:10 GMT", <http:></http:>; rel="memento"; datetime="Sat, 05 Jan 2013 08:38:57 GMT", <http:></http:>; rel="memento"; datetime="Wed, 09 Jan 2013 09:44:17 GMT", <http:></http:>; rel="memento"; datetime="Wed, 16 Jan 2013 23:39:57 GMT", <http:></http:>; rel="memento"; datetime="Wed, 23 Jan 2013 22:23:46 GMT", <http:></http:>; rel="memento"; datetime="Fri, 08 Mar 2013 15:08:01 GMT", <http:></http:>; rel="memento"; datetime="Sat, 09 Mar 2013 02:33:50 GMT", <http:></http:>; rel="memento"; datetime="Sat, 20 Apr 2013 08:26:37 GMT", <http:></http:>; rel="memento"; datetime="Sat, 20 Apr 2013 09:07:21 GMT", <http:></http:>; rel="memento"; datetime="Sat, 20 Apr 2013 19:37:56 GMT", <http:></http:>; rel="memento"; datetime="Mon, 22 Apr 2013 07:37:07 GMT", <http:></http:>; rel="memento"; datetime="Sat, 08 Jun 2013 12:18:08 GMT", <http:></http:>; rel="memento"; datetime="Wed, 07 Aug 2013 09:33:21 GMT", <http:></http:>; rel="memento"; datetime="Sun, 08 Sep 2013 14:42:36 GMT", <http:></http:>; rel="memento"; datetime="Sat, 28 Sep 2013 00:11:44 GMT", <http:></http:>; rel="memento"; datetime="Sat, 19 Oct 2013 03:40:11 GMT", <http:></http:>; rel="memento"; datetime="Sun, 20 Oct 2013 00:51:13 GMT", <http:></http:>; rel="memento"; datetime="Sun, 20 Oct 2013 08:19:55 GMT", <http:></http:>; rel="memento"; datetime="Fri, 01 Nov 2013 00:17:23 GMT", <http:></http:>; rel="memento"; datetime="Sun, 08 Dec 2013 03:22:37 GMT", <http:></http:>; rel="memento"; datetime="Mon, 09 Dec 2013 19:11:58 GMT", <http:></http:>; rel="memento"; datetime="Fri, 20 Dec 2013 17:01:05 GMT", <http:></http:>; rel="memento"; datetime="Tue, 24 Dec 2013 22:19:04 GMT", <http:></http:>; rel="memento"; datetime="Sat, 04 Jan 2014 20:17:27 GMT", <http:></http:>; rel="memento"; datetime="Fri, 10 Jan 2014 10:11:50 GMT", <http:></http:>; rel="memento"; datetime="Sat, 25 Jan 2014 08:11:53 GMT", <http:></http:>; rel="memento"; datetime="Tue, 25 Feb 2014 00:03:47 GMT", <http:></http:>; rel="memento"; datetime="Sat, 08 Mar 2014 21:21:13 GMT", <http:></http:>; rel="memento"; datetime="Sun, 08 Jun 2014 12:10:32 GMT", <http:></http:>; rel="memento"; datetime="Tue, 09 Sep 2014 05:31:10 GMT", <http:></http:>; rel="memento"; datetime="Sat, 08 Aug 2015 05:49:42 GMT", <http:></http:>; rel="memento"; datetime="Fri, 16 Feb 2018 09:14:05 GMT", <http:></http:>; rel="memento"; datetime="Sat, 17 Feb 2018 23:51:22 GMT", <http:></http:>; rel="memento"; datetime="Sun, 18 Feb 2018 20:00:12 GMT", <http:></http:>; rel="memento"; datetime="Mon, 19 Feb 2018 20:35:51 GMT", <http:></http:>; rel="memento"; datetime="Tue, 20 Feb 2018 21:48:48 GMT", <http:></http:>; rel="memento"; datetime="Wed, 21 Feb 2018 22:02:48 GMT", <http:></http:>; rel="memento"; datetime="Thu, 22 Feb 2018 22:23:22 GMT", <http:></http:>; rel="memento"; datetime="Fri, 23 Feb 2018 19:59:12 GMT", <http:></http:>; rel="memento"; datetime="Sat, 24 Feb 2018 21:03:58 GMT", <http:></http:>; rel="memento"; datetime="Sun, 25 Feb 2018 18:56:18 GMT", <http:></http:>; rel="memento"; datetime="Mon, 26 Feb 2018 19:37:17 GMT", <http:></http:>; rel="last memento"; datetime="Tue, 27 Feb 2018 19:34:59 GMT", <https:></https:>; rel="timemap"; type="application/link-format", <https:></https:>; rel="timemap"; type="application/json", <https:></https:>; rel="timemap"; type="application/cdxj+ors", <https:></https:>; rel="timegate"

2018-04-25 update: As noted above, Peter Sterne brought to my attention that the non-standard URL of www.blog.reidreport.com/robots.txt still exists (and is blocking "ia_archiver") even though the more standard blog.reidreport.com/robots.txt is 404.

Another 2018-04-25 update: The NYT has covered the story ("MSNBC Host Joy Reid Blames Hackers for Anti-Gay Blog Posts, but Questions Mount"), and there was an interview with Reid's computer security expert ("Should We Believe Joy Reid’s Blog Was Hacked? This Security Consultant Says We Should"), Jonathon Nichols.

I embed a statement from Nichols (released by Erik Wemple), and a tweet from Nichols clarifying that they were not suggesting that Wayback Machine's mementos were hacked, but rather the hacked blog was crawled by the Internet Archive.

This is where it's important to note that there maybe a discrepancy between the posts that Nichols is concerned with and those that @Jamie_Maz surfaced. There is (semi-)independent evidence of @Jamie_Maz's pages, with the ultimate implication that for those pages to have been the result of a hack, blog.reidreport.com would have had to been hacked as many as 12 years ago -- and for nobody to have noticed at the time.

Reid (& Nichols) could always unblock the Internet Archive and share the evidence of the hack.

Here's the statement of security consultant Jonathan Nichols regarding the claims of blog-hacking by MSNBC's Joy Reid. pic.twitter.com/wGAui8Mfa5
— ErikWemple (@ErikWemple) April 25, 2018

1) WayBack was hacked
2) Joy was hacked
3) we said "Yo! Does your hack look like our hack!?"
4) PS: That's literally the industry standard (I've personally done it plenty of times)
5) We THEN got new data that showed it wasn't a hack of any archive.
— Jonathan Nichols (@wvualphasoldier) April 25, 2018

Yet another 2018-04-25 update: Apparently there are some holes in the http vs. https canonicalization wrt robots.txt blockage, allowing some of posts to surface. Here's an example (via @YanceyMc):
https://web.archive.org/web/20060225041734/https://blog.reidreport.com/2005/10/harriet-miers-and-lesbian-hair-check.html

that page was captured in 2012; here's a version captured in 2006 (page originally authored in 2005; according to blogger):https://t.co/hrXwyC9wGH

here's a copy of that copy in @archiveis:https://t.co/kRgDN7nY4f https://t.co/103n66HKe1
— Michael L. Nelson (@phonedude_mln) April 25, 2018

Also, @wvualphasoldier deleted his tweets then protected his account, so that's the reason the above embed no longer formats correctly.

Yet, Yet Another 2018-04-25 update:

Thanks to Prof. Weigle and Mat Kelly for providing examples of some of the URLs that are slipping through the robots.txt exclusion.

Here's one: https://web.archive.org/web/20060805055643/https://blog.reidreport.com

and another: https://web.archive.org/web/20050728132003/https://blog.reidreport.com:443/

Which has the following information that I thought I saw in the original @Jamie_Maz tweets but now I can't find it, so perhaps I'm misremembering. It certainly fits the overall theme.

↧

2018-04-30: A High Fidelity MS Thesis, To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages

April 30, 2018, 5:57 pm

≫ Next: 2018-05-04: An exploration of URL diversity measures

≪ Previous: 2018-04-24: Why we need multiple web archives: the case of blog.reidreport.com

It is hard to believe that the time has come for me to write a wrap up blog about the adventure that was my Masters Degree and the thesis that got me to this point. If you follow this blog with any regularity you may remember two posts, written by myself, that were the genesis of my thesis topic:

Bonus points if you can guess the general topic of the thesis from the titles of those two blog posts. However, it is ok if you can not as I will give an oh so brief TL;DR;. The replay problems with cnn.com were, sadly, your typical here today gone tomorrow replay issues involving this little thing, that I have come to , known as JavaScript. What we also found out, when replaying mementos of cnn.com from the major web archives, was each web archive has their own unique and subtle variation of this thing called "replay". The next post about the curious case of mendely.com user pages (A State Of Replay) further confirmed that to us.

We found that not only does there exist variations in how web archives perform URL rewriting (URI-Rs URI-Ms) but also that, depending on the replay scheme employed, web archives are also modifying the JavaScript execution environment of the browser and the archived JavaScript code itself beyond URL rewriting! As you can imagine this left us asking a number of questions that lead to the realization that the web archiving lacks the terminology required to effectively describe the existing styles of replay and the modifications made to an archived web page and its embedded resources in order to facilitate replay.

Thus my thesis was born and is titled "To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages".

Since I am known around the ws-dl headquarters for my love of deep diving into the secrets of (securely) replaying JavaScript, I will keep the length of this blog post to a minimum. The thesis can be broken down into three parts, namely Styles Of Replay, Memento Modifications, and Auto-Generating Client-Side Rewriters. For more detail information about my thesis, I have embedded my defense's slides below and the full text of the thesis has been made available.

Styles Of Replay

The existing styles of replaying mementos from web archives is broken down into two distinct models, namely "Wayback" and "Non-Wayback", and each has its own distinct styles. For the sake of simplicity and length of this blog post I will only (briefly) cover the replay styles of the "Wayback" model.

Non-Sandboxing Replay

Non-sandboxing replay is the style of replay that does not separate the replayed memento from the archive-controlled portion of replay, namely the banner. This style of replay is considered the OG (original gangster) way for replaying mementos simply because it was, at the time, the only way to replay mementos and was introduced by the Internet Archive's Wayback Machine. To both clarify and illustrate what we mean by "does not separate the replayed memento from archive-controlled portion of replay", consider the image below displaying the HTML and frame tree for a http://2016.makemepulse.com memento replayed from the Internet Archive on October 22, 2017.

As you can see from the image above, the archive's banner and the memento exist together on the same domain (web.archive.org). Implying that the replayed memento(s) can tamper with the banner (displayed during replay) and or interfere with archive control over replay. Non-malicious examples of mementos containing HTML tags that can both tamper with the banner and interfere with archive control over replay skip to the Replay Preserving modifications section of post. Now to address the recent claim that "memento(s) were hacked in the archive" and its correlation to non-sanboxing replay. Additional discussion on this topic can be found in Dr. Michael Nelson's blog post covering the case of blog.reidreport.com and in his presentation for the National Forum on Ethics and Archiving the Web (slides, trip report).

For a memento to be considered (actually) hacked, the web archive the memento is replayed (retrieved) from must be have been compromised in a manner that requires the hack to be made within the data-stores of the archive and does not involve user initiated preservation. However, user initiated preservation can only tamper with a non-hacked memento when replayed from an archive. The tampering occurs when an embedded resource, previously un-archived at the memento-datetime of the "hacked" memento, is archived from the future (present datetime relative to memento-datetime) and typically involves the usage of JavaScript. Unlike non-sandboxing replay, the next style of Wayback replay, Sandboxed Replay, directly addresses this issue and the issues of how to securely replay archived JavaScript. PS. No signs of tampering, JavaScript based or otherwise, were present in the blog.reidreport.com mementos from the Library of Congress. How do I know??? Read my thesis and or look over my thesis defense slides, I cover in detail what is involved in the mitigation of JavaScript based memento tampering and know what that actually looks like .

Sandboxed Replay

Sandboxed replay is the style of replay that separates the replayed memento from the archive-controlled portion of the page through replay isolation. Replay isolation is the usage of an iframe to the sandbox the replayed memento, replayed from a different domain, from the archive controlled portion of replay. Because replay is split into two different domains (illustrated in the image seen below), one for the replay of the memento and one for the archived controlled portion of replay (banner), the memento cannot tamper with the archives control over replay or the banner. Due to security restrictions placed on web pages from different origins by the browser called the Same Origin Policy. Web archives employing sandboxed replay typically also perform the memento modification style known as Temporal Jailing. This style of replay is currently employed by Webrecorder and all web archives using Pywb (open source, python implementation of the Wayback Machine). For more information on the security issues involved in high-fidelity web archiving see the talk entitled Thinking like a hacker: Security Considerations for High-Fidelity Web Archives given by Ilya Kreymer and Jack Cushman at WAC2017 (trip report), as well as, Dr. David Rosenthal's commentary on the talk.

Memento Modifications

The modification made by web archives to mementos in order to facilitate there replay can be broken down into three categories, the first of which is Archival Linkage.

Archival Linkage Modifications

Archival linkage modifications are made by the archive to a memento and its embedded resources in order to serve (replay) them from the archive. The archival linkage category of modifications are the most fundamental and necessary modifications made to mementos by web archives simply because they prevent the Zombie Apocalypse. You are probably already familiar with this category of memento modifications as it is more commonly referred to as URL rewriting
(URI-R URI-M).

<!-- pre rewritten -->
<link        class="token attr-name">rel=        class="token punctuation">"stylesheet"        class="token attr-name">href=        class="token punctuation">"/foreverTime.css"        class="token punctuation">>
<!-- post rewritten -->
<link        class="token attr-name">rel=        class="token punctuation">"stylesheet"        class="token attr-name">href=        class="token punctuation">"/20171007035807cs_/foreverTime.css        class="token punctuation">">

URL rewriting (archival linkage modifications) ensures that you can relive (replay) mementos, not from the live web, but from the archive. Hence the necessity and requirement for this kind of memento modifications. However, it is becoming necessary to seemingly damage mementos in order to simply replay them.

Replay Preserving Modifications

Replay Preserving Modifications are modifications made by web archives to specific HTML element and attribute pairs in order to negate their intended semantics. To illustrate this, let us consider two examples, the first of which was introduced by our fearless leader Dr. Michael Nelson and is known as the zombie introducing meta refresh tag shown below.

        class="token punctuation"><metahttp-equiv        class="token attr-value">="refresh        class="token punctuation">"content        class="token attr-value">="35;url=?zombie=666        class="token punctuation">"/>

As you are familiar, the meta refresh tag will, after 35 seconds, refresh the page with the "?zombie=666" appended to original URL. When a page containing this dastardly tag is archived and replayed, the results of the refresh plus appending "?zombie=666" to the URI-M causes the browser to navigate to a new URI-M that was never archived. To overcome this archives must arm themselves with the attribute prefixing shotgun in order to negate the tag and attribute's effects. A successful defense against the zombie invasion when using the attribute prefixing shotgun is shown below.


            class="token punctuation"><meta_http-equiv            class="token attr-value">="refresh            class="token punctuation">"_content            class="token attr-value">="35;url=?zombie=666            class="token punctuation">"/>

Now let me introduce to you a new more insidious tag that does not introduce a zombie into replay but rather a demon known as the meta csp tag, shown below.

        class="token punctuation"><metahttp-equiv        class="token attr-value">="Content-Security-Policy        class="token punctuation">"
content            class="token punctuation">="default-src http://notarchive.com; img-src ....            class="token punctuation">"/>

Naturally, web archives do not want web pages to be delivering their own Content-Security-Policies via meta tag because the results are devastating, as shown by the YouTube video below.

Readers have no fear, this issue is fixed!!!! I fixed the meta csp issue for Pywb and Webrecorder in pull request #274 submitted to Pywb. I also reported this to the Internet Archive and they promptly got around to fixing it.

Temporal Jailing

The final category of modifications, known as temporal Jailing, is the emulation of the JavaScript environment as it existed at the original memento-datetime through client-side rewriting. Temporal jailing ensures both the secure replay of JavaScript and that JavaScript can not tamper with time (introduce zombies) by applying overrides to the JavaScript APIs provided by the browser in order to intercept un-rewriten urls. Yes there is more to it, a whole lot more, but because it involves replaying JavaScript and I am attempting to keep this blog post reasonably short(ish), I must force you to consult my thesis or thesis defense slides for more specific details. However, for more information about the impact of JavaScript on archivability, and measuring the impact of missing resources see Dr. Justin Brunelle's Ph.D. wrap up blog post. The technique for the secure replay of JavaScript known as temporal jailing is currently used by Webrecorder and Pywb.

Auto-Generating Client-Side Rewriters

Have I mention yet just how much I JavaScript?? If not, lemme give you a brief overview of how I am auto-generating client-side rewriting libraries, created a new way to replay JavaScript (currently used in production by Webrecorder and Pywb) and increased the replay fidelity of the Internet Archive's Wayback Machine.

First up let me introduce to you Emu: Easily Maintained Client-Side URL Rewriter (GitHub). Emu allows for any web archive to generate their own generic client-side rewriting library, that conforms to the de facto standard implementation Pywb's wombat.js, by supplying it the Web IDL definitions for the JavaScript APIs of the browser. Web IDL was created by the W3C to describe interfaces intended to be implemented in web browser, allow the behavior of common script objects in the web platform to be specified more readily, and provide how interfaces described with Web IDL correspond to constructs within ECMAScript execution environments. You may be wondering how can I guarantee this tool will generate a client-side rewriter that provides complete coverage of the JavaScript APIs of the browser and that we can readily obtain these Web IDL definitions? My answer is simple and it is to confider the following excerpt from the HTML specification:

This specification uses the term document to refer to any use of HTML, ..., as well as to fully-fledged interactive applications. The term is used to refer both to Document objects and their descendant DOM trees, and to serialized byte streams using the HTML syntax or the XML syntax, depending on context ... User agents that support scripting must also be conforming implementations of the IDL fragments in this specification, as described in the Web IDL specification

Pretty cool right, what is even cooler is that a good number of your major browsers/browser engines (Chromium, FireFox, and Webkit) generate and make publicly available Web IDL definitions representing the browsers/engines conformity to the specification! Next up a new way to replay JavaScript.

Remember the curious case of mendely.com user pages (A State Of Replay) and how we found out that Archive-It, in addition to applying archival linkage modifications, was rewriting JavaScript code to substitute a new foreign, archive controlled, version of the JavaScript APIs it was targeting. This is shown in the image below.

Archive-It rewriting embedded JavaScript from the memento for the curious case mendely.com user pages

Hmmmm, looks like Archive-It is only rewriting only two out of four instances of the text string location in the example shown above. This JavaScript rewriting was targeting the Location interface which controls the location of the browser. Ok, so how well would Pywb/Webrecorder do in this situation?? From the image shown below, not as good and maybe a tad bit worse...

Pywb v0.33 replay of https://reacttraining.com/react-router/web/example/auth-workflow

That's right folks, JavaScript rewrites in HTML. Why??? See below.

Bundling HTML in JavaScript, https://reacttraining.com/react-router/15-5fae8d6cf7d50c1c6c7a.js

Because the documentation site for React Router was bundling HTML inside of JavaScript containing the text string "location" (shown above), the rewrites were exposed in the documentations HTML displayed to page viewers (second image above). In combination with how Archive-It is also rewriting archived JavaScript, in a similar manner, I was like this needs to be fix. And fix it I did. Let me introduce to you a brand new way of replaying archived JavaScript shown below.

// window proxy
newwindow.Proxy            class="token punctuation">({            class="token punctuation">},            class="token punctuation">{
get(target            class="token punctuation">, prop)            class="token punctuation">{            class="token comment">/*intercept attribute getter calls*/            class="token punctuation">},
set(target            class="token punctuation">, prop, value            class="token punctuation">){/*intercept attribute setter calls*/            class="token punctuation">},
has(target, prop            class="token punctuation">){/*intercept attribute lookup*/            class="token punctuation">},
ownKeys(target            class="token punctuation">){/*intercept own property lookup*/            class="token punctuation">},
getOwnPropertyDescriptor(target            class="token punctuation">, key)            class="token punctuation">{/*intercept descriptor lookup*/            class="token punctuation">},
getPrototypeOf(target            class="token punctuation">){/*intercept prototype retrieval*/            class="token punctuation">},
setPrototypeOf(target            class="token punctuation">, newProto)            class="token punctuation">{/*intercept prototype changes*/            class="token punctuation">},
isExtensible(target            class="token punctuation">){/*intercept is object extendable lookup*/            class="token punctuation">},
preventExtensions(target            class="token punctuation">){/*intercept prevent extension calls*/            class="token punctuation">},
deleteProperty(target            class="token punctuation">, prop)            class="token punctuation">{/*intercept is property deletion*/            class="token punctuation">},
defineProperty(target            class="token punctuation">, prop, desc            class="token punctuation">){/*intercept new property definition*/            class="token punctuation">},
})

// document proxy
newwindow.Proxy            class="token punctuation">(window.document            class="token punctuation">,{
get(target            class="token punctuation">, prop)            class="token punctuation">{            class="token comment">/*intercept attribute getter calls*/            class="token punctuation">},
set(target            class="token punctuation">, prop, value            class="token punctuation">){/*intercept attribute setter calls*/            class="token punctuation">}
})

The native JavaScript Proxy object allows an archive to perform runtime reflection on the proxied object. Simply put, it allows an archive to defined custom or restricted behavior for the proxied object. I have annotated the code snippet above with additional information about the particulars of how archives can use the Proxy object. Archives using the JavaScript Proxy object in combination with the setup shown below, web archives can guarantee the secure replay of archived JavaScript and do not have to perform the kind of rewriting shown above. Yay! Less archival modification of JavaScript!! This method of replaying archived JavaScript was merged into Pywb on August 4, 2017 (contributed by yours truly) and has been used in production by Webrecoder since August 21, 2017. Now to tell you about how I increased the replay fidelity of the Internet Archive and how you can too .

        class="token keyword">var        class="token function-variable function">__archive$assign$function__        class="token operator">=function        class="token punctuation">(name)        class="token punctuation">{/*return archive override*/        class="token punctuation">};
{
// archive overrides shadow these interfaces
let window =__archive$assign$function__            class="token punctuation">("window")            class="token punctuation">;
let self =__archive$assign$function__            class="token punctuation">("self"            class="token punctuation">);
let document =__archive$assign$function__            class="token punctuation">("document"            class="token punctuation">);
let location =__archive$assign$function__            class="token punctuation">("location"            class="token punctuation">);
let top =__archive$assign$function__            class="token punctuation">("top"            class="token punctuation">);
let parent =__archive$assign$function__            class="token punctuation">("parent")            class="token punctuation">;
let frames =__archive$assign$function__            class="token punctuation">("frames")            class="token punctuation">;
let opener =__archive$assign$function__            class="token punctuation">("opener")            class="token punctuation">;
/* archived JavaScript */
}

Ok so I generated a client-side rewriter for the Internet Archive's Wayback Machine using the code that is now Emu and crawled 577 Internet Archive mementos from the top 700 web pages found in the Alexa top 1 million web site list circa June 2017. The crawler I wrote for this can be found on GitHub . By using the generated client-side rewriter I was able to increase the cumulative number of requests made by the Internet Archive mementos by 32.8%, a 45,051 request increase (graph of this metric shown below). Remember that each additional request corresponds to a resource that previously was unable to be replayed from the Wayback Machine.

Hey look, I also decreased the number of requests blocked by the content-security policy of the Wayback Machine by 87.5%, a 5,972 request increase (graph of this metric shown below). Remember, that earch request un-blocked corresponds to a URI-R the Wayback Machine could not rewrite server-side and requires the usage of client-side rewriting (Pywb and Webrecorder are using this technique already).

Now you must be thinking this impressive to say the least, but how do I know these numbers are not faked / or doctored in some way in order to give a client-side rewriting the advantage??? Well you know what they say seeing is believing!!! The generated client-side rewriter used in the crawl that produced the numbers shown to you today is available as the Wayback++ Chrome and Firefox browser extension! Source code for it is on GitHub as well. And oh look, a video demonstrating the increase in replay fidelity gained if the Internet Archive were to use client-side. Oh, I almost forgot to mention that at the 1:47 mark in the video I make mementos of cnn.com replayable again from the Internet Archive. Winning!!

Pretty good for just a masters thesis wouldn't you agree. Now it's time for the obligatory list of all the things I have created in the process of this research and time as a masters student:

Squidwarc: A high fidelity archival crawler that uses Chrome or Chrome Headless (blog post)
MS Thesis Crawler: The Chrome or Chrome Headless based crawler written for my thesis's evaluation to specifically crawl the Internet Archive
Wayback++: A Chrome and Firefox browser extension that brings client-side rewriting to the Internet Archive's Wayback Machine
Emu: Easily Maintained Client-Side URL Rewriter. Generate a client-side rewriter from Web IDL
Explore the evaluation dataset: a site built to allow access to the dataset of my thesis's evaluation
node-cdxj: Read cdxj files produced by Pywb using node.js (npm)
node-warc: Parse Web Archive (WARC) files or create WARC files using Electron or the chrome-remote-interface. (npm)
Memgator Bulk TimeMap Downloader: Have you ever had a need to download 100 or 1 million TimeMaps using Memgator? With the caveat that it must be done in a timely manner? If so then you are in luck because this project has you covered.
Grad School Python Utils: A python 3 utility belt containing a collection of reusable python code to aid grad students in getting through grad school
Latex Toolbox: Librarification of the setup and macros used in the Latex src of my masters thesis
WAIL Electron: An updated, Electron version of the Mat Kelly original production WAIL Python (Web Archiving Interface Layer)
- Blog posts: 1, 2
- JCDL2017: Paper, BibTex
Auto Check Thy Margins: A 2018 re-implementation of the make file created by Dr. Chuck Cartledge (original blog post)
User-Agent Lists: 12,296 User-Agent strings for research purposes only of course
nukeBloggerClickTrap.js: A small, embeddable, and vanilla JavaScript code that will remove the annoying blogger click trap element when previewing your blog post
Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay, Accepted WADL 2018 submission
To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages,
- Defense slides ( with animations)
- Defense recording

What is next you may ask??? Well I am going to be taking a break before I start down the path known as a Ph.D. Why??????? To become the senior backend developer for Webrecorder of course! There is so so much to be learned from actually getting my hands dirty in facilitating high-fidelity web archiving such that when I return, I will have a much better idea of what my research's focus should be on.

If I have said this once, I have said this a million times. When you use a web browser in the preservation process, there is no such thing as an un-archivable web page! Long live high-fidelity web archiving!

- John Berlin (@johnaberlin , @N0taN3rd )

↧

2018-05-04: An exploration of URL diversity measures

May 3, 2018, 10:05 pm

≫ Next: 2018-05-15: Archives Unleashed: Toronto Datathon Trip Report

≪ Previous: 2018-04-30: A High Fidelity MS Thesis, To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages

Fig. 1: Animal portraits by Morten Koldby

Recently, as part of a research effort to describe a collections of URLs, I was faced with the problem of identifying a quantitative measure that indicates how many different kinds of URLs there are in a collection. In other words, what is the level of diversity in a collection of URLs? Ideally a diversity measure should produce a normalized value between 0 and 1. A 0 value means no diversity, for example, a collection of duplicate URLs (Fig. 2 first row, first column). In contrast, a diversity value of 1 indicates maximum diversity - all different URLs (Fig. 2, first row, last column):
1. http://www.cnn.com/path/to/story?p=v
2. https://www.vox.com/path/to/story
3. https://www.foxnews.com/path/to/story

Surprisingly, I did not find a standard URL diversity measure in the Web Science community, so I introduced the WSDL diversity index (described below). I acknowledge there may be other URL diversity measures in the Web Science community that exist under different names.

Not surprisingly, Biologist (especially Conservation Biologist) have multiple measures for quantifying biodiversity called diversity indices. In this blog post, I will briefly describe how some common biodiversity measures in addition to the WSDL diversity index can be used to quantify URL diversity. Additionally, I have provided recommendations for choosing a URL diversity measure depending on the problem domain. I have also provided a simple python script that reads a text file containing URLs and produces the URL diversity scores of the different measures introduced in this post.

Fig. 2: WSDL URL diversity matrix of examples across multiple policies (URL, hostname, and domain). For all policies, the schemes, URL parameters, and fragments are stripped before calculation. For hostname diversity calculation, only the host is considered, and for domain diversity calculation, only the domain is considered.

I believe the problem of quantify how many different species there are in biological community is very similar to the problem of quantify how many different URLs there are in a collection of URLs. Biodiversity measures (or diversity indices) express the degree of variety in a community. Such measures answer questions such as: does a community of mushrooms only include one, two, or three species of mushrooms? Similarly, a URL diversity measure expresses the degree of variety in a collection of URLs and answers questions such as: does a collection of URLs only represent one (e.g cnn.com), two (cnn.com and foxnews.com), or three (cnn.com, foxnews.com, and nytimes.com) domains. Even though the biodiversity diversity indices and URL diversity measures are similar, it is important to note that since both domains are different their respective diversity measures reflect these differences. For example, the WSDL diversity index I introduce later does not reward duplicate URLs because duplicate URLs do not increase the informational value of a URL collection.

URL Diversity Measures

Let us consider the WSDL diversity index for quantifying URL diversity, and apply popular biodiversity indices to quantify URL diversity.

URL preprocessing:
Since URLs have aliases, the following steps were taken before the URL diversity was calculated.

1.Scheme removal: This transforms
http://www.cnn.com/path/to/story?param1=value1&param2=value2#1
to
www.cnn.com/path/to/story?param1=value1&param2=value2#1

2. URL parameters and fragment removal: This transforms
www.cnn.com/path/to/story?param1=value1&param2=value2#1
to
www.cnn.com/path/to/story

3. Multi-policy and combined (or unified) policy URL diversity: For the WSDL diversity index (introduced below), the URL diversity can be calculated for multiple separate policies such as the URL (www.cnn.com/path/to/story), Domain (cnn.com), or Hostname (www.cnn.com). For the biodiversity measures introduced, the URL diversity can also be calculated by combining policies. For example, URL diversity calculation done by combining Hostname (or domain) with URL paths. This involves considering the Hostnames (or domains) as the species and the URL paths as individuals. I call this combined policy approach of calculating URL diversity, unified diversity.

WSDL diversity index:

The WSDL diversity index (Fig. 3) rewards variety and not duplication. It is the ratio of unique items U (URIs or Domain names, or Hostnames) to the total number of items |C|. We subtract 1 from both numerator and denominator in order to normalize (0 - 1 range) the index. A value of 0 (e.g., Fig 2. first row, first column) is assigned by a list of duplicate URLs. A value of 1 is assigned by a list of distinct URLs (e.g., Fig. 2 first row, last column).

Fig. 3: The WSDL diversity index (Equation 1) and the explanation of variables. U represents the count of unique URLs (or species - R). |C| represents the number of URLs (or individuals N).

Unlike the other biodiversity indices introduced next, the WSDL diversity index can be calculated for separate policies: URL, Domain, and Hostname. This is because the numerator of the formula considers uniqueness not counts. In other words the numerator operates over sets of URLs (no duplicates allowed) unlike the biodiversity measures that operate over lists (duplicates allowed). Since the biodiversity measures introduced below take counts (count of species) into account, calculation of all the URL diversity across multiple policies results in the same diversity value except if the polices are combined (e.g., Hostname combined with URL paths).

Simpson's diversity index:

The Simpson's diversity index (Fig. 4, equation 2) is a common diversity measure in Ecology that quantifies the degree of biodiversity (variety of species) in a community of organisms. It is also known as the Herfindahl–Hirschman index (HHI) in Economics, and Hunter-Gaston index in Microbiology. The index simultaneous quantifies two quantities - the richness (number of different kinds of organisms) and evenness (the proportion of each species present) in a bio-community. Additionally, the index produces diversity values ranging between 0 and 1. 0 means no diversity and 1 means maximum diversity.

Fig. 4: Simpson's diversity index (Equation 2) and Shannon's evenness index (Equation 3) and the explanation of variables (R, n_i (n subscript i), and N) they share.

Applying the Simpson's diversity index to measure URL diversity:
There are multiple variants of the Simpson's diversity index, the variant showed in Fig. 4, equation. 2 is applicable to measuring URL diversity in two ways. First, we may consider URLs as the species of biological organisms (Method 1). Second, we may consider the Hostnames as the species (Method 2) and the URL paths as the individuals. There are three parameters needed to use Simpson's diversity index (Fig. 4):
Method 1:

R - total number of species (or URLs)
n_i (n subscript i) - number of individuals for a given species, and
N - total number of individuals

Method 2 (Unified diversity):

R - total number of species where the Hostnames (or Domains) are the species
n_i (n subscript i) - number of individuals (URL paths) for a given species, and
N - total number of individuals

Fig. 5a applies Method 1 to calculate the URL diversity. In Fig. 5a, there are 3 different URLs interpreted as 3 species (R = 3) in the Simpson's diversity index formula (Fig. 4, equation. 2):

1. www.cnn.com/path/to/story1
2. www.cnn.com/path/to/story2
3. www.vox.com/path/to/story1

Fig. 5a: Example showing how the Simpson's diversity index and Shannon's evenness index can be applied to calculate URL diversity by setting three variables: R represents the number of species (URLs). In the example, there are 3 different URLs. n_i (n subscript i) represents the count of the species (n_1 = 3, n_2 = 1, and n_3 = 1). N represents the total number of individuals (URLs). The Simpson's diversity index (Fig. 4, equation 2) is 0.7, Shannon's evenness index - 0.86

The first URL has 3 copies which can be interpreted as 3 individuals (for the first species - n_0) in the Simpson's diversity index formula. The second and third URLs have 1 copy each, similarly, this can be interpreted as 1 individual for the second (n_1) and third species (n_2). In total (including duplicates) we have 5 URL individuals (N = 5). With all the parameters of the Simpson's diversity index (Fig. 4, equation 2) set, the diversity index for the example in Fig. 5a is 0.7.

Fig. 5b: Example showing how to the Simpson's diversity index and Shannon's diversity index can be applied to calculate unified URL diversity by interpreting Hostnames as the species (R) and the URLs paths as the individuals (n_i). This method combines the Hostname (or Domain) with URL paths for URL diversity calculation.

Fig. 5b applies Method 2 to calculate the Unified diversity. In the unified diversity calculation, the policies are combined (Hostname with URL paths). For example, in Fig. 5b the species represent the Hostnames and the URL paths are considered the individuals.

Shannon-Wiener diversity index:

The Shannon-Wiener diversity index or Shannon's diversity index comes from information theory where it is used to quantify the entropy in a string. However, in Ecology, similar to the Simpson's index, it is applied to quantify the biodiversity in a community. It simultaneously measures the richness (number of species) and the evenness (homogeneity of the species). The Shannon's Evenness Index (SEI) is the Shannon's diversity index divided by the maximum diversity (ln(R)) which occurs when each species has the same frequency (maximum evenness).

Applying the SEI to measure URL diversity:
The variables in the SEI are the same variables in the Simpson's diversity index. Fig 5a. evaluates the SEI (Equation 3) for a set of URLs, while Fig. 5b. calculates the unified URL diversity by interpreting the Hostnames as species.

Fig. 6: Example showing hot the URL diversity indices differ. For example, the WSDL diversity index rewards URL uniqueness and penalized URL duplication since the duplication of URLs does not increase informational value, but the Shannon's evenness index rewards balance in the proportion of URLs. It is also important to note that calculation of URL diversity across multiple separate policies (URL, domain, and hostname) is only possible with the WSDL diversity index.

I recommend using the WSDL diversity index for measuring URL diversity if the inclusion of a duplicate URL should not be rewarded and there is a need to calculate URL diversity across multiple separate policies (URL, domain, and hostname). Both Simpson's diversity index and Shannon evenness index strive to simultaneously capture richness and evenness. I believe Shannon's evenness index does a better job capturing evenness which happens when the proportion of species is distributed evenly (Fig. 6 first row, second column). I recommend using the Simpson's diversity and Shannon's evenness indices for URL diversity calculation when the definition of diversity is similar to the Ecological meaning of diversity and the presence of duplicate URLs need not penalize the overall diversity score. The source code that implements the URL diversity measures introduced here is publicly available.

-- Nwala (@acnwala)

↧

2018-05-15: Archives Unleashed: Toronto Datathon Trip Report

May 15, 2018, 5:08 am

≫ Next: 2018-06-08: Joint Conference on Digital Libraries (JCDL) 2018 Trip Report

≪ Previous: 2018-05-04: An exploration of URL diversity measures

The Archives Unleashed team (pictured below) hosted a two-day datathon, April 26-27, 2018, at the University of Toronto’s Robarts Library. This time around, Shawn Jones and I were selected to represent the Web Science and Digital Libraries (WSDL) research group from Old Dominion University. This event was the first in a series of four planned datathons to give researchers, archivists, computer scientists, and many others the opportunity to get hands-on experience with the Archives Unleashed Toolkit (AUT) and provide valuable feedback to the team. The AUT facilitates analysis and processing of web archives at scale and the datathons are designed to help participants find ways to incorporate these tools into their own workflow. Check out the Archives Unleashed team on Twitter and their website to find other ways to get involved and stay up to date with the work they’re doing.

Archives Unleashed datathon organizers (left to right): Nich Worby, Ryan Deschamps, Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz

Day 1 - April 26, 2018

Ian Milligan kicked off the event by talking about why these datathons are so important to the Archives Unleashed project team. For the project to be a success, the team needs to: build a community, create a common vision for web archiving tool development, avoid black box systems that nobody really understands, and equip the community with these tools to be able to work as a collective.

"Why do we bring people together in one room to discuss #webarchives research?" asks @ianmilligan1 at #hackarchives. The answer: community, common vision, tool transparency, collective work. pic.twitter.com/rejr2eImKJ
— Justin Littman (@justin_littman) April 26, 2018

Many presentations, conversations, and tweets during the event indicated that working with web archives, particularly WARC files, can be messy, intimidating, and really difficult. The AUT tries to help simplify the process by breaking it down into four parts:

Filter - focus on a date range, a single domain, or specific content
Analyze - extract information that might be useful such as links, tags, named entities, etc.
Aggregate - summarize the analysis by counting, finding maximum values, averages, etc.
Visualize - create tables from the results or files for use in external applications, such as Gephi

#hackarchives The Archives Unleashed Toolkit cycle: filter-analyze-aggregate-visualize of WARCs https://t.co/ZvV6Hevjdv pic.twitter.com/qeenisOJeW
— Shawn M. Jones (@shawnmjones) April 26, 2018

We were encouraged to use the AUT throughout the event to go through the process of filtering, analyzing, aggregating, and visualizing for ourselves. Multiple datasets were provided to us and preloaded onto powerful virtual machines, provided by Compute Canada, in an effort to maximize the time spent working with the AUT instead of fiddling with settings and data transfers.

The datasets available to teams @ #hackarchives look phenomenal! How will they choose? Thanks to @ComputeCanada for the VMs, @ruebot + @ianmilligan for dataset prep & all of our dataset contributors! @nichworby @uoftlibraries @ubclibrary @coreyleedavis @justin_littman @walkeroh pic.twitter.com/LHUYB8bUba
— The Archives Unleashed Project (@unleasharchives) April 26, 2018

Now that we knew the who, what, and why of the datathon, it was time to create our teams and get to work. We wrote down research questions (pink), datasets (green), and technologies/techniques (yellow) we were interested in using on sticky notes and posted them on a whiteboard. Teams started to form naturally from the discussion, but not very quickly, until we got a little help from Jimmy and Ian to keep things moving.

#hackarchives team formation with @lintool! Great topics and questions beginning to coalesce. pic.twitter.com/FYquBZUpDc
— Ian Milligan (@ianmilligan1) April 26, 2018

I worked with Jayanthy Chengan, Justin Littman, Shawn Walker, and Russell White. We wanted to use the #neveragain tweet dataset to see if we could filter out spam links and create a list of better quality seed URLs for archiving. Our main goal was to use the AUT without relying on other tools that we may have already been familiar with. Many of us had never even heard of Scala, the language that AUT is written in. We had all worked through the homework leading up to the datathon, but it still took us a few hours to get over the initial jitters and become productive.

Scala was a point of contention among many participants. Why not use Python or another language that more people are familiar with and can easily interface with existing tools? Jimmy had an answer ready, as he did for every question thrown at him over the course of the event.

My ?: For AUT, why Scala not Python or R? @lintools answer: Error reporting lost between JVM & Python. Scala more performant, “on the rails” for Spark. Using RDD instead of dataframes because didn’t exist when AUT written. Working on it: https://t.co/NersRff9Zy #hackarchives
— Justin Littman (@justin_littman) April 26, 2018

Around 5pm, it was time for dinner at Duke of York. My team decided against trying to get everyone up and running on their local machines, to enjoy dinner, and come back fresh for day 2.

#hackarchives Archives Unleashed after hours: the hacking takes a break, but the conversation continues. pic.twitter.com/RXI8YY9oDR
— Shawn M. Jones (@shawnmjones) April 26, 2018

Day 2 - April 27, 2018

Day 2 began with what felt like an epiphany for our team:

In reality, it was more like:

Either way, we learned from the hiccups of the first day and began working at a much faster pace. All of the teams worked right up until the deadline to turn in slides, with a few coffee breaks and lightning talks sprinkled throughout. I'll include more information on the lightning talks and team presentations as they become available.

Lightning Talks

Jimmy Lin led a brainstorming session about moving the AUT from RDD to DataFrames. Samantha Fritz posted a summary of the feedback received where you can participate in the discussion.
Nick Ruest talked about Warclight, a tool that helps with discovery within a WARC collection. He showed off a demo of it after giving us a little background information.
Shawn Jones presented the five minute version of a blog post he wrote last year that talks about summarizing web archive collections.
Justin Litmann presented TweetSets, a service that allows a user to derive their own Twitter dataset from existing ones. You can filter by any Tweet attributes such as text, hashtags, mentions, date created, etc.
Shawn Walker talked about the idea of using something similar to a credit score to warn users, in realtime, of the risk that content they're viewing may be misinformation.

At 3:30pm, Ian introduced the teams and we began final presentations right on time.

#hackArchives @ianmilligan1 starts the final session: each of the datathon teams is ready to present their work pic.twitter.com/PkZj6BX5Xo
— Shawn M. Jones (@shawnmjones) April 27, 2018

Team Make Tweets Great Again (Shawn Jones' team) used a dataset including tweets sent to @realdonaldtrump between June 2017 and now, along with tweets with #MAGA in them from June - October 2017. A few of the questions they had were:

As a Washington insider falls from grace, how quickly do those active in #MAGA and @realDonaldTrump shift allegiance?

Did sentiment change towards Bannon before and after he was fired by Trump?

They used positive or negative sentiment (emojis and text-based analysis) as an indicator of shifting allegiance towards a person. There was a decline in the sentiment rating for Steve Bannon when he was fired in August 2017, but the real takeaway is that people really love the 😂 emoji. Shawn worked with Jacqueline Whyte Appleby and Amanda Oliver. Jacqueline decided to focus on Bannon for the analysis, Amanda came up with the idea to use emojis, and Shawn used twarc to gather the information they would need.

#hackArchives final projects! Here’s the first one, Make Tweets Great Again. Fascinating research questions which used sentiment scores with emojis and text-based sentiment analysis too. pic.twitter.com/oDsXrqveXi
— Ian Milligan (@ianmilligan1) April 27, 2018

Team Pipeline Research used datasets made up of WARC files of pipeline activism and Canadian political party pages, along with tweets (#NoASP, #NoDAPL, #StopKM, #KinderMorgan). From the datasets, they were able to generate word clouds, find the image most frequently used, perform link analysis between pages, and analyze the frequency of hashtags used in the tweets. Through the analysis process, they discovered that some URLs had made it into the collection erroneously.

#hackarchives Pipeline Research presents the data gathered from web archives with #AUT about different pipeline efforts: links, wordles, top images, hashtags pic.twitter.com/FNrH28KtQe
— Shawn M. Jones (@shawnmjones) April 27, 2018

Team Spam Links (my team) used a dataset including tweets with various hashtags related to the Never Again/March for Our Lives movement. The question we wanted to answer was “What is the best algorithm for extracting quality seed URLs from social media data?”. We created a Top 50 list of URLs tweeted in the unfiltered dataset and coded them as relevant, not relevant, or indeterminate. We then came up with multiple factors to filter the dataset by (users with/without the default Twitter profile picture, with/without bio in profile, user follower counts, including/excluding retweets, etc.) and generated a new Top 50 list each time. The original Top 50 list was then compared to each of the filtered Top 50 lists.

We didn’t find a significant change in the rankings of the spam URLs, but we think that’s because there just weren’t that many in the dataset’s Top 50 to begin with. Running these experiments against other datasets and expanding the Top 50 to maybe the Top 100 or more would likely yield better results. Special thanks to Justin and Shawn Walker for getting us started and doing the heavy lifting, Russell for coding all of the URLs, and Jayanthy for figuring out Scala with me.

#hackarchives The Spam Links team tries to answer "What is the best algorithm for extracting quality seed URLs from social media data?"pic.twitter.com/NgUKrfNu8M
— Shawn M. Jones (@shawnmjones) April 27, 2018

Team BC Teacher Labour was the final group of the day and they used a dataset from Archive-It about the British Columbia Teachers’ Labour Dispute. While exploring the dataset with the AUT, they created word clouds showing the frequency of words compared between multiple domains, network graphs showing named entities and how they related to each other, and many others. The most interesting visual they created was an animated GIF that quickly showed the primary image from each memento, giving a good overview of the types of images in the collection.

#hackarchives topic analysis and visual discourse happening at the @archiveitorg collection at https://t.co/55Du1vkq20 extracted using #AUT: an animated gif of embedded images from each #memento gives a high level view of the collection pic.twitter.com/117pC0Qopd
— Shawn M. Jones (@shawnmjones) April 27, 2018

Team Just Kidding, There’s One More Thing was a team of one: Jimmy Lin. Jimmy was busy listening to feedback about Scala vs. Python and working on his own secret project. He created a new branch of the AUT running in a Python environment, enabling some of the things people were asking for at the beginning of Day 1. Awesome.

#hackarchives @lintool demonstrates what he has whipped up for #AUT in the past few days based on our feedback: PySpark with data frames working in a @ProjectJupyter notebook: now we will be able use #AUT within the #python ecosystem pic.twitter.com/nU550Qi9nU
— Shawn M. Jones (@shawnmjones) April 27, 2018

After Jimmy’s surprise, the organizers and teams voted for the winning project. All of the projects were great, but there can only be one winner and that was Team Make Tweets Great Again! I’m still convinced there’s a correlation between the number of emojis in their presentation, their team name, and the number of votes they received but 🤷🏻‍♂️. Just kidding 😂, your presentation was 🔥. Congratulations 🎊 to Shawn and his team!

I’m brand new to the world of web archiving and this was my first time attending an event like this, so I had some trepidation leading up to the first day. However, I quickly discovered that the organizers and participants, regardless of skill level or background, were there to learn and willing to share their own knowledge. I would highly encourage anyone, especially if you’re in a situation similar to mine, to apply for the Vancouver datathon that was announced at the end of Day 2 or one of the 2019 datathons taking place in the United States.

We are pleased to announce the next #hackArchives datathon will be in #vancouver this November. We are so incredibly lucky to be working with our co-host @Reb_D and @SFU libraries. St pic.twitter.com/dBQOVk1eLi
— The Archives Unleashed Project (@unleasharchives) April 27, 2018

Thanks again to the organizers (Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz, Ryan Deschamps, and Nich Worby), their partners, and the University of Toronto for hosting us. Looking forward to the next one!

- Brian Griffin

↧

2018-06-08: Joint Conference on Digital Libraries (JCDL) 2018 Trip Report

June 8, 2018, 9:40 am

≫ Next: 2018-06-08: Joint Conference on Digital Libraries (JCDL) Doctoral Consortium Trip Report

≪ Previous: 2018-05-15: Archives Unleashed: Toronto Datathon Trip Report

The gathering place at the Cattle Raisers Museum, Fort Worth, Texas

This year's 18th ACM/IEEE Joint Conference on Digital Libraries Libraries (JCDL 2018) took place at the University of North Texas (Fort Worth, Texas). Between June 3-6, members of WSDL attended paper sessions, workshops, tutorials, panels, and a doctoral consortium.

We had a great #jcdl2018 / #wadl2018 / @jcdl2018! Here is the 2018 @WebSciDL reunion photo, but we're getting so large we weren't all there at once!

Not pictured:
Incoming fac (Fall 2018): @OpenMaze @fanchyna
Alum: @johnaberlin

See you June 2-6, 2019 @iSchoolUI @JCDLConf pic.twitter.com/GmKpqKg0Zs
— Michael L. Nelson (@phonedude_mln) June 7, 2018

#JCDL2018 seems like a good time to announce recent @WebSciDL comings and goings.

Present at @jdl2018 we're happy to have:
* 2 faculty
* 2 incoming faculty (!)
* 6 students
* 3 alumni https://t.co/vjdxuSWFjd
— Michael L. Nelson (@phonedude_mln) June 4, 2018

The theme of this year's conference was "From Data to Wisdom: Resilient Integration across Societies, Disciplines, and Systems." The conference provided researchers across multiple disciplines ranging from Digital Libraries and Web science research to Libraries and Information science, with the opportunity to communicate the findings of their research.

Day 1 (June 3, 2018)

The first day of the conference was dedicated to doctoral consortium, tutorials, and workshops. The doctoral consortium provided an opportunity for Ph.D. students in the early phases of their dissertation to present their thesis and research plans and receive constructive feedback. I will provide a link to the Doctoral Consortium blogpost when it becomes available.

Day 2 (June 4, 2018)

#jcdl2018 stats. About 37% acceptance rate for full papers this year. Participation is increasing too. pic.twitter.com/H1vk3OB9KR
— Gianmaria Silvello (@giansilv) June 4, 2018

The conference officially began on the second day with Dr. Jiangping Chen's introduction of the conference and the keynote speaker - Dr. Trevor Owens. Dr. Trevor Owens is a librarian, researcher and policy maker and the first head of Digital Content Management for library services at the Library of Congress. His talk was titled: "We have interesting problems."

— Trevor Owens 💾🗄🕚 (@tjowens) June 4, 2018

Great crowd for my #jcdl2018 keynote! Room set up feels a little like looking out at some sort of intergalactic senate. pic.twitter.com/UTlrmyuVAo
— Trevor Owens 💾🗄🕚 (@tjowens) June 4, 2018

And @tjowens kicks off the #jcdl2018 opening keynote - appropriately titled “We Have Interesting Problems.” Really looking forward to this! pic.twitter.com/YWNAlHMjdf
— Ian Milligan (@ianmilligan1) June 4, 2018

It started with a highlight of Ben Shneiderman's The New ABCs of Research which provides students with guidance on how to succeed in research, and provides senior researchers and policy makers on how to respond to new problems and apply new technologies. The new ABC's of research may be grossly summarized with two acronyms included in the book: ABC (Applied, Basic, and Combined) and SED (Science, Engineering, and Design).

@tjowens is talking about @benbendc's book in the #jcdl2018 keynote. pic.twitter.com/SHkRFKbK1N
— Mat Kelly (@machawk1) June 4, 2018

Additionally, he presented NDP@3, an IMLS framework for investments in digital infrastructures for libraries. Also he presented multiple IMLS-funded projects such as: Image Analysis for Archival Discover (AIDA), which explores various ways to use millions of images representing the digitized cultural record.

.@tjowens giving a nice run-down of cool funded IMLS projects - a nice shoutout to the WASAPI project. W/o WASAPI we’d be having a devil of a time with @unleasharchives - great community development. #jcdl2018
— Ian Milligan (@ianmilligan1) June 4, 2018

another @US_IMLS web archiving project: "Combining Social Media Storytelling with Web Archives"https://t.co/QLp3VP365m

supported @yasmina_anwar @acnwala @shawnmjones

some products:https://t.co/IcTJCYNzhP https://t.co/pulO7jNJt5
+ many papers#JCDL2018 @tjowens @WebSciDL
— Michael L. Nelson (@phonedude_mln) June 4, 2018

Interested in how National Digital Platform and/or @librarycongress might help with distribution and discoverability of new forms of digital scholarship, esp SUP's Mellon-funded web-based digital publications. #jcdl2018 https://t.co/zmenxltYbw
— Jasmine Mulliken (@jasminemulliken) June 4, 2018

Next he talked about some resources at the Library of Congress Labs such as:

Library of Congress Colors: provides the capability of exploring the colors in the Library of Congress collections.
LC for Robots: provides a list of APIs, data and tutorials for exploring the digital collections at the Library of Congress.

Let's talk about .@LC_Labs https://t.co/EmYLrqWwcf #jcdl2018 @tjowens @opba pic.twitter.com/AlzNBZ3TYL
— Martin Klein (@mart1nkle1n) June 4, 2018

Now @tjowens has mentioned the work done at @LC_Labs by @opba @MeghaninMotion @JaimeMears @liblaura @blprnt @whaleandpetunia @chelseastieber #jcdl2018 https://t.co/3Zt12OzBjT and also the work of @kzwa pic.twitter.com/HIAcyb3Efx
— Shawn M. Jones (@shawnmjones) June 4, 2018

Work done by @LC_Labs mentioned by @tjowens, check out @github repositories by @blprnt: https://t.co/1RbDDOmdtc and @liblaura: https://t.co/JKcQC8gUWQ #jcdl2018 pic.twitter.com/CpW7eTUVTr
— Shawn M. Jones (@shawnmjones) June 4, 2018

Following the keynote were three concurrent paper sessions with the theme: Use, Collection Building, and Semantics & Linking. I will briefly describe the papers discussed in two paper sessions.

Paper session 1B (Day 2)

Myriam Traub (best paper nominee), a PhD student at Centrum Wiskunde & Informatica (CWI) presented a full paper titled: "Impact of Crowdsourcing OCR Improvements on Retrievability Bias." She discussed how crowd-sourced correction of OCR errors affects the retrievability of documents in a historic newspaper corpus in a digital library.

Impact of Crowdsourcing OCR Improvements on Retrievability Bias from Myriam Traub

@MyriamCTraub on " Impact of Crowdsourcing OCR Improvements on Retrievability Bias"https://t.co/SWiesqhYps #jcdl2018 pic.twitter.com/ywpanphtl4
— Martin Klein (@mart1nkle1n) June 4, 2018

Super cool & nuanced paper exploring experiments on effects for discoverability of crowdsourcing fixes for OCR errors in digitized historical collections #JCDL2018 https://t.co/TxwEkBFF5M
— Trevor Owens 💾🗄🕚 (@tjowens) June 4, 2018

Three short papers followed Traubs's presentation. First, Karen Harker, a Collection Assessment Librarian at the University of North Texas Libraries presented: "Applying the Analytic Hierarchy Process to an Institutional Repository Collection." She discussed the application of the Analytic Hierarchy Process (AHP) to create a model for evaluating collection development strategies of institutions. Second, Douglas Kennard presented: "Computer-Assisted Crowd Transcription of the U.S. Census with Personalized Assignments for Better Accuracy and Participation," where he introduced the Open Genealogy Data census transcription project that strives to make census data readily available to researchers and digital libraries. This was achieved through the use of automatic handwriting recognition to bootstrap their census database, and subsequent crowd-sourced correction of the data through a web interface. Finally, Mandy Neumann, a research associate at the Institute of Information Science at TH Köln presented: "Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp." She explored different features for ranking conference candidates by using a pseudo-relevance assessment.

Next up Karen Harker on
"Applying the Analytic Hierarchy Process to an Institutional Repository Collection"https://t.co/L797rqtpCX #jcdl2018 pic.twitter.com/mGzUmt3bVl
— Martin Klein (@mart1nkle1n) June 4, 2018

Douglas J. Kennard aren't these doctors' prescriptions? ;-) "Computer-Assisted Crowd Transcription of the U.S. Census with Personalized Assignments for Better Accuracy and Participation" at #jcdl2018 pic.twitter.com/pfiQFaIe1D
— Sawood Alam (@ibnesayeed) June 4, 2018

@protestreich on "Prioritizing and Scheduling Conferences for Metadata Harvesting in DBLP"https://t.co/HKcslsrxbd #jcdl2018 pic.twitter.com/viXLzkc4IW
— Martin Klein (@mart1nkle1n) June 4, 2018

Paper session 1C (Day 2)

Dr. Federico Nanni (best paper nominee), a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim presented the first of three full papers titled: "Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context," in which he introduced a method for obtaining specific descriptions of entities in text by retrieving the most related section from Wikipedia.

Onto the first paper session at #jcdl2018, @f_nanni et al.'s Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context. pic.twitter.com/SnK8BTk2ON
— Mat Kelly (@machawk1) June 4, 2018

First up in the “Semantics and Linking” #jcdl2018 session is @f_nanni on “Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context.”
— Ian Milligan (@ianmilligan1) June 4, 2018

Next, Gary Munnelly, a PhD student at the School of Computer Science and Statistics (SCSS) at Trinity College Dublin presented: "Investigating Entity Linking in Early English Legal Documents," discussing the effectiveness of different entity linking systems for the task of disambiguating named entities in 17th century depositions obtained during the 1641 Irish rebellion.

Gary Munnelly is talking about using @dbpedia as a source for information on entity linking for studying text, he identifies the problems of evolving entities, emerging entities, and more as part of "Investigating Entity Linking in Early English Legal Documents"#jcdl2018 pic.twitter.com/sZ6whIDS4e
— Shawn M. Jones (@shawnmjones) June 4, 2018

Finally, Dr. Ahmed Tayeh presented: "An Analysis of Cross-Document Linking Mechanisms," where he discussed different strategies for linking or associating information across physical and digital documents. The titles of other papers presented in a parallel session (1A) include:

#jcdl2018 Ahmed Tayeh presents "An Analysis of Cross-Document Linking Mechanisms" - some users use annotations to link information, other users use physical/digital folders to link information - Tayeh & Beat Signer are creating linking tools to support linking across documents pic.twitter.com/9iEcFHGp0L
— Shawn M. Jones (@shawnmjones) June 4, 2018

Open Cross-Document Linking Service Based on a Plug-in Architecture from Ahmed Tayeh

Paper session 2A (Day 2)

Two full papers were presented after a break. The first was titled: "Putting Dates on the Map: Harvesting and Analyzing Street Names with Date Mentions and their Explanations," was presented by Rosita Andrade. She presented her research about the automated analysis of street names with date references around the world, and showed that "temporal streets" are frequently used to commemorate important events such as a political change in a country.

#jcdl2018 has resumed again from lunch with Rosita Andrade presenting "Putting Dates on the Map: Harvesting and Analyzing Street Names with Date Mentions and their Explanations". pic.twitter.com/8EVH9gDumW
— Mat Kelly (@machawk1) June 4, 2018

#jcdl2018 Rosita Andrade presents "Putting Dates on the Map: Harvesting and Analyzing Street Names with Date Mentions and their Explanations"pic.twitter.com/deqeIIdUSz
— Shawn M. Jones (@shawnmjones) June 4, 2018

#jcdl2018 A temporal analysis of "temporal streets" - streets with date expressions in their name - has taught me a new term, in addition to temponyms there are also toponyms? pic.twitter.com/U92phMcGD8
— Shawn M. Jones (@shawnmjones) June 4, 2018

#jcdl2018 As part of "Putting Dates on the Map", Rosita Andrade used HeidelTime https://t.co/ibnk5CaS1X a temporal tagger that was introduced to me by @jannikstroetgen at #www2016 pic.twitter.com/SBRb55SbK4
— Shawn M. Jones (@shawnmjones) June 4, 2018

Rosita Andrade mentioned the website for temporal street information: https://t.co/nzvzSYJVpc #jcdl2018
— Shawn M. Jones (@shawnmjones) June 4, 2018

Next, Dr. Philipp Mayr, a deputy department head and a team leader at the GESIS department Knowledge Technologies for the Social Sciences presented: "Contextualised Browsing in a Digital Library's Living Lab." He presented two approaches that contextualize browsing in a digital library. The first approached is based on document similarity and the second utilizes implicit session information (e.g., queries and document metadata from sessions of users).

@Philipp_Mayr presenting ‘’Contextualized Browsing’’ in #DigitalLibraries to address shortcomings of keyword-based search #jcdl2018 @jcdl2018 pic.twitter.com/UrY1qWxZ9W
— Corinna Breitinger (@BreitingerC) June 4, 2018

Now @Philipp_Mayr on contextualized browsing in a DL's living lab. #jcdl2018 pic.twitter.com/25EATnkLS1
— Mandy (@protestreich) June 4, 2018

.@Philipp_Mayr compared non-contextualized browsing just using query expansion with synonyms & translations with those using similarity measures and again with searches that used session-based contextualized browsing #jcdl2018 pic.twitter.com/fs7w96fhUt
— Shawn M. Jones (@shawnmjones) June 4, 2018

Paper session 3A (Day 2)

Three concurrent paper sessions followed Dr. Phillip Mayr's presentation. Dr. Dominika Tkaczyk, a researcher and a data scientist at the Applied Data Analysis Lab at the University of Warsaw (Poland) presented: "Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers," in which she presented the results of the comparison of different methods for parsing scholarly article references.

Dominika Tkaczyk shows the state of the art in scholarly article reference parsinghttps://t.co/4oRRsnmiMo #jcdl2018 pic.twitter.com/JwXy1YKwpU
— Martin Klein (@mart1nkle1n) June 4, 2018

Dominika Tkaczyk et al. find that GROBID performs pretty well before *and* after retraininghttps://t.co/wWCbkahkXK #jcdl2018 pic.twitter.com/DGFTPwYsdV
— Martin Klein (@mart1nkle1n) June 4, 2018

Anne Lauscher, a PhD student at the University of Mannheim presented: "Linked Open Citation Database: Enabling Libraries to Contribute to an Open and Interconnected Citation Graph." She presented the current state of the workflow and implementation of the Linked Open Citation Database project, which is a distributed infrastructure based on linked data technology for efficiently cataloging citations in libraries.

this much #jcdl2018 pic.twitter.com/SvRqKtbZuq
— Martin Klein (@mart1nkle1n) June 4, 2018

Paper session 3C (Day 2)

Norman Meuschke, a PhD student at the University of Konstanz, presented: "An Adaptive Image-based Plagiarism Detection Approach," in which he discussed his analysis of images in academic documents to detect disguised forms of plagiarism with approaches such as perceptual hashing, ratio hashing and position-aware OCR text matching.

An Adaptive Image-based Plagiarism Detection Approach by Norman Meuschke from Research Group Gipp (University of Konstanz)

#jcdl2018 @normeu is presenting "An Adaptive Image-based Plagiarism Detection Approach" starting with an overview of the forms of plagiarism & detection methods: copy & paste, shake & paste, technical disguise, paraphrasing, structural & idea plagiarism, and cross-lang plagiarism pic.twitter.com/6gYHcPsx7o
— Shawn M. Jones (@shawnmjones) June 4, 2018

.@normeu mentions that images contain a lot of semantic information, visualizations and figures contain a lot of core information for a paper, and that there is a research gap here #jcdl2018 pic.twitter.com/Qt6Q3AG9mu
— Shawn M. Jones (@shawnmjones) June 4, 2018

Tesseract OCR software https://t.co/5r7FSOiKJ9 was mentioned by @normeu as a method of detecting text in images, but it is not perfect, hence they use a method of position-aware text matching to account for OCR errors #jcdl2018 pic.twitter.com/4sNIB1S2ac
— Shawn M. Jones (@shawnmjones) June 4, 2018

#PerceptualHashing technique utilized by @normeu at #JCDL2018 could be useful in establishing #ArchivalFixity, relevant to @WebSciDL @maturban1's work. pic.twitter.com/IC7JXh1pXl
— Sawood Alam (@ibnesayeed) June 4, 2018

Hisham Benotman presented his work: "Extending Multiple Diagram Navigation with Internal Diagram And Collection Connections." He discussed his work about extending Multiple diagram navigation (MDN) such that diagram-to-content queries reach related collection documents not directly connected to the diagrams.

#jcdl2018 Hisham Benotman is presenting "Extending Multiple Diagram Navigation with Internal Diagram And Collection Connections"pic.twitter.com/swGQh6c5gR
— Shawn M. Jones (@shawnmjones) June 4, 2018

Other papers presented in a parallel session (3B) include:

Building a Theoretical Framework for the Development of Digital Scholarship Services in China's Universities- Fang Zhang et al.
Formula Ranking within an Article- Ke Yuan
Ranking Scientific Papers and Venues in Heterogeneous Academic Networks by Mutual Reinforcement- Fang Zhang et al.

Minute madness followed the paper sessions. The minute madness was an activity in which poster presenters were given 1 minute to advertise their respective posters to the conference attendees. The poster session began after the minute madness.

Graduate students lining up to present their research in one minute or less. #MinuteMadness #JCDL2018 pic.twitter.com/V7QCrpGo6E
— JoAnn Livingston (@JoAnnLivingston) June 4, 2018

In line for #JCDL2018 #minutemadness pic.twitter.com/8MjGYxpvvo
— Martin Klein (@mart1nkle1n) June 4, 2018

Now for my favourite part of #jcdl2018 - the “minute madness.” Dozens of presenters, one minute each, exhorting us to visit their posters. pic.twitter.com/d2wJVGgrUT
— Ian Milligan (@ianmilligan1) June 4, 2018

#MinuteMadness ongoing at #JCDL2018. So much research, so little time! pic.twitter.com/pPYLayuUbK
— JoAnn Livingston (@JoAnnLivingston) June 4, 2018

I think this slide wins for most fun slide to share with no context #JCDL2018 pic.twitter.com/Riw1XcNtIM
— Trevor Owens 💾🗄🕚 (@tjowens) June 4, 2018

Another contender! #JCDL2018 pic.twitter.com/LWlV8FpnvZ
— Trevor Owens 💾🗄🕚 (@tjowens) June 4, 2018

I think @ibnesayeed is trying to start a new viral #webarchiving hashtag: #GiveOurToolbarsBack! #JCDL2018 pic.twitter.com/kv8ib2aWlo
— Ian Milligan (@ianmilligan1) June 4, 2018

Felix Hamborg shares research from @BelaGipp’s group on the extraction of Main Event Descriptors from News Articles #jcdl2018 #minutemadness pic.twitter.com/mlcV3pXDwu
— Corinna Breitinger (@BreitingerC) June 4, 2018

#UNTResearch on a roll here with #MinuteMadness at #JCDL2018 #GoMeanGreen pic.twitter.com/e7kiRFJmZ8
— JoAnn Livingston (@JoAnnLivingston) June 4, 2018

And now the up close and personal poster presentations #JCDL2018 pic.twitter.com/sND1xTkmBZ
— JoAnn Livingston (@JoAnnLivingston) June 4, 2018

@maturban1’s poster “ArchiveNow: Simplified, Extensible, Multi-Archive Preservation.” #JCDL2018 pic.twitter.com/y76YJ5APZR
— Hany Alsalmi (@HanyAlsalmi) June 5, 2018

A busy poster session #JCDL2018 pic.twitter.com/yfSFvN2snQ
— Hany Alsalmi (@HanyAlsalmi) June 5, 2018

Vote #158 #166 #JCDL2018 #postersession pic.twitter.com/CP2TvkqPR9
— Martin Klein (@mart1nkle1n) June 5, 2018

Dr. Ali Shiri’s poster #JCDL2018 pic.twitter.com/A5q2rfcIhq
— Hany Alsalmi (@HanyAlsalmi) June 5, 2018

#JCDL2018 pic.twitter.com/oycDIa1ZWH
— Michele Whitehead (@WhiteheadML) June 5, 2018

Day 3 (June 5, 2018)

Day 3 of the conference began with Dr. Niall Gaffney's keynote. Dr. Niall Gaffney is an Astronomer and Director of Data Intensive Computing at the Texas Advanced Computing Center (TACC). He started by emphasizing the importance of scientific reproducibility before moving on to show some of the projects supported by the computational machinery at TACC such as Firefly.

Dr. Niall Gaffney emphasizing the importance of scientific reproducibility (keynote JCDL 2018)#jcdl2018 pic.twitter.com/UKVoDrjwLg
— Alexander C. Nwala (@acnwala) June 5, 2018

Progress in Research vs. Reproducibility. Is it really tit for tat? #jcdl2018 keynote @de_Niled pic.twitter.com/ILxTpdqIXr
— Min-Yen Kan (@knmnyn) June 5, 2018

Texas Super Computing @jcdl2018 #JCDL2018 presented by Niall Gaffney #stampede2 pic.twitter.com/TrcdFiWINY
— Philipp Mayr (@Philipp_Mayr) June 5, 2018

. @de_Niled reminds me of @phonedude_mln‘s admonition to our @WebSciDL students: “No magic laptops!” #jcdl2018 pic.twitter.com/5RmBS1wajL
— Michele Weigle (@weiglemc) June 5, 2018

Two concurrent paper sessions followed a short break.

Paper session 4A (Day 3)

Dr. Gianmaria Silvello, an assistant professor at the Department of Information Engineering of the University of Padua presented a full paper titled: "Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents." He discussed his project about the development of an evaluation framework for validating the conformance of long-term preservation by assessing correctness, usability and usefulness.

#jcdl2018 session 4 has begun with @giansilv presenting "Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents". pic.twitter.com/tJWAMwZcLv
— Mat Kelly (@machawk1) June 5, 2018

Next, Dr. Pavlos Fafalios a researcher at L3S Research Center in Germany presented a full paper titled: "Ranking Archived Documents for Structured Queries on Semantic Layers," in which he proposed two ranking models that rank archived documents and considers the similarity of documents to entities, timeliness of documents, and the temporal relations between the entities.

Kudos to @pavlos098 et al. for making their evaluation data set from their #jcdl2018 presentation "Ranking Archived Documents for Structured Queries on Semantic Layers" publicly available https://t.co/sYqfmzw1vu pic.twitter.com/2E7QAtQD3b
— Mat Kelly (@machawk1) June 5, 2018

The slides of my presentation "Ranking Archived Documents for Structured Queries on Semantic Layers" are available at https://t.co/8rYD7EVCPP
and the paper at https://t.co/ZkxP7TeDTX
#jcdl2018
— Pavlos Fafalios (@pavlos098) June 5, 2018

The final paper presented (not by an author of the paper) in this session was a short paper titled: "Modeling Author Contribution Rate With Blockchain." Three concurrent paper sessions (all full papers) followed after break.

Paper session 4B (Day 3)

Florian Mai, a graduate student at Kiel University in Germany was the first presenter of the paper session on Text Collections. He presented a full paper titled: "Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text," in which he presented the findings from investigating how deep learning models obtained from training on titles compare to deep learning models obtained from training on full-texts.

Using deep learning for Title-based semantic subject indexing to reach competitive performance to full text
With @_florianmai @jcdl2018 #jcdl2018 pic.twitter.com/D5115YP3IC
— Giorgio Saez (@Giosd) June 5, 2018

.@_florianmai - asks can the best title-based method outperform the best full text method for subject indexing? - he notes that his results are positive for his EconBiz dataset, but not for his PubMed dataset #jcdl2018 pic.twitter.com/eENpfhZvwk
— Shawn M. Jones (@shawnmjones) June 5, 2018

#jcdl2018 Code for "Using Deep Learning For Title-Based Semantic Subject Indexing To Reach Competitive Performance to Full Text" by @_florianmai is available at https://t.co/1FqzfJ5KPk
— Shawn M. Jones (@shawnmjones) June 5, 2018

#questiontime With @_florianmai @jcdl2018 #jcdl2018 pic.twitter.com/CZFoi8OTnm
— Giorgio Saez (@Giosd) June 5, 2018

Next, Chris Holstrom, a PhD student from the Information School at the University of Washington presented a short paper: "Social Tagging: Organic and Retroactive Folksonomies," in which he showed that tags on MetaFilter and AskMetaFilter follow a power law distribution and retroactive taggers do not use "organization" tags like professional indexers.

Chris Holstrom is presenting "Social Tagging: Organic and Retroactive Folksonomies" - organic tags are produced while users are producing the posts and retroactive tags are applied afterward - he evaluated content on https://t.co/fNpljfMguY and https://t.co/9RI8hOTrGN #jcdl2018 pic.twitter.com/L9XMtpCm2u
— Shawn M. Jones (@shawnmjones) June 5, 2018

#jcdl2018 Chris Holstrom: Do tags fit a power law distribution? Yes. Do retroactive taggers use "organization" tags like professional indexers? No. Do retroactive taggers use preferred terms and avoid synonyms which are common in folksonomies? No, they add more synonyms. pic.twitter.com/pvzps4TSlM
— Shawn M. Jones (@shawnmjones) June 5, 2018

Social tagging: Organic and retroactive Folksonomies (Chris Holstrom) @jcdl2018 #jcdl2018 @UW_iSchool @UNTCOI pic.twitter.com/5mGHWf3adL
— Giorgio Saez (@Giosd) June 5, 2018

Next, Jens Willkomm, a PhD student at the Karlsruhe Institute of Technology in Germany, presented a full paper titled: "A Query Algebra for Temporal Text Corpora." He proposed a novel query algebra for accessing and analyzing words in large text corpora.

Jens Willkomm from #KIT presenting his interdisciplinary work on a query algebra for temporal text corpora #jcdl2018 pic.twitter.com/qnPCfr1bpR
— Susanne Putze (@s_putze) June 5, 2018

Jens Willkomm presents "A Query Algebra for Temporal Text Corpora" - joint work with philosophers - if you could read 1 book per day, then you can read 365 books a year - there are many many books to get through - a query language may help #jcdl2018 pic.twitter.com/hy4PHjTQYP
— Shawn M. Jones (@shawnmjones) June 5, 2018

Nice conclusion: Read and cite our paper! #jcdl2018 pic.twitter.com/MLku0nF2K2
— Susanne Putze (@s_putze) June 5, 2018

Paper session 5A (Day 3)

Omar Alonso (best paper nominee) presented a full paper titled: "How it Happened: Discovering and Archiving the Evolution of a Story Using Social Signals." He introduced a method of showing the evolution of stories from the perspective of social media users as well as the articles that include social media as supporting evidence.

How it happened: Discovering and archiving the evolution of a story using social signals. (Omar Alonso) @elunca @jcdl2018 #jcdl2018 #Microsoft pic.twitter.com/pJnsrvKPQe
— Giorgio Saez (@Giosd) June 5, 2018

Tobias Backes a researcher at Gesis presented his paper titled: "Keep it Simple: Effective Unsupervised Author Disambiguation with Relative Frequencies." He addressed the problem of author name homonymy in the Web Science domain by proposing a novel probabilistic similarity measure for author name disambiguation based on feature overlap.

Talk about author name disambiguation by Tobias Backes from @gesis_org #jcdl2018 pic.twitter.com/OUxiCMZW9c
— Kai Eckert (@kaiec) June 5, 2018

The last paper (best paper nominee) presented in this session was titled: "Digital History meets Microblogging: Analyzing Collective Memories in Twitter."

Paper session 5B (Day 3)

Noah Siegel a researcher at the Allen Institute for Artificial Intelligence presented a full paper titled: "Extracting Scientific Figures with Distantly Supervised Neural Networks," where he introduced a system of extracting figures from large number of scientific documents without human intervention.

Extracting Scientific Figures with Distantly Supervised Neural Networks https://t.co/taDFsZRNL2 presented by Noah Siegel @allenai_org #JCDL2018
— Philipp Mayr (@Philipp_Mayr) June 5, 2018

Next, André Greiner-Petter presented his full paper titled: "Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context." He presented a new approach for mathematical format conversion that utilizes textual information to reduced error rate. Additionally, he evaluated state-of-the art tools for mathematical conversions and provided a public manually-created gold standard dataset for mathematical format conversion.

@GreinerPetter presenting @physikerwelt and @BelaGipp’s group research on ''Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Content'' at #jcdl2018 pic.twitter.com/xkM3fuYcbU
— Corinna Breitinger (@BreitingerC) June 5, 2018

Next, Yuta Kobayashi presented a paper titled: "Citation Recommendation Using Distributed Representation of Discourse Facets in Scientific Articles," presenting the effectiveness of using facets of scientific articles such as "objective,""method," and "result" for citation recommendation by learning a multi-vector representation of scientific articles, in which each vector represents a facet in the article.

Paper session 5C (Day 3)

Catherine Marshall, an adjunct professor at Texas A&M University presented: "Biography, Ephemera, and the Future of Social Media Archiving." She presented her finding from answering the following question: "Will the addition of new digital sources such as records repositories, digital libraries, social media, and collections of ephemera change biographical research practices?" She demonstrated how new digital resources unravel a subject's social network, thus exposing biographical information formerly invisible.

"Biography, emphera,and the future of social media archiving"@ccmarshall #jcdl2018 pic.twitter.com/VrpuXcre7W
— Michael L. Nelson (@phonedude_mln) June 5, 2018

Next, I presented our full paper titled: "Scraping SERPs for Archival Seeds: It Matters When You Start" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson. In my presentation, first, I highlighted the importance of web archive collections for studying important historical events ranging from elections to disease outbreaks. Next, I showed that search engines (specifically Google) can be used to generate seeds. Finally, I showed that it becomes harder to find the older URLs of news stories over time, so seed generators that utilize search engines should begin early and persist to capture the evolution of an event.

.@acnwala presenting "Scraping SERPs for Archival Seeds: It Matters When You Start"https://t.co/AZeJfohcaA #jcdl2018 pic.twitter.com/coQ99URUIV
— Michael L. Nelson (@phonedude_mln) June 5, 2018

.@acnwala:"we don't have enough curators to capture seeds for all events", @internetarchive @archiveitorg often send out requests for seeds from volunteers. Collection building often begins with a search-can we use search engine result pages to help find seeds as well? #jcdl2018 pic.twitter.com/WgB289Hfll
— Shawn M. Jones (@shawnmjones) June 5, 2018

In "Scraping SERPs for archival seeds: it matters when you start"@acnwala details how one can scrape search engine result pages (SERPs) to find seeds for use in web archive collections #preprint here: https://t.co/OB4j39rwsR #jcdl2018 pic.twitter.com/Hya5PH5h4n
— Shawn M. Jones (@shawnmjones) June 5, 2018

Alexander Nwala @acnwala amazing to see how quickly Trump-Russia stories disappear after they show up in the news #JCDL2018– crazy dynamics pic.twitter.com/RJmVV31g2X
— Mike Hucka (@mhucka) June 5, 2018

This is cool from @acnwala - how search results move up and down various Google search result pages (or SERPs). Some persisting for dozens of days; others from page 1 to page 5; or even back from page 5 to 1 etc etc. Nice viz too. #JCDL2018 pic.twitter.com/IIuCRZIn2F
— Ian Milligan (@ianmilligan1) June 5, 2018

.@WebSciDL resources for: Scraping SERPs for Archival Seeds: It Matters When You Start tech report

Tech report: https://t.co/2uvwKd7ykc
Slides: https://t.co/qFzZMftWQ9
151,602 URI from 7 months: https://t.co/0uUFp2QqBn
App. for Scraping Google: https://t.co/iBf4kHekNY #jcdl2018
— Alexander C. Nwala (@acnwala) June 5, 2018

Next, Mat Kelly (best paper nominee), a fellow PhD student at Old Dominion University and member of WSDL presented his full paper titled: "A Framework for Aggregating Private and Public Web Archives." He showed his framework that provides a means of combining public web archive captures and private web captures (e.g., banking and social media information) without compromising sensitive information included in the private captures. This work utilizes Sawood Alams's Memgator, a Memento aggregator that supports multiple serialization formats such as Link, JSON, and CDXJ.

A Framework for Aggregating Public and Private Web Archives from Mat Kelly

Great slides by @machawk1 as he walks us through personas working with their own personal web archives, supplemented by other personal and public collections. #jcdl2018 pic.twitter.com/qTJMXNdWTu
— Ian Milligan (@ianmilligan1) June 5, 2018

.@machawk1 from @WebSciDL shows that, in additional to customizing the results from aggregators, memento meta aggregators can allow access to private web archives #jcdl2018 pic.twitter.com/f41GquaCAm
— Shawn M. Jones (@shawnmjones) June 5, 2018

Paper session 6A (Day 3)

The last paper session on Topic Modeling and Detection consisted of three full papers. First, Julian Risch (best paper nominee), a PhD student at Hasso-Plattner Institute (Germany) presented: "My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections." He presented a topic model combined with automatic domain term extraction and phrase segmentation that distinguishes collection-specific and collection-independent words based on information entropy.

#jcdl2018 Julian Risch presents "My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections"pic.twitter.com/pr4PYqslA3
— Shawn M. Jones (@shawnmjones) June 5, 2018

#jcdl2018 Thanks to Julian Risch for introducing Cross-collection LDA (ccLDA), a topic modeling technique I was not familiar with pic.twitter.com/JE6mUx8dC3
— Shawn M. Jones (@shawnmjones) June 5, 2018

Next, Dr. Ralf Krestel, the head of Web Science Research Group & Senior Researcher at Hasso-Plattner Institute (Germany) presented his full paper titled: "WELDA: Enhancing Topic Models by Incorporating Local Word Context." He proposed a new topic model called WELDA that combines word embeddings (WE) and Latent Dirichlet Allocation (LDA).

#jcdl2018 Ralf Krestel presents "WELDA: Enhancing Topic Models by Incorporating Local Word Context" : "You shall know a word by the company it keeps" - FIRTH, 1957 pic.twitter.com/s4dREHg6cY
— Shawn M. Jones (@shawnmjones) June 5, 2018

Something one can do with word embeddings that cannot be done with topic models: word analogies - Ralf Krestel #jcdl2018 pic.twitter.com/fIDxR0L726
— Shawn M. Jones (@shawnmjones) June 5, 2018

Ralf Krestel: improving topic models by leveraging word embeddings: WELDA #jcdl2018 pic.twitter.com/EGClhiD9lO
— Shawn M. Jones (@shawnmjones) June 5, 2018

Ralf Krestel mentions that WELDA adds new steps to the Gibbs sampling algorithm for topic modeling #jcdl2018 pic.twitter.com/hAR5zN9Igo
— Shawn M. Jones (@shawnmjones) June 5, 2018

#jcdl2018 Ralf Krestel provides some answers to "How do we evaluate topic models?", such as "topic coherence" - reminds me of Chang's "Reading Tea Leaves: How Humans Interpret Topic Models"https://t.co/Wt4IJrttB8 pic.twitter.com/S9UY294e1D
— Shawn M. Jones (@shawnmjones) June 5, 2018

Finally, Angelo Salatino, a PhD student at the Knowledge Media Institute (UK) presented a full paper titled: "AUGUR: Forecasting the Emergence of New Research Topics." He introduced AUGUR, which is a new approach for the early detection of research topics in order to help stakeholders such as universities, institutional funding bodies, academic publishers and companies recognize new research trends.

Excited to see @angelosalatino’s @JCDL2018 presentation on forecasting the emergence of research topics! Paper here: https://t.co/4NpCzNl4jv @kmiou #JCDL2018 #ScholarlyCommunication #DigitalLibraries @skm3ou pic.twitter.com/hyuELic1BN
— Dasha Herrmannova (@robodasha) June 5, 2018

#jcdl2018 @angelosalatino presents "AUGUR: Forecasting the Emergence of New Research Topics"https://t.co/NY1I6FgLUf - I met Angelo at #www2016 where he was exploring emerging topics in this paper https://t.co/w1gxKId6pp pic.twitter.com/DEjRiQ4U81
— Shawn M. Jones (@shawnmjones) June 5, 2018

To look at emerging work @angelosalatino used Scopus and the Computer Science Ontology portal: https://t.co/9SemqzM7in #jcdl2018 pic.twitter.com/0755KjWbiM
— Shawn M. Jones (@shawnmjones) June 5, 2018

It never ceases to amaze me how important graphs are to #computerscience - @angelosalatino modeled topics as a graph, and used a community detection algorithm to detect cliques and find topic communities #jcdl2018 pic.twitter.com/ll9HFFgFUF
— Shawn M. Jones (@shawnmjones) June 5, 2018

#jcdl2018 Prior work by @angelosalatino on the emergence of topics related to his AUGUR work presented @jcdl2018: "How are topics born? Understanding the research dynamics preceding the emergence of new areas" : https://t.co/fmWBHGJw7x
— Shawn M. Jones (@shawnmjones) June 5, 2018

A dinner at the Fort Worth Museum of Science and History followed after a break. The best poster award was presented to Mohamed Aturban, a fellow PhD student at Old Dominion University and member of WSDL for this poster "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation."

#JCDL2018 pic.twitter.com/n29mUaSYF7
— Michael L. Nelson (@phonedude_mln) June 5, 2018

Dr. Federico Nanni (Providing Fine-Grained Semantics of Entities in Context) and Myriam Traub (Impact of Crowdsourcing OCR Improvements on Retrievability Bias) tied for the Vannevar Bush best paper awards. Myriam Traub also won the best student paper award.

Congrats to @MyriamCTraub and @f_nanni -- they split the Vannevar Bush best paper award, and @MyriamCTraub won the best student paper award! #jcdl2018 pic.twitter.com/n48sjXAJEW
— Michael L. Nelson (@phonedude_mln) June 5, 2018

Congrats to @f_nanni for the best paper award @jcdl2018 #jcdl2018 pic.twitter.com/SmoEh1fDgX
— Kai Eckert (@kaiec) June 5, 2018

Day 4 (June 6, 2018)

Day 4 began with a keynote from Dr. Carly Strasser, director of Strategic Development for the Collaborative Knowledge Foundation. Her keynote "Open Source Tech for Scholarly Communication: Why It Matters," illustrated the problems in the submission, production and delivery of scholarly communication. She talked about the problem of the disjoint nature (silos) of the various stages of scholarly communication, as well as the expensive delivery, slow production, static and less interoperable output.

Excited about @carlystrasser's keynote titled "Open Source Tech for Scholarly Communication: Why It Matters."#jcdl2018 pic.twitter.com/mg695f2H5R
— Martin Klein (@mart1nkle1n) June 6, 2018

#jcdl2018 @carlystrasser gives this morning's keynote: "Open Source Tech for Scholarly Communication: Why It Matters."pic.twitter.com/MosOOlgly5
— Shawn M. Jones (@shawnmjones) June 6, 2018

#jcdl2018 @carlystrasser mentions the "chaos of the research process" and what is not necessarily being captured, with a focus on publications even though there are many other products of the scholarly process #scholarlycommunication pic.twitter.com/V29h9P0HD7
— Shawn M. Jones (@shawnmjones) June 6, 2018

.@carlystrasser shared work of @olihb mapping scholarly collaboration https://t.co/YRFISJ6irn pic.twitter.com/88wyr30hlr
— Shawn M. Jones (@shawnmjones) June 6, 2018

#jcdl2018 @carlystrasser shares Jennifer Lin's work of the "article nexus": https://t.co/FkEABBJCeQ pic.twitter.com/eJI0fgrHfD
— Shawn M. Jones (@shawnmjones) June 6, 2018

She also presented a vision of scholarly communication that consists of living documents that link to open source code and data, a cheaper delivery system, faster production and more interoperable and dynamic output. Additionally, she talked about the organizations working to achieve various aspects of this vision.

.@carlystrasser gives a plug for @force11rescomm in October as well as other groups like @PLOS @ProjectJupyter @datadryad @biorxivpreprint and many others that are thinking about the future of research #jcdl2018 pic.twitter.com/chSIIBWBZX
— Shawn M. Jones (@shawnmjones) June 6, 2018

#jcdl2018 @carlystrasser lists #opensource software successes, best when collaborative and community-driven pic.twitter.com/wHJlQPECXf
— Shawn M. Jones (@shawnmjones) June 6, 2018

"Get everything into the browser" and "eliminate that production step in the middle" of #scholarlycommunication - @carlystrasser #jcdl2018 pic.twitter.com/FZNbVBOSTL
— Shawn M. Jones (@shawnmjones) June 6, 2018

#jcdl2018 @carlystrasser describes "the open ecosystem" including tools by @CokoFoundation and other tools like @inveniosoftware pic.twitter.com/Ir5LBfR0hL
— Shawn M. Jones (@shawnmjones) June 6, 2018

The main conference gave way to workshops and a preview of JCDL 2019 which is scheduled to take place at the School of Information Sciences at the University of Illinois, Urbana-Champaign from June 2-6, 2019.

#jcdl2018 isn't over yet, but #jcdl2019 will be @Illinois_Alma - @profdownie says that your best action plan is to plan to spend sometime in Chicago too! pic.twitter.com/m8ur0C42Xh
— Shawn M. Jones (@shawnmjones) June 6, 2018

I would like to thank the organizers of the conference, the hosts, University of North Texas (UNT) College of Information and UNT Health Science Center, as well as SIGIR for the travel grants. I will provide links to Mat Kelly's Web Archiving and Digital Libraries (WADL) workshop trip report once it is available. But here is a preview of WADL from Jasmine Mulliken, Digital Production Associate at Stanford University Press. Dr. Min-Yen Kan set up a repository for all the slides from JCDL 2018; please upload your slides if you have not already done so.

After lunch photo Wednesday #jcdl2018 pic.twitter.com/JPRmB1U0Wz
— Michele Whitehead (@WhiteheadML) June 6, 2018

-- Nwala (@acnwala)

↧

2018-06-08: Joint Conference on Digital Libraries (JCDL) Doctoral Consortium Trip Report

June 8, 2018, 12:02 pm

≫ Next: 2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

≪ Previous: 2018-06-08: Joint Conference on Digital Libraries (JCDL) 2018 Trip Report

On June 3, 2018, PhD students arrived in Fort Worth, Texas to attend the Joint Conference on Digital Libraries Doctoral Consortium. This is a pre-conference event associated with the ACM and IEEE-CS Joint Conference on Digital Libraries. This event gives PhD students a forum in which to discuss their dissertation work with others in the field. The Doctoral Consortium was well attended, not only by the presenting PhD students, their advisors/supervisors, and organizers, but also by those who were genuinely interested in emerging work. As usual, I live-tweeted the event to capture salient points. It was a very enjoyable experience for all.

Thanks very much to the chairs:

In this post I will cover the work of all accepted students, three of whom are from the Web Science and Digital Libraries Research Group at Old Dominion University:

Shawn M. Jones (me), Old Dominion University, USA
Alexander Nwala, Old Dominion University, USA
Mohamed Aturban, Old Dominion University, USA
André Greiner-Petter, University of Konstanz, Germany
Timothy Kanke, Florida State University, USA
Hany Alsalmi, Florida State University, USA
Corinna Breitinger, University of Konstanz, Germany
Susanne Putze, University of Bremen, Germany
Stephen Abrams, Queensland University of Technology, Australia
Tirthankar Ghosal, Indian Institute of Technology Patna, India

I would also like to thank the assigned mentors of the Doctoral Consortium, who provided insight and guidance not only to their own assigned students, but the rest of us as well:

Trond Aalberg, Norwegia University of Science and Technology, Norway
Monika Akbar, University of Texas at El Paso, USA
Daniel Alemneh, University of North Texas, USA
Jose Borbinha, Técnico Lisboa, Portugal
Sally Jo Cunningham, The University of Waikato, New Zealand
Ying Ding, Indiana University Bloomington, USA
Kai Eckert, Hochschule der Medien, Germany
Daqing He, University of Pittsburgh, USA
Martin Klein, Los Alamos National Laboratory, USA
Philipp Mayr-Schlegel, GESIS – Leibniz Institute for the Social Sciences, Germany
Mirella M. Moro, Universidade Federal de Minas Gerais, Brazil
Federico Nanni, University of Mannheim, Germany
Michele C. Weigle, Old Dominion University, USA
Jian Wu, Pennsylvania State University, USA

WS-DL Presentations

Shawn M. Jones

Improving Collection Understanding in Web Archives from Shawn Jones

.@shawnmjones presenting "improving collection understanding in web archives" #jcdl2018 @WebSciDL @jcdl2018 pic.twitter.com/vq3mJrMSen
— Michael L. Nelson (@phonedude_mln) June 3, 2018

How does a researcher differentiate between web archive collections that cover the same topic? Some web archive collections consist of 100,000+ seeds, each with multiple mementos. There are more than 8000 collections in Archive-It as of the end of 2016. Existing metadata in Archive-It collections is insufficient because the metadata is produced by different curators from different organizations applying different content standards and different rules of interpretation. As part of my doctoral consortium submission, I proposed improving upon the solution piloted by Yasmin AlNoamany. She generated a series of representative mementos and then submitted them to the social media storytelling platform Storify in order to provide a summary of each collection.

Improving Collection Understanding in Web Archives by @shawnmjones @WebSciDL at #JCDL2018 #JCDLDC2018 #DoctoralConsortium pic.twitter.com/AKzNgNZWpA
— Sawood Alam (@ibnesayeed) June 3, 2018

As part of my preliminary work I presented some findings that will be published at iPres 2018. We discovered four semantic categories of Archive-It collections: collections where an organization archived itself, collections about a specific subject, collections about expected events or time periods, and collections about spontaneous events. The collections AlNoamany used in her work fit into the last category. This also turned out to be the smallest category of collections, meaning that there are many other types of collections not evaluated by her method. She proved that humans could not tell the difference between her automatically-generated stories and other stories generated by humans. She did not, however, provide evidence that the visualization was useful for collection understanding. We also have the problem that Storify is no longer in service, something that I mentioned in a previous blog post. My plan includes developing a flexible framework that allows us to test different methods of selecting representative mementos. This framework will also allow us to test different types of visualizations using those representative mementos. Some of these visualizations may make use of different social media platforms. I plan to evaluate these collections by first creating user tasks that give us some idea that a user understands aspects of a collection. With these tasks I intend to then evaluate different solutions via user testing. The solutions that score best from the testing will address a large problem inherent to the scale of web archives.

In analyzing @archiveitorg collections, @shawnmjones found 54% of the collections are people/orgs archiving themselves, 24% archiving a subject, 14% archiving event they know is coming, and 4% collections are unexpected events. #jcdl2018
— Mat Kelly (@machawk1) June 3, 2018

Alexander Nwala

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media from anwala

How do we find high quality seeds for generating web archive collections? Alexander is focusing on a different aspect of web archive collections than I am. I am analyzing existing collections. He is building collections from seeds supplied by social media users. He notes that users often create "micro-collections" of web resources, typically surrounding an event. Using examples like ebola epidemics, the Ukraine crisis, and school shootings, Alexander asks if seeds generated by social media are comparable to those generated by professional curators. He also seeks quantitative methods for evaluating collections. Finally, he wants to evaluate the quality of collections at scale.

.@acnwala - "not all micro-collections in social media will give us high quality seeds"- how do we handle this at scale? @jcdl2018 #jcdl2018 #DoctoralConsortium pic.twitter.com/Tk2B3bthZz
— Shawn M. Jones (@shawnmjones) June 3, 2018

He demonstrated the results of using a prototype system that extracts seeds from social media and compared these seeds to those extracted from Google search engine result pages (SERPs). He discovered that, when using SERPs, the probability of finding a URI for a news story diminishes with time. He introduced methods like distribution of topics, distribution of sources, distribution in time, content diversity, collection exposure, target audience, and more. He covered some of his work on the Local Memory Project as well as work that will be presented at JCDL 2018 and Hypertext 2018. He intends to do further research on hubs and authorities in social media, as well as evaluating the quality of collections. Alexander will ensure that good quality seeds make it into web archives, addressing an aspect of curation that has long been an area of concern in web archives.

re: @acnwala's use of scraping Google UI vs. bing API, APIs and UIs can give very different results, see: "Agreeing to Disagree: Search Engines and their Public Interfaces" from #JCDL2007, @fmccown https://t.co/BnGSTzqS91
#jcdl2018
— Michael L. Nelson (@phonedude_mln) June 3, 2018

Mohamed Aturban

Establishing and Verifying Fixity of Archived Web Pages from maturban

How can we verify the content of web archives? Mohamed presented his work on fixity for mementos. He described issues with temporal violations and playback issues. He asked whether different web archives agreed on the content of mementos produced for the same live resource at the same time. He showed how "evil" archives could potentially manipulate memento content to produce a different page than existed at the time of capture. So, how do we ensure that the memento was unaltered since the time of capture?

.@maturban1 - existing hashing functions are insufficient for generating fixity of mementos in a #webarchive - we need a Memento-aware fixity solution, including all embedded resources #jcdl2018 @jcdl2018 #DoctoralConsortium pic.twitter.com/5yMkn9MPJx
— Shawn M. Jones (@shawnmjones) June 3, 2018

He demonstrated that the playback engine used by a web archive can inadvertently change the result of the displayed memento. Just providing a timestamped hash of the memento HTML is not enough. He proposes generating a cryptographic hash for the memento and all embedded resources and then generating a manifest of these hashes. This manifest will then be stored as mementos themselves in multiple web archives. I expect this work to be quite important to the archiving community, addressing a concern that many professional archivists have had for quite some time.

.@maturban1 - once we have fixity information for mementos, we will push a manifest with this information into multiple web archives #jcdl2018 @jcdl2018 #DoctoralConsortium pic.twitter.com/Dqerbxrb26
— Shawn M. Jones (@shawnmjones) June 3, 2018

Other Work Presented

André Greiner-Petter

Here's @GreinerPetter's suggestion for a semantification helper for mathematical formulae in Wikipedia for example. Nice! #jcdl2018 pic.twitter.com/Vk8kIozVAS
— Mandy (@protestreich) June 3, 2018

Research papers use equations all of the time. Unfortunately, there isn't a good method of comparing equations or providing semantic information about them. André Greiner-Petter is working on creating a method of enriching the equations used in research papers. This will have a variety of uses, such as detecting plagiarism or finding related literature.

Timothy Kanke

Timothy Kanke from @floridastate presents different concepts from @wikidata: items, projects, WikiProjects https://t.co/jPcXO16dKA #jcdl2018 #DoctoralConsortium @jcdl2018 pic.twitter.com/0T7meZJ6ht
— Shawn M. Jones (@shawnmjones) June 3, 2018

How are people using Wikidata? I had attended a session on Wikidata at Wiki ConferenceUSA 2014, but have not really examined it since. Will it be useful for me? How do I participate? Who is involved? Timothy Kanke seeks to understand the answers to all of these questions. The Wikidata project has grown over the last few years, feeding information back into the Wikipedia community. Kanke will study the Wikidata community and provide a good overview for those who want to use its content. Using his work, we will all have an understanding of the overall ways in which Wikidata can work for the scholarly community.

Hany Alsalmi

"Dual Language Information Seeking in Digital Libraries" talk by @HanyAlsalmi in #JCDL2018 #DoctroalConsortium for #Arabic is very relatable to many other languages like #Persian and #Urdu. pic.twitter.com/dnRXsgLFi6
— Sawood Alam (@ibnesayeed) June 3, 2018

How many languages do you use for searching? What is the precision of the results when you switch languages, even for the same query? Hany Alsalmi noticed that users who search in English were getting different results than when they searched for the same term in Arabic. Alsalmi will perform studies on users of the Saudi Digital Library to understand how they perform their searches and how successful those searches are. He will also record their reactions to search results, with the concern being that the user will quit in frustration if the results are insufficient. His work will have implications for search engines in the Arabic-speaking world.

Corinna Breitinger

.@BreitingerC mentions different types of semantic similarity measures for use in analyzing scholarly literature: I was familiar with text-based measures and citation-based measures, but approaches for analyzing mathematical language and figures are new to me #jcdl2018 pic.twitter.com/U8prxk5hUO
— Shawn M. Jones (@shawnmjones) June 3, 2018

Scholarly recommendation systems examine papers using text similarity. Can we do better? What about the figures, citations, and equations? Corinna Breitinger will take all of these text-independent semantic markers into consideration with the development of a new recommender approach targeted at STEM fields. Once that is done, she will create a new visualization concept that will help users view and navigate a collection of similar literature. The benefits of such a system will help spot redundant research and also help us find related research in the field.

Susanne Putze

Suzanne Putze presents "Digital Libraries for Experimental Data"@jcdl2018 #DoctroalConsortium #jcdl2018 pic.twitter.com/m1KDSj5EmB
— Shawn M. Jones (@shawnmjones) June 3, 2018

How is research data managed? How can we facilitate making data management a “first-class citizen”? To do so would improve the amount of data shared by researchers as well as its quality. Susanne Putze has extended experiment models to improve data documentation. She will create prototypes and evaluate how well they work to address data management in the scholarly process. From there she will begin the process of improving knowledge discovery using these prototypes. Her research has implications for how we handle our data and incorporate it into scholarly communications.

Stephen Abrams

I was composing my tweet on this same thing! :-)

"In science, truth is not binary, instead we approach to it asymptotically!" -- Stephen Abrams#JCDL2018 #JCDLDC2018 #DoctoralConsortium
— Sawood Alam (@ibnesayeed) June 3, 2018

How successful are digital preservation efforts? Stephen Abrams is working on creating metrics for this purpose. He is planning on evaluating digital preservation from the perspective of communications rather than through preservation management concepts like quantities, ages, of quality of preserved material. Thanks to his presentation I will now examine terms like “verisimilitude”, “semiotic”, and “truthlikeness”. When he is done, we should have better metrics to evaluate things like the trustworthiness of preserved material. His work is more general and theoretical than Mohamed’s, but there is a loose connection to be sure.

Tirthankar Ghosal

Tirthankar Ghosal's focus is helping editors to flag editors and papers ahead of time to determine the potential for a paper to be accepted or rejected - @weiglemc mentions that determining novelty of research is a "huge task"@jcdl2018 #jcdl2018 #DoctoralConsortium pic.twitter.com/gGDW8vyC5b
— Shawn M. Jones (@shawnmjones) June 3, 2018

Why are papers rejected by editors? Have we done a good job identifying what makes our paper novel? What if we could spot such complex issues in our papers prior to submission? Tirthankar Ghosal seeks to help address these concerns by using AI techniques to help researchers and editors more easily identify papers that will likely be rejected. He has already done some work examining reasons for desk rejections. He will identify methods for detecting what makes a paper novel, if a paper is fit for a given journal, if it is of sufficient quality to be accepted, and lastly create benchmark data that can be used to evaluate papers in the future. His work has large implications for scholarly communication and may affect not only the way we write, but also how submissions are handled in the future.

What Next?

#jcdl2018 @profdownie mentions to #DoctoralConsortium participants @jcdl2018: "a successful thesis is really focused", "really focused on the limitations", note what you have not explored but may explore in the future
— Shawn M. Jones (@shawnmjones) June 3, 2018

#jcdl2018 #doctoralconsortium @profdownie asks Stephen Abrams What is the "impact on the world" of your thesis? "What is this thing in the future that we want better of?" Things to think of when considering my own work...
— Shawn M. Jones (@shawnmjones) June 3, 2018

I would like to thank all participants for their input and insight throughout the event. Hearing their feedback for other participants was quite informative to me as well. We will all have improved candidacy proposals as a result of their input and, more importantly, will use this input to improve our contributions to the world.

#jcdl2018 Close the doctoral consortium by a nice photo~ pic.twitter.com/vTUTN7AEuT
— Chenwei Zhang (@z_vvvv) June 3, 2018

Updated on 2018/06/09 at 20:50 EDT with embed of Mohamed Aturban's Slideshare.

--Shawn M. Jones

↧

2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

June 11, 2018, 5:47 pm

≫ Next: 2018-06-11: Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report from JCDL2018

≪ Previous: 2018-06-08: Joint Conference on Digital Libraries (JCDL) Doctoral Consortium Trip Report

Mat Kelly reports on the Web Archiving and Digital Libraries (WADL) Workshop 2018 that occurred in Fort Worth, Texas. ⓖⓞⓖⓐⓣⓞⓡⓢ

On June 6, 2018, after attending JCDL 2018 (trip report), WS-DL members attended the Web Archiving and Digital Libraries 2018 Workshop (#wadl2018) in Fort Worth, Texas (see trip reports from WADL 2017, 2016, 2015, 2013). WS-DL's contributions to the workshop included multiple presentations inclusive of the workshop keynote by my PhD advisor, which I discuss below.

.@mart1nkle1n getting us started at #WADL2018 #jcdl2018 https://t.co/SctiuI9foX pic.twitter.com/Kbj4ofRwRH
— Michael L. Nelson (@phonedude_mln) June 6, 2018

The Project Panel

Martin Klein (@mart1nkle1n) initially welcomed the workshop attendees and had the group of 26-or-so participants give a quick overview of who they were and their interest in attending. He then introduced Zhiwu Xie (@zxie) of Virginia Tech to begin the series of presentations reporting on the project kickoff of the IMLS-funded project (as establish at WADL 2017) "Continuing Education to Advance Web Archiving". A distinguishing feature of this project compared to others, Zhiwu said, is that the projects will use project-based problem solving instead of the products being surveys and lectures. He highlighted a collection of Curriculum modules involving existing practice (event archiving) to feed into various Web archiving tools (e.g., Social Feed Manager (SFM), ArchiveSpark, and Archives Unleashed Toolkit) to facilitate the understanding of the fundamentals (e.g., web, data science, big data) to produce experience in libraries, archives, and programming. The focus here was on individuals that had some level of prior experience with archives instead of the program being designed as training for those with zero experience in the area.

Now @zxie kicks off our IMLS project - “Continuing Education to Advance Web Archiving.” Funded under the Laura Bush 21st Century Librarian Program. We’ll be developing some #webarchiving training. #wadl2018 #jcdl2018 pic.twitter.com/9ihZejeUrm
— Ian Milligan (@ianmilligan1) June 6, 2018

@phonedude describing some approaches that may be used for storytelling with archives now that @storify is gone. #wadl2018 pic.twitter.com/E8joxzeoOF
— Mat Kelly (@machawk1) June 6, 2018

ODU WS-DL's Michael Nelson (@phonedude_mln) continued with the one motivation is to encourage storytelling using Web archives and how that has been hampered with the recent closing of Storify. Some recent work of the group (including the in-develop project MementoEmbed) would allow this concept to be revitalized despite Storify's demise through systematic "card" generation of mementos to allow a more persistent (in the preservation sense) version of the story to be extracted and retained.

@justin_littman starts his #wadl2018 talk in distinguishing what you get from the Twitter API compared to what you would see on their Web interface. He then highlights some tools to interact with the data then describes doing so with their @US_IMLS-supported Social Feed Manager. pic.twitter.com/8Y1uYSMmta
— Mat Kelly (@machawk1) June 6, 2018

Justin Littman (@justin_littman) of George Washington University Libraries continued the project description by describing Social Feed Manager's and emphasized that what you get from the Twitter API may well differ from what you get from the Web interface. The purpose of SFM is to be an easy-to-use, self-service Web interface to drive down the barriers in collecting social media data for academic research.

At #jcdl2018 #wadl2018 gave a quick run down of our @unleasharchives projects – for some more info as promised on the research approach to using our tools, here’s the “FAAV” cycle. pic.twitter.com/68gxWLn0eG
— Ian Milligan (@ianmilligan1) June 6, 2018

Ian Milligan (@ianmilligan1) continued by giving a quick run-down of his group's Archives Unleashed Projects, noting a realization in the project's development that not all historians like working with the command-line and Scala. He then briefly described the projects' approach of a filter-analyze-aggregate-visualize to make using large collections of Web archives more effective for research.

Edward A. Fox presents work done archiving events @virginia_tech #WADL2018 #jcdl2018 https://t.co/XfbWQILkDM pic.twitter.com/CY18bk9Cbv
— Shawn M. Jones (@shawnmjones) June 6, 2018

Wrapping up the project report, Ed Fox described Virginia Tech's initial attempts at performing crawls with Heritrix via Archive-It and how noisy the results were. He emphasized that a typical crawling approach consisting of starting with seed URIs harvested from tweets does not work well. The event model his group is developing and further evaluating will help guide the crawling procedure.

Ed's presentation completed the series of reports for the IMLS project panel and began a series of individuals presenting.

#jcdl2018 #wadl2018 @johnaberlin presenting: "Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay"

Based on his @WebSciDL MS thesis https://t.co/xeAlv0pVVF pic.twitter.com/FDcxodBuFu
— Michael L. Nelson (@phonedude_mln) June 6, 2018

Individual Presentations

John Berlin (@johnaberlin) started off with an abbreviated version of his Master's Thesis titled, "Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay". While John had recently given his defense in April (see his post for more details), this presentation focused on some of the more problematic aspects of archival replay as caused by JavaScript. He highlighted specific instances where the efforts of a replay system to accurately replay JavaScript varied from causing a page to display a completely blank viewport (see CNN.com has been unarchivable since November 1st, 2016) to the representation being highjacked to declare Brian Williams as the originator of "Gin and Juice" long before Snoop Dogg(y Dogg). John has created a Chrome and Firefox extension he dubbed "Wayback Plus Plus" that mitigates JavaScript-based replay issues using client-side redirects. See his presentation for more details.

#wadl2018 @edwardafox on URLs in tweets - time interval between a tweet and its webpage - most people prefer newly released web pages - for successfully retrieved URLs 50% were archived within 5 days after tweets were posted #jcdl2018 pic.twitter.com/aQqPIoBr13
— Shawn M. Jones (@shawnmjones) June 6, 2018

The workshop participants then had a break to grab a boxed lunch and followed with Ed Fox, again, presenting "A Study of Historical Short URLs in Event Collections of Tweets". In this work Ed highlighted the number of tweets in their collections that had URLs, namely that 10% had 2 URLs and less than 0.5% had 3 or more. From this collection, his group analyzed how many of the URLs linked are still accessible in Internet Archive's Wayback Machine with an emphasis that the Wayback Machine is not covering a lot of things that are in the Twitter data he has gathered. His group also analyzed the time difference between when a tweet with URLs was made and when it was archived and found that 50% were archived within 5 days after the tweet was posted.

Keynote

The workshop keynote, "Enabling Personal Use of Web Archives" was next and presented by my PhD Advisor Dr. Michele C. Weigle (@weiglemc). Her presentation initially gave a high-level overview of the needs of those that want to perform personal Web archiving and the tools that the WS-DL group have created over the years in facilitating the efforts to address those needs. She highlighted the early work of the group in identifying disasters in existing archives with a segue of the realization that many archive users lack in that there are more archives beyond Internet Archive.

#wadl2018 @weiglemc mentions that there are multiple #webarchives - with ArchiveNow we can submit to many web archives at once https://t.co/uAQqOenQOe #jcdl2018 pic.twitter.com/1KqMd6b9yh
— Shawn M. Jones (@shawnmjones) June 6, 2018

In her (our) group's tooling to encourage Web users to Archive What They See Now, they created the WARCreate Chrome extension to create WARC files from any Web page. To resolve the issue of what a user is to do with their WARCs, they then created the Web Archiving Integration Layer (WAIL) (and later an Electron version) to allow individuals to control both the preservation and replay process. To give users a better picture of the archived Web as they browsed, they created the Chrome extension Mink to give users a measure of how well-archived (in terms of quantity) a URI is as they browsed the live Web and optionally (and easily) submit the URI currently viewed to 1-to-3 Web archives.

Enabling Personal Use of Web Archives from Michele Weigle

There are a lot of tools available at the @WebSciDL Group, our GitHub page has a lot of projects in progress: https://t.co/9qTndi5Vxd #wadl2018 #jcdl2018 pic.twitter.com/LZQFjaongt
— Shawn M. Jones (@shawnmjones) June 6, 2018

Dr. Weigle also highlighted the work of other WS-DL students of past and present like Yasmin Anwar's (@yasmina_anwar) Dark and Stormy Archives (DSA) and Shawn Jones' (@shawnmjones) upcoming MementoEmbed tool.

Following a tool review, Dr. Weigle asked, "What if browsers could natively interpret and replay WARCs?". She performed a high level review of what could be possible if the compatibility barriers between the archived and live Web were resolved through live Web tools that could natively interact with the archived Web. In one example, she provided a screenshot where in-place of the "secure" badge a browser provides, it might also be aware that it is viewing an archived page and indicate as such.

.@weiglemc: What if browsers could natively interpret and replay WARC files?
me: My life would be complete, that's what. Yes! #wadl2018 #JCDL2018
— Jasmine Mulliken (@jasminemulliken) June 6, 2018

Libby Hemphill (@libbyh) presented next with "Developing a Social Media Archive at ICPSR" where her group sought to make data useful for people who wanted to understand how we are today from the perspective of people of the long-distant future. She mentioned how messy it can be to consider the ethical challenges when archiving social media data and that people have different levels of comfort depending of what sort of research for which their social media content is to be used. She outlined an architecture of their social media archive SOMAR for federating data to follow the terms of service, rehydrating tweets to follow the terms of research, and other aspects of the social-media-to-research-data process.

Which looks roughly similar to the SOMAR infrastructure described by @libbyh. #wadl2018 pic.twitter.com/l7aViIBf8R
— Justin Littman (@justin_littman) June 6, 2018

The workshop then took another break with a simultaneous poster session including a poster by Justin Littman titled, "Supporting social media research at scale" and WS-DL's Sawood Alam's (@ibnesayeed) "A Survey of Archival Replay Banners". Just prior to their poster presentations, each gave a lightning talk as a quick overview to entice attendees into stopping by.

#jcdl2018 @justin_littman is doing a lightning talk for his "Supporting social media research at scale" poster at #wadl2018 pic.twitter.com/P03L2GRsuU
— Shawn M. Jones (@shawnmjones) June 6, 2018

.@ibnesayeed is presenting his #wadl2018 poster "A Survey of Archival Replay Banners" detailing issues with #webarchive banners, such as covering navigational links on a memento, and more... #jcdl2018 pic.twitter.com/OGWNWHyo5h
— Shawn M. Jones (@shawnmjones) June 6, 2018

An excellent use of a Merkle Tree (https://t.co/lngldKeXwn) by @maturban1 to generate a root hash for each memento that includes the hashes of each of its embedded resources #jcdl2018 #wadl2018 pic.twitter.com/azY4XdRIIj
— Shawn M. Jones (@shawnmjones) June 6, 2018

After the break, WS-DL's Mohamed Aturban (@maturban1) presented "It is Hard to Compute Fixity on Archived Web Pages". Mohamed's work highlighted an issue that subtle changes in content may be difficult to detect using conventional hashing methods to compute the fixity of Web pages. He emphasized that computing the fixity of the root HTML page of a memento is not enough for fixity and that the fixity must also be computed for all embedded resources. With an approach utilizing Merkle trees (or on WP), he generates a hash of the composite memento representative of the fixity of all embedded resources. In one example highlighted in his recent post and tech report, Mohamed showed the manipulation of Climate Change data.

It is hard to compute fixity on archived web pages from maturban

To wrap up the presentations for the workshop, I (Mat Kelly, @machawk1) presented "Client-Assisted Memento Aggregation Using the Prefer Header". This work highlighted one particular aspect of my presentation the previous day at JCDL 2018 (see blog post), namely of how the framework in the basis presentation facilitates the specification of which archives are aggregated using Memento. The previous investigation by Jones, Van de Sompel et al. (see "Mementos in the Raw, Take Two") used the HTTP Prefer header to allow a client to request the un-rewritten version of mementos from archival replay system. In my work, I imagined a more capable Memento aggregator that would expose the archives aggregated and allow a client, basing their customizations off of the aggregator's response, customize the set of archives aggregated by sending the set as base64-encoded data in the Prefer request header.

#jcdl2018 #wadl2018 @machawk1 is presenting "Client-Assisted Memento Aggregation Using the Prefer Header" as part of his work on aggregating memento data from multiple web archives pic.twitter.com/JmhTHdKCcA
— Shawn M. Jones (@shawnmjones) June 6, 2018

Client-Assisted Memento Aggregation Using the Prefer Header from Mat Kelly

Closing

When I was through with the final presentation, Ed Fox began the wrap-up of the conference. This discussion of all attendees opened the floor for comments and recommendations for the future of the workshop. With the discussion finished, the workshop came to a close. As usual, I found this workshop extremely informative, though I was familiar with many of the participants previous work. I am hoping, as also expressed by other attendees, to encourage other fields to become involved and present their ongoing work and ideas at this informal workshop. Doing so, from the perspective of both an attendee and presenter, has proven valuable.

—Mat (@machawk1)

↧

2018-06-11: Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report from JCDL2018

June 12, 2018, 10:01 am

≫ Next: 2018-06-27: InfoVis Spring 2017 Class Projects

≪ Previous: 2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

Fort Worth Museum of Science & History 9/11 Tribute

The theme of the workshop on Knowledge Discovery from Digital Libraries (KDDL) was to uncover hidden relationships between data with techniques from artificial intelligence, mathematics, statistics, and algorithms. The workshop organizers, which included ODU Computer Science alumna, Dr. Hui Shi, Dr. Wu He, and Dr. Guandong Xu identified the following objectives that we were to explore:

Existing and novel techniques to extract and present knowledge from digital libraries;
Advanced ways to organize and maintain digital libraries to facilitate knowledge discovery;
Knowledge discovery applications in business; and
New challenges and technologies brought to the area of knowledge discovery and digital libraries.

Hui Shi, @oducs alumni, kicks off Knowledge Discovery from Digital Libraries (KDDL) 2018 #jcdl2018 pic.twitter.com/NrLJUlfRm0
— Shawn M. Jones (@shawnmjones) June 6, 2018

The KDDL workshop consisted of three paper presentations which are summarized here.

Presentation 1: I presented my work on Mining the Web to Approximate University Rankings based on the tech report "University Twitter Engagement: Using Twitter Followers to Rank Universities" (https://arxiv.org/abs/1708.05790) and discussed in an earlier blog post.

Mining the Web to Approximate University Rankings

This paper presented an alternative methodology for approximating the academic rankings of a university using social media; specifically, the university's Twitter followers. We identified a strategy for discovering official Twitter accounts along with a comparative analysis of metrics mined from the web which could be predictors of high academic rank (e.g., athletic expenditures, undergraduate enrollment, endowment value). As expected, schools with more financial resources tend to have more Twitter (@Twitter) followers based on larger enrollments, big endowments, and big investments in their sport programs. We also discovered that smaller schools like Wake Forest University can enhance their reputation when they employ faculty with national name recognition (e.g., Melissa Harris-Perry (@MHarris-Perry)). For those wishing to perform further analysis, we have posted all of the ranking and supporting data used in this study which includes a social media rich data set containing over 1 million Twitter profiles, ranking data, and other institutional demographics in the oduwsdl Github repository.

Knowledge Discovery from Digital Libraries (KDDL) first session is kicked off by @WebSciDL member @CorrenMcCoy presenting "Mining the Web to Approximate University Rankings" available at https://t.co/hILDimq46I #jcdl2018 pic.twitter.com/cbhTe8Iu89
— Shawn M. Jones (@shawnmjones) June 6, 2018

Presentation 2: Basic Science and Technological Innovation: A Classification of Research Publications was presented by Dr. Robert M. Patton, Oak Ridge National Laboratory. This paper explored the context required for funding decision makers, sponsors, and the general public to determine the value of research publications. Core questions addressed the accessibility of massive digital libraries and methods related to identification of new discoveries, data sets, publications in disparate journals, and new software codes. Dr. Patton asserted that research evaluation has become increasingly complicated and citation analysis alone is insufficient if considered within the context of the people who control the flow of funding. His presentation of evaluation techniques included altmetrics along with a comparison of Bohr’s, Edison’s, and Pasteur’s quadrants as classifiers which use the wording of titles and abstracts in conjunction with domain specific terminology.

A Classification of Research Publications

Presentation 3: Introducing Math QA -- A Math Aware Question Answering System was presented by Felix Hamborg, University of Konstanz. This paper presented a software tool that allows a user to enter a textual request for a math formula (e.g., What is the formula for …?) in English or Hindi and is then presented with the required parameters and the actual formula from Wikidata. The authors mined 40 million articles in Wikidata searching for <math> tags to identify 17 thousand general and geometric formulas. They defined a QA System workflow consisting of three distinct modules for calculation, question parsing, and formula retrieval. Their discovery of geometric formulas (e.g., polygons, curves) was slightly more complex as these formulas can include a nested hierarchy of related data that required traversal of the associated Wikidata subsections. Following evaluation and comparison to a commercial engine, exported information was parsed and ported back into Wikidata. The author's source code and data is available in their GitHub repository (http://github.com/ag-gipp/MathQa).

A Math Aware Question Answering System

Following the paper presentations, the workshop participants divided into two groups to conduct a breakout session where we discussed Challenges and Research Trends in Knowledge Discovery from Digital Libraries and Beyond. Each group was asked to offer opinions and provide summary responses for each of the following topics:

What are your reactions to the paper presentations? What did you learn that you didn’t previously know?
What are the current techniques, applications, and/or research questions that you are addressing in Knowledge Discovery from Digital Libraries and Beyond? What are the biggest impediments or challenges limiting Knowledge Discovery from Digital Libraries and Beyond?
What are your top priorities in implementing Knowledge Discovery from Digital Libraries and Beyond?
What resources and/or support do you need to implement?
What areas will you recommend for research? How do you think artificial intelligence (AI) can benefit knowledge discovery in digital libraries?
Suggestions for coordination of research and future collaboration.

Collectively, my group's responses centered on the themes of data curation with less reliance on subject matter experts, methods or tools to make data more self-documenting, and new strategies for relationship extraction between linked entities. There was also considerable discussion related to reproducible research using common repositories and formats conducive to sharing data (e.g., XML) and open access to both software and the peer review process.

I would like to thank Old Dominion University for the Graduate Student Travel Award which helped to facilitate my participation in the JCDL conference and this workshop.

--Corren (@correnmccoy)

↧

2018-06-27: InfoVis Spring 2017 Class Projects

June 27, 2018, 8:28 am

≫ Next: 2018-07-02: The Off-Topic Memento Toolkit

≪ Previous: 2018-06-11: Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report from JCDL2018

This may sound familiar, but yet again I'm way behind in posting about my previous offerings of CS 725/825 Information Visualization.
(Previous semester highlights posts: Spring 2016, Spring 2015, Spring/Fall 2013, Fall 2012, Fall 2011)

Here are a few projects that I'd like to highlight from Spring 2017. (All class projects are listed in my InfoVis Gallery. This semester has its own page because there were 19(!) projects.) All of the projects were implemented using the D3.js library.

Because the Spring 2017 semester began with President Donald Trump's Travel Ban (EO 13769) and we have a large international graduate student population, students were understandably interested in US immigration and refugee data. The first two projects here focus on that. In addition, one project looked at sentiment about the US Presidential candidates on social media on Election Day.

The last two projects that I'll highlight are focused on the lighter topic of sports, NFL football and IPL cricket.

Visualization of US Refugee Admittance Data

Created by Susan Zehra

This project focuses on refugee admittance to the US between 2008-2016. The visualization highlights the number of refugees by country of nationality/religion, and relationship between a country's number of refugees, number of war deaths, population, GDP per capita (GDPPC), and State Fragility Index (SFI).

Foreign Travel and Immigration to the US
Created by Hind Aldabagh and Bathsheba Nelson

This project (available at http://www.cs.odu.edu/~bnelson/cs725/project1/index.html) shows the total number of immigrants (2010-2015) from each region, country, and class of admission as well as the totals that settle in each state in the US. The visualizations include an interactive world map alongside a tabbed panel with various idioms (bar chart, line chart, choropleth map, text lists) to provide quick access to multiple views of immigration information.

Flash video available at http://www.cs.odu.edu/~haldabag/cs725/worldmap-template2/video.html

Sentiment Analysis Based on Social Media

Created by Triveni Bhardwaj

This project (available at http://www.cs.odu.edu/~ttriveni/cs725/SentimentAnalysis/test.html) presents emotion and sentiment analysis of Tweets about the 2016 US Presidential candidates on Election Day. Visualization idioms include treemap, wordcloud, and US map.

Insights into American Football

Created by Mahesh Kukunooru and Maheedhar Gunnam

This project presents a visualization interface to explore NFL football data over the past 10 years. Different idioms like multi-line chart, bar chart, radar and donut charts are used to visualize the football dataset, which aims at providing a platform for the users to help them explore and find some interesting insights that may go unnoticed.

IPL - Indian Premier League
Created by Karan Balmaui and Varun Kumar Karne

This project visualizes statistics of the Indian Premier League (IPL) for all 9 seasons. It is concentrated on displaying complete information from a season to each ball in every match. The user is provided with performance rankings, points table for each season, and total scores of each match in a season. Upon comparing total runs in all matches played by a team in a season, the user can navigate to run-rate, loss of wickets, types of runs, batting/bowling stands in the selected match.

-Michele

↧

2018-07-02: The Off-Topic Memento Toolkit

July 2, 2018, 7:31 am

≫ Next: 2018-07-03: Extracting Metadata from Archive-It Collections with Archive-It Utilities

≪ Previous: 2018-06-27: InfoVis Spring 2017 Class Projects

Inspired by AlNoamany's work from "Detecting off-topic pages within TimeMaps in Web archives" I am pleased to announce an alpha release of the Off-Topic Memento Toolkit (OTMT). The results of testing with this software will be presented at iPres 2018 and those results are now available as a preprint.

Web archive collections are created with a specific purpose in mind. A curator will supply seeds for the collection and create multiple versions of these seeds in order to study the evolution of a web page over time. This is valuable for following the changes in an organization or the events in a news story. Unfortunately, depending on the curator's intent, sometimes these seeds go off-topic. Because web archive crawling software has no way to know that a page is off-topic, these mementos are added to the collection. Below I list a few examples of off-topic pages within Archive-It collections.

This memento from the Human Rights collection at Archive-It created by the Columbia University Libraries is off-topic. The page ceased to be available at some point and produced this "404 Page Not Found" response with a 200 HTTP status.

This memento from the Egypt Revolution and Politics collection at Archive-It created by the American University in Cairo is off-topic. The web site began having database problems.

It is important to note that the OTMT does not delete potentially off-topic mementos, but rather only flags them for curator review. Detecting such mementos allows us to exclude them from consideration or flag them for deletion by some downstream tool, which is important to our collection summarization and storytelling efforts. The OTMT detects these mementos using a variety of different similarity measures. One could also use the OTMT to detect and study off-topic mementos.

Installing the software

The OTMT requires Python 3.6. Once you have met that requirement, install OTMT by typing:

# pip install otmt

This installs the necessary libraries and provides the system with a new detect-off-topic command.

A simple run

To perform an off-topic run with the software on Archive-It collection 1068, type:

# detect-off-topic -i archiveit=1068 -tm cosine,bytecount -o myoutputfile.json

This will find all URI-Rs (seeds) related to Archive-It collection 1068, download their timemaps (URI-Ts), download the mementos within each timemap, process those mementos via the default similarity measures, and write the results in JSON format out to a file named outputfile.json.

The JSON output looks like the following.

Each URI-T serves as a key containing all URI-Ms within that timemap. In this example the timemap at URI-T http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/ contains several mementos. For brevity, we are only showing results for the memento at http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/.

The key "timemap measures" contains all measures run against the memento. In this case I used the two measures "cosine" and "bytecount". Each measure entry indicates which preprocessing has been performed against that memento (e.g., stemmed, tokenized, and removed boilerplate). Under "comparison score" is that measure's score. Under "topic status" is a verdict on whether or not the memento is on or off-topic. Finally, the "overall topic status" indicates if any of the measures determined that the memento is off-topic.

The OTMT uses an input-measure-output architecture. This way the tool separates the concerns of input, (e.g., how to process Archive-It collection 1068 for mementos), from measure (e.g., how to process these mementos using cosine and byte count similarity measures), and output (e.g., how to produce the output in JSON format and write it to the file outputfile.json). This architecture is extensible, providing interfaces allowing for more input types, measures, and output types to be added in the future.

The -i (for specifying the input) and -o (for specifying the output) options are the only required options. The following sections detail the different command line options available to this tool.

Input and Output

The input type is supplied by the -i option. OTMT currently supports the following input types:

an Archive-It collection ID (keyword: archiveit)
one or more TimeMap URIs (URI-T) (keyword: timemap)
one or more WARCs (keyword: warc)

An output file is supplied by the -o option. Output types are specified by the -ot option. OTMT currently supports the following output types:

JSON as shown above (the default) (keyword: json)
a comma-separated file consisting of the same content found in the JSON file (keyword: csv)

To specify multiple WARCs, list them after the warc option like so:

# detect-off-topic -i warc=mycrawl1.warc.gz,mycrawl2.warc.gz -o myoutputfile.json

Likewise, for multiple TimeMaps, list them with the timemap argument and separate their URI-Ts with commas, like so:

# detect-off-topic -i timemap=https://archive.example.org/urit/http://example.org,https://archive.example.org/urit/http://example2.org -o myoutputfile.json

To use the comma-separated file format instead of json use the -ot option as follows:

# detect-off-topic -i archiveit=3936 -o myoutputfile.csv -ot csv

For better processing, we want to eliminate any interference from HTML and JavaScript associated with archive-specific branding. In the case of TimeMaps and Archive-It collections, raw mementos will be downloaded where available. While any TimeMap may be specified for processing, raw mementos are preferred as they do not contain the additional banner information and other augmentations supplied by many web archives. These augmentations may skew the off-topic results. Currently, only raw mementos from Archive-It are detected and processed. WARC files, of course, are "raw" by their nature, so removing web-archive augmentations like banners is not needed for WARC files.

Measures

OTMT supports the following measures with the -tm (for "timemap measure") option:

Cosine Similarity of document vectors informed by TF-IDF with scikit-learn (default) (keyword: cosine)
Word Count (keyword: wordcount)
Byte Count (keyword: bytecount)
Simhash on the raw memento content with (keyword: raw_simhash)
Simhash on the term frequencies of the raw memento content (keyword: tf_simhash)
Jaccard Distance (keyword: jaccard)
Sørensen-Dice Distance (keyword: sorensen)
Cosine similarity of document vectors informed by Latent Semantic Indexing with Gensim (keyword: gensim_lsi)

Each of these measures considers the first memento in a TimeMap to be on-topic and evaluates all other mementos in that TimeMap against that first memento.

Measures and thresholds can be supplied on the command line, separated by commas. For example, to use Jaccard with a threshold of 0.15, separate the measure name and the threshold value, like so:

# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15

Multiple measures can also be used, separated by commas. For example, to use jaccard and cosine similarity, type the following:

# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15,cosine=0.10

The default thresholds for these measures have been derived from testing using a gold standard dataset of on and off-topic mementos originally generated by AlNoamany. This dataset is now available at: https://github.com/oduwsdl/offtopic-goldstandard-data/. We used this dataset as a standard and selected thresholds that produced the best F₁ score for each measure. I will present the details of how we arrived at these thresholds at iPres 2018. Our study is available as a preprint available on arXiv.

Other options

Optionally, one may also change the working directory (-d) and the logging file (-l). By default, the software uses the directory /tmp/otmt-working for its work and logs to the screen with stdout.

The Future

I am still researching several features that will make it into future releases. I have separated the capabilities into library modules for use with future Python applications, but the code is currently volatile and I expect changes to come in the following months as new features are added and defects are fixed.

The software does not currently offer an algorithm utilizing the Web-based kernel function specified in AlNoamany's paper. This algorithm augments terms from the memento with terms from search engine result pages (SERPs), pioneered by Sahami and Heilman. Due to the sheer number of mementos to be evaluated by the OTMT and Google's policy on blocking requests to its SERPs, I will likely not implement this feature unless it is requested by the community.

I am also interested in the concept of "collection measures". I created the "timemap measures" key in the JSON output to differentiate one set of measure results from another eventual category of collection-wide measures that would test each memento against the topic of an entire collection. Preliminary work using the Jaccard Distance in this area was not fruitful, but I am considering other ideas.

The Off-Topic Memento Toolkit is available at https://github.com/oduwsdl/off-topic-memento-toolkit. Please give it a try and report any issues encountered and features desired. Although developed with an eye toward Archive-It collections, we hope to increase its suitability for all themed collections of archived web pages, such as personal collections created with webrecorder.io.

-- Shawn M. Jones

↧

2018-07-03: Extracting Metadata from Archive-It Collections with Archive-It Utilities

July 3, 2018, 11:44 am

≫ Next: 2018-07-11: InfoVis Fall 2017 Class Projects

≪ Previous: 2018-07-02: The Off-Topic Memento Toolkit

At iPres 2018, I will be presenting "The Many Shapes of Archive-It", a paper that focuses on some structural features inherent in Archive-It collections. The paper is now available as a preprint on arXiv.

As part of the data gathering for "The Many Shapes of Archive-It", and also as part of the development the Off-Topic Memento Toolkit, I had to write code that extracts metadata and seeds from public Archive-It collections. This capability will be useful to several aspects of our storytelling and summarization work, so I used the knowledge gained from those projects and produced a standalone Python library named Archive-It Utilities (AIU). This library is currently in alpha status, but is already being used with upcoming projects.

The metadata available from an Archive-It collection

Archive-It curators can use the predefined metadata fields of Dublin core. They can also supply their own custom metadata fields.

An screenshot of Archive-It collection 4515 with metadata annotated.

Above is Archive-It collection 4515, named 2013 BART Strike and collected by the San Francisco Public Library. This collection's curators generated quite a bit of metadata. In this screenshot, we can see the following metadata fields for the collection:

Subject
Creator
Publisher
Source
Format
Rights
Language
Collector

In addition to collection-wide metadata, we see that the first seed has the following metadata applied:

Creator
Publisher
Language
Format
Date

For research purposes, there is quite a lot of data here to be analyzed, especially when comparing collections as we did in "The Many Shapes of Archive-It". I discovered that most collections used the controlled vocabulary from Dublin Core, shown as blue in the bar chart below, more often than freeform vocabulary, shown in green.

Distribution of the top 20 collection-wide metadata fields in public Archive-It collections.

Each collection can have one or more topics. As shown in the screenshot below, the curator can choose from the controlled vocabulary offered by the collection topics field. They can also add their own freeform topics in the subject field. The public-facing interface combines entries from both of these input fields into the public-facing subject field.

Metadata can be added by curators by using the metadata page of one of their Archive-It collections.

The bar chart below shows the distribution of the top 20 topics in public Archive-It collections. I discovered that most curators apply the controlled vocabulary topics to their collections.

Distribution of the top 20 collection-wide subjects (also called topics) of public Archive-It Collections.

This creates a confusing nomenclature. When viewing an Archive-It collection from the outside, everything is displayed as part of the subject field. Because of this, the rest of this post, and Archive-It Utilities, uses the subject field to refer to these topics.

As work for "The Many Shapes of Archive-It" progressed, we focused more on collecting seed lists and then mementos for further analysis. We tried to predict the topics using machine learning, but were unsuccessful and chose a different path for predicting the semantic categories of a collection. Most of the metadata gathered did not make it into the study's results, but will be used in future work. I have included these results here to show the kinds of questions one can answer with Archive-It Utilities.

Installation

Archive-It Utilities requires Python 3.6. Once that requirement has been met, you can install it using:

pip install aiu

It provides several experimental executables. We will only cover fetch_ait_metadata in this post.

Running `fetch_ait_metadata`

The fetch_ait_metadata command produces a JSON file containing all of the information available about a public Archive-It collection.

To run it on collection 4515 and store the results in file output.json, type the following command:

fetch_ait_metadata -c 4515 -o output.json

The -c option allows one to specify an Archive-It collection and the -o option allows one to indicate where to store the JSON output.

The JSON output looks like the following, truncated for brevity:

{     "id": "4515",     "exists": true,     "metadata_timestamp": "2018-07-01 16:49:39",     "name": "2013 BART Strike",     "uri": "https://archive-it.org/collections/4515",     "collected_by": "San Francisco Public Library",     "collected_by_uri": "https://archive-it.org/organizations/160",     "description": "News articles and documents issued by BART regarding the strike of its workers that began on July 1, 2013",     "subject": [         "Government",         "Government - Cities",         "Government - Counties",         "2013 BART strike"     ],     "archived_since": "Apr, 2014",     "private": false,     "optional": {         "creator": [             "Bay Area Rapid Transit"         ],         "publisher": [             "Bay Area Rapid Transit"         ],         "source": [             "http://www.bart.gov"         ],         "format": [             "Web pages and PDFs"         ],         "rights": [             "Public Domain"         ],         "language": [             "English"         ],         "collector": [             "San Francisco Public Library"         ]     },     "seed_metadata": {         "seeds": {             "http://www.bart.gov/news/articles/2013/news20130617": {                 "collection_web_pages": [                     {                         "title": "Sign up for BART labor strike alerts",                         "creator": [                             "San Francisco Bay Area Rapid Transit District"                         ],                         "publisher": [                             "San Francisco Bay Area Rapid Transit District"                         ],                         "language": [                             "English"                         ],                         "format": [                             "Web page"                         ],                         "date": [                             "August 14, 2013"                         ]                     }                 ],                 "seed_report": {                     "group": "",                     "status": "True",                     "frequency": "NONE",                     "type": "normal",                     "access": "True"                 }             }, ...

From this JSON we can see the name of the collection, which organization created it from the collected_by field, the subjects the curator applied to the collection as a list in the subject field, and when the collection was created in the archived_since field.

Within the optional dictionary field, we see values for freeform metadata added by the curator. In this case we have creator, publisher, source, format, rights, language, and collector.

Also included is the "seed metadata" section containing a list of seeds both scraped from the HTML of the Archive-It collection's web pages and also gathered from the CSV report provided for each Archive-It collection. Above I've listed the seed http://www.bart.gov/news/articles/2013/news20130617 to demonstrate the type of metadata that can be gathered. As noted in "The Many Shapes of Archive-It", seed metadata is optional, but in this example the curator added a title, creator, publisher, language, format, and date to this seed.

Using Archive-It Utilities In Python Code

This information can also be acquired programmatically using the ArchiveItCollection object. The script below demonstrates how one can acquire and the collection name, collecting organization, and the list of seed URIs for Archive-It collection ID 4515.

which produces the following output, truncated for brevity:

The following methods of the ArchiveItCollection class are useful for analyzing the metadata of a collection:

get_collection_name - returns the name of the collection
get_collection_uri - returns the URI of the collection
get_collectedby - returns the name of the collecting organization
get_collectedby_uri - returns the URI of the collecting organization
get_description - returns the content of the "description" field
get_subject - returns a Python list containing the subjects applied to the collection
get_archived_since - returns the content of the "archived since" field
is_private - returns True if the collection is not public, False otherwise
does_exist - not all collection identifiers are valid, this method returns True if the collection identifier actually represents a real collection, False otherwise
list_seed_uris - returns a Python list of seed URIs
get_seed_metadata(uri) - returns a Python dictionary containing metadata for a specific seed at uri
return_collection_metadata_dict - returns a Python dictionary containing all collection-wide metadata
return_seed_metadata_dict - a Python dictionary containing all seeds and their metadata
return_all_metadata_dict - a Python dictionary containing all collection-wide and seed metadata
save_all_metadata_to_file(filename) - writes all collection-wide and seed metadata out as JSON to a file named filename

The code does perform some measure of lazy loading to be nice to Archive-It. If you only need the general collection-wide metadata, it only acquires the first page of the collection. If you need all seed URIs, it must download all Archive-It pages belonging to the collection.

Summary

Archive-It collections have metadata that can be used to answer many research questions. After working on "The Many Shapes of Archive-It", to be presented at iPres 2018, I used the lessons learned to create Archive-It utilities as a Python library that can be used to acquire this metadata. Please try it out and log any issues at the GitHub repository https://github.com/oduwsdl/archiveit_utilities.

--Shawn M. Jones

↧

2018-07-11: InfoVis Fall 2017 Class Projects

July 11, 2018, 6:59 am

≫ Next: 2018-07-15: How well are the National Guideline Clearinghouse and the National Quality Measures Clearinghouse Archived?

≪ Previous: 2018-07-03: Extracting Metadata from Archive-It Collections with Archive-It Utilities

(Previous semester highlights posts: Spring 2017, Spring 2016, Spring 2015, Spring/Fall 2013, Fall 2012, Fall 2011)

Here are a few projects that I'd like to highlight from Fall 2017. (All class projects are listed in my InfoVis Gallery.) All of the projects were implemented using the D3.js library.

World Leader Interactions on Social Media (Twitter)
Created by Grant Atkins

This project (available at http://www.cs.odu.edu/~gatkins/world-leader-vis/app/) provides an interactive dashboard to visualize ways Twitter list data can be used and represented. This visualization uses the World Leaders list on Twitter, with the addition of a few world leaders not on the list, to derive information and visualize shared information among these users. The goal of this visualization is to show shared term usage among world leaders, see which times tweets are more likely to be sent out, the sentiment of the users, and the decay of data allocated in a static decreasing time interval.

Investigation Into Cryptocurrency Pricing Patterns With Respect to Financial Instability
Created by Jason Orender

This project (available at http://www.cs.odu.edu/~jorender/cs725/CS725-PROJECT_Y9FziL/) provides a focused presentation of the world events co-incident with spikes in peer-to-peer cryptocurrency transactions together with a continuous evolutionary timeline to provide perspective regarding the state of development and usage at a national, regional and worldwide level. Bitcoin was the sole cryptocurrency used in this analysis due to the large amount of country specific peer-to-peer data available.

Holiday Flight Patterns

Created by Asmita Gosavi

This project (available at http://www.cs.odu.edu/~agosav/cs725/HolidaysFlightPattern/index.html) is an interactive tool for visualizing holiday flight patterns using a dot chart to visualize the last year's data, a US map with bubbles to display the average arrival delay at particular airports, and a line chart which shows the monthly distribution from 2006-2015. The datasets selected for this visualization are the percentage of on-time arrival, delayed and cancellations for different airlines operating in the US, over different years. The intention was to find flight delay patterns between holiday and non-holiday months.

Sports Injuries

Created by Plinio Vargas and Miranda Smith

This project (available at http://www.cs.odu.edu/~pvargas/cs725/cs725-project/) deals with an organization interested in reducing the number of personnel injuries due to the dangerous nature of their job, and looks into a more effective visualization technique that measures its physical performance training program. The goal of the visualization project is to provide answers specific to the organization in terms of identifying where most injuries occur, which activities are injury-prone, correlation of injuries with the training program, and the evaluation and trends of its members physical training performance.

Federal Workforce
Created by John Ashley

This project (available at http://www.cs.odu.edu/~jashley/cs725/project/) highlights some of the characteristics that make up the current Federal civil service workforce. The visualization also provides a visual snapshot of how widely dispersed the workforce is.

Video available at http://www.cs.odu.edu/~jashley/cs725/demo/

-Michele

↧

2018-07-15: How well are the National Guideline Clearinghouse and the National Quality Measures Clearinghouse Archived?

July 15, 2018, 4:03 pm

≫ Next: 2018-07-18: Why We Need Private Web Archives: Almost Two-Thirds of Web Traffic IS NOT Publicly Archivable

≪ Previous: 2018-07-11: InfoVis Fall 2017 Class Projects

On July 13, I saw this on Twitter:

On Monday, doctors, nurses and researchers will lose access to a trove of medical data, as the Trump Administration shuts down the National Guidelines Clearinghouse. See our piece today, by our Senior Investigator @j0ncampbell, in @TheDailyBeast: https://t.co/Qz1tkRIQIM
— Sunlight Web Integrity Project (@SunWebIntegrity) July 12, 2018

There are two US government websites in danger, the National Guideline Clearinghouse (https://www.guideline.gov) and the National Quality Measures Clearinghouse (https://qualitymeasures.ahrq.gov). Both store medical guidelines. Both will "not be available after July 16, 2018". According to the linked Daily Beast article above:

Medical guidelines are best thought of as cheatsheets for the medical field, compiling the latest research in an easy-to-use format. When doctors want to know when they should start insulin treatments, or how best to manage an HIV patient in unstable housing — even something as mundane as when to start an older patient on a vitamin D supplement — they look for the relevant guidelines. The documents are published by a myriad of professional and other organizations, and NGC has long been considered among the most comprehensive and reliable repositories in the world.

The Sunlight Foundation Web Integrity Project wrote a report about the archivabilty of this service. They note that "interactive features do not function, making archived content much more difficult to access and, in many cases completely unavailable." Seeing as web archives typically crawl websites from the client side and have no access to the server components, I expect that the search functionality of the web sites should not work once archived.

The robots.txt for www.guideline.gov disallows everyone:

The robots.txt for qualitymeasures.ahrq.gov disallows everyone:

In December of 2016, the Internet Archive stopped honoring robots.txt for .gov and .mil websites, hoping to "keep this valuable information available to users in the future". Seeing at these two sites will be shut down on July 16, 2018, how well are they archived?

Experiment Setup

The method I used to evaluate how much of each site was archived consisted of the following general steps:

Acquire a sample of original resource URIs from www.guideline.gov and qualitymeasures.ahrq.gov
Use a Memento Aggregator to determine if each original resource has at least one memento

As we can see in the robots.txt above, there is no machine readable site map for either web site. This means that I would need to crawl each site to find all of the URIs. Remembering lessons from when I evaluated URI patterns for Signposting the Scholarly Web, and when I manually crawled a number of scholarly web sites looking for URI patterns to help other crawling efforts, I started off thinking like someone who was planning to archive each site. I knew that I did not have time to manually crawl the entirety of each site so I tried to evaluate which documents appeared to be the main products of these sites. I have classified the documents I evaluated into five categories: main products - summaries, expert commentaries, guideline syntheses, summaries in other formats, and other pages. I will describe these categories in more detail in the following sections.

I created a GitHub repository to save my work. Due to the time crunch, I did not organize it nicely and it will be updated in the coming days with more content used in this article, so check back to it often if interested.

Update on 2018/07/16 at 20:37 GMT:The GitHub repository is now as stable as it is going to be. As it was written over the course of 3 days, the code is very, very rough. I have no intentions of improving it, but the data and code is provided for anyone who is interested. Feel free to contact me on Twitter with any questions.

After acquiring sample original resource URIs to test I installed MemGator, a Memento Aggregator developed by the WS-DL Research Group. I wrote a Python script which requested an aggregated TimeMap from MemGator for each original resource URI and recorded the number of mementos per URI.

So, what categories of documents did I retrieve before feeding them into MemGator?

Main Products - Summaries

I reviewed the menus across the top of each site's home page. I discovered that the main product of www.guideline.gov appeared to be the guideline summaries and the main product of qualitymeasures.ahrq.gov appeared to be measure summaries. I focused on these documents because, if captured as mementos, an enterprising archivist could build their own search engine around them.

As shown in the screenshot below, these summaries were accessible via paginated search result pages. Fortunately, there is an "All Summaries" option which will list all summaries as a series of search results.

The qualitymeasures.ahrq.gov site also has its own "All Summaries" page, shown in the screenshot below, so these URIs can be scraped using a script aware of the paging as well.

As Corren wrote last year, pagination can result in a missed captures. Knowing this, I wondered if the pagination would have an impact on if the guideline summaries were archived.

I wrote some simple (and very rough) code in Python using the requests library and BeautifulSoup to scrape all URIs from each search result page. The same script was used to scrape both sites. For both sites I selected the guideline summary URIs, identified because they contained the string "/summaries/summary/", and removed duplicates. This gave me 1415 original URIs for www.guideline.gov and 2533 original URIs for qualitymeasures.ahrq.gov.

Expert Commentaries

Both sites also contained expert commentaries about these summaries. I decided that this also looked important, even though these commentaries did not appear to be indexed by the search engine.

A screenshot of the Expert Commentaries page on www.guideline.gov

A screenshot of the Expert Commentaries page on qualitymeasures.ahrq.gov

I wrote a script to scrape the expert commentary URIs from these pages. With this I ended up with 45 URIs for www.guideline.gov and 52 URIs for qualitymeasures.ahrq.gov.

Guideline Syntheses

The www.guideline.gov site has a series of documents labeled guideline synthesis documenting "areas of agreement and difference, the major recommendations, the corresponding strength of evidence and recommendation rating schemes, and a comparison of guideline methodologies". These documents also seemed to be important, so I chose to include them as well.

The Guideline Synthesis page at www.guideline.gov is another set of documents provided by the web site.

I wrote a script to scrape this page for all guideline synthesis URIs. This led me to 18 URIs for www.guideline.gov. The qualitymeasures.ahrq.gov site did not contain this type of document.

Summaries in other formats

In addition to the HTML formatted guideline summaries, there were guideline summaries available in PDF, XML, and DOC format on www.guideline.gov. I wrote another script to iterate through all of the summary pages captured in the previous section and save off the PDF, XML, and DOC URIs. The qualitymeasures.ahrq.gov website only has HTML formatted measure summaries, so this document category does not apply to that site.

This screenshot demonstrates the multiple formats available for a guideline summary on www.guideline.gov.

My script to scrape these pages gave me 4185 URIs for www.guideline.gov.

Other Pages

Finally, I was curious about what may have been missed elsewhere. I decided to try to gather URIs as a crawler would who is given the seed of the top level page. With this exercise, I was hoping to gather a number of top level pages to see how their archive status differed from the guideline summaries, expert commentaries, guideline syntheses, and the measure summaries.

I wrote two simple spiders (crawlers) using the Python crawling framework scrapy. I pointed each spider at the homepage of each website, instructed it not to crawl outside of the domain of each site, and told it to print out any URI listed on a page it discovered while crawling. Unfortunately, I ran it on a machine with insufficient memory. The operating system killed scrapy in both cases because it was consuming too much memory. This means that the crawl for www.guideline.gov ran for 4 hours while the crawl for qualitymeasures.ahrq.gov ran for 7 hours. This inconsistency in crawl times was disappointing, but I kept the URIs from these crawls because they provide an interesting contrast in the results section.

Once I had a list of URIs linked from pages encountered during the crawl, I then removed all URIs that were not in the domains of www.guideline.gov or qualitymeasures.ahrq.gov, respectively.

Hundreds of thousands of URIs returned were related to search facets. The crawl of www.guideline.gov returned 894,881 such URIs while the crawl of qualitymeasures.ahrq.gov returned 1,474,516. Because these search facet URIs were related to the summaries from the prior sections, I removed these search URIs in the interest of time and only focused on the other pages crawled because these other pages contained actual content. I removed any URIs containing fragments (i.e. hashes like #introduction). I also filtered the URIs for summaries, guideline syntheses, and expert commentaries so that there would be no overlap in results.

I then fed the URIs through MemGator to see if the pages were captured.

Results and Discussion

The table below shows the results of testing if a page was archived for www.guideline.gov. Of those URIs recorded for this experiment, 98.8% of them were indeed archived, which is good news.

`www.guideline.gov` Page Category	Archived	Not Archived	Total
Guidelines Summaries (HTML)	1401	14	1415
Expert Commentaries	45	0	45
Guideline Syntheses	18	0	18
Guideline Summaries (Other Formats)	4185	57	4242
Other Pages	150	2	152
Total	5799 (98.8%)	73 (1.2%)	5872

Most importantly, of the 1415 guideline summaries from www.guideline.gov, 1401/1415 (99.0%) are archived. Only 14/1415 (1.0%) are not archived. Also, all 45 expert commentaries and all 18 guideline syntheses are archived. This means that almost all of the important site content is preserved and an enterprising archivist can build a search engine around them in the future.

The table below shows the results of testing if a page was archived for qualitymeasures.ahrq.gov. Of the URIs recorded for this experiment, 97.5% of them were archived.

`qualitymeasures.ahrq.gov` Page Category	Archived	Not Archived	Total
Measures Summaries	2509	24	2533
Expert Commentaries	52	0	52
Other Pages	90	44	134
Total	2651 (97.5%)	68 (2.5%)	2719

Of the 2533 measure summaries from qualitymeasures.ahrq.gov, 2509/2533 (99%) are archived. Only 24/2533 (0.9%) were not archived. Also, all 52 expert commentaries are archived. Again, this means that the majority of the important documents exist in a web archive and can be indexed by a potential search engine in the future. The picture is not so good for the other pages category, where only 90/134 (67.2%) of the pages exist in a web archive.

The high overall numbers are remarkable and likely a result of the Internet Archive's efforts to remove the robots.txt restrictions at the end of 2016. The next sections answer additional questions.

What is the distribution of mementos per category per site?

Below several histograms show the distribution of memento counts across the the different categories of pages for www.guideline.gov. Note that this only applies to those pages with mementos.

Histogram of the number of mementos per URI for guideline summaries for www.guideline.gov.
Minimum: 1, Maximum: 24, Mode: 8.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for expert commentaries for www.guideline.gov.
Minimum: 9, Maximum: 14, Mode: 11.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for guideline syntheses for www.guideline.gov.
Minimum: 9, Maximum: 18, Mode: 14.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for non-HTML guideline summaries for www.guideline.gov.
Minimum: 1, Maximum: 13, Mode: 9.
Note: only pages with mementos were evaluated.

Histogram of the number of mementos per URI for other pages for www.guideline.gov.
Minimum: 1, Maximum: 2072, Mode: 1.
Note: only pages with mementos were evaluated.

We see that the more specific content in the guideline summaries, expert commentary, guideline syntheses, and the guideline summaries in multiple formats tend to have a mode of 8, 9, 11, or 14 mementos. This means that many of the more important pages have multiple mementos. The other content, consisting mostly of top level pages, has a mode of 1, meaning that a lot of these top level pages were only archived once. There is at least one page in the other category, though, that was archived 2072 times.

Below several histograms show the distribution of memento counts across the the different categories of pages for qualitymeasures.ahrq.gov.

Histogram of the number of mementos per URI for measure summaries for qualitymeasures.ahrq.gov.
Minimum: 1, Maximum: 15, Mode: 4.

Histogram of the number of mementos per URI for expert commentaries for qualitymeasures.ahrq.gov.
Minimum: 6, Maximum: 7, Mode: 7.

Histogram of the number of mementos per URI for other pages for qualitymeasures.ahrq.gov.
Minimum: 2, Maximum 131, Mode: 2.

The numbers are much lower for qualitymeasures.ahrq.gov, but they exhibit the same pattern.

How does the crawling pattern for mementos change over time per category per site?

So, how does the crawling of www.guideline.gov change over time? The bar charts below show the number of mementos added to archives per month based on their memento-datetime.

Memento count per month for guideline summaries for www.guideline.gov. We see a big push in more recent months.

Memento count per month for expert commentaries for www.guideline.gov. There is much the same pattern as for the prior category.

The number of mementos crawled per month for the guideline syntheses documents of www.guideline.gov. There has been a lot of activity the past few months.

Memento count per month for the non-HTML versions of guideline summaries for www.guideline.gov. Again, we see a big push in more recent months.

Memento count per month for other pages at www.guideline.gov. Here we see years of crawling with big spikes after the US election. This may be related to the Internet Archive's new robots.txt policy.

It is interesting to note that mementos exist for some of these pages prior to December of 2016, meaning that people were archiving them with functionality such as "Save Page Now". Archiving really picked up in all cases in September of 2016, then again in October of 2017, and then again starting in April of 2018. These spikes appear to be a coordinated effort to archive parts of the site.

In the last graph, we see years of crawling the top level pages. This is interesting considering the contents of the robots.txt file. Did it change over time? Was it more permissive at some point? Fortunately, we have web archives we can use to check.

Here is a screenshot of the Internet Archive's capture calendar for www.guideline.gov/robots.txt from 2005. Orange indicates that the robots.txt file did not exist. Blue indicates that it did.

Based on the above screenshot, it appears that a robots.txt did not exist for the site www.guideline.gov until 2005. It was first observed at this site on August 23, 2005 at 22:54:19 GMT. Its contents were as follows:

According to the robots.txt specification website, this indicates "To allow all robots complete access". This means that at one time, the site was far more permissive about crawling than it is now. I randomly chose a memento for the robots.txt each year after 2005 and found that it did not change. In August of 2008, the robots.txt disappeared again. In 2009, the successful robots.txt captures are actually of a soft-404 page indicating that it does not exist. Before September 11, 2010 at 18:14:50 GMT, the robots.txt became more complex, as shown below:

As we see, it still isn't disallowing all content like I mentioned at the beginning of the article. This configuration persisted until August 26, 2016 when the robots.txt was still present, but a completely blank file. The robots.txt was changed to its current state on April 27, 2017 before 20:09:28 GMT. The US Senate approved the nomination of Tom Price to the office of Secretary of Health and Human Services on February 1, 2017. This means that the site's robots.txt allowed crawling until after Tom Price took office. This is probably why so many top level pages had been captured by web archives since the site's creation.

What about the qualitymeasures.ahrq.gov site? The bar charts below show the number of mementos per month for each of its categories.

Memento count per month for measure summaries at qualitymeasures.ahrq.gov. There is some activity in 2016, but a lot of very recent crawling of the content.

Memento count per month for measure summaries at qualitymeasures.ahrq.gov. Like above, there is some activity in 2016, but a big push in June of 2017, and a lot of very recent crawling of the content.

Memento count per month for measure summaries at qualitymeasures.ahrq.gov. We see the same large push in recent history, with a lot of crawling.

The crawling of qualitymeasures.ahrq.gov follows much the same pattern, though not with the exact same spikes prior to this last month. From these graphs we see that there has been a concerted effort to archive both of these sites since June. This site created its first robots.txt on August 24, 2005 before 00:04:23 GMT. And the robots.txt was completely permissive, as with www.guideline.gov.

The emergence of a robots.txt for qualitymeasures.ahrq.gov on August 24, 2005, as shown on the Internet Archive's calendar page for the URI qualitymeasures.ahrq.gov/robots.txt.

The robots.txt went through much the same history for this site as for www.guideline.gov, implying a similar policy or even the same webmaster for both sites. It finally changed to its current disallow state on April 27, 2017 before 22:10:11 GMT. Again, this is after Tom Price took office. This again explains why so many of the top level pages of the site were archived throughout the history of qualitymeasures.ahrq.gov.

In which archives are these pages preserved?

I chose to use an aggregator because I wanted to search in multiple web archives for these pages. How well do these mementos spread throughout the archives? The charts below show the number of mementos per archive for each category of pages at www.guideline.gov. Only archives containing mementos for a given category are displayed in each chart.

This chart of the guideline summaries for www.guideline.gov shows 6,848 mementos are present in the Internet Archive, with 4,887 mementos preserved by Archive-It and 10 mementos preserved by Archive.today (archive.is).

This chart of the non-HTML versions of the guideline summaries for www.guideline.gov shows 14,846 mementos are preserved in the Internet Archive, but more, 19,044 are preserved in Archive-It.

This chart of the expert commentaries for www.guideline.gov shows that 324 mementos are held by the Internet Archive while 177 are held by Archive-It.

This chart of the guideline syntheses for www.guideline.gov shows that 189 mementos are held by the Internet Archive while 62 are held by Archive-It.

The chart of the other pages for www.guideline.gov shows that the top-level pages are preserved at more archives than the previous categories. There are 3,397 mementos at the Internet Archive, 819 mementos at Archive-It, 128 mementos at the Library of Congress, 19 at Archive.today, 11 at the Icelandic Web Archive, 7 at the Portuguese Archive, and 1 at Perma.cc.

While the Internet Archive and Archive-It have most of the mementos, some mementos of the top-level pages of the site are held in other archives. As this is a US government web site, I was surprised that the Library of Congress was not featured more. Archive-It also has more non-HTML guideline summaries than the Internet Archive, indicating a particular effort by some organization to preserve these documents in other formats. Unfortunately, the Archive-It mementos I discovered with MemGator belonged to the collection /all/ meaning that I have no indication as to which Archive-It collection or organization was preserving the pages.

Update on 2018/07/16 at 18:10 GMT: To find the specific Archive-It collection and collecting organization, Michele Weigle has suggested that one might be able to search the Archive-It collections for these URIs using Archive-It's explore all archives search interface. One would need to use the "Search Page Text" tab. I did try the string www.guideline.gov and discovered 5,906 search results, so this hostname is in the content of some of these pages. I tried using a URI reported to have an Archive-It memento, but did not receive any search results. If you are successful, please say something in the comments.

The bar charts below show the distribution of mementos across web archives for the qualitymeasures.ahrq.gov web site.

This chart of the measure summaries for qualitymeasures.ahrq.gov shows 9,494 mementos at the Internet Archive and only 216 at Archive-It.

This chart of the measure summaries for qualitymeasures.ahrq.gov shows all 360 mementos of expert commentaries are held by the Internet Archive.

This chart for the other pages at qualitymeasures.ahrq.gov shows 1,147 mementos at the Internet Archive, 145 mementos at Archive-It, 40 mementos at the Library of Congress, 12 at the Portuguese Web Archive, 1 at Archive.today, and 1 at Perma.cc.

The results for qualitymeasures.ahrq.gov show that most of the mementos for that site are archived at the Internet Archive, with a few in other archives. This is in contrast to the results for www.guideline.gov, where the numbers between the Internet Archive and Archive-It were close in many cases.

Attempts at Archiving the Missing Pages

On July 14, 2018, I attempted to use our own ArchiveNow to preserve the ~1% of summary URIs from each site that had not been archived. Unfortunately, the live resources started responding very slowly. The sample of summary URIs that had not been archived produced 500 status codes, as can be seen in the output from the curl commands below, each which took close to a minute to execute:

# curl -I "https://qualitymeasures.ahrq.gov/summaries/summary/51246/hospice-experience-percentage-of-caregivers-who-reported-that-their-family-member-was-treated-with-respect" HTTP/1.1 500 Internal Server Error Cache-Control: private Content-Length: 276 Content-Type: text/html; charset=utf-8 Server: Microsoft-IIS/10.0 Set-Cookie: ASP.NET_SessionId=erbtndkp131sxor5h2rhocv5; path=/; HttpOnly Set-Cookie: ASP.NET_SessionId=erbtndkp131sxor5h2rhocv5; path=/; HttpOnly Set-Cookie: NQMC=RecentlyViewedContent=51246,51246; expires=Mon, 13-Aug-2018 16:55:45 GMT; path=/; HttpOnly X-AspNetMvc-Version: 4.0 X-AspNet-Version: 4.0.30319 Request-Context: appId=cid-v1:deb297e9-23b8-4f6e-b376-44111ebc4951 X-Powered-By: ASP.NET X-Frame-Options: SAMEORIGIN Content-Security-Policy: frame-ancestors 'none'; Set-Cookie: ARRAffinity=da7985c62a729352f6a2745ce1af0df3e542bf945932bc766fb35d7c851d7626;Path=/;HttpOnly;Domain=qualitymeasures.ahrq.gov Date: Sat, 14 Jul 2018 16:56:43 GMT Strict-Transport-Security: max-age=31536000; includeSubDomains;

I ran curl on all live URIs listed as not captured and they return a HTTP 500 status as of July 14, 2018 at approximately 16:50:00 GMT. Because I had scraped these URIs from the "All Summaries" page, it is possible that they returned 500 statuses at the time of crawl and this is why web archives do not currently have them. This means, that, even on the live web, they were not available. The live versions of the other summary pages with mementos returned a 200 status (after about a minute delay).

It is also possible that the service at these web sites is degrading in their last hours. As of approximately 07:00 GMT on July 15, 2018, the qualitymeasures.ahrq.gov site was no longer available, displaying error messages for pages, as shown in the screenshot below.

As of 07:00 GMT on July 15, 2018, the qualitymeasures.ahrq.gov website started displaying error messages instead of content.

Update on 2018/07/16 at 19:00 GMT: The website qualitymeasures.ahrq.gov is available again, but the measure summaries that were missing from the archives still return HTTP 500 status codes. The missing guideline summaries for www.guideline.gov also still return HTTP 500 status codes.

This was quite disheartening, because my plan was to archive the pages I had detected as missing after I did my initial study. I thought I had until July 16 to save the web pages!

Conclusion

Almost all web archiving is done externally, with no knowledge of the software running on the server side. This reduces mementos to a series of observations of pages rather than a complete reproduction of all of the functionality that existed at a web site. In the case of two US government websites that will be shut down on July 16, 2018, www.guideline.gov and qualitymeasures.ahrq.gov have server-side functionality, but their most valuable assets are a series of summary documents that can be captured without having to reproduce the functionality of the server side. In this article, I've tried to determine how much of these web sites have been captured prior to their termination.

When focusing on the main products of each site, the guideline summaries and the measure summaries, we see that these products are actually pretty well archived, at 99% of guideline summaries for www.guideline.gov and 99% of measure summaries for qualitymeasures.ahrq.gov. We also observed that 100% of all expert commentaries were archived in both cases. Other aspects of the site, such as trying to reproduce all facets of the search engine were not tested. I did, however, attempt to crawl the sites to gain a list of pages outside of these categories and discovered that, at least among the pages captured during a limited crawl, other pages at www.guideline.gov are archived at a percentage of 99%, higher than those for qualitymeasures.ahrq.gov, which only stand at 67.2%.

Many of these main products have more than one memento and as many as 25 in some cases. There are more mementos for www.guideline.gov than for qualitymeasures.ahrq.gov, but the mode for the number of mementos of the main products range between 4 and 14 mementos. This means that the main products have good coverage. The top-level content at these sites, however, has a mode of 1 or 2 mementos, indicating poor coverage of the changes over time for some top-level pages.

Over the life of these sites, most of the mementos stored in web archives are for the top-level pages, because crawling was permitted by their robots.txt until April 27, 2017, a few months after Tom Price became the Secretary of Health and Human Services. Fortunately, there has been a large push to archive the main products of the site since September of 2016, resulting in many mementos created within the last month.

Most of the mementos for these sites are stored in the Internet Archive. Archive-It has more mementos of the non-HTML versions of guideline summaries for www.guideline.gov, but its memento count is eclipsed by the Internet Archive in all other cases. After the Internet Archive and Archive-It, there is a long tail of archives for top-level pages, but the number of mementos for each of these archives is less than 100. With the exception of 10 guideline summaries for www.guideline.gov stored in Archive.today, none of the main products of these sites are stored outside of the Internet Archive or Archive-It.

My attempts to archive the pages after running this experiment failed, in large part due to the degradation in service at these web sites. Even though I tried preserving the pages prior to the cutoff date of July 16, 2018, they were no longer reliably available.

Because one needs to know the original resource URI in order to find mementos in a web archive, I have published the URIs I discovered to Figshare. I do this in hopes that someone might build a resource for providing easy access to the content of these sites, especially for medical personnel. If you want to access them, use these links.

Feel free to contact me if you run into problems with these files.

This case demonstrates the importance of organizations like the Sunlight foundation for identifying at risk resources. Also important are the web archives for allowing us to preserve these resources. This case also demonstrates how we can come together and ensure that these resources are preserved. We do need to be concerned that so much of this content is preserved in one place, rather than spread across multiple archives. If a page is of value to you, you have an obligation to archive it and archive it in multiple archives. What web pages have you archived today, so that you, and others, can access their content long after the live site has gone away?

--Shawn M. Jones

↧

2018-07-18: Why We Need Private Web Archives: Almost Two-Thirds of Web Traffic IS NOT Publicly Archivable

July 18, 2018, 9:47 am

≫ Next: 2018-07-18: HyperText and Social Media (HT) Trip Report

≪ Previous: 2018-07-15: How well are the National Guideline Clearinghouse and the National Quality Measures Clearinghouse Archived?

Google.com mementos from May 8th 1999 on the Internet Archive

In terms of the ability to be archived in public web archives, web pages fall into one of two categories: publicly archivable, or not publicly archivable.

1. Publicly Archivable Web Pages:

These pages are archivable by public archives. The pages can be accessed without login/authentication. In other words, these pages do not reside behind a paywall. Grant Atkins examined paywalls in the Internet Archive for news sites and found that web pages behind paywalls may actually be redirecting to a login page at crawl time. A good example of a publicly archivable page is Dr. Steven Zeil's page since no authentication is required to view the page. Furthermore, it does not use client-side scripts (i.e., Ajax) to load additional content, so what you see in the web browser and what you can replay from public web archives are exactly the same.

Screen shot from Dr. Steven Zeil's page captured on 2018-07-02

Memento for Dr. Zeil's page on the Internet Archive captured on 2017-12-02

Some web pages provide "personalized" content depending on the GeoIP of the requester. In these cases, what you see in the browser and what you can replay from public web archives are nearly the same, except for some minor personalization/GeoIP related changes. For example, a user requesting https://www.islamicfinder.org from Suffolk, Virginia will see the prayer times for the closest major city (Norfolk, Virginia). On the other hand, when the Internet Archive crawls the page, it sees the prayer times for San Bruno, California. This is likely because the crawling/archiving is happening from San Francisco, California. The two pages, otherwise, are exactly the same!

The live version of https://www.islamicfinder.org for a user in Suffolk, VA on 2018-07-02

Memento for https://www.islamicfinder.org from the Internet Archive captured on 2018-06-22

Some social media sites, like Twitter, are publicly archivable and the Internet Archive captures most of their content. Twitter's home page is personalized, so user-specific contents, like "Who to Follow" and "Trends for you" are not captured, but the tweets are. Also, some Twitter services require authentication.

@twitter live web page

@twitter memento from the Internet Archive captured on 2016-05-18

The archived memento for the @twitter web page shows a message that cookies are used and they are important for an enhanced user experience, nevertheless, the main content of the page, tweets, is preserved (or at least the top-k tweets, since the crawler does not automatically scroll at archive time to activate the Ajax-based pagination, cf. Corren McCoy's "Pagination Considered Harmful to Archiving").

Message from Twitter about cookies use to enhance user experience

Also, deep links to individual tweets are archivable.

Memento for a deep link to a tweet on the Internet Archive captured on 2013-01-18

2. Not Publicly Archivable Web Pages:

As far as the amount of web traffic, search engines are at the top. According to SimilarWeb, Google is number one; its share is 10.10% of the entire web traffic. The Internet Archive crawls it on regular basis, and has over 0.5 million mementos as of 2018-05-01 (cf. Mat Kelly's tech report about the difficulty in counting the number of mementos). The captured mementos are exact copies as far as the look, but obviously not a functioning search page.

As of 2018-05-01 the IA has 552,652 mementos of Google.com

Google.com memento from May 8th 1999 on the Internet Archive played on 2018-05-01

It is possible to push a search result page from Google to a public web archive like archive.is, but that is not how web archives are normally used.

A Google search query for "Machine Learning" on 2018-06-18 archived in archive.is

Furthermore, it is not viable for web archives to try to archive search engines' result pages (SERPs) because there is an infinite number of possible URIs due to an infinite number of search queries and syntax, so even if we preserve a single SERP from June, 2018 (as shown above), we are unable to issue new queries against a June, 2018 version of Google. Maps and other applications that depend on user interaction are similar: individual pages may be archived, but we typically don't consider the entire application "archived" (cf. Michael Nelson's "Game Walkthroughs As A Metaphor for Web Preservation").

Even when web archives use headless browsers to overcome the Ajax problem, there can be additional challenges. For example, I pushed a page from Google Maps with an address in Chesapeake, Virginia to archive.today and the result was a page from Google support (in Russian) telling me that I (or more accurately, archive.today) need to update the browser in order to use Google Maps! While technically not a paywall, this is similar to Grant's study mentioned above in that there is now something in the web archive corresponding to that Google Maps URI, but it does not match the users' expectations. It also reveals a clue about the GeoIP of archive.today.

Google Maps page for the address 4940 S Military HWY, Chesapeake, VA 23321 pushed to archive.today on 2018-07-02

Memento for the Google Maps page I pushed to archive.today on 2018-07-02

It is worth mentioning there are emerging tools like Webrecorder, WARCreate, WAIL, and Memento Tracer for personal web archiving (or community tools in the case of Tracer), but even if/when the Internet Archive replaces Heritrix with Brozzler and resolves the problems with Ajax, their Wayback Machine cannot be expected to have pages requiring authentication, nor pages with effectively infinite inputs like search engines and maps.

Social media pages respond differently when web archives' crawlers try to crawl and archive them. Public web archives might have mementos of some social media pages, however, they often require a login to allow the download of the pages' representation. Otherwise, a redirection takes place. Another obstacle faces archiving social media pages is their heavy use of client-side executed scripts that will, for example, fetch new content when the page is scrolled or when hiding/showing comments with no change in the URI. Facebook, for example, does not allow web archives' crawlers to access the majority of its pages. The Internet Archive's Wayback Machine returned 1,699 mementos for the former president's official Facebook page, but when I opened one of these mementos, it returned the infamous Facebook login or register page.

1,699 mementos for the official Facebook page of Mr. Obama, former U.S. president as of 2018-05-01

The memento captured on 2017-02-10 is showing the login page of Facebook

There are few exceptions where the Internet Archive is able to archive some user-contributed Facebook pages.

Memento for a facebook page in the Internet Archive captured on 2012-03-02

Also, it seems like archive.is is using a dummy account ("Nathan") to authenticate, view, and archive some Facebook pages.

Memento for a facebook page in archive.is captured on 2018-06-21

With the previous exceptions in mind, it is still safe to say that Facebook pages are not publicly archivable.

Linkedin shares the same behavior with Facebook. The notifications page has 46 mementos as of 2018-05-29, but they are entirely empty. The live page contains notifications from contacts such as who is having a birthday, job anniversary, got a new job, and so on. This page is completely personalized and requires a cookie or login to display information that is related to the user, and therefore, the Internet Archive has no way of downloading its representation.

My account's notification page on Linkedin

Memento of Linkedin's notification page

The last example I would like to share is Amazon's "yourstore" page. I chose this example because it contains recommended items (another clean example for personalized web pages). The recommendations are based on the user's behavior. In my case, Amazon recommended electronics, automotive tools, and prime video.

My Amazon's page (live) on 2018-05-02

As of 2018-05-02, I found 111 mementos for "my Amazon's your store page" in the Internet Archive, and opened one of them to see what has been captured.

Mementos for Amazon's yourstore page in the Internet Archive on 2018-05-02

As I expected, the page has a redirect to another page that asks for a login. It returned a 302 response code when it was crawled by the Internet Archive. The actual content of the original page was not archived because the IA crawler does not provide credentials to download the content of the page. The representation saved to the Internet Archive is for a resource different from the originally crawled page.

IA crawler was redirected to a login page and captured it instead

There are many web sites with this behavior, so it is safe to assume that for some web sites, even when there are plenty of mementos, they all might return a soft 404.

Estimating the amount of archivable web traffic:

To explore the amount of web traffic that is archivable, I examined the top 100 sites as ranked by Alexa, and manually constructed a data set of those 100 sites using traffic analysis services from SimilarWeb and Semrush.

The data was collected on 2018-02-23 and I captured three web traffic measures offered by both websites, total visits, unique visits, and pages/visit.

Total visits is the total number of non-unique visits from last month.

Unique visits is the number of unique visits from last month.

Pages/visit is the average number of visited pages per user's visit.

I determined whether or not a website is archivable based on the discussion I provided earlier, and put it all together in a csv file to use it later as input for my script. Suggestions, feedback, and pull requests are always welcome!

The data set used in the experiment

Using Python 3, I wrote a simple script that calculates the percentage of web traffic that is publicly archivable. I am assuming that the top 100 sites is a good representative of the whole web. I am aware that 100 sites is a small number compared to 1.8 billion live websites on the Internet, but according to SimilarWeb, the top 100 sites receive 48.86% of the entire traffic on the web which is consistent with a Pareto distribution. The program offers six different results, each of which is based on a certain measure or a combination of measures, total visits, unique visits, and pages/visit. Flags can be set to control what measures are used in the calculation. If no flags are set, the program shows all the results using all three measures and their combination. I came up with this formula to calculate the percentage of publicly archivable websites based on all three measures combined:

Multiply the pages/visit by visits for each web site from both SimilarWeb and SemRush
Take the average for both sources, SimilarWeb and SemRush
Take the average of unique visits for each website from SimilarWeb and SemRush
Add the numbers obtained in 2 and 3
Add the number obtained in 4 for all archivable websites
Add the number obtained in 4 for all non-archivable websites
Add the numbers obtained in 5 and 6 to get the total
Calculate the percentage of the numbers obtained in 5 and 6 from the total, obtained in 7

Using all measures, I found that 65.30% of the traffic of the top 100 sites is not archivable by public web archives. The program and the data set are available on Github.

Now, it is possible to discuss three different scenarios and compute a range. If the top 100 sites receive 48.86% of the traffic, and 65.30% of that traffic is not publicly archivable, therefore:

If all of the remaining web traffic is publicly archivable, then 31.91% of the entire web traffic is not publicly archivable. 65.30 * 0.4886 = 31.91.
If the remaining web traffic is similar to the traffic from the top 100 sites, then 65.30% of the entire web traffic is not publicly archivable.
Finally, if all of the remaining web traffic is not publicly archivable, then only 16.95% of the entire web traffic is archivable. 34.7 * 0.4886 = 16.95. This means that 83.05% of the entire web traffic is not publicly archivable.

So the percentage of not publicly archivable web traffic is between 31.91% and 83.05%. More likely, it is close to 65.30% (the second case).

I would like to emphasize that since the top 100 websites are mainly Google, Bing, Yahoo, etc, and their derivatives, the nature of these top sites is the determining factor of my results. However, since the range has been calculated, it is safe to say that, at least, 1/3 of the entire web traffic is not publicly archivable. This percentage constitutes the necessity of private web archives. There are few available tools to solve this problem, Web Recorder, Warcreate, and WAIL. Public web archiving sites like the Internet Archive, archive.is, and others will never be able to preserve personalized or private web pages like emails, bank accounts, etc.

Take Away Message:

Personal web archiving is crucial since, at least, 31.91% of the entire web traffic is not archivable by public web archives. This is due to the increase use of personalized/private web pages and the use of technologies hindering the ability of web archives' crawlers to crawl and archive these pages. The experiment shows that the percentage of not publicly archivable web traffic can be as high as 83.05%, but the more likely case is that around 65% of web traffic is not publicly archivable. Unfortunately, no matter how good public web archives get at capturing web pages, there will always be a significant number of web pages that are not publicly archivable. This emphasizes the need for personal web archiving tools, such as Web Recorder, Warcreate, and WAIL - possibly combined with a collaboratively-maintained repository of how to interact with complex sites, as introduced by Memento Tracer. Even if Ajax-related web archiving problems were eliminated, no less than 1/3 of web traffic is to sites that will otherwise never appear in public web archives.

--
Hussam Hallak

↧

2018-07-18: HyperText and Social Media (HT) Trip Report

July 18, 2018, 8:53 pm

≫ Next: 2018-07-22: Tic-Tac-Toe and Magic Square Made Me a Problem Solver and Programmer

≪ Previous: 2018-07-18: Why We Need Private Web Archives: Almost Two-Thirds of Web Traffic IS NOT Publicly Archivable

Leaping Tiger statue next to the College of Arts at Towson University

From July 9 - 12, the 2018 ACM Conference on Hypertext and Social Media (HT) took place at the College of Arts at Towson University in Baltimore, Maryland. Researchers from around the world presented the results of complete or ongoing work in tutorial, poster, and paper sessions. Also, during the conference I had the opportunity to present a full paper: "Bootstrapping Web Archive Collections from Social Media" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson.

Day 1 (July 9, 2018)

The first day of the conference was dedicated to a tutorial (Efficient Auto-generation of Taxonomies for Structured Knowledge Discovery and Organization) and three workshops:

I attended the Opinion Mining, Summarization and Diversification workshop. The workshop started with a talk titled: "On Reviews, Ratings and Collaborative Filtering," presented by Dr. Oren Sar Shalom, principal data scientist at Intuit, Israel. Next, Ophélie Fraisier, a PhD student studying stance analysis on social media at Paul Sabatier University, France, presented: "Politics on Twitter : A Panorama," in which she surveyed methods of analyzing tweets to study and detect polarization and stances, as well as election prediction and political engagement.

@SyrupType at @ACMHT presenting: "Politics on Twitter: A Panorama" that explores
- the relationship between homophily and echo chambers
- correlation between retweets and political ideology
- (limitations of) election prediction with Twitter#acmht18 pic.twitter.com/nkpF6tA55W
— Alexander C. Nwala (@acnwala) July 9, 2018

Next, Jaishree Ranganathan, a PhD student at the University of North Carolina, Charlotte, presented: "Automatic Detection of Emotions in Twitter Data - A Scalable Decision Tree Classification Method."

@WebSciDL Jaishree Ranganathan presents her decision tree classifier method for assigning emotion labels to tweets.#acmht18 pic.twitter.com/QVZdis4569
— Alexander C. Nwala (@acnwala) July 9, 2018

Finally, Amin Salehi, a PhD student at Arizona State University, presented: "From Individual Opinion Mining to Collective Opinion Mining." He showed how collective opinion mining can help capture the drivers behind opinions as opposed to individual opinion mining (or sentiment) which identifies single individual attitudes toward an item.

@WebSciDL Amin Salehi presents: "From Individual Opinion Mining to Collective Opinion Mining"#acmht18 pic.twitter.com/78z9z0UTN7
— Alexander C. Nwala (@acnwala) July 9, 2018

Day 2 (July 10, 2018)

The conference officially began on day 2 with a keynote: "Lessons in Search Data" by Dr. Seth Stephens-Davidowitz, a data scientist and NYT bestselling author of: "Everybody Lies."

I can now officially present my book, Everybody Lies -- insights from new, internet data. I thank a million people! https://t.co/I3quPp6nw3
— Seth Stephens-Davidowitz (@SethS_D) May 9, 2017

In his keynote, Dr. Stephens-Davidowitz revealed insights gained from search data ranging from racism to child abuse. He also discussed a phenomenon in which people are likely to lie to pollsters (social desirability bias) but are honest to Google ("Digital Truth Serum") because Google incentivizes telling the truth. The paper sessions followed the keynote with two full papers and a short paper presentation.

@SethS_D keynote @ACMHT: African American Turnout was low correlating with @Google searches in mostly African-American areas.#acmht18 pic.twitter.com/jwwXdxYSaV
— Alexander C. Nwala (@acnwala) July 10, 2018

@SethS_D keynote @ACMHT: Not enough researchers are using Google Trends (https://t.co/QqK3kcE9tk)

#acmht18
— Alexander C. Nwala (@acnwala) July 10, 2018

Interesting #ACMHT18 Keynote by Seth Stephens-Davidowitz 'Lessons in Search Data' . Google search data as "digital truth serum" - while reporting of child abuse go down at the recession time, Google search data indicates that real child abuse increases https://t.co/DQQoAotZqB
— Peter Brusilovsky (@peterpaws) July 10, 2018

Opening talk at #acmht18 by @SethS_D:#Google: “Digital Truth Serum”#Facebook: “Digital Brag to My Friends About How Good My Life is Serum” pic.twitter.com/GOVHxX6KRP
— Claus Atzenbeck 🇪🇺 (@clausatz) July 10, 2018

More interesting findings from search data: what is the best age to become fun of a baseball team? #acmht18 pic.twitter.com/ewmRi52ctV
— Peter Brusilovsky (@peterpaws) July 10, 2018

No doubt, the #acmht18 opening talk by @SethS_D is very interesting; quite well done individual studies, nicely presented.

However, it feels more like a research talk rather than a #keynote. Though still interesting, I’d rather hear about a #vision for this area of #research.
— Claus Atzenbeck 🇪🇺 (@clausatz) July 10, 2018

The first (full) paper of day 2 in the Computational Social Science session: "Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora," was presented by Shubhanshu Mishra, a PhD student at the iSchool of the University of Illinois at Urbana-Champaign. He showed correlations between user-level and tweet-level metadata by addressing two questions: "Do tweets from users with similar Twitter characteristics have similar sentiments?" and "What meta-data features of tweets and users correlate with tweet sentiment?"

First paper of Day 2 at @ACMHT by @TheShubhanshu: Detecting the correlation between sentiment and user-as well as text-level meta-data from benchmark corpora, exploring using additional tweet metadata such as rts, follower/following count to predict sentiment.#acmht18 pic.twitter.com/ln3ha61rou
— Alexander C. Nwala (@acnwala) July 10, 2018

Next, Dr. Fred Morstatter presented a full paper: "Mining and Forecasting Career Trajectories of Music Artists," in which he showed that their dataset generated from concert discovery platforms can be used to predict important career milestones (e.g., signing by a major music label) of musicians.

@fredmorstatter at @ACMHT presenting: "Mining and Forecasting Career Trajectories of Music Artists." They researched:

- Forecasting artist success
- Event prediction
- Discovery of important artists and venues

Code: https://t.co/ycz7iWKZrn #acmht18 pic.twitter.com/3GADA8z7RI
— Alexander C. Nwala (@acnwala) July 10, 2018

Next, Dr. Nikolaos Aletras, a research associate at the University College London, Media Futures Group, presented a short paper: "Predicting Twitter User Socioeconomic Attributes with Network and Language Information." He described a method of predicting the occupational class and income of Twitter users by using information extracted from their extended networks.

Can your Twitter data be used to predict your income status? @nikaletras at @ACMHT presenting: "Predicting Twitter User Socioeconomic Attributes with Network and Language Information"#acmht18 pic.twitter.com/mU5fRnZtyM
— Alexander C. Nwala (@acnwala) July 10, 2018

After a break, the Machine Learning session began with a full paper (Best Paper Runner-Up): "Joint Distributed Representation of Text and Structure of Semi-Structured Documents," presented by Samiulla Shaikh, a software engineer and researcher at IBM India Research Labs.

@SamiullaShaikh begins the Machine Learning session at @ACMHT with:
"Joint Distributed Representation of Text and Structure of Semi-Structured
Documents."

He explored the use of vector representation to capture the semantic information of semi-structured documents#acmht18 pic.twitter.com/8DkOpwqt0N
— Alexander C. Nwala (@acnwala) July 10, 2018

Next, Dr. Oren Sar Shalom presented a short paper titled: "As Stable As You Are: Re-ranking Search Results using Query-Drift Analysis," in which he presented the merits of using query-drift analysis for search re-ranking. This was followed by a short paper presentation titled: "Embedding Networks with Edge Attributes," by Palash Goyal, a PhD student at University of Southern California. In his presentation, he showed a new approach to learn node embeddings that uses the edges and associated labels.

Oren Sar Shalom @ @ACMHT presents "As Stable As You Are: Re-ranking Search Results using Query-Drift Analysis:" addressed the query drift problem by:
- effective selection of "promising" query aspects
- considering each document's role with such potential query drift#acmht18 pic.twitter.com/SyaSlbDykd
— Alexander C. Nwala (@acnwala) July 10, 2018

@palashiitkgp at @ACMHT presents: "Embedding Networks with Edge Attributes"#acmht18 pic.twitter.com/TKRNV1IQBB
— Alexander C. Nwala (@acnwala) July 10, 2018

Another short paper presentation (Recommendation System session) by Dr. Oren Sar Shalom followed. It was titled: "A Collaborative Filtering Method for Handling Diverse and Repetitive User-Item Interactions." He presented a collaborative filtering model that captures multiple complex user-item interactions without any prior domain knowledge.

@oren_sarshalom is already presenting his second #acmht18 paper! Now https://t.co/oiUb2qXI8d and previously https://t.co/ylkRpushdP
— Ingmar Weber (@ingmarweber) July 10, 2018

Oren Sar Shalom (again) starts the Recommendation System presentations at @ACMHT: "A Collaborative Filtering Method for Handling Diverse and Repetitive User-Item Interactions."#acmht18 pic.twitter.com/XXduX0G4OR
— Alexander C. Nwala (@acnwala) July 10, 2018

Next, Ashwini Tonge, a PhD student at Kansas State University presented a short paper titled: "Privacy-Aware Tag Recommendation for Image Sharing," in which she presented a means of tagging images on social media in order to improve the quality of user annotations while preserving user privacy sharing patterns.

Can we recommend tags for images considering their privacy to help image-sharing privacy settings?

Ashwini Tonge at @ACMHT presents: "Privacy-Aware Tag Recommendation for Image Sharing"#acmht18 pic.twitter.com/mk9rtP3zFV
— Alexander C. Nwala (@acnwala) July 12, 2018

Coffee break at #acmht18! #FRAvsBEL pic.twitter.com/25y8VpvM31
— Ujwal Gadiraju (@UjLaw) July 10, 2018

#ht2018conf coffee break pic.twitter.com/IyTNroLCC7
— Nuno Nunes (@njn) July 10, 2018

Finally, Palash Goyal presented another short paper titled: "Recommending Teammates with Deep Neural Networks."

The day 2 closing keynote by Leslie Sage, director of data science at DevResults followed after a break that featured a brief screening of the 2018 World Cup semi-final game between France and Belgium. In her keynote, she presented the challenges experienced in the application of big data toward international development.

@palashiitkgp (again) at @ACMHT presents: "Recommending Teammates with Deep Neural Networks." He proposed a recommendation system that makes recommendations of players to maximize performance.#acmht18 pic.twitter.com/tznJp2ZBdX
— Alexander C. Nwala (@acnwala) July 10, 2018

@lesliemsage wraps up day 2 at @ACMHT with a keynote: "Data and Design in International Development." She explored the scope of challenges, approaches and remaining hurdles in applying big data toward international development.#acmht18 pic.twitter.com/vFcnXHxmBG
— Alexander C. Nwala (@acnwala) July 10, 2018

Day 3 (July 11, 2018)

Day 3 of the conference began with a keynote: "Insecure Machine Learning Systems and Their Impact on the Web" by Dr. Ben Zhao, Neubauer Professor of Computer Science at University of Chicago. He highlighted many milestones of machine learning by showing problems they have solved in natural language processing and computer vision. But showed that opaque machine learning systems are vulnerable to attack by agents with malicious intents, and he expressed the idea that these critical issues must be addressed especially given the rush to deploy machine learning systems.

@ravenben starts day 3 at @ACMHT: "Insecure Machine Learning Systems and Their Impact on the Web:" neural networks are doing things that we thought were exclusive to humans

- Explores the impact of machine learning on information/(dis)misinformation on the web#acmht18 pic.twitter.com/l1RWt5CV59
— Alexander C. Nwala (@acnwala) July 11, 2018

Kick off at #acmht18 day 2 - and it’s into fake news land pic.twitter.com/91BgFKUfGY
— David Millard (@hoosfoos) July 11, 2018

Identify the teacher model. #acmht18 pic.twitter.com/R2YBsyIhLF
— Shubhanshu Mishra (@TheShubhanshu) July 11, 2018

@ravenben’s conclusion about #MachineLearning at #acmht18:

→ Deep learning a powerful tool for both sides
→ Worrisome implications for journalists
→ Need more attention on security side on the Web

His paper can be found at the ACM DL:https://t.co/QcWLnwDEdk pic.twitter.com/2r4kAphSLg
— Claus Atzenbeck 🇪🇺 (@clausatz) July 11, 2018

Following the keynote, I present our full paper: "Bootstrapping Web Archive Collections from Social Media" in the Temporal session. I highlighted the importance of web archive collections as a means of preserving the historical record of important events, and the seeds (URLs) from which they are formed. The seeds are collected by experts curators, but we do not have enough experts to collect seeds in a world of rapidly unfolding events. Consequently, I proposed exploiting the collective domain expertise of web users by generating seeds from social media collections and showed through a novel suite of measures, that seeds generated from social media are similar to those generated by experts.

Generate seed uri from social media. Uses @Wikipedia reference as seed urls. #acmht18 @ACMHT
Uses @Twitter moments to extract seeds along with @Storify and @reddit . pic.twitter.com/bowjwGsA9K
— Shubhanshu Mishra (@TheShubhanshu) July 11, 2018

Next, Paul Mousset, a PhD student at Paul Sabatier University, presented a full paper: "Studying the Spatio-Temporal Dynamics of Small-Scale Events in Twitter," in which he presented his work into the granular identification and characterization of event types on Twitter.

A nice exploratory of small-scale events in Twitter - by @PaulMousset @PitYo and @LyndaTamine - https://t.co/g26CthY5XX #acmht18 https://t.co/UDg9Oak97B
— Laure Soulier (@LaureSoulier) July 11, 2018

Next, Dr. Nuno Moniz, invited Professor at the Sciences College of the University of Porto, presented a short paper: "The Utility Problem of Web Content Popularity Prediction." He demonstrated that state-of-the-art approaches for predicting web content popularity have been optimized for improving the predictability of average behavior of data: items with low levels of popularity.

@nunompmoniz at @ACMHT presents: "The Utility Problem of Web Content Popularity Prediction"

Paper: https://t.co/CDyyaegENx
Code: https://t.co/LBWkWEeILY #acmht18 pic.twitter.com/jq9lQGztes
— Alexander C. Nwala (@acnwala) July 11, 2018

Next, Samiulla Shaikh (again), presented the first full paper (Nelson Newcomer Award winner) of the Semantic session: "Know Thy Neighbors, and More! Studying the Role of Context in Entity Recommendation," in which he showed how to efficiently explore a knowledge graph for the purpose of entity recommendation by utilizing contextual information to help in the selection of a subset of a entities in a knowledge graph.

Given Steve Jobs (entity) and the Pixar (context) what are the related entities?@SamiullaShaikh at @ACMHT presents: "Know Thy Neighbors, and More! Studying the Role of Context in Entity Recommendation"

Paper: https://t.co/YXzgcvOAlq #acmht18 pic.twitter.com/4pktFoXHKi
— Alexander C. Nwala (@acnwala) July 11, 2018

Samiulla Shaikh (again), presented a short paper: "Content Driven Enrichment of Formal Text using Concept Definitions and Applications," in which he showed a method of making formal text more readable to non-expert users by text enrichment e.g., highlighting definitions and fetching of definitions from external data sources.

How can you make formal text more readable to non-expert users?@SamiullaShaikh (again) at @ACMHT presents: "Content Driven Enrichment of Formal Text Using Concept Definitions and Applications"#acmht18 pic.twitter.com/r1WvAxOPCs
— Alexander C. Nwala (@acnwala) July 11, 2018

Next, Yihan Lu, a PhD student at Arizona State University, presented a short paper: "Modeling Semantics Between Programming Codes and Annotations." He presented the results from investigating a systematic method to examine annotation semantics and its relationship with source codes. He also showed their model which predict concepts in programming code annotation. Such annotations could be useful to new programmers.

Yihan Lu at @ACMHT presents: "Modeling Semantics Between Programming Codes and Annotations," which attempt so help new programmers learn from examples and connect code with linguistic features in annotation#acmht18 pic.twitter.com/q5tD66eae9
— Alexander C. Nwala (@acnwala) July 11, 2018

Following a break, the User Behavior session began. Dr. Tarmo Robal, a research scientist at the Tallinn University of Technology, Estonia, presented a full paper: "IntelliEye: Enhancing MOOC Learners' Video Watching Experience with Real-Time Attention Tracking." He introduced IntelliEye, a system that monitors students watching video lessons and detects when they are distracted and intervenes in an attempt to refocus their attention.

Can loss of attention when watching video lessons be automatically detected and interventions activated (visual and audio)?

Tarmo Robal at @ACMHT presents "IntelliEye: Enhancing MOOC Learners' Video Watching Experience with RealTime Attention Tracking"#acmht18 pic.twitter.com/g10MmIFpxj
— Alexander C. Nwala (@acnwala) July 11, 2018

More links on the IntelliEye presentation at #acmht18: https://t.co/r2xsj2cvO4 https://t.co/CeTuleIxlT https://t.co/J7VSABHDbc By @CharlotteHase et al.
— Ingmar Weber (@ingmarweber) July 11, 2018

Next, Dr. Ujwal Gadiraju, a postdoctoral researcher at L3S Research Center, Germany, presented a full paper: "SimilarHITs: Revealing the Role of Task Similarity in Microtask Crowdsourcing." He presented his findings from investigating the role of task similarity in microtask crowdsourcing on platforms such as Amazon Mechanical Turk and its effect on market dynamics.

@UjLaw at @ACMHT presents: "SimilarHITs: Revealing the Role of Task Similarity in Microtask Crowdsourcing"

Paper: https://t.co/hpOfl8ZlF0 #acmht18 pic.twitter.com/LOssjqqRMy
— Alexander C. Nwala (@acnwala) July 11, 2018

Next, Xinyi Zhang, a computer science PhD candidate at UC Santa Barbara, presented a short paper: "Penny Auctions are Predictable: Predicting and profiling user behavior on DealDash." She showed that penny auction sites such as DealDash are vulnerable to modeling and adversarial attacks by showing that both the timing and source of bids are highly predictable and users can be easily classified into groups based on their bidding behaviors.

Are there identifiable patterns in penny auctions? (yes)
What are the common strategies used in penny auctions?

Xinyi Zhang at @ACMHT presents: "Penny Auctions are Predictable: Predicting and profiling user behavior on DealDash"#acmht18 pic.twitter.com/BLVRtJ6OdH
— Alexander C. Nwala (@acnwala) July 11, 2018

Shortly after another break, the Hypertext paper sessions began. Dr. Charlie Hargood, senior lecturer at Bournemouth University, UK and Dr. David Millard, associate Professor at the University of Southampton, UK, presented a full paper: "The StoryPlaces Platform: Building a Web-Based Locative Hypertext System." They presented StoryPlaces, an open source authoring tool designed for the creation of locative hypertext systems.

@drchargood& @hoosfoos at @ACMHT present: "The StoryPlaces Platform: Building a Web-Based Locative Hypertext System," where they introduced StoryPlaces - a locative hypertext platform and authoring tool

Website: https://t.co/UsLbCmsJkT
Paper: https://t.co/qk7eOyjh2N #acmht18 pic.twitter.com/qevwxaxJlp
— Alexander C. Nwala (@acnwala) July 11, 2018

Giving a conference presentation during an English World Cup semifinal must be some sort of test #acmht18 but I think we just about managed it pic.twitter.com/7w7N6lI3LI
— David Millard (@hoosfoos) July 11, 2018

Next, Sharath Srivatsa, a Masters student at International Institute of Information Technology, India, presented a full paper: "Narrative Plot Comparison Based on a Bag-of-actors Document Model." He presented an abstract "bag-of-actors" document model for indexing, retrieving, and comparing documents based on their narrative structures. The model resolves the actors in the plot and their corresponding actions.

Sharath Srivatsa at @ACMHT presents: "Narrative Plot Comparison Based on a Bag-of-actors Document Model"#acmht18 pic.twitter.com/W3uOwAa7oe
— Alexander C. Nwala (@acnwala) July 11, 2018

Next, Dr. Claus Atzenbeck, professor at Hof University, Germany, presented a short paper: "Mother: An Integrated Approach to Hypertext Domains." He stated that the Dexter Hypertext Reference Model which was developed to provide a generic model for node-link hypertext systems does not match the need of Component-Based Open Hypermedia Systems (CB-OHS), and proposed how this can be remedied by introducing Mother, a system that implements link support.

Today’s final #acmht18 session is entitled “#Hypertext”. I’m about to present our work “Mother – An Integrated Approach to Hypertext Domains” (co-authored by @dnlrssnr& @tzagara). The paper is available as #OpenAccess at the ACM DL:https://t.co/cAVurtOM2L
— Claus Atzenbeck 🇪🇺 (@clausatz) July 11, 2018

It wouldn’t be #acmht18 without these three (brought to you by @clausatz ) pic.twitter.com/1F5CcOkkvt
— David Millard (@hoosfoos) July 11, 2018

The final (short) paper of the day, "VAnnotatoR: A Framework for Generating Multimodal Hypertexts," was presented by Giuseppe Abrami. He introduced a virtual reality and augmented reality framework for generating multimodal hypertexts called VAnnotatoR. The framework enables the annotation and linkage of texts, images and their segments with walk-on-able animations of places and buildings.

Last of day 3 but not least at @ACMHT, presentation from Giuseppe Abrami: "VAnnotatoR: A Framework for Generating Multimodal Hypertexts"#acmht18 pic.twitter.com/gJFma4Of9N
— Alexander C. Nwala (@acnwala) July 11, 2018

Wow, the second Open Hypermedia Systems seen in the wild in one day #acmht18 pic.twitter.com/ge9V7Ke6gf
— David Millard (@hoosfoos) July 11, 2018

The conference banquet at Rusty Scupper followed the last paper presentation. The next HyperText conference was announced at the banquet.

Banquet location for #acmht18 pic.twitter.com/1OElzjyaEm
— Ingmar Weber (@ingmarweber) July 11, 2018

I‘m pleased to announce that the next ACM Hypertext conference 2019 (#acmht19) will take place at #HofUniversity, Germany.https://t.co/kmacR2LG2T @ACMHT @iisys_de #acmht18 pic.twitter.com/mBBre1UzVH
— Claus Atzenbeck 🇪🇺 (@clausatz) July 12, 2018

Day 4 (July 12, 2018)

The final day of the conference featured multiple papers presentations such as:

Intelligent Generative Locative Hyperstructure (Hargood et al.)
As We May Hear: Our Slaves Of Steel II (Bernstein)
A Villain's Guide to Social Media and Web Science (Bernstein and Cooper)

The day began with a keynote "The US National Library of Medicine: A Platform for Biomedical Discovery and Data-Powered Health," presented by Elizabeth Kittrie, strategic advisor for data and open science at the National Library of Medicine (NLM). She discussed the role the NLM serves such as provider of health data for biomedical research and discovery. She also discussed the challenges that arise from the rapid growth of biomedical data, shifting paradigms of data sharing, as well as the role of libraries in providing access to digital health information.

By 2025, the total amount of genomics data alone is expected to equal or exceed totals from the three major producers of large amounts of data: Astronomy, @YouTube, and @Twitter - @NIH Strategic Plan for Data Science (2018-06)

Elizabeth Kittrie (@nlm_news) at @ACMHT #acmht18 pic.twitter.com/c1LiPfmyJi
— Alexander C. Nwala (@acnwala) July 12, 2018

.@KittrieE of @nlm_news presents the "All of Us" initiative: https://t.co/hej9vQa1fR https://t.co/192tHdLIp8 #acmht18
— Ingmar Weber (@ingmarweber) July 12, 2018

The Privacy session of exclusively full papers followed the keynote. Ghazaleh Beigi, a PhD student at Arizona State University presented: "Securing Social Media User Data - An Adversarial Approach." She showed a privacy vulnerability that arises from the anonymization of social media data by demonstrating an adversarial attack specialized for social media data.

Are structural and textual anonymization of social media data sufficient? (no because of the heterogeneity of social media data)

Ghazaleh Beigi at @ACMHT presents: "Securing Social Media User Data - An Adversarial Approach"#acmht18 pic.twitter.com/Ih57m3iv4e
— Alexander C. Nwala (@acnwala) July 12, 2018

Next, Mizanur Rahman, a PhD student at Florida International University, presented: "Search Rank Fraud De-Anonymization in Online Systems." The bots and automatic methods session with two full paper presentations followed.

Find the crowdsourcing fraudsters who control user accounts that posted fraudulent activities for a product.

Mizanur Rahman at @ACMHT presents: "Search Rank Fraud De-Anonymization in Online Systems"

Paper: https://t.co/2ocA61uaZr
Code: https://t.co/LqhHeNK9pi #acmht18 pic.twitter.com/P2xe8p8yzJ
— Alexander C. Nwala (@acnwala) July 12, 2018

Diego Perna, a researcher at the University of Calabria, Italy, presented: "Learning to Rank Social Bots." Given recent reports about the use of bots to spread misinformation/disinformation on the web in order to sway public opinion, Diego Perna proposed a machine-learning framework for identifying and ranking online social network accounts based on their degree similarity to bots.

Diego Perna presents "Learning to rank social bots" at @ACMHT #acmht18 pic.twitter.com/p5YN0VqQXV
— Alexander C. Nwala (@acnwala) July 12, 2018

Next, David Smith, a researcher at University of Florida, presented: "An Approximately Optimal Bot for Non-Submodular Social Reconnaissance." He showed that studies that show how social bots befriend real users as part of an effort to collect sensitive information operate with the premise that the likelihood of users accepting bot friend requests is fixed, a constraint contradicted by empirical evidence. Subsequently, he presented his work which addressed this limitation.

David Smith at @ACMHT presents: "An Approximately Optimal Bot for Non-Submodular Social Reconnaissance"#acmht18 pic.twitter.com/Ok20ek5a2X
— Alexander C. Nwala (@acnwala) July 12, 2018

The News session began shortly after a break with a full paper (Best Paper Award) presentation from Lemei Zhang, a PhD candidate from Norwegian University of Science and Technology: "A Deep Joint Network for Session-based News Recommendations with Contextual Augmentation." She highlighted some of the issues news recommendation system suffer such as fast updating rate of news articles and lack of user profiles. Next, she proposed a news recommendation system that combines user click events within sessions and news contextual features to predict the next click behavior of a user.

Lemei Zhang begins the News session at @ACMHT with: "A Deep Joint Network for Session-based News Recommendations with Contextual Augmentation"#acmht18 pic.twitter.com/8dIX5Q0CY8
— Alexander C. Nwala (@acnwala) July 12, 2018

Next, Lucy Wang, senior data scientist at Buzzfeed, presented a short paper: "Dynamics and Prediction of Clicks on News from Twitter."

Lucy Wang at @ACMHT introduces a minimalist click prediction model that only uses publicly available, aggregated data from the first hour of a link's lifecyle: "Dynamics and Prediction of Clicks on News from Twitter"#acmht18 pic.twitter.com/fcuwMBNQmp
— Alexander C. Nwala (@acnwala) July 12, 2018

Next, Sofiane Abbar, senior software/research engineer at Qatar Computing Research Institute, presented via a YouTube video: "To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content." He proposed a new approach for predicting the popularity of news articles before they are published. The approach is based on observations regarding the article similarity and topicality and complements existing content-based methods.

Can we predict the popularity of news stories before they go online?@SofianeAbbar at @ACMHT presents via video: "To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content"

Video: https://t.co/gYFHy1CTR6
Questions? Post comment below video#acmht18 pic.twitter.com/SIX9hmkdyp
— Alexander C. Nwala (@acnwala) July 12, 2018

Next, two full papers (Community Detection session) where presented by Ophélie Fraisier and Amin Salehi. Ophélie Fraisier presented: "Stance Classification through Proximity-based Community Detection." She proposed the Sequential Community-based Stance Detection (SCSD) model for stance (online viewpoints) detection. It is a semi-supervised ensemble algorithm which considers multiple signals that inform stance detection. Next, Amin Salehi presented: "Sentiment-driven Community Profiling and Detection on Social Media." He presented a method of profiling social media communities based on their sentiment toward topics and proposed a method of detecting such communities and identifying motives behind their formation.

Account for textual, social, and geographic proximities in order to detect stances.@SyrupType at @ACMHT presents: "Stance Classification through Proximity -based Community Detection"#acmht18 pic.twitter.com/eKINrghjUh
— Alexander C. Nwala (@acnwala) July 12, 2018

Use sentiments (positive & negative) to identify and profile social media communities.

Amin Salehi at @ACMHT presents: "Sentiment-driven Community Profiling and Detection on Social Media"#acmht18 pic.twitter.com/3IU1iX9ZcB
— Alexander C. Nwala (@acnwala) July 12, 2018

I would like to thank the organizers of the conference, the hosts, Towson University College of Arts, as well as IMLS for funding our research.

-- Nwala (@acnwala)

↧

2018-07-22: Tic-Tac-Toe and Magic Square Made Me a Problem Solver and Programmer

July 22, 2018, 12:39 pm

≫ Next: 2018-08-01: A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages

≪ Previous: 2018-07-18: HyperText and Social Media (HT) Trip Report

"How did you learn programming?", a student asked me in a recent summer camp. Dr. Yaohang Li organized the Machine Learning and Data Science Summer Camp for High School students of the Hampton Roads metropolitan region at the Department of Computer Science, Old Dominion University from June 25 to July 9, 2018. The camp was funded by the Virginia Space Grant Consortium. More than 30 students participated in it. They were introduced to a variety topics such as Data Structures, Statistics, Python, R, Machine Learning, Game Programming, Public Datasets, Web Archiving, and Docker etc. in the form of discussions, hands-on labs, and lectures by professors and graduate students. I was invited to give a lecture about my research and Docker. At the end of my talk I solicited questions and distributed Docker swag.

The question "How did you learn programming?" led me to draw Tic-Tac-Toe Game and a 3x3 Magic Square on the white board. Then I told them a more than a decade old story of the early days of my bachelors degree when I had recently got my very first computer. One day while brainstorming on random ideas, I realized the striking similarity between the winning criteria of a Tic-Tac-Toe game and sums of 15 using three numbers of a 3x3 Magic Square that uses unique numbers from one to nine. The similarity has to do with their three rows, three columns, and two diagonals. After confirming that there are only eight combinations of selecting three unique numbers from one to nine whose sum is 15, I was sure that those are all placed at strategic locations in a magic square and there is no other possibility left for another such combination. If we assign values to each block of the Tic-Tac-Toe game according the Magic Square and store list of values acquired by the two players, we can decide potential winning moves in the next step by trying various combinations of two acquired vales of a player and subtracting it from 15. For example, if places 4 and 3 are acquired by the red (cross sign) player then a potential winning move would be place 8 (15-4-3=8). With this basic idea of checking potential wining move, when the computer is playing against a human, I could set strategies of first checking for the possibility of winning moves by the computer and if none are available then check for the possibility of the next winning moves by the human player and block them. While there are many other approaches to solve this problem, my idea was sufficient to get me excited and try to write a program for it.

By that time I only had the basic understanding of programming constructs such as variables, conditions, loops, and functions in C programming language as part of the introductory Computer Science curriculum. While C is a great language for many reasons, it was not an exciting language for me as a beginner. If I were to write Tic-Tac-Toe game in C, I would have ended up writing something that would have a text-based user interface in the terminal which is not what I was looking for. I asked someone about the possibility of writing software with a graphical user interface (GUI) and he suggested that I try Visual Basic. So I went to the library, got a book on VB6, and studied it for about a week. Now, I was ready to create a small window with nine buttons arranged in a 3x3 grid. When these buttons would be clicked, a colored label (a circle or a cross) would be placed and a callback function would be called with an argument associated with the value according to the position of the button (as per the Magic Square arrangement). The callback function can then update states and play the next move. Later, the game was improved with different modes and settings.

One day, I shared my program and approach with a professor (who is working for Microsoft now) with excitement. He said this technique is explored in an algorithm book too. This made me feel a little underwhelmed because I was not the first one who came up with this idea. However, I was equally happy that I discovered it independently and the fact that it was validated by some smart people already.

This was not the only event when I had an idea and needed the right tool to express it. Over time my curiosity lead me to many more challenges, ideas of potential solutions for the problem, and exploration of numerous suitable tools, techniques, and programming languages.

My talk was scheduled for Wednesday, June 27, 2018. I started by introducing myself, WS-DL Research Group, basics of Web Archiving, and then briefly talked about my Archive Profiling research. Without going too much into the technical details, I tried to explain the need of Memento Routing and how Archive Profiles can help to achieve this.

Introducing Web Archiving and WSDL Research Group by Sawood Alam

Luckily, Dr. Michele Weigle had already introduced Web Archiving to them the day before my talk. When I started mentioning Web Archives, they knew what I was talking about. This helped me cut my talk down and save some time to talk about other things and the Q/A session.

Intro to Web Archiving by Michele Weigle

I then put my Docker Campus Ambassador hat on and started with the Dockerizing ArchiveSpark story. Then I briefly described what Docker is, where can it be useful, and how it works. I walked them through a code example to illustrate the procedure of working with Docker. As expected, it was their first encounter with Docker and many of them had no experience with Linux operating system either, so I tried to keep things as simple as possible.

Introducing Docker - Application Containerization & Service Orchestration by Sawood Alam

I had a lot of stickers and some leftover T-shirts from my previous Docker event, so I gave them to those who asked any questions. A couple days later, Dr. Li told me that the students were very excited about Docker and especially those T-shirts, so I decided to give a few more of those away. For that, I asked them a few questions related to my earlier talk and whoever was able to recall the answers got a T-shirt.

Overall, I think it was a successful summer camp. I am positive that those High School students had a great learning experience and exposure to some research techniques that can be helpful in their career and some of them might be encouraged to go for a graduation degree one day. Being a research university, ODU is enriched with many talented graduate students with a variety of expertise and experiences which can benefit the community at large. I think more such programs should be organized in the Department of Computer Science and various other departments of the university.

It was a fun experience for me as I interacted with High School students here in the USA for the first time. They were all energetic, excited, and engaging. Good luck to all who were part of this two weeks long event. And now you know, how I learned programming!

--
Sawood Alam

↧

2018-08-01: A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages

August 1, 2018, 6:47 am

≫ Next: 2018-08-25: Four WS-DL Classes Offered for Fall 2018

≪ Previous: 2018-07-22: Tic-Tac-Toe and Magic Square Made Me a Problem Solver and Programmer

As commonly seen on Facebook and Twitter, the social card is a type of surrogate that provides clues as to what is behind a URI. In this case, the URI is from Google and the social card makes it clear that the document behind this long URI is directions.

As I described to the audience of Dodging the Memory Hole last year, surrogates provide the reader with some clue of what exists behind a URI. The social card is one type of surrogate. Above we see a comparison between a Google URI and a social card generated from that URI. Unless a reader understands the structure of all URIs at google.com, they will not know what the referenced content is about until they click on it. The social card, on the other hand, provides clues to the reader that the underlying URI provides directions from Old Dominion University to Los Alamos National Laboratory. Surrogates allow readers to pierce the veil of the URI's opaqueness.

With the death of Storify, I've been examining alternatives for summarizing web archive collections. Key to these summaries are surrogates. I have discovered that there exist services that provide users with embeds. These embeds allow an author to insert a surrogate into the HTML of their blog post or other web page. These containing pages often use the surrogate to further illustrate some concept from the surrounding content. Our research team blog posts serve as containing pages for embeds all of the time. We typically use embeddable surrogates of tweets, videos from YouTube, and presentations from Slideshare, but surrogates can be generated for a variety of other resources as well. Unfortunately, not all services generate good surrogates for mementos. After some reading, I came to the conclusion that we can fill in the gap with our own embeddable surrogate service: MementoEmbed.

A recent WS-DL blog post containing embeddable surrogates of Slideshare presentations.

Blast Theory
Sam Pearson and Clara Garcia Fraile are in residence for one month Sam Pearson and Clara Garcia Fraile are in residence for one month working on a new project called In My Shoes. They are developin

MementoEmbed is the first archive-aware embeddable surrogate service. This means it can include memento-specific information such as the memento-datetime, the archive from which a memento originates, and the memento's original resource domain name. In the MementoEmbed social card above, we see the following information:

from the resource itself:

title — "Blast Theory"
a description conveying some information of what the resource is about — "Sam Pearson and Clara Garcia..."
a striking image from the resource conveying some visual aspect of aboutness
its original web site favicon — the bold "B" in the lower left corner
its original domain name — "BLASTTHEORY.CO.UK"
its memento-datetime — 2009-05-22T22:12:51 Z
a link to its current version — under "Current version"
a link to other versions — under "Other Versions"

from the archive containing the resource:

the domain name of the archive — "WEBARCHIVE.ORG.UK"
the favicon of the archive — the white "UKWA" on the aqua background
a link to the memento in the archive — accessible via the links in the the title and the memento-datetime

Most of this information is not provided by services for live web resources, such as Embed.ly.

MementoEmbed is a deployable service that currently generates social cards, like the one above, and thumbnails. As with most software I announce, MementoEmbed is still in its alpha prototype phase, meaning that crashes and poor output are to be expected. A bleeding edge demo is available at http://mementoembed.ws-dl.cs.odu.edu. The source code is available from https://github.com/oduwsdl/MementoEmbed. Its documentation is growing at https://mementoembed.readthedocs.io/en/latest/.

In spite of its simplicity in concept, MementoEmbed is an ambitious project, requiring that it not only support parsing and processing of the different web concepts and technologies of today, but all that have ever existed. With this breadth of potential in mind, I know that MementoEmbed does not yet currently handle all memento cases, but that is where you can help contribute by submitting issue reports that help us improve it.

But why use MementoEmbed instead of some other service? What are the goals of MementoEmbed? How does it work? What does the future of MementoEmbed look like?

Why MementoEmbed?

Why should someone use MementoEmbed and not some other embedding service? I reviewed several embedding services mentioned on the web. The examples in this section will demonstrate some embeds using a memento of the New York Times front page from 2005 preserved by the Internet Archive, shown below.

This is a screenshot of the example New York Times memento used in the rest of this section. Its memento-datetime is June 2, 2005 at 19:45:24 GMT and it is preserved by the Internet Archive. This page was selected because it contains a lot of content, including images.

I reviewed Embed.ly, embed.rocks, Iframely, noembed, microlink, and autoembed. As of this writing, the autoembed service appears to be gone. The noembed service only provides embeds for a small number of web sites and does not support web archives. Iframely responds with errors for memento URIs, as shown below.

Iframely fails to generate an embed for a memento of a New York Times page at the Internet Archive. The error message is misleading. There are multiple images on this page.

What the Iframely parsers see for this memento according to their web application.

What Iframely generates for the current New York Times web page (as of July 29, 2018 at 18:23:15 GMT).

Embed.ly, embed.rocks. and microlink are the only services that attempt to generate embeds for mementos. Unfortunately, none of them are fully archive-aware. One of the goals of a good surrogate is to convey some level of aboutness with respect to the underlying web resource. Mementos are documents with their own topics. They are typically not about the archives that contain them. Intermixing these two concepts of document content and archive information, without clear separation, produces surrogates that can confuse users. The microlink screenshot below shows an embed that fails to convey the aboutness of its underlying memento. The microlink service is not archive-aware. In this example, microlink mixes the Internet Archive favicon and Internet Archive banner with the title from the original resource. The embed.rocks example below does not fare much better, appearing to attribute the New York Times article to web.archive.org. What is the resource behind this surrogate really about? This mixing of resources weakens the surrogate's ability to convey the aboutness of the memento.

As seen in the screenshot of a social card for our example New York Times memento from 2005, microlink conflates original resource information and archive information.

The embed.rocks social card does not fare much better, attributing the New York Times page to web.archive.org.

Embed.ly does a better job, but still falls short. In the screenshot below an embed was created for the same resource. It contains the title of the resource as well as a short description and even a striking image from the memento itself. Unfortunately, it contains no information about the original resource, potentially implying that someone at archive.org is serving content for the New York Times. Even worse, in the world where readers are concerned about fake news this surrogate may lead an informed reader to believe that this is a link to a counterfeit resource because it does not come from nytimes.com.

This screenshot of an embed for the same New York Times memento shows how well embed.ly performs. While the image and description convey more aboutness for the original resource, there is only attribution information about the archive.

Below, the same resource is represented as a social card in MementoEmbed. MementoEmbed chose the New York Times logo as the striking image for this page. This card incorporates elements used in other surrogates, such as the title of the page, a description, and a striking image pulled from the page content. Further down, I annotate the card and show how the information exists in separate areas of the card. MementoEmbed places archive information and the original resource information into their own areas of the card, visually providing separation between these concepts to reduce confusion.

A screenshot of the same New York Times memento in MementoEmbed.

This is not to imply that cards generated by Embed.ly or other services should not be used, just that they appear to be tailored to live web resources. MementoEmbed is strictly designed for use with mementos and strives to occupy that space.

Goals of MementoEmbed

MementoEmbed has the following goals in mind.

The system shall provide archive-aware surrogates of mementos
The system shall be deployable by others
Surrogates shall degrade gracefully
Surrogates shall have limited or no dependency on an external service
Not just humans, but machines shall be able to generate surrogates

I have demonstrated how we meet the first goal in the prior section. In the following subsections I'll provide an overview of how well the current service meets these other goals.

Deployable by others

I did not want MementoEmbed to be another centralized service. My goal is that eventually web archives can run their own copies of MementoEmbed. Visitors to those archives will be able to create their own embeds from mementos they find. The embeds can be used in blog posts and other web pages and thus help these archives promote themselves.

MementoEmbed is a Python Flask application that can be run from a Docker container. Again, it is in its alpha prototype phase, but thanks to the expertise of fellow WS-DL member Sawood Alam, others can download the current version from DockerHub.

Type the following to acquire the MementoEmbed Docker image:

docker pull oduwsdl/mementoembed

Type the following to create a container from the image and run it on TCP port 5550:

docker run -it --rm -p 5550:5550 oduwsdl/mementoembed

Inside the container, the service runs on port 5550. The -p flag maps the container's port 5550 to your local port 5550. From here, the user can access the container at http://localhost:5550 and they are greeted with the page below.

The welcome page for MementoEmbed.

Surrogates that degrade gracefully

Prior to executing any JavaScript, MementoEmbed's social cards use the blockquote, div, and p tags. After JavaScript, these tags are augmented with styles, images, and other information. This means that if the MementoEmbed JavaScript resource is not available, the social card is still viewable in a browser, as seen below.

A MementoEmbed social card generated for a memento from the Portuguese Web Archive.

The same social card rendered without the associated JavaScript.

Surrogates with limited or no external dependencies

All web resources are ephemeral, and embedding services are no exception. If an embed service fails or otherwise disappears, what happens to its embeds? Consider Embed.ly. The embed code for Embed.ly is typically less than 100 bytes in length. They achieve this small size because their embeds contain the title of the represented page, the represented URI, and a URI to a JavaScript resource. Everything else is loaded from their service via that JavaScript resource. Web page authors trade a small embed code for dependency on an outside service. Once that JavaScript is executed and a page is rendered, the embed grows to around 2kB. What has the web page author using the embed really gained from the small size? They have less to copy and paste, but their page size still grows once rendered. Also, in order for their page to render, it now relies on the speed and dependability of yet another external service. This is why Embed.ly cards sometimes experience a delay when the containing page is being rendered.

Privacy can be another concern. Embedded resources result in additional requests to web servers outside of the one providing the containing page. This means that an embed not only potentially conveys information about which pages it is embedded in, but also who is visiting these pages. If a web page author does not wish to share their audience with an outside service, then they might want to reconsider embeds.

Thinking about this from the perspective of web archives, I decided that MementoEmbed can do better. I started thinking about how its embeds could outlive MementoEmbed while at the same time offering privacy to visiting users.

MementoEmbed offers thumbnails as data URIs so that pages using these thumbnails do not depend on MementoEmbed.

Currently, MementoEmbed provides surrogates either as social cards or thumbnails. In response to requests for thumbnails, MementoEmbed provides an embed as a data URI, as shown above. Data URI support for images in browsers is well established at this point. A web page containing the data URI can render it without relying upon any MementoEmbed service, thus removing an external dependency. Of course, one can also save the thumbnail locally and upload it to their own server.

MementoEmbed offers the option of using data URIs for images and favicons in social cards so that these embedded resources are not dependent on outside services.

For social cards, I tried to take the use of data URIs a step further. As seen in the screenshot above, MementoEmbed allows the user to use data URIs in their social card rather than just relying upon external resources for favicons and images. This makes the embeds larger, but ensures that they do not rely upon external services.

As noted in the previous section, MementoEmbed includes some basic data and simple HTML to allow for degradation. CSS and images are later added by JavaScript loaded from the MementoEmbed service. To eliminate this dependency, I am currently working on an option that will allow the user (or machine) to request an HTML-only social card.

Not just for humans

The documentation provides information on the growing web API that I am developing for MementoEmbed. For the sake of brevity, I will talk about how a machine can request a social card or a thumbnail here.

MementoEmbed uses similar tactics to other web archive frameworks. Each service has its own URI "stem" and the URI-M to be operated on is appended to this stem.

Firefox displays a social card produced by the machine endpoint /services/product/socialcard at http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.

To request a social card, a URI-M is appended to the endpoint /services/product/socialcard/. For example, consider a system that wants to request a social card for the memento at http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ from the MementoEmbed service running at mementoembed.ws-dl.cs.odu.edu. The client would visit: http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the HTML and JavaScript necessary to render the social card, as seen in the above screenshot.

Firefox displays a thumbnail produced by the machine endpoint /services/product/thumbnail at http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.

Likewise, to request a thumbnail for the same URI-M from the same service, the machine would visit the endpoint at /services/product/thumbnail at the URI http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the image as shown in the above Firefox screenshot. The thumbnail service returns thumbnails in the PNG image format.

Clients can use the Prefer header from RFC 7240 to control the generation of these surrogates. I have written about the Prefer header before, and Mat Kelly is using it in his work as well. Simply, the client uses the Prefer header to request certain behavior on behalf of a server with respect to a resource. The server responds with a Preference-Applied header indicating which behaviors exist in the response.

For example, to change the width of a thumbnail to 500 pixels, a client would generate a Prefer header containing the thumbnail_width option. If one were to use curl, the HTTP request headers to a local instance of MementoEmbed would look like this, with the Prefer header marked red for emphasis:

GET /services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ HTTP/1.1
Host: localhost:5550
User-Agent: curl/7.54.0
Accept: */*
Prefer: thumbnail_width=500

And the MementoEmbed service would respond with the following headers, with the Preference-Applied headed marked red for emphasis:

HTTP/1.0 200 OK
Content-Type: image/png
Content-Length: 216380
Preference-Applied: viewport_width=1024,viewport_height=768,thumbnail_width=500,thumbnail_height=375,timeout=15,remove_banner=no
Server: Werkzeug/0.14.1 Python/3.6.5
Date: Sun, 29 Jul 2018 21:08:19 GMT

The server indicates that the thumbnail returned has not only a width of 500 pixels, but also a height of 375 pixels. Also included are other preferences used in its creation, like the size of the browser viewport, the number of seconds MementoEmbed waited before giving up on a response from the archive, and whether or not the archive banner was removed.

The social card service also supports preferences for whether or not to use data URIs for images and favicons.

Other service endpoints exist, like /services/memento/archivedata, to provide parts of information used in social cards. In addition to these services, I am also developing an oEmbed endpoint for MementoEmbed.

Brief Overview of MementoEmbed Internals

Here I will briefly cover some of the libraries and algorithms used by MementoEmbed. The Memento protocol is a key part of what allows MementoEmbed to work. MementoEmbed uses the Memento protocol to discover the original resource domain, locate favicons, and of course to find a memento's memento-datetime.

If metadata is present in HTML meta tags, then MementoEmbed uses those values for the social card. MementoEmbed favors Open Graph metadata tags first, followed by Twitter card metadata, and then resorts to mining the HTML page for items like title, description, and striking image.

Titles are extracted for social cards using BeautifulSoup. The description is generated using readability-lxml. This library provides scores for paragraphs in an HTML document. Based on comments from the readability code, the paragraph with the highest score is considered to be "good content". The highest scored paragraph is selected for use in the description and truncated to the first 197 characters so it will fit into the card. If readability fails for some reason, MementoEmbed falls back to building one large paragraph from the content using justext and taking the first 197 characters from it, a process Grusky, et. al. refer to as Lede-3.

Striking image selection is a difficult problem. To support our machine endpoints, I needed to find a method that would select an image without any user intervention. There are several research papers offering different solutions for image selection based on machine learning. I was concerned about performance, so I opted to use some heuristics instead. Currently, MementoEmbed employs an algorithm that scores images using the equation below.

where S is the score, N is the number of images on the page, n is the current image position on the page, s is the size of the image in pixels, h is the number of bars in the image histogram containing a value of 0, and r is the ratio of width to height. The variables k₁ through k₄ are weights. This equation is built on several observations. Images earlier in a page (a low value of n) tend to be more important. Larger images (a high s) tend to be preferred. Images with a histogram consisting of many 0s tend to be mostly text, and are likely advertisements or navigational elements. Images whose width is much greater than their height (a high value for r) tend to be banner ads. For performance, the first 15 images on a page are scored. If the highest scoring image meets some threshold, then it is selected. If no images meet that threshold, then the next 15 are loaded and evaluated.

The thumbnails are generated by a call from flask to puppeteer. MementoEmbed includes a Python class that can make this cross-language call, provided a user has puppeteer installed. If requested by the user, MementoEmbed uses its knowledge of various archives to produce a thumbnail without the archive banner. This only works for some archives. For Wayback Archives, information for choosing URI-Ms without banners was gathered from Table 9 of John Berlin's Masters Thesis.

The Future of MementoEmbed

MementoEmbed has many possibilities. I have already mentioned that MementoEmbed will support features like an oEmbed endpoint and HTML-only social cards. In the visible future, I will address language-specific issues and problems with certain web constructs, like framesets and ancient character sets. I also foresee the need for additional social card preferences, like changes to width and height as well as a preference for a vertical rather than horizontal card. One could even use content negotiation to request thumbnails in formats other than PNG.

The striking image selection algorithm will be improved. At the moment the weights are set at what works based on my limited testing. It is likely new weights, a new equation, or even a new algorithm could be employed at some point. Feedback from the community will guide these decisions.

Some other ideas that I have considered involve new forms of surrogates. Simple alterations to existing surrogates are possible, like social cards that contain thumbnails or social cards without any images. More complex concepts like Teevan's Visual Snippets or Woodruff's enhanced thumbnails would require a lot of work, but are possible within the framework of MementoEmbed.

A lot of it will depend on the needs of the community. Thanks to Sawood Alam, Mat Kelly, Grant Atkins, Michael Nelson, and Michele Weigle for already providing feedback. As more people experience MementoEmbed, they will no doubt come up with ideas I had not considered, so please try our demo at http://mementoembed.ws-dl.cs.odu.edu or look at the source code in GitHub at https://github.com/oduwsdl/MementoEmbed. Most importantly, report any issues or ideas to our GitHub issue tracker: https://github.com/oduwsdl/MementoEmbed/issues.

--Shawn M. Jones

↧

2018-08-25: Four WS-DL Classes Offered for Fall 2018

August 25, 2018, 11:01 am

≫ Next: 2018-08-30: Excited to Join WS-DL group in ODU!

≪ Previous: 2018-08-01: A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages

Four WS-DL classes are offered for Fall 2018:

CS 418/518 Web Programming is taught by Dr. Justin Brunelle, Tuesdays 4:20-7pm. This class teaches LAMP, the original web programming stack. Even if you end up using MEAN, you still need to know LAMP.
CS 431/531 Web Server Design is taught by Dr. Michael L. Nelson, Wednesdays 4:20-7pm. This class teaches REST, the primary architectural style for web programming, via implementing a fully functional web server from scratch.
CS 795/895 Intro to Data Science is taught by Dr. Sampath Jayarathna, Tuesdays & Thursdays, 5:45-7pm. This course will cover Python, machine learning, NumPy, pandas, and general data wrangling.
CS 795/895 Mining Scholarly Big Data is taught by Dr. Jian Wu, Tuesdays & Thursdays, 9:30-10:45am. This course will cover machine learning, data mining, deep learning, as applied to the corpus of scholarly communication (via Dr. Wu's involvement in the CiteSeerX project).

Dr. Michele C. Weigle is not teaching this semester.

Our current plan for courses in Spring 2019 is to offer a record five WS-DL courses:

CS 432/532 Web Science, Alexander Nwala
CS 725/825 Information Visualization, Dr. Michele C. Weigle
CS 734/834 Information Retrieval, Dr. Jian Wu
CS 795/895 Human-Computer Interaction (HCI), Dr. Sampath Jayarathna
CS 795/895 Web Archiving Forensics, Dr. Michael L. Nelson

Note that CS 418, 431, and 432 all count for the CS Web Programming minor.

--Michael

↧

Role of web archives in finding deleted tweets

Why are deleted tweets important?

How were deleted tweets found?

Code to fetch recent tweets using Python-Twitter API

Shell command to run Memgator locally

Code to fetch TimeMap for any twitter handle

Code to parse tweets, their timestamps and tweet ids from mementos

Analysis of Deleted Tweets from Breitbart News

Analysis of deleted tweets from John Carney and NolteNC

Conclusions

Different Web Page Surrogates

Text Snippet

Thumbnail

Enhanced Thumbnail

Internal Image

Visual Snippet

External Image

Text + Thumbnail

Social Card

Evaluations of these Surrogates

Surrogates for Mementos

Conclusion

2006-01-20

2006-06-05

2006-06-13

2006-10-03

2008-10-16

Styles Of Replay

Non-Sandboxing Replay

Sandboxed Replay

Memento Modifications

Archival Linkage Modifications

Replay Preserving Modifications

Temporal Jailing

Auto-Generating Client-Side Rewriters

Day 1 (June 3, 2018)

Day 2 (June 4, 2018)

Paper session 1B (Day 2)

Paper session 1C (Day 2)

Paper session 2A (Day 2)

Paper session 3A (Day 2)

Paper session 3C (Day 2)

Day 3 (June 5, 2018)

Paper session 4A (Day 3)

Paper session 4B (Day 3)

Paper session 5A (Day 3)

Paper session 5B (Day 3)

Paper session 5C (Day 3)

Paper session 6A (Day 3)

Day 4 (June 6, 2018)

WS-DL Presentations

Shawn M. Jones

Alexander Nwala

Mohamed Aturban

Other Work Presented

André Greiner-Petter

Timothy Kanke

Hany Alsalmi

Corinna Breitinger

Susanne Putze

Stephen Abrams

Tirthankar Ghosal

What Next?

The Project Panel

Individual Presentations

Keynote

Closing

Installing the software

A simple run

Input and Output

Measures

Other options

The Future

The metadata available from an Archive-It collection

Installation

Running fetch_ait_metadata

Using Archive-It Utilities In Python Code

Summary

Experiment Setup

Main Products - Summaries

Running `fetch_ait_metadata`