2015-11-06: iPRES2015 Trip Report

November 5, 2015, 4:36 pm

≫ Next: 2015-11-24 Twitter Follower Analysis of Virginia University Alumni Associations

≪ Previous: 2015-10-21: Grace Hopper Celebration of Women in Computing (GHC) 2015

From November 2nd through November 5th, Dr. Nelson, Dr. Weigle, and I attended the iPRES2015 conference at the University of North Carolina Chapel Hill. This served as a return visit for Drs. Nelson and Weigle; Dr. Nelson worked at UNC through a NASA fellowship and Dr. Weigle received her PhD from UNC. We also met with Martin Klein, a WS-DL alumnus now at the UCLA Library. While the last ODU contingent to visit UNC was not so lucky, we returned to Norfolk relatively unscathed.

Cal Lee and Helen Tibbo opened the conference with a welcome on November 3rd, followed by Nancy McGovern's keynote address delivered with Leo Konstantelos and Maureen Pennock. This was not a traditional keynote, but instead an interactive dialogue in which several challenge areas were presented to the audience, and the audience responded -- live and on twitter -- significant achievements or advances in those challenge areas from #lastyear. For example, Dr. Nelson identified the #iCanHazMemento utility. The responses are available on Google Docs.

Archiving links in twitter with #icanhazmemento #ipres2015 #lastyear https://t.co/A39u7a8VPv
— Michael L. Nelson (@phonedude_mln) November 3, 2015

I attended the Institutional Opportunities and Challenges session to open the conference. Kresimir Duretec presented "Benchmarks for Digital Preservation Tools." His presentation touched on how we can get digital preservation tools that "Just Work", including benchmarks for evaluating tools on test beds and measuring them for quality. Related to this is Mat Kelly's work on the Archival Acid Test.

Another web archiving tool comparison from @WebSciDL -- archival acid test #ipres2015 https://t.co/oNEJgJHXyy
— Justin F Brunelle (@justinfbrunelle) November 3, 2015

Alex Thirifays presented "Towards a Common Approach for Access to Digital Archival Records in Europe." This paper touched on user access: user needs, best practices for identifying requirements for access, and a capability gaps analysis of current tools versus user needs.

"Developing a Highly Automated Web Archive System Based
on IIPC Open Source Software" was presented by Zhenxin Wu. Her paper outlined a framework of open source tools to archive the web using Heritrix and a SOLR index of WARCS with an enhanced interface.

Barbara Sierman closed the session with her presentation "Best Until ... A National Infrastructure for Digital Preservation in the Netherlands" focusing on user accessibility and organizational challenges as part of a national strategy for preserving digital and cultural Dutch heritage.

After lunch, I lead off the Infrastructure Opportunities and Challenges session with my paper on Archiving Deferred Representations Using a Two-Tiered Crawling Approach. We defined deferred representations as those that rely on JavaScript to load embedded resources on the client. We show that archives can use PhantomJS to create a 1.5 times larger crawl frontier than Heritrix itself, but PhantomJS crawls 10.5 times slower. We recommend using a classifier to recognize deferred representations and only use it to crawl deferred representations, mitigating the crawl slow-down while still reaping the benefits of the headless crawler.

iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling Approach from Justin Brunelle

Douglas Thain followed with his presentation on "Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?" Similar to our work with deferred representations, his work focuses on scientific replay of simulations and software experiments. He presents several tools as part of a framework for preserving the context of simulations and simulation software, including dependencies and build information.

Hao Xu presented "A Method for the Systematic Generation of Audit Logs in a Digital Preservation Environment and Its Experimental Implementation In a Production Ready System". His presentation focuses on a construction of a finite state machine to understand whether a repository is following compliance policies for auditing purposes.

Jessica Trelogan and Lauren Jackson presented their paper Preserving an Evolving Collection: "“On-The-Fly” Solutions for the Chora of Metaponto Publication Series." They discussed the storage of complex artifacts of ongoing research projects in archeology with the intent of improving sharability of the collections.

To wrap up Day 1, we attended a panel on Preserving Born-Digital News consisting of Edward McCain, Hannah Sommers, Christie Moffatt, Abigail Potter (moderator), Stéphane Reecht, and Martin Klein. Christie Moffatt identified the challenges with archiving born-digital news material, including the challenges with scoping a corpus. She presented their case study on the Ebola response. Stéphane Reecht presented the work by the BnF regarding their work to perform massive, once-a-year crawls as well as selective, targeted daily crawls. Hannah Sommers provided insight into the culture of a news producer (NPR) on digital preservation. Martin Klein presented SoLoGlo (social, local, and global) news preservation, including citing statistics about the preservation of links shortened by the LA Times. Finally, Edward McCain discussed the ephemeral nature of born-digital news media, and provided examples of the sparse number of mementos in news pages in the Wayback Machine.

Marin Klein from UCLA Libraries speaking about the SoLoGlo social media project #ipres2015 #savenews pic.twitter.com/nWf6sXQX1a
— Edward McCain (@e_mccain) November 3, 2015

To kick off Day 2, Lisa Nakamura gave her opening keynote The Digital Afterlives of This Bridge Called My Back: Public Feminism and Open Access. Her talk focused on the role of Tumblr in curating and sharing a book no longer in print as a way to open the dialogue on the role of piracy and curation in the "wild" to support open access and preservation.

I attended the Dimensions of Digital Preservation session, which began with Liz Lyon's presentation on "Applying Translational Principles to Data Science Curriculum Development." Her paper outlines a study to help revise the University of Pittsburgh's data science curriculum. Nora Mattern took over the presentation to discuss the expectations of the job market to identify the skills required to be a professional data scientist.

Elizabeth Yakel presented "Educational Records of Practice: Preservation and Access Concerns." Her presentation outlined the unique challenges with preserving, curating, and making available educational data. Education researchers or educators can use these resources to further their education, reuse materials, and teach the next generation of teachers.

Emily Maemura presented "A Survey of Organizational Assessment Frameworks in Digital Preservation." She presented the results of a survey focusing on frameworks for assessment models, drawing conclusions like software maturity models do for computer scientists. Further, her paper identifies trends, gaps, and models for assessment.

Matt Schultz, Katherine Skinner, and Aaron Trehub presented "Getting to the Bottom Line: 20 Digital Preservation Cost Questions." Their questions help institutions evaluate cost, including questions about storage fees, support, business plans, etc. to help institutions assess their approach to taking on digital preservation.

After lunch, I attended the panel on Long Term Preservation Strategies & Architecture: Views from Implementers consisting of Mary Molinaro (moderator), Katherine Skinner, Sibyl Schaefer, Dave Pcolar, and Sam Meister. Sibyl Schaefer lead off with a presentation of details on Chronopolis and ACE audit manager. Dave Pcolar followed by presenting the Digital Preservation Network (DPN) and their data replication policies for dark archives. Sam Meister discussed the BitCurator Consortium which helps with the acquisition, appraisal, arrangement and descriptions, and access of archived material. Finally, Katherine Skinner presented the MetaArchive Cooperative and their activities teaching institutions to perform their own archiving, along with other statistics (e.g., the minimum number of copies to keep stuff safe is 5).

Day 2 concluded with the poster session (including a poster by Martin Klein) and reception.

.@mart1nkle1n hashtag poster session hashtag minute madness #ipres15 pic.twitter.com/byQdiYlH7V
— Justin F Brunelle (@justinfbrunelle) November 4, 2015

Pam Samuelson opened Day 3 with her keynote Mass Digitization of Cultural Heritage: Can Copyright Obstacles Be Overcome? Her keynote touched on the challenges with preserving cultural heritage introduced by copyright, along with some of the emerging techniques to overcome the challenges. She identified duration of copyright as a major contributor to the challenges of cultural preservation. She notes that most countries have exceptions for libraries and archives for preservation purposes, and explains recent U.S. evolutions in fair use through the Google Books rulings.

After Samuelson's keynote, I concluded my iPRES2015 visit and explored Chapel Hill, including a visit to the Old Well (at the top of this post) and an impromptu demo of the pit simulation. It was very scary.

Several themes emerged from iPRES2015, including an increased emphasis on web archiving and a need to improved context, provenance, and access for digitally preserved resources. I look forward to monitoring the progress in these areas.

--Justin F. Brunelle

↧

2015-11-24 Twitter Follower Analysis of Virginia University Alumni Associations

November 24, 2015, 5:46 am

≫ Next: 2015-11-28: Two WS-DL Classes Offered for Spring 2016

≪ Previous: 2015-11-06: iPRES2015 Trip Report

The primary goal of any alumni association is to maintain and strengthen the ties between its alumni, the community, and the mission of the university. With social media, it's easier than ever to connect with current and former graduates on Facebook, Instagram or Twitter with a simple invitation to "like us" or "follow me." Considering just one of these social platforms, we recently analyzed the Twitter networks of twenty-three (23) Virginia colleges and universities to determine what, if any, social characteristics were shared among the institutions and whether we could gain any insight by examining the public profiles of their respective followers. The colleges of interest, ranked by number of followers in Table 1, vary in size, mission, type of institution, admissions selectivity and perceived prestige. Each of the alumni associations has maintained a Twitter presence for an average of six (6) years. The oldest Twitter account belongs to Roanoke College (@roanokecollege) which is approaching the eight (8) year mark. The newest Twitter account was registered by Randolph Macon College (@RMCalums) nearly two years ago.

University	Followers	Joined Twitter
University of Virginia	12,100	11/1/2008
Roanoke College*	9,588	3/1/2008
Regent University*	7,966	11/1/2008
James Madison University	7,865	8/1/2008
Virginia Tech	6,418	4/1/2009
College of William & Mary	4,448	1/1/2009
University of Mary Washington	3,847	10/1/2009
Liberty University	3,699	11/6/2009
University of Richmond	3,299	5/1/2009
Sweet Briar College*	2,523	8/1/2010
George Mason University	2,375	2/1/2011
Hampton University	2,372	2/15/2012
Christopher Newport University	2,191	8/1/2010
Old Dominion University	1,996	7/1/2009
Randolph College*	1,857	8/1/2008
Washington and Lee University	1,842	8/1/2011
Radford University	1,758	3/11/2011
Hampden-Sydney College	1,086	7/1/2009
Longwood University	1,035	2/28/2013
Hollins University	923	4/1/2009
Virginia Military Institute	836	3/1/2009
Norfolk State University	629	8/15/2011
Randolph-Macon College	172	3/7/2014
Table 1 - Alumni Associations Ranked by Followers * Institution does not have an official alumni Twitter account. The university Twitter account was used instead.

Social Graph Analysis

NodeXL is a template for Microsoft Excel which makes network analysis easy and rather intuitive. We used this tool for data collection to import the Twitter networks and to analyze the various social media interactions. There are limitations established in the Twitter API which regulate the amount of data collected per hour by any one user. Therefore, due to rate limiting, NodeXL will inherently only import the 2,000 most recent friends and followers for any Twitter account. To improve the response time of the API, we further restricted our data collection to the 200 most recent tweets for both the university and each of its follower accounts.

For our first look at the alumni associations, we clustered the data based on an algorithm in NodeXL which looks at how the vertices are connected to one another. The clusters, as shown in Figure 1, are indicated by the color of the nodes. The clusters themselves revealed some interesting patterns. The high level of inter-association connectivity, as measured in follows, tweets and mentions, was unexpected. We would have thought that each association operated within the confines of its own Twitter space or that of its parent organization. As we examine the groupings in this network, it is not unreasonable that we would observe connections between Old Dominion University (@ODUAlumni), Norfolk State University (@nsu_alumni_1935) and Hampton University (@HamptonU_Alumni) as all three are located within close proximity of one another in the Hampton Roads area. But, then we must take notice of Hollins University (@HollinsAlum), a small, private women's college in Roanoke, VA, which has a connection with ten (10) other alumni associations; more connections than any other school. Hollins is one of the smallest universities in our group with enrollment of less than 800 students. Since Twitter is primarily about influence, in this instance, we can probably assume the follows serve as a means to observe best practices and current engagement trends employed by larger institutions. While Hollins University is well connected, as are many of the other schools, at the opposite end of the spectrum we find Liberty University (@LibertyUAlum), a large school with more than 77,000 students. Liberty University remains totally isolated with no follower connections to the other alumni associations. You might minimally expect some type of connection with either Regent University (@RegentU) since both share a similar mission as private, Christian institutions or other universities within close physical proximity such as Randolph College (@randolphcollege).

Figure 1 - Connectivity of Alumni Associations

Twitter Followers, Enrollment, and Selectivity

We normally measure the popularity of a Twitter account based on the number of followers. Instead of simply quantifying the follower counts of each alumni association, we sought to understand if certain factors, actions or inherent qualities about the institution might influence the relative number of followers. First, we considered whether more active tweeters would attract more alumni followers. As shown in Figure 2, the College of William and Mary (@wmalumni) has generated the most tweets over its lifetime, approximately 6,200 or 2.5 tweets per day. But, we also observe the University of Mary Washington (@UMaryWash), which has approximately half the student enrollment, a similar Twitter life span, 50% percent less tweets at 2,800 or 1.3 per day, with only a slight difference in the number of followers, 4,400 versus 3,800 respectively. While the graph shows that schools such as Virginia Tech (@vt_alumni) and the University of Virginia (@UVA_Alumni) have more followers with fewer lifetime tweets, the caveat is that these public institutions have the benefit of considerably larger student populations which inherently increases the pool of potential alumni.

Figure 2 - Lifetime Tweets Versus Followers

Next, we considered whether a higher graduation rate, or alumni production, would result in more followers. We obtained the most recent, 2014 overall graduation rates for each institution from the National Center for Education Statistics, with reported overall six-year graduation rates ranging from 34% to 94%. A 2015 Pew Research Center study of the Demographics of Social Media Users indicates that among all internet users, 32% in the 18 to 29 age range use Twitter. This is a key demographic as we would expect our alumni associations to be primarily focused on attracting recent undergraduates. We also factored in selectivity, a comparative scoring of the admissions process, using the categories defined in the 2016 U.S. News Best Colleges Directory. In this directory, colleges are designated as most selective, more selective, selective, less selective or least selective based on a formula.

As we look at Figure 3, we observe a positive correlation between admissions selectivity and the institution's overall graduation rate. Schools which were least selective during the admissions phase also produced the lowest graduation rates (less than 40%) while schools which were most selective, experienced the highest graduation rates (around 90%). It isn't surprising that improved graduation rates positively affect the expected number of alumni Twitter followers. We'll leave it as an exercise for the reader to extrapolate how closely each institution's annual undergraduate enrollment, graduation rate and expected level of engagement on Twitter corresponds to the actual number of followers when all three factors are considered.

Figure 3 - Followers Versus Graduation Rate

Potential Reach of Verified Followers

Users on Twitter want to be followed so we looked carefully at who, besides alumni and students, was following each of the alumni associations. Specifically, we noted the number of Twitter verified followers; accounts which are usually associated with high-profile users in "music, acting, fashion, government, politics, religion, journalism, media, sports, business and other key interest areas." In addition to an abundance of local news reporters and sports anchors, regional politicians and career sites, other notable followers included: restaurant review site Zagat (@Zagat), automaker Toyota USA (@toyota), musician and rapper DJ King Assassin (@DjKingAssassin), the Nelson Mandela Foundation (@NelsonMandela), the President of the United States Barack Obama (@BarackObama), Virginia Governor Terry McAuliffe (@GovernorVA) and artist and singer Yoko Ono (@yokoono). It's a safe assumption that some of the follower relationships with verified users were probably established prior to 2013. This is the year in which Twitter instituted new rules to kill the "auto follow" which was a programmatic way of following another user back after they follow you. Either way, the open question would remain as to why these particular users would follow an alumni association when there are no readily apparent educational ties.

Twitter doesn't take follower count into consideration when verifying an account, but it's not unusual for a verified account to have a considerable following. Since the mission of an alumni association is essentially about networking and information dissemination, we also measured the potential reach or level of influence across the followers' extended network obtained from the verified accounts. No single university had more than 70 verified accounts among its followers. However, when we look at their contribution, in Figure 4, as a percentage of the combined reach achieved by all followers of each alumni association, these select users accounted for as little as 1.6% for Virginia Military Institute (@vmialumni) to as much as 95.8% for Longwood University (@acaptainforlife) of the institution's total potential reach (i.e., followers of my followers).

Figure 4 - Potential Reach Percentage of Verified Accounts

Alumni Sentiment

Finally, we examined how each follower described himself in the description (i.e., bio) portion of their Twitter profile by extracting the top 200 most frequently occurring terms for each alumni association. A word cloud for the alumni of each university is shown in Figure 5. If we further isolated the descriptions to the top ten most frequently occurring words, we observed a common pattern among all alumni followers. In addition to the official or some derivative of the institution name (e.g., JMU, NSU, Tech), we find the terms love, life, and some intimate description of the follower as a mom, husband, student, father or alumni. If the university has an athletic department, we also found mention of sports and, in the case of our two Christian universities, Liberty and Regent, the terms God, Jesus, and Christ were prevalent. In 22 of 23 institutions, the alumni primarily described themselves using these personal terms. Conversely, the alumni followers at only one institution, the University of Richmond (@urspidernetwork), described themselves in a more business-like or academic manner with more frequent mention of the words PhD, career, and job.

Figure 5 - Word Clouds of Twitter Follower Descriptions

-- Corren McCoy

↧

2015-11-28: Two WS-DL Classes Offered for Spring 2016

November 27, 2015, 1:56 pm

≫ Next: 2015-12-08: Evaluating the Temporal Coherence of Composite Mementos

≪ Previous: 2015-11-24 Twitter Follower Analysis of Virginia University Alumni Associations

Two WS-DL classes are offered for Spring 2016:

CS 725/825 - Information Visualization, Dr. Weigle
CS 432/532 - Introduction to Web Science, Dr. Nelson

Information Visualization is being offered both online (CRNs 29183 (HR), 29184 (VA), 29185 (US)) and on-campus (CRN 25511). Web Science is being offered for the first time with the 432/532 numbers (CRNs 27556 and 27557, respectively), but the class will be similar to the Fall 2014 offering as 495/595.

--Michael

↧

2015-12-08: Evaluating the Temporal Coherence of Composite Mementos

December 8, 2015, 6:12 pm

≫ Next: 2015-12-22: 60% of Web Annotations are Orphaned or in Danger of Being Orphaned

≪ Previous: 2015-11-28: Two WS-DL Classes Offered for Spring 2016

When an archived web page is viewed using the Wayback Machine, the archival datetime is easy to determine from the URI and the Wayback Machine's display. The archival datetime of embedded resources (images, CSS, etc.) is another story. And what stories their archival datetimes can tell. These stories are the topic of my recent research and Hypertext 2015 publication. This post introduces composite mementos, the evaluation of their temporal (in-)coherence, provides an overview of my research results.

What is a composite memento?

A Memento is an archived copy of web resource (RFC 7089) The datetime when the copy was archived is called its Memento-Datetime. A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation. Composite mementos can be thought of as a tree structure. The root resource embeds other resources, which may themselves embed resources, etc. The figure below shows this tree structure and a composite memento of the ODU Computer Science home page as archived by the Internet Archive on 2005-05-14 01:36:08 GMT. Or does it?

Hints of Temporal Incoherence

Consider the following weather report that was captured 2004-12-09 19:09:26 GMT. The Memento-Datetime can be found in the URI and the December 9, 2004 capture date is clearly visible near the upper right. Look closely at description of Current Conditions and the radar image. How can there be no clouds on the radar when the current conditions are light drizzle? Something is wrong here. We have encountered temporal incoherence. This particular incoherence is caused by inherent delays of the capture process used by Heritrix and other crawler-based web archives. In this case, the radar image was captured much later (9 months!) than the web page itself. However, there is no indication of this condition.

A Framework for Evaluating Temporal Coherence

In order to study temporal coherence of composite mementos, a framework was needed. The framework details a series of patterns describing the relationships between root and embedded mementos and four coherence states. The four states and sample patterns are described below. The technical report describing the framework is available on arXiv.

Prima Facie Coherent

An embedded memento is prima facie coherent when evidence shows that it existed in its archived state at the time the root was captured. The figure below illustrates the most common case. Here the embedded memento was captured after the root but modified before the root. The importance of Last-Modified is discussed in my previous post on the importance of header replay.

Possibly Coherent

An embedded memento is possibly coherent when evidence shows that it might have existed in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured before the root.

Probably Violative

An embedded memento is probably violative when evidence shows that it might not have existed in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured after the root, but its Last-Modified datetime is unknown.

Prima Facie Violative

An embedded memento is probably violative when evidence shows that it did not exist in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured after the root and was also modified after the root.

Only One in Five Archived Web Pages Existed as Presented

Using the framework, we evaluated the temporal coherence of 82,425 composite mementos. These contained 1,623,127 embedded URIs, of which 1,332,993 were available in a web archive. Composite mementos were recomposed using single and multiple archives and two heuristics: minimum distance and bracket.

Single and multiple archives: Composite mementos were recomposed from single and multiple archives. For single archives, all embedded mementos were selected from the same archive as the root. For multiple archives, embedded mementos were selected from any of the 15 archives included in the study.

Heuristics: The minimum distance (or nearest) heuristic selects between multiple captures for the same URI by choosing the memento with the Memento-Datetime nearest to the root's Memento-Datetime, and can be either before or after the root's. The bracket heuristic also takes Last-Modified datetime into account. When a memento's Last-Modified datetime and Memento-Datetime "bracket" the root's Memento-Datetime (as in Prima Facie Coherent above), it is selected even if it is not the closest.

We found that only 38.7% of web pages are temporally coherent and that only 17.9% (roughly 1 in 5) of web pages are temporally coherent and can be fully recomposed (i.e., they have no missing resources).

The paper can be downloaded from the ACM Digital Library or from my local copy. The slides from the Hypertext'15 talk follow.

Only One Out of Five Archived Web Pages Existed as Presented from ScottAinsworth

One last thing: I would like to thank Ted Goranson for presenting the slides at Hypertext 2015 when we could not attend.

-- Scott G. Ainsworth

↧

2015-12-22: 60% of Web Annotations are Orphaned or in Danger of Being Orphaned

October 23, 2015, 8:54 am

≫ Next: 2015-12-24: CNI Fall 2015 Membership Meeting Trip Report

≪ Previous: 2015-12-08: Evaluating the Temporal Coherence of Composite Mementos

Figure 1. An Annotation is defined by OAC
as a set of connected resources

In our TPDL paper, we studied 6281 highlighted text annotations (out of 7744 annotations) available in the Hypothes.is annotation system in January 2015. The main goal was to investigate the prevalence of orphaned annotations, where neither a live Web page nor an archived copy of the web page contains the text that had previously been annotated.

Recently, we applied the same analysis as in our TPDL paper to a larger number of annotations. Figure 2 illustrates that the number of annotations in Hypothes.is has been increasing since July 2013. Our TPDL paper focused on the 7744 annotations available in January 2015. Our updated paper (available at arXiv.org) analyzed the 20,133 highlighted text annotations (out of 33,946 total annotations) available in August 2015. In this post, I will focus on reporting results of our arXiv paper.

Figure 2. January 2015 - dataset used in TPDL paper
August 2015 - dataset used in arXiv version

Based on my experience in analyzing web annotations in Hypothes.is, I have seen annotations created just for the purpose of testing the system to see how it works (e.g. some annotations contain the tag "test" in Hypothes.is). Although some annotations can be considered as not beneficial, the majority of annotations are valuable to the community in different aspects. For example, 9 out of the 10 most annotated websites in Hypothes.is are related to education, academic research, or publishing.

The Hypothes.is annotation system offers free accounts allowing users to annotate the Web by, for example, creating tags/notes to highlighted text or to a web page as a whole. Hypothes.is supports collaborative work by letting users reply to each other's comments as shown in Figure 3.

Figure 3. Annotating the Web Using Hypothes.is Annotation System

It is known that web pages are not fixed resources, and they might be changed or become unavailable at any time. These changes in webpages can affect the associated annotations. Figure 4 shows the target URI http://climatefeedback.org/ as it appeared in December 2014. The highlighted text “Scientific feedback for Climate Change information online” in the webpage was annotated with “After reading about your project at MIT news, I visited your page and ...”. In August 2015, this annotation can no longer be attached to the target web page because the highlighted text no longer appears on the page, as shown in Figure 5. Although the live Web version of http://climatefeedback.org/ has changed and the annotation was in danger of being orphaned, the original version that was annotated has been archived and is available at the Internet Archive. The annotation could be re-attached to this archived resource, or memento.

Figure 4. http://climatefeedback.org/ in December 2014

Figure 5. http://climatefeedback.org/ in August 2015

Because web pages are changing, the status of annotations is also affected. We can classify web annotations into 4 categories based on the attachment to their target live web pages and to mementos:

Safe - The annotation can be attached to the target live web page and also to at least one memento.

In Danger - The annotation can be attached to the target live web page but it is not attached to any mementos. In this case, if the live web page is changed such that the associated annotations become unattached, then these annotations, unfortunately, would become orphaned.

Re-attached - The annotation is no longer attached to the live web page but, fortunately, it can be reattached to at least one memento from public web archives.

Orphaned - The annotation is neither attached to the live web page nor any mementos.

Safe and re-attached annotations can be recovered with web archives, so they are in better situation than the other two categories. We want to make annotations that belong to the second category (In danger) safe or re-attached by archiving their target web pages. Obviously, we can do nothing about annotations that belong to orphaned category. They are lost.

We used the LANL Memento Aggregator to look for archived copies of web pages (mementos) in the public archives. To be more specific, we were looking for the closest mementos to annotations' creation date. In the example shown in Figure 4, we would need to find the closest mementos captured immediately before and after the annotation creation date (e.g., December 3, 2014 at 12:47 AM for the web page http://climatefeedback.org).

Figure 6(a) shows an example where mementos are available before and after the annotation creation date. In this example, only M1 and M3 will be tested to see if the associated annotations can be re-attached to these mementos. Figure 6(b) shows mementos that are only available before the annotation creation date while Figure 6(c) shows mementos that are only available after the annotation date. Finally, Figure 6(d) shows annotations that have no existing mementos for their target web pages in the web archives.

Figure 6. Discovering Mementos for Annotations' Target Web Pages

After we discovered the closest mementos to annotations' creation date and checked if annotations are still attached to their live web pages and to mementos, we get to the conclusion illustrated in Figure 7. It shows that 19% of annotations are orphaned while 41% are in danger of being orphaned. The remaining 40% of annotations are in an acceptable situation as 37% of annotations are considered safe while only 3% of them can be re-attached using archives. Results indicate also that if mementos are available for an annotation target web page, there will be a high chance that the annotation can re-attached. In addition, a copy of the same memento can be available in different web archives.

Figure 7. The Status of Current Hypothes.is Annotations

As we can see, having 60% of annotations orphaned or in danger of being orphaned will lead us to a conclusion that archiving webpages at the time of annotation is important to avoid orphaned annotations.

-- Mohamed Aturban

↧

2015-12-24: CNI Fall 2015 Membership Meeting Trip Report

December 24, 2015, 10:44 am

≫ Next: 2016-01-02: Review of WS-DL's 2015

≪ Previous: 2015-12-22: 60% of Web Annotations are Orphaned or in Danger of Being Orphaned

The CNI Fall 2015 Membership Meeting was held in Washington, D.C., December 14-15, 2015. Like all CNI meetings, the Fall 2015 meeting was excellent and contained many high quality presentations. Unfortunately, the members' project briefings ran simultaneously, with 7 or 8 different presentations overlapping at any given time. As a result I missed a great deal.

Cliff Lynch kicked off the meeting with reflections about public access to federally funded research (e.g., CRS R42983), interoperability (e.g., OAI-ORE, ORCIDs, IIIF), linked data (e.g., Wikipedia notability guidelines for biographies), privacy & surveillance (e.g., eavesdropping Barbies, Ashley Madison data breach, RFC 7624), and understanding the personalization algorithms that go into presenting (and thus archiving) the view of the web that you experience (e.g., our 2013 D-Lib Magazine article about mobile vs. desktop & GeoIP), and much more. I'm hesitant to try to further summarize his talk -- watching the video of his talk, as always, is time well spent.

In the next session Herbert and I presented "Achieving Meaningful Interoperability for Web-based Scholarship", which is basically a summary of our recent D-Lib Magazine paper "Reminiscing About 15 Years of Interoperability Efforts".

Interoperability for web based scholarship from Herbert Van de Sompel

See also the excellent summary and commentary from David Rosenthal about the "signposting" proposal.

The next session I split between "Linked Data for Libraries and Archives: LD4L and Europeana" (see the "Linked Data for Libraries" site) and "Is Gold Open Access Sustainable? Update from the UC Pay-It-Forward Project" (slides, video). The final session of the day included several presentations I would have liked to have seen but didn't. I understand "Documenting Ferguson: Building A Community Digital Repository" (slides) was good & standing room only.

I missed the opening session on the second day (including the "Update on Funding Opportunities" presentation), but made the presentation from David Rosenthal about emulation. See the transcript of his talk, as well as his 2015 Emulation and Virtualization as Preservation Strategies report for the AMF.

Unfortunately, David's talk collided with that of Martin& his UCLA colleagues. Fortunately, CNI has posted the video of their talk, his slides are online, and he has a great interactive site to explore the data.

How much does $1.7 billion buy? from Martin Klein

After lunch I attend Rob's talk "The Future of Linked Data in Libraries: Assessing BibFrame Against Best Practices" (slides). Rob even referenced my "no free kittens" slogan (tirade?) from our time developing OAI-ORE:

@azaroth42 quotes me via hello kitty "no such thing as free kittens"#cni15f pic.twitter.com/wRmZuRlyBG
— Michael L. Nelson (@phonedude_mln) December 15, 2015

The closing plenary was an excellent talk from Julie Brill, head of the Federal Trade Commission, entitled "Transparency, Trust, and Consumer Protection in a Complex World". The transcript is worth reading, but the essence of the talk explores the role the FTC would (should?) play in making sure that consumers can be aware of the data that companies track about them and how that data is used to make decisions about the consumers.

A mostly complete list of slides is available via the OSF. CNI recorded many of the presentations and have begun uploading the videos to the CNI Youtube channel. The CNI Spring 2016 Membership Meeting will be held in San Antonio, TX, April 4-5, 2016.

Given all the simultaneous sessions, your CNI experience was probably different than mine. Check out these other CNI Fall 2015 trip reports: Dale Askey, Jaap Geraerts, and Tim Pyatt.

--Michael

↧

2016-01-02: Review of WS-DL's 2015

January 2, 2016, 8:34 pm

≫ Next: 2016-01-28: January 2016 Federal Cloud Computing Summit

≪ Previous: 2015-12-24: CNI Fall 2015 Membership Meeting Trip Report

.@WebSciDL reunion at #jcdl2015 - 4 phd alumni, 3 phd students, 2 profs! pic.twitter.com/kA8CUYtZCW
— Michael L. Nelson (@phonedude_mln) June 23, 2015

The Web Science and Digital Libraries Research Group had a terrific 2015, marked by four new student members, one Ph.D. defense, and two large research grants. In many ways it was even better than 2014 and 2013.

We had fewer students graduate or advance their status this year, but last year was unusually productive. We did add four new students, as well as graduate a PhD student, an MS student, and had two other students advance their status:

Hany SalahEldeen defended his Ph.D. on May 5, 2015 and then took a cross country motorcycle trip on his way to join Microsoft in Redmond, Washington.
Shawn Jones defended his M.S. thesis on March 23, 2015 and then joined LANL as a Graduate Research Assistant in the fall of 2015. Shawn will continue his PhD work while @ LANL, working with Herbert Van de Sompel.
Sawood Alam passed the candidacy exam.
Kayla Fox joined WS-DL and then passed her breadth exam on December 9, 2015.
We also added: Dan Milanko (PhD), John Berlin (BS), and Erik Jensen (BS). John and Erik will be completing their undergraduate degrees and continuing their graduate work with us in 2016.

Hany's defense saw us continue the WS-DL tradition of the post-PhD luncheon.

We had 16 publications in 2015, which was about the same as 2014 (15) but down from 2013's impressive 22 publications. This year we had:

Three full papers and one poster at JCDL 2015. Lulwah Alkwai won best student paper for "How Well Are Arabic Websites Archived?", and Justin Brunelle split the best poster award (with WS-DL alumnus Ahmed AlSum) for "Mobile Mink: Merging Mobile and Desktop Archived Webs", which is based on his advising of HS student intern, Wes Jordan.
Four full papers at TPDL 2015.
One paper at Hypertext 2015. Although none of us (Scott, Michael, & Herbert) were able to attend, VA Beach-based Ted Goranson also had a paper there and was kind enough to present the slides on our behalf.
Hany had a paper at AAAI 2015, based on his internship at WING with Dr. Min-Yen Kan.
Justin had one paper at iPRES 2015.
One tech report, one D-Lib Magazine article, and three articles in the International Journal on Digital Libraries (IJDL). Of the IJDL articles, one is based on Justin's JCDL 2014 best student paper, one is based on Chuck's JCDL 2014 best paper, and the third was an original submission for which Mat was the lead student.

Next year we won't have this kind of showing at JCDL 2016 because Michele is one of the program co-chairs:

In addition to the JCDL, TPDL, and iPRES conferences listed above, we traveled to and presented at ten conferences, workshops, or professional meetings that do not have formal proceedings:

I attended the CNI Fall 2015 Membership Meeting.
Yasmin attended the Grace Hopper Celebration of Women in Computing. This was Yasmin's third acceptance to this conference!
Yasmin and Sawood traveled to Hochschule Magdeburg-Stendal University in Potsdam, Germany at the invitation of Dr. Michael Herzog.
Yasmin met with members of L3S and the Internet Archive in San Francisco.
Sawood, Mat, Lulwah, Michele and I attended JCDL 2015 (which also included the Doctoral Consortium and the Web Archiving & Digital Libraries workshop) We also had several WS-DL alumni show up as well: Hany, Ahmed, Martin Klein, and Joan Smith (see the picture at the top of this post).
Mat and Michele attended the Web Archiving Collaboration meeting.
Sawood and I attended the IIPC General Assembly
Mat presented at the Virginia Space Grant Consortium Student Research Conference.
Michele attended the Capital Region Celebration of Women in Computing (CAPWIC)
Justin attended the The Winter 2015 Federal Cloud Computing Summit

We were also fortunate to host Michael Herzog for the spring 2015 semester:

As well as Herbert Van de Sompel for an extended colloquium / planning visit:

.@hvdsomp lecturing at @WebSciDL pic.twitter.com/vh3sW12Elq
— Michael L. Nelson (@phonedude_mln) August 5, 2015

We also released (or updated) a number of software packages, services, and format definitions:

Alexander Nwala created:

What Did It Look Like -- a Tumblr blog (with automated Twitter submission) that creates animations of screen shots of archived web pages through time.
I Can Haz Memento -- a Twitter bot that pushes URLs in tweets with the hashtag "#icanhazmemento" into web archives.

Sawood released:

CDXJ - a proposed serialization of CDX files (among other formats) in JSON format (based on his discussions with Ilya Kreymer)
MemGator - A Go-based Memento aggregator (used by Ilya in his excellent emulation service oldweb.today).

Shawn, working with LANL colleagues, released the py-memento-client Python library.
Wes and Justin released "Mobile Mink", an Android Memento enabled client.
Mat has continued to update the Mink Chrome extension (github, Chrome store).

Our coverage in the popular press continued:

I was quoted in the New Yorker, Aeon Magazine, and El Confidencial (a Spanish news magazine).
Rashad McDowell, of ODU's Mace and Crown, did features on me, Michele, and our "Archive What I See Now" project.

We were fortunate to receive two significant research grants this year, totaling nearly $1M:

"Combining Social Media Storytelling With Web Archives", IMLS, $468,618
"Increasing the Value of Existing Web Archives", NSF, $481,780

Thanks to all who made 2015 a great year! We are looking forward to 2016!

-- Michael

This is how I evaluate drafts now. The one on the left isn't ready yet. pic.twitter.com/vjTeYkB63C
— Michael L. Nelson (@phonedude_mln) February 6, 2015

↧

2016-01-28: January 2016 Federal Cloud Computing Summit

January 28, 2016, 9:28 am

≫ Next: 2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

≪ Previous: 2016-01-02: Review of WS-DL's 2015

As I have mentioned previously, I am the MITRE chair of the Federal Cloud Computing summit. The Summits are designed to allow representatives from government agencies that would not necessarily cross paths to collaborate and learn from one another about the best practices, challenges, and recommendations for adopting emerging technologies in the federal government. The MITRE-ATARC Collaboration Symposium is a working group-style session in which academics, representatives from industry, government, and FFRDC representatives discuss potential solutions and ways-forward for the top challenges of emerging technology adoption in government. MITRE helps select the challenge areas by polling government practitioners on their top challenges, and the participants break into groups to discuss each challenge area. The Collaboration Symposium allows this heterogeneous group of cloud practitioners to collaborate across all levels, from the end users to researchers to practitioners to policy makers (at the officer level).

The Summit series includes mobile, Internet of Everything, big data, and cyber security summits along with the cloud summit, each of which occurs twice each year. MITRE produces a white paper that summarizes the MITRE-ATARC Collaboration Symposium. The white paper is shared with industry to communicate the top challenges and current needs of the federal government to guide product development, academia to identify the skillsets needed by the government and influence curricula development along with research topics, and government to communicate best practices and current challenges of other peer government agencies.

The Summit takes place in Washington, D.C. and is a full-day event. The day begins at 7:30 AM with registration and an industry trade show that allows industry representatives to communicate with government representatives about their challenges and the solutions that industry has to offer. At 9:00, a series of panel discussions by academic researchers and government. This also allows audience members to ask questions to the top implementers of cloud computing in the government and academia.

At 1:15, after lunch, the MITRE-ATARC Collaboration Symposium begins, and runs until 3:45. There is also a final out-briefing from each collaboration session a teh end of the day to communicate the major findings from each session to the summit participants.

Common threads from the summit included the importance of cloud security, the importance of incorporating other emerging technologies (e.g., mobile, big data, Internet of Things) in cloud computing, and how each emerging technology enables or enhances the others, and the importance of agile processes in cloud migration planning. More details on the outcomes will be included in the white paper, which should be released in 6-8 weeks. Prior white papers are available at the ATARC website.

The results of the Summit has implications for web archivists. With the increasing importance and emphasis on mobile, IoT, and cloud services, particularly within the government, there is an increased importance on archiving representations and the use of this material. As Julie Brill mentioned in her CNI talk, the government is interested in understanding how these services and technologies are being used regardless of whether or not there is a UI or other interface with which humans can interact.

Archiving data endpoints from HTTP is comparatively trivial (although challenges still exist with archiving at high fidelity, particularly when considering JavaScript and deferred representations), but archiving a data service that might exchange data through non-HTTP or even push (as opposed to pull) transactions may change the paradigm used for web archiving.

With increased adoption, the archiving of representations reliant or designed to be consumed through emerging technologies will continue to increase and highlights a potential frontier in web archiving and digital preservation.

--Justin F. Brunelle *

* APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. CASE NUMBER 15-3250
The authors’ affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions or viewpoints expressed by the authors.

↧

2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

February 24, 2016, 9:43 am

≫ Next: 2016-03-07: Custom Missions in the COVE Tool

≪ Previous: 2016-01-28: January 2016 Federal Cloud Computing Summit

Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives. These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from
AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.

We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex. We document our findings in a technical report entitled: "Rules of Acquisition for Mementos and Their Content".

Our technical report briefly covers the following key points:

Special techniques for acquiring mementos from the WebCite on-demand archive (http://www.webcitation.org)
Special techniques for dealing with JavaScript Redirects created by the Internet Archive
An alternative to BeautifulSoup for removing elements and extracting text from mementos
Stripping away archive-specific additions to memento content
An algorithm for dealing with inaccurate character encoding
Differences in whitespace treatment between archives for the same archived page
Control characters in HTML and their effect on DOM parsers
DOM-corruption in various HTML pages exacerbated by how the archives present the text stored within <noscript> elements

Rather than repeating the entire technical report here, we want to focus on the two issues of interest that may have the greater impact on others acquiring and experimenting with mementos: acquiring mementos from Web Cite and inaccurate character encoding.

Acquisition of Content from WebCite

WebCite is an on-demand archive specializing in archiving web pages used as citations in scholarly work. An example WebCite page is shown below.

For acquiring most memento content, we utilized the cURL data transfer tool. With this tool, one merely types the following command to save the contents of the URI http://www.example.com:

curl -o outputfile.html http://www.example.com

For WebCite, the output from cURL for a given URI-M results in the same HTML frameset content, regardless of which URI-M is used. We sought to acquire the actual content of a given page for text extraction, so merely utilizing cURL was insufficient. An example of this HTML is shown below.

Instead of relying on cURL, we analyzed the resulting HTML frameset and determined that the content is actually returned by a request to the mainframe.php file. Unfortunately, merely issuing a request to the mainframe.php file is insufficient because the cookies sent to the browser indicate which memento should be displayed. We developed custom PhantomJS code, presented as Listing 1 in the technical report, for overcoming this issue. PhatomJS, because it must acquire, parse, and process the content of a page, is much slower than merely using cURL.

The requirement to utilize a web browser, rather than HTTP only, for the acquisition of web content is common for live web content, as detailed by Kelly and Brunelle, but we did not anticipate that we would need a browser simulation tool, such as PhantomJS, to acquire memento content.

In addition to the issue of acquiring mementos, we also discovered reliability problems with Web Cite, seen in the figure below. We would routinely need to reattempt downloads of the same URI-M in order to finally acquire its content.

Finally, we experienced rate limiting from Web Cite, forcing us to divide our list of URI-Ms and download content from several source networks.

Because of these issues, the acquisition of almost 100,000 mementos from Web Cite took more than 1 month to complete, compared to the acquisition of 1 million mementos from the Internet Archive in 2 weeks.

Inaccurate Character Encoding

Extracting text from documents requires that such text be decoded properly for processes such as text similarity or topic analysis. For a subset of mementos, some archives do not present the correct character set in the HTTP Content-Type header. Even though most web sites now use the UTF-8 character set, a subset of our mementos come from a time before UTF-8 was adopted so proper decoding becomes an issue.

To address this issue, we developed a simple algorithm that attempts to detect and use the character encoding for a given document.

Use the character set from the HTTP Content-Type header, if present; otherwise try UTF-8.
If a character encoding is discovered in the file contents, as is common for XHTML documents, then try to use that; otherwise try UTF-8.
If any of the character sets encountered raise an error, raise our own error.

We fall back to UTF-8 because it is an effective superset of many of the character sets for the mementos in our collection, such as ASCII. This algorithm worked for more than 99% of our dataset.

In the future, we intend to explore the use of confidence-based tools, such as the chardet library, to guess the character set when extracting text. The use of such tools takes more time than merely using the Content-Type header, but are necessary when that header is unreliable and algorithms such as ours fail.

Summary

We were able to overcome most of the memento acquisition and text extraction issues encountered in our experiment. Because we were unaware of the problems we would encounter, we felt that it would be useful to detail our solutions for others to assist them in their own research and engineering.

--
Shawn M. Jones
PhD Student, Old Dominion University
Graduate Research Assistant, Los Alamos National Laboratory
- and -
Harihar Shankar
Research & Development Engineer, Los Alamos National Laboratory

↧

2016-03-07: Custom Missions in the COVE Tool

March 7, 2016, 3:48 am

≫ Next: 2016-03-07: Archives Unleashed Web Archive Hackathon Trip Report (#hackarchives)

≪ Previous: 2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

When I am not studying Web Sciences at ODU, I work as a software developer at Analytical Mechanics Associates. In general, my work there aims to make satellite data more accessible. As part of this mission, one of my primary projects is the COVE tool.

The COVE tool allows a user to view where a satellite could potentially take an image. The above image shows the ground swath of both Landsat 7 (red) and Landsat 8 (green) over a one day period.

The CEOS Visualization Environment (COVE) tool is a browser-based system that leverages Cesium, an open-source JavaScript library for 3D globes and maps, in order to display satellite sensor coverage areas and identify coincidence scene locations. In other words, the COVE tool allows the user to see where a satellite could potentially take an image and where two or more satellite paths overlap during a specified time period. The Committee on Earth Observing Satellites (CEOS) is currently operating and planning hundreds of Earth observation satellites. COVE initially began as a way to improve Standard Calibration and Validation (Cal/Val) exercises for these satellites. Cal/Val exercises need to compare near-simultaneous surface observations and identify corresponding image pairs in order to calibrate and validate the satellite's orbit. These tasks are time-consuming and labor-intensive. The COVE tool has been pivotal in making these Cal/Val exercises much easier and more efficient.

The COVE tool allows a user to see possible coincidences of two satellites. The above image shows the coincidences of ALOS-2 with Landsat 7 over a one week period.

In the past, the COVE tool only allowed for this analysis to be done on historical, operational, or notional satellite missions with known orbit data, which COVE could then use to predict the propagation of the orbit accurately, within the bounds of the model’s assumptions, for up to three (3) months passed the last-known orbit data. This has proven extremely useful for those missions that the orbit data is known; however, it was limited to these missions.

Mission planning is another task which includes the prediction of satellite orbits, a task the COVE tool was well equipped for. However, in mission planning exercises, the orbit data of the satellite is unknown. Based on this need, we wanted to extend COVE to include customized missions, in which the user could define the orbit parameters and the COVE tool would then predict the orbit of the customized mission through a numerical propagation. I had the opportunity to be the lead developer for this new feature, which recently went live and can be accessed through the Custom Missions tab on the right of the COVE tool, as shown in the video below. This is an important addition to the COVE tool, as it allows for better planning of potential future missions and will hopefully help to improve satellite coverage of Earth in the future.

Video Summary:
00:07:04 - The "Custom" Missions and Instruments tab shows a list of the current user's custom missions. Currently, we do not have any custom missions.
00:09:03 - To create a custom mission, choose "Custom Missions" on the right panel. First, we need to "Add Mission." Once we have a mission we can add additional instruments to the instrument or delete the mission.
00:20:15 - After choosing a mission name, we need to decide if we want to use an existing mission's orbit or define a custom orbit. We want to create a custom orbit. Clicking on "Custom defined orbit" gives three more options. A circular orbit is the most basic and for the novice user. A repeating sun synchronous orbit is a subset of circular orbits that must cover each area around the same time. For example, if the satellite passes over Hampton, VA at 10:00 AM, its next pass over Hampton should also be at 10:00 AM. The advanced orbit is for the experienced user and allows full control over the orbital parameters. We will create a repeating sun synchronous orbit, similar to Landsat 8.
00:33:14 - When creating a repeating sun synchronous orbit, the altitude given is only an estimate as only certain inclination/altitude pairs are able to repeat. Thus, the user has the option to calculate the inclination and altitude that will be used.
00:37:24 - The instrument and mode, along with the altitude of the orbit we just defined, determine the swath size of the potential images the satellite will be able to take.
00:49:23 - We need to define "Field of View" and "Pointing Angle" of the instrument. We will also choose "Daylight only," our custom mission will only take images during the daylight hours. This is useful because many optical satellites, such as Landsat 8 are "Daylight only" since they cannot take good optical images at night.
01:02:06 - We will now choose a date range over which we will propagate the orbit to see what our satellite's path will look like.
01:21:18 - We can now see what path our satellite will take during the daylight hours, since we chose "Daylight only."

This project was only possible thanks to other key AMA associates involved, namely Shaun Deacon--project lead and aerospace engineer, Andrew Cherry--developer and ODU graduate, and Jesse Harrison--developer.

--Kayla

↧

2016-03-07: Archives Unleashed Web Archive Hackathon Trip Report (#hackarchives)

March 10, 2016, 10:18 am

≫ Next: 2016-03-22: Language Detection: Where to start?

≪ Previous: 2016-03-07: Custom Missions in the COVE Tool

The Thomas Fisher Rare Book Library (University of Toronto)

Between March 3 - March 5, 2016, Librarians, Archivists, Historians, Computer Scientists, etc., came together for the Archives Unleashed Web Archive Hackathon at the University of Toronto Robarts Library, Toronto, Ontario Canada. This event gave researchers the opportunity to collaboratively develop open-source tools for web archives. The event was organized by Ian Milligan, (assistant professor of Canadian and digital history in the Department of History at the University of Waterloo), Nathalie Casemajor (assistant professor in communication studies in the Department of Social Sciences at the University of Québec in Outaouais (Canada)), Jimmy Lin (the David R. Cheriton Chair in the David R. Cheriton School of Computer Science at the University of Waterloo), Matthew Weber (Assistant Professor in the School of Communication and Information at Rutgers University), and Nicholas Worby (the Government Information & Statistics Librarian at the University of Toronto’s Robarts Library).

Additionally, the event was made possible due to the support of the Social Sciences and Humanities Research Council of Canada, the National Science Foundation, the University of Waterloo, the University of Toronto, Rutgers University, the University of Québec in Outaouais, the Internet Archive, Library and Archives Canada, and Compute Canada. Sawood Alam, Mat Kelly and myself, joined researchers from Europe and North America to exchange ideas in efforts to unleash our web archives. The event was split across three days.

DAY 1, THURSDAY MARCH 3, 2016

Ian Milligan kicked off the presentations by presenting the agenda. Following this, he presented his current research effort -

.@ianmilligan1 starts off #hackarchives with an outline of what will be accomplished at the Hackathon. #webarchives pic.twitter.com/GRscT5k2z5
— Mat Kelly (@machawk1) March 3, 2016

HistoryCrawling with Warcbase(Ian Milligan, Jimmy Lin)

The presenters introduced Warcbase as a platform for exploring the past. Warcbase is an open-source tool used to manage web archives built on Hadoop an Hbase. Warcbase was introduced through two case studies and datasets, namely, exploring Canadian Political Parties and Political Interest Groups (2005 - 2015), and Geocities datasets.

Exploring the past with Warcbase by @ianmilligan1 @lintool @jeremyw #hackarchives
— Nathalie Casemajor (@ncasemajor) March 3, 2016

.@ianmilligan1 talks Gephi analysis on longitudinal data #hackarchives pic.twitter.com/34b4k3yHcA
— Neha Gupta (@archaeomap) March 3, 2016

.@ianmilligan1 closes on WARCbase usage: 1: Grab data 2/3: Filter/find sites of interest 4: Analyze text 5: Step 5: Profit! #hackarchives
— Mat Kelly (@machawk1) March 3, 2016

Put Hacks to Work: Archives in Research (Matthew Weber)

Following Ian Milligan's presentation, Matthew Weber emphasized some important ideas to guide the development of tools for web archives, such as considering the audience.

Web Archives and Data Challenges - Archives Unleashed from mwe400

My hackathon yacking is over - now Matthew Weber up on putting hacks to work. #hackarchives pic.twitter.com/jZijiLypTv
— Ian Milligan (@ianmilligan1) March 3, 2016

Who is audience >> not just abt developing database backend, .@docmattweber #hackarchives
— Neha Gupta (@archaeomap) March 3, 2016

.@docmattweber stresses the importance of questioning the reliability and validity of our web archive data sources #hackarchives
— Jeremy Wiebe (@jeremyw) March 3, 2016

Archive Research Services Workshop(Jefferson Bailey, Vinay Goel)

Following Matthew Weber's presentation, Jefferson Bailey and Vinay Goel presented a comprehensive introduction workshop for researchers, developers, and general users. The workshop addressed data mining and computational tools and methods for working with web archives.

.@jefferson_bail& @vinaygo On his workshop, given WARC files, build derivatives files about contents https://t.co/i9rgWnmvYE #hackarchives
— Mat Kelly (@machawk1) March 3, 2016

.@vinaygo : Once index created, build own visualizations on Kibana; #hackarchives
— Neha Gupta (@archaeomap) March 3, 2016

Lots of data, derived formats, and APIs being presented by @jefferson_bail and @vinaygo. No shortage of things to play with at #hackarchives
— Jeremy Wiebe (@jeremyw) March 3, 2016

.@jefferson_bail mentioned available @archiveitorg API for utilization during #hackarchives . Hoping to test this out instead of scraping
— Mat Kelly (@machawk1) March 3, 2016

Embedded Metadata as Mobile Micro Archives (Nathalie Casemajor)

Following Jefferson Bailey and Vinay Goel's presentation, Nathalie Casemajor presented her research effort for tracking the evolution of images shared on the web. She talked about how embedded metadata in images helped track dissemination of images shared on the web.

.@ncasemajor on a constellation of derivative objects - how things are transformed, live/die. #hackarchives pic.twitter.com/2a7E5BkpHd
— Ian Milligan (@ianmilligan1) March 3, 2016

@ncasemajor at #hackarchives: micro archives, metadata & disfunctionality in exploring the social life of #digital things
— Katherine Cook (@KatherineRCook) March 3, 2016

Examine usage patterns, memology, visibility & copyright @ncasemajor> social life of #digital things, derivative objects #hackarchives
— Neha Gupta (@archaeomap) March 3, 2016

Now @ncasemajor on 'yack'-ey questions like "when we publish an image on the web, where does it go?" Memes, vitality, usage. #hackarchives
— Ian Milligan (@ianmilligan1) March 3, 2016

Social life includes past + present @ncasemajor #digital #hackarchives pic.twitter.com/Pg8z5ZnRv5
— Neha Gupta (@archaeomap) March 3, 2016

Revitalization of the Web Archiving Program at LAC (Tom Smyth)

Following Nathalie Casemajor's presentation, Tom Smyth of the Library and Archives Canada presented their archiving activities such as the domain crawls of Federal sites, curation of thematic research collections, and preservation archiving of resources at risk. He also talked about their recent collections such as Federal Election 2015, First World War Commemoration, and the Truth and Reconciliation collections.

After the first five short presentations, Jimmy Lin gave presented a technical tutorial of Warcbase. After which Helge Holzmann, presented ArchiveSpark: framework built to make accessing Web Archives easier for researchers, which makes for easy data extraction and derivation.

.@helgeho's ArchiveSpark @jcdl2016 project can easily extract and access data from #webarchives https://t.co/SVS7Ro9vWM #hackarchives
— Mat Kelly (@machawk1) March 3, 2016

ArchiveSpark simplifies spark process, documents data lineage, transformations @helgeho #hackarchives
— Neha Gupta (@archaeomap) March 3, 2016

Workflow for ArchiveSpark #hackarchives @helgeho pic.twitter.com/8YkmrjZSM3
— Neha Gupta (@archaeomap) March 3, 2016

After a short break, there were five more presentations targeting Web Archiving and Textual Analysis Tools:

WordFish (Federico Nanni)

Federico Nanni presented WordFish: a R computer program used to extract political positions from text documents. Wordfish is a scaling technique and does not need any anchoring documents to perform the analysis but relies instead on a statistical model of word frequencies.

.@f_nanni's Wordfish tool uses R to extract political positions from text documents, ran on most recent debate on flight to #hackarchives
— Mat Kelly (@machawk1) March 3, 2016

You can find out more about WordFish at https://t.co/FWxoxlHtfD. @jeremyw had some ideas about incorporating into warcbase… #HackArchives
— Ian Milligan (@ianmilligan1) March 3, 2016

MemGator (Sawood Alam)

Following Federico Nanni's presentation Sawood Alam presented a tool he developed called MemGator: a Memento Aggregator CLI and Server written in Go. Memento is a framework that adds the time dimension to the web. Additionally, a timestamped copy of the presentation of a resource is also called a Memento. A list/collection of such mementos is called a TimeMap. MemGator can generate TimeMap of a given URI or provide the closest Memento to a given time.

.@ibnesayeed's #MemGator A #Memento Aggregator CLI and Server in Go https://t.co/8LS2U2w99t #hackarchives @WebSciDL
— Mat Kelly (@machawk1) March 3, 2016

Try @ibnesayeed's #MemGator in your browser https://t.co/CFTqym7FiI Poke around. Try to break it! #hackarchives @WebSciDL
— Mat Kelly (@machawk1) March 3, 2016

@ibnesayeed's MemGator can also be run as a @Docker container! https://t.co/yzdPeiDdFq #hackarchives @WebSciDL
— Mat Kelly (@machawk1) March 3, 2016

Topic Words in Context (Jonathan Armoza)

Following Sawood Alam's presentation, Jonathan Armoza presented a tool he developed - TWIC (Topics Words in Context) by demonstrating LDA topic modeling of Emily Dickenson's poetry. TWIC provides a hierarchical visualization of LDA topic models generated by the MALLET topic modeler.

.@JonathaNgrams demoing topic modeling of Emily Dickenson's poetry w/ #d3js for data exploration https://t.co/7Ld7GjaLtU #hackarchives
— Mat Kelly (@machawk1) March 3, 2016

Twarc (Nick Ruest)

Following Jonathan Armoza's presentation, Nick Ruest presented Twarc: a Python command line tool/Python library tool for archiving Tweet JSON data. Twarc runs in three modes: search, filter stream and hydrate.

Carbon date (Alexander Nwala)

Following Nick Ruest's presentation, I presented Carbon date: a tool originally developed by Hany SalahEldeen, which I current maintain. Carbon date is a tool for estimating the creation date of a website. Carbon date polls multiple sources for datetime evidence. It returns a Json response which contains the estimated creation date of the website.

How old is the website you are visiting? Check it out here: https://t.co/IfyTlMsFIq #hackarchives https://t.co/UHqP51Gm3x
— Patrick Egan (@mrpatrickegan) March 3, 2016

After the five short presentation about Web Archiving and Textual Analysis Tools, all participants engaged in a brain storming session in which ideas where discussed. And clusters of researchers with common interests where iteratively developed. The brainstorming session led to the formation of seven groups, namely:

I know words and images
Searching, mining, everything
Interplanetary WayBack
Surveillance of First Nations
Nuage
Graph‐X‐Graphics
Tracking Discourse in Social Media

Teams are forming! #hackArchives pic.twitter.com/MLTtstmUfW
— Ian Milligan (@ianmilligan1) March 3, 2016

The Twitter "twits" are forming. 😉 #hackArchives pic.twitter.com/T0R1sDwgye
— Ian Milligan (@ianmilligan1) March 3, 2016

.@ncasemajor talking with an emerging images team! #hackArchives pic.twitter.com/0ceK6XMpPZ
— Ian Milligan (@ianmilligan1) March 3, 2016

Following the brainstorming and group formation activity, all participants were received at the Bedford Academy for a reception that went on through the late evening.

Finishing up day one of #hackArchives with introductions - over a few libations. Thanks all for a great start! pic.twitter.com/t2FVpIaXUj
— Ian Milligan (@ianmilligan1) March 4, 2016

DAY 2, THURSDAY MARCH 4, 2016

In the main room, the overpowering sound of clacking keyboards! #hackarchives pic.twitter.com/HmEewJwbby
— Ian Milligan (@ianmilligan1) March 4, 2016

The second day of the Archives Unleashed Web Archive Hackathon began with breakfast, after which the groups formed on Day 1 met for about three hours to begin working on the ideas discussed the previous day. At noon, lunch was provided as more presentations took place:

Snowden Archive-in-a-Box (Evan Light)

Evan Light began the series of presentations, by talking about a box he created called the Snowden Archive-in-a-Box : The box features a stand-alone wifi network and web server that allows researchers to utilize the files leaked (subsequently published by the media) by Edward Snowden. The box which serves as a portable archive protects users from mass surveillance.

This is awesome! Snowden Archive-in-Box #hackarchives pic.twitter.com/4qOEphNPf9
— nick ruest (@ruebot) March 4, 2016

Mediacat (Alejandro Paz and Kim Pham)

Following Evan Light's presentation, Alejandro Paz and Kim Pham presented Mediacat: an open-source web crawler and archive application suite which enables ethnographic research to understand how digital news is disseminated and used across the web.

Data Mining the Canadian Media Public Sphere (Sylvain Rocheleau)

Following Alejandro Paz and Kim Pham's presentation, Sylvain Rocheleau talked about his research efforts to provide near real time Data Mining of the Canadian news media. His research involves the mass crawl of about 700 Canadian news websites at 15-minute intervals, and Data Mining processes which includes Named Entity Recognition.

Tweet Analysis with Warcbase (Jimmy Lin)

Following Sylvain Rocheleau's presentation, Jimmy Lin gave another tutorial in which he showed how to extract information from Tweets from the Warcbase platform.

A five hour Hackathon session continued. The Hackathon was briefly suspended for a visit to the Thomas Fisher Rare Books Library.

Ooohs and ahhhs as #hackArchives attendees get their photo op at the Fisher Rare Books Library. pic.twitter.com/sdIFT3h4t1
— Ian Milligan (@ianmilligan1) March 4, 2016

After the visit to the Thomas Fisher Rare Books Library, the hackathon session continued until the evening, after which all participants went for Dinner at the University of Toronto Faculty Club.

DAY 3, THURSDAY MARCH 5, 2016

Go go gadget hackathon #hackarchives pic.twitter.com/QoaulPwKUA
— KJ (@thundersnowjane) March 5, 2016

The third and final day of the Archives Unleashed Web Archive Hackathon began in a similar fashion as the second: first breakfast, second a three hour hackathon session, third presentations over lunch:

Malach Collection (Petra Galuscakova)
Waku (Kyle Parry)
Digital Arts and Humanities Initiatives at UH Mānoa (or how to do interesting things with few resources) (Richard Rath)

After the presentations, the hackathon session continued until 4:30 pm EST, thereafter, the group presentations began:

PRESENTATIONS

I know words and images (Kyle Parry, Niel Chah, Emily Maemura, and Kim Pham)

Searching, mining, everything (Jaspreet Singh, Helge Holzmann, and Vinay Goel)

Interplanetary WayBack (Sawood Alam and Mat Kelly)

"Who will archive the archives?"

To answer this question Sawood Alam and Mat Kelly presented the archiving and replay system called Interplanetary Wayback (ipwb). In a nutshell, during the indexing process ipwb consumes WARC files one record a time, splits the record into headers and payload, pushes the two pieces into the IPFS (a peer‐to‐peer file system) network for persistent storage, and stores the references (digests) into to file format called CDXJ along with some other lookup keys and metadata. For replay it it finds the records in the index file and builds the response by assembling headers and payload retrieved from the IPFS network and performing necessary rewrites. The major benefits of this system include deduplication, redundancy, and shared open access.

Surveillance of First Nations (Evan Light, Katherine Cook, Todd Suomela, and Richard Rath)

Nuage (Petra Galuscakova, Neha Gupta, Rosa Iris R. Rovira, Nathalie Casemajor, Sylvain Rocheleau, Ryan Deschamps, and Ruqin Ren)

Graph‐X‐Graphics (Jeremy Wiebe, Eric Oosenbrug, and Shane Martin)

Tracking Discourse in Social Media (Tom Smyth, Allison Hegel, Alexander Nwala, Patrick Egan, Nick Ruest, Yu Xu, Kelsey Utne, Jonathan Armoza, and Federico Nanni)

This team processed ~11.2 million tweets and ~50 million reddit comments which referenced the Charlie Hebdo and Bataclan attacks, in an effort to track the evolution of social media commentary about the attacks. The team sought to measure the attention span, information/misinformation flow, as well as the co-occurence network of terms in order to understand the dynamics of commentary about these events.

The votes were tallied and Nuage team got the most votes, and were declared winners. The event concluded after some closing remarks.

@ianmilligan1 wrapping up #hackarchives thank you! pic.twitter.com/JEMjF6nI6i
— Neha Gupta (@archaeomap) March 5, 2016

--Nwala

↧

2016-03-22: Language Detection: Where to start?

March 22, 2016, 11:40 am

≫ Next: 2016-04-05: CNI Spring 2016 Trip Report

≪ Previous: 2016-03-07: Archives Unleashed Web Archive Hackathon Trip Report (#hackarchives)

Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived?. Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language, Python-Language Detector, LangID and Google Language Detection API. And since Python is my favorite coding language I searched for tools that were written in Python.

I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setting. Guess-Language and Python-Language Detector are fast in detecting language, but they are more accurate with more text. Also, you have to extract the HTML tags before passing the text to the tools. LangID is a tool that detects language and gives you a confidence score, it's fast and works well with short texts and it is easy to install and use. Google Language Detection API is also a powerful tool that can be downloaded for different programming languages, it also has a confidence score, but you need to sign in and if the dataset you need to detect is large, (larger than 5000 requests a day (1 MB/day)), you must choose a payable plan.
HTTP Language Header:
If you want to detect the language of a web site a primary method is to look at the HTTP response header, Content-Language. The Content-Language lets you know what languages are present on the requested page. The value is defined as a two or three letter language code (such as ‘fr’ for French), and sometimes followed by a country code (such as ‘fr-CA’ for French spoken in Canada).

For example:

curl -I --silent http://bagergade-bogb.dk/ |grep -i "Content-Language"

Content-Language: da-DK,da-DK

In this example the webpage's language is Danish (Denmark).
In some cases you will find some sites offering content in multiple languages, and the Content-Language header only specifies one of the languages.
For example:

curl -I --silent http://www.hotelrenania.it/ |grep -i "Content-Language"

Content-Language: it

In this example, when looking at the browser the webpage has three languages available Italian, English and Dutch. And it only states Italian as its Content-Language. You have to note that the Content-Language does not always match the language displayed in your bowser, because the browser's displayed language depends on the browser's language preference which you can change.

For example:

curl -I --silent https://www.debian.org/ |grep -i "Content-Language"

Content-Language: en

This webpage offers its content in more than 37 different languages. Here I had my browsers language preference set as Arabic, and the Content-Language found was English.
In addition, most cases the Content-Language is not included in the header. From a random sample of 10,000 English websites in DMOZ I found that only 5.09% have the Content-Language header.

For example:

curl -I --silent http://www.odu.edu |grep -i "Content-Language"

In this example we see that the Content-Language header was not found.

HTML Language:
Another indication of the language of a web page is the HTML language tag (such as, <html language='en'>….</html>). Using this method will require you to save the HTML code first then search for the HTML language code.

For example:

curl -I —silent http://ksu.edu.sa/ > ksu.txt

grep "<html lang=" ksu.txt

<html lang="ar"" dir="rtl" class="no-js">

However, I found from a random sample of 10,000 English websites in DMOZ directory that only 48.6% have the HTML language tag.

Guess-Language:
One tool to detect language in Python is Guess-Language. It detects the nature of the Unicode text. This tool detects over 60 languages. However, two important notes to be taken are that 1)this tool works better with more text and 2) don’t include the HTML tags in the text or the result will be flawed. So if you wanted to check the language of a webpage I recommend that you filter the tags using the beautiful soup package and then pass it to the tool.

For example:

curl --silent http://abelian.org/tssp/|grep "title"|sed -e 's/<[^>]*>//g'

Tesla Secondary Simulation Project

python
from guess_language import guessLanguage
guessLanguage(“Tesla Secondary Simulation Project”)
’fr’
guessLanguage("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

’en'

This example shows detecting the title language of a randomly selected English webpage from DMOZ http://abelian.org/tssp/. The language test using Guess-Language package will detect the language as French which is wrong. However, when we extract more text the result will be English. In order to determine the language of short text you need to install Pyenchant and other dictionaries. By default it only supports three languages: English, French, and Esperanto. You need to download any additional language dictionary you may need.

Python-Language Detector (languageIdentifier):
Jeffrey Graves built a very light weight tool in C++ based on language hashes and wrapped in python. This tool is called Python-Language Detector. It is very simple and effective. It detects 58 languages.

For example:

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“Tesla Secondary Simulation Project”,300,300)
’fr’
languageIdentifier.identify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.”,300,300)

’en’

Here, we also noticed that the length of the text affects the result. When the text was short we falsely got "French" as the language. However, when we add more text from the webpage the correct answer appeared.

Another example where we check the title of a Korean webpage, which was selected randomly from the DMOZ Korean webpage directory.

For example:

curl --silent http://bada.ebn.co.kr/ | grep "title"|sed -e 's/<[^>]*>//g'

EBN 물류&조선 뉴스

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“EBN 물류&조선 뉴스”,300,300)

’ko’

Here the correct answer showed up “Korean”, although some English letters were in the title.

LangID:
The other tool is LangID. This tool can detects 97 different languages. As an output it states the confidence score for the probability prediction. The scores are re-normalized and it produces an output in the 0-1 range. This tool is one of my favorite language detection tools because it is fast, detects short texts and gives you a confidence score.

For example:

python
import langid
langid.classify(“Tesla Secondary Simulation Project”)

(‘en’, 0.9916567142572572)

python
import langid
langid.classify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

(‘en’, 1.0)

Using the same text above. This tool identified a small text correctly, with a confidence rate of 0.99. And when full text is provided the confidence score was 1.0.

For example:

python
import langid
langid.classify(“السلام عليكم ورحمة الله وبركاته”)

(‘ar’, 0.9999999797315073)

By testing other language such as an Arabic phrase, it had a 0.99 confidence score for Arabic language.

Google Language Detection API:
The Google Language Detection API detects 160 different languages. I have tried this tool and I think it is one of the strongest tools found. The tool can be downloaded in different programming languages: ruby, java, python, php, crystal, C#. To use this tool you have to download an API key after creating an account and signing-up. The language tests results in three outputs: isReliable (true, false), confidence (rate), language (language code). The tool's website mentions that the confidence rate is not a range and can be higher than 100, no further explanation of how this score is calculated was mentioned. The API allows 5000 free requests a day (1 MB/day) free requests. If you need more than that there are different payable plans you can sign-up for. You can also detect text language in an online demo. I recommend this tool if you have a small data set, but it needs time to set-up and to figure out how it runs.

For example:

curl --silent http://moheet.com/ | grep "title"| sed -e 's/<[^>]* > //g'> moheet.txt

python
file1=open(“moheet.txt”,”r”)
import detectlanguage
detectlanguage.configuration.api_key=“Your key”
detectlanguage.detect(file1)
[{‘isReliable’: True, ‘confidence’: 7.73, ‘language’: ‘ar’}]

In this example, I extract text from an Arabic webpage from DMOZ Arabic Directory. The tool detected its language Arabic with True reliability and a confidence of 7.73. Note you have to remove the new line from the text so it doesn’t consider it a batch detection and give you result for each line.

In Conclusion:
So before you start looking for the right tool you have to determine a couple of things first:

Are you trying to detect the language of a webpage or some text?
What is the length of the text? usually more text is better and gives more accurate results (check this article on the effect of short texts on language detection: http://lab.hypotheses.org/1083)
What is the language you want to determine (if it is known or expected), because certain tools determine certain languages
What programming language do you want to use?

Here is a short summary of the language detection methods I reviewed and a small description of all:

Method	Advantage	Disadvantage
HTML language header and HTML language tag	can state language	not always found and sometimes affected by browser setting.
Guess-Language	fast, easy to use	works better on longer text.
Python-Language Detector	fast, easy to use	works better on longer text.
LangID	fast, gives you confidence score	works on both long and short text.
Google Language Detection API	gives you confidence score, works on both long and short text	needs creating an account and setting-up.

--Lulwah M. Alkwai

↧

2016-04-05: CNI Spring 2016 Trip Report

April 5, 2016, 3:56 pm

≫ Next: 2016-04-15: How I learned not to work full-time and get a PhD

≪ Previous: 2016-03-22: Language Detection: Where to start?

The CNI Spring 2016 Members Meeting was held in San Antonio, TX, April 4-5, 2016. As usual, the presentations were excellent but with six or more simultaneous sessions you are forced to make hard choices about what to catch up on.

This year Martin Halbert and Katherine Skinner arranged the "Digital Preservation of Federal Information Summit", convening 30+ people to discuss "...the topic of preservation and access to at-risk digital government information." It was quite the collaborative exercise, and I know Martin produced some summary slides that I will link here when they are posted. There were only a few presentations (and they were done in Pecha Kucha format) for this Summit, and I was fortunate enough to give one for Herbert and I entitled "Why We Need Multiple Archives". The answer is probably pretty obvious for the crowd that Martin assembled, but we often run into people that don't understand the role of archives beyond that of the (obviously excellent) Internet Archive.

Why We Need Multiple Archives from Michael Nelson

Victoria Stodden gave the opening keynote, "Defining the Scholarly Record for Computational Research", in which she talked about the "Reproducible Research Standard", ResarchCompendia.org, and computational infrastructure within the context of legal and social norms. CNI will eventually put the videos up, in the mean time I would encourage you to see her SC15 talk that touches on similar themes.

The next session I attended was Jason Varghese (NYPL) presenting "Microservices Architecture: Building Scalable (Library) Software Solutions." He's clearly doing cool stuff, but I would have appreciated a more detailed discussion of the APIs they've implemented, but I guess that can be found at: http://api.repo.nypl.org/.

The next session was "Scaling Maker Spaces Across the Web: Weaving Maker Space Communities Together to Support Distributed, Networked Collaboration in Knowledge Creation", by Rick Luce and Carl Grant, both at Oklahoma University. They talked about their experiences setting up a makerspace (complete with 3D printing and VR capabilities) in the library, both a small satellite for their on campus library (the "edge") and their much larger facility in the research park two miles away (the "hub"). I urge you to peruse the links -- this was truly impressive stuff & Rick consistently does exciting things with libraries.

I skipped the final session of the day in order to get my slides for Tuesday morning arranged. I had originally thought I had a 30 minute slot, but in reality I had 15 minutes and many slides needed tossing. There was an evening reception and they we had dinner at one of the many restaurants on the famed River Walk.

Tuesday began with split sessions, and I was in the session that Martin Halbert arranged, "National Web Archiving Programs in the U.S.", along with Jefferson Bailey and Mark Phillips. Jefferson gave a brief overview of the "Systems Interoperability and Collaborative Development for Web Archiving" project, Mark reviewed End of Term (EOT) web archiving, and Martin recapped the "Digital Preservation of Federal Information Summit" from the previous days. I presented a brief status about our work using Storytelling interfaces for summarizing collections in Archive-It:

Storytelling for Summarizing Collections in Web Archives from Michael Nelson

Unfortunately, with the simultaneous sessions I had to miss "DBpedia Archive using Memento, Triple Pattern Fragments, and HDT", presented by Herbert Van de Sompel and Miel Vander Sande.

The next session I attended was about organization identifiers, and featured Geoffery Bilder (Crossref), Patricia Cruse (DataCite), and (via facetime) Laure Haak (ORCID). They are in the early stages of collaborating for org ids, and while I learned a lot, I would have appreciated a more thorough review of existing org id efforts and how they fall short of their goals. Part They did share their requirements document and invited contributions. "Challenges Presented by Organizational IDs" by Karen Smith-Yoshimura (OCLC), from CNI Spring 2015, provides some of the background that I did not have.

The after lunch session that I attended on Tuesday was "Rebuilding the Getty Provenance Index as Linked Data". I knew almost nothing about the art world going into this, so now I know more about the linked data challenges of porting Getty's legacy databases that await Rob Sanderson when he joins Getty later this month.

The closing keynote, "Activist Stewardship: The Imperative of Risk in Collecting Cultural Heritage", was handled by a trio from UCLA: Todd Grappone, Elizabeth McAulay, Heather Briston (who was pinch hitting for Sharon Farb). They presented about the Digital Ephemera Project, and in general the role of archivists in collecting materials that will get you (the library) or the contributor in trouble. Some examples included the internal and external pressures about UCLA's Scientology collection and contributors regarding the Green Movement collection. Cliff Lynch gave a good introduction to this session and promised a wrap for it as well, but the session ran a bit long and that did not happen. Rather than try to further summarize, I'll link the video when it comes out. I did appreciate that Memento got a mention in the presentation regarding finding archived images embedded in tweets that had otherwise been deleted!

#cni16s lost img from tweet #memento we measured 11% loss in 1 year, 7% yearly after that https://t.co/clA0TVHs60 pic.twitter.com/hJnqihRjIX
— Michael L. Nelson (@phonedude_mln) April 5, 2016

If you want a mostly different path through the various simultaneous sessions, I encourage you to read Dale Askey's excellent conference notes.

I'll update this post as additional slides and videos are uploaded. Thanks to everyone @ CNI for yet another excellent meeting!

--Michael

↧

2016-04-15: How I learned not to work full-time and get a PhD

April 15, 2016, 7:51 am

≫ Next: 2016-04-17: A Summary of "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia"

≪ Previous: 2016-04-05: CNI Spring 2016 Trip Report

ODU's commencement on May 7th marks the last day of my academic career as a student. I began my career at ODU in the Fall of 2004, graduated with my BS in CS in the Spring of 2008 at which point I immediately began my Master's work under Dr. Levinstein. I completed my MS in Spring 2010, spent the summer with June Wright (now June Brunelle), and started my Ph.D. under Dr. Nelson in the Fall of 2010 (which is referred to as the Great Bait-and-Switch in our family). I will finish in the Spring of 2016 only to return as an adjunct instruction teaching CS418/518 at ODU in the Fall of 2016.

On February 5th, I defended my dissertation"Scripts in a Frame: A Framework for Archiving Deferred Representations" (above picture courtesy Dr. Danette Allen, video courtesy of Mat Kelly). My research in the WS-DL group focused on understanding, measuring, and mitigating the impacts of client-side technologies like JavaScript on the archives. In short, we showed that JavaScript causes missing embedded resources in mementos, leading to lower quality mementos (according to web user assessment). We designed a framework that uses headless browsing in combination with archival crawling tools to mitigate the detrimental impact of JavaScript. This framework crawls more slowly but more thoroughly than Heritrix and will result in higher quality mementos. Further, if the framework interacts with the representations (e.g., click buttons, scroll, mouseover), we add even more embedded resources to our crawl frontier, 92% of which are not archived.

Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations from Justin Brunelle

En route to these findings, we demonstrated the impact of JavaScript on mementos with our now-[in]famous CNN Presidential Debate example, defined the terms deferred representations to refer to representations dependent upon JavaScript to load embedded resources, descendants to refer to client-side states reached through the execution of client-side events, and published papers and articles on our findings (including Best Student Paper at DL2014 and Best Poster at JCDL2015).

At the end of WS-DLer academic tenures, it is customary to provide lessons learned,recommendations, and recaps of their academic experiences useful to future WS-DLers and grad students. Rather than recap the work that we have documented in published papers, I will echo some of my advice and lessons learned for what it takes to be a successful Ph.D. student.

Primarily, I learned that working while pursuing a Ph.D. is a bad idea. I worked at The MITRE Corporation throughout my doctoral studies. It took a massive amount of discipline, a massive amount of sacrifice (from myself, friends, and family), a forfeiture of any and all free time and sleep, and a near-lethal amount of coffee. Unless a student's "day job" aligns or overlaps significantly with her doctoral studies (I got close, but no cigar), I strongly recommend against doing this.

I learned that a robust support system (family, friends, advisor, etc.) is essential to being a successful graduate student. I am lucky that June is patient and tolerant of my late nights and irritability during paper season, my family supported my sacrifices and picked up the proverbial slack when I was at conferences or working late, and that Dr. Nelson dedicates an exceptional portion of his time to his students. (Did I say that just like you scripted, Dr. Nelson?) I learned to challenge myself and ignore the impostor syndrome.

I learned that a Ph.D. is life-consuming, demanding of 110% of a student's attention, and hard -- despite evidence to the contrary (i.e., they let me graduate) -- they don't give these things away. I also learned about what real, capital-R "Research" involves, how to do it, and the impact that it has. This is a lesson that I am applying to my day job and current endeavors.

I learned to network. While I don't subscribe to the adage "It's not what you know, it's who you know", I will say that knowing people makes things much easier, more valuable, more impactful, and essential to success. However, if you don't know the "what", knowing the "who" is useless.

I learned that not all Ford muscle cars are Mustangs (even though they are clearly the best), that it's best to root for VT athletics (or at least pretend), that I am terrible at commas, and that giving your advisors homebrew with your in-review paper submissions certainly can't hurt; the best collaborations and brainstorming sessions often happen outside of the office and over a cup of coffee or a pint of beer.

Finally, I learned that finishing my Ph.D. before my son arrived was one of the best things I've done -- even if mostly by luck and divine intervention. I have thoroughly enjoyed spending the energy previously dedicated to staying up late, writing papers, and pounding my head against my keyboard to spending time with June, Brayden, and my family.

Despite these hard lessons and a difficult ~5 years, pursuing a doctorate has been a great experience and well worth the hard work. I look forward to continued involvement with the WS-DL group, ODU, my dissertation committee, and sharing my many lessons learned with future students.

--Dr. Justin F. Brunelle

↧

2016-04-17: A Summary of "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia"

April 17, 2016, 4:10 pm

≫ Next: 2016-04-19: IIPC General Assembly 2016 Trip Report

≪ Previous: 2016-04-15: How I learned not to work full-time and get a PhD

Authors Nattiya Kanhabua., Ngoc Tu Nguyen., and Claudia Niederée. from L3S published the following study at JCDL 2014. In the process of reviewing possible topics for my PhD research, I share my analysis of their findings. The full citation and presentation for the paper is below.

Kanhabua, N., Nguyen, T. N., & Niederee, C. (2014, September). What triggers human remembering of events?: a large-scale analysis of catalysts for collective memory in Wikipedia. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 341-350). IEEE Press.

What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia from Nattiya Kanhabua

The focus of the article centers around identifying patterns that trigger recollection of events in collective memory. Since the number of categorical events is limitless, the authors focus on natural and man-made disasters, accidents, and terrorism. Their analysis confirms that two of the most notable characteristics across all events are time and location. While in conjunction they are not consistent metrics in identifying triggers for recollection of events, their independent state is. In addition, the study also confirms that semantics found in different types of events, like level of impact and damage cost, further help trigger remembrance of specific memories.

For their analysis, authors use the English Wikipedia as the collective memory location, which is built by an online community. It is important to note that this memory is dynamic in nature, changes over time, and is constructed by the agreed upon social influence. Essentially, the goal here is to extract patterns and characteristics of a particular memory, and use them in identifying how they can be triggered in recall. Note, aside from characteristic analysis, we can identify the most popular memories by category, community division over topics, or even observe the edit wars that are centered around controversial topics.

To get a better understanding of the underlying collection, the authors parse view logs of different events documented on Wikipedia. This allows them to visually interpret and categorize them. Figure 1 below shows how such a log can be used alongside a temporal attribute.

(Peaks signify an increase in resource views, .)

By observing the chart above, we can conclude that within some timespan, peaks are created as resource views dramatically increase. Thus, they become the driving factor behind correlating documents to temporal and categorical events. Take for example a document explaining a hurricane event in 2015 being viewed dramatically in 2016.

By itself, a peak is not a complete solution in identifying memory recollection, as there is nothing to compare it to. The proposed solution here is a remembrance score, which analyzes how likely peaks are memory catalysts of past events. In other words, it's a comparison between multiple peaks to see if relationships exists. This score is divided into three parts: Cross-correlation coefficient (CCF), Sum of squared errors (SSE), and Kurtosis. These parts are all centered around time and peak analysis, and compare how likely is it for us to remember one event by experiencing another one. For this, CCF is used as a means of understanding the similarity between two time series in a volume. It's a simple representation of how different events relate during particular time frames. SSE further pushes CCF by measuring the accuracy of how unplanned a particular time is within a time frame, and promotes surprise detection. This helps us understand if one peak potentially triggered the other. Lastly, Kurtosis is applied to the remembering score to accommodate for the skewness of the peaks. This considers the underlying distribution over time, and answers the question, is the peak a constant phenomenon or a heavily influenced variable of change?

(Table 1 shows the test data of events used from Wikipedia. Do note, italicized events are excluded from the experiment, as there were too few results for significant evaluation.)

While this score is a good approach in understanding triggers for all events, the authors propose an analysis of common features to identify relationship development between similar events. This includes temporal similarity, or the time when the events occurred, and location similarity, where they occurred. Lastly, they also observe the impact of an event and how likely this event is to remain a continuous memory. Examples of impacts include: cost incurred due to event occurrence, affected regions, fatalities, etc.

In Figure 5, location is a key observable similarity between hurricane events, whereas time is much more inconsistent.

In Figure 10, time and location both play a significant role in identifying terrorist events. The conjunction of these attributes is much more evident here as opposed to hurricane events shown in Figure 5.

In Figure 11, high impact events comprise between 25% and 50% of the top 10 triggered events. The percentage expands to 75% when considering the top 20.

By observing the charts above, we can conclude several things from the proposed study. First, location and time are key contributors when identifying which events cause remembrance of others. In addition, they are sporadic in influence over the different types of events. Next, according to the results retrieved, contextual information also plays a very large role in determining relationships. The impact of events and semantic similarity can significantly boost or demolish the triggered recollection of collective memories we have stored. Lastly, the computed remembrance scores are a good step towards identifying which peaks relate. While they can be tuned to score better for particular events, they also must remain generic enough for limitless use.

It is clear that the explored study here has a great motive, and even more interesting findings. However, attached are two key limitations. First, the authors analyze human remembering of events against the English Wikipedia. While this could be helpful for a language specific study, it could have a very large cultural bias as compared to versions on other languages. In addition, it might sway focus and emphasize events that are more centered towards regions relating to an English-based context. The other limitation is that the authors are simply assuming an occurrence of one event triggers a recall from collective memory. While this can apply for many cases, this assumption does not consider the fact that new events could trigger research of the prior, as opposed to remembrance.

Applying in your research:

Significant insight in a forecasted and understood user recollection promotes targeted event triggering. When users are searching for particular events, we could recommend other events that they might be interested in within particular bounds of similarity.
In contrary to exploring new data, we could also help the user recall what they have forgotten from the past.

Slobodan Milanko

↧

2016-04-19: IIPC General Assembly 2016 Trip Report

April 19, 2016, 4:55 pm

≫ Next: 2016-04-24: WWW 2016 Trip Report

≪ Previous: 2016-04-17: A Summary of "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia"

9°C & wind 0 m/s! Our host @Landsbokasafn& @kristsi seem to have booked an early spring for #iipcga16. Thank you! pic.twitter.com/jvyORkzsRo
— IIPC (@NetPreserve) April 12, 2016

Our host @kristsi opening #iipcGA16 pic.twitter.com/YEWPnJPIRI
— IIPC (@NetPreserve) April 11, 2016

The 2016 IIPC General Assembly and the separate-but-related IIPC Web Archiving Conference 2016 were held in Reykjavík, Iceland, April 11-15, with the former being open to IIPC members only and the latter open to the public. Unfortunately, my trip report will be incomplete since I had to leave midday on Wednesday. The first day was primarily given to IIPC business: introducing the new officers, covering project status, budgets, new bylaws, etc. Jason gave a brief overview of our IIPC-funded Web Archive Profiling Via Sampling Project, which is now coming to a close. In addition to the resources and deliverables linked from the IIPC project page, Sawood Alam has developed the MemGator Memento Aggregator and the CDXJ format for serializing CDX files in json. We welcome feedback on both. I'd also like to repeat our request for web archiving logs so we can better model request patterns.

We had a introduction and Q&A from the Steering Committee members that worked well (I believe this was the first time this format had been used). The day closed with updates from Alex Thurman& Abbie Grotke about the collaborative collections, Sara Aubry about the proposed WARC 1.1 format, and Andy Jackson on "Building Tools to Archive the Modern Web".

Unfortunately Day 2 began with dual and triple tracks, so one was forced to make hard decisions about what to attend when they're all good. I began in the session with Andy Jackson covering "Building Better Tools, Together" in which he covered the benefits of open source development. The following session was had David Rosenthal, Nicholas Taylor, and Jefferson Bailey covering the IMLS-funded web archiving API project. The result of the session was a Google doc that contained the essence of the discussion.

After lunch, I presented in the session "Harvesting Tools", with Jefferson Bailey and Youssef Eldakar. Jefferson gave a preview of brozzler, a crawling package that combines real chrome browsers with warcprox for capturing all resources. Youssef gave a demo of visualizing Heritrix crawls. My talk closed the session and was based on Justin's work on crawling deferred representations and descendants (see the iPres 2015 paper and 2016 tech report for more information about these concepts, as well as Justin's PhD summary post).

Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript from Michael Nelson

The final session was by Martin Klein, Andrea Goethals, and Stephen Abrams on their plans for a submission to IMLS for nominating and coordinating seed URIs for crawls.

Wednesday began the IIPC Web Archiving Conference, and it kicked off with a keynote from Iceland's own Hjálmar Gíslason, most recently at DataMarket. He started off the keynote by defining the progression of "big data":

"The last data slide you'll ever need"@hjalli #iipcwac16 pic.twitter.com/KCOce6o0N4
— Martin Klein (@mart1nkle1n) April 13, 2016

Drawing from his current position and previous positions, he made a number of interesting observations regarding what is worth archiving. Although "hoarding isn't a strategy", we frequently don't know in advance what will be valuable (e.g., the NY Times 1927 article that said "commercial use in doubt" regarding television). His slides aren't posted yet, but hopefully soon.

After that was a joint presentation from Vint Cerf and Rick Witt from Google, who is now an IIPC member (!). Vint rightly noted that the IIPC crowded didn't need the usual background material he typically provides (cf. DSHR's and my reaction to his 2015 AAAS talk). Rick focused on potential roles for Google in the IIPC and web archiving in general:

potential roles for @google wrt #digitalpreservation shared by @richardswhitt: convener, financier, vendor, lobbyist, advocate #iipcWAC16
— Nicholas Taylor (@nullhandle) April 13, 2016

Vint was only able to be there for part of the day on Wednesday, but Rick was there the whole time. Rick was careful to stress that Google was there to learn and assess, not to try to steer or dominate the community. However, it is fair to say that the IIPC members that I spoke to were all very excited about Google's recognition of web archiving, even if no specific strategy or plan is adopted. The Q&A after their presentation was quite lively and could have gone on much longer. Brewster Kahle then moderated a panel about web archiving from the perspective of National Libraries with: Helen Hockx-Yu (IA, formerly of the British Library), Steve Knight (New Zealand), and Paul Koerbin (Australia).

I had to leave after lunch, so I missed the remainder of the conference. Rounding out Wednesday was David Rosenthal's "The Architecture of Emulation on the Web", Ilya Kremer& Dragan Espenschied presenting on oldweb.today (netcapsule github), Thomas Liebetraut talked about emulation (bw_FLA), and Matthew S. Weber and Ian Milligan talked about their Hackathons (Canada in March, US in June). Brewster concluded the day with a keynote "20 Years of Web Archiving – What Do We Do Now?" He previewed a really cool experimental interface for the Wayback Machine:

Woah. I’m mousing over these. Finding major changes, seeing who crawled them. This is BIG. #iipcWAC16 pic.twitter.com/WV9BssXSI4
— Ian Milligan (@ianmilligan1) April 13, 2016

I won't even try to summarize Thursday's sessions, and Friday consisted of a couple of different workshops. The Twitter hashtags were #IIPCGA2016 and #IIPCWAC2016, respectively. Ed Summers has a nice page summarizing all the tweets for both events. Kristinn Sigurðsson, who did a great job organizing the event, has a summary blog post for the event, and Peter Webster has a nice reflection piece about "What do we need to know about the archived web?" based on what he learned at IIPC. I'll add more posts about the event as I discover them.

As always, the IIPC meeting was excellent -- I highly encourage you attending if you are at all interested in web archiving. Next year's IIPC General Assembly and Web Archiving Conference will be in Lisbon, Portugal, in late March.

--Michael

↧

2016-04-24: WWW 2016 Trip Report

April 24, 2016, 12:05 pm

≫ Next: 2016-04-27: Mementos in the Raw

≪ Previous: 2016-04-19: IIPC General Assembly 2016 Trip Report

I was fortunate to present a poster at the 25th International World Wide Web Conference, held from April 11, 2016 - April 15, 2016. Though my primary mission was to represent both the WS-DL and the LANL Prototyping Group, I gained a better appreciation for the state of the art of the World Wide Web. The conference was held in Montréal, Canada at the Palais des congrés de Montéal.

SAVE-SD 2016

I began the conference at the SAVE-SD workshop, focusing on the semantics, analytics, and visualization of scholarly data. They had 6 full research papers, 2 position papers, and 2 poster papers. The acceptance rate for this conference is relatively high. The conference was kicked off by Alejandra Gonzales-Beltran and Francesco Osborne. They encouraged the use of Research Articles in Simplified HTML.

Alex Wade gave us an introduction to the Microsoft Academic Service (MAS) and a sneak peek at the new features offered by Microsoft Academic, such as the Microsoft Academic Graph. They are in the process of adding semantic, rather than keyword search with the intention of understanding academic user intent when searching for papers. They have opened up their dataset to the community and provide APIs for future community research projects.

Angelo Salatino presented "Detection of Embryonic Research Topics by Analysing Semantic Topic Networks". The study investigated the discovery of "embryonic" (i.e. emerging) topics by testing for more than 2000 topics in more than 3 million publications. The goal is to determine it we can recognize trends in research while they are happening, rather than years later. They were able to show the features of embryonic topics and their next step is to automate their detection.

Bahar Sateli presented "Semantic User Profiles: Learning Scholars’ Competences by Analyzing their Publications". The goal of this study is to mitigate the information overload associated with semantic publishing. They found that it is feasible to semantically model a user's writing history. By modeling the user, better search ranking of document results can be provided for academic researchers. It can also be used to allow researchers to find others with similar interests for the purposes of collaboration.

Francesco Ronzano presented "Knowledge Extraction and Modeling from Scientific Publications" where they propose a platform to turn data from scientific publications into RDF datasets, using the Dr. Inventor Text Mining Framework Java library. They also generate several example interactive web visualizations of the data. In the future, they seek to improve the Text Mining Framework.

Joakim Philipson presented "Citation functions for knowledge export - a question of relevance, or, can CiTO do the trick?". He explored the use of the CiTO ontology in order to understand knowledge export - "the transfer of knowledge from one discipline to another as documented by cross-disciplinary citations". Unfortunately, he found that CiTO is not specific enough to capture all of the information needed to understand this.

Sahar Vahdati presented "Semantic Publishing Challenge: Bootstrapping a Value Chain for Scientific Data". The study discussed "the use of Semantic Web technologies to make scholarly publications and data easier to discover, browse, and interact with". Its goal is to use many different sources to produce linked open datasets about scholarly publications with the intent of improving scholarly communication, especially in the areas of searching and collaboration. Their next step is to start building services on the data they have produced.

Vidas Daudaravicious presented "A framework for keyphrase extraction from scientific journals". His framework is able to use keyphrases to define topics that can differentiate journals. Using these keyphrases, one can improve search results by comparing journals to queries, allowing users to find articles of a similar nature. It also has the benefit of noting trends in research, such as when journal topics shift. Researchers can also use the framework to identify the best journals for paper submission.

Ujwal Gadirju presented "Analysing Structured Scholarly Data Embedded in Web Pages". They analyzed the use of microdata, microformats, and RDF used as bibliographic metadata embedded in scholarly documents with the intent of building knowledge graphs. They found that the distribution of data across providers, domains, and topics was uneven, with few providers actually providing any embedded data. They also found that Computer Science and Life Science documents were more apt to contain this metadata than other disciplines, but also admitted that their Common Crawl dataset may have been skewed in this direction. In the future, they are planning a targeted crawl with further analysis.

Shown below are participants enjoying the SAVE-SD 2016 Poster session. On the left below, Bahar Sateli presented "From Papers to Triples: An Open Source Workflow for Semantic Publishing Experiments". She showed how one could convert natural language academic papers into linked data, which could then be used to provide more specific search results for scholars. For example, the workflow allows a scholarly user to search a corpus for all contributions made in a specific topic.

On the right below, Kata Gábor demonstrated "A Typology of Semantic Relations Dedicated to Scientific Literature Analysis". Her poster shows a model for extracting facts about the state of the art for a particular research field using semantic relations derived from pattern mining and natural language processing techniques.

And shown to the left Erwin Marsi discussed his poster, "Text mining of related events from natural science literature". His study had the goal of producing aggregate facts on the concepts from articles within a corpus. For example, it aggregates the fact that there is an increase in algae based on the text from many papers that had research results finding an increase in algae. The idea is to find trends in research papers through natural language processing.

In closing, the SAVE-SD 2016 workshop mentioned that selected papers could be resubmitted to PeeRJ.

TempWeb 2016

On Tuesday I attended the 6th Temporal Web Analytics Workshop, where I learned about current studies using and analyzing the temporal nature of the web. I spoke to a few of the participants about our work on Memento, and they educated me as to the new work being done.

The morning opened with a Keynote by Wolfgang Nejdl of the Alexandria Project. Wolfgang Nejdl discussed the work at L3S and how they were trying to consider all aspects of the web, from the technical to its effects on community and society. He discussed how social media has become a powerful force, but tweets and posts link to items that can disappear, losing the context of the original post. This reminded me of some other work I had seen in the past. He mentioned how important it was to archive these items.

He then went on to cover other aspects of searching the archived web, detailing challenges encountered by project BUDDAH, including the problem of ranking temporal search results. Seen below, he demonstrates an alternative way of visualizing temporal search results using the HistDiv project. This visualization for understanding the changing nature of a topic. In this case, we see how searching for the term Rudolph Giuliani changes with time, as the person's career (and career aspirations) change so do the content of the archived pages about them. He closed by discussing the use of curated archiving collections in Archive-It in the collaborative search and sharing platform ArchiveWeb, which allows one to find archive collections pertinent to their search query.

The workshop presentations started with two different investigations into ways of creating and performing calculations on temporal graphs. On the right, Julia Stoyanovich presents "Towards a distributed infrastructure for evolving graph analytics". She details Portal, a query language for temporal graphs, allowing one to easily query and calculate metrics such as PageRank for a temporal graph, given a specific interval.

Matthias Steinbauer presented "DynamoGraph: A Distributed System for Large-scale, Temporal Graph Processing, its Implementation and First Observations". DynamoGraph a system also allowing for one to query and calculate metrics on temporal graphs.

Both researchers used the following lunch to discuss temporal graphs at length. I wondered if one could model TimeMaps in this way and use these tools to discover interesting connections between archived web pages.

Mohsen Shahriari discussed "Predictive Analysis of Temporal and Overlapping Community Structures in Social Media". He went into detail on the evolution of communities, represented by graphs, detailing how they can grow, shrink, merge, split, or dissolve entirely. Using datasets from Facebook, DLBP citations, and Enron emails, his experiments showed that smaller communities have a higher chance of survival and his model had a high success rate in predicting whether a community would survive.

Aécio Santos presented "A First Study on Temporal Dynamics on the Web". He used topical web page classifiers in a focused crawling experiment to analyze how often web pages about certain topics changed. Pages from his two topics, ebola and movies, changed at different rates. Pages on ebola were more volatile, losing and gaining links, mostly due to changing news stories on the topic, whereas movies pages were more stable, with authors only augmenting their contents. He did find that, in spite of this volatility, pages did tend to stay on topic over time. The goal is to ensure that crawlers are informed by differences in topics and adjust their strategies accordingly.

Jannik Strötgen presented "Temponym Tagging: Temporal Scopes for Textual Phrases". He discussed the discovery and use of temponyms to understand the temporal nature of text. Using temponyms, machines can determine the time period that a text covers. He explained the issues with finding exact temporal intervals or times for web page topics, seeing as many pages are vague. His temponym project, HeidelTime, has been tested on the WikiWars corpus and the YAGO semantic web system. He also presented further information on this topic, later in WWW 2016.

We then shifted into using temporal analysis for security. Staffan Truvé from Recorded Future presented "Temporal Analytics for Predictive Cyber Threat Intelligence". His company specializes in using social media and other web sources to detect potential protests, uprisings, and cyberattacks. He indicated that protests and hacktivism are often talked about online before they happen, allowing authorities time to respond.

In closing, Omar Alonso from Microsoft presented "Time to ship: some examples from the real-world". He highlighted some of the ways in which the carousel from the top of Bing is populated, using topic virality on social media as one of the many inputs. He talked about the concept of socialsignatures, derived from all of the social media posts referring to the same link. Using this text, they are able to further determine aboutness for a given link, helping further with search results. He switched to other topics that help with search, such as connecting place and time. Search results for points of interest (POI) for a given location in effect is trying to match people looking for things to do (queries) with social media posts, checkins, and reviews for a given POI. He concluded by saying that there is much work to be done, such as allowing POI results for a given time period "things to do in Montréal at night".

Keynotes

Sir Tim Berners-Lee

Sir Tim Berners-Lee spoke of the importance of decentralizing the web, ensuring that users own their own data, web security, work to standardize and improve the ease of payments on the web, and finally the Internet of Things (IoT).

Mentioning the efforts of projects like Solid, he highlighted the need to ensure that users retain their data to ensure their privacy. The idea is that a user can tell the service where to store their data and then they have ownership and responsibility over that data.

He mentioned that, in the past the Internet had to be deployed by sending tapes through the mail, but now we are heading to a point where the web platform, because it allows you deploy a full computing platform very very quickly, may become the rollout platform for the future. Because of this ability, security is becoming more and more important and he wants to focus on a standard for security that uses the browser, rather than external systems, as the central point for asking a user for their credentials, thereby helping guard against trojans and malicious web sites. He said that the move from HTTP to HTTPS has been less easy than expected, considering many HTTPS pages are "mixed" containing references to HTTP URIs. This results in three different worlds: those that are HTTP pages, those that are HTTPS pages, and upgrade insecure requests which still provide a mixed page, but one that is endorsed by the author.

Next, he spoke about making web payments standardized, comparing it to authentication. There are a wide variety of different solutions for web payments and there needs to be a standard interface. There is also an increasing call to allow customers to pay smaller amounts than before, which many current systems do not handle. Of course, customers will need to know when they are being phished, hence the security implications of a standardized system.

Finally, he covered the Internet of Things (IoT), indicating there are connections to data ownership, privacy, and security.

In the following Q&A session, I asked Sir Tim Berners-Lee about the steps toward browser adoption for technologies such as Memento. He said the first step is to discuss them at conferences like WWW, then engage in working groups, workshops, and other venues. He noted that one also needs to define the users for such new technologies so they can help with the engagement.

Later, during the student Q&A session the following day, Morgannis Graham from McGill University asked Sir Tim Berners-Lee about his thoughts on the role of web archives. He replied that "personally, I am a pack rat and am always concerned about losing things". He highlighted that while the general web users are thinking of the present, it is the role of libraries and universities to think about the future, hence their role in archiving the web. He stated that universities and libraries should work more closely together in archiving the web so that if one university falls, others exist having the archives of the one that was lost. He also stated that we all have a role in ensuring that legislation exists to protect archiving efforts. Finally, he tied his answer back to one of his current projects: what happens to your data when the site you have given it to goes out of business.

Lady Martha Lane-Fox

Wednesday evening ended with an inspiring talk from Lady Martha Lane-Fox. She works for the UK in a variety of roles advancing the use of technology in society. She states that a country that can: (1) improve gender balance in tech, (2) improve the technical skills of the populace, and (3) improve the ability to use tech in the public sector, will be the most competitive.

She went further in explaining how the current gender balance is very depressing, noting that in spite of the freedom offered by technology, old hierarchies and structures have been re-established. She indicated that there are studies showing that companies with more diverse boards are more successful, and how we need to tackle this problem, not only from a technical, but also a social perspective.

She discussed the challenges of bringing technology to everyday lives and applauded South Korea's success while highlighting the challenges still present in the UK. She relayed stories of encounters with the citizenry, some of whom were reluctant to embrace the web, but after doing so felt they had more freedom and capability in their lives than ever before. She praised the UK for putting coding on the school curriculum and looking toward the needs of future generations.

She then talked about re-imagining public services entirely through the use of technology. The idea is to make government agencies digital by default in an effort to save money and provide more capability. She highlighted a project where a UK hospital once had 700 administrators and 17 nurses, and, through adopting technology, were able to then take the same money and hire 700 nurses to work with 17 administrators, thus providing better service to patients.

She closed by discussing her program DotEveryone, which is a new organization promoting the promise of the Internet in the UK for everyone and by everyone. Her goal is for the UK to be the most connected, most digitally literate, and most gender equivalent nation on earth. In a larger sense, she wants to kick off a race among countries to use technology to create the best countries for their citizens.

Mary Ellen Zurko

Wednesday morning started with a keynote by Mary Ellen Zurko, from Cisco. She discussed security on the web. Her first lesson: "The future will be different; so will the attacks and attackers, but only if you are wildly successful". Her point was the the success of the web has made it a target. She then covered the history of basic authentication, S-HTTP, and finally SSL/TLS in HTTPS.

She then discuss the social side of security, indicating that users are often confused about how to respond to web browser warnings about security. There is a 90% ignore rate on such warnings, and 60% of those are related to certificates. She highlighted how difficult it is for users to know whether or not a domain is legitimate and if the certificate shown is valid. She also highlighted where most users, even expert users, do not fully understand the permissions they are granting when asked due to the cryptic and sometimes misleading descriptions given to them, mentioning that 17% of Android users actually pay attention to permissions during installation and only 3% are able to answer questions on what the security permissions mean.

Reiterating the results of a study by Google, she stated that 70% of users clicked through malware warnings in Chrome, but Firefox had more participation. The Google study found that the Firefox warnings provided a better user experience, and thus users were more apt to pay attention and understand them. Following this study, Google changed its warnings in Chrome.

She said that the open web is an equal opportunity environment for both attackers and defenders, detailing how fraudulent tech support scans are quite lucrative. This was discovered in recent work by Cisco, "Reverse Social Engineering Social Tech Support Scammers", where Cisco engineers actively bluffed tech support scammers in order to gather information on their whereabouts and identities.

Of note, she also mentioned that there is a largely unexploited partnership between web science and security.

Peter Norvig

On Friday morning, Peter Norvig gave an engaging speech on the state of the Semantic Web. He mentioned that his job is to bring information retrieval and distributed systems together. He went through a history of information retrieval, discussing WAIS and the World Wide Web, as well as ARCHIE. Before Google, several were trying to tame the nascent web at the time.

After Google, the Semantic Web was developed as a way to extract information from the many pages that existed. He talked about how Tim Berners-Lee was a proponent, whereas Cory Doctorow highlighted that there were noting but obstacles in its path. Peter said that Cory had several reasons for why it would fail, but the main were (1) people lie, (2) people are lazy, and (3) people are stupid, indicating that the information gathered from such a system would consist of intentional misinformation, lack of complete information, or misinformation due to incompetence.

Peter then highlighted several instances where this came about. Initially, excellent expressiveness was produced by highly trained logicians, giving us DAML, OWL, RDFa, FOAF, etc. Unfortunately, they found a 40% page error rate in practice, indicating that Cory was correct on all 3 fronts. Peter's conclusion was the highly trained logicians did not seem to solve the identified problems.

Peter then posited "what about a highly trained webmaster?". In 2010, search companies promoted the creation of schema.org with the idea of keeping it simple. The search engines promised that if a site were marked up, then they would show it immediately in search results. This gave users an incentive to mark up their pages and now has resulted in technologies that can better present things like hotel reservations and product information. This led most to conclude that schema.org was an unexpected success.

Peter closed by saying that obstacles still remain, seeing as most of the data comes from web site owners, still leading to misinformation in some cases. He talked about the need to be able to connect different sources together so that one can, for example, not only find a book on Amazon, but also a listing of the Author's interests on Facebook. He hopes that neural networks could be combined with semantic and syntactic approaches to solve some these large connection problems.

W3C Track

Tzviya Siegman, from John Wiley & Sons Publishing, presented "Scholarly Publishing in a Connected World". She discussed how publications of the past were immutable, and publishers did little with content once something was published. She confessed that in a world where machines are readers, too, publications are a bit behind the times. She further said that we still have an obsession with pages, citing them, marking them, and so on, when in reality the web is not bound by pages. She wants to standardize on a small set of RDFa vocabularies that would enable gathering of content by topic, whether the documents published are just articles, but also data and electronic notebooks. She closed by talking about how Wiley is trying to extract metadata from its own corpus to provide additional data for scholars.

Hugh McGuire presented "Opening the book: What the Web can teach books, and what books can teach the Web". He talked about how books seem to hold a special power and value, specific to the boundedness of a book. The web, by contrast, is unbounded; even a single web site is unknowable with no sense of a beginning or an end. On the web, however, anyone can publish documents and data to a global audience without any required permission. He talked about how books are a singular important node of knowledge, with the ebook business having the opposite motive of the web, making ebooks a kind of restricted, broken version of the web. He wants to be able to combine the two. For example a system can provide location-aware annotations of an ebook while also sharing those annotations freely, essentially making ebooks smarter and more open.

Ivan Herman revealed Portable Web Publications which has serious implications for archiving. The goal is to allow people to download web publications like they do ebooks, PDFs, or other portable articles. There is a need to do so because connectivity is not yet ubiquitous. With the power of the web, one can also embed interactivity into the downloaded document. Of course, there are also additional considerations, like the form factor of the reading device and the needs of reader. The concept is more than just creating a ebook with interactive components or a web page that can be saved offline. He highlighted the work of publishers in terms of egonomy and aesthetics, stating that web designers for such portable publications should learn from this work. Portable Web Publications would not be suitable for social web sites, web mail, or anything that depends on real-time data. PWP requires 3 layers of addressing (1) locating the PWP itself, (2) locating a resource within a PWP, and (3) locating a target within such a resource. In practice, locators depend on the state of the resource, creating a bit of a mess. His group is currently focusing on a manifest specification to solve these issues.

Poster Session

Of course, I was here to present a poster, "Persistent URIs Must Be Used to be Persistent", developed by Herbert Van de Sompel, Martin Klein, and I, which indicates important consequences for the use of persistent URIs such as DOIs.

Thanks to everyone at #www2016 who came by to discuss and experience our poster: https://t.co/XkFueTMqZZ pic.twitter.com/59LgAHSdZB
— Shawn M. Jones (@shawnmjones) April 13, 2016

In looking at the data from "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot", we reviewed 1.6 million web references from 1.8 million articles and discovered 3 things:

use of web references is increasing in scholarly articles
frequently authors use publisher web pages (locating URI) rather than DOIs (persistent URI) when creating references

We show on the poster that, because many use browser bookmarks or citation managers that store these locating URIs, there must be an easy way to help tools find the DOI. Our suggestion is to store this DOI in the Link header for easy access by these tools.

I appreciate the visit from Sarven Capadisli and Amy Guy who work on Solid. Many others came by to see our work, like Takeru Yokoi, Hideaki Takeda, Lee Giles, and Pieter Colpaert. Most appreciated the idea, noting it as "simple" with some asking "why don't we have this already?".

WWW Conference Presentations

Even though I attended many additional presentations, I will only detail a few of interest.

As a person who has difficulty with SPARQL, I appreciated the efforts of Gonzalo Diaz and his co-authors in "Reverse Engineering SPARQL Queries". Their goal was to reverse engineer SPARQL queries with the intent of producing better examples for new users, seeing as new users have a hard time with the precise syntax and semantics of the language. Given a database and answers, they wanted to reverse engineer the queries that produced those answers. Unfortunately, they discovered that verifying a reverse engineered SPARQL query to determine if it is the canonical query for a given database and answer is an NP-complete (intractable) problem. They were however able to perform some heuristics on a specific subset of queries to solve this problem in polynomial time.

Fernando Suarez presented "Foundations of JSON Schema". He mentioned that JSON is very popular because it is flexible, but there is no way to describe what kind of JSON response a client should expect from a web service. He discussed a proposal from the Internet Task Force to develop a JSON schema, a set of restrictions that documents must satisfy. he said the specification is in its Fourth Draft, but is still ambiguous. Even online validators disagree on some content, meaning that we need clear semantics for validation, and he proposes a formal grammar. His contribution is an analysis shows that the validataion problem is PTIME-complete, but that determining if a document has an equivalent JSON schema is PSPACE-hard for very simple schemas. For the future, he intends to work further on integrity constraints for JSON documents and more use cases for JSON schema.

David Garcia presented "The QWERTY Effect on the Web; How Typing Shapes the Meaning of Words in Online-Human Communications". He highlights a hypothesis that words typed with more letters from the right side of the keyborard are more positive than those with more letters from the left. He tests this hypothesis on product ratings from different datasets and found that 9 out of 11 datasets see a significant QWERTY effect which is independent of the number of views or comments on an item. He does mention that he needs to repeat the study with different languages and keyboard layouts. He closed by saying that there is no evidence yet that we can predict meanings or change evaluations based on this knowledge.

Justin Cheng presented "Do Cascades Recur?" where he analyzes the rise and fall of memes multiple times throughout social media. Prior work shows that cascades (meme sharing) rises, then falls, but in reality there are many rises and falls over time. He studies these different peaks and tries to determine how and why these cascades recur. Seeing as these bursts are separated among different network communities, cascades recur when people connect communities and reshare something. It turns out that a meme with high virality has less chance of recurring, but one with medium virality will recur months or perhaps years later. He would like to repeat his study with networks other than Facebook and develop improved models of recurrence based on other data.

Prahmod Bhatotia presented "IncApprox: The Marriage of incremental and approximate computing". He discussed how data analytic systems transform raw data into useful information, but they need to strike a balance between low latency and high throughput. There are two computing paradigms that try to strike this balance: (1) incremental computations and (2) approximate computing. Incremental computation is motivated by the fact that we are recomputing the output with small changes in the input and can reuse memorized parts of the computation that are unaffected by the changed input. Approximate computing is motivated the fact that the approximate answer is good enough. With approximate computing we get the entire input dataset, but compute only parts of the input and then produce approximate output in a low latency manner. His contribution is the combination of these two approaches.

Jessica Su presented "The Effect of Recommendations on Network Structure". She worked with Twitter on the rollout of a recommendation system that suggests new people to follow. They restricted the experiment to two weeks to avoid any noise from outside the rollout. They found that there is an effect; people's followers did increase after the rollout. They also confirmed that the "rich get richer", with those who already had many followers gaining more followers and those with few still gaining some followers. She also mentioned that people did not appear to be making friends, only following others.

Samuel Way presented "Gender, Producitivity, and Prestige in Computer Science Faculty Hiring Networks". This study tried to investigate why women are not participating in computer science. He mentioned that there are conflicting results. Universities have a 2-to-1 preference for female faculty applicants, but at the same time there is a bias favoring male students. They developed a framework for modeling faculty hiring networks using a combination of CVs, social media profiles, and other sources on a subset of people currently going through the tenure process. The model shows that gender bias is not uniformly, systematically affecting all hires in the same way and that the top institutions fight over a small group of people. Women are a limited resource in this market and some institutions are better at competing for them. The result is that accounting for gender does not help predict faculty placement, leading them to conclude that the effects of gender are counted for by other factors, such as publishing or post-doctoral training rates or the fact that some institutions appear to be better at hiring women than others. The model predicts that men and women will be hired at equal rates in Computer Science by the 2070s.

Social

Of course, I did not merely enjoy the presentations and posters. Among the Monday night SAVE-SD dinner, the Thursday night Gala, and lunch each day, I took the opportunity to acquaint myself with many field experts. Google, Yahoo!, and Microsoft were also there looking to discuss data sharing, collaboration, and employment opportunities.

I always had lunch company thanks to the efforts of Erik Wilde, Michael Nolting, Roland Gülle, Eike Von Seggern, Francesco Osborne, Bahar Sateli, Angelo Salatino, Marc Spaniol, Jannik Strötgen, Erdal Kuzey, Matthias Steinbauer, Julia Stoyanovich, Jan Jones, and more.

Furthermore, the Gala introduced me to other attendees, like Chris LaRoche, Marc-Olivier Lamothe, Ashutosh Dhekne, Mensah Alkebu-Lan, Salman Hooshmand, Li'ang Yin, Alex Jeongwoo Oh, Graham Klyne, and Lukas Eberhard. Takeru Yokoi introduced me to Keiko Yokoi from the University of Tokyo who was familiar with many aspects of digital libraries and quite interested in Memento. I also had a fascinating discussion about Memento and the Semantic Web with Michel Gagnon and Ian Horricks, who suggested I read "Introduction to Description Logic" to understand more of the concepts behind the semantic web and artificial intelligence.

In Conclusion

As my first academic conference, the WWW 2016 conference was an excellent experience, bringing me in touch with paragons on the forefront of web research. I now have a much better understanding of where we are in the many aspects of the web and scholarly communications.

Even as we left the conference and said our goodbyes, I knew that many of us had been encouraged to create a more open, secure, available, and decentralized web.

Au revoir Montréal pic.twitter.com/nSIpC2UvUT
— Shawn M. Jones (@shawnmjones) April 16, 2016

-- Shawn M. Jones

↧

2016-04-27: Mementos in the Raw

April 27, 2016, 9:19 am

≫ Next: 2016-05-31: Can I find this story? API: Yes, Google: Maybe, Native Search: No

≪ Previous: 2016-04-24: WWW 2016 Trip Report

While analyzing mementos in a recent experiment, we discovered problems processing archived content. Many web archives augment the mementos they serve with additional archive-specific information, including HTML, text, and JavaScript. We were attempting to compare content across many web archives, and had to develop custom solutions to remove these augmentations.

Most augment their mementos in order to provide additional user experience features, such as navigation to additional mementos, by rewriting links and providing additional discovery tools. From an end-user perspective, these augmented mementos enhance the usability and overall experience of web archives and are the default case for user access to mementos. An example from the PRONI web archive is shown below, with the augmentations outlined in red.

Others have requirements to differentiate archived content from live content, because they expose archived content to web search engines. Below, we see that a Google search will return content from the UK National Archives, with one of these search results outlined in red.

To indicate the archived nature of this content, the title of the web page, outlined in red below, has been altered to indicate that this archived page is "[ARCHIVED CONTENT]".

Our experiments were adversely affected by these augmentations. We required "mementos in the raw". In the case of our study, we needed to access the content as it had existed on the web at the time of capture. Research by Scott Ainsworth requires accurate replay of the headers as well. These captured mementos are also invaluable to the growing number of research studies that use web archives. Captured mementos are also used by projects like oldweb.today, that truly need to access the original content so it can be rendered in old browsers. It seeks consistent content from different archives to arrive at an accurate page recreation. Fortunately, some web archives store the captured memento, but there is no uniform, standard-based way to access them across various archive implementations.

Based on the needs of these research studies and software projects:

A captured memento must contain only the memento content that was present in the original document:

no HTML, JavaScript, CSS, or text has been added to the output
linked URIs are not rewritten and exist as they were in the original document (e.g., http://wayback.vefsafn.is/wayback/20091117131348/http://www.lanl.gov/news/index.html should just be http://www.lanl.gov/news/index.html)

A captured memento should also provide the original HTTP headers in some form (e.g., X-Archive-Orig-Content-Type: text/html for users desiring the original Content-Type)

The following table provides a list of some known web archives and the status of their ability to provide captured mementos, by either unaltered content and/or the original headers. Those columns with a "Yes" indicate that the archive is able to provide access to that specific dimension of captured mementos using software-specific approaches.

Web Archive	Unaltered Content Available	Original Headers Available
Archive Today (archive.is)	?	?
Archive-It	Yes	Yes
Athens University Archive	?	?
Bibliotheca Alexandrina Web Archive	Yes	Yes
Canadian Archive	?	?
Catalonia Archive	?	?
Croatian Archive	?	?
Estonian Web Archive	Yes	?
Icelandic Web Archive	Yes	Yes
Internet Archive	Yes	Yes
Library of Congress	Yes	Yes
PastPages	No	?
Portuguese Archive	?	?
Public Record Office of Northern Ireland (PRONI)	Yes	?
Rhizome Webenact	Yes	?
Slovenian Web Archive	Yes	Yes
Stanford Web Archive	Yes	Yes
The National Archives and Records Administration (NARA)	Yes	Yes
UK National Archives Web Archive	Yes	?
UK Parliament Web Archive	Yes	?
UK Web Archive	Yes	Yes
Web Archive Singapore	?	?
WebCite	No	?
York University Archive	?	?

Those entries with a ? and other archives not listed may or may not provide access to captured mementos. This ambiguity is part of the problem. Those archives that run OpenWayback for serving their mementos have the capability to deliver captured mementos, as detailed in the OpenWayback Administrator Manual, by use of special URIs. In fact, the OpenWayback im_ URI flag provides the desired behavior, with original headers and original content, even though the documentation states that it is supposed to "return document as an image".

Of course, not all web archives run OpenWayback, and developers have needed to create heuristics based on the software used by each individual web archive. For example, our archive registry uses the un-rewritten-api-url attribute to provide a pattern for accessing captured mementos. Because there is no uniform approach, these pattern-based solutions are necessary but brittle, tying them to a small set of specific implementations, and making it difficult for clients to adapt to new or changing web archive software.

We propose a solution that uses the Memento specification (RFC 7089) in its current form, while still allowing uniform, standards-based access to both augmented and captured mementos.

Proposed Solution for Accessing Augmented and Captured Mementos

We propose two parallel Memento implementations: one with a TimeGate and TimeMap for access to augmented mementos (as currently exists) and another with a TimeGate and TimeMap for access to captured mementos. A client that desires access to a specific type of memento (captured or augmented) only needs to access the TimeGate or TimeMap that specializes in finding and returning that type of memento. These parallel Memento implementations are based on the same infrastructure, the interactions are the same, and the only difference is in the nature of the memento each serves.

Clients could use the Archive Registry for discovering these TimeGates and TimeMaps. The Registry contains entries for many public web archives and version control systems, for each detailing its TimeGate and TimeMap URIs, as well as any additional information pertinent to accessing the archives. Several tools, such as the Memento Aggregator, directly use the information in the Registry. In light of discussions on the Memento Development list, we are considering creating a curated location where improvements can be submitted by the community.

A new attribute, profile, added to the timegate and timemap elements in the Registry, would allow a client to discover the TimeGate and/or TimeMap providing the type of memento it desires. A fictional enhanced Registry entry for the Icelandic Web Archive is shown below with the new profile attributes in red. Also, information currently provided in the <archive> element would either be deprecated (e.g. un-rewritten-api-url) or relocated (e.g. inside the timegate or timemap elements).

<link id="is" longname="Icelandic Web Archive">
<timegate uri="http://wayback.vefsafn.is/wayback/" redirect="no" profile="http://mementoweb.org/terms/augmented"/>
<timegate uri="http://wayback.vefsafn.is/wayback/captured/" redirect="no" profile="http://mementoweb.org/terms/captured"/>
<timemap uri="http://wayback.vefsafn.is/wayback/timemap/link/" paging-status="2" redirect="no"profile="http://mementoweb.org/terms/augmented" />
<timemap uri="http://wayback.vefsafn.is/wayback/timemap/captured/link/"
paging-status="2" redirect="no"profile="http://mementoweb.org/terms/captured" />
<icon uri="http://vefsafn.is/favicon.ico"/>
<calendar uri="http://wayback.vefsafn.is/wayback/*/"/>
<memento uri="http://wayback.vefsafn.is/wayback/*/"/>
<archive type="snapshot" rewritten-urls="yes" un-rewritten-api-url="http://wayback.vefsafn.is/wayback/{timestamp}id_/{url}" access-policy="public" memento-status="yes"/>
</link>

This solution requires no changes to the Memento protocol and allows web archives to satisfy the needs of both end-users and software applications by returning the appropriate memento for each use-case.

In the case of OpenWayback, this capability should be easy to add. Consider the following example from the Icelandic Archive, running OpenWayback, where the following URIs refer to the mementos of http://www.lanl.gov with a Memento-Datetime of Tue, 17 Nov 2009 13:13:48 GMT:

augmented -- http://wayback.vefsafn.is/wayback/20091117131348/http://www.lanl.gov
captured -- http://wayback.vefsafn.is/wayback/20091117131348im_/http://www.lanl.gov

The memento that will be selected from the archive for the requested datetime, and hence the database interactions, will be the same for augmented and captured mementos. The only difference is the memento URI to which the TimeGates will redirect and is limited to the addition of the string im_ in the captured memento's URI. The additional TimeGate only needs to add this string to its output.

This approach, fully aligned with the Memento protocol, removes the need for client heuristics and supports using syntaxes other than im_ to distinguish between captured and augmented memento URIs. A client that picks the nature of a given TimeGate or TimeMap will continue to receive that type of memento.

Optional Additions

With parallel "augmented" and "captured" Memento protocol support in place, as described above, we have supplied access to different types of mementos. The following section details other optional helpful changes that a client could use to identify and locate different types of mementos.

Self-Describing TimeGates, TimeMaps, and Mementos

TimeGates, TimeMaps, and mementos can self-describe their nature with an HTTP link using a profile relation, defined by RFC 6906, and a link target (Target IRI in the RFC) that indicates their augmented or captured nature.

Example TimeGate response headers implementing this self-describing ability are shown below, with the profile relation specifying the captured nature in red.

HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:02:14 GMT
Server: Apache
Vary: accept-datetime
Location: http://arxiv.example.net/web/captured/20010321203610/http://
a.example.org/
Link: <http://a.example.org/>; rel="original",
<http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format"
      ; from="Tue, 15 Sep 2000 11:28:26 GMT"
      ; until="Wed, 20 Jan 2010 09:34:33 GMT",
<http://mementoweb.org/terms/captured>; rel="profile"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Connection: close

Example TimeMap response headers implementing this relation are shown below, again with additions in red describing this TimeMap as listing augmented mementos. The profile link is placed within the Link header so that clients can discard or consume the associated entity based on their needs. The profile link is also included in the TimeMap body so that the TimeMap itself is self-describing.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:06:50 GMT
Server: Apache
Content-Length: 4883
Content-Type: application/link-format
Link: <http://mementoweb.org/terms/augmented>; rel="profile"
Connection: close

<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>
      ; rel="self";type="application/link-format",
<http://mementoweb.org/terms/augmented>
      ; rel="profile",
<http://arxiv.example.net/timegate/http://a.example.org>
      ; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>
      ; rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>
      ; rel="last memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>
      ; rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>
      ; rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT",
    ...

Finally, a memento can specify whether it is captured or augmented using the same method. Seen as red in the example below, headers describe this resource as a captured memento.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:15 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
<http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
<http://arxiv.example.net/timegate/captured/http://a.example.org/>
      ; rel="timegate",
<http://mementoweb.org/terms/captured>; rel="profile"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

These additional profile relations allow archives to describe the nature of respective TimeGates, TimeMaps, and mementos without affecting existing Memento clients.

Discovery of Other TimeGates and TimeMaps via Mementos

Here we introduce an approach for a client to get from a memento to its corresponding memento of the other type. This capability is handy as such, but, as will be shown, it is also a way to get to the other type of TimeGate and TimeMap.

By including another Link relation, a machine client can find the corresponding memento of another type. Shown below, we build upon our previous example memento headers and add this new relation, marked in red, allowing clients to find this captured memento's augmented counterpart. Here a profile attribute is used with the memento relation type in order to indicate the type of memento found at the link target. This profile attribute has been requested as part of "Signposting the Scholarly Web", and is provided by a proposed update to a draft RFC detailing "link hints". This proposed update has been informally accepted by the RFC's author.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:15 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
<http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
<http://arxiv.example.net/timegate/captured/http://a.example.org/>
      ; rel="timegate",
<http://mementoweb.org/terms/captured>; rel="profile",
<http://arxiv.example.net/web/20010321203610/http://
a.example.org/> 
      ; rel="memento"; profile="http://mementoweb.org/terms/augmented"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

From there, a client can follow the link target to the augmented memento. In the example below, we have the headers for the corresponding augmented memento. The Memento protocol already provides the associated timegate and timemap relations, shown in bold. A client uses these relations to discover the TimeGate/TimeMap that serves this memento, and, of course, the TimeGate/TimeMap have the same augmented nature as this memento. Note that this augmented memento also links to its captured counterpart.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:16 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
<http://arxiv.example.net/timemap/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
<http://arxiv.example.net/timegate/http://a.example.org/>
      ; rel="timegate",
<http://mementoweb.org/terms/augmented>; rel="profile",
<http://arxiv.example.net/web/20010321203610/captured/http://
a.example.org/>
      ; rel="memento"; profile="http://mementoweb.org/terms/captured"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

Now the client can make future requests to this TimeGate and receive responses like the one below, finding additional augmented mementos for the original resource.

HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:02:17 GMT
Server: Apache
Vary: accept-datetime
Location: http://arxiv.example.net/web/20100424131422/http://
a.example.org/
Link: <http://a.example.org/>; rel="original",
<http://arxiv.example.net/timemap/http://a.example.org/>
      ; rel="timemap"; type="application/link-format"
      ; from="Tue, 15 Sep 2000 11:28:26 GMT"
      ; until="Wed, 20 Jan 2010 09:34:33 GMT",
<http://mementoweb.org/terms/augmented>; rel="profile"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Connection: close

Likewise, a client can issue a request to the associated TimeMap to access augmented mementos for this resource. Of course, this process can start from an augmented memento and lead a client to the TimeGate/TimeMap for its captured counterpart as well.

Conclusion

The "captured" and "augmented" parallel Memento implementations addresses the problem of accessing different types of mementos in a standard-based manner. Given that the selected memento will be the same for both the captured and augmented cases and the difference will only be in the access mechanism (URI), the solution seems straightforward to implement for web archives. Existing clients will still continue to function as is, and clients desiring a specific type of memento can use the Archive Registry to find the resources that support the that type of memento.

In addition, the optional profile and discovery links add further value, allowing clients to identify which type of mementos they have currently acquired as well as accessing the other types of mementos that are available.

We look forward to feedback on this proposed solution.

--
Shawn M. Jones
- and -
Herbert Van de Sompel
- and -
Michael L. Nelson

Acknowledgements: Ilya Kremer also contributed to the initial discussion of the need for a standard method of accessing captured mementos.

↧

2016-05-31: Can I find this story? API: Yes, Google: Maybe, Native Search: No

May 31, 2016, 12:47 pm

≫ Next: 2016-06-03: Lipstick or Ham: Next Steps for WAIL

≪ Previous: 2016-04-27: Mementos in the Raw

A story on Storify titled: "Lecture on Academic Freedom" (capture date: 2016-05-31)

The story on Storify titled: "Lecture on Academic Freedom" could not be found on Google (capture date: 2016-05-31)

The story on Storify titled: "Lecture on Academic Freedom" could not be found on Storify native search (capture date: 2016-05-31)

A part of our research (funded by IMLS) to build collections for stories or events involves exploring content curation sites like Storify in order to determine if they hold quality (news worthy, timely, etc.) content. Storify is a social network service used to create stories which consists of text and multimedia content, as well as content from other social media sites like Twitter, Facebook and Instagram.

Our exploration involved collecting stories from Storify over a period in other to manually inspect the stories to determine their newsworthiness. This exploration was dual natured: we collected latest stories (across multiple topics) from the Storify API (browse/latest interface) over a period of time, we also collected stories from Storify about the Ebola virus through Storify's search API. During this period we collected resources from Google (with the "site:storify.com" directive) as well. At a particular point in our exploration, we considered if we could rely exclusively on Storify search as a means to find content or use Google's site directive to find Storify stories. In other words, how good is the Storify native search compared to Google search for discovery of stories on Storify when compared to the Storify browse/latest API?

Storify API vs Google and Storify native search: A simple plan for measuring discovery

We focused on known item searches to avoid the problem of subjective relevance measures. This gave us a very simple way of scoring Google and Storify's native search: if Google finds a specific story (query extracted from exact title, body content and description), Google gets 1 point. On the other hand, if Storify's native search (using the same query), finds the story, Storify gets 1 point.

Our set of test stories and their corresponding queries generated from the story titles, body content and description snippets consisted of 10 stories created between February 2016 and March 2016 (Enough time for both search services to index the stories). These stories were collected from the Storify browse/latest API interface which allows for discovery of content, but does not allow us to find topical content like with search. Here is the list of stories (collected 2016-05-30) and their respective creation datetime values, as well as the results outlining stories found by Google and/or Storify's native search:

Story	Creation datetime	Found? (Google)	Found? (Storify)
Commandos 2: Men of Courage full game free pc, download, play. download Commandos 2: Men of Courage for pc	2016-02-22T22:36:03	Yes	No
#SJUtakeover	2016-02-17T21:16:43	Yes	No
Annotations for Edgar Allan Poe	2016-03-02T19:47:31	No	No
Lecture on Academic Freedom	2016-02-22T22:27:08	No	No
Hitman: Codename 47 full game free pc, download, play. download Hitman: Codename 47 for pc	2016-02-22T22:36:26	Yes	No
AU Game Lab at GDC 2016	2016-03-18T17:36:34	Yes	No
5 Leading Onlinegames For Females Cost Free	2016-02-22T22:37:22	Yes	No
Sony Ericsson Z610i (Pink): newest cellular Phone With Advanced attributes	2016-03-18T23:50:55	No	No
Senior Research Paper	2016-02-26T19:47:19	Yes	No
Syracuse community reacts to NCAA Tournament win over Dayton	2016-03-18T17:38:34	Yes	No

We searched for the stories by issuing queries with full quotes (for exact match) to Google search (with the "site:storify.com" directive) and Storify's native search and counted the number of hits and misses for both. For both Google and Storify, all SERP links where included in the test. The results from Google did not exceed 1 page, for Storify however, the average number was 20 stories.

Storify's native search finds 0/10 stories, Google finds 7/10

We expected Storify to find more stories compared to Google, since the content resides on Storify, but this was not the case: out of 10 stories, Google found 7 but Storify found none! Google found all except the following stories:

A story on Storify titled: "#SJUTakeover" (capture date: 2016-05-31)

The story on Storify titled: "#SJUTakeover" could not be found on Storify native search (capture date: 2016-05-31)

The story on Storify titled: "#SJUTakeover" could not be found on Storify search but found on Google (capture date: 2016-05-31)

Before our test, we checked and did not find a Storify utility to exclude a story from search during the story's creation. Consequently, out test result suggests that the Storify search index is not synchronized with its browse/latest API interface. This investigation also shows the utility of using the Storify API for discovery, which contradicts some of our previous experiences where APIs provide different, limited, or stale data (e.g., Delicious API, SE APIs).

A proposal for a comprehensive study

We acknowledge the sample size of our experiment is very small, however, the preliminary results could be an approximation of a larger study due to random selection of stories. But the curious reader may consider verifying our result through a larger test consisting of a large collection of random stories published across a wide temporal window. If this is done, kindly share your findings with us.

--Nwala

↧

2016-06-03: Lipstick or Ham: Next Steps for WAIL

June 3, 2016, 5:07 am

≫ Next: 2016-06-23: Joint Conference on Digital Libraries (JCDL) 2016 Trip Report

≪ Previous: 2016-05-31: Can I find this story? API: Yes, Google: Maybe, Native Search: No

The development, state, and future of 🐳 Web Archiving Integration Layer. 💄∨🐷?

Some time ago I created and deployed Web Archiving Integration Layer (frequently abbreviated as WAIL), an application that provides users pre-configured local instances of Heritrix and OpenWayback. This tool was originally created for the Personal Digital Archiving 2013 conference and has gone through a metamorphosis.

The original impetus for creating the application was that the browser-based WARCreate extension required some sort of server-like software to save files locally because of the limitations of the Google Chrome API and JavaScript at the time (2012). WARCreate would perform an HTTP POST to this local server instance, which could would then return an HTTP response with an appropriate MIME type that would cause the browser to download the file. I initially used XAMPP for this with a PHP script within the Apache instance. This was unwieldy and a little more complex of a procedure than I wanted for the user.

With the introduction of the HTML5 File API, this server software was no longer required. The File API, however, is sandboxed to an isolated file system accessible only to the browser. To circumvent this restriction, I utilize the FileSaver.js library but this, too, has limitations in size of the file that can be download -- 500 MiB (about 524 MB) for Google Chrome.

XAMPP to WAIL

With Apache no longer being a requirement for WARCreate, I investigated using XAMPP's bundled copy of Apache web server and the additionally bundled Tomcat Java server for other web archiving purposes, namely the engine to run the Java-based OpenWayback. This worked well but still felt heavy for a user's PC, as Java applications do. The added Java requirement also meant that I could include a pre-configured Heritrix, Internet Archive's Java-based archival crawler, within XAMPP. The XAMPP interface, however, was generic relative to simply controlling services, a UI scheme I wanted to obscure from the target audience.

A locally hosted web-based interface might have been suitable but as with the WARCreate-to-local-file problems, having a browser launch applications on the user's machine was likely to be problematic. Being already familiar with Python, I created a script using the wxPython (the Python port of wxWidgets) library that allows a user to specify a URI for Heritrix to crawl (by programmatically creating crawl configurations) and locations for the resulting WARCs to which Heritrix should write and OpenWayback read.

This additional Graphical User Interface (GUI) "Layer" for "Integrating""Web Archive" tools (Heritrix and OpenWayback) spawned the awkwardly named, "Web Archiving Integration Layer". The acronym, while descriptive, also reiterated ODU WS-DL's trend of associating produced software with sea creatures (as I referenced once before).

Ceci n'est pas un cochonRequiring the target user base (digital humanities scholars and amateur web archivists) to go to the command-line to launch a Python script was unacceptable, however, and the remedy to this problem has been partially to blame for the slowdown in further development of WAIL. To "Freeze" code is to create the more familiar "Application" that a user would double click to launch. At the time (2013), PyInstaller provided the best application freezing functionality in that it performed dependency resolution, created cross-platform binaries, and provided a mode to produced a single binary file, which was not initially necessary but became appealing.

In the beginning, WAIL was compiled for Windows and MacOS X (or nowadays called simply "OS X"). In the latter, single-file applications are very common, as OS X's ".app" faux directory structure allows the application tools and resources to be nicely packaged. Eventually, this was also a useful place to include the OpenWayback and Heritrix binaries. That Windows does not have this abstraction but instead frequently provides a directory of files with the ".exe" being the binary is the reason that WAIL for Windows has not been updated since 2013.

Plagued with Problems

As if the decoupling of the OS X and Windows versions was not bad enough, OS X ceased bundling the Java runtime with the operating system (which required WAIL to install the runtime), Heritrix required an older version of Java (it would break with the latest version), and just generally Java problems all around. These problems persist to this day but ultimately it was these requirements and configuration issues that WAIL was designed to solve or at least mitigate for the user. The WAIL code that drives the UI is also quite the mess. Despite being researchers where code function should supercede its form, because WAIL is publicly available (both the binary and the source), it ought to reflect the quality in form to the extent of function.

Refactor or Is That Fiddly?

I have been maintaining and improving the code but eventually either another WS-DLite will be doing the same or the project will die. I believe there still to be merit in a locally hosted web archive, particularly for the digital humanities scholars that aren't familiar with system interaction via the command-line and manually rewriting configuration files.

We are looking into other routes to make the code more intuitive to maintain but still functionally equivalent to if not greater than the Python-based native app in its current state. We have bundled the newly developed Go-based MemGator Memento aggregator (blog post to come) with WAIL as a cross-platform native executable. We also hope to include other tools that personal web archivists would find useful with the requirement being that it must run natively and include no further non-bundled dependencies. Two tools on our radar are Ilya Kreymer's pywb, part of the replay component that's driving Webrecorder, and the heavily coupled (with pywb) InterPlanetary Wayback (ipwb) system we developed at the Archives Unleashed Hackathon in March.

The question still remains whether to rework the current code or to overhaul the UI in a way that is more extensible and maintainable. The Electron packaging library, as used by the native Slack application, Atom editor, and many other software projects, looks to be the route to take to achieve these goals. Additionally, interfaces written for Electron can be compiled to native applications, a feature that will allow the ethos of WAIL to be retained.

However, rewriting the UI does not a more useful application make and doing so boils down to putting lipstick on a pig. External dependencies should be the primary problem to tackle. From that, including additional functionality and tools to make the application more useful (the "ham" if this simile can be stretched any further) ought to be given priority.

—Mat Kelly (@machawk1)

↧