Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 747 articles
Browse latest View live

2015-11-06: iPRES2015 Trip Report

$
0
0
From November 2nd through November 5th, Dr. Nelson, Dr. Weigle, and I attended the iPRES2015 conference at the University of North Carolina Chapel Hill. This served as a return visit for Drs. Nelson and Weigle; Dr. Nelson worked at UNC through a NASA fellowship and Dr. Weigle received her PhD from UNC. We also met with Martin Klein, a WS-DL alumnus now at the UCLA Library. While the last ODU contingent to visit UNC was not so lucky, we returned to Norfolk relatively unscathed.

Cal Lee and Helen Tibbo opened the conference with a welcome on November 3rd, followed by Nancy McGovern's keynote address delivered with Leo Konstantelos and Maureen Pennock. This was not a traditional keynote, but instead an interactive dialogue in which several challenge areas were presented to the audience, and the audience responded -- live and on twitter -- significant achievements or advances in those challenge areas from #lastyear. For example, Dr. Nelson identified the #iCanHazMemento utility. The responses are available on Google Docs.


I attended the Institutional Opportunities and Challenges session to open the conference. Kresimir Duretec presented "Benchmarks for Digital Preservation Tools." His presentation touched on how we can get digital preservation tools that "Just Work", including benchmarks for evaluating tools on test beds and measuring them for quality. Related to this is Mat Kelly's work on the Archival Acid Test.



Alex Thirifays presented "Towards a Common Approach for Access to Digital Archival Records in Europe." This paper touched on user access: user needs, best practices for identifying requirements for access, and a capability gaps analysis of current tools versus user needs.

"Developing a Highly Automated Web Archive System Based
on IIPC Open Source Software" was presented by Zhenxin Wu. Her paper outlined a framework of open source tools to archive the web using Heritrix and a SOLR index of WARCS with an enhanced interface.

Barbara Sierman closed the session with her presentation "Best Until ... A National Infrastructure for Digital Preservation in the Netherlands" focusing on user accessibility and organizational challenges as part of a national strategy for preserving digital and cultural Dutch heritage.

After lunch, I lead off the Infrastructure Opportunities and Challenges session with my paper on Archiving Deferred Representations Using a Two-Tiered Crawling Approach. We defined deferred representations as those that rely on JavaScript to load embedded resources on the client. We show that archives can use PhantomJS to create a 1.5 times larger crawl frontier than Heritrix itself, but PhantomJS crawls 10.5 times slower. We recommend using a classifier to recognize deferred representations and only use it to crawl deferred representations, mitigating the crawl slow-down while still reaping the benefits of the headless crawler.

 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling Approach from Justin Brunelle
  
Douglas Thain followed with his presentation on "Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?" Similar to our work with deferred representations, his work focuses on scientific replay of simulations and software experiments. He presents several tools as part of a framework for preserving the context of simulations and simulation software, including dependencies and build information.

Hao Xu presented "A Method for the Systematic Generation of Audit Logs in a Digital Preservation Environment and Its Experimental Implementation In a Production Ready System". His presentation focuses on a construction of a finite state machine to understand whether a repository is following compliance policies for auditing purposes.

Jessica Trelogan and Lauren Jackson presented their paper Preserving an Evolving Collection: "“On-The-Fly” Solutions for the Chora of Metaponto Publication Series." They discussed the storage of complex artifacts of ongoing research projects in archeology with the intent of improving sharability of the collections.

To wrap up Day 1, we attended a panel on Preserving Born-Digital News consisting of Edward McCain, Hannah Sommers, Christie Moffatt, Abigail Potter (moderator), Stéphane Reecht, and Martin Klein. Christie Moffatt identified the challenges with archiving born-digital news material, including the challenges with scoping a corpus. She presented their case study on the Ebola response. Stéphane Reecht presented the work by the BnF regarding their work to perform massive, once-a-year crawls as well as selective, targeted daily crawls. Hannah Sommers provided insight into the culture of a news producer (NPR) on digital preservation. Martin Klein presented SoLoGlo (social, local, and global) news preservation, including citing statistics about the preservation of links shortened by the LA Times. Finally, Edward McCain discussed the ephemeral nature of born-digital news media, and provided examples of the sparse number of mementos in news pages in the Wayback Machine.


To kick off Day 2, Lisa Nakamura gave her opening keynote The Digital Afterlives of This Bridge Called My Back: Public Feminism and Open Access. Her talk focused on the role of Tumblr in curating and sharing a book no longer in print as a way to open the dialogue on the role of piracy and curation in the "wild" to support open access and preservation.

I attended the Dimensions of Digital Preservation session, which began with Liz Lyon's presentation on "Applying Translational Principles to Data Science Curriculum Development." Her paper outlines a study to help revise the University of Pittsburgh's data science curriculum. Nora Mattern took over the presentation to discuss the expectations of the job market to identify the skills required to be a professional data scientist.

Elizabeth Yakel presented "Educational Records of Practice: Preservation and Access Concerns." Her presentation outlined the unique challenges with preserving, curating, and making available educational data. Education researchers or educators can use these resources to further their education, reuse materials, and teach the next generation of teachers.

Emily Maemura presented "A Survey of Organizational Assessment Frameworks in Digital Preservation." She presented the results of a survey focusing on frameworks for assessment models, drawing conclusions like software maturity models do for computer scientists. Further, her paper identifies trends, gaps, and models for assessment.

Matt Schultz, Katherine Skinner, and Aaron Trehub presented "Getting to the Bottom Line: 20 Digital Preservation Cost Questions." Their questions help institutions evaluate cost, including questions about storage fees, support, business plans, etc. to help institutions assess their approach to taking on digital preservation.

After lunch, I attended the panel on Long Term Preservation Strategies & Architecture: Views from Implementers consisting of Mary Molinaro (moderator), Katherine Skinner, Sibyl Schaefer, Dave Pcolar, and Sam Meister. Sibyl Schaefer lead off with a presentation of details on Chronopolis and ACE audit manager. Dave Pcolar followed by presenting the Digital Preservation Network (DPN) and their data replication policies for dark archives. Sam Meister discussed the BitCurator Consortium which helps with the acquisition, appraisal, arrangement and descriptions, and access of archived material. Finally, Katherine Skinner presented the MetaArchive Cooperative and their activities teaching institutions to perform their own archiving, along with other statistics (e.g., the minimum number of copies to keep stuff safe is 5).

Day 2 concluded with the poster session (including a poster by Martin Klein) and reception.



Pam Samuelson opened Day 3 with her keynote Mass Digitization of Cultural Heritage: Can Copyright Obstacles Be Overcome? Her keynote touched on the challenges with preserving cultural heritage introduced by copyright, along with some of the emerging techniques to overcome the challenges. She identified duration of copyright as a major contributor to the challenges of cultural preservation. She notes that most countries have exceptions for libraries and archives for preservation purposes, and explains recent U.S. evolutions in fair use through the Google Books rulings.

After Samuelson's keynote, I concluded my iPRES2015 visit and explored Chapel Hill, including a visit to the Old Well (at the top of this post) and an impromptu demo of the pit simulation. It was very scary.



Several themes emerged from iPRES2015, including an increased emphasis on web archiving and a need to improved context, provenance, and access for digitally preserved resources. I look forward to monitoring the progress in these areas.


--Justin F. Brunelle


2015-11-24 Twitter Follower Analysis of Virginia University Alumni Associations

$
0
0
The primary goal of any alumni association is to maintain and strengthen the ties between its alumni, the community, and the mission of the university. With social media, it's easier than ever to connect with current and former graduates on Facebook, Instagram or Twitter with a simple invitation to "like us" or "follow me." Considering just one of these social platforms, we recently analyzed the Twitter networks of twenty-three (23) Virginia colleges and universities to determine what, if any, social characteristics were shared among the institutions and whether we could gain any insight by examining the public profiles of their respective followers. The colleges of interest, ranked by number of followers in Table 1, vary in size, mission, type of institution, admissions selectivity and perceived prestige. Each of the alumni associations has maintained a Twitter presence for an average of six (6) years. The oldest Twitter account belongs to Roanoke College (@roanokecollege) which is approaching the eight (8) year mark. The newest Twitter account was registered by Randolph Macon College (@RMCalums) nearly two years ago.




UniversityFollowersJoined Twitter
University of Virginia12,10011/1/2008
Roanoke College*9,5883/1/2008
Regent University*7,96611/1/2008
James Madison University7,8658/1/2008
Virginia Tech6,4184/1/2009
College of William & Mary4,4481/1/2009
University of Mary Washington3,84710/1/2009
Liberty University3,69911/6/2009
University of Richmond3,2995/1/2009
Sweet Briar College*2,5238/1/2010
George Mason University2,3752/1/2011
Hampton University2,3722/15/2012
Christopher Newport University2,1918/1/2010
Old Dominion University1,9967/1/2009
Randolph College*1,8578/1/2008
Washington and Lee University1,8428/1/2011
Radford University1,7583/11/2011
Hampden-Sydney College1,0867/1/2009
Longwood University1,0352/28/2013
Hollins University9234/1/2009
Virginia Military Institute8363/1/2009
Norfolk State University6298/15/2011
Randolph-Macon College1723/7/2014
Table 1 - Alumni Associations Ranked by Followers

* Institution does not have an official alumni Twitter account.
The university Twitter account was used instead.

Social Graph Analysis


NodeXL is a template for Microsoft Excel which makes network analysis easy and rather intuitive. We used this tool for data collection to import the Twitter networks and to analyze the various social media interactions. There are limitations established in the Twitter API which regulate the amount of data collected per hour by any one user. Therefore, due to rate limiting, NodeXL will inherently only import the 2,000 most recent friends and followers for any Twitter account. To improve the response time of the API, we further restricted our data collection to the 200 most recent tweets for both the university and each of its follower accounts.

For our first look at the alumni associations, we clustered the data based on an algorithm in NodeXL which looks at how the vertices are connected to one another. The clusters, as shown in Figure 1, are indicated by the color of the nodes. The clusters themselves revealed some interesting patterns.  The high level of inter-association connectivity, as measured in follows, tweets and mentions, was unexpected. We would have thought that each association operated within the confines of its own Twitter space or that of its parent organization. As we examine the groupings in this network, it is not unreasonable that we would observe connections between Old Dominion University (@ODUAlumni), Norfolk State University (@nsu_alumni_1935) and Hampton University (@HamptonU_Alumni) as all three are located within close proximity of one another in the Hampton Roads area. But, then we must take notice of Hollins University (@HollinsAlum), a small, private women's college in Roanoke, VA, which has a connection with ten (10) other alumni associations; more connections than any other school. Hollins is one of the smallest universities in our group with enrollment of less than 800 students. Since Twitter is primarily about influence, in this instance, we can probably assume the follows serve as a means to observe best practices and current engagement trends employed by larger institutions. While Hollins University is well connected, as are many of the other schools, at the opposite end of the spectrum we find Liberty University (@LibertyUAlum), a large school with more than 77,000 students. Liberty University remains totally isolated with no follower connections to the other alumni associations. You might minimally expect some type of connection with either Regent University (@RegentU) since both share a similar mission as private, Christian institutions or other universities within close physical proximity such as Randolph College (@randolphcollege).

Figure 1 - Connectivity of Alumni Associations

Twitter Followers, Enrollment, and Selectivity


We normally measure the popularity of a Twitter account based on the number of followers. Instead of simply quantifying the follower counts of each alumni association, we sought to understand if certain factors, actions or inherent qualities about the institution might influence the relative number of followers.  First, we considered whether more active tweeters would attract more alumni followers. As shown in Figure 2, the College of William and Mary (@wmalumni) has generated the most tweets over its lifetime, approximately 6,200 or 2.5 tweets per day. But, we also observe the University of Mary Washington (@UMaryWash), which has approximately half the student enrollment, a similar Twitter life span, 50% percent less tweets at 2,800 or 1.3 per day, with only a slight difference in the number of followers, 4,400 versus 3,800 respectively. While the graph shows that schools such as Virginia Tech (@vt_alumni) and the University of Virginia (@UVA_Alumni) have more followers with fewer lifetime tweets, the caveat is that these public institutions have the benefit of considerably larger student populations which inherently increases the pool of potential alumni.

Figure 2 - Lifetime Tweets Versus Followers


Next, we considered whether a higher graduation rate, or alumni production, would result in more followers. We obtained the most recent, 2014 overall graduation rates for each institution from the National Center for Education Statistics, with reported overall six-year graduation rates ranging from 34% to 94%. A 2015 Pew Research Center study of the Demographics of Social Media Users indicates that among all internet users, 32% in the 18 to 29 age range use Twitter. This is a key demographic as we would expect our alumni associations to be primarily focused on attracting recent undergraduates. We also factored in selectivity, a comparative scoring of the admissions process, using the categories defined in the 2016 U.S. News Best Colleges Directory. In this directory, colleges are designated as most selective, more selective, selective, less selective or least selective based on a formula.

As we look at Figure 3, we observe a positive correlation between admissions selectivity and the institution's overall graduation rate. Schools which were least selective during the admissions phase also produced the lowest graduation rates (less than 40%) while schools which were most selective, experienced the highest graduation rates (around 90%).  It isn't surprising that improved graduation rates positively affect the expected number of alumni Twitter followers. We'll leave it as an exercise for the reader to extrapolate how closely each institution's annual undergraduate enrollment, graduation rate and expected level of engagement on Twitter corresponds to the actual number of followers when all three factors are considered.

Figure 3 - Followers Versus Graduation Rate

Potential Reach of Verified Followers


Users on Twitter want to be followed so we looked carefully at who, besides alumni and students, was following each of the alumni associations. Specifically, we noted the number of Twitter verified followers; accounts which are usually associated with high-profile users in "music, acting, fashion, government, politics, religion, journalism, media, sports, business and other key interest areas." In addition to an abundance of local news reporters and sports anchors, regional politicians and career sites, other notable followers included: restaurant review site Zagat (@Zagat), automaker Toyota USA (@toyota), musician and rapper DJ King Assassin (@DjKingAssassin), the Nelson Mandela Foundation (@NelsonMandela), the President of the United States Barack Obama (@BarackObama), Virginia Governor Terry McAuliffe (@GovernorVA) and artist and singer Yoko Ono (@yokoono). It's a safe assumption that some of the follower relationships with verified users were probably established prior to 2013. This is the year in which Twitter instituted new rules to kill the "auto follow" which was a programmatic way of following another user back after they follow you. Either way, the open question would remain as to why these particular users would follow an alumni association when there are no readily apparent educational ties.

Twitter doesn't take follower count into consideration when verifying an account, but it's not unusual for a verified account to have a considerable following. Since the mission of an alumni association is essentially about networking and information dissemination, we also measured the potential reach or level of influence across the followers' extended network obtained from the verified accounts. No single university had more than 70 verified accounts among its followers. However, when we look at their contribution, in Figure 4, as a percentage of the combined reach achieved by all followers of each alumni association, these select users accounted for as little as 1.6% for Virginia Military Institute (@vmialumni) to as much as 95.8% for Longwood University (@acaptainforlife) of the institution's total potential reach (i.e., followers of my followers).

Figure 4 - Potential Reach Percentage of Verified Accounts

Alumni Sentiment


Finally, we examined how each follower described himself in the description (i.e., bio) portion of their Twitter profile by extracting the top 200 most frequently occurring terms for each alumni association. A word cloud for the alumni of each university is shown in Figure 5. If we further isolated the descriptions to the top ten most frequently occurring words, we observed a common pattern among all alumni followers. In addition to the official or some derivative of the institution name (e.g., JMU, NSU, Tech), we find the terms love, life, and some intimate description of the follower as a mom, husband, student, father or alumni.  If the university has an athletic department, we also found mention of sports and, in the case of our two Christian universities, Liberty and Regent, the terms God, Jesus, and Christ were prevalent. In 22 of 23 institutions, the alumni primarily described themselves using these personal terms. Conversely, the alumni followers at only one institution, the University of Richmond (@urspidernetwork), described themselves in a more business-like or academic manner with more frequent mention of the words PhD, career, and job.



Figure 5 - Word Clouds of Twitter Follower Descriptions



-- Corren McCoy

2015-11-28: Two WS-DL Classes Offered for Spring 2016

$
0
0
https://xkcd.com/1319/

Two WS-DL classes are offered for Spring 2016:

Information Visualization is being offered both online (CRNs 29183 (HR), 29184 (VA), 29185 (US)) and on-campus (CRN 25511).  Web Science is being offered for the first time with the 432/532 numbers (CRNs 27556 and 27557, respectively), but the class will be similar to the Fall 2014 offering as 495/595

--Michael

2015-12-08: Evaluating the Temporal Coherence of Composite Mementos

$
0
0
    When an archived web page is viewed using the Wayback Machine, the archival datetime is easy to determine from the URI and the Wayback Machine's display.  The archival datetime of embedded resources (images, CSS, etc.) is another story.  And what stories their archival datetimes can tell.  These stories are the topic of my recent research and Hypertext 2015 publication.  This post introduces composite mementos, the evaluation of their temporal (in-)coherence, provides an overview of my research results.

     

    What is a composite memento?

     

    A Memento is an archived copy of web resource (RFC 7089)  The datetime when the copy was archived is called its Memento-Datetime.  A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation.  Composite mementos can be thought of as a tree structure.  The root resource embeds other resources, which may themselves embed resources, etc.  The figure below shows this tree structure and a composite memento of the ODU Computer Science home page as archived by the Internet Archive on 2005-05-14 01:36:08 GMT.  Or does it?


       

      Hints of Temporal Incoherence

       

      Consider the following weather report that was captured 2004-12-09 19:09:26 GMT.  The Memento-Datetime can be found in the URI and the December 9, 2004 capture date is clearly visible near the upper right.  Look closely at description of Current Conditions and the radar image.  How can there be no clouds on the radar when the current conditions are light drizzle?  Something is wrong here.  We have encountered temporal incoherence.  This particular incoherence is caused by inherent delays of the capture process used by Heritrix and other crawler-based web archives.  In this case, the radar image was captured much later (9 months!) than the web page itself.  However, there is no indication of this condition.



       

      A Framework for Evaluating Temporal Coherence


      In order to study temporal coherence of composite mementos, a framework was needed.  The framework details a series of patterns describing the relationships between root and embedded mementos and four coherence states.  The four states and sample patterns are described below.  The technical report describing the framework is available on arXiv.

       

      Prima Facie Coherent

      An embedded memento is prima facie coherent when evidence shows that it existed in its archived state at the time the root was captured.  The figure below illustrates the most common case.  Here the embedded memento was captured after the root but modified before the root.  The importance of Last-Modified is discussed in my previous post on the importance of header replay.


       

      Possibly Coherent

      An embedded memento is possibly coherent when evidence shows that it might have existed in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured before the root.


       

      Probably Violative

      An embedded memento is probably violative when evidence shows that it might not have existed in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured after the root, but its Last-Modified datetime is unknown.


      Prima Facie Violative

      An embedded memento is probably violative when evidence shows that it did not exist in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured after the root and was also modified after the root.

       

       

       

      Only One in Five Archived Web Pages Existed as Presented


      Using the framework, we evaluated the temporal coherence of 82,425 composite mementos. These contained 1,623,127 embedded URIs, of which 1,332,993 were available in a web archive.  Composite mementos were recomposed using single and multiple archives and two heuristics: minimum distance and bracket.

      Single and multiple archives: Composite mementos were recomposed from single and multiple archives. For single archives, all embedded mementos were selected from the same archive as the root. For multiple archives, embedded mementos were selected from any of the 15 archives included in the study.

      Heuristics:  The minimum distance (or nearest) heuristic selects between multiple captures for the same URI by choosing the memento with the Memento-Datetime nearest to the root's Memento-Datetime, and can be either before or after the root's. The bracket heuristic also takes Last-Modified datetime into account. When a memento's Last-Modified datetime and Memento-Datetime "bracket" the root's Memento-Datetime (as in Prima Facie Coherent above), it is selected even if it is not the closest.

      We found that only 38.7% of web pages are temporally coherent and that only 17.9% (roughly 1 in 5) of web pages are temporally coherent and can be fully recomposed (i.e., they have no missing resources).

      The paper can be downloaded from the ACM Digital Library or from my local copy.  The slides from the Hypertext'15 talk follow.




      One last thing: I would like to thank Ted Goranson for presenting the slides at Hypertext 2015 when we could not attend.

      -- Scott G. Ainsworth

      2015-12-22: 60% of Web Annotations are Orphaned or in Danger of Being Orphaned

      $
      0
      0
      Figure 1. An Annotation is defined by OAC
       as a set of connected resources  
      In our TPDL paper, we studied 6281 highlighted text annotations (out of 7744 annotations) available in the Hypothes.is annotation system in January 2015. The main goal was to investigate the prevalence of orphaned annotations, where neither a live Web page nor an archived copy of the web page contains the text that had previously been annotated.

      Recently, we applied the same analysis as in our TPDL paper to a larger number of annotations.  Figure 2 illustrates that the number of annotations in Hypothes.is has been increasing since July 2013. Our TPDL paper focused on the 7744 annotations available in January 2015.  Our updated paper (available at arXiv.org) analyzed the 20,133 highlighted text annotations (out of 33,946 total annotations) available in August 2015.  In this post, I will focus on reporting results of our arXiv paper.
      Figure 2. January 2015 - dataset used in TPDL paper
      August 2015 - dataset used in arXiv version  

      Based on my experience in analyzing web annotations in Hypothes.is, I have seen annotations created just for the purpose of testing the system to see how it works (e.g. some annotations contain the tag "test" in Hypothes.is). Although some annotations can be considered as not beneficial, the majority of annotations are valuable to the community in different aspects. For example, 9 out of the 10 most annotated websites in Hypothes.is are related to education, academic research, or publishing.

      The Hypothes.is annotation system offers free accounts allowing users to annotate the Web by, for example, creating tags/notes to highlighted text or to a web page as a whole. Hypothes.is supports collaborative work by letting users reply to each other's comments as shown in Figure 3.

      Figure 3. Annotating the Web Using Hypothes.is Annotation System

      It is known that web pages are not fixed resources, and they might be changed or become unavailable at any time. These changes in webpages can affect the associated annotations. Figure 4 shows the target URI http://climatefeedback.org/ as it appeared in December 2014. The highlighted text “Scientific feedback for Climate Change information online” in the webpage was annotated with “After reading about your project at MIT news, I visited your page and ...”. In August 2015, this annotation can no longer be attached to the target web page because the highlighted text no longer appears on the page, as shown in Figure 5. Although the live Web version of http://climatefeedback.org/ has changed and the annotation was in danger of being orphaned, the original version that was annotated has been archived and is available at the Internet Archive. The annotation could be re-attached to this archived resource, or memento.

      Figure 4. http://climatefeedback.org/ in December 2014 
      Figure 5. http://climatefeedback.org/ in August 2015
      Because web pages are changing, the status of annotations is also affected. We can classify web annotations into 4 categories based on the attachment to their target live web pages and to mementos:
      • Safe - The annotation can be attached to the target live web page and also to at least one memento. 
      • In Danger - The annotation can be attached to the target live web page but it is not attached to any mementos. In this case, if the live web page is changed such that the associated annotations become unattached, then these annotations, unfortunately, would become orphaned.
      • Re-attached - The annotation is no longer attached to the live web page but, fortunately, it can be reattached to at least one memento from public web archives. 
      • Orphaned - The annotation is neither attached to the live web page nor any mementos.

      Safe and re-attached annotations can be recovered with web archives, so they are in better situation than the other two categories. We want to make annotations that belong to the second category (In danger) safe or re-attached by archiving their target web pages. Obviously, we can do nothing about annotations that belong to orphaned category. They are lost.

      We used the LANL Memento Aggregator to look for archived copies of web pages (mementos) in the public archives. To be more specific, we were looking for the closest mementos to annotations' creation date. In the example shown in Figure 4, we would need to find the closest mementos captured immediately before and after the annotation creation date (e.g., December 3, 2014 at 12:47 AM for the web page http://climatefeedback.org).

      Figure 6(a) shows an example where mementos are available before and after the annotation creation date. In this example, only M1 and M3 will be tested to see if the associated annotations can be re-attached to these mementos. Figure 6(b) shows mementos that are only available before the annotation creation date while Figure 6(c) shows mementos that are only available after the annotation date. Finally, Figure 6(d) shows annotations that have no existing mementos for their target web pages in the web archives.

      Figure 6. Discovering Mementos for Annotations' Target Web Pages
      After we discovered the closest mementos to annotations' creation date and checked if annotations are still attached to their live web pages and to mementos, we get to the conclusion illustrated in Figure 7. It shows that 19% of annotations are orphaned while 41% are in danger of being orphaned. The remaining 40% of annotations are in an acceptable situation as 37% of annotations are considered safe while only 3% of them can be re-attached using archives. Results indicate also that if mementos are available for an annotation target web page, there will be a high chance that the annotation can re-attached. In addition, a copy of the same memento can be available in different web archives.

      Figure 7. The Status of  Current Hypothes.is Annotations

      As we can see, having 60% of annotations orphaned or in danger of being orphaned will lead us to a conclusion that archiving webpages at the time of annotation is important to avoid orphaned annotations.

      -- Mohamed Aturban

      2015-12-24: CNI Fall 2015 Membership Meeting Trip Report

      $
      0
      0
      The CNI Fall 2015 Membership Meeting was held in Washington, D.C., December 14-15, 2015.  Like all CNI meetings, the Fall 2015 meeting was excellent and contained many high quality presentations.  Unfortunately, the members' project briefings ran simultaneously, with 7 or 8 different presentations overlapping at any given time.  As a result I missed a great deal. 

      Cliff Lynch kicked off the meeting with reflections about public access to federally funded research (e.g., CRS R42983), interoperability (e.g., OAI-ORE, ORCIDs, IIIF), linked data (e.g., Wikipedia notability guidelines for biographies),  privacy & surveillance (e.g., eavesdropping Barbies, Ashley Madison data breach, RFC 7624), and understanding the personalization algorithms that go into presenting (and thus archiving) the view of the web that you experience (e.g., our 2013 D-Lib Magazine article about mobile vs. desktop & GeoIP), and much more.  I'm hesitant to try to further summarize his talk -- watching the video of his talk, as always, is time well spent. 

      In the next session Herbert and I presented "Achieving Meaningful Interoperability for Web-based Scholarship", which is basically a summary of our recent D-Lib Magazine paper "Reminiscing About 15 Years of Interoperability Efforts". 



      See also the excellent summary and commentary from David Rosenthal about the "signposting" proposal.

      The next session I split between "Linked Data for Libraries and Archives: LD4L and Europeana" (see the "Linked Data for Libraries" site) and "Is Gold Open Access Sustainable? Update from the UC Pay-It-Forward Project" (slides, video).  The final session of the day included several presentations I would have liked to have seen but didn't.  I understand "Documenting Ferguson: Building A Community Digital Repository" (slides) was good & standing room only. 

      I missed the opening session on the second day (including the "Update on Funding Opportunities" presentation), but made the presentation from David Rosenthal about emulation.  See the transcript of his talk, as well as his 2015 Emulation and Virtualization as Preservation Strategies report for the AMF.

      Unfortunately, David's talk collided with that of Martin& his UCLA colleagues.  Fortunately, CNI has posted the video of their talk, his slides are online, and he has a great interactive site to explore the data



      After lunch I attend Rob's talk "The Future of Linked Data in Libraries: Assessing BibFrame Against Best Practices" (slides).  Rob even referenced my "no free kittens" slogan (tirade?) from our time developing OAI-ORE:




      The closing plenary was an excellent talk from Julie Brill, head of the Federal Trade Commission, entitled "Transparency, Trust, and Consumer Protection in a Complex World".  The transcript is worth reading, but the essence of the talk explores the role the FTC would (should?) play in making sure that consumers can be aware of the data that companies track about them and how that data is used to make decisions about the consumers. 

      A mostly complete list of slides is available via the OSF.  CNI recorded many of the presentations and have begun uploading the videos to the CNI Youtube channel.  The CNI Spring 2016 Membership Meeting will be held in San Antonio, TX, April 4-5, 2016.

      Given all the simultaneous sessions, your CNI experience was probably different than mine.  Check out these other CNI Fall 2015 trip reports: Dale Askey, Jaap Geraerts, and Tim Pyatt

      --Michael

      2016-01-02: Review of WS-DL's 2015

      $
      0
      0

      The Web Science and Digital Libraries Research Group had a terrific 2015, marked by four new student members, one Ph.D. defense, and two large research grants.  In many ways it was even better than 2014 and 2013.

      We had fewer students graduate or advance their status this year, but last year was unusually productive.  We did add four new students, as well as graduate a PhD student, an MS student, and had two other students advance their status:
      Hany's Defense Luncheon
      Hany's defense saw us continue the WS-DL tradition of the post-PhD luncheon.

      We had 16 publications in 2015, which was about the same as 2014 (15) but down from 2013's impressive 22 publications.  This year we had:
      Next year we won't have this kind of showing at JCDL 2016 because Michele is one of the program co-chairs:

      JCDL 2016 Chairs

      In addition to the JCDL, TPDL, and iPRES conferences listed above, we traveled to and presented at ten conferences, workshops, or professional meetings that do not have formal proceedings:
      We were also fortunate to host Michael Herzog for the spring 2015 semester:

      MLN, MCW, and Michael Herzog

      As well as Herbert Van de Sompel for an extended colloquium / planning visit:




      We also released (or updated) a number of software packages, services, and format definitions:
      • Alexander Nwala created: 
      • Sawood released:
        • CDXJ -  a proposed serialization of CDX files (among other formats) in JSON format (based on his discussions with Ilya Kreymer
        • MemGator - A Go-based Memento aggregator (used by Ilya in his excellent emulation service oldweb.today).
      • Shawn, working with LANL colleagues, released the py-memento-client Python library.
      • Wes and Justin released "Mobile Mink", an Android Memento enabled client.  
      • Mat has continued to update the Mink Chrome extension (github, Chrome store). 
      Our coverage in the popular press continued:
      We were fortunate to receive two significant research grants this year, totaling nearly $1M:
      Thanks to all who made 2015 a great year!  We are looking forward to 2016!

      -- Michael


      2016-01-28: January 2016 Federal Cloud Computing Summit

      $
      0
      0

      As I have mentioned previously, I am the MITRE chair of the Federal Cloud Computing summit. The Summits are designed to allow representatives from government agencies that would not necessarily cross paths to collaborate and learn from one another about the best practices, challenges, and recommendations for adopting emerging technologies in the federal government. The MITRE-ATARC Collaboration Symposium is a working group-style session in which academics, representatives from industry, government, and FFRDC representatives discuss potential solutions and ways-forward for the top challenges of emerging technology adoption in government. MITRE helps select the challenge areas by polling government practitioners on their top challenges, and the participants break into groups to discuss each challenge area. The Collaboration Symposium allows this heterogeneous group of cloud practitioners to collaborate across all levels, from the end users to researchers to practitioners to policy makers (at the officer level).





      The Summit series includes mobile, Internet of Everything, big data, and cyber security summits along with the cloud summit, each of which occurs twice each year. MITRE produces a white paper that summarizes the MITRE-ATARC Collaboration Symposium. The white paper is shared with industry to communicate the top challenges and current needs of the federal government to guide product development, academia to identify the skillsets needed by the government and influence curricula development along with research topics, and government to communicate best practices and current challenges of other peer government agencies.

      The Summit takes place in Washington, D.C. and is a full-day event. The day begins at 7:30 AM with registration and an industry trade show that allows industry representatives to communicate with government representatives about their challenges and the solutions that industry has to offer. At 9:00, a series of panel discussions by academic researchers and government. This also allows audience members to ask questions to the top implementers of cloud computing in the government and academia.

      At 1:15, after lunch, the MITRE-ATARC Collaboration Symposium begins, and runs until 3:45. There is also a final out-briefing from each collaboration session a teh end of the day to communicate the major findings from each session to the summit participants.

      Common threads from the summit included the importance of cloud security, the importance of incorporating other emerging technologies (e.g., mobile, big data, Internet of Things) in cloud computing, and how each emerging technology enables or enhances the others, and the importance of agile processes in cloud migration planning. More details on the outcomes will be included in the white paper, which should be released in 6-8 weeks. Prior white papers are available at the ATARC website.

      The results of the Summit has implications for web archivists. With the increasing importance and emphasis on mobile, IoT, and cloud services, particularly within the government, there is an increased importance on archiving representations and the use of this material. As Julie Brill mentioned in her CNI talk, the government is interested in understanding how these services and technologies are being used regardless of whether or not there is a UI or other interface with which humans can interact. 

      Archiving data endpoints from HTTP is comparatively trivial (although challenges still exist with archiving at high fidelity, particularly when considering JavaScript and deferred representations), but archiving a data service that might exchange data through non-HTTP or even push (as opposed to pull) transactions may change the paradigm used for web archiving.

      With increased adoption, the archiving of representations reliant or designed to be consumed through emerging technologies will continue to increase and highlights a potential frontier in web archiving and digital preservation.


      --Justin F. Brunelle *

      * APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. CASE NUMBER 15-3250
      The authors’ affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions or viewpoints expressed by the authors.

      2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

      $
      0
      0
      Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives.  These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from
      AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.

      We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex.  We document our findings in a technical report entitled:  "Rules of Acquisition for Mementos and Their Content".

      Our technical report briefly covers the following key points:
      • Special techniques for acquiring mementos from the WebCite on-demand archive (http://www.webcitation.org)
      • Special techniques for dealing with JavaScript Redirects created by the Internet Archive
      • An alternative to BeautifulSoup for removing elements and extracting text from mementos
      • Stripping away archive-specific additions to memento content
      • An algorithm for dealing with inaccurate character encoding
      • Differences in whitespace treatment between archives for the same archived page
      • Control characters in HTML and their effect on DOM parsers
      • DOM-corruption in various HTML pages exacerbated by how the archives present the text stored within <noscript> elements
      Rather than repeating the entire technical report here, we want to focus on the two issues of interest that may have the greater impact on others acquiring and experimenting with mementos: acquiring mementos from Web Cite and inaccurate character encoding.

      Acquisition of Content from WebCite


      WebCite is an on-demand archive specializing in archiving web pages used as citations in scholarly work.  An example WebCite page is shown below.
      For acquiring most memento content, we utilized the cURL data transfer tool.  With this tool, one merely types the following command to save the contents of the URI http://www.example.com:

      curl -o outputfile.html http://www.example.com

      For WebCite, the output from cURL for a given URI-M results in the same HTML frameset content, regardless of which URI-M is used.  We sought to acquire the actual content of a given page for text extraction, so merely utilizing cURL was insufficient.  An example of this HTML is shown below.


      Instead of relying on cURL, we analyzed the resulting HTML frameset and determined that the content is actually returned by a request to the mainframe.php file.  Unfortunately, merely issuing a request to the mainframe.php file is insufficient because the cookies sent to the browser indicate which memento should be displayed. We developed custom PhantomJS code, presented as Listing 1 in the technical report, for overcoming this issue.  PhatomJS, because it must acquire, parse, and process the content of a page, is much slower than merely using cURL.

      The requirement to utilize a web browser, rather than HTTP only, for the acquisition of web content is common for live web content, as detailed by Kelly and Brunelle, but we did not anticipate that we would need a browser simulation tool, such as PhantomJS, to acquire memento content.

      In addition to the issue of acquiring mementos, we also discovered reliability problems with Web Cite, seen in the figure below.  We would routinely need to reattempt downloads of the same URI-M in order to finally acquire its content.

      Finally, we experienced rate limiting from Web Cite, forcing us to divide our list of URI-Ms and download content from several source networks.

      Because of these issues, the acquisition of almost 100,000 mementos from Web Cite took more than 1 month to complete, compared to the acquisition of 1 million mementos from the Internet Archive in 2 weeks.

      Inaccurate Character Encoding


      Extracting text from documents requires that such text be decoded properly for processes such as text similarity or topic analysis.   For a subset of mementos, some archives do not present the correct character set in the HTTP Content-Type header.  Even though most web sites now use the UTF-8 character set, a subset of our mementos come from a time before UTF-8 was adopted so proper decoding becomes an issue.

      To address this issue, we developed a simple algorithm that attempts to detect and use the character encoding for a given document.

      1. Use the character set from the HTTP Content-Type header, if present; otherwise try UTF-8.
      2. If a character encoding is discovered in the file contents, as is common for XHTML documents, then try to use that; otherwise try UTF-8.
      3. If any of the character sets encountered raise an error, raise our own error.

      We fall back to UTF-8 because it is an effective superset of many of the character sets for the mementos in our collection, such as ASCII. This algorithm worked for more than 99% of our dataset.

      In the future, we intend to explore the use of confidence-based tools, such as the chardet library, to guess the character set when extracting text.  The use of such tools takes more time than merely using the Content-Type header, but are necessary when that header is unreliable and algorithms such as ours fail.

      Summary


      We were able to overcome most of the memento acquisition and text extraction issues encountered in our experiment.  Because we were unaware of the problems we would encounter, we felt that it would be useful to detail our solutions for others to assist them in their own research and engineering.

      --
      Shawn M. Jones
      PhD Student, Old Dominion University
      Graduate Research Assistant, Los Alamos National Laboratory
      - and -
      Harihar Shankar
      Research & Development Engineer, Los Alamos National Laboratory

      2016-03-07: Custom Missions in the COVE Tool

      $
      0
      0
      When I am not studying Web Sciences at ODU, I work as a software developer at Analytical Mechanics Associates. In general, my work there aims to make satellite data more accessible. As part of this mission, one of my primary projects is the COVE tool.

      The COVE tool allows a user to view where a satellite could potentially take an image. The above image shows the ground swath of both Landsat 7 (red) and Landsat 8 (green) over a one day period. 
      The CEOS Visualization Environment (COVE) tool is a browser-based system that leverages Cesium, an open-source JavaScript library for 3D globes and maps, in order to display satellite sensor coverage areas and identify coincidence scene locations. In other words, the COVE tool allows the user to see where a satellite could potentially take an image and where two or more satellite paths overlap during a specified time period. The Committee on Earth Observing Satellites (CEOS) is currently operating and planning hundreds of Earth observation satellites.  COVE initially began as a way to improve Standard Calibration and Validation (Cal/Val) exercises for these satellites. Cal/Val exercises need to compare near-simultaneous surface observations and identify corresponding image pairs in order to calibrate and validate the satellite's orbit. These tasks are time-consuming and labor-intensive. The COVE tool has been pivotal in making these Cal/Val exercises much easier and more efficient.

      The COVE tool allows a user to see possible coincidences of two satellites. The above image shows the coincidences of ALOS-2 with Landsat 7 over a one week period.
      In the past, the COVE tool only allowed for this analysis to be done on historical, operational, or notional satellite missions with known orbit data, which COVE could then use to predict the propagation of the orbit accurately, within the bounds of the model’s assumptions, for up to three (3) months passed the last-known orbit data. This has proven extremely useful for those missions that the orbit data is known; however, it was limited to these missions.

      Mission planning is another task which includes the prediction of satellite orbits, a task the COVE tool was well equipped for. However, in mission planning exercises, the orbit data of the satellite is unknown. Based on this need, we wanted to extend COVE to include customized missions, in which the user could define the orbit parameters and the COVE tool would then predict the orbit of the customized mission through a numerical propagation. I had the opportunity to be the lead developer for this new feature, which recently went live and can be accessed through the Custom Missions tab on the right of the COVE tool, as shown in the video below. This is an important addition to the COVE tool, as it allows for better planning of potential future missions and will hopefully help to improve satellite coverage of Earth in the future.



      Video Summary:
      00:07:04 - The "Custom" Missions and Instruments tab shows a list of the current user's custom missions. Currently, we do not have any custom missions.
      00:09:03 - To create a custom mission, choose "Custom Missions" on the right panel. First, we need to "Add Mission." Once we have a mission we can add additional instruments to the instrument or delete the mission.
      00:20:15 - After choosing a mission name, we need to decide if we want to use an existing mission's orbit or define a custom orbit. We want to create a custom orbit. Clicking on "Custom defined orbit" gives three more options. A circular orbit is the most basic and for the novice user. A repeating sun synchronous orbit is a subset of circular orbits that must cover each area around the same time. For example, if the satellite passes over Hampton, VA at 10:00 AM, its next pass over Hampton should also be at 10:00 AM. The advanced orbit is for the experienced user and allows full control over the orbital parameters. We will create a repeating sun synchronous orbit, similar to Landsat 8.
      00:33:14 - When creating a repeating sun synchronous orbit, the altitude given is only an estimate as only certain inclination/altitude pairs are able to repeat. Thus, the user has the option to calculate the inclination and altitude that will be used.
      00:37:24 - The instrument and mode, along with the altitude of the orbit we just defined, determine the swath size of the potential images the satellite will be able to take.
      00:49:23 - We need to define "Field of View" and "Pointing Angle" of the instrument. We will also choose "Daylight only," our custom mission will only take images during the daylight hours. This is useful because many optical satellites, such as Landsat 8 are "Daylight only" since they cannot take good optical images at night.
      01:02:06 - We will now choose a date range over which we will propagate the orbit to see what our satellite's path will look like.
      01:21:18 - We can now see what path our satellite will take during the daylight hours, since we chose "Daylight only."

      This project was only possible thanks to other key AMA associates involved, namely Shaun Deacon--project lead and aerospace engineer, Andrew Cherry--developer and ODU graduate, and Jesse Harrison--developer.

      --Kayla

      2016-03-07: Archives Unleashed Web Archive Hackathon Trip Report (#hackarchives)

      $
      0
      0
      The Thomas Fisher Rare Book Library (University of Toronto)
      Between March 3 - March 5, 2016, Librarians, Archivists, Historians, Computer Scientists, etc., came together for the Archives Unleashed Web Archive Hackathon at the University of Toronto Robarts Library, Toronto, Ontario Canada. This event gave researchers the opportunity to collaboratively develop open-source tools for web archives. The event was organized by Ian Milligan, (assistant professor of Canadian and digital history in the Department of History at the University of Waterloo), Nathalie Casemajor (assistant professor in communication studies in the Department of Social Sciences at the University of Québec in Outaouais (Canada)), Jimmy Lin (the David R. Cheriton Chair in the David R. Cheriton School of Computer Science at the University of Waterloo), Matthew Weber (Assistant Professor in the School of Communication and Information at Rutgers University), and Nicholas Worby (the Government Information & Statistics Librarian at the University of Toronto’s Robarts Library).

      Additionally, the event was made possible due to the support of the Social Sciences and Humanities Research Council of Canada, the National Science Foundation, the University of Waterloo, the University of Toronto, Rutgers University, the University of Québec in Outaouais, the Internet Archive, Library and Archives Canada, and Compute Canada. Sawood Alam, Mat Kelly and myself, joined researchers from Europe and North America to exchange ideas in efforts to unleash our web archives. The event was split across three days.

      DAY 1, THURSDAY MARCH 3, 2016

      Ian Milligan kicked off the presentations by presenting the agenda. Following this, he presented his current research effort - 

      HistoryCrawling with Warcbase(Ian Milligan, Jimmy Lin)

      The presenters introduced Warcbase as a platform for exploring the past. Warcbase  is an open-source tool used to manage web archives built on Hadoop an HbaseWarcbase was introduced through two case studies and datasets, namely, exploring Canadian Political Parties and Political Interest Groups (2005 - 2015), and Geocities datasets.



      Put Hacks to Work: Archives in Research (Matthew Weber)

      Following Ian Milligan's presentation, Matthew Weber emphasized some important ideas to guide the development of tools for web archives, such as considering the audience.




      Archive Research Services Workshop(Jefferson Bailey, Vinay Goel)

      Following Matthew Weber's presentation, Jefferson Bailey and Vinay Goel presented a comprehensive introduction workshop for researchers, developers, and general users. The workshop addressed data mining and computational tools and methods for working with web archives.




      Embedded Metadata as Mobile Micro Archives (Nathalie Casemajor)

      Following Jefferson Bailey and Vinay Goel's presentation, Nathalie Casemajor presented her research effort for tracking the evolution of images shared on the web. She talked about how embedded metadata in images helped track dissemination of images shared on the web.





      Revitalization of the Web Archiving Program at LAC (Tom Smyth)

      Following Nathalie Casemajor's presentation, Tom Smyth of the Library and Archives Canada presented their archiving activities such as the domain crawls of Federal sites, curation of thematic research collections, and preservation archiving of resources at risk. He also talked about their recent collections such as Federal Election 2015, First World War Commemoration, and the Truth and Reconciliation collections.

      After the first five short presentations, Jimmy Lin gave presented a technical tutorial of Warcbase. After which Helge Holzmann, presented ArchiveSpark: framework built to make accessing Web Archives easier for researchers, which makes for easy data extraction and derivation.


      After a short break, there were five more presentations targeting Web Archiving and Textual Analysis Tools:

      WordFish (Federico Nanni)

      Federico Nanni presented WordFish: a R computer program used to extract political positions from text documents. Wordfish is a scaling technique and does not need any anchoring documents to perform the analysis but relies instead on a statistical model of word frequencies.


      MemGator (Sawood Alam)

      Following Federico Nanni's presentation Sawood Alam presented a tool he developed called MemGator: a Memento Aggregator CLI and Server written in Go. Memento is a framework that adds the time dimension to the web. Additionally, a timestamped copy of the presentation of a resource is also called a Memento. A list/collection of such mementos is called a TimeMap. MemGator can generate TimeMap of a given URI or provide the closest Memento to a given time.



      Topic Words in Context (Jonathan Armoza)

      Following Sawood Alam's presentation,  Jonathan Armoza presented a tool he developed - TWIC (Topics Words in Context) by demonstrating LDA topic modeling of Emily Dickenson's poetry. TWIC provides a hierarchical visualization of LDA topic models generated by the MALLET topic modeler.
      Following Jonathan Armoza's presentation, Nick Ruest presented Twarc: a Python command line tool/Python library tool for archiving Tweet JSON data. Twarc runs in three modes: search, filter stream and hydrate.
      Following Nick Ruest's presentation, I presented Carbon date: a tool originally developed by Hany SalahEldeen, which I current maintain. Carbon date is a tool for estimating the creation date of a website. Carbon date polls multiple sources for datetime evidence. It returns a Json response which contains the estimated creation date of the website.
      After the five short presentation about Web Archiving and Textual Analysis Tools, all participants engaged in a brain storming session in which ideas where discussed. And clusters of researchers with common interests where iteratively developed. The brainstorming session led to the formation of seven groups, namely:
      1. I know words and images
      2. Searching, mining, everything
      3. Interplanetary WayBack
      4. Surveillance of First Nations
      5. Nuage
      6. Graph‐X‐Graphics
      7. Tracking Discourse in Social Media



      Following the brainstorming and group formation activity, all participants were received at the Bedford Academy for a reception that went on through the late evening. 


      DAY 2, THURSDAY MARCH 4, 2016



      The second day of the Archives Unleashed Web Archive Hackathon began with breakfast, after which the groups formed on Day 1 met for about three hours to begin working on the ideas discussed the previous day. At noon, lunch was provided as more presentations took place:
      Evan Light began the series of presentations, by talking about a box he created called the Snowden Archive-in-a-Box : The box features a stand-alone wifi network and web server that allows researchers to utilize the files leaked (subsequently published by the media) by Edward Snowden. The box which serves as a portable archive protects users from mass surveillance.

      Mediacat (Alejandro Paz and Kim Pham)

      Following Evan Light's presentation, Alejandro Paz and Kim Pham presented Mediacat: an open-source  web crawler and archive application suite which enables ethnographic research to understand how digital news is disseminated and used across the web.

      Data Mining the Canadian Media Public Sphere (Sylvain Rocheleau)

      Following Alejandro Paz and Kim Pham's presentation, Sylvain Rocheleau talked about his research efforts to provide near real time Data Mining of the Canadian news media. His research involves the mass crawl of about 700 Canadian news websites at 15-minute intervals, and Data Mining processes which includes Named Entity Recognition.

      Tweet Analysis with Warcbase (Jimmy Lin)

      Following Sylvain Rocheleau's presentation, Jimmy Lin gave another tutorial in which he showed how to extract information from Tweets from the Warcbase platform.

      A five hour Hackathon session continued. The Hackathon was briefly suspended for a visit to the Thomas Fisher Rare Books Library.
      After the visit to the Thomas Fisher Rare Books Library, the hackathon session continued until the evening, after which all participants went for Dinner at the University of Toronto Faculty Club. 

      DAY 3, THURSDAY MARCH 5, 2016



      The third and final day of the Archives Unleashed Web Archive Hackathon began in a similar fashion as the second: first breakfast, second a three hour hackathon session, third presentations over lunch:

      Malach Collection (Petra Galuscakova)
      Waku (Kyle Parry)
      Digital Arts and Humanities Initiatives at UH Mānoa (or how to do interesting things with few resources) (Richard Rath)

      After the presentations, the hackathon session continued until 4:30 pm EST, thereafter, the group presentations began:

      PRESENTATIONS

      I know words and images (Kyle Parry, Niel Chah, Emily Maemura, and Kim Pham)

      Searching, mining, everything (Jaspreet SinghHelge Holzmann, and Vinay Goel)

      Interplanetary WayBack (Sawood Alam and Mat Kelly)

      "Who will archive the archives?"

      To answer this question Sawood Alam and Mat Kelly presented the archiving and replay system called Interplanetary Wayback (ipwb). In a nutshell, during the indexing process ipwb consumes WARC files one record a time, splits the record into headers and payload, pushes the two pieces into the IPFS (a peer‐to‐peer file system) network for persistent storage, and stores the references (digests) into to file format called CDXJ along with some other lookup keys and metadata. For replay it it finds the records in the index file and builds the response by assembling headers and payload retrieved from the IPFS network and performing necessary rewrites. The major benefits of this system include deduplication, redundancy, and shared open access.

      Surveillance of First Nations (Evan Light, Katherine Cook, Todd Suomela, and Richard Rath)

      Nuage (Petra Galuscakova, Neha Gupta, Rosa Iris R. Rovira, Nathalie CasemajorSylvain Rocheleau, Ryan Deschamps, and Ruqin Ren)

      Graph‐X‐Graphics (Jeremy Wiebe, Eric Oosenbrug, and Shane Martin)

      Tracking Discourse in Social Media (Tom Smyth, Allison Hegel, Alexander Nwala, Patrick EganNick RuestYu Xu, Kelsey UtneJonathan Armoza, and Federico Nanni)

      This team processed ~11.2 million tweets and ~50 million reddit comments which referenced the Charlie Hebdo and Bataclan attacks, in an effort to track the evolution of social media commentary about the attacks. The team sought to measure the attention span, information/misinformation flow, as well as the co-occurence network of terms in order to understand the dynamics of commentary about these events.

      The votes were tallied and Nuage team got the most votes, and were declared winners. The event concluded after some closing remarks.

      --Nwala

      2016-03-22: Language Detection: Where to start?

      $
      0
      0

      Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived?. Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language, Python-Language Detector, LangID and Google Language Detection API. And since Python is my favorite coding language I searched for tools that were written in Python.

      I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setting. Guess-Language and Python-Language Detector are fast in detecting language, but they are more accurate with more text. Also, you have to extract the HTML tags before passing the text to the tools. LangID is a tool that detects language and gives you a confidence score, it's fast and works well with short texts and it is easy to install and use. Google Language Detection API is also a powerful tool that can be downloaded for different programming languages, it also has a confidence score, but you need to sign in and if the dataset you need to detect is large, (larger than 5000 requests a day (1 MB/day)), you must choose a payable plan.
      HTTP Language Header:
      If you want to detect the language of a web site a primary method is to look at the HTTP response header, Content-Language. The Content-Language lets you know what languages are present on the requested page. The value is defined as a two or three letter language code (such as ‘fr’ for French), and sometimes followed by a country code (such as ‘fr-CA’ for French spoken in Canada).

      For example:

      curl -I --silent http://bagergade-bogb.dk/ |grep -i "Content-Language"

      Content-Language: da-DK,da-DK

      In this example the webpage's language is Danish (Denmark).
      In some cases you will find some sites offering content in multiple languages, and the Content-Language header only specifies one of the languages.
      For example:

      curl -I  --silent http://www.hotelrenania.it/ |grep -i "Content-Language"

      Content-Language: it

      In this example, when looking at the browser the webpage has three languages available Italian, English and Dutch. And it only states Italian as its Content-Language. You have to note that the Content-Language does not always match the language displayed in your bowser, because the browser's displayed language depends on the browser's language preference which you can change.

      For example:

      curl -I --silent https://www.debian.org/ |grep -i "Content-Language"

      Content-Language: en

      This webpage offers its content in more than 37 different languages. Here I had my browsers language preference set as Arabic, and the Content-Language found was English.
      In addition, most cases the Content-Language is not included in the header. From a random sample of 10,000 English websites in DMOZ I found that only 5.09% have the Content-Language header.

      For example:

      curl -I --silent http://www.odu.edu |grep -i "Content-Language"


      In this example we see that the Content-Language header was not found.

      HTML Language:
      Another indication of the language of a web page is the HTML language tag (such as, <html language='en'>….</html>). Using this method will require you to save the HTML code first then search for the HTML language code.

      For example:

      curl -I —silent http://ksu.edu.sa/ > ksu.txt

      grep "<html lang=" ksu.txt

      <html lang="ar"" dir="rtl" class="no-js">

      However, I found from a random sample of 10,000 English websites in DMOZ directory that only 48.6% have the HTML language tag.

      Guess-Language:
      One tool to detect language in Python is Guess-Language. It detects the nature of the Unicode text. This tool detects over 60 languages. However, two important notes to be taken are that 1)this tool works better with more text and 2) don’t include the HTML tags in the text or the result will be flawed. So if you wanted to check the language of a webpage I recommend that you filter the tags using the beautiful soup package and then pass it to the tool.

      For example:

      curl --silent http://abelian.org/tssp/|grep "title"|sed -e 's/<[^>]*>//g'

      Tesla Secondary Simulation Project

      python
      from guess_language import guessLanguage
      guessLanguage(“Tesla Secondary Simulation Project”)
      ’fr’
      guessLanguage("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

      ’en'

      This example shows detecting the title language of a randomly selected English webpage from DMOZ  http://abelian.org/tssp/. The language test using Guess-Language package will detect the language as French which is wrong. However, when we extract more text the result will be English.  In order to determine the language of short text you need to install Pyenchant and other dictionaries. By default it only supports three languages: English, French, and Esperanto. You need to download any additional language dictionary you may need.

      Python-Language Detector (languageIdentifier):
      Jeffrey Graves built a very light weight tool in C++ based on language hashes and wrapped in python. This tool is called Python-Language Detector. It is very simple and effective. It detects 58 languages.

      For example:

      python
      import languageIdentifier
      languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
      languageIdentifier.identify(“Tesla Secondary Simulation Project”,300,300)
      ’fr’
      languageIdentifier.identify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.”,300,300)

      ’en’

      Here, we also noticed that the length of the text affects the result. When the text was short we falsely got "French" as the language. However, when we add more text from the webpage the correct answer appeared.

      Another example where we check the title of a Korean webpage, which was selected randomly from the DMOZ Korean webpage directory.

      For example:

      curl --silent http://bada.ebn.co.kr/ | grep "title"|sed -e 's/<[^>]*>//g'

      EBN 물류&조선 뉴스

      python
      import languageIdentifier
      languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
      languageIdentifier.identify(“EBN 물류&조선 뉴스”,300,300)

      ’ko’

      Here the correct answer showed up “Korean”, although some English letters were in the title.

      LangID:
      The other tool is LangID. This tool can detects 97 different languages. As an output it  states the confidence score for the probability prediction. The scores are re-normalized and it produces an output in the 0-1 range. This tool is one of my favorite language detection tools because it is fast, detects short texts and gives you a confidence score.

      For example:

      python
      import langid
      langid.classify(“Tesla Secondary Simulation Project”)

      (‘en’, 0.9916567142572572)

      python
      import langid
      langid.classify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

      (‘en’, 1.0)

      Using the same text above. This tool identified a small text correctly, with a confidence rate of 0.99. And when full text is provided the confidence score was 1.0.


      For example:

      python
      import langid
      langid.classify(“السلام عليكم ورحمة الله وبركاته”)

      (‘ar’, 0.9999999797315073)

      By testing other language such as an Arabic phrase, it had a 0.99 confidence score for Arabic language.

      Google Language Detection API:
      The Google Language Detection APIdetects 160 different languages. I have tried this tool and I think it is one of the strongest tools found. The tool can be downloaded in different programming languages: ruby, java, python, php, crystal, C#. To use this tool you have to download an API key after creating an account and signing-up. The language tests results in three outputs: isReliable (true, false), confidence (rate), language (language code). The tool's website mentions that the confidence rate is not a range and can be higher than 100, no further explanation of how this score is calculated was mentioned. The API  allows 5000 free requests a day (1 MB/day) free requests. If you need more than that there are different payable plans you can sign-up for. You can also detect text language in an online demo. I recommend this tool if you have a small data set, but it needs time to set-up and to figure out how it runs.

      For example:

      curl --silent http://moheet.com/ | grep "title"| sed -e 's/<[^>]* > //g'> moheet.txt

      python
      file1=open(“moheet.txt”,”r”)
      import detectlanguage
      detectlanguage.configuration.api_key=“Your key”
      detectlanguage.detect(file1)
      [{‘isReliable’: True, ‘confidence’: 7.73, ‘language’: ‘ar’}]

      In this example, I extract text from an Arabic webpage from DMOZ Arabic Directory. The tool detected its language Arabic with True reliability and a confidence of 7.73. Note you have to remove the new line from the text so it doesn’t consider it a batch detection and give you result for each line.

      In Conclusion:
      So before you start looking for the right tool you have to determine a couple of things first:
      • Are you trying to detect the language of a webpage or some text?
      • What is the length of the text? usually more text is better and gives more accurate results (check this article on the effect of short texts on language detection: http://lab.hypotheses.org/1083)
      • What is the language you want to determine (if it is known or expected), because certain tools determine certain languages
      • What programming language do you want to use?

      Here is a short summary of the language detection methods I reviewed and a small description of all:

      MethodAdvantageDisadvantage
      HTML language header and HTML language tagcan state languagenot always found and sometimes affected by browser setting.
      Guess-Languagefast, easy to useworks better on longer text.
      Python-Language Detectorfast, easy to useworks better on longer text.
      LangIDfast, gives you confidence scoreworks on both long and short text.
      Google Language Detection APIgives you confidence score, works on both long and short textneeds creating an account and setting-up.


      --Lulwah M. Alkwai

      2016-04-05: CNI Spring 2016 Trip Report

      $
      0
      0
      The CNI Spring 2016 Members Meeting was held in San Antonio, TX, April 4-5, 2016.  As usual, the presentations were excellent but with six or more simultaneous sessions you are forced to make hard choices about what to catch up on.

      This year Martin Halbert and Katherine Skinner arranged the "Digital Preservation of Federal Information Summit", convening 30+ people to discuss "...the topic of preservation and access to at-risk digital government information."  It was quite the collaborative exercise, and I know Martin produced some summary slides that I will link here when they are posted.  There were only a few presentations (and they were done in Pecha Kucha format) for this Summit, and I was fortunate enough to give one for Herbert and I entitled "Why We Need Multiple Archives".  The answer is probably pretty obvious for the crowd that Martin assembled, but we often run into people that don't understand the role of archives beyond that of the (obviously excellent) Internet Archive.




      Victoria Stodden gave the opening keynote, "Defining the Scholarly Record for Computational Research", in which she talked about the "Reproducible Research Standard", ResarchCompendia.org, and computational infrastructure within the context of legal and social norms.  CNI will eventually put the videos up, in the mean time I would encourage you to see her SC15 talk that touches on similar themes.

      The next session I attended was Jason Varghese (NYPL) presenting "Microservices Architecture: Building Scalable (Library) Software Solutions." He's clearly doing cool stuff, but I would have appreciated a more detailed discussion of the APIs they've implemented, but I guess that can be found at: http://api.repo.nypl.org/.

      The next session was "Scaling Maker Spaces Across the Web: Weaving Maker Space Communities Together to Support Distributed, Networked Collaboration in Knowledge Creation", by Rick Luce and Carl Grant, both at Oklahoma University.  They talked about their experiences setting up a makerspace (complete with 3D printing and VR capabilities) in the library, both a small satellite for their on campus library (the "edge") and their much larger facility in the research park two miles away (the "hub").  I urge you to peruse the links -- this was truly impressive stuff & Rick consistently does exciting things with libraries.

      I skipped the final session of the day in order to get my slides for Tuesday morning arranged.  I had originally thought I had a 30 minute slot, but in reality I had 15 minutes and many slides needed tossing.  There was an evening reception and they we had dinner at one of the many restaurants on the famed River Walk.

      Tuesday began with split sessions, and I was in the session that Martin Halbert arranged, "National Web Archiving Programs in the U.S.", along with Jefferson Bailey and Mark Phillips.  Jefferson gave a brief overview of the "Systems Interoperability and Collaborative Development for Web Archiving" project, Mark reviewed End of Term (EOT) web archiving, and Martin recapped the "Digital Preservation of Federal Information Summit" from the previous days.  I presented a brief status about our work using Storytelling interfaces for summarizing collections in Archive-It:




      Unfortunately, with the simultaneous sessions I had to miss "DBpedia Archive using Memento, Triple Pattern Fragments, and HDT", presented by Herbert Van de Sompel and Miel Vander Sande.

      The next session I attended was about organization identifiers, and featured Geoffery Bilder (Crossref), Patricia Cruse (DataCite), and (via facetime) Laure Haak (ORCID).  They are in the early stages of collaborating for org ids, and while I learned a lot, I would have appreciated a more thorough review of existing org id efforts and how they fall short of their goals.  Part They did share their requirements document and invited contributions.  "Challenges Presented by Organizational IDs" by Karen Smith-Yoshimura (OCLC), from CNI Spring 2015, provides some of the background that I did not have.

      The after lunch session that I attended on Tuesday was "Rebuilding the Getty Provenance Index as Linked Data".  I knew almost nothing about the art world going into this, so now I know more about the linked data challenges of porting Getty's legacy databases that await Rob Sanderson when he joins Getty later this month.

      The closing keynote, "Activist Stewardship: The Imperative of Risk in Collecting Cultural Heritage", was handled by a trio from UCLA: Todd Grappone, Elizabeth McAulay, Heather Briston (who was pinch hitting for Sharon Farb).  They presented about the Digital Ephemera Project, and in general the role of archivists in collecting materials that will get you (the library) or the contributor in trouble.  Some examples included the internal and external pressures about UCLA's Scientology collection and contributors regarding the Green Movement collection.  Cliff Lynch gave a good introduction to this session and promised a wrap for it as well, but the session ran a bit long and that did not happen.  Rather than try to further summarize, I'll link the video when it comes out.  I did appreciate that Memento got a mention in the presentation regarding finding archived images embedded in tweets that had otherwise been deleted!


      If you want a mostly different path through the various simultaneous sessions, I encourage you to read Dale Askey's excellent conference notes.

      I'll update this post as additional slides and videos are uploaded.  Thanks to everyone @ CNI for yet another excellent meeting!

      --Michael



      2016-04-15: How I learned not to work full-time and get a PhD

      $
      0
      0
      ODU's commencement on May 7th marks the last day of my academic career as a student. I began my career at ODU in the Fall of 2004, graduated with my BS in CS in the Spring of 2008 at which point I immediately began my Master's work under Dr. Levinstein. I completed my MS in Spring 2010, spent the summer with June Wright (now June Brunelle), and started my Ph.D. under Dr. Nelson in the Fall of 2010 (which is referred to as the Great Bait-and-Switch in our family). I will finish in the Spring of 2016 only to return as an adjunct instruction teaching CS418/518 at ODU in the Fall of 2016.


      On February 5th, I defended my dissertation"Scripts in a Frame: A Framework for Archiving Deferred Representations" (above picture courtesy Dr. Danette Allen, video courtesy of Mat Kelly). My research in the WS-DL group focused on understanding, measuring, and mitigating the impacts of client-side technologies like JavaScript on the archives. In short, we showed that JavaScript causes missing embedded resources in mementos, leading to lower quality mementos (according to web user assessment). We designed a framework that uses headless browsing in combination with archival crawling tools to mitigate the detrimental impact of JavaScript. This framework crawls more slowly but more thoroughly than Heritrix and will result in higher quality mementos. Further, if the framework interacts with the representations (e.g., click buttons, scroll, mouseover), we add even more embedded resources to our crawl frontier, 92% of which are not archived.


      Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations from Justin Brunelle

      En route to these findings, we demonstrated the impact of JavaScript on mementos with our now-[in]famous CNN Presidential Debate example, defined the terms deferred representations to refer to representations dependent upon JavaScript to load embedded resources, descendants to refer to client-side states reached through the execution of client-side events, and published papers and articles on our findings (including Best Student Paper at DL2014 and Best Poster at JCDL2015).


      At the end of WS-DLer academic tenures, it is customary to provide lessons learned,recommendations, and recaps of their academic experiences useful to future WS-DLers and grad students. Rather than recap the work that we have documented in published papers, I will echo some of my advice and lessons learned for what it takes to be a successful Ph.D. student.

      Primarily, I learned that working while pursuing a Ph.D. is a bad idea. I worked at The MITRE Corporation throughout my doctoral studies. It took a massive amount of discipline, a massive amount of sacrifice (from myself, friends, and family), a forfeiture of any and all free time and sleep, and a near-lethal amount of coffee. Unless a student's "day job" aligns or overlaps significantly with her doctoral studies (I got close, but no cigar), I strongly recommend against doing this.

      I learned that a robust support system (family, friends, advisor, etc.) is essential to being a successful graduate student. I am lucky that June is patient and tolerant of my late nights and irritability during paper season, my family supported my sacrifices and picked up the proverbial slack when I was at conferences or working late, and that Dr. Nelson dedicates an exceptional portion of his time to his students. (Did I say that just like you scripted, Dr. Nelson?) I learned to challenge myself and ignore the impostor syndrome.

      I learned that a Ph.D. is life-consuming, demanding of 110% of a student's attention, and hard -- despite evidence to the contrary (i.e., they let me graduate) -- they don't give these things away. I also learned about what real, capital-R "Research" involves, how to do it, and the impact that it has. This is a lesson that I am applying to my day job and current endeavors.

      I learned to network. While I don't subscribe to the adage "It's not what you know, it's who you know", I will say that knowing people makes things much easier, more valuable, more impactful, and essential to success. However, if you don't know the "what", knowing the "who" is useless.

      I learned that not all Ford muscle cars are Mustangs (even though they are clearly the best), that it's best to root for VT athletics (or at least pretend), that I am terrible at commas, and that giving your advisors homebrew with your in-review paper submissions certainly can't hurt; the best collaborations and brainstorming sessions often happen outside of the office and over a cup of coffee or a pint of beer.

      Finally, I learned that finishing my Ph.D. before my son arrived was one of the best things I've done -- even if mostly by luck and divine intervention. I have thoroughly enjoyed spending the energy previously dedicated to staying up late, writing papers, and pounding my head against my keyboard to spending time with June, Brayden, and my family.

      Despite these hard lessons and a difficult ~5 years, pursuing a doctorate has been a great experience and well worth the hard work. I look forward to continued involvement with the WS-DL group, ODU, my dissertation committee, and sharing my many lessons learned with future students.


      --Dr. Justin F. Brunelle

      2016-04-17: A Summary of "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia"

      $
      0
      0

      Authors Nattiya Kanhabua., Ngoc Tu Nguyen., and Claudia Niederée.  from L3S published the following study at JCDL 2014. In the process of reviewing possible topics for my PhD research,  I share my analysis of their findings. The full citation and presentation for the paper is below.


      Kanhabua, N., Nguyen, T. N., & Niederee, C. (2014, September). What triggers human remembering of events?: a large-scale analysis of catalysts for collective memory in Wikipedia. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 341-350). IEEE Press.





      The focus of the article centers around identifying patterns that trigger recollection of events in collective memory. Since the number of categorical events is limitless, the authors focus on natural and man-made disasters, accidents, and terrorism. Their analysis confirms that two of the most notable characteristics across all events are time and location. While in conjunction they are not consistent metrics in identifying triggers for recollection of events, their independent state is.  In addition, the study also confirms that semantics found in different types of events, like level of impact and damage cost, further help trigger remembrance of specific memories.

      For their analysis, authors use the English Wikipedia as the collective memory location, which is built by an online community. It is important to note that this memory is dynamic in nature, changes over time, and is constructed by the agreed upon social influence. Essentially, the goal here is to extract patterns and characteristics of a particular memory, and use them in identifying how they can be triggered in recall. Note, aside from characteristic analysis, we can identify the most popular memories by category, community division over topics, or even observe the edit wars that are centered around controversial topics.

      To get a better understanding of the underlying collection, the authors parse view logs of different events documented on Wikipedia. This allows them to visually interpret and categorize them. Figure 1 below shows how such a log can be used alongside a temporal attribute.

      (Peaks signify an increase in resource views, .)

      By observing the chart above, we can conclude that within some timespan, peaks are created as resource views dramatically increase. Thus, they become the driving factor behind correlating documents to temporal and categorical events. Take for example a document explaining a hurricane event in 2015 being viewed dramatically in 2016.

      By itself, a peak is not a complete solution in identifying memory recollection, as there is nothing to compare it to. The proposed solution here is a remembrance score, which analyzes how likely peaks are memory catalysts of past events. In other words, it's a comparison between multiple peaks to see if relationships exists. This score is divided into three parts: Cross-correlation coefficient (CCF), Sum of squared errors (SSE), and Kurtosis. These parts are all centered around time and peak analysis, and compare how likely is it for us to remember one event by experiencing another one. For this, CCF is used as a means of understanding the similarity between two time series in a volume. It's a simple representation of how different events relate during particular time frames. SSE further pushes CCF by measuring the accuracy of how unplanned a particular time is within a time frame, and promotes surprise detection. This helps us understand if one peak potentially triggered the other. Lastly, Kurtosis is applied to the remembering score to accommodate for the skewness of the peaks. This considers the underlying distribution over time, and answers the question, is the peak a constant phenomenon or a heavily influenced variable of change?

      (Table 1 shows the test data of events used from Wikipedia. Do note, italicized events are excluded from the experiment, as there were too few results for significant evaluation.)

      While this score is a good approach in understanding triggers for all events, the authors propose an analysis of common features to identify relationship development between similar events. This includes temporal similarity, or the time when the events occurred, and location similarity, where they occurred. Lastly, they also observe the impact of an event and how likely this event is to remain a continuous memory. Examples of impacts include: cost incurred due to event occurrence, affected regions, fatalities, etc.

      In Figure 5, location is a key observable similarity between hurricane events, whereas time is much more inconsistent.

      In Figure 10, time and location both play a significant role in identifying terrorist events. The conjunction of these attributes is much more evident here as opposed to hurricane events shown in Figure 5.


      In Figure 11, high impact events comprise between 25% and 50% of the top 10 triggered events. The percentage expands to 75% when considering the top 20.

      By observing the charts above, we can conclude several things from the proposed study. First, location and time are key contributors when identifying which events cause remembrance of others. In addition, they are sporadic in influence over the different types of events. Next, according to the results retrieved, contextual information also plays a very large role in determining relationships. The impact of events and semantic similarity can significantly boost or demolish the triggered recollection of collective memories we have stored. Lastly, the computed remembrance scores are a good step towards identifying which peaks relate. While they can be tuned to score better for particular events, they also must remain generic enough for limitless use.

      It is clear that the explored study here has a great motive, and even more interesting findings. However, attached are two key limitations. First, the authors analyze human remembering of events against the English Wikipedia. While this could be helpful for a language specific study, it could have a very large cultural bias as compared to versions on other languages. In addition, it might sway focus and emphasize events that are more centered towards regions relating to an English-based context. The other limitation is that the authors are simply assuming an occurrence of one event triggers a recall from collective memory. While this can apply for many cases, this assumption does not consider the fact that new events could trigger research of the prior, as opposed to remembrance.

      Applying in your research:

      • Significant insight in a forecasted and understood user recollection promotes targeted event triggering. When users are searching for particular events, we could recommend other events that they might be interested in within particular bounds of similarity. 
      • In contrary to exploring new data, we could also help the user recall what they have forgotten from the past. 
      Slobodan Milanko

        2016-04-19: IIPC General Assembly 2016 Trip Report

        $
        0
        0



        The 2016 IIPC General Assembly and the separate-but-related IIPC Web Archiving Conference 2016 were held in Reykjavík, Iceland, April 11-15, with the former being open to IIPC members only and the latter open to the public.  Unfortunately, my trip report will be incomplete since I had to leave midday on Wednesday.  The first day was primarily given to IIPC business: introducing the new officers, covering project status, budgets, new bylaws, etc.   Jason gave a brief overview of our IIPC-funded Web Archive Profiling Via Sampling Project, which is now coming to a close.  In addition to the resources and deliverables linked from the IIPC project page, Sawood Alam has developed the MemGator Memento Aggregator and the CDXJ format for serializing CDX files in json.  We welcome feedback on both.  I'd also like to repeat our request for web archiving logs so we can better model request patterns.

        We had a introduction and Q&A from the Steering Committee members that worked well (I believe this was the first time this format had been used).  The day closed with updates from Alex Thurman& Abbie Grotke about the collaborative collections, Sara Aubry about the proposed WARC 1.1 format, and Andy Jackson on "Building Tools to Archive the Modern Web".

        Unfortunately Day 2 began with dual and triple tracks, so one was forced to make hard decisions about what to attend when they're all good.  I began in the session with Andy Jackson covering "Building Better Tools, Together" in which he covered the benefits of open source development.  The following session was had David Rosenthal, Nicholas Taylor, and Jefferson Bailey covering the IMLS-funded web archiving API project.  The result of the session was a Google doc that contained the essence of the discussion. 

        After lunch, I presented in the session "Harvesting Tools", with Jefferson Bailey and Youssef Eldakar.  Jefferson gave a preview of brozzler, a crawling package that combines real chrome browsers with warcprox for capturing all resources.  Youssef gave a demo of visualizing Heritrix crawls.  My talk closed the session and was based on Justin's work on crawling deferred representations and descendants (see the iPres 2015 paper and 2016 tech report for more information about these concepts, as well as Justin's PhD summary post). 




        The final session was by Martin Klein, Andrea Goethals, and Stephen Abrams on their plans for a submission to IMLS for nominating and coordinating seed URIs for crawls.

        Wednesday began the IIPC Web Archiving Conference, and it kicked off with a keynote from Iceland's own Hjálmar Gíslason, most recently at DataMarket.  He started off the keynote by defining the progression of "big data":


        Drawing from his current position and previous positions, he made a number of interesting observations regarding what is worth archiving.  Although "hoarding isn't a strategy", we frequently don't know in advance what will be valuable (e.g., the NY Times 1927 article that said "commercial use in doubt" regarding television).  His slides aren't posted yet, but hopefully soon.

        After that was a joint presentation from Vint Cerf and Rick Witt from Google, who is now an IIPC member (!).  Vint rightly noted that the IIPC crowded didn't need the usual background material he typically provides (cf. DSHR's and my reaction to his 2015 AAAS talk).  Rick focused on potential roles for Google in the IIPC and web archiving in general:


        Vint was only able to be there for part of the day on Wednesday, but Rick was there the whole time.  Rick was careful to stress that Google was there to learn and assess, not to try to steer or dominate the community.  However, it is fair to say that the IIPC members that I spoke to were all very excited about Google's recognition of web archiving, even if no specific strategy or plan is adopted.  The Q&A after their presentation was quite lively and could have gone on much longer.  Brewster Kahle then moderated a panel about web archiving from the perspective of National Libraries with: Helen Hockx-Yu (IA, formerly of the British Library), Steve Knight (New Zealand), and Paul Koerbin (Australia). 

        I had to leave after lunch, so I missed the remainder of the conference.  Rounding out Wednesday was David Rosenthal's "The Architecture of Emulation on the Web", Ilya Kremer& Dragan Espenschied presenting on oldweb.today (netcapsule github), Thomas Liebetraut talked about emulation (bw_FLA), and Matthew S. Weber and Ian Milligan talked about their Hackathons (Canada in March, US in June). Brewster concluded the day with a keynote "20 Years of Web Archiving – What Do We Do Now?"  He previewed a really cool experimental interface for the Wayback Machine:





        I won't even try to summarize Thursday's sessions, and Friday consisted of a couple of different workshops.  The Twitter hashtags were #IIPCGA2016 and #IIPCWAC2016, respectively. Ed Summers has a nice page summarizing all the tweets for both events.  Kristinn Sigurðsson, who did a great job organizing the event, has a summary blog post for the event, and Peter Webster has a nice reflection piece about "What do we need to know about the archived web?" based on what he learned at IIPC.  I'll add more posts about the event as I discover them. 

        As always, the IIPC meeting was excellent -- I highly encourage you attending if you are at all interested in web archiving.  Next year's IIPC General Assembly and Web Archiving Conference will be in Lisbon, Portugal, in late March.

        --Michael


        2016-04-24: WWW 2016 Trip Report

        $
        0
        0


        I was fortunate to present a poster at the 25th International World Wide Web Conference, held from April 11, 2016 - April 15, 2016. Though my primary mission was to represent both the WS-DL and the LANL Prototyping Group, I gained a better appreciation for the state of the art of the World Wide Web.  The conference was held in Montréal, Canada at the Palais des congrés de Montéal.



        SAVE-SD 2016


        I began the conference at the SAVE-SD workshop, focusing on the semantics, analytics, and visualization of scholarly data.  They had 6 full research papers, 2 position papers, and 2 poster papers.  The acceptance rate for this conference is relatively high.  The conference was kicked off by Alejandra Gonzales-Beltran and Francesco Osborne. They encouraged the use of Research Articles in Simplified HTML.

        Alex Wade gave us an introduction to the Microsoft Academic Service (MAS) and a sneak peek at the new features offered by Microsoft Academic, such as the Microsoft Academic Graph. They are in the process of adding semantic, rather than keyword search with the intention of understanding academic user intent when searching for papers. They have opened up their dataset to the community and provide APIs for future community research projects.
        Angelo Salatino presented "Detection of Embryonic Research Topics by Analysing Semantic Topic Networks". The study investigated the discovery of "embryonic" (i.e. emerging) topics by testing for more than 2000 topics in more than 3 million publications. The goal is to determine it we can recognize trends in research while they are happening, rather than years later. They were able to show the features of embryonic topics and their next step is to automate their detection.

        Bahar Sateli presented "Semantic User Profiles: Learning Scholars’ Competences by Analyzing their Publications". The goal of this study is to mitigate the information overload associated with semantic publishing. They found that it is feasible to semantically model a user's writing history. By modeling the user, better search ranking of document results can be provided for academic researchers. It can also be used to allow researchers to find others with similar interests for the purposes of collaboration.

        Francesco Ronzano presented "Knowledge Extraction and Modeling from Scientific Publications" where they propose a platform to turn data from scientific publications into RDF datasets, using the Dr. Inventor Text Mining Framework Java library.  They also generate several example interactive web visualizations of the data. In the future, they seek to improve the Text Mining Framework.

        Joakim Philipson presented "Citation functions for knowledge export - a question of relevance, or, can CiTO do the trick?".  He explored the use of the CiTO ontology in order to understand knowledge export - "the transfer of knowledge from one discipline to another as documented by cross-disciplinary citations". Unfortunately, he found that CiTO is not specific enough to capture all of the information needed to understand this.

        Sahar Vahdati presented "Semantic Publishing Challenge: Bootstrapping a Value Chain for Scientific Data". The study discussed "the use of Semantic Web technologies to make scholarly publications and data easier to discover, browse, and interact with".  Its goal is to use many different sources to produce linked open datasets about scholarly publications with the intent of improving scholarly communication, especially in the areas of searching and collaboration.  Their next step is to start building services on the data they have produced.

        Vidas Daudaravicious presented "A framework for keyphrase extraction from scientific journals".  His framework is able to use keyphrases to define topics that can differentiate journals. Using these keyphrases, one can improve search results by comparing journals to queries, allowing users to find articles of a similar nature. It also has the benefit of noting trends in research, such as when journal topics shift. Researchers can also use the framework to identify the best journals for paper submission.

        Ujwal Gadirju presented "Analysing Structured Scholarly Data Embedded in Web Pages". They analyzed the use of microdata, microformats, and RDF used as bibliographic metadata embedded in scholarly documents with the intent of building knowledge graphs. They found that the distribution of data across providers, domains, and topics was uneven, with few providers actually providing any embedded data. They also found that Computer Science and Life Science documents were more apt to contain this metadata than other disciplines, but also admitted that their Common Crawl dataset may have been skewed in this direction. In the future, they are planning a targeted crawl with further analysis.
        Shown below are participants enjoying the SAVE-SD 2016 Poster session. On the left below, Bahar Sateli presented "From Papers to Triples: An Open Source Workflow for Semantic Publishing Experiments". She showed how one could convert natural language academic papers into linked data, which could then be used to provide more specific search results for scholars. For example, the workflow allows a scholarly user to search a corpus for all contributions made in a specific topic.

        On the right below, Kata Gábor demonstrated "A Typology of Semantic Relations Dedicated to Scientific Literature Analysis". Her poster shows a model for extracting facts about the state of the art for a particular research field using semantic relations derived from pattern mining and natural language processing techniques.


        And shown to the left Erwin Marsi discussed his poster, "Text mining of related events from natural science literature". His study had the goal of producing aggregate facts on the concepts from articles within a corpus.  For example, it aggregates the fact that there is an increase in algae based on the text from many papers that had research results finding an increase in algae. The idea is to find trends in research papers through natural language processing.

        In closing, the SAVE-SD 2016 workshop mentioned that selected papers could be resubmitted to PeeRJ.

        TempWeb 2016


        On Tuesday I attended the 6th Temporal Web Analytics Workshop, where I learned about current studies using and analyzing the temporal nature of the web. I spoke to a few of the participants about our work on Memento, and they educated me as to the new work being done.

        The morning opened with a Keynote by Wolfgang Nejdl of the Alexandria Project.  Wolfgang Nejdl discussed the work at L3S and how they were trying to consider all aspects of the web, from the technical to its effects on community and society. He discussed how social media has become a powerful force, but tweets and posts link to items that can disappear, losing the context of the original post.  This reminded me of some other work I had seen in the past. He mentioned how important it was to archive these items.
        He then went on to cover other aspects of searching the archived web, detailing challenges encountered by project BUDDAH, including the problem of ranking temporal search results. Seen below, he demonstrates an alternative way of visualizing temporal search results using the HistDiv project. This visualization for understanding the changing nature of a topic.  In this case, we see how searching for the term Rudolph Giuliani changes with time, as the person's career (and career aspirations) change so do the content of the archived pages about them. He closed by discussing the use of curated archiving collections in Archive-It in the collaborative search and sharing platform ArchiveWeb, which allows one to find archive collections pertinent to their search query.
        The workshop presentations started with two different investigations into ways of creating and performing calculations on temporal graphs.  On the right, Julia Stoyanovich presents "Towards a distributed infrastructure for evolving graph analytics".  She details Portal, a query language for temporal graphs, allowing one to easily query and calculate metrics such as PageRank for a temporal graph, given a specific interval.

        Matthias Steinbauer presented "DynamoGraph: A Distributed System for Large-scale, Temporal Graph Processing, its Implementation and First Observations".  DynamoGraph a system also allowing for one to query and calculate metrics on temporal graphs.  
        Both researchers used the following lunch to discuss temporal graphs at length.  I wondered if one could model TimeMaps in this way and use these tools to discover interesting connections between archived web pages.
        Mohsen Shahriari discussed "Predictive Analysis of Temporal and Overlapping Community Structures in Social Media".  He went into detail on the evolution of communities, represented by graphs, detailing how they can grow, shrink, merge, split, or dissolve entirely.  Using datasets from Facebook, DLBP citations, and Enron emails, his experiments showed that smaller communities have a higher chance of survival and his model had a high success rate in predicting whether a community would survive.
        Aécio Santos presented "A First Study on Temporal Dynamics on the Web".  He used topical web page classifiers in a focused crawling experiment to analyze how often web pages about certain topics changed.  Pages from his two topics, ebola and movies, changed at different rates. Pages on ebola were more volatile, losing and gaining links, mostly due to changing news stories on the topic, whereas movies pages were more stable, with authors only augmenting their contents. He did find that, in spite of this volatility, pages did tend to stay on topic over time. The goal is to ensure that crawlers are informed by differences in topics and adjust their strategies accordingly.
        Jannik Strötgen presented "Temponym Tagging: Temporal Scopes for Textual Phrases".  He discussed the discovery and use of temponyms to understand the temporal nature of text.  Using temponyms, machines can determine the time period that a text covers. He explained the issues with finding exact temporal intervals or times for web page topics, seeing as many pages are vague. His temponym project, HeidelTime, has been tested on the WikiWars corpus and the YAGO semantic web system.  He also presented further information on this topic, later in WWW 2016.

        We then shifted into using temporal analysis for security.  Staffan Truvé from Recorded Future presented "Temporal Analytics for Predictive Cyber Threat Intelligence". His company specializes in using social media and other web sources to detect potential protests, uprisings, and cyberattacks.  He indicated that protests and hacktivism are often talked about online before they happen, allowing authorities time to respond.

        In closing, Omar Alonso from Microsoft presented "Time to ship: some examples from the real-world". He highlighted some of the ways in which the carousel from the top of Bing is populated, using topic virality on social media as one of the many inputs. He talked about the concept of socialsignatures, derived from all of the social media posts referring to the same link.  Using this text, they are able to further determine aboutness for a given link, helping further with search results.  He switched to other topics that help with search, such as connecting place and time. Search results for  points of interest (POI) for a given location in effect is trying to match people looking for things to do (queries) with social media posts, checkins, and reviews for a given POI.  He concluded by saying that there is much work to be done, such as allowing POI results for a given time period "things to do in Montréal at night".

        Keynotes


        Sir Tim Berners-Lee



        Sir Tim Berners-Lee spoke of the importance of decentralizing the web, ensuring that users own their own data, web security, work to standardize and improve the ease of payments on the web, and finally the Internet of Things (IoT).
        Mentioning the efforts of projects like Solid, he highlighted the need to ensure that users retain their data to ensure their privacy. The idea is that a user can tell the service where to store their data and then they have ownership and responsibility over that data.
        He mentioned that, in the past the Internet had to be deployed by sending tapes through the mail, but now we are heading to a point where the web platform, because it allows you deploy a full computing platform very very quickly, may become the rollout platform for the future. Because of this ability, security is becoming more and more important and he wants to focus on a standard for security that uses the browser, rather than external systems, as the central point for asking a user for their credentials, thereby helping guard against trojans and malicious web sites. He said that the move from HTTP to HTTPS has been less easy than expected, considering many HTTPS pages are "mixed" containing references to HTTP URIs.  This results in three different worlds: those that are HTTP pages, those that are HTTPS pages, and upgrade insecure requests which still provide a mixed page, but one that is endorsed by the author.
        Next, he spoke about making web payments standardized, comparing it to authentication. There are a wide variety of different solutions for web payments and there needs to be a standard interface. There is also an increasing call to allow customers to pay smaller amounts than before, which many current systems do not handle. Of course, customers will need to know when they are being phished, hence the security implications of a standardized system.
        Finally, he covered the Internet of Things (IoT), indicating there are connections to data ownership, privacy, and security.
        In the following Q&A session, I asked Sir Tim Berners-Lee about the steps toward browser adoption for technologies such as Memento.  He said the first step is to discuss them at conferences like WWW, then engage in working groups, workshops, and other venues.  He noted that one also needs to define the users for such new technologies so they can help with the engagement.
        Later, during the student Q&A session the following day, Morgannis Graham from McGill University asked Sir Tim Berners-Lee about his thoughts on the role of web archives.  He replied that "personally, I am a pack rat and am always concerned about losing things". He highlighted that while the general web users are thinking of the present, it is the role of libraries and universities to think about the future, hence their role in archiving the web.  He stated that universities and libraries should work more closely together in archiving the web so that if one university falls, others exist having the archives of the one that was lost. He also stated that we all have a role in ensuring that legislation exists to protect archiving efforts.  Finally, he tied his answer back to one of his current projects: what happens to your data when the site you have given it to goes out of business.

        Lady Martha Lane-Fox


        Wednesday evening ended with an inspiring talk from Lady Martha Lane-Fox.  She works for the UK in a variety of roles advancing the use of technology in society.  She states that a country that can: (1) improve gender balance in tech, (2) improve the technical skills of the populace, and (3) improve the ability to use tech in the public sector, will be the most competitive.


        She went further in explaining how the current gender balance is very depressing, noting that in spite of the freedom offered by technology, old hierarchies and structures have been re-established. She indicated that there are studies showing that companies with more diverse boards are more successful, and how we need to tackle this problem, not only from a technical, but also a social perspective.
        She discussed the challenges of bringing technology to everyday lives and applauded South Korea's success while highlighting the challenges still present in the UK. She relayed stories of encounters with the citizenry, some of whom were reluctant to embrace the web, but after doing so felt they had more freedom and capability in their lives than ever before. She praised the UK for putting coding on the school curriculum and looking toward the needs of future generations.
        She then talked about re-imagining public services entirely through the use of technology. The idea is to make government agencies digital by default in an effort to save money and provide more capability. She highlighted a project where a UK hospital once had 700 administrators and 17 nurses, and, through adopting technology, were able to then take the same money and hire 700 nurses to work with 17 administrators, thus providing better service to patients.
        She closed by discussing her program DotEveryone, which is a new organization promoting the promise of the Internet in the UK for everyone and by everyone. Her goal is for the UK to be the most connected, most digitally literate, and most gender equivalent nation on earth. In a larger sense, she wants to kick off a race among countries to use technology to create the best countries for their citizens.

        Mary Ellen Zurko


        Wednesday morning started with a keynote by Mary Ellen Zurko, from Cisco. She discussed security on the web. Her first lesson: "The future will be different; so will the attacks and attackers, but only if you are wildly successful". Her point was the the success of the web has made it a target. She then covered the history of basic authentication, S-HTTP, and finally SSL/TLS in HTTPS.
        She then discuss the social side of security, indicating that users are often confused about how to respond to web browser warnings about security. There is a 90% ignore rate on such warnings, and 60% of those are related to certificates. She highlighted how difficult it is for users to know whether or not a domain is legitimate and if the certificate shown is valid. She also highlighted where most users, even expert users, do not fully understand the permissions they are granting when asked due to the cryptic and sometimes misleading descriptions given to them, mentioning that 17% of Android users actually pay attention to permissions during installation and only 3% are able to answer questions on what the security permissions mean.


        Reiterating the results of a study by Google, she stated that 70% of users clicked through malware warnings in Chrome, but Firefox had more participation. The Google study found that the Firefox warnings provided a better user experience, and thus users were more apt to pay attention and understand them. Following this study, Google changed its warnings in Chrome.
        She said that the open web is an equal opportunity environment for both attackers and defenders, detailing how fraudulent tech support scans are quite lucrative. This was discovered in recent work by Cisco, "Reverse Social Engineering Social Tech Support Scammers", where Cisco engineers actively bluffed tech support scammers in order to gather information on their whereabouts and identities. 
        Of note, she also mentioned that there is a largely unexploited partnership between web science and security.

        Peter Norvig


        On Friday morning, Peter Norvig gave an engaging speech on the state of the Semantic Web. He mentioned that his job is to bring information retrieval and distributed systems together. He went through a history of information retrieval, discussing WAIS and the World Wide Web, as well as ARCHIE. Before Google, several were trying to tame the nascent web at the time.
        After Google, the Semantic Web was developed as a way to extract information from the many pages that existed. He talked about how Tim Berners-Lee was a proponent, whereas Cory Doctorow highlighted that there were noting but obstacles in its path. Peter said that Cory had several reasons for why it would fail, but the main were (1) people lie, (2) people are lazy, and (3) people are stupid, indicating that the information gathered from such a system would consist of intentional misinformation, lack of complete information, or misinformation due to incompetence. 
        Peter then highlighted several instances where this came about. Initially, excellent expressiveness was produced by highly trained logicians, giving us DAML, OWL, RDFa, FOAF, etc. Unfortunately, they found a 40% page error rate in practice, indicating that Cory was correct on all 3 fronts. Peter's conclusion was the highly trained logicians did not seem to solve the identified problems.

        Peter then posited "what about a highly trained webmaster?". In 2010, search companies promoted the creation of schema.org with the idea of keeping it simple. The search engines promised that if a site were marked up, then they would show it immediately in search results. This gave users an incentive to mark up their pages and now has resulted in technologies that can better present things like hotel reservations and product information. This led most to conclude that schema.org was an unexpected success.
        Peter closed by saying that obstacles still remain, seeing as most of the data comes from web site owners, still leading to misinformation in some cases. He talked about the need to be able to connect different sources together so that one can, for example, not only find a book on Amazon, but also a listing of the Author's interests on Facebook. He hopes that neural networks could be combined with semantic and syntactic approaches to solve some these large connection problems.

        W3C Track


        Tzviya Siegman, from John Wiley & Sons Publishing, presented "Scholarly Publishing in a Connected World". She discussed how publications of the past were immutable, and publishers did little with content once something was published. She confessed that in a world where machines are readers, too, publications are a bit behind the times. She further said that we still have an obsession with pages, citing them, marking them, and so on, when in reality the web is not bound by pages. She wants to standardize on a small set of RDFa vocabularies that would enable gathering of content by topic, whether the documents published are just articles, but also data and electronic notebooks. She closed by talking about how Wiley is trying to extract metadata from its own corpus to provide additional data for scholars.
        Hugh McGuire presented "Opening the book: What the Web can teach books, and what books can teach the Web". He talked about how books seem to hold a special power and value, specific to the boundedness of a book. The web, by contrast, is unbounded; even a single web site is unknowable with no sense of a beginning or an end. On the web, however, anyone can publish documents and data to a global audience without any required permission. He talked about how books are a singular important node of knowledge, with the ebook business having the opposite motive of the web, making ebooks a kind of restricted, broken version of the web. He wants to be able to combine the two. For example a system can provide location-aware annotations of an ebook while also sharing those annotations freely, essentially making ebooks smarter and more open.
        Ivan Herman revealed Portable Web Publications which has serious implications for archiving. The goal is to allow people to download web publications like they do ebooks, PDFs, or other portable articles. There is a need to do so because connectivity is not yet ubiquitous. With the power of the web, one can also embed interactivity into the downloaded document. Of course, there are also additional considerations, like the form factor of the reading device and the needs of reader. The concept is more than just creating a ebook with interactive components or a web page that can be saved offline. He highlighted the work of publishers in terms of egonomy and aesthetics, stating that web designers for such portable publications should learn from this work. Portable Web Publications would not be suitable for social web sites, web mail, or anything that depends on real-time data. PWP requires 3 layers of addressing (1) locating the PWP itself, (2) locating a resource within a PWP, and (3) locating a target within such a resource. In practice, locators depend on the state of the resource, creating a bit of a mess. His group is currently focusing on a manifest specification to solve these issues.

        Poster Session


        Of course, I was here to present a poster, "Persistent URIs Must Be Used to be Persistent", developed by Herbert Van de Sompel, Martin Klein, and I, which indicates important consequences for the use of persistent URIs such as DOIs.

        In looking at the data from "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot", we reviewed 1.6 million web references from 1.8 million articles and discovered 3 things:
        1. use of web references is increasing in scholarly articles
        2. frequently authors use publisher web pages (locating URI) rather than DOIs (persistent URI) when creating references
        We show on the poster that, because many use browser bookmarks or citation managers that store these locating URIs, there must be an easy way to help tools find the DOI. Our suggestion is to store this DOI in the Link header for easy access by these tools.

        I appreciate the visit from Sarven Capadisli and Amy Guy who work on Solid. Many others came by to see our work, like Takeru Yokoi, Hideaki Takeda, Lee Giles, and Pieter Colpaert. Most appreciated the idea, noting it as "simple" with some asking "why don't we have this already?".

        WWW Conference Presentations


        Even though I attended many additional presentations, I will only detail a few of interest.
        As a person who has difficulty with SPARQL, I appreciated the efforts of Gonzalo Diaz and his co-authors in "Reverse Engineering SPARQL Queries". Their goal was to reverse engineer SPARQL queries with the intent of producing better examples for new users, seeing as new users have a hard time with the precise syntax and semantics of the language. Given a database and answers, they wanted to reverse engineer the queries that produced those answers. Unfortunately, they discovered that verifying a reverse engineered SPARQL query to determine if it is the canonical query for a given database and answer is an NP-complete (intractable) problem. They were however able to perform some heuristics on a specific subset of queries to solve this problem in polynomial time.
        Fernando Suarez presented "Foundations of JSON Schema". He mentioned that JSON is very popular because it is flexible, but there is no way to describe what kind of JSON response a client should expect from a web service. He discussed a proposal from the Internet Task Force to develop a JSON schema, a set of restrictions that documents must satisfy. he said the specification is in its Fourth Draft, but is still ambiguous. Even online validators disagree on some content, meaning that we need clear semantics for validation, and he proposes a formal grammar. His contribution is an analysis shows that the validataion problem is PTIME-complete, but that determining if a document has an equivalent JSON schema is PSPACE-hard for very simple schemas. For the future, he intends to work further on integrity constraints for JSON documents and more use cases for JSON schema.
        David Garcia presented "The QWERTY Effect on the Web; How Typing Shapes the Meaning of Words in Online-Human Communications".  He highlights a hypothesis that words typed with more letters from the right side of the keyborard are more positive than those with more letters from the left. He tests this hypothesis on product ratings from different datasets and found that 9 out of 11 datasets see a significant QWERTY effect which is independent of the number of views or comments on an item. He does mention that he needs to repeat the study with different languages and keyboard layouts. He closed by saying that there is no evidence yet that we can predict meanings or change evaluations based on this knowledge.
        Justin Cheng presented "Do Cascades Recur?" where he analyzes the rise and fall of memes multiple times throughout social media. Prior work shows that cascades (meme sharing) rises, then falls, but in reality there are many rises and falls over time. He studies these different peaks and tries to determine how and why these cascades recur. Seeing as these bursts are separated among different network communities, cascades recur when people connect communities and reshare something. It turns out that a meme with high virality has less chance of recurring, but one with medium virality will recur months or perhaps years later. He would like to repeat his study with networks other than Facebook and develop improved models of recurrence based on other data.
        Prahmod Bhatotia presented "IncApprox: The Marriage of incremental and approximate computing". He discussed how data analytic systems transform raw data into useful information, but they need to strike a balance between low latency and high throughput. There are two computing paradigms that try to strike this balance: (1) incremental computations and (2) approximate computing. Incremental computation is motivated by the fact that we are recomputing the output with small changes in the input and can reuse memorized parts of the computation that are unaffected by the changed input. Approximate computing is motivated the fact that the approximate answer is good enough. With approximate computing we get the entire input dataset, but compute only parts of the input and then produce approximate output in a low latency manner. His contribution is the combination of these two approaches.
        Jessica Su presented "The Effect of Recommendations on Network Structure". She worked with Twitter on the rollout of a recommendation system that suggests new people to follow. They restricted the experiment to two weeks to avoid any noise from outside the rollout. They found that there is an effect; people's followers did increase after the rollout. They also confirmed that the "rich get richer", with those who already had many followers gaining more followers and those with few still gaining some followers. She also mentioned that people did not appear to be making friends, only following others.
        Samuel Way presented "Gender, Producitivity, and Prestige in Computer Science Faculty Hiring Networks". This study tried to investigate why women are not participating in computer science. He mentioned that there are conflicting results. Universities have a 2-to-1 preference for female faculty applicants, but at the same time there is a bias favoring male students. They developed a framework for modeling faculty hiring networks using a combination of CVs, social media profiles, and other sources on a subset of people currently going through the tenure process. The model shows that gender bias is not uniformly, systematically affecting all hires in the same way and that the top institutions fight over a small group of people. Women are a limited resource in this market and some institutions are better at competing for them. The result is that accounting for gender does not help predict faculty placement, leading them to conclude that the effects of gender are counted for by other factors, such as publishing or post-doctoral training rates or the fact that some institutions appear to be better at hiring women than others. The model predicts that men and women will be hired at equal rates in Computer Science by the 2070s.

        Social

        Of course, I did not merely enjoy the presentations and posters. Among the Monday night SAVE-SD dinner, the Thursday night Gala, and lunch each day, I took the opportunity to acquaint myself with many field experts. Google, Yahoo!, and Microsoft were also there looking to discuss data sharing, collaboration, and employment opportunities.

        I always had lunch company thanks to the efforts of Erik Wilde, Michael Nolting, Roland Gülle, Eike Von Seggern, Francesco Osborne, Bahar Sateli, Angelo Salatino, Marc Spaniol, Jannik Strötgen, Erdal Kuzey, Matthias Steinbauer,  Julia Stoyanovich, Jan Jones, and more.
        Furthermore,  the Gala introduced me to other attendees, like Chris LaRoche, Marc-Olivier Lamothe, Ashutosh Dhekne, Mensah Alkebu-Lan, Salman Hooshmand, Li'ang Yin, Alex Jeongwoo Oh, Graham Klyne, and Lukas Eberhard. Takeru Yokoi introduced me to Keiko Yokoi from the University of Tokyo who was familiar with many aspects of digital libraries and quite interested in Memento. I also had a fascinating discussion about Memento and the Semantic Web with Michel Gagnon and Ian Horricks, who suggested I read "Introduction to Description Logic" to understand more of the concepts behind the semantic web and artificial intelligence.

        In Conclusion


        As my first academic conference, the WWW 2016 conference was an excellent experience, bringing me in touch with paragons on the forefront of web research. I now have a much better understanding of where we are in the many aspects of the web and scholarly communications.
        Even as we left the conference and said our goodbyes, I knew that many of us had been encouraged  to create a more open, secure, available, and decentralized web.




        2016-04-27: Mementos in the Raw

        $
        0
        0
        While analyzing mementos in a recent experiment, we discovered problems processing archived content.  Many web archives augment the mementos they serve with additional archive-specific information, including HTML, text, and JavaScript.  We were attempting to compare content across many web archives, and had to develop custom solutions to remove these augmentations.

        Most augment their mementos in order to provide additional user experience features, such as navigation to additional mementos, by rewriting links and providing additional discovery tools. From an end-user perspective, these augmented mementos enhance the usability and overall experience of web archives and are the default case for user access to mementos.  An example from the PRONI web archive is shown below, with the augmentations outlined in red.



        Others have requirements to differentiate archived content from live content, because they expose archived content to web search engines. Below, we see that a Google search will return content from the UK National Archives, with one of these search results outlined in red.
        To indicate the archived nature of this content, the title of the web page, outlined in red below, has been altered to indicate that this archived page is "[ARCHIVED CONTENT]".


        Our experiments were adversely affected by these augmentations. We required "mementos in the raw".  In the case of our study, we needed to access the content as it had existed on the web at the time of capture.  Research by Scott Ainsworth requires accurate replay of the headers as well. These captured mementos are also invaluable to the growing number of research studies that use web archives. Captured mementos are also used by projects like oldweb.today, that truly need to access the original content so it can be rendered in old browsers. It seeks consistent content from different archives to arrive at an accurate page recreation. Fortunately, some web archives store the captured memento, but there is no uniform, standard-based way to access them across various archive implementations.

        Based on the needs of these research studies and software projects:
        1. A captured memento must contain only the memento content that was present in the original document:
        • no HTML, JavaScript, CSS, or text has been added to the output
        • linked URIs are not rewritten and exist as they were in the original document (e.g., http://wayback.vefsafn.is/wayback/20091117131348/http://www.lanl.gov/news/index.html should just be http://www.lanl.gov/news/index.html)
      • A captured memento should also provide the original HTTP headers in some form (e.g., X-Archive-Orig-Content-Type: text/html for users desiring the original Content-Type)
      • The following table provides a list of some known web archives and the status of their ability to provide captured mementos, by either unaltered content and/or the original headers. Those columns with a "Yes" indicate that the archive is able to provide access to that specific dimension of captured mementos using software-specific approaches.


        Those entries with a ? and other archives not listed may or may not provide access to captured mementos. This ambiguity is part of the problem.  Those archives that run OpenWayback for serving their mementos have the capability to deliver captured mementos, as detailed in the OpenWayback Administrator Manual, by use of special URIs. In fact, the OpenWayback im_ URI flag provides the desired behavior, with original headers and original content, even though the documentation states that it is supposed to "return document as an image".

        Of course, not all web archives run OpenWayback, and developers have needed to create heuristics based on the software used by each individual web archive.  For example, our archive registry uses the un-rewritten-api-url attribute to provide a pattern for accessing captured mementos. Because there is no uniform approach, these pattern-based solutions are necessary but brittle, tying them to a small set of specific implementations, and making it difficult for clients to adapt to new or changing web archive software.
        We propose a solution that uses the Memento specification (RFC 7089) in its current form, while still allowing uniform, standards-based access to both augmented and captured mementos.

        Proposed Solution for Accessing Augmented and Captured Mementos

        We propose two parallel Memento implementations: one with a TimeGate and TimeMap for access to augmented mementos (as currently exists) and another with a TimeGate and TimeMap for access to captured mementos.  A client that desires access to a specific type of memento (captured or augmented) only needs to access the TimeGate or TimeMap that specializes in finding and returning that type of memento. These parallel Memento implementations are based on the same infrastructure, the interactions are the same, and the only difference is in the nature of the memento each serves.

        Clients could use the Archive Registry for discovering these TimeGates and TimeMaps. The Registry contains entries for many public web archives and version control systems, for each detailing its TimeGate and TimeMap URIs, as well as any additional information pertinent to accessing the archives. Several tools, such as the Memento Aggregator, directly use the information in the Registry. In light of discussions on the Memento Development list, we are considering creating a curated location where improvements can be submitted by the community.

        A new attribute, profile, added to the timegate and timemap elements in the Registry, would allow a client to discover the TimeGate and/or TimeMap providing the type of memento it desires. A fictional enhanced Registry entry for the Icelandic Web Archive is shown below with the new profile attributes in red. Also, information currently provided in the <archive> element would either be deprecated (e.g. un-rewritten-api-url) or relocated (e.g. inside the timegate or timemap elements).

        <link id="is" longname="Icelandic Web Archive">
        <timegate uri="http://wayback.vefsafn.is/wayback/" redirect="no" profile="http://mementoweb.org/terms/augmented"/>
        <timegate uri="http://wayback.vefsafn.is/wayback/captured/" redirect="no" profile="http://mementoweb.org/terms/captured"/>
        <timemap uri="http://wayback.vefsafn.is/wayback/timemap/link/" paging-status="2" redirect="no"profile="http://mementoweb.org/terms/augmented" />
        <timemap uri="http://wayback.vefsafn.is/wayback/timemap/captured/link/"
        paging-status="2" redirect="no"profile="http://mementoweb.org/terms/captured" />
        <icon uri="http://vefsafn.is/favicon.ico"/>
        <calendar uri="http://wayback.vefsafn.is/wayback/*/"/>
        <memento uri="http://wayback.vefsafn.is/wayback/*/"/>
        <archive type="snapshot" rewritten-urls="yes" un-rewritten-api-url="http://wayback.vefsafn.is/wayback/{timestamp}id_/{url}" access-policy="public" memento-status="yes"/>
        </link>

        This solution requires no changes to the Memento protocol and allows web archives to satisfy the needs of both end-users and software applications by returning the appropriate memento for each use-case. 
        In the case of OpenWayback, this capability should be easy to add. Consider the following example from the Icelandic Archive, running OpenWayback, where the following URIs refer to the mementos of http://www.lanl.gov with a Memento-Datetime of Tue, 17 Nov 2009 13:13:48 GMT:
        The memento that will be selected from the archive for the requested datetime, and hence the database interactions, will be the same for augmented and captured mementos. The only difference is the memento URI to which the TimeGates will redirect and is limited to the addition of the string im_ in the captured memento's URI. The additional TimeGate only needs to add this string to its output.
        This approach, fully aligned with the Memento protocol, removes the need for client heuristics and supports using syntaxes other than im_ to distinguish between captured and augmented memento URIs. A client that picks the nature of a given TimeGate or TimeMap will continue to receive that type of memento.

        Optional Additions


        With parallel "augmented" and "captured" Memento protocol support in place, as described above, we have supplied access to different types of mementos. The following section details other optional helpful changes that a client could use to identify and locate different types of mementos.

        Self-Describing TimeGates, TimeMaps, and Mementos

        TimeGates, TimeMaps, and mementos can self-describe their nature with an HTTP link using a profile relation, defined by RFC 6906, and a link target (Target IRI in the RFC) that indicates their augmented or captured nature.

        Example TimeGate response headers implementing this self-describing ability are shown below, with the profile relation specifying the captured nature in red.

        HTTP/1.1 302 Found
        Date: Thu, 21 Jan 2010 00:02:14 GMT
        Server: Apache
        Vary: accept-datetime
        Location: http://arxiv.example.net/web/captured/20010321203610/http://
        a.example.org/
        Link: <http://a.example.org/>; rel="original",
        <http://arxiv.example.net/timemap/captured/http://a.example.org/>
        ; rel="timemap"; type="application/link-format"
        ; from="Tue, 15 Sep 2000 11:28:26 GMT"
        ; until="Wed, 20 Jan 2010 09:34:33 GMT",
        <http://mementoweb.org/terms/captured>; rel="profile"
        Content-Length: 0
        Content-Type: text/plain; charset=UTF-8
        Connection: close

        Example TimeMap response headers implementing this relation are shown below, again with additions in red describing this TimeMap as listing augmented mementos. The profile link is placed within the Link header so that clients can discard or consume the associated entity based on their needs. The profile link is also included in the TimeMap body so that the TimeMap itself is self-describing.

        HTTP/1.1 200 OK
        Date: Thu, 21 Jan 2010 00:06:50 GMT
        Server: Apache
        Content-Length: 4883
        Content-Type: application/link-format
        Link: <http://mementoweb.org/terms/augmented>; rel="profile"
        Connection: close

        <http://a.example.org>;rel="original",
        <http://arxiv.example.net/timemap/http://a.example.org>
        ; rel="self";type="application/link-format",
        <http://mementoweb.org/terms/augmented>
        ; rel="profile",

        <http://arxiv.example.net/timegate/http://a.example.org>
        ; rel="timegate",
        <http://arxiv.example.net/web/20000620180259/http://a.example.org>
        ; rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT",
        <http://arxiv.example.net/web/20091027204954/http://a.example.org>
        ; rel="last memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT",
        <http://arxiv.example.net/web/20000621011731/http://a.example.org>
        ; rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT",
        <http://arxiv.example.net/web/20000621044156/http://a.example.org>
        ; rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT",
        ...

        Finally, a memento can specify whether it is captured or augmented using the same method.  Seen as red in the example below, headers describe this resource as a captured memento.

        HTTP/1.1 200 OK
        Date: Thu, 21 Jan 2010 00:02:15 GMT
        Server: Apache-Coyote/1.1
        Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
        Link: <http://a.example.org/>; rel="original",
        <http://arxiv.example.net/timemap/captured/http://a.example.org/>
        ; rel="timemap"; type="application/link-format",
        <http://arxiv.example.net/timegate/captured/http://a.example.org/>
        ; rel="timegate",
        <http://mementoweb.org/terms/captured>; rel="profile"
        Content-Length: 25532
        Content-Type: text/html;charset=utf-8
        Connection: close

        These additional profile relations allow archives to describe the nature of respective TimeGates, TimeMaps, and mementos without affecting existing Memento clients.

        Discovery of Other TimeGates and TimeMaps via Mementos

        Here we introduce an approach for a client to get from a memento to its corresponding memento of the other type. This capability is handy as such, but, as will be shown, it is also a way to get to the other type of TimeGate and TimeMap.

        By including another Link relation, a machine client can find the corresponding memento of another type.  Shown below, we build upon our previous example memento headers and add this new relation, marked in red, allowing clients to find this captured memento's augmented counterpart. Here a profile attribute is used with the memento relation type in order to indicate the type of memento found at the link target. This profile attribute has been requested as part of "Signposting the Scholarly Web", and is provided by a proposed update to a draft RFC detailing "link hints". This proposed update has been informally accepted by the RFC's author.

        HTTP/1.1 200 OK
        Date: Thu, 21 Jan 2010 00:02:15 GMT
        Server: Apache-Coyote/1.1
        Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
        Link: <http://a.example.org/>; rel="original",
        <http://arxiv.example.net/timemap/captured/http://a.example.org/>
        ; rel="timemap"; type="application/link-format",
        <http://arxiv.example.net/timegate/captured/http://a.example.org/>
        ; rel="timegate",
        <http://mementoweb.org/terms/captured>; rel="profile",
        <http://arxiv.example.net/web/20010321203610/http://
        a.example.org/>
        ; rel="memento"; profile="http://mementoweb.org/terms/augmented"

        Content-Length: 25532
        Content-Type: text/html;charset=utf-8
        Connection: close

        From there, a client can follow the link target to the augmented memento. In the example below, we have the headers for the corresponding augmented memento.  The Memento protocol already provides the associated timegate and timemap relations, shown in bold.  A client uses these relations to discover the TimeGate/TimeMap that serves this memento, and, of course, the TimeGate/TimeMap have the same augmented nature as this memento. Note that this augmented memento also links to its captured counterpart.

        HTTP/1.1 200 OK
        Date: Thu, 21 Jan 2010 00:02:16 GMT
        Server: Apache-Coyote/1.1
        Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
        Link: <http://a.example.org/>; rel="original",
        <http://arxiv.example.net/timemap/http://a.example.org/>
        ; rel="timemap"; type="application/link-format",
        <http://arxiv.example.net/timegate/http://a.example.org/>
        ; rel="timegate",
        <http://mementoweb.org/terms/augmented>; rel="profile",
        <http://arxiv.example.net/web/20010321203610/captured/http://
        a.example.org/>
        ; rel="memento"; profile="http://mementoweb.org/terms/captured"
        Content-Length: 25532
        Content-Type: text/html;charset=utf-8
        Connection: close

        Now the client can make future requests to this TimeGate and receive responses like the one below, finding additional augmented mementos for the original resource.

        HTTP/1.1 302 Found
        Date: Thu, 21 Jan 2010 00:02:17 GMT
        Server: Apache
        Vary: accept-datetime
        Location: http://arxiv.example.net/web/20100424131422/http://
        a.example.org/
        Link: <http://a.example.org/>; rel="original",
        <http://arxiv.example.net/timemap/http://a.example.org/>
        ; rel="timemap"; type="application/link-format"
        ; from="Tue, 15 Sep 2000 11:28:26 GMT"
        ; until="Wed, 20 Jan 2010 09:34:33 GMT",
        <http://mementoweb.org/terms/augmented>; rel="profile"
        Content-Length: 0
        Content-Type: text/plain; charset=UTF-8
        Connection: close

        Likewise, a client can issue a request to the associated TimeMap to access augmented mementos for this resource. Of course, this process can start from an augmented memento and lead a client to the TimeGate/TimeMap for its captured counterpart as well.

        Conclusion


        The "captured" and "augmented" parallel Memento implementations addresses the problem of accessing different types of mementos in a standard-based manner.  Given that the selected memento will be the same for both the captured and augmented cases and the difference will only be in the access mechanism (URI), the solution seems straightforward to implement for web archives. Existing clients will still continue to function as is, and clients desiring a specific type of memento can use the Archive Registry to find the resources that support the that type of memento.

        In addition, the optional profile and discovery links add further value, allowing clients to identify which type of mementos they have currently acquired as well as accessing the other types of mementos that are available.

        We look forward to feedback on this proposed solution.

        --
        Shawn M. Jones
        - and -
        Herbert Van de Sompel
        - and -
        Michael L. Nelson

        Acknowledgements: Ilya Kremer also contributed to the initial discussion of the need for a standard method of accessing captured mementos.

        2016-05-31: Can I find this story? API: Yes, Google: Maybe, Native Search: No

        $
        0
        0
        A story on Storify titled: "Lecture on Academic Freedom" (capture date: 2016-05-31)
        The story on Storify titled: "Lecture on Academic Freedom" could not be found on Google (capture date: 2016-05-31)
        The story on Storify titled: "Lecture on Academic Freedom" could not be found on Storify native search (capture date: 2016-05-31)
        A part of our research (funded by IMLS) to build collections for stories or events involves exploring content curation sites like Storify in order to determine if they hold quality (news worthy, timely, etc.) content. Storify is a social network service used to create stories which consists of text and multimedia content, as well as content from other social media sites like Twitter, Facebook and Instagram.
        Our exploration involved collecting stories from Storify over a period in other to manually inspect the stories to determine their newsworthiness. This exploration was dual natured: we collected latest stories (across multiple topics) from the Storify API (browse/latest interface) over a period of time, we also collected stories from Storify about the Ebola virus through Storify's search API. During this period we collected resources from Google (with the "site:storify.com" directive) as well. At a particular point in our exploration, we considered if we could rely exclusively on Storify search as a means to find content or use Google's site directive to find Storify stories. In other words, how good is the Storify native search compared to Google search for discovery of stories on Storify when compared to the Storify browse/latest API? 
        Storify API vs Google and Storify native search: A simple plan for measuring discovery
        We focused on known item searches to avoid the problem of subjective relevance measures. This gave us a very simple way of scoring Google and Storify's native search: if Google finds a specific story (query extracted from exact title, body content and description), Google gets 1 point. On the other hand, if Storify's native search (using the same query), finds the story, Storify gets 1 point.
        Our set of test stories and their corresponding queries generated from the story titles, body content and description snippets consisted of 10 stories created between February 2016 and March 2016 (Enough time for both search services to index the stories). These stories were collected from the Storify browse/latest API interface which allows for discovery of content, but does not allow us to find topical content like with search. Here is the list of stories (collected 2016-05-30) and their respective creation datetime values, as well as the results outlining stories found by Google and/or Storify's native search:

        Story Creation datetime Found? (Google) Found? (Storify)
        Commandos 2: Men of Courage full game free pc, download, play. download Commandos 2: Men of Courage for pc2016-02-22T22:36:03YesNo
        #SJUtakeover2016-02-17T21:16:43YesNo
        Annotations for Edgar Allan Poe2016-03-02T19:47:31NoNo
        Lecture on Academic Freedom2016-02-22T22:27:08NoNo
        Hitman: Codename 47 full game free pc, download, play. download Hitman: Codename 47 for pc2016-02-22T22:36:26YesNo
        AU Game Lab at GDC 20162016-03-18T17:36:34YesNo
        5 Leading Onlinegames For Females Cost Free2016-02-22T22:37:22YesNo
        Sony Ericsson Z610i (Pink): newest cellular Phone With Advanced attributes2016-03-18T23:50:55NoNo
        Senior Research Paper2016-02-26T19:47:19YesNo
        Syracuse community reacts to NCAA Tournament win over Dayton2016-03-18T17:38:34YesNo

        We searched for the stories by issuing queries with full quotes (for exact match) to Google search (with the "site:storify.com" directive) and Storify's native search and counted the number of hits and misses for both. For both Google and Storify, all SERP links where included in the test. The results from Google did not exceed 1 page, for Storify however, the average number was 20 stories.
        Storify's native search finds 0/10 stories, Google finds 7/10
        We expected Storify to find more stories compared to Google, since the content resides on Storify, but this was not the case: out of 10 stories, Google found 7 but Storify found none! Google found all except the following stories:
        1. Annotations for Edgar Allan Poe
        2. Lecture on Academic Freedom
        3. Sony Ericsson Z610i (Pink): newest cellular Phone With Advanced attributes
        A story on Storify titled: "#SJUTakeover" (capture date: 2016-05-31)

        The story on Storify titled: "#SJUTakeover" could not be found on Storify search but found on Google (capture date: 2016-05-31)
        Before our test, we checked and did not find a Storify utility to exclude a story from search during the story's creation. Consequently, out test result suggests that the Storify search index is not synchronized with its browse/latest API interface. This investigation also shows the utility of using the Storify API for discovery, which contradicts some of our previous experiences where APIs provide different, limited, or stale data (e.g., Delicious API, SE APIs).
        A proposal for a comprehensive study
        We acknowledge the sample size of our experiment is very small, however, the preliminary results could be an approximation of a larger study due to random selection of stories. But the curious reader may consider verifying our result through a larger test consisting of a large collection of random stories published across a wide temporal window. If this is done, kindly share your findings with us.
        --Nwala

        2016-06-03: Lipstick or Ham: Next Steps for WAIL

        $
        0
        0
        The development, state, and future of 🐳 Web Archiving Integration Layer. 💄∨🐷?                                                                 

        Some time ago I created and deployed Web Archiving Integration Layer (frequently abbreviated as WAIL), an application that provides users pre-configured local instances of Heritrix and OpenWayback. This tool was originally created for the Personal Digital Archiving 2013 conference and has gone through a metamorphosis.

        The original impetus for creating the application was that the browser-based WARCreate extension required some sort of server-like software to save files locally because of the limitations of the Google Chrome API and JavaScript at the time (2012). WARCreate would perform an HTTP POST to this local server instance, which could would then return an HTTP response with an appropriate MIME type that would cause the browser to download the file. I initially used XAMPP for this with a PHP script within the Apache instance. This was unwieldy and a little more complex of a procedure than I wanted for the user.

        With the introduction of the HTML5 File API, this server software was no longer required. The File API, however, is sandboxed to an isolated file system accessible only to the browser. To circumvent this restriction, I utilize the FileSaver.js library but this, too, has limitations in size of the file that can be download -- 500 MiB (about 524 MB) for Google Chrome.

        XAMPP to WAIL

        With Apache no longer being a requirement for WARCreate, I investigated using XAMPP's bundled copy of Apache web server and the additionally bundled Tomcat Java server for other web archiving purposes, namely the engine to run the Java-based OpenWayback. This worked well but still felt heavy for a user's PC, as Java applications do. The added Java requirement also meant that I could include a pre-configured Heritrix, Internet Archive's Java-based archival crawler, within XAMPP. The XAMPP interface, however, was generic relative to simply controlling services, a UI scheme I wanted to obscure from the target audience.

        A locally hosted web-based interface might have been suitable but as with the WARCreate-to-local-file problems, having a browser launch applications on the user's machine was likely to be problematic. Being already familiar with Python, I created a script using the wxPython (the Python port of wxWidgets) library that allows a user to specify a URI for Heritrix to crawl (by programmatically creating crawl configurations) and locations for the resulting WARCs to which Heritrix should write and OpenWayback read.

        This additional Graphical User Interface (GUI) "Layer" for "Integrating""Web Archive" tools (Heritrix and OpenWayback) spawned the awkwardly named, "Web Archiving Integration Layer". The acronym, while descriptive, also reiterated ODU WS-DL's trend of associating produced software with sea creatures (as I referenced once before).


        Ceci n'est pas un cochon
        Requiring the target user base (digital humanities scholars and amateur web archivists) to go to the command-line to launch a Python script was unacceptable, however, and the remedy to this problem has been partially to blame for the slowdown in further development of WAIL. To "Freeze" code is to create the more familiar "Application" that a user would double click to launch. At the time (2013), PyInstaller provided the best application freezing functionality in that it performed dependency resolution, created cross-platform binaries, and provided a mode to produced a single binary file, which was not initially necessary but became appealing.

        In the beginning, WAIL was compiled for Windows and MacOS X (or nowadays called simply "OS X"). In the latter, single-file applications are very common, as OS X's ".app" faux directory structure allows the application tools and resources to be nicely packaged. Eventually, this was also a useful place to include the OpenWayback and Heritrix binaries. That Windows does not have this abstraction but instead frequently provides a directory of files with the ".exe" being the binary is the reason that WAIL for Windows has not been updated since 2013.

        Plagued with Problems

        As if the decoupling of the OS X and Windows versions was not bad enough, OS X ceased bundling the Java runtime with the operating system (which required WAIL to install the runtime), Heritrix required an older version of Java (it would break with the latest version), and just generally Java problems all around. These problems persist to this day but ultimately it was these requirements and configuration issues that WAIL was designed to solve or at least mitigate for the user. The WAIL code that drives the UI is also quite the mess. Despite being researchers where code function should supercede its form, because WAIL is publicly available (both the binary and the source), it ought to reflect the quality in form to the extent of function.

        Refactor or Is That Fiddly?

        I have been maintaining and improving the code but eventually either another WS-DLite will be doing the same or the project will die. I believe there still to be merit in a locally hosted web archive, particularly for the digital humanities scholars that aren't familiar with system interaction via the command-line and manually rewriting configuration files.

        We are looking into other routes to make the code more intuitive to maintain but still functionally equivalent to if not greater than the Python-based native app in its current state. We have bundled the newly developed Go-based MemGatorMemento aggregator (blog post to come) with WAIL as a cross-platform native executable. We also hope to include other tools that personal web archivists would find useful with the requirement being that it must run natively and include no further non-bundled dependencies. Two tools on our radar are Ilya Kreymer's pywb, part of the replay component that's driving Webrecorder, and the heavily coupled (with pywb) InterPlanetary Wayback (ipwb) system we developed at the Archives Unleashed Hackathon in March.

        The question still remains whether to rework the current code or to overhaul the UI in a way that is more extensible and maintainable. The Electron packaging library, as used by the native Slack application, Atom editor, and many other software projects, looks to be the route to take to achieve these goals. Additionally, interfaces written for Electron can be compiled to native applications, a feature that will allow the ethos of WAIL to be retained.

        However, rewriting the UI does not a more useful application make and doing so boils down to putting lipstick on a pig. External dependencies should be the primary problem to tackle. From that, including additional functionality and tools to make the application more useful (the "ham" if this simile can be stretched any further) ought to be given priority.

        —Mat Kelly (@machawk1)
        Viewing all 747 articles
        Browse latest View live