Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 746 articles
Browse latest View live

2014-11-14: Carbon Dating the Web, version 2.0

$
0
0



For over 1 year, Hany SalahEldeen's Carbon Date service has been out of service mainly because of API changes in some of the underlying modules on which the service is built upon. Consequently, I have taken up the responsibility of maintaining the service, beginning with the following now available in Carbon Date v2.0.

Carbon Date v2.0


The Carbon Date service currently makes requests to the different modules (Archives, backlinks, etc.), in a concurrent manner through threading.
The server framework has been changed from bottle server to CherryPy server which is still a python minimalist WSGI server, but a more robust framework which features a threaded server.

How to use the Carbon Date service

There are three ways:
  • Through the website, http://cd.cs.odu.edu/: Given that carbon dating is highly computationally intensive, the site should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally (local.py or server.py)
  • Through the local server (server.py): The second way to use the Carbon Date service is through the local server application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.
  • Through the local application (local.py): The third way to use the Carbon Date service is through the local python application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.

The backlinks efficiency problem

Upon running the Carbon Date service, you will notice a significant difference in the runtime of the backlinks module compared to the other modules, this is because the most expensive operation in the carbon dating process involves carbon dating backlinks. Consequently, in the local application (local.py), the backlinks module is switched off by default and reactivated with the --compute-backlinks option. For example, to Carbon Date cnn.com, with the backlinks module switched on:
Some effort was put towards optimizing the backlinks module, however, my conclusion is that the current implementation cannot be optimized.

This is because of the following cascade of operations associated with the inlinks:



Given a single backlink (an incoming link - inlink to the URL), the application retrieves all mementos (which could range from tens to hundreds). Thereafter, the application searches for the first occurrence of the link in the memento.

At first glance, one may suggest binary search since the mementos are in chronological order. However, given that there are potentially multiple memento instances which contain the URL, binary search does not help us because if we check the midpoint memento for the URL, we cannot act upon this information to narrow the search space by half, since the left half of the list of mementos or the right half of the list of mementos could contain the first occurrence of the URL. Therefore, the linear method is the only possible method.

I am grateful to everyone who contributed to the debugging of Carbon Date such as George Micros and the members of the Old Dominion University Introduction to Web Science class (Fall 2014). Further recommendation or comments about how this service can be improved is welcome and will be appreciated.

--Nwala

2014-11-20: Archive-It Partners Meeting 2014

$
0
0


I attended the 2014 Archive-It Partners Meeting in Montgomery, AL on November 18.  The meeting attendees are representatives from Archive-It partners with interests ranging from archiving webpages about art and music to archiving government webpages.  (Presentation slides will be available on the Archive-It wiki soon.)  This is ODU's third consecutive Partners Meeting (see trip reports from 2012 and 2013).

The morning program was focused on presentations from partners who are building collections.  Here's a brief overview of each of those.

Penny Baker and Susan Roeper from the Clark Art Institute talked about their experience in archiving the 2013 Venice Biennale international art exhibition (Archive-It collection) and plans for the upcoming exhibition.  Their collection includes exhibition catalogs, monographs, and press releases about the event.  The material also includes a number of videos (mainly from vimeo), which Archive-It can now capture.

Beth Downs from the Montana State Library (Archive-It collection) spoke about working with partners around the state to fulfill the state mandate to make all government documents publicly available and working to make the materials available to state employees, librarians, teachers, and the general public.  One of the nice things they've added to their site footer is a Page History link that goes directly to the Archive-It Wayback calendar page for the current page.


Beth has also provided instructions for their state agencies on how to include the Page History link and a Search box into the archive on their pages.  This could be easily adapted to point to other state government archives or to the general Internet Archive Wayback Machine.

Dory Bower from the US Government Printing Office talked about the FDLP (Federal Depository Library Program) Web Archive (Archive-It collections).  They have several archiving strategies and use Archive-It mainly for the more content rich websites along with born-digital materials.

Heather Slania, Director of the Betty Boyd Dettre Library and Research Center at the National Museum of Women in the Arts (Archive-It collections) spoke about the challenges of capturing dynamic content from artists websites.  This includes animation, video (mainly vimeo), and other types of Internet art. She has initially focused on capturing websites of a selection of Internet artists.  These sites include over 6000 videos (from just 30 artists).  The next step is to archive the work of video artists and web comics.  As part of this project, she has been considering what types of materials are currently capture-able and categorizing the amount of loss in the archived sites.  This is related to our group's recent work on measuring memento damage (pdfslides) and investigating the archivability of websites over time (pdf at arXivslides).

Nicholas Taylor from Stanford University Libraries gave an overview of the 2013 NDSA (National Digital Stewardship Alliance) Survey Report (pdf).  The latest survey was conducted in 2013 and the first was done in 2011.  NDSA's goal is to conduct this every 2 years.  Nicholas had lots of great stats in his slides, but here are a few that I noted:
  • 50% of respondents were university programs
  • 7% affiliated with IIPC, 33% with NDSA, 45% Web Archiving Roundtable, 71% with Archive-It
  • many are concerned with capturing social media, databases, and video
  • about 80% respondents are using external services for archiving, like Archive-It
  • 80% haven't transferred data to their local repository
  • many are using tools that don't support WARC (but the percentage using WARC has increased since 2011)
Abbie Nordenhaug and Sara Grimm from the Wisconsin Historical Society (Archive-It collections) presented next.  They're just getting started archiving in a systematic manner.  They have a range of state agency partners with websites that are dynamic to those that are fairly static.  So far, they've set up monthly, quarterly, semi-annual, and annual crawls for those sites.

After these presentations, it was time for lunch.  Since we were in Alabama, I found my way to Dreamland BBQ.

After lunch, the presentations focused on collaborations, an update on 2014-2015 Archive-It plans, BOF breakout sessions, and strategies and services.

Anna Perricci from Columbia University Libraries spoke about their experiences with collaborative web archiving projects (Archive-It collections), including the Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY) collection and the Contemporary Composers Web Archive (CCWA) collection.

Kent Underwood, Head of the Avery Fisher Center for Music and Media at the NYU Libraries, spoke about web archiving for music history (Archive-It collection).  Kent gave an eloquent argument for web archiving:  "Today’s websites will become tomorrow’s historical documents, and archival websites must certainly be an integral part of tomorrow’s libraries. But websites are fragile and impermanent, and they cannot endure as historical documents without active curatorial attention and intervention. We must act quickly to curate and preserve the memory of the Internet now, while we have the chance, so that researchers of tomorrow will have the opportunity to discover their own past. The decisions and actions that we take today in web archiving will be crucial in determining what our descendants know and understand about their musical history and culture."

Patricia Carlson from Mount Dora High School in Florida spoke about Archive-It's K-12 Archiving Program and its impact on her students (Mount Dora's Archive-It collection).  She talked about its role in introducing her students to primary sources and metadata.  She's also been able to use things that they already do (like tag people on Facebook) as examples of adding metadata. The students have even made a video chronicling their archiving experiences.

After the updates on ongoing collaborations, Lori Donovan and Maria LaCalle from Archive-It gave an overview of Archive-It's 2014 activities and upcoming plans for 2015.  Archive-It currently has 330 partners in 48 US states (only missing Arkansas and North Dakota!) and 16 countries.  In 2014, with version 4.9, Archive-It crawls certain pages with Heritrix and Umbra, which allows Heritrix to access sites in the same way a browser would.  This allows for capture of client-side scripting (such as JavaScript) and improves the capture of social media sites.  There were several new features in the 5.0 release, among them integration with Google Analytics. There will be both a winter 2014 release and a spring/summer 2015 release.  In the spring/summer release several new features are planned, including visual/UI redesign of the web app, the ability to move and share seeds between collections, ability to manually rank metadata facets on public site, enhanced integration with archive.org, updated Wayback look and feel, and linking related pages on the Wayback calendar (in case URI changed over time).

After a short break, we divided up into BOF groups:
  • Archive.org v2
  • Researcher Services
  • Cross-archive collaboration
  • QA (quality assurance)
  • Archiving video, audio, animations, social media
  • State Libraries
I attended the Research Services BOF, led by Jefferson Bailey and Vinay Goel from Internet Archive and Archive-It.  Jefferson and Vinay described their intentions with launching research services and asked for feedback and requests.  The idea is to use the Internet Archive's big data infrastructure to process data and provide smaller datasets of derived data to partners from their collections.  This would allow researchers to work on smaller datasets that would be manageable without necessarily needing big data tools.  This could also be used to provide a teaser as to what's in the collection, highlight link structure in the collection, etc.  One of the initial goals is to seed example use cases of these derivative datasets to show others what might be possible.  The ultimate goal is to help people get more value from the archive.  Jefferson and Vinay talked in more detail about what's upcoming in the last talk of the meeting (see below). Most of the other participants in the BOF were interested in ways that their users could make research use out of their archived collections.

After the BOF breakout, the final session featured talks on strategies and services.

First up was yours truly (Michele Weigle from the WS-DL research group at Old Dominion University).  My talk was a quick update on several of our ongoing projects, funded by NEH Office of Digital Humanities and the Columbia University Libraries Web Archiving Incentives program.


The tools I mentioned (WARCreate, WAIL, and Mink) are all available from our Software page.  If you try them out, please let us know what you think (contact info is on the last slide).

Mohamed Farag from Virginia Tech's CTRnet research group presented their work on an event focused crawler (EFC).  Their previous work on automatic seed generation from URIs shared on Twitter produced lots of seeds, but not all of them were relevant.  The new work allows a curator to select high quality seed URIs and then uses the event focused crawler (EFC) to retrieve webpages that are highly similar to the seeds.  The EFC can also read WARCs and perform text analysis (entities, topics, etc.) from them.  This enables event modeling, describing what happened, where, and when.

In the final presentation of the meeting, Jefferson Bailey and Vinay Goel from Internet Archive spoke about building Archive-It Research Services, planned to launch in January 2015. The goals are to expand access models to web archives, enable new insights into collections, and facilitate computational analysis.  The plan is to leverage the Internet Archive's infrastructure for large-scale processing.  This could result in increasing the use, visibility, and value of Archive-It collections.  Initially, three main types of datasets are planned:
  • WAT - consists of key metadata from a WARC file, includes text data (title, meta-keywords, description) and link data (including anchor text) for HTML
  • LGA - longitudinal graph analysis - what links to what over time
  • WANE - web archive named entities
All of these datasets are significantly smaller than the original WARC files.  Jefferson and Vinay have built several visualizations based on some of this data for demonstration and will be putting some of these online.  Their future work includes developing programmatic APIs, custom datasets, and custom processing.

All in all, it was a great meeting with lots of interesting presentations. It good to see some familiar faces and to actually meet others I'd only previously emailed with.  It was also nice to be in an audience where I didn't have to motivate the need for web archiving.

There were several people live-tweeting the meeting (#ait14).  I'll conclude with some of the tweets.


-Michele

2014-12-20: Using Search Engine Queries For Reliable Links

$
0
0
Earlier this week Herbert brought to my attention Jon Udell's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes.

Jon has a series of columns for InfoWorld, and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page.  For example, the old URI:

http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html

is currently:

http://www.infoworld.com/article/2660595/application-development/xquery-and-the-power-of-learning-by-example.html

The same content had at least one other URI as well, from at least 2009--2012:

http://www.infoworld.com/d/developer-world/xquery-and-power-learning-example-924

The first reaction is to say InfoWorld should use "Cool URIs", mod_rewrite, or even handles.  In fairness, Inforworld is still redirecting the second URI to the current URI:



And it looks like they kept redirecting the original URI to the current URI until sometime in 2014 and then quit; currently the original URI returns a 404:



Jon's approach is to just give up on tracking different URIs for his 100s of articles and instead use a combination of metadata (title & author) and the "site:" operator submitted to a search engine to locate the current URI (side note: this approach is really similar to OpenURL).  For example, the link for the article above would become:

http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery+and+the+power+of+learning+by+example%22

Herbert had a number of comments, which I'll summarize as:
  • This problem is very much related to Martin's PhD research, in which web archives are used to generate lexical signatures to help refind the new URIs on the live web (see "Moved but not gone: an evaluation of real-time methods for discovering replacement web pages").  
  • Throwing away the original URI is not desirable because that is a useful key for finding the page in web archives.  The above examples used the Internet Archive's Wayback Machine, but MementoTimeGates and TimeMaps could also be used (see Memento 101 for more information).   
  • One solution to linking to a SE for discovery while retaining the original URI is to use the data-* attributes from HTML (see the "Missing Link" document for more information).  
For the latter point, including the original URI (and its publishing date), the SE URI, and the archived URI would result in html that looks like:



I posted a comment saying that a search engine's robots.txt page would prevent archives like the Internet Archive from archiving the SERPs and thus not discover (and archive) the new URIs themselves.  In an email conversation Martin made the point that rewriting the link to search engine is assuming that the search engine URI structure isn't going to change (anyone want to bet how many links to msn.com or live.com queries are still working?).  It is also probably worth pointing out that while metadata like the title is not likely to change for Jon's articles, that's not always true for general web pages, whose titles often change (see "Is This A Good Title?"). 

In summary, Jon's solution of using SERPs as interstitial pages as a way to combat link rot is an interesting solution to a common problem, at least for those who wish to maintain publication (or similar) lists.  While the SE URI is a good tactical solution, disposing of the original URI is a bad strategy for several reasons, including working against web archives instead of with them, and betting on the long-term stability of SEs.  The solution we need is a method to include > 1 URI per HTML link, such as proposed in the "Missing Link" document.

--Michael

2015-01-03: Review of WS-DL's 2014

$
0
0
The Web Science and Digital Libraries Research Group's 2014 was even better than our 2013.  First, we graduated two PhD students and had many other students advance their status:
In April we introduced our now famous "PhD Crush" board that allows us to track students' progress through the various hoops they must jump through.  Although it started as sort of a joke, it's quite handy and popular -- I now wish we had instituted it long ago. 

We had 15 publications in 2014, including:
JCDL was especially successful, with Justin's paper "Not all mementos are created equal: Measuring the impact of missing resources" winning "best student paper" (Daniel Hasan from UFMG also won a separate "best student paper" award), and Chuck's paper "When should I make preservation copies of myself?" winning the Vannevar Bush Best Paper award.  It is truly a great honor to have won both best paper awards at JCDL this year (pictures: Justin accepting his award, and me accepting on behalf of Chuck).  In the last two years at JCDL & TPDL, that's three best paper awards and one nomination.  The bar is being raised for future students.

In addition to the conference paper presentations, we traveled to and presented at a number of conferences that do not have formal proceedings:
We were also fortunate enough to visit and host visitors in 2014:
We also released (or updated) a number of software packages for public use, including:
Our coverage in the popular press continued, with highlights including:
  • I appeared on the video podcast "This Week in Law" #279 to discuss web archiving.
  • I was interviewed for the German radio program "DRadio Wissen". 
We were more successful on the funding front this year, winning the following grants:
All of this adds up to a very busy and successful 2014.  Looking ahead to 2015, as well as continued publication and funding success, we are expecting to graduate both one MS & one Ph.D. student and host another visiting researcher (Michael Herzog, Magdeburg-Stendal University). 

Thanks to everyone that made 2014 such a great success, and here's to a great start to 2015!

--Michael





2015-01-15: The Winter 2015 Federal Cloud Computing Summit

$
0
0


On January 14th-15th, I attended the Federal Cloud Computing Summit in Washington, D.C., a recurring event in which I have participated in the past. In my continuing role as the MITRE-ATARC Collaboration Session lead, I assisted the host organization, the Advanced Technology And Research Center (ATARC) in organizing and run the MITRE-ATARC Collaboration Sessions. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

The collaboration sessions continue to be highly valued within the government and industry. The Winter 2015 Summit had over 400 government or academic registrants and more than 100 industry registrants. The whitepaper summarizing the Summer 2014 collaboration sessions is now available.

A discussion of FedRAMP and the future of the policies was held in a Government-only session at 11:00 before the collaboration sessions began.
At its conclusion, the collaboration sessions began, with four sessions focusing on the following topics.
  • Challenge Area 1: When to choose Public, Private, Government, or Hybrid clouds?
  • Challenge Area 2: The umbrella of acquisition: Contracting pain points and best practices
  • Challenge Area 3: Tiered architecture: Mitigating concerns of geography, access management, and other cloud security constraints
  • Challenge Area 4: The role of cloud computing in emerging technologies
Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will continue its practice of releasing a summary document after the Summit (for reference, see the Summer 2014 and Winter 2013 summit whitepapers).

On January 15th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:25-4:10, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

This was the fourth Federal Summit event in which I have participated, including the Winter 2013 and Summer 2014 Cloud Summits and the 2013 Big Data Summit. They are great events that the Government participants have consistently identified as high-value. The events also garner a decent amount of press in the federal news outlets and at MITRE. Please refer to the fedsummits.com list of press for the most recent articles about the summit.

We are continuing to expand and improve the summits, particularly with respect to the impact on academia. Stay tuned for news from future summits!

--Justin F. Brunelle

2015-02-05: What Did It Look Like?

$
0
0
Having often wondered why many popular videos on the web are time lapse videos (that is videos which capture the change of a subject over time), I came to the conclusion that impermanence gives value to the process of preserving ourselves or other subjects in photography. As though a means to defy the compulsory fundamental law of change. Just like our lives, one of the greatest products of human endeavor, the World Wide Web, was once small, but has continued to grow. So it is only fitting for us to capture the transitions.
What Did It Look Like? is a Tumblr blog which uses the Memento framework to poll various public web archives, take the earliest archived version from each calendar year, and then create an animated image that shows the progression of the site through the years.

To seed the service we randomly chose some web sites and processed them (see also the archives). In addition, everyone is free to nominate web sites to What Did It Look Like? by tweeting: "#whatdiditlooklike URL". 

In order to see how this system is achieved, consider the architecture diagram below. 

The system is implemented in Python and utilizes Tweepy and PyTumblr to access the Twitter and Python APIs respectively, and consists of the following programs:
  1. timelapseTwitter.py: This application fetches tweets (with "#whatdiditlooklike URL"signature) by using the tweet ID of the last tweet visited as reference to know where to begin retrieving tweets. For example, if the application initially visited tweet IDs 0, 1, 2. It keeps track of the ID 2, so as to begin retrieving tweets with IDs greater than 2 in a subsequent tweet retrieval operation. Also, since Twitter rate limits the number of search operations (180 requests per 15 minute window), the application sleeps in between search operations. The snippet below outlines the basic operations of fetching tweets after the last tweet visited:
  2. usingTimelapseToTakeScreenShots.py: This is a simple application with invokes timelapse.py for each nomination tweet (that a tweet with the "#whatdiditlooklike URL" signature).
  3. timelapse.py: Given an input URL, this application utilizes PhantomJS, (a headless browser) to take screenshots and utilizes ImageMagick to create an animated GIF. It should also be noted that the GIFs created are optimized due to the snippet below in order to reduce their respective sizes to under 1MB. This ensures the animation is not deactivated by Tumblr.
  4. timelapseSubEngine.py: this application executes two primary operations:
    1. Publication of the animated GIFs of nominated URLs to Tumblr: This is done through the PyTumblr API create_photo() method as outlined by the snippet below:
    2. Notifying the referrer and making status updates on Twitter: This is achieved through Tweepy's api.update_status() method as outlined by the snippet below which tweets the status update message:However, a simple periodic Twitter status update message could result in the message to be flagged eventually as spam by Twitter. This comes in form of a 226 error code. In order to avoid this, timelapseSubEngine.py does not post the same status update tweet message or notification tweet. Instead the application randomly selects from a suite of messages and injects a variety of attributes which ensure status update tweets are different. The randomness in execution is due to a custom cron application which randomly executes the entire stack beginning from timelapseTwitter.py down to timelapseSubEngine.py.

How to nominate sites onto What Did It Look Like?

If you are interested in seeing what a web site looked like through the years:
Tweet"#whatdiditlooklike URL" to nominate a web site or tweet "#whatdiditlooklike URL1, URL2, ..., URLn"to nominate multiple URLs.

How to explore historical posts

To explore historical posts, visit the archives page: http://whatdiditlooklike.tumblr.com/archives

Examples 

What Did cnn.com Look Like?

What Did cs.odu.edu Look Like?
What Did apple.com Look Like?

"What Did It Look Like?" is inspired by two sources: 1) the "One Terabyte of Kilobyte Age Photo Op" Tumblr that Dragan Espenschied presented at DP 2014 (which basically demonstrates digital preservation as performance art; see also the commentary blog by Olia Lialina& Dragan), and 2) the Digital Public Library of America (DPLA) "#dplafinds" hashtag that surfaces interesting holdings that one would otherwise likely not discover.  Both sources have the idea of "randomly" highlighting resources that you would otherwise not find given the intimidatingly large collection in which they reside.

We hope you'll enjoy this service as a fun way to see how web sites -- and web site design! -- have changed through the years.

--Nwala

2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive

$
0
0

On February 2nd, 2015, Rene Voorburgannounced the JavaScript utility robustify.js. The robustify.js code, when embedded in the HTML of a web page, helps address the challenge with link rot by detecting when a clicked link will return an HTTP 404 and uses the Memento Time Travel Service to discover mementos of the URI-R. Robustify.js assigns an onclick event to each anchor tag in the HTML. The event occurs, robustify.js makes an Ajax call to a service to test the HTTP response code of the target URI.

When an HTTP 404 response code is detected by robustify.js, it uses Ajax to make a call to a remote server, uses the Memento Time Travel Service to find mementos of the URI-R, and uses a JavaScript alert to let the user know that JavaScript will redirect the user to the memento.

Our recent studies have shown that JavaScript -- particularly Ajax -- normally makes preservation more difficult, but robustify.js is a useful utility that is easily implemented to solve an important challenge. Along this thought process, we wanted to see how a tool like robustify.js would behave when archived.

We constructed two very simple test pages, both of which include links to Voorburg's missing page http://www.dds.nl/~krantb/stellingen/.
  1. http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html which does not use robustify.js
  2. http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html which does use robustify.js
In robustifyTest.html, when the user clicks on the link to http://www.dds.nl/~krantb/stellingen/, an HTTP GET request is issued by robustify.js to an API that returns an existing memento of the page:

GET /services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F HTTP/1.1
Host: digitopia.nl
Connection: keep-alive
Origin: http://www.cs.odu.edu
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4
Accept: */*
Referer: http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Fri, 06 Feb 2015 21:47:51 GMT
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.3.10-1ubuntu3.15
Access-Control-Allow-Origin: *

The resulting JSON is used by robustify.js to then redirect the user to the memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ as expected.

Given this success, we wanted to understand how our test pages would behave in the archives. We also included a link to the stellingen memento in our test page before archiving to understand how a URI-M would behave in the archives. We used the Internet Archive's Save Page Now feature to create the mementos at URI-Ms http://web.archive.org/web/20150206214019/http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html and http://web.archive.org/web/20150206215522/http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html.

The Internet Archive re-wrote the embedded links to be relative to the archive in the memento, converting http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://www.dds.nl/~krantb/stellingen/. Upon further investigation, we noticed that robustify.js does not issue onclick events to anchor tags linking to pages within the same domain as the host page. An onclick even is not assigned to this any embedded anchor tags because all of the links point to within the Internet Archive, the host domain. Due to this design decision, robustify.js is never invoked when within the archive.

When clicking on the URI-M, the 2015-02-06 memento does not exist, so the Internet Archive redirects the user to the closest memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/. The user, when clicking the link, ends up at the 1999 memento because the Internet Archive understands how to redirect the user from the 2015 URI-M for a memento that does not exist to the 1999 URI-M for a memento that does exist. If the Internet Archive had no memento for http://www.dds.nl/~krantb/stellingen/ the user would simply receive a 404 and not have the benefit of robustify.js using the Memento Time Travel service to search additional archives.

The robustify.js file is archived (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) but its embedded URI-Rs are re-written by the Internet Archive.  The original, live web JavaScript has URI templates embedded in the code that are completed at run time by inserting the "yyymmddhhmmss" and "url" variable strings into the URI-R:

archive:"http://timetravel.mementoweb.org/memento/{yyyymmddhhmmss}/{url}",statusservice:"http://digitopia.nl/services/statuscode.php?url={url}"

These templates are rewritten during playback to be relative to the Internet Archive:

archive:"/web/20150206214020/http://timetravel.mementoweb.org/memento/{yyyymmddhhmmss}/{url}",statusservice:"/web/20150206214020/http://digitopia.nl/services/statuscode.php?url={url}"

Because the robustify.js is modified during archiving, we wanted to understand the impact of including the URI-M of robustify.js (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) in our test page (http://www.cs.odu.edu/~jbrunelle/wsdl/test-r.html). In this scenario, the JavaScript attempts to execute when the user clicks on the page's links, but the re-written URIs point to /web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2 (since test-r.html exists on www.cs.odu.edu, the links are relative to www.cs.odu.edu instead of archive.org).

Instead of issuing an HTTP GET for http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F, robustify.js issues an HTTP GET for
http://www.cs.odu.edu/web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F which returns an HTTP 404 when dereferenced.
The robustify.js script does not handle the HTTP 404 response when looking for its service, and throws an exception in this scenario. Note that the memento that references the URI-M of robustify.js does not throw an exception because the robustify.js script does not make a call to digitopia.nl/services/.

In our test mementos, the Internet Archive also re-writes the URI-M http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/.

This memento of a memento (in a near Yo Dawg situation) does not exist. Clicking on the apparent memento of a memento link leads to the user being told by the Internet Archive that the page is available to be archived.

We also created an Archive.today memento of our robustifyTest.html page: https://archive.today/l9j3O. In this memento, the functionality of the robustify script is removed, redirecting the user to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web. The link to the Internet Archive memento is re-written to https://archive.today/o/l9j3O/http://www.dds.nl/~krantb/stellingen/, which results in a redirect (via a refresh) to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web, just as before. Archive.today uses this redirect approach as standard operating procedure. However, Archive.today re-writes all links to URI-Ms back to their respective URI-Rs.

This is a different path to a broken URI-M than the Internet Archive takes, but results in a broken URI-M, nonetheless.  Note that Archive.today simply removes the robustify.js file from the memento, not only removing the functionality, but also removing any trace that it was present in the original page.

In an odd turn of events, our investigation into whether a JavaScript tool would behave properly in the archives has also identified a problem with URI-Ms in the archives. If web content authors continue to utilize URI-Ms to mitigate link rot or utilize tools to help discover mementos of defunct links, there is a potential that the archives may see additional challenges of this nature arising.


--Justin Brunelle

                       

2015-02-17: Reactions To Vint Cerf's "Digital Vellum"

$
0
0
Don't you just love reading BuzzFeed-like articles, constructed solely of content embedded from external sources?  Yeah, me neither.  But I'm going to pull one together anyway.

Vint Cerf generated a lot of buzz last week when at an AAAS meeting he gave talk titled "Digital Vellum".  The AAAS version, to the best of my knowledge, is not online but this version of "Digital Vellum" at CMU-SV from earlier the same week is probably the same.



The media (e.g., The Guardian, The Atlantic, BBC) picked up on it, because when Vint Cerf speaks people rightly pay attention.  However, the reaction from archiving practitioners and researchers was akin to having your favorite uncle forget your birthday, mostly because Cerf's talk seemed to ignore the last 20 or so years of work in preservation.  For a thoughtful discussion of Cerf's talk, I recommend David Rosenthal's blog post.  But let's get to the BuzzFeed part...

In the wake of the media coverage, I found myself retweeting many of my favorite wry responses starting with Ian Milligan's observation:


Andy Jackson went a lot further, using his web archive (!) to find out how long we've been talking about "digital dark ages":



And another one showing how long The Guardian has been talking about it:


And then Andy went on a tear with pointers to projects (mostly defunct) with similar aims as "Digital Vellum":









Andy's dead right, of course.  But perhaps Jason Scott has the best take on the whole thing:



So maybe Vint didn't forget our birthday, but we didn't get a pony either.  Instead we got a dime kitty

--Michael

2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

$
0
0
Source: Dr. Roger Peng (2011). Reproducible Research in Computational Science. Science 334: 122

Have you ever needed to look back at a program and research data from lab work performed last year, last month or maybe last week and had a difficult time recalling how the pieces fit together? Or, perhaps the reasoning behind the decisions you made while conducting your experiments is now obscure due to incomplete or poorly written documentation.  I never gave this idea much thought until I enrolled in a series of Massive Open Online Courses (MOOCs) offered on the Coursera platform. The courses, which I took during the period from August to December of 2014, were part of a nine course specialization in the area of data science. The various topics included R Programming, Statistical Inference and Machine Learning. Because these courses are entirely free, you might think they would lack academic rigor. That's not the case. In fact, these particular courses and others on Coursera are facilitated by many of the top research universities in the country. The courses I took were taught by professors in the biostatistics department of the Johns Hopkins Bloomberg School of Public Health. I found the work to be quite challenging and was impressed by the amount of material we covered in each four-week session. Thank goodness for the Q&A forums and the community teaching assistants as the weekly pre-recorded lectures, quizzes, programming assignments, and peer reviews required a considerable amount of effort each week.

While the data science courses are primarily focused on data collection, analysis and methods for producing statistical evidence, there was a persistent theme throughout -- this notion of reproducible research. In the figure above, Dr. Roger Peng, a professor at Johns Hopkins University and one of the primary instructors for several of the courses in the data science specialization, illustrates the gap between no replication and the possibilities for full replication when both the data and the computer code are made available. This was a recurring theme that was reinforced with the programming assignments. Each course concluded with a peer-reviewed major project where we were required to document our methodology, present findings and provide the code to a group of anonymous reviewers; other students in the course. This task, in itself, was an excellent way to either confirm the validity of your approach or learn new techniques from someone else's submission.

If you're interested in more details, the following short lecture from one of the courses (16:05), also presented by Dr. Peng, gives a concise introduction to the overall concepts and ideas related to reproducible research.





I received an introduction to reproducible research as a component of the MOOCs, but you might be wondering why this concept is important to the data scientist, analyst or anyone interested in preserving research material. Consider the media accounts in the latter part of 2014 of admonishments for scientists who could not adequately reproduce the results of groundbreaking stem cell research (Japanese Institute Fails to Reproduce Results of Controversial Stem-Cell Research) or the Duke University medical research scandal which was documented in a 2012 segment of 60 Minutes. On the surface these may seem like isolated incidents, but they’re not.  With some additional investigation, I discovered some studies, as noted in a November 2013 edition of The Economist, which have shown reproducibility rates as low as 10% for landmark publications posted in scientific journals (Unreliable Research: Trouble at the Lab). In addition to a loss of credibility for the researcher and the associated institution, scientific discoveries which cannot be reproduced can also lead to retracted publications which affect not only the original researcher but anyone else whose work was informed by possibly erroneous results or faulty reasoning. The challenge of reproducibility is further compounded by technology advances that empower researchers to rapidly and economically collect very large data sets related to their discipline; data which is both volatile and complex. You need only think about how quickly a small data set can grow when it's aggregated with other data sources.


Cartoon by Sidney Harris (The New Yorker)


So, what steps should the researcher take to ensure reproducibility? I found an article published in 2013, which lists Ten Simple Rules for Reproducible Computational Research. These rules are a good summary of the ideas that were presented in the data science courses.
  • Rule 1: For Every Result, Keep Track of How It Was Produced. This should include the workflow for the analysis, shell scripts, along with the exact parameters and input that was used.
  • Rule 2: Avoid Manual Data Manipulation Steps. Any tweaking of data files or copying and pasting between documents should be performed by a custom script.
  • Rule 3: Archive the Exact Versions of All External Programs Used. This is needed to preserve dependencies between program packages and operating systems that may not be readily available at a later date.
  • Rule 4: Version Control All Custom Scripts. Exact reproduction of results may depend upon a particular script. Archiving tools such as Subversion or Git can be used to track the evolution of code as its being developed.
  • Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. Intermediate results can reveal faulty assumptions and uncover bugs that may not be apparent in the final results.
  • Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Using the same random seed ensures exact reproduction of results rather than approximations.
  • Rule 7: Always Store Raw Data behind Plots. You may need to modify plots to improve readability. If raw data are stored in a systematic manner, you can modify the plotting procedure instead of redoing the entire analysis.
  • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying any summaries.
  • Rule 9: Connect Textual Statements to Underlying Results. Statements that are connected to underlying results can include a simple file path to detailed results or the ID of a result in the analysis framework.
  • Rule 10: Provide Public Access to Scripts, Runs, and Results. Most journals allow articles to be supplemented with online material. As a minimum, you should submit the main data and source code as supplementary material and be prepared to respond to any requests for further data or methodology details by peers.
In addition to the processing rules, we were also encouraged to adopt suitable technology packages as part of our toolkit. The following list represents just a few of the many products we used to assemble a reproducible framework and also introduce literate programming and analytical techniques into the assignments.
  • R and RStudio: Integrated development environment for R.
  • Sweave: An R package that allows you to embed R code in LaTeX documents.
  • Knitr: New enhancements to the Sweave package for dynamic report generation. It supports publishing to the web using R Markdown and R HTML.
  • R Markdown: Integrates with knitr and RStudio. Allows you to execute R code in chunks and create reproducible documents for display on the web.
  • RPubs: Web publication tool for sharing R markdown files. The gallery of example documents illustrates some useful techniques.
  • Git and GitHub: Open source version control repository.
  • Apache Subversion (SVN): Open source version control repository.
  • iPython Notebook: Creates literate webpages and documents interactively in Python. You can combine code execution, text, mathematics, plots and rich media into a single document. This gallery of videos and screencasts includes tutorials and hands-on demonstrations.
  • Notebook Viewer: Web publication tool for sharing iPython notebook files.

As a result of my experience with the MOOCs, I now have a greater appreciation for the importance of reproducible research and all that it encompasses. For more information on the latest developments, you can refer to any of these additional resources or follow Dr. Peng (@rdpeng) on Twitter.

-- Corren McCoy

2015-03-10: Where in the Archive Is Michele Weigle?

$
0
0
(Title is an homage to a popular 1980s computer game "Where in the World Is Carmen Sandiego?")

I was recently working on a talk to present to the Southeast Women in Computing Conference about telling stories with web archives (slideshare). In addition to our Hurricane Katrina story, I wanted to include my academic story, as told through the archive.

I was a grad student at UNC from 1996-2003, and I found that my personal webpage there had been very well preserved.  It's been captured 162 times between June 1997 and October 2013 (https://web.archive.org/web/*/http://www.cs.unc.edu/~clark/), so I was able to come up with several great snapshots of my time in grad school.

https://web.archive.org/web/20070912025322/
http://www.cs.unc.edu/~clark/
Aside: My UNC page was archived 20 times in 2013, but the archived pages don't have the standard Wayback Machine banner, nor are their outgoing links re-written to point to the archive. For example, see https://web.archive.org/web/20130203101303/http://www.cs.unc.edu/~clark/
Before I joined ODU, I was an Assistant Professor at Clemson University (2004-2006). The Wayback Machine shows that my Clemson home page was only crawled 2 times, both in 2011 (https://web.archive.org/web/*/www.cs.clemson.edu/~mweigle/). Unfortunately, I no longer worked at Clemson in 2011, so those both return 404s:


Sadly, there is no record of my Clemson home page. But, I can use the archive to prove that I worked there. The CS department's faculty page was captured in April 2006 and lists my name.

https://web.archive.org/web/20060427162818/
http://www.cs.clemson.edu/People/faculty.shtml
Why does the 404 show up in the Wayback Machine's calendar view? Heritrix archives every response, no matter the status code. Everything that isn't 500-level (server error) is listed in the Wayback Machine. Redirects (300-level responses) and Not Founds (404s) do record the fact that the target webserver was up and running at the time of the crawl.

Wouldn't it be cool if when I request a page that 404s, like http://www.cs.clemson.edu/~mweigle/, the archive could figure out that there is a similar page (http://www.cs.unc.edu/~clark/) that links to the requested page?
https://web.archive.org/web/20060718131722/
http://www.cs.unc.edu/~clark/
It'd be even cooler if the archive could then figure out that the latest memento of that UNC page now links to my ODU page (http://www.cs.odu.edu/~mweigle/) instead of the Clemson page. Then, the archive could suggest http://www.cs.odu.edu/~mweigle/ to the user.

https://web.archive.org/web/20120501221108/
http://www.cs.unc.edu/~clark/
I joined ODU in August 2006.  Since then, my ODU home page has been saved 53 times (https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/).

The only memento from 2014 is on Aug 9, 2014, but it returns a 302 redirecting to an earlier memento from 2013.



It appears that Heritrix crawled http://www.cs.odu.edu/~mweigle (note the lack of a trailing /), which resulted in a 302, but http://www.cs.odu.edu/~mweigle/ was never crawled.The Wayback Machine's canonicalization is likely the reason that the redirect points to the most recent capture of http://www.cs.odu.edu/~mweigle/. (That is, the Wayback Machine knows that http://www.cs.odu.edu/~mweigle and http://www.cs.odu.edu/~mweigle/ are really the same page.)

My home page is managed by wiki software and the web server does some URL re-writing. Another way to get to my home page is through http://www.cs.odu.edu/~mweigle/Main/Home/, which has been saved 3 times between 2008 and 2010. (I switched to the wiki software sometime in May 2008.) See https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/Main/Home/

Since these two pages point to the same thing, should these two timemaps be merged? What happens if at some point in the future I decide to stop using this particular wiki software and end up with http://www.cs.odu.edu/~mweigle/ and http://www.cs.odu.edu/~mweigle/Main/Home/ being two totally separate pages?

Finally, although my main ODU webpage itself is fairly well-archived, several of the links are not.  For example, http://www.cs.odu.edu/~mweigle/Resources/WorkingWithMe is not archived.


Also, several of the links that are archived have not been recently captured.  For instance, the page with my list of students was last archived in 2010 (https://web.archive.org/web/20100621205039/http://www.cs.odu.edu/~mweigle/Main/Students), but none of these students are still at ODU.

Now, I'm off to submit my pages to the Internet Archive's "Save Page Now" service!

--Michele

2015-03-23: 2015 Capital Region Celebration of Women in Computing (CAPWIC)

$
0
0
On February 27-28, I attended the 2015 Capital Region Celebration of Women in Computing (CAPWIC) in Harrisonburg, VA on the campus of James Madison University.  Two of our graduating Masters students, Apeksha Barhanpur (ACM president) and Kayla Henneman (ACM-W president) attended with me.

With the snow that had blanketed the Hampton Roads region, we were lucky to get out of town on Friday morning.  We were also lucky that Harrisonburg had their foot of snow over the previous weekend so that there was plenty of time for all of the roads to be cleared.  We had some lovely scenery to view along the way.

We arrived a little late on Friday afternoon, but Apeksha and Kayla were able to attend "How to Get a Tech Job" by Ann Lewis, Director of Engineering at Pedago.  This talk focused on how each student has to pick the right field of technology for their career. The speaker presented some basic information on the different fields of technology and different levels of job positions and companies. The speaker also mentioned the "Because Software is Awesome" Google Group, which is a private group for students seeking information on programming languages and career development.

While they attended the talk, I caught up with ODU alum and JMU assistant professor, Samy El-Tawab.

After a break, I put on my Graduate Program Director hat and gave a talk titled "What's Grad School All About?"


I got to reminisce about my grad school days, share experiences of encountering the imposter syndrome, and discuss the differences between the MS and PhD degrees in computer science.


After my talk, we set up for the College and Career Fair.  ODU served as an academic sponsor, meaning that we got a table where were able to talk with several women interested in graduate school.  Apeksha and Kayla also got to pass out their resumes to the companies that were represented.

I also got to show off my deck of Notable Women in Computing playing cards.  (You can get your own deck at notabletechnicalwomen.org.)


Our dinner keynote, "Technology and Why Diversity Matters," was given by Sydney Klein, VP for Information Security and Risk Management at Capital One. (Capital One had a huge presence at the conference.) One thing she emphasized is that Capital One now sees itself as more of a technology company than a bank. Klein spoke about the importance of women in technology and the percentages of women that are represented in the field at various levels. She also mentioned various opportunities present within the market for women.

After dinner, we had a ice breaker/contest where everyone was divided into groups with the task of creating a flag representing the group and their relation with the field of computer science. Apeksha was on the winning team!  Their flag represented the theme of the conference and how they were connected to the field of technology, “Women make the world work”. Apeksha noted that this was a great experience to work with a group of women from different regions around the world.

On Saturday morning, Apekska and Kayla attended the "Byte of Pi" talk given by Tejaswini Nerayanan and Courtney Christensen from FireEye. They demonstrated programming using the Raspberry Pi device, a single board computer.  The students were given a small demonstration on writing code and building projects.

Later Saturday, my grad school buddy, Mave Houston arrived for her talk.  Mave is the Founder and Head of USERLabs and User Research Strategy at Capital One. Mave gave a great talk, titled "Freedom to Fail". She also talked about using "stepping stones on the way to success." She let us play with Play-Doh, figuring out how to make a better toothbrush. My partner, a graduate student at Virginia State University, heard me talk about trying to get my kids interested in brushing their teeth and came up with a great idea for a toothbrush with buttons that would let them play games and give instructions while they brushed. Another group wanted to add a sensor that would tell people where they needed to focus their brushing.

We ended Saturday with a panel on graduate school that both Mave and I helped with and hopefully encouraged some of the students attending to continue their studies.

-Michele

2015-04-05: From Student To Researcher...

$
0
0
In 2010, I decided to again study at the Old Dominion University Computer Science Department for better employment opportunities. After taking some classes, I realized that I did not merely want to take classes and earn a Master's Degree, but also wanted to contribute knowledge, like those who wrote the many research papers I had read during my courses.

My Master's Thesis is titled "Avoiding Spoilers On MediaWiki Fan Sites Using Memento".   I came to the topic via a strange route.

During Dr. Nelson's Introduction to Digital Libraries course, we built a digital library based on a single fictional universe.  I chose the television show Lost, and specifically archived Lostpedia, a site that my wife and I used while watching and discussing the show.  We realized that fans were updating Lostpedia while episodes aired.  This highlighted the idea that wiki revisions created prior to the episode obviously did not contain information about that episode, and emphasized that episodes led to wiki revisions.

A few years later, a discussion at work occurred while watching Game of Thrones.  I realized that some of us had seen the episode of the night before while others had not.  We wanted to use the Game of Thrones Wiki to continue our conversation, but realized that those who had not seen the episode easily encountered spoilers.  By this point, I was quite familiar with Memento, had used Memento for Chrome, and was working on the Memento MediaWiki Extension.  The idea of using Memento to avoid spoilers was born.

The figure above exhibits the Naïve Spoiler Concept.  The concept is that wiki revisions in the past of a given episode should not contain spoilers, because information has not yet been revealed by the episode, hence fans could not write about it.  Inversely, wiki revisions in the future of a given episode will likely contain spoilers, seeing as episodes cause fans to write wiki revisions.

It turned out that there was more to avoiding spoilers in fan wiki sites than merely using Memento and the Naïve Spoiler Concept.  Most TimeGates use a heuristic that is not reliable for avoiding spoilers, so I proposed a new one and demonstrated why the existing heuristic was insufficient by calculating the probability of encountering a spoiler using the current heuristic.  I also used the Memento MediaWiki Extension to demonstrate this new heuristic in action.  In this way I was able to develop a Computer Science Master's Thesis on the topic.

Mindist (minimum distance) is the heuristic used by most TimeGates. This works well for an sparse archive, because often the closest memento to the datetime you have requested is best.  Wikis have access to every revision, allowing us to use a new heuristic minpast (minimum distance in the past, minimum distance without going over the given date).  Using records from fan wikis, I showed that, if one is trying to avoid spoilers, there can be as much as a 66% chance of encountering a spoiler if we use the Wayback Machine or a Memento TimeGate using mindist.  I also analyzed Wayback Machine logs for wikia.com requests and found that 19% of those requests ended up in the future.  From these studies, it was clear that using minpast directly on wikis was the best way to avoid spoilers.

While I was examining fan wikis for spoilers, I also had the opportunity to compare wiki revisions with mementos recorded by the Internet Archive.  Using this information I was actually able to reveal how the Internet Archive's sparsity is changing over time.  Because wikis keep track of every revision, so we can see missed updates by the Internet Archive.

In the figure above, we see a timeline for each wiki page I conducted in the study.  The X-axis shows time and the Y-axis consists of an identifier for each wiki page.  Darker colors indicate more missed updates by the Internet Archive.  We see that the colors are getting lighter, meaning that the Internet Archive has becoming more aggressive in recording pages.

Below are the slides for the presentation, available on my SlideShare account, followed by the video of my defense posted to YouTube.  The full document of my Master's Thesis is available here.







Thanks to Dr. Irwin Levinstein and Dr. Michele Weigle for serving on my committee.  Their support has been invaluable during this process. Were it not for Dr. Levinstein, I would not have been able to become a graduate student.  Were it not Dr. Weigle's wonderful Networking class, I would not have been able to draw some of the conclusions necessary to complete this thesis.

Much of the thanks goes to my advisor, Dr. Michael L. Nelson, who spent hours discussing these concepts with me, helping correct my assumptions and assessments when I erred, while praising the experience when I came up with something original and new.  His patience and devotion not only to the area of study, but also the art of mentoring, led me down the path of success.

In the process of creating this thesis, I also created a technical report which can be referenced using the BibTeX code below.



So, what is next?  Do I use wikis to study the problem of missed updates in more detail? Do I study the use of the naïve spoiler concept in another setting?  Or do I do something completely different?

I realize that I have merely begun my journey from student to researcher, but know even more now that I will enjoy the path I have chosen.

--Shawn M. Jones, Researcher

2015-04-20: Virginia Space Grant Consortium Student Research Conference Report

$
0
0
Mat Kelly and various other graduate students in the state of Virginia present their graduate research at the Virginia Space Grant Consortium.                           

On Friday, April 17, 2015 I attended the Virginia Space Grant Consortium (VSGC)Student Research Conference at NASA Langley Research Center (LaRC) in Hampton, Virginia. This conference is slightly beyond the scope of what we at ODU WS-DL (@webscidl) usually investigate, as the research requirement was that it was relevant to NASA's objectives as a space agency.

My previous work with LaRC's satellite imagery allowed me to approach the imagery files with the perspective a computational scientist. More on my presentation, "Facilitation of the A Posteriori Replication of Web Published Satellite Imagery"below.

The conference started off with registration and a provided continental breakfast. Mary Sandy, the VSGC Director and Chris Carter, the VSGC Deputy Director began by describing the history of the Virginia Space Grant Consortium program including the amount contributed since its inception and the number of recipients that have benefitted from being funded.

The conference was organized in a model consisting of concurrent sessions of two to three themed presentations by undergraduate and graduate students at various Virginia universities.

First Concurrent Sessions

I attended the "Aerospace" session in the first morning session. In this session Maria Rye (Virginia Tech) started with her explorative research in suppressing distortions in tensegrity systems, a flexible structure held together by interconnected bars and tendons.

Marie Ivanco (Old Dominion University) followed Maria with her research in applying Analytic Hierarchy Processes (AHPs) for analytical sensitivity analysis and local inconsistency checks for engineering applications.

Peter Marquis (Virginia Tech) spoke third in the session with his research on characterizing the design variables to trim the LAICE CubeSat to obtain a statically stable flight configuration.

Second Concurrent Sessions

The second sessions seamlessly continued with Stephen Noel (Virginia Tech) presenting a similar work relating to LAICE. His work consisted of the development of software to read, parse, and interpret calibration data for the system.

Cameron Orr (Virginia Tech) presented the final work in the second Aerospace session with the exploration of the development of adapted capacitance manometers for thermospheric applications. Introducing this additional component as well as some detection circuitry allowed more accurate measurement of pressure changes.

Third Concurrent Sessions

After a short break where posters from graduate students around Virginia were presented, I opted to move to another room to view the Applied Science presentations.

Atticus Stovall (University of Virginia) described his system for modeling forest carbon relating height-to-biomass relationships as well as voxel based volume modeling as a means of evaluating the amount of carbon stored.

Matthew Giarra (Virginia Tech) wrapped up the short session with a visual investigation of the flow of hemolymph (blood) in insects' bodies as a potential model for non-directional fluid pumping.

Fourth Concurrent Sessions

The third session immediately segued into the fourth session of the day, where I changed rooms to attend the Astrophysics presentations.

Charles Fancher (William & Mary) presented work on a theoretical prototype for an ultracold atom-based magnetometer for accurate timekeeping in space.


John Blalock (Hampton University) presented next in the Astrophysics session with his work on using various techniques to measure wind speeds on Saturn from the results returned by the Cassini orbiter's Imaging Science Subsystem.


Kimberly Sokal (University of Virginia) wrapped up the fourth session with her enthusiastic presentation on emerging super star clusters with Wolf-Rayet stars. Her group's discovery of the star cluster S26 in NGC 4449 is undergoing an evolutionary transition that is not well understood. The ongoing work may provide feedback as to the tipping point of the emerging process that affects the super star cluster's ability to remain bound.

The conference then broke for an invitation-only lunch with a keynote address by Dr. David Bowles, Acting Directory of NASA Langley Research Center.

Fifth Concurrent Sessions

For the final session of the day, I attended and presented at the Astrophysics session. Emily Mitchell (University of Virginia) presented first with her study on the irradiation effects of H2-laden porous water ice films in the interstellar medium (ISM). She exposed ice to hydrogen gas at different pressures after deposition and during radiation. She reports that H2 concentration increases with decreasing ion flux, suggesting that as much as 7 percent solid H2 is trapped in interstellar ice by radiation impacts

Following Emily, Mat Kelly (your author) of Old Dominion University presented my work on the Facilitation of the A Posteriori Replication of Web Published Satellite Imagery. By creating software to mine the metadata and a system that allows peer-to-peer sharing of the public domain satellite imagery currently solely located on the NASA Langley servers, I was able to mitigate the reliance on a single source of the data. The system I created utilizes concepts from ResourceSync, BitTorrent and WebRTC.

Wrap Up

The Virginia Space Grant Consortium Student Research Conference was extremely interesting despite being somewhat different in topic compared to our usual conferences. I am very glad that I got the opportunity to do the research for the fellowship and hope to progress the work for further applications beyond satellite imagery.

Mat (@machawk1)

2015-05-07: Teaching Undergraduate Computer Science Using GitHub and Docker

$
0
0
Mat Kelly taught CS418 - Web Programming at Old Dominion University in Spring 2015. This blog post highlights some teaching methods and technologies used (namely, Docker and GitHub) and how he integrated their usage into the flow of the course.                           

For Spring Semester at Old Dominion University I taught CS418 - Web Programming with some updated methods and content. This course has been previously taught by various members of ODU WS-DL (2014, 2013, 2012).

The first deviation from previous offerings of the course was to change the subject of the project. Previously, CS418 students were asked to progressively build an online forum like phpBB. Web sites resembling this medium are no longer as common as they once were on the Web, so a refresh was needed to keep the project familiar and relevant.

For Spring, I asked students to build a Question-and-Answer website akin to StackOverflow.com. Being students of computer science, all were familiar with the contemporary model of online discussions and soliciting help from others experienced in an area (e.g., computer programming).

We followed an initial coursework flow with lectures about Web Fundamentals, followed by more technical lectures on PHP, MySQL, JavaScript, and an HTML/CSS Primer for those students that have programmed but never created a web page. The lectures were old news for some students, who were already employed (CS418 is a senior-level course), and completely new for others, who had programmed but never for the Web.

The delivery of the project is an aspect that made this semester's course unique. In a preliminary assignment very early in the semester, I required each student to:

  1. Fork the class GitHub repository
  2. Pull a working copy to their system
  3. Add a single file to the repository
  4. Commit the change
  5. Submit a pull request to the class repository

This ensured a base knowledge of version control dynamics but also required the students to provide a reference to the repository for their class project with the single file submitted. A student's project repository was different than the fork of the class repository.

GitHub inherently facilitates sharing of source code - an aspect that I did not particularly want to encourage with the individual students' projects. The GitHub Student Developer Pack provided a solution for this. By each student contacting GitHub and providing proof of being a student, they were each supplied a small number of private repositories, which would normally require a monthly fee. The program also offers many other benefits free of charge to students like credit on a cloud hosting platform, a free domain name on the .me TLD, and private builds from one of the more popular continuous integration services (among many other benefits).

Along with submitting a pull request, I also asked the students to add me as a "collaborator" on GitHub for the repository they each specified, allowing access for grading whether or not the student decided to take advantage of the Student Developer Pack.

As the students began to build features for each of the four milestone requirements in the course, I reiterated that what was checked into their GitHub repository come demo day is that which they would be graded on. This circumvented the "my computer crashed", "the dog ate my homework", etc. but introduced the issue of "I must have forgot to check my updated code in". To remedy this, but mainly to allow students to verify their code will work as expected on demo day, I put together a demo day deployment system using Docker.

Docker allows easy, systematic deployment of software that is sandboxed from a host system yet extensible to communicate between multiple instances ("containers" in Docker jargon). Using Docker allowed a student to iteratively test the code they had checked into their GitHub repository from the comfort of their home while instilling confidence on the correctness of the features they had implemented thus far. While previous offerings of the class provided students with a Virtual Machine (VM) on which to develop their project, I opted to use Docker instead, as it provides an isolated environment for each student with a freshly installed OS each time their code is deployed. Docker also allowed the packages and libraries needed by the students for production to be parameterized. A downside to using Docker over a VM is the students' reliance on a central server for deployment. However, this "benefit" of VMs does not guarantee the consistency of presentation for demo day, as a local VM might be configured differently than the demo day machine.

Our Docker Deployment System was hosted on a server at ODU but was accessible to the world. Each student was supplied a unique port number, allowing students to simultaneously use the system without fear of clashing with other students testing. The system evolved as the semester drew on and I continuously developed it. Using the system is fairly easy and intuitive.

The student first enters their ODU CS username in a text field.

The Docker Deployment System dynamically queries the class GitHub repository to link the CS username to a student's GitHub repository, as previously submitted. This cross-referenced prevented abuse by GitHub users that were not registered for the class and required students to execute the procedure as a prerequisite for demo day (i.e., submission of assignments).

The user can then authenticate with GitHub by clicking a button. Doing so brings up the dialog to do so on the GitHub website.

Upon successful login, the user is returned to the Docker Deployment System interface with the same button now reading "Dockerize my code". Selecting this button invokes a server side scripted process.

Brief messages are shown to the user to indicate the script process that is being followed on the server. In sequence, the script:

  1. Deletes any old remnants from previously deployments by the student
  2. Clones the user's repository using Git and the GitHub API access token, obtained from the user logging in (this is critical if the user's project repository is private)
  3. Kills previously deployed Docker instances spawned by the user
  4. Removes the previously deployed instances to ensure a fresh copy is used by Docker
  5. Fires up a new container in the Docker Deployment System using the students's latest code (from the above repo clone)
  6. Provides HTML links to the student to test their code.
Docker containers are defined in a Dockerfile, a standard format that references the basis OS and any packages required for the container. For the students' deployment, I used Ubuntu as the basis along with Apache, PHP, and MySQL in the CS418 Dockerfile. A directive in the Dockerfile also provides the hook to allow the directory containing the student's code to be used as the default "website" for Apache. The students provided a MySQL database dump in the root of their project repository, which is loaded when the container for their project is instantiated.

For the most part, the initial bumps for the students to effectively use the system were overcome. Students reiterated throughout the semester that the tool was extremely useful in testing their code and ensuring that nothing unexpected would occur on demo day.

In synopsis, the usage of the Docker Deployment System developed for the Spring 2015 session of CS418 Web Programming at Old Dominion University and the required submission of coursework via GitHub allowed students to gain experience with tools and iterative testing that previous models (e.g., "magic" laptops and e-mailing code, resp.) of verifying code submissions are unable to effectively facilitate. The project-based nature of CS418 was an appropriate testing medium for developing both the system and workflow. In the future, I hope to reuse the system and workflow to teach a course less technically driven to evaluate the portability of the methods.

Special thanks to Sawood Alam (@ibnesayeed) for his technical assistance in working with Docker throughout the semester and Minhao Dong for being the ODUCS access point to ensure that students' project deployment did not compromise the university network.

Mat (@machawk1)

2015-05-09: IIPC General Assembly 2015 Trip Report

$
0
0

The day before International Internet Preservation Consortium (IIPC) General Assembly 2015 we landed in San Francisco and some delicious Egyptian dishes were waiting for us. Thank you Ahmed, Yasmin, Moustafa, Adrian, and Yusuf for hosting us. It was a great way to spend the evening before IIPC GA and we were delighted to see you people after long time.

Day 1

We (Sawood Alam, Michael L. Nelson, and Herbert Van de Sompel) entered in the conference hall a few minutes after the session was started and Michael Keller from Stanford University Libraries was about to leave the stage after the welcome speech. IIPC Chair Paul Wagner gave brief opening remarks and invited the keynote speaker Vinton Cerf from Google on the stage. The title of the talk was "Digital Vellum: Interacting with Digital Objects Over Centuries" and it was such an informative and delightful talk. He mentioned that the high density low cast storage media is evolving, but the devices to read them might not last long. While mentioning Internet connected picture frames and surf boards he added, we should not forget about the security. To emphasize the security aspect he gave an example that grand parents would love to see their grand children in those picture frames, but will not be very happy if they see something which they do not expect.

Moving on to software emulators he invited Mahadev Satyanarayanan from Carnegie Mellon University to talk about their software archive and emulator called Olive Archive. Satya gave various live demos including the Great American History Machine, ChemCollective (a copy of the website frozen at certain time), PowerPoint 4.0 running in Windows 3.1, and the Oregon Trail, all powered by their virtual machines and running in a web browser. He also talked about the architecture of the Olive Archive and how in future multiple instances can be launched and orchestrated to emulate the subset of the Internet for applications that rely on external services where some instances might run those services independently.

In the QA session someone asked Cerf, how to ask big companies like Google to provide the data about their Crisis Response efforts for archiving after they are done with it? Cerf responded, "you just did." while acknowledging the importance of such data for archival. Here are some tweets that were capturing the moments:

After the break Niels Brügger and Janne Nielsen presented their case study of Danish websphere under the title "Studying a nation's websphere over time: analytical and methodological considerations". Their study covered website content, file types, file sizes, backgrounds, fonts, layout and more importantly the domain names. They also raised the points like size of the ".dk" domain, geolocation, inter and intra domain link network, and if the Danish websites are actually in Danish language? They talked about some crawling challenges. Their domain name analysis tells that only 10% owners own 50% of all the ".dk" domains. I suspected that this result might be due to the private domain name registrations, so I talked to them later and they said, they did not think about private registrations, but they will revisit their analysis.

Andy Jackson from the British Library took the stage with his presentation title "Ten years of the UK web archive: what have we saved?". This case study covers three collections including Open Archive, Legal Deposit Archive, and JISC Historical Archive. These collections store over eight billion resources in over 160TB compressed files and now adding about two billion resources per year. With the help of a nice graph he illustrated that not all ".uk" domains are interlinked, so to maximize the coverage the crawlers need to include other popular TLDs such as ".com". He also presented the analysis of reference rot and content drift utilizing the "ssdeep" fuzzy hash algorithm. Their analysis tells that 50% of resources are unrecognizable or gone after oner year, 60% after two years and 65% after three years.

I had lunch together with Scott Fisher from the California Digital Library. I told him about various digital library and archiving related research projects we are working on at Old Dominion University and he described the holdings of his library and the phalanges they have in upgrading their Wayback to bring Memento support.

After the lunch, keynote speaker of the second session Cathy Marshall from the Texas A&M University took the stage with a very interesting title, "Should we archive Facebook? Why the users are wrong and the NSA is right". She motivated her talk by some interview style dialogues with the primary question, "Do you archive Facebook?" and mostly the answer was "No!". She highlighted that people have developed [wrong] sense that Facebook is taking care of their stuff, so they do not have to. She also noted that people usually do not value their Facebook content or they think it has immediate value, but no archival value. In a large survey she asked should Facebook be archived?, three fourth objected and half of them said "No" unconditionally. In the later part of her talk, she build the story of the marriage of Hal Keeler and Joan Vollmer by stitching various cuttings from local news papers. I am not sure if I could fully appreciate the story due to the cultural difference, but I laughed when everyone else did. Although I did follow her efforts and intention to highlight the need of archiving social media for future historians. And if asks me, is NSA is right? my answer would be, "Yes!, if they do it correctly with all the context included."

Meghan Dougherty from Loyola University Chicago and Annette Markham from Aarhus University presented their talk "Generating granular evidence of lived experience with the Web: archiving everyday digitally lived life". They illustrated how sometimes intentionally or unintentionally people record moments of their life with different media. Among various visual illustrations, I particularly liked the video of a street artist playing with a ring that was posted on Facebook in a very different context than the context it appeared in YouTube. They ended their talk with a hilarious video of Friendster.

Susan Aasman from University of Groningen presented her talk "Everyday saving practices: "small data" and digital heritage strategies". This talk was full of motivation, why people should care about personal archive of their daily life moments. She described how the service Kodak Gallery launched in 2001 with the tag-line, "live forever", and closed in 2012 after transferring billions of images to Shutterfy which was only available for US customers. As a result, people from other countries have lost their photo memories. She also played the Bye Bye Super 8 video of Johan Kramer that was amusing and motivating for personal archiving.

After a short beak Jane Winters from the Institute of Historical Research, Helen Hockx-Yu from the British Library, and Josh Cowls from the Oxford Internet Institute took the stage with their topic "Big UK domain data for Arts and Humanities" also known as BUDDAH project. Jane highlighted the value of archives for research and described the development of a framework to help researchers leverage the archives. She illustrated the interface of the Big Data analysis of BUDDAH project, described the planned output, and various case studies showing what can be done with that data.

Helen Hockx-Yu began her talk "Co-developing access to the UK Web Archive" with reference to the earlier talk by Andy. She noted that a scenario that fits everyone's need is difficult. She described the high level requirements including query building, corpus formation, annotation and cuuration, in-corpus and whole-dataset analysis. She illustrated the SHINE interface that provides features like full-text search, multi-facet filters, query history, and result export.

Finally, Josh Cowls presented his talk about the book "The Web as History: Using Web Archives to Understand the Past and the Present" in which he contributed a chapter. He talked about the four second level domain from ".uk" TLD including ".co.uk", ".org.uk", ".ac.uk", and ".gov.uk" and how they are interlinked. He described the growth of web presence of the BBC and British universities.

IIPC Chair Paul Wagner concluded the day by emphasizing that we have only started scratching the surface. He also noted in his concluding remarks that the context matters.

Day 2

Herbert Van de Sompel from Los Alamos National Laboratory started the second day sessions by talking about "Memento Time Travel". He started with a brief introduction of the Memento followed by a bag full of announcements. For the ease of use in JavaScript clients, Memento now supports JSON responses along with traditional Link format. Memento aggregator now provides responses in two modes including DIY (Do It Yourself) and WDI (We Do It). The service now also allows to export the Time Travel Archive Registry in structured format. Due to the default Memento support in Open Wayback, various Web archives now natively support Memento. There is an extension available for MediaWiki to enable Memento support in it. Herbert described the Robust Links (Hiberlink) and how it can be used to avoid reference rot. He said that their service usage is growing, hence they upgraded the infrastructure and now using Amazon cloud for hosing services. He noted that going forward everyone will be able to participate by running Memento service instances in a distributed manner to provision load-balancing. He also demonstrated Ilya's work of constructing composite mementos from various sources to minimize the temporal inconsistencies while visualizing the sources of mementos.

Daniel Gomes from the Portuguese Web Archive talked about "Web Archive Information Retrieval". He started classifying web archive information needs in three categories including Navigational, Informational, and Transactional. He noted that the usual way of accessing archive is URL searching which might not be known to the users. An alternate method is full-text search that poses the challenge of relevance. Daniel described various relevance models in great detail and how to select features to maximize the relevance. He announced that all the dataset and code is available for free and under open source license. The code is hosted on Google Code, but due to their announcement of sunsetting the service the code will be migrated to GitHub soon.

After this talk, there was a short break followed by the announcement that remaining sessions of the day will have two parallel tracks. It was a hard decision to choose one track or the other, but I can watch the missed sessions latter when the video recordings are made available. Later the parallel sessions were interfering each other so the microphone was turned off.

After the break Ilya Kreymer gave a live demo of his recent work "Web Archiving for all: Building WebRecorder.io". He acknowledged the collaboration with Rhizome and announced the availability of invite only beta implementation of the WebRecorder. He demonstrated how WebRecorder can be used perform personal archiving in What You See Is What You Archive (WYSIWYA) mode.

Zhiwu Xie from VirginiaTech presented "Archiving transactions towards an uninterruptible web service". He described an indirection layer between the web application server and the client that archives each successful response and when server returns 4xx/5xx failure responses, it serves the most recent copy of the resource from the transactional archive. It is similar to services like CloudFlare in functionality from clients' perspective, but it has added advantage of building a transactional archive for website owners. Zhiwu demonstrated the implementation by reloading two web pages multiple times of which one was utilizing the UWS and the other was directly connected to the web application server that was returning the current timestamp with random failures. He mentioned that the system is not ready for the prime time yet.

During the lunch break I was with Andy, Kristinn, and Roger where we had free style conversation on advanced crawlers, CDX indexer memory error issues, the possibility of implementing CDX indexer in Go, separating data and view layers in Wayback for easy customization, some YouTube videos such as "Is Your Red The Same as My Red?", hilarious "If Google was a Guy", Ted talks such as "Can we create new senses for humans?", "Evacuated Tube Transport Technologies (ET3)", and the possible weather of Iceland around the time IIPC GA 2016 is scheduled.

Jefferson Bailey presented his talk on "Web Archives as research datasets". With various examples and illustrations from Archive-It collections he established the point that web archives are great sources of data for various researches. He acknowledged that WAT is a compact and easily parsable metadata file format that is about 18% of the WARC data files.

Ian Milligan from the University of Waterloo presented his talk on "WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Three Types of Web Archives". He described the importance of web archives and why historians should use web archives. His talk was primarily based on three case studies including Wide Web Scrape, GeoCities End-of-Life Torrent, and Archive-It Longitudinal Collections, Canadian Political Parties & Labour Organizations. I enjoyed his style of storytelling, some mesmerizing visualizations, and in particular the GeoCities case study. He noted that the GeoCities data was not the form of WARC files, instead it was regular Wget crawl.

After a short break Ahmed AlSum from the Stanford University Library (and a WS-DLalumnus) presented his work on "Restoring the oldest U.S. website". He described how he turned yearly backup files of SLAC website from 1992 to 1999 into WARC and CDX files with the help of Wget and by applying some manual changes to mimic the effect as if it was captured in those early days. These transforms were necessary to allow modern Open Wayback system to correctly replay it. Ahmed briefly handed the microphone over to Joan Winters who was responsible to take backups of the website in early days and she described how they did it. Ahmed also mentioned that the Wayback codebase had hardcoded 1996 as the earliest year that was fixed by making it configurable.

As an after thought I would love to see this effort combined with Satya's Olive Archive so that from the server stack to the browser experience all can be replicated as close to the original environment as possible.

Federico Nanni from the University of Bologana presented "Reconstructing a lost website". Looking at the schedule, my first impression was that it is going to be a talk about tools to restore any lost websites and reconstruct all the pages and links with the help of archives. I was wondering if they are aware of Warrick, a tool that was developed at Old Dominion University with this very objective. But, it turned out to be a case study of the world's oldest university established around 1088. One of the many challenges in reconstructing the university website he mentioned was the exclusion of the site from the Wayback Machine for unknown reasons which they tried to resolve together with Internet Archive. Amusingly, one of the many sources of collecting snapshots includes a clone of the site prepared by student protesters.

Last speaker of the second day Michael L. Nelson from Old Dominion University presented the work of his student Scott G. Ainsworth"Evaluating the temporal coherence of archived pages". With an example of Weather Underground site he demonstrated how unrealistic pages can be constructed by archives due to the temporal violations. He acknowledged that among various categories of temporal violations, there are at least 5% cases where there exists a provable temporal violation. He also noted that temporal violation is not always a concern.

Day 3

The third day sessions were in the Internet Archive building, San Francisco instead of the usual Li Ka Shing Center at Stanford University, Palo Alto. A couple of buses transported us to the IA and we enjoyed the bus trip in the valley as the weather was very good. IA staff was very humble and welcoming. The emulator of classical games installed in the lobby of IA turned out to be the prime center of attraction. We came to know some interesting facts about the IA such as the building was a church which was acquired because of its similarity with the IA logo and the pillows in the hall were contributed by various websites with the domain name and logo printed on them.

Sessions before lunch were mainly related to consortium management and logistics these include Welcome to the Internet Archive by Brewster Kahle, Chair address by Paul Wagner, Communication report by Jason Webber, Treasurer report by Peter Stirling, and Consortium renewal by the chair followed by break-out discussions to gather ideas and opinion from the IIPC members on various topics. Also, the date and venue for the next general assembly was announced to be on April 11, 2016 in Reykjavik, Iceland.

After the lunch break, your author, Sawood Alam from Old Dominion University presented the progress report on "Profiling web archives" project, funded by IIPC. With the help of some examples and scenarios he established the point that the long tail of archive matters. He acknowledged the growing number of Memento compliant archives and the growth of use of Memento aggregator service. In order for the Memento aggregator to perform efficiently, it needs query routing support apart from caching which only helps when the requests are repeated before cache expires. Then he acknowledged two earlier profiling efforts one being a complete knowledge profile by Sanderson and the other minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. Sawood also announced the availability of the code to generate profiles and benchmark them in a GitHub repository. In a later wrap-up session the chair Paul Wagner referred to Sawood's motivation slide in his own words, "sometimes good enough is not good enough."

In the break various IA staff members gave us tour of the IA facility including book scanners, television archive, an ATM, storage rack, music and video archive where they convert data from old recording media such as vinyl discs and cassettes.

After the break a historian and writer Abby Smith Rumsey talked about "The Future of Memory in the Digital Age". Her talk was full of insightful and quotable statements. I will quote one of my favorite and will leave the rest in the form of tweets. Se says, "ask not what we can afford to save; ask what we can afford to lose".

Finally the founder of the Internet Archive, Brewster Kahle took the stage and talked about digital archiving and the role of IA in the form of various initiatives including book archive, music archive, and TV archive to name a few. He described the zero-sum book lending model utilized by the Open Library for the books that are not free for unlimited distribution. He invited all the archivists to create a common collective distributed library where people can share their resources such as computing power, storage, man power, expertise, and connections. During the QA session I asked when he thinks about collaboration, is he envisioning a model similar to the inter-library loan where peer libraries will refer to the other places in the form of external links if they don't have the resources but others do or in contrast they will copy the resources of each other? He responded, "both."

The chair gave a wrap-up talk and formally ended the third day session. Buses still had some time before they leave, so people were engaged in conversation, games and photographs while enjoying drinks and food. I particularly enjoyed a local ice cream named "It's-It" recommended by an IA staff.

Day 4

On fourth day Sara Aubry presented her talk on "Harvesting Digital Newspapers Behind Paywalls" in Berge Hall A where Harvesting Working Group was gathered while IIPC's communication strategy session was going on in Hall B. She discussed her experience of working with news publishers to make their content more crawler friendly. Some of the crawling and replay challenges include paywalls requiring authentication to grant access to the content and inclusion of the daily changing date string in the seed URIs. They modified the Wayback to fulfill their needs, but the modifications are not committed back to the upstream repository. She said, if it is useful for the community then the changes can be pushed out in the main repository.

Roger Coram presented his talk on "Supplementing Crawls with PhantomJS". I found his talk quite relevant to one of my colleague Justin Brunelle's work. This is a necessary step to improve the quality of the crawls especially when sites are becoming more interactive with extensive use of JavaScript. For some pages, he is using CSS selectors and takes screen shots to later complement the rendering.

Kristinn Sigurðsson engaged everyone to talk about the "Future of Heritrix". He started with the question, "is Heritrix dead?" and I said to myself, "can we afford this?". This ignited the talk about what can be done to increase the activity on its development. I asked the question, what is slowing down the development of Heritrix, is it out of ideas and new feature requests or there are not enough contributors to continue the development? There was no clear answer to this question, but it helped continuing the discussion. I also suggested that if new developers are afraid of making changes that will break the system and will discourage upgrades then can we introduce plug-in architecture where new features can be added as optional add-ons.

Helen Hockx-Yu took the microphone and talked about the Open Wayback development. She gave brief introduction of the development workflow and periodic telecon. She also talked about the short and long term development goals including better customization and internationalization support, display more metadata, ways to minimize the live leaks, and acknowledge/visualize the temporal coherence.

After a short break Tom Cramer gave his talk on "APIs and Collaborative Software Development for Digital Libraries". He formally categorized the software development models in five categories. He suggested IIPC to take the position to unify the high level API for each category of the archiving tools so that they can co-operate interchangeably. This was very appealing to me because I was thinking on the same lines and have done some architectural design of an orchestration system that achieves the same goal via a layer of indirection.

Daniel Vargas from LOCKSS presented his talk on "Streamlining deployment of web archiving tools" and demonstrated usage of Docker containers for deployment. He also demonstrated the use of plain WARC files on regular file system and in HDFS with Hadoop clusters. I was glad to see someone else deplying Wayback machine in containers as I was pushing some changes to the Open Wayback repository that will make containerization of Wayback easier.

During the lunch break Hunter Stern from IA approached me and told me about the Umbra project to supplement the crawling of JS-rich pages. After the lunch there was a short open mic session where every speaker has got four minutes to introduce exciting stuff that they are working on. Unfortunately, due to the shortage of time I could not participate in it.

After the lunch break Access Working Group gathered to talk about "Data mining and WAT files: format, tools and use cases". Peter Stirling, Sara Aubry, Vinay Goel, and Andy Jackson gave talks on "Using WAT at the BnF to map the First World War", "The WAT format and tools for creating WAT files", and "Use cases at Internet Archive and the British Library". Vinay has got some really neat and interactive visualizations based on the WAT files. I talked to Vinay during the break and we had some interesting ideas to work on such as building a content store indexed by hashes while using WAT files in conjunction to replay and a WebSocket based BOINC implementation in JavaScript to perform Hadoop style distributed research operations on IA data on users' machine.

After a short break Access Working Group talked about "Full-text search for web archives and Solr". Anshum Gupta, Andy Jackson, and Alex Thurman presented "Apache Solr: 5.0 and beyond", "Full-text search for web archives at the British Library", and "Solr-based full-text search in Columbia's Human Rights Web Archive" respectively. Anshum's talk was on technical aspects of Solr while the other two talks were more towards a case study.

Day 5

On the last day of the conference Collection Development and Preservation Working Groups were discussing their current state and plans in separate parallel tracks. Before the break I attended Collection Development Working Group. They demonstrated Archive-It account functionality. I expressed the need of a web based API to interact with the Archive-It service. I gave the example of a project I was working on a few years ago in which a feed reader periodically reads from news feeds and sends it to a disaster classifier that Yasmin AlNoamany and Sawood Alam (me) built. If the classifier classifies the news article to be in disaster category, we wanted to archive that page immediately. Unfortunately, Archive-It did not provide a way to programmatically do that (unless we use page scraping or some headless browser), so we ended up using WebCite service for that.

After the break I moved to the Preservation Working Group track where I had a talk scheduled. David S.H. Rosenthal presented his talk on "LOCKSS: Collaborative Distributed Web Archiving For Libraries". He described the working of LOCKSS and how it benefited the publishing industry. He described how Crawljax is used in LOCKSS to capture content that are loaded via Ajax. He also noted that most of the publishing sites try not to rely on Ajax and if they do, they provide some other means to crawl their content to maintain the search engine ranking.

Sawood Alam (me) happened to be the last presenter of the conference where he presented his talk on "Archive Profile Serialization". This talk was in continuation with his earlier talk at IA. He described what should be kept in profiles and how should it be organized. He also talked briefly about the implications of each data organization strategy. Finally he talked about the file format to be used and how it can affect the usefulness of the profiles. He noted that XML, JSON, and YAML like single node file formats are not suitable for profiles and he proposed an alternative format that is a fusion of CDX and JSON formats. Kristinn provided his feedback that it seems the right approach of serialization of such data, but he strongly suggested to name the file format something other than CDXJSON.

While we were having lunch, the chair took the opportunity to wrap-up the day and the conference. And now I would like to thank all the organizing team members especially Jason Webber, Sabine Hartmann, Nicholas Taylor, and Ahmed AlSum for organizing and making the event possible.

In the afternoon Ahmed AlSum took me to the Computer History Museum where Marc Weber gave us a tour. It was a great place to visit after such an intense week.

Missed Talks

Due to the parallel tracks I missed some sessions that I wanted to attend such as "SoLoGlo - an archiving and analysis service" by Martin Klein, "Web archive content analysis" by Mohammed Farag, "Identifying national parts of the internet" by Eld Zierau, "Warcbase: Building a scalable platform on HBase and Hadoop" by Jimmy Lin, "WARCrefs for deduplicating web archives" by Youssef Eldakar, and "WARC Standard Revision Workshop" by Clément Oury to name a few. I hope the videos recordings will be available soon. Meanwhile I was following the related tweets.

Conclusions

IIPC GA 2015 was a fantastic event. I had great time, met a lot of new people and some of those whom I knew on the Web, shared my ideas and learned from others. It was the most amazing one complete week I ever had. I appreciate the efforts of everyone who made this possible including organizers, presenters, and attendees.

Resources

Please let us know the links of various resources related to IIPC GA 2015 to include below.

Official

Aggregations

Blog Posts

Tools

--
Sawood Alam

'); $('#tweet-toggler').click(function() { $('[id^=twitter-widget]').toggle(); }); });

2015-05-29: Call me Dr. SalahEldeen

$
0
0
Dr. Nelson saying how awesome I am
Stick a fork in it...it’s done! So now what?

These are the two thoughts that floated in my mind just after defending my dissertation on May 5th. Is it over?... Well, my bet the road has just begun. I just became a doctor after all!
After merely just 5 years, 4 months, 13 days I finished the PhD (see what I did here? that’s sarcasm!). Fresh off the boat (err the plane) I landed on December 23rd 2009 I though I will just knock this PhD out in a couple of years and go work for a big company, and oh boy little did I know. I believe I am a whole different man now, I learned things I didn’t even imagine I would know, mostly about myself and the glorious fields of machine learning, modeling, user behavioral analysis, archival, preservation, and of course engineering. What do you know, it turned out that research is awesome and I loved it. Finding a pattern, building this predictive model that learns with time gives you a pleasure of no equal, and apparently I am good at it!

Back to the PhD, my dissertation is entitled: Detecting, Modeling, and Predicting, User Temporal Intention in Social Media. It’s a new field of human intention in relation to time and content shared on social networking portals. It’s an enticing area of study that merges multiple disciplines, and various fields of study, and according to Dr. Nelson right now I am the world expert in this tiny point in the collective human knowledge of sciences, … yup, I will take that.

Our work gained both academic and public acclaim, demonstrated through our publication record and the articles about our work in the BBC, the Atlantic, Popular Mechanics and MIT tech review as demonstrated in the last set of slides from my defense:



To watch me do an awesome job defending my dissertation:

Big thanks to Mat Kelly for taking awesome pictures along the defense day (pardon my weary look):

https://www.flickr.com/photos/124419986@N07/sets/72157651976633968/

I PhD Crushed it!
Well, back to the first part, it seems I was half right after all. I did not manage to finish the PhD in just “a couple” of years, but I did manage to land an awesome job at Microsoft (yes that’s the “big company” part). I accepted a job at Bing working on a very enticing project on utilizing user behavioral analysis to best present the search results. Right up my alley so I am super excited. I did work for Microsoft before twice though, both as an intern in Microsoft Research in 2009 and at Microsoft Silicon Valley in 2011. I guess it’s a loyalty thing to come back, it’s awesome to work there to be honest.



So, I am packing my stuff as I write these words, getting a bit emotional to leave the place I called home for 18.96% of my life, the amazing life friends who I would always cherish, and excited about new beginnings and awesome feats to achieve. I am shipping all my stuff to Seattle, and I am taking my old motorcycle (A.K.A. Beast) on a cross country trip inspired by Che Guevara’s life changing journey across South America after he finished med school on his motorcycle La Poderosa ("The Mighty One") documented in his marvelous memoir: The Motorcycle Diaries

 So wish me luck and I will post about the trip soon, hopefully I won't break down!
Beast packed and ready
-- Hany SalahEldeen

2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

$
0
0
Mat Kelly and Michele Weigle travel to and present at the Web Archiving Collaboration Conference in NYC.                           

On June 4 and 5, 2015, Dr. Weigle (@weiglemc) and I (@machawk1) traveled to New York City to attend the Web Archiving Collaboration conference held at the Columbia School of International and Public Affairs. The conference gave us an opportunity to present our work from the incentive award provided to us by Columbia University Libraries and the Andrew W. Mellon Foundation in 2014.

Robert Wolven of Columbia University Libraries started off the conference with welcoming the audience and emphasizing the variety of presentations that were to occur on that day. He then introduced Jim Neal, the keynote speaker.

Jim Neal starting by noting the challenges of "repository chaos", namely, which version of a document should be cited for online resources if multiple versions exist. "Born-digital content must deal with integrity", he said, "and remain as unimpaired and undivided as possible to ensure scholarly access."

Brian Carver (@brianwc) and Michael Lissner (@mlissner) of Free Law Project (@freelawproject) followed the keynote with Brian first stating, "Too frequently I encounter public access systems that have utterly useless tools on top of them and I think that is unfair." He described his project's efforts to make available court data from the wide variety of systems digitally deployed by various courts on the web. "A one-size-fits-all solution cannot guarantee this across hundreds of different court websites.", he stated, further explaining that each site needs its own algorithm of scraping to extract content.

To facilitate the crowd sourcing of scraping algorithms, he has created a system where users can supply "recipes" to extract content from the courts' sites as they are posted. "Everything I work with is in the public domain. If anyone says otherwise, I will fight them about it.", he mentioned regarding the demands people have brought to him when finding their name in the now accessible court documents. "We still find courts using WordPerfect. They can cling to old technology like no one else."

Shailin Thomas (@shailinthomas) and Jack Cushman from the Berkman Center for Internet and Society, Harvard University spoke next of Perma.cc. "From the digital citation in the Harvard Law Review from the last 10 year, 73% of the online links were broken. Over 50% of the links cited by the Supreme Court are broken." They continued to describe the Perma API and the recent Memento compliance.

After a short break, Deborah Kempe (@nyarcist) of the Frick Art Reference Library describe her recent observation that there is a digital shift in art moving to the Internet. She has been been working with both Archive-It and Hanzo Archives for quality assurance of captured websites and for on-demand captures of sites that her organization found particularly challenging (respectively). One example of the latter is Wangechi Mutu's site, which has an animation on the homepage, which Archive-It was unable to capture but Hanzo was.

In the same session, Lily Pregill (@technelily) of NYARC stated, "We needed a discovery system to unite NYARC arcade and our Archive-It collection. We anticipated creating yet another silo of an archive." While she stated that the user interface is still under construction, it does allow the results of her organization's archive to be supplemented with results from Archive-It.

Following Lily in the session, Anna Piricci (@AnnaPerricci and Alex Thurman (@athurman of Columbia University Libraries talked about the Contemporary Composers Web Archive, which consisted of 11 participating curators from 56 sites currently available in Archive-It. Alex then spoke of the varying legal environments between members based on countries, some being able to do full TLD crawling while some members (namely, in the U.S.) have no protection from copyright. He spoke of the preservation of Olympics web sites from 2010, 2012, and 2014 - the latter being the first logo to contain a web address. "Though Archive-It had a higher upfront cost", he said about the initial weighing of various option for Olympic website archiving, it was all-inclusive of preservation, indexing, metadata, replay, etc." To publish their collections, they are looking into utilizing the .int TLD, which is reserved for internationally significant information but is underutilized in that only about 100 sites exist, all which have research value.

The conference then broke for a provided lunch then started with Lightning Talks.

To start off the lightning talks, Michael Lissner (@mlissner) spoke about RECAP, what it is, what has it done and what is next for the project. Much of the content contained with the Public Access to Court and Electronic Records (PACER) system is paywalled public domain documents. Obtaining the documents costs users ten cents per page with a three dollar maximum. "To download the Lehman Brothers proceedings would cost $27000.", he said. His system leverages user's browser via the extension framework to save a copy of the downloads from a user to Internet Archive and also first query the archive for a user to see if the document has been previously downloaded.

Dragan Espenschied (@despens) gave the next Lightning Talk talking about preserving digital art pieces, namely those on the web. He noted one particular example where the artist extensively used scrollbars, which are less common place in user interface today. To accurately re-experience the work, he fired up a browser based MacOS 9 emulator:

Jefferson Bailey@jefferson_bail followed Dragan with his work in investigating archive access methods that are not URI centric. He has begun working with WATs (web archive transformations), LGAs (longitudinal graph analyses), and WANEs (web archive named entities).

Dan Chudnov (@dchud) then spoke of his work at GWU Libraries. He had developed Social Feed Manager, a Django application to collect social media data from Twitter. Previously, researchers had been copy and pasting tweets into Excel documents. His tool automated this process. "We want to 1. See how to present this stuff, 2. Do analytics to see what's in the data and 3. Find out how to document the now. What do you collect for live events? What keywords are arising? Whose info should you collect?", he said.

Jack Cushman from Perma.cc gave the next lightning talk about ToolsForTimeTravel.org, a site that is trying to make a strong dark archive. The concept would prevent archivists from reading material within until conditions are met. Examples where this would be applicable are the IRA Archive at Boston College, Hillary Clinton's e-mails, etc.

With the completion of the Lightning Talks, Jimmy Lin (@lintool) of University of Maryland and Ian Milligan (@ianmilligan1) of University of Waterloo rhetorically asked, "When does an event become history?" stating that history is written 20 to 30 years after an event has occurred. "History of the 60s was written in the 30s. Where are the Monica Lewinsky web pages now? We are getting ready to write the history of the 1990s.", Jimmy said. "Users can't do much with current web archives. It's hard to develop tools for non-existent users. We need deep collaborations between users (archivists, journalists, historians, digital humanists, etc.) and tool builders. What would a modern archiving platform built on big data infrastructure look like?" He compared his recent work in creating warcbase with the monolithic OpenWayback Tomcat application. "Existing tools are not adequate."

Ian then talked about warcbase as an open source platform for managing web archives with Hadoop and HBase. WARC data is ingested into HBase and Spark is used for text analysis and services.

Zhiwu Xie (@zxie) of Virginia Tech then presented his group's work on maintaining web site persistence when the original site is no longer available. By using an approach akin to a proxy server, the content served when the site was last available is continued to be served in lieu of the live site. "If we have an archive that archives every change of that web site and the website goes down, we can use the archive to fill the downtimes.", he said.

Mat Kelly (@machawk1, your author) presented next with "Visualizing digital collections of web archives" where I described the SimHash archival summarization strategy to efficiently generate a visual representation of how a web page changed over time. In the work, I created a stand-alone interface, Wayback add-on, and embeddable service for a summarization to be generated for a live web page. At the close of the presentation, I attempted a live demo.

WS-DL's own Michele Weigle (@weiglemc) next presented Yasmin's (@yasmina_anwar) work on Detecting Off-Topic Pages. The recently accepted TPDL 2015 paper had her looking at how pages in Archive-It collections have changed over time and being able to detect when a page is no longer relevant to what the archivist intended to capture. She used six similarity metrics to find that cosine similarity performed the best.

In the final presentation of the day, Andrea Goethals of Harvard Library and Stephen Abrams of California Digital Library discussed difficulties in keeping up with web archiving locally, citing the outdated tools and systems. A hierarchical diagram of a potential they showed piqued the audiences' interest as being overcomplicated for smaller archives.

To close out the day, Robert Wolven gave a synopsis of the challenges to come and expressed his hope that there was something for everyone.

Day 2

The second day of the conference contained multiple concurrent topical sessions that were somewhat open-ended to facilitate more group discussion. I initially attended David Rosenthal's talk where he discussed the need for tools and APIs for integration into various system for standardization of access. "A random URL on the web has less than 50% chance of getting preserved anywhere.", he said, "We need to use resources as efficiently as possible to up that percentage". Further emphasizing this point:

DSHR then discussed repairing archives for bit-level integrity and LOCKSS' approach at accomplishing it. How would we go about establish a standard archival vocabulary?", he asked, "'Crawl scope' means something different in Archive-It vs. other systems."

I then changed rooms to catch the last half hour of Dragan Espenschied's tools where he discussed pywb (the software behind webrecorder.io) more in-depth. The software allows people to create their own public and private archives as well as offers a pass-through model where it does not record login information. Further, it can capture embedded YouTube and Google Maps.

Following the first set of concurrent sessions, I attended Ian Milligan's demo of utilizing warcbase for analysis of Canadian Political Parties (a private repo as of this writing but will be public once cleaned up). He also demonstrated using Web Archives for Historical Research. In the subsequent and final presentation of day 2, Jefferson Bailey demonstrated Vinay Goel's (@vinaygo) Archive Research Services Workshop, which was created to serve as an introduction to data mining and computational tools and methods for work with web archives for researchers, developers, and general users. The system utilizes the WAT, LGA, and WANE derived data formats that Jefferson spoke of in his Day 1 Lightning talk.

After Jefferson's talk, Robert Wolven again collected everyone into a single session to go over what was discussed in each session on the second day and gave a final closing.

Overall, the conference was very interesting and very relevant to my research in web archiving. I hope to dig into some of the projects and resources I learned about and follow up with contacts I made at the Columbia Web Archiving Collaboration conference.

— Mat (@machawk1)

2015-06-09: Mobile Mink merges the mobile and desktop webs

$
0
0
As part of my 9-to-5 job at The MITRE Corporation, I lead several STEM outreach efforts in the local academic community. One of our partnerships with the New Horizon's Governor's School for Science and Technology pairs high school seniors with professionals in STEM careers. Wes Jordan has been working with me since October 2014 as part of this program and for his senior mentorship project as a requirement for graduation from the Governor's School.

Wes has developed Mobile Mink (soon to be available in the Google Play store). Inspired by Mat Kelly's Mink add-on for Chrome, Wes adapted the functionality to an Android application. This blog post discusses the motivation for and operation of Mobile Mink.

Motivation

The growth of the mobile web has encouraged web archivists to focus on ensuring its thorough archiving. However, the mobile URIs are not as prevalent in the archives as their non-mobile (or as we will refer to them: desktop) URIs. This is apparent when we compare the TimeMaps of the Android Central site (with desktop URI http://www.androidcentral.com/ and a mobile URI http://m.androidcentral.com/).

TimeMap of the desktop Android Central URI
 The 2014 TimeMap in the Internet Archive of the desktop Android Central URI includes a large number of mementos with a small number of gaps in archival coverage.
TimeMap of the mobile Android Central URI
Alternatively, the TimeMap in the Internet Archive of the mobile Android Central URI has far fewer mementos and many more gaps in archival coverage.

This example illustrates the discrepancy between archival coverage of mobile vs desktop URIs. Additionally, as humans we can understand that these two URIs are representing content from the same site: Android Central. The connection between the URIs is represented in the live web, with mobile user-agents triggering a redirect to the mobile URI. This connection is lost during archiving.



The representations of the mobile and desktop URIs are different, even though a human will recognize the content as largely the same. Because archives commonly index by URI and archival datetime only, a machine may not be able to understand that these URIs are related.
The desktop Android Central representation
The mobile Android Central representation

Mobile Mink helps merge the mobile and desktop TimeMaps while also also providing a mechanism to increase the archival coverage of mobile URIs. We detail these features in the Implementation section.

Implementation

Mobile Mink provides users with a merged TimeMap of mobile and desktop versions of the same site. We use the URI permutations detailed in McCown's work to transform desktop URIs to mobile URIs (e.g., http://www.androidcentral.com/ -> http://m.anrdoidcentral.com/) and mobile URIs to desktop URIs (e.g., http://m.androidcentral.com/ -> http://www.androidcentral.com/). This process allows Mobile Mink to establish the connection between mobile and desktop URIs.



Merged TimeMap
With the mobile and desktop URIs identified, Mobile Mink uses Memento to retrieve the TimeMaps of both the desktop and mobile versions of the site. Mobile Mink merges all of the returned TimeMaps and sorts the mementos temporally, identifying the mementos of the mobile URIs with an orange icon of a mobile phone and the mementos of the desktop URIs with a green icon of a PC monitor.

To mitigate the discrepancy in archival coverage between the mobile and desktop URIs of web resources, Mobile Mink provides an option to allow users to push the mobile and desktop URIs to the Save Page Now feature at the Internet Archive and to Archive.today. This will allow Mobile Mink's users to actively archive mobile resources that may not be otherwise archived.

These features mirror the functionality of Mink by providing users with a TimeMap of the site currently being viewed, but extends Mink's functionality by providing the merged mobile and desktop TimeMap. Mink also provides a feature to submit URIs to Archive.today and the Save Page Now feature, but Mobile Mink extends this functionality by submitting the mobile and desktop URIs to these two archival services.

Demonstration

The video below provides a demo of Mobile Mink. We use the Chrome browser and navigate to http://www.androidcentral.com/, which redirects us to http://m.androidcentral.com/. From the browser menu, we select the "Share" option. When we select the "View Mementos" option, Mobile Mink provides the aggregate TimeMap. Selecting the icon in the top right corner, we can access the menu to submit the mobile and desktop URIs to Archive.today and/or the Internet Archive.


Next Steps

We plan to release Mobile Mink in the Google Play store in the next few weeks. In the mean time, please feel free to download and use the app from Wes's GitHub repository (https://github.com/Thing342/MobileMemento) and provide feedback to through the issues tracker (https://github.com/Thing342/MobileMemento/issues). We will continue to test and refine the software moving forward.

Wes's demo of MobileMink was accepted at JCDL2015. Because he is graduating in June and preparing to start his collegiate career at Virginia Tech, someone from the WS-DL lab will be presenting his work on his behalf. However, we hope to convince Wes to come to the Dark Side and join the WS-DL lab in the future. We have cookies.

--Justin F. Brunelle

2015-06-26: JCDL 2015 Doctoral Consortium

$
0
0
Mat Kelly attended and presented at the JCDL 2015 Doctoral Consortium. This is his report.                           

Evaluating progress between milestones in a PhD program is difficult due to the inherent open-endedness of research. A means of evaluating whether a student's topic is sound and has merit while still early on in his career is to attend a doctoral consortium. Such an event, as the one held at the annual Joint Conference on Digital Libraries (JCDL), has previously provided a platform for WS-DL students (see 2014, 2013, 2012, and others) to network with faculty and researchers from other institutions as well as observe the approach that other PhD students at the same point in their career use to explain their respective topics.

As the wheels have turned, I have showed enough progress in my research for it to be suitable for preliminary presentation at the 2015 JCDL Doctoral Consortium -- so did so this past Sunday in Knoxville, Tennessee. Along with seven other graduate students from various other universities throughout the world, I gave a twenty minute presentation with ten to twenty minutes of feedback from the audience of both other presenting graduate students, faculty, and researchers.

Kazunari Sugiyama of National University of Singapore (where Hany SalahEldeen recently spent a semester as a research intern) welcomed everyone and briefly described the format of the consortium before getting underway. Each student was to have twenty minutes to present with ten to twenty minutes for feedback from the doctors and the other PhD students present.

The Presentations

The presentations were broken up into four topical categories. In the first section, "User's Relevance in Search", Sally Jo Cunningham introduced the two upcoming speakers. Sampath Jayarathna (@OpenMaze) of Texas A&M University was the first presenter of the day with his topic, "Unifying Implicit and Explicit Feedback for Multi-Application User Interest Modeling". In his research, he asked users to type short queries, which he used to investigate methods for search optimization. He asked, "Can we combine implicit and semi-explicit feedback to create a unified user interest model based on multiple every day applications?". Using a browser-based annotation tool, users in his study were able to provide relevance feedback of the search results via explicit and implicit feedback. One of his hypotheses is that if he has a user model, he should be able to compare the model against explicit feedback that the user provides for providing better relevance of results.


After Sampath, Kathy Brennan (@knbrennan) of University of North Carolina presented her topic, "User Relevance Assessment of Personal Finance Information: What is the Role of Cognitive Abilities?". In her presentation she alluded to the similarities of buying a washer and dryer to obtaining a mortgage in respect to being an indicator for a person's cognitive abilities. "Even for really intelligent people, understanding prime and subprime rates can be a challenge.", she said. One study she described analyzed rounding behavior with stock prices being an example of the observed critical details by an individual. Through testing 69 different abilities psychometrically through users analyzing documents for relevance, she found that someone with lower cognitive abilities will have a lower threshold for relevance and thus attribute more documents as relevant than those with higher cognitive abilities. "However", she said, "those with a higher cognitive ability were doing a lot more in the same amount of time as those with lower cognitive abilities."

After a short coffee break, Richard Furuta of Texas A&M University introduced the two speakers of the second session titled, "Analysis and Construction of Archive". Yingying Yu of Dalian Maritime University presented first in this session with "Simulate the Evolution of Scientific Publication Repository via Agent-based Modeling". In her research, she is seeking to find candidate co-authors for academic publications based on a model that includes venue, popularity and author importance as a partial set of parameters to generate a model. "Sometimes scholars only focus on homogenous network", she said.


Mat Kelly (@machawk1, your author) presented second in the session with "A Framework for Aggregating Private and Public Web Archives". In my work, I described the issues of integrate private and public web archives in respect to access restrictions, privacy issues, and other concerns that would arise were the archives' results to be aggregated.


The conference then broke for boxed lunch and informal discussions amongst the attendees.


After resuming sessions after the lunch break, George Buchanan (@GeorgeRBuchanan) of City University of London welcomed everybody and introduced the two speakers of the third session of the day, "User Generated Contents for Better Service".


Faith Okite-Amughoro (@okitefay) of University of KwaZulu-Natal presented her topic, "The Effectiveness of Web 2.0 in Marketing Academic Library Services in Nigerian Universities: a Case Study of Selected Universities in South-South Nigeria". Faith's research noted that there has not been any assessment on how the libraries in her region of study have used Web 2.0 to market their services. "The real challenge is not how to manage their collection, staff and technology", she said, "but to turn these resources into services". She found that the most used Web 2.0 tools were social networking, video sharing, blogs, and generally places where the user could add themselves.


Following Faith, Ziad Matni (@ziadmatni) of Rutgers University presented his topic, "Using Social Media Data to Measure and Influence Community Well-Being". Ziad asked, "How can we gauge how well people are doing in their local communities though the data that they generate on social media?" He is currently looking for useful measure of components of community well-being and their relationships with collective feelings of stress and tranquility (as he defined in his work). He is hoping to focus on one or two social indicators and to understand the influence factors that correlate the sentiment expressed on social media and a geographical community's well-being.


After Ziad's presentation, the group took a coffee break then started the last presentation session of the day, "Mining Valuable Contents". Kazunari Sugiyama (who welcomed the group at the beginning of the day) introduced the two speakers of the session.


The first presentation in this session was from Kahyun Choi of University of Illinois at Urbana-Champaign presented her work, "From Lyrics to Their Interpretations: Automated Reading between the Lines". In her work, she is looking to try to find the source of subject information from songs with the assumption that machines might have difficult analyzing songs' lyrics. She has three general research questions, the first relating lyrics and their interpretations, the second whether topic modeling can discover the subject of the interpretations, and the third in reliably obtaining the interpretations from the lyrics. She is training and testing a subject classifier where she collected lyrics and their interpretations from SongMeanings.com. From this she obtained eight subject categories: religion, sex, drugs, parents, war, places, ex-lover, and death. With 100 songs in each category, she assigned each song to have only one subject. She then obtained the top ten interpretations per song to prevent the results from being skewed by songs with a large number of interpretations.


The final group presentation of the day was to come from Mumini Olatunji Omisore of Federal University of Technology with "A Classification Model for Mining Research Publications from Crowdsourced Data". Because of visa issues, he was unable to attend but planned on presenting via Skype or Google Hangouts. After changing wireless configurations, services, and many other attempts, the bandwidth at the conference venue proved insufficient and he was unable to present. A contingency was setup between him and the doctoral consortium organizers to review his slides.


Two-on-Two

Following the attempts to allow Mumini to present remotely, the consortium broke up into group of four (two students and two doctors) for private consultations. The doctors in my group (Drs. Edie Rasmussen and Michael Nelson) provided extremely helpful feedback in both my presentation and research objectives. Particularly valuable was their helpful discussions for how I could go about improving the evaluation of my proposed research.

Overall, the JCDL Doctoral Consortium was a very valuable experience. By viewing how other PhD students were approaching their research and obtaining critical feedback on mine, I believe the experience to be priceless for improving the quality of one's PhD research.

— Mat (@machawk1)

Edit: Subsequent to this post, Lulwah reported on the main portion of the JCDL 2015 conference and Sawood reported on the WADL workshop at JCDL 2015.

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

$
0
0
My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).
Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.
Fig 2. The user selects the Make option, which initiates an Ajax request...
Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.
Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.
If we view a memento of KBB.com (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.
The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (BrowsertrixWebRecorder.ioCrawlJAX) but these are slightly outside the scope of what we want to do.  We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

Tool:PhantomJSSeleniumVisualEvent
OperationHeadlessFull-BrowserJavaScript bookmarklet and code
Speed (seconds)2.5-84-10< 1 (on user click)
DOM IntegrationClose integration3rd partyClose integration/embedded
DOM Event ExtractionSemi-reliableSemi-reliable100% reliable
DOM InteractionScripted, native, on-demandScriptedNone

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack OverflowReal PythonVilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test  (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).
Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.
VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle
Viewing all 746 articles
Browse latest View live