2014-11-14: Carbon Dating the Web, version 2.0

November 14, 2014, 9:56 am

≫ Next: 2014-11-20: Archive-It Partners Meeting 2014

≪ Previous: 2014-11-09: Four WS-DL Classes for Spring 2015

For over 1 year, Hany SalahEldeen's Carbon Date service has been out of service mainly because of API changes in some of the underlying modules on which the service is built upon. Consequently, I have taken up the responsibility of maintaining the service, beginning with the following now available in Carbon Date v2.0.

Carbon Date v2.0

The Carbon Date service currently makes requests to the different modules (Archives, backlinks, etc.), in a concurrent manner through threading.

The server framework has been changed from bottle server to CherryPy server which is still a python minimalist WSGI server, but a more robust framework which features a threaded server.

How to use the Carbon Date service

There are three ways:

Through the website, http://cd.cs.odu.edu/: Given that carbon dating is highly computationally intensive, the site should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally (local.py or server.py)

Through the local server (server.py): The second way to use the Carbon Date service is through the local server application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.

Through the local application (local.py): The third way to use the Carbon Date service is through the local python application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.

The backlinks efficiency problem

Upon running the Carbon Date service, you will notice a significant difference in the runtime of the backlinks module compared to the other modules, this is because the most expensive operation in the carbon dating process involves carbon dating backlinks. Consequently, in the local application (local.py), the backlinks module is switched off by default and reactivated with the --compute-backlinks option. For example, to Carbon Date cnn.com, with the backlinks module switched on:

Some effort was put towards optimizing the backlinks module, however, my conclusion is that the current implementation cannot be optimized.

This is because of the following cascade of operations associated with the inlinks:

Given a single backlink (an incoming link - inlink to the URL), the application retrieves all mementos (which could range from tens to hundreds). Thereafter, the application searches for the first occurrence of the link in the memento.

At first glance, one may suggest binary search since the mementos are in chronological order. However, given that there are potentially multiple memento instances which contain the URL, binary search does not help us because if we check the midpoint memento for the URL, we cannot act upon this information to narrow the search space by half, since the left half of the list of mementos or the right half of the list of mementos could contain the first occurrence of the URL. Therefore, the linear method is the only possible method.

I am grateful to everyone who contributed to the debugging of Carbon Date such as George Micros and the members of the Old Dominion University Introduction to Web Science class (Fall 2014). Further recommendation or comments about how this service can be improved is welcome and will be appreciated.

--Nwala

↧

2014-11-20: Archive-It Partners Meeting 2014

November 20, 2014, 11:29 am

≫ Next: 2014-12-20: Using Search Engine Queries For Reliable Links

≪ Previous: 2014-11-14: Carbon Dating the Web, version 2.0

I attended the 2014 Archive-It Partners Meeting in Montgomery, AL on November 18. The meeting attendees are representatives from Archive-It partners with interests ranging from archiving webpages about art and music to archiving government webpages. (Presentation slides will be available on the Archive-It wiki soon.) This is ODU's third consecutive Partners Meeting (see trip reports from 2012 and 2013).

The morning program was focused on presentations from partners who are building collections. Here's a brief overview of each of those.

Penny Baker and Susan Roeper from the Clark Art Institute talked about their experience in archiving the 2013 Venice Biennale international art exhibition (Archive-It collection) and plans for the upcoming exhibition. Their collection includes exhibition catalogs, monographs, and press releases about the event. The material also includes a number of videos (mainly from vimeo), which Archive-It can now capture.

Beth Downs from the Montana State Library (Archive-It collection) spoke about working with partners around the state to fulfill the state mandate to make all government documents publicly available and working to make the materials available to state employees, librarians, teachers, and the general public. One of the nice things they've added to their site footer is a Page History link that goes directly to the Archive-It Wayback calendar page for the current page.

Beth has also provided instructions for their state agencies on how to include the Page History link and a Search box into the archive on their pages. This could be easily adapted to point to other state government archives or to the general Internet Archive Wayback Machine.

Dory Bower from the US Government Printing Office talked about the FDLP (Federal Depository Library Program) Web Archive (Archive-It collections). They have several archiving strategies and use Archive-It mainly for the more content rich websites along with born-digital materials.

Heather Slania, Director of the Betty Boyd Dettre Library and Research Center at the National Museum of Women in the Arts (Archive-It collections) spoke about the challenges of capturing dynamic content from artists websites. This includes animation, video (mainly vimeo), and other types of Internet art. She has initially focused on capturing websites of a selection of Internet artists. These sites include over 6000 videos (from just 30 artists). The next step is to archive the work of video artists and web comics. As part of this project, she has been considering what types of materials are currently capture-able and categorizing the amount of loss in the archived sites. This is related to our group's recent work on measuring memento damage (pdf, slides) and investigating the archivability of websites over time (pdf at arXiv, slides).

Nicholas Taylor from Stanford University Libraries gave an overview of the 2013 NDSA (National Digital Stewardship Alliance) Survey Report (pdf). The latest survey was conducted in 2013 and the first was done in 2011. NDSA's goal is to conduct this every 2 years. Nicholas had lots of great stats in his slides, but here are a few that I noted:

50% of respondents were university programs
7% affiliated with IIPC, 33% with NDSA, 45% Web Archiving Roundtable, 71% with Archive-It
many are concerned with capturing social media, databases, and video
about 80% respondents are using external services for archiving, like Archive-It
80% haven't transferred data to their local repository
many are using tools that don't support WARC (but the percentage using WARC has increased since 2011)

Abbie Nordenhaug and Sara Grimm from the Wisconsin Historical Society (Archive-It collections) presented next. They're just getting started archiving in a systematic manner. They have a range of state agency partners with websites that are dynamic to those that are fairly static. So far, they've set up monthly, quarterly, semi-annual, and annual crawls for those sites.

After these presentations, it was time for lunch. Since we were in Alabama, I found my way to Dreamland BBQ.

After lunch, the presentations focused on collaborations, an update on 2014-2015 Archive-It plans, BOF breakout sessions, and strategies and services.

Anna Perricci from Columbia University Libraries spoke about their experiences with collaborative web archiving projects (Archive-It collections), including the Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY) collection and the Contemporary Composers Web Archive (CCWA) collection.

Kent Underwood, Head of the Avery Fisher Center for Music and Media at the NYU Libraries, spoke about web archiving for music history (Archive-It collection). Kent gave an eloquent argument for web archiving: "Today’s websites will become tomorrow’s historical documents, and archival websites must certainly be an integral part of tomorrow’s libraries. But websites are fragile and impermanent, and they cannot endure as historical documents without active curatorial attention and intervention. We must act quickly to curate and preserve the memory of the Internet now, while we have the chance, so that researchers of tomorrow will have the opportunity to discover their own past. The decisions and actions that we take today in web archiving will be crucial in determining what our descendants know and understand about their musical history and culture."

Patricia Carlson from Mount Dora High School in Florida spoke about Archive-It's K-12 Archiving Program and its impact on her students (Mount Dora's Archive-It collection). She talked about its role in introducing her students to primary sources and metadata. She's also been able to use things that they already do (like tag people on Facebook) as examples of adding metadata. The students have even made a video chronicling their archiving experiences.

After the updates on ongoing collaborations, Lori Donovan and Maria LaCalle from Archive-It gave an overview of Archive-It's 2014 activities and upcoming plans for 2015. Archive-It currently has 330 partners in 48 US states (only missing Arkansas and North Dakota!) and 16 countries. In 2014, with version 4.9, Archive-It crawls certain pages with Heritrix and Umbra, which allows Heritrix to access sites in the same way a browser would. This allows for capture of client-side scripting (such as JavaScript) and improves the capture of social media sites. There were several new features in the 5.0 release, among them integration with Google Analytics. There will be both a winter 2014 release and a spring/summer 2015 release. In the spring/summer release several new features are planned, including visual/UI redesign of the web app, the ability to move and share seeds between collections, ability to manually rank metadata facets on public site, enhanced integration with archive.org, updated Wayback look and feel, and linking related pages on the Wayback calendar (in case URI changed over time).

After a short break, we divided up into BOF groups:

Archive.org v2
Researcher Services
Cross-archive collaboration
QA (quality assurance)
Archiving video, audio, animations, social media
State Libraries

I attended the Research Services BOF, led by Jefferson Bailey and Vinay Goel from Internet Archive and Archive-It. Jefferson and Vinay described their intentions with launching research services and asked for feedback and requests. The idea is to use the Internet Archive's big data infrastructure to process data and provide smaller datasets of derived data to partners from their collections. This would allow researchers to work on smaller datasets that would be manageable without necessarily needing big data tools. This could also be used to provide a teaser as to what's in the collection, highlight link structure in the collection, etc. One of the initial goals is to seed example use cases of these derivative datasets to show others what might be possible. The ultimate goal is to help people get more value from the archive. Jefferson and Vinay talked in more detail about what's upcoming in the last talk of the meeting (see below). Most of the other participants in the BOF were interested in ways that their users could make research use out of their archived collections.

After the BOF breakout, the final session featured talks on strategies and services.

First up was yours truly (Michele Weigle from the WS-DL research group at Old Dominion University). My talk was a quick update on several of our ongoing projects, funded by NEH Office of Digital Humanities and the Columbia University Libraries Web Archiving Incentives program.

The tools I mentioned (WARCreate, WAIL, and Mink) are all available from our Software page. If you try them out, please let us know what you think (contact info is on the last slide).

Mohamed Farag from Virginia Tech's CTRnet research group presented their work on an event focused crawler (EFC). Their previous work on automatic seed generation from URIs shared on Twitter produced lots of seeds, but not all of them were relevant. The new work allows a curator to select high quality seed URIs and then uses the event focused crawler (EFC) to retrieve webpages that are highly similar to the seeds. The EFC can also read WARCs and perform text analysis (entities, topics, etc.) from them. This enables event modeling, describing what happened, where, and when.

In the final presentation of the meeting, Jefferson Bailey and Vinay Goel from Internet Archive spoke about building Archive-It Research Services, planned to launch in January 2015. The goals are to expand access models to web archives, enable new insights into collections, and facilitate computational analysis. The plan is to leverage the Internet Archive's infrastructure for large-scale processing. This could result in increasing the use, visibility, and value of Archive-It collections. Initially, three main types of datasets are planned:

WAT - consists of key metadata from a WARC file, includes text data (title, meta-keywords, description) and link data (including anchor text) for HTML
LGA - longitudinal graph analysis - what links to what over time
WANE - web archive named entities

All of these datasets are significantly smaller than the original WARC files. Jefferson and Vinay have built several visualizations based on some of this data for demonstration and will be putting some of these online. Their future work includes developing programmatic APIs, custom datasets, and custom processing.

All in all, it was a great meeting with lots of interesting presentations. It good to see some familiar faces and to actually meet others I'd only previously emailed with. It was also nice to be in an audience where I didn't have to motivate the need for web archiving.

There were several people live-tweeting the meeting (#ait14). I'll conclude with some of the tweets.
Tweets about #ait14 since:2014-11-18 until:2014-11-19

-Michele

↧

2014-12-20: Using Search Engine Queries For Reliable Links

December 20, 2014, 10:28 am

≫ Next: 2015-01-03: Review of WS-DL's 2014

≪ Previous: 2014-11-20: Archive-It Partners Meeting 2014

Earlier this week Herbert brought to my attention Jon Udell's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes.

Jon has a series of columns for InfoWorld, and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page. For example, the old URI:

http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html

is currently:

http://www.infoworld.com/article/2660595/application-development/xquery-and-the-power-of-learning-by-example.html

The same content had at least one other URI as well, from at least 2009--2012:

http://www.infoworld.com/d/developer-world/xquery-and-power-learning-example-924

The first reaction is to say InfoWorld should use "Cool URIs", mod_rewrite, or even handles. In fairness, Inforworld is still redirecting the second URI to the current URI:

And it looks like they kept redirecting the original URI to the current URI until sometime in 2014 and then quit; currently the original URI returns a 404:

Jon's approach is to just give up on tracking different URIs for his 100s of articles and instead use a combination of metadata (title & author) and the "site:" operator submitted to a search engine to locate the current URI (side note: this approach is really similar to OpenURL). For example, the link for the article above would become:

http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery+and+the+power+of+learning+by+example%22

Herbert had a number of comments, which I'll summarize as:

This problem is very much related to Martin's PhD research, in which web archives are used to generate lexical signatures to help refind the new URIs on the live web (see "Moved but not gone: an evaluation of real-time methods for discovering replacement web pages").
Throwing away the original URI is not desirable because that is a useful key for finding the page in web archives. The above examples used the Internet Archive's Wayback Machine, but Memento TimeGates and TimeMaps could also be used (see Memento 101 for more information).
One solution to linking to a SE for discovery while retaining the original URI is to use the data-* attributes from HTML (see the "Missing Link" document for more information).

For the latter point, including the original URI (and its publishing date), the SE URI, and the archived URI would result in html that looks like:

I posted a comment saying that a search engine's robots.txt page would prevent archives like the Internet Archive from archiving the SERPs and thus not discover (and archive) the new URIs themselves. In an email conversation Martin made the point that rewriting the link to search engine is assuming that the search engine URI structure isn't going to change (anyone want to bet how many links to msn.com or live.com queries are still working?). It is also probably worth pointing out that while metadata like the title is not likely to change for Jon's articles, that's not always true for general web pages, whose titles often change (see "Is This A Good Title?").

In summary, Jon's solution of using SERPs as interstitial pages as a way to combat link rot is an interesting solution to a common problem, at least for those who wish to maintain publication (or similar) lists. While the SE URI is a good tactical solution, disposing of the original URI is a bad strategy for several reasons, including working against web archives instead of with them, and betting on the long-term stability of SEs. The solution we need is a method to include > 1 URI per HTML link, such as proposed in the "Missing Link" document.

--Michael

↧

2015-01-03: Review of WS-DL's 2014

January 3, 2015, 12:22 pm

≫ Next: 2015-01-15: The Winter 2015 Federal Cloud Computing Summit

≪ Previous: 2014-12-20: Using Search Engine Queries For Reliable Links

The Web Science and Digital Libraries Research Group's 2014 was even better than our 2013. First, we graduated two PhD students and had many other students advance their status:

Ahmed AlSum defended his Ph.D. on February 26, 2014 and joined the Stanford University Libraries after his defense. It was Ahmed that started the WSDL tradition of the successful candidate providing the celebratory lunch (shown in the above picture, as well as several photos from our new WSDL Flickr Photostream).
Chuck Cartledge defended his Ph.D. on May 30, 2014 (he already had a position with Fulcrum). Since Chuck finished after Ahmed, he was responsible for lunch as well.
Justin Brunelle passed his candidacy exam.
Yasmin AlNoamany passed her candidacy exam.
Hany SalahEldeen passed his candidacy exam.
Mohamed Aturban passed his breadth exam.
Louis Nguyen passed his breadth exam.
Corren McCoy passed her breadth exam.
Alexander Nwala joined WSDL after completing his MS in another research group in the department.
Shawn Jones was granted status as a PhD student even though officially he hasn't defended his MS thesis yet.

In April we introduced our now famous "PhD Crush" board that allows us to track students' progress through the various hoops they must jump through. Although it started as sort of a joke, it's quite handy and popular -- I now wish we had instituted it long ago.

We had 15 publications in 2014, including:

The IJDL paper based on Martin Klein's dissertation was finally published.
Three other IJDL papers were published -- invited, extended versions of: Yasmin's best student paper from TPDL 2013, Scott's JCDL 2013 paper nominated for best student paper, and Ahmed's TPDL 2013 paper.
One paper at ECIR 2014.
One poster at SIGCSE 2014 (with WSDL alumnus Frank McCown).
Two full papers, 1 short paper, and 1 poster from JCDL 2014 (JCDL and TPDL were combined into one conference this year).
Five technical reports.

JCDL was especially successful, with Justin's paper "Not all mementos are created equal: Measuring the impact of missing resources" winning "best student paper" (Daniel Hasan from UFMG also won a separate "best student paper" award), and Chuck's paper "When should I make preservation copies of myself?" winning the Vannevar Bush Best Paper award. It is truly a great honor to have won both best paper awards at JCDL this year (pictures: Justin accepting his award, and me accepting on behalf of Chuck). In the last two years at JCDL & TPDL, that's three best paper awards and one nomination. The bar is being raised for future students.

In addition to the conference paper presentations, we traveled to and presented at a number of conferences that do not have formal proceedings:

Michele Weigle presented at the 2014 Archive-It Partners Meeting.
I attended the 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent workshop.
Yasmin AlNoamany went to the Grace Hopper Celebration of Women in Computing (GHC) 2014.
Michele Weigle and I went to the NEH ODH Project Directors Meeting.
Mat Kelly and I went to Digital Preservation 2014.
Justin Brunelle went to the Federal Big Data Summit and the Federal Cloud Computing Summit.
Shawn Jones presented at the WikiConference USA 2014.
I presented at the 2014 IIPC General Assembly.
Lulwah Alkwai traveled to CRA-W 2014.

We were also fortunate enough to visit and host visitors in 2014:

Hany SalahEldeen spent a semester at the National University of Singapore with Dr. Min-Yen Kan.
Yasmin AlNoamany is interning at the Internet Archive with Kristine Hannah.
We hosted Jose Antonio Olvera from UDG for the fall 2014 semester.

We also released (or updated) a number of software packages for public use, including:

Alexander Nwala updated the "Carbon Date" software.
Mat Kelly released the "Mink" Google Chrome extension.
Shawn Jones updated the Memento MediaWiki Extension, and created a separate, light-weight MementoHeaders MediaWiki Extension.

Our coverage in the popular press continued, with highlights including:

I appeared on the video podcast "This Week in Law" #279 to discuss web archiving.
I was interviewed for the German radio program "DRadio Wissen".

We were more successful on the funding front this year, winning the following grants:

We won two Mellon Web Archiving Incentive Awards (each of which are $22k & 9 months).
I received a two year, $49k grant from the IIPC for "Web Archive Profiling Via Sampling".
We won an NEH Implementation Grant (3 years & $324k) for "Archive What I See Now: Bringing Institutional Web Archiving Tools to the Individual Researcher".

All of this adds up to a very busy and successful 2014. Looking ahead to 2015, as well as continued publication and funding success, we are expecting to graduate both one MS & one Ph.D. student and host another visiting researcher (Michael Herzog, Magdeburg-Stendal University).

Thanks to everyone that made 2014 such a great success, and here's to a great start to 2015!

--Michael

↧

2015-01-15: The Winter 2015 Federal Cloud Computing Summit

January 15, 2015, 7:29 pm

≫ Next: 2015-02-05: What Did It Look Like?

≪ Previous: 2015-01-03: Review of WS-DL's 2014

On January 14th-15th, I attended the Federal Cloud Computing Summit in Washington, D.C., a recurring event in which I have participated in the past. In my continuing role as the MITRE-ATARC Collaboration Session lead, I assisted the host organization, the Advanced Technology And Research Center (ATARC) in organizing and run the MITRE-ATARC Collaboration Sessions. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

The collaboration sessions continue to be highly valued within the government and industry. The Winter 2015 Summit had over 400 government or academic registrants and more than 100 industry registrants. The whitepaper summarizing the Summer 2014 collaboration sessions is now available.

A discussion of FedRAMP and the future of the policies was held in a Government-only session at 11:00 before the collaboration sessions began.

Standing room only as Matt Goodrich @MrFedRAMP @usgsa kicks off @atarclabs Cloud Summit http://t.co/2v8h38HtLg #cloudfeds
— Cloud Fed Summit (@cloudfeds) January 14, 2015

At its conclusion, the collaboration sessions began, with four sessions focusing on the following topics.

Challenge Area 1: When to choose Public, Private, Government, or Hybrid clouds?
Challenge Area 2: The umbrella of acquisition: Contracting pain points and best practices
Challenge Area 3: Tiered architecture: Mitigating concerns of geography, access management, and other cloud security constraints
Challenge Area 4: The role of cloud computing in emerging technologies

Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will continue its practice of releasing a summary document after the Summit (for reference, see the Summer 2014 and Winter 2013 summit whitepapers).

On January 15th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:25-4:10, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

Start 2015 with Complete Cloud Coverage: Federal Cloud Computing Summit on Jan. 15 in Washington, DC http://t.co/D0H6azP2Az
— Cloud Fed Summit (@cloudfeds) January 12, 2015

This was the fourth Federal Summit event in which I have participated, including the Winter 2013 and Summer 2014 Cloud Summits and the 2013 Big Data Summit. They are great events that the Government participants have consistently identified as high-value. The events also garner a decent amount of press in the federal news outlets and at MITRE. Please refer to the fedsummits.com list of press for the most recent articles about the summit.

We are continuing to expand and improve the summits, particularly with respect to the impact on academia. Stay tuned for news from future summits!

--Justin F. Brunelle

↧

2015-02-05: What Did It Look Like?

January 21, 2015, 8:19 am

≫ Next: 2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive

≪ Previous: 2015-01-15: The Winter 2015 Federal Cloud Computing Summit

Having often wondered why many popular videos on the web are time lapse videos (that is videos which capture the change of a subject over time), I came to the conclusion that impermanence gives value to the process of preserving ourselves or other subjects in photography. As though a means to defy the compulsory fundamental law of change. Just like our lives, one of the greatest products of human endeavor, the World Wide Web, was once small, but has continued to grow. So it is only fitting for us to capture the transitions.

What Did It Look Like? is a Tumblr blog which uses the Memento framework to poll various public web archives, take the earliest archived version from each calendar year, and then create an animated image that shows the progression of the site through the years.

To seed the service we randomly chose some web sites and processed them (see also the archives). In addition, everyone is free to nominate web sites to What Did It Look Like? by tweeting: "#whatdiditlooklike URL".

In order to see how this system is achieved, consider the architecture diagram below.

Source code

The system is implemented in Python and utilizes Tweepy and PyTumblr to access the Twitter and Python APIs respectively, and consists of the following programs:

timelapseTwitter.py: This application fetches tweets (with "#whatdiditlooklike URL"signature) by using the tweet ID of the last tweet visited as reference to know where to begin retrieving tweets. For example, if the application initially visited tweet IDs 0, 1, 2. It keeps track of the ID 2, so as to begin retrieving tweets with IDs greater than 2 in a subsequent tweet retrieval operation. Also, since Twitter rate limits the number of search operations (180 requests per 15 minute window), the application sleeps in between search operations. The snippet below outlines the basic operations of fetching tweets after the last tweet visited: for tweet in tweepy.Cursor(api.search, q="%23whatdiditlooklike", since_id=sinceIDValue).items(30): #update since_id if( tweet.id > sinceIDValue ): sinceIDValue = tweet.id print localTimeTweet, ",tweet_id:", tweet.id, ",", tweet.user.screen_name, " - ", tweet.text print '...sleeping for 15 seconds' time.sleep(15)
usingTimelapseToTakeScreenShots.py: This is a simple application with invokes timelapse.py for each nomination tweet (that a tweet with the "#whatdiditlooklike URL" signature).
timelapse.py: Given an input URL, this application utilizes PhantomJS, (a headless browser) to take screenshots and utilizes ImageMagick to create an animated GIF. It should also be noted that the GIFs created are optimized due to the snippet below in order to reduce their respective sizes to under 1MB. This ensures the animation is not deactivated by Tumblr. #optimize params = ['../gifsicle-static','--lossy=80', '--optimize', '--colors', '160', '--resize-width', '800', '-o', newfile, file] subprocess.check_call(params)
timelapseSubEngine.py: this application executes two primary operations:

Publication of the animated GIFs of nominated URLs to Tumblr: This is done through the PyTumblr API create_photo() method as outlined by the snippet below:client = pytumblr.TumblrRestClient(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET) postID = client.create_photo(globalBlogName, tags=[tagsString], state='publish', caption=[captionString], data=gifAnimationFilename)
Notifying the referrer and making status updates on Twitter: This is achieved through Tweepy's api.update_status() method as outlined by the snippet below which tweets the status update message:auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_KEY) auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)data=gifAnimationFilename) # Creation of the interface, using authentication api = tweepy.API(auth) api.update_status(statusUpdateString) However, a simple periodic Twitter status update message could result in the message to be flagged eventually as spam by Twitter. This comes in form of a 226 error code. In order to avoid this, timelapseSubEngine.py does not post the same status update tweet message or notification tweet. Instead the application randomly selects from a suite of messages and injects a variety of attributes which ensure status update tweets are different. The randomness in execution is due to a custom cron application which randomly executes the entire stack beginning from timelapseTwitter.py down to timelapseSubEngine.py.

How to nominate sites onto What Did It Look Like?

If you are interested in seeing what a web site looked like through the years:

Tweet"#whatdiditlooklike URL" to nominate a web site or tweet "#whatdiditlooklike URL1, URL2, ..., URLn"to nominate multiple URLs.

How to explore historical posts

To explore historical posts, visit the archives page: http://whatdiditlooklike.tumblr.com/archives

Examples

What Did cnn.com Look Like?

What Did cs.odu.edu Look Like?

What Did apple.com Look Like?

"What Did It Look Like?" is inspired by two sources: 1) the "One Terabyte of Kilobyte Age Photo Op" Tumblr that Dragan Espenschied presented at DP 2014 (which basically demonstrates digital preservation as performance art; see also the commentary blog by Olia Lialina& Dragan), and 2) the Digital Public Library of America (DPLA) "#dplafinds" hashtag that surfaces interesting holdings that one would otherwise likely not discover. Both sources have the idea of "randomly" highlighting resources that you would otherwise not find given the intimidatingly large collection in which they reside.

We hope you'll enjoy this service as a fun way to see how web sites -- and web site design! -- have changed through the years.

--Nwala

↧

2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive

February 17, 2015, 6:17 pm

≫ Next: 2015-02-17: Reactions To Vint Cerf's "Digital Vellum"

≪ Previous: 2015-02-05: What Did It Look Like?

On February 2nd, 2015, Rene Voorburg announced the JavaScript utility robustify.js. The robustify.js code, when embedded in the HTML of a web page, helps address the challenge with link rot by detecting when a clicked link will return an HTTP 404 and uses the Memento Time Travel Service to discover mementos of the URI-R. Robustify.js assigns an onclick event to each anchor tag in the HTML. The event occurs, robustify.js makes an Ajax call to a service to test the HTTP response code of the target URI.

When an HTTP 404 response code is detected by robustify.js, it uses Ajax to make a call to a remote server, uses the Memento Time Travel Service to find mementos of the URI-R, and uses a JavaScript alert to let the user know that JavaScript will redirect the user to the memento.

Our recent studies have shown that JavaScript -- particularly Ajax -- normally makes preservation more difficult, but robustify.js is a useful utility that is easily implemented to solve an important challenge. Along this thought process, we wanted to see how a tool like robustify.js would behave when archived.

We constructed two very simple test pages, both of which include links to Voorburg's missing page http://www.dds.nl/~krantb/stellingen/.

http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html which does not use robustify.js
http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html which does use robustify.js

In robustifyTest.html, when the user clicks on the link to http://www.dds.nl/~krantb/stellingen/, an HTTP GET request is issued by robustify.js to an API that returns an existing memento of the page:

GET /services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F HTTP/1.1
Host: digitopia.nl
Connection: keep-alive
Origin: http://www.cs.odu.edu
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4
Accept: */*
Referer: http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Fri, 06 Feb 2015 21:47:51 GMT
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.3.10-1ubuntu3.15
Access-Control-Allow-Origin: *

The resulting JSON is used by robustify.js to then redirect the user to the memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ as expected.

Given this success, we wanted to understand how our test pages would behave in the archives. We also included a link to the stellingen memento in our test page before archiving to understand how a URI-M would behave in the archives. We used the Internet Archive's Save Page Now feature to create the mementos at URI-Ms http://web.archive.org/web/20150206214019/http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html and http://web.archive.org/web/20150206215522/http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html.

The Internet Archive re-wrote the embedded links to be relative to the archive in the memento, converting http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://www.dds.nl/~krantb/stellingen/. Upon further investigation, we noticed that robustify.js does not issue onclick events to anchor tags linking to pages within the same domain as the host page. An onclick even is not assigned to this any embedded anchor tags because all of the links point to within the Internet Archive, the host domain. Due to this design decision, robustify.js is never invoked when within the archive.

When clicking on the URI-M, the 2015-02-06 memento does not exist, so the Internet Archive redirects the user to the closest memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/. The user, when clicking the link, ends up at the 1999 memento because the Internet Archive understands how to redirect the user from the 2015 URI-M for a memento that does not exist to the 1999 URI-M for a memento that does exist. If the Internet Archive had no memento for http://www.dds.nl/~krantb/stellingen/ the user would simply receive a 404 and not have the benefit of robustify.js using the Memento Time Travel service to search additional archives.

The robustify.js file is archived (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) but its embedded URI-Rs are re-written by the Internet Archive. The original, live web JavaScript has URI templates embedded in the code that are completed at run time by inserting the "yyymmddhhmmss" and "url" variable strings into the URI-R:

archive:"http://timetravel.mementoweb.org/memento/{yyyymmddhhmmss}/{url}",statusservice:"http://digitopia.nl/services/statuscode.php?url={url}"

These templates are rewritten during playback to be relative to the Internet Archive:

archive:"/web/20150206214020/http://timetravel.mementoweb.org/memento/{yyyymmddhhmmss}/{url}",statusservice:"/web/20150206214020/http://digitopia.nl/services/statuscode.php?url={url}"

Because the robustify.js is modified during archiving, we wanted to understand the impact of including the URI-M of robustify.js (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) in our test page (http://www.cs.odu.edu/~jbrunelle/wsdl/test-r.html). In this scenario, the JavaScript attempts to execute when the user clicks on the page's links, but the re-written URIs point to /web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2 (since test-r.html exists on www.cs.odu.edu, the links are relative to www.cs.odu.edu instead of archive.org).

Instead of issuing an HTTP GET for http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F, robustify.js issues an HTTP GET for
http://www.cs.odu.edu/web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F which returns an HTTP 404 when dereferenced. The robustify.js script does not handle the HTTP 404 response when looking for its service, and throws an exception in this scenario. Note that the memento that references the URI-M of robustify.js does not throw an exception because the robustify.js script does not make a call to digitopia.nl/services/.

In our test mementos, the Internet Archive also re-writes the URI-M http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/.

This memento of a memento (in a near Yo Dawg situation) does not exist. Clicking on the apparent memento of a memento link leads to the user being told by the Internet Archive that the page is available to be archived.

We also created an Archive.today memento of our robustifyTest.html page: https://archive.today/l9j3O. In this memento, the functionality of the robustify script is removed, redirecting the user to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web. The link to the Internet Archive memento is re-written to https://archive.today/o/l9j3O/http://www.dds.nl/~krantb/stellingen/, which results in a redirect (via a refresh) to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web, just as before. Archive.today uses this redirect approach as standard operating procedure. However, Archive.today re-writes all links to URI-Ms back to their respective URI-Rs.

This is a different path to a broken URI-M than the Internet Archive takes, but results in a broken URI-M, nonetheless. Note that Archive.today simply removes the robustify.js file from the memento, not only removing the functionality, but also removing any trace that it was present in the original page.

In an odd turn of events, our investigation into whether a JavaScript tool would behave properly in the archives has also identified a problem with URI-Ms in the archives. If web content authors continue to utilize URI-Ms to mitigate link rot or utilize tools to help discover mementos of defunct links, there is a potential that the archives may see additional challenges of this nature arising.

--Justin Brunelle

↧

2015-02-17: Reactions To Vint Cerf's "Digital Vellum"

February 17, 2015, 8:29 pm

≫ Next: 2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

≪ Previous: 2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive

Don't you just love reading BuzzFeed-like articles, constructed solely of content embedded from external sources? Yeah, me neither. But I'm going to pull one together anyway.

Vint Cerf generated a lot of buzz last week when at an AAAS meeting he gave talk titled "Digital Vellum". The AAAS version, to the best of my knowledge, is not online but this version of "Digital Vellum" at CMU-SV from earlier the same week is probably the same.

The media (e.g., The Guardian, The Atlantic, BBC) picked up on it, because when Vint Cerf speaks people rightly pay attention. However, the reaction from archiving practitioners and researchers was akin to having your favorite uncle forget your birthday, mostly because Cerf's talk seemed to ignore the last 20 or so years of work in preservation. For a thoughtful discussion of Cerf's talk, I recommend David Rosenthal's blog post. But let's get to the BuzzFeed part...

In the wake of the media coverage, I found myself retweeting many of my favorite wry responses starting with Ian Milligan's observation:

This article on the "digital dark age" could have been written in 1996... kinda ignores whole fields of work. http://t.co/5Lg2j5SmV0
— Ian Milligan (@ianmilligan1) February 13, 2015

Andy Jackson went a lot further, using his web archive (!) to find out how long we've been talking about "digital dark ages":

"Digital Dark Age", eh? Seems familar... http://t.co/AKwd9YedEw e.g. http://t.co/A3JYjWEUf8, http://t.co/RBf4W2pA1N pic.twitter.com/z4nlkncXqu
— Andy Jackson (@anjacks0n) February 13, 2015

And another one showing how long The Guardian has been talking about it:

The Digital Dark Age ... Still Looming After All These Years: 2003 (http://t.co/RBf4W2pA1N) to 2015 (http://t.co/S0ykcGZZTf) (typo fixed)
— Andy Jackson (@anjacks0n) February 17, 2015

And then Andy went on a tear with pointers to projects (mostly defunct) with similar aims as "Digital Vellum":

Oh, we've got *all* the vellums. There's the late 90's Universal Preservation Format http://t.co/HiD7giaSko…
— Andy Jackson (@anjacks0n) February 17, 2015

… The early 2000's Universal Virtual Computer http://t.co/j4rBxpnFVJ…
— Andy Jackson (@anjacks0n) February 17, 2015

… The mid-2000's XCL Project (which doesn't think of itself as a universal preservation format, but it is) http://t.co/cjM31NERC8…
— Andy Jackson (@anjacks0n) February 17, 2015

… The late 2000's Self-contained Information Retention Format http://t.co/6tIfjpNlvC…
— Andy Jackson (@anjacks0n) February 17, 2015

… and of course the currently available bwFLA Emulation as a Service system http://t.co/chJd5BO6Rv…
— Andy Jackson (@anjacks0n) February 17, 2015

… and now a new one. Welcome to the party Olive Archives! Looking forward to seeing your sustainability plan… https://t.co/oTYAUacQKc
— Andy Jackson (@anjacks0n) February 17, 2015

Oops, I forgot the KEEP Virtual Machine http://t.co/T3Pw8vlRU8 http://t.co/FuU93BYLBF http://t.co/3ADskRU9UE #digitalvellum
— Andy Jackson (@anjacks0n) February 18, 2015

Andy's dead right, of course. But perhaps Jason Scott has the best take on the whole thing:

Honestly, if Cerf brings the news that our digital crap is in jeopardy, that's good enough. Different people listen to him.
— Jason Scott (@textfiles) February 13, 2015

So maybe Vint didn't forget our birthday, but we didn't get a pony either. Instead we got a dime kitty.

--Michael

↧

2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

March 2, 2015, 7:35 pm

≫ Next: 2015-03-10: Where in the Archive Is Michele Weigle?

≪ Previous: 2015-02-17: Reactions To Vint Cerf's "Digital Vellum"

Source: Dr. Roger Peng (2011). Reproducible Research in Computational Science. Science 334: 122

Have you ever needed to look back at a program and research data from lab work performed last year, last month or maybe last week and had a difficult time recalling how the pieces fit together? Or, perhaps the reasoning behind the decisions you made while conducting your experiments is now obscure due to incomplete or poorly written documentation. I never gave this idea much thought until I enrolled in a series of Massive Open Online Courses (MOOCs) offered on the Coursera platform. The courses, which I took during the period from August to December of 2014, were part of a nine course specialization in the area of data science. The various topics included R Programming, Statistical Inference and Machine Learning. Because these courses are entirely free, you might think they would lack academic rigor. That's not the case. In fact, these particular courses and others on Coursera are facilitated by many of the top research universities in the country. The courses I took were taught by professors in the biostatistics department of the Johns Hopkins Bloomberg School of Public Health. I found the work to be quite challenging and was impressed by the amount of material we covered in each four-week session. Thank goodness for the Q&A forums and the community teaching assistants as the weekly pre-recorded lectures, quizzes, programming assignments, and peer reviews required a considerable amount of effort each week.

While the data science courses are primarily focused on data collection, analysis and methods for producing statistical evidence, there was a persistent theme throughout -- this notion of reproducible research. In the figure above, Dr. Roger Peng, a professor at Johns Hopkins University and one of the primary instructors for several of the courses in the data science specialization, illustrates the gap between no replication and the possibilities for full replication when both the data and the computer code are made available. This was a recurring theme that was reinforced with the programming assignments. Each course concluded with a peer-reviewed major project where we were required to document our methodology, present findings and provide the code to a group of anonymous reviewers; other students in the course. This task, in itself, was an excellent way to either confirm the validity of your approach or learn new techniques from someone else's submission.

If you're interested in more details, the following short lecture from one of the courses (16:05), also presented by Dr. Peng, gives a concise introduction to the overall concepts and ideas related to reproducible research.

I received an introduction to reproducible research as a component of the MOOCs, but you might be wondering why this concept is important to the data scientist, analyst or anyone interested in preserving research material. Consider the media accounts in the latter part of 2014 of admonishments for scientists who could not adequately reproduce the results of groundbreaking stem cell research (Japanese Institute Fails to Reproduce Results of Controversial Stem-Cell Research) or the Duke University medical research scandal which was documented in a 2012 segment of 60 Minutes. On the surface these may seem like isolated incidents, but they’re not. With some additional investigation, I discovered some studies, as noted in a November 2013 edition of The Economist, which have shown reproducibility rates as low as 10% for landmark publications posted in scientific journals (Unreliable Research: Trouble at the Lab). In addition to a loss of credibility for the researcher and the associated institution, scientific discoveries which cannot be reproduced can also lead to retracted publications which affect not only the original researcher but anyone else whose work was informed by possibly erroneous results or faulty reasoning. The challenge of reproducibility is further compounded by technology advances that empower researchers to rapidly and economically collect very large data sets related to their discipline; data which is both volatile and complex. You need only think about how quickly a small data set can grow when it's aggregated with other data sources.

Cartoon by Sidney Harris (The New Yorker)

So, what steps should the researcher take to ensure reproducibility? I found an article published in 2013, which lists Ten Simple Rules for Reproducible Computational Research. These rules are a good summary of the ideas that were presented in the data science courses.

Rule 1: For Every Result, Keep Track of How It Was Produced. This should include the workflow for the analysis, shell scripts, along with the exact parameters and input that was used.
Rule 2: Avoid Manual Data Manipulation Steps. Any tweaking of data files or copying and pasting between documents should be performed by a custom script.
Rule 3: Archive the Exact Versions of All External Programs Used. This is needed to preserve dependencies between program packages and operating systems that may not be readily available at a later date.
Rule 4: Version Control All Custom Scripts. Exact reproduction of results may depend upon a particular script. Archiving tools such as Subversion or Git can be used to track the evolution of code as its being developed.
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. Intermediate results can reveal faulty assumptions and uncover bugs that may not be apparent in the final results.
Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Using the same random seed ensures exact reproduction of results rather than approximations.
Rule 7: Always Store Raw Data behind Plots. You may need to modify plots to improve readability. If raw data are stored in a systematic manner, you can modify the plotting procedure instead of redoing the entire analysis.
Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying any summaries.
Rule 9: Connect Textual Statements to Underlying Results. Statements that are connected to underlying results can include a simple file path to detailed results or the ID of a result in the analysis framework.
Rule 10: Provide Public Access to Scripts, Runs, and Results. Most journals allow articles to be supplemented with online material. As a minimum, you should submit the main data and source code as supplementary material and be prepared to respond to any requests for further data or methodology details by peers.

In addition to the processing rules, we were also encouraged to adopt suitable technology packages as part of our toolkit. The following list represents just a few of the many products we used to assemble a reproducible framework and also introduce literate programming and analytical techniques into the assignments.

R and RStudio: Integrated development environment for R.
Sweave: An R package that allows you to embed R code in LaTeX documents.
Knitr: New enhancements to the Sweave package for dynamic report generation. It supports publishing to the web using R Markdown and R HTML.
R Markdown: Integrates with knitr and RStudio. Allows you to execute R code in chunks and create reproducible documents for display on the web.
RPubs: Web publication tool for sharing R markdown files. The gallery of example documents illustrates some useful techniques.
Git and GitHub: Open source version control repository.
Apache Subversion (SVN): Open source version control repository.
iPython Notebook: Creates literate webpages and documents interactively in Python. You can combine code execution, text, mathematics, plots and rich media into a single document. This gallery of videos and screencasts includes tutorials and hands-on demonstrations.
Notebook Viewer: Web publication tool for sharing iPython notebook files.

As a result of my experience with the MOOCs, I now have a greater appreciation for the importance of reproducible research and all that it encompasses. For more information on the latest developments, you can refer to any of these additional resources or follow Dr. Peng (@rdpeng) on Twitter.

Google Group: Scientists for Reproducible Research
Implementing Reproducible Research: Textbook published as part of the Open Science Framework. Chapters are available for free download.
ResearchCompendia: Web service that allows people to share the research software and data associated with a scientific publication (articles and working papers).
RunMyCode: Enables scientists to openly share the code and data that underlie their research publications.

-- Corren McCoy

↧

2015-03-10: Where in the Archive Is Michele Weigle?

March 10, 2015, 7:28 am

≫ Next: 2015-03-23: 2015 Capital Region Celebration of Women in Computing (CAPWIC)

≪ Previous: 2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

(Title is an homage to a popular 1980s computer game "Where in the World Is Carmen Sandiego?")

I was recently working on a talk to present to the Southeast Women in Computing Conference about telling stories with web archives (slideshare). In addition to our Hurricane Katrina story, I wanted to include my academic story, as told through the archive.

I was a grad student at UNC from 1996-2003, and I found that my personal webpage there had been very well preserved. It's been captured 162 times between June 1997 and October 2013 (https://web.archive.org/web/*/http://www.cs.unc.edu/~clark/), so I was able to come up with several great snapshots of my time in grad school.

https://web.archive.org/web/20070912025322/
http://www.cs.unc.edu/~clark/

Aside: My UNC page was archived 20 times in 2013, but the archived pages don't have the standard Wayback Machine banner, nor are their outgoing links re-written to point to the archive. For example, see https://web.archive.org/web/20130203101303/http://www.cs.unc.edu/~clark/

Before I joined ODU, I was an Assistant Professor at Clemson University (2004-2006). The Wayback Machine shows that my Clemson home page was only crawled 2 times, both in 2011 (https://web.archive.org/web/*/www.cs.clemson.edu/~mweigle/). Unfortunately, I no longer worked at Clemson in 2011, so those both return 404s:

Sadly, there is no record of my Clemson home page. But, I can use the archive to prove that I worked there. The CS department's faculty page was captured in April 2006 and lists my name.

https://web.archive.org/web/20060427162818/
http://www.cs.clemson.edu/People/faculty.shtml

Why does the 404 show up in the Wayback Machine's calendar view? Heritrix archives every response, no matter the status code. Everything that isn't 500-level (server error) is listed in the Wayback Machine. Redirects (300-level responses) and Not Founds (404s) do record the fact that the target webserver was up and running at the time of the crawl.

Wouldn't it be cool if when I request a page that 404s, like http://www.cs.clemson.edu/~mweigle/, the archive could figure out that there is a similar page (http://www.cs.unc.edu/~clark/) that links to the requested page?

https://web.archive.org/web/20060718131722/
http://www.cs.unc.edu/~clark/

It'd be even cooler if the archive could then figure out that the latest memento of that UNC page now links to my ODU page (http://www.cs.odu.edu/~mweigle/) instead of the Clemson page. Then, the archive could suggest http://www.cs.odu.edu/~mweigle/ to the user.

https://web.archive.org/web/20120501221108/
http://www.cs.unc.edu/~clark/

I joined ODU in August 2006. Since then, my ODU home page has been saved 53 times (https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/).

The only memento from 2014 is on Aug 9, 2014, but it returns a 302 redirecting to an earlier memento from 2013.

It appears that Heritrix crawled http://www.cs.odu.edu/~mweigle (note the lack of a trailing /), which resulted in a 302, but http://www.cs.odu.edu/~mweigle/ was never crawled.The Wayback Machine's canonicalization is likely the reason that the redirect points to the most recent capture of http://www.cs.odu.edu/~mweigle/. (That is, the Wayback Machine knows that http://www.cs.odu.edu/~mweigle and http://www.cs.odu.edu/~mweigle/ are really the same page.)

My home page is managed by wiki software and the web server does some URL re-writing. Another way to get to my home page is through http://www.cs.odu.edu/~mweigle/Main/Home/, which has been saved 3 times between 2008 and 2010. (I switched to the wiki software sometime in May 2008.) See https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/Main/Home/

Since these two pages point to the same thing, should these two timemaps be merged? What happens if at some point in the future I decide to stop using this particular wiki software and end up with http://www.cs.odu.edu/~mweigle/ and http://www.cs.odu.edu/~mweigle/Main/Home/ being two totally separate pages?

Finally, although my main ODU webpage itself is fairly well-archived, several of the links are not. For example, http://www.cs.odu.edu/~mweigle/Resources/WorkingWithMe is not archived.

Also, several of the links that are archived have not been recently captured. For instance, the page with my list of students was last archived in 2010 (https://web.archive.org/web/20100621205039/http://www.cs.odu.edu/~mweigle/Main/Students), but none of these students are still at ODU.

Now, I'm off to submit my pages to the Internet Archive's "Save Page Now" service!

--Michele

↧

2015-03-23: 2015 Capital Region Celebration of Women in Computing (CAPWIC)

March 23, 2015, 9:13 am

≫ Next: 2015-04-05: From Student To Researcher...

≪ Previous: 2015-03-10: Where in the Archive Is Michele Weigle?

On February 27-28, I attended the 2015 Capital Region Celebration of Women in Computing (CAPWIC) in Harrisonburg, VA on the campus of James Madison University. Two of our graduating Masters students, Apeksha Barhanpur (ACM president) and Kayla Henneman (ACM-W president) attended with me.

With the snow that had blanketed the Hampton Roads region, we were lucky to get out of town on Friday morning. We were also lucky that Harrisonburg had their foot of snow over the previous weekend so that there was plenty of time for all of the roads to be cleared. We had some lovely scenery to view along the way.

We arrived a little late on Friday afternoon, but Apeksha and Kayla were able to attend "How to Get a Tech Job" by Ann Lewis, Director of Engineering at Pedago. This talk focused on how each student has to pick the right field of technology for their career. The speaker presented some basic information on the different fields of technology and different levels of job positions and companies. The speaker also mentioned the "Because Software is Awesome" Google Group, which is a private group for students seeking information on programming languages and career development.

While they attended the talk, I caught up with ODU alum and JMU assistant professor, Samy El-Tawab.

After a break, I put on my Graduate Program Director hat and gave a talk titled "What's Grad School All About?"

I got to reminisce about my grad school days, share experiences of encountering the imposter syndrome, and discuss the differences between the MS and PhD degrees in computer science.

After my talk, we set up for the College and Career Fair. ODU served as an academic sponsor, meaning that we got a table where were able to talk with several women interested in graduate school. Apeksha and Kayla also got to pass out their resumes to the companies that were represented.

I also got to show off my deck of Notable Women in Computing playing cards. (You can get your own deck at notabletechnicalwomen.org.)

Playing with my #notablewomenincomputing deck at the @oducs table at #capwic. Thanks @JessiDG! pic.twitter.com/2ZXP3axnW9
— Michele Weigle (@weiglemc) February 28, 2015

Our dinner keynote, "Technology and Why Diversity Matters," was given by Sydney Klein, VP for Information Security and Risk Management at Capital One. (Capital One had a huge presence at the conference.) One thing she emphasized is that Capital One now sees itself as more of a technology company than a bank. Klein spoke about the importance of women in technology and the percentages of women that are represented in the field at various levels. She also mentioned various opportunities present within the market for women.

After dinner, we had a ice breaker/contest where everyone was divided into groups with the task of creating a flag representing the group and their relation with the field of computer science. Apeksha was on the winning team! Their flag represented the theme of the conference and how they were connected to the field of technology, “Women make the world work”. Apeksha noted that this was a great experience to work with a group of women from different regions around the world.

On Saturday morning, Apekska and Kayla attended the "Byte of Pi" talk given by Tejaswini Nerayanan and Courtney Christensen from FireEye. They demonstrated programming using the Raspberry Pi device, a single board computer. The students were given a small demonstration on writing code and building projects.

Later Saturday, my grad school buddy, Mave Houston arrived for her talk. Mave is the Founder and Head of USERLabs and User Research Strategy at Capital One. Mave gave a great talk, titled "Freedom to Fail". She also talked about using "stepping stones on the way to success." She let us play with Play-Doh, figuring out how to make a better toothbrush. My partner, a graduate student at Virginia State University, heard me talk about trying to get my kids interested in brushing their teeth and came up with a great idea for a toothbrush with buttons that would let them play games and give instructions while they brushed. Another group wanted to add a sensor that would tell people where they needed to focus their brushing.

We ended Saturday with a panel on graduate school that both Mave and I helped with and hopefully encouraged some of the students attending to continue their studies.

-Michele

↧

2015-04-05: From Student To Researcher...

April 5, 2015, 7:16 pm

≫ Next: 2015-04-20: Virginia Space Grant Consortium Student Research Conference Report

≪ Previous: 2015-03-23: 2015 Capital Region Celebration of Women in Computing (CAPWIC)

In 2010, I decided to again study at the Old Dominion University Computer Science Department for better employment opportunities. After taking some classes, I realized that I did not merely want to take classes and earn a Master's Degree, but also wanted to contribute knowledge, like those who wrote the many research papers I had read during my courses.

My Master's Thesis is titled "Avoiding Spoilers On MediaWiki Fan Sites Using Memento". I came to the topic via a strange route.

During Dr. Nelson's Introduction to Digital Libraries course, we built a digital library based on a single fictional universe. I chose the television show Lost, and specifically archived Lostpedia, a site that my wife and I used while watching and discussing the show. We realized that fans were updating Lostpedia while episodes aired. This highlighted the idea that wiki revisions created prior to the episode obviously did not contain information about that episode, and emphasized that episodes led to wiki revisions.

A few years later, a discussion at work occurred while watching Game of Thrones. I realized that some of us had seen the episode of the night before while others had not. We wanted to use the Game of Thrones Wiki to continue our conversation, but realized that those who had not seen the episode easily encountered spoilers. By this point, I was quite familiar with Memento, had used Memento for Chrome, and was working on the Memento MediaWiki Extension. The idea of using Memento to avoid spoilers was born.

The figure above exhibits the Naïve Spoiler Concept. The concept is that wiki revisions in the past of a given episode should not contain spoilers, because information has not yet been revealed by the episode, hence fans could not write about it. Inversely, wiki revisions in the future of a given episode will likely contain spoilers, seeing as episodes cause fans to write wiki revisions.

It turned out that there was more to avoiding spoilers in fan wiki sites than merely using Memento and the Naïve Spoiler Concept. Most TimeGates use a heuristic that is not reliable for avoiding spoilers, so I proposed a new one and demonstrated why the existing heuristic was insufficient by calculating the probability of encountering a spoiler using the current heuristic. I also used the Memento MediaWiki Extension to demonstrate this new heuristic in action. In this way I was able to develop a Computer Science Master's Thesis on the topic.

Mindist (minimum distance) is the heuristic used by most TimeGates. This works well for an sparse archive, because often the closest memento to the datetime you have requested is best. Wikis have access to every revision, allowing us to use a new heuristic minpast (minimum distance in the past, minimum distance without going over the given date). Using records from fan wikis, I showed that, if one is trying to avoid spoilers, there can be as much as a 66% chance of encountering a spoiler if we use the Wayback Machine or a Memento TimeGate using mindist. I also analyzed Wayback Machine logs for wikia.com requests and found that 19% of those requests ended up in the future. From these studies, it was clear that using minpast directly on wikis was the best way to avoid spoilers.

While I was examining fan wikis for spoilers, I also had the opportunity to compare wiki revisions with mementos recorded by the Internet Archive. Using this information I was actually able to reveal how the Internet Archive's sparsity is changing over time. Because wikis keep track of every revision, so we can see missed updates by the Internet Archive.

In the figure above, we see a timeline for each wiki page I conducted in the study. The X-axis shows time and the Y-axis consists of an identifier for each wiki page. Darker colors indicate more missed updates by the Internet Archive. We see that the colors are getting lighter, meaning that the Internet Archive has becoming more aggressive in recording pages.

Below are the slides for the presentation, available on my SlideShare account, followed by the video of my defense posted to YouTube. The full document of my Master's Thesis is available here.

Thanks to Dr. Irwin Levinstein and Dr. Michele Weigle for serving on my committee. Their support has been invaluable during this process. Were it not for Dr. Levinstein, I would not have been able to become a graduate student. Were it not Dr. Weigle's wonderful Networking class, I would not have been able to draw some of the conclusions necessary to complete this thesis.

Much of the thanks goes to my advisor, Dr. Michael L. Nelson, who spent hours discussing these concepts with me, helping correct my assumptions and assessments when I erred, while praising the experience when I came up with something original and new. His patience and devotion not only to the area of study, but also the art of mentoring, led me down the path of success.

In the process of creating this thesis, I also created a technical report which can be referenced using the BibTeX code below.

So, what is next? Do I use wikis to study the problem of missed updates in more detail? Do I study the use of the naïve spoiler concept in another setting? Or do I do something completely different?

I realize that I have merely begun my journey from student to researcher, but know even more now that I will enjoy the path I have chosen.

--Shawn M. Jones, Researcher

↧

2015-04-20: Virginia Space Grant Consortium Student Research Conference Report

April 20, 2015, 9:54 am

≫ Next: 2015-05-07: Teaching Undergraduate Computer Science Using GitHub and Docker

≪ Previous: 2015-04-05: From Student To Researcher...

Mat Kelly and various other graduate students in the state of Virginia present their graduate research at the Virginia Space Grant Consortium.

On Friday, April 17, 2015 I attended the Virginia Space Grant Consortium (VSGC)Student Research Conference at NASA Langley Research Center (LaRC) in Hampton, Virginia. This conference is slightly beyond the scope of what we at ODU WS-DL (@webscidl) usually investigate, as the research requirement was that it was relevant to NASA's objectives as a space agency.

My previous work with LaRC's satellite imagery allowed me to approach the imagery files with the perspective a computational scientist. More on my presentation, "Facilitation of the A Posteriori Replication of Web Published Satellite Imagery"below.

The conference started off with registration and a provided continental breakfast. Mary Sandy, the VSGC Director and Chris Carter, the VSGC Deputy Director began by describing the history of the Virginia Space Grant Consortium program including the amount contributed since its inception and the number of recipients that have benefitted from being funded.

The conference was organized in a model consisting of concurrent sessions of two to three themed presentations by undergraduate and graduate students at various Virginia universities.

First Concurrent Sessions

I attended the "Aerospace" session in the first morning session. In this session Maria Rye (Virginia Tech) started with her explorative research in suppressing distortions in tensegrity systems, a flexible structure held together by interconnected bars and tendons.

Marie Ivanco (Old Dominion University) followed Maria with her research in applying Analytic Hierarchy Processes (AHPs) for analytical sensitivity analysis and local inconsistency checks for engineering applications.

Peter Marquis (Virginia Tech) spoke third in the session with his research on characterizing the design variables to trim the LAICE CubeSat to obtain a statically stable flight configuration.

Second Concurrent Sessions

The second sessions seamlessly continued with Stephen Noel (Virginia Tech) presenting a similar work relating to LAICE. His work consisted of the development of software to read, parse, and interpret calibration data for the system.

Cameron Orr (Virginia Tech) presented the final work in the second Aerospace session with the exploration of the development of adapted capacitance manometers for thermospheric applications. Introducing this additional component as well as some detection circuitry allowed more accurate measurement of pressure changes.

Third Concurrent Sessions

After a short break where posters from graduate students around Virginia were presented, I opted to move to another room to view the Applied Science presentations.

Atticus Stovall (University of Virginia) described his system for modeling forest carbon relating height-to-biomass relationships as well as voxel based volume modeling as a means of evaluating the amount of carbon stored.

Matthew Giarra (Virginia Tech) wrapped up the short session with a visual investigation of the flow of hemolymph (blood) in insects' bodies as a potential model for non-directional fluid pumping.

Fourth Concurrent Sessions

The third session immediately segued into the fourth session of the day, where I changed rooms to attend the Astrophysics presentations.

Charles Fancher (William & Mary) presented work on a theoretical prototype for an ultracold atom-based magnetometer for accurate timekeeping in space.

John Blalock (Hampton University) presented next in the Astrophysics session with his work on using various techniques to measure wind speeds on Saturn from the results returned by the Cassini orbiter's Imaging Science Subsystem.

Kimberly Sokal (University of Virginia) wrapped up the fourth session with her enthusiastic presentation on emerging super star clusters with Wolf-Rayet stars. Her group's discovery of the star cluster S26 in NGC 4449 is undergoing an evolutionary transition that is not well understood. The ongoing work may provide feedback as to the tipping point of the emerging process that affects the super star cluster's ability to remain bound.

The conference then broke for an invitation-only lunch with a keynote address by Dr. David Bowles, Acting Directory of NASA Langley Research Center.

Fifth Concurrent Sessions

For the final session of the day, I attended and presented at the Astrophysics session. Emily Mitchell (University of Virginia) presented first with her study on the irradiation effects of H₂-laden porous water ice films in the interstellar medium (ISM). She exposed ice to hydrogen gas at different pressures after deposition and during radiation. She reports that H₂ concentration increases with decreasing ion flux, suggesting that as much as 7 percent solid H₂ is trapped in interstellar ice by radiation impacts

Following Emily, Mat Kelly (your author) of Old Dominion University presented my work on the Facilitation of the A Posteriori Replication of Web Published Satellite Imagery. By creating software to mine the metadata and a system that allows peer-to-peer sharing of the public domain satellite imagery currently solely located on the NASA Langley servers, I was able to mitigate the reliance on a single source of the data. The system I created utilizes concepts from ResourceSync, BitTorrent and WebRTC.

Wrap Up

The Virginia Space Grant Consortium Student Research Conference was extremely interesting despite being somewhat different in topic compared to our usual conferences. I am very glad that I got the opportunity to do the research for the fellowship and hope to progress the work for further applications beyond satellite imagery.

—Mat (@machawk1)

↧

2015-05-07: Teaching Undergraduate Computer Science Using GitHub and Docker

May 7, 2015, 6:25 am

≫ Next: 2015-05-09: IIPC General Assembly 2015 Trip Report

≪ Previous: 2015-04-20: Virginia Space Grant Consortium Student Research Conference Report

Mat Kelly taught CS418 - Web Programming at Old Dominion University in Spring 2015. This blog post highlights some teaching methods and technologies used (namely, Docker and GitHub) and how he integrated their usage into the flow of the course.

For Spring Semester at Old Dominion University I taught CS418 - Web Programming with some updated methods and content. This course has been previously taught by various members of ODU WS-DL (2014, 2013, 2012).

The first deviation from previous offerings of the course was to change the subject of the project. Previously, CS418 students were asked to progressively build an online forum like phpBB. Web sites resembling this medium are no longer as common as they once were on the Web, so a refresh was needed to keep the project familiar and relevant.

For Spring, I asked students to build a Question-and-Answer website akin to StackOverflow.com. Being students of computer science, all were familiar with the contemporary model of online discussions and soliciting help from others experienced in an area (e.g., computer programming).

We followed an initial coursework flow with lectures about Web Fundamentals, followed by more technical lectures on PHP, MySQL, JavaScript, and an HTML/CSS Primer for those students that have programmed but never created a web page. The lectures were old news for some students, who were already employed (CS418 is a senior-level course), and completely new for others, who had programmed but never for the Web.

The delivery of the project is an aspect that made this semester's course unique. In a preliminary assignment very early in the semester, I required each student to:

Fork the class GitHub repository
Pull a working copy to their system
Add a single file to the repository
Commit the change
Submit a pull request to the class repository

This ensured a base knowledge of version control dynamics but also required the students to provide a reference to the repository for their class project with the single file submitted. A student's project repository was different than the fork of the class repository.

GitHub inherently facilitates sharing of source code - an aspect that I did not particularly want to encourage with the individual students' projects. The GitHub Student Developer Pack provided a solution for this. By each student contacting GitHub and providing proof of being a student, they were each supplied a small number of private repositories, which would normally require a monthly fee. The program also offers many other benefits free of charge to students like credit on a cloud hosting platform, a free domain name on the .me TLD, and private builds from one of the more popular continuous integration services (among many other benefits).

Along with submitting a pull request, I also asked the students to add me as a "collaborator" on GitHub for the repository they each specified, allowing access for grading whether or not the student decided to take advantage of the Student Developer Pack.

As the students began to build features for each of the four milestone requirements in the course, I reiterated that what was checked into their GitHub repository come demo day is that which they would be graded on. This circumvented the "my computer crashed", "the dog ate my homework", etc. but introduced the issue of "I must have forgot to check my updated code in". To remedy this, but mainly to allow students to verify their code will work as expected on demo day, I put together a demo day deployment system using Docker.

Docker allows easy, systematic deployment of software that is sandboxed from a host system yet extensible to communicate between multiple instances ("containers" in Docker jargon). Using Docker allowed a student to iteratively test the code they had checked into their GitHub repository from the comfort of their home while instilling confidence on the correctness of the features they had implemented thus far. While previous offerings of the class provided students with a Virtual Machine (VM) on which to develop their project, I opted to use Docker instead, as it provides an isolated environment for each student with a freshly installed OS each time their code is deployed. Docker also allowed the packages and libraries needed by the students for production to be parameterized. A downside to using Docker over a VM is the students' reliance on a central server for deployment. However, this "benefit" of VMs does not guarantee the consistency of presentation for demo day, as a local VM might be configured differently than the demo day machine.

Our Docker Deployment System was hosted on a server at ODU but was accessible to the world. Each student was supplied a unique port number, allowing students to simultaneously use the system without fear of clashing with other students testing. The system evolved as the semester drew on and I continuously developed it. Using the system is fairly easy and intuitive.

The student first enters their ODU CS username in a text field.

The Docker Deployment System dynamically queries the class GitHub repository to link the CS username to a student's GitHub repository, as previously submitted. This cross-referenced prevented abuse by GitHub users that were not registered for the class and required students to execute the procedure as a prerequisite for demo day (i.e., submission of assignments).

The user can then authenticate with GitHub by clicking a button. Doing so brings up the dialog to do so on the GitHub website.

Upon successful login, the user is returned to the Docker Deployment System interface with the same button now reading "Dockerize my code". Selecting this button invokes a server side scripted process.

Brief messages are shown to the user to indicate the script process that is being followed on the server. In sequence, the script:

Deletes any old remnants from previously deployments by the student
Clones the user's repository using Git and the GitHub API access token, obtained from the user logging in (this is critical if the user's project repository is private)
Kills previously deployed Docker instances spawned by the user
Removes the previously deployed instances to ensure a fresh copy is used by Docker
Fires up a new container in the Docker Deployment System using the students's latest code (from the above repo clone)
Provides HTML links to the student to test their code.

Docker containers are defined in a Dockerfile, a standard format that references the basis OS and any packages required for the container. For the students' deployment, I used Ubuntu as the basis along with Apache, PHP, and MySQL in the CS418 Dockerfile. A directive in the Dockerfile also provides the hook to allow the directory containing the student's code to be used as the default "website" for Apache. The students provided a MySQL database dump in the root of their project repository, which is loaded when the container for their project is instantiated.

For the most part, the initial bumps for the students to effectively use the system were overcome. Students reiterated throughout the semester that the tool was extremely useful in testing their code and ensuring that nothing unexpected would occur on demo day.

In synopsis, the usage of the Docker Deployment System developed for the Spring 2015 session of CS418 Web Programming at Old Dominion University and the required submission of coursework via GitHub allowed students to gain experience with tools and iterative testing that previous models (e.g., "magic" laptops and e-mailing code, resp.) of verifying code submissions are unable to effectively facilitate. The project-based nature of CS418 was an appropriate testing medium for developing both the system and workflow. In the future, I hope to reuse the system and workflow to teach a course less technically driven to evaluate the portability of the methods.

Special thanks to Sawood Alam (@ibnesayeed) for his technical assistance in working with Docker throughout the semester and Minhao Dong for being the ODUCS access point to ensure that students' project deployment did not compromise the university network.

—Mat (@machawk1)

↧

2015-05-09: IIPC General Assembly 2015 Trip Report

May 9, 2015, 4:45 pm

≫ Next: 2015-05-29: Call me Dr. SalahEldeen

≪ Previous: 2015-05-07: Teaching Undergraduate Computer Science Using GitHub and Docker

The day before International Internet Preservation Consortium (IIPC) General Assembly 2015 we landed in San Francisco and some delicious Egyptian dishes were waiting for us. Thank you Ahmed, Yasmin, Moustafa, Adrian, and Yusuf for hosting us. It was a great way to spend the evening before IIPC GA and we were delighted to see you people after long time.

@WebSciDL reunion before #iipcGA15 @phonedude_mln @mart1nkle1n @yasmina_anwar @ibnesayeed @hvdsomp @mousta pic.twitter.com/orSFWGULSW
— Ahmed AlSum (@aalsum) April 27, 2015

Day 1

We (Sawood Alam, Michael L. Nelson, and Herbert Van de Sompel) entered in the conference hall a few minutes after the session was started and Michael Keller from Stanford University Libraries was about to leave the stage after the welcome speech. IIPC Chair Paul Wagner gave brief opening remarks and invited the keynote speaker Vinton Cerf from Google on the stage. The title of the talk was "Digital Vellum: Interacting with Digital Objects Over Centuries" and it was such an informative and delightful talk. He mentioned that the high density low cast storage media is evolving, but the devices to read them might not last long. While mentioning Internet connected picture frames and surf boards he added, we should not forget about the security. To emphasize the security aspect he gave an example that grand parents would love to see their grand children in those picture frames, but will not be very happy if they see something which they do not expect.

Moving on to software emulators he invited Mahadev Satyanarayanan from Carnegie Mellon University to talk about their software archive and emulator called Olive Archive. Satya gave various live demos including the Great American History Machine, ChemCollective (a copy of the website frozen at certain time), PowerPoint 4.0 running in Windows 3.1, and the Oregon Trail, all powered by their virtual machines and running in a web browser. He also talked about the architecture of the Olive Archive and how in future multiple instances can be launched and orchestrated to emulate the subset of the Internet for applications that rely on external services where some instances might run those services independently.

In the QA session someone asked Cerf, how to ask big companies like Google to provide the data about their Crisis Response efforts for archiving after they are done with it? Cerf responded, "you just did." while acknowledging the importance of such data for archival. Here are some tweets that were capturing the moments:

High density, low cost storage media, but the devices to read them may not last long, says @vgcerf #iipcGA15
— Michael Widner (@mwidner) April 27, 2015

@vgcerf explains that there are not many places storing software, which you need in the future (to interpreting the bits). #iipcga15
— Helen Hockx (@hhockx) April 27, 2015

Cerf not sure that Google keeps its own software very long if it’s no longer used #iipcGA15
— Jane Winters (@jfwinters) April 27, 2015

Now seeing an example of the Great American History Machine, created in the late 1980s. Working now! #iipcGA15 pic.twitter.com/ZhxHP0zYFz
— Ian Milligan (@ianmilligan1) April 27, 2015

#iipcga15 Mahadev Satanarayanan highlights how @OliveArchive cleanly separates VM storage & execution, and how entire stack is open source
— Ina DL Web (@inadlweb) April 27, 2015

#iipcGA15 Awesome keynotes! Time for (hard) questions. pic.twitter.com/AcoHtkTs60
— Sabine Hartmann (@skhartmann) April 27, 2015

After the break Niels Brügger and Janne Nielsen presented their case study of Danish websphere under the title "Studying a nation's websphere over time: analytical and methodological considerations". Their study covered website content, file types, file sizes, backgrounds, fonts, layout and more importantly the domain names. They also raised the points like size of the ".dk" domain, geolocation, inter and intra domain link network, and if the Danish websites are actually in Danish language? They talked about some crawling challenges. Their domain name analysis tells that only 10% owners own 50% of all the ".dk" domains. I suspected that this result might be due to the private domain name registrations, so I talked to them later and they said, they did not think about private registrations, but they will revisit their analysis.

Brugger: size of .dk domain, file types, size of individual websites, where are websites located, networks within & between sites #iipcGA15
— Jane Winters (@jfwinters) April 27, 2015

"No 1:1 relation between Danish national archive and the Danish national web domain" #iipcGA15
— Yasmina Anwar (@yasmina_anwar) April 27, 2015

Andy Jackson from the British Library took the stage with his presentation title "Ten years of the UK web archive: what have we saved?". This case study covers three collections including Open Archive, Legal Deposit Archive, and JISC Historical Archive. These collections store over eight billion resources in over 160TB compressed files and now adding about two billion resources per year. With the help of a nice graph he illustrated that not all ".uk" domains are interlinked, so to maximize the coverage the crawlers need to include other popular TLDs such as ".com". He also presented the analysis of reference rot and content drift utilizing the "ssdeep" fuzzy hash algorithm. Their analysis tells that 50% of resources are unrecognizable or gone after oner year, 60% after two years and 65% after three years.

Fascinating - the halo around the middle cluster are .uk sites you can only find through other TLDs. #iipcGA15 pic.twitter.com/NvjJUfcJo0
— Ian Milligan (@ianmilligan1) April 27, 2015

Vanishing .uk domains. @anjacks0n at #iipcGA15 pic.twitter.com/t6H4ba0feV
— Katrin Weller (@kwelle) April 27, 2015

I had lunch together with Scott Fisher from the California Digital Library. I told him about various digital library and archiving related research projects we are working on at Old Dominion University and he described the holdings of his library and the phalanges they have in upgrading their Wayback to bring Memento support.

After the lunch, keynote speaker of the second session Cathy Marshall from the Texas A&M University took the stage with a very interesting title, "Should we archive Facebook? Why the users are wrong and the NSA is right". She motivated her talk by some interview style dialogues with the primary question, "Do you archive Facebook?" and mostly the answer was "No!". She highlighted that people have developed [wrong] sense that Facebook is taking care of their stuff, so they do not have to. She also noted that people usually do not value their Facebook content or they think it has immediate value, but no archival value. In a large survey she asked should Facebook be archived?, three fourth objected and half of them said "No" unconditionally. In the later part of her talk, she build the story of the marriage of Hal Keeler and Joan Vollmer by stitching various cuttings from local news papers. I am not sure if I could fully appreciate the story due to the cultural difference, but I laughed when everyone else did. Although I did follow her efforts and intention to highlight the need of archiving social media for future historians. And if asks me, is NSA is right? my answer would be, "Yes!, if they do it correctly with all the context included."

"click bait" slide for @ccmarshall's talk at #iipcGA15 pic.twitter.com/pW2wJNEO0z
— Michael L. Nelson (@phonedude_mln) April 27, 2015

Interview questions regarding to the preservation of data on fb #iipcGA15 pic.twitter.com/4GmKZ4EVvi
— Yasmina Anwar (@yasmina_anwar) April 27, 2015

Archiving Facebook's public vs. private data: maybe not the same challenge #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

C. Marshall: impossible to reconstruct a story like Vollmer's today because you'd have to rely on Facebook volatile data #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

The majority of #iipcGA15 attendees love the idea of #facebook archive!!
— Yasmina Anwar (@yasmina_anwar) April 27, 2015

#iipcGA15 try #WAIL and #WARCreate by @WebSciDL @machawk1 to #archive #Facebook https://t.co/jRl0moLpAz
— Sawood Alam (@ibnesayeed) April 27, 2015

Meghan Dougherty from Loyola University Chicago and Annette Markham from Aarhus University presented their talk "Generating granular evidence of lived experience with the Web: archiving everyday digitally lived life". They illustrated how sometimes intentionally or unintentionally people record moments of their life with different media. Among various visual illustrations, I particularly liked the video of a street artist playing with a ring that was posted on Facebook in a very different context than the context it appeared in YouTube. They ended their talk with a hilarious video of Friendster.

#iipcGA15 context matters https://t.co/FaVn23uZob
— Sawood Alam (@ibnesayeed) May 8, 2015

@mdocx1 questions how well web archives capture everyday digital lived life. #iipcGA15
— Helen Hockx (@hhockx) April 27, 2015

Megan Dougherty quote rebecca solnit tyranny of the quantifiable that which can be measured takes priority over that which cannot #iipcGA15
— rosalie lack (@rosalielack) April 27, 2015

"She forgets the camera, or rather treats the laptop camera as a close friend"—@mdocx1 on our relationship w/our digital lives #iipcGA15
— David Moles (@chronodm) April 27, 2015

@mdocx1 suggests a StoryCorp of the Web - link for the original StoryCorp http://t.co/kNG97fqMo4 which goes to @librarycongress #iipcGA15
— Abbie Grotke (@agrotke) April 27, 2015

#iipcGA15 @mdocx1 So funny video on "Friendster discovered by an Internet archeologist" fromThe Onion YT account : https://t.co/nHDl0k6Asd
— Ina DL Web (@inadlweb) April 27, 2015

Susan Aasman from University of Groningen presented her talk "Everyday saving practices: "small data" and digital heritage strategies". This talk was full of motivation, why people should care about personal archive of their daily life moments. She described how the service Kodak Gallery launched in 2001 with the tag-line, "live forever", and closed in 2012 after transferring billions of images to Shutterfy which was only available for US customers. As a result, people from other countries have lost their photo memories. She also played the Bye Bye Super 8 video of Johan Kramer that was amusing and motivating for personal archiving.

#iipcGA15 @aasmanna Family memories as technological memories. Engaging talk about personal archiving. pic.twitter.com/yskvw7FxHY
— Sabine Hartmann (@skhartmann) April 27, 2015

In 2001 Kodak launched a website promising to preserve everyone’s photos online... and failed @ISSN_IC #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

Aasman's project "Changing platforms of ritualized memory practices: the cultural dynamics of home movies" http://t.co/ZCkazCMBnb #iipcGA15
— Katrin Weller (@kwelle) April 27, 2015

Johan Kramer - Bye Bye Super 8 https://t.co/3rFSnS5hL8 #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 27, 2015

After a short beak Jane Winters from the Institute of Historical Research, Helen Hockx-Yu from the British Library, and Josh Cowls from the Oxford Internet Institute took the stage with their topic "Big UK domain data for Arts and Humanities" also known as BUDDAH project. Jane highlighted the value of archives for research and described the development of a framework to help researchers leverage the archives. She illustrated the interface of the Big Data analysis of BUDDAH project, described the planned output, and various case studies showing what can be done with that data.

What is a web archive ? https://t.co/TaGnIQBCwF #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

BUDDAH: big uk domain data for arts & humanities http://t.co/dT0lfTYEEq aims at valuing web archives as a material for research #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

#iipcGA15 ... inform collection development and access at BL, train researchers in use of big data. Great acronyn: the BUDDHA project.
— Jackie Dooley (@minniedw) April 27, 2015

Helen Hockx-Yu began her talk "Co-developing access to the UK Web Archive" with reference to the earlier talk by Andy. She noted that a scenario that fits everyone's need is difficult. She described the high level requirements including query building, corpus formation, annotation and cuuration, in-corpus and whole-dataset analysis. She illustrated the SHINE interface that provides features like full-text search, multi-facet filters, query history, and result export.

#iipcGA15 @hhockx presents SHINE prototype for advanced academic use of web archive
— Ina DL Web (@inadlweb) April 27, 2015

#iipcGA15 @hhockx SHINE project : FT search, multifacet search, Ngram,trend analysis and access to data behind http://t.co/CTI5U1vx9e
— Ina DL Web (@inadlweb) April 27, 2015

"It's really not a choice between 'big' or 'small' data… what we really need is the flexibility to move between the two."—@hhockx #iipcGA15
— David Moles (@chronodm) April 27, 2015

Finally, Josh Cowls presented his talk about the book "The Web as History: Using Web Archives to Understand the Past and the Present" in which he contributed a chapter. He talked about the four second level domain from ".uk" TLD including ".co.uk", ".org.uk", ".ac.uk", and ".gov.uk" and how they are interlinked. He described the growth of web presence of the BBC and British universities.

Awesome chart!! @joshcowls at #iipcGA15 pic.twitter.com/HpsbIBsJKw
— Michael Corbett (@Reloaded2Boot) April 27, 2015

#iipcGA15 Mapping the UK Webspace: Fifteen Years of British Universities on the Web http://t.co/voe0rXJYT7
— Michael L. Nelson (@phonedude_mln) April 27, 2015

Fun fact #iipcGA15 : to demonstrate the interest of web archives as material for history, write a book and put it online
— Emmanuelle Bermes (@figoblog) April 27, 2015

#iipcGA15 @JoshCowls In the book The web as history a focus on the evolution of BBC online presence
— Ina DL Web (@inadlweb) April 27, 2015

IIPC Chair Paul Wagner concluded the day by emphasizing that we have only started scratching the surface. He also noted in his concluding remarks that the context matters.

Day 2

Herbert Van de Sompel from Los Alamos National Laboratory started the second day sessions by talking about "Memento Time Travel". He started with a brief introduction of the Memento followed by a bag full of announcements. For the ease of use in JavaScript clients, Memento now supports JSON responses along with traditional Link format. Memento aggregator now provides responses in two modes including DIY (Do It Yourself) and WDI (We Do It). The service now also allows to export the Time Travel Archive Registry in structured format. Due to the default Memento support in Open Wayback, various Web archives now natively support Memento. There is an extension available for MediaWiki to enable Memento support in it. Herbert described the Robust Links (Hiberlink) and how it can be used to avoid reference rot. He said that their service usage is growing, hence they upgraded the infrastructure and now using Amazon cloud for hosing services. He noted that going forward everyone will be able to participate by running Memento service instances in a distributed manner to provision load-balancing. He also demonstrated Ilya's work of constructing composite mementos from various sources to minimize the temporal inconsistencies while visualizing the sources of mementos.

.@hvdsomp kicking off #iipcGA15 Day One with a presentation on Memento. They’ve got a fantastic, well-documented API! http://t.co/YIGMTDbzGx
— Ian Milligan (@ianmilligan1) April 28, 2015

Day 2 of #iipcGA15 starts with Herbert Van de sompel pic.twitter.com/bD3aGk2elM
— IIPC (@NetPreserve) April 28, 2015

Memento for Chrome #iipcGA15 https://t.co/dMyGIZ3zXg adds #memento capability for your browswer. see also: http://t.co/ValYQrslkD
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Another #Memento extension for #Chrome: Mink. see: http://t.co/QzAXp6IOsm #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 28, 2015

18 public web archives + http://t.co/YdWmuXgLak + http://t.co/AwA61HKm4w + #MediaWikis aggregated. see: http://t.co/5GpeUFY4H6 #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Good guide on Robust Links from the Memento Project - what you can do as a web page author, user, etc. #iipcGA15 http://t.co/Dlh9lDoquJ
— Ian Milligan (@ianmilligan1) April 28, 2015

Hiberlink http://t.co/bJC1cr1d4A addresses "reference rot" in scholarly citations #iipcGA15
— Michael Widner (@mwidner) April 28, 2015

some results of the #Hiberlink project on reference rot in scholcom http://t.co/V0jsh8QEuX @PLOSONE @hvdsomp #iipcGA15
— Martin Klein (@mart1nkle1n) April 28, 2015

Time travel reconstruct. @hvdsomp in #iipcGA15 pic.twitter.com/nCU1CArQGK
— Yasmina Anwar (@yasmina_anwar) April 28, 2015

Replay of archived websites relies on patching resources not necessarily crawled at the same time. Not something known to users. #iipcGA15
— Helen Hockx (@hhockx) April 28, 2015

Daniel Gomes from the Portuguese Web Archive talked about "Web Archive Information Retrieval". He started classifying web archive information needs in three categories including Navigational, Informational, and Transactional. He noted that the usual way of accessing archive is URL searching which might not be known to the users. An alternate method is full-text search that poses the challenge of relevance. Daniel described various relevance models in great detail and how to select features to maximize the relevance. He announced that all the dataset and code is available for free and under open source license. The code is hosted on Google Code, but due to their announcement of sunsetting the service the code will be migrated to GitHub soon.

#iipcGA15 Map of web archiving around the world. We are doing well. But still so much room to grow. @dcgomes77 pic.twitter.com/cR2BGWiouO
— Sabine Hartmann (@skhartmann) April 28, 2015

"A full text index is like a huge book glossary" @dcgomes77 #iipcGA15
— Yasmina Anwar (@yasmina_anwar) April 28, 2015

Machine learning for web archives content discovery.. using in links, term frequency, etc. Very cool. #iipcGA15 pic.twitter.com/YV9zjwR66o
— Ian Milligan (@ianmilligan1) April 28, 2015

#iipcGA15 68 ranking features--too many to put into production. Using URL, title, text body, anchor text of incoming link.
— Jackie Dooley (@minniedw) April 28, 2015

Search the Past with the Portuguese Web Archive #iipcGA15 http://t.co/ap6tRwNJh6 #www2013 https://t.co/0xRGZTYTMC
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Here’s the Google Code repository for the Portuguese Web Archive: looking forward to checking it all out. #iipcGA15 https://t.co/MeBRdRO0Ta
— Ian Milligan (@ianmilligan1) April 28, 2015

After this talk, there was a short break followed by the announcement that remaining sessions of the day will have two parallel tracks. It was a hard decision to choose one track or the other, but I can watch the missed sessions latter when the video recordings are made available. Later the parallel sessions were interfering each other so the microphone was turned off.

@NetPreserve only if I could #Memento #TimeTravel to attend both sessions. I will be looking forward for the #iipcGA15 session recordings.
— Sawood Alam (@ibnesayeed) April 28, 2015

#iipcGA15 Sometimes it is good we can still get by without microphones.
— Sabine Hartmann (@skhartmann) April 28, 2015

After the break Ilya Kreymer gave a live demo of his recent work "Web Archiving for all: Building WebRecorder.io". He acknowledged the collaboration with Rhizome and announced the availability of invite only beta implementation of the WebRecorder. He demonstrated how WebRecorder can be used perform personal archiving in What You See Is What You Archive (WYSIWYA) mode.

#iipcGA15 use #WebRecorder.io beta for #personal #archiving #WYSIWYA like #Facebook #Twitter @webrecorder_io
— Sawood Alam (@ibnesayeed) April 28, 2015

#iipcGA15 Webrecorder.io: on-demand archiving via browser. WYSIWYA: what u see is what you archive. Available to anybody. quality > quality.
— Jackie Dooley (@minniedw) April 28, 2015

Demo-ing webrecorder.io ability to record Facebook while you're logged in and Vines - good stuff! #iipcGA15 #webarchiving
— Web Archiving RT (@WebArch_RT) April 28, 2015

Ilya Kremer: http://t.co/xb01j9mJC6 built on the top of pywb https://t.co/yG0snopyd8 and warcprox https://t.co/2e0QcHsmfl #iipcGA15
— Ahmed AlSum (@aalsum) April 28, 2015

Ilya Kreymer is looking for collaborators, developers, UI designers, and archivists to move webrecorder.io to the next level #iipcGA15
— Ahmed AlSum (@aalsum) April 28, 2015

public demos for beta webrecorder.io http://t.co/J1eqeqQW3H -- also #Memento compliant #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Zhiwu Xie from VirginiaTech presented "Archiving transactions towards an uninterruptible web service". He described an indirection layer between the web application server and the client that archives each successful response and when server returns 4xx/5xx failure responses, it serves the most recent copy of the resource from the transactional archive. It is similar to services like CloudFlare in functionality from clients' perspective, but it has added advantage of building a transactional archive for website owners. Zhiwu demonstrated the implementation by reloading two web pages multiple times of which one was utilizing the UWS and the other was directly connected to the web application server that was returning the current timestamp with random failures. He mentioned that the system is not ready for the prime time yet.

@zxie suggests web archiving is working as UPS for websites during down time, similar to what we had in U.S. government shutdown #iipcGA15
— Ahmed AlSum (@aalsum) April 28, 2015

Uninterruptible Web Service, in diagram form #iipcGA15 pic.twitter.com/RugYwBMaSQ
— Mouse Reeve (@tripofmice) April 28, 2015

Xie: project uses SiteStory as a back up for website availability...Patch in archived version of a web page if 500 error occurs #iipcGA15
— Web Archiving RT (@WebArch_RT) April 28, 2015

During the lunch break I was with Andy, Kristinn, and Roger where we had free style conversation on advanced crawlers, CDX indexer memory error issues, the possibility of implementing CDX indexer in Go, separating data and view layers in Wayback for easy customization, some YouTube videos such as "Is Your Red The Same as My Red?", hilarious "If Google was a Guy", Ted talks such as "Can we create new senses for humans?", "Evacuated Tube Transport Technologies (ET3)", and the possible weather of Iceland around the time IIPC GA 2016 is scheduled.

#iipcGA15 announced #iipcGA16 is scheduled on April 11, 2016 in Reykjavik, Iceland.
— Sawood Alam (@ibnesayeed) April 29, 2015

Jefferson Bailey presented his talk on "Web Archives as research datasets". With various examples and illustrations from Archive-It collections he established the point that web archives are great sources of data for various researches. He acknowledged that WAT is a compact and easily parsable metadata file format that is about 18% of the WARC data files.

Two elements when thinking about research data: collections, and derivation @internetarchive #iipcGA15 pic.twitter.com/0wI2Pd4auj
— Yasmina Anwar (@yasmina_anwar) April 28, 2015

#iipcGA15 @jefferson_bail takes us through web archives as research datasets. pic.twitter.com/AR9Ae9YMx5
— Sabine Hartmann (@skhartmann) April 28, 2015

@jefferson_bail of the @internetarchive presenting research datasets: web archives are mature and ready for data-driven analysis. #iipcGA15
— Helen Hockx (@hhockx) April 28, 2015

#iipcGA15 Those WATs: 18% size of a WARC. In JSON, easily analyzed/parsed.
— Jackie Dooley (@minniedw) April 28, 2015

#iipcGA15 WANE = web archive named entities. Uses Stanford NER tool. Entities from colls (names, titles etc). Less than 1% of WARC size.
— Jackie Dooley (@minniedw) April 28, 2015

#iipcGA15 Failed research ideas still provide useful insight in @jefferson_bail talk. Great that is shared as well.
— Sabine Hartmann (@skhartmann) April 28, 2015

@jefferson_bail just showed amazing visualisation of linked images within a fashion blog collection. #iipcga15
— Helen Hockx (@hhockx) April 28, 2015

Ian Milligan from the University of Waterloo presented his talk on "WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Three Types of Web Archives". He described the importance of web archives and why historians should use web archives. His talk was primarily based on three case studies including Wide Web Scrape, GeoCities End-of-Life Torrent, and Archive-It Longitudinal Collections, Canadian Political Parties & Labour Organizations. I enjoyed his style of storytelling, some mesmerizing visualizations, and in particular the GeoCities case study. He noted that the GeoCities data was not the form of WARC files, instead it was regular Wget crawl.

"Web Archives offer windows into lives of everyday people". @ianmilligan1 presenting use cases of 3 web archive datasets. #iipcga15
— Helen Hockx (@hhockx) April 28, 2015

Visualizing the link structure of the wide web scrape by @ianmilligan1 #iipcGA15 http://t.co/7E9K18igS4
— Michael Widner (@mwidner) April 28, 2015

@ianmilligan1 #iipcGA15 answers do researchers want metadata or content analysis? #CDX vs #WARC pic.twitter.com/yG8eqmz0YY
— Sawood Alam (@ibnesayeed) April 28, 2015

@ianmilligan1 finds WAT files useful and offer the right details: these are sweat spot between the light CDX and heavy WARCs. #iipcGA15
— Helen Hockx (@hhockx) April 28, 2015

"Web archives will profoundly change the work of historians" says @ianmilligan1 #iipcGA15
— Michael Widner (@mwidner) April 28, 2015

Here’s my slide deck from last week’s #iipcGA15: “WARCs, WATs, and gets: Opportunity and Challenge for a Historian.” http://t.co/3kx3KMH3UN
— Ian Milligan (@ianmilligan1) May 4, 2015

After a short break Ahmed AlSum from the Stanford University Library (and a WS-DL alumnus) presented his work on "Restoring the oldest U.S. website". He described how he turned yearly backup files of SLAC website from 1992 to 1999 into WARC and CDX files with the help of Wget and by applying some manual changes to mimic the effect as if it was captured in those early days. These transforms were necessary to allow modern Open Wayback system to correctly replay it. Ahmed briefly handed the microphone over to Joan Winters who was responsible to take backups of the website in early days and she described how they did it. Ahmed also mentioned that the Wayback codebase had hardcoded 1996 as the earliest year that was fixed by making it configurable.

As an after thought I would love to see this effort combined with Satya's Olive Archive so that from the server stack to the browser experience all can be replicated as close to the original environment as possible.

@aalsum is trying to #restore the #oldest US website #iipcGA15 pic.twitter.com/xq39xlPBM1
— Sawood Alam (@ibnesayeed) April 28, 2015

#iipcGA15 SLAC archivist Joan Winters talks about the earliest website outside Europe. pic.twitter.com/LU3UMrJRu3
— Sabine Hartmann (@skhartmann) April 28, 2015

Amazing, painstaking work by @aalsum to reconstruct the SLAC website, the US’s first. Interviews, primary resource reviews, etc. #iipcGA15
— Ian Milligan (@ianmilligan1) April 28, 2015

#iipcGA15 Homepage wasn't a concept in 1991. One entry page with two internal links. First page said "Someday there will be text here." :)
— Jackie Dooley (@minniedw) April 28, 2015

blog posts on the oldest U.S. website (@SLAClab) from @aalsum http://t.co/25N9PU2BTU and @nullhandle http://t.co/WkG2OdLJTD #iipcGA15
— Nicholas Taylor (@nullhandle) April 28, 2015

Evolution of #SLAC #homepage @aalsum #iipcGA15, we don't have homepage! pic.twitter.com/qqAG1vE94B
— Sawood Alam (@ibnesayeed) April 28, 2015

The slides from my talk about Restoring US First website http://t.co/4JyB3hTZEV #iipcGA15
— Ahmed AlSum (@aalsum) April 29, 2015

Federico Nanni from the University of Bologana presented "Reconstructing a lost website". Looking at the schedule, my first impression was that it is going to be a talk about tools to restore any lost websites and reconstruct all the pages and links with the help of archives. I was wondering if they are aware of Warrick, a tool that was developed at Old Dominion University with this very objective. But, it turned out to be a case study of the world's oldest university established around 1088. One of the many challenges in reconstructing the university website he mentioned was the exclusion of the site from the Wayback Machine for unknown reasons which they tried to resolve together with Internet Archive. Amusingly, one of the many sources of collecting snapshots includes a clone of the site prepared by student protesters.

#iipcGA15 No national web archive for Italy. U Bologna excluded from Wayback Machine, so hard to reconstruct its web history. Undaunted!
— Jackie Dooley (@minniedw) April 28, 2015

#iipcGA15 Frederico Nanni exposes how he was able to retrieve Universiy-ty of Bologna web archive when @internetarchive had excluded it
— Ina DL Web (@inadlweb) April 28, 2015

Reconstruct a lost website:sometimes we still need persons to ask questions and do the paper trail and wait for student protests. #iipcGA15
— susan aasman (@aasmanna) April 28, 2015

Last speaker of the second day Michael L. Nelson from Old Dominion University presented the work of his student Scott G. Ainsworth"Evaluating the temporal coherence of archived pages". With an example of Weather Underground site he demonstrated how unrealistic pages can be constructed by archives due to the temporal violations. He acknowledged that among various categories of temporal violations, there are at least 5% cases where there exists a provable temporal violation. He also noted that temporal violation is not always a concern.

#iipcGA15 How much of web archived? Sources vary. Are archives stable? Nope. Temporal drift while browsing? Yep bec sparse crawls.
— Jackie Dooley (@minniedw) April 28, 2015

.@phonedudemln showing how this Wayback page _never existed! Mashing together temporal elements. #iipcGA15 pic.twitter.com/0lD6Qw1Mxb
— Ian Milligan (@ianmilligan1) April 28, 2015

@hhockx @phonedude_mln #iipcGA15 pic.twitter.com/pDVMAJl5X2
— Sawood Alam (@ibnesayeed) April 28, 2015

Listening to the talk by @phonedude_mln reminds us why HTTP headers matter. #iipcGA15
— Mark Phillips (@vphill) April 28, 2015

Evaluating the Temporal Coherence of Archived Pages http://t.co/9TCLWnvinY #iipcGA15 @hvdsomp @Galsondor @WebSciDL http://t.co/mHPuTrDF8k
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Day 3

The third day sessions were in the Internet Archive building, San Francisco instead of the usual Li Ka Shing Center at Stanford University, Palo Alto. A couple of buses transported us to the IA and we enjoyed the bus trip in the valley as the weather was very good. IA staff was very humble and welcoming. The emulator of classical games installed in the lobby of IA turned out to be the prime center of attraction. We came to know some interesting facts about the IA such as the building was a church which was acquired because of its similarity with the IA logo and the pillows in the hall were contributed by various websites with the domain name and logo printed on them.

.@hhockx and @anjacks0n hard at work at the @internetarchive #hadoken pic.twitter.com/c05RG8OgRz
— PsypherPunk (@PsypherPunk) April 29, 2015

Very excited today! Our session of #iipcga15 is @internetarchive pic.twitter.com/1XVtJi5u3o
— Mar Pérez Morillo (@mpmorillo) April 29, 2015

@internetarchive acquired a church to make it the main office because it matches the logo. #iipcGA15 pic.twitter.com/8RiYtgqUFX
— Sawood Alam (@ibnesayeed) April 29, 2015

Sessions before lunch were mainly related to consortium management and logistics these include Welcome to the Internet Archive by Brewster Kahle, Chair address by Paul Wagner, Communication report by Jason Webber, Treasurer report by Peter Stirling, and Consortium renewal by the chair followed by break-out discussions to gather ideas and opinion from the IIPC members on various topics. Also, the date and venue for the next general assembly was announced to be on April 11, 2016 in Reykjavik, Iceland.

#iipcGA15 @brewster_kahle& @pnwagner To get us through the programme of the day. pic.twitter.com/JkK3BZ7x9q
— Sabine Hartmann (@skhartmann) April 29, 2015

#iipcGA16 will be in Reykjavík, Iceland #iipcGA15 pic.twitter.com/PZ67bjKVbR
— Kristinn Sigurðsson (@kristsi) April 29, 2015

After the lunch break, your author, Sawood Alam from Old Dominion University presented the progress report on "Profiling web archives" project, funded by IIPC. With the help of some examples and scenarios he established the point that the long tail of archive matters. He acknowledged the growing number of Memento compliant archives and the growth of use of Memento aggregator service. In order for the Memento aggregator to perform efficiently, it needs query routing support apart from caching which only helps when the requests are repeated before cache expires. Then he acknowledged two earlier profiling efforts one being a complete knowledge profile by Sanderson and the other minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. Sawood also announced the availability of the code to generate profiles and benchmark them in a GitHub repository. In a later wrap-up session the chair Paul Wagner referred to Sawood's motivation slide in his own words, "sometimes good enough is not good enough."

#iipcGA15 Profiling Web Archives talk by @ibnesayeed of Old Dominion University pic.twitter.com/mpYTKcwZUi
— Sabine Hartmann (@skhartmann) April 29, 2015

@ibnesayeed proposes web archiving profiling approach easier than @azaroth42 CDX aggregation & accurate than @aalsum URL sampling. #iipcGA15
— Ahmed AlSum (@aalsum) April 29, 2015

@ibnesayeed: Web archiving profiling code is available at https://t.co/6ue4XZT4wR #iipcGA15
— Ahmed AlSum (@aalsum) April 29, 2015

@anjacks0n suggests adding web archiving profile from @ibnesayeed presentation in the openWayback #iipcGA15 https://t.co/qIrm4s2yH8
— Ahmed AlSum (@aalsum) April 29, 2015

Slides of my talk on Profiling Web Archives at #iipcGA15 http://t.co/VJUl8DMqaj @WebSciDL @ibnesayeed
— Sawood Alam (@ibnesayeed) May 7, 2015

In the break various IA staff members gave us tour of the IA facility including book scanners, television archive, an ATM, storage rack, music and video archive where they convert data from old recording media such as vinyl discs and cassettes.

On the @brewster_kahle tour of the @internetarchive. Excited to explore! #iipcGA15 pic.twitter.com/Q7iy4ffIdS
— Ian Milligan (@ianmilligan1) April 29, 2015

I think my favourite moment of #iipcGA15 was experiencing the hum and shimmer of the @internetarchive servers... pic.twitter.com/ijfRxRUKp7
— Andy Jackson (@anjacks0n) May 7, 2015

After the break a historian and writer Abby Smith Rumsey talked about "The Future of Memory in the Digital Age". Her talk was full of insightful and quotable statements. I will quote one of my favorite and will leave the rest in the form of tweets. Se says, "ask not what we can afford to save; ask what we can afford to lose".

#iipcGA15 Historians are the only people qualified to predict the future. Says Abby Smith Rumsey. pic.twitter.com/PofC4zB0mb
— Sabine Hartmann (@skhartmann) April 29, 2015

#iipcGA15 Abby Smith Rumsey : what is at stake with preservation is the survival of the species. Humans have known how to pass knowledge
— Ina DL Web (@inadlweb) April 29, 2015

Components of memory: starts by forgetting/ filter what is irrelevant/ keep what will be valuable in the future Abby #iipcGA15
— Emmanuelle Bermes (@figoblog) April 29, 2015

Rumsey: Scale is an issue with information, it always has been (no exception re: web archiving) #iipcGA15
— Abbie Grotke (@agrotke) April 29, 2015

Collect and make available, don't curate, allow the future to judge the value. #iipcGA15
— Andy Jackson (@anjacks0n) April 29, 2015

#iipcga15 Abby Smith Rumsey: "ask not what we can afford to save; ask what we can afford to lose"
— Dan Kerchner (@DanKerchner) April 29, 2015

Finally the founder of the Internet Archive, Brewster Kahle took the stage and talked about digital archiving and the role of IA in the form of various initiatives including book archive, music archive, and TV archive to name a few. He described the zero-sum book lending model utilized by the Open Library for the books that are not free for unlimited distribution. He invited all the archivists to create a common collective distributed library where people can share their resources such as computing power, storage, man power, expertise, and connections. During the QA session I asked when he thinks about collaboration, is he envisioning a model similar to the inter-library loan where peer libraries will refer to the other places in the form of external links if they don't have the resources but others do or in contrast they will copy the resources of each other? He responded, "both."

#iipcGA15 @brewster_kahle talks at the IIPC GA. How can the whole web archiving community collaborate @NetPreserve pic.twitter.com/M43uoGjc9V
— Sabine Hartmann (@skhartmann) April 29, 2015

Brewster Khale: we need to develop unexpected uses of our digital libraries #iipcga15
— Emmanuelle Bermes (@figoblog) April 29, 2015

300k books are available for lending at @internetarchive one at a time #iipcGA15
— Emmanuelle Bermes (@figoblog) April 29, 2015

Online music: IA started with concerts when music artists authorized it, in exchange for free storage and bandwidth #iipcga15
— Emmanuelle Bermes (@figoblog) April 29, 2015

@brewster_kahle 100+ libraries participating in @openlibrary are buying, digitizing, and loaning non-rights cleared ebooks 1/time #IIPCGA15
— Tom Smyth (@smythbound) April 29, 2015

#iipcGA15 @brewster_kahle presents the news archives : tens of news TV channels comprehensively archived since 2009 https://t.co/VE4eGvJbEI
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

Personal digital archiving next step for Internet Archive, according to Brewster Kahle #iipcGA15, #pda15
— susan aasman (@aasmanna) April 30, 2015

@brewster_kahle Please don't through any book, film, video, CD, DVD or any material away, just give it to @internetarchive #iipcGA15
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 @brewster_kahle : why not build libraries together ? Cooperative collection dvlpmt, distributed preservation & cloud/local access
— Ina DL Web (@inadlweb) April 30, 2015

#iipcGA15 @brewster_kahle : we should fight against the "winner takes all" idea behind the large centralized library repositories
— Clément Oury (@cleymour) April 30, 2015

"If somebody says 'We'll license it back to you, and you can be on the advisory committee…'—run the other way."—@brewster_kahle #iipcGA15
— David Moles (@chronodm) April 30, 2015

The chair gave a wrap-up talk and formally ended the third day session. Buses still had some time before they leave, so people were engaged in conversation, games and photographs while enjoying drinks and food. I particularly enjoyed a local ice cream named "It's-It" recommended by an IA staff.

"@aalsum: #iipcGA15 group photo in front of @internetarchive https://t.co/aEXNJmpMMo" Great looking gang !!
— Paul N. Wagner (@pnwagner) April 30, 2015

Day 4

On fourth day Sara Aubry presented her talk on "Harvesting Digital Newspapers Behind Paywalls" in Berge Hall A where Harvesting Working Group was gathered while IIPC's communication strategy session was going on in Hall B. She discussed her experience of working with news publishers to make their content more crawler friendly. Some of the crawling and replay challenges include paywalls requiring authentication to grant access to the content and inclusion of the daily changing date string in the seed URIs. They modified the Wayback to fulfill their needs, but the modifications are not committed back to the upstream repository. She said, if it is useful for the community then the changes can be pushed out in the main repository.

#iipcGA15 @saraaubry : 23 press titles accessible upon payment, representing more than 200 local editions, are harvested every day
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

@saraaubry discusses working with news publishers to make their sites more crawler friendly - generally was positive experience #iipcGA15
— Abbie Grotke (@agrotke) April 30, 2015

#iipcGA15 @saraaubry : @DLWebBnF is using ARK identifiers for a federated search on several versions of URLs of the same press title
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

New Wayback features presented by @saraaubry #iipcGA15 pic.twitter.com/4fL7w09k3T
— webmiriam (@webmiriam1) April 30, 2015

@DLWebBnF is identifying alternative ways for collecting, deposit from publishers thru FTP, deposit from press aggregators #iipcGA15
— Abbie Grotke (@agrotke) April 30, 2015

Roger Coram presented his talk on "Supplementing Crawls with PhantomJS". I found his talk quite relevant to one of my colleague Justin Brunelle's work. This is a necessary step to improve the quality of the crawls especially when sites are becoming more interactive with extensive use of JavaScript. For some pages, he is using CSS selectors and takes screen shots to later complement the rendering.

@hhockx @PsypherPunk's #iipcGA15 presentation about #PhantomJS is very relevant to what @justinfbrunelle @WebSciDL is working on.
— Sawood Alam (@ibnesayeed) April 30, 2015

At #iipcGA15 @PsypherPunk talking about how we also store rendered home pages as potentially clickable image maps, or Google maps div as img
— Andy Jackson (@anjacks0n) April 30, 2015

Blog post by @PsypherPunk Archiving Screenshots: http://t.co/lmbWWx0wWr #iipcGA15
— Helen Hockx (@hhockx) April 30, 2015

HTTP Archive (HAR) format mentioned by @PsypherPunk: https://t.co/JMVK1yXWns #iipcGA15
— Helen Hockx (@hhockx) April 30, 2015

Hadn't come across CrawlJax before - looks interesting. http://t.co/wpwyRQxcyP #iipcGA15
— Andy Jackson (@anjacks0n) April 30, 2015

Kristinn Sigurðsson engaged everyone to talk about the "Future of Heritrix". He started with the question, "is Heritrix dead?" and I said to myself, "can we afford this?". This ignited the talk about what can be done to increase the activity on its development. I asked the question, what is slowing down the development of Heritrix, is it out of ideas and new feature requests or there are not enough contributors to continue the development? There was no clear answer to this question, but it helped continuing the discussion. I also suggested that if new developers are afraid of making changes that will break the system and will discourage upgrades then can we introduce plug-in architecture where new features can be added as optional add-ons.

Now at Harvesting Group: Is Heritrix dead? @kristsi No, but it needs sustainability and support. #iipcGA15
— Mar Pérez Morillo (@mpmorillo) April 30, 2015

Harvesting Working Group discussion on Heritrix: We need a framework where multiple crawlers can exist #iipcGA15
— Abbie Grotke (@agrotke) April 30, 2015

@anjacks0n suggests the future of Heritrix should be using the Archive Proxy #iipcGA15
— Ahmed AlSum (@aalsum) April 30, 2015

Helen Hockx-Yu took the microphone and talked about the Open Wayback development. She gave brief introduction of the development workflow and periodic telecon. She also talked about the short and long term development goals including better customization and internationalization support, display more metadata, ways to minimize the live leaks, and acknowledge/visualize the temporal coherence.

Requirements for Open Wayback presented by @hhockx #iipcGA15 pic.twitter.com/nWoCcSS7wH
— webmiriam (@webmiriam1) April 30, 2015

If you are interested in joining OpenWayback development, you can send an email to the group https://t.co/9R6m3aJkqW… #iipcGA15 @hhockx
— Ahmed AlSum (@aalsum) April 30, 2015

After a short break Tom Cramer gave his talk on "APIs and Collaborative Software Development for Digital Libraries". He formally categorized the software development models in five categories. He suggested IIPC to take the position to unify the high level API for each category of the archiving tools so that they can co-operate interchangeably. This was very appealing to me because I was thinking on the same lines and have done some architectural design of an orchestration system that achieves the same goal via a layer of indirection.

Different types of open source development (regardless of license) according to @tcramer #iipcGA15 pic.twitter.com/2CdJ2cGQyN
— Emmanuelle Bermes (@figoblog) April 30, 2015

A majority of open source software is actually "sole source" software (1 dev) or "closed source" (1 team or company) @tcramer #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

#iipcGA15 @tcramer: free to use software doesn't mean it's a distributed or scalable.
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 @tcramer: presents @GeoBlacklight http://t.co/967Nxy5yTR, Hydra http://t.co/OBFJDYCcgJ, @FedoraRepo http://t.co/6PkW2t3yVB
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 @tcramer Reason for success, open source fundamentals: Transparency, Inclusivity, Merit, agility, Quality, and Value.
— Ahmed AlSum (@aalsum) April 30, 2015

No grants were abused in the making of this project/community @tcramer #iipcGA15 pic.twitter.com/nxcr9iscEi
— Sawood Alam (@ibnesayeed) April 30, 2015

Daniel Vargas from LOCKSS presented his talk on "Streamlining deployment of web archiving tools" and demonstrated usage of Docker containers for deployment. He also demonstrated the use of plain WARC files on regular file system and in HDFS with Hadoop clusters. I was glad to see someone else deplying Wayback machine in containers as I was pushing some changes to the Open Wayback repository that will make containerization of Wayback easier.

Daniel Vargas is doing a demo on running OpenWayback instance using @docker container #iipcGA15
— Ahmed AlSum (@aalsum) April 30, 2015

LOCKSS can extract WARC files to play in Open Wayback through a Docker container #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

Nice demo running OpenWayback in Docker - https://t.co/8Zu2xMFRLg #iipcGA15
— Andy Jackson (@anjacks0n) April 30, 2015

During the lunch break Hunter Stern from IA approached me and told me about the Umbra project to supplement the crawling of JS-rich pages. After the lunch there was a short open mic session where every speaker has got four minutes to introduce exciting stuff that they are working on. Unfortunately, due to the shortage of time I could not participate in it.

After the lunch break Access Working Group gathered to talk about "Data mining and WAT files: format, tools and use cases". Peter Stirling, Sara Aubry, Vinay Goel, and Andy Jackson gave talks on "Using WAT at the BnF to map the First World War", "The WAT format and tools for creating WAT files", and "Use cases at Internet Archive and the British Library". Vinay has got some really neat and interactive visualizations based on the WAT files. I talked to Vinay during the break and we had some interesting ideas to work on such as building a content store indexed by hashes while using WAT files in conjunction to replay and a WebSocket based BOINC implementation in JavaScript to perform Hadoop style distributed research operations on IA data on users' machine.

Peter Stirling at the BnF has been using WATs to map web archives relating to the First World War. Excited to see how it’s going! #iipcGA15
— Ian Milligan (@ianmilligan1) April 30, 2015

Peter Stirling at #iipcGA15 : analyze web archives to understand the use of digitized heritage documents on websites related to WWI
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

There are technical, legal and organizational challenges to the set up of a data mining service for researchers at @ActuBnF #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

#iipcGA15 @saraaubry : WAT files were for the first time presented to the community at the 2011 @NetPreserve General Assembly
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

#iipcGA15 @saraaubry More tools to extract WAT files from WARC https://t.co/H9SrZD8YWK
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 Vinay Goel: if you have WAT files, you can directly produce CDX files from them
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

Vinay Goel-WAT files provide contextual info for users about collections such as which domains where crawled, which urls, etc. #iipcGA15
— rosalie lack (@rosalielack) April 30, 2015

@vinaygo @internetarchive mentioned @ibnesayeed @WebSciDL #ArchiveProfiling work at #iipcGA15 w/ nice #visualization http://t.co/PaDcwRlNGs
— Sawood Alam (@ibnesayeed) April 30, 2015

.@anjacks0n has a repo for WAT files as well: wat-mining. https://t.co/hrhVVUVD6s #iipcGA15
— Ian Milligan (@ianmilligan1) April 30, 2015

After a short break Access Working Group talked about "Full-text search for web archives and Solr". Anshum Gupta, Andy Jackson, and Alex Thurman presented "Apache Solr: 5.0 and beyond", "Full-text search for web archives at the British Library", and "Solr-based full-text search in Columbia's Human Rights Web Archive" respectively. Anshum's talk was on technical aspects of Solr while the other two talks were more towards a case study.

Web Archive architecture drawn by @anjacks0n #iipcga15 pic.twitter.com/NTc8NjlzW4
— Ahmed AlSum (@aalsum) April 30, 2015

Historians prefer transparency and want to know how things work under the hood. They "hate" things like stemming. #iipcGA15 @anjacks0n #solr
— Helen Hockx (@hhockx) April 30, 2015

Size of index at BL is 15TB... So impossible to have as much RAM as index size (SolR recommendation) @anjacks0n #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

.@anjacks0n giving a live demo of the amazing UK Web Archive’s Shine interface. Big Data trending up! http://t.co/zpTMgddDyT #iipcGA15
— Ian Milligan (@ianmilligan1) April 30, 2015

BL's index is split in 24 shards (see https://t.co/XQEEK10J8r ) #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

#iipcGA15 @anshumgupta gives an interesting talk about @SolrLucene 5.0 new features.
— Ahmed AlSum (@aalsum) April 30, 2015

.@athurman discussing Columbia’s implementation of full-text search w/ the Human Rights Web Archive #iipcGA15 http://t.co/02yVjvPpxg
— Ian Milligan (@ianmilligan1) April 30, 2015

A search in this web archive for Chippewa (for example) allows expansion to these other terms too. Cool. #iipcGA15 pic.twitter.com/reFs8hUxj0
— Ian Milligan (@ianmilligan1) April 30, 2015

Day 5

On the last day of the conference Collection Development and Preservation Working Groups were discussing their current state and plans in separate parallel tracks. Before the break I attended Collection Development Working Group. They demonstrated Archive-It account functionality. I expressed the need of a web based API to interact with the Archive-It service. I gave the example of a project I was working on a few years ago in which a feed reader periodically reads from news feeds and sends it to a disaster classifier that Yasmin AlNoamany and Sawood Alam (me) built. If the classifier classifies the news article to be in disaster category, we wanted to archive that page immediately. Unfortunately, Archive-It did not provide a way to programmatically do that (unless we use page scraping or some headless browser), so we ended up using WebCite service for that.

#iipcGA15 @agrotke leads us into the morning session of the last day of the GA. We have collection development today. pic.twitter.com/xQ35y58Vbe
— Sabine Hartmann (@skhartmann) May 1, 2015

After the break I moved to the Preservation Working Group track where I had a talk scheduled. David S.H. Rosenthal presented his talk on "LOCKSS: Collaborative Distributed Web Archiving For Libraries". He described the working of LOCKSS and how it benefited the publishing industry. He described how Crawljax is used in LOCKSS to capture content that are loaded via Ajax. He also noted that most of the publishing sites try not to rely on Ajax and if they do, they provide some other means to crawl their content to maintain the search engine ranking.

Legal framework for LOCKSS: obtain explicit permission to crawl & preserve permission along w/ content #iipcGA15
— Emmanuelle Bermes (@figoblog) May 1, 2015

LOCKSS has a peer-to-peer protocol for verification and repair of the content in the boxes, w/ authorization check #iipcGA15
— Emmanuelle Bermes (@figoblog) May 1, 2015

LOCKSS supports a variety of stds incl. Memento, OpenUrl, http content negotiation,WARC import and export, bibliographic metadata #iipcGA15
— Emmanuelle Bermes (@figoblog) May 1, 2015

#iipcGA15 LOCKSS runs "Red Hat" model: free open source software but paid support
— ISSN Int. Centre (@ISSN_IC) May 1, 2015

DSHR: @crawljax web capture seeding turned out to be less crucial than originally thought due to journals' focus on @google #SEO #iipcGA15
— Nicholas Taylor (@nullhandle) May 1, 2015

#iipcGA15 Rosenthal: all components of LOCKSS processing chains should interact through web services
— ISSN Int. Centre (@ISSN_IC) May 1, 2015

David Rosenthal #LOCKSS mentioned @ibnesayeed @WebSciDL #ArchiveProfiling project in his talk #iipcGA15 pic.twitter.com/8x87EGV0Vv
— Sawood Alam (@ibnesayeed) May 1, 2015

Sawood Alam (me) happened to be the last presenter of the conference where he presented his talk on "Archive Profile Serialization". This talk was in continuation with his earlier talk at IA. He described what should be kept in profiles and how should it be organized. He also talked briefly about the implications of each data organization strategy. Finally he talked about the file format to be used and how it can affect the usefulness of the profiles. He noted that XML, JSON, and YAML like single node file formats are not suitable for profiles and he proposed an alternative format that is a fusion of CDX and JSON formats. Kristinn provided his feedback that it seems the right approach of serialization of such data, but he strongly suggested to name the file format something other than CDXJSON.

Slides of my talk on Profiling Serialization at #iipcGA15 http://t.co/GsfV5ICWch @WebSciDL @ibnesayeed
— Sawood Alam (@ibnesayeed) May 8, 2015

While we were having lunch, the chair took the opportunity to wrap-up the day and the conference. And now I would like to thank all the organizing team members especially Jason Webber, Sabine Hartmann, Nicholas Taylor, and Ahmed AlSum for organizing and making the event possible.

The #iipcGA15 comes to an end, thanks to everyone who made this such a great week! pic.twitter.com/z4TIHbZzsI
— webmiriam (@webmiriam1) May 1, 2015

In the afternoon Ahmed AlSum took me to the Computer History Museum where Marc Weber gave us a tour. It was a great place to visit after such an intense week.

#iipcGA15 The fun continues. pic.twitter.com/XEKjLbQ8Dv
— IIPC (@NetPreserve) May 1, 2015

Marc Weber is giving #iipcGA15 visitors a tour @ComputerHistory museum pic.twitter.com/nAwiqvoWK7
— Ahmed AlSum (@aalsum) May 1, 2015

Missed Talks

Due to the parallel tracks I missed some sessions that I wanted to attend such as "SoLoGlo - an archiving and analysis service" by Martin Klein, "Web archive content analysis" by Mohammed Farag, "Identifying national parts of the internet" by Eld Zierau, "Warcbase: Building a scalable platform on HBase and Hadoop" by Jimmy Lin, "WARCrefs for deduplicating web archives" by Youssef Eldakar, and "WARC Standard Revision Workshop" by Clément Oury to name a few. I hope the videos recordings will be available soon. Meanwhile I was following the related tweets.

Visualization of tweet locations during Charlie Hebdo events @mart1nkle1n at #iipcGA15 pic.twitter.com/XiBi7WaFCr
— Katrin Weller (@kwelle) April 28, 2015

Farag: Event Focused Crawler (EFC) helps curators focus the crawls and improve quality #iipcGA15
— Abbie Grotke (@agrotke) April 28, 2015

#iipcGA15 @cleymour introducing the WARC standard revision process. Do IIPC members need changes or evolutions?
— Dépôt légal Web BnF (@DLWebBnF) April 28, 2015

Youssef Eldakar from Bibliotheca Alexandrina is presenting issues (and solutions!) related the duplicates in WARC files at #iipcGA15
— ISSN Int. Centre (@ISSN_IC) April 28, 2015

Conclusions

IIPC GA 2015 was a fantastic event. I had great time, met a lot of new people and some of those whom I knew on the Web, shared my ideas and learned from others. It was the most amazing one complete week I ever had. I appreciate the efforts of everyone who made this possible including organizers, presenters, and attendees.

Resources

Please let us know the links of various resources related to IIPC GA 2015 to include below.

Official

Aggregations

Blog Posts

Web Archiving in 2015 -- a Quick Redux of IIPC's General Assembly at Stanford - Tom Cramer
Recap of international web archiving community meeting - Nicholas Taylor
Notes from IIPC General Assembly 2015 - Jefferson Bailey
Let Them Emulate! - Andy Jackson

Tools

--
Sawood Alam

'); $('#tweet-toggler').click(function() { $('[id^=twitter-widget]').toggle(); }); });

↧

2015-05-29: Call me Dr. SalahEldeen

May 30, 2015, 11:31 pm

≫ Next: 2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

≪ Previous: 2015-05-09: IIPC General Assembly 2015 Trip Report

Dr. Nelson saying how awesome I am

Stick a fork in it...it’s done! So now what?

These are the two thoughts that floated in my mind just after defending my dissertation on May 5th. Is it over?... Well, my bet the road has just begun. I just became a doctor after all!
After merely just 5 years, 4 months, 13 days I finished the PhD (see what I did here? that’s sarcasm!). Fresh off the boat (err the plane) I landed on December 23rd 2009 I though I will just knock this PhD out in a couple of years and go work for a big company, and oh boy little did I know. I believe I am a whole different man now, I learned things I didn’t even imagine I would know, mostly about myself and the glorious fields of machine learning, modeling, user behavioral analysis, archival, preservation, and of course engineering. What do you know, it turned out that research is awesome and I loved it. Finding a pattern, building this predictive model that learns with time gives you a pleasure of no equal, and apparently I am good at it!

Back to the PhD, my dissertation is entitled: Detecting, Modeling, and Predicting, User Temporal Intention in Social Media. It’s a new field of human intention in relation to time and content shared on social networking portals. It’s an enticing area of study that merges multiple disciplines, and various fields of study, and according to Dr. Nelson right now I am the world expert in this tiny point in the collective human knowledge of sciences, … yup, I will take that.

Our work gained both academic and public acclaim, demonstrated through our publication record and the articles about our work in the BBC, the Atlantic, Popular Mechanics and MIT tech review as demonstrated in the last set of slides from my defense:

Doctoral Defense: Hany SalahEldeen from heinestien

To watch me do an awesome job defending my dissertation:

Big thanks to Mat Kelly for taking awesome pictures along the defense day (pardon my weary look):

https://www.flickr.com/photos/124419986@N07/sets/72157651976633968/

I PhD Crushed it!

Well, back to the first part, it seems I was half right after all. I did not manage to finish the PhD in just “a couple” of years, but I did manage to land an awesome job at Microsoft (yes that’s the “big company” part). I accepted a job at Bing working on a very enticing project on utilizing user behavioral analysis to best present the search results. Right up my alley so I am super excited. I did work for Microsoft before twice though, both as an intern in Microsoft Research in 2009 and at Microsoft Silicon Valley in 2011. I guess it’s a loyalty thing to come back, it’s awesome to work there to be honest.

So, I am packing my stuff as I write these words, getting a bit emotional to leave the place I called home for 18.96% of my life, the amazing life friends who I would always cherish, and excited about new beginnings and awesome feats to achieve. I am shipping all my stuff to Seattle, and I am taking my old motorcycle (A.K.A. Beast) on a cross country trip inspired by Che Guevara’s life changing journey across South America after he finished med school on his motorcycle La Poderosa ("The Mighty One") documented in his marvelous memoir: The Motorcycle Diaries

So wish me luck and I will post about the trip soon, hopefully I won't break down!

Beast packed and ready

-- Hany SalahEldeen

↧

2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

June 9, 2015, 9:30 am

≫ Next: 2015-06-09: Mobile Mink merges the mobile and desktop webs

≪ Previous: 2015-05-29: Call me Dr. SalahEldeen

Mat Kelly and Michele Weigle travel to and present at the Web Archiving Collaboration Conference in NYC.

On June 4 and 5, 2015, Dr. Weigle (@weiglemc) and I (@machawk1) traveled to New York City to attend the Web Archiving Collaboration conference held at the Columbia School of International and Public Affairs. The conference gave us an opportunity to present our work from the incentive award provided to us by Columbia University Libraries and the Andrew W. Mellon Foundation in 2014.

Robert Wolven of Columbia University Libraries started off the conference with welcoming the audience and emphasizing the variety of presentations that were to occur on that day. He then introduced Jim Neal, the keynote speaker.

Jim Neal starting by noting the challenges of "repository chaos", namely, which version of a document should be cited for online resources if multiple versions exist. "Born-digital content must deal with integrity", he said, "and remain as unimpaired and undivided as possible to ensure scholarly access."

Brian Carver (@brianwc) and Michael Lissner (@mlissner) of Free Law Project (@freelawproject) followed the keynote with Brian first stating, "Too frequently I encounter public access systems that have utterly useless tools on top of them and I think that is unfair." He described his project's efforts to make available court data from the wide variety of systems digitally deployed by various courts on the web. "A one-size-fits-all solution cannot guarantee this across hundreds of different court websites.", he stated, further explaining that each site needs its own algorithm of scraping to extract content.

To facilitate the crowd sourcing of scraping algorithms, he has created a system where users can supply "recipes" to extract content from the courts' sites as they are posted. "Everything I work with is in the public domain. If anyone says otherwise, I will fight them about it.", he mentioned regarding the demands people have brought to him when finding their name in the now accessible court documents. "We still find courts using WordPerfect. They can cling to old technology like no one else."

Shailin Thomas (@shailinthomas) and Jack Cushman from the Berkman Center for Internet and Society, Harvard University spoke next of Perma.cc. "From the digital citation in the Harvard Law Review from the last 10 year, 73% of the online links were broken. Over 50% of the links cited by the Supreme Court are broken." They continued to describe the Perma API and the recent Memento compliance.

.@permacc is API driven. Everything you do on the https://t.co/OZhnsDtGle website is done through our own API. Fantastic dogfooding! #cuwarc
— Mat Kelly (@machawk1) June 4, 2015

After a short break, Deborah Kempe (@nyarcist) of the Frick Art Reference Library describe her recent observation that there is a digital shift in art moving to the Internet. She has been been working with both Archive-It and Hanzo Archives for quality assurance of captured websites and for on-demand captures of sites that her organization found particularly challenging (respectively). One example of the latter is Wangechi Mutu's site, which has an animation on the homepage, which Archive-It was unable to capture but Hanzo was.

In the same session, Lily Pregill (@technelily) of NYARC stated, "We needed a discovery system to unite NYARC arcade and our Archive-It collection. We anticipated creating yet another silo of an archive." While she stated that the user interface is still under construction, it does allow the results of her organization's archive to be supplemented with results from Archive-It.

Following Lily in the session, Anna Piricci (@AnnaPerricci and Alex Thurman (@athurman of Columbia University Libraries talked about the Contemporary Composers Web Archive, which consisted of 11 participating curators from 56 sites currently available in Archive-It. Alex then spoke of the varying legal environments between members based on countries, some being able to do full TLD crawling while some members (namely, in the U.S.) have no protection from copyright. He spoke of the preservation of Olympics web sites from 2010, 2012, and 2014 - the latter being the first logo to contain a web address. "Though Archive-It had a higher upfront cost", he said about the initial weighing of various option for Olympic website archiving, it was all-inclusive of preservation, indexing, metadata, replay, etc." To publish their collections, they are looking into utilizing the .int TLD, which is reserved for internationally significant information but is underutilized in that only about 100 sites exist, all which have research value.

The conference then broke for a provided lunch then started with Lightning Talks.

To start off the lightning talks, Michael Lissner (@mlissner) spoke about RECAP, what it is, what has it done and what is next for the project. Much of the content contained with the Public Access to Court and Electronic Records (PACER) system is paywalled public domain documents. Obtaining the documents costs users ten cents per page with a three dollar maximum. "To download the Lehman Brothers proceedings would cost $27000.", he said. His system leverages user's browser via the extension framework to save a copy of the downloads from a user to Internet Archive and also first query the archive for a user to see if the document has been previously downloaded.

Dragan Espenschied (@despens) gave the next Lightning Talk talking about preserving digital art pieces, namely those on the web. He noted one particular example where the artist extensively used scrollbars, which are less common place in user interface today. To accurately re-experience the work, he fired up a browser based MacOS 9 emulator:

"I love how the scrollbars are destorying themselves!" - @despens showing Jan Robert Leegte's Scrollbars in MacOS 9 #cuwarc
— Mat Kelly (@machawk1) June 4, 2015

Jefferson Bailey@jefferson_bail followed Dragan with his work in investigating archive access methods that are not URI centric. He has begun working with WATs (web archive transformations), LGAs (longitudinal graph analyses), and WANEs (web archive named entities).

Dan Chudnov (@dchud) then spoke of his work at GWU Libraries. He had developed Social Feed Manager, a Django application to collect social media data from Twitter. Previously, researchers had been copy and pasting tweets into Excel documents. His tool automated this process. "We want to 1. See how to present this stuff, 2. Do analytics to see what's in the data and 3. Find out how to document the now. What do you collect for live events? What keywords are arising? Whose info should you collect?", he said.

Jack Cushman from Perma.cc gave the next lightning talk about ToolsForTimeTravel.org, a site that is trying to make a strong dark archive. The concept would prevent archivists from reading material within until conditions are met. Examples where this would be applicable are the IRA Archive at Boston College, Hillary Clinton's e-mails, etc.

With the completion of the Lightning Talks, Jimmy Lin (@lintool) of University of Maryland and Ian Milligan (@ianmilligan1) of University of Waterloo rhetorically asked, "When does an event become history?" stating that history is written 20 to 30 years after an event has occurred. "History of the 60s was written in the 30s. Where are the Monica Lewinsky web pages now? We are getting ready to write the history of the 1990s.", Jimmy said. "Users can't do much with current web archives. It's hard to develop tools for non-existent users. We need deep collaborations between users (archivists, journalists, historians, digital humanists, etc.) and tool builders. What would a modern archiving platform built on big data infrastructure look like?" He compared his recent work in creating warcbase with the monolithic OpenWayback Tomcat application. "Existing tools are not adequate."

Ian then talked about warcbase as an open source platform for managing web archives with Hadoop and HBase. WARC data is ingested into HBase and Spark is used for text analysis and services.

Zhiwu Xie (@zxie) of Virginia Tech then presented his group's work on maintaining web site persistence when the original site is no longer available. By using an approach akin to a proxy server, the content served when the site was last available is continued to be served in lieu of the live site. "If we have an archive that archives every change of that web site and the website goes down, we can use the archive to fill the downtimes.", he said.

Mat Kelly (@machawk1, your author) presented next with "Visualizing digital collections of web archives" where I described the SimHash archival summarization strategy to efficiently generate a visual representation of how a web page changed over time. In the work, I created a stand-alone interface, Wayback add-on, and embeddable service for a summarization to be generated for a live web page. At the close of the presentation, I attempted a live demo.

.@machawk1 tempting the demo fates… and succeeding to “oohs” and “ahhhs” from the audience. Wasn’t able to snap pic, but v. cool. #cuwarc
— Ian Milligan (@ianmilligan1) June 4, 2015

WS-DL's own Michele Weigle (@weiglemc) next presented Yasmin's (@yasmina_anwar) work on Detecting Off-Topic Pages. The recently accepted TPDL 2015 paper had her looking at how pages in Archive-It collections have changed over time and being able to detect when a page is no longer relevant to what the archivist intended to capture. She used six similarity metrics to find that cosine similarity performed the best.

In the final presentation of the day, Andrea Goethals of Harvard Library and Stephen Abrams of California Digital Library discussed difficulties in keeping up with web archiving locally, citing the outdated tools and systems. A hierarchical diagram of a potential they showed piqued the audiences' interest as being overcomplicated for smaller archives.

Still thinking of this diagram... #cuwarc so many assumptions contained here. pic.twitter.com/AuAgpRkoqH
— Dragan Espenschied (@despens) June 5, 2015

To close out the day, Robert Wolven gave a synopsis of the challenges to come and expressed his hope that there was something for everyone.

Day 2

The second day of the conference contained multiple concurrent topical sessions that were somewhat open-ended to facilitate more group discussion. I initially attended David Rosenthal's talk where he discussed the need for tools and APIs for integration into various system for standardization of access. "A random URL on the web has less than 50% chance of getting preserved anywhere.", he said, "We need to use resources as efficiently as possible to up that percentage". Further emphasizing this point:

The preservable Web is mostly designed as the Web as seen by Heritrix. -DSHR #cuwarc
— Mat Kelly (@machawk1) June 5, 2015

DSHR then discussed repairing archives for bit-level integrity and LOCKSS' approach at accomplishing it. How would we go about establish a standard archival vocabulary?", he asked, "'Crawl scope' means something different in Archive-It vs. other systems."

I then changed rooms to catch the last half hour of Dragan Espenschied's tools where he discussed pywb (the software behind webrecorder.io) more in-depth. The software allows people to create their own public and private archives as well as offers a pass-through model where it does not record login information. Further, it can capture embedded YouTube and Google Maps.

Following the first set of concurrent sessions, I attended Ian Milligan's demo of utilizing warcbase for analysis of Canadian Political Parties (a private repo as of this writing but will be public once cleaned up). He also demonstrated using Web Archives for Historical Research. In the subsequent and final presentation of day 2, Jefferson Bailey demonstrated Vinay Goel's (@vinaygo) Archive Research Services Workshop, which was created to serve as an introduction to data mining and computational tools and methods for work with web archives for researchers, developers, and general users. The system utilizes the WAT, LGA, and WANE derived data formats that Jefferson spoke of in his Day 1 Lightning talk.

After Jefferson's talk, Robert Wolven again collected everyone into a single session to go over what was discussed in each session on the second day and gave a final closing.

Overall, the conference was very interesting and very relevant to my research in web archiving. I hope to dig into some of the projects and resources I learned about and follow up with contacts I made at the Columbia Web Archiving Collaboration conference.

— Mat (@machawk1)

↧

2015-06-09: Mobile Mink merges the mobile and desktop webs

June 9, 2015, 3:03 pm

≫ Next: 2015-06-26: JCDL 2015 Doctoral Consortium

≪ Previous: 2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

As part of my 9-to-5 job at The MITRE Corporation, I lead several STEM outreach efforts in the local academic community. One of our partnerships with the New Horizon's Governor's School for Science and Technology pairs high school seniors with professionals in STEM careers. Wes Jordan has been working with me since October 2014 as part of this program and for his senior mentorship project as a requirement for graduation from the Governor's School.

Wes has developed Mobile Mink (soon to be available in the Google Play store). Inspired by Mat Kelly's Mink add-on for Chrome, Wes adapted the functionality to an Android application. This blog post discusses the motivation for and operation of Mobile Mink.

Motivation

The growth of the mobile web has encouraged web archivists to focus on ensuring its thorough archiving. However, the mobile URIs are not as prevalent in the archives as their non-mobile (or as we will refer to them: desktop) URIs. This is apparent when we compare the TimeMaps of the Android Central site (with desktop URI http://www.androidcentral.com/ and a mobile URI http://m.androidcentral.com/).

TimeMap of the desktop Android Central URI

The 2014 TimeMap in the Internet Archive of the desktop Android Central URI includes a large number of mementos with a small number of gaps in archival coverage.

TimeMap of the mobile Android Central URI

Alternatively, the TimeMap in the Internet Archive of the mobile Android Central URI has far fewer mementos and many more gaps in archival coverage.

This example illustrates the discrepancy between archival coverage of mobile vs desktop URIs. Additionally, as humans we can understand that these two URIs are representing content from the same site: Android Central. The connection between the URIs is represented in the live web, with mobile user-agents triggering a redirect to the mobile URI. This connection is lost during archiving.

The representations of the mobile and desktop URIs are different, even though a human will recognize the content as largely the same. Because archives commonly index by URI and archival datetime only, a machine may not be able to understand that these URIs are related.

The desktop Android Central representation

The mobile Android Central representation

Mobile Mink helps merge the mobile and desktop TimeMaps while also also providing a mechanism to increase the archival coverage of mobile URIs. We detail these features in the Implementation section.

Implementation

Mobile Mink provides users with a merged TimeMap of mobile and desktop versions of the same site. We use the URI permutations detailed in McCown's work to transform desktop URIs to mobile URIs (e.g., http://www.androidcentral.com/ -> http://m.anrdoidcentral.com/) and mobile URIs to desktop URIs (e.g., http://m.androidcentral.com/ -> http://www.androidcentral.com/). This process allows Mobile Mink to establish the connection between mobile and desktop URIs.

Merged TimeMap

With the mobile and desktop URIs identified, Mobile Mink uses Memento to retrieve the TimeMaps of both the desktop and mobile versions of the site. Mobile Mink merges all of the returned TimeMaps and sorts the mementos temporally, identifying the mementos of the mobile URIs with an orange icon of a mobile phone and the mementos of the desktop URIs with a green icon of a PC monitor.

To mitigate the discrepancy in archival coverage between the mobile and desktop URIs of web resources, Mobile Mink provides an option to allow users to push the mobile and desktop URIs to the Save Page Now feature at the Internet Archive and to Archive.today. This will allow Mobile Mink's users to actively archive mobile resources that may not be otherwise archived.

These features mirror the functionality of Mink by providing users with a TimeMap of the site currently being viewed, but extends Mink's functionality by providing the merged mobile and desktop TimeMap. Mink also provides a feature to submit URIs to Archive.today and the Save Page Now feature, but Mobile Mink extends this functionality by submitting the mobile and desktop URIs to these two archival services.

Demonstration

The video below provides a demo of Mobile Mink. We use the Chrome browser and navigate to http://www.androidcentral.com/, which redirects us to http://m.androidcentral.com/. From the browser menu, we select the "Share" option. When we select the "View Mementos" option, Mobile Mink provides the aggregate TimeMap. Selecting the icon in the top right corner, we can access the menu to submit the mobile and desktop URIs to Archive.today and/or the Internet Archive.

Next Steps

We plan to release Mobile Mink in the Google Play store in the next few weeks. In the mean time, please feel free to download and use the app from Wes's GitHub repository (https://github.com/Thing342/MobileMemento) and provide feedback to through the issues tracker (https://github.com/Thing342/MobileMemento/issues). We will continue to test and refine the software moving forward.

Wes's demo of MobileMink was accepted at JCDL2015. Because he is graduating in June and preparing to start his collegiate career at Virginia Tech, someone from the WS-DL lab will be presenting his work on his behalf. However, we hope to convince Wes to come to the Dark Side and join the WS-DL lab in the future. We have cookies.

--Justin F. Brunelle

↧

2015-06-26: JCDL 2015 Doctoral Consortium

June 26, 2015, 9:48 am

≫ Next: 2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

≪ Previous: 2015-06-09: Mobile Mink merges the mobile and desktop webs

Mat Kelly attended and presented at the JCDL 2015 Doctoral Consortium. This is his report.

Evaluating progress between milestones in a PhD program is difficult due to the inherent open-endedness of research. A means of evaluating whether a student's topic is sound and has merit while still early on in his career is to attend a doctoral consortium. Such an event, as the one held at the annual Joint Conference on Digital Libraries (JCDL), has previously provided a platform for WS-DL students (see 2014, 2013, 2012, and others) to network with faculty and researchers from other institutions as well as observe the approach that other PhD students at the same point in their career use to explain their respective topics.

As the wheels have turned, I have showed enough progress in my research for it to be suitable for preliminary presentation at the 2015 JCDL Doctoral Consortium -- so did so this past Sunday in Knoxville, Tennessee. Along with seven other graduate students from various other universities throughout the world, I gave a twenty minute presentation with ten to twenty minutes of feedback from the audience of both other presenting graduate students, faculty, and researchers.

Kazunari Sugiyama of National University of Singapore (where Hany SalahEldeen recently spent a semester as a research intern) welcomed everyone and briefly described the format of the consortium before getting underway. Each student was to have twenty minutes to present with ten to twenty minutes for feedback from the doctors and the other PhD students present.

The Presentations

The presentations were broken up into four topical categories. In the first section, "User's Relevance in Search", Sally Jo Cunningham introduced the two upcoming speakers. Sampath Jayarathna (@OpenMaze) of Texas A&M University was the first presenter of the day with his topic, "Unifying Implicit and Explicit Feedback for Multi-Application User Interest Modeling". In his research, he asked users to type short queries, which he used to investigate methods for search optimization. He asked, "Can we combine implicit and semi-explicit feedback to create a unified user interest model based on multiple every day applications?". Using a browser-based annotation tool, users in his study were able to provide relevance feedback of the search results via explicit and implicit feedback. One of his hypotheses is that if he has a user model, he should be able to compare the model against explicit feedback that the user provides for providing better relevance of results.

After Sampath, Kathy Brennan (@knbrennan) of University of North Carolina presented her topic, "User Relevance Assessment of Personal Finance Information: What is the Role of Cognitive Abilities?". In her presentation she alluded to the similarities of buying a washer and dryer to obtaining a mortgage in respect to being an indicator for a person's cognitive abilities. "Even for really intelligent people, understanding prime and subprime rates can be a challenge.", she said. One study she described analyzed rounding behavior with stock prices being an example of the observed critical details by an individual. Through testing 69 different abilities psychometrically through users analyzing documents for relevance, she found that someone with lower cognitive abilities will have a lower threshold for relevance and thus attribute more documents as relevant than those with higher cognitive abilities. "However", she said, "those with a higher cognitive ability were doing a lot more in the same amount of time as those with lower cognitive abilities."

After a short coffee break, Richard Furuta of Texas A&M University introduced the two speakers of the second session titled, "Analysis and Construction of Archive". Yingying Yu of Dalian Maritime University presented first in this session with "Simulate the Evolution of Scientific Publication Repository via Agent-based Modeling". In her research, she is seeking to find candidate co-authors for academic publications based on a model that includes venue, popularity and author importance as a partial set of parameters to generate a model. "Sometimes scholars only focus on homogenous network", she said.

Mat Kelly (@machawk1, your author) presented second in the session with "A Framework for Aggregating Private and Public Web Archives". In my work, I described the issues of integrate private and public web archives in respect to access restrictions, privacy issues, and other concerns that would arise were the archives' results to be aggregated.

The conference then broke for boxed lunch and informal discussions amongst the attendees.

After resuming sessions after the lunch break, George Buchanan (@GeorgeRBuchanan) of City University of London welcomed everybody and introduced the two speakers of the third session of the day, "User Generated Contents for Better Service".

Faith Okite-Amughoro (@okitefay) of University of KwaZulu-Natal presented her topic, "The Effectiveness of Web 2.0 in Marketing Academic Library Services in Nigerian Universities: a Case Study of Selected Universities in South-South Nigeria". Faith's research noted that there has not been any assessment on how the libraries in her region of study have used Web 2.0 to market their services. "The real challenge is not how to manage their collection, staff and technology", she said, "but to turn these resources into services". She found that the most used Web 2.0 tools were social networking, video sharing, blogs, and generally places where the user could add themselves.

Following Faith, Ziad Matni (@ziadmatni) of Rutgers University presented his topic, "Using Social Media Data to Measure and Influence Community Well-Being". Ziad asked, "How can we gauge how well people are doing in their local communities though the data that they generate on social media?" He is currently looking for useful measure of components of community well-being and their relationships with collective feelings of stress and tranquility (as he defined in his work). He is hoping to focus on one or two social indicators and to understand the influence factors that correlate the sentiment expressed on social media and a geographical community's well-being.

After Ziad's presentation, the group took a coffee break then started the last presentation session of the day, "Mining Valuable Contents". Kazunari Sugiyama (who welcomed the group at the beginning of the day) introduced the two speakers of the session.

The first presentation in this session was from Kahyun Choi of University of Illinois at Urbana-Champaign presented her work, "From Lyrics to Their Interpretations: Automated Reading between the Lines". In her work, she is looking to try to find the source of subject information from songs with the assumption that machines might have difficult analyzing songs' lyrics. She has three general research questions, the first relating lyrics and their interpretations, the second whether topic modeling can discover the subject of the interpretations, and the third in reliably obtaining the interpretations from the lyrics. She is training and testing a subject classifier where she collected lyrics and their interpretations from SongMeanings.com. From this she obtained eight subject categories: religion, sex, drugs, parents, war, places, ex-lover, and death. With 100 songs in each category, she assigned each song to have only one subject. She then obtained the top ten interpretations per song to prevent the results from being skewed by songs with a large number of interpretations.

The final group presentation of the day was to come from Mumini Olatunji Omisore of Federal University of Technology with "A Classification Model for Mining Research Publications from Crowdsourced Data". Because of visa issues, he was unable to attend but planned on presenting via Skype or Google Hangouts. After changing wireless configurations, services, and many other attempts, the bandwidth at the conference venue proved insufficient and he was unable to present. A contingency was setup between him and the doctoral consortium organizers to review his slides.

Two-on-Two

Following the attempts to allow Mumini to present remotely, the consortium broke up into group of four (two students and two doctors) for private consultations. The doctors in my group (Drs. Edie Rasmussen and Michael Nelson) provided extremely helpful feedback in both my presentation and research objectives. Particularly valuable was their helpful discussions for how I could go about improving the evaluation of my proposed research.

Overall, the JCDL Doctoral Consortium was a very valuable experience. By viewing how other PhD students were approaching their research and obtaining critical feedback on mine, I believe the experience to be priceless for improving the quality of one's PhD research.

— Mat (@machawk1)

Edit: Subsequent to this post, Lulwah reported on the main portion of the JCDL 2015 conference and Sawood reported on the WADL workshop at JCDL 2015.

↧

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

June 26, 2015, 12:05 pm

≫ Next: 2015-07-02: JCDL2015 Main Conference

≪ Previous: 2015-06-26: JCDL 2015 Doctoral Consortium

My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).

Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.

Fig 2. The user selects the Make option, which initiates an Ajax request...

Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.

Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.

If we view a memento of KBB.com (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.

The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (Browsertrix, WebRecorder.io, CrawlJAX) but these are slightly outside the scope of what we want to do. We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

Tool:	PhantomJS	Selenium	VisualEvent
Operation	Headless	Full-Browser	JavaScript bookmarklet and code
Speed (seconds)	2.5-8	4-10	< 1 (on user click)
DOM Integration	Close integration	3rd party	Close integration/embedded
DOM Event Extraction	Semi-reliable	Semi-reliable	100% reliable
DOM Interaction	Scripted, native, on-demand	Scripted	None

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack Overflow, Real Python, Vilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).

Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.

VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle

↧