2013-08-24: Two WS-DL Classes Offered for Fall 2013

August 24, 2013, 11:39 am

≫ Next: 2013-09-06: Wolfram Data Summit 2013 Trip Report

≪ Previous: 2013-08-23: Archive-It Supports Memento

Two WS-DL classes are offered for Fall 2013:

CS 725/825 - Information Visualization, Dr. Weigle
CS 495/595 - Introduction to Web Science, Dr. Nelson

Information Visualization has been taught twice before, but with a 795/895 course number. This semester will be the first time that Web Science has been taught at ODU, although the course is based on Dr. McCown's Spring 2013 class at Harding University.

--Michael

↧

2013-09-06: Wolfram Data Summit 2013 Trip Report

September 8, 2013, 7:41 am

≫ Next: 2013-09-09: MS Thesis: HTTP Mailbox - Asynchronous RESTful Communication

≪ Previous: 2013-08-24: Two WS-DL Classes Offered for Fall 2013

I was fortunate enough to be invited to present at the 2013 Wolfram Data Summit in Washington DC, September 5-6, 2013. My talk was about the future of web archiving, but the focus of the data summit was "big data". As such, there was a variety of disciplines represented at the summit since the unifying factor was the scale of the data. Logistics dictated that I missed several of the presentations, but many of the ones I did attend were very engaging. The slides will be posted at the Wolfram site later, but I'll provide some short summaries below.

First was Greg Newby presenting about Project Gutenberg, the long-running collection of free ebooks. His focus was on PG as a portable collection, which is subtly different from universal access from different interfaces (even if the interface is just Google). The focus was more on PG as a collection to be explored and personalized services to be built-on. During the question and answer period someone asked "what's next for Project Gutenberg?", and during lunch the next day me, Greg, and others talked about PG and Open Annotation, and maybe uploading some content to Rap Genius (I got the idea from Rob Sanderson).

Andrew Ng gave a skype presentation (which, unlike most video presentations, worked rather well) about Coursera. I'm rather skeptical about most universities' stampede for MOOCs, but I should probably start looking for quality segments in Coursera to augment my own classes.

Another really engaging discussion was Paul Lamere of The Echo Nest. With lots of illustrative examples using pop music, Paul gave one of the most well-received presentations of the summit. We learned that band names are not getting longer (I was surprised, I thought they were, but older conventions like "Herb Albert and the Tijuana Brass" make for long names), metal fans are more "passionate" (defined as replays/favorites) than dub step fans (that one was easy), and that we can easily tell human drummers from machines by analyzing variances in the signal (his example was the variations in "So Lonely"). Paul's blog, Music Machinery, is worth checking out.

Eric Newburger of the US Census Bureau gave an excellent presentation about how Census data is the original "big data". Tufte fans will enjoy checking out his presentation (prior presentations are available at the moment). He made a good pitch for using Census data for ground truth for a variety of business purposes, but you really should check out some of the early visualizations.

Ryan Cordell and David Smith of Northeastern gave a great presentation about "infectious texts", a project to mine early US newspapers for early "viral" memes. Apparently early newspapers were equal parts news, fiction, and apocryphal stories half-way between truth and fiction, and editors would fill their local papers with large-scale copying from other newspapers, with and without attribution. The project analyzes the types of stories chosen for 19th century retweeting, the networks of reuse (which don't always match geography and population networks), their temporal patterns, etc. During the Q&A period and later during lunch we speculated about identifying timeless stories (e.g., the soldier returning from war) and reintroducing them to Facebook & Twitter and see if they reignite. The project uses LC data from the Chronicling of America project, and the OCR data is especially noisy and requires a host of tricks to align and find the reused portions.

Roger Macdonald of the Internet Archive discussed the Television Archive, which features 2M+ hours of TV news. I'm guilty of thinking the Internet Archive is just web pages (of which they have some 338B), but they have a great deal more: 30k software titles, 600k books, 900k films/movies, 1M audio recordings (many concerts), and 2M ebooks. The TV news archive features a very attractive and useful interface for browsing, search, and sharing its content.

Leslie Johnston from the Library of Congress gave an overview of LC's collections and services. Most of these I was already familiar with, but I'll mention two sites that I was not aware of. First, the venerable THOMAS will be replaced with a new congress.gov (see the beta version now) and will will soon feature APIs for accessing the data behind the site. See these reviews: O'Reilly, TechPresident. I was also unavailable of id.loc.gov, a site that gathers the various naming, standards, and vocabulary functions into one place. I knew LC performed this function, but I didn't know of this particular site.

Eric Rabkin gave a fascinating talk about the analysis of titles of works of science fiction and what that revealed about the society that they reflect. Quoting from his "Genre Evolution Project" page:

We study literature as a living thing, able to adapt to society’s desires and able to influence those desires. Currently, we are tracking the evolution of pulp science fiction short stories published between 1926 and 1999. Just as a biologist might ask the question, “How does a preference for mating with red-eyed males effect eye color distribution in seven generations of fruit flies?” the GEP might ask, “How does the increasing representation of women as authors of science fiction affect the treatment of medicine in the 1960s and beyond?”

In addition the slides (when they're available), you might be interested in his SF course on Coursera.

I gave the last presentation of the day, talking about trends in web archiving. I gave a high-level overview of some of our recent JCDL and TPDL papers, as well as mentioning long-running projects like Memento and how they integrate the various public web archives, most of which most people have never heard of.

Who Will Archive the Archives? Thoughts About the Future of Web Archiving from Michael Nelson

Since I was the last presentation of the summit, we had an extended question and answer period with a handful of people who were not in a hurry to leave and jump in DC traffic. I ended up meeting my friend Terry for dinner and then headed back to Norfolk at about 7:45 that evening.

Overall this was a really interesting summit and I enjoyed the multidisciplinary nature of presentations. I regret that I ended up missing as many as I did, but that's how things worked out. I would definitely recommend the 2014 summit. While waiting for the 2013 presentations to be posted, you might want to check out the presentations from 2012, 2011, and 2010.

--Michael

↧

2013-09-09: MS Thesis: HTTP Mailbox - Asynchronous RESTful Communication

September 9, 2013, 1:52 pm

≫ Next: 2013-10-04: TPDL 2013 Trip Report

≪ Previous: 2013-09-06: Wolfram Data Summit 2013 Trip Report

It is my pleasure to report the successful completion of my Master's degree thesis entitled "HTTP Mailbox - Asynchronous RESTful Communication". I have defended my thesis on July 11th and got my written thesis accepted on August 23rd 2013. In this blog post I will briefly describe the problem that the thesis is targeting at followed by proposed and implemented solution to the problem. I will walk through an example that will illustrate the usage of the HTTP Mailbox then I will provide various links and resources to further explore the HTTP Mailbox.

Traditionally, general web services used only the GET and POST methods of HTTP while several other HTTP methods like PUT, PATCH, and DELETE were rarely utilized. Additionally, the Web was mainly navigated by humans using web browsers and clicking on hyperlinks or submitting HTML forms. Clicking on a link is always a GET request while HTML forms only allow GET and POST methods. Recently, several web frameworks/libraries have started supporting RESTful web services through APIs. To support HTTP methods other than GET and POST in browsers, these frameworks have used hidden HTML form fields as a workaround to convey the desired HTTP method to the server application. In such cases, the web server is unaware of the intended HTTP method because it receives the request as POST. Middleware between the web server and the application may override the HTTP method based on special hidden form field values. Unavailability of the servers is another factor that affects the communication. Because of the stateless and synchronous nature of HTTP, a client must wait for the server to be available to perform the task and respond to the request. Browser-based communication also suffers from cross-origin restrictions for security reasons.

We describe HTTP Mailbox, a mechanism to enable RESTful HTTP communication in an asynchronous mode with a full range of HTTP methods otherwise unavailable to standard clients and servers. HTTP Mailbox also allows for multicast semantics via HTTP. We evaluate a reference implementation using ApacheBench (a server stress testing tool) demonstrating high throughput (on 1,000 concurrent requests) and a systemic error rate of 0.01%. Finally, we demonstrate our HTTP Mailbox implementation in a human-assisted Web preservation application called "Preserve Me!" and a visualization application called "Preserve Me! Viz".

The HTTP Mailbox is inspired by the pre-Web distributed computing model Linda and modern Web scale distributed computing architecture REST. It tunnels the HTTP traffic over HTTP using message/http (or application/http) MIME type and stores the HTTP messages (requests/responses) along with some extra metadata for later retrieval. The HTTP Mailbox provides a RESTful API to send and retrieve asynchronous HTTP messages. For a quick walk-through of the thesis please refer to the oral presentation slides (HTML) or access them on SlideShare. A complete copy of the thesis (PDF) is also available publicly at:

Sawood Alam, HTTP Mailbox - Asynchronous RESTful Communication, MS Thesis, Computer Science Department, Old Dominion University, August 2013.

HTTP Mailbox - Asynchronous RESTful Communication

Our preliminary implementation code can be found on GitHub. We have also deployed an instance of our implementation on Heroku for public use. This instance internally uses Fluidinfo service for message storage. Let us have a look at the deployed service to illustrate its usage.

Let us assume that we want to check the HTTP Mailbox to see if there any messages for http://example.com/all. Our HTTP Mailbox API endpoint is located at http://httpmailbox.herokuapp.com/hm/. Hence we will make a GET request as illustrated below.

$ curl -i http://httpmailbox.herokuapp.com/hm/http://example.com/all
HTTP/1.1 404 Not Found
Content-Type: message/http
Date: Mon, 09 Sep 2013 16:59:13 GMT
Server: HTTP Mailbox
Content-Length: 0
Connection: keep-alive

This indicates that there are no messages for the given URI. Now let us POST something to that URI first. We have an example file named "welcome.txt" that is a valid HTTP message which we want to send to http://example.com/all.

$ cat welcome.txt
POST /all HTTP/1.1
Host: example.com
Content-Type: text/plain
Content-Length: 32

Welcome to the HTTP Mailbox! :-)

Now let us POST this message to the given URI.

$ curl -i -X POST --data-binary @welcome.txt \
> -H "Sender: hm-deployer" \
> -H "Content-Type: message/http" \
> http://httpmailbox.herokuapp.com/hm/http://example.com/all
HTTP/1.1 201 Created
Content-Type: message/http
Date: Mon, 09 Sep 2013 17:13:02 GMT
Location: http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6
Server: HTTP Mailbox
Content-Length: 0
Connection: keep-alive

Now that we have POSTed the message, we can retrieve it anytime later.

$ curl -i http://httpmailbox.herokuapp.com/hm/http://example.com/all
HTTP/1.1 200 OK
Content-Type: message/http
Date: Mon, 09 Sep 2013 17:15:33 GMT
Link: <http://httpmailbox.herokuapp.com/hm/http://example.com/all>; rel="current",
<http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6>; rel="self",
<http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6>; rel="first",
<http://httpmailbox.herokuapp.com/hm/id/ab3defce-dfa9-4d09-a72d-cac267531ca6>; rel="last"
Memento-Datetime: Mon, 09 Sep 2013 17:13:01 GMT
Server: HTTP Mailbox
Via: sent by 128.82.4.75 on behalf of hm-deployer, delivered by http://httpmailbox.herokuapp.com/hm/
Content-Length: 114
Connection: keep-alive

POST /all HTTP/1.1
Host: example.com
Content-Type: text/plain
Content-Length: 32

Welcome to the HTTP Mailbox! :-)

So far, there is only one message for the given URI. If more messages are posted to the same URI, above retrieval request will only retrieve the last message of the chain. From there the "Link" header can be used to navigate through the message chain.

We have been using HTTP Mailbox service in various applications including "Preserve Me!" and "Preserve Me! Viz". Following screenshot illustrates its usage in "Preserve Me!".

We would like to thank GitHub for hosting our code, Heroku for running our HTTP Mailbox instance on their cloud infrastructure, and Fluidinfo for storing messages in their "tag and value" style RESTful storage system.

I am grateful to my advisor Dr. Michael L. Nelson, committee members Dr. Michele C. Weigle and Dr. Ravi Mukkamala, colleagues and everyone else who helped me in the process of getting my Master's degree. Now, I am continuing my research under the guidance of Dr. Michael L. Nelson at Old Dominion University.

Resources

--
Sawood Alam

↧

2013-10-04: TPDL 2013 Trip Report

October 4, 2013, 10:35 am

≫ Next: 2013-10-11: Archive What I See Now

≪ Previous: 2013-09-09: MS Thesis: HTTP Mailbox - Asynchronous RESTful Communication

I attended the 2013 Theory and Practice of Digital Libraries (TPDL) Conference on September 22-26 in Valletta, Malta. Although I've had papers at several of the prior TPDL (known as ECDL prior to 2011) conferences, I think this is the first one I've personally attended since ECDL 2005 in Austria. Normally I prefer to send students to present their papers, but this year we had five full papers accepted, so I could not afford to send all the students and I went in their stead. An unfortunate side effect of having so many papers is that between preparation and my own presentations I was unable to see as much of the conference as I would have liked.

The conference began with Herbert Van de Sompel and I giving a tutorial about ResourceSync. Attendees registered for all tutorials and were free to attend whichever one they preferred. We had as many as ten people in ours at one point, but more importantly we had some key people present who will be implementing ResourceSync in their organizations. We also received some feedback and will probably reorder the slide deck to focus more on particular cases instead of a reference list of all possible capabilities and their implementation.

ResourceSync Tutorial from Open Archives Initiative

The main conference began with an opening keynote from Chris Borgman, reviewing the state of scholarly communication with her talk "Digital Scholarship and Digital Libraries: Past, Present, and Future". The slides are already available, and I believe videos will be eventually be posted on the TPDL Vimeo channel, but directly form her slides the best summary is 1) Open scholarship is the norm, 2) Formal and informal scholarly communication are converging, 3) Data practices are local, and 4) Open access to data is a paradigm shift.

I had two papers in the "Aggregating and Archiving" session following the keynote, although Herbert helped me out and presented one of them. I first presented "On the Change in Archivability of Websites Over Time" (with Mat Kelly, Justin Brunelle, and Michele Weigle), and then Herbert presented "Profiling Web Archive Coverage for Top-Level Domain and Content Language" (with Ahmed AlSum, Michele Weigle, and Herbert Van de Sompel).

On the Change in Archivability of Websites Over Time from Michael Nelson

Profiling Web Archive Coverage for Top-Level Domain and Content Language from Michael Nelson

There was a single parallel session after lunch, followed by a panel on the EU Cooperation on Science and Technology (COST), and then the Minute Madness and poster session that evening. At the reception, they honored Ingeborg Sølvberg for her upcoming retirement. Ingeborg has been active in the community for quite some time, and Herbert and I were PC co-chairs with her for JCDL 2012.

The following day opened with a large panel on DLs and e-Infrastructure, and then a single session that featured the eventual best paper winner, "An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles" by Stefan Klamp and Roman Kern. In the "Architectures and Interoperability" session after lunch, I presented the paper "Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool" (with Justin Brunelle, Herbert Van de Sompel, Robert Sanderson, and Lyudmila Balakireva). This paper quantified the negligible impact of running the SiteStory Transactional Web Archiving software on Apache systems.

Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool from Michael Nelson

There was a single session after this on the Semantic Web, and then the conference dinner in Mdina that evening.

The next day opened with a keynote about linked data and DLs from Soren Auer:

What can linked data do for digital libraries from Sören Auer

On this topic is a great blog post from Max Kemman entitled "The Future of Libraries is in Linking". Max covers the entire TPDL from the point of view of linked data and I think he's spot on.

In the closing session, I presented two papers: "Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web" (with Hany SalahEldeen), and "Who and What Links to the Internet Archive" (with Yamin AlNoamany, Ahmed AlSum, and Michele Weigle).

Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context to the Disappearing Web from Michael Nelson

Who and What Links to the Internet Archive from Michael Nelson

We were fortunate enough to have Yasmin's paper win "Best Student Paper"! Scott Ainsworth's paper was a nominee for this award at JCDL 2013, but this represents the first win for our research group! Congratulations to Yasmin and Ahmed!

“@tpdl2013: Best Student Paper Award #tpdl2013 http://t.co/JPllioad3k” @aalsum @yasmina_anwar @weiglemc
— Michael L. Nelson (@phonedude_mln) September 25, 2013

After the close of the conference, I had the opportunity to do a little bit of touring with George Buchanan and Annika Hinze, going to the Tarxien Temples and then to the picturesque fishing village of Marsaxlokk. Most of the following day on Thursday was spent getting back to US, in time to watch Va Tech beat Ga Tech. Additional photos can be found in TPDL 2013 flickr pool.

Next year, JCDL and TPDL will be held together September 8--12 in London. George Buchanan is the conference chair and the web site for the conference will be available soon.

--Michael

↧

2013-10-11: Archive What I See Now

October 11, 2013, 11:23 am

≫ Next: 2013-10-14: Right-Click to the Past -- Memento for Chrome

≪ Previous: 2013-10-04: TPDL 2013 Trip Report

Earlier this year, we were awarded an NEH Digital Humanities Start-Up Grant for our project "Archive What I See Now": Bringing Institutional Web Archiving Tools to the Individual Researcher.

We were invited to attend the NEH Office of Digital Humanities Project Directors' Meeting in early October, but due to the government shutdown, the meeting was cancelled. Here I'll give the quick overview of the project that I'd planned for that meeting. (Mat Kelly has already posted a nice description of the tools we've been developing, WARCreate and WAIL, at http://bit.ly/wc-wail.)

The slides I'd prepared are below:

"Archive What I See Now" - NEH ODH overview from Michele Weigle

Our project is focused on helping people archive web pages. Since much of our cultural heritage is now published on the web, we want to make sure that important pages are archived for the future.

Since 1996, the Internet Archive and other archiving services have done great work preserving web pages. But, the Internet Archive can only do so much. What if you had a website that the Internet Archive doesn’t or can’t crawl or one that changes more frequently than they would crawl it? Until now, your solution was to archive the page yourself, either using ad-hoc methods like “Save Page As” or by attempting to install your own crawler and Wayback Machine instance.

Our partners in this project include church historians who want to allow individual churches to archive their own websites, artists who want to preserve their own sites, political scientists who want to archive conversations about elections in social media, and social scientists who want to archive conversations about disasters in social media.

There are a couple problems here that we’re addressing. First, if you want an archive of a webpage in a standard format, called a WARC, you have to install and configure some rather complex software. Second, if the webpages you want to archive are behind authentication, the crawler will not be able to access them. Another problem is that you typically set the crawl frequency ahead of time, so if you find a page that you want to archive and it might change soon, it may be difficult to schedule.

So, we’ve built some tools that allow you to get around these problems. They let you “Archive”, “What I See”, “Now”. Essentially, what you see in the browser is what gets archived.

The two tools that we've developed are WARCreate and WAIL. WARCreate is a browser extension (right now for Chrome, but Firefox is coming soon) that lets you create a WARC of whatever page you’re viewing. It can be on social media, it can be a dynamic page, or it can be behind authentication. The WARC is created locally and saved on your local machine.

So, now that you have a WARC, what do you do with it? Our second tool, WAIL, addresses this issue. It is package that contains Heritrix, the Internet Archive’s crawler, and wayback, the software behind the Wayback Machine. This package installs and configures the software in one click. Once WAIL is running, you can point it to a directory of WARCs that you created with WARCreate, and then you can access your archives locally using the Wayback Machine interface.

Right now, WARCreate can only archive a single page and just saves it locally. We are working on building in the ability to archive a set of pages, or a whole site, and the ability to upload the created WARC to a remote server, including a service like the Internet Archive’s Archive-It.

We hope that these two tools will be useful and can help non-IT experts archive important pages for the future.

If you try out these tools, please fill out our feedback form at http://bit.ly/wc-wail-feedback

-Michele

↧

2013-10-14: Right-Click to the Past -- Memento for Chrome

October 14, 2013, 1:21 pm

≫ Next: 2013-10-15: Grace Hopper Celebration of Women in Computing (GHC) 2013

≪ Previous: 2013-10-11: Archive What I See Now

Last week LANL released Memento for Chrome, an extension that adds Memento capability for Chrome browsers. It represents such a leap in capability and speed that the prior MementoFox (Memento for FireFox) add-on should be considered deprecated.

It's not just a FireFox vs. Chrome thing either; Memento for Chrome features a subtle change in how it interacts with the past and present. MementoFox had a toggle switch for present vs. Time Travel mode that would trap and modify all outbound requests, from the current page and all subsequent pages until turned off, to go from the form of:

http://example.com/index.html

to:

http://mementoproxy.lanl.gov/aggr/timegate/http://example.com/index.html

This involved some complicated logic to determine when you were getting a memento (i.e., archived web entity) vs. something from the live web. When you factored in native Memento archives vs. proxied Memento archives, things could get hairy (see the 2011 Code4Lib paper for a (dated) discussion of some of the issues). Due to differences in how they archive web pages, it was not possible to take an HTML page from archives like WebCite and Archive.is and modify all the links to go through the Memento aggregator.

Instead of a toggle switch, Memento for Chrome features a "right-click" model in which time travel is only for the next click and (from the client's point of view) is not sticky. Basically, you load the present version of "index.html", and the prior versions are accessed by right-clicking in the page or on the next link itself to pull up the option of traveling to some prior date (set separately via a calendar interface). This means the client only modifies a single request, and the subsequent requests are processed unfiltered.

For most web archives, all the subsequent requests for embedded images, stylesheets, etc. will be rewritten to be relative to the archive of the parent HTML page. In other words, if you land inside the Internet Archive, then all the embedded resources will come from the Internet Archive, and all the subsequent links in the HTML to other pages will be rewritten to take you to pages inside the Internet Archive. This means sometimes you'll miss a resource that is present in another archive but not in the current archive or the target date can drift over many clicks (see Scott Ainsworth's JCDL 2013 paper on this topic), but this allows the client to run much faster. You can always choose to right-click (instead of just a regular click) to reapply the Memento time travel.

On other archives, like WebCite and Archive.is, as well as other systems like wikis, the links to other pages aren't rewritten to point back into the archive, and a regular click will pop you out of time travel mode and back to the present web. In this case, successive right-clicks are required to stay in time travel mode.

Herbert has prepared a very nice demo video that packs many features into 78 seconds. If you want to know why Memento for Chrome is really special, watch this video:

he starts at the current version of techcrunch.com, and then sets a datetime preference for June 20, 2011.
right-clicking in the current page, he chooses the option to time travel to a version near June 20, 2011 (in this case, he gets June 20, 2011 exactly, but that's not always possible)
he right-clicks on the link for gnonstop.com, and chooses to get an archived version (in this case, the archive delivers a close but not exact version of June 21, 2011).
note the archived pages for techcrunch.com and gnonstop.com both come from the Internet Archive.
to see the current version, he right-clicks and chooses "get at current time" and sees that the current version is unavailable.
from that page he right-clicks and chooses "get near current time", which is basically "get me the most recent archived version", which at the time of this video was July 8, 2011 and the archived pages comes from a different Memento-enabled archive, Archive.is.

If the above is interesting to you, I recommend the longer (10 minute) video Herbert prepared with an earlier version of Memento for Chrome:

Some highlights include:

00:30 -- 1:30: a Google search is done in the present, but the first blue link is right-clicked to visit the prior version
02:15 -- 03:00: from the Google results page, a link is followed in the present, then the page is viewed in the past via a right-click
04:10 -- 05:03: shows how the client works with the SiteStory Transactional Web Archive
05:10 -- 07:00: an extended session about how it works with Wikipedia (i.e., Mediawiki)
07:20 -- 08:20: interacting with an archived 404 and resetting the date

Keep in mind this is an older version of the software, but there are enough interesting bits in the video that I think it still warrants viewing for those who care about various special cases.

On the other hand, if you don't care about the details watch the short video and then download the Chrome extension and get started. We welcome feedback on the memento-dev email list, or please be the first to review the extension. Thanks to Harish Shankar for an excellent job on the development of this extension.

--Michael

P.S. Note that other Memento clients are still available, including iOS, Android, and the mcurl command line client. Though slowed by his PhD breadth exam and other obligations, Mat Kelly is still developing Tachyon, a Chrome extension with a toggle model similar to MementoFox, first developed by Andy Jackson.

↧

2013-10-15: Grace Hopper Celebration of Women in Computing (GHC) 2013

October 15, 2013, 4:26 am

≫ Next: 2013-10-22: IEEE VIS Trip Report

≪ Previous: 2013-10-14: Right-Click to the Past -- Memento for Chrome

On October 2-5, I was thrilled to attend Grace Hopper Celebration of Women in Computing (GHC), the world's largest gathering for women in computing, and meet so many amazing and inspiring women in computing. This year, GHC was held in Minneapolis, MN. It is presented by the Anita Borg Institute for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994 to bring together research and career interests of women in computing and encourage the participation of women in computing. GHC was held for the first time in 1994 in Washington DC. The theme of the conference this year was "Think Big - Drive Forward".

There were many sessions and workshops that targeted academics and business. The Computing Research Association Committee on Women in Computing (CRA-W), offered sessions targeted towards academics. I had a chance to attend Graduate Cohort Workshop last April, which was held in Boston, MA, and created a blog post about it.

The first day started with welcoming new comers by the program Co-Chairs, Wei Lin from Symantec Corporation and Tiffani Williams from Texas A&M University. They expressed their happiness to be among 4,600 brilliant women in computing. They also highlighted that there were many experts and collaborators who were eager to help and answer our questions.

Barb Gee, the vice president of programs for Anita Borg institute, spoke about ABI global expansion and it was a successful experiment in India. Gee said, "we believe that if women are equally represented at the innovation table, the products will meet better satisfaction and solutions for many problems will be optimized".

Then, the plenary session was composed of three amazing world thought leaders who had an enlightening conversation about "How we can think big, and drive forward": Sheryl Sandberg, the COO of Facebook and the founder of LeanIn.org, with Maria Klawe, the president of Harvey Mudd College, and Telle Whitney, the President and CEO of the Anita Borg Institute. The conversation started with a question from Klawe to Sandberg about the reason of writing her book "Lean In". Sandberg started her answer with "because it turns our the world is still run by men, and I'm not sure it's going very well!".

Sandberg left all of us with a great inspiration because of her question: "What would you do if you are not afraid?"
Here are some quotes from their conversation:

"People who imagine and build technology are problem solvers. They look at what the world needs and they create it.
"We are here because we believe that each one of you has a potential to create a different future."
"Women who make up 51% of the population and are part of 80% of the purchasing decisions, only make up 23% of the computer science work force."
"Next time you see a little girl and someone is calling her bossy, take a deep breath and big smile on your face, and say, ‘that little girl is not bossy she has executive leadership skills."
"What would you do, if you were not afraid? When you leave GHC, whatever you want to do, go and do it!"
"Women inspire other women"

At the end, Withney announced a partnership between LeanIn.org Foundation and Anita Borg Institute to create circles for women in computing.

For reading more about the conversation, here are a blog post and an article:

This is the video of the conversation:

After the opening keynote, we attended the scholarship lunch which was sponsored by Walmart Labs, in which we had small talks from Walmart people during the lunch. After lunch, I attended the Arab Women in Computing meeting. This is the first time to have a real existence for Arab Women in Computing organization in GHC, based on Sana Odeh, the founder and the chair of the organization, from New York University. Then I attended couple of Leadership workshops in which we had circles and exchanged the questions with expert senior women in computing who answered questions about how to move our career forward.

In the evening, I presented my poster entitled "Access Patterns for Robots and Humans in Web Archives" during the poster session. The poster contains an analysis of the user access patterns of web archives using the Internet Archive's Wayback Machine access logs. The detailed paper of this research appeared at JCDL 2013 proceedings.

User Access Patterns in Web Archives from Yasmina Anwar

In the meantime, many famous companies, such as Google, Facebook, Microsoft, IBM, etc., were there in the career fair. Each company has many representatives to discuss the different opportunities they have for women. A few men also attended the conference. For a perspective of the conference from a man's point of view, Owyn Richen created blog post titled "Grace Hopper 2013 – A Guy’s Perspective". This is also another post on Questionable Intelligence blog.

Thomson Reuters attracted many women's attention with a great promotion through bringing up a caricature artists. I was seeing large queues of many women the whole time of the days of the career fair waiting for a delight draw. They also had many representatives for promoting the company and also for interviewing. I enjoyed being among all of these women in the career fair which inspired me enough to think about how to direct my future in a way to contribute to computing and also encourage many other women to computing. My advice to anyone who will go to GHC next years, print many copies of your resumes to be prepared for the career fair.

On day 2, Telle Whitney gave an inspiring short talk before the second keynote begins. She presented some statistics about the conference to realize how fortunate we were to be among 4,817 attendees of the conference. Based on Whitney's, 54 countries, 305 companies, and 402 universities are represented. She also presented the top 10 universities that brought the most students, and also the top 10 companies who brought participants to GHC 2013. University of Minnesota is in the lead of the universities and Microsoft is in the lead of the companies. Here are some quotes from here talk:

"Think Big, because you can!"
"You cannot fight every battle or certainly cannot win every war, but you can stay true to who you are, by never giving up on yourself. Drive Forward."

Whitney talked about ACM support and partnership between ACM and ABI, then introduced John White, from ACM, for the opening Remarks. Vincent Cerf, the president of ACM, was supposed to attend, but he couldn't. Cerf created a video for the attendees to speak about how it is important to be in GHC. He expressed his sadness from some colleagues for badly treating women in computing. He wished to attend GHC 2014 personally to help encouraging more women to be in the computing field.

Megan Smith, the Vice President of Google[x], gave a keynote titled, "Passion, Adventure and Heroic Engineering". Before Smith showing up, a short inspiring video about moonshot thinking was presented. The most inspiring quote of the video was "When you find your passion, you are unstoppable.". Smith had image oriented presentation that flow her talks. She shared details about the 4 Google[x] projects:

The highlight quote from her talk was "Find your passion and combine it with work, you will be unstoppable.".

Here are a blog post and an article about Smith's talk:

The following is the video of Smith talk:

At the end, we were surprised by Nora Danzel, who gave an amazing talk last year in GHC's keynote opening. Dr. Michele Weigle created a blog post about it. Danzel talked shortly about Anita Borg story and how that amazing women started the organization to encourage women in computing to get together and increase the women in computing. She asked for donation for keeping Anita Borg Institute up to help many women every year.

I attended a couple of workshops after the break, but the most highlighted one is an invitation only event from Microsoft Workshop. I had a great chance to meet many senior women from Microsoft from many different projects and exchange the knowledge on how can be a successful leaderships in our careers.

At the end of the day, the ABI award ceremony was held. Shikoh Gitau, the ABIE Change Agent Awards Winner, gave a very emotional talk. After this it was the dancing party and the entertainment. In the same time, there was a documentary video about Anita Borg's life and her influence on the creation of the Anita Borg Institute, and the Systers group. It showed also how she started these initiatives for bringing women in computing together. Here is the documentary video about Anita Borg:

I spent most of the third day in the career fair. Grace Hopper not only gave me inspiration, happily it allowed me to meet many old friends and new amazing friends. It also allowed me to discuss my research ideas with many senior women and got positive feedback about it. I'm pleased to have this great opportunity which allowed me to network and communicate with many great women in computing.

For more information about GHC, here are some articles and blog posts:

"Grace Hopper Conference 2013: Think Big, Drive Forward" by Ivo Lukas from the Huffington Post
"Grace Hopper Celebration of Women in Computing" by Research at Google
"Advancing Your Career With Leadership Presence" blog post
A GHC roundup by Gail Carmichael from women in science and engineering of Carleton University.
"Why I Told a Dude to Go to a Women’s Conference" by Betsy Aoki from Bing
A blog post from Groupon Engineers by Laura Roman
Many blog posts about GHC on Medium.com
Many blog posts from Thomson Reuters

---
Yasmin

↧

2013-10-22: IEEE VIS Trip Report

October 21, 2013, 9:32 pm

≫ Next: 2013-10-23: Preserve Me! (... if you can, using Unsupervised Small-World graphs.)

≪ Previous: 2013-10-15: Grace Hopper Celebration of Women in Computing (GHC) 2013

If you recall, way back in 2012, Kalpesh Padia (now at N.C. State under Christopher Healey) and Yasmin AlNoamany (@yasmina_anwar) presented "Visualizing Digital Collections at Archive-It", a paper presented at JCDL 2012, which was the product of Dr. Michele C. Weigle's (@weiglemc) pair of infovis-related courses at Old Dominion University (ODU): CS825 - Information Visualization and CS895 - Applied Visual Analytics.

Like Kalpesh and Yasmin, I have turned a semester project into a conference submission with a poster/demo accepted to IEEE VIS 2013: Graph-Based Navigation of a Box Office Prediction System. The impetus for this strangely out-of-topic (for this blog's theme) submission has roots in the IEEE Visual Analytics Science and Technology (VAST) Challenge, a competition where a large data set is supplied to contestants and a meaningful visual representation is created with each submission. Both Kalpesh and I had previously participated in the VAST Challenge in 2011 (see a summary of my Visual Investigator submission) yet neither of us attended the conference, so without further ado, the Trip Report.

I arrived on Wednesday morning, setup my poster, and headed off to the first session, which consisted of "Fast Forwards" of the papers. This summary session is akin to the "Minute Madness" at JCDL and allows conference attendees to get a glimpse at the many papers to be presented and to choose which concurrent session to attend. The one that piqued my interest the most was the InfoVis Papers session: Storytelling & Presentation.

With the completion of the Fast Forward Summaries, I headed over to the Atrium Ballroom of the Atlanta Marriott Marquis (the conference venue, pictured above) to first see Jessica Hullman of University of Michigan present "A Deeper Understanding of Sequence in Narrative Visualization" (full paper).

In the presentation she stated, "Users are able to understand data if they're seeing the same type of transition repeatedly." In her study, her group created over fifty presentational transitions using public data with varying type and cost (she describes the latter as a function in the paper). From the study, she found that low cost transitions are preferred, temporal, temporal transitions are easy to grasp and hierarchical transitions were the most difficult for the user.

She then created 6 visualizations with and without parallel structures and utilized them in a timed presentation given to 82 users. She then asked for the transitions to be compared and explained as well as requested the user to recall the order of the content. With further studies on the topic she was able to confidently conclude that "Presentation order matters!" and that "Sequence affects the consumption of data by the user."

Following Jessica, Bongshin Lee (@bongshin) of Microsoft Research presented "Sketchstory: Telling More Engaging Stories with Data through Freeform Sketching". Sketchstory is a means of utilizing underlying data in interactive presentations, as is done on an interactive whiteboard. Bongshin demonstrated the product by showing that just through gesturing, data can be immediately plotted or graph in a variety of Powerpoint-esque figured to help a presenter explain data interactively to an audience. The system is capable of drawing icons and axes while utilizing the data on-the-fly, which makes it suitable for storytelling.

In a study of the product, Bongshin's group found that users enjoyed the presentations given with Sketchstory more than Powerpoint presentations, felt they were more engaged with the presentations and that the presenters felt the system was easy enough to learn. However, possibly due to previous familiarity with Powerpoint, most presenters felt that creating a presentation in Sketchstory required more effort than doing so in Powerpoint.

In followup questions to the presentation, one audience participant (at the conference, not in the study) asked how engagement was measured in the study, to which Bongshin replied that questions were asked using a Likert scale. When another audience member asked where they could try out the software to evaluate it for themselves, Bongshin relied that it was not available for download and that is only suitable for internal (Microsoft) use."

The next presentation was on StoryFlow, a tool (inspired by the work of Randall Munroe, illustration pictured) for creating storyline visualizations interactively. The authors determined that in order to be more effective of a visualization, the timeline plot needed to reduce the number of edge crossings and minimize whitespace and "wiggles", with the latter referring to unnecessary movements for association in the graph.

The authors mathematically optimized the plots using quadratic programming to facilitate ordering alignment and compaction of the plots. Evaluation was done by comparing the plots generated against a genetic algorithm method and Randall's method. From their work, the authors concluded that a storyline visualization system was an effective hybrid approach at producing the graphs through being aware of the hierarchy needed based on the plots. Further, their system provided a method for interactively and progressively rendering the graphs if the user though a more visually pleasing layout is preferred.

The fifth and last presentation of the Storytelling and Presentation session was "Visual Sedimentation", a interesting approach at showing data flow. "Data streams are everywhere but difficult to visualize.", stated Samuel Huron (@cybunk).

Along with Romain Vuillemot (@romsson) and Jean-Daniel Fekete (@jdfaviz) of Inria, their early work started in visualizing political Twitter streams during the French 2012 presidential elections, and social interactions during a TV show. Through an effect of compaction, old data is "merged" into the total value to escape visual clutter and provide an interesting accumulation abstraction. The video (below) gives a good overview of the work but for those technically handy, the project is on Github.

After a short coffee break, I attended the next session wherein Carlos Scheidegger (@scheidegger) presented "Nanocubes for Real-Time Exploration of Spatiotemporal Datasets". Nanocubes are "a fast datastructure for in-memory data cubes developed at the Information Visualization department at AT&T Labs – Research". Based on Data Cubes by J. Gray et. al along with many other works well known in the Vis community (e.g., by Stolte, Mackinlay, Kandel and other works) Carlos showed how they went about extracting the information necessary based on two location fields and a device field aggregated to create a summary record.

Carlos' summation of the project were the nanocubes enabled interactive visual interfaces for datasets that previously were much too large to visualize. Further, he emphasized that these data sets did not have massive hardware requirements but instead, the system was designed to allow exploration of the data sets from a laptop of cell phone. The project is open source with the server back end written in C++11 and the front end written in C++11, OpenGL, JavaScript, D3 and a few other technologies.

After Carlos, Raja Sambasivan (@RS1999ent) of Carnegie Mellon University presented "Visualizing Request-Flow Comparison to Aid Performance Diagnosis in Distributed Systems". "Distributed Systems are prone to difficult-to-solve problems due to scale and complexity", he said. "Request flow show client-server interaction".

After Raja, Michelle Borkin (@michelle_borkin) of Harvard presented "Evaluation of Filesystem Provenance Visualization Tools" in which she initiated the talk by introducing file system provenance through the recording of relationships of reads and writes on the file system of a computer. The application of recording this information might lie in "IT Help, chemistry, physics and astronomy", she said. Through a time-based node grouping method algorithm, data is broken up into groups by activity versus being a whole stream grouping or a simple marker for the start of activity.

She illustrated various methods for visualizing file provenance showing a side-by-side of how a node-and-link diagram gets unwieldy with large data sets and expounding on the radial graphs as an alternative is preferable.

A running theme in the conference was the addition of a seemingly random dinosaur on the slides of presenters. The meme originated with Michelle's presentation on Tuesday titled "What Makes a Visualization Memorable?" (paper) in which, she was quoted as saying, "What makes a visualization memorable? Try adding a dinosaur. If that’s not enough, add some colors." With this in-mind, dinosaurs began popping up on the slides each author felt was the take-home of his/her presentation.

Following Michelle, Corinna Vehlow presented "Visualizing Fuzzy Overlapping Communities in Networks". "There are two types of overlapping communities: crisp and fuzzy.", she continued, "Analyzing is essential in finding out what attributes contribute to each of these types." Her group has developed an approach for utilizing undirected weighted graphs for clarifying the grouping and representing the overlapping community structure. Through their approach, they were able to define the predominant community of each object and allow the user of their visualization to observe outliers and identify objects with fuzzy associations to the various defined groups.

After Corinna, Bilal Alsallakh (@bilalalsallakh) presented "Radial Sets: Interactive Visual Analysis of Large Overlapping Sets". In his talk, he spoke about Euler diagrams' limited scalability and the concept he created called "Radial Sets" that allows association to be encoded using relative proximity. The interactive visualization he created allowed for interactivity wherein extra information could be accessed (e.g., set union, intersection) by holding down various keyboard modifiers (e.g., alt, control). By using a brushing gesture, sets could be combined and aggregate data returned to the user.

The conference then broke for a long lunch. Upon returning, a panel commenced titled "The Role of Visualization in the Big Data Era: An End to a Means or a Means to an End?" with Danyel Fisher (@FisherDanyel), Carlos Scheidegger (@scheidegger), Daniel Keim, Robert Kosara (@eagereyes), and Heidi Lam. Danyel stated "The means to an end is about exploration. The ends to a means is about presentation". He noted that a lot of big data is under-explored. In 1975, he illustrated, big data was defined in VLDB's first year as 200,000 magnetic tape reels. He cited his own 2008 paper about Hotmaps as an exhibition that big data is frequently not suitably convertable for interactivity. "There were things I couldn't do quickly", he said, alluding to tasks like finding hte most popular tile in the world in the visualization. He finished his portion of the panel by stating that Visualizations is both an ends to a mean and a means to an end. "They're complementary, not opposing", he concluded.

Carlos was next and stated that there are two kinds of big data, the type that is large in quantity and the type that is "a mess". "Big data is a means to a means.", he said, "Solving one problem only brings about more questions. Technology exists to solve problems created by technology." He continued by noting that people did not originally expect data to be interactive. "Your tools need to be able to grow with the user so you're not making toy tools." He continued by saying that we need more domain-specific languages, "Let's do more of that!".

Heidi followed Carlos noting that "When company profits are down, consider whether they've always been down.", alluding to the causal aspect of the panel. She noted two challenge: First, figure out what not to show in a visualization; Secondly, Aggregated data is likely missing meaning only apparent when the full data set is explored, an issue with big data. She finished by describing Simpson's paradox by saying "Only looking at aggregate data and not slices might result in the wrong choice.", referring back to her original "profits down" example.

Robert spoke after Heidi by asking the audience, "What does big data mean? How do you scale from three hundred to three hundred thousand? Where should it live?" In reference to a tree map he asked, "Why would I want to look at a million items and how is this going to scale?" Juxtaposed to Heidi he stated that he cares about totals and aggregate data and likely not the individual data.

In the Q&A part of the panel, one panelist noted, "Visualization at big data does not work. Shneiderman's mantra does not work for big data.". The good news, stated another panelist, is that automated analysis of big data does work.

Following the panel, the poster session commenced early as the last event of the day.. There I presented the poster/demo I showed earlier in this post "Graph-Based Navigation of a Box Office Prediction System".

Thursday

The second day of my attendance at IEEE VIS started with a presentation by Ross Maciejewski titled, "Abstracting Attribute Space for Transfer Function Design". In the paper Ross inquired as to how to take 3D data and map the bits to color. In his group's work, they proposed a modification to such a visualization in which the user is presented with an information metric detailing the relationship between attributes of the multivariate volumetric data instead of simply the magnitude of the attribute. "Why are high values interesting?", he asked and replied with "We can see the skewness change rapidly in some places." His visualization gives a hint of where to start in processing the data for visualization and gives additional information metrics like mean, standard deviation, skewness, and entropy. Any of these values can then be plugging into the information metric of the visualization.

Carsten Görg followed Ross with "Combining Computational Analyses and Interactive Visualization for Document Exploration and Sensemaking in Jigsaw". "Given a collection of textual documents", he said, "we want to assist analysts in information foraging and sensemaking." Targeted analysis is a bottom up approach, he described, whole an open-ended scenario is top to bottom. He then proceeded to show an hourglass as an analogy of information flow in either scenario. His group did an evaluation study with four settings, using paper, a desktop, an entity, and Jigsaw each with using four strategies: overflow, filtering, and detail; build from detail; hit the keyword; and find a clue and follow the trail. From a corpus of academic papers he showed a demo wherein corresponding authors were displayed on-the-fly when one was selected.

Ross Maciejewski completed the sandwich around Carsten by presenting another paper after him titled, "Bristle Maps: A Multivariate Abstraction Technique for Geovisualization. In the talk he first described four map types and some issues with each:

Point maps are cluttered
Choropleth maps are not modifiable and exhibit a real unit problem
Heat Maps are limited to one variable per map
Line maps allow two variables per map, but that's it.

"Bristle maps allow the user to visualize seven variables utilizing traits like color, size, shape, and orientation in the visualization." His group tried different combinations of encoding to see what information could be conveyed. As an example, he visualized crime data at Purdue and found that people were better at identifying values in the visualization with bristle maps than with a bi-variate color map.

After Ross' sandwich, Jing Yang presented "PIWI: Visually Exploring Graphs Based on Their Community Structure (HTML). In the presentation she described the process of using Vizster and NodeXL to be able to utilize tag clouds, vertex plots, boolean operations and U-Groups (User-defined vertex groups).

Following Jing, Zyiyuan Zhang presented "The Five W's for Information Visualization with Application to Healthcare Informatics". "Information organization uses the 5 Ws scheme", he said, "Who (the patient), What (their problems), Where (location of What), When (time and duration of what)" conveniently leaving out the "Why". He encoded these questions into a means more navigable to doctors than the usual form-based layout healthcare professionals experience.

Following a break, Charles Perin (@charles_perin) presented "SoccerStories: A Kick-off for Visual Soccer Analysis. "Usually there's not enough data and only simple statistics are shown.", Charles said. "If there's too much data, it's difficult to explore." His group developed a series of visualizations that allows each movement on the field to be visualized usually context-sensitive visualization types that are appropriate for the type of action on the field they're trying to describe. Upon presentation to a journalist, his reply was "My readers are not ready for this complex visualization", noting that a higher degree of visualization literacy would be required to fully appreciate the visualization's dynamics.

Following Charles, Rahul Basole (@basole) presented "Understanding Interfirm Relationships in Business Ecosystems with Interaction Visualization". "Business makers understand their immediate competitive position but not beyond that." His group's approach enabled decision makers to:

Visually explore the complexity of inter-firm relations in the mobile ecosystem
Discover the relation between current and emerging segments
Determine the impact of convergence on ecosystem structure
Understand a firm's competitive position
Identify inter-firm relation patterns that may influence their choice of innovation strategy or business models.

Following Rahul, Sarah Goodwin (@sgeoviz presented Creative User-Centered Visualization Design For Energy Analysts and Modelers". In the presentation she visualized energy usage of individuals to provide insight into time-shifting their usage (a la Smart House) to less peak times.

Christian Partl spoke after Sarah on his paper "Entourage: Visualizing Relationships between Biological Pathways using Contextual Subsets." His work expounded on Kono 2009 by showing that biological processes can be broken down into pathways and asked three questions:

How do we visualize multiple pathways at the same time?
How do we visualize relationships between pathways?
How do we visualize experimental data on the pathways?

To visualize multiple pathways, he connected the pathways by shared nodes with "focus" pathways and "context" pathways. When focusing on a node, his visualization only displays the immediately surrounding node. Relationships can be visualized by the connection of stubs and guessing which pathway it is. A system called enRoute allows selection of a path within a path and can display it in a separate view to show experimental data.

Joel Ferstay came up after Christian with "Variant View: Visualizing Sequence Variants in their Gene Context". In their study they created visualization for DNS analysts using an interactive and iterative fashion to ensure the visualization was maximally useful in regards to allowing exploration and providing insights onto the data. From the data source of DNA sequence variants (e.g., human versus monkey), their work helped to determine which variants are helpful and which are harmless. Their goal was to show all attributes necessary for variant analysis and nothing else. To evaluate their visualization, they compare it to MuSiC, a different variant visualization plot and found Variant View showed encoding on separate lanes, so did not have the disadvantage of variant overlap, which would hinder usefulness.

Sébastien Rufiange next presented "DiffAni: Visualizing Dynamic Graphs with a Hybrid of Difference Maps and Animation". In his presentation, he tried to resolve the node-link bad but matrices hard-to-read problem by using dynamic networks in small multiples and embedded glyphs with data at each point.

John Alexis Guerra Gómez (@duto_guerra) followed Sébastien with "Visualizing Change Over Time Using Dynamic Hierarchies: TreeVersity2 and the StemView" where he showed how to display categorical data as trees. The trees consisted of data with either fixed hierarchy, dynamic data (e.g., gender, ethnicity, age), or mixed (e.g., gender, state, city).

Following John, Eamonn Maguire presented "Visual Compression of Workflow Visualizations with Automated Detection of Macro Motifs". In the paper, they created macros in workﬂow visualization as a support tool to increase the efficiency of data curation tasks. Further, they discovered that the state transition information used to identify macro candidates characterizes the structural pattern of the macro and can be harnessed as part of the visual design of the corresponding macro glyph.

After Earmonn, Eirik Bakke (@eirikbakke) presented "Automatic Layout of Structured Hierarchical Reports". In their visualization, Eirik's group with to overcome the form-based layout style of visualization that is normally supplied to those having to interface with a database". Using a nested table approach allowed them to display data based on the screen real estate available and be adaptive when the space available was conducive.

Tim Dwyer presented next with "Edge Compression Techniques for Visualization of Dense Directed Graphs" where he attempted to simplify dense graphs by creating boxes. His visualization were created by using Power-graphic compression through MiniZinc.

After a much-needed break (as evidenced by the length of my trip report notes), R. Borgo presented "Visualizing Natural Image Statistics" in which he, utilizing the Forier representations for image, noted that it's difficult to uniquely identify different images by sight. Further, he found that it was difficult to even define the statical criteria for classifying these images. The example he used were images of manmade versus natural images wherein some degree of similarity existed between those of the same class but the distinction was insufficient. Using Gabor filters, four scales and eight orientations were used for the classification task.

Yu-Hsuan Chan presented next with "The Generalized Sensitivity Scatterplot". She had asked people to identify 2 functional trends from a scatterplot, determined the flow of the data independently and determined how well the trends matched.

Michael Gleicher presented his paper next with "Scatterplots: Overcoming Overdraw in Scratterplots". In his paper, he asked "What happens when you have scatterplots with too many points?". He continued, "Data is unbounded, visual is bounded". His group utilized Kernel Density Estimation to determine when to cluster data and utilized the GPU to ensure that the visualization was interactive.

Ceci n'est pas une kelp.

Wouter Meulemans presented next with his paper "KelpFusion: a Hybrid Set Visualization Technique". He said, "Given a set of points, each point is part of a set. To find the strucuture, connect the nodes to form a minimum spanning tree." He went on to correlate Kelp Diagrams with Bubble Sets and Line Sets. He toted KelpFusion as a means to interactively explore hybrid selection. He then went on to explore the various considerations and heuristics he used in strategically generating the areas between nodes to express relation beyond a simple node-and-link diagram while simultaneously retain the context potentially provided on and underlying layer (see below).

Ceci est une kelp.

The final presentation I attended was Sungkil Lee's "Perceptually-Driven Visibility Optimization for Categorical Data Visualization". The goals of Sungkil's study were to define a measure of perceptual intensity for categorical distances. They define class visibility as a metric to measure perceptual intensity of categorical groups. Without doing so, they found, the dominant structure suppresses small inhomogeneous groups.

Following the presentations, I headed back to Norfolk armed with new knowledge of the extent of visualization research that is currently being done. Had I simply perused proceedings or read papers, I am not sure I would have gotten the benefit of hearing the authors give insights into their work.

—Mat (@machawk1)

↧

2013-10-23: Preserve Me! (... if you can, using Unsupervised Small-World graphs.)

October 22, 2013, 6:53 pm

≫ Next: 2013-11-2: WSDL NFL Power Rankings Week 9

≪ Previous: 2013-10-22: IEEE VIS Trip Report

Everyday we create more and more digital files that record our lives. We take selfies (with and without our loved ones). We record our baby's first step. We take pictures of things that we have or would like to have. The number of digital file and artifacts we create grows and grows and the places where we can store them seem to have almost infinite capacity. Smart phones with 64Gigabytes of storage, could hold almost 20,000 MP3 files (roughly 1,000 hours of listening time, or about 6 months of listening 8 hours a day). Amateur cameras can have the same amount of storage, and depending on image size and frames per second can store days of continuous recordings or about 500,000 still images. We can and are creating more digital artifacts than we can manage. Being able to create so much, means we don't care about what we create. We create because it is easy. We create because it is fun. We create because we have a new toy. We create because we can. There is a significant downside to this creation craze. How can we preserve our selfies for our children?? How can we share our baby's first step with their babys?? How can we show what we had when we were young, now that our hair is silver?? How can we show unknown others in the future those things that were important to us in our youth?? How do we preserve our selves??

We could foist the preservation responsibility of all that we create onto our children (seems sort of unkind). We could preserve our selves using a commercial or governmental institution , but that may not be too much better. Another way to attack the problem is to rephrase the question. Instead of: how we preserve digital artifacts and objects?? Change the question to: how can digital objects preserve themselves?? If we can imbue digital objects with directions to preserve themselves and provide a benign environment where they can survive then they should be able to continue to be available long after we are gone. Long after our children are gone. Long after those that loved and cared for us are forgotten. Imbuing digital objects with preservation directives and providing a benign environment are at the heart of Unsupervised Small-World (USW) graphs.

At Old Dominion University, we created a demonstration USW environment composed of representative sample webpages, faux domains with supporting RESTful methods, and a robot to represent users as they wandered through the Internet and viewing the representative webpages. We scraped parts of four domains (flickr.com, arXiv.org, RadioLab.org and Gutenberg.com) to collect representative pages with different types of digital files.

As the human facing portion of the USW graph, we mocked up a Preserve Me! button so that the webpage viewer could add the webpage to the USW graph.

Mock up of an ArXiv page with the Preserve Me! button.

Putting the Preserve Me! capability into the hands of everyone is in keeping with the idea that everyone should be an curator (Frank McCown "Everyone is a Curator: Human-Assisted Preservation for ORE Aggregations"). After the Preserve Me! button is pressed, the second screen appears,

Mock up of Preserve Me! REM messages.

and an Object Exchange and Reuse (ORE) REsource Map (REM) serialization of the original webpage is created. The REM representation of the original webpage will be preserved by the USW process.

There are two major parts of the benign infrastructure. Firstly is a set of servers that support two USW RESTful methods called "copy" and "edit." The "copy" method creates a copy of a foreign REM in the local domain and "edit" updates selected REMs on the local domain. Secondly is an HTTP message server (Sawood Alman's Master's thesis) which provides a communication mechanism for exchanging actionable HTTP directives between the USW imbued digital objects.

As an example, we going to talk about preserving a scanned image from the 1900s.

Josie McClure, 1907, 15 years old.

The image was uploaded to flickr.com and was scraped to become part of the ODU benign USW demonstration environment.

A robot was written to act as a human visiting the different pages in the ODU USW demonstration environment. The robot was written rather than have a human repeatedly press the Preserve Me! button on different pages. It is possible to watch the USW graph grow using the Preserve Me! Visualizer

Preserve Me! Visualizer

and to this specific example.

Preserve Me! Visualizer with data

Things to look for and at in the Visualizer:

1. The "copy," "edit," and HTTP mailbox infrastructure components are represented by the three cyan colored icons in the center of the display.

2. Original USW REMs are in a concentric circle close to the infrastructure icons and are color coded. REMs from flickr have a magenta frame, those from RadioLab have a blue frame, and REMs from Gutenberg have a yellow frame.

3. Copy USW REMs are much further out from the center and have the same color as the domain they are hosted on, but the contents of the REMs are from their original domain.

4. Permanent connections between REMs (edges in the USW graph) are directional and colored white.

5. Activity between a REM and any of the infrastructure components are directional, red and transient.

6. If a REM is removed from the system, a red slash is drawn through it's icon.

7. Below the plotting area are VCR like controls, including speed controls, toggling the background between black and white, capturing an image and maximizing the display.

8. Placing the pointer over any of the icons will cause almost all other icons and edges to become hidden. The only things that will be visible is the icon under the pointer, permanent edges originating at that icon, and icons that are pointed to by the permanent edge.

9. Clicking on an icon will show explanatory information about the icon.

10. A REM will try to make preservation copies on domains, other than its own that it knows about.

Preserve Me! Viz replays a prerecorded JSON log of events. These events came from a scenario that the robot executed. Between the time the robot created the JSON log file and when you replay the visualization of the robot's actions, the USW graph created by the robot may no longer be in existence (caveat emptor).

The general events are:

1. REM #1 retrieves messages from its mailbox. (As indicated by the flashing red line from the REM to the HTTP mailbox icon.)

2. Based on the messages, REM #1 might execute HTTP patch directives (as indicated by the flashing red line from the REM to the edit icon), might create preservation copies of another REM (as indicated by the flashing red icon from the REM to the copy icon and the creation of preservation REM), or other actions.

3. REM #1 might inspect REM #2 to retrieve data from REM #2.

4. Based on that information, REM #1 might send HTTP patch directives or copy requests to REM #2.

A REM will never directly affect another REM. A REM will send requests and directives via the HTTP mailbox.

The replay file shows 17 webpages, across 4 domains creating preservation copies of themselves on domains different than the one where they were created. Josie originated on the flickr domain (at the 6 o'clock position and framed in magenta), preserves a copy on the Gutenberg domain (at the 1:30 position and framed in yellow), and made USW connections to a REM originating on the Gutenberg domain (at the 12:00 o'clock framed in yellow) and preservation copies on the flickr and RadioLab domains (framed in magenta and green respectively).

Things to watch for include during a replay of the example, or you can watch a video:

Event number (real time in seconds):
2 (1.825) Josie exists in the USW realm.

5 (6.175) USW infrastructure is complete and available.

6 - 9 (8.575 - 14.290) The first USW REM connection is made from flickr's Kittens to Gutenberg's Pride and Prejudice.

10 - 277 (663.476) Additional REMs are added to the system and make connections to Gutenberg's Pride and Prejudice.

278 - 326 (664.884 - 770.372) Gutenberg's Pride and Prejudice begins to read messages from its mailbox.

327 - 525 (771.436 - 1134.216) Gutenberg's Pride and Prejudice creates reciprocal REM connections to other REMs, creates preservation copies on the Gutenberg domain and sends messages back to requesting REMs.

526 (1135.699) A preservation copy of Josie is created on the Gutenberg domain.

527 - 1056 (2921.045) REMs continue to make preservation copies and permanent edges as directed by messages from the HTTP mailbox.

1057 (2922.324) The first REM on the RadioLab domain is lost. The next few events will show all REMs on the RadioLab domain as lost. These few events simulate the total loss of the domain either through closing the domain, terminating the domain's participation in the USW process, or disconnection of the domain from the Internet.

1093 - 1735 (2999.098 - 5398.655) The remaining REMs continue to process messages from their respective mailboxes until all messages have been processed and no more communications are needed or necessary. In effect, the USW system has reached a point of stability and does not have any growth or change opportunities.

Now Josie (my grandmother's sister) exists on two domains and given a larger benign environment, could spread to more places thereby increasing the likelihood of being around long after those that knew her have been forgotten.

On the picture's back: Josie McClure's Picture taken Feb. 30, 1907 at Poteau I. T. Fifteen years of age. When this was taken weighed 140 lbs.

--

Chuck Cartledge

↧

2013-11-2: WSDL NFL Power Rankings Week 9

November 3, 2013, 9:49 am

≫ Next: 2013-11-08: Proposals for Tighter Integration of the Past and Current Web

≪ Previous: 2013-10-23: Preserve Me! (... if you can, using Unsupervised Small-World graphs.)

We are halfway through the 2013 NFL season and it is time for our WSDL mid-season rankings. Both conferences have one winless team, Jacksonville in the AFC and Tampa Bay in the NFC. The NFC is looking rather lackluster this year with no standout teams so far. The NFC East teams in particular need to get their acts together. The AFC appears to be dominating the League with a number of teams that are performing quite well. Two teams that show up on the top of every power ranking list are the Denver Broncos and the Kansas City Chiefs.

Kansas City has a great defense, using our efficiency ratings they are rated as the fifth best defense in the league. However a good defense will only get you so far when your offense is ranked at 27th out of 32. Denver on the other hand has the highest ranked offense in our system with a lot of that on Peyton Mannings shoulders. A good passing offense correlates quite well with a team that wins games.

Here is where our ranking system rates each of the teams. The size of each circle is the rank of the team, larger circle, higher rank. The arrows are wins and point from the loser to the winner.

Our ranking system is based on Google's PageRank algorithm. It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory.

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j. This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week Denver beat the Redskins 45 to 21, in the graph a directed edge from the Redskins to Denver with a weight of 24 was created.

Our rankings differ from most of the others you will find. One of the reasons is that our algorithm takes the strength of schedule into account. Denver has the EASIEST strength of schedule out of the entire league and Kansas City is only a few teams above them. Kansas City is the most over-rated team in the league according to our ratings with the largest difference between our rating and the average of power rankings found on the Internet.

We are not saying that one or the other isn't going to go to the playoffs because neither one has more than one or two good opponents in the schedule, they will continue to win against mediocre opponents. Look for them to stumble against good teams and they are scheduled to play each other twice in the second half of the season.

Our rankings:

Indianapolis Colts         0.103362928
San Diego Chargers      0.076939541
Cincinnati Bengals       0.070915973
New England Patriots   0.059724726
Cleveland Browns        0.055835742
San Francisco 49ers      0.050768683
New Orleans Saints      0.047412718
Oakland Raiders           0.047354270
Denver Broncos            0.044543605
Seattle Seahawks        0.043283728
Miami Dolphins           0.041914789
Green Bay Packers       0.041591108
Kansas City Chiefs       0.034758260
Detroit Lions                0.027743700
Tennessee Titans          0.025216934
New York Jets               0.022644022
Chicago Bears               0.022401144
Arizona Cardinals         0.021622872
Houston Texans             0.021445405
Baltimore Ravens          0.019740968
Washington Redskins    0.017983031
Dallas Cowboys            0.014199985
Carolina Panthers          0.013691420
St. Louis Rams              0.012693121
Pittsburgh Steelers         0.010377841
Buffalo Bills                  0.009797045
Philadelphia Eagles       0.008701059
New York Giants           0.007876355
Atlanta Falcons              0.007223360
Minnesota Vikings         0.007014133
Jacksonville Jaguars      0.005610766
Tampa Bay Buccaneers 0.005610766

--Greg Szalkowski

↧

2013-11-08: Proposals for Tighter Integration of the Past and Current Web

November 8, 2013, 10:10 am

≫ Next: 2013-11-13: 2013 Archive-It Partner Meeting Trip Report

≪ Previous: 2013-11-2: WSDL NFL Power Rankings Week 9

The Memento Team is soliciting feedback on two white papers that address related proposals for more tightly integrating the past and current web.

The first is "Thoughts on Referencing, Linking, Reference Rot", which is inspired by the hiberlink project. This paper proposes making temporal semantics part of the HTML <a> element, via "versiondate" and "versionurl" attributes that respectively include the datetime the link was created and optionally a link to an archived version of the page (in case the live web version becomes 404, goes off topic, etc.). The idea is that "versiondate" can be used as a Memento-Datetime value by a client, and "versionurl" can be used to record a URI-M value. This approach is inspired by the Wikipedia Citation Template, which has many metadata fields, including "accessdate" and "archiveurl". For example, in the article about the band "Coil", one of the links to the source material is broken, but the Citation Template has values for both "accessdate" and "archiveurl":

Unfortunately, when this is transformed into HTML the semantics are lost or relegated to microformats:

A (simple) version with machine-actionable links suitable for the Memento Chrome extension or Zotero could have looked like this in the past, ready to activate when the link eventually went 404:

The second paper, "Memento Capabilities for Wikipedia", "describes added value that Memento can bring to Wikipedia and other MediaWiki platforms. One is enriching their external links with the recommendations from our first paper (described above), and the second is about native Memento support for wikis.

Native Memento support is possible via a new Memento Extension for MediaWiki servers that we announced for testing and feedback on the wikitech-l list. This new extension is the result of a significant re-engineering effort guided by feedback received from Wikipedia experts to a previous version. When installed, this extension allows clients to access the "history" portion of wikis in the same manner as they access web archives. For example, if you wanted to visit the Coil article as it existed on February 2, 2007 instead of wading through the many pages of the article's history, your client would use the Memento protocol to access a prior version with the "Accept-Datetime" request header:

and the server would eventually redirect you to:

In a future blog post we will describe how using a Memento-enabled wiki can be used to avoid spoilers on fan wikis (e.g., The Songs of Ice and Fire wiki) by setting the Accept-Datetime to be right before a episode or book is released.

We've only provided a summary of the content of the two papers and we invite everyone to review them and provide us with feedback (here, twitter, email, etc.).

--Michael & Herbert

↧

2013-11-13: 2013 Archive-It Partner Meeting Trip Report

November 13, 2013, 3:41 pm

≫ Next: 2013-11-19: REST, HATEOAS, and Follow Your Nose

≪ Previous: 2013-11-08: Proposals for Tighter Integration of the Past and Current Web

On November 12, I attended the 2013 Archive-It Partner Meeting in Salt Lake City, Utah, our research group's second year of attendance (see 2012 Trip Report). The meeting started off casually at 9am with breakfast and registration. Once everyone was settled, Kristine Hanna, the Director of Archiving Services at Internet Archive introduced her team that was present of the meeting. Kristine acknowledged the fire at Internet Archive last week and the extent of the damage. "It did burn to the ground but thankfully, nobody was injured." She reminded the crowd of partners to review Archive-It's storage and preservation policy and mentioned the redundancies in-place, including a soon-to-be mirror at our very own ODU. Kristine then mentioned news of a new partnership with Reed Technologies to jointly market and sell Archive-It (@archiveitorg). She reassured the audience that nothing would change beyond having more resources for them to accomplish their goals.

Kristine then briefly mentioned the upcoming release of Archive-It 5.0, which would be spoken about in-depth in a later presentation. She asked everyone in the room (of probably 50 or so attendees) to introduce themselves and to state their affiliated. With the intros out of the way, the presentations began.

Kate Legg of National Center for Atmospheric Research (NCAR) presented "First steps toward digital preservation at NCAR". She started by saying that NCAR is a federally funded research and development center (FFRDC) whose mission is to "preserve, maintain and make accessible records and materials that document the history and scientific contributions of NCAR". With over 70 collections and 1500 employees, digital preservation is on the organization's radar. Their plan, while they have a small library and staff, is to accomplish this along with other competing priorities.

"Few people were thinking about the archives for collecting current information", Kate described of some of the organization not understanding that preserving now will create archives for later. "The archive is not just where old where old stuff goes, but new stuff as well." One of the big obstacles for the archiving initiatives of the organizations has been funding. Even with this limitation, however, NCAR was still able to subscribe to Archive-It through a low level subscription. With this subscription, they started to preserve their Facebook group but increasingly found huge amounts of data, including videos, that they felt was too resource heavy to archive. The next step for the initiative is to add a place on the organization's webpage where archive content will be accessible to the public.

Jaime McCurry (@jaime_ann) of Folger Shakespeare Library followed Kate with "The Short and the Long of It: Web Archiving at the Folger Shakespeare Library". Jaime is currently participating in the National Digital Stewardship Residency where her goal is to establish local routines and best practices for archiving and preserving the library's born-digital content. They have two collection with over 6 millions documents (over 400 gigabytes of data) currently where the topic being collected is to preserve content on the web relating to the works of Shakespeare (particularly in social media and from festivals). In trying to describe the large extent of the available content, Jaime said, "In trying to archive Shakespeare's presence on the web, you really have to define what you're looking for. Shakespeare is everywhere!". She noted that one of the first things she realized when she first started on the project at Folger was that nobody knew that the organization was performing web archiving, so she wished to establish an organization-wide web archiving policy. One of the recent potential targets of her archiving project was the NYTimes' Hamlet contest wherein the newspaper suggested Instagram users create 15-second clips of their interpretation of a passage from the play. Because this related to Shakespeare, it would be an appropriate target for the Folger Shakespeare Library.

After Jaime finished, Sharon Farnel of University of Alberta began her presentation "Metadata workflows for web archiving – thinking and experimenting with ‘metadata in bulk’". In her presentation she referenced a project called Blacklight, an open source project that provides a discovery interface for any Solr index via a customizable, template-based user interface. In her collection, from the context of metadata, she wished to think about where and why discovery of content tasks place in web archiving. She utilized a mixed model wherein entries might have MARC records, Dublin Core data or both. Sharon emphasized that metadata was an important functionality of Archive-It. To better parse the data, her group created XSLT stylesheets to be able to export the data into a more interoperable format like Excel, which it could then be imported back into Blacklight after manipulation. She referenced some of the difficulties in working the the different technologies but said, "None of these tools were a perfect solution on their own but by combining the tools in-house, we can get good results with the metadata."

After a short break (to caffeinate), Abbie Grotke (@agrtoke) of Library of Congress remotely presented "NDSA Web Archiving Survey Update". In her voice-over presentation from DC, she gave preliminary results of the NDSA Web Archiving Survey, stating that the initiative of the NDIIP program had yielded about 50 respondents so far. For the most part, the biggest concern about web archiving reported by the survey participants was database preservation followed by social media and video archiving. She stated that the survey is still open and encouraged attendees to take it (Take it here).

Trevor Alvord of Brigham Young University was next with "A Muddy Leak: Using the ARL Code of Best Practices in Fair Use with Web Archiving". His efforts with the L. Tom Perry Special Collections at BYU was to build a thematic based collection based on Mormonism. He illustrated that many battles had been fought and won over digital preservation content rights (e.g., Perfect 10 vs. Google and Students vs. iParadigms), so his collection should be justified based on the premises in those cases. "Web archiving is routinely done by two wealthiest corporations (Google and Microsoft)", he quoted Jonathan Band, a recognized figure in the lawsuits versus Google. "In the last few months, libraries have prevailed.", Trevor said, "Even with our efforts, we have not received any complaints about their website being archived by libraries."

Trevor then went on to describe the problem with his data set, alluding to the Teton Dam flooding wherein millions of documents are being produced about Mormonism and now he is having to capture whatever he can. This is partially due to the lowering of the age allowed for missionaries and the Mormon church's encouragement for young Mormons to post online. He showed two examples of Mormon "mommy" bloggers Peace Love Lauren, a very small impact bloggers and NieNie Dialogs, a very popular blog. He asked the audience, "How do you prioritize what content to archive given popular content is more important but also more likely to be preserved?"

Following Trevor, Vinay Goel of Internet Archive presented "Web Archive Analysis". He started by saying that "Most Partners access Archive-It via the Wayback Machine." where other methods would be by using the Archive-It search service or downloading the archival contents. He spoke of de-duplication and how it is represented in WARCs via a revisit record. The core of his presentation spoke of the various WARC meta formats, Web Archive Transformation (WAT) files and CDX files, the format used for WARC indexing. "WAT files are WARC metadata records.", he said, "CDX files are space delimited text files that record where a file resides in a WARC and its offset." Vinay has come up with an analysis toolkit that would allow researchers to express question they want to ask about the archives in a high level language that would then be translated to a low level language understandable by an analysis system. "We can capture YouTube content", he said, giving an example use case, "but the content is difficult to replay." Some of the analysis information he displayed was identifying this non-replayable content in the archives and showing the in-degree and out-degree information of each resource. Further, his toolkit is useful in studying how this linking behavior changes over time.

The crowd then broke for lunch only to return to Scott Reed (@vector_ctrl) of Internet Archive presenting the new features that would be present in the next iteration of Archive-It, 5.0. The new system, among other things, allows users to create test crawls and is better at social media archiving. Some of the goals to be implemented in the system before the end of the year is to get the best capture and display the capture in currently existing tools. Scott mentioned an effort by Archive-It to utilize phantomjs (with which we're familiar at WS-DL through our experiments) through a feature they're calling "Ghost". Further, the new version promises to have an API. Along with Scott, Maria LaCalle spoke of a survey completed about the current version of Archive-It and Solomon Kidd spoke of work done on user interface refinements of the upcoming system.

Following Scott, the presentations continued with your author, Mat Kelly (@machawk1) presenting "Archive What I See Now".

After I finished my presentation, the final presentation of the day was by Debbie Kempe of The Frick Collection and Frick Art Reference Library with "Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art Resources". In her presentation, she stated that there was a broad overlap of art between the Brooklyn Museum, Museum of Modern Art, and the Frick Art Reference Library. Citing Abbie Grotke's survey from earlier, she reminded the audience that no museums responded to the survey, which is problematic for evaluating their archiving needs. "Not all information is digital in the art community", Debbie said. In initiating archiving effort, it wasn't so much clear to the museums' organizers as to why or how web archiving of their content should be done but rather, "Who will do it?" and "How will we pay for it?" She ran a small experiment in accomplishing the preservation tasks of the museum and is now subsequently running a longer "experiment", given more content is being create that is digital and less in print in their collections. In the longer trial, she hopes to test and formulate a sustainable workflow, including re-skilling and organizational changes.

After Debbie, the crowd was freed into a Birds of a Feather session to discuss issues about web archiving that interested each individual, to which I collected with a group about "Capture", given my various software projects relating to the topic. After the BoF session, Lori Donovan and Kristine Hanna adjourned the room to a following reception.

Overall, I felt the trip to Utah to meet with a group with a common interest was a unique experience that I don't get at other conferences where some of the audiences' focuses are disjoint from one another. The feedback I received on my research and the discussion I had with various attendees was extremely valuable in learning how the Archive-It community works and I hope to attend again next year.

— Mat (@machawk1)

↧

2013-11-19: REST, HATEOAS, and Follow Your Nose

November 19, 2013, 10:26 am

≫ Next: 2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

≪ Previous: 2013-11-13: 2013 Archive-It Partner Meeting Trip Report

This post is hardly timely, but I wanted to gather together some resources that I have been using for REST (Representational State Transfer) and HATEOAS (Hypermedia as the Engine of Application State). It seems like everyone claims to be RESTful, but mentioning HATEOAS is frequently met with silence. Of course, these terms come from Roy Fielding's PhD dissertation, but I won't claim that it is very readable (it is not the nature of dissertations to be readable...). Fortunately he's provided more readable blog posts about REST and HATEOAS. At the risk of aggressively over-simplifying things, REST = "URIs are nouns, not verbs" and HATEOAS = "follow your nose".

"Follow your nose" simply means that when a client dereferences a URI, the entity that is returned is responsible for providing a set of links that allows the user agent to transition to the next state. This standard procedure in HTML: you follow links to guide you through an online transaction (e.g., ordering a book from Amazon) all the time -- it is so obvious you don't even think about it. Unfortunately, we don't hold our APIs to the same level; non-human user agents are expected to make state transitions based on all kinds of out-of-band information encoded in the applications. When there is a change in the states and the transitions between them, there is no direct communication with the application and it simply breaks.

I won't dwell on REST because most APIs get the noun/verb thing right, but we seem to be losing ground on HATEOAS (and technically, HATEOAS is a constraint on REST and not something "in addition to REST", but I'm not here to be the REST purity police).

There are probably many good descriptions of HATEOAS and I apologize if I've left your favorite out, but these are the two that I use in my Web Server Design course (RESTful design isn't goal of the course, but more like a side benefit). Yes, you could read about a book about REST, but these two slide decks will get your there in minutes.

The first is from Jim Webber entitled "HATEOAS: The Confusing Bit from REST". There is a video of Jim presenting these slides as well as a white paper about it (note: the white paper is less correct than the slides when it comes to things like specific MIME types). He walks you through a simple but illustrative (HTTP) RESTful implementation of ordering coffee. If the user agent knows the MIME type "applications/vnd.restbucks+xml" (and the associated Link rel types), then it can follow the embedded links to transition from state-to-state. And if you don't know how to do the the right thing (tm) with this MIME type, you should stop what you're doing.

HATEOAS: The Confusing Bit from REST from elliando dias

It seems like the Twitter API is moving away from HATEOAS. Brian Mulloy has a nice blog post about this (from which I took the image at the top of this post). The picture nicely summarizes that from an HTML representation of a Tweet there are all manner of state transitions available, but the equivalent json representation is effectively a dead-end; the possible state transitions have to be in the mind of the programmer and in encoded in the application. Their API returns MIME types of "application/json" just like 1000 other APIs and it is up to your program to sort out the details. Twitter's 1.1 API, with things like removing support for RSS, is designed for lock-in and not abstract ideals like HATEOAS. Arguably all the json APIs, with their ad-hoc methods for including links and uniform MIME type, are a step away from HATEOAS (see the stackoverflow.com discussion).

The second presentation also address a pet peeve of mine: API deprecation (e.g., my infatuation with Topsy has tempered after they crashed all the links I had created -- grrr.). The presentation "WHY HATEOAS: A simple case study on the often ignored REST constraint" from Wayne Lee walks through a proper way to define your API with an eye to new versions, feature evolution, etc.

Why HATEOAS from Lee Wayne

Again, I'm sure there are many other quality REST and HATEOAS resources but I wanted to gather the couple that I had found useful into one place and not just have them buried in class notes. Apologies for being about five years late to the party.

--Michael

↧

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

November 20, 2013, 9:13 pm

≫ Next: 2013-11-21: 2013 Southeast Women in Computing Conference (SEWIC)

≪ Previous: 2013-11-19: REST, HATEOAS, and Follow Your Nose

.@Conservatives put speeches in Streisand's house: http://t.co/6aRiOsHwxO @UKWebArchive: http://t.co/BGD3tYavEx via @lljohnston @hhockx
— Michael L. Nelson (@phonedude_mln) November 13, 2013

Circulating the web last week the story of the UK's Conservative Party (aka the "Tories") removing speeches from their website (see Note 1 below). Not only did they remove the speeches from their website, but via their robots.txt file they also blocked the Internet Archive from serving their archived versions of the pages as well (see Note 2 below of a discussion of robots.txt, as well as for an update about availability in the Internet Archive). But even though the Internet Archive allows site owners to redact pages from their archive, mementos of the pages likely exist in other archives. Yes, the Internet Archive was the first web archive and is still by far the largest with 240B+ pages, but the many other web archives, in aggregate, also provide good coverage (see our 2013 TPDL paper for details).

Consider this randomly chosen 2009 speech:

http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx

Right now it produces a custom 404 page (see Note 3 below):

Fortunately, the UK Web Archive, Archive-It (collected by the University of Manchester), and Archive.is all have copies (presented in that order):

So it seems clear that this speech will not disappear down a memory hole. But how do you discover these copies in these archives? Fortunately, the UK Web Archive, Archive-It, and Archive.is (as well as the Internet Archive) all implement Memento, an inter-archive discovery framework. If you use a Memento-enabled client such as the recently released Chrome extension from LANL, the discovery is easy and automatic as you right-click to access the past.

If you're interested in the details, the Memento TimeMap lists the four available copies (Archive-It actually has two copies):

The nice thing about the multi-archive access of Memento is as new archives are added (or in this case, if the administrators at conservatives.com decide to unredact the copies in the Internet Archive), the holdings (i.e., TimeMaps) are seamlessly updated -- the end-user doesn't have to keep track of the dozens of public web archives and manually search them one-at-a-time for a particular URI.

We're not sure how many of the now missing speeches are available in these and other archives, but this does nicely demonstrate the value of having multiple archives, in this case all with different collection policies:

Internet Archive: crawl everything
Archive-It: collections defined by subscribers
UK Web Archive: archive all UK websites (conservatives.com is a UK web site even though it is not in the .uk domain)
Archive.is: archives individual pages on user request

Download and install the Chrome extension and all of these archives and more will easily available to you.

-- Michael and Herbert

Note 1: According to this BBC report, the UK Labour party also deletes material from their site, but apparently they don't try to redact from the Internet Archive via robots.txt. For those who are keeping score, David Rosenthal regularly blogs about the threat of governments altering the record (for example, see: June 2007, October 2010, July 2012, August 2013). "We've always been at war with Eastasia."

Note 2: In the process of writing this blog, the Internet Archive is no longer blocking access to this speech (and presumably the others). Here is the raw HTTP of the speech being blocked (the key is the line with "X-Archive-Wayback-Runtime-Error:" line):

But access was restored sometime in the space of three hours before I could generate a screen shot:

Why was it restored? Because the conservatives.com administrators changed their robots.txt file on November 13, 2013 (perhaps because of the backlash from the story breaking?). The 08:36:36 version of robots.txt has:

...

Disallow: /News/News_stories/2008/
Disallow: /News/News_stories/2009/
Disallow: /News/News_stories/2010/01/

...

But the 18:10:19 version has:

...

Disallow: /News/Blogs.aspx
Disallow: /News/Blogs/

...

These "Disallow" rules no longer match the URI of the original speech. I guess the Internet Archive cached the disallow rule and it just now expired one week later. See the IA's exclusion policy for more information about their redaction policy and robotstxt.org for details about syntax.

The TimeMap from the LANL aggregator is now current with 28 mementos from the Internet Archive and 4 mementos from the other three archives. We're keeping the earlier TimeMap above to illustrate how the Memento aggregator operates; the expanded TimeMap (with the Internet Archive mementos) is below:

curl -i http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx HTTP/1.1 200 OK Server: nginx/1.2.8 Date: Thu, 21 Nov 2013 04:54:59 GMT Content-Type: application/link-format Transfer-Encoding: chunked Connection: keep-alive  <http:>;rel="original" , <http:>;rel="memento"; datetime="Fri, 13 Nov 2009 14:41:56 GMT" , <http:>;rel="memento"; datetime="Sun, 15 Nov 2009 09:23:25 GMT" , <http:>;rel="memento"; datetime="Tue, 08 Dec 2009 02:48:37 GMT" , <http:>;rel="memento"; datetime="Thu, 11 Mar 2010 20:49:24 GMT" , <http:>;rel="memento"; datetime="Wed, 14 Apr 2010 19:54:15 GMT" , <http:>;rel="memento"; datetime="Fri, 30 Apr 2010 00:37:36 GMT" , <http:>;rel="memento"; datetime="Wed, 04 May 2011 02:24:06 GMT" , <http:>;rel="memento"; datetime="Wed, 22 Jun 2011 06:27:40 GMT" , <http:>;rel="memento"; datetime="Wed, 29 Jun 2011 02:53:46 GMT" , <http:>;rel="memento"; datetime="Wed, 26 Oct 2011 22:41:26 GMT" , <http:>;rel="memento"; datetime="Fri, 28 Oct 2011 18:41:31 GMT" , <http:>;rel="memento"; datetime="Sat, 26 Nov 2011 19:59:22 GMT" , <http:>;rel="memento"; datetime="Tue, 27 Dec 2011 21:37:40 GMT" , <http:>;rel="memento"; datetime="Fri, 27 Jan 2012 20:40:27 GMT" , <http:>;rel="memento"; datetime="Wed, 08 Feb 2012 20:00:18 GMT" , <http:>;rel="memento"; datetime="Fri, 02 Mar 2012 18:02:34 GMT" , <http:>;rel="memento"; datetime="Tue, 06 Mar 2012 04:16:05 GMT" , <http:>;rel="memento"; datetime="Wed, 14 Mar 2012 08:32:13 GMT" , <http:>;rel="memento"; datetime="Thu, 10 May 2012 13:52:48 GMT" , <http:>;rel="memento"; datetime="Sat, 14 Jul 2012 07:01:01 GMT" , <http:>;rel="memento"; datetime="Sat, 20 Oct 2012 12:30:16 GMT" , <http:>;rel="memento"; datetime="Sun, 21 Oct 2012 10:19:09 GMT" , <http:>;rel="memento"; datetime="Sun, 25 Nov 2012 07:47:32 GMT" , <http:>;rel="memento"; datetime="Wed, 02 Jan 2013 06:28:52 GMT" , <http:>;rel="memento"; datetime="Fri, 15 Mar 2013 13:49:17 GMT" , <http:>;rel="memento"; datetime="Thu, 21 Mar 2013 20:14:57 GMT" , <http:>;rel="memento"; datetime="Mon, 25 Mar 2013 01:29:16 GMT" , <http:>;rel="memento"; datetime="Tue, 02 Apr 2013 09:39:38 GMT" , <http:>;rel="memento"; datetime="Mon, 08 Apr 2013 04:24:02 GMT" , <http:>;rel="memento"; datetime="Tue, 27 Aug 2013 09:45:25 GMT"  , <http:>;rel="timegate"  , <http:>;rel="self"; type="application/link-format"; from ="Fri, 13 Nov 2009 14:41:56 GMT";until="Tue, 27 Aug 2013 09:45:25 GMT" </http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:>

Note 3: Perhaps this is a Microsoft-IIS thing, but their custom 404 page, while pretty, is unfortunate. Instead of returning a 404 page at the original URI (like Apache), it 302 redirects to another URI that returns the 404:

See our 2013 TempWeb paper for a discussion about redirecting URI-Rs and which values to use as keys when querying the archives.

↧

2013-11-21: 2013 Southeast Women in Computing Conference (SEWIC)

November 21, 2013, 10:13 am

≫ Next: 2013-11-28: Replaying the SOPA Protest

≪ Previous: 2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

Last weekend (Nov 14-17), I was honored to give a keynote at the Southeast Women in Computing Conference (SEWICC), located at the beautiful Lake Guntersville State Park in north Alabama. The conference was organized by Martha Kosa and Ambareen Siraj (Tennessee Tech University), and Jennifer Whitlow (Georgia Tech).

Videos from the keynotes and pictures from the weekend will soon be posted on the conference website.

The 220+ attendees included faculty, graduate students, undergraduates, and even some high school students (and even some men!).

On Friday night, Tracy Camp from the Colorado School of Mines presented the first keynote, "What I Know Now... That I Wish I Knew Then". It was a great kickoff to the conference and provided a wealth of information on (1) the importance of mentoring, networking, and persevering, (2) tips on negotiating and time management, and (3) advice on dealing with the Impostor Syndrome.

During her talk, she pointed out that women's participation != women's interest. She had some statistics showing that in 1970 the percentages of women in law school (5%), business school (4%), medical school (8%), and high school sports (4%) were very low. Then she contrasted that with data from 2005: law school (48%), business school (45%), medical school (50%), and high school sports (42%). The goal was to counter the frequent comment that "Oh, women aren't in computing and technology because they're just not interested."

She also listed qualities that might indicate that you have the impostor syndrome. From my discussions with friends and colleagues, it's very common among women in computing and technology. (I've heard that there a few men who suffer from this, too!) Here's the list:

Do you secretly worry that others will find out that you're not as bright/capable as they think you are?
Do you attribute your success to being a fluke or "no big deal"?
Do you hate making a mistake, being less than fully prepared, or not doing things perfectly?
Do you believe that others are smarter and more capable than you are?

Saturday morning began with my keynote, "Telling Stories with Web Archives".

Telling Stories with Web Archives from Michele Weigle

I talked a bit about web archiving in general and then described Yasmin AlNoamany's PhD work on using the archives for storytelling. The great part for me was to be able to introduce the Internet Archive and the Wayback Machine to lots of people. I got several comments from both students and faculty afterwards with ideas of how they would incorporate the Wayback Machine in their work and studies.

After my talk, I attend a session on Education. J.K. Sherrod and Zach Guzman from Pellissippi State Community College in Knoxville, TN presented "Using the Raspberry Pi in Education". They had been teaching cluster computing using Mac Minis, but it was getting expensive, so they started purchasing Raspberry Pi devices (~$35) for their classes. The results were impressive. Since the devices run a full version of Linux, they even were able to implement a Beowulf Cluster.

I followed this up by attending a panel "Being a Woman in Technology: What Does it Mean to Us?" The panelists and audience discussed both positive connotations and challenges to being a woman in technology. This produced some amazing stories, including one student who related being told by a professor that she was no good at math and was "a rotten mango".

You're not a rotten mango #sewic2013 pic.twitter.com/8i7OTBttDY
— Shannon Wood (@Shannonanagains) November 16, 2013

After lunch, several students presented 5 minute lightning talks on strategies for success in school and life. It was great to see so many students excited to share their experiences and lessons learned.

The final keynote was given on Saturday night by Valentina Salapura, from IBM TJ Watson on "Cloud Computing 24/7". After telling her story and things she learned along the way (and including a snapshot from the Wayback Machine of her former academic webpage), she described the motivation and promise of cloud computing.

Sunday was the last day, and I attend a talk by Ruthe Farmer, Director of Strategic Initiatives, NCWIT on "Research and Opportunities for Women in Technology". The National Center for Women & Information Technology was started in 2004 and is a non-profit that partners with corporations, academic institutions, and other agencies with the goal of increasing women's participation in technology and computing. One of their slogans is "50:50 by 2020". There's a wealth of information and resources available on the NCWIT website (including the NCWIT academic alliance and Aspirations in Computing program).

Ruthe described the stereotype threat that affects both women and men. This is the phenomena where awareness of negative stereotypes associated with a peer group can inhibit performance. She described a study where a group of white men from Stanford were given a math test. Before the test, one set of students were reminded of the stereotype that Asian students outperform Caucasian students in math, and the other set was not reminded of this stereotype. The stereotype threatened test takers performed worse than the control set.

Before we left on Sunday, I had the opportunity to sit in the red chair.
Sit With Me is a promotion by NCWIT to recognize the role of women in computing. "We sit to recognize the value of women's technical contributions. We sit to embrace women's important perspectives and increase their participation."

All in all, it was a great weekend. I drank lots of sweet tea, heard great southern accents that reminded me of home (Louisiana), and met amazing women from around the southeast, including students and faculty from Trinity University (San Antonio, TX), Austin Peay University, Georgia Tech, James Madison University (Virginia), Tennessee Tech, Pellissippi State Community College (Knoxville, TN), Murray State University (Kentucky), NC State, Rhodes College (Memphis, TN), Univ of Georgia, Univ of Tennessee, and the Girls Preparatory School (Chattanooga, TN).

There are plans for another SEWIC Conference in 2015.

It's closing time. Teardrop. Great conference! #SEWIC2013 pic.twitter.com/Uu6X4Trj14
— Ben Henderson (@ben_henderson) November 17, 2013

-Michele

↧

2013-11-28: Replaying the SOPA Protest

November 28, 2013, 4:45 am

≫ Next: 2013-12-13: Hiberlink Presentation at CNI Fall 2013

≪ Previous: 2013-11-21: 2013 Southeast Women in Computing Conference (SEWIC)

In an attempt to limit online piracy and theft of intellectual property, the U.S. Government proposed the Stop Online Privacy Act (SOPA). This act was widely unpopular. On January 18th, 2012, many prevalent websites (e.g., XKCD) organized a world-wide blackout of their websites in protest of SOPA.

While the attempted passing of SOPA may end up being a mere footnote in history, the overwhelming protest in response is significant. This event is an important observance and should be archived in our Web archives. However, some methods of implementing the protest (such as JavaScript and Ajax) made the resulting representations unarchiveable by archival services at the time. As a case study, we will examine the Washington, D.C. Craigslist site and the English Wikipedia page. All screenshots of the live protests were taken during the protest on January 18th, 2012. The screenshots of the mementos were taken on November 27th, 2013.

Screenshot of the live Craigslist SOPA Protest from January 18th, 2012.

Craigslist put up a blackout page that would only provide access to the site through a link that appears after a timeout. In order to preserve the SOPA splash page on the Craigslist site, we submitted the URI-R for the Washington D.C. Craigslist page to WebCite for archiving producing a memento for the SOPA screen:

http://webcitation.org/query?id=1326900022520273

At the bottom of the SOPA splash page, JavaScript counts down from 10 to 1 and then provides a link to enter the site. The countdown operates properly in the memento, providing an accurate capture of the resource as it existed on January 18th, 2012.

Screenshot of the Craigslist protest memento in WebCite.

The countdown on the page is created with JavaScript that is included in the HTML:

The countdown behavior is archived along with the page content because the JavaScript script creating the countdown is captured with the content and is available when the onload event fires on the client and subsequent startCountDown code is executed. However, the link that appears at the bottom of the screen dereferences to the live version of Craigslist. Notice that the live Craigslist page has no reference to the SOPA protest. Since WebCite is a page-at-a-time archival service, it only archives the initial representation and all embedded resources, meaning the the linked Craigslist page is missed during archiving.

Screenshot of the Craigslist homepage linked from the
protest splash page. This is also the live version of the
homepage.

The Heritrix crawler archived the Craigslist page on January 18th, 2012. The Internet Archive contains a memento for the protest:

Screenshot of the Craigslist protest splash page in the Wayback Machine.

http://web.archive.org/web/20120118050348/http://washingtondc.craigslist.org/

as does Archive-It:

http://wayback.archive-it.org/all/20120119183432/http://washingtondc.craigslist.org/

The Internet Archive memento has the same splash page and countdown as the WebCite memento that was captured with the Heritrix crawler. The link on the Internet Archive memento leads to a memento of the Craigslist page rather than the live version albeit with archival timestamps one day and 13 hours, 30 minutes, and 44 seconds apart (2012-01-19 18:34:32 vs 2012-01-18 05:03:48):

Screenshot of the Craigslist homepage memento, linked from the
protest splash screen.

http://web.archive.org/web/20120120201008/http://washingtondc.craigslist.org/h

The Internet Archive converts embedded links to be relative to the archive rather than target the live web. Since Heritrix also crawled the linked page, the embedded link dereferences to the proper memento with a note embedded in the HTML protesting SOPA.

The Craigslist protest was readily archived by WebCite, Archive-It, and the Internet Archive. Policies within each archival institution impacted how the Craigslist homepage (past the protest splash screen) is referenced and accessed by archive users. This differs from the Wikipedia protest, which was not readily archived.

Screenshot of the live Wikipedia SOPA Protest.

Wikipedia displayed a splash screen protesting SOPA blocking access to all content on the site. A version of which is still available live on Wikipedia as of November 27th, 2013:

The live version of the Wikipedia SOPA Protest.

http://en.wikipedia.org/?banner=blackout

On January 18th, 2012, we submitted the page to WebCite to produce a memento that did not captured the splash page. Instead the memento has only the representation hidden by the splash page.

A screenshot of the WebCite memento of the Wikipedia SOPA Protest.

http://webcitation.org/query?id=1326888962288259

The mementos captured by Heritrix and presented through the Internet Archive's Wayback Machine and Archive-It are also missing the SOPA splash page.

A screenshot of the Internet Archive memento of the
Wikipedia SOPA Protest.

http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page

A screenshot of the Archive-It memento of the
Wikipedia SOPA Protest.

http://wayback.archive-it.org/all/20120118184432/http://en.wikipedia.org/wiki/Main_Page

To investigate the cause of the missing splash page further, we requested that WebCite archive the current version of the Wikipedia blackout page on November 27th, 2013. The new memento does not capture the splash page, either:

A screenshot of the WebCite memento of the current
Wikipedia blackout page.

http://www.webcitation.org/6LRY6XyXC

Heritrix has also created a memento for the current blackout page on August 24th, 2013. This memento suffers the same problem as the aforementioned mementos and does not capture the splash page:

A screenshot of the Internet Archive memento of the current
Wikipedia blackout page.

http://web.archive.org/web/20130824022954/http://en.wikipedia.org/?banner=blackout

When looking through the client-side DOM of the Wikipedia mementos we reference, there is no mention of the splash page protesting SOPA. This means the page was loaded from either Cascading Style Sheets (CSS) or JavaScript. Since clicking the browser's "Stop" button prevents the splash page from appearing, we hypothesize (and show) that JavaScript is responsible for loading the splash page. JavaScript loads the image needed for the splash page as a result of a client-side event. Since the tools have no way of executing the event, the tools have no way of knowing to archive the image.

When we load the live blackout resource, we see that there are several files loaded by Wikimedia. Some of the JavaScript files return a 403 Forbidden response since they are blocked by the Wikipedia Robots.txt file:

Google Chrome's developer console showing the resources requested
by http://web.archive.org/web/20130824022954/http://en.wikipedia.org/?banner=blackout
and their associated response codes.

Specifically, the Robots.txt file preventing these resources from being archived is:

http://bits.wikimedia.org/robots.txt

The Robots.txt file is archived, as well:

http://web.archive.org/web/*/http://bits.wikimedia.org/robots.txt

We will look at one specific HTTP request for a JavaScript file:

This JavaScript file contains code defining a function that adds CSS to the page, overlaying an image as a splash page and overlays the associated text on the image (I have added the line breaks for readability):

Without execution of the insertBanner function, the archival tools will not know to archive the image of the splash page (WP_SOPA_Splash_Full.jpg) or the overlayed text. In this example, Wikimedia is constructing the URI of the image and using Ajax to request the resource:

The blackout image is available in the Internet Archive, but the mementos in the Wayback Machine do not attempt to load it:

http://web.archive.org/web/20120118165255/http://upload.wikimedia.org/wikipedia/commons/9/98/WP_SOPA_Splash_Full.jpg

Without execution of the client-side JavaScript and subsequent capture of the splash screen, the SOPA blackout protest is not seen by the archival service.

We have presented two different uses of JavaScript by two different web sites and its impact on the archivability of their SOPA protests. The Craigslist mementos provide representations of the SOPA protest, although the archives may be missing associated content due to policy differences and intended use. The Wikipedia mementos do not provide a representation of the protest. While the constituent parts of the Wikipedia protest are not entirely lost, they are not properly reconstituted, making the representation unarchivable with the tools available on January 18th, 2012 and November 27th, 2013.

We have previously demonstrated that JavaScript in mementos can cause strange things to happen. This is another example of how technologies that normally improve a user's browsing experience can actually be more difficult, if not impossible, to archive.

--Justin F. Brunelle

↧

2013-12-13: Hiberlink Presentation at CNI Fall 2013

December 13, 2013, 7:38 am

≫ Next: 2013-12-18: Avoiding Spoilers with the Memento Mediawiki Extension

≪ Previous: 2013-11-28: Replaying the SOPA Protest

Herbert and Martin attended the recent Fall 2013 CNI meeting in Washington DC, where they gave an update about the Hiberlink Project (joint with the University of Edinburgh), which is about preserving the referential integrity of the scholarly record. In other words, we link to the general web in our technical publications (and not just other scholarly material) and of course the links rot over time. But the scholarly publication environment does give us several hooks to help us access web archives to uncover the correct material.

As always, there are many slides but they are worth the time to study them. Of particular importance are slides 8--18, which helps differentiate Hiberlink from other projects, and slides 66-99 which walk through a demonstration of the "Missing Link" concepts (along with the Memento for Chrome extension) can be used to address the problem of link rot. In particular, absent specific versiondate attributes on a link, such as:

<a versiondate="some-date-value" href="...">

A temporal context can be inferred from the "datePublished" META value defined by schema.org:

<META itemprop="datePublished" content="some-ISO-8601-date-value">

Hiberlink: Investigating Reference Rot, December 2013 from Herbert Van de Sompel

Again, the slides are well-worth your time.

--Michael

↧

2013-12-18: Avoiding Spoilers with the Memento Mediawiki Extension

December 18, 2013, 6:56 pm

≫ Next: 2013-12-19: 404 - Your interview has been depublished

≪ Previous: 2013-12-13: Hiberlink Presentation at CNI Fall 2013

From Modern Family to the Girl with the Dragon Tatoo, fans have created a flood of fan-based wikis based on their favorite television, book, and movie series. This dedication to fiction has allowed fans to settle disputes and encourage discussion using these resources.

These resources, coupled with the rise in experiencing fiction long after it is initially released, has given rise to another cultural phenomenon: spoilers. Using a fan-based resource is wonderful for those who are current with their reading/watching, but is fraught with disaster for those who want to experience the great reveals and have not caught up yet.

Memento can help here.

Above is a video showing how the Memento Chrome Extension from Los Alamos National Laboratory (LANL) can be used to avoid spoilers while browsing for information on Downtown Abbey. This wiki is of particular interest because the TV show is released in the United Kingdom long before it is released in other countries. The wiki has a nice sign warning all visitors about impending spoilers should they read the pages within, but the warning is redundant, seeing as fans who have not caught up will know that spoilers are implied.

A screenshot of the page with this notice is shown below.

We can use Memento to view pre-spoiler versions.

To avoid spoilers for Downtown Abbey Series 4, we choose a date prior to its release: August 30, 2012. Then we use LANL's Memento Chrome Extension to browse to that date. The HTTP conversation for this exchange is captured using Google Chrome's Live HTTP Headers Extension and detailed in the steps below.

1. The Chrome Memento Extension sends a HEAD request to the site using Memento's Accept-Datetime header*.

HEAD /wiki/Downton_Abbey_Wiki HTTP/1.1 Host: downtonabbey.wikia.com Accept: */* Accept-Datetime: Thu, 30 Aug 2012 16:55:16 GMT Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 Cookie: wikia_beacon_id=eMUofNNFos; Geo={%22city%22:%22FIXME%22%2C%22country%22:%22US%22%2C%22continent%22:%22NA%22}; __qca=P0-2109602207-1386434547402; __utma=251085184.467733389.1386434548.1386434548.1386434548.1; __utmb=251085184.0.10.1386434548; __utmc=251085184; __utmz=251085184.1386434548.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=73223046.522101053.1386434548.1386434548.1386434548.1; __utmb=73223046.0.10.1386434548; __utmc=73223046; __utmz=73223046.1386434548.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); OAID=965cffa6e646896cf2dae0a0da51f55d; __gads=ID=d998f75908c8191b:T=1386434550:S=ALNI_MZ7eDrJdZMgw5dYUPEan_ONPzkg0Q; fm_tag_0_called=1386520954027; PD_ExpansionCap_TimesPerDay.6108967=1386434556157,1; LTS=6_135l0r1m702%2C6_1868l0r1m702%2C6_1162l0r1m702%2C6_2468l1%2C6_805l1%2C6_983l1%2C6_1494l1%2C6_2491l1; varnish-stat=/server/ASH/cache-v37-ASH/(null)/; loadtime=S1386412259.042873621,VS0,VS37,VE38,VE22350429 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36  HTTP/1.1 200 OK Accept-Ranges: bytes Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Connection: keep-alive Content-Encoding: gzip Content-language: en Content-Length: 27103 Content-Type: text/html; charset=utf-8 Date: Sat, 07 Dec 2013 16:43:29 GMT Last-Modified: Sat, 07 Dec 2013 10:02:17 GMT Server: Apache Set-Cookie: varnish-stat=/server/ASH/cache-v37-ASH/(null)/; path=/ Set-Cookie: loadtime=S1386412259.042873621,VS0,VS37,VE38,VE22350451; path=/ Vary: Accept-Encoding,Cookie X-Age: 22350 X-Cache: HIT, HIT X-Cache-Hits: 4, 5 X-Cacheable: YES X-Content-Type-Options: nosniff X-Served-By: cache-s24-SJC, cache-v37-ASH X-Timer: S1386412259.042873621,VS0,VS37,VE38,VE22350451

2. Because there are no Memento headers in the response, it connects to LANL's Memento aggregator using a GET request with the same Accept-Datetime header and gets back a 302 redirection response.

3. Then it follows the URI from the Location response header to a TimeGate specifically set up for Wikia, making another GET request using the Accept-Datetime request header on that URI. The TimeGate uses the date given by Accept-Datetime to determine which revision of a page to retrieve. The URI for this revision is sent back in the Location response header as part of the 302 redirection response.

GET /wikia/timegate/http://downtonabbey.wikia.com/wiki/Downton_Abbey_Wiki HTTP/1.1 Host: mementoproxy.lanl.gov Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Datetime: Thu, 30 Aug 2012 16:55:16 GMT Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36  HTTP/1.1 302 Found Connection: keep-alive Content-Length: 0 Content-Type: text/plain; charset=UTF-8 Date: Sat, 07 Dec 2013 16:43:30 GMT Link: <http:>;rel="original",<http:>;rel="timemap";type="application/link-format",<http:>;datetime="Tue, 05 Oct 2010 02:44:57 GMT";rel="first memento",<http:>;datetime="Tue, 03 Dec 2013 06:25:41 GMT";rel="last memento",<http:>;datetime="Wed, 01 Aug 2012 19:21:56 GMT";rel="prev memento",<http:>;datetime="Thu, 06 Sep 2012 03:04:22 GMT";rel="next memento",<http:>;datetime="Fri, 31 Aug 2012 23:16:51 GMT";rel="memento" Location: http://downtonabbey.wikia.com/index.php?oldid=11477 Server: nginx/1.2.8 Vary: negotiate,accept-datetime </http:></http:></http:></http:></http:></http:></http:>

4. From here it performs a final GET request on the URI specified in the Location response header, which is the revision of the article closest to the date requested. A screenshot of that revision is shown below, without the spoiler warning.

Even though this method works, it is not optimal.

The external Memento aggregator must know about the site and provide a site-specific TimeGate. In this case, the aggregator is merely looking for the presence of "wikia.com" in the URI and redirecting to the appropriate TimeGate in step 3. Behind the scenes, the Mediawiki API is used to acquire the list of past revisions and the TimeGate selects the best one in step 4. This requires LANL, or another Memento participant like the UK National Archives, to provide a TimeGate for all possible Wiki sites on the Internet, which is not possible.

To see where this is relevant, let's look at the fan site A Wiki of Ice and Fire, detailing information on the series A Song of Ice and Fire (aka Game of Thrones). LANL has no Memento TimeGate specifically for this real fan wiki, unlike what we saw with the Downtown Abbey site.

Here's a screenshot of the starting page for this demonstration. Let's assume we want to avoid spoilers from the book A Dance With Dragons, which was released in July 2011, so we choose the date of June 30, 2011.

1. The Chrome Memento Extension connects with an Accept-Datetime request header, hoping for a response with Memento headers.

HEAD /index.php/Kevan_Lannister HTTP/1.1 Host: awoiaf.westeros.org Accept: */* Accept-Datetime: Wed, 29 Jun 2011 16:55:16 GMT Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 Cookie: __cfduid=d637a57abb101a95207dd1cae6eb7e1561385421276735; __utma=199493950.1531075210.1385421278.1387324291.1387326684.5; __utmb=199493950.1.10.1387326686; __utmc=199493950; __utmz=199493950.1387324293.4.4.utmcsr=ws-dl-05.cs.odu.edu|utmccn=(referral)|utmcmd=referral|utmcct=/demo/index.php/Kevan_Lannister; __atuvc=2%7C48%2C25%7C49%2C0%7C50%2C6%7C51 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36  HTTP/1.1 200 OK Cache-Control: private, must-revalidate, max-age=0 CF-RAY: de7c07a21260099 Connection: keep-alive Content-Encoding: gzip Content-Language: en Content-Type: text/html; charset=utf-8 Date: Wed, 18 Dec 2013 00:31:43 GMT Expires: Thu, 01 Jan 1970 00:00:00 GMT Last-Modified: Wed, 27 Nov 2013 23:45:37 GMT Server: cloudflare-nginx Vary: Accept-Encoding, Cookie X-Cache: MISS

2. Because there were no Memento headers in the response, it turns to the Memento Aggregator at LANL, which serves as the TimeGate, using the datetime given by the Accept-Datetime request header to find the closest version of the page to the requested date. The TimeGate then provides a Location response header containing the archived version of the page at the Internet Archive.

GET /aggr/timegate/http://awoiaf.westeros.org/index.php/Kevan_Lannister HTTP/1.1 Host: mementoproxy.lanl.gov Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Datetime: Wed, 29 Jun 2011 16:55:16 GMT Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36  HTTP/1.1 302 Moved Temporarily Connection: keep-alive Content-Length: 0 Content-Type: text/plain; charset=iso-8859-1 Date: Wed, 18 Dec 2013 00:31:44 GMT Link: <http:>;rel="original" ,<http:>;rel="timemap"; type="application/link-format",<http:>;rel="timegate",<http:>;rel="memento"; datetime="Wed, 27 Apr 2011 14:53:59 GMT",<http:>;rel="memento first"; datetime="Wed, 23 May 2007 03:27:34 GMT",<http:>;rel="memento last"; datetime="Thu, 15 Aug 2013 18:23:42 GMT" Location: http://web.archive.org/web/20110427145359/http://awoiaf.westeros.org/index.php/Kevan_Lannister Server: nginx/1.2.8 Vary: Accept-Datetime </http:></http:></http:></http:></http:></http:>

3. Using the URI from that Location response header, the page is then retrieved directly from the Internet Archive.

But this page has a date of 27 Apr 2011, which is missing information we want, like who played this character in the TV series, which was added to the 7 June 2011 revision of the page. This is because the Internet Archive only contains two revisions around our requested datetime: 27 Apr 2011 and 1 Aug 2011. Even though the fan wiki contains the 7 June 2011 revision, the Internet Archive does not.

Fortunately, there is the nativeMemento Mediawiki Extension, supported by the Andrew Mellon Foundation, which addresses these issues. It has been developed jointly by Old Dominion University and LANL. Mediawiki was chosen because it is the most widely used Wiki software, used in sites such as Wikipedia and Wikia.

This native extension allows direct access to all revisions of a given page, avoiding spoilers. It can also return the data directly, requiring no Memento aggregators or other additional external infrastructure.

We set up a demonstration wiki using data from the same Game of Thrones fan wiki above. The video above shows this extension in action. Because our demonstration wiki has the native extension installed, it allows for access to all revisions of each article.

We will try the same scenario using this Memento-enabled wiki.

Here is a screenshot of the starting page for this demonstration.

In this case, because the Memento Mediawiki Extension has full Memento support, the HTTP messages sent are different. We again use the date June 30, 2011 to show that we can acquire information about a given article without revealing any spoilers from the book A Dance With Dragons, which was released on July 2011.

1. The Memento Chrome Extension sends an Accept-Datetime request header, but this time Mediawiki itself is serving as the TimeGate, deciding on the page closest to, but not over, the date requested. Mediawiki then issues its own 302 redirection response.

2. That response gives a Location response header pointing to the correct revision of the page, which was published on June 7, 2011, prior to the release of A Dance With Dragons. From here the Memento Chrome Extension can issue a GET request on that URI to retrieve the correct representation of the page.

As this demonstrates, running the Memento Mediawiki Extension on a fan wiki will ensure that site visitors can not only browse the site spoiler free, but also will get the date closest, but not over, their requested date. This way they avoid spoilers and don't miss any information.

To recap, the native extension gives us the following benefits:

The Memento Infrastructure cannot know about all possible wikis and provide TimeGates for each one, so the chances of a wiki having one are low.
The Internet Archive does not have all revisions of each fan wiki page, meaning that visitors to a fan wiki may miss out on information.
Visitors to the fan wiki site who are trying to avoid spoilers don't need to worry about any issues with the Memento wiki TimeGate infrastructure. Changes to a wiki's API can threaten the whole process, and APIs change frequently while Memento is established by a more stable RFC.

If you are running a fan wiki and want to help your visitors avoid spoilers, the Memento Mediawiki Extension is what you need. Please contact us and we'll help you customize it to your needs, if necessary.

--Shawn

* = Memento for Chrome version 0.1.11 actually performs two HEAD requests on the resource, but this will be fixed in the next release.

↧

2013-12-19: 404 - Your interview has been depublished

December 19, 2013, 2:39 pm

≫ Next: 2014-01-06: Review of WS-DL's 2013

≪ Previous: 2013-12-18: Avoiding Spoilers with the Memento Mediawiki Extension

Early November 2013 I gave an invited presentation at the EcoCom conference (picture left) and at the Spreeeforum, an informal gathering of researchers to facilitate knowledge exchange and foster collaborations. EcoCom was organized by Prof. Dr. Michael Herzog and his SPiRIT team and the Spreeforum was hosted by Prof. Dr. Jürgen Sieck who leads the INKA research group. Both events were supported by the Alcatel-Lucent Stiftung for Communications research. In my talks I gave a high-level overview of the state of the art in web archiving, outlined the benefits of the Memento protocol, pointed at issues and challenges web archives face today, and gave a demonstration of the Memento for Chrome extension.

Following the talk at the Spreeforum I was asked to give an interview for the German radio station Inforadio (you may think of it as Germany's NPR). The piece was aired on Monday, November 18th at 7.30am CET. As I had left Germany already I was not able to listen to it live but was happy to find the corresponding article online that basically contained the transcript of the aired report and an audio file was embedded in the document. I immediately bookmarked the page.

A couple of weeks later I revisited the article at its original URI only to find it was no longer available (screenshot left). Now, we all know that the web is dynamic and hence links break and even we have seen odd dynamics at other media companies before but in this case, as I was about to find out, it was higher powers that caused the detrimental effect. Inforadio is a public radio station and therefore, as many others in Germany and throughout Europe, to a large extent financed by the public (as of 2013 the broadcast receiving license is 17.98 Euros (almost USD 25) per month per household). As such they are subject to the "Rundfunkstaatsvertrag", which is a contract between the German states to regulate broadcasting rights. The 12th amendment to this contract from 2009 mandates that most online content must be removed after 7 days of publication. Huh? Yeah, I know, it sounds like a very bad joke but it is not. It even lead to coining the term "depublish" - a paradox by itself. I had considered public radio stations as "memory organizations", in league with libraries, museums, etc. How wrong was I and how ironic is this, given my talk's topic!? For what it's worth though, the content does not have to be deleted from the repository but it has to be taken offline.

I can only speculate about the reasons for this mandate but to me believable opinions circulate indicating that private broadcasters and news publishers complained about unfair competition. In this sense, the claim was made that "eternal" availability of broadcasted content on the web is unfair competition as the private sector is not given the appropriate funds to match that competitive advantage. Another point that supposedly was made is that this online service goes beyond the mandate of public radio stations and hence would constitute a misguided use of public money. To me personally, none of this makes any sense. Broadcasters of all sorts have realized that content (text, audio, and video) is increasingly consumed online and hence are adjusting their offerings. How this can be seen as unfair competition is unclear to me.

But back to my interview. Clearly, one can argue (or not) whether the document is worth preserving but my point here is a different one:
Not only did I bookmark the page when I saw it, I also immediately tried to push it into as many web archives as I could. I tried the Internet Archive's new "save page now" service but, to add insult to injury, Inforadio also has a robots.txt file in place that prohibits the IA from crawling the page. To the best of my knowledge this is not part of the 12th amendment to the "Rundfunkstaatsvertrag" so the broadcaster could actually take action to preserve their online content. Other web sites of public radio and TV stations such as Deutschlandfunk or ZDF do not prohibit archives from crawling their pages.

Fortunately, the archiving service Archive.is was able to grab the page (screenshot left) but the audio feed is lost.

Just one more thing (Peter Falk style):
Note that the original URI of the page:

http://www.inforadio.de/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

when requested in a web browser redirects (200-style) to:

http://www.inforadio.de/error/404.html?/rbb/inf/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

The good news here: it is not a soft 404 so the error is somewhat robot friendly. The bad news is that the original URI is thrown away. As the original URI is the only key for a search in web archives, we can not retrieve any archived copies (such as the one I created in Archive.is) without it. Unfortunately, this is not only true for manual searches but it also undermines automatic retrieval of archives copies by clients such as the browser extension Memento for Chrome. As stressed in our recent talk at CNI this is very bad practice and unnecessarily makes life harder for those interested in obtaining archived copies of web pages at large, not only my radio interview.

--
Martin

↧

2014-01-06: Review of WS-DL's 2013

January 6, 2014, 6:28 pm

≫ Next: 2014-01-07: Two WS-DL Classes Offered for Spring 2014

≪ Previous: 2013-12-19: 404 - Your interview has been depublished

The Web Science and Digital Libraries Research Group had a really good year in 2013. There was a burst of student progress in 2012 and there will likely be more in 2014, but in 2013:

Mat Kelly passed his breadth exam
Sawood Alam finished his MS thesis, entered the PhD program, and passed the breadth exam

We (along with many co-authors at other institutions) were involved with 22 publications, including new venues for our group as well as very strong showings at traditional ones such as JCDL (pictured above) and TPDL:

one Internet RFC (Memento)
one beta specification (ResourceSync)
one poster at IEEE VIS
one paper at the WWW developer track
two papers at the Temporal Web Workshop
two D-Lib Magazine articles
two technical reports
four full papers and three posters at JCDL
five full papers at TPDL

Two of the papers received recognition: Scott Ainsworth's paper "Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive" was one of three nominees for Best Student Paper at JCDL, and Yasmin Anwar's paper "Who and What Links to the Internet Archive" won Best Student Paper at TPDL. Congratulations to Scott and Yasmin, who are the first from our group to receive these honors.

In addition to the conference paper presentations, we presented at a number of conferences that do not have formal proceedings:

Yasmin received two travel grants to present posters at: Grad Cohort Workshop (CRA-W), and Grace Hopper Celebration of Women in Computing (GHC)
Mat presented his software at the Archive-It Partner's Meeting, Digital Preservation, and Personal Digital Archiving
Ahmed AlSum and I (Michael Nelson) had several presentations at the IIPC General Assembly (our department also officially joined IIPC this year)
Justin Brunelle, Hany SalahEldeen, Ahmed, and Yasmin all gave presentations at the Web Archiving and Digital Libraries Workshop (held after JCDL)
I gave an invited presentation at the Wolfram Data Summit
Michele Weigle gave a keynote at the Southeast Women in Computing Conference

We also released (or updated) a number of software packages for public use:

Mat released two: WARCreate and WAIL
Ahmed created "mcurl", a Memento-enable curl command
Hany released Carbon Date
Sawood released HTTP Mailbox
Shawn Jones released a Memento plugin for MediaWiki

Our research activities continued to enjoy a lot of coverage in the popular press. Some of the highlights include:

"Internet Archaeologists Reconstruct Lost Web Pages", MIT Technology Review
"How to Carbon-Date a Web Page", MIT Technology Review
"Archiving the Internet", Virginia Public Radio
"Computer Scientists Measure How Much of the Web is Archived", MIT Technology Review

Getting external funding is increasingly difficult, but we did manage two small new grants this year:

M. C. Weigle (PI), M. L. Nelson, "Archive What I Can See Now", NEH, $57,892
M. L. Nelson, "Enhanced Memento Support for MediaWiki", Andrew Mellon Foundation, $25,000

Lastly, we're awfully excited and proud about these upcoming student opportunities:

In a few months Ahmed will begin a full-time position as a Web Archive Engineer at the Stanford University Libraries (SUL)
Hany will spend the spring semester with Min-Yen Kan's Web Information Retrieval / Natural Language Processing Group (WING) at the National University of Singapore

Thanks to all that made 2013 a great success for WS-DL! We're looking forward to an even better 2014.

--Michael

2014-01-09 Edit:

Ahmed AlSum was the winner of the 2013 WS-DL lunch pail award (inspired by Bud Foster's "lunch pail defense"), awarded at yesterday's luncheon. Previous winners of the lunch pail include Martin Klein and Matthias Prellwitz.

↧