2017-06-09: InfoVis Spring 2016 Class Projects

June 9, 2017, 6:41 am

≫ Next: 2017-06-26: IIPC Web Archiving Conference (WAC) Trip Report

≪ Previous: 2017-04-26: Discovering Scholars Everywhere They Tread

I'm way behind in posting about my Spring 2016 offering of CS 725/825 Information Visualization, but better late than never. (Previous semester highlights posts: Spring 2015, Spring/Fall 2013, Fall 2012, Fall 2011)

Here are a few projects that I'd like to highlight. (All class projects are listed in my InfoVis Gallery.)

Expanding the WorldVis Simulation
Created by Juliette Pardue, Mridul Sen, Christos Tsolakis

This project (available at http://ws-dl.cs.odu.edu/vis/world-vis/) was an extension of the FluNet visualization, developed as a class project in 2013. The students extended the specialized tool to account for general datasets of quantitative attributes per country over time and added attributes based on continent average. They also computed summary data for each dataset for each year, so at a glance, the user can see statistical information including the country with the minimum and maximum value.

This work was accepted as a poster to IEEE VIS 2016:

Juliette Pardue, Mridul Sen, Christos Tsolakis, Reid Rankin, Ayush Khandelwal and Michele C. Weigle, "WorldVis: A Visualization Tool for World Data," In Proceedings of IEEE VIS. Baltimore, MD, October 2016, poster abstract. (PDF, poster, trip report blog post)

Visualization for Navy Hearing Conservation Program (HCP)
Created by Erika Siregar (@erikaris), Hung Do (@hdo003), Srinivas Havanur

This project (available at http://www.cs.odu.edu/~hdo/InfoVis/navy/final-project/index.html) was also the extension of previous work. The first version of this visualization was built by Lulwah Alkwai.

The aim of this work is to track hearing level of workers in the US Navy over a period of time through Hearing Conservation Program (HCP). The HCP's goal is to detect and prevent a noise-induced hearing loss among the service members by analyzing their hearing levels over the years. The students analyzed the data obtained from the audiogram dataset to produce some interactive visualizations using D3.js to see hearing curves of workers over the years.

ODU Student Demographics
Created by Ravi Majeti, Rajyalakshmi Mukkamala, Shivani Bimavarapu

This project (available at http://webspace.cs.odu.edu/~nmajeti/InfoViz/World/worldmap-template.html) concentrates on ODU international student information. It visualizes the headcount of international graduate and undergraduate students studying at ODU for each country for a particular major in a selected year and visualizes the gender ratio for undergraduate and graduate students in the university for each year. The main goal is to provide an interactive interface for the prospective students to analyze the global diversity at ODU and identify whether ODU best suits their expectations in the aspects of alumni from their respective major and country.

Visualizing Web Archives of Moderate Size
Created by John Berlin (@johnaberlin), Joel Rodriguez-Ortiz, Dan Milanko

This work (available at http://jrodgz.github.io/project/), develops a platform for understanding web archives in a multi-user setting. The students used contextual data provided during the archival process to provide a new approach towards identifying the general state of the archives. This metadata allows us to identify the most common domains, archived resources, times and tags associated with a web collection. The designed tool outlines the most important areas of focus in web archives and gives users a more clear picture of what their collections comprise of, both in specific and general terms.

-Michele

↧

2017-06-26: IIPC Web Archiving Conference (WAC) Trip Report

June 26, 2017, 4:50 pm

≫ Next: 2017-06-29: Joint Conference on Digital Libraries (JCDL) 2017 Trip Report

≪ Previous: 2017-06-09: InfoVis Spring 2016 Class Projects

Mat Kelly reports on the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017 in London, England.

In the latter part of Web Archiving Week (#waweek2017) from Wednesday to Friday, Sawood and I attended the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017, held jointly with the RESAW Conference at the Senate House and British Library Knowledge Center in London. Each of the three days had multiple tracks. Reported here are the presentations I attended.

Prior to the keynote, Jane Winters (@jfwinters) of University of London and Nicholas Taylor (@nullhandle) welcomed the crowd with admiration toward the Senate House venue. Leah Lievrouw (@Leah53) from UCLA then began the keynote. In her talk, she walked through the evolution of the Internet as a medium to access information prior to and since the Web.

#Keynote @Leah53 the 'pre-browser internet' was about access - #WAweek2017 pic.twitter.com/yaQMCnwsVf
— Sally Chambers (@schambers3) June 14, 2017

With reservation toward the "Web 3.0" term, Leah described a new era in the shift from documents to conversations, to big data. With a focus toward the conference, Leah described the social science and cultural break down as it has applied to each Web era.

After the keynote, two concurrent presentation tracks proceeded. I attended a track where Jefferson Bailey (@jefferson_bail) presented "Advancing access and interface for research use of web archives". First citing an updated metric of the Internet Archive's holdings (see Ian's tweet below), Jefferson provided a an update on some contemporary holdings and collections by IA inclusive of some of the details on his GifCities project (introduced with IA's 20th anniversary, see our celebration), which provides searchable access to the the archive's holdings of the animated GIFs that once resided on Geocities.com.

Current size o’ the @internetarchive: over 560,000,000,000 URLs collected (plus 2.3m books, 2.4m audio, 3m hours TV, 4m eBooks). #WAweek2017
— Ian Milligan (@ianmilligan1) June 14, 2017

In addition to this, Jefferson also highlighted the beta features of the Wayback Machine, inclusive of anchor text-based search algorithm, MIME-type breakdown, and much more. He also described some other available APIs inclusive of one built on top of WAT files, a metadata format derived from WARC.

Also – you can get great summaries of special collections (or hosts, domains, TLDs). i.e. this one. #WAweek2017 https://t.co/GvPFLFbgEQ pic.twitter.com/AR5d7julxb
— Ian Milligan (@ianmilligan1) June 14, 2017

Through recent efforts by IA for their anniversary, they also had put together a collection of military PowerPoint slide decks.

Quote of the day so far, from @jefferson_bail: 'Military PowerPoints are insane' #WAWeek2017
— Jane Winters (@jfwinters) June 14, 2017

Following Jefferson, Niels Brügger (@NielsBr) lead a panel consisting of a subset of authors from the first issue of his journal, "Internet Histories". Marc Weber stated that the journal had been in the works for a while. When he initially told people he was looking at the history of the Web in the 1990s, people were puzzled. He went on to compare the Internet to be in its Victorian era as evolved from 170 years of the telephone and 60 years of being connected through the medium. Of the vast history of the Internet we have preserved relatively little. He finished with noting that we need to treat history and preservation as something that should be done quickly, as we cannot go back later to find the materials if they are no preserved.

love this quote: "The past isn’t over. It isn’t even past” William Faulkner. Thnx Marc Weber @ComputerHistory #WAWeek2017
— James R. Jacobs (@freegovinfo) June 14, 2017

Steve Jones of University of Illinois at Chicago spoke second about the Programmed Logic for Automatic Teaching Operations (PLATO) system. There were two key interests, he said, in developing for PLATO -- multiplayer games and communication. The original PLATO lab was in a large room and because of laziness, they could not be bothered to walk to each other's desks, so developed the "Talk" system to communicate and save messages so the same message would not have to be communicated twice. PLATO was not designed for lay users but for professionals, he said, but was also used by university and high school students. "You saw changes between developers and community values," he said, "seeing development of affordances in the context of the discourse of the developers that archived a set of discussions." Access to the PLATO system is still available.

Jane Winters presented third on the panel stating that there is a lot of archival content that has seen little research engagement. This may be due to continuing work on digitizing traditional texts but it is hard to engage with the history of the 21st century without engaging with the Web. The absence of metadata is another issue. "Our histories are almost inherently online", she said, "but they only gain any real permanence through preservation in Web archives. That's why humanists and historians really need to engage with them."

The tracks then joined together for lunch and split back into separate sessions, where I attended the presentation, "A temporal exploration of the composition of the UK Government Web Archive". In this presentation they examined the evolution of the UK National Archives (@uknatarchives). This was followed by a presentation by Caroline Nyvang (@caobilbao) of the Royal Danish Library that examined current web referencing practices. Her group proposed the persistent web identifier (PWID) format for referencing Web archives, which was eerily familiar to the URI semantics often used in another protocol.

PWID: Persistent web identifier #WAWeek2017
(personal consideration: i feel perplexed with the idea of new identifiers) pic.twitter.com/sYmvDhTXcx
— Raffaele Messuti (@atomotic) June 14, 2017

Some similarity to https://t.co/2G5MDAleQ5 #WAWeek2017 #mementoweb
— Mat Kelly (@machawk1) June 14, 2017

Andrew (Andy) Jackson (@anjacks0n) then took the stage to discuss the UK Web Archive's (@UKWebArhive) catalog and challenges they have faced while considering the inclusion of Web archive material. He detailed a process, represented by a hierarchical diagram, to describe the sorts of transformations required in going from the data to reports and indexes about the data. In doing so, he also juxtaposed and compared his process with other archival workflows that would be performed in a conventional library catalog architecture.

.@anjacks0n on the layers of transformation between day web archive data at bottom to indexes, reports, data at top. #waweek2017 pic.twitter.com/XT2Ra49ZkD
— Ian Milligan (@ianmilligan1) June 14, 2017

Following Andy, Nicola Bingham (@NicolaJBingham) discussed curating collections at the UK Web Archive, which has been archiving since 2013, and challenges in determine the boundaries and scope of what should be collected. She encouraged researchers to engage to shape their collections. Their current holdings consist of about 400 terabytes with 11 to 12 billion records, growing 60 to 70 terabytes and 3 billion records per year. Their primary mission is to collect UK web sites under UK TLDs (like .uk, .scot, .cymru, etc). Domains are currently capped at 512 megabytes being preserved but even then other technical limitations exists in capture like proprietary formats, plugins, robots.txt, etc).

When Nicola finished, there was a short break. Following that, I traveled upstairs of the Senate House to the "Data, process, and results" workshop, lead by Emily Maemura (@emilymaemura). She first described three different research projects where each of the researchers were present and asked attendees to break out into groups to discuss the various facets of each project in detail with each researcher. I opted to discuss Frederico Nanni's (@f_nanni) work with him and a group of other attendees. His work consisted of analyzing and resolving issues in the preservation of the web site of the University of Bologna. The site specifies a robots.txt exclusion, which makes the captures inaccessible to the public but through his investigation and efforts, was able to change their local policy to allow for further examination of the captures.

With the completion of the workshop, everyone still in attendance joined back together in the Chancellor's Hall of the Senate House as Ian Milligan (@ianmilligan1) and Matthew Weber (@docmattweber) gave a wrap up of the Archives Unleashed 4.0 Datathon, which had occurred prior to the conference on Monday and Tuesday. Part of the wrap-up was time given to three top ranked projects as determined by judges from the British Library. The group with which I was a part from the Datathon, "Team Intersection" was one of the three, so Jess Ogden (@jessogden) gave a summary presentation. More information on our intersection analysis between multiple data sets can be had on our GitHub.io page. A blog post with more details will be posted here in the coming days detailing our report of the Datathon.

. @docmattweber giving the overview of the Archives Unleashed projects #WAWeek2017 pic.twitter.com/llVEbheiEt
— Sally Chambers (@schambers3) June 14, 2017

Following the AU 4.0 wrap-up, the audience moved to the British Library Knowledge Center for a panel titled, "Web Archives: truth, lies and politics in the 21st century". I was unable to attend this, opting for further refinement of the two presentations I was to give on the second day of IIPC WAC 2017 (see below).

Day Two

The second day of the conference was split into three concurrent tracks -- two at the Senate House and a third at the British Library Knowledge Center. Given I was slated to give two presentations at the latter (and the venues were about 0.8 miles apart), I opted to attend the sessions at the BL.

Nicholas Taylor opened the session with the scope of the presentations for the day and introduced the first three presenters. First on the bill was Andy Jackson with "Digging document out of the web archives." This initially compared this talk to the one he had given the day prior (see above) relating to the workflows in cataloging items. In the second day's talk, he discussed the process of the Digital ePrints team and the inefficiencies of its manual process for ingesting new content. Based on this process, his team setup a new harvester that watches targets, extracts the document and machine-readable metadata from the targets, and submits it to the catalog. Still though, issues remainder with one being what to identify as the "publication" for e-prints relative to the landing page, assets, and what is actually cataloged. He discussed the need for further experimentation using a variety of workflows to optimize the outcome for quality and to ensure the results are discoverable and accessible and the process remain mostly automated.

@anjacks0n digging docs out of the archived web. #WAweek2017 pic.twitter.com/FRqmsxeKXa
— Martin Klein (@mart1nkle1n) June 15, 2017

Ian Milligan and Nick Ruest (@ruebot) followed Andy with their presentation on making their Canadian web archival data sets easier to use. "We want web archives to be used on page 150 in some book.", they said, reinforcing that they want the archives to inform the insights instead of the subject necessarily being about the archives themselves. They also discussed their extraction and processing workflow from acquiring the data from Internet Archive then using Warcbase and other command-line tools to make the data contained within the archives more accessible. Nick said that since last year when they presented webarchives.ca, they have indexed 10 terabytes representative of over 200 million Solr docs. Ian also discussed derivative datasets they had produced inclusive of domain and URI counts, full-text, and graphs. Making the derivative data sets accessible and usable by researchers is a first step in their work being used on page 150.

Greg Wiedeman (@GregWiedeman) presented third in the technical session by first giving context of his work at the University at Albany (@ualbany) where they are required to preserve state records with no dedicated web archives staff. Some records have paper equivalents like archived copies of their Undergraduate Bulletins while digital versions might consist of Microsoft Word documents corresponding to the paper copies. They are using DACS to describe archives, so questioned whether they should use it for Web archives. On a technical level, he runs a Python script to look at their collection of CDXs, which schedules a crawl which is displayed in their catalog as it completes. "Users need to understand where web archives come from,", he says, "and need provenance to frame their research questions, which will add weight to their research."

This @GregWiedeman slide is pretty key - importance of describing machine-readable web archives. #WAWeek2017 pic.twitter.com/G0GJVRehOT
— Ian Milligan (@ianmilligan1) June 15, 2017

A short break commenced, followed by Jefferson Bailey presenting, "Who, what when, where, why, WARC: new tools at the Internet Archive". Initially apologizing for repetition of his prior days presentation, Jefferson went into some technical details of statistics IA has generated, APIs they have to offer, and new interfaces with media queries of a variety of sorts. They also have begun to use Simhash to identify dissimilarity between related documents.

The brave @jefferson_bail attempts to live demo APIs on locked down conference WiFi. I salute you! #WAWeek2017 pic.twitter.com/kZT6cSNSSZ
— Ian Milligan (@ianmilligan1) June 15, 2017

I (Mat Kelly, @machawk1) presented next with "Archive What I See Now – Personal Web Archiving with WARCs". In this presentation I described the advancements we had made to WARCreate, WAIL, and Mink with support from the National Endowment for the Humanities, which we have reported on in a few prior blog posts. This presentation served as a wrap-up of new modes added to WARCreate, the evolution of WAIL (See Lipstick or Ham then Electric WAILs and Ham), and integration of Mink (#mink #mink #mink) with local Web archives. Slides below for your viewing pleasure.

.@machawk1 is prepared – his bit.ly link for #WAweek2017 goes to all of the tools he is demoing. https://t.co/pHgiafZRDS pic.twitter.com/kIHxFr6zz9
— Ian Milligan (@ianmilligan1) June 15, 2017

Archive What I See Now: Personal Web Archiving with WARCs from machawk1

Lozana Rossenova (@LozanaRossenova) and Ilya Kreymer (@IlyaKreymer) talked next about Webrecorder and namely about remote browsers. Showing a live example of viewing a web archive with a contemporary browser, technologies that are no longer supported are not replayed as expected, often not being visible at all. Their work allows a user to replicate the original experience of the browser of the day to use the technologies as they were (e.g., Flash/Java applet rendering) for a more accurate portrayal of how the page existed at the time. This is particularly important for replicating art work that is dependent on these technologies to display. Ilya also described their Web Archiving Manifest (WAM) format to allow a collection of Web archives to be used in replaying Web pages with fetches performed at the time of replay. This patching technique allows for more accurate replication of the page at a time.

patching with remote archives: fill missing resource in web recordings using data from external archives https://t.co/pNXoziV5EE #WAweek2017
— Raffaele Messuti (@atomotic) June 15, 2017

After Lozana and Ilya finished, the session broke for lunch then reconvened with Fernando Melo (@Fernando___Melo) describing their work at the publicly available Portuguese Web Archive. He showed their work building an image search of their archive using an API to describe Charlie Hebdo-related captures. His co-presenter João Nobre went into further details of the image search API, including the ability to parameterize the search by query string, timestamp, first-capture time, and whether it was "safe". Discussion from the audience afterward asked of the pair what their basis was of a "safe" image.

Nicholas Taylor spoke about recent work with LOCKSS and WASAPI and the re-architecting of the former to open the potential for further integration with other Web archiving technologies and tools. They recently built a service for bibliographic extraction of metadata for Web harvest and file transfer content, which can then be mapped to the DOM tree. They also performed further work on an audit and repair protocol to validate the integrity of distributed copies.

Jefferson again presented to discuss IMLS funded APIs they are developing to test transfers using WASAPI to their partners. His group ran surveys to show that 15-20% of Archive-It users download their WARCs to be stored locally. Their WASAPI Data Transfer API returns a JSON object derived from the set of WARCs transfered inclusive of fields like pagination, count, requested URI, etc. Other fields representative of an Archive-It ID, checksums, and collection information are also present. Naomi Dushay (@ndushay) then showed a video of an overview of their deployment procedure.

After another short break, Jack Cushman& Ilya Kreymer tag-teamed to present, "Thinking like a hacker: Security Issues in Web Capture and Playback". Through a mock dialog, they discussed issues in securing Web archives and a suite of approaches challenging users to compromise a dummy archive. Ilya and Jack also iterated through various security problems that might arise in serving, storing, and accessing Web archives inclusive of stealing cookies, frame highjacking to display a false record, banner spoofing, etc.

Following Ilya and Jack, I (@machawk1, again) and David Dias (@daviddias) presented, "A Collaborative, Secure, and Private InterPlanetary WayBack Web Archiving System using IPFS". This presentation served as follow-on work from the InterPlanetary Wayback (ipwb) project Sawood (@ibnesayeed) had originally built at the Archives Unleashed 1.0 then presented at JCDL 2016, WADL 2016, and TPDL 2016. This work, in collaboration with David of Protocol Labs, who created the InterPlanetary File System (IPFS), was to display some advancements both in IPWB and IPFS. David began with an overview of IPFS, what problem its trying to solve, its system of content addressing, and mechanism to facilitate object permanence. I discussed, as with previous presentations, IPWB's integration of web archive (WARC) files with IPFS using an indexing and replay system that utilize the CDXJ format. One item in David's recent work is bring IPFS to the browsers with his JavaScript port to interface with IPFS from the browsers without the need for a running local IPFS daemon. I had recent introduced encryption and decryption of WARC content to IPWB, allowing for further permanence of archival Web data that may be sensitive in nature. To close the session, we performed a live demo of IPWB consisting of data replication of WARCs from another machine onto the presentation machine.

@machawk1 and @daviddias on stage with #IPWB and #IPFS at #WAweek2017 @WebSciDL pic.twitter.com/XZCK1blYvo
— Sawood Alam (@ibnesayeed) June 15, 2017

A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving System Using IPFS from machawk1

Following our presentation, Andy Jackson asked for feedback on the sessions and what IIPC can do to support the enthusiasm for open source and collaborative approaches. Discussions commenced among the attendees about how to optimize funding for events, with Jefferson Bailey reiterating the travel eats away at a large amount of the cost for such events. Further discussions were had about why the events we not recorded and on how to remodel the Hackathon events on the likes of other organizations like Mozilla's Global Sprints, the organization of events by the NodeJS community, and sponsoring developers for the Google Summer of Code. The audience then had further discussions on how to followup and communicate once the day was over, inclusive of the IIPC Slack Channel and the IIPC GitHub organization. With that, the second day concluded.

Day 3

By Friday, with my presentations for the trip complete, I now had but one obligation for the conference and the week (other than write my dissertation, of course): to write the blog post you are reading. This was performed while preparing for JCDL 2017 in Toronto the following week (that I attended by proxy, post coming soon). I missed out on the morning sessions, unfortunately, but joined in to catch the end of João Gomes' (@jgomespt) presentation on Arquivo.pt, also presented the prior day. I was saddened to know that I had missed Martin Klein's (@mart1nkle1n) "Uniform Access to Raw Mementos" detailing his, Los Alamos', and ODU's recent collaborative work in extending Memento to support access to unmodified content, among other characteristics that cause a "Raw Memento" to be transformed. WS-DL's own Shawn Jones (@shawnmjones) has blogged about this on numerous occasions, see Mementos in the Raw and Take Two.

The same @mart1nkle1n presenting the #RawMemento @shawnmjones @WebSciDL #WAweek2017 pic.twitter.com/EIytWM0svn
— Sawood Alam (@ibnesayeed) June 16, 2017

The first full session I was able to attend was Abbie Grotke's (@agrotke) presentation, "Oh my, how the archive has grown..." that detailed the progress and size that Library of Congress's Web archive has experienced with minimal growth in staff despite the substantial increase in size of their holdings. While captivated, I came to know via the conference Twitter stream that Martin's third presentation of the day coincided with Abbie's. Sorry, Martin.

I did manage to switch rooms to see Nicholas Taylor discuss using Web archives in legal cases. He stated that in some cases, social media used by courts may only exist in Web archives and that courts now accept archival web captures as evidence. The first instance of using IA's Wayback Machine was in 2004 and its use in courts has been contested many times without avail. The Internet Archive provided affidavit guidance that suggested asking the court to ensure usage of the archive will consider captures as valid evidence. Nicholas alluded to FRE 201 that allows facts to be used as evidence, the basis for which the archive has been used. He also cited various cases where expert testimony of Web archives was used (Khoday v. Symantec Corp., et al.), a defamation case where the IA disclaimer dismissed using it as evidence (Judy Stabile v. Paul Smith Limited et al.), and others. Nicholas also cited WS-DL's own Scott Ainsworth's (@Galsondor) work on Temporal Coherence and how a composite memento may not have existed as displayed.

Legal use cases for web archives @nullhandle #WAweek2017 pic.twitter.com/3Ipnmu9Xmn
— Martin Klein (@mart1nkle1n) June 16, 2017

Following Nicholas, Anastasia Aizman and Matt Phillips (@this_phillips) presented "Instruments for Web archive comparison in Perma.cc". In their work with Harvard's Library Innovation Lab (with which WS-DL's Alex Nwala was recently a Summer fellow), the Perma team has a goal to allow users to cite things on the Web, create WARCs of those things, then be able to organize the captures. Their initial work with the Supreme Court corpus from 1996 to present found that 70% of the references had rotted. Anastasia asked, "How do we know when a web site has changed and how do we know which changed are important?"

They used a variety of ways to determine significant change inclusive of MinHas (via calculating the Jaccard Coefficients), Hamming Distance (via SimHash), and Sequence Matching using a Baseline. As a sample corpus, they took over 2,000 Washington Post articles consisting of over 12,000 resources, examined the SimHash and found big gaps. For MinHash, the distances appeared much closer. In their implementation, they show this to the user on Perma via their banner that provides an option to highlight file changes between sets of documents.

There was a brief break then I attended a session where Peter Webster (@pj_webster) and Chris Fryer (@C_Fryer) discussed their work with the UK Parliamentary Archives. Their recent work consists of capturing official social media feeds of the members of parliament, critical as it captures their relationship with the public. They sought to examine the patterns of use and access by the members and determine the level of understanding of the users of their archive. "Users are hard to find and engage", they said, citing that users were largely ignorant about what web archives are. In a second study, they found that users wanted a mechanism for discovery that mapped to an internal view of how the parliament function. Their studies found many things from web archives that user do not want but a takeaway is that they uncovered some issues in their assumptions and their study raised the profile of the Parliamentary Web Archives among their colleagues.

.@pj_webster discovered that #webArchiving users are (a) hard to find/engage; (b) largely ignorant of web archives; (c) unsure. #WAweek2017
— Ian Milligan (@ianmilligan1) June 16, 2017

Emily Maemura and Nicholas Worby presented next with their discussion on origin studies as it relates to web archives, provenance, and trust. They examined decisions made in creating collections in Archive-It by the University of Toronto Libraries, namely the collections involving the Canadian political parties, the Toronto 2015 Pam Am games, and their Global Summitry Archive. From these they determined the three traits of each were that they were long running, a one-time event, and a collaboratively created archive, respectively. For the candidates' sites, they also noticed the implementation of robots.txt exclusions in a supposed attempt to prevent the sites from being archived.

Alexis Antracoli and Jackie Dooley (@minniedw) presented next about their OCLC Research Library Partnership web archive working group. Their examination determined that discoverability was the primary issue for users. Their example of using Archive-It at Princeton but that the fact was not documented was one such issue. Through their study they established use cases for libraries, archives, and researchers. In doing so, they created a data dictionary of characteristics of archives inclusive of 14 data elements like Access/rights, Creator, Description, etc. with many fields having a direct mapping to Dublin Core.

With a short break, the final session then began. I attended the session where Jane Winters (@jfwinters) spoke about increasing the visibility of web archives, asking first, "Who is the audience for Web archives?" then enumerating researchers in the arts, humanities and social sciences. She then described various examples in the press relating to web archives inclusive of Computer Weekly report on Conservatives erasing official records of speeches from IA and Dr. Anat Ben-David's work on getting the .yu TLD restored in IA.

Cynthia Joyce then discussed her work in studying Hurricane Katrina's unsearchable archive. Because New Orleans was not a tech savvy place at the time and it was pre-Twitter, Facebook was young, etc., the personal record was not what it would be were the events to happen today. In her researcher as a citizen, she attempted to identify themes and stories that would have been missed in mainstream media. She said, "On Archive-It, you can find the Katrina collection ranging from resistance to gratitude." Only 8-9 years later did she collect the information, for which many of the writers never expect to be preserved.

For the final presentation of the conference, Colin Post (@werrthe) discussed net-based art and how to go about making them objects of art history. Colin used Alexi Shulgin's "Homework" as an example that uses pop-ups and self-conscious elements that add to the challenge of preservation. In Natalie Bookchin's course, Alexei Shulgin encouraged artists to turn in homework for grading, also doing so himself. His assignment is dominated with popups, something we view in a different light today. "Archives do not capture the performative aspect of the piece", Colin said. Citing oldweb.today provides interesting insights into how the page was captured over time with multiple captures being combined. "When I view the whole piece, it is emulated and artificial; it is disintegrated and inauthentic."

Synopsis

The trip proved very valuable to my research. Not documented in this post was the time between sessions where I was able to speak to some of the presenters about their as it related to my own and even to those that were not presenting in finding an intersection in our respective research.

—Mat (@machawk1)

↧

2017-06-29: Joint Conference on Digital Libraries (JCDL) 2017 Trip Report

June 29, 2017, 3:03 pm

≫ Next: 2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017

≪ Previous: 2017-06-26: IIPC Web Archiving Conference (WAC) Trip Report

The 2017 Joint Conference on Digital Libraries (JCDL) took place at the University of Toronto, Canada. From June 19-23, we (WS-DL) attended workshops, tutorials, panels, and a doctoral consortium. The theme of this year's conference was #TOscale, #TOanalyze, and #TOdiscover. The conference provided researchers from disciplines such as Digital Library research and information science, with the opportunity to communicate the findings of their respective research areas.

Day 1 (June 19)

The first day (pre-conference) of the conference kicked off with a Doctoral Consortium and a Tutorial - Introduction to Digital Libraries. These events took place in parallel with a Workshop - 6th International Workshop on Mining Scientific Publications (WOSP 2017). The final event of the day was a tutorial titled, "Scholarly Data Mining: Making Sense of the Scientific Literature"

Day 2 (June 20)

. #JCDL2017 kicks off with opening remarks by @ianmilligan1 pic.twitter.com/Hs182dws7C
— JCDL2017 (@jcdl2017) June 20, 2017

The conference official started on the second day with opening remarks from Ian Milligan, shortly followed by a keynote from Liz Lyon in which she presented a retrospective on data management, highlighting the successes and achievements of the last decade, as well as assessing the the current state of data, and providing insight into the research, policies and practices needed to sustain progress.

Liz Lyon discussing the relevant skills for data professionals #jcdl2017 pt 1 pic.twitter.com/0Md4bDEdnH
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

... And pt 2 #jcdl2017 pic.twitter.com/1RXsZ3JlJd
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

Great data science roles slide from our opening keynote Liz Lyon #JCDL2017 pic.twitter.com/BmRIBWNpVo
— JCDL2017 (@jcdl2017) June 20, 2017

Liz Lyon: 3D model of #openscience - access, participation and transparency #JCDL2017 pic.twitter.com/FipUtwy9ud
— Daniel Bangert (@enigmaticocean) June 20, 2017

Following Liz Lyon's keynote, Dr. Justin Brunelle opened the Web archives paper session with a presentation for a full paper titled, "Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly." In this presentation, he discussed the challenges Web archives face in crawling pages with deferred representations due to JavaScript, and proposed a method for discovering and archiving deferred representations and their respective descendants which are only visible from the client.

.@mart1nkle1n introducing @justinfbrunelle's paper "discover more stuff but crawl more slowly"#jcdl2017 see also https://t.co/wTDDZokQeE pic.twitter.com/1v58FjNwnk
— Michael L. Nelson (@phonedude_mln) June 20, 2017

Next up at #JCDL2017 : Web Archives! I can't think of a better thing to hear about on a Tuesday afternoon. pic.twitter.com/2JI4jTQGiw
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

.@justinfbrunelle proposes changes to @internetarchive's wayback and other crawler techs to prevent JS zombies in web archives #jcdl2017 pic.twitter.com/Zcrcz624Oc
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

#webarchiving needs to be better. Glad people at #jcdl2017 like @justinfbrunelle are working on it! pic.twitter.com/AIkneZrXvr
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

Next, Faryaneh Poursardar presented a short paper - "What is Part of that Resource? User Expectations for Personal Archiving," where she talked about the difficulty users face in deciding the answer to the question: What is part of and what is not part of an Internet resource? She also explored various user perception of this question and its implications on personal archiving.

Faryaneh Poursarder on users' perception re parts of a web resource. Important to understand for web archiving. #jcdl2017 pic.twitter.com/GmSl9m5VsO
— Martin Klein (@mart1nkle1n) June 20, 2017

Faryaneh Poursardar: We expect an archived url to present full content, not just e.g. 1st page of a 3-pg article. #webarchiving #jcdl2017
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

Web users expect multi-page news articles to be archived together, but don't care if the ads are captured. #jcdl2017
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

Next, Dr. Weijia Xu presented a short paper - "A Portable Strategy for Preserving Web Applications and Data Functionality". Dr. Xu proposed a preservation strategy for decoupling web applications and from data and the hosting environment in order to improve reproducibility and portability of the applications across different platforms over time.

Weijia Xu from #TACC on a strategy for preserving web apps and data functionality #jcdl2017 pic.twitter.com/vFaOeFSNiM
— Martin Klein (@mart1nkle1n) June 20, 2017

Weijia Xu explains why web archiving, emulation,
& virtualization strategies aren't effective for database/dynamic web projects #jcdl2017 pic.twitter.com/Wm69lrfvaG
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

Sawood Alam was scheduled to present his short paper titled: "Client-side Reconstruction of Composite Mementos Using ServiceWorker," but his flight was cancelled the previous day, delaying his arrival until after the paper session.

And here's why pic.twitter.com/Z4YvGfFs9F
— Michele Weigle (@weiglemc) June 19, 2017

Dr. Nelson presented the paper on his behalf, and discussed the use of ServiceWorker (SW) web API to help archival replay systems avoid the problem of incorrect URI references due to URL rewriting, by strategically rerouting HTTP requests from embedded resources instead of rewriting URLs.

.@phonedude_mln, presenting on behalf of @ibnesayeed: Client-side Reconstruction of Composite Mementos Using ServiceWorker #jcdl2017 pic.twitter.com/Ra8MURdXhB
— Alexander C. Nwala (@acnwala) June 20, 2017

Zombies at #jcdl2017 - @phonedude_mln presenting since @ibnesayeed is en route after travel hell pic.twitter.com/tq36uf3ZNc
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

slides for "Client-side Reconstruction of Composite Mementos Using ServiceWorker"https://t.co/MxLMhVRRcX @ibnesayeed @WebSciDL #jcdl2017
— Michael L. Nelson (@phonedude_mln) June 20, 2017

The conference continued with the second paper session (Semantics and Linking) after a break. This session consisted of a pair of full paper presentations followed by a pair of short paper presentations.

First, Pavlos Fafalios presented - "Building and Querying Semantic Layers for Web Archives," which was also a Vannevar Bush Best Paper Nominee. Pavlos Fafalios proposed a means to improve the use of web archives. He highlighted the lack of efficient and meaningful methods for exploring web archives, and proposed an RDF/S model and distributed framework that describes semantic information about the content of web archives.

Building semantic layers on 3 different types of archives - work by Pavlos Fafalios et al. from L3S #jcdl2017 pic.twitter.com/UKtndeVvg1
— Michele Weigle (@weiglemc) June 20, 2017

@pavlos098 from #L3S shares semantic layers for sample web archive collections: https://t.co/q7ustAGKzL @helgeho #jcdl2017
— Martin Klein (@mart1nkle1n) June 20, 2017

Second, Abhik Jana presented "WikiM: Metapaths based Wikification of Scientific Abstracts" - a method of wikifying scientific publication abstracts - in order to effectively help readers decide whether to read the full articles.

WikiM - improving the wikification of scientific abstracts. Extract entities and link to corresponding Wikipedia articles #jcdl2017 @IITKgp pic.twitter.com/qfe694pAlt
— Michele Weigle (@weiglemc) June 20, 2017

Third, Dr. Jian Wu presented "HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities." Dr. Jian Wu presented a variant of automatic keyphrase extraction called Scientific Domain Knowledge Entity (SDKE) extraction. Unlike keyphrases (important noun phrases of a document), SDKEs refer to a span of text which represents a concept which can be classified as a process, material, task, dataset etc.

Fourth, Xiao Yang presented "Smart Library: Identifying Books in a Library using Richly Supervised Deep Scene Text" - a library inventory building/retrieval system based on scene text reading methods, which has the potential of reducing the manual labor required to manage book inventories.

The third paper session (Collection Access and Indexing) began with Martin Toepfer's presentation of his full paper (Vannevar Bush Best Paper Nominee) titled: "Descriptor-invariant Fusion Architectures for Automatic Subject Indexing: Analysis and Empirical Results on Short Texts." He discussed the need for digital libraries to automatically index documents accurately especially considering concept drift and amid a rapid increase in content such as scientific publication. Martin Toepfer also discussed the approaches for automatically indexing as a means to help researchers and practitioners in digital libraries decide the appropriate methods for automatic indexing. Next, Guillaume Chiron, presented his short paper titled: "Impact of OCR errors on the use of digital libraries. Towards a better access to information." He discussed his research to estimate the impact of OCR errors on the use of the Gallica Digital Library from the French National Library, and proposed a means for predicting the relative mismatch between queried terms and the target resources due to OCR errors.

Kevin Page @OxfordeResearch @MusiCog @POrg outlining model of information-seeking in large-scale DLs such as @hathitrust #JCDL2017 pic.twitter.com/KnRIrhTSWU
— Daniel Bangert (@enigmaticocean) June 20, 2017

Next, Dr. Kevin Page presented a short paper titled: "Information-Seeking in Large-Scale Digital Libraries: Strategies for Scholarly Workset Creation." He discussed his research which examined the information-seeking models ('worksets') proposed by the HathiTrust Research Center for research into the 15 million volumes of HathiTrust content. This research also involved assessing whether the information-seeking models effectively capture emergent user activities of scholarly investigation.

Great to see @hathitrust #researchcenter work at #JCDL2017 on data capsule model for non-consumptive research pic.twitter.com/3hjO95zlxO
— Robert H. McDonald (@mcdonald) June 20, 2017

Next, Dr. Peter Darch presented a short paper titles: "Uncertainty About the Long-Term: Digital Libraries, Astronomy Data, and Open Source Software." Dr. Darch talked about the uncertainty Digital Library developers experience when designing and implementing Digital libraries by presenting the case study of building the Large Synoptic Survey Telescope (LSST) Digital Library.

.@PeterTDarch& @ashley247 Large Synoptic Survey Telescope leadership chose open source approach to mitigate long-term uncertainty #JCDL2017 pic.twitter.com/gdoctES9QL
— Daniel Bangert (@enigmaticocean) June 20, 2017

The third paper session was concluded with a short paper presentation from Jaimie Murdock titled: "Towards Publishing Secure Capsule-based Analysis," in which he discussed recent advancements in providing aid to HTDL (HathiTrust Digital Library) researchers who intend to publish there results from Big Data analysis from HTDL. The advancements include provenance, workflows, worksets, and non-consumptive exports.

After the Day 2 paper sessions, Dr. Nelson conducted the JCDL plenary community meeting in which attendees where given the opportunity to give feedback to improve the conference. The plenary community meeting was followed by Minute Madness - a session in which authors of posters had one minute to convince the audience to visit their poster stands.

Justin and Emily kick off #minutemadness #JCDL2017 pic.twitter.com/n4m0EKjWiR
— JCDL2017 (@jcdl2017) June 20, 2017

#minutemadness is always one of my favourite parts of a conference like #jcdl2017! Thanks for chairing @emilymaemura and @justinfbrunelle! pic.twitter.com/lFvLzKX7sR
— Ian Milligan (@ianmilligan1) June 20, 2017

.@johnaberlin pitching WAIL in minute madness #jcdl2017 https://t.co/2pzlUL7vd1 @WebSciDL pic.twitter.com/4uvKKqWs3G
— Michael L. Nelson (@phonedude_mln) June 20, 2017

The Minute Madness gave way to the poster session and a reception followed.

#jcdl2017 poster session with @emilymaemura pic.twitter.com/nqMyhE6uAL
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

The biggest #jcdl2017 poster @machawk1 @ibnesayeed pic.twitter.com/2jAtdI2p4g
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

@dblp_org guys at #JCDL2017 . To me the best computer science biblio db out there. Add citations and conquer the world! ;) pic.twitter.com/XmQh2vfort
— Gianmaria Silvello (@giansilv) June 20, 2017

Day 3 (June 21)

Day 3 started with a keynote from Dr. Raymond Siemens, in which he discussed the ways social scholarship framing of the production, accumulation, organization, retrieval, and navigation of knowledge, encourages building knowledge to scale in a Humanistic context.

@RayS6 at #JCDL2017 in action on Building Knowledge to Scale pic.twitter.com/XOTsWbGzAO
— JCDL2017 (@jcdl2017) June 21, 2017

Ru Siemens keynote at #JCDL2017 https://t.co/hkuVsXaT9o
— JCDL2017 (@jcdl2017) June 21, 2017

Ray Siemens keynote at #JCDL2017 https://t.co/E8UnnyZKXf
— JCDL2017 (@jcdl2017) June 21, 2017

Following the keynote, the fourth paper session (Citation Analysis) began with a prerecorded full paper (Vannevar Bush Best Paper Nominee) presentation from Dr. Saeed-Ul Hassan titled: "Identifying Important Citations using Contextual Information from Full Text," in which he addressed the problem of classifying cited work into important and non-important classes with respect to the developments presented in a research publication, as an important step for algorithms designed to track emerging research topics. Next, Luca Weihs presented a full paper titled: "Learning to Predict Citation-Based Impact Measures." He presented non-linear probabilistic techniques for predicting the future scientific impact of impact of a research paper. This is unlike linear probabilistic methods which focus on understanding the past and present impact of a paper. The final full paper presentation from this session was titled: "Understanding the Impact of Early Citers on Long-Term Scientific Impact" and presented by Mayank Singh. Mayank Singh presented his investigation to see if the set of authors who cite a paper early (within 1-2 years), affect the paper's Long-Term Scientific Impact (LTSI). In his research he discovered that influential early citers negatively affect LTSI probably due to "attention stealing."

The conference continued with fifth paper session (Exploring and Analyzing Collections) consisting of three full paper presentations. The first (Student Paper Award Nominee), titled: "Matrix-based News Aggregation: Exploring Different News Perspectives," was presented by Norman Meuschke. He presented NewsBird, Matrix-based News Analysis system (MNA) which help users see news from various perspectives, as a means to help avoid a biased news consumption.

.@normeu presenting https://t.co/SZdsYQYJeu https://t.co/OsKtLrGqKU @BelaGipp #jcdl2017
— Michael L. Nelson (@phonedude_mln) June 21, 2017

The second paper (Vannevar Bush Best Paper Nominee), titled: "Quill: A Framework for Constructing Negotiated Texts - with a Case Study on the US Constitutional Convention of 1787," was presented by Dr. Nicholas Cole, who presented the Quill framework. Quill is a new approach to present and study formal negotiation records such as creation of constitutions, treaties, and legislation. Quill currently hosts the records of the Constitutional Convention of 1787 that wrote the Constitution of the United States.

.@quill1787 is great: reconstructing sequence of events, hierarchy of decision-making, relationships, etc. #JCDL2017 https://t.co/B2Sfxvg4Ep pic.twitter.com/JnPl6qyKjw
— Ian Milligan (@ianmilligan1) June 21, 2017

You mean Geo. Washington didn't write the US Constitution all by himself? Great presentation by @quilldir #jcdl2017 https://t.co/XJydygYXyj pic.twitter.com/tTst8ncFPm
— Michele Weigle (@weiglemc) June 21, 2017

The final presentation for this session was from Dr. Kevin Page, titled: "Realising a Layered Digital Library: Exploration and Analysis of the Live Music Archive through Linked Data," in which he discussed his research which followed a Linked Data approach to build a layered Digital Library, utilizing content form the Internet Archive's Live Music Archive.

The sixth paper session (Text Extraction and Analysis) consisted of three full paper presentation. The first, titled: "A Benchmark and Evaluation for Text Extraction," was presented by Dr. Hannah Bast. Dr. Bast highlighted the difficulty of extracting text from PDF documents due to the fact that PDF is a layout-based format which specifies position information of characters rather than semantic information (e.g., body text or footnote). She also presented her evaluation result of 13 state of the art tools for extracting text from PDF. She showed that her method Icecite outperformed other tools, but is not perfect, and outlined the steps necessary to make text extraction from PDF a solved problem. Next, Kresimir Duretec presented "A text extraction software benchmark based on a synthesized dataset." To help text data processing workflows in digital libraries, he described a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. He also presented a benchmark for text extraction tools. This paper session was concluded with a presentation by Tokinori Suzuki titled: "Mathematical Document Categorization with Structure of Mathematical Expressions." He presented his research in Mathematical Document Categorization (MDC) - a task of classifying mathematical documents into mathematical categories such as Probability theory and Set theory. He proposed a classification method that uses text and structures of mathematical expressions.

The seventh paper session (Collection Building) consisted of three full paper presentation, and began with Dr. Federico Nanni's presentation (Best Student Paper Award Nominee) titled: "Building Entity-Centric Event Collections." Federico Nanni presented an approach that utilizes large web archives to build event-centric sub-collections consisting of core documents related to the events as well as documents associated with the premise and consequences of events.

.@f_nanni on need to find early stages of an event, approach using search for related entities #JCDL2017 pic.twitter.com/gyz9NW2e32
— Emily Maemura (@emilymaemura) June 21, 2017

Next, Jan R. Benetka, presented a paper titled: "Towards Building a Knowledge Base of Monetary Transactions from a News Collection," where he addressed the problem of extracting structured representations of economic events (e.g., large company buyouts) from a large corpus of news articles. He presented a method which combines natural language processing and machine learning techniques to address this task.

.@janbenetka github repo mentioned in his presentation https://t.co/UEcachQxwb #jcdl2017
— John Berlin (@johnaberlin) June 21, 2017

I concluded the seventh paper session with a presentation titled: "Local Memory Project: providing tools to build collections of stories for local events from local sources". In this presentation, I discussed the need to expose local media sources, and introduced two tools under the umbrella of the Local Memory Project. The first tool - Geo, helps users discover nearby local news media sources such as newspapers, TV, and radio stations. The next - a Collection building tool, helps users build, save, share, and archive collections of local events from local sources for US and non-US media sources.

.@acnwala presenting the Local Memory Project https://t.co/d8zXXw6twB #jcdl2017 @abziegler @WebSciDL @HarvardLIL pic.twitter.com/ol5S7c8Ylc
— Michael L. Nelson (@phonedude_mln) June 21, 2017

@acnwala from @WebSciDL is presenting "Local Memory Project" work at #jcdl2017 with a very longest title. https://t.co/TNcgVmjKok pic.twitter.com/uDIR5Xys20
— Sawood Alam (@ibnesayeed) June 21, 2017

Here are the slides I presented:

Local Memory Project from anwala

The eighth paper session (Classification and Clustering) occurred in parallel with the sixth paper session. It consisted of a pair of full papers and a pair of short papers. The first paper titled: "Classifying Short Unstructured Data using the Apache Spark Platform," was presented by Saurabh Chakravarty. Saurabh Chakravarty highlighted the difficulty traditional classifiers have in classifying tweets. This difficulty is partly due to the shortness of tweets, and the presence of abbreviations, hashtags, emojis, and non-standard usage of written language. Consequently, he proposed the used of the Spark platform to implement two shot text classification strategies. He also showed these strategies are able to effectively classify millions of text composed of thousands of distinct features and classes. Next, Abel Elekes presented his full paper (Best Student Paper Award Nominee) titled: "On the Various Semantics of Similarity in Word Embedding Models," in which he discussed results running two experiments to determine when exactly similarity scores of word embedding model is meaningful. He proposed that his method could provide a better understanding of the notion of similarity in embedding models and improve the the evaluation of such models. Next, Mirco Kocher presented his short paper titled: "Author Clustering Using Spatium." Mirco Kocher proposed a model for clustering authors after presenting the author clustering problem as it relates to authorship attribution questions. The model he proposed uses a distance measure called Spatium which was derived from weighted version of L1 norm (Canberra measure). He showed that this model evaluation produced high precision and F1 values when tested with a 20 test collection. Finally Shaobin Xu presented a short paper titled: "Retrieving and Combining Repeated Passages to Improve OCR." He presented a new method to improve the output of Optical Character Recognition (OCR) systems. The method begins with detecting duplicate passages, then it performs a consensus decoding which is combined with a language model.

The ninth paper session (Content Provenance and Reuse), began with Dr. David Bamman full paper presentation titled: "Estimating the Date of First Publication in a Large-Scale Digital Library." Dr. David Bamman discussed his finding from evaluating methods for approximating date of first publication. The methods considered (and used in practice) include: using the date of publication from available metadata, multiple deduplication methods, and automatically predicting the date of composition from text of the book. He found that using a simple heuristic of metadata-based deduplication performs best in practice.

David Bamman (@dbamman) discussing methods to estimate the date of first publication of books in @hathitrust digital library #JCDL2017
— Daniel Bangert (@enigmaticocean) June 21, 2017

Dr. George Buchanan presented his full paper titled: "The Lowest form of Flattery: Characterising Text Re-use and Plagiarism Patterns in a Digital Library Corpus," in which he discussed a first assessment of text re-use (plagiarism) for the digital libraries domain, and suggested measures for more rigorous plagiarism detection and management.

The literature shows high levels of plagiarism across a range of scholarly fields but also that actions are rarely taken #jcdl2017
— Rebecca Parker (@libodyssey) June 21, 2017

Next, Corinna Breitinger presented her short paper titled: "CryptSubmit: Introducing Securely Timestamped Manuscript Submission and Peer Review Feedback using the Blockchain." She introduced CryptSubmit as a means to address the fear researchers have that their work may be leaked or plagiarized by a program committee or anonymous peer reviewers. CryptSubmit utilizes the decentralized Bitcoin blockchain to establish trust and verifiability by creating a publicly verifiable and tamper-proof timestamp for manuscript.

Corinna Breitinger (@BreitingerC) presenting secure timestamping of manuscript submission using blockchain #JCDL2017 https://t.co/7AJPlgz8cn pic.twitter.com/HkzktAx3XL
— Daniel Bangert (@enigmaticocean) June 21, 2017

@BreitingerC showing off https://t.co/MdFdns1yk2 API @BelaGipp as part of CryptSubmit work https://t.co/cnj9l3n0DL #jcdl2017 pic.twitter.com/ZEJKNMUSEP
— Michele Weigle (@weiglemc) June 21, 2017

Next, Mayank Singh a short paper titled: "Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science." He proposed a new model of conceptualizing plagiarism in scholarly research based on reuse of explicit citation sentences in scientific research articles, which is unlike traditional plagiarism detection which uses text similarity. He provided examples of plagiarism and revealed that this practice is widespread even for well known researchers.

A conference banquet at Sassafraz Restaurant followed the last paper session of the day.

Good times had by all at #JCDL2017 Banquet - congrats to our award winners! pic.twitter.com/RPVyohaFxC
— JCDL2017 (@jcdl2017) June 22, 2017

During the banquet, awards for best poster, best student paper, and the Vannevar Bush best paper award, were given. Sawood Alam received the most votes for his poster - Impact of URI Canonicalization on Memento Count - thus, received the award for best poster. Felix Hamborg, Norman Meuschke, and Dr. Bella Gipp, received the best student paper award for: "Matrix-based News Aggregation: Exploring Different News Perspectives." Finally, Dr. Nicholas Cole, Alfie Abdul-Rahman, and Grace Mallon received the Vannevar Bush best paper award for "Quill: A Framework for Constructing Negotiated Texts - with a Case Study on the US Constitutional Convention of 1787."

Congratulations to @ibnesayeed @machawk1 @LulwahMA on best poster at #jcdl2017! pic.twitter.com/GYG16Gw1jO
— Ian Milligan (@ianmilligan1) June 22, 2017

... and Felix Hamborg, Norman Meuschke, and Bella Gipp win best student paper! #JCDL2017 pic.twitter.com/VbSJLTdjB1
— Ian Milligan (@ianmilligan1) June 22, 2017

And last but not least the @quill1787 wins #jcdl2017 Vannevar Bush Best Paper Award. Congrats @quilldir! pic.twitter.com/5IzuEWdXPV
— Ian Milligan (@ianmilligan1) June 22, 2017

Day 4 (June 22)

Day four of the conference began with a panel session titled: "Can We Really Show This?: Ethics, Representation and Social Justice in Sensitive Digital Space," in which ethical issues experienced by curators who work with sensitive and contentious content from marginalized populations was addressed. The panel consisted of Deborah Maron (Moderator), and the following speakers: Dorothy Berry, Raegan Swanson, and Erin White.

Excited to start off the morning with the panel on ethics, representation and social justice #JCDL2017 pic.twitter.com/QJUawp1h7z
— Emily Maemura (@emilymaemura) June 22, 2017

Berry: Archival Practice Makes Metadata Complications. #jcdl2017 pic.twitter.com/5iKhz2qEwp
— Ian Milligan (@ianmilligan1) June 22, 2017

The tenth and last paper session (Scientific Collections and Libraries) followed and consisted of three full paper presentations. First, Dr. Abdussalam Alawini, presented a paper titled: "Automating data citation: the eagle-i experience," in which he highlighted the growing concern of giving credit to contributors and curators of datasets. He presented his research in automating citation generation for an RDF dataset called eagle-i, and discussed a means to generalize this citation framework across a variety of different types of databases. Next, Sandipan Sikdar presented "Influence of Reviewer Interaction Network on Long-term Citations: A Case Study of the Scientific Peer-Review System of the Journal of High Energy Physics" (Best Student Paper Award Nominee). He presented his research which sought to answer the question: "Could the peer review system be improved?" amid a consensus from the research community that it is indispensable but flawed. His research attempted to answer this question by introducing a new reviewer-reviewer interaction network, showing that structural properties of this network surprisingly serve as strong predictors of the long-term citations of a submitted paper. Finally Dr. Martin Klein, presented: "Discovering Scholarly Orphans Using ORCID". Dr. Martin Klein proposed a new paradigm for archiving scholarly orphans - web-native scholarly objects that are largely neglected by current archival practices. He presented his research which investigated the feasibility of using Open Researcher and Contributor ID (ORCID) as a means for discovering the web identities and scholarly orphans for active researchers.

More ORCIDs over time, but low %age of profiles contain works, affiliations, or web identities. #jcdl2017 @mart1nkle1n @hvdsomp @ORCID_Org pic.twitter.com/98UJbIjq3r
— Michele Weigle (@weiglemc) June 22, 2017

Nice use of radar/spider chart to show mismatch between PhD researchers and use of ORCID per discipline. @mart1nkle1n @hvdsomp #jcdl2017 pic.twitter.com/jl4cs2OaJg
— Michele Weigle (@weiglemc) June 22, 2017

Here are the slides he presented:

Discovering Scholarly Orphans Using ORCID from Martin Klein

Dr. Salvatore Mele gave the keynote of the day. He discussed the significant impact Preprints have had on research such has the High-Energy Physics domain which has benefited from a rich Preprint culture for more than half a century. He also reported on the results of two studies that aimed to assess the coexistence and complementarity between Preprints and academic journals that are less open.

Final keynote by Salvatore Mele @CERN #JCDL2017 pic.twitter.com/wjGbcev0My
— Daniel Bangert (@enigmaticocean) June 22, 2017

Do high-energy physicists read preprints or journals? The former, it appears, according to #Jcdl2017 closing keynote Salvatore Mele. pic.twitter.com/7RU49xmmWZ
— Ian Milligan (@ianmilligan1) June 22, 2017

Salvatore Mele on metrics: How to decide what to count? How to count fairly? How to count consistently over time, and places? #JCDL2017 pic.twitter.com/EslzkaT9gJ
— Daniel Bangert (@enigmaticocean) June 22, 2017

The 2017 JCDL conference officially concluded with Dr. Ed Fox's announcement of the 2018 JCDL conference to be held at the University of North Texas.

Edward Fox announcing #JCDL2018 at #JCDL2017. June 3-7, 2018 in the University of North Texas #UNT, USA. pic.twitter.com/9I1VPb1985
— Sawood Alam (@ibnesayeed) June 22, 2017

--Nwala

↧

2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017

July 4, 2017, 10:49 pm

≫ Next: 2017-07-06: Web Science 2017 Trip Report

≪ Previous: 2017-06-29: Joint Conference on Digital Libraries (JCDL) 2017 Trip Report

Web Archiving and Digital Libraries Workshop was held after JCDL 2017 from June 6, 2017, to June 23, 2017. I live-tweeted both days and you can follow along on Twitter with this blog post using the hashtag wadl2017 or via the notes/minutes of WADL2017. I also created a list on Twitter of the speaker/presenters Twitter handles, go give them a follow to keep up to date with their exciting work.

Day 1 (June 22)

WADL2017 kicked off at 2 pm with Martin Klein and Edward Fox welcoming us to the event by giving an overview and introduction to the presenters and panelists.

@mart1nkle1n kicks off a #JCDL2017 attached session by scrbblinging #WADL2017 hashtag on the blackboard. pic.twitter.com/CUN3fPKS4l
— Sawood Alam (@ibnesayeed) June 22, 2017

Keynote

The opening keynote of WADL2017 was National Digital Platform (NDP), Funding Opportunities, and Examples Of Currently Funded Projects by Ashley Sands (IMLS).

@ashley247 with her opening keynote at #wadl2017 @US_IMLS #jcdl2017 #wadl2017 pic.twitter.com/c0w5mZZGNF
— Martin Klein (@mart1nkle1n) June 22, 2017

In the keynote Sands spoke about the desired values for the national digital platform, how IMLS offers various grant categories and funding opportunities for archiving projects, and the submission procedure for grants as well as tips to writing IMLS grant proposals. Sands also shared what a successful (funded) proposal looks like, and how to apply to become a reviewer of the proposals!

.@ashley247 desired values for national digital platform #wadl2017 pic.twitter.com/XFuIcRjc2z
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 shares funding opportunities for archiving projects @ #WADL2017 #JCDL2017 pic.twitter.com/HbCL4DkZ54
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Very helpful tips and recs for IMLS grant proposals! Thanks for a super informative prez @ashley247 #WADL2017 #jcdl2017 pic.twitter.com/KOIdYyou86
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

.@ashley247 successful proposal from 2015. "Combining Social Media Storytelling with Web Archives"@WebSciDL + @archiveitorg #wadl2017 pic.twitter.com/t43S6cZFOq
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 apply to be a reviewer, more voices more diveristy #wadl2017 pic.twitter.com/rrRUXS8yys
— John Berlin (@johnaberlin) June 22, 2017

Lightning Talks

First up in the lightning talks was Ross Spenser from the New Zealand Web Archive on "HTTPreserve: Auditing Document-Based Hyperlinks" (poster)

.@beet_keeper from New Zealand Web Archive presenting @httpreserve #wadl2017 pic.twitter.com/1z3x9VE7Zn
— John Berlin (@johnaberlin) June 22, 2017

Finally get to meet @beet_keeper! He’s presenting on HTTPreserve, repo at https://t.co/uDJKJ2OCQY. #WADL2017 pic.twitter.com/G92rNVc98N
— Ian Milligan (@ianmilligan1) June 22, 2017

Spenser has created a tool that will check the status of a URL on the live web and if it has been archived by the Internet Archive (httpreserve) which is a part of a large suite of tools under the same name. You can try it out via httpreserve.info and the project is open to contributions from the community as well!

GREAT!! Awesome to know about projects that welcome issues and PRs! #wadl2017
— John Berlin (@johnaberlin) June 23, 2017

The second talk was Muhammad Umar Qasim on "WARC-Portal: A Tool for Exploring the Past". WARC-Portal is a tool that seeks to provide access for researchers to browse and search through custom collections and provides tools for analyzing these collections via Warcbase.

WARC-Portal: A Tool for Exploring the Past #wadl2017 pic.twitter.com/2RsUADChhO
— John Berlin (@johnaberlin) June 22, 2017

Third talks was by Sawood Alam on "The Impact of URI Canonicalization on Memento Count". Alam spoke about the ratio of representations vs redirects obtained from dereferencing each archived capture. For a more detailed explanation of this you can read our blog post or the full technical report.

Impact of URI Canonicalization on Memento Count from Mat Kelly

The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.

Paper Sessions

First, up for in paper sessions was Ian Milligan, Nick Ruest and Ryan Deschamps on "Building a National Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Project"

.@ianmilligan1 @ruebot @RyanDeschamps tag team presentation "The WALK Project"#wadl2017 pic.twitter.com/9LkibDm73F
— John Berlin (@johnaberlin) June 22, 2017

The WALK project seeks to address the issue of "To use Canadian web archives you have to really want to use them, that is you need to be an expert" by "Bringing Canadian web archives into a centralised portal with access to derivative datasets".

And now a joint prez on Building a Nat'l Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Proj. #WADL2017 pic.twitter.com/Ya5HZOTt4A
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Enter WALK: 61 collections, 16 TB of WARC files, developed new Solr front end based on Project Blacklight (currently indexed 250 million records). The WALK workflow consists of using Warcbase and a handful of other command line tools to retrieve data from the Internet Archive, generate scholarly derivatives (visualizations, etc) automatically, upload those derivatives to Dataverse and ensure the derivatives are available to the research team.

.@RyanDeschamps on the WALK workflow, including description of collection with network graphs #WADL2017 pic.twitter.com/cgkRhtCzLV
— Emily Maemura (@emilymaemura) June 22, 2017

To ensure that WALK could scale the WALK project will be building on top of Blacklight and contributing it back to the community as WARCLight.

.@ruebot shows (unsurprisingly) the cutest slide of #WADL2017 pic.twitter.com/g0Wob4Bwz1
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

The second paper of WADL2017 presentation was by Sawood Alam on "Avoiding Zombies in Archival Replay Using ServiceWorker." Alam spoke about how through the use of ServiceWorkers URI's that were missed during rewriting or not rewritten at all due to the dynamic nature of the web can be rerouted dynamically by the ServiceWorker to hit the archive rather than the live web.

Avoiding Zombies in Archival Replay Using ServiceWorker from Sawood Alam

Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.

Slides for our #wadl2017 #jcdl2017 talk: “Comparing Topic Shifts Between Two US Presidential Administrations.” https://t.co/rrDsb0HJQu pic.twitter.com/o4OiqZtLzI
— Ian Milligan (@ianmilligan1) June 22, 2017

Closing the first day of WADL2017 was Brenda Reyes Ayala with the final paper presentation on "Web Archives: A preliminary exploration vs reality". Ayala spoke about looking at Archive-It support tickets, as XML, then cleaned and anonymized then using qualitative coding, grounded theory for analysis and presented three expectations when considering user expectations, their mental models when working with archives.

The original website had X number of documents, it would also follow that the archived website also has X number of documents.

Reality: an archived website was often much larger or smaller than the user had expected.

A web archive only includes content that is closely related to the topic.

Reality: Due to crawler settings, scoping rules, and the nat-6/23ure of the web, web archives often include content that is not topic-specific. This was especially the case with social media sites. Users saw the presence of this content as being of little relevance and superfluous.

Content that looks irrelevant is actually irrelevant.

Reality: A website contains pages or elements that are not obviously important but help “behind the scenes” to make other elements or pages render correctly or function properly.

This is knowledge that is known by the partner specialist, but usually unknown or invisible to the user or creator of an archive. Partner specialists often had to explain the true nature of this seemingly irrelevant content Domains and sub-domains are the same thing, and they do not affect the capture of a website.

Reality: These differences usually affect how a website is captured.

Day 2 (June 23)

Day two started off with a panel featuring Emily Maemura, Dawn Walker, Matt Price, and Maya Anjur-Dietrich on "Challenges for Grassroots Web Archiving of Environmental Data". The first event hosted took place in December in Toronto to preserve the EPA data from the Obama administration during the Trump transition. The event had roughly two-hundred participants and during the event hundreds of press articles, tens of thousands of URL’s seeded to Internet Archive, dozens of coders building tools and a sustainable local community of activists interested in continuing the work. Since then seven events in Philly, NYC, Ann Arbor, Cambridge MA, Austin TX, Berkeley were hosted/co-hosted with thirty-one more planned in cities across the country.

Matt Price talks about very important EDGI project--keeping track of environmental data online: https://t.co/21JcMwsfBm #WADL2017 #JCDL2017
— Jasmine Mulliken (@jasminemulliken) June 23, 2017

After the panel was Tom J. Smyth on Legal Deposit, Collection Development, Preservation, and Web Archiving at Library and Archives Canada Web Archival Scoping Documents. Smyth spoke on questions about how to start building a collection for a budding web archive that does not have the scale as well as an established one and that it has:

Web Archival Scoping Documents

What priority
What type
What are we trying to document
What degree are we trying to document

Controlled Collection Metadata, Controlled vocabulary

Evolves over time with the collection topic

Quality Control Framework

Essential for setting a cut-off point for quality control

Selected Web Resources must pass four checkpoints

Is the resource in-scope of the collection and theme
(when in doubt consult the Scoping Document)
Heritage Value, is the content unique available in other formats,
(what contexts can it be used)
Technology / Preservation
Quality Control

.@smythbound gives @RyanDeschamps a #wadl2017 shoutout for his "topic Jeopardy," when thinking about curating collections. pic.twitter.com/k26LhCdObo
— Ian Milligan (@ianmilligan1) June 23, 2017

@smythbound you won the cuteness round of #Zombies and #Unicorns against @ibnesayeed at #WADL2017 #JCDL2017 pic.twitter.com/xqE9yR22aL
— Sawood Alam (@ibnesayeed) June 23, 2017

The next paper presenters up were Muhammad Umar Qasim and Sam-Chin Li for "Working Together Toward a Shared Vision: Canadian Government Information Digital Preservation Network (CGI - DPN)". The Canadian Government Information Digital Preservation Network (CGI - DPN) is a project that seeks to preserve digital collections of government information and ensure the long-term viability of digital materials through geographically dispersed servers, protective measures against data loss, and forward format migration. The project will also as a backup server in cases where the main server is unavailable as well as act as a means of restoring lost data. To achieve the goals the project is using Archive-It for the web crawls and collection building then using LOCKSS to disseminating the collections to additional peers (LOCKSS nodes).

Sam-chin Li and Muhammed Umar Qasim "Working Together Towards A Shared Vision"#wadl2017 pic.twitter.com/oLN8o4ftdb
— John Berlin (@johnaberlin) June 23, 2017

Nick Ruest was up next speaking on "Strategies for Collecting, Processing, and Analyzing Tweets from Large Newsworthy Events". Ruest spoke about how Twitter is big data and handling the can be difficult. Ruest also spoke about how to handle the big Twitter data in a sane manner by using tools such as Hydrator or Twarc from the DocNow project.

.@ruebot Strategies for handling large news worthy tweet collections @documentnow #wadl2017 pic.twitter.com/Mu0j9Lzhtw
— John Berlin (@johnaberlin) June 23, 2017

Here are my #WADL2017 slides if you want follow along at home.https://t.co/PZtJzhxhMH

❤️ @documentnow
— nick ruest (@ruebot) June 23, 2017

The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.

"Classification Of Tweets Using Augmented Training" Datasets #wadl2017 pic.twitter.com/pLKywUe3YP
— John Berlin (@johnaberlin) June 23, 2017

This is a great project, given how hard it is to classify short tweets – they train a dataset, then use on auxiliary tweets. #wadl2017 pic.twitter.com/wFPDFhH31o
— Ian Milligan (@ianmilligan1) June 23, 2017

Closing Round Table On WADL

The final order of business for WADL 2017 was a round table discussion with the participants and attendees concerning next years WADL and how to make WADL even better. There was a lot of great ideas and suggestions made as the round table progressed with the participants of this discussion becoming the most excited about the following:

WADL 2018 (naturally of course)
Seeking out additional collaboration and information sharing with those who are actively looking for web archiving but are unaware of / did not meet up for WADL
Looking into bringing proceedings to WADL, perhaps even a journal
Extending the length of WADL to a full two or three day event
Integration of remote site participation for those who wish to attend but can not due to geographical location or travel expenses

Till Joint Conference on Digital Libraries 2018 June 3 - 7 in Fort Worth, Texas, USA
- John Berlin

↧

2017-07-06: Web Science 2017 Trip Report

July 6, 2017, 7:59 am

≫ Next: 2017-07-19: Archives Unleashed 4.0: Web Archive Datathon Trip Report

≪ Previous: 2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017

I was fortunate enough to have the opportunity to present Yasmin AlNoamany's work at Web Science 2017. Dr. Nelson offers an excellent class on Web Science, but it has been years since I had taken it and I still was uncertain about the current state of the art.

Web Science 2017 took place in Troy, a small city in upstate New York that is home to Rensselaer Polytechnic Institute (RPI). The RPI team had organized an excellent conference focused on a variety of Web Science topics, including cyber bullying, taxonomies, social media, and ethics.

Keynote Speakers

Day One

The opening keynote by Steffen Staab from the Institute for Web Science and Technologies (WeST) was entitled "The Web We Want". He discussed how we need to determine what values we want to meet before deciding on the web we want. Dr. Staab defined three key values: accessibility for the disabled, freedom from harassment, and a useful semantic web.

Staab detailed the MAMEM project whose goal is to provide access to the web for the disabled, accounting for those without the ability to operate a mouse and keyboard as well as those who cannot see or hear. He mentioned that the z-stacking used by the Google search engine's textbox frustrates a lot of accessibility tools.

On the topic of harassment, Staab indicated that we need to determine the roles and structure used by people in social networks. Who is each person linked to? Are they initiators or do they join discussions later? Are they trolls? Are they contributors? Are they moderators? Can we differentiate these roles based on previous experience? He showed the procedure by which the ROBUST project classifies users into each of these roles with the goal of providing an early response to trolls, attacks, and spam.

The Web We Want from Steffen Staab

For a useful Semantic Web, Staab stressed the importance of data that interlinked and allows us to further describe entities. Most quality assessments for existing links don't take into account the usefulness of the data. How close are we to benchmarking this usefulness? It depends on the application. So far, we have recommender systems that work based on what someone else said was useful, but may not fit the needs of the user under consideration. Tests of usefulness are further frustrated by the fact that people behave differently during testing the second time through.

Steffen Staab on the Semantic Web - "some links are more powerful than others"#WebSci17 pic.twitter.com/vLKAYFYlaE
— Shawn M. Jones (@shawnmjones) June 26, 2017

He closed by stressing that we need to measure our achievement of these goals. As we measure the achievement of these values, it inspires our engineering and this same measurement is required to understand how well an engineering solution works.

Day Two

The second day keynote was by Jennifer Golbeck, world leader in social media research and science communication, creator of the field of social analytics, and professor at the University of Maryland. She started by talking about one of the reviewers of her paper, "A Large Labeled Corpus for Online Harassment Research", submitted to Web Science 2017. This reviewer liked the work done on social media harassment, but objected to the inclusion of harassing tweets as evidence in the paper. Dr. Golbeck objected to this idea that we should not include evidence in scientific papers, no matter how offensive it may be. She used the story of the upset reviewer throughout the rest of her talk.

Her research tries to answer questions such as who is posting harassing content and why? She also mentioned that Twitter will often not help you block someone if you report them. In order to study the phenomenon, she sought out harassing tweets on Twitter. Fortunately, there is a low density of harassing tweets in the Twitter firehouse. After several unsuccessful methods, including blocklists like Block Together, she resorted to finding harassing tweets using Twitter searches by combining expletives and the names of marginalized groups. Harassment is directed at these groups because they have less power and it is intended to silence them. Sadly, 50% of the tweets containing the word "feminist" are part of harassment on Twitter.

She discovered that there were several main groups of harassers in the data, labeled Gamergate, Trump Supporters, Alt-right, UK-based Brexit/anti-Muslim, and "Anime". This does not mean that all Trump Supporters harass people, but there is a large group of harassers that are Trump Supporters. The "Anime" group seemed to be very interested in the Japanese cartoon style, but not all Anime fans are trolls, and likewise with other descriptors.

She also highlighted the work "Trolls Just Want to Have Fun", by Buckels, Trapnell, and Paulhus. Buckels discovered that trolls exhibit a higher percentage of the following personality traits: machiavellianism, narcissism, psychopathy, and sadism. This was contrasted with those who merely engage in debating issues. These personality traits are known in psychological circles as the Dark Tetrad of personality, identifying individuals that are more likely to cause "social distress".

In spite of some of this progress, we still have no idea of the full picture of harassment on Twitter. One would need to learn the language of the communities under study, both of harassers and victims, in order to fully discover all of the harassment going on. This makes members of these groups -- women, minorities, etc. -- more careful about what they say on social media because they have to weigh the potential harassment before even speaking. She returned to the reviewer of her paper and stated that she included the Tweets not only as evidence, but because the more we are silent on these issues, the more they will continue.

I Presented Yasmin AlNoamany's Work

@shawnmjones presenting "Generating Stories From Collections"#WebSci17 pic.twitter.com/EJ929ANjJm
— Frederick Ayala (@FredAyala) June 28, 2017

On the third day, I was fortunate enough to present Dr. AlNoamany's work on using storytelling tools to summarize Archive-It collections. She uses the example of the Egyptian Revolution, much of which was recorded online in real time, as a use case for summarizing web archive collections. Much of the web resources from the Revolution are gone, but have been preserved in web archives.

csvconfyasmin2017_05_03 from Yasmin AlNoamany, PhD

There are multiple archive collections about the revolution and it is difficult to visualize more than potentially 1000 different captures of potentially 1000 seed URIs. We seek to answer questions such as: "What is in this collection?" and "What makes this collection different from another?" She uses social media Storytelling as an existing interface with which users are familiar. This presentation discusses, at a high level, the Dark and Stormy Archives (DSA) framework which automatically summarizes the collection and generates the visualization in Storify.

Selected Posters

There were many excellent posters at Web Science 2017. Unfortunately, I do not have room to cover them all, so I will highlight a select few.

In "Understanding Parasocial Breakups on Twitter" (preprint), Kiran Garimella studied perceived virtual relationships, known as para-social relationships, on Twitter. This scenario erupts when a user follows a celebrity on social media and then are followed back. For some fans, this provides the illusion of a real relationship. A para-social breakup (PSB) occurs when a fan stops following the celebrity. He studied the 15 most followed celebrities from popular culture on Twitter and used a subset of their fans. He classified fans into 3 types: (1) involved, (2) casual, and (3) random. The involved fans tweet often with their chosen celebrity, but also have a higher probability of unfollowing the celebrity than casual fans who tweet with their celebrity only once per year, or a random sample of followers. Garimella's study has implications for marketing.

The mobile game Pokémon Go has a feature known as the Pokéstop where players can gather more resources to continue playing the game. In "Pokémon Go: Impact on Yelp Restaurant Reviews" (preprint), Pavan Kondamundi evaluated whether or not the inclusion of Pokéstops in Yelp restaurant profiles had an impact on the reviews for said restaurants. His study included 100 restaurants, half of which contained Pokéstops. He found an increase in the number of reviews for the period of 2014 to 2015, but a slight decrease for the period of 2016-2016.

Policy documents are used by many organizations, not just those within government. Bharat Kale's work, "Predicting Research that will be cited in Policy Documents" (preprint), attempts to determine what features increase the probability of an academic work being cited by a policy document. Using features related to online attention, he discovered that the Random Forest classifier showed the best results for predicting if an article is cited by a policy document. Mention counts on peer-review platforms, such as PubPeer, seems to be the most influential feature and mentions in news appears to be the least influential feature. He intends to extend the work "to predict the number of policy citations a given work is likely to receive."

As I mentioned in an earlier blog post, the problem of entity resolution, and more specifically author disambiguation, continues to confound solutions for scholarly communication. Janaína Gomide focuses on the synonym problem, where a single individual has multiple names. In her work "Consolidating identities of authors through egonet structure", rather than using content information about a given author, she is studying egonets, networks of collaborators built from co-authorship information. She is developing an algorithm that attempts to disambiguate authors based on the shape of their egonet. Preliminary results with datasets from DBLP and Google Scholar show promise for the current version of this algorithm.

There was a lot of work on social networks at the Web Science conference, and Nirmal Sivaraman's work "On Social Synchrony in Online Social Networks" was no exception. He defines social synchrony as "a kind of collective social activity associated with some event where the number of people who participate in the activity first increases exponentially and then decreases exponentially". He developed an algorithm that determines if synchrony has occurred within a dataset of social media data.

Spencer Norris won the best poster award for his "A Semantic Workflow Approach to Web Science Analytics". He highlights the use of linked data to build workflows for use in running and repeating scientific experiments. His work focuses on the use of semantic workflows for Web Science, indicating that these workflows, because of their ease of publication and analysis, also easily allow "Web Science analyses to be recreated and recombined". He combines the Workflow INstance Generation and Specialization (WINGS) system with the existing Semantic Numeric Exploration Technology (SemNExT) framework.

Spencer Norris @iamspenthur from Tetherless World Constellation #twcrpi winning the Best Poster Award at @WebSci2017 #WebSci17 pic.twitter.com/9wpdyW0aH8
— Hao Zhong (@_zhong_hao) June 28, 2017

Selected Papers

There were 45 papers accepted at Web Science. I will summarize a few here to convey the type of research being conducted at the conference.

#WebSci17 @bardiadoosti presents "A Deep Study into the History of Web Design"https://t.co/3suLAFVs0L pic.twitter.com/jGO0qzcPcD
— Shawn M. Jones (@shawnmjones) June 28, 2017

The design of web pages have shifted over time, leading to differences in how we consume them. Bardia Doosti presented "A Deep Study into the History of Web Design" (copy on author's website). In their work, they point out that web design, much like paintings and architecture, can be analyzed to indicate the concepts and ideas that represent the era from which a web page comes. They develop several automated techniques for analyzing archived web pages, including the use of deep Convolutional Neural Networks, with a hope of identifying the web pages' subject areas as well as determining which web sites (such as apple.com) may influence the design of others.

Olga Zagovora presented "The gendered presentation of professions on Wikipedia" (preprint) where she and her co-authors conducted a study comparing the number of women mentioned on the profession pages of German Wikipedia to the actual number of women in those professions, indicating that there is still a gender bias in the pages. They compared the number of images, mentioned persons, and wiki page titles). It is likely that the choices representing individuals in these professions may be made out of tradition or due to the historical preponderance of males in these fields, but this work is useful in informing further development of guidelines for the Wikipedia community. The data is available on GitHub.

Companies, celebrities, and even general users want to know what helps them acquire more Twitter followers. Juergen Mueller and his co-authors attempted to determine the influence of a user's profile information on her number of followers in "Predicting Rising Follower Counts on Twitter Using Profile Information" (preprint). Because of the rate limitations of the Twitter API, they are interested in determining what information can be predicted based on a user's profile. Using several classifiers, they discovered that follower account is affected by the "subjective impression" of the profile's name, indicating that follower counts are adversely affected for accounts with a name that is perceived as feminine. They also confirmed earlier research that indicates that users with a given name in their name field have less followers.

The concept of fake news has a lot of media attention, especially since the 2016 US Presidential Elections. In "The Fake News Spreading Plague: Was it Preventable?" (preprint), Eni Mustafaraj presented recipes for spreading misinformation on Twitter from 2010 and spreading fake news on Facebook from 2016. The two recipes have the same steps, meaning that perhaps the spread of misinformation during the 2016 US Presidential elections could have been avoided. She mentioned that even thought Facebook had been working on preventing the spread of hoaxes since January of 2015, they were unsuccessful.

Omar Alonso and his colleagues at Microsoft created a search engine in "What's Happening and What Happened: Searching the Social Web". They are building a growing archive of tweets to find relevant links that had been shared on social media. Their project differs from others because they are also adding a temporal dimension to their data gathering to show what people were talking about at a given time, which keeps the search engine fresh, but also allows for some historical analysis. The system uses a concept of virality rather than just popularity for the inclusion of results. Because of this focus on virality, their system is able filter fake news from the results. Contrary to other results, they discovered that "the total number of shares of the real links was higher than the fake links" on Twitter. The resulting search engine allows a user to search for a topic at a given date and time and discover what links were relevant to that topic at that time. The results are presented as a series of social cards rather than the "10 blue links" presented by well known web search engines. These social cards are similar to the link cards used in Storify: they contain an image, title, and short description of the link behind the card.

Alonso also presented the work "Automatic Generation of Timelines from Social Data", which attempts to determine what occurred on a given day for a given hashtag. The system evaluates the tweets by several metrics for relevance, quality, and popularity to produce a vector of relevant n-grams for that hashtag. Once this is done, links are extracted from the tweets, and titles are extracted from these links. The document link titles are evaluated using a new technique the authors name Social Pseudo Relevance Feedback which combines their existing n-gram vectors with the concept of pseudo relevance feedback from information retrieval in order to re-rank the link titles. The highest ranked title for the time period, a day or an hour, is then presented as an entry into the story. The dates can then be listed next to the title produced for that date which, when presented in this fashion, represents a timeline of events matching a given hashtag (seen for #parisattacks and #deflategate in the photos above). I thought this was quite brilliant. One could easily extend this technique by presenting the generated links in order using a tool like Storify, much like Dr. AlNoamany has done for Archive-It collections.

Web Archives are important to the research of the Old Dominion University WS-DL group, so I was intrigued by "Observing Web Archives" (university repository version) presented by Jessica Ogden. She was interested in the in-depth "day-to-day decisions, activities and processes that facilitate web archiving in practice". She used an ethnographic approach to understand the practices at the Internet Archive.

Kiran Garimella presented "The Effect of Collective Attention on Controversial Debates on Social Media" (preprint), studying polarized debates on Twitter. They analyzed four controversial topics on Twitter for from 2011 - 2016. They discovered that "spikes in interest correspond to an increase in the controversy of the discussion" and that these spikes result in changes to the vocabulary being used during the discussions as each side argues their case. They want to develop a model that allows us to use "early signals" from social media to "predict the impact of an event". Kiran won best student paper for their study.

#WebSci17 @gdfm7 @gvrkiran @gionis @mmathioudakis congrats! Best student paper award! pic.twitter.com/teqvftkrGd
— Frederick Ayala (@FredAyala) June 28, 2017

As Web Science researchers, we spend a lot of time analyzing the data available from the web. While working on the online harassment Digital Wildfire project, Helena Webb, Marina Jirotka, and their co-authors began to question the ethics of exploring a user's Twitter data without a user's consent. Even though Twitter is largely a public social network, it presents issues when one considers that researchers are deriving behaviors and information about people and thus has parallels with the testing of human subjects. Marina Jirotka presented "The Ethical Challenges of Publishing Twitter Data for Research Dissemination" (link to university repository). They indicated that there is indeed harm to be caused by republishing social media posts, exposing the attacker to retaliation and forcing the victim to relive the experience. Even if one were to anonymize posts, it is still difficult to fully anonymize the subject, considering the posts can be found via search engines on social media sites.

If a researcher wanted to acquire consent, how would they do so? In the case of Twitter, the social media feed is so large that many users do not view it all and may miss requests for consent. How often should the researcher attempt to contact them? Is an opt-out policy better than an opt-in policy? Echoing Jennifer Golbeck's keynote: If posts were observed, but cannot be included in published research, how do we support our findings? The study exposes many of these concerns in hopes that we can come to a consensus on how to handle them as a community. This study won best paper.

#websci17 Best Paper: "The ethical challenges of publishing Twitter data for research dissemination" by Helena Webb, Marina Jirotka, et al. pic.twitter.com/hgAO1diJYU
— Shawn M. Jones (@shawnmjones) June 28, 2017

Everything Else

#WebSci17 AI #moose chairs the Panel on The Ethics of Doing Web Science @jahendler pic.twitter.com/KrTD7ViwVu
— Wendy Hall (@DameWendyDBE) June 26, 2017

#WebSci17 Ethics Panel- Jeanna Matthews: "individuals cannot predict what will be decided about them based on their searches, tweets, likes"pic.twitter.com/RaJzlAZ1IE
— Shawn M. Jones (@shawnmjones) June 26, 2017

How does #webscience collaborate w/ other fields who are grappling with web research #ethics? Get involved, participate, advocate. #WebSci17
— Jessica Ogden (@jessogden) June 26, 2017

Hans Akkermans: As an interdisc community #webscience has an #ethical responsibility to address + drive convo around #webwewant #WebSci17
— Jessica Ogden (@jessogden) June 26, 2017

#WebSci17 Ethics Panel - Jeanna Matthews - on Fitbit as an example "Even when you are paying for the product, you may still be the product."
— Shawn M. Jones (@shawnmjones) June 26, 2017

There was a panel discussion at the end of the first day on the ethics of web science. It echoed some of the issues brought up in Helena Webb and Marina Jirotka's paper, but also introduced some additional perspectives. The panel consisted of Jim Hendler, Jeanna Matthews, Steffen Staab, and Hans Akkermans. Each offered many different perspectives, but it was clear that the concern is that Web Science researchers need to drive the ethics discussion before groups outside of the community drive it for them.

Of course, we enjoyed the time learning from one another. Discussions over dinner were influenced by the presentations we had witnessed during the day. We were also able to educate one another about our individual projects. Memento made an appearance in some of those discussions and stickers ended up on some laptops.

I would like to thank John Erickson, Juergen Mueller, Lee Fiorio, Jim Hendler, Omar Alonso, Wendy Hall, Kiran Garimella, Jessica Ogden, Marina Jirotka, James McCusker, Deborah McGuinness, Olga Zagovora, Frederick Ayala-Gómez, Hamed Alhoori, Peter Fox, Spencer Norris, Katharina Kinder-Kurlanda, Eni Mustafaraj, Xiaogang (Marshall) Ma, and many others for fascinating insight and interesting discussions over meals and outings.

Summary

The Web Science 2017 conference was invigorating and fascinating. It has really inspired me to make Web Science an area of interest in future studies. The Web Science Trust has summarized the conference and also provided a Storify story of what happened. I am looking forward to possibly attending this conference again in Amsterdam in 2018 where I may contribute the next grains of knowledge to the discipline.

-- Shawn M. Jones

↧

2017-07-19: Archives Unleashed 4.0: Web Archive Datathon Trip Report

July 18, 2017, 9:16 pm

≫ Next: 2017-07-24: Replacing Heritrix with Chrome in WAIL, and the release of node-warc, node-cdxj, and Squidwarc

≪ Previous: 2017-07-06: Web Science 2017 Trip Report

They: Hey Sawood, nice to see you again.

Me: Hi, I am glad to see you too.

They: Did you attend all hackathons, I mean datathons?

Me: Yes, I attended all of the four Archives Unleashed events so far.

They: How did you like it?

Me: Well, there is a reason why I attended all of them, despite being a seemingly busy PhD researcher.

They: So, what is your research about?

Me: I am trying to profile various web archives to build a high-level understanding of their holdings, primarily, for the sake of efficiently routing Memento aggregation requests, but there can be many more use cases of such profiles... [and the conversation continues...]

On day zero of Archives Unleashed 4.0 in London, conversations among many familiar and unfamiliar faces started with travel and lodging related questions, but soon emerged into mass storage challenges, scaling issues, quality and coverage of web archives, long-term maintenance of archival tools, documentation and discovery of libraries, and exchange of research ideas etc. Ian and Matt were looking fresh and welcoming in the reception of #HackArchives as always. This was all familiar, this is how other previous AU events started too, and yielded great networking among the web archiving community members.

@WebSciDL is heading to London for #WAC2017 #HackArchives #AU4 #WAWeek2017 #RESAW #IIPC @NetPreserve @UKWebArchive @machawk1 @ibnesayeed pic.twitter.com/XPgvEWYLsW
— Sawood Alam (@ibnesayeed) June 10, 2017

Previously, the Web Science and Digital Libraries Research Group (WSDL) has been well-represented at AU events, but visa issues and competing events meant that only Mat and I were able to attend.

The next day, on Monday, June 12, 2017, the main event started at the British Library in the morning with usual registration process, welcome kit, and AU-branded, 3D-printed looking, strange red rubber balls (that no one had any idea what to do with it). Dr. Matthew Weber and Dr. Ian Milligan began with the opening remarks, described the scope of the event, and available dataset and other resources.

Turns out this #hackarchives goodie is the most durable dog toy ever. Insert dog-related quips here #olddognewtricks #keepcalmjustchew pic.twitter.com/EWF5IIjqPt
— Jessica Ogden (@jessogden) July 7, 2017

Next was the current efforts session for which Ian, Jefferson, Tom, and Andy were supposed to talk about Warcbase, Internet Archive APIs, National Archives Datasets, and UK Web Archive respectively. Since Jefferson could not make it to the event on time, Ian had to morph into Jefferson for the corresponding talk about IA APIs. All of these talks were very insightful and had a lot to learn from.

Possibly the most interesting aspect of AU events is the phenomenon of the group formation. People and idea stickers flock around the room and naturally cluster in smaller groups with similar interest to come up with a more precise research question and datasets to use. This time, they formed a total of eight different groups with diverse set of research questions and scopes.

Old school analogue #data filtering and analysis at #hackathon @britishlibrary #WAweek2017 #hackArchives pic.twitter.com/yAL6D7Su7X
— Kees Teszelszky (@keesone) June 12, 2017

After the lunch break teams settled on their tables and started worrying about task refinement, computing resources, data acquisition, and action plan. One of the most difficult issues at AU events is the problem of data set acquisition. Advertised datasets are often not in the easy-to-get condition. Additionally, these datasets are often too large to be copied over to the respective computing instances in a feasible amount of time. Some preprocessing and sampling can be helpful. Additionally, complex (and often unknown) authentication barriers should be removed from the data acquisition process. On one hand it is part of the learning process to acquire and understand the data and learn about other tools to create derivative data, but on the other hand I have consistently noticed that this process is difficult and limits the opportunity for actual data analysis.

Another very useful aspect of AU events is the opportunity to allow people to share their current projects and efforts in the field of web archiving using short lightning talks. In the past we have taken advantage of it to introduce various WSDL efforts such as MemGator, IPWB, CarbonDate, WhatDidItLookLike, and ICanHazMemento. Following the tradition, this time also there were a handful of lightning talks lined up for both the days.

Toke Eskildsen demonstrated Juxta, a very interesting interactive archived image collage.
Sawood Alam (me) introduced a ServiceWorker-base solution to avoid zombies (live-leaks) in the archival replay. The zombie and unicorn joke tossed in this talk continued to JCDL2017 and WADL2017.
Mat Kelly introduced a suite of archival tools developed at WSDL that include WARCreate, WAIL, Mink, and IPWB.
Gregory Wiedeman talked about automating the process of augmenting provenance information in web archives.
Nick Ruest demonstrated Twitter harvesting with Twarc.
Naomi Dushay showed a recorded demonstration video of the WASAPI and explained how to download your data from Archive-It.
Jess Ogden shed some light on her research work about why web archiving studies matter as a practice.
Shawn Walker presented an interesting perspective on social media archival records.
Maya Anjur-Dietrich and Dawn Walker presented their work on saving environmental data.
Thomas Walskaar gave us a chance to look at My Hard-Drive Died Along With My Heart book, but I could only flip a few pages as there were many others in the queue.
Olga Holownia briefly talked about the British Library.

After the first round of five lightning talks teams went back to their hacking task, mostly trying to acquire datasets, understand them, and adjust their ambitious plans to something more feasible withing the short time limit. Then everyone left for the dinner while discussing ideas and scope of their work with their team members. The dinner was really good, but it did not stop people from exchanging world-shaking ideas.

The next morning many teams were talking about how much data they processed overnight and what to do next. The next couple of hours were very critical for every team to come up with something that provides some answers to their proposed research questions. After another session of lightning talks, teams continued to work on their projects, but now they started thinking about reporting aspect and visualizations of their findings as more and more results are apparent. The efforts continued during and after the short lunch break. One could see people multi-tasking to get everything done before the final presentations that was only a coffee break away, but some people still had courage to put everything aside for a while and go for a walk outside. Not every team was working on data analysis, but the overall experience was still generalizable. Finally, the time has arrived for brief project presentations and share findings of the "Samudra Manthan" in front of three esteemed judges from the British Library.

Team Portuguese Archive presented their outcome of archived image classification using TensorFlow. As a testbed they used maps to distinguish contemporary maps from historic maps.
Team Intersect (of which I was a member) presented the archival coverage of Occupy Wall Street movement in various collections and social media along with overlap among various datasets. They found less than 1% of overlap among different datasets which means the more collectors the better coverage. They also found that two-third of the outlinks from these collections were not archived.
The Olympians presented gender distribution in Olympic committees and found strong male bias.
Team Shipman Report analyzed text in Shipman Report and found it deadly and dark.
Team Links analyzed WARC files to find the trend in distribution of relative/absolute paths and absolute URLs in anchor element along with HTML element distribution around anchors over the time.
Team Robots analyzed different types of robots.txt files in web archives with the intent of finding the impact on archival captures if the robots.txt was honored. They found that the impact will not be huge.
Team Curated built a prototype of an upcoming Rhizome tool for better curation and annotation. They illustrated some wire frame prototypes of various components and workflow.
Team WARCs peeked inside WARC files for traces of politics and elections in the US.

While judges were deciding winners, Ian wrapped up the event by looking back at the past two days and briefly mentioning the highlights of the event. He gave vote of thanks for all individuals and sponsoring organizations who supported the event in various ways including data and computing resources, venue and logistics, and travel grants. Judges' verdict was in; Team Links, Team Robots, and Team Intersect were found guilty of being the best. Everyone was a winner, but some of them performed more efficiently than others within a very short span of time. I am sure every team had much more to show than what they could in the short five minutes presentation.

Now, it was the time to disperse around and continue exchanging ideas over drinks and dinner while getting ready for the rest of the Web Archiving Week events.

They: So, Sawood, are you planning to continue attending all future AU events?

Me: I hope so! ;-)

Sawood Alam

↧

2017-07-24: Replacing Heritrix with Chrome in WAIL, and the release of node-warc, node-cdxj, and Squidwarc

July 24, 2017, 11:13 am

≫ Next: 2017-08-07: rel="canonical" does not mean what you think it means

≪ Previous: 2017-07-19: Archives Unleashed 4.0: Web Archive Datathon Trip Report

I have written posts detailing how an archives modifications made to the JavaScript of a web page being replayed collided with the JavaScript libraries used by the page and how JavaScript + CORS is a deadly combination during replay. Today I am here to announce the release of a suite of high fidelity web archiving tools that help to mitigate the problems surrounding web archiving and a dynamic JavaScript powered web.To demonstrate this, consider the image above: the left-hand screen shot shows today's cnn.com archived and replayed in WAIL, whereas the right-hand screen shot shows cnn.com in the Internet Archive on 2017-07-24T16:00:02.

In this post, I will be covering:

Updates to WAIL
Release of node-warc
Release of node-cdxj
Release of Squidwarc

WAIL

Let me begin by announcing that WAIL has transitioned away from using Heritrix as the primary preservation method. Instead, WAIL now directly uses a full Chrome browser (Electron provided) as the preservation crawler. WAIL does not use WARCreate, Browsertrix, brozzler, Webrecorder, or a derivation of one of these tools but my own special sauce. The special sauce powering these crawls has been open sourced and made available through node-warc. Rest assured WAIL still provides auto configured Heritrix crawls but the Page Only, Page + Same Domain Links and Page + All Links crawls now use a full Chrome browser in combination with automatic scrolling of the page and the Pepper Flash Plugin. You can download this version of WAIL today. Oh and did I mention that WAIL's browser based crawls pass Mat Kelly's Archival Acid Test.

But I thought WAIL was already using a full Chrome browser and a modified WARCreate for the Page Only crawl? Yes, that is correct, but the key aspect here is the modified in modified WARCreate. WARCreate was modified for automation to use Node.js buffers, re-request every resource besides the fully rendered page and to work using Electron that is not an extension. What was shared was saving both the rendered page and the request/response headers. So how did I do this or what kind of black magic did I use in order to achieve this? Enter node-warc.

Before I continue

It is easy to forget which tool did this first and continues to do it extremely well. That tool is WARCreate. None of this would be possible if WARCreate had not done it first and I had not cut my teeth on Mat Kelly's projects. So look for this very same functionality to come to WARCreate in the near future as Chrome and the Extension APIs have matured beyond what was initially available to Mat at WARCreates inception. It still amazes me that he was able to get WARCreate to do what it does in the hostile environment that is Chrome Extensions. Thanks Mat! Now get that Ph.D. so that we come closer to not being concerned with WARCs containing cookies and other sensitive information.

node-warc

node-warc started out as a Node.js library for reading WARC files to address my dislike for how other libraries would crash and burn if they encounter an off by one error (webarchiveplayer, OpenWayback, Pywb indexers). But also to build one that is more performant and has a nicer API than the only other one on npm which is three years old with no updates and no gzip. As I worked on making the WAIL provided crawls awesome and Squidwarc it became a good home for the browser based preservation side of handling WARCs. node-warc is now a one stop shop for reading and creation of WARC files using Node.js.

On the reading side, node-warc supports both gzipped and non-gzipped warcs. An example of how to get started reading warcs using node-warc is shown below with the API documentation available online.

How performant is node-warc? Below are performance metrics for parsing both gzipped and un-gzipped warc files of different size.

un-gzipped

size	records	time	max process memory
145.9MB	8,026	2s	22 MiB
268MB	852	2s	77 MiB
2GB	76,980	21s	100 MiB
4.8GB	185,662	1m	144.3 MiB

gzipped

size	records	time	max process memory
7.7MB	1,269 records	297ms	7.1 MiB
819.1MB	34,253 records	16s	190.3 MiB
2.3GB	68,020 records	45s	197.6 MiB
5.3GB	269,464 records	4m	198.2 MiB

Now to the fun part. node-warc provides the means for the archiving the web using Electron's provided Chrome browser and using Chrome or Chrome headless through the chrome-remote-interface a Node.js wrapper for the DevTools Protocol. If you wish to use this library for preservation with Electron use ElectronRequestCapturer and ElectronWARCGenerator. The Electron archiving capabilities were developed in WAIL and then put into node-warc so that others can build high fidelity web archiving tools using Electron. If you need an example to help you get started consult wail-archiver.

For use with Chrome via chrome-remote-interface, use RemoteChromeRequestCapturer and RemoteChromeWARCGenerator. The Chrome specific portion of node-warc came from developing Squidwarc a high fidelity archival crawler that uses Chrome or Chrome Headless. Both the Electron and remote Chrome WARCGenerator and RequestCapturer share the same DevTools Protocol but each has their own way of accessing that API. node-warc takes care of that for you by providing a unified API for both Electron and Chrome. The special sauce here is node-warc retrieves the response body from Chrome/Electron by asking for it and Chrome/Electron will give it to us. It is that simple. Documentation for node-warc is available via n0tan3rd.github.io/node-warc and is released on Github under the MIT license. node-warc welcomes contributions and hopes that it will be found useful. Download it today using npm (npm install node-warc or yarn add node-warc)

node-cdxj

The companion library to node-warc is node-cdxj, cdxj on npm and is the Node.js library for parsing CDXJ files commonly used by Pywb. An example of how to use this library is seen below.

node-cdxj is distributed via Github and npm (npm install cdxj or yarn add cdxj) Full API documentation is available via n0tan3rd.github.io/node-cdxj and is released under the MIT license.

Squidwarc

Now that Vitaly Slobodin stepped down as the maintainer of PhantomJS (it's dead Jim) in deference to Headless Chrome it is with great pleasure to introduce to you today Squidwarc a high fidelity archival crawler that uses Chrome or Headless Chrome directly. Squidwarc aims to address the need for a high fidelity crawler akin to Heritrix while still easy enough for the personal archivist to setup and use. Squidwarc does not seek (at the moment) to dethrone Heritrix as the queen of wide archival crawls, rather seeks to address Heritrix's short comings namely

No JavaScript execution
Everything is plain text
Requiring configuration to known how to preserve the web
Setup time and technical knowledge required of its users

Those are some bold (cl)aims. Yes, they are, but in comparison to other web archiving tools using Chrome as the crawler makes sense. Plus to quote Vitaly Slobodin

Chrome is faster and more stable than PhantomJS. And it doesn't eat memory like crazy.

So why work hard when you can let the Chrome devs do a lot of the hard work for you. They must keep up with the crazy fast changing world of web development. Then why should not the web archiving community utilize that to our advantage??? I think so at least and is why I created Squidwarc. Which reminds me of the series of articles Kalev Leetaru wrote entitled Why Are Libraries Failing At Web Archiving And Are We Losing Our Digital History?, Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web and The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web? Well, sir, I present to you Squidwarc, an archival crawler that can handle the every changing and dynamic web. I have showed you mine, what do your crawlers look like?

Squidwarc is an HTTP/1.1 and HTTP/2, GET, POST, HEAD, OPTIONS, PUT, DELETE preserving, JavaScript executing, page interacting archival crawler (just to name a few). And yes it can do all that. If you doubt me, see the documentation for what Squidwarc is capable of through the chrome-remote-interface and node-warc. Squidwarc is different from brozzler as it supports both Chrome and Headless Chrome right out of the box, does not require a middle man to capture the requests and create the WARC, makes full use of the DevTools Protocol thanks to it being a Node.js based crawler (Google approved) and is simpler to setup and use.

So what can be done with Squidwarc at its current stage? I created a video demonstrating the technique described by Dr. Justin Brunelle in Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants which can be viewed below. The code used in this video is on Github as is Squidwarc itself.

The crawls operate in terms of a Composite Memento. For those who are unfamiliar with this terminology, a composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation. An example crawl configuration (JSON files currently) is seen below with annotations (comments) are not valid JSON. A non-annotated configuration file is provided in the Squidwarc Github repository.

The definitions for them are seen below and remember Squidwarc crawls operate in terms of a composite memento. The frontier is web pages, not web pages plus resources of a web page (Chrome retrieves them for us automatically).

page-only: Preserve the page so that there is no difference when replaying the page from viewing the page in a web browser at preservation time
page-same-domain: Page Only option plus preserve all links found on the page that are on the same domain as the page
page-all-links: page-same-domain plus all links from other domains

Below is a video demonstrating that the Squidwarc crawls do in fact preserve only the page, page + same domain links, and page + all links using the initial seed n0tan3rd.github.io/wail/sameDomain1.

Squidwarc is an open source project and available on Github. Squidwarc is not yet available via npm but you can begin using Squidwarc by cloning the repo. Let us build the next phase of web archiving together. Squidwarc welcomes all who wish to be part of its development and if there are any issues feel free to open one up.

Both WAIL and Squidwarc use node-warc and a Chrome browser for preservation. If portability and no setup are what you seek download and start using WAIL. If you just want to use the crawler clone the Squidwarc repository and begin preserving the web using your Chrome browser today. All the projects in this blog post welcome contributions as well as issues via Github. The excuse of "I guess the crawler was not configured to preserve this page" or the term "unarchivable page" are no longer valid in the age of browser based preservation.

- John Berlin

↧

2017-08-07: rel="canonical" does not mean what you think it means

August 7, 2017, 12:27 am

≫ Next: 2017-08-11: Where Can We Post Stories Summarizing Web Archive Collections?

≪ Previous: 2017-07-24: Replacing Heritrix with Chrome in WAIL, and the release of node-warc, node-cdxj, and Squidwarc

The rel="identifier" draft has been submitted to the IETF. Some of the feedback we've received via Twitter and email are variations of 'why don't you use rel="canonical" to link to the DOI?' We discussed this in our original blog post about rel="identifier", but in fairness that post discussed a great deal of things and through updates and comments it has become quite lengthy.

The short answer is that rel="canonical" handles cases where there are two or more URIs for a single resource (AKA "URI aliases"), whereas rel="identifier" specifies relationships between multiple resources.

Having two or more URIs for the same resource is also known as "DUST: different URLs, similar text". This is common place with SEO and catalogs (see the 2009 Google blog post and help center article about rel="canonical"). RFC 6596 gives abstract examples, but below we will examine real world examples (only one of which I'm fully prepared to buy).

Consider the two lexigraphically different URIs for the same resource (in this case, Amazon's page for DJ Shadow's upcoming EP "The Mountain Has Fallen"):

The first URI is what I got when I searched amazon.com for "dj shadow" and clicked on a search result. The second URI is the "canonical" version that should be indexed by Google et al. The first URI uses an HTML <link> element to inform search engines about the second URI so they know they haven't found two different resources with two different URIs:

$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow" | grep -i canonical
<link rel="canonical" href="https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q" />

We can see that the HTML is not exactly the same (which would be trivial for the search engines to dedup), but can see the rendered HTML is essentially the same, with the exception of the navigation trail ("‹ Back to search results for "dj shadow"") vs. the categorization ("CDs & Vinyl › Dance & Electronic › Electronica") on the left-hand side, right above the EP artwork:

$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow" | wc
12711 16648 446841

$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q" | wc
12802 17120 459761

It is clear there is no need for a search engine to index both pages.   The raw HTML is nearly (but not exactly!) the same and unless it is aware of amazon.com URI patterns, your crawler would not easily discover that they refer to the same resource. We can construct a similar example with ebay.com: again the raw HTML differs slightly but in this case I cannot tell a difference in the rendered HTML:

$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713?hash=item33d3451f79:g:6G8AAOSwiBpZcMhO&vxp=mtr" | fmt | grep --context canonical | tail -3
    hreflang="es-ni" /><link rel="canonical"
    href="http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713"
    /><lmeta Property="og:image"

$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713" | wc
    2678    9225 189098

$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713?hash=item33d3451f79:g:6G8AAOSwiBpZcMhO&vxp=mtr" | wc
    2688    9246 189235

So why can't we use rel="canonical" for, say, DOIs and publisher pages? In the case of DOIs, a technical reason is that the resource identified by the DOI and the resource identified by the publisher's page are not the same resource. Admittedly this is a detour into the esoteric realm of HTTP 303 semantics, but the HTTP URI of a DOI does not have a representation and the publisher's URI does; the resources identified by these URIs are related but fundamentally different.

Another reason would be when you wish to specify part-whole relationships between resources that comprise the resource identified by a DOI. For example, XML vs. HTML, Zip file(s) of associated code and data, embedded (and "recontextualizable"!) images, sound, or video, etc. This would be for the purpose of expressing identity, and would not preclude combination with navigation (e.g., rel="up") or SEO links (e.g., rel="canonical"). These identification patterns are presented in more detail at the Signposting web site.

Another argument against using rel="canonical" for linking to DOIs (and friends) is that publishers are already using canonical to manage SEO within their own domain. In the example below, springer.com signals to search engines that the URI in the third redirect from the DOI is canonical and not the previous two:

$ curl -iL --silent http://dx.doi.org/10.1007/978-3-319-43997-6_35 | egrep -i "(HTTP/1.1 [0-9]|^location:|rel=.canonical)"
HTTP/1.1 303 See Other
Location: http://link.springer.com/10.1007/978-3-319-43997-6_35
HTTP/1.1 301 Moved Permanently
Location: https://link.springer.com/10.1007/978-3-319-43997-6_35
HTTP/1.1 302 Found
Location: https://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35
HTTP/1.1 200 OK
<link rel="canonical" href="https://link.springer.com/chapter/10.1007/978-3-319-43997-6_35"/>

Furthermore, publishers are specifying DOIs with a variety of incompatible ad hoc approaches (see the prior blog post for examples), meaning there is demand for this function even though there is currently not a standardized method of achieving it.

But there are other applications for rel="identifier" outside of scholarly content. Consider the Wikipedia page for DJ Shadow. As I type this, it has not yet been edited to include the upcoming EP mentioned above, but there's a good chance that by the time you read this that will have changed.

I can reference the particular version of the page using the "permalink", which yields the URI https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397. That page will remain static, and never mention "The Mountain Has Fallen". That page does use rel="canonical" to link back to the generic, current version of the page:

$ curl --silent -i "https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397" | grep "rel=.canonical"
<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>

Which is entirely expected and desirable: we don't want Google to separately index the 1000s of prior versions of this page, just the latest version. The generic version of the page also asserts that it is canonical:

$ curl --silent -i "https://en.wikipedia.org/wiki/DJ_Shadow" | grep "rel=.canonical"
<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>

But if I were using a reference manager to cite https://en.wikipedia.org/wiki/DJ_Shadow, and if that page also had:

<link rel="identifier" href="https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397"/>

Then the reference manager would cite the specific version of the page, providing a machine-readable version of the human-readable guidance already provided under the "Cite This Page" link. This use of rel="identifier" would not collide with the rel="canonical" which is already in place for SEO*. In this Wikipedia example, the two rels coexist and specify URI preferences for different purposes:

rel="canonical": preferred for content indexing
rel="identifier": preferred for referencing

Herbert insisted on a New Mexico specific example, so we'll consider the ubiquitous multi-page articles, designed to expand content to increase advertising revenue. Of interest to us is page 5 of this particular article about TV continuity errors: http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5. It uses rel="canonical" to inform search engines to strip off any common, superfluous arguments that might be also be present (e.g., "&utm_source=...&utm_medium=...&utm_campaign=..."):

$ curl -i --silent "http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5" | grep canonical
<link rel="canonical" href="http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5" />

Assuming for a moment that coolimba.com wanted to facilitate referencing of this page as part of an aggregation, it could include:

<link rel="identifier up" href="http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/" />

In this case, rel="up" also serves as a simple navigation function, if you chose to view these pages as a tree and not a list (if this is indeed a list, then "up" is probably not applicable). But note that rel="up" would not be applicable in the Wikipedia (or even DOI) example(s) above. Also note that rel="up" and rel="identifier" sharing the same URI is something of a coincidence: if a multi-page article has more than two "levels" then we would expect the URIs to diverge.

In conclusion, SEO/indexing and referencing are different functions and thus require different rel types; cases where the target URIs overlap should be considered coincidences. rel="canonical" is used to collapse multiple URIs that yield duplicative text into a single, preferred URI to facilitate indexing, and rel="identifier" is used to select a single URI from among multiple URIs that yield different text to facilitate referencing.

--Michael & Herbert

P.S. To return to our original pop culture reference: "have fun storming the castle!"

* Note that rel="permalink" and rel="bookmark" (the former was never registered and ultimately supplanted by the latter) do different things and are not usable in HTTP Link headers; see the prior blog post for details.

↧

2017-08-11: Where Can We Post Stories Summarizing Web Archive Collections?

August 11, 2017, 9:47 am

≫ Next: 2017-08-14: Introducing Web Archiving and Docker to Summer Workshop Interns

≪ Previous: 2017-08-07: rel="canonical" does not mean what you think it means

A social card generated by Facebook for my previous blog post.

Rich links, snippet, social snippet, social media card, Twitter card, embedded representation, rich object, social card. These visualizations of web objects now pervade our existence on and off of the Web. The concept has been used to render web documents as results in academic research projects, like in Omar Alonso's "What's Happening and What Happened: Searching the Social Web". oEmbed is a standard for producing rich embedded representations of web objects for a variety of consuming services. Google experiments with using richer objects in their search results, even including images and other content from pages. Facebook, Twitter, Tumblr, Storify, and other tools use these cards. They have become so ubiquitous that services that do not produce these cards, like Google Hangouts, seem antiquated. These cards also no longer just sit within the confines of the web browser, being used in Apple's iMessage application since the release of iOS 10, as shown below. For simplicity, I will use the term social card for the rest of this post.

Apple's iOS iMessage app also generates social cards. This example also shows a card linking to my previous blog post.

Why use these cards? Why not just allow applications to copy and paste links as plaintext URIs? For many end users, URIs are unwieldy. Consider the URI below. Even though copying and pasting mitigates many of the issues with having to type this URI, it is still quite long. There is also very little information in the URI indicating to what document it will lead the end user.

https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35.3644614,-109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1d-106.287162!2d35.8440582

Now consider the following social card from Facebook for this same URI. The card tells the user that it is from Google Maps and contains directions from Old Dominion University to Los Alamos National Laboratory. Most importantly, it does not require that the user know any details about how Google Maps constructs its URIs.

A social card on Facebook generated from a Google Maps URI that represents a document providing directions from Old Dominion University to Los Alamos National Laboratory.

In effect, social cards are visualizations of web objects, piercing the veil created by the opaqueness of a URI. Thanks to social cards, the end user gets some information about the content of the URI before clicking on it, preventing them from visiting a site they may not have time or bandwidth for. In Yasmin AlNoamany's Dark and Stormy Archives (DSA), she uses social cards in Storify stories to summarize mementos from Archive-It collections. These stories take the form of 28 high quality mementos represented by social cards ordered by publication date. The screenshot below shows the Storify story containing links generated by the DSA for Archive-It collection 3649 about the Boston Marathon Bombing in 2013.

The Dark and Stormy Archives (DSA) application summarizes Archive-It collections as a collection of 28 well-chosen, high-quality mementos that are ordered by publication date and then visualized as social cards in Storify. This screenshot shows the Storify output of the DSA for Archive-It collection 3649 about the Boston Marathon Bombing in 2013.

A visualization requiring increased cognitive load requires more effort from the end user, and, in some cases hinders performance. Earlier attempts at visualizing Archive-It collections by Padia and others required training in how to use each visualization and their complexity may have produced increased cognitive load in the end user. A well-chosen, reduced set of links visualized as social cards works better than other visualizations that attempt to summarize web archives due to the low cognitive load required on behalf of the consumer. Each social card as a visualization into itself, hence a collection of social cards becomes an instance of the visualization technique of small multiples.

Small multiples was initially categorized in 1983 by Tufte in his Visual Display of Quantitative Information, but are present as a technique as far back as Eadweard Muybridge's Horse in Motion from 1886. Small multiples allow the user the ability to compare the same attributes of different sets of data. Consider the line graphs in the example below. Each details the expenses for different departments in an organization during the time period ranging from July to December. Note how the same x-axis on each graph allows the viewer to compare the expenses over time between each department. The key is that each visualization places the same data in the same spatial region, allowing for easy comparison.

An example of small multiples. Source: Wikipedia.

Each social card is a data item consisting of multiple attributes. The same attribute for each item is presented in the same spatial region of a given card. This allows the user to scan the list of cards for a given attribute, such as title, without being overwhelmed by the values of the rest of the attributes present. This consistency makes it easy to compare each card in the set. Below is a diagram of a given Storify card with annotations detailing its attributes. This becomes an effective storytelling method for events because users can see the cards in the order that their respective content was written.

Storify cards consist of multiple attributes that are visualized in the same spatial region on each card. This card exists for the live link http://www.history.com/topics/boston-marathon-bombings.

AlNoamany uses Storify in this way, but how well might other tools work for visualizing the output of the DSA? Can they serve as a replacement for Storify?

This post is a re-examination of the landscape since AlNoamany's dissertation to see if there are tools other than Storify that the DSA can use. It covers the tools living in the spaces of content curation, storytelling, and social media. AlNoamany's dissertation lists several that fit into different categories, and understanding these categories led to the discovery of more tools. The tools discussed in this post come from three sources: AlNoamany's disseration, "Content Curation Tools: The Ultimate List" by Curata, and "40 Social Media Curation Sites and Tools" by Shirley Williams. Curation takes many forms for many different reasons, but not all of them are suitable for the DSA framework. After this journey, I settle on four tools -- Facebook, Storify, Tumblr, and Twitter Moments -- that might be useful contenders.

For some tools, in order to test how well the generated social cards and collections for mementos, I used the URIs from the Boston Marathon Bombing 2013 Stories 3649spst0s and 3649spst1s generated as part of AlNoamany's DSA work. If I needed to contrast with them with live web examples, I used the URI http://www.history.com/topics/boston-marathon-bombings.

Engaging With Customers

A number of tools exist for the purpose of customer engagement. They provide the ability to curate content from the web with the goal of increasing confidence in a brand.

With Falcon.io collections can be shared internally so that they can be reviewed by teams in order to craft a message. It allows an organization to curate their own content and coordinate a single message across multiple social channels. They provide social monitoring and analysis of the impact of their message. They use their curated content to develop plans for addressing trends, dealing with crises (e.g., the recent Pepsi commercial fiasco), and ensuring that customers know the company is a key player in the market (e.g., IBM's Big Data & Analytics Hub). Cision, FlashIssue, Folloze, Spredfast, Sharpr, Sprinklr, and Trap!t are tools with a similar focus. I requested demos and discussions about these tools with their corresponding companies, but only recieved feedback from Falcon.io and Spredfast who were instrumental in helping understand this space.

Roojoom, Curata, and SimilarWeb, and Waywire Enterprises focus more on helping influence the development of the corporate web site with curated content. RockTheDeadLine offers to curate content on the organization's behalf. CurationSuite (formerly CurationTraffic) focuses on providing a curated collection as a component of a WordPress blog. These services go one step further and provide site integration components in addition to mere content curation. Curata has a lot of documentation and several whitepapers that helped me understand the reasons for these tools.

Hootsuite, Pluggio, PostPlanner, SproutSocial focus on collecting and organizing content and responses from social media. They do not provide collections for public consumption in the same way Storify or a Facebook album would. Hootsuite in particular provides a way to gather content from many different social networking accounts at once while synchronizing outgoing posts across all of them.

All of these tools offer analytics packages that permit the organization to see how the produced newsletter or web content is performing. Though these tools do focus on curating content, their focus is customer engagement and marketing. Most of these tools focus on trends and web content in aggregate rather than showcasing individual web resources.

Our focus in this project is to find new ways of visualizing and summarizing Archive-It collections. Though some of these tools might be capable of doing this, their cost and unused functionality make them a poor fit for our purpose.

Focusing on the Present

Some tools allow the user to supply a topic as a seed for curated content. The tool will then use that topic and its own internal curation service to locate content that may be useful to the user. A good example is a local newspaper. A resident of Santa Fe, for example, will likely want to know what content is relevant to their city, and hence would be better served by the curation services of the Santa Fe New Mexican than they would by the Seattle Times. The newspaper changes every day, but the content reflects the local area.

Paper.li presents a different collection each day based on the user's keywords. I created "The Science Daily", which changes every day. The content for June 4, 2017 (left) is different from the content for June 5, 2017 (right).

This category of curation tools is not limited by geographic location. The input in the system is a set of search terms representing a topic. Paper.li and UpContent allow one to create a personalized newspaper about a given topic that changes each day, providing fresh content to the user. ContentGems is much the same, but supports a complex workflow system that can be edited to supply content from multiple sources. ContentGems also allows one to share their generated paper via email, Twitter, RSS feed, website widgets, RSS, IFTTT, Zapier, and a whole host of other services. DrumUp uses a variety of sources from the general web and social media to generate topic-specific collections. They also allow the user to schedule social media posts to Facebook, Twitter, and LinkedIn. Where Paper.li appears to be focused on a single user, ContentGems and DrumUp easily stretch into customer engagement, and UpContent offers different capabilities depending on to which tier the user has subscribed.

(left) The Tweeted Times shows some of the tweets from the author's Twitter feed.
(right) Tagboard requires that a user supply a hashtag as input before creating a collection.

The Tweeted Times and Tagboard both focus on content from social media. The Tweeted Times attempts to summarize a user's twitter feed and later publishes that summary at a URI for the end user to consume. Tagboard uses hashtags from Facebook or Twitter as seeds to their content curation system.

The tools in this section focus on content from the present. They do not allow a user to supply a list of URIs which are then stored in a collection, hence are not suitable for inclusion into Dark and Stormy Archives.

Sharing and the Lack of Social Cards

There is a spectrum of sharing. Storify allows one to share their collection publicly for all to see. Other tools expect only subscribed accounts to view their collections. In these cases, subscribed accounts may be acquired for free or at cost. Feedly supports sharing of collections only for other users in one's team, a grouping of users that can view each other's content. Pinboard and Pocket are slightly less restrictive, permitting other portal users to view their content. In addition, both Pinboard and Pocket promise paying customers the ability to archive their saved web resources for later viewing. Shareist only shares content via email and on social media, not producing a web-based visualization of a collection. We are interested in tools that allow us to not only share collections of mementos on the web, but also share them with as few barriers to viewing as possible.

Huzzaz and Vidinterest only support URIs to web resources that contain video. Both support YouTube and Vimeo URIs, but only Vidinterest supports Dailymotion. Neither support general URIs, let alone URI-Ms from Archive-It. Instagram and Flickr work specifically with images, and they do not create social cards for URIs. Sutori allows one to curate URIs, but does not create social cards. Even though Twitter may render a social card in the Tweets, the card is not present when the tweets are visualized in a collection using Togetter.

A screenshot of a Tweet containing a social card for http://www.history.com/topics/boston-marathon-bombings.

A screenshot of a Togetter collection of live links containing the Tweet from above as the fourth in the collection. Note that none of these URIs show a social card, even

This screenshot shows a live link http://www.history.com/topics/boston-marathon-bombings inserted into a Sutori story, with no social card.

A test post in Instagram where I attempted to add several URIs as comments, including the URI http://www.history.com/topics/boston-marathon-bombings used in the Twitter example above. Instagram produced no social cards for these URIs and did not make them links either.

Card Size Matters

Some tools change the size of the card for effect, or to allow extra data in one card rather than another. These size change interrupt the visual flow of the small multiples paradigm I mentioned in the introduction. While good for presenting in newspapers or other tools that collect articles, such size changes make it difficult to follow the flow of events in a story. They create additional cognitive load on the user, forcing her to constantly ask "does this different sized card come before or after the other cards in my view?" and "how does this card fit into the story timeline?"

Flipboard

Flipboard orders the social cards from left to right then up and down, but changes the size of some of the cards.

Flipboard often makes the first social card the largest, dominating the collection as seen in the screenshot above. Sometimes it will choose another card in the collection and increase its size as well. Flipboard also has other issues. In the screenshot below, we see a social card rendered for a live link, but in the screenshot below that we see that Flipboard does not do so well with mementos.

A social card generated in Flipboard for the live URI http://www.history.com/topics/boston-marathon-bombings.

A screenshot of a collection of mementos about the Boston Marathon Bombing stored in Flipboard.

Scoop.it

In this Scoop.it collection, Scoop.it changes the size of some social cards based on the amount of data present in the card.

Scoop.it changes the size of some social cards due to the presence of large images or extra text in the snippet. These changes distort the visual flow of the collection. There are also restrictions, even for paying users, on the amount of content that can be stored, with even a top subscription of $33 per month being limited to only 15 collections.

Flockler

Flockler alters the sizes of some cards based on the information present. Note: because I only had a trial account, this Flockler collection may no longer be accessible.

Flockler alters the size of its cards based on the information present. Cards with images, titles, and snippets are larger than those with just text. As shown below, sometimes Flockler cannot extract any information and generates empty cards or cards whose title is the URI.

A screenshot of social cards generated from Archive-It mementos in a Flockler collection about the Boston Marathon Bombing. The one on top just displays the link while the one in the middle is empty. Links to mementos: top, middle, bottom.

The same mementos visualized in social cards in this Pinterest collection. Pinterest supports collections, but does not typically generate social cards, favoring images.

Pinterest has a distinct focus on images, but does create social cards (i.e., "pins" in the Pinterest nomenclature) for web resources. The system requires a user to select an image for each pin. Interestingly, the first image presented when a user is generating a pin is often the same that is selected by Storify when it generates social cards. Unfortunately, the images are all different sizes, making it difficult to follow the sequence of events in the story.

In addition to the size issue, if Pinterest cannot find an image in a page or if the image is too small, it will not create a social card. It could not find any images for URI-M http://wayback.archive-it.org/3649/20140404170835/https://sites.tufts.edu/museumstudents/2013/06/27/help-create-the-boston-marathon-bombing-archive/ and all images for http://wayback.archive-it.org/3649/20130422044829/https://twitter.com/LadieAuPair/status/325365298196795394/ were too small.

If an image is too small, Pinterest will issue an error and refuse to post the link.

Pinterest also presents another problem. During the processing of some social cards, Pinterest converts the URI-M into a URI-R. In the screenshot below, we see that the social card bears the domain name "wayback.archive-it.org", but clicking on the card leads one to card for "newyorker.com".

Juxtapost

As seen in this collection, Juxtapost changes the size of social cards and even moves them out of the way for advertisements (top right text says "--Advertisement--"). Which direction does the story flow?

Juxtapost is the other tool which changes the size of the social cards. In addition, it requires that the end user select and image and insert a description for every card. If it weren't for the changing sizes of each card, the manual labor may also make this unsuitable for use in the DSA.

Juxtapost also refuses to add a resource (e.g., http://wayback.archive-it.org/3649/20140408194419/http://www.boston.com/yourtown/news/watertown/2014/01/digital_archive_exhibit_on_marathon_bombing_to_visit_waterto.html) for which it can find no images.

Google+

Google+ collection for the Boston Marathon Bombing viewed with a window size of 2033 x 1254.

The same Google+ collection viewed with a window size of 1080 x 1263.

The same Google+ collection viewed in a window resized to 945 x 1265.

As shown in the screenshots above, the direction and size of the cards in a Google+ collection changes depending on the resolution used to view the collection. This is likely a result of adjusting the page for mobile screen sizes. In spite of the fact that Google+ had no problems generating cards for all of our test mementos, the first figure above does not indicate well in which direction the events in the story unfolded and thus this information is lost in Google+

Problems That APIs Might Solve

Of course, the Dark and Stormy Archives software generates its visualization automatically. This makes the use of a web API quite important for the tool. The DSA generates 28 links per Archive-It collection. Would it be acceptable for a human to submit these links to one of these tools much like I have done? What if the collection changes frequently and the DSA must be rerun to account for these changes?

In addition to freeing humans from creating stories, AlNoamany was able to use the Storify API to assist Storify in developing richer social cards, adding dates and favicons to override and improve upon the information that Storify extracted from mementos. The human interface for Storify also had some problems creating cards for mementos, and these problems could be overcome by using the Storify API.

Pearltrees has no API. I could not find APIs for Symbaloo, eLink, ChannelKit, or BagTheWeb. Listly has an API, but it is not public.

BagTheWeb requires additional information supplied by the user in order to create a social card. As seen below, BagTheWeb does not generate any social card data based solely on the URI. If there were an API, the DSA platform might be able to address some of these shortcomings. Symbaloo is much the same. It chooses an image, but often favors the favicon over an image selected from the article.

This is a screenshot of a social card created by BagTheWeb for http://www.history.com/topics/boston-marathon-bombings.

A screenshot of a card created by Symbaloo for the same URI.

Pearltrees has problems that may be addressed by an API that allows the user to specify information. The example screenshot below displays a Firefox error instead of a selected image in the social card. This is especially surprising because the system was able to extract the title from the destination URI. Pearltrees also tends to convert URI-Ms to URI-Rs, linking to the live page instead of the archived Archive-It page.

A screenshot of two social cards created from Archive-It mementos by Pearltrees in a collection about the Boston Marathon. The one on the left displays a Firefox error instead of a selected image for the memento. Links to mementos: left, right.

The social cards generated by eLink have a selected image, a title, and a text snippet. Sometimes, however, they do not seem to find the image, as seen in the screenshot below. Scoop.it also has similar problems for some URIs, also shown below. An API call that allows one to select an image for the card would help improve this tool's shortcomings.

A screenshot of two social cards generated from Archive-It mementos from an eLink collection about the Boston Marathon Bombing. The one of the left shows a missing selected image while the one on the right displays fine. Links to mementos: left, right.

ChannelKit usually generates nice social cards, complete with a title, text snippet, and a selected image or web page thumbnail. Sometimes, as shown below, the resulting card contains no information and a human must intervene. Listly also has issues with some of the links submitted to it. It usually generates a title, text snippet, and selected image, but in some cases, as shown below, just lists the URI. Flockler also has similar problems, shown below. An API call that allows one to supply the missing information would be helpful in addressing these issues.

A screenshot of the social cards generated from Archive-It mementos in a ChannelKit collection about the Boston Marathon Bombing. The one on the right shows no useful information. Links to mementos: left, right.

A screenshot of social cards generated from Archive-It mementos in a Listly collection about the Boston Marathon Bombing. The one on the top has no information but the URI. The one on the bottom contains a title, selected image, and snippet. Links to mementos: top, bottom.

Curation Tools Useful for Visualization of Archive-It Collections

The final four tools have APIs, produce social cards, and allow for collections. I decided to review these five in more detail using the mementos generated by the DSA tool against the Archive-It collection 3639 about the Boston Marathon Bombing in 2013, corresponding to this Storify story. I created these collections by hand and did not use their associated API. Storify is already in use in the DSA, and hence I did not bother to review it again here.

In this section I discuss these tools and their shortcomings. I also discuss how DSA might be able to overcome some of those shortcomings with the tool's API.

Facebook

Selected mementos from a run of the Dark and Stormy Archives tool on Archive-It collection 3649 about the Boston Marathon Bombing as visualized in social cards in Facebook comments where the collection is stored as a Facebook post.

With 1.871 billion users, Facebook is the most popular social media tool. Facebook supports social cards in posts and comments. Facebook also supports creating albums of photos, but not of posts. Posts contain comments, however. In order to generate a series of social cards in a collection, I gave the post the title of the collection and supplied each URI-M in the story to a separate comment. In this way, I generate a story much like AlNoamany had done with Storify.

A screenshot of two Facebook comments. The URI-M below generated a social card, but the URI-M above did not.

As seen above, Facebook does occasionally fail to generate social cards for links. The Facebook API could be used to update such comments with a photo and a snippet, if necessary. Providing additional images is not possible, as Facebook posts and comments will not generate a social card if the post/comment already has an image.

Tumblr

The same mementos visualized in social cards in Tumblr where the collection is denoted by a hashtag.

Weighing in with 550 million users is Tumblr. Tumblr is a complex social media tool supporting many different types of posts. A user selects which type of post they desire and then supply the necessary data or metadata. For example, if a user wanted to generate something like a Facebook post or a Twitter tweet, they would choose "Text". The interface for selecting a type of post is shown below.

This screenshot shows the interface used by a user when they wish to post to Tumblr. It shows the different types of posts possible with the tool.

The post type "Link" produces a social card for the supplied link. In addition to the social card generated by Tumblr, the "Link" post can also be adorned with an additional photo, video, or text.

All of these post types are available as part of the Tumblr API. If a social card lacks an image, or if the DSA wants to supply additional text, the post can be updated appropriately.

I use hashtags to create collections on Tumblr. The hashtags are confined to a specific blog controlled by that blog's user, hence posts outside of the blog do not intrude into the collection, as would happen with hashtags on Twitter or Facebook.

Twitter Moments

This Twitter Moment contains tweets that contain the URI-Ms from our Dark and Stormy Archives summary.

Twitter has 317 million users worldwide. While all tools required that the user title the collection in some way, Twitter Moments requires that the user upload an image separately in order to create a collection. This image serves as the striking image for the collection. The user is also compelled to supply a description.

Sadly, much like Flipboard, Twitter does not appear to generate social cards for URI-Ms from Archive-It. Shown below in a Twitter Moment, the individual URI-Ms are displayed in their tweets with no additional visualization.

Unfortunately, as we see in the same Twitter Moment, tweets do not render social cards for our Archive-It URI-Ms.

DSA could use the Twitter API to add images and additional text (up to 140 characters of course) to supplement these tweets. At that point, the DSA is building its own social cards out of tweets.

Other Paradigms

In this post, I tried to find the tools that would replace Storify as it currently exists, but what about different paradigms of storytelling? The point of the DSA framework is to visualize an Archive-It collection. Other visualization techniques could use the tools I have discarded on this list. For example, Instagram has been used successfully by activist organizations and government entities as a storytelling tool. It is also being actively used by journalists. Even though works primarily through photos, is there some way we can use it for storytelling like these people have been doing? What other paradigms can we explore for storytelling?

Summary

Considering how Storify is used in the Dark and Stormy Archives framework took me on a long ride through the world of online curation. I read about tools that are used purely for customer engagement, those that live in the perpetual now, those that do not provide public sharing, those that do not provide social cards, and those that do not support our use of small multiples. I reviewed tools that do seem to have some problems generating social cards from Archive-It mementos, and provide no API with which to address the issues.

I finally came down to three tools that may serve as replacements for Storify, with varying degrees of capability. The collections housing the same story derived from Archive-It collection 3649 are here:

Twitter does not appear to make social cards for Archive-It mementos, and hence passes this issue onto Twitter Moments. In this case, Twitter requires that the DSA supply more information than just a URI to create social cards and hence is a poor choice to replace Storify. Facebook and Tumblr do create social cards for most URIs and provide an API that can be used to augment these cards. These tools have 1.871 billion and 550 million users, respectively. Because of this familiarity, they also satisfy one of the other core requirements of the DSA: an interface that people already know how to use.

-- Shawn M. Jones

Acknowledgements: A special thanks to the folks at Flockler for extending my trial, Curata for producing so much trade literature on curation, and Sarah Zickelbrach at Cision, Jeffery at Falcon.io, and Chase Schlachter from Spredfast for answering my questions and helping me to understand the space where some of these tools live.

↧

2017-08-14: Introducing Web Archiving and Docker to Summer Workshop Interns

August 14, 2017, 7:49 pm

≫ Next: 2017-08-25: University Twitter Engagement: Using Twitter Followers to Rank Universities

≪ Previous: 2017-08-11: Where Can We Post Stories Summarizing Web Archive Collections?

Last Wednesday, August 9, 2017, I was invited to give a talk to some summer interns of the Computer Science Department at Old Dominion University. Every summer our department invites some undergrad students from India and hosts them for about a month to work on some projects under a research lab here as summer interns. During this period, various research groups introduce their work to those interns to encourage them to become potential graduate applicants. Those interns also act as academic ambassadors who motivate their colleagues back in India for higher studies.

This year, Mr. Ajay Gupta invited a group of 20 students from Acharya Institute of Technology and B.N.M. Institute of Technology and supervised them during their stay at Old Dominion University. Like the last year, I was selected from the Web Science and Digital Libraries Research Group again to introduce them with the concept of web archiving and various researches of our lab. An overview of the talk can be found in my last year's post.

Introducing Web Archiving and WSDL Research Group by Sawood Alam

Recently, I have been selected as the Docker Campus Ambassador for ODU. I thought it would be a great opportunity to introduce those interns with the concept of software containerization. Among numerous other benefits, it would help them deal with the "works on my machine" problem (also known as "magic laptop" problem), common in students' life.

I (@ibnesayeed) have been selected as the @Docker Campus Ambassador for Old Dominion University! /cc @ODU @oducs @WebSciDl
— Sawood Alam (@ibnesayeed) June 28, 2017

After finishing the web archiving talk, I briefly introduced them with the basic concepts and building blocks of Docker. Then I illustrated the process of containerization with the help of a very simple example.

Dockerize Your Projects - A Brief Introduction to Containerization from Sawood Alam

I encouraged them to interrupt me during the talk to ask any relevant questions as both the topics were fairly new for them. Additionally, I tried to bring in references from Indian culture, politics, and cinema to make it more engaging for them. Overall, I was very happy with the kind of questions they were asking, which gave me the confidence that they were actually absorbing these new concepts and not asking questions just for the sake of grabbing some swags that included T-shirts and stickers from Docker and Memento.

--
Sawood Alam

↧

2017-08-25: University Twitter Engagement: Using Twitter Followers to Rank Universities

August 25, 2017, 6:47 pm

≫ Next: 2017-08-26: rel="bookmark" also does not mean what you think it means

≪ Previous: 2017-08-14: Introducing Web Archiving and Docker to Summer Workshop Interns

Figure 1: Summing primary and secondary followers for @ODUNow

Our University Twitter Engagement (UTE) rank is based on the friend and extended follower network of primary and affiliated secondary Twitter accounts referenced on a university's home page. We show that UTE has a significant, positive correlation with expert university reputation rankings (e.g., USN&WR, THE, ARWU) as well as rankings by endowment, enrollment, and athletic expenditures (EEE). As illustrated in Figure 1, we bootstrap the process by starting with the URI for the university's homepage obtained from the detailed institutional profile information in the ranking lists. For each URI, we navigated to the associated webpage and searched the HTML source for links to valid Twitter handles. Once the Twitter screen name was identified, the Twitter GET users/Show API was used to retrieve the URI from the profile of each user name. If the domain of the URI matched exactly or resolved to the known domain of the institution, we considered the account to be one of the university's official, primary Twitter handles since the user had self-associated with the university via the URI reference.

As an example, the user names @NBA, @DukeAnnualFund, @DukeMBB, and @DukeU were extracted from the page source of the Duke University homepage (www.duke.edu). However, only @DukeAnnualFund and @DukeUare considered official primary accounts because their respective URIs, annualfund.duke.edu and duke.edu, are in the same domain as the university. On the other hand, @DukeMBB maps to GoDuke.com/MBB, which is not in the same domain as duke.edu, so we don't include it among the official accounts. Ultimately, we delve deeper into the first and second degree relationships between Twitter followers to identify the pervasiveness of the university community which includes not only academics, but sports teams, high profile faculty members, and other sponsored organizations.

We aggregated the rankings from multiple expert sources to calculate an adjusted reputation rank (ARR) for each university which allows direct comparison based on position in the list and provides a collective perspective of the individual rankings. In rank-to-rank comparisons using Kendall's Tau, we observed a significant, positive rank correlation (τ=0.6018) between UTE and ARR which indicates that UTE could be a viable proxy for ranking atypical institutions normally excluded from traditional lists. We also observed a strong correlation (τ=0.6461) between UTE and EEE suggesting that universities with high enrollments, endowments, and/or athletic budgets also have high academic rank. The top 20 universities as ranked by UTE are shown in Table 1. We've highlighted a few universities where there is a significant disparity between the ARR and the UTE ranking which indicates larger Twitter followings than can be explained just by academic rank.

University	UTE Rank	ARR Rank
Harvard University	1	1
Stanford University	2	2
Cornell University	3	10
Yale University	4	7
University of Pennsylvania	5	8
Arizona State University--Tempe	6	59
Columbia University in the City of New York	7	4
Texas A&M University--College Station	8	39
Wake Forest University	9	74
University of Texas--Austin	10	16
Pennsylvania State University--Main Campus	11	24
University of Michigan--Ann Arbor	12	10
University of Minnesota--Twin Cities	13	16
Ohio State University--Main Campus	14	22
Princeton University	15	4
University of Wisconsin--Madison	16	14
University of Notre Dame	17	46
Boston University	18	21
University of California--Berkeley	19	3
Oklahoma State University--Main Campus	20	100

Table 1: Top 20 Universities Ranked by UTE for Comparison With ARR

We have prepared an extensive report of our findings as a technical report available on arXiv (linked below). We have also posted all of the ranking and supporting data used in this study which includes a social media rich dataset containing over 1 million Twitter profiles, ranking data, and other institutional demographics in the oduwsdl Github repository.

- Corren (@correnmccoy)

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

↧

2017-08-26: rel="bookmark" also does not mean what you think it means

August 26, 2017, 2:18 pm

≫ Next: 2017-08-27: Media Manipulation research at the Berkman Klein Center at Harvard University Trip Report

≪ Previous: 2017-08-25: University Twitter Engagement: Using Twitter Followers to Rank Universities

Extending our previous discussion about how the proposed rel="identifier" is different from rel="canonical" (spoiler alert: "canonical" is only for pages with duplicative text), here I summarize various discussions about why we can't use rel="bookmark" for the proposed scenarios. We've already given a brief review of why rel="bookmark" won't work (spoiler alert: it is explicitly prohibited for HTML <link> elements or HTTP Link: headers) but here we more deeply explore the likely original semantics.

I say "likely original semantics" because:

the short phrases in the IANA link relations registry ("Gives a permanent link to use for bookmarking purposes") and the HTML5 specification ("Gives the permalink for the nearest ancestor section") are not especially clear, nor is the example in the HTML5 specification.
rel="bookmark" exists to address a problem, anonymous content, that has been so thoroughly solved that the original motivation is hard to appreciate.

In our Signposting work, we had originally hoped we could use rel="bookmark" to mean "please use this other URI when you press control-D". For example, we hoped the HTML at http://www.sciencedirect.com/science/article/pii/S038800011400151X could have:

<link rel="bookmark">http://dx.doi.org/10.1016/j.langsci.2014.12.003</link>

And when the user hit "control-D" (the typical keyboard sequence for bookmarking), the user agent would use the doi.org URI instead of the current URI at sciencedirect.com. But alas, that's not why rel="bookmark" was created, and the original intention is likely why rel="bookmark" is prohibited from <link> elements. I say likely because the motivation is not well documented and I'm inferring it from the historical evidence and context.

In the bad old days of the early web, newsfeeds, blogs, forums, and the like did not universally support deep links, or permalinks, to their content. A blog would consist of multiple posts displayed within a single page. For example page 1 of a blog would have the seven most recent posts, page 2 would have the previous seven posts, etc. The individual posts were effectively anonymous: you could link to the "top" of a blog (e.g., blog.dshr.org), but links to individual posts were not supported; for example this individual post from 2015 is no longer on page 1 of the blog and without the ability to link directly to its permalink, one would have click backwards through many pages to discover it.

Of course, now we take such functionality for granted -- we fully expect to have direct links to individual posts, comments, etc. The earliest demonstration I can find is from this blog post from 2000 (the earliest archived version is from 2003, here's the 2003 archived version of top-level link to the blog where you can see the icon the post mentions). This early mention of a permalink does not use the term "permalink" or relation rel="bookmark"; those would follow later.

The implicit model with permalinks appears to be that there would be > 1 rel="bookmark" assertions within a single page, thus the relation is restricted to <a> and <area> elements. This is because <link> elements apply to the entire context URI (i.e., "the page") and not to specific links, so having > 1 <link> elements with rel="bookmark" would not allow agents to understand the proper scoping of which element "contains" the content that has the stated permalink (e.g., this bit of javascript promotes rel="bookmark" values into <link> elements, but scoping is lost). An ASCII art figure is order here:

+----------------------------+
|                            |
|  <A href="blog.html"       |
|     rel=bookmark>          |
|  Super awesome alphabet    |
|  blog! </a>                |
|  Each day is a diff letter!|
|                            |
|  +---------------------+   |
|  | A is awesome!!!!    |   |
|  | <a href="a.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for A </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | B is better than A! |   |
|  | <a href="b.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for B </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | C is not so great.  |   |
|  | <a href="c.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for C </a>|   |
|  +---------------------+   |
|                            |
+----------------------------+

$ curl blog.html
Super awesome alphabet blog!
Each day is a diff letter!
A is awesome!!!!
permalink for A
B is better than A!
permalink for B 
C is not so great.
permalink for C
$ curl a.html
A is awesome!!!!
permalink for A
$ curl b.html
B is better than A!
permalink for B 
$ curl c.html
C is not so great.
permalink for C

In the example above, the blog has a rel="bookmark" to itself ("blog.html") and since the <a> element appears at the "top level" of the HTML, it is understood that the scope of the element applies to the entire page. In the subsequent posts, the scope of the link is bound to some ancestor element (perhaps a <div> element) and thus it does not apply to the entire page. The rel="bookmark" to "blog.html" is perhaps unnecessary, the user agent already knows its own context URI (in other words, a user agent typically knows the URL of the page it is currently displaying (but might not in some conditions, like being the response to a POST request), but surfacing the link with an <a> element makes it easy for the user to right-click, copy-n-paste, etc. If "blog.html" had four <link rel="bookmark"> elements, the links would not be easily available for user interaction and scoping information would be lost.

And it's not just for external content ("a.html", "b.html", "c.html") like the example above. In the example below, rel="bookmark" is used to provide permalinks for individual comments contained within a single post.

+----------------------------+
|                            |
|  <A href="a.html"          |
|     rel=bookmark>          |
|  A is awesome!!!!</a>      |
|                            |
|  +---------------------+   |
|  | <a name="1"></a>    |   |
|  | Boo -- I hate A.    |   |
|  | <a href="a.html#1"  |   |
|  |    rel=bookmark>    |   |
|  | 2017-08-01 </a>     |   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | <a name="2"></a>    |   |
|  | a series of tubes!  |   |
|  | <a href="a.html#2"  |   |
|  |    rel=bookmark>    |   |
|  | 2017-08-03 </a>     |   |
|  +---------------------+   |
|                            |
+----------------------------+

This style exposes the direct links of the individual comments, and in this case the anchor text for the permalink is the datestamp of when the post was made (by convention, permalinks often have anchor text or title attributes of "permalink", "permanent link", datestamps, the title of the target page, or variations of these approaches). Again, it would not make sense to have three separate <link rel="bookmark"> elements here, obscuring scoping information and inhibiting user interaction.

So why prohibit <link rel="bookmark"> elements? Why not allow just a single <link rel="bookmark"> element in the <head> of the page, which would by definition enforce the scope to apply to the entire document? I'm not sure, but I guess it stems from 1) the intention of surfacing the links to the user, 2) the assumption that a user-agent already knows the URI of the current page, and 3) the assumption that there would be > 1 bookmarks per page. I suppose uniformity was valued over expressiveness. A 1999 HTML specification does not explicitly mention the <link> prohibition, but it does mention having several bookmarks per page.

An interesting side note is that while typing self-referential, scoped links with rel="bookmark" to differentiate them from just regular links to other pages seemed like a good idea ca. 1999, such links are now so common that many links with the anchor text "permalink" or "permanent link" often do not bother to use rel="bookmark" (e.g., Wikipedia pages all have "permanent link" in the left-hand column, but do not use rel="bookmark" in the HTML source, but the blogger example captured in the image above does use bookmark). The extra semantics are no longer novel and are contextually obvious.

In summary, in much the same way there is confusion about rel="canonical", which is better understood as rel="hey-google-index-this-url-instead", perhaps a better name for rel="bookmark" would have been rel="right-click". If you s/bookmark/right-click/g, the specifications and examples make a lot more sense.

--Michael & Herbert

N.B. This post is a summary of discussions in a variety of sources, including this WHATWG issue, this tweet storm, and this IETF email thread.

↧

2017-08-27: Media Manipulation research at the Berkman Klein Center at Harvard University Trip Report

August 27, 2017, 9:30 am

≫ Next: 2017-08-27: Four WS-DL Classes Offered for Fall 2017

≪ Previous: 2017-08-26: rel="bookmark" also does not mean what you think it means

A photo of me inside "The Yellow House" -
The Berkman Klein Center for Internet & Society

On June 5, 2017, I started work as an Intern at the Berkman Klein Center for Internet & Society at Harvard University under the supervision of Dr. Rob Faris, the Research Director for the Berkman Klein Center. This was a wonderful opportunity to conduct news media related research, and my second consecutive Summer research at Harvard. The Berkman Klein Center is an interdisciplinary research center that researches the means to tackle some of the biggest challenges on the Internet. Located in a yellow house at the Harvard Law School, the Center is committed to studying the development, dynamics, norms and standards of cyberspace. The center has produced many significant contributions such as the review of ICANN (Internet Corporation for Assigned Names and Numbers) and the founding of the DPLA (Digital Public Library of America).

During the first week of my Internship, I met with Dr. Faris to identify the research I would conduct in collaboration with Media Cloud at Berkman. Media Cloud, is an open-source platform for studying media ecosystems. The Media Cloud platform provides various tools for studying media such as Dashboard, Topic Mapper, and Source Manager.

Media Cloud tools for visualizing and analyzing online news

Dashboard helps you see how a specific topic is spoken about in digital media. Topic Mapper helps you conduct topic in-depth analysis by identifying the most influential sources and stories. Source Manager helps explore the Media Cloud vast collection of digital media. The Media Cloud collection consists of about 547 million stories from over 200 countries. Some of the most recent Media Cloud research publications include: "Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election" and "Partisan Right-Wing Websites Shaped Mainstream Press Coverage Before 2016 Election."

Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election | Berkman Klein Center
In this study, we analyze both mainstream and social media coverage of the 2016 United States presidential election. We document that the majority of mainstream media coverage was negative for both candidates, but largely followed Donald Trump's agenda: when reporting on Hillary Clinton, coverage primarily focused on the various scandals related to the Clinton Foundation and emails.

Partisan Right-Wing Websites Shaped Mainstream Press Coverage Before 2016 Election, Berkman Klein Study Finds | Berkman Klein Center
The study found that on the conservative side, more attention was paid to pro-Trump, highly partisan media outlets. On the liberal side, by contrast, the center of gravity was made up largely of long-standing media organizations.

Rob and I narrowed my research area to media manipulation. Given, the widespread concern about the spread of fake news especially during the 2016 US General Elections, we sought to study the various forms of media manipulation and possible measures to mitigate this problem. I worked closely with Jeff Fossett, a co-intern on this project. My research about media manipulation began with a literature review of the state of the art. Jeff Fosset and I explored various research and news publications about media manipulation.

SOME MECHANISMS OF MEDIA MANIPULATION

Case 1: Roger Sollenberger - How the Trump-Russia Data Machine Games Google to Fool Americans

How the Trump-Russia Data Machine Games Google to Fool Americans
A year ago I was part of a digital marketing team at a tech company. We were maybe the fifth largest company in our particular industry, which was drones. But we knew how to game Google, and our site was maxed out.

Roger Sollenberger revealed a different kind of organized misinformation/disinformation campaign from the conventional publication of fake news (pure fiction - fabricated news). This campaign is based on Search Engine Optimization (SEO) of websites. Highly similar less popular (by Alexa rank) and often fringe news websites beat more popular traditional news sites in the Google rankings. They do this by optimizing their content with important trigger keywords, generate a massive volume of similar fresh (constantly updated) content, and link among themselves.

Case 2: Data & Society - Manipulation and Disinformation Online

Media Manipulation and Disinformation Online

Data & Society introduced the subject of media manipulation in this report: media manipulators such as some far-right groups exploit the media's proclivity for sensationalism and novelty over newsworthiness. They achieve this through the strategic use of social media, memes, and bots to increase the visibility of their ideas through a process known as "attention hacking."

Case 3: Comprop - Computational Propaganda in the United States of America: Manufacturing Consensus Online

Computational Propaganda in the United States of America: Manufacturing Consensus Online

This report by the Computational Propaganda Project at Oxford University, illustrated the influence of bots during the 2016 US General Elections and dynamics between bots and human users. They illustrated how armies of bots allowed campaigns, candidates, and supporters to achieve two key things during the 2016 election, first, to manufacture consensus and second, to democratize online propaganda.

COMPROP: 2016 US General Election sample graph showing network of humans (black nodes) retweeting bots (green nodes)

Their findings showed that armies of bots were built to follow, retweet, or like a candidate's content making that candidate seem more legitimate, more widely supported, than they actually are. In addition, they showed that the largest botnet in the pro-Trump network was almost 4 times larger than the largest botnet in the pro-Clinton network:

COMPROP: 2016 US General Election sample botnet graph showing the pro-Clinton and more sophisticated pro-Trump botnets

WHAT IS MEDIA MANIPULATION?

Based on my study, I consider media manipulation as:

calculated efforts taken to circumvent or harness the mainstream media in order to set agendas and propagate ideas, often with the utilization of social media as a vehicle.

A STEP TOWARD STUDYING MEDIA MANIPULATION

On close consideration of the different mechanisms of media manipulation, I saw a common theme. This common theme of media manipulation was described by the Computational Propaganda Project as "Manufactured Consensus." The idea of manufactured consensus is: we naturally lend some credibility to stories we see at multiple different sources, and media manipulators know this. Consequently, some media manipulators manufacture consensus around a story, sometimes pure fabrication (fake news), and some other times, they take some truth from a story, distort it, and replicate this distorted version of the truth across multiple source to manufacture consensus. An example of this manufacture of consensus is Case 1. Manufactured consensus ranges from fringe disinformation websites to bots on Twitter which artificially boost the popularity of a false narrative. The idea of manufactured consensus motivated by attempt to first, provide a means of identifying consensus, second, learn to distinguish organic consensus from manufactured consensus both within news sources and Twitter. I will briefly outline the beginning part of my study to identify consensus specifically across news media sources.

IDENTIFY CONSENSUS ACROSS NEWS MEDIA NETWORKS

I acknowledge there are multiple ways to define "consensus in news media," so I define consensus within news media:

as a state in which multiple news sources report on the same or highly similar stories.

Let us use a graph (nodes and edges) to identify consensus in new media networks. In this graph representation, nodes represent news stories from media sources and consensus is captured by edges between the highly similar or the same news stories (nodes). For example, Graph 1 below shows consensus between NPR and BBC for a story about a shooting in a Moscow court.

Graph 1: Consensus within NPR and BBC for "Shooting at a Moscow Court" story

Consensus may occur within the mainstream left news media (Graph 2) or the mainstream right media (Graph 3).

Graph 2: Consensus on the left (CNN, NYTimes& WAPO) media for the "Transgender ban" story

Graph 3: Consensus on the right (Breitbart, The blaze, & Fox) for the "Republican senators who killed the Skinny repeal bill" story

Consensus may or may not be exclusive to left, center or right media, but if there is consensus across different media networks (e.g., mainstream left, center and right), we would like to capture or approximate the level of bias or "spin" expressed by the various media networks (left, center and right). For example, on August 8, 2017, during a roundtable in his Golf Club in Bedminster New Jersey, President Trump said if North Korea continues to threaten the US, "they will be met with fire and fury." Not surprisingly, various news media reported this story, in order words there was consensus within left, center and right news media organizations (Graph 4) for the "Trump North Korea fire and fury" story.

Graph 4: Consensus across left, center and right media networks for the "Trump North Korea fire and fury story"

Let us inspect the consensus Graph 4 closely, beginning with the left, then the center and right, consider the following titles:

LEFT

Vox:

Politicus USA:

Trump Threatens War With North Korea While Hanging Out In The Clubhouse Of His Golf Club

Huffington Post, Washington Post, and NYTimes, respectively:

Donald Trump Threatens North Korea With ‘Fire And Fury’ (Huffington Post)
Trump vows North Korea will be met with ‘fire and fury’ if threats continue (Washington Post)
Trump Threatens‘Fire and Fury’ Against North Korea if It Endangers U.S. (NYTimes)

CENTER LEFT/RIGHT

The Hill:

CNN and ABCNEWS, respectively:

RIGHT

Gateway Pundit, The Daily Caller, Fox, Breitbart, Conservative Tribune:

WOW! North Korea Says It Is ‘Seriously Considering’ Military Strike on Guam (Gateway Pundit)
North Korea: We May Attack Guam (The Daily Caller)
Trump: North Korea 'will be met with fire and fury like the world has never seen' if more threats emerge (Fox)
Donald Trump Warns North Korea: Threats to United States Will Be Met with ‘Fire and Fury’ (Breitbart)
Breaking: Trump Promises Fire And Fury.. Attacks North Korea In Unprecedented Move (Conservative Tribune)
North Korea threatens missile strike on Guam (Washington Examiner)

In my opinion, on the left, consider the critical outlook offered by Vox, claiming the President sounded like the North Korean Dictator Kim Jong Un. At the center, The Hill emphasized the unfavorable response of some senators due to the President's statement. I think one might say the left and some parts of the center painted the President as reckless due to his threats. On the right, consider the focus on the North Korean threat to strike Guam. The choice of words and perspectives reported on this common story exemplifies the "spin" due to the political bias of the various polarized media. We would like to capture this kind of spin. But it is important to note that our goal is NOT to determine what is the truth. Instead, if we can identify consensus and go beyond consensus to capture or approximate spin, this solution could be useful in studying media manipulation. It is also relevant to investigate if spin is related to misinformation or disinformation.

I believe the prerequisite for quantifying spin is identifying consensus. The primitive operation of identifying consensus is the binary operation of measuring the similarity (or distance) between two stories. I have began this analysis with an algorithm in development. This algorithm was applied to generate Graphs 1-4. Explanation of this algorithm is beyond the scope of this post, but you may see the algorithm in action through this polar media consensus graph, which periodically computes a consensus graph for left, center and right media.

Consensus graph generated by an algorithm in development

The second part of our study was to identify consensus on Twitter. I will strive to report the developments of this research as well as a formal introduction of the consensus identifying algorithm when our findings are concrete.

In addition to researching media manipulation, I had the pleasure to see the 4th of July fireworks across the Charles River from Matt's rooftop, and attend Law and Cyberspace lectures hosted by Harvard Law School Professors - Jonathan Zittrain and Urs Gasser. I had the wonderful opportunity to teach Python and learn from my fellow interns, as well as present my media manipulation research to Harvard LIL.

Thanks To Alexander Nwala (PhD) For Spending 3 Hours Teaching Me Computer Programming With Python. pic.twitter.com/7Oi98oImds
— J. Okechúkú Effoduh (@effodu) August 4, 2017

Can we distinguish manufactured consensus from real consensus in media? Talk by @acnwala about his work with @BKCHarvard pic.twitter.com/l42MH8Q3JX
— Library Innovation (@HarvardLIL) August 4, 2017

Media manipulation is only going to evolve, making its study crucial. I am grateful for the constant guidance of my Ph.D supervisors, Dr. Michael Nelson and Dr. Michele Weigle, and am also very grateful to the Dr. Rob Faris at Media Cloud, and the rest of Berkman Klein community for providing me with the opportunity to research this pertinent subject.

--Nwala

↧

2017-08-27: Four WS-DL Classes Offered for Fall 2017

August 27, 2017, 11:41 am

≫ Next: 2017-09-13: Pagination Considered Harmful to Archiving

≪ Previous: 2017-08-27: Media Manipulation research at the Berkman Klein Center at Harvard University Trip Report

An unprecedented four Web Science & Digital Library (WS-DL) courses will be offered in Fall 2017:

CS 418/518 "Web Programming", Tuesdays 4:20-7:00 pm (CRNs 14356 & 14357), will be offered by Dr. Justin F. Brunelle. This will be an updated version of CS 418/518 last taught in Fall 2016 by Dr. Brunelle.
CS 734/834 "Introduction to Information Retrieval", Thursdays 4:20-7:00 pm (CRNs 13631 & 13639) offered by Dr. Michael L. Nelson. This will be an updated version of the same course most recently taught in Fall 2016.
CS 725/825 "Information Visualization", Fridays 8:30-11 am (CRNs 19344 & 19345) offered by Dr. Michele C. Weigle. This will be an updated version of the same course most recently taught in Spring 2017.
CS 791/891 "Web Archiving Seminar", Wednesdays 2:00-4:30 pm (CRNs 21302 & 21303) offered by Drs. Nelson & Weigle. This is a new seminar for incoming and prospective WS-DL students where we will cover a variety of web archiving topics and research methodology and tools. Permission from the instructors is required to register for this class.

Finally, although they are not WS-DL courses per se, WS-DL member Corren McCoy is also teaching CS 462 Cybersecurity Fundamentals again this semester, and WS-DL alumnus Dr. Charles Cartledge is teaching CS 395 "Data Analysis Bootcamp".

I'm especially proud of this semester's breadth of course offerings and the participation by two alumni and one current WS-DL member.

--Michael

↧

2017-09-13: Pagination Considered Harmful to Archiving

September 13, 2017, 5:28 am

≫ Next: 2017-09-19: Carbon Dating the Web, version 4.0

≪ Previous: 2017-08-27: Four WS-DL Classes Offered for Fall 2017

Figure 1 - 2016 U.S. News Global Rankings Main Page as Shown on Oct 30, 2015

Figure 2 - 2016 U.S. News Global Rankings Main Page With Pagination Scheme as Shown on Oct 30, 2015
https://web.archive.org/web/20151030092546/https://www.usnews.com/education/best-global-universities/rankings

While gathering data for our work in measuring the correlation of university rankings by reputation and by Twitter followers (McCoy et al., 2017), we discovered that many of the web pages which comprised the complete ranking list for U.S. News in a given year were not available in the Internet Archive. In fact, 21 of 75 pages (or 28%) had never been archived at all. "... what is part of and what is not part of an Internet resource remains an open question" according to research concerning Web archiving mechanisms conducted by Poursadar and Shipman (2017). Over 2,000 participants in their study were presented with various types of web content (e.g., multi-page stories, reviews, single page writings) and surveyed regarding their expectation for later access to additional content that was linked from or appeared on the main page. Specifically, they investigated (1) how relationships between page content affect expectations and (2) how perceptions of content value relate to internet resources. In other words, if I save the main page as a resource, what else should I expect to be saved along with it?

I experienced this paradox first hand when I attempted to locate an historical entry from the 2016 edition of the U.S. News Best Global University Rankings. As shown in Figure 1, October 30, 2015 is a particular date of interest because on the day prior, a revision of the original ranking for the University at Buffalo-SUNY was reported. The university's ranking was revised due to incorrect data related to the number of PhD awards. A re-calculation of the ranking metrics resulted in a change of the university's ranking position from a tie at No. 344 to a tie at No. 181.

Figure 3 - Summary of U.S. News
https://web.archive.org/web/*/https://www.usnews.com/education/best-global-universities/rankings

Figure 4 - Capture of U.S. News Revision for Buffalo-SUNY
https://web.archive.org/web/20160314033028/http://www.usnews.com/education/best-global-universities/rankings?page=19

A search of the Internet Archive, Figure 3, shows the U.S. News web site was saved 669 times between October 28, 2014 and September 3, 2017. We should first note that regardless of the ranking year you choose to locate via a web search, U.S. News reuses the same URL from year to year. Therefore, an inquiry against the live web will always direct you to their most recent publication. As of September 3, 2017, the redirect would be to the 2017 edition of their ranking list. Next, as shown in Figure 2, the 2016 U.S. News ranking list consisted of 750 universities presented in groups of 10 spread across 75 web pages. Therefore, the revised entry for the University at Buffalo-SUNY at rank No. 181 should appear on page 19, Figure 4.

**Table 1 - Page Captures of U.S. News (Abbreviated Listing)**
Page No.	Captures	Start Date	End Date
1	669	10/28/2014	09/03/2017
2	434	10/28/2014	08/15/2017
3	171	10/28/2014	07/17/2017
4	43	10/28/2014	01/13/2017
5	37	10/28/2014	01/13/2017
6	7	10/28/2014	06/29/2017
7	4	09/29/2015	01/13/2017
8	1	01/12/2017	01/12/2017
9	2	01/28/2016	01/12/2017
10	1	01/12/2017	01/12/2017
11	2	01/12/2017	02/22/2017
12	1	02/22/2017	02/22/2017
13	1	02/22/2017	02/22/2017
14	2	02/22/2017	03/30/2017
15	2	03/12/2017	03/12/2017
16	2	03/30/2017	06/30/2017
17	2	06/20/2015	07/16/2017
18	4	06/19/2015	07/16/2017
19	3	06/18/2015	07/16/2017

While I could readily locate the main page of the 2016 list as it appeared on October 30, 2015, I noted that subsequent pages were archived with diminishing frequency and over a much shorter period of time. We see in Table 1, after the first three pages, there can be a significant variance in the frequency with which the remaining site pages are crawled. And, as was noted earlier, more than a quarter (28%) of the ranking list cannot be reconstructed at all. Ainsworth and Nelson examined the degree of temporal drift that can occur during the display of sparsely archived pages using the Sliding Target policy allowed by the web archive user interface (UI); namely many years in just a few clicks. Since a substantial portion of the U.S. News ranking list is missing, it is very likely the web browsing experience will result in a hybrid list of universities that encompasses different ranking years as the user follows the page links.

Figure 5 - Frequency of Page Captures

Ultimately, we found page 19 had been captured three times during the specified time frame. However, the page containing the revised ranking that was of interest, Figure 4, was not available in the archive until March 14, 2016; almost five months after the ranking list had been updated. Further, in Figure 5, we note heavy activity for the first few and last few pages of the ranking list which may occur because, as shown on Figure 2, these links are presented prominently on page 1. The remaining pages 3 through 5 must be discovered manually by clicking on the next page. We note, in Figure 5, here the sporadic capture scheme for these intermediate pages.

Current web designs which feature pagination create a frustrating experience for the user when subsequent pages are omitted in the archive. It was my expectation that all pages associated with the ranking list would be saved in order to maintain the integrity of the complete listing of universities as they appeared on the publication date. My intuition is consistent with Poursadar and Shipman, who among their other conclusions, noted that navigational distance from the primary page can affect perceptions regarding what is considered to be viable content that should be preserved. However, for multi-page articles, nearly 80% of the participants in their study considered linked information in the later pages as part of the resource. This perception was especially profound "when the content of the main and connected pages are part of a larger composition or set of information" as in perhaps a ranking list.

Overall, the findings of Poursadar and Shipman along with our personal observations indicate that archiving systems require an alternative methodology or domain rules that recognize when content spread across multiple pages represent a single collection or a composite resource that should be preserved in its entirety. From a design perspective, we can only wonder why there isn't a "view all" link on multi-page content such as the U.S. News ranking list. This feature might present a way to circumvent paginated design schemes so the Internet Archive can obtain a complete view of a particular web site; especially if the "view all" link is located on the first few pages which appear to be crawled most often. On the other hand, the use of pagination might also represent a conscious choice by the web designer or site owner as a way to limit page scraping even though people can still find a way to do so. Ultimately, the collateral damage associated with this type of design scheme is an uneven distribution in the archive; resulting in an incomplete archival record.

Sources:

Scott G. Ainsworth and Michael L. Nelson. "Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive." International Journal on Digital Libraries 16: 129-144. DOI: 10.1007/s00799-014-0120-4

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

Faryaneh Poursardar and Frank Shipman, "What Is Part of That Resource?User Expectations for Personal Archiving.", Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries, 2017.

-- Corren (@correnmccoy)

↧

2017-09-19: Carbon Dating the Web, version 4.0

September 19, 2017, 6:36 pm

≫ Next: 2017-10-16: Visualizing Webpage Changes Over Time - new NEH Digital Humanities Advancement Grant

≪ Previous: 2017-09-13: Pagination Considered Harmful to Archiving

With this release of Carbon Date there are new features being introduced to track testing and force python standard formatting conventions. This version is dubbed Carbon Date v4.0.

We've also decided to switch from MementoProxy and take advantage of the Memgator Aggregator tool built by Sawood Alam.

Of course with new APIs come new bugs that need to be addressed, such as this exception handling issue. Fortunately, the new tools being integrated into the project will allow for our team to catch and address these issues quicker than before as explained below.

The previous version of this project, Carbon Date 3.0, added Pubdate extraction, Twitter searching, and Bing search. We found that Bing has changed its API to only allow 30 day trials for its API with 1000 requests per month unless someone wants to pay. We also discovered a few more use cases for the Pubdate extraction by applying Pubdate to the mementos retrieved from Memgator. By default, Memgator provides the Memento-Datetime retrieved from an archive's HTTP headers. However, news articles can contain metadata indicating the actual publication date or time. This gives our tool a more accurate time of an article's publication.

Whats New

With APIs changing over time it was decided we needed a proper way to test Carbon Date. To address this issue, we decided to use the popular Travis CI. Travis CI enables us to test our application every day using a cron job. Whenever an API changes, a piece of code breaks, or is styled in an unconventional way, we'll get a nice notification saying something has broken.

CarbonDate contains modules for getting dates for URIs from Google, Bing, Bitly and Memgator. Over time the code has had various styles and no sort of convention. To address this issue, we decided to conform all of our python code to pep8 formatting conventions.

We found that when using Google query strings to collect dates we would always get a date at midnight. This is simply because there is not timestamp, but rather a just year, month and day. This caused Carbon Date to always choose this as the lowest date. Therefore we've changed this to be the last second of the day instead of the first of the day. For example, the date '2017-07-04T00:00:00' becomes '2017-07-04T23:59:59' which allows a better precision for timestamp created.

We've also decided to change the JSON format to something more conventional. As shown below:

Other sources explored

It has been a long term goal to continuously find and add available sources to the Carbon Date application that bring offer a creation date. However, not all the sources we explore bring what we expect. Below there is a list of APIs and other sources that were tested but were unsuccessful in returning a URI creation date. We explored URL shortener APIs such as:

The bitly URL shortener still remains the best as the Bitly API allows a lookup of full URLs not just shortened ones.

How to use

Carbon Date is built on top of Python 3 (most machines have Python 2 by default). Therefore we recommend installing Carbon Date with Docker.

We do also host the server version here: http://cd.cs.odu.edu/. However, carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally via Docker.

Instructions:

After installing docker you can do the following:

2013 Dataset explored

The Carbon Date application was originally built by Hany SalahEldeen, mentioned in his paper in 2013. In 2013 they created a dataset of 1200 URIs to test this application and it was considered the "gold standard dataset." It's now four years later and we decided to test that dataset again.

We found that the 2013 dataset had to be updated. The dataset originally contained URIs and actual creation dates collected from the WHOIS domain lookup, sitemaps, atom feeds and page scraping. When we ran the dataset through the Carbon Date application, we found Carbon Date successfully estimated 890 creation dates but 109 URIs had estimated dates older than their actual creation dates. This was due to the fact that various web archive sites found mementos with creation dates older than what the original sources provided or sitemaps might have taken updated page dates as original creation dates. Therefore, we've taken taken the oldest version of the archived URI and taken that as the actual creation date to test against.

We found that 628 of the 890 estimated creation dates matched the actual creation date, achieving a 70.56% accuracy - originally 32.78% when conducted by Hany SalahEldeen. Below you can see a polynomial curve to the second degree used to fit the real creation dates.

Troubleshooting:

Q: I can't install Docker on my computer for various reason. How can I use this tool?

A: If you can't use Docker then my recommendation is to download the source code from Github and install using a python virtual environment and installing the dependencies with pip from there.

Q: After x amount of requests Google doesn't give a date when I think it should. What's happening?

A: Google is very good at catching programs (robots) that aren't using their APIs. Carbon Dating is not using an API but rather doing a string query, like a browser would be, and then looking at the results. You might have hit a Captcha so Google might lock Carbon Date out for a while.

Q: I sent a simple website like http://apple.com to Carbon Date to check the date of creation, but it says it was not found in any archive. Why is that?

A: Websites like apple.com, cnn.com, google.com, etc., all have an exceedingly large number of mementos. The Memgator tool is searching for tens of thousands of mementos for these websites across multiple archiving websites. This request can take minutes which eventually leads to a timeout, which in turn means Carbon Date will return zero archives.

Q: I have another issue not listed here, where can I ask questions?
A: This project is open source on github. Just navigate to the issues tab on Github, start a new issue and ask away!

Carbon Date 4.0? What about 3.0?

With this being Carbon Date 4.0, that means there has been three blogs previously for this project! You can find them here:

-Grant Atkins

↧

2017-10-16: Visualizing Webpage Changes Over Time - new NEH Digital Humanities Advancement Grant

October 16, 2017, 9:39 am

≫ Next: 2017-10-24: Grace Hopper Celebration of Women in Computing (GHC) 2017

≪ Previous: 2017-09-19: Carbon Dating the Web, version 4.0

In August, we were excited to be awarded an 18-month Digital Humanities Advancement Grant from the National Endowment for the Humanities (NEH) and the Institute of Museum and Library Services (IMLS). Our project, "Visualizing Webpage Changes Over Time", was one of 31 awards made through this joint NEH/IMLS program (award announcement).

Michele C. Weigle and Michael L. Nelson - ODU
Deborah Kempe - Frick Art Reference Library and New York Art Resources Consortium
Pamela Graham and Alex Thurman - Columbia University Libraries
Oct 2017 – Mar 2019, $75,000

As web archives grow in importance and size, techniques for understanding how a web page changes through time need to adapt from an assumption of scarcity (just a few copies of a page, no more than a few weeks or months apart) to one of abundance (tens of thousands of copies of a page, spanning as much as 20 years). This project, a joint effort among ODU, the New York Art Resources Consortium (NYARC), and Columbia University Libraries (CUL), will research and develop tools for efficient visualization of and interaction with archived web pages. This work will be informed by and in support of CUL’s and NYARC’s existing web archiving activities.

This project is an extension of AlSum and Nelson's Thumbnail Summarization Techniques for Web Archives, published in ECIR 2014 (presentation slides), and our previous work, funded by an incentive grant from Columbia University Libraries and the Andrew W. Mellon Foundation.

For this project, we will develop

a tool for visualizing web page changes in arbitrary web archives
a plug-in for the popular Wayback Machine web archiving system
scripts for easy embedding of the visualizations in live web pages, providing tighter integration of the archived web and live web.

The visualizations we will develop fall into three main categories:

grid view - This view would show the entire thumbnail summary in a grid.
interactive timeline view - This view would place the thumbnail summary on an interactive timeline. Depending upon the size of the TimeMap, other mementos (those not selected as part of the summary) may be indicated on the timeline as well.
single thumbnail view - The area of this view would be a single thumbnail. In addition to the standard screenshot, we will also develop visualizations that employ a Twitter-style card. We propose to develop several instances of this view:

image slider- This view would be similar to iPhoto image previews, where the thumbnail image changes as the user moves their mouse across the image
animated GIF - This view would automatically cycle through the selected thumbnails, similar to our “What Did It Look Like?” service.
video - This view would be a video of the thumbnails. We would use existing services, such as YouTube or Instagram, which allow annotation with links to mementos and direct access to particular thumbnails. For instance, YouTube provides access to particular points in a video through the #t={mm}m{ss}s URL parameter (e.g., https://www.youtube.com/watch?v=7CahU8bSgyA#t=0m22s).

We are grateful for the continued support of NEH and IMLS for our web archiving research and look forward to producing exciting tools and services for the community.

-Michele and Michael

↧

2017-10-24: Grace Hopper Celebration of Women in Computing (GHC) 2017

October 24, 2017, 10:59 am

≫ Next: 2017-11-06: Association for Information Science and Technology (ASIS&T) Annual Meeting 2017

≪ Previous: 2017-10-16: Visualizing Webpage Changes Over Time - new NEH Digital Humanities Advancement Grant

This year’s Grace Hopper Celebration 2017 (@ghc, #GHC17), the world's largest gathering of women technologists, took place in Orlando Florida at the Orange County Convention Center on October 4-6. The events occurred in two locations: the Orange County Convention Center West (OCCC) and the Hyatt Regency (Hyatt), which is directly connected by skybridge.

GHC is presented by the Anita Borg Institute for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994. Grace Hopper attendees grew from 500 in 1994 to over 18,000 in 2017 with around 700 speakers.

Its incredible to see how the number of attendees grew over the years @ghc #GHC17 pic.twitter.com/hmVrINdUen
— Lulwah (@LulwahMA) October 3, 2017

This was my first time attending the conference and I was fortunate to receive an AnitaB.org travel scholarship to attend GHC. The scholarships cover registration, travel, hotel, and meal expenses. Only 657 were selected from almost 15,000 applicants. In addition, three other graduate students from the Department of Computer Science at Old Dominion University attended the conference. Aida Ghazizadeh presented a poster, Maha Abdelaal got a Google scholarship, and Wessam Elhefnawy got a Hooper scholar. I was also lucky to have my sister Lamia Alkwai attending the conference who is a grad student at King Saud University, Riyadh, Saudi Arabia and also received an AnitaB.org travel scholarship. Previously, Yasmin AlNoamany, a PhD graduate from @ODUCS@WebSciDL, attended GHC in 2015, 2014, and 2013.

Proud of our @ODU CS PhD students (Aida, Maha, @LulwahMA, @WessamElhefnawy) celebrating women in computing with 18,000+ others at #GHC17! pic.twitter.com/pk6qnhv95Y
— ODU Computer Science (@oducs) October 5, 2017

Before attending the conference a phone application named GHC 17 was introduced that includes news feed, schedule, sponsors, top companies, resources and more. I really liked this application because it makes it easy to create your own selection of talks you want to attend, location, time and brief description of the talks. Also, another good resource is the GHC Scholars Facebook Group where questions are answered promptly and you can connect with other GHC attendees.

The conference schedule had inspiring talks and events which included keynote speakers, presentations, panels, workshops, a career fair, and a poster session. This year there were 20 different tracks which are: career, community, CRA-W, student opportunity, ACM research competition, general poster session, general session, lunches and receptions, special sessions, IOT/Wearable tech, products A to Z, artificial intelligence, computer system engineering, data science, human computer interaction, interactive/privacy, software engineering, open source, and organization transformation.

Tuesday, Oct 3
First Timers Orientation

On Tuesday, there was the first first timers orientation, where five speakers gathered to talk about the keynotes, featured speakers, types of sessions, students and faculty highlights, networking opportunities, planning your GHC survival skills, and staying engaged with AnitaB.org. The presenters in this session were Kathryn Magno, the Sr. Program Manager of GHC Operations at AnitaB.org, Rhonda Leiva, the Senior Program Manager of Student Programs at AnitaB.org, Beth Roberts, the Manager at GHC Content Operations, Stuti Badoni, the Director at GHC Content, and Radha Jhatakia, the Program Manager at GHC Content. Some of the tips mentioned were to not miss out on the keynotes, connect with speakers and students, check out the poster session, and attend the career fair. Other notes included wearing comfortable shoes, drinking plenty of water, and pacing yourself. They mention the “I AM” Movement! which is a sign that you can write on to define who you are in your own words. They concluded the talk with answering some of the audience questions.

Drink plenty of water, wear comfortable shoes and pace yourself are some of the advices at the first first timers orientation :) @ghc #GHC17
— Lulwah (@LulwahMA) October 3, 2017

Wednesday, Oct 4
Wednesdays Keynote

Wednesday’s keynote started with Aicha Evans, the Chief Strategy Officer at Intel Corporation and a Board Member at AnitaB.org. She welcome everyone and mentioned that today’s women technologists are shaping the future of technology. She introduced Dr. Fei-Fei Li, a Professor and Director at Stanford University AI Lab and the Chief Scientist at Google Cloud AI/ML. Dr. Fei-Fei Li talked about her experience as an immigrant when she moved to the United States at age 16, where she had to learn the language and understand a new culture. She also discussed the research she has been doing on Artificial Intelligence.

After that, Vicki Mealer-Burke introduced the winner of the Technical Leadership ABIE Award. This year’s winner was Diane Greene, the CEO of Google Cloud and a Board Member at Alphabet (ABIE Award Winner). People described her as a problem solver who understands technology and understands how to bring these technologies into the marketplace at a foundational level. She believes that one of the biggest contributions to her success was her father who gave her the confidence at an early age which structured her life journey. Her father would hand her the steering wheel and let her navigate the sailboat at age three. She also discussed the projects she worked on early in and her work and challenges when working on VMware.

Next, Megan Smith, a Former U.S. Chief Technology Officer, started her talk by showing her “I am Movement” sign - "I am so ready to upgrade tech culture with all of you". She talked about diversity and how to find jobs for people. She talked about the challenges of tech diversity and that it is time to make real progress in this matter. She showed that in 2016 only 21.7% of women in the US were employed in the technical workforce and in 2017 it had risen to 22.95%. This rate is considered low and we must work hard to make leadership increase in priority. In that matter four companies that made more progress are celebrated at GHC.

1- The first company that made the most progress with senior executives and technical executives and others is IBM.
2- The second company that got an award was Accenture, which was in the technical force of over 10,000 category.
3- The third company was GEICO, which was in the technical force of 1,000-10,000 category.
4- The fourth company was ThoughtWorks, which was in the technical force of under 1,000 category.

Next, Monique Chenier, the Director of Employment at Hackbright Academy, presented the ABIE award winner for social impact. This Social Impact ABIE Award Winner was presented to Dr. Sue Black OBE. Dr. Sue Black talked about the struggles she faced early in her life that led her to found #TechMums, which empowers mothers and their families through technology. TechMums is currently working with 500 mothers in the UK to give them skills, knowledge, and confidence to build successful lives. Dr. Black said, “If you help a million mothers, you are helping at least two million people because every single mother is a caretaker for at least one other person.” Dr. Black is also the founder of BCSWomen, the UK’s first online network for women in tech.

Next, Melinda Gates, the Co-Chair of Bill & Melinda Gates Foundation, talked about her early relationship with computers in her high school years. Afterwards, she talked about her projects in Microsoft and the big risks that you must take to move forward. Then, she spoke about the importance of diverse teams. She pointed out that in 2015 there were only 25% of women in the tech force and they held only 15% of technical roles. She also mentioned that tech is being part of our lives and it is going to grow and that AI should be taught the best of what humanity has to offer and that building AI should be for all genders and all ethnicities. She discussed the importance of making many pathways for people who have interest and talent in tech.

The next ABIE award was presented by Cindy Finkelman, the CIO FactSet Research Systems. The winner of the Students of Vision ABIE Award was Mehul Raje, a Master of Engineering degree program student at Harvard University. She talked about the steps she is taking to increase the number of women in computing.

Dr. Vicki Hanson, the President of the Association for Computing Machinery (ACM), presented Dr. Telle Whitney who co-founded GHC along with Dr. Anita Borg and served as president for AnitaB.org for 15 years. She talked about the work that GHC does to create cultures where women thrive. She talked about the goals that she wishes that would be accomplished. Dr. Fran Berman, the Hamilton Distinguished Professor of Computer Science at the Rensselaer Polytechnic Institute and the Chair of AnitaB.org Board of Trustees, presented Dr. Brenda Darden Wilkerson the President and the CEO of AnitaB.org. Dr. Wilkerson was excited to continue the journey that Dr. Telle and Dr. Anita started.

Wednesday Keynote Speakers GHC17: Dr. Fei-Fei Li, Professor and Director, Stanford University AI Lab; Chief Scientist, Google Cloud AI/ML, Diane Greene, ABIE Award Winner, Megan Smith, Former U.S. Chief Technology Officer, Dr. Sue Black OBE, ABIE Award Winner, Melinda Gates, Co-chair, Bill & Melinda Gates Foundation, Mehul Raje, ABIE Award Winner

Presentation Session: Artificial Intelligence Track

In the Artificial Intelligence track I attended the presentation: AI for Social Goods. In this session there were two talks. The first was the presentation on Can Machine Learning, IoT, Drones, and Networking Solve World Hunger? Presented by Jennifer Marsman, a Principal Software Development Engineer at Microsoft. The presenter talked about fighting world hunger using machine learning, drones, IoT, and networking research for precision agriculture. Recent studies show that food production must double by 2050 to meet demand from the world's growing population. In this work they use the idea of precision agriculture which is a farming management concept based on observing, measuring, and responding to inter and intra-field variability in crops. They try to use drones to help accomplish that with the use of placing sensors to measure the amount of water. The had several challenges such as sensor maintenance, power consumption, and where to place them. This approach was tested in 100 acres of farm in Carnation, Washington State and in a 2000 acres in upstate New York. The demo of this work can be found at Microsoft Cognitive Services.

The second presentation was Bias In Artificial Intelligence, Presented by Neelima Kumar, a Software Manager at Oracle. In this talk the presenter talked about the shortcomings of Artificial Intelligence (AI). She showed real life examples of IA that are racist or sexist. She first asked what we pictured in our mind if when the word nurse is used, is it a female nurse or a male nurse? Another example was the Google photo app that categorized a picture of a 22 year old and his friend that was categorized as gorillas. She discussed where Bias is introduced in AI. In AI the steps usually include training data that is collected and annotated, then the model is trained, and finally we get the output. The bias can be shown in every step. To solve the bias we need to build awareness of possible biases, digest for inclusion and diversity, and work with communities affected most. Finally the presentation was concluded by the phrase “AI will change the world but who will change AI”.

Career Fair

Career Fair was huge, it had most major of the companies such as Google, Microsoft, IBM, Facebook, Snapchat, and a lot more. Each company discussed the different opportunities they have for women. I would recommend having your CV printed out if you are looking for an internship or a job, as there are also on-the-spot interviews.

Poster Session

They also had two poster sessions, the ACM student research competition where Aida Ghazizadeh presented her poster, and the GHC general poster session. It was interesting to see all the different research people were doing. There was a poster session competition and the winners were announced at Friday's Keynote.

Thursday, Oct 5
Thursday Keynote

Thursday keynote started by Ana Pinczuk, the Senior Vice President Hewlett Packard Enterprise and a Board Member at AnitaB.org, presenting three more ABIE award winners. She introduced Mary Spio, the Founder and CEO of CEEK VR, Mary started by showing her "I AM" Movement sign where she wrote "I am changing the face of innovation and I need your help". She mentioned her struggles during her beginnings. When she came to the US at age 16, she worked at McDonald's and when she got her first paycheck she thought it was more than she could have and asked her boss if it was a mistake, her boss thought she was complaining and he instantly offered a higher paying job. This was her first lesson to not be afraid to ask for more. After that, she joined the army, then got a scholarship to go to college. Next, her career bloomed and got some patents for technologies she developed. Later on she did some tours on behalf of the US to spread the goodwill of America. She then worked on CEEK VR the virtual reality eyewear experience. She concluded by discussing the importance of diversity and that it is a necessity for innovation.

To present the next ABIE award winner was presented by Astrid Atkinson, the Engineering Director at Google Product Infrastructure. She present the winner of Change Agent ABIE Award WinnerMarie Claire Murekatete, a Software Manager at Rwanda Information Society Authority (RISA) and the Founder of Refugee Girls Need You. She discussed the challenge she faced such as not owning a computer at college level. "Not giving up and having endless curiosity, my wish is for all of you is to enjoy life the fullest and overcome any challenges that come your way, find a tribe of wonderful people who can help you say yes to big opportunity" says Marie Claire Murekatete on what she learned from her experience in life.

Cathy Scerbo, the VP of IT Business operation at the Liberty Mutual Insurance, recognized the next ABIE award winner. The next ABIE Leadership Award Winner, was awarded to Mercedes Soria, Vice President of Software Engineering at Knightscope, Inc. She discussed the challenges she faced coming from the Ekwador and not knowing how to speak English, but she overcame the challenges and always focused on her studies. She concluded her talk by showing her "I AM" Movement sign "I am a crime fighter and I develop technology that saves peoples lives".

Next, Debbie Sterling, the Founder and CEO of GoldieBlox. Debbie started by showing her "I AM" Movement which says, "I am a pick aisle disruptor". She talked about her journey in finding her passion. She studied mechanical engineering and product design at Stanford University. She had an idea with her friend to have building construction toys for girls. Bit by bit she started to come up with toys that girls will enjoy and let their imagination set off. She ended up creating the company GoldieBlox, which is based on a girl engineer who solves problems. The company was founded in 2012, and it was recognized as a leader in children’s entertainment and has reached billions of consumers.

After that, Kim Garner, the Vice President Advisory Services at Neustar, Inc, introduced the winner of the Technology Entrepreneurship ABIE Award Winner. The award was presented to Dr. Laura Mather, the CEO and Founder of Talent Sonar. Dr. Mather startup was with cybersecurity space such as Bank of America, Apple, etc. After that, she tried to use technology to address unconscious bias. She explained that the main solution to solve this bias is through changing the process of doing things such as hiring, mission making, and permissions.

Thursday Keynote Speakers GHC17: Mary Spio, Founder and CEO, CEEK VR, Marie Claire Murkatete, ABIE Award Winner, Mercedes Soria, ABIE Award Winner, Debbie Sterling, Founder and CEO, GoldieBlox, Dr. Laura Mather, ABIE Award Winner

Featured Speaker: Yasmin Mustafa

In special session track I attended the presentation for the featured speaker Yasmin Mustafa, the Founder at ROAR, Coded by Kids, Temple University. She is a refugee of the Persian Gulf War. She talked about her journey in coming to the US and her struggles in life. After that, she became an American citizen, and then she started running a blog, built her audience, and fell in love with marketing. She then started the Philadelphia chapter of Girl Develop It, an organization that aims to get more women to learn programming in supportive environments. This enabled her to travel around South America for six months, there she encountered many women who had been attacked or harassed. This inspired her to create a self-defense wearable technology company aimed at diminishing attacks against women and addressing the underlying causes of violence. She created a company to create the wearable devices called ROAR for Good, and she became CEO and the Founder. The take-away of her talk was, “You have to be a little naïve and crazy to start a company, don’t listen to naysayers, and you can start a tech company as a non-techie”. You can find Yasmin TEDx Talk “The Birth Lottery Does Not Define You”.

"You have to be a little naive and crazy to start a company"
Great talk by Yasmine Mustafa, developer of Girl Develop It @ghc #GHC17
— Lulwah (@LulwahMA) October 5, 2017

Presentation Session: Artificial Intelligence Track

The next session I attended was in the AI track. The presentation's title was “Recommendation Systems”. The first talk was “Buy It Again: Repeat Purchase Recommendations for Consumables”, presented by Aditi Bhattacharyya, Technical Lead in the Amazon Personalization organization. She started her talk by asking the audience if they ever have used Amazon, and almost all of the audience had. She discussed that Amazon uses personalized recommendation for each customer. She gave an overview of types of recommendations that existed. The first was collaborative filtering by collecting information from other users who had previously bought some similar items and item-to-item collaborative filtering by recommending items that are in the same category of the customer’s purchasing history. They recommend consumables such as buying the same grocery items, house supplies, office supplies, and health care items, so the model has to predict when the user will need to buy it again. Next, she discussed the method to build the model which depends on the repeat purchase count; however, the challenge was if an item was irrelevant to the customer anymore such as buying a baby formula. Time decay model assigns weights to the purchases based on time where the weight will decay over time, this also could be a challenge because after a certain amount of time the items weight could be so low although it is relevant. These challenges lead to creating a consumption rate which is the rate of the repeat purchase and the timestamp of last and first purchase. To know the repeat purchase score of an item even if the user purchased an item for the first time but based on the item’s score from other customers, then the repeat purchase score of the item is based on the repeat purchase rate and the aggregated purchase signal of item across all customers. This means that both the repeat signal of the customer alone and all customers is considered in the recommendation. The threshold of creating a recommendation is based on the repeat customer count, total customer count, repeat customer probability, repeat purchasable if both repeat customer count and the repeat customer probability are higher than a certain threshold which is calculated during offline and online process, and finally rank descending order of repeat purchase score. To optimize the process to perform the calculations on all the items they divided the task to online and offline process. Offline process includes expensive computations, build process, and offline filtering. Online process includes real time service lookup, online filtering, and front-end rendering. Next, in the evaluation both offline and online metrics. Precision and recall are used for offline metrics and A/B testing framework by showing the current recommendations and the newly introduced recommendations and other metrics such as purchases, views, click rates, and engagement. They found that this method of recommendations has improved the purchase rate.

The second talk was “Learning to be Relevant, Course Recommender Systems for Online Education”, presented by Shivani Rao, LinkedIn. The goal of the talk is to learn about the algorithms behind course recommendations for the online education platform. In general, because skills are always coming and going and because the job market is dynamic, learning does not stop after school and there is a need of lifelong learning. To create member-to-course recommendations there is both offline and online processing. The Offline processes both the member data and the course data to member-to-course recommendation which is stored in a Key-Value-Voldemort for high-scalability storage. After that the online process starts with using the Restli Service to Front-End. Usually the offline process is used for email flows, and the online can be used for viewing online feeds and other online activities. For solving the cold start problem they used the information provided by the user in their LinkedIn profile. However, since only 30% of users fill in the profile, there is an option to use members’ inferred title and most distinct skills associated with it based on members’ cohort. Another challenge was that only 2% of the skills are covered by manual tagging. The solution was to learn a supervised model using the manual tags as labels. Selecting courses for members is done by scoring and then ranking the courses. An additional challenge that the system can have is scaling challenge because each model has recommendations for 500M+ members and 200+ courses. A solution is that a model is not served to all the members and that same model may be served in different channels with different treatments. The last topic discussed was Micro-content, and the challenges it has such as how to identify the videos that are useful, which is solved using key features. More information on this topic can be found here.

The final talk in this session was “Related Pins at Pinterest: Evolution of a Real-World Recommender System”, presented by Jenny Liu, Software Engineer at Pinterest. The talked started by an overview of Pinterest. Then she noted that in Pinterest 40% of the views and saves come from related pins recommendations. The main principles for building systems are start simple, keep it simple, and optimize for iterations. The major pieces of related pins were candidate generation, memorization, and learning to rank. Candidate generation could be from user boards and board co-occurrence. Since calculating the co-occurrence of every pin is computationally expensive, they started off with random sampling and using heuristics score which worked really well. For finding candidate recommendations for pins that are only connected to one board they find pins that could be more than one hop away. They use the Pixie system to place all the coins in the memory and perform random walk with over 100,000 steps. However, in this system new pins are not visited during random walks. To solve this problem, they added fresh candidates which are the recent added pins. The next major pieces of related pins was memorization. When checking what are people clicking, we need to know the reason of clicking - is it because of relevance or because of the position? To solve this they added normalize by position, where they move pins that are worth more to a non-clicked position. The final major pieces of related pins was learning to rank. Here in the feature extraction of the candidates, the feature vector is converted to numerical value and placed in a score model which returns a score by multiplying that to the feature weights. This method increased saving pins. However, this had some challenges such as that the linear model is not enough, there were feedback loops, and it was hard to experiment with models. Linear model was caused by having the feature weight the same for multiple candidates. This is solved by creating a gradient-boosted decision tree. The second challenge was having feedback loops which is having the same candidates showing over and over again. This is solved by having 1% of the traffic for low score candidates. The final challenge was hard to experiment with models offline because of the huge amount of data, the time it consumed, and personalization is not possible because it was offline results. The result was investing more on making it an online serving.

Recommender Systems Presentation @ghc #GHC17 Related pins at Pinterest, the main principles for building systems: pic.twitter.com/rEPO0KV8ej
— Lulwah (@LulwahMA) October 5, 2017

Friday, Oct 6
Presentation Session: Artificial Intelligence Track

I attended the presentation Real World Application of IA. The first presentation in this session was Bring Intelligence to Resource Utilization presented by Xiaoqian Liu, Data Engineer, Salesforce. Salesforce Einstein is a lawyer within the Salesforce platform that infuses artificial intelligence features and capabilities across all Salesforce clouds, it builds a specific model for each customer per app. Successful resource assignments will improve job efficiencies and costs. A case study is performed by applying RL on Spark job parameter tuning in real-time. The workflow includes job generation, model, and Hadoop server. The next step was the system implementation, then experiment phase. The results of this project shows a slight downtrend in the output graph results which indicates that feature improvements are needed.

The next presentation was Recommending Dream Jobs in a Biased Real World presented by Nadia Fawaz, Staff Software Engineer, LinkedIn. Recommendation systems trained on biased data may reflect the bias, this bias is crucial when it affects job finding. The technical scope challenges are in training, evaluating and deploying a large scale recommendation system based on biased real world data. This talk presented an overview of Jobs You May Be Interested In (JYMBII), which is a personalized list of jobs for the user. This list is generated by a machine learning model that has to be trained to predict a relevant score to measure how each job is relevant for a specific member on LinkedIn. The model is taught by feeding it data such as member information, job posting, and previous interactions between members and jobs such as clicks, saves, and dismisses. This process is performed both online and offline. Next, to answer where bias comes from. The bias actually comes from different stages, in the data such as gender gap in positions in hiring more women than men in educational roles and less in tech roles. There is bias in how data is collected such as pages that have low rank where the data is never collected from them, or the position bias if the user only clicked on the top jobs instead of looking at all the options. Another bias is model bias, where the model is too simple and has few parameters or insufficient features to fully represent the outcome. Reducing bias matters because people think it is unacceptable or are illegal. In business, members care about quality of recommendations, and removing the bias could cause a lift in the business. Finally, if bias is in the technical model, it can impact the performance metrics in both online and offline evaluations. Reducing the online bias in data can be done using two methods: fully random bucket for models that may not have been chosen and session-based top k randomization after ranking them which randomize the position of the first k set. This can be done using explore/exploit parsimonious randomization. Reducing the online bias in the training dataset by augmenting the dataset by adding random negatives, inferred positives, and high quality manual tagging. This could be evaluated by replay method by taking random bucket dataset as input, reranks based algorithm to evaluate, and compute top k reward on matched input.

The final presentation in this session was Where DNA Meets AI - and How You Can Help! presented by Amanda Fernandez, an Assistant Professor in Practice, University of Texas San Antonio. This talk was an overview of general information on connecting DNA with machine learning and AI solutions. How to get started in this field is to first learn programming language, and familiarize yourself with research problems. Two recommended resources if you are bridging CS and biology: Deep Learning for Health Informatics, Ravi et al, 2016 and Deep Learning for Computational Biology, Angermueller et al, 2016. Some open source models that are helpful and recommended to use are Google’s PAIR Initiative, Kaggle, and the Big Data Genomics.

Social Celebrating Arab Women in Technical Roles

I then attended the Social Celebrating Arab Women In Technical Roles, sponsored by Visa. I met many Arab and non-Arab women in computing(ArabWIC) who came from all over the world. The talk was presented by Dr. Sana Odeh, a Clinical Professor at the Computer Science Department, New York University, and Dr. Kaoutar El Maghraoui a Research Scientist at IBM. They talked about the goal of the group and what our job is towards it. After that, we socialized with others and we exchanged our bios and discussed the different ways to contribute to this group.

#ArabWIC at #GHC17

Increasing the impact of women on all aspects of tech, and increasing the positive impact of tech on the world’s women 🌍 pic.twitter.com/tGINOJtEcB
— ArabWomeninComputing (@ArabWIC) October 6, 2017

Friday Keynote

Friday’s keynote was presented by Nora M. Denze the Board of Trustees of AnitaB.org. She presented Dr. Ayanna Howard, Professor and Linda J. and Mark C. Smith Endowed Chair in Bioengineering in the School of Electrical and Computer Engineering at the Georgia Institute of Technology. Dr. Howard gave on overview in robotics. Her recent research is in the area of pediatric therapy by using robots in the home for kids with special needs, which is important for people who could not afford regular therapy sessions. The lesson learned from this work is that humans trust robots and our intelligent machines are inheriting our human biases.

Next, Dr. Brenda Darden Wilkerson, the President and the CEO of AnitaB.org, presented the winner of 2017 Lifetime Achievement Award Winner, for recognition of her dedication to women in tech, Dr. Telle WhitneyAnitaB.org. To help honor Dr. Telle, Dr. Fran Berman, the Hamilton Distinguished Professor of Computer Science at the Rensselaer Polytechnic Institute and the Chair of AnitaB.org Board of Trustees, talked about all the amazing things that Dr. Telle did in the 15 years she served AnitaB.org. Dr. Fran then presented the award to Dr. Telle. Which talked about how important it is for us to make a change and contribute to the movement of creating more opportunity for women.

Next, Dr. Jodi Tims, Chair ACM and Council on Women in Computing, presented the ACM Student Research Competition. GHC hosts one of the largest technical poster sessions in the U.S. This year was 90 posters total. The poster winners were announced.

The winners of the Undergraduate Student Research Competition:

1- Stephanie Mason, Western Washington University, US, the poster on "Characterizing Rigidity Properties of Protein Cavities through Data Visualization".

2- Theodore Weber, Western Washington University, US, the poster on "MyQuitPal-A Participant-Centric Smoking Cessation System".

3- Zalika Dixon, Capitol College, US, the poster on "Climate Control: Measurement of Ultraviolet Radiation".

The winners of the Graduate Student Research Competition:

1- Parishad Karimi, WINLAB, Rutgers University, The State of New Jersey University, the poster on "SMART: A Distributed Architecture for Dynamic Spectrum Management".

2- Aastha Nigam, University of Notre Dame, US, the poster on "Harvesting Social Signals to Inform Peace Processes Implementation and Monitoring".

3- Mariam Nouh, University of Oxford, UK, the poster on "CCINT: Cyber-Crime INTelligence Framework for Detecting Online Radical Content".

Next, Dr. Jennifer Chayes, Technical Fellow and Managing Director at Microsoft presented the winner of Denice Denton Emerging Leader ABIE Award WinnerDr. Aysegul Gunduz. Dr. Gunduz develops tools and devices that identify neurology disorder.

After that, Dr. Deborah Berebichez, Chief Data Scientist, Metis; Co-host, “Outrageous Acts of Science”, talked about her life story and how important it is to keep going and face all obstacles that may come in life, and how import it is to encourage each other.

Next, Sherry Ryan, Vice President and Chief Information Security Officer, Juniper Networks, presented the winner of A. Richard Newton Educator ABIE Award Winner Dr. Marie desJardins, Professor of Computer Science and Electrical Engineering, University of Maryland. Dr. Marie desJardins has been detected to support women in computing. She has given them guidance, support, care, and feedback to female facility as they prepare for their tenure packages.

The next speaker was Maureen Fan, CEO and Co-founder, Baobab Studios. Maureen talked about her nation in virtual reality and animation, and how it inspires you to dream. She also talked about storytelling and its strength to make you care about the character. She also talked about how to create your own path and follow your dreams even if it is not an easy path to take.

Nora M. Denze Ended the Keynote by a quote by Grace Hopper herself, "Ships in the harbor are safe, but that’s not what ships are made for".

Friday Keynote Speakers GHC17: Dr. Ayanna Howard, Professor and Linda J. and Mark C. Smith Endowed Chair, Bioengineering, School of Electrical and Computer Engineering, Georgia Institute of Technology, Aysegul Gunduz, ABIE Award Winner, Dr. Deborah Berebichez, Chief Data Scientist, Metis; Co-host, “Outrageous Acts of Science”, Dr. Marie desJardins, ABIE Award Winner, Maureen Fan, CEO and Co-founder, Baobab Studios

Friday Night Celebration

During the Friday Night Celebration, we all went to dance, socialize, get swag from Google, eat food, and celebrate Grace Hopper. I met incredible people and had enjoyed my time.

My Recommendations and Final Thoughts

My recommendations for future attendees is to plan the talks you’re going to attend early on before the conference, and make sure you know where the talks are as people line up almost 15 minutes before the talk and spaces fill up fast. It is important to have a plan B talk. Also, some popular talks are repeated during the day, so if you could not attend the first session you can attend the repeat. In addition, do not miss the opportunity to socialize with other attendees or speakers. Also, if you are looking for a job or internship opportunity upload your CV to the GHC database, and have a print out of your CV for the career fair. Finally, you will receive a lot of swag during the conference, so prepare to make space in your bag.

It was amazing to see all the women in computing, it made me feel I am not alone in this field. I would like to thank Grace Hopper for this amazing conference. I enjoyed every bit from meeting amazing women of computing to listening to the inspirational talks, dancing, enjoying Orlando, and seeing my sister who came all the way from Saudi Arabia.

--Lulwah Alkwai

↧

2017-11-06: Association for Information Science and Technology (ASIS&T) Annual Meeting 2017

November 6, 2017, 9:49 am

≫ Next: 2017-11-16: Paper Summary for Routing Memento Requests Using Binary Classifiers

≪ Previous: 2017-10-24: Grace Hopper Celebration of Women in Computing (GHC) 2017

The crowds descended upon Arlington, Virginia for the 80th annual meeting of the Association for Information Science and Technology. I attended this meeting to learn more about ASIS&T, including its special interest groups. Also attending with me was former ODU Computer Science student and current Los Alamos National Laboratory librarian Valentina Neblitt-Jones.

Shawn and Valentina - ODU and Los Alamos National Lab #asist2017 pic.twitter.com/0hf6qvM5js
— Valentina (@vneblitt) November 1, 2017

The ASIS&T team had organized a wonderful collection of panels, papers, and other activities for us to engage in.

Plenary Speakers

Richard Marks: Head of the PlayStation Magic Lab at Sony Interactive Entertainment

Richard Marks talked about the importance of play to the human experience. He covered innovations at the Playstation Magic Lab in an effort to highlight possible futures of human-computer interaction. The goal of the laboratory is "experience engineering" whereby the developers focus on improving the experience of game play rather than on more traditional software development. Play is about interaction and the Magic Lab focuses on amplifying that interaction.

Marks on "creative" uses of game engines 4 data flow, real-world play (augmented or purely reality), spatial control & awareness #asist2017
— Adam Worrall (@adamworrall4) October 29, 2017

One of the new frontiers of gaming is virtual reality, whereby users are immersed in a virtual world. Marks talked about how using an avatar in a game intiates a "virtual transfer of identity". Consider the example of pouring water: seeing onesself pour water on a screen while using a controller provides one level of immersion, but seeing the virtual glass of water in your hands makes the action far more natural. He mentioned that VR players confronted with a virtual tightrope suspended above New York City had difficulty stepping onto the tightrope, even though they knew it was just a game.

Richard Marks: game engines moving to the cloud - persistent, "always on"#ASIST17 #ASIST2017
— Lorri Mon (@lorriberri) October 29, 2017

He talked about thresholds of technology change, detailing the changes in calculating machines throughout the 20th Century and how "when you can get it into your shirt pocket, now everything changes". Though this calculator example seems an obvious direction of technology, it was not entirely obvious when calculating machines were first being developed. The same parallel can be made for user interfaces. Marks also mentioned that games allow their researchers to explore many different techniques without having to worry about the potential for loss of life or other challenges that confront interface researchers in other industries.

Richard Marks on other VR uses, including experiments with social VR using NASA astronauts and robots #ASIST2017 pic.twitter.com/QpVW9fajaX
— Shawn M. Jones (@shawnmjones) October 29, 2017

William Powers: Laboratory for Social Machines at the MIT Media Lab

William Powers, author of "Hamlet's Blackberry" and reporter at the Washington Post, gave a thoughtful talk on the effects of information overload on society. To him, tech revolutions are all about depth and meaning. Depth is about focus, reflection, and when "the human mind takes its most valuable and most important journeys". Meaning is our ability to develop "theories about what exists is all about".

Kind of a profound statement. "Each of us has a completely original mind" William Powers. #ASIST2017
— wadekelly (@wadekelly) October 31, 2017

He talked about the current social changes people are experiencing in the online (and offline) world. He personally found that he was not able to give attention to things he cared about. The more time he spent online, the harder it became to read longer pieces of work, like books. A number of media stories exist about diminishing attention spans correlated to an increase in online use.

Michael Powers #asist2017 rejects digital maximism and aims toward human interaction—including through books
— Gary Marchionini (@marchionini) October 31, 2017

While at a fellowship at Harvard's Shorenstein Center, Powers began work on what print on paper had done for civilization. He covered different "Philosophers of Screens" from history. Socrates believed that the alphabet would destroy our minds, fearing that people would not think outside of the words on the page. Socrates felt that people needed distance to truly digest the world around them. Seneca lived in a world of many new technologies, such as postal systems and paved roads, but he feared the "restless energy" that haunted him, developing mental exercises to focus the mind. By inventing the printing press, Gutenberg helped mass produce the written word, leading some of his era to fear the end of civilization as misinformation was being printed. In Shakespeare's time, people complained that the print revolution had given them too much to read and that they would not be able to keep up with it. Benjamin Franklin worked to overcome his own addictions through the use of ritual. Henry David Thoreau bemoaned the distracted nature of his compatriots in the 19th Century, noting that "when our life ceases to be inward and private, conversation degenerates to gossip."Marshall McLuhan also believed that we could rise above information overload by developing our own strategies.

Apt historical analogy of Elizabethan-era Tables for note-taking in @wpwrs book “Hamlet’s BlackBerry” as he presents at #ASIST2017
— Leslie Johnston (@lljohnston) October 31, 2017

The output of this journey became the paper "Hamlet's Blackberry: Why Paper Is Eternal", which then led to the book &quotHamlet's Blackberry". The common thread was that each age has had new technical advances and concerns that people were becoming less focused and more out of touch. Each age also had visionaries who found that they could rise above this information fray by developing their own techniques for focus and introspection. Every technical revolution starts with the idea that the technology will consume everything, but this is hardly the case. Says Powers, "If all goes well with the digital revolution, then tech will allow us to have the depth that paper has given us." Powers even mentioned that he had been discussing with Tim Berners-Lee how to build a "better virtual society in the virtual world" that would in turn improve our real world.

"We need ways we can be fully human without the machine running everything in our lives” - Willam Powers #asist2017
— Sara Bond (@SaraBond) October 31, 2017

.@wpwrs: we have the horrible election to thank for a more constructive discussion on technology #asist2017 pic.twitter.com/RM7hJGHZin
— Shawn M. Jones (@shawnmjones) October 31, 2017

Sample of Papers Presented

As usual, I cannot cover all papers presented, and, due to overlaps, was not present at all sessions. I will discuss a subset of the presentations that I attended.

Top Ranked Papers

#spoilers for Game of Thrones! @eforcier talks fans and info behavior about a certain wedding #ASIST2017 pic.twitter.com/AW5CY0LH7G
— Samantha Kaplan (@SamanthaKaplan) October 31, 2017

Eric Forcier presented something near to one of my topics of interest in "Re(a)d Wedding: A Case Study Exploring Everyday Information Behaviors of the Transmedia Fan". In the paper he talks about the phenomena of transmedia fandom: fans who explore a fictional world through many different media types. The paper specifically focuses on an event in the Game of Thrones media franchise: The Red Wedding. Game of Thrones is an HBO television show based on a series of books named A Song of Ice and Fire. This story event is of interest because book fans were aware of the events of the Red Wedding before television fans experienced them, leading to a variety of different experiences for both. Forcier details the different types of fans and how they interact. Forcier's work has some connection to my work on spoilers and using web archives to avoid them.

Information is not knowledge. It requires time, expertise, and sustained attention -Ron Day #ASIST2017
— Laura Ridenour (@imparseable) October 31, 2017

In "Before Information Literacy [Or, Who Am I , As a Subject-Of-(Information)-Need?]", Ronald Day of the Department of Information and Library Science at Indiana University discusses the current issue of fake news. In his paper he considers the current solutions of misinformation exposure to be incomplete. Even though we are focusing on developing better algorithms for detecting fake news and also attempting to improve information literacy, there is also the possibility of improving a person's ability to determine what they want out of an information source. Day's paper provides an interesting history of information retrieval from an information science perspective. Over the years, I have heard that "data becomes information, but not all data is information"; Day extends this further by stating that "knowledge may result in information, but information doesn't necessarily have to come from or result in knowledge".

@vkitzie: Search algorithms as actors, with agency; can act in ways unintended by their creators to stigmatize identity. #asist2017
— Adam Worrall (@adamworrall4) October 31, 2017

In "Affordances and Constraints in the Online Identity Work of LGBTQ+ Individuals", Vanessa Kitzie discusses the concepts of online identity in the LGBTQ+ community. Using interviews with thirty LGBTQ+ individuals, she asks about the experiences of the LGBTQ+ community in both social media and search engines. She finds that search engines are often used by members of the community to find the language necessary to explore their identity, but this is problematic because labels are dependent on clicks rather than on identity. Some members of the community create false social profiles so that they can "escape the norms confining" their "physical body" and choose the identity they want others to see. Many use social media to connect to other members of the community. The suggestions of further people to follow often introduces the user to more terms that help them with their identity. Her work is an important exploration of the concept of self, both on and offline.

Other Selected Papers

Sarah Bratt presented "Big Data, Big Metadata, and Quantitative Study of Science: A Workflow Model for Big Scientometrics". In this paper, she and her co-authors demonstrates a repeatable workflow used to process bibliometric data for the GenBank project. She maps the workflow that they developed for this project to the standard areas detailed in Jeffrey M. Stanton's Data Science. It is their hope that the framework can be applied to other areas of big data analytics and they intend to pursue a workflow that will work in these areas. I wondered if their workflow would be applicable to projects like the Clickstream Map of Science. I was also happy to see that her group was trying to tackle disambiguation, something I've blogged about before.

Disambiguation of author names is still a big problem - Sarah Bratt #asist2017
— Valentina (@vneblitt) November 1, 2017

Yu Chi presented "Understanding and Modeling Behavior Patterns in Cross-Device Web Search." She and her co-authors conducted a user study to explore the behaviors surrounding beginning a web search on one device and then continuing it on another compared with just searching on a single device. They make the point that "strategies found on the single device, single-session search might not be applicable to the cross-device search". Users switching devices have a new behavior, re-finding, that might be necessary due to the interruption. They discovered that there are differences in user behavior in the two instances and that Hidden Markov Models could be used to model and uncover some user behavior. This work has implications for search engines and information retrieval.

Yu Chi: how does search behavior change when they start a search on one device, but continue on another? #asist2017 pic.twitter.com/7gpjqfZKOY
— Shawn M. Jones (@shawnmjones) November 1, 2017

"Toward A Characterization of Digital Humanities Research Collections: A Contrastive Analysis of Technical Designs" is the work of Katrina Fenlon. She talks about thematic research collections, which are collected by scholars who are trying to "support research on a theme". She focuses on the technical designs of thematic research collections and explores how collections with different uses have different designs. In the paper, she reviews three very different collections and categorizes them based on need: providing advanced access to value-added sources, providing context and interrelationships to sources, and also providing a platform for "new kinds of analysis and interpretation". I was particularly interested in Dr. Felon's research because of my own work on collections.

Katrina Fenlon: Humanities collections rarely undergo under any formal evaluation, rarely part of a stewardship effort #asist2017 pic.twitter.com/H8MBldEzoY
— Shawn M. Jones (@shawnmjones) October 31, 2017

I was glad to once again see Leslie Johnston from the United States National Archives and Records Administration. She presented her work on "ERA 2.0: The National Archives New Framework for Electronic Records Preservation." This paper discusses the issues of developing the second version of Electronic Records Archives (ERA), the system that receives and processes US government records from many agencies before permanently archiving them for posterity. It is complex because records consist not only of different file formats, but many have different regulations surrounding their handling. ERA 2.0 now uses an Agile software methodology for development as well as cloud computing in order to effectively adapt to changing needs and requirements.

.@lljohnston: NARA goal: to create a more open and transparent government through the sharing of records #asist2017
— Shawn M. Jones (@shawnmjones) October 31, 2017

Unique to my experience at the conference was Kolina Koltai's presentation of "Questioning Science with Science: The Evolution of the Vaccine Safety Movement." In this work, the authors interviewed those who sought more research on vaccine safety, often called "anti-vaxxers". Most participants cited concern for children, and not just their own, as one of their values. They often read scientific journals and are concerned about financial conflicts of interest between government agencies and the corporations that they regulate, especially in light of prior issues involving research into the safety of tobacco and sugar. The Deficit Model, the idea that the group just lacks sufficient information, does not exist for this group. They discovered that the Global Belief Model has not been effective in understanding members of this movement. It is the hope of the authors that this work will be helpful in developing campaigns and addressing concerns about vaccine safety. In a larger sense, it supports other work on "how people develop belief systems based upon their values" also providing information for those attempting to study fake news.

Kolina Koltai discusses the motivation for vaccination and how more families are opting out, impacting public health #asist2017 pic.twitter.com/EbrvI3MA6r
— Shawn M. Jones (@shawnmjones) October 29, 2017

Manasa Rath presented "Identifying the Reasons Contributing to Question Deletion in Educational Q&A." They looked at "bad" questions asked on the Q&A site Brainly. I was particularly interested in this work because the authors identified what features of a question caused moderators to delete it and then discovered that a J48-Decision Tree classifier is best at predicting if a given question would be deleted.

@r_manasa presenting our paper on Educational CQA at #asist2017! pic.twitter.com/zChIMVfdmh
— Diana Floegel (@floeginator) October 29, 2017

"Tweets May Be Archived: Civic Engagement. Digital Preservation, and Obama White House Social Media Data" was presented by Adam Kriesberg. Using data from the Obama White House Social Media Archive stored at the Internet Archive the authors discussed the archiving -- not just web archiving -- of Barack Obama's social media content on Twitter, Vine, and Facebook. Problems exist on some platforms, such as Facebook, where data can be downloaded by users, but is not necessarily structured in a way useful to those outside of Facebook. Facebook data is only browseable by year and photographs included in the data store lack metadata. Obama changed Vine accounts during his presidency, making it difficult for archivists to determine if they have a complete collection from even a single social media platform. An archived Twitter account is temporal, meaning that counts for likes and retweets are only from a snapshot in time. On this note, Kriesberg says that values are likes and retweets are "incorrect", but I object to the terminology of "incorrect". Content drift is something I and others of WS-DL have studied and any observation from the web needs to be studied with the knowledge that it is a snapshot in time. He notes that even though we have Obama's content, we do not have the content of those he engaged with, making some conversations one-sided. He finally mentions that social media platforms provide a moving target for archivists and researchers, as APIs and HTML changes quickly, making tool development difficult. I recommend this work for anyone attempting to archive or work with social media archives.

Kriesberg asks whether we should build crosswalks/standards to make social media data interoperable? #asist2017
— Pam Lach (@VisualizingPam) November 1, 2017

Social

As with other conferences, ASIS&T provided multiple opportunities to connect with researchers in the community. I appreciated the interesting conversations with Christina Pikas, Hamid Alhoori, and others during breaks. I also liked the lively conversations with Elaine Toms and Timothy Bowman. I want to thank Lorri Mon for inviting me to the Florida State University alumni lunch with Kathleen Burnett, Adam Worrall, Gary Burnett, Lynette Hammond Gerido, and others where we discussed each others' work as well as developments at FSU.

I apologize to anyone else I have left off.

Summary

ASIS&T is a neat organization focusing on the intersections of information science and technology. As always, I am looking forward to possibly attending future conferences, like Vancouver in 2018.

-- Shawn M. Jones

↧

2017-11-16: Paper Summary for Routing Memento Requests Using Binary Classifiers

November 15, 2017, 10:19 pm

≫ Next: 2017-11-20: Dodging the Memory Hole 2017 Trip Report

≪ Previous: 2017-11-06: Association for Information Science and Technology (ASIS&T) Annual Meeting 2017

While researching my dissertation topic, I re-encountered the paper, "Routing Memento Requests Using Binary Classifiers" by Bornand, Balakireva, and Van de Sompel from JCDL 2016 (arXiv:1606.09136v1). The high-level gist of this paper is that by using two corpora of URI-Rs consisting of requests to their Memento aggregator (one for training, the other for training evaluation), the authors were able to significantly mitigate wasted requests to archives that contained no mementos for a requested URI-R.

For each of the 17 Web archives included in the experiment, with the exception of the Internet Archive on the assumption that a positive result would always be returned, a classifier was generated. The classifiers informed the decision of, given a URI-R, whether the respective Web archive should be queried.

Optimization of this sort has been performed before. For example, AlSum et al. from TPDL 2013 (trip report, IJDL 2014, and arXiv) created profiles for 12 Web archives based on TLD and showed that it is possible to obtain a complete TimeMap for 84% of the URI-Rs requested using only the top 3 archives. In two separate papers from TPDL 2015 (trip report) then TPDL 2016 (trip report), Alam et al. (2015, 2016) described making routing decisions when you have the archive's CDX information and when you have to use the archive's query interface to expose its holdings (respectively) to optimize queries.

The training data set was based off of the LANL Memento Aggregator cache from September 2015 containing over 1.2 million URI-Rs. The authors used Receiver Operating Characteristic (ROC) curves comparing the rate of false positives (URI-R should not have been included but was) to the rate of true positives (URI-R was rightfully included in the classification). When requesting a prediction from the classifier once training, a pair of each of these rates is chosen corresponding to the most the most acceptable compromise for the application.

A sample ROC curve (from the paper) to visualize memento requests to an archive.

Classification of this sort required feature selection, of which the authors used character length of the URI-R and the count of special characters as well as the Public Suffix List domain as a feature (cf. AlSum et al.'s use of TLD as a primary feature). The rationale in choosing PSL over TLD was because of most archiving covering the same popular TLDs. An additional token feature was used by parsing the URI-R, removing delimiters to form tokens, and transforming the tokens to lowercase.

The authors used four different methods to evaluating the ranking of the features being explored for the classifiers: frequency over the training set, sum of the differences between feature frequencies for a URI-R and the aforementioned method, Entropy as defined by Hastie et al. (2009), and the Gini impurity (see Breiman et al. 1984). Each metric was evaluated to determine how it affected the prediction by training a binary classifier using the logistic regression algorithm.

The paper includes applications of the above plots for each of the four feature selection strategies. Following the training, they evaluated the performance of each algorithm, with a preference toward low computation load and memory usage, for classification using correspond sets of selected features. The algorithms evaluated were logistical regression (as used before, Multinomial Bayes, Random Forest, and SVM. Aside from Random Forest, the other three algorithms had similar runtime predictions, so were evaluated further.

A classifier was trained using each permutation of the three remaining algorithms and each archive. To determine the true positive threshold, they brought in the second data set consisting of 100,000 unrelated URI-Rs from the Internet Archive's log files from early 2012. Of the three algorithms, they found that logistic regression performed the best for 10 archives and Multinomial Bayes for 6 others (per above, IA was excluded).

The authors then evaluated the trained classifiers using yet another dataset of URI-Rs from 200,000 randomly selected requests (cleaned to just over 187,000) from oldweb.today. Given the data set was based on inter-archive requests, it was more representative of that of an aggregator's requests compared to the IA dataset. They computed recall, computational cost, and response time using a simulated method to prevent the need for thousands of requests. These results confirmed that the currently used heuristic of querying all archives has the best recall (results are comprehensive) but response time could be drastically reduced using a classifier. With a reduction in recall of 0.153, less than 4 requests instead of 17 would reduce the response time from just over 3.7 seconds to about 2.2 seconds. Additional details of optimization obtained through evaluation of the true positive rate can be had in the paper.

Take Away

I found this paper to be an interesting an informative read on a very niche topic that is hyper-relevant to my dissertation topic. I foresee a potential chance to optimize archival query from other Memento aggregators like MemGator and look forward to further studies is this realm on both optimization and caching.

—Mat (@machawk1)

Nicolas J. Bornand, Lyudmila Balakireva, and Herbert Van de Sompel. "Routing Memento Requests Using Binary Classifiers," In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL), pp. 63-72, (Also at arXiv:1606.09136).

↧