Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 746 articles
Browse latest View live

2014-07-10: Federal Cloud Computing Summit

$
0
0


As mention in my previous post, I attended the Federal Cloud Computing Summit on July 8th and 9th at the Ronald Reagan Building in Washington, D.C. I helped the host organization, the Advanced Technology And Research Center (ATARC) organize and run the MITRE-ATARC Collaboration Sessions that kick off the event on July 8th. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

A FedRAMP primer was held at 10:00 AM on July 8th in a Government-only session. At its conclusion, we began the MITRE-ATARC Collaboration Sessions that focused on Cloud Computing in Austere Environments, Cloud Computing for the Mobile Worker, Security as a Service, and the Impact of Cloud Computing on the Enterprise. Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will be constructing a summary document from the discussions that outlines the main points of the discussions, identifies orthogonal ideas between the challenge areas, and makes recommendations for the Government and academia based on the discussions. For reference, please see the 2013 Federal Cloud Computing Summit Whitepaper.

On July 9th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:30-4:15, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

This was the third Federal Summit event in which I have participated. They are great events that the Government participants have consistently identified as high-value and I am excited to start planning the Winter installment of the Federal Cloud Computing Summit.

--Justin F. Brunelle

2014-07-14: The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript

$
0
0

One very large part of digital preservation is the act of crawling and saving pages on the live Web into a format for future generations to view. To accomplish this, web archivists use various crawlers, tools, and bits of software, often built to purpose. Because of these tools' ad hoc functionality, users expect them to function much better than a general purpose tool.

As anyone that has looked up a complex web page in The Archive can tell you, the more complex the page, the less likely that all resources will be captured to replay the page. Even when these pages are preserved, the replay experience is frequently inconsistent from the page on the live web.

We have started building a preliminary corpus of tests to evaluate a handful of tools and web sites that were created specifically to save web pages from being lost in time.

In homage to the web browser evaluation websites by the Web Standards Project, we have created The Archival Acid Test as a first step in ensuring that these tools to which we supply URLs for preservation are doing their job to the extent we expect.

The Archival Acid Test evaluates features that modern browsers execute well but preservation tools have trouble handling. We have grouped these tests into three categories with various tests under each category:

The Basics

  • 1a - Local image, relative to the test
  • 1b - Local image, absolute URI
  • 1c - Remote image, absolute
  • 1d - Inline content, encoded image
  • 1e - Scheme-less resource
  • 1f - Recursively included CSS

JavaScript

  • 2a - Script, local
  • 2b - Script, remote
  • 2c - Script inline, DOM manipulation
  • 2d - Ajax image replacement of content that should be in archive
  • 2e - Ajax requests with content that should be included in the archve, test for false positive (e.g., same origin policy)
  • 2f - Code that manipulates DOM after a certain delay (test the synchronicity of the tools)
  • 2g - Code that loads content only after user interaction (tests for interaction-reliant loading of a resource)
  • 2h - Code that dynamically adds stylesheets

HTML5 Features

  • 3a - HTML5 Canvas Drawing
  • 3b - LocalStorage
  • 3c - External Webpage
  • 3d - Embedded Objects (HTML5 video)

For the first run of the Archival Acid Tests, we evaluated Internet Archive's Heritrix, GNU Wget (via its recent addition of WARC support), and our own WARCreate Google Chrome browser extension. Further, we ran the test on Archive.org's Save Page Now feature, Archive.today, Mummify.it (now defunct), Perma.cc, and WebCite. For each of these tools, we first attempted to preserve the Web Standards Project's Acid 3 Test (see Figure 1).

The results for this initial study (Figure 2) were accepted for publication (see the paper) to the Digital Libraries 2014 conference (joint JCDL and TPDL this year) and will be presented September 8th-14th in London, England.

The actual test we used is available at http://acid.matkelly.com for you to exercise with your tools/websites and the code that runs the site is available on GitHub.

— Mat Kelly (@machawk1)

2014-07-14: "Refresh" For Zombies, Time Jumps

$
0
0
We've blogged before about "zombies", or archived pages that reach out to the live web for images, ads, movies, etc.  You can also describe it as the live web "leaking" into the archive, but we prefer the more colorful metaphor of a mixture of undead and living pages.  Most of the time Javascript is to blame (for example, see our TPDL 2013 paper "On the Change in Archivability of Websites Over Time"), but in this example the blame rests with the HTML <meta http-equiv="refresh" content="..."> tag, whose behavior in the archives I discovered quite by accident.

First, the meta refresh tag is a nasty bit of business that allows HTML to specify the HTTP headers you should have received.  This is occasionally useful (like loading a file from local disk), but more often that not seems to create situations in which the HTML and the HTTP disagree about header values, leading to surprisingly complicated things like MIME type sniffing.  In general, having data formats specify protocol behavior is a bad idea (see the discussion about orthogonality in the W3C Web Architecture), but few can resist the temptation.  Specifically, http-equiv="refresh" makes things even worse, since the HTTP header "Refresh"never officially existed, and it was eventually dropped from the HTML specification as well.

However, it is a nice illustration of a common but non-standard HTML/fake-HTTP extension that nearly everyone supports.  Here's how it works, using www.cnn.com as an example:



This line:

<meta http-equiv="refresh" content="1800;url=http://www.cnn.com/?refresh=1"/>

tells the client to wait 30 minutes (1800 seconds) and reload the current page with the value specified in the optional url= argument (if no URL is provided, the client uses the current page's URL).  CNN has used this "wait 30 minutes and reload" functionality for many years, and it is certainly desirable for a news site to cause the client to periodically reload its front page.   The problem comes when a page is archived, but the refresh capability is 1) not removed or 2) the URL argument is not (correctly) rewritten.

Last week I had loaded a memento of cnn.com from WebCitation, specifically: http://webcitation.org/5lRYaE8eZ, that shows the page as it existed at 2009-11-21:


I hid that page, did some work, and then when I came back I noticed that it had reloaded to the page as of 2014-07-11, even though the URL and the archival banner at the top remained unchanged:


The problem is that WebCitation leaves the meta refresh tag as is, causing the page to reload from the live web after 30 minutes.  I had never noticed this behavior before, so I decided to check how some other archives handle it.

The Internet Archive rewrites the URL, so although the client still refreshes the page, it gets an archived page.  Checking:

http://web.archive.org/web/20091121211700/http://www.cnn.com/


we find:

<meta http-equiv="refresh" content="1800;url=/web/20091121211700/http://www.cnn.com/?refresh=1">


But since the IA doesn't know to canonicalize www.cnn.com/?refresh=1 to www.cnn.com, you actually get a different archived page:



Instead of ending up on 2009-11-21, we end up two days in the past at 2009-11-19:


To be fair, ignoring "?refresh=1" is not a standard canonicalization rule but could be added (standard caveats apply).  And although this is not quite a zombie, it is potentially unsettling since the original memento (2009-11-21) is silently exchanged for another memento (2009-11-19; future refreshes will stay on the 2009-11-19 version).  Presumably other Wayback-based archives behave similarly.  Checking the British Library I saw:

http://www.webarchive.org.uk/wayback/archive/20090914012158/http://www.cnn.com/

redirect to:

http://www.webarchive.org.uk/wayback/archive/20090402030800/http://www.cnn.com/?refresh=1

In this case the jump is more noticable (five months: 2009-09-14 vs. 2009-04-02) since the BL's archives of cnn.com are more sparse. 

Perma.cc behaves similarly to the Internet Archive (i.e., rewriting but not canonicalizing), but presumably because it is a newer archive, it does not yet have a "?refresh=1" version of cnn.com archived.  It is possible that Perma.cc has a Wayback backend, but I'm not sure.  I had to push a 2014-07-11 version into Perma.cc (i.e., it did not already have cnn.com archived).  Checking:

http://perma.cc/89QJ-Y632?type=source


we see:

<meta http-equiv="refresh" content="1800;url=/warc/89QJ-Y632/http://www.cnn.com/?refresh=1"/>

And after 30 minutes it will refresh to a framed 404 because cnn.com/?refresh=1 is not archived:


As Perma.cc becomes more populated, the 404 behavior will likely disappear and be replaced with something like the Internet Archive and British Library examples.

Archive.today is the only archive that correctly handled this situation.  Loading:

https://archive.today/Zn6HS

produces:


A check of the HTML source reveals that they simply strip out the meta refresh tag altogether, so this memento will stay parked on 2013-06-27 no matter how long it stays in the client.

In summary:

  • WebCitation did not rewrite the URI and thus created a zombie
  • Internet Archive (and other Wayback archives) rewrites the URI, but because of site-specific canonicalization, it violates the user's expectations with a single time jump (the distance of which is dependent on the sparsity of the archive)
  • Perma.cc rewrites the URI, but in this case, because it is a new archive, produces a 404 instead of a time jump
  • Archive.today strips the meta refresh tag and avoids the behavior altogether

--Michael

2014-07-22: "Archive What I See Now" Project Funded by NEH Office of Digital Humanities

$
0
0
We are grateful for the continued support of the National Endowment for the Humanities and their Office of Digital Humanities for our "Archive What I See Now" project.
In 2013, we received support for 1 year through a Digital Humanities Start-Up Grant.  This week, along with our collaborator Dr. Liza Potts from Michigan State, we were awarded a 3-year Digital Humanities Implementation Grant. We are excited to be one of the seven projects selected this year.

Our project goals are two-fold:
  1. to enable users to generate files suitable for use by large-scale archives (i.e., WARC files) with tools as simple as the "bookmarking" or "save page as" approaches that they already know
  2. to enable users to access the archived resources in their browser through one of the available add-ons or through a local version of the Wayback Machine (wayback).
Our innovation is in allowing individuals to "archive what I see now". The user can create a standard web archive file ("archive") of the content displayed in the browser ("what I see") at a particular time ("now").
    Our work focuses on bringing the power of institutional web archiving tools like Heritrix and wayback to humanities scholars through open-source tools for personal-scale web archiving. We are building the following tools:
    • WARCreate - A browser extension (for both Google Chrome and Firefox) that can create an archive of a single webpage in the standard WARC format and save it to local disk. It can allow a user to archive pages behind authentication or that have been modified after user interaction.
    • WAIL (Web Archiving Integration Layer) - A stand-alone application that provides one-click installation and GUI-based configuration of both Heritrix and wayback on the user’s personal computer.
    • Mink - A browser extension (for both Google Chrome and Firefox) that provides access to archived versions of live webpages. This is an additional Memento client that can be configured to access locally stored WARC files created by WARCreate.
    With these three tools, a researcher could, in her normal workflow, discover a web resource (using her browser), archive the resource as she saw it (using WARCreate in her browser), and then later index and replay the archived resource (using WAIL). Once the archived resource is indexed, it would be available for view in the researcher’s browser (using Mink).

    We are looking forward to working with our project partners and advisory board members: Kristine Hanna (Archive-It), Lily Pregill (NY Art Resources Consortium), Dr. Louisa Wood Ruby (Frick Art Reference Library), Dr. Steven Schneider (SUNY Institute of Technology), and Dr. Avi Santo (ODU).

    A previous post described the work we did for the start-up grant:
    http://ws-dl.blogspot.com/2013/10/2013-10-11-archive-what-i-see-now.html

    We've also posted previously about some of the tools (WARCreate and WAIL) that we've developed as part of this project:
    http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

    See also "ODU gets largest of 11 humanities grants in Virginia" from The Virginian-Pilot.

    -Michele

    2014-07-25: Digital Preservation 2014 Trip Report

    $
    0
    0

    On July 22 and 23, 2014, Dr. Michael Nelson (@phonedude_mln) and I (@machawk1) attended Digital Preservation 2014 in Washington, DC. This was my fourth consecutive NDIIPP (@ndiipp) / NDSA (@ndsa2) meeting (see trip reports from Digital Preservation 2011, 2012, 2013). With the largest attendance yet (300+) and compressed into two days, the schedule was jam-packed with interesting talks. Per usual, videos for most of the presentations are included inline below.

    Day One

    Micah Altman (@drmaltman) led the presentations with information about the NDSA and asked, regarding Amazon claiming reliability of 99.99999999999% for uptime, "What do the eleven nines mean?". "There are a number of risk that we know about [as archivists] that Amazon doesn't", he said, continuing, "No single institution can account for all the risks." Micah spoke about the updated National Agenda for Digital Stewardship, which is to have the theme of "developing evidence for collective action".

    Matt Kirschenbaum (@mkirschenbaum) followed Micah with his presentation Software, It’s a Thing with reference to the recently excavated infamous Atari ET game. "Though the data on the tapes had been available for years, the uncovering was more about aura and allure". Matt spoke about George R. R. Martin's (of Game of Thrones fame) recent interview where he stated that he still uses a word processing program called WordStar on an un-networked DOS machine and that it is his "secret weapon from distraction and viruses." In the 1980s, Wordstar dominated the market until Word Perfect took rein, followed by Microsoft Word. "A power user that has memorized all of the Wordstar commands could utilize the software with the ease of picking up a pencil and starting to write."

    Matt went on to talk (also see his Medium post) about software as different concepts include as an assets, as an object, as a kind of notation or score (qua music), as shrinkwrap, etc. For a full explanation of each, see his presentation:

    Following Matt, Shannon Mattern (@shannonmattern) shared her presentation on Preservation Aesthetics. "One of preservationists's primary concerns is whether an item has intrinsic value.", she said. Shannon then spoke about the various sorts of auto-destructive software and art including those that are light sensitive (the latter) and those that delete themselves on execution (the former). In addition to recording her talk (see below), she graciously included the full text of her talk online.

    The conference briefly had a break and a quick summary of the poster session to come later in day 1. After the break, there was a panel titled "Stewarding Space Data". Hampapuram Ramapriyan of NASA began the panel stating that "NASA data is available as soon as the satellites are launched.", he continued, "This data is preserved...We shouldn't lost bits that we worked hard to collect and process for results. He also stated that he (and NASA) is part of the Earth Science Information Partners (EISP) Data Stewardship Committee.

    Deirdre Byrne of NOAA then presented speaking of their dynamics and further need on documenting data, preserving it with provenance, providing IT support to maintain the data's integrity, and being able to work with the data in the future (a digital preservation plus). Deirdre then referenced Pathfinder, a technology that allows the visualization of sea surface temperatures among other features like indicating coral bleaching from fish stocks, the water quality on coasts, etc. Citing its now use as a de facto standard means for this purpose, she described the physical dynamics as having 8 different satellites for its functionality along with 31 mirror satellites on standby for redundancy.

    Emily Frieda Shaw (@emilyfshaw) of University of Iowa Libraries followed in the panel after Deirdre, and spoke about the Iowan role in preserving the original development data for the Explorer I launch. Upon converting and analyzing the data, her fellow researchers realize that at certain altitudes, the radiation detection dropped to zero, which indicated that there were large belts of particles surrounding the Earth (later, they were recognized as the Van Allen belts). After discovering more data in a basement about the launch, the group received a grant to recover and restore the badly damaged and moldy documentation.

    Karl Nilsen (@KarlNilsen)and Robin Dasler (@rdasler) of University of Maryland Libraries were next with Robin first talking about her concern with issues in the library realm related to data. She reference that one project's data still resided at the University of Hawaii's Institute for Astronomy due to it being the home school to one of the original researchers on a project. The project consisted of data measuring the distances between galaxies that came about by combining and compiling data from various data sources originating from both observational data and standard corpora. To display the data (500 gigabytes total), the group developed a UI to utilize web technologies like MySQL to make the data more accessible. "Researchers were concerned about their data disappearing before they retired.", she stated about the original motive for increasing the data's accessibility.

    Karl changed topics somewhat with stating that two different perspectives can be taken about data from a preservation standpoint (format or system centric). "The intellectual value of a database comes from ad hoc combination from multiple tables in the form of joins and selections.", he said. "Thinking about how to provide access", he continued, "is itself preservation." He followed this with approaches including utilizing virtual machines (VMs), migrating from one application to a more modern web application, and collapsing the digital preservation horizon to ten years at a time.

    Following a brief Q&A for the panel was a series of "Lightning talks". James (Jim) A. Bradley of Ball State University started with his talk, "Beyond the Russian Doll Effect: Reflexivity and the Digital Repository Paradigm" where he spoke about promoting and sharing digital assets for reuse. Jim then talked about Digital Media Repository (DMR), which allowed information to be shared and made available at the page level. His group had the unique opportunity to tell what material are in the library, who was using them and when. From these patterns, grad students made 3-D models, which were them subsequently added and attributed to the original objects.

    Dave Gibson (@davegilbson) of Library of Congress followed Jim by presenting Video Game Source Disc Preservation. He stated that his group has been the "keepers of The Game" since 1996 and have various source code excerpts from a variety of unreleased games including Duke Nukem Critical Mass, which was released for Nintendo DS but not Playstation Portable (PSP), despite being developed for both platforms. In their exploration of the game, they uncovered 28 different file formats on the discs, many of which were proprietary, and wondered how they could get access to the files' contents. After using Mediacoder to convert many of the audio files and hex editors to read the ASCII, his group found source code fragments hidden within the files. From this work, they now have the ability to preserve unreleased games

    Rebecca (Becky) Katz of Council of the District of Columbia was next with UELMA-Compliant Preservation: Questions and Answers?. She first described the UELMA, an act that declares that if a U.S. state passed the act and its digital legal publications are official, the state has to make sure that the publications are preserved in some way. Because of this requirement, many states are reluctant to rely solely on digital documents and instead keeping paper copies in addition to the digital copies. Many of the barriers in the preservation process for the states lie in how they ought to preserve the digital documents. "For long term access," she said, " we want to be preserving our digital content." Becky also spoke about appropriate formats for preservation and that common features of formats like authentication for PDF are not open source. "All the metadata in the world doesn't mean anything if the data is not usable", she said. "We want to have a user interface that integrates with the [preservation] repository." She concluded with recommending state that develop the EULMA to have agreements with universities or long standing institutions to allow users to download updates of the documents to ensure that many copies are available.

    Kate Murray of Library of Congress followed Becky with "We Want You Just the Way You Are: The What, Why and When of fixity in Digital Preservation". She referenced the "Migrant Mother" photo and how, through moving the digital photo from one storage component to another, there have been subtle changes to the photo. "I never realized that there are three children in the photo!", she said, referencing the comparison between the altered and original photo. To detect these changes, she uses fixity (related article on The Signal) on a whole collection of data, which ensures bit level integrity.

    Following Kate, Krystyna Ohnesorge of Swiss Federal Archives (SFA) presented "Save Your Databases Using SIARD!". "About 90 percent of of specialized applications are based on relational databases.", she said. SIARD is used to preserve database content for the long term so that the data can be understood in 20 or 50 years. The system is already used in 54 countries with 341 licenses currently existing for different international organizations. "If an institution needs to archive relational databases, don't hesitate to use the SIARD suite and SIARD format!"

    The conference then broke for lunch where the 2014 Innovation Awards were presented.

    Following lunch, Cole Crawford (@coleinthecloud) of the Open Compute Foundation presented "Community Driven Innovation" where he spoke about Open Computer being an international based open source project. "Microsoft is moving all Azure data to Open Compute", he said. "Our technology requirements are different. To have an open innovation system that's reusable is important." He then emphasized that his talk was to be specifically open the storage aspects of Open Compute. He started with "FACT: The 19 inch server rack originated in the railroad industry then propagated to the music industry, then it was adopted by IT." He continued, "One of the most interesting things Facebook has done recently is move from tape storage to ten thousand Blueray discs for cold storage". He stated that in 2010, the Internet consisted of 0.8 Zetabytes. In 2012, this number was 3.0 Zetabytes, and by 2015, he claimed, it will be 40 Zetabytes in size. "As stewards of digital data, you guys can be working with our project to fit your requirements. We can be a great engine for procurement. As you need more access to content, we can get that.

    After Cole was another panel titled, "Community Approaches to Digital Stewardship". Fran Berman (@resdatall) of Rensselaer Polytechnic Institute started off with reference to the Research Data Alliance. "All communities are working to develop the infrastructure that's appropriate to them.", she said, "If I want to worry about asthma (referencing an earlier comment about whether asthma is more likely to be obtained in Mexico City versus Los Angeles), I don't want to wait years until the infrastructure is in place. If you have to worry about data, that data needs to live somewhere."

    Bradley Daigle (@BradleyDaigle) of University of Virginia followed Fran and spoke about the Academic Preservation Trust, a group consisting of 17 members that takes a community based approach at are attempting to not just have more solutions but better solutions. The group would like to create a business model based on preservation services. "If you have X amount of data, I can tell you it will take Y amount of cost to preserve that data.", he said, describing an ideal model. "The AP Trust can serve as a scratch space with preservation being applied to the data."

    Following Bradley on the panel, Aaron Rubinstein from University of Massachusetts Amherst described his organization's scheme as being similar to Scooby Doo, iterating through each character displayed on-screen and stating the name of a member organization. "One of the things that makes our institution privileges is that we have administrators that understand the need for preservation.

    The last presenter in the panel, Jamie Schumacher of Northern Illinois University started with "Smaller institutions have challenges when starting digital preservation. Instead of obtaining an implementation grant when applying to the NEH, we got a 'Figure it Out' grant. ... Our mission was to investigate a handful of digital preservation solutions that were affordable for organizations with restricted resources like small staff sizes and those with lone rangers. We discovered that practitioners are overwhelmed. To them, digital objects are a foreign thing." Some of the roadblocks her team eliminated were the questions of which tools and services to use for preservation tasks, to which Google frequently gave poor of too many results.

    Following a short break, the conference split into five different concurrent workshops and breakout sessions. I attended the session titled Next Generation: The First Digital Repository for Museum Collections where Ben Fino-Radin (@benfinoradin) of Museum of Modern Art, Dan Gillean of Artefactual Systems and Kara Van Malssen (@kvanmalssen) of AVPreserve gave a demo of their work.

    As I was presenting a poster at Digital Preservation 2014, I was unable to stay for the second presentation in the session Revisiting Digital Forensics Workflows in Collecting Institutions by Marty Gengenbach of Gates Archive, as a was required to setup my poster. Starting at 5 o'clock, the breakout sessions ended and a reception was held with the poster session in the main area of the hotel. My poster, "Efficient Thumbnail Summarization for Web Archives" is an implementation of Ahmed AlSum's initial work published at ECIR 2014 (see his ECIR Trip Report).

    Day Two

    The second day started off with breakfast and an immediate set of workshops and breakout sessions. Among these, I attended the Digital Preservation Questions and Answers from the NDSA Innovation Working Group where Trevor Owens (@tjowens), among other group members introduced the history and migration of an online digital preservation question and answer system. The current site, currently residing at http://qanda.digipres.org is in the process of migration from previous attempts including a failed try at a Digital Preservation Stack Exchange. This work, completed in-part by Andy Jackson (@anjacks0n) at the UK Web Archive, began its migration with his Zombse project, which extracted all of the initial questions and data from the failed Stack Exchange into a format that would eventually be readable by another Q&A system.

    Following a short break after the first long set of sessions, the conference then re-split into the second set of breakout sessions for the day, where I attended the session titled Preserving News and Journalism. Aurelia Moser (@auremoser) administrated this panel-style presentation and initially showed a URI where the presentation's resource could be found (I typed bit.ly/1klZ4f2 but that seems to be incorrect).

    The panel, consisting of Anne Wootton (@annewooton, Leslie Johnston (@lljohnston), and Edward McCain Reynolds (@e_mccain), initially asked, "What is born digital and what are news apps?". The group had put forth a survey toward 476 news organization, consisting of 406 hybrid organizations (those that put content in print and online), and 70 "online only" publications.

    From the results, the surveyors asked what the issue was with responses, as they kept the answers open ended for the journalists to obtain an accurate account of their practices. "Newspapers that are in chains are more likely to have written policies for preservation.

    The smaller organizations are where we're seeing data loss." At one point, Anne Wooton's group organized a "Design-a-Thon" where they gathered journalists, archivists, and developers. Regarding the surveyors' practice, the group stated that Content Management System (CMS) vendors for news outlets are the holders of t he key of the kingdom for newspapers in regard to preservation.

    After the third breakout session of the conference, lunch was served (Mexican!) with Trevor Owens of Library of Congress, Amanda Brennan (@continuants) of Tumblr, and Trevor Blank (@trevorjblank of The State University of New York at Potsdam giving a preview of CURATECamp, to occur the day after the completion of the conference. While lunch proceeded, a series of lightning talks was presented.

    The first lightning talk was by Kate Holterhoff (@KateHolterhoff) of Carnegie-Mellon University and titled Visual Haggard and Digitizing Illustration. In her talk, she introduced Visual Haggard, a catalog of many images from public domain books and documents that attempts to have better quality representations of the images in these documents compared to other online systems like Google Books. "Digital Archivists should contextual and mediate access to digital illustrations", she said.

    Michele Kimpton of DuraSpace followed Kate with DuraSpace and Chronopolis Partner to Build a Long-term Access and Preservation Platform. In her presentation she introduced a few tools like Chronopolis (used for dark archiving), DuraCloud and a few other tools and her group's approach toward getting various open source tools to work together to provide a more comprehensive solution for preserving digital content.

    Following Michelle, Ted Westervelt of Library of Congress presented Library of Congress Recommended Formats where he reference the Best Edition Statement, a largely obsolete but useful document that needed supplementation to account for modern best practice and newer mediums. His group has developed the "Recommended Format Specification", which provide this without superseding the original document and is a work-in-progress. The goal of the document is to set parameters for the target objects for the document so that most content that is current un-handled by the in-place specification will have directives to ensure that digital preservation of the objects is guided and easy.

    After Ted, Jane Zhang of Catholic University of America presented Electronic Records and Digital Archivists: A Landscape Review where she did a cursory study of employment positions for digital archivists, both formally trained and trained on-the-job. She attempt to answer the question "Are libraries hiring digital archivists?" and tried to see a pattern from one hundred job descriptions.

    After the lightning talks, another panel was held, titled "Research Data and Curation". Inna Kouper (@inkouper) and Dharma Akmon (@DharmaAkmon) of Indiana University and University of Michigan, Ann Arbor, respectively, discussed Sustainable Environment Actionable Data (SEAD, @SEADdatanet), and a Research Object Framework for the data that will be very familiar for practitioners working with video producers. "Research projects are bundles", they said, "The ROF captures multiple aspects of working with data include unique ID, agents, states, relationships, and content and how they cyclicly relate. Research objects change states."

    They continued, "Curation activities are happening from the beginning to the end of an object's lifecycle. An object goes through three main states", they listed, "Live objects are in a state of flux, handled by members of project teams, and their transition is initiated by the intent to publish. Curation objects consist of content packaged using the BagIt protocol with metadata, and relationships via OAI/ORE maps, which are mutable but allow selective changes to metadata. Finally, publication objects are immutable and citable via a DOI and have revisions and derivations tracked."

    Ixchel Faniel from OCLC Research then presented, stating that there are three perspectives for archeological practice. "The data has a lifecycle from the data collection, data sharing, and data reuse perspective and cycle." Her investigation consisted of detecting how actions in one part of the lifecycle facilitated work in other parts of the lifecycle. She collected data over one and one-half year (2012-2014) from 9 data producers, 2 repository staff, and 7 data re-users and concluded that actions in one part of the cycle have influence on things that occur in other stages. "Repository actions are overwhelmingly positive but cannot always reverse or explain documentation procedures."

    Following Ixchel and the panel, George Oates (@goodformand), Director of Good, Form & Spectacle presented Contending with the Network. "Design can increase access to the material", she said, stating her experience with Flickr and Internet Archive. Relating to her work with Flickr, she referenced The Commons, a program that is attempting to catalog the world's public photo archive. In her work with IA, she was most proud of the interface she designed for the Understanding 9/11 interface. She then worked for a group named MapStack and created a project called Surging Seas, an interactive tool for visualizing sea level rise. She recently started a new business "Good, Form, & Spectacle" and proceeded on a formal mission of archiving all documents related to the mission through metadata. "It's always useful to see someone use what you've designed. They'll do stuff you anticipate and that you think is not so clear."

    Following a short break, the last session of the day started with the Web Archiving Panel. Stephen Abrams of California Digital Library (CDL) started the presentation asking "Why web archiving? Before, the web was a giant document retrieval system. This is no longer the case. Now, the web browser is a virtual machine where the language of choice is JavaScript and not HTML." He stated that the web is a primary research data object and that we need to provide programmatic and business ways to support web archiving.

    After Stephen, Martin Klein (@mart1nkle1n) of Los Alamos National Laboratory (LANL) and formally of our research group gave an update on the state of the work done with Memento. "There's extensive memento infrastructure in-place now!", he said. New web services that are to be released soon to be offered by LANL include a Memento redirect service (for example, going to http://example.com/memento/20040723150000/http://matkelly.com will automatically be resolved in the archives to the closest available archived copy); a memento list/search service to allow memento lookup using a user interface with specifying dates, times, and a URI; and finally, a Memento TimeMap service.

    After Martin, Jimmy Lin (@lintool) of University of Maryland and formally of Twitter presented on how to leverage his big data expertise for use in digital preservation. "The problem", he said, "is that web archives are an important part of heritage but are vastly underused. Users can't do that much with web archives currently." His goal is to build tools to support exploration and discovery in web archives. A tool his group built, Warcbase uses Hadoop and HBase for topic modeling.

    After Jimmy, ODU WS-DL's very own Michael Nelson (@phonedude_mln) presented starting off with "The problem is that right now, we're still in the phase of 'Hooray! It's in the web archive!" whenever something show up. What we should be asking is, "How well did we archive it?" Referencing the recent publicity of Internet Archiving capturing evidence toward the plane being shot down in Ukraine, Michael says, "We were just happy that we had it archived. When you click on one of the video, however, and it just sits here and hangs. We have the page archived but maybe not all the stuff archived that we like." He then went on to describe the ways that his group is assessing web archive is to determine the importance of what's missing, detect temporal violations, and benchmarking how well the tools handle the content they're made to capture.


    After Dr. Nelson presented, the audience had an extensive amount of questions.

    After the final panel, Dragan Espenschied (@despens) of Rhizome presented Big Data, Little Narration (see his interview). In his unconventional presentation, he reiterated that some artifacts don't make sense in the archives. "Each data point needs additional data about it somewhere else to give it meaning.", he said, giving a demonstration of an authentic replay of Geocities sites in Netscape 2.0 via a browser-accessible emulator. "Every instance of digital culture is too large for an institution because it's digital and we cannot completely represent it."

    As the conference adjourned, I was glad I was able to experience it, see the progress other attendees have made in the last three (or more) years, and present the status of my research.

    — Mat Kelly (@machawk1)

    2014-08-22: One WS-DL Class Offered for Fall 2014

    $
    0
    0
    This fall, only one WS-DL class will be offered:
    This class approaches the Web as a phenomena to be studied in its own right.  In this class we will explore a variety of tools (e.g., Python, R, D3) as well as applications (e.g., social networks, recommendations, clustering, classification) that are commonly used in analyzing the socio-technical structures of the Web. 

    The class will be similar to the fall 2013 offering. 

    Right now we're planning on offering in spring 2015:
    • CS 418 Web Programming 
    • CS 725/825 Information Visualization
    • CS 751/851 Introduction to Digital Libraries
    --Michael

    2014-08-26: Memento 101 -- An Overview of Memento in 101 Slides

    $
    0
    0
    In preparation for the upcoming "Tools and Techniques for Revisiting Online Scholarly Content" tutorial at JCDL 2014, Herbert and I have revamped the canonical slide deck for Memento, and have called it "Memento 101" for the 101 slides it contains.  The previous slide deck was from May 2011 and was no longer current with the RFC (December 2013).  The slides cover Memento basic and intermediate concepts, with pointers for some of the more detailed and esoteric bits (like patterns 2, 3, and 4, as well as the special cases) of interest to only the most hard-core archive wonks. 

    The JCDL 2014 tutorial will choose a subset of these slides, combined with updates from the Hiberlink project and various demos.  If you find yourself in need of explaining Memento please feel free to use these slides in part or in whole (PPT is available for download from slideshare). 




    --Michael & Herbert

    2014-08-28: InfoVis 2013 Class Projects

    $
    0
    0
    (Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.)

    I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous posts: Fall 2011, Fall 2012)

    In Spring 2013, I taught an Applied Visual Analytics course that asked students to create visualizations based on the "Life in Hampton Roads" annual survey performed by the Social Science Research Center at ODU.  In Fall 2013, I taught the traditional InfoVis course that allowed students to choose their own topics.  (All class projects are listed in my InfoVis Gallery.)

    Life in Hampton Roads
    Created by Ben Pitts, Adarsh Sriram Sathyanarayana, Rahul Ganta


    This project (currently available at https://ws-dl.cs.odu.edu/vis/LIHR/) provides a visualization of the "Life in Hampton Roads" survey for 2010-2012.  Data cleansing was done with the help of tools like Excel and Open Refine.  The results of the survey are shown using maps and bar charts, making it easier to understand public opinions on a particular question. Visualizations are created by using various tools like D3, Javascript, JQuery and HTML. The visualizations implemented in the project are reader-driven and exploratory.  The user can interactively change the filters in the top half of the page and see the data only for the selected groups below.

    The video below provides a demo of the tool.


    MeSH Viewer
    Created by Gil Moskowitz

    MeSH is a controlled vocabulary developed and maintained by the U.S. National Library of Medicine (NLM) for indexing biomedical databases, including the PubMed citation database. PubMed queries can be made more precise, returning fewer citations with higher relevance, by issuing them with specific reference to MeSH terms. The database of MeSH terms is large, with many interrelationships between terms. The MeSH viewer (currently available at https://ws-dl.cs.odu.edu/vis/MeSH-vis/) visually presents a subset of MeSH terms, specifically MeSH descriptors. The MeSH descriptors are organized into trees based on a hierarchy of MeSH numbers. This work includes a tree view to show the relationships as structured by the NLM. However, individual descriptors may have multiple MeSH numbers, hence multiple locations in the forest of trees, so this view is augmented with a force-directed node-link view of terms found in response to a user search. The terms can be selected and used to build PubMed search strings, and an estimate of the specificity of this combination of terms is displayed.

    The video below provides a demo of the tool.


    Visualizing Currency Volatility
    Created by Jason Long



    This project (currently available at https://ws-dl.cs.odu.edu/vis/currency/) was focused on showing changes in currency values over time. The initial display shows the values of 39 currencies (including Bitcoin) as compared to the US Dollar (USD).  It's easy to see when the Euro was introduced and individual currencies in EU countries dropped off.  Clicking on a particular currency expands its chart and allows for closer inspection of its change over time.  There is also a tab for a heatmap view that allows the user to view a moving average of the difference from USD.  The color indicates whether the USD is appreciating (green) or depreciating (red).

    The video below provides a demo of the tool.

    -Michele

    2014-09-02: WARCMerge: Merging Multiple WARC files into a single WARC file

    $
    0
    0
    WARCMerge is the name given to a new tool for organizing WARC files. The name describes it -- merging multiple WARC files into a single one. In web archiving, WARC files can be generated by well-known web crawlers such as Hertrix and Wget command, or by state-of-the-art tools like WARCreate/WAIL and Webrecorder.io which were developed to support the personal web archiving. WARC files contain records not only for HTTP responses and metadata elements but also all original HTTP requests. By having those WARC files, any replay tools (e.g., Wayback Machine) can be used to reconstruct and display the original web pages. I would emphasize here that a single WARC file may consist of records related to different web sites. In other words, multiple web sites can be archived in the same WARC file.

    This Python program runs in three different modes. In the first mode, the program sequentially reads records one by one from different WARC files and combines them into a new file in which an extra metadata element is added to indicate when this merging operation occurred. Furthermore, a new “warcinfo” record is placed at the beginning of the resulting WARC file(s). This record contains information about WARCMerge program and metadata for date and time.

    The second mode is very similar to the first mode; the only difference here is the source of WARC files. In the first mode the source files are from a specific directory, while in the second mode an explicit list of WARC files is provided.


    In third mode, an existing WARC file is appended to the end of another WARC file. In this case, only one metadata element (WARC-appended-by-WARCMerge) is added to each “warcinfo” record found in the first file.

    Finally, regardless of the mode, WARCMerge always checks for errors like validating the resulting WARC files as well as ensuring that the size of the resulting file does not exceed the maximum size limit. (The maximum size limit can be changed through the program's source code by assigning a new value to the variable: MaxWarcSize).

    Download WARCMerge:
    • WARCMerge's source code is available on GitHub, or by running the following command:
                       git clone https://github.com/maturban/WARCMerge.git

    Dependencies:
    • Tested on Linux Ubuntu 12.04
    • Requires Python 2.7+
    • Requires Java to run Jwattool for validating WARC files
    Running WARCMerge:

    As described above, WARCMerge can be run in three different modes; see the three examples below (adding the option '-q' will make the program run in a quiet mode; the program does not display any messages)


    Example 1: Merging WARC files (found in "input-directory") into new WARC file(s):

    %pythonWARCMerge.py./collectionExample/my-output-dir 

       MergingthefollowingWARCfiles:
        ----------------------------------:
       [Yes]./collectionExample/world-cup/20140707174317773.warc 
       [Yes]./collectionExample/warcs/20140707160258526.warc 
       [Yes]./collectionExample/warcs/20140707160041872.warc 
       [Yes]./collectionExample/world-cup/20140707183044349.warc 

       ValidatingtheresultingWARCfiles:
        ----------------------------------: 
         -[valid]my-output-dir/WARCMerge20140806040712197944.warc
    Example 2: Merging all listed WARC files into new WARC file(s):

    %pythonWARCMerge.py  1.warc 2.warc  ./dir1/3.warc  ./warc/4.warc mydir

        MergingthefollowingWARCfiles:
        ----------------------------------: 
        [Yes]./warc/4.warc
        [Yes]./1.warc
        [Yes]./dir1/3.warc
        [Yes]./2.warc 
        
        ValidatingtheresultingWARCfiles:
        ----------------------------------:
        -[valid]mydir/WARCMerge20140806040546699431.warc
    Example 3: Appending a WARC file to another WARC file. The option '-a' is used here to make sure that any change in the destination file is done intentionally:

    %python WARCMerge.py -a ./test/src/1.warc ./test/dest/2.warc

          The resulting (./test/dest/2.warc) is valid WARC file
    In case a user enters any incorrect command-line arguments, the following message will be shown:

    usage: WARCMerge [[-q] -a <source-file> <dest-file> ]                 
                                 [[-q] <input-directory> <output-directory> ]
                                 [[-q] <file1> <file2> <file3> ... <output-directory> ]  
    WARCMerge can be useful as an independent component or it can be integrated with other existing tools. For example, WARCreate, is a Google Chrome extension that helps users to create WARC files for any visited web pages in the browser. Instead of having hundreds of such WARC files, WARCMerge brings all files together in one place.

    -M. Aturban

    2014-09-09: DL2014 Doctoral Consortium

    $
    0
    0

    After exploring London on Sunday, I attended the first DL2014 session: the Doctoral Consortium. Held in the College Building at the City University London, the Doctoral Consortium offered early-career Ph.D. students the opportunity to present their research and academic plans and receive feedback from digital libraries professors and researchers.

    Edie Rasmussen chaired the Doctoral Consortium. I was a presenter at the Doctoral Consortium in 2012 with Hany SalahEldeen, but I attended this year as a Ph.D. student observer.

    Session I: User Interaction was chaired by José Borbinha. Hugo Huurdeman was first to present his work entitled "Adaptive Search Systems for Web archive research". His work focuses on information retrieval and discovery in the archives. He explained the challenge with searching not only across documents but also across time.

    Georgina Hibberd presented her work entitled "Metaphors for discovery: how interfaces shape our relationship with library collections." Georgina is working on digitally modeling the physical inputs library users receive when interacting with books and physical library media to allow the same information to be available when interacting with digital representations of the collection. For example, how can we incorporate physical proximity and accidental discovery in the digital systems, or how can we demonstrate frequency of use that would previously be shown in the condition of a book's spine?

    Yan Ru Guo presented her work entitled "Designing and Evaluating an Affective Information Literacy Game" in which she proposes serious games to help tertiary students in an effort to help their ability to perform searches and information discovery in digital environments.

    After a break to re-caffeinate, Session II: Working with Digital Collections began. Dion Goh chaired the session. Vincent Barrallon presented his work entitled "Collaborative Construction of Updatable Digital Critical Editions: A Generic Approach." This work aims to establish an updatable data structure to represent the collaborative flow of annotation, especially with respect to editorial efforts. He proposes using bidirectional graphs, triple graphs, or annotated graphs as representatives, and proposes methods of identifying graph similarity.

    Hui Li finished the session with her presentation entitled "Social Network Extraction and Exploration of Historic Correspondences" in which she is working to use Named Entity Extraction to create a social network from digitized historical documents. Her effort utilizes topic modeling and event extraction to construct the network.

    Due to a scheduling audible, lunch and Session III: Social Factors overlapped slightly. Ray Larson chaired this session, and Mat Kelly was able to attend after landing in LHR and navigating to our hotel. Maliheh Farrokhnia presented her work entitled "A request-based framework for FRBRoo graphical representation: With emphasis on heterogeneous cultural information needs." Her work takes user interests (through adaptive selection of target information) to present relational graphs of digital library content.

    Abdulmumin Isah presented his work entitled "The Adoption and Usage of Digital Library Resources by Academic Staff in Nigerian Universities: A Case Study of University of Ilorin." His work highlights a developing country's use of digital resources in academia and cites factors influencing the success of digital libraries.

    João Aguir Castro presented his work entitled "Multi-domain Research Data Description -- Fostering the participation of researchers in an ontology based data management environment." His work with Dendro uses metadata and ontologies to aid in long-term preservation of research data.

    The last hour of the consortium was dedicated to an open mic session chaired by Cathy Marshall with the goal of having the student observers present their current work. I presented first and explained my work that aims to mitigate the impact of JavaScript on archived web pages. Mat went next and discussed his work about integrating public and private web archives with tools like WAIL and WARCreate.

    Alexander Ororbia presented his work on using artificial intelligence and deep learning for error correcting crowd sourced data from scholarly texts. Md Arafat Sultan discussed his work on using natural language processing to detect similarity in text to identify text segments that adhered to set standards (e.g. educational standards). Kahyun Choi discussed her work on perceived mood in eastern music from the point of view of western listeners. Finally, Fawaz Alarfaj discussed his work using entity extraction, information retrieval, and natural language processing to identify experts within a specified field.

    As usual, the Doctoral Consortium was full of interesting ideas, valuable recommendations, and highly motivated Ph.D. students. Tomorrow marks the official beginning of DL2014.


    --Justin F. Brunelle

    2014-09-16: A long and tortuous trail to a PhD

    $
    0
    0

    (or how I learned to embrace the new)


    I am reaching the end of this part of my professional, academic, and personal life.  It is time to reflect and consider how I got here.

    The trail ahead.
    When I started, I thought that I knew the path, the direction, and the work that it would take.  I was wrong.  The path was rugged, steep, and covered with roots and stones that lay in wait to trip the unwary.  The direction was not straight forward.  At times I wasn't sure how to set my compass, and which way to steer.  In the end, there was more work than I thought in the beginning.  But the end is nigh.  The path has been long.  At times the was direction confusing.  The work seemed never ending.  This is a story of how I got to the end, using a little help from "a friend" at the end of this post.

    Bringing the initially disparate disciplines of graph theory, digital preservation, and emergent behavior together to solve a particular class of problem, is/was non-trivial.  Sometimes you have to believe in a solution before you can see it.

    Graph theory is: the study of graphs, the mathematical structures used to model pairwise relations between objects.  In my world, I focused on the application of graph theory as it applied to the creation of graphs that had the small-world properties of a high clustering coefficient and a low average path length.

    Digital preservation is: a series of managed activities necessary to ensure continued access to digital materials for as long as they are needed.  In my world, I focused on preserving the "essence" of a web object (WO), not the entire object.  WOs can include links to resources and capabilities that are protected and not visible on the "surface web."  While this web "dark matter" could contain unknown wealth and information,  I was interested in the essence of the WO and preserving that for the long term.

    Emergent behavior is: unanticipated behavior shown by a system.  In my world, I took Craig Reynolds' axiom of imbuing objects with a small set of rules, turning them loose, and seeing what happens.  My rules guided the WOs through their explorations of the Unsupervised Small-World (USW) graph, how they made decisions about which other WOs to connect to, and when and where to make preservation copies.

    Graph theory, digital preservation, and emergent behavior are brought together in the USW process; the heart of my dissertation.

    At the end of a very long climb, there is:

    A video of the USW process in action video:



    My PhD Defense PowerPoint presentation on SlideShare.






     A video of my dissertation defense can be found here.

     My dissertation in two different sized files.
    A small (19 MB) version of my dissertation.

    A much larger (619 MB) version of my dissertation can be found here.

    A simple chronology from the Start in 2007 through the PhD in 2014 (with a little help from my friend).

    2007: I started down this trail
    The "story" of my dissertation. (My friend.)

    2007 - 2013: The Unsupervised Small-World (USW) simulator (on GitHub) directly supported almost all phases of my work.  It went through many iterations from its first inception until its final form.  What started as a simple was to create simple graphs in python, through a couple of other scripting languages, stabilized as an message driven 5K line long C++ program.  The program served as a way to generate USW graph to test different theories and ideas.  The simulator generated data, while offline R scripts did the heavy lift analysis.  One my favorite graphs was a by-product of the simulator (and it didn't have anything to do with USW).

    2008: Emergent behavior: a poster entitled "Self-Arranging Preservation Networks."

    2009: Emergent behavior and graph theory: a short paper entitled "Unsupervised Creation of Small World Networks for the Preservation of Digital Objects."

    2009: Graph theory: Doctoral consortium

    2010: Digital preservation: a long paper entitled: "Analysis of Graphs for Digital Preservation Suitability."

    2011: Graph theory: arXiv on entitled: "Connectivity Damage to a Graph by the Removal of an Edge or Vertex."

    2011: Graph theory: a WS-DL blog article: "Grasshopper, prepare yourself. It is time to speak of graphs and digital libraries and other things."

    2012: Digital preservation: a long paper entitled: "When Should I Make Preservation Copies of Myself?"

    2013: Digital preservation: a WS-DL blog article: "Preserve Me! (... if you can, using Unsupervised Small-World graphs.)"

    2013: The USW robot, my own Marvin, (on GitHub) grew from the lessons learned from the simulator.  Marvin worked with Sawood Alam's HTTP Mailbox application to actually create USW graphs based on data in the USW instrumented Web Pages.

    2013 - 2014: Emergent behavior: working with Sawood Alam and his HTTP Mailbox application.  The Mailbox was the communication mechanism used by USW Web Objects.

    2014: Digital preservation: an updated long paper entitled: "When Should I Make Preservation Copies of Myself?"

    2014: My PhD defense (link to set of slides).

    2014: LaTeX: a WS-DL blog article: LaTeX References, and how to control them.

    2014: LaTeX: a WS-DL blog article: An ode to the "Margin Police," or how I learned to love LaTeX margins.

    2014: Dissertation submitted and accepted by the Office of the Registrar.

    In many movies, there is one line that stands out.  One line that resonates.  One line that sums up many things.  The one that comes to my mind was uttered by Sean Connery as William Forrester in the movie "Finding Forrester" when he pointed to the faded photograph on the wall and said: "I'm that one."

    The trail, and the road was long and trying, with many places where things could have gone awry. But in the end, like Kwai Chang Caine and his brazier, the way out of the temple was shown and the last trial was completed.

    Chuck

    Published works (ready for copying and pasting):
    • Sawood Alam, Charles L. Cartledge, and Michael L. Nelson. HTTP Mailbox - Asynchronous RESTful Communication. Technical report, arXiv:1305.1992, Old Dominion University, Computer Science Department, Norfolk, VA, 2013.
    • Sawood Alam, Charles L. Cartledge, and Michael L. Nelson. Support for Various HTTP Methods on the Web. Technical report, arXiv:1405.2330, Old Dominion University, Computer Science Department, Norfolk, VA, 2014.
    • Charles Cartledge. Preserve Me! (... if you can, using Unsupervised Small-World graphs.). http://ws-dl.blogspot.com/2013/10/2013-10-23-preserveme-if-you-can-using.html/, 2013.
    • Charles L. Cartledge and Michael L. Nelson. Self-Arranging Preservation Networks. In Proc. of the 8th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 445 – 445, 2008.
    • Charles L. Cartledge and Michael L. Nelson. Unsupervised Creation of Small World Networks for the Preservation of Digital Objects. In Proc. of the 9th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 349 – 352, 2009.
    • Charles L. Cartledge and Michael L. Nelson. Analysis of Graphs for Digital Preservation Suitability. In Proc. of the 21st ACM conference on Hypertext and hypermedia, pages 109 – 118. ACM, 2010.
    • Charles L. Cartledge and Michael L. Nelson. Connectivity Damage to a Graph by the Removal of an Edge or Vertex. Technical report, arXiv:1103.3075, Old Dominion University, Computer Science Department, Norfolk, VA, 2011.
    • Charles L. Cartledge and Michael L. Nelson. When Should I Make Preservation Copies of Myself? Tech. Report arXiv:1202.4185, 2012.
    • Charles L. Cartledge and Michael L. Nelson. When Should I Make Preservation Copies of Myself? In Proc. of the 14th ACM/IEEE-CS Joint Conf. on Digital Libraries, page TBD, 2014.

    Published works (ready for BibTex):

    2014-09-17: NEH ODH Project Directors' Meeting

    $
    0
    0
    On Monday (Sep 15), Michael and I attended the NEH Office of Digital Humanities Project Directors' Meeting at their new location in the Constitution Center in Washington, DC. We were invited based on our "Archive What I See Now" project being funded as a Digital Humanities Implementation Grant.


    There were two main goals of the meeting: 1) provide administrative information and advice to project directors and 2) allow project directors to give a 3 minute overview of their project to the general public.

    The morning was devoted the first goal.  One highlight for me was ODH Director Brett Bobley's welcome in which he talked a bit about the history of the NEH (NEH's 50th anniversary is coming up in 2015).  The agency is currently in the process of digitizing their historical documents, including records of all of the grants that have been awarded (originally stored on McBee Key Sort cards). He also mentioned the recent article "The Rise of the Machines" that describes the history of NEH and digital humanities. Bottom line, digital humanities is not a new thing.

    The public afternoon session was kicked off with a welcome from the new NEH Chairman, Bro Adams.

    The keynote address was given by Michael Whitmore, Director of the Folger Shakespeare Library.  He talked about adjacency in libraries allows people to easily find books with similar subjects ("virtuous adjacency").  But, if you look deeper into a book and are looking for items similar to a specific part of the book (his example was the use of the word "ape"), then the adjacent books in the stacks probably aren't relevant ("vicious adjacency"). In a physical library, it's not easy to rearrange the stacks, but in a digital library, you can have the "bookshelf rearrange itself". 

    His work uses Docuscope to analyze types of words in Shakespeare's plays.  The algorithm classifies words according to what type of word it is (imperative, dialogue, anger, abstract nouns, ...) and then uses PCA analysis to cluster plays according to these descriptors. One of the things learned through this visual analysis is that Shakespeare used more sentence-starting imperatives than his peers. Another project mentioned was Visualizing English Print, 1530-1799.  The project visualized topics in 1080 texts with 40 texts from each decade. The visualization tool, Serendip, will be presented at IEEE VAST 2014 in Paris (30-second video).

    After the keynote, it was time for the lightning rounds.  Each project director was allowed 3 slides and 3 minutes to present an overview of their newly funded work.  There were 33 projects presented, so I'll just mention and give links to a few here.

    Lightning Round 1 - Special Projects and Start-Up Grants
    Lightning Round 2 - Implementation Grants
    • Pop Up Archive, PRX, Inc. - archiving, tagging, transcribing audio
    • Bookworm, Illinois at Urbana-Champaign - uses HathiTrust Corpus and is essentially an open-source version of Google n-gram viewer
    The program ended with a panel on how to move projects beyond the start-up phase.

    Thanks to the ODH staff (Brett Bobley, Perry Collins, Jason Rhody, Jen Serventi, and Ann Sneesby-Koch) for organizing a great meeting!

    For another take on the meeting, see the article "Something Old, Something New" at Inside Higher Ed. Also, the community has some active tweeters, so there's more commentary at #ODH2014.

    The lightning presentations were recorded, so I expect to see a set of videos available in the future, as was done with the 2011 meeting.

    One great side thing I learned from the trip is that mussels and fries (or, moules-frites) is a traditional Belgian dish (and is quite yummy).
    -Michele

    2014-09-18: Digital Libraries 2014 (DL2014) Trip Report

    $
    0
    0
    Mat Kelly, Justin F. Brunelle and Dr. Michael L. Nelson travel to London, UK to report on the Digital Libraries 2014 Conference.                           

    On September 9th through 11th, 2014, Dr. Nelson (@phonedude_mln), Justin (@justinfbrunelle), and I (@machawk1) attending the Digital Libraries 2014 conference (a composite of the JCDL (see trip reports for 2013, 2012, 2011) and TPDL (see trip reports for 2013 and 2012) conferences this year) in London, England. Prior to the conference, Justin and I attended the DL2014 Doctoral Consortium, which occurred on September 8th.

    The main conference on September 9th opened with George Buchanan (@GeorgeRBuchanan) indicating that this year's conference was a combination of both TPDL and JCDL from previous years. With the first digital libraries conference being in 1994, this year marked the 20th year anniversary of the conference. George celebrated this by singing Happy Birthday to the conference and introduced the Ian Locks, the Master of company if the Worshipful Company of Stationers and Newspaper Makers, to continue the introduction.

    Ian first gave a primer and history of the his organization as a "chain gang that dated back 1000 years" that "reduced the level of corruption when people could not read or write". Originally, his organization became Britain's first library of deposit wherein printed works needed to be deposited with them and the organization was central to the first copyright in world in 1710.

    Ian then gave way to Martin Klein (@mart1nkle1n), who gave insight into the behind-the-scenes dynamics of the conference. He stated that the programming committee had 183 members to allocate reviews for every submission received. The committee's goal was to have four first level reviews. Of the papers received, 38 countries were represented with the largest number coming from the U.S. followed by the U.K. then Germany. The acceptance rate for full papers was 29% while the rate for short papers was 32%. 33 posters and 12 demos were also accepted. Interestingly, the country with the highest acceptance rate that submitted over five papers was Brazil, with over half of their papers accepted.

    Martin then segued to introducing the keynote speaker, Professor Dieter Fellner of the Fraunhofer Institute. Dieter's theme consisted mostly of the different means and issues in digitally preserving 3-dimensional objects. He described the digitization as, "A grand opportunity for society, research, and economy but a grand challenge for digital libraries." In reference to object recovery for preservation before or after an act of loss he said, "if we cannot physically preserve an object, having a digital artifact is second best." Dieter then went on to tell of the inaccuracies of preserving artifacts from a single or insufficient lighting conditions. TO evaluate how well an object is preserved, he spoke of a "Digital Artifact Turing Test" wherein, he said, first create photos of a 3D artifact then make a 3D model. If you can't tell the difference, then the capture can be deemed successful and represntative. -

    Dieter continued with some approaches they have used to achieve better lighting conditions and how varying the lighting conditions has provided instances of uncovering data that previously was hard to accurately capture. As an example, he show a piece of driftwood from Easter Island that had an ancient etched message that was very subtle to see and thus would likely be unknowingly used as firewood. By varying the light conditions when preserving the object, the ancient writing was exposed and preserved for later translation once more is known about the language.

    Another instance he gave was based on scans today of ancient objects, how accurately can we replicate the original color, citing the discolored bust of Nefertiti. Further inspection using various colored lighting to scan produced potentially better results for a capture.

    After a short break, the meeting resumed with simultaneous sessions. I attended the "Web archives and memory" session where WS-DL's Michael Nelson lead off with "When Should I Make Preservation Copies of Myself?", a work related to WS-DL's recent alumnus Chuck Cartledge's PhD dissertation. In his presentation, Michael spoke of the preservation of objects, particular of Chuck now-famous ancestor "Josie", for which he had a physical photo over a hundred years old with some small bits of metadata hand-written on the back. With respect to modeling the self-preservation of the correlative digital object of the Josie photo (e.g., a scanned image on Flickr), Michael described the "movement" of how this image should propagate in the model in a way akin to Craig Reynold's Boids in the desired behavior of collision avoidance, velocity matching, and flock centering. This "small world" consisting of the set of duplicated objects in a variety of locations can be described with the "small world" concept and will not create a lattice structure in its propagation scheme.

    Chuck's implementation work includes adding a linked image embedded on the web page (using the HTML "link" tag and not the "a" tag) that allows the user to specify that they would like the object preserved. Michael then described the three policies used for duplication to ensure optimal spread of a resource, which included one-at-a-time until a limit is hit, as aggressively as possible until a soft limit then one-at-a-time until a hard limit, or a super aggressive policy of duplication until the hard limit is hit. From Chuck's work, Michael said, "It pays to be aggressive, conservation doesn't work for preservation. What we envisioned", he continued, "was to create objects that live longer than the people that created them."

    Cathy Marshall (@ccmarshall) followed Michael with "An Argument for Archiving Facebook as a Heterogeneous Personal Store". From her previous studies, she found that users were apathetic about preserving their Facebook contents, "Why we should archive Facebook despite the users not caring about archiving it?", she said. "Evidence has suggested that people are not going to archive their stuff in any kind of consistent way in the long term." Most users in her study could not think of anything to save on Facebook, assuming that if Facebook died, those files also live somewhere else and can be recovered. Despite this, she attempted to identify what users found most important in their Facebook contents with 50% of the users saying that they find the most value in their photos, 35% saying they would carry over their contacts if they needed to, and other than that, they did not care about much else.

    Michael Day (@bindonlane) followed Cathy with a review of recent work at the British Library. His group has been making attempts to implement preservation concepts from extremely large collections of digital material. They have published the British Library Content Strategy, which attempts to guide their efforts.

    When Michael was done presenting, I (Mat Kelly, @machawk1) presented my paper "The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript". The purpose of the work was to evaluate different web archiving tools and web sites in a way much similar to the Acid Tests originally designed for web standards but with more clarity and to test specific facets of the web for which web archiving tools have trouble.

    After I presented was the lunch session and another set of concurrent sessions. For this session I attended the "Digital Libraries: Evolving from Collections to Communities?" panel with Brian Beaton, Jill Cousins (@JilCos), and Herbert Van de Sompel (@hvdsomp) with Deanna Marcum and Karen Calhoun as moderators.

    Jill stated, "Europeana can't be everything to everybody but we can provide the data that everyone reuses." The group spoke about the Europeana 2020 strategy and how it aims to fulfill 3 principles: "share data, improve access conditions to data and create value in what we're doing". Brian asked, "Can laypeople do managed forms of expert work? That question has been answered. The real issue is determining how crowd-sourced projects can remain sustainable.", referencing previous discussion on the integration of crowd sourcing to fund preservation studies and efforts. He continued, "I think we're at the moment where we are at a lot of competitors entering race [for crowd sourcing platforms]. Lots of non-profits are turning to integration of crowd-funded platforms. I'm curious to see what happens where more competition emerges for crowd-sourcing."

    Following the panel and a short break, Alexander Ororbia of Penn State presented "Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities" relating to CiteSeerX, "The scholarly big data platform". The application he described relates to research management, collaboration discovery, citation recommendation, expert search, etc. and uses technologies like a private cloud, HDFS, NOSQL, Map Reduce, and crawl scheduling.

    Following Alex, Per Møldrup-Dalum (@perdalum) presented "Bridging the Gap Between Real World Repositories and Scalable Preservation Environments". In his presentation, Per spoke of Hadoop and his work in digitizing 32 million scanned newspaper pages using OCR and ensuring the digitization was valid according to his group's their preservation policy. In accomplishing this, he created a "stager" and "loader" as proof-of-concept implementations of using the SCAPE APIs. In doing this, Per wanted to emphasize reusability of the products he produced, as his work was mostly based on the reuse of other projects.

    After Per, Yinlin Chen described their work on utilizing the ACM Digital Library (DL) data set as the basis for a project on finding good feature representations that minimizes the differences between source and target domains of selected documents.

    C. Lee Giles of Penn State came next with his presentation "The Feasibility of Investing of Manual Correction of Metadata for a Large-Scale Digital Library". In this work, he sought to build a classifier using a truth discovery process using metadata from Google Scholar. He found that a "web labeling system" seemed more promising compared to simple models of crowdsourcing the classification.

    This finished the presentations for the first day of the conference and the poster session followed. In the poster session, I presented my work on developing a Google Chrome extension called Mink (now publicly available) that attempts to integrate the live and archived web viewing experience.

    Day 2

    George Buchanan started the second day by introducing Professor Jane Ohlmeyer of Trinity College, Dublin (@tcddublin) and her work relating to the 1641 Depositions, the records of massacre, atrocity and ethnic cleansing in seventeenth-century Ireland. These testimonies related to the Irish Rebellion around the 22nd of October in 1641 where Catholics robbed, murdered, and pillaged their protestant neighbors. From what's documented of the conflict, Jane noted that "we only hear one side of the suffering and we don't have reports of how the Catholics suffered, were massacred, etc." referring to the accounts being mostly collected from a single perspective of the conflict. Jane highlighted one particular account by Anne Butler from the 7th of September, 1642, where Anne first explained who she was then followed with her neighbors with whom she previously interacted with daily in the market subsequently threatening her during the conflict solely due to her being Protestant. "It's as if the depositions are allowing us to hear the story of those that suffered through fear and conflicts.", Jane said, referencing the accounts. The controversial depositions had originally been donated in 1741 and held in Trinity College and locked away due to their controversial documentation of the conflict. The accounts consists of over 19000 pages (about 3.5 million words) and include 8000 witness testimonies of events related to 1641 rebellion. The accounts had been attempted to be published by multiple parties in the past (including an attempt in 1930) but had previously been censored by the Irish government because of their graphic nature. Now that the parties involved in the conflict are at peace, further work is being done by Jane's group preserving the accounts while dealing with various features of the writing (e.g., multiple spellings, document bleed through, inconsistent data collection pattern) that might otherwise be lost in the process were the documents naively digitized.

    Jane's group has since launched a website (in 2010) to ensure that the documents are accessible to the public and currently have over 17,000 registered users. All of the data they have added is open source. Upon launching it, they had both Mary McAleese and Ian Paisley (who was notoriously anti-Catholic) together for the launch with Paisley surprisingly saying that he advocated the publication of the documents, as it "promoted learning" and he encouraged the documents be made accessible in the classroom to 14, 15, and 16 year olds so that society could "remember the past but not bound by the past". Through the digitization process, Jane's group has looked to other more recent (and some currently ongoing) controversial conflicts and how the accounts of the conflict can be documented and released in a way that is appropriate to the respectively affected society.

    Following Jane's Keynote (and a coffee break), the conference was split into concurrent sessions where I attended the "Browsing and Searching" session, where Edie Rasmussen introduced Dana Mckay (@girlfromthenaki) began her presentation, "Borrowing rates of neighbouring books as evidence for browsing". In her work, she sought to explore the concept of browsing in respect to the various digital platforms for doing so (e.g., for books on Amazon) vs. the analog of browsing in a library. With library-based browsing, a patron is able to maintain physical context and see other nearby books as shelved by the library. "Browsing is part of the of the human information seeking process", she said, following with the quote, "The experience of browsing a physical library is enough to dissuade people to use e-books." In her work, she used 6 physical libraries as a sample set and checked the frequency at which physically nearby books had been borrowed as a function of likelihood in respect to an initially checked out book. In preliminary research, she found that from her sample set, that over 50% of the book had ever been borrowed and just 12% had been borrowed in the last year. In an attempt to quantify browsing, she first split her set of libraries into two sets consisting of those that used the Dewey Decimal system and those that used the Library of Congress system of organization. She first tested 100 random books, checked if they had been borrowed on day Y then checked the physically nearby books to see if they had been borrowed the day before. From her study, Dana found that there is definitely a causal effect on the location of books borrowed and that, especially in libraries, browsing has an effect on usage.

    Javier Lacasta followed Dana with "Improving the visibility of geospatial data on the Web". In the work, his group's objective was to identify, classify, interrelated, and facilitate access to geospactial web services. They wanted to create an automatic process that identified existing geospatial services on the web by using an XML specification. From the service discover, Javier wanted to extract content of fields containing the resource's title, description, thematic keywords, content date, spatial coordinates, textual descriptions of the place, and the creator of the service. By doing this, they hoped to harmonize the content for consistency between services. Further, they wanted a mean of classifying the services by assigning values from a controlled vocabulary. The study, he said, ought to be applicable to other fields, though his discovery of services was largely limited by lack of content for these type of services on the web.

    Martyn Harris was next with "The Anatomy of a Search and Mining System for Digital Humanities" where he looked at the barriers for tool adoption in the digital humanities spectrum. He found that documentation and usability evaluation was mostly neglected, so looked toward "dogfooding" in developing his own tool using context-dependent toolsets. An initial prototype uses a treemap for navigating the Old Testament and considers the probability of querying each document.

    Õnne Mets followed Martyn with "Increasing the visibility of library records via a consortial search engine". The target for the study was the search engine behind National Library of Estonia, which provides an e-books on-demand service as well as a service for digitizing public domain books into e-book form. Their service has been implemented in 37 libraries in 12 countries and provides an "EOD button" that sends the request to the respective library to scan and transfer the images from the physical book. Their service provides a central point for users to discover EOD eligible books and uses OAI-PMH to harvest and batch upload the book files via FTP. Despite the services' interface, Õnne said that 89% of the hits on their search interface came directly to their landing pages via Google. From this, Õnne concluded that collaboration with a consortial search engine does in fact make collections of digitized books more visible, which increases the potential audience.

    The conference then broke for lunch but returned with Daniel Hasan Dalip's presentation of "Quality Assessment of Collaborative Content With Minimal Information". In this work, Daniel investigated how users can better utilize the larger amount of information created in web documents using a multi-view approach that indicates a meaningful view of quality. As a use case, he divided a Wikipedia article into different views representing the evidence conveyed in the article. Using Spark Vector Machines (SVR), he worked to identify features within the document with a low prediction error. He concluded that using the algorithm allows the feature set of 68 features to be reduced by 15%, 18%, and 25% for three sample data article on Wikipedia for "MUPPET", "STARWAR", and "WIKIPEDIA", respectively.

    As Daniel's presentation was going on, Justin viewed Adam Jatowt's presentation "Quality Assessment of Collaborative Content With Minimal Information". In this work, Adam showed that words changed meaning over time using tools to verify words' evolution. He first took 5-grams from The Corpus of Historical American English (COHA) on Google Books and measured both the frequency and the temporal entropy of each 5-gram. He found that if a word is popular in one decade, it's usually popular in the next decade. He also investigated similarity based on context (i.e., the position in a sentence). Through the study he discovered word similarities as was with the case of the word "nice" being synonymous with "pleasant" around the year 1850.

    I then joined Justin for Nikolaos Aletras's (@nikaletras) presentation, "Representing Topics Labels for Exploring Digital Libraries". In this work, Nikolaos stated that the problem with online documents is that they have no structure, metadata, or manually created classification system accompanying them, which makes it difficult to explore and find specific information. He created unsupervised topic models that were data-driven and captured the themes discussed within the documents. The documents were then represented as a distribution over various topics. To accomplish this, he developed a topic model pipeline where a set of documents acted as the input with the output consisting of two matrices: topic-word (probability of each word on a given topic) and topic-document (probability of each document given the topic). He then used his trained model to identify as many documents relevant to a set of queries within 3 minute in a document collection using document models. The data set used was a subset of the Reuters Corpus from Rose et al. 2002. This data set had already been manually classified, so could be used for model verification. From the data set, 20 subject categories were used to generate a topic model. 84 topics were produced and provided via an alternative means of browsing the documents.

    Han Xu presented next with "Topical Establishment Leveraging Literature Evolution" where he attempted to discovery research topics from a collection of papers and to measure how well or not a given topic is recognized by the community. First, Han's group identified research topics whose recognition can be described as either persistent, withering or booking. Their approach was inspired by bidirectional mutual enforcement between papers and topical recognition. By using the weight of a topic as a sum of its recognitions in papers, he could compare using PageRank and RALEX (their previous work using random WALKS) and show that their own approach was more suitable, as it was more designed to take into account literature evolution, unlike PageRank.

    Fuminori Kimura was next with "A Method to Support Analysis of Personal Relationship through Place Names Extracted from Documents", a followup study on previous research for extracting personal relationships through place names. In this work, their extracted personal names and place names and counted the co-occurrence between them. Next, their created a personal's feature vector then calculated the personal relationship and stored this product in a database for further analysis. When a personal name and a place name appeared in the same paragraph, they hypothesized, it is an indicator of the relationship between the person and the location. Using cosine similarity and clustering, Fuminori found that initial tests of their word on Japanese historical documents could epitomize a relationship network graph of closely related people backed by their common relationships with locations.

    After a short break, the final set of concurrent sessions started where I attempted Christine Borgman's (@scitechprof presentation of "The Ups and Downs of Knowledge Infrastructures in Science: Implications for Data Management". In this work, she spoke of how countries in Europe, the U.S., and other parts of the world are now requiring scholars to release the data from their studies and questioned what sort of digital libraries should we be building for this data. Her work was reporting on the progress from the Alfred P. Sloan Foundation's study of 4 different scientific processes and how they make and use data. "What kind of new professionals should be prepared for data mining", she asked.She described four different projects in a 2x2 matrix where two had large amounts of data and two were projects that were just ramping up (with each project of the four holding a unique combination of these traits). The four projects (Center for EMBEDDED network Sensing (CENS), Sloan Digital Sky Survey (SDSS), Center for Dark Energy Biosphere investigations (C-DEBI), and the Large Synoptic Survey Telescope (LSST) each either had either previous methods of storing the data or were proposing ways to handle, store, and filter the large amount of data to-come. "You don't just trickle the data out as it comes across the instruments. You must clean, filter, document, and release very specific blocks.", she said of some projects releasing the cleaned data sets while others were planning to opt to release the raw data to the public. "Each data is accompanied by a paper with 250 authors", she said, highlighting that they were greatly used as a basis for much further research.

    Carl Lagoze of University of Michigan presented next with "CED2AR: The Comprehensive Extensible Data Documentation and Access Repository", which he described as "yet another metadata repository collection system." In a deal between the NSF and the Census Bureau, he worked to make better use of the Census Bureau's huge amount of data. Doing further work on the data was to increase emphasis to have scientists make data available on the network and make the data useful for replicating methods, verifying/validating studies, and taking advantage of the results. Key facets of the census data is that it is highly controlled and confidential, with the latter describing both the content itself as well as the metadata of the content. Because of this, both identity and provenance were key issues that had to be dealt with in the controlled data study. Regarding the mixing of this confidential data with public data, Carl said, "Taking controlled data spaces and mixing it with uncontrolled data spaces creates a new data problem in data integrity and scientific integrity.".

    David Bainbridge present next with "Big Brother is Watching You - But in a Good Way" where he initially presented the use case of having had something on his screen earlier for which he could not remember the specifics of some text. His group has created a system that records and remembers text that has displayed on a machine running XWindows (think: Linux) and allows the collected data to be searchable with graphical recall.

    During the presentation, David gave a live demo wherein he visited a website, which was immediately indexed and became searchable as well as showing results from earlier relevant browsing sessions.

    Rachael Kotarski (@RachPK) presented next with "A comparative analysis of the HSS & HEP data submission workflows" where she withed with a UK data archive looking for social science data. She referenced that users registering for an ORCID greatly helps with the mining process and takes only five minutes.

    Nikos Houssos (@nhoussos) presented the last paper of the day with "An Open Cultural Digital Content Infrastructure" where he spoke of 70 cultural heritage projects costing about 60 million Euros and how his group has helped associate successful validation with funding cash flows. By building a suite of services for repositories, they have provided a single point of access for these services through aggregation and harvesting. Much of the back-end, he said, is largely automated checking and compliance for safe keeping.

    Nikos closed out the sessions for Day 2. Following the sessions, the conference dinner was held at The Mermaid Function Centre.

    Day Three

    The third day of the Digital Libraries was short but to lead off was ODU WS-DL's own Justin Brunelle (@justinfbrunelle) with "Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources". In this paper, we (I was a co-author) investigated the effects of missing resources on an archived web page and how the impact of a resource is not sufficiently evaluated on an each-resource-has-equal-weight basis. Instead, using measures of size, position, and centrality, Justin developed an algorithm to weight a missing resource's impact (i.e., "Damage") to a web page if not captured to the archived. He initially used the example of the web comic XKCD and how a single resource (the main comic) has much more importance for the page's purpose than all other resources on the page. When missing a stylesheet, the algorithm considers the background color of the page and the concentration of content with the assumption that if the stylesheet is missing and important, most of the content will be in the left third of the page.

    Hugo Huurdeman (@TimelessFuture) followed Justin with "Finding Pages on the Unarchived Web" by first asking, "Given that we cannot crawl lost web pages, how can we recover the content lost?" Working with the National Libraries of the Netherlands, which consists of about 10 terabytes of data from 2007, they focused on a subset of this data for 2012 with the temporal span of a year. From this they extracted the data for processing and sought to answer three research questions:

    1. Can we recover a significant fraction of unarchived pages?
    2. How rich are the representations for the unarchived pages?
    3. Are these representations rich enough to characterize the content?

    Using a measure involving Mean Reciprocal Rank, they took the average scores of the first correct result of each query while utilizing keywords within the URLs for non-homepages. A second measure of "Success Rate" allowed them to evaluate that 46.7% of homepages and 46% of non-homepages could have a summary generated if never preserved. Their approach claimed to "Reconstruct significant parts of the unarchived web." based on descriptions and link evidence pointing to the unpreserved pages.

    Nattiya Kanhabua presented last in the session with "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia" where she investigated the scenario of a computer that forgets intentionally and how that plays into digital preservation. "Forgetting plays a crucial role for human remembering and life.", she said. Nattiya spoke of "managed forgetting", i.e., to remember the right information. "Individuals' memories are subject to a fast forgetting process." She referenced various psychological studies to correlate the preservation process with "flashbulb memories". For a case study, they looked at the Wikipedia view logs as signal for collective memory, as they're publicly available traffic over a long span of time. "Looking at page views does not directly reflect how people forget; significant patterns are a good estimate for public remembering.", she said. Their approach developed a "remembering score" to rank related past events and identify features (e.g., time, location) as having a high correlation with remembering.

    Following a short final break, the final paper presentations of the conference commenced. I was able to attend the last two presentations of the conference where C. Lee Giles of Penn State University presented "RefSeer: A Citation Recommendation System", a citation recommendation system based on the content of an entire manuscript query. His work served as an example on how to build a tool on top of other system through integration. To further facilitate this, the system contains a novel language translation method and is intended to help users write papers better.

    Hamed Alhoori presented the last paper of the conference with "Do Altmetrics Follow the Crowd or Does the Crowd Follow Altmetrics?" where he used bookmarks as metrics. His work found that journal-level altmetrics have significant correlation among themselves compared with the weak correlations within article-level altmetrics. Further, they found that Mendeley and Twitter have the highest usage and coverage of scholarly activities.

    Following Hamed's presentation, George Buchanan provided information on the next year's JCDL 2015 and TPDL 2015 (which would again be split into two locations) and what ODU WS-DL was waiting for: the announcements for best papers. For best student paper, the nominees were:

    • Glauber Dias Gonçalves, Flavio Vinicius Diniz de Figueiredo, Marcos Andre Goncalves and Jussara Marques de Almeida. Characterizing Scholar Popularity: A Case Study in the Computer Science Research Community
    • Daniel Hasan Dalip, Harlley Lima, Marcos Gonçalves, Marco Cristo and Pável Calado. Quality Assessment of Collaborative Content With Minimal Information
    • Justin F. Brunelle, Mat Kelly, Hany Salaheldeen, Michele C. Weigle and Michael Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources

    For best paper, the nominees were:

    • Chuck Cartledge and Michael Nelson. When Should I Make Preservation Copies of Myself?
    • David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, John Wilkerson and Nick Stramp (Best paper nominees). Detecting and Modeling Local Text Reuse
    • Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar and Arjen P. de Vries (Best paper nominees). Finding Pages on the Unarchived Web

    The results (above tweet) served as a great finish to a conference with many fantastic papers that we will be exploring in-depth for the next year.

    — Mat

    2014-09-18: A tale of two questions

    $
    0
    0

    (with apologies to Charles Dickens, Robert Frost, and Dr. Seuss)


    "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, ..."(A Tale of Two Cities, by Charles Dickens).

    At the end of this part of my journey; it is time to reflect on how I got here, and what the future may hold.

    Looking back, I am here because of answering two simple questions.  One from a man who is no longer here, one from a man who still poses new and interesting questions.  Along the way, I've formed a few questions of my own.

    The first question was posed by my paternal uncle, Bertram Winston.  Uncle Bert was a classic type A personality.  Everything in his life was organized and regimented.  When planning a road trip across the US, he would hand write the daily itinerary.  When to leave a specific hotel, how many miles to the next hotel,
    Uncle Bert and Aunt Artie
    phone numbers along the way, people to visit in each city, and sites to see.  He would snail-mail a copy of the itinerary to each friend along way, so they would know when to expect he and Aunt Artie to arrive (and to depart).  He did this all before MapQuest and Google maps.  He did all of this without a computer, using paper maps and AAA tour books. 

    Bert took this attention to detail to the final phase of his life.  As he made preparations for his end, he went through their house and boxed up pictures and mementos for friends and family.  These boxes would arrive unannounced, and were full of treasures.  After receiving, opening, sharing these detritus with Mary and our son Lane, I thanked Bert for helping to answer some of the questions that had plagued me since I was a child.  During the conversation, he posed the first question to me.  Bert said that he had been through his house many times and still had lots of stuff left that he didn't know what to do with.  He said,  "what will I do with the rest?"  I said that I would take it, all of it, and that I would take care of each piece.

    I continued to receive boxes until his death. 
    Josie McClure, my muse.
    With each; Mary, Lane, and I would sit in our living room and I would explain the history behind each memento.  One of these mementos was a picture of Josie McClure.  She became my muse for answering the second question.



    Dr. Michael L. Nelson,
    my academic parent.
    The second question was posed by my academic "parent,"Michael L. Nelson.  One day in 2007; he stopped me in the Engineering and Computational Sciences Building on the Old Dominion University campus, and posed the question "Are you interested in solving a little programming problem?"  I said "yes" not having any idea about the question, the possible difficulties involved, the level of commitment that would be necessary, or the incredible highs and lows that
    would torment by soul.  But I did know that I liked the way he thought, his outlook on life, and his willingness to explore new ideas.

    The combination of answering two simple questions, resulted in a long journey.  Filled with incredible highs brought on by discovering things that no one else in the world knew or understood, and incredible lows brought on by no one else in the world knowing or understanding what I was doing.  My long and tortuous trail can be found here.

    While on this journey, I have accreted a few things that I hope will serve me well.

    My own set of questions:


    1.  What is the problem??  Sometimes just formulating the question is enough to see the solution, or puts the topic into perspective and makes it non-interesting.  Formulating the problem statement can be an iterative process where constant refining reveals the essence of the problem.

    2.  Why is it important??  The world is full of questions.  Some are important, others are less so.  Everyone has the same number of hours per day, so you have to choose which questions are important in order to maximize your return on the time you spend.

    3.  What have others done to try and solve the problem??  If the problem is good and worthy, then take a page from Newton and see what others have done about the problem.  It may be that they have solved the problem and you just hadn't been able to spend the time trying to find an existing solution.  If they haven't solved the problem, then you might be able to say (as Newton is want to say) "If I have seen further it is by standing on the shoulders of giants."

    4.  What will I do to solve the problem??  If no one has solved the problem, then how will you attack it??  How will your approach be different or better than everything done  by everyone else??

    5.  What did I do to prove I solved the problem??  How to show that your approach really solved the problem??

    6.  What is the conclusion??  After you have labored long and hard on a problem, what do you do with the knowledge you have created??

    Be an active reader.

    Read everything closely to ensure that I understand what the author was (and was not) saying.  Making notes in the margins on what has been written.  Noting the good, the bad, and the ugly.  If it is important enough, track down the author and speak to them about the ideas and thoughts they had written.  Imagine if you will, receiving a call from a total stranger about something that you've published a few years before.  It means that someone has read your stuff, has questions about it, and that it was important enough to talk directly to you.  How would you feel if that happened to you??  I've made those calls and you can almost feel the excitement radiating through the phone.

    Understand all the data you collect.

    In keeping with Issac Asimov's view on data: "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'"  When we conduct experiments, we collect data of some sort.  Be that memento temporal coverage, public digital longevity, digital usage patterns, data of all sorts and types.  Then we analyze the data, and try to glean a deeper understanding.  Watch for the outliers, the data that "looks funny" have additional things to say.

    Everyone has stories to tell.  

    Our stories are the threads of the fabric of our lives.  Revel in stories from other people.  Those stories they choose to share, are an intimate part of what makes them who they are.  Treat their stories with care and reverence, and they will treat yours the same way.

    Don't be afraid to go where others have not.  

     During our apprenticeship, all our training and work point us to new and uncharted territories.  To wit:
    "...
    Two roads diverged in a wood, and I,
    I took the one less traveled by,
    And that has made all the difference."
    (The Road Not Taken, by Robert Frost)

    Remember through it all;


    The highs are incredible, the lows will crush your soul, others have survived, and that you are not alone.

    And in the end,

    "So...
    be your name Buxbaum or Bixby or Bray
    or Mordecai Ali Van Allen O'Shea,
    you're off to Great Places!
    Today is your day!
    Your mountain is waiting.
    So...get on your way!"
    (Oh, the Places You'll Go!, by Dr. Seuss)





    With great fondness and affection,

    Chuck Cartledge
    The III. A rapscallion.  A husband.  A father.  A USN CAPT.  A PhD.  A simple man.








    Thanks to Sawood Alam, Mat Kelly, and Hany SalahEldeen for their comments and review of "my 6 questions."  They were appreciated and incorporated.

    2014-09-25: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

    $
    0
    0
    The Internet Archive (IA) and Open Library offer over 6 million fully accessible public domain eBooks. I searched for the term "dictionary" while I was casually browsing the scanned book collection to see how many dictionaries they have. I found several dictionaries in various languages. I randomly picked A Dictionary of the English Language (1828) - Samuel Johnson, John Walker, Robert S. Jameson from the search result. I opened the dictionary in fullscreen mode using IA's opensource online BookReader application. This book reader application has common tools for browsing an image based book such as flipping pages, seeking a page, zooming, and changing the layout. In the toolbar it has some interesting features like reading aloud and full-text searching. I wondered how could it possibly perform text searching and read aloud an scanned raster image based book? I sneaked inside the page source code which pointed me to some documentation pages. I realized it is using an Optical Character Recognition (OCR) engine called ABBY FineReader to power these features.

    I was curious to find out how do they define the term "dictionary" in a dictionary of early 19th century? So I gave the "search inside" feature of IA's book reader a try and searched for the term "dictionary" there. It took about 40 seconds to search for the lookup term in a book with 850 pages and returned three results. Unfortunately, they were pointing to the title and advertisement pages where this term appeared, but not the page where it was defined. After this failed OCR attempt, I manually flipped pages in the BookReader back and forth the way word lookup is performed in printed dictionaries until I reached the appropriate page. Then I located the term on the page and the definition there was, "A book containing the words of any language in alphabetical order, with explanations of their meaning; a lexicon; a vocabulary; a word-book." I thought I would give the "search inside" feature another try. According to the definition above, dictionary is a book, hence I chose "book" as the next lookup term. This time the BookReader took about 50 seconds to search and returned 174 possible places where the term was highlighted in the entire book. These matches include derived words and definitions or examples of other words where the term "book" appeared. Although the OCR engine did work, the goal of finding the definition of the lookup term was still not achieved.

    After experimenting with an English dictionary, I was tempted to give another language a try. When it comes to a non-Latin language, there is no better choice for me than Urdu. Urdu is a Right-to-Left (RTL) complex script language inspired from Arabic and Persian languages, shares a lot of vocabulary and grammar rules with Hindi, spoken by more than 100 million people globally (majority in Pakistan and India), and it happens to be my mother tongue as well. I picked an old dictionary entitled, Farhang-e-Asifia (1908) - Sayed Ahmad Dehlavi (four volumes). I searched for several terms one after the other, but every time the response was "No matches were found.", although I verified their existence in the book. It turns out that the ABBY FineReader claims OCR support for about 190 languages, but it does not support more than 60% of the world's 100 most popular languages and the recognition accuracy of the supported languages is not reliable.


    Dictionaries are a condensed collection of words and definitions of languages and capture the essence of cultural vocabularies of the era they are prepared, hence they have great archival value and are of equal interest to linguistics and archivists. Improving accessibility of the preserved scanned dictionaries will make them more useful not only for linguistics and archivists, but for the general users too. Unlike general literature books, dictionaries have some special characteristics such as they are sorted to make the lookup of words easy and lookup in dictionaries is fielded searching as opposed to the full-text searching. These special properties can be leveraged when developing an application for accessing scanned dictionaries.

    To solve the scanned dictionary exploration and word lookup problem, we chose a crowdsourced manual approach that works well for every language irrespective of how poorly it is supported by OCR engines. In our approach pages or words of each dictionary are indexed manually to load appropriate pages that correspond to the lookup word. Our indexing approach is progressive hence it increases the usefulness and ease of lookup as more crowdsourced energy is put into the system, starting from the base case, "Ordered Pages" which is at least as good as IA's current BookReader. In the next stage the dictionary can go into "Sparse Index" state in which the first lookup word of each page is indexed that is sufficient to determine the page where any arbitrary lookup word can be found if it exists in the dictionary. To further improve the accessibility of these dictionaries, exhaustive "Full Index" is prepared that indexes every single lookup word found in the dictionary with corresponding pages as opposed to just the first lookup words of each page. This index is very helpful in certain dictionaries where sorting of words is not linear. To determine the exact location of the lookup word on the page we have "Location Index" that highlights the place on the page where the lookup word is located to point user's attention there. Apart from indexing we have introduced an annotation feature where users can link various resources to words on dictionary pages. Users are encouraged to help and contribute improving various indexes and annotations as they use the application. For more detailed description of our approach, please read our technical report:
    Sawood Alam, Fateh ud din B Mehmood, Michael L. Nelson. Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages. Technical Report arXiv:1409.1284, 2014.
    We have built an online application called "Dictionary Explorer" that utilizes the indexing described above and it has an interfaces suitable for dictionaries. The application serves as the explorer of various dictionaries in various languages at the same time it represents various context-aware controls for feedback to contribute to indexes and annotations. In the Dictionary Explorer the user selects a lookup language that loads a tree like word index in the sidebar for the selected language and various tabs in the center region, each tab corresponds to one monolingual or multilingual dictionary that has indexes in the selected language. The user can then either directly input the lookup term in the search field or locate the search term in the sidebar by expanding corresponding prefixes. Once the lookup is performed, all the tabs are loaded simultaneously with appropriate pages corresponding to the lookup term in each dictionary. A pin is placed on pages where the word exists on the page if the location index is available for the lookup word which allows interaction with the word and annotations. A special tab accumulates all the related resources such as user contributed definitions, audio, video, images, examples, and resources from third party online dictionaries and services.



    Following are some feature highlights to summarize the Dictionary Explorer application:
    • Support for various indexing stages.
    • Indexes in multiple languages and multiple monolingual and multilingual dictionaries in each language.
    • Bidirectional (right-to-left and left-to-right) language support.
    • Multiple input methods such as keyboard input, on screen keyboard, and word prefix tree.
    • Simultaneous lookup in multiple dictionaries.
    • Pagination and zoom controls.
    • Interactive location marker pins.
    • Context aware user feedback and annotations.
    • Separate tab for related resources such as user contributions, related media, and third-party resources.
    • API for third-party applications.
    We have successfully developed a progressive approach of indexing that enables lookup in scanned dictionaries of any language with very little initial effort and improves over time as more people interact with the dictionaries. In the future we want to explore specific challenges of indexing and interaction in several other languages such as Mandarin or Japaneses where dictionaries are not sorted essentially based on their huge alphabet. We also want to utilize our current indexes that were developed by users over time to predict pages for lookup terms in dictionaries that are not indexed yet or have partial indexing. We have intuition that we can automatically predict pages of an arbitrary dictionary for a lookup term with acceptable variance by aligning pages of a dictionary with one or more resources such as indexes of other dictionaries in the same language, corpus of the language, most popular words in the language, and partial indexes of the dictionary.


    Resources

    --
    Sawood Alam


    2014-10-03: Integrating the Live and Archived Web Viewing Experience with Mink

    $
    0
    0
    The goal of the Memento project is to provide a tighter integration between the past and current web.    There are a number of clients now that provide this functionality, but they remain silent about the archived page until the user remembers to invoke them (e.g., by right-clicking on a link).

    We have created another approach based on persistently reminding the user just how well archived (or not) are the pages they visit.  The Chrome extension Mink (short for Minkowski Space) queries all the public web archives (via the Memento aggregator) in the background and will display the number of mementos (that is, the number of captures of the web page) available at the bottom right of the page.  Selecting the indicator allows quick access to the mementos through a dropdown.  Once in the archives, returning to the live web is as simple as clicking the "Back to Live Web" button.

    For the case where there are too many mementos to make navigating an extensive list useable (think CNN.com captures), we have provided a "Miller Columns" interface that allows hierarchical navigation and is common in many operating systems (though most don't know it by name).

    For the opposite case where there are no mementos for a page, Mink provides a one-click interface to submit the page to Internet Archive or Archive.today for immediate preservation and provides just-as-quick access to the archived page.

    Mink can be used concurrently with Memento for Chrome, which provides a different modality of letting the user specify desired Memento-Datetime as well as reading cues provided by the HTML pages themselves.  For those familiar with Memento terminology, Memento for Chrome operates on TimeGates and Mink operates on TimeMaps.  We also presented a poster about Mink at JCDL 2014 in London (proceedings, poster, video).

    Mink is for Chrome, free, publicly available (go ahead and try it now!), and open source (so you know there's no funny business going on).

    —Mat

    2014-10-07: FluNet Visualization

    $
    0
    0
    (Note: This wraps up the current series of posts about visualizations created either by students in our research group or in our classes. I'll post more after the Spring 2015 offering of the course.)

    I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous posts: Fall 2011, Fall 2012, 2013)

    The final visualization in this series is an interactive visualization of the World Health Organization's global influenza data, created by Ayush Khandelwal and Reid Rankin in the Fall 2013 InfoVis course. The visualization is currently available at https://ws-dl.cs.odu.edu/vis/flunet-vis/ and is best viewed in Chrome.

    The Global Influenza Surveillance and Response System (GISRS) has been in operation since 1995 and aggregates data weekly from laboratories and flu centers around the world. The FluNet website was constructed to provide access to this data, but does not include interactive visualizations.

    This project presents an interactive visualization of all of the GISRS data available through FluNet as of October 2013. The main visualization is an animated 3D choropleth globe where hue corresponds to virus lineage (influenza type A or type B) and color intensity corresponds to infection level. This shows the propagation of influenza across the globe over time.  The globe is also semi-transparent, so that the user can see how influenza infection rates change on the opposite hemisphere. The user may pick a specific time period or press the play button and watch the yearly cycle of infection play itself out on the globe's surface.

    The visualization also includes the option to show a 2D version of the globe, using the Natural Earth projection.

    There is a stacked area slider located under the globe for navigating through time (example of a "scented widget").  The stacked area chart provides a view of the progression of infection levels over time and is shown on a cubic-root scale to compensate for the peaks during the 2009 flu pandemic.

    If the user clicks on a country, a popout chart will be displayed, showing a single year of data for that country, centered on the current point in time.  The default view is a stacked area chart, but there are options to show either a streamgraph or an expanded 100% stacked area chart.  The popout chart animates with the choropleth.

    The video below shows a demo:


    Although the data was freely available from the GISRS website, there was still a significant amount of data cleaning involved.  Both OpenRefine and Mr. Data Converter were used to clean and format the data into JSON.  The D3.js, NVD3, and TopoJSON libraries were used to create the visualization.

    Our future work on this project involves turning this into an extensible framework that can be used to show other global datasets over time.

    -Michele

    2014-10-16: Grace Hopper Celebration of Women in Computing (GHC) 2014

    $
    0
    0
    Photo credit to my friend Mona El Mahdy
    I was thrilled and humbled for the second time to attend Grace Hopper Celebration of women in computing (GHC) 2014, the world’s largest gathering for technologists women. GHC is presented by the Anita Borg Institute for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994 to bring together research and career interests of women in computing and encourage the participation of women in computing. The twentieth anniversary of GHC was held in Phoenix, Arizona on October 8-10, 2014. This year, GHC has almost doubled the number of women who have research and business interests from the last year to be 8,000 women from about 67 countries and about 900 organizations to get inspired, gain expertise, get connected, and have fun.

    Aida Ghazizadeh from the Department of Computer Science at Old Dominion University also was awarded travel scholarships to attend this year's GHC. I hope ODU will have more participation in the upcoming years.

    The conference theme this year was "Everywhere. Everyone.”. Computer technologies are everywhere and everyone should be included for driving innovations. There were multiple technical tracks featuring the latest technologies in many fields such as cloud computing, data science, security, and Swift Playgrounds Programming language by Apple. Conference presenters represented many different fields, such as academia, industry, and government. The non-profit organization "Computing Research Association Committee on Women in Computing (CRA-W)", also offered sessions targeted towards academics and business. I had a chance to attend Graduate Cohort Workshop in 2013, which was held in Boston, MA, and created a blog post about it.


    The first day started off with welcoming the 8,000 conference attendees by Dr. Telle Whitney, the president and the CEO of Anita Borg Institute. She mentioned how the GHC started the first time on 1994 in Washington DC to bring together research and career interests of women in computing and encourage the participation of women in computing. "Join in, connect with one another, be inspired by our speakers, be inspired by our award winners, develop your own skill and knowledge at the leadership workshops and at the technical sessions, let's all inspire and increase the ratio,  and make technology for everyone  everywhere,” Whitney said. Then she introduced Alex Wolf, the President of the Association of Computing Machinery (ACM) and a professor in Computing at Imperial College London, UK, for opening remarks.

    Ruthe Farmer
    Barbara Biungi and Durbana Habib
    After the opening keynote, the ABIE Awards for social impact and Change Agent were presented by the awards' sponsors. The recognitions went to Ruthe FarmerBarbara Birungi and Durdana Habib who gave nice and motivated talks. Some highlights from Farmer's talk was:
    • "The next time you witness a technical woman doing something great, please tell her, or better tell others about her."
    • "The core of aspiration in computing is a powerful formula recognition plus community.” 
    • "Technical Women are not outliers."
    • "Heads up to all of you employers out there. There is a legion of young women heading your way that will negotiating their salaries ... so budget accordingly!"

    The keynote of the first day was for Shafi Goldwasser, RSA Professor of Electrical Engineering and Computer Science at MIT and 2012 recipient of the Turing Award, about the history and benefits of cryptography and also her work in cryptography. She discussed the challenges in encryption and cloud computing. Here are some highlights from Goldwasser's talk:
    • "With the magic of cryptography, we can get the benefits of technology without the risks."
    • "Cryptography is not just about finding the bad guys, it is really about correctness, and privacy of computation"
    • "I believe that a lot of the challenges for the future of computer science are to think about new representations of data. And these new representations of data will enable us to solve the challenges of the future."



    Picture taken from My Ramblings blog
    After the opening keynote, we attended the Scholarship Recipients Lunch which was sponsored this year by Apple. We had engineers from Apple on each table to communicate with us during the lunch.

    The sessions started after the lunch break. I attended CRA-W track: Finding Your Dream Job Presentations, which had presentations by Jaeyeon Jung from Microsoft Research and Lana Yarosh from University of Minnesota. The session targeted the late stage graduate students for helping them in deciding how to apply for jobs, how to prepare for interview, and also how to negotiate a job offer. The presenters allotted a big time slot for questions after they finished their presentations. For more information about "Finding Your Dream Job Presentations" session and the highlights of the session, here is an informative blog post:
    GHC14 - Finding your Dream Job Presentations


    A global community of women leaders panel
    The next session I attended was "A Global Community of Women Leaders" panel in the career track, moderated by Jody Mahoney (Anita Borg Institute). The panelists were Sana Odeh (New York University), Judith Owigar (Akirachix), Sheila Campbell (United States Peace Corps), Andrea Villanes (North Carolina State University).  They explained their roles in increasing the number of women in computing and the best ways to identify global technology leaders through their experience. At the end, they opened questions to the audience. "In the middle east, the women in technology represents a big ratio of the people in computing," said Sana Odeh.

    There were many interesting sessions, such as, "Building Your Professional Persona Presentations" and "Building Your Professional Network Presentations", for presenting how to build your professional image and how to promote yourself and present your ideas in a concise and appealing way to the people. These are two blog posts that cover the two sessions in details:
    Facebook booth in the career fair #GHC14
    In the meantime, the career fair was launched on the first day, Wednesday 8 October at 4:30 - 6:30 p.m and continued the second day and part of the third day. The career fair is a great forum for facilitating open conversations about career positions in industry and academia. Many famous companies, such as Google, FacebookMicrosoftIBM, Yahoo,  Thomson Reuters, etc.,  many universities such as, Stanford University, Carnegie Mellon UniversityThe George Washington UniversityVirginia Tech University, etc., and non-profit organizations such as CRA-W. Each company had many representatives to discuss the different opportunities they have for women. The poster session was held in the evening.

    Cotton candy in the career fair #GHC14
    Like the last year, Thomson Reuters attracted many women's attention with a great promotion through bringing up a caricature artists. Other companies used nice ideas to promote themselves, such as cotton candy. There were many representatives for promoting each organization and also for interviewing. I enjoyed being among all of these women in the career fair which inspired me enough to think about how to direct my future in a way to contribute to computing and also encourage many other women to computing. My advice to anyone who will go to GHC next year, print many copies of your resumes to be prepared for the career fair.


    Day 2 started with welcoming from the audience by Barb Gee, the vice president of programs for Anita Borg institute. Gee presented the GirlRising videoclip "I'm not a number".

    After the clip, Dr. Whitney introduced the special guest, the amazing Megan Smith, the new Chief Technology Officer of the United States and the previously vice president of Google[x]. Smith was a last year's keynote speaker, in which she gave a very inspiring talk entitled, "Passion, Adventure and Heroic Engineering". Smith welcomed the audience and talked about her new position as the CTO of the United States. She expressed her happiness to serve the president of USA and serve her country. "Let’s work together together to bring everyone a long and to bring technology that we know how to solve the problems with," Smith said at the end of her short inspiring talk.

    Dr. Whitney talked about the the Building Recruiting And Inclusion for Diversity (BRAID) initiative between the Anita Borg Institute and Harvey Mudd College to increase the diversity in computer science undergraduates. The BRAID initiative is funded by Facebook, Google, Intel, and Microsoft.


    The 2014 GHC technical leadership ABIE award went to Anne Condon, a professor and the head of the Department of Computer Science at University of British Columbia. Condon donated her award to Grace Hopper India and Programs of the Computing Research Association (CRA).



    Maria Kawle on the right Satya Nadella on the left 
    Satya Nadella, the Chief Executive Officer (CEO) of Microsoft, in an interesting conversation with Maria Kawle, the president of Harvey Mudd College, was the second keynote of GHC 2014. Nadella is the first male speaker at GHC. Nadella was asked many interesting questions. One of them as "Microsoft has competitors like Apple, Google, Facebook, Amazon. What can Microsoft do uniquely do in this new world?" Nadella answered that the two things that he believes Microsoft contribute to the world are the productivity and the platform. Maria continued, "it is not a competition, it is a partnership".

    In answer to a tough question "Why does Microsoft hire fewer female engineer employers than male?", Nadella said that they all now have the numbers out there. Microsoft number is about 17% and it is almost the same numbers as Google, Facebook, and little below Apple. He said, "the real issue in our company how to make sure that we are getting women who are very capable into company and well represented".

    In response to a question about how to ask for a raise in salary, Nadella said: "It’s not really about asking for a raise, but knowing and having faith that the system will give you the right raise." Nadella got a torrent of criticism and irate reaction on twitter.

    Nadella later apologized for his "inarticulate” remarks in a tweet, followed by an issued statement to Microsoft employee, which was published on company's website.

    "I answered that question completely wrong," said Nadella. "I believe men and women should get equal pay for equal work. And when it comes to career advice on getting a raise when you think it’s deserved, Maria’s advice was the right advice. If you think you deserve a raise, you should just ask."






    Day 3 started with some announcements from the ABI board, then the best posters announcement and the Awards Presentation. The last keynote was by Dr. Arati Prabhakar, the Director of the Defense Advanced Research Projects Agency (DARPA). Dr. Prabhakar talked about "how do we shape our times with the technology that we work on and we passionate about?". Dr. Prabhakar shared neat technologies with us in her keynote. She started with a video of a quadriplegic using her thoughts to control a robotic arm by blogged her brain to the computer. She talked about building technologies at DARPA. She answered many questions from at the end related to her work in DARPA. It is an amazing to see a successful women who creates technology that serves her country. The keynote ended with a nice video promoting GHC 2015.





    Latest trends and technical challenges of big data panel
    After the keynote, I attended "Latest Trends and Technical Challenges of Big Data Analytics Panel", which was moderated by Amina Eladdadi (College of Saint Rose). The Panelists were Dr. Bouchra Bouqata from GE, Dr. Kaoutar El Maghraoui from IBM, Dr. Francine Berman from RPI, and Dr. Deborah Agarwal from LBNL. This panel focused on discussing new Big Data Analytics data-driven technologies, infrastructure, and challenges. The panelists introduced use cases from industry and academia. They are many challenges that faces big data: storage, security (specifically for cloud computing), and the scale of the data and bring everything together to solve the problem.

    ArabWIC lunch table
    After the panel, I attended the career fair then I attended the Arab Women in Computing (ArabWIC) meeting during the lunch. I had my first real experience with ArabWIC organization in GHC 2013. ArabWIC had more participation this year. I also attended ArabWIC reception, Sponsored by Qatar Computing Research Institute (QCRI),on Wednesday's night and get a chance to connect many Arab women in computing in business and academia.



    After that I attended the "Data Science in Social Media Analysis Presentations", which included three presentations that talk about data analysis. The three useful presentations were:
    "How to be a data scientist?" by Christina Zou
    The presenters talked about real-life projects. The highlights of the presentations were:



  1. "Improve the accuracy is what we strove for."
  2. "It’s important to understand the problem."
  3. "Divide the problem into pieces."
  4. After the presentations, I talked to Christina about my research, and she gave me some ideas that I'll apply.
    The picture taken from GHC Facebook page
    At the end of the day, Friday celebration, which was sponsored by Google, Microsoft, GoDaddy, begins at 7:30. The dancing floor was full of amazing ladies celebrating and dancing with glowing sticks!

    It was fantastic meeting a large number of like-minded peers and future employers. I'm pleased to have this great opportunity which allowed me to network and communicate with many great women in computing. GHC allowed me to discuss my research ideas with many senior women and got positive feedback about it. I came back with multiple ideas that will help me shape my next phase of my research and my next career path.

     ---
    Yasmin

    2014-10-27: 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent

    $
    0
    0
    Herbert and I attended the "404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent" at the Georgetown Law Library on Friday, October 24, 2014.  Although the origins for this workshop are many, catalysts for it probably include the recent Liebler  & Liebert study about link rot in Supreme Court opinions,  and the paper by Zittrain, Albert, and Lessig about Perma.cc and the problem of link rot in the scholarly and legal record and the resulting popular media coverage resulting from it  (e.g., NPR and the NYT). 

    The speakers were naturally drawn from the legal community at large, but some notable exceptions included David Walls from the GPO, Jefferson Bailey from the Internet Archive, and Herbert Van de Sompel from LANL. The event was streamed and recorded, and videos + slides will be available from the Georgetown site soon so I will only hit the highlights below. 

    After a welcome from Michelle Wu, the director of the Georgetown Law Library, the workshop started with an excellent keynote from the always entertaining Jonathan Zittrain, called "Cites and Sites: A Call To Arms".  The theme of the talk centered around "Core Purpose of .edu", which he broke down into:
    1. Cultivation of Scholarly Skills
    2. Access to the world's information
    3. Freely disseminating what we know
    4. Contributing actively and fiercely to the development of free information platforms



    For each bullet he gave numerous anecdotes and examples; some innovative, and some humorous and/or sad.  For the last point he mentioned Memento, Perma.cc, and timed release crypto

    Next up was a panel with David Walls (GPO), Karen Eltis (University of Ottawa), and Ed Walters (Fastcase).  David mentioned the Federal Depository Library Program Web Archive, Karen talked about the web giving us "Permanence where we don't want it and transience where we require longevity" (I tweeted about our TPDL 2011 paper that showed for music videos on Youtube, individual URIs die all the time but the content just shows up elsewhere), and Ed generated a buzz in the audience when he announced that in rendering their pages they ignore the links because of the problem of link rot.  (Panel notes from Aaron Kirschenfeld.)

    The next panel had Raizel Liebler (Yale) author of another legal link rot study mentioned above and an author of one of the useful handouts about links in the 2013-2014 Supreme Court documentsRod Wittenberg (Reed Tech) talked about the findings of the Chesapeake Digital Preservation Group and gave a data dump about link rot in Lexis-Nexis and the resulting commercial impact (wait for the slides).  (Panel notes from Aaron Kirschenfeld.)

    After lunch, Roger Skalbeck (Georgetown) gave a web master's take on the problem, talking about best practices, URL rewriting, and other topics -- as well as coining the wonderful phrase "link rot deniers".  During this talk I also tweeted TimBL's classic 1998 resource "Cool URIs Don't Change". 

    Next was Jefferson Bailey (IA) and Herbert.  Jefferson talked about web archiving, the IA, and won approval from the audience for his references to Lionel Hutz and HTTP status dogs.  Herbert's talk was entitled "Creating Pockets of Persistence", and covered a variety of topics, obviously including Memento and Hiberlink.




    The point is to examine web archiving activities with an eye to the goal of making access to the past web:
    1. Persistent
    2. Precise
    3. Seamless
    Even though this was a gathering of legal scholars, the point was to focus on technologies and approaches that are useful across all interested communities.  He also gave examples from our "Thoughts on Referencing, Linking, Reference Rot" (aka "missing link) document, which was also included in the list of handouts.  The point on this effort is enhance existing links (with archived versions, mirror versions, etc.), but not at the expense of removing the link to the original URI and the datetime of intended link.  See our previous blog post on this paper and a similar one for Wikipedia.

    The closing session was Leah Prescott (Georgetown; subbing for Carolyn Cox),  Kim Dulin (Harvard), and E. Dana Neacşu (Colombia).   Leah talked some more about the Chesapeake Digital Preservation Group and how their model of placing materials in a repository doesn't completely map to the Perma.cc model of web archiving (note: this actually has fascinating implications for Memento that are beyond the scope of this post).  Kim gave an overview of Harvard's Perma.cc archive, and Dana gave an overview of a prior archiving project at Columbia.  Note that Perma.cc recently received a Mellon Foundation grant (via Columbia) to add Memento capability.

    Thanks to Leah Prescott and everyone else that organized this event.  It was an engaging, relevant, and timely workshop.  Herbert and I met several possible collaborators that we will be following up with. 




    Resources:

    -- Michael

    2014-11-09: Four WS-DL Classes for Spring 2015

    $
    0
    0
    We're excited to announce that four Web Science & Digital Library (WS-DL) courses will be offered in Spring 2015:
    Web Programming, Big Data, Information Visualization, & Digital Libraries -- we have you covered for spring 2015.  

    --Michael
    Viewing all 746 articles
    Browse latest View live