Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 659 articles
Browse latest View live

2015-07-02: JCDL2015 Main Conference

0
0

Large, Dynamic and Ubiquitous –The Era of the Digital Library






JCDL 2015 (#JCDL2015) took place at the University of Tennessee Conference Center in Knoxville, Tennessee. The conference was four days long; June 21-25, 2015. This year three students from our WS-DL CS group at ODU had their papers accepted as well as one poster (see trip reports for 2014,2013, 2012, 2011). Dr. Weigle (@weiglemc), Dr. Nelson (@phonedude_mln), Sawood Alam (@ibnesayeed), Mat Kelly (@machawk1) and I (@LulwahMA) went to the conference. We drove from Norfolk, VA. Four of our previous members of our group, Martin Klein (UCLA, CA) (@mart1nkle1n), Joan Smith (Linear B Systems, inc., VA) (@joansm1th), Ahmed Alsum (Stanford University, CA) (@aalsum) and Hany SalahEldeen (Microsoft, Seattle)(@hanysalaheldeen), also met us there. The trip was around 8 hours. We enjoyed the mountain views and the beautiful farms. We also caught parts of a storm on our way, but it only lasted for two hours or so.

The first day of the conference (Sunday June 21, 2015) consisted of four tutorials and the Doctoral Consortium. The four tutorials were: Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research, Introduction to Digital Libraries, Digital Data Curation Essentials for Data Scientists, Data Curators and Librarians and Automatic Methods for Disambiguating Author Names in Bibliographic Data Repositories.


Mat Kelly (ODU, VA)(@machawk1) covered the Doctoral Consortium.




The main conference started on Monday June 22, 2015 and opened with Paul Logasa Bogen II (Google, USA) (@plbogen). He started by welcoming the attendees and then he mentioned that this year had 130 registered attendees from 87 different organizations, and 22 states and 19 different countries.



Then the program chairs: Geneva Henry (University Libraries, GWU, DC), Dion Goh (Wee Kim Wee School of Communication and Information, Nanyang Technical University, Singapore) and Sally Jo Cunningham (Waikato University, New Zealand) added on the announcements and number of accepted papers in JCDL2015. Of the conference submissions, 18 (30%) of full research papers are accepted, and 30 (50%) of short research papers are accepted, and 18 (85.7%) of posters and demos are accepted. Finally, the speaker announced the nominees for best student paper and best overall paper.

The best paper nominees were:
The best student paper nominees were:
Then Unmil Karadkar (The University of Texas at Austin, TX) presented the keynote speaker Piotr Adamczyk (Google Inc, London, UK). Piotr's talk was titled “The Google Cultural Institute: Tools for Libraries, Archive, and Museums”. He presented some of Google attempts to add to the online cultural heritage. He introduced Google Culture Institute website that consisted of three main projects: the Art Project, Archive (Historic Moments) and World Wonders. He showed us the Google Art Project (Link from YouTube: Google Art Project) and then introduced an application to search museums and navigate and look at art. Next, he introduced the Google Cardboard (Link from YouTube:”Google Cardboard Tour Guide”) (Link from YouTube: “Expeditions: Take your students to places a school bus can’t”) where you can explore different museums by looking into a cardboard container that can house a user's electronic device. He mentioned that more museums are allowing Google to capture images of their museums and allowing others to explore it using Google Cardboard and that Google would like to further engage with cultural partners. His talk was similar to a talk he gave in 2014 titled "Google Digitalizing Culture?".

Then we started off with the two simultaneous sessions "People and Their Books" and "Information Extraction". I attended the second session. The first paper was “Online Person Name Disambiguation with Constraints” presented by Madian Khabsa (PSU, PA). The goal of his work is to map the name mentioned to the real world people. They found that 11%-17% of the queries in search engines are personal names. He mentioned that two issues are not addressed: adding constraints to the clustering process and adding the data incrementally without clustering the entire database. The challenge they faced was redundant names. When constraints are added they can be useful in digital library where user can make corrections. Madian described constraint-based clustering algorithm for person name disambiguation.


Sawood Alam (ODU, VA) (@ibnesayeed) followed Madian with his paper “Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages”. He mentioned that general online book readers are not suitable for scanned dictionary. They proposed an approach of indexing scanned pages of a dictionary that enables direct access to appropriate pages on lookup. He implemented an application called Dictionary Explorer where he indexed monolingual and multilingual dictionaries, with speed of over 20 pages per minute per person.



Next, Sarah Weissman (UMD, Maryland) presented “Identifying Duplicate and Contradictory Information in Wikipedia”. Sara identified sentences in wiki articles that are identical. She randomly selected 2k articles and manually identified them. She found that 45% are identical, 30% are templates, 13.15% are copy editing, 5.8% are factual drift, 0.3% are references and 4.9% are other pages.

The last presenter in this session is Min-Yen Kan (National University of Singapore, Singapore) (@knmnyn) presenting “Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs”. He introduced the notion of extensible features for higher order semi-CRFs that allow memorization to speed up inference algorithms.


The papers in the other concurrent session that I was unable to attend were:

After the Research at Google-sponsored Banquet Lunch, Sally Jo Cunningham (University of Waikato, NZ) introduced the first panel "Lifelong Digital Libraries" and then the first speaker Cathal Gurrin (Dublin City University, Ireland)(@cathal). His presentation was titled "Rich Lifelong Libraries". He started off with using wearable devices and information loggers to automatically record your life in details. He gave examples of devices such as Google Glass or Apple’s iWatch that are currently in the market that record every moment. He has gathered a digital memory of himself since 2006 by using a wearable camera. The talk he gave was similar to a talk he gave at 2012 titled "The Era of Personal Life Archives".

The second speaker was Taro Tezuka (University of Tsukuba, Japan). His presentation was titled "Indexing and Searching of Conversation Lifelogs". He focused on search capability and that it is as important as storage capability in lifelong applications. He mentioned that cleaver indexing of recorded content is necessary for implementing a useful lifelong search systems. He also showed the LifeRecycle which is a system for recording and retrieving conversation lifelogs, by first recording the conversation, then providing speech recognition, after that store the result and finally search and show the result. He mentioned that the challenges that faces a person to allow being recorded is security issues and privacy.

The last speaker of the panel was Håvard Johansen (University of Tromso, Norway). First they started with definitions of lifelogs. He also discussed the use of personal data for sport analytic, by understanding how to construct privacy preserving lifelogging. After the third speaker the audience asked/discussed some privacy issues that may concern lifelogging.



The third and fourth session were simultaneous as well. The third session was "Big Data, Big Resources". The first presenter was Zhiwu Xie (Virginia Tech, VA) (@zxie) with his paper “Towards Use And Reuse Driven Big Data Management”. This work focused on integrating digital libraries and big data analytics in the cloud. Then they described its system model and evaluation.


Next, “iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling” was presented by Gerhard Gossen (L3S Research Center, Germany) (@GerhardGossen). iCrawl combines crawling of the Web and Social Media for a specified topic. The crawler works by collecting web and social content in a single system and exploits the stream of new social media content for guideness. The target users for this web crawling toolbox is web-science, qualitative humanities researchers. The approach was to start with a topic and follow its outgoing links that are relevant.


G. Craig Murray (Verisign Labs) presented instead of Jimmy Lin (University of Maryland, College Park) (@lintool). The paper was titled “The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi”. Craig discussed how it is useful to have Wikipedia that you can access without Internet by connecting to Raspberry Pi device via bluetooth or wifi. He passed along the Raspberry Pi device to the audience, and allowed them to connect to it wirelessly. The device is considered better than Google since it offers offline search and full text access. It also offers full control over search algorithms and is considered a private search. The advantage of the data being on a separate device instead of on the phone is that it is cheaper per unit storage and offers full Linux stack and hardware customizability.


The last presenter in this session was Tarek Kanan (Virginia Tech, VA), presenting “Big Data Text Summarization for Events: a Problem Based Learning Course”. Problem/project Based Learning (PBL) is a student-centered teaching method, where student teams learn by solving problems. In this work 7 teams of student each with 30 students apply big data methods to produce corpus summaries. They found that PBL helped students in a computational linguistics class automatically build good text summaries for big data collections. The student also learned many of the key concepts of NLP.


The fourth session I missed was "Working the Crowd", Mat Kelly (ODU, VA) (@machawk1) recorded the session.



After that, Conference Banquet was served at the Foundry on the Fair Site.


On Tuesday June 23, 2015 after breakfast the Keynote speaker Katherine Skinner (Educopia Institute, GA). Her talk was titled “Moving the needle: from innovation to impact”. She discussed how to engage others to make use of digital libraries and archiving, getting out there and being an important factor to the community as we should be. She asked what digital libraries could accomplish as a field if we shifted our focus from "innovation" to "impact".

After that, there were two other simultaneous sessions "Non-Text Collection" and "Ontologies and Semantics". I attended the first session where there was one long paper presented and four short papers. The first speaker in this session was Yuehan Wang (Peking University, China). His paper was “WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document”. The speaker discussed the challenges of extracting mathematical formula from the different representations. They propose an upgraded Mathematical Information Retrieval system named WikiMirs3.0. The system can extract mathematical formulas from PDF and can type in queries. The system is publicly available at: www.icst.pku.edu.cn/cpdp/wikimirs3/.


Next, Kahyun Choi (University of Illinois at Urbana-Champaign, IL) presented “Topic Modeling Users’ Interpretations of Songs to Inform Subject Access in Music Digital Libraries”. Her paper focused on addressing if topic modeling can discover subject from interpretations, and the way to improve the quality of topics automatically. Their data set was extracted from songmeanings.com which contained almost went four thousand songs with at least five interpretation per song. Topic models are generated using Latent Dirichlet Allocation (LDA) and the normalization of the top ten words in each topic was calculated. For evaluating a sample as manually assigned to six subjects and found 71% accuracy.


Frank Shipman (Texas A&M University, TX) presented “Towards a Distributed Digital Library for Sign Language Content”. In this work they try to locate content relating to sign language over the Internet. They propose a description of a software components of a distributed digital library of sign language content, called SLaDL. This software detects sign language content in a video.




The final speaker of this session was Martin Klein (UCLA, CA) (@mart1nkle1n), presenting “Analyzing News Events in Non-Traditional Digital Library Collections”. In his work they found indicators relevant for building non-traditional collection. From the two collection, an online archive of TV news broadcasts and an archive of social media captures, they found that there is an 8 hour delay between social media and TV coverages that continues at a high frequency level for a few days after a major event. In addition, they found that news items have potential to influence other collections.




The session I missed was "Ontologies and Semantics", the papers presented were:

After lunch, there were two other simultaneous sessions "User Issues" and "Temporality". I attended "Temporality" session where there were two long papers. The first paper was presented by Thomas Bogel (Heidelberg University, Germany) titled “Time Well Tell: Temporal Linking of News Stories”. Thomas presented a framework to link news articles based on temporal expressions that occur in the articles. In this work they recover the arrangement of events covered in an article, in the big picture a network of article will be timely ordered.

The second paper was “Predicting Temporal Intention in resource Sharing” presented by Hany SalahEldeen (ODU, VA) (@hanysalaheldeen). Links on web pages on Twitter could change over time and might not match users intention. In this work they enhance prior temporal intention model by adding linguistic feature analysis, semantic similarity and balancing the training dataset. In this current module they had a 77% accuracy on predicting the intention of the user.




The session I missed "User Issues" had four papers:

Next, there was a panel on “Organizational Strategies for Cultural Heritage Preservation”. Paul Logasa Bogen, II (Google, WA) (@plbogen) introduced four speakers in this panel. There were Katherine Skinner (Educopia Institute, Atlanta), Stacy Kowalczyk (Dominican University, IL)(@skowalcz), Piotr Adamczyk (Google Cultural Institute, Mountain View) and Unmil Karadkar (The University of Texas at Austin, Austin) (@unmil). In this panel they discussed the preservation goal, the challenges that faces organizations practice preservation centralized or decentralized preservation and how to balance these approaches. In the final minutes there were questions from the audience regarding privacy and ownership in Cultural Heritage collections.

Following that was Minute Madness, which was a session where each poster presenter has two chances (60 seconds then 30 seconds) to talk about their poster in attempt to lure attendees to come by during the poster session.



The final session of the day was the "Reception and Posters". Where posters/demos are viewed and everyone in the audience got three stickers that were used to vote for best poster/demo.


On Wednesday June 24, 2015, there was one session "Archiving, Repositories, and Content" and three different workshops: "4th International Workshop on Mining Scientific Publications (WOSP 2015)", "Digital Libraries for Musicology (DLfM)" and "Web Archiving and Digital Libraries (WADL 2015)".

The session of the day "Archiving, Repositories, and Content" had four papers. The first paper in the last session of the conference was Stacy Kowalczyk (Dominican University, IL)(@skowalcz) presenting “Before the Repository: Defining the Preservation Threats to Research Data in the Lab”. She mentioned that lost media is a big threat and this threat is required to be addressed by preservation. She conducted a survey to quantify the risk to the preservation of research data. By getting a sample of 724 National Science Foundation awardees completing the survey, she found that the human error was the greatest threat to preservation followed by equipment malfunction.




Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author) presented “How Well Are Arabic Websites Archived?”. In this work we focused on determining if Arabic websites are archived and indexed. We collected a simple of Arabic websites and discovered that 46% of the websites are not archived and that 31% are not indexed. We also analyzed the dataset to find that almost 15% had an Arabic country code top level domain and almost 11% had an Arabic geographical location. We recommend that if you want an Arabic webpage to be archived then you should list in DMOZ and host it outside an Arabic country.





Next, Ke Zhou (University of Edinburgh, UK) presented his paper “No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving” (from the Hiberlink project). This paper addresses the issue of having a 404 on a reference in a scholar article. They found that there are two types content drift and link rot, and that there are around 30% of rotten web references. This work suggests that authors to archive links that are more likely to be rotten.




Then Jacob Jett (University of Illinois at Urbana-Champaign, IL) presented his paper “The Problem of “Additional Content” in Video Games". In this work they first discuss the challenges that video games nowadays faces due to its additional content such as modification and downloadable contents. They try to address the challenges by proposing a solution by capturing additional contents.




After the final paper of the main conference lunch was served along with the announcement of best poster/demo by counting the number of the audience votes. This year there were two best poster/demo awards and they were to Ahmed Alsum (Stanford University, CA) (@aalsum) for “Reconstruction of the US First Website”, and to Mat Kelly (ODU, VA) (@machawk1) for “Mobile Mink: Merging Mobile and Desktop Archived Webs”, by Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson (learn more about Mobile Mink). 




Next, was the announcement for the awards for best student paper and best overall paper. The best student paper was awarded to Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author), Michael L. Nelson, and Michele C. Weigle for our paper “How Well Are Arabic Websites Archived?”, and the Vannevar Bush best paper was awarded to Pertti Vakkari and Janna Pöntinen for their paper “Result List Actions in Fiction Search”.

After that there was the "Closing Plenary and Keynote", where J. Stephen Downie talked about “The HathiTrust Research Center Providing Analytic Access to the HathiTrust Digital Library’s 4.7 Billion Pages”. HathiTrust is trying to preserve the cultural records. It currently digitalized 13,496,147 volumes, 6,778,492 books and many more. There are any current projects that HathiTrust are working on, such as HathiTrust BookWorm which you can search for a specific term, the number of occurrence and its position. This presentation was similar to a presentation titled "The HathiTrust Research Center: Big Data Analytics in Secure Data Framework" presented in 2014 by Robert McDonald.

Finally, JCDL 2016 was announced to be located in Newark, NJ, June 19-23.

After that, I attended the "Web Archiving and Digital Libraries" workshop, where Sawood Alam (ODU, VA)(@ibnesayeed) will cover the details in a blog post.





by Lulwah Alkwai,

Special thanks to Mat Kelly for taking the videos and helping to edit this post.

2015-07-07: WADL 2015 Trip Report

0
0

It was the last day of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2015 when the Workshop on Web Archiving and Digital Libraries (WADL) 2015 was scheduled and it started on time. When I entered in the workshop room, I realized we needed a couple of more chairs to accommodate all the participants, which was a good problem to have. The session started with a brief informal introduction of individual participants. Without wasting any time, the lightning talks session was started.

Gerhard Gossen started the lightning talk session with his presentation on "The iCrawl System for Focused and Integrated Web Archive Crawling". It was a short description of how iCrawl can be used to create archives for current events, targeted primarily to researchers and journalists. The demonstration illustrated how to search on the Web and Twitter for trending topics to find good seed URLs, manually add seed URLs and keywords, extract entities, configure crawling basic policies and finally start/schedule the crawling.

Ian Milligan presented his short talk on "Finding Community in the Ruins of GeoCities: Distantly Reading a Web Archive". He introduced GeoCities and explained why it matters. He illustrated the preliminary exploration of the data such as images, text, and topic extraction from it. He announced plans for a Web Analytics Hackathon in Canada in 2016 based on Warcbase and is looking for collaborators. He expressed the need of documentation for researchers. To acknowledge the need of context, he said, "In an archive you can find anything you want to prove, need to contextualize to validate the results."



Zhiwu Xie presented a short talk on "Archiving the Relaxed Consistency Web". This was focused on inconsistency problem mainly seen in crawler based archives. He described the illusion of consistency on distributed social media systems and the role of timezome differences. They found that the newer content is more inconsistent. In a simulation more than 60% of timelines were found inconsistent. They propose proactive redundant crawls and compensatory estimation of archival credibility as potential solutions to the issue.



Martin Lhotak and Tomas Foltyn presented their talk on "The Czech Digital Library - Fedora Commons based solution for aggregation, reuse, dissemination and archiving of digital documents". They introduced three main digitization areas in the Czech Republic - Manuscriptorium (early printed books and manuscripts), Kramerius (modern collections from 1801), and WebArchiv (digital archive of the Czech web resources). Their goal is to aggregate all digital library content from Czech Republic under Czech Republic Library (CDL).

Todd Suomela presented "Analytics for monitoring usage and users of Archive-IT collections". The University of Alberta is using Archive-It since 2009 where they have 19 different collections of which 15 are public. Their collections are proposed by public, faculty, or librarians then the proposal goes to the Collection Development Committee for the review. Todd evaluated the user activity (using Google Analytics) and the collection management aspects of the UA Digital Libraries.



After the lightning talks were over, workshop participants took a break and looked at the posters and demonstrations associated with the lightning talks above.


Our colleague Lulwah Alkwai had her "Best Student Paper" award winner full paper, "How Well Are Arabic Websites Archived?" presentation scheduled the same day, hence we joined her in the main conference track.



During the lunch break, awards were announced where our WSDL Research Group secured the Best Student Paper and the Best Poster awards. While some people were still enjoying their lunch, Dr. J. Stephen Downie presented the closing keynote on HathiTrust Digital Library. I learned a lot more about the HathiTrust, its collections, how they deal with the copyright and (not so) open data, and their mantra, "bring computing to the data" for the sake of the fair use of the copyright data. Finally, the there were announcements about the next year's JCDL conference which will be held in Newark, NJ from 19 to 23 June, 2016. After that we assembled again in the workshop room for the remaining sessions of the WADL.



Robert Comer and Andrea Copeland together presented "Methods for Capture of Social Media Content for Preservation in Memory Organizations". They talked about preserving personal and community heritage. They outlined the issues and challenges in preserving the history of the social communities and the problem of preserving the social media in general. They are working on a prototype tool called CHIME (Community History in Motion Everyday).




Mohamed Farag presented his talk on "Building and Archiving Event Web Collections: A focused crawler approach". Mohamed described the current approaches of building event collections, 1) manually - which leads to the hight quality but requires lots of effort and 2) social media - which is quick, but may result in potentially low quality collections. They are looking for balance between the two approaches to develop an Event Focused Crawler (EFC) that retrieves web pages that are similar to those with the curator selected seed URLs with the help of a topic detection model. They have made an event detection service demo available.



Zhiwu Xie presented "Server-Driven Memento Datetime Negotiation - A UWS Case". He described Uninterruptable Web Service (UWS) architecture which uses Memento to provide continuous service even if a server goes down. Then he proposed an ammendment in the workflow of the Memento protocol for a server-driven content negotiation instead of an agent-driven approcah to improve the efficiency of UWS.



Luis Meneses presented his talk on "Grading Degradation in an Institutionally Managed Repository". He motivated his talk by saying that degradation in data collection is like a library with books with missing pages. He illustrated examples from his testbed collection to introduce nine classes of degradation from the least damaged to the worst as 1) kind of correct, 2) university/institution pages, 3) directory listings, 4) blank pages, 5) failed redirects, 6) error pages, 7) pages in a different language, 8) domain for sale, and 9) deceiving pages.



The last speaker of the session, Sawood Alam (your author) presented "Profiling Web Archives". I briefly described Memento Aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. I described various profile types and policies, analyzed their cost in terms of space and time, and measured the routing efficiency of each profile. Also, I discussed the serialization format and scale related issues such as incremental updates. I took the advantage of being the last presenter of the workshop and kept the participants away from their dinner longer than I was supposed to.





Thanks Mat for your efforts in recording various sessions. Thanks Martin for the poster pictures. Thanks to everyone who contributed to the WADL 2015 Group Notes, it was really helpful. Thanks to all the organizers, volunteers and participants for making it a successful event.

Resources

--
Sawood Alam

2015-07-22: I Can Haz Memento

0
0
Inspired by the "#icanhazpdfmovement and built upon the Memento  service, I Can Haz Memento attempts to expand the awareness of Web Archiving through Twitter. Given a URL (for a page) in a tweet with the hash tag "#icanhazmemento," the I Can Haz Memento service replies the tweet with a link pointing to an archived version of the page closest to the time of the tweet. The consequence of this is: the archived version closest to the time of the tweet likely expresses the intent of the user at the time the link was shared.
Consider a scenario where Jane shares a link in a tweet to the front page of cnn about a story on healthcare. Given the fluid nature of the news cycle, at some point, the story about healthcare would be replaced by another fresh story; thus the link in Jane's tweet and its corresponding intent (healthcare story) become misrepresented by Jane's original link (for the new story). This is where I Can Haz Memento comes into the picture. If Jane included "#icanhazmemento" in her tweet, the service would have replied Jane's tweet with a link representing:
  • An archived version (closest to her tweet time) of the front page healthcare story on cnn, if the page had already been archived. Or
  • A newly archived version of the same page. In other words, the service does the archiving and returns the link to the newly archived page, if the page was not already archived.
Method 1: In order to use the service, include the hashtag "#icanhazmemento" in the tweet with the link to the page you intend to archive or retrieve an archived version. For example, consider Shawn Jones' tweet below for http://www.cs.odu.edu:
Which prompted the following reply from the service:
Method 2:In Method 1, the hashtag "#icanhazmemento" and the URL,  http://www.cs.odu.edu, reside in the same tweet, but Method 2 does not impose this restriction. If someone (@anwala) tweeted a link (e.g arsenal.com), and you (@wsdlodu) wished the request be treated in the same manner as Method 1 (as though "#icanhazmemento" and  arsenal.com were in the same tweet), all that is required is a reply to the original tweet (without the "#icanhazmemento") with a tweet which includes "#icanhazmemento." Consider an example of Method 2 usage:
  1. @acnwala tweets arsenal.com without "#icanhazmemento"
  2. @wsdlodu replies the @acnwala's tweet with "#icanhazmemento"
  3. @icanhazmemento replies @wsdlodu with the archived versions of arsenal.com
The scenario (1, 2 and 3) is outlined by the following tweet threads:
 I Can Haz Memento - Implementation

I Can Haz Memento is implemented in Python and leverages the TwitterTweepy API. The implementation is captured by the following subroutines:
  1. Retrieve links from tweets with "#icanhazmemento": This was achieved due to Tweepy's api.search API method. The sinceIDValue is used to keep track of already visited tweets. Also, the application sleeps in between each request in order to comply with Twitter's API rate limits, but not before retrieving the URLs from each tweet.
  2. After the URLs in 1. have been retrieved, the following subroutine
  • Makes an HTTP Request to the Timegate API in order to get the the Memento (instance of the resource) closest to the time of tweet (since the time of tweet is passed as a parameter for datetime content negotiation):
  • If the page is not found in any archive, it is pushed to archive.org and archive.is for archiving:
The source code for the application is available on Gitub. We acknowledge the effort of Mat Kelly who wrote the first draft of the application. And we hope you use #icanhazmemento.
--Nwala

2015-07-24: ICSU World Data System Webinar #6: Web-Centric Solutions for Web-Based Scholarship

0
0
Earlier this week Herbert Van de Sompel gave a webinar for the ICSU World Data System entitled "Web-Centric Solutions for Web-Based Scholarship".  It's a short and simple review of some of the interoperability projects we've worked on through since 1999, including OAI-PMH, OAI-ORE, and Memento.  He ends with a short nod to his simple but powerful "Signposting the Scholarly Web" proposal, but the slides in the appendix give the full description. 



The main point of this presentation was to document how each project successively further embraced the web, not just as a transport protocol but fully adopting the semantics as part of the protocol.  Herbert and I then had a fun email discussion about how the web, scholarly communication, and digital libraries were different in 1999 (the time of OAI-PMH & our initial collaboration) and now.  Some highlights include:
  • Although Google existed, it was not the hegemonic force that it is today, and contemporary search engines that did exist (e.g., AltaVista, Lycos) weren't that great (both in terms of precision and recall).  
  • The Deep Web was still a thing -- search engines did not reliably find obscure resources likely scholarly resources (cf. our 2006 IEEE IC study "Search Engine Coverage of the OAI-PMH Corpus" and Kat Hagedorn's 2008 follow up "Google Still Not Indexing Hidden Web URLs").
  • Related to the above, the focus in digital libraries was on repositories, not the web itself.  Everyone was sitting on an SQL database of "stuff" and HTTP was seen just as a transport in which to export the database contents.  This meant that the gateway script (ca. 1999, it was probably in Perl DBI) between the web and the database was the primary thing, not the database records or the resultant web pages (i.e., the web "resource").  
  • Focus on database scripts resulted in lots of people (not just us in OAI-PMH) tunneling ad-hoc/homemade protocols over HTTP.  In fairness, Roy Fielding's thesis defining REST only came out in 2000, and the W3C Web Architecture document was drafted in 2002 and finalized in 2004.  Yes, I suppose we should have sensed the essence of these documents in the early HTTP RFCs (2616, 2068, 1945) but... we didn't. 
  • The very existence of technologies such as SOAP (ca. 1998) nicely illustrates the prevailing mindset of HTTP as a replaceable transport. 
  • Technologies similar to OAI-PMH, such as RSS, were in flux and limited to 10 items (belying their news syndication origin which made them unsuitable for digital library applications).  
  • Full-text was relatively rare, so the focus was on metadata (see table 3 in the original UPS paper; every digital library description at the time distinguished between "records" and "records with full-text links").  Even if full-text was available, downloading and indexing it was an expensive operation for everyone involved -- bandwidth was limited and storage was expensive in 1999!  Sites like xxx.lanl.gov even threatened retaliation if you downloaded their full-text (today's text on that page is less antagonistic, but I recall the phrase "we fight back!").  Credit to CiteSeer for being an early digital library that was the first to use full-text (DL 1998).
Eventually Google Scholar announced they were deprecating OAI-PMH support, but the truth is they never really supported it in the first place.  It was just simpler to crawl the web, and the early focus on keeping robots out of the digital library had given way to making sure that they got into the digital library (e.g., Sitemaps). 

The OAI-ORE and then Memento projects were more web-centric, as Herbert nicely explains in the slides, with OAI-ORE having a Semantic Web spin and Memento being more grounded in the IETF community.   As Herbert says at the beginning of the video, our perspective in 1999 was understandable given the practices at the time, but he goes on to say that he frequently reviews proposals about data management, scholarly communication, data preservation, etc. that continue to treat the web as a transport protocol over which the "real" protocol is deployed.  I would add that despite the proliferation of web APIs that claim to be RESTful, we're seeing a general retreat from REST/HATEOAS principles by the larger web community and not just the academic and scientific community. 

In summary, our advice would be to fully embrace HTTP, since it is our community's Fortran and it's not going anywhere anytime soon

--Michael

2015-07-27: Upcoming Colloquium, Visit from Herbert Van de Sompel

0
0
On Wednesday, August 5, 2015 Herbert Van de Sompel (Los Alamos National Laboratory) will give a colloquium in the ODU Computer Science Department entitled "A Perspective on Archiving the Scholarly Web".  It will be held in the third floor E&CS conference room (r. 3316) at 11am.  Space is somewhat limited (the first floor auditorium is being renovated), but all are welcome to attend.  The abstract for his talk is:

 A Perspective on Archiving the Scholarly Web
As the scholarly communication system evolves to become natively web-based and starts supporting the communication of a wide variety of objects, the manner in which its essential functions -- registration, certification, awareness, archiving -- are fulfilled co-evolves.  Illustrations of the changing implementation of these functions will be used to arrive at a high-level characterization of a future scholarly communication system and of the objects that will be communicated. The focus will then shift to the fulfillment of the archival function for web-native scholarship. Observations regarding the status quo, which largely consists of back-office processes that have their origin in paper-based communication, suggest the need for a change. The outlines of a different archival approach inspired by existing web archiving practices will be explored.
This presentation will be an evolution of ideas following his time as a visiting scholar at DANS, in conjunction with Dr. Andrew Treloar (ANDS) (2014-01& 2014-12). 

Dr. Van de Sompel is an internationallyrecognizedpioneer in the field of digital libraries and web preservation, with his contributions including many of the architectural solutions that define the community, including: OpenURL, SFX, OAI-PMH, OAI-ORE, info URI, bX, djatoka, MESUR, aDORe, Memento, Open Annotation, SharedCanvas, ResourceSync, and Hiberlink.  

Also during his time at ODU, he will be reviewing the research projects of PhD students in the Web Science and Digital Libraries group as well as exploring new areas for collaboration with us.  This will be Dr. Van de Sompel's first trip to ODU since 2011 when he and Dr. Sanderson served as the external committee members for Martin Klein's PhD dissertation defense

--Michael

2014-12-20: Using Search Engine Queries For Reliable Links

0
0
Earlier this week Herbert brought to my attention Jon Udell's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes.

Jon has a series of columns for InfoWorld, and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page.  For example, the old URI:

http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html

is currently:

http://www.infoworld.com/article/2660595/application-development/xquery-and-the-power-of-learning-by-example.html

The same content had at least one other URI as well, from at least 2009--2012:

http://www.infoworld.com/d/developer-world/xquery-and-power-learning-example-924

The first reaction is to say InfoWorld should use "Cool URIs", mod_rewrite, or even handles.  In fairness, Inforworld is still redirecting the second URI to the current URI:



And it looks like they kept redirecting the original URI to the current URI until sometime in 2014 and then quit; currently the original URI returns a 404:



Jon's approach is to just give up on tracking different URIs for his 100s of articles and instead use a combination of metadata (title & author) and the "site:" operator submitted to a search engine to locate the current URI (side note: this approach is really similar to OpenURL).  For example, the link for the article above would become:

http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery+and+the+power+of+learning+by+example%22

Herbert had a number of comments, which I'll summarize as:
  • This problem is very much related to Martin's PhD research, in which web archives are used to generate lexical signatures to help refind the new URIs on the live web (see "Moved but not gone: an evaluation of real-time methods for discovering replacement web pages").  
  • Throwing away the original URI is not desirable because that is a useful key for finding the page in web archives.  The above examples used the Internet Archive's Wayback Machine, but MementoTimeGates and TimeMaps could also be used (see Memento 101 for more information).   
  • One solution to linking to a SE for discovery while retaining the original URI is to use the data-* attributes from HTML (see the "Missing Link" document for more information).  
For the latter point, including the original URI (and its publishing date), the SE URI, and the archived URI would result in html that looks like:



I posted a comment saying that a search engine's robots.txt page would prevent archives like the Internet Archive from archiving the SERPs and thus not discover (and archive) the new URIs themselves.  In an email conversation Martin made the point that rewriting the link to search engine is assuming that the search engine URI structure isn't going to change (anyone want to bet how many links to msn.com or live.com queries are still working?).  It is also probably worth pointing out that while metadata like the title is not likely to change for Jon's articles, that's not always true for general web pages, whose titles often change (see "Is This A Good Title?"). 

In summary, Jon's solution of using SERPs as interstitial pages as a way to combat link rot is an interesting solution to a common problem, at least for those who wish to maintain publication (or similar) lists.  While the SE URI is a good tactical solution, disposing of the original URI is a bad strategy for several reasons, including working against web archives instead of with them, and betting on the long-term stability of SEs.  The solution we need is a method to include > 1 URI per HTML link, such as proposed in the "Missing Link" document.

--Michael

2015-01-03: Review of WS-DL's 2014

0
0
The Web Science and Digital Libraries Research Group's 2014 was even better than our 2013.  First, we graduated two PhD students and had many other students advance their status:
In April we introduced our now famous "PhD Crush" board that allows us to track students' progress through the various hoops they must jump through.  Although it started as sort of a joke, it's quite handy and popular -- I now wish we had instituted it long ago. 

We had 15 publications in 2014, including:
JCDL was especially successful, with Justin's paper "Not all mementos are created equal: Measuring the impact of missing resources" winning "best student paper" (Daniel Hasan from UFMG also won a separate "best student paper" award), and Chuck's paper "When should I make preservation copies of myself?" winning the Vannevar Bush Best Paper award.  It is truly a great honor to have won both best paper awards at JCDL this year (pictures: Justin accepting his award, and me accepting on behalf of Chuck).  In the last two years at JCDL & TPDL, that's three best paper awards and one nomination.  The bar is being raised for future students.

In addition to the conference paper presentations, we traveled to and presented at a number of conferences that do not have formal proceedings:
We were also fortunate enough to visit and host visitors in 2014:
We also released (or updated) a number of software packages for public use, including:
Our coverage in the popular press continued, with highlights including:
  • I appeared on the video podcast "This Week in Law" #279 to discuss web archiving.
  • I was interviewed for the German radio program "DRadio Wissen". 
We were more successful on the funding front this year, winning the following grants:
All of this adds up to a very busy and successful 2014.  Looking ahead to 2015, as well as continued publication and funding success, we are expecting to graduate both one MS & one Ph.D. student and host another visiting researcher (Michael Herzog, Magdeburg-Stendal University). 

Thanks to everyone that made 2014 such a great success, and here's to a great start to 2015!

--Michael





2015-01-15: The Winter 2015 Federal Cloud Computing Summit

0
0


On January 14th-15th, I attended the Federal Cloud Computing Summit in Washington, D.C., a recurring event in which I have participated in the past. In my continuing role as the MITRE-ATARC Collaboration Session lead, I assisted the host organization, the Advanced Technology And Research Center (ATARC) in organizing and run the MITRE-ATARC Collaboration Sessions. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

The collaboration sessions continue to be highly valued within the government and industry. The Winter 2015 Summit had over 400 government or academic registrants and more than 100 industry registrants. The whitepaper summarizing the Summer 2014 collaboration sessions is now available.

A discussion of FedRAMP and the future of the policies was held in a Government-only session at 11:00 before the collaboration sessions began.
At its conclusion, the collaboration sessions began, with four sessions focusing on the following topics.
  • Challenge Area 1: When to choose Public, Private, Government, or Hybrid clouds?
  • Challenge Area 2: The umbrella of acquisition: Contracting pain points and best practices
  • Challenge Area 3: Tiered architecture: Mitigating concerns of geography, access management, and other cloud security constraints
  • Challenge Area 4: The role of cloud computing in emerging technologies
Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will continue its practice of releasing a summary document after the Summit (for reference, see the Summer 2014 and Winter 2013 summit whitepapers).

On January 15th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:25-4:10, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

This was the fourth Federal Summit event in which I have participated, including the Winter 2013 and Summer 2014 Cloud Summits and the 2013 Big Data Summit. They are great events that the Government participants have consistently identified as high-value. The events also garner a decent amount of press in the federal news outlets and at MITRE. Please refer to the fedsummits.com list of press for the most recent articles about the summit.

We are continuing to expand and improve the summits, particularly with respect to the impact on academia. Stay tuned for news from future summits!

--Justin F. Brunelle

2015-02-05: What Did It Look Like?

0
0
Having often wondered why many popular videos on the web are time lapse videos (that is videos which capture the change of a subject over time), I came to the conclusion that impermanence gives value to the process of preserving ourselves or other subjects in photography. As though a means to defy the compulsory fundamental law of change. Just like our lives, one of the greatest products of human endeavor, the World Wide Web, was once small, but has continued to grow. So it is only fitting for us to capture the transitions.
What Did It Look Like? is a Tumblr blog which uses the Memento framework to poll various public web archives, take the earliest archived version from each calendar year, and then create an animated image that shows the progression of the site through the years.

To seed the service we randomly chose some web sites and processed them (see also the archives). In addition, everyone is free to nominate web sites to What Did It Look Like? by tweeting: "#whatdiditlooklike URL". 

In order to see how this system is achieved, consider the architecture diagram below. 

The system is implemented in Python and utilizes Tweepy and PyTumblr to access the Twitter and Python APIs respectively, and consists of the following programs:
  1. timelapseTwitter.py: This application fetches tweets (with "#whatdiditlooklike URL"signature) by using the tweet ID of the last tweet visited as reference to know where to begin retrieving tweets. For example, if the application initially visited tweet IDs 0, 1, 2. It keeps track of the ID 2, so as to begin retrieving tweets with IDs greater than 2 in a subsequent tweet retrieval operation. Also, since Twitter rate limits the number of search operations (180 requests per 15 minute window), the application sleeps in between search operations. The snippet below outlines the basic operations of fetching tweets after the last tweet visited:
  2. usingTimelapseToTakeScreenShots.py: This is a simple application with invokes timelapse.py for each nomination tweet (that a tweet with the "#whatdiditlooklike URL" signature).
  3. timelapse.py: Given an input URL, this application utilizes PhantomJS, (a headless browser) to take screenshots and utilizes ImageMagick to create an animated GIF. It should also be noted that the GIFs created are optimized due to the snippet below in order to reduce their respective sizes to under 1MB. This ensures the animation is not deactivated by Tumblr.
  4. timelapseSubEngine.py: this application executes two primary operations:
    1. Publication of the animated GIFs of nominated URLs to Tumblr: This is done through the PyTumblr API create_photo() method as outlined by the snippet below:
    2. Notifying the referrer and making status updates on Twitter: This is achieved through Tweepy's api.update_status() method as outlined by the snippet below which tweets the status update message:However, a simple periodic Twitter status update message could result in the message to be flagged eventually as spam by Twitter. This comes in form of a 226 error code. In order to avoid this, timelapseSubEngine.py does not post the same status update tweet message or notification tweet. Instead the application randomly selects from a suite of messages and injects a variety of attributes which ensure status update tweets are different. The randomness in execution is due to a custom cron application which randomly executes the entire stack beginning from timelapseTwitter.py down to timelapseSubEngine.py.

How to nominate sites onto What Did It Look Like?

If you are interested in seeing what a web site looked like through the years:
  1. Search to see if the web site already exists by using the search service in the archives page; this can be done by typing the URL of the web site and hitting submit. If the URL does not exist on the site, go ahead to step 2.
  2. Tweet"#whatdiditlooklike URL" to nominate a web site or tweet "#whatdiditlooklike URL1, URL2, ..., URLn"to nominate multiple URLs.
Tweet"#whatdiditlooklike URL" to nominate a web site or tweet "#whatdiditlooklike URL1, URL2, ..., URLn"to nominate multiple URLs.

How to explore historical posts

To explore historical posts, visit the archives page: http://whatdiditlooklike.tumblr.com/archives

Examples 

What Did cnn.com Look Like?

What Did cs.odu.edu Look Like?
What Did apple.com Look Like?

"What Did It Look Like?" is inspired by two sources: 1) the "One Terabyte of Kilobyte Age Photo Op" Tumblr that Dragan Espenschied presented at DP 2014 (which basically demonstrates digital preservation as performance art; see also the commentary blog by Olia Lialina& Dragan), and 2) the Digital Public Library of America (DPLA) "#dplafinds" hashtag that surfaces interesting holdings that one would otherwise likely not discover.  Both sources have the idea of "randomly" highlighting resources that you would otherwise not find given the intimidatingly large collection in which they reside.

We hope you'll enjoy this service as a fun way to see how web sites -- and web site design! -- have changed through the years.

--Nwala

2015-08-18: Three WS-DL Classes Offered for Fall 2015

0
0

https://xkcd.com/657/

The Web Science and Digital Libraries Group is offering three classes this fall.  Unfortunately there are no undergraduate offerings this semester, but there are three graduate classes covering the full WS-DL spectrum:

Note that while 891 classes count toward the 24 hours of 800-level class work for the PhD program, they do not count as one of the "four 800-level regular courses" required.  Students looking to satisfy one of the 800-level regular courses should consider CS 834.  Students considering doing research in the broad areas of Web Science should consider taking all three of these classes this semester.

--Michael

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting

0
0


Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world.

The attendees were Jefferson Bailey and Vinay Goel from IA, Nicholas Taylor and Ahmed AlSum from Stanford, and Wolfgang Nejdl, Ivana Marenzi and Helge Holzmann from L3S.

First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA.

Then, Nejdl introduced the Alexandria project, and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform.

I presented the off-topic detection work with a live demo for the tool, which can be downloaded and tested from https://github.com/yasmina85/OffTopic-Detection.


The off-topic tool aims to automatically detect when the archived page goes off-topic, which means the page changed through time to move away from the initial scope of the page. The tool suggests a list of off-topic pages based on a specific threshold that is input by the user. Based on evaluating the tool, we suggest values for the threshold in a research paper* that can be used to detect the off-topic pages.

A site for one of the candidates for Egypt’s 2012 presidential election. Many of the captures of hamdeensabhay.com are not about the Egyptian Revolution. Later versions show an expired domain (as does the live Web version).

Examples for the usage of the tool:
--------

Example 1: Detecting off-topic pages in 1826 collection

python detect_off_topic.py -i 1826 -th 0.15
extracting seed list

http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org

50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://casademaryland.org

Downloading 4 mementos out of 306
Downloading 14 mementos out of 306

Detecting off-topic mementos using Cosine Similarity method

Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html

Example 2: Detecting off-topic pages for http://hamdeensabahy.com/

python detect_off_topic.py -t https://wayback.archive-it.org/2358/timemap/link/http://hamdeensabahy.com/  -m wcount -th -0.85

Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/


Downloading 4 mementos out of 270

Extracting text from the html

Detecting off-topic mementos using Word Count method

Similarity memento_uri
-0.979434447301http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/

-0.966580976864http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/

-0.94087403599http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/

-0.94087403599http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/


Nicholas insisted on the importance of the off-topic tool from QA perspective, while Internet Archives folks focused on the required computation resources and how it can be shared with Archive-It partners. The group discussed some user interface options to display the output of the tool.

After the demo, we discussed the importance of the tool, especially in the crawling quality assurance practices.  While demoing ArchiveWeb interface, some of the visualization for pages from different collections showed off-topic pages.  We all agreed that it is important that those pages won’t appear to the users when they browse the collections.

It was amazing to spend time in IA and knowing about the last trend from other research groups. The discussion showed the high reputation of WS-DL research in the web archiving community around the world.

*Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, Detecting Off-Topic Pages in Web Archives, Proceedings of TPDL 2015, 2015.

----
Yasmin




2015-08-28 Original Header Replay Considered Coherent

0
0

Introduction


As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089, defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15. This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers.

Archive Headers and Original Headers


Consider the headers (Figure 1) returned for a logo from the ODUComputer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT.

HTTP/1.1 200 OK
Content-Type: image/gif
Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT
Figure 1. No Original Header Playback

Try to answer the question "Was the representation provided by the web archive valid for Tue, 28 Apr 2015 12:00:00 GMT?" (i.e., the day before). The best answer possible is maybe. Because I have spent many hours using the Computer Science web site, I know the site changes infrequently. Given this knowledge, I might upgrade the answer from maybe to probably. This difficulty answering is due to the Last-Modified header reflecting the date archived instead of the date the image itself was last modified. And, although it is true that the memento (archived copy) was indeed modified Wed, 29 Apr 2015 15:15:23 GMT, this merging of original resource Last-Modified and memento Last-Modified loses valuable information. (Read Memento-Datetime is not Last-Modified for more details.)

Now consider the headers (figure 2) for another copy that was archived Sun, 14 Mar 2015 22:21:07 GMT. Take special note of the X-Archive-Orig-* headers. These are a playback of original headers that were included in the response when the logo image was captured by the web archive.

HTTP/1.1 200 OK
Content-Type: image/gif
X-Archive-Orig-etag: "52d202fb-19db"
X-Archive-Orig-last-modified: Sun, 12 Jan 2014 02:50:35 GMT
X-Archive-Orig-expires: Sat, 19 Dec 2015 13:01:55 GMT
X-Archive-Orig-accept-ranges: bytes
X-Archive-Orig-cache-control: max-age=31104000
X-Archive-Orig-connection: keep-alive
X-Archive-Orig-date: Wed, 24 Dec 2014 13:01:55 GMT
X-Archive-Orig-content-type: image/gif
X-Archive-Orig-server: nginx
X-Archive-Orig-content-length: 6619
Memento-Datetime: Sun, 14 Mar 2015 22:21:07 GMT
Figure 2. Original Header Playback

Compare the Memento-Datetime (which is the archive datetime) and the X-Archive-Orig-last-modified headers while answering this question: "Was the representation provided by the web archive valid for Tue, 13 Mar 2015 12:00:00 GMT?". Clearly the answer is yes.

Why This Matters


For the casual web archive user, the previous example may seem like must nit-picky detail. Still, consider the Weather Underground page archived on Thu, 09 Dec 2004 19:09:26 GMT and shown in Figure 3.

Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
Figure 3. Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
The Weather Underground page (like most) is a composition of many resources including the web page itself, images,  style sheets, and JavaScript. Note the conflict between the forecast of light drizzle and the completely clear radar image. Figure 4 shows the relevant headers returned for the radar image:

HTTP/1.1 200 OK
Memento-Datetime: Mon, 12 Sep 2005 22:34:45 GMT
X-Archive-Orig-last-modified: Mon, 12 Sep 2005 22:32:24 GMT
Figure 4. Prima Facie Coherence Headers

Clearly the radar image was captured much later than the web page—over 9 months later in fact! But this alone does not prove the radar image is the incorrect image (perhaps Weather Underground radar images were broken on 09 Dec 2004). However, the Memento-Datetime and X-Archive-Orig-last-modified headers tell the full story, showing that not only was the radar image captured well after the web page was archived, but also that the radar image was modified well after the web page was archived. Thus, together Memento-Datetime and X-Archive-Org-Last-Modified are prima facie evidence that the radar image is temporally violative with respect to the archived web page in which it is displayed. Figure 5 illustrates this pattern. The black left-to-right arrow is time. The black diamond and text represent the web page; the green represents the radar image. The green line shows that the radar image X-Archive-Orig-Last-Modified and Memento-Datetime bracket the web page archival time. Details on this pattern and others are detailed in our framework technical report.

Figure 5. Prima Facie Coherence

But Does Header Playback Really Matter?


Of course, if few archived images and other embedded resources include Last-Modified headers, the overall effect could be inconsequential. However, the results to be published at Hypertext'15 show that using the Last-Modified header makes a significant coherence improvement: using Last-Modified to select embedded resources increased mean prima facie coherence from ~41% to ~55% compared to using just Memento-Datetime. And, at the time the research was conducted, only the Internet Archive included Last-Modified playback. If the other 14 web archives included in the study also implemented header playback, we project that mean prima facie coherence would have been about 80%!

Web Archives Today


When the research leading to the Hypertext'15 paper was conducted, only the Internet Archive included Last-Modified playback. This limited prima facie coherence determination to only embedded resources retrieved from the Internet Archive. As shown in Table 1, additional web archives now playback original headers. The table also show which archives implement the Memento Protocol (and are therefore RFC 7089 compliant) and which archives use OpenWayback, which already implements header replay. Although header playback is a long way from universal, progress is happening. We look forward to continuing coherence improvement as additional web archives implement header playback and the Memento Protocol.

Table 1. Current Web Archive Status
Web ArchiveHeader Playback?Memento Compliant?OpenWayback?
Archive-ItYesYesYes
archive.todayNoYesNo
arXivNoNoNo
Bibliotheca Alexandrina Web ArchiveUnknown1YesYes
Canadian Government Web ArchiveNoProxyNo
Croatian Web ArchiveNoProxyNo
DBPedia ArchiveNoYesNo
Estonian Web ArchiveNoProxyNo
GitHubNoProxyNo
Icelandic Web ArchiveYesYesYes
Internet ArchiveYesYesYes
Library of Congress Web ArchiveYesYesYes
NARA Web ArchiveNoProxyYes
OrainNoProxyNo
PastPages Web ArchiveNoYesNo
Portugese Web ArchiveNoProxyNo
PRONI Web ArchiveNoYesYes
Slovenian Web ArchiveNoProxyNo
Stanford Web ArchiveYesYesYes
UK Government Web ArchiveNoYesYes
UK Parliament's Web ArchiveNoYesYes
UK Web ArchiveYesYesYes
Web Archive SingaporeNoProxyNo
WebCiteNoProxyNo
WikiPediaNoProxyNo
1Unavailable at the time this post was written.

Wrap Up


Web archives featuring both the capture and replay of original header show significantly better temporal coherence in recomposed web pages. Currently, web archives using Heritrix and OpenWayback implement these features; no archives using other software are known to do so. Implementing original header capture and replay is highly recommended as it will allow implementation of improved recomposition heuristics (which is a topic for another day and another post).


— Scott G. Ainsworth

2015-09-01: From Student To Researcher II

0
0









After successfully defending my Master's Thesis, I accepted a position as a Graduate Research Assistant at Los Alamos National Laboratory (LANL) Library's Digital Library Research and Prototyping Team.  I now work directly for Herbert Van de Sompel, in collaboration with my advisor, Michael Nelson.

Up to this point, I worked for years as a software engineer, but then re-entered academia in 2010 to finish my Master's Degree.  I originally just wanted to be able to apply for jobs that required Master's Degrees in Computer Science, but during my time working on my thesis, I discovered that I had more of a passion for the research than I had expected, so I became a PhD student in Computer Science at Old Dominion University.  During the time of my Master's Degree, I had taken coursework that counts toward my PhD, so I am free to accept this current extended internship while I complete my PhD dissertation.

LANL is a fascinating place to work.  In my first week, we learned all about safety and security. We learned not only about office safety (don't trip over cables), but also nuclear and industrial safety (don't eat the radioactive material).  This was in preparation for the possibility that we might actually encounter an environment where these dangers existed. One of the more entertaining parts of the training was being aware of wildlife and natural dangers, such as rattlesnakes, falling rocks, flash floods, and tornadoes.  We also learned about the more mundane concepts like how to use our security badge and how to fill out timesheets.  I was fortunate to meet people from a variety of different disciplines.


We have nice, powerful computing equipment, good systems support, excellent collaboration areas, and very helpful staff. Everyone has been very conscientious and supportive as I have acquired access rights and equipment.

By the end of my first week, I had begun working with the Prototyping Team.  They shared their existing work with me, educating me on back-end technical aspects of Memento as well as other projects, such as Hiberlink. My team members Harihar Shankar and Lyudmila Balakireva have been nice enough to answer questions about the research work as well as general LANL processes.

I am already doing literature reviews and writing code for upcoming research projects.  We just released a new Python Memento Client library in collaboration with Wikipedia. I am also evaluating Jupyter for use in future data analysis and code collection.  I have learned so much in my first month here!

I know my friends and family miss me back in Virginia, but this time spent among some of the best and brightest in the world is already shaping up to be an enjoyable and exciting experience.

--Shawn M. Jones, Graduate Research Assistant, Los Alamos National Laboratory

2015-09-08: Releasing an Open Source Python Project, the Services That Brought py-memento-client to Life

0
0
The LANL Library Prototyping Team recently received correspondence from a member of the Wikipedia team requesting Python code that could find the best URI-M for an archived web page based on the date of the page revision. Collaborating with Wikipedia, Harihar Shankar, Herbert Van de Sompel, Michael Nelson, and I were able to create the py-mement-client Python library to suit the needs of pywikibot.

Over the course of library development, Wikipedia suggested the use of two services, Travis CI and Pypi, that we had not used before.  We were very pleased with the results of those services and learned quite a bit from the experience.  We have been using GitHub for years, and also include it here as part of the development toolchain for this Python project.

We present three online services that solved the following problems for our Python library:
  1. Where do we store source code and documentation for the long term? - GitHub
  2. How do we ensure the project is well tested in an independent environment?  - Travis CI
  3. Where do we store the final installation package for others to use? - Pypi
We start first with storing the source code.

GitHub

As someone who is concerned about the longevity of the scholarly record, I cannot emphasize enough how important it is to check your code in somewhere safe.  GitHub provides a wide variety of tools, at no cost, that allow one to preserve and share their source code.

Git and GitHub are not the same thing.  Git is just a source control system.  GitHub is a dedicated web site providing additional tools and hosting for git repositories.

Here are some of the benefits of just using Git (without GitHub):
  1. Distributed authoring - many people can work separately on the same code and commit to the same place
  2. Branching is built in, allowing different people to work on features in isolation (like unfinished support for TimeMaps)
  3. Tagging can easily be done to annotate a commit for release
  4. Many IDEs and other development tools support Git out of the box
  5. Ease of changing remote git repositories if switching from one site to another is required
  6. Every git clone is actually a copy of the master branch of the repository and all of its history, talk about LOCKSS!!!
That last one is important.  It means that all one needs to do is clone a git repository and they now have a local archive of that repository branch, with complete history, at the time of cloning.  This is in contrast to other source control systems, such as Subversion, where the server is the only place storing the full history of the repository.  Using git avoids this single point of failure, allowing us to still have a archival copy, including history, in the case that our git local server or GitHub goes away.


Here are some of the benefits of using GitHub:
  1. Collaboration with others inside and outside of the project team, through the use of pull requests, code review, and an issue tracker
  2. Provides a GUI for centralizing and supporting the project
  3. Allows for easy editing of documentation using Markdown, and also provides a wiki, if needed
  4. The wiki can also be cloned as a Git repository for archiving!
  5. Integrates with a variety of web services, such as Travis CI
  6. Provides release tools that allow adding of release notes to tags while providing compiled downloads for users
  7. Provides a pretty-parsed view of the code where quick edits can be made on the site itself
  8. Allows access from multiple Internet-connected platforms (phone, tablet, laptop, etc.)
  9. And so much more that we have not yet explored....
We use GitHub for all of these reasons and we are just scratching the surface.  Now that we have our source code centralized, how do we independently build and test it?

Travis CI

Travis CI provides a continuous integration environment for code. In our case, we use it to determine the health of the existing codebase.

We use it to evaluate code for the following:
  1. Does it compile? - tests for syntax and linking errors
  2. Can it be packaged? - tests for build script and linking errors
  3. Does it pass automated tests? - tests that the last changes have not broken functionality
Continuous integration provides an independent test of the code. In many cases, developers get code to work on their magiclaptop or their magic network and it works for no one else. Continuous Integration is an attempt to mitigate that issue.

Of course, far more can be done with continuous integration, like publish released binaries, but with our time and budget, the above is all we have done thus far.

Travis CI provides a free continuous integration environment for code.  It easily integrates with GitHub.  In fact, if a user has a GitHub account, logging into Travis CI will produce a page listing all GitHub projects that they have access to. To enable a project for building, one just ticks the slider next to the desired project.

It then detects the next push to GitHub and builds the code based on the a .travis.yml file, if present in the root of the Git repository.

The .travis.yml file has a relatively simple syntax whereby one specifies the language, language version, environment variables, pre-requisite requirements, and then build steps.

Our .travis.yml looks as follows:

language: python
cache: # caching is only available for customers who pay
directories:
- $HOME/.cache/pip
python:
- "2.7"
- "3.4"
env:
- DEBUG_MEMENTO_CLIENT=1
install:
- "pip install requests"
- "pip install pytest-xdist"
- "pip install ."
script:
- python setup.py test
- python setup.py sdist bdist_wheel
branches:
only:
- master

The language section tells Travis CI which language is used by the project. Many languages are available, including Ruby and Java.

The cache section allows caching of installed library dependencies on the server between builds. Unfortunately, the cache section is only available for paid customers.

The python section lists for which versions of Python the project will be built.  Travis CI will attempt a parallel build in every version specified here.  The Wikimedia folks wanted our code to work with both Python 2.7 and 3.4.

The env section contains environment variables for the build.

The install section runs any commands necessary for installing additional dependencies prior to the build.  We use it in this example to install dependencies for testing.  In the current version this section is removed because we now handle dependencies directly via Python's setuptools, but it is provided here for completeness.

The script section is where the actual build sequence occurs.  This is where the steps are specified for building and testing the code.   In our case, Python needs no compilation, so we skip straight to our automated tests before doing a source and binary package to ensure that our setup.py is configured correctly.

Finally, the branches section is where one can indicate additional branches to build.  We only wanted to focus on master for now.

There is extensive documentation indicating what else one can do with .travis.yml.

Once changes have have pushed to GitHub, Travis CI detects the push and begins a build.  As seen below, there are two builds for py-memento-client:  for Python 2.7 and 3.4.



Clicking on one of these boxes allows one to watch the results of a build in real time, as shown below. Also present is a link allowing one to download the build log for later use.


All of the builds that have been performed are available for review.  Each entry contains information about the the commit, including who performed the commit, as well as how long it took, when it took place, how many tests passed, and, most importantly, if it was successful.  Status is indicated by color:  green for success, red for failure, and yellow for in progress.


Using Travis CI we were able to provide an independent sanity check on py-memento-client, detecting test data that was network-dependent and also eliminating platform-specific issues.  We developed py-memento-client on OSX, tested it at LANL on OSX and Red Hat Enterprise Linux, but Travis CI runs on Ubuntu Linux so we now have confidence that our code performs well in different environments.
Closing thought:  all of this verification only works as well as the automated tests, so focus on writing good tests.  :)

Pypi

Finally, we wanted to make it straightforward to install py-memento-client and all of its dependencies:

pip install memento_client

Getting there required Pypi, a site that globally hosts Python projects (mostly libraries).  Pypi not only provides storage for built code so that others can download it, it also requires that metadata be provided so that others can see what functionality the project provides.  Below is an image of the Pypi splash page for the py-memento-client.


Getting support for Pypi and producing the data for this splash page required that we use Python setuptools for our build. Our setup.py file, inspired by Jeff Knupp's "Open Sourcing a Python Project the Right Way", provides support for a complete build of the Python project.  Below we highlight the setup function that is the cornerstone of the whole build process.

setup(
name="memento_client",
version="0.5.1",
url='https://github.com/mementoweb/py-memento-client',
license='LICENSE.txt',
author="Harihar Shankar, Shawn M. Jones, Herbert Van de Sompel",
author_email="prototeam@googlegroups.com",
install_requires=['requests>=2.7.0'],
tests_require=['pytest-xdist', 'pytest'],
cmdclass={
'test': PyTest,
'cleanall': BetterClean
},
download_url="https://github.com/mementoweb/py-memento-client",
description='Official Python library for using the Memento Protocol',
long_description="""
The memento_client library provides Memento support, as specified in RFC 7089 (http://tools.ietf.org/html/rfc7089)
For more information about Memento, see http://www.mementoweb.org/about/.
This library allows one to find information about archived web pages using the Memento protocol. It is the goal of this library to make the Memento protocol as accessible as possible to Python developers.
""",
packages=['memento_client'],
keywords='memento http web archives',
extras_require = {
'testing': ['pytest'],
"utils": ["lxml"]
},
classifiers=[

'Intended Audience :: Developers',

'License :: OSI Approved :: BSD License',

'Operating System :: OS Independent',

'Topic :: Internet :: WWW/HTTP',
'Topic :: Scientific/Engineering',
'Topic :: Software Development :: Libraries :: Python Modules',
'Topic :: Utilities',

'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3.4'
]
)

Start by creating this function call to setup, supplying all of these named arguments.  Those processed by Pypy are name, version, url, license, author, download_url, description, long_description, keywords, and classifiers.  The other arguments are used during the build to install dependencies and run tests.

The name and version arguments are used as the title for the Pypi page.  They are also used by those running pip to install the software.  Without these two items, pip does not know what it is installing.

The url argument is interpreted by Pypi as Home Page and will display on the web page using that parameter.

The license argument is used to specify how the library is licensed. Here we have a defect, we wanted users to refer to our LICENSE.txt file, but Pypi interprets it literally, printing License: LICENSE.txt.  We may need to fix this.

The author argument maps to the Pypi Author field and will display literally as typed, so commas are used to separate authors.

The download_url argument maps to the Pypi Download URL field.

The description argument becomes the subheading of the Pypi splash page.

The long_description argument becomes the body text of the Pypi splash page.  All URIs become links, but attempts to put HTML into this field produced a spash page displaying HTML, so we left it as text until we required richer formatting.

The keywords argument maps to the Pypi Keywords field.

The classifiers argument maps to the Pypi Categories field.  When choosing classifiers for a project, use this registry.  This field is used to index the project on Pypi to make finding it easier for end user.

For more information on what goes into setup.py, check out "Packaging and Distributing Projects" and "The Python Package Index (PyPI)" on the Python.org site.

Once we had our setup.py configured appropriately, we had to register for an account with Pypi.  We then created a .pypirc file in the builder's home directory with the contents shown below.

[distutils]
index-servers =
pypi

[pypi]
repository: https://pypi.python.org/pypi
username: hariharshankar
password: <password>

The username and password fields must both be present in this file. We encountered a defect while uploading the content whereby the setuptools did not prompt for the password if it was not present and the download failed.

Once that is in place, use the existing setup.py to register the project from the project's source directory:

python setup.py register

Once that is done, the project show up on the Pypi web site under the Pypi account. After that, publish it by typing:

python setup sdist upload

And now it will show up on Pypi for others to use.

Of course, one can also deploy code directly to Pypi using Travis CI, but we have not yet attempted this.

Conclusion


Open source development has evolved quite a bit over the last several years.  The first successful achievement being sites such as Freshmeat (now defunct) and SourceForge, providing free repositories and publication sites for projects.  GitHub fulfills this role now, but developers and researchers need more complex tools.

Travis CI, coupled with good automated tests, allows independent builds, and verification that software works correctly.  It ensures that a project not only compiles for users, but also passes functional tests in an independent environment.  As noted, one can even use it to deploy software directly.

Pypi is a Python-specific repository of Python libraries and other projects.  It is the backend of the pip tool commonly used by Python developers to install libraries.  Any serious Python development team should consider the use of Pypi for hosting and providing easy access to their code.

Using these three tools, we not only developed py-memento-client in a small amount of time, but we also independently tested, and published that library for others to enjoy.

--Shawn M. Jones
Graduate Research Assistant, Los Alamos National Laboratory
PhD Student, Old Dominion University

2015-09-10: CDXJ: An Object Resource Stream Serialization Format

0
0
I have been working on an IIPC funded project of profiling various web archives to summarize their holdings. The idea is to generate statistical measures of the holdings of an archive under various lookup keys where a key can be a partial URI such as Top Level Domain (TLD), registered domain name, entire domain name along with any number of sub-domain segments, domain name and a few segments from the path, a given time, a language, or a combination of two or more of these. Such a document (or archive profile) can be used answer queries like "how many *.edu Mementos are there in a given archive?", "how many copies of the pages are there in an archive that fall under netpreserve.org/projects/*", or "number of copies of *.cnn.com/* pages of 2010 in Arabic language". The archive profile can also be used to determine the overlap between two archives or visualize their holdings in various ways. Early work of this research was presented at the Internet Archive during the IIPC General Assembly 2015 and later it was published at:
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, and David S. H. Rosenthal, Web Archive Profiling Through CDX Summarization, Proceedings of TPDL 2015.
One of many challenges to solve in this project was to come up with a suitable profile serialization format that has the following properties:
  • scalable to handle terabytes of data
  • facilitates fast lookup of keys
  • allows partial key lookup
  • allows arbitrary split and merge
  • allows storage of arbitrary number of attributes without changing the existing data
  • supports link data semantics (not essential, but good to have)
We were initially leaning towards JSON format (with JSON-LD for linked data) because it has wide language and tool support and it is expressive like XML, but less verbose than XML. However, in the very early stage of our experiments we realized that it has scale issues. JSON, XML, and YAML (a more human readable format) are all single root node document formats, which means a single document serialized in any of these formats can not have multiple starting nodes; they all must be children of a single root node. This means it has to be fully loaded in the memory which can be the bottleneck in the case of big documents. Although there are streaming algorithms to parse XML or JSON, they are slow and usually only suitable for cases when an action is to be taken while parsing the document as opposed to using them for frequent lookup of the keys and values. Additionally, JSON and XML are not very fault tolerant i.e., a single malformed character may result in making the entire document fail to be parsed. Also, because of the single root node, splitting and merging of the documents is not easy.

We also thought about using simple and flat file formats such as CSV, ARFF, or CDX (a file format used in indexing WARC files for replay). These flat formats allow sorting that can facilitate fast binary lookup of keys and the files can be split in arbitrary places or multiple files with the same fields (in the same order) can be merged easily. However, the issue with these formats is that they do not support nesting and every entry in them must have the same attributes. Additionally the CDX has limited scope of extension as all the fields are already described and reserved.

Finally, we decided to merge the good qualities from CDX and JSON to come up with a serialization format that fulfills our requirements listed above. We call it CDXJ (or CDX-JSON). Ilya Kreymer first introduced this format in PyWB, but there was no formal description of the format. We are trying to formally introduce it and make some changes that makes it extendable so that it can be utilized by the web archiving community as well as broader web communities. The general description of the format is, "a plain file format that stores key-value pairs per line in which the keys are strings that are followed by their corresponding value objects where the values are any valid JSON with the exception that the JSON block does not contain any new line characters in it (encoded newline "\n" is allowed)." Here is an example:
@context ["http://oduwsdl.github.io/contexts/arhiveprofiles"]
@id {"uri": "http://archive.org/"}
@keys ["surt_uri", "year"]
@meta {"name": "Internet Archive", "year": 1996}
@meta {"updated_at": "2015-09:03T13:27:52Z"}
com,cnn)/world - {"urim": {"min": 2, "max": 9, "total": 98}, "urir": 46}
uk,ac,rpms)/ - {"frequency": 241, "spread": 3}
uk,co,bbc)/images 2013 {"frequency": 725, "spread": 1}

Lines starting with @ sign signify special purpose keys and they make these lines to appear together at the top of the file when sorted. The first line of the above example with the key @context provides context to the keywords used in the rest of the document. The value of this entry can be an array of contexts or an object with named keys. In the case of an array, all the term definitions from all the contexts will be merged in the global namespace (resolving name conflicts will be the responsibility of the document creator) while in the case of a named object it will serve like the XML Namespace.

The second entry @id holds an object that identifies the document itself and established relationship with other documents such as a parent/sibling when it is split. The @keys entry specifies the name of the key fields in the data section as an array of the field names in the order they appear (such as the primary key name appears first then the secondary key, and so on). To add more information about the keys, each element of the @keys array can have an object. All the lines except the special keys (@-prefixed) must have the exact number of fields as described in the @keys entry. Missing fields in the keys must have the special placeholder character "-".

The @meta entries describe the aboutness of the resource and other metadata. Multiple entries of the same special keys (that start with an @ sign) should be merged at the time of consuming the document. Splitting them in multiple lines increases the readability and eases the process of updates. This means the two @meta lines can be combined in a single line or split into three different lines each holding "name", "year", and "updated_at" separately. The policy to resolve the conflicts in names when merging such entries should be defined per key basis as suitable. These policies could be "skip", "overwrite", "append" (specially for the values that are arrays), or some other function to derive new value(s).

The latter three lines are the data entries in which the first one starts with a key com,cnn)/world (which is the SURT for of the http://www.cnn.com/world) followed by a nested data structure (in JSON format) that holds some statistical distribution of the archive holdings under that key. The next line holds different style of statistics (to illustrate the flexibility of the format) for a different key. The last line illustrates a secondary key in which the primary keys is the SURT form of a URI followed by the a secondary key that further divides the distribution yearly.

Now, let's reproduce the above example in JSON-LD, YAML, and XML respectively for comparison:
{
"@context": "http://oduwsdl.github.io/contexts/arhiveprofiles",
"@id": "http://archive.org/",
"meta": {
"name": "Internet Archive",
"year": 1996,
"updated_at": "2015-09:03T13:27:52Z"
},
"surt_uri": {
"com,cnn)/world": {
"urim": {
"min": 2,
"max": 9,
"total": 98
},
"urir": 46
},
"uk,ac,rpms)/": {
"frequency": 241,
"spread": 3
},
"uk,co,bbc)/images": {
"year": {
"2013": {
"frequency": 725,
"spread": 1
}
}
}
}
}
---
@context: "http://oduwsdl.github.io/contexts/arhiveprofiles"
@id: "http://archive.org/"
meta:
name: "Internet Archive"
year: 1996
updated_at: "2015-09:03T13:27:52Z"
surt_uri:
com,cnn)/world:
urim:
min: 2
max: 9
total: 98
urir: 46
uk,ac,rpms)/:
frequency: 241
spread: 3
uk,co,bbc)/images:
year:
2013:
frequency: 725
spread: 1
<?xml version="1.0" encoding="UTF-8"?>
<profile xmlns="http://oduwsdl.github.io/contexts/arhiveprofiles">
<id>http://archive.org/</id>
<meta>
<name>Internet Archive</name>
<year>1996</year>
<updated_at>2015-09:03T13:27:52Z</updated_at>
</meta>
<data>
<record surt-uri="com,cnn)/world">
<urim>
<min>2</min>
<max>9</max>
<total>98</total>
</urim>
<urir>46</urir>
</record>
<record surt-uri="uk,ac,rpms)/">
<frequency>241</frequency>
<spread>3</spread>
</record>
<record surt-uri="uk,co,bbc)/images" year="2013">
<frequency>725</frequency>
<spread>1</spread>
</record>
</data>
</profile>

The WAT format, commonly used in the web archiving community also uses JSON fragments as values for each entry separately to deal with the single root document issue, but it does not restrict the use of new-line character. As a consequence, sorting the file line-wise is not allowed, which affects the lookup speed. In contrast, CDXJ files can be sorted (like CDX files) which allows binary search in the files on the disk and prove very efficient in lookup heavy applications.

We have presented the earlier thoughts to seek feedback on the CDXJ serialization format at Stanford University during IIPC GA 2015. The slides of the talk are available at:


Going forward we are proposing to split the syntax and semantics of the format in separate specifications where the overall syntax of the file is defined as a base format while further restrictions and semantics such as adding meaning to the keys, making certain entries mandatory, giving meaning to the terms, enforcing specific sort order and defining the scope of the usage for the document are described in a separate derived specification. This practice is analogous to the XML which defines the basic syntax of the format and other XML-based formats such as XHTML or  Atom add semantics to it.

A generic format for this purpose can be defined as Object Resource Stream (ORS) that registers ".ors" file extension and "application/ors" media type. Then CDXJ extends from that to add semantics to it (as described above) which registers ".cdxj" file extension and "application/cdxj+ors" media type.

Object Resource Stream (ORS)

The above railroad diagram illustrates the grammar of the ORS format. Every entry in this format acquires one line. Empty lines are allowed that should be skipped during the consumption of the file. Apart from the empty lines, every line starts with a string key, followed by a single-line JSON block as the value. The keys are allowed to have optional leading or trailing spaces (SPACE or TAB characters) for indentation, readability, or alignment purposes, but should be skipped when consuming the file. Keys can be empty strings which means values (JSON blocks) can be present without being associated with any keys. Quoting keys is not mandatory, but if necessary one can use double quotes for the purpose. Quoted string keys will preserve any leading or trailing spaces inside quotes. None of the keys or values are allowed to break the line (new lines should be encoded if any) as the line break starts a new entry. As mentioned before, it is allowed to have a value block without a corresponding key, but not the opposite. Since the opening square and curly brackets indicate the start of the JSON block, hence it is necessary to escape them (as well as the escape and double quote characters) if they appear in the keys, and optionally their closing pairs should also be escaped. An ORS parser should skip malformed lines and continue with the remaining document. Optionally the malformed entries can be reported as warnings.

CDXJ (or CDX-JSON)

The above railroad diagram illustrates the grammar of the CDXJ format. CDXJ is a subset of ORS as it introduces few extra restriction in the syntax that are not present in the ORS grammar. In the CDXJ format the definition of the key string is strict as it does not allow leading spaces before the key or empty string as the key. If there are spaces in the CDXJ key string, it is considered a compound key where every space separated segment has an independent meaning. Apart from the @-prefixed special keys, every key must have the same number of space separated fields and empty fields use the placeholder "-". CDXJ only allows a single SPACE character to be used as the delimiter between the parts of the compound key. It also enforces a SPACE character to separate the key from the JSON value block. As opposed to the ORS, CDXJ does not allow TAB character as the delimiter. Since the keys cannot be empty strings in CDXJ, there must be a non-empty key associated with every value in it. Additionally, the CDXJ format also prohibits empty lines. These restrictions are introduced in the CDXJ to encourage its use as sorted files to facilitate binary search on the disk. When sorting CDXJ files, byte-wise sorting is encouraged for greater interoperability (this can be achieved on Unix-like operating systems by setting an environment variable LC_ALL=C). On the semantics side CDXJ introduces optional @-prefixed special keys to specify metadata, the @keys key to specify the field names of the data entries, and the @id and the @context keys to provision linked-data semantics inspired by JSON-LD.

Applications

There are many applications where a stream of JSON block is being used or can be used. Some of the applications even enforce the single line JSON restriction and optionally prefix the JSON block with associated keys. However, the format is not formally standardized and it is often called JSON for the sake of general understanding. The following are some example applications of the ORS or CDXJ format:
  • Archive Profiler generates profiles of the web archives in CDXJ format. An upcoming service will consume profiles in the CDXJ format to produce a probabilistic rank ordered list of web archives with the holdings of a given URI.
  • PyWB accepts (and encourages the usage of) CDXJ format for the archive collection indexes and the built-in collection indexer allows generating CDXJ indexes.
  • MemGator is a Memento aggregator that I built. It can be used as a command line tool or run as a web service. The tool generates TimeMaps in CDXJ format along with Link and JSON formats. The CDXJ format response is sorted by datetime as the key and it makes it very easy and efficient to consume the data chronologically or using text processing tools to perform filtering based on partial datetime.
  • 200 million Reddit link (posts) corpus that I collected and archived recently (it will be made publicly available soon) in CDXJ format (where the key is the link id), while 1.65 billion Reddit comments corpus is available in a format that conforms ORS format (although it is advertised as series of JSON blocks delimited by new lines).
  • A very popular container technology Docker and a very popular log aggeragation and unification service Fluentd are using a data format that conforms the above described specification of ORS. Docker calls their logging driver JSON which actually generates a stream of single-line JSON blocks that can also have the timestamp prefix with nano-second precision as the key for each JSON block. Fluentd log is similar, but it can have more key fields as prefix to each line of JSON block.
  • NoSQL databases including key-value store, tuple store, data structure server, object database, and wide column store implementations such as Redis and CouchDB can use ORS/CDXJ format to import and export their data from and to the disk.
  • Services that provide data streams and support JSON format such as Twitter and Reddit can leverage ORS (or CDXJ) to avoid unnecessary wrapper around the data to encapsulate the under a root node. This will allow immediate consumption of the stream of the data as it arrives to the client, without waiting for the end of the stream.
In conclusion, we proposed a generic Object Resource Stream (ORS) data serialization format that is composed of single line JSON values with optional preceding string keys per line. For this format we proposed the file extension ".ors" and the media-type "application/ors". Additionally, we proposed a derivative format of ORS as CDXJ with additional semantics and restrictions. For the CDXJ format we proposed the file extension ".cdxj" and the media-type "application/cdxj+ors". The two formats ORS and CDXJ can be very helpful in dealing with endless streams of structured data such as server logs, Twitter feed, and key-value stores. These formats allow arbitrary information in each entry (like schema-less NoSQL databses) as opposed to the fixed-field formats such as spreadsheets or relational databases. Additionally, these formats are text processing tool friendly (such as sort, grep, and awk etc.) which makes them very useful and efficient for file based data store. We have also recognized that the proposed formats are already in use on the Web and have proved their usefulness. However, they are neither formally defined nor given a separate media-type.

--
Sawood Alam


2015-09-21: InfoVis Spring 2015 Class Projects

0
0
In Spring 2015, I taught Information Visualization (CS 725/825) for MS and PhD students.  This time we used Tamara Munzner's Visualization Analysis & Design textbook, which I highly recommend:
"This highly readable and well-organized book not only covers the fundamentals of visualization design, but also provides a solid framework for analyzing visualizations and visualization problems with concrete examples from the academic community. I am looking forward to teaching from this book and sharing it with my research group."
—Michele C. Weigle, Old Dominion University
I also tried a flipped-classroom model, where students read and answer homework questions before class so that class time can focus on discussion, student presentations, and in-class exercises. It worked really well -- students liked the format, and I didn't have to convert a well-written textbook into Powerpoint slides.

Here I highlight a couple of student projects from that course.  (All class projects are listed in my InfoVis Gallery.)

Chesapeake Bay Currents Dataset Exploration
Created by Teresa Updyke


Teresa is a research scientist at ODU's Center for Coastal Physical Oceanography (CCPO). This visualization (currently available at http://www.radarops.comoj.com/CS725/project/index.html) gives a view of the metadata related to the high-frequency radar data that the CCPO collects. For all stations, users can investigate the number of data files available, station count, vector count, and average speed of the currents. The map allows users to select one of the three stations and further investigate the radial count and type collected on each day. This visualization aids researchers in quickly determining the quality of data collected at specific times and in identifying interesting areas for further investigation.

The thing I really liked about this project was that it solved a real problem and will help Teresa to do her job better. I asked Teresa how researchers previously determined what data was available.  Her reply: "They called me, and I looked it up in the log files."


In and Out Flow of DoD Contracting Dollars
Created by Kayla Henneman and Swaraj Wankhade



This project (currently available at http://kaylamarie0110.github.io/infovis/project.html) is a visualization of the flow of Department of Defense (DoD) contracting dollars to and from the Hampton Roads area of Virginia. The system is for those who wish to analyze how the in- and out-flow of DoD contracting dollars affects the Hampton Roads economy. The visualization consists of an interactive bubble map which shows the flow of DoD contracting dollars to and from Hampton Roads based on counties, along with line charts which show the total amount of inflow and outflow dollars.  Hovering over a county on the map shows the inflow and outflow amounts for that county over time.



Federal Contracting in Hampton Roads
Created by Valentina Neblitt-Jones and Shawn Jones


This project (currently available at http://shawnmjones.github.io/hr-contracting/app/index.html) is a visualization for US federal government contracting awards in the Hampton Roads region of Virginia. The visualization consists of a choropleth map displaying different colors based on the funding each locality receives. To the right of the map is a bar chart indicating how much funding each industry received. On top of the map and the bar chart is a sparkline showing the trend in funding. The visualization allows the user to select a year, agency, or locality within the Hampton Roads area and updates the choropleth, bar chart, and sparkline as appropriate.


-Michele

2015-09-28: TPDL 2015 in Poznan, Poland

0
0

The Old Market Square in Poznan
On September 15 2015, Sawood Alam and I (Yasmin AlNoamany) attended the 2015 Theory and Practice of Digital Libraries (TPDL) Conference in Poznan, Poland. This year, WS-DL had four accepted papers in TPDL for three students (Mohamed Aturban (who could not attend the conference because of visa issues), Sawood Alam, and Yasmin AlNoamany). Sawood and I arrived in Poznan on Monday, Sept. 14. Although we were tired from travel, we could not resist walking to the the best area in Poznan, the old market square. It was fascinating to see those beautiful colorful houses at night with the reflection of the water on them after it rained with the beautiful European music by many artists who were playing in the street.

The next morning we headed to the conference, which was held in Poznań Supercomputing and Networking Center. The organization of the conference was amazing and the general conference co-chairs, Marcin Werla and Cezary Mazurek, were always there to answer our questions. Furthermore, the people at the reception of the conference were there for us the whole time to help us with transportation, especially with the communication with taxi drivers; we do not speak Polish and they do not speak English. On every day of the conference, there were coffee break where we had hot and cold drinks and snacks. It is worth mentioning that I had the best coffee I have ever tasted in Poland :-). The main part of the TPDL 2015 conference was streamed live and recorded. The recordings will be processed and made publicly available on-line on PlatonTV portal.

Sawood (on the left) and Jose (on the right)
We met Jose Antonio Olvera, who interned in WS-DL lab in summer 2014, at the entrance. At the conference, Jose had an accepted poster “Evaluating Auction Mechanisms for the Preservation of Cost-Aware Digital Objects Under Constrained Digital Preservation Budgets” that was presented at the evening of the first day in the poster session. It was nice meeting him, since I was not there when he interned in our lab.
The first day of the main conference, September 15th, started with a Keynote speech by David Giaretta, whom I was honored to speak to many times during the conference and had him among the audience of my presentations, talked about "Data – unbound by time or discipline – challenges and new skills needed". At the beginning, Giaretta introduced himself with a summary about his background. His speech mainly was about data preservation and the challenges that this field faces, such as link rots, which Giaretta considered a big horror. He mentioned many examples about the possibility of data loss. Giaretta talked about big data world and presented the 7 (or 8 (or 9)) V’s of big data: volume, velocity, variety, volatility, veracity, validity, value, variability, and visualization. I loved these quotes from his speech:
  • "Preservation is judged by continuing usability, then come value". 
  • "Libraries are gateways to knowledge". 
  • "Metadata is classification".
  • "emulate or migrate".
He talked about how it is valuable and expensive to preserve the scientific data, then raised an issue about reputation for keeping things over time and long term funding. Funding is a big challenge in digital preservation, so he talked about vision and opportunities for funding. Giaretta concluded his keynote with the types of digital objects that needs to be preserved, such as simple documents and images, scientific data, complex objects, and the changing over time (such as the annotations). He raised this question: "what questions can one ask when confronted with some completely unfamiliar digital objects?" Giaretta ended his speech with an advice: "Step back and help the scientists to prepare data management plans, the current data management plan is very weak".

After the keynote we went to a coffee break, then the first session of the conference "Social-technical perspectives of digital information" started. The session was led by WS-DL’s Sawood Alam presenting his work "Archive Profiling Through CDX Summarization", which is a product of an IIPC funded project. He started with a brief introduction about the memento aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. He described two earlier profiling efforts: the complete knowledge profile by Sanderson and minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He also talked about the newly introduced CDXJ serialization format for profiles and illustrated its usefulness in serializing profiles on scale with the ability of merging and splitting arbitrary profiles easily. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. The code to generate profiles and benchmark can be found in a GitHub repository.



Next, there was a switch between the second and the third presentations and since Sawood was supposed to present on the behalf of Mohamed Aturban, the chair of the session gave Sawood enough time to breathe between the two presentations.

The second presentation was "Query Expansion for Survey Question Retrieval in the Social Sciences" by Nadine Dulisch from GESIS and Andreas Oskar Kempf from ZBW. Andreas started with a case study for the usage of survey questions, which were developed by operational organizations, in social science. He presented the importance of social science survey data for social scientists.  Then, Nadine talked about the approaches they applied for query expansion retrieval. She showed that statistical-based expansion was better than intellectual-based expansion. They presented the results of their experiments based on Trec_eval. They evaluated thesaurus-based and co-occurrence-based expansion approaches for query expansion to improve retrieval quality in digital libraries and research data archives. They found that automatically expanded queries using extracted co-occurring terms could provide better results than queries manually reformulated by a domain expert.

Sawood presented "Quantifying Orphaned Annotations in Hypothes.is". In this paper, Aturban et al. analyzed 6281 highlighted text annotations in Hypothes.is annotation system. They also used the Memento Aggregator to look for archived versions of the annotated pages. They found that 60% of the highlighted text annotations are orphans (i.e. annotations are attached to neither the live web nor memento(s)) or in danger of being orphaned (i.e. annotations are attached to the live web but not to memento(s)). They found that if a memento exists, there is a 90% chance that it recovers the annotated webpage. Using public archives, only 3% of all highlighted text annotations were reattached, otherwise they would be orphaned. They found that for the majority of the annotations, no memento existed in the archives. Their findings highlighted the need for archiving pages at the time of annotation.


After the end of the general session, we took a lunch break where we gathered with Jose Antonio Olvera and many of the conference attendees to exchange our research ideas.

After the lunch break, we attended the second session of the day, "Multimedia information management and retrieval and digital curation". The session started with "Practice-oriented Evaluation of Unsupervised Labeling of Audiovisual Content in an Archive Production Environment” presented by Victor de Boer. In their work, Victor et al. evaluated the automatic labeling of the audiovisual content to improve efficiency and inter-annotator agreement by generating annotation suggestions automatically from textual resources related to the documents to be archived. They performed pilot studies to evaluate term suggestion methods through precision and recall by taking terms assigned by archivists as ‘ground-truth’. The found that the quality of automatic term-suggestion are sufficiently high.

The second presentation was "Measuring Quality in Metadata Repositories" by Dimitris Gavrilis. He started his presentation by mentioning that this is a hard topic, then he explained why this research is important. He explained the specific criteria that determine the data quality: completeness, validity, consistency, timeliness, appropriateness, and accuracy constituents. In their paper, Dimitris et al. introduced a metadata quality evaluation model (MQEM) that provides a set of metadata quality criteria as well as contextual parameters concerning metadata generation and use. The MQEM allows the curators and the metadata designers to assess the quality of their metadata and to run queries on existing datasets. They evaluated their framework on two different use cases: application design and content aggregation.

After the session, we took a break and I got illness which prevented me from attending the discussion panel session, which was entitled "Open Access to Research Data: is it a solution or a problem?", and the poster session. I went back to the hotel to rest and prepare for the next day's presentation. I am embedding the tweets about the panel and the poster session.

The next day I felt fine, so we went early to have breakfast in the beautiful old market square, then headed to the conference. The opening of the second day was by Cezary Mazurek who introduced the sessions of the second day and thanked the sponsors of the conference. Then he left us with a beautiful soundtrack of music, which was related to the second keynote speaker.

The Keynote speech was "Digital Audio Asset Archival and Retrieval: A Users Perspective" by Joseph Cancellaro, active composer, musician, and the chair of the Interactive Art and Media Department of Columbia College in Chicago. Cancellaro started by a short bio about himself. The first part of his presentation handled issues of audio asset and the constant problematic for sound designers and non-linear environments (naming convention (meta tag), search tools, storage (failure), retrieval (failure), DSP (Digital signal processing), etc. He also mentioned how do they handle these issues in his department, for example for naming conventions, they add tags to the files. He explained the simple sound asset SR workflow. Preservation to Cancellaro is “not losing any more audio data". His second part of the presentation was about storage, retrieval, possible solutions, and content creation. He mentioned some facts about storage and retrieval:
  • The decrease in technology costs have reduced the local issues of storage capacity (this is always a concern in academia). 
  • Bandwidth is still an issue in real-time production. 
  • Non-linear sound production is a challenge for linear minded composers and sound designers.
He mentioned that searching for sound objects is a blocking point for many productions, then continued "when I ask my students about the search options for the sound track they have, all what I hear are crickets". At the end,  Dr. Cancellaro presented agile concept as a solution for content management systems (CMS). He presented the basic digital audio theory: sound as a continuous analog event is captured at specific data point.

After the keynote, we took a coffee break, then the sessions of the second day started with "Influence and Interrelationships among Chinese Library and Information Science Journals in Taiwan" by Ya-Ning Chen. In this research, the authors investigate the citation relation between the different journals based on a data set collected from 11 Chinese LIS journals (2,031 articles during from 2001 to 2012) in Taiwan. The authors measured the indexer and the indegree, outdegree, and the self-feeding ratios between the journals. They also measured the degree and betweenness centrality of SNA to investigate the information flow between Chinese LIS journals in Taiwan. They created a 11 × 11 matrix that express the journal-to-journal analysis. They created a sciogram of Interrelationships among Chinese LIS Journals in Taiwan which summarized the citation relation between the journals they studied.

Next was a presentation entitled "Tyranny of Distance: Understanding Academic Library Browsing by Refining the Neighbour Effect" by Dana Mckay and George Buchanan.
Dana and George explained the importance of browsing books as a part of informations seeking, and how this is not well-supported for e-books. They used different datasets to examine the patterns of co-borrowing. They examined different aspects of the neighbour effect on browsing behavior. Finally they presented their findings to improve the browsing of digital libraries.

The last presentation of this session was a study on Storify entitled "Characteristics of Social Media Stories" by Yasmin AlNoamany. Based upon analyzing 14,568 stories from Storify, AlNoamany et al. specified the structural characteristics of popular (i.e., receiving the most views) human-generated stories to build a template that will be used later in generating (semi-)automatic stories from the archived collections. The study investigated many question regarding to the features of the stories, such as the length of the story, the number of elements, the decay rate of the stories, etc. At the end, the study differentiated the popular stories and the unpopular stories based on the main feature of both of them. Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features. Popular stories tend to have more web elements (medians of 28 vs. 21), longer timespan (5 hours vs. 2 hours), longer editing time intervals, and less decay rate.

 

After the presentation, we had lunch, when some attendees extended the talk about my research. It was a useful discussion regarding to the future of my research, especially integrating data from the archived collections with the storytelling services.

The "user studies for and evaluation of digital library systems and applications" session started after the break with "On the Impact of Academic Factors on Scholar Popularity: A Cross-Area Study” presentation by Marcos Gonçalves. Gonçalves et al. presented a cross-area study on the impact of key academic factors on scholar popularity for understanding how different factors impact scholar popularity. They conducted their study based on scholars affiliated to different graduate programs in Brazil and internationally, with more than 1,000 scholars and 880,000 citations, over a 16-year period. They found that scholars in technological programs (e.g., Computer Science, Electrical Engineering, Bioinformatics) tend to be the most "popular" ones in their universities. They also found that international popularity in still much higher than that obtained by Brazilian scholars.

After the first presentation, there was a panel on "User studies and Evaluation" by George Buchanan, Dana McKay, and Giannis Tsakonas, and moderated by Seamus Ross as a replacement of two presentations due to the absence of the presenters. The panel started with a question from Seamus Ross: Are user studies in digital libraries soft? Each one of the panelists presented his point of view on the importance of user studies. Buchanan said that user studies matter, then Dana followed up that we want to create something that all the people can use. Tsakonas said he did studies that never developed into systems. Seamus Ross asked the panelists: what makes a person a good user study person? Dana answered with a joke; "choose someone like me". Dana works as User Experience Manager and Architect at the academic library of Swinburne University of Technology, so she has experience with users needs and user studies. I followed up the discussion that we do users studies to know what the people need or to evaluate a system, then I asked if Mechanical Turk (MTurk) experiments is a form of user studies. At the end, Seamus Ross concluded the panel with some advice on conducting user studies, such as considering a feedback loop in the process of user study.

After the panel, we had a coffee break. I had a great discussion about user evaluation in the context of my research with Brigitte Mathiak, who gave me much useful advice about evaluating the stories that will be created automatically from the web archives. Later on my European trip I gave a presentation at Magdeburg-Stendal University of Applied Sciences that gives big picture of my research.

In the last session, I attended Brigitte Mathiak presented "Are there any Differences in Data Set Retrieval compared to well-known Literature Retrieval?". In the beginning, Mathiak explained the motive of their work. Based on two user studies, a lab study with seven participants and telephone interviews with 46 participants, they investigated the requirements that users have for a data set retrieval system in the social sciences and in Digital Libraries. They found that choosing the data set is more important to researcher than choosing a piece of literature. Moreover, meta data quality and quantity is even more important for data sets.

At the evening, We had the conference dinner which was held at Concordia Design along with beautiful music. At the dinner, the conference chairs announced two awards: the best paper award for Matthias Geel and Moira Norrie on "Memsy: Keeping Track of Personal Digital Resources across Devices and Services" and the best poster/demo award for Clare Llewellyn, Claire Grover, Beatrice Alex, Jon Oberlander and Richard Tobin on "Extracting a Topic Specific Dataset from a Twitter Archive".



The third day started early at 9:00 am with sessions about digital humanities, in which I presented my study about “Detecting Off-Topic Pages in Web Archives”. The paper investigate different methods for automatically detecting when an archived page goes off-topic. It presented six different methods that mainly depend on comparing the archived copy of a page (a memento) with the first memento of this page. Testing the methods was done on different Archived collections from Archive-It. The suggested best method was a combination between a textual method (cosine similarity using TF-IDF) and a structural method (word count). The best combined methods for detecting the off-topic pages gave an average precision 0.92 on 11 different collections. The output of this research is a tool for detecting the off-topic pages in the archive. The code can be downloaded and tested from Github, and more information can be found from my recent presentation at the Internet Archive.


 

The next paper presented in the digital humanities session was "Supporting Exploration of Historical Perspectives across Collections" by Daan Odijk. In their work, Odjjk et al. introduced tools for selecting, linking, and visualizing the second World War (WWII) collection from collections of the NIOD, the National Library of the Netherlands, and Wikipedia. They also link digital collections via implicit events, i.e. if two articles are close in time and similar in content, they are considered to be related. Furthermore, they provided exploratory Interface to explore the connected collections. They used Manhattan distance for textual similarity over document terms in a TF.IDF weighted vector space and measured temporal similarity using a Gaussian decay function. They found that textual similarity performed better than temporal similarity, and combining textual and temporal similarity improved the nDCG score.


The third paper entitled "Impact Analysis of OCR Quality Tasks in Digital Archives" presented by Myriam C. Traub. Traub et al. performed user studies on digital archived to classify the research tasks and describe the potential impact of OCR quality on these tasks through interviewing scholars from digital humanities. They analyzed the questions and categorized the research tasks. Myriam said that few scholars could quantify the impact of OCR errors on their own research tasks. They found that OCR is unlikely to be perfect. They could not find solutions but they could suggest strategies that lead to the solutions. At the end, Myriam suggested that the tools should be open-source and there should be evaluation matrices.


 
At the end, I attended the last Keynote speech by Costis Dallas– "The post-repository era: scholarly practice, information and systems in the digital continuum", which was about on digital humanists' practices in the age of curation. Then the conference ended with the closing sessions, in which they announced the TPDL 2016 in Hannover, Germany.

After the conference, Sawood and I took the train from Poznan to Potsdam, Germany to meet Dr. Michael A. Herzog, the Vice Dean for Research and Technology Transfer, Department of Economics and head of Research Group SPiRIT. We were invited to talk about our research in a Digital Preservation lecture at Magdeburg-Stendal University of Applied Sciences in Magdeburg. Sawood wrote a nice blog post about our talks.

---
Yasmin

2015-09-30: Digital Preservation - Magdeburg Germany Trip Report

0
0

Dr. Herzog: This large green area on your left is Sanssouci Park. It has 11 palaces in it.
Yasmin: I want to visit this park after we are back from the university, can we?
Dr. Herzog: We sure can... I think we will be back before sunset.
Yasmin: I love beautiful things.
Dr. Herzog: Who doesn't?
Sawood: [Smiles]

The three souls were heading to the Hochschule Magdeburg-Stendal University from Potsdam, Germany in Dr. Michael Herzog's car for a lunch lecture on the topic of Digital Preservation. Yasmin and Sawood from the Web Science and Digita Libraries Research Group of the Old Dominion University, Norfolk, Virginia were invited for the talk by Dr. Herzog at his SPiRIT Research Group. The two WSDL members have presented their work at TPDL 2015 in Poznan, Poland then on their way back home they ware halted and hosted by Dr. Herzog in Germany for the lunch lecture. You may also enjoy the TPDL 2015 trip report by Yasmin.


Passing by beautiful landscapes, crossing bridges and rivers, observing renewable energy sources such as windmills and solar panels, and touching almost 200 km/h speed on the highway we reached to the university in Magdeburg. Due to the vacations there were not many people in the campus, but the canteen was still crowded when we went there for the lunch. Dr. Herzog's student, Benjamin Hatscher (who created the poster for the talk) joined us for the lunch. Then we headed to the room that was reserved for the talk and started the session.

Dr. Herzog briefly introduced us, our research group, and our topics for the day to the audience. He also shared his recent memories about the time he spent at ODU and about his interactions with the WSDL members. Then he left the podium for Yasmin.


Yasmin presented her talk on the topic, "Using Web Archives to Enrich the Live Web Experience Through Storytelling". She noted that her work is supported in part by IMLS. She started her introduction with a set of interesting images. She then illustrated the importance of the time aspect in storytelling and described how storytelling looks like on the Web, and especially on the social media. She discussed the need of selecting a very small, but representative subset from a big pile of resources around certain topic to tell the story. Selecting the small representative subset is challenging, but important task. This gives a brief summary as well as the entry point to deep dive into the story and explore remaining resources. She gave examples of how Facebook Lookback compiles a few highlights from hundreds or thousands of someone's sharings and 1 Second Everyday for storytelling. Then she moved on to the popular social media storytelling service Storify and described the issues in it such as flat representation, bookmarking not preservation, and resources going off-topic over time. This lead her to the description of the Web archives, Memento, and Web archiving services (mainly Archive-It). Then she described the shortcomings of the Web archiving services when it comes to storytelling and how it can be improved by combining the Web archives and the storytelling services together. After that she concluded her talk by describing her approaches and policies on selecting the representative subset of resources from a collection.


I, Sawood Alam presented my talk on the topic, "Web Archiving: A Brief Introduction". I briefly introduced myself with the help of my academic footprint and the lexical signature. The "lexical signature" term led me to touch on Martin Klein's work and how I used it to describe a person instead of a document. Then I followed the agenda for the talk and began with the description of the archiving in general, the concept of the Web archiving, and the differences between the two.


I then briefly talked about the purpose and importance of the Web archiving on institutional and personal scales. Then I described various phases and challenges involved in the Web archiving such as crawling, storage, retrieval, replay, completeness, accuracy, and credibility. This gave me opportunity to reference various WSDL members' research work such as Justin's Two-Tiered Crawling and Scott's Temporal Violations. Then I talked about existing Web and digital archiving efforts and various tools used by Web archivists in various stages. The list included vastly used tools such as Heritrix, OpenWayback, and TimeTravel as well as various tool developed by WSDL members or other individual developers such as CarbonDate, Warrick, Synchronicity, WARCreate, WAIL, Mink, MemGator, and Browsertrix. After that I briefly described the Memento protocol and Memento aggregator.

This lead me to my IIPC funded research work on Archive Profiling. In this section of the talk I described why archive profiling is important, how it can help in Memento query routing, and how does an archive profile look like.

To motivate the audience for research in the Web archiving field I discussed various related areas that have vast research opportunities to explore.

Then I concluded my talk with the introduction of our Web Science and Digital Libraries Research Group. This was the fun part of the talk, full of pictures illustrating lifestyle and work environment at our lab. I illustrated how we use tables in our lab for fun traditions such as bringing lots of food after a successful defense or spreading assignment submissions on the Ping Pong table for parallel evaluation. I illustrated our effective use of the white boards from "about:blank" state to the highly busy and annotated state and the reserved space for the "PhD Crush" that keeps track of the progress of each WSDL member in a visual and fun way. I couldn't resist to show our Origami skills on the scale of covering an entire cubicle and every single item in it individually.




After a brief QA session, Dr. Herzog formally concluded the event.


From there we all were free to explore the beauty of the places around and we did to the extent possible. We toured around the historical places of the Magdeburg city such as the Gothic architecture masterpiece, Magdeburg Cathedral and on our way back to the Potsdam we saw the newly built largest canal under-bridge, Magdeburg Water Bridge.


By the time we reached to Postdam the sun was already set, but we still managed to see a couple of the palaces in the Sanssouci Park and they were looking beautiful in that light condition. We even managed to take a few pictures in that low light.


Dr. Herzog invited us for dinner at his place and we had no reason or intention to say no. He was the head chef in his kitchen and prepared for us a delicious rice recipe and white asparagus (which was a new vegetable for me). Since I like cooking, I decided to join him in his kitchen and he gladly welcomed me. I did not have any plans in advance, but after a brief look inside his fridge I decided to prepare egg hearts and salad. During and after the dinner Dr. Herzog described and showed pictures of many historical places in Potsdam and made us excited to visit them the next day.


The next morning we had to head back to Berlin, but we sneaked a couple of hours in the morning to see the beauty of the Sanssouci Park and the Sanssouci Palace in the bright sunlight. A long series of stairs from the front entrance of the palace leading to the water fountain with stepped walls on both the sides covered with grapes vines were mesmerizing.


Dr. Herzog dropped us to the train station (or Bahnhof in German) from where we took train for Berlin. We got almost a day to explore Berlin and we did it the extent possible. It is an amazing city, full of historical masterpieces and the state of the art architecture. At one point, we got stuck in a public demonstration and couldn't use any transport due to the road jam, although, we had no idea what was that demonstration for.


Later in the evening Dr Herzog came to Berlin to pick his wife up from the Komische Oper Berlin where she was performing an Opera and we got a chance to look inside this beautiful place. This way we got a few more hours to have a guided tour of Berlin and had dinner in an Italian restaurant.


It was a fun trip to explore three beautiful cities of Germany immediately after exploring yet another beautiful and colorful city of Poznan, Poland. We couldn't have imagined anything better than this. I published seven photo spheres of various churches and palaces on Google Maps during this trip and got an album full of pictures.

On behalf of my university, department, research group, and myself I would like to extend my sincere thanks and regards to Dr. Herzog for his invitation, warm welcome, hosting, and spending time while showing us the best of Magdeburg, Potsdam, and Berlin during our stay in Germany. He is a fantastic host and tour guide. Now tuning back to the see off conversation among the three.

Sawood: Yasmin, now you know why Dr. Herzog said, "who doesn't" when you said, "I love beautiful things".
Yasmin: [Smiles]
Dr. Herzog: [Smiles]

--
Sawood Alam

2015-10-07: IMLS and NSF fund web archive research for WS-DL

0
0
In the spring and summer of 2015, the Web Science and Digital Libraries (WS-DL) group has received a total of $950k of funding from the IMLS and the NSF to study various aspects of web archiving.  Although previously announced on twitter (IMLS: 2015-03-31& NSF: 2015-08-25), here we provide greater context for how these awards support our vision for the future of web archiving*.

Our IMLS proposal is titled "Combining Social Media Storytelling With Web Archives" and a PDF of the full proposal is available directly from the IMLS.  This proposal is joint with our partners at Archive-It and is informed by our experiences in several areas, such as:
Our most illuminating insight (somewhat obvious in retrospect) is to not try to include all of the collection's holdings in its summarization, but to only surface the exemplary components sufficient to distinguish one collection from the next.  One example we frequently use is "how do we distinguish the many `human rights' collections available in Archive-It?"  They all have different perspectives, but they can be difficult to navigate for those without detailed knowledge of the seed URIs and the collection development policy. 

The IMLS proposal will investigate two main thrusts:
  1. Selecting a small number (e.g., 20) of exemplary pages from a collection (often 100s of archived copies of 1000s of web pages) and loading them in an existing tool such as Storify as a summarization interface (instead of custom & unfamiliar interfaces).  Yasmin AlNoamany has some exciting preliminary work in this area; for example see her TPDL 2015 paper examining what makes a "good" story on Storify, and her presentation "Using Web Archives to Enrich the Live Web Experience Through Storytelling".
  2. Using existing stories to generate seed URIs for collections.  One problem for human-generated web archive collections is that they depend on the domain knowledge of curators.   For example, the image above shows two Storify stories about early riots in Kiev (aka Kyiv) which predated much of the exposure in Western media and then the subsequent escalation of the crisis.  The collection at Archive-It was not begun until the annexation of the Crimea was imminent, possibly missing the URIs that document the early stages of this developing story.  Our idea is to mine social media, especially stories, for semi-automated, early creation of web archive collections. 
The NSF proposal is titled "Increasing the Value of Existing Web Archives" and represents a shift in how we think about web archiving.  One point we've made for a while now (for example, see our 2014 presentation "Accessing the Quality of Web Archives") is that we must shift our current focus of simply piling up bits in the archive to more nuanced questions of how to make the archives more immediately useful (as opposed to just insurance for future loss) and to how to assess & meaningfully convey the quality of the archived page.  This proposal will have three main research thrusts:
  1. Inspired by Martin Klein's PhD research and Hugo Huurdeman et al.'s "Finding Pages on the Unarchived Web" from JCDL 2014, we would like to see archives provide recommendations of related pages in the archive, as well as suggested "replacements" for pages that are not archived.  Web archives now just return a "yes" (200) or "no" (404) when you query for a URI -- they should be able to provide more detailed answers based on their holdings.
  2. We'd like to further investigate the various issues of how well a page is archived.  We have some preliminary work from Justin Brunelle for automatically assessing the impact of missing embedded resources (typically stylesheets and images), as well as from Scott Ainsworth on detecting temporal violations -- combinations of HTML and images that never occurred on the live web (see "Only One Out of Five Archived Web Pages Existed as Presented" from HT 2015).  
  3. Related to #2, we need to find a better way to visualize the temporal & archival makeup of replayed pages.  For example, the LANL Time Travel service does a nice job of showing the various archives that contribute resources to a reconstruction, but questions remain about scale as well as describing temporal violations and their likely semantic impact.  Similarly, we'd like to investigate how to convey the request environment that generated the representation you're viewing now (see our 2013 D-Lib paper "A Method for Identifying Personalized Representations in Web Archives" for preliminary ideas on linking various geoip, mobile vs. desktop, and other related representations). 
We have been very fortunate with respect to funding in 2015 and we look forward to continued progress on the research thrusts outlined above.  We'd like to thank everyone that made these awards possible.  We welcome any feedback or interest on these (and other) projects as we progress.  Watch this blog and @WebSciDL for continued research updates.

--Michael


* = See also our 2014 award for $324k from the NEH for the study of personal web archiving and our 2014 award for $49k from the IIPC for profiling web archives for a more complete picture of our research vision for web archives.

2015-10-21: Grace Hopper Celebration of Women in Computing (GHC) 2015

0
0

On October 13-17, the atmosphere at the George R. Brown Convention Center in Houston, Texas was electric with 12,000 women in tech from all around the world attending the Grace Hopper Celebration of Women in Computing (GHC), the world's largest gathering for women in computing. GHC is presented by the Anita Borg Institute (ABI) for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994 to bring together research and career interests of women in computing and encourage the participation of women in computing. The incredible progress of GHC went from 500 women in technology at 1994 to 12,000 women this year.

I was humbled to receive a scholarship from the ABI to attend GHC 2015. I also was thrilled twice before to attend the GHC 2013 in Minnesota and GHC 2014 in Phoenix. This year, I represented the Computer Science department at Old Dominion University, the ArabWIC organization, as a member of the leadership committee and as a mentor in the academic mentoring sessions, and the ABI organization, in which I volunteered for blogging and taking notes from GHC. You can visit the Grace Hopper Celebration 2015 wiki page for reading more about the sessions note updates.

The conference was filled with exciting lineup of inspiring speakers, panels, sessions and workshops. There were multiple technical tracks: career, emerging tech, general sessions, open source, organizational transformation, and technology (e.g., data science, artificial intelligence, HCI, security, software engineering). Conference presenters represented many different fields, such as academia, industry, and government. The non-profit organization "Computing Research Association Committee on Women in Computing (CRA-W)", also offered sessions targeted towards academics and business. I had a chance to attend Graduate Cohort Workshop in 2013, which was held in Boston, MA, and created a blog post about it.

The first day was kicked off by the amazing and inspiring Telle Whitney, the president and the CEO of the ABI, welcoming the audience. Whitney gave the audience a piece of advice: "talk to almost anyone you pass by in the conference and introduce yourself. It is your time to learn, to join new communities, to reach out people, and offer advice. It is our time to lead". She introduced the featured keynote speakers of the three days of the conference: Susan Wojcicki (the CEO of YouTube), Megan Smith (the first female CTO of the United States), and Sheryl Sandberg (the CEO of Facebook), Manuela M. Veloso (Professor in the Computer Science Department at Carnegie Mellon University), Clara Shih (CEO and Founder of Hearsay Social), Hilary Mason (the Founder of Fast Forward Labs). At the end, Whitney introduced Alex Wolf, the President of the Association of Computing Machinery (ACM) and a professor in the Department of Computing at Imperial College London, UK, for opening remarks.

As the day progressed, the Open Source Day sessions and presentations were talking place. Open Source Day: Code-a-thon for Humanity gives women from around the world the chance to learn how to contribute to the open source community, regardless of their skill or experience level through developing a variety of humanitarian projects. The Open Source Day 2015 page contains more details about the projects.

The Wednesday Keynote by Hilary Mason: "This is the best room in the world !!", this is how Mason started her keynote, which was about machine intelligence research. Mason introduced herself as a data scientist, CEO, software engineer and followed up with "I look like all of those". She talked about the importance of data and mentioned that data products are everywhere. She mentioned many example for different apps that use machine intelligence research: Foursquare, an app from New York city company collect data and based on this data, the app provides recommendations of the places to go around a user's current location and Dark Sky app, which predicts when it will rain or snow. Dark Sky app was built on the top of government weather data. It may be not interesting for a Californian, but it is interesting for the rest of people :-).

Mason talked about how she become passionate about data science. She defined a data scientist as a professional role to combine multiple capabilities: math, statistic, coding ability to build infrastructure, and communication domain knowledge, everything they need to know to go to talk someone who has a problem. A data scientist works on analytical problems. She said technology is changing rapidly, and people's adaption of technology is growing faster. One of the interesting parts of her talk was about predicting the future. She said, "predicting the future is hard", then showed a picture for people from the past imagining the future.

At the second part of the talk, she talked about her company, Fast Forward labs, which started in 2014, to introduce a new method for applied research. They focus on innovation opportunities through data and algorithms. FF sits in the middle of three communities: established companies, startups, and academic research. What makes a machine intelligence technology interesting?
  1. A theoretical breakthrough 
  2. A change of economics 
  3. A capability becomes a commodity (ex: Hadoop
  4. a) Wikipedia: new data is available b) data is made useful 
Mason ended her talk with thanking everyone who helped her, then she gave the audience a piece of advice: "If you are at the beginning of your career and you are thinking of where you might end up, you need to know that my first GHC was in 2002, and I was a shy quiet student who mostly sit in the back in every talk and shy to ask a question. But it is amazing to be in this room today with so many people who have affected my career".

At the end of the keynote, the 2015 Technical Leadership ABIE Award was given to Lydia E. Kavraki, the Noah Harding Professor of Computer Science and professor of Bioengineering at Rice University.

After the keynote, I attended the "CRA-W Early Career: The Tenure Process" session by Julia Hirschberg from Columbia University and Joan Francioni from Winona State University. The session and tenure process, i.e., research, teaching, service, expectations of department, annual reviews, letter writers, and the typical process. The speakers gave advice and tips on understanding the requirements/expectations of your institution, such as, have an overall teaching plan/goals, do not be hard or too easy. They also gave tips regarding collaboration: the successful collaboration is a multiplier; you can achieve more than you can on your own and the unsuccessful collaboration can be a negative multiplier; waste times, stressful, creates hard feelings.

The panel of Global Women Technical Leaders Program 
Next, I attended the "Global Women Technical Leaders Program: After the Grace Hopper Celebration: Building and Sustaining Community" panel in the career track. The panelist were Josephine Ndambuki of Safaricom ltdRosario Robinson of Anita Borg Institute, Alaa Fatayer of JawwalSana Odeh of NYU and ArabWIC and moderated by Arezoo Miot of TechWomen. Sana introduced the panel and thanked ABI and the panelists for their support and for increasing the women in tech communities. Rosario talked about her journey. She said she was the only woman in mathematics. The panel discussed the essentials of building a community to support women in technology. Alaa talked about her experience starting with tech women to building a community in Palestine. Some of the addressed questions were: How and where do you start in creating a community? What programs are out there to support technical women? Overcoming obstacles in creating local communities. How can we develop allies in our communities? At the end, Rosario gave the following advice: "be clear about what you are and what you do".

A panel by directors from Apple in the scholars lunch 
The scholars lunch: Suzanne Mathew, an Assistant Professor of Computer Science at the United States Military Academy, introduced the panel of three amazing ladies from Apple. The panel was by Esther Hare, a Director in Worldwide Developer Marketing team and Maryam Najafi, a Director of UX, and moderated by Karen Sipprell, the VP Marcom at Portal Software. The scholar lunch was sponsored by Apple, in which many scholars get together with discussions on tables and each table has one women who has a role in Apple. The number of scholars in GHC15 are 500 out of 2,000 applicants as Professor Nancy Amato from Texas A&M University announced. The panel by three seniors ladies from Apple handled many interesting experience by each one of them. Here are some advice from the panelist:
  • Be around as much as you can, the more you get around the more opportunities you will find. 
  • Find you passion, so you can solve problems. 
  • Go out and solve problems that freaks you out. 
The lunch ended with the fun part, seven Apple watches for seven lucky women who found animal stickers on the bottom of their chairs!

After lunch I had to work on some stuff for the ABI blogging and social media activities. I also communicated with many amazing women during the conference.

The Wednesday Afternoon Plenary: We had three TED style talks on "Transforming the Culture of Tech" by Clara Shih, the CEO and Founder of Hearsay Social, Blake Irving, the CEO and Board Director of GoDaddy, and the amazing Megan Smith, the Chief Technology Officer (CTO) of the United States of America.

The afternoon plenary speakers
Clara Shich mentioned that she attended GHC for the first time in 2004 when there were 800 attendees at GHC. She told the audience about her journey in the past decade, starting from a student to software engineer, then project manager, to being CEO and Founder of Hearsay Social. Shich shared with the audience the lessons she learned through her journey: 1) Listen carefully, 2) Be ok with being different 3) Cherish relationship above all and help other women. 4) There is no failure, only learning 5) The future is on us, because if not us then who? when, if not now? "if people just sat back 11 year ago, GHC would not be 12,000 today!". Every time we decide to lift a woman up we lift all women up.



Blake Irving talked about how he closed the gender gap at the company since he took over as CEO two years ago and mentioned the solid progress in the ratio of women in GoDaddy. Since last year's GHC, GoDaddy has more than doubled the number of women interns and graduate hires. Blake talked about payment equality and showed many graphs based on data of GoDaddy. "If you are a leader of tech company, be vulnerable again and again. Do not hide your problems. Go public with your diversity statistics, publish your salary. Seek change from the top and bottom. Do the research, find your issues. Surround yourself with people that will challenge you," Blake said, "bad things live in the dark, bad things die in the light."

Megan Smith with the President tech team showing
 the Declaration of Sentiments
"It is great to be back to my people!", this is how the amazing Megan Smith started her talk. Before mentioning the highlight of Megan Smith talk, I would like to highlight her amazing job during the conference to encourage and inspire the attendees by talking to them by herself. This lovely inspiring woman passed by the community booths at the career fair and allow people to talk to her personally and take pictures with her. She also was creative in showing some of the federal tech projects nowadays and bringing many ladies in tech from the president team. At the beginning, Smith talked about her new a role as a CTO of the USA, in which she serves as assistant to the President through advising him and his team on how technology policy, data and innovation can advance our future as a nation. She described the people in the federal government as so passionate, mission driven, and extraordinary.

GHC archive that was found in the previous Thanksgiving 
Smith mentioned that they found GHC archives in the previous Thanksgiving. She talked about many projects they are working on, such as, Innovation Nation, Active STEM Learning, Police Data Initiative. She described the President as "an incredible leader, so smart, so technical, science tech president, and he opens the doors for us to innovate”. Smith introduced many amazing young ladies from the president tech team, who talked about their different roles to serve the nation.

At the end, Smith talked about Declaration of Sentiments, a document signed by 103 of people in 1848 (68 women and 32 men) at the first women's rights convention to be organized by women in Seneca Falls, New York. The document is missing and they are looking for it with many archivist using the #FindTheSentiments.

There was a short discussion at the end with the three speakers about why changing is hard and what strategies are working for them.

In the meantime, the career fair, in which many famous companies, such as Google, Thomson Reuters, Facebook, Microsoft, IBM, etc., were there for hiring talented woman in tech as much as they can, and the community fair, which is a dedicated with in the Expo for attendees to interact with GHC communities, such as the BlackWIC and ArabWIC. The ABI booth was at the center of the Community Fair, where I met the amazing Telle Whitney and talked to her many times. The career fair was the place for anyone who wants to apply for job opportunities at all levels across industry and academia. Each company in the career fair has many representatives to discuss the different opportunities they have for women. A few men also attended the conference. The companies were very creative in advertising themselves.

Megan Smith at ArabWIC booth in the community fair 

The amazing Megan Smith passed by ABI community booths and stopped by the ArabWIC booth. We had a great chance to talk to her personally and take a look at the Declaration of Sentiments closely. She left us with encouragement and inspiration for leading communities and attract more women in tech!

At the end of the first day, I attended the ArabWIC reception, which was sponsored by the Qatar Computing Research Institute (QCRI). We had many new Arab ladies in computing and non-Arab women as well. We exchanged our bios and how each one of us is contributing to serve the women in technology.
The Thursday Keynote had two speakers: Susan Wojcicki, the CEO of YouTube and Hadi Partovi, the CEO and Cofounder at Code.org. "I’m feeling that I’m really the talking guy in the room,"Hadi Partovi said in the beginning of his talk. He shared with the audience his personal story that changed his life; when his dad brought a computer that did not have any games on it, and a book for Hadi to learn so he could write his own games. He talked about Hour of Code, a non-profit bootstrapped project that started in 2013 to expand access to computer science in schools. code.org has support from both Democrats, Republicans, and many celebrities (e.g., President Obama, Bill Gates, Mark Zuckerberg). Code.org has trained 15,000 teachers to teach computer science this year, reaching 600K students (43% female)

The Hour of Code
Partovi insisted that his main goal for code.org is not teaching kids how to code, it is teaching kids computer science. He claimed that CS education is on the recovery after many years of declines and there is a problem in CS. He also mentioned that about 9 out of 10 parents want their children to learn CS. I started already code.org with my 7 years old and he was so excited to start his first code :-). Partovi claimed that the gender gap started at K-12; "Almost 70% of the high school kids do not have access to the computer science field. When kids go to school every kid learn about how electricity works or the basic math equations. In the 21th century, it is equally foundational to learn how algorithm work or how the internet works”. Partovi continued that “the school system can evolve to tech kids computer science field. Over 70 schools have embraced CS, including NY, Houston, Chicago, etc". Regarding to the diversity, Partovi asked if we can change the stereotype without changing the facts on the ground. He commented that the way to change the stereotyping is the Hour of Code, which has now 300 partners from 196 countries and 150,000 teachers. At the end, Partovi asked all the audience to help to get more volunteers. To encourage the people to get involved, Microsoft and Amazon will give away gift cards to any teacher who will organize Hour of Code.

After Partovi's talk, 2015 Grace Hopper Celebration Change Agent ABIE Award Winners, Maria Celeste Medina from Kenya and Mai Abualkas Temraz from Palestine were announced. The Award winners gave short inspiring talks about their journey to lead women in technology and how they started.
Susan Wojcicki described the conference as a lifeline where women come together, learn, feel supported, be a computer scientists, and be ourselves. She started her speech with a story about her girl who told her she hated computers, although she used to go to Google since she was born. Susan talked about the serious impact of leaving the girls out of conversation when it comes to technology. "Girls think that technology is insular and anti-social. By 2020, jobs in computer science are expected to grow nearly two times faster than the national average, totaling nearly 5 million jobs. Technology is revolutionizing almost every part in our lives. Every car today has more computing technology than Apollo 11 that first landed on the moon. Yet, today women hold only 26% of all tech jobs. The fact that women represent small portion of tech work force is not just a wake up call, it is a 'Sputnik 'moment. It risks future competitiveness,” Susan said "If women don't participate in tech, with its massive prominence in our lives and society, we risk losing many of the economic, political and social gains we have made over decades." Susan continued that the female representation in Tech is a problem and it is getting worse. The women in tech representation was better in the 80s. Susan Wojcicki shared an exclusive teaser of the Codegirl movie, directed by Lesley Chilcott, the Oscar winning film producer.

She talked about balance between family and work. She had her baby 5 months after she joined Google. The constraints of family (for example, how it is tough for kids to be the last one who are picked up from day care) enabled her to develop a work style that focus on efficiency, productivity, prioritization, and to do that at the office hours. She mentioned a Harvard study that shows that employees who take breaks from work have higher level of focus compared to those who do not. Furthermore, employees who feel encouragement by their bosses to take breaks are 100% more loyal to their employers.

Susan Wojcicki is the first one to take maternity leave in Google, and she the only person to take five maternity leaves at Google. Interestingly, each leave enriched her life and left her with peace of mind and gave her a chance to reflect on her career. A generous maternity leave increases retention. When women are given short maternity leave and they are under the pressure of having a call, they quit. When Google increased its paid family leave from 12 to 18 weeks, the rate at which mothers quit fell by 50%. 88% of women in USA are not given family leave. Susan said, "men don't get asked how they balance it all". Susan's daughter now loves computer science. She enrolled her in a computer camp that are for girls, afterward she sketched a computer watch that has her friends contacts and info, before Samsung and Apple came up with their watches.

At the end, Susan insisted that we have to make it our personal responsibility to show the next generation of girls that they belong to the world of computer science.

Advice from Susan:
  • We need to give everyone a chance to understand computer science. 
  • Make computer science available to everyone in the USA by making it mandatory. 
  • Focus on working smart. Work smart, work hard. Do a great job, but then GO home.
  • Keep asking, look out for yourself, be an advocate and do not feel guilty about it! 
  • For tech companies, you need to help employees to find balance between work and family.
  • Tech companies need to pay generous maternity leave. 
  • A step back helps sometime.
  • If you work for a company and you feel you can not work a balanced day and the maternity leave is bad, I recommend that you leave and search for a supportive company and by the way, we are hiring! 


The Thursday Afternoon Plenary: Thursday Afternoon Plenary was a conversation between Sheryl Sandberg, Facebook CEO and and author of best-selling book Lean In and Nora Danzel, Board Director of Ericsson, AMD, and Outerwall (makers of Redbox, Coinstar and ecoATM) about "What it means to be an effective leader and why it is so important to have women at the table to create technology". Sheryl shared her story about being a keynote speaker in GHC. The conversation handled gender diversity in technology and the pay gap. Sandberg asked the audience to negotiate regarding to payment equality. She talked about Lean In book and Lean In circles and how mentoring is important. She advised the audience to join Lean In circles. Sandberg said, "Starting a Lean In circle is a great leadership opportunity". To read more about the conversation, here is a nice article:
Sandberg: Tech offers the best jobs, needs more women voices, and women need to stick with it


I attended the "Change Agent and Social Impact Awards” session by the ABI award winners: Michal Segalov of Mind the Gap, Maria Celeste Medina of Ada IT, Daniel Raijman of Mind the Gap, Mai Abualkas Temraz of Gaza Sky Geeks.

The moderator had a conversation with the ABI award winners to draw out their stories. The winners talked about the turning points in their life and what continues to motivate them to make a difference. The moderator asked the panelists about the challenges they faced, the turning points in life, and what motivates them.
Daniel said they started Mind The Gap 8 years ago to expose many girls to computer science. They have interacted with 10,000 girls. Mind The Gap expanded globally and is now in its 8th year, with more than 10,000 participants to date.

Michal said that they cared the most about making Mind The Gap scalable. Mind The Gap offers the people to choose how to give/volunteer. For example, some people can provide tech classes, some other can give talk, etc. They had about 100 people volunteered and each volunteer only give one hour of their time per month, so that makes it easy for the people and encourage them to volunteer. Michel advice was to be open to changing things, yourself, and your passion.

María mentioned that her mom encouraged her and support her the most. In one year, Maria has worked with the Programá Tu Futuro team and has initiated more than 6,000 people in coding: kids, adults, teenagers and senior citizens (of which 30% are women). She said that there is also of studies to how to empower woman.

Mai from Gaza was talking over Skype because she could not attend for political reasons. Mai was asked for some fun facts, but she said that she is not in a good status because she could not make it the conference, which made it hard to mention fun facts. In 2014, she became a TechWomen Emerging Leader. She also encouraged everyone to help and support them, and also keep inviting them, so may be in one day they will be able to attend. Mai said they face a lot of challenges in Gaza, but she like to call them opportunities to learn and get more powerful in solving problems they face. At the end, Mai said, I’m kept motivated by events like this where I’m exposed to the global women’s tech community. My goal here is to bring back as much of your energy as I can to Gaza. You can come mentor in Gaza. She mentioned many examples for people who went to Gaza before for mentoring: Angie Chang, the founder of Women 2.0, Dave McClure, the Founder of 500 Startups, and many others. "Don’t worry, it’s safe," Mai Said "or you can mentor women in Gaza remotely." Mai is a member of ArabWIC as well.

Thursday speed mentoring sessions took place during the lunch table on Thursday and Friday. I joined mentoring discussions around academic careers. It was useful to hear from many senior women in academia about their career journey and also hear some questions about applying in academia.

At the career fair, I was lucky to meet Sinead Borgersen, a Principal HR Business Partner at CA Technologies and Dr. Michele Weigle's friend. We had a quick discussion about the careers in CA Technologies and how they will fit with my interest. Siena is an amazing lady who is full of enthusiasm.

The Friday Keynote: Friday morning started with a cool technical keynote on "Robotics as a Part of Society" by Manuela Veloso, Herbert A. Simon University Professor, Computer Science Department, Carnegie Mellon University. Manuela has become well-known in the AI community for being the guiding force behind robot soccer. In her keynote, Manuela highlighted different perspectives of robots in collaborative network of robots and humans. Manuela talked about CoBots, the robots she and her students created to help them with simple tasks in their offices and labs. There robots can use the internet or send emails to ask for help. She showed that autonomous robots learn from interacting with humans. "Technology is about diversity, "Manuela said. "You don’t have to do everything, but some do things that others can’t."




At the end of the keynote, there were announcements about the Grace Hopper 2016. The GHC 2016 will take place in Houston, Texas. The general program co-chairs for GHC 2016 will be Kaoutar El Maghraoui, from IBM Research and the ArabWIC and Maria Gini from University of Minnesota. I spent most of the time on Friday at the career fair, then I attended the mentoring session on ArabWIC lunch table and met many women in computing from different fields.

The Friday Afternoon Plenary: The day wrapped up with an afternoon plenary session focused on the importance on diversity in technology by Janet George, Chief Data Scientist for Big Data/Data Science and Cognitive Computing at SanDisk, Isis Anchalee, Platform Engineer at OneLogin, Miral Kotb, Director, Producer, Choreographer and Playwright for iLuminate.


I couldn’t attend the afternoon keynote, but I heard from many friends about iLuminate, which is a wearable lighting system that enables novel dance act, performance, in which the audiences were treated with at the end of the conference. For more about the afternoon plenary, here are nice wrap ups for the three talks:
GHC 2015 ended with busting a move on the dance floor in a night to remember at the Minute Maid Park. There were many photos booths, t-shirts, glowing sticks, and dessert. It is a Grace Hopper Celebration, after all!

It was fascinating to be in GHC 2015 to hear from the most talented and inspiring women in technology and get advice from them. Furthermore, spending the best time with many awesome ladies and get back with many friends who support each other. I also was glad to be involved in many activities this year for the ABI community and the ArabWIC.


---Yasmin
Viewing all 659 articles
Browse latest View live




Latest Images