2016-09-27: Introducing Web Archiving in the Summer Workshop

September 27, 2016, 2:15 pm

≫ Next: 2016-10-03: Summary of “Finding Pages on the Unarchived Web"

≪ Previous: 2016-09-26: IIPC Building Better Crawlers Hackathon Trip Report

For the last few years the Department of Computer Science at Old Dominion University invites a group of undergrad students from India and hosts them in the summer. They work closely with a research group on some relevant projects. Additionally, researchers from deferment research groups in the departments present their work to the guest students twice a week and introduce various different projects that they are working on. The goal of this practice is to allow them to collaborate with graduate students of the department and to encourage them for research studies. The invited students also act as ambassadors to share their experience with their colleagues and spread the word out when they go back to India.

This year a group of 16 students from Acharya Institute of Technology and B.N.M. Institute of Technology visited Old Dominion University, they were hosted under the supervision of Ajay Gupta. They worked in the areas of Sensor Networks and Mobil Application Development. They researched ways to integrate mobile devices with low-cost sensors to solve problems in health care-related areas and vehicular networks.

I (Sawood Alam) was selected to represent our Web Science and Digital Libraries Research Group this year on July 28. Mat and Hany represented the group in the past. I happened to be the last presenter before they return back to India, by the time they were overloaded with scholarly information. Additionally, the students were not primarily from the Web science or digital libraries background. So, I decided to keep my talk semi-formal and engaging rather than purely scientific. The slides were inspired from my last year's talk in Germany on "Web Archiving: A Brief Introduction".

Introducing Web Archiving and WSDL Research Group by Sawood Alam

I began with my presentation slides entitled, "Introducing Web Archiving and WSDL Research Group". I briefly introduced myself with the help of my academic footprint and the lexical signature. I described the agenda of the talk and established the motivation for Web archiving. From there, I followed the talk agenda as laid out before, covering topics like issues and challenges in Web archiving, existing tools, services, and research efforts, my own research work about Web archive profiling, and some open research topics in the field of Web archiving. Then I introduced the WSDL research Group along with all the fun things we do in the lab. Being an Indian, I was able to pull in some cultural references from India to keep the audience engaged and entertained while still being on the agenda of the talk.

I heard encouraging words from Ajay Gupta, Ariel Sturtevant, and some of the invited students after my talk as they acknowledged it being one of the most engaging talks during the entire summer workshop. I would like to thank all who were involved in organizing this summer workshop and gave me the opportunity to introduce my field of interest and the WSDL research group.

--
Sawood Alam

↧

2016-10-03: Summary of “Finding Pages on the Unarchived Web"

October 3, 2016, 9:00 am

≫ Next: 2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

≪ Previous: 2016-09-27: Introducing Web Archiving in the Summer Workshop

by: Hugo C. Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar, and Arjen P. de Vries
Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries 2014

In this paper, the authors detailed their approach to recover the unarchived Web based on links and anchors of crawled pages. The data used was from the Dutch 2012 Web archive at the National Library of the Netherlands (KB), totaling about 38 million webpages. The collection was selected by the library based on categories related to Dutch history, social and cultural heritage. Each website is categorized using UNESCO code. The authors try to address three research questions: Can we recover a significant fraction of unarchived pages?, How rich are the representations for the unarchived pages?, and Are these representations rich enough to characterize the content?

The link extraction used Hadoop MapReduce and Apache Pig to process all archived webpages and used JSoup to extract links from their content. A second MapReduce job was to index the URLs and check if they are archived or not. Then the data was deduplicated based on the value of year, anchor text, source, target, and hashcode (MD5). In addition basic cleaning and processing was performed on the data set. The resulting number of pages in the dataset was 11 million webpages. Both external links (inter-server links) which are links between different servers, and site internal links (intra-server links) which occur within a server were included in the data set. Apache Pig script was used to aggregate the extracted links to different element such as TLD, domains, hosts, and file type.

The processed file list is as following:
(sourceURL, sourceUnesco, sourceInSeedProperty, targetURL, targetUnesco, targetInSeedProperty, anchorText, crawlDate, targetInArchiveProperty, sourceHash).

There are four main classification of URLs found in this data set, shown in Figure 1:
1-Intentionally archived URLs in the seed list, which is 92% of the dataset (10.1M).
2-Unintentionally archived URLs due to crawler configuration, which is 8% of the dataset (0.8M).
3-Inner Aura: unarchived URLs which the parent domain is included in the seed list (5.5M), (20% depth 4, because 94% are links to the site).
4-Outer Aura: unarchived URLs which do not have a parent domain that is on the seed list (5.2M), (29.7% depth 2).

In this work, the Aura is defined as Web documents which are not included in the archived collection but are known to have existed through references to those unarchived Web documents in the archived pages.

They analyzed the four classification and checked unique hosts, domain, and TLD. They found that unintentionally archived URLs have higher percentage of unique hosts, domain, and TLD compared to intentionally archived URLs. And that outer Aura have higher percentage of unique hosts, domain and TLD compared to inner Aura.

When checking the Aura they found that most of the unarchived Aura points to textual web content. The inner Aura mostly had a (.nl) top level domain (95.7%) and the outer Aura had 34.7% (.com) TLD, 31.1% (.nl) TLD, and 18% (.jp) TLD. They high percentage of Japanese TLD is that they unintentionally archived those pages. Also, they analyzed the indegree of the Aura where all target representations in the outer Aura have at least one source link, 18% have at least 3 links, and 10% have 5 links or more. In addition, the Aura was compared by the number of intra-server links and the inter-server links, the inner Aura had 94.4% intra-server links. On the other hand the outer Aura had 59.7% of inter-server links.

The number of unique anchor text words for both inner and outer Aura was almost similar, 95% had at least one word describing them, 30% have at least three words, and 3% have 10 words or more.

To test the theory of finding missing unarchived Web pages, they took a random 300 websites where 150 are homepages and 150 are non-homepages. They made sure the websites selected are either live or archived. They found that 46.7% of the targets page were found within the top 10 SERP using anchor text. However for non-homepage 46% were found using texts obtained from the URLs. By combining anchor text and URL word evidence both homepage and non-homepage had a high percentage, 64% of the homepages, and 55.3% of the deeper pages can be retrieved. Another random sample of URLs was selected to check the anchor text and words from the link, and they found homepages can be represented with anchor text, on the other hand non-homepages are better represented with both anchor text and words from the link.

They found that the archived pages show evidence of a large number of unarchived pages and websites. They also found that only a few homepage webpages have rich representations. Finally, they found that even with a few words to describe a missing webpage they can be found within the first rank. Future work include adding further information such as surrounding text and advance retrieve models.

Resources:

JCDL 2014 paper.
A longer detailed version of the paper IJDL 2015, "Lost but not forgotten: finding pages on the unarchived web".
SlideShare JCDL2014.
JCDL2014 trip report.
Summary image.

-Lulwah M. Alkwai

↧

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

October 3, 2016, 10:50 am

≫ Next: 2016-10-13: Dodging The Memory Hole 2016 Trip Report (#dtmh2016)

≪ Previous: 2016-10-03: Summary of “Finding Pages on the Unarchived Web"

"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)

The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g., www.georgewbush.com, www.johnkerry.com, www.gop.com, and www.democrats.org) or other newspapers like www.washingtonpost.com and www.nytimes.com. They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:

NLTK to tokenize text
Stemming and removing stop words (involving TF-IDF weighting)
Gensim and Latent Dirichlet Allocation for topic modeling

After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State	Candidate	Frequency of mentioning the state	The most important topic
Mississippi	Kerry	85	Iraq
Mississippi	Bush	131	Energy
Oklahoma	Kerry	65	Jobs
Oklahoma	Bush	85	Retirement
Delaware	Kerry	53	Colleges
Delaware	Bush	2	Other
Minnesota	Kerry	155	Jobs
Minnesota	Bush	303	Colleges
Illinois	Kerry	86	Iraq
Illinois	Bush	131	Health
Georgia	Kerry	101	Energy
Georgia	Bush	388	Tax
Arkansas	Kerry	66	Iraq
Arkansas	Bush	42	Colleges
New Mexico	Kerry	157	Jobs
New Mexico	Bush	384	Tax
Indiana	Kerry	132	Tax
Indiana	Bush	43	Colleges
Maryland	Kerry	94	Jobs
Maryland	Bush	213	Energy
Louisiana	Kerry	60	Iraq
Louisiana	Bush	262	Tax
Texas	Kerry	195	Terrorism
Texas	Bush	1108	Tax
Tennessee	Kerry	69	Tax
Tennessee	Bush	134	Teacher
Arizona	Kerry	77	Iraq
Arizona	Bush	369	Jobs

...

[3] Interactive US map

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (http://www.cs.odu.edu/~maturban/hackathon/).

So exciting to see use of our LC web archives in use at the #hackarchives ! So inspiring! pic.twitter.com/DXksowhrn2
— Abbie Grotke (@agrotke) June 15, 2016

Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.

--Mohamed Aturban

↧

2016-10-13: Dodging The Memory Hole 2016 Trip Report (#dtmh2016)

October 22, 2016, 3:30 pm

≫ Next: 2016-10-23: Institutional Repositories, OAI-PMH, and Anonymous FTP

≪ Previous: 2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

Dodging the Memory Hole 2016, held at UCLA's Charles Young Research Library in Los Angeles California, was a two-day event to discuss and highlight potential solutions to the issue of preserving born-digital news. Organized by Edward McCain (digital curator of journalism at the Donald W. Reynolds Journalism Institute and University of Missouri Libraries) this event brought together technologists, archivists, librarians, journalists and fourteen graduate students who had won travel scholarships for attendance. Among the attendees were four members of the WS-DL group (l-r): Mat Kelly, John Berlin, Dr. Michael Nelson, and Shawn Jones.

The event was made possible by support from the Reynolds Journalism Institute, Journalism Digital News Archive (JDNA), UCLA Library, the Educopia Institute and theInstitute of Museum and Library Services (IMLS).

Day 1 (October 13, 2016)

Day one started off at 9am with Edward McCain welcoming everyone to the event and then turning it over to Ginny Steel, UCLA University Librarian, for opening remarks.

@RJIJDNA @UCLA @vsteel Saving Online #News #Legal #Technical #Policy #dtmh2016 #digitalmemory #freeexpression #historicalrecord #infoaccess pic.twitter.com/D5WFalhrWR
- Sharon E. Farb (@FarbThink) October 13, 2016

In the opening remarks, Ginny reflected on her career as a lifelong librarian, the evolution of printed news to digital and in closing she summarized the role archiving has to play in the digital-born news era.

The challenge to #dtmh2016 is to develop a framework to preserve the news. @vsteel
- Todd Grappone (@liber8er) October 13, 2016

After opening remarks, Edward McCain went over the goals and sponsors of the event before transitioning to the first speaker Hjalmar Gislason.

Hjalmar Gislason's talk was entitled "Digital Salvage Operations: What's worth saving?"

#DtMH2016 @hjalli: The fundamental question: Do you want to save everything or do you want to get rid of everything?
- ChrisAldrich (@ChrisAldrich) October 13, 2016

In the talk, Hjalmar touched on issues concerning the amount of data currently being generated, how to determine context about data and the importance of if and that data lost due to not knowing if it is important could mean losing someone's life work. Hjalmar ended his talk with two takeaway points: "There is more to news archiving than the web: there is mobile content" and "Television news is also content that is important to save".

@hjalli Keynote #dtmh2016 #Digital #Salvage What #News Worth Saving? Not enough to save #stories. #Context is Everything #authenticity #1984 pic.twitter.com/vH82BZeD4F
- Sharon E. Farb (@FarbThink) October 13, 2016

After a short break, panel one which consisted of Chris Freeland, Matt Weber, Laura Wrubel, and moderator Ana Krahmer addressed the question of "Why Save Online News".

Next Up. Why #Save #Online #News? Challenge #Access #Online #News Post Event @chrisfreeland @liblaura @docmattweber #dtmh2016 #localnews pic.twitter.com/KTpg7z4NwB
- Sharon E. Farb (@FarbThink) October 13, 2016

Matt Weber started off the discussion by talking about the interactions between web archives and news media. Stating that digital only media has no offline surrogate and how it is becoming increasingly difficult to do anything but look at it now as it exists. Following Mat Weber were Laura Wrubel and Chris Freeland who both talked about the large share Twitter has in online news. Laura Wrubel brought up that in 2011 journalists primarily used Twitter to direct people to articles rather than for conversation. Chris Freeland stated that Twitter the primary source of information during the Ferguson protests in St. Louis and that the local news outlets were far behind in reporting the organic story as it happened.

#dtmh2016 @docmattweber"I don't think you'll ever convince publishers at scale to donate their economic property to memory institutions."
- Kate Zwaard (@kzwa) October 13, 2016

Following panel one was Tim Groeling (professor and former chair of the UCLA Department of Communication Studies) giving presentation one entitled "NewsScape: Preserving TV News".

The NewsScape project is currently migrating analog recordings of TV news to digital for archival lead by Tim Groesling. The collection contains recording dating back to 1950's and is the largest collection of TV news and public affairs programs containing a mix of U-matic, Betamax, and VHS tapes.

Currently, the project is working its way through the collections tapes completing 36k hours of encoding this year. Tim Groeling pointed out that VHS despite being the newest tapes are the most threatened.

#DtMH2016 Tim Groeling: We use a layer of dead VCR's over our good VCR's to prevent RF interference and audio buzzing. :)
- ChrisAldrich (@ChrisAldrich) October 13, 2016

After lunch, the attendees were broken up into fifteen groups for the first of two breakout sessions. Each group was tasked with formulating three things that could be included in a national agenda for news preservation and to come up with a project to advance the practice of online news preservation.

Each group sent up one person who briefly went over what they had come up with. Despite the diverse background of the attendees at dtmh2016 the ideas that each group came up with had a lot in common:

A list of tools/technologies for archiving (awesome memento)
Identifying broken links in new articles
Increase awareness of how much or how little is archived
Work with news organization to increase their involvement in archiving
More meetups, events, hackathons that bring together technologists
with journalists and librarians

The final speaker of the day was Clifford Lynch giving a talk entitled "Born-digital news preservation in perspective".

Dr. Clifford Lynch #dtmh2016 speaking about problems of scholarly journals, a topic very important to Phittle pic.twitter.com/ZWE6SFEXir
— Phittle (@ThePhittle) October 13, 2016

In his talk, Clifford Lynch spoke about problems that plague news preservation such as link rot and the need for multiple archives.

#DtMH2016 Clifford Lynch: The material on lots of links (as sources) disappears after a short period of time.
— ChrisAldrich (@ChrisAldrich) October 13, 2016

Clifford Lynch of @cni_org at #dtmh2016: "We have this mythology that @internetarchive archives the web. ... It's not a total solution."
— Ben Welsh (@palewire) October 13, 2016

He also spoke on the need to preserve other kinds of media like data dumps and that archival record keeping goes hand in hand with journalism.

Who preserves the data dumps? Who preserves the PDFs and reports? No one has really stepped up. -Clifford Lynch #dtmh2016
— P. Kim Bui (@kimbui) October 13, 2016

"Responsible journalism implies a strong permanent record of that work." -- Clifford Lynch #dtmh2016
— Kate Zwaard (@kzwa) October 13, 2016

After his talk was over Edward McCain gave final remarks for day one and transitioned us to reception for the scholarship winners. The scholarship winners purposed projects (to be completed by December 2016) that would aid in digital news preservation and of these students three were WS-DL members (Shawn Jones, Mat Kelly, John Berlin).

#dtmh2016 Introducing Amazing #GradStudents #Scholars #UCLA #Saving #Online #News #Students=Future pic.twitter.com/WGKFNrHs3O
- Sharon E. Farb (@FarbThink) October 13, 2016

Day 2 (October 14, 2016)

Day two of dodging the memory hole 2016 began with Sharon Farb welcoming us back.

@FarbThink greeting #dtmh2016 participants on start of day 2 @UCLA_library #savenews pic.twitter.com/2u353dLi9R
— Edward McCain (@e_mccain) October 14, 2016

@FarbThink talks about human rights and the role journalism and journalists play. It's critical that we preserve that work #dtmh2016
— Todd Grappone (@liber8er) October 14, 2016

Followed by the first presentation of the day by our very own Dr. Nelson titled "Summarizing archival collections using storytelling techniques"

#dtmh2016 @phonedude_mln presents work with @yasmina_anwar and @weiglemc on "summarizing archival collections using storytelling techniques"pic.twitter.com/pW38yRrYs0
— Shawn M. Jones (@shawnmjones) October 14, 2016

The presentation highlighted the work done by Yasmin AlNoamany in her doctoral dissertation, in particular, The Dark and Stormy Archives (DSA) Framework.

#dtmh2016 @phonedude_mln details the Dark and Stormy Archives (DSA) framework for storytelling with archives pic.twitter.com/cpF6pBB2kQ
— Shawn M. Jones (@shawnmjones) October 14, 2016

Up next was Pulitzer prize winning journalist Peter Arnett who presented "Writing The First Draft of History - and Saving It!" talking about his experiences while covering the Vietnam War and how he saved the Associated Presses Saigon office archives.

Peter Arnett talks about being a journalist covering the Vietnam War and censorship #dtmh2016 pic.twitter.com/GgKL3NOMs4
— Todd Grappone (@liber8er) October 14, 2016

Following Perter Arnett was the second to last panel of dtmh2016 Kiss your app goodbye: the fragility of data journalism featuring Ben Welsh, Regina Roberts, Meredith Broussard and moderated by Martin Klein.

Meredith Broussard spoke about how archiving of news apps has become difficult as their content does not live in a single place.

#DtMH2016 @merbroussard: News apps don't live in any of the CMSs. They're bespoke and live on a separate data server.
— ChrisAldrich (@ChrisAldrich) October 14, 2016

This is even more complicated with news apps, which are dynamic and separate from the web CMS @merbroussard #dtmh2016
— Kate Zwaard (@kzwa) October 14, 2016

Ben Welsh was up next speaking about the work he has done at the LA Times Data Desk.

Ben Welsh @palewire presenting on news apps at #dtmh2016 @UCLA_library pic.twitter.com/6MVECkJmOl
— Edward McCain (@e_mccain) October 14, 2016

In his talk, he stressed the need for more tools to be made that allowed people like himself to make archiving and viewing of archived news content easier.

.@palewire Made a Django Momento plugin https://t.co/xlsSO4GJjQ #dtmh2016
— Kate Zwaard (@kzwa) October 14, 2016

Following Ben Welsh was Regina Roberts who spoke about the work done at Standford for archiving and adding context to the data sets that live beside the codebases of research projects.

#dtmh2016 Regina Lee Roberts on preservation and sharing of big data at Stanford pic.twitter.com/1aVaMiX2mR
— Shawn M. Jones (@shawnmjones) October 14, 2016

#dtmh2016 Regina Lee Roberts talks about creating BLDR (big local data repository) at Stanford pic.twitter.com/X9VQO8yRhC
— Shawn M. Jones (@shawnmjones) October 14, 2016

The last panel of dtmh2016 "The future of the past: modernizing The New York Times archive" featured members of the technology team at the New York Times Evan Sandhaus, Jane Cotler, and Sophia Van Valkenburg with moderator Edward McCain.

Evan Sandhause presented the New York Times own take on the wayback machine called TimesMachine. The TimesMachine allows users to view the microfilm archive of The New York Times.

#dtmh2016 @kansandhaus introduces the @nytimes TimesMachine of scans and metadata from microfilm https://t.co/ExRZGkdqfm pic.twitter.com/x3bpvZ0Yut
— Shawn M. Jones (@shawnmjones) October 14, 2016

Sophia Van Valkenburg spoke about how the New York Times was transitioning its news archives into a more modern system.

#dtmh2016 Sophia van Valkenburg demonstrates flowchart for converting legacy born digital articles @nytimes into format used by current CMS pic.twitter.com/XqlHAFGBwa
— Shawn M. Jones (@shawnmjones) October 14, 2016

After Sophia Valkenburg, was Jan Cotler who spoke about the gotchas encountered during the migration process. Most notable of the gotchas was that the way in which the articles were viewed (i.e, visual aesthetics) was not preserved in the migration process in favor of a "better user experience" and that in migrating to the new system links to the old pages would no longer work.

#dtmh2016 Jane Cotler mentioned decommissioning old URLs and how this can lead to link rot for those linking to @nytimes
— Shawn M. Jones (@shawnmjones) October 14, 2016

#DtMH2016 @janecotler: We made the decision of taking out data we had in lieu of making a better user experience for missing sections.
— ChrisAldrich (@ChrisAldrich) October 14, 2016

#dtmh2016 @kansandhaus"much easier to preserve print journalism because it is not a nexus of content and software"
— Shawn M. Jones (@shawnmjones) October 14, 2016

Lightning rounds were up next.

Mark Grahm of the Internet Archive was up first with a presentation on the wayback machine and how later this year it would be getting site search.

@MarkGraham of @internetarchive lightning talk @waybackmachine #dtmh2016 pic.twitter.com/z1ajvUuBvP
— John Berlin (@johnaberlin) October 14, 2016

#dtmh2016 @MarkGraham from @internetarchive discussed "save page now"@internetarchive, upcoming site search, and more
— Shawn M. Jones (@shawnmjones) October 14, 2016

Jefferson Bailey also of the Internet Archive spoke on the continual efforts at the Internet Archive to get the web archives into the hands of researchers.

#dtmh2016 @jefferson_bail on "trying to get web archives into the hands of researchers"pic.twitter.com/TKu9s6MU8b
— Shawn M. Jones (@shawnmjones) October 14, 2016

#dtmh2016 @jefferson_bail is talking about derivative data sets for researchers, extracting metadata from collections into WAT, LGA, WANE
— Shawn M. Jones (@shawnmjones) October 14, 2016

Terry Britt spoke about how social media over time establishes "collective memory".

On episodic and mediated memory. Journalists are responsible for mediated memory. #dtmh2016 pic.twitter.com/hLQH1EFBAg
— P. Kim Bui (@kimbui) October 14, 2016

Katherine Boss presented "Challenges facing the preservation of born-digital news applications" and how they end up in dependency hell.

Lightning talk from Katherine Boss and Meredith Broussard#dtmh2016 pic.twitter.com/EbKxqrfE8j
— John Berlin (@johnaberlin) October 14, 2016

Eva Revear presented a tool to discover frameworks and software used for news apps

#dtmh2016 @erevear presents a survey tool to discover frameworks and software used for news apps pic.twitter.com/LdIdtpnO8q
— Shawn M. Jones (@shawnmjones) October 14, 2016

Cynthia Joyce talked about a book on Hurricane Katrina and its use of archived news coverage of the storm.

#dtmh2016 @cynthiajoyce talks about Hurricane Katrina and a book of the curated experiences of those who covered the storm pic.twitter.com/hw0u0yuMAh
— Shawn M. Jones (@shawnmjones) October 14, 2016

Jennifer Younger presented the work being done by the Catholic News Archive.

#dtmh2016 Jennifer Younger presents https://t.co/1Aw1jWHrM4 pic.twitter.com/m2TNPc0J4E
— Shawn M. Jones (@shawnmjones) October 14, 2016

Kalev Leetaru talked about the work he and the gdeltproject are doing in web archival.

@kalevleetaru giving lighting talk about @gdeltproject #dtmh2016 pic.twitter.com/pKAfMchdKW
— John Berlin (@johnaberlin) October 14, 2016

An overview of the loss of journalistic content. #dtmh2016 pic.twitter.com/HKc6OU8Muy
— P. Kim Bui (@kimbui) October 14, 2016

The last presentation of the event was by Kate Zwaard titled "Technology and community Why we need partners, collaborators, and friends".

Kate Zwaard talked about the success of web archival events such as the recent Collections as Data and Archives Unleashed 2.0 held at the Library of Congress.

#dtmh2016 @kzwa mentioned the success of #archivesunleashed @librarycongress earlier this year pic.twitter.com/YkjY0IoHEJ
— Shawn M. Jones (@shawnmjones) October 14, 2016

The web archive collection at the Library of Congress.

#dtmh2016 @kzwa talking about web archive @librarycongress https://t.co/oNRWXpLv7Q, which is #memento compliant pic.twitter.com/yE09UWNY2B
— Shawn M. Jones (@shawnmjones) October 14, 2016

How they are putting Jupyter notebooks on top of database dumps.

#dtmh2016 @kzwa talks about saving #Jupyter notebooks and database dumps https://t.co/k4cZJues1Q pic.twitter.com/LQa2DPE0C4
— Shawn M. Jones (@shawnmjones) October 14, 2016

And the diverse skill sets required for librarians of today.

"It's like physicists in the '50s."@kzwa of @librarycongress talks about wide range of skill sets necessary for librarians #dtmh2016 pic.twitter.com/chM22UCuAJ
— JDNA (@RJIJDNA) October 14, 2016

The final breakout sessions of dtmh2016 consisted of four topic discussions.

Jefferson Bailey's session, Web Archiving For News, was an informal breakout where he asked the attendants about collaboration between the Archive and other organizations. A notable response was from the NYTimes representative Evan Sandhaus with a counter question about whether organizations or archives should be responsible for the preservation of news content. Jefferson Bailey responded that he wished organizations were more active in practicing self-archiving. Others responded with their organizations or ones they knew about approaches to self-archiving.

Ben Welsh's session, News Apps, discussed issues archiving news apps which are online web applications providing rich data experiences. An example app to illustrate this was California's War Dead which was archived by the Internet Archive but with diminished functionality. In spite of this "success", Ben Welsh brought up the difficulty in preserving the full experience of the app as web crawlers only interact with client side code, not server side which is required. To address this issue, he suggested solutions such as the python library django-backery for producing flat, static versions of news apps based on database queries. These static versions can be more easily archived while still providing a fuller experience when replayed.

Ben Welsh @palewire focusing on how his news app for @latimes is made at #dtmh2016 at @UCLA_library ssavenews pic.twitter.com/5Rk71tc5B9
— Edward McCain (@e_mccain) October 14, 2016

Eric Weig's session, Working with CMS, started out with him sharing his experience of migrating one the Univeristy of Kentucky Libraries Special Collections Research Center newspaper sites cms from a local data center using sixteen cpus to a less powerful cloud-based solution using only two cpus. One of the biggest performance increases came when he switched from dynamically generating pages to serving static html pages. Generating the static html pages for the eighty-two thousand issues contained in this cms took only three hours on the two cpu cloud-based solution. After sharing this experience the rest of the time was used to hear from the audience about their experiences using cms and an impromptu roundtable discussion on cms.

Kalev Leetaru's session, The GDELT Project: A Look Inside The World's Largest Initiative To Understand And Archive The World's News, was a more in depth version of the lightning talk he gave. Kalev Leetaru shared experiences that The GDELT Project had with archival crawling of non-English language news sites, his work with the Internet Archive on monitoring news feeds and broadcasts, the untapped opportunities for exploration of Internet Archive and A Vision Of The Role and Future Of Web Archives. He also shared two questions he is currently pondering: "Why are archives checking certain news organizations more than others?" and "How do we preserve GeoIP generated content especially in non-western news sites?".

@kalevleetaru: datasets @gdeltproject uses. #dtmh2016 pic.twitter.com/bgJPWFsHlp
— John Berlin (@johnaberlin) October 14, 2016

The last speaker of dtmh2016 was Katherine Skinner with Alignment and Reciprocity. In her speech Katherine Skinner called for volunteers to carry out some of the actions mentioned at dtmh2016 and reflected on the past two days.

Katherine Skinner from @Educopia talks about Alignment and Reciprocity at #dtmh2016 @UCLA_library #savenews pic.twitter.com/iOM3UUKQHF
— Edward McCain (@e_mccain) October 14, 2016

Closing out dtmh2016 was Edward McCain who thanked everyone for coming and expressed how enjoyable this event was especially with the graduate students and Todd Grappone's closing remarks. In the closing remarks, Todd Grappone reminded attendees of the pressing problems in news archival and how they require both academic and software solutions.

I'm sad about the end of #dtmh2016; it was good to meet everyone; lots of good experiences pic.twitter.com/bQX6DUXxrx
— Shawn M. Jones (@shawnmjones) October 14, 2016

Video recordings of DTMH2016 can be found on the Reynolds Journalism Institute's Facebook page. Chris Aldrich recorded audio along with a transcription of days one and two. NPR's Research, Archive & Data Strategy team created a storify page of tweets covering topics they found interesting.

-- John Berlin

↧

2016-10-23: Institutional Repositories, OAI-PMH, and Anonymous FTP

October 24, 2016, 6:21 am

≫ Next: 2016-10-24: Fun with Fictional Web Sites and the Internet Archive

≪ Previous: 2016-10-13: Dodging The Memory Hole 2016 Trip Report (#dtmh2016)

Richard Poynder's recent blog post "Q&A with CNI’s Clifford Lynch: Time to re-think the institutional repository?" has generated a lot of discussion, including a second post from Richard to address the comments and the always insightful commentary from David Rosenthal ("Why Did Institutional Repositories Fail?"). There surely have been enough articles about institutional repositories to fill an institutional repository, but of particular interest to me are discussions about the technical and aspirational goals of OAI-PMH.

A year ago Herbert and I reflected on OAI-PMH and other projects ("Reminiscing About 15 Years of Interoperability Efforts"), which I wish Richard would have referenced in his discussion (although Cliff does allude to this in his interview (MLN edit: Richard points out that I missed his quoting of that paper in his second blog post), as well as the original SFC and UPS papers. For his response to Richard, Herbert had a series of tweets which I collected:

Herbert Van de Sompel's Reaction to Richard Poynder's "Q&A with..."

I also put forward my own perspective in a series of tweets, which I will summarize below. To me, OAI-PMH is the logical conclusion of the trajectory of the computer science department tradition of publishing technical reports on anonymous FTP servers. These were both pre- and post-print versions, and whereas arXiv.org was based on a centralized approach (due in part to its SMTP origins), the anonymous FTP approach was inherently distributed, and was a departmental-level institutional repository.

Within the CS community, the CS-TR project (which produced Dienst) and WATERS project evolved into NCSTRL, which was arguably one of the first open source institutional repository systems. An unrelated effort that is often overlooked was the Unified Computer Science Technical Report Index (UCSTRI), whose real innovation was that it provided a centralized interface to the distributed anonymous FTP servers without requiring them to do anything. It would cleverly crawl and index known FTP servers, parse the README files, and construct URLs from the semi-structured metadata. The parsing results weren't always perfect, but for 1993 it was highly magical and presaged the idea of building centralized services on top of existing, uncoordinated servers.

At NASA Langley Research Center in 1993, I brought the anonymous FTP culture to NASA technical reports (mostly their own report series, but some post-prints, see NASA TM-4567), followed by a web interface in 1993 (NASA TM-109162). In 1994, we integrated several of these web interfaces into the NASA Technical Report Server (NTRS, AIAA-95-0964), which continues in name to this day (ntrs.nasa.gov) as an institutional repository that largely goes unrecognized as such (albeit covering a smaller range of subjects than a typical university). NTRS is a centralized operation today, but it was originally a distributed search model. Due in part to the limited number of NASA Centers, projects, and affiliated institutes (there were probably never more than a dozen in NTRS) it was initially a distributed architecture.

By 1999 there was a proliferation of both subject-based and institutional repositories, which lead to the UPS experiment and ultimately OAI-PMH itself. The proliferation of the web made it possible to greatly enhance the functionality of the anonymous FTP server (searching, better browsing, etc.). But at the same time the web also killed the CS departmental technical report series and the servers that hosted them. Although some may exist somewhere, off the top of my head I'm not aware of any CS departments with an active CS technical report series, at least not like the 80s and 90s.

The web made it possible for individuals to list their pre- and post-prints on their own page (e.g., my publication page, Herbert's publication page), and systems like CiteSeer, Google Scholar, and others -- much like UCSTRI before them -- evolved to discover these e-prints linked from individuals' home pages and centrally index them with no administrative or author effort.

In summary, I believe any discussion of institutional repositories (and OAI-PMH) has to acknowledge that while the web allowed for their evolution of repository systems to their current advanced state, the web also obsoleted many of the models and assumptions that drove the development of repository systems in the first place. The web allowed for "fancy" anonymous FTP servers, but it also meant that we no longer needed them. Or perhaps we need them differently and a lot less: institutional repositories still have a functional role, but they need to be operated more like Google Scholar et al.

--Michael

↧

2016-10-24: Fun with Fictional Web Sites and the Internet Archive

October 24, 2016, 9:05 am

≫ Next: 2016-10-24: Are My Favorite Arabic Websites Archived?

≪ Previous: 2016-10-23: Institutional Repositories, OAI-PMH, and Anonymous FTP

As we celebrate the 20th anniversary of the Internet Archive, I realize that using Memento and the Wayback Machine has become second nature when solving certain problems, not only in my research, but also in my life. Those who have read my Master's Thesis, Avoiding Spoilers on Mediawiki Fan Sites Using Memento, know that I am a fan of many fictional television shows and movies. URIs are discussed in these fictional worlds, and sometimes the people making the fiction actually register these URIs, seen in the example below, creating an additional vector for fans to find information on their favorite characters and worlds.

Real web site at http://www.piedpiper.com/ for the fictional company Pied Piper from HBO's TV series Silicon Valley

Unfortunately, interest in maintaining these URIs fades once the television show is cancelled or the movie is no longer showing. As noted in my thesis, the advent of services like Netflix and Hulu allow fans to watch old television shows for the first time, sometimes years after they have gone off of the air. Those first-time fans might want to visit a URI they encountered in one of these shows, but instead encounter the problems of link rot and content drift shown in the examples below.

Link rot for http://www.starkexpo2010.com/ showing the StarkExpo,
a fictitious technology fair from the Marvel Studios film Iron Man 2 (left -memento),
now leads to a dead link (right - current dead site)

Content drift for http://www.richardcastle.net/
from the fictional character's web site (left -memento)
now leads to an advertisement
for the cancelled ABC television show Castle (right - live site)

Fortunately, the Internet Archive can come to the rescue. Below is a chart listing some fictional URIs and the television shows in which they occur. The content at these URIs is no longer available live, but is still available thanks to the efforts of the Internet Archive. Included in the far right column are links to example URI-Ms from the Internet Archive for each of these URI-Rs, showing how fans can indeed go back and visit these URIs.

TV Show or Movie	Network or Production Company	URI-R	Current URI-R Status Compared to URI-M	Link to URI-M from the Internet Archive
The Simpsons	FOX	http://www.dorks-gone-wild.com/	Link Rot No HTTP server at hostname	URI-M
True Blood	HBO	http://www.americanvampireleague.com/	Content Drift 301 Redirect to HBO.com	URI-M
30 Rock	NBC	http://jdlutz.com/karen/proof/	Link Rot 500 HTTP Status	URI-M
Iron Man 2	Marvel Studios	http://www.starkexpo2010.com/	Link Rot Hostname does not resolve	URI-M
Castle	ABC	http://www.richardcastle.net/	Content Drift 301 Redirect to ABC.com Castle page	URI-M
LOST	ABC	http://www.oceanic-air.com/	Link Rot 301 Redirect and 404 HTTP Status	URI-M
Jurassic World	Universal Studios	http://www.jurassicworld.com/	Content Drift Was Fictional Content, Now Advertises Movie and Games	URI-M

The practice of publishing content at these fictional URIs shows no signs of abating. For example, the HBO TV Series Silicon Valley is a comedy about the lives of tech entrepeneurs working in Silicon Valley. The television show features several fictional companies that have real web sites that fans can visit, such as http://www.piedpiper.com, http://www.hooli.com/, and http://www.bachmanity.com. Because the show is about software developers, there is even a real Github account for one of the fictional characters, shown in the screenshot below. Using the "Save Page Now" feature, I just created a URI-M for it today in the Internet Archive.

This concept will become more important over time. As historians and sociologists study our past, some of these resources may be important to understanding these fictional worlds and how they fit into the time period in which they were developed. This makes improved archivability and reduction in Memento damage important even for these pages.

As to the meaning of the content, that's up to the fans to evaluate and discuss.

-- Shawn M. Jones

↧

2016-10-24: Are My Favorite Arabic Websites Archived?

October 24, 2016, 1:24 pm

≫ Next: 2016-10-24: 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) Trip Report

≪ Previous: 2016-10-24: Fun with Fictional Web Sites and the Internet Archive

In this work, I collected the top 20 Arabic websites that I like and browse, and in my personal judgment consider as popular (shown in Table 1). For each, I checked its ranking globally and locally based on Alexa Ranking. Then I used MemGator tool to check if it is archived, and got the estimated creation date based on the first memento date. After that, I checked who archived the webpage first (shown in Table 2).

Arabic websites in general were evaluated based on how well they were archived and indexed in my previous work, How Well Are Arabic Websites Archived?. We sampled 300,646 Arabic language pages, and found that 46% are not archived and 31% are not indexed by Google.

Table 1: List of my favorite Arabic Websites and its description
Website	Site Description
maktoob.yahoo.com Yahoo! مكتوب	A major internet portal and email service provider in Arabic language.
aljazeera.net الجزيرة نت	News channel, political, economic, and thoughts.
6arab.com طرب موقع طرب اغاني طرب كوم	Arabic singers and music directory.
alarabiya.net العربية.نت	Arabic language news network. Breaking news and features along with videos, photo galleries and In Focus sections on major news topics.
kooora.com كووورة: الموقع العربي الرياضي الأول	The first Arabic website of football featuring World Championships with an Arab follow-up and analysis of all events of football.
hawaaworld.com منتديات عالم حواء	Women's network concerned with women affairs, life, family, cooking, children, and Beauty.
mbc.net ترفيه، جدول البرامج، مشاهير،أفلام، مسلسلات، برامج تلفزيونية	The Middle East Broadcasting Center Group is the first private free-to-air satellite broadcasting company in the Arab World.
alriyadh.com جريدة الرياض	The first daily newspaper published in Arabic in the capital of Saudi Arabia.
ksu.edu.sa جامعة الملك سعود	Established in 1957, King Saud University is the largest educational institution in the Kingdom of Saudi Arabia, and generally considered the premier institute for academics and research in Arab and Muslim countries.
abunawaf.com شبكة أبو نواف - المتعة والفائدة	The site specializes in multimedia and entertainment, it also contains the biggest content for a mailing lists.
eqla3.com شبكة الإقلاع	Sites and forums comprehensive that occur daily with what is new in all fields.
samba.com سامبا: خدمات الافراد و المصرفية الالكترونية	Samba Financial Group (formerly known as The Saudi American Bank), is a large banking firm in Saudi Arabia.
sabq.org صحيفة سبق الإلكترونية‎	Saudi newspaper was founded in 2007. Working in the field of electronic media, dealing with the most important local events in particular and the Arab and international in general.
ar.wikipedia.org ويكيبيديا، الموسوعة الحرة	A free encyclopedia built collaboratively using wiki software in Arabic language.
mekshat.com مكشات - الصفحة الرئيسة	A website interested in trips and camping.
cksu.com تجمع طلبة جامعة الملك سعود	King Saud University students gathering, a group of human energies are working on a good environment for dialogue and purposeful upscale between students and faculty.
uoh.edu.sa جامعة حائل	The University of Ha'il was officially established in 2006. The university is located in the north of Saudi Arabia.
ar-sa.namshi.com موقع نمشي للأزياء, وجهتك الأولى لتسوق الأزياء في السعودية	Website is interested in fashion and online shopping.
arabtravelersforum.com منتديات المسافرون العرب الاصلي	Travelers Arab forum is interested in tourism and travel.
vanilla.sa موقع فانيلا	Website is interested in fashion and online shopping.

Table 2: Alexa Ranking and Archiving Results of My Favorite Websites
Website	Global Alexa RankingOct, 2016	Local Alexa RankingOct, 2016	Memento count	First memento date	Who archived it first
maktoob.yahoo.com	5	(US)=5	28,866	2009-08-31	IA
aljazeera.net	1,673	(SA)=174	20,468	1998-11-11	IA+BA
6arab.com	113,624	(Egypt)=6,120	12,991	1999-11-27	IA+BA+Archive Today
alarabiya.net	1,548	(SA)=37	9,737	2003-11-26	IA+BA
kooora.com	517	(Algeria)=15	4,658	2002-10-19	IA+BA
hawaaworld.com	10,026	(SA)=166	4,149	2001-01-10	IA+BA
mbc.net	1,195	(SA)=57	3,924	1999-10-13	IA+BA
alriyadh.com	5,136	(SA)=45	3,415	2000-02-29	IA+BA
ksu.edu.sa	6,093	(SA)=87	3,025	2000-03-02	IA
abunawaf.com	24,238	(SA)=413	2,446	2002-05-23	IA+BA
eqla3.com	9,741	(SA)=107	1,906	2000-05-10	IA+BA
samba.com	10,756	(SA)=129	1,451	1999-01-17	IA+BA
sabq.org	793	(SA)=5	1,170	2007-02-23	IA
ar.wikipedia.org	6	(US)=6	1,106	2003-02-09	IA+BA
mekshat.com	24,343	(SA)=404	828	2001-04-28	IA+BA
cksu.com	47,807	(SA)=708	643	2004-02-21	IA+BA
uoh.edu.sa	88,397	(SA)=806	210	2006-07-16	IA
ar-sa.namshi.com	10,968	(SA)=279	85	2012-04-05	IA
arabtravelersforum.com	31,259	(SA)=960	43	2014-11-29	IA
vanilla.sa	118,442	(SA)=1,053	13	2015-03-27	IA

Alexa calculates the global and local ranking of a website based on its traffic statistics. However, this tool is based on calculating the traffic of the domain. For example, if we check the ranking of the Arabic Wikipedia, ar.wikipedia.org, the tool will return the statistics of wikipedia.org instead. Based on this information we note that the top two global ranking in my list had an English domain. The two websites are maktoob.yahoo.com with a global ranking 5 and ar.wikipedia.org with a global ranking of 6. The third top global ranking website is kooora.com with a global rank of 15 and local rank of 15 in Algeria. Followed by sabq.org with a global ranking of 793 and a high local ranking in Saudi Arabia of 5.

In my list, I found that 4 out of the 20 websites were created before 2000. However, when looking in the archive I found that mbc.net domain was created in 1999 and it was in Korean language, then it became the Arabic website written in English in 2003, finally the Arabic version of the website was created in 2004.

mbc.net in 1999

mbc.net in 2003

mbc.net in 2004

Also, as expected I found that the Internet Archive was first to archive the webpage. However, in some websites I found that the Bibliotheca Alexandrina archive had a copy of the exact memento records, that is due to that the BA having duplicate record of the IA. Only, 6arab.com was first archived by three separate archives: archive.is, the IA, and the BA archive.

As for the memento count, I would expect that the websites that existed before 2000 had more mementos. However, two of the four webpages, samba.com and mbc.net, have only 1,451 and 3,924 mementos, respectively, which seems low considering how long they have existed. On other hand, aljazeera.net's first memento was in 1998 and has around 20,468 mementos, which is the second highest memento count in my list after maktoob.yahoo.com with memento count of 28,866.

The website 6arab.com is currently being blocked in Saudi Arabia (and access via the IA is blocked as well), due to violating the Saudi regulation of the Ministry of Culture and Information. So it does not have a local ranking in Saudi Arabia. Instead the top local ranking of this site is in Egypt.

-Lulwah M. Alkwai

↧

2016-10-24: 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) Trip Report

October 24, 2016, 1:47 pm

≫ Next: 2016-10-25: Paper in the Archive

≪ Previous: 2016-10-24: Are My Favorite Arabic Websites Archived?

"Dad, he is pushing random doorbell buttons", Dr. Herzog's daughter complained about her brother while we were walking back home late night after having dinner in the city center of Potsdam. Dr. Herzog smiled and replied, "it's rather a cool idea, let's all do it". Repeating the TPDL 2015 tradition, Dr. Michael Herzog's family was hosting me (Sawood Alam) at their place after the TPDL 2016 conference in Hannover. Leaving some cursing people behind (who were disturbed by false doorbells), he asked me, "how was your conference this year?"

Day 1

Between the two parallel sessions of the first day, I attended the Doctoral Consortium session as a participant. The chair Kjetil Nørvåg, Norwegian University of Science and Technology, Norway, began the session with the formal introduction of the session structure and timeline. Out of the seven accepted Doctoral Consortium submissions, only five could make it to the workshop.

Õnne Mets from the University of Milano-Bicocca, Italy, presented her talk on "Social Interaction and Discoverability of Digital Resources in Memory Institutions"
Joffrey Decourselle from LIRIS, France, presented his talk on "Case-oriented Semantic Enrichment of Bibliographic Entities"
I, Sawood Alam from Old Dominion University, USA, presented my talk on "Web Archive Profiling for Efficient Memento Aggregation"
Sebastian Dungs from the University of Duisburg-Essen, Germany, presented his work on "Describing user's search behaviour with Hidden Markov Models"
Konstantina Lazaridou from Hasso-Plattner-Institut, Germany, presented her work on "Identifying Political Bias in News Articles"

My talk was mainly praised for the good content organization, an easy to follow story for the problem description, tiered approach to solving problems, and inclusion of the work and publication plans. Konstantina's talk on political bias identification generated the maximum discussion during the QA session. I owe her references to A visual history of Donald Trump dominating the news cycle and Text analysis of Trump's tweets confirms he writes only the (angrier) Android half.

TPDL 2016 Doctoral Consortium - Web Archive Profiling by Sawood Alam

Each presenter was assigned a mentor for more in-depth feedback on their work and provide and outsider's perspective that would help define the scope of the thesis and recognize parts that might need more elaboration. After formal presentation session, presenters were spread apart for one-to-one session with their corresponding mentor. Nattiya Kanhabua, from Aalborg University, Denmark, was my mentor. She provided great feedback and some useful references that might be relevant to my research. We also talked about the possibilities of collaboration in future where our research interest intersects.

After the conclusion of the Doctoral Consortium Workshop we headed to Technische Informationsbibliothek (TIB) where Mila Runnwerth welcomed us to German National Library of Science and Technology. She gave us an insightful presentation followed by a guided tour to the library facilities.

Day 2

The main conference started on the second day with David Bainbridge's keynote presentation on "Mozart's Laptop: Implications for Creativity in Multimedia Digital Libraries and Beyond". He introduced a tool named Expeditee that gives a universal UI for text, image, and music interaction. The talk was full of interesting references and demonstrations such a querying music by humming. Following the keynote, I attended the Digital Humanities track while missing the other two parallel tracks.

Gerhard Lauer presented his work on "DH-Research and eInfrastructures: Entanglements and Points of Tension"
Oliver Schöner presented David Zellhöfer's work on "Exploring Large Digital Libraries by Multimodal Criteria"
Annika Hinze presented her work on "The challenge of creating geo-location markup for digital books"

Then I moved to another track for Search and User Aspects sessions.

Zeljko Carevi presented his work on "Survey on High-level Search Activities based on the Stratagem Level in Digital Libraries"
Ralph Ewerth presented his work on "Content-Based Video Retrieval in Historical Collections of the German Broadcasting Archive"
Christos Doulkeridis presented his work on "Profile-based Selection of Expert Groups"
Annika Hinze presented her work on "Tracking and Re-finding printed material using a Personal Digital Library"
Zeon Trevor Fernando presented his work on "ArchiveWeb: Collaboratively Extending and Exploring Web Archive Collections"

Following the regular presentation tracks the Posters and Demos session was scheduled. It came to me as a surprise that all the Doctoral Consortium submissions were automatically included in the Posters session (apart from the regular poster and demo submissions) and assigned reserved places in the hall for posters, which means I had to do something for the traditional Minute Madness event that I was not prepared for. So I ended up reusing #IAmNotAGator gag that I prepared for JCDL 2016 Minute Madness and utilized the poster time to advertise MemGator and Memento.

Day 3

On the second day of the conference I had two papers to present. So, I decided to wear business formal attire. As a consequence, the conference photographer stopped me at the building entrance and asked me to pose for him near the information desk. The lady on the information desk tried to explain me routes to various places of the city, but the modeling session extended so long that it became awkward and we both started smiling.

The day began with Jan Rybicki's keynote talk on "Pretty Things Done with (Electronic) Texts: Why We Need Full-Text Access". For the first time I came to know about the term Stylometry. His slides were full of beautiful visualizations. The tool used to generate the data for the visualizations is published as an R package called stylo. After the keynote, I attended the Web Archives session.

I, Sawood Alam, presented our work on "Web Archive Profiling Through Fulltext Search" (prototype implementation)
Thaer Samar presented his work on "Comparing Topic Coverage in Breadth-first & Depth-first Crawls using Anchor Texts"
Philipp Kemkes presented his work on "How to Search the Internet Archive Without Indexing It" (prototype implementation)

Web Archive Profiling Through Fulltext Search by Sawood Alam

After the lunch break I moved to the Short Papers track where I had my second presentation of the day.

Sarantos I. Kapidakis presented his work on "A Case Study of Summarizing and Normalizing the Properties of DBpedia Building Instances"
Jorgina Kaumbe do Rosario Paihama presented her work on "What happens when the untrained search for training information"
Sarantos I. Kapidakis presented his work on "Exploring Metadata Providers Reliability and Update Behavior"
I, Sawood Alam, presented our work on "InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives" (prototype implementation)

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives by Sawood Alam

After the coffee break I attended the Multimedia and Time Aspects track while missing the panel session on Digital Humanities and eInfrastructures.

Yue Zhao presented his work on "Sub-document Timestamping: A study on the Content Creation Dynamics of Web Documents"
Helge Holzmann presented his work on "Archiving Software Surrogates on the Web for Future Reference"
Cynthia C. S. Liem presented her work on "From Water Music to 'Underwater Music': Multimedia Soundtrack Retrieval with Social Mass Media Resources"

In the evening we headed to the XII Apostel Hannover for the conference dinner. The food was good. During the dinner they announced Giannis Tsakonas and Joffrey Decourselle as the best paper and the best poster winners respectively.

Day 4

On the last day of the main conference I decided to skip the panel and tutorial tracks in the favor of the Digital Library Evaluation research track.

Giannis Tsakonas presented his work on "The “nomenclature of multidimensionality” in the digital libraries evaluation domain"
Gabriel Pacheco presented his work on "Dissecting a Scholar Popularity Ranking into Different Knowledge Areas"
Hugo Manguinhas and Juliane Stiller presented their work on "Exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata"

After a brief coffee break everyone gathered for the closing keynote presentation by Tony Veale on "Metaphors All the Way Down: The many practical uses of figurative language understanding". The talk was very informative, interesting, and full of hilarious examples. He mentioned the Library of Babel which reminded me of a digital implementation of it and a video talking about it. Slides looked more like a comic strip which was very much in line with the theme of the talk which ended up talking about various Twitter bots such as MetaphorIsMyBusiness and MetaphorMirror.

Following the closing keynote the main conference was concluded with some announcements. Next year TPDL 2017 will be hosted in Thessaloniki, Greece, during September 17-21, 2017. TPDL is willing to expand the scope and encouraging young researchers to come forward with session ideas, chair events, and take the lead. People who are active on social media and scientific communities are encouraged to spread the word out to bring more awareness and participation. This year's Twitter hashtag was #TPDL2016 where all the relevant Tweets can be found.

The rest of the afternoon I spent in the Alexandria Workshop.

Wolfgang Nejdl presented the Alexandria Workshop opening keynote on "Temporal Retrieval Exploration and Analytics in Web Archives"
Steffen Staab presented "Text Mining using LDA with Context"
Nattiya Kahnabua presented "Real-time Timeline Summarisation for High-impact Events in Twitter"
Besnik Fetahu presented "Wikipedia Enrichment"

Day 5

It was my last day in Hannover. I checked out from the conference hotel, Congress Hotel am Stadtpark Hannover. The hotel was located next to the conference venue and the views from the hotel were good. However, the experience at the hotel was not very good. It was located far away from the city center and there were no restaurants nearby. Despite complaints I have found an insect jumping on my laptop and bed on fifteenth floor, late night, for two consecutive nights. The basic Wi-Fi was useless and unreliable. In my opinion, nowadays, high-speed Wi-Fi in hotels should not be counted in luxury amenities, especially for business visitors. The hotel was not cheap either. These factors should be considered when choosing a conference venue and hotel by organizers.

I realized I still have some time to spare before I begin my journey. So, I decided to go to the conference venue where the Alexandria Workshop was ongoing. I was able to catch the keynote by Jane Winters in which she talked about many Web archiving related familiar projects. Then I headed to the Hannover city center to catch the train to Stendal.

"I know the rest of the story, since I received you in Stendal", Dr. Herzog interrupted me. We have reached home and it was already very late, hence, we called it a night and went to our beds.

Post-conference Days

After the conference, I spent a couple of days with Dr. Herzog's family on my way back. We visited Stendal University of Applied Sciences, met some interesting people for lunch at Schlosshotel Tangermünde, explored Potsdam by walking and biking, did some souvenir shopping and kitchen experiments, visited Dr. Herzog's daughter's school and the Freie Universität Berlin campus along with many other historical places on our way, and had dinner in Berlin where I finally revealed the secret of the disappearing earphone magic trick to Mrs. Herzog. On Sunday morning Dr. Herzog dropped me to the Berlin airport.

Dr. Herzog is a great host and tour guide. He has a beautiful, lovely, and welcoming family. Visiting his family is a single sufficient reason for me to visit Germany anytime.

--
Sawood Alam

↧

2016-10-25: Paper in the Archive

October 25, 2016, 6:01 am

≫ Next: 2016-10-26: A look back at the 2008 and 2012 US General Elections via Web Archives

≪ Previous: 2016-10-24: 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) Trip Report

Mat reports on his journalistic experience and how we can relive it through Internet Archive (#IA20)

We have our collections, the things we care about, the mementos that remind us of our past. Many of these things reside on the Web. For those we want to recall and should have (in hindsight) saved, we turn to the Internet Archive.

As a computer science (CS) undergrad at University of Florida, I worked at the student-run university newspaper, The Independent Florida Alligator. This experience became particularly relevant with my recent scholarship to preserve online news. At the paper, we reported mostly on the university community, but also on news that catered to the ACRs through reports about Gainesville (e.g., city politics).

News is compiled late in the day to maximize temporal currency. I started at the paper as a "Section Producer" and eventually evolved to be a Managing Editor. I was in charge of the online edition, the "New Media" counterpart of the daily print edition -- Alligator Online. The late shift fit well with my already established coding schedule.

Proof from '05, with the 'thew' still intact.

The Alligator is an independent newspaper -- the content we published could conflict with the university without fear of being censored by the university. Typical associated college newspapers have this conflict of interest, which potentially limits their content only to that which is approved. This was part of the draw to the paper for me and I imagine, the student readers seeking less biased reporting. The orange boxes were often empty well before day's end. Students and ACRs read the print paper. As a CS student, I preferred Alligator Online.

With a unique technical perspective among my journalistic peers, I introduced a homebrewed content management system (CMS) into the online production process. This allowed Alligator Online to focus on porting the print content and not on futzing with markup. This also made the content far more accessible and, as time has shown thanks to Internet Archive, preservable.

Internet Archive's capture of Alligator Online at alligator.org over time with my time there highlighted in orange.

After graduating from UF in 2006, I continued to live and work elsewhere in Gainesville for a few years. Even then technically an ACR, I still preferred Alligator Online to print. A new set of students transitioned into production of Alligator Online and eventually deployed a new CMS.

Now as a PhD student of CS studying the past Web, I have observed a resultant decline in accessibility that occurred after I had moved on from the paper. This corresponds further with our work On the Change in Archivability of Websites Over Time (PDF). Thankfully, adaptations at Alligator Online and possibly IA have allowed the preservation rate to recover (see above, post-tenure).

alligator.org before (2004) and after (2006) I managed, per captures by Internet Archive.

With Internet Archive celebrating 20 years in existence (#IA20), IA has provided the means for me to see the aforementioned trend in time. My knowledge in the mid-2000s of web standards and accessibility facilitated preservation. Because of this, with special thanks to IA, the collections of pages I care about -- the mementos that remind me of my past -- are accessible and well-preserved.

— Mat (@machawk1)

NOTE: Only after publishing this post I thought to check alligator.org's robots.txt file as archived by IA. The final capture of alligator.org in 2007 before the next temporally adjacent one in 2009 occurred on August 7, 2007. At that time (and prior), no robots.txt file existed for alligator.org despite IA preserving the 404. Around late October of that same year, a robot.txt file was introduced with the lines:
User-Agent: *
Disallow: /

↧

2016-10-26: A look back at the 2008 and 2012 US General Elections via Web Archives

October 25, 2016, 5:59 pm

≫ Next: 2016-10-25: Web Archive Study Informs Website Design

≪ Previous: 2016-10-25: Paper in the Archive

Web Archives perform the crucial service of preserving our collective digital heritage. October 26, 2016 marks the 20th anniversary of the Internet Archive, and the United States presidential Election will take place November 8, 2016. To commemorate both occasions, let us look at the 2008 and 2012 US General Elections as told by Web Archives from the perspectives of CNN and Fox News. We started with three news media - MSNBC, CNN and Fox News in order to capture both ends of the political spectrum. However, msnbc.com has redirected to various different URLs in the past (e.g., msnbc.msn.com, nbcnews.com) and the result is that the site is not well-archived.

Obama vs McCain - Fox News (2008)

Obama vs McCain - CNN (2008)

Obama vs Romney - Fox News (2012)

The archives show that the current concerns about voter fraud and election irregularities are not new (at least on Fox News, we did not find corresponding stories at CNN).

This Fox News page contains a story titled: "Government on High Alert for Voter Fraud" (2008)

Fox News: "Trouble at the ballot box" (2008)

Fox News claims a mural of Obama at a Philly polling station, that was ordered to be covered by a Judge, was not properly covered (2012)

Fox News reports about Election day Monitors on the lookout for voter fraud and "funny business" (2012)

Obama vs Romney - CNN (2012)

We appreciate the ability to tell these stories by virtue of the presence of public Web archives such as the Internet Archive. We also appreciate frameworks such as the Memento protocol that provide a means to access multiple web archives, and tools such as Sawood's Memgator which implements the memento protocol. For the comprehensive list of mementos (extracted with Memgator) for these stories see: Table vis or Timeline vis.

--Nwala

↧

2016-10-25: Web Archive Study Informs Website Design

October 25, 2016, 6:44 pm

≫ Next: 2016-10-26: They should not be forgotten!

≪ Previous: 2016-10-26: A look back at the 2008 and 2012 US General Elections via Web Archives

Shortly after beginning my Ph.D. research with the Old Dominion University Web Science and Digital Libraries team, I also rediscovered a Hampton Roads folk music non-profit I had spent a lot of time with years before. Somehow I was talked into joining the board (not necessarily the most sensible thing when pursuing a Ph.D.).

My research area being digital preservation and web archiving, I decided to have a look at the Tidewater Friends of Folk Music (TFFM) website and its archived web pages (mementos). Naturally, I looked at oldest copy of the home page available, 2002-01-25. What I found is definitely reminiscent of early, mostly hand-coded HTML:

tffm.org 2002-01-25 23:57:26 GMT (Internet Archive)
https://web.archive.org/web/20020125235726/http://tffm.org/

Of course the most important thing for most people is concerts, so I had a look at the concerts page too (interestingly, the newest concerts page available is five years newer than the oldest home page—this phenomena was the subject was of my JCDL 2013 paper.).

tffm.org/concerts 2007-10-07 06:17:32 GMT (Internet Archive)
https://web.archive.org/web/20071007061732/http://tffm.org/concerts.html

Clicking my way through the home and concert page and mementos, I found little had changed over time other than masthead image.


2005-08-26 21:05:28 GMT	2005-12-11 09:23:55 GMT	2009-08-31 06:31:40 GMT

The end result is that I became, and remain, TFFM’s web master. However, studying web archive quality, that is completeness and temporal coherence, has greatly influenced my redesigns of the TFFM website. First up was bringing the most important information to the forefront in a much more readable and navigable format. Here is a memento captured 2011-05-23:

tffm.org 2011-05-23 11:10:54 GMT (Internet Archive)
https://web.archive.org/web/20110523111054/http://www.tffm.org/concerts.html

As part of the redesign, I put my new-found knowledge of archival crawler to use. The TFFM website now had a proper sitemap and every concert its own URI with very few URI aliases. This design lasted until the TFFM board decided to replace “Folk” with “Acoustic,” changing the name to Tidewater Friends of Acoustic Music (TFAM).

Along with the change came a brighter look and mobile-friendly design. Again, putting knowledge from my Ph.D. studies to work, the mobile-friendly design is responsive, adapting to the user’s device, rather than incorporating a second set of URIs and independent design. With the response approach, archived copies replay correctly in both mobile and desktop browsers.

tidewateracoustic.org 2014-10-07 01:56:07 GMT
https://web.archive.org/web/20141007015607/http://tidewateracoustic.org/

After watching several fellow Ph.D. students struggle with the impact of JavaScript and dynamic HTML on archivability, I elected to minimized the use of JavaScript on the TFAM the site. JavaScript greatly complicates web archiving and reduces archive quality significantly.

So, the sensibility of taking on a volunteer website project while pursuing my Ph.D. aside, I can say that in some ways the two have synergy. My Ph.D. studies have influenced the design of the TFAM website and the TFAM website is a small, practical, and personal proving ground for my Ph.D. work. The two have complemented each other well.

Enjoy live music? Check out http://tidewateracoustic.org!

— Scott G. Ainsworth

↧

2016-10-26: They should not be forgotten!

October 25, 2016, 9:56 pm

≫ Next: 2016-10-27: UrduTech - The GeoCities of Urdu Blogosphere

≪ Previous: 2016-10-25: Web Archive Study Informs Website Design

Source: http://www.masrawy.com/News/News_
Various/details/2015/6/7/596077/أسرة-الشهيد-أحمد-بسيوني
-فوجئنا-بصورته-على-قناة-الشرق-والقناة-نرفض-التصريح

I remembered his face and smile very well. It was very tough for me to look at his smile and realize that he will not be in this world again. It got worse for me when I read his story and many others who had died defending the future of my home country, Egypt, hoping to draw a better future for their kids. Ahmed Basiony, one of Egypt’s great artists, was killed by the Egyptian Regime on the January 28th, 2011. One of the main reasons that drove Basiony to participate in the protests is filming police beatings to document the protests. While he was filming, he also used his camera during the demonstration to zoom on the soldiers and warn the people around him so they take cautions before they had gunfire. Suddenly, his camera fell down.

Basiony was a dad for two kids: one and six years old. He has been loved by everyone who knew him. I hope Basiony's and others' stories will remain for future generations.

Basiony was among the protests in the first days of the Egypt Revolution.
Source: https://www.facebook.com/photo.php?
fbid=206347302708907&set=a.139725092704462.24594.
100000009164407&type=3&theater

curl -I http://1000memories.com/egypt
HTTP/1.1 404 Not Found
Date: Tue, 25 Oct 2016 16:53:04 GMT
Server: nginx/1.4.6 (Ubuntu)
Content-Type: text/html; charset=UTF-8

Basiony's information and many other martyrs were documented at the site 1000memories.com/egypt. The 1000memories site contained a digital collection of around 403 martyrs with information about their live. The entire Web site is unavailable now, and the Internet Archive is the only place where it was archived. Not only the 1000memories that has been disappeared, there are also many other repositories that contained videos, images, etc. that document the 18 days of the Egyptian Revolution disappeared. Examples are iamtahrir.com (archived version), which contained the artwork produced during the Egyptian Revolution, and 25Leaks.com (archived versions), which contained about 100s of important papers posted by people during the revolution. Both sites were created for collecting content related to the Egyptian Revolution.

An archived copy of 1000memories in Archive-It.

The Jan. 25 Egyptian Revolution is one of the most important events that has happened in recent history. Several books and initiatives have been published for documenting the 18 days of the Egyptian Revolution. These books cited many digital collections and other sites that were dedicated to document the Egyptian Revolution (e.g., 25Leaks.com). Unfortunately, the links to many of these Web sites are now broken and there is no way (without the archive) to know what they contained.

Luckily, 1000memories.com/egypt has multiple copies in the "Egypt Revolution and Politics" collection in Archive-It, a subscription service from the Internet Archive that allow institutions to develop, curate, and preserve collections of Web resources. I'm glad I found information of Basiony and many more martyrs archived!

Archiving Web pages is a method for ensuring these resources are available for posterity. My PhD research focused on exploring methods for summarizing and interacting with collections in Archive-It, and recording the events of the Egyptian Revolution spurred my initial interest in web archiving. My research necessarily focused on quantitative analysis, but this post has allowed me to revisit the humanity behind these web pages that would be lost without web archiving.

Sources:

--Yasmin

↧

2016-10-27: UrduTech - The GeoCities of Urdu Blogosphere

October 27, 2016, 9:54 pm

≫ Next: 2016-10-31: Two Days at IEEE VIS 2016

≪ Previous: 2016-10-26: They should not be forgotten!

On December 12, 2008, an Urdu blogger Muhammad Waris reported an issue in Urdu Mehfil about his lost blog that was hosted on UrduTech.net. Not just Waris, but many other Urdu bloggers of that time were anxious about their lost blogs due to a sudden outage of the blogging service UrduTech. The downtime lasted for several weeks which has changed the shape of the Urdu blogosphere.

Before diving into the UrduTech story, let's have a brief look into the Urdu language and the role of the Urdu Mehfil forum in promoting Urdu on the Web. Urdu is a language spoken by more than 100 million people worldwide (about 1.5% of the global population), primarily in India and Pakistan. It has a rich literature, while being one of the premier languages of poetry in South Asia for centuries. However, the digital footprint of Urdu has been relatively smaller than some other languages like Arabic or Hindi. In the early days of the Web, computers were not easily available to the masses of the Urdu speaking community. Urdu input support was often not built-in or would require additional software installation and configuration. The right-to-left (RTL) direction of the text flow in Urdu script was another issue of writing and reading it on devices that were optimized for left-to-right languages. There were not many fonts that support Urdu character set completely and properly. The most commonly used Nastaleeq typeface was initially only available in a propriety page-making software called InPage which did not support Unicode and locked-in the content of books and news papers. Early online Urdu news sites used to export the content as images and publish on the web.

Urdu community used to write Urdu text in Roman script on the Web initially, but the efforts of promoting Unicode Urdu were happening on small scales; one such early effort was Urdu Computing Yahoo Group by Eijaz Ubaid. In the year 2005, some people from the Urdu community including Nabeel, Zack, and many others took an initiative to build a platform to promote Unicode Urdu on the Web and created UrduWeb and a discussion board under that with the name Urdu Mehfil. This has quickly become the hub for Urdu related discussions, development, and idea exchange. The community created tools to ease the process of reading and writing Urdu in computers and on the Web. They created many beautiful Urdu fonts and keyboard layouts, translated various software and CMS systems and customized themes to make them RTL friendly, created dictionaries and encyclopedia, developed plugins for various software to enable Urdu in them, developed Urdu variants of Linux OS, provided technical help and support, digitized printed books, created Urdu blog aggregator (Saiyarah) to promote blogging and increase the visibility of new bloggers, and gave a platform to share literary work. These are just a few of many contributions of UrduWeb. These efforts played a significant role in shaping up the presence of Urdu on the Web.

I, Sawood Alam, am associated with UrduWeb since early 2008 with my continuing interest in getting the language and culture online. For the last seven years I am administering UrduWeb. In this period I have mentored various projects, developed many tools, and took various initiatives. I recently collaborated with Fateh, another UrduWeb member, to published a paper entitled, "Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages" (PDF), in an effort to enable easy and fast lookup in many classical and culturally significant Urdu dictionaries that are available in scanned form in the Internet Archive.

To give a sense of the increased activity and presence of Urdu on the Web we can take a couple examples. In the year 2007 when UrduTech was introduced as a blogging platform, Urdu Wikipedia was in the third group of languages on Wikipedia based on the number of articles, with only 1,000+ articles. Fast forward eight years, now in 2016 it has jumped to the second group of languages with 100,000+ articles and actively growing.

In May, 2015 Google Translate Community hosted a translation challenge, in which Urdu languages surfaced in the top ten most contributing languages that was highlighted by Google Translate as, "Notably Bengali and Urdu are in the lead along with some larger languages."

Now, back to the Urdu blogging story, in the year 2007, WordPress CMS was the most popular blogging software for those who can afford to host their site and make it work. For those who were not technically sound or did not want to pay for hosting, WordPress and Blogger were among the most popular hosted free blogging platforms. However, when it comes to Urdu, both platforms had some limitations. WordPress allowed flexible options of plugins, translations, and theming etc., but only if one runs the CMS on their server, hosted free service in contrast, had limited number of themes of which none were RTL friendly and it did not allow custom plugins either. This means, changing CSS to better suit the rendering of the mixed bidirectional content was not allowed that would render the lines containing bidirectional text (which is not uncommon in Urdu) in an unnatural and unreadable order. Lack of custom plugin support would also mean that providing JavaScript based Urdu input support in the reply form was not an option as a result articles would receive more comments in Roman script than in Urdu. On the other hand, blogger allowed theme customization, but the comment form was rendered inside an iframe that had no way to inject external JavaScript in it to allow Urdu input support. As a result, those Urdu bloggers who chose one of these hosted free blogging services had some compromises.

The technical friction of getting things to work for Urdu was a big reason for the slow adoption of Urdu blogging. To make it easier, Imran Hameed, a member of UrduWeb, introduced UrduTech blogging service. People from UrduWeb including Mohib, Ammar, Mawra, and some others encouraged many people to start Urdu blogging. UrduTech used WordPress MU to allow multi-user blogging on a single installation. It was hosted on a shared hosting service. Creating a new blog was as simple as filling an online form with three fields and hit the "Next" button. From there, one can choose from a handful of beautiful RTL-friendly themes and enable pre-installed add-ons to allow Urdu input support, both in the dashboard for post writing and on the public facing site for comments. Removing all the frictions WordPress and Blogger had, UrduTech gave a big boost to the Urdu community and many people started creating their blogs.

It turned out that creating a new blog on UrduTech was easy not just for legitimate people, but for spammers as well. This is evident from the earliest capture of UrduTech.net in the Internet Archive. Unfortunately, the styleseets, images, and other resources were not well archived, so please bear with the ugly looking (damaged Memento) screenshots.

Later captures in the web archive show that as the Urdu bloggers community grew on UrduTech, so did the attack from spam bots. This has increased the burden of the moderation to actively and regularly clean the spam registrations.

The service ran for a little over a year with occasional minor down times. Urdu blogosphere has started evolving slowly and the diversity of the content increased. During this period, some people have slowly started migrating to other blogging platforms such as their personal free or paid hosting, other Urdu blogging offerings, or hosted free services of WordPress and Blogger. This is evident from the blogroll of various bloggers in their archived copies.

Increasing activity on UrduTech from both human and bots lead to the point where the shared hosting provider decided to shut the service down without any warning. People were anxious of the sudden loss of their content and demanding for the backup. Who makes backups? (Hint: Web archives!) Imran, the founder of the service, was busy in his other priorities that took him more than a month to bring the service back online. In the interim, people either decided to never do blogging again or swiftly moved on to other more robust options to start over from scratch (so did Waris) with the lesson learned the hard way to make backup of their content regularly.

"Did Waris really lost all his hard work and hundreds of valuable articles he wrote about Urdu and Persian literature and poetry?" I asked myself. The answer was perhaps to be found somewhere in 20,000 hard drives of the Internet Archive. However, I didn't know his lost blog's URL, but the Internet Archive was there to help. I first looked through a few captures of the UrduTech in the archive, from there I was able to find his blog link. I was happy to discover that his blog's home page a was archived a few times, however the permalinks of individual blog posts were not. Also, the pages of the blog home with older posts were not archived either. This means, from the last capture, only the 25 latest posts can be retrieved (without comments). When other earlier captures of the home page are combined, a few more posts can be archived, but perhaps not all of them. Although the stylesheet and various template resources are missing, the images in the post are archived, which is great.

What happened to the UrduTech service? When it came back online after a long outage, many people have already lost their interest and trust in the service. In less than three months, the service went down again, but this time it was the ultimate death of the service until the domain name registration expired.

Due to its popularity and search engine ranking, the domain was a good target for drop catching. Mementos (captures) during November 27, 2011 and December 18, 2014 show a blank page when viewed using WayBack Machine. A closer inspection of the page source reveals what is happening there. Using JavaScript, the page is loaded in the top frame (if not already) and the page has frames to load more content. Unfortunately, resources in the frame are not archived, so it is difficult to say how the page might have looked in that duration. However, there is some plain text for "noframe" fallback that reveals that the domain drop catchers were trying to exploit the "tech" keyword present in the UrduTech name, though they have nothing to do with Urdu.

Sometime before March 25, 2015, the domain name was presumably went through another drop catch. Alternatively, it is possible that the same domain name owner has decided to host a different type of content on that domain. Whatever is the case, since then the domain is serving a health-related "legitimate-looking fake" site, it is still live, and adding new content every now and then. However, the content of the site has nothing to do with neither "Urdu" nor "tech".

UrduTech simplified a challenging task at that time, made it accessible to people with the little technical skills, proliferated the community, killed the service, but the community has moved on (though the hard way) and transformed into a more mature and stable blogosphere. It has played the same role for Urdu blogging what the GeoCities did for personal home page hosting, only on a smaller scale for a specific community. Over the time the Web technology matured, support for Urdu in computer and smart phones became better, awareness of the tools and technologies grew in the community in general, and various new communication media such as social media sites helped spread the word and connect people together. Now, the Urdu blogosphere has grown significantly and people in the community organize regular meetups and Urdu blogger conferences. Manzarnamah, another initiative from UrduWeb members, introduces new bloggers in the community, publishes interviews of regular bloggers, and distributes annual awards to bloggers. Bilal, another member of the UrduWeb, is independently creating tools and guides to help new bloggers and the Urdu community in general. UrduTech was certainly not the only driving force for Urdu blogging, but it did play a significant role.

On the occasion of 20th birthday celebration of the Internet Archive (#IA20), on behalf of WS-DL Research Group and the Urdu community I extend my gratitude for preserving the Web for 20 long years. Happy Birthday Internet Archive, keep preserving the Web for many many more years to come. I could only wish that the preservation was more complete and less damaged, but having something is better than nothing and as DSHR puts it, "You get what you get and you don't get upset". Without these archived copies I would not be able to augment my own memories and tell the story of the evolution of a community that is very dear to me and to many others. I can only imagine how many more such stories are buried in the spinning discs of the Internet Archive.

--
Sawood Alam

↧

2016-10-31: Two Days at IEEE VIS 2016

October 31, 2016, 6:44 am

≫ Next: 2016-11-03: Jones International University: A Look Back at a Controversial Online Institution

≪ Previous: 2016-10-27: UrduTech - The GeoCities of Urdu Blogosphere

I attended a brief portion of IEEE VIS 2016 in Baltimore on Monday and Tuesday (returned to ODU on Wednesday for our celebration of the Internet Archive's 20th anniversary). As the name might suggest, VIS is the main academic conference for visualization researchers and practitioners. I've taught our graduate Information Visualization course for several years (project blog posts, project gallery), so I've enjoyed being able to attend VIS occasionally. (Mat Kelly presented a poster in 2013 and wrote up a great trip report.)

This year's conference was held at the Baltimore Hilton, just outside the gates of Oriole Park at Camden Yards. If there had been a game, we could have watched during our breaks.

My excuse to attend this year (besides the close proximity) was that another class project was accepted as a poster. Juliette Pardue, Mridul Sen (@mridulish), and Christos Tsolakis took a previous semester's project, FluNet Vis, and generalized it. WorldVis (2-pager, PDF poster, live demo) allows users to load and visualize datasets of annual world data with a choropleth map and line charts. It also includes a scented widget in the form of a histogram showing the percentage of countries with reported data for each year in the dataset.

Before I get to the actual conference, I'd like to give kudos to whomever picked out the conference badges. I loved having a pen (or two) always handy.

Monday, October 24

Monday was "workshop and associated events" day. If you're registered for the full conference, then you're able to attend any of these pre-main conference events (instead of having to pick and register for just one). This is nice, but results in lots of conflict in determining which interesting session to attend. Thankfully, the community is full of live tweeters (#ieeevis), so I was able to follow along with some of the sessions I missed. It was a unique experience to be at a conference that appealed not only to my interest in visualization, but also to my interests in digital humanities and computer networking.

I was able to attend parts of 3 events:

VizSec (IEEE Symposium on Visualization for Cyber Security) - #VizSec
Vis4DH (Workshop on Visualization for the Digital Humanities) - #Vis4DH
BELIV (Beyond Time and Errors: Novel Evaluation Methods for Visualization) Workshop - #BELIV

VizSec

The VizSec keynote, "The State of (Viz) Security", was given by Jay Jacobs (@jayjacobs), Senior Data Scientist at BitSight, co-author of Data-Driven Security, and host of the Data-Driven Security podcast. He shared some of his perspectives as Lead Data Analyst on multiple Data Breach Investigations Reports. His data-driven approach focused on analyzing security breaches to help decision makers (those in the board room) better protect their organizations against future attacks. Rather than detecting a single breach, the goal is to determine how analysis can help them shore up their security in general. He spoke about how configuration (TLS, certificates, etc.) can be a major problem and that having a P2P share on the network indicates the potential for botnet activity.

In addition, he talked about vis intelligence and how CDFs and confidence intervals are often lost on the general public.

Now @jayjacobs arguing that we need visual literacy in support of security — optimal charts often hard to interpret. #VizSec #ieeevis
— Lane Harrison (@laneharrison) October 24, 2016

He also mentioned current techniques in displaying IT risk and how some organizations allow for manual override of the analysis.

Fair criticism from @jayjacobs on traditional risk registers & heat maps, & the impact of manually overriding model results #ieeevis #vizsec pic.twitter.com/6JEBeNFuE9
— Savannah Fitzwater (@Atomic_Fitz) October 24, 2016

And then during question time, a book recommendation: How to Measure Anything in Cybersecurity Risk

Great ? from audience about how to improve heatmaps. @jayjacobs recommends "How to Measure Anything in Cybersecurity Risk"#ieeevis #vizsec
— Savannah Fitzwater (@Atomic_Fitz) October 24, 2016

In addition to the keynote, I attended sessions on Case Studies and Visualizing Large Scale Threats. Here are notes from a few of the presentations.

"CyberPetri at CDX 2016: Real-time Network Situation Awareness", by Dustin Arendt, Dan Best, Russ Burtner and Celeste Lyn Paul, presented analysis of data gathered from the 2016 Cyber Defense Exercise (CDX).

"Uncovering Periodic Network Signals of Cyber Attacks", by Ngoc Anh Huynh, Wee Keong Ng, Alex Ulmer and Jörn Kohlhammer, looked at analyzing network traces of malware and provided a good example of evaluation using a small simulated environment and real network traces.

"Bigfoot: A Geo-based Visualization Methodology for Detecting BGP Threats", by Meenakshi Syamkumar, Ramakrishnan Durairajan and Paul Barford, brought me back to my networking days with a primer on BGP.

"Understanding the Context of Network Traffic Alerts" (video), by Bram Cappers and Jarke J. van Wijk, used machine learning on PCAP traces and built upon their 2015 VizSec paper "SNAPS: Semantic Network traffic Analysis through Projection and Selection" (video).

"An alert does not stand on its own." Bram Cappers argues for context-understanding tools for network traffic alerts. #ieeevis #vizsec
— Steven R. Gomez (@steveg_cs) October 24, 2016

Fantastic demo from Cappers and van Wijk's work on ConTA at #VizSec. If you missed it, a video is up at https://t.co/e6ewqGAD1t #IEEEVIS
— Sophie Engle (@sjengle) October 24, 2016

Vis4DH

DJ Wrisley (@djwrisley) put together a great Storify with tweets from Vis4DH.

Super excited to attend Visualization for DH workshop #VIS4DH #ieeevis& looking forward to sharing when I get back @uglibrary @DHatGuelph! pic.twitter.com/wwmv04cxkU
— Lucia Costanzo (@luciacostanzo) October 24, 2016

Here's a question we also ask in my main research area of digital preservation:

#ieeevis #vis4dh“how do we deal with large collections in digital humanities?” Gregory Crane
— Jonathan C. Roberts (@jcrbrts) October 24, 2016

A theme throughout the sessions I attended was the tension between the humanities and the technical ("interpretation vs. observation", "rhetoric vs. objective"). Speakers advocated for technical researchers to attend digital humanities conferences, like DH 2016, to help bridge the gap and get to know others in the area.

There was also a discussion of close reading vs. distant reading.

AJ Bradley at #ieeevis #VIS4DH: "I don't care how many nouns are in this poem" - on the absurd focus on deriving insights from vis
— matt brehmer (@mattbrehmer) October 24, 2016

Distant reading, analyzing the structure of a work, is relatively easy to visualize (frequency of words, parts of speech, character appearance), but close reading is about interpretation and is harder to fit into a visualization. But the discussion did bring up the promise of distant reading as a way to navigate to close reading.

BELIV
I made the point to attend the presentation of the 2016 BELIV Impact Award so that I could hear Ben Shneiderman (@benbendc) speak. He and his long-time collaborator, Catherine Plaisant, were presented the award for their 2006 paper, "Strategies for Evaluating Information Visualization Tools: Multidimensional In-depth Long-term Case Studies".

Ben and Catherine have been working together for 29 years! #IEEEVIS pic.twitter.com/emGV9c30Vu
— Enrico Bertini (@FILWD) October 24, 2016

Ben's advice was to "get out of the lab" and "work on real problems".

I also attended the "Reflections" paper session, which consisted of position papers from Jessica Hullman (@JessicaHullman), Michael Sedlmair, and Robert Kosara (@eagereyes). Jessica Hullman's paper focused on evaluations of uncertainty visualizations, and Michael Sedlmair presented seven scenarios (with examples) for design study contributions:

propose a novel technique
reflect on methods
illustrate design guidelines
transfer to other problems
improve understanding of a VIS sub-area
address a problem that your readers care about
strong and convincing evaluation

Robert Kosara challenged the audience to "reexamine what we think we know about visualization" and looked at how some well-known vis guidelines have either recently been questioned or should be questioned.

I propose to establish the new Robert Kosara's Mantra "How Do We Know That?"#ieeevis.
— Enrico Bertini (@FILWD) October 24, 2016

Tuesday, October 25

The VIS keynote was given by Ricardo Hausmann (@ricardo_hausman), Director at the Center for International Development& Professor of the Practice of Economic Development, Kennedy School of Government, Harvard University. He gave an excellent talk and shared his work on the Atlas of Economic Complexity and his ideas on how technology has played a large role in the wealth gap between rich and poor nations.

When Adam Smith wrote his book, wealth gap between richest and poorest country was 4:1, today it is 256:1. Ricardo Hausmann at #IEEEVIS
— Ming the Merciless (@eagereyes) October 25, 2016

Hausmann argues that the difference between rich and poor is due to technology. #IEEEVIS
— Ming the Merciless (@eagereyes) October 25, 2016

Technology isn’t just tools and encoded knowledge, but critically the tacit knowledge that’s deep in people’s brains, nowhere else #IEEEVIS
— Ming the Merciless (@eagereyes) October 25, 2016

Tools as embodied knowledge; codes as codified knowledge; knowhow as tacit knowledge @ricardo_hausman #ieeevis
— Tamara Munzner (@tamaramunzner) October 25, 2016

Online version of @ricardo_hausman product space visualization https://t.co/nQBkcJraV7 (United States, 2014) @HarvardCID #dataviz #ieeevis pic.twitter.com/1LBiHJwpzq
— Romain Vuillemot (@romsson) October 25, 2016

When the Treemaps father @benbendc ask questions to the Treemaps fan and everyday user @ricardo_hausman #ieeevis pic.twitter.com/UXnPHqrlpJ
— Romain Vuillemot (@romsson) October 25, 2016

After the keynote, in the InfoVis session, Papers Co-Chairs Niklas Elmqvist (@NElmqvist), Bongshin Lee (@bongshin), and Kwan-Liu Ma described a bit of the reviewing process and revealed even more details in a blog post. I especially liked the feedback and statistics (including distribution of scores and review length) that were provided to reviewers (though I didn't get a picture of the slide). I hope to incorporate something like that in the next conference I have a hand in running.

I attended parts of both InfoVis and VAST paper sessions. There was a ton of interesting work presented in both. Here are notes from a few of the presentations.

"Visualization by Demonstration: An Interaction Paradigm for Visual Data Exploration" (website with demo video), by Bahador Saket (@bahador10), Hannah Kim, Eli T. Brown, and Alex Endert, presented a new interface for allowing relatively novice users to manipulate their data. Items start out as a random scatterplot, but users can rearrange the points into bar charts, true scatterplots, add confidence intervals, etc. just by manipulating the graph into the idea of what it should look like.

"Vega-Lite: A Grammar of Interactive Graphics" (video), by Arvind Satyanarayan (@arvindsatya1), Dominik Moritz (@domoritz), Kanit Wongsuphasawat (@kanitw), and Jeffrey Heer (@jeffrey_heer), won the InfoVis Best Paper Award. This work presents a high-level visualization grammar for building rapid prototypes of common visualization types, using JSON syntax. Vega-Lite can be compiled into Vega specifications, and Vega itself is an extension to the popular D3 library. Vega-Lite came out of the Voyager project, which was presented at InfoVis 2015. The authors mentioned that this work has already been extended - Altair is a Python API for Vega-Lite. One of the key features of Vega-Lite is the ability to create multiple linked views of the data. The current release only supports a single view, but the authors hope to have multi-view support available by the end of the year. I'm excited to have my students try out Vega-Lite next semester.

Best Paper at InfoVis goes to Vega Lite by @arvindsatya1, @domoritz, @kanitw, and @jeffrey_heer. Congrats! https://t.co/LXxkPSeBSM #IEEEVIS
— Ming the Merciless (@eagereyes) October 25, 2016

"HindSight: Encouraging Exploration through Direct Encoding of Personal Interaction History", by Mi Feng, Cheng Deng, Evan M. Peck, and Lane Harrison (@laneharrison), allows users to explore visualizations based on their own history in interacting with the visualization. The tool is also described in a blog post and with demo examples.

Introducing HindSight, a design technique for interaction history - #ieeevis

Post: https://t.co/FSgmeaszl3

Demos: https://t.co/hQZFgrmfnz pic.twitter.com/VFMPBvxsFf
— Lane Harrison (@laneharrison) October 24, 2016

"PowerSet: A Comprehensive Visualization of Set Intersections" (video), by Bilal Alsallakh (@bilalalsallakh) and Liu Ren, described a new method for visualizing set data (typically shown in Venn or Euler diagrams) in a rectangle format (similar to a treemap).

On Tuesday night, I attended the VisLies meetup, which focused on highlighting poor visualizations that had somehow made it into popular media. This website will be a great resource for my class next semester. I plan to ask each student to pick one of these and explain what went wrong.

Although I was only able to attend two days of the conference, I saw lots of great work that I plan to bring back into the classroom to share with my students.

(In addition to this brief overview, check out Robert Kosara's daily commentaries (Sunday/Monday, Tuesday, Wednesday/Thursday, Thursday/Friday) at https://eagereyes.org.)

-Michele (@weiglemc)

↧

2016-11-03: Jones International University: A Look Back at a Controversial Online Institution

November 3, 2016, 4:43 pm

≫ Next: 2016-11-05: Pro-Gaddafi Digital Newspapers Disappeared from the Live Web!

≪ Previous: 2016-10-31: Two Days at IEEE VIS 2016

I’m currently teaching the undergraduate CS 462 Cybersecurity Fundamentals course which is delivered online using a combination of Blackboard and ODU’s Distance Learning PLE (personal learning environment). Although I’ve taught both online and on campus for many years, I was curious as to which institution initially helped to trigger the transition from physical campuses to the virtual environments used today. While people are paying a lot of attention to MOOCs (Massive Open Online Course) these days due to the sheer size of their enrollment, online learning has been around for quite some time. In the 1990s and 2000s, distance learning using the Internet was embraced by many universities seeking to include alternative modes of instruction for their students. I happened upon this interactive infographic created by a team of Ph.D.’s and NASA scientists at Post University which depicts “the Evolution of Distance Learning in Higher Education.”

One milestone in the infographic notes one particular for-profit institution, Jones International University (JIU), as the first accredited online university; offering degrees primarily in business and education. Although, JIU was a small and not very well-known university, its accreditation was met with criticism and fostered much debate regarding the sanctity and overall mission of higher education outside of traditional brick and mortar institutions.

If you visit the homepage of JIU today, you’ll be greeted by a dialog box which states the university is now closed and all student information has been turned over to the Colorado Department of Higher Education. So, what happened here? If we proceed a little further into the site, we learn that JIU was a fully functioning university from 1999 through February 2016. Located at the time in Centennial, Colorado, the school was forced to shut down following a 55% drop in enrollment during the period between 2011 and the end of 2014. Another factor which contributed to its demise was increased competition from traditional four-year colleges entering the online education market.

As expected, the website for JIU is no longer active, but what did it look like before the university shuttered its virtual doors and windows? When we check the Internet Archive, we find there isn’t much historical content available prior to 2005.

http://web.archive.org/web/*/http://www.jiu.edu/

However, there is an archived study “Lessons in Change: What We are Really Changing by Moving Education into Online Environment”, published in April 2001, which provides a glimpse into the heated discussions in academic circles which surrounded the accreditation of JIU. Lack of sufficient, full-time faculty ratios, low academic quality, and civil disputes among the administration are all documented.

https://archive.org/details/ERIC_ED461898

When we look at the first available archived web page dated October 30, 2005, we see that many of the embedded images cannot be retrieved and the alternative text is displayed instead. It's ironic that we cannot fully visualize the page given the articulated goal of the school’s founder, media mogul Glenn R. Jones, “to develop and deliver rich content and learning to adult learners across the world via the Internet and Web.”

http://web.archive.org/web/20051030024451/http://www.jiu.edu/

As we delve further into the university's history, we take a deeper look at one of those early days in 2008 when the web site was crawled numerous times. Here’s what the site looked like on July 31, 2008. We see richer graphics in the design and there’s a section for online classes and even a self-assessment which evaluates a student's capacity for online learning.

http://web.archive.org/web/20080731004257/http://www.jiu.edu/

Here, we see the last fully functioning web page for JIU which was archived on March 28, 2015. It’s somewhat ironic that one of the last pages posted by the university includes a caption which states “dreams don’t have expiration dates” given that the school’s closing was imminent.

http://web.archive.org/web/20150328021130/http://www.jiu.edu/

And, finally we see the notice to students of the impending closure which was first archived on November 24, 2015. Students with a year or less left were “able to complete their courses and graduate from JIU.” The remainder of the student body was given the option to transfer to other institutions which offered many of the same courses.

https://web.archive.org/web/20151117161246/http://www.jiu.edu

The accreditation of the now-defunct Jones International University as an entirely virtual academic institution was historically significant when first announced back in 1993. Although the school’s brief history is only sparsely documented in the Internet Archives, we can still obtain some sense of how the university began, the critical reception it received from other academic professionals and ultimately how it was forced to cease operations.

Sources:

Crotty, James Marshal. Distance Learning Has Been Around Since 1892, You Big MOOC. Forbes. 14 Nov. 2012.
Chuang, Tamara. First accredited online school, Jones University, to shutter in 2016. The Denver Post. 02 Apr. 2015.
Ogden, Joslyn. Cyber U: The Accreditation of Jones International University. The Kenan Institute for Ethics at Duke University. n.d.

-- Corren McCoy

↧

2016-11-05: Pro-Gaddafi Digital Newspapers Disappeared from the Live Web!

November 4, 2016, 10:31 pm

≫ Next: 2016-11-07: Linking to Persistent Identifiers with rel="identifier"

≪ Previous: 2016-11-03: Jones International University: A Look Back at a Controversial Online Institution

Internet Archive & Libyan newspapers logos

Colonel Gaddafi ruled Libya for 42 years after taking power from King Idris in a 1969 military coup. In August 2011, his regime was toppled in the so-called Arab Spring. For more than four decades, media in Libya was highly politicized to support Gaddafi’s regime and secure his power. After the Libyan revolution (in 2011), media became freed from the tight control of the government, and we have seen the establishment of tens if not hundreds of new media organizations. Here is an overview of one side, newspapers, of Gaddafi’s propaganda machine:

71 newspapers and magazines
All monitored and published by the Libyan General Press Corporation (LGPC)
The Jamahiriya News Agency (JANA) was the main source of domestic news
No real political function other than to polish the regime’s image
Publish information provided by the regime

The following are the Libyan most well-known newspapers which are all published by LGPC:

All Libyan newspaper websites are no longer controlled by the government

After the revolution, most of the Libyan newspapers' websites including the website of the Libyan General Press Corporation (LGPC) became controlled by foreign institutions, in particular, by an Egyptian company. Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), and El Fajr El Jadid (www.alfajraljadeed.com/) became Egyptian news websites under different names: Jime News (www.news.aljamahiria.com/), Kifah Arabi (www.news.kifaharabi.com/), and El Fajr El Jadid Alakbaria while the El Zahf Al Akhdar (www.azzahfalakhder.com/) is now a German sport blog. Here are the logos of the new websites (the new websites remain with the same domain name except the alshames.com which redirects to www.news.kifaharabi.com/):

Can we still have access to the old state media?

After this big change in Libya with the fall of the regime, can we still have access to the old state media? (This question might apply to other countries as well. Would any political or regime change in any country lead to loss a part of its digital history?)

Fortunately, Internet Archive has captured thousands of snapshots of the Libyan newspapers' websites. The main pages of Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), El Zahf Al Akhdar (www.azzahfalakhder.com/), and El Fajr El Jadid (www.alfajraljadeed.com/) have been captured 2310, 606, 1398, and 836 times, respectively, by the Internet Archive.

www.aljamahiria.com/ captured 2,310 times by the Internet Archive

www.azzahfalakhder.com/ captured 1,398 times by the Internet Archive

Praise for Qaddafi no longer on the live web

Although we can not conclude that the Internet Archive has captured everything due to the fact that the content in these newspapers was extremely redundant as they focus in praising the regime, the Internet Archive has captured important events, such as the regime's activities during the "2011"revolution, a lot of domestic news and the regime's interpretation of international news, many economic articles, the long process taken by Libyan authorities in order to establish the African Union, Gaddafi's speeches, etc. Below is an example of one of these articles during the Libyan "2011" revolution indicating the "there will be no future for Libya without our leader Gaddafi". This article is no longer available on the live web.

From the Internet Archive https://web.archive.org/web/20

110514103049/http://www.alfajraljadeed.com//full.pdf

From the live web http://www.alfajraljadeed.com//full.pdf

Slides about this post is also available:

--Mohamed Aturban

↧

2016-11-07: Linking to Persistent Identifiers with rel="identifier"

November 7, 2016, 9:05 am

≫ Next: 2016-11-16: Introducing the Local Memory Project

≪ Previous: 2016-11-05: Pro-Gaddafi Digital Newspapers Disappeared from the Live Web!

Do you remember hearing about that study that found that people who are "good" at swearing actually have a large vocabulary, refuting the conventional wisdom about a "poverty-of-vocabulary"? The DOI (digital object identifier) for the 2015 study is:

http://dx.doi.org/10.1016/j.langsci.2014.12.003

But if you read about it in the popular press, such as the Independent or US News & World Report, you'll see that they linked to:

http://www.sciencedirect.com/science/article/pii/S038800011400151X

The problem is that although the DOI is the preferred link, browsers follow a series of redirects from the DOI to the ScienceDirect link, which is then displayed in the address bar of the browser, and that's the URI that most people are going to copy and paste when linking to the page. Here's a curl session showing just the HTTP status codes and corresponding Location: headers for the redirection:

$ curl -iL --silent http://dx.doi.org/10.1016/j.langsci.2014.12.003 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 303 See Other
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectSinglePerm?Redirect=http%3A%2%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS038800011400151X%3Fvia%253Dihubkey=072c950bffe98b3883e1fa0935fb56a6f1a1b364
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X
HTTP/1.1 200 OK

Most publishers follow this model of a series of redirects to implement authentication, tracking, etc. While DOI use has made significant progress in scholarly literature, many times the final URL is the one that is linked to instead of the more stable DOI (see the study by Herbert, Martin, and Shawn presented at WWW 2016 for more information). Furthermore, while sometimes the mapping between the final URL and DOI is obvious (e.g., http://dx.doi.org/10.1371/journal.pone.0115253 --> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253), the above example proves that's not always the case.

Ad-hoc linking back to DOIs

One of the obstacles limiting the correct linking is that there is no standard, machine-readable method for the HTML from the final URI to link back to its DOI (and by "DOI" we also mean all other persistent identifiers, such as handles, purls, arks, etc.). In practice, each publisher adopts its own strategy for specifying DOIs in <meta> HTML elements:

In http://link.springer.com/article/10.1007%2Fs00799-016-0184-4 we see:

<meta name="citation_publisher" content="Springer Berlin Heidelberg"/>
<meta name="citation_title" content="Web archive profiling through CDX summarization"/>
<meta name="citation_doi" content="10.1007/s00799-016-0184-4"/>
<meta name="citation_language" content="en"/>
<meta name="citation_abstract_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_fulltext_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_pdf_url" content="http://link.springer.com/content/pdf/10.1007%2Fs00799-016-0184-4.pdf"/>

In http://www.dlib.org/dlib/january16/brunelle/01brunelle.html we see:

<meta charset="utf-8" />
<meta id="DOI" content="10.1045/january2016-brunelle" />
<meta itemprop="datePublished" content="2016-01-16" />
<meta id="description" content="D-Lib Magazine" />

In http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253 we see:

<meta name="citation_doi" content="10.1371/journal.pone.0115253" />
...
<meta name="dc.identifier" content="10.1371/journal.pone.0115253" />

In https://www.computer.org/csdl/proceedings/jcdl/2014/5569/00/06970187-abs.html we see:

<meta name='doi' content='10.1109/JCDL.2014.6970187' />

And in http://ieeexplore.ieee.org/document/754918/ there are no HTML elements specifying the corresponding DOI. Furthermore, HTML elements can only appears in HTML -- which means you can't provide Links for PDF, CSV, Zip, or other non-HTML representations. For example, NASA uses handles for the persistent identifiers of the PDF versions of their reports:

$ curl -IL http://hdl.handle.net/2060/19940023070
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Expires: Thu, 03 Nov 2016 17:47:07 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 221
Date: Thu, 03 Nov 2016 17:47:07 GMT

HTTP/1.1 301 Moved Permanently
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Location: https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Content-Length: 984250
Content-Type: application/pdf

And the final PDF obviously cannot use HTML elements to link back to its handle.

To address these shortcomings, and in support of our larger vision of Signposting the Scholarly Web, we are proposing a new IANA link relation type, rel="identifier", that will support linking from the final URL in the redirection chain (AKA as the "locating URI") back to the persistent identifier that ideally one would use to start the resolution. For example, in the NASA example above the PDF would link back to its handle with the proposed Link header in red:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier"
Content-Length: 984250
Content-Type: application/pdf

And in the Language Sciences example that we began with, the final HTTP response (which returns the HTML landing page) would use the Link header like this:

HTTP/1.1 200 OK
Last-Modified: Fri, 04 Nov 2016 00:36:50 GMT
Content-Type: text/html
X-TransKey: 11/03/2016 20:36:50 EDT#2847_006#2415#68.228.137.112
X-RE-PROXY-CMP: 1
X-Cnection: close
X-RE-Ref: 0 1478219810005195
Server: www.sciencedirect.com
P3P: CP="IDC DSP LAW ADM DEV TAI PSA PSD IVA IVD CON HIS TEL OUR DEL SAM OTR IND OTC"
Vary: Accept-Encoding, User-Agent
Expires: Fri, 04 Nov 2016 00:36:50 GMT
Cache-Control: max-age=0, no-cache, no-store
Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"
...

But it's not just the landing page that would link back to the DOI, but also the constituent resources that are also part of a DOI-identified object. Below is a request and response for the PDF file in the Language Sciences example, and it carries the same Link: response header as the landing page:

$ curl -IL --silent "http://ac.els-cdn.com/S038800011400151X/1-s2.0-S038800011400151X-main.pdf?_tid=338820f0-a442-11e6-9f85-00000aab0f6b&acdnat=1478451672_5338d66f1f3bb88219cd780bc046bedf"
HTTP/1.1 200 OK
Accept-Ranges: bytes
Allow: GET
Content-Type: application/pdf
ETag: "047508b07a69416a9472c3ac02c5a9a01"
Last-Modified: Thu, 15 Oct 2015 08:11:25 GMT
Server: Apache-Coyote/1.1
X-ELS-Authentication: SDAKAMAI
X-ELS-ReqId: 67961728-708b-4cbb-af64-bb68f1da03ea
X-ELS-ResourceVersion: V1
X-ELS-ServerId: ip-10-93-46-150.els.vpc.local_CloudAttachmentRetrieval_prod
X-ELS-SIZE: 417655
X-ELS-Status: OK
Content-Length: 417655
Expires: Sun, 06 Nov 2016 16:59:44 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sun, 06 Nov 2016 16:59:44 GMT
Connection: keep-alive
Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"

Although at first glance there seems to be a number of existing rel types (some registered and some not) that would be suitable:

rel="canonical"
rel="alternate"
rel="duplicate"
rel="related"
rel="bookmark"
rel="permalink"
rel="shortlink"

It turns out they all do something different. Below we explain why these rel types are not suitable for linking to persistent identifiers.

rel="canonical"

This would seem to be a likely candidate and it is widely used, but it actually exists for a different purpose: to "identify content that is either duplicative or a superset of the content at the context (referring) IRI." Quoting from RFC 6596:

If the preferred version of a IRI and its content exists at:

http://www.example.com/page.php?item=purse

Then duplicate content IRIs such as:

http://www.example.com/page.php?item=purse&category=bags
http://www.example.com/page.php?item=purse&category=bags&sid=1234

may designate the canonical link relation in HTML as specified in
[REC-html401-19991224]:

<link rel="canonical"
href="http://www.example.com/page.php?item=purse">

In the representative cases shown above, the DOI, handle, etc. is neither duplicative nor a superset of the content. For example, the URI of the NASA report PDF clearly bears some relation to its handle, but the PDF URI is clearly not duplicative nor a superset of the handle. This is reinforced by the semantics of the "303 See Other" redirection, which indicates there are two different resources with two different URIs*. rel="canonical" is ultimately about establishing primacy among the (possibly) many URI aliasesfor a single resource. For SEO purposes, this avoids splitting Pagerank.

Furthermore, publishers like Springer are already using rel="canonical" (highlighted in red) to specify a preferred URI in their chain of redirects:

$ curl -IL http://dx.doi.org/10.1007/978-3-319-43997-6_35
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://link.springer.com/10.1007/978-3-319-43997-6_35
Expires: Mon, 31 Oct 2016 20:52:26 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 191
Date: Mon, 31 Oct 2016 20:40:48 GMT

HTTP/1.1 302 Moved Temporarily
Content-Type: text/html; charset=UTF-8
Location: http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35
Server: Jetty(9.2.14.v20151106)
X-Environment: live
X-Origin-Server: 19t9ulj5bca
X-Vcap-Request-Id: 48d17c7e-2556-4cff-4b2b-0e6fbae94237
Content-Length: 0
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448562:07a49aef;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=d9cf189bedb640a9b5d55c9d0;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Link: <http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35>; rel="canonical"
Server: openresty
X-Environment: live
X-Origin-Server: 19ta3iq6v47
X-Served-By: core-internal.live.cf.private.springer.com
X-Ua-Compatible: IE=Edge,chrome=1
X-Vcap-Request-Id: 5a458b2c-de85-42cd-7157-022c440a9668
X-Vcap-Request-Id: 54b0e2dc-7766-4c00-4f95-d33bdb6c427a
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448766:c35e0847;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=1d67fdfb47ab4a5f94b43326e;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

And some publishers use it inconsistently. In this Elsevier example, the content from http://dx.doi.org/10.1016/j.acra.2015.10.004 is indexed at three different URIs:

Even if we accept that the PubMed version is a different resource (i.e., hosted at NLM instead of Elsevier) and should have a separate URI, Elsevier still maintains two different URIs for this article:

http://www.academicradiology.org/article/S1076-6332(15)00453-5/abstract
http://www.sciencedirect.com/science/article/pii/S1076633215004535

The DOI resolves to the former URI (academicradiology.org), but it is the latter (sciencedirect.com) that has in the HTML (and not in the HTTP response header):

<link rel="canonical" href="http://www.sciencedirect.com/science/article/pii/S1076633215004535">

Presumably to distinguish this URI from the various URIs that you get starting with http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 instead of the DOI:

$ curl -iL --silent http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectPrefsPerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS1076633215004535%3Fvia%253Dihub&key=07077ac16f0a77a870586ac94ad3c000cfa1973f
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535
HTTP/1.1 200 OK

In summary, although "canonical" seems promising at first, the semantics are different from what we propose and publishers are already using it for internal linking purposes. This eliminates "canonical" from consideration.

rel="alternate"

This rel type has been around for a while and has some reserved historical definitions for stylesheets and RSS/Atom, but the general semantics for "alternate" is to provide "an alternate representation of the current document." In practice, this means surfacing different representations for the same resource, but varying in Content-type (e.g., application/pdf vs. text/html) and/or Content-Language (e.g., en vs. fr). Since a DOI, for example, is not simply a different representation of the same resource, "alternate" is removed from consideration.

rel="duplicate"

RFC 6249 specifies how resources can specify resources with different URIs are in face byte-for-byte equivalent. "duplicate" might suitable for stating equivalence between the PDFs linked at both http://www.academicradiology.org/article/S1076-6332(15)00453-5/abstract and http://www.sciencedirect.com/science/article/pii/S1076633215004535, but we can't use it to link back to http://dx.doi.org/10.1016/j.acra.2015.10.004.

rel="related"

Defined in RFC 4287, "related" is probably the closest to what we propose but its semantics are purposefully vague. A DOI is certainly related to locating URI, but it is also related to a lot of other resources as well: the other articles in a journal issue, other publications by the authors, citing articles, etc. Using "related" to link to DOIs could be ambiguous, and would eventually lead to parsing the linked URI for strings like "dx.doi.org", "handle.net", etc. -- not what we want to encourage.

rel="bookmark"

We initially hoped this could mean "when you press , use this URI instead of one in your address bar." Unfortunately, "bookmark" is instead used to identify permalinks for different sections of the document that it appears in. And as a result, it's not even defined for Link: HTTP headers, and thus eliminated from consideration.

rel="permalink"

It turns out that "permalink" was intended for what we thought "bookmark" would be used for, but although it was proposed, it was never registered nor did it gain significant traction ("bookmark" was used instead). It is most closely associated with the historical problem of creating deep links within blogs and as such we choose not to resurrect it for persistent identifiers.

rel="shortlink"

We include this one mostly for completeness since the semantics arguably provide the opposite of what we want: instead of a link to a persistent identifier, it allows linking to a shortened URI. Despite its widespread use, it is actually not registered.

The ecosystem around persistent identifiers is fundamentally different than that of shortened URIs even though they may look similar to the untrained eye. Putting aside the preservation nightmare scenario of bit.ly going out of business or Twitter deprecating t.co, "shortlink" could be used to complement "identifier". Revisiting the NASA example from above, the two rel types could be combined to link to both the handle and the nasa.gov branded shortened URI:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
<http://go.nasa.gov/2fkvyya>; rel="shortlink"
Content-Length: 984250
Content-Type: application/pdf

Combining rel="identifier" with other Links

The "shortlink" example above illustrates that "identifier" can be combined with other rel type for more expressive resources. Here we extend the NASA example further with rel="self":

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
<http://go.nasa.gov/2fkvyya>; rel="shortlink",
<http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf>; rel="self"
Content-Length: 984250
Content-Type: application/pdf

Now the HTTP response for the PDF is self-contained and unambiguously lists all of the appropriate URIs. We could also combine rel="identifier" with version information. arXiv.org does not issue DOIs or handles, but it does mint its own persistent identifiers. Here we propose, using rel types from RFC 5829, how version 1 of an eprint could link both to version 2 (both the next and current version) as well as the persistent identifier (which we also know to be the "latest-version"):

$ curl -I https://arxiv.org/abs/1212.6177v1
HTTP/1.1 200 OK
Date: Fri, 04 Nov 2016 02:31:19 GMT
Server: Apache
ETag: "Tue, 08 Jan 2013 01:02:17 GMT"
Expires: Sat, 05 Nov 2016 00:00:00 GMT
Strict-Transport-Security: max-age=31536000
Set-Cookie: browser=68.228.137.112.1478226679112962; path=/; max-age=946080000; domain=.arxiv.org
Last-Modified: Tue, 08 Jan 2013 01:02:17 GMT
Link: <https://arxiv.org/abs/1212.6177>; rel="identifier latest-version",
<https://arxiv.org/abs/1212.6177v2>; rel="successor-version",
<https://arxiv.org/abs/1212.6177v1>; rel="self"
Vary: Accept-Encoding,User-Agent
Content-Type: text/html; charset=utf-8

The Signposting web site has further examples how rel="identifier" can be used to express the relationship between the persistent identifiers, the "landing page", the "publication resources" (e.g., the PDF, PPT), and the combination of both the landing page and publication resources. We encourage you to explore the analyses of existing publishers (e.g., Nature) and repository systems (e.g., DSpace, Eprints).

In summary, we propose rel="identifier" to standardize linking to DOIs, handles, and other persistent identifiers. HTML <meta> tags can't be used as headers in HTTP responses, and existing rel types such as "canonical" and "bookmark" have different semantics.

We welcome feedback about this proposal, which we intend to eventually standardize with an RFC and register with IANA. Herbert will cover these issues at PIDapalooza, and we will include the slides here after the conference.

--Michael & Herbert

* Technically, a DOI is a "digital identifier of an object" rather than "identifier of a digital object", and thus there is not a representation associated with the resource identified by a DOI (i.e., not an information resource). Relationships like "canonical", "alternate", etc. only apply to information resources, and thus are not applicable to most persistent identifiers. Interested readers are encouraged to further explore the HTTPRange-14 issue.

↧

2016-11-16: Introducing the Local Memory Project

November 15, 2016, 10:22 pm

≫ Next: 2016-11-16: Reminiscing About The Days of Cyber War Between Indonesia and Australia

≪ Previous: 2016-11-07: Linking to Persistent Identifiers with rel="identifier"

Collage made from screenshot of local news websites across the US

The national news media has different priorities than the local news media. If one seeks to build a collection about local events, the national news media may be insufficient, with the exception of local news which “bubbles” up to the national news media. Irrespective of this “bubbling” of some local news to the national surface, the perspective and reporting of national news differs from local news for the same events. Also, it is well known that big multinational news organizations routinely cite the reports of smaller local news organizations for many stories. Consequently, local news media is fundametal to journalism.

It is important to consult local sources affected by local events. Thus the need for a system that helps small communities to build collections of web resources from local sources for important local events. The need for such a system was first (to the best of my knowledge) outlined by Harvard LIL. Given Harvard LIL's interest of helping facilitate participatory archiving by local communities and libraries, and our IMLS-funded interest of building collections for stories and events, my summer fellowship at Harvard LIL provided a good opportunity to collaborate on the Local Memory Project.

Our goal is to provide a suite of tools under the umbrella of the Local Memory Project to help users and small communities discover, collect, build, archive, and share collections of stories for important local events from local sources.

Local Memory Project dataset

We currently have a public json US dataset scraped from USNPL of:

5,992 Newspapers
1,061 TV stations, and
2,539 Radio stations

The dataset structure is documented and comprises of the media website, twitter/facebook/youtube links, rss/open search links, as well as geo-coordinates of the cities or counties in which the local media organizations reside. I strongly believe this dataset could be essential to the media research
community.

There are currently 3 services offered by the Local Memory Project:

1. Local Memory Project - Google Chrome extension:

This service is an implementation of Adam Ziegler and Anastasia Aizman's idea for a utility that helps one build a collection for a local event which did not receive national coverage. Consequently, given a story expressed by a query input, for a place, represented by a zip code input, the Google Chrome extension performs the following operations:

Retrieve a list of local news (Newspapers and TV stations) websites that serve the zip code
For each local news website search Google for stories from all the local news websites retrieved from 1.

The result is a collection of stories for the query from local news sources.

For example, given the problem of building a collection for Zika virus for Miami Florida, we issue the following inputs (Figure 1) to the Google Chrome Extension and click "Submit":

Figure 1: Google Chrome Extension, input for building a collection about Zika virus for Miami FL

After the submit button is pressed the application issues the "zika virus"query to Google with the site directive for newspapers and tv stations for the 33101 area.

Figure 2: Google Chrome Extension, search in progress. Current search in image targets stories about Zika virus from Miami Times

After the search, the result (Figure 3) was saved remotely.

Figure 3: A subset (see complete) of the collection about Zika virus built for the Miami FL area.

Here are examples of other collections built with the Google Chrome Extension (Figures 4 and 5):

Figure 4: A subset (see complete) of the collection about Simone Biles' return for Houston Texas

Figure 5: A subset (see complete) of the collection about Protesters and Police for Norfolk Virginia

The Google Chrome extension also offers customized settings that suit different collection building needs:

Figure 6: Google Chrome Extension Settings (Part 1)

Figure 7: Google Chrom Extension Settings (Part 2)

Google max pages: The number of Google search pages to visit for each news source. Increase if you want to explore more Google pages since the default value is 1 page.
Google Page load delay (seconds): This time delay between loading Google search pages ensures a throttled request.
Google Search FROM date: Filter your search for news articles crawled from this date. This comes in handy if a query spans multiple time periods, but the curator is interested in a definite time period.
Google Search TO date: Filter your search for news articles before this date. This comes in handy especially when combined with 3, it can be used to collect documents within a start and end time window.
Archive Page load delay (seconds): Time delay between loading pages to be archived. You can increase this time if you want to have the chance to do something (such as hit archive again) before the next archived page loads automatically. This is tailored to archive.is.
Download type: Download to your machine for a personal collection in (json or txt format). But if you choose to share, save remotely (you should!)
Collection filename: Custom filename for collection about to be saved.
Collection name: Custom name for your collection. It's good practice to label collections.
Upload a saved collection (.json): For json collections saved locally, you may upload them to revisualize the collection.
Show Thumbnail: A flag that decides whether to send a remote request to get a card (thumbnail summary) for the link. Since cards require multiple GET requests, you may choose to switch this off if you have a large collection.
Google news: The default search of the extension is the generic Google search page. Check this box to search teh Google news vertical instead.
Add website to existing collection: Add a website to an existing collection.

2. Local Memory Project - Geo service:

The Google Chrome extension utilizes the Geo service to find media sources that serve a zip code. This service is an implementation of Dr. Michael Nelson's idea for a service that supplies an ordered list of media outlets based on their proximity to a user-specified zip code.

Figure 8: List of top 10 Newspapers, Radio and TV station closest to zip code 23529 (Norfolk, VA)

3. Local Memory Project - API:

The local memory project Geo website is meant for human users, while the API website targets machine users. Therefore, it provide the same services as the Geo website but returns a json output (as opposed to HTML). For example, below is a subset output (see complete) corresponding to a request for 10 news media sites in order of proximity to Cambridge, MA.

{
"Lat": 42.379146, 
"Long": -71.12803, 
"city": "Cambridge", 
"collection": [
    {
"Facebook": "https://www.facebook.com/CambridgeChronicle", 
"Twitter": "http://www.twitter.com/cambridgechron", 
"Video": "http://www.youtube.com/user/cambchron", 
"cityCountyName": "Cambridge", 
"cityCountyNameLat": 42.379146, 
"cityCountyNameLong": -71.12803, 
"country": "USA", 
"miles": 0.0, 
"name": "Cambridge Chronicle", 
"openSearch": [], 
"rss": [], 
"state": "MA", 
"type": "Newspaper - cityCounty", 
"website": "http://cambridge.wickedlocal.com/"
    }, 
    {
"Facebook": "https://www.facebook.com/pages/WHRB-953FM/369941405267", 
"Twitter": "http://www.twitter.com/WHRB", 
"Video": "http://www.youtube.com/user/WHRBsportsFM", 
"cityCountyName": "Cambridge", 
"cityCountyNameLat": 42.379146, 
"cityCountyNameLong": -71.12803, 
"country": "USA", 
"miles": 0.0, 
"name": "WHRB 95.3 FM", 
"openSearch": [], 
"rss": [], 
"state": "MA", 
"type": "Radio - Harvard Radio", 
"website": "http://www.whrb.org/"
    }, ...

Saving a collection built with the Google Chrome Extension

Collection built on a user machine can be saved in one of two ways:

Save locally: this serves as a way to keep a collection private. Saving can be done by clicking "Download collection" in the Generic settings section of the extension settings. A collection can be saved in json or plaintext format. The json format permits the collection to be reloaded through "upload a saved collection" in the Generic settings section of the extension settings. The plaintext format does not permit reloading into the extension, but contains all the links which make up the collection.
Save remotely: in order to be able to share the collection you built locally with the world, you need to save remotely by clicking the "Save remotely" button on the frontpage of the application. This leads to a dialog requesting a mandatory unique collection author name (if one doesn't exist) and an optional collection name (Figure 10). After supplying the inputs the application saves the collection remotely and the user is presented with a link to the collection (Figure 11).

Before a collection is saved locally or remotely, you may choose to exclude an entire news source (all links from a given source) or a single news source as described by Figure 9:

Figure 9: Exclusion options before saving locally/remotely

Figure 10: Saving a collection prompts a dialog requesting a mandatory unique collection author name and an optional collection name

Figure 11: A link is presented after a collection is saved remotely

Archiving a collection built with the Google Chrome Extension

Saving is the first step to make a collection persist after it is built. However, archiving ensures that the links referenced in a collection persist even if the content is moved or deleted. Our application currently integrates archiving via Archive.is, but we plan to expand the archiving capability to include other public web archives.

In order to archive your collection, click the "Archive collection" button on the frontpage of the application. This leads to a dialog similar to the saving dialog which requests a mandatory unique collection author name (if one doesn't exist) and an optional collection name. Subsequently, the application archives the collection by first archiving the front page which contains all the local news sources, and secondly, the application archives the individual links which make up the collection (Figure 12). You may choose to stop the archiving operation at any time by clicking "Stop" on the archiving update orange-colored message bar. At the end of the archiving process, you get a short URI corresponding to the archived collection (Figure 13).

Figure 12: Archiving in progress

Figure 13: When the archiving is complete, a short link pointing to the archived collection is presented

Community collection building with the Google Chrome Extension

We envision a community of users contributing to a single collection for a story. Even though the collections are built in isolation, we consider a situation in which we can group collections around a single theme. To begin this process, the Google Chrome Extension lets you share a locally built collections on Twitter by clicking the "Tweet" button (Figure 14).

Figure 14: Tweet button enables sharing the collection

This means if user 1 and user 2 locally build collections for Hurricane Hermine, they may use the hashtags #localmemory and #hurricanehermine when sharing the collection. Consequently, all Hurricane Hermine-related collections will be seen via Twitter with the hashtags. We encourage users to include #localmemory and the collection hashtags in tweets when sharing collections. We also encourage you to follow the Local Memory Project on Twitter.

.@localmem local memory collection, hermine (Hilton Head SC): https://t.co/EaYsRYYGy5 https://t.co/wZQ8su9sIw #localmemory #hurricanehermine
— Alexander C. Nwala (@acnwala) September 3, 2016

The local news media is a vital organ of journalism, but one in decline. We hope by providing free and open source tools for collection building, we can contribute in some capacity to help its revival.

I am thankful for everyone who has contributed to the ongoing success of this project. From Adam, Anastasia, Matt, Jack and the rest of the Harvard LIL team, to my Supervisor Dr. Nelson and Dr. Weigle, and Christie Moffat at the National Library of Medicine, as well as Sawood and Mat and the rest of my colleagues at WSDL, thank you.

--Nwala

↧

2016-11-16: Reminiscing About The Days of Cyber War Between Indonesia and Australia

November 16, 2016, 11:45 am

≫ Next: 2016-11-21: WS-DL Celebration of #IA20

≪ Previous: 2016-11-16: Introducing the Local Memory Project

Image is taken from Wikipedia

Indonesia and Australia are neighboring countries that, just like what always happens between neighbors, have a hot-and-cold relationship. The History has recorded a number of disputes between Indonesia and Australia, from East Timor disintegration (now Timor Leste) in 1999 to the Bali Nine case (the execution of Australian drug smugglers) in 2015. One of the issues that has really caused a stir in Indonesia-Australia's relationship is the spying imbroglio conducted by Australia toward Indonesia. The tension arose when an Australian newspaper The Sydney Morning Herald published an article titled Exposed: Australia's Asia spy network and a video titled Spying at Australian diplomatic facilities on October 31st, 2013. It revealed one of Edward Snowden's leaks that Australia had been spying on Indonesia since 1999. This startling fact surely enraged Indonesia's government and, most definitely, the people of Indonesia.

Indonesia strongly demanded clarification and an explanation by summoning Australia's ambassador, Greg Moriarty. Indonesia also demanded Australia to apologize. But Australia refused to apologize by arguing that this is something that every government will do to protect its country. The situation was getting more serious when it was also divulged that an Australian security agency attempted to listen in on Indonesian President Susilo Bambang Yudhoyono's cell phone in 2009. Yet, Tony Abbott, Australia's prime minister at that time, still refused to give either explanation or apology. This caused President Yudhoyono to accuse Tony Abbott of 'belittling' Indonesia's response to the issue. All of these situations made the already enraged Indonesian became more furious. Furthermore, Indonesian people judged that the government was too slow in following up and responding to this issue.

Image is taken from The Australian

To channel their frustration and anger, a group of Indonesian hacktivists named 'anonymous Indonesia' launched a number of attacks to hundreds of Australian websites that were chosen randomly. They hacked and defaced those websites to spread the message 'stop spying on Indonesia'. Over 170 Australian websites were hacked during November 2013, some of them are government websites such as Australian Secret Intelligence Service (ASIS), Australian Security Intelligence Organisation (ASIO), and Department of Foreign Affairs and Trade (DFAT).

Australian hackers also took revenge by attacking several important Indonesian websites such as the Ministry of Law and Human Rights and Indonesia's national airline, Garuda Indonesia. But, the number of the attacked websites is not as many as what have been attacked by the Indonesians. These websites are already recovered now and they look as if the attacks never happened. Fortunately, those who never heard this spying row before, could take advantage of using Internet Archive and go back in time to see how those websites looked like when they got attacked. Unfortunately, not all of those attacked websites have archives for November 2013. For example, according to Sydney Morning Herald and Australian Broadcasting Corporation, the ASIS websites were hacked on November 11, 2013. The Australian newspaper also reported that ASIO website was also hacked on November 13, 2013. But, these incidents were not archived by the Internet Archive as we cannot see any snapshot for the given dates.

https://web.archive.org/web/20130101000000*/http://asis.gov.au

https://web.archive.org/web/20130101000000*/http://asio.gov.au

However, we are lucky enough to have sufficient examples to give us a clear idea of the cyber war that once took place between Indonesia and Australia.

http://web.archive.org/web/20130520072344/http://australianprovidores.com.au

http://web.archive.org/web/20131106225110/http://www.danzaco.com.au/

http://web.archive.org/web/20131112141017/http://defence.gov.au/

http://web.archive.org/web/20131107064017/http://dmresearch.com.au

http://web.archive.org/web/20131109094537/http://www.flufferzcarwashcafe.com.au/

http://web.archive.org/web/20131105222138/http://smartwiredhomes.com.au

- Erika (@erikaris)-

↧

2016-11-21: WS-DL Celebration of #IA20

November 20, 2016, 10:04 pm

≫ Next: 2016-12-20: Archiving Pages with Embedded Tweets

≪ Previous: 2016-11-16: Reminiscing About The Days of Cyber War Between Indonesia and Australia

.@WebSciDL celebrates 20 years of #webarchiving& @internetarchive w tacos and @djspooky CDs! #IA20 pic.twitter.com/AFb3qUiuzz
— Michael L. Nelson (@phonedude_mln) October 26, 2016

The Web Science & Digital Library Research Group celebrated the 20th Anniversary of the Internet Archive with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving. This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky).

Normally our group posts about research developments and technical analysis of web archiving, but for #IA20 we had members of our group write mostly non-technical stories drawn from personal experiences and interests that are made possible by web archiving. We are often asked "Why archive the web?" and we hope these blog posts will help provide you with some answers.

Shawn blogged about archiving "fictional" web sites. For examples, sites that are mentioned in TV shows and then set up in real life (tm) to support the shows.
Lulwah followed up her JCDL 2015 paper with a more personal note about which of her favorite Arabic language web sites were archived.
Mat described his time as an editor of the The Alligator, the student newspaper at the University of Florida.
Scott retold how he applied the results of his archivability research to better the design of a local acoustic music site that he maintains.
Alexander, in anticipation of the 2016 election, examined the mementos for Fox News and CNN web sites for both the 2008 and 2012 US presidential elections.
Yasmin revisited how well archived are some of the sites, no longer on the live web, documenting the 2011 Egyptian Revolution.
Sawood blogged about early Urdu language blogs and how they disappeared from the live web, Geocities-style.
Corren recounted the rise and fall of one of the first online universities, as captured by the Internet Archive.
Mohamed summarized the fate of various Libyan newspapers controlled by the government after the 2011 conflict in Libya.
Erika pulled some examples from the web archives of hacked pages from the 2013 cyber war between Indonesia and Australia.

We've collected these links and more material related to #IA20 in both a Storify story and a Twitter moment; we hope you can take the time to explore them further. We'd like to thank everyone at the Internet Archive for 20 years of yeoman's work, the many other archives that have come on-line more recently, and all of the WS-DL members who made the time to provide their personal stories about the impacts and opportunities of web archiving.

--Michael

↧