Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 752 articles
Browse latest View live

2014-01-07: Two WS-DL Classes Offered for Spring 2014

$
0
0

We are offering two WS-DL classes for Spring 2014:
MS students are welcome in the 895 course; I just don't cross-list it as a 795 for my administrative convenience.

For an idea of what the classes will be like, please see the Fall 2012 offering for CS 495/595 and the Spring 2013 offering for CS 895. 

--Michael

2014-03-01 Domains per page over time

$
0
0
A few days ago, I read an interesting blog post by Peter Bengtsson. Peter is sampling web pages and computing basic statistic on the number of domains (RFC 3986Host) required to completely render the page.  Not surprisingly, the mean is quite high: 33.  Also not surprisingly, he has found pages that depend on more than 100 different domains.

This started me thinking about how this has changed over time. Over the course of my research I have acquired a corpus of composite mementos (archived web pages and all their embedded images, CSS, etc.) dating from 1996 to 2013.  So, I did a little number crunching. What I suspected and confirmed is that the number of domains has increased over time and that the rate of increase has also increased. This is reflected in the median domains data show in Figure 1.

Note that the median shown (3) is a fraction of Peter's (25). I believe there are two major reasons for this. First, our current process for recomposing composite mementos from web archives does not run JavaScript, thus it only finds static URIs. Second, Peter's sample appears to be heavy on media sites, which tend to aggregate information, social media, and advertising from a multitude of other sites. On the other hand, our sample of 4,000 URIs and 82,425 composite mementos might be larger than Peter's sample and is probably more diverse. This difference is immaterial; direct comparability with Peter's results is not required to examine change in domains over time.

Figure 1 also shows the median resources (more precisely, the median unique URIs), required to recompose composite mementos. Median resources also increased over time. Furthermore, as shown in Figure 2, domains clearly increase as resources increase. This correlation seems to weaken as the number of resources increases. Note, however, that above 250 resources, the data is quite thin.
Another question that comes to mind is "what is the occurrence frequency of composite mementos at each resource level?" Figure 3 show an ECDF for the data. Although it is hard to tell from the figure, 99% (81,387) of our composite mementos use 100 resources or less and 90% use 43 or less. Indeed, only 34 (0.0412%) have more than 300 resources.
Also interesting is the distribution of composite mementos with respect to number of domains, which is shown in Figure 4. Here 97.5% of our composite mementos use at most 10 domains. It only takes 14 domains to cover 99% or our composite mementos.
Clearly, the number of resources and domains per web page has increased over time and the rate of increase has accelerated over time. These results are not directly comparable to Peter Bengtsson's, but I suspect were he to use a 17-year sample the same patterns would emerge. I was half tempted to plug the 4,000 URIs from our sample into Peter's Number of Domains page to see what happens, unfortunately I don't have the time available. Still, the results would be very interesting.


—Scott G. Ainsworth

March 2 update: minor grammatical corrections.


2014-03-01: Starting my research internship at NUS

$
0
0


Well, I made it! I am finally on the green fine island. After a long trip from Norfolk international airport to Washington DC Dulles then 23 hours in the air except for a fueling pit-stop in Tokyo Narita airport I landed in Changi airport in Singapore.

To give you some context, I was invited to spend a semester at the National University of Singapore and work with Dr. Min Yen Kan in the WING research group. The purpose was to work in a common area of interest that helps me progress in the final leg of my PhD marathon and increase the collaboration between our WS-DL lab and WING yielding a reputable paper (or more?). In short, I am a WING this semester! So buckle up!

Due to jet lag being a miserable companion the first couple of days, I decided not to take the first day off to rest and settle and go directly to the university. Or maybe it was my excitement? I will never confess.

At NUS I did the regular paperwork and met my colleague and fellow research partner for the next couple of months miss Tao Chen. Tao showed me around the university and labs and gave me pointers on what to expect around here.

The next day I met with Dr. Kan and we discussed all the logistics of my arrival and the possible ideas to zero on the angle I am going to focus on. Also we discussed points of collaborations on side projects with Tao and Jun Ping, a fellow researcher at WING.

I will be working in the lab on the 5th floor of AS6 the building next to COM1 in the school of computing. My desk is next to a huge building spanning half of the wall and overlooking an adjacent forest with singing birds! I guess I am a very happy PhD student now!


The journey starts now, let's see what I can do in the next couple of months while working with Asia's finest. Wish me luck!

-- Hany M. SalahEldeen

2014-04-01: Yesterday's (Wiki) Page, Today's Image?

$
0
0
Web pages, being complex documents, contain embedded resources like images.  As practitioners of digital preservation well know, ensuring that the correct embedded resource is captured when the main page is preserved presents a very difficult problem.  In A Framework for Evaluation of Composite Memento Temporal Coherence, Scott Ainsworth, Michael L. Nelson, and Herbert Van de Sompel explore this very concept.

Figure 1: Web Archive Weather Underground Example Showing the Different Ages of Embedded Resources
In Figure 1, borrowed from that paper, we see a screenshot of the Web Archive's December 9, 2004 memento from Weather Underground.  Even though the age of most of these embedded images differ greatly from the main page, they don't really impact its meaning.  Of interest is the weather map that differs by 9 months, which shows clear skies even though the forecast of the main page calls for clouds and light rain.

The Web Archive, as a service external to the resource that it is trying to preserve, only has access to resources that exist at the time it can make a crawl, leading to inconsistencies.  Wikis, on the other hand, have access to all resources under their control, embedded or otherwise.

This is why it is surprising that MediaWiki, even though it allows for access to all previous revisions of a given page, does not tie the datetime of those embedded resources back to that main page.

A pertinent example is that of the Wikipedia article Same-sex marriage law in the United States by state.

Figure 2: Screenshot of Wikipedia article on Same-sex marriage law in the United States by state
Figure 2 shows the current (as of this writing) version of this article, complete with a color-coded map indicating the types of same-sex marriage laws applying to each state.  In this case, the correctness of the version of the embedded resource is pertinent to the understanding of the article.

Figure 3: Screenshot of the same Wikipedia page, but for a revision from June of 2013
Figure 3 shows a June 2013 revision of this article, with the same color-coded map.  This is a problem because it is an old revision of the article with the same version of this color-coded map.  When accessing the June 2013 version of the article on Wikipedia, I get the March 2014 version of the embedded resource.  To ensure that this revision makes sense to the reader, the map from Figure 4 should be displayed with the article instead.  As Figure 5 shows, Wikipedia has all previous revisions of this resource.

Figure 4: The June 2013 revision of the embedded map resource
Figure 5: Listing of all of the revisions of the map resource on Wikipedia

For this particular topic, any historian (or paralegal) attempting to trace the changes in laws on this topic will be confused when presented by a map that does not match the text, and may possibly question the validity of this resource as a whole.

We tried to address this issue with the Memento MediaWiki extension.  MediaWiki provides the ImageBeforeProduceHTML hook, which appears to do what we want.  It provides a $file argument, giving access the the LocalFile Object for the image. It also provides a $time argument that signifies the Timestamp of file in 'YYYYMMDDHHIISS' string form, or false for current.

We were perplexed when the hook did not perform as expected, so we examined the source of MediaWiki version 1.22.5.  Below we see the makeImageLink function that calls the hook on line 569 of Linker.php.

We see that later on, inside this conditional block, $time is used on line 655 as an argument to the makeThumbLink2 function (bottom of code snippet).
And, within the makeThumbLink function, it gets used to make a boolean argument for a call to the function makeBrokenImageLinkObj on line 861.
Back inside the makeImageLink function, we see a second opportunity to use the $time value on line 675, but again it is used to create a boolean argument to the same function.
Note that its timestamp value in 'YYYYMMDDHHIISS' string form is never actually used as prescribed.  So, the documentation for the ImageBeforeProduceHTML hook is incorrect on the use of this $time argument.  In fact, the hook was introduced in MediaWiki version 1.13.0 and this code doesn't appear to have changed much since that time.  It is possible that the $time functionality is intended to be implemented in a future version.

Alternatively, we considered using the &$res argument from that hook to replace the HTML with the images of our choosing, but we would still need to use the object provided by the $file argument, which has no ready-made way to select a specific revision of the embedded resource.

At this point, in spite of having all of the data needed to solve this problem, MediaWiki, and transitively Wikipedia, does not currently support rendering old revisions of articles as they truly looked in the past.

--Shawn M. Jones


2014-04-17: TimeGate Design Options For MediaWiki

$
0
0
We've been working on the development, testing, and improvement of the Memento MediaWiki Extension.  One of our principle concerns is performance.

The Memento MediaWiki Extension supports all Memento concepts:
  • Original Resource (URI-R) - in MediaWiki parlance referred to as a "topic URI"
  • Memento (URI-M) - called "oldid page" in MediaWiki
  • TimeMap (URI-T) - analogous to the MediaWiki history page, but in a machine readable format
  • TimeGate (URI-G) - no native equivalent in MediaWiki; acquires a datetime from the Memento client, supplies back the appropriate URI-M for the client to render
This article will focus primarily on the TimeGate (URI-G), specifically the analysis of two different alternatives in the implementation of TimeGate.  In this article we use the following terms to refer to these two alternatives:
  • Special:TimeGate - where we use a MediaWiki Special Page to act as a URI-G explicitly
  • URI-R=URI-G - where a URI-R acts as a URI-G if it detects an Accept-Datetime header in the request
Originally, the Memento MediaWiki Extension used Special:TimeGate.
A Special:TimeGate datetime negotiation session would proceed as follows, also as described as Pattern 2.1 in Section 4.2.1 of RFC 7089:
  1. HEAD request is sent with Accept-Datetime header to the URI-R*; URI-R responds with a Link header containing the location of the URI-G
  2. GET request is sent with Accept-Datetime header to the URI-G; URI-G responds with a 302 response header containing the location of the URI-M
  3. GET request is sent to the URI-M; URI-M responds with a 200 response header and Memento content
Obviously, this consists of 3 separate round trips between the client and server.  This URI-G architecture is referred to as Special:TimeGate.
The duration for Special:TimeGate is represented by:
dstg = a + RTT + b + RTT + M + RTT
dstg = 3RTT + a + b + M                            (1)
where:
  • a - time to generate the initial URI-R response in step 1
  • b - time to generate the URI-G response in step 2
  • M - time to generate the URI-M response in step 3
  • RTT - round trip time for each request
Based on a conversation with the Wikimedia team, we chose to optimize this exchange by reducing the number of round trips, effectively implementing Pattern 1.1 in Section 4.1.1 of RFC 7089:
  1. HEAD request is sent with Accept-Datetime header to the URI-R; URI-R response with a 302 response header containing the location of the URI-M
  2. GET request is sent to the URI-M; URI-M responds with a 200 response header and Memento content
This URI-G architecture is referred to as URI-R=URI-G.

The duration for URI-R=URI-G is represented by:
drg = B + RTT + M + RTT
drg = 2RTT + B + M                                 (2)
where:
  • B - time to generate the URI-G response in step 1
  • M - time to generate the URI-M response in step 2
  • RTT - round trip time for each request
Intuitively, URI-R=URI-G should be faster.  It has fewer round trips to make between client and server.

For URI-R=URI-G to be the better choice, drg< dstg, which is the same as the following derived relationship:
2RTT + B + M < 3RTT + a + b + M
2RTT - 2RTT + B + M < 3RTT - 2RTT + a + b + M
B + M - M < RTT + a + b + M - M

yielding:

B < RTT + a + b                                       (3)
First, let's try to acquire the value of the term a.

After review of the Wikimedia architecture, it also became apparent that caching was an important aspect of our design and architecture plans.  Because the initial architecture utilized a Special:TimeGate URI and 302 responses are not supposed to be cached, caching was not of much concern.  Now that we've decided to pursue URI-R=URI-G, it becomes even more important.
Experiments with Varnish (the caching server used by Wikimedia) indicate that the Vary header correctly indicates what representations of the resource are to be cached.  If the URI-R contains a Vary header with the value Accept-Datetime, this indicates to Varnish that it should cache each
URI-R representation in response to an Accept-Datetime in the request for that URI-R.  Other values of the Vary header have a finite number of values, but Accept-Datetime can have a near-infinite number of values, making caching near useless for URI-R=URI-G.

Those visitors of a URI-R that don't use Accept-Datetime in the request header will be able to reap the benefits of caching readily.  Memento users of a URI-R=URI-G system will never reap this benefit, because Memento clients send an initial Accept-Datetime with every initial request.
Caching is important to our duration equations because a good caching server returns a cached URI-R in a matter of milliseconds, meaning our value of a in (3) above is incredibly small, on the order of 0.1 seconds on average.

Next we attempt to get the values of b and B in (3) above.

To get a good range of values, we conducted testing using the benchmarking tool Siege on our demonstration wiki.  The test machine is running an Apache HTTP Server 2.2.15 on top of Red Hat Enterprise Linux 6.5.  This server is a virtual machine consisting of two 2.4 GHz Intel Xeon CPUs and 2 GB of RAM.  The test machine consists of two installs of MediaWiki containing the Memento MediaWiki Extension: one with Special:TimeGate implemented, and a second using URI-R=URI-G.

Both TimeGate implementations use the same function for datetime negotiation.  The only major difference being whether they are called from a topic page (URI-R) or a Special page.

Tests were performed against localhost to avoid benefits to using the installed Varnish caching server.

The output from siege looks like the following:


This output was processed to extract the 302 responses, which correspond to those instances of datetime negotiation (the 200 responses are just siege dutifully following the 302 redirect). The URI then indicates which version of the Memento MediaWiki Extension is installed. URIs beginning with /demo-special use the Special:TimeGate design option. URIs beginning with /demo use the URI-R=URI-G design option. From these lines we can compare the amount of time it takes to perform datetime negotiation using each design option.

The date of Mon, 30 Jun 2011 00:00:00 GMT was used for datetime negotiation, because the test wiki contains fan-created data for the popular book series A Song of Ice And Fire (aka Game of Thrones), and this date corresponds to a book released during the wiki's use.

Figure 1: Differences in URI-G performance between URI-R=URI-G and Special:TimeGate
Figure 1 shows the results of performing datetime negotiation against 6304 different wiki pages.  The plot shows the difference between the URI-R=URI-G durations and the Special:TimeGate durations. Seeing as most values are above 0, there is a marked benefit to using Special:TimeGate.

Why the big difference?  It turns out that the earliest MediaWiki hook in the chain that we can use for URI-R=URI-G is ArticleViewHeader, because we needed something that provides an object that allows access to both the request (for finding Accept-Datetime) and response (for providing a 302) at the same time.  This hook is called once all of the data for a page has been loaded, leading to a lot of extra processing that is not incurred by the Special:TimeGate implementation.

Figure 2: Histogram showing the range of URI-R=URI-G values
Figure 2 shows a histogram with 12 buckets containing the values for the durations of URI-R=URI-G.  The minimum value is 0.56 seconds.  The maximum value is 12.06 seconds.  The mean is 1.24 seconds. The median is 0.77 seconds. The biggest bucket spans 0 and 1.0.
Figure 3: Histogram showing the range of Special:TimeGate values
Figure 3 shows a histogram also with 12 buckets (for comparison) containing the values for the duration of Special:TimeGate.  The Special:TimeGate values only stretch between 0.22 and 1.75 seconds. The mean is 0.6 seconds. The median is 0.59 seconds. The biggest bucket spans 0.5 and 0.6.

Using this data, we can derive a solution for (3).  The values for B range from 0.56 to 12.06.  The values for b range from 0.22 to 1.75 seconds.

Now, the values of RTT can be considered.

The round trip time (RTT) is a function of the transmission delay (dt) and propagation delay (dp):
RTT = dt + dp                                                (4)
And transmission delay is a function of the number of bits (N) divided by the rate of transmission (R)
dt = N / R                                                      (5)
The average TimeGate request-response pair consists of a 300 Byte HTTP HEAD request header + 600 Byte HTTP 302 response header + 20 Byte TCP header + 20 Byte IP header = 940 Byte  = 7520 bit payload.

For 1G wireless telephony (28,800 bps), the end user would experience a transmission delay of
dt = 7520 b / 28800 bps
dt = 0.26 s
So, in our average case for both URI-G implementations (using a = 0.1 for a cached URI-R in (3)):
B < RTT + a + b
B < dp + dt + a + b
1.24 s < dp + dt + 0.1 s + 0.6 s
we replace RTT with our value for 1G wireless telephony:
1.24 s < dp + 0.26 s + 0.1 s + 0.6 s
1.24 s < dp + 0.96 s
So, an end user with 1 G wireless telephony would need to experience an additional 0.22 s of propagation delay in order for URI-R=URI-G to even be comparable to Special:TimeGate.

Propagation delay is a function of distance and propagation speed:
dp = d / sp                                                (6)
Seeing as 1G wireless telephony travels at the speed of light, the distance one would need to transmit a signal to make URI-R=URI-G viable becomes
0.22 s = d / (299,792,458 m/s)
(0.22 s) (299,792,458 m/s) = d
d = 65,954,340.76 m = 65,954 km = 40,981 miles
This is more than the circumference of the Earth.  Even if we used copper wire (which is worse) rather than radio waves, the order of magnitude is still the same.  Considering the amount of redundancy on the Internet, the probability of hitting this distance is quite low, so let's ignore propagation delay for the rest of this article.

That brings us back to transmission delay.  At what transmission delay, and essentially what bandwidth, does URI-R=URI-G win out over Special:TimeGate using our average values for the generation of the 302 response?
B < dt + a + b from (1) and (4), dropping dp
1.24 s < dt + 0.1 s + 0.6 s
1.24 s < dt + 0.7 s
0.54 s < dt

dt = N / R
0.54 s = 7520 b / R
(0.54 s) R = 7520 b
R = 7520 b / 0.54 s
R = 13,925 bps = 13 kbps
Thus, those MediaWiki sites with users using something slower than a 14.4 modem will benefit from the URI-R=URI-G implementation for TimeGate using our average values for the generation of TimeGate responses.

Therefore, we have decided that Special:TimeGate provides the best performance in spite of the extra request needed between the client and server.  The reason that the intuitive choice did not work out in most cases is due to idiosyncrasies in the MediaWiki architecture, rather than network concerns, as originally assumed.

--Shawn M. Jones

* It is not necessary for a client to send an Accept-Datetime header to a URI-R.  Most Memento clients do (and RFC 7089 Section 3.1 demonstrates this), in hopes that they encounter a URI-R=URI-G pattern and can save on an extra request.

2014-04-18: Grad Cohort Workshop (CRA-W) 2014 Trip Report

$
0
0
Last week on April 10-11, 2014 I attended the Graduate Cohort Workshop 2014 that took place at the Hyatt Regency in Santa Clara. While there, I enjoyed the nice weather of California and saw the home to the headquarters of several high-tech companies.

CRA-W (Computer Research Association's Committee on the Status of Women in Computing Research) sponsors a number of activities focused on helping graduate students succeed in CSE research careers. These include educational and community building events, and mentoring.

The event was part of CRA-W, which has several goals, including (1) increase the number of women in computing (2) provide strategies and information on navigating graduate school (3) early insight into career paths (4) meet others, speaker, graduate students, networking among others and among others.

Women students in their first, second or third year of graduate school in computer science and engineering or a closely related field, who are studying at a US or Canadian institution are eligible to apply to attend the event. This year was the eleventh year of the workshop, which started in 2004.  In that first workshop, there were only 100 applicants and all got accepted, this year there were 503 applications and only 304 that got accepted.

There were general sessions for all audience and there were three simultaneous sessions, for students in their first, second, third year of graduate school. The audience could attend what ever they think is relevant.

The program agenda is available in the Grad Cohort Workshop website, previous agenda and slides are available. They will be uploading the slides from the talks from this year as well.
Friday morning started with the registration process, then breakfast was served where I got to meet some wonderful graduate students and we all shared our personal experience in graduate school and we also got to exchange our CONNECT ID's, which provides conferences a searchable online attendee list. It  allows us to upload our picture, name, school, year in graduate school, interest, personal website link and share it with other attendees. CONNECT allows us to look at other attendee profiles, and people with similar interests and send messages.
After that there was a welcome session that explained what was the workshop about, how important it is and why we are there and how we were selected to be attendees.

Dr.Tracy Campfrom the Colorado School of Mines presented the first session, Networking. She started by introducing her self and talked about her professional and personal background. The she provided some information about networking strategies, and that it is not genetic it is a skill that could be developed. Then she discussed how networking takes all direction top, down and across. At the end of the session she made us practice one to one conversation with the person seated both in front and behind each attendee.
In the second session, Dr.Yuanyuan Zhou, a Professor at University of California, San Diego presented on Finding a Research Topic.She first talked about her personal experience on struggling to find what she wanted to work on, and what she was passionate about.
During her talk she noted that zigzag path in finding your research topic is fine and not to expect to find it in only one shot. Some pointers to help you find your right path is (1) find your own strength and what to look for in a topic (2) what is your interest? Pick your strength (3)set your goals and milestones so you can successfully finish (4) think out of the boxAlso, show showed some of other graduate student experience in finding a research topic.

After that there was lunch break where we found tables with research topic tags. For me, I  sat at the visualization  table, but were able to talk about web archiving as well. It was interesting to talk to both graduate students and professors with similar interests.

Next, I attended the session on “Balancing Graduate School and Personal Life”, presented by both Dr.Yanlei Diao an Associated Professor in the Department of Computer Science at the University of Massachusetts Amherst and Dr.Angela Demke Brown  an Associate Professor in the Department of Computer Science at the University of Toronto. The talk was about how to set long and short term goals achieving them and enjoy each day at a time. Always set target dates and manage time. The main tip introduced was to management and choosing activities carefully. Also, to treat grad school as a job that is separate work and personal life.





After that Dr.Farnam Jahanian  who leads the US NSF's Directorate on Computer and Information Science and Engineering (CISE) talked about “Future of Computer Science”. In the talk, he mentioned who belongs in the CISE community, which is comprised of is 61% Computer Science and Information Science & Computer Engineering, the rest is 24% Science and Humanities, 12% Engineering (excluding computer engineering), and finally 3% Interdisciplinary Centers. Also, he pointed out the divisions and core research areas.

Then he mentioned the six Emerging Frontiers: (1) Data Explosion (2) Smart Systems: Sensing, Analysis and Decision- such as Environment Sensing, People Centric Sensing, Energy response and Smart Health Care (3) Expanding the limits of Computation (4) Secure Cyberspace – Securing our nation cyberspace (5) Universal Connectivity (6) Augmenting Human Capabilities.
In addition in his talk he mentioned some awards that were granted each year to explore the frontiers of computing.
After that there was a poster session where there were about 90 posters that displayed different research topics that were interesting. The  primary research areas were Networking, HCI, AI, Database, Graphics, Security and many other computer research areas.



Between sessions there was breaks where snacks are provided and the sponsors had some information on their work and job availabilities.


Saturday morning started with breakfast and a session on “Strategies for Human-Human Interaction” presented by three speakers Dr.Amanda StentPrincipal Research Scientist at Yahoo, Dr.Laura Haas IBM researcher and Dr.Margaret Matonosi  Professor at Princeton University.
    
The session started with small introduction about all the speakers then the talk focused on (1) interaction strategies between faculty and students (2) the challenges of being a women in a computing technology field (3) examples of uncomfortable situation that may occur and how to response.
After that I attended a session on “Building Self Confidence”, presented by Dr.Jullia Hirschberg a Professor and the Department Chair at Columbia University.The talk mainly focused on (1) how to recover from not doing as well in a course as you expected (2) frustration of not knowing what your specific research project (3) feeling that you don’t know as much as your fellow graduate students (4) some examples on situations that may occur and how to help yourself build your self confidence in your own way.


Then there was Wrap-up and Final Remarkswhere all the speakers and attendees were thanked for coming, after that lunch was provided.

Finally, there was a Resume “Writing Clinic” and an “Individual Advising” session where all the speakers provide one to one help to the attendees if needed.

It was nice to attend this kind of sessions and to meet all the wonderful women both professors and students in computing from all over the world, sharing our thoughts and experiences in graduate school.


--Lulwah Alkwai


Special thanks to Professor Michele C. Weigle for editing this post

2014-04-14: ECIR 2014 Trip report

$
0
0
From ECIR 2014 official flicker account
Between Apr. 14 to Apr. 16, 2014, in the beautiful Amsterdam city in Netherlands, I attended the the 36th European Conference on Information Retrieval (ECIR 2014). The conference started with Workshops/Tutorials day on Apr 13, which I didn't attend.

The first day was the workshops and tutorials day. ECIR 2014 had a wide range of workshops/tutorials that covered various aspects of IR such as: Text Quantification: A Tutorial, GamifIR' 14 workshop,  Context Aware Retrieval and Recommendation workshop (CaRR 2014), Information Access in smart cities workshop (i-ASC 2014), and Bibliometric-enhanced Information Retrieval workshop (BIR 2014).

The main conference started on April 14 with a welcome note from the conference chair Maarten de Rijke. After that,  Ayse Goker, from Robert Gordon University presented the winner of Karen Spärck Jones award and the keynote speaker Eugene Agichtein, a professor at Emory University. His presentation, which entitled "Inferring Searcher Attention and Intention by Mining Behavior Data", covered the challenges and the opportunities in the IR field and the future research areas.

First, he compared between the challenges of “Search” on 2002, where it aimed to support global information access and the contextual retrieval, and “Search” on 2012 (SWIRL 2012), where it focused on what beyond the ranked list and the evaluation. Eugene moved after that to the concept of inferring the search intention. In this area, Eugene pointed to use the interaction data such as asking questions by understanding the search term in social CQA, and some unsuccessful queries may be converted to automatic questions that are forwarded to the people (CQA) to answer it. Also, he considered the mining the query logs and click logs as sources of data that may enhance the search experience.

Then, Eugene discussed the challenges of having realistic search behavioral data outside the major search engines.  Eugene discussed UFindIt, a game to control the search behavior data at scale. Also, he showed some examples about override the big and expensive eye tracker equipment such as ViewSer that enabled remote eye tracking.

Finally, Eugene listed some of the future trends in IR field such as: behavior models for ubiquitous search, the future vision in search interface by developing an intelligent assistant and augmented reality, developing new tools for  analysis of cognitive processing, using mobile devices with camera as an eye tracking tool, optimizing the power consumption for the search task for mobile devices, and the privacy concern for searching.

After the break were two parallel sessions (Recommendation and Evaluation). I attended the recommendation session,where Chenyi Zhang from Zhejiang University presented his paper entitled "Content + Attributes: a Latent Factor Model for Recommending Scientific Papers in Heterogeneous Academic Networks" . In this paper, they proposed a new enhanced latent model for recommendation system for the academic papers. The system incorporates the paper content (e.g., title and abstract in plain text) and includes additional attributes (e.g., author, venue, publish year). The system solves the cold start for the new user by incorporating social media.  In the evaluation session, Colin Wilkie, from University of Glasgow, presented Best and Fairest: An Empirical Analysis of Retrieval System Bias. After lunch, we had the first poster/demo session. There was a set of interesting demos: DAIKnow, Khresmoi Professional, and ORMA.

The second day, April 14, started with a panel discussion about "Panel on the Information Retrieval Research Ecosystem" but due to the jet lag, I couldn't attend the morning session. After lunch, we started the next poster/demo session. I enjoyed the discussion around, GTE-Cluster: A Temporal Search Interface for Implicit Temporal Queries and TripBuilder who won the best demo award.

In the third and last day, April 15, the keynote speaker was Gilad Mishne, Director of Search at Twitter. Gilad introduced Twitter search as building the train track while the train is running hundreds of miles an hour. Gilad discussed the challenges of the search task in Twitter. He defined the challenges to be: mainstream input of tweets, on-time indexing, ranking tweets, and aggregating the results between tweets and people that required multiple indexes and multiple ranking techniques. Also, he distinguished the behavior in twitter search from search engines, as it is not repeated, 29% of top queries on twitter change hourly and 44% change daily. Gilad explained that there is a human in the loop for tweet annotation, Twitter hires "on-call" crowdsourced workers to categorize the queries, for example to determine if it is news-related or not. There are  a set of IR techniques that will not work with twitter search such as: anchor text,  term frequency, click data,  and relevance judgments. Twitter results optimization targets decreasing the bad results, which will increase good search experience, using evaluation metric so-called cr@p3 (fraction of crap in the top 3 docs).

The next session was "Digital Library" session where I presented my paper "Thumbnail Summarization for Web Archives". In this paper, we proposed various techniques to predict the change in the web page visual appearance based on the change of the HTML text in order to select a subset of the TimeMap that represents the major changes of the website through time. We suggested using SimHash fingerprint to estimate the changes between the pages. We proposed three algorithms that may minimize the size of the TimeMap to 25%.



The next presentation was "CiteSeerX: A Scholarly Big Dataset" by Cornelia Caragea. She spoke about some use cases for Scholarly article databases. Cagalna used DBLP content to clean the CiteXseer database.  She assumed that if there are two articles similar in title, author, and number of pages, then they are duplicate. However, one of the audience discussed a special use-case in the medical publications where this assumption is not right.

Then, Marijn Koolen from University of Amsterdam presented User Reviews in the Search Index? That'll Never Work!. Marjjn studied the user reviews for books on the web, e.g., Amazon, to enhance the search experience for books. He showed different examples about useful and unuseful comments. He used a big dataset of 2.8 million books description collected from Amazon and LT, augmented by 1.8 M entries from LoC and BL. The industry track ran in parallel with my session, this is an interesting slides from Alessandro Benedetti, Zaizi UK.



After lunch, I attended the industry track session with a presentation about the global search engines. Pavel Seryukov from Yandex presented "Analyzing Behavioral Data for Improving Search Experience at Yandex". Pavel spoke about Yandex efforts to share user data. Yandex ran click data challenge for 3 years right now. He showed how they anonymized the click logs by converting it into numbers.



The next presenter was Peter Mika from Yahoo Labs. His presentation entitled "Semantic Search at Yahoo". In this presentation, Peter gave an overview about the status of the semantic web and how it is used by the search engines.



By the end of the day, it was the closing session where the conference chair thanked the organizer for their efforts. Also, ECIR 2015 committee promoted the next ECIR event at Vienna, Austria. Finally, ECIR 2014 media committee made this wonderful video that incorporated various moments from ECIR 2014.



----
Ahmed AlSum

2014-05-08: Support for Various HTTP Methods on the Web

$
0
0
While clearly not all URIs will support all HTTP methods, we wanted to know what methods are widely supported, and how well is the support advertised in HTTP responses. Full range of HTTP method support is crucial for RESTful Web services. Please read our previous blog post for definitions and pointers about REST and HATEOAS. Earlier, we have done a brief analysis of HTTP method support in the HTTP Mailbox paper. We have extended the study to carry out deeper analysis of the same and look at various aspects of it.

We initially sampled 100,000 URIs from the DMOZ and found that only 40,870 URIs were live. Our further analysis was based on the response code, "Allow" header, and "Server" header for OPTIONS request from those live URIs. We found that out of those 40,870 URIs:
  • 55.31% do not advertise which methods they support
  • 4.38% refuse the OPTIONS method, either with a 405 or 501 response code
  • 15.33% support only HEAD, GET, and OPTIONS
  • 38.53% support HEAD, GET, POST, and OPTIONS
  • 0.12% have syntactic errors in how they convey which methods they support
  • 2.99% have RFC compliance issues such as a 200 (OK) response code to an OPTIONS request but OPTIONS is not present in the Allow header, 405 (Method not supported) response code without an Allow header, or 405 response code, but OPTIONS method is present in the Allow header
Below is an example of an OPTIONS request with a successful response:

$ curl -I -X OPTIONS http://www.cs.odu.edu/
HTTP/1.1 200 OK
Date: Wed, 07 Aug 2013 23:11:04 GMT
Server: Apache/2.2.17 (Unix) PHP/5.3.5 mod_ssl/2.2.17 OpenSSL/0.9.8q
Allow: GET,HEAD,POST,OPTIONS
Content-Length: 0
Content-Type: text/html

$

The above code illustrates that the URI http://www.cs.odu.edu/ returns 200 OK response, it uses Apache web server and it supports GET, HEAD, POST, and OPTIONS methods.

The following OPTIONS request illustrates an unsuccessful response which has RFC compliance issue in it:

$ curl -I -X OPTIONS http://dev.bitly.com/
HTTP/1.1 405 Not Allowed
Content-Type: text/html
Date: Wed, 07 Aug 2013 22:24:05 GMT
Server: nginx
Content-Length: 166
Connection: keep-alive

$

The above code illustrates that the URI http://dev.bitly.com/ returns 405 Not Allowed response, it uses Nginx web server and it does not tell what methods it allows.

Table 1: Interleaved Method Support Distribution.

Table 1 gives an interleaved distribution of method support. It shows the count and percentage of URIs in our sample set for all the combinations of supported and unsupported methods. If a combination is not listed in the table then it does not occur in our sample set.

In our sample set, about 55% URIs claim support for GET and POST methods, but less than 2% of the URIs claim support for one or more of PUT, PATCH, or DELETE methods. The full technical report can be found at arXiv.

Resources:

--
Sawood Alam


2014-05-25: IIPC GA 2014

$
0
0
I attended the International Internet Preservation Consortium (IIPC) General Assembly 2014 (#iipcGA14) hosted by the Bibliothèque nationale de France (BnF) in Paris.  Although the GA ran the entire week (May 19 -- May 23), I was only able to attend May 20 & 21.  It looks like I missed some good material on the first day, including keynotes from Wendy Hall and Wolfgang Nejdl, and a presentation from Common CrawlMartin Klein also presented an overview of the Hiberlink project, as well as the "mset attribute" that we are working on with the people from Harvard

I arrived after lunch on May 20, in time for a really strong session on "Harvesting and access: technical updates", featuring talks about Solr indexing (Andy Jackson et al.) (Andy's slides), deduplicating content in WARCs (Kristinn Sigurðsson), Heritrix updates (Kris Carpenter), and Open Wayback (Helen Hockx).  Within WS-DL, we haven't really done much with Solr in our projects or classes and that's a shortcoming we should address soon.

The morning of May 20 began with presentations from Helen Hockx and Gildas Illien about creating IIPC-branded collections (essentially continuing the Olympics collections available so far), followed by breakout sessions to discuss the legal and technical issues regarding such collections (guess which one is the most problematic!).  Although all considered this an interesting direction for IIPC to pursue, I'm not sure we made much progress on how to proceed.

After lunch, I gave my presentation in a session that included status updates about the KB's web archives (Anna Rademakers (slides)) and the Internet Memory Foundation (Leïla Medjkoune and Florent Carpentier (slides)).  My talk established the metaphor of web archives as "cluttered attics, garages, and basements" and then about profiling web archives to better perform query routing at the Memento Aggregator, as well as provide an interchange format and mechanism to coordinate IIPC crawling and coverage activities, including the contents of dark archives.




The day ended with a session about archiving Dutch public TV (Lotte Belice Baltussen (slides)) and crawling & archiving RSS feeds (Kristinn Sigurðsson (slides)).  Thursday and Friday closed out with public workshops, but I was already well into my homeward bound ordeal during those days. 

As always, the IIPC GA was filled with informative sessions and a collaborative spirit.  It was great catching up with old friends, and especially good to see WS-DL alumni Martin Klein (LANL) and Ahmed AlSum (Stanford).  Unfortunately, it is probably one of the last events at which we'll see Kris Carpenter since she is transitioning out of the Internet Archive.  I regret that my schedule did not allow me to attend the entire GA.  Although it is not quite official yet, it looks like the 2015 GA will be held at/near Stanford.

--Michael

N.B. I will update the narrative above with links to the slides as they become available.

2014-05-27 Update: A mostly complete set of presentations is now available.


2014-05-28: The road to the most precious three letters, PHD

$
0
0
On May 10th, 2014, the commencement with hundreds of students wearing their caps and gowns and ready for the moment of graduation can’t be forgotten. For me, it was the coronation for a long trip towards my Ph.D. degree in computer science. A few days before that, on May 3rd, 2014, I submitted my dissertation that was entitled “Web Archive Services Framework For Tighter Integration Between The Past And Present Web” to the ODU registrar's office as a declaration of the completion of the requirements for the degree. On Feb 26th, 2014, I defended my dissertation that was presented with these slides and is available for watching on video streaming.







In my research, I explored a proposed service framework that provided APIs for the web archive corpus to enable users and third party developers to access the web archive on four levels.

  • The first level is the content level that gives access to the actual content of web archive corpuses with various filter. 
  • The second level is the metadata level that gives access to two types of metadata. The ArcLink system extracts, preserves, and delivers the temporal web graph for the corpus. ArcLink was published as a poster in JCDL 2013 with my favorite minute madness and with a more detailed version as a tech report. ArcLink was presented in IIPC GA 2013 and received good feedback from the web archives consortium. The second type of metadata was thumbnails, we proposed thumbnails summarization techniques to select and generate distinguished set of pages that represent the main changes in the visual appearance of webpage through time. This work has been presented at ECIR 2014
  • The third level is URI level where we tried to extend the default URI lookup interface to benefit form the HTTP redirection. This research has been discussed in TempWeb 2013 and the full paper available in the proceedings. 
  • The fourth level is archive level where we quantified the current web archiving activities on two directions. The percentage of web archives materials regarding the live web corpus that was presented in JCDL 2011 and detailed version appeared as tech report. This work attracted the attention of various reporters to discuss it such as: The Atlantic, The Chronicle of Higher Education, and MIT Technology Review. The second direction was the distribution of web archives materials where we developed new methods to profile the web archives based on the TLD and languages. The work was presented at TPDL 2013, and an extended version with a larger dataset is accepted for publication in an IJDL special issue.
Now, while writing about it from my office at Stanford University Library, where I’m working as web archiving engineer and leading the technical activities for the new Stanford web archiving project, I remember the long trip since I've arrived in the US in Fall 2009 to start my degree. It was a long trip to gain the most precious three letters that will be attached to my name forever, Ahmed AlSum, PhD.
@JFK on Aug 2, 2009
------
Ahmed AlSum

2014-06-02: WikiConference USA 2014 Trip Report

$
0
0

Amid the smell of coffee and bagels, the crowd quieted down to listen to the opening by Jennifer Baek, who, in addition to getting us energized, also paused to recognize Ardrianne Wadewitz and Cythia Sheley-Nelson, two Wikipedians who had, after contributing greatly to the Wikimedia movement, had recently passed.  The mood became more uplifting as Sumana Harihareswara began her keynote, discussing the wonders of contributing knowledge and her experience with the Ada Initiative, Geek Feminism, and Hacker School.  She detailed how the Wikimedia culture can learn from the experiences at Hacker School, discussing different methods of learning, and how these methods allow all of us to nurture learning in a group.  She went on to discuss the difference between liberty and hospitality, and the importance of both to community, detailing how the group must ensure that individuals do not feel marginalized due to their gender or ethnicity, but also detailing how good hospitality engenders contribution as well as learning.  Thus began WikiConference USA 2014 at 9:30 am, on May 30, 2014.


At 10:30 am, I attended a session on the Global Economic Map by Alex Peek.  The Global Economic Map is a Wikidata project whose goal is to make economic data available to all in a consistent and easy to access format.  It pulls in sources of data from the World Bank, UN Statistics, U.S. Census, the Open Knowledge Foundation, and more.  It will use a bot to incorporate all of this data into a single location for processing.  They're looking for community engagement to assist in the work.

At 10:50 am, Katie Filbert detailed what Wikidata is and what the community can do with it, which consists mainly of bots collecting information from many sources and consolidating them into a usable format using MediaWiki.  The data is stored and accessible with XML, and is also multi-lingual.  This data is also fed back into the other Wikimedia projects, like Wikipedia, for use in infoboxes.  They are incorporating Lua into the mix in order to allow the infoboxes to be more intelligent about what data they are displaying.  They will be developing a query interface for Wikidata so information can be more readily retrieved from their datastore.

At 11:24 am, Max Klein showed us how to answer big questions with Wikidata.  In addition to the possibilities expressed in previous talks, Wikidata aims to provide modeling of all Wikipedias, allowing further analysis and comparison between each Wikipedia.  He showed visual representations of the gender bias of each Wikipedia, how much each language writes about other languages, and a map of the connection of data within Wikipedia by geolocation.  He showed us the exciting Wikidata Toolkit that allows for translation from XML to RDF and other formats as well as simplifying queries.  The toolkit uses Java to generate JSON, which can be processed by Python for analysis.

At noon, Frances Hocutt gave a presentation on how to use the MediaWiki API to get data out of Wikipedia.  She expounded upon the ability to extract specific structured data, such as links, from given Wikipedia pages.  She mentioned that the data can be directly accessed in XML or JSON, but there is also the mwclient Python library which may be easier to use.  Afterwards, she led a workshop on using the API, guiding us through the API sandbox and the API documentation.

Our lunch was provided by the conference and our lunch keynote was given by DC Vito.  He is a founding member of the Learning About Multimedia Project (LAMP), an organization that educates people about the effects of media in their lives  He wanted to highlight their LAMPlatoon effort and specifically their Media Breaker tool, which allows users to take existing commercials and other media, and edit them in a legal way to interject critical thinking and commentary.  They are working on a deal with the Wikimedia foundation to crowdsource the legal review of the uploaded media so that his organization can avoid lawsuits.

At 3:15 pm, Mathias Klang gave a presentation concerning the attribution of images on Wikipedia and how Wikipedia deals with the copyright of images.  He highlighted how though images are important, the interesting part is often the caption and also the metadata.  He mentioned how sharing came first on the web, but it is only recently that the law has begun to catch up with such easy-to-use licenses as Creative Commons.  His organization, Commons Machinery, is working to return that metadata, such as the license of the image, back to the image itself.  He reveal Elogio, a browser plugin that allows one to gather and store resources from around the web while storing their legal metadata for later use.  Then he mentioned that Wikimedia does not store the attribution metadata in a way that Elogio and other tools can find it.  One of the audience members indicated that Wikimedia is actively interested in this.

At 4:15 pm, Timothy A. Thompson and Mairelys Lemus-Rojas gave a presentation on the Remixing Archival Metadata Project (RAMP) Editor, which is a browser-based tool that uses traditional library finding aids to create individual and organization authority pages for creators of archival collections.  Under the hood, it takes in a traditional finding aid as a Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF) record.  Then it processes this file, pulling in relevant data from other external sources, such as Encoded Archival Description (EAD) files and OCLC.  Then it transforms these EAC-CPF records into wiki markup, allowing for direct publication to English Wikipedia via the Wikipedia API.  The goal is to improve Wikipedia's existing authority records for individuals and organizations with data from other sources.

At 5:45 pm, Jennifer Baek presented her closing remarks, mentioning the conference reception on Saturday at 6:00 pm.  Thus we closed out the first day and socialized with Sumana and several other Wikimedians for the next hour.

At 8:30 am on Saturday, our next morning keynote was given by Phoebe Ayers, who wanted to discuss the state of the Wikimedia movement and several community projects.  She detailed the growth of Wikipedia, even in the last few years, while also expressed concern over the dip in editing on English Wikipedia in recent years, echoing the concerns of a recent paper picked up by the popular press.  She showed how there are many classes being taught on Wikipedia right now.  She highlighted many of the current projects being worked on by the Wikimedia community, briefly focusing on the Wikidata Game as a way to encourage data contribution via gamification.  She mentioned what the Wikimedia Foundation has been focusing on a new Visual editor and other initiatives to support their editors, including grantmaking.  She closed with big questions that face the Wikimedia community, such as promoting growth, providing access for all, and fighting government and corporate censorship.  And our second day of fun had started.

At 10:15 am, I began my first talk.

In my talk, Reconstructing the past with MediaWiki, I detailed our attempts and successes in bringing temporal coherence to MediaWiki using the Memento MediaWiki Extension.  I partitioned the problem into the individual resources needed to faithfully reproduce the past revision of a web page.  I covered old HTML, images, CSS, and JavaScript and how MediaWiki should be able to achieve temporal coherence because all of these resources are present in MediaWiki.



My second talk, Using the Memento MediaWiki Extension to Avoid Spoilers, detailed a specific use case of the Memento MediaWiki Extension.  I showed how we could avoid spoilers by using Memento.  This generated a lot of interest from the crowd.  Some wanted to know when it would be implemented.  Others indicated that there were past efforts to implement spoiler notices in Wikipedia but they were never embraced by the Wikipedia development team.


Using the Memento MediaWiki Extension to Avoid Spoilers

At noon, Isarra Yos gave a presentation on vector graphics, detailing the importance of their use on Wikipedia.  She mentioned how Wikimedia uses librsvg for rendering SVG images, but Inkscape gets better results.  She has been unable to convince Wikimedia to change because of performance and security considerations.  She also detailed the issues in rendering complex images with vector graphics, and why bitmaps are used instead.  Using Inkscape, she showed how to convert bitmaps into vector images.

At 12:30 pm, Jon Liechty gave a presentation on languages in Wikimedia Commons.  He indicated that half of Wikimedia uses the English language template, but the rest of the languages fall off logarithmically.  He is concerned about the "exponential hole" separating the languages on each side of the curve.  He has reached out to different language communities to introduce Commons to them in order to get more participation from those groups.  He also indicated that some teachers are using Wikimedia Commons in their foreign language courses.

After lunch, Christie Koehler, a community builder from Mozilla, gave a presentation on encouraging community building in Wikipedia.  She indicated that community builders are not merely specialized people, but all of us are, by virtue of working together, are community builders.  She has been instrumental in growing Open Source Bridge, an event that brings together discussions on all kinds of open source projects, both for technical and maker communities.  According to her, a community provides access to experienced people you can learn from, and also provides experienced people the ability to deepen skills by letting them share their knowledge in new ways.  She detailed how it is important for a community to be accessible socially and logistically, otherwise the community will not be as successful.  She highlighted how a community must also preserve and share knowledge for present and future methods.  She mentioned that some resources in a community may be essential, but may also be invisible until they are no longer available, so it is important to value those who maintain these resources.  She also mentioned how important it is for communities to value all contributions, not just those from those who most often contribute.

At 3:15 pm, Jason Q. Ng gave a highly attended talk on a comparison of Chinese Wikipedia with Hudong and Baidu Baike.  He works on Blocked on Weibo, which is a project showing what content Weibo blocks that is otherwise available on the web.  He mentioned that censorship can originate from a government, industry, or even users.  Sensitive topics flourish on Chinese Wikipedia, which creates problems for those entities that want to censor information.  Hundong Baike and Baidu Baike are far more dominant than Wikipedia in China, even though they censor their content.  He has analyzed articles and keywords between these three encyclopedias, using HTTP status codes, character count, number of likes, number of edits, number of deleted edits, and if an article is locked from editing, to determine if a topic is censored in some way.

At 5:45 pm, James Hare gave closing remarks detailing the organizations that made the conference possible.  Richard Knipel, of Wikimedia NYC, told us about his organization and how they are trying to grow their Wikimedia chapter within the New York metropolitan area.  James Hare returned to the podium and told us about the reception upstairs.

At 6:00 pm, we all got together on the fifth floor, got to know each other, and discussed the events of the day at the conference reception.

Sunday was the unstructured unconference.  There were lightning talks and shorter discussions on digitizing books (George Chris), video on Wikimedia and the Internet Archive (Andrew Lih), new projects from Wikidata (Katie Filbert), contribution strategies for Wikipedia (Max Klein), low Earth micro-satellites (Gerald Shields), the importance of free access to laws via Hebrew WikiSource (Asaf Bartov), the MozillaWiki (Joelle Fleurantin), ACAWiki, religion on Wikipedia, Wikimedia program evaluation and design (Edward Galvez), Wikimedia meetups in various places, Wikipedia in education (Flora Calvez), and Issues with Wikimedia Commons (Jarek Tuszynski).

I spent time chatting with Sumana Harihareswara, Frances Hocutt, Katie Filbert, Brian Wolff, and others about the issues facing Wikimedia.  I was impressed by the combination of legal and social challenges to Wikimedia.  It helped me understand the complexity of their mission.

At the end of the Sunday, information was exchanged, goodbyes were said, lights were turned off, and we all spread back to the corners of the Earth from which we came, but within each of us was a renewed spirit to improve the community of knowledge and contribute.


-Shawn M. Jones

2014-06-18: Navy Hearing Conservation Program Visualizations

$
0
0
(Note: This is the first in a series of posts about visualizations created either by students in our research group or in our classes.)

The US Navy runs a Hearing Conservation Program (HCP) which aims to protect the hearing and prevent hearing loss in service members.  Persons who are exposed to levels in the range 85-100 dB are in the program and have their hearing regularly tested.  In the audiogram, there is a beep sounded at different frequencies with increasing volume.  The person being tested raises their hand when they hear the beep and the frequency and volume (in dBA) are recorded.  A higher volume value means worse hearing (i.e., the beep had to be louder before it was audible). Not only are people in the HCP regularly tested, but they are also provided hearing protection to help prevent hearing loss.  The audiogram data includes information about the job the person currently holds as well as if they are using hearing protection.

Researchers are interested in studying Noise Induced Hearing Loss (NIHL).  The theory behind NIHL is that if you're exposed to a massive noise event, you lose lots of hearing instantly, but that if you're exposed to long-term noise, there could be up to a 5 year lag before you notice hearing loss.  One goal of the HCP is to track hearing over time to see if this can be identified.  Hearing in the 4000-6000 Hz range is the most affected by NIHL.

We obtained a dataset of audiograms from the HCP with over 700,000 records covering 20 years of the program.  From this, PhD student Lulwah Alkwai produced three interactive visualizations.

In the first visualization, we show frequency (Hz) vs. hearing level (dB) averaged by job code.  The average over all persons with that job code is shown as the solid line (black is left ear, blue is right ear).  Normal impairment in each ear is shown as a dotted line.  The interactive visualization (currently available at https://ws-dl.cs.odu.edu/vis/Navy-HCP/hz-db.html) allows the user to explore the hearing levels of various job codes.  The visualization also includes representative job codes for the different hearing levels as a guide for the user.

The second visualization (currently available at https://ws-dl.cs.odu.edu/vis/Navy-HCP/age-year.html) shows the age of the person tested vs. the year in which they were tested.  The colored dots indicating hearing level use the same color scheme as the first visualization.  The visualization allows the user to filter the displayed data between all persons, those who used hearing protection, and those who used no hearing protection.  Note that this visualization shows only a sample of the full dataset.


The final visualization (currently available at https://ws-dl.cs.odu.edu/vis/Navy-HCP/age-db.html) is an animated visualization showing how age vs. total hearing (left ear hearing level + right ear hearing level) has changed through time.  Once the page loads, the animation begins, with the year shown indicated in the bottom right corner.  The visualization is also interactive.  If the user hovers over the year, the automatic animation stops and the user takes control of the year displayed by moving the mouse left or right. As with the previous visualization, this shows only a sample of the full dataset.


We created a short demo video of all three visualizations in action.



All three of these visualizations were made using the D3.js library, using examples from Mike Bostock's gallery.  The animated chart was based on Mike Bostock's D3.js recreation of the Gapminder Wealth and Health of Nations chart.

-Michele

2014-06-18: Google and JavaScript

$
0
0

In this blog post, we detail three short tests in which we challenge the Google crawler's ability to index JavaScript-dependent representations. After an introduction to the problem space, we describe our three tests as introduced below.
  1. String and DOM modification: we modify a string and insert it into the DOM. Without the ability to execute JavaScript on the client, the string will not be indexed by the Google crawler.
  2. Anchor Tag Translation: we decode an encoded URI and add it to the DOM using JavaScript. The Google crawler should index the decoded URI after discovering it from the JavaScript-dependent representation.
  3. Redirection via JavaScript: we use JavaScript to build a URI and redirect the browser to the newly built URI. The Google crawler should be able to index the resource to which JavaScript redirects.

Introduction

JavaScript continues to create challenges for web crawlers run by web archives and search engines. To summarize the problem, our web browsers are equipped with the ability to execute JavaScript on the client, while crawlers commonly do not have the same ability. As such, content created -- or requested, as in the case of Ajax -- by JavaScript are often missed by web crawlers. We discuss this problem and its impacts in more depth on our TPDL '13 paper.

Archival institutions and search engines are attempting to mitigate the impact JavaScript has on their archival and indexing effectiveness. For example, Archive-It has integrated Umbra into its archival process in an effort to capture representations dependent upon JavaScript. Google has announced that its crawler will index content created by JavaScript, as well. There is evidence that Google's crawler has been able to index JavaScript-dependent representations in the past, but they have announced a commitment to improve and more widely use the capability.

We wanted to investigate how well the Google solution could index JavaScript-dependent representations. We created a set of three extremely simple tests to gain some insight into how Google's crawler operated.

Test 1: String and DOM Modification

To challenge the Google crawler in our first test, we constructed a test page with a MD5 hash string "1dca5a41ced5d3176fd495fc42179722" embedded in the Document Object Model (DOM). The page includes a JavaScript function that changes the hash string  by performing a ROT13 translation on page load. The function overwrites the initial string with the ROT13 translated string "1qpn5n41prq5q3176sq495sp42179722".

Before the page was published, both hash strings returned 0 results when searched in Google. Now, Google shows the result of the JavaScript ROT13 translation that was embedded in the DOM (1qpn5n41prq5q3176sq495sp42179722) but not the original string (1dca5a41ced5d3176fd495fc42179722). The Google Crawler successfully passed this test and accurately crawled and indexed this JavaScript-dependent representation.

Test 2: Anchor Tag Translation

Continuing our investigation with a second test, we wanted to discover if Google could discover a URI to add to its frontier if the anchor tag is generated by JavaScript and only inserted into the DOM after page load. We constructed a page that uses JavaScript to ROT13 decode the string "uggc://jjj.whfgvasoeharyyr.pbz/erqverpgGnetrg.ugzy" to get a decoded URI. The JavaScript inserts an anchor tag linking to the decoded URI. This test evaluates whether the Google crawler will extract the URI from the anchor tag after JavaScript performs the insertion or if the crawler only indexes the original DOM before it is modified by JavaScript.

The representation of the resource identified by the decoded URI contains the MD5 hash string "75ab17894f6805a8ad15920e0c7e628b". At the time of this blog posting's publication, this string returned 0 results in Google. To protect our experiment from contamination (i.e., linking to the resource from a source other than the JavaScript-reliant page), we will not post the URI of the hidden resource in this blog.


The text surrounding the anchor tag is "The deep web link is: " followed by the anchor tag with the target being the decoded URI and the text of "HIDDEN!". If we search for the text surrounding the anchor tag, we receive a single result which includes the link to the decoded URI. However, at the time of this blog posting's publication, the Google crawler has not discovered the hidden resource identified by the decoded URI. It appears Google's crawler is not extracting URIs for its frontier from the JavaScript reliant resources.

Test 3: Redirection via JavaScript

In a third test, we created two pages. One of which was linked by my homepage and is called "Google Test Page 1". This page has a MD5 hash string embedded in the DOM "d41d8cd98f00b204e9800998ecf8427e".

A JavaScript function changes the hash code to "4e4eb73eaad5476aea48b1a849e49fb3" when the page's onload event fires. In short, when the page finishes loading in the browser, a JavaScript function will change the original hash string to a new hash string. After the DOM is changed, JavaScript constructs a URI string to redirect to another page.



In the impossible case (1==0 always evaluates to "false"), the redirect URI is testerpage1.php. This page does not exist. We put in this false URI to try to trick the Google crawler into indexing a page that never existed. (Google was not fooled.)

JavaScript constructs the URI of testerpage2.php that has the hash string "13bbd0f0352dc9f61f8a3d8b015aef67" embedded in the DOM. This page -- prior to this blog post -- is not linked from anywhere, and Google cannot discover it without executing the JavaScript redirect embedded in Google Test Page 1. When we searched for the hash string, Google returned 0 results.

testerpage2.php also writes to a text file whenever the page is loaded. We waited for a string to appear in the text file. After that point, when we search Google for the hash string in testerpage2.php, we receive a result that shows the content and hash of testerpage2.php, but shows the URI of the original Google Test Page 1.


While some may argue that the URI returned in our third test's search result should show the URI of testerpage2.php, this is a choice by Google to provide the original URI rather than the URI of the redirect.

Conclusion

This very simple test set shows that Google is effectively executing JavaScript and indexing the resulting representation. However, the crawler is not expanding its frontier to include URIs that are generated by JavaScript. In all, Google shows that crawling resources reliant on JavaScript is possible at Web scale, but more work is left to be done to properly crawl all JavaScript reliant representations.

--Justin F. Brunelle

2014-06-23: Federal Big Data Summit

$
0
0

On June 19th and 20th, I attended the Federal Big Data Summit at the Ronald Reagan Building in the heart of Washington, D.C. The summit is hosted by the Advanced Technology Academic Research Center (ATARC).

I participated as an employee of the MITRE Corporation -- we help ATARC organize a series of collaboration sessions that are designed to help identify and make recommendations for solutions to big challenges in the federal government. I lead a collaboration session between government, industry, and academic representatives on Big Data Analytics and Applications. The goal of the session was to facilitate discussions between the participants regarding the application of big data in the government and preparing for the continued growth in importance of big data. The targeted topics included access to data in disconnected environments, interoperability between data providers, parallel processing (e.g., MapReduce), and moving from data to decision in an optimal fashion.

Due to the nature of the discussions (protected by Chatham House Rule), I cannot elaborate on the specific attendees or specific discussions. In a few weeks, MITRE will produce a publicly released summary and set of recommendations for the federal government based on the discussions. When it is released, I will update this blog with a link to the report. It be in a similar format and contain information at a similar level as the 2013 Federal Cloud Computing Summit deliverable.

On July 8th and 9th, I will be attending the Federal Cloud Computing Summit where I will run the MITRE-ATARC Collaboration Sessions on July 8th and moderate a panel of collaboration session participants on July 9th. Stay tuned for another blog posting on the Cloud Summit!

--Justin F. Brunelle

2014-06-26: InfoVis Fall 2011 Class Projects

$
0
0
(Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.)

I've been teaching the graduate Information Visualization course (then CS 795/895, now CS 725/825) since Fall 2011.Each semester, I assign an open-ended final project that asks students to create an interactive visualization of something they find interesting.  Here's an example of the project assignment. In this series of blog posts, I want to highlight a few of the projects from each course offering.  Some of these projects are still active and available for use, while others became inactive after their creators graduated.

The following projects are from the Fall 2011 semester.  Both Sawood and Corren are PhD students in our group.  Another nice project from this semester was done by our PhD student Yasmin AlNoamany and MS alum Kalpesh Padia.  The project led directly to Kalpesh's MS Thesis, which has its own blog post.

K-12 Archive Explorer
Created by Sawood Alam and Chinmay Lokesh


The K-12 Web Archiving Program was developed for high schools in partnership with the Archive-It team at the Internet Archive and the Library of Congress. The program has been active since 2008 and allows students to capture web content to create collections that are archived for future generations. The visualization helps to aggregate this vast collection of information. The explorer (currently available at http://k12arch.herokuapp.com/) provides users with a single interface for fast exploration and visualization of the K-12 archive collections.

The video below provides a demo of the tool.




We Feel Fine: Visualizing the Psychological Valence of Emotions
Created by Corren McCoy and Elliot Peay



This work was inspired by the "We Feel Fine" project by Jonathan Harris and Sep Kamvar.  The creators harvested blog entries for occurrences of the phrases "I feel" and "I am feeling" to determine the emotion behind the statement. They collected and maintained a database of several million human feelings from prominent websites such as Myspace and Flickr. This work uses the "We Feel Fine" data to measure the nature and intensity of a person’s emotional state as noted in the emotion-laden sentiment of individual blog entries. The specific words in the blogs related to feelings are rated on a continuous 1 to 9 scale using a psychological valence score to determine the degree of happiness. This work also incorporates elements of a multi-dimensional color wheel of emotions popularized by Plutchik to visually show the similarities between words. For example, happy positive feelings are bright yellow, while sad negative feelings are dark blue. The final visualization method combines a standard histogram which describes the emotional states with an embedded word frequency bar chart. We refer to this visualization technique as a "valence bar" which allows us to compare not only how the words used to express emotion have changed over time, but how this usage differs between men and women.

The video below shows a screencast highlighting how the valence bars change for different age groups and different years.



-Michele

2014-07-02 An ode to the "Margin Police," or how I learned to love LaTeX margins

$
0
0
To the great Margin Police:

"You lay down rules for all that approach you,

One and half on the left-hand edge,

One on all the other edges,

Page numbers one half down from the top.

These are your words.

And we are grateful for you guidance and direction.

Lo, you lead us in the ways of professionalism and consistency.

We, the unwashed are grateful."

But I have one question:

Why doesn't the LaTeX style file help me achieve these goals??

And so the exploration begins.

Sometimes we use LaTeX to write and submit papers and reports for publication.  Often the publishers provide a style file for us to use that dictates things like margins, number of columns per page, headers, footers, and other formatting directives.  Other times, guidance comes from "instructions to authors" and we are expected and required to meet the requirements.  What follows below are how see what are the current margins, how to set the margins, and how to see if your document stays within the margins.  (LaTeX has environments that "float" and will sometimes ignore the margins.)  Hold on while we wander through the great and beautiful world of LaTeX margins.

LaTeX "thinks" of sheets of paper as a collection of "boxes."  It fills the boxes with text and whatnot.  At first glance, the location and description of these boxes is arcane and really without much apparent rhyme or reason.  What it comes down to is that a box is defined to start relative to where other boxes end, and each box has a height and width.  Defining locations like this allows an entire set of boxes to be moved by changing the reference starting point for the beginning box.

A sample layout result page.
You can see what LaTeX thinks the current box settings are by including the package "layout" and then inside your document executing the command:

\layout

The layout command will inject a new page showing the boxes on the page and their dimensions.  The dimensions are expressed as points (1 inch = 72 points).  Because the \layout command injects a page into your document, you won't want to use the command in your final document.

Once you have your arms (sort of) around the idea of a layout, the next question is how do I affect the layout.  One way to do this is to set the various values that LaTeX uses by executing setlength commands (See the definition of the command MyPageSetup).  The values in MyPageSetup will result in the 1.5x1.0x1.0x1.0 margins with US letter paper that our "Margin Police" example dictated. (Changing the \textwidth value to 4.0in will result in very wide right margins because the text box is now much narrower.)

One of the "flies in the ointment" with the above approach is that not everything obeys LaTeX margins and stays within their "boxes." Some specific examples are figure and table environments that "float" on the page.  So, now that we have told the regular text where its boxes are, and we assume that LaTeX will honor those boxes, how do we identify those times when the "floating" things don't honor the margins.  Conceptually the answer is fairly simple: put a template that matches the margins on all the pages and then tell us which pages have things that are outside the margins.  Simply stated; not so simply answered.

Beware, gory details ahead!!  There is a make file (Listing 1), a LaTeX document (Listing 2), and an image analysis report (Listing 3).

Because I like make, the make file has a couple of targets that taken together answer the question: which page has something that violates the margins??


First the target: margins.  Here we:

1.  Define, remove and create a temporary directory.

The redacting mask.
2.  Copy the PDF document that we want to check into the temporary directory.

3.  In the temporary directory, we use pdftk to split the large PDF into a collection of small PDFs with one page per file.  (There are several other commands that will do the same job, I happen to choose pdftk.)

4.  Create a couple of skyblue redacting images (the images are sized to match page numbers, and the main body of text).

5.  For every page in the PDF (from step 3 above), overlay on that page the two redacting images (from step 4 above), and create a new file with the word "redacted" in the file name.

6.  It is always nice to tell the user that something is happening.

7.  Gather up all the redacted pages into a single large PDF.

At this point in time, we have two files of interest; the original large PDF, and a second PDF where all the text should be redacted.  If the files are small (10 pages or less being small), it is a simple manner to manually flip through the redacted file and see if there is anything that hasn't been redacted.  If the redacted file is large (100 pages or more being large), manually flipping pages is troublesome.

Now the target: checkColors.  Here we:

1.  Return to the temporary directory where we created the redacted pages.

2.  Define a "magic" number (to be explained later).
A redacted page with offending text.

3.  Define a "threshold" number (can be tweaked as desired).

4.  For all the redacted PDF files, figure out how many pixels are skyblue, how many are white, and compare that number with the threshold.  If the number of pixels that are neither skyblue nor white exceeds the threshold then alert the user and record that offending data.

The magic number 484704 is the number of pixels in the PDF image of a single US letter size page.  Changing the pixel density or page size means that you will have to use a different magic number.  The number comes from adding the number of pixels of different collors as returned by the convert command.

At the end of this processing you have:
  • The original and unchanged PDF
  • A page for page version of the original PDF that has been redacted
  • A report listing which redacted pages exposed more pixels than the threshold value allowed
Using these files you can do the following:
  1. Open the original PDF in a PDF viewer,
  2. Open the redacted PDF in a PDF viewer,
  3. Open the report in a TXT viewer,
  4. For each "bad" page in the report, goto that page in the redacted and see what the problem is, and if the problem is severe enough (you have to decide what severe enough means) then goto that page in the original and correct the problem.
Using the Listing 2 LaTeX document and Listing 3 report, page 1 exposes too many pixels, but they are the page number in the footer.  Formatting the headers and footers will correct this problem.  Page 2 exposes the right hand margin of the table.  Reading the LaTeX source document does not indicate that this should be a problem, but the LaTeX processing created it.  Correction of this problem is a little more challenging.

Is all the above processing worth time and effort??  For a small document (less than 10 pages) perhaps not.  For a largish document (greater than 100 pages) then yes.  Processing is fairly quick, a 17 MB (~500 page) PDF takes about 9 minutes to process, less time than it would take to manually flip through each page, even in a print preview mode.

And so we say to you, great Margin Police:

"We have heeded thy words on the margins,

We have checked and rechecked our margins,

And they are good.

We beseech the Oh Great Margin Police,

Let us pass and we will be enlightened all the days of our lives."

-- Chuck Cartledge

Listing 1: The make file.

 


 

Listing 2: The LaTeX file.

 


 

 Listing 3: The results.txt file.

 

2014-07-02 LaTeX References, and how to control them

$
0
0
With just a little abuse:

"Which way did they go?
How many were there?
I must find my references;
For I am their master."


LaTeX references are wonderful things.  In this short epistle, we will explore some
A sample page with reference problems.
of the interesting things that you can do with them, problems that can arise from misusing them, problems that can arise from not using them, and finally how to spice them up just a little.

First we will set up a conceptual model using based on the LaTeX file (Listing 1), the make file (Listing 2), and some auxiliary files that LaTeX creates.  Firstly, copy the LaTeX file and the make file to a convenient directory.  Create references.pdf from the command line, by executing make.  You want to get a sample PDF like in the image.  Now that we have something to look at, we can construct the conceptual model.

Opening the references.aux and searching for the lines that begin with the \newlabeltoken, and comparing that to the references.tex file shows that the label tbl:twice is declared twice.  It is first declared on page 3 and then again on page 4.  So, a major piece in our model, is that the label is actually a token used to find data associated with the token.  In effect, it is a key into a database of labels.  Using the label, you can get the sequential number of the type of object the label is associated with, the label's page number, and the title associated with the label.

It is but a little leap to realize that we are dealing with two related types of entities and that they can behave in many interesting ways.  The entities are the labels and how they are referred to (references). Just to keep things simple, well look at just the \label and \ref commands and see how they interact.

  1. If a label is never declared and it isn't referenced then nothing happens.  This the nill case.
  2. If a label is never declared and it is referenced then a LaTeX warning such as "Reference `tbl:nonExistent' on page 1 undefined on input line 37" is recorded in the log file and ?? marks are written to the PDF.
  3. If a label is declared once and it is referenced then this is the ideal case.  The appropriate sequential number will be written to the PDF.
  4. If a label is declared more than once then a LaTeX warning such as "Label `tbl:twice' multiply defined." is recorded in the log and the sequence number of the last declaration will be used in the PDF.  This condition could happen if a file is included more than once, or if label creation is not well disciplined, or if there is a typo.
  5. If a label is declared and is never referenced, is this an error??  This could happen if all references to the label were lost during editing.

The references.tex file contains all these types of conditions.

How do we address these conditions?? Lets look at each individually:

Sample page after refcheck processing.
  1. Not an error, so nothing to do.
  2. Examining the log file will show all the undeclared labels.  Now you have to root around in the tex files to see where the label should be.  Depending on how the tex files are linked, this can be easy or hard.
  3. Not an error, so nothing to do.
  4. Multiple label declarations can come from including the same LaTeX file more than once, so you'll have to figure when and how that could occur.  Also, the labels may have originated from different source files that were never meant to be in the same document, but now are.  Or it could be as simple as a typo.
  5. This one is interesting in its own right, for a couple of reasons.  Firstly, how do you find unused labels, and then secondly what do you do about them.  Like so many other things in LaTeX, the secret is finding the right package.  Uncommenting the \usepackage{refcheck} line in references.tex will add reference related information into the log file and the PDF.  In the left margin of the PDF, each label will be printed where it is declared with declarations indicating whether or not the label is used.  Similar information is written to the log file.  Now, you can identify labels (for tables, sections, figures, equations, etc.) that were deserving of a label at one time, but now aren't referenced in the text.  If it is no longer of interest, you could ignore the condition, but it does raise the question about why the label is no longer necessary.  This is particularly true if a table or figure is not referenced anywhere in the document.  If it is no longer referenced, is there a reason for its existence in the current PDF??

By this point, we resolved all our label referencing.  We've got unique labels for all the things deserving of being labeled, and we're referencing all the labels using things like \ref or \pageref.  Lets spice things up a little bit.

Uncommenting:

 %% \renewcommand{\MyRef}[1]{\vref{#1}}

will cause all occurrences of \MyRef to be replaced by \vref (part of the varioref package).  \ref returns the sequence number associated with the label, \vref does that and a little more.  \vref looks at where the reference is relative to the thing that is being referenced, and changes the text that it returns.  \vref returns:

  1. Sequence number only - if label and reference are on the same page
  2. Sequence number on the previous page - if the label is one page prior to the reference
  3. Sequence number on the next page - if the label is one page after the reference
  4. Sequence number on page number - if the label is more than one page away from the reference

Remake the sample LaTeX file and see how the table references change.
Sample page after \vref processing.

Because \vref changes the text in the document, it is possible that LaTeX can get in a situation where the software can't figure out what to do.  So use \vref only near the very end of the editing and document creation process.  \vref is especially useful when your publisher requires that a table or figure be within 1.5 pages of the reference.

Now you are the master of your table, figure, and section references.  You know what they are conceptually used for, what types of logical conundrums can arise from when declaring and using references, and how to spice up the reference's readability.

And so we can say:
"I know which way they went.
I know how many there are.
I control my references;
For I am their master."

-- Chuck Cartledge

Listing 1. The sample LaTeX file.



Listing 2.  The make file.


2014-07-07: InfoVis Fall 2012 Class Projects

$
0
0

(Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.)

I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous post: Fall 2011)

The Fall 2012 projects were based on the 2012 ODU Infographics Contest. Participants were tasked with visualizing the history and trajectory of work done in the area of quantum sensing.

Top Quantum Sensing Trends
Created by Wayne Stilwell


This project (currently available at https://ws-dl.cs.odu.edu/vis/quantum-stilwell/) is a visualization for displaying the history and trajectory of quantum sensing. History is shown as a year-by-year slideshow. The most publicized quantum sensing areas for the selected year are displayed. Clicking on a topic shows the number of publications on that subject over time compared to the most popular topic (gray line). This allows users to see when a subject started to rise in popularity and at what point in time (if any) it started to decline. The visualization also shows which research groups have the most publications for the selected subject. When a new year is chosen, animation is used to show which topics increased in popularity and which decreased. The final slide in the visualization is a projection for the year 2025 to show where quantum sensing is headed in the future.  This project won the overall competition.

The video below provides a demo of the tool.



BibTeX Corpus Visualizer
Created by Mat Kelly


One method to find trends in any industry is to examine the publications related to that industry. Given a set of publications, one should be able to extrapolate trends based on solely on the publications' metadata, e.g., title, keywords, abstract. For one to analyze text data to determine trends is daunting, so another method should be used that analyzes this data and presents it in a way that can be easily consumed by a casual user. This casual user should be be able to achieve the goal of identifying trends in the respective industry. This project (currently available at https://ws-dl.cs.odu.edu/vis/quantum-kelly/index.php) is a visualization that examines a small corpus consisting of metadata (in BibTeX format) about a collection of articles related to Quantum Sensing. The interface allows a user to explore this data and conclude many attributes of the data set and industry, including finding trends.  This project was built using the jQuery and D3.js libraries.

The video below provides a demo of the tool. 


Mat is one of our PhD students and has done other visualization work (described in his IEEE VIS 2013 trip report).

-Michele

2014-07-08: Potential MediaWiki Web Time Travel for Wayback Machine Visitors

$
0
0





Over the past year, I've been working on the Memento MediaWiki Extension.  In addition to trying to produce a decent product, we've also been trying to build support for the Memento MediaWiki Extension at WikiConference USA 2014.  Recently, we've reached out via Twitter to raise awareness and find additional supporters.

To that end, we attempt to answer two questions:
  1. The Memento extension provides the ability to access a page revision closest, but not over the datetime specified by the user.  As mentioned in an earlier blog post, the Internet Archive only has access to the revisions of articles that existed at the time it crawled, but a wiki can access every revision.  How effective is the Wayback Machine at ensuring that visitors gain access to pages close to the datetimes they desire?
  2. How many visitors of the Wayback Machine could benefit from the use of the Memento MediaWiki Extension?
Answering the first question shows why the Wayback Machine is not a suitable replacement for a native MediaWiki Extension.

Answering the second question gives us an idea of the potential user base for the Memento MediaWiki Extension.

Thanks to the work by Yasmin AlNoamany's work in "Who and What Links to the Internet Archive", we have access to 766 GB of (compressed) anonymized Internet Archive logs in a common Apache format.  Each log file represents a single day of access to the Wayback Machine.  We can use these logs to answer these questions.

Effectiveness of accessing closest desired datetime in the Wayback Machine

How effective is the Wayback Machine at ensuring that visitors gain access to pages close to the datetimes they desire?
To answer the first question, I used the following shell command to review the logs.



This command was only used on this single log file to find a potential English Wikipedia page as an example to trace in the logs.  It was only used to search for an answer to the first question above.

From that command, I found a Wayback Machine capture of a Wikipedia article about the Gulf War.  The logs were anonymized, so of course I couldn't see the actual IP address of the visitor, but I was able to follow the path of referrers back to see what path the user took as they browsed via the Wayback Machine.



We see that the user engages in a Dive pattern, as defined in Yasmin AlNoamany's "Access Patterns for Robots and Humans in Web Archives".
  1. http://web.archive.org/web/20071218235221/angel.ap.teacup.com/gamenotatsujin/24.html
  2. http://web.archive.org/web/20080112081044/http://angel.ap.teacup.com/gamenotatsujin/259.html
  3. http://web.archive.org/web/20071228223131/http://angel.ap.teacup.com/gamenotatsujin/261.html
  4. http://web.archive.org/web/20071228202222/http://angel.ap.teacup.com/gamenotatsujin/262.html
  5. http://web.archive.org/web/20080105140810/http://angel.ap.teacup.com/gamenotatsujin/263.html
  6. http://web.archive.org/web/20071228202227/http://angel.ap.teacup.com/gamenotatsujin/264.html
  7. http://web.archive.org/web/20071228223136/http://angel.ap.teacup.com/gamenotatsujin/267.html
  8. http://web.archive.org/web/20071228223141/http://angel.ap.teacup.com/gamenotatsujin/268.html
  9. http://web.archive.org/web/20080102052100/http://en.wikipedia.org/wiki/Gulf_War

The point of this exercise was not to read this Japanese blog that the user was initially interested in.  From this series of referrers, we see that the end user chose the original URI with a datetime of 2007/12/18 23:52:21 (from the 20071218235221 part of the archive.org URI).  It is the best we can do to determine which Accept-Datetime they would have chosen if they were using Memento.  What they actually got at the end was an article with a Memento-Datetime of 2008/01/02 05:21:00.


So, we could assume, that perhaps there were no changes in this article between these two dates.  The Wikipedia history for that article shows a different story, listing 51 changes to the article in that time.

The Internet Archive produced a page that maps to revision 181419148 (1 January 2008), rather than revision 178800602 (19 December 2007), which is the closest revision to what the visitor actually desired.

What did the user miss out on by getting the more recent version of the article?  The old revision discusses how the Gulf War was the last time the United States used battleships in war, but an editor in between decided to strike this information from the article.  The old revision listed different countries in the Gulf War coalition than the new revision.

So, seeing as the Internet Archive's Wayback Machine slides the user from date to date, they end up getting a different revision than they originally desired.  This algorithm makes sense in an archival environment like the Wayback Machine, where the mementos are sparse.

The Memento MediaWiki Extension has access to all revisions, meaning that the user can get the revision closest to the date they want.

Potential Memento MediaWiki Extension Users at the Internet Archive

How many visitors of the Wayback Machine could benefit from the use of the Memento MediaWiki Extension?
The second question involves discovering how many visitors are using the Wayback Machine for browsing Wikipedia when they could be using the Memento MediaWiki Extension.

We processed these logs in several stages to find the answer, using different scripts and commands than the one used earlier.

First, a simple grep command, depicted below, was run on each logfile.  The variable $inputfile was the compressed log file, and the $outputfile was stored in a separate location.



Considering we are looping through 766 GB of data, this took quite some time to complete on our  dual-core 2.4 GHz virtual machine with 2 GB of RAM.

As Yasmin AlNoamany showed in "Who and What Links to the Internet Archive", wikipedia.org is the biggest referrer to the Internet Archive, but we wanted direct users.  So, we were concerned with any entries that were merely referrers from Wikipedia.  Because Wikipedia uses links to the Internet Archive to avoid dead links to Wikipedia article references, there are many referrers in these logs from Wikipedia.

We used the simple Python script below on each of the 288 output files returned from the first pass, stripping out all of referrers containing the string 'wikipedia.org'.



Python was used because it offered better performance than merely using a combination of sed, grep, and awk to achieve the same goal.

Once we had stripped the referrers from the processed log data, then we could find the counts of access to Wikipedia with another script. The script below was run with the argument of wikipedia.org as the search term. Seeing as we had removed referrers, only actual requests for wikipedia.org should remain.



Because each log file represents one day of activity, this script gives us a CSV containing a date and a count of how many wikipedia.org requests occur for that day.

Now that we have a list of counts from each day, it is easy to take the numbers from the count column in this CSV and find the mean.  Again, enter Python, because it was simple and easy.



It turns out that the Wayback Machine, on average, receives 131,438 requests for Wikipedia articles each day.

If we perform the same exercise for the other popular wiki sties in the web we get the results shown in the table below.

Wiki SiteMean number of
daily requests to the
Wayback Machine
*.wikipedia.org (All Wikipedia sites)131,438
*.wikimedia.org (Wikimedia Commons)26,721
*.wikia.com (All Wikia sites)9,574

So, there are a potential 168,000 Memento requests per day who could benefit if these wikis used the Memento MediaWiki Extension.

On top of it, these logs represent a snapshot in time for the Wayback Machine only.  The Internet Archive has other methods of access that were not included in this study, so the number of potential Memento requests per day is actually much higher.

Summary


We have established support for two things:
  1. the Memento MediaWiki Extension will produce results closer to the date requested than the Wayback Machine
  2. there are a potential 168,000 Memento requests per day that could benefit from the Memento MediaWiki Extension
Information like this is useful when newcomers ask: who could benefit from Memento?

--Shawn M. Jones

2014-07-08: Presenting WS-DL Research to PES University Undergrads

$
0
0

On July 7th and 8th, 2014, Hany SalahEldeen and I (Mat Kelly) were given the opportunity to present our PhD research to visiting undergraduate seniors from a leading university in Bangalore, India (PES University). About thirty students were in attendance at each session and indicated their interest in the topics through a large quantity of relevant questions.


Dr. Weigle (@weiglemc)

Prior to ODU CS students' presentations, Dr. Michele C. Weigle (@weiglemc) gave the students an overview presentation of some of WS-DL's research topics with her presentation Bits of Research.

In her presentation she covered both our lab's foundational work, recent work, some outstanding research questions, as well as some potential projects to entice interested students to work with our research group.


Mat (@machawk1), your author

Between Hany and me, I (Mat Kelly) presented a fairly high level yet technical overview titled Browser-Based Digital Preservation, which highlighted my recent work in creating WARCreate, a Google Chrome extension that allows web pages to be preserved from the browser.

Though not merely a demo of the tool (as was given at Digital Preservation and JCDL 2012), I initially gave a primer on the dynamics of the web, HTTP, the state of web archiving, some issues relating to personal web archiving versus institutional web archiving, then finally, the problems that WARCreate addresses. I also covered some other related topics and browser-based preservation dynamics, which can be seen in the slides included in this post.


Hany (@hanysalaheldeen) presented the next day after my presentation, giving a general overview of his academic career and research topics. His presentation Zen and the Art of Data Mining covered the wide range of topics including (but not limited to) temporal user intention, the Egyptian Revolution, and his experience as an ODU WS-DL PhD student (to, again, entice the students).

The opportunity for Hany and me to present what we work on day-to-day to bright-eyed undergraduate students was unique, as their interest is both within our research area (computer science) yet still have doors open on what research path to take as potential graduate students.

We hope that the presentations and questions we were able to answer were of some help in facilitating their decisions to pursue a graduate career at the Web Science and Digital Libraries Research Lab at Old Dominion University.



— Mat Kelly
Viewing all 752 articles
Browse latest View live