Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 749 articles
Browse latest View live

2019-01-03: Five WS-DL Classes Offered for Spring 2019

$
0
0
https://xkcd.com/2085/
"Both arXiv and archive.org are invaluable projects which, if they didn't exist, we would dismiss as obviously ridiculous and unworkable."

A record fiveWS-DL classes are offered for Spring 2019:
  • CS 432/532 Web Science is taught by Alexander Nwala, Thursdays 4:20-7:00pm.  This class explores web phenomena with a variety of data science tools such as Python, R, D3, ML, and IR. 
  • CS 725/825 Information Visualization is taught by Dr. Michele C. Weigle, Wednesdays 9:30am-12:15pm. This class will explore the background and tools needed to develop effective visualizations through analyzing existing visualizations and visualization problems.
  • CS 734/834 Information Retrieval is taught by Dr. Jian Wu, Tuesdays & Thursdays, 9:30-10:45am.  This class will explore the theory and engineering of information retrieval in the context of developing web-based search engines.
  • CS 795/895 Human-Computer Interaction (HCI) is taught by Dr. Sampath Jayarathna, Tuesdays 4:20-7:00pm.  This class will explore the major cognitive and social phenomena surrounding human use of computers with the goal of understanding their impact and creating guidelines for the design and evaluation of software and physical products and services.
  • CS 795/895 Web Archiving Forensics is taught by Dr. Michael L. Nelson, Wednesdays 4:20-7:00pm.  This explores the use of web archives for verifying the priority and authenticity of web pages, especially in the face of faulty and untrustworthy archives.
If you're interested in any of these classes you'll need to take them this semester since Fall 2019 will likely bring a completely different line up:
  • CS 418/518 Web Programming, Dr. Jian Wu
  • CS 431/531 Web Server Design, Sawood Alam
  • CS/DASC 600 Introduction to Data Science, Dr. Sampath Jayarathna
  • CS/DASC 625 Data Visualization, Dr. Michele C. Weigle
I will likely be on research leave (and thus not teaching) in Fall 2019.

--Michael

2019-01-07: Review of WS-DL's 2018

$
0
0

The Web Science and Digital Libraries Research Group had a strong year, with the most significant event being our expansion from two professors to four.  Beginning in Fall 2018, we added two new assistant professors:
We're very lucky to have them both join WS-DL.  Their collective experience in HCI, ML, Big Data, and mining scholarly data increases the capabilities of our team and will allow WS-DL to greatly expand our teaching and research portfolios. For example, we will offer a record five WS-DL classes for Spring 2019.

Dr. Michele Weigle and I also had an eventful 2018: she was promoted to full professor and I received a joint appointment with Virginia Modeling, Analysis & Simulation Center (VMASC).

In 2018 we also had three MS students graduate, two students do internships, two students advance to PhD candidacy, one new research grant ($248k) awarded, 11 publications, and 11 trips to conferences, workshops, hackathons, internships, etc.




We had 11 publications in 2018.  This total does not include the 2018 publications from Drs. Wu and Jayarathna since those were already in the pipeline prior to them joining WS-DL; their contributions will be included in the 2019 summary.  Our students' publications this year mainly centered around three conferences, with one "best poster" award and two "best paper" nominations:















In addition to iPres, Hypertext, and JCDL/JCDL-DC/WADL/KDD, we also attended 12 additional events:







We were fortunate enough to host Michael Herzog and several other members of Hochschule Magdeburg-Stendal in March.


For internal outreach, we did several presentations, seminars, and colloquiums within ODU.




This year was exceptionally good for public outreach about web archiving.  I was quoted in the Washington Post, The Atlantic, and Vox, culminating in an interview on CNN in April.  The entire story is nicely summarized in an ODU press release.





We've continued to update existing and release new software and datasets via our GitHub account. Given the nature of software and data, sometimes it can be difficult a specific release date, but this year our significant releases and updates include:
This year we also had a number of significant contributions that were not software, data, nor conventional publications.  Instead, they were quick analyses, evaluations, definitions, or reviews.  They may yet form the basis of future publications, but the blog allowed for rapid release:
For funding, we were able to continue our string of nine consecutive years with new funding.  The IMLS funded "Continuing Education to Advance Web Archiving", with Va Tech as the lead institution and several others, including WS-DL, in a supporting role.   With several existing funded projects scheduled to close out this year, plus two additional faculty members, the focus of 2019 will be acquiring additional external funding.

WS-DL annual reviews are also available for 2017, 2016, 2015, 2014, and 2013.  Finally, we'd like to thank all those who have complimented our blog, students, and the WS-DL research group in general.  We really appreciate the feedback, some of which we include below.

--Michael










2019-02-01: WS-DL Hosts Ed Fox For A Colloquium About Global Event And Trend Archiving

$
0
0

On January 18, 2019, the Web Science and Digital Libraries Group hosted Prof. Ed Fox from Va Tech for the colloquium "Integrating Research and Education Regarding Global Event and Trend Archiving".  After the colloquium we had a working lunch where three of our PhD students briefed Dr. Fox on their current status.  Photos and links are included in the Twitter thread below.

--Michael



















2019-02-02: Two Days in Hawaii - the 33rd AAAI Conference on Artificial Intelligence (AAAI-19)

$
0
0
The 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, and the 9th Symposium on Educational Advances in Artificial Intelligence were held at the Hilton Hawaiian Village, Honolulu, Hawaii. I have one paper accepted by IAAI 2019 on Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets, coauthored with Athar Sefid (my student at PSU), Jing Zhao (my mentee at PSU), Lu Liu (a graduate student who published a Nature Letter), Cornelia Caragea(my collaborator at UIC), Prasenjit Mitra, and C. Lee Giles
This year, AAAI receives the greatest number of submissions -- 7095 which doubles the submission in 2018. There are 18191 reviews collected and over 95% papers have 3 reviews. There are 1147 papers accepted, which takes 16.2% of all submissions. This is the lowest acceptance rate in history. There are in total 122 technical sessions, 460 oral presentations (15 min talk) and 687 posters (2 min flash). People from China submitted the largest number of papers and got the largest number of papers accepted (382, about 16%). US people got 264 papers submitted (21%). Isreal got the highest acceptance rate (24.4%). The topics on MVP (Machine learning, NLP, and computer vision) take over 61% of all submissions and 59% of all accepted. The total 3 submission increase are reasoning under uncertainty, applications, and humans and AI. The top 3 submission decrease are cognitive systems, computational sustainability, and human computation and crowdsourcing. Papers with supplementary got 5% of more acceptance rate (27%) than peoples without supplementary (12%). 
The IAAI 2019 are less competitive, with 118 submissions. The acceptance rate is 35%. There are 36 emerging applications (including ours), and 5 deployed applications. The deployed application awards are conferred to 5 papers:
  • Grading uncompilable programs by Rohit Takhar & Varun Aggarwal (Machine Learning India)
  • Automated Dispatch of Helpdesk Email Tickets by Atri Mandal et al. (IBM Research India)
  • Transforming Underwriting in the Life Insurance Industry by Marc Maier et al. (Massachusetts Mutual Life Insurance Company)
  • Large Scale Personalized Categorization of Financial Transactions by Christopher Lesner et al. (Intuit Inc.)
  • A Genetic Algorithm for Finding a Small and Diverse Set of RecentNews Stories on a Given Subject: How We Generate AAAI’s AI-Alert by Joshua Eckroth and Eric Schoen (i2kconnect)
The Robert S. Engelmore Memorial Award was conferred to Milind Tambe (USC) for outstanding research in the area of multi-agent systems and their application to problems of societal significance. I know Tambe's work partially from his student Amulya Yadav whom I interviewed to the assistant professor position at Penn State IST. He is well-known for his work on connecting AI with social goods.
The classic paper award was conferred to Prem Melville et al. for their 2002 AAAI paper on "Content-boosted collaborative filtering for improved recommendations" (cited by 1621 times on Google Scholar). This work proposed the collaborative filtering idea on recommendation systems, which is currently a classic textbook algorithm for recommendation systems.
Due to the limited amount of time I spent at Hawaii, I went to 3 invited talks. 
The first was given by Cynthia Breazeal, who is the director of the personal robots group at MIT. Her presentation was on a social robot called Jibo. Different from Google home and Amazon Echo, this robot features more on social communications with people, instead of selling products and controlling devices. It was based on the Bayesian Theory of Mind Communication Framework and Bloom’s learning theory. Jino has been tested with early childhood education and fostering aging people community connection. The goal is to promote early learning with peers and treating loneliness, helplessness, and boredom. It could talk like a human, and do some simple motions, such as dancing. My personal opinion is that we should be careful when using these robots. They may be used for medical treatment but people should always be encouraged to reach people, instead of robots. 
The second was given by Milind Tambeon "AI and Multiagent Systems Research for Social Good". He divided this broad topic in 3 aspects: public safety and security, conservation, and public health. He views social problems as multiagent systems and pointed out that the key research challenge is how to optimize our limited intervention resources when interacting with other agents. Examples include conservation/wildlife protection in which they used game theory to successfully predict the poachers in national parks in Uganda, homeless youth shelters in Los Angeles (this is Amulya's work), and scheduling patrol scheduling using game theory. 
The last one was given by the world-famous Deep Learning expert Ian Goodfellow, Senior Staff Research Scientists of Google AI, and the author of the widely used Deep Learning book. His talk was on "Adversarial Machine Learning" -- of course he invented Generative Adversarial Network(GAN). He described the prosperity of machine learning as a Cambrian Explosion, and gave applications of GAN in security, model-based optimization, reinforcement learning, extreme reliability, label efficiency, domain adaptation, fairness, accountability, transparency, and finally neuroscience. His current research focuses on designing extremely reliable systems used in autonomous vehicles, air traffic control, surgery robots, and medical diagnosis, etc. A lot of his data is images. 
There are too many sessions and I was interested in many of them but I finally chose to focus on the NLP sessions. The paper titles can be found from the conference website. Most NLP papers use AI techniques to deal with fundamental NLP problems such as representation learning, sentence-level embedding, entity, and relation extraction. I summarize what I learned below:
(1) GAN, attentive models, and Reinforce Learning (RL) are gaining more attention, especially the latter. For example, RL is used to learn embed sentences using attentive recursive trees(Jiaxin Shi et al.; Tsinghua University). RL is used to build a hierarchical framework for relation extraction(Takanobu et al. Tsinghua University. Attentive GAN was used to generate responses of chatbot (Yu Wu et al. Beihang University). RL is used to generate topically coherent visual stories (Qiuyuan Huang et al. MSR). Deep neural networks are still popular but not that popular in NLP tasks. 
(2) Zero-shot learning became a popular topic. Zero-shot learning means learning without any instances. For example, Lee and Jha (MSR) presented Zero-shot Adaptive Transfer for Conversational Language UnderstandingShruti Rijhwani (CMU) presented Zero-Shot Neural Transfer for Cross-Lingual Entity Linking
(3) Entity and relation extraction, one of the fundamental tasks in NLP is still not well-solved. People are approaching this problem in different ways, but it seems that joint extraction is better than dealing with them separately. The model proposed in Cotype by Xiang Ren et al. has become a baseline. New baselines are proposed, which are better, though the boost is marginal. For example, Rijhwani et al. (CMU) proposed Zero-shot neural transfer for cross-lingual entity linking. Changzhi Sun & Yuanbin Wu (East China Normal University) proposed Distantly Supervised Entity Relation Extraction with Adapted Manual Annotations.  Gupta et al. (Siemens) proposed Neural relation extraction within and across sentence boundaries
(4) Most advances in QA systems are still limited to answer selection task. Generating NL is still a very difficult task even with DNN. There is an interesting work by Lili Yao (Peking University) in which they generate short stories by a given keyphrase. But the code is not ready to be released. 

(5) There is one paper talking about a framework for question generation from phrase extraction by Siyuan wang et al. (Fudan University), which is related to my recent research in summarization. However, the input of the system is single sentences, rather than paragraphs, not to mention full text. So it is not directly applicable to our work. Some session names look interesting in general, but the papers are not very interesting as they usually focus on a very narrow topic. 

The IAAI session I attended featured 5 presentations. 
·       Early-stopping of scattering pattern observation with Bayesian Modeling by Asahara et al. (Japan). This is a good example to apply AI with physics. They are basically using unsupervised learning to predict the neutron scattering patterns. The goal was to reduce the cost to build equipment to generate powerful neutron beams.
·       Novelty Detection for Multispectral Images with Application to Planetary Exploration by Hannah R. Kerner et al. (ASU). These people are designing AI techniques to facilitate fast decision making for the Mars project.
·       Expert Guided Rule Based Prioritization of Scientifically Relevant Images for Downlinking over Limited Bandwidth from Planetary Orbiters by Srija Chakraborty (ASU).
·       Ours on Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets. I received a comment and a question. The comment given by an audience named Chris Lesner refers me to the MinHashing Shingles, which can be another potential solution to the scalability problem. The question came from a person on understanding the entity matching problem. I also got the name card of Diane Oyen, who is a staff scientist in the Information Science group at Los Alamos National Lab. She has some interesting problems to detect plagiarisms that we can potentially collaborate. 
·       A fast machine learning workflow for rapid phenotype prediction from whole shotgun metagenomes by Anna Paola Carrieri et al. (IBM)
One impression to me about the conference is that most presentations are terrible. This is agreed by Dr. Huan Sun, assistant professor at OSU. This is the disadvantage of sending students to the conference. It is much more beneficial to students than the audiences. The slides are not readable, the voice of presenters is low, and many presenters do not spend enough time explaining key results, leaving essentially no room for high-quality questions: the audiences just do not understand what they were talking! In particular, although Chinese scholars got many papers accepted and presented, most do not present well. Most audiences were swiping smartphones rather than listening to talks. 
Another impression is that the conference is too big! There is virtually little chance to get enough coverage and meet with speakers. I was lucky to meet with my old colleagues at Penn State: Madian Khabsa (now at Apple), Shuting Wang (now at Facebook), and Alex Ororbia II (now at RIT). I also meet with Prof. Huan Liu at ASU and had lunch with a few new friends at Apple. 
Overall, the conference was well organized, although the program arrived very late, which delayed my trip planning. The location had very good scenes, except that it is too expensive. The registration is almost $1k but lunch is not covered! Hawaii is very beautiful. I enjoyed the Waikiki beach and came across a rainbow. 

Jian Wu


























2019-02-08: Google+ Is Being Shuttered, Have We Preserved Enough of It?

$
0
0

In 2017 I reviewed many storytelling, curation, and social media tools to determine which might be a suitable replacement for Storify, should the need arise. Google+ was one of the tools under consideration. It did not make the list of top three, but it did produce quality surrogates.

On January 31, 2019, Sean Keane of CNET published an article indicating that the consumer version of Google+ will shut down on April 2, 2019. I knew that the social media service was being shut down in August, but I was surprised to see the new date. Google's blog mentions that they changed the deadline on December 10, 2018, for security reasons. David Rosenthal's recent blog post cites Google+ as yet another example of Google moving up a service decommission date.


This blog post is long because I am trying to answer several useful questions for would-be Google+ archivists. Here are the main bullet points:
  • End users can create a Google Takeout archive of their Google+ content. The pages from the archive do not use the familiar Google+ stylesheets. The archive only includes images that you explicitly posted to Google+.
  • Google+ pages load more content when a user scrolls. Webrecorder.io is the only web archiving tool that I know of that can capture this content.
  • Google+ consists of mostly empty, unused profiles. We can detect empty, unused profiles by page size. Profile pages less than 568kB are likely empty.
  • The robots.txt for plus.google.com does not block web archives.
  • Even when only considering estimates of active profiles, I estimate that less than 1% of Google+ is archived in either the Internet Archive or Archive.today.
  • I sampled some Google+ mementos from the Internet Archive and found a mean Memento Damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.
Google+ will join the long list of shuttered Web platforms. Verizon will be shuttering some former AOL and Yahoo! services in the next year. Here are some more recent examples:
Sometimes the service is not shuttered, but large swaths of content are removed, such as with Tumblr's recent crackdown on porn blogs, and Flickr's mass deletion of the photos of non-paying users.

The content of these services represents serious value for historians. Thus Geocities, Vine, and Tumblr were the targets of concerted hard-won archiving efforts.

Google launched Google+ in 2011. Writers have been declaring Google+ dead since its launch. Google+ has been unsuccessful for many reasons. Here are some mentioned in the news over the years:
As seen below, Google+ still has active users. I lost interest in 2016, but WS-DL Member Sawood Alam, Dave Matthews Band, and Marvel Entertainment still post content to the social network. Barack Obama did not last as long I did, with his last post in 2015.

I stopped posting to Google+ in 2016.
WS-DL member Sawood Alam is a more active Google+ member, having posted 17 weeks ago.

Dave Matthews Band uses Google+ to advertise concerts. Their last post was 1 week ago.

Marvel Entertainment was still posting to Google+ while I was writing this blog post.

Barack Obama lost interest in Google+. His last post was on March 6, 2015.

Back in July of 2018, I analyzed how much of the U.S. Government's AHRQ websites were archived. Google+ is much bigger than those two sites. Google+ allows users to share content with small groups or the public. In this blog post, I focus primarily on public content and current content.

I will use the following Memento terminology in this blog post:
  • original resource - a live web resource
  • memento - an observation, a capture, of a live web resource
  • URI-R - the URI of an original resource
  • URI-M - the URI of a memento
ArchiveTeam has a wiki page devoted to the shutdown of Google+. They list the archiving status as "Not saved yet." As shown below, I have found less than 1% of Google+ pages in the Internet Archive or Archive.today.

In the spirit of my 2017 blog post about saving data from Storify, I cover how one can acquire their own Google+ data. My goal is to provide information for archivists trying to capture the accounts under their care. Finally, in the spirit of the AHRQ post, I discuss how I determined much of Google+ is probably archived.

Saving Google+ Data

Google Takeout


There are professional services like PageFreezer that specialize in preserving Google+ content for professionals and companies. Here I focus on how individuals might save their content.

Google Takeout allows users to acquire their data from all of Google's services. 


Google provides Google Takeout as a way to download personal content for any of their services. After logging into Google Takeout, it presents you with a list of services. Click "Select None" and then scroll down until you see the Google+ entries.

Select "Google+ Stream" to get the content of your "main stream" (i.e., your posts). There are additional services from which you can download Google+ data. "Google+ Circles" allows you to download vCard-formatted data for your Google+ contacts. "Google+ Communities" allows you to download the content for your communities.

Once you have selected the desired services, click Next. Then click Create Archive on the following page. You will receive an email with a link allowing you to download your archive.

From the email sent by Google, a link to a page like the one in the screenshot allows one to download their data.

The Takeout archive is a zip file that decompresses into a folder containing an HTML file and a set of folders. These HTML pages include your posts, events, information about posts that you +1'd, comments you made on others' posts, poll votes, and photos.

Note that the actual files of some of these images are not part of this archive. It does include your profile pictures and pictures that you uploaded to posts. Images from any Google+ albums you created are also available. With a few exceptions, references to images from within the HTML files in the archive are all absolute URIs pointing to googleusercontent.com.  They will no longer function if googleusercontent.com is shut down. Anyone trying to use this Google Takeout archive will need to do some additional crawling for the missing image content.
Google Takeout (right) does not save some formatting elements in your Google+ posts (left). The image, in this case, was included in my Google Takeout download because it is one that I posted to the service.

Webrecorder.io


One could use webrecorder.io to preserve their profile pages. Webrecorder saves web content to WARCs for use in many web archive playback tools. I chose Webrecorder because Google+ pages require scrolling to load all content, and scrolling is a feature with which Webrecorder assists.

A screenshot of my public Google+ profile replayed in Webrecorder.io.
One of Webrecorder's strengths is the ability to authenticate to services. We should be able to use this authentication ability to capture private Google+ data.

I tried saving it using my native Firefox, but that did not work well. Unfortunately, as shown below, sometimes Google's cookie infrastructure got in the way of authenticating with Google from within Webrecorder.io.

In Firefox, Google does not allow me to log into my Google+ account via Webrecorder.io.

I recommend changing the internal Webrecorder.io browser to Chrome to preserve your profile page. I tried to patch the recording a few times to capture all of the JavaScript and images. Even in these cases, I was unable to record all private posts. If someone else has better luck with Webrecorder and their private data, please indicate how you got it to work in the comments.

Other Web Archiving Tools

The following methods only work on your public Google+ pages. Google+ supports a robots.txt that does not block web archives.

The robots.txt for plus.google.com as of February 5, 2019, is shown below:



You can manually browse through each of your Google+ pages and save them to multiple web archives using the Mink Chrome Extension. The screenshots below show it in action saving my public Google+ profile.

The Mink Chrome Extension in action, click to enlarge. Click the Mink icon to show the banner (far left), and then click on the "Archive Page To..." button (center left). From there choose the archive to which you wish to save the page (center right), or select "All Three Archives" to save to multiple archives. The far right displays a WebCite memento of my profile saved using this process.
Archive.is and the Internet Archive both support forms where you can insert a URI and have it saved. Using the URIs of your Google+ public profile, collections, and other content, manually submit them to these forms and the content will be saved.

The Internet Archive (left) has a Save Page Now form as part of the Wayback Machine.
archive.today (right) has similar functionality on its front page.
My Google+ profile saved using the Internet Archive's Save Page Now form.
If you have all of your Google+ profile, collection, community, post, photo, and so on URIs in a file and wish to push them to web archives automatically, submit them to the ArchiveNow tool. ArchiveNow can save them to archive.is, archive.st, the Internet Archive, and WebCite by default. It also provides support for Perma.cc if you have a Perma.cc API key.

Current Archiving Status of Google+

How Much of Google+ Should Be Archived?

This section is not about making relevance judgments based on the historical importance of specific Google+ pages. A more serious problem exists. Most Google+ profiles are indeed empty. Google made it quite difficult to enroll in Google's services without signing up for Google+ at the same time. At one time, if one wanted a Google account for Gmail, Android, Google Sheets, Hangouts, or a host of other services, they would inadvertently be signed up for Google+. Acquiring an actual count of active users has been difficult because Google reported engagement numbers for all services as if they were for Google+. President Obama, Tyra Banks, and Steven Speilberg have all hosted Google Hangouts. This participation can be misleading, as Google Hangouts and Photos were features most often used by users, and these users may not have maintained a Google+ profile. Again, there are a lot of empty Google+ profiles.

In 2015, Forbes wrote that less than 1% of users (111 million) are genuinely active, citing a study done by Stone Temple Consulting. In 2018, Statistics Brain estimated 295 million active Google+ accounts.

As archivists trying to increase the quality of our captures, we need to detect the empty Google+ profiles. Crawlers start with seeds from sitemaps. I reviewed the robots.txt for plus.google.com and found four sitemaps, one of which focuses on profiles. The sitemap at http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml consists of 50000 additional sitemaps. Due to the number and size of the files, I did not download them all to get an exact profile count. Each consists of between 67,000 and 68,000 URIs for an estimated total of 3,375,000,000 Google+ profiles.


An example of an "empty" Google+ profile.


How do we detect accounts that were never used, like the one shown above?  The sheer number of URIs makes it challenging to perform an extensive lexical analysis in a short amount of time, so I took a random sample of 601 profile page URIs from the sitemap. I chose the sample size using the Sample Size Calculator provided by Qualtrics and verified it with similar calculators provided by SurveyMonkeyRaosoftAustralian Bureau of Statistics, and Creative Research Systems. These sample sizes represent a confidence level of 95% and a margin of error of ±4%.

Detecting unused and empty profiles is similar to the off-topic problem that I tackled for web archive collections, and it turns out that the size of the page is a good indicator of a profile being unused.  I attempted to download all 601 URIs with wget, but 18 returned a 404 status. A manual review of this sample indicated that profiles of size 568kB or higher contain at least one post. Anyone attempting to detect an empty Google+ profile can issue an HTTP HEAD and record the byte size from the Content-Length header. If the byte size is less than 568kB, then the page likely represents an empty profile and can be ignored.

One could automate this detection using a tool like curl. Below we see a command that extracts the status line, date, and content-length for an "empty" Google+ profile of 567,342 bytes:




The same command for a different profile URI shows a size of 720,352 bytes for a non-empty Google+ profile:



An example of a 703kB Google+ profile with posts from 3 weeks ago.

An example of Google+ loading more posts in response to user scrolling. Note the partial blue circle on the bottom of the page, indicating that more posts will load.

As seen above, Google+ profiles load more posts on scroll. Profiles at 663kB or greater have filled the first "page" of scrolling. Any Google+ profile larger than this has more posts to view. Unfortunately, crawling tools must execute a scroll event on the page to load this additional content. Web archive recording tools that do not automatically scroll the page will not record this content.
This histogram displays counts of the file sizes of the downloaded Google+ profile pages. Most profiles are empty, hence a significant spike for the bin containing 554kB.
From my sample 57/601 (9.32%) had content larger than 568kB. Only 12/601 (2.00%) had content larger than 663kB, potentially indicating active users. By applying this 2% to the total number of profiles, we estimate that 67.5 million profiles are active. Of course, based on the sample size calculator, my estimate may be off by as much as 4%, leaving an upper estimate of 135 million, which is between the 111 million number from the 2015 Stone Temple study and the 295 million number from the 2018 StatisticsBrain web page. The inconsistencies are likely the result of the sitemap not reporting all profiles for the entire history of Google+ as well as differences in the definition of a profile between these studies.

I looked at various news sources that had linked to Google+ profiles. The profile URIs from the sitemaps do not correspond to those often shared and linked by users. For example, my vanity Google+ profile URI is https://plus.google.com/+ShawnMJones, but it is displayed in the sitemap as a numeric profile URI https://plus.google.com/115814011603595795989. Google+ uses the canonical link relation to link these two URIs but reveals this relation in the HTML of these pages. For a tool to discover this relationship, it must dereference each sitemap profile URI, an expensive discovery process at scale.  If Google had placed these relations in the HTTP Link header, then archivists could use an HTTP HEAD to discover the relationship. The additional URI forms make it difficult to use profile URIs from sitemaps alone for analysis.

The content of the pages found at the vanity and numeric profile URIs is slightly different. Their SHA256 hashes do not match. A review in vimdiff indicates that the differences are self-referential identifiers in the HTML (i.e., JavaScript variables containing +ShawnMJones vs. 115814011603595795989), a nonce that is calculated by Google+ and inserted into the content when it serves each page, and some additional JavaScript. Visually they look the same when rendered.

How much of Google+ is archived?


The lack of easy canonicalization of profile URIs makes it challenging to use the URIs found in sitemaps for web archive analysis. I chose instead to evaluate the holdings reported by two existing web archives.

For comparison, I used numbers from the sitemaps downloaded directly from plus.google.com.
I use these totals for comparison in the following sections.
Internet Archive Search Engine Result Pages
I turned to the Internet Archive to understand how many Google+ pages exist in its holdings. I downloaded the data file used in the AJAX call that produces the page shown in the screenshot below.

The Internet Archive reports 83,162 URI-Rs captured for plus.google.com.

The Internet Archive reports 83,162 URI-Rs captured. Shown in the table below, I further analyzed the data file and broke it into profiles, posts, communities, collections, and other by URI.

Category# in Internet Archive% of Total from Sitemap
Collections10.00000572%
Communities00%
Posts12,946Not reported in sitemap
Profiles65,0000.00193%
Topics00%
Other5,217Not reported in sitemap

The archived profile page URIs are both of the vanity and numeric types. Without dereferencing each, it is difficult to determine how much overlap exists. Assuming no overlap, the Internet Archive possesses 65,000 profile pages, which is far less than 1% of 3 billion profiles and 0.0481% of our estimate of 135 million active profiles from the previous section.

I randomly sampled 2,334 URI-Rs from this list, corresponding to a confidence level of 95% and a margin of error of ±2%. I downloaded TimeMaps for these URI-Rs and calculated a mean of 67.24 mementos per original resource.
Archive.today Search Engine Result Pages
As shown in the screenshot below, Archive.today also provides a search interface on its web site.

Archive.today reports 2551 URI-Rs captured for plus.google.com.

Archive.today reports 2,551 URI-Rs, but scraping its search pages returns 3,061 URI-Rs. I analyzed the URI-Rs returned from the scraping script to place them into the categories shown in the table below.

Category# in Archive.today% of Total from Sitemap
Collections100.0000572%
Communities00%
Photos22Not reported in sitemap
Posts1994Not reported in sitemap
Profiles9890.0000293%
Topics10.248%
Other45Not reported in sitemap


Archive.today contains 989 profiles, a tiny percent of the 3 billion suggested by the sitemap and the 135 million active profile estimate that we generated from the previous section.

Archive.today is Memento-compliant, so I attempted to download TimeMaps for these URI-Rs. For 354 URI-Rs, I received 404s for their TimeMaps, leaving me with 2707 TimeMaps. Using these TimeMaps, I calculated a mean of 1.44 mementos per original resource.

Are these mementos of good quality?

Archives just containing mementos is not enough. Their quality is relevant as well. Crawling web content often results in missing embedded resources such as stylesheets. Fortunately, Justin Brunelle developed an algorithm for scoring the quality of a memento that takes missing embedded resources into account. Erika Siregar developed the Memento Damage tool based on Justin's algorithm so that we can calculate these scores. I used the Memento Damage to score the quality of some mementos from the Internet Archive.

The histogram of memento damage scores from our random sample shows that most have a damage score of 0.
Memento damage takes a long time to calculate, so I needed to keep the sample size small. I randomly sampled 383 URI-Rs from the list acquired from the Internet Archive and downloaded their TimeMaps. I acquired a list of 383 URI-Ms by randomly sampling 1 URI-M from each of TimeMap. I then fed these URI-Ms into a local instance of the Memento Damage tool. The Memento Damage tool experienced errors for 41 URI-Ms.

This memento has the highest damage score of 0.941 in our sample. The raw size of its base page is 635 kB.


The mean damage score for these mementos is 0.347. A score of 0 indicates no damage. This score may be misleading, however, because more content is loaded via JavaScript when the user scrolls down the page. Most crawling software does not trigger this JavaScript code and hence misses this content.

The screenshot above displays the largest memento in our sample. The base page has a size of 1.3 MB and a damage score of 0.0. It is not a profile page, but a page for a single post with comments.
The screenshot above displays the smallest memento in our sample with a size greater than zero and no errors while computing damage. This single post page redirects to a page not captured by the Internet Archive. The base page has a size of 71kB and a damage score of 0.516.
The screenshot above displays a memento for a profile page of size 568kB, the lower bound of pages with posts from our earlier live sample. It has a memento damage score of 0.

This histogram displays the file sizes in our memento sample. Note how most have a size between 600kB and 700kB. 

As an alternative to memento damage, I also downloaded the raw memento content of the 383 mementos to examine their sizes.  The HTML has a mean size of 466kB and a median of 500kB. In this sample, we have mementos of posts and other types of pages mixed in. Post pages appear to be smaller. The memento of a profile page shown below still contains posts at 532kB. Mementos for profile pages smaller than this had just a user name and no posts. It is possible that the true lower bound in size is around 532kB.

This memento demonstrates a possible new lower bound in profile size at 532kB. The Internet Archive captured it in January of 2019.

Discussion and Conclusions


Google+ is being shut down on April 2, 2019. What direct evidence will future historians have of its existence? We have less than two months to preserve much of Google+. In this post, I detailed how users might preserve their profiles with either Google Takeout, Webrecorder.io, and other web archiving tools.

I mentioned that there are questions about how many active users ever existed on Google+. In Google's attempt to make all of its services "social" it conflated the number of active Google+ users with active users of other Google services. Third-party estimates of active Google+ users over the years have ranged from 111 million to 295 million.  With a sample size of 601 profiles from the profile sitemap at plus.google.com, I estimated that the number might be as high as 135 million.

To archive non-empty Google+ pages, we have to be able to detect pages that are empty. I analyzed a small sample of Google+ profile pages and discovered that pages of size 663kB or larger contain enough posts to fill the first "page" of scrolling. I also discovered that inactive profile pages tend to be less than 568kB. Using the HEAD method of HTTP and the Content-Length header, archivists can use this value to detect unused or poorly contributed to Google+ profiles before downloading their content.

I estimated how much of Google+ exists in public web archives. Scraping URIs from the search result pages of the Internet Archive, the most extensive web archive, reveals only 83,162 URI-Rs for Google+. Archive.today only reveals 2,551 URI-Rs. Both have less than 1% of the totals of different Google+ page categories found in the sitemap. The fact that so few are archived may indicate that few archiving crawls found Google+ profiles because few web pages linked to them.

I sampled some mementos from the Internet Archive and found a mean damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.

Because Google+ uses page scrolling to allow users to load more content, this means that many mementos will likely be of poor quality if recorded outside of tools like Webrecorder.io. With the sheer number of pages to preserve, we may have to choose quantity over quality.

If a sizable sample of those profiles is considered to be valuable to historians, then web archives have much catching up to do.

A concerted effort will be necessary to acquire a significant number of profile pages by April 2, 2019. My recommendations are for users to archive their public profile URIs with ArchiveNow, Mink, or the save page forms at the Internet Archive or Archive.today. Archivists looking to archive Google+ more generally should download the topics sitemap and at least capture the 404 (four hundred four, not 404 status) topics pages using these same tools. Enterprising archivists can search news sources, like this Huffington Post article and this Forbes article, that feature popular and famous Google+ users. Sadly, because of the lack of links, much of the data from these articles is not in a machine-readable form.  A Google+ archivist would need to search Google+ for these profile page URIs manually. Once that is done, the archivist can then save these URIs using the tools mentioned above.

Due to its lower usage compared to other social networks and its controversial history, some may ask "Is Google+ worth archiving?" Only future historians will know, and by then it will be too late, so we must act now.

-- Shawn M. Jones

2019-02-14: CISE papers need a shake -- spend more time on the data section

$
0
0

A Crucial Step for Averting AI Disasters

I know this is a large topic and I may not have enough evidence to convince everyone, but based on my reviewing experiences on journal articles and conference proceedings, I strongly feel that computer and information science and engineering (CISE) papers need to put more text on describing and analyzing the data. 

This argument partially comes from my background in astronomy and astrophysics. Astronomers and astrophysicists usually spend a huge chunk of text in their papers talking about data they adopt, including but not limited to where the data are collected, why they do not use another dataset, how the raw data are pre-processed, and carefully justify why they rule out outliers. They also do analysis on the data and report statistical properties, trend, or bias to ensure that they are using legitimate points in their plots.

In contrast, for many papers I read and reviewed, even in top conferences, CISE people do not often do such work. They usually assume the datasets were used before so they could use it. Many emphasize the size of the data, but few look into the structure, completeness, taxonomy, noise, and potential outliers in the data. The consequence is that they spend a lot of space on algorithms and report results better than baselines, but it not a guarantee of anything. Good CISE papers usually discuss the bias and potential risks caused by the data, but good papers are rare, even in top conferences.

Algorithm is one of the pillars of CISE, but this does not mean it is everything. It only provides the framework, like the photo frame. Data is like the photo. Without the right photo, the picture (frame+photo) will not look pleasing. Even if it looks pleasing for a particular photo, it won't for other photos. Of course, no algorithm will fit all data, but at least the paper should discuss what types of data the algorithm should be applied to.

The good news is that many CISE people have started paying attention to this problem. In the IEEE Big Data Conference,  Blaise Aguera y Arcas, the Google AI director emphasizes that AI algorithms have to be accompanied with the right data to be ethical and useful. Recently, a WSJ article titled "A Crucial Step for Averting AI Disasters" echoed the idea. The article quoted Douglas Merrill's word -- “The answer to almost every question in machine learning is more data,” I would supplement this by adding "right" after "more". If we claim we are doing Data Science, how can we neglect the first part?

Jian Wu 

2019-03-05: 365 dots in 2018 - top news stories of 2018

$
0
0
Fig. 1: News stories for 365 days in 2018. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represent the average degree of the selected GCC. Click to expand image.

There was no shortage of big news headlines in 2018. Amidst this abundance, a natural question is what were the top news stories of 2018? There are multiple lists from different news organizations that present candidate top stories in 2018 such as CNN's most popular stories and videos of 2018, and The year in review: Top news stories of 2018 month by month from CBS. Even though such lists from respectable news organizations pass the "seems right test," they mostly present the top news stories without presenting an explanation for their process. In other words, they often do not state why some story made the list and why another did not make the list. We consider such information very helpful for two reasons. First, an explanation or presentation of the criteria of why some story made the list opens the criteria to critique and helps alleviate concerns about bias. Second, the criteria is inherently valuable because it could be reused and reapplied on a different collection. For example, one could apply the process to find out the top news stories in a different country.

Fortunately, StoryGraph is well suited to answer our main question, "what were the top news stories of 2018?"

A brief introduction of StoryGraph
We plan to publish a blogpost introducing and explaining StoryGraph in the near future. However, here is a quick explanation. StoryGraph is a service that periodically (10-minute intervals) generates a news similarity graph. In this graph, the nodes represent news stories and an edge between a pair of nodes represents a high degree of similarity between the nodes (news stories). For example, this story, In eulogy, Obama says McCain called on nation to be ‘bigger’ than politics ‘born in fear’ is highly similar (similarity: 0.46) to this story, John McCain honored at National Cathedral memorial service, therefore an edge exists between both stories in their parent story graph.
Small stories vs big stories: how StoryGraph quantifies the magnitude of news stories
On a slow news day, news organization report on multiple different news stories. This results in a low degree of similarity between pairs of news stories (e.g., Fig. 2) and results in smaller connected components.
In contrast, shortly after a major news event, news organizations publish multiple highly similar news stories. This results in a high degree of similarity between pairs of news stories. This often leads to a Giant Connected Component (GCC) in the news story graph  (e.g., Fig. 3).
Fig. 2: A small news story (compared to Fig. 3) exhibits lower (compared to 17.03 in Fig. 3) pairwise node similarity and a lower Giant Connected Component average degree (4) compared to a big news story.
In short, the larger the average degree of a Giant Connected Component of a story graph, the bigger the news event, and vice versa.
StoryGraph generates 144 graphs (1 per 10 minutes) for a single day, this means there are 144 possible candidate (duplicate stories included) news graphs to pick from while selecting the top news story for a given day. The following steps were applied in order to select the top news story for a single day. First, each story graph was awarded a score, the score was derived from the average degree of the giant connected component in the graph. Second, from the set of 144 story graphs, the graph with the highest score was selected to represent the top news story for the day. This graph represents the graph with the giant connected component with the highest average degree. Steps 1 and 2 were applied across 365 days to generate the scores plotted in Fig. 1. Fig. 1 captures the top news stories for each of the 365 days in 2018.
The top news stories of 2018
From Fig. 1, and the table below, it is clear that the Kavanaugh hearings was the biggest news story of 2018 with GCC avg. degree of 25.85. In fact, this story was the top news story for about 25 days (red dots in Fig. 1). Also, the top story had three sibling story graphs with GCC avg. degrees (18.96 - 21.94) higher than the second top story.

RankDate (MM-DD)News Story (Selected) TitleGCC Avg. Deg
109-27Kavanaugh accuser gives vivid details of alleged assault - CNN Video25.85
202-02Disputed GOP-Nunes memo released - CNNPolitics18.81
306-12Kim Jong Un, Trump participate in signing ceremony, make history: Live updates - ABC News18.15
410-24Clinton and Obama bombs: Secret Service intercepts suspicious packages - The Washington Post17.03
503-17Trump celebrates McCabe firing on Twitter - CNN Video16.32
606-14DOJ IG report 'reaffirmed' Trump's 'suspicions' of bias of some in FBI: White House - ABC News15.63
708-29A Black Progressive and a Trump Acolyte Win Florida Governor Primaries - The New York Times15.37
804-14Trump orders strike on Syria in response to chemical attack - ABC News15.21
902-25Trump calls Schiff a 'bad guy,' Democratic memo 'a nothing' - CNNPolitics15.13
1011-07John Brennan Warns After Sessions’ Firing: ‘Constitutional Crisis Very Soon’ - Breitbart14.88

The next major news story of 2018 was the news surrounding the release of Nunes' Memo (GCC avg. degree: 18.81). Similar to the Kavanaugh hearings, the story about the release of the controversial memo was the top news story for seven days. In contrast, the story about Schiff's rebuttal memo did not receive as much attention with rank 9 and GCC avg. degree of 15.13. In third place was the Trump-Kim summit with GCC avg. degree of 18.15. Unlike the top two stories, this story, although initially big, did not linger beyond two days. This is an example of a big news story that lacked staying power.

Multiple news stories in our list were included in the list of top stories from other news organizations such as CNNCBS, NBCNews, and BusinessInsider. For example, the Kavanaugh hearings (No. 1), the Trump-Kim summit (No. 3), the Pipe bomber (No. 4), and the Midterm elections (No. 7), were included in multiple top news lists. However, to our surprise the MSD shooting news story (GCC avg. degree of 7.74) was not in our list of top 10 new stories, even though it appeared in multiple top news lists from multiple news organizations. Also, the Nunes memo story (No. 2) was our second top story, but it was absent from the list of top news stories of the four major news organizations we considered.

President Trump was a dominant figure in the 2018 news discourse. As shown in Fig. 1, out of the 365 days, "Trump" was included in the title representing the story graphs 197 (~54%) times.

-- Alexander Nwala (@acnwala)

2019-03-18: Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay In Multiple Languages

$
0
0
Figure 1: Mixed language blocks on a memento of a Twitter timeline. Highlighted with blue colored box for Portuguese, orange for English, and red for Urdu. Dotted border indicates the template present in the original HTML response while blocks with solid borders indicate lazily loaded content.

Would you be surprised if I were to tell you that Twitter is a multi-lingual website, supporting 47 different international languages? How about if I were to tell you that a usual Twitter timeline page can contain tweets in whatever languages the owner of the handle chooses to tweet, but can also show navigation bar and various sidebar blocks in many different languages simultaneously, now surprised? Well, while it makes no sense, it may actually happen in web archives when a memento of a timeline is accessed as shown in Figure 1. Spoiler alert! Cookies are to be blamed, once again.

Last month, I was investigating a real life version of "Ron Burgundy will read anything on the teleprompter (Anchorman)" and "Chatur's speech (3 Idiots)" moments, when I noticed something that caught my eyes. I was looking at a memento (i.e., a historical version of a web page) of Pratik Sinha's Twitter timeline from the Internet Archive. Pratik is the co-founder of Alt News (an Indian fact checking website) and the person who edited an internal document of the IT Cell of BJP (the current ruling party of India), which was then copy-pasted and tweeted by some prominent handles of the party. Tweets on his timeline are generally in English, but the archived page's template language was not English (although, I did not request the page in any specific language). However, this was not surprising to me as we have already investigated the reason behind this template language behavior last year and found that HTTP cookies were causing it. After spending a minute or so on the page, a small notice appeared in the main content area, right above the list of tweets, suggesting that there were 20 more tweets, but the message was in Urdu language, a Right-to-Left (RTL) language, very different from the language used in the page's navigation bar. Urdu, being my first language, immediately alerted me that there is something not quite right. Upon further investigation, I found that the page was composed of three different languages, Portuguese, English, and Urdu as highlighted in Figure 1 (here I am not talking about the language of tweets themselves).

What Can Deface a Composite Memento?


This defaced composite memento is a serious archival replay problem as it is showing a web page that perhaps never existed. While the individual representations all separately existed on the live web, they were never combined in the page as it is replayed by the web archive. In the Web Science and Digital Libraries Research Group, we uncovered a couple of causes in the past that can yield defaced composite mementos. One of them is live-leakage (also known as Zombies) for which Andy Jackson proposed we should use Content-Security-PolicyAda Lerner et al. took a security-centric approach that was deployed by the Internet Archive's Wayback Machine, and we proposed Reconstructive as a potential solution using Service Worker. The other known cause is temporal violations, on which Scott Ainsworth is working as his PhD research. However, this mixed-language Twitter timeline issue cannot be explained by zombies nor temporal violations.

Anatomy of a Twitter Timeline


To uncover the cause, I further investigated the anatomy of a Twitter timeline page and various network requests it makes when accessed live or from a web archive as illustrated in Figure 2. Currently, when a Twitter timeline is loaded anonymously (without logging in), the page is returned with a block of brief description of the user, a navigation bar (containing summary of numbers of tweets and followers etc.), a sidebar block to encourage visitors to create a new account, and an initial set of tweets. The page also contains empty placeholders of some sidebar blocks such as related users to follow, globally trending topics, and recent media posted on that timeline. Apart from loading page requisites, the page also makes some follow up XHR requests to populate these blocks. When the page is active (i.e., the browser tab is focused) it polls for new tweets after every 30 seconds and global trends after every 5 minutes. Successful responses to these asynchronous XHR requests contain data in JSON format, but instead of providing a language-independent bare bone structured data to rendering templates on the client-side, they contain some server-side rendered encoded markup. Which is then decoded on the client-side and directly injected in corresponding empty placeholders (or replaced with any existing content), then the block is set to visible. This server-side partial markup rendering needs to know the language of the parent page in order to utilize phrases translated in the corresponding language to yield a consistent page.

Figure 2: An active Twitter timeline page asynchronously populates related users and recent media blocks then polls for new tweets every 30 seconds and global trends every 5 minutes.

How Does Twitter's Language Internationalization Work?


From our past investigation we know that Twitter handles languages in two primary ways, a query parameter and a cookie header. In order to fetch a page in a specific language (from their 47 currently supported languages) one can either add a "?lang=<language-code>" query parameter in the URI (e.g.,
https://twitter.com/ibnesayeed?lang=ur
for Urdu) or send a Cookie header containing the "lang=<language-code>" name/value pair. A URI query parameter takes precedence in this case and also sets the "lang" Cookie accordingly (overwriting any existing value) for all the subsequent requests until overwritten again explicitly. This works well on the live site, but has some unfortunate consequences when a memento of a Twitter timeline is replayed from a web archive, causing this hodgepodge illustrated in Figure 1 (area highlighted by dotted border indicates the template served in the initial HTML response while areas surrounded with solid border were lazily loaded). This mixed-language rendering does not happen when a memento of a timeline is loaded with an explicit language query parameter in the URI as illustrated in Figures 3, 4, and 5 (the "lang" query parameter is highlighted in the archival banner and also the lazily loaded blocks from each language that corresponds to the blocks in Figure 1). In this case, all the subsequent XHR URIs also contain the explicit "lang" query parameter.

Figure 3: A memento of a Twitter timeline explicitly in Portuguese.

Figure 4: A memento of a Twitter timeline explicitly in English.

Figure 5: A memento of a Twitter timeline explicitly in Urdu. The direction of the page is Right-to-Left (RTL), as a result, sidebar blocks are moved to the left hand side.

To understand the issue, consider the following sequence of events during the crawling of a Twitter timeline page. Suppose, we begin a fresh crawling session and start with fetching the https://twitter.com/ibnesayeed page without any specific language code supplied. Depending on the geo-location of the crawler or any other factors Twitter might return the page in a specific language, for instance, in English. The crawler extracts links of all the page requisites and hyperlinks to add them into the frontier queue. The crawler may also attempt to extract URIs of potential XHR or other JS initiated requests, which might add URIs like:
https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module
and
https://twitter.com/i/related_users/28631536
(and various other lazily loaded resources) in the frontier queue. The HTML page also contains 47 language-specific alternate links (and one x-default hreflang) in its markup (with "?lang=<language-code>" style parameters). These alternate links will also be added in the frontier queue of the crawler in some order. When these language-specific links are fetched by the crawler, the lang Cookie will be set, overwriting any prior value. Now, suppose the https://twitter.com/ibnesayeed?lang=ur was fetched before the "/i/trends" data, it would set the language for any subsequent requests to be served in Urdu. When the data for global trends block is fetched, Twitter's server will returned a server-side rendered markup in Urdu, which will be injected in the page that was initially served in English. This will cause the header of the block saying "دنیا بھر کے میں رجحانات" instead of "Worldwide trends". Here, I would take a long pause of silence to express my condolence on the brutal murder of a language with more than 100 million speakers worldwide by a platform as big as Twitter. The Urdu translation of this phrase appearing on such a prominent place on the page is a nonsense and grammatically wrong. Twitter, if you are listening, please change it to something like "عالمی رجحانات" and get an audit of other translated phrases. Now, back to the original problem, following is a walk-through of the scenario described above.

$ curl --silent "https://twitter.com/ibnesayeed" | grep "<html"
<html
lang="en" data-scribe-reduced-action-queue="true">
$ curl --silent
-c /tmp/twitter.cookie"https://twitter.com/ibnesayeed?lang=ur" | grep "<html"
<html
lang="ur" data-scribe-reduced-action-queue="true">
$ grep
lang /tmp/twitter.cookie
twitter.com FALSE / FALSE 0
lang ur
$ curl --silent
-b /tmp/twitter.cookie"https://twitter.com/ibnesayeed" | grep "<html"
<html
lang="ur" data-scribe-reduced-action-queue="true">
$ curl --silent
-b /tmp/twitter.cookie"https://twitter.com/i/trends?k=&pc=true&profileUserId=28631536&show_context=true&src=module" | jq
{
"module_html": "<div class=\"flex-module trends-container context-trends-container\">\n <div class=\"flex-module-header\">\n \n <h3><span class=\"trend-location js-trend-location\">
دنیا بھر کے میں رجحانات</span></h3>\n </div>\n <div class=\"flex-module-inner\">\n <ul class=\"trend-items js-trends\">\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"#PiDay\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/hashtag/PiDay?src=tren&amp;data_id=tweet%3A1106214111183020034\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#PiDay</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n Google employee sets new record for calculating π to 31.4 trillion digits\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"#SaveODAAT\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/hashtag/SaveODAAT?src=tren&amp;data_id=tweet%3A1106252880921747457\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#SaveODAAT</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n Netflix cancels One Day at a Time after three seasons\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"Beto\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/search?q=Beto&amp;src=tren&amp;data_id=tweet%3A1106142158023786496\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Beto</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n Beto O’Rourke announces 2020 presidential bid\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"#AvengersEndgame\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:hashtag_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/hashtag/AvengersEndgame?src=tren&amp;data_id=tweet%3A1106169765830295552\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#AvengersEndgame</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n Marvel dropped a new Avengers: Endgame trailer\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"12 Republicans\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/search?q=%2212%20Republicans%22&amp;src=tren\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">12 Republicans</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n 6,157 ٹویٹس\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"#NationalAgDay\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/hashtag/NationalAgDay?src=tren\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#NationalAgDay</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n 6,651 ٹویٹس\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"Kyle Guy\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_1000_10000_metadescription:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/search?q=%22Kyle%20Guy%22&amp;src=tren\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Kyle Guy</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n 1,926 ٹویٹس\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"#314Day\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:hashtag_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/hashtag/314Day?src=tren\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">#314Day</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n 12 ہزار ٹویٹس\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"Tillis\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:entity_trend:taxi_country_source:moments_metadescription:moments_badge:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/search?q=Tillis&amp;src=tren&amp;data_id=tweet%3A1106266707230777344\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Tillis</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n Senate votes to block Trump&#39;s border emergency declaration\n </div>\n </a>\n\n</li>\n\n <li class=\"trend-item js-trend-item context-trend-item\"\n data-trend-name=\"Bikers for Trump\"\n data-trends-id=\"1025618545345384837\"\n data-trend-token=\":location_request:entity_trend:taxi_country_source:tweet_count_10000_100000_metadescription:\"\n \n >\n\n <a class=\"pretty-link js-nav js-tooltip u-linkComplex \"\n href=\"/search?q=%22Bikers%20for%20Trump%22&amp;src=tren\"\n data-query-source=\"trend_click\"\n \n >\n <span class=\"u-linkComplex-target trend-name\" dir=\"ltr\">Bikers for Trump</span>\n\n \n <div class=\"js-nav trend-item-context js-ellipsis\"></div>\n <div class=\"js-nav trend-item-stats js-ellipsis\">\n 16.8 ہزار ٹویٹس\n </div>\n </a>\n\n</li>\n\n </ul>\n </div>\n</div>\n",
"personalized": false,
"woeid": 1
}

Here, I started by fetching my Twitter time without specifying any language in the URI or via cookies. The response was returned in English. I then fetched the same page with explicit "?lang=ur" query parameter and saved any returned cookies in the "/tmp/twitter.cookie" file. We illustrated that the response was indeed returned in Urdu. We then checked the saved cookie file to see if it contains a "lang" cookie, which it does and has a value of "ur". We then utilized the saved cookie file to fetch the main timeline page again, but without an explicit "?lang=ur" query parameter to illustrate that Twitter's server respects it and returns the response in Urdu. Finally, we fetched global trends data while utilizing saved cookies and illustrated that the response contains a JSON-serialized HTML markup with Urdu header text in it as
"<h3><span class=\"trend-location js-trend-location\">دنیا بھر کے میں رجحانات</span></h3>"
under the "module_html" JSON key. The original response is encoded using Unicode escapes, but we used jq utility here to pretty-print JSON and decode escaped markup for easier illustration.

Understanding Cookie Violations


When fetching a single page (and all its page requisites) at a time, this problem, let's name it a cookie violation, might not happen as often. However, when crawling is done on a large scale, preventing such unfortunate misalignment of frontier queue completely is almost impossible, especially, since the "lang" cookie is set for the root path of the domain and affects every resource from the domain.

The root cause here can more broadly be described as a lossy state information being utilized when replaying a stateful resource representation from archives that originally performed content negotiation based on cookies or other headers. Most of the popular archival replay systems (e.g., OpenWayback, PyWB, and even our own InterPlanetary Wayback) do not perform any content negotiation when serving a memento other than the Accept-Datetime header (which is not part of the original crawl-time interaction, but a means to add the time dimension to the web). Traditional archival crawlers (such as Heritrix) mostly interacted with web servers by using only URIs without any custom request headers that might affect the returned response. This means, generally a canonicalized URI along with the datetime of the capture was sufficient to identify a memento. However, cookies are an exception to this assumption as they are needed for some sites to behave properly, hence cookie management support was added to these crawlers long time ago. Cookies can be used for tracking, client-side configurations, key/value store, and authentication/authorization session management, but in some cases they can also be used for content negotiation (as is the case with Twitter). When cookies are used for content negotiation, the server should adevrtise it in the "Vary" header, but Twitter does not. Accommodating cookies at capture/crawl time, but not utilizing them at replay time has this consequence of cookie violations, resulting in defaced composite mementos. Similarly, in aggregated personal web arching, which is the PhD research topic of Mat Kelly, not utilizing session cookies (or other forms of authorization headers) at replay time can result in a serious security vulnerability of private content leakage. In modern headless browser-based crawlers there might even be some custom headers that a site utilizes in XHR (or fetch API) for content negotiation, which should be considered when indexing the content for replay (or filtering at replay time from a subset). Ideally, a web archive should behave like an HTTP proxy/cache when it comes to content negotiation, but it may not always be feasible.

What Should We Do About It?


So, should we include cookies in the replay index and only return a memento if the cookies in the request headers match? Well, that will be a disaster as it will cause an enormous amount of false-negatives (i.e., mementos that are present in an archive and should be returned, but won't). Perhaps we can canonicalize cookies and only index ones that are authentication/authorization session-related or used for content negotiation. However, identifying such cookies will be difficult and will require some heuristic analysis or machine learning, because, these are opaque strings and their names are decided by the server application (rather than using any standardized names).

Even if we can somehow sort this issue out, there are even bigger problems in making it to work. For example, how to get the client send suitable cookies in the first place? How will the web archive know when to send a "Set-Cookie" header? Should the client follow the exact path of interactions with pages as the crawler did when a set of pages were captured in order to set appropriate cookies

Let's ignore session cookies for now and only focus on the content negotiation related cookies. Also, let's relax the cookie matching condition further by only filtering mementos if a cookies header is present in a request, otherwise ignore cookies from the index. This means, the replay system can send a Set-Cookie header if the memento in question was originally observed with a Set-Cookie header and expect to see it in the subsequent requests. Sounds easy? Welcome to the cookie collision hell. Cookies from various domains will be required to be rewritten to set the domain name of the web archive that is serving the memento. As a result, same cookie names from various domains served over time from the same archive will step over each other (it's worth mentioning that often a single web page has page requisites from many different domains). Even the web archive can have some of its own cookies independent of the memento being served.

We can attempt to solve this collision issue by rewriting the path of cookies and prefixing it with the original domain name to limit the scope (e.g., change
"Set-Cookie: lang=ur; Domain: twitter.com; Path=/"
to
"Set-Cookie: lang=ur; Domain: web.archive.org; Path=/twitter.com/"
). This is not going to work because the client will not send this cookie unless the requested URI-M path has a prefix of "/twitter.com/", but the root path of Twitter is usually rewritten as something like "/web/20190214075028/https://twitter.com/" instead. If the same rewriting rule is used in cookie path then the unique 14-digit datetime path segment will block it from being sent with subsequent requests that have a different datetime (which is almost always the case after an initial redirect). Unfortunately, cookie path does not support wildcard paths like "/web/*/https://twitter.com/".

Another possibility could be prefixing the name of the cookie with the original domain [and path] (with some custom encoding and unique-enough delimiters) then setting path to the root of the replay (e.g., change the above example to
"Set-Cookie: twitter__com___lang=ur; Domain: web.archive.org; Path=/web/"
), which, the replay server understands how to decode and apply properly. I am not aware of any other attributes of cookies that can be exploited to annotate with additional information. The downside of this approach is that if the client is relying on these cookies for certain functionalities then the changed name will affect them.

Additionally, an archival replay system should also rewrite cookie expiration time to a short-lived future value (irrespective of the original value, which could be a value in the past or a very distant value in the future) otherwise the growing pile of cookies from many different pages will increase the request size significantly over time. Moreover, incorporating cookies in replay systems will have some consequences in cross-archive aggregated memento reconstruction.

In our previous post about another cookie related issue, we proposed that explicitly expiring cookies (and garbage collecting cookies older than a few seconds) may reduce the impact. We also proposed that distributing crawl jobs of the URIs from the same domain in smaller sandboxed instances could minimize the impact. I think these two approaches can be helpful in mitigating this mixed-language issue as well. However, it is worth noting that these are crawl-time solutions, which will not solve the replay issues of existing mementos.

Dissecting the Composite Memento


Now, back to the memento of Pratik's timeline from the Internet Archive. The page is archived primarily in Portuguese. When it is loaded in a web browser that can execute JavaScript, the page makes subsequent asynchronous requests to populate various blocks as it does on the live site. Recent media block is not archived, so it does not show up. Related users block is populated in Portuguese (because this block is generally populated immediately after the main page is loaded and does not get a chance to be updated later, hence, unlikely to load a version in a different language). The closest successful memento of the global trends data is loaded from
https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=module
(which is in English). As the page starts to poll for new tweets for the account, it first finds the closest memento at
https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856
URI-M in Urdu. This adds a notification bar above the main content area that suggests there are 20 new tweets available (clicking on this bar will insert those twenty tweets in the timeline as the necessary markup is already returned in the response, waiting for a user action). I found the behavior of the page to be inconsistent due to intermittent issues, but reloading the page a few times and waiting for a while helps. In the subsequent polling attempts the latent_count parameter changes from "0" to "20" (this suggests how many new tweets are loaded and ready to be inserted) and the min_position parameter changes from "1095942934640377856" to "1100819673937960960" (these are IDs of the most recent tweets loaded so far). Every other parameter generally remains the same in the successive XHR calls after every 30 seconds. If one waits for long enough on this page (while the tab is still active), occasionally another successful response arrives that updates the new tweets notification from 20 to 42 (but in a different language from Urdu). To see if there are any other clues that can explain why the banner was inserted in Urdu, I investigated the HTTP response as shown below (the payload is decoded, pretty-printed, and truncated for ease of inspection):

$ curl --silent -i "https://web.archive.org/web/20190227220450/https://twitter.com/i/profiles/show/free_thinker/timeline/tweets?composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&latent_count=0&min_position=1095942934640377856"
HTTP/2 200
server: nginx/1.15.8
date: Fri, 15 Mar 2019 04:25:14 GMT
content-type: text/javascript; charset=utf-8
x-archive-orig-status: 200 OK
x-archive-orig-x-response-time: 36
x-archive-orig-content-length: 995
x-archive-orig-strict-transport-security: max-age=631138519
x-archive-orig-x-twitter-response-tags: BouncerCompliant
x-archive-orig-x-transaction: 00becd1200f8d18b
x-archive-orig-x-content-type-options: nosniff
content-encoding: gzip
x-archive-orig-set-cookie: fm=0; Max-Age=0; Expires=Mon, 11 Feb 2019 13:21:45 GMT; Path=/; Domain=.twitter.com; Secure; HTTPOnly, _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCIRWidxoAToMY3NyZl9p%250AZCIlYzlmNGViODk4ZDI0YmI0NzcyMTMyMzA3M2M5ZTRjZDI6B2lkIiU2ODFi%250AZjgzYjMzYjEyYzk1NGNlMDlmYzRkNDIzZTY3Mg%253D%253D--22900f43bec575790847d2e75f88b12296c330bc; Path=/; Domain=.twitter.com; Secure; HTTPOnly
x-archive-orig-expires: Tue, 31 Mar 1981 05:00:00 GMT
x-archive-orig-server: tsa_a
x-archive-orig-last-modified: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report
x-archive-orig-x-connection-hash: bca4678d59abc86b8401176fd37858de
x-archive-orig-pragma: no-cache
x-archive-orig-cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
x-archive-orig-date: Mon, 11 Feb 2019 13:21:45 GMT
x-archive-orig-x-frame-options:
cache-control: max-age=1800
x-archive-guessed-content-type: application/json
x-archive-guessed-encoding: utf-8
memento-datetime: Mon, 11 Feb 2019 13:21:45 GMT
link: <https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="original", <https://web.archive.org/web/timemap/link/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="timegate", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="first memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="memento"; datetime="Mon, 11 Feb 2019 13:21:45 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="next memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT", <https://web.archive.org/web/20190217171144/https://twitter.com/i/trends?k=&amp;pc=true&amp;profileUserId=7431372&amp;show_context=true&amp;src=module>; rel="last memento"; datetime="Sun, 17 Feb 2019 17:11:44 GMT"
content-security-policy: default-src 'self''unsafe-eval''unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org
x-archive-src: liveweb-20190211133005/liveweb-20190211132143-wwwb-spn01.us.archive.org.warc.gz
x-app-server: wwwb-app23
x-ts: ----
x-location: All
x-cache-key: httpsweb.archive.org/web/20190211132145/https://twitter.com/i/trends?k=&pc=true&profileUserId=7431372&show_context=true&src=moduleUS
x-page-cache: MISS

{
"max_position": "1100819673937960960",
"has_more_items": true,
"items_html": "\n <li class=\"js-stream-item stream-item stream-item\n\" data-item-id=\"1100648521127129088\"\nid=\"stream-item-tweet-1100648521127129088\"\ndata-item-type=\"tweet\"\n data-suggestion-json=\"{&quot;suggestion_details&quot;:{},&quot;tweet_ids&quot;:&quot;1100648521127129088&quot;,&quot;scribe_component&quot;:&quot;tweet&quot;}\"> ... [REDACTED] ... </li>",
"new_latent_count": 20,
"new_tweets_bar_html": "<button class=\"new-tweets-bar js-new-tweets-bar\" data-item-count=\"20\" style=\"width:100%\">\n
دیکھیں 20 نئی ٹویٹس\n\n </button>\n",
"new_tweets_bar_alternate_html": []
}

While many web archives are good at exposing original response headers via X-Archive-Orig-* headers in mementos, I don't know any web archive (yet) that exposes corresponding original request headers as well (I propose using something like X-Archive-Request-Orig-* headers). By looking the the above response we can understand the structure of how new tweets' notification works on a Twitter timeline, but it does not answer why the response was in Urdu (as highlighted in the value of the "new_tweets_bar_html" JSON key). Based on my assessment and experiment above, I think that the corresponding request should have a header like "Cookie: lang=ur; Domain: twitter.com; Path=/", which can be verified if the corresponding WARC file was available.

Cookie Violations Experiment on the Live Site


Finally, I attempted to recreate this language hodgepodge on the live site on my own Twitter timeline. I followed the the steps below and ended up with a page shown in Figure 6 (which contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages).

  1. Open your Twitter timeline in English by explicitly supplying "?lang=en" query parameter in a browser tab (it can be an incognito window) without logging in, let's call it Tab A
  2. Open another tab in the same window and load your timeline without any "lang" query parameter (it should show your timeline in English), let's call it Tab B
  3. Switch to Tab A and change the value of the "lang" parameter to one of the 47 supported language codes and load the page to update the "lang" cookie (which will be reflected in all the tabs of the same window)
  4. From a different browser (that does not share cookies with the above tabs) or device login to your Twitter account (if not logged in already) and retweet something
  5. Switch to Tab B and wait for a notification to appear suggesting one new tweet in the language selected in the Tab A (it may take a little over 30 seconds)
  6. If you want to add more languages then click on the notification bar (which will insert the new tweet in the current language) and repeat from step 3 otherwise continue
  7. To see the global trends block of Tab B in a different language perform step 3 with the desired language code, switch back to Tab B, and wait until it changes (it may take a little over 5 minutes)

Figure 6: Mixed language illustration on Twitter's live website. It contains phrases from English, Arabic, Hindi, Spanish, Chinese, and Urdu, but could have all 47 supported languages.


Conclusions


With the above experiment on the live site I am confident about my assessment that a cookie violation could be one reason why a composite memento would be defaced. How common this issue is in Twitter's mementos and on other sites is still an open question. While I do not know a silver-bullet solution to this issue yet, I think it can potentially be mitigated to some extent for the future mementos by explicitly reducing the cookie expiration duration in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances. Investigating options about filtering responses by matching cookies needs a more rigorous research.

--
Sawood Alam


2019-04-01: Creating a data set for 116th Congress Twitter handles

$
0
0
Senators from Alabama in the 115th Congress

Any researcher conducting research on Twitter and the US Congress might think, "how hard could it be in creating a data set of Twitter handles for the members of Congress?". At any given time, we know the number of members in the US Congress and we also know the current members of Congress. At this point, creating a data set of Twitter handles for the members of Congress might seem like an easy task, but it turns out it is a lot more challenging than expected. We present the challenges involved in creating a data set of Twitter handles for the members of 116th US Congress and provide a data set of Twitter handles for 116th US Congress

Brief about the US Congress


The US Congress is a bicameral legislature comprising of the Senate and the House of Representatives. The Congress consists of:

  • 100 senators, two from each of the fifty states.
  • 435 representatives, seats are distributed by population across the fifty states.
  • 6 non-voting members from the District of Columbia and US territories which include American Samoa, Guam, Northern Mariana Islands, Puerto Rico, and US Virgin Islands.
Every US Congress is consecutively numbered and has a term of two years. The current US Congress is the 116th Congress which began on 2019-01-03 and will end on 2021-01-03.       

Previous Work on Congressional Twitter


Since the inception of social media, Congress members have aggressively used it as a medium of communication with the rest of the world. Previous researchers have completed their US Congress Twitter handles data set by both using other lists and manually adding to them. 

Jennifer Golbeck et al. in their papers "Twitter Use by the US Congress" (2010) and "Congressional twitter use revisited on the platform's 10-year anniversary" (2018) used the Tweet Congress to build their data set of Twitter handles for the members of Congress. An important highlight from their 2018 paper is that every member of Congress has a Twitter account. Libby Hemphill in "What's congress doing on twitter?" talks about the manual creation of 380 Twitter handles for US Congress which were used for collecting tweets in the winter of 2012. Theresa Loraine Cardenas in "The Tweet Delete of Congress: Congress and Deleted Posts on Twitter" (2013) used Politwoops to create the list of Twitter handles for members of Congress. Jihui Lee et al. in their paper "Detecting Changes in Congressional Twitter Networks over Time" used the community maintained GitHub repository from @unitedstates to collect Twitter data for 369 representatives of the 435 from the 114th US Congress. Libby Hemphill and Matthew A. Shapiro in their paper "Appealing to the Base or to the MoveableMiddle? Incumbents’ Partisan MessagingBefore the 2016 U.S. Congressional Elections" (2018) also used the community maintained GitHub repository from @unitedstates
Screenshot from Tweet Congress

Twitter Handles of the 116th Congress 


January 3, 2019 marked the beginning of 116th United States Congress with 99 freshman members to the Congress. It has already been two months since the new Congress has been sworn in. Now, let us review Tweet Congress and GitHub repository @unitedstates to check how up-to-date these sources are with the Twitter handles for the current members of Congress. We also review the CSPAN Twitter list for the members of Congress in our analysis.

Tweet Congress 

Tweet Congress is an initiative from the Sunlight Foundation with help from Twitter to create a transparent environment which allows easy conversation between lawmakers and voters in real time. It was launched in 2011. It lists all the members of Congress and their contact information. The service also provides visualizations and analytics for Congressional accounts.     

@unitedstates (GitHub Repository)

It is a community maintained GitHub repository which has list of members of the United States Congress from 1789 to present, congressional committees from 1973 to present, committee memberships for current, and information about all the presidents and vice-presidents of the United States. The data is available in YAML, JSON, and CSV format. 

CSPAN (Twitter List)

CSPAN maintains Twitter lists for the 116th US Representatives and US Senators. The Representatives list has 482 Twitter accounts while the Senators list has 114 Twitter accounts. 

Combining Lists  


We used the Wikipedia page on the 116th Congress as our gold-standard data for the current members of Congress. The data from Wikipedia was collected on 2019-03-01. Correspondingly, the data from CSPAN, @unitedstates (GitHub Repository), and Tweet Congress was also collected on 2019-03-01. We then manually compiled a CSV file with the members of Congress and the presence of their Twitter handles in all the different sources. The reason for manual compilation of the list was largely due to discrepancy in the names of the members of Congress from different sources under consideration.
  • Some of the members of Congress use diacritic characters. For example, Wikipedia and Tweet Congress have the name of a representative from New York as Nydia_Velázquez, while  Twitter and @unitedstates repository has her name as Nydia Velazquez
Screenshot from Wikipedia showing Nydia Velazquez, representative from New York using diacritic characters

Screenshot from Twitter for Rep. Nydia Velazquez from New York without diacritic characters
Screenshot from Wikipedia for Rep. Mark Green from Tennessee with his middle name


Screenshot from Twitter for Rep. Mark Green from Tennessee without his middle name
Screenshot from Tweet Congress for Rep. Mark Green from Tennessee without his middle name
Screenshot from Wikipedia for Rep. Chuck Fleischmann from Tennessee using his nick name
Screenshot from Twitter for Rep. Chuck Fleischmann from Tennessee using his nick name
Screenshot from Tweet Congress for Rep. Chuck Fleischmann from Tennessee using his given name

What did we learn from our analysis?


As of 2019-03-01, the US Congress had 538 members of 541 with three vacant representative positions. The three vacant positions include the third and ninth Congressional Districts of North Carolina and the twelfth Congressional District of Pennsylvania. Of the 538 members of Congress, 537 have Twitter accounts while the non-voting member from Guam, Michael San Nicolas, has no Twitter account.


NamePositionJoined CongressCSPAN@unitedstatesTweetCongressRemark
Collin PetersonRep.1991-01-03FFF@collinpeterson
Greg GianforteRep.2017-06-21FFF@GregForMontana
Gregorio SablanDel.2019-01-03FTT
Rick ScottSen.2019-01-08T!TF
Tim KaineSen.2013-01-03T!TF
James ComerRep.2016-11-08T!TF
Justin AmashRep.2011-01-03T!TF
Lucy ClayRep.2001-01-03T!TF
Bill CassidyRep.2015-01-03T!TT
Members of the 116th Congress whose Twitter handles are missing from either one or all of the sources. T represents both name and Twitter handle present, !T represents name present but Twitter handle missing, and F represents both the name and Twitter handle missing.
  • CSPAN has Twitter handles for 534 members of Congress out of the 537 members of Congress with two representatives and a non-voting member missing from its list. The absentees from the list are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), and Delegate Gregorio Sablan (@Kilili_Sablan).
  • The GitHub repository, @unitedstates has Twitter handles for 529 members of Congress out of the 537 members of Congress with five representatives and three senators missing from its data set. The absentees from the repository are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), Sen. Rick Scott (@SenRickScott), Sen. Tim Kaine (@timkaine), Rep. James Comer (@KYComer), Rep. Justin Amash (@justinamash), Rep. Lucy Clay (@LucyClayMO1), and Sen. Bill Cassidy (@SenBillCassidy).
  • Tweet Congress has Twitter handles for 530 members of Congress out of the 537 members of Congress with five representatives and two senators missing.  The absentees are Rep. Collin Peterson (@collinpeterson), Rep. Greg Gianforte (@GregForMontana), Sen. Rick Scott (@SenRickScott), Sen. Tim Kaine (@timkaine), Rep. James Comer (@KYComer), Rep. Justin Amash (@justinamash), and Rep. Lucy Clay (@LucyClayMO1).
The combined list of Twitter handles for the members of Congress from all the sources has two representatives missing, namely Collin Peterson who is a representative from Minnesota since 1991-01-03 and Greg Gianforte who is a representative from Montana since 2017-06-21. The combined list from all the sources also has six members of Congress who have different Twitter handles from different sources.


NamePositionJoined CongressCSPAN@unitedstates + TweetCongress
Chris MurphySen.2013-01-03@ChrisMurphyCT@senmurphyoffice
Marco RubioSen.2011-01-03@marcorubio@SenRubioPress
James InhofeSen.1994-11-16@JimInhofe@InhofePress
Julia BrownleyRep.2013-01-03@RepBrownley@JuliaBrownley26
Seth MoultonRep.2015-01-03@Sethmoulton@teammoulton
Earl BlumenauerRep.1996-05-21@repblumenauer@BlumenauerMedia
Members of the 116th Congress who have different Twitter handles in different sources

Possible reasons for disagreement in creating a Members of Congress Twitter handles data set


Scenarios involved in creating Twitter handles for members of Congress when done over a period of time

One Seat - One Member - One Twitter Handle: When creating our data set of Twitter handles for members of Congress over a period of time, the perfect situation is where we have one seat in the Congress which is held by one member for the entire congress tenure who holds one Twitter account. For example, Amy Klobuchar, senator from Minnesota has only one Twitter account @amyklobuchar.

Google search screenshot for Sen. Amy Klobuchar's Twitter account
Twitter screenshot for Sen. Amy Klobuchar's Twitter account

One Seat - One Member - No Twitter Handle: When creating our data set of Twitter handles for members of Congress over a period of time, we have one seat in Congress which is held by one member for the entire congress tenure and does not have a Twitter account. For example, Michael San Nicolas, delegate from Guam has no Twitter account.

Screenshot from Congressman Michael San Nicolas page showing a Twitter link for HouseDems Twitter account while the rest of the social media icons are linked to his personal accounts

One Seat - One Member - Multiple Twitter Handles: When creating our data set of Twitter handles for members of Congress over a period of time, we have one seat in Congress which is held by one member for the entire congress tenure who has more than one Twitter account. A member of Congress can have multiple Twitter accounts. Based on the purpose of the Twitter accounts they can be classified as Personal, Official, and Campaign accounts.

  • Personal Account: A Twitter account used by the members of Congress to tweet their personal thoughts can be referred to as a personal account. A majority of these accounts might have a creation date prior to when they were elected to the Congress. For example, Marco Rubio, a Senator from Florida created his Twitter account @marcorubio in August, 2008 while he was sworn in to Congress on 2011-01-03.
Screenshot for the Personal Twitter account of Sen. Marco Rubio from Florida. The account was created in August, 2008 while he was elected to Congress on 2011-01-03 
  • Official Account: A Twitter account used by the member of Congress or their staff to tweet out all the official information for general public related to the member of Congress' activity is referred to as an official account. A majority of these accounts creation dates will be close to the date on which the member of Congress got elected. For example, Marco Rubio, a Senator from Florida has a Twitter account @senrubiopress which has a creation date of December, 2010, while he was sworn in to Congress on 2011-01-03. 
Screenshot for the Official Twitter account of Sen. Marco Rubio from Florida. The account was created in December, 2010 while he was elected to Congress on 2011-01-03.
  • Campaign Accounts: A Twitter account used by a member of Congress for campaigning their elections is referred to as a campaign account. For example, Rep. Greg Gianforte from Montana has a Twitter account @gregformontana which contains tweets related to his campaigns for re-election can be referred to as a campaign account.
Twitter Screenshot for the Campaign account of Rep. Greg Gianforte from Montana which contains tweets related to his re-election campaigns.
Twitter Screenshot for the Personal account of Rep. Greg Gianforte from Montana which has personal tweets from him. 

One Seat - Multiple Members - Multiple Twitter Handles: When creating our data set of Twitter handles for members of Congress over a period of time, we can have a seat in Congress which is held by different members during the tenure of Congress at different points in time who have different Twitter accounts. An example from the 115th Congress is the Alabama Senator situation between January 2017 and July 2018. On February 9, 2017, Jeff Sessions resigns as senator and was succeeded by Alabama Governor's appointee Luther Strange. After the special election on January 3, 2018, Luther Strange leaves the office to make way for Doug Jones as the Senator of Alabama. Now,  who do we include as the Senator from Alabama for the 115th Congress? Even though we might decide to include all of them based on the date they join or leave their offices but, when this analysis is done for a year who will provide us all the historical information for the current Congress in session. As of now, all the sources we analyzed try to provide with the most recent information rather than historical information about the current Congress and its members over the entire tenure. 
  
Alabama Senate seat situation between January 2017 and July 2018. It highlights the issue in context of Social Feed Manager's 115th Congress tweet dataset.  
One of the other issues worth mentioning is when members of Congress change their Twitter handle. An example for this scenario is when Rep. Alexandria Ocasio-Cortez from New York tweeted on 2018-12-28 about changing her Twitter handle from @ocasio2018 to @aoc. In the case of popular Twitter accounts for members of Congress, it is easy to discover their change of handles but for a member of Congress who is not popular on Twitter, they might go unnoticed for quite some time.

Screenshot of memento for @Ocasio2018
Screenshot of memento which shows the announcement for change of Twitter handle from @Ocasio2018 to @aoc 
Screenshot of @aoc

Twitter data set for the 116th Congress Handle

  • We have created a data set for the 16th Congress Twitter handles which resolves the issues of CSPAN, Tweet Congress, and @unitedstates (GitHub repository). 
  • We have Twitter handles for all the current 537 members of Congress who are on Twitter, except for one delegate from Guam who does not have a Twitter account. 
  • Unlike other sources, our data set does not  include any member of Congress who are not a part of the 116th Congress.
  • In case of conflicts of Twitter handles for members of Congress from different sources under investigation, we chose accounts which were personally managed by the member of Congress (Personal Twitter Account) over accounts which were managed by their teams or used for campaign purposes (Official or Campaign Accounts). The reason for choosing personal accounts over official or campaign accounts is because some of the members of Congress explicitly mention in Twitter biography of their personal accounts that all the tweets are their own which is not reflected in their official or campaign account's Twitter biography. 
Twitter Screenshot of the Personal account for Rep. Seth Moulton where he states that all the tweets are his own in his Twitter bio.

NamePositionWSDL Data setCSPAN@unitedstates + TweetCongress
Chris MurphySen.@ChrisMurphyCT@ChrisMurphyCT@senmurphyoffice
Marco RubioSen.@marcorubio@marcorubio@SenRubioPress
James InhofeSen.@JimInhofe@JimInhofe@InhofePress
Julia BrownleyRep.@RepBrownley@RepBrownley@JuliaBrownley26
Seth MoultonRep.@Sethmoulton@Sethmoulton@teammoulton
Earl BlumenauerRep.@repblumenauer@repblumenauer@BlumenauerMedia
Members of the 116th Congress who have different Twitter handles in different sources. The WSDL data set has personal Twitter handles over official Twitter handles

Conclusion


Of all the three sources Tweet Congress, @unitedstates (GitHub Repository) and CSPAN, none of them have a full coverage of all the Twitter handles for the members of the 116th Congress. There is one member of Congress who does not have a Twitter account and additionally there are two members of Congress who do not have their Twitter handles present in any of the sources. There is no source which provides the historical information about the members of Congress over the entire tenure of the Congress, as all the sources focus on the recency rather than holding information about the entire tenure of Congress. It turns out creating a data set of Twitter handles for members of Congress seems an easy task on first glance, but it is a lot more difficult owing to multiple reasons for disagreements when the study is to be done for over a period of time. We share a data set for the 116th Congress Twitter handles by combining all the lists.

https://github.com/oduwsdl/US-Congress

----
Mohammed Nauman Siddique
@m_nsiddique

2019-04-17: Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives

$
0
0

On March 11, 2019 in the NBA, the Utah Jazz hosted their Northwest Division rivals, the Oklahoma City Thunder.  During the game, a Utah fan (Shane Keisel) and a Oklahoma City player (Russell Westbrook) engaged in a verbal exchange, with the player stating the fan was directing racist comments to him and the fan admitting to heckling but denying that his comments were racist.  The event was well documented (see, for example, this Bleacher Report article), and the following day the fan received a lifetime ban from all events at the Vivint Smart Home Arena and the player received a $25k fine from the NBA.

Disclaimer: I have no knowledge of what the fan said during the game, nor do I have an opinion regarding the appropriateness of the respective penalties.  My interest is that after the game, the fan gave at least one interview with a TV station reporter in which he exposed his identity.  That set off a rapidly evolving series of events with both real and fake Twitter accounts, which we unravel with the aid of multiple web archives.  The initial analysis was performed by Justin Whitlock as a project in my CS 895 "Web Archiving Forensics" class; prior to Justin proposing it as a project topic, my only knowledge of this event was via the Daily Show.


First, let's establish a timeline of events.  The timeline is made a little bit complicated because of although the game was played in the Mountain time zone, most media reports are relative to Eastern time, and the web crawlers report their time in UTC (or GMT).  Furthermore, daylight savings time began on Sunday, March 10, and the game was played on Monday, March 11.  This means there is a four hour differential between UTC and EDT, and a six hour differential between UTC and MDT.  Although most events occur after daylight savings, some events will occur before (where there would be a five hour differential between UTC and EST). 
  • 2019-03-12T01:00:00Z -- the game is scheduled to begin at March 11, 9pm EDT (March 12, 1am UTC).  An NBA game will typically last 2--2.5 hours, and at least one tweet shows Westbrook talking to someone in the bleachers midway through second quarter (there may be other videos in circulation as well).
  • 2019-03-12T03:58:00Z -- based on the empty seats and the timestamp on the tweet (11:58pm EDT), the post-game interview with a KSL reporter embedded above reveals the fan's name and face.  The uncommon surname of "Keisel" combined with a closeup of his face enables people to find quickly find his Twitter account: "@skeisel391". 
  • 2019-03-12T04:57:34Z -- Within an hour of the KSL interview being posted, Keisel's Twitter account is "protected". This means we can see his banner and avatar photos and his account metadata, but not his tweets.
  • 2019-03-12T12:23:42Z -- Less than 9 hours after the KSL interview, his Twitter account is "deleted". No information is available from his account at this time.
  • 2019-03-12T15:29:47Z -- Although his Twitter account is deleted, the first page (i.e., first 20 tweets) is still in Google's cache and someone has pushed Google's cached version of the page into a web archive.  The banner of the web archive (archive.is) obscures the banner inserted by Google's cache, but a search of the source code of http://archive.is/K6gP4 reveals: 
    "It is a snapshot of the page as it appeared on Mar 6, 2019 11:29:08 GMT."
In other words, an archived version of Google's cached page reveals Keisel's tweets (the most recent 20 tweets anyway) from nearly a week before (i.e., 2019-03-06T11:29:08Z) the game on March 11, 2019.

Although Keisel quickly protected and then ultimately deleted his account, until it was deleted his photos and account metadata were available and allowed a number of fake accounts to proliferate.  The most successful fake is "@skeiseI391", which is also now deleted but stayed online until at least 2019-03-17T04:18:48Z.  "@skeiseI391" replaces the lowercase L ("l") with an uppercase I ("I").  Depending on the font of your browser, the two characters can be all but indistinguishable (here they are side-by-side: lI).  I'm not sure who created this account, but we discovered it in this tweet, where the user provides not only screen shots but also a video of scrolling and clicking through the @skeiseI391 account before it was deleted.






The video has significant engagement: originally posted at 2019-03-12T10:55:00Z, it now has greater than 1k RTs, 3k likes, and 381k views.  There are many other accounts circulating these screen shots: some of which are provably true, some of which are provably false, and some of which cannot be verified using public web archives.  The screen shots have had an impact in the news as well, showing up in among others: The Root, News One, and BET.   BET even quoted a provably fake tweet in the headline of their article:

This article's headline references a fake tweet.
The Internet Archive has mementos (archived web pages) for both the fake @skeiseI391 and the real @skeisel391 accounts, but the Twitter account metadata (e.g., when the account was created, how many followers, how many tweets) for the fake acount are in Chinese and in Kannada for real account.  This is admittedly confusing, but is a result of how the Internet Archive's crawler and Twitter's cookies interact; see our research group's posts from 2018-03 and 2019-03 on these topics for further information.  Fortunately, archive.is does not have the same problems with cookies, so we use their mementos for the following screen shots (two from the real account at archive.is and one from the fake account at archive.is).

real account, 2019-03-06T11:29:08Z (Google cache)
real account, 2019-03-12T04:57:34Z
From the account metadata, we can see this was not an especially active account: established in October 2011, it has 202 total tweets, 832 likes, following 51 accounts, and from March 6 to March 12, it went from 41 to 53 followers.  The geographic location is set to "Utah, USA", and the bio has no linked URL and has three flag emojis.

fake account; note the difference in the account metadata
The fake account has notably different metadata: the bio has only two flag emojis, plus a link to "h.cm", a page for a parked domain that appears to have never had actual content (the Internet Archive has mementos back to 2012). Furthermore, this account is far more active with 7k tweets, 23k likes, 1500 followers and following 1300 accounts, all since being created in August 2018.

Twitter allows users to change their username (or "handle") without losing followers, old tweets, etc.  Since the handle is reflected in the URL and web archives only index by URL, we cannot know what the original handle of the fake @skeiseI391 account, but at some point after the game the owner changed from the original handle to "skeiseI391".  Since the account is no longer live, we cannot use the Twitter API to extract more information about the account (e.g., followers and following, tweets prior to the game), but given the link to a parked/spam web page and the high level of engagement in  a short amount of time, this was likely a burner bot account designed amplify legitimate accounts (cf. "The Follower Factory"), and then was adapted for this purpose.

We can pinpoint when the fake @skeiseI391 account was changed.  By examining the HTML source from the IA mementos of the fake and real accounts, we can determine the URLs of the profile images:

Real: https://pbs.twimg.com/profile_images/872289541541044225/X6vI_-xq_400x400.jpg

Fake: https://pbs.twimg.com/profile_images/1105325330347249665/YHcWGvYD_400x400.jpg

Both images are 404 now, but they are archived at those URLs in the Internet Archive:

Archived real image, uploaded 2017-06-07T03:08:07Z
Archived fake image, uploaded 2019-03-12T04:29:09Z
Also note that the tool used to download the real image and then upload as the fake image maintained the circular profile pic instead of the original square.

For those familiar with curl, I include just a portion of the command line interface that shows the original "Last-Modified" HTTP response header from twitter.com.  It is those dates that record when the image changed at Twitter; these are separate from the dates from when the image was archived at the Internet Archive.  The relevant response headers are shown below:

Real image:
$ curl -I http://web.archive.org/web/20190312045057/https://pbs.twimg.com/profile_images/872289541541044225/X6vI_-xq_400x400.jpg
HTTP/1.1 200 OK
Server: nginx/1.15.8
Date: Wed, 17 Apr 2019 15:12:02 GMT
Content-Type: image/jpeg
...

X-Archive-Orig-last-modified: Wed, 07 Jun 2017 03:08:07 GMT
...

Memento-Datetime: Tue, 12 Mar 2019 04:50:57 GMT
...


Fake image:
$  curl -I http://web.archive.org/web/20190312061306/https://pbs.twimg.com/profile_images/1105325330347249665/YHcWGvYD_400x400.jpg
HTTP/1.1 200 OK
Server: nginx/1.15.8
Date: Wed, 17 Apr 2019 15:13:21 GMT
Content-Type: image/jpeg
...

X-Archive-Orig-last-modified: Tue, 12 Mar 2019 04:29:09 GMT
...

Memento-Datetime: Tue, 12 Mar 2019 06:13:06 GMT
...


The "Memento-Datetime" response header is when the Internet Archived crawled/created the memento (real = 2019-03-12T04:50:57Z; fake = 2019-03-12T06:13:06Z), and the "X-Archive-Orig-last-modified" response header is the Internet Archive echoing the "Last-Modified" response header it received from twitter.com at crawl time.  From this we can establish that the image was uploaded to the fake account at 2019-03-12T04:29:09Z, not quite 30 minutes before we can establish that the real account was set to "protected" (2019-03-12T04:57:34Z). 

We've presented a preponderance of evidence of that the account the account @skeiseI391 is fake and that fake account is responsible for the "come at me _____ boy" tweet referenced in multiple news outlets.  But what about some of the other screen shots referenced in social media and the news?  Are they real?  Are they photoshopped?  Are they from other, yet-to-be-uncovered fake accounts?

First, any tweet that is a reply to another tweet will be difficult to verify with web archives unless we know the direct URL of the original tweet or the reply itself (e.g., twitter.com/[handle]/status/[many numbers]).  Unfortunately, the deep links for individual tweets are rarely crawled and archived for less popular accounts.  While the top level page will be crawled and the most recent 20 tweets included, one has to be logged in to Twitter to see the tweets included in the "Tweets & replies" tab, and public web archives are not logged in when they crawl so those contents are typically not available.  As such, it is hard to establish via web archives if the screen shot of the reply below is real of fake.  The original thread is still on the live web, but of the 45 replies, two of them are marked "This Tweet is unavailable".  One of those could be a reply from the real @skeisel391, but we don't have enough information to definitively rule if that is true.  The particular tweet shown below ("#poorloser") is of issue because even though it was from nearly a year ago, it would contradict the "we were having fun" attitude from the KSL interview.  Other screen shots that appear as replies will be similarly difficult to uncover using web archives.

This could be a real reply, but with web archives it is difficult to establish provenance of reply tweets.
The tweet below is more difficult to establish, since it does not appear to be a reply and the datetime that it was posted (2018-10-06T16:11:00Z) falls with the date range of the memento of the page in the Google cache, which has tweets from 2019-02-27 to 2018-10-06.  The use of "#MAGA" is inline with what we know Keisel has tweeted (at least 7 of the 20 tweets are clearly conservative / right-wing).  At first glance it appears that memento covers tweets all the way back to 2018-10-04, since a retweet with that timestamp appears as the 20th and final tweet on the page, and thus a tweet from 2018-10-06 should appear before the one with a timestamp of 2018-10-04.  But retweeting a page does not reset the timestamp; for example if I tweeted something yesterday and you retweet it today, your retweet will show my timestamp of yesterday.  So although the last timestamp shown on the page is 2018-10-04, the 19th tweet on the page is from Keisel and shows a timestamp of 2018-10-06.  So it's possible that the retweet occurred on 2018-10-06 and the tweet below just missed being included in the 20 most recent tweets (i.e., the 21st most recent tweet).  The screen shot shows a time of "11:11am", and in the HTML source of Google's cached page, for the 19th tweet it has:

title="8:11 AM - 6 Oct 2018"

Which would suggest that the screen shot happened after the 19th tweet, but without time zone information we can't reliably sequence the tweets.  Depending on the GeoIP of Google's crawler, Twitter would set the "8:11 AM" value relative to that timezone.  It's tempting to think it's in California and thus PST, but we can't be certain.  Regardless, there's no way to know the default time zone of the presumed client in the screen shot.

We cannot definitely establish the provenance of this tweet.
Bing's cache also has a copy of Keisel's page, and it covers a period of 2018-09-14 to 2018-03-27.  Unfortunately, that leaves a coverage gap from 2018-10-06 to 2018-09-14, inclusive, and if the "#MAGA" tweet is real it could fall between the coverage provided by Google's cache and Bing's cache.

This leaves three scenarios to account for the above "#MAGA" tweet and why we don't have a cached copy of it:
  1. Keisel deleted this tweet on or before March 6, 2019 in anticipation of the game on March 11, 2019.  While not impossible, it does not seem probable because it would require someone taking a screen shot of the tweet prior to the KSL interview.  Since the real @skeisel391 account was not popular (~200 tweets, < 50 followers), this seems like an unlikely scenario.
  2. Someone photoshopped or otherwise created a fake tweet.  Given the existence of the fake @skeiseI391 account (and other fake accounts), this cannot be ruled out.  If it is a fake, it does not appear to have the same origin as the fake @skeiseI391 account.  
  3. The screen shot is legitimate and we are simply unlucky that the tweet in question fell in the coverage gap between the Google cache and the Bing cache, just missing appearing on the page in Google's cache.
I should note that in the process of extending Justin's analysis we came across this thread from sports journalist @JonMHamm, where he uncovered the fake @ account and also looked at the page in Google's cache, although he was unaware that the earliest date it establishes is 2018-10-06 and not 2018-10-04.  He also vouches for a contact that claims to have seen the "#MAGA" tweet while it was still live, but that's not something I can independently verify.




In summary, of the three primary tweets offered as evidence, we can reach the following conclusions:
  1. "come at me _____ boy" -- this tweet is definitively fake.
  2. "#poorloser" -- this tweet is a reply, and in general reply tweets will not appear in public web archives, so web archives cannot help us evaluate this tweet.
  3. "#MAGA" -- this tweet is either faked, or it falls in the gap between what appears in the Google cache and what appears in the Bing cache; using web archives we cannot definitively determine explanation is more likely.
We welcome any feedback, additional cache sources, deep links to individual tweets, evidence that these tweets were ever embedded in HTML pages, or any additional forensic evidence.  I thank Justin Whitlock for the initial analysis, but I take responsibility for any errors (including the persistent fear of incorrectly computing time zone offsets).

Finally, in the future please don't just take a screen shot, push it to multiple web archives

--Michael




Note: There are other fake Twitter accounts, for example: @skeisell391 (two lowercase L's),  @skeisel_ (trailing underscore), but they are not well-executed and I have omitted them from the discussion above.  

2019-05-03: Selected Conferences and Orders in WS, DL, IR, DS, NLP, AI

$
0
0
The time when research works should be done is usually less predictable than homework. You may submit a paper next year, but you cannot submit your homework the next year. Even if there is a target deadline, the results may not be delivered on time. Even if the results are ready, the papers may not be in good shape, especially for papers written by students. Even if papers are submitted, they can be rejected. Therefore, it is usually useful to decide where to submit the work next. 

I used to struggle to find the next deadline for my work, so I compiled this timeline, sorted by months. The deadlines are not intended to be accurate because they change every year. They can also be extended. The deadlines may vary depending on the submission type: full paper, short paper, poster, etc.  The focus is on the approximate chronological order in which the deadlines happen. One can always visit the conferences' website for the exact deadline. It is also not intended to be exhaustive as it focuses on popular conferences. I also do not want the list to be too crowded but it can be updated by adding new conferences.

The list below is made for people in the Web Science Digital Libraries Group (WS-DL) at ODU, but it can be generalized to researchers working in Web Science, Digital Libraries, Information Retrieval, Data Science, Natural Language Processing, and Artificial Intelligence to better plan where research works can be disseminated. 

----------------------------------------------------------------------------------------------

January
  • JCDL (full/short/poster) (January 25, 2019)
  • SIGIR (full/short) (January 28, 2019)
February
  • ICDAR (full) (February 15, 2019)
  • IJCAI (full) (February 15, 2019)
  • KDD (full/short) (Feb 3, 2019)
  • ACM Web Science (full/short/poster) (Feb 18, 2019)
March
  • ACL (full/short) (March 4, 2019)
April
  • TPDL (full/short/poster) (April 15, 2019) 
May
  • IRI (full) (May 2, 2019)
  • EMNLP (full/short) (May 21, 2019)
  • CIKM (full/short) (May 22, 2019) 
June
  • ICDM (full) (June 5, 2019)
  • K-CAP (full/short) (June 22, 2019)
July

August
  • WSDM (full) (August 15, 2018)
  • IEEE Big Data (full) (August 19, 2019), poster due later
September
  • AAAI, IAAI (full) (September 5, 2018)
October
  • ECIR (full/short) (October 1, 2019)
  • SDM (full) (October 12, 2018)
November
  • WWW (full/short) (November 5, 2018)
December
  • NAACL (full/short) (December 10, 2018) 

Jian Wu 

2019-05-06: Twitter broke my scrapers

$
0
0
Fig. 1: The old tweet DIV showing four (data-tweet-id, data-conversation-id, data-screen-name, and tweet-text) attributes with meaningful names. These attributes are absent in the new tweet DIV (Fig. 2).
On April 23, 2019, my Twitter desktop layout changed. I initially thought a glitch caused me to see  the mobile layout on my desktop instead of the standard desktop layout, but I soon learned this was no accident. I was part of a subset of Twitter users who did not have the option to opt-in to try the new layout. 
New desktop look 
While others might have focused on the cosmetic or functional changes, my immediate concern was to understand the extent of the structural changes to the Twitter DOM. So I immediately opened my Google Chrome Developer Tools to inspect the Twitter DOM, and I was displeased to learn that the changes to the layout seeped beyond the cosmetic new looks of the icons into the DOM. This meant that I would have to rewrite all my research applications built to scrape data from the old Twitter layout.
Old Twitter desktop look
At the moment, I am unsure if it would be possible to extract all the data previously accessible from the old layout. It is important to note that scraping goes against Twitter's Terms of Service's and Twitter offers an API that fulfills some requests invalidating the need for scraping. However, the Twitter API is limited in search, but most importantly, the API does not offer a method for extracting all tweets from a conversation. Extracting tweets from a conversation is a task fundamental to my PhD research, so I scrape Twitter privately for research. In this blogpost, I will use the tweet below to highlight some of the major changes to the Twitter DOM, specifically the tweet DIV by comparing the old and new layouts. 
Fig. 2: In the new tweet DIV, semantic items (e.g, the four semantic items in Fig. 1) are absent or obscured.
Old Tweet DIV vs New Tweet DIV
The most consequential (to me) structural difference between the old and new tweet DIVs is that the old tweet DIV includes many attributes with meaningful names while the new tweet DIV does not. In fact, in the old tweet layout, the fundamental unit, the tweet, was explicitly labeled a "tweet" by a DIV with classname="tweet," unlike the new layout. Let us consider the difference between the old and new tweet DIVs from the perspective of the four important attributes marked in Fig. 1:
  1. data-tweet-id: In the old layout, data-tweet-id (contains the tweet ID - unique string that uniquely identifies a tweet) was explicitly marked. In the new layout, the data-tweet-id attribute is absent.
  2. data-conversation-id: This attribute, absent in the new layout, and present in the old layout is responsible for chaining tweets, and thus required for identifying tweets in a reply or conversation thread. A tweet that is a reply includes the Tweet ID of its parent tweet as a value of the data-conversation-id attribute.
  3. data-screen-name: The data-screen-name attribute labels the Twitter handle of the tweet author. This attribute is marked explicitly in the old tweet DIV, but not in the new tweet DIV.
  4. tweet-text: Within the old tweet DIV, the DIV with class name, "tweet-text," marks the text of the tweet, but in the new tweet DIV, there is no such semantic label for the tweet-text.
The new Twitter layout is still under-development, so it comes as no surprise that I discovered a glitch. I noticed that reloading my timeline caused Twitter to load and subsequently quickly remove sponsored tweets from my timeline. This happens too fast to capture with a screenshot, so I recorded my screen to capture the glitch (Fig. 3).
Fig. 3: New Twitter layout glitch showing the loading and subsequent removal of sponsored tweets
It is not clear if the structural changes to the Twitter DOM is a merely coincidental with the rollout of the new layout or if the removal of semantic attributes is part of an intentional effort to discourage scraping. Whatever the actual reason, the consequence is obvious - scraping Twitter has just gotten harder.

-- Alexander C. Nwala (@acnwala)

2019-05-14: Back to Pennsylvania - Artificial Intelligence for Data Discovery and Reuse (AIDR 2019)

$
0
0
AIDR 2019
The 2019 Artificial Intelligence for Data Discovery and Reuse conference, supported by the National Science Foundation, was held in Carnegie Mellon University, Pittsburg, PA, between May 13 and May 15, 2019. It is called a conference, but it is more like a workshop. There are only plenary meetings (and a small session of posters) and the presentations are not all about frontiers of research. Many of them are research reviews and the speakers are trying to connect them with "data reuse". The presenters are in various domains, from text mining to computer vision, from medical imaging to self-driving cars, etc. Another difference from regular CS conferences in that the accepted presenter list is made only based on the abstracts they submitted. The full papers are submitted later. 

Because CiteSeerX collects a lot of data from the Web, our group does a lot of work on information extraction, classification, and reuses a lot of data for training AI models, Dr. Lee Giles recommended me to give a presentation. My title is "CiteSeerX: Reuse and Discovery for Scholarly Big Data". In general, the talk was well received. One person asked the question of how we plan to collect annotations from authors and readers by crowdsourcing. My answer was to taking advantage of the CiteSeerX platform, but we need to collect more papers (especially more recent papers) and build better author profiles before sending out the requests. I will compile everything into a 4-page paper. 

In my 1 1/2 days in CMU, I listened to two keynotes. The first was given by Tom Mitchell, one of the pioneers of machine learning and the chair of the machine learning department. His talk was on "Discovery from Brain Image Data". I used to be in a webinar by him on a similar topic. His research was on connecting natural language with brain activities, studying how brains react to stimulations of vocal languages. Here are some takeaways: (1) it takes about 400 ms for the brain to fully take a word such as "coffee"; (2) the reaction happens in different regions in the brain and it is dynamic (changing over time). The data was collected using fMIR for several people and there was quite a bit of work to denoise the fMIR signals to filter out other undergoing activities. 

The second keynote was given by the CEO of a startup company called medidata, Glen de Vries. Glen talked about how medidata improves drug testing confidence by using synthetic data. The presentation was given in a very professional way (like a TED presentation), but Dr. Lee Giles made a comment that he was using a statistical method called "boosting" and Glen agreed. So essentially, they are not doing enterprize AI. 

Another interesting talk was given by Natasha Noy from Google. Her talk was about the recently launched search engine called "Google dataset search". According to Natasha, this idea was proposed in one of her blog post in 2017. The search engine was online in September 2018. Unfortunately, because it was not well advertised, very few people know it. I personally knew it two weeks ago. The search engine uses the crawled data from Google. The backend uses basic methods to identify public tools annotated with the schema in schema.org, which defines a comprehensive list of fields for metadata of semantic entities. I explored this schema in 2016. The schema can be used for CiteSeerX, replacing Dublin core, but it does not cover semantic typed entities and relationships. So currently, it is good for metadata management. The datasets indexed was also limited to certain domains. Another interesting data search engine was called Auctus, which is a dataset search engine tailored for data argumentation. It searches data using data as input. 

Other interesting talks are:



  • Dr. Cornelia Caragea gave two presentations, one on "keyphrase extraction" - she is an expert in this field, and one on "web archiving" - from his collaborations with UNT.  
  • Matias Carrasco Kind, an astronomer, was talking about  Searching for similarities and anomalies in galaxy images
In the conference, I met with Dr. C. Lee Giles, Dr. Cornelia Caragea. All of us were very glad to see each other. We had a very pleasant dinner in a restaurant called "spoon". I had a lunch conversation with Dr. Beth Plale, an NSF program director. She gave me some good suggestions for how to survive as a tenure track professor. I also had brief conversations with Natasha Noy in Google AI and Martin Klein in Los Alamos National Lab. 

Overall, the conference experience was very good and I learned a lot by listening to top speakers from CMU. The registration fee was low and they serve breakfast, lunch, and a banquet (I could not attend). The city of Pittsburg is still cool and windy, but I felt that I am quite used to it because I was living in Pennsylvania for 14 years! The Cathedral of Learning reminds me of good old days when I was visiting my friend Jintao Liu. He used to be a graduate student of UPitt and now a professor at Tsinghua University. 


Jian Wu








2019-05-29: In The Battle of the Surrogates: Social Cards Probably Win

$
0
0
Web archive collections provide meaning by sampling specific resources from the web. We want to summarize these resources by sampling mementos from those collections and visualizing them as a social media story.
On Tuesday, we released our latest pre-print "Social Cards Probably Provide Better Understanding of Web Archive Collections". My work builds on AlNoamany's work of using social media storytelling to provide a visualization that summarizes web archive collections. In previous blog posts I discussed different storytelling services. A key component of their capability to convey understanding is the surrogate, a small visualization of a web page that provides a summary of that page, like the surrogate within the Twitter Tweet example shown below. However, there are many types of surrogates. We want to use a group of surrogates together as a story to provide a summary of a web archive collection. Which type of surrogate works best for helping users understand the underlying collection?

An annotated tweet containing a surrogate referring to one of my prior blog posts.

Dr. Nelson, Dr. Weigle, and I iterated for several months to produce this study. Using Mechanical Turk, we evaluated six different surrogate types and discover that the social card, as produced by our MementoEmbed project, probably provides better understanding of the overall collection than the surrogate currently employed in the Archive-It interface.

How Much Information Do We Get From the Surrogates on the Archive-It Collection Page?


As seen in this screenshot, each Archive-It collection page contains surrogates of its seeds. For most collections, how much information do the surrogates provide the user about the collection? (link to collection in screenshot)

Archive-It allows curators to supply optional metadata on seeds. We analyzed how much information might be available to a user viewing such metadata and found that 54.60% of all Archive-It seeds have no metadata. As shown in the scatter plot below, we discovered that, as the number of seeds in a collection increases, the average number of metadata fields decreases.

As the number of seeds increases, we see a decrease in the mean number of metadata fields per collection.


Without this metadata, an Archive-It surrogate consists of the seed URL, the number of mementos, and the first and last memento-datetimes, as shown below. Is this enough for a user to glean meaning about the underlying documents?

A minimal Archive-It surrogate

We adapted some of Lulwah Alkwai's recent work (link forthcoming) and determine that seed URLs still do contain some information that may lead to understanding. An Euler diagram counting the URLs that contain some of this information is shown below. Thus, seed URLs still may help with collection understanding.
An Euler diagram showing the number of Archive-It seed URLs that contain different categories of information.


In the paper, we also highlight the top 10 metadata fields in use and define the different information classes found in seed URLs.

In a Story, Which Surrogate Best Supports Collection Understanding?


Brief Methodology

The figures below show the different types of surrogates that we displayed to participants. Each story consisted of a set of mementos visualized as surrogates in a given order. We varied the surrogates but did not change the order of the mementos. The mementos for each story had been chosen by human curators from AlNoamany's previous work and are available as a Figshare dataset. In our pre-print, we chose stories from four different collections to display to participants.

Our first surrogate type is the de-facto Archive-It interface that users would encounter when trying to understand a web archive collection. We used our own Archive-It Utilities to gather the metadata from the Archive-It collection in order to generate these surrogates.

A screenshot of part of an example story using surrogates from the Archive-It interface.


Our second is the browser thumbnail, commonly used by web archives. We employed MementoEmbed to generate these thumbnails.

A screenshot of an example story using browser thumbnails.


Next was the social card, as produced by MementoEmbed.

A screenshot of an example story using social cards.


The next three surrogates we displayed to users were combinations of browsers and thumbnails.

A screenshot of an example story using social cards next to browser thumbnails.


A screenshot of an example story using social cards, but with thumbnails instead of striking images

A screenshot of an example story using social cards, but where thumbnails appear when the user hovers over the striking image.


For each participant, we showed them the story using a given surrogate for 30 seconds. We then refreshed the web page and presented them with six surrogates of the same type as the story that they had just viewed. Two surrogates represented pages from the collection, but the other four were drawn from different collections. We asked them to select the two surrogates from the six that they believed belonged to the same collection. We recorded all mouse hovers and clicks over links and images.

Brief Results


Our results show no significant difference in response times at p < 0.05, but they do show a difference in answer accuracy for social cards vs. the Archive-It interface at p = 0.0569 and social cards side-by-side with thumbnails at p = 0.0770. The paper further details these results overall and per collection. Even though our use case is different, our results are similar to those in a 2013 IR study performed by Capra et al.

More users interacted with thumbnails than any other surrogate element. We assume that the user was attempting to zoom in and see the thumbnail better. Also, more users clicked on thumbnails to read the web page behind the surrogate than they did for social cards. In fact, social cards had the least number of participants interacting with them compared to other surrogate types. We assume that this means that most users were satisfied with the information provided by the social card and did feel the need to interact as much.

The Future


In this post, I briefly summarized our recent pre-print "Social Cards Probably Provide Better Understanding of Web Archive Collections." This is not the end, however. We are planning more studies to further examine different types of storytelling with future participants. Our work has implications not only for our own web archive summarization efforts, but for any storytelling tool that employs surrogates.

-- Shawn M. Jones



Thank-you @assertpub for letting us know that this pre-print was the #1 paper on arXiv in the Digital Libaries category for May 29, 2019.

2019-06-03 Metadata on Datasets Saves You Time

$
0
0
When I joined ODU this Spring 2019, I explored datasets in digital libraries with the hope of discovering ways to enable users to discover data, and for data to find its ways to users as my first task. This led to some interesting findings that I will elaborate in this post.

First things first, let's take a look at what tools and platforms are available that attempt to make things easier for users to find and visualize data. A quick Google Search provided a link to this awesome GitHub repository which contains a list of topic-centric public dataset repositories. This collection proved useful to gather the types of dataset descriptions available at present.

The first dataset collection I explored was Kaggle. Here, the most upvoted dataset (as of May 31, 2019) was a CSV file with the topic "Credit Card Fraud Detection". Taking a quick look at the data, the first two columns provides a textual description of the content, but not the rest. Since I'm not the maintainer (hence the term distributed digital collections) of that dataset, I wasn't allowed to contribute to improve the metadata in it.
Figure 1: "Credit Card Fraud Detection" Dataset in Kaggle [Link]

One useful feature that's prominent on Kaggle (and most publicly available datasets) was that they provided textual descriptions of the content, but the semantics of data fields and the links between each file in the dataset were either included in the description or not included at all. Only a handful of datasets actually documented the data fields.
Figure 2: Metadata of "Credit Card Fraud Detection" Dataset in Kaggle [Link]

If you have no expertise on a particular domain but only interested in using publicly available data to prove a hypothesis, encountering datasets with inadequate documentation is inevitable owing to the fact that most publicly available dataset semantics are vague and arcane.

This provided us enough motivation to dig a little deeper to find a way to change this trend in digital libraries. We formulated a metadata schema and envisioned a file system, DFS, which aims to reverse this state of ambiguity and bring more sense to the datasets.

Quoting from our poster publication"DFS: A Dataset File System for Data Discovering Users" on JCDL 2019 [link to paper]:
Many research questions can be answered quickly and efficiently using data already collected for previous research. This practice is called secondary data analysis (SDA), and has gained popularity due to lower costs and improved research efficiency. In this paper we propose DFS, a file system to standardize the metadata representation of datasets, and DDU, a scalable architecture based on DFS for semi-automated metadata generation and data recommendation on the cloud. We discuss how DFS and DDU lays groundwork for automatic dataset aggregation, how it integrates with existing data wrangling and machine learning tools, and explores their implications on datasets stored in digital libraries.
We published an extended version of the paper at ArXiV [Link] that elaborates more on the two components that helps to achieve our goal:
  • DFS - A metadata-based file system to standardize the metadata of datasets
  • DDU - A data recommendation architecture based on DFS to bring data closer to users
DFS isn't the next new thing; rather, it's a solution for not having a systematic way of describing datasets with enough detail to make it sensible for an end user. It provides the means to manage versions of data, and ensures that no important information about the dataset is missed out. Most importantly, it provides a machine-understandable format to define dataset schematics.  The JSON shown below is a description of a dataset in DFS meta format.
Figure 3: Sample Metafile in DFS (Shortened for Brevity)

On the other hand, DDU (or Data Discovering Users) is an architecture that we envisioned to simplify the process of plugging in data to test out hypotheses. Assuming that each dataset has metadata compliant with the proposed DFS schema, the goal was to automate data preprocessing and machine learning, while providing a visualization of the steps taken to reach the final results. So if you are not a domain expert, but still want to test out a hypothesis on that domain, you could easily discover a set of datasets that match your need, plug them into the DDU SaaS, and voila! You just got the results needed to validate your hypothesis, with a visualization of the steps followed to get them.
Figure 4: DDU Architecture

As of now, we are working hard to bring DFS into many datasets as possible. For starters, we aim to automatically generate DFS metadata for EEG and Eye Tracking data acquired in real time. The goal is to intercept live data from Lab Streaming Layer [Link], and generate metadata as the data files are generated.

But the biggest question is, does this theory hold true for all domains of research? We plan to answer this in our future work.

- Yasith Jayawardana

2019-06-05: Wikis Are Archives: Integrating Memento and Mediawiki

$
0
0
Since 2013, I have been a principal contributor to the Memento MediaWiki Extension. We recently released version 2.2.0 to support MediaWiki versions of 1.31.1 and greater. During the extension's development, I have detailed some of its concepts on this blog, I have presented it at WikiConference USA 2014, and I have even helped the W3C adopt it. It became the cornerstone of my Master's Thesis, where I showed how the Memento MediaWiki Extension could help people avoid spoilers on fan wikis. Why do Memento and MediaWiki belong together?

The "dimensions of genericity" table from "Web Architecture: Generic Resources" by Tim Berners-Lee in 1996, annotated to display the RFCs that implemented these dimensions for the Web.
Memento is not limited to web archives. When Tim Berners-Lee was developing the Web, he identified four dimensions of genericity: time, language, content-type, and target medium. HTTP enthusiasts will recognize that three of these became the dimensions of content-negotiation that browsers handle for us invisibly. Time, however, was not handled until Memento. Memento provides a way to request not a specific language or file format, but the version of a page from a specific time.
Two timelines show the relationship between a live web page and a web archive capturing its mementos.
The top is the timeline of a live web page.
The bottom is the timeline of the observations of that page captured as mementos o1, o2, and o3.
We do not know what changes to the page the archive missed.
As shown in the diagram above, web archives capture observations of live web pages and store them as mementos. We have no knowledge of how often the page changed and which changes we missed. Web archives provide the only history of these pages.


Wikis hold every revision of a given page.
Even though Memento was first deployed on web archives, it is applicable to any system that provides versioning. As shown in the screenshot above, wikis store all revisions of a web page as mementos. We know how often a page changed because there is a record. When web archives capture wiki pages, they treat them like live web pages and often miss revisions, as shown in the diagram below. This makes wikis a more complete source of version information than web archives. Wikis are web archives in their own right.

Two timelines show a similar relationship between a wiki page and a web archive capturing its mementos.
The top is the timeline of revisions of that wiki.
The bottom line is a timeline of observations of that page captured as mementos o1, o2, and o3.
The web archive missed wiki revisions r3 and r4.


Why do Memento and MediaWiki belong together? With a browser extension like Memento for Chrome, or Memento for Firefox, users can pick a date in the past and seamlessly browse the Web as if it were that date. If Memento were supported by wikis, Memento would carry a user from a web archive observation to a wiki revision and back out to another site, all the while keeping them near their desired date. Who would need this functionality? In the next sections, I provide two scenarios where this would be useful. I then conclude with a section containing resources with more information about the Memento MediaWiki Extension.

Example Usage Scenarios



The Historian


The unfolding events of the search for suspects from the Boston Marathon Bombing, as told by mementos for a single Bostinno article in Archive-It's 2013 Boston Marathon Bombing collection.

Hanah is researching the Boston Marathon Bombing. The screenshots above show how the mementos for a single news article change as an event unfolds. To provide context to the earliest version of this article, Hannah wants to know what Wikipedia published about the event around 5 PM on April 19, 2013. Without Memento installed on Wikipedia, she would have to tediously scroll through pages of article history as shown in the animation below just to find the correct revision.

Anyone trying to find a specific revision in MediaWiki must scroll through pages of article history.
Using the Memento Extension for Chrome, she can set her browser to the date she needs. If Memento were installed on Wikipedia, she could visit her web archive pages and seamlessly transition to this wiki page at the same date, saving time. Even better, she could continue to browse Wikipedia and the rest of the Web with pages around the same date.

The Fan



I am a fan of many different fictional universes. Usually, because I am a PhD student, I cannot watch my favorite television shows on the night that they air. This does not mean that my fandom waits until I have caught up. The Fandom (formerly Wikia) web site runs MediaWiki. As a Star Trek fan, I watch Star Trek: Discovery. The screenshot below contains a spoiler about one of the characters. The episode "Project Daedalus" aired on March 14, 2019. If had not seen the episode, I would immediately see a spoiler when visiting the current version of the page below.
The current version of this Fandom page contains a spoiler about this Star Trek character.

The version of the page before the episode air date does not contain the spoiler.


If I were using Memento, I could set the date in my browser and follow links to this page, as shown below. However, I still see a spoiler? Why?

Even though I use Memento for Chrome and set my date prior to the date of the episode, I still get the spoiler?


In "Avoiding spoilers: wiki time travel with Sheldon Cooper," Michael Nelson, Herbert Van de Sompel, and I explain in more detail why this happens. Web archives only have access to some observations of a wiki page, and hence the nearest memento to the user's desired datetime is often correct. Because the wiki has access to all revisions, installing the Memento MediaWiki Extension directly on the wiki allows us to see the exact revision present at the desired datetime, thus avoiding this problem.


More Information About The Memento MediaWiki Extension



These are just two scenarios where users could benefit from wikis containing the Memento MediaWiki Extension. Can you think of others? Are there times when you wish you could have browsed the Wikipedia of the past? The following resources provide more information.
Memento provides access to any past resource on the Web. Wikis contain all previous versions of a page and thus are web archives in their own right. Do you know of a wiki that would benefit from Memento?

-- Shawn M. Jones

2019-06-05: Joint Conference on Digital Libraries (JCDL) 2019 Trip Report

$
0
0
Alma Mater, a bronze statue at the University of Illinois by sculptor Lorado Taft. Photo by Illinois Library, used under CC BY 2.0 / Cropped from original
It's June, so this means it's time for the 19th ACM/IEEE Joint Conference on Digital Libraries Libraries (JCDL 2019). This year's JCDL was held at the University of Illinois, in Urbana-Champaign (UIUC) between June 2 - 6. Similar to last year's conference, we (members of WSDL) attended paper sessions, workshops, tutorials, and panels, in which researchers from multiple disciplines presented the findings or progress of their respective research efforts. Unlike previous years, we did not feature any students or faculty in this year's JCDL doctoral consortium. We regret this and hope to resume next year.


Day 1



Following a welcome statement by Dr. Stephen Downie, Professor and Associate Dean for Research at the School of Information Sciences at UIUC, Day 1 began with a keynote from Dr. Patricia Hswe (pronounced "sway"), the program officer for Scholarly Communications at The Andrew W. Mellon Foundation. The title of her keynote was: Innovation is Dead! Long Live Innovation!
Her keynote proposed rethinking the purpose of innovation in the Digital Libraries domain to ensure what is being built is not entirely new. But to ensure innovation includes adaptation, reuse, recovery, etc., instead of rushing to build the next new "Next New Shiny Thing."

Three parallel paper sessions followed the keynote after a break:
  1. Generation and Linking
  2. Analysis and Curation, and 
  3. Search Logs

Generation and Linking Session


Pablo Figueira began this paper session with a full paper presentation titled: Automatic Generation of Initial Reading Lists: Requirements and Solutions. They proposed an automatic method for generating reading lists of scientific articles to help researchers familiarize themselves with existing literature by presenting four existing requirements, and one novel requirement for generating reading lists.
Next, Lucy McKenna, a PhD student at Trinity College Dublin, presented a full paper titled: NAISC: An Authoritative Linked Data Interlinking Approach for the Library Domain. They showed that Information Professionals such as librarians, archivists, and cataloguers have difficulty in creating five star Linked Data. Consequently, they proposed NAISC, an approach for assisting Information Professionals in the Linked Data creation process.
Next, Rohit Sharma presented a short paper titled: BioGen: Automated Biography Generation. They proposed BioGen, a system that automatically creates biographies of people by generating short sets of biographical sentences related to multiple life events. They also showed their system produced biographies similar to those manually generated by Wikipedia.
The Generation and Linking session ended with a short paper presentation by Tinghui Duan, PhD student at the University of Jena, titled: Corpus Assembly as Text Data Integration from Digital Libraries and the Web. Their work proposes a method of building a Digital Humanities corpora by searching and extracting fragments of high-quality digitized versions of artifacts from the Web.

Analysis and Curation Session


Dr. Antoine Doucet, professor of Computer Science at the University of La Rochelle, France, began the first paper session by presenting their full paper: Deep Analysis of OCR Errors for Effective Post‐OCR Processing. They presented the results of a study of five general Optical Character Recognition (OCR) errors: misspellings (real-word and non-word errors), edit operations, length effects, character position errors, and word boundary. Subsequently, they recommended different approaches to design and implement effective OCR post-processing systems.
Next,  Colin Post, a doctoral candidate in the Information and Library Science program at the University of North Carolina, Chapel Hill, presented a full paper (best paper nominee) titled: Digital curation at work: Modeling workflows for digital archival materials. This research provides insight about digital curation in practice by studying and comparing the digital curation workflows of 12 cultural heritage institutions, and focusing on the use of open-source software in their workflows.
Next was a presentation from Julianna Pakstis, Metadata Librarian at the Department of Biomedical and Health Informatics (DBHi) at the Children's Hospital of Philadelphia (CHOP), and Christiana Dobrzynski, Digital Archivist at DBHi. Their short paper presentation was titled: Advancing Reproducibility Through Shared Data: Bridging Archival and Library Practice. This research highlights the work of a team of librarians and archivists at CHOP. This team implemented Arcus, an initiative of the CHOP Research Institute with the purpose of providing the biomedical research data archive and discovery catalog more broadly available within their institution.
The session was concluded with Ana Lucic's short paper presentation titled: Unsupervised Clustering with Smoothing for Detecting Paratext Boundaries in Scanned Documents. This research explores addressing the problem of separating the main text of a work from its surrounding paratext, a task common to the processing of large collections of scanned text in the Digital Humanities domain. The paratext is often required to be removed in order to avoid the distortion of word counts computation, locating of references, etc. They proposed a method for detecting the paratext based on a smoothed unsupervised clustering technique, and showed that their method improved subsequently text processing post removal of the paratext.

Search Logs Session


This session began the first (best paper nominee) of three full papers presentation from Behrooz Mansouri, Computer Science PhD Student at the Rochester Institute of Technology, titled: Toward math-enabled digital libraries: Characterizing searches for mathematical concepts. The work explores what queries people use to search for mathematical concepts (e.g., "Taylor series") by studying a dataset of 392,586 queries from a two-year query log. Their results show that math search sessions are typically longer and less successful than general search, and their queries are more diverse. They claim these findings could aid in the design of search engines designed for processing mathematical notation.
Next, Maram Barifah, presented a full paper titled: Exploring Usage Patterns of a Large-scale Digital Library in which they proposed a framework for assisting librarians and webmasters explore the usage patterns of Digital Libraries.
Finally, Yasunobu Sumikawa, presented the final full paper of the session titled: Large Scale Analysis of Semantic and Temporal Aspects in Cultural Heritage Collection's Search. In this presentation they reported the results of a study of a 15-month snapshot of query logs of the online portal of the National Library of France to understand the the interest of users and how users find cultural heritage content.

Classification, Discovery and Recommendation Sessions


Following a lunch break, Abel Elekes, presented the first full paper titled: Learning from Few Samples: Lexical Substitution with Word Embeddings for Short Text Classification. To help in the classification of short text, this paper proposes clustering semantically similar terms when training data is scarce to improve the performance of text classification tasks.
Next, Andrew Collins, a researcher at Trinity College Dublin, presented a short paper titled: Document Embeddings vs. Keyphrases vs. Terms for Recommender Systems: A Large‐Scale Online Evaluation. They compared a standard term-based recommendation approach to document embedding and keyphrases - two methods used for related-article recommendation in digital libraries, by applying the algorithms to multiple recommender systems.
Next, Corinna Breitinger, a PhD student at the University of Konstanz, presented her short paper titled: 'Too Late to Collaborate': Challenges to the Discovery of in-Progress Research. She presented the finding from an investigation to understand how how computer science researchers from four disciplines currently identify ongoing research projects within their respective fields. Additionally, she outlined the challenges faced by researchers such as avoiding duplicate research, while protecting the progress of their research for fear of idea plagiarism.
Finally, Norman Meuschke, a PhD candidate at the University of Wuppertal, presented a full paper titled: Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations. He presented their approach for addressing the problem of detecting concealed plagiarism (heavy paraphrasing, translation, etc.) in scholarly text which consists of a two-staged detection that combines similarity assessments of mathematical content, academic citations, and text, as well as similarity measures that consider the order of mathematical features.
Minute Madness followed after Norman's presentation, wrapping up the scholarly activities of Day 1 of JCDL. In Minute Madness, poster presenters were given one minute to advertise their respective posters to the conference attendees. The poster session began after the minute madness.

Minute Madness



Day 2



Day 2 of JCDL 2019 began with a keynote from Dr. Robert Sanderson, the Semantic Architect for the J. Paul Getty Trust: Standards and Communities: Connected People, Consistent Data, Usable Applications. The keynote highlighted the value of Web/Internet standards in providing the underlying foundation that makes the connected world possible. Additionally, the keynote explored the relationship between standards and their target communities, some common inverse relationships such as the trade-off between the completeness and usability, production and consumption, etc.
The Web Archives session followed the keynote.


Web Archives 1 Session


Sawood Alam,  a PhD student at Old Dominion University, and member of the WSDL group presented a full paper on behalf of Mohamed AturbanArchive Assisted Archival Fixity Verification Framework. Sawood presented two approaches, Atomic and Block, to establish and check fixity ( testing if an archived resource has not been unaltered since the last capture time) of archived resources. The Atomic approach for checking fixity involves storing fixity information of web pages in a JSON file and publishing the fixity content before it is disseminated to multiple on-demand Web archive. In contrast, the block approach involves merging the fixity information of multiple archived pages in a single file before its publication and dissemination to the archives.

Next, Dr. Martin Klein, a research scientist, at the Los Alamos National Laboratory presented a short paper titled: Evaluating Memento Service Optimizations. He explained the the problem of long response time services that utilize the Memento Aggregator experience. This problem arises because search requests are broadcast to all Web archives connected to the Aggregator irrespective of the fact that some URI requests can only be fulfilled by some Web Archives. He subsequently reported some results of some performance optimizations of the Memento Aggregator such as Caching and Machine Learning-based predictions.
Finally, Sawood Alam, again, presented a full paper (best paper nominee) titled: MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood additionally proposed the MementoMap framework as a flexible and adaptive means of efficiently summarizing the holdings of a Web archive, showing its application for the summary of the holdings of a Portuguese Web archive  collection (http://arquivo.pt/) consisting of 5 billion mementos (archived copies of web pages).

Other papers were presented concurrently in the Analysis and Processing session.


Analysis and Processing Session


In this session, Felix Hamborg, a PhD candidate at the University of Konstanz, presented a full paper titled: Automated Identification of Media Bias by Word Choice and Labeling in News Articles. Felix presented their research about an automatic method to detect a specific form of news bias - Word Choice and Labeling (WCL). WCL often occurs when journalists use different terms (e.g., "economic migrants" vs. "refugees.") to refer to the same concepts.
Next, Drahomira Herrmannova,  presented a full paper (Vannevar Bush best paper award winner) titled: Do Authors Deposit on Time? Tracking Open Access Policy Compliance. This paper presented the findings from an analysis of 800,000 research papers published over a 5 year period. They investigated if the time lag between the publication date of research papers and the dates the papers were deposited in a repository can be tracked across thousands of repositories globally.
Following a break, the paper sessions continued.

Web Archives 2 Session


Sergej Wildemann, a researcher at the L3S Research Center, began with a full paper presentation titled: Tempurion: A Collaborative Temporal URI Collection for Named Entities, where he introduced Tempurion, a collaborative service for enriching entities (e.g., People, Places, and Creative Work) by linking them with URLs that best describe them. The URLs are dynamic in nature and change as the associated entities change.
Next, I (Alexander Nwala) presented a full paper (best paper nominee) titled: Using Micro-collections in Social Mediato Generate Seeds for Web Archive Collections. I highlighted the importance of Web Archive collections as a means of traveling back in time to study events (e.g., Ebola Virus Outbreak and Flint Water Crisis) that may not be properly represented on the live Web due to link rot. These Archived collections begin with seed URLs that are often manually selected by experts or crowdsourced. As a result of the time consuming nature in collecting seed URLs for Web Archive collections, it is common for major news events to occur without the creation of a Web Archive collection to memorialize the events, justifying the need for automatically generating seed URLs. I showed that social media Micro-collections (curated lists created by social media users) provide the opportunity for generating seeds and produce collections with distinctive properties from convention collections generated by scraping Web and Social Media Search Engine Result Pages (SERPs).

Next, Dr. Ian Milligan, history professor at the University of Waterloo, presented a short paper titled: The Cost of a WARC: Analyzing Web Archives in the Cloud. Dr. Milligan explored and answered (US$7 per TB) the question he proposed: "How much does it cost to analyze Web archives in the cloud?" He used the Archives Unleashed platform as an example to show some of the infrastructural and financial cost associated with supporting scholarship in the humanities and social sciences.
Finally, Dr. Ian Milligan, again, presented another short paper titled: Building Community and Tools for Analyzing Web Archives through Datathons. In his second talk of the session, Dr. Milligan highlighted lessons learned from conducting the Archives Unleashed Datathons. The Archives Unleashed Datathons started in March 2016, as a collaborative Data hackathon in which social scientists, humanists, archivists, librarians, computer scientists, etc. work together for 2-3 days on analyzing Web archive data.
Another series of paper sessions followed after a break.

User Interface and Behavior Session


Dr. George Buchanan and Dr. Dana Mckay, researchers at the University of Melbourne School of Computing and Information systems, presented a full paper titled: One Way or Another I'm Gonna Find Ya: The Influence of Input Mechanism on Scrolling in Complex Digital Collections. They presented their findings from comparing the effect of input modality-touch and scrolling-on navigation in book browsing interfaces, by reporting user satisfaction associated with horizontal and two-dimensional scrolling.
Next, Dr. Dagmar Kern, a Human Computer Interaction and User Interface Engineering researcher at Gesis, presented a short paper titled: Recognizing Topic Change in Search Sessions of Digital Libraries Based on Thesaurus and Classification System. She presented their thesaurus and classification-based solution for segmenting user session information of a social science literature into its topical components.
Finally, Cole Freeman, a researcher at Northern Illinois University, presented the last short paper of the session titled: Shared Feelings: Understanding Facebook Reactions to Scholarly Articles. where he presented a new dataset of Facebook Reactions to research papers, and the results of analyzing it.

Citation Session


Dattatreya Mohapatra, a recent Computer Science graduate of Indraprastha Institute of Information, presented, a full paper (best student paper award winner) titled: Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees. He presented a novel data structure, the Influence Dispersion Tree (IDT) to model the impact of a scientific paper without relying of citation counts, but instead captures the relationship of follow-up papers and and their citation dependencies.
Next, Leonid Keselman, a researcher at Carnegie Mellon University, presented a full paper titled: Venue Analytics: A Simple Alternative to Citation‐Based Metrics. He presented a means for automatically organizing and evaluating the quality of Computer Science publishing venues, by producing venue scores for conferences and journals, done by formulating venue authorship as a regression problem.
Day 2 ended with the conference banquet and awards presentation at the Memorial football stadium.
The best demo award was given to MELD: a Linked Data Framework for Multimedia Access to Music Digital Libraries, by Dr. Kevin Page, David Lewis, and Dr. David M. Weigl
The best student paper award to given to Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees, by Dattatreya Mohapatra, Abhishek Maiti, Dr. Sumit Bhatia and Dr. Tanmoy Chakraborty
The Vannevar Bush best paper award was given to Do Authors Deposit on Time? Tracking Open Access Policy Compliance by Drahomira Herrmannova, Nancy Pontika and Dr. Petr Knoth


Day 3



Day 3 of JCDL 2019 began with a keynote from Dr. John Wilkin, the Dean of Libraries and University Librarian at the University of Illinois at Urbana-Champaign. His keynote was titled: How do you lift an elephant with one hand? and explored the challenges overcome in building the HathiTrust Digital Library, a large-scale digital repository that offers millions of titles digitized from libraries around the world.
Following the keynote was an ACM Digital Library (DL) panel session titled: Towards a DL by the Communities and for the Communities. The ACM Digital Library & Technology Committee is headed by Dr. Michael Nelson and Dr. Ed Fox, and the panel session featured talks from Dr. Daqing He, Dr. Dan Wu, Wayne Graves, and Dr. Martin Klein. During the panel, Dr. Daqing presented usage statistics of the ACM DL, Wayne Graves, Director of Information Systems at ACM presented the redesigned ACM DL website (available soon) and received feedback on existing and future services, and Dr. Martin Klein presented Piloting a ResourceSync Interface for the ACM Digital Library. Dr. Dan Wu invited the researchers to Wuhan University, the host of the JCDL 2020 conference, and introduced the audience to the city, subsequently, Dr. Stephen Downie gave the conference closing remarks.



I would like to thank the organizers and sponsors of the conference and the hosts, Dr. Stephen Downie and the University of Illinois, in Urbana-Champaign (UIUC), and Corinna Breitinger for taking and uploading additional photos of the conference. The WADL (Web Archiving and Digital Libraries) workshop trip report link would be included here once it is available.

-- Alexander C. Nwala (@acnwala)

2019-06-09: How to Become a Tenure-Track Assistant Professor - Part I (publications, research, teaching and service)

$
0
0
This is a three-part write-up, in this first post, I’ll talk about what you need to do to prepare yourself in the next 2 to 3 years for a tenure-track assistant professor job. I’ll do another blog post about how to find tenure-track positions, how to shortlist your target schools, CV, teaching statement, research statement and cover letters. I’ll do another blog post later about the interview prep (skype/phone, onsite), what to do and not to do during your on-campus interview, offer negotiations, two-body problem etc.
If you are considering a tenure-track job, start as early as possible (2-3 years before your intended job search) working on your teaching, research topics and publications. For some of these, you need careful planning like accumulating enough publications, you need to start publishing as early in your doctoral research and as often in each year. If possible do real teaching, not just a teaching assistantship. Let me get into each of these and explain how I went about doing these aspects during my tenure-track job search.
Publications:
There is no hard and fast rule that says, if you publish this many papers you are guaranteed a tenure-track job. So you need to figure this out. When I went for job market in 2016, I had about 9 conference papers and 2 journal publications. Take a look at my CV, you can see the publications are all over the places, some not so good venues and few in top venues like (CIKM, JCDL, and journals like IEEE TBE and ACM TAP). Some of them are from my MS degree from another school (in different research topic). My recommendation, publish early, collaborate, and find multiple research topics (I’ll get back to this again when I talk about research topics below). As you can see from my not so stellar publication track record, I didn’t do all this, but I tried to publish as early in my PhD, easier said than done. I was the only PhD student (in my research group) on this particular topic and I was the last generation of my adviser’s PhD students working on this particular project. I’d suggest for you to ask your adviser for multiple projects or ask pair you with other students in the group. Also you can talk to other faculty in the department, especially new assistant professors, they always have new topics, and in need of people to jump start their own publication track record for tenure process. You can also start your own side project from something interesting, but always think who’s is going to pay for your publications (conferences are expensive!). Talk to others from industry, research labs, and universities you meet at conferences and meetings. Things don’t always works out as you think, especially people rarely stay true to their word of collaborations (when you meet them at conferences and workshops) if you are just a PhD student. So I’d stick with more of internal collaborators like faculty from other departments.
If you are serious about getting a good tenure-track job in a PhD granting CS department, I’d say you need about 15-20 solid publications (combination of 2 page poster papers, short and full papers and journals). Better if they are from good venues. Again, not a rule of thumb, just my personal opinion going through the tenure-track job market multiple times and being successful getting offers each of the time. Another tip, if you have a list of universities/departments you are interested in applying, take a look at the most recent assistant professors recruited and their track record. You can compare the publications, where they are coming from (tenure-track somewhere else, postdoc or direct ABD), the place of their PhDs, and if you are lucky they might have the CV so you can take a peek and see what else they did (service, teaching) to get in. Again, tenure-track hiring dynamics changes with every search committee and also will depend on the area the department wants to hire. So don’t dwell much in these but you can get a good sense about the hiring process and what sort of people the department hire. If your target is any primarily teaching institutions (non-PhD granting institutions, liberal arts schools, 4 year colleges, community colleges), you should be fine with may be having less than 10 publications, but you never know.
Teaching Experience:
You need a solid teaching experience to get a tenure track offer or at least an interview. Your teaching assistant (TA) experience probably not going to cut it. In my first 5 years of the PhD, I was pretty much a research assistant (RA), doing research related work. I was lucky enough to get into a teaching fellowship during my final year to do a real teaching. I taught for about 2 semesters as the instructor of record and I’m sure my first job offers (teaching oriented schools) and all the interviews I got were pretty much because of my solid teaching experience teaching undergraduates. Second time around I had enough teaching experience as an assistant professor of a primarily teaching school. I recommend all aspiring tenure-track applicants to find a solid teaching experiences.
Here’s few tips, talk to your department chair (or the faculty responsible for handling teaching assignments), they may need someone to take over lower level programming courses to each. Also talk to other schools around your area, like community colleges, technical schools and state schools, they always hire adjunct faculty to teach. If you are an international student, only option is to find a teaching position on-campus. If none of the above works, talk to your adviser, see whether you can pair teach. During my fellowship, I pair-taught with my adviser and also with another professor who was also a member of my dissertation committee, this I believe helped immensely to get a solid letter of recommendations. If this doesn’t work, may be ask your adviser whether you can deliver couple of invited lectures, I did this couple of times covering a class session when my adviser was on travel for a conference. You can ask other faculty in the department from any related areas too, who doesn’t love to skip a teaching duty once in a while!
Also don’t forget to collect some informal feedback from your classes. Take a look at my teaching portfolio, I regularly take feedback and keep track of what students say about my teaching in class. Especially this is important at the time when you write your teaching statement, you can talk about your formal evaluations (if you get any), and if not, some of the informal feedback you got. Also this is a good practice for your tenure-package, you can talk about what student say and quote from these informal feedbacks to write how you go about improving your classroom experiences.
Research Experience:
You need multiple research directions you want to pursue when you start as a tenure track assistant professor. Remember your dissertation topic is not going to be enough, you need to have topics that can pan over several years and interesting enough to get publications, grant money and students. Talk to your adviser or other faculty and collaborate. I was fortunate enough that my adviser got me into a secondary topic area that helped me use my experiences towards areas beyond my PhD topic. This is important when you write your research statement, having multiple topics will show the search committee that you have a well-planned direction for your future research agenda. Letter of recommendation writers: You need at least 3 solid letter of recommendations. Obviously your PhD adviser will be one of them and having a good relationship with him/her will be imperative. Make sure you are making a good progress, and this should be exemplified by your work as an independent researcher. So having multiple topics in different areas and collaborating with other faculty should give you those solid writers, you need people who can vouch for your collegiality, dedication and expertise. Remember the faculty life is sometimes more about how you work with your colleagues, so if you get an on-campus interview, this is something probably faculty who’s going to have one-on-one with you going look for (More about this later with my on-campus interview blog post).
Service:
This won’t be something as highly important as other points that I discussed earlier like publications, research and teaching, but having a solid service experiences work wonders. Service is part of the faculty life, so getting to do stuff other than just research and teaching help build your portfolio. Early you learn to juggle multiple things, will help you survive the busy faculty life of managing multiple roles. These opportunities won’t come to you, so reach out and talk to others and ask around. In early days of my PhD, I reached out to student communities like University Graduate Associations (part of student government) for possible representations. One time I held multiple university-wide positions representing the department, college and the university. Again, these are time consuming tasks, I advise you to participate for these early during your PhD career so you can focus on other important aspects (publications and research) later in your PhD life. I’d suggest you to focus more on things related to student advising (most of these committees require some student participants, see what is available on your university), these will give you good points to talk about in your teaching statement like how the life on the other side as a faculty. Also reach out and see if you have any opportunities to help the department search committee in some capacity, you get to see your potential competitors in the job market. Bonus, you get to see some of the wining candidates.
Another service activity you need to start early in the process is reviewing for conferences/journals and organizing committee activities for conferences and workshops. Do student volunteering early in your PhD career, get to know the organizing committee member for the future years and talk to them about volunteering for organizing committee activities. If you are from CS background, https://chisv.org/ lists bunch of conferences you can volunteer your time. Most of these conferences offer you a free conference registration, food, free goodies/t-shirt and some will even give you travel grants. Also look out for ACM-W, TAPIA, and Grace Hopper for travel scholarships for women in computing. If you don’t get to do student volunteering, talk to your adviser or other faculty, I got my first organizing committee volunteering opportunity by asking around. Jump into any vacant position and help out and make a name for yourself. You don’t always need to travel to conferences if you become part of the organizing committee, some positions require you to get things done before the conference starts, like publication chair (responsible for setting up the conference publishing activities), and publicity chair (twitter, and posting cfp to mailing lists). Also most of these positions come with co-chairing opportunities, so you are not the only one responsible for the activity, you get to share the work with several others. Also ask your colleagues, advisers and other faculty to recommend you to a program committee. You’ll get a chance to help the conferences reviewing papers. You can also ask your adviser for reviewing opportunities as a sub-reviewer. Go to the journal websites, most of them have a place for you to create a reviewer profile and you’ll get requests when papers are submitted relevant to your area of expertise. Again, these are time consuming tasks, you cannot build your profile overnight and need careful planning and starting things early in your PhD life.
Other Volunteering:
Volunteer for any opportunity that comes your way, departmental, college or university or external. There may be opportunities to be a judge for a high school or at an undergraduate poster competition, or to volunteer to teach something or give a lecture to local public school, organize an educational event, present a poster. Find things, all these are CV points you can add that brings the substance to your portfolio.
Internships:
From the get-go, apply for internships, talk to your faculty, friends, collaborators and see if they can refer you to a company summer position. Don’t afraid to shoot high, apply for top companies like Google, Microsoft and also look out for positions in places like research labs (Department of Energy, DOD etc.) Tenure track market is very competitive, there’re hundreds of people applying for handful of the vacant positions. Keep your options open with having some industry experiences.
Student Advising:
See whether your adviser can let you co-advise few undergraduate and other graduate students in the group. You can help them with their day-to-day research work, may be showing them around during initial years, help them with acclimating to the academic life. Volunteer in departmental and college wide opportunities to mentor students. These will become handy when you write your teaching statement.

--Sampath Jayarathna (@openmaze)

2019-06-18: It is time to go back home!

$
0
0
On May 11, 2019 I officially obtained my PhD in Computer Science from Old Dominion University. My graduate studies journey started when I received a full scholarship from the University of Hail in Saudi Arabia, where I worked there two years as a teacher assistant. I came to the USA and specifically to San Francisco in 2010 with my husband and my three-months old daughter. I attended Kaplan Institute where I took English classes and a GRE course for almost a year. After that I got accepted in ODU as a CS Masters student in 2011. In July 2013 I welcomed my second baby girl Jenna, and in August I graduated from the Master program and joined the PhD program to work with the wsdl (Web Science and Digital Libraries) research group there.
On April 4, 2019, I defended my dissertation research, “Expanding the usage of web archives by recommending archived webpages using only the URI” (slides, video).

The goal of my work was to build a model for selecting and ranking possible recommended webpages at a Web archive. This is to enhance both the archive's HTTP 404 responses and HTTP 200 responses by surfacing webpages in the archive that the user may not know existed. An example is when a user requests a Virginia Tech football webpage from the archive. The user knows about the popular Virginia Tech football webpage http://hokiesports.com/football/ and will request that webpage. This webpage is currently on the live Web and archived. However, the user does not know that the webpage http://techsideline.com exists in the archive. In 2013, when requesting the webpage from the live Web it redirects to https://virginiatech.sportswar.com. If the user did not have a link to that webpage on the live Web, the user will never know it existed.

To accomplish this, we first detect the semantics in the requested Uniform Resource Identifier (URI). Next, we classify the URI using an ontology, such as DMOZ or any website directory. Finally, we filter and rank candidates based on several features, such as archival quality, webpage popularity, temporal similarity, and content similarity. Archival quality refers to measuring memento damage by evaluating the number and impact of the missing resources in a webpage. Webpage popularity considers how often the webpage has been archived and its popularity on the live Web. A special case of popularity are webpages in “cold spots”, which are pages that are not on the live Web, are not currently popular, but are archived. Temporal similarity can refer to how close the candidate webpage’s Memento-Datetime is to the requested URI. URI similarity assesses the similarity of candidate URI tokens to the requested URI tokens. We tested the model using human evaluation to determine if we could classify and find recommendations for a sample of requests from the Internet Archive’s Wayback Machine access log. Overall, when selecting the full categorization, reviewers agreed with 80.3% of the recommendations, which is much higher than “do not agree” and “I do not know”. This indicates the reviewer is more likely to agree on the recommendations when selecting the full categorization. But when selecting the first level only, reviewers only agreed with 25.5% of the recommendations. This indicates that having deep level categorization improves the performance of finding relevant recommendations.
My life as a graduate student and especially PhD was not an easy one. Trying to juggle family responsibilities with academic work is a hard task which took me some time to figure a way to balance and handle. There are some lesson learned points that I think could be helpful to other graduate students. First, working on a research concentration that interests you and an advisor that is committed and productive is a key to success. It may take time to find the exact topic you are going to work on but with the right guidance from the advisor, doing a lot of reading on other people’s research, and performing some experiments along the way will help you get there. Second, working with a group that is energized will keep you motivated. It is important to have meetings with the other group members and talk about what was accomplished and what is the future work. Not only does this keep you energized, but it also may lead to research contributions.Third, try to find a balance between your personal life and academic life. It is not easy to have kids and do graduate studies, however having great family and friends support is important. Finally, being a graduate student requires patience and hard work. You need to be self motivated during this journey and believe in yourself.
After 9 long, hard, and beautiful years as a graduate student in the US, I will be heading home in June to where it all started, ‘University of Hail’ at the college of computer science and engineering, and work as an assistant professor.

-Lulwah M. Alkwai

    2019-06-19: Use of Cognitive Memory to Improve the Accessibility of Digital Collections

    $
    0
    0
    Eye Tracking Scenario
    (source - https://imotions.com/blog/eye-tracking-work/)
    Since I joined ODU, I have been working with eye tracking data recorded when completing a Working Memory Capacity (WMC) measure to predict a diagnosis of Attention-Deficit/Hyperactivity Disorder (ADHD). People with ADHD could be restless and hyperactive with distinct behavioral symptoms such as difficulty in paying attention and controlling impulsive behaviors. WM is a cognitive system, which makes it possible for human  to hold and manipulate information simultaneously. Greater WMC means greater ability to use attention to avoid distraction. Theoretically, adults with ADHD have reduced working memory when compared with their peers, demonstrating significant differences in WMC.

    Among many tasks (O-Span, R-SpanN-Back) to measure the WMC, the reading span task (R-Span) is used as a valid measure of working memory yielding a WMC score. In R-Span, participants are asked to read a sentence and letter they see on a computer screen. Sentences are presented in varying sets of 2-5 sentences. Participants are asked to judge sentence coherency by saying 'yes' or 'no' at the end of each sentence. Then, participants are asked to remember the letter printed at the end of the sentence. After a 2-5 sentence set, participants are asked to recall all the letters they can remember from that set. R-Span scores are generated based on the number of letters accurately recalled, divided by the total number of possible letters recalled in order. This task represents a person’s ability to hold and manipulate information simultaneously.

    We investigated eye gaze metrics collected during this task to differentiate the performance of adults with and without ADHD. This was important as it reveals an important eye movements feature differences between atypical and complex attention systems. The precise measurements of eye movements during cognitively demanding tasks provide a window into underlying brain systems affected by ADHD or other learning disabilities.
    Fig 1: Comparison of Eye Fixations for ADHD (Left) and Non-ADHD (Right) participant during WMC Task (source -https://www.igi-global.com/chapter/predicting-adhd-using-eye-gaze-metrics-indexing-working-memory-capacity/227272)
    We chose standard information retrieval evaluation metrics such as  precision, recall, f-measure, and accuracy to evaluate our work. We developed three detailed saccades (rapid changes of gaze) and fixation feature sets. Saccades are eye movements used to jump rapidly from one point to another. Fixations are the times which our eyes stop scanning and hold the vision in place to process what is being looked at. Feature includes the qualifiers: gender, number of fixations, fixation duration measured in milliseconds, average fixation duration in milliseconds, fixation standard deviation in milliseconds, pupil diameter left, pupil diameter right, and diagnosis label or class. The three feature sets categorized according to metric type:
    1) fixation feature set
    2) saccade feature set
    3) saccade and fixation combination feature set

    Fig 2: Classification of Eye Saccade Features during WMC (source - https://www.igi-global.com/chapter/predicting-adhd-using-eye-gaze-metrics-indexing-working-memory-capacity/227272)

    Fig 3: Classification of Eye Fixation and Saccade Features during WMC (source - https://www.igi-global.com/chapter/predicting-adhd-using-eye-gaze-metrics-indexing-working-memory-capacity/227272)
    The purpose of our research is to determine if eye gaze patterns during a WMC task would help us create an objective measuring system to differentiate a diagnosis of ADHD for adults. We identified six of the top performing classifiers for each of the three feature sets: J48, LMT, RandomForest, REPTree, K Star, and Bagging. While fixation features, saccade features, and a combination of saccade and fixation features accurately predicted the classification of ADHD with an accuracy of greater than 78%, saccade features were the best predictors with an accuracy of 91%. 
    We published our work at IGI Global book chapter.
    Anne M. P. Michalek, *Gavindya Jayawardena, and Sampath Jayarathna. "Predicting ADHD Using Eye Gaze Metrics Indexing Working Memory Capacity", Computational Models for Biomedical Reasoning and Problem Solving, IGI Global, pp. 66-88. 2019
    An extended version of the paper is published at arXiv that elaborates more on the use of area of interest (AOI) during the ADHD diagnosis with eye tracking measures.
    Use of Working Memory Capacity in the Wild...
    Research shows that learning disabilities may be present in one's life either from birth or develop later in life due to dementia or injuries. Regardless of their declined cognitive abilities, people are interested in learning new things. For instance, older adults love to read books and learn new things after retirement to make use of their free time. But, physical disabilities and declined cognitive abilities might restrict people from accessing library materials. The Library of Congress Digital Collection is an excellent place for people to do their research as all they need is a computer and an internet connection. Therefore it is essential to make these public digital collection accessible.

    Fig 4: The Library Of Congress Digital Collection Home Page.
    Most of the times, web developers focus on regular users, and tend to forget how to cater to all types of users. Digital Collections requires careful considerations for the web UI to make it accessible, and we believe, based on our eye tracking research on WMC, we can help content creators such as Library of Congress Digital Collections achieve that. 
    In Dr. Jayarathna's HCI course, I learned, to understand people, to be careful of different perspectives, and to design for clarity, and consistency. But as you can see, the application of these rules may differ with requirements. Since we predicted a diagnosis of ADHD with an accuracy greater than 78% using eye gaze data, there is a potential where we could identify people with and without declined cognitive abilities. This allow us to dynamically determine the complexity of the attentional system (whether typical or complex) of users and provide variations of the UI (similar to how a language translator works, click of a button to change the UI to be accessible). 

    In the Future...
    Consider an example scenario when a person with ADHD views the content of the Library of Congress Digital Collection. With a click of a button, web UI can change the presentation of the content. If the person has ADHD or some other learning disability, the content could be arranged in a different layout which allows the user to interact with it differently. 
    Expanding on our results, we set our goal to explore how can we generalize our study to improve content accessibility for the people with learning disabilities without overloading their cognitive memory. We plan to use the Library of Congress or other similar online platform to start our exploration.
    There is a real opportunity for us to help content creators of digital collection to be make these collections accessible for people regardless of their cognitive abilities.

    -- Gavindya Jayawardena  
    Viewing all 749 articles
    Browse latest View live