Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 752 articles
Browse latest View live

2019-06-20: Web Archiving and Digital Libraries Workshop 2019 Trip Report

$
0
0


A subset of JCDL 2019 attendees assembled together on June 6, 2019, at the Illini Union for the Web Archiving and Digital Libraries workshop (WADL 2019). Like previous years, this year's workshop too was organized by Dr. Martin Klein, Dr. Zhiwu Xie, and Dr. Edward A. Fox. Martin inaugurated the session by welcoming everyone and introducing the schedule for the day. He observed that WADL 2019 had an equal representation from both males and females, which was not only the case with attendees, but also presenters. Web Science and Digital Libraries Research Group (WS-DL) from the Old Dominion University was represented there by Dr. Michele C. Weigle, Alexander C. Nwala, and Sawood Alam (me) with two accepted talks.



Cathy Marshall from the Texas A&M University presented her keynote talk entitled, "In The Reading Room: what we can learn about web archiving from historical research". She told many fascinating stories and the process she went through to collect bits and pieces of those stories. Her talk shed light on many problems similar to what we see in web archiving. Her talk reminded me of her presentation at the IIPC General Assembly 2015 entitled, "Should we archive Facebook? Why the users are wrong and the NSA is right".


Corinna Breitinger from the University of Konstanz (but now moved to the University of Wuppertal) presented her team's work entitled, "Securing the integrity of time series data in open science projects using blockchain-based trusted timestamping". She discussed a service called OriginStamp that allows people to create a tamper-proof record of ownership of some digital data at the current time by creating a record in a blockchain. She mentioned Blockchain_Pi project that allows connecting a Raspberry Pi to blockchain for timestamping various sensor data. A remarkable achievement of their project was being cited by a German Supreme Court ruling on a Dashcam recording that was configured to trigger a timestamping call on a short clip when something unusual happens on the road.


I, Sawood Alam, presented "Impact of HTTP Cookie Violations in Web Archives". This was a summary of two of our blog posts entitled "Cookies Are Why Your Archived Twitter Page Is Not in English" and "Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages" in which we performed detailed investigation of two HTTP cookie related issues in web archives. We found that long-lasting cookies in web archives have undesired consequences in both crawling and replay.


Ed Fox from Virginia Tech presented his team's work entitled, "Users, User Roles, and Topics in School Shooting Collections of Tweets". They attempted to identify patterns in user engagement on Twitter regarding school shootings. They also created a tool called TwiRole (source code) that classifies a Twitter handle as "Male", "Female", or a "Brand" using multiple techniques.


Ian Milligan from the University of Waterloo presented his talk entitled, "From Archiving to Analysis: Current Trends and Future Developments in Web Archive Use". He emphasized that the historians of the future writing history of post-1996 will need to understand the Web. Web archives will play a big role in writing the history of today. It is important that there are tools beyond Wayback Machine that they can use to interact with web archives and understand their holdings. He mentioned the Archives Unleashed Cloud as a step in that direction.


Jasmine Mulliken from the Stanford University Press (SUP) presented her talk entitled, "Web Archive as Scholarly Communication". She described various SUP projects and related stories. She spent a fair amount of time describing the use of Webrecorder at SUP in projects like Enchanting the Desert. She also described that SUP is in peril and mentioned Save SUP site that documents the timeline of recent events threatening the existence of SUP. While talking about this, she played a clip from the finale of the Game of Thrones in which the dragon burns the Iron Throne.


Brenda Reyes Ayala from the University of Alberta presented her talk entitled, "Using Image Similarity Metrics to Measure Visual Quality in Web Archives" (slides). Automated quality assurance of archival collections is a topic of interest for many IIPC members. Brenda shared initial findings of her team using image similarities of captures with and without archival banners. She concluded that their result showed significant success in identifying poor and high quality captures, but there is a lot more that needs to be done to improve the quality assurance automation.


Sergej Wildemann from the L3S Research Center presented his talk entitled, "A Collaborative Named Entity Focused URI Collection to Explore Web Archives". He started his talk by describing that the temporal aspect of named entities is often neglected when indexing the live web. Temporal changes associated with an entity become more important when exploring an archival collection related to the entity. He mentioned Internet Archive's beta version of a new prototype of Wayback Machine released in 2016 that provided text search indexed based on the anchor text pointing to sites. Towards the end of his talk he showcased his tool called Tempurion that allows archived named entity search with temporal dimension attached for filtering search results based on date ranges.


I, Sawood Alam, presented my second talk (and the last talk of the day) entitled, "MementoMap: An Archive Profile Dissemination Framework". This talk was primarily based on our JCDL submission that was nominated for the best paper award, but in the WADL presentation we focused more on technical details, use cases, and possible extensions, instead of experimental results. We also talked about the Universal Key Value Store (UKVS) format with some examples.


Once all the formal presentation were over, we all started to discuss about post-workshop plans. The first matter we discussed was about making proceedings available online or as a special issue of a journal. In previous years (except the last year) WADL proceedings were published in the IEEE-TCDL Bulletin, which is discontinued now. Potential fallback options include: 1) compilation of all submissions with an introduction and publishing it as a single document to arXiv, but citing individual work would be an issue, 2) publishing on OSF Preprint, and 3) utilizing a GitHub Pages, with the added advantage of providing supplementary materiel such as slides. To enable more effective communication, a proposal was made to create a mailing list (e.g., using Google Groups) for the WADL community. It was proposed that posters should not be included in the call for papers, because the number of submissions are usually finite enough to give a presentation slot to everyone. Fun fact, only Corinna brought a poster this time. We discussed the possibility of more than one WADL events per year which may or may not be associated with a conference. Since the next JCDL event would be in China, people had some interest in having an independent WADL workshop in the US. Finally, we discussed the possibility of adding a day ahead of JCDL for a hackathon and a day after for the workshop where hackathon results can be discussed in addition to usual talks.

It was indeed a fun week of #JCDL2019 and #WADL2019 where we got to meet many familiar and new faces and explored the spacious campus of the University of Illinois. You may also want to read our detailed trip report of the JCDL 2019. We would like to thanks organizers and sponsors of both JCDL and WADL for making it happen. We would like to extend special thanks to Dr. Stephen Downie, without whom this event would not have been as organized and fun as it was. We would also like to thank NSF, AMF, and ODU for funding our travel expenses. Last but not the least, I would personally thank the "WADL DongleNet" which made it possible for me to connect my laptop with the projector twice.


--
Sawood Alam


2019-07-08: Time Series Data Analysis - What, Why and How

$
0
0
In this article, I plan to introduce time series data, and discuss few fundamental, yet important concepts on how time series data is analyzed in the context of data science. In the latter part of the article, I will explain how I conducted time series data analysis on EEG data, and discuss what was achieved from it.
Visualization of an EEG time series
If you are new to time series data analysis, the first question is, what is time series data? In layman's terms, it's a set of data points collected over a period of time (hence the term "series"). Each data point represents the state of the observed object(s) at a point in time.

In time series data, the collected data should indicate when each observation was made, along with the observations. These observations could be made at regular intervals or irregular intervals. In most time series data collections, a fixed set of properties are observed at each instance, hence tabular data formats such as CSV are widely used to store time series data.

Now that we introduced what and why of time series data analysis, let's move on to how. How time series data is analyzed depends on the domain of data, but there are few fundamental techniques used for this.

What does my data looks like?

Statistical measures provide a good estimate for the majority of data without presenting all data. They come in handy when describing the nature of large sets of data. Statistics such as mean, median and mode indicate the central tendency (read more), while statistics such as minimum and maximum indicate the range of data.

Though the above statistics provide a good estimate of the central tendency and range of the data, it does not describe how densely or sparsely the data is distributed across that range. This is where statistics such as variance and standard deviation comes into the picture. They provide an estimate of how far the majority of data deviates from the mean.

In the context of time series data, these statistics indicate the nature of data observed, and helps to eliminate outliers. But what if the majority of data is not scattered around a center? This could be evaluated by comparing the central tendency estimates (mean, mode and median) with each other. If they deviate largely from each other, it could be an indication of skewed data. Nevertheless, probability density functions could be useful for visualization and estimation in such cases.

Signal Processing and Time Series Data

In statistical measures, the relationships between consecutive data points are not taken into consideration. These measures cannot capture how data changes over time, and only provide time-invariant estimates of data. This is where signal processing techniques come in handy for analyzing time series data. Here, the time series data is treated as a signal, and signal processing techniques are applied to eliminate noise (filtering), observe periodic trends (spectral density), and much more.

Some Data Points are Missing!

Having missing data points is a rather common issue encountered when performing time series data analysis. When you don't have data at a data point, you have two options: 1) to approximately determine the missing value using available data (interpolation), or 2) to ignore that data point entirely. The latter case cannot be applied if signal processing techniques are to be applied on the data, as it changes the sampling frequency of data.

When analyzing time series data, though the fundamentals are the same, applications may vary from domain to domain. I recently collaborated on a project with Dr. Sampath, and Dr. Mark Jaime at IUPUC to determine if the correlation between Autism Spectrum Disorder (ASD) and Social Interaction can be captured through EEG recordings. We addressed this question by building a set of classifiers that uses EEG data to predict whether or not a subject has ASD.

The preliminary work of collecting and pre-processing EEG data, and using them to build the classifiers is published in the Book Chapter "Electroencephalogram (EEG) for Delineating Objective Measure of Autism Spectrum Disorder" in Computational Models for Biomedical Reasoning and Problem Solving, IGI Global [Link]. Next, we extended this work by adopting an approach that takes both short term and long term trends in EEG data into account. We submitted a paper titled "Analysis of Temporal Relationships between ASD and Brain Activity through EEG and Machine Learning" that elaborates on this work to IEEE Information Reuse and Integration for Data Science Conference (IEEE IRI) 2019 (In Press). I'll discuss these points with more detail in the sections below.

EEG data, being a classic example of time series data, requires certain pre-processing steps to eliminate noise, artifacts and transform into features. For this study, we used the following pre-processing pipeline:
Pre-processing Raw EEG Data
  • Removing low frequency baseline drifts using a 1 Hz high pass filter.
  • Removing 50-60 Hz AC noise
  • Bad Channel Rejection using two criteria: 1) flat signal for > 5 s or 2) poor correlation with adjacent channels
  • Artifact Subspace Reconstruction (ASR) (read more)
These steps were done using EEGLAB, a MATLAB tool for EEG data processing. The signal we obtained from pre-processing contained a minimal number of noise and artifacts, and hosted the majority of brain signals. We used it as a clean signal source to perform feature extraction.
Transforming Clean Signal to a Power Matrix
When extracting features, we followed two approaches: In one approach, we decomposed each signal into a 5 signals corresponding to δ, θ, α, β and γ bands.
EEG signals filtered into Frequency Bands δ, θ, α, β and γ
Next, we chunked the signals into fixed periods of 5 seconds, and calculated the mean, median and mode of each signal. In this manner, we created three series of mean, median and values for each electrode, corresponding to each chunk. We used this as Feature Set I, trained several models using WEKA and documented their evaluation results.

Evaluation Results for Feature Set I

The top 6 classifiers show > 90% accuracy when 10 fold cross validation was used. The only features used here were the average and power of the 5 frequency bands of each chunk. The temporal connections between each 5 second chunk was not taken into consideration here.

For our second feature set, we transformed the time series of each electrode into a power matrix by applying wavelet transforms and calculating their spectral densities. Each matrix coefficient indicated the strength of a signal at a given time, and frequency. The diagrams below visualize two power matrices in 2D form.
Visualization of an Autistic Subject
Visualization of a Typically Developing Subject

According to the diagrams, the spectral densities show different patterns for ASD and TD subjects for the same electrode. This motivated us to use the spectral density matrices as inputs to a Convolutional Neural Network (CNN), to take temporal trends into account. The architecture of the CNN layers included Dropout, Regularization Kernels, Convolution Layers, Dense Layers and a Sigmoid Neuron for Binary Classification.
Layers of the Convolutional Neural Network (CNN)

Results from both classifiers showed that they perform equally well for the classification task.
Comparing Evaluation Results of Feature Set II with Feature Set I

These metrics were obtained by using a sample size of 8 ASD subjects and 9 TD subjects. Increasing the number of subjects to counter for any class imbalances would more likely result in more generalized evaluation results.

An extended version of this research can be found on arXiv [Link] titled "Electroencephalogram (EEG) for Delineating Objective Measure of Autism Spectrum Disorder (ASD) (Extended Version)".

-- Yasith Jayawardana (@yasithmilinda)

2019-07-11: Raintale -- A Storytelling Tool For Web Archives

$
0
0

My work builds upon AlNoamany's efforts to use social media storytelling to summarize web archive collections. AlNoamany employed Storify as a visualization platform. Storify is now gone. I explored alternatives to Storify in 2017 and found many of them to be insufficient for our purposes. In 2018, I developed MementoEmbed to produce surrogates for mementos and we used it in a recent research study. Surrogates summarize individual mementos. They are the building blocks of social media storytelling. Using MementoEmbed, Raintale takes surrogates to the next level, providing social media storytelling for web archives. My goal is to help web archives not only summarize their collections but promote their holdings in new ways.

Raintale is the latest entry in the Dark and Stormy Archives project. Our goal is to provide research studies and tools for combining web archives and social media storytelling. Raintale provides the storytelling capability. It has been designed to visualize a small number of mementos selected from an immense web archive collection, allowing a user to summarize and visualize the whole collection or a specific aspect of it.

Raintale accepts a list of memento URIs (URI-Ms) from the user and produces a story containing surrogates of those URI-Ms. It then publishes this story to an individual file, in a format like HTML (as seen below), or a service, like Twitter (as seen above). Our goal is to explore and offer different publishing services and file formats to meet a variety of storytelling needs. You can help by finding defects and making suggestions on the directions we should take. The rest of this article highlights some of Raintale's features. For more information, please consult Raintale's websiteits documentation, and our GitHub repository.

Raintale provides many customization options for different types of storytelling. In this example, the HTML output contains Bootstrap cards and animated GIFs (MementoEmbed imagereels) of the best five images from each memento.

What Is Possible With Raintale



We created Raintale with several types of users in mind. Web archives can use it as another tool for featuring their holdings in new ways. Collection curators can promote their collections by featuring a small sample. Bloggers and other web page authors can write stories like they previously did with Storify.

When a user supplies the URI-Ms, Raintale supplies the formatted story. The URI-Ms do not even need to be from the same web archive. Raintale uses the concept of a storyteller to allow you to publish content to a variety of different file formats and social media services.

Raintale supports HTML storytelling with MementoEmbed social cards (see below). Story authors can use this HTML for static web sites or paste it into services like Blogger. Web archiving professionals can incorporate it into scripts for curation workflows. Raintale also provides storytellers that generate Jekyll headers for HTML or Markdown, suitable for posting to GitHub pages.

Raintale, by default, generates MementoEmbed social cards via the HTML storyteller.


Seen below, Raintale supports MediaWiki storytelling. It generates MediaWiki markdown that story authors can paste into a wiki page. This MediaWiki storyteller can help organizations who employ storytelling with wiki pages as part of ongoing collaboration.

Raintale can generate a story as MediaWiki markup suitable for pasting into MediaWiki pages.


Likewise, Raintale provides a Markdown storyteller with output suitable for GitHub gists. This output is useful for developers providing a list of resources from web archives.

Raintale provides a story as Markdown, rendered here in a GitHub gist available at this link.


For social media sharing, Raintale can also generate a Twitter story. Raintale leverages MementoEmbed's ability to surgically extract specific information from a memento to produce Tweets for each URI-M in a story. These URI-Ms are then bound by an overarching tweet, thus publishing the whole story as a Twitter thread.

Raintale's default Twitter storyteller generates surrogates consisting of the title of the memento, its memento-datetime, its URI-M, a browser thumbnail, and the top 3 images as ranked by MementoEmbed. The Tweet starting the thread contains information about the name of the story, who generated it, and which collection to which it is connected. The Twitter thread shown in this screenshot is available here.


Our Facebook equivalent is still in its experimental phase. We use a Facebook post to contain the story, and Raintale visualizes each URI-M as an individual comment to that post. Our Facebook posts do not yet have image support. The lack of images leads Facebook to generate social cards for the URI-Ms. As noted in a prior blog post, Facebook does not reliably produce surrogates for mementos. Also, Facebook's authentication tokens expire within a short window (sometimes 10 minutes), which requires the user to request new ones continually. We have observed that the comments on the post are not in the order they were submitted. We welcome suggestions on improving Raintale's Facebook storyteller.
We are beginning to explore Raintale's ability to post stories to Facebook.


We are experimenting with producing MPEG videos of collections, as seen below. Raintale generates these videos from the top sentences and top images from the submitted URI-Ms. A story author can then publish the video can to Twitter or YouTube to tell their story. Below, we show a tweet containing an example video created with Raintale.



Raintale supports presets, allowing you to alter the look of your story. If you do not like the social card HTML story shown above, a four column thumbnail story may work better (shown below). These presets provide a variety of options for users. Presets are templates that are already included with Raintale. We will continue to add new presets as development continues. To see what is available, visit our Template Gallery.

This story was produced via the HTML storyteller, but with the thumbnails4col preset. Presets are templates included with Raintale. Users can also supply their own templates.
Templates are an easy way to generate different types of surrogates for our research studies. Some of the initial presets come from those studies and are quite vanilla in tone because we wanted to limit what might influence the study participant. Raintale's output does not need to be this way. Raintale provides template support so that you can choose which surrogate features work best for your web archive or blog, as shown in the fictional "My Archive" story below.

Raintale allows users to supply their own templates, such as this one for the fictional "My Archive." Using these templates, curators can create their own stories describing their collections.
In the "My Archive" example, we show how one can brand a story using their own formatting and images. This example displays thumbnails, favicons, text snippets, titles, original resource domains, memento-datetimes, links to other mementos, links to the live web page, and the top four images discovered in each memento. Each of these features is controlled by a template variable and there are more features available than those shown here. We will continue to add new features as development proceeds.

Take a look at our Template Gallery to see what is available. The documentation provides more information on how to build your own templates using the variables and preferences provided by Raintale.

Requirements for Running Raintale



In the Raintale documentation, we discuss the different ways of installing and running Raintale. Raintale is a command-line Python application tightly coupled to MementoEmbed. The easiest way to run Raintale is with docker-compose. We implemented a command-line utility named tellstory so that a user can easily include Raintale in scripts for automation.

For file formats, tellstory requires a -o argument to specify the output file.

# docker-compose run raintale tellstory -i story-mementos.txt --storyteller html -o mystory.html --title "This is My Story Title"


For social media services, tellstory requires a -c argument to specify the file containing your API credentials.

# docker-compose run raintale tellstory -i story_mementos.txt --storyteller twitter --title "This is My Story Title" -c twitter-credentials.yml



A user can supply the content of the story as either a text file, like the story_mementos.txt above, or JSON. The text file is a newline-separated list of URI-Ms. Alternatively, the user can supply a JSON file for more control over the content. See the documentation for more information.

Our Reasons for Developing Raintale



As noted in a prior blog post, each surrogate is a visualization of the underlying resource. My research focuses on social media storytelling and web archives. The surrogates, presented together as a group, are visualizations of a sample of the underlying collection. In recent work, we explored how well different surrogates worked for collection understanding. Raintale came out of the lessons learned from generating stories with different types of surrogates. We decided that both we and the community would benefit from a tool fitting in this problem space.

Providing Feedback on Raintale



Development on Raintale is just starting, and we would appreciate feedback at our GitHub issues page. In addition to finding defects, we also want to know where you think Raintale should go. Have you developed a template that you find to be useful and want to share it? Is there a storyteller (file format or service) that you want us to incorporate?

The Dark and Stormy Archives Toolkit



Raintale joins MementoEmbed, the Off-Topic Memento Toolkit, and Archive-It Utilities as another member of the growing Dark and Stormy Archives (DSA) Toolkit. The DSA Toolkit includes tools for summarizing and generating stories from web archive collections. The next tool in development, Hypercane, will use structural features of web archive collections, along with similarity metrics and Natural Language Processing, to select the best mementos from collections for our stories.

We will continue to improve Raintale. What stories will we all tell with it?

-- Shawn M. Jones

2019-07-15: Lab Streaming Layer (LSL) Tutorial for Windows

$
0
0
First of all, I would like to give credit to Matt Gray for going through the major hassle in figuring out the prerequisites and for the awesome documentation provided on how to Install and Use Lab Streaming Layer on Windows.
In this blog, I will guide you how to install open source Lab Stream Layer (LSL) and stream data (eye tracking example using PupilLabs eye tracker) to NeuroPype Academic edition. Though a basic version of LSL is available along with NeuroPype, you will still need to complete following prerequisites before installing LSL.
You can find installation instructions for LSL at https://github.com/sccn/labstreaminglayer/blob/master/doc/BUILD.md. The intention of this blog is to provide an easier and more streamlined step-by-step guide for installing LSL and NeuroPype.
LSL is low-level technology for exchange of time series between programs and computers.

Figure: LSL core components
Source: ftp://sccn.ucsd.edu/pub/bcilab/lectures/Demo_1_The_Lab_Streaming_Layer.pdf


Christian A. Kothe, one of the developers of LSL, has a YouTube video in which he explains the structure and function of LSL.
Figure: LSL network overview
Source: ftp://sccn.ucsd.edu/pub/bcilab/lectures/Demo_1_The_Lab_Streaming_Layer.pdf
Installing Dependencies for LSL: LSL need to be built and installed manually using CMakeWe will need a C++ compiler to install LSL. We can use Visual Studio 15 2017 as  C++ compiler. In addition to CMake and Visual Studio, it is required to install Git, Qt, and Boost prior to LSL installation. Though Qt and Boost are not required for the core liblsl library, they are required for some of the apps used to connect to the actual devices.

Installing Visual Studio 15.0 2017: Visual Studio can be downloaded and installed from https://visualstudio.microsoft.com/vs/older-downloads/. You must download Visual Studio 2017 since other versions (including latest 2019) does not work when building some of the dependencies.  You can select community version as it is free.
VS version - 2017


The installation process will ask which Workloads you want to install additionally. Select the following Workloads to install. 
        1..NET desktop development
        2.Desktop development with C++
        3.  Universal Windows Platform development

Figure: Workloads need to be installed additionally
   Installing Git: Git is open source distributed version control system. We will use Git to download the LSL Git repository. Download Git for Windows from https://git-scm.com/download/win. Continue the installation with default settings except feel free to choose your own default text editor (vim, notepad++, sublime, etc) to use with git.In addition, when encountered Adjust your PATH environment page, make sure to choose the Git from the command line and also from 3rd-party software option in order to execute git commands using command prompt, python prompts, and other third party software.

Installing CMake:
Figure: First interface of CMake Installer
CMake is a program for building/installing other programs onto an OS. 
You can download CMake from https://cmake.org/download/. Choose the cmake-3.14.3-win64-x64.msifile to download under Binary distributions
When installing, feel free to choose the default selections, except, when prompted, choose Add CMake to the system PATH for all users.

Installing Qt:
Qt is a GUI generation program mostly used to create user interfaces. Some of the LSL apps use this to create user interfaces for the end user to interact with when connecting to the device. 
Open-source version can be downloaded and installed from https://www.qt.io/download. An executable installer for Qt is provided so installing should be easy. 

You will be asked to enter details of a Qt account in the install wizard. You can either create or log in if you have an account already. 
Figure: Qt Account creation step

-      During the installation process, select defaults for all options except in the Select Components page, select the following to install:
o Under Qt 5.12.3:
§  MSVC 2017 64-bit
§  MinGW 7.3.0 64-bit
§  UWP ARMv7 (MSVC 2017)
§  UWP x64 (MSVC 2017)
§  UWP x86 (MSVC 2017)
Figure: Select Components to be installed in Qt

The directory that you need for installing LSL is  C:\Qt\5.12.3\msvc2017_64\lib\cmake\Qt5

Installing Boost
Boost is a set of C++ libraries which provides additional functionalitis to C++ coding. Boost also needs to be compiled/installed manually. The online instructions for doing this is at https://www.boost.org/doc/libs/1_69_0/more/getting_started/windows.html.
You can download Boost from https://www.boost.org/users/history/version_1_67_0.html. Select the downloaded boost_1_67_0.zip file and extract it directly into your C:\ drive. Then, open a command prompt window and navigate to C:\boost_1_67_0 folder using cd C:\boost_1_67_0 command.

Then execute 
1. bootstrap 
2. .\b2 
commands one after the other.
Figure: Executing bootstrap and .\b2 commands

Figure: After Executing bootstrap and .\b2 commands

The directory that you need for installing LSL is C:\boost_1_67_0\stage\lib

Installing Lab Streaming Layer: Clone lab streaming layer repository from Github into your C:\ drive.

In a command prompt, execute following commands. 
1. cd C:\
2. git clone https://github.com/sccn/labstreaminglayer.git --recursive
Make a build directory in the labstreaminglayerfolder
3. cd labstreaminglayer
4. mkdir build && cd build
\

Configure lab streaming layer using CMake

5. cmake C:\labstreaminglayer -G "Visual Studio 15 2017 Win64"  
-DLSL_LSLBOOST_PATH=C:\labstreaminglayer\LSL\liblsl\lslboost 
-DQt5_DIR=C:\Qt\5.12.3\msvc2017_64\lib\cmake\Qt5 
-DBOOST_ROOT=C:\boost_1_67_0\stage\lib 
-DLSLAPPS_LabRecorder=ON 
-DLSLAPPS_XDFBrowser=ON
-DLSLAPPS_Examples=ON
-DLSLAPPS_Benchmarks=ON 
-DLSLAPPS_BestPracticesGUI=ON

The above command configures LSL, defines which Apps are installed, and tell LSL where the Qt, Boost, and other dependencies are installed.
     i.     C:\labstreaminglayer is the path to the lab streaming layer root directory (where you cloned LSL from Gihub)
                                    ii.     The –G command defines the compiler used to compile LSL ( We use Visual Studio 15 2017 Win64)
                                    iii.     –D is the command for additional options.
1.     –DLSL_LSLBOOST_PATH àPath to the LSL Boost directory
2.     –DQt5_DIR àPath to Qt cmake files
3.     –DBOOST_ROOT àPath to installed boost libraries
4.   –DLSLAPPS_<App Name>=ONà These are the Apps located in the Apps folder (C:\labstreaminglayer\Apps) that you want installed. Just add the name of the folder within the Apps folder that you want installed directly after –DLSLAPPS_ with no spaces 
Build (install) lab streaming layer using CMake
6. cd ..
7. mkdir install


      8. cmake --build C:\labstreaminglayer\build --config Release --target C:\labstreaminglayer\install



Now, that the LSL installation is complete, we will have a look at the LabRecorder. Labrecorder is the main LSL program to interact with all the streams. You can find the LabRecorder program at C:\labstreaminglayer\install\LabRecorder\ LabRecorder.exe.

The interface of LabRecorder  looks like following.
Figure: LabRecorder interface when PupilLabs is streaming

The green color check box entries below Record from Streams are the PupilLabs’(eye tracking device) streams. When all programs are installed and running for each respective device, the devices’ streams will appear as above under Record from Streams.You can check your required data stream from the devices listed, then just press Start to begin data recording from all the devices. The value under Saving to on the right specify where the data files (in XDF format) will be saved.

Installing PupilLabs LSL connection: There are many devices which could be connected with LSL. Muse EEG device, Emotive Epoc EEG device, and PupiLabs core eye tracker are some of them. The example below shows how to use PupilLabs core eye tracker with LSL for streaming data to NeuroPype.

Figure : PupilLabs core eye tracker, Source - https://pupil-labs.com/products/core/

Let us first begin with Setting Up PupilLabs core eye tracker. You can find instructions for using and developing with PupilLabs here. I’ll provide some steps to setup everything from start to finish to work with LSL below though.  The LSL install instructions for PupilLabs is at https://github.com/labstreaminglayer/App-PupilLabs/tree/9c7223c8e4b8298702e4df614e7a1e6526716bcc

To setup PupilLabs Eyetracker, first you have to download PupilLabs software from https://github.com/pupil-labs/pupil/releases/tag/v1.11. You can go ahead and choose pupil_v1.11-4-gb8870a2_windows_x64.7z file and unzip it into your C:\ drive. You may need 7z unzip program for unzipping. Then, you just have to plug in the PupilLabs eye tracker to your computer. It will automatically begin to install drivers for the hardware.

After that, you can run the Pupil Capture program located at: C:\pupil_v1.11-4-gb8870a2_windows_x64\pupil_capture_windows_x64_v1.11-4-gb8870a2\pupil_capture.exe with Administrative Privileges so that it can install the necessary drivers. Next, you can follow the instructions in https://docs.pupil-labs.com/ to setup, calibrate, and use the eye tracker with the Pupil Capture program.

Connect PupilLabs with LSL: Build liblsl-Python in a Python or Anaconda Prompt. You could do with your command prompt as well. Execute following commands:
1. cd C:\labstreaminglayer\LSL\liblsl-Python
2. python setup.py build

Then, you have to install LSL as plugin in Pupil Capture program. 
a.     In the newly created C:\labstreaminglayer\LSL\liblsl-Python\build\libfolder, copy the pylsl folder and all its contents into the C:\Users\<user_profile>\pupil_capture_settings\pluginsfolder (replace <user_profile> with your Windows user profile).
b.     In the C:\labstreaminglayer\Apps\PupilLabsfolder, copy pupil_lsl_relay.py into the C:\Users\<user_profile>\pupil_capture_settings\pluginsfolder.
Figure: Original Location of  pupil_lsl_relay.py

Figure: After copying pupil_lsl_relay.py and pylsl folder into C:\Users\<user_profile>\pupil_capture_settings\plugins folder

If the pylsl folder does not have lib folder containing liblsl64.dll, there is a problem with pylsl build. As an alternative approach, install pylsl via pip by runningpip3 install pylslcommand in command prompt. Make sure you have installed pip in your computer prior running these commands in your command prompt. You can use pip3 show pylslcommand to see where is the pylsl module built in your compute. This module will include the pre-built library files. Copy this newly created pylsl module to the C:\Users\<user_profile>\pupil_capture_settings\pluginsfolder. 
In this example,  pylsl module  was installed in C:\Users\<user_profile>\AppData\Local\Python\Python37\Lib\site-packages\pylsl folder. It includes a lib folder which contains 
 Figure: pylsl module's installation location when used pip3 install pylsl command
As the next step, launch pupil_capture.exe and enable Pupil LSL Relay from Plugin manager in Pupil Capture – World window.

Figure: Enabled Pupil LSL Realy from Plugin Manager
Now when you hit the R button on the left of World window, you start recording from PupilLabs while streaming it to the LSL.  In Labrecorder, you could see the streams in green color (see Figure LabRecorder interface when PupilLabs is streaming).
Now, let's have a look at how to get data from LSL to Neuropype.

Getting Started with Neuropype and Pipeline Designer:
First, you have to download and install the NeuroPype Academic Edition (neuropype-suite-academic-edition-2017.3.2.exe) from https://www.neuropype.io/academic-edition. The NeuroPype Academic Edition includes a Pipeline Designer application, which you can use to design, edit, and execute NeuroPype pipelines using a visual ‘drag-and-drop’ interface. 

Before launching Neuropype Pipeline Designer, make sure that NeuroPype Server is running on background. If not, you can run it by double clicking on NeuroPype Academic icon. You can set to launch NeuroPype server on startup as well. 
The large white area in the following screenshot is the ‘canvas’ that shows your current pipeline, which you can edit using drag-and-drop and double-click actions. On the left you have the widget panel, which has all the available widgets or nodes that you can drag onto the canvas.


Create an Example Pipeline: Select LSL Input node in Network (green) section, Dejitter Timestamp in Utilities section (light blue), Assign Channel Locations in Source Localization section (Pink), Print To Console in Diagnostics section (pink) from widget panel.
Canvas looks like Fig. Pipeline created in Neuropype after creating the pipeline. After getting the nodes to the canvas, you can connect them using the dashed curved lines on both sides of them. Double click on either of the dashed line of one node and drag the connecting line to a dashed curved line of the other node. It will create a connection between two nodes named Data.
You can hover the mouse over any section or widget or click on a widget on canvas and see a tooltip that briefly summarizes it. 

Figure: Pipeline created in Neuropype
Start Processing
LSL is not only a way to get data from one computer to another, but also to get data from your EEG system, or any other kind of sensor that supports it, into NeuroPype. You can also use it to get data out of NeuroPype into external real-time visualizations, stimulus presentation software, and so on.
Make sure that the LSL Input node has a query string that matches that sensor. For instance, if you use PupilLabs, you need to enter type=’Pupil Capture’ as below. Then NeuroPype will pick up data from the PupilLabs eye tracker.
                                                                     
Figure: Set up type of LSL Input
To launch the current patch, click the BLUE pause icon in the toolbar (initially engaged) to unpause it. As a result, the Pipeline Designer will ask the NeuroPype server to start executing your pipeline. This will print some output. 

Congratulations! You successfully set up LSL, PupilLabs and NeuroPype Academic version. Go ahead and experiment with your EEG system, or any other kind of sensor that supports LSL and NeuroPype.

Feel free to tweet @Gavindya2 if you have any questions about this tutorial or need any help with your installation.

--Gavindya Jayawardena (@Gavindya2)

2019-07-17: Bathsheba Farrow (Computer Science PhD Student)

$
0
0
My name is Bathsheba Farrow.  I joined Old Dominion University as a PhD student in the fall of 2016.  I am currently researching various technologies for reliable data collection in individuals suffering from Post-Traumatic Stress Disorder (PTSD).  I intend to use machine learning algorithms to identify patterns their physiological data to support rapid, reliable PTSD diagnosis.  However, diagnosis is only one side of the equation.  I also plan to investigate eye movement desensitization and reprocessing, brainwave technology, and other methods that may actually alleviate or eliminate PTSD symptoms.  I am working with partners at Eastern Virginia Medical School (EVMS) to discover more ways technology can be used to diagnosis and treat PTSD patients.

In May 2019, I wrote and submitted my first paper related to my PTSD research to the IEEE 20th International Conference on Information Reuse and Integration (IRI) for Data Science: Technological Advancements in Post-Traumatic Stress Disorder Detection:  A Survey.  The paper was accepted in June 2019 by the conference committee as a short paper.  I am currently scheduled to present the paper at the conference on 30 July 2019.  The paper describes brain structural irregularities and psychophysiological characteristics that can be used to diagnosis PTSD.  It identifies technologies and methodologies used in past research to measure symptoms and biomarkers associated with PTSD that has or could aid in diagnosis.  The paper also describes some of the shortcomings past research and other technologies that could be utilized in the future studies.

While working on my PhD, I also work full-time as a manager of a branch of military, civilian and contractor personnel within Naval Surface Warfare Center Dahlgren Division Dam Neck Activity (NSWCDD DNA).  I originally started my professional career with internships at the Bonneville Power Administration and Lucent Technologies.  Since 2000, I have worked as a full-time software engineer developing applications for Version, National Aeronautics and Space Administration (NASA), Defense Logistics Agency (DLA), Space and Naval Warfare (SPAWAR) Systems Command, and Naval Sea Systems Command (NAVSEA).  I have used a number of programming languages and technologies during my career including, but not limited to, Smalltalk, Java, C++, Hibernate, Enterprise Architect, SonarQube, and HP Fortify.

I completed  a Master’s degree in Information Technology through Virginia Tech in 2007 and a Bachelor of Science degree in Computer Science at Norfolk State University in 2000.  I also completed other training courses through my employers including, but not limited to, Capability Maturity Model Integration (CMMI), Ethical Hacking, and other Defense Acquisition University courses.

I am originally from the Hampton Roads area.  I have two children, with my oldest beginning her undergraduate computer science degree program in the fall 2019 semester.

2019-07-30: SIGIR 2019 in Paris Trip Report

$
0
0
ACM SIGIR 2019 was held in Paris, France July 21-25, 2019 in the conference center of the Cite des sciences et de l'industrie. Attendees were treated to great talks, delicious food, sunny skies, and warm weather. The final day of the conference was historic - Paris' hottest day on record (42.6 C, 108.7 F).
 
There were over 1000 attendees, including 623 for tutorials, 704 for workshops, and 918 for the main conference. The acceptance rate for full papers was a low 19.7%, with 84/426 submissions accepted. Short papers were presented as posters, set up during the coffee breaks, which allowed for nice interactions among participants and authors. (Conference schedule - will be updated with links to videos of most of the talks)

Several previously-published ACM TOIS journal papers were invited for presentation as posters or oral presentations. We were honored to be invited to present our 2017 ACM TOIS paper, "Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages" (Alkwai, Nelson, Weigle) during the conference.

Opening Reception

On Sunday, the conference opened with the Doctoral Consortium, tutorials, and a lovely reception at the Grande Galerie de l'Evolution.

Keynote 1 (July 22)

The opening keynote was given by Bruce Croft (@wbc11), Director of UMass'Center for Intelligent Information Retrieval, on the "Importance of Interaction in Information Retrieval" (slides).
Croft began with categorizing two IR research communities: CS as system-oriented and IS as user-oriented
From there, he gave an overview of interaction in IR and pointed to questions and answers (and conversational recommendation) as an essential component of interactive systems. Asking clarifying questions is key to a quality interaction.  Interaction in IR requires a dialogue.

I appreciated the mentions of early infovis in IR.

I'll let these tweets summarize the rest of the talk, but if you missed it you should watch the video when it's available (I'll add a link).

SIRIP Panel (July 23)

The SIGIR Symposium on IR in Practice (SIRIP) (formerly known as the "SIGIR industry track")  panel session was led by Ricardo Baeza-Yates and focused on the question, "To what degree is academic research in IR/Search useful for industry, and vice versa?"

The panelists were:
It was an interesting discussion with nice insights into the roles of industrial and academic research and how they can work together.






      Women in IR Session (July 23)

      The keynote for the Women in IR (@WomenInIR) session was given by Mounia Lalmas (@mounialalmas) from Spotify.

      This was followed by a great panel discussion on several gender equity issues, including pay gap and hiring practices.

      Banquet (July 23)

      The conference banquet was held upstairs in the Cite des sciences et de l'industrie.


      During a break in the music, the conference award winners were announced:

      Best Presentation at the Doctoral Consortium:From Query Variations To Learned Relevance Modeling
      Binsheng Liu (RMIT University)

      Best Short Paper:Block-distributed Gradient Boosted Trees
      Theodore Vasiloudis (RISE AI, @thvasilo), Hyunsu Cho (Amazon Web Services), Henrik Boström (KTH Royal Institute of Technology)

      Best Short Paper (Honorable Mention): Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models
      Wei Yang (University of Waterloo), Kuang Lu (University of Delaware), Peilin Yang (No affiliation), Jimmy Lin (University of Waterloo)

      Best Paper (Honorable Mention): Online Multi-modal Hashing with Dynamic Query-adaption
      Xu Lu (Shandong Normal University), Lei Zhu (Shandong Normal University), Zhiyong Cheng (Qilu University of Technology (Shandong Academy of Sciences)), Liqiang Nie (Shandong University), Huaxiang Zhang (Shandong Normal University)

      Best Paper: Variance Reduction in Gradient Exploration for Online Learning to Rank
      Huazheng Wang (University of Virginia), Sonwoo Kim (University of Virginia), Eric McCord-Snook (University of Virginia), Qingyun Wu (University of Virginia), Hongning Wang (University of Virginia)

      Test of Time Award:Novelty and Diversity in Information Retrieval Evaluation (pdf)
      Charles L. A. Clarke, Maheedhar Kolla (@imkolla), Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, Ian MacKinnon
      Published at SIGIR 2008, now with 881 citations 


      Keynote 2 (July 24)

      The final keynote was given by Cordelia Schmid (@CordeliaSchmid) from INRIA and Google on "Automatic Understanding of the Visual World".
      She presented her work on understanding actions in video and interaction with the real world.  One interesting illustration was video of a person walking and then falling down.  Without taking enough context into account, a model may classify this as a person sitting (seeing only the result of the fall), but with tracking the action, their model can detect and correctly classify the falling action.


      My Talk (July 24)

      After the final keynote, I presented our 2017 ACM TOIS paper, "Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages" (Alkwai, Nelson, Weigle) during Session 7B: Multilingual and Cross-modal Retrieval.



      Other Resources

      Check out these other takes on the conference:

        Au Revoir, Paris!





          -Michele

          2019-03-27: Install ParsCit on Ubuntu

          $
          0
          0
          ParsCit is a citation parser developed by a joint effort of Pennsylvania State University and National University of Singapore. Over the past ten years, it is been compared with many other citation parsing tools and is still widely used. Although Neural ParsCit has been developed, the implementation is still not as easy to use as ParsCit. In particular, PDFMEF encapsules ParsCit as the default citation parser.

          However, many people found that installing ParsCit is not very straightforward. This is partially because it is written in perl and the instructions on the ParsCit website are not 100% accurate. In this blog post, I describe the installation procedures of ParsCit on a Ubuntu 16.04.6 LTS desktop. Installation on CentOS should be similar. The instructions do not cover Windows.

          The following steps assume we install ParsCit under /home/username/github.
          1. Download the source code from https://github.com/knmnyn/ParsCit and unzip it.
            $ unzip ParsCit-master.zip
          2.  Install c++ compiler
            $  sudo apt install g++
            To test it, write a simple program hello.cc and run
            $ g++ -o hello hello.cc
            $ ./hello
          3. Install ruby
            $ sudo apt install ruby-full
            To test it, run
            $ ruby --version
          4. Perl usually comes with the default Ubuntu installation, to test it, run
            $ perl --version
          5. Install Perl modules, first start CPAN
            $ perl -MCPAN -e shell
            choose the default setups until the CPAN prompt is up:
            cpan[1]>
            Then install packages one by one
            cpan[1]> install Class::Struct
            cpan[2]> install Getopt::Long
            cpan[3]> install Getopt::Std
            cpan[4]> install File::Basename
            cpan[5]> install File::Spec
            cpan[6]> install FindBin
            cpan[7]> install HTML::Entities
            cpan[8]> install IO::File
            cpan[9]> install POSIX
            cpan[10]> install XML::Parser
            cpan[11]> install XML::Twig
            choose the default setups
            cpan[12]> install XML::Writer
            cpan[13]> install XML::Writer::String
          6. Install crfpp (verison 0.51) from source.
            1. Get into the crfpp directory
              $ cd crfpp/
            2. Unzip the tar file
              $ tar xvf crf++-0.51.tar.gz
            3. Get into the CRF++ directory
              $ cd CRF++-0.51/
            4. Configure
              $ ./configure
            5. Compile
              $ make
              This WILL cause an error like below
              path.h:26:52: error: 'size_t' has not been declared
                   void calcExpectation(double *expected, double, size_t) const;
                                                                  ^
              Makefile:375: recipe for target 'node.lo' failed
              make[1]: *** [node.lo] Error 1
              make[1]: Leaving directory '/home/jwu/github/ParsCit-master/crfpp/CRF++-0.51'
              Makefile:240: recipe for target 'all' failed
              make: *** [all] Error 2
              This is likely caused by missing the following two lines in node.cpp and path.cpp. Add these two lines before other include statements, so the beginnings of either file look like
              #include "stdlib.h"
              #include <iostream>
              #include <cmath>
              #include "path.h"
              #include "common.h"

              then run ./configure and "make" again.
            6. Install crf++
              $ make clean
              $ make
              This should rebuld crf_test and crf_learn.
          7. Move executables to where parscit expects to find them.
            $ cp cp crf_learn crf_test ..
            $ cd .libs
            $ cp -Rf * ../../.libs
          8. Test ParsCit. Under the bin/ directory, run
            $ ./citeExtract.pl -m extract_all ../demodata/sample2.txt
            $ ./citeExtract.pl -i xml -m extract_all ../demodata/E06-1050.xml

          2019-08-03: TweetedAt: Finding Tweet Timestamps for Pre and Post Snowflake Tweet IDs

          $
          0
          0
          Figure 1: Screenshot from TweetedAt service showing timestamp for a deleted tweet from @SpeakerPelosi
          On May 11, 2019, Derek Willis from Politwoops shared a list of deleted tweet IDs which could not be attributed to any Twitter handle followed by them. We tried multiple techniques to find the list of deleted tweet IDs in the web archives, but we were unsuccessful in finding any of the tweet IDs in web archives within the time range of our analysis. During our investigation, we learned of Snowflake, a service used to generate unique IDs by Twitter. We used Snowflake to extract the timestamp from the deleted tweet IDs. Of the 107 deleted tweet IDs shared with us only seven of them were in the time range of our initial analysis. In this post, we describe TweetedAt, a web service and library to extract the timestamps for post-Snowflake IDs and estimate timestamps for pre-Snowflake IDs.

          Previous implementations of Snowflake in different programming languages such as Python, RubyPHP, Java, etc. have implemented finding the timestamp of a Snowflake tweet ID but none provide for estimating timestamps of pre-Snowflake IDs.

          The reasons for implementing TweetedAt are:
          • It is the only web service which allows users to find the timestamp of Snowflake tweet IDs and estimate tweet timestamps for pre-Snowflake Tweet IDs.
          • Twitter developer API has access rate limits. It acts as a bottleneck in finding timestamps for a data set of tweet IDs. This bottleneck is not present in TweetedAt because we do not interact with Twitter's developer API for finding the timestamps. 
          • Deleted, suspended, and private tweets do not have their metadata accessible from Twitter's developer API. TweetedAt is the solution to finding the timestamps for any of these inaccessible tweets. 

          Snowflake


          In 2010, Twitter migrated its database from MySQL to Cassandra. Unlike MySQL, Cassandra does not support sequential ID generation technique, so Twitter announced Snowflakea service to generate unique IDs for all the tweet IDs and other objects within Twitter like lists, users, collections, etc. Snowflake generates unsigned-64 bit integers which consist of: 
          • timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
          • configured machine ID - 10 bits - gives us up to 1024 machines
          • sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)

          According to Twitter's post on Snowflake, the tweet IDs are k-sorted within a second bound but the millisecond bound cannot be guaranteed. We can extract the timestamp for a tweet ID by right shifting the tweet ID by 22 bits and adding the Twitter epoch time of 1288834974657.  
          Python code to get UTC timestamp of a tweet ID

          defget_tweet_timestamp(tid):
          offset =1288834974657
          tstamp = (tid >>22) + offset
          utcdttime = datetime.utcfromtimestamp(tstamp/1000)
          print(str(tid) +" : "+str(tstamp) +" => "+str(utcdttime))


          Twitter released Snowflake on November 4, 2010 but it has been around since March, 2006. Pre-Snowflake IDs do not have their timestamps encoded in the IDs, but we can uncover the value from the 2362 tweet IDs with known timestamps. 

          Estimating tweet timestamps for pre-Snowflake tweet IDs

          TweetedAt estimates the timestamps for pre-Snowflake IDs with an approximate error of 1 hour. For our implementation, we collected 2362 tweet IDs and their timestamps at a daily interval between March 2006 and November 2010 to create a ground truth data set. The ground truth data set is used for estimating the timestamp of any tweet ID prior to Snowflake. Using the weekly interval as ground truth data set resulted in an approximate error of 3 hours and 23 minutes.
          Batch cURL command to find first tweet  
          msiddique@wsdl-3102-03:/$ curl -Is "https://twitter.com/foobarbaz/status/[0-21]"| grep "^location:"
          location: https://twitter.com/jack/status/20
          location: https://twitter.com/biz/status/21
          The ground truth data set ranges from tweet ID 20 to 29700859247. The first non-404 tweet ID found using the cURL batch command is 20. We found a memento which contains pre-Snowflake ID  of 29548970348 from Internet Archive for @nytimes close to Snowflake release date time. We performed all possible digits combinations on the tweet ID, 29548970348, using the cURL batch command to uncover the largest non-404 tweet ID known to us, 29700859247.
          Figure 2: Exponential tweet growth rate in pre-Snowflake time range

          Figure 3: Semi-log scale of tweet growth in pre-Snowflake time range


          Figure 4: Pre-Snowflake time range graph showing two close on the curve (upper bound and lower bound) and a point between upper and lower bound for which timestamp is to be estimated. Each point on the graph is represented by a tuple of Tweet Timestamp (T) and Tweet ID (I). 
          As shown in figure 4, assuming the two points to be very close on the graph, the slope between the two points is linear.


          We know the tweet ID (I) for a tweet and want to estimate the timestamp (T) for it which can be estimated using the formula:

          The pre-Snowflake timestamp estimation formula was tested on 1000 random tweet IDs generated between the minimum and maximum tweet ID range and the test set resulted in approximate average error of 45 minutes. We also created a weekly test data set with 1932 tweet IDs for pre-Snowflake time range and reported a approximate mean error of 59 minutes. Figure 5 shows, after 2006 the half-yearly mean error rate to be within 60 minutes.
          Summary of error difference between the estimated timestamp and the true Tweet timestamp (in minutes) generated on 1000 pre-Snowflake random Tweet IDs
          We can replace the Tweet Generation estimation formula by using a segmented curve fitting technique on the graph shown in figure 2 and reduce the program size by excluding all the 2362 data points.
          Figure 5: Box plot of error range for Pre-Snowflake IDs conducted over a weekly test set.
          Summary of error difference between the estimated timestamp and the true Tweet timestamp (in minutes) generated on weekly pre-Snowflake random Tweet IDs

          Estimating the timestamp of a deleted pre-Snowflake ID

          Figure 7 shows a pre-Snowflake deleted tweet from @barackobama which can be validated by the cURL response of the tweet ID. The timestamp of tweet in the memento is in Pacific Daylight Time (GMT -07). Upon converting the timestamp to GMT, it changes from Sun, 19 October 2008 10:41:45 PDT to Sun, 19 October 2008 17:41:45 GMT. Figure 8 shows TweetedAt returning the estimated timestamp of Sun, 19 October 2008 17:29:27 GMT which is off by approximately 12 minutes. 

          Figure 7: Memento from Internet Archive for @barackobama having a pre-Snowflake deleted tweet ID  
          cURL response for @barackobama deleted Tweet ID

          msiddique@wsdl-3102-03:~/WSDL_Work/Twitter-Diff-Tool$ curl -IL https://twitter.com/barackobama/966426142
          HTTP/1.1 301 Moved Permanently
          location: /barackobama/lists/966426142
          ...

          HTTP/1.1 404 Not Found
          content-length: 6329
          last-modified: Wed, 31 Jul 2019 22:00:56 GMT
          ...

          Figure 8: TweetedAt timestamp response for @barackobama's pre-Snowflake deleted tweet ID 966426142 which is off  by 12 minutes

          To summarize, we released TweetedAt, a service to find the timestamp of any tweet ID from 2006 through today. We created a ground truth data set of pre-Snowflake IDs collected on daily interval for estimating timestamp of any tweet ID prior to Snowflake (November 4, 2010). We tested our pre-Snowflake tweet estimation formula on 1000 test data points and reported an approximate mean error of 45 minutes. We also tested our pre-Snowflake tweet estimation formula on 1932 test data points collected weekly and reported an approximate mean error of 59 minutes.

          Related Links


          2019-08-03: Searching Web Archives for Unattributed Deleted Tweets From Politwoops

          $
          0
          0
          On May 11th 2019, Derek Willis, who works at Propublica and also maintains the Politwoops project, tweeted a list of deleted tweet ids found by Politwoops that could not be attributed to any Twitter handle being tracked by Politwoops. This was an opportunity for us to revisit our interest in using web archives to uncover the deleted tweets. Although we were unsuccessful in finding any of the deleted tweet ids in web archives provided by Politwoops, we are documenting our process for coming to this conclusion.

          Politwoops  

          Politwoops is a web service which tracks deleted tweets of elected public officials and candidates running for office in the USA and 55 other countries. The Politwoops USA is supported by Propublica.

          Creating Twitter handles list for the 116th Congress 

          In a previous post, we discussed the challenges involved in creating a data set of Twitter handles for the members of Congress and provided a data set of Twitter handles for the 116th Congress. A member of Congress can have multiple Twitter accounts which can be categorized into official, personal, and campaign accounts. We made a decision of creating a data set of official Congressional Twitter accounts over their personal or campaign accounts because we did not want to track the personal tweets from the members of Congress. For this reason, our data set has a one-to-one mapping between a member of Congress and their Twitter handle listing all the current 537 members of Congress with their official Twitter handles. However, Politwoops has a one-to-many mapping between a member of Congress and their Twitter handles because it tracks all the Twitter handles for a member of Congress.  We expanded our data set of Twitter handles for the 116th Congress by using the rest of the Twitter handles that Politwoops tracks in addition to those we have in our data set. For example, our data set of Twitter handles for the 116th Congress has @RepAOC as the Twitter handle for Rep. Alexandria Ocasio-Cortez while Politwoops lists @AOC and @RepAOC as her Twitter handles.  
          Figure 1: Screenshot of  Rep. Alexandria Ocasio-Cortez's Politwoops page highlighting the two handles (@AOC, @RepAOC) Politwoops  tracks for her

          Creating the President, the Vice-President, and Committee Twitter handles list

          Politwoops USA tracks members of Congress, the President, and the Vice-President. Propublica provided the list of Twitter handles being tracked by Politwoops in the data sources list provided at Propublica Congress API. Furthermore, we found a subset of the committee Twitter handles to be present in Politwoops which have not been advertised in their data sources list via the Propublica API. With no complete list of the committee Twitter handles being tracked by Politwoops, we used the CSPAN list of committee handles
          Figure 2: CSPAN Twitter Committee List showing the SASC Majority committee's Twitter handle, @SASCMajority
          Figure 3: Politwoops returns a 404 for SASC Majority committee's Twitter handle, @SASCMajority
          Figure 4: Screenshot of the @HouseVetAffairs committee Twitter handle being tracked by Politwoops

          List of Different Approaches Used to Find Deleted Tweets using the Web Archives   

          Internet Archive cdx Server API

          The Internet Archive cdx Server API can be used to list all the mementos in the index of Internet Archive for a URL or URL prefix. We can broaden our search for a URL with the URL match scope option provided by the cdx Server API. In our study, we have used the URL match scope of "prefix".
          The URL http://web.archive.archive.org/cdx/search/cdx?url=https://twitter.com/repaoc&matchType=prefix searches for all the URLs in Internet Archive with the prefix https://twitter.com/repaoc. Using this approach, we received all the different URL variants that exist in Internet Archive index file for @RepAOC.
          Excerpt of the response received from the Internet Archive's cdx server API for @RepAOC
          com,twitter)/repaoc 20190108184114 https://twitter.com/RepAOC text/html 200 GBB2ADFZOLTFQAPQACVT2XFVBVSEEHT5 42489
          com,twitter)/repaoc 20190109161007 https://twitter.com/RepAOC text/html 200 SLZHJQKN25URYRWQUQI7DW5JZD5M5E6F 43004
          com,twitter)/repaoc 20190109200548 https://twitter.com/RepAOC text/html 200 DWGHG6CSHBE7OETXJD3TEINEWKV372DJ 45123
          com,twitter)/repaoc 20190120082837 https://twitter.com/repaoc text/html 200 JVHASBSCBHPGKCVR7GBVOYRM4H5KQYBP 53697
          com,twitter)/repaoc 20190126051939 https://twitter.com/repaoc text/html 200 YRE4RPA46F7PTQNBQUMHKCLWLL2WUXE2 56420
          com,twitter)/repaoc 20190202170000 https://twitter.com/RepAOC text/html 200 6VS73H6XD5T2TVRC4UJXNT2D6FCNZWMJ 55388
          com,twitter)/repaoc 20190207211032 https://twitter.com/repaoc text/html 200 NQQI4UJ6TUMHS36JATOY35D7P255MEIA 56378
          com,twitter)/repaoc 20190221024247 https://twitter.com/RepAOC text/html 200 K6B3P7IRHIXTZSPXRWUPSBCRZ2HCWBZB 56678
          com,twitter)/repaoc 20190223102039 https://twitter.com/RepAOC text/html 200 OO2U6EUXYTGGEE2Q3ARQJ4SI4QGF2CLR 58008
          com,twitter)/repaoc 20190223180906 https://twitter.com/RepAOC text/html 200 HC6RCIVTTUV6JU35PA2JZ256E7RXY2MN 56799
          com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
          com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
          com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59586
          com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59587
          com,twitter)/repaoc 20190307011545 https://twitter.com/RepAOC text/html 200 R5PQUDWVYCZGAH3B4LVSBQXFXZ5MVXSY 59388
          com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59430
          com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59431
          com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
          com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
          com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59498
          com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59499

          Example for a status URL
          com,twitter)/repaoc/status/1082706172623376384 20190108201259 http://twitter.com/RepAOC/status/1082706172623376384 unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 447
          The tweet id from the status URL was compared with the list of deleted tweet ids provided by Politwoops. Using this approach we did not find any matching tweet ids. 

          From mementos of the 116th Congress

          For this approach, we fetched all the mementos from web archives for the 116th Congress between 2019-01-03 and 2019-05-15 using MemGator, a Memento aggregator service.
          For example, we queried multiple web archives for Rep. Karen Bass's Twitter handle, @RepKarenBass, to fetch all the mementos for her Twitter profile page. All the embedded tweets from the memento were parsed and compared with the deleted list of tweet ids from Politwoops. Using this approach we did not find any matching tweet ids. 
          Example of a URI-M  in CDXJ format


          20190201043735 {"uri": "http://web.archive.org/web/20190201043735/https://twitter.com/RepKarenBass", "rel": "memento", "datetime": "Fri, 01 Feb 2019 04:37:35 GMT"}

          Figure 5: Screenshot of the memento for Rep. Karen Bass's Twitter profile page with 20 embedded tweets
          Output upon parsing the fetched mementos
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827923375880359937Timestamp: 1486227289|||TweetText: Sometimes the best way to stand up is to sit down. Happy Birthday Rosa Parks. #OurStory #BlackHistoryMonthpic.twitter.com/fjPMeD3RzX
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827593256988860417Timestamp: 1486148583|||TweetText: I urge the Gov't of Cameroon to respect the civil and human rights of all of its citizens. See my full statement: http://bass.house.gov/media-center/press-releases/rep-bass-condemns-intimidation-against-english-speaking-population …
          TweetType: RT|||ScreenName: RepKarenBass|||TweetId: 827292997100376064Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
          TweetType: OTR|||ScreenName: RepBarbaraLee|||TweetId: 827285964674441216Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
          TweetType: OT|||ScreenName: RepBarbaraLee|||TweetId: 827201943323938816Timestamp: 1486055286|||TweetText: This month is National Children’s Dental Health Month (NCDHM). This year's slogan is "Choose Tap Water for a Sparkling Smile"pic.twitter.com/gk1cj8oTK9
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196902273929217Timestamp: 1486054084|||TweetText: On the growing list of things I shouldn't have to defend my stance on, add #UCBerkeley, 1 of our nation's most prestigious pub. universities
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196521347166209Timestamp: 1486053993|||TweetText: .@realDonaldTrump: #UCBerkeley developed immunotherapy for cancer!
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196386512871425Timestamp: 1486053961|||TweetText: .@realDonaldTrump: Do you like Vitamin K? Discovered/synthesized at #UCBerkeley
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196102554296320Timestamp: 1486053894|||TweetText: .@realDonaldTrump What's your stance on painkillers? Beta-endorphins invented at #UCBerkeleyhttps://twitter.com/realDonaldTrump/status/827112633224544256 …
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826960463590207488Timestamp: 1485997713|||TweetText: Happy to see Judge Birotte of LA continue the fight towards ending Pres. Trump’s exec. order.http://www.latimes.com/local/lanow/la-me-ln-federal-order-travel-ban-20170201-story.html …
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826877783787847681Timestamp: 1485978000|||TweetText: This morning, I was happy to attend @MENTORnational's Capitol Hill Day, where mentors advocate for services for all youth. Thank you!pic.twitter.com/EyVgDSIvuE
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826860675007930368Timestamp: 1485973921|||TweetText: A civil & women's rights activist, Dorothy Height helped black women throughout America succeed. #OurStory #BlackHistoryMonth #NewStamppic.twitter.com/v8wnHFpgMu
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826833993874042880Timestamp: 1485967560|||TweetText: Let's not turn our backs on the latest refugees and potential citizens just because they come from Africa. More: https://bass.house.gov/media-center/press-releases/rep-bass-pens-letter-urging-president-trump-rescind-travel-ban …pic.twitter.com/J9veQNSpJu
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826822912057413633Timestamp: 1485964918|||TweetText: Trump's listening session is w people he knows and should be "listening" to all the time---campaign surrogates, supporters, employees
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826799517295058944Timestamp: 1485959340|||TweetText: 57 years ago, four Black college students sat at a lunch counter and asked for lunch. We will not go back. #OurStory #BlackHistoryMonthpic.twitter.com/ER00yv1q7B
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826606703928078336Timestamp: 1485913370|||TweetText: 7 in 10 Americans do NOT support @POTUS relentless quest to strike down Roe v Wade. Where does #Gorsuch stand? #SCOTUS
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826579567376637952Timestamp: 1485906900|||TweetText: Proud to stand w/ my Foreign Affairs colleagues and defend dissenting diplomats..http://www.politico.com/story/2017/01/trump-immigration-ban-state-department-dissent-democrats-234433 …
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826547056235933697Timestamp: 1485899149|||TweetText: Treasury nominee #Mnuchin denied that his company engaged in robo-signing, foreclosing on Americans without proper review #RejectMnuchin
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826474625831890945Timestamp: 1485881880|||TweetText: Few cities on this planet have benefited so handsomely from immigration as LA. Read the @TrumanProject letter: http://ow.ly/oOhF308waXJ
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826426234422829056Timestamp: 1485870343|||TweetText: Today is the day! #GetCoveredhttps://twitter.com/JoaquinCastrotx/status/826416237223755777 …
          TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826275582883270656Timestamp: 1485834425|||TweetText: Pres. Trump has replaced Yates as standing AG for standing up for millions. You can't replace us all.http://www.cbsnews.com/amp/news/trump-fires-acting-attorney-general-sally-yates/?client=safari …

          From mementos of the 115th Congress

          For this approach, we reused locally stored TimeMaps and mementos from the 115th Congress which we collected between 2017-01-20 and 2019-01-03. The list of Twitter handles for the 115th Congress was obtained from the data set on the 115th Congressional tweet ids released by Social Feed Manager. The request for mementos from the web archives was carried out by expanding the URI-Rs for the Twitter handle with the language and with_replies argument.
          For example, we queried multiple web archives for Doug Jones's Twitter handle, @dougjones, by expanding with the language and with_replies arguments as shown below:

          https://twitter.com/dougjones
          Twitter supports 47 different language variations and multiple arguments such as with_replies. Upon searching for the URI-R https://twitter.com/dougjones, the web archives return all the mementos for the exact URI-R without any language variations or arguments.

          Excerpt of the TimeMap response received for https://twitter.com/dougjones

          20110210134919 {"uri": "http://web.archive.org/web/20110210134919/http://twitter.com:80/dougjones", "rel": "first memento", "datetime": "Thu, 10 Feb 2011 13:49:19 GMT"}
          20180205201909 {"uri": "http://web.archive.org/web/20180205201909/https://twitter.com/DougJones", "rel": "memento", "datetime": "Mon, 05 Feb 2018 20:19:09 GMT"}
          20180306132212 {"uri": "http://wayback.archive-it.org/all/20180306132212/https://twitter.com/DougJones", "rel": "memento", "datetime": "Tue, 06 Mar 2018 13:22:12 GMT"}
          20180912165539 {"uri": "http://wayback.archive-it.org/all/20180912165539/https://twitter.com/DougJones", "rel": "memento", "datetime": "Wed, 12 Sep 2018 16:55:39 GMT"}
          Upon searching for the URI-R https://twitter.com/dougjones?lang=en, the web archives return all the mementos for the language variation "en".

          TimeMap response received for https://twitter.com/dougjones?lang=en

          20190424140424 {"uri": "http://web.archive.org/web/20190424140424/https://twitter.com/dougjones?lang=en", "rel": "first memento", "datetime": "Wed, 24 Apr 2019 14:04:24 GMT"}
          20190501165834 {"uri": "http://web.archive.org/web/20190501165834/https://twitter.com/dougjones?lang=en", "rel": "memento", "datetime": "Wed, 01 May 2019 16:58:34 GMT"}
          20190509164649 {"uri": "http://web.archive.org/web/20190509164649/https://twitter.com/dougjones?lang=en", "rel": "last memento", "datetime": "Thu, 09 May 2019 16:46:49 GMT"}
          A lot of mementos in the web archives contain Twitter handle URLs with the language and with_replies arguments. Therefore, we queried for the Twitter handle URL and the with_replies argument URL with 47 different language variations for each Twitter handle. In total we created 96 URLs for each Twitter handle.
          https://twitter.com/dougjones?lang=en (47 URLs for 47 languages)
          Total: 96 URLs for each URI-R

          Example for different language variation URLs:
          ...

          The parsed embedded tweets from the mementos was compared with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 
          We also had locally stored mementos for the 115th Congress from 2017-01-01 to 2018-06-30. The data set of Twitter handles for this collection was created by taking a Wikipedia page snapshot of the current members of Congress on July 4, 2018 and using the CSPAN Twitter list on members of Congress and Politwoops to get all the Twitter handles. Upon parsing the embedded tweets from the mementos, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 

          From mementos of the President, the Vice-President and the Committee Twitter handles list

          For this analysis we fetched all the mementos for the President, the Vice-President, and committee handles between 2019-01-03 and 2019-06-30. Upon fetching the mementos and parsing embedded tweets, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids.  

          Upon completion of our analysis, we learned to extract timestamp from any tweet id. Snowflake is a service used to generate unique ids for all the tweet ids and other objects within Twitter like lists, users, collections, etc. Snowflake generates unsigned-64 bit integers which consist of: 
          • timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
          • configured machine id - 10 bits - gives us up to 1024 machines
          • sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
          We have created a web service, TweetedAt to extract timestamp from deleted tweet ids. Using TweetedAt, we found the timestamp of all the deleted tweet ids provided by Derek Willis from Politwoops. Of the 107 Politwoops deleted tweet ids, we found only six of the tweet ids from the 116th Congress time range and nine from the 115th Congress time range
          To summarize, we were unable to find any of the deleted tweet ids provided by Derek Willis from Politwoops. We analyzed the given below sources:
          • Mementos for the 116th Congress Twitter handles between 2019-01-03 and 2019-05-15.
          • Twitter handle mementos for the committees, the President and the Vice-President between 2019-01-03 and 2019-06-30. 
          • Mementos for the 115th Congress Twitter handles between 2017-01-03 and 2019-01-03. 
          • The Internet Archive cdx server API responses on the 116th Congress Twitter handles.
          There are several possible reasons for being unable to find the deleted tweet ids provided by Derek Willis from Politwoops:
          • 92 out of 107 deleted tweets were outside the date range of our analysis.
          • The mementos in a web archive are indexed by their URI-Rs. When a user changes their Twitter handle, the original resource URI-R for the user's Twitter account also changes. For example, Rep. Nancy Pelosi used the Twitter handle, @nancypelosi, during the 115th Congress but changed it to @speakerpelosi in the 116th Congress. Now querying the web archives for the mementos for Rep. Nancy Pelosi with her Twitter handle, @speakerpelosi, returns the earliest mementos from the 116th Congress. In order to get mementos prior to the 116th Congress, we need to query the web archives with Twitter handle, @nancypelosi. 
          • The data set of Twitter handles for the US Congress used in our analysis has a one-to-one mapping between a seat in the Congress and the member of Congress. If a seat in the US Congress has been held by multiple members over the Congress tenure, the data set includes the current member of Congress over the former members thus losing out on Twitter handles of the former members within the same Congress.
          We analyzed the web archives for the 115th and the 116th Congress members, the President, the Vice-President, and committee Twitter handles for finding the deleted tweet ids provided to us by Derek Willis from Politwoops. Despite being unable to find any match for the deleted tweet ids from our analysis, we will continue to investigate as we learn more.  We welcome any information that might aid our analysis.

          -----
          Mohammed Nauman Siddique
          (@m_nsiddique)

          2019-08-14: Building the Better Crowdsourced Study - Literature on Mechanical Turk

          $
          0
          0
          The XKCD comic "Study" parodies
           the challenges of recruiting study participants.

          As part of "Social Cards Probably Provide For Better Understanding Of Web Archive Collections" (recently accepted for publication by CIKM2019), I had to learn how to conduct user studies. One of the most challenging problems to solve while conducting user studies is recruiting participants. Amazon's Mechanical Turk (MT) solves this problem by providing a marketplace where participants can earn money by completing studies for researchers. This blog post summarizes the lessons I have learned from other studies that have successfully employed MT. I have found parts of this information scattered throughout different bodies of knowledge, but not gathered in one place; thus, I hope it is a useful starting place for future researchers.

          MT is by far the largest source of study participants, with over 100,000 available participants. MT is an automated system that facilitates the interaction of two actors: the requester and the worker. A worker signs up for an Amazon account and must wait a few days to be approved. Once approved, MT provides the worker with a list of assignments to choose from. A Human Interface Task (HIT) is an MT assignment. Workers perform HITs for anywhere from $0.01 up to $5.00 or more. Workers earn as much as $50 per week completing these HITs. Workers are the equivalents of subjects or participants found in research studies.

          Workers can browse HITs to complete via Amazon's Mechanical Turk.
          Requesters are the creators of HITs. After a worker completes a HIT, the requester decides whether or not to accept the HIT and thus pay the worker. Requesters use the MT interface to specify the amount to be paid for a HIT, how many unique workers per HIT, how much time to allot to workers, and when the HIT will no longer be available for work (expire). Also, requesters can specify that they only want workers with specific qualifications, such as age, gender, employment history, or handedness. The Master Qualification is assigned automatically by the MT system based on the behavior of the workers. Requesters can also specify that they only want workers with precise approval rates.

          Requesters can create HITs using the MT interface, which provides a variety of templates.
          The HITs themselves are HTML forms entered into the MT system. Requesters have much freedom within the interface to design HITs to meet their needs, even including JavaScript. Once the requester has entered the HTML into the system, they can preview the HIT to ensure that it looks and responds as expected. When the requester is done creating the HIT, they can then save it for use. HITs may contain variables for links to visualizations or other external information. When the requester is ready to publish a HIT for workers to perform, they can submit a CSV file containing the values for these variables. MT will create one HIT per row in the CSV file. Amazon will require that the requester deposit enough money into their account to pay for the number of HITs they have specified. After the requester pays for the HITs, workers can see the HIT and then begin their submissions. The requester then reviews each submission as it comes in and pays workers.

          The MT environment is different from that used in traditional user studies. MT participants can use their own devices to complete the study wherever they have a connection to the Internet. Requesters are limited in the amount of data that they can collect on MT participants. For each completed HIT, the MT system supplies the completion time and the responses provided by the MT participant. A requester may also employ JavaScript in the HIT to record additional information.

          In contrast, traditional user studies allow a researcher to completely control the environment and record the participant's physical behavior. Because of these differences, some scholars have questioned the effectiveness of MT's participants. To assuage this doubt, Heer et al. reproduced the results of a classic visualization experiment. The original experiment used participants recruited using traditional methods. Heer recruited participants via MT and demonstrated that the results were consistent with the original study. Kosara and Ziemkiewicz reproduced one of their previous visualization studies and discovered that MT results were equally consistent with the earlier study. Bartneck et al. conducted the same experiment with both traditionally recruited participants and MT workers. They also confirmed consistent results between these groups.

          MT is not without its criticism. Fort, Adda, and Cohen raise questions on the ethical use of MT, focusing on the potentially low wages offered by requesters. In their overview of MT as a research tool, Mason and Suri further discuss such ethical issues as informed consent, privacy, and compensation. Turkopticon is a system developed by Irani and Silberman that helps workers safely voice grievances about requesters, including issues with payment and overall treatment.

          In traditional user studies, the presence of the researcher may engender some social motivation to complete a task accurately. MT participants are motivated to maximize their revenue over time by completing tasks quickly, leading some MT participants to not exercise the same level of care as a traditional participant. Because of the differences in motivation and environments, MT studies require specialized design. Based on the work of multiple academic studies, we have the following advice for requesters developing meaningful tasks with Mechanical Turk:
          • complex concepts, like understanding, can be broken into smaller tasks that collectively provide a proxy for the broader concept (Kittur 2008)
          • successful studies ensure that each task has questions with verifiable answers (Kittur 2008)
          • limiting participants by their acceptance score has been successful for ensuring higher quality responses (Micallef 2012, Borkin 2013)
          • participants can repeat a task – make sure each set of responses corresponds to a unique participant by using tools such as Unique Turker (Paolacci 2010)
          • be fair to participants; because MT is a competitive market for participants, they can refuse to complete a task, and thus a requester's actions lead to a reputation that causes participants to avoid them (Paolacci 2010)
          • better payment may improve results on tasks with factually correct answers (Paolacci 2010, Borkin 2013, PARC 2009) – and can address the ethical issue of proper compensation
          • being up front with participants and explaining why they are completing a task can improve their responses (Paolacci 2010) – this can also help address the issue of informed consent
          • attention questions can be useful for discouraging or weeding out malicious or lazy participants that may skew the results (Borkin 2013, PARC 2009)
          • bonus payments may encourage better behavior from participants (Kosara 2010) – and may also address the ethical issue of proper compensation
          MT provides a place to recruit participants, but recruitment is only one part of successfully conducting user experiments. To create successful user experiments, I recommend starting with "Methods for Evaluating Interactive Information Retrieval Systems with Users" by Diane Kelly.

          For researchers starting down the road of user studies, I recommend starting first with Kelly's work and then circling back to the other resources noted here when developing their experiment.

          -- Shawn M. Jones

          2019-08-24: Six WS-DL Classes Offered for Fall 2019

          $
          0
          0
          https://xkcd.com/2180/

          A record six WS-DL courses are offered for Fall 2019:
          I am on research leave for Fall 2019 and will not be teaching.

          Dr. Brunelle's CS 891 is especially suitable for incoming graduate students that would like an introduction on how to read research papers and give presentations.

          If you're interested in these classes, you need to take them this semester.  Although subject to change, a likely offering of WS-DL classes for Spring 2020 is:
          • CS 395 Research Methods in Data and Web Science, Dr. Michael L. Nelson
          • CS 480/580 Introduction to AI, Dr. Vikas Ashok
          • CS 495/595 Introduction to Data Mining, Dr. Sampath Jayarathna
          • CS 800 Research Methods, Dr. Michele C. Weigle
          Dr. Wu has a course buyout and will be not be teaching in Spring 2020.  

          --Michael

          2019-08-30: Where did the archive go? Part1: Library and Archives Canada

          $
          0
          0

          Web archives are established with the objective of providing permanent access to archived web pages, or mementos. However, in our 14-month study of 16,627 mementos from 17 public web archives, we found that three web archives changed their base URLs and did not leave a machine readable method of locating their new URLs. We were able to manually discover the three new URLs for the archives. A fourth archive has partially ceased operations.

          (1) Library and Archives Canada (collectionscanada.gc.ca)
          Around May 2018, mementos in this archive were moved to a new archive (webarchive.bac-lac.gc.ca) which has a different domain name. We noticed that 49 mementos (out of 351) can not be found in the new archive.

          (2) The National Library of Ireland (NLI) 
          Around May 2018, the European Archive (europarchive.org) was shut down and the domain name was purchased by another entity. The National Library of Ireland (NLI) collection preserved by this archive was moved to another archive (internetmemory.org). All 979 mementos can be retrieved from the new archive (i.e., no missing mementos). Around September 2018, the archive internetmemory.org became unreachable (timeout error). The NLI collection preserved by this archive was moved to another archive (archive-it.org). The other archived collections in internetmemory.org may also have been moved to archive-it.org or to other archives. The number of missing from NLI mementos is 192 (out of 979).

          (3) Public Record Office of Northern Ireland (PRONI) (webarchive.proni.gov.uk)
          Around October 2018, all mementos preserved by this archive were moved to archive-it.org. The PRONI archive's homepage  is still online and shows a list of web pages' URLs (not mementos' URLs). By clicking on any of these URLs, it redirects to an HTML page in archive-it.org that shows the available mementos (i.e., the TimeMap) associated with the selected URL. The number of missing mementos from PRONI is 114 (out of 469).

          (4) WebCite (webcitation.org)
          The archive has been unreachable (timeout error) for about a month (from June 06, 2019 to July 08, 2019). The archive no longer accepts any new archiving requests, but it still provides access to all preserved mementos.

          Library and Archives Canada 

          In this post, we provide some details about changes in the archive Library and Archives Canada. Changes in the other three archives will be described in upcoming posts.

          We refer to the archive from which mementos have moved as the "original archive", and we use the "new archive" to refer to the archive to which the mementos have moved. A memento is identified by a URI-M as defined in the Memento framework. 

          In our study we have 351 mementos from collectionscanada.gc.ca. Around May 2018, 302 of those mementos have been moved to webarchive.bac-lac.gc.ca (the 49 remaining mementos are lost). For instance, the memento:

          http://www.collectionscanada.gc.ca/webarchives/20051228174058/http://nationalatlas.gov/

          is now available at:

          http://webarchive.bac-lac.gc.ca:8080/wayback/20051228174058/http://nationalatlas.gov/

          The representations of both mementos are illustrated in the figure below. The original archive uses the green banner (left) while the new archive uses the yellow banner (right):



          We have several observations about the change in the archive Library and Archives Canada:


          Observation 1: The HTTP request of a URI-M from the original archive does not redirect to the corresponding URI-M in the new archive

          The institution (Library and Archives Canada) that has developed the new archive (webarchive.bac-lac.gc.ca) still controls and maintains the domain name of the original archive (www.collectionscanada.gc.ca). Thus, it would be possible for requests of mementos (URI-Ms) to the original archive to redirect to the corresponding URI-Ms in the new archive. However, we found that every memento request to the original archive redirected to the home page of the new archive as shown below:

          $ curl --head --location --silent http://www.collectionscanada.gc.ca/webarchives/20051228174058/http://nationalatlas.gov/ | egrep -i "(HTTP/1.1|^location:)"

          Location: http://www.bac-lac.gc.ca/eng/discover/archives-web-government/Pages/web-archives.aspx
          HTTP/1.1 302 Found
          Location: http://webarchive.bac-lac.gc.ca/?lang=en
          HTTP/1.1 200

          Here is the representation of the home page of the new archive:



          We had to 
          manually intervene to detect the corresponding URI-Ms of the mementos in the new archive which can be done by replacing "www.collectionscanada.gc.ca/webarchives" with "webarchive.bac-lac.gc.ca:8080/wayback" in the URI-Ms of the original archive.

          This reminds us of The End of Term Archive (eot.us.archive.org) which was established with the goal of preserving the United States government web (.gov). The domain name (eot.us.archive.org) is still under the control of the Internet Archive (archive.org). The example below shows how the HTTP request to a URI-M in the End of Term Archive redirects to the corresponding URI-M in the Internet Archive. This practice maintains link integrity via "follow-your-nose" from the old URI-M to the new URI-M.

          $ curl --head --location --silent http://eot.us.archive.org/eot/20120520120841/http://www2.ed.gov/espanol/parents/academic/matematicas/brochure.pdf | egrep -i "(HTTP/|^location:)"

          HTTP/1.1 302 Found
          Location: https://web.archive.org/web/20120520120841/http://www2.ed.gov/espanol/parents/academic/matematicas/brochure.pdf
          HTTP/2 200

          We can rewrite URI-Ms of the original archive and have them redirect (301 Moved Permanently) to their corresponding URI-Ms in the new archive. For example, for the Apache web server, the mod_rewrite rule can be used to perform automatic redirects and rewrite  requested URIs on the fly. Here is a rewrite rule example that the original archive can use to redirect requests to the new archive:

          # With mod_rewrite
          RewriteEngine on
          RewriteRule   "^/webarchives/(\d{14})/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1/$2  [L,R=301]

          If the original archive serves only mementos under /webarchives, then the mod_rewrite rule would be even simpler:

          # With mod_rewrite
          RewriteEngine on
          RewriteRule   "^/webarchives/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1  [L,R=301]


          Observation 2: Not all mementos are available in the new archive

          Each memento (URI-M) represents a prior version of an original web page (URI-R) at a particular datetime (Memento-Datetime). The timestamp, usually included in a URI-M, is identical to the value of the response header Memento-Datetime. 

          For example, for:

          URI-M = http://www.collectionscanada.gc.ca/webarchives/20060208075019/http://www.cdc.gov/

          we have:

          Memento-Datetime = Wed, 08 Feb 2006 07:50:19 GMT
          URI-R = http://www.cdc.gov/

          For a URI-M from the original archive, if the values of the Memento-Datetime, the URI-R, and the final HTTP status code are not identical to the values of the corresponding URI-M from the new archive, we name this as a missing memento. 

          In this study, we found that 49 mementos (out of 351) can not be retrieved from the new archive. Instead, the archive responds with other mementos that have different Memento-Datetimes. Those mementos may (or may not) have the same content compared to the content returned by the original archive. For example, when we requested the URI-M:

          http://www.collectionscanada.gc.ca/webarchives/20060208075019/http://www.cdc.gov/

          from the original archive (www.collectionscanada.gc.ca) on February 27, 2018, we received the HTTP status "200 OK" with the following representation (the Memento-Datetime of this memento is Wed, 08 Feb 2006 07:50:19 GMT):


          In www.collectionscanada.gc.ca
          Then, we requested the corresponding URI-M:

          http://webarchive.bac-lac.gc.ca:8080/wayback/20060208075019/http://www.cdc.gov/

          from the new archive. As shown in the cURL session below, the request redirected to another URI-M:

          http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/

          This memento has a different Memento-Datetime (Thu, 26 Oct 2006 06:02:47 GMT) for a delta of about 260 days. The content of this memento (the figure below) in the new archive is different from the content of the memento that used to be available in the original archive (the figure above).
          In webarchive.bac-lac.gc.ca
          $ curl --head --location --silent http://webarchive.bac-lac.gc.ca:8080/wayback/20060208075019/http://www.cdc.gov/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

          HTTP/1.1 302 Found
          Location: http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/
          HTTP/1.1 200 OK
          Memento-Datetime: Thu, 26 Oct 2006 06:02:47GMT

          The figure below shows a set of screenshots taken for the memento within 14 months. The screenshots with a blue border are representations of mementos in the original archive (www.collectionscanada.gc.ca) before it is moved to the new archive. The screenshots with a red border is the home page of the new archive before we manually detected the corresponding URI-Ms in the new archive. The screenshots with a green border shows the representations resulting from requesting the memento from the new archive (webarchive.bac-lac.gc.ca). The representation before the archive's change (blue border) is different from the representation of the memento after the change (green border).


          Replayed the memento 33 times within 14-months.

          Observation 3: New features available in the new archive because of the upgraded replay tool

          The new archive (webarchive.bac-lac.gc.ca) uses an updated version of OpenWayback (i.e., OpenWayback Release 1.6.0 or later) that enables new features, such as raw mementos and Memento support. These features were not supported by the original archive that was running OpenWayback Release 1.4 (or earlier) .

          Raw mementos

          At replay time, archives transform the original content of web pages to appropriately replay them (e.g., in a user’s browser). Archives add their own banners to provide metadata about both the memento being viewed and the original page. Archives also rewrite links of embedded resources in a page so that these resources are retrieved from the archive, not from the original server.

          Many archives allow accessing unaltered, or raw, archived content (i.e., retrieving the archived original content without any type of transformation by the archive). The most common mechanism to retrieve the raw mementos is by adding "id_" after the timestamp in the requested URI-M.

          The feature of retrieving the raw mementos was not provided by the original archive (www.collectionscanada.gc.ca). However, it is supported by the new archive. For example, to retrieve the raw content of the memento identified by the URI-M

          http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/

          we add "id_" after the timestamp as shown in the cURL session below:

          curl --head --location --silent http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247id_/http://www.cdc.gov/ | egrep -i "(HTTP/|^Memento-Datetime)"

          HTTP/1.1 200 OK
          Memento-Datetime: Thu, 26 Oct 2006 06:02:47 GMT

          Memento support

          The Memento protocol is supported by most public web archives including the Internet Archive. The protocol introduces two HTTP headers for content negotiation. First, Accept-Datetime is an HTTP Request header through which a client can request a prior state of a web resource by providing the preferred datetime, for example,

          Accept-Datetime: Mon, 09 Jan 2017 11:21:57 GMT.

          Second, the Memento-Datetime HTTP Response header is sent by a server to indicate the datetime at which the resource was captured, for instance,

          Memento-Datetime: Sun, 08 Jan 2017 09:15:41 GMT.

          The Memento protocol also defines:

          • TimeMap: A resource that provides a list of mementos (URI-Ms) for a particular original resource, 
          • TimeGate: A resource that supports content negotiation based on datetime to access prior versions of an original resource. 
          The cURL session below shows the TimeMap of the original resource (http://www.cdc.gov/) available in the new archive. The TimeMap indicates that the memento with the Memento-Datetime Wed, 08 Feb 2006 07:50:19 GMT (as described above) is not available in the new archive.

          $ curl http://webarchive.bac-lac.gc.ca:8080/wayback/timemap/link/http://www.cdc.gov/

          <http://www.cdc.gov/>; rel="original",
          <http://webarchive.bac-lac.gc.ca:8080/wayback/timemap/link/http://www.cdc.gov/>; rel="self"; type="application/link-format"; from="Thu, 26 Oct 2006 06:02:47 GMT"; until="Fri, 09 Oct 2015 13:26:42 GMT",
          <http://webarchive.bac-lac.gc.ca:8080/wayback/http://www.cdc.gov/>; rel="timegate",
          <http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/>; rel="first memento"; datetime="Thu, 26 Oct 2006 06:02:47 GMT",
          <http://webarchive.bac-lac.gc.ca:8080/wayback/20151009132642/http://www.cdc.gov/>; rel="last memento"; datetime="Fri, 09 Oct 2015 13:26:42 GMT"

          It is possible that two archives use the same version of OpenWayback but with different configuration options, such as whether to support Memento framework or not:

           <bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
            <property name="accessPointPath" value="${wayback.url.prefix}/wayback/"/>
            <property name="internalPort" value="${wayback.url.port}"/>
            <property name="serveStatic" value="true" />
            <property name="bounceToReplayPrefix" value="false" />
            <property name="bounceToQueryPrefix" value="false" />
            <property name="enableMemento" value="true" />

          or how to respond to (raw) archival redirects (thanks to Alex Osborne for help in locating this information):

          <!-- WARN CLIENT ABOUT PATH REDIRECTS -->
          <bean class="org.archive.wayback.replay.selector.RedirectSelector">
           <property name="renderer">
             <bean class="org.archive.wayback.replay.JSPReplayRenderer">
               <property name="targetJsp" value="/WEB-INF/replay/UrlRedirectNotice.jsp" />
             </bean>
           </property>
          </bean>
          ...
          <!-- Explicit (via "id_" flag) IDENTITY/RAW REPLAY -->
          <bean class="org.archive.wayback.replay.selector.IdentityRequestSelector">
            <property name="renderer" ref="identityreplayrenderer"/>
          </bean>


          Observation 4: The HTTP status code may change in the new archive 

          The HTTP status codes of URI-Ms in the new archive might not be identical to the HTTP status code of the corresponding URI-Ms in the original archive. For example, the HTTP request of the URI-M:

          http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.berlin.gc.ca/

          to the original archive resulted in the following "302" redirects before it ended up with the HTTP status code "404":

          http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.berlin.gc.ca/ (302)
          http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (302)
          http://www.collectionscanada.gc.ca/webarchives/20070220181204/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (302)
          http://www.collectionscanada.gc.ca/webarchives/20070220181204/http://www.international.gc.ca/global/errors/404.asp?404%3Bhttp://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (404)

          We requested the corresponding URI-M from the new archive, it ended up with the HTTP status code "200":

          http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181041/http://www.berlin.gc.ca/ (Redirect by JavaScript (JS))
          http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181041/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (Redirect by JS)
          http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181204/http://www.international.gc.ca/global/errors/404.asp?404%3Bhttp://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (Redirect by JS)
          http://webarchive.bac-lac.gc.ca:8080/wayback/20071115025620/http://www.international.gc.ca/canada-europa/germany/ (302) 
          http://webarchive.bac-lac.gc.ca:8080/wayback/20071115023828/http://www.international.gc.ca/canada-europa/germany/ (200

          The list of all 351 URI-Ms is shown below. The file contains the following information:
          • The URI-M from the original archive (original_URI-M).
          • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M). 
          • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
          • The URI-M from the new archive (new_URI-M).
          • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
          • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
          • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
          • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs). The different URI-Rs are labeled with "No", otherwise "-".
          • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code). The different status codes are labeled with "No", otherwise "-".
          • The first 49 rows contain the information of the missing mementos.


          Conclusions

          When Library and Archives Canada migrated their archive in May 2018, 49 of the 351 mementos we were tracking resurfaced in the new archive with a change in Memento-Datetime, URI-R, or the final HTTP status code. In three cases, the HTTP status codes of mementos in the new archive change from the status codes in the original archive. Also, updating/upgrading a web archival replay tool (e.g., OpenWayback and PyWb) may affect how migrated mementos are indexed and replayed. In general, with any memento migration, we recommend that when possible requests of mementos to the original archives to be redirected to their corresponding mementos in the new archive (e.g., the case of The End of Term Archive explained above).

          In the upcoming posts we will provide some details about changes in the archives: 

          --Mohamed Aturban

          2019-09-02: So Long, and Thanks for All the Frogs

          $
          0
          0
          Mat Kelly has received his PhD. This is the Final Blog Post                                                                                                                                                                                                                                                                                                                                                                           ⓖⓞⓖⓐⓣⓞⓡⓢ


          On May 7th, 2019, after a very long trek as a PhD student, I successfully defended my dissertation, "Aggregating Private and Public Web Archives Using the Mementity Framework" (slides). The tome (physical height still to be determined), originally titled, "A Framework for Aggregating Private and Public Web Archives" consisted of exactly that and a bit more. The crux of the work was originally presented in the best-paper-nominated paper with the latter named (hence the change) at JCDL 2018 (arXiv). The extended version addressed issues beyond the 10-page conference paper limit. In this post I will provide a very high level synopsis of the work's contribution to the area of web archiving and a round-up of my experience as a PhD student.

          To first describe the "mementity" concept we introduced as an alternative to the already overloaded "entity" nomenclature. In the parlance of the framework, a mementity (Memento entity) is the realized or implemented concept like a Memento TimeGate or Memento Aggregator. In the framework, we introduced three new mementities:

          Memento Meta-Aggregator (MMA)
          For allowing advanced aggregation of archives like subsetting, supplementing, and filtering of archival sources
          Private Web Archive Adapter (PWAA)
          For integrating accessing to private web archives where dereferencing a URI is insufficient
          StarGate (SG)
          For advanced querying of web archives based on attributes of the mementos

          The concepts behind the mementities were progressively developed. The first requirement was allowing clients to have more control over what archival sources are used for aggregation. MMAs allow exactly this as well as, through the usage of the HTTP Prefer (RFC 7240), allow for query precedence of archives queried (cf. sending requests to all known archives at the same time) as well as short-circuiting (halting querying subsequent archives when a specified condition is met).

          Aggregating public and private web archives may not be straightforward when involving private web archives, as additional querying parameters, e.g., credentials, may be needed to access their mementos. Through the base case usage of OAuth 2 (RFC 6749), access patterns as used on the live web can be systematically translated to access to private web archives. If an MMA is aware that an archive is private, it can delegate the authentication dance to a separate mementity, the PWAA.

          Because querying web archives can be temporally, computationally, and spatially expensive when performed in-bulk, exposing attributes of an archive's holdings to a TimeMap allows for more sophisticated querying. For example, when creating summaries of a URI over time, generating a SimHash of the HTML of each memento allows for detection of significant changes in the page and identifies likely candidates for inclusion (per Ahmed AlSum's ECIR 2014paper). We encountered this issue when initially implementing the web archive summarization visualization for the Web Archiving Collaboration Conference. Retaining these SimHashes, once calculated, allows for the generation of summaries to be much quicker. Populating TimeMaps with attributes for mementos beyond time. The addition of this arbitrary, wildcard (*) set of attributes semantically renames TimeMaps to StarMaps per the framework. Being able to filter on these attributes requires communicating with endpoints, like one to generate SimHashes for a URI-M. Delegation of this role to a separate mementity, the StarGate, allows for client-side negotiation of web archives in dimensions beyond time. We initially explored this in our work for WADL 2018.

          The framework can be implemented in piecemeal -- no mementity is reliant on the other. The power of the framework for the contribution of aggregating private and public web archives is emphasized when all mementities are used. This was the fundamental component of the dissertation that I defended in May.

          "But May...", you might say, "why the delay?"

          Following a defense, one must make edits and refine the document per their committee's recommendations. Luckily, aside from adding an additional appendix, some clarifications, and stylistic changes, my document did not require extensive changes. I applied these changes and submitted the document to the ODU College of Sciences on June 7, 2019:

          The college approved and I submitted the document to ProQuest. About 2-and-one-half months later, I heard back with the sole minor change being an incorrect page number (thanks, LaTeX!), which I promptly adjusted. After a couple more weeks and some pings to ODU's ProQuest representative, my dissertation was approved:

          This completes my time as an academic student. My next role will be keep me in academia but on the other side of the table:

          During my term as a graduate student, I had 19 peer reviewed publication (worth 72 WS-DL publication points), collaborated with authors/presenters from 8 different institutions1 and wrote 29 blog posts (inclusive). I also lived in four cities (Charleston, Goose Creek, Virginia Beach, Portsmouth) in two locales (Lowcountry, Tidewater), had a child, worked for three employers, and most importantly, climbed to the top of the PhD Crush board.

          —Mat (@machawk1)

          1 Old Dominion University, Los Alamos National Laboratory, Clemson University, Science Systems and Applications, Inc., NASA Langley Research Center, BMW Group, Intel Corporation, Protocol Labs

          2019-09-04: Invited Talk at ODU CS Summer Research Workshop: Eye Tracking for Predicting ADHD

          $
          0
          0
          This summer 2019, ten students from B.N.M Institute of technology, fourteen students from Acharya Institute of Technology, and one student from Ramaiah Institute of Technology participated for the Summer Research Workshop organized by Ajay Gupta and the CS department at ODU. Over the past few years, this workshop has enabled participants to collaborate with various research groups and join ODU for graduate degrees. One of the main goals of this annual workshop is to  encourage the undergraduate students to actively engage in research activities.

          I was invited to give a talk in one of the session on the topic of "Eye Tracking for Predicting ADHD".  The slides are available at: https://www.slideshare.net/GavindyaJayawardena/eye-tracking-for-predicting-adhd.

          I was able to make it interactive by introducing a couple of eye trackers (PupilLabs tracker and Tobii 4pc) from our lab to the audience.  I covered the following topics in the first half of my talk: eye tracking, what do we exactly do in eye tracking, muscles of the eye, basic eye movements, why eye tracking is important, applications of eye tracking, and a short demonstration of eye trackers. The second half of my talk was primarily about my research work about predicting ADHD using eye tracking data, information about my collaborations (Dr. Anne Michalek from ODU Special Education), what is ADHD, participants, the task of the experiment, our research interests, how did we predict ADHD, and results of our study. Finally, I wrapped up my talk with a demonstration of a machine learning software Weka.

          It was exciting to see that the students were really excited about the prospects of using eye trackers in their future research studies. They genuinely enjoyed testing out some of our eye trackers, the PupilLabs Core Eye Tracker which has a world camera and it is wearable. Some students wanted me to share details about where to purchase eye trackers as they were interested in conducting a user study for their final year project. They also had questions about how to set up a study with eye trackers as well and what features to collect.  Since this year's Summer Research Workshop participants were asked to work in groups to research on detecting Alzheimer patients aggression in real-time, some students were interested in using the eye tracking in their studies and to build a proof-of-concept. 

          I would like to thank my adviser Dr. Sampath Jayarathna and Ajay Gupta for giving me the opportunity to present our research work at the workshop.

          --Gavindya Jayawardena (@Gavindya2)

          2019-09-05: How to Become a Tenure-Track Assistant Professor - Part II (job ads, CV, teaching and research statement, LOR and cover letter)

          $
          0
          0
          This is a three-part write-up, in this second post, I’ll talk about how to find tenure-track positions, how to shortlist your target schools, CV, teaching statement, research statement and cover letters. I’ll do another blog post later about, how to prepare for interviews (skype/phone, onsite), what to do and not to do during your on-campus interviews, offer negotiations, two-body problem etc. 
          • How to Become a Tenure-Track Assistant Professor - Part I (publications, research, teaching and service) 
          • How to Become a Tenure-Track Assistant Professor - Part II (job ads, CV, teaching and research statement, LOR and cover letter) 
          • How to Become a Tenure-Track Assistant Professor - Part III (interview prep, on-campus interview, offer negotiations, two-body problem) 
          Where to find jobs: 
          There are number of options for you to find a job advertisement for a tenure-track positions (or non-tenure track teaching positions and postdoc opportunities). I primarily used the ACM and CRA for my job search, these are for CS related job advertisements (and may be some non-tenure track teaching positions and post-doc opportunities in related fields such as Math, Stat and Information Science). I find both the ACM and CRA are solid resources to track tenure-track advertisements. Here’s a tip, for both of these, you can subscribe (add your email or profile) and receive a list of emails when a position(s) is posted at the site. For other areas, you can find job advertisements in HigherEdJobs, ScienceCarreers etc. 
          I’d suggest to start early (preferably summer) and create your profile at ACM and CRA so you’ll cover most of the positions. Also it may be a good idea to start this process few year yearly so you get an idea about the research areas most schools are hiring. Also remember to keep an eye out for the recent advancements in the technology so you can keep up with the trends. I remember during my job search, machine learning was the hot topic, majority of the positions were posted for this particular area. Then again, in the second time around, it was more about data science. So you need to tune in early and make sure your research directions and expertise are aligned with the hiring trends. I’m not saying that you have to change your area of research, but early you know the trends, better you can prepare and accommodate some of these relevant areas in your own research. Timeline: From my personal experience, majority of the job ads are posted around August to October. Most schools reconvene in Fall term (for US schools) and the search committees put together job advertisements and get approvals etc. Most of the application due dates are around early December to late January/February. So, I suggest, get ready to submit your completed applications around early December, this means you need a solid plan with a complete set of your CV, teaching statement, research statement and cover letters ready to go. Hardest part is getting your letter writers to submit the letter of recommendations on time (I’ll talk more about this latter part of this write-up.) 
          How to organize and shortlist your list:
          Figure 1: Organize your list in an Excel sheet
          I used an excel file (see Figure 1) to keep track of the job advertisements I wanted to apply. Remember, this going to be something overwhelming, I definitely felt that way (too much applications to submit and too little time) during the time of job search. In my initial list I had about 120 positions I wanted to submit applications, and I ended up submitting close to 60. So you need a careful bookkeeping strategy to keep things afloat this time around. It’s easy to get discouraged by the sheer number of applications and the amount of work to get yours submitted on-time. Here’s how I did this. When I received an aggregated list via email from CRA (or ACM), I quickly glanced and see if there’s any schools I wanted to apply. I had a certain strategy (or interest) where I wanted to apply and certain key points I looked at. You need to come up with your own priority list, this may be something to do with the school ranking, program (4 year, MS only, PhD) location, area of interests, collaborators, etc. I normally hate cold weather especially anything to do with snow (I like gardening), for me a regular hot season was a plus. Also I used to pick places I have a chance of getting an interview. I talked about this in my previous blog post. Always take a look at the most recent assistant professors hired by the school you wanted to apply. Other than few exceptions, most departments tend to hire based on a similar trend (candidates school reputation, prior publication track record or grants). Let’s take an example, say you want to apply to UT Austin Computer Science. Here’s an example of the new faculty they hired in 2016-2017 https://www.cs.utexas.edu/news/2016/new-faculty-2016-17 so you get an idea whether you have a chance or you are wasting your time. Again, this is just my personal opinion. My recommendation is to find a list of relevant schools based on your own profile. Carefully take a look at your list of publications (venue, acceptance rate, and reputation), research outcomes, your current position (ABD or AP somewhere relevant) and grants. Here’s something I believe, academic life is a ladder, you can always find places to jump lower to your current place (current school), but in order to jump higher, and you need a strong portfolio (grants, strong publications, some noteworthy awards). 
          Use the above strategies to filter the schools you want to apply. I used to move the CRA/ACM ads emails to a separate folder in my Gmail account called “unsorted” and then took time to review them each week and pick the ads of schools I wanted to apply and move them to another folder called “selected”. Also I started filling out the excel sheet with the information like the research area (machine learning, data science, HCI etc.) of the position, due date, required items etc and also color coded them based on priority. Then I started applying to these schools (may be about 5 or 6 each week). Remember, this is time-consuming, especially if you have a long list of schools to apply. You cannot do this in one sitting. So spread it around each week and take your time carefully changing your cover letter and statements. Also remember to check the documents multiple times before you submit them. Don’t make simple mistakes like using a different school name in your cover letter etc. Create different folder per each school and add the drafts of the CV, cover letter, Teaching and Research Statements. Carefully rename each file, I find most schools nowadays use subscription services like Interfolio online dossiers, so you might accidentally end up submitting a cover letter intended for another school. Don’t forget to notify your letter writers about the list of schools you applied and the due dates (I’ll talk more about this later). 
          CV:
          From my personal opinion, the CV is the most important thing other than your cover letter in your candidate package. Most search committee will go through this carefully looking at the publication track record and other relevant information such as teaching experiences etc. I’m not going to spend too much time explaining the sections you need to have etc. Take a look at my CV and format accordingly. I’d recommend to have your own style, I use tables (hidden) in my CV to structure them around and also I like to align them and rarely use the bullet points. Start this early, I spent majority of my summer time (1 year before the intended job search) refining my CV until I was satisfied. You can do a quick search in several top schools, go to recent assistant professors pages, you’d probably find some good CV formats, find something you like and create a personalized template for your own CV based on these (proved to be effective) samples. Also don’t forget to have a web page (school portfolio, not the LinkedIn), you don’t need to buy your own domain and pay for someone to create a fancy web page. Take a look at mine, I have only 1 picture and pretty much text and hyperlinks. I use Kompozer to maintain my webpage. You can do the same. I’m sure most of the schools provide you a free domain under your department or university. Talk to the IT people and create something so you can link all of your pdf copies of research publications, teaching and research statements. Here’s a pro tip, you can also have a visitor tracker like StatCounter or Google Analytics added to your page to track any visitors from schools. You can be creative with the use of your webpage to have additional information like teaching feedback received, other interesting projects, and articles and have a link in your write up (teaching, research, CV). 
          Teaching and Research Statement: 
          I’m not going to spend too much time explaining what should go in these. Make sure you have solid write up available and have multiple versions with different page lengths. I remember some schools require only a single page statement, so having a certain length prepared early will save your time during the submissions. If you are applying to primarily teaching schools, don’t forget to list the actual courses you can teach (most of these information should be available online at the department webpage), and also new courses you intend to propose/prepare. I remember when I was a search committee member, we specifically looked at whether you list the courses you can teach (from the department curriculum), so we can filter your expertise according to the departmental teaching needs. Especially in your teaching statement, talk about quantitative outcomes like your course evaluation results if available. Don’t forget to list the student advising experiences and outreach activities, this will show an overall picture of your profile. 
          Your research statement should include the agenda for the next few years and several topics that you are intend to work on. You need to carefully align this to the job description. If the department is looking for quantum computing candidate, and if your background is HCI, then you are wasting your time. Show how your background matches to the job description from the topics you are planning to investigate. Have multiple areas of research you are planning to focus on and state few specific problems you are currently working on and how you plan to solve them. 
          Cover Letter: 
          You need a solid cover letter that can quickly summarize your strengths. Don’t forget to highlight your key strengths, things like grants, any best paper awards or nominations or other noteworthy awards. State your area of research and how this aligned with the areas the particular school is looking for. So the excel sheet I mentioned earlier will be handy at this point. You can create a template with the areas to fill in by leaving them highlighted so you can quickly modify the draft template to match the job description. This is one place that things can go wrong, double check before you submit your cover letter, it’s very easy to miss things like the name of the school or have a wrong school name. 
          Have a PDF copy of all the items like the CV, cover letter, teaching and research statements. Don’t forget to have the scanned copy of the school transcripts handy, some schools ask for unofficial copies to be uploaded during the time of the submission. 
          Letter of Recommendations:
          You need at least 3 letter of recommendations, better if these from the faculty who can write a strong and personalized letter. You need to identify and build these relationships early in your PhD career. Talk to people from other schools, collaborate and create a lasting relationship so you can get 3 solid letter writers. Your PhD adviser obviously should be one of the letter writers, and also members of your dissertation committee. May be researchers from industry and government. Writing a letter is time consuming, most of the letters are same and will not change from school to school but it take a significant time to prepare a solid letter for the students. Notify them early (preferably 3-4 months before the due dates) and provide them enough time to prepare for a solid letter. Provide them with a package with things like your updated CV, list of publications, research projects etc. Highlight major accomplishments in a separate write up, so they don’t need to read through your CV to find the details.

          2019-08-07: Invited Talk at ODU CS Summer Research Workshop: Introducing EEG in Data Science

          $
          0
          0
          The Department of Computer Science at ODU has been conducting summer workshops over the past few years for selected undergraduate student groups from India. During this period, they are provided with on-premise accommodation and are arranged to work closely with research groups. Researchers from various groups at ODU present their work and ongoing research to them. Some students who participated in this program in the past have already joined graduate programs at ODU. Ajay Gupta, the Director of Computer Resources at the Department of Computer Science, ODU plays a significant role in conducting this event. Overall, the program has been a great encouragement for students to engage in research.
          This year, the summer workshop group comprised of 25 undergraduates from three universities: Acharya Institute of Technology, BNM Institute of Technology and Ramaiah Insitute of Technology. I had the privilege of presenting my work to them on behalf of the Web Science and Digital Libraries Research Group. I conducted this session on the 16th of July, 2019 at the Engineering and Computational Sciences Building (ECSB) at ODU.
          My presentation was titled "Introduction to Data Science with EEG" (slides). With the hope of providing a more hands-on session, I brought two EEG recording devices, courtesy of my adviser Dr. Sampath Jayarathna. I started with an introduction about me and gave an overview of the talk. In the beginning, I explained the anatomy of the human brain and the cause of its electrical activity. I also told how parts of the brain correspond to different functions. Next, I discussed how the brainwave spectrum can be separated into frequency bands and the states of mind that they represent.
          Following this, I moved on to introduce the EEG recording devices I bought with me. I had a Muse Headband and an Emotiv Insight, which I used to contrast between consumer level and research level EEG devices. With that context, I used the Emotiv Insight and the Emotiv BCI application to demonstrate "mind control," i.e., training a model to push an imaginary box using brainwaves only.
          My first attempt at a demo was unsuccessful due to an error in Emotiv BCI. As a result, I pushed demonstrations to the end of the talk and discussed my work using EEG and Machine Learning. Here, I introduced the research group and the problem we were trying to address, i.e., early diagnosis of neurodevelopmental conditions via EEG. Next, I presented the specific neurodevelopmental condition we explored, which is Autism. I explained how we came up with the machine learning models and the rationale behind using them. I kept the fine details to a minimum and provided a high-level overview of each model along with their evaluation results. At this point, one student raised a question on why the specific activations were selected. It was a good entry point to discuss how different activation functions look like, and what purpose they each offer. Also, I gave visual examples of how overfitting can affect the performance of a model.
          By the end of the session, I was successful at performing a working demonstration of the "mind control" task using Emotiv Insight and Emotiv BCI. I trained a model with two states: Neutral and Push, where Neutral is the idle state, and Push is the state where I'm attempting to push the virtual box. I was able to perform this successfully. I even had a volunteer who put the EEG cap and tried the same. Overall, this got the crowd intrigued about EEG and its potential.
          I would like to thank my advisor Dr. Sampath Jayarathna, and Ajay Gupta, for providing me the opportunity to present our research at this workshop.

          2019-09-09: Introducing sumgram, a tool for generating the most frequent conjoined ngrams

          $
          0
          0
          Comparison of top 20 (first column) bigrams, top 20 (second column) six-grams, and top 20 (third column) sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about the 2014 Ebola Virus Outbreak. Proper nouns of more than two words (e.g., "centers for disease control and prevention") are split when generating bigrams, sumgram strives to remedy this. Generating six-grams surfaces non-salient six-grams. Click image to expand.

          A Web archive collection consists of groups of webpages that share a common topic e.g., “Ebola virus” or “Hurricane Harvey.” One of the most common tasks involved in understanding the "aboutness" of a collection is generating the top k (e.g., k = 20) ngrams. For example, given a collection about Ebola Virus, we could generate the top 20 bigrams as presented in Fig. 1. This simple operation of calculating the most frequent bigrams unveils useful bigrams that help us understand the focus of the collection, and may be used as a summary for the collection. For example, the most frequent bigram, "ebola virus" validates our prior knowledge about the collection topic, and the second bigram "west africa," provides geographical information related to the disease. 

          Closer inspection however exposes a serious native defect of the bigrams summary - splitting multi-word proper nouns. Since the bigrams method splits terms into word pairs, all proper nouns with more than two words are split. For example, the trigram "world health organization" was split into two bigrams ("world health" - rank 7) and ("health organization" - rank 10). Also the six-gram "centers for disease control and prevention" was split into three bigrams: "disease control" - rank 9, "centers disease" - rank 12, and "control prevention" - rank 13. The splitting of multi-word proper nouns was easy to detect in this example because I am familiar with the Ebola Virus subject,  so it is possible that the split might go unnoticed if I am presented with bigrams for a collection I am not familiar with. The need to avoid splitting or fixing the splitting of multi-word proper nouns motivated the development of sumgram.

          Ngrams vs. Sumgrams
          To estimate the aboutness of a collection we could generated bigrams (n=2) or trigrams (n=3). Irrespective of the value of n, we only generate terms of the same ngram class. For example, in Fig. 1, bigrams were generated, so it is not possible to find a trigram term in the list. Fixing split multi-word proper nouns involves "gluing" together ngrams. For example, if we sum the bigrams "world health" and "health organization," the result is the trigram "world health organization." This means in our effort to conjoin split bigrams we could end up with a summary that consists of bigrams, trigrams, four-grams, five-gram, six-grams, etc (Fig. 1, column 1). A higher-order ngram (e.g., "world health organization") generated by conjoining lower-order ngrams (e.g., "world health" and "health organization"), we call a sumgram. In other words, sumgrams are formed by conjoining multiple lower-order ngrams. This means a collection summarized with sumgrams could include multiple different ngram classes (bigrams, trigrams, four-grams, five-grams, etc.). For example, as seen in Fig. 1, column 1, the first term ("ebola virus") is a bigram, the second term ("in west africa"), a trigram, the sixth - four-gram, the eighth - six-gram.


          Performance consideration and Named Entity Recognition
          Initially, we considered applying Named Entity Recognition (NER) as a means to identify and avoid splitting multi-word proper nouns. With a Named Entity Recognition (NER) system, one could easily label a text collection with entity labels (e.g., PERSON, LOCATION, and ORGANIZATION), and instruct the ngram generator to avoid splitting ngrams that have those labels, as a means to remedy the split ngrams problem. However, we decided not to apply NER to resolve split ngrams because NER would impose additional performance overhead upon sumgram. It was important to keep sumgrams as lightweight as possible without compromising the quality of results. There are some phrases such as "direct contact with" and "health care workers," that sumgram could generate unlike NER. However, NER unlike sumgrams provides the benefit of labeling ngrams (e.g, "CDC" - Organization) although with additional performance cost.


          The Sumgram Tool
          We created a python tool, sumgram that implements two novel algorithms (pos_glue_split_ngrams and mvg_window_glue_split_ngramsresponsible for gluing split multi-word proper nouns. In addition to sumgram, we also released NwalaTextUtils, a collection of functions for processing text such as:

          1. derefURI(): Dereference URIs, returning HTML
          2. cleanHtml(): Remove boilerplate from HTML, returning plaintext
          3. getPgTitleFrmHTML(): Return text from within HTML title tag
          4. parallelGetTxtFrmURIs(): Dereference and remove boilerplate from URIs in parallel
          5. parallelGetTxtFrmFiles(): Return text from files (remove boilerplate from HTML) in parallel
          6. parallelTask(): Generic function for parallelizing other functions

          The Github documentation explains how to install and use sumgram in addition to the utilization of sumgram with NwalaTextUtils.parallelGetTxtFrmURIs().

          I would like to express my gratitude to Shawn Jones for his feedback, pull requests, and advice on improving sumgram and NwalaTextUtils. Additionally, I would like to thank Sawood Alam for reviewing sumgram. We encourage you to use sumgram and we welcome all feedback.

          -- Alexander C. Nwala (@acnwala)

          2019-09-09: Information Reuse & Integration for Data Science (IRI) 2019 Trip Report

          $
          0
          0
          The 20th IEEE Information Reuse and Integration for Data Science (IRI) 2019 was held in Los Angles, CA this year. Given the emerging global Information-centric IT landscape that has tremendous social and economic implications, effectively processing and integrating humungous volumes of information from diverse sources to enable effective decision making and knowledge generation have become one of the most significant challenges of current times. Information Reuse and Integration for Data Science (IRI) seeks to maximize the reuse of information by creating simple, rich, and reusable knowledge representations and consequently explores strategies for integrating this knowledge into systems and applications. The IEEE IRI conference serves as a forum for researchers and practitioners from academia, industry, and government to present, discuss, and exchange ideas that address real-world problems with real-world solutions. Theoretical and applied papers are both included. The conference program includes special sessions, open forum workshops, panels and keynote speeches.

          Day 1 (July 30)
          This year the conference had 69 submissions and only 16 papers accepted (23%) as regular papers.

          Keynote 1
          The first day of the conference started with a keynote by Dr. Huan Liu, Professor, School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Texas, USA titled, "Some New Data Science Challenges for Data Science."
          Professor Liu brought fresh perspective to the idea of paper acceptance with three key ideas.
          So we had a nice chat at the end.  He was the adviser of my good friend and the creator of the popular Auto keras framework Xia "Ben" Hu at Texas A&M.
          In the first session of the morning, Machine Learning and AI, Fafael Almedia presented, "Combining Data Mining techniques for Evolutionary Analysis of Programming Languages."  His work examines whether the changes made a positive or negative impact on the community.
          Next, Dr. Khoshgoftaar from Florida Atlantic University presented an important case study in the area of Medicare fraud detection.  He said he met multiple federal government officials in Washington, DC during the last couple of months.
          For the final paper of the morning session, Brian Blake presented his group's work on crowd-sourced sentiment for decision making.  He talked about subjective influencing on any category as well as the correlation and significance based on the data sets of presidential debate, election night, and post election protest.
          Keynote 2
          The second keynote speaker for day 1,  Dr. Matthew C. Stafford Chief Learning Officer, US Air Force’s Air Education and Training Command, Joint Base San Antonio, Texas, USA, spoke about “Chameleons” – Actors Who Can “Play Any Part”: Your Data Can Have a Starring Role, too!
          The key insights of Dr. Stafford's talk included ideas about the mistrust in machines.  When do you trust and when you don't trust them?  Specially in machine learning, the conflict and have to put the human element in data. Data scientist, they don't understand the data and the human side of the knowledge.  Data scientists need to bring both sides to table from the machine learning side and human side.

          The evening session started the session on Novel Data Mining and Machine Learning Applications. The first talk was on GO-FOR: A Goal-Oriented Framework for Ontology Reuse by Cássio Reginato.

          The first day sessions concluded with a panel on On Expanding the Impact of Data Science on the Theory of Intelligence and Its Applications,j chaired by IRI 2019 general chair Stuart H. Rubin and the panelists were Chengcui Zhang, University of Alabama at Birmingham, USA, Taghi Khoshgoftaar, Florida Atlantic University and Matthew C. Stafford, Chief Learning Officer, US Air Force’s Air Education and Training Command, Joint Base San Antonio, Texas, USA
          Day 2 (July 31)

          Despite a few morning clouds, it turned out to be another beautiful day in southern California.  After attendees arrived and got their morning fill of coffee and pastries, they once again made their way into the main conference room.

          Keynote 3
          The second day of the of the conference started with the final keynote address by Aidong Zhang, WilliamWulf Faculty Fellow and Professor of Computer Science at the University of Virginia (UVA), titled, "Graph Neural Networks for Supporting Knowledge-Data Integrated Machine Learning."  Many were surprised at the findings presented (use of a small sample to train a deep learning model) and wanted more information regarding the research being performed by Dr. Zhang's group at the UVA.

          Soon after the keynote address, participants dispersed, making their way to different rooms for one of three breakout sessions:  visual analytics, biomedical applications, and big data applications.  ODU’s own Dr. Sampath Jayarathna (@OpenMaze) started things off in the biomedical applications session by presenting the best student paper nominated work, "Analysis of Temporal relationships between ASD and Brain Activity through EEG and Machine Learning."
          The next presentation was on how deep multi-task learning for interpretable glaucoma detection could overcome some of the challenges faced in the field.  After Nooshin Moab finished addressing questions following her presentation, most conference participants would part ways to find lunch.
          After the break, an overview of MIT’s Lottery Ticket hypothesis for training neural networks was discussed.  Afterwards, we remained in the main conference room for the Novel Data Mining and Machine Learning Applications II breakout session.  The session began with a presented of machine learning models for airfare prediction.  Dr. Jayarathna (@OpenMaze) then returned to discuss, "Eye Tracking Area of Interest (AOI) in the Context of Working Memory Capacity Tasks."  The discussion focused on how proper utilization of the AOI could allow the capture of eye gaze metrics that could predict Attention-Deficit/Hyperactivity Disorder (ADHD) in humans.

          Next, Mohammed Kuko would presented a review of his machine learning method for single and clustered cervical cell classification in support of an automated Pap smear screening system.  Bathsheba Farrow would finish that afternoon’s breakout session with a discussion on the different techniques used for post-traumatic stress disorder (PTSD) detection, which marked the path of her future research.
          There was not a heavy concentration of posters during the poster session.  There was an interesting poster presentation on graph visualizations of Asian music and another on hand gesture recognition with convolution neural networks.  Many attendees took a few minutes to take in the poster presentations then found an opportunity to network with the other presenters in the hotel foyer.

          The second day of the conference ended with a hotel banquet for the conference goers.  Dinner and dessert were followed by an awards ceremony.  Several conference organizers received awards for their work on various committees.  There was definitely excitement in the air when it was announced that the Best Student Paper award would go to our own Yasith Jayawardana (@yasithmilinda), Dr. Mark Jamie, and Dr. Sampath Jayarathna (@OpenMaze) for their work on the "Analysis of Temporal Relationships between ASD and Brain Activity through EEG and Machine Learning."
          The best paper award went for the work of "Data Clustering using Online Variational Learning of Finite Scaled Dirichlet Mixture Models" by Hieu Nguyen, Meeta Kalra, Muhammad Azam and Nizar Bouguila.

          Other Resources
          Check out the pictures of the event available at:

          -- Bathsheba Nelson (@sheissheba) and Sampath Jayarathna (@openmaze)

          2019-09-10: Twitter Follower Growth for the 2020 Democratic Candidates

          $
          0
          0
          Figure 1: Popularity measure of the candidates labeled on the basis of their Twitter follower growth in absolute number and percentage
          There are more than 20 candidates running for the 2020 Democratic party Presidential nomination but everyone knows there will be only one winner. Since only a handful of the candidates have a real shot of receiving the nomination, the question then arises "why are so many candidates running for their party's nomination?" One answer is that running for the nomination increases a candidate's national media coverage and the resulting popularity creates a launchpad for their future endeavors. This is clearly evident in case of candidates like Pete Buttigieg and Andrew Yang, both of whom have enjoyed increased national exposure regardless of the outcome of the primaries. Since "popularity" is hard to define and quantify, we use the Twitter followers for each candidate as a proxy for their popularity. The absolute and relative increase in the Twitter followers since January 1, 2019 can then be an indicator if the candidates' efforts have been worthwhile in increasing the size of their audience.

          Previous Works on Study of Twitter Follower Growth 


          FiveThirtyEight published their two-article series about the NBC and CNN Democratic debates where they analyzed each candidate based on five criteria. One of the criteria was the Twitter follower growth chart from the night of the debate to the following afternoon.
          Figure 2: Twitter follower growth for Democratic candidates after the NBC debate between the night of the debate and the following afternoon. 
          Source: https://fivethirtyeight.com/features/the-first-democratic-debate-in-five-charts/
          Figure 3: Twitter follower growth for Democratic candidates after the CNN debate between the night of the debate and the following afternoon. 
          Source: https://fivethirtyeight.com/features/the-second-democratic-debate-in-5-charts/
          Although FiveThirtyEight captures the immediate effect on an event using the Twitter follower growth, there are other major events such as their campaign announcement and appearing on T.V. shows talking about their candidacy which also affect their overall Twitter follower growth. 
          For a better understanding of the Twitter follower growth as a measure of a candidate's popularity, we need to include all the events that have happened in a candidate's campaign until today which requires us to study their historical Twitter information. Miranda Smith in her post "Twitter Follower Count History via the Internet Archive" explains the reason for using the web archives over the Twitter API for finding historical information of Twitter follower count. In this post, we will rework the analysis done by Miranda Smith on the Democratic candidates using the web archives to gather their historical information and present a broader view of the Twitter follower growth for each candidate in 2019.

          How we built our data set


          We used the 2020 Democratic Party presidential primaries Wikipedia webpage on August 24, 2019 to create our baseline of the candidates. On August 24, 2019, we found 21 candidates to be running for the presidential elections and five candidates had already withdrawn. With 24 of the 26 candidates announcing their candidacy in 2019, we limited our study between January 1, 2019 and August 23, 2019. We used the same Twitter handle for all the 21 candidates mentioned in the FiveThirtyEight article and for the rest of the candidates we used the Twitter accounts which mentioned their 2020 Presidential candidacy in their Twitter bio.
          We collected mementos from multiple web archives using MemGator for all the Twitter handles between January 1, 2019 and August 23, 2019 and retrieved the follower count from each memento to build our data set of historical follower count information.

          Results   


          Table 1 is a sortable table which contains the follower count growth in absolute numbers and in percentage, start follower count, end follower count, memento date range, and Twitter handle for all the candidates. Memento date range for a candidate represents the start and the end time of all the mementos collected from the web archives for their Twitter account. Increase represents the increase in absolute number of Twitter followers for a candidate from the first memento to the last memento. Increase% represents the percentage increase in the follower count for a candidate with respect to their first memento follower count.

             
          There are two possible reasons why all candidates do not have mementos in the web archives for their Twitter handles on or near 2019-01-01:
          • Archiving rate correlates with popularity, so candidates who were less popular are likely to have a lower archival rate. Pete Buttigieg's Twitter account, @petebuttigieg, has 329 mementos in 2019 which is in contrast with the 70 mementos between 2012 and 2018. The high archival rate of his Twitter account can be attributed to the meteoric rise in his Twitter followers by 1.3M. Although Marianne Williamson has 2.6M Twitter followers, she was archived thrice by the Internet Archive between January and March 2019. So, the correlation between archival rate and popularity might not hold in all the situations. 
          • The web archives index their mementos by their URLs (URI-R). A change in the Twitter handle creates a new URL for the same web page. In order to fetch all the mementos for a Twitter account, we need to query the web archives with both the URLs which includes the previous and current Twitter handle URLs. Andrew Yang changed his Twitter handle from @andrewyangvfa to @andrewyang which changed the URL to his Twitter account. Therefore, the first memento for @andrewyang in the Internet Archive is from March 21, 2019. John Hickenlooper (@hickforco to @hickenlooper), John Delaney (@JDelaneyforMD to @JohnKDelaney to @johndelaney), and Michael Bennet (@BennetForCO to @MichaelBennet) have also changed their Twitter handles to reflect their shift from state to national focus.   

          Table 1: List of all the Democratic Candidates with their follower count increase in 2019.
          "1" represents the first Democratic debates held on the NBC. "2" represents the second Democratic debate held on the CNN. "3" represents the third Democratic debate to be held on 2019-09-12 on ABC. "D" represents the candidates who have dropped out as on 2019-08-24.
          Name
          TwitterHandle
          Memento Date RangeStart Follower CountEnd Follower CountIncreaseIncrease%
          Michael Bennet
          1, 2
          @michaelbennet
          18 May - 23 Aug21,21038,26717,05780.42
          Joe Biden
          1, 2, 3
          @joebiden
          01 Jan - 23 Aug3,175,5583,690,554514,99616.22
          Cory Booker
          1, 2, 3
          @corybooker
          03 Jan - 23 Aug4,083,5844,341,153257,5696.31
          Pete Buttigieg
          1, 2, 3
          @petebuttigieg
          24 Jan - 23 Aug94,9631,383,5081,288,5451356.90
          Julian Castro
          1, 2, 3
          @juliancastro
          02 Jan - 23 Aug136,275373,432237,157174.03
          Bill de Blasio
          1, 2
          @billdeblasio
          10 Jan - 23 Aug139,293168,09228,79920.68
          John Delaney
          1, 2
          @johndelaney
          16 Apr - 23 Aug19,20135,59516,39485.38
          Steve Bullock
          2
          @governorbullock
          02 Jan - 20 Aug166,137184,44818,31111.02
          Tulsi Gabbard
          1, 2
          @tulsigabbard
          01 Feb - 23 Aug216,704537,745321,041148.15
          Kamala Harris
          1, 2, 3
          @kamalaharris
          01 Jan - 23 Aug1,990,3493,071,5241,081,17554.32
          Amy Klobuchar
          1, 2, 3
          @amyklobuchar
          24 Jan - 23 Aug566,492751,017184,52532.57
          Beto O'Rourke
          1, 2, 3
          @betoorourke
          01 Jan - 23 Aug1,111,6901,553,346441,65639.73
          Tim Ryan
          1, 2
          @timryan
          16 Apr - 20 Aug18,11636,43418,318101.12
          Bernie Sanders
          1, 2, 3
          @berniesanders
          01 Jan - 23 Aug8,943,1229,580,209637,0877.12
          Elizabeth Warren
          1, 2, 3
          @ewarren
          03 Jan - 23 Aug2,172,7693,041,438868,66939.98
          Marianne Williamson
          1, 2
          @marwilliamson
          05 Mar - 20 Aug2,602,2912,758,962156,6716.02
          Andrew Yang
          1, 2, 3
          @andrewyang
          27 Mar - 23 Aug200,361713,462513,101277.02
          Kirsten Gillibrand
          1, 2
          @sengillibrand
          02 Jan - 23 Aug1,297,3061,461,659164,35312.67
          Joe Sestak
          @joesestak
          25 Jun - 20 Aug10,71512,3461,63115.22
          Wayne Messam
          @waynemessam
          16 Apr - 21 Aug5,8348,4152,58144.24
          Eric Swalwell
          1, D
          @ericswalwell
          01 Jan - 20 Aug23,873107,48483,611350.23
          John Hickenlooper
          1, 2, D
          @hickenlooper
          13 Mar - 23 Aug135,774159,94724,17217.80
          Jay Inslee
          1, 2, D
          @jayinslee
          01 Jan - 20 Aug30,614105,93575,321246.03
          Tom Steyer
          @tomsteyer
          01 Jan - 20 Aug211,202244,38433,18215.71
          Mike Gravel
          @mikegravel
          04 Apr - 21 Aug40,497131,90591,408225.72
          Seth Moulton
          D
          @sethmoulton
          06 Feb - 20 Aug135,481147,28911,8088.72





          Table 2: Table lists all the candidates into each category based on their increase in follower count in numbers and percentage
          Low % Increase in Follower CountHigh % Increase in Follower Count 
          High Increase in Follower CountAlready Popular
          @marwilliamson
          155K, 6%

          @sengillibrand
          165K, 13%

          @amyklobuchar
          185K, 32%

          @corybooker
          260K, 6%

          @betoorourke
          440K, 40%

          @joebiden
          515K, 16%

          @berniesanders
          640K, 7%

          @ewarren
          870K, 40%

          @kamalaharris
          1.1M, 55%
          Big Winners
          @juliancastro
          240K, 175%

          @tulsigabbard
          320K, 150%

          @andrewyang
          515K, 275%

          @petebuttigieg
          1.3M, 1350%
          Low Increase in Follower CountNobody Noticed
          @joesestak
          1.5K, 15%

          @waynemessam
          2.5K, 44%

          @sethmoulton
          11K, 9%

          @johndelaney
          16K, 85%

          @michaelbennet
          17K, 80%

          @governorbullock
          18K, 11%

          @hickenlooper
          24K, 18%

          @billdeblasio
          29K, 21%

          @tomsteyer
          33K, 16%
          Beneficial
          @timryan
          18K, 101%

          @jayinslee
          75K, 245%

          @ericswalwell
          84K, 350%

          @mikegravel

          90K, 225%


          Figure 4: @billdeblasio added 29K followers with growth rate of 21% and is an example from the "Nobody Noticed" category

          Figure 5: @berniesanders added 640K followers with growth rate of 7% and is an example from the "Already Popular" category 
          Figure 6: @ericswalwell added 84K followers with growth rate of 350% and is an example from the "Beneficial" category 
          Figure 7: @petebuttigieg added 1.3M followers with growth rate of 1350% and is an example from the "Big Winners" category 

          Some observations:
          • Joe Biden and Bernie Sanders already have a large number of Twitter followers, so even though the absolute size of their increase is large, their relative increase is small. This matches our intuition of them both being nationally recognized names for whom their candidacies are not about setting up their "next move".
          • Since January 2019, Pete Buttigieg's Twitter account has witnessed an increase in his follower count by 1.3M, similar to Andrew Yang who witnessed a Twitter follower rise of 515K. The meteoric rise in their Twitter followers is in congruence with both the candidates becoming nationally recognized names within a span of months.    
          • The top three candidates with the lowest Twitter follower growth (@JoeSestak, @WayneMessam, and @SethMoulton) have not appeared in any debate.
          • Except for Senator Tulsi Gabbard, all the other ten candidates from the top 11 candidates who have the highest growth in their Twitter follower count will be appearing in the third Democratic debate on September 12, 2019. This matches our intuition of using Twitter followers as a proxy for measuring the popularity of candidates.
          We analyzed the follower counts for 26 Democratic candidates and used their Twitter followers growth as a proxy to measure their popularity. We categorized four Twitter handles to be the big winners, four in the beneficial, nine in the already popular, and nine in the nobody noticed category based on the threshold value of 100K and 100% Twitter follower growth in absolute number and percentage growth. The ten candidates who will appearing in the third debate make up the top 11 highest gainers of Twitter followers matching our intuition of using Twitter followers as a proxy to measure the popularity of candidates.
            --------
            Mohammed Nauman Siddique
            (@m_nsiddique)

            2019-09-10: Where did the archive go? Part 2: National Library of Ireland

            $
            0
            0

            In the previous post, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their web archive replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In part 2 of this four part series, we focus on the movement of a collection from the National Library of Ireland (NLI).


            In May 2018, we discovered that 979 mementos from the NLI collection that were originally archived at the European Archive (europarchive.org) were moved to the Internet Memory Foundation archive (internetmemory.org). Then in September 2018, we found that the collection of mementos had been moved to Archive-It (archive-it.org). We found that 192 mementos, out of 979, can not be found in Archive-It (i.e., missing mementos).

            For example, the memento from the European Archive:


            has been moved to the Internet Memory Foundation (IMF) archive at:

            http://collections.internetmemory.org/nli/20141013204117/http://www.defense.gov/

            before it ended up at Archive-it:

            http://wayback.archive-it.org/10702/20141013204117/http://www.defense.gov/

            The representations of the three mementos are illustrated in the figure below.



            There were no changes in the 979 mementos (other than their URIs) when they moved from the European Archive to the IMF archive (even the archival banner remained the same as the figure above shows), but we found some significant changes upon the move from IMF to Archive-It which we will focus on in this post.

            We refer to the archive from which mementos were moved (i.e., internetmemory.org) as the "original archive", and we use the "new archive" to refer to the archive to which the mementos were moved (i.e., archive-it.org). A memento is identified by a URI-M as defined in the Memento framework.

            Our observations about changes in the NLI collection (from IMF to Archive-It) are as follows:

            Observation 1: The functionality of the original archival banner is gone

            Users of the European Archive and IMF were able to navigate through available mementos via the custom archival banner (marked in red in the top two screenshots in the figure above). Via this banner, the original archive allows users to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This new archive's banner contains information to inform users that they are viewing an "archived" web page. This banner also contains multiple links. One of the links will take you to a web page in archive-it.org that shows all available mementos in the archive as shown in the figure below:





            Observation 2: The original archive is no longer reachable

            After mementos were moved from internetmemory.org, the archive became unreachable as the following cURL session shows:

            $ date
            Tue May 21 08:03:51 EDT 2019

            $ curl http://www.internetmemory.org
            curl: (7) Failed to connect to www.internetmemory.org port 80: Operation timed out



            In addition to IMF, the European Archive (europarchive.org) also is no longer maintained---it was shut down and the domain name was purchased by another entity and is now spam. 



            The movement of mementos from these two archives will affect link integrity across web resources that contain links to mementos from the European Archive or IMF. As mentioned in the previous post, there actions that can be performed by original archives to maintain link integrity via "follow-your-nose" from the old URI-Ms to the corresponding URI-Ms in the new archive. 


            For example, the archive Library and Archives Canada changed its domain name name from collectionscanada.gc.ca to webarchive.bac-lac.gc.ca (described in part 1), and because the archive still controls the original domain name collectionscanada.gc.ca, the archive could (even though it currently does not) redirect requests of URI-M in collectionscanada.gc.ca to the new archive webarchive.bac-lac.gc.ca. For instance, if the original archive uses the Apache web server, the mod_rewrite rule can be used to perform automatic redirects:

            # With mod_rewrite
            RewriteEngine on
            RewriteRule   "^/webarchives/(\d{14})/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1/$2  [L,R=301]

            But these practices become impractical in the case of the European Archive and IMF because:

            • The archives no longer exist (e.g., the European Archive and IMF),  so there is not a maintained web server available to issue the redirects. 
            • Even if it still existed, the archive might decide to not issue redirects for former customers in order to increase lock-in.
            In the upcoming post, we will describe the movement of mementos from the Public Record Office of Northern Ireland (PRONI) web archive to Archive-It. The PRONI organization still controls and maintains the original domain name webarchive.proni.gov.ukso it is possible for PRONI to issue redirects to the new URI-Ms in Archive-It.

            Observation 3: Not all mementos are available in the new archive

            As defined in the previous post, a missing memento occurs when the values of the Memento-Datetime, the URI-R, and the final HTTP status code of the memento from the original archive are not identical to the values of the corresponding memento from the new archive. In this study, we found 192 missing mementos (out of 979) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. We give two examples of missing mementos. The first example shows a memento can not be found in the new archive  with the same Memento-Datetime as it was in the original archive. When requesting the URI-M:

            http://collections.internetmemory.org/nli/20121221162201/http://bbc.co.uk/news/

            from the original archive (internetmemory.org) on September 03, 2018, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Fri, 21 Dec 2012 16:22:01 GMT. Then, we requested the corresponding URI-M:

            http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/


            from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

            http://wayback.archive-it.org/10702/20121221163248/http://www.bbc.co.uk/news/

            As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Dec 2012 16:32:48 GMT in the new archive) for a delta of about 10 minutes. Even though the 10 minute delta might not be semantically significant (apparently just a change in the canonicalization of the URI-R, with bbc.co.uk redirecting to www.bbc.co.uk), we do not consider it to be the same since the values of the Memento-Datetime are not identical.

            $ curl --head --location --silent http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

            HTTP/1.1 302 Found
            Location: /10702/20121221163248/http://www.bbc.co.uk/news/
            HTTP/1.1 200 OK

            Memento-Datetime: Fri, 21 Dec 2012 16:32:48 GMT



            The second example shows a memento that has different values of the Memento-Datetime and URI-R compared to the corresponding values from the original archive. When requesting the memento:

            http://collections.internetmemory.org/nli/20121223122758/http://www.whitehouse.gov/

            on September 03, 2018, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:


            WARC/1.0
            WARC-Type: response
            WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/

            WARC-Date: 2018-09-03T16:31:30Z
            WARC-Record-ID: <urn:uuid:d03e5020-af96-11e8-9d72-f10b53f82929>


            Content-Type: application/http; msgtype=response
            Content-Length: 1694
            HTTP/1.1 200 OK

            Date: Mon, 03 Sep 2018 16:31:19 GMT
            Server: Apache/2.4.10
            Age: 0
            Vary: Accept-Encoding
            Content-Type: text/html
            Via: 1.1 varnish-v4
            cache-control: max-age=86400
            X-Varnish: 28318986 28187250
            Memento-Datetime: Sun, 23 Dec 2012 12:27:58 GMT
            Connection: keep-alive
            Accept-Ranges: bytes
            Link: <http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/>; rel="memento"; datetime="Sun, 23 Dec 2012 12:27:58 GMT", <http://collections.internetmemory.org/nli/content/20110223072152/http://www.whitehouse.gov/>; rel="first memento"; datetime="Wed, 23 Feb 2011 07:21:52 GMT", <http://collections.internetmemory.org/nli/content/20180528183514/http://www.whitehouse.gov/>; rel="last memento"; datetime="Mon, 28 May 2018 18:35:14 GMT", <http://collections.internetmemory.org/nli/content/20121221220430/http://www.whitehouse.gov/>; rel="prev memento"; datetime="Fri, 21 Dec 2012 22:04:30 GMT", <http://collections.internetmemory.org/nli/content/20131208014833/http://www.whitehouse.gov/>; rel="next memento"; datetime="Sun, 08 Dec 2013 01:48:33 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.whitehouse.gov/>; rel="timegate", <http://www.whitehouse.gov/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.whitehouse.gov/>; rel="timemap"; type="application/link-format"
            Content-Length: 287


            <HTML><head>
            <title>[ARCHIVED CONTENT] Access Denied</title>
            </head><BODY>
            <H1>Access Denied</H1>
            Reference&#32;&#35;18&#46;d8407b5c&#46;1356265678&#46;2324d94
            </BODY>
            </HTML>


            You don't have permission to access "http&#58;&#47;&#47;wwws&#46;whitehouse&#46;gov&#47;" on this server.<P>

            When requesting the corresponding memento from archive-it.org

            http://wayback.archive-it.org/10702/20121223122758/http://www.whitehouse.gov/

            the request redirected to another URI-M:

            http://wayback.archive-it.org/10702/20121221222130/http://www.whitehouse.gov/administration/eop/nec/speeches/gene-sperling-remarks-economic-club-washington

            which is "200 OK". Notice that not only the values of the Memento-Datetime are different but also the URI-Rs. The representations of both mementos from the original and new archives are shown below:



            Observation 4: Both archives handle the archival 4xx/5xx differently

            The replay tool in the original archive (internetmemory.org) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. 

            For example, when requesting the memento:

            http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/


            on September 03, 2018, the original archive returned "200 OK" for an archival "503 Service Unavailable" as the WARC record below shows:


            WARC/1.0
            WARC-Type: response
            WARC-Target-URI: http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/
            WARC-Date: 2018-09-03T16:46:51Z
            WARC-Record-ID: <urn:uuid:f4f2e910-af98-11e8-8de6-6f058c4e494a>
            Content-Type: application/http; msgtype=response
            Content-Length: 271841

            HTTP/1.1 200 OK
            Date: Mon, 03 Sep 2018 16:46:39 GMT
            Server: Apache/2.4.10
            Age: 0
            Vary: Accept-Encoding
            Content-Type: text/html; charset=utf-8
            Via: 1.1 varnish-v4
            cache-control: max-age=86400
            Transfer-Encoding: chunked
            X-Varnish: 28349831
            Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
            Connection: keep-alive
            Accept-Ranges: bytes
            ...

            Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

            WARC/1.0
            WARC-Type: response
            WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/
            WARC-Date: 2018-09-03T16:46:51Z
            WARC-Record-ID: <urn:uuid:f500cbc0-af98-11e8-8de6-6f058c4e494a>
            Content-Type: application/http; msgtype=response
            Content-Length: 2642

            HTTP/1.1 200 OK
            Date: Mon, 03 Sep 2018 16:46:40 GMT
            Server: Apache/2.4.10
            Age: 0
            Vary: Accept-Encoding
            Content-Type: text/html
            Via: 1.1 varnish-v4
            cache-control: max-age=86400
            Transfer-Encoding: chunked
            X-Varnish: 27227468 27453379
            Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
            Connection: keep-alive
            Accept-Ranges: bytes
            Link: <http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/>; rel="memento"; datetime="Sun, 21 Oct 2012 20:36:47 GMT", <http://collections.internetmemory.org/nli/content/20110221192317/http://www.amazon.com/>; rel="first memento"; datetime="Mon, 21 Feb 2011 19:23:17 GMT", <http://collections.internetmemory.org/nli/content/20180711130159/http://www.amazon.com/>; rel="last memento"; datetime="Wed, 11 Jul 2018 13:01:59 GMT", <http://collections.internetmemory.org/nli/content/20121016174254/http://www.amazon.com/>; rel="prev memento"; datetime="Tue, 16 Oct 2012 17:42:54 GMT", <http://collections.internetmemory.org/nli/content/20121025120853/http://www.amazon.com/>; rel="next memento"; datetime="Thu, 25 Oct 2012 12:08:53 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.amazon.com/>; rel="timegate", <http://www.amazon.com/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.amazon.com/>; rel="timemap"; type="application/link-format"

            <html>
            <head>
            <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
            <title>[ARCHIVED CONTENT] 500 Service Unavailable Error</title>
            </head>
            <body style="padding:1% 10%;font-family:Verdana,Arial,Helvetica,sans-serif">
              <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/"><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/img09/x-site/other/a_com_logo_200x56.gif" alt="Amazon.com" width="200" height="56" border="0"/></a>
              <table>
                <tr>
                  <td valign="top" style="width:52%;font-size:10pt"><br/><h2 style="color:#E47911">Oops!</h2><p>We're very sorry, but we're having trouble doing what you just asked us to do. Please give us another chance--click the Back button on your browser and try your request again. Or start from the beginning on our <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/">homepage</a>.</p></td>
                  <th><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/x-locale/common/errors-alerts/product-fan-500.jpg" alt="product collage"/></th>
                </tr>
              </table>
            </body>

            </html>

            When requesting the corresponding memento:

            http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

            Archive-It properly returned the status codes 503 for the archival 503:

            curl -I http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

            HTTP/1.1 503 Service Unavailable
            Server: Apache-Coyote/1.1
            Content-Security-Policy-Report-Only: default-src 'self''unsafe-inline''unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
            Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
            Link: <http://www.amazon.com/>; rel="original", <https://wayback.archive-it.org/10702/timemap/link/http://www.amazon.com/>; rel="timemap"; 
            ...

            Observation 5: The HTTP status code may change in the new archive

            The HTTP status codes of URI-Ms in the new archive might not be identical to the HTTP status code of the corresponding URI-Ms in the original archive. For example, the HTTP request of the URI-M:

            http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/


            to the original archive resulted in "200 OK" as the part of the WARC below shows:

            WARC/1.0
            WARC-Type: response
            WARC-Target-URI: http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/
            WARC-Date: 2018-09-03T08:40:43Z
            WARC-Record-ID: <urn:uuid:0b947600-af55-11e8-9b13-5bce71cafd38>
            Content-Type: application/http; msgtype=response
            Content-Length: 28447

            HTTP/1.1 200 OK
            Date: Mon, 03 Sep 2018 08:40:30 GMT
            Server: Apache/2.4.10
            Age: 0
            Vary: Accept-Encoding
            Content-Type: text/html; charset=utf-8
            Via: 1.1 varnish-v4
            cache-control: max-age=86400
            Transfer-Encoding: chunked
            X-Varnish: 27888670
            Memento-Datetime: Sun, 23 Dec 2012 03:18:37 GMT
            Connection: keep-alive
            Accept-Ranges: bytes
            Link: <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="first memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="last memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="prev memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="next memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/timegate/http://www2008.org/>; rel="timegate", <http://www2008.org/>; rel="original", <http://collections.internetmemory.org/nli/timemap/http://www2008.org/>; rel="timemap"; type="application/link-format"



            <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            ...

            The representation of the memento is illustrated below:


            In internetmemory.org

            The request to the corresponding URI-M:

            http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

            from Archive-It results in "404 Not Found" as the cURL session below shows:

            $ curl --head --silent http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

            HTTP/1.1 404 Not Found
            Server: Apache-Coyote/1.1
            Content-Security-Policy-Report-Only: default-src 'self''unsafe-inline''unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
            Content-Type: text/html;charset=utf-8
            Content-Length: 4902
            Date: Thu, 05 Sep 2019 08:28:27 GMT

            The list of all 979 URI-Ms is appended
             below. The file contains the following information:
            • The URI-M from the original archive (original_URI-M).
            • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
            • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
            • The URI-M from the new archive (new_URI-M).
            • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
            • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
            • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
            • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
            • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

            Conclusions

            We did not find any changes in the 979 mementos of the National Library of Ireland (NLI) collection when they were moved from europarchive.org to internetmemory.org in May 2018.  Both archives had used the same replay tool and archival banners. The NLI collection then was moved to archive-it.org in September 2018.  We found that 192 out of the 979 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone from the new archive. We also noticed that both archives internetmemory.org and archive-it.org react differently to requests of archival 4xx/5xx. 

            In the upcoming posts we will provide some details about changes in the archives:
            --Mohamed Aturban






            Viewing all 752 articles
            Browse latest View live