2019-09-25: Where did the archive go? Part 3: Public Record Office of Northern Ireland

September 25, 2019, 10:13 am

≫ Next: 2019-10-01: Attending the Machine Learning + Libraries Summit at the Library of Congress

≪ Previous: 2019-09-10: Where did the archive go? Part 2: National Library of Ireland

In Where did the archive go? Part 1, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In Part 2, we focused on the movement of the National Library of Ireland (NLI). Mementos from NLI collection were moved from the European Archive to the Internet Memory Foundation (IMF) archive. Then, they were moved to Archive-It. We found that 192 mementos, out of 979, can not be found in Archive-It.

In part 3 of this four part series, we focus on changes in the Public Record Office of Northern Ireland (PRONI) Web Archive. In October 2018, mementos in the PRONI archive were moved to Archive-It (archive-it.org). We discovered that 114 mementos, out of 469, can no longer be found in Archive-It (i.e., missing mementos).

We refer to the archive from which mementos have moved as the "original archive" (i.e., PRONI archive), and we use the "new archive" to refer to the archive to which the mementos have moved (i.e., Archive-It). A memento is identified by a URI-M as defined in the Memento framework.

We have several observations about the changes in the PRONI web archive:

Observation 1: The HTTP request to a URI-M in PRONI archive does not redirect to the corresponding URI-M in the new archive

As shown in the cURL session below, every request to a memento (URI-M) in the PRONI archive will return the HTTP status code "404 Not Found":

$ curl --head --location http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

HTTP/1.1 302 Found
Cache-Control: no-cache
Content-length: 0
Location: https://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
HTTP/2 404
date: Fri, 20 Sep 2019 08:13:45 GMT
server: Apache/2.4.18 (Ubuntu)
content-type: text/html; charset=iso-8859-1

PRONI did not leave a machine readable method of locating the new URI-Ms. However, we were able to manually discover the corresponding URI-Ms in Archive-It. For example, the memento:

http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

is now available at:

http://wayback.archive-it.org/11112/20100218151844/http://www.berr.gov.uk/

The representations of both mementos are illustrated in the figure below:

Unlike the European Archive and IMF, the Public Record Office of Northern Ireland (PRONI) still owns the domain name of the original archive, webarchive.proni.gov.uk. Therefore, to maintain link integrity via "follow-your-nose", PRONI could issue redirects (even though it currently does not) to the corresponding URI-Ms in Archive-It. For example, since PRONI uses the Apache web server, the mod_rewrite rule that could be used to perform automatic redirects is:

# With mod_rewrite
RewriteEngine on
RewriteRule "^/(\d{14})/(.+)" http://wayback.archive-it.org/11112/$1/$2 [L,R=301]

Observation 2: The functionality of the original archival banner is gone

Similar to the archival banners provided by the European Archive and IMF, users of the PRONI archive were able to navigate through available mementos via the custom archival banner (marked in red in the top screenshot in the figure above). Users are allowed to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This banner does not have the same functionality compared to the original archive's banner. The new archive's banner contains information to inform users that they are viewing an "archived" web page and shows multiple links. One of the links will redirect to a web page that shows all available mementos in archive-it.org. For example, the figure below shows the available mementos for the web page http://www.berr.gov.uk/:

Observation 3: Not all mementos are available in the new archive

We define a memento "missing" if the values of the Memento-Datetime, the URI-R, and the final HTTP status code of a memento from the original archive are not identical to the values of a corresponding memento in the new archive. In this study, we found 114 missing mementos (out of 469) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. For example, when requesting the URI-M:

http://webarchive.proni.gov.uk/20160901021637/https://www.flickr.com/

from the original archive (PRONI) on 2017-12-01, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Thu, 01 September 2016 02:16:37 GMT. Then, we requested the corresponding URI-M:

http://wayback.archive-it.org/11112/20160901021637/https://www.flickr.com/

from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

http://wayback.archive-it.org/11112/20170401014520/https://www.flickr.com/

As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Apr 2017 01:45:20 GMT in the new archive) for a delta of about 211 days.

$ curl --head --location --silent http://wayback.archive-it.org/11112/20160901021637/http://www.flickr.com/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170401014520/https://www.flickr.com/
HTTP/1.1 200 OK
Memento-Datetime: Sat, 01 Apr 2017 01:45:20 GMT

We found that 63 missing mementos, out of 114, have different values of the Memento-Datetime of a delta of less than 11 seconds. For example, the request to the memento:

http://webarchive.proni.gov.uk/20170102004044/http://www.fws.gov/

from the original archive on 2017-11-18 returned "302" redirect to

http://webarchive.proni.gov.uk/20170102004044/https://fws.gov/

The request to the corresponding memento:

http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/

from the new archive redirects to the memento:

http://wayback.archive-it.org/11112/20170102004051/https://www.fws.gov/

as the cURL session below shows:

$ curl --head --location --silent http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170102004051/https://www.fws.gov/
HTTP/1.1 200 OK
Memento-Datetime: Mon, 02 Jan 2017 00:40:51 GMT

There are 10-second difference between the values of the Memento-Datetime which might not be semantically significant (apparently just a change in the canonicalization of the URIs, with http://www.fws.gov/ redirecting to https://www.fws.gov), but we do not consider the memento in the original archive is identical to the corresponding memento in the new archive because of the difference in the values of the Memento-Datetime.

When the new archive receives an archived collection from the original archive, it may apply some post-crawling techniques to the received files (e.g., WARC files) including deduplication, spam filtering, and indexing. This may result in mementos in the new archive that have different values of the Memento-Datetime compared to their corresponding values in the original archive.

Observation 4: PRONI provides a list of original pages' URIs (URI-Rs)

Mementos in PRONI archive were moved to the Archive-It under the archival collection https://archive-it.org/collections/11112/:

As shown in Observation 1, requests to URI-Ms in PRONI do not redirect to Archive-It. However, webarchive.proni.gov.uk provides a list of all original resources' URIs (URI-Rs) for which mementos have been created as the following figure shows:

For instance, if interested in finding the corresponding memento in Archive-It to the PRONI memento:
URI-M = http://webarchive.proni.gov.uk/20150318223351/http://www.afbini.gov.uk/

In proni.gov.uk

which has:
URI-R = http://www.afbini.gov.uk/
Memento-Datetime = Wed 18 Mar 2015 22:33:51 GMT

From the index at webarchive.proni.gov.uk, we can click on the URI-R www.afbini.gov.uk, which will redirect to an Archive-It HTML page that contains all available mementos for the selected URI-R as shown in the figure below:

Finally, we choose 2015-03-18 since it is the same Memento-Datetime as in the original archive. The representation of the memento is shown below:

Although the PRONI archive does not issue "301" redirects to URI-Ms in the new archive (i.e., PRONI does not provide a direct mapping between original URI-Ms and new URI-Ms), users of the archive can indirectly find the corresponding URI-Ms as explained above.

Observation 5: Archival 4xx/5xx responses are handled differently

Similar to the European Archive and IMF, the replay tool in the original archive (proni.gov.uk) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. For example, when requesting the memento:

http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/

on 2017-11-18, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81bb8530-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 60568

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
...

Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/content/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81c967e0-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 4587

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
X-Varnish: 24849777 24360264
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<title>[ARCHIVED CONTENT] 403 FORBIDDEN : LOGGED BY www.megalithic.co.uk</title>
</head>
<body style=\"font:Arial Black,Arial\">
     403 FORBIDDEN!     

You have been blocked from the Megalithic Portal by our defence robot. Possibly the IP address you are accessing from has been banned for previous bad behavior or you have attempted a hostile action. If you think this is an error please click the Trouble Ticket link below to communicate with the site admin.
...

When requesting the corresponding memento:

http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

Archive-It properly returned the status codes 403 for the archival 403:

$ curl --head http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

HTTP/1.1 403 Forbidden
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self''unsafe-inline''unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
X-Archive-Guessed-Charset: windows-1252
X-Archive-Orig-Server: Apache/2.2.15 (CentOS)
X-Archive-Orig-Connection: close
X-Archive-Orig-X-Powered-By: PHP/5.3.3
X-Archive-Orig-Status: 403 FORBIDDEN
X-Archive-Orig-Content-Length: 3206
X-Archive-Orig-Date: Tue, 16 Feb 2016 15:39:59 GMT
X-Archive-Orig-Warning: 199 www.megalithic.co.uk:80 You_are_abusive/hacking/spamming_www.megalithic.co.uk
X-Archive-Orig-Content-Type: text/html; charset=iso-8859-1
...

The representations of both mementos are illustrated below:

Observation 6: Some URI-Rs were removed from PRONI

Mementos may disappear when moving from the original archive to the new archive. For example, the request to the URI-M:

http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/

from the original archive resulted in "200 OK" as the part of the WARC below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/
WARC-Date: 2017-11-18T02:06:22Z
WARC-Record-ID: <urn:uuid:13136370-cc05-11e7-83e9-19ddf7ecdbd2>
Content-Type: application/http; msgtype=response
Content-Length: 28657

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 02:06:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 20429973
Memento-Datetime: Tue, 08 Apr 2014 18:55:12 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="first memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="last memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="prev memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="next memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/timegate/http://www.www126.com/>; rel="timegate", <http://www.www126.com/>; rel="original", <http://webarchive.proni.gov.uk/timemap/http://www.www126.com/>; rel="timemap"; type="application/link-format"

The representation of the memento is illustrated below:

In proni.gov.uk

The request to the corresponding URI-M:

http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

from Archive-It results in "404 Not Found" as the cURL session below shows:

$ curl --head http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=utf-8
Content-Length: 4910
Date: Tue, 24 Sep 2019 02:00:45 GMT

Before transferring collections to the new archive, it is possible that the original archive reviews collections and removes URI-Rs/URI-Ms that are considered off topic (you may also read about the off-topic memento toolkit) or spam (e.g., the URI-R www.www126.com is about auto insurance).

Observation 7: PRONI may have used the Europe Archive and IMF as hosting services

PRONI used an archival banner that is similar to what the Europe Archive and IMF used. Furthermore, the three archives returned similar set of HTTP response headers to requests of mementos. Values of multiple response headers (e.g., server and via) are identical as shown below:

The Europe Archive:
URI-M = http://collection.europarchive.org/nli/20161213111140/http://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collection.europarchive.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2017-12-01T18:15:32Z
WARC-Record-ID: <urn:uuid:9e705c20-d6c3-11e7-9f9e-5371622c3ef9>
Content-Type: application/http; msgtype=response
Content-Length: 159855

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 18:15:18 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 15695216
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

The IMF archive:
URI-M = http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2018-09-03T16:33:41Z
WARC-Record-ID: <urn:uuid:1e3173c0-af97-11e8-8819-6df9b412b877>
Content-Type: application/http; msgtype=response
Content-Length: 300060

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:33:27 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 11721161
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link:...

The PRONI archive:
URI-M = http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
WARC-Date: 2017-12-01T01:59:56Z
WARC-Record-ID: <urn:uuid:5452da60-d63b-11e7-91e2-8bf9abaf94b4>
Content-Type: application/http; msgtype=response
Content-Length: 36908

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 01:59:43 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 16262557
Memento-Datetime: Thu, 18 Feb 2010 15:18:44 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

We found that PRONI archival collection was listed as one of the maintained collections by the Europe Archive and IMF as the figure below shows:

https://web.archive.org/web/20180707131510/http://collections.internetmemory.org/

Even though the PRONI collection were moved from the Europe Archive to IMF, URI-Ms served by proni.gov.uk had not changed. It is possible that the PRONI archive followed a strategy of serving mementos under proni.gov.uk while using the hosting services provided by the Europe Archive and IMF. Thus, the regular users of the PRONI archive did not notice any change in URI-Ms. We do not think custom domains are available with Archive-It, so PRONI was unable to continue to host their mementos in their own URI namespace.

The list of all 979 URI-Ms is appended below. The file contains the following information:

The URI-M from the original archive (original_URI-M).
The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
The URI-M from the new archive (new_URI-M).
The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

Conclusions

We described the movement of mementos from the PRONI archive (proni.gov.uk) to archive-it.org in October 2018. We found that 114 out of the 469 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone. We noticed that both archives proni.gov.uk and archive-it.org react differently to requests of archival 4xx/5xx. In the upcoming posts we will provide some details about changes in webcitation.org.

--Mohamed Aturban

↧

2019-10-01: Attending the Machine Learning + Libraries Summit at the Library of Congress

October 1, 2019, 8:20 pm

≫ Next: 2019-10-21: Where did the archive go? Part 4: WebCite

≪ Previous: 2019-09-25: Where did the archive go? Part 3: Public Record Office of Northern Ireland

On September 20, 2019, I attended the Machine Learning + Libraries Summit at the Library of Congress. The aim of the meeting is to gather computer and information scientists, engineers, data scientists, and Liberians from reputable universities, government agencies, and industrial companies to come up with ideas on a bunch of questions such as how to expand the service of digital libraries, how to establish a good collaboration with other groups on machine learning projects, and what factors to consider to design a sustainable machine learning project, especially in the digital library domain. In the initial solicitation, the focus was cultural heritage, but the discussion went far beyond that.

The meeting features many interesting lightning talks. Unfortunately, due to the relatively short time allocated, many questions and discussions have to go offline. The organizer also arranged several activities, stimulating brainstorm discussion and teamwork between people from different places. I took notes of some speakers and their presentation contents that are highly relevant to my research.

The summit organizers solicited many other potentially interesting topics but because there was not enough time, they opened a Google doc to create a "look book" allowing people to post 3 slides to highlight their research and potential contribution to the project. There are 3 sections of presentations.

Section 1: existing projects:
* Leen-Kiat Soh, Liz Lorang: University of Nebraska-Lincoln
These people are focusing on newspapers and manuscript collections. It is an explorative project in image segmentation and visual content extraction. The project is called Aida.

* Thu Phuong 'Lisa' Nguyen, Hoover Institution Library & Archives, Stanford University
These people are trying to process digital collection fo scanned documents from 1898 to 1945, published in the USA. They are working toward extracting meaningful data, document layout analysis, page-level segmentation, article segmentation. The text could be arranged in different directions (from left to right or from top to bottom). Some scripts are mixed, i.e., English and Japanese.

* Kurt Luther: Assistant Professor of Computer Science and (by courtesy) History, Virginia Tech
Kurt was leading a group on a project called civil war photo sleuth, which combines crowdsourcing and face recognition to identify historical portraits. They have about 4 million portraits today but only 10-20% are identified. They have developed a crowdsourcing platform with about 10k registered users.

* Ben Lee + Michael Haley Goldman: United States Holocaust Memorial Museum
Ben and Michael are working on a project that involves 190 million images in WWII. Their goal is to trace missing family members. This dataset is an invaluable resource of Holocaust survivors and their families, as well as Holocaust researchers. They mostly use random forest models + template matching methods.

* Harish Maringanti: Associate Dean for IT & Digital Library Services; J. Willard Marriott Library at University of Utah

* David Smith: Associate Professor, Computer Science, Northeastern University
David introduced his work on Networked Texts.

* Helena Sarin: Founder, Neural Bricolage
* Nick Adams: Goodly Labs

Section 2: Partnerships
* Mark Williams: Media Ecology Lab, Dartmouth College
Mark mentioned an annotation tool called SAT "Semantic Annotation Tool".
* Karen Cariani: WGBH Media Library and Archives
* Josh Hadro + Tom Cramer: IIIF, Stanford Libraries
* Mia Ridge + Adam Farquhar: British Library
* Kate Murray: Library of Congress
* Representatives from the Smithsonian Institution Data Lab, National Museum of American History
Rebecca from Smithsonian OCIO data science lab talks about machine learning at Smithsonian. Some interesting and potentially useful tools include Google vision API, RESNET50, and VGG. Their experiments indicate that the Google tool achieves high performance, but not customizable, RESNET and VGG have far lower success numbers but can be customized and re-trained.

* Jon Dunn + Shawn Averkamp: Indiana University Bloomington, AVP
* Peter Leonard: Yale DH Lab
Peter talked about their project called PixPlot, which is a web interface to visualize about 30k images from Lund, Sweden. The source code is at https://github.com/YaleDHLab/pix-plot. The website is https://dhlab.yale.edu/projects/pixplot/.

Section 3: Future Possibilities & Challenges
* Michael Lesk: Rutgers University
Michael talked about duplicate image detector tool at NMAH, including between 1-2 TB of images stored on legacy hardware and network directory. The goal is to determine if there are duplicates. If there are, which images have higher quality.

* Heather Yager: MIT Libraries
* Audrey Altman: Digital Public Library of America
* Hannah Davis: Independent Research Artist
Hannah mentions an interesting tagger: https://www.tagtog.net/

Besides, the summit also has arranged open discussions and activities to stimulate the attendant's thoughts and discussions. Some noted questions are
* How do we communicate machine learning results/outputs to end-users?
* How does one get ML products from the pilot to production?
* Do you know of existing practical guidelines for successful machine learning project agreements?
* How can we overcome the challenges of access variable resources across varying contexts, such as infrastructure, licensing, and copyright structure?
* Which criteria would you use for evaluation of service whether for providers for internal/external use?

Another activity is to ask attendants in different tables to form groups and discuss factors to consider in collaborations with machine learning projects. Some noted points include
* Standardize and documenting data
* Clarity of roles and communication
* User expectation, regular share document of progress
* Organizational and political factors to get the project done.
* Get the right reasons, the right people, and the right plan. Having a value of the project.

Below are the people I met with both known and new friends.

* Stephen Downie from UIUC. He introduced to me some useful tools in HathiTrust that I can borrow for my ETD project.
* Tom Cramer from Stanford. Tom was leading a team to work on a similar project on mining ETDs. He also introduced the yewno.com website, which they are working with, to transform information in ETDs into knowledge.
* Kurt Luther from Virginia Tech at Arlington. Kurt was doing a historical portrait study.
* Wayne Graham from CLIR.
* Heather Yager from MIT. Heather and I had a brief chat on accessing ETDs from DSpace in MIT libraries.
* David Smith from Northeastern. David was an expert on image processing. He introduced hOCR to me which is exactly the tool I was looking for to identify bounding boxes of text on a scanned document.
* Michael Lesk from Rutgers. A senior but still energetic information scientist. He knew Dr. C. Lee Giles.
* Kate Zwaard the chief of National Digital Initiatives at the Library of Congress

Overall, the summit was very successful. The attendances presented real-world problems and discussed very practical questions. The logistic was also good. Eileen Jakeway did excellent jobs on communicating with people before and after the summit, including a follow-up survey. I thank Dr. Michael Nelson for telling me to register for this meeting.

I made a wise decision to stay overnight before the meeting. The traffic from Springfield to the Library of Congress is terrible with 3 accidents in the morning. I was lucky to find a parking spot costing $18 a day near LOC. The back trip was 1 hr longer than the map distance due to constructions. But the weather was fine and the people were friendly!

Jian Wu

↧

2019-10-21: Where did the archive go? Part 4: WebCite

October 20, 2019, 9:03 pm

≫ Next: 2019-10-25: Summary of "Proactive Identification of Exploits in the Wild Through Vulnerability Mentions Online"

≪ Previous: 2019-10-01: Attending the Machine Learning + Libraries Summit at the Library of Congress

webcitation.org

We previously described changes in the following web archives:

In Where did the archive go? Part 1, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their replay system, we were no longer able to find 49 out of 351 mementos (archived web pages).
In Part 2, we focused on the movement of the National Library of Ireland (NLI). Mementos from NLI collection were moved from the European Archive to the Internet Memory Foundation (IMF) archive. Then, they were moved to Archive-It. We found that 192 mementos, out of 979, cannot be found in Archive-It.
In Part 3, we described changes in the Public Record Office of Northern Ireland (PRONI) Web Archive. Mementos in the PRONI archive were moved to Archive-It (archive-it.org). We discovered that 114 mementos, out of 469, can no longer be found in Archive-It (i.e., missing mementos).

In the last part of this four part series, we focus on changes in webcitation.org (WebCite). The WebCite archive has been operational in its current form since at least 2004 and was the first archive to offer an on-demand archiving service by allowing users to submit URLs of web pages. Around 2019-06-07, the archive became unreachable. The Wayback Machine indicates that there were no mementos captured for the domain webcitation.org between 2019-06-07 and 2019-07-09 (for about a month), which is the longest period of time in 2019 that has no mementos for WebCite in the Internet Archive:

https://web.archive.org/web/*/webcitation.org

The host webcitation.org was not resolving as shown in the cURL session below:

$ date
Mon Jul 01 01:33:15 EDT 2019

$ curl -I www.webcitation.org
curl: (6) Could not resolve host: www.webcitation.org

We were conducting a study on a set of mementos from WebCite when the archive was inaccessible. The study included downloading the mementos and storing them locally in WARC files. Because the archive was unreachable, the WARC files contained only request records, with no response records, as shown below (the URI of the memento (URI-M) was http://www.webcitation.org/5ekDHBAVN):

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/5ekDHBAVN
WARC-Date: 2019-07-09T02:01:52Z
WARC-Concurrent-To: <urn:uuid:8519ea60-a1ed-11e9-82a3-4d5f15d9881d>
WARC-Record-ID: <urn:uuid:851c5b60-a1ed-11e9-82a3-4d5f15d9881d>
Content-Type: application/http; msgtype=request
Content-Length: 303

GET /5ekDHBAVN HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
X-DevTools-Emulate-Network-Conditions-Client-Id: 5B0BDCB102052B8A7EF0772D50B85540
Host: www.webcitation.org

<EOF>

The WebCite archive was back online on 2019-07-09 with a significant change; the archive no longer accepts archiving requests as its homepage indicates (e.g., the first screenshot above).

Our WARC records below show the time at which the archive came back online on 2019-07-09T13:17:16Z. Note that there are a few hours difference between the value of the WARC-Date below and its value in the WARC record above. However, the WebCite archive was still down on 2019-07-09T02:01:52Z, while it was online again around 13:17:16Z:

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Concurrent-To: <urn:uuid:df5de3b0-a24b-11e9-aaf9-bb34816ea2ff>
WARC-Record-ID: <urn:uuid:df5f6a50-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=request
Content-Length: 414

GET /6MIxPlJUQ HTTP/1.1
Pragma: no-cache
Accept-Encoding: gzip, deflate
Host: www.webcitation.org
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Cache-Control: no-cache
Connection: keep-alive

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Record-ID: <urn:uuid:df5fb870-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=response
Content-Length: 1212

HTTP/1.1 200 OK
Pragma: no-cache
Date: Tue, 09 Jul 2019 13:16:47 GMT
Server: Apache/2.4.29 (Ubuntu)
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
Set-Cookie: PHPSESSID=ugssainrn4pne801m58d41lm2r; path=/
Cache-Control: no-store, no-cache, must-revalidate
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
Content-Length: 814
Expires: Thu, 19 Nov 1981 08:52:00 GMT

<!DOCTYPE html PUBLIC
...

The archive was still down a few seconds before 2019-07- 09T13:17:16Z as there were not response records in the WARC files:

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/6E2fTSO15
WARC-Date: 2019-07-09T13:16:39Z
WARC-Concurrent-To: <urn:uuid:c94dd940-a24b-11e9-8d2f-a5805b26a392>
WARC-Record-ID: <urn:uuid:c9515bb0-a24b-11e9-8d2f-a5805b26a392>
Content-Type: application/http; msgtype=request
Content-Length: 303

GET /6E2fTSO15 HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
X-DevTools-Emulate-Network-Conditions-Client-Id: 78DDE3B763F42A09787D0EBA241C9C4A
Host: www.webcitation.org

The archive has had some issues in the past related to funding limitations:

#WebCite will stop accepting new submissions end of 2013, unless they reach their fundraising goals http://t.co/nug32Zzc . Please support.
— Ahmed AlSum (@aalsum) February 16, 2013

https://archive.fo/eAETp

One of the main objectives for which WebCite was established was to reduce the impact of reference rot by allowing researchers and authors of scientific work to archive cited web pages. The instability in providing archiving services and being inaccessible from time to time raises important questions:

Is there any plan by the web archiving community to recover web pages archived by WebCite if the archive is gone?

Why didn't the archive give a notice (e.g., in their homepage) before they became unreachable? This will give users some time to deal with different scenarios, such as downloading a particular set of mementos

Has the archived content changed before the archive came back online?

The archive now does not accept archiving requests nor does it do web page crawling. Is there any plan by the archive to resume the on-demand archiving service in the future?

In case WebCite disappears, the structure of the URI-M used by the archive makes it difficult to recover mementos from other web archives. This is because the URI-M of a memento (e.g., www.webcitation.org/5BmjfFFB1) does not give any additional information about the memento. These shortened URI-Ms are also used by other archives, such as archive.is and perma.cc. In contrast, the majority of web archives that employ one of the Wayback Machine’s implementations (e.g., OpenWayback and PyWb) use the URI-M structure illustrated below. Note that archive.is and perma.cc also support this Wayback Machine-style URI-Ms (i.e., each memento has two URI-Ms):

URI-M structure typically used by Wayback-style archives

This common URI-M structure provides two pieces of information, the original page's URI (URI-R) and the creation datetime of the memento (Memento-Datetime). This information then can be used to look up similar (or even identical) archived pages in other web archives using services like the LANL Memento aggregator. With the URI-M structure used by WebCite, it is not possible to recover mementos using only the URI-M. The Robust Links article introduces a mechanism to allow a user to link to an original URI and at the same time describe the state of the URI so that, in the future, users will be able to obtain information about the URI even if it disappears from the live web.

The WebCite archive is a well-known web archive and one of the few on-demand archiving services. It is important that the web archiving community takes further steps to guarantee long-term preservation and stability of those archived pages at webcitation.org.

https://t.co/C46aEumrEM - the uncontested pioneer for fighting link rot in #scholcomm - should really consider handing over its activities to an org with some guarantees re long-term stability. And #Memento support ;-) / cc @permacc
— Herbert @hvdsomp@octodon.social (@hvdsomp) August 1, 2019

--Mohamed Aturban

↧

2019-10-25: Summary of "Proactive Identification of Exploits in the Wild Through Vulnerability Mentions Online"

October 25, 2019, 5:58 pm

≫ Next: 2019-10-28: The interaction between search engine caches and web archives

≪ Previous: 2019-10-21: Where did the archive go? Part 4: WebCite

Figure 1 Disclosed Vulnerabilities by Year (Source: CVE Details)

The number of software vulnerabilities discovered and disclosed to the public is steadily increasing every year. As shown in Figure 1, in 2018 alone, more than 16,000 Common Vulnerabilities and Exposures (CVE) identifiers were assigned by various CVE Numbering Authorities (CNA). CNAs are organizations from around the world that are authorized to assign CVE IDs to vulnerabilities affecting products within their distinct, agreed-upon scope. In the presence of voluminous amounts of data and limited skilled cyber security resources, organizations are challenged to identify the vulnerabilities that pose the greatest risk to their technology resources.

One of the key reasons the current approaches to cyber vulnerability remediation are ineffective is that organizations cannot effectively determine whether a given vulnerability poses a meaningful threat. In their paper, "Proactive Identification of Exploits in the Wild Through Vulnerability Mentions Online", Almukaynizi et al. draw on a body of work that seeks to define an exploit prediction model that leverages data from online sources generated by the white-hat community (i.e., ethical hackers). The white-hat data is combined with vulnerability mentions scraped from the dark web to provide an early predictor of exploits that could appear "in the wild" (i.e., real world attacks).

Video: What is the dark web? And, what will you find there? (Source: https://youtu.be/7QxzFbrrswA)

Common Vulnerability Scoring System (CVSS) Explained
The CVSS is a framework for rating the severity of security vulnerabilities in software and hardware. Operated by the Forum of Incident Response and Security Teams (FIRST), the CVSS uses an publicly disclosed algorithm to determine three severity rating scores: Base, Temporal, and Environmental. The scores are numeric and range from 0.0 through 10.0 with 10.0 being the most severe. According to the most recent version of the CVSS, V3.0:

A score of 0.0 receives a "None" rating.
A 0.1-3.9 score receives a "Low" severity rating.
A score of 4.0-6.9 receives a "Medium" rating.
A score of 7.0-8.9 receives a "High" rating.
A score of 9.0 - 10.0 receives a "Critical" rating.

As shown in Figure 2, the score is computed according to elements and subscores required for each of the three metric groups.

The Base score is the metric most relied upon by enterprise organizations and reflects the inherent qualities of a vulnerability.
The Temporal score represents the qualities of the vulnerability that change over time.
The Environmental score represents the qualities of the vulnerability that are specific to the affected user's environment.

Figure 2 CVSS Metric Groups (Source: FIRST)

Due to their specificity with the organization's environment, it should be noted the temporal and environmental metrics are normally not reflected in the reported CVSS base score, but can be calculated independently using the equations published in the FIRST Common Vulnerability Scoring V3.1 Specification document. The CVSS allows organizations to prioritize which vulnerabilities to remediate first and gauge the overall impact of the vulnerabilities on their systems. A consistent finding in this stream of research is that the status quo for how organizations address vulnerability remediation is often less than optimal and has significant room for improvement. With that in mind, Almukaynizi et al. present an alternative prediction model which they evaluate against the standard CVSS methodology.

Exploit Prediction Model
Figure 3 depicts the individual elements that Almukaynizi et al. use to describe the three phases of their exploit prediction model. The phases are Data Collection, Feature Extraction, and Prediction.

Figure 3 Exploit Prediction Model (Source: Almukaynizi)

Data Collection
This phase is used to build a dataset for further analysis. Vulnerability data is assimilated from:

NVD. 12,598 vulnerabilities (unique CVE IDs) were disclosed and published in the National Vulnerability Database (NVD) between 2015 and 2016. For each vulnerability, the authors gathered the description, CVSS base score, and scoring vector.
EDB (white-hat community). Exploit Database is an archive of public, proof of concept (PoC) exploits and corresponding vulnerable software, developed for use by penetration testers and vulnerability researchers. The PoC exploits can often be mapped to a CVE ID. Using the unique CVE-IDs from the NVD database for the time period between 2015 and 2016, the authors queried the EDB to determine whether a PoC exploit was available. 799 of the vulnerabilities in the 2015 to 2016 data set were found to have verified PoC exploits. For each PoC, the authors gathered the date the PoC exploit was posted.
ZDI (vulnerability detection community). The Zero Day Initiative (ZDI) encourages the reporting of verified zero-day vulnerabilities privately to the affected vendors by financially rewarding researchers; a process which is sometimes referred to as a bug bounty. The authors queried this database to identify 824 vulnerabilities in the 2015 to 2016 data set that were common with the NVD.
DW (black-hat community). The authors built a system which crawls marketplace sites and forums on the dark web to collect data related to malicious hacking. They used a machine learning approach to identify content of interest and exclude irrelevant postings (e.g., pornography). They retained any postings which specifically have a CVE number or could be mapped from a Microsoft Security Bulletin to a corresponding CVE ID. They found 378 unique CVE mentions between 2015 and 2016.
Attack Signatures (Ground truth). Symantec's anti-virus and intrusion detection attack signatures were used to identify actual exploits that were used in the wild and not just PoC exploits. Some attack signatures are mapped to the CVE ID of the vulnerability which were correlated with NVD, EDB, ZDI, and DW. The authors noted this database may be biased towards products from certain vendors (e.g., Microsoft, Adobe).

Table I shows the number of vulnerabilities exploited as compared to the ones disclosed for all the data sources considered.

Source: Almukaynizi

Feature Extraction

A summary of features extracted from all the data sources mentioned is provided in Table II.

Source: Almukaynizi

The NVD description provides information on the vulnerability and the capabilities attackers will gain if they exploit it. Contextual information gleaned from DW was appended to the NVD description. Here, the authors observed foreign languages in use which they translated into English using the Google Translate API. The text features were analyzed using Term Frequency-Inverse Document Frequency (TF-IDF) to create a vocabulary of 1000 most frequent words in the entire data set. Common words were eliminated as important features.
The NVD provides CVSS base scores and vectors which indicate the severity of the vulnerability. The categorical components of the vector include Access Complexity, Authentication, Confidentiality, Integrity, and Availability. All possible categories of features were vectorized then assigned a value of "1" or "0" to denote whether the category is present or not.
The DW feeds are posted in different languages; most notably in English, Chinese, Russian, and Swedish. The language of the DW post is used rather than the description since important information can be lost during the translation process.
The presence of PoC on either EDB, DW, or ZDI increases the likelihood of a vulnerability being exploited. This is treated as a binary feature.

Prediction

The authors employed several supervised machine learning approaches to determine a binary classification on the selected features indicating whether the vulnerability would be exploited or not.

Vulnerability and Exploit Analysis
As shown in Table III, Almukaynizi et al. assessed the importance of aggregating disparate data sources by first analyzing the likelihood of exploitation based on the coverage of each source. Then, they conducted a language based analysis to identify any socio-cultural factors present in the DW sites which might influence exploit likelihood. Table III presents the percentage of exploited vulnerabilities that appeared in each source along with results for the intersection.

Table III

	EDB	ZDI	DW	ZDI V DW	EDB V ZDI V DW
Number of vulnerabilities	799	824	378	1180	1791
Number of exploited vulnerabilities	74	95	52	140	164
Percentage of exploited vulnerabilities	21%	31%	17%	46%	54%
Percentage of total vulnerabilities	6.3%	6.5%	3.0%	9.3%	14.2%

Source: Almukaynizi

The authors determined that 2.4% of the vulnerabilities disclosed in the NVD are exploited in the wild. The correct prediction of exploit likelihood increased when additional data sources were included. This was balanced by the fact that each data community (EDB, ZDI, DW) operates under a distinct set of guidelines (e.g., white hat, researchers, hackers).

In the DW, four languages were detected which resulted in noticeable variations in the exploit likelihood. English and Chinese have more vulnerability mentions (n=242 and n= 112, respectively) than Russian and Swedish (n=13 and n=11, respectively). Chinese postings exhibited the lowest exploitation rate (~10%). However, 46% of the vulnerabilities mentioned in Russian postings were exploited. Figure 4 shows the language analysis based on vulnerability mentions.

Figure 4 Exploited Vulnerabilities by Language (Source: Almukaynizi)

Performance Evaluation
Experiments using the exploit prediction model were examined using different supervised machine learning algorithms including Support Vector Machine (SVM), Random Forest (RF), Naive Bayes Classifier (NB), and Logistic Regression (LOG-REG). Random Forest, which is based on generating multiple decision trees, was found to provide the best F1 measure to determine classes of exploited versus not exploited vulnerabilities.

Their classifier was evaluated based on precision, recall, and Receiver Operating Characteristics (ROC). If minimizing the number of incorrectly flagged vulnerabilities is the goal, then high precision is desired. If minimizing the number of undetected vulnerabilities is the goal, then high recall is desired. To avoid temporal intermixing, the NVD data was sorted by the disclosure date and the first 70% was used for training and the rest for testing. This was necessary so that future PoC events would not influence the prediction of past events (i.e., vulnerability is published before the exploitation date). Table IV shows the precision, recall, and corresponding F1 measure for vulnerabilities mentioned on DW, ZDI, and EDB. DW information was able to identify exploited vulnerabilities with the highest level of precision at 0.67.

Table IV

Source: Almukaynizi

Discussion
Almukaynizi et al. indicate promising results based on their random forest classification scheme. It should be noted that random forest outputs a confidence score for each sample which can be evaluated against a user-defined threshold for predicting a vulnerability as exploited. While the authors acknowledge the threshold can be varied in accordance with other factors of importance to the organization (e.g., vendors), they do not disclose the hard-cut threshold used during their experiments. It is also noteworthy that false negatives that received the lowest confidence scores shared common features (e.g., Adobe Flash and Acrobat Reader), base scores, and similar descriptions in the NVD. A similar observation was noted among the false positives where all predicted exploited vulnerabilities existed in Microsoft products. The inherent class imbalance in vulnerability data may also be a contributing factor along with perceived biases in the Symantec attack signatures which provide ground truth. In the future, the authors hope to enhance their exploit prediction model by expanding the vulnerability data sources to include social media sites and online blogs.

-- Corren McCoy (@correnmccoy)

Almukaynizi, Mohammed, et al. "Proactive identification of exploits in the wild through vulnerability mentions online." 2017 International Conference on Cyber Conflict (CyCon US). IEEE, 2017. http://usc-isi-i2.github.io/papers/almukaynizi17-cycon.pdf

↧

2019-10-28: The interaction between search engine caches and web archives

October 28, 2019, 6:22 pm

≫ Next: 2019-10-31: Continuing Education to Advance Web Archiving (CEDWARC)

≪ Previous: 2019-10-25: Summary of "Proactive Identification of Exploits in the Wild Through Vulnerability Mentions Online"

News articles from Indian newspapers about a corruption case involving an Indian doctor. The left images show screenshots of the article from the print newspaper. The right images show URLs for the articles returning with 404 pages.

My brother, a lawyer in India, recently sent me two screenshots shown in Figures 1 and 2, of a news article about a corruption case involving a renowned doctor from India. In order to proceed with legal proceedings against the newspapers for publishing the article, my brother needed some evidence about the publication of the articles. Therefore he sought my help in finding the URLs of the articles shown in the screenshots. The news articles were published in an English language newspaper, The Asian Age, and a Hindi language newspaper, Punjab Kesari.

Figure 1: Screenshot of the news article from the English language newspaper, The Asian Age shared with me by my brother

Figure 2: Screenshot of the news article from the Hindi language newspaper, Punjab Kesari shared with me by my brother

Finding URLs for the screenshot of the news articles

I searched the websites of The Asian Age and Punjab Kesari for the articles and found links to the articles (shown in the Original URL row of Tables 1 and 2) but they both redirected to a 404 page, as shown in Figures 3 and 4. Fortunately, we found search engine (SE) cached copies of both articles in the Google and Bing caches, as shown in Figures 5 and 6.

Plinio Vargas in his post "Link to Web Archives, not Search Engine Caches" talks about the ephemeral nature of the SE cache URLs and highlights the reason for linking to the web archives over the SE cache URLs. Furthermore, Dr. Michael Nelson in his post "Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives" has already shown us the use of SE cache URLs and the web archives to find answers to real world problems.

Figure 3: A 404 page appears on accessing the news article from Punjab Kesari

Figure 4: A 404 page appears on accessing the news article from The Asian Age

cURL response for the The Asian Age news article which redirects to a 404 page

msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html"
HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:07 GMT
Server: Apache/2.4.7 (Ubuntu)
Location: https://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html
Cache-Control: max-age=300
Expires: Fri, 20 Sep 2019 18:40:07 GMT
Connection: close
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:08 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=dsp7g2kkn5sfk2eggaftg3un84; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: /404.html
X-Cache: MISS from www.asianage.com
Connection: close
Content-Type: text/html

HTTP/1.1 200 OK
Date: Fri, 20 Sep 2019 18:35:10 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=koaujt0tiaqgjvafa5je1djps5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Cache: MISS from www.asianage.com
Connection: close
Content-Type: text/html

Figure 5: Bing Cache http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html

Figure 6: Google Cache http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html

cURL response for the Punjab Kesari news article which redirects to a 404 page

msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "https://haryana.punjabkesari.in/national/news/police-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341"
HTTP/1.1 301 Moved Permanently
Content-Length: 0
Connection: keep-alive
Cache-Control: private
Location: https://haryana.punjabkesari.in/common404.aspx
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 18:45:12 GMT
X-Cache: Miss from cloudfront
Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: IAD79-C2
X-Amz-Cf-Id: Ub5SmJxPQWHJQSIg9xEz-GVZLQtNA4KHkXHT2-qp_6ZD8AFKF_fQKQ==

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 76757
Connection: keep-alive
Cache-Control: public, no-cache="Set-Cookie", max-age=15000
Expires: Fri, 20 Sep 2019 17:17:08 GMT
Last-Modified: Fri, 20 Sep 2019 13:07:08 GMT
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 13:07:08 GMT
Vary: Accept-Encoding,Cookie
X-Cache: Hit from cloudfront
Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: IAD79-C2
X-Amz-Cf-Id: 5PzkcGPXziNxfNLDffTV3-V6Ks2w3FQiEUWnHMzfZm_aDKfyBKjw7A==
Age: 20281

Push the cached URLs to multiple web archives

We pushed the Bing and Google cache URLs (URI-R-SEs) for both news articles to the Internet Archive, perma.cc, and archive.is. The URI-Ms for the URI-R-SEs are shown in Tables 1 and 2. We can use ArchiveNow to automate pushing of URLs to multiple web archives. We also captured the WARC files of the URI-R-SEs for the articles using Webrecorder and stored the WARCs locally.

Table 1: Links to the original URL, SE cache URLs, and the mementos for *The Asian Age* news article.
The Asian Age News Article
Original URL (URI-R)		http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html
Google Cache URL (URI-R-SE)		https://webcache.googleusercontent.com/search?q=cache:NZBrw4FQYRUJ:https://www.asianage.com/amp/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html%3Futm_source%3DlatestPromotion%26utm_medium%3Dend%26utm_campaign%3Dlatest+&cd=1&hl=en&ct=clnk&gl=us
Mementos for Google Cache (URI-M)	Internet Archive	http://web.archive.org/web/20190913044502/https://webcache.googleusercontent.com/search?q=cache:NZBrw4FQYRUJ:https://www.asianage.com/amp/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html%3Futm_source%3DlatestPromotion%26utm_medium%3Dend%26utm_campaign%3Dlatest+&cd=1&hl=en&ct=clnk&gl=us
	archive.is	http://archive.is/Pc5Ss
	perma.cc	https://perma.cc/43A5-ZGUP
Bing Cache URL (URI-R-SE)		http://cc.bingj.com/cache.aspx?q=http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4900757184710965&mkt=en-US&setlang=en-GB&w=2azCVRmBXeu0mxbmz4qBzg-JwMMcKGUO
Mementos for Bing Cache (URI-M)	Internet Archive	https://web.archive.org/web/20190913222156/http://cc.bingj.com/cache.aspx?q=http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4900757184710965&mkt=en-US&setlang=en-GB&w=2azCVRmBXeu0mxbmz4qBzg-JwMMcKGUO
	archive.is	http://archive.is/MLEPL
	perma.cc	https://perma.cc/L87D-YNTD

Table 2: Links to the original URL, SE cache URLs, and the mementos for *Punjab Kesari* news article.
Punjab Kesari News Article
Original URL (URI-R)		https://haryana.punjabkesari.in/national/news/police-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341
Google Cache URL (URI-R-SE)		No results found
Mementos for Google Cache (URI-M)		No Mementos
Bing Cache URL (URI-R-SE)		http://cc.bingj.com/cache.aspx?q=https%3a%2f%2fharyana.punjabkesari.in%2fnational%2fnews%2fpolice-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341&d=4562373894013993&mkt=en-US&setlang=en-GB&w=h-neQ61HBpgaugtetBnppSMOpz05iO79
Mementos for Bing Cache (URI-M)	Internet Archive	http://web.archive.org/web/20190915055809/http://cc.bingj.com/cache.aspx?q=https%3a%2f%2fharyana.punjabkesari.in%2fnational%2fnews%2fpolice-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341&d=4562373894013993&mkt=en-US&setlang=en-GB&w=h-neQ61HBpgaugtetBnppSMOpz05iO79
	archive.is	http://archive.is/4yhYv
	perma.cc	https://perma.cc/C3V3-DYA5

Accessing the Cache URLs in the Web Archives

Web archives index mementos by their URI-R. A SE cache URI-M can only be accessed by users who know the URI-R-SE, which is mostly opaque as a result of various parameters and encodings. As shown in Figure 7, the URI-R-SE for the same web resource may vary according to different geographic location which means that the same web resource may be indexed under different URI-R-SEs in the web archives.

In the US, the Bing Cache URL for the The Asian Age news article is
http://cc.bingj.com/cache.aspx?

q=http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2

frs-100-crore-duping-claim-against-top-doctor.html&d=4900757184710965

&mkt=en-US&setlang=en-GB&w=2azCVRmBXeu0mxbmz4qBzg-JwMMcKGUO

In India, the Bing cache URL for the The Asian Age news article is
http://cc.bingj.com/cache.aspx?q=+http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4857393311190329&mkt=en-IN&setlang=en-US&w=dLDQJ43_8q6g4yPEAeK5Q-U3JNpx878y

Figure 7: The Bing Cache URL for the US (left) is 200 and the one for India
is 404 (right)

Pushing the URI-R-SE to multiple web archives not only makes it accessible from web multiple archives, but also some web archives can be leveraged to find mementos in the other web archives. As shown in Figure 8, archive.is extracts the URI-R of the article from the URI-R-SE of the article and indexes the URI-Ms for the URI-R-SE under both the URI-R and URI-R-SE. As shown in Figure 9, we accessed a memento from Internet Archive for the URI-R-SE using the extracted URI-R-SE from archive.is which is what the other web archives consider as URI-R.

Figure 8: archive.is lists the Bing cache URL for the memento upon searching for the URL of the web page which can be used to search in other web archive

Figure 9: Using the Bing cache URL from archive.is to retrieve
mementos of the search engine cache from the Internet Archive

Figure 10: Memento of a SE cache which did not capture the intended content

Figure 11: Google indexed a document from the Internet Archive which lists the memento from perma.cc for the The Asian Age news article

As shown in Figure 10, the Internet Archive has archived Bing's soft 404 for the URI-R-SE. Fortunately, archive.is, as shown in Figure 8, archived its memento before the URI-R-SE became a soft 404. At times, we can find URI-Ms to a 404 page indexed in Google search result. As shown in Figure 11, the Google search result for the The Asian Age news article listed a document from Internet Archive which contains the URI-M from perma.cc for the news article.

Sometimes SE caches have pages that are missing (404) from the live web but not yet archived. We should push SE cache URL (URI-R-SE) to multiple web archives. We can automate the process of saving URLs to multiple web archives simultaneously by using ArchiveNow. We can use web archives like archive.is to get the URI-R-SE using the URI-R of the resource which can further to be used to search the other web archives for mementos of the URI-R-SEs.

My studies in web archiving helped me solve a real world problem posed by my brother where he needed the URLs of news articles for which he provided me with the screenshots. I found those URLs in SE caches and pushed them to multiple web archives which will be used by him in his legal proceedings.

------
Mohammed Nauman Siddique
(@m_nsiddique)

↧

2019-10-31: Continuing Education to Advance Web Archiving (CEDWARC)

October 31, 2019, 8:06 am

≫ Next: 2019-11-03: STEAM on Spectrum at VMASC

≪ Previous: 2019-10-28: The interaction between search engine caches and web archives

Note: This blog post may be updated with additional links to slides and other resources as they become publicly available.

On October 28, 2019, web archiving experts met with librarians and archivists at the George Washington University in Washington, DC. As part of the Continuing Education to Advance Web Archiving (CEDWARC) effort, we covered several different modules related to tools and technologies for web archives. The event consisted of morning overview presentations and afternoon lab portions. Here I will provide an overview of the topics we covered.

Web Archiving Fundamentals

Prior to attending the event Edward A. Fox, Martin Klein, Anna Perricci, and Zhiwu Xie created a brief tutorial covering the fundamentals of web archiving. This tutorial, shown below, was distributed as a video to attendees prior to the event so they could familiarize themselves with the concepts we would discuss at CEDWARC.

Zhiwu Xie kicked off the event with a refresher of this tutorial. He stressed the complexities of archiving web content due to the number of different resources necessary to reconstruct a web page at a later time. He mentioned that it was not just necessary to capture all of these resources, but also replay them properly. Improper replay can lead to temporal inconsistencies, as has been covered on this blog by Scott Ainsworth. He further covered WARCs and other concepts, like provenance, related to the archive and replay of web pages.

Resources:

Slides: Web Archiving Fundamentals
Video: Web Archiving Fundamentals
Slides: Welcome and Web Archiving Fundamentals Recap

Memento

Now that the attendees were familiar with web archives, Martin Klein provided a deep dive into what they could accomplish with Memento. Klein covered how Memento allows users to find mementos for a resource in multiple web archives via the Memento Aggregator. He further touched on recent machine learning work to improve the performance of the Memento Aggregator.

At CEDWARC (https://t.co/J4nFrTuQRh), @mart1nkle1n presents Time Travel on the web with Memento. For more information, see https://t.co/yzXZEwKDvo pic.twitter.com/v8BSR0gZFy
— Shawn M. Jones (@shawnmjones) October 28, 2019

Klein highlighted how to use the Memento browser extension, available for Chrome and Firefox. He mentioned how one could use Memento with Wikipedia, and echoed my frustration with trying to get Wikipedia to adopt the protocol. He closed by introducing various Memento Time Travel APIs available.

First presentation at #CEDWARC is about Memento, a web aggregator that accesses more than 2 dozen web archives. It uses a federated search and ranks the URIs by proximity to the date/time specified. Totally brilliant, and is has web extensions!! https://t.co/gBbJlfIiDp
— Haian (@haian_o) October 28, 2019

Resources:

Slides: Accessing and Using Web Archives

Social Feed Manager

Laura Wrubel and Dan Kerchner covered Social Feed Manager, a tool by George Washington University that helps researchers build social media archives from Twitter, Tumblr, Flickr, and Sina Weibo. SFM does more than archive the pages of social media content. It also acquires content available via each API, preserving identifiers, location information, and other data not present on a given post's web page.

At #CEDWARC, @liblaura and @DanKerchner present @SocialFeedMgr for capturing social media data from multiple networks for #webarchiving. For more information, see https://t.co/LaZltX6WWB pic.twitter.com/rVewiEFld3
— Shawn M. Jones (@shawnmjones) October 28, 2019

As part of the #CEDWARC presentation on @SocialFeedMgr, @liblaura also gives a shout out to other projects like @documentnow and twarc (https://t.co/C9pnaKZE0G). She emphasizes their partnerships with these projects and others. pic.twitter.com/rWRkbsE4tH
— Shawn M. Jones (@shawnmjones) October 28, 2019

Storytelling

I presented work from the Dark and Stormy Archives Project on using storytelling techniques with web archives. I introduced the concept of a social cards as a summary of the content of a single web page. Storytelling services like Wakelet combine social cards together to summarize a topic. We can use this same concept to summarize web archives. I broke storytelling with web archives into two actions: selecting the mementos for our story and visualizing those mementos.

Storytelling With Web Archives from Shawn Jones

I briefly covered the problems of scale with selecting mementos manually before moving on to the steps of AlNoamany's Algorithm. WS-DL alumnus Yasmin AlNoamany developed this algorithm to produce a set of representative mementos from a web archive collection. AlNoamany's Algorithm can be executed using our Off-Topic Memento Toolkit.

Existing platforms do not reliably produce social cards for #mementos! @shawnmjones has an alternative:https://t.co/m8Jez6GdXf #CEDWARC pic.twitter.com/l1mP0GIZlj
— Martin Klein (@mart1nkle1n) October 28, 2019

Visualizing mementos requires that our social cards do a good job describing the underlying content. These cards should also avoid confusion. Because of the confusion introduced by other card services, we created MementoEmbed to produce surrogates for mementos. From MementoEmbed, we then created Raintale to produce stories from lists of memento URLs (URI-Ms).

In the afternoon, I conducted a series of laboratory exercises with the attendees using these tools.

@shawnmjones rocking the Storytelling session at #CEDWARC pic.twitter.com/QNc6mPGYWi
— Martin Klein (@mart1nkle1n) October 28, 2019

Resources:

Slides: Storytelling With Web Archives
Tutorials: Laboratory Exercises

ArchiveSpark

Helge Holzmann presented ArchiveSpark for efficiently processing and extracting data from web archive collections. ArchiveSpark provides filters and other tools to reduce the data and provide it in a variety of accessible formats. It provides efficient access by first extracting information from CDX and other metadata files before directly processing WARC content.

Fourth presentation is on Archive Spark, a powerful and super impressive tool to efficiently process and analyze web archives, with the example of downloading WARC and CDX files from the Internet Archive's Wayback Machine #CEDWARC https://t.co/yKDkU63Qco
— Haian (@haian_o) October 28, 2019

ArchiveSpark uses Apache Spark to run multiple processes in parallel. Users employ Scala to filter and process their collections to extract data. Helge emphasized that data is provided in a JSON format that was inspired by the Twitter API. He closed by showing how one can use ArchiveSpark with Jupyter notebooks.

“Twitter has a nice JSON API that many people and tools already understand” so ArchiveSpark tries to emulate this idea for data from #webarchive collections, according to @helgeho at #CEDWARC https://t.co/nPK6oHbCiN pic.twitter.com/P4Oe68zUqb
— Shawn M. Jones (@shawnmjones) October 28, 2019

Resources:

Slides: Efficient Web Archive Processing With ArchiveSpark

Archives Unleashed

Samantha Fritz and Sarah McTavish highlighted several tools provided by the Archives Unleashed Project. WS-DL members have been to several Archives Unleashed events, and I was excited to see these tools introduced to a new audience.

Archives Unleashed Project team giving a brief overview of the project’s mission and afternoon session! Boom 💥 #CEDWARC pic.twitter.com/NF2FsX9pN9
— Cal Murgu (@CalMurgu) October 28, 2019

The team briefly covered the capabilities of the Archives Unleashed Toolkit (AUT). AUT employs Hadoop and Apache Spark to allow users to provide collection analytics, text analysis, named-entity recognition, network analysis, image analysis and more. From there, they introduced Archives Unleashed Cloud (AUK) for extracting datasets from ones own Archive-It collections. These datasets can then be consumed and further analyzed in Archives Unleashed Notebooks. Finally, the covered Warclight which provides a discovery layer for web archives.

Presenters gave a plug for the Archives Unleashed Datathon, where participants work on collaboratively inspired web archives project at scale using the Archives Unleashed tools. Next one is in March at Columbia! #CEDWARC https://t.co/YBGcQFaH9B
— Haian (@haian_o) October 28, 2019

Resources:

Slides: Analyzing Web Archives with the Archives Unleashed Project

Event Archiving

Ed Fox closed out our topics by detailing the event archiving work done by the Virginia Tech team. He talked about the issues with using social media posts to supply URLs for events so that web archives could then quickly capture them. After some early poor results, his team has worked extensively on improving the quality of seeds through the use of topic modeling, named entity extraction, location information, and more. This work is currently reflected in the GETAR (Global Event and Trend Archive Research) project.

At #CEDWARC, @edwardafox talks about the Event Focused Crawler, to be demonstrated this afternoon. This paper may provide more information on the ideas behind this tool: https://t.co/Wc6rUM09WO pic.twitter.com/aqP7ohSdjv
— Shawn M. Jones (@shawnmjones) October 28, 2019

In the afternoon session, he helped attendees acquire seed URLs via the Event Focused Crawler (EFC). Using code from Code Ocean and Docker containers, we were able to perform a focused crawl to locate additional URLs about an event. In addition to EFC, we classified Twitter accounts using TwiRole.

At #CEDWARC, @edwardafox is now covering some of the work featured at https://t.co/XfbWQILkDM pic.twitter.com/F2oVIcbNpU
— Shawn M. Jones (@shawnmjones) October 28, 2019

Slides: Event Archiving Introduction
Slides: Event Archiving Practicum
Document: Tutorial for EFC and TwiRole on Code Ocean

Webrecorder

Update on 2019/10/31 at 20:42 GMT:The original version neglected to include the afternoon Webrecorder laboratory session, which I did not attend. Thanks to Anna Perricci for providing us with a link to her slides and some information about the session.

In the afternoon, Anna Perricci presented a laboratory titled "Human scale web collecting for individuals and institutions" which was about using Webrecorder. Unfortunately, I was not able to engage in these exercises because I was at Ed Fox's Event Archiving session. Webrecorder was part of the curriculum because it is a free, open source tool that demonstrates some essential web archiving elements. Her session covered manual use of Webrecorder as well as its newer autopilot capabilities.

Slides: Human scale web collecting for individuals and institutions

Summary

CEDWARC's goal was to educate and provide outreach to the greater librarian and archiving community. We introduced the attendees to a number of tools and concepts. Based on the response to our initial announcement and the results from our sessions, I think we have succeeded. I look forward to potential future events of this type.

-- Shawn M. Jones

↧

2019-11-03: STEAM on Spectrum at VMASC

November 3, 2019, 6:41 pm

≫ Next: 2019-11-18: The 28th ACM International Conference on Information and Knowledge Management (CIKM)

≪ Previous: 2019-10-31: Continuing Education to Advance Web Archiving (CEDWARC)

The second annual STEAM on Spectrum event was held at the Old Dominion University (ODU) Virginia Modeling Analysis & Simulation Center (VMASC) in Suffolk, VA on October 12, 2019.

All of my students (@NirdsLab) are at @vmasc_odu for the "STEAM on Spectrum 2019"#inclusive event for Autistic kids. Nearly 200 participants and 50+ volunteers. @WebSciDL @oducs @ODUSCI @sheissheba @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/f131qYGjkA
— Sampath Jayarathna (@OpenMaze) October 12, 2019

The event, sponsored by IEEE Hampton Roads Section and the VMASC Industry Association, intended to provide inclusive, accessible resources and activities related to science, technology, engineering, art and math (STEAM), and encourage students to pursue careers in the STEAM areas. Event organizers included VMASC, ODU, and the Mea'Alofa Autism Support Center.

It is a beautiful day @vmasc_odu with a great turnout for today’s STEAM on Spectrum 2019 event. @NirdsLab @WebSciDL @oducs @ODUSCI @OpenMaze @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/1DsTqeveJi
— Bathsheba Farrow (@sheissheba) October 12, 2019

In support of the event, Dr. Sampath Jayarantha (@OpenMaze), and multiple WS-DL PhD students volunteered to organize an eye tracking exhibit.

All of my students (@NirdsLab) are at @vmasc_odu for the "STEAM on Spectrum 2019"#inclusive event for Autistic kids. Nearly 200 participants and 50+ volunteers. @WebSciDL @oducs @ODUSCI @sheissheba @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/f131qYGjkA
— Sampath Jayarathna (@OpenMaze) October 12, 2019

The exhibit featured various eye tracking devices, and eye target games. Once participants understood how the eye trackers worked and realized that there was no need to keep moving their heads, children and their parents took on the challenge of hitting various targets with just their eyes. Some families even started competitions among themselves, with children and parents trying earn the day's highest score.

There were several other interactive and hands-on experiences including interactive games, flight and driving simulators, robots, and an activity that converted colors to music.

Participants in today’s STEAM on Spectrum 2019 event can experience more than just eye tracking technologies @vmasc_odu. @NirdsLab @WebSciDL @oducs @ODUSCI @OpenMaze @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/77dTYC6P8z
— Bathsheba Farrow (@sheissheba) October 12, 2019

Event organizers not only ensured that there were fun activities for the children, but also provided them with a goodie bags that included STEAM related toys. Volunteers and participants of the event were also treated to lunch.

ODU CS grad students help children beat the eye tracking game’s highest score, open STEAM on Spectrum 2019 event gifts and themselves to lunch @vmasc_odu. @NirdsLab @WebSciDL @oducs @ODUSCI @OpenMaze @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/9OcXAH2lDZ
— Bathsheba Farrow (@sheissheba) October 12, 2019

The event was well organized and attended, and I expect it to be even better next year. I am glad to have had an opportunity to support the event with others from @WebSciDL and hope to see more volunteers and attendees next year.

--Bathsheba Farrow (@sheissheba)

↧

2019-11-18: The 28th ACM International Conference on Information and Knowledge Management (CIKM)

November 18, 2019, 11:09 am

≫ Next: 2019-11-20: PURS 2020 Proposal Awarded to Support Undergraduate Research in Computer Science

≪ Previous: 2019-11-03: STEAM on Spectrum at VMASC

Students, professors, industry experts, and others came to Beijing to attend the 28th ACM International Conference on Information and Knowledge Management (CIKM). This was the first time CIKM had accepted a long paper from the Old Dominion University Web Science and Digital Libraries Research Group (WS-DL) and I was happy to represent us at this prestigious conference.

CIKM is different from some of our other conference destinations. CIKM's focus spans all forms of information and knowledge, leading to a high diversity in submission topics. The conference organizers classified CIKM papers into topics including advertising, user modeling, urban systems, knowledge graphs, information retrieval, data mining, natural language processing, machine learning, social media, health care, privacy, and security.

There were multiple tracks going in five different rooms across three days. There were 202 long papers, 107 short papers, 38 applied research papers, and 26 demos. I will not be able to summarize all of the impressive work present at this conference, but I will try to convey my wonderful CIKM experiences.

Keynote Speakers

Our first day's keynote was delivered by Professor Steve Maybank from the Department of Computer Science and Information Systems at Birkbeck College in the University of London. He is also a Fellow of the Institute of Mathematics and its Applications, a Fellow of the IEEE, a Fellow of the Royal Statistical Society, and a Fellow of the Higher Education Academy. In his keynote, "The Fisher-Rao Metric in Computer Vision", Maybank covered the mathematics behind one method used to detect curves and circles in computer images. To paraphrase Maybank, his solution is better because it uses simple structures in the image, but creates exact calculations. This has implications in biometrics and facial recognition for more precisely detecting irises and other circular features.

ACM and IEEE Fellow Jian Pei, from Simon Fraser University, delivered our second day keynote on "Practicing the Art of Data Science." He stressed that "as data scientists we are responsible for helping people obtain domain knowledge (from data)." He cautioned against merely using machine learning results as "it is not right to take a black box and try to explain it, instead we should build a model that is interpretable." Jian Pei discussed his own research team's work during his keynote, focusing on citations as a proxy for how generalizable a given paper's work must be. He detailed his team's work on using data science to assist programmers in discovering API methods, finding gangs of war in social networks, discovering the fastest changing portion of a graph over time, network embedding, developing piecewise linear neural networks, and more. He mentioned that a specialized solution for one type of problem may not be scalable and that his team tried to find solutions that covered many different types of problems. Jian Pei closed by stating "if given 1 hour to solve a problem, we should send 50 minutes trying to figure out what the problem is really about."

On our third day, Jiawei Han, Abel Bliss Professor from the Department of Computer Science at the University of Illinois at Urbana-Champaign, discussed methods of converting unstructured data to knowledge in "From Unstructured Text to TextCube: Automated Construction and Multidimensional Exploration." As with previous speakers, Jiawei Han is also an ACM Fellow and an IEEE Fellow and is the author of the popular textbook Data Mining: Concepts and Techniques, 3rd Edition. He spoke of the challenges of converting unstructured text to meaningful content. He covered many of his research team's projects. TopMine uses statistical measure to find groups of words from text and requires on training. Autophrase detects phrases in text via supervised learning. ClusType detects named entities. TaxoGen clusters documents to build a taxonomy from corpora. TextCube allows users to extract summaries and knowledge graphs from corpora. EventCube allows users to apply TextCube to events. WeSTClass classifies text with minimal data. JoSE provides better word embeddings than Word2Vec. Doc2Cube for creating TextCubes in an unsupervised way, without human annotators. Jiawei Han showed how such tools had real world applications. He demonstrated an analysis that provided further evidence that Malaysia Airlines Flight 17 was shot down by Ukrainian separatists. He finished with an analysis of the 2019 Hong Kong protests.

Jianping Shi introduced our fourth day with "Autonomous Driving Towards Mass Production". Jianping Shi is the Executive Research Director of SenseTime. She develops algorithms that facilitate autonomous driving by incorporating data from sensors, maps, and many other sources. Deep learning allows machines to acquire knowledge and skills and China has a lot of data on its people to offer to AI systems. They are currently trying to build the largest dataset ever to advance deep learning. Autonomous driving requires this same level of knowledge to work properly. Using this knowledge, an autonomous driving system can make decisions based on data collected by its sensors. Such sensors are not merely cameras taking in the same information as humans, but also incorporate data from systems such as LiDAR to measure distances and detect objects not visible to normal cameras. The solutions at SenseTime combine information from cameras and LiDAR to produce features for neural networks that are processed via hardware such as FPGAs. With this architecture, their systems can detect issues such as lane departures, potential pedestrian collisions, forward collisions, and correctly recognizing signs and traffic lights in all weather conditions. SenseTime also provides systems that open doors for drivers via facial recognition, determine if children or pets have left their seat, and cover blind spots by providing screens where windows are blocked by parts of the car.

Paper Presentations

As I state in other trip reports, conference presentations contain far more content than can be summarized in a single blog post. The full proceedings from CIKM are available for viewing. Below, I will discuss my own work and also feature some brief summaries of interesting work from CIKM.

My Work

As noted in my previous blog post, I have been evaluating the effectiveness of different types of surrogates for use in understanding web archive collections. CIKM accepted the paper from that work, "Social Cards Probably Provide Better Understanding of Web Archive Collections", and I presented it at CIKM on November 5, 2019. Thanks to Dr. Jian Wu for posting my picture to Twitter and thanks to the original photographer for taking the photo.

Shawn is presenting at CIKM 2020 Beijing. ⁦@shawnmjones⁩ ⁦@WebSciDL⁩ pic.twitter.com/UpKZ0qTy0q
— Jian Wu (@fanchyna) November 5, 2019

In this work, we provide results from an analysis of the existing Archive-It interface and how well it might provide information for consumers of an Archive-It collection. We then take curator generated stories and visualize each using six different surrogate types, the Archive-It interface, browser thumbnails (screenshots), social cards, and three combinations of social cards and thumbnails. We present these visualized stories to Mechanical Turk participants and found that participants scored best on our test question when presented with social cards. In addition, participants clicked on browser thumbnails to view the content beneath them and interacted much with social cards the least. Thus, social cards probably provide better understanding of web archive collections.

Social Cards Probably Provide For Better Understanding Of Web Archive Collections from Shawn Jones

Selected Work From Others

Some sessions, like LR15: Search & Retrieval, I could not attend because there was no more room. Please visit the ACM Digital Library website for access to the freely available proceedings for more detail. Here I will try to quickly summarize some of the work I witnessed.

Because I am trying to ultimately summarize web archive collections, several colleagues have recommended that I become more familiar with recommender systems. I attended Jun Liu's presentation of "BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer" where the author described the issue of evolving user interests and trying to predict their interests using using concepts from Google's BERT language model. Shaoyun Shi provided some insight into a solution to the cold start problem as part of "Adaptive Feature Sampling for Recommendation with Missing Content Feature Values". In that work, the authors detailed the problems with supplying random feature values to neural networks for recommendations and showed how their adaptive feature sampling system produced better results. Wayun Chen presented "A Dynamic Co-Attention Network for Session-based Recommendation" whereby the authors propose breaking a user's preferences into a long term and short term preferences before using a neural network to provide better recommendations based on this model. As a one of the authors of "A Hierarchical Self-Attentive Model for Recommended User-Generated Item Lists", Yun He discussed recommendations in environments containing playlists (e.g., YouTube, Goodreads, Spotify) and how user-generated playlists could be used to recommend other items to those users' followers. I'm intrigued by Yun He's work because it makes me wonder if Archive-It collections by the same curator can provide insight into each other. Di You mentioned how most users get their news from social media and even scholars have issues separating fake from real news. As part of "Attributed Multi-Relational Attention Network for Fact-checking URL Recommendation", Di You and her authors apply neural networks to a model that takes into account the user's social network in order to recommend fact-checking URLs to users. Any WS-DL member interested in studying fake news should look at Di You's (links: to advisor's homepage, advisor's Google Scholar) work.

We often employ Natural Language Processing (NLP) at WS-DL for solving a variety or problems, with Alexander Nwala's sumgram being a recent contribution to this area. To assist in this effort, I tried to attend the session on NLP. Natural language contains nested relations whereby concepts relating to the same entity are expressed within the same clause. For example: "the GDP of the United States in 2018 grew 2.9% compared with 2017" contains at multiple relations (GDP growth rate in 2018, compared to 2017, GDP of the US). As part of "Nested Relation Extraction with Iterative Neural Network", Yixuan Cao and co-authors model nested relations from text as directed acyclic graphs and then employ iterative neural networks to help the system to identify these relations. Word and document embeddings have a variety of uses in digital libraries, from expanding search terms to document clustering, and Tyler Derr builds on the state of the art in "Beyond word2vec: Distance-graph Tensor Factorization for Word and Document Embeddings." Derr and his co-authors identify the problem of comparing word and document embeddings to each other and propose a solution by modeling the entire corpus. In "Learning Chinese Word Embeddings from Stroke, Structure and Pinyin of Characters", Yun Zhang provided a solution for Chinese that cannot be applied to English because each character contains within it additional features that can be used to produce better embeddings. Chen Shiyun employed sequential neural networks and tagging to improve sentiment analysis of text in "Sentiment Commonsense Induced Sequential Neural Networks for Sentiment Classification" and showed that their solution performed better on three datasets. In "Interactive Multi-Grained Joint Model for Targeted Sentiment Analysis", Da Yin and co-authors employ neural networks to tackle the issue of polarization of sentiment where different parts of a sentence may contain different sentiments. WS-DL members seeking to understand sentiment in longer form text like news articles should examine either of these last papers.

As we study the graphs in social networks and perform link analysis on web content, network science becomes an important part of the WS-DL's work. In "Discovering Interesting Cycles in Directed Graphs", Florian Adriaens detailed how interesting cycles can be identified and measured using different measures and that finding interesting cycles is NP-hard. To assist users trying to find interesting cycles, they propose a number of heuristics. Bipartite graphs joining two sets can be used to model a number of problems. As part of "FLEET: Butterfly estimation from a Bipartite Graph Stream", the authors present FLEET, a suite of algorithms for estimating the number of butterflies, a specific type of bipartite graph. Tyler Derr presented "Balance in Signed Bipartite Networks", where he and his co-authors perform an analysis of signed butterflies in bipartite networks, develop methods to predict the signs of edges, and evaluate these methods on real-world signed bipartite networks. Albert Bifet presented his groups work on estimating the betweenness centrality of a node in the graph as part of "Adaptive Algorithms for Estimating Betweenness and k-path Centralities." The centrality of a node may indicate its importance (e.g., most important person in a social network) and estimating its centrality may help a system more quickly focus on specific nodes of interest. Of potentially interest to my work is Chen Zhang's "Selecting the Optimal Groups: Efficiently Computing Skyline k-Cliques." The skyline of a graph consists of objects that may dominate but are not dominated by other objects, and hence may be similar to the authorities in the HITS algorithm. Her work has implications in identifying members to assign to teams as well as product recommendations.

A variety of solutions exist in various algorithmic techniques that may be of interest to our future research. As part of "On VR Spatial Query for Dual Entangled Words", the authors explore the issue of spatial queries, those queries relevant based on the location of the user, that apply to both the real and VR world at the same time. Quang-Huy Duong presented "Sketching Streaming Histogram Elements using Multiple Weighted Factors" which provides solutions for estimating a histogram from a stream of incoming data. Such a histogram can be used to computer the similarities of two streams and predict when one may diverge. As part of "Improved Compressed String Dictionaries", Guillermo de Bernardo presented a novel solution to compressing the names of nodes in a network and providing for fast lookup times during network analysis. With IR research, participants are typically asked about the relevance of documents to a query and are asked to employ some scale estimating that relevance. Unfortunately, not all scales have the same number of points. Lei Han analyzed the TREC datasets to address the problem of transforming scales for comparison as part of "On Transforming Relevance Scales." In "Streamline Density Peak Clustering for Practical Adoption", the authors discuss the 2014 general purpose Density Peak Clustering (DPC) algorithm and roadblocks that have prevented its practical adoption. The authors propose a drop-in replacement for DPC, Streamlined DPC, that has enhanced speed and scalability. From this section, the spatial query work, histogram estimation, and node compression may be of interest to other WS-DL students. I may be able to apply the scale transformation and SDPC to my own summarization work.

My session was titled "User Behavior" and included a lot of other interesting work. Dustin Arendt identified problems with performing graph analysis on social networks, specifically that real-world graphs are dynamic and forecasting future behavior using existing static "snapshots" fails to incorporate information from the history of the graph. In "Learning from Dynamic User Interaction Graphs to Forecast Diverse Social Behavior", Arendt and his co-authors provide solutions for forecasting directly over dynamic graphs. Their solution is generalizable across social platforms and can do more than just predict new edges. WS-DL members investigating social networks should review Arendt's work. In "Exploring The Interaction Effects for Temporal Spatial Behavior Prediction", the authors create a model that creates representations of a user's action, location, time, and other user information int the same latent space. This way they can provide better recommendations based on predicted user behavior. As part of "Understanding Default Behavior in Online Lending", the authors are interested in using social networking to identify borrowers who are more likely to default on their microcredit loans. While analyzing data provided by lending platform PPDai, the authors identified "cheating agents", a new type of user who teachers borrowers how to cheat the system. They propose a framework for predicting both cheating agents and those who will default. Of some interest to the user study portion of my dissertation work is that of Anna Squicciarini, who, in "Rating Mechanisms for Sustainability of Crowdsourcing Platforms", identifies the issues in ensuring that Crowdsourcing platforms not only remain fair to crowd workers, but also sustainable. The authors introduce rating mechanisms and show how their new model can be used to improve Amazon's Mechanical Turk.

I attended a second natural language processing session that had some interesting ideas both for me and other WS-DL members. Topic modeling is one of the possible methods I will use to select representative samples from web archive collections. In "Federated Topic Modeling", the authors explore the issues with separating proprietary and sensitive training data for topic modeling. Their solution guarantees privacy protection while also limiting network communications, allowing for their solution to be used on high latency or low bandwidth networks. Chuhuan Wu presented "Sentiment Lexicon Ehanced Neural Sentiment Classification" whereby the authors employ sentiment lexicons to improve sentiment analysis with neural networks. They demonstrate two approaches with experimental results showing improvement of the state of the art. Wei Huang discussed a hierarchical multi-label text classification problem for organizing documents as part of "Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach." The authors employ a Hierarchical Attention-based Recurrent Neural Network, which means that it requires training data. I am particularly interested in this work and would like to see how I might be able to apply it to web archive collections, but do not yet have training data to start with. In "Multi-Turn Response Selection in Retrieval-Based Chatbots with Iterated Attentive Convolution Matching Network", the authors cover the issues of developing chatbots that are capable of not only responding to the last question asked, but also the whole context of the conversation. They employ an Iterative Attentive Convolution Matching Network (IACM) to solve the issue of context and are inspired by existing work with reading comprehension. Almost all of the work I encountered at CIKM employed some type of neural network, so I was pleasantly surprised by Md Zahidul Islam's presentation on "A Semantics Aware Random Forest for Text Classification" which provided an improvement over a different, but familiar classifier. The authors propose an improvement over how the Random Forest classifier selects its decision trees. I look forward to evaluating their new SARF algorithm the next time I need to use Random Forest to classify text.

Computer Vision is the name given to the act of computers trying to provide understanding of images. It is an area that the WS-DL group has little expertise in. I attended a session on this topic in hopes that I might gain some insight into how we might apply this area to web archive collections. Xiaomin Wang presented "Video-level Multi-model Fusion for Action Recognition" where her team worked with a convolutional neural networks to identify actions (e.g., a person throwing a ball) from video. On top of being impressed with their accomplishments, I am also intrigued by their solution because it incorporates a support vector machine at some point to process one of the layers of the neural network. Andrei Boiarov tried to address the problem of identifying landmarks in photos as part of "Large Scale Landmark Recognition via Deep Metric Learning." Their solution tries to overcome the problem of classification of a landmark (i.e., when is it a building, a statue, a wall?) via a novel scoring mechanism applied to their neural network. The world known to a classifier can be closed, but increasingly we want systems that can deal with and classify the open world in which we live. Xiaojie Guo tackles this problem with her coauthors in "Multi-stage Deep Classifier Cascades for Open World Recognition" whereby they propose a classifier architecture that provides a generic framework for solving this problem. Humans infer context from complex images, but neural networks cannot do the same from pixels. Manan Shah and his coauthors attempt to provide a solution for incorporating and predicting text for images as part of "Inferring Context from Pixels for Multimodal Image Classification." While these other researchers focused on acquiring meaning from image data, Mingkun Wang presented a different concept in "Multi-Target Multi-Camera Tracking with Human Body Part Semantic Features" whereby the authors combine content from multiple cameras to track individual people across different views, improving on previous efforts. Mingkun Wang's work does not yet have implications for web archives, but can be employed for security and AR applications.

I was quite impressed with the content that I viewed from all presenters and was not able to provide a summary of all short papers or long papers sessions. I was intrigued with Phillipp Christmann's work in "Look before you Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion" on improving the conversational ability of systems like Siri and Alexa by employing localized information from a knowledge graph to ensure that the system understands follow-up questions. Sarah Oppold provided a demonstration of "LuPe: A System for Personalized and Transparent Data-driven Decisions", a personalized transparent decision support system for evaluating the credit worthiness of a user without employing a single model for all users, thus avoiding unfairness inherent with other systems. I intend to read through the proceedings to find all of the items I missed.

Tutorial Sessions

Prior to the conference, multiple tutorials were offered on a variety of topics. Each tutorial lasted half of a day, so I could only choose two of the eight available. These tutorials provided each presenter an opportunity to cover each topic's state of the art, including the presenter's work.

Abdullah Mueen from the University of New Mexico presented "Taming Social Bots: Detection, Exploration And Measurement." Mueen covered the behaviors of social media bots, specifically with an eye toward identifying them based on their behavior. Bots have signatures from multiple feature categories: time, space, text, content, and network. Building on this, Mueen covered many different research projects for detecting bots. Botornot tries to classify bots based on these features. DeBot tries to detect bots based on the frequency and synchronicity of posts on a topic. Mueen's BotWalk relies upon the concept that bots form cliques in social networks and uses these cliques to identify them. BotWalk is able to detect 6,000 Twitter bots per day compared to 1,619 bots/day with DeBot and 1,469 bots/day with Botornot. Other research projects for detecting bots include the Rest Sleep Comment Model, CopyCatch, and Bot Dynamics. WS-DL members who focus on social media should further investigate this tutorial and other parts of Mueen's work.

My second tutorial, "Learning-Based Methods with Human-in-the-Loop for Entity Resolution", was developed by Lucian Popa and Siaram Gurajada, both from IBM Research. Popa and Gurajada covered the complex problem that I have referred to before as disambiguation - essentially determining if two entities are the same (e.g., Obama and Barack Obama). They covered different models for solving the problem. Explainable representation languages for this topic include Dedupalog, Markov Logic Networks, Declarative Entity Linking, and HIL. Learning based methods that incorporate machine learning and other techniques include Corleone/Falcon, MLN Learning, SystemER, and ExplainER. They discussed how one can improve results by incorporating crowdsourcing via tools like Amazon's Mechanical Turk, but also highlighted the shortcomings of these tools as crowd workers are often not experts on the entities being resolved. I am interested in these concepts because I seek to improve collection understanding of web archive collections and entities help provide meaning.

BigScholar 2019

BigScholar is a workshop that allows academics and practitioners to discuss ideas and concepts concerning big scholarly data. The goal is to focus on problems and solutions for knowledge discovery, data science, machine learning and more. I attended this workshop in the hopes that I would learn more about the intersection of git data and scholarly communications. BigScholar had taken place alongside WWW from 2014 - 2017, moved with KDD in 2018, and was part of CIKM in 2019. In 2020, the conference will unify several other workshops of the same type at ACL in Seattle.

Mattia Albergante, from the publisher Frontiers, gave the keynote for BigScholar 2019. The goal of Frontiers is to provide open access publishing across 71 journals covering more than 600 academic disciplines. They are the 14th largest publisher in the World and the 5th most cited publisher. Submissions to a Frontiers journal have an average review time of 92 days. Albergante spent the bulk of his talk highlighting the digital products that Frontiers developed in order to keep this review time reasonable while also maintaining quality. Key to assisting in peer review is their Artificial Intelligence Review Assistant (AIRA). AIRA speeds up their process by: (1) electronically evaluating the quality of submissions, and (2) identifies reviewers. Other systems may handle identifying plagiarism, but AIRA goes farther to identify potential personally identifiable images that violate privacy, issues with language quality, and images that may be mistakes. Once these issues have been mitigated, AIRA can move on to identifying which reviewers would best match the submission. Often reviewers are contacted by publishers to review a paper and tell the publisher that the paper is "not my field." AIRA employs text analysis, information from Microsoft Academic, various neural networking, and other technologies. Frontiers has reduced the "not my field" declination rate from 47% based on keywords alone to 15% using AIRA, setting themselves apart in quality from other open-access publishers like PLoS.

BigScholar also had four presentations. Yu Zhang presented "From Big Scholarly Data to Solution-Oriented Knowledge Repository" whereby he discussed using natural language processing to mine solutions from high impact papers so that scholars can more quickly solve research problems. Hyeon-Jo Jeon discussed "Is Performance of Scholars Correlated to their Research Collaboration Patterns?" where she covered her work on how co-authorship networks reflect different research styles and collaboration patterns have a high correlation to research performance, but this correlation does not always hold. Zhuoya Fan represented her team's work from "ScholarCitation: Chinese Scholar Citation Analysis Based on ScholarSpace in the Field of Computer Science" which examined cross-language citations between Chinese and English speaking scholars and found that even though there are different patterns in each field, Chinese Scholars tend to favor citing English papers. Kei Kurakawa presented his team's work in "Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence" (link not found) whereby they create an approach that allows for generating a new classification scheme using existing bibliographic databases.

Social

Our hosts treated us to lunch each day, a wonderful reception on Monday, and a fantastic banquet on Tuesday. During the banquet we were entertained with various traditional Chinese art forms including lion dances and brief Beijing opera.

The banquet acquainted us with a number of Chinese dishes. Many Americans are acquainted with a style of food that is named "Chinese" but is truly adapted for American palates. In this case, we were able to sample from a large variety of "true" Chinese dishes using a turntable at the center of the table. It was here that I was also able to acquaint myself with fellow conference attendees. There are too many names to list here, but I appreciate all of the assistance and guidance I received from those who were not only local and familiar with Chinese customs but also those who had attended previous CIKM conferences. This made the social engagements all the more enjoyable.

Summary

A lot happened at CIKM. The conference is a very different venue than this research group regularly attends, and that is a good thing. My horizons were expanded quite a bit during this trip. I was able to acquaint myself with a number of students on the other side of the Pacific Ocean as well as many in Europe and North America that I would never have met otherwise. I now understand things about natural language processing, networks, computer vision, and a host of other topics. Next year, CIKM 2020 will be in Galway, Ireland where Brian Wall assures me they have an exciting new program planned for all attendees.

-- Shawn M. Jones

↧

2019-11-20: PURS 2020 Proposal Awarded to Support Undergraduate Research in Computer Science

November 20, 2019, 9:08 am

≫ Next: 2019-11-26: Summary of "Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub"

≪ Previous: 2019-11-18: The 28th ACM International Conference on Information and Knowledge Management (CIKM)

Image result for research papers pictures

I am delighted that my proposal entitled "Toward Knowledge Extraction: Finding Datasets, Methods, and Tools Adopted in Research Papers" is awarded under the Program for Undergraduate Research and Scholarship (PURS) by the Old Dominion University Perry Honors College Undergraduate Research Program, in cooperation with the ODU Office of Research.

With the increasing volumes of publications in all academic fields, researchers are under great pressure to read and digest research papers that deliver existing and new discoveries, even in niche domains. With the advancement of natural language processing (NLP) techniques in the last decade, it is possible to build frameworks to process free textual content to extract key facts (datasets, methods, and tools) from research papers. The goal of this project is to develop a machine learning framework to automatically extract datasets, methods, and tools from research papers in Computer and Information Science and Engineering (CISE) domains. The tools the proposed research will leverage include but not limited to information retrieval, NLP, machine learning, and deep learning techniques. The outcome will become the preliminary results towards a future proposal on building a knowledge base (KB) of research papers in open academic domains. The KB will benefit many research topics such as citation and expert recommendation, artificial tutors, and question-answer systems.

The project will hire a self-motivated undergraduate student in Computer Science or related fields to conduct research, under the supervision of Dr. Wu, assistant professor of Computer Science. The project is conducted in the Spring, Summer, and Fall semesters through the calendar year of 2020.

Jian Wu

↧

2019-11-26: Summary of "Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub"

November 27, 2019, 4:06 pm

≫ Next: 2019-12-03: Excited to be back at ODU and even more excited to be part of the WS-DL group!

≪ Previous: 2019-11-20: PURS 2020 Proposal Awarded to Support Undergraduate Research in Computer Science

Figure 1: The Life-Cycle of a Vulnerability (Source: Horawalavithana)

Cyber security attacks can be enabled by the fact that many widely-used applications share open-source libraries. As a result, a vulnerability or software weakness in one of these libraries can have far reaching impact. Once discovered, security experts may announce the vulnerability on a variety of forums, blogs, and social media sites. Cyber-adversaries might also explore these public information channels and private discussion threads on the dark web to identify potential attack targets and ways to exploit them.

In their 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI '19) paper, "Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub", Sameera Horawalavithana, Abhishek Bhattacharjee, Renhao Liu, Nazim Choudhury, Lawrence O. Hall, and Adriana Iamnitchi present a quantitative analysis of user-generated content related to security vulnerabilities on three digital platforms: two social media conversation channels (Reddit and Twitter) and a collaborative software development platform (GitHub). Their analysis shows that while more security vulnerabilities are discussed on Twitter, relevant conversations go viral earlier on Reddit. They also show that the two social media platforms can be used to accurately predict activity on GitHub.

https://www.slideshare.net/SameeraHorawalavithana/mentions-of-security-vulnerabilities-on-reddit-twitter-and-github

Dataset
The authors investigated security vulnerabilities and their mention in social media over a period of 18 months using a private data set released by DARPA as part of their 2018 Computational Simulation of Online Social Behavior (SocialSim) challenge. First, Horawalavithana et al. selected Common Vulnerability Exposure (CVE) identifiers published in the National Vulnerability Database (NVD) between January 2015 and May 2018. Second, the authors filtered posts shown on Reddit and Twitter between March 2016 and August 2017 to identify posts which mentioned a CVE ID. Last, they selected repositories from GitHub that were related to the CVE IDs already identified. The longer time frame for the NVD was chosen to allow comparison of the timing of conversations on Reddit and Twitter and the public disclosure of vulnerabilities in the NVD. The use case and supplemental processing, if any, for each data set are described below.

The NVD serves as the standard repository for publicly disclosed security vulnerabilities. The NVD fields of interest include the published date, the CVSS score (0-10), and the attack severity (i.e., critical, high, medium, low). In Figure 1, Horawalavithana et al. describe the life cycle of a vulnerability using three defined phases adapted from related research.

Black Risk Phase is the period between initial identification and public disclosure of the vulnerability. The exploitation risk is highest during this phase and discussions might be observed among small communities on public and private forums.
Grey Risk Phase occurs when the NVD accepts the vulnerability and continues until the vendor releases an official patch or countermeasure.
White Risk Phasecovers the time needed to deploy the countermeasure.

The Social Media Datasets from Reddit and Twitter consist of conversations and tweets that include at least one CVE ID. The authors used the regular expression pattern of CVE-\d{4}-\d{4}\d* to match CVE IDs that appeared in posts, comments and tweets. In addition, the dataset was augmented to include re-tweet cascades, sentiment analysis (polarity and subjectivity), and bot detection. Bot driven messages were removed in favor of human responses. Table 1 provides a comparison of the social media platforms in terms of the volume of security vulnerability mentions.

Table 1: Size of dataset. Activities represent posts (18%) and comments (82%) in Reddit, tweets (76%), re-tweets (19%) and replies (5%) in Twitter, and events (push, issue, pull-request, watch, fork) in GitHub. Communities are represented by subreddits in Reddit, hashtags in Twitter, and software repositories in GitHub. (Source: Horawalavithana)

The GitHub Dataset focused on public repositories that have one of the CVE IDs from the NVD dataset in the repository text description or a Git commit message. The same regular expression was used to match CVE IDs in the text descriptions. The authors noted a significant overlap between both the NVD and the social media dataset with a 47% and 3% overlap in the CVE IDs observed on Twitter and Reddit, respectively.

Data Analysis
Previous work in the area of online cyber security discussions suggests that messages shared on social media platforms can be used as early signals to detect security vulnerabilities. With this knowledge, Horawalavithana et al. analyzed the reaction of each social media platform. Further, they attempted to show how information disseminated on Reddit and Twitter relates or compels software development activities in GitHub repositories. Core questions of interest addressed by the authors include:

How do social media platforms compare in terms of these signals on security vulnerabilities?
To what extent are named vulnerabilities discussed on public channels before the official disclosure day?

CVE Mentions in Reddit and Twitter
The authors characterized the social media platforms based on the appearance of CVE IDs. They analyzed the CVE IDs discussed only on Twitter, only on Reddit, and on both platforms. As shown in Figure 2, 10,257 CVE IDs were mentioned in the Reddit-Twitter dataset. 95% of CVE IDs were mentioned only on Twitter. 0.5% of the CVE IDs were mentioned only on Reddit. 4.5% were mentioned on both platforms.

Figure 2: More security vulnerabilities are discussed on Twitter. (Source: Horawalavithana)

The timing of mentions relative to public disclosure of the vulnerability was used to describe early signals. Of the 10,209 CVE IDs discussed on Twitter, 17% were mentioned before their public disclosure. During the same time frame, of the 460 CVE IDs discussed on Reddit, 51% were mentioned in advance of public disclosure. Figure 3 shows the daily volume of posts and tweets as related to Day 0 which represents the NVD public disclosure date. The published date of the message (post/tweet) is relative to NVD public disclosure date of the mentioned CVE ID. Horawalavithana et al. observed that both Reddit and Twitter have mentions of CVE IDs more than a year prior to public disclosure. They also observed a spike in CVE mentions around Day 0 on both platforms.

Figure 3: Both platforms show a peak in the mentions of CVE IDs near their public disclosure (Source: Horawalavithana)

Discussions on Reddit and Twitter were classified by topics suggested by subreddits for Reddit and hashtags for Twitter. As noted in Tables 2 and 3, the majority of CVE IDs found on both Reddit and Twitter were discussed before public disclosure.

Table 3: Top 10 subreddits by the total number of posts published. (Source: Horawalavithana)

Table 4: Top 10 hashtags by the total number of tweets published. (Source: Horawalavithana)

CVE Mentions in GitHub Actions (Software Development)

Next, the authors considered how GitHub activity typically follows the public disclosure of security vulnerabilities. There were 10,502 distinct CVE IDs included in the text descriptions of GitHub events. As shown in Table 5, most CVE IDs appear in commit messages. While the majority of CVE IDs are mentioned in only one GitHub event, there are some vulnerabilities mentioned in multiple repositories. The pattern of observed GitHub activity over time is shown in Figure 4 based on a vulnerability associated with the Linux kernel. Horawalavithana et al. surmise that spikes in activity sometimes months after public disclosure are likely due to the software development life cycle where vulnerabilities are not addressed until after a major exploit. Through calculations of dynamic time warping (dtw), the authors attempted to determine similarities between time series events. As expected, push events were popular since this is the mechanism used to contribute to a repository. On the other hand, the authors also observed similarities between fork and watch activities which are measures of popularity (dtw is 323 and 263, respectively). Their analysis suggests that only certain types of GitHub activity are influenced by the volume of CVE mentions which they attribute to interest in learning about bug fixes or developing exploit code.

Table 5: Distribution of distinct CVE IDs as they appeared in GitHub event texts. (Source: Horawalavithana)

Figure 4: The distribution of GitHub events associated with CVE IDs. The insert presents the number of GitHub events over time that are related to CVE-2015-1805, a vulnerability in the Linux kernel. (Source: Horawalavithana)

Predicting GitHub Activities
Finally, Horawalavithana et al. investigate whether Twitter and Reddit CVE mentions help predict the actual activity on GitHub repositories. Activity in this area strengthens the case that online social media platforms create an ecosystem in which signals travel across platforms. Predicting GitHub activities may be important because:

GitHub hosts many exploits and patches related with CVE IDs.
Predictions might reflect the software development activities of an attacker who develops an exploit.
Predictions can be used to estimate the availability of a patch related to a security vulnerability.

Machine Learning
The authors trained two machine-learning models to predict GitHub events. A GitHub event consists of the type of action (as listed in Table 5), associated GitHub repository, the identity of a user who performed the action, and the event time-stamp. The model's features include the daily count of posts, daily count of active authors, daily count of active subreddits, and daily counts of comments on Reddit; and daily count of tweets, daily count of tweeting users, daily count of retweets, and daily count of retweeting users. Using a much larger feature value vector, the authors trained a recurrent neural network for each GitHub event type to predict the likelihood of a user action to a given GitHub repository in a particular hour. Expanded features used for prediction include daily counts of posts, active authors, active subreddits, comments, tweets, and retweets. Horawalavithana et al. used GitHub data from January to May 2017 as training data, the following two months as validation data, and the month of August as test data.

Prediction Results
Since these events measure the popularity of a GitHub repository, prediction results were reported by the authors only on fork and watch events in GitHub. The distribution of forks and watches are presented Figure 5a and 5b. The authors measured the similarities between ground truth and their simulated predictions of GitHub events using Jensen-Shannon (js) divergence; a statistical method which quantifies the difference between finite random variables using a range of 0 (indistinguishable) and 1. Horawalavithana et al. determined their predicted distribution of fork and watch events were nearly equivalent to ground truth with low js divergence scores of 0.0029 and 0.0020, respectively. Further, the coefficient of determination (R squared) which measures the goodness of fit of a model was determined to be 0.6300 and 0.6067 for the predicted events where 1 is considered a perfect fit. Finally, the authors examined their predictions via a time series, Figure 6, tracks the growth of the most active GitHub repository in August 2017. Horawalavithana et al. concluded their simulations followed accurately for the first week but observed limitations in the predictive power of their model over longer intervals.

Figure 5: GitHub Popularity: the distribution of a) Fork events and b) Watch events across GitHub repositories. (Source: Horawalavithana)

Figure 6: The growth of the most active GitHub repository by the number of daily events occurring in August 2017. (Source: Horawalavithana)

Summary Discussion
This paper compares the volume and timing of security vulnerability mentions on three social platforms over a period of 18 months, from March 2016 to August 2017. In addition, Horawalavithana et al. present machine-learning models that predict the patterns of popularity and engagement level activities in GitHub using information gleaned from Reddit and Twitter. The authors theorized and concluded that diverse online platforms are interconnected such that the activities in one platform can be predicted based on the activities in others. Their conclusions were based on the following observations:

The volume of security vulnerability mentions is significantly higher on Twitter than on Reddit and appear slightly earlier. This suggests Twitter is a better platform to monitor for early vulnerability alerts.
Most of vulnerability mentions on Reddit occur before public disclosure. Deeper levels of discussion among professional communities were also noted on this platform.
The majority of GitHub activity occurs after public disclosure of a vulnerability. Signals from Reddit and Twitter may be useful for predicting events in repositories which mention vulnerabilities. Here, Horawalavithana et al. stress they are not suggesting that activity on Twitter and Reddit directly affects or drives activity observed on GitHub.

The findings of Horawalavithana et al. could be practically applied to:

Advance or calibrate security alert tools based on information from multiple social media platforms.
Coordinate software development activities with the lessons learned from social-media information.

-- Corren McCoy (@correnmccoy)

Horawalavithana, S., Bhattacharjee, A., Liu, R., Choudhury, N., O Hall, L., & Iamnitchi, A. (2019, October). Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub. In IEEE/WIC/ACM International Conference on Web Intelligence (pp. 200-207). ACM. doi: 10.1145/3350546.3352519

↧

2019-12-03: Excited to be back at ODU and even more excited to be part of the WS-DL group!

December 2, 2019, 9:47 pm

≫ Next: 2019-11-20: Trip Report to K-CAP 2019

≪ Previous: 2019-11-26: Summary of "Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub"

I joined ODU as an M.S. student in Fall 2009 and was fortunate to be immediately hired as a Research Assistant. As an RA, I was exposed to many interesting and intriguing problems in then emerging fields of Computer Science, which triggered my passion for research. At some point, I started imagining how wonderful it would be if I could have a career in research and teaching. Fast forward to 2019, I am back at ODU as an Assistant Professor!

I received my Ph.D. from Stony Brook University in 2018, and continued my research there as a Postdoctoral Associate till the end of summer 2019. My research mainly centered around Web Accessibility, with specific focus on making the web more usable for people with visual disabilities. This work was very rewarding, and I am thankful for the support from Stony Brook University (Catacosinos Fellowship 2015) and National Science Foundation (Co-PI on a grant).

Back to the present - I am excited to be a member of the WS-DL group at ODU. As Michael says, Web Accessibility and Web Archivability are close cousins, and I am very eager to explore their intersection. I am also keen on investigating other research topics in Human Computer Interaction, Artificial Intelligence, and Natural Language Processing. It sure feels good to return to my home turf ;)

--Vikas Ashok--

↧

2019-11-20: Trip Report to K-CAP 2019

December 7, 2019, 7:47 am

≫ Next: 2019-12-12: Bhanuka Mahanama (Computer Science PhD Student)

≪ Previous: 2019-12-03: Excited to be back at ODU and even more excited to be part of the WS-DL group!

Between November 18 and 20, I attended the 2019 International Conference on Knowledge Capture (K-CAP 2019). K-CAP is an ACM sponsored conference, rated as “A” in the ERA conference rating system. It happens once every two years. Its counterpart in Europe is EKAW (unfortunately, EKAW is rated as B), which also happens every two years. I had papers accepted by K-CAP 2015 and 2017.

This year, I co-authored a short paper titled “Searching for Evidence of Scientific News in Scholarly Big Data” (poster link). The first author is my co-advised student Reshad Hoque at ODU. I also co-authored and presented a long paper titled “Automatic Slide Generation for Scientific Papers” in the 3rdInternational Workshop on Capturing Scientific Knowledge (SciKnow 2019). The first author is my co-advised student Athar Sefid at Penn State. Due to my tight schedule, I had to return right after the keynote by Peter Clark on the first day, so this trip report summarizes the SciKnow workshop, the tutorial on “Build a large-scale cross-lingual text search engine from scratch”, the Poster session for short papers, and the keynote session.

At the beginning of the SciKnow workshop, Dr. Yolanda Gil gave the keynote speech titled “Hypothesis-driven data analysis”. Yolanda and her group have been working on ontology linking and knowledge extraction for a long time. She presented a lot of top-level research including how to automate hypothesis testing, which is related to a very popular topic on R&R (repeatability & reproducibility). She also talked about workflow alignment and merging, model coupling, combination, and distribution, the ontology of future scientific papers. The talk was wrapped up by an open question on how to use automatic hypothesis testing for decision making. The topics were very timely and interesting, but it was a little bit diverse. She mentioned several existing and on-going projects, such as WINGS and MINT (Model INTegration).She introduced a website https://www.scientificpaperofthefuture.org/. They had been holding sessions to train scientists to write papers in a more structured, complete, repeatable, and reproducible manners. I definitely agree with her about the way to write future scientific papers. The only question I had was how the hypothesis are represented and how they can be tested and updated automatically.

The SciKnow 2019 workshop features 4 long papers and 4 short papers. Each long paper is given 20 minutes including 15 min presentation and 5 minutes Q&A.

In the presentation titled “Semantic Workflows and machine learning for the assessment of carbon”, the authors attempted to build a classifier to distinguish between grass, trees, water, and imperviousness regions. I am impressed that he manually annotated several thousand Google earth images.

Another talk that interested me was about a knowledge graph (KG) system called ORKG(open resource knowledge graph). The KG was designed to answer 2 questions. (i) How to compare research contributions in a graph-based system? (ii) How to specify and visualize research contribution comparison in a user interface? Different from most KG systems which are constructed based on NLP techniques, this KG was based on crowdsourcing. The current analysis is based on a relatively small sample (on the order of tens). Although human-labeled data tend to be more accurate, how to scale up the system is a problem the authors need to overcome.

Other useful resources I noted in the workshop include

· The GLUE benchmark for NLP tasks

· Word embedding models (some I did not know): ULMFit, OpenAI GPT, BERT, XLNet, characters (ELMo) Subwords (GPT, BERT), ngrams (fastText)

· BIO2RDF: an API to generate RDF triples in biological domains

· A knowledge graph mapping tool called YARRML?

· The WDPlus project and its framework called T2WML

I missed the tutorial called “Hybrid Techniques for Knowledge-based NLP - Knowledge Graphs meet Machine Learning and all their friends” by Jose Manuel Gomez-Perez, Ronald Denaux and Raul Ortega in the morning, but attended one called “Building a large scale cross language text search engine” by Carlos Badenes-Olmedo, Jose Luis Redondo-Garcia and Oscar Corcho from Universidad Politécnica de Madrid (UPM). The tutorial started with simple IR concepts such as tokenization and lemmatization. It then dived into the topic models, such as the Latent Dirichlet Allocation and how to represent documents using topical vector and then apply the Approximate Nearest Neighbor to classify documents. For the cross-lingual part, they adopted Spanish. Asian languages (e.g., CJK) are not covered. The tutorial presenter used Google classroom to present all the source codes and their results. They use Docker to encapsulate everything needed for demos, so we can play them on our personal computers. These tools can be borrowed for my future courses and tutorials.

The poster session is held in the Information ScienceInstitute (ISI), about 25 minutes’ walk from the conference hotel (yes, I walked up there). Some interesting posters drew my attention.

Jointly Learning from Social Media and Environmental Data for Typhoon Intensity Prediction. In this study, entities are extracted from social media data and encoded into vector representations, which are then concatenated with conventional environmental data to predict the intensity of typhoons. About 100k tweets were collected in a time range of about 10 years. They used Spacyfor NER and then ConceptNet for semantic embedding. The architecture includes a single direction LSTM (used to encode deep semantic feature) and a feedforward network (with dropouts) followed by a softmax to generate a probability. My question was how useful the model is because (i) to collect a sufficient amount of data about this typhoon, we may have to wait for a long time, when we may miss the typhoon; (ii) the intensity of a typhoon may change over time. But feature analysis in this paper does indicate that social media features are more important features than environmental features, which was surprising.

Understanding Financial Transaction Documents using Natural Language Processing. The authors developed a pretty sophisticated way to detect ineligible reimbursement items in financial systems. The proposed system uses customized (may need to retrain) tesseract OCR to extract text out of scanned or pictured reimbursement reports. They then do entity extraction, and semantic analysis to identify items that do not comply with certain restrictions. For example, “Spa treatment massage” is identified to be an ineligible item. The system is developed for commercial uses, so it uses some proprietary datasets and tools. The evaluation gives decent performance with F1=85% based on 6000 test samples.

The first keynote was given by Peter Clark, the director of the Aristo project by AllenAI (AI2). The Aristo project is aimed at building an intelligent system that is able to capture and reason with scientific knowledge. The system starts by converting problems and relevant information into structured knowledge. They tried relational database tables and knowledge graphs and found that the latter works better. They are now able to answer multiple-choice questions at the 8^th grade level such as which month has the longest day time in New York City? Actually, the accuracy on NY Regents 8^th Grade (NDMC) has achieved over 90% in 2019, with language models + specialist solvers.

One failing question is
“Which of these objects will most likely float in water?”
(correct) Table tennis ball. (wrong) hard rubber ball.

Other failing questions are reading and comprehension types. For example:

“A student wants to determine the effect of garlic on the growth of a fungus species. Several samples of fungus cultures are grown in the same amount of agar and light. Each sample is given a different amount of garlic. What is the independent variable in this investigation? ”

(correct) amount of garlic. (wrong) amount of growth.

Aristo is unable to answer a traditional math question like this.

“Two trains are driving in opposite directions with different speed v1 and v2. They started at a distance of s. How long does it take for them to meet?”

In summary, Aristo has gain surprising success with language modeling. The project finds that structure is not essential for many tasks. Just pattern matching can answer many questions. But the method falls short with numerous types of questions, implying that many other AI aspects are missing. Structured reasoning and knowledge capture but with more language-like representations can be the way to go forward.

Marina del Ray is in the vicinity of Los Angeles with very beautiful beach scenes. One nearby beach is called Venice Beach with a very long fishing deck. But I just took a brief look due to my tight schedule. USC and UCLA both have some buildings in this town. The public transportation is crowded especially from the airport, so I decided to take Uber. Carpool can save some money but at the sacrifice of potential time loss because the driver is obligated to pick up at least 2 passengers. I had an experience when the second passenger canceled the trip at the last minute. The hotel I stayed in is called Jolly Roger Hotel, a very small but affordable hotel. People are very friendly but the Thai restaurant near my hotel was very disappointing.

The most important thing in a conference is to meet people. Some of the known people include

· Yolanda Gil: Research Professor of Computer Science and Spatial Sciences and Principal Scientist at USC Information Sciences Institute

· Jay Pujara: Research Assistant Professor of Computer Science

· Ken Barker: IBM

· Krutarth Patel: KSU

· Peter Clark: AllenAI

Some new friends include

· Andre Valdestilhas: a Brazil graduate student studying in Germany

· Enrico Daga: Open University Knowledge Media Institute

· Tim Lebo: Air Force Research Lab

· Prateek Jain: Director Data Science at AppZen

Jian Wu

↧

2019-12-12: Bhanuka Mahanama (Computer Science PhD Student)

December 12, 2019, 12:15 pm

≫ Next: 2019-12-21: Preserving Open Source Software with GitHub's Arctic Vault

≪ Previous: 2019-11-20: Trip Report to K-CAP 2019

My name is Bhanuka Mahanama, I joined Old Dominion University as a Ph.D. student in fall 2019 under the supervision of Dr. Sampath Jayarathna. I'm currently researching audio, visual, and eye-tracking integration in classroom environments. My research interests include multi-sensory environments, information retrieval, and machine learning.

University of Moratuwa

I received my bachelor's degree in Computer Science and Engineering from the University of Moratuwa, Sri Lanka in 2018. I followed the ICE (Integrated Computer Engineering) stream with courses in embedded and industrial computer engineering systems. I also have an advanced diploma in management accounting from the Chartered Institute of Management Accountants (CIMA) in the UK. I was a visiting instructor for computer networks at the University of Moratuwa, Sri Lanka. These lab sessions ranged from setting up a basic network with routers to configuring multi-area OSPF environments using CISCO routers and CISCO packet tracer. Soon after my undergraduate degree, I was appointed as a junior consultant at the University of Moratuwa for coordinating practical course components of Computer Communications, Computer Networks, and Industrial Computer Engineering modules. I was responsible for designing a practical lab series for the Industrial Computer Engineering module, working with programmable logic controllers and industrial environments.

I started my professional career (June, 2017 - December, 2017) as a trainee software engineer at Wavenet International (Pvt) Ltd in Sri Lanka and got promoted to an associate software engineer after the completion of the training program (January, 2018 - July, 2018). At Wavenet, I worked mainly on building highly available software solutions for telecommunication using Erlang and MySQL, including optimizing systems to improve the throughput of code-level changes, database optimizations, and deployment. Later, I started working as a freelance software engineer serving local and foreign clients and building software application solutions.

I'm proficient in coding Java, C++, Javascript, Erlang, Typescript, Python and PHP, and have experience in frameworks such as Spring, AngularJS, Yaws, Erlang OTP, Cordova, Ionic, Scikit, Symfony, Laravel, databases such as MySQL (InnoDB and NDB cluster), MongoDB, Redis, SQLite, and Neo4J.

At ODU, I worked on a game-play interface to replace mouse interactions using eye-tracking technology.

Figure 2: Eye Tracking Game Play Interface

Using this interface users will be able to play conventional games through gaze interactions. The interface is designed to be independent of the game environment allowing the application to be applicable in a variety of games ranging from flash games to multiplayer shooting games. The project leverages the hardware and software API of pupil labs trackers for detecting the eye movements, gaze information and fixation information. The developed environment retrieves data from the pupil labs software through the ZMQ message broker. The application acts as the bridge between the pupil labs software and the gaming environment translating the sensory data from pupil labs to user inputs emulating a mouse. This project was deployed at the recent "STEAM on Spectrum" event at Virginia Modeling and Simulation Center (VMASC). The idea behind the event is to make Science, Technology, Engineering, Art, and Math (STEAM) more accessible and inclusive to people despite their sensory and cognitive needs.

All of my students (@NirdsLab) are at @vmasc_odu for the "STEAM on Spectrum 2019"#inclusive event for Autistic kids. Nearly 200 participants and 50+ volunteers. @WebSciDL @oducs @ODUSCI @sheissheba @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/f131qYGjkA
— Sampath Jayarathna (@OpenMaze) October 12, 2019

It is a beautiful day @vmasc_odu with a great turnout for today’s STEAM on Spectrum 2019 event. @NirdsLab @WebSciDL @oducs @ODUSCI @OpenMaze @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/1DsTqeveJi
— Bathsheba Farrow (@sheissheba) October 12, 2019

We also showcased our eye-tracking and medical sensing equipment at the Science Connect event organized by the Old Dominion University College of Science on October 19th. The event was free and open for Norfolk public high school students. Students were able to tour the labs, meet faculty staff, learn about courses offered, and research conducted at ODU.

Dr. Sampath Jayarathna (@OpenMaze) representing @WebSciDL @oducs @NirdsLab at the Science Connection Day! https://t.co/kpiSmhDKia pic.twitter.com/b5FylKhYv5
— Michael L. Nelson (@phonedude_mln) November 8, 2019

Science connection happening now @ODU @NirdsLab demonstrating #eyetracking #empatica @OpenMaze @WebSciDL @yasithmilinda @Gavindya2 pic.twitter.com/81Mguc18OM
— Bhanuka Mahanama (@mahanama94) October 19, 2019

Recently, we held our inaugural "Trick or Research" event in the computer science department, where Undergraduates and Graduate students were invited to the labs in the department for informative sessions about the research projects being carried out.

We are getting ready for the very first Trick-or-Research @oducs, candy bags ready, passport printed...!!! @ODUSCI @WebSciDL @odu pic.twitter.com/qCNMG9TS22
— Sampath Jayarathna (@OpenMaze) October 30, 2019

-- Bhanuka Mahanama

↧

2019-12-21: Preserving Open Source Software with GitHub's Arctic Vault

December 21, 2019, 2:31 pm

≫ Next: 2019-12-31: Muntabir Hasan Choudhury (Computer Science PhD Student)

≪ Previous: 2019-12-12: Bhanuka Mahanama (Computer Science PhD Student)

Source: Techworm

GitHub is used by more than 40 million developers and currently hosts more than 100 million repositories. In early November 2019, GitHub shared plans to open the Arctic Code Vault, an effort to store and preserve open source software like Flutter and TensorFlow. With this endeavor, code for all open source projects will be stored on specialized ultra-durable 3,500-foot film with frames that include 8.8 million pixels each, designed to last 1,000 years. The data can be read by a computer or a human with a magnifying glass in case of a global power outage.

"Our primary mission is to preserve open source software for future generations. We also intend the GitHub Archive Program to serve as a testament to the importance of the open source community. It’s our hope that it will, both now and in the future, further publicize the worldwide open source movement; contribute to greater adoption of open source and open data policies worldwide; and encourage long-term thinking." (Excerpt: GitHub Archive Program website)

The GitHub Archive Program is working to preserve open source software for future generations—including a long-term archive designed to last 1,000 years 🌍 https://t.co/aLbThVswJS #GitHubUniverse
— GitHub (@github) November 13, 2019

GitHub is partnering with the Stanford Libraries, the Long Now Foundation, the Internet Archive, the Software Heritage Foundation, Piql, Microsoft Research, and Oxford University's Bodleian Library to preserve the world’s open source code. These partners represent the warm and cold tiers in the pace layer strategy GitHub has adopted for archiving code. Each institution provides redundancy by storing multiple copies across various data formats and locations, including a very-long-term archive called the GitHub Arctic Code Vault.

The Arctic World Archive
The Arctic World Archive (AWA), located in Svalbard, Norway, is a vault that wants to preserve the world’s digital heritage and make it available for future generations. The AWA already holds art collections, the Vatican’s 1,500-year-old manuscripts, and even film clips of the Brazilian football player Pele. In collaboration with their clients, the AWA determines whether to store the content in a digital format or in visual format so text and images are human readable. The open source code on GitHub will be maintained in a decommissioned coal mine designed specifically for the AWA. Archivists believe that cold and near-constant conditions can help in film preservation. As one of the northernmost cities on Earth, Svalbard's permafrost can extend hundreds of meters below the surface. While Svalbard is affected by climate change, it’s likely to affect only the outermost few meters of permafrost in the foreseeable future. Warming is not expected to threaten the stability of the mine. The AWA caters to any country, institution or company in need of ultra-secure storage and was likely chosen by GitHub because:

Svalbard is a Norwegian archipelago situated approximately 1,300 kilometers from the North Pole; essentially out of reach for most cyber attacks.
The Svalbard Treaty signed in 1920 declares the area to be a demilitarized zone (DMZ) with no military activity; ensuring the data will not become a casualty in a military conflict. 43 nations, including the United States, Russia and China signed the treaty.
Its unique location, geopolitical and climatic stability makes it a suitable place for safe long-term storage. No electricity or other human intervention is needed as the climatic conditions in the Arctic are ideal for long-term archival of film.

Contents of the 2020 Snapshot
Earlier in 2019, thousands of popular GitHub projects like Blockchain, WordPress, and programming languages like Rust and Ruby were added to the archive. Next year, the Arctic Code Vault will be extended to include all public GitHub repositories. The first snapshot will take place on February 2, 2020. For anyone with an active repository, the associated code will automatically be included in the snapshot. The snapshot will consist of the default branch of each repository, excluding any binaries larger than 100 KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data. The snapshot will also include "significant dormant repositories as determined by stars, dependencies, and an advisory panel", according to GitHub. The advisory panel will include experts from a range of fields, including anthropology, archaeology, history, linguistics, archival science, and futurism. Currently, the Archive Program Advisors include:

Shannon Lee Dawdy, Archaeologist / Anthropologist / Historian
Brewster Kahle, Internet Archive
John McWhorter, Linguist
Alexander Rose, Executive Director, Long Now Foundation
Ada Palmer, Historian / Science Fiction Author
Hussein Bassir, Archaeologist / Egyptologist / Director of the Antiquities Museum at the Library of Alexandria
Christine Moran, Computational Astrophysicist / Security Engineer

Over time, GitHub will develop a cadence to store code once a year or every two years, and a way for open source projects to retrieve code, but those processes are still being developed. The Frequently Asked Questions (FAQs) state that GitHub plans to evaluate the archived film reels and their current state every five years. The current film technology used in archiving, developed by Piql, is coated in iron oxide power. This medium has a lifespan of 500 years as measured by the International Standards Organization (ISO); simulated aging tests indicate Piql’s film will last twice as long. Depending upon GitHub's evaluation results, another snapshot may be taken and archived in the cold storage facility. However, this is not guaranteed or yet known.

Future Proofing Open Source Code
What will software look like 1,000 years from now? Developers and archivist can only guess. Meanwhile, GitHub is working to ensure that today’s most important building blocks make it to tomorrow. Much of today's technology is powered by open source software. It’s a hidden cornerstone and shared foundation for future development efforts. The mission of the GitHub Archive Program is to preserve the legacy for generations to come.

"There is a long history of lost technologies from which the world would have benefited, as well as abandoned technologies which found unexpected new uses, from Roman concrete, or the anti-malarial DFDT, to the hunt for mothballed Saturn V blueprints after the Challenger disaster. It is easy to envision a future in which today’s software is seen as a quaint and long-forgotten irrelevancy until an need for it arises. Like any backup, the GitHub Archive Program is also intended for currently unforeseeable futures as well." (Excerpt: GitHub Archive Program website)

Besides the Archive Program, GitHub is also working on Microsoft’s Project Silica to archive all active public repositories for over 10,000 years, by writing them into quartz glass platters using a femtosecond laser. For anyone who wants to safeguard their own code in the GitHub Arctic Code Vault, there's still time to do so. On the other hand, GitHub will only archive public repositories, so opting out is as simple as making your repository private which is a free feature for all users. One point that should be noted with GitHub's archival plan is that code depends on the underlying infrastructure to run (e.g., hardware, supporting libraries, assembly language, compilers). It's unknown whether GitHub will also include these elements in the AWA. As an alternative, the archive will also include a Tech Tree that provides an overview of the archive and how to use it. The Tech Tree will serve as a quick start manual on software development and computing, bundled with a user guide for the archive. The archive will also include information and guidance for applying open source, with context for how we use it today, in case future readers need to rebuild technologies from scratch. Answers to other common questions can be found in the GitHub Archive Program FAQs.

The Case for Cold Storage?
While the digital preservation of open source code may be culturally significant, David Rosenthal's "Seeds or Code?" blog post offers a slightly different perspective on GitHub's endeavor. He contends the AWA initiative is a publicity stunt conceived by Microsoft which acquired GitHub in 2018. Rosenthal questions whether the AWA would rank high on the list of basic necessities if the world were sufficiently devastated. Further, he draws comparison to other scientific projects which are inspiring in their intent but, in his opinion, similarly lack practical use cases. For emphasis, Rosenthal draws attention to these projects:

Clock of the Long Now which ticks once per year and is designed to accurately keep time for 10,000 years.
Voyager Golden Records which contain Earth sounds and images intended for any intelligent extraterrestrial life form who may find them.

The AWA itself is a cold storage facility that developers may never need until the day comes that they do. Currently, lots of organizations rely on open source software which means the AWA may represent a crucial building block in the restorative process. However, there are both high-quality and poorly-designed projects maintained on GitHub with no visible means to discriminate. As a result, AWA will house code in both categories which may present a challenge to the future of software development. While Rosenthal discounts the media hype surrounding the AWA, he readily acknowledges the importance of the software archives and accessible pace layer partnerships that increase awareness of digital preservation overall.

--Corren McCoy (@correnmccoy)

GitHub Archive Program, Preserving open source software for future generations, N.D., Retrieved from https://archiveprogram.github.com/ on 01-December-2019.

↧

2019-12-31: Muntabir Hasan Choudhury (Computer Science PhD Student)

December 30, 2019, 9:35 pm

≫ Next: 2020-01-01: Himarsha Jayanetti (Computer Science Master’s Student)

≪ Previous: 2019-12-21: Preserving Open Source Software with GitHub's Arctic Vault

My name is Muntabir Hasan Choudhury. I am an international student from Bangladesh. I joined the WS-DL (Web Science and Digital Libraries Research Group) at Old Dominion University as a PhD student in the Fall of 2019 under the direct supervision of Dr. Jian Wu. My current research involves text mining on non-born digital Electronic Theses and Dissertations (ETDs), in collaboration with Dr. Fox and Bill Ingram at Virginia Tech Computer Science. My research interests include, but are not limited to, Big Data, Natural Language Processing, and Machine Learning.

I was always an enthusiast about science, technology, and innovation. Because of curiosity regarding computing, computer architecture, robotics, and artificial intelligence, I decided to study Computer Engineering at college. In May 2018, I received my Bachelor's degree in Computer Engineering from Elizabethtown College, PA.

While studying at Elizabethtown College, I was energized by the study of Big Data and its vast application in industry. This interest drove me to work as a database assistant in college, allowing me to conduct undergraduate research on Relationship Building in Higher Education, which focused on data analytics. The objective was to optimize the relational data from the college database and analyze it to predict the variable to understand the resources of philanthropy through various programs. I ran SQL using Info-Maker to create a sample datasheet which consists of thousands of companies. Later, I imported the data in Microsoft Excel to update and modify inconsistent and incomplete data in Jenzabar. After identifying the duplicates and sorting out the useful data, I created a final datasheet where I had a fair amount of companies' and organizations' data to work with. Lastly, using R and SPSS tools, I used Multiple Linear Regression and Backward Elimination to process the unsupervised data to get a predicted variable.

I did a junior and a senior year group project. My junior year project was Automated Agriculture Simulations and Real-Time Control over the Internet, including correlation to Weather Data. The goal was to implement an automated watering system using a Hummbox (a sensor) provided by a client in France, GreenCityZen. I worked on improving the algorithm and model for Hummbox and coded a Raspberry Pi using Python for opening a valve. My senior year project was implementing a cost-effective Basketball Training Machine. We built a mobile application using MIT App Inventor— a cloud-based tool to create any mobile application. Our application takes the GPS coordinates to track the location of the player. I assembled and tested electrical components. Also, I partially worked on coding using C-programming for feeding GPS input to the Android application.

Automated Agriculture Simulations and Watering Systems

Basketball Training Machine Electrical Component Assembly

Prior to joining WS-DL, I worked at Resource9 Group, Inc.— a New York-based start-up, as a Junior Application Performance Engineer. At Resource9, I helped clients by monitoring the performance of the web and mobile applications using cloud-based monitoring tools called AppDynamics and Moogsoft AI-Ops. My experience working with the IT operations team helped me to enhance my skills in computer networking, automation, and cloud computing. I have used multiple technologies and programming languages throughout my career. I am proficient in scripting and coding in C, Python, PHP, SQL, HTML, JavaScript, CSS, JSON, XML, and also have expertise on Docker (intermediate), AWS (EC2, Fargate, S3), AppDynamics (APM tool), and Linux.

Upon joining WS-DL at ODU, I went to three computer science talks and attended one colloquium organized by the Department of Computer Science. Moreover, the most eye-catching event organized by the Computer Science Department was Trick-or-Researchon the day of Halloween. WS-DL participated in this event, and we cordially welcomed all graduate and undergraduate students to our lab and explained our research work while giving out candies. Dr. Jian Wu and I demonstrated our current research project to the students.

Trick or Research happening now @WebSciDL . Bring your curios minds and get some candies!! Happy Halloween! pic.twitter.com/TDaFx199EN
— kritika garg (@kritika_garg) October 31, 2019

WS-DL is a great place to work. We share our thoughts, provide feedback to other's work, and diligently get things done on time. I am delighted to be a part of this research group, and I am excited to announce that I will be working as a full-time Graduate Research Assistant (GRA) from Spring, 2020.

--Muntabir Choudhury

↧

2020-01-01: Himarsha Jayanetti (Computer Science Master’s Student)

January 1, 2020, 8:41 pm

≫ Next: 2020-01-04: 365 dots in 2019 - top news stories of 2019

≪ Previous: 2019-12-31: Muntabir Hasan Choudhury (Computer Science PhD Student)

My name is Himarsha Jayanetti. I am an international student from Sri Lanka. I joined Old Dominion University as a Master’s student in Fall 2019 under the supervision of Dr. Michele Weigle. My current research project involves observing access patterns of robots vs. humans in Internet Archive and studying whether the patterns prevalent in the Internet Archive are present across other web archives. My collaborator, Kritika Garg, and I are working on this project by extending prior research by WS-DL alumna Dr. Yasmin AlNoamany.

My intrinsic strength has always been my quantitative and analytical potential, embedded with a special passion for mathematics from my early schooling days. This in turn made me choose mathematics, physics, and chemistry at my high school. An exceptional result in the advanced level examinations in Sri Lanka helped me get the Nehru Memorial Scholarship offered by the Ministry of External Affairs of India. Under this scholarship, I started my undergraduate studies in Computer Engineering at Gujarat Technological University, India in 2013. In addition to academic performance, the extracurricular activities I had done during the period of my schooling played a significant role in making myself outstanding at the selection of the students for the scholarship.

I received my Bachelor’s degree in Computer Engineering from Gujarat Technological University, India in 2017. During my Bachelor's degree work, my main area of focus was on computer networking and network security. As I was highly fascinated by the subject, I followed a certification course (Cisco Certified Network Associates) in networking in my leisure time, as I realized it covers a vast area of this particular subject. Furthermore, during summer vacation, I was offered an internship at the Department of National Archives, Sri Lanka for two months (June 2016 - August 2016). During this training, I was exposed to the implementation of desktop application development, web design, networking, and hardware engineering.

Pertaining to the degree, I had completed the final year project which was a website for the International students in Gujarat. As I was among the first batch to get an admission in any university in Gujarat as an international student, the exposure we had was very limited at that time. Through a survey, we got to know that the main reason was the lack of information and communication among the international students. The Regional Officer of the Indian Council for Cultural Relations, Mr. Shri Jigar Inamder, personally formed a group of elite Computer Science students to create a website for the international students in Gujarat. I, along with four other fellow students from different countries, designed and implemented this website. Our project was highly recognized by the college and the website iccrgisc.com (Indian Council for Cultural Relations, Gujarat International Students’ Cell) was launched at the certificate awarding ceremony on March 31, 2017.

Development team of iccrgisc.com - During the website launch
at the certificate awarding ceremony

This brought a lot of appreciation from the student body as well as the officials present at the venue. Unfortunately, it is no longer available on the web proving that not only implementation and initial hosting, but maintenance and continuation are equally important in hosting a website. This website was available on the web for over a year but, there is no archived copy available across different web archiving sites. This, in fact, is a good instance which implies that not everything is indeed archived. Finally, with this project and continuous hard work and commitment, I completed my degree in 2017 with a First Class Distinction.

A screen capture of the website home page - During the website launch at the certificate awarding ceremony

Soon after I returned to my country at the completion of my Bachelor’s degree, I was selected to work as a Network Engineer at Exetel Private Ltd affiliated to Australia (July 2017 - July 2019). I was involved in basic networking, problem-solving, and technical support as a Network Engineer. Also, I was able to acquire IESL (Institute of Engineers Sri Lanka) membership in the year 2019, which is the apex body for professional engineers in Sri Lanka.

During my first semester as a graduate student, I took three courses:

CS625 Data Visualization (by Dr. Michele Weigle)
CS531 Web Server Design (by Sawood Alam)
CS518 Web Programming (by Dr. Jian Wu)

Upon completion of the data visualization course, I have learned the theory and application of data visualization along with R language, data analysis techniques, and several visualization tools such as Tableau and Vega-Lite. The web server design class was focused on understanding the Hypertext Transfer Protocol (HTTP) and the implementation of a web server. I have built a web server from scratch using the Python programming language. This indeed was the most challenging course I took in the Fall semester but through this course I have learned more than I could have ever imagined in a single semester. During the web programming course, I built a search engine using Elasticsearch with the help of programming languages like HTML, CSS, and LAMP (Linux, Apache, MySQL, PHP).
Moreover, I was thrilled to become a part of the WS-DL (Web Science and Digital Libraries Research Group) as a research assistant starting this Fall. Even though the thought of joining a research group was overwhelming, the hospitality provided by the faculty and colleagues of the group was impeccable. A few things I noticed during the first semester working in this group are that they will go all the way to make sure that we are very well understood about a certain topic through discussion and explanation, provide continuous feedback on performance and most of all provide a fun work environment. The most recent event that caught everyone's attention was the Trick-or-Research event on Halloween organized by the Computer Science Department.

pic.twitter.com/TJL13ibgd0
— nirds-lab (@NirdsLab) October 31, 2019

— Sampath Jayarathna (@OpenMaze) October 31, 2019

Trick or Research happening now at ECSB 3100!! Don't forget to bring your Halloween passport.. @WebSciDL @oducs https://t.co/9sUCQDe3s7 pic.twitter.com/gC5ZjtvgBO
— Himarsha Jayanetti (@HimarshaJ) October 31, 2019

Ten different research groups from the department participated in this event. WS-DL also participated in the event where we gave out candy to visiting undergraduate/first-year graduate students, and demo our research. Halloween passports (maps included) were provided to the students and they were encouraged to visit different labs in the Computer Science department. Students could get their passports stamped when they visit a lab, and all students who visited at least 5 labs were eligible to win some prizes. This was an amazing opportunity for students to network with Computer Science faculty, find opportunities to join a lab to do some awesome research, and become a paid Research Assistant. My colleague, Kritika Garg, and I also showcased our current research work at the event (Slides). During the first couple of months into the WS-DL group, Dr. Michael L. Nelson gave an overview presentation about WS-DL's approach to scholarly communication. The presentation was about but not limited to journals, conferences, blogs, and tweets (Slides). I would say that no better place encourages not only timely but also high-quality work than the WS-DL research group. I am really glad to be a part of this group and determined to contribute to the group in all my abilities.

-- Himarsha Jayanetti --

↧

2020-01-04: 365 dots in 2019 - top news stories of 2019

January 4, 2020, 4:12 pm

≫ Next: 2020-01-04: Four WS-DL classes Offered for Spring 2020

≪ Previous: 2020-01-01: Himarsha Jayanetti (Computer Science Master’s Student)

Fig. 1 (Click on image to expand) 365 dots in 2019 - News stories for 365 days in 2019. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represent the average degree of the GCC.

In March 2019 I published "365 dots in 2018" where I presented the top stories for each day in 2018 according to StoryGraph. Now that 2019 is over, it is natural to ask what were the top news stories of 2019? News organizations will often publish "the year's top stories" or "year in review" (e.g., CNN, CBS, FoxNews), but the selection criteria is not always made explicit. The closest to a selection criteria I have seen from news organizations is the presentation of their top most viewed(or most popular) news stories. But this criteria is not accessible to ordinary users who cannot access the private traffic statistics of news articles. As I mentioned previously, we consider specifying the selection criteria important for two reasons. First, an explanation or presentation of the criteria opens the criteria to critique and helps alleviate concerns of bias. Second, the criteria is inherently valuable because it could be reused and reapplied on a different collection. For example, one could apply the process to find out the top news stories in a different country.

StoryGraph's criteria for a "top story" is a high average degree of a connected component generated by computing the similarity between entities by processing news articles extracted from the RSS feeds of 17 US news sources across the partisanship spectrum (left, center, and right). The code is available and the algorithm is described elsewhere. But essentially, the more various news organizations use the same entities (e.g., people, locations, organizations) in their reporting, the more important the story. A graph is constructed with the nodes as news articles represented by the entities extracted from the news articles, and the edge between the news articles represents a high degree of similarity between the news articles. A connected component represents a news story (e.g., Mueller Report), and the average degree of the graph's connected components (GCC avg. deg.) is the attention score.

The top news stories of 2019

The table below shows that the top 10 news stories (extracted from Fig. 1) of 2019 were clustered around three primary stories:

The Mueller Report (Ranks 1st, 4th, 6th, and 8th),
The impeachment inquiry against President Trump (Ranks 2nd and 3rd), and
The 2019 Democratic debates (Ranks 5th, and 7th).

Rank	Date (MM-DD)	News Story	GCC Avg. Deg
1	03-24	AG William Barr releases Mueller Report's principal conclusions	22.93
2	09-24	House Speaker Pelosi announces formal impeachment inquiry	18.60
3	11-19/20	Impeachment inquiry public testimony (Tie: 11-19, 11-20)	18.18
4	01-19	Mueller: BuzzFeed Report 'Not Accurate'	17.19
5	07-31	Second Democratic debates	15.39
6	07-24	Robert Mueller's testimony at congress	15.05
7	09-13	Third Democratic debates	14.37
8	05-01	AG Barr and Robert Mueller split on obstruction	14.36
9	04-08	Homeland Security Chief Kirstjen Nielsen resigns	13.43
10	12-20	Sixth Democratic debates	13.33

Stories surrounding the release of the Mueller Report (red dots in Fig. 1) received the most attention in 2019. On March 22, 2019, Robert Mueller submitted his report to AG William Barr (GCC avg. degree: 18.72). Two days later, AG William Barr released his summary (principal conclusions) of the report. This story received the most attention (GCC avg. degree: 22.93) in 2019. AG William Barr's principal conclusions of the Mueller report was received with skepticism by the Democrats who claimed the conclusions were highly favorable to President Trump. In contrast, the Republicans claimed the summary exonerated the President from any wrongdoing.

StoryGraph top news story for 2019 thus far, Mueller Report Release with average degree 20.23.https://t.co/032Pn3MpED #storygraph pic.twitter.com/iw786cQOgX
— StoryGraph (@storygraphbot) April 18, 2019

The next top story in 2019 (blue dots in Fig. 1) with GCC average degree of 18.60 was Speaker Nancy Pelosi's announcement of an official impeachment inquiry (September 24, 2019) four days after the whistleblower's report. Similarly, at rank three (green dots in Fig. 1) were stories chronicling the public testimonies of the impeachment inquiry.

4 days, 4 degrees later...

Left: https://t.co/ifPhNHEYC5 #Whistleblower #storygraph
September 20, 2019 (Average degree 14.21)

Right: https://t.co/mGsGsWDhgu #Impeachment #storygraph
September 24, 2019 (Average degree 18.60) pic.twitter.com/2bhlHKnFAr
— Alexander C. Nwala (@acnwala) September 24, 2019

Similar to 2018, President Trump was a dominant figure in the 2019 news discourse. As shown in Fig. 1, out of the 365 days, "Trump" was included in the title representing the story graphs 193 (~52%) times (vs. 54% in 2018).

Fig. 1 consists of 365 dots. Each dot represents a single news graph out of 144 candidates. A dot represents the connected component with the highest average degree for that day. Since we select only one connected component (out of 144) — and indeed this is needed to avoid plotting 52,560 (144 x 365) dots — we lose so much information (news stories) for the sake of compression. The need for a method of summarizing the news of the year without discarding too many news articles led me to apply sumgram to summarize the news of 2019.

60 Sumgrams in 2019

Fig. 2: Summary of the top news stories in 2019 according to sumgram. List of five top sumgrams (n = 2) generated for each month in 2019 from the 2019 StoryGraph dataset. The red text highlights the base ngrams. Key: # - Rank, DF - Document Frequency, DFR - Document Frequency Rate

Fig. 2 consists of the list of top five sumgrams generated by processing the StoryGraph 2019 news dataset with base_ngram = 2, and removal of these stop words: "2019 read, abc news, apr 2019, april 2019, associated press, aug 2019, august 2019, com, dec 2019, december 2019, donald trump, feb 2019, february 2019, fox news, getty images, jan 2019, january 2019, jul 2019, july 2019, jun 2019, june 2019, last month, last week, last year, mar 2019, march 2019, may 2019, new york, nov 2019, november 2019, oct 2019, october 2019, pic, pm et, president donald, presidentdonald trump, president trump, president trump’s, said statement, send whatsapp, sep 2019, september 2019, sign up, trump administration, trump said, twitter, united states, washington post, white house, york times."

Recall that President Trump was a dominant figure (mentioned in the titles of news articles in Fig. 1, 52% of the time). Consequently, Fig 2. was generated by treating the above bolded terms (e.g., donald trump, president donald, presidentdonald trump,etc.) associated with "Trump" as stop words. This was done in order to give other salient sumgrams a chance of appearing in the top five sumgrams instead of being crowded out my the highly popular terms associated with "Trump." However, Fig. 3 below was generated without treating terms associated with "Trump" as stop words.

Fig. 3: Summary of top news stories of 2019 according to sumgram. List of five top sumgrams (n = 2) generated for each month in 2019 from the 2019 StoryGraph dataset WITHOUT treating terms associated with "Trump" as stopwords unlike Fig. 2. Consequently, across all months in 2019, "president donald trump," was the top sumgram. The red text highlights the base ngrams. Key: # - Rank, DF - Document Frequency, DFR - Document Frequency Rate

Below I highlight my observations from the summary (Fig. 2) of the news cycle in 2019 according to sumgram, grouped by months.

JANUARY to FEBRUARY - The border wall and the partial government shutdown
The top sumgram the partial government shutdown of January highlights the budget fight between President Trump and the House Democrats over funding for the President's border wall (Fig. 2, January, Rank 2) which led to the partial government shutdown that began on December 22, 2018. The border wall sumgram in February signals the lingering of the partial government shutdown story into February even though the 35-day shutdown — the longest in US history — ended on January 25, 2019.

MARCH - The Mueller Report and AG Barrs principal conclusions
Stories surrounding the release of the special counsel robert mueller's (Fig. 2, March Rank 1) report and attorney general william barr's (Fig. 2, March, Rank 2) release of the principal conclusions of the report dominated the news cycle in March 2019.

APRIL to MAY - The Mueller Report and Biden announces his candidacy for President
Mueller Report stories which began dominating the news cycle in March 2019 continued dominating the news cycle into April, but they shared the spotlight with stories reporting the Biden Presidential Candidacy following his announcement on April 25, 2019. Joe Biden remained a constant fixture in the news cycle from April to December 2019. However, it is important to state that the context around the mention of Joe Biden before September was probably due to his status as a top tier candidate in the Democratic field. But from September 2019, the context of his mention changed because of his involvement in President's Trump's call with the Ukrainian President which led to the Whistleblower's report that precipitated the impeachment inquiry.

JUNE to JULY - The Democratic candidates and Alexandria Ocasio-Cortez
The 2020 US Democratic candidates such as bernie sanders (Fig. 2, June Rank 4) were in the June - July news spotlight of 2019. They shared the spotlight with Congresswoman alexandria ocasio-cortez (Fig. 2, July Rank 3) who received considerable attention from the media in July for different stories such as the green new deal, and a secret Facebook group of current and former Border Patrol members that consisted of posts that demeaned the Congresswoman.

AUGUST - The El Paso mass shooting
The tragic El Paso mass shooting (Fig. 2, August, Rank 5) dominated the August 2019 news cycle.

SEPTEMBER to DECEMBER - Impeachment inquiry announcement, public testimonies, and the impeachment of President Donald J. Trump
September through December chronicled the various stages of the impeachment inquiry and eventually the impeachment of President Trump. On September 24, 2019, House Speaker Nancy Pelosi announced the start of an official impeachment inquiry(Fig. 2, September Rank 4). Next, the November 2019 news cycle was dominated by the public impeachment hearings (Fig. 2, November Rank 5), with the eventual passing of the articles [of] impeachment (Fig. 2, December Rank 1).

StoryGraph has been generating news similarity graphs at 10-minute intervals since August 2017. A single graph file (e.g., this impeachment inquiry graph generated on September 24, 2019) includes the URL of the news articles, plaintext, entities, publication dates, etc. This post only reports an investigation into identifying the news stories that received significant attention in 2019. But there is still the opportunity for further study and we welcome any such initiatives.

-- Alexander C. Nwala (@acnwala)

↧

2020-01-04: Four WS-DL classes Offered for Spring 2020

January 5, 2020, 7:51 am

≫ Next: 2020-01-09: Kritika Garg (Computer Science PhD Student)

≪ Previous: 2020-01-04: 365 dots in 2019 - top news stories of 2019

"Is the pipeline literally running from your laptop?""Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."

Four WS-DL classes are being offered for Spring 2020:

CS 395 Research Methods in Data and Web Science is taught by Dr. Michael L. Nelson, Wednesdays 4:20pm - 7pm. This class will introduce undergraduates to writing proposals, reading & writing papers, giving presentations, Python, Web APIs, reproducibility, LaTeX, and GitHub.
CS 432/532 Web Science is taught by by Dr. Michele C. Weigle, Tues/Thurs, 11am-12:15pm. This class explores web phenomena with a variety of data science tools such as Python, R, D3, ML, and IR.
CS 480/580 Intro to Artificial Intelligence is taught by Dr. Vikas Ashok, Tues/Thur, 3-4:15pm. The class will cover fundamental concepts, principles, and techniques in Artificial Intelligence. Topics include problem representation, problem-solving methods, search, pattern recognition, natural language processing, vision processing, machine learning, and expert systems.
CS 495/595 Intro to Data Mining is taught by Dr. Sampath Jayarathana, Tues/Thurs, 5:45-7pm. This class will introduce concepts, techniques, and tools to deal with various facets of data mining practices such as data preprocessing, pattern mining, outlier analysis, and mining of text, data streams, time series, spatial, and graph data.

In addition, Dr. Michele Weigle is offering the newly created CS 800 Research Methods, a P/F required course for PhD students.

Dr. Jian Wu, via a course buyout, will not be teaching in Spring 2020.

If you're interested in these classes, take them in Spring 2020 since it is not guaranteed when they will be offered again. A tentative plan for Fall 2020 is:

CS 418/518 Web Programming, Dr. Jian Wu
CS 432/532 Web Science (online only), Dr. Michele C. Weigle
CS 620 Intro to Data Science & Analytics, Dr. Sampath Jayarathna
CS 625 Data Visualization, Dr. Michele C. Weigle
CS 791/891 Topics on Mining Scholarly Big Data, Dr. Jian Wu
CS 795/895 Web Archiving Forensics, Dr. Michael L. Nelson
CS 795/895 Natural Language Processing, Dr. Vikas Ashok

-- Michael

↧

2020-01-09: Kritika Garg (Computer Science PhD Student)

January 9, 2020, 11:00 am

≫ Next: 2020-01-13: Data Science Fall 2019 Class Projects

≪ Previous: 2020-01-04: Four WS-DL classes Offered for Spring 2020

I am Kritika Garg, a first-year Ph.D. student at Old Dominion University. My research interests are in the fields of web archiving, social media, and natural language processing. I joined the Old Dominion University in the fall of 2019 under the supervision of Dr. Michael L. Nelson and Dr. Michele C. Weigle. I work with Web Science and Digital Libraries Research Group (WS-DL) where our focus is in the fields of web archiving, digital preservation, social media, and human-computer interaction. My current research work is in the field of web archiving, including analyzing access patterns of robots and humans in web archives and studying whether the patterns prevalent with the Internet Archive are present across different web archives.

I completed my undergrad from Guru Gobind Singh Indraprastha University in June 2019. During my undergrad, I started attending various tech events by tech-groups such as Google Developer Group, PyDelhi, Women Techmakers, Women Who Code, etc. These events acquainted me with various computer science technologies and tools such as python language, machine learning, natural language processing. Attending these events and meeting notable people in science inspired me to pursue a career in research.

Before Joining the WS-DL group, I worked as a research intern at the Indian Institute of Technology (IIT) in Delhi, India. I worked in the fields of natural language processing, social network analysis, machine learning, and information retrieval, under the supervision of Dr. Saroj Kaushik and Mr. Kuntal Dey. We worked on topic life-cycle analysis based on a novel idea where each topic is defined by a cluster of semantically related hashtags. Furthermore, we analyzed the hashtag and topic life-cycle with respect to the communities, about how topics morph over evolutions of hashtags within and across communities. This resulted in "Topic Lifecycle on Social Networks: Analyzing the Effects of Semantic Continuity and Social Communities" (ECIR 2018, with Kuntal Dey, Saroj Kaushik, and Ritvik Shrivastava). We also conducted a thorough study of topic life-cycle by assessing the influence of users and frequently-used hashtags on the life-cycle, which lead to the publication of "Assessing the role of participants in evolution of topic life cycles on social networks"(Computational Social Networks 2018, also with Dey, Kaushik, and Shrivastava).

Topic Lifecycle on Social Networks: Analyzing the Effects of Semantic Continuity and Social Communities

The four of us also published "Assessing Topical Homophily on Twitter" (Complex Networks 2018) where we investigated the relationship between the familiarity of users and textual similarity of their social media content at the user, peer-group and community granularity. During my internship, I also developed a novel socio-temporal hashtag recommendation system using machine learning and the NLP-based approach. I re-implemented the emtagger model based on word vectors and deeply ingrained socio-temporal techniques within it. The social aspect of the system aims to make use of the hashtags generated by familiar users and the temporal aspect aims to age the tweets. We also published "A Socio-Temporal Hashtag Recommendation System for Twitter" in Complex Networks 2018. Before coming to ODU, I also worked on developing an information retrieval system based on insights gained from topical life-cycle analysis.

I joined ODU in August 2019 and I recently completed my first semester. I took three courses this semester which helped me to enhance my technical skills. In the Data Visualization course, I learned how to create effective visualizations in R. The Web Server Design course taught me how to run my own RFC-compliant HTTP webserver using python. The Introduction to Emerging Technologies course acquainted me with research work in modern emerging technologies and taught me how to review the work.

During this time period, I worked on analyzing the access logs of the Internet Archive to understand how users access a web archive. This can help gain insights on how to design web archives and how to tailor to their holdings to their respective user bases. This work is an extension of published work by Dr. Yasmin AlNoamany. We presented this work in the Trick or Research event on Halloween in our lab to undergrad students.

Trick or Research happening now @WebSciDL . Bring your curios minds and get some candies!! Happy Halloween! pic.twitter.com/TDaFx199EN
— kritika garg (@kritika_garg) October 31, 2019

-- Kritika Garg (@kritika_garg)

↧

2020-01-13: Data Science Fall 2019 Class Projects

January 13, 2020, 12:10 pm

≫ Next: 2020-01-13: Review of WS-DL's 2019

≪ Previous: 2020-01-09: Kritika Garg (Computer Science PhD Student)

Here’s a list of projects from the CS 620 Introduction to Data Science & Analytics course from Fall 2019. All the projects are implemented using Python and Google Colab. Google Colab (Colaboratory) is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.

All the projects are based on publicly available datasets, you’ll be able to find the links to the datasets, all the pre-processing, wrangling, analytics, machine learning and visualization steps using Python from these Colab Reports. If you need a quick summary, there’s a summary of the project at the very end of each Colab report.

Classifying White Wine Samples

Dissolving the Myth surrounding Gender, Ethnic and Job discrimination in the city of Norfolk

Innovations and Development around the World

Crime Analysis Project

Accidental Drug Related Deaths

Admission Recommender System for Nursery Schools

Weather History Analysis

Zomato bangalore Restaurants

Hospital Charge Data

PlayerUnknown's BattleGrounds (PUBG) Final Placement Prediction

Individual Household Electric Power Consumption

Mood based Songs Classification

Book Rating Prediction & Recommender System

Book Recommendation System

Predicting underinsured Health Insurance Coverage from 2013-2017

World University Ranking

FIFA 19 Complete Player Dataset

Adult Income Prediction System

Recommendation System for Perfume

Loan Classification Data

Drone RF Signal Classification

Analyzing The Metropolitan Museum of Art Inventory

Physical Activity Intensity Prediction

FIFA World Cup One on One Match Prediction

Boston Crimes Exploratory Data Analysis

Assessing the Credit Worthiness and Credibility of card holders

Energy Consumption and Carbon Emissions

Determining the contributing factors and similarities of absentees at work and predicting future absentees

Analyzing Amazon Product Reviews Data

Incident Management Process Delay Prediction

Street Light Outage and Police Incident Reports in Norfolk

Chasing Stars

Predicting a Hepatitis Patient's Chance of Survival

New York City Airbnb

Predicting the winner of the next season of IPL Cricket based on past data, Visualizations, Perspectives.

Here are few projects that I'd like to highlight from the list.

Street Light Outage and Police Incident Reports in Norfolk

There are two datasets used in this project. Both of them were taken from City of Norfolk Open Data. The two datasets are the street light outage dataset and the Police Incident Reports dataset in Norfolk. The end goal of this project was to find the relationship between the street light outage and number of incidents that were reported to the police in Norfolk. However, there are possibilities of finding some other useful information once the datasets are carefully observed and explored. For example, The Street Light Outage dataset might reveal the factors that contribute to the functionality of the Street Lights using the prediction models.

Dissolving the Myth surrounding Gender, Ethnic and Job discrimination in the city of Norfolk

In the city of Norfolk which is home to a population of over 244,000 people, one would say discrimination and bias towards certain individuals does not exist but according to the saying," The facts are in the details" and "Numbers dont lie", It could be infered from the above charts that 'YES' there is some form of discrimination going on.

A careful observation of the Police officer group below reveals that white female within this group earn far less in terms of base salary when compared to white male. A robust formation at the first 25 percentile of violin plot reveals the hypothesis is true. There were not much difference in salary between the Black female and Male. The Hispanics had a much rather staggering base salary but Hispanic female do make more in terms of salary while the American Indians even though were not much,their presence in this group still had higher pay.

Determining the contributing factors and similarities of absentees at work and predicting future absentees

The goal with the project was to use data regarding absentees for work to determine what factors play the biggest roles and attempt to predict absentees in employees along with their reasoning. There can be many reasons why a employee may have to take off work regarding health, family life, and social activities. If the reasons and factors are fairly consistent with the attributes of an individual, it would be possible to predict why an new individual with similar attributes would take off work.

The major reasons for missing work are medical consultation and dental consultation.

Factors that affect absenteeism

Spring has the most absences and march is the most frequently missed month. (Flu Season)
The most commonly missed day of the week is Monday and least commonly missed is Thursday and Friday.
There is a strong correlation between employees who have a disciplinary failure and the reason they're absent. They're mostly absent because of a dentist appointment.

Correlations between employees:

Employees who are social drinkers are more likely to live far away from work.
Employees who own a pet are more likely to have an increased transportation expense.
Employees who worked with the company for long periods of time are more likely to be social drinkers.

Boston Crimes Exploratory Data Analysis

Boston is the largest city and the capital of Massachusetts. It’s one of the oldest and famous city in the U.S. As a cultural anchor in the thriving Seaport District, Boston attracts thousands of tourists every year. This dataset is provided by the Boston Police Department(BPD) which collects the crime types, date, frequencies and so on. As we can see from the EDA:

Larceny is by far the most common type of serious crime.
Serious crimes are most likely to occur in the afternoon and evening.
Serious crimes are most likely to occur on Friday and least likely to occur on Sunday.
Serious crimes are most likely to occur in the summer and early fall, and least likely to occur in the winter (with the exemption of January, which has a crime rate more similar to the summer).
There is no outstanding connection between major holidays and crime ratess
Serious crimes are most common in the city center, especially districts A1 and D4.

This EDA just one approach to analysis the dataset. Further study would expect how different types of crimes vary in time and space. Another interesting direction would be to combine this with another data about Boston city, such as demography or even the weather, to investigate what factors facilitate to predict crime rates across time and space.

-- Sampath

↧