Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all 741 articles
Browse latest View live

2018-08-30: Excited to Join WS-DL group in ODU!

$
0
0
I am an outlier compared with most computer scientists because I spent 10 years on a field called "Astronomy and Astrophysics". Very few computer scientists followed the same path as me to transfer from a seemingly irrelevant major. But this is where my passion is, so I did it, and I made it!

Right after I graduated as a PhD in 2011, I joined the CiteSeerX group directed by Dr. C. Lee Giles at IST, Penn State University. I worked as a DBA for web crawling at the beginning and soon became the tech leader of the search engine, and recently the Co-PI of an NSF awarded proposal on CiteSeerX. I spent six years, an usually long time as a postdoc and then was promoted to a teaching faculty. However, I kept moving on, because I wanted to do research!

Luckily, Michael and Michele did not mind of taking the risk and bet on me to be a tenure-track faculty at the Old Dominion University. So I accepted the offer and became a member of the Web Science Digital Library group at ODU CS.

I appreciate many CS faculties, including but not limited to Dr. Jing He, Dr. Cong Wang, and Dr. Ravi Mukkamala, that helped me before and after I moved to Virginia Beach. Michael and Michele gave me tremendous guidance already on how to be successful. I am also glad to know Dr. Sampath Jayarathna and Dr. Jiangwen Sun as new colleagues. It is unbelievable that Sampath and I submitted our first NSF proposal before the first class began this fall!

I cherish my old friends at Penn State. I also look forward to doing more exciting work in this new position!

Posted by Jian Wu at ECSB, Norfolk, VA


2018-09-02: Sampath Jayarathna (Assistant Professor, Computer Science)

$
0
0
I am really excited to be part of the Old Dominion University and the WS-DL group. I joined the faculty at Old Dominion University in 2018. Before that, I was a tenure-track assistant professor for two years at California State Polytechnic University (Cal Poly Pomona). I am truly grateful to Frank Shipman, Oleg Komogortsev, Richard Furuta, Dilma Da Silva and Cecelia Aragon for the help throughout this faculty search. It is sad to say goodbye to my colleagues at Cal Poly but I am excited to have an amazing bunch of mentors and colleagues here at ODU, Michael, Michele, Nikos, Ravi, Jian, Cong, Shubham, Anne and many more. Its truly amazing that I was able team up and put-together 2 NSF proposals (CRII and REU Site) within a short period of time.

I received my Ph.D. in Computer Science from Texas A&M University in 2016, advised by Frank Shipman. I was a member of the Center for the Study of the Digital Libraries (CSDL) group. In 2012, I did a 6 month internship at Knowledge Based Systems Inc., College Station, TX to build a Collaborative Analysis tool for JackalFish enterprise search tool. I earned MS degree from Texas State University-San Marcos in 2010. I worked with Oleg Komogortsev in the area of Oculomotor Systems research, eye tracking and Biomertrics using eye movements. I spent the summer 2009 at Lawrence Berkeley National Lab and with Cecilia Aragon (currently professor at UW Seattle) and Deb Agarwal on a very cool eye movement based biometric project.

My undergraduate degree is a B.S in computer Science (First Class Honors, similar to Latin honor summa cum laude) from University of Peradeniya, Sri Lanka in 2006. 
I am an avid gardener, my wife says I have a “green thumb”, something to do with coming from a tropical island. Most of my plants did not survive the 20 days west to east coast journey. 

Sri Lankan "King Coconut"


I grow variety of vegetables including tomato, water melon, leafy greens, chilies, and some exotic tropical fruits and vegies. It is exciting to see what I can do with long hot summers and 4-season weather. 

My academic Genealogy, Bucket List, Goodreads Bookshelf, YouTube playlist, IMDB lists of favorite TV-shows, and Movies.

Posted by Sampath Jayarathna at 12:37 AM, Norfolk, VA

2018-09-03: Let's compare memento damage measures!

$
0
0
It is always nice getting a Google Scholar alert that one of my papers has been cited. In this case, I learned that the paper "Reproducible Web Corpora: Interactive Archiving with Automatic Quality Assessment" (to appear in the ACM Journal of Data and Information Quality) cited a paper that I wrote during my doctoral studies with fellow PhD students Mat Kelly and Hany SalahEldeen and our advisors Michael Nelson and Michele Weigle. More specifically, the Reproducible Web Corpora paper (by Johannes Kiesel, Florian Kneist, Milad Alshomary, Benno Stein, Matthias Hagen, and Martin Potthast) is a very important and well-executed follow on to our paper "Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources" (a best student paper from JCDL2014 and subsequently published in the International Journal of Digital Libraries).

In this blog post, I will be providing a quick recap and analysis of the Kiesel paper from the perspective of an author of the paper that provides the Brunelle15 metric used as the benchmark measure in the Kiesel 2018 paper.

(I suppose this should be referred to as a "guest post" since I have since graduated from the WS-DL research lab and am currently working as a Principal Researcher at the MITRE Corporation.)

Despite missing more embedded resources, the screenshot of the web comic (XKCD) on the left rates as higher quality as compared to the screenshot on the right since the screenshot on the right is missing the most important embedded resource: the image of the comic.
To begin, it is worthwhile to reflect on our 2014/2015 paper. We set out with the goal of improving upon the naive metric of "percentage of missing resources" used by archivists to assess the quality of a memento. Intuitively, larger, more central embedded resources (e.g., images, videos) are likely more important to a human's interpretation of quality than smaller images on the periphery of a web page. Similarly, a CSS resource that is responsible for formatting the look-and-feel of a page is more important than a CSS resource that does not have as great an impact on the visual layout of the page content. Using these qualitative notions of quality, we created an algorithm that measures -- quantitatively -- the quality of a memento based on the measured importance of its missing embedded resources. For example (and -- not coincidentally -- the one used in both the Brunelle and Kiesel papers), a web comic that is missing its large, centrally-located comic is much more "damaged" than a news article missing the social media share icons at the bottom of the page despite the percentage of missing embedded resources being greater for the news article than the web comic.

From this baseline, we used Amazon's Mechanical Turk to identify whether or not our new measure of quality of a memento aligns more closely with a human's interpretation of quality than the naive measure of proportion of the missing embedded resources. I will leave the details and specifics of our approach, algorithm, and result to the reader, but the punch-line is that our algorithm out-performed (albeit slightly) the measure of proportion of missing embedded resources. The take-away from this paper is that there is merit to evaluating the nuances of how we interpret the quality of a memento during and after archiving its live-web counterpart. Erika Siregar has turned the algorithm from our paper into a service for measuring memento damage.




With this context, we can appropriately analyze the 2018 Kiesel paper. Kiesel and his counterparts are developing The Webis Web Archiver and wanted to create a method of immediately and automatically assessing quality of its mementos. The Webis Web Archiver uses an archiving approach that exercises the JavaScript-enabled aspects of a representation to ensure the deferred representations are appropriately archived, replayable, and maintain their behavior when archived. (This can be best described as a combination of the two-tiered crawling approach we proposed in 2017 and the approach used by Webrecorder.io.)

To accomplish the quality assessment of the Webis Web Archiver mementos, Kiesel, et al. sampled 10,000 URI-Rs (and if you want to analyze their data, reuse it, provide an extension to this work, they have made the dataset available!) and identified 6,348 URI-Rs with mementos that had "reproduction errors" (which leads me to believe that the remaining 3,652 mementos were pixel-wise and embedded resource equivalents to their live-web counterparts). With a similar methodology to our work, the authors assigned 9 Turkers to rate the quality of a screenshot of a memento to a screenshot of its live-web counterpart using a Likert scale (1-5) with 1 being "minimal impact" and 5 being "completely unusable".

One topic left out of the Kiesel paper is that a Turker's evaluation of what makes a "well preserved web page" is likely to differ from the evaluation of an archivist. This was a notional finding of our 2014/2015 memento damage work and -- while the nuances of this difference is alluded to -- is not directly mentioned in the Kiesel paper. For example, the edge case of a video "still loading" in the screenshot (among other examples cited in the Kiesel paper) is considered a low-quality memento by the paper's authors but may not be considered completely unusable by the Turkers. To reinforce the difference between archivists and Turkers, the authors noted that they changed the aggregate quality score of the Turker assessments of 11% of the comparisons (717 of the 6,348). To test the null hypothesis in this experiment, the authors could have presented Turkers with a perfect memento from the set of 3,652 mementos without reproduction errors. Ideally, the Turkers would have rated the comparisons in this set as 1s on the authors' Likert scale.
The image on the right shows the loading multi-media embedded resource.
This is part of Figure 5 in the Kiesel paper.

Using their evaluation approach, Kiesel, et al. compared the Brunelle15 approach, a pixel-wise comparison using RMSE, and a neural network-driven classifier according to their respective alignment with Turker assessments. As I would have assumed, the Brunelle15 measure is uncorrelated with Turker assessments. The authors' interpretation of this result matches mine: the Brunelle15 measure is performed in absence of the live-web representation, meaning it has to make assumptions about things like image size and placement when unavailable. Further, Brunelle15 puts an increased emphasis on image/multimedia as opposed to CSS. We assumed that Turkers focus on the potentially more highly visible CSS damage in a memento whereas archivists focus on the absence of prominent embedded resources despite formatting and positioning. I was surprised at how closely RMSE correlated with Turker assessments. This could be a potentially low-computational cost (as compared to training a neural network) method of identifying quality. Of course, the neural network approach performed best and demonstrates the promise of this approach.

Figure 6 of the Kiesel paper shows the correlation of the models with the Turker ratings.
An interesting extension to the Kiesel work would be to identify the aspects of a page that lead to lower quality scores. In alignment with my assumption that higher ratings from Turkers is due to well preserved CSS and archivists rate well-preserved "important" embedded resources higher, it would be interested to see a more granular feature vector for a memento that can be used to tune an archival service. For example, favoring small images over CSS if the research indicates that Turkers (or archivists) do not assign much value to quality based on formatting. (This is unlikely, in my opinion, but is a valid result.) The tuning can also be performed based on wall-clock time of the crawl (a topic that we discussed in our JCDL2017 paper on what it "costs" to archive JavaScript). Another interesting extension would be comparing the quality of mementos resulting from different archival approaches such as archive.is and Webrecorder.io, particularly with respect to resources with deferred representations.

The Kiesel, et al. paper uses an optimal approach for immediate and automatic memento quality assessment -- comparison to the live web and human-assessed interpretations. They also use a neural network to learn to assess quality from the human evaluations. I view this work as a natural and necessary next step toward understanding how to measure memento quality. I look forward to their future work!

--Justin F. Brunelle

The authors' affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions or viewpoints expressed by the authors. Approved for Public Release; Distribution Unlimited. Case Number 18-2725-2.

2018-09-03: Trip Report for useR! 2018 Conference

$
0
0



This year I was really lucky to get my abstract and poster accepted for useR! 2018 conference. The UseR! conference is an annual worldwide conference for international R users and developer community. The fourteenth annual conference was held in the Southern hemisphere in Brisbane, Australia from July 10-13, 2018. This four-day conference consists of nineteen 3-hour tutorials, seven keynote speeches, and more than 200 contributed talks, lightning talks and posters on using, extending, and deploying R. This year, the program successfully gathered almost 600 users of the data analysis language R, from all corners of the world from various expertise levels of R.

Distribution map of useR! 2018 participants across the globe


Fortunately, I was also granted a travel scholarship from the useR! 2018 and could attend the conference including the tutorial sessions for free (thanks useR! 2018).

Day 1 (July 10, 2018): Registration and Tutorial

The conference was held at Brisbane Convention and Exhibition Centre (BCEC). Each participant must register themselves at the secretariat desk and received a goodie bag containing a t-shirt, a pair of socks, and a lanyard (if lucky). The name tags can be picked from a board which are ordered by last name.

The Secretariat Desk


T-shirt and name tag from useR! 2018


useR! 2018 is identified with hexagonal shapes which can be found everywhere in useR! 2018: the name tags, the hex stickers, and of course, the amazing Hexwall designed by Mitchell O'Hara-Wild. He also wrote a blog post about how he created the hexwall. There was also a hexwall photo contest where all conference attendees are requested to take a picture with the hexwall and post it on twitter with hashtag #hexwall.

Me and the hexwall


The R tutorials are conducted in parallel sessions from Tuesday to Wednesday morning (July 10 - 11, 2018). Each participant can only participate in a maximum of three tutorials. The first tutorial that I attend is Wrangling Data in the Tidyverse by Simon Jackson.

This is my first time using Tidyverse, and I found it really helpful for data transformation and visualization, once I got familiar with it. Using the data example from booking.com, we got hands-on experience with various data wrangling techniques such as handle missing values, reshaping, filtering, and selecting data. The thing that I love the most about Tidyverse is the dplyr package. It comes with a very interesting feature pipe (%>%) which allows us to chain together many operations.

In the second tutorial by Statistical Models for Sport in R by Stephanie Kovalchik, we learned how to use R to implement statistical models that are common in sports statistics. The tutorial consists of three parts:
  1. web scraping to gather and clean public sports data using RSelenium and rvest
  2. explore data with graphics
  3. implementing three models: Bradley-Terry paired comparison models, Pythagorean Theorem, Generalized Additive Models, and Forecasting with Bayes.
During the tutorials session, I met three other Indonesians who are currently studying in Australia as Ph.D. students (small world!).
Indonesian students at useR! 2018

Day 2 (July 11, 2018): Tutorial, Opening Ceremony, and Poster Presentation. 

Tutorial

The morning session is filled with tutorial activities which are a continuation of the series of tutorials that began the day before. I attended the tutorial Follow Me: Introduction to social media analysis in R by Maria Prokofieva, Saskia Freytag, and Anna Quaglieri.

Dr. Maria Prokofieva talked about social media analytics using R

During this 2.5 hour tutorial, we learned how to use R libraries twitteR and rtweet for extracting data from twitter and then convert the tweet in the text column to token using tidytext. In general, the whole process is a bit similar to what I have learned in Web Science class by Dr. Michael Nelson at Old Dominion University (ODU), except that all of the processes are conducted in R instead of Python. At the end of the session, we were given a challenge to compare tweets which mention Harry to tweets mentioning Meghan in the royal wedding time series. The answer should be uploaded to twitter using the hashtags #useR2018, #rstats and #socialmediachallenge. All tutorial materials are available on R-Ladies Melbourne's GitHub.

R-Ladies Gathering

There was an R-Ladies gathering that took place during the lunch after the tutorial session. It was such an excellent opportunity to meet other amazing R-Ladies members who have done various project and research in R and get their R libraries published on CRAN. It was really inspiring to hear their stories of promoting gender diversity in the R community. There are 75 R-Ladies groups spread across the globe. Unfortunately, there is no R-Ladies group in Indonesia at this moment. Maybe, I should start creating one?
With Jenny Bryan during the R-Ladies meeting

Opening Ceremony, Keynote Speeches, and Poster Lightning Talk

At 1.30 pm, all conference attendees gathered in the auditorium for the Opening Ceremony. The event started with a performance by Songwoman Maroochy Welcome to the Country followed by an opening speech delivered by useR! 2018 chief organizer, Professor Di Cook from the Department of Econometrics and Business Statistics at Monash University. In her remarks, Professor Cook encourages all attendees to enjoy the meeting, learn as much as we can, and be cognizant of ensuring others have a good experience, too.

Opening speech by Professor Di Cook

By the way, for those who are curious, here's a sneak peek of the Songwoman Maroochy performance.



Next, we had a keynote speech by Steph de SilvaBeyond Syntax, Towards Culture: The Potentiality of Deep Open Source Communities.

After the keynote speech, there was a poster lightning talk session where every presenter is given a chance to advertise and let everyone know what the work is about and encourage them to come and see it during the poster session.
My poster lightning talk
Before ending the opening ceremony, there was another keynote speech by Kelly O'Briant of RStudio titled RStudio 2018 - Who We are and What We Do.

Poster Session.

The poster session wrapped up the day. I am so grateful that useR! 2018 uses all-electronic posters. So, we did not have to bother ourself printing a large poster and carried it across the globe all the way to Australia. There are two poster sessions, one on Wednesday evening and another one during lunch on Thursday. For poster presentation, the conference committee provides 20 47-inch TVs that have HDMI connections to connect the TV to our laptop. This way, if someone asked, we can directly do a demo or showing a specific part of our code on the TV as well.

In this conference, I presented a poster titled AnalevR: An Interactive R-Based Analysis Environment for Utilizing BPS-Statistics Indonesia Data. This project idea originated from the challenge we faced at BPS-Statistics Indonesia. BPS produces a massive amount of strategic data every year. However, these data are still underutilized by public users because of several issues such as bureaucratic procedure, the money that they have to pay, and long waiting time to get their requested data processed. That’s why we introduce AnalevR, an online R-based analysis environment that allows anyone anywhere to access bps data and perform analyses by typing R codes on a notebook-like interface and get the output immediately. This project is still a prototype and currently in the development stage. The poster and the code are available on my GitHub.
Me during the poster session

Day 3 (July 12, 2018): Keynote Speech, Talk, Poster Presentation, and Conference Dinner

The agenda for day 3 was packed with two keynote speeches, several talks, poster presentation, and conference dinner.

Keynote Speech

The first keynote speech was The Grammar of Animation by Thomas Lin Pedersen (video, slides). In his speech, Pedersen explains that visualization is an element that falls somewhere between three dimensions of DataViz nirvana, which are static, interactive, and animated. Each dimension has its own pros and cons. Mara Averick's tweet below gives us a clearer illustration of this.
Pedersen implements this grammar concept by rewriting the gganimatepackage which extends the ggplot2 package to include the description of animation such as transition, view, and shadow. He made his presentation even more engaging by showing an example that channels Hans Rosling's 200 Countries, 200 Years, 4 Minutes visualization. The example is made by utilizing the transition_time() function in the gganimate package.

The second keynote speech was Adventures with R: Two Stories of Analyses and a New Perspective on Data by Bill Venables. He discussed two recent analyses, one from psycholinguistics and the other from fisheries, that show the versatility of R to tackle the full range of challenges facing the statistician/modeler adventurer. He also made a comparison between Statistics and Data Science and how they relate to each other. The emerging data science is not natural a successor of Statistics. There are some subtle differences between them. Professor Venables said that both sides are important domains and connected, but we have to think of them as essentially bifurcating to some extent and not taking on each other's roles. Things work best when domain expert and analyst work hand in hand.
Professor Venables ended his speech by mentioning two quotes that I would like to requote here:

"The relationship between Mathematics and Statistics is like that between chemistry and winemaking. You can bring as much chemistry as you can to winemaking, but it takes more than chemistry to make a drinkable dry red wine." 

"Everyone here is smart, distinguish yourself by being kind."


There was a tribute to Bill Venables at the end of the event.

The Talk Sessions

There are 18 parallel sessions of talks conducted from 10.30 am to 4.50 pm. Those sessions were held in three parts, where each part are separated by two tea breaks and one lunch break. I managed to attend eight talks that covered topics of data handling and visualization.
  1. Statistical Inference: A Tidy Approach using R by Chester Ismay.
    Chester Ismay from DataCamp introduces the infer package which was created to implement common classical inferential techniques in a tidyverse-friendly framework that is expressive of the underlying procedure. There are four main objectives of this package:
    1. Dataframe in, dataframe out
    2. Compose tests and intervals with pipes
    3. Unite computational and approximation methods
    4. Reading a chain of infer code should describe the inferential procedure
  2. Data Preprocessing using Recipes by Max Kuhn.
    Max Kuhn of RStudio gives a talk about the recipes package which aims for predictive data modeling. Recipes works in three steps (recipe → prepare → bake):
    1. Create a recipe, which is the blueprint of how your data will be processed. No data has been modified at this point.
    2. Prepare the recipe using the training set. 
    3. Bake the training set and the test set. At this step, the actual modification will take place.
  3. Build Scalable Shiny Applications for Employee Attrition Prediction on Azure Cloud by Le Zhang
    Le Zhang of Microsoft delivers a talk about building a model for employee attrition prediction and deploy the analytical solution as Shiny-based web service on Azure cloud. The project is available on GitHub.
  4. Moving from Prototype to Production in R: A Look Inside the Machine Learning Infrastructure at Netflix by Bryan Galvin
    Bryan Galvin of Netflix gave the audience a look inside the machine learning infrastructure at Netflix. Galvin explained briefly on how Netflix moves to production using microframework named Metaflow and R. Here's the link to the slides.
  5. Rjs: Going Hand in Hand with Javascript by Jackson Kwok
    rjs is a package that is designed is designed for utilizing JavaScript's visualization libraries and R's modeling packages to build tailor-made interactive apps. I think this package is super cool and it was an absolute highlight for me at useR! 2018. I will definitely spend some time to learn this package. Below is an example of rjs implementation. Check the complete project on GitHub.
  6. Shiny meets Electron: Turn your Shiny App into a Standalone Desktop App in No Time by Katie Sasso
    Katie Sasso of Columbus Collaboratory shares how the Columbus Collaboratory team overcame the barriers of using Shiny for large enterprise consulting by coupling R Portable and Electron. The result is a Shiny app in a stand-alone executable format. The details of her presentation along with the source code and tutorial video are available on her GitHub.
  7. Combining R and Python with GraalVM by Stepan Sindelar
    Stepan Sindelar of Oracle Labs told us how to combine R and Python into a polyglot application which is running on
    GraalVM.  GraalVM enables us to operate on the same data without the need to copy the data when crossing language boundaries.
  8. Large Scale Data Visualization with Deck.gl and Shiny by Ian Hansel.
    Ian Hansel of Verge Labs talked about how to integrate deck.gl, a web data visualization framework released by Uber, with Shiny using the R package deckard.
Conference Dinner

The conference dinner ticket

The conference dinner can only be attended by people who have the ticket only. I was fortunate because as a scholarship recipient, I got a free ticket for the dinner (again, thank you, useR! 2018 and R-Ladies Melbourne). There was a trivia quiz at the end of the dinner. All attendees are grouped based on the table they were sitting at and must team up to answer all the questions on the question sheets. The solution for the quiz can be found here. The teams who won the quiz got free books as the prizes.

The conference dinner and the trivia quiz
Day 4 (July 13, 2018): Keynote Speech, Talk, and Closing Ceremony

Keynote Speech

The last day of the conference starts with a keynote speech Teaching R to New Users: From tapply to Tidyverse by Roger Peng. In his talk, Dr. Peng talked about teaching R and selling R to new users. It could be difficult to describe the value proposition of R to someone who had never seen it before. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? To answer this, Dr. Peng quote a remark from John Chambers (one of the creators of the S language):

"The ambiguity [of the S language] is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important."

I think this is the beauty of R that attracts me. I do not have to jump into the developing things directly, but instead gradually transitioning myself into the programming. To sum up, Dr. Peng shares the keywords that could be useful in selling R to new users: free, open source, graphics, reproducibility - reporting - automation, R packages + community, RStudio, transferability skills, and jobs ($$).

Some tips for selling R by Dr. Roger Peng
The second keynote speech was R for Psychological Science by Danielle Navarro (video, slides). Dr. Navarro shared her experience in teaching R for psychology students. Fear apparently is the main challenge that prevents students from learning. She also talked about the difficulty she faced to find a good textbook to use in her class that finally lead her to write her own lecture notes. Her lecture notes tried to address student fears by using a relaxed style. This works well for her that she ended up having her own book and won a teaching award. Dr. Navarro ended her talk by encouraging everyone to conquer their fears and climb the mountain of R. It might not be easy to avoid the 'dragon' at the top, but there are always people who will support and help us. Reminds our community that we are stronger when we are kind to each other.
The third and the last keynote was Code Smells and Feels by Jenny Bryan. She shared some tips and tricks on how to write codes elegantly in a way that it is easier to understand and cheaper to modify. Some code smells apparently have official names such as Primitive Obsession and Inappropriate Intimacy.
Here are some tips that I summarize from her talk:
  1. Write simple conditions
  2. Use helper functions
  3. Handle class properly
  4. Return and exit early
  5. Use polymorphism
  6. Use switch() if you need to dispatch different logic based on a string.
Besides the three great keynotes above, I also attended several short talks:
  1. Tidy forecasting in R by Rob Hyndman
  2. jstor: An R Package for Analysing Scientific Articles by Thomas Klebel
  3. What is in a name? 20 Years of R Release Management by Peter Dalgaard
  4. Sustainability Community Investment in Action - A Look at Some of the R Consortium Funded Grant Projects and Working Groups by Joseph Rickert
  5. What We are Doing in the R Consortium Funds by various funded researchers

Closing Ceremony

The closing speech was delivered by Professor Di Cook from the Department of Econometrics and Business statistic at Monash University. There was also a  small handover ceremony between Di Cook and Nathalie Vialaneix who will organize next year's useR! 2019 in Toulouse, France.
At the end of the ceremony, there was an announcement for the winners of hexwall photo contest which are chosen randomly.
It was indeed a delightful experience for me. I am happy and went home with a list of homework and new packages that I have to learn. For those who did not make it to the useR! 2018 Conference, do not feel FOMO. All talks and keynote speech are posted online on R Consortium's youtube account.

I would like to thank Professor Di Cook of Monash University as well as R-Ladies Melbourne for giving me a scholarship and make it possible for me to attend this conference. I also would like to congratulate all useR! 2018 organizing committee for the great and brilliant efforts to make this event a great success. I really look for joining next year's useR! 2019 which will be held from July 9 - 12, 2019, in Toulouse, France. So, do not miss the updates. Check its website as well as follow the twitter account @UseR2019_Conf with hashtag #useR2019.

@erikaris

2018-10-10: Americans More Open Than Asians to Sharing Personal Information on Twitter: A Paper Review

$
0
0
Mat Kelly reviews "A Personal Privacy Preserving Framework..." by Song et al. at SIGIR 2018.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ


Americans are more open to share personal aspects on the Web than Asians.— Song et al. 2018

I recently read a paper published at SIGIR 2018 by Song et al. titled "A Personal Privacy Preserving Framework: I Let You Know Who Can See What" (PDF). The title alone captivated my interest with the above claim deep within the text.

The authors' goal of the work was to reduce users' privacy risks on social networks by determining who could see what sort of information they posted. They did so by establishing boundary regulations through summarizing the literature and associate them with 32 categories corresponding to a personal aspect of a user, broken down into 8 groups spanning the categories of personal attributes to life milestones. The authors then fed a list of keywords to the Twitter Search Service for each category they established. From this taxonomy they created a model to be used to uncover personal aspects from users' posts. Their model, TOKEN (a forced abbreviation of laTent grOup multi-tasK lEarniNg), allowed the authors to create guidelines for information disclosure by users into four kinds of social circles and generate a data set consisting of a rich set of privacy-oriented features (available here).

The authors noted that users' private tweets are very sparse and thus they used the Twitter service to gather posts that met the categories in their taxonomy to collect just over 269k tweets. To reduce the noise in the collection, the authors filtered tweets that contain URLs that were not in reference to the users' respective other social media posts. Retweets and tweets less than 50 characters were excluded. The authors did not justify this exclusion.

To establish a ground truth, the authors used Amazon Mechanical Turk to have each post annotated with their selective categories. Turkers that did not validate at least 80% with the authors sampling were excluded from the results. This procedure resulted in just over 11k posts being labeled. To determine inter-worker reliability, the authors employed Fleiss' kappa (PDF of 1969 paper), adapting for the potential variance in label count/post by reducing to a binary classification, to determine moderate agreement (Fleiss' coefficient of 0.43).

The authors then extracted a set of privacy-oriented linguistic features using Linguistic Inquiry Word Count (LIWC), a Privacy Dictionary (per Vasalou et al.'s 2011 JASIST work), Sentiment Analysis (via Stanford's NLP classifier), Sentence2Vector (with each tweet a sentence), and an ad hoc meta-feature approach. The aforementioned final approach considered the presence of hashtags, slang words, images, emojis, and user mentions. Slang, here, was identified using the Internet Slang Dictionary.

Following this analysis, the authors established a prediction component by first formulating a predictive model inter-relating each of the 32 "tasks" within the 8 "groups". The authors anticipated that tasks within the same group would share relevant features, e.g., "places planning to go" and "current location", would share common features within the location group in their taxonomy. From this initial formulation they established the matrix L, whose columns represent the latent features, and S, whose rows represent the weights of the features in L.

To solve L and S, the authors optimized one variable while fixing the other in each iteration of analysis. To determine L, they took the derivative of their objective function (their Equation 5, see paper) with respect to L to produce a linear system with a vector B representing the stacking of columns into a single matrix and A, a definite and invertible matrix. Computing S with L fixed was a bit more mathematically complex that I will leave as an exercise in understanding to the interested reader.

Prescription

...there is still a societal consensus that certain information is more private than the others (sic) from a general societal view.A. Islam et al. 2014

The authors used Mechanical Turk to build guidelines regarding disclosure norms in different circles. This was performed on two selections of Turkers limited by respective geographies of the U.S. and Asia. The authors note that 99% of the Asian participants were Indians. An anticipated real world goal of the authors was, when posting a tweet containing information on a health condition (for example), to set the privacy setting to only share this with her family members. This, I felt, would be an odd recommendation given:

  1. The corpus was of publicly available tweets.
  2. Twitter does not currently have a means of limited who may see a tweet akin to services like Facebook.

This drastically reduces the usefulness of the recommendation, I feel, in the context of the medium observed.

Verification

The authors sought to detect privacy leakage by comparing the precision of TOKEN as compared to the S@K and P@K metrics, as they had previously done in Song et al. 2015 from (IJCAI). Here, S@K is representative of the mean probability that a correct interest is captured within the top K recommended categories and P@K standing for the proportion of the top K recommendations being correct. They used a grid search strategy to obtain the optimal parameters with 10-fold cross-validation.

Using S@K and P@K where K was set to 1, 3, and 5, the authors found LIWC to be most representative of the characterization of users' privacy features as compared to the aforementioned Privacy Dictionary, Sentence2Vector, etc. approaches. They attributed this to LIWC's inclusion of pronouns and verb tense that provide references and temporal hints.

In applying these feature configurations to their corpus, the authors noticed that timestamps played an important role in identifying private information leakage, so took a detour to cursorily explore this. Based on the patterns found (pictured below), various activities peak at certain times of day, e.g., drug and alcohol tweets around "20pm" (sic). It is unclear from the paper whether this was applied to both the U.S. and the AsianIndian results. Further, the multiple plot display with variable inter-plot y-axis scales produces a deceptive result that the authors do not address.

Plots per Song et al. show temporal patterns.

To finally validate their model compared to S@K and P@K, they used SVM, MTL_Lasso,

A different set of categories was compared to show similarities in sharing comfort between Americans and Asians. Of these (healthcare treatments, health conditions, passing away, specific complaints, home address, current location, contact information, and places planning to go), American Turkers were much more restrictive about sharing with the outside world where Asian Turkers exhibited a similarly and relatively conservative sentiment about sharing. From this, the author concluded:

Americans are more open to share personal aspects on the Web than Asians.— Song et al. 2018

Take Home

I found this study to be interesting despite some of the methodological problems and derived conclusions. As I mentioned, the inability to regulate who sees tweets when posting (a la Facebook) affects the nature of the tweet with a potential likely bias toward the tweeter being less concerned for privacy. The authors did not mention whether it was asked if each Turker personally used Twitter or if they even mentioned to the Turkers that the text they judged were tweets and not just "messages posted online". This context, if excluded, could make those judging the tweets unsuitable to do so. I would hope to see an expanded version of this study (say, posted to arXiv) with more comprehensive results, as the authors stated space was a limitation, but there was no indication as such.

—Mat (@machawk1)

2018-10-11: iPRES 2018 Trip Report

$
0
0
September 24th marked the beginning of iPRES 2018 located in Boston, MA, for which both Shawn Jones and I traveled from New Mexico to present our accepted papers: Measuring News Similarity Across Ten U.S. News SitesThe Off-Topic Memento Toolkit, and The Many Shapes of Archive-It.

iPRES ran paper and workshop sessions in parallel, therefore I will focus on the sessions I was able to attend. However, this year organizers created and shared collaborative notes with all attendees for all sessions to help others who couldn't attend many individual sessions. All the presentation materials and associated papers were also made available via google drive.

Day 1 (September 24, 2018): Workshops & Tutorials

The first day of iPRES attendees gathered at the Joseph B. Martin Conference Center at Harvard Medical School to get their registration lanyards and iPRES swag.

Afterwards, there were scheduled workshops and tutorials to enjoy throughout the day. Attending registrants needed to sign up early to get into these workshops. Many different topics were available for to attendees choose from found on Open Science Framework event page. Shawn and I chose to attend: 
  • Archiving Email: Strategies, Tools, Techniques. A tutorial by: Christopher John Prom and Tricia Patterson.
  • Human Scale Web Collecting for Individuals and Institutions (Webrecorder Workshop). A workshop by: Anna Perricci.
Our first session on Archiving Email consisted of talks and small group discussion on various topics and tools for archiving email. It started with talks on the adoption of email preservation systems into our organizations. Within our group talk, it was found that few organizations have email preservation systems. I found the research ideas and topics stemming from these talks to be very interesting especially in the aspect of studying natural language from email content.
Many of the difficulties of archiving email unsurprisingly revolve around issues of privacy. Some of the difficulties range from actually requesting and acquiring emails from users, discovering and disclosing sensitive information inside emails, and also other ethical decisions for preserving emails.

Email preservation also has the challenge of curating at scale. As one can imagine, going through millions of emails inside of a collection can be time consuming and redundant which requires the development of new tools to combat these challenges.
This workshop also exposed many interesting tools to use for archiving and exploring emails including:



Many different workflows for archiving email and also using the aforementioned tools for archiving emails were explained thoroughly at the end of the session. These workflows covered migrations with different tools, accessing disk images of stored emails and attachments via emulation, and bit-level preservation.

Following the email archiving session we continued on for the Human Scale Web Collecting for Individuals and Institutions session presented by Anna Perricci from the Webrecorder team.


Having used Webrecorder before I was very excited for this session. Anna walked through process of registering and starting your first collection. She explained how to start sessions and also how collections are formed as easily as clicking different links on a website. Webrecorder can handle javascript replay very efficiently. For example, past videos streamed from a website like Vine or YouTube are recorded from a user's perspective and then available for replay later in time. Other examples included automated scrolling through twitter feeds or capturing interactive news stories from the New York Times.
During the presentation Anna showed Webrecorder's capability of extracting mementos from other web archives for the possibility of repairing missing content. For example, it managed to take CNN mementos from the Internet Archive past November 1, 2016 and then fix their replay by aggregating resources from other web archives and also the live web - although this could also be potentially harmful. This is an example of Time Travel Reconstruct implemented in pywb.

Ilya Kreymer presented the use of Docker containers for emulating different browser environments and how it could play an important role for replaying specific content like Flash. He demonstrated various tools available open source on Github including: pywb, Webrecorder WARC player, warcio, and warcit.
Ilya also teased at Webrecorder's Auto Archiver Prototype, a system that understands how Scalar websites work and can anticipate URI patterns and other behaviors for these platforms. Auto Archiver introduces automation of the capture of many different web resources on a website, including video and other sources.
Webrecorder Scalar automation demo for a Scalar website

To finish the first day, attendees were transported to a reception hosted at the MIT Samberg Conference Center accompanied by a great view of Boston.

Day 2 (September 25, 2018): Paper Presentations and Lightning Talks

To start the day attendees gathered for the plenary session which was opened by a statement from Chris Bourg.



Eve Blau then continued the session by presenting the Urban Intermedia: City, Archive, Narrative capstone project of a Mellon grant. This talk was about a Mellon Foundation project the Harvard Mellon Urban Initiative. It is a collaborative effort across multiple institutions of architecture, design and humanities. Using multimedia and visual constructs it looked at processes and practices that shape geographical boundaries, focusing on blind spots in:
  • Planned / unplanned - informal processes
  • Migration / mobility, patterns, modalities of inclusion & exclusion
  • Dynamic of nature & technology, urban ecologies
After the keynote I hurried over to open for the Web Preservation session with my paper on Measuring News Similarity Across Ten U.S. News Sites. I explained our methodology of selecting archived news sites, the tool top-news-selectors we created for mining archived news, how the similarity of news collections were calculated, the events that peaked in similarity, and how the U.S. election was recognized as a significant event among many of the news sites.


Following my presentation, Shawn Jones presented his paper The Off-Topic Memento Toolkit. Shawn presentation focused on the many different use cases of Archive-It, and then detailed how many of these collections can go of topic. For example, pages that have missing resources at a point in time, content drift causes different languages to be included in a collection, site redesigns, and etc. This lead to the development of the Off-Topic Memento Toolkit to detect these off-topic mementos inside of a collection through a process of collection a memento and then assigning a score, testing multiple different measures. It showed that in this study Word Count had the highest accuracy and best F1 score for detecting off-topic mementos.

Shawn also presented his paper The Many Shapes of Archive-It. He explained how to understand Archive-It collections using the content, metadata (Dublin Core and custom fields), and collection structure, but also the issues that come with these methods. Using 9351 collections from Archive-It as data, Shawn explained the concept of growth curves for collections which compares seed count, memento count, and also memento-datetime. Using different classifiers Shawn showed that using structural features of a collection one can predict the semantic category of a collection, with the best classifier found to be Random Forest.


Following lunch, I headed to the amphitheater to see Dragan Espenschied's short paper presentation Fencing Apparently Infinite Objects. Dragan questioned how objects, synonymous with file or a collection of files, are bound in digital preservation. The concept of "performative boundaries" was explained to explain different potentials of an object - bound, blurry, and boundless. Using many early software examples like early 2000 Microsoft Word (bound), Apple's QuickTime (blurry), and Instagram (boundless). He shared productive approaches for future replay of these objects:

  • Emulation of auxiliary machines
  • Synthetic stub services or simulations
  • Capture network traffic and re-enact on access 

Dragan Espenschied presenting on Apparently Infiinite Objects 
The next presentation was Digital Preservation in Contemporary Visual Art Workflows by Laura Molloy who presented remotely. This presentation informs us that on a regular basis digital preservation for someone's work isn't a main part of the teachings at an art school, and it should be. Digital technologies are used widely today for creating art with a variety of different formats. When asking various artist about digital preservation this is how they answered:
“It’s not the kind of thing that gets taught in art school, is it?”
“You don’t need to be trained in [using and preserving digital objects]. It’s got to be instinctive and you just need to keep it very simple. Those technical things are invented by IT guys who don’t have any imagination.” 
The third presentation was by Morgane Stricot for her short paper Open the museum’s gates to pirates: Hacking for the sake of digital art preservation. Morgane explained the that software dependency is a large threat for digital art and supporting media archaeology is required for preservation of some forms of these digital arts. Backups of older operating systems (OS) on disks help avoid issues of incompatibility. She also detailed how copyright prohibitions, for example older Mac OS, are difficult to find and that many pirates as well as "harmless hackers" have cracks to gain access to these OS environments while some are unsalvageable.
The final paper presentation was presented by Claudia Roeck on her long paper Evaluation of preservation strategies for an interactive, software-based artwork with complex behavior using the case study Horizons (2008) by Geert Mul. Claudia explored different possible preservation strategies for software such as reprogramming to a different programming language, migration of software, virtualization, and emulation, and also significant properties for what determines the qualities one would want to preserve. She used Horizons as an example project to explore the use cases and determined that reprogramming was of the options they decided was suitable for it. However, it was stated that there were no clear winner for the best preservation strategy in the mid-term of the work.
For the rest of the day lightning talks were available to the attendees and it became packed with viewers. Some of these talks consisted of preservation games to be held the next day such as: Save my Bits, Obsolescence, Digital Preservation Storage Criteria Game, and more. Ilya, from Webrecorder, held a lightning talk showing a demo of the new Auto Archiver prototype for Webrecorder.


After the proceedings another fantastic reception was held, this time at the Harvard Art Museum.

Harvard Art Museum at night

Day 3 (September 26, 2018): Minute Madness, Poster Sessions, and Awards 

This day was opened by a review of iPRES's achievements and challenges for past 15 years with a panel discussion composed of: William Kilbride, Eld Zierau, Cal Lee, and Barbara Sierman. Achievements included the innovation of new research as well as the courage to share and collaborate among peers with similarities in research. This lead to iPRES's adoption of cross-domain preservation in libraries, archives, and digital art. Some of the challenges include decisions for archivists to decide of what to do with past and future data and also conforming to the standard of OAIS.
After talking about the past 15 years it was time to talk about the next 15 years with a panel discussion composed of: Corey Davis, Micky Lindlar, Sally Vermaaten, and Paul Wheatley. This panel discussed what would be good for the future for more attendees be available to attend. They discussed possible organization models to emulate for regional meetings such as code4lib and NDSR. There were suggestions for updates to the Code of Conduct and the value for it to hold for the future.
After the discussion panels it was time for minute madness. I had seen videos of this before but it was the first time I personally had seen this. I found it somewhat theatrical. It was where most people had to explicitly pitch their research in a minute so we would later come visit them during the poster session while some of them put up a show, like Remco van Veenendaal. The topics ranged from workflow integration, new portals for preserving digital materials, code ethics, and timelines for detailing file formats.

After the minute madness attendees wandered around to view the posters available. The first poster I visited conveniently was referencing work from our WSDL group!
Another interesting poster consisted of research into file format usage over time.
I was also surprised at the amount of tools and technologies some of the new preservation platforms for government agencies that had emerged, like the French government IT program for digital archiving, Vitam.

Vitam poster presentation for their digital archiving architecture
Following the poster sessions I was back to paper presentations where Tomasz Miksa presented his long paper Defining requirements for machine-actionable data management plans. This talk involved machine actionable data management plans (maDMPs), which represents living documents automated by information collection systems and notification systems. He showed how current formatted data management systems could be transformed to reuse existing standards such as Dublin Core and PREMIS.
Alex Green then went on to present her short paper Using blockchain to engender trust in public digital archives. It was explained that archivist alter, migrate, normalize, and sometimes make changes to digital files but there is little proof that a researcher receives an authentic copy of a digital file. The ARCHANGEL project proposed to use blockchain to verify integrity of these files and their provenance. It is still unknown if blockchain tech will prevail as a lasting technology as it is still very new. David Rosenthal wrote a review of this paper found on his blog.
I then went on to the Storage Organization and Integrity session to see a long paper presentation Checksums on Modern Filesystems, or: On the virtuous consumption of CPU cycles by Alex Garnett and Mike Winter. The focus of the talk was the computing of checksums on files to prevent bit rot in digital objects and compares different approaches for verifying bit-level preservation. It showed that data integrity can be achieved when computer hardware, such as filesystems using ZFS, are dedicated to digital preservation. This work shows a bridge between digital preservation practices and high-performance computing for detecting bit-rot.

After this presentation I stayed for short paper presentation The Oxford Common File Layout by David Wilcox. The Oxford Common File Layout (OCFL) is an effort to define a shared approach to file hierarchy for long-term preservation. The goal of this layout is to have structure at scale, easily ready for migrations and minimize file transfers, and designed to be managed by many different applications. With a set of defined principles for this file layout, such as ability to log transactions on digital objects among other principles, there is plan for a draft spec release sometime at the end of 2018.
This day closed with the award ceremony for best poster, short papers, and long papers. My paper, Measuring News Similarity Across Ten U.S. New Sites, was nominated for best long paper but did not prevail as the winner. The winners were as follows:
  • Best short paper: PREMIS 3 OWL Ontology: Engaging Sets of Linked Data
  • Best long paper: The Rescue of the Danish Bits - A case study of the rescue of bits and how the digital preservation community supported it  by Eld Zierau
  • Best poster award: Precise & Persistent Web Archive References by Eld Zierau



Day 4 (September 27, 2018): Conference Wrap-up

The final day of iPRES 2018 was composed of paper presentations, discussion panels, community discussions, and games. I chose to attend the paper presentations.

The first paper presentation I viewed was Between creators and keepers: How HNI builds its digital archive by Ania Molenda. Over 4 million documents were recorded to track progressive thinking for Dutch architecture. When converting and pushing these materials into a digital archive there were many issues observed, such as: duplicate materials, file formats with complex dependencies, time and effort to digitalize the multitude of documents, and knowledge lost over time for accessing these documents with no standards in place.

Afterwards I watched the presentation on Data Recovery and Investigation from 8-inch Floppy Disk Media: Three Use Cases by Abigail Adams. This showed the acquisition of three different floppy disk collections ranging in date ranges from 1977-1989! This presentation introduced me to some foreign hardware, software, and encodings required for attempting to recover data from floppy disk media and also a workflow for data recovery from these floppies.

The last paper presentation of my viewing was Email Preservation at Scale: Preliminary Findings Supporting the Use of Predictive Coding by Joanne Kaczmarek and Brent West. Having already been to the email preservation workshop I was excited for this presentation and I was not let down. Using 20gb of emails publicly available they used two different methods, a capstone approach and predictive coding approach, for discovering sensitive content inside emails. With the predictive coding approach, machine learning for training and prediction of documents, they showed preliminary results that classifying emails automatically is an approach that is capable of handling emails at scale.

As a final farewell, attendees were handed bags of tulip buds and told this:
"An Honorary Award will be presented to the people with the best tulip pictures."
It seems William Kilbride, among others, have already got a foot up on all the competition.
This marks the end of my first academic conference as well as my first visit to Boston, Massachusetts. It was an enjoyable experience with a lot of exposure to diverse research fields in digital preservation. I look forward to submitting work to this conference again and hearing about future research in the realm of digital preservation.


Resources for iPRES 2018:


Some tricks to parse XML files

$
0
0
Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs.


  • CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is
    <script> 
    <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] >
    </script >
  • Encoding. Encoding is a pain in text proceeding. The problem is that there is no way to know what the encoding the text is before opening it and reading it (at least in Python). So we must sniff it by trying to open and read the file using an encoding. If the encoding is wrong, the program usually will throw an error message. In this case, we try another possible encoding. The "file" command in Linux gives the encoding information so I know there are 2 encodings in the ACM DL XML file: ASCII and ISO-8859. 
  • HTML entities, such as &auml; The only 5 built-in entities in XML are quotampaposlt and gt. So any other entities should be defined in the DTD file to show what they mean. For example, the DBLP.xml file comes with a DTD file. The ACM DL XML should have associated DTD files: proceedings.dtd and periodicals.dtd but they are not in my dataset.
The following snippet of Python code solves all the three problems above and give me the correct parsing results.

    encodings = ['ISO-8859-1','ascii']
    for e in encodings:
        try:
            fh = codecs.open(confc['xmlfile'],'r',encoding=e)
            fh.seek(0)
        except UnicodeDecodeError:
            logging.debug('got unicode error with %s, trying a different encoding' % e)
        else:
            logging.debug('opening the file with encoding: %s' % e)
            break
    f = codecs.open(confc['xmlfile'],encoding=e)
    soup = BeautifulSoup(f.read(),'html.parser')

Note that we use codecs.open() instead of the Python built-in open(). And we open the file twice, the first time only to check the encoding, and the second time the whole file is pass to a handle before it is parsed by BeautifulSoup. I found that BeautifulSoup is better to handle XML parsing than lxml, not just because it is easier to use but also because you are allowed to pick the parser. Note I choose the html.parser instead of the lxml parser. This is because the lxml parser is not able to parse all entries (for some unknown reason). This is reported by other users on stackoverflow.

2018-11-08: Decentralized Web Summit: Shaping the Next Web

$
0
0

In my wallet I have a few ₹500 Indian currency notes that say, "I PROMISE TO PAY THE BEARER THE SUM OF FIVE HUNDRED RUPEES" followed by the signature of the Governor of the Reserve Bank of India. However, this promise was broken two years ago from today, since then these bills in my pocket are nothing more than rectangular pieces of printed paper. So, I decided to utilize my origami skills and turn them into butterflies.

On November 8, 2016, at 8:00 PM (Indian Standard Time), Indian Prime Minister Narendra Modi announced the demonetization (effective in four hours after midnight) of the two biggest currency notes (₹1,000 and ₹500) in circulation at that time. Together these two notes represented about 86% of the total cash economy of India at that time. More than 65% of the Indian population still lives in rural and remote areas where availability of electricity, the Internet, and other utilities is not reliable yet. Hence, cash is a very common means of business in daily life there. It was morning here in Norfolk (USA) and I was going through the news headlines when I saw this announcement. I could not believe for a while that the news was real and not a hoax. I did not even know that there is a concept called demonetization that governments can practice. Irrespective of my political views and irrespective of the intents and goals behind the decision (whatever good or bad they might have been) I was shocked to realize that the system has so much centralization of power in place that a single person can decide sufferings for about 18% of the global population overnight and cause a chaos in the system. I wished for a better and more resilient system, I wanted a system with decentralization of power by design where no one entity has a significant share of power and influence. I wanted a DECENTRALIZED SYSTEM!

When the Internet Archive (IA) announced plans for the Decentralized Web (DWeb) Summit, I was on board to explore what can we do to eliminate centralization of control and power in systems on the Web. With a generous support from the Protocol Labs, AMF, and NSF IIS-1526700 grants I was able to travel to the West Coast to experience four days full of fun and many exciting events. I got the opportunity to meet many big names who brought us the Web we experience today and many of those who are working towards shaping the future of the Web with their vision, ideas, experience, code, art, legal understanding, education, or social values. They all had a different perspective to share with the rest, but all seemed to agree on one goal of fixing the current Web where freedom of expression is under an ever-growing threat, governments control the voice of dissent, big corporations use personal data of the Internet users for monetary benefits and political influence, and those in power try to suppress the history they might be uncomfortable with.

There was so much going on in parallel that perhaps no two people have experienced the same sequence of events. Also, I am not even pretending to tell everything I have observed there. In this post I will be describing my experience of the following four related events briefly that happened between July 31 and August 3, 2018.

  • IndieWebCamp SF
  • Science Fair
  • Decentralized Web Summit
  • IPFS Lab Day

IndieWebCamp SF


The IndieWeb is a people-focused alternative to the "corporate web". Their objectives include: 1) Your content is yours, 2) You are better connected, and 3) You are in control. Some IndieWeb people at Mozilla decided to host IndieWebCamp SF, a bootcamp the day before #DWebSummit starts and shared open invitation to all participants. I was quick to RSVP there which was going to be my first interaction with the IndieWeb.

On my way from the hotel to the Mozilla's SF office the Uber driver asked me why I came to SF. I replied to her, "to participate in an effort to decentralize the Web". She seemed puzzled and said, "my son was mentioning something about it, but I don't know much". "Have you heard about Bitcoin?", I asked her to get an idea how to explain. "I have heard this term in the news, but don't really know much about it", she said. So, I started the elevator pitch and in the next eight or so minutes (about four round trips of Burj Khalifa's elevator from the ground to the observation deck) I was able to explain some of the potential dangers of centralization in different aspects of our social life and what are some of the alternatives.




The bootcamp had both on-site and remote participants and was well organized. We started with keynotes from Miriam Avery, Dietrich Ayala, and Ryan Barrett then some people introduced themselves, why were they attending the DWeb Summit, and what ideas they had for the IndieWeb bootcamp. Some people had lightning demos. I demonstrated InterPlanetary Wayback (IPWB) briefly. I got to meet some people behind some projects I was well aware of (such as Universal Viewer and Dat Project) and also got to know about some projects I didn't know before (such as Webmention and Scuttlebutt). We then scheduled BarCamp breakout sessions and had lunch.

During and after the lunch I had an interesting discussion and exchanged ideas with Edward Silverton from the British Library and a couple of people from Mozilla's Mixed Reality team about the Universal Viewer, IIIF, Memento, and multi-dimensional XR on the Web.




Later I participated in two sessions "Decentralized Web Archiving" and "Free Software + Indieweb" (see the schedule for notes on various sessions). The first one was proposed by me in which I explained the state of Web archiving, current limitations and threats, and the need to move it to a more persistent and decentralized infrastructure. I have also talked about IPWB and how it can help in distributed web archiving (see notes for details and references). In the latter session we talked about different means to support Free Software and open-source developers (for example bug bounty, crowdfunding, and recurring funding), compared and contrasted different models and their sustainability as compared with closed-source software backed by for-profit organizations. We also touched on some licensing complications briefly.

I had to participate in the Science Fair at IA, so I had to get there a little earlier than the start time of the session. With that in mind, Dietrich (from the Firefox team) and I left the session a little before it was formally wrapped up as the SF traffic in the afternoon was going to make it a rather long commute.

Science Fair


The taxi driver was an interesting person with whom Dietrich and I shared the ride from the Mozilla SF office to the Internet Archive, talking about national and international politics, history, languages, music, and whatnot until we reached our destination where food trucks and stalls were serving the dinner. It was more windy and chilly out there than I anticipated in my rather thin jacket. Brewster Kahle, the founder of the IA, who had just came out of the IA building, welcomed us and navigated us to the registration desk where a very helpful team of volunteers gave us our name badges and project sign holders. I acquired a table right outside the entrance of the IA's building, placed the InterPlanetary Wayback sign on it, and went to the food truck to grab my dinner. When I came back I found that the wind has blown my project sign off the table, so I moved it inside of the building where it was a lot cozier and crowded.

The Science Fair event was full of interesting projects. You may explore the list of all the Science Fair projects along with their description and other details. Alternatively, flip through the pages of the following photo albums of the day.






Many familiar and new faces visited my table, discussed the project, and asked about its functionality, architecture, and technologies. On the one hand I met people who were already familiar with our work and on the other hand some needed a more detailed explanation from scratch. I even met people who asked with a surprise, "why would you make your software available to everyone for free?" This needed a brief overview of how the Open Source Software ecosystem works and why one would participate in it.




This is not a random video. This clip was played to invite Mike Judge, Co-creator of HBO's Silicon Valley on the stage for a conversation with Cory Doctorow in the Opening Night Party after Brewster's welcome note (due to the streaming rights issue the clip is missing in IA's full session recording). I can't think of a better way to begin the DWeb Summit. This was my first introduction with Mike (yes, I had not watched the Silicon Valley show before). After an interesting Q&A session on the stage, I got the opportunity to talk to him in person, took a low-light blurred selfie with him, mentioned Indian demonetization story (which, apparently, he was unaware of), and asked him to make a show in the future about potential threats on DWeb. Web 1.0 emerged as a few entities having control on publishing with the rest of the people being consumers of that content. Web 2.0 enabled everyone to participate in the web both as creators and consumers, but privacy and censorship controls gone in the hands of governments and a few Internet giants. If Web 3.0 (or DWeb) could fix this issue too, what would potentially be the next threat? There should be something which we may or may not be able to think of just yet, right?


Mike Judge and Sawood Alam


Decentralized Web Summit


For the next two days (August 1–2) the main DWeb Summit was organized in the historical San Francisco Mint building. There were numerous parallel sessions going on all day long. At any given moment perhaps there was a session suitable for everyone's taste and no one could attend everything they would wish to attend. A quick look at the full event schedule would confirm this. Luckily, the event was recorded and those recordings are made available, so one can watch various talks asynchronously. However, being there in person to participate in various fun activities, observe artistic creations, experience AR/VR setups, and interacting with many enthusiastic people with many hardware, software, and social ideas are not something that can be experienced in recorded videos.





If the father of the Internet with his eyes closed trying to create a network with many other participants with the help of a yellow yarn, some people trying to figure out what to do with colored cardboard shapes, and some trying to focus their energy with the help of specific posture are not enough then flip through these photo albums of the event to have a glimpse into many other fun activities we had there.





Initially, I tried to plan my agenda but soon I realized it was not going to work. So, I randomly picked one from the many parallel sessions of my interest, spent an hour or two there, and moved to another room. In the process I interacted with many people from different backgrounds participating both in their individual or organizational capacity. Apart from usual talk sessions we discussed various decentralization challenges and their potential technical and social solutions in our one-to-one or small group conversations. An interesting mention of additive economy (a non-zero-sum economy where transactions are never negative) reminded me of our gamification idea we explored when working on the Preserve Me! project and I ended up having a long conversation with a couple of people about it during a breakout session.




If Google Glass was not cool enough then meet Abhik Chowdhury, a graduate student, working on a smart hat prototype with a handful of sensors, batteries, and low-power computer boards placed in a 3D printed frame. He is trying to find a balance in on-board data processing, battery usage, and periodic data transfer to an off-the-hat server in an efficient manner, while also struggling with the privacy implications of the product.

It was a conference where "Crypto" meant "Cryptocurrency", not "Cryptography" and every other participant was talking about Blockchain, Distributed/Decentralized Systems, Content-addressable Filesystem, IPFS, Protocols, Browsers, and a handful other buzz-words. Many demos there were about "XXX but decentralized". Participants included the pioneers and veterans of the Web and the Internet, browser vendors, blockchain and cryptocurrency leaders, developers, researchers, librarians, students, artists, educators, activists, and whatnot.

I had a lightning talk entitled, "InterPlanetary Wayback: A Distributed and Persistent Archival Replay System Using IPFS", in the "New Discoveries" session. Apart from that I spend a fair amount of my time there talking about Memento and its potential role in making decentralized and content-addressable filesystems history-aware. During a protocol related panel discussion, I worked with a team of four people (including members from the Internet Archive and MuleSoft) to pitch the need of a decentralized naming system that is time-aware (along the lines of IPNS-Blockchain) and can resolve a version of a resource at a given time in the past. I also talked to many people from Google Chrome, Mozilla Firefox, and other browser vendors and tried to emphasize the need of native support of Memento in web browsers.

Cory Doctorow's closing keynote on "Big Tech's problem is Big, not Tech" was perhaps one of the most talked about talk of the event, which received manyreactions and commentary. The recorded video of his talk is worth watching. Among many other things in his talk, he encouraged people to learn programming and to understand functions of each software we use. After his talk, an artist asked me how can she or anyone else learn programming? I told her, if one can learn a natural language, then programming languages are way more systematic, less ambiguous, and easier to learn. There are really only three basic constructs in a programming language that include variable assignments, conditionals, and loops. Then I verbally gave her a very brief example of mail merge using all of these three constructs that yields gender-aware invitations using a message template for a list of friends to be invited in a party. She seemed enlightened and delighted (while enthusiastically sharing her freshly learned knowledge with other members of her team) and exchanged contacts with me to learn more about some learning resources.

IPFS Lab Day


It looks like people were too energetic to get tired of such jam-packed and eventful days as some of them have planned post-DWeb events of special interest groups. I was invited by Protocol Labs to give an extended talk in one such IPFS-centric post-DWeb event called Lab Day 2018 on August 3. Their invitation arrived the day after I had booked my tickets and reserved the hotel room, so I ended up updating my reservations. This event was in a different location and the venue was decorated with a more casual touch with bean bags, couches, chairs, and benches near the stage and some containers for group discussions. You may take a glimpse of the venue in these pictures.








They welcomed us with new badges, some T-shirts, and some best-seller books to take home. The event had a good lineup of lightning talks and some relatively longer presentations, mostly extended forms of similar presentations in the main DWeb event. Many projects and ideas presented there were in their early stages. These sessions were recorded and published later after necessary editing.

I presented my extended talk entitled, "InterPlanetary Wayback: The Next Step Towards Decentralized Archiving". Along with the work already done and published about IPWB, I also talked about what is yet to be done. I explored the possibility of an index-free, fully decentralized collaborative web archiving system as the next step. I proposed some solutions that would require some changes in IPFS, IPNS, IPLD, and other technologies around to accommodate the use case. I encouraged people to discuss with me if they have any better ideas to help solve these challenges. The purpose was to spread the word out so that people keep web archiving related use cases in mind while shaping the next web. Some people from the core IPFS/IPNS/IPLD developer community approached me and we had an extended discussion after my talk. The recording of my talk and slides are made available online.




It was a fantastic event to be part of and I am looking forward to more such events in the future. IPFS community and people at Protocol Labs are full of fresh ideas and enthusiasm and they are a pleasure to work with.

Conclusions


Decentralized Web has a long way to go and DWeb Summit is a good place to bring people from various disciplines with different perspectives together every once in a while to synchronize all the distributed efforts and to identify the next set of challenges. While I could not attend the first summit (in 2016) I really enjoyed the second one and would love to participate in future events. Those two short days of the main event had more material than I can perhaps digest in two weeks, so my only advice would be to extend the duration of the event instead of having multiple parallel session with overlapping interests.

I extend my heartiest thanks to organizers, volunteers, fund providers, and everyone involved in making this event happen and making it a successful one. I wish going forward not just the Web, but many other organizations, including governments, become more decentralized so that I do not open my wallet once again to realize it has some worthless pieces of currency bills that were demonetized over night.

Resources




--
Sawood Alam


2018-11-09: Grok Pattern

$
0
0
Image result for logstash logo
Grok is a way to match a text line against a regular expression, map specific parts of the line into dedicated fields, and perform actions based on this mapping. Grok patterns are (usually long) regular expressions that are widely used in log parsing. With tons of search engine logs, how to effectively parse them, extract useful metadata for analytics, training, and prediction has become a key problem in mining text big data. 

In this article, Ran Ramati gives a beginner's guide to Grok Pattern used in Logstash, one of the powerful tools in the Elastic Stack (the other two are Kibana and Elastic Search).

https://logz.io/blog/logstash-grok/

The StreamSets webpage gives a list of Grok pattern examples: 

https://streamsets.com/documentation/datacollector/3.4.3/help/datacollector/UserGuide/Apx-GrokPatterns/GrokPatterns_title.html

The recent paper by Huawei research lab in China summarizes and compare a number of log parsing tools:

https://arxiv.org/abs/1811.03509

I am kind of surprised that although they cited the Logstash website, they did not compare Logstash with its peers.

 Jian Wu

2018-11-10: Scientific news and reports should cite original papers

$
0
0
Image result for sciencealert

I highly encourage all scientific news or reports cite corresponding articles. ScienceAlert usually does a good job on this. This piece of scientific news from ScienceAlert discovers two Rogue planets.  Most planets we discovered rotate around a star. A Rogue planet does not rotate around a star, but the center of the Galaxy. Because planets do not emit light, Rogue planets are extremely hard to detect. This piece of news cites a recently published paper on arXiv. Although anybody can publish papers on arXiv. Papers published by reputable organizations should be reliable.

A reliable citation is beneficial for all parties. It makes the scientific news more trustable. It gives credits to the original authors. It could also connect readers to a place to explore other interesting science.

Jian Wu



2018-10-11: More than 7000 retracted abstracts from IEEE. Can we find them from IA?

$
0
0


Science magazine:

More than 7000 abstracts are quietly retracted from the IEEE database. Most of these abstracts are from IEEE conferences that took place between 2009 and 2011.  The plot below clearly shows when the retraction happened. The reason was weird: 
"After careful and considered review of the content of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE’s Publication Principles."
Similar things happened in Nature subsidiary journal (link) and other journals (link).


The question is can we find them from internet archive? Can they still be legally posted on a digital library like CiteSeerX? If they do, they can provide a very unique training dataset to be used for fraud and/or plagiarism detection, assuming that the reason under the hood is one of them. 

Jian Wu

2018-11-12: Google Scholar May Need To Look Into Its Citation Rate

$
0
0


Google Scholar has long been regarded as a digital library containing the most complete collection of scholarly papers and patterns. For a digital library, completeness is very important because otherwise, you cannot guarantee the citation rate of a paper, or equivalently the in-link of a node in the citation graph. That is probably why Google Scholar is still more widely used and trusted than any other digital libraries with fancy functions.

Today, I found two very interesting aspects of Google Scholar, one is clever and one is silly. The clever side is that Google Scholar distinguishes papers, preprints, and slides and count citations of them separately.

If you search "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", you may see the same view as I attached. Note that there are three results. The first is a paper on IEEE. The second actually contains a list of completely different authors. These people are probably doing a presentation of that paper. The third is actually a pre-print on arXiv. These three have different numbers of citations, which they should do.


The silly side is also reflected in the search result. How does a paper published in less than a year receive more than 1900 citations? You may say that may be a super popular paper. But if you look into the citations. Some do not make sense. For example, the first paper that "cites" the DeepLab paper was published in 2015! How could it cite a paper published in 2018?

Actually, the first paper's citation rate is also problematic. A paper published in 2015 was cited more than 6500 times! And another paper published in 2014 was cited more than 16660 times!

Something must be wrong about Google Scholar! The good news that the number looks higher, which makes everyone happy! :)


Jian Wu

2018-11-15: LANL Internship Report

$
0
0

Los Alamos National Laboratory
On May 27 I landed in sunny Sante Fe, New Mexico to start my 6 month internship at Los Alamos National Laboratory (LANL) for the Digital Library Research and Prototyping Team under the guidance of Herbert Van de Sompel and WSDL alumnus Martin Klein.

Work Accomplished

A majority of my time was used to work on the Scholarly Orphans project, which is a join project between LANL and ODU, sponsored by the Andrew Mellon Foundation. This project explores from an institution perspective how it can discover, capture, and archive scholarly artifacts that an institution's researcher deposits in various productivity portals. After months of working on the project, Martin Klein showcased the Scholarly Orphans pipeline at TPDL 2018.

Scholarly Orphans pipeline diagram




My main task for this pipeline was to create and manage two components: the artifact tracker and pipeline orchestrator. Communication between different components was completed using ActivityStream2 (AS2) messages and Linked Data Notification (LDN) inboxes for sending and receiving messages. AS2 messages describe events users have accomplished providing a "human friendly but machine-processable" JSON format. LDN inboxes provide endpoints for messages to be received, advertising these endpoints via link headers. Applications (senders) can discover these endpoints and send messages to these endpoints (receivers). In this case each component was a sender and a receiver. For example, the orchestrator sent an AS2 message to the tracker component's inbox to start a process to track a user for a list of portals, the tracker responds and sends an AS2 message with results to the orchestrator's inbox which is then saved in a database.

This pipeline was designed to be a distributed network, where the orchestrator knows where each component inbox is before sending messages. The tracker, capture, and archiver components are told by the orchestrator where to send their AS2 messages and also where their generated AS2 event message will be accessible. An example of an AS2 message from the orchestrator component to the tracker component shows an event object with an endpoint "to" telling the tracker where to send the message and a "tracker:eventBaseUrl" to append a uuid for where the event generated by the tracker will be accessible. After the tracker has found events for the user it will generate a new AS2 message and send it to the orchestrator "to" endpoint.

Building the tracker and orchestrator components allowed me to learn a great deal about W3C Web Standards mostly dealing with the Semantic Web. I was required to learn about various programmatic technologies during my work which included: Elasticsearch as a database, Celery task scheduling, using Docker-Compose in a production environment, Flask and uWSGI as a python web server, and working with OAI-PMH interfaces.

I was also exposed to the various technologies the Prototyping Team had developed previously and included these technologies in various components of the Scholarly Orphans pipeline. These included: Memento, Memento Tracer, Robust Links, and Signposting.

The prototype interface of the Scholarly Orphans project is hosted at myresearch.institute for a limited time. On the website you can see the various steps of the pipeline, the AS2 event messages, the WARCs generated from the capture process, and the replay of the WARCs via the archiver process for each of the researcher's productivity portal events. The tracker component of the Scholarly Orphans pipeline was made available via Github found here: https://github.com/oduwsdl/scholarly-orphans-trackers.

New Mexico Lifestyle

Housing

Over the course of my stay I stayed in a house located in Los Alamos shared by multiple Ph.D. students studying in diverse fields such as Computer Vision, Nuclear Engineering, Computer Science, and Biology. The views of the mountains were always amazing and only ever accompanied by rain during the monsoon season. A surprising discovery during the summer was that there always seemed to be a forest fire somewhere in New Mexico. 
Los Alamos, NM

Dining

During my stay and adventures I found out the level of spiciness that apparently every New Mexican had become accustomed to by adding the local Green Chile to practically any and/or every meal. 

Adventures

Within the first two weeks of landing I had already planned a trip to Southern NM. Visiting Roswell, NM I discovered aliens were very real.
Roswell, NM International UFO Museum
Going further south I got to visit Carlsbad, NM the home of the Carlsbad Caverns which were truly incredible.
Carlsbad, NM Carlsbad Caverns
I was able to visit Colorado for a few days and went on a few great hikes. On August 11, I got to catch the Rockies vs. Dodgers MLB game where I got to see for the first time a walk-off home run by the Rockies

I also managed a weekend road trip to Zion Canyon, Utah allowing me to hike some great trails like Observation Point Trail, The Narrows, and Emerald Pools.
Zion Canyon, Utah - Observation Point Trail

Advice

If you're a visiting researcher not hired by the lab consider living in a shared home with multiple other students. This can help alleviate you of boredom and also help you to find people to plan trips with. Otherwise you will usually be excluded from the events planned by the lab for other students.

If you're staying in Los Alamos, plan to make weekend trips out to Santa Fe. Los Alamos is beautiful and has some great hikes, but can be short on entertainment frequently.

Final Impressions

I feel very blessed to have been offered this 6 month internship. At first I was reluctant to move out to the West, however it allowed me to travel to many great locations with new friends. My internship has allowed me to be exposed to various subjects relating to WS-DL research which will surely improve, expand, and influence my own research in the future.

A special thanks to Herbert Van de Sompel, Martin Klein, Harihar Shankar, and Lyudmila Balakireva for allowing me to collaborate, contribute, and learn from this fantastic team during my stay at LANL.

--Grant Atkins (@grantcatkins)

2018-11-30: The Illusion of Multitasking Boosts Performance

$
0
0
Illustration showing hands completing many tasks at onceToday, I read the article on 
https://www.psychologicalscience.org/news/releases/the-illusion-of-multitasking-boosts-performance.html

The title is "The Illusion of Multitasking Boosts Performance". At first, I thought it argues for single-task at once, but after reading it, I found that it is not. It actually supports multi-tasking, but in the sense that the worker "believes" the work he is working on is a combination of multi-tasks.

The original paper published in Psychological Science is 
https://journals.sagepub.com/doi/full/10.1177/0956797618801013

and the title is "The Illusion of Multitasking and Its Positive Effect on Performance". 

In my opinion, the original article's title is accurate, but the press release reveals part of the story and actually distorted the original meaning of the article. The reader actually got an illusion that multi-tasking is producing a negative effect.

Jian Wu

2018-11-30: Archives Unleashed: Vancouver Datathon Trip Report

$
0
0

The Archives Unleashed Datathon #Vancouver was a two day event from November 1 to November 2, 2018 hosted by the Archives Unleashed team in collaboration with Simon Fraser University Library and Key, SFU's big data initiative. This was second in a series of Archives Unleashed datathons to be funded by The Andrew W. Mellon Foundation. This is the first time for me, Mohammed Nauman Siddique of the Web Science and Digital Libraries research group (WS-DL) at Old Dominion University to travel to the datathon at Vancouver.

 Day 1



The event kicked off with Ian Milligan welcoming all the participants at the Archives Unleashed Datathon #Vancouver. It was followed by welcome speech from Gwen Bird, University Librarian at SFU and Peter Chow-White, Director and Professor at GeNA lab. After the welcome, Ian talked about the Archives Unleashed Project, why we care about the web archives, purpose of organizing the datathons, and the roadmap for future datathons.
Ian's talk was followed by Nick Ruest walking us through the details of  the Archives Unleashed Toolkit and the Archives Unleashed Cloud. For more information about the Archives Unleashed Toolkit and the Archives Unleashed Cloud Services you can follow them on Twitter or check their website.
For the purpose of the datathon, Nick had already loaded all the datasets onto six virtual machines provided by Compute Canada. We were provided with twelve options for our datasets courtesy of  University of Victoria, University of British Columbia, Simon Fraser University, and British Columbia Institute of Technology. 
Next, the floor was open for us to decide our projects and form teams. We had to arrange our individual choices on the white board with information about the dataset we wanted to use in blue, tools we intended to use in pink, and research questions we cared about in yellow. The teams started to form quickly based on the datasets and purpose of the project. The first team, led by Umar Quasim, wanted to work on ubc-bc-wildfires dataset which was a collection of webpages related to wildfires in British Columbia. They wanted to understand and find relationships between the events and media articles related to wildfires. The second team, led by Brenda Reyes Ayala, wanted to work on improving the quality of archives pages using the uvic-anarchist-archives dataset. The third team, led by Matt Huculak, wanted to investigate on the politics of British Columbia using the uvic-bc-2017-candidates dataset. The fourth team, led by Kathleen Reed, wanted to work on ubc-first-nations-indigenous-communities to investigate about the history and its discourse in media about first nations indigenous communities.

I worked with Matt Huculak, Luis Menese, Emily Memura, and Shahira Khair on the British Columbia candidates dataset. Thanks to Nick, we had already been provided with the derivative files for our datasets which included list of all the captured domain names with their archival count, extracted text from all the WARC files with basic file metadata, and a Gephi file with network graph. It was the first time that the Archives Unleashed Team had provided the participating teams with derivative files, which saved us hours of wait time which would have been wasted in extracting all the information from the dataset WARC files. We continued to work on our projects through the day with a break for lunch. Ian moved around the room to check on all the teams, motivate us with his light humor, and providing us any help needed to get going on our projects.

Around 4 pm, the floor was open for Day 1 talk session. The talk started with Emily Memura (PhD student at University of Toronto) presenting her research on understanding the use and impact of web archives. Emily's talk was followed by Matt Huculak (Digital Scholarship Librarian at University of Victoria) who talked about the challenges faced by libraries in creating web collections using Archive-It. He emphasized on the use of regular expressions in Archive-It and problems it poses to non-technical librarians and web archivists. Nick Ruest presented Warclight and its framework, the latest service released by the Archives Unleashed Team which was followed by a working demo of the service. Last but not the least, I presented my research work on Congressional Deleted Tweets talking about why we care about the deleted tweets, difficulties involved in curating the dataset for Members of Congress, and results about the distribution of deleted tweets in multiple services which can be used to track deleted tweets.
   



We called it a day at 4:30 pm only to meet again for dinner at 5 pm at Irish Heather in Downtown Vancouver. At dinner Nick, Carl Cooper, Ian, and I had a long conversation ranging from politics to archiving to libraries. After dinner, we called it a day only to meet again fresh the next day.

Day 2



The morning of Day 2 at Vancouver greeted us with a clear view of mountains across the Vancouver harbor which called for a perfect start to our morning. We continued on our project with the occasional distraction of clicking pictures of the beautiful view that lay in front of us. We did some brainstorming on our network graph and bubble chart visualizations from Gephi to understand the relationship between all the URLs in our dataset. We also categorized all the captured URLs into political party URLs, social media URLs and rest. While reading the list of crawled domains present in the dataset, we discovered a bias towards a particular domain which made up approximately 510k mementos out of approximately 540k mementos. The standout domain we refer to was westpointgrey.com, which was owned by Brian Taylor who ran as an independent candidate. We set out to investigate  the reason behind that bias in our dataset, only to parse out and analyze the status codes from response headers of each WARC file. We realized that out of approximately 540k mementos only 10k mementos were of status code 200 OK and the rest were either 301s, 302s or 404s. Our investigation of all the URLs that showed up for westpointgrey.com led us to the conclusion that it was a calendar trap for crawlers.    

Most relevant topics word frequency count in BC Candidates dataset 

During lunch, we had three talks scheduled for Day 2. The first speaker was Umar Quasim from the University of Alberta who talked about the current status of web archiving in their university library and discussed some their future plans. The second presenter, Brenda Reyes Ayala, Assistant Professor at University of Alberta, talked about measuring archival damage and the metrics to evaluate them which had been discussed in her PhD dissertation. Lastly, Samantha Fritz talked about the future of the Archives Unleashed toolkit and cloud service. She mentioned in her talk that starting from 2019 computations using the Archives Unleashed toolkit will be a paid service.
    

Team BC 2017 Politics



We were first to start our presentation on the BC Candidates dataset. with a talk about the dataset we had at our disposal and different visualizations we had used to understand our dataset. We talked about relationships between different URLs and their connections. We also highlighted the issue of westpointgrey.com and the crawler trap issue. Our dataset comprised of 510k mementos of 540k mementos from a single domain westpointgrey.com. The reason for the large memento count from a single domain was due to a calendar crawler trap which was evident on analyzing all the URLs which had been crawled for this domain. Of the 510k mementos crawled, only six of them were 302s and seven of them were 200s, while the rest of the URLs returned a status code of 404. In a nutshell, we had a meager seven mementos with useful information from approximately 510k mementos crawled for this domain. We highlighted the fact that the dataset with approximately 540k mementos had only approximately 10k mementos with relevant information. Based on our brainstorming over the last two days, we summarized lessons learned and an advice for future historians who are curating seeds for creating collections on Archive-It.

Team IDG

Team IDG started off by talking about difficulties faced in settling for their final dataset by waling us through the different datasets they tried before settling for the final dataset (ubc-hydro-cite-c) used in their project. They presented visualization on top keywords based on frequency count and relationship between different keywords. They also highlighted the issue of extracting text from the tables and talked about their solution. They walked us through all the steps involved in plotting their events on a map. It started it using the table of processed text to create a geo-coding for their dataset and plot in onto a map showing the occurrences of the events. They also showed a timeline of how the events evolved over time by plotting it onto the map.

Team Wildfyre



Team Wildfyre opened their talk with description of their dataset and other datasets they used in their project. They talked about research questions and tools used in their project. They presented multiple visualizations showing top keywords, top named entity and geo-coded map of the events. They also had a heat map for distribution of datasets based on the domain names available in their dataset. They pointed out that even when analyzing named entities in the wild fire dataset, the most talked about entity during these events was Justin Trudeau.


Team Anarchy

 


Team Anarchy had split their project into two smaller projects. The first project undertaken by Ryan Deschamps was about finding linkages between all the URLs in the dataset. He presented a concentric circles graph talking about the linkage between pages from depth 0 to 5. They found that starting from the base URL to depth level 5 took them to a spam or a government website in most cases. He also talked about the challenges faced in extracting images from the WARC files and comparing them their live version counterparts. The second project undertaken by Brenda was about capturing archived pages and measuring the degree of difference from the live version of these pages. She showed multiple examples with varying degree of difference between the archived and their live pages.



Once the presentations were done, Ian asked us all to write out our votes and the winner would be decided based on popular vote. Congratulations to Team IDG for winning the Archives Unleashed Datathon #Vancouver. For closing comments Nick talked about what to take away from these events and how to build a better web archiving research community. After all the suspense, the next edition of Archives Unleashed Datathon was announced.



More information about Archives Unleashed Datathon #WashingtonDC can be found using their website or following the Archives Unleashed team on Twitter.

This was my first time at a Archives Unleashed Datathon. I went with the idea of meeting researchers, librarians, and historians all under one roof who propel the web archiving research domain. The organizers try to strike a perfect balance by inviting different research communities with the web archiving community with diverse background, and experience. It was an eye-opening trip for me, where I learned from my fellow participants about their work, how libraries build collections for web archives and the difficulties and challenges faced by them. Thanks, to Carl Cooper, Graduate Trainee at Bodlian Libraries- Oxford University for strolling down Vancouver Downtown with me. I am really excited and look forward to attending the next edition of Archives Unleashed Datathon at Washington DC.

View of Downtown Vancouver
Thanks again to the organizers (Ian Milligan, Rebecca Dowson, Nick Ruest, Jimmy Lin, and Samantha Fritz), their partners and SFU library for hosting us. Looking forward to see you all at future Archive Unleashed datathons.


Mohammed Nauman Siddique
@m_nsiddique

2018-12-03: Acidic Regression of WebSatchel

$
0
0
Mat Kelly reviews WebSatchel, a browser based personal preservation tool.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ


Shawn Jones (@shawnmjones) recently made me aware of a personal tool to save copies of a Web page using a browser extension called "WebSatchel". The service is somewhat akin to the offerings of browser-based tools like Pocket (now bundled with Firefox after a 2017 acquisition) among many other tools. Many of these types of tools use a browser extension that allows the user to send a URI to a service that creates a server-side snapshot of the page. This URI delegation procedure aligns with Internet Archive's "Save Page Now", which we have discussed numerous times on this blog. In comparison, our own tool, WARCreate, saves "by-value".

With my interest in any sort of personal archiving tool, I downloaded the WebSatchel Chrome extension, created a free account, signed in, and tried to save the test page from the Archival Acid Test (which we created in 2014). My intention in doing this was to evaluate the preservation capabilities of the tool-behind-the-tool, i.e., that which is invoked when I click "Save Page" in WebSatchel. I was shown this interface:

Note the thumbnail of the screenshot captured. The red square in the 2014 iteration of the Archival Acid Test (retained at the same URI-R for posterity) is indicative of a user interacting with the page for the content to load and thus be accessible for preservation. With respect to only evaluating the tool's capture ability, the red in the thumbnail may not be indicative of the capture. A repeat of this procedure to ensure that I "surfaced" the red square on the live web (i.e., interacted with the page before telling WebSatchel to grab it) resulted in a thumbnail where all squares were blue. As expected, this may be indicative that WebSatchel is using the browser's screenshot extension API at the time of URI submission rather than creating a screenshot of their own capture. The limitation of the screenshot to the viewport (rather than the whole page) also indicates this.

Mis(re-)direction

I then clicked the "Open Save Page" button and was greeted with a slightly different result. This captured resided at https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/Tl2kToC9fthiV1mM/index.html.

curling that URI results in an inappropriately used HTTP 302 status code that appears to indicate a redirect to a login page.


$ curl -I https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/Tl2kToC9fthiV1mM/index.html
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 19:44:59 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5
Location: websatchel.com/j/public/login
Content-Type: text/html

Note the lack of scheme in the Location header. RFC2616 (HTTP/1.1) Section 14.30 requires the location to be an absolute URI (per RFC3896 Section 4.3). In an investigation to legitimize their hostname leading redirect pattern, I also checked the more current RFC7231 Section 7.1.2, which revises the value of Location response to be a URI reference in the spirit of RFC3986. This updated HTTP/1.1 RFC allows for relative references, as already done in practice prior to RFC7231. WebSatchel's Location pattern causes browsers to interpret their hostname as a relative redirect per the standards, causing a redirect to https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/websatchel.com/j/public/login


$ curl -I https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/websatchel.com/j/public/login
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 20:13:04 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5
Location: websatchel.com/j/public/login

...and repeated recursively until the browser reports "Too Many Redirects".

Interacting with the Capture

Despite the redirect issue, interacting with the capture retains the red square. In the case where all squares were blue on the live Web, the aforementioned square was red when viewing the capture. In addition to this, two of the "Advanced" tests (advanced relative to 2014 crawler capability, not particularly new to the Web at the time) were missing, representative of an iframe (without anything CORS-related behind the scenes) and an embedded HTML5 object (using the standard video element, nothing related to Custom Elements).

"Your" Captures

I hoped to also evaluate archival leakage (aka Zombies) but the service did not seem to provide a way for me to save my capture to my own system, i.e., your archives, remotely (and solely) hosted. In investigating a way to liberate my captures, I noticed that the default account is simply a trial of a service, which ends a month after creating the account and a relatively steep monthly pricing model. The "free" account is also listed as being limited to 1 GB/account, 3 pages/day and access removed to their "page marker" feature, WebSatchel's system for a sort-of text highlighting form of annotation.

Interoperability?

WebSatchel has browser extensions for Firefox, Chrome, MS Edge, and Opera but the data liberation scheme leaves a bit to be desired, especially for personal preservation. As a quick final test, without holding my breadth for too long, I use my browser's DevTools to observe the HTTP response headers for the URI of my Acid Test capture. As above, attempting to access the capture via curl would require circumventing the infinite redirect and manually going through an authentication procedure. As expected, nothing resembling Memento-Datetime was present in the response headers.

—Mat (@machawk1)

2018-12-03: Using Wikipedia to build a corpus, classify text, and more

$
0
0
Wikipedia is an online encyclopedia, available in 301 different languages, and constantly updated by volunteers. Wikipedia is not only an encyclopedia, but it also has been used as an ontology to build a corpus, classify entities, cluster documents, create an annotation, recommend documents to a user, etc. Below, I review some of the significant publications in these areas.
Using Wikipedia as a corpus:
Wikipedia has been used to create corpora that can be used for text classification or annotation. In “Named entity corpus construction using Wikipedia and DBpedia ontology” (LREC 2014), Younggyum Hahm et al. created a method to use Wikipedia, DBpedia, and SPARQL queries to generate a named entity corpus. The method used in this paper can be accomplished in any language.
Fabian Suchanek used Wikipedia, WordNet, and Geonames to create an ontology called YAGO, which contains over 1.7 million entities and 15 million facts. The paper “YAGO: A large ontology from Wikipedia and Wordnet” (Web Semantics 2008), describes how this dataset was created.
Using Wikipedia to classify entities:
In the paper, Entity extraction, linking, classification, and tagging for social media: a Wikipedia-based approach” (VLDB Endowment 2013), Abhishek Gattani et al. created a method that accepts text from social media, such as Twitter, and then extracts important entities, matches the entity to Wikipedia links, filters, classifies the text, and then creates tags for the text. The data used is called a knowledge base (KB). Wikipedia was used as a KB and its graph structure is converted into a taxonomy. For example, if we have the following tweet “Obama just gave a speech in Hawaii”, then the entity extraction selects the two tokens “Obama” and “Hawaii”. Then the resulting tokens are paired with a Wikipedia link (Obama, en/wikipedia.org/wiki/Barak_Obama) and (Hawaii, en/wikipedia.org/wiki/Hawaii). This step is called entity linking. Finally, the classification and tagging of the tweet are set to “US politics, President Obama, travel, Hawaii, vacation”, which is referred to social tagging. The actual process to go from tweet to tag takes ten steps. The overall architecture is shown in Figure 1.
https://3.bp.blogspot.com/-6TylglFPYrU/XAQgGzZ5JmI/AAAAAAAAAhs/d2b8f6ddEp4kioeMqzJWVklTkd9eW8DbgCEwYBhgL/s1600/Screen%2BShot%2B2018-12-02%2Bat%2B1.10.09%2BPM.png
  1. Preprocess: detect the language (English), and select nouns and noun phrases
  2. Extract pair of (string, Wiki link): using the text in the tweet, the text is matched to Wikipedia links and is paired, where the pair of (string, Wikipedia) is called a mention
  3. Filter and score mentions: remove certain pairs and score the rest
  4. Classify and tag tweet: use mentions to classify and tag the tweet
  5. Extract mention features
  6. Filter mentions
  7. Disambiguate: select between topics, e.g. is apple categorized to a fruit or a technology?
  8. Score mentions
  9. Classify and tag tweet: use mentions to classify and tag the tweet
  10. Apply editorial rules
This dataset used in this paper was described in “Building, maintaining, and using knowledge bases: a report from the trenches” (SIGMOD 2013) by Omkar Deshpande et al. In addition to using Wikipedia, the Web and social context were used for the process of tagging the tweet more correctly. After collecting tweets, they gather web context for tweets, which is getting the link included in the tweet if exists and extracting its content, title, and other information. Then entity extraction is performed, followed by link, classify, and tag. Next, the tweet with the tag is used to create a social context of the user, hashtag, and web domains. This information is saved and used for new tweets that need to be tagged. They also used the web and social context for each node in the KB, and this is saved for future usage.
Abhik Jana et al. added Wikipedia links on the keywords in scientific abstracts in WikiM: Metapaths Based Wikification of Scientific Abstracts” (JCDL 2017). This method helped the reader determine if they are interested in reading the full article. They first step was to detect important keywords in the abstract, which they call mentions, using tf-idf. Then a list of candidate Wikipedia links, which they call candidate entries, were selected for each mention. The candidate entries are ranked based on similarity. Finally, a single candidate entry with the highest similarity score is selected for each mention.
Using Wikipedia to cluster documents:
Xiaohuo Hu et al. used Wikipedia in clustering documents in “Exploiting Wikipedia as External Knowledge for Document Clustering” (KDD 2009). In this work, documents are enriched with Wikipedia concepts and category information. Both exact concept match and related concepts are included. Then similar documents are combined based on document content, content from Wikipedia is added, and category information is added. This method was used on three datasets: TDT2, LA Times, and 20-newsgroups. Different methods were used to cluster the documents:
  1. Cluster-based on word vector
  2. Cluster-based on concept vector
  3. Cluster-based on category vector
  4. Cluster-based on the combination of word vector and concept vector
  5. Cluster-based on the combination of word vector and category vector
  6. Cluster-based on the combination of concept vector and category vector
  7. Cluster-based on the combination of word vector, concept vector, and category vector
They found that with all three datasets, clustering based on word and category vector (method #5) and clustering based on word, concept, and category vector (method #7) always had the best results.
Using Wikipedia to annotate documents:
Wikipedia was used to annotate documents, such as in the paper “Wikipedia as an ontology for describing Documents” (ICWSM 2008) by Zareen Sab Sayed et al. Wikipedia text and links were used to identify topics related to some terms in a given document. In this work, three methods were tested using the article text, the article text and categories with spreading activation, and the article text and links with spreading activation. However, the accuracy of the work depends on some factors such as that a Wikipedia page might link to a non-relevant article, the presence of links between related concepts, and the extent of having a concept appear in Wikipedia.
Using Wikipedia to create recommendations:
Wiki-Rec uses Wikipedia to create semantically based recommendations. This technique is discussed in the paper “Wiki-rec: A semantic-based recommendation system using Wikipedia as an ontology” (ISDA 2010) by Ahmed Elgohary et al. They predicted terms common to a set of documents. In this work, the user reads a document and evaluates it. Then using Wikipedia, all the concepts in the document are annotated and stored. After that, the user's profile is updated based on the new information. By matching the user's profile with other user's profiles that contain similar interests, a list of recommended documents is presented to the user. The overall system model is shown in Figure 2.
Using Wikipedia to match ontologies:
Other work, such as “WikiMatch -Using Wikipedia for ontology match” (OM 2012) by Sven Hurtling and Heiko Paulheim, used Wikipedia information to determine if two ontologies are similar, even if they are in different languages. In this work, the Wikipedia search engine is used to get articles related to a term. Then for the articles, all language links are retrieved. Two concepts are compared by comparing the articles' titles. However, this approach is time-consuming because of querying Wikipedia.
In conclusion, Wikipedia is not only an information source, it has also been used as a corpus to classify entities, cluster documents, annotate documents, create recommendations, and match ontologies.
-Lulwah M. Alkwai

2018-12-14: CNI Fall 2018 Trip Report

$
0
0
Mat Kelly reports on his recent trip to Washington, DC for the CNI Fall 2018 meeting                                                                                                                                                                                                                                                                                                                                                                           ⓖⓞⓖⓐⓣⓞⓡⓢ


I (Mat Kelly, @machawk1) attended my first CNI (#cni18f) meeting on December 10-11, 2018, an atypical venue for a PhD student, and am reporting my trip experience (also see previous trip reports from Fall 2017, Spring 2017, Spring 2016, Fall 2015, and Fall 2009).

Dr. Nelson (@phonedude_mln) and I left Norfolk, VA for DC, previously questioning whether the roads would be clear from unseasonably significant snow storm the night before (they were):

The conference was split up into eight sessions with up to 7 separate presentations being given concurrently in each session, which required attendees to choose a session. Between each session was a break, which allowed for networking and informal discussions. The eight sessions I chose to attend were:

  1. Collaboration by Design: Library as Hub for Creative Problem Solving Space
  2. From Prototype to Production: Turning Good Ideas into Useful Library Services
  3. First Steps in Research Data Management Under Constraints of a National Security Laboratory
  4. Blockchain: What's Not To Like?
  5. The State of Digital Preservation: A Snapshot of Triumps, Gaps, and Open Research Questions
  6. What Is the Future of Libraries in Academic Research?
  7. Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
  8. Building Infrastructure and Services for Open Access to Research

Also be sure to check out Dale Askey's report of the CNI 2018 Fall Membership Meeting. With so many concurrent sessions, he had a different experience.

Day One

In the Open Plenary prior to the current sessions Cliff Lynch described his concerns with the suggestions for using blockchain as the panacea of data problems. He discounted blockchain's practicality as a solution to most problems to which it is applied and expressed more concern but enthusiasm for the use of machine learning (ML), however, stated his wariness of ML's alignment with AI. Without training sets, he noted further, ML does not do well. He also noted that if there is bias in training data, the classifiers learn to internalize the biases.

Cliff continued by briefly discussion the rollout of 5G and how it will create competition in home cable-based Internet access, it will not fix the digital divide. Those that don't currently have access will likely not gain access with the introduction of 5G. He went on with his concerns over IoT devices and emulation of old systems and the security implications of reintroducing old, unpatched software.

He then mentioned the upcoming sunsetting of The Digital Preservation Network (DPN) and how their handling of the phase out process is a good example of what is at stake in terms of good behavior (cf. "we're shutting down in 2 weeks"). DPN's approach is very systematic in that they are going through their holdings and figuring out where the contents need to be returned, where other copies of these contents are are being held, etc. As was relevant, he also mentioned the recently announced formalization of a succession plan by CLOCKSS for when the time comes that the organization ceases operation.

Continuing, Cliff referenced Lisa Spiro's Plenary Talk at Personal Digital Archiving (PDA) 2018 this past April (which WS-DL attended in 2017, 2013, 2012, and 2011) and the dialogue that occurred following Hurricane Harvey on enabling and archiving the experiences of those affected. He emphasized that as exemplified in the cases of natural disasters like the hurricane, the recent wildfires in California, etc., we are in an increasing state of denial about how increasingly valuable the collections on our sites are.

Cliff also referenced recent developments in scholarly communication with respect to open-access, namely of the raising of the technical bar with the deposit of articles with strict DTD as prescribed by the European Plan S. The plan requires researchers who receive state funding to publish their work in open repositories or journals. He mentioned that for large open-access journals like PLoS and "big commercial players", doing so is not much of a problem as compared to the hardship that will be endured by smaller "labor of love" journals like those administered using OJS. He also lamented the quantification of Impact measurement in non-reproducible ways and the potential long terms implications using measures like this. In contrast, he noted that when journal editors get together and change the rules to reflect desired behaviors in a community, they can be a powerful force for change. He used the example of genomic journals requiring data submission to GenBank prior to any consideration of submission.

After touching a few more points, Cliff welcomed the attendees and thus begun the concurrent sessions.

Collaboration by Design: Library as Hub for Creative Problem Solving Space

The first session I attended was very interactive. The three speakers (Elliot Felix, Julia Maddox, and Mary Ann Mavrinac) gave a high-level of the iZone system as deployed at the University of Rochester. They first asked the attendees to go to a site or text their reply to the role of libraries and its needs then watched the presentation screen enumerating the responses as they came in real time.

The purpose of the iZone system as they described was to serve as a collaboration hub for innovation for the students to explore ideas of social or community benefit. The system seemed open-ended but the organization helpful to students where they "didn't have a methodology to do research or didn't know how to form teams."

Though the iZone team tried to encourage an "entrepreneurial" mindset, their vision statement intentionally did not include the word, as they found that students did not like the connotations of word. The presenters then had the audience members fill out a sort-of Mad Lib as supplied on the attendees seats stating:



For __audience__ who __motivation__ we deliver __product/service__ with __unique characteristic__ that benefit __benefit__.

Most of those that supplied their response were of a similar style of offering students some service for the benefit of what ever the library at their institution offered. Of course, the iZone representatives provided their own relating to offering a "creative space for problem solving".

Describing another barrier with students, the presenters utilized Bob McKim's tactic on the first day of classes while still teaching of having students draw their neighbor on a sheet of paper for 20 seconds. Having the audience at CNI do this was to demonstrate that "we fear judgement of peers" and "Throughout our education and upbringing, we worry about society's reaction to creative thoughts and urges, no matter how untamed they may be."

This process was an example of how they (at iZone) would help all students to become resilient, creative problem solvers.

Slides for this presentation are available (PDF)

From Prototype to Production: Turning Good Ideas into Useful Library Services

After a short break, I attended the session presented by Andrew K. Pace (@andrewkpace) of OCLC and (Holly Tomren @htomren) of Temple University Libraries. Andrew described a workflow that OCLC Research has been using to ensure that prototypes that are created and experimented with do not end up sitting in the research department without going to production. He described their work on two prototype-to-production projects consisting of IIIF integration into a digital discovery environment and a prototype for digital discovery of linked data in Wikibase. His progression from prototyping to production consisted of 5 steps:

  1. Creating a product vision statement
  2. Justifying the project using a "lean canvas" to determine effort behind a good idea.
  3. "Put the band together" consisting of assembling those fit to do the prototyping with "stereophonic fidelity" (here he cited Helen Keller with "Happiness is best attained through fidelity to a worthy purpose")
  4. Setting Team Expectations using OCLC Community Center (after failing to effective use a listserv) to concretely declare a finishing date that they had to stick to to prevent research staff from having to manage the project after completion.
  5. Accepting the outcome with a Fail Fast & Hard approach, stating that "you should be disappointed if you wind up with something that looks exactly lie you expected to build from the start.

Holly then spoke of her experience at Template piloting the PASSAGE project (the above Wikibase project) from May to September, 2018. An example use case they used in their pilot was asking users to annotate the Welsh translation of James and the Giant Peach. One such question asked was which properties should be associated with the original work and which to the translation.

Another such example was with a portrait of Caroline Still Anderson from Temple University Libraries'Charles L. Blockson Afro-American Collection and deliberating on attributes like "depicts" rather than "subject" in describing the work. In a discussion with the community, they sought to clarify the issue of the photo itself and the subject in the photo. "Which properties belong to which entity?", they asked, noting the potential loss of context if you did not click through. To further emphasize this point, she discussed a photo title "Civil Rights demonstration at Girard College August 3, 1965" where a primary attribute of "Philadelphia" would remove too much context in favor of more descriptive attribute of the subject in the photo like "Event: demonstration" and "Location: Girard College".

These sort of descriptions, Holly summarized, needed a cascading, inheritance style of description relative to other entities. This experience was much different than her familiarity with using MARC records to describe entities.

First Steps in Research Data Management Under Constraints of a National Security Laboratory

Martin Klein (@mart1nkle1n) and Brian Cain (@briancain101) of Los Alamos National Laboratory (LANL) presented next with Martin initially highlighting a 2013 OSTP stating that all federal agencies over $100 million in R&D research are required to store their data and make it publicly accessible to search, retrieve, and analyze. LANL being one of 17 national labs under the US Department of Energy with $12 billion in R&D funding (much greater than $100 million) was required to abide.

Brian highlighted a series of interviews at other institutions inclusive of in-depth interviews about data at their own institution. Responses to these interviews expressed a desire for a centralized storage solution to resolve storing it locally and having more assurance of its location "after the postdoc has left".

Martin documented an unnecessarily laborious process of having to burn data to a DVD, walking it to their review and release system (RASSTI) then once complete, physically walk the approval to a second location. He reported that this was a "Humungous pain" and thus "lots of people don't do this even though they should". He noted that the lab has started initiatives that have started to look into where money goes tracing it from starting points of an idea to funding, to paper, patents, etc.

He went on to describe the model used by the Open Science Framework (OSF) to bring together portability measures the researchers at LANL were already used to. Based on OSF, they created "Nucleus", a scholarly commons to connect the entire cycle and act as the glue that sits in the middle of research workflow and productivity portals. Nucleus can connect to storage solutions like GitLab and other authentication systems (or whatever their researchers are used to want to reuse) to act as a central means of access. As a prototype, Martin's group established an ownCloud instance to demonstrate the power of a sync-n-share solution for their lab. The intention of Nucleus would make the process of submitting datasets to RASTSTI much less laborious to obtain approval and comply.

Blockchain: What's Not To Like?

David Rosenthal presented the final session of the day that I attended, and much anticipated based on the promotion in the official CNI blog post. As is convention, Rosenthal's presentation consisted of a near-literal reading of his (then-) upcoming blog post with an identical title. Go and read that to do his very interesting and opinionated presentation justice.

As a very high-level summary, Rosenthal emphasized the attractiveness but mis-application of blockchain in respect to usage in libraries. He expressed multiple instances of Santoshi Nakamoto's revolutionary idea to have the consensus concept decentralized, which is often the problematic counterpart in this sorts of systems. The application of the ideas though, he summarized, and the side effects (e.g., potential manipulation of consensus, high trading latency) of Bitcoin as an exemplification of blockchain highlighted the real-world use case and the corresponding issues.

Rosenthal repeatedly mentioned the pump-and-dump schemes that allow for price manipulation and to "create money out of thin air". Upon completion of his talk and some less formal, opinionated thoughts on Austrian-led efforts for promotion of blockchain/Bitcoin (through venues of universities, companies, etc.), Dr. Nelson asked "Where are we in 5 years?"

Rosenthal answered with his prediction of "Cryptocurrency has been devaluaing for a year. It is hard to sustain a belief that cryptocurrencies will continue "going up"; miners are getting kicked out of mining. This is a death spiral. If it gets to this level, someone can exploit it. This has happened to small altcoins. You can see instances of using Amazon computing power to mount attacks".

Day 1 of CNI finished with a reception consisting of networking and some decent crab cakes. In a gesture of cosmic unconsciousness, Dr. Nelson preferred the plates of shrimp.

Day Two

Day two of the CNI Fall 2018 meeting started with breakfast and one of the four sessions of the day.

The State of Digital Preservation: A Snapshot of Triumps, Gaps, and Open Research Questions

The first session I attended was presented by Roger C. Schonfeld (@rschon) & Oya Tieger (@OyaRieger) of Ithaka S+R (of recent DSHR infamy) who reported on a recent open-ended study with "21 subject experts" to identify outstanding perspectives and issues in digital preservation. Oya noted that the interviewees were not necessarily a representative sample.

Oya referenced her report, recently published in October 2018 titled, "The State of Digital Preservation in 2018" that highlights the landscape of Web archiving, among other things, and how to transition the bits preserved for future use. In the report she (summarily) asked:

  1. What is working well now?
  2. What are you thoughts on how the community is preparing for new content types and format?
  3. Are you aware of any new research in practices and their impact?
  4. What areas need further attention?
  5. If you were writing a new preservation grant, what would be your focus?

From the paper, she noted that there are evolving priorities in research libraries, which are already spread thin. She questioned whether digital preservation is a priority for libraries' overall role in the community. Oya referenced the recent Harper's article, "Toward an ethical archive of the web" with a thought-provoking pull quote of "When no one is likely to lay eyes on a particular post or web page ever again, can it really be considered preserved?"

What Is the Future of Libraries in Academic Research?

Tom Hickerson, John Brosz (@jbrosz), and Suzanne Goopy of University of Calgary presented next by noting that academic research has changed and whether libraries have adapted. Through support of the Mellon Foundation, his group explored a multitude of project, which John enumerated. They sought to develop a new model for working with campus scholars using a research platform as well as providing equipment to augment the library's technical offerings.



Suzanne, a self-described "library user" described Story Map ECM (Empathic Cultural Mapping) to help identify small data in big data and vise-versa. This system merges personal stories of newcomers to Calgary using a map to show (for example) how adjustment to bus routes in Calgary can affect a Calgary's newcomer's health.



Tom closed the session by emphasizing the need to be able to support a diversity of research endeavors through a research platform to offer economy of scale instead of one-off solutions. Of the 12 projects that John described, he stated, there was only one instance where they asked for a resource that we had to try subscribe to, emphasizing the under-utilized availability of library resources. Even with this case, he mentioned, it was an unconventional example of access. "By having a common association with a research project", he continued, "these various silos of activity have developed new relationships with each other and strengthened our collegial involvement."

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

WS-DL's own Dr. Michael L. Nelson (@phonedude_mln) presented the second of two sessions relating to Blockchain that I attended at CNI, greeting the attendees with Explosions in the Sky (see related posts) and starting with a recent blog post from Peter Todd claimed to "Carbon Date (almost) the Entire Internet" and the contained caveat stating "In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshot". Todd's report was more applicable to ensuring IA holdings like music recordings have not been altered (Nelson stated that "It's great to know that your Grateful Dead recording has not been modified) but is not as useful to validate Web pages. The fixity techniques Todd used are too naive to be applicable to Web archiving.

Nelson then demonstrated this using a memento of a Web page recording his travel log over the years. When this page is curled from different archives, each reports a different content length due to how the content has been amended at replay time. This served as a base example of the runtime manipulation of a page without an external resources. However, most pages contain embedded resources, JavaScript that can manipulate the HTML, etc. that cause an increasingly level of variability to this content length, the content preserved, and the content served at time of replay.

As a potential solution, Nelson facetiously suggested that the whole page with all embedded resources could be loaded, a snapshot taken, and the snapshot hashed; however, he demonstrated a simple example where a rotating image changed at runtime via JavaScript would indicate a change in presentation despite no change in contents, so discarded this approach.

Nelson then highlighted work relating to temporal violations in the archive where, because of the difference in time of preservation of embedded resources, pages that never existed are presented as the historical record of the live Web.

The problem even when viewing the same memento over time shows that what one sees at time in an archive may be different later -- hardly a characteristic of what would expect from an "archive". As an example, Nelson replay a memento of the raw and rewritten versions for the homepage of the future losers of the 2018 Peach Bowl (at the URI http://umich.edu). By doing so 35 times between November 2017 and October 2018, Nelson noted variance of the very same memento, even when stable (e.g., images failed to load) as well as an archival outage due to a subsequently self-reported upgrade. Nelson found that in 11 months, 11% of the URLs they surveyed disappeared or changed. This conclusion was supported by observing 16,627 URI-Ms over that time frame and observing from 17 different archives an 87.92% result of the hash of a page being two different values within that time frame. The conclusion: You cannot replay replay twice the same archived page (with a noted apology to Heraclitus).

As a final analogy, Nelson played a a video of a scene from Monty Python and the Holy Grail alluding to the guards as the archive and the king as the user.

Building Infrastructure and Services for Open Access to Research

The final session was presented by Jefferson Bailey (@jefferson_bail), Jason Priem (@jasonpriem, presenting remotely via Zoom), and Nick Shockey (@nshockey). Jason initially described his motivations and efforts in creating Unpaywall that is seeking to create an open database of scholarly articles. He initially emphasized that their work is open source and was delighted to see its reuse of their early prototypes by Plum Analytics. All data that Unpaywall collects is available using their data APIs, which serve about 1 million calls per day and are "well used and fast".

Jason emphasized that his organization behind Unpaywall (Impactstory) is a non-profit, so it cannot be "acquired" in the traditional sense. Unpaywall seeks to be 98% accurate in the level of open access in returned results and works with publishers and authors to increase the degree of openness of work if unsatisfactory.

He and co-owner of Impactstory published a paper titled "The state of OA: a large-scale analysis of the prevalence and impact of Open Access article" that categorized these degrees of open access and quantified the current state of open access articles in the scholarly record. Some of these articles from the 1920s, he stated, were listed as Open Access even though the concept did not exist at the time. He predicted that within 8 years, 80% of articles will be Open Access based on current trends. Unpaywall has a browser extension freely available.

Jefferson described Internet Archive's efforts at preservation in general with projects like GifCities, a search engine on top of archived Geocities for all GIFs contained within, and a collection of every powerpoint in military domains (about 60,000 in number). Relating to the other presenters' objectives, he provided the IA one liner objective to "Build a complete, use-oriented, highly available graph and archive of every publicly access scholarly article with bibliographic metadata and full-text, enhanced with public identifier metadata, linked with associated data/blog/etc, with a priority on long-tail, at-risk publications and distributed, machine-readable access."

He mentioned that a lot of groups (inclusive of Unpaywall) are doing work in aggregating Open Access articles. They are taking three approaches toward facilitating this:

  • Top-down: using lists, IDs, etc to target harvesting
  • Middle-sideways: Integrating with OA public systems and platforms
  • Bottom-up: using open source tools, algorithms, and machine learning to identify extant works, assess quality of preservation, identify additional materials.

Jefferson referenced IA's use of Grobid for metadata extraction and through their focus on the not-so-well archived, they found 2 million long tail articles that have DOIs that are not archived. Of those he found, 2 out of 3 articles were duplicates. With these removed, IA currently has about 10 million articles in their collection. Their goal is to build a huge knowledge graph of what is archived, what is out there, etc. Once they have that, they can build services on top of it.

Nick presented last of the group and first mentioned he was filling in for Joe McArthur (@Mcarthur_Joe). Nick introduced Open Access Button that provides free alternatives to paywalled articles with a single click. If they are unable to, their service "finds a way to make the article open access for you". They recently switched from a model of a user tools to institutional tooling (with a library focus. Their tools, as Nick reported, was able to find Open Access versions for 23.2% of ILL requests using Open Access or Unpaywall. They are currently building a way to deposit article when an Open Access version is not available using a simple drag-and-drop procedure after notifying authors. This tool can also be embedded in institutions' Web pages for easier accessibility for authors to facilitate more Open Access works.

Slides for this presentation are available (PDF).

Closing Plenary

And with that session (and a break), Cliff Lynch began the closing plenary of CNI by introducing Patricia Flatley Brennan, directory of the National Library of Medicine (NLM). She initially described NLM's efforts to creates Trust in Data, "What does a library do?", she said, "We fundamentally create trust. The substrate is data."

She referenced the NLM is best known for its products and services like PubMed, the MEDLINE database, the Visible Human Project, etc.

"There has never been a greater need for trusted, secure, accessible, valued information in this world.", she said, "Libraries, data scientists, and networked information specialists are essential to the future." Despite the "big fence" around the physical NLM campus in Bethesda, the library is open for visits. She described a refactoring of PubMed via PubMed Labs to create a relevance-based ranking tool instead of reverse temporal order. This would also entail a new user interface. Both of these improvements were formed by the observation that 80% of the people that launch a PubMed search never go to the second page.

...and finally

Upon completion of the primary presentation and prior to audience questions, Dr. Nelson and I left to beat the DC traffic back to Norfolk. Patricia slides are promised to be available soon from CNI, which I will later include in this post.

Overall, CNI was an interesting and quite different meeting with which I am used to attending. The heavier, less technical focus was an interesting perspective and made me even more aware that there quite a lot of what is done in libraries that I have only a high-level idea. As a PhD student, in Computer Science no less, I am grateful for the rare opportunity to see the presentations in-person when I have only ever had to view them via Twitter from a far. Beyond this post I have also taken extensive notes for many topics that I plan to explore in the near future to make myself aware of current work and research going on at other institutions.

—Mat (@machawk1)

2018-12-14: New Insight to Big Data: Trip to IEEE Big Data 2018

$
0
0
The IEEE Big Data 2018 was held in the Westin Seattle Hotel between December 10 and December 13, 2018. There are more than 1100 people registered. The accepting rates vary between 13% to 24%, with an average rate of 19%. I have a poster accepted titled “CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset”, co-authored with C. Lee Giles, two of his graduate students (Bharath and Shaurya), as well as an undergraduate student who produced preliminary results (Jianyu Mao). I attended the conference on Day 2 and Day 3 and left the conference hotel after the keynote on Day 3.



Insights from Personal Meetings
The most important thing to attend conferences is to meet with old friends and know new friends. Old friends I met include Kyle Williams (Microsoft Bing), Mu Qiao (IBM, chair of I&G track), Yang Song (Google AI, co-chair of I&G track), Manlin Li (Google Cloud), and Madian Khabsa (Apple Siri). 

Kyle introduced the recent project on recommendations inferred from dialogs. He also committed giving an invited talk for my IR class in the Spring semester.
Mu mentioned his project on anomaly detection on time-series data.
Yang talked about his previous work on CiteSeerX, and Microsoft Academic Search. He said that one big obstacle for people to use MAS (and all other digital library search engines) is that none of them is comparable to Google Scholar in terms of completeness. The reason was simple: people want to see higher citation rates of their papers. He suggested me switching my focus on mining information that is not available by publishers from the text.
Madian told me that although I may think nobody uses Siri, there are still quite a lot of usage logs. One of the reasons that Siri is not very perfect is the relative smaller team compared with Google and Microsoft. He also says that it is a good time to apply for academic jobs these days because the industry pays far more than universities which attracts the best PhDs in AI.

I also introduced myself to Aidong Zhang, an NSF IIS director. Apparently, she knows Yaohang Li, and Jing He well. I sent my CV to her. I also met Huaglory Tianfield and Liqiang Wang at the University of Central Florida.

Insights from Keynote Speakers
There are two keynote speakers that I like the best, one is Blaise Aguera y Arcas from Google AI (actually he is the boss of Yang Song), and the other is Xuedong Huang from Microsoft. 

Blaise’s talk started from the first NN paper by McCulloch & Pitts (1943),now cited 16k+ based on Google Scholar.  He reviewed the development of AI since 2006, the year when Deep Learning people started to go to the CS conference. He talked about Jeff Dean, the director of Google Brain, and the recent paper by Bonawitz et al. (2016). He pointed out the recent progress on Federated Learning — learning of deep neural networks from decentralized data. Finally, he made a very good point: a successful application does not only depend on the model, but also on the data. He gave an example of a project that attempts to predict sexuality using face features. These features strongly depend on the shooting angle of the photograph, so the model makes wrong predictions. On the other hand, a work on predicting criminality using facial features of standard ID photographs achieves a very accurate result. 

Xuedong Huang’s talk was also comprehensive. He focused on the impact of big data on natural language processing, using Microsoft products as case studies. One of the most encouraging results is that Microsoft has developed effective real-time translation tools that can facilitate team meeting using different languages. It implies that if TTS (text to speech) becomes very sophisticated, people may not need to learn a foreign language anymore. He also reminds people that big data is a vehicle, not the final destination. Knowledge is the final destination. He also admits that current techniques are not sophisticated on denoising data. 

The other keynote speeches were not very impressive to me. I always feel that although it is OK for keynote speakers to talk about their own research/product, they should always try to stand at a higher place, overseeing a lot of problems the community are interested in, rather than focusing on a few very narrowly defined problems with too many jargons, definitions, and math equations. 

Impressive Talks
I selectively went to presentations and posters. What I felt was that streaming data, temporal data, and anomaly detection have been more and more popular. Below are some talks I was particularly interested in.

BigSR: real-time expressive RDF stream reasoning on modern Big Data platforms (Session L9: Recommendation Systems and Stream Data Management)
The motivation is to use a semantic based method to facility anomaly detection. This is my first time to hear Apache Flink. BigSR and Ray are promising replacements of Spark. I just took a Spark training session by PSC last week. Now there are systems faster than Spark!

Unsupervised Threshold Autoencoder to Analyze and Understand Sentence Elements (Annual Workshop on Big Data Analytics)
The author was working on a multiclass classification problem using an autoencoder. He found that the performance of the model depends on some hyperparameters, such as the number of hidden layers and/or neutrons. I commented that this was an artifact of his relatively low training size (44k). With unlimited training data, the difference of different model architectures may diminish. The author did not explain very well about how he manages the imbalance problem of training samples in different categories. 

Forecasting and Anomaly Detection on Application Metrics using LSTM (In Intelligent Data Mining Workshop)
The two challenges are (1) interpretability (explain the reason of anomaly), and (2) rarity (how rarely this abnormal sample is). The author uses Pegasus: an algorithm to solve the non-linear classification with SVM.

Multi-layer Embedding Neural Architecture with External Memory for Large-Scale Text Categorization: Mississippi state. (In Intelligent Data Mining Workshop)
The authors attempt to capture long-range correlations by storing more memory in LSTM nodes. The Idea looks intuitive but I am suspicious of (1) how useful it is to scholarly data as the model was trained on news articles and (2) whether the overhead is significant to classify big data.

A machine learning based NL question and answering system for hellcat data search using complex queries (In health data workshop)
The author attempts to classify all incoming questions into 6 categories. Although this particular model looks simplistic (the author admits he has scalability issues), It may be a good idea to map all questions into a narrow range of questions. This greatly reduces dimensions and may be useful summarization.

Conference Organization, Transportation, and the City of Seattle
The organization was very good. The registration is very expensive ($700). The conference was well sponsored by Baidu and another Chinese company. One impressive part of this conference is a hackathon, asking participants to solve a practical problem in 24 hours. I think JCDL should do something like this. The results may not be the best, but it pushes participants to think intensively within a very limited time window.

The conference center is located in Downtown Seattle. Transportation is super convenient, with Bus, Light Rail, and Monorail stations nearby to any places of interests. The Pike place, where the first Starbucks store is located is 10 min walk. There are many restaurants with gourmet food all over the world. I live in the Mediterranean Inn, 1 mile from the center, which is still within the walking distance. The Expedia combo (Hotel+Flight) costs me $850 for a 3-night hotel stay and a round-trip flight from ORF to SEA.

Seattle is a beautiful city. It was always lightly rainy this season so local people like to wear a waterproof hoodie sweater. People are nice. I got a chance to visit the University of Washington Library, where the Hogwarts school scenes in Harry Potter was shot.  

Jian Wu

2018-12-17: CoQA Challenge: Machine Reading Competition Recent Result

$
0
0
CoQA is a dataset containing more than 127,000 questions with answers collected from more than 8000 conversations. Each conversation is about a passage in the form of questions and answers. One example of the passage is below

Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. 

"What are you doing, Cotton?!"

"I only wanted to be more like you".

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry.

"Don't ever do that again, Cotton!" they all cried. "Next time you might mess up that pretty white fur of yours and we wouldn't want that!"

Then Cotton thought, "I change my mind. I like being special".
This reads like a picture book story, so you can see what kind of text current machine reading can achieve. The sample questions and their answers are 
Q  What color was Cotton?
A  white || a little white kitten named Cotton
A white || white kitten named Cotton
A white || white
A white || white kitten named Cotton.
Q  Where did she live?
A  in a barn || in a barn near a farm house, there lived a little white kitten
A in a barn || in a barn near a farm house, there lived a little white kitten named Cotton
A in a barn || in a barn
A in a barn near || in a barn near a farm house, there lived a little white kitten named Cotton.
Q  Did she live alone?
A  no || Cotton wasn't alone
A no || But Cotton wasn't alone
A No || wasn't alone
A no || But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters.
Note that there could be multiple answers because they are given based on different sentences quoted from the story. So these sentences are used as the explanations or justifications of the answers.

Up until November, 2018, the best model is an ensemble model called SDNet developed by Microsoft with an overall accuracy of about 79%. In December 2018, iFlyTek and HIT (Harbin Institute of Technology) beats them and achieves an overall accuracy of about 80% using a single model. iFlyTek is a Chinese IT company and HIT is a research institute in China. The SDNet model and iFlyTek model both adopt Google's BERT module. The Stanford NLP group is at #8 with an accuracy of 65%. AllenAI is at #4 following Microsoft (single model) with an accuracy of 75%. This represents the best performance of QA systems nowadays. The SDNet system is described in a paper on arXiv.

For the most recent result, please see the front page of the competition website

Below is copied directly from the competition website. 
The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

Jian Wu
Viewing all 741 articles
Browse latest View live