Both researchers used the following lunch to discuss temporal graphs at length. I wondered if one could model TimeMaps in this way and use these tools to discover interesting connections between archived web pages.
". He discussed the discovery and use of temponyms to understand the temporal nature of text. Using temponyms, machines can determine the time period that a text covers. He explained the issues with finding exact temporal intervals or times for web page topics, seeing as many pages are vague. His temponym project,
, has been tested on the WikiWars corpus and the YAGO semantic web system. He also presented further information on this topic,
.
![]()
We then shifted into using temporal analysis for security.
Staffan Truvé from
Recorded Future presented "
Temporal Analytics for Predictive Cyber Threat Intelligence". His company specializes in using social media and other web sources to detect potential protests, uprisings, and
cyberattacks. He indicated that protests and hacktivism are often talked about online before they happen, allowing authorities time to respond.
![]()
In closing, Omar Alonso from Microsoft presented "
Time to ship: some examples from the real-world". He highlighted some of the ways in which the carousel from the top of Bing is populated, using topic virality on social media as one of the many inputs. He talked about the concept of
socialsignatures, derived from all of the social media posts referring to the same link. Using this text, they are able to further determine aboutness for a given link, helping further with search results. He switched to other topics that help with search, such as connecting place and time. Search results for points of interest (POI) for a given location in effect is trying to match people looking for things to do (queries) with social media posts, checkins, and reviews for a given POI. He concluded by saying that there is much work to be done, such as allowing POI results for a given time period "things to do in Montréal at night".
Keynotes
Sir Tim Berners-Lee
Sir Tim Berners-Lee spoke of the importance of decentralizing the web, ensuring that users own their own data, web security, work to standardize and improve the ease of payments on the web, and finally the
Internet of Things (IoT).
Mentioning the efforts of projects like
Solid, he highlighted the need to ensure that users retain their data to ensure their privacy. The idea is that a user can tell the service where to store their data and then they have ownership and responsibility over that data.
He mentioned that, in the past the Internet had to be deployed by sending tapes through the mail, but now we are heading to a point where the web platform, because it allows you deploy a full computing platform very very quickly, may become the rollout platform for the future. Because of this ability, security is becoming more and more important and he wants to focus on a standard for security that uses the browser, rather than external systems, as the central point for asking a user for their credentials, thereby helping guard against trojans and malicious web sites. He said that the move from HTTP to HTTPS has been less easy than expected, considering many HTTPS pages are "mixed" containing references to HTTP URIs. This results in three different worlds: those that are HTTP pages, those that are HTTPS pages, and
upgrade insecure requests which still provide a mixed page, but one that is endorsed by the author.
Next, he spoke about making web payments standardized, comparing it to authentication. There are a wide variety of different solutions for web payments and there needs to be a standard interface. There is also an increasing call to allow customers to pay smaller amounts than before, which many current systems do not handle. Of course, customers will need to know when they are being phished, hence the security implications of a standardized system.
Finally, he covered the Internet of Things (IoT), indicating there are connections to data ownership, privacy, and security.
In the following Q&A session, I asked
Sir Tim Berners-Lee about the steps toward browser adoption for technologies such as
Memento. He said the first step is to discuss them at conferences like WWW, then engage in working groups, workshops, and other venues. He noted that one also needs to define the users for such new technologies so they can help with the engagement.
![]()
Later, during the student Q&A session the following day,
Morgannis Graham from
McGill University asked Sir Tim Berners-Lee about his thoughts on the role of web archives. He replied that "personally, I am a pack rat and am always concerned about losing things". He highlighted that while the general web users are thinking of the present, it is the role of libraries and universities to think about the future, hence their role in archiving the web. He stated that universities and libraries should work more closely together in archiving the web so that if one university falls, others exist having the archives of the one that was lost. He also stated that we all have a role in ensuring that legislation exists to protect archiving efforts. Finally, he tied his answer back to one of his current projects: what happens to your data when the site you have given it to goes out of business.
Lady Martha Lane-Fox
Wednesday evening ended with an inspiring talk from
Lady Martha Lane-Fox. She works for the UK in a variety of roles advancing the use of technology in society. She states that a country that can: (1) improve gender balance in tech, (2) improve the technical skills of the populace, and (3) improve the ability to use tech in the public sector, will be the most competitive.
She went further in explaining how the current gender balance is very depressing, noting that in spite of the freedom offered by technology, old hierarchies and structures have been re-established. She indicated that there are studies showing that companies with more diverse boards are more successful, and how we need to tackle this problem, not only from a technical, but also a social perspective.
She discussed the challenges of bringing technology to everyday lives and applauded South Korea's success while highlighting the challenges still present in the UK. She relayed stories of encounters with the citizenry, some of whom were reluctant to embrace the web, but after doing so felt they had more freedom and capability in their lives than ever before. She praised the UK for putting coding on the school curriculum and looking toward the needs of future generations.
She then talked about re-imagining public services entirely through the use of technology. The idea is to make government agencies digital by default in an effort to save money and provide more capability. She highlighted a project where a UK hospital once had 700 administrators and 17 nurses, and, through adopting technology, were able to then take the same money and hire 700 nurses to work with 17 administrators, thus providing better service to patients.
She closed by discussing her program
DotEveryone, which is a new organization promoting the promise of the Internet in the UK for everyone and by everyone. Her goal is for the UK to be the most connected, most digitally literate, and most gender equivalent nation on earth. In a larger sense, she wants to kick off a race among countries to use technology to create the best countries for their citizens.
Mary Ellen Zurko
Wednesday morning started with a keynote by
Mary Ellen Zurko, from Cisco. She discussed security on the web. Her first lesson: "The future will be different; so will the attacks and attackers, but only if you are wildly successful". Her point was the the success of the web has made it a target. She then covered the history of
basic authentication,
S-HTTP, and finally
SSL/TLS in HTTPS.
She then discuss the social side of security, indicating that users are often confused about how to respond to web browser warnings about security. There is a 90% ignore rate on such warnings, and 60% of those are related to certificates. She highlighted how difficult it is for users to know whether or not a domain is legitimate and if the certificate shown is valid. She also highlighted where most users, even expert users, do not fully understand the permissions they are granting when asked due to the cryptic and sometimes misleading descriptions given to them, mentioning that 17% of Android users actually pay attention to permissions during installation and only 3% are able to answer questions on what the security permissions mean.
![]()
Reiterating the results of
a study by Google, she stated that 70% of users clicked through malware warnings in Chrome, but Firefox had more participation. The Google study found that the Firefox warnings provided a better user experience, and thus users were more apt to pay attention and understand them. Following this study, Google changed its warnings in Chrome.
She said that the open web is an equal opportunity environment for both attackers and defenders, detailing how fraudulent tech support scans are quite lucrative. This was discovered in recent work by Cisco, "
Reverse Social Engineering Social Tech Support Scammers", where Cisco engineers actively bluffed tech support scammers in order to gather information on their whereabouts and identities.
Of note, she also mentioned that there is a largely unexploited partnership between web science and security.
Peter Norvig
On Friday morning, Peter Norvig gave an engaging speech on the state of the Semantic Web. He mentioned that his job is to bring information retrieval and distributed systems together. He went through a history of information retrieval, discussing WAIS and the World Wide Web, as well as ARCHIE. Before Google, several were trying to tame the nascent web at the time.
After Google, the Semantic Web was developed as a way to extract information from the many pages that existed. He talked about how Tim Berners-Lee was a proponent, whereas
Cory Doctorow highlighted that there were noting but obstacles in its path. Peter said that
Cory had several reasons for why it would fail, but the main were (1) people lie, (2) people are lazy, and (3) people are stupid, indicating that the information gathered from such a system would consist of intentional misinformation, lack of complete information, or misinformation due to incompetence.
Peter then highlighted several instances where this came about. Initially, excellent expressiveness was produced by highly trained logicians, giving us
DAML,
OWL,
RDFa,
FOAF, etc. Unfortunately, they found a 40% page error rate in practice, indicating that Cory was correct on all 3 fronts. Peter's conclusion was the highly trained logicians did not seem to solve the identified problems.
Peter then posited "what about a highly trained webmaster?". In 2010, search companies promoted the creation of
schema.org with the idea of keeping it simple. The search engines promised that if a site were marked up, then they would show it immediately in search results. This gave users an incentive to mark up their pages and now has resulted in technologies that can better present things like hotel reservations and product information. This led most to conclude that schema.org was an unexpected success.
Peter closed by saying that obstacles still remain, seeing as most of the data comes from web site owners, still leading to misinformation in some cases. He talked about the need to be able to connect different sources together so that one can, for example, not only find a book on Amazon, but also a listing of the Author's interests on Facebook. He hopes that neural networks could be combined with semantic and syntactic approaches to solve some these large connection problems.
W3C Track
Tzviya Siegman, from
John Wiley & Sons Publishing, presented "
Scholarly Publishing in a Connected World". She discussed how publications of the past were immutable, and publishers did little with content once something was published. She confessed that in a world where machines are readers, too, publications are a bit behind the times. She further said that we still have an obsession with pages, citing them, marking them, and so on, when in reality the web is not bound by pages. She wants to standardize on a small set of RDFa vocabularies that would enable gathering of content by topic, whether the documents published are just articles, but also data and electronic notebooks. She closed by talking about how Wiley is trying to extract metadata from its own corpus to provide additional data for scholars.
Hugh McGuire presented "
Opening the book: What the Web can teach books, and what books can teach the Web". He talked about how books seem to hold a special power and value, specific to the boundedness of a book. The web, by contrast, is unbounded; even a single web site is unknowable with no sense of a beginning or an end. On the web, however, anyone can publish documents and data to a global audience without any required permission. He talked about how books are a singular important node of knowledge, with the ebook business having the opposite motive of the web, making ebooks a kind of restricted, broken version of the web. He wants to be able to combine the two. For example a system can provide location-aware annotations of an ebook while also sharing those annotations freely, essentially making ebooks smarter and more open.
Ivan Herman revealed
Portable Web Publications which has serious implications for archiving. The goal is to allow people to download web publications like they do ebooks, PDFs, or other portable articles. There is a need to do so because connectivity is not yet ubiquitous. With the power of the web, one can also embed interactivity into the downloaded document. Of course, there are also additional considerations, like the form factor of the reading device and the needs of reader. The concept is more than just creating a ebook with interactive components or a web page that can be saved offline. He highlighted the work of publishers in terms of egonomy and aesthetics, stating that web designers for such portable publications should learn from this work. Portable Web Publications would not be suitable for social web sites, web mail, or anything that depends on real-time data. PWP requires 3 layers of addressing (1) locating the PWP itself, (2) locating a resource within a PWP, and (3) locating a target within such a resource. In practice, locators depend on the state of the resource, creating a bit of a mess. His group is currently focusing on a manifest specification to solve these issues.
Poster Session
In looking at the data from "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot", we reviewed 1.6 million web references from 1.8 million articles and discovered 3 things:
- use of web references is increasing in scholarly articles
- frequently authors use publisher web pages (locating URI) rather than DOIs (persistent URI) when creating references
We show on the poster that, because many use browser bookmarks or citation managers that store these locating URIs, there must be an easy way to help tools find the DOI. Our suggestion is to store this DOI in the Link header for easy access by these tools.
WWW Conference Presentations
Even though I attended many additional presentations, I will only detail a few of interest.
![]()
As a person who has difficulty with SPARQL, I appreciated the efforts of Gonzalo Diaz and his co-authors in "
Reverse Engineering SPARQL Queries". Their goal was to reverse engineer SPARQL queries with the intent of producing better examples for new users, seeing as new users have a hard time with the precise syntax and semantics of the language. Given a database and answers, they wanted to reverse engineer the queries that produced those answers. Unfortunately, they discovered that verifying a reverse engineered SPARQL query to determine if it is the canonical query for a given database and answer is an NP-complete (intractable) problem. They were however able to perform some heuristics on a specific subset of queries to solve this problem in polynomial time.
Fernando Suarez presented "
Foundations of JSON Schema". He mentioned that JSON is very popular because it is flexible, but there is no way to describe what kind of JSON response a client should expect from a web service. He discussed a proposal from the Internet Task Force to develop a JSON schema, a set of restrictions that documents must satisfy. he said the specification is in its Fourth Draft, but is still ambiguous. Even online validators disagree on some content, meaning that we need clear semantics for validation, and he proposes a formal grammar. His contribution is an analysis shows that the validataion problem is PTIME-complete, but that determining if a document has an equivalent JSON schema is PSPACE-hard for very simple schemas. For the future, he intends to work further on integrity constraints for JSON documents and more use cases for JSON schema.
![]()
David Garcia presented "
The QWERTY Effect on the Web; How Typing Shapes the Meaning of Words in Online-Human Communications". He highlights a hypothesis that words typed with more letters from the right side of the keyborard are more positive than those with more letters from the left. He tests this hypothesis on product ratings from different datasets and found that 9 out of 11 datasets see a significant QWERTY effect which is independent of the number of views or comments on an item. He does mention that he needs to repeat the study with different languages and keyboard layouts. He closed by saying that there is no evidence yet that we can predict meanings or change evaluations based on this knowledge.
Justin Cheng presented "
Do Cascades Recur?" where he analyzes the rise and fall of memes multiple times throughout social media. Prior work shows that cascades (meme sharing) rises, then falls, but in reality there are many rises and falls over time. He studies these different peaks and tries to determine how and why these cascades recur. Seeing as these bursts are separated among different network communities, cascades recur when people connect communities and reshare something. It turns out that a meme with high virality has less chance of recurring, but one with medium virality will recur months or perhaps years later. He would like to repeat his study with networks other than Facebook and develop improved models of recurrence based on other data.
Prahmod Bhatotia presented "IncApprox: The Marriage of incremental and approximate computing". He discussed how data analytic systems transform raw data into useful information, but they need to strike a balance between low latency and high throughput. There are two computing paradigms that try to strike this balance: (1) incremental computations and (2) approximate computing. Incremental computation is motivated by the fact that we are recomputing the output with small changes in the input and can reuse memorized parts of the computation that are unaffected by the changed input. Approximate computing is motivated the fact that the approximate answer is good enough. With approximate computing we get the entire input dataset, but compute only parts of the input and then produce approximate output in a low latency manner. His contribution is the combination of these two approaches.
Jessica Su presented "
The Effect of Recommendations on Network Structure". She worked with Twitter on the rollout of a recommendation system that suggests new people to follow. They restricted the experiment to two weeks to avoid any noise from outside the rollout. They found that there is an effect; people's followers did increase after the rollout. They also confirmed that the "rich get richer", with those who already had many followers gaining more followers and those with few still gaining some followers. She also mentioned that people did not appear to be making friends, only following others.
Samuel Way presented "
Gender, Producitivity, and Prestige in Computer Science Faculty Hiring Networks". This study tried to investigate why women are not participating in computer science. He mentioned that there are conflicting results. Universities have a 2-to-1 preference for female faculty applicants, but at the same time there is a bias favoring male students. They developed a framework for modeling faculty hiring networks using a combination of CVs, social media profiles, and other sources on a subset of people currently going through the tenure process. The model shows that gender bias is not uniformly, systematically affecting all hires in the same way and that the top institutions fight over a small group of people. Women are a limited resource in this market and some institutions are better at competing for them. The result is that accounting for gender does not help predict faculty placement, leading them to conclude that the effects of gender are counted for by other factors, such as publishing or post-doctoral training rates or the fact that some institutions appear to be better at hiring women than others. The model predicts that men and women will be hired at equal rates in Computer Science by the 2070s.
Social
Of course, I did not merely enjoy the presentations and posters. Among the Monday night SAVE-SD dinner, the Thursday night Gala, and lunch each day, I took the opportunity to acquaint myself with many field experts. Google, Yahoo!, and Microsoft were also there looking to discuss data sharing, collaboration, and employment opportunities.
I always had lunch company thanks to the efforts of
Erik Wilde,
Michael Nolting,
Roland Gülle, Eike Von Seggern,
Francesco Osborne, Bahar Sateli, Angelo Salatino,
Marc Spaniol, Jannik Strötgen,
Erdal Kuzey, Matthias Steinbauer, Julia Stoyanovich,
Jan Jones, and more.
Furthermore, the Gala introduced me to other attendees, like
Chris LaRoche, Marc-Olivier Lamothe,
Ashutosh Dhekne,
Mensah Alkebu-Lan,
Salman Hooshmand,
Li'ang Yin,
Alex Jeongwoo Oh,
Graham Klyne, and Lukas Eberhard. Takeru Yokoi introduced me to
Keiko Yokoi from the
University of Tokyo who was familiar with many aspects of digital libraries and quite interested in Memento. I also had a fascinating discussion about Memento and the Semantic Web with
Michel Gagnon and
Ian Horricks, who suggested I read "
Introduction to Description Logic" to understand more of the concepts behind the semantic web and artificial intelligence.
In Conclusion
As my first academic conference, the WWW 2016 conference was an excellent experience, bringing me in touch with paragons on the forefront of web research. I now have a much better understanding of where we are in the many aspects of the web and scholarly communications.
Even as we left the conference and said our goodbyes, I knew that many of us had been encouraged to create a more open, secure, available, and decentralized web.