Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 751

2025-06-17: The 6th Research Infrastructures for the Study of Archived Web Materials (RESAW) Conference Trip Report

$
0
0

 


Siegen University



The 6th RESAW (Research Infrastructures for the Study of Archived Web Materials) conference took place in Siegen, Germany from June 4 – June 6, 2025. The conference occurs every other year in Europe, and features a mix of presentations along the spectrum from technical to digital humanities, by researchers affiliated with universities as well as from national and commercial web archiving organizations. The conference started off with quite a flurry when Cologne was evacuated following discovery of WWII bombs, disrupting many travel plans into Siegen.

Wednesday

Workshop: Mentorship for Early Career Scholars in Web Archive Studies


The Mentorship for Early Career Scholars in Web Archive Studies workshop started with participants introducing themselves. There were PhD students, postdocs, and mentors. Then, participants wrote down questions they had about web archive studies. The mentors characterized the questions into similar topics such as ethics, publishing, and methods. Then because the group was so large, we split into two groups for discussion. My group included Emmanuelle Bermès, Susan Aasman, Anders Klindt Myrvoll, Inga Schuppener of Siegen University, Marina Hervieu of BnF and Skybox blogs project, Beatrice Cannelli of the University of London, Louis Ravn of University of Amsterdam and Marcus Rommel of Siegen University. Some of the topics we discussed include how there are standard guidelines for legal use but not ethical use, how access policies can hinder dissemination, and how to engage with web archive content of underrepresented communities. One of the takeaways of the workshop was that there should be a Zotero list of papers for getting started with web archive research. The organizers of RESAW have attempted this a few times now, and ODU participated in a similar initiative CEDWARC previously as well.

Thursday

Keynote: Conference Opening Session


The opening keynote by Niels Brügger traced the history of the RESAW workshop. He also introduced the demographics of participants, which heavily skews European. There are 95 registered participants from 11 countries, and 40 presentations over the two and a half days of the conference. 




RESAW 2025 attendees came from 11 countries.





RESAW 2025 attendees primarily live in Europe.


Next, Carolin Gerlitz of University of Siegen talked about the host lab, the Media of Cooperation, funded as a Collaborative Research Centre. She introduced the idea of the datafied web, and observed that “The web is full of things that were not meant to be saved.” She questioned whose view is being preserved. She also stated that the true value of data lies in its ability to be remixed, rather than its amount.




Carolin Gerlitz gave the opening keynote about the datafied web.


Finally, Mara Bertelsen and Sophie Gebeil gave an entertaining skit on the publication of the previous RESAW conference’s proceedings, including props and audience participation. Everyone was entertained by the skit’s portrayal of the challenges of publishing a multinational volume.

Roundtable: The Datafied Web


The opening roundtable on the datafied web started with each of the five panelists - Sebastian Gießmann, Thomas Haigh, Carolin Gerlitz, Anne Helmond, and Miglė Bareikytė - introducing their view of the datafied web. Sebastian Giessmann shared his view related to his research on the history of online payments that datafication is a result of capitalism. Next, Thomas Haigh talked about how database management systems are transitioning towards NoSQL to accommodate vector (LLM) and graph (social media) models. He stated that there was never a web that wasn’t datafied, and that it matters how data is stored because web archives capture traces of the web but it’s not the original data.


Carolin Gerlitz followed by asking what is being measured on the web, how, and who is doing the measuring. She talked about page counters on the web transitioning to social media reactions and further to LLMs learning from our input. Next, Anne Helmond talked about data in the context of blogs and her work with Carolin on social media reactions. Finally, Miglė Bareikytė, who researches war sensing, talked about her work with archiving Telegram chats and channels in the context of the Russian invasion of Ukraine, as 70% of Ukrainians use the software. She stated that an archive that is not used is easily forgotten, and that once users know they are being archived, it changes how they behave.


Questions for the panel included how to study power relations in the context of the datafied web, and what kind of data the panel would like to have from the current web. The panel gave the Covid-19 app collection at the Internet Archive as an example of modern archived data.

Session: Web Archives Practices


The “web archives practices” session opened with Vasco Rato of the Portuguese Web Archive presenting “Bulk access to web-archived data using APIs”. The Portuguese Web Archive has four public APIs: full text search, image search, CDX, and Memento. They stated that the archive now holds 1.4 petabytes of information, and approximately half of the requests come from APIs now. Some examples of projects that have been built using the APIs are the Gloria Portuguese LLM, and the 1st place Arquivo.pt 2023 Awards winner project “Viajar no tempo sobre carris” which aligned CDX schedule data with full text news data.


Next, Eveline Vlassenroot of Ghent University presented, “Navigating the Datafied Web: User requirements and literacy with web archives.” Belgium does not currently have a web archive, so the Belgicaweb is a three year project to develop an archive, with FAIR principles in mind. She conducted a user study with mixed methods. She found that users want curated collections with transparent documentation about selection, as well as search interfaces. One of the main differences in this user study compared to past web archive user studies is that users are requesting datasheets and APIs. She is working on a web archive user literacy framework and researcher playbook including ethical guidelines for her PhD work.


Finally, Helena Byrne of the British Library presented “Lessons learnt from preparing collections as data: the UK Web Archive experience.” The initial framework was presented at the previous RESAW conference in 2023. Now, there is a toolkit published, and an online self-paced course is also available through the University of London. The UK Web Archive (still offline) has applied the framework to 10 collections, released under a creative commons license.


Session: Social Media and APIs


In the social media and APIs session, Katie Mackinnon of the University of Copenhagen presented, “Robots.txt and A History of Consent for Web Data Capture.” She talked about the development of robots.txt by Martijn Koster, and how it is a gentleman’s agreement to follow it. She traced policies at the Internet Archive around robots.txt, culminating in the 2017 blog post that they will no longer follow it. She also showed a 1995 post on the webcrawler mailing list asking for a zip file of the entire internet for AI purposes, which the audience found humorous. She also talked about the study by Longpre et al. which traced how website managers are using robots.txt to restrict access to their content by crawlers for LLMs.


Christian Schulz of Paderborn University next presented “On Reciprocity - Algorithmic Interweavings between PageRank and Social Media.” He talked about comparable systems for ranking authority of social media, such as pingbacks, and also talked about how users can manipulate their authority ranking on social media, including the “like4like” strategy. The final presenter in the session was Christina Schinzel of Bauhaus University, Weimar. She presented, “APIs. How their role in the history of computing and their software engineering principles shape the modern datafied web.” She traced the appearance of the term API to the paper, “Data structures and techniques for remote computer graphics.” She talked about the history of APIs on the web, and how they have led to interconnected ecosystems.


Friday

Session: Web Archives Practices




Lesley Frew presenting “Temporally Extending Existing Web Archive Collections for Longitudinal Analysis.” Photo courtesy of Eveline Vlassenroot.


In the Web Archives Practices session, I presented our work, “Temporally Extending Existing Web Archive Collections for Longitudinal Analysis.” Our pre-print is available on arXiv. We extended the EDGI 2016-2020 federal environmental changes dataset back to 2008 to analyze whether the terms being deleted by the first Trump administration were added by the Obama administration. We described our methodology, which used past web crawling, the End of Term 2008 dataset, and the Wayback Machine CDX API. We found that 87% of the pages with terms deleted by the first Trump administration contained deleted terms originally added by the Obama administration. 



Next, Andrea Kocsis presented, “Engaging audiences with the UK Web Archive: Strategies for general readers, data users, and the digitally curious.” She worked with three groups of users on a computational skills spectrum: general readers do not have experience with web archives, data users have heavy experience with computational use of web archives, and in between are the digitally curious, for example users with some Jupyter notebook experience without specific web archive data experience. She also worked on a first of its kind physical web archive exhibit called “Digital Ghosts - Exploring Scotland’s Heritage on the Web.” Andrea’s research can be found on GitHub.


Finally, Ricardo Basílio presented “Seed lists on themes and events on Arquivo.pt: a curious starting point for discovering a web archive.” He talked about his methodology for obtaining lists of seeds, which includes both automated (Bing) and manual (individual and community) seed list development. He pointed out that automated methods like scraping a seed URL for additional URLs do not result in correct seed lists for very specific collections, such as a list of Portuguese art galleries, since external links wouldn’t be restricted to only Portuguese art galleries, which necessitates manual curation. WSDL has also done previous work on seeds and collections, including scraping social media for micro collections by Alexander Nwala and summarizing collections by Shawn Jones.





Andrea Kocsis, Ricardo Basílio, and Lesley Frew at the Web Archives Practices session discussion. Photo courtesy of Eveline Vlassenroot.


The way that the sessions are structured at RESAW is that all presenters give their presentations back-to-back, and then all presenters answer questions together in a panel. Here are some of the questions that the audience had for my presentation.

Session: My PhD in 5 Minutes


In the My PhD in 5 Minutes session, there were three web archives PhD students who presented their dissertation topics. One of the questions by the chair Jane Winters was to further summarize their PhD to one sentence of what they wanted people to take away. Nathalie Fridzema of the University of Groningen presented, “Before WEB 2.0: A Cultural History of Early Web Practices in the Netherlands from 1994 until 2004.” She wants people to take away the rich history of the Dutch web found in web archives. Anya Shchetvina of Humboldt University presented, “Manifesting The Web: Network Imaginaries in Manifesto Writing Between the 1980s and the 2020s.” She wants people to take away that manifestos have persisted over time but changed. Johanna Hiebl of European University Viadrina presented, “Battlefield of Truth(s) on Investigative Frontlines: From Data Activism to OSINT Professionalism.” She wants people to take away an open source methodology that is ethical for social sciences. There were no other parallel sessions during this time, so all 100 RESAW attendees came to this session and were able to ask questions to the presenters.


Roundtable: RESAW - The First 10 Years and What’s Next


In the RESAW business session roundtable, the panelists (Susan Aasman, Emmanuelle Bermès, Kat Brewster, Iris Geldermans, James Hodges, Tatjana Seitz, and Jesper Verhoef) engaged with each other and the full conference audience about what is to come. One of the successes they discussed was that there is value in non-US centered conferences in the topics studied. One of the challenges was that because RESAW is currently almost entirely European attendees, they wish to increase geographical diversity to Africa and South Asia. There was also an extensive discussion about having “less” (in the context of what web archives hold), but there were concerns about who decides what is deleted and how, and in a complementary sense who decides what is kept.

Session: Conference Closing



The RESAW conference is an informal, international, interdisciplinary web archives conference.


The RESAW conference happens every other year, and the locations are announced four years in advance (so for the next two conferences). RESAW 2027 will be in Groningen, Netherlands, and RESAW 2029 will be hosted by C2DH in Luxembourg.

Conclusion


I enjoyed attending RESAW because it brought together 100 interdisciplinary web archive researchers. The interdisciplinary departments represented include digital humanities, media studies, information science, and computer science. While representatives of web archiving organizations were present at the conference, the focus was on individual research rather than products or tools. I liked seeing the different methodologies that non-computational scholars use with web archives for their research. I hope that WSDL can attend this conference regularly in the future.




River Sieg


-Lesley



Viewing all articles
Browse latest Browse all 751

Trending Articles