Twitter allows its users to author original content and share links to other web pages. Archivists can mine tweets for these shared URIs, and use those as seeds to create web archive collections. These collections of URIs in social media posts were identified as micro-collections (MCs) by Dr. Alexander Nawala in his study Nwala et al.in 2019. The term micro-collection was given due to the scope and the size of these sets of URIs. Unlike the URIs collected by scraping search engines (e.g., Google), the collections of URIs in social media (e.g., Twitter) are curated by users to specific topics or events. These collections of external web resources reflect the editorial effort and domain expertise of the people using the platform, making them a vital source for seed URIs for web archive collections. However, with the recent changes to the platform (read more about the changes), including its rebranding to X, we think that this practice may be diminishing. In this blog post, I report quantitative data about MCs from our recent study which confirms our intuition whether it is still worth scraping Twitter to look for seed URIs for web archive collections.
What is a micro-collection?
Users on social media platforms routinely create and share posts consisting of hand-selected URIs of news stories, tweets, videos, etc. Nwala et al. identified these shared URIs as micro-collections and considered them an important source for archival seeds because the effort taken to create micro-collections is an indication of editorial activity and a demonstration of domain expertise.
Figure 1: Example of a micro-collection from Twitter by a single author (@dtdchange) consisting of a pair of three tweets that are part of a reply thread about the Flint water crisis. (Figure 1 in Nwala et al.) |
Figure 1 shows a tweet thread containing external links related to the Flint water crisis. Nwala et al. introduced post-class terminology to understand the MCs:
P1A1 - Single post by a single author (Example - a single tweet)
PnA1/PnAn - Multiple posts by a single author / multiple posts by multiple authors (Example - a tweet thread)
1) A single person creates a single tweet including multiple external URIs.
2) A single person creating a tweet thread including multiple URIs.
3) A group of people create a thread with multiple URIs.
The first case is rarely found because of the character limitations for regular users. However, in April 2023, Twitter increased the character limit up to 25000 for premium users, allowing them to create MCs in a single tweet. The example below is a tweet created by a single user with a premium account, containing multiple external links.
These are my Jewish Warriors. 👇
— Omar Nizam (@OmarNizam) June 6, 2024
- Jewish authors, books, & podcasts on Palestine -
In the past 6 months I've read 20 books on Israel-Palestine to improve my knowledge of the conflict and have also had the great privilege of talking to 6 of those authors about it on… pic.twitter.com/8ki7jwBkvf
Data Collection
Israel-Palestine conflict
US presidential election 2024
Donald Trump's conviction
Aurora Borealis 2024 (Northern Lights 2024)
Solar eclipse 2024
Paris 2024 Summer Olympics
We issued the queries to Twitter to extract the first 100 tweets (posts) from the Search Engine Result Pages (SERPs). We also collected up to 100 replies per tweet for the first 100 tweets. We scraped tweets from both the top and latest categories. The top category shows tweets that are considered the most popular for a given search term and the latest category shows tweets in reverse chronological order, with the most recent tweets appearing first. Figures 2 and 3 show examples of tweets from the top category and latest category, respectively.
Figure 2: Example for a tweet in the top category for “Israel Palestine conflict” query. The tweet has 11000 reposts and 9100 likes as of 2024-11-26. |
Figure 3: Examples for tweets (tweet 1, tweet 2) in the latest category for “Israel-Palestine conflict” query. Observed date and time: 2024-11-26T02:38:00. The tweets were posted 34 minutes and 36 minutes prior to the observed time. |
We scraped Twitter SERPs using the third-party scraper twscrape and extracted the tweet id, tweet raw content, and the number of replies in the first step. Next, we categorized the collected tweets by considering the number of replies per tweet. A single tweet means a tweet with no replies (P1A1) and a tweet thread is a tweet with more than one reply (PnA1 or PnAn). Table 1 shows that the majority of the tweets in latest category are in P1A1 and majority of tweets in top category are in PnA1/PnAn.
Table 1: Tweet counts per query and post-class for the first 100 tweets in each SERP.
To further investigate the number of tweets in a thread for both top and latest categories, we created graphs to compare the lengths of tweet threads in the top and latest categories. Figures 4 and 5 show that the top category contains more tweet threads having replies greater than 5 when compared to the latest category.
Figure 4: Distribution of tweet threads in the latest category based on the number of replies for 6 queries. The x-axis shows the number of tweets in a thread where n=1 being a single tweet and n >1 being a thread. The y-axis shows the number of threads. |
Figure 5: Distribution of tweet threads in the Top category based on the number of replies for 6 queries. The x-axis shows the number of tweets in a thread where n=1 being a single tweet and n >1 being a thread. The y-axis shows the number of threads. |
We extracted the Twitter short URLs (for example - https://t.co/pCgFKaRbUJ) and then dereferenced them to collect the full URLs. Next, we filtered out the URIs which redirect to other tweets or images and videos hosted at Twitter. The curl requestto https://t.co/pCgFKaRbUJ specifies the full URI in the Location response header. Figures 5 and 6 show that there is a higher number of external URIs in single tweets for the latest category and more external URIs present in tweet threads having six or more replies in the top category.
Figure 6: Distribution of external URIs in the latest category across tweet threads of varying lengths for 6 queries. The x-axis shows the number of tweets in a thread where n=1 being a single tweet and n>1 being a thread. The y-axis shows the number of external links. |
Figure 7: Distribution of external URIs in the top category across tweet threads of varying lengths for 6 queries. The x-axis shows the number of tweets in a thread where n=1 being a single tweet and n>1 being a thread. The y-axis shows the number of external links. |
Identifying micro-collections
Calculating the precision of a micro-collection
Because some MCs could contain spam or other off-topic links, we calculated the precision of the resulting MCs. For instance, if we have an MC of 5 URIs for query 1 and none of the URIs in the collection contain any relevant information referring to query 1 or the selected event, that MC has precision=0.0 (0/5) and is of no use for generating collections of seed URIs. To calculate the precision, we manually observed all the URIs in discovered MCs and made a personal relevance judgment. Next, we divided the number of relevant URIs in the MC by total URIs in the MC to obtain a numerical value for the precision.
Results
Discussion
The results from this study indicate a notable shift in the occurrence of MCs on the platform formerly known as Twitter in 2024, compared to 2019, when Nwala et al. studied the MCs before major changes to the platform. He identified 4549 MCs from Twitter top and 5110 MCs from Twitter latest. For the MCs collected from Twitter, the conditional probabilities of relevance were reported as 0.6 given that MC contains 1 URI, 0.61 given that MC contains 2 URIs, 0.46 given that MC contains 3-4 URIs, and 0.42 given that MC contains 5 or more URI. Our findings suggest that their frequency has declined in the case of individual posts (P1A1 post-class). Micro-collections remain, but they are now more commonly found in tweet threads (PnA1/PnAn post-class), where users or groups of users collectively share external web resources.
When comparing the number of MCs between the top and latest categories, the top category yielded a higher number of MCs in the PnA1/PnAn, which includes longer conversations and multiple contributors. We can assume that we would be able to find MCs in the top category tweet threads since they contain more extended interactions.
The study also highlights that certain types of events, such as political events like the U.S. presidential election and Trump’s conviction, are more likely to generate a significant number of micro-collections compared to other events, such as the solar eclipse or the aurora borealis. This suggests that events with higher public engagement and discourse tend to foster more MC activity.
Precision is an indicator of the relevance of an MC to a selected query. Results suggest that although we could find MCs, they do not have the expected quality. By quality, we mean that the tweeted URIs do not describe the specific event. Nearly half of the MCs are useful in the P1A1 and only 1/5 of the total MCs are useful in the PnA1/PnAn. We can assume that increased interactions of advertising bots in Twitter threads can be a reason for the decline of precision in MCs. Twitter’s rebranding to X and the changes in its algorithm, coupled with users’ shifting behavior, could be contributing factors to the reduced number of MCs. The platform's discouragement of posts with external links, focus on promoting commercialized content, and emphasis on monetized user engagement may be affecting the editorial activity that previously encouraged the organic curation of MCs. It is getting even worse with suppressing the tweets with external links so that they get less engagement which is a recent development in the platform. The significant departure of scientists, journalists, and other professionals from the platform who authored high quality content can be another reason for the lack of MCs in 2024.
Ultimately, the study demonstrates that while micro-collections still exist in 2024, their utility as a source for seed URIs is now questionable. However, we can still expect to find some MCs for high-engagement topics. We can recommend that when using Twitter to extract MCs, it is essential to implement a relevance test to ensure the extracted MCs align with the desired context and purpose. Additionally, it is important to explore methods for mining MCs from other social media platforms, including emerging alternatives to Twitter. Further research could focus on understanding how platform algorithms affect content curation practices and identifying potential strategies to efficiently mine valuable editorial content on evolving social media platforms.
Acknowledgements
I would like to express my sincere gratitude to Dr. Michael L. Nelson and Dr. Sampath Jayarathne for their guidance, support, and encouragement throughout this work.
-Kumushini Thennakoon (@KumushiniT)