Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 751

2025-06-10: Comparing the Archival Rate of Arabic and English News Stories Published Between 1999 and 2022

$
0
0
Aljazeera Arabic Timemap

About 0.5% of websites publish their content in Arabic, occupying the 20th place among other languages; however, Arabic is the 6th most spoken language in the world at 3.4%. A considerable portion of Arabs live in English speaking countries. For example, Arabs make up roughly 1.2% of the U.S. population. Some of them, mainly first generation, are able to consume news in Arabic in addition to English. Second, third, and forth generation Arabs might be interested in the Arabic narrative of news stories, but they prefer the English language since it is their first language. In this post, we present a quantitative study for the archival rate of news webpages published in Arabic as compared to news pages published in English by Arabic media from 1999 to 2022. We reveal that, contrary to the general conjecture which is that web archives favor English webpages, the archival rate of Arabic webpages in increasing more rapidly than the archival rate for English webpages.

The Dataset

Our dataset consists of 1.5 million multilingual news stories' URLs, collected in September of 2022, from sitemaps of four prominent news websites: Aljazeera ArabicAljazeera EnglishAlarabiya, and Arab News. Using sitemaps yielded the maximum amount of stories' URLs. I examined multiple methods to fetch URLs including RSS, Twitter, GDELT, web crawling, and sitemaps. We selected a sample of our dataset based on the median number of stories published each day by year. The median day for the number of published stories represents the year. For example, because the median for the number of stories published each day in 2002 is 93, we selected the stories published in that day to be in our sample representing the stories published in 2002. For all 23 years we studied, the median number of published stories is very close to the mean for the year. Our sample contains 4116 URIs to news stories (2684 Arabic and 1432 English). The dataset is available on GitHub.

Our dataset, collected in September of 2022, consists of 1.5 million news stories in Arabic and English published between 1999 and 2022. We found that 47% of stories published in Arabic were not archived. On the other hand, only 42% of the stories published in English were not archived. However, the archival rate of Arabic stories has increased from 24% to 53% from 2013 to 2022. Conversely, the archival rate of news stories published in English only increased from 47% to 58%. For Arabic webpages, our results are similar to those from a study published in 2017 where Arabic webpages were found to be archived at a rate of 53% for a different dataset which consists of general Arabic web pages from websites directories including DMOZ, Raddadi, and Star28 (defunct). There is a notable increase in the percentage of archived pages from Arabic websites in the last 10 years. 

We discovered that 47% of English news stories published between 1999 and 2013 were archived. This is different from what another study (and a different dataset) which found in 2017 that 72% of English webpages were archived. It is possible that the discrepancy comes from the fact that our dataset only included English news stories published by Arabic media, but their dataset consisted of general English web pages that came from the websites directory, DMOZ.

58% of English news stories between 1999 and 2022 in our dataset were archived. While there is an increase in the archival rate for English pages, it is not as large as the increase in the archival rate for Arabic ones. For English news stories, the increase could be considered normal/expected for a 10 years timeframe. It is worth mentioning that since websites started using more and more JavaScript in the last 10 years, archived mementos have more missing resources like images and other multimedia so the increase is considered an overall improvement but there is a chance that less content per page is captured in recent years. We did not study missing resources from archived mementos we found and cannot confirm whether or not missing resources are still on the rise in archived web pages.

Arabic and English news stories' URIs archival rate
CategoryArabic Language URIsEnglish Language URIs
URIs Queried26841432
URIs Archived1435834
URIs Not Archived1249598
Archival Rate0.530.58

While we were sampling from our dataset, we noticed an increase in Arabic stories published per day (median) for each year. The increase in the number of collected stories over time is expected due to news outlets moving towards publishing on the web in the last 20+ years.

The lower number in the following figure for 2022 is due to our dataset only spanning stories published between January 1999 and September 2022.

The number of collected Arabic stories per day (median)
The number of collected Arabic stories per day (median)



We could not observe a consistent increase or decrease in the number of published stories in English per day (median) for each year because Arab News did not include any stories published after 2013 in its sitemap. Only Aljazeera English, in our dataset, included stories published after 2013 in its sitemap. The other two news websites, Aljazeera Arabic and Alarabiya, publish news in Arabic.

The number of collected English stories per day (median)
The number of collected English stories per day (median)

For Arabic news stories published in the median day in our dataset, nothing was archived for 1999, 2000, and 2004. Deciding to sample using the median day for the number of stories published per day each year was based on the median being very close to the mean value for the number of stories published per day. Moreover, using the median day. we were able to obtain a relatively small sample, 4116 URIs, that spans and represents 23 years worth of news stories from four news networks in two languages, 1.5 million URIs, that would otherwise not be feasible to study the archival rate for.

The min, max, median, and mean for the number of collected stories' URIs each day by year
The min, max, median, and mean for the number of collected stories' URIs each day by year

We found that there is a little increase in the Arabic webpages archival rate until 2010 and the rate fluctuates after 2013 but it remained above 40% from 2014 to 2022. Generally the increase in Arabic news webpages archival rate is significant over the last 20 years.

Arabic webpages archival rate by year
Arabic webpages archival rate by year

For English news stories, nothing was collected for 1999 and 2000 because these news outlets had little to no presence on the web during these years. We noticed even more fluctuation in the archival rate for English webpages but less general increase than it is for Arabic webpages.

English webpages archival rate by year
English webpages archival rate by year

We measured the archival rate for Arabic webpages in our dataset by web archive to find the contribution of each archive to the archiving of these URIs. Using MemGator to check if the collected news stories were archived by public web archives, we studied the following archives:

1. waext.banq.qc.ca: Libraries and National Archives of Quebec
2. warp.da.ndl.go.jp: National Diet Library, Japan
3. wayback.vefsafn.is: Icelandic Web Archive
4. web.archive.bibalex.org: Bibliotheca Alexandrina Web Archive
5. web.archive.org.au: Australian Web Archive
6. webarchive.bac-lac.gc.ca: Library and Archives Canada
7. webarchive.loc.gov: Library of Congress
8. webarchive.nationalarchives.gov: UK National Archives Web Archive
9. webarchive.nrscotland.gov.uk: National Records of Scotland
10. webarchive.org.uk: UK Web Archive
11. webarchive.parliament.uk: UK Parliament Web Archive
12.  wayback.nli.org.il: National Library of Israel
13. archive.today: Archive Today
14. arquivo.pt: The Portuguese Web Archive
15. perma.cc: Perma.cc Archive
16. swap.stanford.edu: Stanford Web Archive
17. wayback.archive-it.org: Archive-It (powered by the Internet Archive)
18. web.archive.org: the Internet Archive

Only archive.today and arquivo.pt returned any mementos for the 2684 URIs we queried. They both returned a total of seven mementos for six different URIs.

We found that the Internet Archive has archived more Arabic news pages than all other archives combined by a large margin. Other archives hardly contributed to archiving Arabic stories' URIs.

The percentage of archived Arabic news stories in web archives
The percentage of archived Arabic news stories in web archives

As far as English news webpages, looking at the archival rate by web archive, the Internet Archive returned mementos for a much larger amount of URIs than the sum of all other web archives, but the gap in contribution between the IA and the sum of all other web archives is not as large as it is for Arabic news webpages in our dataset.
The percentage of archived English news stories in web archives
The percentage of archived English news stories in web archives

Furthermore, we found that the union of all other archives' URI-Rs is a proper subset of the IA's URI-Rs. In other words, only the IA had exclusive copies of URIs of Arabic news stories. All other archives had no exclusive copies. This doesn't necessarily mean that union of all other archives' URI-Ms is a proper subset of the IA's URI-Rs because URIs could've been archived at different times by different web archives. This finding indicates that losing all web archives besides the IA causes almost no loss in information. On the other hand, losing the IA is disastrous to Arabic pages web archiving.

The percentage of exclusively archived Arabic news stories
The percentage of exclusively archived Arabic news stories

For English news webpages in our sample, the IA had much more exclusive copies of URIs than all other archives combined, which indicates that losing all web archives besides the IA causes very little loss in information, but the opposite, losing the IA, is catastrophic.

The percentage of exclusively archived English news stories
The percentage of exclusively archived English news stories

While it is not a secret that the IA is the largest web archive on the internet, our study shows that the bulk of archived webpages on the internet could be lost forever if the Internet Archive was killed by legal threats or crippled by repeated cyber attacks. The most recent DDoS attack and data breach happened in October 2024. Luckily, the DDoS attack was solved, but the incident caused the IA to be down or partially down to keep the data safe.

Our finding is different from an earlier study by Alsum et al. (2014), where they found that it is possible to retrieve full TimeMaps for 93% of their dataset using the top nine web archives without the IA. 

Conclusions

The archival rate of Arabic news pages was, and is still, less than English news pages, but the gap is much smaller than it used to be. The archival rate of Arabic news pages has increased from 24% between 1999 and 2013 to 53% between 2013-2022. Our study shows that most of the increase is due to the IA's augmentation over time while other web archives did not experience such enhancements. Also there was more room for improvement in archiving Arabic news than English news. We show that losing all archives except the IA will cause no loss in archived Arabic news pages, but the loss is irreversible if the IA no longer exists. For English webpages, the majority of archived copies will be lost forever if the IA is crippled.


-Hussam Hallak


Viewing all articles
Browse latest Browse all 751

Trending Articles