About 0.5% of websites publish their content in Arabic, occupying the 20th place among other languages; however, Arabic is the 6th most spoken language in the world at 3.4%. A considerable portion of Arabs live in English speaking countries. For example, Arabs make up roughly 1.2% of the U.S. population. Some of them, mainly first generation, are able to consume news in Arabic in addition to English. Second, third, and forth generation Arabs might be interested in the Arabic narrative of news stories, but they prefer the English language since it is their first language. In this post, we present a quantitative study for the archival rate of news webpages published in Arabic as compared to news pages published in English by Arabic media from 1999 to 2022. We reveal that, contrary to the general conjecture which is that web archives favor English webpages, the archival rate of Arabic webpages in increasing more rapidly than the archival rate for English webpages.
The Dataset
Our dataset consists of 1.5 million multilingual news stories' URLs, collected in September of 2022, from sitemaps of four prominent news websites: Aljazeera Arabic, Aljazeera English, Alarabiya, and Arab News. Using sitemaps yielded the maximum amount of stories' URLs. I examined multiple methods to fetch URLs including RSS, Twitter, GDELT, web crawling, and sitemaps. We selected a sample of our dataset based on the median number of stories published each day by year. The median day for the number of published stories represents the year. For example, because the median for the number of stories published each day in 2002 is 93, we selected the stories published in that day to be in our sample representing the stories published in 2002. For all 23 years we studied, the median number of published stories is very close to the mean for the year. Our sample contains 4116 URIs to news stories (2684 Arabic and 1432 English). The dataset is available on GitHub.
Our dataset, collected in September of 2022, consists of 1.5 million news stories in Arabic and English published between 1999 and 2022. We found that 47% of stories published in Arabic were not archived. On the other hand, only 42% of the stories published in English were not archived. However, the archival rate of Arabic stories has increased from 24% to 53% from 2013 to 2022. Conversely, the archival rate of news stories published in English only increased from 47% to 58%. For Arabic webpages, our results are similar to those from a study published in 2017 where Arabic webpages were found to be archived at a rate of 53% for a different dataset which consists of general Arabic web pages from websites directories including DMOZ, Raddadi, and Star28 (defunct). There is a notable increase in the percentage of archived pages from Arabic websites in the last 10 years.
We discovered that 47% of English news stories published between 1999 and 2013 were archived. This is different from what another study (and a different dataset) which found in 2017 that 72% of English webpages were archived. It is possible that the discrepancy comes from the fact that our dataset only included English news stories published by Arabic media, but their dataset consisted of general English web pages that came from the websites directory, DMOZ.Category | Arabic Language URIs | English Language URIs |
---|---|---|
URIs Queried | 2684 | 1432 |
URIs Archived | 1435 | 834 |
URIs Not Archived | 1249 | 598 |
Archival Rate | 0.53 | 0.58 |
The lower number in the following figure for 2022 is due to our dataset only spanning stories published between January 1999 and September 2022.
![]() |
The number of collected Arabic stories per day (median) |
We could not observe a consistent increase or decrease in the number of published stories in English per day (median) for each year because Arab News did not include any stories published after 2013 in its sitemap. Only Aljazeera English, in our dataset, included stories published after 2013 in its sitemap. The other two news websites, Aljazeera Arabic and Alarabiya, publish news in Arabic.
The number of collected English stories per day (median) |
For Arabic news stories published in the median day in our dataset, nothing was archived for 1999, 2000, and 2004. Deciding to sample using the median day for the number of stories published per day each year was based on the median being very close to the mean value for the number of stories published per day. Moreover, using the median day. we were able to obtain a relatively small sample, 4116 URIs, that spans and represents 23 years worth of news stories from four news networks in two languages, 1.5 million URIs, that would otherwise not be feasible to study the archival rate for.
![]() |
The min, max, median, and mean for the number of collected stories' URIs each day by year |
We found that there is a little increase in the Arabic webpages archival rate until 2010 and the rate fluctuates after 2013 but it remained above 40% from 2014 to 2022. Generally the increase in Arabic news webpages archival rate is significant over the last 20 years.
Arabic webpages archival rate by year |
For English news stories, nothing was collected for 1999 and 2000 because these news outlets had little to no presence on the web during these years. We noticed even more fluctuation in the archival rate for English webpages but less general increase than it is for Arabic webpages.
![]() |
English webpages archival rate by year |
We measured the archival rate for Arabic webpages in our dataset by web archive to find the contribution of each archive to the archiving of these URIs. Using MemGator to check if the collected news stories were archived by public web archives, we studied the following archives:
1. waext.banq.qc.ca: Libraries and National Archives of Quebec
2. warp.da.ndl.go.jp: National Diet Library, Japan
3. wayback.vefsafn.is: Icelandic Web Archive
4. web.archive.bibalex.org: Bibliotheca Alexandrina Web Archive
5. web.archive.org.au: Australian Web Archive
6. webarchive.bac-lac.gc.ca: Library and Archives Canada
7. webarchive.loc.gov: Library of Congress
8. webarchive.nationalarchives.gov: UK National Archives Web Archive
9. webarchive.nrscotland.gov.uk: National Records of Scotland
10. webarchive.org.uk: UK Web Archive
11. webarchive.parliament.uk: UK Parliament Web Archive
12. wayback.nli.org.il: National Library of Israel
13. archive.today: Archive Today
14. arquivo.pt: The Portuguese Web Archive
15. perma.cc: Perma.cc Archive
16. swap.stanford.edu: Stanford Web Archive
17. wayback.archive-it.org: Archive-It (powered by the Internet Archive)
18. web.archive.org: the Internet Archive
Only archive.today and arquivo.pt returned any mementos for the 2684 URIs we queried. They both returned a total of seven mementos for six different URIs.
We found that the Internet Archive has archived more Arabic news pages than all other archives combined by a large margin. Other archives hardly contributed to archiving Arabic stories' URIs.
The percentage of archived Arabic news stories in web archives |
As far as English news webpages, looking at the archival rate by web archive, the Internet Archive returned mementos for a much larger amount of URIs than the sum of all other web archives, but the gap in contribution between the IA and the sum of all other web archives is not as large as it is for Arabic news webpages in our dataset.
![]() |
The percentage of archived English news stories in web archives |
The percentage of exclusively archived Arabic news stories |
For English news webpages in our sample, the IA had much more exclusive copies of URIs than all other archives combined, which indicates that losing all web archives besides the IA causes very little loss in information, but the opposite, losing the IA, is catastrophic.
![]() |
The percentage of exclusively archived English news stories |
The archival rate of Arabic news pages was, and is still, less than English news pages, but the gap is much smaller than it used to be. The archival rate of Arabic news pages has increased from 24% between 1999 and 2013 to 53% between 2013-2022. Our study shows that most of the increase is due to the IA's augmentation over time while other web archives did not experience such enhancements. Also there was more room for improvement in archiving Arabic news than English news. We show that losing all archives except the IA will cause no loss in archived Arabic news pages, but the loss is irreversible if the IA no longer exists. For English webpages, the majority of archived copies will be lost forever if the IA is crippled.