Image may be NSFW. Clik here to view. ![]() |
News articles from Indian newspapers about a corruption case involving an Indian doctor. The left images show screenshots of the article from the print newspaper. The right images show URLs for the articles returning with 404 pages. |
My brother, a lawyer in India, recently sent me two screenshots shown in Figures 1 and 2, of a news article about a corruption case involving a renowned doctor from India. In order to proceed with legal proceedings against the newspapers for publishing the article, my brother needed some evidence about the publication of the articles. Therefore he sought my help in finding the URLs of the articles shown in the screenshots. The news articles were published in an English language newspaper, The Asian Age, and a Hindi language newspaper, Punjab Kesari.
Image may be NSFW. Clik here to view. ![]() |
Figure 1: Screenshot of the news article from the English language newspaper, The Asian Age shared with me by my brother |
Image may be NSFW. Clik here to view. ![]() |
Figure 2: Screenshot of the news article from the Hindi language newspaper, Punjab Kesari shared with me by my brother |
Finding URLs for the screenshot of the news articles
I searched the websites of The Asian Age and Punjab Kesari for the articles and found links to the articles (shown in the Original URL row of Tables 1 and 2) but they both redirected to a 404 page, as shown in Figures 3 and 4. Fortunately, we found search engine (SE) cached copies of both articles in the Google and Bing caches, as shown in Figures 5 and 6.
Plinio Vargas in his post "Link to Web Archives, not Search Engine Caches" talks about the ephemeral nature of the SE cache URLs and highlights the reason for linking to the web archives over the SE cache URLs. Furthermore, Dr. Michael Nelson in his post "Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives" has already shown us the use of SE cache URLs and the web archives to find answers to real world problems.
Plinio Vargas in his post "Link to Web Archives, not Search Engine Caches" talks about the ephemeral nature of the SE cache URLs and highlights the reason for linking to the web archives over the SE cache URLs. Furthermore, Dr. Michael Nelson in his post "Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives" has already shown us the use of SE cache URLs and the web archives to find answers to real world problems.
Image may be NSFW. Clik here to view. ![]() |
Figure 3: A 404 page appears on accessing the news article from Punjab Kesari |
Image may be NSFW. Clik here to view. ![]() |
Figure 4: A 404 page appears on accessing the news article from The Asian Age |
cURL response for the The Asian Age news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html"
HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:07 GMT
Server: Apache/2.4.7 (Ubuntu)
Location: https://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html
Cache-Control: max-age=300
Expires: Fri, 20 Sep 2019 18:40:07 GMT
Connection: close
Content-Type: text/html; charset=iso-8859-1
HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:08 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=dsp7g2kkn5sfk2eggaftg3un84; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: /404.html
X-Cache: MISS from www.asianage.com
Connection: close
Content-Type: text/html
HTTP/1.1 200 OK
Date: Fri, 20 Sep 2019 18:35:10 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=koaujt0tiaqgjvafa5je1djps5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Cache: MISS from www.asianage.com
Connection: close
Content-Type: text/html
Image may be NSFW. Clik here to view. ![]() | |
|
Image may be NSFW. Clik here to view. ![]() |
Figure 6: Google Cache http://www.asianage.com/metros/delhi/260819/rs-100-crore-duping-claim-against-top-doctor.html |
cURL response for the Punjab Kesari news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL "https://haryana.punjabkesari.in/national/news/police-is-not-taking-action-on-dr-purushottam-who-cheated-the-patients-1050341"
HTTP/1.1 301 Moved Permanently
Content-Length: 0
Connection: keep-alive
Cache-Control: private
Location: https://haryana.punjabkesari.in/common404.aspx
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 18:45:12 GMT
X-Cache: Miss from cloudfront
Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: IAD79-C2
X-Amz-Cf-Id: Ub5SmJxPQWHJQSIg9xEz-GVZLQtNA4KHkXHT2-qp_6ZD8AFKF_fQKQ==
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 76757
Connection: keep-alive
Cache-Control: public, no-cache="Set-Cookie", max-age=15000
Expires: Fri, 20 Sep 2019 17:17:08 GMT
Last-Modified: Fri, 20 Sep 2019 13:07:08 GMT
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 13:07:08 GMT
Vary: Accept-Encoding,Cookie
X-Cache: Hit from cloudfront
Via: 1.1 21b0487d8c28cb4577401d2a73a03053.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: IAD79-C2
X-Amz-Cf-Id: 5PzkcGPXziNxfNLDffTV3-V6Ks2w3FQiEUWnHMzfZm_aDKfyBKjw7A==
Age: 20281
Push the cached URLs to multiple web archives
We pushed the Bing and Google cache URLs (URI-R-SEs) for both news articles to the Internet Archive, perma.cc, and archive.is. The URI-Ms for the URI-R-SEs are shown in Tables 1 and 2. We can use ArchiveNow to automate pushing of URLs to multiple web archives. We also captured the WARC files of the URI-R-SEs for the articles using Webrecorder and stored the WARCs locally.
Accessing the Cache URLs in the Web Archives
Web archives index mementos by their URI-R. A SE cache URI-M can only be accessed by users who know the URI-R-SE, which is mostly opaque as a result of various parameters and encodings. As shown in Figure 7, the URI-R-SE for the same web resource may vary according to different geographic location which means that the same web resource may be indexed under different URI-R-SEs in the web archives.
In the US, the Bing Cache URL for the The Asian Age news article is
http://cc.bingj.com/cache.aspx?
In India, the Bing cache URL for the The Asian Age news article is
http://cc.bingj.com/cache.aspx?q=+http%3a%2f%2fwww.asianage.com%2fmetros%2fdelhi%2f260819%2frs-100-crore-duping-claim-against-top-doctor.html&d=4857393311190329&mkt=en-IN&setlang=en-US&w=dLDQJ43_8q6g4yPEAeK5Q-U3JNpx878y
Image may be NSFW. Clik here to view. ![]() |
Figure 7: The Bing Cache URL for the US (left) is 200 and the one for India is 404 (right) |
Pushing the URI-R-SE to multiple web archives not only makes it accessible from web multiple archives, but also some web archives can be leveraged to find mementos in the other web archives. As shown in Figure 8, archive.is extracts the URI-R of the article from the URI-R-SE of the article and indexes the URI-Ms for the URI-R-SE under both the URI-R and URI-R-SE. As shown in Figure 9, we accessed a memento from Internet Archive for the URI-R-SE using the extracted URI-R-SE from archive.is which is what the other web archives consider as URI-R.
Image may be NSFW. Clik here to view. ![]() | |
|
Image may be NSFW. Clik here to view. ![]() |
Figure 9: Using the Bing cache URL from archive.is to retrieve mementos of the search engine cache from the Internet Archive |
Image may be NSFW. Clik here to view. ![]() |
Figure 10: Memento of a SE cache which did not capture the intended content |
Image may be NSFW. Clik here to view. ![]() |
Figure 11: Google indexed a document from the Internet Archive which lists the memento from perma.cc for the The Asian Age news article |
Sometimes SE caches have pages that are missing (404) from the live web but not yet archived. We should push SE cache URL (URI-R-SE) to multiple web archives. We can automate the process of saving URLs to multiple web archives simultaneously by using ArchiveNow. We can use web archives like archive.is to get the URI-R-SE using the URI-R of the resource which can further to be used to search the other web archives for mementos of the URI-R-SEs.
My studies in web archiving helped me solve a real world problem posed by my brother where he needed the URLs of news articles for which he provided me with the screenshots. I found those URLs in SE caches and pushed them to multiple web archives which will be used by him in his legal proceedings.
Mohammed Nauman Siddique
(@m_nsiddique)