Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 744

2024-01-16: Paper summary: Archival HTTP Redirection Retrieval Policies

$
0
0

Figure 1: URI-R and URI-M HTTP redirection relationship cases. Figure 2 fromAlSum et al.

Redirection of web pages refers to the process of forwarding a user from one URI (Uniform Resource Identifier) to another. This can happen for various reasons, and it is a common practice on the web. Redirection is implemented using HTTP status codes, particularly those in the 3xx range. For example, status code 301 is a permanent redirect, and status code 302 is a temporary redirect. In these scenarios the clients (web browsers) automatically direct users to the location as stated in the location response header. In this blog post I summarize the paper titled "Archival HTTP redirection retrieval policies" by AlSum et al..

In web archives, mementos (archived web pages) can have archived redirects (i.e., URI-M (URI of the memento) with a 3xx HTTP status code) where the URI-R (URI of the original resource) returns a redirection status code at crawl time. 


Figure 2 shows the calendar for the mementos of https://srilankacricket.lk/ in 2022. In the figure, blue indicates the HTTP status code 200, green indicates HTTP 3xx status code, and orange indicates the URL was not found, or 4xx HTTP status code.


         Figure 2: Calendar of https://srilankacricket.lk/ in 2022 in Internet Archive’sWayback Machine.

When a user retrieves a memento with a green color code, there are two possibilities where the user might get directed to: (1) the user might end up at a memento in the same TimeMap, or (2) the user may be redirected to a memento in a different TimeMap with a  different URI-R. For example, in the TimeMap of https://srilankacricket.lk/, the memento corresponding to the 1st of July 2008 has an archived redirect and it directs to a memento with the URI-R http://sl.cricinfo.com/db/NATIONAL/SL/SLC/, which has its own separate TimeMap.


Figure 3: Calendar of https://srilankacricket.lk/ in 2008.


Figure 4: Archived version on 2008-07-01 of https://srilankacricket.lk/.


The simplest and most straightforward method of retrieving archived web pages is using the URI-R as the lookup URI. But the situation is different when the URI-R has a redirection in the live web. Then the user needs to choose the lookup URI from URI-R and URI-R* (URI of redirected resource).

 

AlSum et al. introduced and evaluated two policies that can be used to choose the best URI to use as the lookup URI in cases where redirection is present, based on a quantitative study. 


Abstract model


The authors proposed an abstract model to quantify the measurements, URI stability and URI reliability, based on the behavior of HTTP status codes of URI-R and URI-R*.


URI stability was discussed in four different categories of TimeMaps as shown in Figure 5.


                     Figure 5: Categorization of TimeMaps. Figure 3 from AlSum et al.


I considered https://srilankacricket.lk/ as an example to explore URI stability and URI reliability. As of 2023-11-30, https://srilankacricket.lk/ has a total of 1342 mementos. The HTTP status codes of those 1342 mementos are distributed as shown in Table 1.


Table 1: Distribution of HTTP status codes on the TimeMap of https://srilankacricket.lk/.

HTTP status code

Number of mementos

200

539

301/302

745

403

7

500/522/503

7


Based on the categorization in AlSum et al., https://srilankacricket.lk/ belongs to the category (d) in Figure 5, which covers the scenario where  mementos have different HTTP status codes. Equations 1 and 2 (equation 1 in AlSum et al.) show how to calculate the stability of a TimeMap.


Stability =1 - MTMchange (Mi ,Mi-1)TM                                                                         equation (1)

Change(Mi, Mi-1)={1 if Status(Mi)  Status(Mi-1) or Location(Mi)  Location(Mi-1), 0 otherwise} 
where |TM| > 0                                                                                                             equation (2)


AlSum et al. identified a URI as stable if its stability equals to1 or stability close to 1.  I collected the memento URIs (URI-Ms) of https://srilankacricket.lk/ and for each URI-M with a 3xx redirection status code, I extracted the location response header (if there is a redirection chain I looked up for the last response’s location header). The results indicate that there are five changes in URI-R at different times along the TimeMap. Since the URI-R  changed only a few times (5 times), according to equation 1 the stability of the URI-R is high.  Therefore, it can be categorized as a stable URI.


The stability of  https://srilankacricket.lk/ can be calculated as: 


Stability =1-51342=0.9963


While stability provides insight into the status code changes of URI-R over time, it doesn't guarantee successful memento retrieval. Mementos (M(R)) are categorized into two types: successful retrieval, indicated by an HTTP status code of 200 or a redirection chain ending with 200, and unsuccessful retrieval, characterized by a memento having 4xx/5xx or a redirection chain ending with 4xx/5xx. URI reliability was defined as the ratio of successful mementos to the total number of mementos per TimeMap.


Continuing the example of https://srilankacricket.lk/, the reliability of the URI-R can be calculated using equation 3 (equation 2 in AlSum et al.). 


Reliability =Number of mementos end with 200 HTTP status codeTM
where TM >0                                                                                                                equation (3)


To calculate the reliability of https://srilankacricket.lk/, I extracted the HTTP status codes of the URI-Ms (for mementos having redirection chains, I looked up for the HTTP status code of the last response).  I found 1310 mementos ending with 200 status codes which AlSum et al. classified as successful retrieval. Table 2 shows the number of mementos successfully retrieved and number of mementos unsuccessfully retrieved for the TimeMap of https://srilankacricket.lk/.


Table 2: Distribution of the successful and unsuccessful retrieval of mementos

HTTP status code/HTTP status code of the last response of redirection chains

Number of mementos

200 OK (successful retrieval)

1310

4xx/5xx (unsuccessful retrieval)

32


Therefore reliability of the https://srilankacricket.lk/ is: 


Reliability =13101342=0.9762

Data Collection


To assess URI stability and reliability, the authors obtained a random sample of  10,000 URIs from the Open Directory Project (DMOZ) in January 2012. Table 3 shows the distribution of HTTP status codes for the 10,000 URIs in the current web.


Table 3: Distribution of HTTP status codes in the current web for the sample of 10,000 URIs, Table 3 from AlSum et al.


The study employed a Memento Aggregator to obtain the TimeMap (TM(R)) for each URI-R. For redirected resources, the TimeMap (TM(R*)) was also retrieved. Subsequently, the HTTP status code (Status(M)) for each memento in TM(R) was collected. In cases of redirection mementos (M(Rx) → M(Ry)), the destination M(Ry) was recorded, and its original resource Ry was extracted after canonicalization for consistent comparison of URIs.


Results


Relationship between TimeMap of URI-R and TimeMap of URI-R*


The authors categorized the TimeMaps into 7 cases. For examples of case 1, the span of the TimeMap for URI-R* has ended before the start of the URI-R.


Figure 6: The relationship between the TimeMap of the original URI-R and the TimeMaps of the redirected URI-R*, Figure 3 (a) from AlSum et al.


Figure 7 shows the results of the stability ratio for TimeMaps returned from 10,000 URIs considered in AlSum et al. It illustrates that when the number of mementos in the TimeMap increase, the URI tends to be stable. Results indicate that a larger proportion of mementos from this DMOZ sample have stability close to 1.


             Figure 7: URI stability for the collected sample of URIs, Figure 4 from AlSum et al.


Figure 8 shows the reliability ratio against the number of mementos per each TimeMap that contains at least one redirection status code. The following graph illustrates that there is no strong correlation between the number of mementos and the reliability, but most of the points corresponding to the higher reliability ratios (reliability ratio > 0.8) have more than 50 mementos. 


             Figure 8: URI reliability for the collected sample of URIs, Figure 5 from AlSum et al.

Implemented policies to query the archive with a URI having HTTP redirection and evaluating the policies using an example.


Policy one: Retrieving a memento for URI-R


This policy was implemented to observe URI-Rs with HTTP redirection in the current web. Steps of the policy one as mentioned in AlSum et al. are stated below: 


  1. If the retrieved memento for URI-R has 200 HTTP status code, then return it.

  2. Else if retrieved memento has a redirection (3xx), Go to policy two.

  3. If retrieved memento is unavailable (4xx/5xx) HTTP status code and URI-R redirects to URI-R* use URI-R* instead of URI-R.


As of 2023-11-30 https://srilankacricket.lk/ lacks redirection in the live web. Therefore policy one cannot be evaluated using it.


Policy two: Retrieving a memento for URI-M


Policy two addresses the case where the URI-R has no redirection, but the URI-M has a HTTP redirection. Here, the memento will redirect to a different memento with a different original resource.


Policy two is based on URI-Ms having redirections. Therefore, I examined the mementos having 3xx HTTP status codes at different times in the TimeMap of https://srilankacricket.lk/.


I found that https://srilankacricket.lk/ had redirects to the following URI-R*s over time:


  1. http://www.cricinfo.com/link_to_database/NATIONAL/SL/CORPORATE/ 

  2. http://www.cricinfo.com/link_to_database/NATIONAL/SL/SLC/

  3. http://sl.cricinfo.com/db/NATIONAL/SL/SLC/

  4. https://cricket.lk/


TimeMaps corresponding to the above URI-R*s are shown in Table 3.


Table 3: TimeMaps of the URI-R*s of https://srilankacricket.lk/ 

TimeMap

Number of mementos

Time Period

TM (1)

18

2004-01-07 to 2023-10-09

TM (2)

194

2004-03-31 to 2023-11-20

TM (3)

88

2004-04-27 to 2023-10-09

TM (4)

1258

2013-12-21 to 2023-09-24


To evaluate policy two the time frame coverage of the TimeMaps of URI-R*s were compared with the original URI-R’s TimeMap.


The TimeMap of https://srilankacricket.lk/ contains 1342 mementos from 2004-02-22 to 2023-11-30. It is clear that a larger portion of the URI-R is covered by the TimeMap of https://srilankacricket.lk/. All URI-R*s in Table 3 have TimeMaps smaller than https://srilankacricket.lk/ which means |TM (R)| > |TM (R*)| for all four TimeMaps.


Discussion


Upon analyzing HTTP responses across time for 10,000 URIs, the authors discovered that a large portion of URIs have a stability score close to 1 and there is no strong correlation between the number of mementos and the reliability. During the policy evaluations using the sample data obtained from Open Directory Project, authors showed the ability for the new policies to deliver new mementos that were unreachable using the regular methods. The authors introduced policy one to retrieve mementos when the URI-R has a redirection on live web and policy two was focused on the HTTP redirection of mementos. The authors compared the |TM (R)| and |TM (R*)| to decide whether the URI-R* can contribute to the |TM (R)| or not. Client-side implementation of these policies can guide users to select the path according to the behavior of the URI of interest when interacting with web archives.


Conclusions


AlSum et al.conducted a quantitative study to evaluate a set  of policies which help the client to distinguish the correct memento when redirection is present. The authors identified four categories of TimeMaps by observing the changes in HTTP status codes and introduced two metrics, URI stability and URI reliability, to assess a URI-R. The example (https://srilankacricket.lk/) that I used to demonstrate URI stability and reliability revealed that it is stable (stability = 0.9963) and reliable (reliability = 0.9762). This is because the the URI-R redirected to different URI-R*s only a few times along the TimeMap from 2004 to 2021 and later it became more steady with having mementos with no reirections or mementos redirecting to the same URI-R. The reliability depends on the number of mementos having HTTP status code 200 or redirection chain ends with 200. In the TimeMap of https://srilankacricket.lk/ almost all the mementos having redirection and redirection chains end with 200 HTTP status codes. Therefore the majority of the mementos either have 200 HTTP status codes or have redirection chains ending with 200 HTTP status codes while proving that the URI-R is reliable. Although there is no strong correlation between size of the TimeMap and reliability ratio, It is evident that reliability is high for the larger TimeMaps.   


Based on the experiment, two policies for returning a memento when faced with redirection were introduced. Policy one was successful in 22% percent of the cases where URI-R had a redirection in the live web and policy two was successful with 58% of the cases contributing towards the original resource’s TimeMap (|TM (R)|) where URI-R* of archived redirects had TimeMaps (|TM (R*)|) greater than |TM (R)|. In the example https://srilankacricket.lk/, there are four different URI-R*s present along the timeline with |TM (R)|>|TM (R*)| for all four URI-R*s. However to understand the history of  https://srilankacricket.lk/, a user needs to observe all five TimeMaps including the TimeMap of  https://srilankacricket.lk/.


Reference


AlSum, A., Nelson, M. L., Sanderson, R., & Van De Sompel, H. (2013).Archival HTTP redirection retrieval policies. 22ndInternational Conference on World Wide Web.https://doi.org/10.1145/2487788.2488117


-Kumushini Thennakoon (@KumushiniT)



Viewing all articles
Browse latest Browse all 744

Trending Articles