Figure 1: Jawa Overview. Figure from https://www.usenix.org/system/files/osdi22-goel.pdf
While working on the Saving Ads project, we identified problems with replaying ads (technical report: “Archiving and Replaying Current Web Advertisements: Challenges and Opportunities”) that used JavaScript code to dynamically generate URLs. These URLs included random values that differed during crawl time and replay time, resulting in failed requests upon replay. Figure 2 shows an example ad iframe URL that failed to replay, because a dynamically generated random value was used in the subdomain. URL matching approaches like fuzzy matching could resolve these problems by matching the dynamically generated URL with the URL that was crawled.
Figure 2: Different SafeFrame URLs during crawl and replay sessions. Google’s pubads_impl.js (WACZ | URI-R: https://securepubads.g.doubleclick.net/pagead/managed/js/gpt/m202308210101/pubads_impl.js?cb=31077272) generates the random SafeFrame URL.
Goel, Zhu, Netravali, and Madhyastha’s "Jawa: Web Archival in the Era of JavaScript" involved identifying sources of non-determinism that cause replay problems when dynamically loading resources, removing some of the non-deterministic JavaScript code, and appling URL matching algorithms to reduce the number of failed requests that occur during replay.
Video 1: Presentation video
Sources of Non-Determinism
Non-determinism can cause variance in dynamically generated URLs (e.g., the same resource referenced by multiple URLs with different query string values, such as https://www.example.com/?rnd=4734 and https://www.example.com/?rnd=7765). This variance can result in failed requests (like the example shown in Figure 2) if the replay system does not have an approach for matching the requested URL with one that was successfully crawled. The sources of non-determinism that cause problems with replaying archived web pages are server-side state, client-side state, client characteristics, and JavaScript's Date, Random, and Performance (DRP) APIs. When replaying web pages client browsers do not maintain server-side and client-side state. The other sources of non-determinism (client characteristics and DRP APIs) are present during replay and impact JavaScript execution.
When a web page’s functionality requires dynamically constructed server responses (e.g., posting comments, push notifications, and login), the functionality can be impacted if the web page requires communication with a website’s origin servers. When an archived web page is loaded, the functionality would also be impacted if more resources were requested that were not archived during the crawling session. For client characteristics, the authors ensured that all APIs would return the same value during replay time as they did during crawl time. For DRP APIs, they used server-side matching of requested URLs to crawled URLs.
Reducing Storage When Archiving Web Pages
Goel et al. created a web crawler named Jawa (JavaScript-aware web archive) that removes non-deterministic JavaScript code so that the replay of an archived web page does not change if different users replay it. Since Jawa removes some third party scripts, the preservation style is in-between Archival Caricaturization and the Wayback style. Archival Caricaturization is a term created by Berlin et al. (“To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages”) to describe a type of preservation that does not preserve the web page as it originally was during crawl time. Archive.today is an example of Archival Caricaturization where all the original JavaScript is removed during replay. In contrast, archives that use the Wayback style archive and replay all of the resources of a web page and only make minimal rewrites during replay.
Jawa reduced the storage necessary for their corpus of 1 million archived web pages by 41% when compared to techniques that were used by Internet Archive during 2020 (Figure 3). This storage savings occurred because they discarded 84% of JavaScript bytes (Figure 4). During their presentation (https://youtu.be/WdxWpGJ-gUs?t=877), the authors mentioned that the 41% reduction in storage also includes other resources (e.g., HTML, CSS, and images) that would have been loaded by the excluded JavaScript code. Jawa saves storage by not archiving non-functional JavaScript code and removing unreachable code. When removing JavaScript code, they ensured that the removed code does not affect the execution of the rest of the code.
Figure 4: JavaScript bytes removed. Figure from https://www.usenix.org/sites/default/files/conference/protected-files/osdi22_slides_goel.pdf
Brunelle et al.’s “Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly” also involved measuring JavaScript’s impact on storage when archiving web pages. They found that using a browser-based crawler that executes JavaScript during the crawling session resulted in 11.3 times more storage for all 303 web pages in their collection and 5.12 times more storage per URI (approximately 413.2 KB/URI). If we take this KB per URI measurement and multiply it with the number of URIs in Jawa’s corpus of 1 million web pages, it is expected for browser-based crawlers to require approximately 413.2 GB of storage to archive all web pages. When Goel et al. used techniques similar to the Internet Archive (which the authors referred to as IA*), it required 535 GB to archive the web pages in the corpus, while 314 GB of storage was required for Jawa. Since the amount of JavaScript (and the resources dynamically loaded by this code) has increased (Figure 5), the IA* approach required more storage than previously expected by Brunelle et al. Even though the amount of storage required to archive web pages has increased, Jawa achieved enough storage savings to go below the previously expected storage for browser-based crawlers.
Figure 5: The amount of JavaScript code used on a web page is increasing. Figure from https://www.usenix.org/sites/default/files/conference/protected-files/osdi22_slides_goel.pdf
Brunelle et al. also compared the crawl throughput and showed that using browser-based crawlers significantly increased the amount of time (38.9 times longer than using Heritrix) it takes to crawl web pages when compared to traditional web archive crawlers that do not execute the JavaScript during crawl time. Since Jawa can reduce the amount of JavaScript archived, it was able to improve the crawling throughput by 39% when it archived web pages from Goel et al.’s corpus (Figure 6).
Figure 6: Comparison of crawling throughput. Figure from https://www.usenix.org/system/files/osdi22-goel.pdf
Removing Non-Functional Code From JavaScript Files
Their approach for removing non-functional code is based on two observations about JavaScript code that will not work on archived web pages and relies on interacting with origin servers:
- Most non-functional JavaScript code is compartmentalized in a few files and is not included in all JavaScript files.
- The execution of third-party scripts will not work when replaying archived web pages.
To identify non-functional JavaScript, they created filter lists, instead of using complex code analysis. Their filter lists contain rules that were created based on manual analysis of the scripts from their corpus.
Every rule was based on domain, file name, or URL token:
- For domain rules, they removed some URLs associated with a third party service.
- For file name rules, they would identify files like “jquery.cookie.js” (which is used for cookie management) from any domain and not archive it.
- For URL token rules, if a key word such as “recaptcha” was found in the URL they would not archive the resource.
The filter lists can be used to exclude URLs during crawl time to remove JavaScript files that are not needed. They removed third-party scripts that would not prevent post-load interactions from working during replay time. They also removed scripts that were on EasyList, which is an ad blocking filter list.
They checked if the code removed by Jawa visually or functionally impacted the replay of archived web pages. For visual impact, they checked if the archived web page looks the same with or without filtering by comparing screenshots of a web page archived by Jawa with filtering and without filtering. They then viewed the web pages that had different pixel values and only insignificant differences occurred like different time stamp information on the web page and different animations due to JavaScript’s Date, Random, and Performance (DRP) APIs.
For functional impact, they checked if the post-load interactions will work on the archived web page. They found that removing the files that matched their filter lists did not negatively impact the navigational and informational interactions.
Garg et al. (“Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests”) identified cases where archived web pages that require regular updates (e.g., news updates, new tweets, and sport scores updates) would repeatedly make failed requests to an origin server during replay time, which resulted in unnecessary traffic. Jawa could resolve this problem, because it removes code that communicates with an origin server, and this should reduce the amount of failed requests that can occur during replay for these types of archived web pages.
Removing Unreachable Code
For this approach, the code they removed is unreachable code that will never be executed. This unreachable code is associated with sources of non-determinism that are absent during replay time and sources of non-determinism caused by asynchronous execution and APIs for client characteristics.The code that is executed when an event handler is invoked can differ depending on the order the user interacts with elements on the page (Figure 7), the inputs the user provides to the events, and the values returned by the browser’s APIs:
- Order of user interaction: They focused on read-write dependency and found that the event handlers would not impact the replay since these events were used for user analytics that would track user interaction.
- User input: None of the event handlers that work at replay time would read inputs that impact which code gets executed.
- Browser APIs: Jawa removes APIs for client characteristics so only DRP APIs would be executed during replay time. When they checked the web pages in their corpus, the DRP APIs did not impact the reachable code for any event handler.
Figure 7: Code executed depends on the order of user interactions. Image from https://www.usenix.org/sites/default/files/conference/protected-files/osdi22_slides_goel.pdf
For the resources not filtered out by their filter lists, Jawa injects code (Figure 8) to identify which code was executed and in what order and then triggers every registered event handler using default input values and identifies which code was executed. The code that was executed gets stored. They then ensure that the browser will follow the same execution schedule and use the same client characteristics.
Figure 8: Their process for identifying which parts of the remaining JavaScript files (that were not filtered out) need to be archived. Image from https://www.usenix.org/sites/default/files/conference/protected-files/osdi22_slides_goel.pdf
Utilizing URL Matching Algorithms to Handle Non-Determinism
Querystrip removes the query string from a URL before initiating a match. This approach can help with cases where the query string is updated for a resource based on the server-side state. Figures 9 and 10 show an example where querystrip would be useful. We identified a replay problem (that resulted in a failed request for most replay systems except for ReplayWeb.page) with Amazon ad iframes that used a random value in the query string. If the query string is removed from this URL and a search is performed for the base URL in the WACZ file, then we could match the URL that was dynamically generated during replay with the URL that was crawled.
Figure 9: Example URI for an Amazon ad iframe. The rnd parameter in the query string contains a random value that is dynamically generated when loading an ad.
Figure 10: When replaying an Amazon ad iframe, the rnd parameter is not the same as the original value that is in the URI-R. Even though an incorrect URI-M is generated, ReplayWeb.page is able to load the ad. WACZ | URI-R: https://aax-us-east.amazon-adsystem.com/e/dtb/admi?b=...
Goel et al.’s fuzzy matching approach used Levenshtein distance to find the best match for a URL. An example of fuzzy matching for a replay system is pywb’s rules.yaml and pywb’s fuzzymatcher.py script that uses these rules. According to their presentation (https://youtu.be/WdxWpGJ-gUs?t=931), Jawa eliminated failed network fetches on around 95% of the pages from their corpus of 3,000 web pages (Figure 11). Their paper reported 99% of eliminated failed network fetches, but it is listed as 95% in the more recent slides.
Figure 11: Failed fetches that occurred during replay of their corpus of 3,000 web pages. Figure from https://www.usenix.org/sites/default/files/conference/protected-files/osdi22_slides_goel.pdf
This group’s continued work (“Detecting and Diagnosing Errors in Replaying Archived Web Pages”) involves identifying URL rewriting problems caused by JavaScript that impacts the quality of the archived web page during replay. Their goal is to create a new approach for verifying the quality of an archived web page that is better than comparing screenshots and viewing failed requests. Their approach involves capturing (during crawl and replay time) each visible element in the DOM tree, the location and dimensions of the elements, and the JavaScript that produces visible effects. Their approach was able to reduce false positives while detecting low fidelity during replay when compared to using only screenshots and runtime and fetch errors.
Summary
The sources of non-determinism that cause problems with replaying archived web pages are server-side state, client side state, client characteristics, and JavaScript's Date, Random, and Performance APIs. When non-determinism caused variance in a dynamically generated URL during replay, they used two URL matching algorithms which are querystrip and fuzzy matching to match the requested URL with a crawled URL. These URL matching algorithms can reduce the number of failed requests and could resolve replay problems associated with random values in dynamically generated URLs, which is a problem we encountered during the Saving Ads project while replaying ads.
--Travis Reid (@TReid803)
Goel, A., Zhu, J., Netravali, R., and Madhyastha, H. V. "Jawa: Web Archival in the Era of JavaScript". In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), Jul. 2022 (California, USA), pp. 805–820.