Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 744

2024-01-19: Paper Summary of "Overcoming Barriers to Information Exchange on the Web" by Ayush Goel

$
0
0

Introduction

    Dr. Ayush Goel is a recent Ph.D graduate from the University of Michigan’s Systems Lab, studying under Dr. Harsha V. Madhyastha and Dr. Ravi Netravali, now working at Hewlett-Packard Labs as a Research Scientist. While the UMich Systems Lab does not necessarily work on Web archiving, much of their work has overlap with our own here at the Web Science and Digital Libraries group. I first became aware of Dr. Goel’s work through his previous work and publication, “Jawa: Web Archival in the Era of JavaScript”, which was mentioned by former WS-DL alumnus Emily Escamilla in her 2022 IIPC Web Archiving Conference trip report, and again in her 2023 IIPC WAC trip report. The Jawa project is an impressive accomplishment on its own but it is only a piece of Dr. Goel’s contributed body of work. With impacts on my own research work and long-lasting impacts for the Web archiving community at large, I’d like to review and share some of the highlights from Dr. Goel’s dissertation, “Overcoming Barriers to Information Exchange on the Web”.
    Dr. Goel’s dissertation is summed up in his thesis statement: “It is practical to speed up web page loads both for individual users and web crawlers and enable efficient web archiving without any loss in fidelity by leveraging runtime behaviors of web computations which can be accurately and efficiently extracted with the help of fine-grained program analysis.“ Throughout, he touches on a wide range of topics regarding the operations of Internet networking, server-client communication over a Web browser, technical intricacies of how the browser handles JavaScript code, Web crawling and archiving, and mobile computation usage. Dr. Goel states there are about four and a half billion unique interactions across more than six billion Web pages on the World Wide Web and that this figure is only expected to increase. While network conditions have improved, the increased weight of JavaScript usage and computation continue to hamper performance for browsers and Web applications.

JavaScript Analysis Engine

According to Dr. Goel, JavaScript accounts for 49% of the bytes of all stored Web data. For large archival institutions, such as the Internet Archive, this represents a significant portion of their operating costs. With modern browsers and developer tooling, some insight into this utilization can be gained, but as Dr. Goel points out, only after the page has been loaded. To this end, Dr. Goel wrote custom software capable of analyzing JavaScript during page runtime in order to determine the properties and outputs of JavaScript code being run on crawled pages. Dr. Goel also poses an interesting note regarding instrumentation code in that it can either be injected into the crawl/analysis process via modification of the browser or integration with the page JavaScript but the latter is not only easier to recompile and work with but is also better at avoiding detection. The analysis engine operates both statically and dynamically, capable of flexibly injecting static analysis code using a number of methods and performing an injection, early during the page load, of the runtime analysis code to run alongside the rest of the JavaScript being compiled. To “instrument” the code, Dr. Goel uses some clever function wrappers here in order to track JavaScript variable accesses and allow writing input values to them for testing.
    Dynamically tracking JavaScript code is a tricky thing, I have used utilities such as Chrome DevTools to retrieve a byte map, but it does not contain all of the contextual detail and is only available after the page has been loaded. For instance, pinpointing precisely which JavaScript function calls are responsible for triggering the network load of a given resource URI cannot necessarily be done using the browser's native tool set. Dr. Goel lists further analytical complications in tracking JavaScript utilization during runtime, such as JavaScript’s lack of static typing, lack of proper serialization support in native JavaScript for complex data extraction, the mapping of page events, and modification of page variables, all requiring the development of custom code. Dr. Goel describes the utilities of his analytical engine as the basis for optimizations present in the rest of his dissertation.

Improving Page Loads

In the fourth chapter of Dr. Goel’s dissertation, he presents some highly interesting facts regarding Web performance, particularly for mobile devices. He posits that that most optimization in this space has been focused around reducing the computation required by the browser by offloading computations through powerful proxied servers or by utilizing server-side rendering techniques. Dr. Goel argues that there should be a shift in the direction of these computational optimizations to client-side devices and finding out how much the required computation can be reduced locally on the device.
    Dr. Goel’s first method to improve local computations is to implement a caching mechanism in-browser in order to reuse computations, a process called memoization. By memoizing JavaScript functions, the results of an expensive function call can be called again with a fraction of the effort originally required, or even returned directly, akin to a key-value store. This is a common technique for handling network requests, but not so much implemented for JavaScript computations. Using his memoization implementation, Dr. Goel estimates that he may be able to eliminate nearly half of page computations on average! Further examining that, on computationally intensive pages, less than 30% of JavaScript functions are the driving cause for 80% of the total execution time, the system can efficiently cache only what it needs to.
    Dr. Goel’s findings on mobile performance, especially, are a bit staggering. JavaScript computation alone can already be outside the boundaries of user comfort, before even beginning to account for poor network connectivity or latency. Citing numerous recent studies he shows that network conditions, though, have generally improved such that they have taken second seat to the performance hit of client-side computations. The primary crux of the issue lies in the average mobile browser’s continued inability to take advantage of all available CPU cores, despite an increasing core-per-device trend over time. He finds that for his sampled dataset, single-threaded JavaScript execution is hogging a 65% majority of page load time for mobile devices!
    There are huge benefits available in client-side caching that can be difficult to easily see initially. Dr. Goel describes that there is minimal opportunity for savings by simply applying cached data between page loads, but that, even on an hourly scale, only a quarter of Web pages are dramatically altering their JavaScript state, and that 89% and 86% of the loaded JavaScript code and heap state, respectively, remain identical and available for caching. Caching is applied at the function level, first by determining which functions are non-deterministic and excluding them from the cache, and then recording the input and output results of those which remain across multiple page loads to determine if they can be cached. Dr. Goel was able to track roughly half of the pages independently across each of his data sets, finding that of those successfully tracked, 76% of JavaScript can be reused for minute-to-minute crawls, while roughly 57% can be reused on day-to-day page crawls, though this amount jumped up to 76% reuse on computationally-intensive pages. He similarly finds that for a data set of Web pages from the DMOZ index, about 70% of JavaScript can be reused in day-to-day crawls. Across pages, the amount of computational reuse drops to around 15% for pages crawled at a daily rate, though this is still a rather significant amount of computation that can be saved.

Fig. 2: "Figure 4.5" from "Overcoming Barriers to Information Exchange on the Web"


A second component to saving computational resources is through parallelization of computational tasks in order to improve overall efficiency and reduce execution time. Most browsers execute JavaScript over a single-thread, wasting the capabilities that could be leveraged by commonly multi-core mobile processors. As Goel describes though, this is not really the fault of the browser, for many have provided the capability for years. The real issue lies rather in the lack of adoption by Web developers when coding their Web sites to take advantage of multi-core architectures (a problem existing for many desktop applications, as well). In his dissertation, Dr. Goel describes that he only utilizes his analytical engine for measurement and research, but that others further developed his engine into a tool based on them, named Horcrux. Horcrux was originally presented at the 2021 OSDI is a Web optimization system that was able to speed up JavaScript execution by 80%, resulting in an overall 40% decrease to page load time when using a mobile 8-core device! For more information on this area of Dr. Goel's work, please refer to his publication, "Rethinking Client-Side Caching for the Mobile Web".

Sprinter

Sprinter was developed as a hybrid Web crawler which efficiently crawls the Web by first sampling pages in order to discern common computations that can be extrapolated to other pages. The pages in this sampled subset are crawled with high fidelity in order to capture page events, code executions, and other important details. Sprinter then uses a much lighter weight method to capture the remaining pages, utilizing the sampled insights to augment them in order to preserve their original fidelity.

Fig. 3: "Figure 5.1" from "Overcoming Barriers to Information Exchange on the Web"

    As seen in the original Figure 5.1, by comparison, static crawls are an order of magnitude faster though they sacrifice a massive amount of page fidelity to attain such speed. Sprinter is not a silver bullet to this problem but it does offer a large leap forward in speed and overall quality. A large part of this is due to its management of crawled pages and observation of potential reuse across multiple pages from the same Web site. By making the best of both worlds and using its enhanced crawling techniques, Sprinter was able to crawl a set of 50,000 pages at five times the speed of traditional, browser-based crawls while maintaining a comparable level of page fidelity. Upon re-crawling the same corpus a week later, Dr. Goel reports that this was further improved by another 78% over the original crawl results. For more information, you can review his published paper "Sprinter: Speeding up High-Fidelity Crawling of the Modern Web".

Jawa

Fig. 4: Excerpt from Ayush Goel's Jawa presentation poster

    Jawa, or "JavaScript Aware Web Archive", is a new Web archive design taking  advantage of Dr. Goel’s research work and potentially offering huge savings for the Web archiving community. By leveraging the JavaScript analysis and pruning sections of unused and unreachable code, significant storage savings can be gained by archived Web pages at scale. Eliminating sources of non-determinism that causes issues when replaying archived pages and altering the execution of JavaScript incompatible with the replay of Web archives, the majority of archival fidelity issues can also be overcome. In many ways this might mean less code in the final archive, but it might also mean an increased overhead for certain pages that comes with needing to store extra data regarding potential page states. One example of this is shown below from Dr. Goel’s Figure 6.5, where heap data is stored for both potential page states, as found by analysis of page event handlers.

Fig. 5: "Figure 6.5" from "Overcoming Barriers to Information Exchange on the Web"

    Dr. Goel used a data set of 300 pages from the Internet Archive and found that for these pages, JavaScript accounted for 44% of the utilized bytes in 2020, more than doubling the 20% utilization for the same pages in 2000. With this increased usage, there is also an increased chance of performance, infrastructure, and fidelity issues when archiving. Jawa uses a sophisticated NodeJS crawler, leveraging the techniques reviewed previously here to efficiently remove unnecessary code and map event handlers on a page. On a corpus of one million archived pages, Jawa’s crawling throughout increased by 39% and it was able to eliminate 84% of stored JavaScript while maintaining fidelity. Something described here that is pertinent to my research is the mention that for the corpus of one million archived pages more than 10% exhibited page damage, measured as failed resource requests, at or above 25%. Comparably, Jawa’s design manages to match 99% of the crawled resources during archival replay. While Jawa outputs WARC files, they are of a different format which would need to be implemented by institutions in order to see benefit. Given Dr. Goel’s research, this transition just might be worth it though, as overall storage requirements over traditional Web archiving formats could be reduced by 41%. For any institution managing large corpus of Web archives, this is a huge opportunity for potential cost savings.

Conclusion

Dr. Goel’s research has far reaching implications for Web archivists and how Web developers code their sites. From huge cost savings for institutions to efficient JavaScript processing for Web browsers, Dr. Goel’s dissertation and work offers a lot of potential for the Web community. Particularly for mobile devices, which have only been growing as the primary gateway for accessing the Web, it is surprising that multi-core processing is not handled better after so long. The research performed here could have huge savings for the mobile Web experience and even for device longevity, given that the browser and network traffic account for a large portion of mobile device utilization. I am looking forward to trying out some of these tools myself and seeing where they can be taken.


References

Goel, A., Ruamviboonsuk, V., Netravali, R., & Madhyastha, H. V. (2021, February). Rethinking client-side caching for the mobile web. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications (pp. 112-118).

Goel, A., Zhu, J., Netravali, R., & Madhyastha, H. V. (2022). Jawa: Web Archival in the Era of {JavaScript}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (pp. 805-820).

Goel, A., Zhu, J., Netravali, R., & Madhyastha, H. V. (2024). Sprinter: Speeding Up High-Fidelity Crawling of the Modern Web. NSDI.

Mardani, S., Goel, A., Ko, R., Madhyastha, H. V., & Netravali, R. (2021). Horcrux: Automatic javascript parallelism for resource-efficient web computation. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21).


Viewing all articles
Browse latest Browse all 744

Trending Articles