HTML Search and Record<!-- --> | <!-- -->Ben Pettis
Ben Pettis

HTML Search and Record

May 20, 2022

A cartoon image of an open file folder with the Google reCAPTCHA logo appearing out of it

View code on GitHub

Neils Brügger has described five “web strata” or scopes to use the Web as an object of analysis.1 These strata are:

  • the web element
  • the web page
  • the website
  • the web sphere
  • the web as a whole

Tools such as the Internet Archive’s WayBack machine or other web scrapers are good at targeting the web page or the website, but preserving individual web elements along with how users encounter them in their original contexts is more challenging. I have developed this Chrome extension to be a versatile tool to detect and record any arbitrary HTML element.

In my own work, I am using this for the study of Google’s reCAPTCHA. This test to tell Computers and Humans Apart is commonly found on online forms and throughout the Web and is used to minimize automated spam submissions and other supposedly non-genuine interactions.2 Google’s reCAPTCHA is a multi-faceted site of production where users perform labor by providing valuable AI/ML training data. Following Maurizio Lazzarato’s definition of immaterial labor, I view reCAPTCHAs as an interface element that contributes to the production of the informational and cultural content of a given website.

One approach to preserving reCAPTCHAs might see them as an individual fragment of a web page, and attempt to isolate it from its context. Though I am interested in the reCAPTCHA element specifically, I also want to preserve the entire user experience of solving the reCAPTCHA. Accordingly, I follow Niels Brügger’s proposed method of the screen movie—recording the screen as a user navigates a website to include the context of the reCAPTCHA along with how long it takes for the user to solve it.

The extension works by monitoring the current DOM for a specified HTML element with a specific attribute, and counting the number of appearances. In the case of reCAPTCHA, I search for any <iframe> element with the attribute title="recaptcha" to detect any reCAPTCHA element. The user can choose to receive a browser alert in these cases, which prompts them to being a screen recording that is saved locally to their computer. As my project develops, I want to develop a method for users to be able to easily preview these recordings and decide whether to share them with a researcher. This method preserves individual autonomy and control over privacy while also facilitating researchers' ability to preserve user interactions with specific Web elements.

There are six sub-components in the image, arranged with two rows of three items each. The first item in the top left is a screenshot of a reCAPTCHA challenge with several photos of bicycles. The second item in the top center is my custom icon - the reCAPTCHA logo appearing from a folder. Third item in the top right is a fake reCAPTCHA challenge with a painting of a ship and the challenge to "select all squares with the ship of Theseus." The fourth item at the bottom right has a large number 1 and the text "search DOM for specified HTML element + attribute." The fifth item at bottom center has a large number 2 and the text "Alert user and present options." The sixth item at bottom right has a large number 3 and the text "Record a .mp4 or .webm video & save locally."

I've released a version of the extension (along with its source code) under an MIT License so that other researchers may be able to use this approach in their own work. This version of the extension only saves videos locally, but retains the ability to specify any arbitrary HTML element and attribute. The extension can serve as a model for other researchers to use in their studies of online spaces and Web histories. It can be easily modified to detect and record interactions with any kind of Web content. Because the extension functions by detecting a specified HTML element, this can easily be customized to search for and record any arbitrary string of HTML code that a researcher is interested in.

View code on GitHub