Wrangling Date Metadata in the Media History Digital Library
2024-07-31
Return to BlogI wrote this blog post for the Wisconsin Center for Film and Theater Research website to describe some of the ongoing "behind-the-scenes" work that I've been doing to keep the Media History Digital Library running as people expect. Sometimes that means working on implementing specific features or optimizing server configurations, but as this example below lays out, it also sometimes entails chasing down obscure issues with how certain metadata is stored.
You can read the original blog post on the WCFTR website here.
---
For over a decade the Media History Digital Library (MHDL) has supported film and media studies by providing online access to trade papers, fan magazines, and other primary source materials. Lantern, the search platform for the MHDL, provides full-text search for millions of pages within the collections. These resources are popular because users can instantly access thousands of volumes without relying on incomplete microfilm facsimiles or having to face the logistics of locating and accessing physical copies.
I have worked as one of the main MHDL developers for the past 4+ years. During that time, I have seen first-hand the various interconnected systems and tools—as well as continual “behind the scenes” work—that keeps things running smoothly. This work is not always visible or apparent to users, so what I’d like to do with this blog post is share what some of this work has involved.
There’s a saying among developers: the two most challenging things in computer science are:
- Cache Invalidation
- Naming Things
- Off-by one errors
Of course, if you talk to an individual programmer about their projects, you will quickly learn of many additional problems and frustrations that they would like to add to this list. A clear candidate for another addition is formatting and sorting dates and times. Across various coding languages, developers find dealing with dates and times to be incredibly challenging due to the complexities and inconsistencies inherent in timekeeping systems. In JavaScript, for example, the Date
object can be notoriously tricky, with issues like zero-based months and inconsistent handling of time zones. Python developers often grapple with the datetime
module, where converting between time zones or handling daylight saving time can lead to unexpected bugs. In C++, the standard library’s std::chrono
can be cumbersome, requiring careful management of different time units and formats. These challenges are compounded by the need to account for leap years, varying month lengths, and internationalization, making date and time manipulation a common source of frustration across programming languages.
In 2022, we launched Lantern 2.0 – a major upgrade to the website where we rolled out a fresh interface design along with several behind-the-scenes changes . As part of that process, we discovered that the date metadata for large portions of the collection were inaccurate. A team of hard-working graduate students identified numerous publications with inaccurate metadata and corrected the information by hand. As part of the Lantern 2.0 development, I wrote a series of “helper functions,” or small snippets of code, which retrieve date information from various locations and formats it so that it can be displayed within the Lantern interface.
module DatestringHelper
def append_datestring(item)
if item['volume'].to_s[2..-3].match(/\d{4}/)
item['volume'].to_s[2..-3]
else
item['volume'].to_s[2..-3] + ' [' + item['dateString'] + ']'
end
end
end
Along with the tireless work of metadata cleaning, the Lantern 2.0 upgrade work ensured that all items had date information displayed—crucial data for researchers who rely on the MHDL for primary sources for their research.
The team thought that we had seen the end of our date-related data woes, but unfortunately, they keep popping back up. Earlier this year, we ran into another problem that ended up being a result of how Lantern handles date-related metadata. A handful of users reported to us that certain issues of a publication weren’t appearing in Lantern. When they searched for “Motography” in the main search bar, numerous pages appears—exactly as expected. However, when using the Advanced Search field, entering “Motography” and the year 1917 resulted in no results being returned. This was strange, because the MHDL has multiple issues of Motography from 1917 digitized, which we confirmed on the issue’s page at: (https://mediahist.org/features/publications-volumes.php?id=Motography).
It was especially confusing why Lantern was unable to find the 1917 issues, especially because both websites use the same backend service (Apache Solr). If the information appeared on mediahist.org, it clearly was included in the search index! So why wasn’t it appearing on lantern.mediahist.org?
Further confounding this problem was the fact that some date + title searches were working as expected! An Advanced Search for “Motography” and the year 1916 worked exactly as expected. 1918 also had no problems, so what was going on with 1917?
After a few frustrating weeks of troubleshooting and experimentation, we figured out that the problem came down to a difference in how the “date string” was formatted in the database. This formatting which caused the Solr search idex to skip over many matches that it should have been including. A different publication, “Motion Picture Magazine,” had many of its dates were stored as “1914” – a simple year. But for “Motography,” many dates included month prefixes and the year, e.g. “Jul-Dec 1917.” This minor difference seemed to be cascading and causing searches to overlook many matching items.
But hold on, why are we storing dates in this way? This inconsistency actually turned out to be the crux of the problem!
Our database and search index actually have multiple fields for storing date information:
- dateString – a human-readable string, such as July 2017
- date – a string field, but has had inconsistent formats stored over the years
- year – a 4 digit integer, such as 2017
- dateStart – an ISO-8061 timestamp, such as 2024-07-01T00:00:00Z
- dateEnd – an ISO-8061 timestamp, such as 2024-07-31T19:37:26Z
In general, we use dateString for most searches, because it most closely matches how users tend to enter dates as part of a search query. However, a combination of the other fields for various sorting and filtering functions.
It turns out there were some problems with the “data type” of the dateString field. Sam Hansen, our database administrator, examined the data model and discovered that a small aspect of how the database was configured was causing the search index to overlook numerous entries. They wrote that, “the problem was that dateString was text but multiValued=false so whenever there was a space it failed out.” Thanks to Sam’s expertise, we were able to quickly reconfigure our database and search index and make adjustments to the Web app to ensure that date metadata displayed as expected. The updated search configuration was tested deployed in April 2024, and is now operational on the live website. The missing Motography issues were successfully relocated!
This issue with the display of date metadata for a single publication was ultimately minor, but nevertheless represents the unending work of managing the MHDL’s digital collections and ensuring the functionality of the Lantern Web app. Our challenge is to balance the competing priorities of data organization, provenance, and user experience. These priorities often demand we take opposite actions. For example, Lantern provides users with a consistent date for any given page scan, but in doing so obfuscates parts of the underlying object’s attributes and provenance. The scanned page becomes decoupled from its original place within in a physical volume on a library shelf, a volume which itself may contain multiple issues of a publication. Minor details, such as whether a page is listed under “1960” or “January – December 1960,” represent a balancing act of technical requirements, date formatting standards, and user expectations.
What is especially fascinating to me is how these decisions get written into the source code itself, becoming part of the platform and its infrastructure. The nice thing about Lantern and the MHDL is that these structures and interfaces are able to be hidden “behind the scenes” – so long as everything is working, most researchers using the websites will never have to think about dateStrings versus dateStarts. But simultaneously, these data formatting and structuring decisions operate as a layer of abstraction between users and archival decisions of description and organization. Attending to the underlying code is a way for researchers to read either “along” or “against” the “archival grain” of a digital collection. But because source code is often deliberately kept out of sight of most users, important archival decisions and processes can easily go unnoticed. The digitization of a collection never fully eliminates the labor of accessing primary source materials; it merely transforms it and transfers it to other people.
We appreciate that so many people have incorporated Lantern and the MHDL into their own research. We don’t take this lightly; We want to provide the best experience possible. If you notice any issues with Lantern, please email us at mhdl@commarts.wisc.edu to let us know.
Happy searching!
Return to Blog