This kind of work is crucial because the US government holds invaluable international and national data relating to climate. “These are irreplaceable repositories of important climate information,” says Lauren Kurtz, executive director of the Climate Science Legal Defense Fund. “So fiddling with them or deleting them means the irreplaceable loss of critical information. It’s really quite tragic.”
Like the OEDP, the Catalyst Cooperative is trying to make sure data related to climate and energy is stored and accessible for researchers. Both are part of the Public Environmental Data Partners, a collective of organizations dedicated to preserving federal environmental data. ”We have tried to identify data sets that we know our communities make use of to make decisions about what electricity we should procure or to make decisions about resiliency in our infrastructure planning,” says Christina Gosnell, cofounder and president of Catalyst.
Archiving can be a difficult task; there is no one easy way to store all the US government’s data. “Various federal agencies and departments handle data preservation and archiving in a myriad of ways,” says Gosnell. There’s also no one who has a complete list of all the government websites in existence.
This hodgepodge of data means that in addition to using web crawlers, which are tools used to capture snapshots of websites and data, archivists often have to manually scrape data as well. Additionally, sometimes a data set will be behind a login address or captcha to prevent scraper tools from pulling the data. Web scrapers also sometimes miss key features on a site. For example, sites will often have plenty of links to other pieces of information that aren’t captured in a scrape. Or the scrape may just not work because of something to do with a website’s structure. Therefore, having a person in the loop double-checking the scraper’s work or capturing data manually is often the only way to ensure that the information is properly collected.
And there are questions about whether scraping the data will really be enough. Restoring websites and complex data sets is often not a simple process. “It becomes extraordinarily difficult and costly to attempt to rescue and salvage the data,” says Hedstrom. “It is like draining a body of blood and expecting the body to continue to function. The repairs and attempts to recover are sometimes insurmountable where we need continuous readings of data.”
“All of this data archiving work is a temporary Band-Aid,” says Gosnell. “If data sets are removed and are no longer updated, our archived data will become increasingly stale and thus ineffective at informing decisions over time.”
These effects may be long-lasting. “You won’t see the impact of that until 10 years from now, when you notice that there’s a gap of four years of data,” says Jacobs.
Many digital archivists stress the importance of understanding our past. “We can all think about our own family photos that have been passed down to us and how important those different documents are,” says Trevor Owens, chief research officer at the American Institute of Physics and former director of digital services at the Library of Congress. “That chain of connection to the past is really important.”
“It’s our library; it’s our history,” says Richards. “This data is funded by taxpayers, so we definitely don’t want all that knowledge to be lost when we can keep it, store it, potentially do something with it and continue to learn from it.”