Hello! I feel like so much has happened since I last wrote, and I’m excited to share some new project work with you all. I’ve been really busy at CUNY Television, continuing work on modifying micro-services and creating documentation for micro-services use. I still have a few more enhancements to complete before moving on to the next phase of my project- migrating a petabyte of data from LTO 5 to LTO 7 tapes. However, a lot of the work that I am doing now is going to assist with this migration, as it has to do with standardizing the structure and contents of our Archival Information Packages (AIPs) and writing scripts that automate file fixity, account for any changes to files, and ensure that our AIPs contain their necessary components.
With this blog post, I hope to start a discussion exploring how different institutions handle the creation, documentation, and migration of AIPs over time, as well as internal standards outlining the contents of AIPs. I will begin with a discussion of the AIP conceptual model, as outlined in the OAIS Reference Model, and then use my work at CUNY Television as a use case in how to document and verify the contents of an AIP. My observations at CUNY have raised large questions for me about how AIPs are approached by different institutions with many kinds of systems, and what kinds of tools could be used to document and verify the structure and contents of AIPs. To be clear, I don’t mean to suggest a standard that AIPs should adhere to- there is a reason the AIP model is conceptual. Instead, I’m curious to hear how others have handled documenting and verifying the contents of AIPs across varying institutional contexts. I hope you’ll contribute your ideas, in the comments, on twitter, or via email!
OAIS Reference Model and AIPs at CUNY Television
At CUNY Television, we follow a workflow that is akin to the OAIS Functional Model, meaning that we receive Submission Information Packages (SIPs) in the form of shows ready for broadcast and we transform those SIPs into AIPs and Dissemination Information Packages (DIPs). Our SIPs are the media itself, usually a ProRes file but sometimes files from SxS cards, hard drives, or tapes, and the associated metadata, which comes as an email from the producer describing the show, or is created by our Broadcast Librarian, Oksana. SIPs become AIPs through our ingestfile micro-service, which takes the SIP file and transcodes access and service copies, makes metadata, delivers access copies, and packages everything up into a directory named with a unique identifier. While most of our DIPs get delivered to the webteam and the broadcast server automatically in the process of AIP creation, we regularly return to our AIPs for files that need to be rebroadcast, meaning that in some cases, we have “active AIPs” which don’t really exist in the OAIS documentation, but are a legitimate concept in our workflow. [1]
For those of you who have not had the opportunity to read through the Reference Model for an Open Archival Information System (OAIS), an AIP is defined “to provide a concise way of referring to a set of information that has, in principle, all the qualities needed for permanent, or indefinite, Long Term Preservation of a designated Information Object.” (4-36)
a potentially helpful but sort of confusing “conceptual” diagram of an AIP
A DIP on the other hand, is intended for the “consumer,” meaning that DIPs contain the “access copies” of content. Our DIPs consist of YouTube files, Mp3s for podcasts, a service copy for broadcast, and a collection of still images. But, we also include our DIPs in our AIP because we want to preserve not only our original file, but the derivatives and their corresponding metadata. Because AIPs are a conceptual model, it is difficult to figure out exactly what belongs in an AIP. According to the OAIS standards, “the AIP itself is an Information Object that is a container of other Information Objects. Within the AIP is the designated Information Object, and it is called the Content Information. Also within the AIP is an Information Object called the Preservation Description Information (PDI). The PDI contains additional information about the Content Information and is needed to make the Content Information meaningful for the indefinite Long Term.” (4-36 – 4-37). Are you wondering what an Information Object is? “The Information Object is composed of a Data Object that is either physical or digital, and the Representation Information that allows for the full interpretation of the data into meaningful information.” (4-20 – 4-21).
Here’s another OAIS diagram
To break this all down in the context of our archival workflows, and perhaps yours too, an Information Object is the video file and corresponding metadata that make possible the interpretation of the file. Our AIPs contain multiple Information Objects- our original preservation master, the service copy, access copies, and their metadata. Because OAIS provides a framework, and the AIP is a conceptual model, AIPs probably look really different across collections, which makes sense given different institutional needs. Here’s the structure of our AIPs using a command line tool called tree (which I’ll discuss more in the following section):
The structure of a CUNY TV Archival Information Package using the command line tool tree
Documenting and Validating AIPs
Let's go AIPshit. pic.twitter.com/5a8buSYHiI
— tvc15 (he/him) (@tvc15brian) December 2, 2015
In preparation for our data migration, we need to be able to pull all of our AIPs (approximately 1 petabyte of data, or 950,000 files) off our long term storage (LTO), perform fixity checks, and validate that the structure and contents of an AIP contain what we’ve defined they should. Specifically, we’ve been working to develop some kind of tool that can map out the directory structure of an AIP, and then validate its contents against a set of rules. At CUNY TV, we’re concerned about the change of internal AIP specification over time. For example, our digital preservation software, media micro-services, has evolved over the years. An AIP from 3 years ago won’t be the same as one created this week. In theory, this would be the case at other institutions as well, even with the use of digital preservation software like archivematica or preservica, as that software has changed over time.
If we accept that the contents and structure of an AIP might change over time, shouldn’t we be concerned about setting up institution-based parameters for what the AIP should look like? We’re not saying that an AIP specification has to be the same across all institutions, because that would be impossible given the plethora of digital content types, but that there should be a standardized way to document and verify the contents of an AIP. At CUNY TV, we also want to make sure that the files within the AIP are what they should be, for example, that they adhere to specific codecs and other encoding criteria. Do you share these concerns about documenting and verifying the contents of your AIPs?
Presently, we’re working on a solution that involves two scripts, maketree and verifytree. maketree uses a tool called tree, which you saw in the previous section, and creates an XML document that shows the structure and contents of our AIPs. With tree, we’re also able to show the date each file was last modified and its size. Then, verifytree uses xpath statements and xmlstarlet to run the XML document through a series of tests about the structure and contents of an AIP. If there are any discrepancies, verifytree spits them out into the terminal window. These processes are just the beginning though, as we’ll want to implement other forms of validation testing as well.
Still to come
One thing we want to be sure to test, as I mentioned earlier, is the file itself. We want to ensure that each of our access, service, and preservation master files adhere to a standard. Some of the AIPs we’ll be looking at are from 3 or 4 years ago, when we didn’t have the same workflow or internal standards that we do now. Our current thought is to use MediaInfo as a way of testing that a file adheres to specified requirements. We use MediaInfo to create metadata about files at their time of creation, and the output of that is an XML file. Similar to verifytree, we’d use Xpath statements again to test against a set of rules about the encoding of the file.
Another idea we’ve been considering is putting all of this metadata into a METS document and using something like METS Schematron to validate the contents and structure of the package. My mentor, Dave Rice, has been working on this aspect. What Dave has done thus far is take the output of the tree.xml file and the output of MediaInfo for master, service, and access copies, into a METS file. The process can be seen in makemets, on GitHub. This script isn’t completed yet, but is the beginnings of a way to represent the structure and contents of a package using METS. One issue that we’re having is implementing Premis events, in particular for files that do not have a corresponding MediaInfo file.
Finally, another aspect of this workflow to consider is how to log which packages do not validate, and automate a way to make sure they are reprocessed to the present AIP standard if needed. If we want to validate 1,000 packages at a time, or even 10 times that, ideally we wouldn’t just want all of that information to output into the terminal window. Some possible solutions would be to set up a system where if a package fails a test, we would log why it failed somewhere, and automatically move it to a place where it could be reprocessed OR reprocess it immediately once we know it failed.
Concluding thoughts
You may have noticed that there’s a lot of questions posed in this document. That’s because we’re genuinely curious how other organizations are looking at or handling these same problems, if at all. I really encourage discussion in the comments, on twitter, or feel free to email me at dinah [at] cuny.tv. There’s no AIP-shaming, and this is a safe-AIP-space. Thanks for reading, and I hope you’ll be in touch!
==========================
*Dave
[1] I grabbed this concept of an “active AIP” from page 8 the OAIS-YES docs, written at the AMIA/DLF Hack Day in November.