My CUNY TV residency has consisted of three tasks: to assess and enhance media microservices, to assist with the implementation of a Digital Asset Management system, and to build a workflow for the migration of 1 petabyte of data from LTO 5 to LTO 7. While I’m still working on media microservices and DAM implementation, much of my time these days is spent working with LTO 7. In this post I’m going to talk about LTO storage, my proposed data migration workflow at CUNY Television, some of the challenges we’ve had working with the new LTO 7 technology.
What are LTOs?
LTO stands for Linear Tape Open and is a magnetic tape data storage technology. LTO technology is commonly used for large scale IT backups using a tape robot:
LTO has many advantages as a long term storage technology. First, the format specification for LTO is open, meaning that tapes and drives made by different companies are compatible (in theory) and that there can’t be a monopoly on manufacturing the tapes unless companies decide to stop manufacturing LTO, both of which work against obsolescence of the technology. Second, LTO tapes are read and write backwards compatible one generation, and read-only backwards compatible with two generations, meaning that we are able to read back our LTO 5 tapes in our LTO 7 drives, but we can’t write to LTO 5 tapes using our LTO 7 drive.
Finally, LTO 5 tapes and later have an option to use LTFS, which stands for Linear Tape File System. LTFS allows for the file system on the tape to be expressed in XML and read back by the computer to represent files as though they would appear on a computer’s finder. LTFS also has command line utilities and the source code is available from the vendors who make LTO (Tandberg, HP). Because of the LTFS open format specifications we are able to write scripts to perform specific LTO-related tasks. We use and contribute to ltopers, which is part of the AMIA open source repository. Furthermore, LTO technology is available at a lower cost than other types of storage technology such as cloud storage or hard drives. LTO 7 tapes cost $153.99 and can hold approximately 5.4 TB of uncompressed data. They are shown here as having a 6 TB capacity but thus far we have not found this to be true, explained somewhat by orders of magnitude in data, making their cost per TB approximately $28.33.
If you want to learn more specifics about LTO technology, I recommend visiting the Wikipedia for LTO.
Our (proposed) migration workflow
Presently, we have much of our media assets stored on LTO 5, and are beginning to migrate to LTO 7, which was released in late 2015. A high level overview of the proposed workflow is as follows:
- Reading: Read data from LTO 5 tapes (A tape and B tape which are identical) to staging hard drives (separate drive for A tape read back and B tape read back)
- Fixity: Perform fixity checking on the data (for some LTO tapes this will be a matter of verifying checksums, for others, such as tapes that contain AIPs, the verification process will also include ensuring that the package contains its necessary contents)
- Writing: Once approximately 5 TB have been read back from LTO 5, write E and F LTO 7 tapes
- Reading (again): Read back the data from LTO 7 tapes and verify that all the files transferred correctly
This list is highly deceptive, however, because within each of these steps are numerous steps based on the outcome of the previous step, and we have a number of edge cases that complicate these processes. Furthermore, we’ve been encountering some … interesting hardware and software issues, requiring a lot of patience and troubleshooting.
One of the most significant reading problems we’ve had thus far has been reading back LTO 5 tapes using LTO 7 drives. We purchased two standalone LTO 7 drives and one rack mounted LTO 7 drive with two slots. We began our migration by reading back LTO 5 tapes using the LTO 7 standalone drives, and in addition to having multiple files read back with a checksum mismatch (6 mismatches out of 4 tapes), the drives were making some wild sounds!
(turn up your volume/headphones for maximum sounds)
After some correspondence with the manufacturer of the drives, we came to the conclusion that the drives themselves were damaged. This was only emphasized when we had numerous errors reading back from LTO 7 tapes in the LTO 7 drives (more on this in the reading (again) section).
The materials that we have stored on LTO 5 span the past 5 years, which means some were organized or packaged using a different workflow than the one we currently use. This results in some fun surprises when mounting and reading back LTO 5 tapes from the earlier days of the Library and Archives, like no checksums for the contents of a tape…
Luckily, this isn’t too much of a problem because we have an identical copy of the file on the B tape, so we can create checksums for each of the tapes and sort and diff them to verify the contents, and we can also use ffmpeg to check the individual video files for errors (ffmpeg -i [input] -f null -).
If we do have a file with an error (we’ve only had two thus far, out of 15 TB), we make sure to note that and swap out the “bad” file for the “good” file from the second tape when we create the collection that we write to the LTO 7 tape.
We’ve also added checksumming of metadata files as a part of our fixity checking, as it only takes a few additional seconds, and will make it easier in the future to verify the contents on the tape. Now, in addition to individual package and folder checksum.md5 files, we have a checksum.md5 file that represents the entire contents of the tape- similar to a bagit manifest.
To write LTOs, we’re using the writelto script from ltopers. writelto uses rsync to write files to the tape, and we’ve added functionality so that checksums are captured and logged with rsync, so that if there are checksum mismatches when we read files back from LTO 7 tape, we can find out if the errors are happening due to the movement of the file. Overall, the writing process has gone smoothly, and with the exception of attempting to write too much data on one set of tapes, we haven’t encountered any specific errors that we can attribute to the write process.
Because we were sort of nervous about how well our LTO 7 stand-alone drives were performing, given the checksum mismatching from reading back LTO 5 tapes and their disconcerting noises, we decided to add on an additional verification of checksum verification after writing to LTO 7 tape. After we’ve written an LTO 7 tape, we’ve been reading the contents back and creating a secondary checksum file to compare against the tape’s checksum file. This is where things get interesting (weird… bad.. etc.). Out of the three sets of tapes we’ve written (somewhere around 15 TB), we’ve found that upon reading back the files and performing a checksum verification, we’ve had 17 files with checksum mismatches when reading back LTO 7 tapes from the rack mounted LTO 7 drive.
Yikes! This perplexed us, as we knew that all of the files’ checksums had verified before we wrote them. So we decided to read back and checksum the mismatched files a second time. Upon the second read back, the files’ checksums verified, matching the original checksum and the checksum generated as part of the rsync log.
Thus far, every time that we read back a file from LTO 7 tape, upon the first read we have a checksum mismatch, and upon subsequent reads we have a checksum that verifies. What is the deal?!
While we are excited and relieved that all of our files have transferred successfully thus far, it is disconcerting that we are having a checksum mismatch upon the first read back- something that we’ve never experienced with our LTO 5 tapes and drives. We’re going to continue testing and recording our results, with the hope of coming to some better conclusion about why we are having this issue. If anyone else out there in the digital preservation community is using LTO 7, I would love to talk with you more about this!