Last week the NDSR-NY cohort travelled to Philadelphia to attend {code4lib}.This was my first time at the conference, and I wanted to focus on gathering general information and troubleshooting tips for microservices-based workflows.
Code4lib highlights the DIY/open-source/information-sharing work of the coder/hacker/library/archivist community, and sessions are voted on by the public for inclusion in the schedule. Shira Peltzman and Alice Prael kicked off the conference with a great talk on “good enough” digital preservation. You can read all about their presentation – as well as a great recap of some highlights from the conference! – in Alice’s post on the NDSR-DC blog.
There were a few presentations on microservices this year, including Dinah and Ashley’s talk which focused on their use in audiovisual collections, Sebastian Hammer’s which discussed how a microservices approach can be better achieved by for libraries, and Mike Shallcross’s debut of the new features and functions being developed during their ArchivesSpace-Archivematica-DSpace integration project (looking forward to the release with all those additions!).
At WCS, we’re currently testing the transfer and ingest processes of Archivematica, and specifically how to ingest huge SIPs with complex folder structures and mixed-format contents. We’re all about microservices and the flexibility that open-source projects provide for organizations with few staff and resources, but have been struggling with the learning curve of configuration for our particular system. We were chugging along in our testing phase until something happened:
This little dot you see under the “f”, is not actually a dot. It’s a Private Use Area. And it totally halted our microservices-based worflow. Private Use Areas have no defined character – they are displayed in various forms across different file systems. This little bugger got added to a filename twelve years ago, and when Archivematica’s microservices kicked in and called on rsync to look at our SIP and copy it from the transfer storage location into what would be the Backlog, the file with the funky name appeared to vanish, which stopped the transfer and produced an rsync error.
The storage location was mounted on our Linux machine (Archivematica instance) as a Samba share. Through the smb connection, PUA’s are automatically converted into characters. So, on my Mac, the file name looks like (in red):
…and on Linux it looks like:
…and on some Windows systems it looks like:
And on an exFAT formatted thumb drive, it looks like the tiny pixel below the “f” in the first image.
Here is my rough, probably incorrectly described explanation of our problem: The file is stored on a Linux system. The file appears to vanish when rsync attempts transfer it because it changes along the way.
And here is a reference to a couple of similar problems from the Rsync Web Pages:
“Something else that can trip up rsync is a filesystem changeing the filename behind the scenes. This can happen when a filesystem changes an all-uppercase name into lowercase, or when it decomposes UTF-8 behind your back.
An example of the latter can occur with HFS+ on Mac OS X: if you copy a directory with a file that has a UTF-8 character sequence in it, say a 2-byte umlaut-u (\0303\0274), the file will get that character stored by the filesystem using 3 bytes (\0165\0314\0210), and rsync will not know that these differing filenames are the same file (it will, in fact, remove a prior copy of the file if –delete is enabled, and then recreate it).
You can avoid a charset problem by passing an appropriate –iconv option to rsync that tells it what character-set the source files are, and what character-set the destination files get stored in.”
We did find a couple solutions: My mentor, Kim Fisher created a python script that finds and converts the PUAs and any other weird characters into a “_”. This must be run on the original source directory (e.g. on it’s original location within the external drive), or over an NFS connection rather than a Samba share, which does not convert characters. The connection to the transfer storage will also be switched to NFS.
This small issue served to prove a number of things. First, more thinking must be done around how much arrangement and appraisal should be done before collections are transferred to our backlog, as well as some thinking about using BagIt (and how to avoid it taking twenty years to package an enormous folder…). Secondly, sometimes microservices fail, and we don’t really know why, and it takes a lot of time, effort, and resources to troubleshoot problems when you’re not a coder / developer / python-reader-writer. For organizations who can’t afford fancy software and technical support, open-source microservices are being hailed as solutions. But not everyone has a software developing mentor in their back pocket to troubleshoot tiny dots. So what are they to do when they run into a problem that is (almost literally) invisible? The network of support is encouraging, but taxing to harness, especially for those lone-arrangers out there who don’t have budgets for outside consultation, or for taking time to learn a new language.
Don’t get me wrong – Troubleshooting is fun! – it is always an opportunity to learn new skills and revisit one’s workflow. And throughout the conference I was excited about the new tools and the support that everyone has for one another’s projects, and of course grateful to the community-based help that is available on users-forums and through information and experiences shared on archivists’ blogs and documentation repositories.
I don’t really have a conclusion, but I left c4l with a number of questions about how the open-source friendly LAM community can better address the need for free technical support for those in-need and without a grant-funded digital preservation resident to spend a week doing research. User-forums can only get you so far, and the main reason we found this little dot was because I tried to wipe it off of the screen and it didn’t go away…
Pingback: WCS NDSR Project Post: “Trojan Dots and DIY Solutions” | WCS Archives Blog