Featured post

National Digital Stewardship Residency in New York

Posted on by

Metropolitan New York Library Council, in partnership with Brooklyn Historical Society, is implementing the National Digital Stewardship Residency (NDSR) program in the New York metropolitan area through generous funding from the Institute of Museum and Library Services (IMLS) via a 2013 Laura Bush 21st-Century Librarian Program grant.

The National Digital Stewardship Residency is working to develop the next generation of digital stewardship professionals that will be responsible for acquiring, managing, preserving, and making accessible our nation’s digital assets. Residents serve 9-month, paid residencies in host institutions working on digital stewardship initiatives. Host institutions receive the dedicated contribution of a recent graduate that has received advanced training in digital stewardship.

An affiliated project, NDSR Boston, was awarded by IMLS to Harvard Library, in partnership with MIT Libraries, to implement the fellowship program in Boston, MA. Hosts and residents from both programs will participate in the broader NDSR network of professionals working to advance digital preservation and develop a sustainable, extensible model for postgraduate residencies combining advanced training and experiential learning focused on the collection, preservation, and availability of digital materials. See the About NDSR-NY page for additional program and contact information.

The Case for a Digital Preservation Policy Document

Posted on by

Hi everyone, Shira here. Today’s blog post will be something of a project update by way of some thoughts about the need for digital preservation policies, inspired by a conversation between Jefferson Bailey and Meghan Banach Bergin that went up on the Signal earlier this week. Meghan is the author of a Report on Digital Preservation Practices at 148 Institutions worldwide, and Jefferson interviewed her to discuss the results of her research and its implications on her work at University of Massachusetts Amherst Libraries, where Meghan works as the Bibliographic Access and Metadata Coordinator.

signalscreengrab

Bergen’s report provides a fascinating snapshot of how digital preservation is being done around the world at this particular moment in time. Specifically, it highlights the extreme degree of variation that exists within the field; although there are some common themes among digital preservation initiatives (tight budgets, imperfect tools, and not enough members of staff), no two institutions have an approach to preserving digital information that is exactly alike.

For me, one of the most compelling insights to emerge from Bergen’s report is the fact that over 90% of respondents said that they had undertaken efforts to preserve digital materials, and yet only about 25% of the institutions surveyed had a written digital preservation policy. (Briefly, for those wondering what a digital preservation policy is: it’s a written statement authorized by the repository management describing the approach to be taken for the preservation of digital objects deposited into the repository). When I first read that statistic I found it somewhat shocking, but the longer I sat with it the less surprising it began to seem. In fact, I realized that it chimed with my own experience working in the field, which led me to reflect on why this has come to be the case for so many institutions.

preservation_planning

Photo © (c) 2006-2014 Institute of Software Technology and Interactive Systems

The first reason I hit upon is one that Bergen herself mentions in her interview with Bailey. When he asks her about what she feels accounts for the discrepancy, Bergen says that writing a digital preservation policy is time consuming, and that, “that’s why a lot of institutions have decided to skip writing a policy and just proceed straight to actually doing something to preserve their digital materials.”

This struck me as a particularly revealing sentiment because it implies that many respondents understand the act of writing a high-level digital preservation policy document to be fundamentally different (and inherently less valuable) than “actually doing something” to preserve their collections. Quite frankly, this is something that needs to change. Institutions need to begin approaching digital preservation as a holistic task rather than as a series of actions that fulfill the basic preservation requirements of bit preservation and content accessibility. There needs to be a deeply ingrained understanding that not only must any archive, museum, or library seeking to preserve digital material commit to implementing these high-level policy documents, but that doing so is digital preservation in and of itself.

Of course I understand that there is a practical difference between high-level conceptual planning and the hands-on tasks involved in the day-to-day management of an ever-growing collection of digital information; I have yet to work at a repository that has ever had the luxury of ever operating completely outside of the triage stage, and at any given time finding a balance between meeting users’ and administrations’ needs—not to mention the collections’—can be a challenge.

But there will always be emergencies, deadlines, and budget cuts. In fact, that is part of what makes having a digital preservation policy in place so important: being able to articulate your repository’s approach for the preservation of objects accessioned into it will provide a strategy that may help identify gaps or areas of weakness in your preservation strategy, which will in turn help repositories be better prepared for an emergency if one should occur.  A good preservation policy may also aid in demonstrating that certain positions are vital to the organization’s mission, or even—if paired with a well-constructed strategic survey of user groups—help determine the repository’s worth to an organization. In short, this is why having a digital preservation policy is considered a prerequisite to becoming a Trustworthy Digital Repository (for more on that subject see my previous blog post, On The Subject of Trust, which discusses why this matters)

Digital preservation policies matter, and I feel lucky to have the opportunity to help Carnegie Hall draft one. The final deliverable for my NDSR project will be a Digital Preservation and Sustainability document for Carnegie Hall that will outline a set of policies, procedures, best practices, and workflows for the ongoing management of digital files. And yes, this will ultimately include a digital preservation policy, as well as a Preservation Implementation Plan, Strategic Plan, Repository Mission Statement, and Access Policy. (Here’s lookin’ at you, TDR)
hereslookinatyou

At this point in my project I’m midway through the process of gathering the information I will need to eventually create this document. I’ve spent the past month interviewing a variety of different departments across Carnegie Hall to determine how digital content is being created, used, stored, and managed within the organization. I’ve learned a great deal; talking to people has provided me with not only a better understanding of how digital content used within Carnegie Hall, but it has also given me the opportunity to learn what matters most to each department, which will be crucial when we begin rolling out the new DAMS next year. Currently I’m in the process of reviewing, synthesizing, and summarizing these interviews. I will return to these interviews over the next couple months as my project progresses and I reach a point where I can begin to document some of the workflows described in them, so watch this space for updates.

***

Further Reading: There is a lot of good information and resources available on this subject, but wanted to give a quick h/t to Daniel Noonan’s “Digital Preservation Policy Framework: A Case Study”, which charts the process through which The Ohio State University Libraries created an organizational policy for digital information. I particularly enjoyed reading this case study and highly recommend giving it a read.

Prove Yourself: Needs Assessment Edition

Posted on by

What I’ve come to love about the library science field (which after years of waiting tables you’d think I’d hate) is the service aspect to everything we do. Librarians are intensely user-focused in all of our work: through the use of needs assessment surveys, we mold our libraries to what users want, expect, and need. We use the results to design programs, buy technology, even create positions within a library (YA librarian is a thing because of that!). Some common ways to implement a library assessment include  focus groups, interviews, scorecards, comment cards, usage statistics from circulation and reference, and surveys sent to users via email or on paper.

This past week, I attended a workshop with the fabulous Julia Kim at METRO that focused on the implementation and design aspects of surveying, called “Assessment in Focus: Designing and Implementing an Effective User Feedback Survey.” The presenter, Nisa Bakkalbasi, the assessment coordinator at Columbia University Libraries/Information Services, was a former statistician and presented on the many ways one could glean statistically valuable quantitative data from simple survey questions.

The first part of this workshop dealt with the assessment process and types of survey questions, while the second dealt mainly with checking your results for errors. I will focus here on the first part, which is about data gathering and question manufacturing.

I will touch briefly on the assessment process by saying this: all the questions asked should be directly relatable to all the objectives laid out in the beginning of the process. Also, that surveying is an iterative process, and as a library continues to survey its users, the quality of the survey to get valuable results will also increase.

Assessment Process: http://libraryassessment.org/bm~doc/Bakkalbasi_Nisa_2012_2.pdf

Assessment Process: http://libraryassessment.org/bm~doc/Bakkalbasi_Nisa_2012_2.pdf

While my work at AMNH is conducted solely through interviews, I found that the discussion Nisa had on the types of questions used in survey design was particularly helpful. She focused the session on closed-end questions, because there is no way to get quantitative data from open-ended questions. All the results can say is “the majority of respondents said XYZ,” as opposed to closed-ended questions where in the results its “86% of respondents chose X over Y and Z.” This emphasize was extremely important, because real quantifiable data is the easiest to work with when putting together results to share in an institution.

When designing survey questions, it is important to keep a few things in mind:

  • Ask one thing at a time
  • Keep questions (and survey!) short and to the point
  • Ask very few required questions
  • Use clear, precise language (think The Giver!)
  • Avoid jargon and acronyms!

The two most common closed-ended questions are multiple choice questions:

multiple choice

and rating scale questions:

rating scale

For multiple choice questions, it is important to include all options without any overlap. The user should not have to think about whether they fit into two of the categories or none at all. For rating scales, my biggest takeaway was the use of even points for taking away any neutrality. While forcing people to have opinions is considered rude at the dinner table, it is crucial to the success of a survey project.

“Filthy neutrals!” -- Commodore 64 Zapp Brannigan

“Filthy neutrals!” — Commodore 64 Zapp Brannigan

Both of these types of questions (and all closed-ended questions) allow for easy statistical analysis. By a simple count of answers, you have percentage data that you can then group by other questions, such as demographic questions (only use when necessary! sensitive data is just that–sensitive) or other relevant identifying information.

In terms of results, this can be structured like: “78% of recipients who visit the library 1-4 times a week said that they come in for group study work.” These are two questions: what is your primary use of the library, and how often do you come in, both multiple choice. These provide measurable results, critically important in libraryland and something librarians can utilize and rely heavily upon.

I also want to briefly discuss more innovative ways libraries can begin to use this incredible tool. Proving value–the library’s value, that is. Libraries traditionally lose resources in both space and funding due to a perceived lack of value by management, the train of thought usually that since libraries aren’t money-makers, it inherently has less value to the institution.

We as librarians know this to be both ludicrous and false. And we need to prove it. If the result the library is looking for says something like “95% of recipients said that they could not have completed their work without the use of the library,” then that is a rating scale question waiting to happen. And an incredible way to quantitatively prove value to upper management.


graphic

Quantitative data gathered via strategic surveying of user groups can be a powerful tool that librarians can–and should!–use to demonstrate their value. In business decisions, the hard numbers do more than testimonials. Library directors and other leaders could have access to materials that allow them to better represent the library to upper management on an institution-wide level. This can be the difference between a library closure and a library expansion, especially in institutions where funding can be an issue.

Librarians can and should use these surveys for their own needs, both internally for library services and externally on an institution-wide scale. Whether you are a public library trying to prove why you need a larger part of the community’s budget, or a corporate library vying for that larger space in the office, the needs assessment survey can prove helpful to cementing the importance of a library as well as development of library programs.

In the words of Socrates, “an unexamined life is not worth living.”

Jeremy Blake’s Time-Based Paintings

Posted on by

Julia here. In my last post, I gave an overview of the digital forensics AMIA panel I chaired. In this post, I’ll go over some of the work I’m doing as a resident at New York University Libraries, with a special focus on the Jeremy Blake Papers. My current task is to create access-driven workflows for the handling of complex, born-digital media archives.  My work, then, does not stop at ingest but must account for researcher access.  I’m processing 20 collections, each with its own set of factors that influence the direction workflows may take. For example, collections can range in size from 30 MB on 2 floppy disks to multiple terabytes from an institution’s RAID.  Collection content may comprise simple .txt and ubiquitous .doc files or, as is the case of material collected from computer hard drives, may hold hundreds of unique and proprietary file types. Further complicating the task of workflow creation, collections of born-digital media often present thorny privacy and intellectual property issues, especially with regard to identity-specific (ex: social security) information which is generally considered off-limits in areas of public access.

At this point in the fellowship, I have conducted preliminary surveys of several small collections  with relatively simple image, text, moving image, and sound file formats. Through focusing on accessibility with these smaller collections first, I’ll develop a workflow that encompasses disparate collection characteristics. These initial efforts will help me to formulate a workflow as I approach two large, incredibly complex collections: the Jeremy Blake Papers and the Exit Art Collection.  I’ll spend the rest of this post discussing the Blake Papers.

Jeremy Blake (1971-2007) was an American digital artist best known for his “time-based paintings” and his innovations in new media. The Winchester trilogy exemplifies his methodology, which transversed myriad artistic practices: here, he combined 8mm film, vector graphics, and hand-painted imagery to create distinctive color-drenched, even hallucinatory, atmospheric works.  Blake cemented his reputation as a gifted artist with his early artistic and commercial successes, such as his consecutive Whitney Biennial entries (2000–2004, inclusive) and his animated sequences in P.T. Anderson’s Punch Drunk Love (2002).

The Jeremy Blake Papers include over 340 pieces of legacy media physical formats that span optical media, short-lived Zip and Jaz disks, digital linear tape cartridges, and multiple duplicative hard drives.  Much of what we recovered seemed to be a carefully kept personal working archive of drafts, digitized and digital source images, and various backups in multiple formats, both for himself and for exhibition. While the content was often bundled into stacks by artwork title (as exhibited), knowing that multiple individuals had already combed through the archive before and after acquisition of the material make any certainty as to provenance and dating impossible for now.  In addition to work files, we are also processing emails and other assorted files recovered from his laptop.

Photoshop component files from Chemical Sundown (2011) displayed on PowerMac G3.

Photoshop component files from Chemical Sundown (2011)
displayed on PowerMac G3.

Through the work I’ll be doing over the course of this fellowship (stay tuned), researchers will be able to explore Blake’s work process, the software tools he used, and the different digital drafts of moving image productions like Chemical Sundown (2011).

Processing the Jeremy Blake Papers will necessitate exploration of the problems inherent in the treatment of digital materials.  Are emails, with their ease of transmission and seeming immateriality, actually analogous to the paper-based drafts and correspondences in the types of archives we have been processing for years? Or are we newly faced with the transition to a medium that requires seriously rethinking our understandings and retooling of our policy procedures to protect privacy and prevent future vulnerability?  While we haven’t explicitly addressed the issue yet, these are some of the bigger questions that our field will need to explore as we balance our obligations to donors as well as future researchers. Tangential, but not irrelevant, are the types of questions surrounding the belated conception, positioning, and exhibition of post-mortem presentations of incomplete works, such as Blake’s unfinished Glitterbest). These are some of the serious conundrums I am addressing in my work as I draft the clauses addressing born-digital materials for our donor agreement templates—creating concrete policy formations which will be implemented during the course of an acquisition and donor interview next week.

The Blake collection was initially reported to include over 125,000 files. We have recently had to renumber and rethink the accuracy of some of the initial figures, thanks in no small part to the discovery of hitherto occluded media in unprocessed boxes. Initially, my mentor, New York University Digital Archivist Don Mennerich, and I were working with files copied (and therefore significantly altered) from Blake’s hard drives received in 2010, before write-blocker hardware was part of the required protocol for handling digital material at NYU. Taking clues from the fields of legal and criminal investigation, adoption of relevant digital forensics practice in cultural heritage institutions didn’t happen until after breakthroughs such as the publication of CLIR born-digital forensic (2010) paper. Not having the file timestamps severely limited our ability to assess the collection’s historical timespan. In our predictions with regard to research interest, charting Blake’s work progress over time would have been high up on the list, so this bar chart (created from Access Data’s FTK software) was obviously not ideal. Digital files are delicate; the ways in which file access information is recorded lends itself to distortion.

 

Visualization of 1st set of Blake born-digital material with all dates modified.  The grey rectangle represents the modified access date.  That is, all the files show the same date rather than a span of years.

Visualization of 1st set of Blake born-digital material with all dates modified. The grey rectangle represents the modified access date. That is, all the files show the same date rather than a span of years.

Visualization of 2nd set of Blake born-digital material with intact date span, as represented by the many gray lines across almost a decade.

Visualization of 2nd set of Blake born-digital material with intact date span, as
represented by the many gray lines across almost a decade.

Luckily, the issues created by previous access to archival files were resolved after some digging into written reports regarding the collection, along with the important discovery of four boxes of unprocessed material. Enlisting the aid of a number of student interns, we’ve imaged (created bit exact replicas, which can itself be a difficult hurdle ) more than half of these materials. Comparing newly imaged material with the initial Blake acquisition files, we have determined that many of the acquired, compromised files are duplicative, and consequently we have been able to assign the correct time-date stamps! That is, many of the files from the 2nd set of born-digital media images were in the 1st set as well. Blake clearly understood the importance of redundancy in his own workflow. I’ve no doubt that this is (or may prove to be) a common experience for archivists processing digital materials.

Examples of Jeremy Blake media.

Examples of media from the Jeremy Blake Papers (note the optical media that looks like vinyl).

At this point, Blake’s collections have been previewed, preliminarily processed, and arranged through Access Data’s FTK software. This is a powerful but expensive software program that can make an archivist’s task-—to dynamically sift through vast quantities of digital materials—even plausible as a 9-month project. While Don and I manage the imaging and processing, we’ve also starting discussing what access types might look like. This necessitates discussions with representatives from all three of NYU’s archival bodies (Fales, University Archives, and Tamiment), as well as the head of its new (trans-archive) processing department, the Archival Collections Management Department. In our inaugural meeting last week, we discussed making a very small (30 MB) collection accessible to researchers in the very near future as a test case for providing access to some of our larger collections. As part of my responsibilities here, I’ll be chairing this group as we devise access strategies to collection content.

More specifically, we have also set up hardware and software formulations that may help us to understand Blake’s artistic output. In the past two weeks, for example, Don has identified the various Adobe Photoshop versions that Blake used through viewing the files through the hex (hexadecimal of the binary). We have sought out those obsolete versions of Adobe Photoshop, and my office area is now crowded with different computers configured to read materials from software versions common to Blake’s most active years of artistic production. Redundancy isn’t just conducive to preventing data loss, however. We still need multiple methods with which to view and assess Blake’s working files. In addition to using multiple operating systems, write-blockers, imaging techniques, and programs, I spent several days installing emulators on our contemporary Mac, PC, and Unix machines. After imaging material, we’ll start systematically accessing outdated Photoshop files via these older environments, both emulated and actual.

Hex editor view used to help identify software versions used.

Hex editor view used to help identify software versions used (extra points if you recognize what Blake piece this file is from).

In the meantime, I still need to make a number of decisions and the workflow is still very much a work in progress! This underpins a larger point: This fellowship necessitates documentation to address gaps like these. That is, while there are concrete deliverables for each phase of the project, in order to deliver I’ll need to understand and investigate intricacies in the overall digital preservation strategy here at NYU. While working with very special collections like the Jeremy Blake Papers is a great opportunity, it’s also great that the questions we address will be useful at our host sites for many other projects down the line. While I may not be able to write more on Blake in the blog, Don Mennerich and I will co-present our paper documenting our findings at the American Institute of Conservation (AIC) this May…but in the meantime, lot’s of work will need to get done!

On the Subject of Trust

Posted on by

Building a Network of Trustworthy Digital Repositories
Shira here. Earlier this week Karl gave us an excellent run-down of some of the newly released 2015 National Agenda for Digital Stewardship’s principal findings. Whereas Karl’s blog post focused on two of the banner recommendations made in the “Organizational Policies and Practices” section of the report—namely, the importance of multi-institutional collaborations and the acute need to train and empower a new generation of digital stewardship professionals—my blog post today will focus on one of the “blink and you’ll miss it” points that the report makes in its final section on “Research Priorities”.

If you made it all the way to page 41 of the report (and I know you all did… right?), you may have noticed a vaguely phrased section entitled “Policy Research on Trust Frameworks”. Although this section is buried in the report under other findings and actionable recommendations, it contains what is, in my opinion, one of the most important points that the report raises. The purpose of this section is to highlight the importance of developing robust “trust frameworks” for digital repositories. If you read this last sentence and thought, “Hooray! I’m delighted to see this in the NDSA report” then you can probably skip this blog post. For those of you still wondering what a trust framework is and why they matter to digital preservation, read on.

Kara Van Malssen and Seth Anderson giving a workshop to NDSR mentors and residents on the Audit and Certification of Trustworthy Digital Repositories (ISO 16363:2012)


WTF – What’s a Trust Framework?
A trust framework is a digital preservation planning tool that clearly defines the characteristics and responsibilities of a sustainable digital repository. Trust frameworks typically lay out the organizational and technical infrastructure required for an institution to be considered trustworthy and capable of storing digital information over the long-term. Simply put, a trust framework provides organizations with a way to measure—and thereby demonstrate to potential donors, clients, or auditors—its trustworthiness as a steward of digital information. The need for trust frameworks in digital preservation has become more apparent over the past couple decades, and a number of distinct initiatives have been developed in response.

Some trust frameworks, like the Global LOCKSS Network, Meta Archive, Data Preservation Alliance for the Social Sciences (Data-PASS), and the Digital Preservation Network, work on the principle of redundancy and broad-based collaborative institutional mechanisms as a strategy for mitigating single-points-of-failure within a given institution. In this system, multiple organizations that may not be able to individually provide all the elements necessary for a sustainable, end-to-end digital repository enter into an agreement to become collaborative stewards of digital information.

redundancy_graphic

Other trust frameworks, like the NESTOR Catalogue of Criteria for Trusted Digital Repositories, DRAMBORA, and the Trustworthy Repositories Audit & Certification (TRAC) criteria, take the form of auditing tools. These trust frameworks are intended to allow organizations to determine their capability, reliability, commitment, and readiness to assume long-term preservation responsibilities. While some of these tools offer organizations the possibility of performing a self-audit, others offer repositories the possibility of being evaluated and ultimately certified as a trustworthy digital repository by an outside auditor.

criteria_graphic

This post was largely inspired by a recent workshop given by Kara Van Malssen and Seth Anderson of AVPreserve on one of these tools, the Audit and Certification of Trustworthy Digital Repositories, so I thought I’d take this opportunity to talk a little bit about what it is and why it matters to digital preservation.


TDR: A Little Bit of Background
I’ll assume anyone reading this blog has more than a passing familiarity with the Reference Model for an Open Archival Information System, which defines the requirements necessary for an archive to permanently preserve digital information. (If you’re not familiar with OAIS—also known as ISO 14721:2003—stop whatever you’re doing and go read it here). While it’s pretty hard to overstate the importance of OAIS for digital preservation, one thing it lacks is any comprehensive definition or consensus on the characteristics and responsibilities of a sustainable digital repository. That’s precisely where the Audit and Certification of Trustworthy Digital Repositories comes in.

In 2003, Research Libraries Group (RLG) and the National Archives and Records Administration (NARA) embarked on a joint initiative to specifically address the creation of trust framework for digital repositories that would rely on a process of certification. The result of their effort was the Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC), the purpose of which is to identify digital repositories capable of reliably storing, migrating, and providing access to digital collections. The TRAC criteria formed the basis of the Audit and Certification of Trustworthy Digital Repositories, which ultimately superseded it.

The Audit and Certification of Trustworthy Digital Repositories (ISO 16363:2012, or just “TDR” for short), outlines 109 distinct criteria designed to measure a repository’s trustworthiness as a steward of digital information. Intended for use in combination with the OAIS Reference Model, TDR lays out the organizational and technical infrastructure that an institution must have in order to be considered trustworthy and capable of ensuring the stewardship of digital objects over the long-term.

oaistdr

 

Aww Come On. Just Trust Me!
Why do these tools matter? Here are a few reasons:

  • In order to provide reliable, long‐term access to managed digital resources, archives must assume high-level responsibility for this material. This requires a significant amount of resources, organization, infrastructure, and planning across all levels of an organization. Attempting to steward digital material over the long-term on an ad-hoc basis or without the appropriate resources and infrastructure in place is dangerous, and will ultimately put the material they are tasked with caring for at risk. In order to do this effectively, archives must have some metric by which they can evaluate their progress. Trust frameworks like TDR provide this.
  • TDR is designed to take into account a number of criteria beyond merely an organization’s digital preservation infrastructure. These include, for example, the degree of fiscal responsibility an institution is able to demonstrate, and whether or not the institution of an appropriate succession plan, contingency plans, and/or escrow arrangements in place in case the repository ceases to operate. In spite of the fact that both of these criteria are critical to an organization’s ability to provide long‐term access to digital resources, they might be easily overlooked. Being evaluated according to set of established standards—whether vis-à-vis a self-audit or by an external auditor—can highlight holes in a repository’s operation that may not be apparent in the course of normal, day-to-day business.
  • As TDR states in its introduction, “Claims of trustworthiness are easy to make but are thus far difficult to justify or objectively prove.” On a very basic level, trust frameworks provide institutions with a metric that allows them to compare their own systems and procedures against an established, high profile standard in order to evaluate their trustworthiness. Employing a trust framework like TDR will allow archives to provide evidence to potential grantmaking bodies, donors, or board members that they are responsible and trustworthy digital stewards.


Sounds Great. Sign Me Up!
Not so fast, cowboy. While it is undeniably clear that establishing reliable trust frameworks is of the utmost importance to the field, it doesn’t mean that TDR—or any of the other trust frameworks I’ve mentioned here so far—provide the whole answer in and of themselves. As the NDSA report points out, this is still a relatively under-explored area and there is a lot of room for additional standards, models, and frameworks to be developed. Moreover, the report takes pains to point out that many of these frameworks have yet to be empirically tested and systematically measured, and that there are still a lot of questions that remain to be answered: “How reliable are certification procedures, self-evaluations, and the like at identifying good practices? How much do the implementations of such practices actually reduce risk of loss?” the report asks. “The evidence base is not yet rich enough to answer these questions”.

The report concludes with a recommendation that funding and research bodies concentrate their research efforts on exploring ways to improve the reliability and efficiency of trust frameworks for digital preservation. As with so many things in this field, there is no final solution; the journey is the destination.

 

All together now: NYARC and the National Agenda for Digital Stewardship

Posted on by

Karl again! Some truly hardworking authors and editors released 2015’s National Agenda for Digital Stewardship this month. Tl;dr? Then don’t worry because I’ve got you covered. There is in fact something for everyone with a vested interest in digital preservation contained in this vision of its challenges and of the work ahead. Of course I couldn’t help but see reflections of my own project with the New York Art Resources Consortium (NYARC) throughout.

As befits an organization with the scope and mandate of the National Digital Stewardship Alliance, the agenda aspires to enable the kind of the top-down national infrastructure solutions that have been so influential to digital stewardship in the EU–especially when it comes to our biggest big data problems. Nonetheless, it promotes multi-institutional collaborations among smaller organizations like NYARC’s as the most important model for digital stewardship.

Why? Throughout the agenda, NDSA makes the case that this model: provides essential redundancies; enables specialist institutions to focus on collecting within their unique scopes; lets the same perennially under-resourced institutions punch above their weight class as teams; and, ultimately, further distributes a preservation network that by means of its interconnection can perform its longest-term functions more sustainably than can any single vendor. This is an “ecosystem” in which many more can thrive.

Of course that’s all easy enough to say, but much more difficult to coordinate among our respective institutions’ no less important short-term and internal requirements. Visualize just the closest circle of partners in NYARC’s very specific (and very young!) effort to archive and manage specialist art historical resources from the web, and what you get is already a fairly complex machine:

Schematic diagram of web archive management at NYARC by Karl-Rainer Blumenthal. Icons by Hello Many, Juan Pablo Bravo, and Simple Icons of The Noun Project.

Schematic diagram of web archive management at NYARC by Karl-Rainer Blumenthal. Icons by Hello Many, Juan Pablo Bravo, and Simple Icons of The Noun Project.

The NYARC directors at the Frick Art Reference Library, the Brooklyn Museum, and the Museum of Modern Art, and their financial supporters—the Andrew W. Mellon Foundation and the Institute for Museum and Library Services—all clearly understand that this project has to resolve the greatest inter-institutional challenges in order to lay a foundation for future collaborations. Those challenges, as articulated throughout the National Agenda, are to openly share documentation, engender buy-in from content creators in the commercial sector, and foster the next generation of digital stewards.

As Peggy succinctly demonstrated last week, implementing standards for highly specialized digital resources tends to be a difficult balance of specificity and interoperability. NYARC embraces this challenge by modeling processes for the quality assurance, preservation metadata, and archival storage of web archives among three institutions with many disciplinary affinities but great collecting diversity. This facilitates a kind of ‘instant interoperability’ to standards sufficiently specific to be practically applicable at each institution. Moreover, each institution will contribute to the project’s iterative documentation, which the wider field of art librarians will be invited to reference and develop. Watch this space for updates on that front!

Others are certainly eager to take that baton and run with it. As the NDSA’s national survey of web archiving programs—released right on the heels of the National Agenda—indicates, resources for collaborative efforts like the Internet Archive’s Archive-It software service will support an explosion of desire to archive the web as a team:

Graph from the NDSA’s "Web Archiving in the United States: A 2013 Survey" (p. 12).

Graph from the NDSA’s “Web Archiving in the United States: A 2013 Survey” (p. 12).

Such projects have thus far relied on the Herculean efforts of human catalysts like NYARC’s web archiving program coordinator Sumitra Duncan. And while open documentation lightens their load significantly, we all must also do more to engage commercial actors and next-generation digital stewards to share it.

For-profit enterprises define the work of web archiving by exponentially expanding and diversifying the contents of the web, and yet they have been all-too-little engaged to shape expectations of long-term viability. NYARC closes this circle by going directly to the publishers, auction houses, galleries, and curators who create the live web, and demonstrating the program’s value not just as immediate scholarly resource, but dark archive for proprietary content with no extant preservation strategy. In so doing, it also overcomes its greatest technical obstacles with input from Hanzo Archives—a private web archiving organization whose expertise with the most challenging elements known to web design stem from its work with innovative corporate web instances. Encouraging more private concerns like these to team with heritage organizations towards long-term preservation strategies is absolutely essential to ensuring that our organizations have the intellectual resources we need to get the job done.

Perhaps most important of all to the long-term viability of all of this work, however, is the empowerment of our truly born-digital generation of information professionals to own and replenish it. I’m lucky to have a temporary view from vanguard myself, but I’m especially curious to see what NYARC’s youngest and most imaginative teammates will make of it. With the support of Mellon and IMLS, we currently enlist the support of no fewer than six interns from Pratt Institute’s Masters degree programs in Library & Information Science, Digital Art & Information, and History of Art. The students already provide critical vision to otherwise tedious tasks. Their responsibility for management and use of these resources will only increase, so the onus is on all of us to ensure in the meantime that they have a heritage worth stewarding.

Source: The Henry Ford

Source: The Henry Ford

Long-term preservation and shared responsibility are critical to all of us and yet notoriously difficult to articulate in value statements. If we ever want to streamline the intricately-structured, grant-funded, and highly project-specific work, however, we’ll have to make convincing cases for the return on external and internal investments in permanent infrastructure, full-time staff appointments, and further research. All together now, let’s join the structures, resources, and projects to those values to give them form.

 

Science: The Final Frontier (of digipres)

Posted on by

Science: the final frontier. These are the voyages of Vicky Steeves. Her nine-month mission: to explore how scientific data can be preserved more efficiently at the American Museum of Natural History, to boldly interview every member of science staff involved in data creation and management, to go into the depths of the Museum where none have gone before.

Hi there. Digital preservation of scientific data is criminally under-addressed nationwide. Scientific research is increasingly digital and data intensive, with repositories and aggregators built everyday to house this data. Some popular aggregators in natural history include the NIH-funded GenBank for DNA sequence data and the NSF funded MorphBank for image data of specimens. These aggregators are places where scientists submit their data for dissemination and act as phenomenal tools for data sharing, however they cannot be relied upon for preservation.

AMNH scientists at work in the scorpion lab: http://scorpion.amnh.org/page19/page23/page23.html

AMNH scientists at work in the scorpion lab: http://scorpion.amnh.org/page19/page23/page23.html

Science is, at its core, the act of collecting, analyzing, refining, re-analyzing, and reusing data. Reuse and re-analysis are important parts of the evolution of our understanding of the world and the universe, so to carry out meaningful preservation, we as the digital preservationists need to equip those future users with the necessary tools to reuse said data.

 

Therein lies the biggest challenge of digital preservation of scientific data: the very real need to preserve not only the dataset but the ability to deliver that knowledge to a future user community. Technical obsolescence is a huge problem in the preservation of scientific data, due in large part to the field-specific proprietary software and formats used in research. These software are sometimes even project specific, and often are not backwards compatible, meaning that a new version of the software won’t be able to open a file created in an older version. This is counter-intuitive for access and preservation.

FM-4.1-pro

an example of some obsolete databasing software; popular back in the day but not widely used today. 

Digital data are not only research output, but also input into new hypotheses and research initiatives, enabling future scientific insights and driving innovation. In the case of natural sciences, specimen collections and taxonomic descriptions from the 19th century (and earlier) are still used in modern scientific discourse and research. There is a unique concern in digital preservation of scientific datasets where the phrase “in perpetuity” has real usability and consequence, in that these data have value that will only increases with time. 100 years from now, scientific historians will look to these data to document the processes of science and the evolution of research. Scientists themselves will use these data for additional research or even comparative study: “look at the population density of this scorpion species in 2014 versus today, 2114, I wonder what caused the shift.” Some data, particularly older data, aren’t necessarily replicable, and in that case, the value of the material for preservation increases exponentially.

So the resulting question is how to develop new methods, management structures and technologies to manage the diversity, size, and complexity of current and future datasets, ensuring they remain interoperable and accessible over the long term. With this in mind, it is imperative to develop an approach to preserving scientific data that continuously anticipates and adapts to changes in both the popular field-specific technologies, and user expectations.

Open Science

Published data aren’t the end-all-be-all of digipres for science. There are a lot data that need our help! http://www.opensciencenet.org/

There is a pressing need for involvement by digital preservationists to look after scientific data. While there have been strides made by organizations such as the National Science Foundation, Interagency Working Group on Digital Data, and NASA, no overarching methodology or policy has been accepted by scientific fields at large. And this needs to change.

The library, computer science, and scientific communities need to come together to make decisions for preservation of research and collections data. My specific NDSR project at AMNH is but a subset of the larger collaborative effort that needs to become a priority in all three fields. It is the first step of many in the right direction that will contribute to the preservation of these important scientific data. And until a solution is found, scientific data loss is a real threat, to all three communities and our future as a species evolving in our combined knowledge of the world.

I will leave you, dear readers, with a video from the Alliance for Permanent Access conference in 2011. Dr. Tony Hey speaks on data-intensive scientific discovery and digital preservation and exemplifies perfectly the challenges and importance of preserving digital scientific research data:

Capturing a Shadow: Digital Forensics Applications with Born-Digital Legacy Material

Posted on by

Hi – Julia here. Like Shira, I just spent the last week attending the  Association of Moving Image Archivists (AMIA) conference in Savannah, Georgia.  AMIA brings approximately 500 moving image professionals and students from all over the world for a wide variety of workshops, panels, and special screenings.  While it’s too much to cover in depth, those of you unfamiliar with the conference can check out the program as well as another blog post by NDSR Boston resident Rebecca Fraimow.

As part of my conference experience, I chaired and moderated a panel on digital forensics applications with personal collections with speakers Elizabeth Roke, Digital Archivist at Emory University, and Peter Chan, Digital Archivist at Stanford University.  The work of both directly correspond to my projects at NYU Libraries where I’ve been tasked with developing the infrastructure, policy, and workflows to preserve and make accessible born-digital collections. One of the major collections I”m working on are the Jeremy Blake papers and his “time-based paintings.”  I’ll detail my progress in this area in my next post.

This panel was the first of its kind to be introduced to the AMIA community. There had been no previous discussion on digital forensics concepts, use cases, or projects, making this panel a unique experience as both a moderator and a community member.  Digital forensics has its origins in the legal and criminal investigative worlds, but its support of archival principles such as provenance, chain-of-custody, and authenticity, have driven its recent adoption in the archives.

While digital forensics is an emerging field,  both Emory and Stanford were among the first universities to create forensics labs and equipment to process backlogged obsolete born-digital media.  Much rarer, both Emory and Stanford do not stop at ingest.  They’ve both processed collections now accessible to researchers.

 

Elizabeth Roke, Emory University

Elizabeth Roke began her presentation with an introduction to digital forensics concepts and workflows. She strongly emphasized, however, that there is a huge gap between ideal workflows and the reality of born-digital processing. This is a theme that was returned to throughout the panel and Q&A. She often, for example, finds herself processing media with little to no documentation. File names can be baffling and nondescript (“ ,,,,,,.doc”).  Chain of custody may already be broken by well-meaning individuals who have copied files over, permanently altering their time stamps.  Additionally, Elizabeth stressed that disk imaging itself–the initial process of refreshing the media into a more actionable format–could take a lot of time, effort, and experimentation, and was not always successful.

Slide10

Elizabeth also updated us on Emory’s seminal work on Salman Rushdie’s 4 personal computers, as well as some of the recently processed, less complicated collections, such as the Alice Walker papers.  Preserving and providing access to the Rushdie computers involved significant dedicated staff time because of both the high level of technical requirements involved in full-scale personal computing emulation, and because of the number of nuanced access levels and restrictions.  Few institutions can dedicate the resources to make such a project happen.  While the earliest Rushdie computer was emulated and accessible in February 2010, the remainder of the Rushdie born-digital collections is not yet accessible.  She contrasted that immense project with The Alice Walker papers, a collection of word processing files on floppy disks with comparatively few restrictions.

 

Peter Chan, Stanford University

Peter Chan then jumped in via skype to discuss and demo ePADD (email: Process, Accession, Discovery), a Stanford project specifically tackling email processing that is scheduled for an April 2015 release to the public (NYU Libraries is a beta-tester on this project).  One of the great things about ePADD  is that it easily addresses a major stumbling blocks in born-digital access: personal and private information extraction.  ePADD extracts text for key word searches to cull sensitive content such as those based on health, social security numbers, credit card numbers, and any other topics deemed private by a donor.  While you can read much more about ePADD in a recent post, one of the fun aspects  are the visualizations possible through ePADD.  ePADD mines  email records and can create cool visualizations displaying words usages and corresponandances over time, as seen below with the Robert Creeley papers image (courtesy of Peter):

ePADD

 

“The Roke and Chan virus gap”

At the end of the panel, audience members asked some interesting questions highlighting  different institutional responses.  For example, how do each of the institutions handle viruses? Are they worth preserving despite the risks?  Stanford preserves the whole image, virus and all. Emory excludes viruses from ingest.  Emory, it turns out, also preserves the disk image with viruses, but excludes viruses from any exported files.  Some institutions may exclude viruses from ingest.  This is only one example of differences in not only methods, but values and evaluations determining what are objects of study for future researchers.    One person’s context is another person’s text.  At this point, we can all only speculate; the researchers aren’t there.

While my task is to develop policies addressing issues like this, I’m not sure what we’ll be doing in this area!   In my preliminary surveys,  I can already tell that preserving and making accessible Jeremy Blake’s work, for example, will present for a whole other set of considerations due to its artistic context.  Determining the essential qualities will present challenges that a text-based record, for example,  wouldn’t.  I’ll blog more about that in my next post!

 

AMIA 2014: Open Source Digital Preservation & Access Stream

Posted on by

Hi everyone, Shira here. Last weekend I attended the Association of Moving Image Archivists Conference in Savannah, GA. For those of you who don’t already know, AMIA is a nonprofit international association dedicated to the preservation and use of moving image media. Although the conference has traditionally focused more on issues surrounding the preservation of analog film and video, in recent years it has brought the subject of digital preservation to the fore.

This year was no exception. One of the three curated programming streams that comprised this year’s AMIA conference was devoted to addressing the open source software in use within the digital preservation community, and by my count, 22 of 53 panels (~42%) were directly related to digital preservation. There was also significant buzz around the projects that were developed as part of the second annual Hack Day, the goal of which is to design and/or improve upon practical, open source solutions around digital audiovisual preservation and access.

Although enough ground was covered at AMIA to provide days’ worth of blogging fodder, my post today is going to focus on the Open Source Digital Preservation & Access stream, which served as a showcase for a variety of exciting tools for the preservation and access of digital video and born-digital moving images.

AMIA swag bags

AMIA swag bags

What does “open source” mean, and why is it important for digital preservation?
Open source software is made available with a license that permits users to freely run, study, modify, and re-distribute its source code. Open source tools are usually developed in a public, collaborative forum such as GitHub or Gitlab. This means that any users can improve, fix, or add onto the source code over time.

Using the open source model to develop tools for digital preservation has a number of advantages over proprietary software. The principal benefit is cost; most open source tools are available to the public at no cost, which is a big deal for many perennially cash-strapped organizations in the archives community. Another benefit is that the open source model is versatility. Making the source code available to the public allows tools to be perpetually refined and modified according to the needs of a particular group, and as Bill LeFurgy explains in a 2011 blog post on The Signal, open source also, “gives organizations the opportunity to stitch together a preservation system from existing components rather than laboriously start from scratch.” The last benefit I’ll mention here is that open and freely accessible code is far simpler to preserve than closed proprietary code, making it more likely that open source tools themselves will be around in the future.

#osdpa / #amia14
In case you missed the AMIA conference but still want to look into some of the things that were discussed there, you can always play catch-up by following the #osdpa and #amia14 hashtags on Twitter. AMIA has a contingent of active tweeters (myself included), and the good news is that the intrepid Ashley Blewer has put together a twarc archive of tweets from this year’s AMIA conference in JSON and text formats (available here).

She’s also put up a list including links to additional information of all the technologies discussed during the conference, either within the Open Source Digital Preservation & Access Stream, at the Hack Day, or at-large. It’s an extremely valuable resource and I highly recommend giving it a look.

Trevor Thornton, "Open Source Tools, Technologies and Considerations" panel. Photo courtesy of Kathryn Gronsbell

Trevor Thornton talking as part of the “Open Source Tools, Technologies and Considerations” panel. Photo courtesy of Kathryn Gronsbell

Interview with Chris Lacinak
Although following the #osdpa hashtag will give readers a good sense of the talks that comprised the open source stream, I wanted to offer some background information on the stream and why it was put together. Before leaving Savannah I spoke with AVPreserve’s Chris Lacinak, who curated the Open Source steam. Our conversation is below:

Shira Peltzman (SP): How did the open source stream come into being?

Chris Lacinak (CL): Last year the Digital Library Federation (DLF) and AMIA sponsored the first AMIA/DLF Hack Day and there was a lot of excitement within AMIA about that event. Thanks to the hard work of Kara Van Malssen, Lauren Sorenson, and Steven Villereal it was a big success and was perceived well, not just by the people that took part, but by the rest of the membership at large; it gave them an opportunity to engage with that type of event and see what kinds of things happen there. The energy and buzz that came out of the first Hack Day was great. I was the AMIA board liaison to the Open Source Committee and at the committee meeting the members made it clear that they wanted open source to be a greater part of AMIA. We talked about ways to make that happen and one of the ideas was to have an Open Source stream as part of the conference. I went back and pitched it to the board and they were very much in favor, asking me to be the stream’s curator.

SP: I noticed that many of the panels at the Open Source stream were standing room only. Why do you think this stream was so popular?

CL: Standing room only, and that was accidental! Originally the room we were in was supposed to be half the size that it actually was. Clearly we touched a nerve within the membership. Software has become an integral component of digital preservation practice. Open source software has been embraced wholeheartedly by the archival community largely based on preservation principles as well as budgetary considerations. However, there is still a lack of clarity regarding the process and component parts that make up open source software projects, and people are really interested to get their heads around this. It was also interesting that lots of the people filling the seats had minimal experience with open source or Hack Days or anything like that. So a lot of people were new but were clearly enthralled with the topics and types of tools they were seeing. The other thing is that obviously both digital preservation and access are huge things; people are hungry for content around digital media in general so this served both the digital preservation and access folks and also the open source interest.

Inaugural session of the Open Source Digital Preservation & Access stream at AMIA 2014

Inaugural session of the Open Source Digital Preservation & Access stream at AMIA 2014

SP: Are there any trends that you noticed this year among the presentations in the Open Source stream?

CL: I think that both the tools and the understanding have reached a level of maturity that’s very real now. Lots of conversations on open source within the community a few years ago were painted in an experimental light, as if it were something for folks on the fringe and not for “real” archives. Now it feels very central and real. In the stream presentations I really noticed, one, the sophistication of the tools, two, all of the presenters were very articulate and three, the audience was able to receive and process the information in a way that hasn’t happened in the past. So I think there’s definitely a maturity within this ecosystem that’s new and interesting and let’s operate on a different level than in the past.

SP: What are some of the open source projects that you’re most interested in seeing developed in the coming year that were discussed at Hack Day and during the open source stream?

CL: First and foremost I want to point everyone to the Hack Day wiki which also has links to all of the Hack Day projects on GitHub. (Find this here).

All of the projects are really amazing and deserve being reviewed. The prize winning projects this year were Video Sprites! and Hack Day Capture, with AV Artifact Atlas getting a special jury prize. Personally I would encourage people to take a look at the Video Sprites!, the Video Characterization Comparison Viewer and the ffmpeg Documentation projects. Speaking of a documentation project, another thing that I loved this year was the addition of an Edit-A-Thon (part of AMIA/DLF Hack Day), because what you find is that a large number of open source tools are poorly documented, seriously limiting their usability. Documentation projects really answer a huge need, so I think it’s great to have gotten this done. It’s extremely important and valuable work.

In the stream, QC Tools is really exciting. I think it’s an amazing tool with a feature set that rivals commercial offerings, and I’m excited to see that continue to grow, have new features added, and be used by more organizations. I am interested in combining QC Tools with our tool, MDQC in one package. Erik Piil gave a lightning talk on an open hardware cleaner he’s working on which is awesome. And on a larger scale, MoMA’s development of the first digital preservation repository for museum collections is a phenomenal project. It’s hard to pick though because they really are all great. If I didn’t think they were awesome I wouldn’t have picked them for the stream! The entire stream will be posted online for those that are interested in watching the presentations. Keep an eye on the AMIA website.

SP: Well congrats on a really successful stream.

CL: Thank you.