Hello readers! This past week was a busy one for NDSR-NYC. All five of us were able to attend the 2016 Code4Lib conference in Philadelphia, and Dinah gave a presentation with Ashley Blewer, Applications Developer at the New York Public Library, advocating for the benefits of sharing workflow documentation for A/V digitization and digital preservation processes. Genevieve will write more about Code4Lib and Dinah & Ashley’s talk later this week.
In this post, I will follow up on a presentation I gave about Rhizome’s use of Webrecorder for preservation and access of born-digital artworks, at a joint conference of the Art Librarians’ Society of North America and the Visual Resources Association, after Code4Lib. I will describe how to use two web archiving tools that were also created by Ilya Kreymer, which complement the ability to capture web content via Webrecorder.io.
Art Libraries + Visual Resources: Joint Conference in Seattle
The 2016 ARLIS/NA + VRA conference in Seattle marked the 44th Annual Conference of ARLIS/NA, the 34th Annual Conference for VRA, and the third time these two organizations have combined forces for a joint conference. The occasion was preceded by a one-day THATCamp digital humanities unconference on March 8, and followed by the annual meeting of the Association of Architecture School Librarians (AASL) after ARLIS/NA + VRA. Meanwhile, the joint conference itself offered a packed schedule of excellent sessions, events, and opportunities to participate in combined committee meetings representing the shared interests and priorities of these two organizations.
NDSR was well-represented at ARLIS/NA + VRA. A panel called “Terra Fluxus: Surveying the Digital Information Landscape of Environmental Design” organized by NDSR-NYC alum Karl-Rainer Blumenthal included current NDSR-DC resident Valerie Collins speaking about her work at the American Institute of Architects. In “Duty Now for the Future,” a panel discussion of the importance of internships for providing Library/Archives/Information students with much-needed practical education and experience, Sumitra Duncan, a mentor for the NDSR-NYC 2014-15 cohort, discussed her development of a web archiving practicum at the Frick Art Reference Library. George Coulbourne, Chief of Internships & Fellowships, National & International Outreach, at the Office of Strategic Initiatives, Library of Congress, spoke at a Conference Capstone session and provided a high-level view of the NDSR programs from initial development to current and ongoing revisions of and improvements to the curriculum, as well as about NDSR’s relationship to the Library of Congress DPOEE program and the LC’s overall vision for providing ongoing training and leadership in digital preservation.
“The Web Sits for its Portrait,” a panel about web archiving organized by Mark Bresnan, Head of Bibliographic Records at the Frick Art Reference Library, included discussions of web archiving initiatives at the Sterling and Francine Clark Art Institute Library, the New York Art Resources Consortium (NYARC), and Rhizome. Penny Baker spoke about the Clark’s collection of web resources relating to the Venice Biennale, and Lily Pregill presented on NYARC’s integration of their collections of web materials captured with Archive-It into NYARC Discovery, the consortium’s new discovery platform. I presented about Rhizome’s development of Webrecorder as an accessible graphic interface for users to create individualized “high-fidelity” web archives of dynamic content, and also discussed Rhizome’s use of Webrecorder for preserving born-digital artworks.
Two Tools for DIY Web Archives: Web Archive Player and pywb
The two tools I will cover in this post allow the user flexibility in managing archived web materials locally. Web Archive Player provides a simple graphic user interface for replaying the contents of WARC and ARC files, and pywb allows for building, indexing, and replaying collections of archived web materials on your own computer. The instructions in this post are all taken from documentation on Ilya Kreymer’s github repositories, which provides clearly-written and thorough directions for installing and using all of this software. In this post, I’ll cover just the basic steps to get you started. If you’re interested in more complex features for these tools, I’d definitely recommend checking out Ilya’s docs to investigate other options not covered here.
Web Archive Player: View those WARCs
Web Archive Player is a simple desktop application for viewing the contents of WARC and ARC files created with Webrecorder or other programs. To install Web Archive Player, go to https://github.com/ikreymer/webarchiveplayer, where you can download Web Archive Player for OSX or Windows. Double-click the downloaded file to open (for OSX, open the .dmg file to mount the volume and extract the player, then add this to your Applications folder.)
When you open Web Archive Player, a file dialogue box appears that prompts you to select which WARC or ARC you’d like to access.
Choose the file you’d like to view, and Web Archive Player opens a browser tab for you at http://localhost:8090/ that lists all the pages in this file. Click on any of the pages in this list to view the page, or scroll to the bottom of the list and use the Search box to enter a URL, to search for a particular page within this list.
Build, index, and replay web archive collections with pywb
Web Archive Player is great for conveniently checking the contents of individual archive files. Pywb provides even more powerful capabilities for creating and browsing your own collections of WARC files. To get pywb, go to https://github.com/ikreymer/pywb. Ilya notes that pywb has been tested on Mac, Linux, and Windows. So far I’ve only tested pywb on Mac and Linux; these instructions will apply to those operating systems. I will try to update this post with instructions for setting up pywb on Windows by the end of my residency!
First, if you’re using a Mac, install Homebrew if you don’t already have this on your machine. I recommend checking out a blog post on this topic by my NDSR-NYC cohort member Dinah Handel. Dinah provided a very clear and easy-to-follow description of installing Homebrew in her blog post on the open-source A/V tools she’s developing in her residency at CUNY-TV.
Then, make sure you have python installed; you’ll also need pip, which is a tool for installing and managing Python packages. You can check to see if you have Python installed by opening a terminal window, and using this command:
which python
To run pywb, you’ll need a recent Python version, you can check this by asking:
python --version
As long as you’re running 2.7+, you should be all set. Ilya is currently working on a new version which will support Python 3.0.
If you do not have python installed, or if you need to update Python, I recommend the documentation provided at The Hitchhiker’s Guide to Python. The instructions at this site worked well for me, using a laptop running OS 10.10.5. This page also has information about installing pip and setuptools via Homebrew. The steps at this site are well-documented and easy to follow, but of course you can use any documentation source you like for setting up python and pip.
Once you have these setup steps complete, you are ready to install pywb. The command for this is:
pip install pywb
You also need to create a new directory where you will store your collections of web archives on your computer. Navigate to the directory that you want to contain your web archives, and create your new directory:
mkdir ~/name_of_directory
Then use:
cd ~/name_of_directory
To change into the new main web archives directory.
From here, you’ll use a pywb utility called wb-manager for setting up and managing collections. To create your first collection inside the main web archives directory you just created, use this command:
wb-manager init name_of_collection
Now when you look inside the web archives directory, you’ll see that wb-manager has also created a couple other subdirectories automatically. It created “collections” inside our main web archives directory , and put our new collection folder (“name_of_collection” in the example command) inside “collections”. It also created other required directories (“static” and “templates”) inside our main web archives directory.
At this point you’re ready to add WARC files to your brand-new collection. Make sure you do this from the main web archives directory, not from inside the “collections” folder, or from within an individual collection.
Add WARC files to a collection with this command:
wb-manager add name_of_collection path/to/warc
Wait, what WARCs?
Well, pretty much any WARCs you’d like to include; these might be WARCs you’ve downloaded from Webrecorder, or generated from manual crawls via wget, wpull, Heretrix, or another process. If you’re using Webrecorder, at the Storage tab for any of your collections you can download WARCs from pages you’ve recorded. I recommend keeping WARCs you’d like to add to your collections in another directory within the web archive directory you created, just to keep the path simple when adding WARCs to the collection via wb-manager, but you can store them wherever you’d like.
An especially nice feature about wb-manager is that it automatically indexes the files as it adds them to your collection. You can also manually re-index your collection, just run:
wb-manager reindex name_of_collection
Now you’re all set to view and explore the collection you created. From within your main web archives directory (go back to the main directory if you’ve moved around inside the “collections” or “archives” folders), run:
wayback
Then open a browser at http://localhost:8080/, and you can see the collection you made listed there.
Go forth and collect!
With these basic steps, you can get started building your own test collections. It’s a great way to get a sense of how you might define the boundaries of an archive by what context you choose to preserve. It’s also very satisfying to be able to make changes to your collection when you discover more content that you’d like to include; just capture any new pages with Webrecorder or another method, download the new WARC, add it to your collection, reindex, and replay the collection with wayback to view your changes.
I hope you’ll enjoy experimenting with these programs. Make sure to check out Ilya Kreymer’s documentation for more details and other options to try, and please feel free to reach out with any comments or questions!