Author Archives: Morgan McKeehan

Shared Conferences & DIY Web Archives

Posted on by

 

Hello readers! This past week was a busy one for NDSR-NYC. All five of us were able to attend the 2016 Code4Lib conference in Philadelphia, and Dinah gave a presentation with Ashley Blewer, Applications Developer at the New York Public Library, advocating for the benefits of sharing workflow documentation for A/V digitization and digital preservation processes. Genevieve will write more about Code4Lib and Dinah & Ashley’s talk later this week.

In this post, I will follow up on a presentation I gave about Rhizome’s use of Webrecorder for preservation and access of born-digital artworks, at a joint conference of the Art Librarians’ Society of North America and the Visual Resources Association, after Code4Lib. I will describe how to use two web archiving tools that were also created by Ilya Kreymer, which complement the ability to capture web content via Webrecorder.io.

Art Libraries + Visual Resources: Joint Conference in Seattle

The 2016 ARLIS/NA + VRA conference in Seattle marked the 44th Annual Conference of ARLIS/NA, the 34th Annual Conference for VRA, and the third time these two organizations have combined forces for a joint conference. The occasion was preceded by a one-day THATCamp digital humanities unconference on March 8, and followed by the annual meeting of the Association of Architecture School Librarians (AASL) after ARLIS/NA + VRA. Meanwhile, the joint conference itself offered a packed schedule of excellent sessions, events, and opportunities to participate in combined committee meetings representing the shared interests and priorities of these two organizations.

arlisVRA

NDSR was well-represented at ARLIS/NA + VRA. A panel called “Terra Fluxus: Surveying the Digital Information Landscape of Environmental Design” organized by NDSR-NYC alum Karl-Rainer Blumenthal included current NDSR-DC resident Valerie Collins speaking about her work at the American Institute of Architects. In “Duty Now for the Future,” a panel discussion of the importance of internships for providing Library/Archives/Information students with much-needed practical education and experience, Sumitra Duncan, a mentor for the NDSR-NYC 2014-15 cohort, discussed her development of a web archiving practicum at the Frick Art Reference Library. George Coulbourne, Chief of Internships & Fellowships, National & International Outreach, at the Office of Strategic Initiatives, Library of Congress, spoke at a Conference Capstone session and provided a high-level view of the NDSR programs from initial development to current and ongoing revisions of and improvements to the curriculum, as well as about NDSR’s relationship to the Library of Congress DPOEE program and the LC’s overall vision for providing ongoing training and leadership in digital preservation.

“The Web Sits for its Portrait,” a panel about web archiving organized by Mark Bresnan, Head of Bibliographic Records at the Frick Art Reference Library, included discussions of web archiving initiatives at the Sterling and Francine Clark Art Institute Library, the New York Art Resources Consortium (NYARC), and Rhizome. Penny Baker spoke about the Clark’s collection of web resources relating to the Venice Biennale, and Lily Pregill presented on NYARC’s integration of their collections of web materials captured with Archive-It into NYARC Discovery, the consortium’s new discovery platform. I presented about Rhizome’s development of Webrecorder as an accessible graphic interface for users to create individualized “high-fidelity” web archives of dynamic content, and also discussed Rhizome’s use of Webrecorder for preserving born-digital artworks.

Two Tools for DIY Web Archives: Web Archive Player and pywb

The two tools I will cover in this post allow the user flexibility in managing archived web materials locally. Web Archive Player provides a simple graphic user interface for replaying the contents of WARC and ARC files, and pywb allows for building, indexing, and replaying collections of archived web materials on your own computer. The instructions in this post are all taken from documentation on Ilya Kreymer’s github repositories, which provides clearly-written and thorough directions for installing and using all of this software. In this post, I’ll cover just the basic steps to get you started. If you’re interested in more complex features for these tools, I’d definitely recommend checking out Ilya’s docs to investigate other options not covered here.

Web Archive Player: View those WARCs

Web Archive Player is a simple desktop application for viewing the contents of WARC and ARC files created with Webrecorder or other programs. To install Web Archive Player, go to https://github.com/ikreymer/webarchiveplayer, where you can download Web Archive Player for OSX or Windows. Double-click the downloaded file to open (for OSX, open the .dmg file to mount the volume and extract the player, then add this to your Applications folder.)

When you open Web Archive Player, a file dialogue box appears that prompts you to select which WARC or ARC you’d like to access.

Web Archive Player file dialogue box

Here’s what the file dialogue box will look like when you open Web Archive Player. Just choose the WARC you’d like to view.

Choose the file you’d like to view, and Web Archive Player opens a browser tab for you at http://localhost:8090/ that lists all the pages in this file. Click on any of the pages in this list to view the page, or scroll to the bottom of the list and use the Search box to enter a URL, to search for a particular page within this list.

Web Archive Player running in localhost:8090

If you scroll down this long list of URLs, there’s a search box that lets you search for a particular page, if you know there’s one within this WARC that you’d like to view.

Build, index, and replay web archive collections with pywb

Web Archive Player is great for conveniently checking the contents of individual archive files. Pywb provides even more powerful capabilities for creating and browsing your own collections of WARC files. To get pywb, go to https://github.com/ikreymer/pywb. Ilya notes that pywb has been tested on Mac, Linux, and Windows. So far I’ve only tested pywb on Mac and Linux; these instructions will apply to those operating systems. I will try to update this post with instructions for setting up pywb on Windows by the end of my residency!

First, if you’re using a Mac, install Homebrew if you don’t already have this on your machine. I recommend checking out a blog post on this topic by my NDSR-NYC cohort member Dinah Handel. Dinah provided a very clear and easy-to-follow description of installing Homebrew in her blog post on the open-source A/V tools she’s developing in her residency at CUNY-TV.

Then, make sure you have python installed; you’ll also need pip, which is a tool for installing and managing Python packages. You can check to see if you have Python installed by opening a terminal window, and using this command:

which python

To run pywb, you’ll need a recent Python version, you can check this by asking:

python --version

As long as you’re running 2.7+, you should be all set. Ilya is currently working on a new version which will support Python 3.0.

If you do not have python installed, or if you need to update Python, I recommend the documentation provided at The Hitchhiker’s Guide to Python. The instructions at this site worked well for me, using a laptop running OS 10.10.5. This page also has information about installing pip and setuptools via Homebrew. The steps at this site are well-documented and easy to follow, but of course you can use any documentation source you like for setting up python and pip.

Once you have these setup steps complete, you are ready to install pywb. The command for this is:

pip install pywb

You also need to create a new directory where you will store your collections of web archives on your computer. Navigate to the directory that you want to contain your web archives, and create your new directory:

mkdir ~/name_of_directory

Then use:

cd ~/name_of_directory

To change into the new main web archives directory.

From here, you’ll use a pywb utility called wb-manager for setting up and managing collections. To create your first collection inside the main web archives directory you just created, use this command:

wb-manager init name_of_collection

Now when you look inside the web archives directory, you’ll see that wb-manager has also created a couple other subdirectories automatically. It created “collections” inside our main web archives directory , and put our new collection folder (“name_of_collection” in the example command) inside “collections”. It also created other required directories (“static” and “templates”) inside our main web archives directory.

view of web archives directory

In this example, the directory I am using for my web archives (“name_of_directory” in the example command) is very gracefully named “m-staging.webenact.” Inside m-staging.webenact you can see the “collections” directory as well as the “static” and “templates” directories that were created automatically by wb-manager when I created my new collection (“name_of_collection” in the example command, not shown in this image because it’s inside the “collections” folder). Check out ilya’s docs for more instructions on what you can do with these other features!


view of web archives directory with many levels

In this series of commands you can see everything that’s going on inside m-staging.webenact. You can see that the collection I created is called ARLIS. The ARLIS collection also contains subdirectories called “archive” , “indexes”, “static”, and “templates”. The WARCs for the ARLIS collection are stored inside the “archive” folder.


 

At this point you’re ready to add WARC files to your brand-new collection. Make sure you do this from the main web archives directory, not from inside the “collections” folder, or from within an individual collection.  

Add WARC files to a collection with this command:

wb-manager add name_of_collection path/to/warc
terminal window: add WARCs to a collection

Run the “wb-manager add” command from the main web archives directory

Wait, what WARCs?

Well, pretty much any WARCs you’d like to include; these might be WARCs you’ve downloaded from Webrecorder, or generated from manual crawls via wget, wpull, Heretrix, or another process. If you’re using Webrecorder, at the Storage tab for any of your collections you can download WARCs from pages you’ve recorded. I recommend keeping WARCs you’d like to add to your collections in another directory within the web archive directory you created, just to keep the path simple when adding WARCs to the collection via wb-manager, but you can store them wherever you’d like.

An especially nice feature about wb-manager is that it automatically indexes the files as it adds them to your collection. You can also manually re-index your collection, just run:

wb-manager reindex name_of_collection

Now you’re all set to view and explore the collection you created. From within your main web archives directory (go back to the main directory if you’ve moved around inside the “collections” or “archives” folders), run:

wayback

Then open a browser at http://localhost:8080/, and you can see the collection you made listed there.

view of terminal window, with wayback command

It is very exciting to use the “wayback” command. I recommend yelling out “wayback!”

Go forth and collect!

With these basic steps, you can get started building your own test collections. It’s a great way to get a sense of how you might define the boundaries of an archive by what context you choose to preserve. It’s also very satisfying to be able to make changes to your collection when you discover more content that you’d like to include; just capture any new pages with Webrecorder or another method, download the new WARC, add it to your collection, reindex, and replay the collection with wayback to view your changes.

I hope you’ll enjoy experimenting with these programs. Make sure to check out Ilya Kreymer’s documentation for more details and other options to try, and please feel free to reach out with any comments or questions!