Illustration by Cheri Marshall for the Urban Institute

Creating an archive of Urban Institute publications using the Box API and Python

At the Urban Institute, we’ve been working for more than 50 years to produce evidence-based economic and social policy research. A good portion of that research is available on our main website, urban.org, which launched in October 1996, but a significant portion pre-dates the website. Up to now, older research has been available only in paper form. When we moved our office to 500 L’Enfant Plaza in 2019, we cleared out a lot of paper. But in the spirit of Marie Kondo, we wanted to save the publications and papers that sparked joy. As part of that effort, our communications team and researchers scanned hundreds of these older publications and converted them into PDF documents.

The resulting archive project was a collaborative internal effort of the Tech and Data group and the communications group, with support from our library staff. Eventually, we scanned and archived hundreds of these PDF documents in a single archived Box folder, which was a big win, but given the nature of scanned PDF documents we needed a way to make these documents accessible and searchable by Urban researchers who might be interested in them. Also, we wanted to allow researchers to contribute artifacts to the archive going forward to ensure that publications we missed could be included at any time.

In this post, I’ll describe part of the technical effort to ensure the documents were easily searchable, where we leveraged the Box application programming interface (API) and Box Python SDK to access the documents stored in Urban’s Box instance, added metadata, and created a basic search interface. Finally, I’ll supply an example program that uses the Box API to read and write custom metadata, for those of you who may want to explore further.

Why use metadata?

First, why attach metadata to objects in a file system like Box? With good metadata, a collection of files that does not have readable text, such as old photos and images of scanned documents, goes from being difficult to search and classify to being easy to search and classify. For an example of useful metadata, I use Google Photos to save virtually all the pictures I take with my iPhone. My digital pictures contain a default set of metadata (in EXIF, or exchangeable image file format), including geographic location. So I can ask Google Photos to show me all the pictures I’ve taken in a certain place, or a list of all the places in which I’ve taken pictures.

Recently, I started scanning some old family slides. Those images don’t contain geolocation information, so there would be no way to find a scanned slide of the family trip to Yellowstone Park using location metadata — I would just have to look at each image and annotate each picture with metadata on location to enable such a search. In creating the Urban Archive, we had roughly the same situation — scanned PDFs but no reliable metadata (e.g., title, author, publication date, abstract, topic). Luckily, Box gives us the data structures we need to make our older, nondigital publications ready for search by adding metadata.

In another piece of good fortune, Urban still maintains the old databases that have all the data we need about the scanned publications. So much of the work in the project involved developing methods for attaching these metadata by matching each publication with a record in our older databases. The first step was to use optical character recognition (OCR) to translate the scanned text into machine-readable text. Then, using text processing tools such as SequenceMatcher in Python’s difflib, we could (partially) automate the process of matching records with the scanned PDF files by looking at the (usually comprehensible) OCR’ed text inside the PDFs and comparing them with titles and other columns in the databases. SequenceMatcher is a powerful tool, and it allowed us to find the best “fuzzy” match for such fields as title and authors from database records with strings found in the PDF text.

Once correlated with a record, we could easily write information like title, authors, abstract, topics, and publication date into the Box metadata, in a process I describe in the following section. Box then provides us a basic search on the metadata as part of its web interface. We then uploaded a copy of the metadata to our TDNet library service (many thanks to their helpful staff), which provided a search interface that is better suited for publications. This way, the Box metadata becomes the reference system for the Archive, and TDNet supplies a search interface that integrates with our Library website. As with finding the slides of our Yellowstone trip, I can now ask for all publications tagged with the topic “Housing Discrimination,” or whatever other keywords were tagged in our old databases, or any of the other associated metadata fields we had available.

Example program — reading and writing metadata with the Box API

I don’t want to describe everything that went into the archive effort, but I will summarize the efforts I’ve described above and some sample code for reading and writing metadata in Box. The Box API is a powerful tool that allows you to manipulate metadata for Box folders and files, and it was just what we needed. And the Box Python SDK gives you a Pythonic way to use the API endpoints. Here is what we’ll cover:

  1. Getting started
  2. Creating a Box Metadata Template and applying it to a Box folder
  3. Entering metadata into the form for our documents
  4. Installing the Box SDK for Python
  5. Reading and writing metadata from a document in Box using methods supplied by the Python Box SDK package

Getting started

If you want to try out the code, you can use a Box instance you have access to, or you can sign up for a free Box account (10 GB limit at this writing) and set up test documents. The free account gives you all you need to create a Box app and a developer token for the app, which is what you need to use the API. And the nice thing about having your own account is that you are the administrator. Python will be fine from any platform: Mac, Windows, Linux, or a Jupyter Notebook environment like PythonAnywhere. You can also use the Box Postman collection, a great way to try out your API calls. Once you get something to work in Postman, you can use the Python code generated by Postman in your own program. The Box API documentation also provides example code for cURL, .NET, Python, Node, and iOS for any given API endpoint.

Creating a Box Metadata Template and applying it to a Box folder

To have metadata to work with, we first need to create a metadata template and then apply that template to a Box folder. You can find out how to do this at the link below. You need to either be an administrator in Box or ask another Box administrator to help you.

Once the metadata template is created, the metadata fields are available to apply to a folder in the Box interface, and you can enter or write metadata to a given file under that folder.

Entering metadata into the form for our documents

Here is a sample document we uploaded to the Box folder “Urban Institute Historical Publications.” The folder has had a metadata template applied to it, which means that any document we upload to that folder will have the custom metadata fields available for the document. Box also has default metadata — that is, properties for all files and folders. To provide unique identification, Box assigns all folders a folder ID, and all files a file ID.

For the example, we have a slimmed-down set of custom metadata fields: PubID (a unique ID number from our publication databases), title, authors, publication date, and abstract, all of which are text fields except publication date. In the screenshot here, we have filled in the template with data appropriate to the artifact. To be able to edit the metadata, you need to first click on the “M” button to display the metadata, and then click on the pencil icon to edit. A Save button (not shown) lets you save your entries. At this point, we can manually enter metadata for any publication document.

Now that we have a Box folder, at least one Box document, and some metadata, our next task will be to read the metadata for a file using Python and the Box SDK, and then write some new metadata using the API, instead of doing it manually. The reference for writing metadata is here: Box API endpoint for updating metadata on a file.

Installing the Box SDK for Python, and accessing your Box instance

Directions for this step are available here, but basically, you just need to use pip and give the following command:

pip install boxsdk

Or, on a platform like PythonAnywhere:

pip install — user boxsdk

Then you will have the many methods of the Box Python SDK available to you by importing the package with

from boxsdk import DevelopmentClient

To have a way to authenticate to Box, you need to first create a Box app, using the Developer Console. You can find out how to do that here.

To create a client connection to your Box instance, you need a Developer Token, which is available on your Box Developer Console. The Developer Token lasts for only an hour or so, but it’s easy to revoke or refresh as needed. For longer-term applications, authenticate using OAuth2.

Here is an overview of the program “run_metadata_example.py.” The Python code is in a Gist that appears below.

  • Import pubsfun_examples, which contains two utility routines
  • get_my_fields(boxmeta): Return a dictionary of fields from the Box metadata
  • get_my_props(boxprops): Return a dictionary of basic Box properties from the Box metadata
  • Connect to Box by calling DevelopmentClient, which prompts for the token
  • For a given file and folder (supply the IDs)
  • get the Box properties and the custom metadata using client.file and print out
  • write new metadata using .update

When you’re using the DevelopmentClient to connect, you will get a lot of information in JSON (JavaScript Object Notation) when you call the API endpoints. This is handy when you’re debugging or trying to find out the real names of metadata fields, the enterprise scope, and the template name. Once you have everything working and switch to an OAuth2 connection, your program output will be less verbose. Also, to write metadata to your files, you will need to authorize your Box app with the proper permissions. Details are available here:

To conclude

Box proved to be a great solution for us as a content management system for the Urban Archive. As we were already using Box as our cloud-based file system and TDNet for our library service, the project had no need for new software acquisition and had zero budget impact in that respect. The archive project has been a great learning experience for making use of the Box API, and we hope to use it in other areas where our content is unstructured and could benefit from being organized with good metadata.

Digital preservation of your content is worth thinking about. But don’t take my word for it; ask internet pioneer Vint Cerf. The Urban Archive project is counting on archival PDF documents to preserve our publications and artifacts, which is great as long as you think of the publications as computer-readable versions of print documents. Other publications we are making these days at Urban, however, are more dynamic. How do we preserve these as hyperlinks deteriorate, operating systems change, software is updated, and servers are retired? This will be our next big challenge.

Recommended Links

The Role of Archives in Digital Preservation: https://cacm.acm.org/magazines/2018/1/223880-the-role-of-archives-in-digital-preservation/fulltext

-Douglas Murray

Want to learn more? Sign up for the Data@Urban newsletter.

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.