SpecialCollections.txt

Understanding File Formats

When you download your W2, you know it’s a PDF because of the ".pdf" at the end of the filename or because of the program that automatically launches when you open it. But how do you know if it’s a PDF 1.7? Or a PDF/A? Or a 3D PDF? (I hope not.) Or another one of these 50 different types of PDF? Or maybe a Microsoft Word file with the wrong extension at the end of the file name.

You don’t need to know the exact file format of your W2 in order to file your taxes. But for libraries, archives, and museums that want to make files available for researchers decades in the future, knowing the exact file format is crucial. This series of posts will discuss how we in NYPL's Special Collections understand and use formats.

Defining a File Format

Wikipedia's definition of "a standard way that information is encoded for storage" is a start, but there's a lot to unpack in that definition. Different kinds of information are generally stored in different formats. Different formats can use different kinds of encodings. Over time, the same format can be versioned to encode information in a slightly different standard way. Different storage might require using different kinds of formats.

When the first PDF format was launched in 1993, it was supposed to make it easy to display information in the same way regardless of the device being used to display it. Since then, new kinds of content, functionality, and devices have created tens of "standard ways" to encode a PDF, and the future will bring even more variants.

Starting with a List

If that seems like a mess, have you ever tried to list all the different types of something simple, like tables? An ontology maintained by The Getty Museum defines over 60 kinds of tables by function, 30 kinds by form, 8 kinds by location, and 2 kinds by design. That's over 100 different kinds of tables, and you can probably think of other kinds of tables that aren't on any of those lists.

Trying to come up with a list of all file formats is an endless, seemingly impossible, but ultimately necessary task for all of the staff that works with digital materials at NYPL. A file’s format helps us determine how we acquire, describe, store, and provide access to our digital materials. When we find a PDF 1.7 in our collection we know that it can have different preservation needs than a PDF/A-1b. So despite knowing that no list is perfect, our community has created and maintains a few file format registries.

The PRONOM registry is different from others since its entries are written in a way that lets us use tools to match files to a specific entry. For this blog series, we surveyed all of the files held by our Digital Archives Program. Here's a visualization of file formats from that survey:

Analysis of the number and size of files of each file format in NYPL Digital Archives

Of the tens of thousands of files in our collections of digital archives, we have 129 different PRONOM-registered formats. The survey also identified 24 formats by extension or file contents. Extensions don't give us an absolute identification. Going back to PDFs, every kind of PDF uses the same ".pdf" extension. However, these groups are useful starts for finding a more specific format. Finally, the survey shows ~9,000 files that could not be identified with a PRONOM ID or with just an extension. In order to identify these files, we will probably have to create new entries in the PRONOM registry.

Next time, I'll talk about what goes into a PRONOM ID and how it allows us to identify formats with confidence.

*If you've followed me to the bottom of this post, you'd probably also enjoy this 9-minute oral history of the PDF format from Computerphile.