Unlocking the Record of American Creativity—with Your Help
The New York Public Library (NYPL) is embarking on a pilot project to extract the data from a publication known as the Catalog of Copyright Entries, published annually by the United States Copyright Office. The volumes have already been digitized and are freely available through the Internet Archive; our project aims to extract and parse the data contained in the records in order to create a searchable database that will aid copyright research.
Background
One of the best records of American creativity is currently locked in a set of paper records that are difficult to search and require a high level of expertise to use. These historical records of the United States Copyright Office document a significant part of the literary, musical, artistic and scientific production of the United States from 1870 to 1977. Unlocking these records will have a number of benefits, from determining the copyright status of millions of items to enabling the study of the production of creative works in new ways. The records represent a rich trove of data vital to copyright owners, copyright users and academics.
The Project
Today, with your help, we begin the difficult process of unlocking these records. Our goal is to create a dataset to enable accurate searching of these historical records. If we’re successful, in the future a user should be able to retrieve all of the records related to a copyrighted work. That means a user should not only see the text of the record, but should be able to retrieve the image files of the records to verify the information in the database. A user should also be able to search specific fields to help narrow searches and make it easier to retrieve relevant records. The data will also be programmatically accessible through APIs so it can be integrated into other tools. At all times, NYPL is committed to making this data available for free and without restriction on any type of use. This data is, after all, owned by the citizens of the United States.
The task before us is vast—there are some 70 million historical records in a variety of formats. To make this task more manageable, we will begin with one set of records, the Catalog of Copyright Entries (CCEs). The CCEs are comprised of about 450,000 pages and are printed compilations of brief registration and renewal records. These records were published by the Copyright Office at regular intervals, ranging in length from semi-weekly to semi-annually. The CCEs are divided by classes of works, such as books, periodicals, music, drama, maps, photographs, etc. For our pilot, we will focus on a small subset of the CCEs, 10,000 pages of book registration records published between 1923 and 1964.
Although these records were digitized by the Internet Archive, converting the images into a highly accurate database requires some work. The records need to be transcribed and the data parsed to create a reliable dataset. Because our goal is to make the dataset reliable, the accuracy of this transcription is important. For example, during certain times in our copyright law, a copyright owner had to renew their copyright or the work would enter the public domain. That means users of these records are often trying to prove a negative--that the copyright in a particular item was not properly renewed. If the data were inaccurately transcribed, then there would be a significant chance that a user would not be able to locate the renewal record and would proceed under the false assumption that the copyright in the work was not renewed.
We also expect that some users will want the ability to search over specific fields of information. Limiting a search to the copyright owner, book title, registration date, or registration number may be more efficient than a search over all of the data. That means the second task after transcription is to parse the record data into the respective fields. In each CCE volume, the record is a block of text and lacks field names. That means the classification of each word within the text book must be inferred based on the order and context of the word. In some cases, an individual CCE volume will include a list of fields of information so that a user could decipher the block of text. Unfortunately, not all of the CCEs selected for the pilot include this key. We, along with DCL, our partner on this project, have attempted to identify all of the fields that exist within the pilot dataset.
Your Input Needed
This is where we need your help. We need you to review the list of fields we’ve identified to confirm these are the right fields for book and pamphlets. We also have some open issues to resolve in our parsing efforts that you can view and comment on in our repository for this project. You can submit comments on our repository or leave them here and we’ll make sure they are incorporated.
Our goal is to make a reliable dataset from these records. That means we will be making the transcribed and parsed data produced as part of the pilot open to the public. The pilot is the first step towards tackling the work to unlock the data laden in these records. By starting with the CCEs, we hope to build a framework from which we can organize the other historical records held by the Copyright Office. We will keep you apprised of our progress in this important endeavor.
Read E-Books with SimplyE
With your library card, it's easier than ever to choose from more than 300,000 e-books on SimplyE, The New York Public Library's free e-reader app. Gain access to digital resources for all ages, including e-books, audiobooks, databases, and more.
If you don’t have an NYPL library card, New York State residents can apply for a digital card online or through SimplyE (available on the App Store or Google Play).
Need more help? Read our guide to using SimplyE.