Extracting Text from Our Collection of PACER Documents

Michael Lissner

We're getting ready to launch a brand new search engine for PACER content. When it launches, one of the big features it will have is full-text search for the millions of documents that people have submitted using our RECAP system. To our knowledge, this will be the first free system for searching PACER content in this way, allowing you to look up documents by any word they might contain.

The big problem with this goal? We have about a million PDFs that consist only of images. Some of these are actually quite beautiful:

Handwritten Motion

A beautiful handwritten motion. It goes on like this for 46 pages.

But others are hideous:

Log from 1957

An 84 page log from 1957. It's come a long ways just to appear on this blog today.

But no matter how a document looks, we want to extract the text so that we can make it searchable. This is done using a system called Optical Character Recognition (OCR), which looks at each pixel in each page of each document and tries to figure out what letter it is a part of. As you might expect, this can take a while when you're processing millions of documents averaging 9.1 pages each.

About a month ago we started working on this using two very powerful computers, which together used 40 CPU cores. The two computers have been working very hard on extracting text from these documents. For example, this is what one of these computers looks like right now:

A system manager showing lots of cores fully pegged.

This shows 24 CPUs each at 100% utilization.

As of today, we have only about 100,000 more documents to go, and we expect we'll be done in about another ten days.

In the meantime, we're working on the software side of this project, and we will be launching this feature soon!


© 2021 Free Law Project. Content licensed under a Creative Commons BY-ND international 4.0, license, except where indicated.