The Cost of PACER Data? Around One Billion Dollars.
Recently, we started a new project to analyze a few million PACER documents that we acquired through the RECAP Project. As we began working with the data, one thing we did was count how many pages every document had so that we could calculate the average length of a PDF in PACER. Fairly quickly we learned that based on our sample, the average length of a PACER document is 9.1 pages.1
Based on a sample of about 2M PDFs, the average length of a PACER document is 9.1 pages. The max (so far) is 4,417.— RECAP the Law (@RECAPtheLaw) September 2, 2016
This is a really interesting statistic. Another is that there are more than one billion documents in in PACER:
CM/ECF currently contains, in aggregate, more than one billion retrievable documents spread among the 13 courts of appeals, 94 district courts, 90 bankruptcy courts, and other specialized tribunals.
With these two statistics and the knowledge that downloading a document costs ten cents per page, we can once again see how PACER---the biggest paywall the world has ever known---is a deeply troubling system. At this price, purchasing the contents of PACER would cost somewhere on the order of one billion dollars.2
One 👏 Billion 👏 Dollars 👏
For reference, storing the entire PACER corpus in the cloud would cost around $128,000/year.3 That number doesn't include a variety of other expenses in the PACER system, but the difference between the storage cost and the amount it would cost to purchase the content is astounding.
How did this happen? How is it that as the cost of storing data has gone down, the cost of PACER data has gone up? How is it that non-profit organizations like the Internet Archive can share 15 petabytes of data for free4 while PACER data costs so much?
Well, one reason the price is high is because the Administrative Office of the Courts (AO), the federal organization that runs PACER, has a monopoly. In the E-Government Act, Congress asked the AO to set up PACER and said that they could charge a reasonable price for it to recoup costs. Since then, they've had an officially sanctioned monopoly on this data.
Let's Talk (Briefly) About Monopolies
Contrary to popular wisdom, there are good reasons for monopolies. For example, sometimes you have a product that's extremely expensive and that you wouldn't want to create in duplicate. A classic example of this is the sewer line that connects to your house. This sewer line is almost always run by an organization that has a local monopoly for your city because putting in sewer lines is expensive. We don't want every house to be connected do a half dozen sewer lines run by different companies. A competitive system like that would never work.
This is often the case for infrastructure that's expensive to set up, and PACER is like this in some respects. We could have multiple systems where people uploaded and downloaded legal documents, but each system would require a lot of upfront investment. Plus, if we set up multiple systems like that, we would lose the centralized system we currently have. These are probably good reasons to set up a limited monopoly, and Congress was probably right to do so when they passed the E-Government Act.
However, whenever you create monopolies, oversight becomes important. How do we know that our sewer lines are priced efficiently and that the folks running our sewers aren't gouging us? We know because of oversight.
Oversight comes in a few forms. It can be performed by journalists, by the public, or by other government bodies, like Inspectors General. To its credit, the AO has a page on this topic on its website that lists the various auditing and accountability mechanisms that are in place, but the fact is that none of these mechanisms have been successful in reviewing the excessive costs of the PACER system.
The next tool is Congressional oversight. The AO is required to file annual reports to Congress but so far those reports have said that PACER is a resounding success and that surveys report great satisfaction among users. So far, Congress hasn't questioned these assertions, and so this oversight has failed.
Unlike most other parts of the federal government, the AO and the judicial branch do not have Inspectors General. That's out too.
What about FOIA?
One especially powerful tool that is worth mentioning is a Freedom of Information Act (FOIA) request. These can be sent by members of the public to request records of various kinds from the government. Journalists and the public rely on FOIA requests to learn about the inner workings of the government, and have used them to uncover all manner of malfeasance.
Unfortunately, because FOIA applies only to executive branch agencies, the AO, which is part of the judicial branch, is not subject to FOIA requests.
What We're Left With
The E-Government Act created a monopoly for the distribution of Federal Court data, and allowed the AO to charge money to recoup its costs. Since the time PACER was created, it has brought in hundreds of millions of dollars and its prices have risen while the cost of storage and computing have fallen.
We estimate that storing the entire PACER database could cost around $128,000/year, but in 2015 PACER revenue was $145M.
Congress has an oversight role for the AO, but so far it hasn't acted to curb these costs and rein in PACER revenue. FOIA and Inspectors General can't help.
In the end, we believe a solution to PACER's egregious fees will require cooperation of the public, Congress, journalists, and the courts themselves. We've written extensively about the roles that each of these groups plays in fixing this problem, and we hope you'll learn more at that link and get involved in whatever capacity you can. At that link, we have guidance for the public, Congress, and members of the judiciary. (Journalists, you know what to do.)
As it stands, PACER thwarts the ability of the press and the public to ensure the proper functioning of our democracy, and it cripples researchers who wish to study the federal courts. PACER has been this way almost since its inception, but it need not be this way forever.
We've corroborated this number with other organizations that hold large collections of PACER data. Their numbers are similar.↩
9.1 pages per document × ten cents per page = $0.91 per document. $0.91 × 1B documents (as of 2014) = $910,000,000. Critics will point out that there's a $3 cap per document, so this average isn't quite right. Still, this number doesn't factor in the cost of the docket sheets or search results (the latter aren't subject to a $3 cap). On top of this, the corpus has undoubtedly grown since 2014. Can we agree that one billion dollars is the right ballpark?↩
These numbers are rough, but Amazon's "GovCloud" storage product costs $0.0383/gigabyte/month. The average size of a document in RECAP is 278 kilobytes. So, 278 kilobytes × 1B documents means PACER has about 278 terabytes of data, which would cost about $127,901 to store each year at the current pricing.
The revenue of PACER in 2015, the last year that's available, was $145M (see item #1130 on page 51 of the 2016 Judicial Budget). Running PACER has other costs, but that revenue is 1,133 times more than what storing the documents in the cloud would cost.↩
A petabyte is about 1,000 terabytes, which in turn is about 1,000 gigabytes. A gigabyte can hold thousands of documents. So, 15 petabytes is probably trillions of documents.↩