Converting PDF Files to HTML

For my final project, we are considering posting court cases on our site, and so I did some work today analyzing how best to convert the PDF files the courts give us to HTML that people can actually use. I looked briefly at google docs, since it has an amazing tool that converts PDF files to something resembling text, but short of spending a few days hacking the site, I couldn’t figure out any easy way to leverage their technology in any sort of automated way.

The other two tools I have looked at today are pdftotext and pdftohtml, which, not surprisingly, do what their names claim they do. Since we’re going to be pulling cases from the 13 federal circuit courts, I wanted to figure out which method works best for which court, and which method will provide us with the most generalizable solution across whatever PDF a court may crank out.

The short version is that the best option seems to be:

pdftotext -htmlmeta -layout -enc 'UTF-8' yourfile.pdf

This creates an html file with the text of the case laid out best as possible, some basic html meta data applied, and the UTF-8 encoding …

more ...

RECAP Extension 0.6 Beta Released

The Mozilla Foundation released version 3.6 of Firefox today, and we’re proud to release the corresponding version of the RECAP extension, beta version 0.6. In addition to Firefox 3.6 compatibility, we’ve also thrown in a new feature suggested by our users: the option to save documents using filenames that we describe as “lawyer style” in contrast to the “Internet Archive style” we’ve traditionally used. For example, rather than saving a document as “gov.uscourts.cand.204881.46.0.pdf,” you can now configure the extension to store a document as “N.D.Cal._3-08-cv-03251_46_0.pdf.” Those who prefer the traditional filenames are free to continue using those as well.

We’ve also improved our docket-parsing code, allowing us to extract more metadata from court dockets. New fields we’re now scraping include “Assigned to”, “Referred to ” , “Cause”, “Nature of Suit”, “Jury Demand”, “Jurisdiction”, and “Demand.” We also scrape information about parties, including names, contact information, and attorneys. You can see a good example here (to choose a case at random).

If you’re an existing Firefox user, Firefox periodically checks for updates to extensions and should automatically fetch the new version of the RECAP …

more ...

RECAP in the Columbia Science and Technology Law Review

’s place at the heart or the periphery of the movement remains to be seen. Like any crowdsourcing application, RECAP’s usefulness increases as more people use it. Yet PACER’s prime users are large, bill-paying law firms, which tend to be wary about adopting new technology and have little incentive to contribute documents they paid for to a free database.

“Success” for RECAP may not be mainstream adoption, however. Merely by creating the working plugin and calling attention to the problem of restricted access to court documents, CITP has advanced the cause of reforming and opening up access to PACER. That alone is “Turning PACER around.”

One point this misses is that using RECAP can directly reduce firms’ PACER fees. It’s true, of course, that most firms pass these costs along to their clients. However, in today’s economic climate, clients are increasingly pressing their law firms for cost savings. Adopting RECAP is a painless way for firms to demonstrate cost-consciousness. And the cost savings from RECAP adoption will only get bigger as RECAP’s user base continues to grow. So while we think judicial transparency is reason enough to use RECAP, installing RECAP is good for every …

more ...

Google Project Shows Value of Open Judicial Records

We’re excited to see Google has unveiled a dramatic expansion of Google Scholar to include Supreme Court decisions going back to the 18th century, lower federal court decisions since the 1920s, and state Supreme Court and appellate decisions going back to the 1950s. They’ve done an impressive job with automated parsing of legal citations, transforming them into hyperlinks and allowing Google to do automated analysis of case similarity.

This type of project was precisely what we had in mind when some of us wrote “Government Data and the Invisible Hand” last year. The judiciary may be the foundation of a free society, but it’s not especially good at building websites or search engines. By making public records easily available for re-publications by third parties, the judiciary (and the other branches of government) can enable private parties to dramatically expand public access to public information.

In this case, the state and federal courts haven’t made it easy to download bulk data, so Google had to get the information from third parties. Google is a big company with significant resources at its disposal. But in an ideal world, it wouldn’t take the resources of a large company …

more ...

RECAP Media Recap

Last week, we got our first major media coverage from across the pond, as the Guardian gave us a generous write-up. They call RECAP “an ingenious twist on peer to peer networking” and write that “since the system launched in August, legal circles have been buzzing with support for the idea.”

Meanwhile, RECAP continues to generate interest from the legal profession. Earlier this month, RECAP’s own Tim Lee spoke to a group of New Jersey lawyers about how the software can save their clients money while expanding access to the public domain. And Arizona Attorney magazine has an in-depth article about RECAP and the debate over public access. They write that “there appears to be nothing illegal about the use of RECAP by those who are paying PACER users” (we agree). And they conclude that we’ve “carefully thought through the ethical implications and goals of the program.” We like to think so. The December issue of Virginia Lawyer magazine profiles RECAP, describing in detail the efforts so far to liberate PACER documents.

more ...

RECAP in Minnesota Lawyer

Word about RECAP continues to spread through the legal profession. The latest issue of Minnesota Lawyer covers the case of a Minneapolis lawyer who was sanctioned for inadvertently including the Social Security numbers and dates of birth of dozens of individuals in court documents, when the rules of civil procedure mandate that only the last four digits of a Social Security number and the year of birth be disclosed in documents filed with the court.

The article then mentions RECAP as one reason for attorneys to be careful about redaction when they’re filing court documents:

Friedemann said that concern over the publication of sensitive information has been elevated by recent Web programs like RECAP, which has made it easier to access public court filings.

RECAP automatically uploads all PACER documents a user is viewing onto an archive maintained by the non-profit group Internet Archive. When the next RECAP user attempts to view a PACER document that has already been archived, RECAP automatically uploads the copy to prevent that user from paying for those materials. The system allows users of PACER to slowly create a secondary archive of these public documents that can be accessed for free.

Friedemann explained that …

more ...

An Effort to Define the Ideal “Law.gov”

A group of academics has been convened by Public.Resource.Org in order to define recommendations for a proposed federal government site: law.gov. The group will study the feasibility of creating the equivalent of a data.gov for legal materials. The process will define a concrete path forward forward for the government. Specifically, it will deliver:

  • Detailed technical specifications for markup, authentication, bulk access, and other aspects of a distributed registry.
  • A bill of lading defining which materials should be made available on the system.
  • A detailed business plan and budget for the organization in the government running the new system.
  • Sample enabling legislation.
  • An economic impact statement detailing the effect on federal spending and economic activity.
  • Procedures for auditing materials on the system to ensure authenticity.

Ed Felten, Executive Director of Princeton’s Center for Information Technology Policy (which also produced RECAP), is one of the co-conveners.

more ...

Schultze on RECAP at Yale

Last week RECAP’s Steve Schultze and Harlan Yu visited Yale Law School to give a talk sponsored by Yale’s Information Society Project. Yale librarian Jason Eiseman produced a short interview with Steve that he describes as “a little Blair Witch.” Steve talks about the origins of RECAP, discusses some of the current challenge faced by RECAP, and talks briefly about RECAP’s newest sister project, FedThread.

more ...

RECAP in the Los Angeles Times and Elsewhere

Monday’s Los Angeles Times has a great article talking about the growing movement for government transparency. It focuses on three of our favorite transparency advocates: Ellen Miller, co-founder of the Sunlight Foundation; Josh Tauberer, a regular at CITP conferences, and Carl Malamud, whose non-profit, public.resource.org, is a key RECAP partner.

The article discusses RECAP in some detail, describing it as “a sort of digital Kumbaya.” We’re always happy to have news outlets help spread the word about RECAP, and we’re also glad that the article makes clear that RECAP is part of a broader movement for web-enabled government transparency. Folks like Carl, Josh, and Ellen have been pushing the envelope on these issues longer than we have.

One minor correction that’s worth noting: the article refers to “the courts’ PACER revenue of $10 million a year.” In reality, the expected revenue for 2009 is $87 million. This and many other details about PACER’s budget can be found in RECAP co-author Steve Schultze’s recently-released paper on the subject.

RECAP has been a subject of discussion in other venues as well. Ars Technica discussed the courts’ reaction to RECAP in its story about the …

more ...

RECAP’s Steve Schultze at the Gov 2.0 Expo

RECAP co-author Steve Schultze is having a busy month. Last week, he released a new paper called “Electronic Public Access Fees and the United States Federal Courts’ Budget: An Overview.” It provides a comprehensive overview of PACER’s budget. It explains how the courts decide how much to charge for PACER and how the money is spent. It’s an invaluable roadmap for anyone interested in understanding the debate over PACER’s future.

Today, Steve is at the Gov 2.0 Expo giving a talk about RECAP. If you’re at the expo as well, we hope you’re planning to go to the talk, which starts at 10:50. If not, you can see a pre-recorded version of his talk here:

teaser image

Finally, next week Steve will start his new job as associate director of the Center for Information Technology Policy at Princeton, which is the home of RECAP and its other co-authors. The rest of the RECAP team is excited that we’ll soon have Steve as a colleague as well as a co-author.

more ...

RECAP in the Wall Street Journal

Last week we did a round-up of leading technology-focused sites that have covered RECAP. Now, it seems that news of RECAP is spreading beyond the “tech blogosphere,” as more mainstream publications have begun writing about our software. Foreign Policy‘s Evgeny Morozov covered RECAP, calling it “smart and subversive.” On Wednesday NextGov, a National Journal publication widely read within the government IT community, ran a thorough write-up of RECAP by Aliya Sternstein. It included some good background on how RECAP fits into the larger debate about judicial transparency.

Finally, Katherine Mangu-Ward has penned a piece for the Wall Street Journal about RECAP. Katherine calls RECAP “a sleek little add-on” with “a stylish and subversive touch.” She writes:

With the possible exception of the ever-leaky CIA, no aspect of government remains more locked down than the secretive, hierarchical judicial branch. Digital records of court filings, briefs and transcripts sit behind paywalls like Lexis and Westlaw. Legal codes and judicial documents aren’t copyrighted, but governments often cut exclusive distribution deals, rendering other access methods a bit legally questionable. Supreme Court decisions are easy to get, but the briefs and decisions of lower courts can be hard to come by.

Last week …

more ...

A Note on RECAP’s Commitment to Privacy

We’ve gotten our first official reaction from the judiciary, in the form of a statement on the New Mexico Bankruptcy court’s website. It contains two important points about the PACER terms of use, and a misleading statement about privacy that we want to correct.

First, the good news: the court acknowledges the point we’ve made before: use of RECAP is consistent with the law and the PACER terms of use. The only potential exception is if you’ve received a fee waiver for PACER. In that case, use of RECAP could violate the terms of the fee waiver, which reads: “Any transfer of data obtained as the result of a fee exemption is prohibited unless expressly authorized by the court.” We’re not lawyers, so we don’t know if the court’s interpretation is correct, but we encourage our users to honor the terms of the fee waiver.

Now, an important correction. The statement raises the concern that RECAP could compromise sealed or private documents that attorneys access via the CM/ECF, the system attorneys use for electronic filing and retrieval of documents in pending cases. Protecting privacy is our top priority, and we specifically designed …

more ...

Tell The Courts to Improve PACER

One way to promote broader public access to the public record is to use RECAP to share documents with others. A complimentary approach is to tell the U.S. Courts directly what should change. Recently, Stanford Law Librarian Erika Wayne launched a petition to “Improve PACER,” which suggested several changes:

  1. Provide document authentication As the raw materials of adjudication become digitized and disseminated online, we must have some means of knowing that they are genuine. This is a dilemma that RECAP faces in helping users to trust the documents they download.
  2. Lower costs, improve interfaces Our ultimate goal is to remove PACER’s paywall entirely and free the database up for third parties to build interfaces. But in the meantime, it would certainly benefit the public to gain less expensive access to the law through more useful interfaces. The petition recommends that the U.S. Courts reduce the transaction costs of access, and make that access more usable.
  3. Free access from Federal Depository Libraries

Erika will deliver the petition to the Administrative Office of the Courts in the near future. If you support these goals, consider signing the petition.

more ...

Accessing the RECAP Repository without PACER

Of all the questions we’ve received, probably the most common is whether it will be possible to access the documents in our archive without using PACER at all. The answer is yes, but at the moment we don’t offer any good browsing or searching tools.

The big reason has to do with privacy. One of our top priorities in developing RECAP was making sure we don’t inadvertently compromise the privacy of individuals who are the subject of court records. A lot of sensitive personal information is revealed in the course of federal court cases. A variety of private parties might be interested in using the information contained in these records for illicit purposes such as identity theft, stalking, and witness intimidation. We wanted to make sure we weren’t inadvertently facilitating those types of activities.

In theory, the courts have redaction rules designed to deal with these problems. Judges can order particularly sensitive documents to be sealed, and the rest of the documents are supposed to be redacted to prevent inadvertent disclosure of private information. Unfortunately, this process is far from perfect. Private information does sometimes wind up in the public version of court documents.

When court …

more ...

Law Professors, Librarians, and Think Tankers Praise RECAP

We’ve been getting a ton of helpful feedback from users over the weekend. We’re grateful for all the supportive emails, comments, and tweets we’ve received. We’re also grateful for the bug reports and feature requests we’ve gotten. We need this kind of feedback to make RECAP better.

Most of the questions we’ve received are are now answered by our Frequently Asked Questions. Stay tuned for some upcoming blog posts where we’ll address some of these questions in more detail. But first, we wanted to highlight some more of the commentary that RECAP’s release has generated.

James Grimmelmann, a law professor at New York Law School who has done some great writing on public access to the law, gives RECAP this generous endorsement:

The great part about this is that because the Archive is providing the server space for free, every RECAP user is saving the court system work. Each time you download through RECAP, you avoid having to go through PACER’s servers at all. So yes, RECAP will mean a decrease in PACER’s revenues, but it also means a decrease in the things those revenues need to pay for. It …

more ...

The Blogosphere Weighs in on RECAP

We’re thrilled at the reception RECAP has gotten in its first few hours. Among the notable reactions, Techcrunch discusses the legal issues and concludes that using RECAP doesn’t violate copyright law. RECAP is a hot topic of conversation at Slashdot. CNet also weighed in, highlighting one of the challenges RECAP may face in the coming months:

There are some potential problems. One is that because the RECAP developers plan to make the source code available, it wouldn’t be hard for someone to seed the Internet Archive with “official court documents” that had been modified in some way. (The answer is for users to pay to download important files from PACER, or for the courts to employ digital signatures.)

Techdirt calls RECAP “ingenious”, and concludes that “this is a fantastic idea that hopefully will help to open up public domain court information that has been locked behind PACER’s paywalls for too long.”

Finally, Ars Technica does its usual thorough job of covering RECAP, writing:

The RECAP project could also illuminate potential solutions to the problems that are blocking a more complete PACER overhaul. Despite growing pressure from Congress to reform the PACER system and make data available …

more ...

A Million Documents At Your Fingertips

In our last post, we mentioned that we were already working with other organizations that support judicial transparency to help us build the public repository that lies at RECAP’s foundation. Public.resource.org, led by Carl Malamud, has been especially helpful in this regard. They have a vast repository of court documents, weighing in at more than 500 gigabytes in total. Over the last few weeks we’ve been pre-stocking the archive with these documents, and we recently crossed the million document threshold.

What this means is that installing RECAP will not only help you contribute to government transparency, but it’s likely to start saving you money right out of the gate. For example, if you practice law in New York City, you’ll be happy to know that we have 238,098 documents from the Southern District of New York. If you have RECAP installed, you can use PACER the way you normally do, and RECAP will automatically inform you if the document you need is already available for free.

Here is a table of the other courts where we have a significant number of documents:

Court No. of Documents


District of Alaska 52,797 Northern District …

more ...

Turning PACER Around

Transparency is a fundamental principle of our legal system. Since the 1980s, the cutting edge of judicial transparency has been PACER, an electronic system that allows attorneys and the general public to access millions of federal court records. PACER was a big step forward when it was originally created, but lately it has begun to show its age. At a time when the other two branches of government are becoming ever more subject to online scrutiny, the judicial branch still requires citizens to provide a credit card and pay eight cents a page for its documents. For reasons we detail on our “Why It Matters” page, we think this needs to change, and the sooner the better.

Today we’re excited to release the public beta of RECAP. RECAP is an extension to the popular Firefox web browser that gives PACER users a hassle-free way to contribute to a free, open repository of federal court records. When a RECAP user purchases a document from PACER, the RECAP extension helps her automatically send a copy of that document to the RECAP archive. And RECAP saves its users money by notifying them when documents they’re searching for are already available for …

more ...

RECAP Project — Why It Matters

The right of access to criminal trials in particular is properly afforded protection by the First Amendment both because such trials have historically been open to the press and public and because such right of access plays a particularly significant role in the functioning of the judicial process and the government as a whole.

Globe Newspaper Co. v. Superior Ct., 457 U.S. 596

We are a nation of laws. Our law is created not only via legislation, but also through the adjudicative process of the courts. Whereas we generally have open and free access to the statutes that bind us, case law has had a more mixed history. Earlier experiments in secret proceedings did not go well. Western law subsequently developed strong precedents for access to judicial proceedings — citing the importance of transparency in promoting court legitimacy, accountability, fairness, and democratic due process. When the law is accessible, “ignorance of the law is no excuse”

Legal accessibility has traditionally meant that citizens may review the law via the contemporary technology, and redistribute it at will. In ancient courts, this implied open public access to the proceeding itself. Indeed, the principle was literally built into the architecture of the courthouses …

more ...