$10,000 in Further Awards for RECAP Projects

Today, teams across the country are hard at work on the Aaron Swartz Memorial Grants. These grants, offered by the Think Computer Foundation, provide $5,000 awards for three different projects related to RECAP.

We are delighted to announce additional awards. The generous folks over at Google’s Open Source Programs team have pledged to support two more RECAP-related project awards — at $5,000 each. These are open to anyone who wishes to submit a proposal for a significant improvement to the RECAP system. We will work with the proposers to scope the project and define what qualifies for the award. All projects must be open source.

There are several potential ideas. For instance, someone might propose add support to RECAP for displaying the user’s current balance and prompting the user to liberate up to their free quarterly $15 allocation as the end of the quarter approaches (inspired by Operation Asymptote). Someone might propose to improve the https://www.courtlistener.com/recap/ interface, and to improve detection and removal of private information. Someone might propose some other idea that we haven’t thought of. You may wish to watch the discussion of a few of these initial ideas …

more ...

Another new court on CourtListener

We’re on a roll, and today I’m happy to share that we’ve added yet another court to the site. Today’s court, with about 50 cases so far, is the Bankruptcy Appellate Panel for the Ninth Circuit.

We’ll be adding a historical scraper for this court soon, but for now, sit back and enjoy our super-fast results as they get delivered straight to your email.

50 today. 1,000 tomorrow.

more ...

New Courts at CourtListener with Historical Data

I mentioned in my last post that we’ve added some new courts to the site. Today we’ve added the historical data for these courts that was available on their website.

This amounts to about 1,500 new cases on CourtListener:

  • 112 from November 2003 to today at the Court of Appeals for the Armed Forces.
  • 764 from January 2000 to today at the Court of Veterans Claims
  • 600 from January 2008 to today at the Court of International Trade

All of these docs are immediately available for search, RSS or via our dump API, and will be in our dump of all our cases when it is regenerated at the end of the month.

This also marks an important achievement for the Juriscraper library. Since CourtListener now has scrapers for all federal courts of special jurisdiction, we’re officially moving it to version 0.2. It’s taken longer than we wanted to get it here, but this is a huge step for the library.

Freeing 1,000 docs at a time.

more ...

A few updates at CourtListener

It’s been quiet around here for a little while, so it’s about time I share what’s been going on behind the scenes. As you might imagine, just because we haven’t had a lot of news doesn’t mean that we haven’t been busy.

The biggest thing I have to share today is that we’ve moved our CourtListener infrastructure to new and bigger hardware. This task has taken months to complete and involved applying many updates to the code and infrastructure. For developers, this upgrade comes with a few changes:

  1. Our default database for CourtListener is now Postgres rather than MySQL. This is something that’s been planned for a while, but wasn’t really possible until a big upgrade like this one. The big changes that come out of this are non-locking queries for our database dumps, and better performance for many of our queries. Since Postgres is a transactional, stricter and more featureful database, we’re convinced that it is a better way forward than MySQL. Oracle lately hasn’t been a great steward to MySQL, so it was a good time to jump ship. As a bonus, Posgres was started in Berkeley …
more ...

Announcing the Aaron Swartz Memorial Grants

Last week, our community lost Aaron Swartz. We are still reeling. Aaron was a fighter for openness and freedom, and many people have been channeling their grief into positive actions for causes that were close to Aaron’s heart. One of these people is Aaron Greenspan, creator of the open-data site Plainsite and the Think Computer Foundation. He has established a generous set of grants to be awarded to the first person (or group) that develops the following upgrades to RECAP, our court record liberation system. RECAP would not exist without the work of Aaron Swartz.

Three grants are being made available related to RECAP. Each grant is worth $5,000.00:

  1. Grant 1: Develop and release a version of RECAP for the Google Chrome browser that matches the current Firefox browser extension functionality
  2. Grant 2: Develop and release a version of RECAP for Internet Explorer that matches the current Firefox browser extension functionality
  3. Grant 3: Update the Firefox browser extension to capture appellate court documents, and update the RECAP server code to parse them and respond appropriately to browser extension requests

For more details, see The Aaron Swartz Memorial Grants. If you are interested, you must register by the …

more ...

Presentation on Juriscraper and CourtListener for LVI2012

Yesterday and today I’ve been in Ithaca, New York, participating in the Law via the Internet Conference (LVI), where I’ve been learning tons!

I had the good fortune to have my proposal topic selected for Track 4: Application Development for Open Access and Engagement.

In the interest of sharing, I’ve attached the latest version of my slides to this Blog post, and the audio for the talk may eventually get posted on the LVI site.

Attachments

LVI-Presentation-Lissner-Juriscraper

more ...

New tool for testing lxml XPath queries

I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we’re using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built: There’s one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I consistently had the problem that when the HTML got nasty, they’d start falling over.

No more! Today I built a quick Django app that can be run locally or on a server. It’s quite simple. You input some HTML and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it’s pretty basic on the whole.

I’d love to get any feedback I can about this. It’s …

more ...

Announcing the third series of the Federal Reporter!

Following on Friday’s big announcement about our new citator, today I’m excited to share that we’ve completed incorporating volumes 1 to 491 of the third series of the Federal Reporter (F.3d). This has been a monumental task over the past six months. Since we already have many cases that were from the same time period and jurisdiction, we had to work very hard on our duplicate merging algorithm. In the end, we had were able to get upwards of 99% accuracy with our merging code, and any cases that could not be merged automatically were handled by human review. The outcome of this work is an improved dataset beyond any that has been available previously: In tens of thousands of cases, we have been able to merge the meta data on Resource.org with data that we obtained directly from the court websites.

These new cases bring our total number of cases up to 756,713, and we hope to hit a million by the end of the year. With this done, our next task is to begin incorporating and data from all of the appellate-level State Courts. We will be working on this in a …

more ...

Building a Citator on CourtListener

I’m incredibly excited today to announce that over the past few weeks we have successsfully rolled out a Citator on CourtListener. This feature was developed by UC Berkeley School of Information students Karen Rustad and Rowyn McDonald after a thorough design and development cycle which included everything from user interviews to performance optimizations of our citation finding algorithm.

As you’re browsing the site, you’ll immediately see three big new features. First, all Federal citations to documents that we have in our collection are now links. So as you’re reading, if there’s a reference to a prior case that you feel might be useful to your research, you can just click the link to that case and continue your research there. This allows you to go upstream in your research, looking at the important cases that came before.

The second big change you’ll see is a new sidebar on all case pages that lists the top five cases that reference the one you’re reading. This allows you to go downstream from the case you’re reading, where you’ll be able to identify how the case was later interpreted by other courts.

At the …

more ...

Further privacy protections at CourtListener

I’ve written previously about the lengths we go to at CourtListener to protect people’s privacy, and today we completed one more privacy enhancement.

After my last post on this topic, we discovered that although we had already blocked cases from appearing in the search results of all major search engines, we had a privacy leak in the form of our computer-readable sitemaps. These sitemaps contain links to every page within a website, and since those links contain the names of the parties in a case, it’s possible that a Google search for the party name could turn up results that should be hidden.

This was problematic, and as of now we have changed the way we serve sitemaps so that they use the noindex X-Robots-Tag HTTP header. This tells search crawlers that they are welcome to read our sitemaps, but that they should avoid serving them or indexing them.

more ...

My Presentation Proposals for LVI 2012

The Law Via the Internet conference is celebrating its 20th anniversary at Cornell University on October 7-9th. I will be attending, and with any luck, I’ll be presenting on the topic proposed below.

Wrangling Court Data on a National Level

Access to case law has recently become easier than ever: By simply visiting a court’s website it is now possible to find and read thousands of cases without ever leaving your home. At the same time, there are nearly a hundred court websites, many of these websites suffer from poor funding or prioritization, and gaining a higher-level view of the law can be challenging. “Juriscraper” is a new project designed to ease these problems for all those that wish to collect these court opinions daily. The project is under active development, and we are looking for others to get involved.

Juriscraper is a liberally-licensed open source library that can be picked up and used by any organization to scrape the case data from court websites. In addition to a simply scraping the websites and extracting metadata from them, Juriscraper has a number of other design goals:

  • Extensibility to support video, oral argument audio, and other media types
  • Support …
more ...

Announcing OCR Support on CourtListener

For the past few months, we have been blogging about our research into how to handle scanned documents at CourtListener since a number of courts have a habit of releasing their opinions in this manner. Previously when this happened, it meant that we couldn’t get the text out of the document, and as a result, it was impossible for anybody to find these cases on the site.

Obviously, this is a bad situation for our users, so we are excited to announce that as of today we have a new Optical Character Recognition (OCR) system for extracting the text from scanned documents. We’re currently extracting the text from an additional 10,000 opinions that were previously unsearchable, and going into the future we’ll do this automatically as we get cases from the courts.

This change further expands the breadth of our coverage, and we hope you find it to be a useful change!

more ...

Announcing CourtListener’s New Sub-Project: Juriscraper

For the past two years at CourtListener we used a mess of code to scrape the Federal Court system. This worked remarkably well, but we recently began expanding our coverage and it was clear a rewrite was needed. For the past several weeks, we’ve been building a replacement called Juriscraper that is more reliable, understandable, flexible and expandable.

Unlike our old scrapers, Juriscraper is a library that anybody can pick up and use, and which allows your project to easily scrape court websites. It is currently at version 0.1, which supports all of the courts on CourtListener, and over the next few weeks we’ll be adding many more courts until we have all of the available courts in the United States.

We hope that this project will be something that others will use, and that we can thus centralize our scraping efforts. There are many organizations that are currently scraping court websites, each with their own implementations that they build and maintain. This creates lot of duplicated work, and slows down the maintenance for everybody. By finally creating a liberally licensed shared scraper, we hope to bring everybody under the same scraping roof so we can share …

more ...

Adding New Fonts to Tesseract 3 OCR Engine

Update: I’ve turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you have corrections to the article, please send them directly to me using the Contact form.

Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I’ve explained the process so others may more easily add fonts to their system.

The process has a few major steps:

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named ‘standard-training-text.txt’. This file contains the training text that is used by Tesseract for the included fonts.

Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I’ve attached a sample doc too, if that helps. Set the text …

more ...

The Winning Font in Court Opinions

At CourtListener, we’re developing a new system to convert scanned court documents to text. As part of our development we’ve analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we’ve attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold

Italic

Bold Italic

Total
Times

1454

953

867

47

**3321**
Courier

369

333

209

131

**1042**
Arial

364

39

11

41

**455**
Symbol

212

0

0

0

**212**
Helvetica

24

161

2

2

**189**
Century Schoolbook

58

54

52

9

**173**
Garamond

44

42

41

0

**127**
Palatino Linotype

36

24

24

1

**85**
Old English

42

0

0

0

**42**
Lincoln

27

0

0

0

**27**

Attachments

extract_font_metadata_from_files.py_.txt

font-analysis.ods

more ...

Support for x-robots-tag and robots HTML meta tag

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn’t seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here’s Google’s page on the topic. And here’s Bing’s. The msnbot is retired.

Yahoo, AOL

Yahoo!’s search engine is provided by Bing. AOL’s is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia’s search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask’s page on the topic is here, and Yandex’s is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header. Update: Newer versions of Nutch now support x-robots-tag!

The Internet Archive, Alexa

The Internet Archive uses Alexa’s crawler, which is known as ia_archiver. This crawler does not seem …

more ...

Respecting privacy while providing hundreds of thousands of public documents

At CourtListener, we have always taken privacy very seriously. We have over 600,000 cases currently, most of which are available on Google and other search engines. But in the interest of privacy, we make two broad exceptions to what’s available on search engines:

  1. As is stated in our removal policy, if someone gets in touch with us in writing and requests that we block search engines from indexing a document, we generally attempt to do so within a few hours.
  2. If we discover a privacy problem within a case, we proactively block search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of requests to prevent indexing by search engines, we’re often faced with an ethical dilemma, since in many instances, the party making the request is merely displeased that their involvement in the case is easy to discover and/or they are simply embarrassed by their past. In this case, the question we have to ask ourselves is: Where is the balance between the person’s right to privacy and the public’s need to access court records, and to what extent do changes in practical obscurity compel action on our …

more ...

Our Biggest Change Ever is Live!

After three months of hard development, I’m pleased to announce that the new version of CourtListener is going live at this very moment. In this version, we’ve completely rewritten vast swaths of the underlying code, and we’ve switched to a hugely more powerful architecture.

The new site comes with some significant improvements:

  • You can now search by casename, date, court, precedential status or citation
  • Results can be ordered by date or by relevance
  • New Boolean operators are supported, and our syntax is much more intuitive (see here for many more details)
  • If you want, you can now dig very deeply into the results. Previously, we had a cap at 1,000 results for a query. Not any more.
  • Court documents will now show up in our search results within milliseconds of being found on the court’s website. In the future, if there’s demand, we may use this to offer Realtime alerts.
  • We now have snippets and highlighting on our results page.
  • Finally, some polish everywhere to make things prettier.
  • Huge performance improvements.
  • Better support for mobile devices and tablets.
  • Better support for disabled people, and users that prefer not to use JavaScript.

And that just …

more ...

RECAP Featured in XRDS Magazine

XRDS Magazine recently ran an article by Steve Schultze and Harlan Yu entitled Using Software to Liberate U.S. Case Law. The article describes the motivation behind RECAP, and outlines the state of public access to electronic court records.

Using PACER is the only way for citizens to obtain electronic records from the Courts. Ideally, the Courts would publish all of their records online, in bulk, in order to allow any private party to index and re-host all of the documents, or to build new innovative services on top of the data. But while this would be relatively cheap for the Courts to do, they haven’t done so, instead choosing to limit “open” access.

[…]

Since the first release, RECAP has gained thousands of users, and the central repository contains more than 2.3 million documents across 400,000 federal cases. If you were to purchase these documents from scratch from PACER, it would cost you nearly $1.5 million. And while our collection still pales in comparison to the 500 million documents purportedly in the PACER system, it contains many of the most-frequently accessed documents the public is searching for.

[…]

As with many issues, it all comes down to …

more ...

Announcements, Updates and the Current Roadmap

Just a quick note today to share some exciting news and updates about CourtListener.

First, I am elated to announce that the CourtListener project is now supported in part by a grant from Public.Resource.Org. With this support, we are now able to develop much more ambitious improvements to the site that would otherwise not be possible. Over the next few months, the site should be changing greatly thanks to this support, and I’d like to take a moment to share both what we’ve already been able to do, and the coming changes we have planned.

One feature that we added earlier this week is a single location where you can download the entire CourtListener corpus. With a single click, you can download 2.2GB of court cases in XML format. Check out the information on the dump page for more details about when the dump is generated, and how you can get it: http://courtlistener.com/dump-info/

The second exciting feature that we’ve been working on is a platform change that enables CourtListener to support a much larger corpus. In the past, we’ve had difficulty with jobs being performed synchronously with the court scrapers …

more ...