A few small API changes

We’re updating our code in a number of ways today and that is resulting in a number of changes to the format of our data dumps. If you use them in an automated fashion, please note the following changes:

  • dateFiled is now date_filed
  • precedentialStatus is now precedential_status
  • docketNumber is now docket_number
  • westCite is now west_cite
  • lexisCite is now lexis_cite

Additionally, a new field, west_state_cite, has been added, which will have any citations to West’s state reporters. We’ve made these changes in preparation of a proper API that will return XML and JSON. Before we released that API, we needed to clean up some old field values so they were more consistent. After this point, we expect better consistency in the fields of our XML.

If this causes any inconvenience or if you need any help with these changes, please let us know.

more ...

Two RECAP Grants Awarded in Memory of Aaron Swartz

In memory of Internet activist Aaron Swartz, Think Computer Foundation (http://www.thinkcomputer.org) and the Center for Information Technology Policy (CITP) at Princeton University (http://citp.princeton.edu) are announcing the winners of two $5,000 grant awards for improving RECAP.

Since 2009, a team of researchers at Princeton has worked on a web browser-based system known as RECAP (https://free.law/recap/) that allows citizens to recapture public court records from the federal government’s official PACER database. The Administrative Office of the Courts charges per-page user fees for PACER documents, which makes it expensive to access these public records. RECAP allows users to easily share the records that they purchase to and freely access documents that others have already purchased.

Shortly after the unexpected death of Mr. Swartz, Think Computer Foundation announced that it would fund grants worth $5,000 each to extend RECAP and make use of data contained in Think Computer Foundation’s PlainSite database of legal information.

Two of these grants are being awarded today.

Ka-Ping Yee, a Canadian software developer living in Northern California, has created a version of RECAP for Google’s Chrome browser. This gives RECAP a much larger base of …

more ...

Six new courts added to CourtListener

We’re excited to announce today that we’ve added five new courts to our list that we support.

Today we add the Supreme Courts for

  1. California
  2. Indiana
  3. West Virginia
  4. Wisconsin
  5. Wyoming

These are the first State courts that we support and over the next few days we’ll be adding more as the Juriscraper library supports them. We already have another seven state courts in the wings!

By launching these courts today, we’re making a small change in our plans. We were previously working towards having all 50 supreme courts ready to go so we could add them in one big push, but since that’s taking longer than we would like to develop these scrapers, we’re going to start adding state courts as they’re ready, one by one.

Today’s launch adds five courts and about 1,200 more cases to the project. We need help getting the remaining courts ready. If you’re a developer and want to help, get in touch via our contact form and we’ll get you up and coding in no time.

more ...

Live coverage graphs now available

Thanks to a great volunteer contribution, we now have amazing graphs on our coverage page instead of simply static numbers.

The old version used to simply say the number of total documents we had for a court, leaving you scratching your head. The new version shows you a timeline indicating how many documents we have in each court for each year. It’s a great improvement that brings a lot more transparency into the coverage we have on the site.

more ...

$10,000 in Further Awards for RECAP Projects

Today, teams across the country are hard at work on the Aaron Swartz Memorial Grants. These grants, offered by the Think Computer Foundation, provide $5,000 awards for three different projects related to RECAP.

We are delighted to announce additional awards. The generous folks over at Google’s Open Source Programs team have pledged to support two more RECAP-related project awards — at $5,000 each. These are open to anyone who wishes to submit a proposal for a significant improvement to the RECAP system. We will work with the proposers to scope the project and define what qualifies for the award. All projects must be open source.

There are several potential ideas. For instance, someone might propose add support to RECAP for displaying the user’s current balance and prompting the user to liberate up to their free quarterly $15 allocation as the end of the quarter approaches (inspired by Operation Asymptote). Someone might propose to improve the https://www.courtlistener.com/recap/ interface, and to improve detection and removal of private information. Someone might propose some other idea that we haven’t thought of. You may wish to watch the discussion of a few of these initial ideas …

more ...

Another new court on CourtListener

We’re on a roll, and today I’m happy to share that we’ve added yet another court to the site. Today’s court, with about 50 cases so far, is the Bankruptcy Appellate Panel for the Ninth Circuit.

We’ll be adding a historical scraper for this court soon, but for now, sit back and enjoy our super-fast results as they get delivered straight to your email.

50 today. 1,000 tomorrow.

more ...

New Courts at CourtListener with Historical Data

I mentioned in my last post that we’ve added some new courts to the site. Today we’ve added the historical data for these courts that was available on their website.

This amounts to about 1,500 new cases on CourtListener:

  • 112 from November 2003 to today at the Court of Appeals for the Armed Forces.
  • 764 from January 2000 to today at the Court of Veterans Claims
  • 600 from January 2008 to today at the Court of International Trade

All of these docs are immediately available for search, RSS or via our dump API, and will be in our dump of all our cases when it is regenerated at the end of the month.

This also marks an important achievement for the Juriscraper library. Since CourtListener now has scrapers for all federal courts of special jurisdiction, we’re officially moving it to version 0.2. It’s taken longer than we wanted to get it here, but this is a huge step for the library.

Freeing 1,000 docs at a time.

more ...

A few updates at CourtListener

It’s been quiet around here for a little while, so it’s about time I share what’s been going on behind the scenes. As you might imagine, just because we haven’t had a lot of news doesn’t mean that we haven’t been busy.

The biggest thing I have to share today is that we’ve moved our CourtListener infrastructure to new and bigger hardware. This task has taken months to complete and involved applying many updates to the code and infrastructure. For developers, this upgrade comes with a few changes:

  1. Our default database for CourtListener is now Postgres rather than MySQL. This is something that’s been planned for a while, but wasn’t really possible until a big upgrade like this one. The big changes that come out of this are non-locking queries for our database dumps, and better performance for many of our queries. Since Postgres is a transactional, stricter and more featureful database, we’re convinced that it is a better way forward than MySQL. Oracle lately hasn’t been a great steward to MySQL, so it was a good time to jump ship. As a bonus, Posgres was started in Berkeley …
more ...

Announcing the Aaron Swartz Memorial Grants

Last week, our community lost Aaron Swartz. We are still reeling. Aaron was a fighter for openness and freedom, and many people have been channeling their grief into positive actions for causes that were close to Aaron’s heart. One of these people is Aaron Greenspan, creator of the open-data site Plainsite and the Think Computer Foundation. He has established a generous set of grants to be awarded to the first person (or group) that develops the following upgrades to RECAP, our court record liberation system. RECAP would not exist without the work of Aaron Swartz.

Three grants are being made available related to RECAP. Each grant is worth $5,000.00:

  1. Grant 1: Develop and release a version of RECAP for the Google Chrome browser that matches the current Firefox browser extension functionality
  2. Grant 2: Develop and release a version of RECAP for Internet Explorer that matches the current Firefox browser extension functionality
  3. Grant 3: Update the Firefox browser extension to capture appellate court documents, and update the RECAP server code to parse them and respond appropriately to browser extension requests

For more details, see The Aaron Swartz Memorial Grants. If you are interested, you must register by the …

more ...

Presentation on Juriscraper and CourtListener for LVI2012

Yesterday and today I’ve been in Ithaca, New York, participating in the Law via the Internet Conference (LVI), where I’ve been learning tons!

I had the good fortune to have my proposal topic selected for Track 4: Application Development for Open Access and Engagement.

In the interest of sharing, I’ve attached the latest version of my slides to this Blog post, and the audio for the talk may eventually get posted on the LVI site.

Attachments

LVI-Presentation-Lissner-Juriscraper

more ...

New tool for testing lxml XPath queries

I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we’re using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built: There’s one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I consistently had the problem that when the HTML got nasty, they’d start falling over.

No more! Today I built a quick Django app that can be run locally or on a server. It’s quite simple. You input some HTML and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it’s pretty basic on the whole.

I’d love to get any feedback I can about this. It’s …

more ...

Announcing the third series of the Federal Reporter!

Following on Friday’s big announcement about our new citator, today I’m excited to share that we’ve completed incorporating volumes 1 to 491 of the third series of the Federal Reporter (F.3d). This has been a monumental task over the past six months. Since we already have many cases that were from the same time period and jurisdiction, we had to work very hard on our duplicate merging algorithm. In the end, we had were able to get upwards of 99% accuracy with our merging code, and any cases that could not be merged automatically were handled by human review. The outcome of this work is an improved dataset beyond any that has been available previously: In tens of thousands of cases, we have been able to merge the meta data on Resource.org with data that we obtained directly from the court websites.

These new cases bring our total number of cases up to 756,713, and we hope to hit a million by the end of the year. With this done, our next task is to begin incorporating and data from all of the appellate-level State Courts. We will be working on this in a …

more ...

Building a Citator on CourtListener

I’m incredibly excited today to announce that over the past few weeks we have successsfully rolled out a Citator on CourtListener. This feature was developed by UC Berkeley School of Information students Karen Rustad and Rowyn McDonald after a thorough design and development cycle which included everything from user interviews to performance optimizations of our citation finding algorithm.

As you’re browsing the site, you’ll immediately see three big new features. First, all Federal citations to documents that we have in our collection are now links. So as you’re reading, if there’s a reference to a prior case that you feel might be useful to your research, you can just click the link to that case and continue your research there. This allows you to go upstream in your research, looking at the important cases that came before.

The second big change you’ll see is a new sidebar on all case pages that lists the top five cases that reference the one you’re reading. This allows you to go downstream from the case you’re reading, where you’ll be able to identify how the case was later interpreted by other courts.

At the …

more ...

Further privacy protections at CourtListener

I’ve written previously about the lengths we go to at CourtListener to protect people’s privacy, and today we completed one more privacy enhancement.

After my last post on this topic, we discovered that although we had already blocked cases from appearing in the search results of all major search engines, we had a privacy leak in the form of our computer-readable sitemaps. These sitemaps contain links to every page within a website, and since those links contain the names of the parties in a case, it’s possible that a Google search for the party name could turn up results that should be hidden.

This was problematic, and as of now we have changed the way we serve sitemaps so that they use the noindex X-Robots-Tag HTTP header. This tells search crawlers that they are welcome to read our sitemaps, but that they should avoid serving them or indexing them.

more ...

My Presentation Proposals for LVI 2012

The Law Via the Internet conference is celebrating its 20th anniversary at Cornell University on October 7-9th. I will be attending, and with any luck, I’ll be presenting on the topic proposed below.

Wrangling Court Data on a National Level

Access to case law has recently become easier than ever: By simply visiting a court’s website it is now possible to find and read thousands of cases without ever leaving your home. At the same time, there are nearly a hundred court websites, many of these websites suffer from poor funding or prioritization, and gaining a higher-level view of the law can be challenging. “Juriscraper” is a new project designed to ease these problems for all those that wish to collect these court opinions daily. The project is under active development, and we are looking for others to get involved.

Juriscraper is a liberally-licensed open source library that can be picked up and used by any organization to scrape the case data from court websites. In addition to a simply scraping the websites and extracting metadata from them, Juriscraper has a number of other design goals:

  • Extensibility to support video, oral argument audio, and other media types
  • Support …
more ...

Announcing OCR Support on CourtListener

For the past few months, we have been blogging about our research into how to handle scanned documents at CourtListener since a number of courts have a habit of releasing their opinions in this manner. Previously when this happened, it meant that we couldn’t get the text out of the document, and as a result, it was impossible for anybody to find these cases on the site.

Obviously, this is a bad situation for our users, so we are excited to announce that as of today we have a new Optical Character Recognition (OCR) system for extracting the text from scanned documents. We’re currently extracting the text from an additional 10,000 opinions that were previously unsearchable, and going into the future we’ll do this automatically as we get cases from the courts.

This change further expands the breadth of our coverage, and we hope you find it to be a useful change!

more ...

Announcing CourtListener’s New Sub-Project: Juriscraper

For the past two years at CourtListener we used a mess of code to scrape the Federal Court system. This worked remarkably well, but we recently began expanding our coverage and it was clear a rewrite was needed. For the past several weeks, we’ve been building a replacement called Juriscraper that is more reliable, understandable, flexible and expandable.

Unlike our old scrapers, Juriscraper is a library that anybody can pick up and use, and which allows your project to easily scrape court websites. It is currently at version 0.1, which supports all of the courts on CourtListener, and over the next few weeks we’ll be adding many more courts until we have all of the available courts in the United States.

We hope that this project will be something that others will use, and that we can thus centralize our scraping efforts. There are many organizations that are currently scraping court websites, each with their own implementations that they build and maintain. This creates lot of duplicated work, and slows down the maintenance for everybody. By finally creating a liberally licensed shared scraper, we hope to bring everybody under the same scraping roof so we can share …

more ...

Adding New Fonts to Tesseract 3 OCR Engine

Update: I’ve turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you have corrections to the article, please send them directly to me using the Contact form.

Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I’ve explained the process so others may more easily add fonts to their system.

The process has a few major steps:

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named ‘standard-training-text.txt’. This file contains the training text that is used by Tesseract for the included fonts.

Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I’ve attached a sample doc too, if that helps. Set the text …

more ...

The Winning Font in Court Opinions

At CourtListener, we’re developing a new system to convert scanned court documents to text. As part of our development we’ve analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we’ve attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold

Italic

Bold Italic

Total
Times

1454

953

867

47

**3321**
Courier

369

333

209

131

**1042**
Arial

364

39

11

41

**455**
Symbol

212

0

0

0

**212**
Helvetica

24

161

2

2

**189**
Century Schoolbook

58

54

52

9

**173**
Garamond

44

42

41

0

**127**
Palatino Linotype

36

24

24

1

**85**
Old English

42

0

0

0

**42**
Lincoln

27

0

0

0

**27**

Attachments

extract_font_metadata_from_files.py_.txt

font-analysis.ods

more ...

Support for x-robots-tag and robots HTML meta tag

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn’t seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here’s Google’s page on the topic. And here’s Bing’s. The msnbot is retired.

Yahoo, AOL

Yahoo!’s search engine is provided by Bing. AOL’s is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia’s search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask’s page on the topic is here, and Yandex’s is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header. Update: Newer versions of Nutch now support x-robots-tag!

The Internet Archive, Alexa

The Internet Archive uses Alexa’s crawler, which is known as ia_archiver. This crawler does not seem …

more ...