Adding New Fonts to Tesseract 3 OCR Engine

Update: I’ve turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you have corrections to the article, please send them directly to me using the Contact form.

Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I’ve explained the process so others may more easily add fonts to their system.

The process has a few major steps:

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named ‘standard-training-text.txt’. This file contains the training text that is used by Tesseract for the included fonts.

Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I’ve attached a sample doc too, if that helps. Set the text …

more ...

The Winning Font in Court Opinions

At CourtListener, we’re developing a new system to convert scanned court documents to text. As part of our development we’ve analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we’ve attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold

Italic

Bold Italic

Total
Times

1454

953

867

47

**3321**
Courier

369

333

209

131

**1042**
Arial

364

39

11

41

**455**
Symbol

212

0

0

0

**212**
Helvetica

24

161

2

2

**189**
Century Schoolbook

58

54

52

9

**173**
Garamond

44

42

41

0

**127**
Palatino Linotype

36

24

24

1

**85**
Old English

42

0

0

0

**42**
Lincoln

27

0

0

0

**27**

Attachments

extract_font_metadata_from_files.py_.txt

font-analysis.ods

more ...

Support for x-robots-tag and robots HTML meta tag

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn’t seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here’s Google’s page on the topic. And here’s Bing’s. The msnbot is retired.

Yahoo, AOL

Yahoo!’s search engine is provided by Bing. AOL’s is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia’s search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask’s page on the topic is here, and Yandex’s is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header. Update: Newer versions of Nutch now support x-robots-tag!

The Internet Archive, Alexa

The Internet Archive uses Alexa’s crawler, which is known as ia_archiver. This crawler does not seem …

more ...

Respecting privacy while providing hundreds of thousands of public documents

At CourtListener, we have always taken privacy very seriously. We have over 600,000 cases currently, most of which are available on Google and other search engines. But in the interest of privacy, we make two broad exceptions to what’s available on search engines:

  1. As is stated in our removal policy, if someone gets in touch with us in writing and requests that we block search engines from indexing a document, we generally attempt to do so within a few hours.
  2. If we discover a privacy problem within a case, we proactively block search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of requests to prevent indexing by search engines, we’re often faced with an ethical dilemma, since in many instances, the party making the request is merely displeased that their involvement in the case is easy to discover and/or they are simply embarrassed by their past. In this case, the question we have to ask ourselves is: Where is the balance between the person’s right to privacy and the public’s need to access court records, and to what extent do changes in practical obscurity compel action on our …

more ...

Our Biggest Change Ever is Live!

After three months of hard development, I’m pleased to announce that the new version of CourtListener is going live at this very moment. In this version, we’ve completely rewritten vast swaths of the underlying code, and we’ve switched to a hugely more powerful architecture.

The new site comes with some significant improvements:

  • You can now search by casename, date, court, precedential status or citation
  • Results can be ordered by date or by relevance
  • New Boolean operators are supported, and our syntax is much more intuitive (see here for many more details)
  • If you want, you can now dig very deeply into the results. Previously, we had a cap at 1,000 results for a query. Not any more.
  • Court documents will now show up in our search results within milliseconds of being found on the court’s website. In the future, if there’s demand, we may use this to offer Realtime alerts.
  • We now have snippets and highlighting on our results page.
  • Finally, some polish everywhere to make things prettier.
  • Huge performance improvements.
  • Better support for mobile devices and tablets.
  • Better support for disabled people, and users that prefer not to use JavaScript.

And that just …

more ...

RECAP Featured in XRDS Magazine

XRDS Magazine recently ran an article by Steve Schultze and Harlan Yu entitled Using Software to Liberate U.S. Case Law. The article describes the motivation behind RECAP, and outlines the state of public access to electronic court records.

Using PACER is the only way for citizens to obtain electronic records from the Courts. Ideally, the Courts would publish all of their records online, in bulk, in order to allow any private party to index and re-host all of the documents, or to build new innovative services on top of the data. But while this would be relatively cheap for the Courts to do, they haven’t done so, instead choosing to limit “open” access.

[…]

Since the first release, RECAP has gained thousands of users, and the central repository contains more than 2.3 million documents across 400,000 federal cases. If you were to purchase these documents from scratch from PACER, it would cost you nearly $1.5 million. And while our collection still pales in comparison to the 500 million documents purportedly in the PACER system, it contains many of the most-frequently accessed documents the public is searching for.

[…]

As with many issues, it all comes down to …

more ...

Announcements, Updates and the Current Roadmap

Just a quick note today to share some exciting news and updates about CourtListener.

First, I am elated to announce that the CourtListener project is now supported in part by a grant from Public.Resource.Org. With this support, we are now able to develop much more ambitious improvements to the site that would otherwise not be possible. Over the next few months, the site should be changing greatly thanks to this support, and I’d like to take a moment to share both what we’ve already been able to do, and the coming changes we have planned.

One feature that we added earlier this week is a single location where you can download the entire CourtListener corpus. With a single click, you can download 2.2GB of court cases in XML format. Check out the information on the dump page for more details about when the dump is generated, and how you can get it: http://courtlistener.com/dump-info/

The second exciting feature that we’ve been working on is a platform change that enables CourtListener to support a much larger corpus. In the past, we’ve had difficulty with jobs being performed synchronously with the court scrapers …

more ...

Second Series of Federal Reporter from 1950 to 1993 now on CourtListener

Over the past few months we have been working on cleaning and importing the 2nd series of the Federal Register (F.2d) from http://law.resource.org. Today we’re excited to share that we’ve made over 12,000 meta data additions, corrections or categorizations, and that we’ve finally added F2 to our corpus.

This expands our coverage to nearly 600,000 searchable cases, and improves the quality of bulk data that is available for free on the Web.

We’re very excited by these new features, and we hope to import the third series next. If you’re interested in contributing to this work, please drop us a line - it’s a huge task cleaning and importing this information and we can use all the help we can get!

more ...

New formats for dump files

As mentioned in a previous post, we are currently making some changes to our back end to allow better citation meta data and searching granularity. As part of these changes, we have made two small changes to our dump formats.

The first change is to list docketNumber, westCite and lexisCite instead of caseNumber and westCitation. We previously had many West-style citations listed as generic case numbers. This wasn’t very accurate, so we’ve re-organized this to have better granularity.

The second change we’ve made is to how we handle missing or incomplete data. Previously, if a case was missing data, we would simply not include it in a dump. This was not the best solution, so we’re now including any information we have about a case in every dump we create. In some cases, this can create partial cases that lack vital meta data.

We hope these changes will be easy to work with, and that they’ll cause no disruption.

more ...

The abolishment of the Emergency Court of Appeals (April 18, 1962)

One of the coming features at CourtListener is an API for the law. Part of that feature is going to be some basic information about the courts themselves, so I spent some time over the weekend researching courts that served a special purpose but were since abolished.

One such court was the Emergency Court of Appeals. It was created during World War II to set prices, and, naturally, was the court of appeals for many cases. The creation date of the court is prominently published in various places on the Internet, but the abolishment history of the court was very difficult to find. After researching online for some time, and learning that my library card had expired (sigh), I put in a query with the Library of Congress, which provides free research of these types of things.

Within a couple days, the provided me with this amazing response, which I’m sharing here, and on the above Wikipedia article:

As stated in the Legislative Notes to 50 U.S. Code Appendix §§ 921 to 926, as posted at

http://www.law.cornell.edu/uscode/html/uscode50a/usc_sec_50a_00000921——000-notes.html, the following explanation is given regarding the amendment and repeal of Act …

more ...

Site refresh and new features now live!

After many months of works and about 100 revisions to the code, today we’ve rolled out the latest version of the site. This version comes with some great enhancements:

  • We rolled this out to our Twitter stream a few months ago, but we finally have proper branding and a proper logo. We’re still keeping things simple, but this should make things a little prettier.
  • We’ve added the search box to all pages so searches are easier to make and so you can see what search brought you to the document you’re looking at.
  • A new favorites feature has been added that allows you to make notes about cases that interest you, and to see all of your notes in your profile.
  • The sidebar has been moved to the left in preparation for faceted searching and browsing
  • Lots of code clean up, lots of aesthetic fixes and dozens of small fixes here, there and everywhere.

We’re really happy with this refresh and the new features that are coming along with it. If you notice anything that’s not working properly or that could be better, we’re always happy to hear your feedback.

more ...

Updated Supreme Court Case Dates and The First Release of Early SCOTUS Data in Machine-Readable Form

A few years ago, the Library of Congress released a PDF that listed the exact dates that the early Supreme Court Cases were decided. Since the written record only contained the month and year of the decision, this list served as the official record for the cases.

While it was great for the Library of Congress to publish this report, unfortunately they did so in a large PDF rather than a more useful format that could be used by projects such as CourtListener. Attempts to contact the Library of Congress were unable to locate the original version of the document, so we converted the PDF into both a CSV and an ODS spreadsheet so that the data can be easily read by a computer. I’m happy to be releasing these files today so that they can be used by others.

The second project we have been working on at Free Law Project was to import this data into our system. Because citations in the file are not always unique, we had to device a heuristic algorithm to link up the data in the CSV with the data in our system. Today, we’re happy to share that we did …

more ...


Schultze and Lee on RECAP at NYLS

On February 15, Steve and Tim spoke at New York Law School on “PACER, RECAP, and Free Law.” Video of the event is below:

[![]({filename}/images/recap/20110215_Lee_Schultze_RECAP_NYLS.png) ](https://recap.s3.amazonaws.com/20110215_Lee_Schultze_RECAP_NYLS.mp4)
more ...

Changes and Plans at CourtListener.com

A few weeks ago, we made a fairly major change at CourtListener.com to include ID numbers in all of our case URLs. This change meant that links that were previously like this:

http://courtlistener.com/scotus/Wong-v.-Smith/

Are now like this:

http://courtlistener.com/scotus/V5o/wong-v-smith/

Most of the old links should continue to work, but using the new links should be much faster and more reliable. The major difference between the two is the ID number, which is encoded as a set of numbers (in this case V5o). This ID corresponds directly with the ID number in our database, aiding us greatly in serving up cases quickly and accurately.

Around the same time as this change, we added social networking links to all of our case pages to make them easier to share with friends and colleagues. These links use our new tiny domain, http://crt.li/, and should thus be ideal for websites like Twitter or Reddit.

In the next few months we will be getting a major new server, and will be migrating our data to it. This will allow us to serve more data, and—drum roll please—will allow us to begin …

more ...

RECAP Extension 0.8 Beta Released

We are proud to announce beta version 0.8 of RECAP.

This release of RECAP fixes an issue introduced by the newest version of PACER, which has been deployed to several district courts. We’d like to thank the users that brought this issue to our attention and also encourage all RECAP users to contact us if you notice any irregularities in the future. Each district court operates their own version of PACER, so there are often small differences in code which can affect the way that RECAP operates.

In addition, we’ve added a feature that will allow CM/ECF users to more conveniently contribute documents to the RECAP archive. A substantial number of our users are attorneys who have a separate “ECF” login as well as a standard PACER account. Many of these users find it easy to download and pay for PACER documents while logged into the ECF system, but previous versions of RECAP would not upload these documents to the shared archive. Version 0.8 changes this behavior, allowing ECF users to contribute these documents to the RECAP archive.

When we released RECAP over a year ago, we intentionally disabled the extension when it detected an …

more ...

RECAP Extension 0.7 Beta Released

We are proud to announce beta version 0.7 of RECAP. This release adds support for Firefox 4 beta, for those of you living on the cutting edge.

We’ve also added a feature requested by our users. Before this release, the only way to see if RECAP had any free documents for a particular case was to purchase and examine the docket report for that case. In version 0.7, RECAP will notify you before you run a docket report if there is already free archived docket available. On the docket query page for a case that has archived information, you should see a box appear at the bottom of your screen. Clicking on that link will take you to RECAP’s summary page, which includes any docket information we have on the case as well as links to any documents we may have. Here’s an example of what you should see:

Visual of new RECAP
              feature

Version 0.7 also fixes a number of bugs, both minor and major. Thanks to a few extremely helpful users, we were able to fix a problem that prevented RECAP from working correctly behind certain types of proxy servers. Users behind a corporate proxy or firewall …

more ...

RECAP Firefox Search Plugin

One of the ideas behind the RECAP project is that once government data is made accessible in a free and open format, people will find useful new ways to search and process that data. We have heard from many folks looking to do interesting things with the documents archived by RECAP, and last year a group of students built the searchable web-based RECAP Archive. Today, Brian Carver shared a simple tool he built on top of that — a Firefox RECAP search plugin. You know that little search box in the top-right corner of Firefox? If you install his plugin you can choose the RECAP Archive as one of the search engines in the drop-down menu, so that finding free federal court documents is even easier.

Pretty cool!

more ...

Assessing PACER’s Access Barriers

The U.S. Courts recently conducted a year-long assessment of their Electronic Public Access program which included a survey of PACER users. While the results of the assessment haven’t been formally published, the Third Branch Newsletter has an interview with Bankruptcy Judge J. Rich Leonard that discusses a few high-level findings of the survey. Judge Leonard has been heavily involved in shaping the evolution of PACER since its inception twenty years ago and continues to lead today.

The survey covered a wide range of PACER users—“the courts, the media, litigants, attorneys, researchers, and bulk data collectors”—and Judge Leonard claims they found “a remarkably high level of satisfaction”: around 80% of those surveyed were “satisfied” or “very satisfied” with the service.

If we compare public access before we had PACER to where we are now, there is clearly much success to celebrate. But the key question is not only whether current users are satisfied with the service but also whether PACER is reaching its entire audience of potential users. Are there artificial obstacles preventing potential PACER users—who admittedly would be difficult to poll—from using the service? The satisfaction statistic may be fine at face value, assuming …

more ...

New Search and Browsing Interface for the RECAP Archive

Update: We wound down this version of the archive, but we replaced it with something much better.

One of the most-requested RECAP features is a better web interface to the archive. Today we’re releasing an experimental system for searching and browsing, at archive.recapthelaw.org. There are also a couple of extra features that we’re eager to get feedback on. For example, you can subscribe to an RSS feed for any case in order to get updates when new documents are added to the archive. We’ve also included some basic tagging features that let anybody add tags to any case. We’re sure that there will be bugs to be fixed or improvements that can be made.

The first version of the system was built by an enterprising team of students in Professor Ed Felten’s “Civic Technologies” course: Jen King, Brett Lullo, Sajid Mehmood, and Daniel Mattos Roberts. Dhruv Kapadia has done many of the subsequent updates. The links from the RECAP Archive pages point to files on our gracious host, the Internet Archive.

See, for example, the RECAP Archive page for United States of America v. Arizona, State of, et al. This is the Arizona …

more ...