At CourtListener, we’re developing a new system to convert scanned court documents to text. As part of our development we’ve analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we’ve attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold

Italic

Bold Italic

Total
Times

1454

953

867

47

**3321**
Courier

369

333

209

131

**1042**
Arial

364

39

11

41

**455**
Symbol

212

0

0

0

**212**
Helvetica

24

161

2

2

**189**
Century Schoolbook

58

54

52

9

**173**
Garamond

44

42

41

0

**127**
Palatino Linotype

36

24

24

1

**85**
Old English

42

0

0

0

**42**
Lincoln

27

0

0

0

**27**

Attachments

extract_font_metadata_from_files.py_.txt

font-analysis.ods