Parser
A parser that reads HTMLs downloaded from PACER and breaks them up into JSON format.
Usage
To run the parser:
python parse_pacer.py [OPTIONS] INPUT_DIR
Arguments
INPUT_DIR
: Relative path to the folder where HTMLs will be read, e.g.../../data/pacer/ilnd/html
.
If you are using the parser in conjunction with SCALES's PACER scraper, you will likely want your input directory to be the scraper-generated html
folder within your chosen court directory, as outlined here. Similarly, the output and summaries directories will be inferred as the json
and summaries
folder within that chosen court directory, but can be overriden by providing values for output-dir
and summaries-dir
.
Options
-o, --output-dir TEXT
(path) The folder where the parsed JSONs will be placed. If none is provided, they will be placed inINPUT_DIR/../json/
.-s, --summaries-dir TEXT
(path) The folder where the scraper will look for accompanying case summaries. the parsed JSONs will be placed into. If none is provided, it will default toINPUT_DIR/../summaries/
. See more on case summaries below.-c, --court TEXT
(defaults to none) The standard abbreviation for the district court being parsed, e.g.ilnd
. If not specified, and if using the directory structure mentioned above, the parser will inference the court abbreviation from the parent folder.-d, --debug
(flag) Turns off concurrency in the parser. Useful for ensuring that error traces are printed properly.-f, --force-rerun
(flag) Tells the parser to process HTMLs even when their corresponding JSONs already exist. Useful for obtaining fresh parses after scraping updates to existing dockets.--force-ucids TEXT
(path) A path to a CSV file that contais a 'ucid' column. If supplied, the parser will force reruns only on HTMLs that match up with the provided UCIDs, rather than on the every HTML in the input directory.-nw, --n-workers INTEGER
(defaults to 16) Number of concurrent workers to run simultaneously - i.e., no. of simultaneous parses running.
Shell scripts
Two shell scripts, parse_all.sh
and parse_subset.sh
, are provided for batch runs across multiple court directories. To run them:
sh parse_all.sh INPATH [OPTIONS]
sh parse_subset.sh INPATH -s STARTDIR -e ENDDIR [OPTIONS]
where INPATH
is the relative path to a parent folder containing multiple court directories, STARTDIR
and ENDDIR
define the inclusive alphabetical range of court directories to parse (e.g. nyed
through nywd
), and OPTIONS
are any command-line options you would like to pass through to parse_pacer.py
(e.g. --debug
, --force-rerun
, --n-workers
).
Note: each court directory in the batch must include an HTML folder for input and a JSON folder for output, as is true in the scraper-generated directory structure.
Case summaries
Case summary pages can be downloaded through the SCALES scraper. They provide a small amount of additional information (documented below) that is not available in the case docket reports. By default, the scraper will have placed any downloaded summaries in the /summaries
sub-directory of a given court directory.
When the parser runs, it will also parse any summaries associated with a case. It will search for the html files for these summaries in the summaries sub-directory (which can be manually specified with the --summaries-dir
option).
Parser schema
Filepath fields
The following fields are inferenced from the filepath:
case_id
(string) - Pacer's case ID, which has the form O:YY-TY-##### (where O is a court office code, YY is a year, TY is the case type, and ##### is a numeric identifier associated with this case)case_type
(string) - usually 'cr' (criminal) or 'cv' (civil); other types are acceptable ('mc', 'bk'...), but they will result in an incomplete parsecourt
(string) - read from the command line if passed in with the-c
optionucid
(string) - SCALES's case ID (stands for 'unique case id'), generated by prepending the court abbreviation to the Pacer case ID and used to ensure that cases with identical Pacer IDs from different districts can be distinguished from one another
Header fields
The following fields are pulled from the header of the Pacer docket:
header_case_id
(string) - similar tocase_id
, but pulled from the docket itself rather than the filepathcase_status
(string) - 'open' if a terminating date is listed, else 'closed'case_name
(string)filing_date
(string)terminating_date
(string)city
(string)lead_case_id
(string) - if the case is part of multi-district litigation (MDL), the id of the lead caselead_case_pacer_id
(string) - the internal numerical ID that PACER uses to refer to the lead casemagistrate_case_ids
(list of strings)related_cases
(list of strings) - any case numbers found in the "Related cases:" header lineother_courts
(list of strings) - any case numbers found in the "Case in other court:" header linefiled_in_error_text
(string) - any known "filed in error" messages that were found on the pagecase_flags
(list of strings) - any flags listed in the upper right corner of the Pacer docketmdl_code
(integer)- Civil only:
judge
(string) - encoded on a per-defendant basis in criminal cases - Civil only:
referred_judges
(list of strings) - encoded on a per-defendant basis in criminal cases - Civil only:
appeals_case_ids
(list of strings) - encoded on a per-defendant basis in criminal cases - Civil only:
nature_suit
(string) - Civil only:
jury_demand
(string) - Civil only:
cause
(string) - Civil only:
jurisdiction
(string) - Civil only:
monetary_demand
(string)
Body fields
The following fields are pulled from the body of the Pacer docket:
parties
(list of dicts) - each item contains:name
(string)pacer_id
(integer) - only filled in for criminal cases, which number their defendantsrole
(string) - the role (Plaintiff, Defendant, Petitioner, Respondent, etc) listed on the docketparty_type
(string) - the SCALES-defined category ('plaintiff', 'defendant', 'other_party', etc) into which the role was classifiedentity_info
(dict):raw_info
,office_name
,address
,phone
,fax
,email
(strings)terminating_date
(string) - any individual terminating date pertaining to this party
counsel
(list of dicts): - each item contains:name
(string)is_pro_se
,is_lead_attorney
,is_notice_attorney
,is_pro_hac_vice
(booleans)designation
,bar_status
,trial_bar_status
(strings)has_see_above
(boolean) - whether or not address is listed as 'see above for address', meaning address info should be obtained from preceding counsel entriesentity_info
(dict) - contains the same fields as the party-levelentity_info
dict described above
- Criminal only:
pending_counts
,terminated_counts
(lists of dicts) - each item containspacer_id
(integer),counts
(string),disposition
(string) - Criminal only:
complaints_text
(string) - Criminal only:
complaints_disposition
(string) - Criminal only:
highest_offense_level_opening
(string) - Criminal only:
highest_offense_level_terminated
(string) - Criminal only:
judge
(string) - Criminal only:
referred_judges
(list of strings) - Criminal only:
appeals_case_ids
(list of strings)
docket_available
(boolean) - False if the docket table was empty, redacted, etcdocket
(list of dictionaries) - each item contains:date_filed
(string)ind
(string) - Pacer's numerical index for this entry (can be an empty string, as not all Pacer entries are numbered)docket_text
(string)documents
(dictionary) - each key is either a non-zero attachment number or '0' for the main document, and each value is a dictionary with the following structure:url
(string) - the Pacer URL for this documentspan
(dictionary) - the starting and ending indices (withindocket_text
) of the hyperlink to the document, formatted as a dictionary with keysstart
andend
edges
(list of tuples) - each element is a three-value tuple (encoded in graph-edge format) representing a hyperlink between two docket entries, with the first value encoding the index of the source entry withindocket
, the second value encoding the index of the target entry, and the third value encoding the starting and ending indices of the hyperlink withindocket_text
(as specified inspan
above)
Receipt fields
The following fields are pulled from the 'Transaction Receipt' at the bottom of the Pacer docket:
billable_pages
(integer)cost
(float)download_timestamp
(string)n_docket_reports
(integer) - the number of times the SCALES scraper has modified this docket (1 if there have never been updates, >1 if new docket entries have been added after the initial download)
Case-summary fields
When a case summary page is available, the following fields are pulled from that page and stored as a dict in the summary
field of the main JSON.
case_id
(string) - the case id e.g. '1:16-cv-00001, All defendants'case_name
(string) - e.g. 'USA v. Johnson et al.'date_filed
(string)date_terminated
(string)date_of_last_filing
(string) - the date of the latessts filing in the casepresiding
(string) - presiding judge, if anyreferral
(string) - referred judge, if anybillable_pages
(int) - no. of billable pages (usually just 1)cost
(float) - the cost of downloading the case summary (usually 0.10)download_timestamp
(string) - the time the case summary was downloaded- Civil only:
parties
(list of objects) - each item contains:name
(string)role
(string) - their role in the case e.g ('Plaintiff', 'Defendant')represented_by
(string) - name of party's representationemail
(string)phone
(string)fax
(string) - n.b.: often non-fax related things end up here (e.g. 'Pro Hac Vice', 'MDL')
- Criminal only:
defendants
(list of dicts) - each item contains:name
(string)ind
(string) - should correspond to the defendant indices in the docket reportoffice
(string) - the court officecounty
(string) - the court countyfiled
(string) - defendant-specific filing dateterminated
(string) - defendant-specific terminating datereopened
(string) - defendant-specific reopening dateplaintiffs
(list of objects) - the plaintiffs bringing the case against this defendant (each item contains the same six fields that belong to the civilparties
items above)other_court_case
(string) - other associated casesmagistrate_case
(string) - previous magistrate case, if anyflags
(list of strings) - PACER flags that apply to this defendant, e.g. ['CLOSED', 'PRO_SE']defendant_custody_status
(string) -pending_status
(string) -counts
(list of objects) - each item containscount
(the count reference e.g. '1sss'),citation
,offense_level
, andtext
(the text associated with the count)complaints
(list of objects) - each item containscitation
,offense_level
andtext
Internal use only
The following fields are not pulled directly from the Pacer docket, and are primarily meant for internal use:
mdl_id_source
(string) - the origin ofmdl_code
(either 'lead_case_id' or 'flags')is_mdl
(boolean) - true ifmdl_code
is non-null or if the case has any MDL flagsis_multi
(boolean) - true ifis_mdl
is true or if any oflead_case_id
,member_case_key
, orother_court
is non-nullmember_case_key
(string) - a UCID-formatted version oflead_case_id
(or a copy ofucid
if this case is a lead case); used to write MDL-related data to an external file for improved performancedownload_url
(string) - the URL from which this HTML was downloaded; only present if parsing an HTML from the SCALES scrapercase_pacer_id
(integer) - the internal numerical ID that PACER uses to refer to this case; only present if parsing an HTML from the SCALES scraperscraper_labels
(list of strings) - any text labels (in the form of a "stamp" that the SCALES scraper leaves at the end of the HTMl) added by the user who scraped this caseis_stub
(boolean) - True if 'stub' is inscraper_labels
is_private
(boolean) - True if 'private' is inscraper_labels
- Deprecated:
source
(string) - was used to distinguish between JSONs from this parser and similarly-formatted JSONs from other sources - Deprecated:
recap_id
(integer) - was used for RECAP's internal numerical ids when remapping data acquired from RECAP