Parser
A parser that reads HTMLs downloaded from Pacer.gov and breaks them up into JSON format.
Usage
To run the parser:
python parse_pacer.py [OPTIONS] INPUT_DIR
Arguments
INPUT_DIR
: Relative path to the folder where HTMLs will be read, e.g.../../data/pacer/ilnd/html
If you are using the parser in conjunction with SCALES's Pacer scraper, you will likely want your input directory to be the scraper-generated html
folder within your chosen court directory, as outlined here. Similarly the output and summaries directories will be inferred as the json
and summaries
folder within that chosen court directory, but can be overriden by providing values for output-dir
and summaries-dir
Options
-o, --output-dir TEXT
(path) The folder where the parsed JSONs will be placed into. If none is provided they will placed inINPUT_DIR/../json/
-s, --summaries-dir TEXT
(path) The folder where the scraper will look for accompanying case summaries. the parsed JSONs will be placed into. If none is provided it will deault toINPUT_DIR/../summaries/
. See more on case summaries below-c, --court TEXT
(defaults to none) The standard abbreviation for the district court being parsed, e.g.ilnd
. If not specified, and if using the directory structure mentioned above, the parser will inference the court abbreviation from the parent folder.-d, --debug
(flag) Turns off concurrency in the parser. Useful for ensuring that error traces are printed properly.-f, --force-rerun
(flag) Tells the parser to process HTMLs even when their corresponding JSONs already exist. Useful for obtaining fresh parses after scraping updates to existing dockets.--force-ucids
(path) A path to a .csv file that contais a 'ucid' column. If supplied the parser will force rerun only on HTMLs that match up with the provided UCIDs (rather than force rerunning on the entire INPATH)-nw, --n-workers INTEGER
(defaults to 16) Number of concurrent workers to run simultaneously - i.e., no. of simultaneous parses running.
Shell scripts
Two shell scripts, parse_all.sh
and parse_subset.sh
, are provided for batch runs across multiple court directories. To run them:
sh parse_all.sh INPATH [OPTIONS]
sh parse_subset.sh INPATH -s STARTDIR -e ENDDIR [OPTIONS]
where INPATH
is the relative path to a parent folder containing multiple court directories, STARTDIR
and ENDDIR
define the inclusive alphabetical range of court directories to parse (e.g. nyed
through nywd
), and OPTIONS
are any command-line options you would like to pass through to parse_pacer.py
(e.g. --debug
, --force-rerun
, --n-workers
).
Note: each court directory in the batch must include an HTML folder for input and a JSON folder for output, as is true in the scraper-generated directory structure.
JSON Schema
The following fields are inferenced from the filepath:
case_id
(string) - Pacer's case ID, which has the form O:YY-TY-##### (where O is a court office code, YY is a year, TY is the case type, and ##### is a numeric identifier associated with this case)case_type
(string) - usually 'cr' (criminal) or 'cv' (civil); other types are acceptable ('mc', 'bk'...), but they will result in an incomplete parsedownload_court
(string) - read from the command line if passed in with the-c
optionucid
(string) - SCALES's case ID (stands for 'unique case id'), generated by prepending the court abbreviation to the Pacer case ID and used to ensure that cases with identical Pacer IDs from different districts can be distinguished from one another
The following fields are pulled from the header of the Pacer docket:
header_case_id
(string) - similar tocase_id
, but pulled from the docket itself rather than the filepathcase_name
(string)filing_date
(string)terminating_date
(string)case_status
(string) - 'open' if a terminating date is listed, else 'closed'judge
(string)referred_judge
(string) - only present when the case was referred to a second judgenature_suit
(string) - civil cases onlyjury_demand
(string) - civil cases onlycause
(string) - civil cases onlyjurisdiction
(string) - civil cases onlymonetary_demand
(string) - civil cases onlylead_case_id
(string) - only present when the case is part of multi-district litigation (MDL)other_court
(string) - only present when another case ID is provided by Pacer as 'Case in other court'; doesn't pick up all alternate case IDs (e.g. appeals court case numbers)case_flags
(list of strings) - only present when there are flags listed in the upper right corner of the Pacer docketmdl_code
(integer)
The following fields are pulled from the body of the Pacer docket:
plaintiffs
,defendants
,bankruptcy_parties
,other_parties
,misc_participants
(dictionary) - each key is the name of a participant in the case, and each value is a dictionary with the following structure:counsel
(dictionary): - each key is the name of a lawyer representing this participant, and each value is a dictionary with the following structure:office
(string)is_lead_attorney
(boolean)is_pro_hac_vice
(boolean)additional_info
(dictionary) - keys vary according to the information in the docket ('Designation,' 'Bar Status,' etc.)
is_pro_se
(boolean)roles
(list of strings) - 'Plaintiff,' 'Petitioner,' 'Movant,' etc.
pending_counts
,terminated_counts
(dictionary) - criminal cases only; each key is the name of a party who was charged with a criminal count, and each value is a list in which each element has the following dictionary structure:counts
(string)disposition
(string)
complaints
(dictionary) - certain criminal cases only; each key is the name of a party who was charged with a criminal count, and each value is the statute(s) specified as the basis of the chargesdocket_available
(boolean)docket
(list of dictionaries) - contains one item per docket entry, structured as follows:date_filed
(string)ind
(string) - Pacer's numerical index for this entry (can be an empty string, as not all Pacer entries are numbered)docket_text
(string)documents
(dictionary) - each key is either a non-zero attachment number or '0' for the main document, and each value is a dictionary with the following structure:url
(string) - the Pacer URL for this documentspan
(dictionary) - the starting and ending indices (withindocket_text
) of the hyperlink to the document, formatted as a dictionary with keysstart
andend
edges
(list of tuples) - each element is a three-value tuple (encoded in graph-edge format) representing a hyperlink between two docket entries, with the first value encoding the index of the source entry withindocket
, the second value encoding the index of the target entry, and the third value encoding the starting and ending indices of the hyperlink withindocket_text
(as specified inspan
above)
The following fields are not pulled directly from the Pacer docket, and are primarily meant for internal use:
mdl_id_source
(string) - the origin ofmdl_code
(either 'lead_case_id' or 'flags')is_mdl
(boolean) - true ifmdl_code
is non-null or if the case has any MDL flagsis_multi
(boolean) - true ifis_mdl
is true or if any oflead_case_id
,member_case_key
, orother_court
is non-nullmember_case_key
(string) - a UCID-formatted version oflead_case_id
(or a copy ofucid
if this case is a lead case); used to write MDL-related data to an external file for improved performancesource
(string) - used to distinguish between JSONs from this parser and similarly-formatted JSONs from other sources); if generated by this parser, will always be 'pacer'download_url
(string) - the URL from which this HTML was downloaded; only present if parsing an HTML from the SCALES scraper
The following fields are pulled from the 'Transaction Receipt' at the bottom of the Pacer docket:
billable_pages
(integer)cost
(float)download_timestamp
(string)n_docket_reports
(integer) - the number of times the SCALES scraper has modified this docket (1 if there have never been updates, >1 if new docket entries have been added after the initial download)pacer_case_id
(integer) - the unique numerical ID that Pacer uses internally to identify this document (pulled from Pacer's XML responses to user queries; not visible on the docket sheet itself)
Case summaries:
summary
(object) - case summary information, fully documented below
Case summaries
Case summaries can be downloaded through the SCALES scraper. They provide some additional information that is not available in the case docket reports. By default the scraper will place any downloaded summaries in the /summaries
sub-directory of a given court directory.
When the parser runs it will also parse any summaries associated with a case. It will search for the html files for these summaries in the summaries sub-directory (which can be manually specified with the --summaries-dir
option).
The schema for civil cases and criminal cases are slightly different due to PACER presenting the data in different ways. The main difference is that for criminal cases, each defendant has its own unique list of plaintiffs whereas for civil cases there is a single list of all parties in a case (including both plaintiffs and defendants).
Civil schema
case_id
(string) - the case id e.g. '1:16-cv-00001, All defendants'case_name
(string) - e.g. 'USA v. Johnson et al.'date_filed
(string) - the case filing datedate_terminated
(string) - the case terminating datedate_of_last_filing
(string) - the date of last filing in the casepresiding
(string) - presidint judge, if anyreferral
(string) - referred judge, if anybillable_pages
(int) - no. of billable pages (usually just 1)cost
(float) - the cost of downloading the case summary (usually 0.10)download_timestamp
(string) - the time the case summary was downloadeddefendants
(list of objects) - a list of defendants in the case. For each defendant there is:plaintiffs
(string) - list of objects containingrole
,represented_by
and contact fieldsfax
,email
andphone
. Note: often non-fax related things end up in thefax
field e.g. 'US Govt Attorney'name
(string) - defendant nameind
(string) - defendant index within the case (should link back to the docket report)office
(string) - the court officecounty
(string) - the court countyfiled
(string) - defendant-specific filing dateterminated
(string) - defendant-specific filing datereopened
(string) - defendant-specific reopneing dateother_court_case
(string) - other associated casesdefendant_custody_status
(string) -flags
(list of strings) - pacer flags that applied to the defendant e.g. ['CLOSED','PRO_SE' ]pending_status
(string) -magistrate_case
(string) - previous magistrate case, if anycounts
(list of objects) - containingcount
(the count reference e.g. '1sss')citation
,offense_level
andtext
(the text associated with the count)complaints
(list of objects) - containingcitation
,offense_level
andtext
(the text associated with the count)
Criminal schema
case_id
(string) - the case id e.g. '1:16-cv-00001, All defendants'case_name
(string) - e.g. 'USA v. Johnson et al.'date_filed
(string) - the case filing datedate_terminated
(string) - the case terminating datedate_of_last_filing
(string) - the date of last filing in the casepresiding
(string) - presidint judge, if anyreferral
(string) - referred judge, if anybillable_pages
(int) - no. of billable pages (usually just 1)cost
(float) - the cost of downloading the case summary (usually 0.10)download_timestamp
(string) - the time the case summary was downloadedparties
(list of objects) - a list of parties in the case. For each party there is:role
(string) - their role in the case e.g ('Plaintiff', 'Defendant')name
(string) - party namerepresented_by
(string) - name of party's representationfax
(string) - contact fax no., note: often non-fax related things end up in thefax
field e.g. 'Pro Hac Vice', 'MDL'email
(string) - contact email addressphone
(string) - contact phone no.