Scraper
A collection of web scrapers to download data from Pacer.gov. The scraper.py
script contains five scraper modules:
- Query Scraper
- Docket Scraper
- Summary Scraper
- Member Scraper
- Document Scraper
Purpose | Input | Output | |
---|---|---|---|
Query Scraper | Pull case query results | Query parameters | Query results page (html) |
Docket Scraper | Pull case dockets | Query html or csv | Case dockets (html) |
Summary Scraper | Pull case summaries | Query html or csv | Case summaries (html) |
Member Scraper | Pull MDL member case pages | Query html or csv | Member cases pages (html) |
Document Scraper | Pull case documents & attachments | Case dockets | Case documents (pdf) |
Getting Started
Setup
To run this scraper you will need the following:
- Python 3.7+
- Selenium 3.12+
- Firefox 80.0+
- GeckoDriver
Login Details
Before running the scraper, you will need to have an account on Pacer.gov. You will need to create an auth file (in .json format) with your login details as below:
{
"user": "<your_username>",
"pass": "<your_password>"
}
Directory Structure
As the scraper is run on a single district court at a time, it is recommended that Pacer downloads should be separated into different directories by court. An example of a data folder is the following:
/data
|-- pacer
| |-- ilnd
| |-- nyed
| |-- txsd
| |-- ...
When running the scraper, a court directory will have an imposed structure as below (the necessary sub-directories will be created):
/ilnd
|-- html # Orginal case dockets
| |-- 1-16-cv-00001.html
| |-- ...
|
|-- json # Parsed case dockets
| |-- 1-16-cv-00001.json
| |-- ...
|
|-- queries # Downloaded queries and saved configs
| |-- 2016cv_result.html
| |-- 2016cv_config.json
| |-- ...
|
|-- summaries # Downloaded case summaries
| |-- 1-16-cv-00001.html
| |-- ...
|
|-- docs # Downloaded documents and attachments
| |-- ilnd;;1-16-cv-00001_1_2_u7905a347_t200916.pdf
| |-- ...
|
|-- _temp_ # Temporary download folder for scraper
| |-- 0
| | ...
| |-- 1
| | ...
| |-- ...
UCIDs (unique case identifiers)
To uniquely identify cases, the scraper uses its own identifier called UCIDs, which are constructed from the following two components:
<court abbreviation>;;<case id>
For example, the case 1:16-cv-00001
in the Northern District of Illinois would be identified as ilnd;;1:16-cv-00001
.
Note: In some districts, it is common to include judge initials at the end of a case id, e.g. 2:15-cr-11112-ABC-DE
. These initials are always excluded from the UCID.
Runtime
The scraper is designed to run at night to reduce its impact on server load. By default it will only run between 8pm and 4am (CDT). These parameters can be altered and overridden through the -rts,
-rte
and --override-time
options (see below for details).
Fees
Pacer fees can rack up quickly! Running this scraper will incur costs to your own Pacer account. There are a number of options for the scraper that exist to limit the potential for accidentally incurring large charges:
- Docket limit - A maximum no. of dockets to be downloaded can be specified (see
--docket-limit
below). - Document limit - A maximum can be specified so as to exclude certain dockets from the Document Scraper that have large amounts of documents (see
--document-limit
below).
Usage
To run the scraper:
python scrapers.py [OPTIONS] INPATH
Arguments
inpath
: Relative path to the court directory folder, e.g.../../data/pacer/ilnd
. This is the directory that will have the imposed structure as outlined above.
Options
The options passed to the scraper can be grouped into the following four categories:
General (apply to all three modules)
-m, --mode
[query|docket|summary|member|document] Which scraper mode to run.-a, --auth-path PATH
Relative path to login details auth file (see above).-c, --court
The standard abbreviation for district court being scraped, e.g.ilnd
.-nw, --n-workers INTEGER
No. of workers to run simultaneously (for docket/document scrapers), i.e. no. of simultaneous browsers running.-ct, --case-type TEXT
Specify a single case type to filter query results. If none given, scraper will pull 'cv' and 'cr' cases.-rts, --runtime-start INTEGER
(default:20) The start runtime hour (in 24hr, CDT). The scraper will not run if the current hour is before this hour.-rte, --runtime-end INTEGER
(default:4) The end runtime hour (in 24hr, CDT). The scraper stop running when the current hour reaches this hour.--override-time
Override the time restrictions and run scraper regardless of current time.--case-limit INTEGER
Sets limit on maximum no. of cases to process (enter 'false' or 'f' for no limit). This will be applied as a limit on: - the no. of case dockets the docket scraper pulls - the no. of case dockets the document scraper takes as an input--headless
Run Selenium in headless mode (i.e. no Firefox window will appear); useful if running on a server that does not have a display.--verbose
Give slightly more verbose logging output.
Query Scraper
-qc, --query-conf PATH
Configuration file (.json) for the query that will be used to populate the query form on Pacer. If none is specified, the query builder will run in the terminal.--query-prefix TEXT
A prefix for the filenames of output query HTMLs. If the date range of the query is greater than 180 days, the query will be split into chunks of 31 days to prevent PACER crashing while serving a large query results page. Multiple files will be created that follow the pattern{query_prefix}__i.html
, wherei
enumerates over the date range chunks.
Docket Scraper
--docket-input PATH
A relative path that is the input for the Docket Scraper module: this can be a single query result page (.html), a directory of query html files or a csv with UCIDs-mem, --docket-mem-list
[always|avoid|never] (default: never) How to deal with member lists in docket reports (affects costs particularly with class actions and MDLs)always
: Always include them in reportsavoid
: Do not include them in a report if the current case was previously seen listed as a member case in a previously downloaded docketnever
: Never include them in reports
--docket-exclude-parties
If True, 'Parties and counsel' and 'Terminated parties' will be excluded from docket reports (this reduces the page count for the docket report so can reduce costs).-ex, --docket-exclusions PATH
Relative path to a csv file with a column of UCIDs that are cases to be excluded from the Docket Scraper.--docket-update
Check for new docket lines in existing cases. A--docket-input
must also be provided. If the docket input is a csv, alatest_date
column can be provided to give the latest date across docket lines for each case. This date (+1) is passed to the "date filed from" field in Pacer when the docket report is generated. If nolatest_date
column provided for a case that has been previously downloaded, the date is calculated from the case json.
Summary Scraper
--summary-input PATH
Similar to--docket-input
. A relative path that is the input for the Summary Scraper module: this can be a single query result page (.html), a directory of query html files or a csv with UCIDs.
Member Scraper
--member-input PATH
A relative path to a csv that has at least one of the following columns: pacer_id, case_no, ucid
Document Scraper
--document-input PATH
A relative path that is the Document Scraper module: a csv file that contains a ucid column. These will be the cases that the Document Scraper will run on. If a doc_no column is provided, then the specific cases specified will be downloaded, see Downloading specific documents below. Otherwise, an error will appear warning the user to use the --document-all-docs option, if they want to download all documents for a case (see below).--document-all-docs
This will force the scraper to download all documents for each of the cases supplied in document-input. Warning: this can be very expensive!--document-att / --no-document-att
(default: True) Whether or not to get document attachments from docket lines.--document-skip-seen / --no-document-skip-seen
(default:True) Whether to skip seen cases. If true, documents will only be downloaded for cases that have not previously had documents downloaded. That is, ifCaseA
is in the input for the Document Scraper, it will be excluded and not have any documents downloaded in this session if there are any documents associated withCaseA
that have previously been downloaded (i.e. that are in the /docs subdirectory).--document-limit INTEGER
(default: 1000) A limit on the no. of documents within a case. Cases that have more documents that the limit (i.e. extremely long dockets) will be excluded from the Document Scraper step.
Notes
Downloading specific documents
When giving the Document Scraper specific dockets to download, you can specify specific documents to download from each docket. If you need to download every document in each case you have supplied, then you need to use the --document-all-docs
flag.
There are two types of documents that can be downloaded:
- Line documents: these are documents that relate to the whole docket entry line in the docket report; the links for these documents appear in the # column of the docket report table.
- Attachments: these are attachments or exhibits included in the line; they are referenced in-line in the docket entry text.
Note: Many docket entries contain links with references to documents from previous lines. These are ignored and not treated as attachments. To download these, refer to their original line.
To specify specific documents to be downloaded, give the --document-input
argument a csv that has both a ucid and a doc_no column. The doc_no column is a column where you can give a comma delimited list of documents to download. The following are valid individual values:
- x - just the line document x
- x:y - the line documents from x to y, inclusive
- x_z - the z'th attachment on line x
- x_a:b - attachments a through b, inclusive, from line x
These values are combined into a comma-delimited list; for example, for a given case you could specify: "2,3:5,6_1,7_1:4". (See "Common Tasks" below for a full example of this.)
Notes:
- If doc_no column is not present in the csv and the
--document-all-docs
flag has not been supplied, the scraper will give an error message. You need to either supply adoc_no
column or specify that you want to download all documents for each case, by using the--document-all-docs
flag. - If doc_no column is present and there is a row with a case that has no value (empty string) specified for doc_no, all documents will be downloaded for that case. Note: this may be very expensive.
- The no. or index of the document corresponds to the # column in the docket table on PACER. These are not necessarily displayed in sequential order due to PACER filing peculiarities.
Specific defendant dockets
For criminal cases, there may be separate dockets/stubs for defendants if there are multiple defendants. To download a docket for a specific defendant, you can supply a def_no
column in the docket input csv. In this column, any blank value will be interpreted as getting the main docket. If the def_no
column is excluded, the scraper will pull the main docket for every case.
For example, given the following file:
/docket_update.csv
ucid,def_no
ilnd;;1:16-cr-12345,2
ilnd;;1:16-cr-12345,3
ilnd;;1:16-cr-12346,
ilnd;;1:16-cr-12347,4
Running the following
python scrapers.py -m docket
--docket-input <path_to_file>/docket_update.csv --docket-update <path_to_ilnd_folder>
will pull the following dockets:
- ilnd;;1:16-cr-12345: the docket for defendants 2 and 3
- ilnd;;1:16-cr-12346: the main docket
- ilnd;;1:16-cr-12347: the docket for defendant 4
Common tasks
1. Run a search query from start to finish
Suppose you want to run a search query -- for example, all cases opened in Northern Illinois in the first week of 2020. To do this:
python scrapers.py -m query -a <path_to_auth_file> --query-prefix "first_week_2020"
-c ilnd <path_to_ilnd_folder>
Since the Query Scraper module will run and no query config file has been specified, the query config builder will run in the terminal, allowing you to enter search parameters for the Pacer query form.
2. Downloading Dockets
Suppose you have already run the above search query, and it created a file at pacer/ilnd/queries/first_week_2020.html
. Now, to download all civil and criminal cases included in that search result, run the following:
python scrapers.py -m docket -a <path_to_auth_file>
--document-input <path_to_first_week_2020.html>
-c ilnd <path_to_ilnd_folder>
The dockets will be downloaded into pacer/ilnd/html/<year>/html/
, depending on the year code in the case id. Note that this may differ from the actual filing date (e.g. a case ilnd;;1:20-cv-XXXX
may have a filing date from 2019 in PACER).
Alternatively, if you had the list of cases (either from that query html file or just an ad-hoc/manual list), you could put them in a csv file (with a ucid
column) and pass that as the argument in for --document-input
instead of the query html file.
3. Run Document Scraper on a subset of dockets
If you have have previously downloaded a bunch of case dockets and you want to download the documents for just a subset of these cases, you first need to create a file with the subset of interest. This can be any csv file that has a UCID column and a doc_no column, which we will create and call subset.csv, as below:
ucid,doc_no
ilnd;;1:16-cv-03630,2
ilnd;;1:16-cv-03631,"4,5"
To run the document scraper on just this subset you could do the following:
python scraper.py -m document
-a <path_to_auth_file> -c ilnd
--document-input <path_to_subset.csv> <path_to_ilnd_folder>
Notes:
- The dockets for these cases must have been downloaded and must be in the /html folder for the Document Scraper to detect them.
- The
doc_no
column will download specific documents (see more in Download specific documents below) - If you need to download all documents in each case, you can forgo the
doc_no
column and supply the--document-all-docs
flag (see above).
4. Update dockets
To run a docket update, you need to give a csv file to the --docket-input
argument and also use the --docket-update
flag, as in the following csv:
/docket_update.csv
ucid,latest_date
ilnd;;1:16-cv-03630,1/31/2016
ilnd;;1:16-cv-03631
ilnd;;1:16-cv-03632
To run the scraper:
python scrapers.py -m docket
--docket-input <path_to_file>/docket_update.csv
--docket-update <path_to_ilnd_folder>
Suppose that ..630 and ..631 are cases that have previously been downloaded, but ...632 has not been downloaded yet. The following will occur when the Docket Scraper runs:
- For ..630: the date 2/1/2016 will be passed to the date_from field in Pacer when the docket report is generated. A new docket will be downloaded and saved as ..630_1.html (or _2, _3 etc depending on if previous updates exist).
- For ..631: as it has previously been downloaded but no date has been given in the latest_date column, the date of the latest docket entry will be retrieved from the case json and filled in as thelatest_date, the rest proceeds as above
- For ...632: since this case has not previously been downloaded, the whole docket report will be downloaded (i.e. it will proceed as normal for this case)
5. Download specific documents
When running the Document Scraper, you can specify a list of specific documents to download (see above for valid values). For example, suppose the following file is given:
document_downloads.csv
ucid,doc_no
ilnd;;1:16-cv-03630,"1,3:5"
ilnd;;1:16-cv-03631,"7_6, 7_9:11,"
ilnd;;1:16-cv-03632
To run this
python scrapers.py -m document
--document-input <path_to_file>/document_downloads.csv
<path_to_ilnd_folder>
When it runs, the document downloader will download the following:
- For case ...630: line documents 1, 3, 4, and 5
- For case ...631: attachments 6, 9, 10, and 11 from line 7
- For case ...632: all documents