PACER redaction script
This script is a wrapper for the function
support/data_tools.py. That function takes a text representation of a PACER case, i.e. either raw HTML or a SCALES-parsed JSON dict, and returns the same text with the names of private individuals (see "Notes on approach" for an explanation of how we defined this category) replaced with anonymized hash strings.
redact_pacer.py simply applies
redact_private_individual_names to a list of files, and writes the results to a new directory; the input and output directories are currently hardcoded to reflect the structure of SCALES's internal datastore.
python redact_pacer.py file_pattern outdir_replacement_target outdir_replacement_text
file_pattern should be a glob pattern describing all the files to be redacted; for instance, to redact all the HTML/JSON files in a folder called
pacer conforming to the SCALES scraper/parser file structure, pass
pacer/*/[hj]*/*/*.[hj]* (the 'h' and 'j' are meant to match 'html' and 'json').
outdir_replacement_text provide a convenient way to specify an output path for each file: for a file with the path
inpath, the redacted version of that file will be written to the path
inpath.replace(outdir_replacement_target, outdir_replacement_text), and all intermediate directories will be created if they do not yet exist. For esaxmple, to create an output folder called
pacer_redacted in which to write the redacted versions of files in the
pacer folder, pass
pacer_redacted; in that case, for a file with path
pacer/akd/html/00/1:00-cv-00001.html, the redacted version would be written to path
This script is also included in the
pacer_tools package, where it can be called as follows:
from pacer_tools.code.tasks import redact_pacer redact_pacer.main(["in/*.html", "in/", "out/"], standalone_mode=False)
(Another option is to import
pacer_tools.code.support and call
Why is this needed?
At SCALES, facilitating access to legal data is our primary goal, but we also strive to weigh the benefits of a given increase in accessibility against the attendant risks. In the case of court records, greater access means greater opportunities for research, journalism, and judicial transparency, but also greater risk of abuse by, say, scammers hoping to leverage information about a person's history in the courts to intimidate or embarrass them, or employers attempting to discriminate against formerly incarcerated people. Fortunately, upon replacing each instance of a certain person's name in a PACER docket with a hash string specific to that person, these privacy concerns mostly disappear while the docket still retains its value in the context of aggregate research. This is the strategy we used to anonymize our own collection of PACER data (as it appears both in the Satyrn app and the raw data files we've made available for public download), and we offer this redaction script as a way for you to anonymize your own data.
Notes on approach
The process of redacting names begins with a call to
support/party_classification.py to determine which parties on the case are private individuals, continues with a call to
support/party_tagging.py to identify text spans that include those parties' names, and concludes with code that performs the redaction itself by replacing those spans with hash strings. We settled on this piecemeal solution to the problem because it was a relatively quick-to-implement choice (time was of the essence given our fall '23 public launch of the Satyrn platform) and because it utilized portions of our large pre-existing codebase rather than starting from scratch on tasks like party classification and intra-case span tagging.
We also tried to account for the eccentricities of PACER data in our approach to the redaction task. For instance, although the set of all unique party names in our PACER dataset reveals a remarkable variety of entity types and text patterns, most of the hardest-to-interpret phenomena (complex chains of "X as representative of Y," names with missing or garbled text, single words with no other clarifying information, and so on) belong to a long tail of low-frequency party names that together comprise only a small portion of all parties. Thus, when classifying parties, we felt secure in using a conceptually simple lexicon-based approach, knowing that such an approach could reliably mop up the non-long-tail part of the distribution that contains the vast majority of PACER parties.
Finally, what sets apart "private individuals" from other parties? In other words, whose privacy needs to be protected via the redaction process? We define a private individual as a person who is involved in a court case in a non-professional capacity; that is, (1) they are a single human rather than a company, organization, government, group of people, etc, and (2) their participation in the litigation is not contingent on their status as a public official, police officer, doctor, lawyer, chairperson, owner of an asset, representative of a company or an estate, etc. (For more information, see the
name_redaction_tree() decision-tree function in
party_classification.py, which contains the rules that demarcate this category in practice.) With this definition, we believe we are drawing a balanced distinction between people involved in the legal system due to the circumstances of their private lives on the one hand, and people & organizations knowingly engaging in legal activity pertinent to the public on the other. That said, in truly gray-area situations, we have tried to make the "private individual" category over-inclusive and err on the side of over-redaction, under the assumption that the harm of leaving a true private individual's name exposed would be greater than the harm of erroneously concealing the name of a non-private-individual party.
redact_pacer.pycurrently does not use multiprocessing, and would probably run a lot faster if it did. (We made a quick attempt at parallelization using
joblib, but this ended up slowing the script down immensely.)
- For speed reasons, fuzzy matching had to be removed from
- A few cases (i.e. cases with over 1000 parties slated for redaction) raised the
is_huge_caseflag and thus were redacted without applying the anchor-tagging strategy (i.e. with exact-match tagging only), due to the extra few hours of runtime per case that anchor tagging would have entailed. If the anchor tagger can be optimized (or if
redact_pacer.pyis parallelized, or if future phases of work on this problem can accommodate very long runtimes), those cases should be rerun.
- The "hash1 or hash2" notation used for redaction of ambiguous parties in the docket table has not been implemented for redaction of the case title.
- The case-title redaction heuristic is based on the PACER parser's conservative approach to finding the case title within the HTML file, and thus misses some fairly obvious appearances of party names in the handful of cases with non-standard case titles (see e.g. utd;;1:16-cv-00101).
- Pro-se parties are currently identified by a check for the text "PRO SE," a PACER flag that is present alongside almost all pro-se parties. However, the function
support/text_functions.py) is a more robust solution to the problem of pro-se identification, and could be used in
redact_private_individual_names()instead of the current naive check.
- The portions of
redact_private_individual_names()that split HTML data into its constituent parts (1) may be duplicative given the HTML-parsing code already present in
parsers/parse_pacer.pyand (2) might be performing redundant computations in situations in which the HTML files in question have already been parsed into JSON files residing elsewhere.
- Some of the
_apply_second_order_labeling_procedures()may be ending their respective loops before all of the necessary label adjustments have been completed (this could be fixed by rewriting them as list comprehensions).
- Lines involving the variable
first_ei_wordshould be rewritten to allow checks for multi-word phrases at the beginning of the extra info. This is not currently a problem in
name_redaction_multient_handler(), but it is a problem in
_handle_multi_entities(), the generic handler (unused in
redact_private_individual_names()) left over from the era in which
party_classification.pywas used for more than just redaction.
- The regexes used in the multi-entity-handling code currently match some phrases (like "individual capacity") that were relevant in earlier eras of fine-grained party splitting/detection but that may be a hindrance in the simpler context of private-individual detection.
synecdoche_nonperson_supersedingword list, which was designed to solve certain as-yet-unsolved problems relating to multi-entity classification (see the associated comment in
lexicon.py), is not currently being used.
- No checks have been performed to detect collisions between hash abbreviations within individual cases, or between long hashes across the whole dataset (although such collisions are exceedingly unlikely).
- At present, the party-tagging algorithm may produce false positives when multiple party names share tokens (e.g. when "Alice Smith" needs to be redacted, "Officer Bob Smith" may erroneously be redacted as well) or when name tokens can double as non-name tokens (e.g. when "Jude Law" needs to be redacted, the word "law" may erroneously be redacted wherever it appears). Inspecting the set of all unique party names may reveal further problems of this sort.
human_affixeslist we inherited as part of
party_tagging.pyis quite limited, and future versions could benefit from expanding it, perhaps by pulling in some keywords from
- The distinction between
person_words(associated with private individuals) and
professional_words(associated with non-private individuals) is currently rather blurry. For instance, can we presume that people listed with honorifics like "Dr," "MD," "PhD," and "RN" are participating in their cases in their professional capacities, or should we allow for the possibility that these people are involved in the litigation no differently than those listed as "Mr," "Mrs," and so on?
as_x_of_yregex was originally designed for multi-entity detection in the context of fine-grained party classification; now that the multi-entity-detection parts of this code are essentially a proxy for the detection of all non-private entities, it could be worthwhile to expand that regex to match the broader pattern "X of Y."
- Some of the rules in
name_redaction_tree()depend on context-specific information like a case's nature of suit, but in an ideal world, the party-classification process would be as context-independent as possible.
- Among the handful of parties whose names have already been redacted (to e.g. "John Doe" or "FNU LNU"), which the redaction code currently passes over, some are occasionally followed by extra info that contains their real name (e.g. "also known as Joe Smith," meaning that these parties should be redacted after all. (See e.g. ilnd;;1:12-cr-00624.)
- The distinction between party names like "Social Security Commissioner Carolyn Colvin" (which would not be redacted) and "Carolyn Colvin" (which would be classified as a private individual and redacted), although reasonable on a programmatic level, is rather absurd on a conceptual level.
- About 500 parties have extra-info blocks containing address information, which may include words that erroneously flag them as non-private entities.
- Arguably, if city/state information were to be preserved in its unredacted form in address lines (both in pro-se blocks and extra-info blocks), research opportunities would increase without compromising privacy.
- The presumption that individuals appearing in court on behalf of other people or entities are not private individuals may break down in the context of estate-settlement cases, in which, for instance, individuals named as beneficiaries or representatives may have been pulled into litigation without any deliberate action on their part.