PACER redaction script

This script is a wrapper for the function redact_private_individual_names() in support/data_tools.py. That function takes a text representation of a PACER case, i.e. either raw HTML or a SCALES-parsed JSON dict, and returns the same text with the names of private individuals (see "Notes on approach" for an explanation of how we defined this category) replaced with anonymized hash strings. redact_pacer.py simply applies redact_private_individual_names to a list of files, and writes the results to a new directory; the input and output directories are currently hardcoded to reflect the structure of SCALES's internal datastore.

Usage

python redact_pacer.py file_pattern outdir_replacement_target outdir_replacement_text

file_pattern should be a glob pattern describing all the files to be redacted; for instance, to redact all the HTML/JSON files in a folder called pacer conforming to the SCALES scraper/parser file structure, pass pacer/*/[hj]*/*/*.[hj]* (the 'h' and 'j' are meant to match 'html' and 'json').

outdir_replacement_target and outdir_replacement_text provide a convenient way to specify an output path for each file: for a file with the path inpath, the redacted version of that file will be written to the path inpath.replace(outdir_replacement_target, outdir_replacement_text), and all intermediate directories will be created if they do not yet exist. For esaxmple, to create an output folder called pacer_redacted in which to write the redacted versions of files in the pacer folder, pass pacer and pacer_redacted; in that case, for a file with path pacer/akd/html/00/1:00-cv-00001.html, the redacted version would be written to path pacer_redacted/akd/html/00/1:00-cv-00001.html.

This script is also included in the pacer_tools package, where it can be called as follows:

from pacer_tools.code.tasks import redact_pacer
redact_pacer.main(["in/*.html", "in/", "out/"], standalone_mode=False)

(Another option is to import data_tools from pacer_tools.code.support and call data_tools.redact_private_individual_names() directly.)

Why is this needed?

At SCALES, facilitating access to legal data is our primary goal, but we also strive to weigh the benefits of a given increase in accessibility against the attendant risks. In the case of court records, greater access means greater opportunities for research, journalism, and judicial transparency, but also greater risk of abuse by, say, scammers hoping to leverage information about a person's history in the courts to intimidate or embarrass them, or employers attempting to discriminate against formerly incarcerated people. Fortunately, upon replacing each instance of a certain person's name in a PACER docket with a hash string specific to that person, these privacy concerns mostly disappear while the docket still retains its value in the context of aggregate research. This is the strategy we used to anonymize our own collection of PACER data (as it appears both in the Satyrn appopen in new window and the raw data filesopen in new window we've made available for public download), and we offer this redaction script as a way for you to anonymize your own data.

Notes on approach

The process of redacting names begins with a call to support/party_classification.py to determine which parties on the case are private individuals, continues with a call to support/party_tagging.py to identify text spans that include those parties' names, and concludes with code that performs the redaction itself by replacing those spans with hash strings. We settled on this piecemeal solution to the problem because it was a relatively quick-to-implement choice (time was of the essence given our fall '23 public launch of the Satyrn platform) and because it utilized portions of our large pre-existing codebase rather than starting from scratch on tasks like party classification and intra-case span tagging.

We also tried to account for the eccentricities of PACER data in our approach to the redaction task. For instance, although the set of all unique party names in our PACER dataset reveals a remarkable variety of entity types and text patterns, most of the hardest-to-interpret phenomena (complex chains of "X as representative of Y," names with missing or garbled text, single words with no other clarifying information, and so on) belong to a long tail of low-frequency party names that together comprise only a small portion of all parties. Thus, when classifying parties, we felt secure in using a conceptually simple lexicon-based approach, knowing that such an approach could reliably mop up the non-long-tail part of the distribution that contains the vast majority of PACER parties.

Finally, what sets apart "private individuals" from other parties? In other words, whose privacy needs to be protected via the redaction process? We define a private individual as a person who is involved in a court case in a non-professional capacity; that is, (1) they are a single human rather than a company, organization, government, group of people, etc, and (2) their participation in the litigation is not contingent on their status as a public official, police officer, doctor, lawyer, chairperson, owner of an asset, representative of a company or an estate, etc. (For more information, see the name_redaction_tree() decision-tree function in party_classification.py, which contains the rules that demarcate this category in practice.) With this definition, we believe we are drawing a balanced distinction between people involved in the legal system due to the circumstances of their private lives on the one hand, and people & organizations knowingly engaging in legal activity pertinent to the public on the other. That said, in truly gray-area situations, we have tried to make the "private individual" category over-inclusive and err on the side of over-redaction, under the assumption that the harm of leaving a true private individual's name exposed would be greater than the harm of erroneously concealing the name of a non-private-individual party.

Known issues

Technical

redact_pacer.py currently does not use multiprocessing, and would probably run a lot faster if it did. (We made a quick attempt at parallelization using joblib, but this ended up slowing the script down immensely.)
For speed reasons, fuzzy matching had to be removed from party_tagging.py.
A few cases (i.e. cases with over 1000 parties slated for redaction) raised the is_huge_case flag and thus were redacted without applying the anchor-tagging strategy (i.e. with exact-match tagging only), due to the extra few hours of runtime per case that anchor tagging would have entailed. If the anchor tagger can be optimized (or if redact_pacer.py is parallelized, or if future phases of work on this problem can accommodate very long runtimes), those cases should be rerun.
Due to a shortcut taken when generating party_chunks (in order to avoid the awkward split_on_multiple_separators method used by the PACER parser), party hashes in HTML files often differ from those in the corresponding JSONs.
The "hash1 or hash2" notation used for redaction of ambiguous parties in the docket table has not been implemented for redaction of the case title.
The assumption that party names in the case title will take the form party_name.split()[-1] is not always correct; for instance, in a random sample of 10,000 cases, 16 included names whose final token was an item in lexicon.person_words (e.g. "Joe Smith Jr").
The case-title redaction heuristic is based on the PACER parser's conservative approach to finding the case title within the HTML file, and thus misses some fairly obvious appearances of party names in the handful of cases with non-standard case titles (see e.g. utd;;1:16-cv-00101).
Pro-se parties are currently identified by a check for the text "PRO SE," a PACER flag that is present alongside almost all pro-se parties. However, the function pro_se_identifier() (in support/text_functions.py) is a more robust solution to the problem of pro-se identification, and could be used in redact_private_individual_names() instead of the current naive check.
The portions of redact_private_individual_names() that split HTML data into its constituent parts (1) may be duplicative given the HTML-parsing code already present in parsers/parse_pacer.py and (2) might be performing redundant computations in situations in which the HTML files in question have already been parsed into JSON files residing elsewhere.
Some of the break statements in _apply_second_order_labeling_procedures() may be ending their respective loops before all of the necessary label adjustments have been completed (this could be fixed by rewriting them as list comprehensions).
Lines involving the variable first_ei_word should be rewritten to allow checks for multi-word phrases at the beginning of the extra info. This is not currently a problem in name_redaction_multient_handler(), but it is a problem in _handle_multi_entities(), the generic handler (unused in redact_private_individual_names()) left over from the era in which party_classification.py was used for more than just redaction.
The regexes used in the multi-entity-handling code currently match some phrases (like "individual capacity") that were relevant in earlier eras of fine-grained party splitting/detection but that may be a hindrance in the simpler context of private-individual detection.
The synecdoche_nonperson_superseding word list, which was designed to solve certain as-yet-unsolved problems relating to multi-entity classification (see the associated comment in lexicon.py), is not currently being used.
No checks have been performed to detect collisions between hash abbreviations within individual cases, or between long hashes across the whole dataset (although such collisions are exceedingly unlikely).

Conceptual

At present, the party-tagging algorithm may produce false positives when multiple party names share tokens (e.g. when "Alice Smith" needs to be redacted, "Officer Bob Smith" may erroneously be redacted as well) or when name tokens can double as non-name tokens (e.g. when "Jude Law" needs to be redacted, the word "law" may erroneously be redacted wherever it appears). Inspecting the set of all unique party names may reveal further problems of this sort. (Note that false positives exacerbate this problem; for example, in the rare case of a party ending in "Company" being flagged for redaction, as in ilnd;;3:21-cv-50125, the word "company" may erroneously be redacted elsewhere as well.)
The human_affixes list we inherited as part of party_tagging.py is quite limited, and future versions could benefit from expanding it, perhaps by pulling in some keywords from support/lexicon.py.
The distinction between person_words (associated with private individuals) and professional_words (associated with non-private individuals) is currently rather blurry. For instance, can we presume that people listed with honorifics like "Dr," "MD," "PhD," and "RN" are participating in their cases in their professional capacities, or should we allow for the possibility that these people are involved in the litigation no differently than those listed as "Mr," "Mrs," and so on?
The as_x_of_y regex was originally designed for multi-entity detection in the context of fine-grained party classification; now that the multi-entity-detection parts of this code are essentially a proxy for the detection of all non-private entities, it could be worthwhile to expand that regex to match the broader pattern "X of Y."
Some of the rules in name_redaction_tree() depend on context-specific information like a case's nature of suit, but in an ideal world, the party-classification process would be as context-independent as possible.
Among the handful of parties whose names have already been redacted (to e.g. "John Doe" or "FNU LNU"), which the redaction code currently passes over, some are occasionally followed by extra info that contains their real name (e.g. "also known as Joe Smith," meaning that these parties should be redacted after all. (See e.g. ilnd;;1:12-cr-00624.)
The distinction between party names like "Social Security Commissioner Carolyn Colvin" (which would not be redacted) and "Carolyn Colvin" (which would be classified as a private individual and redacted), although reasonable on a programmatic level, is rather absurd on a conceptual level.
About 500 parties have extra-info blocks containing address information, which may include words that erroneously flag them as non-private entities.
Arguably, if city/state information were to be preserved in its unredacted form in address lines (both in pro-se blocks and extra-info blocks), research opportunities would increase without compromising privacy.
The presumption that individuals appearing in court on behalf of other people or entities are not private individuals may break down in the context of estate-settlement cases, in which, for instance, individuals named as beneficiaries or representatives may have been pulled into litigation without any deliberate action on their part.

# PACER redaction script

# Usage

# Why is this needed?

# Notes on approach

# Known issues

# Technical