SCALES NLP

An AI toolkit for legal research that includes a collection of deep learning models and utilities to:

  • Download dockets from PACER and parse their contents
  • Classify the text of docket entries with 70+ labels
  • Extract named entities from docket entries
  • Apply a case-level logic to tagged data to identify key opening and dispositive events according to the SCALES Litigation Ontology

Pretrained Models

This module builds on a collection of publicly available pretrained language models that can used to apply the SCALES Litigation Ontology to your own data. Check out the scales-oknopen in new window page on the Hugging Face Model Hub for updates.

Docket Language Model

scales-okn/docket-language-modelopen in new window

The base model that we use for downstream fine-tuning on both classification and entity recognition tasks. This model builds on a microsoft/deberta-v3-largeopen in new window that has been further finetuned on 11 million docket entries using the masked language modelling task.

# use with transformers
from transformers import AutoModel, AutoTokenizer

model_name = 'scales-okn/docket-language-model'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Docket Classifier

scales-okn/docket-classificationopen in new window

A multi-label classifier that has been finetuned to predict the SCALES litigation ontology labels. The complete list of labels and their descriptions can be found here

import scales_nlp

model = scales_nlp.pipelines('docket-classifier', device=0)

predictions = model(texts, batch_size=4)

Docket Encoder

scales-okn/docket-encoderopen in new window

A text encoding model that can be used to map a docket entry to a vector for similarity search. This model was finetuned using semi-supervised training routine based on a contrastive, multi-dimensional version of batch hard triplet lossopen in new window aimed at approximating the relative tfidf distances between batches of docket entries. Note that if using this model without the SCALES 'docket-encoder' pipeline, it is the CLF token embedding in the last hidden layer that is used as the embedding representation for the whole entry.

import scales_nlp

model = scales_nlp.pipelines('docket-encoder', device=0)

vectors = model(texts, batch_size=4)

Quickstart

Installation

This module can be installed with pip:

$ pip install scales-nlp

You should also have the appropriate version of PyTorchopen in new window installed for your system.

Configuration

To get the most out of this module, run the following to set module-wide configuration variables. These variables can also be set or overriden by your environment.

$ scales-nlp configure
  • PACER_DIR The folder where all of your PACER data will be saved and managed. This should be the same top-level directory that you use with the scraper and parser.
  • PACER_USERNAME The username to your PACER account.
  • PACER_PASSWORD The password to your PACER account.
  • HUGGING_FACE_TOKEN You only need to include this if you want to use the pipelines API with private models on Hugging Face.

In addition to these variables, you can also configure the default values that are used by the training routines API. These can be set by running scales-nlp configure train-args.

Collect PACER Data

This module includes simplified wrappers for the SCALES scraper and parser. This version eschews much of those tools' original functionality in order to make it easy to download a single case. For running bulk downloads of PACER data or otherwise taking advantage of the wide range of functionality available in the original implementation, you can consult the PACER-toolsopen in new window package, as well as the individual documentation pages for the scraper and parser.

Simplified Scraper

To download a single case, provide a UCID to the following command:

$ scales-nlp download [UCID]

The UCID consists of the court abbreviation and the docket number, separated by two semicolons (e.g. azd;;3:18-cv-08134).

Simplified Parser

To run the parser across all of your downloaded cases, run the following:

$ scales-nlp parse

You may also provide a court abbreviation as an argument if you only want to apply the parser to cases within a single court.

Apply SCALES Models

Update Classifier Labels

The following command can be used to compute and update computed classifier labels from the SCALES litigation ontology to new data. By default the model outputs will be cached in your PACER_DIR and the model will only be applied to new cases that do not already have saved predictions. You can override saved predictions by passing the --reset flag. For optimal performance it is recommended that you only perform inference using the SCALES models on a device with a GPU. If you run into memory errors, try adjusting the --batch-size to your needs.

$ scales-nlp update-labels --batch-size 4

Update Named-entity Extraction

SCALES will release several NER models in the near future.

Loading Tagged Dockets

Downloaded cases can be loaded using the SCALES NLP Docket object as follows. If classifier labels or entity spans have been computed for the case, these will accessible as well. Furthermore, the Docket object will consolidate all of the information available in the case to infer the specific pathway events. To learn more about labels, entities, and pathway events, check out the Litigation Ontology Guide.

import scales_nlp

docket = scales_nlp.Docket.from_ucid("CASE UCID")

print(docket.ucid)
print(docket.case_name)
print(docket.header.keys())

for entry in docket:
    print(entry.row_number, entry.entry_number, entry.date_filed)
    print(entry.text)
    print(entry.event)
    print(entry.labels)
    print(entry.spans)
    print()