SCALES NLP
An AI toolkit for legal research that includes a collection of deep learning models and utilities to:
- Download dockets from PACER and parse their contents
- Classify the text of docket entries with 70+ labels
- Extract named entities from docket entries
- Apply a case-level logic to tagged data to identify key opening and dispositive events according to the SCALES Litigation Ontology
Pretrained Models
This module builds on a collection of publicly available pretrained language models that can used to apply the SCALES Litigation Ontology to your own data. Check out the scales-okn page on the Hugging Face Model Hub for updates.
Getting started
To help illustrate how to use these models, we will use an example dataset.
entries = [
'MOTION by Plaintiff Charles Shulruff, D.D.S. to certify class (Edelman, Daniel) (Entered: 01/22/2016)',
'MOTION by Plaintiff Able Home Health, LLC to certify class (Edelman, Daniel) (Entered: 01/28/2016)',
'MOTION by Plaintiff Dr. Charles Shulruff, D.D.S. to certify class (Attachments: # 1 Exhibit A-E)(Edelman, Daniel) (Entered: 02/24/2016)',
'MOTION by Plaintiff Lindabeth Rivera to certify class (Carroll, Katrina) (Entered: 03/01/2016)',
'MOTION by Plaintiff Christy Griffith to certify class (Glapion, Jeremy) (Entered: 08/01/2016)'
]
Docket Language Model
scales-okn/docket-language-model
The base model that we use for downstream fine-tuning on both classification and entity recognition tasks. This model builds on a microsoft/deberta-v3-large that has been further finetuned on 11 million docket entries using the masked language modelling task.
# use with transformers
from transformers import AutoModel, AutoTokenizer
model_name = 'scales-okn/docket-language-model'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
To use the model with the example data then we tokenize the entries and feed that as the input to the model.
model_inputs = tokenizer(entries, padding=True, truncation=True, return_tensors="pt")
outputs = model(**model_inputs)
We can then access the calculated tensors for each of the five entries (output shape will be 5x49x1024).
outputs[0].shape
## torch.Size([5, 49, 1024])
Docket Classifier
scales-okn/docket-classification
A multi-label classifier that has been finetuned to predict the SCALES litigation ontology labels. The complete list of labels and their descriptions can be found here
#Import the scales package
import scales_nlp
#Load the model
model = scales_nlp.pipeline('docket-classifier')
predictions = model(entries[0])
Docket Encoder
A text encoding model that can be used to map a docket entry to a vector for similarity search. This model was finetuned using semi-supervised training routine based on a contrastive, multi-dimensional version of batch hard triplet loss aimed at approximating the relative tfidf distances between batches of docket entries. Note that if using this model without the SCALES 'docket-encoder' pipeline, it is the CLF token embedding in the last hidden layer that is used as the embedding representation for the whole entry.
#Import the scales package
import scales_nlp
#Load the model
model = scales_nlp.pipeline('docket-encoder')
vectors = model(entries)
vectors.shape
## (5, 1024)
Quickstart
Installation
This module can be installed with pip:
$ pip install scales-nlp
You should also have the appropriate version of PyTorch installed for your system.
Configuration
To get the most out of this module, run the following to set module-wide configuration variables. These variables can also be set or overriden by your environment.
$ scales-nlp configure
PACER_DIR
The folder where all of your PACER data will be saved and managed. This should be the same top-level directory that you use with the scraper and parser.PACER_USERNAME
The username to your PACER account.PACER_PASSWORD
The password to your PACER account.HUGGING_FACE_TOKEN
You only need to include this if you want to use the pipelines API with private models on Hugging Face.
In addition to these variables, you can also configure the default values that are used by the training routines API. These can be set by running scales-nlp configure train-args
.
Collect PACER Data
This module includes simplified wrappers for the SCALES scraper and parser. This version eschews much of those tools' original functionality in order to make it easy to download a single case. For running bulk downloads of PACER data or otherwise taking advantage of the wide range of functionality available in the original implementation, you can consult the PACER-tools package, as well as the individual documentation pages for the scraper and parser.
Simplified Scraper
To download a single case, provide a UCID to the following command:
$ scales-nlp download [UCID]
The UCID consists of the court abbreviation and the docket number, separated by two semicolons (e.g. azd;;3:18-cv-08134
).
Simplified Parser
To run the parser across all of your downloaded cases, run the following:
$ scales-nlp parse
You may also provide a court abbreviation as an argument if you only want to apply the parser to cases within a single court.
Apply SCALES Models
Update Classifier Labels
The following command can be used to compute and update computed classifier labels from the SCALES litigation ontology to new data. By default the model outputs will be cached in your PACER_DIR and the model will only be applied to new cases that do not already have saved predictions. You can override saved predictions by passing the --reset
flag. For optimal performance it is recommended that you only perform inference using the SCALES models on a device with a GPU. If you run into memory errors, try adjusting the --batch-size
to your needs.
$ scales-nlp update-labels --batch-size 4
Update Named-entity Extraction
SCALES will release several NER models in the near future.
Loading Tagged Dockets
Downloaded cases can be loaded using the SCALES NLP Docket
object as follows. If classifier labels or entity spans have been computed for the case, these will accessible as well. Furthermore, the Docket
object will consolidate all of the information available in the case to infer the specific pathway events. To learn more about labels, entities, and pathway events, check out the Litigation Ontology Guide.
import scales_nlp
docket = scales_nlp.Docket.from_ucid("CASE UCID")
print(docket.ucid)
print(docket.case_name)
print(docket.header.keys())
for entry in docket:
print(entry.row_number, entry.entry_number, entry.date_filed)
print(entry.text)
print(entry.event)
print(entry.labels)
print(entry.spans)
print()