Skip to content
This repository was archived by the owner on Jun 28, 2023. It is now read-only.

Commit 1ec1b6d

Browse files
committed
Merge branch 'main' of https://github.com/JSv4/AtticusClassifier_Spacy into main
2 parents 8b5fe9d + 706b9c2 commit 1ec1b6d

File tree

2 files changed

+30
-214
lines changed

2 files changed

+30
-214
lines changed

README.rst

Lines changed: 20 additions & 214 deletions
Original file line numberDiff line numberDiff line change
@@ -1,90 +1,19 @@
1-
******************************************
2-
Atticus Legal Clause Classifiers for Spacy
3-
******************************************
1+
This is still a work in progress and is not meant for public use yet.
2+
Leaving the repository public in case anyone stumbles across it and it
3+
saves some time in preparing your own Atticus classifiers. I am planning
4+
to write a blog post / instructions once the performance is slightly better.
45

5-
Introduction
6-
############
6+
I am using LexNLP for its great sentence tokenization functionality. You
7+
could probably get decent performance out of NLTK. Spacy is ok. Currently,
8+
I have been training the classifiers, then using LexNLP to clean and split
9+
sentences / sections / whatever and then running those chunks through Spacy
10+
and checking the category labels.
711

8-
The `Atticus Project <https://www.atticusprojectai.org/>`_ was recently announced as an initiative
9-
to, among other things, build a world-class corpus of labelled legal contracts which could be used
10-
to train and/or benchmark text classifiers and question-answering NLP models. Their initial release
11-
contains 200 labelled contracts. I wanted to experiment with the data set and build a working classifier
12-
that I could use on contract data, so I set out to build a simple project to load the dataset, convert it
13-
into a format that Spacy can read, and then train some classifiers to see how the data set performs.
14-
This repository contains the code I used to train classifiers based on 1) Word2Vec embeddings and 2)
15-
a BERT-based transformer model.
12+
To get the full (and excellent) Atticus dataset, go here:
13+
https://www.atticusprojectai.org/
1614

17-
Quickstart - Use the Classifier
18-
###############################
19-
20-
If you are in a hurry to test out the classifiers and are not really interested in how they were trained,
21-
you can currently install the classifier directly from a package I'm hosting on my AWS bucket by typing::
22-
23-
pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
24-
25-
This should install Spacy, Spacy-transformers, a BERT model and the classifiers. Once you've installed the
26-
model, you can use it like this::
27-
28-
import spacy
29-
30-
nlp = spacy.load('en_atticus_classifier_bert')
31-
32-
clause = """The Joint Venturers shall maintain adequate books
33-
and records to be kept of all the Joint Venture activities and affairs
34-
conducted pursuant to the terms of this Agreement. All direct costs and
35-
expenses, which shall include any insurance costs in connection with the
36-
distribution of the Products or operations of the Joint Venture, or if the
37-
business of the Joint Venture requires additional office facilities than
38-
those now presently maintained by each Joint Venturer"""
39-
40-
cats = nlp(clause).cats
41-
cats = [label for label in cats if cats[label] > .7] # If you want to filter by similarity scores > .7
42-
print(cats) # Show the categories
43-
44-
45-
As discussed below, the performance of the model is good enough to be interesting,
46-
but currently not good enough to really be production ready. I *think* this is primarily
47-
due to the dataset being relatively small and many clause categories having fewer than 20
48-
examples. I wanted to release this as-is, however so others could experiment. As the Atticus
49-
Project corpus grows, these classifiers should get better. In my experience 50 - 100 examples
50-
is typically a good target to aim for, so doubling or tripling the Atticus Corpus will
51-
hopefully lead to much, much better performance.
52-
53-
Build a Word2Vec-Based Model
54-
############################
55-
56-
I first experimented with using Spacy's OOTB Word2Vec models. This approach was very
57-
quick to train, but the performance was not very good. The f-score was about .6. I also
58-
tried using a different set of word embeddings released as "Law2Vec", and these improved
59-
performance marginally to an F-Score of ~.64. I've included the code to train these models
60-
in Word2VecModelBuilder.py. You can simply run that python script. The default settings
61-
will load Spacy's en_core_web_lg model and embeddings. You can also load the Law2Vec model
62-
if you download the vector file::
63-
64-
wget -O ~/Downloads/Law2Vec.200d.txt https://archive.org/download/Law2Vec/Law2Vec.200d.txt
65-
66-
Then you can use Spacy to convert this file into a Spacy-compatible model like so::
67-
68-
mkdir /models
69-
python -m spacy init-model en /models/Law2VecModel --vectors-loc ~/Downloads/Law2Vec.200d.txt
70-
71-
Then you can change the model argument (per the example above) to '/models/Law2VecModel'.
72-
You probably want to change the output_dir too. Once you've trained a new model, you can
73-
load the trained model with spacy.load(output_dir).
74-
75-
Train a BERT-based Model
76-
########################
77-
78-
Overview
79-
The transformer models encode a lot more contextual information about words than Word2Vec models,
80-
so I wanted to see if I could squeeze more performance out of the dataset using BERT. The good
81-
news was performance increased substantially using a BERT-based model. This is still probably not good enough for use in production, but it's good
82-
enough to yield some interesting insights, particularly if you set your similarity threshold very
83-
high.
84-
85-
Training Results
86-
Using a BERT-based model, the beta release of the Atticus training set yields
87-
an acceptable (but still not really production-ready) F-score of .735::
15+
Using a BERT-based model, the beta release of the Atticus training set yields
16+
an acceptable (but still not really production-ready) F-score of .735::
8817

8918
LOSS P R F
9019
1.093 0.739 0.472 0.576
@@ -100,134 +29,11 @@ Training Results
10029
0.219 0.751 0.720 0.735
10130
0.551 0.751 0.720 0.735
10231

103-
Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
104-
graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
105-
took about three hours.
106-
107-
Step 1 - Sign Up for Atticus Project Data and Download
108-
I've included the Atticus CSV in the repository for convenience, but you should go to the
109-
Atticus Project website and signup there. For one, they would like to collect user and
110-
contact info for people downloading their dataset. For another, you should go there to make
111-
sure you get the latest version of their dataset.
112-
113-
Step 2 - Install Python Dependencies and SPACY BERT Model
114-
First, install Python dependencies (I'm using LexNLP to tokenize test data, you do not
115-
need it to build the model)::
116-
117-
pip install lexnlp spacy pip install spacy-transformers==0.5.2 pandas
118-
119-
Then, download the BERT transformer model::
120-
121-
!python -m spacy download en_trf_bertbaseuncased_lg
122-
123-
Step 3 - Load Atticus Data and Format for Spacy
124-
The Atticus dataset is a csv, so we can use Pandas to load and manipulate it. Since
125-
we're training classifiers and not answering questions, we only care about the columns
126-
containing text for a given classification. The columns with headers marked "...-Answer"
127-
are meant for question-answering and we don't want to train on this data. We also don't
128-
really want the filename column or the document title columns, which are the first and
129-
second columns respectively. The following function will load our Atticus CSV, filter
130-
out the ...-Answer cols, the filename col and the document title col. Then, it will
131-
format the data into Spacy's preferred training format and split the training set into
132-
two pieces - a training set and an evaluation set. The default is to split the total data
133-
set so 80% is used for training and 20% is used for evaluation.
134-
135-
**Code**::
136-
137-
def load_atticus_data(filepath='/tmp/aok_beta/Final Publication/master_clauses.csv'):
138-
139-
"""
140-
Load data from the atticus csv (omitting the answer cols as we want to train classifiers
141-
not question answering).
142-
143-
Data is returned in the Spacy training format:
144-
TRAIN_DATA = [
145-
("text1", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
146-
]
147-
148-
A list of headers is also returned so you can add these labels. FYI, the Filename and Doc name
149-
columns are dropped as well.
150-
151-
"""
152-
153-
# Load csv
154-
atticus_clauses_df = pd.read_csv(filepath)
155-
156-
# Do a little post-processing
157-
data_headers = [h for h in list(atticus_clauses_df.columns) if not "Answer" in h]
158-
data_headers.pop(0) # Drop filename col (index 0 for col 1)
159-
data_headers.pop(0) # Drop doc name (orig col 2 (index 1) but now first col (index 0))
160-
161-
training_values = {i: 0 for i in data_headers}
162-
atticus_clauses_data_df = atticus_clauses_df.loc[:, data_headers]
163-
164-
train_data = []
165-
166-
# Iterate over csv to build training data dict
167-
for header in atticus_clauses_data_df.columns:
168-
169-
for row in atticus_clauses_data_df[[header]].iterrows():
170-
171-
value = row[1][header]
172-
173-
if not pd.isnull(value):
174-
train_data.append((value, {'cats': {**training_values, header: 1}}))
175-
176-
return train_data, data_headers
177-
178-
179-
def create_training_set(train_data=[{}], limit=0, split=0.8):
180-
"""Load data from the Atticus dataset, splitting off a held-out set."""
181-
random.shuffle(train_data)
182-
train_data = train_data[-limit:]
183-
184-
texts, labels = zip(*train_data)
185-
split = int(len(train_data) * split)
186-
187-
# Return data in format that matches example here:
188-
# https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
189-
return (texts[:split], labels[:split]), (texts[split:], labels[split:])
190-
191-
192-
Step 4 - Build the Model
193-
*WARNING - running the training takes a looong time, even if you have a CUDA-compatible
194-
graphics card and it's properly configured in your environment*
195-
196-
You can just run the BertModelBuilder.py with default settings. On my Nvidia 1050 Ti, it took
197-
about 3 - 4 hours to run the training. Unless you're adding additional data, I'd suggest you
198-
just use my pre-built models.
199-
200-
Packaging / Serving Model for Use
201-
#################################
202-
203-
You can follow Spacy's excellent instructions `here <https://spacy.io/api/cli#package>`_
204-
to package up the final model into a tar that can be installed with pip like this::
205-
206-
pip install local_path_to_tar.tar.gz
207-
208-
I've uploaded the package to my public AWS bucket, and you can install directly from there
209-
like so::
210-
211-
pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
212-
213-
Now you can load it just like this::
214-
215-
nlp = spacy.load('en-atticus-classifier-bert-based-0.1.0')
216-
217-
I plan to also upload this to PyPi as well so you can just do something like this::
218-
219-
pip install atticus_classifiers_spacy (DOESN'T WORK YET)
220-
221-
Another option, is you can load the pickled model in the pre-trained folder::
222-
223-
import pickle
224-
import spacy
225-
226-
nlp = pickle.load(open("/path/to/BertClassifier.pickle", "rb"))
227-
228-
# Then you can use the spacy object just like normal:
229-
clause = "Test clause"
230-
cats = nlp(clause).cats
231-
cats = [label for label in cats if cats[label] > .7] #If you want to look only at labels with similarity scores over .7
232-
print(cats)
32+
Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
33+
graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
34+
took about three hours.
23335

36+
Training older, Word2Vec-based models yields less promising results but is much,
37+
much faster to train. Spacy's en_web_core_lg model yields an f-score of around .6.
38+
Using Word2Vec embeddings trained on legal data (such as Law2Vec) yields a slightly
39+
better f-score of around .635. Neither is as good as the BERT-based approach, however.

TestClassifier.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,20 @@
11
import spacy
2+
<<<<<<< HEAD
23
import pickle
4+
=======
5+
>>>>>>> 706b9c26af2815cc3e4afb9e7ef2b9281fa8fd66
36
from lexnlp.nlp.en.segments.sentences import pre_process_document, normalize_text, get_sentence_list
47

58
# Here we load our pre-trained Spacy model with the classifiers. Point Spacy to the directory you
69
# saved your model to in Word2VecModelBuilder or BertModelBuilder. Be aware that this model is missing
710
# other pipeline components and really just does classification, so, if you need other Spacy features,
811
# you need to add them to the model pipeline after loading or change how the model is generated and saved.
12+
<<<<<<< HEAD
913
nlp = spacy.load("/models/BertClassifier")
1014
nlp_pickle = pickle.load(open("./pre-trained/BertClassifier.pickle", "rb"))
15+
=======
16+
nlp = spacy.load("/home/jman/PycharmProjects/AtticusWord2Vec/data/Law2VecClassiier")
17+
>>>>>>> 706b9c26af2815cc3e4afb9e7ef2b9281fa8fd66
1118

1219
text="""JOINT VENTURE AGREEMENT
1320
Collectible Concepts Group, Inc. ("CCGI") and Pivotal Self Service Tech, Inc.
@@ -89,9 +96,12 @@
8996
cats = [label for label in cats if cats[label] > .7]
9097
print(cats)
9198

99+
<<<<<<< HEAD
92100
print("\n\n#############################################33\nTest Pickled Model\n")
93101
for sent in get_sentence_list(normalize_text(pre_process_document(text))):
94102
print(f"LexNLP Sentence: {sent}")
95103
cats = nlp_pickle(sent).cats
96104
cats = [label for label in cats if cats[label] > .7]
97105
print(cats)
106+
=======
107+
>>>>>>> 706b9c26af2815cc3e4afb9e7ef2b9281fa8fd66

0 commit comments

Comments
 (0)