Skip to content
This repository was archived by the owner on Jun 28, 2023. It is now read-only.

Commit ca856d1

Browse files
committed
Updated Readme to show how to just load the packaged model.
1 parent 1ec1b6d commit ca856d1

File tree

1 file changed

+213
-20
lines changed

1 file changed

+213
-20
lines changed

README.rst

Lines changed: 213 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,90 @@
1-
This is still a work in progress and is not meant for public use yet.
2-
Leaving the repository public in case anyone stumbles across it and it
3-
saves some time in preparing your own Atticus classifiers. I am planning
4-
to write a blog post / instructions once the performance is slightly better.
1+
******************************************
2+
Atticus Legal Clause Classifiers for Spacy
3+
******************************************
54

6-
I am using LexNLP for its great sentence tokenization functionality. You
7-
could probably get decent performance out of NLTK. Spacy is ok. Currently,
8-
I have been training the classifiers, then using LexNLP to clean and split
9-
sentences / sections / whatever and then running those chunks through Spacy
10-
and checking the category labels.
5+
Introduction
6+
############
117

12-
To get the full (and excellent) Atticus dataset, go here:
13-
https://www.atticusprojectai.org/
8+
The `Atticus Project <https://www.atticusprojectai.org/>`_ was recently announced as an initiative
9+
to, among other things, build a world-class corpus of labelled legal contracts which could be used
10+
to train and/or benchmark text classifiers and question-answering NLP models. Their initial release
11+
contains 200 labelled contracts. I wanted to experiment with the data set and build a working classifier
12+
that I could use on contract data, so I set out to build a simple project to load the dataset, convert it
13+
into a format that Spacy can read, and then train some classifiers to see how the data set performs.
14+
This repository contains the code I used to train classifiers based on 1) Word2Vec embeddings and 2)
15+
a BERT-based transformer model.
1416

15-
Using a BERT-based model, the beta release of the Atticus training set yields
16-
an acceptable (but still not really production-ready) F-score of .735::
17+
Quickstart - Use the Classifier
18+
###############################
19+
20+
If you are in a hurry to test out the classifiers and are not really interested in how they were trained,
21+
you can currently install the classifier directly from a package I'm hosting on my AWS bucket by typing::
22+
23+
pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
24+
25+
This should install Spacy, Spacy-transformers, a BERT model and the classifiers. Once you've installed the
26+
model, you can use it like this::
27+
28+
import spacy
29+
30+
nlp = spacy.load('en_atticus_classifier_bert')
31+
32+
clause = """The Joint Venturers shall maintain adequate books
33+
and records to be kept of all the Joint Venture activities and affairs
34+
conducted pursuant to the terms of this Agreement. All direct costs and
35+
expenses, which shall include any insurance costs in connection with the
36+
distribution of the Products or operations of the Joint Venture, or if the
37+
business of the Joint Venture requires additional office facilities than
38+
those now presently maintained by each Joint Venturer"""
39+
40+
cats = nlp(clause).cats
41+
cats = [label for label in cats if cats[label] > .7] # If you want to filter by similarity scores > .7
42+
print(cats) # Show the categories
43+
44+
45+
As discussed below, the performance of the model is good enough to be interesting,
46+
but currently not good enough to really be production ready. I *think* this is primarily
47+
due to the dataset being relatively small and many clause categories having fewer than 20
48+
examples. I wanted to release this as-is, however so others could experiment. As the Atticus
49+
Project corpus grows, these classifiers should get better. In my experience 50 - 100 examples
50+
is typically a good target to aim for, so doubling or tripling the Atticus Corpus will
51+
hopefully lead to much, much better performance.
52+
53+
Build a Word2Vec-Based Model
54+
############################
55+
56+
I first experimented with using Spacy's OOTB Word2Vec models. This approach was very
57+
quick to train, but the performance was not very good. The f-score was about .6. I also
58+
tried using a different set of word embeddings released as "Law2Vec", and these improved
59+
performance marginally to an F-Score of ~.64. I've included the code to train these models
60+
in Word2VecModelBuilder.py. You can simply run that python script. The default settings
61+
will load Spacy's en_core_web_lg model and embeddings. You can also load the Law2Vec model
62+
if you download the vector file::
63+
64+
wget -O ~/Downloads/Law2Vec.200d.txt https://archive.org/download/Law2Vec/Law2Vec.200d.txt
65+
66+
Then you can use Spacy to convert this file into a Spacy-compatible model like so::
67+
68+
mkdir /models
69+
python -m spacy init-model en /models/Law2VecModel --vectors-loc ~/Downloads/Law2Vec.200d.txt
70+
71+
Then you can change the model argument (per the example above) to '/models/Law2VecModel'.
72+
You probably want to change the output_dir too. Once you've trained a new model, you can
73+
load the trained model with spacy.load(output_dir).
74+
75+
Train a BERT-based Model
76+
########################
77+
78+
Overview
79+
The transformer models encode a lot more contextual information about words than Word2Vec models,
80+
so I wanted to see if I could squeeze more performance out of the dataset using BERT. The good
81+
news was performance increased substantially using a BERT-based model. This is still probably not good enough for use in production, but it's good
82+
enough to yield some interesting insights, particularly if you set your similarity threshold very
83+
high.
84+
85+
Training Results
86+
Using a BERT-based model, the beta release of the Atticus training set yields
87+
an acceptable (but still not really production-ready) F-score of .735::
1788

1889
LOSS P R F
1990
1.093 0.739 0.472 0.576
@@ -29,11 +100,133 @@ an acceptable (but still not really production-ready) F-score of .735::
29100
0.219 0.751 0.720 0.735
30101
0.551 0.751 0.720 0.735
31102

32-
Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
33-
graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
34-
took about three hours.
103+
Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
104+
graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
105+
took about three hours.
106+
107+
Step 1 - Sign Up for Atticus Project Data and Download
108+
I've included the Atticus CSV in the repository for convenience, but you should go to the
109+
Atticus Project website and signup there. For one, they would like to collect user and
110+
contact info for people downloading their dataset. For another, you should go there to make
111+
sure you get the latest version of their dataset.
112+
113+
Step 2 - Install Python Dependencies and SPACY BERT Model
114+
First, install Python dependencies (I'm using LexNLP to tokenize test data, you do not
115+
need it to build the model)::
116+
117+
pip install lexnlp spacy pip install spacy-transformers==0.5.2 pandas
118+
119+
Then, download the BERT transformer model::
120+
121+
!python -m spacy download en_trf_bertbaseuncased_lg
122+
123+
Step 3 - Load Atticus Data and Format for Spacy
124+
The Atticus dataset is a csv, so we can use Pandas to load and manipulate it. Since
125+
we're training classifiers and not answering questions, we only care about the columns
126+
containing text for a given classification. The columns with headers marked "...-Answer"
127+
are meant for question-answering and we don't want to train on this data. We also don't
128+
really want the filename column or the document title columns, which are the first and
129+
second columns respectively. The following function will load our Atticus CSV, filter
130+
out the ...-Answer cols, the filename col and the document title col. Then, it will
131+
format the data into Spacy's preferred training format and split the training set into
132+
two pieces - a training set and an evaluation set. The default is to split the total data
133+
set so 80% is used for training and 20% is used for evaluation.
134+
135+
**Code**::
136+
137+
def load_atticus_data(filepath='/tmp/aok_beta/Final Publication/master_clauses.csv'):
138+
139+
"""
140+
Load data from the atticus csv (omitting the answer cols as we want to train classifiers
141+
not question answering).
142+
143+
Data is returned in the Spacy training format:
144+
TRAIN_DATA = [
145+
("text1", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
146+
]
147+
148+
A list of headers is also returned so you can add these labels. FYI, the Filename and Doc name
149+
columns are dropped as well.
150+
151+
"""
152+
153+
# Load csv
154+
atticus_clauses_df = pd.read_csv(filepath)
155+
156+
# Do a little post-processing
157+
data_headers = [h for h in list(atticus_clauses_df.columns) if not "Answer" in h]
158+
data_headers.pop(0) # Drop filename col (index 0 for col 1)
159+
data_headers.pop(0) # Drop doc name (orig col 2 (index 1) but now first col (index 0))
160+
161+
training_values = {i: 0 for i in data_headers}
162+
atticus_clauses_data_df = atticus_clauses_df.loc[:, data_headers]
163+
164+
train_data = []
165+
166+
# Iterate over csv to build training data dict
167+
for header in atticus_clauses_data_df.columns:
168+
169+
for row in atticus_clauses_data_df[[header]].iterrows():
170+
171+
value = row[1][header]
172+
173+
if not pd.isnull(value):
174+
train_data.append((value, {'cats': {**training_values, header: 1}}))
175+
176+
return train_data, data_headers
177+
178+
179+
def create_training_set(train_data=[{}], limit=0, split=0.8):
180+
"""Load data from the Atticus dataset, splitting off a held-out set."""
181+
random.shuffle(train_data)
182+
train_data = train_data[-limit:]
183+
184+
texts, labels = zip(*train_data)
185+
split = int(len(train_data) * split)
186+
187+
# Return data in format that matches example here:
188+
# https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
189+
return (texts[:split], labels[:split]), (texts[split:], labels[split:])
190+
191+
192+
Step 4 - Build the Model
193+
*WARNING - running the training takes a looong time, even if you have a CUDA-compatible
194+
graphics card and it's properly configured in your environment*
195+
196+
You can just run the BertModelBuilder.py with default settings. On my Nvidia 1050 Ti, it took
197+
about 3 - 4 hours to run the training. Unless you're adding additional data, I'd suggest you
198+
just use my pre-built models.
199+
200+
Packaging / Serving Model for Use
201+
#################################
202+
203+
You can follow Spacy's excellent instructions `here <https://spacy.io/api/cli#package>`_
204+
to package up the final model into a tar that can be installed with pip like this::
205+
206+
pip install local_path_to_tar.tar.gz
207+
208+
I've uploaded the package to my public AWS bucket, and you can install directly from there
209+
like so::
210+
211+
pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
212+
213+
Now you can load it just like this::
214+
215+
nlp = spacy.load('en_atticus_classifier_bert')
216+
217+
I plan to also upload this to PyPi as well so you can just do something like this::
218+
219+
pip install atticus_classifiers_spacy (DOESN'T WORK YET)
220+
221+
Another option, is you can load the pickled model in the pre-trained folder::
222+
223+
import pickle
224+
import spacy
225+
226+
nlp = pickle.load(open("/path/to/BertClassifier.pickle", "rb"))
35227

36-
Training older, Word2Vec-based models yields less promising results but is much,
37-
much faster to train. Spacy's en_web_core_lg model yields an f-score of around .6.
38-
Using Word2Vec embeddings trained on legal data (such as Law2Vec) yields a slightly
39-
better f-score of around .635. Neither is as good as the BERT-based approach, however.
228+
# Then you can use the spacy object just like normal:
229+
clause = "Test clause"
230+
cats = nlp(clause).cats
231+
cats = [label for label in cats if cats[label] > .7] #If you want to look only at labels with similarity scores over .7
232+
print(cats)

0 commit comments

Comments
 (0)