1
- ******************************************
2
- Atticus Legal Clause Classifiers for Spacy
3
- ******************************************
1
+ This is still a work in progress and is not meant for public use yet.
2
+ Leaving the repository public in case anyone stumbles across it and it
3
+ saves some time in preparing your own Atticus classifiers. I am planning
4
+ to write a blog post / instructions once the performance is slightly better.
4
5
5
- Introduction
6
- ############
6
+ I am using LexNLP for its great sentence tokenization functionality. You
7
+ could probably get decent performance out of NLTK. Spacy is ok. Currently,
8
+ I have been training the classifiers, then using LexNLP to clean and split
9
+ sentences / sections / whatever and then running those chunks through Spacy
10
+ and checking the category labels.
7
11
8
- The `Atticus Project <https://www.atticusprojectai.org/ >`_ was recently announced as an initiative
9
- to, among other things, build a world-class corpus of labelled legal contracts which could be used
10
- to train and/or benchmark text classifiers and question-answering NLP models. Their initial release
11
- contains 200 labelled contracts. I wanted to experiment with the data set and build a working classifier
12
- that I could use on contract data, so I set out to build a simple project to load the dataset, convert it
13
- into a format that Spacy can read, and then train some classifiers to see how the data set performs.
14
- This repository contains the code I used to train classifiers based on 1) Word2Vec embeddings and 2)
15
- a BERT-based transformer model.
12
+ To get the full (and excellent) Atticus dataset, go here:
13
+ https://www.atticusprojectai.org/
16
14
17
- Quickstart - Use the Classifier
18
- ###############################
19
-
20
- If you are in a hurry to test out the classifiers and are not really interested in how they were trained,
21
- you can currently install the classifier directly from a package I'm hosting on my AWS bucket by typing::
22
-
23
- pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
24
-
25
- This should install Spacy, Spacy-transformers, a BERT model and the classifiers. Once you've installed the
26
- model, you can use it like this::
27
-
28
- import spacy
29
-
30
- nlp = spacy.load('en_atticus_classifier_bert')
31
-
32
- clause = """The Joint Venturers shall maintain adequate books
33
- and records to be kept of all the Joint Venture activities and affairs
34
- conducted pursuant to the terms of this Agreement. All direct costs and
35
- expenses, which shall include any insurance costs in connection with the
36
- distribution of the Products or operations of the Joint Venture, or if the
37
- business of the Joint Venture requires additional office facilities than
38
- those now presently maintained by each Joint Venturer"""
39
-
40
- cats = nlp(clause).cats
41
- cats = [label for label in cats if cats[label] > .7] # If you want to filter by similarity scores > .7
42
- print(cats) # Show the categories
43
-
44
-
45
- As discussed below, the performance of the model is good enough to be interesting,
46
- but currently not good enough to really be production ready. I *think * this is primarily
47
- due to the dataset being relatively small and many clause categories having fewer than 20
48
- examples. I wanted to release this as-is, however so others could experiment. As the Atticus
49
- Project corpus grows, these classifiers should get better. In my experience 50 - 100 examples
50
- is typically a good target to aim for, so doubling or tripling the Atticus Corpus will
51
- hopefully lead to much, much better performance.
52
-
53
- Build a Word2Vec-Based Model
54
- ############################
55
-
56
- I first experimented with using Spacy's OOTB Word2Vec models. This approach was very
57
- quick to train, but the performance was not very good. The f-score was about .6. I also
58
- tried using a different set of word embeddings released as "Law2Vec", and these improved
59
- performance marginally to an F-Score of ~.64. I've included the code to train these models
60
- in Word2VecModelBuilder.py. You can simply run that python script. The default settings
61
- will load Spacy's en_core_web_lg model and embeddings. You can also load the Law2Vec model
62
- if you download the vector file::
63
-
64
- wget -O ~/Downloads/Law2Vec.200d.txt https://archive.org/download/Law2Vec/Law2Vec.200d.txt
65
-
66
- Then you can use Spacy to convert this file into a Spacy-compatible model like so::
67
-
68
- mkdir /models
69
- python -m spacy init-model en /models/Law2VecModel --vectors-loc ~/Downloads/Law2Vec.200d.txt
70
-
71
- Then you can change the model argument (per the example above) to '/models/Law2VecModel'.
72
- You probably want to change the output_dir too. Once you've trained a new model, you can
73
- load the trained model with spacy.load(output_dir).
74
-
75
- Train a BERT-based Model
76
- ########################
77
-
78
- Overview
79
- The transformer models encode a lot more contextual information about words than Word2Vec models,
80
- so I wanted to see if I could squeeze more performance out of the dataset using BERT. The good
81
- news was performance increased substantially using a BERT-based model. This is still probably not good enough for use in production, but it's good
82
- enough to yield some interesting insights, particularly if you set your similarity threshold very
83
- high.
84
-
85
- Training Results
86
- Using a BERT-based model, the beta release of the Atticus training set yields
87
- an acceptable (but still not really production-ready) F-score of .735::
15
+ Using a BERT-based model, the beta release of the Atticus training set yields
16
+ an acceptable (but still not really production-ready) F-score of .735::
88
17
89
18
LOSS P R F
90
19
1.093 0.739 0.472 0.576
@@ -100,134 +29,11 @@ Training Results
100
29
0.219 0.751 0.720 0.735
101
30
0.551 0.751 0.720 0.735
102
31
103
- Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
104
- graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
105
- took about three hours.
106
-
107
- Step 1 - Sign Up for Atticus Project Data and Download
108
- I've included the Atticus CSV in the repository for convenience, but you should go to the
109
- Atticus Project website and signup there. For one, they would like to collect user and
110
- contact info for people downloading their dataset. For another, you should go there to make
111
- sure you get the latest version of their dataset.
112
-
113
- Step 2 - Install Python Dependencies and SPACY BERT Model
114
- First, install Python dependencies (I'm using LexNLP to tokenize test data, you do not
115
- need it to build the model)::
116
-
117
- pip install lexnlp spacy pip install spacy-transformers==0.5.2 pandas
118
-
119
- Then, download the BERT transformer model::
120
-
121
- !python -m spacy download en_trf_bertbaseuncased_lg
122
-
123
- Step 3 - Load Atticus Data and Format for Spacy
124
- The Atticus dataset is a csv, so we can use Pandas to load and manipulate it. Since
125
- we're training classifiers and not answering questions, we only care about the columns
126
- containing text for a given classification. The columns with headers marked "...-Answer"
127
- are meant for question-answering and we don't want to train on this data. We also don't
128
- really want the filename column or the document title columns, which are the first and
129
- second columns respectively. The following function will load our Atticus CSV, filter
130
- out the ...-Answer cols, the filename col and the document title col. Then, it will
131
- format the data into Spacy's preferred training format and split the training set into
132
- two pieces - a training set and an evaluation set. The default is to split the total data
133
- set so 80% is used for training and 20% is used for evaluation.
134
-
135
- **Code **::
136
-
137
- def load_atticus_data(filepath='/tmp/aok_beta/Final Publication/master_clauses.csv'):
138
-
139
- """
140
- Load data from the atticus csv (omitting the answer cols as we want to train classifiers
141
- not question answering).
142
-
143
- Data is returned in the Spacy training format:
144
- TRAIN_DATA = [
145
- ("text1", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
146
- ]
147
-
148
- A list of headers is also returned so you can add these labels. FYI, the Filename and Doc name
149
- columns are dropped as well.
150
-
151
- """
152
-
153
- # Load csv
154
- atticus_clauses_df = pd.read_csv(filepath)
155
-
156
- # Do a little post-processing
157
- data_headers = [h for h in list(atticus_clauses_df.columns) if not "Answer" in h]
158
- data_headers.pop(0) # Drop filename col (index 0 for col 1)
159
- data_headers.pop(0) # Drop doc name (orig col 2 (index 1) but now first col (index 0))
160
-
161
- training_values = {i: 0 for i in data_headers}
162
- atticus_clauses_data_df = atticus_clauses_df.loc[:, data_headers]
163
-
164
- train_data = []
165
-
166
- # Iterate over csv to build training data dict
167
- for header in atticus_clauses_data_df.columns:
168
-
169
- for row in atticus_clauses_data_df[[header]].iterrows():
170
-
171
- value = row[1][header]
172
-
173
- if not pd.isnull(value):
174
- train_data.append((value, {'cats': {**training_values, header: 1}}))
175
-
176
- return train_data, data_headers
177
-
178
-
179
- def create_training_set(train_data=[{}], limit=0, split=0.8):
180
- """Load data from the Atticus dataset, splitting off a held-out set."""
181
- random.shuffle(train_data)
182
- train_data = train_data[-limit:]
183
-
184
- texts, labels = zip(*train_data)
185
- split = int(len(train_data) * split)
186
-
187
- # Return data in format that matches example here:
188
- # https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
189
- return (texts[:split], labels[:split]), (texts[split:], labels[split:])
190
-
191
-
192
- Step 4 - Build the Model
193
- *WARNING - running the training takes a looong time, even if you have a CUDA-compatible
194
- graphics card and it's properly configured in your environment *
195
-
196
- You can just run the BertModelBuilder.py with default settings. On my Nvidia 1050 Ti, it took
197
- about 3 - 4 hours to run the training. Unless you're adding additional data, I'd suggest you
198
- just use my pre-built models.
199
-
200
- Packaging / Serving Model for Use
201
- #################################
202
-
203
- You can follow Spacy's excellent instructions `here <https://spacy.io/api/cli#package >`_
204
- to package up the final model into a tar that can be installed with pip like this::
205
-
206
- pip install local_path_to_tar.tar.gz
207
-
208
- I've uploaded the package to my public AWS bucket, and you can install directly from there
209
- like so::
210
-
211
- pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
212
-
213
- Now you can load it just like this::
214
-
215
- nlp = spacy.load('en-atticus-classifier-bert-based-0.1.0')
216
-
217
- I plan to also upload this to PyPi as well so you can just do something like this::
218
-
219
- pip install atticus_classifiers_spacy (DOESN'T WORK YET)
220
-
221
- Another option, is you can load the pickled model in the pre-trained folder::
222
-
223
- import pickle
224
- import spacy
225
-
226
- nlp = pickle.load(open("/path/to/BertClassifier.pickle", "rb"))
227
-
228
- # Then you can use the spacy object just like normal:
229
- clause = "Test clause"
230
- cats = nlp(clause).cats
231
- cats = [label for label in cats if cats[label] > .7] #If you want to look only at labels with similarity scores over .7
232
- print(cats)
32
+ Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
33
+ graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
34
+ took about three hours.
233
35
36
+ Training older, Word2Vec-based models yields less promising results but is much,
37
+ much faster to train. Spacy's en_web_core_lg model yields an f-score of around .6.
38
+ Using Word2Vec embeddings trained on legal data (such as Law2Vec) yields a slightly
39
+ better f-score of around .635. Neither is as good as the BERT-based approach, however.
0 commit comments