1
- This is still a work in progress and is not meant for public use yet.
2
- Leaving the repository public in case anyone stumbles across it and it
3
- saves some time in preparing your own Atticus classifiers. I am planning
4
- to write a blog post / instructions once the performance is slightly better.
1
+ ******************************************
2
+ Atticus Legal Clause Classifiers for Spacy
3
+ ******************************************
5
4
6
- I am using LexNLP for its great sentence tokenization functionality. You
7
- could probably get decent performance out of NLTK. Spacy is ok. Currently,
8
- I have been training the classifiers, then using LexNLP to clean and split
9
- sentences / sections / whatever and then running those chunks through Spacy
10
- and checking the category labels.
5
+ Introduction
6
+ ############
11
7
12
- To get the full (and excellent) Atticus dataset, go here:
13
- https://www.atticusprojectai.org/
8
+ The `Atticus Project <https://www.atticusprojectai.org/ >`_ was recently announced as an initiative
9
+ to, among other things, build a world-class corpus of labelled legal contracts which could be used
10
+ to train and/or benchmark text classifiers and question-answering NLP models. Their initial release
11
+ contains 200 labelled contracts. I wanted to experiment with the data set and build a working classifier
12
+ that I could use on contract data, so I set out to build a simple project to load the dataset, convert it
13
+ into a format that Spacy can read, and then train some classifiers to see how the data set performs.
14
+ This repository contains the code I used to train classifiers based on 1) Word2Vec embeddings and 2)
15
+ a BERT-based transformer model.
14
16
15
- Using a BERT-based model, the beta release of the Atticus training set yields
16
- an acceptable (but still not really production-ready) F-score of .735::
17
+ Quickstart - Use the Classifier
18
+ ###############################
19
+
20
+ If you are in a hurry to test out the classifiers and are not really interested in how they were trained,
21
+ you can currently install the classifier directly from a package I'm hosting on my AWS bucket by typing::
22
+
23
+ pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
24
+
25
+ This should install Spacy, Spacy-transformers, a BERT model and the classifiers. Once you've installed the
26
+ model, you can use it like this::
27
+
28
+ import spacy
29
+
30
+ nlp = spacy.load('en_atticus_classifier_bert')
31
+
32
+ clause = """The Joint Venturers shall maintain adequate books
33
+ and records to be kept of all the Joint Venture activities and affairs
34
+ conducted pursuant to the terms of this Agreement. All direct costs and
35
+ expenses, which shall include any insurance costs in connection with the
36
+ distribution of the Products or operations of the Joint Venture, or if the
37
+ business of the Joint Venture requires additional office facilities than
38
+ those now presently maintained by each Joint Venturer"""
39
+
40
+ cats = nlp(clause).cats
41
+ cats = [label for label in cats if cats[label] > .7] # If you want to filter by similarity scores > .7
42
+ print(cats) # Show the categories
43
+
44
+
45
+ As discussed below, the performance of the model is good enough to be interesting,
46
+ but currently not good enough to really be production ready. I *think * this is primarily
47
+ due to the dataset being relatively small and many clause categories having fewer than 20
48
+ examples. I wanted to release this as-is, however so others could experiment. As the Atticus
49
+ Project corpus grows, these classifiers should get better. In my experience 50 - 100 examples
50
+ is typically a good target to aim for, so doubling or tripling the Atticus Corpus will
51
+ hopefully lead to much, much better performance.
52
+
53
+ Build a Word2Vec-Based Model
54
+ ############################
55
+
56
+ I first experimented with using Spacy's OOTB Word2Vec models. This approach was very
57
+ quick to train, but the performance was not very good. The f-score was about .6. I also
58
+ tried using a different set of word embeddings released as "Law2Vec", and these improved
59
+ performance marginally to an F-Score of ~.64. I've included the code to train these models
60
+ in Word2VecModelBuilder.py. You can simply run that python script. The default settings
61
+ will load Spacy's en_core_web_lg model and embeddings. You can also load the Law2Vec model
62
+ if you download the vector file::
63
+
64
+ wget -O ~/Downloads/Law2Vec.200d.txt https://archive.org/download/Law2Vec/Law2Vec.200d.txt
65
+
66
+ Then you can use Spacy to convert this file into a Spacy-compatible model like so::
67
+
68
+ mkdir /models
69
+ python -m spacy init-model en /models/Law2VecModel --vectors-loc ~/Downloads/Law2Vec.200d.txt
70
+
71
+ Then you can change the model argument (per the example above) to '/models/Law2VecModel'.
72
+ You probably want to change the output_dir too. Once you've trained a new model, you can
73
+ load the trained model with spacy.load(output_dir).
74
+
75
+ Train a BERT-based Model
76
+ ########################
77
+
78
+ Overview
79
+ The transformer models encode a lot more contextual information about words than Word2Vec models,
80
+ so I wanted to see if I could squeeze more performance out of the dataset using BERT. The good
81
+ news was performance increased substantially using a BERT-based model. This is still probably not good enough for use in production, but it's good
82
+ enough to yield some interesting insights, particularly if you set your similarity threshold very
83
+ high.
84
+
85
+ Training Results
86
+ Using a BERT-based model, the beta release of the Atticus training set yields
87
+ an acceptable (but still not really production-ready) F-score of .735::
17
88
18
89
LOSS P R F
19
90
1.093 0.739 0.472 0.576
@@ -29,11 +100,133 @@ an acceptable (but still not really production-ready) F-score of .735::
29
100
0.219 0.751 0.720 0.735
30
101
0.551 0.751 0.720 0.735
31
102
32
- Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
33
- graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
34
- took about three hours.
103
+ Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
104
+ graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
105
+ took about three hours.
106
+
107
+ Step 1 - Sign Up for Atticus Project Data and Download
108
+ I've included the Atticus CSV in the repository for convenience, but you should go to the
109
+ Atticus Project website and signup there. For one, they would like to collect user and
110
+ contact info for people downloading their dataset. For another, you should go there to make
111
+ sure you get the latest version of their dataset.
112
+
113
+ Step 2 - Install Python Dependencies and SPACY BERT Model
114
+ First, install Python dependencies (I'm using LexNLP to tokenize test data, you do not
115
+ need it to build the model)::
116
+
117
+ pip install lexnlp spacy pip install spacy-transformers==0.5.2 pandas
118
+
119
+ Then, download the BERT transformer model::
120
+
121
+ !python -m spacy download en_trf_bertbaseuncased_lg
122
+
123
+ Step 3 - Load Atticus Data and Format for Spacy
124
+ The Atticus dataset is a csv, so we can use Pandas to load and manipulate it. Since
125
+ we're training classifiers and not answering questions, we only care about the columns
126
+ containing text for a given classification. The columns with headers marked "...-Answer"
127
+ are meant for question-answering and we don't want to train on this data. We also don't
128
+ really want the filename column or the document title columns, which are the first and
129
+ second columns respectively. The following function will load our Atticus CSV, filter
130
+ out the ...-Answer cols, the filename col and the document title col. Then, it will
131
+ format the data into Spacy's preferred training format and split the training set into
132
+ two pieces - a training set and an evaluation set. The default is to split the total data
133
+ set so 80% is used for training and 20% is used for evaluation.
134
+
135
+ **Code **::
136
+
137
+ def load_atticus_data(filepath='/tmp/aok_beta/Final Publication/master_clauses.csv'):
138
+
139
+ """
140
+ Load data from the atticus csv (omitting the answer cols as we want to train classifiers
141
+ not question answering).
142
+
143
+ Data is returned in the Spacy training format:
144
+ TRAIN_DATA = [
145
+ ("text1", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
146
+ ]
147
+
148
+ A list of headers is also returned so you can add these labels. FYI, the Filename and Doc name
149
+ columns are dropped as well.
150
+
151
+ """
152
+
153
+ # Load csv
154
+ atticus_clauses_df = pd.read_csv(filepath)
155
+
156
+ # Do a little post-processing
157
+ data_headers = [h for h in list(atticus_clauses_df.columns) if not "Answer" in h]
158
+ data_headers.pop(0) # Drop filename col (index 0 for col 1)
159
+ data_headers.pop(0) # Drop doc name (orig col 2 (index 1) but now first col (index 0))
160
+
161
+ training_values = {i: 0 for i in data_headers}
162
+ atticus_clauses_data_df = atticus_clauses_df.loc[:, data_headers]
163
+
164
+ train_data = []
165
+
166
+ # Iterate over csv to build training data dict
167
+ for header in atticus_clauses_data_df.columns:
168
+
169
+ for row in atticus_clauses_data_df[[header]].iterrows():
170
+
171
+ value = row[1][header]
172
+
173
+ if not pd.isnull(value):
174
+ train_data.append((value, {'cats': {**training_values, header: 1}}))
175
+
176
+ return train_data, data_headers
177
+
178
+
179
+ def create_training_set(train_data=[{}], limit=0, split=0.8):
180
+ """Load data from the Atticus dataset, splitting off a held-out set."""
181
+ random.shuffle(train_data)
182
+ train_data = train_data[-limit:]
183
+
184
+ texts, labels = zip(*train_data)
185
+ split = int(len(train_data) * split)
186
+
187
+ # Return data in format that matches example here:
188
+ # https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
189
+ return (texts[:split], labels[:split]), (texts[split:], labels[split:])
190
+
191
+
192
+ Step 4 - Build the Model
193
+ *WARNING - running the training takes a looong time, even if you have a CUDA-compatible
194
+ graphics card and it's properly configured in your environment *
195
+
196
+ You can just run the BertModelBuilder.py with default settings. On my Nvidia 1050 Ti, it took
197
+ about 3 - 4 hours to run the training. Unless you're adding additional data, I'd suggest you
198
+ just use my pre-built models.
199
+
200
+ Packaging / Serving Model for Use
201
+ #################################
202
+
203
+ You can follow Spacy's excellent instructions `here <https://spacy.io/api/cli#package >`_
204
+ to package up the final model into a tar that can be installed with pip like this::
205
+
206
+ pip install local_path_to_tar.tar.gz
207
+
208
+ I've uploaded the package to my public AWS bucket, and you can install directly from there
209
+ like so::
210
+
211
+ pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
212
+
213
+ Now you can load it just like this::
214
+
215
+ nlp = spacy.load('en_atticus_classifier_bert')
216
+
217
+ I plan to also upload this to PyPi as well so you can just do something like this::
218
+
219
+ pip install atticus_classifiers_spacy (DOESN'T WORK YET)
220
+
221
+ Another option, is you can load the pickled model in the pre-trained folder::
222
+
223
+ import pickle
224
+ import spacy
225
+
226
+ nlp = pickle.load(open("/path/to/BertClassifier.pickle", "rb"))
35
227
36
- Training older, Word2Vec-based models yields less promising results but is much,
37
- much faster to train. Spacy's en_web_core_lg model yields an f-score of around .6.
38
- Using Word2Vec embeddings trained on legal data (such as Law2Vec) yields a slightly
39
- better f-score of around .635. Neither is as good as the BERT-based approach, however.
228
+ # Then you can use the spacy object just like normal:
229
+ clause = "Test clause"
230
+ cats = nlp(clause).cats
231
+ cats = [label for label in cats if cats[label] > .7] #If you want to look only at labels with similarity scores over .7
232
+ print(cats)
0 commit comments