Merge branch 'main' of https://github.com/JSv4/AtticusClassifier_Spacy into main

JSv4 · JSv4 · commit 1ec1b6d789c0 · 2021-01-01T22:04:43.000-05:00
diff --git a/README.rst b/README.rst
@@ -1,90 +1,19 @@
-******************************************
-Atticus Legal Clause Classifiers for Spacy
-******************************************
+This is still a work in progress and is not meant for public use yet.
+Leaving the repository public in case anyone stumbles across it and it
+saves some time in preparing your own Atticus classifiers. I am planning
+to write a blog post / instructions once the performance is slightly better.
 
-Introduction
-############
+I am using LexNLP for its great sentence tokenization functionality. You
+could probably get decent performance out of NLTK. Spacy is ok. Currently,
+I have been training the classifiers, then using LexNLP to clean and split
+sentences / sections / whatever and then running those chunks through Spacy
+and checking the category labels.
 
-The `Atticus Project <https://www.atticusprojectai.org/>`_ was recently announced as an initiative
-to, among other things, build a world-class corpus of labelled legal contracts which could be used
-to train and/or benchmark text classifiers and question-answering NLP models. Their initial release
-contains 200 labelled contracts. I wanted to experiment with the data set and build a working classifier
-that I could use on contract data, so I set out to build a simple project to load the dataset, convert it
-into a format that Spacy can read, and then train some classifiers to see how the data set performs.
-This repository contains the code I used to train classifiers based on 1) Word2Vec embeddings and 2)
-a BERT-based transformer model.
+To get the full (and excellent) Atticus dataset, go here:
+https://www.atticusprojectai.org/
 
-Quickstart - Use the Classifier
-###############################
-
-If you are in a hurry to test out the classifiers and are not really interested in how they were trained,
-you can currently install the classifier directly from a package I'm hosting on my AWS bucket by typing::
-
-    pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
-
-This should install Spacy, Spacy-transformers, a BERT model and the classifiers. Once you've installed the
-model, you can use it like this::
-
-    import spacy
-
-    nlp = spacy.load('en_atticus_classifier_bert')
-
-    clause = """The Joint Venturers shall maintain adequate books
-    and records to be kept of all the Joint Venture activities and affairs
-    conducted pursuant to the terms of this Agreement. All direct costs and
-    expenses, which shall include any insurance costs in connection with the
-    distribution of the Products or operations of the Joint Venture, or if the
-    business of the Joint Venture requires additional office facilities than
-    those now presently maintained by each Joint Venturer"""
-
-    cats = nlp(clause).cats
-    cats = [label for label in cats if cats[label] > .7] # If you want to filter by similarity scores > .7
-    print(cats) # Show the categories
-
-
-As discussed below, the performance of the model is good enough to be interesting,
-but currently not good enough to really be production ready. I *think* this is primarily
-due to the dataset being relatively small and many clause categories having fewer than 20
-examples. I wanted to release this as-is, however so others could experiment. As the Atticus
-Project corpus grows, these classifiers should get better. In my experience 50 - 100 examples
-is typically a good target to aim for, so doubling or tripling the Atticus Corpus will
-hopefully lead to much, much better performance.
-
-Build a Word2Vec-Based Model
-############################
-
-I first experimented with using Spacy's OOTB Word2Vec models. This approach was very
-quick to train, but the performance was not very good. The f-score was about .6. I also
-tried using a different set of word embeddings released as "Law2Vec", and these improved
-performance marginally to an F-Score of ~.64. I've included the code to train these models
-in Word2VecModelBuilder.py. You can simply run that python script. The default settings
-will load Spacy's en_core_web_lg model and embeddings. You can also load the Law2Vec model
-if you download the vector file::
-
-    wget -O ~/Downloads/Law2Vec.200d.txt https://archive.org/download/Law2Vec/Law2Vec.200d.txt
-
-Then you can use Spacy to convert this file into a Spacy-compatible model like so::
-
-    mkdir /models
-    python -m spacy init-model en /models/Law2VecModel --vectors-loc ~/Downloads/Law2Vec.200d.txt
-
-Then you can change the model argument (per the example above) to '/models/Law2VecModel'.
-You probably want to change the output_dir too. Once you've trained a new model, you can
-load the trained model with spacy.load(output_dir).
-
-Train a BERT-based Model
-########################
-
-Overview
-  The transformer models encode a lot more contextual information about words than Word2Vec models,
-  so I wanted to see if I could squeeze more performance out of the dataset using BERT. The good
-  news was performance increased substantially using a BERT-based model. This is still probably not good enough for use in production, but it's good
-  enough to yield some interesting insights, particularly if you set your similarity threshold very
-  high.
-
-Training Results
-  Using a BERT-based model, the beta release of the Atticus training set yields
-  an acceptable (but still not really production-ready) F-score of .735::
+Using a BERT-based model, the beta release of the Atticus training set yields
+an acceptable (but still not really production-ready) F-score of .735::
 
     LOSS 	  P  	  R  	  F
     1.093	0.739	0.472	0.576
@@ -100,134 +29,11 @@ Training Results
     0.219	0.751	0.720	0.735
     0.551	0.751	0.720	0.735
 
-  Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
-  graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
-  took about three hours.
-
-Step 1 - Sign Up for Atticus Project Data and Download
-  I've included the Atticus CSV in the repository for convenience, but you should go to the
-  Atticus Project website and signup there. For one, they would like to collect user and
-  contact info for people downloading their dataset. For another, you should go there to make
-  sure you get the latest version of their dataset.
-
-Step 2 - Install Python Dependencies and SPACY BERT Model
-  First, install Python dependencies (I'm using LexNLP to tokenize test data, you do not
-  need it to build the model)::
-
-    pip install lexnlp spacy pip install spacy-transformers==0.5.2 pandas
-
-  Then, download the BERT transformer model::
-
-    !python -m spacy download en_trf_bertbaseuncased_lg
-
-Step 3 - Load Atticus Data and Format for Spacy
-  The Atticus dataset is a csv, so we can use Pandas to load and manipulate it. Since
-  we're training classifiers and not answering questions, we only care about the columns
-  containing text for a given classification. The columns with headers marked "...-Answer"
-  are meant for question-answering and we don't want to train on this data. We also don't
-  really want the filename column or the document title columns, which are the first and
-  second columns respectively. The following function will load our Atticus CSV, filter
-  out the ...-Answer cols, the filename col and the document title col. Then, it will
-  format the data into Spacy's preferred training format and split the training set into
-  two pieces - a training set and an evaluation set. The default is to split the total data
-  set so 80% is used for training and 20% is used for evaluation.
-
-  **Code**::
-
-        def load_atticus_data(filepath='/tmp/aok_beta/Final Publication/master_clauses.csv'):
-
-            """
-            Load data from the atticus csv (omitting the answer cols as we want to train classifiers
-            not question answering).
-
-            Data is returned in the Spacy training format:
-                TRAIN_DATA = [
-                    ("text1", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}})
-                ]
-
-            A list of headers is also returned so you can add these labels. FYI, the Filename and Doc name
-            columns are dropped as well.
-
-            """
-
-            # Load csv
-            atticus_clauses_df = pd.read_csv(filepath)
-
-            # Do a little post-processing
-            data_headers = [h for h in list(atticus_clauses_df.columns) if not "Answer" in h]
-            data_headers.pop(0)  # Drop filename col (index 0 for col 1)
-            data_headers.pop(0)  # Drop doc name (orig col 2 (index 1) but now first col (index 0))
-
-            training_values = {i: 0 for i in data_headers}
-            atticus_clauses_data_df = atticus_clauses_df.loc[:, data_headers]
-
-            train_data = []
-
-            # Iterate over csv to build training data dict
-            for header in atticus_clauses_data_df.columns:
-
-                for row in atticus_clauses_data_df[[header]].iterrows():
-
-                    value = row[1][header]
-
-                    if not pd.isnull(value):
-                        train_data.append((value, {'cats': {**training_values, header: 1}}))
-
-            return train_data, data_headers
-
-
-        def create_training_set(train_data=[{}], limit=0, split=0.8):
-            """Load data from the Atticus dataset, splitting off a held-out set."""
-            random.shuffle(train_data)
-            train_data = train_data[-limit:]
-
-            texts, labels = zip(*train_data)
-            split = int(len(train_data) * split)
-
-            # Return data in format that matches example here:
-            # https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
-            return (texts[:split], labels[:split]), (texts[split:], labels[split:])
-
-
-Step 4 - Build the Model
-  *WARNING - running the training takes a looong time, even if you have a CUDA-compatible
-  graphics card and it's properly configured in your environment*
-
-  You can just run the BertModelBuilder.py with default settings. On my Nvidia 1050 Ti, it took
-  about 3 - 4 hours to run the training. Unless you're adding additional data, I'd suggest you
-  just use my pre-built models.
-
-Packaging / Serving Model for Use
-#################################
-
-You can follow Spacy's excellent instructions `here <https://spacy.io/api/cli#package>`_
-to package up the final model into a tar that can be installed with pip like this::
-
-    pip install local_path_to_tar.tar.gz
-
-I've uploaded the package to my public AWS bucket, and you can install directly from there
-like so::
-
-    pip install https://jsv4public.s3.amazonaws.com/en_atticus_classifier_bert-0.1.0.tar.gz
-
-Now you can load it just like this::
-
-    nlp = spacy.load('en-atticus-classifier-bert-based-0.1.0')
-
-I plan to also upload this to PyPi as well so you can just do something like this::
-
-    pip install atticus_classifiers_spacy (DOESN'T WORK YET)
-
-Another option, is you can load the pickled model in the pre-trained folder::
-
-    import pickle
-    import spacy
-
-    nlp = pickle.load(open("/path/to/BertClassifier.pickle", "rb"))
-
-    # Then you can use the spacy object just like normal:
-    clause = "Test clause"
-    cats = nlp(clause).cats
-    cats = [label for label in cats if cats[label] > .7] #If you want to look only at labels with similarity scores over .7
-    print(cats)
+Training the BERT-based model takes a lot more computing power, and a CUDA-compatible
+graphics card is absolutely recommended. Using a Nvidia 1050 Ti, the above training
+took about three hours.
 
+Training older, Word2Vec-based models yields less promising results but is much,
+much faster to train. Spacy's en_web_core_lg model yields an f-score of around .6.
+Using Word2Vec embeddings trained on legal data (such as Law2Vec) yields a slightly
+better f-score of around .635. Neither is as good as the BERT-based approach, however.
diff --git a/TestClassifier.py b/TestClassifier.py
@@ -1,13 +1,20 @@
 import spacy
+<<<<<<< HEAD
 import pickle
+=======
+>>>>>>> 706b9c26af2815cc3e4afb9e7ef2b9281fa8fd66
 from lexnlp.nlp.en.segments.sentences import pre_process_document, normalize_text, get_sentence_list
 
 # Here we load our pre-trained Spacy model with the classifiers. Point Spacy to the directory you
 # saved your model to in Word2VecModelBuilder or BertModelBuilder. Be aware that this model is missing
 # other pipeline components and really just does classification, so, if you need other Spacy features,
 # you need to add them to the model pipeline after loading or change how the model is generated and saved.
+<<<<<<< HEAD
 nlp = spacy.load("/models/BertClassifier")
 nlp_pickle = pickle.load(open("./pre-trained/BertClassifier.pickle", "rb"))
+=======
+nlp = spacy.load("/home/jman/PycharmProjects/AtticusWord2Vec/data/Law2VecClassiier")
+>>>>>>> 706b9c26af2815cc3e4afb9e7ef2b9281fa8fd66
 
 text="""JOINT VENTURE AGREEMENT
 Collectible Concepts Group, Inc. ("CCGI") and Pivotal Self Service Tech, Inc.
@@ -89,9 +96,12 @@
     cats = [label for label in cats if cats[label] > .7]
     print(cats)
 
+<<<<<<< HEAD
 print("\n\n#############################################33\nTest Pickled Model\n")
 for sent in get_sentence_list(normalize_text(pre_process_document(text))):
     print(f"LexNLP Sentence: {sent}")
     cats = nlp_pickle(sent).cats
     cats = [label for label in cats if cats[label] > .7]
     print(cats)
+=======
+>>>>>>> 706b9c26af2815cc3e4afb9e7ef2b9281fa8fd66