Skip to content

aesuli/bert_ner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NER with BERT

Two scripts for training and testing BERT-like pretrained models on Named Entity Recognition (NER) tasks.

  • Train (finetune) a BERT-based NER model (from Hugging Face or a model stored on your computer) on a dataset formatted in CoNLL-like format.
  • Use the trained model to predict NER tags on test data, getting an evaluation report, or new data, getting a tagged version the data.

The scripts handle sequences longer than the maximum length defined by the models by splitting the sequences in multiple chunks with a context overlap.

Data Format

The input data should follow a CoNLL-like format:

  • Each line contains a token in the first position and its corresponding tag in the last position, separated by a space.
  • Sentences are separated by blank lines.

Example:

Barack B-PER 
Obama I-PER 
was O 
born O 
in O 
Hawaii B-LOC
. O

Script Arguments

Training Script (train_ner_tagger.py)

usage: train_ner_tagger.py [-h] --data_dir DATA_DIR [--data_file_suffix DATA_FILE_SUFFIX] --model_dir MODEL_DIR [--model_name MODEL_NAME] [--k K] [--skip_cv] [--skip_train] [--learning_rate LEARNING_RATE]
                           [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE] [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE] [--num_train_epochs NUM_TRAIN_EPOCHS] [--weight_decay WEIGHT_DECAY]

Train a token classifier using Hugging Face models.

options:
  -h, --help            show this help message and exit
  --data_dir DATA_DIR   Directory containing training data files.
  --data_file_suffix DATA_FILE_SUFFIX
                        Extension of training data files (default=.txt).
  --model_dir MODEL_DIR
                        Directory to save the final trained model and/or results.
  --model_name MODEL_NAME
                        Name of the pretrained Hugging Face or path to a local model to use (default=bert-base-cased).
  --k K                 Number of folds for cross-validation (default=10).
  --skip_cv             Do not do cross-validation (default=do cv).
  --skip_train          Do not train on the whole training data (default=do training).
  --learning_rate LEARNING_RATE
                        Learning rate for the optimizer (default: 2e-5).
  --per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE
                        Batch size for training (default: 32).
  --per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE
                        Batch size for evaluation (default: 32).
  --num_train_epochs NUM_TRAIN_EPOCHS
                        Number of training epochs (default: 10).
  --weight_decay WEIGHT_DECAY
                        Weight decay for optimizer (default: 0.01).

Testing Script (apply_ner_tagger.py)

usage: apply_ner_tagger.py [-h] --model_name MODEL_NAME --data_dir DATA_DIR [--data_file_suffix DATA_FILE_SUFFIX] [--ignore_unknown_tags] --output_dir OUTPUT_DIR [--overlap OVERLAP]

Predict tags for input text files using a trained token classification model.

options:
  -h, --help            show this help message and exit
  --model_name MODEL_NAME
                        Path to a directory containing the trained model or name of a Hugging Face model.
  --data_dir DATA_DIR   Directory containing input files to be tagged.
  --data_file_suffix DATA_FILE_SUFFIX
                        Extension of input files.
  --ignore_unknown_tags
                        Ignore tags in input files that are not defined in the model (default=throw error).
  --output_dir OUTPUT_DIR
                        Directory to save the output files with predicted tags.
  --overlap OVERLAP     Context length for sequences longer than model max length (default=50)

Example of use

This is an example of training and testing a BERT NER model on the CoNNL++ dataset.

Training a NER model

The following command trains a BERT model on the CoNLL++ dataset:

CUDA_VISIBLE_DEVICES=3  python train_ner_tagger.py \
--data_dir ./data \
--data_file_suffix train.txt \
--model_dir models/bert_conllpp

Example output:

204567 tokens, 14987 sentences read from 1 files.
Training fold 1...
{'loss': 1.4032, 'grad_norm': 3.16690731048584, 'learning_rate': 1.9952606635071093e-05, 'epoch': 0.02}                                                                                                             {'loss': 0.6847, 'grad_norm': 1.0225292444229126, 'learning_rate': 1.990521327014218e-05, 'epoch': 0.05}                                                                                                            

[... 10 fold validation ...]

{'loss': 0.0008, 'grad_norm': 0.01256786659359932, 'learning_rate': 9.478672985781992e-08, 'epoch': 9.95}                                                                                                            {'loss': 0.0029, 'grad_norm': 0.05868620052933693, 'learning_rate': 4.739336492890996e-08, 'epoch': 9.98}                                                                                                            {'loss': 0.0009, 'grad_norm': 0.016589513048529625, 'learning_rate': 0.0, 'epoch': 10.0}                                                                                                             {'eval_loss': 0.033828623592853546, 'eval_runtime': 0.5779, 'eval_samples_per_second': 2592.008, 'eval_steps_per_second': 81.325, 'epoch': 10.0}                                                                                                       
{'train_runtime': 210.2245, 'train_samples_per_second': 641.647, 'train_steps_per_second': 20.074, 'train_loss': 0.02340272987770111, 'epoch': 10.0}                                                                                                   
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4220/4220 [03:30<00:00, 20.07it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:00<00:00, 85.92it/s]
Metrics for fold 10: {'eval_loss': 0.027988383546471596, 'eval_runtime': 0.556, 'eval_samples_per_second': 2694.247, 'eval_steps_per_second': 84.532, 'epoch': 10.0}
Confusion Matrix for fold 10 (labels=['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']):
[[680, 5, 5, 3, 0, 0, 0, 0, 5], [6, 326, 5, 1, 0, 4, 0, 0, 7], [9, 6, 594, 5, 0, 0, 4, 0, 7], [1, 2, 5, 638, 0, 0, 0, 2, 7], [0, 0, 0, 0, 100, 2, 4, 1, 2], [0, 2, 0, 0, 3, 100, 0, 0, 3], [0, 0, 0, 0, 2, 3, 333, 2, 3], [0, 0, 0, 0, 0, 1, 1, 445, 2], [2, 8, 4, 1, 0, 7, 5, 0, 16842]]
Cross-validation report saved to output/conllpp/cross_validation_report.json
Retraining on the entire dataset...
{'loss': 0.1346, 'grad_norm': 0.4300181269645691, 'learning_rate': 1.7867803837953093e-05, 'epoch': 1.07}                                                                                                            

[... final training on the whole training dataset ...]

{'train_runtime': 199.9011, 'train_samples_per_second': 749.721, 'train_steps_per_second': 23.462, 'train_loss': 0.021958918129203163, 'epoch': 10.0}                                                                                                  
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4690/4690 [03:19<00:00, 23.46it/s]
Final model and tokenizer saved to models/bert_conllpp

The finetuned model is saved in models/bert_conllpp and the cross_validation_report.json in that directory shows the cross validation results:

cat models/bert_conllpp/crossvalidation_report.json

{
  "fold_metrics": [
    {
      "fold": 1,
      "metrics": {
        "eval_loss": 0.03650137037038803,
        "eval_runtime": 1.0708,
        "eval_samples_per_second": 1399.861,
        "eval_steps_per_second": 43.892,
        "epoch": 10.0
      },
      "confusion_matrix": [
        [ 668, 5, 6, 3, 0, 0, 1, 1, 2],
        [ 1, 326, 7, 1, 0, 4, 0, 0, 7],
        [ 14, 5, 580, 4, 0, 0, 3, 0, 8],
        [ 2, 2, 7, 657, 1, 0, 1, 1, 7],
        [ 0, 0, 0, 0, 102, 0, 8, 0, 4],
        [ 1, 3, 0, 0, 0, 121, 3, 1, 8],
        [ 0, 0, 1, 0, 2, 9, 330, 0, 9],
        [ 0, 0, 0, 3, 0, 0, 2, 444, 0],
        [ 4, 7, 3, 1, 1, 5, 8, 4, 17963]
      ],
      "labels": [ "B-LOC", "B-MISC", "B-ORG", "B-PER", "I-LOC", "I-MISC", "I-ORG", "I-PER", "O"]
    },

   [... fold metrics repeated for all the folds ...]

  ],
  "average_metrics": {
    "eval_loss": 0.036608549393713476,
    "eval_runtime": 0.7434000000000001,
    "eval_samples_per_second": 2117.7788,
    "eval_steps_per_second": 66.4155,
    "epoch": 10.0
  },
  "aggregated_confusion_matrix": [
    [ 6956, 52, 61, 26, 4, 0, 8, 1, 32],
    [ 43, 3203, 64, 17, 0, 34, 3, 1, 73],
    [ 106, 80, 5963, 64, 1, 1, 24, 0, 82],
    [ 22, 20, 59, 6433, 1, 1, 2, 15, 47],
    [ 5, 1, 1, 0, 1066, 16, 40, 7, 21],
    [ 4, 24, 1, 0, 6, 1022, 35, 4, 59],
    [ 6, 0, 17, 1, 28, 34, 3527, 16, 75],
    [ 1, 0, 0, 11, 5, 5, 16, 4480, 10],
    [ 33, 100, 82, 24, 17, 76, 78, 8, 170106]
  ],
  "labels": [ "B-LOC", "B-MISC", "B-ORG", "B-PER", "I-LOC", "I-MISC", "I-ORG", "I-PER", "O"]
}

Testing the NER model

The following command tests the trained model on a test dataset and writes predictions to an output directory:

CUDA_VISIBLE_DEVICES=3 python apply_ner_tagger.py \
--data_dir ./data \
--data_file_suffix test.txt \
--model_name models/conllpp \
--output_dir ./output_conllpp_annotated

Example output:

46666 tokens, 3684 sentences read from 1 files.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 231/231 [00:01<00:00, 121.99it/s]

Labels:
['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']

Confusion Matrix:
 [[ 1575    18    28     1     1     1     9     0    13]
 [   19   623    36     9     0     6     1     0    29]
 [   44    37  1579    21     0     0     7     0    27]
 [   10     2    27  1567     0     0     0     0    12]
 [    1     0     1     2   244     1     6     0     4]
 [    1     8     1     0     5   182    19     1    37]
 [    1     0     4     1    20    14   807     3    32]
 [    0     0     0     2     1     1     2  1155     0]
 [   10    28    48    10     4    36    30     1 38241]]

Classification Report:
               precision    recall  f1-score   support

       B-LOC       0.95      0.96      0.95      1646
      B-MISC       0.87      0.86      0.87       723
       B-ORG       0.92      0.92      0.92      1715
       B-PER       0.97      0.97      0.97      1618
       I-LOC       0.89      0.94      0.91       259
      I-MISC       0.76      0.72      0.74       254
       I-ORG       0.92      0.91      0.92       882
       I-PER       1.00      0.99      1.00      1161
           O       1.00      1.00      1.00     38408

    accuracy                           0.99     46666
   macro avg       0.92      0.92      0.92     46666
weighted avg       0.99      0.99      0.99     46666

License

See LICENSE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks