Skip to content

Commit

Permalink
Updates to the transformers conf docs and yaml file (#1467)
Browse files Browse the repository at this point in the history
  • Loading branch information
omri374 authored Oct 13, 2024
1 parent 13ae328 commit 21361f9
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 25 deletions.
118 changes: 94 additions & 24 deletions docs/analyzer/nlp_engines/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,75 @@ Then, also download a spaCy pipeline/model:
python -m spacy download en_core_web_sm
```

#### Creating a configuration file

### Configuring the NER pipeline

Once the models are downloaded, one option to configure them is to create a YAML configuration file.
Note that the configuration needs to contain both a `spaCy` pipeline name and a transformers model name.
In addition, different configurations for parsing the results of the transformers model can be added.

The NER model configuration can be done in a YAML file or in Python:

#### Configuring the NER pipeline via code

Example configuration in Python:

```python
# Transformer model config
model_config = [
{"lang_code": "en",
"model_name": {
"spacy": "en_core_web_sm", # for tokenization, lemmatization
"transformers": "StanfordAIMI/stanford-deidentifier-base" # for NER
}
}]

# Entity mappings between the model's and Presidio's
mapping = dict(
PER="PERSON",
LOC="LOCATION",
ORG="ORGANIZATION",
AGE="AGE",
ID="ID",
EMAIL="EMAIL",
DATE="DATE_TIME",
PHONE="PHONE_NUMBER",
PERSON="PERSON",
LOCATION="LOCATION",
GPE="LOCATION",
ORGANIZATION="ORGANIZATION",
NORP="NRP",
PATIENT="PERSON",
STAFF="PERSON",
HOSP="LOCATION",
PATORG="ORGANIZATION",
TIME="DATE_TIME",
HCW="PERSON",
HOSPITAL="LOCATION",
FACILITY="LOCATION",
VENDOR="ORGANIZATION",
)

labels_to_ignore = ["O"]

ner_model_configuration = NerModelConfiguration(
model_to_presidio_entity_mapping=mapping,
alignment_mode="expand", # "strict", "contract", "expand"
aggregation_strategy="max", # "simple", "first", "average", "max"
labels_to_ignore = labels_to_ignore)

transformers_nlp_engine = TransformersNlpEngine(
models=model_config,
ner_model_configuration=ner_model_configuration)

# Transformer-based analyzer
analyzer = AnalyzerEngine(
nlp_engine=transformers_nlp_engine,
supported_languages=["en"]
)
```

#### Creating a YAML configuration file

Once the models are downloaded, one option to configure them is to create a YAML configuration file.
Note that the configuration needs to contain both a `spaCy` pipeline name and a transformers model name.
Expand All @@ -75,9 +143,9 @@ models:
ner_model_configuration:
labels_to_ignore:
- O
aggregation_strategy: simple # "simple", "first", "average", "max"
aggregation_strategy: max # "simple", "first", "average", "max"
stride: 16
alignment_mode: strict # "strict", "contract", "expand"
alignment_mode: expand # "strict", "contract", "expand"
model_to_presidio_entity_mapping:
PER: PERSON
LOC: LOCATION
Expand All @@ -92,33 +160,15 @@ ner_model_configuration:
DATE: DATE_TIME
PHONE: PHONE_NUMBER
HCW: PERSON
HOSPITAL: ORGANIZATION
HOSPITAL: LOCATION
VENDOR: ORGANIZATION

low_confidence_score_multiplier: 0.4
low_score_entity_names:
- ID
```
Where:
- `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`.
- The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2`

The `ner_model_configuration` section contains the following parameters:

- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning.
- `aggregation_strategy`: The strategy to use when aggregating the results of the transformers model.
- `stride`: The value is the length of the window overlap in transformer tokenizer tokens.
- `alignment_mode`: The strategy to use when aligning the results of the transformers model to the original text.
- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types.
- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.

See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification).

Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.

#### Calling the new model
##### Calling the new model
Once the configuration file is created, it can be used to create a new `TransformersNlpEngine`:

Expand All @@ -143,6 +193,26 @@ Once the configuration file is created, it can be used to create a new `Transfor
print(results_english)
```

#### Explaning the configuration options

- `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`.
- The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2`

The `ner_model_configuration` section contains the following parameters:

- `labels_to_ignore`: A list of labels to ignore. For example, `O` (no entity) or entities you are not interested in returning.
- `aggregation_strategy`: The strategy to use when aggregating the results of the transformers model.
- `stride`: The value is the length of the window overlap in transformer tokenizer tokens.
- `alignment_mode`: The strategy to use when aligning the results of the transformers model to the original text.
- `model_to_presidio_entity_mapping`: A mapping between the transformers model labels and the Presidio entity types.
- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.

See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification).

Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.


### Training your own model

!!! note "Note"
Expand Down
3 changes: 2 additions & 1 deletion presidio-analyzer/presidio_analyzer/conf/transformers.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ ner_model_configuration:
TIME: DATE_TIME
PHONE: PHONE_NUMBER
HCW: PERSON
HOSPITAL: ORGANIZATION
HOSPITAL: LOCATION
FACILITY: LOCATION
VENDOR: ORGANIZATION

low_confidence_score_multiplier: 0.4
low_score_entity_names:
Expand Down

0 comments on commit 21361f9

Please sign in to comment.