HunFlair

HunFlair is a state-of-the-art NER tagger for biomedical texts. It comes with models for genes/proteins, chemicals, diseases, species and cell lines. HunFlair builds on pretrained domain-specific language models and outperforms other biomedical NER tools on unseen corpora. Furthermore, it contains harmonized versions of 31 biomedical NER data sets and comes with a Flair language model ("pubmed-X") and FastText embeddings ("pubmed") that were trained on roughly 3 million full texts and about 25 million abstracts from the biomedical domain.

Content: Quick Start | BioNER-Tool Comparison | Tutorials | Citing HunFlair

Quick Start

Requirements and Installation

HunFlair is based on Flair 0.6+ and Python 3.6+. If you do not have Python 3.6, install it first. Here is how for Ubuntu 16.04. Then, in your favorite virtual environment, simply do:

pip install flair

Example 1: Biomedical NER

Let's run named entity recognition (NER) over an example sentence. All you need to do is make a Sentence, load a pre-trained model and use it to predict tags for the sentence:

from flair.data import Sentence
from flair.nn import Classifier

# make a sentence 
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

# load biomedical tagger
tagger = Classifier.load("hunflair")

# tag sentence
tagger.predict(sentence)

Done! The Sentence now has entity annotations. Let's print the entities found by the tagger:

for entity in sentence.get_labels():
    print(entity)

This should print:

Span[0:2]: "Behavioral abnormalities" → Disease (0.6736)
Span[9:12]: "Fragile X Syndrome" → Disease (0.99)
Span[4:5]: "Fmr1" → Gene (0.838)
Span[6:7]: "Mouse" → Species (0.9979)

Example 2: Biomedical NER with Better Tokenization

Scientific texts are difficult to tokenize. For this reason, we recommend to install SciSpaCy for improved pre-processing and tokenization of scientific / biomedical texts:

pip install scispacy==0.5.1
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

Use this code to apply scientific tokenization:

from flair.data import Sentence
from flair.nn import Classifier
from flair.tokenization import SciSpacyTokenizer

# make a sentence and tokenize with SciSpaCy
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
                    use_tokenizer=SciSpacyTokenizer())

# load biomedical tagger
tagger = Classifier.load("hunflair")

# tag sentence
tagger.predict(sentence)

Comparison to other biomedical NER tools

Tools for biomedical NER are typically trained and evaluated on rather small gold standard data sets. However, they are applied "in the wild" to a much larger collection of texts, often varying in topic, entity distribution, genre (e.g. patents vs. scientific articles) and text type (e.g. abstract vs. full text), which can lead to severe drops in performance.

HunFlair outperforms other biomedical NER tools on corpora not used for training of neither HunFlair or any of the competitor tools.

Corpus	Entity Type	Misc^₁	SciSpaCy	HUNER	HunFlair
CRAFT v4.0	Chemical	42.88	35.73	42.99	59.83
	Gene/Protein	64.93	47.76	50.77	73.51
	Species	81.15	54.21	84.45	85.04
BioNLP 2013 CG	Chemical	72.15	58.43	67.37	81.82
	Disease	55.64	56.48	55.32	65.07
	Gene/Protein	68.97	66.18	71.22	87.71
	Species	80.53	57.11	67.84	76.41
Plant-Disease	Species	80.63	75.90	73.64	83.44

_{All results are F1 scores using partial matching of predicted text offsets with the original char offsets
of the gold standard data. We allow a shift by max one character.}

_{1: Misc displays the results of multiple taggers:
tmChem for Chemical,
GNormPus for Gene and Species, and
DNorm for Disease}

Here's how to reproduce these numbers using Flair. You can find detailed evaluations and discussions in our paper.

Tutorials

We provide a set of quick tutorials to get you started with HunFlair:

Tutorial 1: Tagging
Tutorial 2: Training biomedical NER models

Citing HunFlair

Please cite the following paper when using HunFlair:

@article{weber2021hunflair,
  title={HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition},
  author={Weber, Leon and S{\"a}nger, Mario and M{\"u}nchmeyer, Jannes and Habibi, Maryam and Leser, Ulf and Akbik, Alan},
  journal={Bioinformatics},
  volume={37},
  number={17},
  pages={2792--2794},
  year={2021},
  publisher={Oxford University Press}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HUNFLAIR.md

HUNFLAIR.md

HunFlair

Quick Start

Requirements and Installation

Example 1: Biomedical NER

Example 2: Biomedical NER with Better Tokenization

Comparison to other biomedical NER tools

Tutorials

Citing HunFlair

Files

HUNFLAIR.md

Latest commit

History

HUNFLAIR.md

File metadata and controls

HunFlair

Quick Start

Requirements and Installation

Example 1: Biomedical NER

Example 2: Biomedical NER with Better Tokenization

Comparison to other biomedical NER tools

Tutorials

Citing HunFlair