This example project is using the popular CoNLL 2002 dataset. The csv consists of multiple rows each containing a word with the corresponding tag. Multiple rows are building a single sentence.
The dataset itself contains different tags
- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon
Each tag is defined in an IOB format, IOB (short for inside, outside, beginning) is a common tagging format for tagging tokens.
B - indicates the beginning of a token
I - indicates the inside of a token
O - indicates that the token is outside of any entity not annotated
"London on Monday evening"
"London(B-geo) on(O) Monday(B-tim) evening(I-tim)"
You can download the dataset from the Kaggle dataset. In order to make it convenient we have uploaded the dataset on GCS.
gs://kubeflow-examples-data/named_entity_recognition_dataset/ner.csv
The training pipeline will use this data, there are no further data preperation steps required.
Next: Custom prediction routine
Previous: Build the pipeline components