Dataset

Dataset description

This example project is using the popular CoNLL 2002 dataset. The csv consists of multiple rows each containing a word with the corresponding tag. Multiple rows are building a single sentence.

The dataset itself contains different tags

geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon

Each tag is defined in an IOB format, IOB (short for inside, outside, beginning) is a common tagging format for tagging tokens.

B - indicates the beginning of a token

I - indicates the inside of a token

O - indicates that the token is outside of any entity not annotated

Example

"London on Monday evening"
"London(B-geo) on(O) Monday(B-tim) evening(I-tim)"

Data Preparation

You can download the dataset from the Kaggle dataset. In order to make it convenient we have uploaded the dataset on GCS.

gs://kubeflow-examples-data/named_entity_recognition_dataset/ner.csv

The training pipeline will use this data, there are no further data preperation steps required.

Next: Custom prediction routine

Previous: Build the pipeline components

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step-3-upload-dataset.md

step-3-upload-dataset.md

Dataset

Dataset description

Example

Data Preparation

Files

step-3-upload-dataset.md

Latest commit

History

step-3-upload-dataset.md

File metadata and controls

Dataset

Dataset description

Example

Data Preparation