Skip to content

Latest commit

 

History

History
43 lines (29 loc) · 1.34 KB

step-3-upload-dataset.md

File metadata and controls

43 lines (29 loc) · 1.34 KB

Dataset

Dataset description

This example project is using the popular CoNLL 2002 dataset. The csv consists of multiple rows each containing a word with the corresponding tag. Multiple rows are building a single sentence.

The dataset itself contains different tags

  • geo = Geographical Entity
  • org = Organization
  • per = Person
  • gpe = Geopolitical Entity
  • tim = Time indicator
  • art = Artifact
  • eve = Event
  • nat = Natural Phenomenon

Each tag is defined in an IOB format, IOB (short for inside, outside, beginning) is a common tagging format for tagging tokens.

B - indicates the beginning of a token

I - indicates the inside of a token

O - indicates that the token is outside of any entity not annotated

Example

"London on Monday evening"
"London(B-geo) on(O) Monday(B-tim) evening(I-tim)"

Data Preparation

You can download the dataset from the Kaggle dataset. In order to make it convenient we have uploaded the dataset on GCS.

gs://kubeflow-examples-data/named_entity_recognition_dataset/ner.csv

The training pipeline will use this data, there are no further data preperation steps required.

Next: Custom prediction routine

Previous: Build the pipeline components