Skip to content

Commit 98a4b39

Browse files
committed
iss15: update README and upload page
1 parent a3027a8 commit 98a4b39

File tree

7 files changed

+142
-14
lines changed

7 files changed

+142
-14
lines changed

README.md

Lines changed: 138 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -102,47 +102,174 @@ Now, try logging in with the superuser account you created in the previous step.
102102

103103
<img src="./docs/projects.png" alt="projects" width=600>
104104

105-
You should see there is no project.
105+
There is no project created yet. Here we take an NER annotation task for science fictions to give you a brief tutorial on doccano.
106+
107+
Below is a JSON file containing lots of science fictions description with different languages.
108+
109+
`books.json`
110+
```JSON
111+
{"text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film."}
112+
{"text": "《三体》是中国大陆作家刘慈欣于2006年5月至12月在《科幻世界》杂志上连载的一部长篇科幻小说,出版后成为中国大陆最畅销的科幻长篇小说之一。2008年,该书的单行本由重庆出版社出版。本书是三体系列(系列原名为:地球往事三部曲)的第一部,该系列的第二部《三体II:黑暗森林》已经于2008年5月出版。2010年11月,第三部《三体III:死神永生》出版发行。 2011年,“地球往事三部曲”在台湾陆续出版。小说的英文版获得美国科幻奇幻作家协会2014年度“星云奖”提名,并荣获2015年雨果奖最佳小说奖。"}
113+
{"text": "『銀河英雄伝説』(ぎんがえいゆうでんせつ)は、田中芳樹によるSF小説。また、これを原作とするアニメ、漫画、コンピュータゲーム、朗読、オーディオブック等の関連作品。略称は『銀英伝』(ぎんえいでん)。原作は累計発行部数が1500万部を超えるベストセラー小説である。1982年から2009年6月までに複数の版で刊行され、発行部数を伸ばし続けている。"}
114+
```
106115

107116
To create your project, make sure you’re in the project list page and select `Create Project` button. You should see the following screen:
108117

109118
<img src="./docs/create_project.png" alt="Project Creation" width=400>
110119

111-
In project creation, you can select three project types: text classificatioin, sequence labeling and sequence to sequence. You should select a type with your purpose.
120+
In this step, you can select three project types: text classificatioin, sequence labeling and sequence to sequence. You should select a type with your purpose.
112121

113-
### Import text items
122+
As for the tutorial, we name the project as `sequence labeling for books`, write some description, choose sequence labeling project type and select the user we created.
114123

115-
Now that we’ve created a project. Now you’re at the “dataset” page for the project. This page displays all the documents in the project. You can see there is no documents.
124+
### Import Data
116125

117-
To import text items, select `Import Data` button in the navigation bar. You should see the following screen:
126+
After creating a project, you will see the "Import Data" page, or click `Import Data` button in the navigation bar. You should see the following screen:
118127

119128
<img src="./docs/upload.png" alt="Upload project" width=600>
120129

121-
The text items should be provided in txt format. As of now, it must contain only texts. Each line must contain a text:
130+
You can upload two types of files:
131+
- `TXT file`: each line contains a text and no line breaks (`\n`).
132+
- `JSON file`: each line contains a JSON object with a `text` key. JSON format supports line breaks rendering.
122133

134+
> Notice: Doccano won't render line breaks in annotation page for sequence labeling task due to the indent problem, but the exported JSON file still contains line breaks.
135+
136+
`example.txt` (or `example.csv`)
123137
```python
124138
EU rejects German call to boycott British lamb.
125139
President Obama is speaking at the White House.
126140
He lives in Newark, Ohio.
127141
...
128142
```
143+
`example.json`
144+
```JSON
145+
{"text": "EU rejects German call to boycott British lamb."}
146+
{"text": "President Obama is speaking at the White House."}
147+
{"text": "He lives in Newark, Ohio."}
148+
...
149+
```
150+
151+
Once you select a TXT/JSON file on your computer, click `Upload dataset` button. As for the tutorial, we select JSON format and upload the `books.json` file.
129152

130-
Once you select a csv file on your computer, select `Upload` button.
153+
After uploading the dataset file, we will see the `Dataset` page (or click `Dataset` button list in the left bar). This page displays all the documents we uploaded in one project.
131154

132155
### Define labels
133156

134-
Now we’ll define your labels. To define your labels, select `Labels` menu. You should see the label editor page:
157+
Click `Labels` button in left bar to define your own labels. You should see the label editor page. In label editor page, you can create labels by specifying label text, shortcut key, background color and text color.
135158

136159
<img src="./docs/label_editor.png" alt="Edit label" width=600>
137160

138-
In label editor page, we can create labels by specifying label text, shortcut key, background color and text color.
161+
As for the tutorial, we created some entities related to science fictions.
139162

140163
### Annotation
141164

142-
Now, we are ready to annotate the texts. Back to the project list page and select the project. You can start annotation!
165+
Now, you are ready to annotate the texts. Just click the `Annotate Data` button in the navigation bar, you can start to annotate the documents you uploaded.
143166

144167
<img src="./docs/annotation.png" alt="Edit label" width=600>
145168

169+
### Export Data
170+
171+
After the annotation step, you can download the annotated data. Click the `Edit data` button in navigation bar, and then click `Export Data`. You should see below screen:
172+
173+
<img src="./docs/export_data.png" alt="Edit label" width=600>
174+
175+
You can export data as CSV file or JSON file by clicking the button. Below is the annotated result for our tutorial project.
176+
177+
`sequence_labeling_for_books.json`
178+
```JSON
179+
{"doc_id": 33, "text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film.", "entities": [[0, 36, "Title"], [63, 67, "Title"], [69, 75, "Title"], [78, 82, "Title"], [89, 111, "Genre"], [130, 143, "Person"], [158, 180, "Genre"], [184, 193, "Other"], [199, 203, "Date"], [254, 265, "Genre"], [267, 273, "Genre"], [275, 286, "Genre"], [290, 294, "Date"], [295, 304, "Genre"], [308, 312, "Date"], [313, 323, "Genre"], [329, 333, "Date"], [334, 346, "Genre"]], "username": "admin"}
180+
{"doc_id": 34, "text": "《三体》是中国大陆作家刘慈欣于2006年5月至12月在《科幻世界》杂志上连载的一部长篇科幻小说,出版后成为中国大陆最畅销的科幻长篇小说之一。2008年,该书的单行本由重庆出版社出版。本书是三体系列(系列原名为:地球往事三部曲)的第一部,该系列的第二部《三体II:黑暗森林》已经于2008年5月出版。2010年11月,第三部《三体III:死神永生》出版发行。 2011年,“地球往事三部曲”在台湾陆续出版。小说的英文版获得美国科幻奇幻作家协会2014年度“星云奖”提名,并荣获2015年雨果奖最佳小说奖。", "entities": [[1, 3, "Title"], [5, 7, "Location"], [11, 14, "Person"], [15, 22, "Date"], [23, 26, "Date"], [28, 32, "Other"], [43, 45, "Genre"], [53, 55, "Location"], [70, 75, "Date"], [126, 135, "Title"], [139, 146, "Date"], [149, 157, "Date"], [162, 172, "Title"], [179, 184, "Date"], [195, 197, "Location"], [210, 212, "Location"], [227, 230, "Other"], [220, 225, "Date"], [237, 242, "Date"], [242, 245, "Other"]], "username": "admin"}
181+
{"doc_id": 35, "text": "『銀河英雄伝説』(ぎんがえいゆうでんせつ)は、田中芳樹によるSF小説。また、これを原作とするアニメ、漫画、コンピュータゲーム、朗読、オーディオブック等の関連作品。略称は『銀英伝』(ぎんえいでん)。原作は累計発行部数が1500万部を超えるベストセラー小説である。1982年から2009年6月までに複数の版で刊行され、発行部数を伸ばし続けている。", "entities": [[1, 7, "Title"], [23, 27, "Person"], [30, 34, "Genre"], [46, 49, "Genre"], [50, 52, "Genre"], [53, 62, "Genre"], [63, 65, "Genre"], [66, 74, "Genre"], [85, 88, "Title"], [9, 20, "Title"], [90, 96, "Title"], [108, 114, "Other"], [118, 126, "Other"], [130, 135, "Date"], [137, 144, "Date"]], "username": "admin"}
182+
```
183+
184+
Congratulation! You just mastered how to use doccano for a sequence labeling project. As for the export data of document classification and sequence to sequence, you can check it below.
185+
186+
**JSON output**
187+
188+
The export json format: every annotated document will be a one line, and each line will be a python dictionary class with 4 keys.
189+
* `doc_id`: document id
190+
* `text`: original text
191+
* `labels/entities/sentences`: annotation
192+
* `username`: annotater name
193+
194+
A json export example for *document classification*.
195+
```JSON
196+
{"doc_id": 20, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "labels": ["label1"], "username": "admin"}
197+
{"doc_id": 21, "text": "贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统,\n任期从2009月1月20日到2017年1月20。", "labels": ["label1", "label2"], "username": "admin"}
198+
{"doc_id": 22, "text": "バラク・フセイン・オバマ2世は、アメリカの政治家であり、\n2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。", "labels": ["label1", "label2", "label3"], "username": "admin"}
199+
```
200+
201+
A json export example for *sequence labeling*. The position of entity will ignore line breaks.
202+
```JSON
203+
{"doc_id": 23, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "entities": [[0, 20, "PER"], [87, 104, "ORG"], [110, 126, "DATE"], [131, 147, "DATE"]], "username": "admin"}
204+
{"doc_id": 24, "text": "贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统,\n任期从2009月1月20日到2017年1月20。", "entities": [[0, 11, "PER"], [29, 31, "ORG"], [38, 48, "DATE"], [49, 58, "DATE"]], "username": "admin"}
205+
{"doc_id": 25, "text": "バラク・フセイン・オバマ2世は、アメリカの政治家であり、\n2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。", "entities": [[0, 12, "PER"], [16, 20, "ORG"], [29, 39, "DATE"], [41, 51, "DATE"]], "username": "admin"}
206+
```
207+
208+
A json export example for *sequence to sequence*.
209+
```JSON
210+
{"doc_id": 26, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "sentences": ["バラク・フセイン・オバマ2世は、アメリカの政治家であり、 2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。"], "username": "admin"}
211+
{"doc_id": 27, "text": "贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统,\n任期从2009月1月20日到2017年1月20。", "sentences": ["Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统, 任期从2009月1月20日到2017年1月20。"], "username": "admin"}
212+
{"doc_id": 28, "text": "バラク・フセイン・オバマ2世は、アメリカの政治家であり、\n2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。", "sentences": ["Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017."], "username": "admin"}
213+
```
214+
215+
Because we save each JSON obejct as one line in the JSON file, you should read it line by line. Here is a simple script to load such format for your task.
216+
217+
```Python
218+
import json
219+
with open("export.json") as f:
220+
jsons = [json.loads(line) for line in f]
221+
```
222+
223+
**CSV output**
224+
225+
The CSV export format for *document classification* has four columns: document id, text, label (one label a line), user name. Below is a multi-label example.
226+
227+
```CSV
228+
20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label1,admin
229+
20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label2,admin
230+
20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label3,admin
231+
...
232+
```
233+
234+
The CSV export format for *sequence labeling* is the [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) in a character-level, which has three columns: document id, character, entity.
235+
236+
```CSV
237+
23,B,B-PER
238+
23,a,I-PER
239+
23,r,I-PER
240+
23,a,I-PER
241+
23,c,I-PER
242+
23,k,I-PER
243+
23, ,I-PER
244+
23,H,I-PER
245+
23,u,I-PER
246+
23,s,I-PER
247+
23,s,I-PER
248+
23,e,I-PER
249+
23,i,I-PER
250+
23,n,I-PER
251+
23, ,I-PER
252+
23,O,I-PER
253+
23,b,I-PER
254+
23,a,I-PER
255+
23,m,I-PER
256+
23,a,I-PER
257+
23, ,O
258+
23,I,O
259+
23,I,O
260+
23, ,O
261+
23,i,O
262+
23,s,O
263+
...
264+
```
265+
266+
The CSV export format for *sequence to sequence* has four columns: document id, original text, sentence (one sentence a line), user name. Below example shows that the English text is translated to Chinese and Japanese.
267+
268+
```CSV
269+
26,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",バラク・フセイン・オバマ2世は、アメリカの政治家であり、2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。,admin
270+
26,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统, 任期从2009月1月20日到2017年1月20。,admin
271+
```
272+
146273
I hope you are having a great day!
147274

148275
## Contribution
@@ -153,4 +280,4 @@ As with any software, doccano is under continuous development. If you have reque
153280

154281
For help and feedback, please feel free to contact [the author](https://github.com/Hironsan).
155282

156-
**If you are favorite to doccano, please follow my [GitHub](https://github.com/Hironsan) and [Twitter](https://twitter.com/Hironsan13) account.**
283+
**If you are favorite to doccano, please follow my [GitHub](https://github.com/Hironsan) and [Twitter](https://twitter.com/Hironsan13) account.**

app/server/templates/admin/dataset_upload.html

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,10 @@
1919
<p>
2020
<b>To annotate texts, you first need to import a set of text items to annotate it.</b>
2121
</p>
22-
<p>
23-
Each line should contain a text.
24-
</p>
22+
<ul.is-unstyled>
23+
<li>TXT file: each line should contain a text.</li>
24+
<li>JSON file: each line should contain a json object with at least one key 'text', which contains a text.</li>
25+
</ul.is-unstyled>
2526
<form action="" method="post" enctype="multipart/form-data">
2627
{% csrf_token %}
2728
<div class="section">

docs/annotation.png

-75 KB
Loading

docs/create_project.png

-55.7 KB
Loading

docs/export_data.png

166 KB
Loading

docs/label_editor.png

16.8 KB
Loading

docs/upload.png

70 KB
Loading

0 commit comments

Comments
 (0)