There is no project created yet. Here we take an NER annotation task for science fictions to give you a brief tutorial on doccano.
Below is a JSON file containing lots of science fictions description with different languages.
`books.json`
```JSON
{"text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film."}
To create your project, make sure you’re in the project list page and select `Create Project` button. You should see the following screen:
There is no project created yet. To create your project, make sure you’re in the project list page and select `Create Project` button. You should see the following screen:
In this step, you can select three project types: text classificatioin, sequence labeling and sequence to sequence. You should select a type with your purpose.
As for the tutorial, we name the project as `sequence labeling for books`, write some description, choose sequence labeling project type and select the user we created.
### Import Data
After creating a project, you will see the "Import Data" page, or click `Import Data` button in the navigation bar. You should see the following screen:
@ -148,9 +135,7 @@ He lives in Newark, Ohio.
...
```
Once you select a TXT/JSON file on your computer, click `Upload dataset` button. As for the tutorial, we select JSON format and upload the `books.json` file.
After uploading the dataset file, we will see the `Dataset` page (or click `Dataset` button list in the left bar). This page displays all the documents we uploaded in one project.
Once you select a TXT/JSON file on your computer, click `Upload dataset` button. After uploading the dataset file, we will see the `Dataset` page (or click `Dataset` button list in the left bar). This page displays all the documents we uploaded in one project.
### Define labels
@ -158,7 +143,6 @@ Click `Labels` button in left bar to define your own labels. You should see the
You can export data as CSV file or JSON file by clicking the button. Below is the annotated result for our tutorial project.
`sequence_labeling_for_books.json`
```JSON
{"doc_id": 33, "text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film.", "entities": [[0, 36, "Title"], [63, 67, "Title"], [69, 75, "Title"], [78, 82, "Title"], [89, 111, "Genre"], [130, 143, "Person"], [158, 180, "Genre"], [184, 193, "Other"], [199, 203, "Date"], [254, 265, "Genre"], [267, 273, "Genre"], [275, 286, "Genre"], [290, 294, "Date"], [295, 304, "Genre"], [308, 312, "Date"], [313, 323, "Genre"], [329, 333, "Date"], [334, 346, "Genre"]], "username": "admin"}
Congratulation! You just mastered how to use doccano for a sequence labeling project. As for the export data of document classification and sequence to sequence, you can check it below.
**JSON output**
The export json format: every annotated document will be a one line, and each line will be a python dictionary class with 4 keys.
* `doc_id`: document id
* `text`: original text
* `labels/entities/sentences`: annotation
* `username`: annotater name
A json export example for *document classification*.
```JSON
{"doc_id": 20, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "labels": ["label1"], "username": "admin"}
A json export example for *sequence labeling*. The position of entity will ignore line breaks.
```JSON
{"doc_id": 23, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "entities": [[0, 20, "PER"], [87, 104, "ORG"], [110, 126, "DATE"], [131, 147, "DATE"]], "username": "admin"}
{"doc_id": 26, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "sentences": ["バラク・フセイン・オバマ2世は、アメリカの政治家であり、 2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。"], "username": "admin"}
{"doc_id": 27, "text": "贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统,\n任期从2009月1月20日到2017年1月20。", "sentences": ["Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统, 任期从2009月1月20日到2017年1月20。"], "username": "admin"}
{"doc_id": 28, "text": "バラク・フセイン・オバマ2世は、アメリカの政治家であり、\n2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。", "sentences": ["Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017."], "username": "admin"}
```
Because we save each JSON obejct as one line in the JSON file, you should read it line by line. Here is a simple script to load such format for your task.
```Python
import json
with open("export.json") as f:
jsons = [json.loads(line) for line in f]
```
**CSV output**
You can export data as CSV file or JSON file by clicking the button. As for the export file format, you can check it here: [Export File Formats](https://github.com/chakki-works/doccano/wiki/Export-File-Formats)
The CSV export format for *document classification* has four columns: document id, text, label (one label a line), user name. Below is a multi-label example.
### Tutorial
```CSV
20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label1,admin
20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label2,admin
20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label3,admin
...
```
The CSV export format for *sequence labeling* is the [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) in a character-level, which has three columns: document id, character, entity.
```CSV
23,B,B-PER
23,a,I-PER
23,r,I-PER
23,a,I-PER
23,c,I-PER
23,k,I-PER
23, ,I-PER
23,H,I-PER
23,u,I-PER
23,s,I-PER
23,s,I-PER
23,e,I-PER
23,i,I-PER
23,n,I-PER
23, ,I-PER
23,O,I-PER
23,b,I-PER
23,a,I-PER
23,m,I-PER
23,a,I-PER
23, ,O
23,I,O
23,I,O
23, ,O
23,i,O
23,s,O
...
```
The CSV export format for *sequence to sequence* has four columns: document id, original text, sentence (one sentence a line), user name. Below example shows that the English text is translated to Chinese and Japanese.
```CSV
26,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",バラク・フセイン・オバマ2世は、アメリカの政治家であり、2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。,admin
26,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",贝拉克·侯赛因·奥巴马是一个美国的政治家,曾担任第四十四任美国总统, 任期从2009月1月20日到2017年1月20。,admin
```
We prepared a NER annotation tutorial, which can help you have a better understanding of doccano. Please first read the README page, and then take the tutorial. [A Tutorial For Sequence Labeling Project](https://github.com/chakki-works/doccano/wiki/A-Tutorial-For-Sequence-Labeling-Project).