update README: split tutorial and export file format to wiki pages

5 years ago · e44b28046f
1 changed files with 5 additions and 113 deletions
--- a/README.md
+++ b/README.md
@ -102,25 +102,12 @@ Now, try logging in with the superuser account you created in the previous step.

 <img src="./docs/projects.png" alt="projects" width=600>

-There is no project created yet. Here we take an NER annotation task for science fictions to give you a brief tutorial on doccano.
-
-Below is a JSON file containing lots of science fictions description with different languages.
-
-`books.json`
-```JSON
-{"text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film."}
-{"text": "《三体》是中国大陆作家刘慈欣于2006年5月至12月在《科幻世界》杂志上连载的一部长篇科幻小说，出版后成为中国大陆最畅销的科幻长篇小说之一。2008年，该书的单行本由重庆出版社出版。本书是三体系列（系列原名为：地球往事三部曲）的第一部，该系列的第二部《三体II：黑暗森林》已经于2008年5月出版。2010年11月，第三部《三体III：死神永生》出版发行。 2011年，“地球往事三部曲”在台湾陆续出版。小说的英文版获得美国科幻奇幻作家协会2014年度“星云奖”提名，并荣获2015年雨果奖最佳小说奖。"}
-{"text": "『銀河英雄伝説』（ぎんがえいゆうでんせつ）は、田中芳樹によるSF小説。また、これを原作とするアニメ、漫画、コンピュータゲーム、朗読、オーディオブック等の関連作品。略称は『銀英伝』（ぎんえいでん）。原作は累計発行部数が1500万部を超えるベストセラー小説である。1982年から2009年6月までに複数の版で刊行され、発行部数を伸ばし続けている。"}
-```
-
-To create your project, make sure you’re in the project list page and select `Create Project` button. You should see the following screen:
+There is no project created yet. To create your project, make sure you’re in the project list page and select `Create Project` button. You should see the following screen:

 <img src="./docs/create_project.png" alt="Project Creation" width=400>

 In this step, you can select three project types: text classificatioin, sequence labeling and sequence to sequence. You should select a type with your purpose.

-As for the tutorial, we name the project as `sequence labeling for books`, write some description, choose sequence labeling project type and select the user we created.
-
 ### Import Data

 After creating a project, you will see the "Import Data" page, or click `Import Data` button in the navigation bar. You should see the following screen:
@ -148,9 +135,7 @@ He lives in Newark, Ohio.
 ...
 ```

-Once you select a TXT/JSON file on your computer, click `Upload dataset` button. As for the tutorial, we select JSON format and upload the `books.json` file.
-
-After uploading the dataset file, we will see the `Dataset` page (or click `Dataset` button list in the left bar). This page displays all the documents we uploaded in one project.
+Once you select a TXT/JSON file on your computer, click `Upload dataset` button. After uploading the dataset file, we will see the `Dataset` page (or click `Dataset` button list in the left bar). This page displays all the documents we uploaded in one project.

 ### Define labels

@ -158,7 +143,6 @@ Click `Labels` button in left bar to define your own labels. You should see the

 <img src="./docs/label_editor.png" alt="Edit label" width=600>

-As for the tutorial, we created some entities related to science fictions.

 ### Annotation

@ -172,103 +156,11 @@ After the annotation step, you can download the annotated data. Click the `Edit

 <img src="./docs/export_data.png" alt="Edit label" width=600>

-You can export data as CSV file or JSON file by clicking the button. Below is the annotated result for our tutorial project.
-
-`sequence_labeling_for_books.json`
-```JSON
-{"doc_id": 33, "text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film.", "entities": [[0, 36, "Title"], [63, 67, "Title"], [69, 75, "Title"], [78, 82, "Title"], [89, 111, "Genre"], [130, 143, "Person"], [158, 180, "Genre"], [184, 193, "Other"], [199, 203, "Date"], [254, 265, "Genre"], [267, 273, "Genre"], [275, 286, "Genre"], [290, 294, "Date"], [295, 304, "Genre"], [308, 312, "Date"], [313, 323, "Genre"], [329, 333, "Date"], [334, 346, "Genre"]], "username": "admin"}
-{"doc_id": 34, "text": "《三体》是中国大陆作家刘慈欣于2006年5月至12月在《科幻世界》杂志上连载的一部长篇科幻小说，出版后成为中国大陆最畅销的科幻长篇小说之一。2008年，该书的单行本由重庆出版社出版。本书是三体系列（系列原名为：地球往事三部曲）的第一部，该系列的第二部《三体II：黑暗森林》已经于2008年5月出版。2010年11月，第三部《三体III：死神永生》出版发行。 2011年，“地球往事三部曲”在台湾陆续出版。小说的英文版获得美国科幻奇幻作家协会2014年度“星云奖”提名，并荣获2015年雨果奖最佳小说奖。", "entities": [[1, 3, "Title"], [5, 7, "Location"], [11, 14, "Person"], [15, 22, "Date"], [23, 26, "Date"], [28, 32, "Other"], [43, 45, "Genre"], [53, 55, "Location"], [70, 75, "Date"], [126, 135, "Title"], [139, 146, "Date"], [149, 157, "Date"], [162, 172, "Title"], [179, 184, "Date"], [195, 197, "Location"], [210, 212, "Location"], [227, 230, "Other"], [220, 225, "Date"], [237, 242, "Date"], [242, 245, "Other"]], "username": "admin"}
-{"doc_id": 35, "text": "『銀河英雄伝説』（ぎんがえいゆうでんせつ）は、田中芳樹によるSF小説。また、これを原作とするアニメ、漫画、コンピュータゲーム、朗読、オーディオブック等の関連作品。略称は『銀英伝』（ぎんえいでん）。原作は累計発行部数が1500万部を超えるベストセラー小説である。1982年から2009年6月までに複数の版で刊行され、発行部数を伸ばし続けている。", "entities": [[1, 7, "Title"], [23, 27, "Person"], [30, 34, "Genre"], [46, 49, "Genre"], [50, 52, "Genre"], [53, 62, "Genre"], [63, 65, "Genre"], [66, 74, "Genre"], [85, 88, "Title"], [9, 20, "Title"], [90, 96, "Title"], [108, 114, "Other"], [118, 126, "Other"], [130, 135, "Date"], [137, 144, "Date"]], "username": "admin"}
-```
-
-Congratulation! You just mastered how to use doccano for a sequence labeling project. As for the export data of document classification and sequence to sequence, you can check it below.
-
-**JSON output**
-
-The export json format: every annotated document will be a one line, and each line will be a python dictionary class with 4 keys.
-* `doc_id`: document id
-* `text`: original text
-* `labels/entities/sentences`: annotation
-* `username`: annotater name
-
-A json export example for *document classification*.
-```JSON
-{"doc_id": 20, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "labels": ["label1"], "username": "admin"}
-{"doc_id": 21, "text": "贝拉克·侯赛因·奥巴马是一个美国的政治家，曾担任第四十四任美国总统，\n任期从2009月1月20日到2017年1月20。", "labels": ["label1", "label2"], "username": "admin"}
-{"doc_id": 22, "text": "バラク・フセイン・オバマ2世は、アメリカの政治家であり、\n2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。", "labels": ["label1", "label2", "label3"], "username": "admin"}
-```
-
-A json export example for *sequence labeling*. The position of entity will ignore line breaks.
-```JSON
-{"doc_id": 23, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "entities": [[0, 20, "PER"], [87, 104, "ORG"], [110, 126, "DATE"], [131, 147, "DATE"]], "username": "admin"}
-{"doc_id": 24, "text": "贝拉克·侯赛因·奥巴马是一个美国的政治家，曾担任第四十四任美国总统，\n任期从2009月1月20日到2017年1月20。", "entities": [[0, 11, "PER"], [29, 31, "ORG"], [38, 48, "DATE"], [49, 58, "DATE"]], "username": "admin"}
-{"doc_id": 25, "text": "バラク・フセイン・オバマ2世は、アメリカの政治家であり、\n2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。", "entities": [[0, 12, "PER"], [16, 20, "ORG"], [29, 39, "DATE"], [41, 51, "DATE"]], "username": "admin"}
-```
-
-A json export example for *sequence to sequence*.
-```JSON
-{"doc_id": 26, "text": "Barack Hussein Obama II is an American politician \nwho served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "sentences": ["バラク・フセイン・オバマ2世は、アメリカの政治家であり、  2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。"], "username": "admin"}
-{"doc_id": 27, "text": "贝拉克·侯赛因·奥巴马是一个美国的政治家，曾担任第四十四任美国总统，\n任期从2009月1月20日到2017年1月20。", "sentences": ["Barack Hussein Obama II is an American politician  who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.", "贝拉克·侯赛因·奥巴马是一个美国的政治家，曾担任第四十四任美国总统， 任期从2009月1月20日到2017年1月20。"], "username": "admin"}
-{"doc_id": 28, "text": "バラク・フセイン・オバマ2世は、アメリカの政治家であり、\n2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。", "sentences": ["Barack Hussein Obama II is an American politician  who served as the 44th President of the United States from January 20, 2009, to January 20, 2017."], "username": "admin"}
-```
-
-Because we save each JSON obejct as one line in the JSON file, you should read it line by line. Here is a simple script to load such format for your task.
-
-```Python
-import json
-with open("export.json") as f:
-    jsons = [json.loads(line) for line in f]
-```
-
-**CSV output**
+You can export data as CSV file or JSON file by clicking the button. As for the export file format, you can check it here: [Export File Formats](https://github.com/chakki-works/doccano/wiki/Export-File-Formats)

-The CSV export format for *document classification* has four columns: document id, text, label (one label a line), user name. Below is a multi-label example.
+### Tutorial

-```CSV
-20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label1,admin
-20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label2,admin
-20,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",label3,admin
-...
-```
-
-The CSV export format for *sequence labeling* is the [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) in a character-level, which has three columns: document id, character, entity.
-
-```CSV
-23,B,B-PER
-23,a,I-PER
-23,r,I-PER
-23,a,I-PER
-23,c,I-PER
-23,k,I-PER
-23, ,I-PER
-23,H,I-PER
-23,u,I-PER
-23,s,I-PER
-23,s,I-PER
-23,e,I-PER
-23,i,I-PER
-23,n,I-PER
-23, ,I-PER
-23,O,I-PER
-23,b,I-PER
-23,a,I-PER
-23,m,I-PER
-23,a,I-PER
-23, ,O
-23,I,O
-23,I,O
-23, ,O
-23,i,O
-23,s,O
-...
-```
-
-The CSV export format for *sequence to sequence* has four columns: document id, original text, sentence (one sentence a line), user name. Below example shows that the English text is translated to Chinese and Japanese.  
-
-```CSV
-26,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",バラク・フセイン・オバマ2世は、アメリカの政治家であり、2009年1月20日から2017年1月20日まで、第44代米国大統領を務めた。,admin
-26,"Barack Hussein Obama II is an American politician who served as the 44th President of the United States from January 20, 2009, to January 20, 2017.",贝拉克·侯赛因·奥巴马是一个美国的政治家，曾担任第四十四任美国总统， 任期从2009月1月20日到2017年1月20。,admin
-```
+We prepared a NER annotation tutorial, which can help you have a better understanding of doccano. Please first read the README page, and then take the tutorial. [A Tutorial For Sequence Labeling Project](https://github.com/chakki-works/doccano/wiki/A-Tutorial-For-Sequence-Labeling-Project).

 I hope you are having a great day!