You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

86 lines
5.1 KiB

  1. # Tutorial
  2. This tutorial demonstrates how to use doccano to complete a named entity recognition annotation task for an example science fiction dataset.
  3. ## Dataset
  4. Here is a JSON file named `books.json` containing lots of science fiction book descriptions in different languages. We need to annotate some entities like names, book titles, dates, and so on.
  5. ```json
  6. {"text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film."}
  7. {"text": "《三体》是中国大陆作家刘慈欣于2006年5月至12月在《科幻世界》杂志上连载的一部长篇科幻小说,出版后成为中国大陆最畅销的科幻长篇小说之一。2008年,该书的单行本由重庆出版社出版。本书是三体系列(系列原名为:地球往事三部曲)的第一部,该系列的第二部《三体II:黑暗森林》已经于2008年5月出版。2010年11月,第三部《三体III:死神永生》出版发行。 2011年,“地球往事三部曲”在台湾陆续出版。小说的英文版获得美国科幻奇幻作家协会2014年度“星云奖”提名,并荣获2015年雨果奖最佳小说奖。"}
  8. {"text": "『銀河英雄伝説』(ぎんがえいゆうでんせつ)は、田中芳樹によるSF小説。また、これを原作とするアニメ、漫画、コンピュータゲーム、朗読、オーディオブック等の関連作品。略称は『銀英伝』(ぎんえいでん)。原作は累計発行部数が1500万部を超えるベストセラー小説である。1982年から2009年6月までに複数の版で刊行され、発行部数を伸ばし続けている。"}
  9. ```
  10. ## Create a project
  11. To start, let's create a new project for this task.
  12. 1. Log in to doccano with the superuser account.
  13. ![Sign in as a superuser.](./images/tutorial/signin.png)
  14. 2. To create your project, go to the project list page and click **Create**.
  15. 3. Fill out the project details. For this tutorial, name the project `sequence labeling for books`, write a description, and choose the sequence labeling task type.
  16. ![Creating a project.](./images/tutorial/create_project.png)
  17. ## Import a dataset
  18. After creating a project, the **Dataset** page appears.
  19. To import a dataset:
  20. 1. Click **Actions** > **Import Dataset**. You should see the following screen:
  21. ![Importing a dataset.](./images/tutorial/import_dataset.png)
  22. 2. Choose **JSON** and click **Select a file**.
  23. 3. Click **books.json** and it will load automatically.
  24. ## Define labels
  25. Define the labels to use for your annotation project:
  26. 1. Click **Labels** in the left side menu. You should see the label editor page.
  27. 2. On the label editor page, create labels by specifying label text, a shortcut key, background color, and text color. For this tutorial, let's create some entities related to science fiction, as shown below.
  28. ![Defining labels.](./images/tutorial/define_labels.png)
  29. ## Add members
  30. Members are users who can participate in labeling activities. To add members:
  31. 1. Click **Members** in the left side menu. If you are not the project administrator, the button won't appear.
  32. ![](images/faq/add_annotator/select_members.png)
  33. 2. Click **Add** to display the Add Member form.
  34. ![](images/faq/add_annotator/select_user.png)
  35. 3. Fill in the form with the user name and role you want to add to the project. If there is no user to select, you need to create the user first. See the [FAQ](./faq.md) for instructions.
  36. 4. Click **Save**.
  37. ## Annotation
  38. Next, let's annotate the texts.
  39. Click **Start annotation** in the navigation bar to start annotating the documents.
  40. ![Annotating named entities.](./images/tutorial/annotation.png)
  41. ## Export the dataset
  42. After finishing the annotation step, let's download the annotated data.
  43. 1. Go to the **Dataset** page and click **Action** > **Export Dataset**.
  44. 2. Select an export format. For this tutorial choose the JSONL format.
  45. 3. Click **Export**. You should see this screen:
  46. ![Exporting a dataset.](./images/tutorial/export_dataset.png)
  47. Below is the annotated result for this tutorial.
  48. `sequence_labeling_for_books.json`
  49. ```json
  50. {"doc_id": 33,
  51. "text": "The Hitchhiker's Guide to the Galaxy (sometimes referred to as HG2G, HHGTTGor H2G2) is a comedy science fiction series created by Douglas Adams. Originally a radio comedy broadcast on BBC Radio 4 in 1978, it was later adapted to other formats, including stage shows, novels, comic books, a 1981 TV series, a 1984 video game, and 2005 feature film.",
  52. "labels": [[0, 36, "Title"], [63, 67, "Title"], [69, 75, "Title"], [78, 82, "Title"], [89, 111, "Genre"], [130, 143, "Person"], [158, 180, "Genre"], [184, 193, "Other"], [199, 203, "Date"], [254, 265, "Genre"], [267, 273, "Genre"], [275, 286, "Genre"], [290, 294, "Date"], [295, 304, "Genre"], [308, 312, "Date"], [313, 323, "Genre"], [329, 333, "Date"], [334, 346, "Genre"]],
  53. "username": "admin"}
  54. ```
  55. Congratulations! You just explored how to use doccano for a sequence labeling project.