Browse Source

Implement data import from Excel

pull/294/head
Arman Rahman 5 years ago
parent
commit
7b1bbc1aea
16 changed files with 123 additions and 21 deletions
  1. 2
      .gitignore
  2. 42
      README.md
  3. BIN
      app/api/tests/data/example.invalid.1.xlsx
  4. BIN
      app/api/tests/data/example.invalid.2.xlsx
  5. BIN
      app/api/tests/data/example.xlsx
  6. 4
      app/api/tests/data/example_one_column.csv
  7. BIN
      app/api/tests/data/example_one_column.xlsx
  8. BIN
      app/api/tests/data/example_one_column_no_header.xlsx
  9. 45
      app/api/tests/test_api.py
  10. 25
      app/api/utils.py
  11. 4
      app/api/views.py
  12. BIN
      app/server/static/components/examples/upload_seq2seq.xlsx
  13. BIN
      app/server/static/components/examples/upload_text_classification.xlsx
  14. 10
      app/server/static/components/upload_seq2seq.vue
  15. 10
      app/server/static/components/upload_text_classification.vue
  16. 2
      requirements.txt

2
.gitignore

@ -198,3 +198,5 @@ pip-selfcheck.json
node_modules/
bundle/
webpack-stats.json
.vscode

42
README.md

@ -58,20 +58,19 @@ Doccano can be deployed to AWS ([Cloudformation](https://docs.aws.amazon.com/AWS
> Notice: (1) EC2 KeyPair cannot be created automatically, so make sure you have an existing EC2 KeyPair in one region. Or [create one yourself](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). (2) If you want to access doccano via HTTPS in AWS, here is an [instruction](https://github.com/chakki-works/doccano/wiki/HTTPS-setting-for-doccano-in-AWS).
## Features
* Collaborative annotation
* Multi-Language support
* Emoji :smile: support
* (future) Auto labeling
- Collaborative annotation
- Multi-Language support
- Emoji :smile: support
- (future) Auto labeling
## Requirements
* Python 3.6+
* Django 2.1.7+
* Node.js 8.0+
* Google Chrome(highly recommended)
- Python 3.6+
- Django 2.1.7+
- Node.js 8.0+
- Google Chrome(highly recommended)
## Installation
@ -164,7 +163,9 @@ Finally, to start the server, run the following command:
```bash
python manage.py runserver
```
Optionally, you can change the bind ip and port using the command
```bash
python manage.py runserver <ip>:<port>
```
@ -199,20 +200,26 @@ After creating a project, you will see the "Import Data" page, or click `Import
<img src="./docs/upload.png" alt="Upload project" width=600>
You can upload two types of files:
- `CSV file`: file must contain a header with a `text` column or be one-column csv file.
- `JSON file`: each line contains a JSON object with a `text` key. JSON format supports line breaks rendering.
You can upload the following types of files (depending on project type):
- `Text file`: file must contain one sentence/document per line separated by new lines.
- `CSV file`: file must contain a header with `"text"` as the first column or be one-column csv file. If using labels the sencond column must be the labels.
- `Excel file`: file must contain a header with `"text"` as the first column or be one-column excel file. If using labels the sencond column must be the labels. Supports multiple sheets as long as format is the same.
- `JSON file`: each line contains a JSON object with a `text` key. JSON format supports line breaks rendering.
> Notice: Doccano won't render line breaks in annotation page for sequence labeling task due to the indent problem, but the exported JSON file still contains line breaks.
`example.txt` (or `example.csv`)
```python
`example.txt/csv/xlsx`
```txt
EU rejects German call to boycott British lamb.
President Obama is speaking at the White House.
He lives in Newark, Ohio.
...
```
`example.json`
```JSON
{"text": "EU rejects German call to boycott British lamb."}
{"text": "President Obama is speaking at the White House."}
@ -220,7 +227,7 @@ He lives in Newark, Ohio.
...
```
Any other columns (for csv) or keys (for json) are preserved and will be exported in the `metadata` column or key as is.
Any other columns (for csv/excel) or keys (for json) are preserved and will be exported in the `metadata` column or key as is.
Once you select a TXT/JSON file on your computer, click `Upload dataset` button. After uploading the dataset file, we will see the `Dataset` page (or click `Dataset` button list in the left bar). This page displays all the documents we uploaded in one project.
@ -230,7 +237,6 @@ Click `Labels` button in left bar to define your own labels. You should see the
<img src="./docs/label_editor.png" alt="Edit label" width=600>
### Annotation
Now, you are ready to annotate the texts. Just click the `Annotate Data` button in the navigation bar, you can start to annotate the documents you uploaded.
@ -251,11 +257,14 @@ by adding `external_id` to the imported file. For example:
Input file may look like this:
`import.json`
```JSON
{"text": "EU rejects German call to boycott British lamb.", "external_id": 1}
```
and the exported file will look like this:
`output.json`
```JSON
{"doc_id": 2023, "text": "EU rejects German call to boycott British lamb.", "labels": ["news"], "username": "root", "metadata": {"external_id": 1}}
```
@ -272,7 +281,6 @@ As with any software, doccano is under continuous development. If you have reque
Here are some tips might be helpful. [How to Contribute to Doccano Project](https://github.com/chakki-works/doccano/wiki/How-to-Contribute-to-Doccano-Project)
## Contact
For help and feedback, please feel free to contact [the author](https://github.com/Hironsan).

BIN
app/api/tests/data/example.invalid.1.xlsx

BIN
app/api/tests/data/example.invalid.2.xlsx

BIN
app/api/tests/data/example.xlsx

4
app/api/tests/data/example_one_column.csv

@ -0,0 +1,4 @@
text
AAA
BBB
CCC

BIN
app/api/tests/data/example_one_column.xlsx

BIN
app/api/tests/data/example_one_column_no_header.xlsx

45
app/api/tests/test_api.py

@ -759,7 +759,7 @@ class TestUploader(APITestCase):
def upload_test_helper(self, project_id, filename, file_format, expected_status, **kwargs):
url = reverse(viewname='doc_uploader', args=[project_id])
with open(os.path.join(DATA_DIR, filename)) as f:
with open(os.path.join(DATA_DIR, filename), 'rb') as f:
response = self.client.post(url, data={'file': f, 'format': file_format})
self.assertEqual(response.status_code, expected_status)
@ -803,6 +803,12 @@ class TestUploader(APITestCase):
file_format='csv',
expected_status=status.HTTP_201_CREATED)
def test_can_upload_single_column_csv(self):
self.upload_test_helper(project_id=self.seq2seq_project.id,
filename='example_one_column.csv',
file_format='csv',
expected_status=status.HTTP_201_CREATED)
def test_cannot_upload_csv_file_does_not_match_column_and_row(self):
self.upload_test_helper(project_id=self.classification_project.id,
filename='example.invalid.1.csv',
@ -815,6 +821,43 @@ class TestUploader(APITestCase):
file_format='csv',
expected_status=status.HTTP_400_BAD_REQUEST)
def test_can_upload_classification_excel(self):
self.upload_test_helper(project_id=self.classification_project.id,
filename='example.xlsx',
file_format='excel',
expected_status=status.HTTP_201_CREATED)
def test_can_upload_seq2seq_excel(self):
self.upload_test_helper(project_id=self.seq2seq_project.id,
filename='example.xlsx',
file_format='excel',
expected_status=status.HTTP_201_CREATED)
def test_can_upload_single_column_excel(self):
self.upload_test_helper(project_id=self.seq2seq_project.id,
filename='example_one_column.xlsx',
file_format='excel',
expected_status=status.HTTP_201_CREATED)
def test_cannot_upload_excel_file_does_not_match_column_and_row(self):
self.upload_test_helper(project_id=self.classification_project.id,
filename='example.invalid.1.xlsx',
file_format='excel',
expected_status=status.HTTP_400_BAD_REQUEST)
def test_cannot_upload_excel_file_has_too_many_columns(self):
self.upload_test_helper(project_id=self.classification_project.id,
filename='example.invalid.2.xlsx',
file_format='excel',
expected_status=status.HTTP_400_BAD_REQUEST)
@override_settings(IMPORT_BATCH_SIZE=1)
def test_can_upload_small_batch_size(self):
self.upload_test_helper(project_id=self.seq2seq_project.id,
filename='example_one_column_no_header.xlsx',
file_format='excel',
expected_status=status.HTTP_201_CREATED)
def test_can_upload_classification_jsonl(self):
self.upload_test_helper(project_id=self.classification_project.id,
filename='classification.jsonl',

25
app/api/utils.py

@ -8,6 +8,7 @@ from random import Random
from django.db import transaction
from django.conf import settings
import pyexcel
from rest_framework.renderers import JSONRenderer
from seqeval.metrics.sequence_labeling import get_entities
@ -318,13 +319,32 @@ class CSVParser(FileParser):
def parse(self, file):
file = io.TextIOWrapper(file, encoding='utf-8')
reader = csv.reader(file)
yield from ExcelParser.parse_excel_csv_reader(reader)
class ExcelParser(FileParser):
def parse(self, file):
excel_book = pyexcel.iget_book(file_type="xlsx", file_content=file.read())
# Handle multiple sheets
for sheet_name in excel_book.sheet_names():
reader = excel_book[sheet_name].to_array()
yield from self.parse_excel_csv_reader(reader)
@staticmethod
def parse_excel_csv_reader(reader):
columns = next(reader)
data = []
if len(columns) == 1 and columns[0] != 'text':
data.append({'text': columns[0]})
for i, row in enumerate(reader, start=2):
if len(data) >= settings.IMPORT_BATCH_SIZE:
yield data
data = []
if len(row) == len(columns) and len(row) >= 2:
# Only text column
if len(row) == len(columns) and len(row) == 1:
data.append({'text': row[0]})
# Text, labels and metadata columns
elif len(row) == len(columns) and len(row) >= 2:
text, label = row[:2]
meta = json.dumps(dict(zip(columns[2:], row[2:])))
j = {'text': text, 'labels': [label], 'meta': meta}
@ -346,7 +366,6 @@ class JSONParser(FileParser):
data = []
try:
j = json.loads(line)
#j = json.loads(line.decode('utf-8'))
j['meta'] = json.dumps(j.get('meta', {}))
data.append(j)
except json.decoder.JSONDecodeError:
@ -373,6 +392,7 @@ class JSONLRenderer(JSONRenderer):
ensure_ascii=self.ensure_ascii,
allow_nan=not self.strict) + '\n'
class JSONPainter(object):
def paint(self, documents):
@ -406,6 +426,7 @@ class JSONPainter(object):
data.append(d)
return data
class CSVPainter(JSONPainter):
def paint(self, documents):

4
app/api/views.py

@ -19,7 +19,7 @@ from .models import Project, Label, Document
from .permissions import IsAdminUserAndWriteOnly, IsProjectUser, IsOwnAnnotation
from .serializers import ProjectSerializer, LabelSerializer, DocumentSerializer, UserSerializer
from .serializers import ProjectPolymorphicSerializer
from .utils import CSVParser, JSONParser, PlainTextParser, CoNLLParser, iterable_to_io
from .utils import CSVParser, ExcelParser, JSONParser, PlainTextParser, CoNLLParser, iterable_to_io
from .utils import JSONLRenderer
from .utils import JSONPainter, CSVPainter
@ -235,6 +235,8 @@ class TextUploadAPI(APIView):
return JSONParser()
elif file_format == 'conll':
return CoNLLParser()
elif file_format == 'excel':
return ExcelParser()
else:
raise ValidationError('format {} is invalid.'.format(file_format))

BIN
app/server/static/components/examples/upload_seq2seq.xlsx

BIN
app/server/static/components/examples/upload_text_classification.xlsx

10
app/server/static/components/upload_seq2seq.vue

@ -22,6 +22,16 @@ block select-format-area
)
| JSONL
label.radio
input(
type="radio"
name="format"
value="excel"
v-bind:checked="format === 'excel'"
v-model="format"
)
| Excel
block example-format-area
pre.code-block(v-show="format === 'plain'")
code.plaintext

10
app/server/static/components/upload_text_classification.vue

@ -22,6 +22,16 @@ block select-format-area
)
| JSONL
label.radio
input(
type="radio"
name="format"
value="excel"
v-bind:checked="format === 'excel'"
v-model="format"
)
| Excel
block example-format-area
pre.code-block(v-show="format == 'plain'")
code.plaintext

2
requirements.txt

@ -24,6 +24,8 @@ lockfile==0.12.2
mixer==6.1.3
model-mommy==1.6.0
psycopg2-binary==2.7.7
pyexcel==0.5.14
pyexcel-xlsx==0.5.7
python-dateutil==2.7.3
pytz==2018.4
requests==2.21.0

Loading…
Cancel
Save