Dataset

The organizers have asked volunteer data contributors to take photos of their or their friends’ readable receipts to create the dataset. This results in a dataset of more than 2,000 receipt images contributed from more than 50 data contributors. For each receipt image in the dataset, a human annotator is assigned to annotate each text line with the corresponding text. Here, a text line is considered clear if the text is extracted easily by the annotator. The number of clean text lines is then used to produce the quality score of the receipt image.

Annotation format

Train data:

  • Field names: img_id, anno_polygons, anno_num, anno_texts, anno_labels, anno_image_quality

  • Example: mcocr_warmup_fde7c60204d91e7974d8543699f5c62e_00237.jpg, [{'category_id': 15, 'segmentation': [[453.3, 340.7, 453.3, 355.4, 302.7, 355.4, 302.7, 340.7]], 'area': 2100, 'bbox': [303, 341, 150, 14], 'width': 768, 'height': 1024}, {'category_id': 16, 'segmentation': [[469.5, 364.3, 469.5, 383.3, 282.6, 383.3, 282.6, 364.3]], 'area': 3553, 'bbox': [283, 364, 187, 19], 'width': 768, 'height': 1024}, {'category_id': 17, 'segmentation': [[562, 870.8, 562, 891.9, 407.7, 891.9, 407.7, 870.8]], 'area': 3234, 'bbox': [408, 871, 154, 21], 'width': 768, 'height': 1024}, {'category_id': 18, 'segmentation': [[274.1, 818.8, 274.1, 839, 191.6, 839, 191.6, 818.8]], 'area': 1640, 'bbox': [192, 819, 82, 20], 'width': 768, 'height': 1024}, {'category_id': 18, 'segmentation': [[565.2, 812.7, 565.2, 829.6, 505.7, 829.6, 505.7, 812.7]], 'area': 1003, 'bbox': [506, 813, 59, 17], 'width': 768, 'height': 1024}], 50, MINIMART ANAN|||Chợ Sủi Phú Thị Gia Lâm|||Số GD: 000AC2212008001023 Ngày: 09/08/2020-09:26, SELLER|||ADDRESS|||TIMESTAMPS, 0.9274


Test data:

  • Field names: img_id, anno_texts, anno_image_quality

  • Example: mcocr_warmup_e3e80ba188c7dd6a4f4685486a36b76b_00297.jpg, abc abc abc, 0.5

Dataset details:

(The following information is for the warm-up dataset, however other data have the same structure). Dataset structure:

  • Folder "./warmup_images" contains raw receipts.

  • File "warmup_train.csv" contains annotations as described above.

  • File "warmup_test.csv" contains list of testing receipts and predicted info (as described above). Before submission, please rename this file to "results.csv" and zip it.

  • File label_dict.json contains predefined labels and their IDs.

  • README.md contains brief information of the data and data license.

License Agreement:

  • This is a dataset of VNDAG for research purposes only. You need to sign this user agreement form and send to vndag@vietnlp.com to register before using any data from VNDAG.

Download links:

Links are available at the official competition page at Codalab.Org (after joining, please click Participate tab to get the data).

References

  • [1]. D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. Gomez, S. Robles, J. Mas, D. Fernandez, J. Almazan, L.P. de las Heras: ICDAR 2013 Robust Reading Competition. ICDAR, 2013.

  • [2]. D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, D. Ghosh , A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, VR. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, E. Valveny: ICDAR 2015 robust reading competition. ICDAR, 2015.

  • [3]. Everingham, M. and Eslami, S. M. A. and Van Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.: The Pascal Visual Object Classes Challenge: A Retrospective. IJCV, 2015.

  • [4]. D. Karatzas, L. Rushinol, The Robust Reading Competition Annotation and Evaluation Platform.