IndoNLI : a Natural Language Inference Dataset for Indonesian

Mahendra, Rahmad; Aji, Alham Fikri; Louvan, Samuel; Rahman, Fahrurrozi; Vania, Clara

Show simple item record

Files in this item

Name:: Mahendra_2021_IndoNLI_VoR.pdf
Size:: 366.8Kb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	Mahendra, Rahmad
dc.contributor.author	Aji, Alham Fikri
dc.contributor.author	Louvan, Samuel
dc.contributor.author	Rahman, Fahrurrozi
dc.contributor.author	Vania, Clara
dc.date.accessioned	2024-04-04T15:30:04Z
dc.date.available	2024-04-04T15:30:04Z
dc.date.issued	2021-11-07
dc.identifier	294047985
dc.identifier	ea54067c-9d75-4123-a77b-60c2fc166eff
dc.identifier.citation	Mahendra , R , Aji , A F , Louvan , S , Rahman , F & Vania , C 2021 , IndoNLI : a Natural Language Inference Dataset for Indonesian . in IndoNLI : A Natural Language Inference Dataset for Indonesian . Association for Computational Linguistics , pp. 10511–10527 . https://doi.org/10.18653/v1/2021.emnlp-main.821	en
dc.identifier.isbn	9781955917094
dc.identifier.uri	https://hdl.handle.net/10023/29606
dc.description.abstract	We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ~18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research.
dc.format.extent	17
dc.format.extent	375650
dc.language.iso	eng
dc.publisher	Association for Computational Linguistics
dc.relation.ispartof	IndoNLI	en
dc.subject	QA75 Electronic computers. Computer science	en
dc.subject	NS	en
dc.subject.lcc	QA75	en
dc.title	IndoNLI : a Natural Language Inference Dataset for Indonesian	en
dc.type	Conference item	en
dc.contributor.institution	University of St Andrews. School of Computer Science	en
dc.identifier.doi	10.18653/v1/2021.emnlp-main.821

This item appears in the following Collection(s)

University of St Andrews Research

Show simple item record