Files in this item
IndoNLI : a Natural Language Inference Dataset for Indonesian
Item metadata
dc.contributor.author | Mahendra, Rahmad | |
dc.contributor.author | Aji, Alham Fikri | |
dc.contributor.author | Louvan, Samuel | |
dc.contributor.author | Rahman, Fahrurrozi | |
dc.contributor.author | Vania, Clara | |
dc.date.accessioned | 2024-04-04T15:30:04Z | |
dc.date.available | 2024-04-04T15:30:04Z | |
dc.date.issued | 2021-11-07 | |
dc.identifier | 294047985 | |
dc.identifier | ea54067c-9d75-4123-a77b-60c2fc166eff | |
dc.identifier.citation | Mahendra , R , Aji , A F , Louvan , S , Rahman , F & Vania , C 2021 , IndoNLI : a Natural Language Inference Dataset for Indonesian . in IndoNLI : A Natural Language Inference Dataset for Indonesian . Association for Computational Linguistics , pp. 10511–10527 . https://doi.org/10.18653/v1/2021.emnlp-main.821 | en |
dc.identifier.isbn | 9781955917094 | |
dc.identifier.uri | https://hdl.handle.net/10023/29606 | |
dc.description.abstract | We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ~18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research. | |
dc.format.extent | 17 | |
dc.format.extent | 375650 | |
dc.language.iso | eng | |
dc.publisher | Association for Computational Linguistics | |
dc.relation.ispartof | IndoNLI | en |
dc.subject | QA75 Electronic computers. Computer science | en |
dc.subject | NS | en |
dc.subject.lcc | QA75 | en |
dc.title | IndoNLI : a Natural Language Inference Dataset for Indonesian | en |
dc.type | Conference item | en |
dc.contributor.institution | University of St Andrews. School of Computer Science | en |
dc.identifier.doi | 10.18653/v1/2021.emnlp-main.821 |
This item appears in the following Collection(s)
Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.