BERT BASED WORD ALIGNMENT MODEL FOR JAPANESE-VIETNAMESE
Main Article Content
Abstract
Word alignment plays an important role in many subtasks of natural language processing. Therefore, a wide range of studies have been conducted on different language pairs. However, those on word alignment for Vietnamese - Japanese pair are still limited. Most Japanese-Vietnamese word alignments are created from word alignment tools based on statistical methods or unsupervised learning methods, giving results with low accuracy. In this study, we build a Japanese-Vietnamese word-level alignment corpus manually and then implement and train an automatic word alignment model for Japanese-Vietnamese bilingual sentence pairs. Our word alignment model achieves an outstanding accuracy of 20.06 F1 scores compared to the GIZA++ tool. We have created a word alignment model for Japanese-Vietnamese, which is advanced at present.
Keywords
BERT, Japanese-Vietnamese, Parallel corpus, SQuAD, Word alignment model
Article Details
References
Ashish Vaswani, N. S. (2017). Attention Is All You Need. In Proceedings of the NIPS 2017, (pp. 5998-6008).
Chris Dyer, V. C. (2013). A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proceedings of the NAACL-HLT-2013, (pp. 644-648).
David Vilar, M. P. (2016). AER: Do we need to “improve” our alignments? In Proceedings of IWSLT-2006, (pp. 2005-212).
Elias Stengel-Eskin, T. R. (2019). A Discriminative Neural Model for Cross-Lingual Word Alignment. In Proceedings of the EMNLP-IJCNLP-2019, (pp. 910-920).
Franz Josef Och, a. H. (2003, 3). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29.
Jacob Devlin, M.-W. C. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-2019, (pp. 4171-4186).
João Graça, J. P. (2008). Building a Golden Collection of Parallel Multi-Language Word Alignment. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco: European Language Resources Association (ELRA).
Joel Legrand, M. A. (2016). Neural Network-based Word Alignment through Score Aggregation. In Proceedings of the WMT-2016, (pp. 66-73).
Josef, F., & Ney, H. (2003, 3). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29.
Le H. B., T. V. (2021). Automatic Word Alignment For English-Vietnamese Bilinguals Corpus Using A Deep Learning Approach. FAIR2021: Fundamental and Applied Information Technology, (pp. 491-498). Ho Chi Minh.
Masaaki Nagata, K. C. (2020). A Supervised Word Alignment Method based on Cross-Language Span Prediction using Multilingual BERT. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 555-565). Association for Computational Linguistics.
Nan Yang, S. L. (2013). Word Alignment Modeling with Context Dependent Deep Neural Network. In Proceedings of the ACL-2013, (pp. 166-175).
Neubig, G. (2015). Kyoto Free Translation Task alignment data package. http://www.phontron.com/kftt/.
Och, F. J., & Ney, H. (2003, 3). A Systematic Comparison of Various Statistical Alignment Models. Comput. Linguist., 29, 19-51.
Pedersen, R. M. (2003). An Evaluation Exercise for Word Alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, (pp. 1--10).
Pranav Rajpurkar, R. J. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the ACL-2018, (pp. 784-789).
Sarthak Garg, S. P. (2019). Jointly Learning to Align and Translate with Transformer Models. In Proceedings of the EMNLP-IJCNLP-2019, (pp. 4452-4461).
Sashank J. Reddi, S. K. (2018). On the Convergence of Adam and Beyond. International Conference on Learning Representations (ICLR) 2018. Vancouver Canada.
Thomas Zenkel, J. W. (2019). Adding Interpretable Attention to Neural Translation Models Improves Word Alignment. ArXiv:1901.11359.
Thomas Zenkel, J. W. (2020). End-to-End Neural Word Alignment Outperforms GIZA++. In Proceeding of the ACL-2020, (pp. 1605-1607).
Toshinori Sato, T. H., & Okumura, M. (2017). Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval (in Japanese). Proceedings of the Twenty-three Annual Meeting of the Association for Natural Language Processing (pp. NLP2017-B6-1). The Association for Natural Language Processing.
Vu Thanh, N. D. (2018, 6). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 56-60). New: Association for Computational Linguistics.
Vu, T., Nguyen, D. Q., Nguyen, D. Q., Dras, M., & Johnson, M. (2018, 6). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 56–60). New: Association for Computational Linguistics.
Xuansong Li, S. G. (2015). GALE Chinese-English Parallel Aligned Treebank -- Training. Linguistic Data Consortium. Linguistic Data Consortium.