AN IMPROVED MULTI-VISION CONTEXTUAL ATTENTION MODEL FOR VIETNAMESE VISUAL-BASED QUESTION ANSWERING
Main Article Content
Abstract
Visual Question Answering (VQA) represents the intersection of Computer Vision and Natural Language Processing, offering both scientific significance and practical applications. Integrating VQA models into mobile devices can assist blind and visually impaired individuals in accessing and understanding image content. A common approach involves extracting features from different image regions to capture local context. However, this method often overlooks the global context, which affects the model’s ability to aggregate information and make accurate inferences. Recent methods leverage Vision Transformer to extract both global and local features from images, enhancing model performance. Additionally, multimodal attention mechanisms are applied to optimize the integration of image and question features, allowing the model to focus on key features and better understand the context. While most VQA models are designed for English datasets, research on Vietnamese VQA (ViVQA) remains limited. In this paper, we propose an improved model based on Multi-Vision Contextual Attention, achieving an accuracy of 62.41%, a significant improvement over the original model’s 60% on the ViVQA dataset.
Keywords
multimodal, natural language, PhoBERT, ResNet, Swin Transformer, Vietnamese language, visual question answering (VQA)
Article Details
References
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate (arXiv:1409.0473). arXiv. https://arxiv.org/abs/1409.0473
Bar-Hillel, Y. (1960). The present status of automatic translation of languages. Advances in Computers, 1, 91-163.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.1251-1258). https://doi.org/10.1109/CVPR.2017.195
Duan, T. D., Du, T. H., Phuoc, T. V., & Hoang, N. V. (2005, February). Building an automatic vehicle license plate recognition system. In Proceedings of the International Conference on Computer Science RIVF (Vol. 1, pp.59-63).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Jallouli, N., Elghniji, K., Hentati, O., Ribeiro, A. R., Silva, A. M., & Ksibi, M. (2016). UV and solar photo-degradation of naproxen: TiO₂ catalyst effect, reaction kinetics, products identification and toxicity assessment. Journal of Hazardous Materials, 304, 329-336. https://doi.org/10.1016/j.jhazmat.2015.10.045
Lagorio, A., Tistarelli, M., Cadoni, M., Fookes, C., & Sridharan, S. (2013, April). Liveness detection based on 3D face shape analysis. In 2013 International Workshop on Biometrics and Forensics (IWBF) (pp.1-4). IEEE. https://doi.org/10.1109/IWBF.2013.6547310
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541-551. https://doi.org/10.1162/neco.1989.1.4.541
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.10012-10022). https://doi.org/10.1109/ICCV48922.2021.00986
Nguyen, A. D., Le, T., & Nguyen, H. T. (2022, November). Combining multi-vision embedding in contextual attention for Vietnamese visual question answering. In Pacific-Rim Symposium on Image and Video Technology (pp.172185). Springer. https://doi.org/10.1007/978-3-031-26431-3_14
Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for Vietnamese (arXiv:2003.00744). arXiv. https://doi.org/10.48550/arXiv.2003.00744
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv. https://arxiv.org/abs/1409.1556
Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In K. Chaudhuri & R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (pp.6105-6114). PMLR.
Tran, D. M. N., Le, T., Nguyen, M. L., & Nguyen, H. T. (2022, October). Bi-directional cross-attention network on Vietnamese visual question answering. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation (pp.834-841).
Tran, K. Q., Nguyen, A. T., Le, A. T. H., & Van Nguyen, K. (2021). ViVQA: Vietnamese visual question answering. In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation (pp.683-691).
Tsoumakas, G., & Katakis, I. (2008). Multi-label classification: An overview. In Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications (pp.64-74). https://doi.org/10.4018/978-1-59904-951-9.ch005
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 30. https://doi.org/10.48550/arXiv.1706.03762
Wang, W., Li, Y., Zou, T., Wang, X., You, J., & Luo, Y. (2020). A novel image classification approach via dense‐MobileNet models. Mobile Information Systems, 2020(1), Article 7602384. https://doi.org/10.1155/2020/7602384
Yu, Z., Yu, J., Cui, Y., Tao, D., & Tian, Q. (2019). Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.6281-6290). https://doi.org/10.48550/arXiv.1906.10770