INTEGRATING SENTIMENT FACTORS INTO THE CONTEXT OF A MULTIMODAL DIALOGUE SYSTEM

Nguyễn Thuỳ Dương Lê , Ngọc Tuấn Lê , Hồng Bửu Long Nguyễn

Main Article Content

Abstract

 

 

The text-based dialogue systems using the seq2seq model have been extensively used in recent research. However, besides purely text conversations, images and emotions are also important factors. In 2021, Zheng et al. presented MOD which can dialogue with text, visuals, and classify emotions. In spite of the promising performance of MOD, the input context does not use the emotional element. In this article, we improve MOD by adding the sentiment factor binding to the other two factors (text, image) to enhance the quality of the information in the context and help the model capture the context more deeply. Finally, we incorporate image features retrieved from the CNN network for the input context to improve the quality of visual features. Finally, our model improved the BLUE-4 score by 0.19 and the Perplexity by 4.6 compared to MOD. The results show that our model (integrating the sentiment factor into the context) performs better.

 

Article Details

References

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,… Illia Polosukhin. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000-6010.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P.,... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, (33), 1877-1901.
Fei, Z., Li, Z., Zhang, J., Feng, Y., & Zhou, J. (2021). Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark. ArXiv, abs/2109.01839.
Weidong He, Zhi Li, Dongcai Lu, Enhong Chen, Tong Xu, Baoxing Huai, & Jing Yuan. (2020). Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements. In Proceedings of the 28th ACM International Conference on Multimedia (MM '20). Association for Computing Machinery, New York, NY, USA, 2755-2764. https://doi.org/10.1145/3394171.3413679
Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2015). A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.
Saha, A., Khapra, M., & Sankaranarayanan, K. (2018, April). Towards building large scale multimodal domain-aware conversation systems. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Smith, L. N. (2017, March). Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on Applications of computer vision (WACV) (pp. 464-472). IEEE.
Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ArXiv, abs/1905.11946.
Wang, Y., Ke, P., Zheng, Y., Huang, K., Jiang, Y., Zhu, X., & Huang, M. (2020). A Large-Scale Chinese Short-Text Conversation Dataset. NLPCC.
Weidong He, Zhi Li, Dongcai Lu, Enhong Chen, Tong Xu, Baoxing Huai, & Jing Yuan. (2020). Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements. In Proceedings of the 28th ACM International Conference on Multimedia (MM '20). Association for Computing Machinery, New York, NY, USA, 2755-2764. https://doi.org/10.1145/3394171.3413679
Zhang, Y., Sun, S., Galley, M., Chen, Y. C., Brockett, C., Gao, X.,... Dolan, B. (2019). Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
Zhang, Y., Sun, S., Gao, X., Fang, Y., Brockett, C., Galley, M.,... Dolan, B. (2021). Joint retrieval and generation training for grounded text generation. arXiv preprint arXiv:2105.06597.