INTEGRATING SENTIMENT FACTORS INTO THE CONTEXT OF A MULTIMODAL DIALOGUE SYSTEM
Main Article Content
Abstract
The text-based dialogue systems using the seq2seq model have been extensively used in recent research. However, besides purely text conversations, images and emotions are also important factors. In 2021, Zheng et al. presented MOD which can dialogue with text, visuals, and classify emotions. In spite of the promising performance of MOD, the input context does not use the emotional element. In this article, we improve MOD by adding the sentiment factor binding to the other two factors (text, image) to enhance the quality of the information in the context and help the model capture the context more deeply. Finally, we incorporate image features retrieved from the CNN network for the input context to improve the quality of visual features. Finally, our model improved the BLUE-4 score by 0.19 and the Perplexity by 4.6 compared to MOD. The results show that our model (integrating the sentiment factor into the context) performs better.
Keywords
context-aware dependency, large language model, multimodal dialogue system, multitask learning, sentiment factor
Article Details
References
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P.,... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, (33), 1877-1901.
Fei, Z., Li, Z., Zhang, J., Feng, Y., & Zhou, J. (2021). Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark. ArXiv, abs/2109.01839.
Weidong He, Zhi Li, Dongcai Lu, Enhong Chen, Tong Xu, Baoxing Huai, & Jing Yuan. (2020). Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements. In Proceedings of the 28th ACM International Conference on Multimedia (MM '20). Association for Computing Machinery, New York, NY, USA, 2755-2764. https://doi.org/10.1145/3394171.3413679
Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2015). A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners.
Saha, A., Khapra, M., & Sankaranarayanan, K. (2018, April). Towards building large scale multimodal domain-aware conversation systems. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Smith, L. N. (2017, March). Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on Applications of computer vision (WACV) (pp. 464-472). IEEE.
Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ArXiv, abs/1905.11946.
Wang, Y., Ke, P., Zheng, Y., Huang, K., Jiang, Y., Zhu, X., & Huang, M. (2020). A Large-Scale Chinese Short-Text Conversation Dataset. NLPCC.
Weidong He, Zhi Li, Dongcai Lu, Enhong Chen, Tong Xu, Baoxing Huai, & Jing Yuan. (2020). Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements. In Proceedings of the 28th ACM International Conference on Multimedia (MM '20). Association for Computing Machinery, New York, NY, USA, 2755-2764. https://doi.org/10.1145/3394171.3413679
Zhang, Y., Sun, S., Galley, M., Chen, Y. C., Brockett, C., Gao, X.,... Dolan, B. (2019). Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
Zhang, Y., Sun, S., Gao, X., Fang, Y., Brockett, C., Galley, M.,... Dolan, B. (2021). Joint retrieval and generation training for grounded text generation. arXiv preprint arXiv:2105.06597.