ENHANCING EMOTION RECOGNITION THROUGH MULTIMODAL CONTEXTUAL FEATURE INTEGRATION
Main Article Content
Abstract
In the digital age, the demand for intelligent systems capable of understanding users’ emotions is continuously increasing. However, existing emotion recognition methods, whether unimodal or multimodal, often struggle to integrate information from multiple sources cohesively and leverage contextual cues effectively. This limitation makes models susceptible to noise or incomplete information from input data. To address this limitation, this research introduces MCFF (Multi-Modal Contextual Feature Fusion), a multimodal deep learning architecture designed to simultaneously leverage visual, audio, and textual information. Experimental results on the IEMOCAP dataset yielded an accuracy of 82.89% and an F1-score of 82.86%, demonstrating MCFF’s competitive strong performance compared with other state-of-the-art methods. MCFF exhibits broad potential for application in intelligent interactive systems, ranging from enhancing experiences in online education and virtual assistants to providing crucial support in mental healthcare.
Article Details
References
Adoma, A. F., Henry, N.-M., & Chen, W. (2020). Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition. 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 117–121. https://ieeexplore.ieee.org/abstract/document/9317379/
Bhosale, S., Chakraborty, R., & Kopparapu, S. K. (2020). Deep encoded linguistic and acoustic cues for attention based end to end speech emotion recognition. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7189–7193. https://ieeexplore.ieee.org/abstract/document/9054621/
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. https://doi.org/10.1007/s10579-008-9076-6
Cheng, Z., Cheng, Z.-Q., He, J.-Y., Wang, K., Lin, Y., Lian, Z., Peng, X., & Hauptmann, A. (2024). Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning. Advances in Neural Information Processing Systems, 37, 110805–110853.
Ding, J., Chen, X., Lu, P., Yang, Z., Li, X., & Du, Y. (2023). DialogueINAB: An interaction neural network based on attitudes and behaviors of interlocutors for dialogue emotion recognition. The Journal of Supercomputing, 79(18), 20481–20514. https://doi.org/10.1007/s11227-023-05439-1
Fu, H., Zhuang, Z., Wang, Y., Huang, C., & Duan, W. (2023). Cross-corpus speech emotion recognition based on multi-task learning and subdomain adaptation. Entropy, 25(1), 124.
Goswami, S. A., Dave, S., & Patel, K. C. (2024). The Need for Emotional Intelligence in Human-Computer Interactions. In Harnessing Artificial Emotional Intelligence for Improved Human-Computer Interactions (pp. 82–106). IGI Global. https://www.igi-global.com/chapter/the-need-for-emotional-intelligence-in-human-computer-interactions/349198
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.
Islam, S., Elmekki, H., Elsebai, A., Bentahar, J., Drawel, N., Rjoub, G., & Pedrycz, W. (2024). A comprehensive survey on applications of transformers for deep learning tasks. Expert Systems with Applications, 241, 122666.
Jia, N., Zheng, C., & Sun, W. (2022). A multimodal emotion recognition model integrating speech, video and MoCAP. Multimedia Tools and Applications, 81(22), 32265–32286. https://doi.org/10.1007/s11042-022-13091-9
Joshi, A., Bhat, A., Jain, A., Singh, A. V., & Modi, A. (2022). COGMEN: COntextualized GNN based Multimodal Emotion recognitioN (No. arXiv:2205.02455). arXiv. https://doi.org/10.48550/arXiv.2205.02455
Khan, M., Gueaieb, W., El Saddik, A., & Kwon, S. (2024). MSER: Multimodal speech emotion recognition using cross-attention with deep fusion. Expert Systems with Applications, 245, 122946.
Khan, M., Tran, P.-N., Pham, N. T., El Saddik, A., & Othmani, A. (2025). MemoCMT: Multimodal emotion recognition using cross-modal transformer-based feature fusion. Scientific Reports, 15(1), 5473.
Li, Z., Tang, F., Zhao, M., & Zhu, Y. (2022). EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition (No. arXiv:2203.13504). arXiv. https://doi.org/10.48550/arXiv.2203.13504
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., & Dong, L. (2022). Swin transformer v2: Scaling up capacity and resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12009–12019. http://openaccess.thecvf.com/content/CVPR2022/html/Liu_Swin_Transformer_V2_Scaling_Up_Capacity_and_Resolution_CVPR_2022_paper.html
Luria, M., Zoran, A., & Forlizzi, J. (2019). Challenges of Designing HCI for Negative Emotions (No. arXiv:1908.07577). arXiv. https://doi.org/10.48550/arXiv.1908.07577
Ly, D., Tran, N., Nguyen, H. Q., Nguyen, T., Nguyen, L., & Nguyen, H. (2025). A Graph Attention Network-Enhanced Approach to Facial Expression Recognition Using Hybrid Pixel-Geometry Features. International Journal of Intelligent Engineering & Systems, 18(5).
Naderi, N., & Nasersharif, B. (2023). Cross corpus speech emotion recognition using transfer learning and attention-based fusion of wav2vec2 and prosody features. Knowledge-Based Systems, 277, 110814.
Nguyen, C.-V. T., Mai, A.-T., Le, T.-S., Kieu, H.-D., & Le, D.-T. (2023). Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 15154–15167. https://doi.org/10.18653/v1/2023.emnlp-main.937
Nguyen, H., Tran, N., Ly, D., Tran, A., Nguyen, A., & Vo, H. (2024). A Model for Song Recommendation Based on Facial Emotion Analysis and Musical Emotion. International Journal of Intelligent Engineering & Systems, 17(4). https://inass.org/wp-content/uploads/2024/03/2024083177-2.pdf
Patamia, R. A., Santos, P. E., Acheampong, K. N., Ekong, F., Sarpong, K., & Kun, S. (2023). Multimodal speech emotion recognition using modality-specific self-supervised frameworks. 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 4134–4141. https://ieeexplore.ieee.org/abstract/document/10394418/
Roy, A. K., Kathania, H. K., Sharma, A., Dey, A., & Ansari, M. S. A. (2024). ResEmoteNet: Bridging accuracy and loss reduction in facial emotion recognition. IEEE Signal Processing Letters. https://ieeexplore.ieee.org/abstract/document/10812829/
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter (No. arXiv:1910.01108). arXiv. https://doi.org/10.48550/arXiv.1910.01108
Shayaninasab, M., & Babaali, B. (2024). Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers (No. arXiv:2402.07327). arXiv. https://doi.org/10.48550/arXiv.2402.07327
Ta, P., Tran, N., Nguyen, H., & Nguyen, H. D. (2025). Detecting signs of depression on social media: A machine learning analysis and evaluation. Sustainable Futures, 100827.
Tran, N., Ta, P., Nguyen, H., Nguyen, H. D., & Le, A.-C. (2025). Hybrid contextual and sentiment-based machine learning model for identifying depression risk in social media. Expert Systems with Applications, 291, 128505.
Zhang, X., Fu, X., Qi, G., & Zhang, N. (2024). A multi‐scale feature fusion convolutional neural network for facial expression recognition. Expert Systems, 41(4), e13517. https://doi.org/10.1111/exsy.13517