TĂNG CƯỜNG NHẬN DIỆN CẢM XÚC THÔNG QUA TÍCH HỢP ĐẶC TRƯNG NGỮ CẢNH ĐA PHƯƠNG THỨC

Nguyễn Viết Hưng; Trần Thanh Nhã; Nguyễn Quốc Hưng; Lý Nguyễn Tiến Đạt; Nguyễn Quốc Trọng; Tạ Công Phi

doi:10.54607/hcmue.js.23.2.5044(2026)

PDF

Số xuất bản: Tập 23, Số 2 (2026)

Chuyên mục: Bài viết

DOI: 10.54607/hcmue.js.23.2.5044(2026)

Ngày xuất bản: 27/02/2026

Lượt xem 299

Lượt tải xuống 100

Trích dẫn bài báo

Nguyễn, V. H., Trần , T. N., Nguyễn, Q. H., Lý, N. T. Đ., Nguyễn, Q. T., & Tạ, C. P. (2026). TĂNG CƯỜNG NHẬN DIỆN CẢM XÚC THÔNG QUA TÍCH HỢP ĐẶC TRƯNG NGỮ CẢNH ĐA PHƯƠNG THỨC. Tạp chí Khoa học Trường Đại học Sư phạm Thành phố Hồ Chí Minh, 23(2), 237-248. https://doi.org/10.54607/hcmue.js.23.2.5044(2026)

Định dạng trích dẫn:

TĂNG CƯỜNG NHẬN DIỆN CẢM XÚC THÔNG QUA TÍCH HỢP ĐẶC TRƯNG NGỮ CẢNH ĐA PHƯƠNG THỨC

Nguyễn Viết Hưng¹, Trần Thanh Nhã^1,, Nguyễn Quốc Hưng¹, Lý Nguyễn Tiến Đạt¹, Nguyễn Quốc Trọng¹, Tạ Công Phi¹
¹ Trường Đại học Sư phạm Thành phố Hồ Chí Minh, Việt Nam

Tóm tắt

Trong kỉ nguyên số, nhu cầu về các hệ thống thông minh có khả năng thấu cảm với cảm xúc người dùng ngày càng tăng cao. Tuy nhiên, các phương pháp nhận diện cảm xúc hiện có, dù là đơn phương thức hay đa phương thức, thường chưa thể tích hợp thông tin từ nhiều nguồn một cách chặt chẽ và tận dụng ngữ cảnh một cách hiệu quả. Điều này khiến các mô hình dễ bị ảnh hưởng bởi nhiễu hoặc thông tin thiếu sót từ dữ liệu đầu vào. Để khắc phục hạn chế này, nghiên cứu này giới thiệu MCFF (Multi-Modal Contextual Feature Fusion), một kiến trúc học sâu đa phương thức được thiết kế để khai thác đồng thời thông tin hình ảnh, âm thanh và văn bản. Kết quả thực nghiệm trên bộ dữ liệu IEMOCAP đạt 82,89% Accuracy và 82,86% F1-score, cho thấy MCFF có hiệu suất cạnh tranh mạnh mẽ so với các phương pháp tiên tiến khác. MCFF cho thấy tiềm năng ứng dụng rộng rãi trong các hệ thống tương tác thông minh, từ cải thiện trải nghiệm trong giáo dục trực tuyến và trợ lí ảo cho đến hỗ trợ quan trọng trong lĩnh vực chăm sóc sức khỏe tâm thần.

Tài liệu tham khảo

Acheampong, F. A., Nunoo-Mensah, H., & Chen, W. (2021). Transformer models for text-based emotion detection: A review of BERT-based approaches. Artificial Intelligence Review, 54(8), 5789–5829. https://doi.org/10.1007/s10462-021-09958-2
Adoma, A. F., Henry, N.-M., & Chen, W. (2020). Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition. 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 117–121. https://ieeexplore.ieee.org/abstract/document/9317379/
Bhosale, S., Chakraborty, R., & Kopparapu, S. K. (2020). Deep encoded linguistic and acoustic cues for attention based end to end speech emotion recognition. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7189–7193. https://ieeexplore.ieee.org/abstract/document/9054621/
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. https://doi.org/10.1007/s10579-008-9076-6
Cheng, Z., Cheng, Z.-Q., He, J.-Y., Wang, K., Lin, Y., Lian, Z., Peng, X., & Hauptmann, A. (2024). Emotion-llama: Multimodal emotion recognition and reasoning with instruction tuning. Advances in Neural Information Processing Systems, 37, 110805–110853.
Ding, J., Chen, X., Lu, P., Yang, Z., Li, X., & Du, Y. (2023). DialogueINAB: An interaction neural network based on attitudes and behaviors of interlocutors for dialogue emotion recognition. The Journal of Supercomputing, 79(18), 20481–20514. https://doi.org/10.1007/s11227-023-05439-1
Fu, H., Zhuang, Z., Wang, Y., Huang, C., & Duan, W. (2023). Cross-corpus speech emotion recognition based on multi-task learning and subdomain adaptation. Entropy, 25(1), 124.
Goswami, S. A., Dave, S., & Patel, K. C. (2024). The Need for Emotional Intelligence in Human-Computer Interactions. In Harnessing Artificial Emotional Intelligence for Improved Human-Computer Interactions (pp. 82–106). IGI Global. https://www.igi-global.com/chapter/the-need-for-emotional-intelligence-in-human-computer-interactions/349198
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.
Islam, S., Elmekki, H., Elsebai, A., Bentahar, J., Drawel, N., Rjoub, G., & Pedrycz, W. (2024). A comprehensive survey on applications of transformers for deep learning tasks. Expert Systems with Applications, 241, 122666.
Jia, N., Zheng, C., & Sun, W. (2022). A multimodal emotion recognition model integrating speech, video and MoCAP. Multimedia Tools and Applications, 81(22), 32265–32286. https://doi.org/10.1007/s11042-022-13091-9
Joshi, A., Bhat, A., Jain, A., Singh, A. V., & Modi, A. (2022). COGMEN: COntextualized GNN based Multimodal Emotion recognitioN (No. arXiv:2205.02455). arXiv. https://doi.org/10.48550/arXiv.2205.02455
Khan, M., Gueaieb, W., El Saddik, A., & Kwon, S. (2024). MSER: Multimodal speech emotion recognition using cross-attention with deep fusion. Expert Systems with Applications, 245, 122946.
Khan, M., Tran, P.-N., Pham, N. T., El Saddik, A., & Othmani, A. (2025). MemoCMT: Multimodal emotion recognition using cross-modal transformer-based feature fusion. Scientific Reports, 15(1), 5473.
Li, Z., Tang, F., Zhao, M., & Zhu, Y. (2022). EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition (No. arXiv:2203.13504). arXiv. https://doi.org/10.48550/arXiv.2203.13504
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., & Dong, L. (2022). Swin transformer v2: Scaling up capacity and resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12009–12019. http://openaccess.thecvf.com/content/CVPR2022/html/Liu_Swin_Transformer_V2_Scaling_Up_Capacity_and_Resolution_CVPR_2022_paper.html
Luria, M., Zoran, A., & Forlizzi, J. (2019). Challenges of Designing HCI for Negative Emotions (No. arXiv:1908.07577). arXiv. https://doi.org/10.48550/arXiv.1908.07577
Ly, D., Tran, N., Nguyen, H. Q., Nguyen, T., Nguyen, L., & Nguyen, H. (2025). A Graph Attention Network-Enhanced Approach to Facial Expression Recognition Using Hybrid Pixel-Geometry Features. International Journal of Intelligent Engineering & Systems, 18(5).
Naderi, N., & Nasersharif, B. (2023). Cross corpus speech emotion recognition using transfer learning and attention-based fusion of wav2vec2 and prosody features. Knowledge-Based Systems, 277, 110814.
Nguyen, C.-V. T., Mai, A.-T., Le, T.-S., Kieu, H.-D., & Le, D.-T. (2023). Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 15154–15167. https://doi.org/10.18653/v1/2023.emnlp-main.937
Nguyen, H., Tran, N., Ly, D., Tran, A., Nguyen, A., & Vo, H. (2024). A Model for Song Recommendation Based on Facial Emotion Analysis and Musical Emotion. International Journal of Intelligent Engineering & Systems, 17(4). https://inass.org/wp-content/uploads/2024/03/2024083177-2.pdf
Patamia, R. A., Santos, P. E., Acheampong, K. N., Ekong, F., Sarpong, K., & Kun, S. (2023). Multimodal speech emotion recognition using modality-specific self-supervised frameworks. 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 4134–4141. https://ieeexplore.ieee.org/abstract/document/10394418/
Roy, A. K., Kathania, H. K., Sharma, A., Dey, A., & Ansari, M. S. A. (2024). ResEmoteNet: Bridging accuracy and loss reduction in facial emotion recognition. IEEE Signal Processing Letters. https://ieeexplore.ieee.org/abstract/document/10812829/
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter (No. arXiv:1910.01108). arXiv. https://doi.org/10.48550/arXiv.1910.01108
Shayaninasab, M., & Babaali, B. (2024). Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers (No. arXiv:2402.07327). arXiv. https://doi.org/10.48550/arXiv.2402.07327
Ta, P., Tran, N., Nguyen, H., & Nguyen, H. D. (2025). Detecting signs of depression on social media: A machine learning analysis and evaluation. Sustainable Futures, 100827.
Tran, N., Ta, P., Nguyen, H., Nguyen, H. D., & Le, A.-C. (2025). Hybrid contextual and sentiment-based machine learning model for identifying depression risk in social media. Expert Systems with Applications, 291, 128505.
Zhang, X., Fu, X., Qi, G., & Zhang, N. (2024). A multi‐scale feature fusion convolutional neural network for facial expression recognition. Expert Systems, 41(4), e13517. https://doi.org/10.1111/exsy.13517

Thanh bên bài viết

Nội dung chính của bài viết

Tóm tắt

Chi tiết bài viết

Tài liệu tham khảo