Artificial intelligence models for detecting emotions in speech conversations show promise but may still be getting mixed signals
Listen now to explore the profound implications of AI’s emotional intelligence.
Artificial intelligence is getting better at interpreting how we feel — at least in theory. Emotion recognition from speech, once mere science fiction, is now central to emerging tools in mental health, customer service, and virtual assistants. But a new study from Khalifa University suggests these systems may not be as emotionally fluent as we think.
Ghada Alhussein, Ioannis Ziogas, Shiza Saleem and Prof. Leontios Hadjileontiadis, all from Khalifa University’s Department of Biomedical Engineering and Biotechnology, published a systematic review and meta-analysis in , analyzing 51 studies from 2010 to 2023. They found major discrepancies in how emotions are labelled, the types of data used, and how algorithms are tested.
The field, known as Speech Emotion Recognition in Conversations (SERC), uses AI to identify emotions in human speech. These tools have potential applications in areas like mental health monitoring, human-computer interaction, and call center analytics. But the research finds that while accuracy has improved — especially with the use of deep learning and self-supervised models — bias and methodological issues remain unresolved.
One major finding is that datasets often use inconsistent or poorly defined emotional labels. Some studies use categorical models (e.g. happy or angry), while others adopt dimensional models based on valence (positive vs. negative) and arousal (calm vs. excited). This inconsistency leads to confusion for algorithms trying to learn emotional patterns. Even the best AI systems can’t outperform the quality of the labels they’re trained on.
Then, there’s the issue of realism. Many datasets used to train AI on emotion are based on acted speech: people pretending to be angry, sad, or cheerful. While easier to collect and label, these examples don’t always reflect how emotions actually sound in real conversations. The review found that most studies used scripted or acted datasets, which can produce inflated performance metrics that don’t translate to real-world settings.
“Emotion recognition by AI is not just a technical challenge – it’s a question of how we define, experience and communicate feelings. That makes data quality just as important as model architecture.”
— Dr. Leontios Hadjileontiadis, Khalifa University.
Despite these challenges, technical improvements are helping. Newer models based on self-supervised learning show promise in extracting emotion-relevant features from speech, even with limited labelled data. Hybrid approaches that combine hand-crafted and deep features also tend to perform better. Still, this review makes it clear that model sophistication alone isn’t enough. Performance depends heavily on dataset quality, the realism of the speech samples, and the clarity of the emotional categories.
“We see strong potential for AI in emotion recognition,” Prof. Hadjileontiadis said. “But the field needs to address foundational issues — particularly in how emotions are defined, labelled, and measured — if we want these systems to be both accurate and applicable.”