FEIM-TTS

Facial Expression-Enhanced TTS:
Combining Face Representation and Emotion Intensity for Adaptive Speech

FEIM-TTS generates speech from given text script, face image and emotion ques with intensity.

Abstract

We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS's unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics, allowing visually impaired users to enjoy these narratives more fully. Comprehensive evaluation evidences its proficiency in modulating emotion and intensity, advancing emotional speech synthesis and accessibility.

Examples

All of the results were using only unseen speakers including synthesized images.

Case 1 (Synthesized face from generated.photos)

Text: I think I've seen this before.

From Face-TTS

From FEIM-TTS (Ours)

Case 2 (Synthesized face from generated.photos)

Text: These take the shape of a long round arch, with its path high above,
and its two ends apparently beyond the horizon.

From Face-TTS

From FEIM-TTS (Ours)

Case 3 (Synthesized face from generated.photos)

Generated speeches according to the intensity of each emotion


FEIM-TTS (Ours)

Text: Yet the public opinion of the whole body seems to have checked dissipation.

1 10 20
Anger
Disgust
Fear
Happy
Neutral
Sad

Case 4

Generated speeches using images of characters from the animation
*All image licenses have been noted below.


1 2

Text: Maybe tomorrow it will be cold.

Judy Hopps

Kristoff Bjorgman