Video summarization remains a huge challenge in computer vision due to t...
Significant advances are being made in speech emotion recognition (SER) ...
Advertisement videos (ads) play an integral part in the domain of Intern...
Traditional music search engines rely on retrieval methods that match na...
Continuously-worn wearable sensors enable researchers to collect copious...
Automatic Speech Understanding (ASU) leverages the power of deep learnin...
Many recent studies have focused on fine-tuning pre-trained models for s...
Recent studies have explored the use of pre-trained embeddings for speec...
This paper presents the approach and results of USC SAIL's submission to...
There is an imminent need for guidelines and standard test sets to allow...
The process of human affect understanding involves the ability to infer
...
Audio event detection is a widely studied audio processing task, with
ap...
Interpersonal spoken communication is central to human interaction and t...
Active speaker detection in videos addresses associating a source face,
...
With the similarity between music and speech synthesis from symbolic inp...
Vocal entrainment is a social adaptation mechanism in human interaction,...
The need for emotional inference from text continues to diversify as mor...
Detecting emotions expressed in text has become critical to a range of
f...
Human perception and experience of music is highly context-dependent.
Co...
Detecting unsafe driving states, such as stress, drowsiness, and fatigue...
In psychotherapy interactions, the quality of a session is assessed by
c...
Longform media such as movies have complex narrative structures, with ev...
We present a cross-modal unsupervised framework for active speaker detec...
We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT ...
Papilledema is an ophthalmic neurologic disorder in which increased
intr...
Many existing privacy-enhanced speech emotion recognition (SER) framewor...
Speaker clustering is an essential step in conventional speaker diarizat...
Speaker diarization is one of the critical components of computational m...
A variety of recent works have looked into defenses for deep neural netw...
An essential goal of computational media intelligence is to support
unde...
Speech Emotion Recognition (SER) application is frequently associated wi...
Emotion recognition from text is a challenging task due to diverse emoti...
Societal ideas and trends dictate media narratives and cinematic depicti...
Automatic inference of important paralinguistic information such as age ...
In this paper we investigate speech denoising as a defense against
adver...
Computational approaches for assessing the quality of conversation-based...
Key challenges in developing generalized automatic emotion recognition
s...
Speech encodes a wealth of information related to human behavior and has...
Instrument separation in an ensemble is a challenging task. In this work...
During a psychotherapy session, the counselor typically adopts technique...
With the growing prevalence of psychological interventions, it is vital ...
A key desiderata for inclusive and accessible speech recognition technol...
Word vector representations enable machines to encode human language for...
Speaker diarization is a task to label audio or video recordings with cl...
Robust face clustering is a key step towards computational understanding...
Violent content in the media can influence viewers' perception of the
so...
Robust speaker recognition, including in the presence of malicious attac...
Life events can dramatically affect our psychological state and work
per...
Neural speaker embeddings trained using classification objectives have
d...
Electroencephalography (EEG) signals are promising as a biometric owing ...