Spoken language identification refers to the task of automatically predi...
Video question–answering is a fundamental task in the field of video
und...
We introduce AudioPaLM, a large language model for speech understanding ...
Large Language Models (LLMs) have been applied in the speech domain, oft...
Speech representation learning approaches for non-semantic tasks such as...
This paper introduces a new speech dataset called “LibriTTS-R” designed ...
Aspect Sentiment Triplet Extraction (ASTE) is a subtask of Aspect-Based
...
Recent advances in machine learning-aided lossy compression are incorpor...
We investigate power allocation for the base matrix of a spatially coupl...
We consider high-dimensional MIMO transmissions in frequency division
du...
Speech restoration (SR) is a task of converting degraded speech signals ...
We introduce the Universal Speech Model (USM), a single large model that...
We introduce Noise2Music, where a series of diffusion models is trained ...
Recently, the concept of holographic multiple-input multiple-output (MIM...
Most research on task oriented dialog modeling is based on written text
...
Although deep neural networks have shown well-performance in various tas...
We propose a novel method to accelerate training and inference process o...
Existing multimodal tasks mostly target at the complete input modality
s...
Self-training methods have been explored in recent years and have exhibi...
This paper proposes a simple yet effective interpolation-based data
augm...
With the boom of e-commerce, Multimodal Review Helpfulness Prediction (M...
We consider the semantic rate-distortion problem motivated by task-orien...
We present the Pathways Autoregressive Text-to-Image (Parti) model, whic...
We consider a problem of coding for computing, where the decoder wishes ...
Building inclusive speech recognition systems is a crucial step towards
...
Wireless network capacity is one of the most important performance metri...
In this paper, we study a concatenate coding scheme based on sparse
regr...
In learning action recognition, models are typically pre-trained on obje...
We propose a simple and fast method for providing a high quality solutio...
Holistic object representation-based trackers suffer from performance dr...
Many speech applications require understanding aspects beyond the words ...
The explosively generated micro-videos on content sharing platforms call...
The accelerated convergence of digital and real-world lifestyles has imp...
We summarize the results of a host of efforts using giant automatic spee...
In multimodal sentiment analysis (MSA), the performance of a model highl...
Motivated by the success of masked language modeling (MLM) in pre-traini...
Multimodal sentiment analysis aims to extract and integrate semantic
inf...
Neural network based speech recognition systems suffer from performance
...
Streaming end-to-end automatic speech recognition (ASR) systems are wide...
Although end-to-end automatic speech recognition (e2e ASR) models are wi...
End-to-end (E2E) models have shown to outperform state-of-the-art
conven...
Streaming end-to-end automatic speech recognition (ASR) models are widel...
Streaming automatic speech recognition (ASR) aims to emit each hypothesi...
We employ a combination of recent developments in semi-supervised learni...
Visual Question Answering (VQA) is challenging due to the complex cross-...
Streaming automatic speech recognition (ASR) aims to emit each hypothesi...
Image text carries essential information to understand the scene and per...
Dialogue relation extraction (DRE) aims to detect the relation between t...
This paper argues that to efficiently cope with the high throughput,
rel...
Rate-Splitting Multiple Access (RSMA) is an emerging flexible, robust an...