Music editing primarily entails the modification of instrument tracks or...
The information retrieval community has made significant progress in
imp...
We introduce MoviePuzzle, a novel challenge that targets visual narrativ...
We introduce CDBERT, a new learning paradigm that enhances the semantics...
Video-grounded dialogue understanding is a challenging problem that requ...
This paper gives two theoretical results on estimating low-rank paramete...
Multilingual training is effective in improving low-resource ASR, which ...
We improve low-resource ASR by integrating the ideas of multilingual tra...
The front-end is a critical component of English text-to-speech (TTS)
sy...
Automatic dubbing, which generates a corresponding version of the input
...
Visual-audio navigation (VAN) is attracting more and more attention from...
Recent studies have shown that using an external Language Model (LM) ben...
Salient object detection (SOD) focuses on distinguishing the most conspi...
We study video-grounded dialogue generation, where a response is generat...
Sound field decomposition predicts waveforms in arbitrary directions usi...
Federated learning is a popular strategy for training models on distribu...
VQA is an ambitious task aiming to answer any image-related question.
Ho...
Some recent studies have demonstrated the feasibility of single-stage ne...
Adapting to a continuously evolving environment is a safety-critical
cha...
Speech restoration aims to remove distortions in speech signals. Prior
m...
Cognitive science has shown that humans perceive videos in terms of even...
Although deep learning and end-to-end models have been widely used and s...
We propose two improvements to target-speaker voice activity detection
(...
The Partitioning Min-Max Weighted Matching (PMMWM) problem is an NP-hard...
It is still a pipe dream that AI assistants on phone and AR glasses can
...
Dubbing is a post-production process of re-recording actors' dialogues, ...
The goal in a blind image quality assessment (BIQA) model is to simulate...
With the increasing popularity of speech synthesis products, the industr...
Speech restoration aims to remove distortions in speech signals. Prior
m...
Deep neural network based methods have been successfully applied to musi...
This paper describes the ByteDance speaker diarization system for the fo...
Acoustic echo and background noise can seriously degrade the intelligibi...
Separating a song into vocal and accompaniment components is an active
r...
This paper presents a novel supervised approach to detecting the chorus
...
A music mashup combines audio elements from two or more songs to create ...
This system description describes our submission system to the Third DIH...
We propose a multimodal singing language classification model that uses ...
Speech enhancement is a task to improve the intelligibility and perceptu...
Music source separation (MSS) is the task of separating a music piece in...
Music classification is a task to classify a music piece into labels suc...
Symbolic music datasets are important for music information retrieval an...
Automatic music transcription (AMT) is the task of transcribing audio
re...
This paper proposes the building of Xiaomingbot, an intelligent, multili...
With the popularity of deep neural network, speech synthesis task has
ac...
Accent conversion (AC) transforms a non-native speaker's accent into a n...
Text style transfer is a hot issue in recent natural language
processing...
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achi...
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) sy...
Source separation is the task to separate an audio recording into indivi...
Edit-distance-based string similarity search has many applications such ...