3D visual grounding aims to localize the target object in a 3D point clo...
3D visual grounding involves finding a target object in a 3D scene that
...
Speech Recognition builds a bridge between the multimedia streaming
(aud...
Direct speech-to-speech translation (S2ST) aims to convert speech from o...
Multi-modal Contrastive Representation (MCR) learning aims to encode
dif...
Multi-media communications facilitate global interaction among people.
H...
Transformers have become the powerhouse of natural language processing a...
Kinodynamic Motion Planning (KMP) is to find a robot motion subject to
c...
Reliable real-time planning for robots is essential in today's rapidly
e...