We present Composable Diffusion (CoDi), a novel generative model capable...
Action knowledge involves the understanding of textual, visual, and temp...
We present Perceiver-VL, a vision-and-language framework that efficientl...
In this work, we present the Textless Vision-Language Transformer (TVLT)...
Since visual perception can give rich information beyond text descriptio...
Videos convey rich information. Dynamic spatio-temporal relationships be...