Skeleton-based action segmentation requires recognizing composable actio...
This paper introduces InternVid, a large-scale video-centric multimodal
...
With the advance of text-to-image models (e.g., Stable Diffusion) and
co...
Self-supervised video representation learning aimed at maximizing simila...
Spatio-temporal coherency is a major challenge in synthesizing high qual...
We consider the problem of generating musical soundtracks in sync with
r...
Diffusion models have attained impressive visual quality for image synth...
This work focuses on unsupervised representation learning in person
re-i...
Current self-supervised approaches for skeleton action representation
le...
Due to the remarkable progress of deep generative models, animating imag...
Action recognition based on skeleton data has recently witnessed increas...
In this work, we introduce an unconditional video generative model,
InMo...
Annotating identity labels in large-scale datasets is a labour-intensive...
Taking advantage of human pose data for understanding human activities h...
Creating realistic human videos introduces the challenge of being able t...