Omnidirectional images (ODIs) have become increasingly popular, as their...
We introduce HOSNeRF, a novel 360 free-viewpoint rendering method that
r...
Despite the success in large-scale text-to-image generation and
text-con...
The state of the arts in vision-language pretraining (VLP) achieves exem...
The incredible generative ability of large-scale text-to-image (T2I) mod...
Both masked image modeling (MIM) and natural language supervision have
f...
Recent CLIP-guided 3D optimization methods, e.g., DreamFields and
PureCL...
To reproduce the success of text-to-image (T2I) generation, recent works...
Vector-Quantized (VQ-based) generative models usually consist of two bas...
Cross-domain recommendation is an important method to improve recommende...
Existing benchmark datasets for recommender systems (RS) either are crea...
Weakly-supervised action localization aims to localize and classify acti...
Modeling dynamic scenes is important for many applications such as virtu...
Since the development of self-supervised visual representation learning ...
Dominant pre-training work for video-text retrieval mainly adopt the
"du...
Finding relevant moments and highlights in videos according to natural
l...
Pre-training a model to learn transferable video-text representation for...