Existing emotion prediction benchmarks contain coarse emotion labels whi...
Visual recognition models are prone to learning spurious correlations in...
Long-term activity forecasting is an especially challenging research pro...
Learning with noisy labels (LNL) is challenging as the model tends to
me...
Diffusion models have demonstrated impressive performance in text-guided...
Webpages have been a rich resource for language and vision-language task...
Compositional reasoning is a hallmark of human visual intelligence; yet
...
Webpages have been a rich, scalable resource for vision-language and lan...
Multi-source Domain Generalization (DG) measures a classifier's ability ...
We propose a self-supervised approach for learning to perform audio sour...
We provide a new multi-task benchmark for evaluating text-to-image model...
Human pose transfer aims to synthesize a new view of a person under a gi...
Prior work has shown that Visual Recognition datasets frequently
under-r...
Recent self-supervised approaches have used large-scale image-text datas...
The goal of attribute manipulation is to control specified attribute(s) ...
Movie genre classification has been widely studied in recent years due t...
Image manipulation has attracted a lot of interest due to its wide range...
We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a new dat...
Automatically writing long articles is a complex and challenging languag...
Phrase detection requires methods to identify if a phrase is relevant to...
Analyzing the morphology of cells in microscopy images can provide insig...
We introduce the task of spatially localizing narrated interactions in
v...
In recent years, vision-language research has shifted to study tasks whi...
Large-scale dissemination of disinformation online intended to mislead o...
Many self-supervised learning (SSL) methods have been successful in lear...
We present Shapeshifter Networks (SSNs), a flexible neural network frame...
Current multilingual vision-language models either require a large numbe...
Existing unsupervised domain adaptation methods aim to transfer knowledg...
Prior work in multi-task learning has mainly focused on predictions on a...
Given a video and a sentence, the goal of weakly-supervised video moment...
Existing vision-language methods typically support two languages at a ti...
Many real-world tasks require models to compare images along multiple
si...
Shouldn't language and vision features be treated equally in vision-lang...
Explaining a deep learning model can help users understand its behavior ...
Most existing work that grounds natural language phrases in images start...
In this paper, we introduce an attribute-based interactive image search ...
Outfits in online fashion data are composed of items of many different t...
This paper presents an approach for grounding phrases in images which jo...
This paper presents a framework for localization or grounding of phrases...