While speech-enabled teachable agents have some advantages over typing-b...
Named entities are ubiquitous in text that naturally accompanies images,...
Despite recent attention and exploration of depth for various tasks, it ...
Vision-language pretraining to learn a fine-grained, region-word alignme...
The use of large-scale vision-language datasets is limited for object
de...
Human-object interaction (HOI) detection aims to extract interacting
hum...
Contrastive learning has emerged as a competitive pretraining method for...
Speakers build rapport in the process of aligning conversational behavio...
Sometimes the meaning conveyed by images goes beyond the list of objects...
Videos are more well-organized curated data sources for visual concept
l...
Though significant efforts such as removing false claims and promoting
r...
Recently, vision transformers and MLP-based models have been developed i...
Prior work in scene graph generation requires categorical supervision at...
In this work, we present BasisNet which combines recent advancements in
...
The observation that computer vision methods overfit to dataset specific...
Deep learning based object detectors are commonly deployed on mobile dev...
We study the problem of animating images by transferring spatio-temporal...
The abundance of multimodal data (e.g. social media posts) has inspired
...
The news media shape public opinion, and often, the visual bias they con...
Learning to localize and name object instances is a fundamental problem ...
Advertisements are unavoidable in modern society. Times Square is notori...
Computer vision systems currently lack the ability to reliably recognize...
To alleviate the cost of obtaining accurate bounding boxes for training
...
In order to resonate with the viewers, many video advertisements explore...
In this paper, we examine the visual variability of objects across diffe...
Images and text in advertisements interact in complex, non-literal ways....
How would you search for a unique, fashionable shoe that a friend wore a...
In order to convey the most content in their limited space, advertisemen...
There is more to images than their objective physical content: for examp...
Computer vision systems require large amounts of manually annotated data...