The recent wave of large-scale text-to-image diffusion models has
dramat...
We consider the problem of segmenting scenes into constituent entities, ...
This paper explores self-supervised learning of amodal 3D feature
repres...
We propose an action-conditioned dynamics model that predicts scene chan...
We present neural architectures that disentangle RGB-D images into objec...
We propose a system that learns to detect objects and infer their 3D pos...
Consider the utterance "the tomato is to the left of the pot." Humans ca...