Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Structured prediction is concerned with predicting multiple inter-dependent labels simultaneously. Classical methods like CRF achieve this by maximizing a score function over the set of possible label assignments. Recent extensions use neural networks to either implement the score function or in maximization. The current paper takes an alternative approach, using a neural network to generate the structured output directly, without going through a score function. We take an axiomatic perspective to derive the desired properties and invariances of a such network to certain input permutations, presenting a structural characterization that is provably both necessary and sufficient. We then discuss graph-permutation invariant (GPI) architectures that satisfy this characterization and explain how they can be used for deep structured prediction. We evaluate our approach on the challenging problem of inferring a scene graph from an image, namely, predicting entities and their relations in the image. We obtain state-of-the-art results on the challenging Visual Genome benchmark, outperforming all recent approaches.
READ FULL TEXT