It is well known that most of the conventional video question answering
...
Natural language modeling with limited training data is a challenging
pr...
Knowledge-based visual question answering (QA) aims to answer a question...
Learning generic joint representations for video and text by a supervise...
Human-Object Interaction (HOI) detection is the task of identifying a se...
Self-supervised learning has drawn attention through its effectiveness i...
The VALUE (Video-And-Language Understanding Evaluation) benchmark is new...
Human-Object Interaction (HOI) detection is a task of identifying "a set...
As a scene graph compactly summarizes the high-level content of an image...
Conventional sequential learning methods such as Recurrent Neural Networ...
Conventional sequential learning methods such as Recurrent Neural Networ...
While conventional methods for sequential learning focus on interaction
...