Visual Question Answering Using Semantic Information from Image Descriptions
Visual question answering (VQA) is a task that requires AI systems to display multi-modal understanding. A system must be able to reason over the question being asked as well as the image itself to determine reasonable answers to the questions posed. In many cases, simply reasoning over the image itself and the question is not enough to achieve good performance. As an aid of the task, other than region based visual information and natural language questions, external textual knowledge extracted from images can also be used to generate correct answers for questions. Considering these, we propose a deep neural network model that uses an attention mechanism which utilizes image features, the natural language question asked and semantic knowledge extracted from the image to produce open-ended answers for the given questions. The combination of image features and contextual information about the image bolster a model to more accurately respond to questions and potentially do so with less required training data. We evaluate our proposed architecture on a VQA task against a strong baseline and show that our method achieves excellent results on this task.
READ FULL TEXT