Combining Visual and Textual Attention in Neural Models for Enhanced Visual Question Answering

Apply

Combining Visual and Textual Attention in Neural Models for Enhanced Visual Question Answering

Journal Title: Revista Romana de Interactiune Om-Calculator - Year 2018, Vol 11, Issue 1

Abstract

While visual information is essential for humans as it models our environment, language is our main method of communication and reasoning. Moreover, these two human capabilities interact in complex ways, therefore problems involving both visual and natural language data became widely explored in recent years. Thus, visual question answering aims at building systems able to process questions expressed in natural language about images or even videos. This would significantly ease the quality of life for visually impaired people by allowing them to get real-time answers about their surroundings. Unfortunately, the relations between images and questions are complex and the current solutions that exploit recent advanced in deep learning for text and image representation are not reliable enough. To improve these results, the visual and text representations must be fused into the same multimodal space. In this paper we present two different solutions for solving this problem. The first performs reasoning on the image by using soft attention mechanisms computed given the question. The second uses soft attention not just on the image, but the text as well. Although our models are more lightweight than state of the art solutions for this task, we achieve near top performance with the proposed combination of visual and textual representations.

Authors and Affiliations

Cosmin Dragomir, Cristian Ojog, Traian Rebedea

Keywords

EP ID EP673498
DOI -
Views 178
Downloads 0