Combining Visual and Textual Attention in Neural Models for Enhanced Visual Question Answering

Journal Title: Revista Romana de Interactiune Om-Calculator - Year 2018, Vol 11, Issue 1

Abstract

While visual information is essential for humans as it models our environment, language is our main method of communication and reasoning. Moreover, these two human capabilities interact in complex ways, therefore problems involving both visual and natural language data became widely explored in recent years. Thus, visual question answering aims at building systems able to process questions expressed in natural language about images or even videos. This would significantly ease the quality of life for visually impaired people by allowing them to get real-time answers about their surroundings. Unfortunately, the relations between images and questions are complex and the current solutions that exploit recent advanced in deep learning for text and image representation are not reliable enough. To improve these results, the visual and text representations must be fused into the same multimodal space. In this paper we present two different solutions for solving this problem. The first performs reasoning on the image by using soft attention mechanisms computed given the question. The second uses soft attention not just on the image, but the text as well. Although our models are more lightweight than state of the art solutions for this task, we achieve near top performance with the proposed combination of visual and textual representations.

Authors and Affiliations

Cosmin Dragomir, Cristian Ojog, Traian Rebedea

Keywords

Related Articles

Finding flow: some implications for the utilization of new technologies

As computers and internet are becoming an increasingly prominent presence in our daily life, a stronger need to better understand the behavior of persons using these technologies emerges as well. During the recent years,...

Approaches in Automatic Usability Evaluation. Comparative Study

Usability testing is a growing field with more and more companies getting aware of the importance of assuring a good usability of their products. Specialists conduct the tests and they use various kinds of tools to help...

2D graphical interaction in elearning

Sketching is often used by people to express ideas. Some concepts that are hard to explain in words can be easily expressed using a figure or drawing. As the pen-based user interfaces became common, many systems that use...

Un model software cu potenţial în dezvoltarea jocurilor de strategie

Scopul acestui articol este de a propune şi prezenta un model software care ar putea sta la baza dezvoltării jocurilor de strategie şi, în general, a aplicaţiilor interactive care implică simularea interacţiunilor dintre...

Researches regarding the acceptance of e-learning technologi

Nowadays, the e-learning applications are integrated in most of curricular areas and scholar cycles. In this context, the perceived utility of these applications at end-user level is an important issue. A series of theor...

Download PDF file
  • EP ID EP673498
  • DOI -
  • Views 178
  • Downloads 0

How To Cite

Cosmin Dragomir, Cristian Ojog, Traian Rebedea (2018). Combining Visual and Textual Attention in Neural Models for Enhanced Visual Question Answering. Revista Romana de Interactiune Om-Calculator, 11(1), 1-27. https://europub.co.uk/articles/-A-673498