Image Question Answering and Video Question Answering are two tasks involving the realization of models able to analyze the visual content of an image or a video, and produce a meaningful answer to visual content-related questions. These tasks both involve spatial, frame-level reasoning. Moreover, Video Question Answering also requires temporal, video-level reasoning which further raises the difficulty of the task. Solving these tasks would represent the ability to train models able to jointly analyze and reason on visual contents and textual contents at a human-level: the obtained models would be able to learn to isolate and pinpoint objects of interest in video (or image), and to identify and reason about their interactions in both the spatial and temporal domains. Image and Video Question Answering thus represent a challenging, but fundamental task in both Computer Vision and Natural Language Processing communities.
Call For Paper
Topics of interest are mainly related to Visual (Image and Video) Question Answering including, but not limited to: