Joint Event Detection and Description in Continuous Video Streams
As a fine-grained video understanding task, dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) that solves the dense captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers and proposes variable-length temporal events based on pooled features. In order to explicitly model temporal relationships between visual events and their captions in a single video, we propose a two-level hierarchical LSTM module that transcribes the event proposals into captions. Unlike existing dense video captioning approaches, our proposal generation and language captioning networks are trained end-to-end, allowing for improved temporal segmentation. On the large-scale ActivityNet Captions dataset, JEDDi-Net demonstrates improved results as measured by most language generation metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.
READ FULL TEXT