PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

08/09/2023
by   Faegheh Sardari, et al.
0

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1 MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5 studies to examine the impact of the different components of our proposed network.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset