Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

03/10/2022
by   Boyu Chen, et al.
16

Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the tracking development in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5 gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset