Recent video recognition models utilize Transformer models for long-rang...
Large-scale multi-modal training with image-text pairs imparts strong
ge...
Pre-trained vision-language (V-L) models such as CLIP have shown excelle...
Existing open-vocabulary object detectors typically enlarge their vocabu...