Recently, large-scale pre-trained language-image models like CLIP have s...
Point-supervised Temporal Action Localization (PSTAL) is an emerging res...
Modeling and synthesizing low-light raw noise is a fundamental problem f...
Pre-training has emerged as an effective technique for learning powerful...
Current state-of-the-art approaches for few-shot action recognition achi...
Learning from large-scale contrastive language-image pre-training like C...
Since the fully convolutional network has achieved great success in sema...
Human-Object Interaction (HOI) detection aims to learn how human interac...
Recent incremental learning for action recognition usually stores
repres...
Standard approaches for video recognition usually operate on the full in...
This technical report presents our first place winning solution for temp...
Currently, for crowd counting, the fully supervised methods via density ...
Recently, many approaches tackle the Unsupervised Domain Adaptive person...
This paper tackles the Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR)
...
In this work, we present a new method for 3D face reconstruction from
mu...
Deep learning-based methods for low-light image enhancement typically re...
The fully convolutional network (FCN) has achieved tremendous success in...
Supervised learning is dominant in person search, but it requires elabor...
The central idea of contrastive learning is to discriminate between diff...
Temporal action localization aims to localize starting and ending time w...
Most recent approaches for online action detection tend to apply Recurre...
Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to
rec...
This technical report presents our solution for temporal action detectio...
This technical report analyzes an egocentric video action detection meth...
With the recent surge in the research of vision transformers, they have
...
The existing crowd counting methods usually adopted attention mechanism ...
We present an efficient high-resolution network, Lite-HRNet, for human p...
Self-supervised learning presents a remarkable performance to utilize
un...
Temporal action proposal generation aims to estimate temporal intervals ...
The goal of person search is to localize and match query persons from sc...
In this paper, we propose to estimate 3D hand pose by recovering the 3D
...
In the conventional person Re-ID setting, it is widely assumed that crop...
Non-local operation is widely explored to model the long-range dependenc...
Currently, one-stage frameworks have been widely applied for temporal ac...
Human pose estimation is the task of localizing body keypoints from stil...
In this report, we present our solution for the task of temporal action
...
This technical report analyzes a temporal action localization method we ...
Crowd counting is a concerned and challenging task in computer vision.
E...
Image dehazing using learning-based methods has achieved state-of-the-ar...
The low-level details and high-level semantics are both essential to the...
Recent works have widely explored the contextual dependencies to achieve...
We propose a Generative Transfer Network (GTNet) for zero shot object
de...
Person search aims at localizing and identifying a query person from a
g...
Semantic segmentation requires both rich spatial information and sizeabl...
Most existing methods of semantic segmentation still suffer from two asp...
We present an effective blind image deblurring method based on a data-dr...
Designing a robust affinity model is the key issue in multiple target
tr...
Traditional single-view object detection methods often perform worse und...
Recently, scene text detection has become an active research topic in
co...