Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of ma...
Prompt engineering is a technique that involves augmenting a large
pre-t...
Vision-Language Models (VLMs) are expected to be capable of reasoning wi...
Visual Question Answering (VQA) is a multi-discipline research task. To
...