Training deep learning models for video classification from audio-visual...
Zero-shot learning models achieve remarkable results on image classifica...
Neural Persistence is a prominent measure for quantifying neural network...
The visual classification performance of vision-language models such as ...
Cross-modal retrieval methods are the preferred tool to search databases...
Planning an optimal route in a complex environment requires efficient
re...
Audio-visual generalised zero-shot learning for video classification req...
Providing explanations in the context of Visual Question Answering (VQA)...
Learning to classify video data from classes not included in the trainin...
The objectives of this work are cross-modal text-audio and audio-text
re...
We consider the task of retrieving audio using free-form natural languag...
Explaining the decision of a multi-modal decision-maker requires to dete...
Having access to multi-modal cues (e.g. vision and audio) empowers some
...
This work explores how to use self-supervised learning on videos to lear...
We propose a self-supervised framework for learning facial attributes by...
The objective of this paper is a neural network model that controls the ...