We present Kosmos-2.5, a multimodal literate model for machine reading o...
In this work, we use large language models (LLMs) to augment and acceler...
Multimodal signals, including text, audio, image and video, can be integ...
Next-generation edge intelligence is anticipated to bring huge benefits ...
Machine Learning as a Service (MLaaS) has gained popularity due to
advan...
Optical Character Recognition (OCR) enables automatic text extraction fr...
In this work, we propose Retentive Network (RetNet) as a foundation
arch...
Semantic communication (SC) is an emerging intelligent paradigm, offerin...
Scaling sequence length has become a critical demand in the era of large...
Deep speech classification has achieved tremendous success and greatly
p...
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enablin...
Knowledge Distillation (KD) is a promising technique for reducing the hi...
Existing large language models (LLMs) can only afford fix-sized inputs d...
In-context learning, where pre-trained language models learn to perform ...
Large vision Transformers (ViTs) driven by self-supervised pre-training
...
A big convergence of language, multimodal perception, action, and world
...
Inductive reasoning is a core component of human intelligence. In the pa...
Large pretrained language models have shown surprising In-Context Learni...
Position modeling plays a critical role in Transformers. In this paper, ...
Pre-trained models have achieved remarkable success in natural language
...
Well-designed prompts can guide text-to-image models to generate amazing...
Large language models have exhibited intriguing in-context learning
capa...
We propose eXtensible Prompt (X-Prompt) for prompting a large language m...
Large Transformers have achieved state-of-the-art performance across man...
Unmanned aerial vehicles (UAVs) can be applied in many Internet of Thing...
In this paper, we elaborate upon recipes for building multilingual
repre...
Masked image modeling has demonstrated great potential to eliminate the
...
Contrastive language-image pre-training (CLIP) serves as a de-facto stan...
Named entity recognition (NER) suffers from the scarcity of annotated
tr...
A big convergence of model architectures across language, vision, speech...
In conventional backscatter communication (BackCom) systems, time divisi...
Convolutional neural networks can achieve remarkable performance in sema...
A big convergence of language, vision, and multimodal pretraining is
eme...
Masked image modeling (MIM) has demonstrated impressive results in
self-...
Deep neural networks (DNNs) have been shown to be vulnerable against
adv...
Foundation models have received much attention due to their effectivenes...
We introduce a vision-language foundation model called VL-BEiT, which is...
As more and more pre-trained language models adopt on-cloud deployment, ...
Backscatter communication (BackCom), one of the core technologies to rea...
In-context learning of GPT-like models has been recognized as fragile ac...
Human language is grounded on multimodal knowledge including visual know...
Synthetic speech detection is one of the most important research problem...
In this paper, we investigate and analyze full-duplex-based backscatter
...
Sparse mixture of experts provides larger model capacity while requiring...
The Mixture-of-Experts (MoE) technique can scale up the model size of
Tr...
CLIP has shown a remarkable zero-shot capability on a wide range of visi...
In this paper, we propose a simple yet effective method to stabilize
ext...
To guide the generation of large pretrained language models (LM), previo...
With the increasing of model capacity brought by pre-trained language mo...
We introduce Corrupted Image Modeling (CIM) for self-supervised visual
p...