X2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks

DOI (url)

10.1109/tpami.2023.3339661
Publication Date

2024/05/01
Journal

IEEE Transactions on Pattern Analysis and Machine Intelligence
Indian UGC (journal)
Refrences

81
Yan Zeng ByteDance Research, Beijing, China ORCID (unauthenticated)
Xinsong Zhang ByteDance Research, Beijing, China ORCID (unauthenticated)
Hang Li ByteDance Research, Beijing, China ORCID (unauthenticated)
Jiawei Wang ByteDance Research, Beijing, China ORCID (unauthenticated)
Jipeng Zhang Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China ORCID (unauthenticated)
Wangchunshu Zhou Department of Computer Science, ETH Zurich, Zurich, Switzerland ORCID (unauthenticated)

Journal Categories

Science

Mathematics

Instruments and machines

Electronic computers

Computer science

Technology

Electrical engineering

Electronics

Nuclear engineering

Electric apparatus and materials

Electric circuits

Electric networks

Technology

Electrical engineering

Electronics

Nuclear engineering

Electronics

Technology

Engineering (General)

Civil engineering (General)

Technology

Mechanical engineering and machinery

Title	Journal	Journal Categories	Citations	Publication Date
ViLT: Vision-and-language transformer without convolution or region supervision				2021
ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks				2019
Multi-grained vision language pre-training: Aligning texts with visual concepts
OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework
Faster R- CNN: Towards real-time object detection with region proposal networks