Title | Journal | Journal Categories | Citations | Publication Date |
---|---|---|---|---|
ViLT: Vision-and-language transformer without convolution or region supervision | 2021 | |||
ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks | 2019 | |||
Multi-grained vision language pre-training: Aligning texts with visual concepts | ||||
OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework | ||||
Faster R- CNN: Towards real-time object detection with region proposal networks |