Summary of paper ‘AN IMAGE IS WORTH 16X16 WORDS’

Architecture of ViT
  • ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images
  • ImageNet-21k with 21k classes and 14M images
  • JFT with 18k classes and 303M high-resolution images
Model Variant on ViT used for the experiment
Benchmark Report
Performance vs Dataset
Performance vs Computation

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gareema Dhingra

Gareema Dhingra

A final year student learning something new every day about data science and machine learning.