Google’s Gemini Pro and OpenAI’s GPT-4V compete in visual capabilities


Two new papers examine the visual capabilities of Google Gemini Pro and GPT-4 vision. Both models are on par, with slight advantages for GPT-4.

Two new papers from Tencent Youtu Lab, the University of Hong Kong, and numerous other universities and institutes comprehensively compare the visual capabilities of Google’s Gemini Pro and OpenAI’s GPT-4V, currently the most capable multimodal language models (MLLMs).

The research focuses on the specific strengths and capabilities of each model and gives a detailed comparison across multiple dimensions. These include image recognition, text recognition in images, image inferencing, text inferencing in images, integrated image and text understanding, object localization, temporal video understanding, and multilanguage capability.

GPT-4V and Gemini Pro are on par when it comes to visual comprehension and reasoning

Both models showed comparable performance on basic image recognition tasks. They can extract text from images, but need improvement in areas such as recognizing complex formulas, as one of the two papers shows.



Image: Qi et al.

In image understanding, both models showed good common-sense reasoning. However, Gemini performed slightly worse than GPT-4V on a pattern search test (IQ tests).

Image: Fu et al.

Both models also showed a good understanding of humor, emotion, and aesthetic judgment (EQ tests).

Image: Qi et al.

In terms of text comprehension, Gemini showed some poorer performance on complex tabular reasoning and mathematical problem-solving tasks compared to GPT-4V. Google’s larger model, Gemini Ultra, could exhibit greater improvements here.

MME benchmark results. | Image: Fu et al.

In terms of the level of detail and accuracy of the responses, the research teams made exactly opposite observations: One group attributed particularly detailed or concise responses to Gemini, the other to GPT-4V. Gemini would add relevant images and links.

In terms of commercial applications, Gemini was outperformed by GPT-4V in the areas of embodied agent and GUI navigation. Gemini, in turn, is said to have advantages in multimodal reasoning capability.


