Google’s latest AI models compared with features and performance

Gemini 3 vs Gemini 3 Pro vs Gemini 3 DeepThink: The rollout of Gemini 3, Google’s latest large language model (LLM), appears to be the tech giant’s strongest AI debut in recent months as it has drawn positive feedback from users and developers alike.

Early reviews suggest that Gemini 3 is a highly capable foundational AI model, especially when it comes to handling reasoning-heavy tasks. The model was shipped on November 18, with Google promoting its arrival as a ‘new era of intelligence’.

Gemini 3 is designed to provide better answers to more complex questions compared with prior models. It is also said to be the best model that Google has built for ‘vibe-coding’, the controversial practice where users mostly rely on AI tools to generate code and build software.

According to Google, its advances with Gemini 3 are reflected in the model’s performance across several benchmark tests. The company claimed that Gemini 3 outperforms its predecessor on every AI benchmark, topping the LM Arena leaderboard as well as earning top marks on Humanity’s Last Exam and GPQA Diamond.

However, public benchmarks have been criticised as unreliable indicators of real-world AI performance because they can be easy to game. For instance, famed AI researcher Andrej Karpathy pointed out that Gemini 3 refused to believe that it was 2025 since its pre-training data only included information up till 2024. But he also acknowledged that his early impression of Gemini 3 was positive.

I played with Gemini 3 yesterday via early access. Few thoughts –

First I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to…

— Andrej Karpathy (@karpathy) November 18, 2025

As feedback continues to roll in over the next few weeks, let’s take a closer look at the Gemini 3 family of models and what each of them has to offer.

Gemini 3

Gemini 3 is said to possess multimodal reasoning capabilities, meaning that it combines reasoning abilities with vision and spatial understanding as well as multilingual skills and a one million-token context window, allowing users to ask complex and nuanced questions, including lengthy ones.

For developers, Gemini 3 is capable of handling complex prompts and instructions to render richer, more interactive web UI. According to Google, Gemini 3 is ‘exceptional’ at zero-shot generation which means that it can generate software elements without being explicitly trained on such elements.

Story continues below this ad

In terms of use cases, Google said that users could, for instance, ask Gemini 3 to decipher and translate handwritten recipes in different languages into a shareable family cookbook. “It can even analyse videos of your pickleball match, identify areas where you can improve and generate a training plan for overall form improvements,” the company said.

Gemini 3 has been subjected to several safety tests in order to reduce sycophancy and improve resistance to malicious prompt injection attacks, as per Google.

On the benchmark front, Gemini 3 topped the WebDev Arena leaderboard by scoring an impressive 1487 Elo. It also scores 54.2 per cent on Terminal-Bench 2.0, which tests a model’s tool use ability to operate a computer via terminal. It outperformed Gemini 2.5 Pro on SWE-bench Verified (76.2 per cent), a benchmark that measures coding agents.
The model further topped the Vending-Bench 2 leaderboard, which tests longer horizon planning by managing a simulated vending machine business.

Gemini 3 Pro

“Gemini 3 Pro demonstrates better long-horizon planning to generate significantly higher returns compared to other frontier models,” Google said.

Story continues below this ad

Its responses are smart, concise, and direct. “It acts as a true thought partner that gives you new ways to understand information and express yourself, from translating dense scientific concepts by generating code for high-fidelity visualizations to creative brainstorming,” the company added.

Geminin 3 Pro outperforms 2.5 Pro on every major AI benchmark. It topped the LMArena Leaderboard with a breakthrough score of 1501 Elo. It received top scores on Humanity’s Last Exam (37.5 per cent without the usage of any tools) and GPQA Diamond (91.9 per cent).

Gemini 3 Pro is said to be highly capable at solving complex problems across a vast array of topics like science and mathematics. It set a new high score (23.4 per cent) on MathArena Apex, a benchmark for evaluating frontier models on mathematics. Its multimodal reasoning extends beyond text as the model scored 81 per cent on MMMU-Pro and 87.6 per cent on Video-MMMU.

Gemini 3 Pro’s responses are also more likely to be factually accurate as it scored a 72.1 per cent on SimpleQA Verified.

Story continues below this ad

Gemini 3 Deep Think

Gemini 3 Deep Think is an enhanced reasoning mode that pushes Gemini 3’s multimodal reasoning capabilities even further to help users solve more complex problems.

In testing, Gemini 3 Deep Think outperformed Gemini 3 Pro’s performance on Humanity’s Last Exam (41.0 per cent without the use of tools) and GPQA Diamond (93.8%). It also achieved 45.1 per cent on ARC-AGI-2 (with code execution, ARC Prize Verified), demonstrating its ability to solve novel challenges.

However, Google said that Gemini 3 Deep Think Mode is still undergoing safety evaluations and will be made available to Google AI Ultra subscribers after gathering inputs from safety testers in the coming weeks. The company has also said it plans to release additional models to the Gemini 3 series soon.