Google’s Gemini 3 Beats GPT-5.1 Benchmarks, Integrated into Search on Day One < AI·XR < Big Tech < K-World News < 기사본문

Google’s release of Gemini 3 is more than a routine model upgrade. It marks a redesign of the company’s entire service architecture, starting with Search. It is also Google’s bid for leadership in the second phase of the global AI race.

For the first time, Google connected a new model to Search the moment it was announced. Instead of waiting for extended validation cycles, the company switched the core search engine directly to Gemini 3. This shift signals Google’s belief that selecting models manually is becoming obsolete. The system now determines the best model on its own and users simply search.

Gemini 3 integrated into Google services. Photo courtesy of Google

Search Turns on Gemini 3 the Moment “Thinking Mode” Is Activated

In the United States, subscribers of AI Pro and Ultra can activate Gemini 3 Pro instantly by selecting the “Thinking” mode in the search bar. The result is a new default experience where searching and using Gemini 3 effectively become the same action.

Google plans to take this further. Complex queries will be automatically routed to Gemini 3 while lightweight factual questions will be handled by smaller models. The company says the future of Search is not a list of links but a system that interprets intent, context and difficulty, then generates the most appropriate response in real time.

Benchmark Results Show Across-the-Board Gains in Reasoning, Multimodality and Agent Capabilities

■ Advanced Reasoning

Gemini 3 posted record scores on multiple global benchmarks.

Benchmark	Meaning	Score
LMArena	Tournament-style evaluation of reasoning and knowledge	1501, highest to date
Humanity’s Last Exam	Graduate-level reasoning test	37.5% to Deep Think 41%
GPQA Diamond	PhD-level biology, physics and chemistry reasoning	91.9% to Deep Think 93.8%
ARC-AGI-2	Abstract visual reasoning puzzle for AGI	Deep Think 45.1%

GPQA Diamond is regarded as a benchmark that only trained researchers can typically solve. Deep Think’s 93.8 percent score places Gemini 3 near expert-level performance.

■ Multimodal Understanding

Gemini 3 interprets text, images, video and diagrams simultaneously.

Benchmark	Meaning	Score
MMMU-Pro	Multidomain test across medicine, engineering, geography and arts	81%
Video-MMMU	Understanding motion, context and events in video	87.6%
ScreenSpot-Pro	Interpreting computer UI elements, graphs and tables	72.7%
SimpleQA Verified	Source-verified factual QA	72.1%

The Video-MMMU result shows that the model goes beyond basic description. It understands situations, reasons through events and explains meaning across multiple visual inputs.

Agent and Coding Performance

Gemini 3 demonstrates end-to-end task execution rather than simple code generation.

Benchmark	Meaning	Score
WebDev Arena	Full workflow for building websites and interfaces	1487 Elo
Terminal-Bench 2.0	Solving problems using command-line interaction	54.2%
SWE-bench Verified	Fixing and verifying real bugs in open-source code	76.2%

These results indicate that the model can understand problems, write code, run it, fix errors and execute the workflow again. Google’s release of the “Antigravity” agent framework aligns with this expanded capability. Gemini 3 is designed to control browsers, terminals and editors directly, operating as a self-contained agent.

A New Standard for AI-Driven Services

The introduction of Gemini 3 signals a shift that goes far beyond model comparisons. Google is treating the model as the core engine of its services, especially Search. The pace of change will accelerate as this new standard spreads across the company’s platforms.

The outcome of the next phase of the AI race will depend on how effectively companies embed these standards into real products.

Ju-baek Shin ㅣjbshin@kmjournal.net

키워드