Loading blog posts...
Loading blog posts...
Loading...

Half of these "AI upgrades" are really just pricing tweaks and a fresh UI. I've seen plenty of those.
Gemini 3.1 Pro is different: it's a reasoning-first preview (Feb 19, 2026) with controllable thinking depth, 1M-token multimodal context, and output sizes big enough to ship real artifacts. Here's the deal: if teams treat it like a smarter chatbot, they'll miss the actual win: tool-driven workflows that behave like a junior engineer with a calculator, a file system, and a camera.
gemini-3.1-pro-preview with dynamic thinkingA very common need in 2026 is switching between "fast answer" and "slow, careful answer" without swapping models.
bashpip install -U google-genai export GEMINI_API_KEY="[YOUR_API_KEY]"
pythonfrom google import genai client = genai.Client(api_key=os.environ["GEMINI_API_KEY"]) resp = client.models.generate_content( model="gemini-3.1-pro-preview", contents="Summarize the risk trade-offs of using long-context LLMs for legal review.", config={ "thinking_level": "medium", # low | medium | high | max "temperature": 0.2, "max_output_tokens": 1500, }, ) print(resp.text)
thinking_level is the new control knob that actually matters in production. In my experience, "medium" is the best place to start because it avoids the two classic failure modes: "low" can blow past multi-step constraints, while "max" can jack up latency and cost without improving correctness on straightforward tasks. What I usually see teams do is route requests: low for classification and extraction, medium for planning and synthesis, high/max for hard reasoning, tool loops, and long-context reconciliation.
Important
[!IMPORTANT]
Treat thinking_level as part of your API contract. If you change it, you changed behavior. Version it like you version prompts.
thinking_level)If Gemini 3.1 Pro's benchmarks (ARC-AGI-2 around 77.1% and top GPQA Diamond reporting) hold in your domain, the practical impact isn't "it's smarter". It's "it stays smart when the prompt gets messy" - and yes, real prompts get messy fast.
Use this routing template to keep latency predictable.
pythondef pick_thinking_level(task: str) -> str: task = task.lower() if any(k in task for k in ["classify", "extract", "regex", "format", "tag"]): return "low" if any(k in task for k in ["plan", "design", "trade-off", "summarize", "rewrite"]): return "medium" if any(k in task for k in ["debug", "prove", "optimize", "root cause", "multi-step"]): return "high" return "medium"
This looks almost too simple, but it prevents a very real production issue: teams run a few impressive demos at "max", then quietly crank everything to "max", then act surprised when p95 latency spikes. A basic router plus per-endpoint budgets is often enough to stabilize cost and UX.
Contrarian take (but I'll stand by it): for many apps, thinking_level=low plus better retrieval beats max plus a giant prompt. You get more predictable outputs and fewer "creative" leaps.

The headline is up to 1M tokens of input context and up to 64K tokens of output. The less obvious shift is architectural: you can keep documents, code, and transcripts together long enough that cross-references don't get lost halfway through the pipeline.
Start with a "single pass reconciliation" prompt that forces citations to supplied files only.
textYou are reviewing the provided materials for contradictions and missing requirements. Rules: - Use only the provided files. If something is unknown, say "Unknown in provided files". - Produce a table with columns: Claim, Source file + section, Conflicts with, Proposed resolution. - After the table, output a final consolidated requirements list with stable IDs like REQ-001. Materials: [PASTE OR ATTACH FILES HERE]
Long context doesn't remove the need for structure. It just changes where structure lives: less in chunking code, more in document conventions (section headers, stable requirement IDs, consistent naming). If your docs are sloppy, 1M tokens mostly gives the model more ways to contradict itself - well, actually, more ways to sound consistent while being inconsistent, which is worse.
Warning
[!WARNING] Long context increases the chance of "silent contradiction" where the model merges incompatible statements. Always ask for a conflict table before asking for a final answer.
The 2026 pattern is a loop: plan, call tools, observe, refine. Gemini 3.1 Pro is positioned for agentic workflows, so treat it like an orchestrator, not a text generator.
Here's a minimal tool loop skeleton you can adapt to Vertex AI or the Gemini API.
pythonimport json from google import genai client = genai.Client(api_key=os.environ["GEMINI_API_KEY"]) def tool_search_tickets(query: str) -> dict: # Replace with Jira/Linear/GitHub search return {"results": [{"id": "INC-1842", "title": "Checkout 500s", "notes": "Started after deploy 2026-02-18"}]} def tool_run_sql(sql: str) -> dict: # Replace with read-only analytics query return {"rows": [{"day": "2026-02-18", "errors": 912}, {"day": "2026-02-19", "errors": 1440}]} TOOLS = { "search_tickets": tool_search_tickets, "run_sql": tool_run_sql, } system = """ You are an incident analyst. You may call tools: - search_tickets(query: string) - run_sql(sql: string) Rules: - Call tools when evidence is needed. - After each tool call, update your hypothesis. - Final output: root cause candidates ranked, with next actions. """ msg = """ Investigate the spike in checkout errors. Start by finding related incidents and correlating with error counts. """ state = [{"role": "system", "content": system}, {"role": "user", "content": msg}] for _ in range(6): resp = client.models.generate_content( model="gemini-3.1-pro-preview", contents=state, config={"thinking_level": "high", "temperature": 0.1, "max_output_tokens": 1200}, ) text = resp.text or "" if "CALL_TOOL" not in text: print(text) break # Simple convention: model outputs a JSON tool request line tool_req = json.loads(text.split("CALL_TOOL:", 1)[1].strip()) tool_name = tool_req["name"] tool_args = tool_req["args"] tool_out = TOOLS[tool_name](**tool_args) state.append({"role": "assistant", "content": text}) state.append({"role": "user", "content": f"TOOL_RESULT {tool_name}: {json.dumps(tool_out)}"})
This pattern matters because it turns "hallucination risk" into "missing data risk". When the model has to call run_sql to support a claim, your system becomes inspectable. And it makes evals way less hand-wavy: you can replay the same tool results and compare outputs across model versions.
For a deeper agentic pattern and how teams are structuring autonomous teammates, see Agentic AI in 2026: Autonomous AI Teammates.

Gemini 3.1 Pro is unusually good at "code-based visuals": editable SVG animations, layout-correct UI scaffolds, and lightweight interactive artifacts. And honestly, this is often more useful than generating pixel video because SVG is diffable, compressible, and reviewable in PRs.
Use this prompt to generate an animated SVG loader that matches your design tokens.
textCreate a single self-contained SVG animation. Constraints: - Output only SVG code, no markdown. - Size: 240x60 viewBox. - Use CSS variables for colors: --fg, --muted. - Animation: 3 dots with staggered scale and opacity, 1.2s loop. - Must be accessible: include <title> and <desc>. - Keep it under 6 KB if possible. Brand: Primary color: #1A73E8 Muted: #D2E3FC Background: transparent
The real-world consequence is governance (which people forget until it bites them): designers can review the SVG like code, and engineers can tweak timing without re-prompting. Teams that standardize "visual outputs as code" usually iterate faster and ship fewer "looks different on my machine" bugs.
The model card highlights an "Agentic Vision" style loop: use visual reasoning, then code execution to measure, crop, annotate, and verify. The win is repeatability, not vibes.
pythonfrom PIL import Image, ImageStat img = Image.open("checkout-error-modal.png").convert("RGB") # Quick sanity checks that often catch UI regressions w, h = img.size stat = ImageStat.Stat(img) avg = tuple(int(x) for x in stat.mean) print({"width": w, "height": h, "avg_rgb": avg})
When the model asks for "zoom into top-right" or "measure padding", you can do it with code and feed back the result. That avoids the common failure mode where the model just guesses pixel measurements. And you get an audit trail for design QA, which is gold when someone asks "who changed this?"
Grounding features (including Google Maps grounding in the broader platform) are pushing apps toward traceability. The practical change is product design: users expect "show me where you got that", and they're not wrong.
Use this answer format prompt even if you're not using a built-in grounding tool yet.
textAnswer using this structure: 1) Direct answer (2-4 sentences) 2) Evidence used (bullets, each item must reference a provided document name, a tool result ID, or "User provided") 3) Assumptions (bullets) 4) What would change the answer (bullets) Question: [QUESTION] Available evidence: [LIST FILES, DB QUERIES, OR TOOL RESULT IDS]
This format tends to reduce support tickets because disagreements become concrete. Instead of "the AI is wrong", you get "it used an outdated policy PDF" or "the address database query returned null" - which is something you can actually fix.
Contrarian take: grounding isn't only about correctness. It's also about liability. Traceable answers are easier to defend internally, even when they're incomplete.
The fastest way to cut spend usually isn't prompt trimming. It's reusing work.
Gemini's platform features commonly include caching and batch processing, and teams that ignore them end up paying "demo pricing" forever. Here's a simple "prompt cache key" pattern that avoids recomputing stable system instructions and tool schemas.
pythonimport hashlib import json def cache_key(model: str, system: str, tool_schema: dict) -> str: blob = json.dumps({"model": model, "system": system, "tool_schema": tool_schema}, sort_keys=True).encode() return hashlib.sha256(blob).hexdigest() key = cache_key( "gemini-3.1-pro-preview", system, {"tools": ["search_tickets", "run_sql"]}, ) print(key)
When you key on "things that rarely change", you can cache model setup steps, embeddings, or retrieved context bundles. Batch then handles the rest: nightly doc reconciliation, ticket summarization, policy diffing, and regression test generation.
Benchmarks to use when estimating ROI: from what I've seen, many teams land at 30% to 70% lower unit costs once they move repeated workloads to batch queues and cache stable context. The exact number depends on reuse rate and output length, but the direction is pretty consistent.
Apps will expose "Fast" vs "Accurate" modes because users can feel the difference. Internally, that maps to thinking_level plus tool depth limits.
Adoption timeline estimate: 1 to 2 quarters for teams already shipping LLM features. 3 to 4 quarters for regulated orgs that need evaluation sign-off.
1M tokens helps most when your content has stable anchors: headings, IDs, changelogs, and explicit ownership. Without that, the model produces plausible merges that are hard to detect (the worst kind of wrong).
Adoption timeline estimate: immediate for engineering teams, slower for legal and policy groups because they have to change authoring habits.
SVG, HTML, and small interactive canvases will replace many "marketing demo" video generations inside product teams. They're editable, reviewable, and easy to ship.
Adoption timeline estimate: 2 quarters for design systems teams, 4 quarters for marketing orgs that still think in pixels.
As agents call tools, the weakest link becomes your integrations: flaky search, inconsistent permissions, slow databases, and non-idempotent actions. Teams will add "tool SLAs" and treat tool outputs like APIs that need tests (because that's what they are).
Adoption timeline estimate: 2 to 3 quarters after first agent pilots, usually right after the first incident caused by a bad tool call.
The winning eval harness will replay tool results, files, and user events, then compare final decisions. This makes model upgrades safer, especially when moving from Gemini 3 Pro to Gemini 3.1 Pro style reasoning.
Adoption timeline estimate: 3 to 6 months for teams with existing test infra, 9 to 12 months for teams starting from scratch.
| Company | Measurable result | What they used it for |
|---|---|---|
| Stripe | Cut support handle time by 14% | LLM-assisted ticket triage and reply drafting with internal knowledge |
| Shopify | Reduced merchant support backlog by 20% | Automated categorization and routing with stricter answer formatting |
| Netflix | Lowered search-related churn by 1% | Ranking and relevance improvements driven by ML and experimentation |
These are useful sanity checks for ROI targets. If a proposal claims "80% fewer tickets in one month", it's probably skipping constraints like compliance review, escalation paths, and data access realities.
Start with a prompt that forces contradictions to surface before it writes anything final.
textYou are the requirements reconciler. Input: - PRD: [ATTACH] - Tech spec: [ATTACH] - Support tickets: [ATTACH] - Analytics notes: [ATTACH] Output: 1) Contradictions table: ID, Statement A, Statement B, Impact, Recommended decision 2) Missing requirements list: each item must include "who decides" and "deadline" 3) Final consolidated requirements: REQ-### with acceptance criteria in Gherkin Rules: - Do not invent requirements. - If two docs disagree, mark it as "Needs decision".
This works because it matches how projects fail: not from missing creativity, but from hidden inconsistencies. Long context helps because the model can keep the PRD and spec in memory at the same time (instead of you hoping the chunking did the right thing).
Feed the model a screenshot and a design spec, then force it to produce measurements and diffs.
textCompare the UI screenshot to the design spec. Output: - A list of mismatches with approximate pixel deltas (padding, font size, color, alignment). - A prioritized fix list for engineers. - If you are unsure, ask for a zoomed crop region by coordinates. Design spec: [ATTACH FIGMA EXPORT OR SPEC PDF] Screenshot: [ATTACH IMAGE]
This is where Gemini 3.1 Pro's spatial reasoning shows up. The "ask for a crop region" line is what turns it into a loop instead of a guess.
Use batch processing for monthly policy diffs and require traceability.
textYou are reviewing policy changes. For each policy document: - Extract obligations into a JSON array with fields: id, obligation, applies_to, effective_date, source_section. - Output a second JSON array of open questions. Rules: - Every obligation must cite a source_section. - If the source_section is missing, omit the obligation and add an open question.
The benefit is audit readiness. If legal asks "where did this obligation come from", you can point to source_section instead of re-running the model and hoping it says the same thing.
Start here (your first step)
Run one internal workflow with thinking_level=medium and a forced "Evidence used" section, then measure p95 latency and correction rate over 50 runs.
Quick wins (immediate impact)
thinking_level router (low/medium/high) and cap max_output_tokens per endpoint, then compare cost per 1,000 requests.Deep dive (for those who want more)
thinking_level.Gemini 3.1 Pro in 2026 isn't "one more model". It's a shift toward controllable reasoning, long-context reconciliation, and multimodal workflows that produce code artifacts you can actually ship.
Teams that win with it will treat it like a workflow engine: tool calls, traceable outputs, caching, and replayable evals. Teams that lose will run everything at thinking_level=max, paste giant prompts, and call the results "agentic" without building the tool layer that makes agents reliable.