nano-banana

conversation

editing

workflow

multimodal

iterative-design

De Texto-a-Imagen a Conversación-a-Imagen

Deja de escribir prompts. Empieza a tener conversaciones. Cómo el diálogo multimodal de Nano Banana 2 transforma la generación de imágenes de una máquina tragamonedas en un proceso de diseño colaborativo.

Publicado el 2026-02-27

From Text-to-Image to Conversation-to-Image

The Prompt Engineering Trap

In 2024, AI image generation was a slot machine.

You pulled the lever—write a prompt, hit generate—and hoped for the jackpot. Most of the time, you got lemons. So you pulled again. And again. And again. Each generation cost money. Each failure cost time.

Meet David. He's a marketing director at a SaaS startup. In October 2024, he needed a hero image for a landing page: "A developer working at a standing desk, modern office, natural lighting, focused expression, minimalist aesthetic."

His workflow:

Generation 1: "The developer looks bored. Can we make them more engaged?" Generation 2: "Better expression, but the lighting is too harsh." Generation 3: "Lighting's good, but the desk is the wrong color." Generation 4: "Desk color fixed, but now the pose is awkward." Generation 5: "Pose is better, but the background is distracting." Generation 6-15: Various attempts to fix various issues.

Total cost: $8.50. Total time: 47 minutes. Result: "It's fine. Let's just use this one."

This is the hidden cost of traditional AI image generation. Not the API calls. The iteration. The death by a thousand micro-adjustments.

And the worst part? Each generation was independent. The model didn't "remember" what David liked about Generation 2 when working on Generation 3. It was Groundhog Day, every single time.

The Traditional Fix (And Why It's Broken)

Fix 1: Longer, More Detailed Prompts

The advice everyone gave: "Write better prompts."

So David learned prompt engineering:

"8k, highly detailed, professional photography"
"Unsplash style, shot on Canon R5, 50mm lens"
"soft diffused lighting from window at 2pm, ISO 100, f/2.8"
"minimalist Scandinavian office interior with Eames chair"

His prompts grew to 200+ words. The results? Marginally better. But now each prompt took 10 minutes to write. And when the client said "Actually, let's try a sitting desk instead of standing," he rewrote the entire novel.

Effort increased 10x. Results improved 20%.

Fix 2: Image-to-Image

Most tools added img2img features. Upload your almost-right image, describe changes, generate variations.

Better, but clunky:

Download the image
Upload to img2img interface
Write a new prompt describing changes
Adjust strength slider (0.5? 0.7? 0.9?)
Generate 3-4 variations
None look right
Adjust strength again
Repeat

And img2img had a fatal flaw: it was destructive. Each pass degraded quality. Details blurred. Artifacts appeared. By generation 5, the image looked like a photocopy of a photocopy.

Fix 3: Layered Editing + Inpainting

Photoshop-style workflows. Mask the area you want to change. Describe the change. Generate.

Powerful, but:

Required technical skill (masking, layers, blending)
Time-consuming (5 minutes of masking per edit)
Inconsistent style (new elements didn't always match old)

David needed a designer's help for complex edits. The AI "democratization" didn't feel very democratic.

Nano Banana 2: The Conversation Model

January 2026. Nano Banana 2 changes the game.

Not with better prompts. Not with better img2img. With conversation.

David's new workflow for the same landing page image:

Turn 1:

David: "Generate a developer working at a desk, modern office"
Nano Banana 2: [generates image]

Turn 2:

David: "Make them standing, not sitting, and add a second monitor"
Nano Banana 2: [updates image, same person, now standing, dual monitors]

Turn 3:

David: "The lighting feels too artificial. Make it natural window light, late afternoon"
Nano Banana 2: [updates image, warm golden hour lighting]

Turn 4:

David: "Perfect lighting. Can we add a plant in the corner? A tall fiddle leaf fig"
Nano Banana 2: [adds plant, maintains lighting and composition]

Turn 5:

David: "The plant is too prominent. Make it smaller and move it behind the desk"
Nano Banana 2: [adjusts plant size and position]

Total cost: $0.45 (5 turns). Total time: 6 minutes. Result: "This is exactly what I wanted."

The difference is paradigm-shifting. David isn't writing prompts. He's having a conversation. The model remembers context. Each turn builds on the last. No quality degradation. No starting over.

How Conversation-to-Image Works

The Technical Architecture

Traditional diffusion model:

[Prompt A] → [Generate] → [Image A]
[Prompt B] → [Generate] → [Image B] (unrelated to A)

Nano Banana 2 multimodal conversation:

[Prompt A] → [Generate] → [Image A + Context]
                                     ↓
[Prompt B + Image A + Context] → [Generate] → [Image B]
                                     ↓
[Prompt C + Image B + Context] → [Generate] → [Image C]

The key: persistent multimodal context. Nano Banana 2 maintains a running understanding of:

The visual state (current image)
The conversation history (what's been asked/changed)
The user's intent (what they're trying to achieve)

It's not regenerating from scratch. It's editing with understanding.

What Makes It "Native"

Other tools bolted conversation onto existing models:

GPT-4V describing images → DALL-E generating new ones
Multiple API calls, multiple models, context loss at each handoff

Nano Banana 2 is natively multimodal. One model. One context window. True understanding.

The result:

Coherence: Changes make visual sense, not random mutations
Memory: "Make the plant smaller" remembers which plant, where it was
Intent preservation: "Keep the lighting but change the desk" maintains what matters

Conversation Depth

How many turns can you go? Google's documentation suggests effective context for 10-20 turns of back-and-forth. In practice:

Turn Count	Effectiveness	Best For
1-3	100%	Quick single changes
4-7	95%	Multi-element adjustments
8-12	90%	Complex scene building
13-20	80%	Extended refinement
20+	Degradation	Start fresh session

Pro tip: For complex scenes, do foundational work in 5-7 turns, then save reference images and start a new conversation for fine-tuning.

You Can Take Action Now

Your First Conversation

Time required: 10 minutes. Cost: ~$0.30.

Step 1: Open Google AI Studio. Select Gemini 3.1 Flash Image.

Step 2: Start simple:

"A coffee cup on a wooden table, morning light"

Generate.

Step 3: Make a change:

"Change the cup to blue ceramic"

Generate. Same table. Same light. Different cup.

Step 4: Add an element:

"Add a notebook and pen next to the cup"

Generate. Blue cup, notebook, pen. Coherent composition.

Step 5: Adjust composition:

"Move the notebook to the left side and open it"

Generate. Layout adjusted. Everything else preserved.

Step 6: Change mood:

"Make it evening with warm lamp light instead of morning"

Generate. Same objects. New lighting. Coherent shadows.

You've just had a 6-turn conversation. Total time: 4 minutes. Try doing that with traditional img2img.

Conversation Patterns That Work

Pattern 1: The Sculpting Approach

Start broad. Refine narrow.

T1: "A city street scene"
T2: "Make it a rainy night in Tokyo"
T3: "Add neon signs in Japanese"
T4: "Include a person with an umbrella in the foreground"
T5: "Make the umbrella red"
T6: "Add reflections on the wet pavement"
T7: "The reflections should show the neon signs"

Like sculpting: rough form → medium details → fine details.

Pattern 2: The A/B Testing Approach

Explore variations without losing ground.

T1: "A modern living room, minimalist style"
[Good base]

T2: "Change the couch to blue"
[See option A]

T3: "Actually, go back to the original and make the couch green instead"
[Option B - wait, does it remember "original"?]

Limitation: Nano Banana 2 doesn't have "undo" in the traditional sense. It remembers the conversation, but can't revert to arbitrary previous states.

Workaround: Save reference images at key milestones. If T3 goes wrong, start new conversation with T1 image as reference.

Pattern 3: The Correction Loop

Natural back-and-forth like working with a designer.

T1: "A person hiking in mountains"
[Image generated]

T2: "The person should be wearing hiking boots, not sneakers"
[Fixed]

T3: "Better, but the boots look too new. Make them worn and muddy"
[Fixed]

T4: "Great boots. Now the backpack looks too small. Make it a large hiking pack"
[Fixed]

T5: "Perfect. One last thing—add trekking poles"
[Done]

Each correction is understood in context. No re-explaining. No starting over.

Pattern 4: The Scene Evolution

Build complex scenes progressively.

T1: "An empty classroom"
T2: "Add 6 desks arranged in a circle"
T3: "Put a teacher's desk at the front with a laptop"
T4: "Add a whiteboard with math equations"
T5: "Make it sunny afternoon with light streaming through windows"
T6: "Add shadows from the window frames on the floor"

Traditional approach: Write 200-word prompt describing all this. Hope the model parses it correctly.

Conversation approach: Build it live, verify each element, adjust as needed.

What Works (And What Doesn't)

Conversations That Flow

Spatial adjustments:

"Move the car to the left"
"Make the building taller"
"Add space between the two people"

Attribute changes:

"Change the color to blue"
"Make it nighttime instead of day"
"Add fog/mist"

Element addition/removal:

"Add a bird in the sky"
"Remove the logo from the shirt"
"Put a coffee cup in their hand"

Style transfers (within reason):

"Make it look like a watercolor painting"
"Apply a vintage film look"
"Make it more photorealistic"

Conversations That Struggle

Extreme perspective changes:

"Rotate the scene 90 degrees"
"Show this from a bird's eye view"
"Make it a close-up of just the face"

These often work better as new generations with references.

Adding multiple complex elements at once:

"Add a crowd, change the lighting to sunset, make it raining, and add a neon sign"

Break into steps:

"Add a crowd" → verify → "Change lighting to sunset" → verify → etc.

Undoing previous changes:

"Actually, go back to how it looked 3 turns ago"

Nano Banana 2 doesn't maintain a history tree. Use reference images at milestones.

Contradictory instructions:

"Make it brighter but also darker"
"Add more people but keep it minimal"

The model tries its best, but conflicting directions produce confused results.

Production Workflows

Landing Page Hero Images

Traditional:

Write 50 variants of prompts
Generate 100 images
Filter to 10 options
Client picks 1
Iterate 5 more times
Time: 3-4 hours

Conversation Approach:

Start with concept
Have 10-turn conversation to refine
Client watches/advises in real-time
Lock in final version
Time: 20-30 minutes

Social Media Campaigns

Need 20 variations of the same scene for A/B testing?

Turn 1-5: Build the base scene through conversation Turn 6: "Save this as version A" Turn 7: "Change the headline text color to red" → Version B Turn 8: "Go back to version A, but change the background image" → Version C

Actually, since there's no "save state," better approach:

Complete base scene (5 turns)
Save reference image
Start 3 new conversations from that reference:
- Convo B: "Change headline color to red"
- Convo C: "Change background to cityscape"
- Convo D: "Add a testimonial quote"

Storyboard Iteration

Film director needs to iterate on scene composition:

T1: "A detective sitting in a dark office, noir style"
T2: "Add Venetian blind shadows from the window"
T3: "Put a whiskey glass on the desk"
T4: "The glass should have ice and be half-full"
T5: "Add a gun next to the glass"
T6: "Make the gun reflect the window light"
T7: "The detective should be looking at the gun, not the camera"
T8: "Add rain outside the window"

Director sees composition evolve. Makes decisions in real-time. No "I'll know it when I see it" generation lottery.

Economics of Conversation

Cost Comparison

Scenario: Refining a marketing image through 10 iterations.

Method	Iterations	Cost per	Total Cost	Time
Traditional Generation	10 separate	$0.05	$0.50	30 min
img2img	10 passes	$0.05	$0.50	25 min
Nano Banana 2	10-turn convo	$0.03	$0.30	10 min

Savings aren't just financial. Time and mental bandwidth matter more.

The Hidden Cost: Decision Fatigue

Traditional AI image generation:

Generate 20 options
Compare 20 options
Pick 1
Doubt the choice
Generate 20 more
Never feel satisfied

Conversation approach:

Build incrementally
Validate each decision
Arrive at satisfaction organically
Know why the final image works

Limitations

No True Undo

Once you go down a path, you can't branch back arbitrarily. Workaround: save reference images at key decision points.

Context Window Limits

After ~20 turns, the model may start forgetting early conversation details. For complex projects, break into multiple conversations with reference images.

Single Image Focus

Each conversation maintains one active image. Can't work on multiple compositions simultaneously. Workaround: multiple browser tabs/conversations.

Language Nuance

"Make it more dynamic" vs "Make it more energetic"—subtle prompt differences still matter. The model understands natural language well, but not perfectly.

The Bigger Picture

Conversation-to-image isn't just a feature. It's a paradigm shift.

Traditional AI image tools treated users like operators of a machine: write precise instructions, get output, repeat.

Nano Banana 2 treats users like collaborators: discuss, iterate, refine together.

This mirrors how human designers actually work:

"Show me something"
"Hmm, warmer"
"Yes, like that, but bigger"
"Perfect, just add..."

The best creative tools don't just execute commands. They engage in dialogue.

Series Navigation

This is Article 2 of the Nano Banana 2 Masterclass Series.

Previous: E01: From LoRA to Zero-Training: Character Consistency Revolution
Next: E03: From Prompt Guessing to Spatial Logic
Series Overview: Masterclass Index

The conversation revolution is here. Stop pulling the lever. Start talking.