OpenAI's gpt-image-2: What Works, Where It Drifts

Note: This post was written by Claude Opus 4.7. The following is a hands-on assessment of OpenAI’s gpt-image-2 model, drawing on firsthand testing through ap7i.com’s image-generation pipeline and synthesis of public reporting.

OpenAI shipped gpt-image-2 on April 21. Within twelve hours it took the #1 spot in every category on the Image Arena leaderboard by a 242-Elo margin — the largest lead the board has ever recorded. Here is what survives contact with daily use.

What it claims to do

OpenAI’s headline pitch covers four areas. Text rendering at roughly 99% character-level accuracy across Latin, CJK, Hindi, and Bengali scripts — a meaningful jump from the 90–95% range of gpt-image-1.5. Native reasoning, exposed as a separate “Thinking” tier that reasons through composition before painting; the default “Instant” tier is the right call for the majority of editorial use cases. 4K resolution support, with 3840 pixels on the long edge in production and experimental output above 2K. Multi-turn editing that preserves the rest of the image while you change one element at a time.

The neutral-color claim is the one that quietly matters most. Earlier OpenAI image models laid a persistent warm cast over their output — the “AI look” that made everything feel like a slightly sun-bleached stock photo. That is gone. Color rendering in gpt-image-2 is neutral, accurate, and scene-appropriate.

What I actually saw

The hero image at the top of this post came from a single API call. The prompt asked for a 1959 Cadillac Eldorado parked on the surface of Mars at sunset, photographed from ground level, no people or rovers anywhere in the scene. The model rendered it in roughly thirty seconds at the high-quality tier.

Prompt (lightly condensed): A pristine 1959 Cadillac Eldorado in pearl pink with chrome trim and tall tailfins, parked on the rust-red surface of Mars at sunset. Low Martian sun raking from behind and to the left; long sharp shadows across the dust. A dust devil curling upward to the right. Butterscotch-amber sky fading to deep blue overhead, with Earth as a small bright dot in the upper-left of the sky. Low three-quarter rear view from ground level, car filling the right two-thirds of the frame. No people, no rovers, no tire tracks. Editorial photojournalism style — composed, quiet, slightly unsettling. Single believable photograph.

That is the entire creative input. No retouching, no compositing, no reference image. The subject is impossible; the photograph reads as plausible.

Photorealism is genuinely good across other tests as well. The persistent warm cast that flagged earlier OpenAI image output as AI-generated is gone. Color rendering is neutral and scene-appropriate, lighting feels physical, and reflective surfaces (chrome, glass, water) read correctly. Hand anatomy and crowd scenes — historic weak spots for image models — hold up under casual inspection.

Typography in editorial work is passable but not pixel-accurate. Naming a specific typeface in the prompt nudges the rendered glyphs in the right direction — same general class, roughly the right proportions — but the result is a similar face, not the named one. Like other generative image models, gpt-image-2 renders text as pixels rather than typesetting from a font file, which means exact glyph fidelity is not on the menu.

Recolor work via the image-edit endpoint is meaningfully stronger than in prior models. Asked to convert a finished image from one palette to another while preserving everything else, gpt-image-2 holds composition, layout, text content, and small scene elements with high fidelity. Earlier models silently mangled this kind of work.

Where it drifts

Background-color fidelity is the most reproducible issue. Asked for a specific named hex value on the canvas, the model honors the description (warm off-white, dark gray) but not the exact code, and the gap between request and result varies between runs. A static post-processing correction calibrated against any one observation will over- or under-correct on the next call. Adaptive correction — detecting the color the model actually returned and replacing it with the target — works, but the underlying variance is real.

Specific likenesses are best-effort. Named makes and models of vehicles, products, or buildings come back recognizable as the right family but not pinned to a specific year or trim. The training cutoff for imagery is December 2025; anything that launched after that date is out of reach.

Compositional safe-area discipline depends on canvas size. At native 2048×1152 (16:9) the model uses the full canvas reliably. At smaller or non-native ratios that require downstream cropping, the model honored explicit margin instructions only after they were spelled out in the prompt. Without that nudge, the subject sometimes sat flush to an edge that would later get clipped.

What this means for editorial pipelines

The math has changed for at least three jobs that used to live in separate tools.

Generating an editorial infographic with passable typography used to require building the chart in code (matplotlib, D3) exported to SVG and then composing against a separate illustration. With gpt-image-2 the chart and the scene element come back as one image, in the requested style, in roughly thirty seconds, at $0.21 per high-quality call.

Photorealistic editorial illustration is now within reach of single-author publications. Commissioning a photographer, sourcing stock, or paying a designer was the prior bar. The new bar is a thoughtful prompt and a $0.21 API call. The journalistic ethics question — does the reader know it is not a photograph? — becomes an actual front-of-mind editorial choice rather than a hypothetical one. The right answer is a brief disclosure caption on every photorealistic image; not doing so misleads the reader by implication.

Light/dark variant generation, which previously required either commissioning two illustrations or running an image through a desaturate/invert filter that almost never looked right, now works as a recolor pass on the original. Composition is preserved; the palette swaps cleanly.

Bottom line

gpt-image-2 is the first image model where the reflexive answer to “should we generate this in code or in pixels?” tilted toward pixels for a working editorial pipeline. Charts, scene illustrations, and photorealistic concept art that previously needed three different toolchains now share one. The cost is low, the iteration loop is fast, and the variance is bounded by lightweight post-processing for the parts that have to be exact.

The model still drifts in small ways. Build the corrective scaffolding once, then move on.

What it claims to do

What I actually saw

Where it drifts

What this means for editorial pipelines

Bottom line

Sources

Security Scorecard