The 6-block prompt formula
Anyone who's written prompts for more than a few days has hit this wall:
You type a girl drinking coffee, hit generate, and out come 8 images — 3 indoor, 3 outdoor, one warm-lit, one cold-lit, half-bodies and close-ups, some with a Japanese palette, some American. Each one looks "okay". None of them are usable. So you tweak one word, run it again, and get 8 totally different versions. Rinse and repeat.
Most people's first reaction is "AI is unstable." Wrong. The AI isn't "unstable" at all. It's painfully honest — it just fills in everything you didn't say. You gave it 3 pieces of info (subject, action, what the action involves). The other 7 pieces (where, how it's shot, what light, what style, what lens, what image quality, what constraints) — the model guessed, using the "average taste" baked into its training data. And "average taste" has huge variance across dimensions, so each guess goes a different direction. That's not a bug. That's the blanks in your prompt being auto-filled.
The official prompting guide in the OpenAI Cookbook recommends an order — Subject → Setting → Style → Composition → Lighting → Technical specs. This chapter unpacks all 6: what each one means, what happens when you skip it, how to write it without falling into traps, and how much time we saved at JR using this formula.
6 Building Blocks Cheat Sheet
| Block | What it is | Example keywords |
|---|---|---|
| Subject | Who / what | A young Asian woman in her 20s holding a latte cup |
| Setting | Where + when | cozy Sydney café by a large window, autumn afternoon |
| Style | What "look" | photorealistic editorial photography / 3D Pixar render / oil painting |
| Composition | How it's framed | medium close-up, slightly angled from below, rule of thirds |
| Lighting | What light, what direction | warm golden hour light through window, soft rim light from left |
| Technical specs | What lens / medium | 50mm prime lens, f/1.8 shallow DOF, slight 35mm film grain |
Think of a prompt like a brief you hand to a photographer. You wouldn't just say "shoot a girl drinking coffee" — you'd tell them where, what light, what lens, what mood. The 6 building blocks are the 6 columns on that brief. Leave any column blank and the photographer (the model) defaults to "however we shot it last time." So of course you get 8 images with 8 different vibes.
The Full Formula Template
[Subject], [Setting].
Composition: [framing / camera angle].
Lighting: [time of day / quality / direction / mood].
Style: [aesthetic / artist reference / medium].
Technical: [lens / film grain / detail level].
[Constraints: no extra text, exact aspect ratio, etc.]
You don't have to follow this exact order line-by-line, but all 6 must be there. The A/B comparison below shows what skipping 3 of them costs you.
A/B Real-World Comparison
Weak prompt (only 3 building blocks)
A girl drinking coffee, café, photorealistic.
What actually comes out: of 8 images, 5 indoor, 3 on a patio, 4 overhead shots, 4 eye-level, half warm-toned, half cool. Each one passes on its own — but the moment you try to stack them into a Xiaohongshu carousel, the whole thing falls apart. Inconsistent style.
Why? Because you only specified Subject, half a Setting (café — but which café, what time of day?), and Style (photorealistic — which covers literally every realism photo). Composition / Lighting / Technical / Constraints are all blank, so the model freestyles.
Strong prompt (all 6 building blocks)
A young Asian woman in her 20s holding a latte cup,
sitting by a large window in a cozy Sydney café in autumn.
Composition: medium close-up, slightly angled from below,
subject occupies right two-thirds, left third is window light.
Lighting: warm afternoon golden hour through window,
soft rim light on her hair, warm color temperature ~3200K.
Style: photorealistic editorial photography,
slight 35mm film grain, warm autumn palette,
inspired by Annie Leibovitz portraiture.
Technical: 50mm prime lens, f/1.8 shallow depth of field,
crisp foreground subject, dreamy bokeh background.
No text overlay, exact 3:4 vertical aspect ratio.
Output: all 8 images come out warm gold, all medium close-up, all window-lit, the subject's age, makeup, and outfit nearly identical — the whole set is carousel-ready. These 8 are now your "style anchor." Want a 9th image in the same look? Change two words in the prompt (different season, different drink) and you're done.
Going Deeper on Each Block
Subject — concrete identity beats abstract description by 3x. a person / a girl makes the model guess from the training-data mean (which, fair warning, skews Western and white). a young Asian woman in her 20s locks down ethnicity, age, gender. Level up: add occupation, expression, posture — a tired junior developer in her 20s, focused expression, slightly hunched over laptop. That layer of "specific occupation + emotion" turns a person from "extra" into "character in a story."
Setting — don't just say "in a café," add light direction and time of day. cozy Sydney café → vague. cozy Sydney café by a large window, warm afternoon golden hour through window → light source, time, and city all locked in. City names (Sydney / Tokyo / Paris) work better than generic ones (city), because the model has seen tons of training data tied to those specific landmarks' visual memory.
Style — this is the master switch for the whole vibe. editorial photography (magazine) / oil painting / 3D Pixar render / studio Ghibli illustration — swap the medium keyword and the entire feel changes. You can stack artist names: inspired by Annie Leibovitz / in the style of Wes Anderson. The model has learned the visual signatures of famous photographers and directors really well, and dropping their names locks the tone immediately.
Composition — use industry terms. The model gets them. wide shot / medium close-up / extreme close-up / overhead top-down / dutch angle. For position, rule of thirds, subject on right two-thirds is way more precise than "subject on the right." shot from below gives you hero feel, shot from above gives surveillance feel — angle is emotion.
Lighting — three ingredients: temperature (warm / cool / neutral) + direction (from left / rim light / backlight) + quality (soft / hard / diffused). The most common faceplant is writing good lighting. That's empty calories — the model defaults (usually flat ambient, looking like a phone-camera straight shot). Spelling out color temperature (warm 3200K) is way more reliable.
Technical specs — 35mm film / 50mm prime / 85mm portrait are all different lens languages, and the model trained on a lot of photographer EXIF metadata, so it actually learned these distinctions. f/1.8 shallow depth of field gives shallow DOF. 8K detail / ultra-sharp gives high-res. slight film grain gives film texture. Stack all three and you've got the universal "cinematic" recipe.
What Breaks When You Skip a Block
| Missing block | What actually happens |
|---|---|
| No Subject details | Model guesses the subject — 8 images, 8 different faces. No carousel for you. |
| No Setting light source | Output comes out dim / flat / passport-photo vibes. |
| No Style | Defaults toward "sd-1.5 realism," but maybe not the realism you wanted. |
| No Composition | 8 angles for 8 images. Can't reuse anything. |
| No Lighting | Whole image has zero atmosphere. Looks like a phone straight-shot. |
| No Technical specs | Lens language gets jumbled — deep focus wide-angle mixed with shallow close-ups. |
| No Constraints | Text / aspect ratio / random extra elements all go off the rails. |
Real Faceplants
Faceplant 1: Style at the end. First time I wrote a full prompt, I followed Subject → Setting → Composition → Lighting → Technical → Style. The output drifted hard — I'd written oil painting and half came back as photorealistic photos. Later I figured out: the first 50 words carry the most weight (next chapter goes deep on this). A keyword like Style, which sets the whole tone, has to be up front. Can't wait until paragraph 5. Now our template stuffs Style keywords into the Subject or Setting block so they hit within the first 50 words.
Faceplant 2: Subject as a person. Did a Xiaohongshu carousel for a client, 5 prompts all using a person sitting at desk. Got 5 totally different people — different genders, ages, ethnicities, no narrative coherence. Switched to a young Chinese woman in her late 20s and it converged immediately — 4 of 5 looked like the same "character."
Faceplant 3: Filler words like good lighting. Wrote studio with good lighting once. Got flat passport-photo lighting back — neither "good" nor "studio." Changed it to studio with soft key light from upper left, warm fill light from right, subtle rim light from behind and suddenly it had real set-piece feel. Vague adjectives carry no information. The model just defaults — and its default is usually the most boring option on the menu.
Faceplant 4: All 6 blocks present, but in the wrong order. Put Technical before Subject (50mm shot of a woman drinking coffee...) and the weighting flipped — the lens language got emphasized, the subject got de-emphasized, and 8 images came back heavy on lens vibe with stiff facial expressions. Order is information too. Same as LLM prompts — earlier tokens carry more weight.
What We Learned at JR Academy
We were doing the spring KV poster set for the vibe-coding bootcamp. Week one was pure trial-and-error — 4-5 revisions per image before anything was usable. Week two we forced everyone to use the 6-block formula (fill in each row of the cheat sheet before writing — if any block was blank, fill it). The first image passed review. Client said "this looks like a pro team shot it."
What we saved wasn't generation time (generation was always fast — seconds per image). It was communication time. Before, every revision meant going back and forth with the client to confirm "is this the vibe you want?" Vague prompts mean a vague picture in the client's head, and review becomes a fight. Now, prompts are specific enough that the client can already imagine what's coming, and review disagreements drop by 80%. A week of work — 8 bootcamps, 5 images each, 40 images total — went from 60 hours to 5 hours. Most of those 55 hours saved were communication overhead, not Photoshop time.
Lesson: the 6-block formula isn't about "writing more." It's about "guessing less." Every block you fill in is one dimension the model doesn't have to guess, and the output gets a little more stable.
What's Next
Having all 6 blocks is just the foundation. There's a hidden rule that actually controls output stability — the first-50-words rule. With the same 6 blocks, putting "style" as word 1 versus word 50 changes the weighting by 3x and gives you completely different vibes. In other words: filling in 6 blocks stops the random guessing, but getting the order right is what gets you a stable, reusable style anchor. Chapter 05 unpacks this — why the first 50 words carry the most weight, what info has to fit there, and what can wait.
📷 Long Prompt Case Study (the maxed-out 6-block version)
A real "ultra-detailed prompt" example, from awesome-gpt-image (CC BY 4.0). After this you'll see why "6 building blocks + explicit constraints" reliably ships commercial-grade output.
Case: Japanese Onsen Ryokan Portrait (a 300+ word full prompt)
Prompt (first half of the full version):
35mm film photography, warm vintage Japanese onsen ryokan aesthetic,
soft ambient wooden lantern lighting mixed with gentle natural window light,
subtle film grain, gentle color shift, high atmosphere editorial style,
intimate medium shot, early 20s beautiful Chinese female idol with
ultra-realistic delicate refined Chinese features...
[continues with 200+ more words covering pose, lighting direction,
fabric texture, skin rendering, anti-AI defects clauses]
Notice the structure here — Style is in sentence one (35mm film photography, warm vintage Japanese onsen ryokan aesthetic), locking the tonal foundation. Lighting is right after in sentence two (soft ambient wooden lantern lighting mixed with gentle natural window light). Composition / Subject are sentences three and four. This is the deluxe version of the 6-block formula from this chapter — a 300-word prompt that pushes every block to its limit.
Why does a long prompt like this give stable output? Because the model has zero room to "guess" — every variable is locked. This is the standard playbook for commercial photography and high-quality output.
📷 Creator: @BubbleBrain · Curated by: awesome-gpt-image
❓ 常见问题
关于本章主题最常被搜索的问题,点击展开答案
gpt-image-2 prompt 公式是什么?
OpenAI Cookbook 官方推荐 6 大构件:Subject 主体 + Setting 环境 + Style 风格 + Composition 构图 + Lighting 光线 + Technical specs 技术规格。6 件齐全 = 出图稳定可复用。
prompt 缺哪个构件最翻车?
缺 Lighting 出图昏暗 / 平淡;缺 Subject 模型瞎猜(人脸全不一样);缺 Style 默认偏写实味;缺 Constraints 文字 / 比例失控。每缺一件,模型多猜一个维度。
prompt 越长越好吗?
不是。6 大构件齐全比堆砌长 prompt 准。前 50 词权重最高(约 50%),关键元素必须前置。装饰性细节放后面就行。