Image-generation models are trained on hundreds of millions of caption-image pairs. When you give them a vague prompt like "a cat", they collapse toward the statistical average of every cat image on the internet — which is exactly why early AI art always looked the same. The fix is not better prompts in general; it is more specific tokens in four specific places. Once you know the four places, you can steer any model — Pollinations, Midjourney, gpt-image-1, Gemini nanobanana — without retraining anything.
The 4 ingredients are: **subject** (the noun and its attributes — "a calico cat" beats "a cat"), **style** (the visual idiom — "retro anime" or "oil painting" or "polaroid" — this is the single most powerful slot), **lighting** (the time and mood — "golden hour" produces warm rims; "overhead fluorescent" produces flat shadows), and **composition** (the camera — "dutch angle", "top-down", "wide shot"). Drop any one and the model fills it in from its prior — usually wrong. A worked example: "a cat" produces generic stock photo. "a calico cat coding at a Brooklyn cafe, retro anime style, golden hour, dutch angle" produces a specific scene that didn't exist before you wrote it.
Next: from images to language. Same 4-ingredient trick, applied to making Claude do exactly what you want.