● onlinetty/0guest@bubucn:~$

v0.7.2utc+08static▮

05 · ai_model_evals

one water shader.
ten models.

A subjective, screenshot-first comparison from one round of testing on my desktop. Every prompt and every HTML is here — try them yourself, your runs may differ.

play the htmls jump to scores

Contact sheet of all real water shader benchmark screenshots

28 screenshots · 10 model lanes · every HTML attached

Now playing

Claude Opus 4.7 · final · Microfacet water pass

strong

02 · scoreboard

model-by-model scores.

Same four axes for every model: prompt fidelity, aesthetic, UI/UX, robustness. Definitions live in the methodology section below.

Claude Opus 4.7 final output

Claude web

Claude Opus 4.7

Overall best of the round

Read the prompt the most carefully and translated it into actual shader, physics and camera decisions. The microfacet final still looks like real water after multiple sessions.

prompt9

aesthetic9

UI/UX9

robust7

Claude Opus 4.7 1st

Claude Opus 4.7 2nd

Claude Opus 4.7 final

GPT-5.5 Pro final output

ChatGPT web

GPT-5.5 Pro

Best at following the brief

The most prompt-adherent of the round. First pass tripped a runtime error, but it locked back onto the visible-water requirement and rebuilt as a self-contained WebGL2 demo. Final lands with clear UI, boats and a believable horizon haze.

prompt9

aesthetic8

UI/UX8

robust7

GPT-5.5 Pro 1st

GPT-5.5 Pro 2nd

GPT-5.5 Pro 3rd

GPT-5.5 Pro final

Codex CLI GPT-5.5 xhigh final output

Codex CLI

Codex CLI GPT-5.5 xhigh

Queued chain, single final

Not a one-shot web pass. The CLI took a queued P0 → P2 → P3 chain and emitted one final HTML. Robustness reads high partly because there was only one artifact to capture, but the result is feature-complete with controls, buoyancy, wakes and boats.

prompt8

aesthetic7

UI/UX9

robust9

Codex CLI GPT-5.5 xhigh queued final

GPT-5.5 Thinking Standard final output

ChatGPT web

GPT-5.5 Thinking Standard

Solid, less ambitious

Same no-water failure as Pro, but the final stayed safer — less depth, less drama, more conventional UI structure.

prompt7

aesthetic7

UI/UX7

robust7

GPT-5.5 Thinking Standard 2nd

GPT-5.5 Thinking Standard 3rd

GPT-5.5 Thinking Standard final

GPT-5.5 Thinking Standard 2nd typo archive

2nd typo archive

Claude Design final output

Claude Design bonus

Claude Design

Strongest aesthetic

Bonus visual lane — it does not strictly meet the engineering asks (procedural controls, boats, the full WebGL stack). What it nails is atmosphere: dramatic horizons, color depth, lighting that actually feels cinematic. Treat as visual reference, not a benchmark winner.

prompt5

aesthetic10

UI/UX6

robust8

Claude Design design 1

Claude Design design 2

Claude Design design 3

Gemini 3.1 Pro final output

Gemini web

Gemini 3.1 Pro

Cinematic frame, soft on the brief

The 3rd pass produced one of the most cinematic frames in the round — horizon, lighting and color all land. But the boats from the brief are barely visible in the final, and several engineering asks were quietly dropped.

prompt6

aesthetic9

UI/UX7

robust8

Gemini 3.1 Pro 1st

Gemini 3.1 Pro 2nd

Gemini 3.1 Pro 3rd

DeepSeek web Expert final output

DeepSeek web

DeepSeek web Expert

Heavy on knobs, light on water

Loves adding controls and systems — the panel is genuinely complete. The surface, though, kept drifting toward icy crystals or terrain rather than convincing water.

prompt7

aesthetic5

UI/UX8

robust6

DeepSeek web Expert 1st

DeepSeek web Expert 2nd

DeepSeek web Expert final

Kimi 2.6 Thinking final output

Kimi web

Kimi 2.6 Thinking

Stable and lightweight

Pleasant and runnable, but did not push the harder shader / physics asks. Reads more like a calm lake than the high-pressure ocean shader the brief asked for.

prompt6

aesthetic7

UI/UX6

robust8

Kimi 2.6 Thinking 1st

Kimi 2.6 Thinking 2nd

Qwen 3.6 Max Preview final output

Qwen web

Qwen 3.6 Max Preview

Recovered by trimming scope

First shader compile failed; the final does run, but a lot of the original prompt scope got dropped to get there. Surface still looks faceted and noisy.

prompt5

aesthetic5

UI/UX5

robust5

Qwen 3.6 Max Preview 1st

Qwen 3.6 Max Preview 2nd

Qwen 3.6 Max Preview final

GLM-5.1 final output

Z.ai web

GLM-5.1

Weakest first frame

First archive had no canvas at all; the second rendered but the surface reads more like jagged terrain than water. Did not reach a clean final this round.

prompt4

aesthetic4

UI/UX4

robust3

GLM-5.1 1st

GLM-5.1 2nd

03 · prompts & logs

every prompt, and how each model walked through them.

Six prompts cover everything from the original brief to the runtime fixes and the Codex CLI queue. Below the prompt list, the per-model log shows which prompts each lane actually went through.

P0Original benchmark promptbase

Every model gets this verbatim, fresh chat, no system prompt.

Create an ultra-realistic 3D ocean / water simulation system and deliver it as a single self-contained HTML file.

The goal is to stress-test advanced AI capability in real-time 3D graphics, GLSL shader programming, physically inspired water simulation, interactive parameter control, and floating object physics.

Core Requirement:
- Build a real-time 3D water scene, not a 2D canvas ripple effect.
- The final output must be a single complete `.html` file.
- The file must run directly in a modern browser.
- Use WebGL, GLSL shaders, and/or Three.js.
- If Three.js is used, it must be loaded from a CDN inside the HTML file.
- Do not use external image assets, model files, texture files, or build tools.
- All geometry, materials, shaders, boats, sky, UI, and effects must be generated procedurally or inline.

Rendering Requirements:
- Implement a 3D water surface using a high-resolution plane mesh or procedural geometry.
- The water must use custom shader logic or shader-like material behavior.
- Include realistic water effects such as vertex displacement, multi-layer sine/Gerstner waves, wind-driven direction, dynamic amplitude, wave speed controls, surface normal calculation, specular highlights, Fresnel reflection, environment reflection approximation, depth-based water color, foam or whitecaps on steep waves, sun/light controls, and a sky gradient or procedural environment.
- The water surface must visibly behave like a 3D ocean/lake, not a flat animated texture.

P1No-water repairno-water repair

Triggered when a first run rendered nothing or hid the water. Used on GPT-5.5 Pro, GPT-5.5 Standard, GLM-5.1.

i could not see the water.

P2Broad quality passquality

Mid-conversation nudge for everyone that already had visible water — generic, intentionally lazy.

enhance the overall quality

P3Cinematic final passcinematic

The closing prompt before final HTML capture. Anyone who reached a final ran this last.

Improve the ocean/water scene so it feels significantly more realistic, cinematic, and visually rich.

The final result should look like a believable body of water rather than a flat or decorative shader. Focus on making the water feel deep, dynamic, natural, and physically convincing from multiple viewing angles. The surface should have layered motion, subtle variation, convincing highlights, realistic color depth, and a strong sense of scale. Avoid repetitive patterns, plastic-looking shine, overly uniform color, and artificial movement.

Make the lighting, reflections, wave behavior, foam, horizon, and overall composition work together as a polished real-time ocean rendering demo. The goal is not just to add more effects, but to make the water feel alive, immersive, and high quality.

Return the improved result as a complete, self-contained single HTML file that can run directly in the browser.

P4Concrete runtime-error fixruntime fix

Only fired when an archived HTML threw a runtime error in the browser. Used once on DeepSeek for `debugSelect is not defined`.

deepseek_html_20260501_d05add.html:1347 Uncaught ReferenceError: debugSelect is not defined at deepseek_html_20260501_d05add.html:1347:9

The error occurs because `debugSelect` is used in `cycleDebugView()` but was never declared. Fix the standalone HTML so the debug selector is defined and the demo runs directly in the browser.

P5Codex CLI queued conditioncodex queue

Only the Codex CLI lane. The CLI takes a single queued chain rather than back-and-forth turns.

Codex CLI GPT-5.5 xhigh was tested as one queued-message run in ~/Desktop/codexwater:

1. P0 original benchmark prompt
2. P2 enhance the overall quality
3. P3 final cinematic water prompt

The archived result is the final HTML emitted after that queued chain, not three separately inspected intermediate HTML files.

// run logPer-model prompt chain.

Claude Opus 4.7Claude web
P0P2P3
P0 first pass already shipped — three iterations cleaned up details, finished on a microfacet final.
GPT-5.5 ProChatGPT web
P0P1P2P3
P0 hit a runtime error → P1 forced visibility → P2 added controls → P3 closed on a cinematic WebGL2 final.
Codex CLI GPT-5.5 xhighCodex CLI
P0P2P3
Three prompts queued in one go (P0 → P2 → P3). No back-and-forth, no intermediate captures.
GPT-5.5 Thinking StandardChatGPT web
P0P1P2P3
P0 failed off-camera → archive starts at the P1 recovery → P2 / P3 polished. One typo-archived module-import block.
Claude DesignClaude Design bonus
P0P3
Treat as visual reference — Claude Design is tuned for art direction, not the engineering brief.
Gemini 3.1 ProGemini web
P0P2P3
Three sequential passes, no failures — strongest visual jump on P3, weakest spec adherence of the visible-water lanes.
DeepSeek web ExpertDeepSeek web
P0P2P4P3
P0 → P2 polish → P4 fixed an undeclared `debugSelect` runtime error → P3 final.
Kimi 2.6 ThinkingKimi web
P0P3
P0 → P3, no failures. Skipped the harder shader asks rather than tried and broken.
Qwen 3.6 Max PreviewQwen web
P0P1P3
P0 shader compile failed → P1 unblocked rendering → P3 simplified final.
GLM-5.1Z.ai web
P0P1
P0 produced no canvas → P1 unblocked rendering — never reached a clean final in this round.

// codex_cli_queueCodex CLI runs as a queued chain, not a chat.

Three prompts queued in one go (P0 → P2 → P3) and exactly one final HTML came back. Read it as multi-prompt execution, not first-shot fidelity.

Codex CLI queued follow-up inputs

04 · verdicts

who won what, without drama.

Short summaries by category. Different rounds and different prompts will rearrange these — see the caveats section.

// overallClaude Opus 4.7 — most consistent overall.

Read the prompt the most carefully and translated it into actual shader, physics and camera decisions. The microfacet final still looks like real water after multiple passes.

// most prompt-adherentGPT-5.5 Pro — best at following the brief.

Strictest about the spec. First pass tripped a runtime error, but it locked back onto the visible-water requirement and rebuilt as a self-contained WebGL2 demo. Final ships with clear UI, boats and believable horizon haze.

// single best frameGemini 3.1 Pro — cinematic final, soft on the brief.

The 3rd pass produced one of the most cinematic frames of the round. Caveat: the boats from the spec are barely visible and several engineering asks were quietly dropped.

// strongest aestheticClaude Design — atmosphere over engineering.

Bonus visual lane. Does not strictly meet the engineering brief, but the atmosphere — horizon, color depth, lighting — actually feels cinematic. Treat as visual reference, not a benchmark winner.

// best UICodex CLI & DeepSeek — most complete control panels.

Both shipped feature-rich panels (controls, buoyancy, wakes for Codex; heavy systems for DeepSeek). DeepSeek's water reads icy though — full UI doesn't guarantee convincing visuals.

// blockersGLM, Qwen and the GPT first run hit real failures.

GLM rendered no canvas; Qwen failed shader compile; GPT-5.5 Pro tripped a runtime error. Two of the three got there with a follow-up; GLM did not reach a clean final.

// fine printThese are one round of takes, not a leaderboard.

A single desktop, a single round, one prompt set. Different sessions, different timings, different prompt phrasings — any of those can absolutely flip an order. Treat it as a snapshot.

05 · methodology

how I scored it.

Each model gets a 1–10 score on four axes. Visuals are not the whole story — prompt fidelity counts the same.

// scoring axesFour axes, deliberately small and explicit.

prompt fidelity

Did it actually meet the hard asks — single HTML, WebGL, procedural geometry, working controls, visible boats? Engineering compliance, not interpretation.

aesthetic

Water realism, lighting, color depth, atmospheric haze, composition — does the scene feel beautiful and convincing in the browser?

UI / UX

Control panel design, parameter sliders, interactivity, frame-rate, ease of use. The bits beyond just rendering pretty water.

robustness

First-frame water visible, no runtime errors, no manual intervention needed to get a working artifact.

// tested modelsEight web lanes, one Claude Design bonus, one Codex CLI queued run.

GPT-5.5 series

ChatGPT web — both GPT-5.5 Pro and GPT-5.5 Thinking Standard.

Codex CLI GPT-5.5 xhigh

Codex CLI in ~/Desktop/codexwater. The only command-line lane — it runs a queued P0 → P2 → P3 chain rather than chatting back and forth.

Claude Opus 4.7 + Claude Design

Claude.ai web on Opus 4.7 Adaptive, plus a few bonus runs from Claude Design for visual reference.

Gemini 3.1 Pro

Gemini web — three sequential HTML iterations archived.

GLM-5.1

Z.ai web. The first run produced a blank canvas; the second one rendered.

Qwen 3.6 Max Preview

Qwen Studio. First shader failed to compile, then a simplified final.

DeepSeek web Expert

DeepSeek web Expert mode — first pass through PBR repair after a runtime error.

Kimi 2.6 Thinking

Kimi Thinking — two ocean stages archived.

06 · caveats

this is one snapshot, not a leaderboard.

Subjective evaluation, one desktop, one round of testing. Read it that way.

// surfaces

Most lanes ran on each vendor's web UI — GPT, Claude, Gemini, DeepSeek, Kimi, Qwen, GLM. The only command-line lane is Codex CLI GPT-5.5 xhigh, and it is flagged separately on its model card and prompt log.

// reproducibility

Different prompt phrasings, different chat sessions, different run timings can change the order. The point of this page is to make every prompt and every HTML public so you can re-run it yourself.

// not a leaderboard

Read the scores as one person's qualitative read, not a peer-reviewed benchmark. If a result feels off for your use case, it probably is — your use case is different.