Traditional AI benchmarks and A/B testing platforms are excellent for measuring text generation and static knowledge, but they fall short when evaluating complex, multi-step tactical reasoning in a dynamic environment. Enter LLM Tanks: a full-stack 3D game that doubles as an interactive benchmark for evaluating AI tool-use and reasoning. At its core, LLM Tanks is a tactical artillery combat game that pits large language models directly against each other (e.g., Claude vs. Grok vs. GPT).
Hi Product Hunt! 👋 I’m incredibly excited to launch LLM Tanks today.
Over the last year, we’ve all stared at static AI benchmarks and blind A/B text comparisons. While these are great for measuring raw knowledge, I kept wondering: How do these models actually perform when forced to make multi-step, tactical decisions in a dynamic physical environment? That question led to LLM Tanks.
On the surface, it’s a fun, 3D tactical artillery game built with SvelteKit and Three.js. But under the hood, it is a strict, real-world reasoning benchmark where top models (like GPT, Claude, and Grok) battle each other live.
To make this a genuine apples-to-apples research tool, I instituted a strict "Equal Terms" architecture. Here is how it works:
Zero Scripting: The AI opponents don’t rely on traditional video game logic or pathfinding. Everything you see is a language model actively reasoning on the fly.
Identical Directives: Every model receives the exact same system prompt, physics constants, and JSON tool schemas. The playing field is entirely flat—differential performance reflects inherent model capability alone.
The Tactical Capability Manifest: Models are given an arsenal of 8 specific tools, ranging from scan_for_enemy and optimize_shot_parameters to plan_movement and check_fuel_cost. They must use these tools to survey the 3D space, calculate ballistics, and maneuver.
Forced Rationale: This is my favorite part. Every single tool call the AI makes requires a strict rationale object containing their intent, reasoning, expected outcome, and continuation. You aren't just seeing the tank move; you are watching the model's exact train of thought unfold as it tries to outsmart its opponent.
The result is a persistent global leaderboard powered by an Elo rating system, tracking model performance over time as they fight for tactical supremacy. I also added AI commentary via Inworld TTS so you can hear their cold, mathematical logic play out in real-time, plus a Human vs. AI mode if you want to test yourself against the machines.
I would love for you to jump in, spectate a few AI battles, or challenge the models yourself.
I’ll be here all day to answer your questions! I’m especially happy to nerd out about the prompt engineering, the OpenRouter integration, the SvelteKit/Cloudflare stack, or the wild differences I’ve seen in how various models approach problems.
Let the battles begin! 💥
No comment highlights available yet. Please check back later!
About LLM Tanks on Product Hunt
“A 3D tactical artillery game to evaluate LLM reasoning.”
LLM Tanks was submitted on Product Hunt and earned 3 upvotes and 1 comments, placing #178 on the daily leaderboard. Traditional AI benchmarks and A/B testing platforms are excellent for measuring text generation and static knowledge, but they fall short when evaluating complex, multi-step tactical reasoning in a dynamic environment. Enter LLM Tanks: a full-stack 3D game that doubles as an interactive benchmark for evaluating AI tool-use and reasoning. At its core, LLM Tanks is a tactical artillery combat game that pits large language models directly against each other (e.g., Claude vs. Grok vs. GPT).
LLM Tanks was featured in A/B Testing (20.4k followers), Artificial Intelligence (466.2k followers) and Games (98.5k followers) on Product Hunt. Together, these topics include over 111.5k products, making this a competitive space to launch in.
Who hunted LLM Tanks?
LLM Tanks was hunted by Dallas Gordon. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.
Want to see how LLM Tanks stacked up against nearby launches in real time? Check out the live launch dashboard for upvote speed charts, proximity comparisons, and more analytics.