Multi-Language Code Evaluation Pipeline for LeetCode Style Problems
Most evaluator writeups optimize for speed first.
Our biggest quality issue was not latency, it was false negatives.
We repeatedly saw “correct-looking” solutions fail across languages due to starter drift, I/O contract mismatch, and comparator inconsistency. So we redesigned the pipeline around one goal:
deterministic, explainable verdicts, no AI validation.
The constraint was that trust mattered most. This work came out of building CodeNexus, a mobile LeetCode-style coding app to help form habit loops. That constraint forced us to treat evaluation correctness as a first-class system problem, not just an execution detail.
What we built
Starter quality gate (pre-execution)
- Validates starter templates before users run anything.
- Catches missing/empty templates, TODO-only scaffolds, missing callable signatures, and structural syntax defects.
- Prevents template defects from polluting runtime pass/fail metrics.
Starter smoke validation
- Fast per-language smoke runs classify failures as starter quality vs solver quality.
- Surfaces wrapper/parser drift and placeholder runtime crashes early.
Contract-driven comparison layer
- Each problem can define indexing, output format, order sensitivity, unordered strategy, and optional semantic validator.
- Comparator sequence:
- normalize expected/actual
- exact match
- unordered match (when allowed)
- semantic validation for multi-answer correctness
- diagnostic mismatch classification
- Normalization includes whitespace, boolean canonicalization, JSON normalization, and unordered multiset strategies.
Hardened execution path
- Clear separation of compile errors, runtime errors, and infrastructure failures.
- Language-specific execution config is explicit (including TS compiler options).
- Batch submission + polling for throughput, sequential fallback for reliability without changing grading semantics.
Artifact-first outputs
- Every run emits a structured JSON artifact with summary + failure details.
- Failures include problem slug, failure class, and expected/actual snippets for fast triage, analytics, and replay.
{
"language": "python",
"summary": { "total": 316, "passed": 316, "failed": 0, "errors": 0, "passRate": 100 },
"failures": []
}
Results
- C++: 19.0% -> 100.0%
- Go: 7.0% -> 100.0%
- Java: 0.9% -> 100.0%
- Non-SQL suite: 316/316 across supported languages
Most reliability gains came from evaluation architecture, not algorithm rewrites.
In multi-language judges, deterministic contracts and artifacts matter more than raw execution speed once baseline performance is acceptable.
United States
NORTH AMERICA
Related News
How Braze’s CTO is rethinking engineering for the agentic area
10h ago
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
21h ago

Implementing Multicloud Data Sharding with Hexagonal Storage Adapters
15h ago

DeepMind’s CEO Says AGI May Be ~4 Years Away. The Last Three Missing Pieces Are Not What Most People Think.
15h ago

CCSnapshot - A Claude Code Configs Transfer Tool
21h ago