TechBlast - Tech News for Builders and Operators

TL;DR — I tried to run Qwen2.5-7B-Instruct-1M on a consumer laptop (RTX 3050 Laptop 6GB VRAM) and mapped the literal feasibility frontier. All evidence in JSON, drift-CI enforced. Three honest findings:

4k context = the hard ceiling on Windows transformers + bitsandbytes int4 NF4. 5k, 6k, 8k all OOM at the first attention forward pass. The 4k cell passes only because Windows kernel shared-memory PCIe spillover (WDDM overcommit) lets allocations spill to system RAM at ~10x latency tax — peak measured 10.8GB on a 6GB GPU.
WSL2 + vllm cannot even fit the model. vllm 0.7.3 memory profile literal log: "model weights take 5.43GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is -0.94GiB". 0 GPU cache blocks allocated, 0.00x concurrency at 4200 tokens. Linux nvidia driver does not provide an equivalent shared-mem fallback — vllm sees only physical 6GB and refuses. The conventional wisdom "vllm > transformers for memory efficiency" is literal disproven at this hardware tier: it fails harder because Windows OS was the enabler, not the inference engine.
Cloud free-tier is also capped, and unevenly. GitHub Models free tier (zero credit card, gh OAuth only): gpt-4.1-mini PASS @ 4k in 8.54s (~30x faster than local). llama-3.3-70b-instruct PASS @ 4k in 5.17s. But: gpt-5 returns unavailable_model at any context size on free tier. DeepSeek-V3 + gpt-5 are capped at literal 4000 input tokens. And Anthropic Claude is not in the GitHub Models catalog at all — zero CC + Claude = no path.

Full numbers + 11 JSON evidence cells + 3 ADRs at: https://github.com/leagames0221-sys/longctx-bench-honest

Hardware: RTX 3050 Laptop 6GB / driver 560.94 / CUDA 12.6 / Windows 11 + WSL2 Ubuntu 24.04. Software: torch 2.5.1+cu124, transformers (5.8.0 Win / 4.48.3 WSL), bitsandbytes 0.49.2, vllm 0.7.3. Everything fully reproducible — uv.lock committed, runners under examples/.

Related sibling repo for browser RPA on the same constraints (5-layer defense-in-depth journey, 5 honest failures with JSON): https://github.com/leagames0221-sys/browser-agent-demo

Cross-repo thesis is "constraint-optimized AI engineering": map the literal feasibility frontier under (zero credit card, consumer laptop, public OSS only, drift-CI enforced) and publish both the working zone AND the boundary. Happy to answer questions about the methodology or specific runner code.

Counterintuitive: WSL2 + vllm cannot fit Qwen2.5-7B-1M on 6GB VRAM where Windows transformers can

Comments (0)

United States

Related News

How Braze’s CTO is rethinking engineering for the agentic area

Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools

Implementing Multicloud Data Sharding with Hexagonal Storage Adapters

DeepMind’s CEO Says AGI May Be ~4 Years Away. The Last Three Missing Pieces Are Not What Most People Think.

CCSnapshot - A Claude Code Configs Transfer Tool