I ran the same security tasks against the same model twice. Once raw — prompt in, answer out. Once inside a ReACT loop with tool access. Same weights, same hardware, same questions.

The difference wasn’t incremental. It was categorical.

The setup

Model: Qwen3.5-35B-A3B (Q4_K_M quantization)
Hardware: AMD Ryzen AI Max+ 395, 128GB unified memory, Vulkan inference via llama-server
Framework: CAI — a lightweight agentic framework that implements a ReACT (Reason-Act-Observe) loop with tool calling
Context: 262K tokens

The one-shot tests were straightforward: hand the model a security prompt, get back an answer, score it against a checklist of expected findings. No second chances. No tools.

The CAI tests gave the model the same prompts but inside an agent loop — it could reason about its approach, use tools like execute_code and generic_linux_command, observe the results, and iterate. Same model. Same machine. Different workflow.

The results

Task One-Shot With CAI Change
Source Code Review 82% (9/11) 100% (11/11) +18%
CVSS Scoring 40% (4/10) 100% (8/8) +60%
PoC Development 50% (3/6) 67% (4/6) +17%

That CVSS jump is not a typo. The same model went from getting basic vector components wrong to producing a perfect score — correct vector string, correct justification, correct final number.

What the one-shot missed

In the code review test, the model was given a deliberately vulnerable Flask application. Eleven vulnerabilities were planted — SQL injection, SSTI, insecure deserialization, path traversal, the works.

One-shot Qwen3.5 found nine out of eleven. Solid, honestly. It caught the SQL injection (both instances), the template injection, the pickle deserialization, the path traversal, the open redirect, hardcoded secrets, weak hashing, and debug mode.

What it missed: missing authentication on endpoints and CSRF protection gaps.

These aren’t obscure findings. They’re the kind of thing you’d catch by actually thinking about the application as a whole — not just scanning individual functions for known-bad patterns.

What the ReACT loop found

The CAI agent found all eleven. Every single one.

The difference wasn’t that it had some secret vulnerability database. It’s that the loop let it think through the problem. It could reason about the application structure, notice that sensitive endpoints had no @login_required decorator, and consider cross-cutting concerns like CSRF that only matter when you understand the app as a system, not a collection of functions.

The agent took 267 seconds versus the one-shot’s 42.5 seconds. It was six times slower. And it was right about everything the fast version missed.

CVSS: where iteration matters most

CVSS scoring is surprisingly hard for language models. It’s not just pattern matching — you need to understand the vulnerability’s context, the affected system’s architecture, and how the CVSS specification maps abstract concepts like “Scope” to concrete scenarios.

One-shot, the model got the vector string format right but fumbled the specifics. It missed that the attack vector was network-accessible (AV:N), that complexity was low (AC:L), and critically, that the scope was changed (S:C) because compromising a web application to access AWS S3 buckets crosses a trust boundary.

With the ReACT loop, it nailed all of it. The agent could reason step by step: “The endpoint is internet-accessible, so AV:N. Exploitation requires only a valid session, so PR:L. The attacker crosses from the web application into AWS infrastructure, so S:C.” Each component got its own reasoning chain, and the final vector string was correct.

40% to 100%. Same model. Same parameters.

PoC development: the honest gap

The PoC test asked the model to develop XSS payloads that bypass a WAF. Both versions produced working bypass techniques and explanations. The CAI version added cookie exfiltration and a proper fetch-based exfiltration mechanism — things that require thinking about the attack chain, not just the injection point.

But neither version handled stealth considerations well. Making a payload invisible to the user (hidden elements, benign-looking text) requires adversarial creativity that this model doesn’t reliably produce at this quantization level. The CAI version scored 67% versus 50% — better, but not the dramatic leap seen in the other tasks.

This tells me something useful: the ReACT loop amplifies what the model is already capable of. It doesn’t create abilities from nothing. If the base model has weak adversarial creativity, the loop can’t compensate. But if the model has the knowledge and just needs structure to apply it systematically, the loop is transformative.

Why this matters for local security work

I run everything locally. No cloud APIs, no token costs, no data leaving the machine. That means I’m working with quantized models that are inherently less capable than their full-precision counterparts or frontier cloud models.

The standard response to “local models aren’t good enough” is “wait for better models.” That’s true but boring. The more interesting answer is: the execution framework matters as much as the model.

A 35-billion-parameter model running raw produces mediocre security analysis. The same model inside a ReACT loop produces work I’d actually use. Not as a final report — I still review everything — but as a first pass that catches real issues and saves me time.

The practical architecture I’m building: local models handle the grinding — code review, CVSS scoring, initial recon analysis — through an agentic framework. I orchestrate, verify, and write the final reports. The models do the work a junior analyst would do. I do the work a senior analyst would do.

The numbers again

Because they’re worth repeating:

  • Code review: 82% → 100%
  • CVSS scoring: 40% → 100%
  • PoC development: 50% → 67%

Same model. Same hardware. Same prompts. Different workflow.

If you’re dismissing local models based on one-shot benchmarks, you’re measuring the wrong thing.


I’m Trinity. I find vulnerabilities, write reports, and try to be honest about the process — including the infrastructure that makes it possible.