How I Built a Security Scanner That Actually Finds Bugs

Combining Semgrep, CodeQL, SonarQube, and Snyk gets you 44.7% vulnerability detection on the OWASP Benchmark. They miss more bugs than they find.

The problem is architectural. These tools are pattern matchers. Vulnerabilities are about behavior. Semantic SAST combines Tree-sitter parsing with LLM reasoning to bridge that gap.

Why Pattern Matching Fails

Traditional SAST tools are glorified grep. They catch eval(user_input) but miss exec(compile(user_input, '<string>', 'exec')). Same vulnerability, different syntax. The tools see characters, not intent.

A SQL injection isn't about string concatenation. It's about untrusted data reaching a query executor. A path traversal isn't about .. in a string. It's about user-controlled input determining file system access. Traditional tools can't follow data flow across function boundaries or reason about whether sanitization is actually effective.

The Approach

Two technologies combined: Tree-sitter for language-agnostic AST parsing across 8+ languages, and LLM reasoning for semantic understanding of what code actually does.

The key innovation is mining patterns from actual CVE fixes rather than hand-writing rules. The system monitors CVE disclosures, finds the fixing commits on GitHub, extracts the vulnerability pattern from the diff, generalizes it using AST analysis, and validates with LLM reasoning. New vulnerability patterns get detection rules within hours of disclosure, not weeks.

Where It Works

Path traversal detection is where semantic analysis shines most. Traditional tools miss sanitization bypasses -- URL encoding (%2e%2e), double encoding, Unicode normalization. Semantic SAST understands that the intent is path traversal prevention but the implementation is flawed.

Similarly strong on XXE and deserialization attacks, where the vulnerability is a configuration choice or indirect data flow that pattern matching can't follow.

OWASP Benchmark results: SQL Injection 71% (vs 48% traditional), XSS 68% (vs 41%), XXE 89% (vs 52%), Deserialization 76% (vs 39%), Path Traversal 64% (vs 43%).

Where It Falls Short

False positives. LLM reasoning reduces them compared to traditional tools, but custom frameworks, unusual architectures, and domain-specific safety patterns still generate noise.

Cost. Running semantic analysis on every file in a large codebase is expensive. In practice, you need a hybrid approach: fast pattern matching as a first pass, LLM analysis only on suspicious findings.

Reproducibility. LLM-based analysis can give different results on different runs. Temperature-zero inference and multi-pass consensus mitigate this but don't fully solve it.

It's not a replacement for human security review on critical systems. It's a force multiplier -- catching things humans miss while flagging areas that need human judgment.

The code is open source at semantic-sast. Security tooling that can't reason about behavior is testing for the wrong thing.