Research synthesis · 2026-05-23 · revision 2

LLM-augmented audit for a single repository

Scope: one repo at a time, tooling tailored to its stack. Two outputs: whole-repo audit report and per-MR / diff review packet. Pragmatic recipes, corrected per-axis framing, and a concrete pipeline.

TL;DR

Scopeone repo at a time. Tooling chosen by the repo's stack. Two outputs: whole-repo audit and per-MR review packet.
ReframeSemgrep is AppSec, not architecture or quality. Use stack-native arch tools (dependency-cruiser, ArchUnitNET, import-linter, Conftest) and quality tools (Sonar, kube-linter, ruff, etc.) on their own axes.
Pipelinedetect stack → run stack-native analyzers → load CLAUDE.md rulebook → Claude-Agent-SDK triages SARIF + reasons about architecture → Markdown out → render to HTML packet.
Recommended pathExtend sw-audit with two per-repo commands: audit and review <mr|range>. Lean on PR-Agent for review prompts; lean on tech-debt-skill / fan-out-audit / metis patterns for whole-repo. Markdown canonical → static HTML.

The reframe — four independent axes

The earlier report leaned on Semgrep across axes. Correction: Semgrep is one input to the security axis; each axis has its own canonical tooling per stack.

AxisCanonical tooling (cross-stack)LLM adds
Security (SAST + SCA + secrets + IaC)Snyk / Trivy / Checkov / Bandit / Semgrep / Roslyn-security / kube-linterTriage, dedup, exploitability rationale
Architecture / layeringdependency-cruiser, ts-arch, ArchUnitNET, NetArchTest, import-linter, Conftest/OPANarrative across modules; cite Clean-Arch rule violated
Quality / smells / debtSonar family, ESLint+SonarJS, Roslyn+SonarAnalyzer, ruff+pylint, tflint, kube-linter, CodeSceneDedup overlapping rules; rank by blast radius
Custom rules / fitness functionsOPA/Conftest (IaC/K8s), Roslyn custom analyzers (.NET/Unity), dep-cruiser/ts-arch rules (TS), import-linter contracts (Py)Reason about rules too soft for static encoding ("should-not-have-done")

Per-stack recipes

For each stack: minimum analyzer set, where to put custom rules, and what the LLM layer adds on top.

NestJS / TypeScript backend

Security
npm audit / Snyk / OSV; eslint-plugin-security, eslint-plugin-nestjs-security; Semgrep TS rules
Architecture
dependency-cruiser (SARIF), ts-arch, Madge (cycles), eslint-plugin-boundaries or Sheriff; Nx tags if monorepo
Quality
ESLint + @typescript-eslint + SonarJS + eslint-plugin-import + eslint-plugin-nestjs-typed; knip (dead exports); tsc --noEmit
Custom rules
dep-cruiser forbidden in .dependency-cruiser.cjs; ts-arch as Jest tests; ESLint rules for Nest decorator misuse
LLM: ingest SARIF (ESLint, SonarJS, Semgrep) + dep-cruiser JSON → cluster by module/controller, explain controller→repo bypasses, rank by blast radius.

C# / .NET backend

Security
Security Code Scan, Roslyn security rules, dotnet list package --vulnerable, Snyk/OSV
Architecture
ArchUnitNET (fluent xUnit assertions) and/or NetArchTest; NsDepCop (namespace deps); NDepend (CQLinq, commercial, deep)
Quality
SonarAnalyzer.CSharp, StyleCop.Analyzers, JetBrains InspectCode CLI (SARIF), dotnet format
Custom rules
Custom Roslyn analyzers for project Clean-Arch rules; CQLinq in NDepend; xUnit project hosting ArchUnitNET asserts
LLM: feed Roslyn/Sonar SARIF + ArchUnitNET text + NDepend XML → layered-violation narrative, dedup Sonar smells vs. ArchUnit findings, severity rationale.

Unity client C# / C++ · thin tooling

Security
Limited. Treat as C# (Roslyn security) + asset/secret scanning (Trivy fs, gitleaks)
Architecture
ArchUnitNET / NetArchTest against built assemblies in Edit-mode tests; NDepend on the generated SLN
Quality
Microsoft.Unity.Analyzers (official, active), JetBrains Rider Unity inspections via InspectCode, SonarQube on the generated SLN
Custom rules
Custom Roslyn analyzers for project conventions (no GameObject.Find, no Resources.Load, MonoBehaviour boundary rules); IL post-processors for deeper checks
LLM: cross-reference analyzer output with Unity Profiler/Memory Profiler exports → narrate "perf-architecture" issues (e.g., GC-alloc hotspots tied to MonoBehaviour anti-patterns). No silver bullet here.

Terraform infra

Security
Trivy (trivy config, tfsec merged in) and/or Checkov (broadest rules, native SARIF); KICS alternative
Architecture
Conftest + OPA/Rego on terraform show -json plan.json (module composition, tag/region/network constraints); terraform-compliance for BDD; CUE if schema-first
Quality
tflint (provider plugins), terraform fmt -check, terraform validate
Custom rules
Rego in policy/ via Conftest; Checkov custom YAML/Python; tflint custom rules
LLM: SARIF (Checkov/Trivy) + Conftest JSON + Infracost diff → triage by blast radius (public S3 in prod), explain drift, group remediations per module.

Helm / Kubernetes manifests

Security
kubescape (NSA/CIS/MITRE, SARIF), Trivy k8s/config, Checkov k8s
Architecture
Conftest/OPA Rego against rendered manifests (helm template | conftest test -); required labels, NetworkPolicy, namespace shape. Note: datree is sunset; use Monokle / kubescape.
Quality
kube-linter (StackRox), Polaris (Fairwinds), kube-score, kubeconform (kubeval deprecated), helm lint
Custom rules
Conftest Rego in policy/; kube-linter custom templates; Polaris custom checks
LLM: ingest kube-linter/Polaris/kubescape SARIF + rendered manifest tree → cluster per workload, explain prod-readiness gaps, propose chart-level (not per-manifest) fixes.

Python data platform data

Security
bandit (or ruff S-rules), pip-audit / Snyk, Semgrep py rules
Architecture
import-linter (layers/forbidden/independence contracts in .importlinter), pydeps / grimp for graphs; lint-imports in CI
Quality
Ruff (replaces flake8/isort/pyupgrade/pydocstyle/black-format), pyright or mypy, pylint for deeper smells, vulture (dead code)
Custom rules
import-linter contracts; Ruff per-file selectors; Semgrep custom py rules; dbt-checkpoint / sqlfluff if dbt/SQL
LLM: SARIF (Ruff, Bandit, Pyright) + import-linter text → narrate layer violations ("domain/ importing infra/"), dedup Ruff/Pylint overlap, rank by pipeline criticality.

Stack detection (single-repo driver)

  • package.json → Node. dependencies contains @nestjs/core → NestJS recipe. nx.json → Nx variant.
  • *.sln / *.csproj → .NET. Presence of ProjectSettings/ProjectVersion.txt or Assets/ + Packages/manifest.json → Unity recipe.
  • *.tf / terragrunt.hcl → Terraform.
  • Chart.yaml → Helm. Else kustomization.yaml or any *.yaml with apiVersion:/kind: → K8s manifests.
  • pyproject.toml / requirements*.txt → Python. dbt_project.yml → dbt overlay.
  • Tie-break: largest LOC stack wins. Multi-stack repo → run multiple recipes, tag findings with stack.
  • Output: each tool writes to .audit/<stack>/<tool>.sarif; driver merges via SARIF multi-tool runs.

Canonical pipeline

1Detect stackLockfiles + manifest fingerprints pick the recipe.
2Run analyzersStack-native security / arch / quality tools emit SARIF.
3Load rulebookCLAUDE.md + .claude/skills/audit/SKILL.md injected at session start.
4Index repoAider repo-map / Claude Code @codebase for navigation.
5Agent fan-outPer-module subagents apply rulebook over small file batches.
6LLM triageDedup SARIF; explain; sketch fixes (Metis pattern).
7Severity + citefile:line citations, severity bucket, effort estimate.
8Emit packetAUDIT.md canonical → audit.html; audit.sarif sidecar.

Whole-repo audit — top picks

1ksimback/tech-debt-skill github.com/ksimback/tech-debt-skill
Claude Code skill; produces TECH_DEBT_AUDIT.md with file-cited findings, severity, effort. Per-repo override via .claude/skills/tech-debt-audit/SKILL.md. MIT.
Tradeoff: tech-debt slant; extend the skill for Clean-Arch rules.
2lachiejames/fan-out-audit github.com/lachiejames/fan-out-audit
Slash command spawning ~200 parallel Claude subagents over small file batches; explicit precision-vs-context-bloat design. Whole-repo, target-dir scopable.
Tradeoff: raw token cost; needs orchestration around it.
3arm/metis github.com/arm/metis
OSS AI security review with native SARIF triage (--triage); validates third-party SAST; supports vLLM / Ollama for local models.
Tradeoff: security-shaped; useful as the SARIF-triage component, not the whole audit.
alsoevilsocket/audit github.com/evilsocket/audit
8-stage Claude-Agent-SDK pipeline with prompts + JSON schemas + SQLite state. Closest "single-repo audit driver" reference implementation.

MR / diff review packet — top picks

1Qodo PR-Agent github.com/qodo-ai/pr-agent · GitLab self-host docs
/describe /review /improve; hierarchical best_practices.md with per-folder/per-group scope; on-prem GitLab; CLI returns Markdown/JSON.
Tradeoff: no first-class standalone HTML — you wrap and template.
2anthropics/claude-code-security-review github.com/anthropics/claude-code-security-review
Official GitHub Action + /security-review slash command; CLAUDE.md-driven; reads pending diff.
Tradeoff: GitHub-bound, security-focused; pattern is reusable for GitLab via Claude Agent SDK.
3GitLab Duo Code Review Flow docs.gitlab.com/.../code_review/
GA from GitLab 18.8 (Jan 2026), Sonnet 4.6 in 19.1; supports repo-specific review instructions; self-managed via Duo Self-Hosted AI Gateway.
Tradeoff: UI-only output today, no portable review packet, no arbitrary commit range.
alsocodedog · Greptile · CodeRabbit · kodus/Kody
codedog emits a Markdown report; Greptile has rich .greptile/rules.md; CodeRabbit has .coderabbit.yaml learnings; Kody is OSS, model-agnostic.

Rulebook ingestion — pick one

Use CLAUDE.md (root) + .claude/skills/<audit>/SKILL.md

Two reasons:

  • Claude Code's progressive-loading: SKILL.md loads only when the audit runs, so the big rulebook doesn't bloat every session; the small CLAUDE.md just dispatches.
  • Co-located with the repo, version-controlled, reusable from CLI, GitHub/GitLab Action, or Claude Agent SDK.

For MR review only, Qodo's best_practices.md is more expressive — drive Qodo from the same Markdown via symlink / extra_instructions.

Report format — pick one

Markdown canonical → static HTML render

  • Markdown is what every agent (Claude, Qodo, Greptile, codedog, Metis) emits natively, and what GitLab/GitHub render inline.
  • Pipe through pandoc or a single marked + Prism + diff2html template for the standalone "review packet" HTML/PDF.
  • Generating HTML directly from the LLM wastes tokens, breaks diffability, and loses the GitLab MR-comment path.
  • Pair with audit.sarif sidecar for GitHub/GitLab Code Scanning ingestion.

Three ways forward

A. Integrate heavy B. Thin custom on top of OSS pieces (recommended) C. All-custom from scratch
Whole-repo auditQodo self-host "rule system"; or commercial Sonar + CodeScenetech-debt-skill / fan-out-audit pattern + Claude Agent SDK; analyzers per stack; Metis-style SARIF triageHand-rolled prompts + own orchestrator
MR / diff packetGitLab Duo Code Review Flow inline; abandon HTML packetGitLab API → PR-Agent CLI → diff2html + Jinja → static HTMLHand-rolled prompts + UI
RulebookPer-tool config (Qodo, Sonar)CLAUDE.md + SKILL.md (single source); also feeds PR-Agent best_practices.mdCustom prompt template
Engineering costLowMediumHigh
Lock-in / recurring $HighLow (Apache-2.0 / MIT)None
RiskVendor roadmapPR-Agent / skill prompts (forkable)Prompt-engineering bills + time

Recommendation

Path B — thin per-repo CLI on top of OSS pieces

Two new sw-audit commands, both single-repo:

  • sw-audit audit <repo> — detect stack, run stack-native analyzers, load CLAUDE.md rulebook, fan out Claude subagents (tech-debt-skill / fan-out-audit pattern), triage SARIF (Metis-style), emit AUDIT.md → render audit.html.
  • sw-audit review <repo> <mr|range> — fetch diff + commits + before/after trees via GitLab API, run PR-Agent CLI with project rulebook, render via diff2html + Jinja into a self-contained review.html packet.

Main tradeoff: you maintain the glue between detector, analyzers, and LLM layer. Mitigation: each piece is replaceable; rulebook lives in the repo, not in the tool.