Mistral 7B for Smart Contract Auditing: Benchmarks

AI-Web3

2024-01-15

Mistral 7B for Smart Contract Auditing: Benchmarks

Deploy Mistral 7B for smart contract audits in 2024: GPTQ 4-bit cuts VRAM to 4 GB, vLLM 2-4x throughput, F1 0.69 on reentrancy. Benchmark on SmartBugs now.

Frequently Asked Questions

Mistral 7B cannot replace a human smart contract auditor. The model achieves F1 scores in the 0.55 to 0.69 range on known vulnerability benchmarks, meaning it misses roughly one in three vulnerabilities and generates meaningful false positives. Human auditors catch cross-contract logic errors, business logic flaws specific to protocol tokenomics, and novel vulnerability patterns not in training data. Mistral 7B is a pre-screening productivity tool. Its output should always be treated as a candidate list for human review, not a final deliverable.

GPU-only batch inference using GPTQ 4-bit quantisation via vLLM requires a minimum of 8 GB VRAM for single-contract queries and 16 to 24 GB VRAM for batch sizes of 16 to 32 concurrent sequences with 4096-token context. An NVIDIA A10G or RTX 3090 with 24 GB VRAM handles production batch workloads. For workstations with less VRAM, GGUF Q4_K_M with llama.cpp partial GPU offload runs on 8 GB VRAM plus 32 GB system RAM at roughly 10 to 15 tokens per second.

GPT-4 achieves higher precision on complex multi-step vulnerability reasoning, improving reentrancy recall from approximately 0.68 to 0.79 versus Mistral 7B zero-shot in Ancilar internal benchmarks from Q4 2023. However, GPT-4 is closed-source and cannot be deployed on-premise, which creates data privacy concerns for client contract source code. Its per-token cost makes batch auditing expensive at scale, approximately 50 to 80 times higher per contract than Mistral 7B on local GPU infrastructure. For pipelines where LLM output is one layer among several, that precision gap does not justify the cost and data-handling trade-offs.