Best AI for Legal Documents
in 2026

Top 29 AI models benchmarked on legal document evaluation — contract review, e-discovery, multi-jurisdictional compliance, M&A due diligence, and structured recommendations. Gemini 3.1 Pro scored a perfect 100%, but 11 of 29 models failed entirely. Legal AI is high-stakes.

What This Benchmark Tests (and Does Not Test)

Task Legal document evaluation — assessing AI selection across nuanced legal scenarios Categories Privileged contract review (1 test), selection criteria identification (1), general vs legal-specific AI (1), multi-jurisdictional compliance (1), contract lifecycle capabilities (1), unsuitable tool rejection (1), e-discovery at scale (1), structured JSON recommendation (1), M&A due diligence (1), frontier policy with risk controls (1) Difficulty Weighted from 2 to 15 points — later tests require domain-specific knowledge of legal specialization, auditability, and jurisdiction-awareness Scoring Deterministic — contains_all, contains_any, exact_match, and json_schema modes with partial credit where applicable Points 10 tests with weighted points (2+4+3+8+7+2+10+9+12+15) = 72.0 max score Models tested 29 models from 10 providers (18 completed all tests) Stability 2 runs per model Config Default API configurations, recommended temperature where available Date March 2026 Does not test Actual legal accuracy of outputs, jurisdiction-specific case law knowledge, attorney-client privilege handling, or compliance with specific regulatory frameworks

All models tested with default API configurations. 11 of 29 models scored 0% or failed to complete all tests. This benchmark evaluates whether a model can reason about legal AI tool selection, not whether it can practice law.

Benchmark Results

Bar chart showing AI model accuracy scores for legal document evaluation benchmark including contract review, e-discovery, compliance, and M&A due diligence, March 2026. Gemini 3.1 Pro leads with a perfect 100%.
Accuracy by model — Legal Document Evaluation: Contract Review, E-Discovery, Compliance, and Due Diligence March 2026
Table showing full benchmark results for AI models on legal document evaluation tasks, including accuracy, cost per run, speed, stability, and accuracy-per-dollar metrics
Full results — scores, cost, speed, stability, and token usage March 2026

Key Findings

🏆 Perfect Score: Gemini 3.1 Pro Aces Every Legal Test

Gemini 3.1 Pro scored 100% (72.0/72.0) with zero variance across runs — the only model to answer every legal evaluation test correctly. Claude Haiku 4.5 followed closely at 97%, then GPT-5.4 and Command-A both at 96%. The top 4 models are tightly clustered, but only one achieved a perfect score.

💰 The Sweet Spot: Claude Haiku 4.5 at 97% for $0.014

Claude Haiku 4.5 scored 97% at $0.0136 per run — roughly 1/8th the cost of Gemini Pro ($0.107) with only 3% lower accuracy. It also showed perfect stability (±0.000) and was 2.6x faster at 44.3 seconds. For most legal document workflows, Claude Haiku offers the strongest balance of accuracy, cost, and reliability. Command-A matched at 96% for $0.0123 with similarly strong performance.

⚡ Budget Stars: Codestral at 88% for Under $0.001

Codestral scored 88% at $0.000978 per run and was the fastest model at 12.3 seconds (310 Acc/min). DeepSeek Chat scored 81% at $0.000615 — the highest accuracy-per-dollar in the benchmark at 94,837. Both are Low-tier models that outperform several High-tier alternatives. For teams needing volume processing at lower cost, these models deliver strong results at roughly 1/100th of Gemini Pro's price.

📉 High Failure Rate: 11 of 29 Models Scored Zero

Legal document evaluation proved to be one of the most demanding benchmark categories — 11 models failed entirely. Qwen models, GPT-5-Nano, GPT-5.4 Pro, DeepSeek Reasoner, Grok-4, GLM-5, and Codex Mini all scored 0%. Claude Opus 4.6 scored only 2.8% with 10% completion. Even Claude Sonnet 4.6, a capable model, only reached 79% with 45% completion. Legal evaluation demands domain-specific reasoning that not every model can handle.

Why These Results May Surprise You

A coding model in the top 8 on a legal benchmark. A budget model outperforming Anthropic's flagship. Here is why:

📐 Legal evaluation tests structured reasoning, not legal knowledge: This benchmark tests whether a model can reason about legal AI tool selection — identifying criteria like confidentiality, auditability, and jurisdiction-awareness. Models that follow instructions precisely and output structured responses (including JSON) score well regardless of their "specialty."
🔄 Codestral (a coding model) scored 88% because structured output is its strength: One test requires generating a valid JSON object with specific schema constraints. Models trained for structured code generation often excel at these format-compliance tasks. Codestral's 88% proves that instruction-following and format obedience matter more than domain labeling.
💡 Claude Opus failed while Claude Haiku succeeded: Opus 4.6 scored just 2.8% with 10% completion, while Haiku 4.5 scored 97% with perfect stability. Larger reasoning models sometimes over-think constrained tasks, adding commentary when the prompt says "Return ONLY the final answer." Smaller, instruction-tuned models follow the format constraints more reliably.
🏷️ Within Cohere, Command-A scored 96% while Command-R scored 50%: Same provider, 46-point gap. Command-A is the newer, more capable model. Command-R struggled with the harder tests requiring multiple concepts (e-discovery, M&A due diligence, frontier policy). Model generation within a provider matters enormously.

The high failure rate (11 of 29 models) combined with the tight top cluster (4 models between 96-100%) makes legal document evaluation a clear capability separator. Models either handle it well or fail entirely — there is little middle ground.

Generic Benchmarks vs. Custom Benchmarks

These results are a directional signal, not a definitive legal AI ranking. This benchmark tests whether models can reason about legal tool selection and produce structured outputs for legal scenarios. It does not test actual legal accuracy, case law retrieval, or compliance with specific jurisdictions.

The practical takeaway: Claude Haiku 4.5 is the standout performer. It scores 97% at 1/8th the cost of the only model that beats it, with perfect stability and reasonable speed. For teams that need guaranteed perfection, Gemini 3.1 Pro is the only option — but the cost premium is steep.

These results are valid for this task design and scoring setup. Change the legal scenarios, the required concepts, or the scoring criteria, and rankings can change — which is exactly why custom benchmarking matters.

Frequently Asked Questions

Which AI model is best for legal document evaluation in 2026?

On our benchmark, Gemini 3.1 Pro scored a perfect 100% with zero variance. Claude Haiku 4.5 followed at 97% for roughly 1/8th the cost. GPT-5.4 and Command-A both scored 96%. The best choice depends on your budget — Claude Haiku offers the strongest balance for most legal workflows. Run your own legal benchmark to find out.

Can AI handle complex legal document analysis?

It depends on the model. 18 of 29 models completed all tests, and the top 6 scored above 91%. But 11 models scored 0% or failed entirely. The tests covered privileged contract review, multi-jurisdictional compliance, M&A due diligence, e-discovery, and structured JSON output. Legal AI is demanding, and model selection matters significantly.

What is the cheapest AI for legal document review?

DeepSeek Chat scored 81% at $0.000615 per run — the highest accuracy-per-dollar at 94,837 Acc/$. Codestral scored 88% at $0.000978 and was the fastest model at 12.3 seconds. Both outperform many premium models and cost roughly 1/100th of Gemini Pro's price per run.

Is AI reliable enough for legal work?

Reliability varies by model. Gemini Pro, Claude Haiku, GPT-5.4, and several others showed perfect stability (zero variance). But some models like GPT-5.1 Codex (±10.6) and Command-R (±10.7) had high variance. For legal work, stability matters as much as accuracy. Always verify AI outputs with human review.

How does OpenMark AI benchmark legal document evaluation?

Ten tests covering contract review, e-discovery, compliance, due diligence, and structured output. Scoring uses contains_all, contains_any, exact_match, and json_schema modes with partial credit. Tests require models to include specific legal concepts like confidentiality, auditability, and jurisdiction-awareness. Fully deterministic. Try it yourself for free.

Why Teams Use OpenMark AI

Your task, not a generic benchmark

You define the evaluation in your words, for your use case. Not MMLU, not HumanEval — your actual legal prompts, your actual document scenarios.

Stability scoring built in

Multiple runs per model with variance tracking. A model that scores 90 once and 60 the next is not the same as one that scores 85 every time.

Cost efficiency, not just cost

See which model is cheapest for your task — scored against quality, not just raw price-per-token.

No API keys needed

No accounts with OpenAI, Anthropic, or Google required. OpenMark AI handles every API call — just describe your task and run.

Benchmark AI on Your Legal Documents

Test which model handles YOUR contracts, YOUR compliance requirements, YOUR legal workflows.
Build custom benchmarks for any task — text, code, structured output, classification, images, and more.
50 free credits — no API keys, no setup.

Run a Legal AI Benchmark — Free →