Consistency Systems

DEEPMIND

MATHEMATICS

DeepMind Mathematics Dataset

The Consistency Math Kernel™ is evaluated on the public DeepMind Mathematics dataset spanning arithmetic, algebra, calculus, probability, comparison, and structured reasoning tasks.

Unlike probabilistic models that must produce an answer for every prompt, the Math Kernel returns a result only when correctness can be formally certified. When certification cannot be established under bounded deterministic constraints, the system is not certified by design.

Across this evaluation:

560,000 problems evaluated
322,694 scored.
322,694 certified correct (57.624% coverage)
Incorrect: 0
Coverage: 57.6239%
Runtime: 148.24 seconds
Average time per problem: 0.2647 ms

Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.

Updated 3/1/2026

HENDRYCKS

MATHEMATICS

Hendrycks MATH (Geometry Subset)

The Consistency Math Kernel™ is evaluated on the geometry subset of the Hendrycks MATH benchmark to measure deterministic, fail-closed solving under formal verification constraints.

The kernel returns an answer only when correctness can be certified under bounded deterministic rules; otherwise the result is NOT CERTIFIED by design.

Across this evaluation:

479 problems evaluated
257 answered
257 certified correct
Incorrect: 0
Not certified (abstain): 222
Coverage: 53.65%
Runtime: 0.684 seconds
Average time per problem: 1.428 ms

Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.

Updated 3/1/2026

AsyMOB

MATHEMATICS

AsyMOB (Advanced Symbolic Mathematics Obfuscation Benchmark)

The Consistency Math Kernel™ is evaluated on AsyMOB — a structurally adversarial symbolic benchmark designed to stress-test algebraic manipulation, trigonometric identities, limits, series expansion, and expression equivalence under obfuscation.

These problems are intentionally constructed to challenge pattern recognition systems by altering structure without changing semantic meaning. The kernel releases an answer only when correctness can be certified under deterministic constraints.

Across this evaluation:

17,000+ problems evaluated
Coverage expansion and validation in progress
0 incorrect results among answered outputs
Remaining cases NOT CERTIFIED by design

Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.

Updated 2/16/2026

DEEPMIND

LOGIC

DeepMind Logical Entailment

The Consistency Logic Kernel™ certifies whether a proposed conclusion logically follows from a set of premises under propositional reasoning.

The system does not score likelihood or assign confidence. It returns a decision only when correctness can be established under strict deterministic rules. When certification cannot be established, the system is not certified.

Across this evaluation:

1,696 problems evaluated (1,696 / 1,696)
0 incorrect results
Deterministic decisioning suitable for audit-oriented reasoning pipelines

Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.

Updated 2/20/2026

SATLIB / DIMACS

LOGIC

Deterministic SAT vs UNSAT Classification on DIMACS CNF

The Logic Kernel is evaluated on SATLIB / DIMACS CNF instances to validate binary SAT/UNSAT correctness under formal propositional constraints.

For each instance, the system determines whether a satisfying assignment exists (SAT) or whether no assignment exists (UNSAT). Correctness is enforced under a strict invariant: wrong must equal zero.

Across this evaluation:

2,000 instances evaluated (1,000 SAT + 1,000 UNSAT)
0 incorrect results (SAT: 1000/1000; UNSAT: 1000/1000)
Deterministic, reproducible SAT/UNSAT certification

Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.

Updated 2/20/2026

Next step

Evaluate deterministic certification on your own inputs

Benchmarks show performance under controlled conditions. Developer Access lets you test real prompts, outputs, and fail-closed behavior in your own workflow. For production deployment options, review Licensing.

Start 30-Day Free Developer Access›View Technology›View Licensing›

Deterministic certification. Auditable outputs. Benchmark-scale performance.

Consistency Systems builds deterministic certification for LLMs at inference time. Our technology certifies against a reference system and outputs reproducible, auditable proof traces for evaluation and production.

Demo

Technology

Benchmarks

Company

Contact

Deterministic certification. Auditable outputs. Benchmark-scale performance.

Demo

Technology

Benchmarks

Company

Contact

Deterministic certification. Auditable outputs. Benchmark-scale performance.

Demo

Technology

Benchmarks

Company

Contact

Benchmarks & Certification

Benchmarks & Certification

Benchmarks & Certification

Deterministic certification. Auditable outputs. Benchmark-scale performance.

Deterministic certification. Auditable outputs. Benchmark-scale performance.

Deterministic certification. Auditable outputs. Benchmark-scale performance.