DEEPMIND
MATHEMATICS
DeepMind Mathematics Dataset

The Consistency Math Kernel™ is evaluated on the public DeepMind Mathematics dataset spanning arithmetic, algebra, calculus, probability, comparison, and structured reasoning tasks.

Unlike probabilistic models that must produce an answer for every prompt, the Math Kernel returns a result only when correctness can be formally certified. When certification cannot be established under bounded deterministic constraints, the system is not certified by design.

Across this evaluation:
  • 560,000 problems evaluated
  • 322,694 scored.
  • 322,694 certified correct (57.624% coverage)
  • Incorrect: 0
  • Coverage: 57.6239%
  • Runtime: 148.24 seconds
  • Average time per problem: 0.2647 ms
Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.
Updated 3/1/2026
HENDRYCKS
MATHEMATICS
Hendrycks MATH (Geometry Subset)

The Consistency Math Kernel™ is evaluated on the geometry subset of the Hendrycks MATH benchmark to measure deterministic, fail-closed solving under formal verification constraints.

The kernel returns an answer only when correctness can be certified under bounded deterministic rules; otherwise the result is NOT CERTIFIED by design.

Across this evaluation:
  • 479 problems evaluated
  • 257 answered
  • 257 certified correct
  • Incorrect: 0
  • Not certified (abstain): 222
  • Coverage: 53.65%
  • Runtime: 0.684 seconds
  • Average time per problem: 1.428 ms
Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.
Updated 3/1/2026
AsyMOB
MATHEMATICS
AsyMOB (Advanced Symbolic Mathematics Obfuscation Benchmark)

The Consistency Math Kernel™ is evaluated on AsyMOB — a structurally adversarial symbolic benchmark designed to stress-test algebraic manipulation, trigonometric identities, limits, series expansion, and expression equivalence under obfuscation.

These problems are intentionally constructed to challenge pattern recognition systems by altering structure without changing semantic meaning. The kernel releases an answer only when correctness can be certified under deterministic constraints.

Across this evaluation:
  • 17,000+ problems evaluated
  • Coverage expansion and validation in progress
  • 0 incorrect results among answered outputs
  • Remaining cases NOT CERTIFIED by design
Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.
Updated 2/16/2026
DEEPMIND
LOGIC
DeepMind Logical Entailment

The Consistency Logic Kernel™ certifies whether a proposed conclusion logically follows from a set of premises under propositional reasoning.

The system does not score likelihood or assign confidence. It returns a decision only when correctness can be established under strict deterministic rules. When certification cannot be established, the system is not certified.

Across this evaluation:
  • 1,696 problems evaluated (1,696 / 1,696)
  • 0 incorrect results
  • Deterministic decisioning suitable for audit-oriented reasoning pipelines
Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.
Updated 2/20/2026
SATLIB / DIMACS
LOGIC
Deterministic SAT vs UNSAT Classification on DIMACS CNF

The Logic Kernel is evaluated on SATLIB / DIMACS CNF instances to validate binary SAT/UNSAT correctness under formal propositional constraints.

For each instance, the system determines whether a satisfying assignment exists (SAT) or whether no assignment exists (UNSAT). Correctness is enforced under a strict invariant: wrong must equal zero.

Across this evaluation:
  • 2,000 instances evaluated (1,000 SAT + 1,000 UNSAT)
  • 0 incorrect results (SAT: 1000/1000; UNSAT: 1000/1000)
  • Deterministic, reproducible SAT/UNSAT certification
Each CERTIFIED result is reproducible; not certified outcomes are explicit. Reproducibility artifacts are available under NDA.
Updated 2/20/2026
Next step
Evaluate deterministic certification on your own inputs

Benchmarks show performance under controlled conditions. Developer Access lets you test real prompts, outputs, and fail-closed behavior in your own workflow. For production deployment options, review Licensing.