Benchmark

UniADILR

UniADILR evaluates reasoning as a process—Abduction → Deduction → Induction—rather than answers alone. The WGS (Wisdom Graph System) changes reasoning performance and efficiency across different backbone models.

Up to ~2× performance gains for frontier models

varying by model and conditions.

GPT-4o mini

46.27

GPT-4.1

64.95

GPT-5

73.33

Gemini 2.5 Pro

76.45

Wisdom (GPT-4o-mini)

80.38

Wisdom (GPT-5)

90.75

Wisdom (Gemini 2.5 Pro)

91.88

806550

46.27

64.95

73.33

76.45

80.38

90.75

91.88

GPT-4o-mini

GPT-4.1

GPT-5

gemini-2.5-pro

WGS

(GPT-4o-mini)

WGS

(GPT-4.1)

WGS

(gemini-2.5-pro)

Efficiency Gains

In our UniADILR runs, a smaller model equipped with WGS outperformed the prompt-engineered performance of a much larger frontier model (e.g., GPT-5). This result demonstrates structural efficiency advantages—achieving higher quality while significantly reducing cost and latency.

Results Summary

Across backbones, Baseline + WGS consistently outperforms Baseline on UniADILR. A lightweight model also shows a large uplift
(~2× improvement), indicating that the layer can materially improve reasoning quality under practical constraints.

Average Gain

+15 points

Across All Runs

Lightweight

~2x Boost

Low Overhead

Consistency

Robust

Across Backbones

Efficiency

Outperforming

Frontier Models

Logical consistency

What UniADILR Measures

UniADILR scores the quality of an end-to-end reasoning chain. It goes beyond "right vs wrong" by evaluating logical consistency, evidence fit, and reproducibility.

"Reasoning is evaluated as a chain, not a single answer."

WGS Layer

Backbone Model

Measurable reasoning gains

Reasoning layer validation

Why We Ran This

Our goal is to verify whether a dedicated layer can add reasoning performance on top of existing models. We measure how consistently it improves scores across multiple backbones.

"Backbone model + WGS layer measurable reasoning gains."

What This Result Shows

These results are not about ranking models. They show that a well-designed layer can reliably improve reasoning quality on top of strong backbones, and that performance—and efficiency—are not determined by the model alone.

"WGS adds measurable reasoning gains across different backbones, and can unlock multi-fold efficiency in practice."

Why It Matters in Production

In real deployments, teams face constraints—cost, latency, and operational limits—that often require smaller or more affordable models. UniADILR results indicate that the WGS layer can raise reasoning quality even in lightweight settings, improving cost–quality trade-offs without relying solely on bigger models.

Another Benchmark: Graph RAG

See Benchmark Results

Interested in our Wisdom Graph System beyond this benchmark?

Get a demo, explore the full system, or discuss integration.