Abstract background
All articles
AI Data AnalysisText-to-SQLAccuracyControllingDeterministicCFO

How Reliable Is AI Data Analysis? 90% Demo vs. 6% Reality

Text-to-SQL hits 86%+ accuracy in benchmarks — on real enterprise databases GPT-4 drops to 6%. Why AI data analysis fails on multi-filter metrics.

By Thomas IngenhorstCo-Founder, oneLake GmbH

How reliable is AI data analysis, really? AI data analysis is only as reliable as the database it works on. In clean benchmark tests (Spider 1.0), language models reach over 86% accuracy at turning a question into a database query. On real enterprise databases with hundreds of columns and actual business rules (Spider 2.0), that same accuracy collapses to 6 to 20% — GPT-4 falls from 86.6% to 6.0%. The cause isn't invented information but mistranslated queries: a forgotten filter, a wrong period, a wrong join. The result looks correct and is wrong anyway. AI data analysis becomes reliable only when a deterministic, rule-based layer handles the calculation and makes every step traceable.

Reading time: about 11 minutes. Author: Thomas Ingenhorst, Co-Founder, oneLake GmbH.

How reliable is AI data analysis, really?

In short:

  • On academic benchmarks with small, clean databases (Spider 1.0), text-to-SQL reaches over 86% accuracy — on realistic enterprise schemas (Spider 2.0, averaging around 812 columns per database), it falls to 10 to 20%.
  • In a direct comparison, GPT-4 drops from 86.6% to 6.0% — same model, just real data instead of tidy data.
  • This is not "hallucination" in the sense of invented facts. These are plausible-looking but wrong numbers pulled from real databases — harder to catch precisely because they seem to come from the right sources.
  • For metrics that combine several filters (period, region, product group, booking logic), even the latest language models don't reliably follow every business rule. More checking loops barely change that.
  • AI data analysis becomes reliable only when the language model understands the question but the actual calculation runs deterministically and rule-based, without AI — and every step is traceable.

90% accuracy sounds like a good grade. In a quarterly report it means: every tenth number is wrong. And you don't know which one. This is exactly where the marketing slide parts ways with reality — and that's what this article examines: where the good benchmark numbers come from, why they collapse on real data, and what an answer needs before a CFO can actually trust it.

Why 90% accuracy in the demo says nothing about your data

When an AI analytics tool advertises "over 90% accuracy," that number almost always comes from an academic benchmark. The best-known is Spider 1.0: a test dataset with tidy databases, cleanly named columns, and clearly phrased questions. On this dataset, top models like GPT-4 reach an execution accuracy of 86.6%, with some frameworks exceeding 90% (Spider 2.0, Lei et al., 2024).

The problem: your production database doesn't look like Spider 1.0. It doesn't have 5 clean tables but hundreds of columns with cryptic names, logic that grew over years, edge cases, and business rules documented nowhere. That's exactly what the successor benchmark Spider 2.0 was built for — databases averaging around 812 columns, exceeding 3,000 in extreme cases (Towards Data Science, 2024). In other words, the complexity that is normal in a real ERP or data warehouse system.

On Spider 2.0, the accuracy of the same models drops dramatically:

Text-to-SQL: demo benchmark vs. real enterprise dataExecution accuracy in percent · Spider 1.0 (clean) vs. Spider 2.0 (enterprise)0255075100%GPT-4 · Spider 1.086.6%GPT-4 · Spider 2.06.0%Best framework · Spider 2.021.3%Same model, same task — just real data instead of tidy data.Spider 2.0 databases: ~812 columns on average, up to 3,000+.Sources: Spider 2.0 (Lei et al., arXiv:2411.07763); Towards Data Science, 2024.

Text-to-SQL accuracy (execution accuracy): GPT-4 reaches 86.6% on the clean Spider 1.0 benchmark but only 6.0% on real enterprise data (Spider 2.0). The best specialized agent framework (o1-preview) solves just 21.3% of tasks on Spider 2.0. Spider 2.0 databases average around 812 columns. Sources: Spider 2.0 (Lei et al., arXiv:2411.07763); Towards Data Science, 2024.

GPT-4 falls from 86.6% to 6.0%. The best specialized agent framework (o1-preview) solves only 21.3% of tasks on the full Spider 2.0 benchmark. This isn't the weakness of a single vendor — it's the underlying dynamic of every purely language-model-based approach once the database is exposed to real complexity.

Put plainly: for a business decision, the solution has to be binary. It works, or it doesn't. "90% of the numbers are correct" is not a property a controller can work with — because the missing 10% aren't flagged.

Is this the same as a hallucination?

No, and the difference matters. A hallucination in the classic sense is a fully invented piece of information: you ask ChatGPT for the managing director of a supplier, and the model makes up a name. We covered that elsewhere in detail — on DACH company data, ChatGPT is wrong 96% of the time (96% Wrong: Why ChatGPT Invents Your Business Data).

The accuracy problem in data analysis is subtler — and precisely for that reason more dangerous. Here the AI doesn't invent anything. It accesses your real database, writes a query, runs it, and returns a number. The number is genuinely computed, looks correct, and demonstrably comes from your data. It's simply wrong — because the AI translated the question into the wrong query:

  • It forgot a filter (e.g. didn't exclude cancelled invoices).
  • It picked the wrong period (booking date instead of service date).
  • It aggregated over the wrong column (gross instead of net).
  • It set a join wrong and counted records twice.

An invented fact can be exposed with a bit of skepticism. A plausible-looking but miscalculated metric cannot — it has the right format, a believable order of magnitude, and comes from the right system. It only surfaces when someone recalculates it by hand. In daily operations, nobody does.

Why do language models fail specifically on metrics?

I learned this firsthand while developing oneAgent. After 15 years as a BI consultant in the Microsoft ecosystem, I tested the text-to-SQL and LLM-only approach in real customer scenarios — and repeatedly ran into its limits. Trying to analyze and extract company data exclusively through large language models, a pattern emerged quickly: as soon as a metric combines several filters and conditions, the task becomes too complex for a language model. Even the newest models of the day don't reliably follow every defined business rule and don't reproduce the result correctly.

A concrete example: "Show me the contribution margin of the Electrical product group in the South region in Q2, excluding intercompany revenue and cancelled orders." That's five nested conditions. Each one has to be translated correctly into database logic — and the definition of "contribution margin" has to match exactly what your controlling team uses. A language model guesses plausibly here. Sometimes right, sometimes not. It isn't reproducible.

The decisive insight, confirmed in practice: no matter how many checking mechanisms you build around a language model, a purely LLM-based approach can never deliver a guaranteed-correct result. Each checking loop reduces the error rate, but none eliminates it. For general text, that's perfectly fine. For a number that goes into the board report, it isn't.

This explicitly does not mean language models are useless. On the contrary: they are excellent at understanding a question phrased in natural language, interpreting ambiguous wording, and structuring an analysis path. Language understanding is exactly the strength you don't want to give up. The weakness lies elsewhere — in the precise, rule-compliant calculation of the actual number.

What reliable AI data analysis has to do differently

This realization is exactly what led to a two-part approach in oneAgent. The language model handles what it's good at — understanding the question and orchestrating the analysis path. The actual resolution of the metric is handled by a deterministic layer that deliberately uses no language model and no AI for the calculation, working strictly rule-based instead.

The difference in practice:

Purely LLM-basedLLM + deterministic layer
Understand the questionLanguage modelLanguage model
Calculate the metricLanguage model (guesses plausibly)Rule-based engine, no LLM
Metric definitionimplicit, model-dependentdefined and validated by the business
Same question, same result?not guaranteedyes, reproducible
Traceabilitynoneevery step visible

The core: metrics are not left to the language model but defined and validated by the business. What "contribution margin," "net revenue," or "active customer" precisely means is set by your controlling team — not by a statistical model that decides differently after the next update. The deterministic layer sticks strictly to those definitions. The same question returns the same result on the same data, every time. That's the precondition for a number to be checkable and auditable at all. More on the principle on our page AI without hallucinations.

The honest caveat stays: "guaranteed correct" refers to the validated business rules in the deterministic layer — not to magical omniscience. If your source data is faulty or the metric is defined wrong, a deterministic system also returns a wrong result. It just returns it consistently and traceably wrong, so the error sits in the definition rather than in the model's guessing — and is therefore findable.

"Looks perfect" is not a quality signal

There's an effect that makes the accuracy problem worse. AI chats answer in a style designed to build trust: they first praise the question ("Great question!"), present the answer in a clean format, with bullet points, bold text, and a crisp conclusion. The result looks so professional that the user assumes it must be correct.

When you then check the numbers closely, they're often completely wrong — and there's no record of how the model arrived at them. Which tables were queried? Which filters applied? Which metric definition assumed? With a plain chat result, that stays a black box. The nice formatting isn't quality then; it's camouflage.

That's why serious AI data analysis needs the second half of the answer: traceability. At oneAgent we call it Output Transparency. For every answer, the user sees exactly which steps were executed, which business rules were considered, and which filters were applied. Instead of a pretty black box, you get the full derivation — and can judge for yourself whether the number rests on the right basis. Whoever sees the calculation path doesn't have to trust blindly. That's the difference between a number you can present and one you merely pass on and hope.

This standard is also why some data deliberately shouldn't run through generic chatbots — data protection and traceability are connected here. The underlying risks of pouring company data into public AI tools are broken down in Company Data in ChatGPT; when entering customer data into ChatGPT is actually allowed and when it's a GDPR violation, we cover in a dedicated article.

What this means for the DACH mid-market

The topic lands in a market that barely exploits its data as it is. According to Bitkom, in 2024 only 6% of German companies fully exploit the potential of their available data; 42% use it "rather little," 18% "not at all" (Bitkom, 2024). The reflex to solve this with an AI tool is understandable. It becomes dangerous when the tool delivers plausible but wrong numbers — because then the problem just shifts: from "we can't get to the data" to "we make decisions on numbers we don't check."

For CFOs and Heads of Data, this means concretely: the decisive question to put to an AI analytics tool is not "What's your benchmark accuracy?" but:

  1. On which dataset was this accuracy measured — a clean benchmark, or a schema the size of our production systems?
  2. Who defines the metrics — the model, or our controlling team?
  3. Is the number calculated or estimated — deterministically and reproducibly, or probabilistically?
  4. Can I see the calculation path — tables, filters, rules — or do I get a black box?

These four questions separate a tool you can use in reporting from one that looks good and becomes a liability in the quarterly report. For a broader market overview of the tools, see our comparison: 8 AI analytics tools tested for the mid-market 2026.

Conclusion: accuracy is a property of the architecture, not the model

The Spider 2.0 numbers are not an indictment of language models. They're an indictment of the idea that a language model alone can reliably compute business metrics. 86% in the demo and 6% on real data with the identical model show: accuracy doesn't sit in the model but in the architecture around it.

AI data analysis becomes reliable when three things come together: a language model that understands the question; a deterministic, rule-based layer that calculates the metric exactly per validated business rules; and full transparency over every step. Only then is a number not just nicely formatted but actually dependable.

Frequently Asked Questions about AI data analysis accuracy

How accurate is AI data analysis really?

It depends entirely on the dataset. On academic benchmarks with small, clean databases (Spider 1.0), language models reach over 86% accuracy at translating a question into a database query. On realistic enterprise schemas (Spider 2.0, averaging around 812 columns), that same accuracy falls to 6 to 20%. So the benchmark number says little about performance on real company data.

What does "GPT-4 falls from 86% to 6%" mean?

It describes the same model on two different test datasets. GPT-4 reaches 86.6% execution accuracy on the tidy Spider 1.0 benchmark, but only 6.0% on the more realistic Spider 2.0 benchmark with large, complex databases. This shows: the model isn't the problem — the complexity of real data structures is.

Is wrong AI data analysis the same as a hallucination?

No. A hallucination is a freely invented piece of information. With the accuracy problem in data analysis, the AI accesses real data and computes a real number — it merely translates the question into the wrong database query, for example through a forgotten filter or a wrong join. The result looks correct, comes from the right sources, and is still wrong. That makes it harder to spot than a classic hallucination.

Why do language models fail on metrics with multiple filters?

The more conditions a metric combines — period, region, product group, booking logic, exclusions — the more nested logic must be translated correctly into a database query. Language models work probabilistically and don't reliably follow every defined business rule. Additional checking loops lower the error rate but don't eliminate it. A purely LLM-based approach therefore cannot deliver a guaranteed-correct result.

How does a deterministic layer solve the accuracy problem?

The language model understands the question and orchestrates the analysis path, but the actual calculation of the metric is handled by a rule-based engine without AI. The metrics are defined and validated by the business in advance; the deterministic layer sticks strictly to them. The same question returns the same result on the same data, every time. "Guaranteed correct" here refers to the validated business rules — not to error-free source data.

What is Output Transparency?

Output Transparency makes the calculation path of every answer visible: which steps were executed, which business rules were considered, and which filters were applied. Instead of a nicely formatted number with no derivation, the user sees exactly how the result came about — and can judge whether it rests on the right basis. It replaces blind trust with traceability.

What should I check an AI analytics tool for?

Four things: first, the dataset the accuracy was measured on (clean benchmark or production-like schema); second, who defines the metrics (the model or your controlling team); third, whether the number is calculated deterministically or estimated probabilistically; and fourth, whether you can inspect the calculation path. These questions separate a presentable tool from a reporting-grade one.

Want to see how this looks on your numbers?

The most honest test isn't a benchmark — it's your own database. In a short demo, we'll show you on a typical controlling use case how oneAgent resolves a multi-filter metric, and how Output Transparency reveals every step.

Try oneAgent for free →

Ready to query your data securely?

oneAgent brings AI to your data — not the other way around. GDPR compliant, hosted in Frankfurt, free trial available.

How Reliable Is AI Data Analysis? 90% Demo vs. 6% Reality | oneAgent