Buenas prácticas 19 de marzo de 2026

Why AI Makes the Data Engineer More Necessary, Not Less

By the Sediment Data team · 19 de marzo de 2026

Everyone wants an LLM on top of their data. The problem: if the data is a mess, the AI answers confidently and gets it wrong anyway. What nobody wants to hear.

Why AI Makes the Data Engineer More Necessary, Not Less

There’s a demo everyone has seen. The CEO caught it at a conference, or someone on the team reproduced it in a Jupyter notebook: you ask a question in plain English, the LLM queries your warehouse, and it responds with a clean table, a specific number, sometimes even a chart.

Impressive.

Now try doing that with your actual company data. Not the demo dataset — your tables, your columns, the queries someone wrote in 2019 that nobody touches because nobody fully understands them anymore.

That’s where things get complicated.

Why the LLM doesn’t understand your business even though it understands language

That distinction sounds obvious, but the consequences are significant.

When you point an LLM at your warehouse, the model reads table names, column names, whatever comments (probably none) exist in the schema, and the sample data it can access. From that, it builds its interpretation of what everything means.

If your table is called sales_v3_final_USE_THIS, the LLM will try to infer what’s in there. If your amount column can be gross or net depending on context and nobody documented it, the model will pick an interpretation — confidently — and be wrong half the time.

The problem isn’t the AI. The problem is that you’re feeding it data that even you don’t fully understand.

What actually happens in practice

A company wants to “chat with their data.” They hire someone for the model, connect the LLM to their warehouse, and within the first week the answers start getting weird.

They ask how much they sold last month. The LLM returns a number that doesn’t match the usual report. Which one is right? Nobody knows — because there are three columns that could be “sales” and none of them have documentation.

They ask for their top 10 customers. The model returns a list that includes test accounts and churned customers. Nobody filtered those out in the model — it was tacit knowledge held by whoever used to build the reports manually.

They ask for profitability by product. The LLM produces something, but the business logic for calculating the real margin was buried in a 200-line query written three years ago that nobody fully understands anymore.

In every case, the AI worked perfectly. The problem was what you fed it.

What work is still required even when you use AI

Here’s the point the hype ignores: everything that makes an LLM work well on business data requires data engineering work. Not less than before. More.

Someone still has to:

Design the data model to be interpretable. Not just correct — interpretable. Clear names, a single source of truth for each metric, tables that represent business concepts rather than technical artifacts from legacy systems. The payoff: analysts get answers in seconds instead of waiting for someone to write a new query.

Document what each field means. Not as a formality — as a functional prerequisite. If customer_id can refer to an end customer or a distributor depending on the table, the LLM won’t know that unless someone writes it down explicitly. dbt has native support for documenting fields directly in the code.

Ensure data quality before the AI touches it. Test accounts filtered out, outliers identified, dates in consistent formats, nulls handled with intent. Everything the analyst used to do “from memory” when preparing a report needs to be codified in the pipeline.

Keep all of it current. Because the business changes, systems change, and the LLM doesn’t know that last year you changed the discount logic and there’s a period of data that’s not comparable with the rest.

None of this is new. It’s the same data engineering as always. What changed is that before, you could hide technical debt in the analyst’s institutional knowledge. Now, if you want an AI to understand it, it has to be explicit.

What AI does genuinely change

Not all doom. AI genuinely changes some things:

Ad hoc exploration gets faster. A business user can ask questions without writing SQL. That has real value, even if someone needs to validate the answers.

Documentation can be partially generated. A well-prompted model can help write field descriptions, detect anomalies, suggest clearer naming. The data engineer still validates and decides.

Access gets democratized — with guardrails. Business teams can explore data without depending on a technician for every query. But someone has to have built the layer that makes that possible — the same Medallion architecture that existed before AI.

In every case, AI amplifies the data engineer. It doesn’t replace them.

Does AI also change the business analyst’s role?

Yes, though differently from what’s usually said. The exploratory work — “give me sales numbers by region for Q3” — can be genuinely accelerated with a well-configured LLM. That frees the analyst for higher-value work: interpreting the numbers, identifying opportunities, proposing actions.

What doesn’t change: someone still has to validate that the LLM’s numbers are correct before making decisions based on them. And to validate, you need to understand the data source — which brings you back to the starting point.

The honest question to ask before investing in AI

Before hiring someone to implement an LLM on top of your data, there’s a more honest question:

Could an AI understand your pipelines? Or do you barely understand them yourself?

If the answer is the second, the problem isn’t the model. It’s the foundation. And no LLM, no matter how sophisticated, will fix that for you.

That foundation — clean data, interpretable model, real documentation, a pipeline that stays current — is exactly what we build before any model touches it. If you’re considering AI for your company, let’s talk before you hire the AI team. Order matters.

Frequently asked questions

How long does it take to get data “AI-ready”?

It depends on the starting point. For mid-sized companies with 3–7 data sources in reasonable shape, preparing for AI typically takes 4–10 weeks: building the Bronze→Silver→Gold pipeline, documenting tables and fields, configuring quality tests. For data in poor shape — multiple unintegrated systems, undocumented fields, fragmented history — it can take longer. The initial diagnosis (Smart Blueprint) determines the real scope.

What AI tools work best on a well-designed Lakehouse?

Several options work well: RAG systems (Retrieval Augmented Generation) that query data documentation; text-to-SQL interfaces that translate natural language questions into SQL queries against Gold tables; analytical agents that can run Python against the data. All of them depend on Gold tables having clear names, documented fields, and clean data. Without that foundation, none of them work reliably.

Can an AI agent replace a data engineer?

For specific, repetitive tasks — generating field documentation, detecting anomalies in new data, suggesting naming improvements — LLMs can assist significantly. Architecture design, business logic modeling, resolving conflicts between sources, and validating business rules still require human judgment. The analogy: just as an aircraft autopilot doesn’t replace the pilot, an AI data agent doesn’t replace the data engineer.

If you’re considering AI for your company, also read why most AI projects fail before they start. The problem is almost always upstream of the model.

Let’s talk before you hire the AI team. Order matters.

About the author

The Sediment Data team is made up of data engineers with over 20 years of experience implementing infrastructure at mid-sized companies in Colombia and Argentina.

Want to implement AI on top of reliable data? First you need to get the infrastructure in order. Let's talk.

Book a 30-minute call, no commitment. We'll tell you how we can help you organize your data infrastructure.

Book a call →