Why AI makes the Data Engineer more necessary, not less
Everyone wants an LLM on top of their data. The problem: if the data is a mess, the AI answers confidently and gets it wrong anyway. What nobody wants to hear.
There’s a demo everyone has seen. The CEO caught it at a conference, or someone on the team reproduced it in a Jupyter notebook: you ask a question in plain English, the LLM queries your warehouse, and it responds with a clean table, a specific number, sometimes even a chart.
Impressive.
Now try doing that with your actual company data. Not the demo dataset — your tables, your columns, the queries someone wrote in 2019 that nobody touches because nobody fully understands them anymore.
That’s where things get complicated.
The LLM doesn’t understand your business. It understands language.
That distinction sounds obvious, but the consequences are significant.
When you point an LLM at your warehouse, the model reads table names, column names, whatever comments (probably none) exist in the schema, and the sample data it can access. From that, it builds its interpretation of what everything means.
If your table is called sales_v3_final_USE_THIS, the LLM will try to infer what’s in there. If your amount column can be gross or net depending on context and nobody documented it, the model will pick an interpretation — confidently — and be wrong half the time.
The problem isn’t the AI. The problem is that you’re feeding it data that even you don’t fully understand.
What actually happens in practice
A company wants to “chat with their data.” They hire someone for the model, connect the LLM to their warehouse, and within the first week the answers start getting weird.
They ask how much they sold last month. The LLM returns a number that doesn’t match the usual report. Which one is right? Nobody knows — because there are three columns that could be “sales” and none of them have documentation.
They ask for their top 10 customers. The model returns a list that includes test accounts and churned customers. Nobody filtered those out in the model — it was tacit knowledge held by whoever used to build the reports manually.
They ask for profitability by product. The LLM produces something, but the business logic for calculating the real margin was buried in a 200-line query written three years ago that nobody fully understands anymore.
In every case, the AI worked perfectly. The problem was what you fed it.
What someone still has to do
Here’s the point the hype ignores: everything that makes an LLM work well on business data requires data engineering work. Not less than before. More.
Someone still has to:
Design the data model to be interpretable. Not just correct — interpretable. Clear names, a single source of truth for each metric, tables that represent business concepts rather than technical artifacts from legacy systems.
Document what each field means. Not as a formality — as a functional prerequisite. If customer_id can refer to an end customer or a distributor depending on the table, the LLM won’t know that unless someone writes it down explicitly.
Ensure data quality before the AI touches it. Test accounts filtered out, outliers identified, dates in consistent formats, nulls handled with intent. Everything the analyst used to do “from memory” when preparing a report needs to be codified in the pipeline.
Keep all of it current. Because the business changes, systems change, and the LLM doesn’t know that last year you changed the discount logic and there’s a period of data that’s not comparable with the rest.
None of this is new. It’s the same data engineering as always. What changed is that before, you could hide technical debt in the analyst’s institutional knowledge. Now, if you want an AI to understand it, it has to be explicit.
What AI does change (to be fair)
Not all doom. AI genuinely changes some things:
Ad hoc exploration gets faster. A business user can ask questions without writing SQL. That has real value, even if someone needs to validate the answers.
Documentation can be partially generated. A well-prompted model can help write field descriptions, detect anomalies, suggest clearer naming. The data engineer still validates and decides.
Access gets democratized — with guardrails. Business teams can explore data without depending on a technician for every query. But someone has to have built the layer that makes that possible.
In every case, AI amplifies the data engineer — it doesn’t replace them.
The honest question to ask
Before hiring someone to implement an LLM on top of your data, there’s a more honest question:
Could an AI understand your pipelines? Or do you barely understand them yourself?
If the answer is the second, the problem isn’t the model. It’s the foundation. And no LLM, no matter how sophisticated, will fix that for you.
That foundation — clean data, interpretable model, real documentation, a pipeline that stays current — is exactly what we build before any model touches it. If you’re considering AI for your company, let’s talk before you hire the AI team. Order matters.
¿Tenés este problema en tu empresa?
Agendá una llamada de 30 minutos sin compromiso. Te contamos cómo podemos ayudarte a ordenar tu infraestructura de datos.
Agendá una llamada →