Critical Thinking in the Age of AI.

The hype is real, but so is the risk

Look, I’m not here to be another voice telling you AI is overhyped or that you should be scared of it taking your job. Honestly, as someone who builds software, infrastructure, and data pipelines day to day, AI tools have genuinely changed how I work and for the better. I move faster. I prototype ideas I would’ve shelved because the setup cost was too high. I ask questions while I am commuting back and forth which otherwise could be a dull, boring waste of time. The productivity gains are real and I’m not pretending otherwise.

But here’s the thing nobody talks about enough at the conference talks and the LinkedIn posts celebrating “10x engineers”: the output still needs you.

The model will be wrong, confidently

One of the more dangerous characteristics of large language models is not that they get things wrong. Every tool gets things wrong. The dangerous part is how they get things wrong.

When a compiler throws an error, it tells you. When a linter flags a potential bug, it highlights it. When a SQL query returns nothing, you know immediately something is off. LLMs do not behave like this. They will generate a beautifully structured, syntactically valid, well-commented piece of code that quietly does the wrong thing. They will explain an infrastructure concept with the confidence of a senior architect and leave out a detail that would’ve saved you three hours of debugging. They will produce a data pipeline transformation that looks exactly right, passes a quick glance review, and then silently corrupts your aggregations in edge cases you didn’t think to test.

I have personally caught generated code that had race conditions baked in, IAM policies that were technically functional but dangerously permissive, and SQL joins that would produce row explosions on any dataset larger then a toy example. None of it came with a warning label. It all looked fine at first glance.

This is not a reason to stop using these tools. It is a reason to never, ever turn your brain off when reviewing what they produce.

A critical point most people miss

LLMs Are Non-Deterministic Systems.

This one is really important and I think a lot of non-technical people fundamentally misunderstand it, which leads to wildly unrealistic expectations about how these systems should behave.

Sure you can use an LLM like a calculator, but it isn’t one. It’s not a lookup table. It’s not a database.

Ask a calculator what 2 + 2 is and you will get 4, every single time, forever. Ask an LLM the same technical question twice and you can get two different answers. Ask it on a different day and the answer might shift again. This is not a bug in the traditional sense. It is an intrinsic property of how these models work. They sample from probability distributions over tokens. There is temperature, there is randomness, there are subtle differences in how a question is phrased that can steer the output in completely different directions.

For someone coming from a background of deterministic systems, this is a jarring concept. A lot of business stakeholders and non-engineers will treat AI output as ground truth, as if it’s pulling from some authoritative database of correct answers. It is not. It never was. What you get is a statistically likely response given your input and the model’s training. Sometimes that’s exactly right. Sometimes it’s subtly wrong. Sometimes it’s completely wrong but written in a way that sounds authoritative.

If your organization is making decisions based on AI generated analysis without human review, you have introduced a non-deterministic, probabilistic system into your decision making pipeline and you are treating it like it’s deterministic. That is a serious problem.

The bigger issue: bias is baked in at every layer

This is where I want to spend a bit more time because I think it’s the most underappreciated risk, especially for teams using AI for data analysis, architecture recommendations, or anything where the “right answer” depends heavily on context.

LLMs are biased, and the bias is structural, not accidental.

Here’s what I mean by that. The model learns from content. It learns from what was written on the internet, in documentation, in books, in code repositories. That content was not neutral. It overrepresents certain programming languages, certain architectural patterns, certain industries, certain time periods, and certain perspectives. If the majority of the training content describes AWS as the default cloud provider, the model will have an implicit lean toward AWS patterns even when you’re asking about a GCP or Azure environment. If most of the ingested data engineering content predates the modern lakehouse architecture, the recommendations you get will reflect that older mental model.

There is also the question of how the content was fed. The curation process, the fine tuning decisions, the RLHF (Reinforcement Learning from Human Feedback) process, the system prompts baked into the product you’re using, all of these shape what the model considers a “good” response. You are not getting an objective view. You are getting a view that was shaped by a lot of human decisions you have no visibility into.

And then there is the cutoff. Every model has a training cutoff date, which means its knowledge of the world is frozen at a specific point in time. For software engineers this is especially critical. A model trained on data up to early 2024 doesn’t know about a CVE disclosed in late 2024. It doesn’t know about breaking changes in a library that released a major version after its cutoff. It doesn’t know about the architectural shift that happened in your specific tools ecosystem six months ago. It will still answer your questions confidently, drawing on what it knows, which may no longer reflect current reality.

The version of AWS Lambda behavior it describes might be outdated. The Kubernetes API it references might be deprecated. The “best practice” it recommends for your data pipeline might have been superseded by a better pattern that emerged after its knowledge ended.

You have to know this. You have to factor this in. And you have to verify anything that matters against current, authoritative sources, not just accept the generated answer because it sounds right.

Non-determinism hits different when you’re talking about data

I want to revisit the non-determinism point from earlier but specifically in the context of data analysis, because this is where I’ve seen it cause the most damage and also where it’s the hardest to explain to stakeholders.

When an LLM generates text, variation across responses is generally tolerable. If you ask it to summarize something and it gives you three slightly different summaries across three runs, that’s fine. A human reader can interpret all of them. I could describe someones height as “not short”, “tall”, “very tall”, “above average”, “noticeably tall”, and all of those phrases are technically valid depending on framing. A reader understands the intent. Language has that flexibility built in.

Data does not have that flexibility. Not even a little bit.

If the actual number is 6.5 feet and the model tells you 7 feet, it doesn’t matter that both numbers are objectively tall and well above average. The number is wrong. And the moment someone cross-references it against a source of truth and finds that discrepancy, the trust is gone. Not just for that one data point, but for everything the model has ever told them. The context around the number being “close enough” is completely irrelevant to most people reviewing data. They see a wrong number and they stop trusting the system.

This is one of the more underrated challanges of using LLMs for data analysis in a business context. The model isn’t just writing prose, it’s generating figures, aggregations, metrics, comparisons. And unlike language, numbers are exact. There is no graceful interpretation of a 7.7% churn rate when the actual number is 6.2%. Those are different business realities that lead to different decisions.

What makes this especially tricky is that the model isn’t lying. It genuinely doesn’t know the difference. It’s pattern matching on what “revenue” or “active users” or “conversion rate” tends to mean based on its training data, which may have nothing to do with how your organization specifically defines those terms. And this is where it gets really important, because the definition of a metric is not universal. “Revenue” in one company means gross. In another it means net after refunds. In another it means only recognized revenue for that period. The LLM doesn’t know which one you mean unless you tell it, and even then it might not consistently apply it.

How we addressed this: the semantic layer

This is a challenge we’ve been actively working through at our org, and the approach that has made the biggest difference is introducing a semantic layer on top of our data that acts as a set of guardrails for how the LLM interacts with it.

The core idea is simple: instead of letting the model guess what your business terms mean, you define them explicitly and make those definitions part of the context the model operates within. You are essentially giving it a controlled vocabulary for your data.

Here’s a simplified example based on an orders table. Without a semantic layer the model is free to interpret “revenue” however it wants:

Column Name	Raw Type	What the LLM Might Assume
`order_total`	DECIMAL	Could be revenue, could include tax
`discount_amount`	DECIMAL	May or may not subtract this
`refund_amount`	DECIMAL	Often ignored in naive calculations
`status`	VARCHAR	Unclear which statuses count as “complete”
`created_at`	TIMESTAMP	Unclear if this is order date or payment date

With a semantic layer you collapse that ambiguity by explicitly defining what each business concept maps to:

Semantic Term	Definition	Underlying Calculation	Filters Applied
`revenue`	Recognized net revenue for completed orders	`order_total - discount_amount - refund_amount`	`status IN ('completed', 'fulfilled')`
`gross_revenue`	Total order value before any deductions	`order_total`	`status != 'cancelled'`
`order_date`	The date a payment was captured	`created_at` where `payment_confirmed = true`	Excludes pending
`active_orders`	Orders currently in fulfillment	`COUNT(order_id)`	`status IN ('processing', 'shipped')`

When the LLM is pointed at this semantic layer instead of the raw table, it has a lot less room to hallucinate. It doesn’t need to infer what “revenue” means because you’ve told it. It doesn’t need to guess which statuses to filter on because that’s already encoded. The model is constrained to work within your definitions, and the output becomes significantly more consistent and trustworthy.

It’s not a perfect solution and I want to be clear about that. The model can still make mistakes, still misapply a definition, still produce wrong aggregations if the query is complex enough. But the error rate drops substantially when you remove the ambiguity that invites hallucination in the first place. You are essentially narrowing the probability space the model is sampling from, which is the right way to think about making a non-deterministic system more reliable in a data context.

The semantic layer also has a side benefit that has nothing to do with AI: it forces your organization to actually agree on what your metrics mean. And if you’ve worked at more than one company, you know that’s a harder problem then the technical one.

What this looks like in practice

As a lead engineer, the way I think about it is this: AI is a very fast, very well-read junior contributor who has read everything but understood some of it incorrectly, and doesn’t know what happened last quarter. You wouldn’t merge a PR from that person without a thorough review. You wouldn’t deploy infrastructure they designed without validating it against your actual requirements and constraints.

Treat AI output the same way. Use it to move faster. Use it to explore options. Use it to write the boilerplate you’ve written a hundred times so you can focus your energy on the parts that actually require judgment. But never outsource the judgment itself.

When to trust, when to verify

I don’t have a perfect system for this but I have a rough mental model that has served me well enough over the past year or so. Basically I sort AI output into three buckets based on how much damage a wrong answer can do.

Trust and move on:

Boilerplate code you’ve written a dozen times before. Config files, test scaffolding, basic CRUD endpoints. You know what correct looks like here so if the output is wrong you’ll catch it instantly. The risk is near zero and the time savings are real.
Formatting, refactoring, renaming. Mechanical stuff where the intent is obvious and the blast radius is small.
First draft documentation. You’re going to rewrite it anyway.

Trust but verify before merging:

Unfamiliar library usage. The model might be referencing an older API or a pattern that technically works but isn’t idiomatic. Skim the actual docs before you commit to it.
Unit tests. The model is surprisingly good at generating test cases but it will sometimes test the wrong thing, or write a test that passes for the wrong reason. Read each assertion, don’t just check that it runs green.
Database queries on non-trivial schemas. Especially anything with joins across more then two tables. Run it against real data before you trust the output.

Never trust, always verify from scratch:

IAM policies, security group rules, anything auth related. A permissive policy that “works” is not the same as a correct policy. The model will happily give you * permissions if that’s the path of least resistance.
Data pipeline transformations where accuracy matters. If the numbers feed into dashboards or reports that people make decisions from, you cannot afford a subtle aggregation bug.
Architecture decisions. The model doesn’t know your team size, your on-call rotation, your budget constraints, or what your infrastructure actually looks like at 3am when something breaks. It will recommend what’s popular, not what’s right for you.

The general principle is pretty simple: the higher the cost of being wrong, the less you should trust the output without independent verification. It sounds obvious when you write it down but I’ve watched smart engineers skip the verification step on category three stuff because the generated answer “looked right” and they were in a hurry.

What I actually use AI for (and what I don’t)

This is the practical section. No theory, just what my actual day to day looks like with these tools in the mix.

Things I use AI for almost every day:

Exploring unfamiliar code. When I’m dropped into a repo I haven’t touched before, asking “what does this module do” or “trace the request flow from this endpoint” is genuinely faster then reading through it cold. Its not always perfectly accurate but it gets me oriented quickly.
Writing the boring parts. Terraform modules, Dockerfile boilerplate, CI pipeline configs, migration scripts. Stuff where the pattern is well established and I just need it done. I’ll review the output but I’m not writing it from scratch anymore.
Rubber ducking. Sometimes I describe a problem to the model not because I expect a good answer but because the act of explaining it helps me see the issue. The response is a bonus. The real value is the prompting forcing me to articulate what I actually think is wrong.
Generating test data. “Give me 50 rows of realistic looking order data with these columns” is a genuinely great use case. Saves me ten minutes of writing factory functions every time.
Drafting communications. Design docs, incident postmortems, PR descriptions. First draft goes fast, then I rewrite the parts that sound too generic or miss context the model couldn’t know.

Things I don’t use AI for:

Production incident response. When something is on fire I need to think clearly and move deliberately. The last thing I want is a confidently wrong suggestion pulling my attention in the wrong direction while the pager is going off.
Security reviews. I will not let a model tell me whether a permission boundary is correct. I’ll read the policy myself, trace the access paths myself, and verify against the principle of least privilege myself. The stakes are too high for “probably fine.”
Performance critical code paths. The model doesn’t know your p99 latency targets, your memory constraints, or what your profiler is actually showing you. It will suggest something that works but “works” and “works at scale under load” are different conversations.
Anything where I don’t understand the output well enough to catch a mistake. This is the big one. If you can’t independently verify whether the answer is right, you shouldn’t be using it. Full stop. The whole value proposition breaks down if you cant tell good output from bad.

The pattern is basically: use AI where you already know what good looks like and where a mistake is cheap to catch. Don’t use it where the cost of a wrong answer is high and you wouldn’t notice until its too late.

Final thought

Critical thinking isn’t a soft skill that becomes less important as tooling improves. In an environment where tools can produce convincing-looking wrong answers at scale, it becomes the most important skill you have.

And this goes way beyond code. The internet is being flooded with AI generated articles, images, videos, voice clones. Stuff that looks and sounds real but was never touched by a human who actually knows what they’re talking about. You see a chart on Twitter that confirms what you already believe. Did someone pull that data or did a model generate a plausible looking visualization? You read a blog post that cites three studies. Do those studies actually exist?

The fundamental question we all need to get better at asking is embarrassingly simple: “Is this actually true?”

Not “does this sound right.” Not “does this match what I already believe.” Actually true. Verified against something that isn’t another AI generated source citing another AI generated source in a circle of confident nonsense. Because if we stop asking, we stop thinking. And at that point we’re not informed, we’re just brainwashed with extra steps.

Critical thinking was always important. In a world where anyone can generate unlimited convincing content at zero cost, it’s the whole game.