How Many Databricks Does It Take To Change A Lightbulb?

Data has become significantly expensive. So expensive that "eye-watering" might not quite cut it. And it is not just expensive in monetary terms, there is also the dreary hidden expense of the report taking data engineering months to produce, or that so-called accelerator in its never-ending professional services integration… and your stakeholders just wanted to know how their business was performing.

I do wonder if accelerator is one of the most ironic terms used in tech.

There are many reasons for this; however, to me, one thing is sticking out like a throbbing sore thumb, hit by several hammers (made myself wince there too). I am, of course, talking about the misplaced use of big data tools.

Cartoon databricks character scratching his head about how to change a lightbulb

So, how many Databricks does it take to change a lightbulb?

Well, Databricks doesn't change lightbulbs. It would be terrible at it, just like Snowflake. They just aren't the right tool.

Now, don't get me wrong, these tools can absolutely work with data of any size, but consider this analogy:

If I need to travel 10 miles, that is a fair distance to walk, so using a tool makes sense. I could get a bicycle, very economical. I could drive a car. I could charter a helicopter, maybe see if anyone has built a hyperloop yet, or even better, see if SpaceX have a launch window.

Spacex rockets in a taxi rank

All of these can help me travel the 10 miles, but some of them really don't make a whole lot of sense for such a small journey. They would be heinously uneconomical.

There are many reasons to choose big data tools, and typically, when the reasons are real, you had no choice but to choose them, and it usually means you have a good problem. Sometimes it is obvious you will need them soon, and it is worth the investment in preparation.

And there are, of course, other valid reasons. Data engineers want to use these tools, and who could blame them. They are incredible tools and prime CV fodder.

On the other side of that, something we have probably all seen a few times now: large teams of data engineers managing a total of 5TB of data for their department or company, with only a few GB running through a medallion-style pipeline architecture each night.

Most of us know this deep down. We could unearth the Raspberry Pi we left somewhere in the office IT cupboard and do the same for the cost of cleaning the dust off it and a little ingress/egress.

So why do so many companies end up with 130-core clusters, vast amounts of memory, and millions of pounds in staff costs to shuffle a couple of gigabytes through a pipeline each night? They don't need it. Nobody needs it. It is, to put it mildly, illogical.

Of course, this is not to say these tools are never the right answer. There are many reasons you should be reaching for them, for example:

You are genuinely operating at scale. Not "we might be one day", but terabytes turning into petabytes, with concurrency, latency and reliability actually mattering.
Your workloads demand distributed compute, complex joins across large datasets, or near real-time processing where a single machine simply will not cut it.
You have multiple teams, multiple use cases, and a platform that needs to serve as a shared, governed layer rather than a single pipeline feeding a single report.
You have hit the limits of simpler systems, repeatedly, and scaling up is no longer a choice but a necessity.

If not, then it is worth asking a slightly uncomfortable question:

Are we solving a data problem, or are we maintaining a data platform that has grown far larger than the problem it was supposed to solve?

Raspberry pi beating a huge server in a race, both carrying 2GB of data

More often than not, the answer isn't technical.

It is habit.

It is vendor influence.

It is "this is how modern data is done".

It is building for a future that may never arrive.

And, occasionally, it is the simple truth that complex systems are more interesting to build.

None of that makes the tools bad. Far from it. They are exceptional at what they are designed for.

But using them everywhere, regardless of need, is a bit like taking that helicopter to travel 10 miles. You will absolutely get there, and it will look impressive while you do it. But first you will need to travel to the helipad (increased overheads), find somewhere to land at the other end (deployment complexity), register the flightpath (governance), file the paperwork with three different authorities (change approval), and burn a lot more fuel than necessary along the way (cloud bill). You may find yourself wondering why the journey felt so expensive for what it was, and why it took so long.

Sometimes, the most effective solution is not the most scalable, or the most modern, or the most impressive.

Sometimes, it is just the one that fits the problem.

On the software side of the fence, we learned these things in depth through the rise of Agile methodologies such as YAGNI (You Aren't Going to Need It) and so on. I think the data world is catching up with engineering principles. I just hope they get through this part of their evolution quickly, with limited financial damage to businesses along the way.