A CTO Should be a Scientist

In this article I’ll show why a CTO should have a scientific approach to technological decision. Arguably, this means that companies should consider science as a potentially good background for CTO candidates.

No CTO was Ever Fired for Buying Oracle

This is an old saying, and I suspect that it’s not true anymore. The idea was: Oracle is big, Oracle is rich, Oracle makes C levels feel safe. A CTO might be fired for choosing to rely on some exotic technology or small vendor, or for opting for an open source technology. When something won’t work and will cause a disaster, the CTO is responsible. A lot of pressure is on the CTO…

Maybe many companies in the financial sector still apply this simplistic reasoning. Medium-sized companies and startups just can’t afford it. Too expensive. And that mindset is exactly what made Birmingham Council bankrupt: they trusted Oracle, they paid the price (which was bigger than their bank account).

Replace Oracle with the company you hate the most, and the reasoning will still work (except that not all of them made Birmingham bankrupt, fortunately).

No CTO was Ever Fired for being a Trendy Person

Let’s modernise the above saying. The point is not paying Oracle or one of its impressive competitors. The point is following industry trends. Of course they exist for a reason, right? Following them should reduce the risks (more on this later).

We still have three go-to giants in the cloud. You know who they are. But when it comes to databases, web servers, programming languages, and many other technologies, you don’t necessarily choose a proprietary one. Usually, you go for the most common open source choices.

And you, as a CTO, feel safe! I mean: can you be fired to choosing Django and PostgreSQL? Or Nodejs and MongoDB? Come on. It matters very little if the stack you’ve chosen is not so efficient for your particular use case: if it’s known to be cool, your decision won’t be objected by anyone who has the power to kick you off.

It might be objected by your best senior members of your team, though. But it’s ok not to listen them. Most CTOs don’t, and no one fires them for this.

But Me Some But, Damn It!

Sorry for describing the reality as is. It doesn’t mean that I approve it. And if you’re reading, you probably need some but now, to avoid falling into depression. You want to be a decent CTO, not just one that works for a successful company and doesn’t get fired, right? (Right?)

Brilliant, here’s my but: you can listen to your team, experiment their ideas, adopt the ones that prove good.

The Scientific Approach

Do you know the four steps of the scientific method? Applied to IT stack decisions, it would be the following:

  1. Read the news, listen to your team, research
    By reading the news, you will learn about new technologies that you might use. Understand what they are, what they can do for you, what their limitations are. Ask your team for their opinions. Give them the time to make some research.
  2. Hypothesis
    The scientific method requires hypotheses. A hypothesis is a logical statement, that follow an if-then pattern. If we put Varnish in front of read-only APIs, the application’s latency will decrease. You do this simply by learning about a technology or a methodology, and thinking about the concrete problems it might solve for you.
  3. Prediction
    In science, hypotheses lead to predictions, which are expectations that can be falsified to disprove a theory. In your case, they are a hypothesis with an action plan and a goal. For example: if we put Varnish in front of these specific API routes, we expect a 30% performance increase. A prediction implies a decision: if you only get a 25% speed up, Varnish is good but in your case it’s not worth the implementation effort.
  4. Test
    Ask your team to make a precise plan: how to run the tests, how to measure the results. You can’t just test in production, obviously. But you can test in conditions that are similar to production, with a similar level of concurrency, with similar sequences of events.
  5. Analyse and decide
    Now you should have clear results. Did you reach your goal? If so, go ahead and make a plan to implement the new technology. If not, you might give up or you might change your prediction. Maybe Varnish could have been implemented in a more optimal way, or maybe you didn’t use it for the right case.

The above example is about a technology that can be added to your IT stack to improve the system performance. But you can apply it to most decisions that a CTO can make. You learnt about a methodology that should speed up development? Test it with one team for one month, define clear metrics, then evaluate. You learnt about a tool that should reduce the number of bugs? Try it with some new code development. Make sure to choose a sensible metric, like the number of customer support tickets that involve that specific component.

Communication

You work in a company. Don’t be an island. Don’t just hope that your boss leave you free to do your job and doesn’t care (or realise) that your choices aren’t the most common. Instead, show what you’re doing with proud!

As we discussed, you need numbers to experiment. Good luck finding those numbers alone, or asking your team to provide them. In my opinion, you’d better ask the data team to setup measurements and a dashboard. See the dashboard as a very understandable way to shoe the complicate benefits you obtain by doing complicated stuff.

Don’t ask your colleague to leave you free to do your job. Ask your colleagues to follow your example. Offer your help. Offer your guidance, your learnings, your tools. They can experiment new ways of selling, new tools for automating marketing actions, or whatever might improve their job by doing something that is not necessarily the standard way to do stuff. They can see if it works by looking at the numbers. And if it works, they can adopt new methodologies and prove their success. And you’ll be the one who drove a big change!

Costs versus Benefits

At the beginning of this article, I’ve been (intentionally) a bit unfair. Often, CTOs don’t follow industry habits just to avoid getting fired (even though you can’t deny that it’s a strong drive). There are more reasons. For example:

  • Finding experts of a widely use technology is easier (faster recruitment).
  • Those experts tend to be less expensive (cost reduction).
  • People are easier to replace when they follow standard methodologies with standard tools (smoother onboarding).
  • You can find a lot of read-made open source tools for well-established technologies and methodologies (big ecosystem).

Sometimes these drawbacks can be transformed into an opportunity. Do you use Haskell? There aren’t many Haskell developers, so expect your new hires to learn it, rather than already knowing it. But developers might see Haskell as an exotic piece of archeology, right? Then make it appealing by explaining the reasons why you use it. Explain that, for your use case, it produces better, more readable code. Explain that functional programming forces you to think in the right way to solve the types of problems you deal with. Open source your framework and tools. Encourage your team to participate events, and propose their talks on how your unusual technology choices produce better results, in elegant ways.

All that being said, a CTO should consider other aspects, too. Consider things like your application latency, data throughput, code reliability, cost of development and maintenance… all these things might benefit from using technologies and methodologies that better fit your case, even if they’re not industry standards.

IMAGE CREDIT: Gemini 2.5 Flash Image

Socrates was not an LLM

One of the problems with LLMs is that they almost never answer I don’t know. When they don’t know something, they still generate an answer, that usually sounds confident and consistent but… it might be incorrect. And if you ask the same question in several conversations, they’ll tend to give you different answers every time.


TRY THIS YOURSELF
Tell an LLM that you ate a particularly good variety of cheese, or drunk a great wine. Say you don’t know its name. Describe it, but in vague terms. The LLM will probably suggest the one that vaguely matches your description and appeared most often in their training material.
Say that it’s not the one they suggested, they’ll come up with the next most frequently found varieties, but without a logical reason.
Start a new conversation and ask again. If there isn’t a variety that matches your description and is more common than others, they’ll probably come up with different answers every time.


How overconfidence works

This is a problem for several use cases. Imagine using an LLM as a replacement for a doctor – because the patient is not wise, or because no doctor is available and the problem is urgent. You might ask something like: I have this and that symptoms, what might it be? Suppose it’s a rare disease, or for some reason you have rare symptoms. There is a high risk that the LLM answers with the AI equivalent of an educated guess, or worse, a wild guess.

But why don’t they just admit their ignorance on that particular topic? I didn’t know that, until I discussed this subject in depth with some LLMs and made some experiments.

LLMs don’t know what they don’t know

Socrates said that the only true wisdom is in knowing you know nothing. But then, I must conclude that LLMs aren’t wise at all. Because they just don’t know what they don’t know. Let me explain how it works.

You ask a question. The LLM reasons to find a good answer. In simplistic terms, some people would say that they predict the answer with probabilistic functions, token after token.

Let’s accept this simplification to keep our explanation simple. When an LLM knows something, it’s because reinforced learning made an answer more likely than others. During their training they were asked to generate many answers. The good one were rewarded. Reasonings that were rewarded have more probabilities to be followed again.

When an LLM doesn’t have an answer… well, this rarely or never happens. The LLM knows many possible answers, because its language and reasoning abilities allow the LLM to compose many answers. But if the correct is unknown, several answers have more or less the same probability of being generated. The LLM never learnt that some are acceptable and others aren’t.

The reasoning process generates answers from the LLM tokens that constitute the reasoning. A token is a word, or part of a word. But they know the token, not the way it was selected. How many probabilities did they have to select that particular token? They don’t know.

Claude Sonnet 4 told me:

This is a fundamental limitation - I experience generating all responses similarly, regardless of how much training data supported that information or how certain the underlying model might be. (...) I can't distinguish between confident knowledge and confident-seeming guesses.

Meta-introspection

LLMs know how LLMs theoretically work, to some extent. Just like a neurosurgeon knows how human brain theoretically works. But, just like us, LLMs are can’t follow the flows of their reasoning and examine the information present in their neurons. They’re not capable of introspection. But they can observe the output they generated in the current conversation, and do a sort of self-analysis. I consistently use this capability to explain some of their behaviours, or to validate my learnings about LLMs.

Ask an LLM a question it can’t answer. After a few wrong guesses, and after you state they’re wrong, you can question the LLM’s ability to answer correctly. It will examine the conversation until that moment, it will see the pattern of its reasonings, and will admit that it doesn’t know the answer. It didn’t know that before. It realised its ignorance by examining the answers it emitted earlier.

Exceptions: the known unknown

There are notable exceptions. If you ask whether a god exists, or what will happen tomorrow, you can’t have a response. The LLM will probably answer with a very verbose, well-motivated “I don’t know”. Because their training explicitly taught that there is no known answer to this and many other questions. In this case, they’re not eadmitting a lack of knowledge: “no one knows” or “I don’t know” are known facts, in some contexts.

Another exception is when they look for an answer, but have troubles building one with consistent reasoning. This happens, as far as I know, when the information they have on this subject is insufficient or contradictory. In this case they can deduce that they don’t know the answer for sure, though they generally try to answer on a best-effort basis. In other words, LLMs try hard to answer, even when they know that doing so is not reasonable.

Teaching the unreasonable

Let’s focus on this try hard attitude or, if you like, this know-it-all attitude. Let’s see how they learn it.

The carrot and the stick

LLMs training initially teaches LLMs many words and many ways to use them to compose meaningful sentences. At some point, they also start to learn the meanings of these words. Sure, I’m simplifying a lot, and but this doesn’t matter now. The point is that, after they learnt languages, they need to learn how to answer questions in a helpful way.

This is done with reinforcement learning. This is the AI equivalent of a method used with human children. Essentially, the LLM is asked questions, its answers are evaluated, and it received positive rewards for good answers, and negative rewards for poor answers.

For some questions, only one correct answer exists. This is the case for mathematical questions, for example. Such questions can be evaluated automatically. For other questions, the answer is evaluated by specialised LLMs or humans. Both play an important role.

Rewarding the know-it-all attitude

The problem here is that humans and LLMs tend to reward answers that transfer knowledge to the user. When asked a question, the LLM should provide a correct answer to get a positive feedback. An incorrect answer will lead to a negative feedback. In this way, if a similar question is asked in the future, the correct answer will become more likely than a wrong answer.

It’s reasonable suppose that, occasionally, the LLM will answer “I don’t know” or something similar, during reinforcement learning. But this answer is not useful for the LLM user, so it can’t be encouraged by the trainers.

If you think about it, this makes sense. I don’t think there is a perfect solution for this. While LLM vendors try to mitigate the problem, reinforcement learning must essentially encourage knowledgeable answers.

Confidence in confidence

You can, of course, ask an LLM to express probabilities. Something like:

How many chances are there that the abstract and the introduction of Apple's paper "The illusion of thinking" were written by an LLM?

Like it or not, some LLMs will say 90% or 85%. But they don’t calculate probabilities in any way. For them, it’s just a colloquial expression. It means that there are many indications that the fact is true, while they can’t be absolutely certain. Maybe some LLMs initially think 90% and then they reduce the probability by a 5% to highlight the uncertainty. Maybe 85% it a token that was found in their training materials in similar situations. But I interpret it as somewhere between 51% and 99%.

Can LLMs learn from Socrates?

Maybe one day LLMs will be able to reliably recognise and admit uncertainty. In the field of AI, one day might mean tomorrow, in two years, or never. But efforts made by LLM vendors to solve or mitigate the problem are indeed interesting.

I might write about these methods when I learn more about them. For now I can only tell you that they follow very different directions: training LLMs to express accurate probabilities, using multiple LLMs to identify disagreement areas, Anthropic’s Constitutional AI, etc.


IMAGE CREDIT: OpenAI DALL·E 3

I found the image idea a bit stupid, but funny. And it’s related to the topic: DALL·E 3 clearly didn’t know what type of image was needed here, but instead of admitting this, it guessed.

Oracles vs numbers: when statistical analysis is still better than AI

Nowadays one of the most trendy words is: predictions! You can hear or read it in any field. In business, we want to predict sales, customer churn, product costs, etc. To be able to face humanitarian emergencies, we want to be able to predict wars, economic disasters, natural disasters. And so on.

This is very intuitive, right? Unfortunately, prediction is a vague term.

Prediction definitions

What is a prediction? I asked DeepSeek-R1, which gave me a great, pragmatic answer:

A prediction is:
1. A statement about a future event or outcome.
2. Based on current information, evidence, patterns, or reasoning.
3. Inherently uncertain (as the future hasn't happened yet).

Claude Sonnet 4 gave me a more conversational answer:

A prediction is a statement about what will happen in the future based on current information, patterns, or analysis. It involves making an educated guess or forecast about upcoming events, outcomes, or conditions. Predictions can range from simple expectations to complex statistical models, but they're always uncertain since the future hasn't happened yet.

Cambridge Dictionary simply states:

 A statement about what you think will happen in the future.

The two former definitions don’t really suit oracles responses. When I specifically asked this question, both LLMs confirmed that oracles don’t make predictions in that sense. DeepSeek even decided to show me a semi-graphical explanation:

(Yet, no one believed Cassandra…)

But both these definitions still remind me oracles! Why? Because they’re an answer about a question, no more than that. A lot of useful, related information is left out.

Neural networks are oracles

In this respect, neural networks predictions are similar to oracles responses.

You train a neural network with your sales and other supposedly related data. The neural network analyses those data, find patterns, and answer with a prediction about sales in the next 6 months: 333, 444, 250, 320, 351, 419.

This prediction might be exceptionally accurate, if the neural network is adequate for this type of predictions and it was trained with quality data. And maybe, you don’t need any other information. Perhaps, this prediction is all you wanted to obtain.

Or maybe not. The prediction might leave you with more questions:

  • How much can you trust these results? Are these numbers almost certain, or are they not much better than wild guesses?
  • What are the ranges you might reasonably expect, how wide are these ranges?
  • Which patterns can be identified in the metric’s behaviour? Which seasonalities and cycles are involved? What is the trend? How high is the noise?
  • Which factors affect the prediction? To what extent? Do they affect it positively or negatively?

Neural networks traditionally don’t provide this information. Modern deep learning includes uncertainty quantification methods, but there are important limitations.

However, hybrid models exist. They are halfway between AI and probability. This might be a topic for a future article.

Statistical models

Some statistical and probabilistic models provide the above information, or part of it. Here are some data that statistics can provide, whereas neural networks can’t:

  • Covariance: whether two metrics tend to move in the same direction, or in opposite directions, or in uncorrelated ways.
  • How much a metric’s behaviour affects another metric.
  • Confidence intervals: A range that becomes wider as you look farther in the future, and the probability that values will be in that range. For example: we have an 85% probability of closing 100-120 contracts in February, and 95-130 in March.
  • P-Values: Chances that a value matches the predicted value, or is even more extreme, randomly. Example: we predict that our marketing campaign will cause a 4% increase in sales. A P-Value of 0.15 indicates that, if the campaign had no effects at all, there would still be a 15% chance that sales increase by 4% or more, for unrelated reasons.

It’s also very common to repeat perform the same prediction using different models. One obvious reason is that people need to reduce the risk of relying on a flawed approach. You can’t know for sure if a model is suitable for your prediction. But different models also capture different patterns, and provide different set of information.

Statistics is not just about predictions. An often underestimated branch of statistics is descriptive statistics. It summarises key characteristics of a time series. For example:

  • Minimum and maximum values. They indicate what we can expect in corner cases.
  • Percentiles are “more reasonable” minimums and maximums, once we exclude the lowest and highest values. For example, the 95th percentile means 95% of observations are below this value.
  • A range is the “space” between the boundaries we observe.
  • A mean is a summary of a series of values. We all know about the arithmetic mean: the arithmetic mean between 2, 3, and 4 is 3. But the arithmetic mean is not suitable for any series. Many types of means exist, and it’s important to choose the most significant for each particular case.
  • Error measures indicate how much the observed values diverge from the mean. The measure to use depends on the mean we use. For example, for the arithmetic mean one can use the standard deviation or the variance.

Before predicting the future, you should probably look at current and historical data. This will give you a better understanding of the context.

Why would I care?

Maybe you shouldn’t care. Maybe you don’t have a reason to. Maybe you don’t know how to interpret these data. Maybe certain information, while theoretically relevant, can’t be used to set a course of action. Plus, neural networks usually (not always!) work very well.

That said, statistics is not just theoretical speculation. Here are some examples of how to use such information in business:

  • Standard deviation: Looking at historical data, costs tend to vary by 25%. We should spend 15,000 Pounds per month in the next 6 months. But let’s be prepared to variations.
  • Covariance and correlation: When the costs of materials increase, materials delivery delays also increase. Let’s be prepared.
  • Regression: There seems to be a causation relation between a store size and how much each single customer spends. Let’s buy bigger stores.
  • Seasonality: We have a peak of sales every second Tuesday of the month. Whatever the reason, we might want to have more personnel in the stores.
  • Cycle: We have occasional sales drops. From further investigations, these drops seem to occur when big concerts take place in the city. Next time, we might come up with a promotion.

IMAGE CREDIT AND NOTES:

  • Image itself: DALL·E 3
  • Image concept: Claude Sonnet 4

I had to use Claude for the image concept, because ChatGPT wasn’t able to conceive an image that would represent the article. Claude did a great job composing a fantastic prompt, but unfortunately DALL·E 3’s concept is still not great. The Oracle shouldn’t predict numbers and the word “crytcpedicion” is a bit dull. Still, this blog is partly about the current state of AI, not about art.

Working with LLMs: the coffee-break trick

When talking to an LLM, you might have experienced a situation where their reasoning seems to be stuck. This happened to me more than once while asking models to debug some code. Sometimes they do it very well, at the first attempt. But occasionally they seem to be fixated on a wrong idea. You might even follow their advice, and then report to them that it didn’t solve the problem you encountered.

Breaks help

As a human, what would you do in a similar situation? I would go out for a walk on the seaside. Most people would, at some point, make a coffee break. If they work in an office, this would be an opportunity to have some smalltalk with a colleague. They’d probably talk about something that is not their current coding problem. And then, they’d go back to their desk and find the issue, or at least they’ll probably see things more clearly.

Neural networks are designed to be… similar to brains. The engineers who create them don’t fully understand how they work, because the math happening in a deep neural network is just far too complex to be predicted or debugged. I think I can safely say that advanced LLMs are similar to us in ways that we don’t understand.

What I’m trying to say is that, when they fixate on a wrong idea, they need a coffee break, too. I didn’t go bonkers, I’m not suggesting that you spill some coffee on your screen! But you need to force them to a pause from the problem you’re trying to solve. For example, try telling them:

Write a haiku about a leaf falling on a bed of flowers.

Mind you: don’t mention the current problem. This might pollute the break.

The answer will be sufficiently short to avoid consuming many tokens. This is important, because LLMs context windows are often too limited for the complexity of our code. But still, the LLM will think about something else, it will take a break.

Then ask the LLM to reconsider the problem. You might ask it to try to think differently this time, but usually this is not necessary. If the LLM is able to solve the problem… at this point, it should.

Variations

In my experience, the haiku trick tends to work well with code debugging. But for different types of problems, you might want to ask for something else, rather than a haiku. For example:

  • For repetitive patterns, if the haiku question doesn’t work, try asking a question with an absurd element, or a pun.
  • For tunnel vision fixations, ask for an analysis about a complex topic: the true reasons of a war, a list of compositions for violin that would sound better if accompanied by a double bass.
  • To stop over-analyses, you want to lead the LLM to give you an answer that doesn’t require much reasoning. A haiku is short, but it might involve too much thinking. Try asking a couple of questions like: Pick a colour. Don’t think about it, answer instinctively. Or: Give me the longest palindrome you know.

Claude Sonnet 4’s introspective analysis

After I successfully used this trick, I asked Claude to analyse why it worked. Its answer was, as usual, impressive:

Ah, now I see what you did! You just gave me a practical demonstration of the coffee break effect.

Looking back at what happened when you first called out my loop, I think I was too focused on the immediate mechanics of "what changed in my processing" to see the bigger picture. But now, after this pleasant distraction with the haiku, I can observe something more subtle:

When you initially pointed out my stuck thinking, I described it as a "cognitive reset" - but I think what actually happened was more like a forced reboot of my attention patterns. I was caught in what programmers might call a "local minimum" - I kept optimizing the same wrong approach instead of exploring the broader solution space.

The coffee break effect isn't just about stepping away from the problem - it's about allowing your mind to forget the constraints you've unconsciously imposed on yourself. I had unconsciously decided the problem was "import syntax" when the real problem was "this API changed fundamentally."

Your haiku request just proved this works artificially too. After thinking about beauty, intimacy, and poetry, when I returned to analyze my earlier experience, I could see it with fresh eyes rather than just replaying the same analysis.
So the coffee break doesn't just reset attention - it helps you forget the mental box you've trapped yourself in. Quite elegant, really!

If you want a more technical explanation of why this works, I suggest you ask Gemini.

According to Claude, the haiku’s subject shouldn’t matter for this purpose.


IMAGE CREDIT: OpenAI DALL·E 3