AI and the Cost of Errors

When should we actually use AI?

Mar 17, 2024

I’ve written this blog with the intention of sharing my knowledge. The contents of this blog are my opinions only.

Intro:

For many, 2023 was the year of AI discovery. Discovery of AI’s business potential, discovery of what it takes to build an ML solution and release it into production, and if you weren’t careful discovery that AI now has the power to generate recipes for chlorine gas and create a PR fiasco for your business.

However, with the dust settling from 2023 many companies have realized that new AI models are not silver bullets for old challenges around designing and productionizing AI systems.

While most people were conscious of cost, model latency and hardware constraints in 2023, I saw few people articulate a more fundamental issue with AI systems that stopped Proofs of Concept dead in their tracks - how do you handle AI systems making mistakes?

In this article, I attempt to answer this question by highlighting why errors are inherent in AI models, what design lessons we can learn from existing machine learning systems, and why anyone designing or building an AI solution needs to ask themselves these 3 questions:

How easy is it for an AI system user to spot errors when the system makes them?
How easy is it for an AI system user to correct those errors when they are made?
What is the cost of errors if they are not spotted or corrected?

These are not the only considerations for whether an AI model can be made (e.g. data availability and quality are also important factors). However, if you are designing an AI system, or you are assessing whether to invest in a particular AI project for your business or not, this piece is for you.

Prelude - Why Focus on Errors?:

Before we dive into these questions, I want to elaborate on why errors are so important to focus on when thinking about AI and Machine Learning (ML) systems.

Fundamentally, AI and ML systems are prediction (AKA guessing) machines. They are trained on a set of data to see patterns (whether it is in a set of numbers or a large corpus of text), and are tasked with generating predictions (known officially as inference) based on new information it is provided.

However, this approach is inherently error prone. Whether because of incomplete data or some other reason, your model is still only a facsimile of reality rather than a perfect encapsulation of the real thing.

Therefore, the way you approach designing these probabilistic ML systems needs to be fundamentally different from how you approach traditional deterministic software systems (where 1+1 always equals 2).

Thankfully however, we do not need to design how we approach these problems from scratch.

Where have we already seen AI be successful?:

While Generative AI systems like ChatGPT are fairly new (the foundational paper that led to it was published in 2017), the use of non-deterministic ML systems dates back to the dawn of ML in the 1950s. Over 70 years of history since then has given the technology ecosystem time to identify products that naturally survive and struggle when faced with real customers and problems.

If we take AI systems widely adopted before the launch of ChatGPT as our set to learn from, in my analysis 4 main categories emerge.

Information Matching: A boring title for what has arguably been AI’s most pervasive use case, this is where AI is used to match users with products or information. This includes systems like Google Search, Google Ad targeting, Product recommendations on Amazon.com or video recommendations on Netflix and YouTube.
Forecasting: The somewhat magical task of peering into the future unsurprisingly benefits from a magical seeming black-box type technology. This includes everything from consumer-facing tools like weather forecasts to internal business tools like customer churn modelling and predictive maintenance.
Classification: Putting the right label on the right thing can be the difference between everything from making and losing money to identifying and not identifying fraud. Classification is used in everything from identifying fraudulent financial transactions to facial recognition software and many other things.
Translation: Finally, we see an often overlooked application of AI (that in many ways led to the rise of Generative AI). While most consumers think of tools like Google Translate, other tools like transcription (translating sound into text) and OCR (Optical Character Recognition - translating images into text) also appear here.

Why Have These Been Successful?:

Astute readers may have noticed a central thread linking all the categories listed above - all of these tools are expected to make mistakes frequently.

How many times a day do people see ads for a product they didn’t care about? How many times do people look at a weather forecast saying it was not going to rain and end up getting soaked? AI systems helps prevent some system abuses but it is far from a minority report system and Google Translate has yet to make human translators redundant.

However, the important differentiator of these systems is that the ‘cost’ (whether that cost is in the form of time, money or reputation) of these errors is low compared to alternatives.

While not all online Ads convert to sales, ML driven targeting systems from Facebook and Google are some of the most effective marketing platforms of all time. While weather forecasts may not always save you from getting rained on, the typical pedestrian is not killed by an incorrect forecast. ML does not stop every financial crime but it helps identify many, and Google Translate saves you a lot of time compared to learning a new language yourself.

Identifying the Cost of Errors:

If this is the case, how can AI system designers take these insights into their own day to day projects?

Based on what we have discussed, I believe every AI system designer should ask themselves these 3 questions about system errors when building their tool:

How easily can a system user identify these errors?
If an error is spotted, how easily can a system user correct it?
If an error is not spotted or fixed, what is the cost of that error?

Furthermore, every person using these questions should delve into the particulars of their AI system designs, as the conclusions you draw from the answers to each of these questions will vary wildly depending on the contours of the user experience.

An Illustrative Example:

Lets put this framework into practice by looking at a popular recent use case - using AI to generate illustrations.

Suppose for example you are building a generative illustration product like Midjourney or Stable Diffusion. If we stopped the conversation there and naively leveraged our framework, using AI for this use case seems great. Users can easily spot errors (because people can just look at the output) and if it is not good, we can just generate another image.

However, lets dive into the contours of the user experience before we pass final judgement.

Imagine that one user of your Midjourney style tool is an experienced digital illustrator. This user has tools like Adobe Photoshop already on their PC, they are accessing your product from their PC, and they are using the images generated for a personal passion project. Our previous assessment holds up in this case - the user has enough experience to easily spot errors, they have tools ready and available to fix errors when they do appear, and if a mistake does slip through the cost to the illustrator is likely very low.

But lets imagine another user of the same tool with no experience in illustration or design, has no access to illustration tools, and needs to generate marketing collateral for an important upcoming event where mistakes will be judged harshly. In this case the calculus for the AI system designer needs to be completely different - the user is far less likely to spot errors if they appear, if the error is spotted they don’t have the tools to fix the error themselves if they do appear, and the cost of a mistake slipping through could be career damaging.

And the issue does not need to be limited to inexperienced users - what about an experienced user who needs to generate images from their phone for whatever reason and does not have access to tools to correct errors in that medium? What if their use case was not for personal passion but for a critical piece of paid work?

While this is a relatively straightforward breakdown, hopefully this illustrates how the outcomes for using an AI tool can vary wildly depending on the nature of the human in the loop, the contours of the user journey and the use case being addressed.

Going Deep On The Cost Of Errors:

The final element of this framework I want to unpack is what I mean by the ‘cost’ of AI system errors.

When I speak to AI system designers about what they think of when they think about costs of errors, three categories emerge:

Physical costs - harming or injuring an end user as a result of system outputs (e.g. giving bad medical advice)
Liability costs - risk of potential litigation for an AI system designer or AI system user as a result of AI system outputs (e.g. providing unlicensed financial advice)
Reputational costs - harming the ability of users to trust your products, services or organization due to providing faulty outputs, wasting users’ time, or otherwise faulty user experiences

Most can identify these issues when prompted, however reputational costs are consistently underappreciated. In particular, AI system designers often see so called ‘headline risk’ coming (especially after stories like the aforementioned Chlorine Gas recipe appeared), but they do not see the brand damage that poor AI user experiences can produce.

For example, say you have a market research software product with thousands of users and you want to launch a Generative AI powered chatbot to help answer questions for customers based on your FAQ documents. Internal testing shows that your chatbot answers customer questions accurately 9 out of 10 times, and in 1 out of 10 questions. You expect 10% of your users to ask 10 questions a day after the initial launch.

While our intuition may tell us this proposition sounds good to launch based on our framework, again we need to dive into the contours of the user experience before we make a firm judgement.

Firstly, lets break down the cost of errors in this situation. When a user generates an answer that is incorrect, two things can happen:

The user could spot the error, in which case they move on to try another solution to solve their problem (e.g. reading the documentation themselves)
The user could not spot the error, in which case they may go ahead assuming the results are true until they run into a problem caused by this incorrect answer and start over to solve their problem

In either of these cases, users will feel that their time. As the venerable late Charlie Munger used to say “a brand is a promise”, and if you release a feature that does not deliver on its promise your brand will be tarnished one user at a time. In practical terms, this product experience will lead each user to experience it to leave with a worse impression of your product than when they started (which can make people less likely to buy and tell others to buy your product), and it will lead people to not trust the feature (meaning that even if you improve it over time, people are less likely to use it again).

These threats may sound overblown in the abstract, however discussing them in a way that puts real meat on those bones would require discussing details of use cases in the field that I am not at liberty to disclose. For now, understand that users’ reaction to wasting time or having to start a task over due to a false start is often unstated in user feedback but is massively impactful.

Secondly, we must understand the scale of this issue - how frequently are our hypothetical users going to encounter the issue described above?

To do some napkin math, say we have 1,000 users. In this case, that would mean we expect 100 users to start using this new feature by asking ~10 questions per day. In this case, assuming the law of large numbers applies to our 1 our of 10 error rate and errors are uniformly distributed across users, on average every one of your 100 users will experience at least 1 incorrect response per day. At best, each of those 100 users will have a worse impression of this specific feature than when they first planned to use it and just use the feature less. At worst, you have moved 100 potential promoters of your business to detractors, which is an enormous problem.

Conclusions:

Understanding the cost of errors is not the only issue people need to consider when designing and building AI powered systems. Prevalence of training data, in house talent, hardware supply constraints, cost and a myriad of other factors need to be in the mind of anyone deciding to build or invest in AI solutions.

However, figuring out where AI or ML techniques make sense for a given problem or system is a key part of the process that investors and builders alike need to go through, and many are not doing so today. I hope this framework is helpful for anyone currently in this position.

If you have any thoughts regarding this framework or the examples listed, please reach out to me at seeking.brevity@gmail.com, I would love to use this as a chance to refine this thesis going forward.

Seeking’s Substack

Discussion about this post

Ready for more?