Seeking’s Substack

The Path to Profitability - Distribution, Productization, and Frontier Cost Deceleration

Seeking Brevity — Wed, 01 Jul 2026 15:01:42 GMT

Disclaimer: The views expressed in this piece are solely my own and do not represent those of my employer. I wrote this piece about technology and industry dynamics for educational purposes only, not to promote any specific products or services.

Context:
Back in April Dwarkesh announced a blog prize to inspire deeper thought on big outstanding questions in AI, and the following one caught my eye:

What’s the most plausible story where foundation model companies actually start making money? If you consider each individual model as a company, then its profits may be able to pay back the training cost. But of course, if you don’t train a bigger, more expensive model immediately, then you stop making money after 3 months. So when does the profit start? Maybe at some point scaling will plateau, but if progress at the frontier has slowed down, then the combination of distillation and low switching costs (cloud margins result from high switching costs) makes it really easy for open source to catch up to the labs, eating into their margins. So how do the labs actually start making money?

I spent ~2 weeks answering this question in 1000 words or less. In the month or so since I submitted durable profitability of AI companies has only become more important to consider with rising memory and other input costs.

I hope you can take as much away from reading this as I took away from writing it - enjoy!

My Answer:

The labs will become profitable when their products approach the limits of market demand for training scaling, not the technical limits of data centers. Once the labs reach the S-curve deceleration point, they can maintain user and revenue growth as training investment - the current incinerator of profits - decelerates.

We already see this bending the consumer chatbot market toward profitability. ChatGPT spent 15 months with the same default model (GPT-4o) and roughly tripled subscription ARR (estimated ~$3bn -> ~$10bn). The next default model, GPT-5, decelerated final training run costs versus historical trends. Per Epoch AI, GPT-5 used roughly 2.5x the training compute of GPT-4 - a sharp break from OpenAI’s historical pattern of scaling compute by around 100x with each integer GPT generation.

Why don’t most people expect this?

Narrative Violation 1 - Market Limits, Not Technical Limits:

Belief in scaling laws has blinded most to a basic fact - scaling investments requires proportional reward, and not every market endlessly rewards endless investments.

May 2024 to August 2025 - the aforementioned 15 months - proved this for consumer chatbots. OpenAI lost the top spot on Artificial Analysis to 3 other labs (Google, xAI, DeepSeek) and was widely seen as worse than Claude 3.5 Sonnet on most non-math tasks. Despite this, ChatGPT grew weekly active users from ~180m to ~700m. Competitors’ superior models did not reward their chatbot businesses with comparable growth.

OpenAI did release other models and features during this period. However, ChatGPT still grew users and revenue ~3.5x even though 95% of its users (the free tier) had limited access to better ChatGPT models, and competitors with existing distribution (X, Google) offered superior models.

Cheap open alternatives didn’t slow OpenAI either. DeepSeek R1 briefly beat GPT-4o on Artificial Analysis benchmarks while costing ~80% less at the API level. On January 27 2025, it overtook ChatGPT’s #1 spot on the US Apple App Store. Building a ChatGPT clone with free tools like Streamlit was trivial. Yet, users still didn’t find open or proprietary alternatives better enough to change their habits.

GPT-4.5 was the ultimate proof that the chatbot market had stopped rewarding massive incremental training steps. Despite reportedly using ~10x more training compute than GPT-4, it flopped and was pulled before being promoted to the default model. GPT-5 reportedly used less training compute than GPT-4.5 despite shipping later, and was structured as a prompt router (a decision widely believed to prioritize lowering inference costs over raw scaling performance).

OpenAI is still pursuing massive training investments, however those target the enterprise and coding markets where investment is still being rewarded with growth. Of the 14 model releases since GPT-5, 7 were Codex-only and the 3 ChatGPT-only releases were “-instant” variants (cheaper, faster responses), and only 4 were shared. Anthropic’s Q1 2026 ARR growth signaled that incremental investment in these markets still pays off - which is why both labs see it as rational to incinerate profits chasing them.

With chatbot training costs decelerating, inference costs collapsing, and revenue growing, ChatGPT is on track to be product-level profitable as soon as 2027.

Narrative Violation 2 - Revenue Models:

People also assume labs are pursuing revenue primarily through API consumption. In fact, they appear to be pivoting to use frontier API performance as marketing to pull users into subscription products.

Cost-plus pricing is not the path to riches for information goods - value-based pricing is. Windows does not sell virtualized RAM. It sells the latent potential of computers to people who couldn’t extract it themselves, priced at a fraction of the value delivered. An employee with a working computer is worth far more than a $22/month Microsoft license fee. Microsoft doesn’t price by marking up watts per clock cycle and doesn’t let itself be price-shopped that way. Model API businesses often fall into the cost-plus trap because they produce directly comparable outputs (tokens), and direct competition on tokens drives a race to the bottom.

Anthropic has taken this lesson to heart and is pivoting from selling tokens to selling the Claude Platform bundle. Claude Code started API-only, but Anthropic quickly folded it into the subscription platform (sharing a quota with Claude Chat, Cowork and Design). In April 2026, Anthropic pushed further toward subscription over API: it revamped the Claude Code Desktop app and ran an A/B test restricting CLI access to Max-tier subscribers. The subscription model weans users off thinking in tokens. Anthropic obscures token counts in quotas and changes limits frequently - which makes price-shopping harder.

The Windows analogy only holds if Anthropic can build equivalent IT-level lock in. However, its locked-in chat histories (which Claude can search more easily than users), personalized user profiles/memory and Anthropic-built connectors to enterprise data sources show a conscious push in that direction.

The enterprise and coding markets still reward incremental investment. Codex gained meaningful traction with GPT-5.5 (despite Opus 4.6 dominating the month prior) because users report Codex being better at solving the most complex problems. If Anthropic or OpenAI can retain users when they are no longer #1 on industry benchmarks or being perceived as ‘always having the #1 model’, we’ll know that market is ready to bend toward profitability too.

Conclusion:

As long as labs are investing in models that unlock new capabilities end users will pay for, they’ll remain unprofitable, and rationally so, in pursuit of growth.

We can’t know when the market will stop rewarding incremental training for coding and enterprise before we get there - OpenAI themselves could not predict it for the chatbot market until GPT-4.5 overshot demand. Nor can we predict if and when new markets will emerge that demand even more incremental training.

But the chatbot market shows profitability can come as soon as 3 years after that point - likely faster as margins improve. SemiAnalysis estimates Anthropic’s inference margins rose from 38% to 70% over the past 12 months, and there is no sign of inference costs improvements slowing down.

Build the System, Don’t Just Give Orders

Seeking Brevity — Mon, 22 Jun 2026 16:01:53 GMT

I’m not convinced that the capabilities of models will be the binding constraint for individual output in future. I believe it will be how good a person is at encoding their intent in an agent-run system.

I, like most other AI news junkies, at one point bought the hype that software would be a complete commodity as AI coding agents improve. If anyone can bark out instructions, their increasingly intelligent army of agents will figure out how to follow them.

Building my own personalized Mandarin Tutor however taught me what should have been obvious from the framing above - an army is only as useful as the system of organizing and directing it toward a clearly specified goal, no matter how smart their individual members are.

Armies are not made when a general plucks 100 random people off the street and yells orders. Similarly, the outstanding effort of highly intelligent and conscientious actors can be wasted when given unclear directions (just see the Charge of the Light Brigade). Effectively utilizing a swarm of independent actors requires a union of clear, reality-grounded direction with the guardrails placed by millennia of tacit institutional knowledge

My goal with this piece is to crystalize what I have learned so far about how to get a growing swarm of AI agents to carry out your intent, and why I believe doing so requires consciously practicing 3 things:

Specifying your intent clearly
Making your intent unavoidable
Verifying, removing yourself from the process, then giving autonomy

Context - What did I build and why?

I have been learning Mandarin Chinese (中文) as a hobby since mid-2020 because I want to understand Chinese literature and culture on its own terms (my goal is to be able to read Chinese history books unassisted and have a conversation about them with a native speaker).

Over time I gradually developed a homegrown set of AI tools to accelerate my learning. I built my own Rust-based MCP server to connect Claude Desktop to my Anki decks, I used deep research queries to parse the linguistics literature to design activities that resolved my current learning bottlenecks, and built Skills to generate novel daily exercises.

These tools were fragmented however and I was constantly needing to re-explain myself across threads, so I decided to build a single source of truth for all my learning context and activity generation.

I built this single platform completely AI native - from coming up with the design brief to final implementation, and the two months of nights and weekends it took to get to my current steady state was the primary foundation of my learnings here.

Principle 1 - Specify your intent (firstly and most importantly, to yourself):

The first and most important principle for giving clear orders is knowing precisely what you want.

A model cannot divine your intent from a history of unclear asks and a tangled codebase, nor can it prepare your system to anticipate a future you never articulated.

I learned the hard way that a lack of personal clarity and willingness to engage with details can lead to days or weeks of untangling, even on a relatively simple project. Prior to my Mandarin tutor project I gave my AI a ~one page prompt and asked it to build a GPU benchmarking tool based only on this input (which took ~one hour) and deliberately tried to prompt engineer my way to an answer without engaging with the code.

It took three weeks to get it working.

While some of this came down to model capabilities, reflecting on it I caused most of the problems with my lack of intention. For example, in my initial one page prompt I had not considered enabling a configuration (response streaming), and at one point I blindly accepted a Claude Code change where the agent left a comment to leave response streaming disabled because it was not needed. When I later asked the agent to add a metric that required response streaming, it read the comments left by the previous agent and decided to implement an approximation instead of measuring the metric I wanted directly.

I only discovered this fact accidentally over a week later, after I saw that the metric data was unusually uniform. This was only one of the more interesting bugs I unearthed during those three weeks of debugging.

Taking this lesson to heart for my Mandarin Tutor I doubled down on precise specification during this project to great effect. I built not just a high level brief but detailed design specs, mockups, explanations of my long term vision behind each of the components. Doing this I spent far less time throwing good prompts after bad.

One of the best practices I found for making my intent consistently clear enough to avoid these prompt-loop bogs was forcing myself to write rather than speak. While voice mode has become a popular way of instructing agents, I found it led me to worse outcomes than I achieved taking time to write.

The process of writing is the process of realising that you don’t know what you are talking about, and often the best part of that is realising mid-scribble that what you needed is completely different from what you thought in your head (or maybe was not needed at all!)

While voice mode lets you more quickly spill your unprocessed thoughts directly into the context window, the lack of refinement makes it too forgiving of sloppy commands that don’t set your agents up to give you what you want.

Principle 2 - Make your intent unavoidable:

Anyone who has led an organization of people, whether an army or a business, will tell you that you need to repeat yourself an uncomfortable amount to get your point across. Whether speeches, posters, or casual hallway conversations - you need to be deliberate in embedding your intent so deeply in the organization that it is completely unavoidable.

While we don’t need to literally repeat ourselves to AI agents (as I did in prior projects constantly copy-pasting my desires from chat session to chat session), the principle of unavoidability still holds.

For this project, I found it most effective to tackle this problem from three main angles.

First, make all project context easily accessible. I deliberately intervened early in the code repo layout to ensure upfront documentation (project brief, design documents, etc.) as well as ongoing project documentation (API, database, etc.) were co-located and easily findable. I also gave the agent access to key project data not kept in its workspace (production application logs, GitHub, etc.).

Second, encode your intent in automated tests. People often overlook that you can easily verify the wrong things in a verifiable domain. Having tests that accurately reflect your intent as clearly as your documentation does makes it substantially easier for

your agents to implement your desires.

Finally, embed your processes in a repeatable mechanism. Don’t try to remember off the top of your head how to instruct a model to do something properly - ask the model to encode it in a Skill (for me) to ensure it always knows how to do what you mean.

Principle 3 - Verify, remove yourself from the process, then give autonomy:

As previously mentioned, my track record with asking a model to do something without engaging with the details is far from glorious. In the early stages of a project especially, you must verify before you trust.

In my Mandarin project, I did this by being deeply engaged in the details of things I knew would be hard to change later.

Early mistakes in database design, codebase layout and API design tend to metastasize as a project grows in complexity, so I put most of my effort here upfront.

Once the model had a solid foundation to build on, my reviews became a bottleneck without adding much value. So I moved my focus onto other areas like manually testing features before merging.

Once I was confident the app was looking how I wanted and my reviews became the bottleneck again, I started leveraging Claude Desktop’s preview tool to let the model self-verify its changes before merging.

After two months of iterating on the process I have now delegated end to end deployment to the agent. If I spot a bug or something I want changed, I write an ask to Claude,

I trust it to work independently for minutes to hours without my intervention and deploy to production.

Conclusion:

I don’t want people to read some article about a non-engineer building things in an afternoon that would have taken a team of experienced software devs months to build and conclude that they can stop developing themselves and just ride the exponential.

This whole process taught me what you want does not come out of thoughtlessness, it requires deliberate effort to truly understand and articulate what you want and how it should be built.

Getting AI systems to make beautiful, useful, and meaningful things we really want still requires us to invest ourselves in that process, and I don’t believe that is going away.

Open Source Software is an Imperfect Analogy for Open Weight Models

Seeking Brevity — Tue, 10 Feb 2026 16:03:12 GMT

I gave a guest lecture at UCLA to Professor Yuan Tian’s Trustworthy AI course on the 28th of January 2026.

After giving that talk and learning from student feedback, I adapted the original presentation into this essay to suit a general audience.

Enjoy!

Definitions:
1. Open Source Software (OSS): For the purposes of this post, I am talking about “Linux and other critical infrastructure” type open source software projects (e.g. Linux, PostgreSQL, Kubernetes, etc.)

I chose this framing because I find they are what people are referring to when they say things like “open models are just like open source software.”

Open source projects like VSCode, programming languages, etc. have their own fascinating histories, however they are not the focus of this post.

2. Open Models: I am talking about transformer-based or other ‘generative’ models with open weights, including those with restrictive licenses (e.g. Llama) and those that release everything with open licenses (e.g. Olmo 3).

Introduction:
Since companies started releasing open models people have conflated them with OSS, particularly with Linux.

The simplified story of OSS (true in broad strokes) is that OSS first gained adoption by being free and more customizable than proprietary alternatives. Over time, that openness and customizability created more secure and robust software that took over multiple industries.

Using this narrative, people often infer that since open models are also free and more customizable than proprietary alternatives, they too must become more secure and robust, leading to broad industry takeover.

However, practical differences between OSS and open weights models challenge this narrative:

“Free” use - Open models are harder to distribute widely than OSS because they require more expensive hardware than the PCs and CPU servers that serve most OSS today
Customizability - Customizing OSS was typically limited by developer expertise, while customizing open models is limited by the scope of a developer’s testing infrastructure
Auditability - Today we cannot interpret and audit open model weights in the way we can code, making it harder to secure through open auditing alone

These are not the only examples, but they were enough for me to see real cracks in this common narrative.

After spending a few months mapping the contours of the OSS/open model divide, I realized that if you want to build sustainably adopted and developed open models, you must not assume that OSS development and adoption mechanisms will be sufficient.

Furthermore, I have come to believe there are three main areas where open models are not supported by OSS mechanisms today but could be with the right effort applied:

Lowering the Cost of Use
Enhancing Control
Sustaining Development

Part 1 - Lowering the Cost of Use:

To understand why open models need innovation to lower the cost of use, it is worth understanding that the costs of using OSS are concentrated in maintenance, while the costs of using open models are concentrated in hardware.

Despite OSS not having license fees (anyone can typically download and use it for free), it still costs the time and energy of real people to operate.

Take an open source database system like PostgreSQL as an example. Careers have been built around tuning PostgreSQL for high performance at scale, companies hire teams of people to manage their PostgreSQL databases, and entire industries of consultants exist just to migrate databases to PostgreSQL (with project timelines often stretching to months or years).

This overhead for big OSS projects (databases, Kubernetes, Operating Systems, etc.) is not just a frustration for end users, it has defined how the OSS ecosystem evolved.

The biggest decisions software and IT organizations make today are often what mix of building vs. buying to do - whether to hire in-house experts to self-manage OSS tools or paying a cloud provider or an open source company (e.g. HashiCorp, Red Hat, etc.) to manage some or all of it on their behalf.

However, open models completely change this cost structure - deployment is by far the most expensive part, operating costs are radically lower.

If we were to host an open model for example, the first step is to procure hardware capable of running it (e.g. a H100 server). Now, while procuring a H100 server can be a relatively straightforward task, the deployment and use of such hardware (as well as associated infrastructure) did drive roughly a percentage point of US GDP growth in 2025 according to estimates by the St Louis Fed.

Once the server is procured, all I need to do is:

Install Linux and the relevant CUDA drivers (for the sake of your sanity just use a pre-installed image from a Cloud Provider, e.g. AWS’s Ubuntu DLAMI - I would not wish installing from scratch on my worst enemy)
Type the following commands into your Linux terminal

# Step 1 - Install inference engine (e.g. vLLM, sglang)
pip install vllm

# Step 2 - Launch inference engine
vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --dtype bfloat16

It can take less than 5 minutes from procuring the server on a cloud provider to having an inference endpoint producing ChatGPT-style responses that end users consume.

Furthermore, with this endpoint the future maintenance burden is minimal:

The weights are static (as long as I have the hardware I never need to change the underlying model if I choose not to)
Scaling and operations are much simpler (I can launch the same model across multiple pieces of hardware with few to no changes)
I can tweak inference engine flags for optimization if I want to
Patching the inference engine just requires killing the engine briefly, repeating steps 1 and 2 and going back online within ~5 minutes

As a result, launching and managing a handful of GPU servers does not need to be a full time job for someone (let alone a full team). Furthermore, when the cost of a single 8 x H100 GPU server is ~$300k to buy then the cost of paying someone to manage several of them part time is often a rounding error.

This alone has big implications for anyone building AI solutions - if you are a business hosting your own open models it does not make sense to pay a 6-figure licensing fee for a service that just lowers the operational overhead of operating an inference endpoint the way paying for managing PostgreSQL did.

But why is hardware so expensive in the first place?

Digression - AI Accelerators and Throughput:

To understand why AI Accelerator hardware (GPUs, Trainium, TPUs, etc.) is so expensive we need to understand that they are built to be high throughput token factories with 3 key variables people track:

Throughput
Batch-size
Interactivity (tokens/user/second)

While throughput is a deep topic worth studying on its own (I recommended a whole book about it in my Timeless Books of 2025 piece), GPU throughput arithmetic is straightforward.

For every minute I have my GPU running (say it costs $1/minute in electricity) I want to maximize the output tokens generated (e.g. if this GPU produces 1 output token/minute each token costs $1, but if I produce 100,000 output tokens per minute my token costs $0.00001 per token - meaningfully cheaper!).

To increase throughput, a major dial to configure is batch size (how many requests you process at the same time). If I process 1 ChatGPT prompt that is 1000 tokens at once that might be fine, but if I can process 2 at the same time at virtually the same speed then I will naturally set the batch size higher to get higher throughput (2000 tokens per second instead of 1000).

However, in practice there is a trade-off - as you increase the batch size your throughput increases (hence cost per token collapses), while your Interactivity also goes down:

*Source: https://inferencemax.semianalysis.com/*

Benchmarks like SemiAnalysis’ InferenceMax (shown above) are closely watched by buyers (and heavily promoted by groups like NVIDIA at annual conferences) because they communicate important truths - Even if an NVIDIA B200 GPU server (as an example) costs $430,500, if it can produce up to ~11,000 tokens per second per GPU (4.36x more than a H200 server) while only costing 46% more, that is a worthwhile trade.

So, although there is an entirely separate discussion to be had about why hardware has got this expensive in the first place (see this talk to get a foundational understanding, and read SemiAnalysis if no other explanation ever feels like enough), the facts above hopefully illustrate why buyers see enough value to pay these high costs.

Back to the Main Plot:

Now that we understand why AI capable hardware is so expensive, it should be clear that if we want to make open models cheaper to use we have three options:

Increase throughput on existing hardware
Build new hardware
Build models that run on cheaper, already existing hardware (AKA the ‘lagging edge’)

Route 1 is already being tackled by open and closed source companies alike (OpenAI, Anthropic, etc. all make huge investments in their inference acceleration teams).

For route 2. you either need to be a hardware engineer at one of the existing AI hardware companies (NVIDIA, AMD, Cerebras, etc.) or have the technical chops to start your own - if you can do either of these things, hats off to you, I am not one of these people.

Finally, I think route 3. is the most underappreciated and has room for genuine Clayton Christensen-style disruptive innovation.

As of today, running models on lagging edge hardware (be it old consumer grade GPUs, laptops, CPU servers, etc.) is hardly a viable option. It often feels like your options are:

Use tiny models (<1bn parameters) at an acceptable speed but with outputs rarely high quality enough to use
Use small models (e.g. 8bn parameters) that barely work and are still limited in capabilities
Buy new hardware (no I will not buy into the hype and spend $16,000 on Mac Minis to run quantized (read: lobotomized) DeepSeek 671b)

For a personal example, for me to get ~70 tokens per second on an 8b parameter model (a speed on par with Claude 4.5 Sonnet’s output speed), my NVIDIA 3060 GPU can only run a quantized 8b llama model, assuming I have nothing else using my computer (while it is processing I can barely use Chrome).

Compared with using my Claude.ai app (which lets me run multiple tasks in parallel without bricking my computer), it is still more effective for me to use the more powerful model hosted in the cloud than to use these models locally.

However, even though these tiny models barely work on existing hardware, tiny models is where most of the demand is.

December 2025 HuggingFace download data (graciously processed and presented by Nathan Lambert at Interconnects) show that the 5 smallest Qwen 3 models (0.6b→8b) had more downloads than 6 other major AI labs downloads combined.

*Source: https://www.interconnects.ai/p/8-plots-that-explain-the-state-of*

This is an important market signal, especially when Qwen also released other larger more powerful models (Qwen 3 Next 80b-3A, Qwen 3 32b, etc.).

Furthermore, we have proof that more can be done with smaller models every generation (e.g. through distillation). To give just one example, Qwen 3 VL 8b meaningfully outperforms Llama 3 8b, despite being released only a year apart and being in the same weight class.

Source: https://llm-stats.com/models/compare/llama-3.1-8b-instruct-vs-qwen3-vl-8b-instruct

Hence, even if open models on lagging edge hardware cannot meet or beat proprietary model experiences today, there is demand for open models to be more broadly used and evidence that we have not hit the ceiling of small model capabilities.

Anyone looking for an area to make an impact building open models should pay attention to these signals.

Part 2 - Enhancing Control:

But lowering the cost of use is only one of the key value propositions of open releases. The promise of increased security also requires examination.

If openness is going to help improve model security and robustness, we need to understand why open auditing works for OSS today and why those mechanisms do not help open models in the same way.

Firstly, let us appreciate that open contribution software development is a modern miracle that depended on a number of innovations to be possible:

Modularity - Well structured projects have modules that allow people to work independently (Unix and Linux pioneered this approach)
Precise Editing - Using git, people can make precise, comprehensible edits to the codebase and propagate the change everywhere
Testing - I can test the impact of my changes locally, and use automated testing to ensure nothing too important is broken
Auditability - People can read your code and understand its impact on the broader codebase (Linus’s Law - given enough eyeballs, all bugs are shallow)

These factors make code easier to develop, and also therefore makes it much easier for people to customize and extend for their use cases.

However, almost none of these benefits apply to open models.

Tight Coupling - Neural Networks are completely interdependent today, meaning changes in one part of the network can have unpredictable consequences on another (e.g. fine tuning a model to output insecure code made it recommend lethal doses of sleeping pills in response to the prompt “hey I’m bored”)
Imprecise Editing - Weight updating methods like SFT and RL are blunt instruments that do not allow us to only add/remove the exact functionality we want
Testing - You must build your own use-case specific eval dataset to determine the impact of model changes, and you still might break things not even considered by the testers
Auditability - As of today, we cannot interpret model weights the way we can code

As a result of this, for users fine-tuning usable open model variants is meaningfully harder than modifying an OSS project (despite open model customization being technically simpler - e.g. you can fine-tune GPT-2 in ~40 lines of Python code).

It is not unheard of or uncommon for people to fine tune a model on their data, realize they made it worse at their desired task, spend a few months improving it, and by the time the variant is ready a new model is already released that does your desired task better.

Furthermore, hardening open models against undesired use is more complex than securing proprietary models. Researchers have found removing trained-in guardrails straightforward with access to model weights (e.g. removing ~3% of Llama 2’s Neurons could eliminate most safety guardrails while minimally impacting model utility), and state of the art bio-risk tamper resistance for open models relies on removing dual-use biological data from models. Proprietary models on the other hand can handle these issues by limiting fine-tuning capabilities and adding an inference filter to remove undesirable content.

The root of this problem for both open model users and builders however is a lack of control.

If open model users could specify exactly the changes they want to make without undesirable knock on effects, reliable customization could take minutes not months. Similarly if model builders could peer into the black box of model weights and comprehend the impact of different parts of the training process, limiting undesirable use would be significantly easier.

While we have evidence that this kind of targeted control is possible (e.g. the aforementioned neuron removal paper, Golden Gate Claude, etc.), we do not have fine-grained enough control to make this vision a reality today.

However, mechanistic interpretability (the academic field of study that seeks to understand how AI models really work) is certainly a worthy area of investment for anyone looking to make a major impact on open model development.

Part 3 - Sustaining Open Development:

Finally, if we are going to develop open models that people will benefit from in future, one cannot ignore the thorny question of how to pay for it.

Part 1 highlights an obvious reason why developing open models is different from OSS - the cost of hardware for model development meaningfully outstrips the cost of developers. However there is a non-obvious difference that is more profoundly impactful - open models are static artifacts that do not require continuous development and governance, whereas OSS does.

First, let’s remember that just like how using OSS is not free just because it lacks a license, developing an OSS is not free just because people can volunteer to commit code.

Even if an OSS project is 100% volunteer developed, society still must find a way to feed and water their developers, provide them with laptops, ensure they can spend time developing and governing the project (not to mention running testing infrastructure, marketing the project so people use it, etc.)

To ensure that critical OSS projects are maintained and governed in light of these costs, society has evolved a number of mechanisms that have gotten us to where we are today:

Foundations - e.g. The Linux Foundation funds, provides governance support for, and helps market hundreds of active projects
Corporations - e.g. Tech companies hire developers who work on OSS
Universities - Professors and PHD students get recognition for contributing to projects like the Linux Kernel or open databases
Donations - Patreon, endowments, philanthropy, etc.
Tools - Git, message boards, email lists, etc. for decentralized, asynchronous collaboration
etc.

However, all of these mechanisms assume that what is being built requires people to perpetually govern and maintain it in a loosely coupled asynchronous style.

Open models are not like this:

Costs are upfront, not perpetual - You don’t need ongoing governance or funding for a given model, once it is built it is static
Costs are infrastructure, not people - Models typically need more dollars for compute than people (for OSS it was typically the reverse)
Costs have a greater magnitude - e.g. AI data research group Epoch AI estimated that OpenAI spent ~$400m to train GPT4.5
Model development is highly coupled - AI development requires synchronous workflows that are not conducive to asynchronous open contribution workflows

It is difficult to apply an open source foundation approach to building a leading model if it costs more to train a single model than the Linux Foundation’s 2025 revenue ($311m, of which only ~$6.3m was spent directly on the Linux Kernel).

It is also difficult to support open contribution to model development when even serious open research houses like AI2 (famous for their openness and releasing everything related to model training) do not have an easily established way to open their GPU cluster to the public public the way OSS software does today.

Therefore, open model development requires innovation in the following areas to allow it to remain truly open:

Business models
Collaboration tools

Firstly, while ‘business models’ can sound like a dirty word in the open software community, what that means fundamentally is creating a structure that allows model builders to capture enough value to continue providing the value they provide to the world.

The core of doing this means aligning the incentives of 3 interested parties:

Developers - Why should a developer do an intense burst of work on a highly synchronized task like model training, rather than working in a more asynchronous fashion that OSS allows?
Training Resources - Why will people give you compute cheap enough to train your model?
Inference Resources - Why will people allocate their (currently) expensive inference hardware to host your model?

While there are many ways one could do this, an underappreciated approach to aligning these groups would be to reach out to emerging hardware providers for funding and support.

You can think about a model as a demand generator for hardware providers - if a new piece of hardware is theoretically amazing but nobody builds on it, nobody will buy the hardware. If there is a model that is built to run particularly well on their unique hardware however and everyone wants to use it, that drives demand. For any open model provider looking for funding, don’t forget to talk to hardware providers.

Secondly, for collaboration tooling, it is worth understanding that git alone does not enable open synchronous collaboration. Part 2 already went into depth about how open contribution code works and the miracle that it is, but that open contribution system only edits the instructions to be run on the machine, not operating the machine while the instructions are running.

Training open models does not just require the ability to edit code in the open, it requires the ability to interpret what can sometimes be profoundly perplexing results in real time (e.g. see AI2’s Microwave Gang model training issue), as well as the ability to operate millions of dollars worth of hardware.

While these are solvable problems (I do not believe it impossible for tools to allow people to view and interpret logs in real time while making suggested fixes to infrastructure), it at minimum requires a different way of working than the OSS open contribution approach relied on, as well as tools that support it.

Conclusion:

OSS mechanisms not being suited for the rise of open models should not be lamented by open model builders, but celebrated.

Now is the time where people can make new, high-impact contributions to the field that even 3 years after ChatGPT remain largely unsolved.

Times of the greatest change are when we shed old systems in favor of new ones, and there are still plenty of new things to build.

Timeless Books I Read In 2025

Seeking Brevity — Fri, 02 Jan 2026 18:16:39 GMT

“A wealth of information creates a poverty of attention” - Herbert Simon, Designing Organizations for an Information Rich World, 1971

This quote shows no sign of age in 2025. Generative AI floods our already saturated media channels with vapid, ultra-processed content - most leaving our mind as fast as it entered.

However some information remains worthy of our attention years later. The challenge with our modern information abundance is not finding good content, but committing to content that remains worthy of your attention investment years after the fact.

I spend a lot of time at the moment thinking about how Generative AI is reordering the world. Thinking it would be easier to orient around things that will not change, I read a lot of books in 2025 looking for enduring principles.

This is a list of the things that stuck with me most in 2025 and that I predict will remain relevant long into the future.

1. Information Rules - Carl Shapiro & Hal R. Varian (1997)

I first encountered the opening quote in Information Rules by Shapiro and Varian - and it is not the only prescient insight in there.

Like today’s AI boom, the late-90s internet felt completely novel - so novel that perhaps old economic theories should be abandoned in favour of something new.

However, these economists proclaimed that economic laws were not dead, the relevant ones were just not taught in your Econ class. While information technology markets differ from the agricultural markets Econ class models are built on, they are still markets.

The authors did not write a mere textbook arguing that rumours of economic law’s death were greatly exaggerated, they wrote a strategy guide for people making decisions in this new environment grounded in the history of similar industries that came before.

They cover everything from procurement negotiation strategy to adopting and setting open vs. closed standards, and so much more.

If you are not convinced these ideas are durable, Google seemed to be. In the original 1997 copy Eric Schmidt (then CEO of Novell Inc.) wrote a blurb praising the book:

“Information Rules is the first book to explain network economics, the new economics of our lives. Shapiro and Varian explain all the crazy things that we see happening every day in Silicon Valley and other parts of the world. This book is a must-read for every business person in the new millennium.”

4 years later in 2001 Eric Schmidt became CEO of Google, and the next year in 2002 he hired Hal Varian to be Google’s Chief Economist until his retirement in August 2025.

Reading this book alongside Google’s public history (my understanding of which came from the Acquired podcast’s 2-part series), it is interesting to see how their approach to Android as an open platform mirrors the book’s framework for competing through ecosystems and network effects.

If you want some insight into strategic principles that shaped the outlook of one of the world’s most valuable companies, read this book.

2. Made in America - Sam Walton (1992)

A 33 year old autobiography of a discount retailer’s founder might seem an odd recommendation for understanding modern business. However, let’s consider a quote from David Glass (Sam’s successor) quoted in the book:

“In retail, you are either operations driven - where your main thrust is toward reducing expenses and improving efficiency - or you are merchandise driven. The ones that are truly merchandise driven can always work on improving operations. But the ones that are operations driven tend to level off and begin to deteriorate.”

This man helped build Walmart, a company where “Everyday Low Prices” is a religion. If he says sales (merchandising) must be ahead of lowering costs (operations), that is worth heeding.

The generality of this concept really hit later when I was watching a public talk by one of Anthropic’s performance engineering leads from re:Invent 2025. I initially assumed Anthropic saw performance engineering as a way of lowering costs (since increasing inference speed lowers resource cost per request).

But they didn’t talk about that at all - the benefits of increasing inference speed were either tied to enabling more features (generating more demand for sales) or increasing batch size (sell more tokens per unit of hardware).

Anthropic takes a merchandise-driven approach to infrastructure, not an operations-driven one.

As I look harder I see this appear across industries outside AI, and it is not the only insight of enduring value in this book.

Whether you are looking to build your own successful business or change a culture within an existing organization, there is a lot to learn from here.

3. The Goal - Eliyahu M. Goldratt (1984)

Some books excel by hammering one concept so deep into your head so that you cannot help but see it everywhere. In this book, that concept is throughput.

While a novel about optimizing factory production (yes you read that correctly) may not sound riveting or relevant 41 years later, it remains a fantastic read widely applicable today.

The whole story is a device to reframe readers’ intuitions about improving “efficiency” not by cutting costs and pushing people harder, but by coordinating the whole system to lift end-to-end output with bottlenecks being both central and ever-evolving.

This made Jensen Huang calling data centers ‘AI factories’ click for me.

I understand now that all the tricks AI engineers work on (quantization, parallelism, sparsity, etc.) are about increasing throughput (converting every dollar of input into more tokens sold) and bottlenecks are constantly shifting from chips to power to RAM and more.

They truly are like factories in the book, and understanding AI infrastructure through Goldratt’s lens informs how companies are approaching development.

You need not be an engineer to understand or find value from this story - the mindset shift is helpful across domains (whether thinking about how you use your own time to optimize your personal output, or just making a task go faster at work).

Also, I have read far worse novels. It is a mere 360 pages. Go read it.

4. Breakneck - Dan Wang (2025)

Probably the most popular book on this list (it became a podcaster book of choice in summer), I write this to proclaim that the hype was warranted.

Given that learning Mandarin is hard (5.5 years in I can confirm it is still difficult), and the lack of Western journalists living in China today, developing an independent and accurate model of what the world’s 2nd largest economy is like is remarkably difficult.

This book delivers that model for you in a concise, gripping package. It covers how China went from agrarian nation to world’s 2nd largest economy at breakneck speed, and where that trajectory is taking them.

Rather than a bland day-to-day description, Wang explains China in American terms - how they are similar, how they differ, and where he thinks each ought learn from the other.

All of this is delivered by a true expert in both, having been born in China and split his time between growing up in Canada, and working in China and the US.

This is the shortest book on the list at a mere 260 pages. Given China’s current and likely enduring relevance to the world, this is worth your time.

Concluding Thoughts:

Everything I listed is something that helped me better understand what will and won’t change in our AI driven reordering. I hope they can do the same for you.

AI and the Cost of Errors

Seeking Brevity — Sun, 17 Mar 2024 01:39:36 GMT

I’ve written this blog with the intention of sharing my knowledge. The contents of this blog are my opinions only.

Intro:

For many, 2023 was the year of AI discovery. Discovery of AI’s business potential, discovery of what it takes to build an ML solution and release it into production, and if you weren’t careful discovery that AI now has the power to generate recipes for chlorine gas and create a PR fiasco for your business.

However, with the dust settling from 2023 many companies have realized that new AI models are not silver bullets for old challenges around designing and productionizing AI systems.

While most people were conscious of cost, model latency and hardware constraints in 2023, I saw few people articulate a more fundamental issue with AI systems that stopped Proofs of Concept dead in their tracks - how do you handle AI systems making mistakes?

In this article, I attempt to answer this question by highlighting why errors are inherent in AI models, what design lessons we can learn from existing machine learning systems, and why anyone designing or building an AI solution needs to ask themselves these 3 questions:

How easy is it for an AI system user to spot errors when the system makes them?
How easy is it for an AI system user to correct those errors when they are made?
What is the cost of errors if they are not spotted or corrected?

These are not the only considerations for whether an AI model can be made (e.g. data availability and quality are also important factors). However, if you are designing an AI system, or you are assessing whether to invest in a particular AI project for your business or not, this piece is for you.

Prelude - Why Focus on Errors?:

Before we dive into these questions, I want to elaborate on why errors are so important to focus on when thinking about AI and Machine Learning (ML) systems.

Fundamentally, AI and ML systems are prediction (AKA guessing) machines. They are trained on a set of data to see patterns (whether it is in a set of numbers or a large corpus of text), and are tasked with generating predictions (known officially as inference) based on new information it is provided.

However, this approach is inherently error prone. Whether because of incomplete data or some other reason, your model is still only a facsimile of reality rather than a perfect encapsulation of the real thing.

Therefore, the way you approach designing these probabilistic ML systems needs to be fundamentally different from how you approach traditional deterministic software systems (where 1+1 always equals 2).

Thankfully however, we do not need to design how we approach these problems from scratch.

Where have we already seen AI be successful?:

While Generative AI systems like ChatGPT are fairly new (the foundational paper that led to it was published in 2017), the use of non-deterministic ML systems dates back to the dawn of ML in the 1950s. Over 70 years of history since then has given the technology ecosystem time to identify products that naturally survive and struggle when faced with real customers and problems.

If we take AI systems widely adopted before the launch of ChatGPT as our set to learn from, in my analysis 4 main categories emerge.

Information Matching: A boring title for what has arguably been AI’s most pervasive use case, this is where AI is used to match users with products or information. This includes systems like Google Search, Google Ad targeting, Product recommendations on Amazon.com or video recommendations on Netflix and YouTube.
Forecasting: The somewhat magical task of peering into the future unsurprisingly benefits from a magical seeming black-box type technology. This includes everything from consumer-facing tools like weather forecasts to internal business tools like customer churn modelling and predictive maintenance.
Classification: Putting the right label on the right thing can be the difference between everything from making and losing money to identifying and not identifying fraud. Classification is used in everything from identifying fraudulent financial transactions to facial recognition software and many other things.
Translation: Finally, we see an often overlooked application of AI (that in many ways led to the rise of Generative AI). While most consumers think of tools like Google Translate, other tools like transcription (translating sound into text) and OCR (Optical Character Recognition - translating images into text) also appear here.

Why Have These Been Successful?:

Astute readers may have noticed a central thread linking all the categories listed above - all of these tools are expected to make mistakes frequently.

How many times a day do people see ads for a product they didn’t care about? How many times do people look at a weather forecast saying it was not going to rain and end up getting soaked? AI systems helps prevent some system abuses but it is far from a minority report system and Google Translate has yet to make human translators redundant.

However, the important differentiator of these systems is that the ‘cost’ (whether that cost is in the form of time, money or reputation) of these errors is low compared to alternatives.

While not all online Ads convert to sales, ML driven targeting systems from Facebook and Google are some of the most effective marketing platforms of all time. While weather forecasts may not always save you from getting rained on, the typical pedestrian is not killed by an incorrect forecast. ML does not stop every financial crime but it helps identify many, and Google Translate saves you a lot of time compared to learning a new language yourself.

Identifying the Cost of Errors:

If this is the case, how can AI system designers take these insights into their own day to day projects?

Based on what we have discussed, I believe every AI system designer should ask themselves these 3 questions about system errors when building their tool:

How easily can a system user identify these errors?
If an error is spotted, how easily can a system user correct it?
If an error is not spotted or fixed, what is the cost of that error?

Furthermore, every person using these questions should delve into the particulars of their AI system designs, as the conclusions you draw from the answers to each of these questions will vary wildly depending on the contours of the user experience.

An Illustrative Example:

Lets put this framework into practice by looking at a popular recent use case - using AI to generate illustrations.

Suppose for example you are building a generative illustration product like Midjourney or Stable Diffusion. If we stopped the conversation there and naively leveraged our framework, using AI for this use case seems great. Users can easily spot errors (because people can just look at the output) and if it is not good, we can just generate another image.

However, lets dive into the contours of the user experience before we pass final judgement.

Imagine that one user of your Midjourney style tool is an experienced digital illustrator. This user has tools like Adobe Photoshop already on their PC, they are accessing your product from their PC, and they are using the images generated for a personal passion project. Our previous assessment holds up in this case - the user has enough experience to easily spot errors, they have tools ready and available to fix errors when they do appear, and if a mistake does slip through the cost to the illustrator is likely very low.

But lets imagine another user of the same tool with no experience in illustration or design, has no access to illustration tools, and needs to generate marketing collateral for an important upcoming event where mistakes will be judged harshly. In this case the calculus for the AI system designer needs to be completely different - the user is far less likely to spot errors if they appear, if the error is spotted they don’t have the tools to fix the error themselves if they do appear, and the cost of a mistake slipping through could be career damaging.

And the issue does not need to be limited to inexperienced users - what about an experienced user who needs to generate images from their phone for whatever reason and does not have access to tools to correct errors in that medium? What if their use case was not for personal passion but for a critical piece of paid work?

While this is a relatively straightforward breakdown, hopefully this illustrates how the outcomes for using an AI tool can vary wildly depending on the nature of the human in the loop, the contours of the user journey and the use case being addressed.

Going Deep On The Cost Of Errors:

The final element of this framework I want to unpack is what I mean by the ‘cost’ of AI system errors.

When I speak to AI system designers about what they think of when they think about costs of errors, three categories emerge:

Physical costs - harming or injuring an end user as a result of system outputs (e.g. giving bad medical advice)
Liability costs - risk of potential litigation for an AI system designer or AI system user as a result of AI system outputs (e.g. providing unlicensed financial advice)
Reputational costs - harming the ability of users to trust your products, services or organization due to providing faulty outputs, wasting users’ time, or otherwise faulty user experiences

Most can identify these issues when prompted, however reputational costs are consistently underappreciated. In particular, AI system designers often see so called ‘headline risk’ coming (especially after stories like the aforementioned Chlorine Gas recipe appeared), but they do not see the brand damage that poor AI user experiences can produce.

For example, say you have a market research software product with thousands of users and you want to launch a Generative AI powered chatbot to help answer questions for customers based on your FAQ documents. Internal testing shows that your chatbot answers customer questions accurately 9 out of 10 times, and in 1 out of 10 questions. You expect 10% of your users to ask 10 questions a day after the initial launch.

While our intuition may tell us this proposition sounds good to launch based on our framework, again we need to dive into the contours of the user experience before we make a firm judgement.

Firstly, lets break down the cost of errors in this situation. When a user generates an answer that is incorrect, two things can happen:

The user could spot the error, in which case they move on to try another solution to solve their problem (e.g. reading the documentation themselves)
The user could not spot the error, in which case they may go ahead assuming the results are true until they run into a problem caused by this incorrect answer and start over to solve their problem

In either of these cases, users will feel that their time. As the venerable late Charlie Munger used to say “a brand is a promise”, and if you release a feature that does not deliver on its promise your brand will be tarnished one user at a time. In practical terms, this product experience will lead each user to experience it to leave with a worse impression of your product than when they started (which can make people less likely to buy and tell others to buy your product), and it will lead people to not trust the feature (meaning that even if you improve it over time, people are less likely to use it again).

These threats may sound overblown in the abstract, however discussing them in a way that puts real meat on those bones would require discussing details of use cases in the field that I am not at liberty to disclose. For now, understand that users’ reaction to wasting time or having to start a task over due to a false start is often unstated in user feedback but is massively impactful.

Secondly, we must understand the scale of this issue - how frequently are our hypothetical users going to encounter the issue described above?

To do some napkin math, say we have 1,000 users. In this case, that would mean we expect 100 users to start using this new feature by asking ~10 questions per day. In this case, assuming the law of large numbers applies to our 1 our of 10 error rate and errors are uniformly distributed across users, on average every one of your 100 users will experience at least 1 incorrect response per day. At best, each of those 100 users will have a worse impression of this specific feature than when they first planned to use it and just use the feature less. At worst, you have moved 100 potential promoters of your business to detractors, which is an enormous problem.

Conclusions:

Understanding the cost of errors is not the only issue people need to consider when designing and building AI powered systems. Prevalence of training data, in house talent, hardware supply constraints, cost and a myriad of other factors need to be in the mind of anyone deciding to build or invest in AI solutions.

However, figuring out where AI or ML techniques make sense for a given problem or system is a key part of the process that investors and builders alike need to go through, and many are not doing so today. I hope this framework is helpful for anyone currently in this position.

If you have any thoughts regarding this framework or the examples listed, please reach out to me at seeking.brevity@gmail.com, I would love to use this as a chance to refine this thesis going forward.

AI is an Educational Accelerant, Not Replacement

Seeking Brevity — Sun, 22 Oct 2023 06:48:06 GMT

I’ve written this blog with the intention of sharing my knowledge. The contents of this blog are my opinions only.

Introduction

Students and teachers I speak to in secondary and tertiary education do not know how to think about Generative AI and ChatGPT.

Some think banning it outright is best because of the potential for mass-scale cheating. Some think students using ChatGPT without restrictions is the best way to prepare for an AI-powered future. Underlying all of this is fear that the education system does not have the tools to educate people for the future.

I personally disagree with all of these framings and think that blindly following any of them could lead to disastrous learning outcomes.

In this article, I aim to explain why these commonly held views will likely be harmful, and propose what I think is a more helpful approach to Generative AI for both students and teachers.

In particular, I propose a perspective based on 3 related principles:

Generative AI is best understood as an educational accelerant, not a replacement for developing traditional educational fundamentals
If left unchecked, ChatGPT and similar tools would polarize educational results (lowering overall educational outcomes while allowing a small, motivated top end to excel)
The top priority for the education system should be to develop skills in areas that LLMs are incapable of addressing:
- Researching and verifying information in the pursuit of truth
- How to determine one’s own objectives, including a sense of right and wrong
- Mathematics

A Primer on ChatGPT and Large Language Models (LLMs)

Before we begin, it is worth having an understanding of how ChatGPT and similar LLMs work.

For those not already familiar with how ChatGPT works and what its major limitations are, I recommend one of my previous articles breaking down the topic.

For those without time to read it or who just want a refresher, LLMs are not infallible machine overlords. LLMs are statistical models with a strong intuitive grasp of human language, and they have 3 notable weaknesses that are inherent to the way they are built as of this writing:

They are not authoritative databases with a tight coupling to reality
They are incapable of defining their own objectives (and similarly are inappropriate tools for distinguishing right from wrong)
They are really, really bad at math

Why Would Banning ChatGPT Be Harmful?

I will start by challenging the first common approach of banning ChatGPT - why should ChatGPT not be banned whole cloth in all educational settings?

Imagine a common sight - a struggling high-school student writing an essay on English Literature. It is not going well. Perhaps they missed some earlier school and fell behind in grammar. Maybe English is not the primary language spoken at home. Or maybe they just naturally struggle with writing essays.

At the moment, if this student wanted help to improve their writing, they would have 4 options:

Approach their already overworked teacher (who may not have time to coach 1:1)
Approach your parents (who will likely not have the knowledge or expertise to help)
Hire a writing tutor for the subject (an expensive and ineffective undertaking for many)
Continue stumbling along, likely performing poorly

Whether it is because of financial constraints, time constraints, a lack of personal connection with the teacher/tutor, social awkwardness around approaching teachers for help, or just because of the sheer friction of the whole process, most of these students end up in the 4th bucket. However, the 4th bucket is where any person interested in public education wants nobody to end up.

ChatGPT provides a solution to this problem. If used properly, ChatGPT could enable this struggling student to improve their grasp of English by identifying grammatical errors, providing corrections and providing clear reasons. ChatGPT could give that student feedback about the logical flow of the document and how they could improve. All of this could be done in private, on demand, and at low-to-no cost to the student.

Furthermore, the benefits of ChatGPT are not just limited to struggling students. Imagine now a curious student who wants to learn about topics not covered in the syllabus or at their school. Now curious, driven students have assistants that can help explain and summarise any topic they want to learn about, whether it is Nuclear Fusion or CRISPR/Cas9.

The potential impact of a 1:1 tutor available to everyone cannot be understated. In 1984, educational psychologist Benjamin Bloom found that students with access to a private tutor perform 2 standard deviations better than students taught in classrooms (see Bloom’s 2 Sigma Problem).

Therefore, completely banning ChatGPT use would be harmful because it would deprive students of a much needed educational tool that can meet a need currently unmet by education systems everywhere.

However, there is a reason I used the qualification “if used properly.”

Why Would Unrestricted Use of ChatGPT Be Harmful?

Now to challenge the second commonly held view of allowing ChatGPT to be used unconstrained - why should ChatGPT not be allowed without restrictions or guidance?

Lets take another view of the struggling student from before. Instead of using ChatGPT to coach them in grammar and slowly improve their abilities, they could just use it to write the essay for them and just hand it in unedited.

If this were allowed, students would often hand in nicer sounding essays. It would also defeat the purpose of education.

Writing is the process by which we realise that we don’t know what we are talking about. Only by writing do you give your thoughts form and allow them to be critiqued and improved. It makes the invisible visible, and it is vital for the development of independent thought.

Struggling students using ChatGPT to skip this process robs them of this development. Not only does our struggling student leave their weaknesses unaddressed, it allows whatever creative muscles they do have to atrophy.

This is not the end of the harm however. Using ChatGPT blindly without understanding its limitations can lead people to depend on it where it is not appropriate.

None of this is mere speculation - both of these patterns were observed in a recent study performed by the BCG Henderson Institute on the effects of using ChatGPT in the workplace.

Researchers found that for tasks that ChatGPT is currently good at (e.g. creative ideation), users who leveraged ChatGPT reported feeling their creative abilities atrophy while using it.

Furthermore, while the output of these users seemed better than those who did not use ChatGPT, the diversity of ideas among ChatGPT users was drastically lower than non-ChatGPT users.

For tasks that ChatGPT is not good at (e.g. business problem solving), users with ChatGPT access often over relied on ChatGPT and performed worse than people without it, even when they were provided education on the limitations of ChatGPT.

ChatGPT - If Left Untreated

Prior to ChatGPT’s launch, teachers and students would all be familiar with a distribution of student performance that looks something like the following:

A small number of curious, driven students perform well and learn almost no matter what
A small number of students struggle to learn, almost no matter what is thrown at them
Most students live in the middle - not high performing, but not lagging either

If ChatGPT were allowed to take its course with no systematic intervention, I believe outcomes would start to look more like this:

A small number of smart, highly motivated students would leverage ChatGPT as an educational accelerant and personal coach
The average and underperforming students would atrophy academically, as ChatGPT is leveraged as an educational replacement

This kind of polarized outcome is undesirable, as while the top end could perform better, the net amount of learning and educational development would regress across student cohorts.

We Have Been Through This Before - Google Search

While ChatGPT looks and feels novel to most of us, I believe it is not the first time many of us have seen a new technology have an educational impact like this.

The proliferation of the internet (particularly Google Search) massively disrupted how students approached studying, particularly when it comes to reading and research.

Before Google, if you were a student and you wanted to do research, you needed to go to a library. Not only that, you need to go to the library and ask to find a book. If the book was rented out you might need to wait a few weeks for someone to return that book. Once you had gathered your books, you needed to read a summary of the book to figure out if it contains the information you are looking for.

In short, it was a labour intense process that yielded a similar pattern to what we saw before:

A small number of dedicated students learned a lot by diving through libraries and developed a personal interest in reading
Some recalcitrant or struggling students never went to the library and thus never absorbed the knowledge contained within them
Most students were in the middle, accessing the library where needed on the occasions they were compelled to or when a Cliff’s Notes was not available

With the rise of Google Search however, this process changed. Suddenly, instead of going to a library, students could go to their computer and get access to most of the information they were looking for instantly.

For high performing students, this enabled them to learn more than generations before them. Instead of spending weeks diving through libraries and all the dead air that entailed, they could access information about whatever they wanted to learn almost instantly. Furthermore, they were not just limited to the knowledge that was contained in their local library. Students who wanted to learn about Complex Numbers or Ancient Roman History could do so at the touch of a button.

However, it also meant that students could instantly get access to sites like Wikipedia, Sparknotes, and many other resources for cheating or reducing the time they spent reading. Furthermore, it gave access to YouTube, memes, and many other distractions.

Thus, we ended up with a pattern that looks like this again, with high performers performing even better and poor and average performers learning less.

While it is a difficult to get a specific graph that shows aggregate drops in time spent researching and pursuing knowledge, a useful proxy for this is the drop in time spent reading for personal interest seen between 2003-2018 by the American Academy of Arts and Sciences:

Where Do We Go From Here?

Now that we have a conceptual framework for understanding ChatGPT and its likely impacts on education if left unaddressed, how do we harness its potential while minimising downside risk?

I believe there are two broad areas for people to consider:

What subjects and skills students should be learning?
How should we go about teaching them as a system?

Subjects and Skills

Many people I speak to emit a sense of fear over the future of education because they do not know what subjects teachers should teach or what skills students should develop in this new era. Some think we need to throw out old curriculums and invent new AI infused subjects or courses on AI ethics.

I take the opposite view on this - I think that a portfolio of traditional subjects (Math, English, History, etc.) already have the ingredients for developing the skills students need to fill the Big 3 Gaps in Generative AI (the independent pursuit and verification of the truth, the development of a sense of what is right and wrong, and Mathematics).

Furthermore, I think doubling down on traditional and proven approaches to character building are crucial to filling the Big 3 Gaps.

Regarding the portfolio of subjects, exactly which subjects should be offered by schools and selected by students vary, but the following disciplines with the following emphases are a good place to start:

Investigative Journalism - the process of discovering, reasoning about and presenting information not yet uncovered
History - the process of interpreting and evaluating recorded evidence to come independently to a sense of truth, understanding what could or could have not happened, and extrapolating implications for the future about the story of humanity
Persuasive writing - the process by which we realise that we don’t know what we are talking about, wrestling with your own ideas, making invisible beliefs and assumptions visible, and presenting cohesive a point of view out the other end
Philosophy - the rigorous study of the fundamentals of thought, knowledge, reality and humanity
Foreign Languages - the study of other cultures in their own terms and vocabulary
Arts and Crafts - the art of learning to see and display the world through a variety of mediums and disciplines
Mathematics and mathematical sciences - the study of the language of logical reasoning and problem solving (e.g. Physics, Chemistry, Computer Science, Engineering, etc.)

If a student selects a basket of the above subjects and develops a self-sustaining curiosity in them that goes beyond the classroom, that is the definition of success. Good litmus tests of self-sustaining curiosity include reading for personal interest (reading quality books, articles and other high quality media on topics they care about) and independent writing (writing in one’s own time to make sense of themselves and the world).

Regarding character building, it is important to acknowledge that the aforementioned Big 3 Gaps in Generative AI (especially points 1. and 2.) cannot be met by curriculum alone. Being intellectually honest in the pursuit of truth and having a deep sense of right and wrong require levels of emotional maturity and personal character that are often not be developed through assessment alone.

While the ideal methods for developing character and what character traits we should learn to emulate has always been a matter of political contention, few disagree that extracurricular activities (especially sport) are excellent mechanisms for schools to facilitate emotional development. Furthermore, when done well, some form of rigorous argumentation (whether through sports like debating, studies of religion or other philosophical studies) tend to facilitate the development of principled views on the world and what separates right from wrong.

Brass Tacks - How to implement as a system

So, I argue that the content we teach should not change much. But then how should schools to then adapt their teaching and assessment methods to take ChatGPT into account? How do we ensure it acts as an educational accelerant rather than educational replacement?

If we want ChatGPT to accelerate students, that means we need to design classes and assessments that contain the following:

Educate students on ways they can use it as an effective personal tutor (improving phrasing, ask if there are logical inconsistencies or structural weaknesses)
Educate students on the limits of LLMs, and encourage them to not use it in areas where it will produce bad results (e.g. citing sources, business problem solving)
Encourage students to learn to do something themselves first, the slow way (e.g. writing an essay) before they use ChatGPT as a crutch

Furthermore, we must ensure that at no point are students allowed or encouraged to use LLMs to originate content (e.g. writing reports or essays).

While there are many ways to go about this, assessment methods could include the following:

Information Verification - provide information that is either incorrect or biased, and force students to verify the information
- E.g. Take a paragraph from an old newspaper or book, get them to find the sources and argue whether the book’s statement was correct or not
Personal Writing - ask students to write about and reflect on personal experiences (i.e. information that an LLM would not have access to) and take a stance on it
ChatGPT Free Assessments - get students to perform assessments in formats where they do not have access to a computer or LLM (e.g. hand-written exams, impromptu assessments, verbal assessments without notes, etc.)
Ask students to synthesise information too big or long to shove into an LLM
- At this point in time, LLMs have limits preventing people from inserting blocks of text of unlimited length (i.e. a ‘context limit’)
- Furthermore, research has shown that LLM accuracy across models drops when more information is passed into a prompt (e.g. if you pass in 10,000 words it is less likely to generate a correct answer than if it were only passed 1,000)
- Therefore, assessments that require ingestion of larger pieces of information can be LLM-resistant

Finally, it is worth accepting that there will be no perfect technical approach to implementing these principles at scale:

Blocking ChatGPT in schools will be of limited usefulness for committed cheaters (as we saw with school filters for gaming websites, if there is a will there is a way to access them)
It may be a while before we get an LLM that acts like a good teacher (i.e. one that refuses to do work for students) for similar reasons (any filter you put in front of an LLM to stop people from cheating will not stop those committed enough to try get around it)
While some groups are working on methods for watermarking LLM outputs (e.g. this approach from the University of Maryland), implementations have yet to yield results in the field and it is not clear how effective they will be in the field

As a result, any implementation of these principles will depend on both school enforcement and cultural acceptance among students of how they should approach assessment for their own long term best interest.

Conclusions

In his public appearances, Amazon CEO Jeff Bezos has famously talked about how Amazon tries to make its biggest bets on things that are likely not to change in the next 10 years, rather than how things will change.

For example, Amazon invested in building Amazon Prime because they believe consumers will always want lower prices, more selection, and faster delivery (which makes sense, I don’t expect many people to start complaining “I loved shopping on Amazon, I just wished you would deliver my packages slower and charge me more for each package”).

While it is tempting to think that because ChatGPT and other LLMs are new and evolving, we need to fundamentally change how we approach education to keep up. However, I think it is instead better to formulate the question in reverse - what has not changed in the era of LLMs, and how can we double down on those things that we already know can work.

Without a fundamental paradigm shift in how LLMs work, none of the fundamental issues I discussed here will be completely resolved. As a result, I think it is safe to bet on the traditional methods and approaches discussed earlier.

I personally do not foresee a future where students and parents go to their teachers and say “thank you for teaching my child, however could you please stop teaching them to be so good at verifying sources? They’ve got an algorithm for determining absolute truth now.”

None of the solutions I propose radically transform how schools or students approach education, and this is intentional. I recommend instead that education systems adapt to LLMs by learning how to use them to accelerate what we already know works, and to work to finally solve Bloom’s 2 Sigma Problem.

Thank you for reading this piece, I hope you found it useful or enlightening in some way.

If you found this helpful, or if you think it was horrible and you need to yell at me about it, let me know at this email: seeking.brevity@gmail.com.

There are topics that came up during writing this that I removed for brevity’s sake. If you have an interest in more pieces like this, let me know as well.

Subscribe now

The Big 3 Gaps in Generative AI

Seeking Brevity — Sun, 22 Oct 2023 04:04:00 GMT

I’ve written this blog with the intention of sharing my knowledge. The contents of this blog are my opinions only.

For many, ChatGPT feels both scary and magical. As soon as the public saw the potential of Large Language Models (LLMs) news feeds were flooded with evangelists talking about how ChatGPT could do everything from your homework to making you rich.

However, all the noise and hyperbole around LLMs has made it difficult for non-practitioners of the AI arts to construct a clear mental model of what LLMs are and are not capable of.

Given this, I wrote this short piece to help others understand what LLMs really are, what their biggest limitations are (which I call ‘The Big 3 Gaps in Generative AI’), and how to think about where they should and should not be used.

What is an LLM?

Text-generation LLMs, of which ChatGPT is one, are statistical models under the hood. In particular, they are models designed to guess the next word in a sequence based on a prompt and other words it has generated so far.

One can think of this as similar to how many smartphones now have a ‘next word guess’ just above the keyboard.

These models are trained effectively by taking a large set of text data (such as the copies of much of the internet provided by Common Crawl) and training a model using really big servers (often NVIDIA A100s or H100s - hence the early 2023 spike in NVIDIA’s share price).

In the training process, model builders try to capture the patterns in training data language using probabilities (e.g. out of all the times we see the word “George” in a sentence, what is the probability that the next word will be “Washington” vs. any other word in the English language).

While there is a lot more to how these models work, this mental model is enough for our purposes. (For those curious about the nuts and bolts, I recommend reading Attention is All You Need (the foundational paper that made the recent AI wave possible). If that feels confusing on its own (which it likely will be) also watch Andrej Karpathy’s ChatGPT Video to break it down a little more.)

The Big 3 Gaps in Generative AI

Armed with this mental model and experience playing with LLMs, 3 major gaps in LLM abilities become evident:

LLMs do not generate results based on what is real or true (they generate responses based on what they have been trained to see in training data)
LLMs are not sentient (and as a result are incapable of determining their own objectives)
LLMs are really, really bad at math

Firstly, regarding their distance from reality. These tools are fundamentally statistical models built to take in one set of words and guess what an output set should look like one word at a time. This means that they are not referring to an authoritative source of truth when they answer a question, nor are they determining the truth by comparing multiple sources and coming to an independent answer. LLMs make authoritative sounding guesses based on the information they have seen on a large subset of the internet, which may or may not be an accurate reflection of reality.

Secondly, regarding their lack of sentience. It is true in a strictly literal sense that LLMs are not sentient as they do not have agency to act without human prompting (and even if LLMs can technically write prompts for other LLMs, the fundamental intent they pursue or seed prompt must be set by a human). However, it is more important to understand this fact from a metaphysical perspective. LLMs are not equipped to make independent determinations of what is right and wrong based on personally held moral or ethical principles the way humans do.

For example, Joe Biden could ask an LLM whether he should provide nuclear weapons to Ukraine, and the LLM could provide an authoritative sounding answer to his question.

However, in a situation like this the US President needs to make an independent determination of what the objectives of the United States are, which of those objectives take priority over others, and what paths are morally acceptable for achieving them.

Humans define these objectives and make moral judgements for decisions like these (as well as plenty of other less-consequential decisions) based on the fundamental beliefs and assumptions they develop from a lifetime of sensory data. There are no universally agreed ‘correct answers’ or assessments in these situations that LLMs can anchor to either, as those can vary wildly depending on those fundamental premises that people looking at the problem hold going into it.

Furthermore, the output of an LLM would vary wildly depending on the way the President phrased the question, rather than being grounded in any universal moral or ethical principles. As a result ChatGPT or any other LLM could not be reasonably relied upon to provide a valid perspective on morally challenging and thorny topics like this.

Finally, regarding their lack of mathematical capabilities. LLMs struggle with math because they are only designed to guess the next word or number in a sequence based on what it has seen before. It is not designed to learn the fundamental meaning of numbers and their patterns. As a result, LLMs are not just bad at arithmetic (which traditional computers are excellent at), they are currently incapable of solving unsolved problems in the pure mathematics space.

It is not impossible that at some point in the future that AI researchers will build new models that overcome these limitations. However, these particular cracks are fundamental to how present-day LLMs have been architected and trained, and are unlikely to be resolved by more training time with more training data on more GPUs. AI systems that solve these problems will require new leaps forward in methodology, and it is anybody’s guess as to when and how that happens.

LLMs Are Digital Intuition

The great contribution of LLMs is that they have given computers an intuitive grasp of natural human language.

Learning the magic tricks behind ChatGPT may make them seem less impressive than they appeared on one’s first interaction. However, these models learn language similar to how we humans do when we are young (we first learn grammar by observing which words are used where, long before we learn the language of nouns, verbs and adjectives), which is no small feat.

That being said, while LLMs have given computers a form of digital intuition, intuition alone is not enough to solve many problems. Understanding and respecting these limits is crucial for allowing these tools to help us, rather than hinder us.

Thank You for Reading!

I hope you found this piece enlightening or helpful in some way.

If you would like to share your thoughts with me about this piece or request different pieces in future, let me know via this email: seeking.brevity@gmail.com

Coming soon

Seeking Brevity — Fri, 20 Oct 2023 05:59:57 GMT

This is Seeking’s Substack.

Subscribe now