Engineering leaders have a new challenge: how to work with, and talk about, large language models when everything is moving so fast. Programmers experiment with models, fine-tune for fun on gaming rigs, learn what works. Engineering leaders, who live further from the code, must stay with it.
Building Cased I’ve learned, the hard way, about AI for application development: how do we build useful software amidst deeply enigmatic large language models.In this first post I’ll focus on high-level concepts—hopefully helpful to any engineering leader in this new world.
Use LLMs extensively, but only for what they’re good at
Focusing on text, large language models excel at: reducing a large set of words into a smaller set of words while preserving meaning; at generating words that fit a desired description; and at categorizing or classifying words.This last skill, which is the most interesting, needs clarification for real world use. We confuse a model’s intelligence: we think it has eyes and can reflect, understand, reason, conclude.
We are tempted to just give it words and ask “what is all this?”, as you might ask a person.But that doesn’t work, because the world of words is large and the model can’t cut scope well enough. So this job, of limiting scope, is now your job: either through careful prompting, or providing contextual data, or with alterations to a base model.
For classification tasks, a way to do it: prompt your LLM with a set of limited options from which it can choose, and let it go from there. Know that this limits the expressive power of the model—and this is the right trade-off. Further, drill that small set of classification options into your system prompt as if THE FUTURE OF THE WORLD DEPENDS UPON IT.
In fact, this particular suggestion is literal: you can prompt your LLM with occasional, carefully-selected all-caps phrases to overstate the existential importance of your instructions.
Use LLMs in a more limited way for other things
LLMs are decent at writing code and in answering questions when given explicit context from which to answer those questions. LLMs are mostly acceptable at simple information retrieval from their base dataset. They are not great at “reasoning” (i.e., applying either formal or informal logic to multi-step processes).
They are terrible at long-lived work, as they drift further and further from intent and get “confused”. They are terrible at human-like agentic action in the absence of detailed, explicit guidance. They are bad at math, but think they aren’t. Be less enthusiastic applying LLMs, for now, for these things.
Function-calling is actually good
Function-calling via LLMs (using a model to select functions which your application then executes) is extremely valuable. And for the same reason that APIs in general are good: a whole world opens up in a more controllable way (at least more controllable than the inherently non-deterministic LLM).
You know what your function does. You can do ordinary computations, do math, call services: whatever. The LLM just has a simple, tight job: pick the right function. Even that said though: the LLM’s selection of the right function, with the right suggested arguments to it, is not guaranteed. So some things to help.
When you write a “description” field for a function-calling LLM, don’t merely describe it— treat that field like an instruction, including when it should and shouldn’t be involved; you can even reference other functions.
Remember that that “function field” just ends up as text in a long system prompt—and should be written that way.Use the LLM to call functions: it is good business. Even a small number of powerful, targeted, highly-specific functions made available to a well-prompted function-calling LLM can work.
A limited LLM is a good LLM
It would be cool to see artificial general intelligence in my lifetime, but if you’re building AI-based applications in 2024 most of your work is to do the exact opposite. Most of your work is about strongly controlling the problem space for the LLM.
This might mean fine-tuning or particular types of retrieval operations (the tragically-misunderstood retrieval-augmented generation or “RAG”; i.e. putting contextual information in your prompts), or might just mean properly framing what you want your application to do.Prefer specificity over generality in your operations; prefer tightly-controlled (albeit potentially less “impressive”) operations over occasionally doing something incredible.
An anecdote: early on with Cased we were focused on potentially powerful, but too-general solutions, and this led to some intriguing moments. For example, our DevOps agent would decide to create GitHub Issues with its own product suggestions about Cased.
This was wild to see (although somewhat passive-aggressive behavior), but we’re not in the business (and you shouldn’t be either!) of just an occasional whoa moment. So build reliable stuff that people can trust.
And that’s what people will talk about.