I gave the opening talk at our CS faculty retreat at TU Wien in February. The framework: five levels that trace the path from autocomplete to autonomous agents. (Slides)
A good starting point is the METR time-horizon graph, which tracks the duration of human tasks that AI agents can reliably automate. A few years ago, agents could handle maybe 10-minute tasks. Today the frontier is crossing the 1-hour mark. The curve is steep and shows no sign of flattening.
Level 1: Single-Shot LLM
One input, one output. “The cat sits on the ___” goes in, “mat” comes out. This is the basic building block, tracing back to the 2017 paper “Attention Is All You Need” that introduced the Transformer.

Simple, but everything that follows is built on this.
Level 2: Workflows
Multiple single-shot calls stitched together in a fixed sequence. Two examples: summarizing the Harry Potter series (each chapter gets a summary, then each book, then the whole series) and Retrieval-Augmented Generation (RAG), where a vector database lookup feeds context into the LLM.

ChatGPT itself is a workflow: the chat history is appended to the context window and fed back as a new single-shot call. Workflows became widespread around 2022.
Level 3: Agentic Workflows
This is the big jump. The difference from Level 2 is that the model itself decides when and whether to call a tool. A weather lookup, a calculation, running a test suite. The LLM interacts with its environment, and that is what makes it an agent.

GitHub Copilot (Level 2) completes code as you type. A coding agent (Level 3) writes the code, runs the test, reads the error, fixes the code, and iterates.
This also opens the neurosymbolic world: a neural LLM calling symbolic tools like solvers, proof assistants, or computer algebra systems.
Tool calling was introduced around 2023 with GPT-4. But the early ecosystem was fragmented. With N agent architectures and M tools, you needed N times M integrations. Then in late 2024, Anthropic introduced the Model Context Protocol (MCP), often called “the USB port for tool calling.” It standardized the interface, and thousands of MCP servers have been built since.

On the MCP front, our group has built several servers: MCP-Solver for SAT/CP/SMT solving, DBLP-MCP for literature research, Consult-7 for synthesizing feedback from multiple frontier models, iPython-MCP as a virtual Python coding environment, plus Sage-MCP for computer algebra and Lean-MCP for theorem proving in Lean 4. Each exposes symbolic tools to the LLM, and the agent decides autonomously when and how to use them.
Level 4: Context Engineering
As agents run and call tools, every result fills up the context window. Eventually it is full. Context engineering is the discipline of working within this limit.

Two techniques stand out. Sub-agents let the main agent spawn workers for subtasks. All the intermediate context stays encapsulated inside the sub-agent; only the final result comes back. Skills work differently: instead of loading all instructions at the start, the agent starts with a short index and pulls in detailed instructions only when needed. Both keep the context clean.
Level 5: Persistent Memory
A single session generates a lot of context, but once it ends, that context is gone. The next session starts from zero. Persistent memory changes this. The agent learns across sessions, carrying knowledge forward.

Agents can even generate their own skills. The next time they run, they can use what they wrote before. This is where things get close to a self-improvement loop.
Anthropic has stated that Claude Code now generates all of its own code. And a complete C compiler (100,000 lines of C) was reportedly built by a team of 16 Claude agents in two weeks.
Three Scaling Dimensions
Three things are advancing simultaneously: model capability (doubling roughly every four months according to METR), architecture (MCP, sub-agents, neurosymbolic integration), and self-improvement (agents generating their own skills and code).
The implication: software is becoming cheap. When model capability, architectural reach, and self-improvement all compound, the resulting trajectory is steep.