Insights
AI Systems10 min read

The 6-Month AI Engineer Roadmap

A sequenced, applied learning path from zero to shipping production LLM systems, ordered by what you can build at each stage, not by what sounds impressive.

AI EngineeringLearning PathLLMCareerProduction AI
The 6-Month AI Engineer Roadmap

Most people who want to build AI systems do not know where to start. They are told to learn "machine learning" and spend six months studying gradient descent before discovering that the skills needed to ship production LLM systems are almost entirely different. This roadmap fixes that problem.

It is ordered by what you can apply at each stage. Each month builds on the previous one. By month six, you can architect and ship production AI systems and specialise in the direction that interests you most.

Month 0: The Mental Model

Before touching any code, get the mental model right. Most people who build fragile AI systems have a fundamental misunderstanding of what they are working with.

An LLM is a next-token predictor trained on a vast corpus of text. It does not "think." It does not "know" things in the way a database knows things. It has a strong prior over what text should follow given context. When you prompt an LLM, you are providing a context that makes certain text continuations more or less probable.

This framing matters because it tells you where to invest. You are not training a model to know things. You are engineering the context that makes the model likely to produce the output you need. Everything from prompt design to retrieval to memory is about shaping that context.

The second mental model concept: LLMs are probabilistic systems. Your goal is to move the distribution of possible outputs toward the outputs you want. You do this through prompt engineering, context engineering, fine-tuning, and output validation. You can narrow the distribution significantly, but never guarantee a specific output.

Month 0 deliverable: be able to explain what an LLM is and is not, what a context window is, and what the difference between a prompt and a completion is.

Month 1: Prompting

Month 1 is entirely about prompt engineering. Not because prompting is the most important skill, but because everything else you learn will require you to write prompts, and weak prompting introduces errors at every stage.

By the end of month 1, you should be able to: write a system prompt that reliably produces the output you want for a defined task, use few-shot examples to calibrate model behaviour, implement chain-of-thought prompting, prevent hallucinations using structural techniques, and use output formatting to produce structured data from model outputs.

Work through the Anthropic prompt engineering guide and the OpenAI cookbook. Build something that uses an LLM API for a real task you care about. The learning that sticks is the learning you apply immediately.

Month 1 deliverable: a working LLM-powered tool that solves a real problem using a prompt you engineered yourself.

Month 2: Systems Thinking and RAG Introduction

Month 2 shifts from single-prompt thinking to systems thinking. You are no longer asking "how do I phrase this?" You are asking "how does information flow through this system?"

The central concept is retrieval-augmented generation (RAG). RAG is the pattern of retrieving relevant documents from an external store and injecting them into the model's context before generation. It solves the core limitation of LLMs: they do not have access to your specific data.

Month 2 fundamentals: understand vector embeddings and semantic similarity, build a basic RAG pipeline from scratch (do not use a framework yet), understand chunking strategies and their tradeoffs, and learn basic context engineering.

Month 2 deliverable: a working RAG application that can answer questions about a corpus of documents you provide.

Month 3: RAG at Depth

Month 3 goes deep on RAG quality. A basic RAG pipeline is straightforward to build. A RAG pipeline that reliably produces accurate, high-quality answers is significantly harder.

The failure modes to understand and address: retrieval quality (the right documents are not being retrieved), context quality (the retrieved documents are structured poorly for the model to use), and generation quality (the model is not using the retrieved documents correctly).

Month 3 topics: hybrid search combining semantic and keyword search, re-ranking retrieved results, parent-document retrieval, hypothetical document embeddings, and evaluating RAG quality with a structured eval framework.

Month 3 deliverable: an evaluated RAG system with documented quality metrics and at least one concrete improvement made based on eval findings.

Month 4: Agents

Month 4 introduces agents. An agent is a system where an LLM can take actions, observe results, and iterate. This is the architecture that makes LLMs useful for multi-step, open-ended tasks.

Month 4 fundamentals: implement a basic ReAct agent (reason and act loop), build custom tools the agent can call, implement error handling and retry logic, add an observability layer (trace every tool call), and build a simple human-in-the-loop checkpoint.

The critical lesson of month 4: agents fail in production for reasons that have nothing to do with the model. They fail because tools have unclear descriptions, because loops have no termination conditions, because tool errors have no recovery paths, and because there is no observability to see what went wrong. Engineering discipline matters more than model quality.

Month 4 deliverable: a working agent that completes a multi-step task reliably, with a trace you can inspect for every run.

Month 5: Production and Deployment

Month 5 is about shipping. The gap between a working prototype and a production system is larger in AI than in conventional software, because AI systems have additional failure modes that only appear at scale and over time.

Month 5 topics: evaluation harnesses and regression testing, LLM observability with a production tool, cost tracking and optimisation, rate limiting and error handling for LLM API calls, prompt versioning, and basic security practices (input validation, output sanitisation, avoiding prompt injection).

The month 5 mindset shift: you are not building a demo. You are building infrastructure. Infrastructure has SLAs, monitoring, runbooks, and a plan for when things go wrong.

Month 5 deliverable: a production-deployed AI system with monitoring, cost tracking, and an eval suite that runs on every change.

Month 6: Specialisation

Month 6 is where you choose your direction. The foundation is solid. Now you deepen in the area that aligns with what you want to build:

Agentic systems: multi-agent architectures, MCP, planning and task decomposition, long-horizon agent design.

RAG and knowledge systems: advanced retrieval, knowledge graph integration, document understanding, multi-modal retrieval.

Fine-tuning and alignment: when and how to fine-tune, dataset preparation, RLHF basics, model evaluation.

AI product engineering: product development for non-deterministic systems, eval-driven product development, AI system metrics and dashboards.

Vertical application: go deep on a specific domain with domain-specific evaluation and compliance considerations.

The engineer who completes this roadmap is not an AI researcher. They are an AI systems engineer: someone who can architect, build, evaluate, and operate production LLM systems. That is the role the industry needs and under-supplies.