An Introduction to Generative AI for RSEs

How we got here: principles and tooling — ICCS Summer School 2026

Cordero Core

Tom Meltzer

Matt Archer

In collaboration with

Institutions behind this session

University of Cambridge — Institute of Computing for Climate Science

University of Washington Scientific Software Engineering Center, University of Washington eScience Institute, University of Washington

Plan

How we got here: ~70 years of GenAI as a timeline
Inside a transformer: tokens, embeddings, attention, inference
From next-word prediction to assistants, agents, and MCP
Tooling overview and code demo
Opencode intro and configuration
Create your own tool call with MCP + Skills

How We Got Here

A winding road through computing history, from a 1950s cabinet computer with dials, past a beige workstation and GPU boards, to a modern laptop with a chat bubble and robotic arm.

Every concept in GenAI was an answer to a problem someone hit
So we’ll walk the timeline instead of a glossary
Destination: what is inside the tools you’ll use this afternoon

If You Remember Nothing Else

Every AI model is the same machine.
Your model is only as good as your harness.
Generative AI tools are powerful but fragile.

Before we start the timeline, here’s the whole talk in three sentences. If you drift off after lunch, these are the three things to leave with.

One: every AI model is the same machine. From the 1958 perceptron to the frontier model you’ll use this afternoon, it’s the same unit — inputs, weights, sum, threshold — arranged and scaled, trained by the same loop. Once you see that, none of it is magic.

Two: your model is only as good as your harness. The model just predicts the next token; everything useful — tools, context, MCP servers, skills — comes from the machinery you wrap around it. That’s why the second half of today is about the harness, not the model.

Three: generative AI tools are powerful but fragile. They can do remarkable work and fail silently in the same minute — hallucination, prompt injection, brittle context. Use them, but verify them.

Everything on the timeline is evidence for one of these three.

Act I

1958 – 2012

Learning from data

1958: The Perceptron

1958 › 1969 › 1986 › 2012 › 2017 › 2020 › 2022 › 2024 › now

The 1958 machine you’re still using

Inputs, weights, sum, threshold — no layers, no magic
Weights are dials: learn by nudging them on every mistake
Geometrically: learning to draw a line between two classes
Still the irreducible unit of every model you use

%%{init: {"flowchart": {"nodeSpacing": 12, "rankSpacing": 25}, "theme": "base", "themeVariables": {"edgeLabelBackground": "#ffffff"}}}%%
flowchart LR
    x1([x₁]) -->|w₁| S["Σ"]
    x2([x₂]) -->|w₂| S
    x3([x₃]) -->|w₃| S
    S --> T["&gt; θ ?"]
    T --> O(["0 / 1"])

    classDef io   fill:#6c757d,stroke:#495057,color:#fff
    classDef key  fill:#003b6f,stroke:#001f3f,color:#fff
    classDef step fill:#4b9cd3,stroke:#2c6e9e,color:#fff

    class x1,x2,x3,O io
    class S key
    class T step

Frank Rosenblatt, a psychologist at Cornell, 1958. The Mark I Perceptron was hardware: a 20×20 grid of photocells, and the weights were physical potentiometers turned by electric motors. When Rosenblatt said “neural network”, he meant a wall of motors turning knobs.

The learning rule: show it a labelled example; if it guesses wrong, nudge the offending weights in the direction that would have produced the right answer. Repeat ten thousand times. The Perceptron Convergence Theorem guaranteed that if a correct set of weights exists, this procedure finds it in a finite number of steps.

The New York Times, July 1958: a machine that would eventually “walk, talk, see, write, reproduce itself and be conscious.” The hype cycle is not new.

The single perceptron is to modern AI what the transistor is to modern computing: everything since is this unit, arranged.

1969: The Wall

1958 › 1969 › 1986 › 2012 › 2017 › 2020 › 2022 › 2024 › now

The book that almost killed neural networks

Minsky & Papert prove a single-layer perceptron cannot compute XOR
The proof was correct — and still is
The lethal move was extrapolation: multi-layer networks declared “sterile”
Funding dried up: the first AI winter

A brick wall blocks a winding path, but a small door in the wall stands ajar with light shining through; a heavy closed book leans against the wall.

1969: Marvin Minsky and Seymour Papert publish “Perceptrons”. The XOR theorem is impeccable mathematics — a single layer can only separate linearly-separable classes. That result has never been overturned.

But the damage came from the leap: they called the multi-layer extension “sterile”, and the field read that as “neural networks are a dead end.” The extension they dismissed turned out to be the entire engineering future of AI.

The twist: Minsky was a believer first — he built one of the first neural-net learning machines (SNARC, 1951). And the fight was partly about a finite pot of DARPA funding.

The lesson worth keeping: the right question is never just “is the math correct?” — it’s “what is the gap between the specific limit the math proves and the general claim being built on top of it?” That gap is where decades get lost.

Rosenblatt died in a boating accident in 1971 and never saw the revival. Recurring theme in this history: many of the pioneers never got to read the ending.

1986: Backpropagation

1958 › 1969 › 1986 › 2012 › 2017 › 2020 › 2022 › 2024 › now

One wrong answer, a billion dials — how a network learns who to blame

Loss function: measures model error
Gradient descent: moves weights down the loss surface
Backpropagation: blame flows backward, layer by layer (chain rule)
Overfitting: memorises training data, fails to generalise

This is the “credit assignment” problem — a term coined by Minsky himself. A model reads a handwritten 2 and confidently says 8. Which of a billion dials do you turn, and by how much?

The answer: measure how sensitive the final error is to each weight — that’s all a derivative is here. Innocent weights barely move the cost; guilty weights send it flying. The error signal flows backward through the network, and each connection computes its own share of the blame locally. You never have to tell the middle layers the right answer.

The engine is nothing exotic — just the chain rule, used with discipline. The wall Minsky pointed at was real, but the door was sitting in a freshman calculus textbook. Werbos worked it out in his 1974 dissertation and offered Minsky a co-authored correction; Minsky declined. It took until Rumelhart, Hinton & Williams’ 1986 Nature paper (six pages) for the field to listen.

Standard ML-basics framing still applies: supervised learning on input-output pairs, differentiable loss, gradient descent on the loss surface.

Overfitting is the classic failure mode in classical ML. Worth noting: large transformer models largely sidestep it — vast and diverse data, regularisation (dropout, weight decay), the double-descent phenomenon, and emergent representations that capture structure rather than surface patterns.

Backprop’s gift wasn’t intelligence — it was correctability: the ability to be wrong and to know exactly who to blame.

1986: The Training Loop

%%{init: {"flowchart": {"nodeSpacing": 15, "rankSpacing": 18}, "theme": "base", "themeVariables": {"edgeLabelBackground": "#ffffff"}}}%%
flowchart TD
    A([Training data]) --> B["Forward pass"]
    B --> C[Compute loss]
    C --> D[Backpropagation]
    D --> E[Update weights]
    E -->|repeat| B
    C -->|loss small enough| F([Done ✓])

    classDef io     fill:#6c757d,stroke:#495057,color:#fff
    classDef step   fill:#4b9cd3,stroke:#2c6e9e,color:#fff
    classDef key    fill:#003b6f,stroke:#001f3f,color:#fff

    class A,F io
    class B,C step
    class D,E key

1989–2006: Winters and Stubborn Ideas

1958 › 1969 › 1986 › 90s › 2012 › 2017 › 2022 › now

1989 — LeCun’s convolutional network reads real ZIP codes: structure is knowledge
1990s — SVMs win on clean math and guarantees; deep nets stall on the vanishing gradient
2006 — Hinton’s layer-wise pretraining revives depth, under a new name: deep learning

Three beats, quickly.

1989: Yann LeCun at Bell Labs trains a convolutional network on real mail from Buffalo — the first neural network with a day job. By the early 2000s its descendant read ~10% of all US checks. The key idea: bake what you already know into the architecture (a pattern worth finding in one place is worth finding everywhere), so the network’s capacity goes to what actually needs learning. “Structure is knowledge” — remember this phrase, it comes back with transformers and with tool use.

1990s: Support Vector Machines (Vapnik & Cortes, 1995 — written down the hall from LeCun) beat neural networks fair and square: convex optimisation, one best answer, real theory. Meanwhile deep nets hit the vanishing gradient: with sigmoid activations, the blame signal shrinks at every layer going backward — after twenty layers, a millionth. The one thing that could make neural nets special (depth) was exactly what wouldn’t train. Second winter.

2006: Hinton, Osindero & Teh train deep networks one layer at a time — greedy unsupervised pretraining, then fine-tune the whole stack. The trick itself became obsolete within a few years (better initialisation + ReLU + GPUs let you train from scratch), but it granted the field permission to go deep again — and the pretrain-then-fine-tune workflow you use every time you call from_pretrained was born here. They also rebranded: “neural networks” was radioactive, so they called it deep learning.

The pattern in all three: being better and being believed are different things.

2012: AlexNet — The Year the Dam Broke

1958 › 1969 › 1986 › 90s › 2012 › 2017 › 2022 › now

26%2011 best, top-5 error → 15%AlexNet, 2012 · 2gaming GPUs

Nothing new in the network — the world finally supplied data + compute
ReLU dodges the vanishing gradient; dropout tames overfitting
“Just make it bigger” is born; scale becomes the moat

2012, the ImageNet challenge: 1.2M labelled images, 1000 categories. Best systems (hand-designed features + SVMs) plateaued around 26% top-5 error. AlexNet — Krizhevsky, Sutskever, Hinton — comes in at ~15%. That’s not an improvement, it’s a different weather system. Within months every serious lab changed direction.

And the architecture? A deep convolutional network — LeCun’s 1989 idea, finally cashing its cheque. What changed wasn’t the idea:

Data: Fei-Fei Li’s ImageNet — 14M hand-labelled images, mocked at the time as too big, too manual, too unglamorous.
Compute: two NVIDIA GTX 580 gaming cards. The compute that unlocked modern AI arrived as hardware a teenager might have under their desk.
Two small tricks: ReLU (pass positives straight through — dodges the vanishing gradient) and dropout (break the network on purpose during training so it holds up in the real world).

The bill: if wins come from data and compute, the question becomes who has the data and the GPUs? At the frontier, scale is the moat — that’s still the industry structure you live in today.

Rosenblatt (d. 1971) and Rumelhart (d. 2011) never saw it.

Act II

2013 – 2020

The machine learns to read

The Next Problem: Language

Vision fell to scale — but language resisted
An image is a grid, all present at once
A sentence is a sequence: “bank” needs memory of what came before
And first: how do you feed text to a machine that only does arithmetic?

LLMs: Tokenisation

LLMs cannot process raw text, it must first be converted to numbers.

Text is split into sub-word tokens using a learned vocabulary
Each token is assigned a unique integer ID
Common words are single tokens; rare words split into pieces

“I went to the”

235285 “I” 3806 “▁went” 576 “▁to” 573 “▁the”

we’re dealing with text, but the model operates on numbers. So the first step is tokenisation: converting raw text into a sequence of token IDs.

We break text into sub-word tokens as otherwise the vocabulary would be too large. We can instead use the tokens as building blocks to represent any word. This makes the vocabulary more manageable and allows the model to handle rare or technical words by breaking them into pieces.

Tokenisation is the first step in the pipeline. The tokeniser is a separate component from the LLM — it ships as a vocabulary file and a set of merge rules (BPE or SentencePiece). The vocabulary size for Gemma is ~256k tokens.

Sub-word tokenisation means common words like “the” are a single token, while rare or technical words like “backpropagation” split into pieces: ▁back, prop, agation.

2013: Meaning Becomes Geometry

word2vec: king − man + woman ≈ queen

The embedding matrix maps tokens to vectors; directions encode meaning.

A 3D vector space showing that the displacement from E(Japan) to E(Germany) approximately equals the displacement from E(Sushi) to E(Bratwurst), illustrating that directions in embedding space encode meaning.

Attention enriches each vector with context: bank (river) vs bank (finance)

3Blue1Brown, Deep Learning Ch. 5

Timeline stop: 2013, word2vec (Mikolov, Google). The discovery that made language tractable: meaning can become location. Words map to vectors, and directions in that space encode relationships.

We need to convert tokens to vectors before the model can process them — the model operates in a continuous vector space, not discrete token IDs. The embedding matrix maps each token ID to a dense vector.

These vectors capture semantic meaning — similar words have similar embeddings. “king” and “queen” are close; “king” and “carrot” are far apart.

$\text{king} - \text{man} + \text{woman} \approx \text{queen}$

In practice this exact relationship isn’t very accurate (queen has multiple meanings), but the point stands: directions encode meaning.

In this example the displacement between germany and japan is approximately the same as between bratwurst and sushi. The food/country direction generalises: pizza/Italy, tacos/Mexico, sushi/Japan. These patterns exist before attention runs. Attention enriches each vector with context, distinguishing “bank” (river) from “bank” (finance).

File this away for later too: this same “meaning becomes location” trick is what powers RAG and vector databases this afternoon.

2014: Attention — The Bottleneck

1958 › 1986 › 2012 › 2014 › 2017 › 2020 › 2022 › now

Stop making the machine work from memory — let it look back

RNNs read one token at a time, carrying a running “mental note”
seq2seq: the whole sentence crushed into one fixed-size vector
Like translating a paragraph from a single sticky note
Bahdanau 2014: keep everything, let the decoder look back with learned soft weights
But attention was bolted onto RNNs — sequential, GPUs sitting idle

A long scroll of text is squeezed through a funnel, producing a single tiny sticky note that a confused reader stares at.

The 1990s-2014 approach to language: recurrent networks (RNNs). Read a word, update a running mental note (the hidden state), repeat. LSTMs (1997) added learned gates for what to keep and forget — that’s what powered Google Translate’s big 2016 jump.

The wall: encoder-decoder translation crushed the entire source sentence into one fixed-size vector, and the decoder had to rebuild everything from just that. Read a paragraph once, write yourself one sticky note, hand the paragraph away, now translate. Quality fell off a cliff past a certain sentence length. One fixed-size vector is not enough room to hold a long thought.

The fix (Bahdanau, Cho & Bengio, 2014) was almost too simple to trust: keep all the encoder’s notes, and for every output word compute learned soft weights over all the inputs — 70% on this word, 20% on that. The model’s eyes flicking back to the source sentence.

But for three years attention was an accessory bolted onto the RNN. And the RNN had to read in order — inherently sequential, impossible to parallelise, so the GPUs that won 2012 sat idle. The best part of the machine was chained to the worst part.

Cliffhanger: what if you kept the accessory and threw away the machine?

2017: Attention Is All You Need

1958 › 1986 › 2012 › 2014 › 2017 › 2020 › 2022 › now

The 2017 paper hiding inside every AI you use

Eight Google researchers, writing a translation paper
The reckless move: keep attention, throw away the RNN
Every token attends to every other token — in parallel
Parallel means GPUs saturate; GPUs saturating means scale

Six word blocks arranged in a circle, every block connected to every other by thin lines, with one connection highlighted — every token attending to every other token.

“Attention Is All You Need”, Vaswani et al., 2017. Eight authors, listed as equal contributors, order randomised. They were trying to improve machine translation — none of them saw what it would become. It is now the blueprint for every AI you talk to: GPT literally stands for Generative Pretrained Transformer.

The move sounds absurd on paper: attention was invented as an accessory to the RNN — keeping it while dropping the RNN is like keeping the GPS and getting rid of the car. But it works, because self-attention alone can capture the relationships that matter, and without recurrence the whole sequence processes simultaneously.

That’s the deep connection back to 2012: AlexNet proved GPUs + scale win, but RNNs couldn’t use that hardware — they read one word at a time. The transformer is the architecture that finally let language soak up all the compute. “Structure is knowledge” again: the structure a language model needs is attention plus a sense of position, and almost nothing else needs to be hard-wired.

Now let’s open the box and walk through exactly what happens at inference time — this machinery is running every time you watch a model answer token by token.

Transformers: Inference I

Convert text to tokens

Transformers: Inference I

Convert text to tokens

Predict new tokens

Transformers: Inference I

Convert text to tokens

Predict new tokens

Convert tokens to text

Transformers: Inference II

“I went to the”

I ▁went ▁to ▁the

Transformers: Inference II

“backpropagation”

▁back prop agation

3 tokens

Transformers: Inference II

Token
ID
Embedding (2048 dims)


I
235285
[ 0.21, -0.83, 0.54, 0.12, … ]

▁went
3806
[ -0.44, 0.31, 0.09, -0.77, … ]

▁to
576
[ 0.67, 0.02, -0.51, 0.38, … ]

▁the
573
[ 0.55, -0.19, 0.73, -0.02, … ]

Token	ID	Embedding (2048 dims)
I	235285	[ 0.21, -0.83, 0.54, 0.12, … ]
▁went	3806	[ -0.44, 0.31, 0.09, -0.77, … ]
▁to	576	[ 0.67, 0.02, -0.51, 0.38, … ]
▁the	573	[ 0.55, -0.19, 0.73, -0.02, … ]

Transformers: Inference II

transformer block

Transformers: Inference II

transformer block

Self-attention

I went to the

“I” ↔︎ “went”: subject-verb | “went” → “to”: verb-preposition

Self-attention

“The bank by the river was steep”

“bank” attends strongly to “river” - meaning is of a riverbank, not financial

Transformers: Inference II

▁the → [ 0.55, -0.19, 0.73, … ]
× unembedding matrix (2048 × 256k)
→ logits for every token in vocab
→ softmax → sample

Token
Probability


▁library
31%

▁store
18%

▁park
12%

▁doctor
8%

Token	Probability
▁library	31%
▁store	18%
▁park	12%
▁doctor	8%

Only “the” matters — contextualised by attention.

Transformers: Inference II

[235285] [3806] [576] [573] → 4376 “▁library”
[235285] [3806] [576] [573] [4376] → 736 “▁this”

full sequence fed back in each loop

Transformers: Inference II

[235285, 3806, 576, 573, 4376, 736]
↓ vocab lookup
“I went to the library this”

just a lookup table, the inverse of tokenisation

Tokenise: text to sub-word token IDs
LLM: token IDs enter the model
Embed + pos. encoding: token IDs to dense vectors with position
Self-attention + MLP × N layers: every token attends to every other
Predict & sample: logits to softmax to sample next token
Autoregressive: feed token back, repeat
Decode: token IDs to text

The crucial innovation is self-attention: rather than processing tokens in sequence (like an RNN), every token attends to every other token in parallel. This lets the model capture long-range dependencies and scales well on modern hardware.

Sampling strategies: the model doesn’t always pick the most likely next token. Temperature scales the probability distribution (higher = more random). Top-k restricts sampling to the k most likely tokens. Top-p (nucleus sampling) samples from the smallest set of tokens whose cumulative probability exceeds p.

Transformers: Summary

Tokenise: text → sub-word token IDs
Embed: token IDs → dense vectors (static meaning)
Self-attention: enrich each vector with context (dynamic meaning)
- MLP × N layers: transform representations
Predict & sample: last token’s vector × unembedding matrix → next token ID
Autoregressive loop: append token, feed full sequence back in
Decode: token IDs → text (lookup table)

Transformers: Summary

The model is completely stateless
All context is in the text fed to it, there is no memory
Each forward pass re-processes the full sequence
Longer contexts are more expensive: attention is O(n²)

LLM Hello World

import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM

login(token=os.environ["HF_API_KEY"], add_to_git_credential=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

input_text = "I went to the"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(tokenizer.decode(outputs[0]))

LLM Hello World

import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM

login(token=os.environ["HF_API_KEY"], add_to_git_credential=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

input_text = "I went to the"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(tokenizer.decode(outputs[0]))

huggingface_hub / transformers: The platform and library where the ML community collaborates on models, datasets, and applications.

LLM Hello World

import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM

login(token=os.environ["HF_API_KEY"], add_to_git_credential=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

input_text = "I went to the"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(tokenizer.decode(outputs[0]))

huggingface_hub / transformers: The platform and library where the ML community collaborates on models, datasets, and applications.
Login: Register with Hugging Face and obtain a key to download hosted models.

LLM Hello World

import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM

login(token=os.environ["HF_API_KEY"], add_to_git_credential=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

input_text = "I went to the"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(tokenizer.decode(outputs[0]))

huggingface_hub / transformers: The platform and library where the ML community collaborates on models, datasets, and applications.
Login: Register with Hugging Face and obtain a key to download hosted models.
Load tokenizer & model: Downloads Google Gemma-2b and its matching tokenizer.

LLM Hello World

import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM

login(token=os.environ["HF_API_KEY"], add_to_git_credential=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

input_text = "I went to the"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(tokenizer.decode(outputs[0]))

huggingface_hub / transformers: The platform and library where the ML community collaborates on models, datasets, and applications.
Login: Register with Hugging Face and obtain a key to download hosted models.
Load tokenizer & model: Downloads Google Gemma-2b and its matching tokenizer.
Tokenise input: Converts your text into a tensor of token IDs the model can read.

LLM Hello World

import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM

login(token=os.environ["HF_API_KEY"], add_to_git_credential=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

input_text = "I went to the"
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(tokenizer.decode(outputs[0]))

huggingface_hub / transformers: The platform and library where the ML community collaborates on models, datasets, and applications.
Login: Register with Hugging Face and obtain a key to download hosted models.
Load tokenizer & model: Downloads Google Gemma-2b and its matching tokenizer.
Tokenise input: Converts your text into a tensor of token IDs the model can read.
Generate & decode: Model predicts tokens autoregressively; tokenizer converts them back to text.

Act III

2018 – now

From prediction to agents

2018: The Internet as a Textbook

1958 › 2012 › 2017 › 2018 › 2020 › 2022 › 2024 › now

117MGPT-1, 2018 → 1.5BGPT-2, 2019

Take half the transformer, train it on one objective: predict the next token
The text is its own answer key — no human labelling, so it scales to the internet
Sounds too dumb to matter — but “The capital of France is ___” requires the fact

A robot sits reading an enormous open book whose pages are made of tiny webpage wireframes, with a stack of more giant books behind it.

The Hello World demo you just saw — “I went to the” → “library” — is this idea. GPT is that exact objective, made enormous.

Why next-token prediction works: predicting the next word well forces understanding. To finish “The capital of France is ___” you need the fact, not just the grammar. Understanding turned out to be the cheapest way to win at prediction.

GPT-1 (Radford et al., 2018) reused the 2006 recipe at transformer scale: pretrain on raw text, fine-tune on the small labelled task. GPT-2 (2019) surprised everyone by doing tasks with no fine-tuning at all — zero-shot translation and summarisation emerged as a byproduct of scale. OpenAI initially declared the full model “too dangerous to release” — the field’s first loud public argument about model release.

But be clear what this was: a brilliant mimic of the shape of writing with no reliable commitment to the substance. It continued text; it couldn’t follow instructions; it stated made-up things with total confidence. Keep that gap in mind — it takes until 2022 to close.

2020: Scaling Laws

Bigger was the whole idea

Plot loss vs. parameters, data, compute → a straight line on log-log axes
You can forecast a model’s quality before you build it
“The graph turned a leap of faith into a line item”
GPT-3: 175B params — new skills appear that nobody trained in

Kaplan et al., January 2020: train 200+ models of different sizes and the loss follows a power law in parameters, data, and compute. A ruler-straight line on a log-log plot. That predictability is what justified spending nine figures on a single training run — you could extrapolate the return before paying.

GPT-3 (mid-2020, 175B parameters, 100x GPT-2) revealed in-context learning: show it two English→French pairs in the prompt and leave a third hanging, and it translates. The weights are frozen — the prompt doesn’t retrain anything. The examples act as an address, pointing at a skill already coiled in the weights.

Note the power law is also a statement of diminishing returns — not infinite intelligence, but a reliable exchange rate. And DeepMind’s Chinchilla (2022) corrected the recipe: GPT-3 was lopsided — huge in parameters, starved of data. Data matters as much as size.

The deflating lesson of this era: the winning move wasn’t insight, it was resources. Remember AlexNet’s bill — “who has the data and the GPUs?” — now with nine zeros. That hangs over everything until 2025, when DeepSeek shows the bill is partly optional (MoE routing + distillation — more on that in the notes when we get to open weights).

Limit: GPT-3 was optimised to be plausible, not correct. Fluency and truth are different targets — this is where hallucination becomes a household problem.

Nov 2022: ChatGPT

1958 › 2012 › 2017 › 2018 › 2020 › 2022 › 2024 › now

The week everyone found out — the model wasn’t new, the manners were

GPT-3 had been sitting in an API for two years
RLHF: (1) fine-tune on good answers, (2) humans rank outputs → reward model, (3) optimise against learned taste
Same 1986 training loop — only the blame signal changed

1Musers in 5 days → 100Musers in 2 months

A human hand ranks three answer cards into podium order while a small robot in a chat bubble watches and learns.

You probably remember the week, even if you can’t put a date on it: 30 November 2022.

The gap between GPT-3 and ChatGPT wasn’t capability — it was intent. A base model continues text; it doesn’t answer you. And you can’t program “be helpful” — it’s a thousand tiny judgment calls. So you show it judgments instead:

Supervised fine-tuning on human-written good answers.
Humans rank several model outputs (comparing is cheap, writing is expensive) → train a separate reward model — a machine that has learned human taste.
Optimise the main model against that reward model.

This is backpropagation wearing a new coat: same loop from 1986, but the signal changed from “was this the right digit?” to “would a human prefer this?”.

The interface was the invention — the world didn’t fall for a smarter machine, it fell for a more agreeable one in a chat box.

The costs of RLHF are the things that annoy you today: sycophancy (rewarded for approval, not truth), reward hacking (Goodhart’s law: when a measure becomes a target, it ceases to be a good measure), and “as an AI language model, I cannot…” — RLHF scar tissue.

Variant worth knowing: Anthropic’s Constitutional AI (also late 2022) — replace much of the human ranking with the model critiquing itself against a short list of principles written in plain English (RLAIF). It makes the values legible — you can read the constitution — but legible still means someone chose. Whose values? That question doesn’t go away.

2023: The Weights Get Out

The year AI escaped the API

Feb 2023: Meta’s Llama weights leak within a week of release
Open weights ≠ open source: you get the baked cake, not the recipe
Quantisation (llama.cpp): frontier-class models on a laptop
This afternoon you’ll use self-hosted models — this is why that’s possible

Hands pass a finished three-layer cake through an open door while the recipe book stays locked in a cabinet — open weights without the recipe.

Until 2023, frontier models lived behind APIs — you could only knock and ask a question through the mail slot. February 2023: Meta releases Llama to researchers; within about a week the weights are torrented to the world. July 2023: Llama 2 is deliberately open for commercial use. The mail slot was gone — people had the whole door.

Two ideas to keep:

Open weights is not open source. The weights are the finished, baked cake — you can run it and fine-tune it, but you don’t get the recipe: training data, code, process. You can’t reproduce or fully audit it.

Quantisation: store each weight in fewer bits — 4 instead of 16 — like a more compressed JPEG. Georgi Gerganov’s llama.cpp put serious models on MacBooks. That plus ollama is why ollama run on your laptop works with the network cable pulled.

This matters directly for you: the hands-on session runs on Cambridge’s self-hosted LLM service — open-weight models (Devstral, Qwen) running on university hardware. Data sovereignty, cost, and reproducibility arguments all flow from this moment. The DeepSeek shock of January 2025 (frontier-competitive open weights, trained with MoE + distillation at a fraction of the assumed cost) pushed the same door further open.

The dark side, honestly stated: open weights means anyone can fine-tune the safety training right back out in a few hours. The same fact wearing two faces.

The Goldfish Problem, Fix #1 — RAG

Giving a frozen model an open book

The model is frozen and stateless — and it bluffs (plausible ≠ correct)
Fine-tuning is the wrong fix: knowledge dissolves into the weights like sugar into water
RAG: chunk → embed → store; at question time retrieve top-k and put it in the context
Don’t change the model — change what it can see

A goldfish in a bowl sits at an exam desk beside a large open reference book — the frozen model taking an open-book exam.

Remember the summary slide: the model is a goldfish — stateless, frozen at its training cutoff, and it will confidently invent a config file that doesn’t exist. The tone is flawless; the content is fiction. This falls straight out of next-token prediction.

RAG (Retrieval-Augmented Generation, Lewis et al. 2020 — the idea predates the chatbot wave) is the closed-book exam turned open-book. The pipeline: split your documents into chunks, embed each chunk (2013’s “meaning becomes location”, industrialised), store the vectors in a vector database. At question time: embed the question, nearest-neighbour search for the most similar chunks, stuff them into the prompt, and instruct the model to answer from them — with citations.

We didn’t make the model know more — we gave it somewhere to look.

The limit: RAG is only as good as its retrieval. Bad retrieval makes it authoritatively incorrect — the garbage arrives wearing a source link. Similarity is not the same as relevance, chunking is fiddly, and the context window caps how much you can show it.

2023: Tool Use — The Chatbot Grows Hands

The model never runs anything. It writes the ticket.

You hand the model a menu of tools
It emits a structured request — data, not prose: get_weather(city="Cambridge")
Your harness runs the real function
The result goes back into the context; the model writes the answer

A chef robot writes an order ticket and clips it to the kitchen rail; hands on the other side of the pass take the ticket to the stove — the model writes the ticket, the harness cooks.

The model still can’t do anything — tool use is stagecraft built around that limitation. It never executes code; it emits a structured tool call, the surrounding framework executes it, and the result is appended back into the context for the model to re-reason with.

Division of labour worth memorising: the model is the reasoner, the tools are the hands, the harness is the nervous system connecting them. Like a chef writing a ticket and clipping it to the rail rather than cooking the dish.

Timeline: ReAct (2022) showed reasoning + acting interleaved cuts hallucination; Toolformer (Feb 2023) let a model teach itself when to call tools; OpenAI shipped function calling June 2023; Anthropic’s tool use went GA 2024. In about eighteen months, the chatbot grew hands.

“Structure is knowledge” one more time: a bounded menu of tools tells the model something true about the world, and a tool result yanks it back down to the ground — the closest thing we have to an antidote to hallucination.

The sharp edge arrives on the same wire: we handed real-world power to a system whose defining trait is confident plausibility. Wrong tool, hallucinated arguments, and above all prompt injection — the model reads instructions and data as the same stream of text. Hold that thought for the concerns section.

2023–24: The Agent Loop

An agent is scaffolding and a loop around an LLM

LLM: The reasoning engine. It can plan, evaluate, or decide whether to “act” or “answer.”
System Prompt: Defines the persona, available tools, and operational boundaries.
Working memory: Maintains the state, including history, tool outputs, and the current goal.
Tools: External capabilities like web search, code execution, APIs, or connecting to RAG.

One tool call was never enough — almost nothing worth doing is one command. You find a bug by doing: read, run, read the failure, edit, re-run. Each step depends on what the last one revealed.

So: put the model in a loop. Reason (just the next move), act (pick a tool), observe (reality talking back), reflect — repeat. Plan, act, observe, repeat. It’s cooking while tasting, not following a recipe with your eyes closed.

The LLM is the core reasoning engine. CoT is a simple form of agentic reasoning. More complex agents evaluate their own output and decide whether to “act” (call a tool) or “answer” (respond to the user).
The system prompt constrains the persona — a financial adviser agent shouldn’t give medical advice.
Working memory is held in the context window; may need summarising as it grows.
Tools let the LLM affect the outside world — via the harness, never directly.

Crucially: the loop does not live in the model — a model does one forward pass. The loop lives in application code around it. The model is only half of an agent; the harness is the other half.

Cautionary tale: AutoGPT (2023) — GPT-4 in an unscoped loop, 100k GitHub stars, and it faceplanted (~24% success on simple shopping tasks). Errors compound around the loop. Autonomy isn’t a switch you flip by putting a good model in a loop — agents work when the loop is scoped, with fences and feedback. That’s exactly what a good coding agent harness provides.

The Agent Loop

%%{init: {"flowchart": {"nodeSpacing": 40, "rankSpacing": 80}, "theme": "base", "themeVariables": {"edgeLabelBackground": "#ffffff"}}}%%
flowchart LR
    S([System prompt]) --> C
    H([Prompt]) --> C["Context window"]
    C --> L["LLM: generate text"]
    L --> D{Tool call?}
    D -->|yes| T["Tools (incl. RAG)"]
    T -->|result appended| C
    D -->|no| O([Output])

    classDef io      fill:#6c757d,stroke:#495057,color:#fff
    classDef core    fill:#003b6f,stroke:#001f3f,color:#fff
    classDef support fill:#4b9cd3,stroke:#2c6e9e,color:#fff
    classDef decision fill:#e9c46a,stroke:#f4a261,color:#000

    class S,H,O io
    class L core
    class C,T support
    class D decision

An agentic system wraps an LLM in a loop using lots of scaffolding to enable it to solve complex tasks.

We start with two inputs. The Prompt is what the user wants, but the System Prompt is where we’ve given the agent a persona, ‘You are a researcher with access to these specific tools.’ Both of these get combined into the Context Window.

The context window is limited in size so we have to be quite strategic about what goes in there. We can’t include the entire internet or all our documentation.

The LLM looks at the context window and decides: ‘Do I need a tool to answer this?’

If the answer is ‘Yes,’ the agent doesn’t talk to the user yet. Instead, it outputs a specific string (a Tool Call) that the scaffolding recognises. The scaffolding then executes a Skill—like running a Python script or a RAG search—and appends the result back into the window.

Only when the LLM decides it finally has enough information does it generate the final response for the user.

The result of that tool—the data it found or the code it ran—is appended back into the Context Window.

Even though LLMs are entirely stateless, this loop is how we ‘fake’ memory. The LLM now sees the original prompt, its own decision to use a tool, and the tool’s result. It ‘re-reasons’ with this new information.

The LLM itself is stateless — all state lives in the context window. The “loop” is managed by the surrounding application code, not the model.

Nov 2024: MCP — A Universal Plug for Agents

1958 › 2012 › 2017 › 2020 › 2022 › 2024 › now

The mess: every agent × every tool = a bespoke, brittle connector (N×M)
MCP: one open protocol between agents and tools (N+M)
Like USB for peripherals — or LSP for editors
Servers expose tools (act), resources (read), prompts (templates)

Left: a chaotic tangle of cables between devices. Right: the same devices connected neatly through one central hub — N times M collapses to N plus M.

By 2024, everyone was writing the same integration over and over: this agent needs GitHub, that agent needs Slack, every pairing a bespoke connector. Not one hard problem — the same medium-hard problem solved N×M times.

Anthropic shipped the Model Context Protocol as an open standard in late November 2024. The sharpest analogy for this audience is LSP: write one language server for Python and every editor that speaks LSP gets Python support for free. Write one MCP server for your instrument, your data store, your NetCDF files — and every MCP-speaking agent can use it. The grid collapses into two lists.

An MCP server wraps one tool or data source; the MCP client lives in the agent. It standardises three primitives: tools (the function calling from 2023), resources (the read-context instinct from RAG), and prompts (reusable templates). Within a year it was adopted by OpenAI, Google DeepMind, and essentially every coding tool.

This afternoon you will build one with fastMCP and wire it into opencode — including an MCP server that reads NetCDF files.

Two caveats to carry in: sometimes the CLI the agent already knows (git, grep) beats a server — tool definitions cost context tokens. And a universal socket is a universal attack surface: treat a third-party MCP server the way you’d treat giving a new hire your production credentials.

2024–25: Reasoning Models

The machine that learned to think first

Old knob: more compute at training time
New knob: more compute at inference time — let it think per question
Chain-of-thought as scratch space, baked in via RL on verifiable answers
o1 (Sept 2024), DeepSeek-R1 (Jan 2025, open weights)

A robot works through a problem on a big sheet of scratch paper covered in diagrams, a lightbulb glowing overhead and a crumpled first attempt on the floor.

You’ve watched a model think — the pause, the “Thinking…” label. Same weights as the fast version; the difference is test-time compute.

The mechanism: chain-of-thought (write out intermediate steps) boosts accuracy because the model uses its own output as scratch space — each written step gives it something firmer to stand on. Blurting vs. scratch paper. Reasoning models bake this in with reinforcement learning: reward chains that arrive at correct answers — easy to check automatically for maths and code. Underneath, it’s still the 1986 loop.

Why this matters for your work: reasoning models are a tool you aim, not a default. 10–50x the tokens, and they’ll overthink trivial problems. Use them for the hard step, not the boilerplate.

The unsettling limit: the visible trace is not guaranteed to be a faithful record of how the model actually reached its answer. It’s a plausible story told alongside the answer, in a font that looks like honesty. Read it as an argument to check, not a verdict to trust.

(Also in this era: models stopped being just language models — CLIP and vision transformers let image patches be treated as tokens, which is why you can paste a screenshot of a stack trace and have it read. Extraordinary at the gist, unreliable on the exact — miscounts, invented text in dense screenshots.)

2025: Coding Agents

1958 › 2012 › 2017 › 2020 › 2022 › 2024 › now

Where the whole timeline lands — and where this afternoon begins

Autocomplete (2021) → chat in the sidebar (2022) → it just goes and does it (2024+)
One tool = transformer + next-token + RLHF + tool use + agent loop + MCP
The harness is the other half: if the model is the brain, the harness is the body
The new cost: verification — code that looks right and isn’t, just as fast

This is where the whole timeline converges — and it’s exactly what you’ll use after the break.

Three moves in the editor: first it finished your sentence (Copilot, 2021 — next-token prediction pointed at a keyboard). Then you could ask it things (instruction-tuned chat, 2022 — a brilliant advisor that never touched the keyboard). Now it goes and does it: opens files, edits, runs the tests, reads the failures, repeats.

Count the stops it took: the transformer (2017), next-token prediction at scale (2018–20), instruction-following (2022), tool use (2023), the agent loop (2023–24), MCP (2024). Six stops on the timeline; one tool in your terminal.

And the half nobody sees: the harness. Context management, tool wiring, scoping the loop, feeding back errors. When an agent fails at a task, the fix is at least as often in the harness as in the model — which is good news, because the harness is the part you control. That’s what the opencode session is really about.

The honest cost: these tools produce code that looks right and isn’t, just as fast as code that is right. Your work shifts from typing to specifying, steering, and verifying — you move from “in the loop” to “on the loop”. The agent’s speed is not the ceiling on what you can build; your judgment is.

The Timeline, In One Slide

1958	Perceptron	learning by nudging dials
1986	Backprop (after the first winter)	learning who to blame
2012	AlexNet (after the second)	structure is knowledge; then scale wins
2017	Transformer	attention, read in parallel
2020	GPT-3 + scaling laws	the internet as a textbook
2022	ChatGPT (RLHF)	the interface was the invention
2023	Open weights, RAG, tool use	the chatbot grows hands
2024	Agent loop, MCP	the loop + a universal plug
2025	Reasoning models, coding agents	think first; meet the work

Self Study Resources

Cordero Core, How We Got Here — the 25-part history of GenAI this talk is based on (Medium, weekly)
3Blue1Brown, Language Models Explained:
- Chapters 4, 5, 6, 7
Meridian Cambridge Transformer architecture, Agents

Gen-AI Concerns

Trust
Safety
Ethical
Environmental

Trust

Capability outran verification

Benchmarks are proxies — and Goodhart applies: a measure made a target stops measuring
Benchmark contamination: the exam leaks into the training data
The weights are grown, not written — interpretability is still a young microscope
The posture: calibrated trust — check the diff, not the green tick

The last stop on the timeline is the unresolved one. A model passes the eval suite, the demo lands, it ships — and two weeks later it does something in production no test caught.

Evals like MMLU are fixed exams, and we read the score table like a report card. But Goodhart’s law bites: models optimised toward benchmarks aren’t the same as good models. Worse, benchmarks live on the internet, so they leak into training data — studies have found substantial contamination, with scores dropping double digits on fresh equivalents.

Alignment (RLHF, Constitutional AI) shapes behaviour but can’t guarantee it: sycophancy, reward hacking, jailbreaks, prompt injection. And mechanistic interpretability — actually reading what’s inside the weights — is real but early: a very good microscope pointed at a cell for the first time.

The 1958 perceptron you could fully understand. The thing writing your pull requests, you cannot — capability outran verification. The practical posture is neither blind adoption nor blanket refusal: calibrated trust. Review the diff, keep a human accountable, scope what you hand over — treat it like a sharp new hire. The discomfort is the correct response; the concerns on the next slides are where it bites.

Safety

Amazon experienced several high-profile incidents
AI-slop crippling maintainers e.g., curl
Supply-chain infection e.g., LiteLLM
Hard to debug semantically correct code

And these are just from a software perspective…

Ethical

Staff layoffs under guise of genAI
Copyright/piracy e.g., Anthropic $1.5bn settlement
Data sovereignty (who owns your data)

Environmental

Massive increase in water & energy consumption
Accelerating E-Waste and Hardware Churn

Opinion

GenAI usage has parallels to HPC
If genAI can help science – I want to make it:
- greener
- safer
- more ethical

Tools and Workflows

Opencode (CLI)

In this half of the training we will make use of opencode

Concepts can also be applied to similar tools e.g., VSCode, GitHub Copilot CLI etc.

Opencode Installation

Installation instructions here

Linux

curl -fsSL https://opencode.ai/install | bash

brew install anomalyco/tap/opencode

Windows (download .exe)

Opencode Configuration

We now need to configure opencode to run self-hosted LLMs

Add API key to .basrhc (or equivalent) e.g.,
export CAMLLM_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXX
Configure opencode (see next slide)

Note

If you already have access to UoC’s LiteLLM (https://llm.hpc.cam.ac.uk) you can create one from the virtual keys page: Virtual Keys $\rightarrow$ Create New Key.

Setting the API Key

Add the key to your shell profile so it’s available every session:

macOS (zsh):

echo 'export CAMLLM_API_KEY="your-key-here"' >> ~/.zshrc
source ~/.zshrc

Linux (bash):

echo 'export CAMLLM_API_KEY="your-key-here"' >> ~/.bashrc
source ~/.bashrc

Windows (PowerShell):

[Environment]::SetEnvironmentVariable("CAMLLM_API_KEY", "your-key-here", "User")

Note

Remember to replace "your-key-here" with your actual key.

Where is the Config File?

The opencode config lives at:

Mac / Linux: ~/.config/opencode/opencode.json
- If $XDG_CONFIG_HOME is set, it uses $XDG_CONFIG_HOME/opencode/opencode.json instead
Windows: %USERPROFILE%\.config\opencode\opencode.json
- This is .config in your user profile, not %APPDATA%
- Run opencode once to auto-create the directory, then check with dir %USERPROFILE%\.config\opencode

Opencode Configuration

edit/create ~/.config/opencode/opencode.json

~/.config/opencode/opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "cam-llm": {
      "options": {
        "baseURL": "https://llm.science.ai.cam.ac.uk/v1",
        "apiKey": "{env:CAMLLM_API_KEY}"
      },
      "models": {
        "Qwen/Qwen3.6-27B-FP8": {
          "name": "Qwen/Qwen3.6-27B-FP8",
          "modalities": { "input": ["text", "image"], "output": ["text"] }
        },
      }
    }
  },
  "permission": {
    "bash": { "*": "ask" },
    "edit": { "*": "allow" }
  }
}

Context Engineering

LLMs are powerful, but suffer from context bloat
Context window is finite resource
LOTR + Hobbit ~ 750k tokens / 100k LOC ~ 1M tokens

Model Name	Context Size
`Claude 4.6 Opus`	1M
`Gemini 3.1 Pro`	1M – 10M
`GPT-5.3-Codex`	400k
`Devstral-2-123B-Instruct-2512`	256k

Solution

To resolve this issue, Anthropic open-sourced 2 methods:

Model Context Protocol (MCP) [November 2024]
Agent Skills [December 2025]

MCP

Open-source standard
Connect LLMs to external systems

MCP examples

For example, opencode supports 11 built-in skills (see docs)

Note

LLMs can answer questions, but cannot interact with your system.

MCP example (add)

We will build our own using fastMCP

mcp-numbers.py

from fastmcp import FastMCP

mcp = FastMCP(name="mcp-numbers")

@mcp.tool
def add(a: int, b: int) -> int:
  """Add two numbers"""
  return a + b

if __name__ == "__main__":
  mcp.run()

MCP example (add)

Now let’s add mcp-numbers to our opencode configuration
Follow instructions in mcp/README.md
Running /status in opencode should display
Try Use mcp tool "numbers_add" to add 4 and -1

MCP examples (netcdf)

What about a more interesting example…
Can we give LLM power to inspect netcdf .nc files?
Let’s try with MCP.

MCP examples (netcdf)

Inspect file mcp/mcp-netcdf.py

../mcp/mcp-netcdf.py

# /// script
# dependencies = [
#   "netCDF4",
#   "fastmcp",
# ]
# ///

import netCDF4
from fastmcp import FastMCP

# Initialize the FastMCP server
mcp = FastMCP("nc-mcp")


@mcp.tool()
def get_variables(path: str) -> str:
    """
    Reads a NetCDF file from the given path and returns its variables.

    Args:
        path: The absolute or relative path to the NetCDF (.nc) file.

    Returns:
        A string representation of the NetCDF file's variables.
    """
    try:
        # Open the dataset
        dset = netCDF4.Dataset(path)

        # Capture the variables as a string to return to the client
        variables_output = ", ".join(dset.variables.keys())

        # Close the dataset to free up resources
        dset.close()

        return variables_output

    except FileNotFoundError:
        return f"Error: Could not find the file at path: {path}"
    except Exception as e:
        return f"Error reading NetCDF file: {str(e)}"


@mcp.tool()
def get_variable_shape(path: str, variable_name: str) -> dict | str:
    """
    Reads a NetCDF file from the given path and returns the shape of a specific
    variable.

    Args:
        path: The absolute or relative path to the NetCDF (.nc) file.
        variable_name: The name of the variable to get the shape for.

    Returns:
        A dictionary containing the shape of the specified variable.
        Example: {'temperature': (365, 180, 360)}
        Returns an error (as a string) if the variable is not found.
    """
    return dict()


if __name__ == "__main__":
    mcp.run()

MCP examples (netcdf)

mcp/mcp-netcdf.py contains 2 MCP tools
- netcdf_get_variables
- netcdf_get_variable_shape (to be implemented)
Try using netcdf_get_variables on file simple.nc

MCP examples (netcdf)

Implement netcdf_get_variable_shape
See stub in mcp/mcp-netcdf.py

../mcp/mcp-netcdf.py

@mcp.tool()
def get_variable_shape(path: str, variable_name: str) -> dict:
    """
    Reads a NetCDF file from the given path and returns the shape of a specific
    variable.
    ...
    """
    pass

(15 minutes for exercise)

Skills

Define reusable behavior via SKILL.md definitions
Agent skills let LLMs discover reusable instructions
Skills are loaded on-demand
Skills are “just” markdown files

Anatomy of a Skill

Many genAI tools support skills e.g., Claude code, opencode, codex etc.

Note

opencode requires that skills are stored in a specific set of locations (A full list can be found here). We will focus on these:

Project config: .opencode/skills/<name>/SKILL.md
Global config: ~/.config/opencode/skills/<name>/SKILL.md

<name>/               # Required: unique skill name
├── SKILL.md          # Required: instructions + metadata
├── scripts/          # Optional: executable code
├── references/       # Optional: documentation
└── assets/           # Optional: templates, resources

Skills Example (netcdf)

Let’s refactor our netcdf MCP tool as a skill
Follow instructions in skill/README.md:

cd project/root/GenAI-teaching
mkdir -p .opencode/skills/netcdf
ln -sf $(pwd)/skill/netcdf/SKILL.md .opencode/skills/netcdf/

Run /skills in opencode to check registration

Skills Example (netcdf)

skill/netcdf/SKILL.md

---
name: netcdf-processing
description: Use this skill for any operations involving NetCDF (.nc) files, including inspecting metadata, reading variable shapes, extracting data slices, or generating new NetCDF datasets.
---

# What I do

This skill provides guidance for inspecting and generating NetCDF files using
standard command-line utilities. Use these commands to understand dataset
structures before writing extraction scripts.

# When to use this skill

Use this skill whenever a user mentions climate data, multidimensional arrays,
.nc files, or atmospheric datasets.

## Workflow Decision Tree

- **Inspecting Schema**: Use `ncdump -h` first to understand dimensions.
- **Data Access**: If the file is large, only request specific variable slices (don't read entire arrays into context).
- **Creating Files**: Use `ncgen` for small CDL templates or `netCDF4` Python scripts for large datasets.

## Viewing Metadata with `ncdump`
`ncdump` is the standard tool for converting NetCDF binary files into
human-readable text (CDL format).

* **View Header Only (Recommended):** Displays dimensions, variables, and attributes without printing raw data.
    ```bash
    ncdump -h filename.nc
    ```
* **View Specific Variable:** Look at the data for a single variable (e.g., 'temperature').
    ```bash
    ncdump -v temperature filename.nc
    ```
* **Coordinate Formatting:** Use `-c` to see the header plus the values of coordinate variables (lat, lon, time).
    ```bash
    ncdump -c filename.nc
    ```

## Creating Files with `ncgen`
`ncgen` takes a text-based CDL file and compiles it into a binary `.nc` file.

* **Generate Binary from CDL:**
    ```bash
    ncgen -o output_file.nc input_text.cdl
    ```

Skills Example (netcdf)

Try running the following command

Note

Disable netcdf MCP server before trying to test the skill. They may conflict.

Skills Exercise

Create your own SKILL.md
Register it in opencode
Try using it

(15 minutes for exercise)

CLI

Do we really need MCP or Skills?…

CLI

Common CLI tools are already in model weights
No authentication
No configuration

CLI

Try asking model to run CLI commands
Don’t need to be specific e.g., “are there any untracked files in this repo?”
What issues do you foresee?

(5 minutes for exercise)

MCP vs CLI vs Skill

So how do I choose between MCP, CLI and Skills?

	MCP Server	CLI	`SKILL.md` (Instruction)
Primary Purpose	Tool calling – Need auth, permissions, audit trails, or remote access? Is data format important?	Raw commands – Run terminal tools the model already knows (git, grep, docker).	Domain Expertise – Provides workflows, rules, and domain knowledge.
Context/Loading	Loaded immediately into context window (regardless of query) reducing effective context window size.	Zero cost – Knowledge is baked into model weights; no schemas loaded.	Lazy loaded when needed. Will still impact context window.
Timeout	Timeout ~ 1-2 minutes. Ideal for short, quick function calls	No timeout.	No timeout.

Note

For a more in-depth comparison check out Cordero’s article “MCP vs CLI: What Your Agents Should Be Using”

Taking it further

opencode and other genAI tools often support agents/sub-agents (see docs)
Agents are specialized AI assistants that can be configured for specific tasks and workflows
They allow you to create focused tools with custom prompts, models, and tool access
More markdown 👀

Sub-Agent

Let’s create a sub-agent to generate PR messages
Use opencode agent create
Try creating your own
Modify it and see what difference it makes

(15 minutes for exercise)

Thanks for listening

University of Cambridge — Institute of Computing for Climate Science

University of Washington Scientific Software Engineering Center, University of Washington

References

Achiam, Joshua. 2018. Spinning up in Deep Reinforcement Learning. OpenAI. https://spinningup.openai.com.

DeepLearning.AI. 2024. Build and Train an LLM with JAX. Online course, DeepLearning.AI. https://learn.deeplearning.ai/courses/build-and-train-an-llm-with-jax/lesson/gy364z/introduction.

Hugging Face. 2022. Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course.

Karpathy, Andrej. 2022. Neural Networks: Zero to Hero. YouTube playlist. https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ.

Sanderson, Grant. 2017. Neural Networks. YouTube playlist, 3Blue1Brown. https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi.

Stanford University. 2021. CS25: Transformers United. Stanford University Course. https://web.stanford.edu/class/cs25/.

An Introduction to Generative AI for RSEs

In collaboration with

Plan

How We Got Here

If You Remember Nothing Else

1958: The Perceptron

1969: The Wall

1986: Backpropagation

1986: The Training Loop

1989–2006: Winters and Stubborn Ideas

2012: AlexNet — The Year the Dam Broke

The Next Problem: Language

LLMs: Tokenisation

2013: Meaning Becomes Geometry

2014: Attention — The Bottleneck

2017: Attention Is All You Need

Transformers: Inference I

Transformers: Inference I

Transformers: Inference I

Transformers: Inference I

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Self-attention

Self-attention

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Inference II

Transformers: Summary

Transformers: Summary

LLM Hello World

LLM Hello World

LLM Hello World

LLM Hello World

LLM Hello World

LLM Hello World

2018: The Internet as a Textbook

2020: Scaling Laws

Nov 2022: ChatGPT

2023: The Weights Get Out

The Goldfish Problem, Fix #1 — RAG

2023: Tool Use — The Chatbot Grows Hands

2023–24: The Agent Loop

The Agent Loop

Nov 2024: MCP — A Universal Plug for Agents

2024–25: Reasoning Models

2025: Coding Agents

The Timeline, In One Slide

Self Study Resources

Further Reading

Further Reading

Gen-AI Concerns

Trust

Safety

Ethical

Environmental

Opinion

Tools and Workflows

Opencode (CLI)

Opencode Installation

Opencode Configuration

Setting the API Key

Where is the Config File?

Opencode Configuration

Context Engineering

Solution

MCP

MCP examples

MCP example (add)

MCP example (add)