Karpathy MicroGPT：從零實現 Autograd + Transformer

研究日期

2026-02-13

什麼是 Karpathy’s MicroGPT？

Karpathy 的 MicroGPT 是一個「極簡版」GPT 實現，只用 Python 標準庫 + math 模組（約 243 行程式碼），完整實現了：

Autograd（自動微分引擎）：從零實現反向傳播
Transformer 架構：完整 GPT 模型（embedding、multi-head attention、MLP、residual connections）
訓練迴圈：Adam optimizer、cross-entropy loss、梯度下降

核心價值：這不是為了生產環境，而是為了教育——展示現代 LLM 的核心算法本質。

主要特性

零依賴：只有 Python 標準庫，無需 NumPy、PyTorch、TensorFlow
完整實現：從 autograd 到 transformer 到訓練迴圈全部從零實現
完全透明：每個數學運算都可見，無隱藏的 C++/CUDA 代碼
可執行：複製貼上即可運行，能在真實數據上訓練
教育導向：優先理解核心原理，而非效能優化

工作原理

核心概念：Autograd（自動微分）

Autograd = 計算圖（DAG）+ 鏈式法則

1
2
3
4
5
6


class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data          # Forward pass 的數值
        self.grad = 0             # Backward pass 的梯度 (∂L/∂self)
        self._children = children          # 計算圖的邊（依賴關係）
        self._local_grads = local_grads    # 局部導數（鏈式法則使用）

Forward Pass（建圖）：

1
2
3
4
5


a = Value(2.0)
b = Value(3.0)
c = a + b        # 創建新節點，記錄依賴關係
d = c ** 2       # 繼續建圖
loss = d         # 最終節點

Backward Pass（反向傳播）：

1
2
3
4
5
6
7


loss.backward()   # 自動計算所有梯度！

# 內部執行：
# 1. Topological sort（拓樸排序）
# 2. 初始化：loss.grad = 1
# 3. 反向遍歷：應用鏈式法則
#    child.grad += local_grad × parent.grad

鏈式法則實例

例子：L = (a + b)²

計算圖：
a(2) →─┐
       ├── (+) → c(5) ─── (²) → L(25)
b(3) →─┘

Backward:
L.grad = 1
c.grad = 1 × 2 × 5 = 10      (∂L/∂c = 2c)
a.grad = 10 × 1 = 10          (∂L/∂a = ∂L/∂c × ∂c/∂a)
b.grad = 10 × 1 = 10          (∂L/∂b = ∂L/∂c × ∂c/∂b)

數學運算的梯度規則

運算	Forward	Backward (∂L/∂a)
`c = a + b`	`c = a + b`	`∂L/∂a = ∂L/∂c × 1`
`c = a * b`	`c = a × b`	`∂L/∂a = ∂L/∂c × b`
`c = a²`	`c = a²`	`∂L/∂a = ∂L/∂c × 2a`
`c = exp(a)`	`c = e^a`	`∂L/∂a = ∂L/∂c × e^a`
`c = log(a)`	`c = ln(a)`	`∂L/∂a = ∂L/∂c × 1/a`
`c = ReLU(a)`	`c = max(0, a)`	`∂L/∂a = ∂L/∂c × (a > 0)`

實現範例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    # Forward: c = a + b
    # Local grads: ∂(a+b)/∂a = 1, ∂(a+b)/∂b = 1
    return Value(self.data + other.data, (self, other), (1, 1))

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    # Forward: c = a × b
    # Local grads: ∂(a×b)/∂a = b, ∂(a×b)/∂b = a
    return Value(self.data * other.data, (self, other), (other.data, self.data))

def relu(self):
    # Forward: c = max(0, a)
    # Local grad: ∂(max(0,a))/∂a = 1 if a > 0, else 0
    return Value(max(0, self.data), (self,), (float(self.data > 0),))

Transformer 架構實現

完整架構

Input Tokens
    │
    ▼
┌─────────────────────────────────┐
│  Token Embedding (wte)        │  → [vocab_size × n_embd]
│  +                           │
│  Position Embedding (wpe)     │  → [block_size × n_embd]
└──────────┬────────────────────┘
           │
           ▼
┌─────────────────────────────────┐
│      RMS Normalization        │
└──────────┬────────────────────┘
           │
           ▼
┌─────────────────────────────────┐
│    N × Transformer Blocks     │
│  ┌──────────────────────┐   │
│  │ 1. RMSNorm          │   │
│  │2. Multi-Head Attn   │   │
│  │   + Residual        │   │
│  │3. RMSNorm          │   │
│  │4. MLP (ReLU)       │   │
│  │   + Residual        │   │
│  └──────────────────────┘   │
└──────────┬────────────────────┘
           │
           ▼
┌─────────────────────────────────┐
│  LM Head (Output Projection)  │  → [vocab_size × n_embd]
└──────────┬────────────────────┘
           │
           ▼
      Logits → Softmax → Loss

Multi-Head Self-Attention

核心機制：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# 對於每個 attention head：
# 1. 計算 Q, K, V
q = linear(x, W_q)   # Query：「我想找什麼？」
k = linear(x, W_k)   # Key：「我包含什麼？」
v = linear(x, W_v)   # Value：「我提供什麼資訊？」

# 2. 計算 attention scores（scaled dot-product）
attn_scores = [
    sum(q[j] * k[t][j] for j in range(head_dim)) / sqrt(head_dim)
    for t in range(len(k))
]

# 3. Softmax 得到權重
attn_weights = softmax(attn_scores)

# 4. 加權求和 V
head_out = [
    sum(attn_weights[t] * v[t][j] for t in range(len(v)))
    for j in range(head_dim)
]

Multi-Head 機制：

將 embedding 維度分割成 n_head 個子空間
每個 head 獨立計算 attention
連接所有 head 的輸出，再投影到原維度
好處：不同 head 可以學習不同的關係類型

Causal Masking（因果遮罩）：

Keys/Values 只儲存過去位置（不包含未來）
位置 pos_id 只能看到位置 0 到 pos_id-1
這是隱式實現（通過 append 而非 prepend）

Feed-Forward Network (MLP)

1
2
3
4


# 擴展 → ReLU → 壓縮
x = linear(x, fc1)      # n_embd → 4×n_embd（擴展）
x = [xi.relu() for xi in x]  # 啟用函數
x = linear(x, fc2)      # 4×n_embd → n_embd（壓縮）

關鍵點：

通常 4 倍擴展（與 GPT-2 一致）
這裡用 ReLU，生產模型常用 GeLU
無 bias term（簡化用於教育）

Residual Connections（殘差連接）

1
2
3


x_residual = x
x = layer(x)
x = [a + b for a, b in zip(x, x_residual)]  # 殘差

為什麼重要：

為梯度提供「捷徑」：直接流經身份映射
解決深網絡的梯度消失問題
允許訓練非常深的網絡

Normalization

1
2
3
4
5


def rmsnorm(x):
    """Root Mean Square Normalization"""
    ms = sum(xi * xi for xi in x) / len(x)  # 平方均值
    scale = (ms + 1e-5) ** -0.5              # 加 epsilon 避免除以零
    return [xi * scale for xi in x]

公式： x_hat[i] = x[i] / sqrt(mean(x[j]²) + ε)

為什麼 RMSNorm？

無需中心化均值（比 LayerNorm 簡單）
只做縮放歸一化
表現與 LayerNorm 相當

主要使用案例

1. 學習 Autograd 原理

案例：理解鏈式法則

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# 建立計算圖
a = Value(2.0)
b = Value(3.0)
c = a + b
d = c ** 2
loss = d

# 反向傳播
loss.backward()

# 檢查梯度
print(f"a.grad = {a.grad}")  # 10
print(f"b.grad = {b.grad}")  # 10
print(f"c.grad = {c.grad}")  # 10

# 手動驗證：
# ∂L/∂a = ∂L/∂c × ∂c/∂a = 2c × 1 = 2×5×1 = 10 ✓

2. 理解 Transformer 注意力機制

案例：Single Head Attention

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


# 簡化的 attention 計算
def single_head_attention(x):
    # 假設 x = [x1, x2, x3] 每個是維度 d 的向量

    # 1. 投影到 Q, K, V
    q = linear(x, W_q)
    k = linear(x, W_k)
    v = linear(x, W_v)

    # 2. 計算 attention scores（位置 2 看所有位置）
    # Score_2→0 = (q2 · k0) / √d
    # Score_2→1 = (q2 · k1) / √d
    # Score_2→2 = (q2 · k2) / √d
    scores = [dot(q[2], k[t]) / sqrt(d) for t in range(3)]

    # 3. Softmax 得到權重
    weights = softmax(scores)  # 例如 [0.1, 0.7, 0.2]

    # 4. 加權求和 V
    output = weights[0] * v[0] + weights[1] * v[1] + weights[2] * v[2]

    return output

理解：

Position 2 的輸出是過去所有位置的加權組合
權重由 Q 與 K 的相似度決定
不同 position 可以學習到不同的權重模式

3. 訓練字符級語言模型

完整訓練迴圈：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


# 設定
n_embd = 16
n_head = 4
n_layer = 1
block_size = 16
num_steps = 10000
learning_rate = 0.01

# 初始化參數（都是 Value 物件 → 自動微分！）
state_dict = {
    'wte': matrix(vocab_size, n_embd),
    'wpe': matrix(block_size, n_embd),
    'lm_head': matrix(vocab_size, n_embd),
    # ... 每層的 Q, K, V, O, MLP 權重
}
params = flatten_params(state_dict)

# Adam optimizer 狀態
m = [0.0] * len(params)   # 一階動量
v = [0.0] * len(params)   # 二階動量

for step in range(num_steps):
    # --- Forward Pass ---
    doc = docs[step % len(docs)]
    tokens = [BOS] + tokenize(doc) + [BOS]

    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []

    for pos_id in range(min(block_size, len(tokens) - 1)):
        token_id = tokens[pos_id]
        target_id = tokens[pos_id + 1]

        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()  # Cross-entropy
        losses.append(loss_t)

    loss = (1 / len(losses)) * sum(losses)

    # --- Backward Pass ---
    loss.backward()  # 一行代碼計算所有梯度！

    # --- Optimizer Update (Adam) ---
    lr_t = learning_rate * (1 - step / num_steps)  # Linear decay

    for i, p in enumerate(params):
        # Update moments
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2

        # Bias correction
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))

        # Parameter update
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps)

        # Zero gradient for next iteration
        p.grad = 0

    # --- Monitor ---
    if step % 1000 == 0:
        print(f"Step {step}: loss = {loss.data:.4f}")

4. 推理與生成

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


def generate(prompt, max_len=100):
    """生成文本"""
    tokens = [BOS] + tokenize(prompt)
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]

    for _ in range(max_len):
        # 下一個 token
        token_id = tokens[-1]
        pos_id = len(tokens) - 1

        # Forward pass（不需要梯度）
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)

        # Sampling（可選溫度參數）
        next_token = sample(probs, temperature=0.8)
        tokens.append(next_token)

        if next_token == EOS:
            break

    return detokenize(tokens[1:])  # 去掉 BOS

安裝與設定

前置需求

Python 3.x（標準庫）
無需其他依賴

設定步驟

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# 1. 下載 Gist 代碼
curl -O https://gist.githubusercontent.com/karpathy/8627fe009c40f57531cb18360106ce95/raw/microgpt.py

# 2. 準備訓練數據
# 創建一個文本文件（例如 names.txt）
echo -e "Alice\nBob\nCharlie\nDavid\nEmma" > names.txt

# 3. 運行訓練
python3 microgpt.py

# 輸出：
# Step 0: loss = 2.7143
# Step 1000: loss = 1.2345
# Step 2000: loss = 0.8765
# ...
# Generated names: Alice, Bob, Emma, Charlie

代碼結構

microgpt.py (~243 lines)
│
├── Value class (autograd)
│   ├── __init__, __add__, __mul__, __pow__
│   ├── exp, log, relu
│   └── backward (topological sort + chain rule)
│
├── Neural Network Operations
│   ├── linear (matrix-vector multiply)
│   ├── rmsnorm (normalization)
│   └── softmax (numerically stable)
│
├── Transformer Architecture
│   ├── gpt (forward pass)
│   │   ├── Embeddings (wte + wpe)
│   │   └── N × Layers
│   │       ├── RMSNorm
│   │       ├── Multi-Head Attention
│   │       ├── Residual
│   │       ├── RMSNorm
│   │       ├── MLP (ReLU)
│   │       └── Residual
│   └── lm_head (output projection)
│
└── Training Loop
    ├── Data loading & tokenization
    ├── Forward pass (loss calculation)
    ├── Backward pass (loss.backward())
    └── Adam optimizer update

最佳實踐

1. 數值穩定性

問題： exp(1000) = inf 導致 NaN

解決：

1
2
3
4
5
6


def softmax(logits):
    # 減去最大值防止溢出
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]

原理：

原本：exp([1000, 999, 998]) → [inf, inf, inf]
減去 max：exp([0, -1, -2]) → [1, 0.37, 0.14]
Softmax 仍然相同！

2. 避免 Division by Zero

1
2
3
4
5


# Bad
scale = ms ** -0.5  # 如果 ms=0 會除以零

# Good
scale = (ms + 1e-5) ** -0.5  # 加小 epsilon

3. 每次迭代後歸零梯度

1
2
3
4
5
6
7


for step in range(num_steps):
    loss.backward()

    # Update parameters...
    for p in params:
        p.data -= lr * p.grad
        p.grad = 0  # 重要！否則梯度會累積

4. 測試梯度（數值驗證）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def numerical_gradient(f, x, eps=1e-5):
    """有限差分法計算梯度"""
    return (f(x + eps) - f(x - eps)) / (2 * eps)

# 驗證梯度正確性
grad_analytical = param.grad
grad_numerical = numerical_gradient(
    lambda: loss.data,
    param.data
)
assert abs(grad_analytical - grad_numerical) < 1e-5

5. 使用 Operator Overloading 提升可讀性

1
2
3
4
5


# Bad（不直觀）
c = add(mul(a, b), pow(c, 2))

# Good（自然數學符號）
c = a * b + c ** 2

6. 分離 Forward 和 Backward

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# Forward pass：建圖
loss = compute_loss(inputs, params)

# Backward pass：計算梯度
loss.backward()

# Update：優化器
optimizer.step(params)

# 清晰責任劃分，易於調試

7. Topological Sort 的重要性

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


def backward(self):
    # 必須先排序！否則梯度計算順序錯誤
    topo = []
    visited = set()

    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)

    build_topo(self)

    # 反向遍歷（父節點到子節點）
    self.grad = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad

8. 調試技巧

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# 打印計算圖
def print_graph(v, indent=0):
    print("  " * indent + f"Value(data={v.data:.2f}, grad={v.grad:.2f})")
    for child in v._children:
        print_graph(child, indent + 1)

# 監控梯度
def check_gradients(params):
    for i, p in enumerate(params):
        if abs(p.grad) > 100:
            print(f"Warning: Large gradient in param {i}: {p.grad}")
        if abs(p.grad) < 1e-8:
            print(f"Warning: Vanishing gradient in param {i}: {p.grad}")

進階功能

1. 添加更多激活函數

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    return Value(t, (self,), (1 - t**2,))

def sigmoid(self):
    s = 1 / (1 + math.exp(-self.data))
    return Value(s, (self,), (s * (1 - s),))

def gelu(self):
    # GeLU ≈ 0.5 × x × (1 + tanh(√(2/π) × (x + 0.044715 × x³)))
    x = self.data
    tanh_val = math.tanh(math.sqrt(2/math.pi) * (x + 0.044715 * x**3))
    return Value(0.5 * x * (1 + tanh_val), (self,), (local_grad,))

2. Layer Normalization

1
2
3
4
5
6


def layernorm(x):
    """Layer Normalization（更常見的歸一化）"""
    mean = sum(xi for xi in x) / len(x)
    variance = sum((xi - mean) ** 2 for xi in x) / len(x)
    scale = (variance + eps) ** -0.5
    return [(xi - mean) * scale for xi in x]

3. Dropout

1
2
3
4
5
6
7


def dropout(x, p=0.5, training=True):
    """Dropout 正則化"""
    if not training:
        return x

    mask = [Value(1 if random.random() > p else 0) for _ in x]
    return [xi * mask_i / (1-p) for xi, mask_i in zip(x, mask)]

4. Gradient Clipping（梯度裁剪）

1
2
3
4
5
6
7


def clip_gradients(params, max_norm=1.0):
    """防止梯度爆炸"""
    total_norm = math.sqrt(sum(p.grad ** 2 for p in params))
    if total_norm > max_norm:
        scale = max_norm / total_norm
        for p in params:
            p.grad *= scale

常見問題解答

Q: 為什麼不用 PyTorch？

A: PyTorch 抽象了很多細節。從零實現可以：

理解梯度如何計算（而不是魔法）
理解反向傳播的每一步
理解為什麼某些運算要這樣設計
建立直覺，之後使用框架更有效

Q: 這個實現適合生產嗎？

A: 不適合！這是為了學習。生產框架會添加：

GPU 加速（CUDA）
記憶體優化（gradient checkpointing）
自動混合精度（FP16/FP32）
分散式訓練（多 GPU/多機器）
高效矩陣運算（BLAS/CUBLAS）

對比：

特性	這個實現	PyTorch
代碼行數	~243	百萬級
速度	慢（Python）	快（CUDA）
記憶體	高（無優化）	優化
教育價值	★★★★★	★★★
生產價值	★	★★★★★

Q: 為什麼用列表而不是 NumPy？

A: 列表讓每個操作都顯式可見：

1
2
3
4
5


# 用列表：
out = [sum(wi * xi for wi, xi in zip(row, x)) for row in w]

# 用 NumPy：
out = np.dot(w, x)  # C 優化，隱藏細節

這是教育選擇，優先理解而非效能。

Q: 如何驗證我的梯度是對的？

A: 使用數值梯度檢查：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


def numerical_gradient(f, x, eps=1e-5):
    """有限差分：∂f/∂x ≈ (f(x+ε) - f(x-ε)) / (2ε)"""
    return (f(x + eps) - f(x - eps)) / (2 * eps)

# 比較
analytical_grad = param.grad
numerical_grad = numerical_gradient(lambda: loss.data, param.data)

if abs(analytical_grad - numerical_grad) > 1e-5:
    print(f"Gradient mismatch! {analytical_grad} vs {numerical_grad}")

Q: Adam 為什麼這麼複雜？

A: Adam 結合了多個優化器想法：

組件	來源	目的
m (一階動量)	Momentum	平滑梯度，加速收斂
v (二階動量)	AdaGrad/RMSProp	自適應學習率
Bias correction	Adam	修正早期偏誤

公式簡化：

1

update = lr * m_hat / (sqrt(v_hat) + eps)

m_hat：平均梯度方向（動量）
sqrt(v_hat)：梯度大小歸一化（RMS）
結果：大梯度的小步，小梯度的大步

Q: 為什麼要用 Residual Connections？

A: 殘差連接解決深網絡的問題：

無殘差：

梯度流：L → Layer N → Layer N-1 → ... → Layer 1 → Input
每層都乘以梯度 < 1 → 指數衰减 → 梯度消失

有殘差：

梯度流：
L → Layer N + ────────────────┐
    ↓                     ↓
Layer N-1 + ───────┐     ↓
    ↓            │    ↓
    └──────→ 直接路徑 ──→ Input

捷徑允許梯度「繞過」中間層

數學： output = layer(input) + input

梯度：

∂L/∂input = ∂L/∂output × ∂output/∂input
          = ∂L/∂output × (∂layer/∂input + 1)
          = ∂L/∂layer + ∂L/∂output

多加的 + ∂L/∂output 就是捷徑！

Q: Scaled Dot-Product Attention 為什麼要除以 √d？

A: 防止梯度消失/爆炸。

問題：

1
2
3
4


# 假設 d=512，Q 和 K 的元素均值為 0，標準差為 1
# Q·K 的均值 ≈ 0，標準差 ≈ √512 ≈ 22.6
# Softmax([22.6, 22.5, 22.7, ...]) → 接近 one-hot
# 梯度很小（只有一個位置有大梯度）

解決：

1
2
3
4


# 除以 √d
scores = Q·K / √512 = Q·K / 22.6
# 標準差 ≈ 1
# Softmax 分布更均勻 → 梯度更穩定

Q: Causal Masking 是如何實現的？

A: 隱式實現（通過只儲存過去位置）：

1
2
3
4
5


keys[li].append(k)    # 只 append，不 prepend
values[li].append(v)  # 只 append，不 prepend

# 位置 pos_id 只能看到 keys[:pos_id] 和 values[:pos_id]
# 自動實現因果遮罩

顯式遮罩（PyTorch 常用）：

1
2
3
4
5


# 創建遮罩矩陣
mask = torch.tril(torch.ones(seq_len, seq_len))  # 下三角

# 應用遮罩
masked_scores = scores.masked_fill(mask == 0, float('-inf'))

資源連結

官方/原始資源

Karpathy’s MicroGPT Gist - ~243 行完整實現
micrograd Repository - 獨立 autograd 引擎
minGPT Repository - PyTorch 版極簡 GPT

教程/課程

Neural Networks: Zero to Hero - Karpathy 完整課程
Let’s Build GPT Video (YouTube, 2 hours) - 手把手建 GPT
CS231n: Backpropagation - 計算圖與反向傳播

論文

“Attention Is All You Need” (Vaswani et al., 2017) - Transformer 原始論文
“Language Models are Unsupervised Multitask Learners” (GPT-2, 2019)
“Adam: A Method for Stochastic Optimization” (Kingma & Ba, 2014) - Adam 優化器

解釋性文章

The Illustrated Transformer (Jay Alammar) - Transformer 視覺化解釋
Attention Mechanism Explained (Lilian Weng) - 注意力機制詳解
Understanding Neural Networks - Distill.pub 高質量可視化解釋

框架文檔

PyTorch Autograd Tutorial - 官方教程
PyTorch Transformer Tutorial - Transformer 實現
JAX Autodiff - 函數式自動微分

總結

核心洞察

Autograd 是簡單的
- 建圖（Forward pass）
- 拓樸排序
- 鏈式法則（Backward pass）
Transformer 是可組合的
- GPT = Embedding + N×(Attention + MLP) + Output
- 每個組件都簡單：Attention = Q·K + Softmax + Weighted Sum
一切都是矩陣運算
- 線性層 = 點積
- 激活函數 = 元素級運算
- Softmax = 歸一化
數值穩定性很重要
- Softmax 前減 max
- 除法加 epsilon
- Attention 除以 √d
教育 vs 生產
- 這個實現：清晰、易懂、教學用
- 生產框架：快速、優化、工程用

學習路徑建議

第 1 階段（1-2 天）：理解 Autograd

閱讀 Value class 實現
運行範例，打印計算圖
實現新運算（sigmoid, tanh）
數值驗證梯度

第 2 階段（2-3 天）：理解神經網絡

建立簡單分類器
實現隱藏層
在 XOR 問題上訓練
可視化決策邊界

第 3 階段（2-3 天）：理解 Attention

實現 single-head attention
實現 multi-head attention
在簡單序列上測試
驗證 attention weights 合理

第 4 階段（1 週）：建立 Mini-Transformer

組合所有組件
在小數據集上訓練
生成樣本
實驗超參數

最終檢查清單

運行所有範例成功
理解 Value class
實現一個新運算
數值驗證梯度
建立並訓練簡單模型
從零實現 attention
閱讀完整分析
實驗超參數

這個實現證明了深度學習不是魔法——它是數學，仔細實現的結果。一旦你理解這 ~243 行代碼，你就理解了現代 AI 的核心。 🎓

複雜性來自：

工程優化（GPU、並行化）
規模（數十億參數）
特性（dropout、batch norm、混合精度）

但想法仍然簡單而美麗。