第 2 章：統計學習 — 2.2 評估模型準確度

ISLP §2.2 pp. 37–49 ★★★☆☆ 核心 Bias-Variance Bayes Classifier KNN 訓練 vs 測試

課本：James, Witten, Hastie, Tibshirani (2023), An Introduction to Statistical Learning with Applications in Python, Springer.

# === Google Drive + Colab 相容資料讀取（所有範例共用）===
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, SplineTransformer
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_squared_error

try:
    from google.colab import drive
    drive.mount('/content/drive')
    DATA_PATH = '/content/drive/MyDrive/ISLP_data/'
except ImportError:
    DATA_PATH = '/tmp/'

一、衡量配適品質：訓練 MSE vs 測試 MSE

📚 理論基礎：MSE 源自 Gauss (1809) 的最小平方法理論。從決策理論角度，MSE 是平方損失 \(L(Y, \hat{f}(X)) = (Y - \hat{f}(X))^2\) 下的風險函數。見 Lehmann & Casella (1998) Theory of Point Estimation §1.6。

1.1 訓練 MSE

\[ \text{MSE}_{\text{train}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2 \quad\text{(方程式 2.5)} \]

訓練 MSE 衡量模型在已見過的資料上的誤差。計算簡單，但無法反映泛化能力。

1.2 測試 MSE —— 真正重要的指標

\[ \text{MSE}_{\text{test}} = \text{Ave}(y_0 - \hat{f}(x_0))^2 \quad\text{(方程式 2.6)} \]

測試 MSE 衡量模型在未見過的資料上的誤差。這是我們真正關心的指標。

🎯 應用場景：量化交易模型回測
華爾街量化基金訓練交易模型時，若僅看「訓練 MSE」（樣本內 Sharpe ratio），模型可能過度擬合歷史雜訊。Gerard & Mi (2019) 發現，過度擬合的量化策略樣本外 Sharpe ratio 平均衰退 50% 以上。因此實務上必須使用 walk-forward validation 或 expanding window 測試。

1.3 訓練 vs 測試 MSE 的經典示範（Figure 2.9–2.11）

# Figure 2.9–2.11 概念：訓練 vs 測試 MSE
np.random.seed(42)

def simulate_data(n, f_true, sigma=1.0):
    X = np.sort(np.random.uniform(0, 10, n))
    eps = np.random.normal(0, sigma, n)
    y = f_true(X) + eps
    return X, y, f_true

# 三種真實函數
f_nonlinear = lambda x: 0.5 * x + 3 * np.sin(x) + 0.2 * x**2 - 5
f_near_linear = lambda x: 2 + 1.5 * x + 0.1 * np.sin(2*x)
f_highly_nonlinear = lambda x: 0.3 * x**2 - x + 10 * np.sin(x)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
titles = ['Figure 2.9: f 中度非線性', 'Figure 2.10: f 接近線性', 'Figure 2.11: f 高度非線性']
funcs = [f_nonlinear, f_near_linear, f_highly_nonlinear]
colors_fit = ['orange', 'blue', 'green']

for idx, (ax, title, f_true) in enumerate(zip(axes, titles, funcs)):
    X, y, _ = simulate_data(100, f_true, sigma=1.5)
    X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
    ax.scatter(X, y, c='grey', s=20, alpha=0.6, label='Training data')
    ax.plot(X_plot, f_true(X_plot), 'k-', lw=2, label='True f')

    # 三種靈活度
    for deg, color, label in [(1, '#ff7f0e', 'Linear (d=1)'),
                                (4, '#1f77b4', f'Spline (中等)'),
                                (20, '#2ca02c', f'Spline (高靈活)')]:
        poly = PolynomialFeatures(degree=deg)
        model = LinearRegression().fit(poly.fit_transform(X.reshape(-1,1)), y)
        y_pred = model.predict(poly.transform(X_plot))
        ax.plot(X_plot, y_pred, color=color, lw=1.5, alpha=0.8, label=label)
    ax.set_title(title, fontsize=11)
    ax.legend(fontsize=7, loc='upper left')

plt.suptitle('訓練 vs 測試 MSE：增加靈活度的效果', fontsize=14)
plt.tight_layout(); plt.show()

# 訓練 MSE 與測試 MSE 曲線
np.random.seed(42)
X_train, y_train, f_true = simulate_data(100, f_nonlinear, sigma=1.5)
X_test, y_test, _ = simulate_data(500, f_nonlinear, sigma=1.5)

degrees = np.arange(1, 21)
train_mse, test_mse = [], []

for d in degrees:
    poly = PolynomialFeatures(degree=d)
    model = LinearRegression().fit(poly.fit_transform(X_train.reshape(-1,1)), y_train)
    train_mse.append(mean_squared_error(y_train, model.predict(poly.transform(X_train.reshape(-1,1)))))
    test_mse.append(mean_squared_error(y_test, model.predict(poly.transform(X_test.reshape(-1,1)))))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(degrees, train_mse, 'grey', lw=2, label='Training MSE')
ax.plot(degrees, test_mse, 'r-', lw=2, label='Test MSE')
ax.axhline(y=1.5**2, color='k', linestyle='--', alpha=0.5, label='Var(ε) = 最小可能 MSE')
optimal_d = degrees[np.argmin(test_mse)]
ax.axvline(x=optimal_d, color='b', linestyle=':', alpha=0.7, label=f'最優靈活度 d={optimal_d}')
ax.set_xlabel('Polynomial Degree (靈活度)'); ax.set_ylabel('Mean Squared Error')
ax.set_title('Figure 2.9 右圖概念：訓練 MSE vs 測試 MSE')
ax.legend(); plt.tight_layout(); plt.show()
print(f"最優多項式階數: d={optimal_d}, 測試 MSE={min(test_mse):.3f}")
print(f"訓練 MSE 單調遞減：{all(train_mse[i] >= train_mse[i+1] for i in range(len(train_mse)-1))}")
print(f"測試 MSE 為 U 型曲線")

1.4 為什麼不能只用訓練 MSE？

現象	原因	後果
訓練 MSE 單調遞減	更靈活模型更好擬合訓練資料	無法用訓練 MSE 選擇模型
測試 MSE 呈 U 型	偏差-變異數權衡	存在最優靈活度
過度擬合	模型記住雜訊而非真實模式	測試誤差遠大於訓練誤差

🎯 應用場景：選舉民調預測
Nate Silver 的 FiveThirtyEight 模型使用 polls-only 與 polls-plus 兩種版本，透過 out-of-sample validation 選擇最優模型。2012 年準確預測 50 州；2016 年測試誤差暴露模型缺陷（低估無大學學歷白人選民權重）。關鍵教訓：訓練 MSE 低 ≠ 現實預測準確。

二、偏差-變異數權衡（Bias-Variance Tradeoff）

📚 理論根源：偏差-變異數分解由 Geman, Bienenstock & Doursat (1992) 在神經網路背景下完整推導。Hastie, Tibshirani & Friedman (2009) ESL §7.3 提供一般性論述。此權衡是統計學習中最重要且最反直覺的概念之一。

2.1 數學推導：期望測試 MSE 的分解

對給定 \(x_0\)，期望測試 MSE 可分解為三個組成部分：

\[ \underbrace{\mathbb{E}\left[(y_0 - \hat{f}(x_0))^2\right]}_{\text{期望測試 MSE}} = \underbrace{\text{Var}(\hat{f}(x_0))}_{\text{變異數}} + \underbrace{[\text{Bias}(\hat{f}(x_0))]^2}_{\text{偏差}^2} + \underbrace{\text{Var}(\epsilon)}_{\text{不可約誤差}} \quad\text{(方程式 2.7)} \]

推導過程（對給定 \(x_0\)）：

\[ \begin{aligned} \mathbb{E}[(y_0 - \hat{f}(x_0))^2] &= \mathbb{E}[(f(x_0) + \epsilon - \hat{f}(x_0))^2] \\ &= \mathbb{E}[(f(x_0) - \hat{f}(x_0))^2] + \mathbb{E}[\epsilon^2] + 2\mathbb{E}[(f(x_0) - \hat{f}(x_0))\epsilon] \\ &= \mathbb{E}[(\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)] + \mathbb{E}[\hat{f}(x_0)] - f(x_0))^2] + \text{Var}(\epsilon) \\ &= \underbrace{\mathbb{E}[(\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)])^2]}_{\text{Var}(\hat{f}(x_0))} + \underbrace{(\mathbb{E}[\hat{f}(x_0)] - f(x_0))^2}_{[\text{Bias}(\hat{f}(x_0))]^2} + \underbrace{\text{Var}(\epsilon)}_{\sigma^2} \end{aligned} \]

📚 關鍵洞察：\(\text{Bias}(\hat{f}(x_0)) = \mathbb{E}[\hat{f}(x_0)] - f(x_0)\) 衡量平均而言估計值與真實值的差距（系統性誤差）。若對不同訓練集平均後的 \(\hat{f}\) 接近 \(f\)，偏差低。
\(\text{Var}(\hat{f}(x_0)) = \mathbb{E}[(\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)])^2]\) 衡量不同訓練集間 \(\hat{f}\) 的波動程度。若不同訓練集產生相似 \(\hat{f}\)，變異數低。

2.2 偏差 vs 變異數的直觀理解

概念	直觀解釋	比喻
偏差 (Bias)	模型平均預測偏離真實值的程度	射箭：箭靶偏離靶心
變異 (Variance)	不同訓練集之間預測的波動程度	射箭：箭散布很大
不可約誤差	資料本身的雜訊，任何模型都無法消除	靶心本身在晃動

2.3 靈活度對偏差與變異數的影響

# Figure 2.12 概念：偏差、變異數、測試 MSE 的關係
np.random.seed(42)
f_true = lambda x: 0.5 * x + 3 * np.sin(x) + 0.2 * x**2 - 5
n_bootstrap = 100
n_train = 50
X_all = np.random.uniform(0, 10, 1000)
degrees = np.arange(1, 16)
bias_sq, variance, test_mse_vals = [], [], []

for d in degrees:
    preds = np.zeros((n_bootstrap, len(X_all)))
    for b in range(n_bootstrap):
        X_boot = np.sort(np.random.uniform(0, 10, n_train))
        y_boot = f_true(X_boot) + np.random.normal(0, 1.5, n_train)
        poly = PolynomialFeatures(degree=d)
        model = LinearRegression().fit(poly.fit_transform(X_boot.reshape(-1,1)), y_boot)
        preds[b] = model.predict(poly.transform(X_all.reshape(-1,1)))

    f_bar = preds.mean(axis=0)
    bias_sq.append(np.mean((f_bar - f_true(X_all))**2))
    variance.append(np.mean(np.var(preds, axis=0)))
    test_mse_vals.append(bias_sq[-1] + variance[-1] + 1.5**2)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax, (b, v, t), title in zip(axes,
    [(bias_sq, variance, test_mse_vals)],
    ['Figure 2.12 概念']):
    pass

# 合併到一張圖
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(degrees, bias_sq, 'b-', lw=2, label='Bias²')
ax.plot(degrees, variance, 'orange', lw=2, label='Variance')
ax.plot(degrees, test_mse_vals, 'r-', lw=2, label='Test MSE')
ax.axhline(y=1.5**2, color='k', linestyle='--', alpha=0.5, label='Var(ε)')
opt = degrees[np.argmin(test_mse_vals)]
ax.axvline(x=opt, color='grey', linestyle=':', alpha=0.7)
ax.set_xlabel('Polynomial Degree (靈活度)')
ax.set_ylabel('Error')
ax.set_title('Figure 2.12：Bias², Variance, Test MSE vs 靈活度')
ax.legend()
plt.tight_layout(); plt.show()
print(f"最優多項式階數: {opt}")
print(f"Bias 遞減: {all(bias_sq[i] >= bias_sq[i+1] for i in range(len(bias_sq)-1))}")
print(f"Variance 遞增: {all(variance[i] <= variance[i+1] for i in range(len(variance)-1))}")

2.4 偏差-變異數權衡的完整對照

模型類型	偏差	變異	適用情境	舉例
簡單模型（低靈活）	高	低	真實 f 簡單、雜訊大	線性迴歸、Ridge
適中模型	中	中	一般情況	GAM、小決策樹
靈活模型（高靈活）	低	高	真實 f 複雜、資料多	XGBoost、神經網路
過度靈活	極低	極高	避免！	未正則化的深度網路

🎯 應用場景：信用評分模型
FICO 信用分數模型需要同時滿足：低偏差（準確預測違約機率）與低變異（不同時間段訓練的模型穩定）。監管機構（如美國 CFPB）要求信用模型具備「穩健性」，若不同訓練樣本產出截然不同的分數，將導致公平性問題。因此實務上偏好中度靈活模型（如 Logistic Regression + 分段 WOE 編碼）而非深度學習。

✅ 高偏差低變異模型優點

穩定、可重現
可解釋、易監管
小樣本可用
對雜訊不敏感

❌ 高偏差低變異模型缺點

欠擬合
無法捕捉複雜模式
預測精度受限

✅ 低偏差高變異模型優點

高預測精度
可學習複雜模式
大資料場景表現佳

❌ 低偏差高變異模型缺點

過度擬合風險
訓練不穩定
難以解釋與監管

三、分類問題的錯誤率

📚 ISLP pp. 35–36。分類損失函數使用 0-1 loss：\(L(Y, \hat{Y}) = I(Y \neq \hat{Y})\)，其風險函數即為錯誤率。

分類的訓練錯誤率：

\[ \text{Error}_{\text{train}} = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i) \quad\text{(方程式 2.8)} \]

分類的測試錯誤率：

\[ \text{Error}_{\text{test}} = \text{Ave}\left(I(y_0 \neq \hat{y}_0)\right) \quad\text{(方程式 2.9)} \]

其中 \(I(\cdot)\) 為指示函數（indicator variable），事件成立時為 1，否則為 0。

3.1 分類 vs 迴歸的評估指標對照

問題類型	核心指標	損失函數	數學形式
迴歸	MSE, MAE, RMSE	平方損失	\((y - \hat{y})^2\)
二分類	Error Rate, Accuracy	0-1 損失	\(I(y \neq \hat{y})\)
多分類	Top-1 Accuracy	0-1 損失	\(I(y \neq \hat{y})\)
不平衡分類	F1, AUC, Precision-Recall	多樣	依需求選擇

🎯 應用場景：醫療診斷
乳癌篩檢中，假陰性（漏診癌症）的成本遠高於假陽性（不必要的追蹤檢查）。此時不能僅用錯誤率評估——需使用 sensitivity（真陽率）、specificity（真陰率）、ROC-AUC。學界常用 Fawcett (2006) 的 ROC 分析框架。

四、貝氏分類器（Bayes Classifier）——理論上的黃金標準

📚 ISLP pp. 36–37。貝氏分類器基於決策理論 (decision theory) 的最優決策規則，見 Berger (1985) Statistical Decision Theory and Bayesian Analysis。貝氏錯誤率是分類問題中理論上可達到的最低錯誤率。

4.1 定義

貝氏分類器將觀測值 \(x_0\) 分配到條件機率最大的類別：

\[ \hat{y}_{\text{Bayes}} = \arg\max_j \Pr(Y = j \mid X = x_0) \quad\text{(方程式 2.10)} \]

在二分類問題中：

\[ \hat{y}_{\text{Bayes}} = \begin{cases} 1, & \text{if } \Pr(Y=1 \mid X=x_0) > 0.5 \\ 2, & \text{otherwise} \end{cases} \]

4.2 貝氏錯誤率（Bayes Error Rate）

\[ \text{Bayes Error} = 1 - \mathbb{E}\left[\max_j \Pr(Y = j \mid X)\right] \quad\text{(方程式 2.11)} \]

貝氏錯誤率 > 0 當且僅當不同類別在預測子空間中有所重疊。它是分類問題中的「不可約誤差」類比。

# Figure 2.13 概念：貝氏決策邊界
from sklearn.naive_bayes import GaussianNB

np.random.seed(42)
n = 200
# 兩個類別從不同分布生成
X1_blue = np.random.multivariate_normal([2, 3], [[2, 0.5], [0.5, 1]], n//2)
X1_orange = np.random.multivariate_normal([5, 5], [[1.5, -0.3], [-0.3, 2]], n//2)
X_bayes = np.vstack([X1_blue, X1_orange])
y_bayes = np.hstack([np.zeros(n//2), np.ones(n//2)])

# 貝氏分類器（使用真實分布已知的生成參數）
# 在實際中不知道真實分布，這裡用 GaussianNB 逼近
gnb = GaussianNB().fit(X_bayes, y_bayes)

# 繪製決策邊界
x_min, x_max = X_bayes[:, 0].min()-1, X_bayes[:, 0].max()+1
y_min, y_max = X_bayes[:, 1].min()-1, X_bayes[:, 1].max()+1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                     np.linspace(y_min, y_max, 300))
Z = gnb.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1].reshape(xx.shape)

fig, ax = plt.subplots(figsize=(9, 7))
ax.contourf(xx, yy, Z, levels=[0, 0.5, 1], alpha=0.3, colors=['#1f77b4', '#ff7f0e'])
ax.contour(xx, yy, Z, levels=[0.5], colors='purple', linestyles='--', linewidths=2)
ax.scatter(X_bayes[y_bayes==0, 0], X_bayes[y_bayes==0, 1], c='#1f77b4', s=30, alpha=0.7, label='Class Blue')
ax.scatter(X_bayes[y_bayes==1, 0], X_bayes[y_bayes==1, 1], c='#ff7f0e', s=30, alpha=0.7, label='Class Orange')
ax.set_xlabel('X₁'); ax.set_ylabel('X₂')
ax.set_title('Figure 2.13 概念：貝氏決策邊界\n(紫色虛線 = Pr(Orange|X) = 0.5)')
ax.legend()
bayes_error = 1 - np.mean(np.max(gnb.predict_proba(X_bayes), axis=1))
print(f"估計貝氏錯誤率: {bayes_error:.4f} (> 0 因類別重疊)")
plt.tight_layout(); plt.show()

4.3 貝氏分類器的核心洞察

特性	說明
最優性	測試錯誤率最低——任何分類器無法超越
前提	需要知道 \(\Pr(Y \mid X)\)，實務中通常未知
角色	理論上的黃金標準，實務方法的比較基準
不可約錯誤	即使貝氏分類器錯誤率也可能 > 0（類別重疊）

🎯 應用場景：垃圾郵件過濾
Gmail 的垃圾郵件分類器追求接近貝氏錯誤率。已知 \(\Pr(\text{spam} \mid \text{含「Viagra」}) \approx 0.99\)、\(\Pr(\text{spam} \mid \text{來自已知聯絡人}) \approx 0.01\)。但某些郵件位於「灰色地帶」（例如行銷郵件），此時即使最優分類器也有不可約錯誤。Google 使用大規模 Naive Bayes + 深度學習，實測錯誤率 < 0.1%。

五、K-最近鄰（K-Nearest Neighbors, KNN）分類器

📚 ISLP pp. 37–49。KNN 由 Fix & Hodges (1951) 提出，是最簡單的非參數分類方法。其一致性由 Cover & Hart (1967) 證明：當 \(n \to \infty\)、\(K \to \infty\)、\(K/n \to 0\) 時，KNN 錯誤率收斂至貝氏錯誤率。

5.1 KNN 演算法

給定測試點 \(x_0\)，找出訓練集中距離最近的 K 個點 \(\mathcal{N}_0\)
估計條件機率：

\[ \Pr(Y = j \mid X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i = j) \quad\text{(方程式 2.12)} \]

將 \(x_0\) 分配到機率最高的類別

# Figure 2.14–2.16 概念：KNN 決策邊界
from matplotlib.colors import ListedColormap

np.random.seed(42)
n_knn = 150
X_b = np.random.multivariate_normal([2, 3], [[2, 0.5], [0.5, 1]], n_knn//2)
X_o = np.random.multivariate_normal([5, 5], [[1.5, -0.3], [-0.3, 2]], n_knn//2)
X_knn = np.vstack([X_b, X_o])
y_knn = np.hstack([np.zeros(n_knn//2), np.ones(n_knn//2)])

# 不同 K 值的 KNN
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
x_min, x_max = X_knn[:, 0].min()-1, X_knn[:, 0].max()+1
y_min, y_max = X_knn[:, 1].min()-1, X_knn[:, 1].max()+1

for ax, K, title in zip(axes, [1, 10, 100],
    ['K=1: 過度靈活 (低偏差/高變異)',
     'K=10: 適中 (接近貝氏邊界)',
     'K=100: 過於平滑 (高偏差/低變異)']):
    knn = KNeighborsClassifier(n_neighbors=min(K, len(X_knn)))
    knn.fit(X_knn, y_knn)
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#1f77b4', '#ff7f0e']))
    ax.scatter(X_knn[y_knn==0, 0], X_knn[y_knn==0, 1], c='#1f77b4', s=20)
    ax.scatter(X_knn[y_knn==1, 0], X_knn[y_knn==1, 1], c='#ff7f0e', s=20)
    ax.set_title(title, fontsize=10)
    # 測試錯誤率
    err = 1 - knn.score(X_knn, y_knn)
    ax.text(0.02, 0.98, f'Train Error = {err:.3f}', transform=ax.transAxes,
            fontsize=9, verticalalignment='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

plt.suptitle('Figure 2.16 概念：K 值對 KNN 決策邊界的影響', fontsize=14)
plt.tight_layout(); plt.show()

5.2 K 的選擇：偏差-變異數權衡

K 值	模型行為	偏差	變異	決策邊界
K = 1	最靈活，完美擬合訓練集	極低	極高	極度不規則
K = 小值 (3–5)	靈活，可捕捉局部結構	低	高	不規則
K = 適中 (≈√n)	平衡	中	中	平滑
K = 大值 (>100)	過度平滑	高	低	接近線性
K = n	總是預測多數類	極高	極低	無（常數分類）

# 測試錯誤率 vs K
np.random.seed(42)
X_train_knn, y_train_knn = X_knn[:100], y_knn[:100]
X_test_knn, y_test_knn = X_knn[100:], y_knn[100:]

Ks = np.arange(1, 51)
train_errs, test_errs = [], []
for K in Ks:
    knn = KNeighborsClassifier(n_neighbors=K)
    knn.fit(X_train_knn, y_train_knn)
    train_errs.append(1 - knn.score(X_train_knn, y_train_knn))
    test_errs.append(1 - knn.score(X_test_knn, y_test_knn))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(Ks, train_errs, 'grey', lw=2, label='Training Error')
ax.plot(Ks, test_errs, 'r-', lw=2, label='Test Error')
opt_K = Ks[np.argmin(test_errs)]
ax.axvline(x=opt_K, color='b', linestyle=':', alpha=0.7, label=f'最優 K={opt_K}')
ax.set_xlabel('K (鄰居數)'); ax.set_ylabel('Error Rate')
ax.set_title('KNN：訓練/測試錯誤率 vs K')
ax.legend(); plt.tight_layout(); plt.show()
print(f"最優 K = {opt_K}, 測試錯誤率 = {min(test_errs):.4f}")
print(f"K 增加 → 訓練錯誤率上升（高偏差）")
print(f"K 過小 → 測試錯誤率高（高變異）")

5.3 KNN 與貝氏分類器的關係

📚 理論一致性：Cover & Hart (1967) 證明 \(\lim_{n\to\infty, K\to\infty, K/n\to 0} \mathbb{E}[\text{Error}_{\text{KNN}}] = \text{Bayes Error}\)。直覺：當資料無限多時，K 個最近鄰都無限接近 \(x_0\)，頻率收斂至真實條件機率。

特性	貝氏分類器	KNN 分類器
最優性	理論上最優	漸近最優（n→∞ 時收斂至貝氏）
實務可行性	不可行（需知真實分布）	可行（只需訓練資料）
錯誤率	最低（貝氏錯誤率）	略高於貝氏，取決於 K 和 n
參數	無	K（超參數）
計算成本	N/A	O(n) 預測，O(1) 訓練

六、評估方法的全面比較

6.1 訓練誤差 vs 測試誤差 vs 交叉驗證

方法	計算成本	偏差	變異	適用場景
訓練誤差	低	嚴重低估真實誤差	低	不建議用於模型選擇
驗證集 (Hold-out)	低	低估（只用部分資料）	中	資料量大、快速原型
K-fold CV	中	低	中	一般用途，n < 10,000
LOOCV	高（n 次擬合）	極低（幾乎不偏）	高	極小樣本
Bootstrap (.632)	高	低	中	非線性模型、複雜評估

📚 交叉驗證的理論基礎見 Stone (1974) 與 Geisser (1975)。Efron (1983) 提出 .632 bootstrap 修正。詳見 ISLP §5.1。

6.2 分類器比較：KNN vs Logistic Regression vs LDA vs QDA

分類器	決策邊界	參數化	優點	缺點
KNN	非線性、任意形狀	否	靈活、無分布假設	維度詛咒、計算成本高
Logistic Regression	線性	是	可解釋、機率輸出	限線性邊界
LDA	線性	是	簡單、理論完備	需常態假設
QDA	二次曲線	是	比 LDA 靈活	參數多、需較多資料
Naive Bayes	視假設而定	是	極快、高維度表現佳	條件獨立假設常違反

七、實務應用要點

7.1 模型選擇流程

1. 分割資料：訓練集（60%）／驗證集（20%）／測試集（20%）
2. 在訓練集上擬合多個模型（不同靈活度）
3. 在驗證集上選擇最優模型（或使用 CV）
4. 在測試集上報告最終效能（僅使用一次！）
5. 絕對不要在測試集上調參——會導致過度樂觀的誤差估計

7.2 常見陷阱

用測試集調參：測試集變成「第二個訓練集」，失去泛化評估功能
僅看訓練誤差：過度擬合的模型訓練誤差極低但泛化差
忽略偏差-變異數權衡：盲目追求最靈活的模型，忽略變異數暴增
資料洩漏 (Data Leakage)：訓練過程中使用未來資訊或測試集資訊
不平衡類別用錯誤率：當正負樣本比例極端時，錯誤率無意義

今日關鍵句

「期望測試 MSE 永遠等於偏差平方 + 變異數 + 不可約誤差——因此，選擇統計學習方法的藝術，不在於消滅偏差或變異，而在於找到讓兩者和最小的平衡點。」

— ISLP §2.2.2, adapted from Equation 2.7

← 2.1 什麼是統計學習？ | 下一節：2.3 Lab: Python 入門（待製作）