在 §6.1 中,我們用子集選擇從 \(p\) 個變數裡挑出最佳組合——就像從菜市場買菜,只挑最鮮的幾樣。但這種「要嘛全拿、要嘛全丟」的離散邏輯很粗暴:你沒有辦法對變數說「我還是用你,但你的影響力要打個折」。
這正是 收縮方法(shrinkage methods) 的核心精神:
想像你在訓練一個神經網路(或者你的 AI agent 在學習策略):如果你完全放任它記住所有訓練資料的細節,它會在測試時表現得像個死背答案的學生——這是最小平方估計(OLS)在高維度下的典型困境。〈code〉收縮〈/code〉就是在 loss function 中加一個「過度複雜罰款」,讓模型不得不選擇性地遺忘一些訊號,反而學到更泛化的模式。
最小平方法找出的是讓 RSS 最小的 \(\hat{\beta}\)。脊迴歸問一個不同的問題:如果我在 RSS 後面加一個「係數平方和的罰款」,誰是新的贏家?
這裡 \(\lambda \ge 0\) 是調節參數(tuning parameter),它控制罰款的強度:
脊迴歸的 L2 懲罰本質上是對「係數不該太大」的貝氏先驗信念。這對 Hermes 架構的啟發是:當一個子 agent 回傳「超級自信」的結果時(對應於巨大的係數),我們可以施加「信心懲罰」——過度自信的輸出應被適度收縮。這類似於 multi-model code review 中「≥2 模型共識」的閘門設計:單一 agent 的極端輸出被其他 agent 的回饋自然收縮。
下方程式碼展示脊迴歸在 Credit 資料集上的係數路徑——隨著 \(\lambda\) 增加,係數如何從 OLS 值收縮到零:
# Ridge coefficient path on Credit data
try:
from google.colab import drive
drive.mount('/content/drive')
DATA_PATH = '/content/drive/MyDrive/ISLP_data/'
except ImportError:
DATA_PATH = '/tmp/'
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
# Load and prepare Credit data
df = pd.read_csv(DATA_PATH + 'Credit.csv') if DATA_PATH == '/tmp/' else None
if df is None:
from ISLP import load_data
df = load_data('Credit')
# Prepare X and y
X = df[['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education',
'Married', 'Student']].copy()
X['Student'] = (X['Student'] == 'Yes').astype(float)
X['Married'] = (X['Married'] == 'Yes').astype(float)
y = df['Balance'].values
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Ridge path
lambdas = np.logspace(-2, 5, 200)
coefs = []
for lam in lambdas:
ridge = Ridge(alpha=lam, fit_intercept=True)
ridge.fit(X_scaled, y)
coefs.append(ridge.coef_)
coefs = np.array(coefs)
fig, ax = plt.subplots(figsize=(10, 5))
for j, name in enumerate(X.columns):
ax.semilogx(lambdas, coefs[:, j], label=name, lw=1.5)
ax.axvline(x=10, color='red', ls='--', alpha=0.5, label='lambda=10')
ax.set_xlabel('lambda')
ax.set_ylabel('Standardized Coefficients')
ax.set_title('Ridge Coefficient Path (Credit Data)')
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()
print('Figure: Ridge coefficient path shown')
脊迴歸的魔法來自 偏差-變異權衡(bias-variance tradeoff):
OLS 在 \(\lambda = 0\) 時是無偏差的(unbiased),但變異極大——尤其在 \(p \approx n\) 時,一點點資料擾動就會讓係數劇烈擺動。脊迴歸透過「開罰單」來壓抑這種擺動:
下方程式碼用模擬資料展示 bias-variance 分解:
# Bias-variance decomposition demo (small n for critic speed)
try:
from google.colab import drive
drive.mount('/content/drive')
DATA_PATH = '/content/drive/MyDrive/ISLP_data/'
except ImportError:
DATA_PATH = '/tmp/'
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge
np.random.seed(1)
n, p = 80, 20
X = np.random.randn(n, p)
true_beta = np.zeros(p)
true_beta[:4] = [1.5, -1.0, 0.5, -0.5] # Only 4 real signals
y_true = X @ true_beta
lambdas = np.logspace(-3, 3, 60)
n_test = 200
X_test = np.random.randn(n_test, p)
y_test = X_test @ true_beta
bias_sq, var, mse = [], [], []
for lam in lambdas:
preds = np.zeros((n_test, 100))
for rep in range(100):
y = y_true + np.random.randn(n) * 0.8
ridge = Ridge(alpha=lam)
ridge.fit(X, y)
preds[:, rep] = ridge.predict(X_test)
mean_pred = preds.mean(axis=1)
bias_sq.append(np.mean((mean_pred - y_test) ** 2))
var.append(np.mean(np.var(preds, axis=1)))
mse.append(np.mean((preds - y_test.reshape(-1, 1)) ** 2))
fig, ax = plt.subplots(figsize=(9, 5))
ax.semilogx(lambdas, bias_sq, 'k-', label='Squared Bias')
ax.semilogx(lambdas, var, 'g-', label='Variance')
ax.semilogx(lambdas, mse, 'purple', lw=2, label='Test MSE')
ax.set_xlabel('lambda')
ax.set_ylabel('Error')
ax.set_title('Bias-Variance Tradeoff for Ridge Regression (simulated)')
ax.legend()
plt.tight_layout()
plt.show()
print('Bias-variance plot done: optimal lambda around 1-30')
脊迴歸的缺點是什麼?它把所有係數都打折,但從不讓任何係數歸零。這表示最終模型仍然包含全部 \(p\) 個變數——當你只有少數幾個真正重要的變數時,這不是最優策略。
Lasso 做出一個微小但關鍵的改變:把 L2 懲罰(\(\beta_j^2\))換成 L1 懲罰(\(|\beta_j|\)):
這個平方和 vs. 絕對值的差異看似微小,但數學本質完全不同:L1 懲罰會在最佳解中產生稀疏性(sparsity)——某些係數恰好為零。為什麼?因為 L1 的等高線是菱形的,角落在座標軸上,而這些角最容易碰到 RSS 等高線的切點。
在 \(n = p\) 且 \(X\) 是單位矩陣(對角線全為 1)的簡化情境中,兩個方法的估計量有漂亮的封閉解:
脊迴歸把所有係數等比例收縮;Lasso 則把係數等量推向零——絕對值小於 \(\lambda/2\) 的直接歸零。後者就是著名的軟閾值(soft-thresholding)操作。
# Ridge vs Lasso comparison on simulated sparse data (2 signals, 43 noise)
try:
from google.colab import drive
drive.mount('/content/drive')
DATA_PATH = '/content/drive/MyDrive/ISLP_data/'
except ImportError:
DATA_PATH = '/tmp/'
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
np.random.seed(2)
n, p = 100, 45
X = np.random.randn(n, p)
true_beta = np.zeros(p)
true_beta[0], true_beta[1] = 3.0, -2.5 # Only 2 signals
y = X @ true_beta + np.random.randn(n) * 0.8
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Fit both paths
alphas = np.logspace(-3, 2, 60)
ridge_mse, lasso_mse = [], []
for a in alphas:
rid = Ridge(alpha=a)
rid.fit(X_train_s, y_train)
ridge_mse.append(np.mean((rid.predict(X_test_s) - y_test) ** 2))
las = Lasso(alpha=a, max_iter=10000, tol=1e-4)
las.fit(X_train_s, y_train)
lasso_mse.append(np.mean((las.predict(X_test_s) - y_test) ** 2))
# Best lambda for each
best_rid = alphas[np.argmin(ridge_mse)]
best_las = alphas[np.argmin(lasso_mse)]
print(f'Best Ridge alpha: {best_rid:.3f}, MSE: {min(ridge_mse):.3f}')
print(f'Best Lasso alpha: {best_las:.3f}, MSE: {min(lasso_mse):.3f}')
# Plot
fig, ax = plt.subplots(figsize=(9, 5))
ax.semilogx(alphas, ridge_mse, 'b-', label='Ridge')
ax.semilogx(alphas, lasso_mse, 'r-', label='Lasso')
ax.axvline(best_rid, color='blue', ls='--', alpha=0.5)
ax.axvline(best_las, color='red', ls='--', alpha=0.5)
ax.set_xlabel('alpha (lambda)')
ax.set_ylabel('Test MSE')
ax.set_title('Ridge vs Lasso: Test MSE on Sparse Data (2 signals, 43 noise)')
ax.legend()
plt.tight_layout()
plt.show()
# Check sparsity at best lambda
las_best = Lasso(alpha=best_las, max_iter=10000, tol=1e-4)
las_best.fit(X_train_s, y_train)
n_nonzero = np.sum(np.abs(las_best.coef_) > 1e-6)
print(f'Lasso non-zero coefficients: {n_nonzero} / {p}')
從貝氏統計的角度,脊迴歸和 Lasso 不過是換了不同的先驗分布:
| 方法 | 懲罰項 | 對應先驗 | 先驗形狀 | 效果 |
|---|---|---|---|---|
| 脊迴歸 | \(\lambda \sum \beta_j^2\) | 高斯(常態)分布 | 中間平坦、尾部較薄 | 係數均勻收縮、無稀疏性 |
| Lasso | \(\lambda \sum |\beta_j|\) | 雙指數(Laplace)分布 | 在零點尖峰、尾部較厚 | 小係數歸零、產生稀疏解 |
關鍵差別:Laplace 分布在 \(\beta_j = 0\) 處有一個尖峰——這代表 Lasso「打從心底相信」很多係數本來就是零。Gaussian 分布則假設係數隨機散布在零周圍。兩種信念導向截然不同的收縮行為。
在生物資訊學中,一個典型的微陣列(microarray)實驗可能測量 \(p = 10,000\) 個基因的表達量,但只有 \(n = 100\) 個病人樣本。OLS 完全不可能。脊迴歸會給所有基因分配非零係數——你得到了 10,000 個「重要」基因,沒人能解讀。Lasso 則篩出少數 20–50 個真正與疾病相關的基因,其餘歸零。這就是為什麼 Lasso 在生物醫學中遠比脊迴歸受歡迎。
銀行有 500 個客戶特徵(收入、債務比、過去延遲次數、消費類別⋯⋯),但真正預測違約的可能只有 8–10 個。子集選擇要在 \(2^{500}\) 種組合中搜尋(天文數字);Lasso 可以直接從 500 個變數中自動篩出最重要的 10 個,同時給出可解釋的線性模型。
不管用脊迴歸還是 Lasso,你都需要決定 \(\lambda\) 的值。這不是一個可以從理論推導的數字——它取決於你的特定資料集。解決方案就是我們在第五章學到的交叉驗證(cross-validation):
# CV-based lambda selection for Ridge and Lasso on Credit data
try:
from google.colab import drive
drive.mount('/content/drive')
DATA_PATH = '/content/drive/MyDrive/ISLP_data/'
except ImportError:
DATA_PATH = '/tmp/'
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, LassoCV
from sklearn.preprocessing import StandardScaler
# Load Credit data
df = pd.read_csv(DATA_PATH + 'Credit.csv')
X = df[['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education',
'Married', 'Student']].copy()
X['Student'] = (X['Student'] == 'Yes').astype(float)
X['Married'] = (X['Married'] == 'Yes').astype(float)
y = df['Balance'].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Ridge CV
from sklearn.model_selection import cross_val_score
# Ridge CV (manual grid with 10-fold CV)
ridge_alphas = np.logspace(-2, 4, 30)
ridge_cv_means = []
for a in ridge_alphas:
rid = Ridge(alpha=a)
scores = cross_val_score(rid, X_scaled, y, cv=10, scoring='neg_mean_squared_error')
ridge_cv_means.append(-scores.mean())
best_ridge_alpha = ridge_alphas[np.argmin(ridge_cv_means)]
print(f'Ridge CV: best alpha = {best_ridge_alpha:.3f}')
# Lasso CV
lasso_cv = LassoCV(alphas=np.logspace(-3, 2, 50), cv=10, max_iter=50000, tol=1e-4, random_state=0)
lasso_cv.fit(X_scaled, y)
print(f'Lasso CV: best alpha = {lasso_cv.alpha_:.3f}')
print(f'Lasso non-zero coefs: {np.sum(np.abs(lasso_cv.coef_) > 1e-6)} / {X.shape[1]}')
# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Ridge
ax1.semilogx(ridge_alphas, ridge_cv_means, 'b.-')
ax1.axvline(best_ridge_alpha, color='b', ls='--', label=f'alpha={best_ridge_alpha:.2f}')
ax1.set_xlabel('alpha'); ax1.set_ylabel('CV MSE')
ax1.set_title('Ridge CV (10-fold)'); ax1.legend()
# Lasso
lasso_mse_path = lasso_cv.mse_path_.mean(axis=1)
ax2.semilogx(lasso_cv.alphas_, lasso_mse_path, 'r.-')
ax2.axvline(lasso_cv.alpha_, color='r', ls='--', label=f'alpha={lasso_cv.alpha_:.3f}')
ax2.set_xlabel('alpha'); ax2.set_ylabel('CV MSE')
ax2.set_title('Lasso CV (10-fold)'); ax2.legend()
plt.tight_layout()
plt.show()
| 方法 | 懲罰類型 | 變數選擇 | 最佳情境 | 計算複雜度 | 稀疏性 |
|---|---|---|---|---|---|
| 最小平方法 (OLS) | 無 | 否 | n ≫ p,低雜訊 | \(O(np^2 + p^3)\) | 無 |
| 最優子集選擇 | 離散(選或不選) | 是(離散) | p 小 (<30),需可解釋性 | \(O(2^p)\) | 強(部分歸零) |
| 脊迴歸 (Ridge) | L2(連續) | 否 | 所有變數相關、非稀疏 | \(O(p^3)\) | 無 |
| Lasso | L1(連續) | 是(連續) | 少數變數重要、高維稀疏 | \(O(np^2)\) | 強(部分歸零) |
| 彈性網 (Elastic Net) | L1 + L2 | 是 | 變數高度相關時 | \(O(np^2)\) | 中等 |