6.5 Lab: 線性模型選擇與正則化方法

PAGE 課本 p.275–290 難度:★★★★★ TIME 約 45 分鐘
Lab 模型選擇 Ridge Lasso PCR PLS Hitters
THEORY 理論基礎:本章 Lab 將前面 6.1–6.4 的所有理論付諸實作——子集選擇、脊回歸(Ridge)、Lasso、主成分回歸(PCR)與偏最小平方法(PLS)。核心資料集為 Hitters(1986 年大聯盟球員薪資),出自 James, Witten, Hastie, Tibshirani (2023) An Introduction to Statistical Learning with Python

這堂實作課是第 6 章的壓軸——我們不再紙上談兵,而是親手用 Python 實作從 子集選擇脊回歸(Ridge)Lasso、到 PCR 與 PLS 的完整流程。每個程式碼區塊都可獨立執行,建議你在 Colab 或本機 Jupyter 中跟著跑一遍。

6.5.1 子集選擇方法(Subset Selection)

向前選取法(Forward Selection)

目標:從 19 個候選變數中挑出最能預測球員薪資(Salary)的子集。課本採用 Cp 統計量(Mallows' Cp)作為模型選擇準則——Cp 同時考量擬合誤差(RSS)與模型複雜度(懲罰多餘變數),選出「夠準又不冗餘」的模型。

THEORY Mallows' Cp = (RSS + 2p·σ²) / n,其中 p 為變數個數,σ² 用全模型 MSE 估計(課本 6.1.3)

SCENE 應用場景:MLB 薪資談判

球團想知道「哪些打擊數據真正值錢」。向前選取法從零開始,每次加入對薪資預測貢獻最大的變數,最後留下 10 個——包括安打數(Hits)、保送(Walks)、守備(PutOuts)——恰好反映市場對打擊與守備的重視。

# Forward Selection with Cp on Hitters data
try:
    from google.colab import drive
    drive.mount('/content/drive')
    DATA_PATH = '/content/drive/MyDrive/ISLP_data/'
except ImportError:
    DATA_PATH = '/tmp/'

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from statsmodels.api import OLS
from ISLP import load_data
from ISLP.models import ModelSpec as MS, Stepwise, sklearn_selected
from functools import partial

# Load and clean
Hitters = load_data('Hitters')
Hitters = Hitters.dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)

# Estimate sigma^2 from full model
sigma2 = OLS(Y, X).fit().scale

# Negative Cp scorer
def nCp(sigma2, estimator, X, Y):
    n, p = X.shape
    Yhat = estimator.predict(X)
    RSS = np.sum((Y - Yhat)**2)
    return -(RSS + 2 * p * sigma2) / n

neg_Cp = partial(nCp, sigma2)

# Forward stepwise with Cp
strategy = Stepwise.first_peak(design, direction='forward', max_terms=len(design.terms))
hitters_Cp = sklearn_selected(OLS, strategy, scoring=neg_Cp)
hitters_Cp.fit(Hitters, Y)
selected = hitters_Cp.selected_state_
print(f"Cp selected {len(selected)} variables:")
for v in selected:
    print(f"  {v}")

交叉驗證選模型

除了 Cp,交叉驗證(CV)是更通用的模型選擇方法。課本使用 sklearn_selection_path 擬合整條向前選取路徑,然後用 cross_val_predict 計算每個模型規模的 CV MSE。

# CV-based forward selection path
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
from sklearn.model_selection import cross_val_predict
from statsmodels.api import OLS
from ISLP import load_data
from ISLP.models import ModelSpec as MS, Stepwise
from ISLP.models import sklearn_selection_path

Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])

# Fit full forward path
strategy = Stepwise.fixed_steps(design, len(design.terms), direction='forward')
full_path = sklearn_selection_path(OLS, strategy)
full_path.fit(Hitters, Y)

# 5-fold CV predictions along the path
K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
Yhat_cv = cross_val_predict(full_path, Hitters, Y, cv=kfold, method='predict')

# CV MSE for each model size
n_steps = Yhat_cv.shape[1]
cv_mse = np.mean((Yhat_cv - Y[:, None])**2, axis=0)
cv_se = np.std((Yhat_cv - Y[:, None])**2, axis=0, ddof=1) / np.sqrt(K)

best_step = int(np.argmin(cv_mse))
print(f"Best model size: {best_step} variables, CV MSE = {cv_mse[best_step]:.0f}")

最佳子集選擇(Best Subset Selection)

與貪婪的向前選取不同,最佳子集對每個模型大小 k 窮舉所有組合,保證找到 RSS 最小的 k 變數模型。課本使用 l0bnb 套件實作(需 !pip install l0bnb)。由於 l0bnb 求解的是懲罰路徑而非固定子集大小,實際使用時需搭配 CV 選 λ。

SCENE 應用場景:基因組關聯研究(GWAS)

當你有 50 萬個 SNP 標記卻只有幾千個樣本時,向前選取可能挑出 17 個「看似顯著」的 SNP——但課本特別警告:換一組獨立樣本重跑,選出的 17 個 SNP 可能完全不同!這正是 高維度陷阱:模型有預測力,但選出的變數不一定是「因果」。研究報告必須強調「這只是眾多可能模型之一」。

6.5.2 Ridge 回歸與 Lasso

Ridge 回歸(脊回歸)

Ridge 在最小平方法加上 L&sb2; 懲罰項 λ·Σβ²,將係數「收縮」但不歸零。課本使用 sklearn.linear_model.ElasticNet(設 l1_ratio=0 即為 Ridge),並強調先標準化再擬合——因為 λ 的懲罰效果會受變數量綱影響。

# Ridge Regression with CV on Hitters data
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS

# Prepare data
Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)

# Ridge path with cross-validation
K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
lambdas = 10**np.linspace(8, -2, 100) / Y.std()

scaler = StandardScaler(with_mean=True, with_std=True)
ridgeCV = skl.ElasticNetCV(alphas=lambdas, l1_ratio=0, cv=kfold)
pipeCV = Pipeline(steps=[('scaler', scaler), ('ridge', ridgeCV)])
pipeCV.fit(X, Y)

tuned_ridge = pipeCV.named_steps['ridge']
best_lambda = tuned_ridge.alpha_
best_mse = np.min(tuned_ridge.mse_path_.mean(1))
n_nonzero = np.sum(np.abs(tuned_ridge.coef_) > 1e-6)
print(f"Best lambda = {best_lambda:.4f}")
print(f"CV MSE = {best_mse:.0f}")
print(f"Nonzero coefs = {n_nonzero} (Ridge does NOT do variable selection)")

Lasso:稀疏解與變數選擇

Lasso 用 L&sb1; 懲罰 λ·Σ|β|,能將部分係數精確壓縮到零——等於同時做模型選擇。課本對比 Hitters 資料:Ridge 保留全部 19 個變數,Lasso 只留 13 個,但 CV MSE 幾乎相同(Ridge 115,527 vs Lasso 114,691)。這就是 Lasso 的核心賣點:準度不減,模型更簡

# Lasso with CV on Hitters data
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS

Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)

K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)

scaler = StandardScaler(with_mean=True, with_std=True)
lassoCV = skl.ElasticNetCV(n_alphas=100, l1_ratio=1, cv=kfold)
pipeCV = Pipeline(steps=[('scaler', scaler), ('lasso', lassoCV)])
pipeCV.fit(X, Y)

tuned_lasso = pipeCV.named_steps['lasso']
coef = tuned_lasso.coef_
n_nonzero = np.sum(np.abs(coef) > 1e-6)
n_zero = np.sum(np.abs(coef) <= 1e-6)

print(f"Best alpha = {tuned_lasso.alpha_:.4f}")
print(f"CV MSE = {np.min(tuned_lasso.mse_path_.mean(1)):.0f}")
print(f"Nonzero: {n_nonzero}, Zeroed: {n_zero}")
print(f"Lasso drops {n_zero} variables vs Ridge, same accuracy!")

SCENE 應用場景:客戶流失預測

電信公司有 200+ 個客戶行為特徵,但高層只想知道「哪 5 個因素最重要」。Ridge 無法給答案(所有係數都非零),Lasso 卻能直接挑出前 5 個——便在簡報時一句話講清楚。這就是 可解釋性 vs 預測力的取捨

嵌套交叉驗證(Nested CV)

用 CV 選 λ 後,我們還需要估計最終模型的泛化誤差。但不能用同一份資料既選 λ 又評估誤差(資料洩漏!)。解決方案:外層 train/test split + 內層 CV 選 λ

# Nested CV: outer split for evaluation, inner CV for tuning
import numpy as np
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS

Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)

lambdas = 10**np.linspace(8, -2, 100) / Y.std()
scaler = StandardScaler(with_mean=True, with_std=True)

outer_valid = skm.ShuffleSplit(n_splits=1, test_size=0.25, random_state=1)
inner_cv = skm.KFold(n_splits=5, shuffle=True, random_state=2)

ridgeCV = skl.ElasticNetCV(alphas=lambdas, l1_ratio=0, cv=inner_cv)
pipeCV = Pipeline(steps=[('scaler', scaler), ('ridge', ridgeCV)])
results = skm.cross_validate(pipeCV, X, Y, cv=outer_valid,
                             scoring='neg_mean_squared_error')
test_mse = -results['test_score'][0]
print(f"Nested CV test MSE = {test_mse:.0f}")
print("This is the unbiased estimate of generalization error!")

6.5.3 主成分回歸(PCR)與偏最小平方法(PLS)

主成分回歸(PCR)

PCR 先用 PCA 將 p 個原始變數壓縮成 M 個主成分,再用這些主成分做線性回歸。核心思想:降維後再做回歸,避免 p ≈ n 時的過擬合。課本用 sklearn.decomposition.PCA + Pipeline,並以 GridSearchCV 選最佳主成分數 M。

# PCR with CV for number of components
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS

Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)

K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
scaler = StandardScaler(with_mean=True, with_std=True)

# PCR with GridSearchCV for n_components
pca = PCA()
linreg = skl.LinearRegression()
pipe = Pipeline([('scaler', scaler), ('pca', pca), ('linreg', linreg)])

param_grid = {'pca__n_components': range(1, 20)}
grid = skm.GridSearchCV(pipe, param_grid, cv=kfold,
                        scoring='neg_mean_squared_error')
grid.fit(X, Y)

best_M = grid.best_params_['pca__n_components']
best_mse = -grid.best_score_
print(f"Best n_components = {best_M}")
print(f"CV MSE = {best_mse:.0f}")

# Explained variance
pipe_best = grid.best_estimator_
evr = pipe_best.named_steps['pca'].explained_variance_ratio_
print(f"M=1 captures {evr[0]*100:.1f}% variance")
print(f"M=2 cumulative: {sum(evr[:2])*100:.1f}%")

偏最小平方法(PLS)

PLS 與 PCR 同樣做降維,但關鍵差異:PLS 在提取成分時同時考量 X 的變異與 Y 的相關性——不像 PCR 只看 X 的變異。這讓 PLS 在較少成分時通常表現更好,尤其當 Y 與 X 的前幾個主成分關係不大時。

# PLS Regression with CV
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
from sklearn.cross_decomposition import PLSRegression
from ISLP import load_data
from ISLP.models import ModelSpec as MS

Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)

K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)

pls = PLSRegression(scale=True)
param_grid = {'n_components': range(1, 20)}
grid = skm.GridSearchCV(pls, param_grid, cv=kfold,
                        scoring='neg_mean_squared_error')
grid.fit(X, Y)

best_n = grid.best_params_['n_components']
best_mse = -grid.best_score_
print(f"PLS best n_components = {best_n}")
print(f"CV MSE = {best_mse:.0f}")
print("PLS typically needs fewer components than PCR")

SCENE 應用場景:化工製程監控

煉油廠有 200 個感測器(溫度、壓力、流量),要預測最終辛烷值。PCR 會優先保留變異大的感測器——但它們可能跟辛烷值無關(例如環境溫度)。PLS 則會挑出與辛烷值相關的感測器組合,用更少成分達到相同準度,讓現場工程師容易解讀。

方法總比較

方法核心機制變數選擇適用情境課本 MSE
向前選取 + Cp逐步加入,Cp 平衡 RSS 與複雜度YES 是變數不太多(p < 40)
Ridge 回歸L&sb2; 懲罰收縮所有係數NO 否所有變數都可能有貢獻115,527
LassoL&sb1; 懲罰可將係數壓至零YES 是需要稀疏可解釋模型114,691
PCRPCA 降維 → 線性回歸NO 否共線性嚴重、p ≈ n~116,000
PLS監督式降維 → 線性回歸NO 否需 Y 相關性引導降維~116,000

Lab 方法優缺點對照

PRO 優點

CON 注意事項

QUOTE 今日關鍵句:「模型選擇的本質不是找到『唯一正確的模型』,而是在預測力與可解釋性之間找到最適合當下任務的平衡點。Lasso 用 13 個變數做到和 Ridge 19 個變數幾乎一樣的準度,提醒我們:更簡單的模型,往往更值得信任。
DESIGN 系統設計啟發:Lasso 的 L&sb1; 懲罰啟發了 Hermes Agent 的工具選擇策略——與其讓 agent 載入全部 30+ 個 skills,不如給每個任務類型一個「稀疏啟動配置」(只載入相關的 3–5 個 skill)。這降低了 context window 負擔(≈ Ridge 的收縮效果),同時確保關鍵工具不會被遺漏。就像 Lasso 把不必要的係數壓到零,我們也該把不必要的 skill 擋在 prompt 外。
PREV 6.4 高維度資料 HOME 回首頁