這堂實作課是第 6 章的壓軸——我們不再紙上談兵,而是親手用 Python 實作從 子集選擇、脊回歸(Ridge)、Lasso、到 PCR 與 PLS 的完整流程。每個程式碼區塊都可獨立執行,建議你在 Colab 或本機 Jupyter 中跟著跑一遍。
目標:從 19 個候選變數中挑出最能預測球員薪資(Salary)的子集。課本採用 Cp 統計量(Mallows' Cp)作為模型選擇準則——Cp 同時考量擬合誤差(RSS)與模型複雜度(懲罰多餘變數),選出「夠準又不冗餘」的模型。
THEORY Mallows' Cp = (RSS + 2p·σ²) / n,其中 p 為變數個數,σ² 用全模型 MSE 估計(課本 6.1.3)球團想知道「哪些打擊數據真正值錢」。向前選取法從零開始,每次加入對薪資預測貢獻最大的變數,最後留下 10 個——包括安打數(Hits)、保送(Walks)、守備(PutOuts)——恰好反映市場對打擊與守備的重視。
# Forward Selection with Cp on Hitters data
try:
from google.colab import drive
drive.mount('/content/drive')
DATA_PATH = '/content/drive/MyDrive/ISLP_data/'
except ImportError:
DATA_PATH = '/tmp/'
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from statsmodels.api import OLS
from ISLP import load_data
from ISLP.models import ModelSpec as MS, Stepwise, sklearn_selected
from functools import partial
# Load and clean
Hitters = load_data('Hitters')
Hitters = Hitters.dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters)
# Estimate sigma^2 from full model
sigma2 = OLS(Y, X).fit().scale
# Negative Cp scorer
def nCp(sigma2, estimator, X, Y):
n, p = X.shape
Yhat = estimator.predict(X)
RSS = np.sum((Y - Yhat)**2)
return -(RSS + 2 * p * sigma2) / n
neg_Cp = partial(nCp, sigma2)
# Forward stepwise with Cp
strategy = Stepwise.first_peak(design, direction='forward', max_terms=len(design.terms))
hitters_Cp = sklearn_selected(OLS, strategy, scoring=neg_Cp)
hitters_Cp.fit(Hitters, Y)
selected = hitters_Cp.selected_state_
print(f"Cp selected {len(selected)} variables:")
for v in selected:
print(f" {v}")
除了 Cp,交叉驗證(CV)是更通用的模型選擇方法。課本使用 sklearn_selection_path 擬合整條向前選取路徑,然後用 cross_val_predict 計算每個模型規模的 CV MSE。
# CV-based forward selection path
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
from sklearn.model_selection import cross_val_predict
from statsmodels.api import OLS
from ISLP import load_data
from ISLP.models import ModelSpec as MS, Stepwise
from ISLP.models import sklearn_selection_path
Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
# Fit full forward path
strategy = Stepwise.fixed_steps(design, len(design.terms), direction='forward')
full_path = sklearn_selection_path(OLS, strategy)
full_path.fit(Hitters, Y)
# 5-fold CV predictions along the path
K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
Yhat_cv = cross_val_predict(full_path, Hitters, Y, cv=kfold, method='predict')
# CV MSE for each model size
n_steps = Yhat_cv.shape[1]
cv_mse = np.mean((Yhat_cv - Y[:, None])**2, axis=0)
cv_se = np.std((Yhat_cv - Y[:, None])**2, axis=0, ddof=1) / np.sqrt(K)
best_step = int(np.argmin(cv_mse))
print(f"Best model size: {best_step} variables, CV MSE = {cv_mse[best_step]:.0f}")
與貪婪的向前選取不同,最佳子集對每個模型大小 k 窮舉所有組合,保證找到 RSS 最小的 k 變數模型。課本使用 l0bnb 套件實作(需 !pip install l0bnb)。由於 l0bnb 求解的是懲罰路徑而非固定子集大小,實際使用時需搭配 CV 選 λ。
當你有 50 萬個 SNP 標記卻只有幾千個樣本時,向前選取可能挑出 17 個「看似顯著」的 SNP——但課本特別警告:換一組獨立樣本重跑,選出的 17 個 SNP 可能完全不同!這正是 高維度陷阱:模型有預測力,但選出的變數不一定是「因果」。研究報告必須強調「這只是眾多可能模型之一」。
Ridge 在最小平方法加上 L&sb2; 懲罰項 λ·Σβ²,將係數「收縮」但不歸零。課本使用 sklearn.linear_model.ElasticNet(設 l1_ratio=0 即為 Ridge),並強調先標準化再擬合——因為 λ 的懲罰效果會受變數量綱影響。
# Ridge Regression with CV on Hitters data
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS
# Prepare data
Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)
# Ridge path with cross-validation
K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
lambdas = 10**np.linspace(8, -2, 100) / Y.std()
scaler = StandardScaler(with_mean=True, with_std=True)
ridgeCV = skl.ElasticNetCV(alphas=lambdas, l1_ratio=0, cv=kfold)
pipeCV = Pipeline(steps=[('scaler', scaler), ('ridge', ridgeCV)])
pipeCV.fit(X, Y)
tuned_ridge = pipeCV.named_steps['ridge']
best_lambda = tuned_ridge.alpha_
best_mse = np.min(tuned_ridge.mse_path_.mean(1))
n_nonzero = np.sum(np.abs(tuned_ridge.coef_) > 1e-6)
print(f"Best lambda = {best_lambda:.4f}")
print(f"CV MSE = {best_mse:.0f}")
print(f"Nonzero coefs = {n_nonzero} (Ridge does NOT do variable selection)")
Lasso 用 L&sb1; 懲罰 λ·Σ|β|,能將部分係數精確壓縮到零——等於同時做模型選擇。課本對比 Hitters 資料:Ridge 保留全部 19 個變數,Lasso 只留 13 個,但 CV MSE 幾乎相同(Ridge 115,527 vs Lasso 114,691)。這就是 Lasso 的核心賣點:準度不減,模型更簡。
# Lasso with CV on Hitters data
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS
Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)
K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
scaler = StandardScaler(with_mean=True, with_std=True)
lassoCV = skl.ElasticNetCV(n_alphas=100, l1_ratio=1, cv=kfold)
pipeCV = Pipeline(steps=[('scaler', scaler), ('lasso', lassoCV)])
pipeCV.fit(X, Y)
tuned_lasso = pipeCV.named_steps['lasso']
coef = tuned_lasso.coef_
n_nonzero = np.sum(np.abs(coef) > 1e-6)
n_zero = np.sum(np.abs(coef) <= 1e-6)
print(f"Best alpha = {tuned_lasso.alpha_:.4f}")
print(f"CV MSE = {np.min(tuned_lasso.mse_path_.mean(1)):.0f}")
print(f"Nonzero: {n_nonzero}, Zeroed: {n_zero}")
print(f"Lasso drops {n_zero} variables vs Ridge, same accuracy!")
電信公司有 200+ 個客戶行為特徵,但高層只想知道「哪 5 個因素最重要」。Ridge 無法給答案(所有係數都非零),Lasso 卻能直接挑出前 5 個——便在簡報時一句話講清楚。這就是 可解釋性 vs 預測力的取捨。
用 CV 選 λ 後,我們還需要估計最終模型的泛化誤差。但不能用同一份資料既選 λ 又評估誤差(資料洩漏!)。解決方案:外層 train/test split + 內層 CV 選 λ。
# Nested CV: outer split for evaluation, inner CV for tuning
import numpy as np
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS
Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)
lambdas = 10**np.linspace(8, -2, 100) / Y.std()
scaler = StandardScaler(with_mean=True, with_std=True)
outer_valid = skm.ShuffleSplit(n_splits=1, test_size=0.25, random_state=1)
inner_cv = skm.KFold(n_splits=5, shuffle=True, random_state=2)
ridgeCV = skl.ElasticNetCV(alphas=lambdas, l1_ratio=0, cv=inner_cv)
pipeCV = Pipeline(steps=[('scaler', scaler), ('ridge', ridgeCV)])
results = skm.cross_validate(pipeCV, X, Y, cv=outer_valid,
scoring='neg_mean_squared_error')
test_mse = -results['test_score'][0]
print(f"Nested CV test MSE = {test_mse:.0f}")
print("This is the unbiased estimate of generalization error!")
PCR 先用 PCA 將 p 個原始變數壓縮成 M 個主成分,再用這些主成分做線性回歸。核心思想:降維後再做回歸,避免 p ≈ n 時的過擬合。課本用 sklearn.decomposition.PCA + Pipeline,並以 GridSearchCV 選最佳主成分數 M。
# PCR with CV for number of components
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
import sklearn.linear_model as skl
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS
Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)
K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
scaler = StandardScaler(with_mean=True, with_std=True)
# PCR with GridSearchCV for n_components
pca = PCA()
linreg = skl.LinearRegression()
pipe = Pipeline([('scaler', scaler), ('pca', pca), ('linreg', linreg)])
param_grid = {'pca__n_components': range(1, 20)}
grid = skm.GridSearchCV(pipe, param_grid, cv=kfold,
scoring='neg_mean_squared_error')
grid.fit(X, Y)
best_M = grid.best_params_['pca__n_components']
best_mse = -grid.best_score_
print(f"Best n_components = {best_M}")
print(f"CV MSE = {best_mse:.0f}")
# Explained variance
pipe_best = grid.best_estimator_
evr = pipe_best.named_steps['pca'].explained_variance_ratio_
print(f"M=1 captures {evr[0]*100:.1f}% variance")
print(f"M=2 cumulative: {sum(evr[:2])*100:.1f}%")
PLS 與 PCR 同樣做降維,但關鍵差異:PLS 在提取成分時同時考量 X 的變異與 Y 的相關性——不像 PCR 只看 X 的變異。這讓 PLS 在較少成分時通常表現更好,尤其當 Y 與 X 的前幾個主成分關係不大時。
# PLS Regression with CV
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import sklearn.model_selection as skm
from sklearn.cross_decomposition import PLSRegression
from ISLP import load_data
from ISLP.models import ModelSpec as MS
Hitters = load_data('Hitters').dropna()
design = MS(Hitters.columns.drop('Salary')).fit(Hitters)
Y = np.array(Hitters['Salary'])
X = design.transform(Hitters).drop('intercept', axis=1)
X = np.asarray(X)
K = 5
kfold = skm.KFold(K, shuffle=True, random_state=0)
pls = PLSRegression(scale=True)
param_grid = {'n_components': range(1, 20)}
grid = skm.GridSearchCV(pls, param_grid, cv=kfold,
scoring='neg_mean_squared_error')
grid.fit(X, Y)
best_n = grid.best_params_['n_components']
best_mse = -grid.best_score_
print(f"PLS best n_components = {best_n}")
print(f"CV MSE = {best_mse:.0f}")
print("PLS typically needs fewer components than PCR")
煉油廠有 200 個感測器(溫度、壓力、流量),要預測最終辛烷值。PCR 會優先保留變異大的感測器——但它們可能跟辛烷值無關(例如環境溫度)。PLS 則會挑出與辛烷值相關的感測器組合,用更少成分達到相同準度,讓現場工程師容易解讀。
| 方法 | 核心機制 | 變數選擇 | 適用情境 | 課本 MSE |
|---|---|---|---|---|
| 向前選取 + Cp | 逐步加入,Cp 平衡 RSS 與複雜度 | YES 是 | 變數不太多(p < 40) | — |
| Ridge 回歸 | L&sb2; 懲罰收縮所有係數 | NO 否 | 所有變數都可能有貢獻 | 115,527 |
| Lasso | L&sb1; 懲罰可將係數壓至零 | YES 是 | 需要稀疏可解釋模型 | 114,691 |
| PCR | PCA 降維 → 線性回歸 | NO 否 | 共線性嚴重、p ≈ n | ~116,000 |
| PLS | 監督式降維 → 線性回歸 | NO 否 | 需 Y 相關性引導降維 | ~116,000 |
sklearn Pipeline,乾淨可複用l0bnb 最佳子集需額外安裝,非標準 sklearnsklearn_selected 封裝了 statsmodels,學習曲線較陡