2.3 Lab: Python 入門 — 基礎科學計算

本實驗室對應 ISLP §2.3 Lab: Introduction to Python，涵蓋統計學習必備的三個核心套件：NumPy（數值運算）、pandas（資料處理）、matplotlib（視覺化）。所有操作都在課本 Advertising.csv 資料集上示範。

💡 本章不需預先安裝任何東西 — 使用 Google Colab 即可在瀏覽器中執行所有程式碼。

📐 NumPy — 數值運算基石

NumPy 提供高效的多維陣列（ndarray）與向量化運算。在統計學習中，所有資料最終都會轉成 NumPy 陣列餵給模型。

建立陣列

import numpy as np

# 從 list 建立
a = np.array([1, 2, 3, 4, 5])       # 一維
b = np.array([[1, 2], [3, 4]])       # 二維 (2×2)

# 常用快速建立
np.zeros((3, 4))      # 3×4 全 0 矩陣
np.ones((2, 3))       # 2×3 全 1 矩陣
np.arange(0, 10, 2)   # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5)  # [0.  , 0.25, 0.5 , 0.75, 1.  ]
np.random.randn(3, 2) # 3×2 標準常態亂數

基本屬性與索引

a = np.array([[1, 2, 3], [4, 5, 6]])
a.shape    # (2, 3)
a.ndim     # 2（維度數）
a.dtype    # dtype('int64')

a[0, 1]    # 2（第 0 列第 1 行）
a[:, :2]   # 所有列、前 2 行
a[a > 3]   # 布林索引 → [4, 5, 6]

向量化運算

向量化 = 一次處理整個陣列，比 Python for 迴圈快 10–100 倍。這是 NumPy 的核心價值。

x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

x + y        # [5, 7, 9]
x * y        # [4, 10, 18]（element-wise）
np.dot(x, y) # 32（內積）
np.sqrt(x)   # [1., 1.414, 1.732]
np.sum(x)    # 6
np.mean(x)   # 2.0
np.std(x)    # 0.816（標準差）

🐼 pandas — 資料處理中樞

pandas 提供 DataFrame（二維表格）和 Series（一維序列）。這是統計分析的主力工具 — 在 ISLP 課程中，pandas 用來載入、清理、探索每個資料集。

建立 DataFrame

import pandas as pd

# 從 dict 建立
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age': [25, 30, 28],
    'score': [88, 72, 95]
})

# 從 CSV 讀取 — ISLP 課本標準起手式
df = pd.read_csv('Advertising.csv', index_col=0)
df.head()

核心操作速查

操作	程式碼	說明
前幾筆	`df.head()`	預設 5 筆
基本統計	`df.describe()`	count, mean, std, min, quartiles, max
欄位資訊	`df.info()`	dtype、缺失值
形狀	`df.shape`	(200, 4) — 200 筆、4 欄
選取欄位	`df['sales']`	回傳 Series
選取多欄	`df[['TV', 'radio']]`	回傳 DataFrame
條件篩選	`df[df['sales'] > 20]`	Boolean indexing
新增欄位	`df['total'] = df['TV'] + df['radio']`	向量化賦值
排序	`df.sort_values('sales', ascending=False)`	降冪排列

處理缺失值

df.isnull().sum()           # 各欄缺失數
df.dropna()                 # 刪除有缺失的列
df.fillna(0)                # 缺失值補 0
df['col'].fillna(df['col'].median(), inplace=True)  # 用中位數填補

⚠️ 課本 Advertising.csv 沒有缺失值，但真實資料一定有 — 社群平台的爬蟲資料、問卷調查、政府開放資料都充滿缺失值，務必養成先 df.isnull().sum() 的習慣。

📊 matplotlib — 資料視覺化

matplotlib 是 Python 最基礎的繪圖套件，幾乎所有視覺化工具（seaborn、plotly、ggplot）都建立在它之上。ISLP 整本書都使用 matplotlib。

基本繪圖模式

import matplotlib.pyplot as plt

# 折線圖 + 散點
plt.plot(df['TV'], df['sales'], 'o')  # 'o' = 散點標記
plt.xlabel('TV budget')
plt.ylabel('Sales')
plt.title('TV vs Sales — Advertising Data')
plt.show()

多子圖並排 — ISLP 最常用的技巧

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(df['TV'], df['sales'])
axes[0].set_title('TV')

axes[1].scatter(df['radio'], df['sales'])
axes[1].set_title('Radio')

axes[2].scatter(df['newspaper'], df['sales'])
axes[2].set_title('Newspaper')

plt.tight_layout()
plt.show()

三張圖並排，一眼看出：TV 和 radio 與 sales 呈正相關，newspaper 則無明顯關係。這是 §2.1 的核心洞察。

常用圖表對照

圖表類型	函數	適用場景
散點圖	`plt.scatter(x, y)`	兩變數關係
折線圖	`plt.plot(x, y)`	趨勢、時間序列
直方圖	`plt.hist(data, bins=20)`	分佈形狀
盒鬚圖	`plt.boxplot(data)`	中位數、離群值
熱力圖	`plt.imshow(corr)`	相關矩陣

統計學習必備：相關矩陣熱力圖

import seaborn as sns  # 建立在 matplotlib 之上

corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix — Advertising')
plt.show()

診斷迴歸模型：殘差圖

殘差圖是 §3.1 的重要診斷工具 — 檢查殘差是否有模式，若隨機散佈表示模型合適。

import numpy as np

# 假設 y_true 為真實值、y_pred 為模型預測值
residuals = y_true - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

🔗 與後續章節的關聯

章節	Python 技術
3.1–3.2 線性迴歸	`sklearn.linear_model.LinearRegression` + matplotlib 殘差圖
3.3 其他考量	pandas `pd.get_dummies()` 處理類別變數
3.4 行銷計畫	NumPy 矩陣運算、模型選擇
4.x 分類	`sklearn.linear_model.LogisticRegression` + 混淆矩陣
5.x 重抽樣	`sklearn.model_selection.cross_val_score`
6.x 線性模型選擇	`sklearn.preprocessing.StandardScaler`

— ISLP §2.3, adapted · 📖 下載免費課本 PDF

← 2.2 評估模型準確度　｜　 3.1 簡單線性迴歸 →