Deep Learning Notes I - Introduction

Introduction, Neural Networks Construction

All modern machine learning algorithms are just nearest neighbors. It’s only that the neural networks are telling you the space in which to compute the distance.

Introduction

Risk: expected loss

\[R(\theta) = \mathbb{E}\left[\mathcal{L}(\theta, X, Y)\right]\]

Empirical Risk: average loss

\[\hat{R}(\theta) = \frac{1}{N}\sum_{i=1}^N\mathcal{L}(\theta, x_i, y_i)\]

Empirical Risk too high: underfitting; too low: overfitting.

From MLE to MAP

其实就是用 Bayesian 的视角来看待 Regularization。

就是我们现在是要最大化

\[p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta)\cdot p(\theta)}{p(\mathcal{D})}\]

这东西叫 MAP 也就是 Maximum A Posteriori。

取对数之后,我们就可以得到

\[\begin{aligned} \arg\max_{\theta}\log p(\theta \mid \mathcal{D}) &= \arg\max_{\theta}\left\{\log p(\mathcal{D} \mid \theta) + \log p(\theta) - \log p(\mathcal{D})\right\} \\ &= \arg\max_{\theta}\left\{\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right\} \end{aligned}\]

而 \(\log p(\mathcal{D} \mid \theta)\) 其实就是 loss 嘛。那 \(\log p(\theta)\) 其实就是 regularization。

就比如说我们现在假设 \(\theta\sim\mathcal{N}(0, \sigma^2I)\),那么我们就可以得到

\[\log p(\theta) = -\frac{1}{2}\sum_{i=1}^n\left(\frac{\theta_i}{\sigma}\right)^2\log\left(\frac{1}{\sigma\sqrt{2\pi}}\right) = \lambda\Vert\theta\Vert^2\]

其中 \(\lambda\) 是一个常数。


Single-layer Network: Regression

Linear Regression

\[\hat{y}(x, w) = \sum_{j=0}^{M-1}w_j\phi_j(x) = w^{\top}\phi(x)\]

其中 \(\phi\) 叫 bias functions。一般 \(\phi_0(x)=1\)。

然后其实我们认为真实值 \(y\) 是

\[y = \Phi w + \epsilon, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)\]

然后根据 loss function

\[\mathcal{L}(w) = \frac{1}{2}\left(y - \Phi w\right)^{\top}\left(y - \Phi w\right) + \frac{\lambda}{2}w^{\top}w\]

容易得到

\[\hat{w} = (\Phi^{\top}\Phi + \lambda I)^{-1}\Phi^{\top}y\]

这边简单复习一下上学期学的东西。假设没有 regularization,那么我们就有

\[\hat{w} = (\Phi^{\top}\Phi)^{-1}\Phi^{\top}y\]

然后我们令

\[H = \Phi(\Phi^{\top}\Phi)^{-1}\Phi^{\top}\]

那么 \(H\) 和 \(I - H\) 是 projection matrix,并且 \(Hy \perp (I - H)y\):

\[\begin{cases} H\Phi = \Phi \\ H^2 = H \\ (I - H)^2 = I - H \\ H^{\top} = H \\ (I - H)^{\top} = I - H \end{cases}\]

误差分析的话,其实就考虑的是在那个 \(\Phi\) 的平面上,加上一个正态分布的 \(\epsilon\) 嘛。\(\Phi\) 这个平面是 \(M\) 维的,那么与之正交的那个东西,也就是 \(I - H\) 肯定是 \(N - M\) 维的。现在我们要求的是 \(\epsilon\) 投影在这个平面上的长度,那很显然是 \(N - M\) 个 \(\mathcal{N}(0, \sigma^2)\) 相加,也就是

\[\frac{1}{\sigma^2}(y - \hat{y})^{\top}(y - \hat{y}) \sim \chi^2_{N-M}\]

所以说

\[\mathbb{E}\left[(y - \hat{y})^{\top}(y - \hat{y})\right] = \sigma^2(N - M)\]

也就是

\[s^2 = \frac{1}{N-M}\sum_{i=1}^N(y_i - \hat{y}_i)^2\Longrightarrow\mathbb{E}[s^2] = \sigma^2\]

祭出 ANOVA Table (\(A = \frac{1}{n}11^{\top}\)):

Source DF Sum of Squares Mean Square \(\mathcal{F}\)-value
Regression \(M - 1\) \(\mathrm{SSR}=\sum_{i=1}^N(\overline{y}_i - \hat{y}_i)^2 = y^{\top}(H - A)y\) \(\mathrm{MSR}=\frac{\mathrm{SSR}}{M-1}\) \(F=\frac{\mathrm{MSR}}{s^2}\)
Residual \(N - M\) \(\mathrm{SSE}=\sum_{i=1}^N(y_i - \hat{y}_i)^2 = y^{\top}(I - H)y\) \(s^2=\frac{\mathrm{SSE}}{N-M}\)  
Total \(N - 1\) \(\mathcal{S}_{yy}=\sum_{i=1}^N(\hat{y}_i - \overline{y}_i)^2 = y^{\top}(I - A)y\)    

考虑到 \(\mathrm{SSR}\) 和 \(\mathrm{SSE}\) 都是 \(\chi^2\) distribution,他们除一除的 \(F\) 就是一个 \(\mathcal{F}\) distribution 嘛。

所以说

\[F = \frac{\mathrm{MSR}}{s^2}\sim \mathcal{F}_{M - 1, N - M}\] \[\mathbb{E}[L] = \mathbb{E}\left[\frac{1}{2}\left(f(x) - y\right)^2\right]\] \[F(y(x) + \epsilon \eta(x)) = F(y(x)) + \int \frac{\delta F}{\delta y(x)}\eta(x)\mathrm{d}x + \mathcal{O}(\epsilon^2)\] \[\begin{aligned} \frac{\delta\mathbb{E}\left[L\right]}{\delta f(x)}&=\frac{\delta}{\delta f(x)}\mathbb{E}\left[\frac{1}{2}\left(f(x) - y\right)^2\right]\\ &=\frac{\delta}{\delta f(x)}\iint \frac{1}{2}(f(x) - y)^2 p(x, y)\mathrm{d}x\mathrm{d}y \end{aligned}\]