Thoughts on out-of-distribution generalization and detection.

The generalization error of model $\hat{f}$ is defined as $\mathcal{R}(\hat{f}) - \inf_{f \in \mathcal{F}} \mathcal{R}(f)$ , where $\mathcal{R}$ denotes the expected loss. The error can be further decomposed into three parts. The first one is the optimization loss $\epsilon_{opt} :=\hat{\mathcal{R}}(\hat{f}) - \inf_{f \in \mathcal{F}_{\delta}} \hat{\mathcal{R}}(f)$, i.e. the loss in optimizing $\hat{f}$ in $\mathcal{F}_\delta$ . The second one is $\epsilon_{stat}:= \mathcal{R}(\hat{f}) - \hat{\mathcal{R}}(\hat{f})$, which is the difference between the empirical and real losses. The last one is the theoretical error induced by the limitation of $\mathcal{F}_\delta$ .

Study on robustness and generalization mostly concerns $\epsilon_{stat}$. Suppose $\hat{f}(x) = \hat{f} \circ \hat{g}(x) = \hat{f}(\hat{z}) $ . $\hat{g}$ is the learned representation mapping. Generally, the problem occurs when $\hat{\mathcal{R}}_{train}(\hat{f})$ and $\hat{\mathcal{R}}_{test}(\hat{f})$ have large difference. Such difference orginates from the fact that $\hat{\mathbb{P}}_{train}$ is a partial sampling of the real $\mathbb{P}$.

Robustness and out-of-distribution generalization consider different kinds of distribution shift. We can view the input as a original form of representation. From this view, the robustness and bias-shift can be unified as studies on distribution shift in different representation space.

Similar to that, outlier detection and