Machine Learning review notes
Fundamentals
Overfitting/Underfitting
- Overfitting = a model fits too well against its training data, and it cannot perform accurately against unseen data, thus has high generalization error/out-of-sample error/risk
- Underfitting = a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data
Prevent Underfitting?
- Try a more complicated (non-linear, nn) model
- Decrease regularization
- Increase the duration of training
- Add in more features to provide more information
Prevent Overfitting?
- Early stopping
- Cross validation
- Train with more data
- Data augmentation
- Feature selection
- Regularization
- Ensemble methods
Bias/Variance Trade-Off
- $y$ labels
- We want to model function $f(x;D)$ , $f’(x)$ being my model, $x$ is the input, $D$ is the training data
- $y = f(x) = f’(x;D) + \varepsilon$ where $\varepsilon$ is the noise with $0$ mean and $\sigma^2$ variance
We decompose its expected error on an unseen sample $x$:
$$ E_D \left[ (y - f'(x;D))^2 \right] = \left( \text{Bias}_D[f'(x;D)] \right)^2 + \text{Var}_D[f'(x;D)] + \sigma^2 $$- The bias term $$ = E_D[f'(x;D)] - f(x) $$
- And the variance $$ = E_D \left[ (E_D[f'(x;D)] - f'(x;D))^2 \right] $$
The more complex the model is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model “move” more to capture the data points, and hence its variance will be larger.
- Underfitting = High bias, low variance
- Overfitting = Low bias, high variance
Hypothesis Testing on ML Models
Give a set of ground truths, model A, and model B, how do you be confident that one model is better than another?
- $H_0$: Both models perform equally good
- $H_1$: Model B performs better
Test model on various sets of data, calculate the mean $\mu$ and variance $\Sigma$ of the accuracy (or other metrics), and perform significance testing (z-test, t-test, …).
The more test datasets the better according to the Central Limit Theorem.
Logistic Regression
Sigmoid
$$ y = \frac{1}{1 + e{-z}} $$Log Loss (Logistic Loss/Cross-Entropy Loss)
Log Loss is the negative average of the log of corrected predicted probabilities for each instance.
$$ L(y, p) = - \frac{1}{N} \sum_{i}^{N} (y_i\log(p_i) + (1 - y_i)\log(1 - p_i)) $$True label $y \in \{0, 1\}$ and probability estimate $p$
Can’t use MSE loss because when using Sigmoid+MSE the loss function w.r.t. weights $ L(w) $ become non-convex.
Regularization
The intuitive difference between L1 and L2: L1 tries to estimate the median of the data while L2 tries to estimate the mean of the data.
L1 Regularization (Lasso Regression)
- Lasso = Least Absolute Shrinkage and Selection Operator
- L1 regularization leads to sparsity (some weight can be 0, so we can use it for feature selection)
L2 Regularization (Ridge Regression)
- L2 regularization does not have sparsity (some weight can be close to 0 but not 0)
Priors of L1 and L2
If we assume the prior of the weight $P(\omega)$ to be
- a Laplace (Double Exponential) Distribution with mean 0, we can derive to Lasso Regression using MAP
- a Gaussian Distribution with mean 0, we can derive to Ridge Regression using MAP
Classic Machine Learning Models
KNN
- Classification and regression
- Non-parametric
- Algorithm:
- Given a vector $v$, calculate the distance between it and every vector in the training data
- Sort the distances descendingly, keep the smallest $k$ samples
- For classification, the prediction of $v$ is the most common class labels in the $k$ neighbors. For regression, the prediction is the mean value of the neighbors.
- Applications: anomaly detection, search, recommender system
k-means
- k-means is a NP-hard problem, the k-means algorithm usually refers to Loyd’s algorithm, a heuristic algorithm to solve this problem
- k-means partitions observations into $k$ clusters in which each observation belongs to the nearest cluster
- These heuristic algorithms don’t guarantee to find the global optimum. The result depend on the initial clusters
- Loyd’s algorithm (the k-means algorithm)
- Determine $k$ and initialize $k$ clusters (determining the mean of each cluster $\mu_k$) in some way
- E-step: Compute the sum of the squared distances between each data point and all centroids, and assign each data point to the closest cluster (centroid)
- M-step: Compute the new centroid (mean) by taking the average of the all data points that was assigned to this cluster
- Keep iterating until the end condition is met (for example, max number of iterations finished)
- k-means++ (better initialization)
- Choose one data point as the initial center $c_1$ uniformly at random from the data samples
- For each data point $x$ not chosen, compute the distance $D(x)$ between it and the nearest center that has already been chosen
- Choose one new data point at random as a new center, the probability of a point $x’$ being chosen is $\frac{D(x)^2}{\sum_x D(x)^2}$
- Repeat until $k$ centers have been chosen
- Proceed to the standard k-means algorithm
- Applications: Vector quantization for signal processing (where k-means was originally developed), cluster analysis, feature learning, topic modeling
Bagging
- Bagging = bootstrap aggregating
- Designed to improve the stability and accuracy of ML algorithms. It reduces variance and helps to avoid overfitting
- Sample with replacement to create different data subsets (bootstraps), and train a model on each of these bootstraps
- Sampling with replacement ensures each bootstrap is independent of its peers
- The final prediction is the majority vote or average of all models’ predictions
- Bagging generally improves unstable methods, such as neural networks, classification and regression trees, and subset selection in linear regression
- It can mildly degrade the performance of stable methods such as KNN
- Example: Random Forest
Boosting
- Boosting is a family of iterative ensemble algorithms that convert weak learners to strong ones
- Start by training the first weak classifier on the original dataset
- Samples are re-weighted based on how well the first classifier classifies them: misclassified samples are given higher weight
- Train the second classifier on this re-weighted dataset. The ensemble now includes both classifiers
- Samples are re-weighted based on how well the ensemble classifies them.
- Repeat for as many iterations as needed. And the final strong classifier is created as a weighted combination of the existing classifiers (classifiers with smaller training errors have higher weights)
- Example: Gradient-booted Tree (GBT)
Stacking
- Predictions from each model are stacked together and used as input to a final model (usually called a meta-model)
- It’s similar to the weighted average of models, but the weights here are learned, rather than manually assigned
- There is a also a term called blending, which trains the meta model on a different holdout set, rather than on the
same training set used for the upstream models
- Blending can prevent information leak but may lead to overfitting if the holdout set is small
Standardization and Normalization
- Normalization or min-max scaling is calculate as
- Standardization or z-score normalization is calculate as
- Normalization scales the range to [0, 1], while standardization is not bounded to a certain range.
- As normalization deals with min and max values, it can be affected by outliers easily
- If we don’t know the distribution of the data, normalization can often be useful.
- If we know the the distribution to be normal already, then standardization can be useful.