Machine Learning review notes

Fundamentals

Overfitting/Underfitting

Overfitting = a model fits too well against its training data, and it cannot perform accurately against unseen data, thus has high generalization error/out-of-sample error/risk
Underfitting = a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data

Prevent Underfitting?

Try a more complicated (non-linear, nn) model
Decrease regularization
Increase the duration of training
Add in more features to provide more information

Prevent Overfitting?

Early stopping
Cross validation
Train with more data
Data augmentation
Feature selection
Regularization
Ensemble methods

Bias/Variance Trade-Off

$y$ labels
We want to model function $f(x;D)$ , $f’(x)$ being my model, $x$ is the input, $D$ is the training data
$y = f(x) = f’(x;D) + \varepsilon$ where $\varepsilon$ is the noise with $0$ mean and $\sigma^2$ variance

We decompose its expected error on an unseen sample $x$:

$$ E_D \left[ (y - f'(x;D))^2 \right] = \left( \text{Bias}_D[f'(x;D)] \right)^2 + \text{Var}_D[f'(x;D)] + \sigma^2 $$

The bias term $$ = E_D[f'(x;D)] - f(x) $$
And the variance $$ = E_D \left[ (E_D[f'(x;D)] - f'(x;D))^2 \right] $$

The more complex the model is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model “move” more to capture the data points, and hence its variance will be larger.

Underfitting = High bias, low variance
Overfitting = Low bias, high variance

Hypothesis Testing on ML Models

Give a set of ground truths, model A, and model B, how do you be confident that one model is better than another?

$H_0$: Both models perform equally good
$H_1$: Model B performs better

Test model on various sets of data, calculate the mean $\mu$ and variance $\Sigma$ of the accuracy (or other metrics), and perform significance testing (z-test, t-test, …).

The more test datasets the better according to the Central Limit Theorem.

Logistic Regression

Sigmoid

$$ y = \frac{1}{1 + e{-z}} $$

Log Loss (Logistic Loss/Cross-Entropy Loss)

Log Loss is the negative average of the log of corrected predicted probabilities for each instance.

$$ L(y, p) = - \frac{1}{N} \sum_{i}^{N} (y_i\log(p_i) + (1 - y_i)\log(1 - p_i)) $$

True label $y \in \{0, 1\}$ and probability estimate $p$

Can’t use MSE loss because when using Sigmoid+MSE the loss function w.r.t. weights $ L(w) $ become non-convex.

Regularization

The intuitive difference between L1 and L2: L1 tries to estimate the median of the data while L2 tries to estimate the mean of the data.

L1 Regularization (Lasso Regression)

Lasso = Least Absolute Shrinkage and Selection Operator
L1 regularization leads to sparsity (some weight can be 0, so we can use it for feature selection)

$$ L = \sum_i^N(y_i - \boldsymbol{\omega}^T \boldsymbol{x}_i)^2 + \lambda ||\boldsymbol{\omega_i}||_1 $$$$ ||\boldsymbol{\omega_i}||_1 = \sum_i^N |\boldsymbol{\omega}_i| $$

L2 Regularization (Ridge Regression)

L2 regularization does not have sparsity (some weight can be close to 0 but not 0)

$$ L = \sum_i^N(y_i - \boldsymbol{\omega}^T \boldsymbol{x}_i)^2 + \lambda ||\boldsymbol{\omega_i}||_2^2 $$$$ ||\boldsymbol{\omega_i}||_2^2 = \sum_i^N |\boldsymbol{\omega}_i|^2 $$

Priors of L1 and L2

If we assume the prior of the weight $P(\omega)$ to be

a Laplace (Double Exponential) Distribution with mean 0, we can derive to Lasso Regression using MAP
a Gaussian Distribution with mean 0, we can derive to Ridge Regression using MAP

Classic Machine Learning Models

KNN

Classification and regression
Non-parametric
Algorithm:
1. Given a vector $v$, calculate the distance between it and every vector in the training data
2. Sort the distances descendingly, keep the smallest $k$ samples
3. For classification, the prediction of $v$ is the most common class labels in the $k$ neighbors. For regression, the prediction is the mean value of the neighbors.
Applications: anomaly detection, search, recommender system

k-means

k-means is a NP-hard problem, the k-means algorithm usually refers to Loyd’s algorithm, a heuristic algorithm to solve this problem
k-means partitions observations into $k$ clusters in which each observation belongs to the nearest cluster
These heuristic algorithms don’t guarantee to find the global optimum. The result depend on the initial clusters
Loyd’s algorithm (the k-means algorithm)
1. Determine $k$ and initialize $k$ clusters (determining the mean of each cluster $\mu_k$) in some way
2. E-step: Compute the sum of the squared distances between each data point and all centroids, and assign each data point to the closest cluster (centroid)
3. M-step: Compute the new centroid (mean) by taking the average of the all data points that was assigned to this cluster
4. Keep iterating until the end condition is met (for example, max number of iterations finished)
k-means++ (better initialization)
1. Choose one data point as the initial center $c_1$ uniformly at random from the data samples
2. For each data point $x$ not chosen, compute the distance $D(x)$ between it and the nearest center that has already been chosen
3. Choose one new data point at random as a new center, the probability of a point $x’$ being chosen is $\frac{D(x)^2}{\sum_x D(x)^2}$
4. Repeat until $k$ centers have been chosen
5. Proceed to the standard k-means algorithm
Applications: Vector quantization for signal processing (where k-means was originally developed), cluster analysis, feature learning, topic modeling

Bagging

Bagging = bootstrap aggregating
Designed to improve the stability and accuracy of ML algorithms. It reduces variance and helps to avoid overfitting
Sample with replacement to create different data subsets (bootstraps), and train a model on each of these bootstraps
- Sampling with replacement ensures each bootstrap is independent of its peers
The final prediction is the majority vote or average of all models’ predictions
Bagging generally improves unstable methods, such as neural networks, classification and regression trees, and subset selection in linear regression
It can mildly degrade the performance of stable methods such as KNN
Example: Random Forest

Boosting

Boosting is a family of iterative ensemble algorithms that convert weak learners to strong ones
Start by training the first weak classifier on the original dataset
Samples are re-weighted based on how well the first classifier classifies them: misclassified samples are given higher weight
Train the second classifier on this re-weighted dataset. The ensemble now includes both classifiers
Samples are re-weighted based on how well the ensemble classifies them.
Repeat for as many iterations as needed. And the final strong classifier is created as a weighted combination of the existing classifiers (classifiers with smaller training errors have higher weights)
Example: Gradient-booted Tree (GBT)

Stacking

Predictions from each model are stacked together and used as input to a final model (usually called a meta-model)
It’s similar to the weighted average of models, but the weights here are learned, rather than manually assigned
There is a also a term called blending, which trains the meta model on a different holdout set, rather than on the same training set used for the upstream models
- Blending can prevent information leak but may lead to overfitting if the holdout set is small

Standardization and Normalization

Normalization or min-max scaling is calculate as

$$ X' = \frac{X - X_{min}}{X_{max} - X_{min}} $$

Standardization or z-score normalization is calculate as

$$ X' = \frac{X - mean}{std} $$

Normalization scales the range to [0, 1], while standardization is not bounded to a certain range.
As normalization deals with min and max values, it can be affected by outliers easily
If we don’t know the distribution of the data, normalization can often be useful.
If we know the the distribution to be normal already, then standardization can be useful.

Machine Learning Notes

Fundamentals

Overfitting/Underfitting

Prevent Underfitting?

Prevent Overfitting?

Bias/Variance Trade-Off

Hypothesis Testing on ML Models

Logistic Regression

Sigmoid

Log Loss (Logistic Loss/Cross-Entropy Loss)

Regularization

L1 Regularization (Lasso Regression)

L2 Regularization (Ridge Regression)

Priors of L1 and L2

Classic Machine Learning Models

KNN

k-means

Bagging

Boosting

Stacking

Standardization and Normalization

Useful Links

Fundamentals#

Overfitting/Underfitting#

Prevent Underfitting?#

Prevent Overfitting?#

Bias/Variance Trade-Off#

Hypothesis Testing on ML Models#

Logistic Regression#

Sigmoid#

Log Loss (Logistic Loss/Cross-Entropy Loss)#

Regularization#

L1 Regularization (Lasso Regression)#

L2 Regularization (Ridge Regression)#

Priors of L1 and L2#

Classic Machine Learning Models#

KNN#

k-means#

Bagging#

Boosting#

Stacking#

Standardization and Normalization#

Useful Links#

Fundamentals

Overfitting/Underfitting

Prevent Underfitting?

Prevent Overfitting?

Bias/Variance Trade-Off

Hypothesis Testing on ML Models

Logistic Regression

Sigmoid

Log Loss (Logistic Loss/Cross-Entropy Loss)

Regularization

L1 Regularization (Lasso Regression)

L2 Regularization (Ridge Regression)

Priors of L1 and L2

Classic Machine Learning Models

KNN

k-means

Bagging

Boosting

Stacking

Standardization and Normalization

Useful Links