Core ML Concepts #
Cross-validation #
Cross-validation is a technique used to assess how a model will generalize to an independent dataset. Another word for this is backtest using historical or actual data. Common types:
- K-Fold: Split data into k parts, train on k–1, validate on 1.
- Leave-One-Out (LOO): Extreme form of K-fold with k = n.
- Stratified K-Fold: Keeps class distribution constant in each fold.
Overfitting vs Underfitting #
Overfitting: when model fits training data too well including outliers and pattterns. This results in poor generalization for new data.
- High training data accuracy but low test data accuracy.
- Reduce by add regularization, features, or regularization (L1, L2)
Underfitting: model lacks sufficient generalization and did not capture underlying patterns in the data.
- Low training data accuracy and low test data accuracy.
- Reduce by add complexity or features, add data, reduce regularization
Bias and variance trade-off #
Error decomposition: $$ \text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$
- Bias: error due to overly simplistic assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- Variance: error due to excessive sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
Regularization #
Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. Common types of regularization include:
- L1 Regularization (Lasso): adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. It can lead to sparse models where some coefficients are exactly zero.
- L2 Regularization (Ridge): adds the squared magnitude of coefficients as a penalty term to the loss function. It tends to shrink coefficients but does not set them to zero.
Feature Selection and Engineering #
Scaling and normalization #
- Scaling: transforming features to a specific range, such as [0, 1] or [-1, 1]. Common methods include Min-Max Scaling and Standardization (Z-score normalization).
- Normalization: transforming features to have a mean of 0 and a standard deviation of 1. This is often done using StandardScaler in libraries like scikit-learn.
One-hot ecndoing #
- Converts categorical variables into binary vectors.
Target encoding #
- Replaces categorical variable with the mean of the target variable for each category.
Feature extraction #
- Techniques like PCA (Principal Component Analysis) reduce dimensionality while retaining important information.
Feature selection methods #
- filtering (corelection, chi-square)
- wrapper methods (recursive feature elimination, foward/backward selection)
- embedded methods (Lasso, decision tree feature importance)
Metrics #
- Regression: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R²)
- MAE = $$\frac{1}{n} \sum | \text{Actual} - \text{Predicted} |$$
- MSE = $$\frac{1}{n} \sum ( \text{Actual} - \text{Predicted} )^2$$
- RMSE = $$\sqrt{\frac{1}{n} \sum ( \text{Actual} - \text{Predicted} )^2}$$
- R² = 1 - $$\frac{\sum ( \text{Actual} - \text{Predicted} )^2}{\sum ( \text{Actual} - \bar{\text{Actual}} )^2}$$
- Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC
- True positive: Predicted positive and actual’s positive
- False positive: Predicted positive but actual’s negative
- True negative: Predicted negative and actual’s negative
- False negative: Predicted negative but actual’s positive
- Precision = TP / (TP + FP) »> how precise positive predictions are
- Recall = TP / (TP + FN) »> how many actual positives were captured
- F1-score = 2 * (Precision * Recall) / (Precision + Recall) »> balance precision and recall for class imbalance scenarios
- Accuracy = (TP + TN) / (TP + TN + FP + FN) »> overall correctness of the model
- Specificity = TN / (TN + FP) »> how many actual negatives were captured
- Precision-recall curve: y-axis = precision, x-axis = recall, area under curve (AUC) indicates model performance, higher is better.Good for imbalanced datasets.
- AUC-ROC curve: y-axis = TPR (recall), x-axis = FPR (1 - specificity), area under curve (AUC) indicates model performance, higher is better. Good for balanced datasets.
- Times series: Mean Absolute Percentage Error (MAPE), Symmetric Mean Absolute Percentage Error (sMAPE)
- MAPE = $$\frac{1}{n} \sum \left|\frac{\text{Actual} - \text{Forecast}}{\text{Actual}}\right| \times 100$$
- sMAPE = $$\frac{1}{n} \sum \frac{|\text{Forecast} - \text{Actual}|}{(|\text{Actual}| + |\text{Forecast}|)/2} \times 100$$
ML Models #
K-Nearest Neighbors (KNN) #
- instance-based learning algorithm
- classification and regression
- Predicts the class or value based on the majority class or average of the k-nearest neighbors in the feature space
- Distance metrics: Euclidean, Manhattan, Minkowski Pros:
- Simple to implement and understand
- No training phase, making it fast for small datasets Cons:
- Computationally expensive for large datasets
- Sensitive to irrelevant features and the choice of k
k is the hyperparameter representing the number of nearest neighbors to consider when making predictions.
- small k means more flexible decision boundary but may lead to overfitting
- large k means smoother decision boundary but may lead to underfitting
Linear Regression #
- regression algorithm
- Assumes linear relationship between input features and target variable
- Objective: minimize Mean Squared Error (MSE)
- Equation: ( y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + … + \beta_n x_n + \epsilon )
- Coefficients (β) represent the change in the target variable for a one-unit change in the feature, holding other features constant. Pros:
- Simple to implement and interpret
- Computationally efficient Cons:
- Assumes linearity, which may not hold in real-world data
- Sensitive to outliers Assumptions:
- Linearity: relationship between features and target is linear
- Independence: observations are independent of each other
- Homoscedasticity: constant variance of errors
- Normality: errors are normally distributed
Logistic Regression #
- classificaiton algorithm
- likelihood of a binary outcome based on input features
- Uses the logistic (sigmoid) function to map predicted values to probabilities
- Equation: ( P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + … + \beta_n x_n)}} ) Pros:
- Outputs probabilities, useful for binary classification
- Easy to implement and interpret Cons
- Assumes linear decision boundary
- May struggle with complex relationships
Naive Bayes #
- Bayesian classification algorithm
- Based on Bayes’ theorem with strong (naive) independence assumptions between features
- Types: Gaussian, Multinomial, Bernoulli Pros:
- Fast and efficient for large datasets
- Performs well with high-dimensional data Cons:
- Assumes feature independence, which may not hold in practice
Decision Tree #
- Classification and regression algorithm
- Leaf nodes: represent class labels (classification) or continuous values (regression) - bottom of tree
- Does not require feature scaling or normalization
- Classification
- Entropy: foused on information gain
- Gini Impurity: focused on minimizing misclassification
- Gixi Index: impurity measure based on the probability of a random sample being incorrectly classified
- Regression
- Predicts value instead of class in each node
- Variance reduction along feature
- Decision boundary
- piecewise linear
- perpendicular to an axis Pros:
- Easy to interpret and visualize
- Handles both numerical and categorical data
Cons: - Prone to overfitting, especially with deep trees
- High variance: sensitive to small changes in data
Deep dive #
- Recursive partition: repeatly split data into subsets to make outcome more homogeneous
- Split value: predictor value that divides data into two groups
Ensemble #
Random Forests #
- Combine multiple decision trees to improve performance
- Bagging (Bootstrap Aggregating): trains each tree on a random subset of data with replacement
- Random feature selection: each split considers a random subset of features
- Reduces overfitting and variance compared to single decision tree
- What’s random:
- Random sampling of data points for each tree (bagging)
- Random selection of features for each split
- Each node is a random set of m features from the global set at each split
- Each tree uses a subset of samples
Boosting #
- Adaboost, gradient boosting, xgboost
- Predict the residuals of previous trees/models
- Sequentially add models to correct errors of prior models
- Learning rate: controls contribution of each tree to final prediction
XGBoost #
- parallel processing at node level
- level-wise tree growth
- regularization to reduce overfitting
LightGBM #
- leaf-wise tree growth with depth limitation
- histogram-based splitting for faster training
CatBoost #
- handles categorical features natively by converting them into numerical representations
- ordered boosting to reduce overfitting
SVM #
- Classification but also can be used for regression (SVR)
- Finds optimal hyperplane that maximizes margin between classes
- Support vectors: data points closest to the hyperplane
- Kernel trick: transforms data into higher dimensions to make it linearly separable
- Common kernels: linear, polynomial, RBF, Gaussian Pros:
- Effective in high-dimensional spaces - for more features
- Robust to overfitting, especially with proper regularization Cons:
- Computationally intensive for large datasets
- Sensitive to choice of kernel and hyperparameters
ML model trade-offs #
Neural network and Deep Learning #
Neural network #
Using neural networks with multiple layers to learn complex patterns in data. Main components:
- Layers: input, hidden, output
- Parameters: weights, biases
- Activation functions: ReLU, sigmoid, tanh
- Loss functions: MSE, cross-entropy
- Optimizers: SGD, Adam Common models: CNNs, RNNs, Transformers, Autoencoders backpropagation: algorithm used to train neural networks by minimizing the loss function through gradient descent.
single-layer perceptron (SLP)
- input layer, output layer multi-layer perceptron (MLP)
- input layer, hidden layers, output layer
Framework Logic
- Forward pass > predictions
- Compute loss
- Backward pass > gradients
- Optimizer step > update parameters
- Zero gradients (to prevent accumulation or interference between batches, in plain language, clear old gradients before computing new ones)
- Evaluation on validation set (metrics)
Transformers #
A type of DL model primarily used for sequential data like text, time series, or audio. It uses self attention to weight all input elemenmts relative to each other, allowing the model to capture contextual relationships. Main components:
- Attention mechanism: self-attention, multi-head attention
- Positional encoding: adds information about the position of tokens in the sequence
- Encoder-decoder architecture: used in seq2seq tasks Common applications: NLP (BERT, GPT), time series forecasting, image processing (Vision Transformers)
Comparision #
| Type | Best for | Core idea | Limitation |
|---|---|---|---|
| Feedforward NN/Multilayer Perceptron MLP | Tabular or static data | Simple stacked layers (no memory) | Cannot handle sequence or spatial patterns |
| Transformer | Text, time series, vision | Uses attention instead of recurrence or convolution | Requires more compute, large data |
| Convolutional NN (CNN) | Images, spatial data | Uses filters (kernels) that scan input to capture local patterns | Poor at long-term or sequential relations |
| Autoencoder | Dimensionality reduction, anomaly detection | Learns compressed representation of input data | May lose important info in compression |
| Recurrent NN (RNN) | Sequential/time-series data | Maintains internal memory of previous steps | Struggles with long dependencies (vanishing gradients) |
| Long Short-Term Memory (LSTM) | Long sequences | RNN variant with “gates” to keep or forget information | Still sequential (slow on long data) |
Pre-training vs post-training #
Pre-training: training a model on a large dataset before fine-tuning it on a specific task. Common in NLP with models like BERT, GPT. Post-training: Fine-tune for task-specific behavior.
Drop out and batch norm #
- Dropout: regularization technique that randomly sets a fraction of input units to 0 during training to prevent overfitting.
- Batch Normalization: normalizes the inputs of each layer to have a mean of 0 and a standard deviation of 1, improving training speed and stability.
Bert vs GPT #
- BERT (Bidirectional Encoder Representations from Transformers): designed for understanding context in text by looking at both left and right context. Used for tasks like question answering, sentiment analysis.
- GPT (Generative Pre-trained Transformer): designed for generating coherent text by predicting the next word in a sequence. Used for tasks like text generation, summarization.
Reinforcement Learning #
Learning via interaction with an environment to maximize cumulative reward.
Core idea: Agent learns policy π(a|s) that maps state → action to maximize expected return
Main components:
- Agent, environment
- State (s), action (a), reward (r), next state (s′)
- Policy π, value function V(s), Q-function Q(s, a)
Training loop:
- Observe state
- Choose action
- Receive reward, new state
- Update policy/value via reward feedback
Pros: Learns optimal sequential decisions
Cons: Slow convergence, unstable training, exploration–exploitation trade-off
Common algorithms: Q-learning, DQN, Policy Gradient, PPO, A3C #
Unsupervised Learning #
K-Means Clustering #
- Partition data into k clusters based on feature similarity
- Steps:
- Initialize k centroids randomly
- Assign each data point to the nearest centroid using distance metric
- Update centroids by calculating the mean of assigned points
- Iterate until centroids stabilize
- Eblow method: within cluster sum of squares against number of clusters to optimize k
- Other types: k-medoids
- k-selection: gap statistics (wcss), silhouette score (how well data points fit within their clusters vs next closest cluster)
Hierarchical Clustering #
- Agglomerative (bottom-up) or divisive (top-down) approaches
- Algorithm:
- Start with one data point
- Treat each data as one cluster
- Merge clusters
- Dendrograms are formed at end to group clusters
- Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)