Machine Learning Interview Questions 2026: Algorithms, Metrics and Deep Learning

⏱️5 min read · 893 words

Machine learning interview questions in 2026 cover statistics, model selection, evaluation metrics, common algorithms, deep learning concepts, and production ML. This guide covers the most commonly asked ML questions for data scientist and ML engineer roles.

Core ML Concepts

1. What is the bias-variance tradeoff?

Bias: Error from wrong assumptions — model too simple, underfits training data
Variance: Error from sensitivity to training data — model too complex, overfits
Tradeoff: Reducing bias increases variance and vice versa

Issue	Signs	Fix
High bias (underfitting)	Low train AND test accuracy	More features, complex model, longer training
High variance (overfitting)	High train, low test accuracy	More data, regularization, dropout, simpler model

2. Explain precision, recall, and F1 score

# For binary classification:
# True Positive (TP): correctly predicted positive
# False Positive (FP): predicted positive, actually negative
# False Negative (FN): predicted negative, actually positive
# True Negative (TN): correctly predicted negative

# Precision = TP / (TP + FP)
# "Of all predicted positives, how many were actually positive?"
# Use when false positives are costly (spam filter: don't block legitimate email)

# Recall (Sensitivity) = TP / (TP + FN)
# "Of all actual positives, how many did we catch?"
# Use when false negatives are costly (cancer detection: don't miss cancer)

# F1 = 2 * (Precision * Recall) / (Precision + Recall)
# Harmonic mean — balanced metric when both matter

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

y_true = [1, 1, 0, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0, 1, 0]

print(f"Precision: {precision_score(y_true, y_pred):.2f}")  # 0.80
print(f"Recall: {recall_score(y_true, y_pred):.2f}")        # 0.80
print(f"F1: {f1_score(y_true, y_pred):.2f}")                # 0.80
print(classification_report(y_true, y_pred))

3. What is cross-validation and why is it needed?

from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Problem: single train/test split can be misleading
# Solution: cross-validation — use all data for both training and testing

model = RandomForestClassifier(random_state=42)

# K-Fold (k=5 most common)
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV scores: {cv_scores}")
print(f"Mean: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Stratified K-Fold — maintain class balance in each fold (for imbalanced data)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')

# Key insight: cross-validation gives a DISTRIBUTION of scores, not a single number
# Use mean ± std to understand model reliability

4. Explain regularization (L1, L2, Dropout)

from sklearn.linear_model import Lasso, Ridge, ElasticNet

# L1 (Lasso): penalty = lambda * |weights|
# Effect: sparse weights — some become exactly 0 (feature selection!)
lasso = Lasso(alpha=0.1)  # alpha = lambda

# L2 (Ridge): penalty = lambda * weights^2
# Effect: all weights shrink toward 0 but don't reach 0
ridge = Ridge(alpha=1.0)

# ElasticNet: combination of L1 + L2
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)

# For deep learning: Dropout
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(100, 256),
    nn.ReLU(),
    nn.Dropout(p=0.3),    # randomly zero 30% of neurons during training
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.2),
    nn.Linear(128, 1),
)
# At test time, dropout is automatically disabled

5. Explain gradient descent and its variants

# Gradient Descent: update weights in direction that reduces loss
# weights = weights - learning_rate * gradient

# Batch Gradient Descent:
# - Uses ALL training data to compute gradient
# - Stable but slow for large datasets

# Stochastic Gradient Descent (SGD):
# - Uses ONE sample per update
# - Fast but noisy/unstable

# Mini-batch SGD (most common):
# - Uses a batch (32, 64, 128, 256 samples) per update
# - Balance of speed and stability

# Adam (most popular optimizer in 2026):
# - Adaptive learning rates per parameter
# - Combines momentum + RMSprop
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

# Learning rate schedulers
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
# Reduces LR when validation loss stops improving

6. What is the difference between classification and regression?

Classification: Predict a category (spam/not spam, disease/no disease)
Regression: Predict a continuous value (house price, temperature, stock)
Binary classification: 2 classes
Multi-class: 3+ classes
Multi-label: Multiple labels per example

7. How do you handle imbalanced datasets?

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.utils.class_weight import compute_class_weight

# Option 1: Oversampling (SMOTE — synthetic minority)
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Option 2: Undersampling (reduce majority class)
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

# Option 3: Class weights (tell model to penalize minority mistakes more)
weights = compute_class_weight('balanced', classes=[0,1], y=y_train)
class_weight_dict = {0: weights[0], 1: weights[1]}

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight='balanced', random_state=42)

# Option 4: Use appropriate metric
# Don't use accuracy for imbalanced data!
# Use: F1, AUC-ROC, Precision-Recall AUC

8. Explain the transformer architecture

Transformers (introduced in “Attention Is All You Need” 2017) power LLMs, BERT, GPT, and most modern NLP/vision models:

Self-attention: Each token attends to all other tokens — captures long-range dependencies
Multi-head attention: Multiple attention heads learn different relationships
Positional encoding: Injects sequence position information
Feed-forward layers: Transform attended representations
Encoder-Decoder: Encoder encodes input; Decoder generates output (translation, summarization)
Decoder-only: GPT-style, generate text autoregressively

ML interview success: know the bias-variance tradeoff intuitively (underfitting vs overfitting), explain evaluation metrics clearly (when to use precision vs recall vs F1), understand regularization, and demonstrate knowledge of gradient descent variants. Production ML questions cover feature engineering, model monitoring, and MLOps patterns.