⏱️5 min read · 889 words
O aprendizado de máquina nunca foi tão acessível. Em 2026, scikit-learn, PyTorch e o ecossistema Hugging Face fornecem tudo que você precisa para construir modelos reais de ML sem um doutorado. Este guia para iniciantes cobre os principais conceitos, algoritmos e fluxo de trabalho para fazer seu primeiro modelo de ML funcionar.
📋 Table of Contents
O fluxo de trabalho de aprendizado de máquina
ML Workflow:
1. Define the problem (classification? regression? clustering?)
2. Collect and clean data
3. Explore the data (EDA — visualize, understand distributions)
4. Feature engineering (transform raw data into useful features)
5. Split: train (70%), validation (15%), test (15%)
6. Choose and train model
7. Evaluate on validation set
8. Tune hyperparameters
9. Final evaluation on test set (once!)
10. Deploy and monitor
Configurar
pip install scikit-learn pandas numpy matplotlib seaborn
pip install xgboost lightgbm
pip install jupyter # interactive exploration
# Or use Google Colab (free GPU/TPU)
Classificação: Prever Categorias
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
df = pd.read_csv("customer_churn.csv")
# Explore
print(df.shape) # (10000, 15)
print(df.dtypes)
print(df.isnull().sum()) # check missing values
print(df["churn"].value_counts(normalize=True)) # class balance
# Feature engineering
df["tenure_months"] = df["tenure"] * 12
df["high_value"] = (df["monthly_charges"] > df["monthly_charges"].median()).astype(int)
# Encode categorical variables
le = LabelEncoder()
df["contract_encoded"] = le.fit_transform(df["contract"])
df["internet_service_encoded"] = le.fit_transform(df["internet_service"])
# Prepare features and target
FEATURES = ["tenure_months", "monthly_charges", "total_charges",
"contract_encoded", "internet_service_encoded", "high_value",
"num_support_tickets"]
TARGET = "churn"
X = df[FEATURES]
y = df[TARGET]
# Handle missing values
X = X.fillna(X.median())
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use train's scaler!
# Train multiple models
models = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
score = model.score(X_test_scaled, y_test)
roc = roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1])
print(f"{name}: accuracy={score:.3f}, ROC-AUC={roc:.3f}")
# Best model evaluation
best = models["Random Forest"]
y_pred = best.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
# Feature importance
feat_imp = pd.Series(best.feature_importances_, index=FEATURES).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')
Regressão: prever números
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# House price prediction example
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(
n_estimators=200,
max_depth=15,
min_samples_leaf=4,
random_state=42,
n_jobs=-1 # use all CPU cores
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.3f}, RMSE: {rmse:.3f}, R2: {r2:.3f}")
Ajuste de hiperparâmetros
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define parameter search space
param_dist = {
"n_estimators": randint(100, 500),
"max_depth": randint(5, 30),
"min_samples_split": randint(2, 20),
"min_samples_leaf": randint(1, 10),
"max_features": ["sqrt", "log2", 0.3, 0.5],
}
rf = RandomForestClassifier(random_state=42)
# Random search (faster than grid search for large spaces)
search = RandomizedSearchCV(
rf, param_dist, n_iter=50, cv=5, scoring="roc_auc",
n_jobs=-1, random_state=42, verbose=1
)
search.fit(X_train_scaled, y_train)
print("Best params:", search.best_params_)
print("Best ROC-AUC:", search.best_score_:.3f")
best_model = search.best_estimator_
Salvar e carregar modelos
import joblib
# Save model and preprocessors
joblib.dump(best_model, "churn_model.joblib")
joblib.dump(scaler, "scaler.joblib")
joblib.dump(le, "label_encoder.joblib")
# Load and predict
model = joblib.load("churn_model.joblib")
scaler = joblib.load("scaler.joblib")
new_customer = pd.DataFrame([{
"tenure_months": 24,
"monthly_charges": 79.99,
"total_charges": 1919.76,
"contract_encoded": 1,
"internet_service_encoded": 2,
"high_value": 1,
"num_support_tickets": 2,
}])
new_scaled = scaler.transform(new_customer)
churn_probability = model.predict_proba(new_scaled)[0][1]
print(f"Churn probability: {churn_probability:.1%}")
Próximas etapas após este guia
- Aprenda XGBoost/LightGBM– ganhe competições Kaggle com estes
- Estude engenharia de recursos— 80% do desempenho de ML vem de bons recursos
- Aprenda PyTorch– para aprendizagem profunda, PNL, visão computacional
- Pratique no Kaggle— conjuntos de dados reais, comunidade, competições
- Leia ML prático com Scikit-Learn– Livro de Aurelien Geron
O aprendizado de máquina em 2026 estará acessível a qualquer desenvolvedor Python. Comece com o scikit-learn para ML clássico, entenda a divisão treinar/validar/teste corretamente, avalie vários modelos e ajuste hiperparâmetros. O algoritmo importa menos do que bons dados e validação adequada.
🔗 Share this article
✍️ Leave a Comment