🌐 Detecting your location…
📢 Advertisement — Configure AdSense in Appearance → Customize → AdSense Settings

Aprendizado de máquina para iniciantes 2026: Scikit-learn, algoritmos e fluxo de trabalho

⏱️5 min read  ·  889 words

O aprendizado de máquina nunca foi tão acessível. Em 2026, scikit-learn, PyTorch e o ecossistema Hugging Face fornecem tudo que você precisa para construir modelos reais de ML sem um doutorado. Este guia para iniciantes cobre os principais conceitos, algoritmos e fluxo de trabalho para fazer seu primeiro modelo de ML funcionar.

O fluxo de trabalho de aprendizado de máquina

ML Workflow:
1. Define the problem (classification? regression? clustering?)
2. Collect and clean data
3. Explore the data (EDA — visualize, understand distributions)
4. Feature engineering (transform raw data into useful features)
5. Split: train (70%), validation (15%), test (15%)
6. Choose and train model
7. Evaluate on validation set
8. Tune hyperparameters
9. Final evaluation on test set (once!)
10. Deploy and monitor

Configurar

pip install scikit-learn pandas numpy matplotlib seaborn
pip install xgboost lightgbm
pip install jupyter  # interactive exploration

# Or use Google Colab (free GPU/TPU)

Classificação: Prever Categorias

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv("customer_churn.csv")

# Explore
print(df.shape)         # (10000, 15)
print(df.dtypes)
print(df.isnull().sum()) # check missing values
print(df["churn"].value_counts(normalize=True))  # class balance

# Feature engineering
df["tenure_months"] = df["tenure"] * 12
df["high_value"] = (df["monthly_charges"] > df["monthly_charges"].median()).astype(int)

# Encode categorical variables
le = LabelEncoder()
df["contract_encoded"] = le.fit_transform(df["contract"])
df["internet_service_encoded"] = le.fit_transform(df["internet_service"])

# Prepare features and target
FEATURES = ["tenure_months", "monthly_charges", "total_charges",
            "contract_encoded", "internet_service_encoded", "high_value",
            "num_support_tickets"]
TARGET = "churn"

X = df[FEATURES]
y = df[TARGET]

# Handle missing values
X = X.fillna(X.median())

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use train's scaler!

# Train multiple models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    score = model.score(X_test_scaled, y_test)
    roc = roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1])
    print(f"{name}: accuracy={score:.3f}, ROC-AUC={roc:.3f}")

# Best model evaluation
best = models["Random Forest"]
y_pred = best.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

# Feature importance
feat_imp = pd.Series(best.feature_importances_, index=FEATURES).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')

Regressão: prever números

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# House price prediction example
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_leaf=4,
    random_state=42,
    n_jobs=-1  # use all CPU cores
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.3f}, RMSE: {rmse:.3f}, R2: {r2:.3f}")

Ajuste de hiperparâmetros

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define parameter search space
param_dist = {
    "n_estimators": randint(100, 500),
    "max_depth": randint(5, 30),
    "min_samples_split": randint(2, 20),
    "min_samples_leaf": randint(1, 10),
    "max_features": ["sqrt", "log2", 0.3, 0.5],
}

rf = RandomForestClassifier(random_state=42)

# Random search (faster than grid search for large spaces)
search = RandomizedSearchCV(
    rf, param_dist, n_iter=50, cv=5, scoring="roc_auc",
    n_jobs=-1, random_state=42, verbose=1
)
search.fit(X_train_scaled, y_train)

print("Best params:", search.best_params_)
print("Best ROC-AUC:", search.best_score_:.3f")

best_model = search.best_estimator_

Salvar e carregar modelos

import joblib

# Save model and preprocessors
joblib.dump(best_model, "churn_model.joblib")
joblib.dump(scaler, "scaler.joblib")
joblib.dump(le, "label_encoder.joblib")

# Load and predict
model = joblib.load("churn_model.joblib")
scaler = joblib.load("scaler.joblib")

new_customer = pd.DataFrame([{
    "tenure_months": 24,
    "monthly_charges": 79.99,
    "total_charges": 1919.76,
    "contract_encoded": 1,
    "internet_service_encoded": 2,
    "high_value": 1,
    "num_support_tickets": 2,
}])

new_scaled = scaler.transform(new_customer)
churn_probability = model.predict_proba(new_scaled)[0][1]
print(f"Churn probability: {churn_probability:.1%}")

Próximas etapas após este guia

  1. Aprenda XGBoost/LightGBM– ganhe competições Kaggle com estes
  2. Estude engenharia de recursos— 80% do desempenho de ML vem de bons recursos
  3. Aprenda PyTorch– para aprendizagem profunda, PNL, visão computacional
  4. Pratique no Kaggle— conjuntos de dados reais, comunidade, competições
  5. Leia ML prático com Scikit-Learn– Livro de Aurelien Geron

O aprendizado de máquina em 2026 estará acessível a qualquer desenvolvedor Python. Comece com o scikit-learn para ML clássico, entenda a divisão treinar/validar/teste corretamente, avalie vários modelos e ajuste hiperparâmetros. O algoritmo importa menos do que bons dados e validação adequada.

✍️ Leave a Comment

Your email address will not be published. Required fields are marked *

🌐 Read in:🇬🇧 English🇩🇪 Deutsch🇧🇷 Português🇸🇦 العربية🇮🇳 हिन्दी🇧🇩 বাংলা