🌐 Detecting your location…
📢 Advertisement — Configure AdSense in Appearance → Customize → AdSense Settings

Maschinelles Lernen für Anfänger 2026: Scikit-learn, Algorithmen und Workflow

⏱️4 min read  ·  869 words

Maschinelles Lernen war noch nie so zugänglich. Im Jahr 2026 bieten scikit-learn, PyTorch und das Hugging Face-Ökosystem alles, was Sie zum Erstellen echter ML-Modelle ohne Doktortitel benötigen. Dieser Einsteigerleitfaden behandelt die Kernkonzepte, Algorithmen und Arbeitsabläufe, um Ihr erstes ML-Modell zum Laufen zu bringen.

Der Workflow für maschinelles Lernen

ML Workflow:
1. Define the problem (classification? regression? clustering?)
2. Collect and clean data
3. Explore the data (EDA — visualize, understand distributions)
4. Feature engineering (transform raw data into useful features)
5. Split: train (70%), validation (15%), test (15%)
6. Choose and train model
7. Evaluate on validation set
8. Tune hyperparameters
9. Final evaluation on test set (once!)
10. Deploy and monitor

Aufstellen

pip install scikit-learn pandas numpy matplotlib seaborn
pip install xgboost lightgbm
pip install jupyter  # interactive exploration

# Or use Google Colab (free GPU/TPU)

Klassifizierung: Kategorien vorhersagen

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv("customer_churn.csv")

# Explore
print(df.shape)         # (10000, 15)
print(df.dtypes)
print(df.isnull().sum()) # check missing values
print(df["churn"].value_counts(normalize=True))  # class balance

# Feature engineering
df["tenure_months"] = df["tenure"] * 12
df["high_value"] = (df["monthly_charges"] > df["monthly_charges"].median()).astype(int)

# Encode categorical variables
le = LabelEncoder()
df["contract_encoded"] = le.fit_transform(df["contract"])
df["internet_service_encoded"] = le.fit_transform(df["internet_service"])

# Prepare features and target
FEATURES = ["tenure_months", "monthly_charges", "total_charges",
            "contract_encoded", "internet_service_encoded", "high_value",
            "num_support_tickets"]
TARGET = "churn"

X = df[FEATURES]
y = df[TARGET]

# Handle missing values
X = X.fillna(X.median())

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use train's scaler!

# Train multiple models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    score = model.score(X_test_scaled, y_test)
    roc = roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1])
    print(f"{name}: accuracy={score:.3f}, ROC-AUC={roc:.3f}")

# Best model evaluation
best = models["Random Forest"]
y_pred = best.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

# Feature importance
feat_imp = pd.Series(best.feature_importances_, index=FEATURES).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')

Regression: Zahlen vorhersagen

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# House price prediction example
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_leaf=4,
    random_state=42,
    n_jobs=-1  # use all CPU cores
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.3f}, RMSE: {rmse:.3f}, R2: {r2:.3f}")

Hyperparameter-Tuning

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define parameter search space
param_dist = {
    "n_estimators": randint(100, 500),
    "max_depth": randint(5, 30),
    "min_samples_split": randint(2, 20),
    "min_samples_leaf": randint(1, 10),
    "max_features": ["sqrt", "log2", 0.3, 0.5],
}

rf = RandomForestClassifier(random_state=42)

# Random search (faster than grid search for large spaces)
search = RandomizedSearchCV(
    rf, param_dist, n_iter=50, cv=5, scoring="roc_auc",
    n_jobs=-1, random_state=42, verbose=1
)
search.fit(X_train_scaled, y_train)

print("Best params:", search.best_params_)
print("Best ROC-AUC:", search.best_score_:.3f")

best_model = search.best_estimator_

Modelle speichern und laden

import joblib

# Save model and preprocessors
joblib.dump(best_model, "churn_model.joblib")
joblib.dump(scaler, "scaler.joblib")
joblib.dump(le, "label_encoder.joblib")

# Load and predict
model = joblib.load("churn_model.joblib")
scaler = joblib.load("scaler.joblib")

new_customer = pd.DataFrame([{
    "tenure_months": 24,
    "monthly_charges": 79.99,
    "total_charges": 1919.76,
    "contract_encoded": 1,
    "internet_service_encoded": 2,
    "high_value": 1,
    "num_support_tickets": 2,
}])

new_scaled = scaler.transform(new_customer)
churn_probability = model.predict_proba(new_scaled)[0][1]
print(f"Churn probability: {churn_probability:.1%}")

Nächste Schritte nach dieser Anleitung

  1. Lernen Sie XGBoost/LightGBM– Gewinnen Sie damit Kaggle-Wettbewerbe
  2. Studieren Sie Feature Engineering– 80 % der ML-Leistung sind auf gute Funktionen zurückzuführen
  3. Lernen Sie PyTorch– für Deep Learning, NLP, Computer Vision
  4. Übe auf Kaggle— echte Datensätze, Community, Wettbewerbe
  5. Lesen Sie praktisches ML mit Scikit-Learn— Aurelien Gerons Buch

Maschinelles Lernen im Jahr 2026 ist für jeden Python-Entwickler zugänglich. Beginnen Sie mit scikit-learn für klassisches ML, verstehen Sie die Aufteilung von Training/Validierung/Test richtig, bewerten Sie mehrere Modelle und optimieren Sie Hyperparameter. Der Algorithmus ist weniger wichtig als gute Daten und eine ordnungsgemäße Validierung.

✍️ Leave a Comment

Your email address will not be published. Required fields are marked *

🌐 Read in:🇬🇧 English🇩🇪 Deutsch🇧🇷 Português🇸🇦 العربية🇮🇳 हिन्दी🇧🇩 বাংলা