RAG-Tutorial 2026: Erstellen Sie KI-Apps mit LLMs und Vektorsuche

⏱️5 min read · 1,015 words

Retrieval-Augmented Generation (RAG) ist das vorherrschende Muster für die Entwicklung von KI-Produktionsanwendungen im Jahr 2026. RAG begründet LLM-Antworten in realen Daten, eliminiert Halluzinationen über Domänenwissen und hält Ihre KI ohne teure Feinabstimmung auf dem neuesten Stand. Dieser Leitfaden erstellt ein Produktions-RAG-System von Grund auf.

📋 Table of Contents

Warum RAG?
RAG-Architektur
Setup: Abhängigkeiten installieren
Schritt 1: Dokumentenaufnahme
Schritt 2: Abrufen
Schritt 3: Generierung mit Claude
FastAPI RAG-API
Erweiterte RAG-Muster
Produktionsüberlegungen

Warum RAG?

LLMs haben eine Wissenssperre und können nicht auf Ihre privaten Daten zugreifen. RAG löst beide Probleme:

Keine Halluzinationen über Fakten– Modellantworten aus abgerufenen Dokumenten, nicht aus dem Gedächtnis
Private Daten— Indizieren Sie Ihre eigenen PDFs, Datenbanken und Wikis
Aktuelle Antworten– Aktualisieren Sie die Wissensdatenbank, nicht das Modell
Günstiger als Feintuning— keine Schulungskosten, sofortige Updates
Quellenangabe— Geben Sie an, welche Dokumente für jede Antwort verwendet wurden

RAG-Architektur

RAG Pipeline:

[Your Documents]
     ↓ chunk + embed
[Vector Database] (Pinecone, Qdrant, ChromaDB)
     ↑ semantic search
[User Query] → embed → search → [Top-K Chunks]
                                      ↓
                              [LLM Prompt]
                              "Using these documents: {chunks}
                               Answer: {query}"
                                      ↓
                              [Grounded Answer]

Setup: Abhängigkeiten installieren

pip install anthropic chromadb sentence-transformers          pypdf langchain langchain-community          fastapi uvicorn python-dotenv

Schritt 1: Dokumentenaufnahme

import anthropic
from pathlib import Path
from pypdf import PdfReader
import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB (local vector store)
client = chromadb.PersistentClient(path="./chroma_db")

# Use sentence-transformers for embeddings (free, local)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"  # fast, good quality, 384 dims
)

collection = client.get_or_create_collection(
    name="documents",
    embedding_function=ef
)

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    # Split text into overlapping chunks.
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

def ingest_pdf(pdf_path: str) -> int:
    # Ingest a PDF into the vector store.
    reader = PdfReader(pdf_path)
    all_chunks = []
    metadatas = []
    ids = []

    for page_num, page in enumerate(reader.pages):
        text = page.extract_text()
        if not text.strip():
            continue

        chunks = chunk_text(text)
        for i, chunk in enumerate(chunks):
            chunk_id = f"{Path(pdf_path).stem}_p{page_num}_c{i}"
            all_chunks.append(chunk)
            metadatas.append({
                "source": pdf_path,
                "page": page_num + 1,
                "chunk": i
            })
            ids.append(chunk_id)

    # Add to ChromaDB
    collection.add(documents=all_chunks, metadatas=metadatas, ids=ids)
    return len(all_chunks)

# Ingest documents
print(f"Ingested: {ingest_pdf('company_docs.pdf')} chunks")
print(f"Ingested: {ingest_pdf('product_manual.pdf')} chunks")

Schritt 2: Abrufen

def retrieve(query: str, n_results: int = 5) -> list[dict]:
    # Retrieve relevant chunks for a query.
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )

    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        chunks.append({
            "text": doc,
            "source": meta["source"],
            "page": meta["page"],
            "similarity": 1 - dist  # convert distance to similarity
        })

    # Filter low-relevance chunks
    return [c for c in chunks if c["similarity"] > 0.3]

Schritt 3: Generierung mit Claude

import anthropic

claude = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

def rag_query(question: str, n_results: int = 5) -> dict:
    # Answer a question using RAG.

    # Retrieve relevant context
    chunks = retrieve(question, n_results)

    if not chunks:
        return {
            "answer": "I couldn't find relevant information in the documents.",
            "sources": []
        }

    # Build context string
    context = "

---

".join([
        f"[Source: {c['source']}, Page {c['page']}]
{c['text']}"
        for c in chunks
    ])

    # Create prompt
    system = (
        "You are a helpful assistant that answers questions based ONLY on "
        "the provided documents. If the answer is not in the documents, say so clearly. "
        "Always cite the source document and page number for your answers."
    )

    user_message = (
        f"Documents:
{context}

"
        f"Question: {question}

"
        "Answer based only on the provided documents, citing sources."
    )

    # Call Claude
    response = claude.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": user_message}]
    )

    return {
        "answer": response.content[0].text,
        "sources": [{"source": c["source"], "page": c["page"]} for c in chunks],
        "tokens_used": response.usage.input_tokens + response.usage.output_tokens
    }

# Use it
result = rag_query("What is the return policy?")
print(result["answer"])
print("Sources:", result["sources"])

FastAPI RAG-API

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI(title="RAG API")

class QueryRequest(BaseModel):
    question: str
    n_results: int = 5

class QueryResponse(BaseModel):
    answer: str
    sources: list[dict]
    tokens_used: int

@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
    if not request.question.strip():
        raise HTTPException(400, "Question cannot be empty")

    result = rag_query(request.question, request.n_results)
    return QueryResponse(**result)

@app.post("/ingest")
async def ingest_document(file_path: str):
    count = ingest_pdf(file_path)
    return {"status": "ok", "chunks_indexed": count}

# Run: uvicorn main:app --reload

Erweiterte RAG-Muster

Hybridsuche (Schlüsselwort + Semantik)

# Combine BM25 keyword search with vector search
from rank_bm25 import BM25Okapi

def hybrid_search(query: str, docs: list[str], alpha: float = 0.5) -> list[str]:
    # Semantic search scores
    semantic_results = collection.query(query_texts=[query], n_results=10)
    semantic_scores = {doc: score for doc, score in
                      zip(semantic_results["ids"][0], semantic_results["distances"][0])}

    # BM25 keyword scores
    tokenized = [doc.split() for doc in docs]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(query.split())

    # Combine (Reciprocal Rank Fusion)
    combined = alpha * (1 - semantic_scores.get(id, 1)) + (1-alpha) * bm25_score
    return sorted_by_combined_score

Neueinstufung

# Use a cross-encoder to rerank retrieved chunks
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, score in ranked[:top_k]]

Produktionsüberlegungen

Chunking-Strategie– Experimentieren Sie mit der Blockgröße (200–1000 Token) und der Überlappung (10–20 %).
Einbettungsmodell– OpenAI text-embedding-3-small für Qualität, lokale Modelle für Kosten
Vektor-DB– ChromaDB/Qdrant für selbst gehostet, Pinecone für verwaltet
Caching– Cache-Einbettungen und häufige Abfrageergebnisse
Auswertung— RAGAS-Framework für RAG-spezifische Metriken (Treue, Relevanz)
Streaming– Streamen Sie Claude-Antworten für eine bessere UX

RAG ist jetzt das Basismuster für Unternehmens-KI im Jahr 2026. Beginnen Sie mit einem einfachen ChromaDB + Claude-Setup, messen Sie die Antwortqualität mit RAGAS und optimieren Sie dann Chunking und Abruf. Die Kombination aus Vektorsuche und LLM-Schlussfolgerung ist für wissensintensive Anwendungen unglaublich leistungsstark.

🔗 Share this article

X / Twitter Facebook WhatsApp LinkedIn Telegram