Python Pandas Tutorial 2026: Data Analysis from Basics to Real Projects

⏱️3 min read · 556 words

Pandas is the most-used Python library for data analysis in 2026. Whether you are cleaning CSVs, aggregating sales data, or preparing datasets for machine learning, pandas is the tool. This tutorial covers the essential operations every data professional needs.

📋 Table of Contents

Install and Import
Creating DataFrames
Selecting Data
Cleaning Data
Grouping and Aggregation
Merging DataFrames
Applying Functions
Export Results
Conclusion

Install and Import

pip install pandas numpy

import pandas as pd
import numpy as np

print(pd.__version__)  # 2.2+

Creating DataFrames

# From dict
df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol', 'Dave'],
    'age':    [25, 30, 35, 28],
    'salary': [70000, 85000, 92000, 78000],
    'dept':   ['Eng', 'Mktg', 'Eng', 'HR'],
})

# From CSV
df = pd.read_csv('data.csv')

# From JSON
df = pd.read_json('data.json')

print(df.head())       # First 5 rows
print(df.info())       # Column types and nulls
print(df.describe())   # Stats summary

Selecting Data

# Select column
df['name']              # Series
df[['name', 'salary']]  # DataFrame

# Select rows by condition
df[df['salary'] > 80000]
df[(df['dept'] == 'Eng') & (df['age'] < 35)]

# iloc — by position
df.iloc[0]       # First row
df.iloc[1:3]     # Rows 1-2

# loc — by label
df.loc[df['dept'] == 'Eng', ['name', 'salary']]

Cleaning Data

# Check nulls
df.isnull().sum()

# Drop rows with any null
df.dropna(inplace=True)

# Fill nulls
df['salary'].fillna(df['salary'].mean(), inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Rename columns
df.rename(columns={'name': 'full_name'}, inplace=True)

# Change dtype
df['age'] = df['age'].astype(int)

Grouping and Aggregation

# Group by department, get stats
summary = df.groupby('dept').agg(
    count=('name', 'count'),
    avg_salary=('salary', 'mean'),
    max_salary=('salary', 'max'),
).reset_index()

print(summary)
#   dept  count  avg_salary  max_salary
#   Eng       2     81000.0       92000
#   HR        1     78000.0       78000
#   Mktg      1     85000.0       85000

Merging DataFrames

# Two DataFrames
employees = pd.DataFrame({'emp_id': [1,2,3], 'name': ['Alice','Bob','Carol']})
salaries  = pd.DataFrame({'emp_id': [1,2,4], 'salary': [70000,85000,60000]})

# Inner join (only matching)
merged = employees.merge(salaries, on='emp_id', how='inner')

# Left join (keep all employees)
merged = employees.merge(salaries, on='emp_id', how='left')

Applying Functions

# Apply function to column
df['salary_band'] = df['salary'].apply(
    lambda s: 'High' if s > 90000 else ('Mid' if s > 75000 else 'Low')
)

# Apply to multiple columns
df[['age','salary']].apply(lambda col: col / col.max())

# Vectorized operations (faster than apply)
df['salary_k'] = df['salary'] / 1000
df['senior'] = df['age'] >= 30

Export Results

df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)
df.to_json('output.json', orient='records')

Conclusion

Pandas handles 95% of real-world data wrangling tasks. Master groupby, merge, and vectorized operations and you will process millions of rows in seconds. Combine with Matplotlib or Plotly for visualization, or hand off cleaned data to scikit-learn for ML.

📚 You might also like

🔗 Share this article

X / Twitter Facebook WhatsApp LinkedIn Telegram