March 26, 202612 min read

NumPy and Pandas: The Two Libraries Every Python Developer Eventually Needs

A practical guide to NumPy arrays and Pandas DataFrames. Learn vectorization, slicing, data loading, filtering, grouping, and real-world data cleaning.

numpy pandas python data-science tutorial

At some point in your Python journey, you'll need to work with data. Maybe it's a CSV export from a database. Maybe it's sensor readings, financial data, or user analytics. You'll try to process it with plain Python lists and loops, and it'll work -- but it'll be slow and verbose. Then someone will tell you about NumPy and Pandas, and you'll wonder how you ever managed without them.

NumPy handles numerical arrays and mathematical operations. Pandas handles tabular data (think spreadsheets). Together, they form the foundation of data work in Python. This guide covers both from the ground up with practical examples.

Part 1: NumPy

What NumPy Does

NumPy's core contribution is the ndarray (N-dimensional array). Unlike Python lists, NumPy arrays:

Store data in contiguous memory (much faster access)
Support element-wise operations without explicit loops
Are implemented in C under the hood (10-100x faster than pure Python loops)

pip install numpy

import numpy as np

# Python list approach (slow)
numbers = list(range(1_000_000))
squared = [x ** 2 for x in numbers]  # List comprehension, ~300ms

# NumPy approach (fast)
numbers = np.arange(1_000_000)
squared = numbers ** 2  # Vectorized operation, ~3ms

That's a 100x speedup. And the code is shorter and more readable.

Creating Arrays

import numpy as np

# From a Python list
a = np.array([1, 2, 3, 4, 5])
print(a)  # [1 2 3 4 5]

# With built-in functions
zeros = np.zeros(5)           # [0. 0. 0. 0. 0.]
ones = np.ones((3, 4))        # 3x4 matrix of ones
empty = np.empty((2, 3))      # 2x3 uninitialized (fast, random values)
range_arr = np.arange(0, 10, 2)  # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.  ]

# Random arrays
random_uniform = np.random.rand(3, 3)     # 3x3 uniform [0, 1)
random_normal = np.random.randn(1000)     # 1000 normal distribution samples
random_ints = np.random.randint(0, 100, size=10)  # 10 random ints [0, 100)

# 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
print(matrix.shape)  # (3, 3)
print(matrix.dtype)  # int64

Vectorized Operations

The killer feature. Operations apply element-by-element without loops.

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Arithmetic
print(a + b)    # [11 22 33 44 55]
print(a * b)    # [10 40 90 160 250]
print(a ** 2)   # [1 4 9 16 25]
print(a / 2)    # [0.5 1.  1.5 2.  2.5]

# Comparison (returns boolean array)
print(a > 3)    # [False False False  True  True]

# Math functions
print(np.sqrt(a))     # [1.   1.41 1.73 2.   2.24]
print(np.log(a))      # [0.   0.69 1.10 1.39 1.61]
print(np.exp(a))      # [2.72 7.39 20.09 54.60 148.41]
print(np.sin(a))      # Sine of each element

# Aggregations
print(np.mean(a))     # 3.0
print(np.std(a))      # 1.414
print(np.sum(a))      # 15
print(np.min(a))      # 1
print(np.max(a))      # 5
print(np.argmax(a))   # 4 (index of max value)

Indexing and Slicing

a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])

# Basic indexing
print(a[0])      # 10
print(a[-1])     # 90
print(a[2:5])    # [30 40 50]

# Boolean indexing (extremely useful)
mask = a > 40
print(mask)          # [False False False False  True  True  True  True  True]
print(a[mask])       # [50 60 70 80 90]
print(a[a > 40])     # Same thing, shorthand

# Fancy indexing
indices = [0, 2, 4]
print(a[indices])    # [10 30 50]

# 2D slicing
m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

print(m[0, :])      # [1 2 3] (first row)
print(m[:, 1])      # [2 5 8] (second column)
print(m[0:2, 0:2])  # [[1 2], [4 5]] (top-left 2x2)

Boolean indexing is one of the most powerful features in NumPy. It lets you filter arrays without loops:

temperatures = np.array([72, 85, 68, 91, 77, 95, 63, 88])

# Find all temperatures above 80
hot_days = temperatures[temperatures > 80]
print(hot_days)  # [85 91 95 88]

# Count how many
print(np.sum(temperatures > 80))  # 4

# Multiple conditions (use & for AND, | for OR, ~ for NOT)
moderate = temperatures[(temperatures > 70) & (temperatures < 90)]
print(moderate)  # [72 85 77 88]

Reshaping

a = np.arange(12)
print(a)  # [ 0  1  2  3  4  5  6  7  8  9 10 11]

# Reshape to 3x4 matrix
b = a.reshape(3, 4)
print(b)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# -1 means "figure it out"
c = a.reshape(2, -1)  # 2 rows, auto-calculate columns (6)
print(c.shape)  # (2, 6)

# Flatten back
print(b.ravel())  # [ 0  1  2  3  4  5  6  7  8  9 10 11]

# Transpose
print(b.T.shape)  # (4, 3)

Broadcasting

Broadcasting lets NumPy perform operations on arrays of different shapes.

# Add a scalar to every element
a = np.array([1, 2, 3])
print(a + 10)  # [11 12 13]

# Add a column vector to each row
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

row = np.array([10, 20, 30])
print(matrix + row)
# [[11 22 33]
#  [14 25 36]
#  [17 28 39]]

Broadcasting follows rules about shape compatibility, but the intuition is: NumPy "stretches" the smaller array to match the larger one. A (3,4) matrix plus a (4,) vector works because the vector broadcasts across each row.

Linear Algebra

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

# Matrix multiplication
print(a @ b)           # or np.dot(a, b)
# [[19 22]
#  [43 50]]

# Determinant
print(np.linalg.det(a))  # -2.0

# Inverse
print(np.linalg.inv(a))

# Eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(a)

Part 2: Pandas

What Pandas Does

Pandas provides two key data structures:

Series: a labeled 1D array (like a column in a spreadsheet)
DataFrame: a labeled 2D table (like a spreadsheet or SQL table)

pip install pandas

import pandas as pd

# Series
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s['b'])  # 20

# DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [28, 35, 42, 31],
    'city': ['NYC', 'SF', 'NYC', 'LA'],
    'salary': [75000, 95000, 120000, 85000]
})

print(df)
#       name  age city  salary
# 0    Alice   28  NYC   75000
# 1      Bob   35   SF   95000
# 2  Charlie   42  NYC  120000
# 3    Diana   31   LA   85000

Reading Data

This is where Pandas shines. Real-world data comes from files, databases, and APIs.

# CSV (most common)
df = pd.read_csv('data.csv')

# With options
df = pd.read_csv('data.csv',
    sep=',',              # Delimiter
    header=0,             # Row number for header (0 = first row)
    index_col='id',       # Use a column as the index
    parse_dates=['date'], # Parse date columns automatically
    na_values=['N/A', ''],# Treat these as NaN
    nrows=1000,           # Only read first 1000 rows
    usecols=['name', 'age', 'salary']  # Only read specific columns
)

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('data.json')

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)

# Clipboard (handy for quick analysis)
df = pd.read_clipboard()

Exploring Data

The first thing you do with any dataset: understand its shape and contents.

df = pd.read_csv('sales_data.csv')

# Basic info
print(df.shape)          # (rows, columns)
print(df.columns)        # Column names
print(df.dtypes)         # Data types of each column
print(df.info())         # Summary: types, non-null counts, memory

# First/last rows
print(df.head())         # First 5 rows
print(df.tail(10))       # Last 10 rows

# Statistics
print(df.describe())     # Count, mean, std, min, 25%, 50%, 75%, max
print(df['salary'].mean())
print(df['age'].median())
print(df['city'].value_counts())  # Frequency of each unique value

# Missing values
print(df.isnull().sum())  # Count NaN per column

Selecting and Filtering

# Select a column (returns Series)
names = df['name']

# Select multiple columns (returns DataFrame)
subset = df[['name', 'salary']]

# Filter rows
high_earners = df[df['salary'] > 100000]
nyc_workers = df[df['city'] == 'NYC']

# Multiple conditions
senior_nyc = df[(df['age'] > 35) & (df['city'] == 'NYC')]

# String methods
df[df['name'].str.startswith('A')]
df[df['name'].str.contains('ali', case=False)]

# .loc (label-based) and .iloc (integer-based)
df.loc[0:2, 'name':'city']     # Rows 0-2, columns 'name' through 'city'
df.iloc[0:3, 0:3]              # First 3 rows, first 3 columns

# isin for matching multiple values
df[df['city'].isin(['NYC', 'SF'])]

# query method (SQL-like, often more readable)
df.query('salary > 100000 and city == "NYC"')

Adding and Modifying Columns

# New column from calculation
df['salary_monthly'] = df['salary'] / 12

# New column from condition
df['is_senior'] = df['age'] > 35

# Apply a function to a column
df['name_upper'] = df['name'].str.upper()

# Apply a custom function
def categorize_salary(salary):
    if salary > 100000:
        return 'high'
    elif salary > 70000:
        return 'medium'
    return 'low'

df['salary_category'] = df['salary'].apply(categorize_salary)

# Rename columns
df = df.rename(columns={'salary': 'annual_salary', 'city': 'location'})

# Drop columns
df = df.drop(columns=['salary_monthly', 'is_senior'])

Grouping and Aggregating

This is where Pandas replaces hours of loop-based code with one-liners.

# Average salary by city
df.groupby('city')['salary'].mean()

# Multiple aggregations
df.groupby('city')['salary'].agg(['mean', 'median', 'count', 'std'])

# Group by multiple columns
df.groupby(['city', 'is_senior'])['salary'].mean()

# Named aggregations (clean output)
result = df.groupby('city').agg(
    avg_salary=('salary', 'mean'),
    num_employees=('salary', 'count'),
    oldest=('age', 'max'),
    youngest=('age', 'min')
).reset_index()

print(result)
#   city  avg_salary  num_employees  oldest  youngest
# 0   LA       85000              1      31        31
# 1  NYC       97500              2      42        28
# 2   SF       95000              1      35        35

Sorting

# Sort by one column
df.sort_values('salary', ascending=False)

# Sort by multiple columns
df.sort_values(['city', 'salary'], ascending=[True, False])

# Sort index
df.sort_index()

Handling Missing Data

Real data is messy. Pandas handles NaN (Not a Number) values throughout.

# Check for missing values
print(df.isnull().sum())

# Drop rows with any NaN
df_clean = df.dropna()

# Drop rows where specific columns are NaN
df_clean = df.dropna(subset=['salary', 'age'])

# Fill missing values
df['salary'] = df['salary'].fillna(df['salary'].median())
df['city'] = df['city'].fillna('Unknown')

# Forward fill (use previous value)
df['value'] = df['value'].ffill()

# Interpolate (linear interpolation)
df['temperature'] = df['temperature'].interpolate()

Merging DataFrames

Merging is the Pandas equivalent of SQL joins.

# Two DataFrames
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20, 30],
    'dept_name': ['Engineering', 'Marketing', 'Sales']
})

# Inner merge (only matching rows)
merged = employees.merge(departments, on='dept_id')
print(merged)
#    emp_id     name  dept_id    dept_name
# 0       1    Alice       10  Engineering
# 1       3  Charlie       10  Engineering
# 2       2      Bob       20    Marketing
# 3       4    Diana       30        Sales

# Left merge (all employees, even without department)
merged = employees.merge(departments, on='dept_id', how='left')

# Different column names
orders = pd.DataFrame({
    'order_id': [1, 2, 3],
    'customer': [1, 2, 1]
})
customers = pd.DataFrame({
    'id': [1, 2],
    'name': ['Alice', 'Bob']
})
merged = orders.merge(customers, left_on='customer', right_on='id')

Practical Example: Cleaning Real Data

Let's clean a messy sales dataset -- the kind you'd actually encounter at work.

import pandas as pd

# Read the messy data
df = pd.read_csv('sales_raw.csv')

# Step 1: Inspect
print(df.shape)
print(df.head())
print(df.dtypes)
print(df.isnull().sum())

# Step 2: Fix column names (lowercase, no spaces)
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Step 3: Parse dates
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')

# Step 4: Clean price column (remove $ and commas)
df['price'] = (df['price']
    .str.replace('$', '', regex=False)
    .str.replace(',', '', regex=False)
    .astype(float))

# Step 5: Handle missing values
df['customer_name'] = df['customer_name'].fillna('Unknown')
df = df.dropna(subset=['order_date', 'price'])  # Drop if critical fields missing

# Step 6: Remove duplicates
df = df.drop_duplicates(subset=['order_id'])

# Step 7: Add derived columns
df['month'] = df['order_date'].dt.month
df['year'] = df['order_date'].dt.year
df['total'] = df['price'] * df['quantity']

# Step 8: Filter out invalid data
df = df[df['price'] > 0]
df = df[df['quantity'] > 0]

# Step 9: Analyze
monthly_sales = (df.groupby(['year', 'month'])
    .agg(
        total_revenue=('total', 'sum'),
        num_orders=('order_id', 'count'),
        avg_order_value=('total', 'mean')
    )
    .reset_index()
    .sort_values(['year', 'month']))

print(monthly_sales)

# Step 10: Save clean data
df.to_csv('sales_clean.csv', index=False)

This workflow -- load, inspect, clean, transform, analyze, save -- is the bread and butter of data work. Every data project follows this pattern.

Performance Tips

# Use appropriate dtypes to reduce memory
df['category'] = df['category'].astype('category')  # 90%+ memory reduction for low-cardinality strings
df['small_int'] = df['small_int'].astype('int8')     # int8 vs int64

# Check memory usage
print(df.memory_usage(deep=True))

# Use chunking for large files
chunks = pd.read_csv('huge_file.csv', chunksize=100_000)
results = []
for chunk in chunks:
    processed = chunk[chunk['status'] == 'active']
    results.append(processed)
df = pd.concat(results)

# Avoid iterrows() -- it's extremely slow
# Bad:
for index, row in df.iterrows():
    df.loc[index, 'new'] = row['a'] + row['b']

# Good (vectorized):
df['new'] = df['a'] + df['b']

Saving Data

df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)
df.to_json('output.json', orient='records')
df.to_parquet('output.parquet')  # Fastest, smallest, best for large data

NumPy + Pandas Together

Pandas is built on top of NumPy. They work seamlessly together:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [10, 20, 30, 40, 50]
})

# NumPy operations work on DataFrame columns
df['log_a'] = np.log(df['a'])
df['sqrt_b'] = np.sqrt(df['b'])

# Convert between them
numpy_array = df.values          # DataFrame → NumPy array
df_from_array = pd.DataFrame(numpy_array, columns=['a', 'b', 'log_a', 'sqrt_b'])

# Use NumPy's where for conditional logic
df['category'] = np.where(df['a'] > 3, 'high', 'low')

When to Use Which

Task	Use
Numerical computation, linear algebra	NumPy
Reading CSVs, Excel files	Pandas
Data cleaning and transformation	Pandas
Statistical analysis on tabular data	Pandas
Machine learning feature matrices	NumPy (scikit-learn expects NumPy arrays)
Image processing (pixel arrays)	NumPy
Time series analysis	Pandas
Large-scale numerical simulation	NumPy

In practice, most data projects start with Pandas (load and clean data) and switch to NumPy when feeding data into machine learning models or doing numerical computation.

Where to Go from Here

Visualization: learn Matplotlib and Seaborn to make charts from your DataFrames
Machine learning: scikit-learn takes NumPy arrays as input -- you now know how to prepare them
Bigger data: when Pandas gets slow, look at Polars (faster DataFrame library) or Dask (parallel Pandas)
Practice: download datasets from Kaggle and practice the load-explore-clean-analyze workflow
SQL comparison: if you know SQL, learning Pandas is fast -- most operations have direct equivalents

The combination of NumPy and Pandas handles 90% of data tasks you'll encounter as a developer. Master these two libraries and you'll be equipped to work with data in any context.

For more Python tutorials, data science guides, and programming content, check out CodeUp.