NumPy and Pandas: The Two Libraries Every Python Developer Eventually Needs
A practical guide to NumPy arrays and Pandas DataFrames. Learn vectorization, slicing, data loading, filtering, grouping, and real-world data cleaning.
At some point in your Python journey, you'll need to work with data. Maybe it's a CSV export from a database. Maybe it's sensor readings, financial data, or user analytics. You'll try to process it with plain Python lists and loops, and it'll work -- but it'll be slow and verbose. Then someone will tell you about NumPy and Pandas, and you'll wonder how you ever managed without them.
NumPy handles numerical arrays and mathematical operations. Pandas handles tabular data (think spreadsheets). Together, they form the foundation of data work in Python. This guide covers both from the ground up with practical examples.
Part 1: NumPy
What NumPy Does
NumPy's core contribution is the ndarray (N-dimensional array). Unlike Python lists, NumPy arrays:
- Store data in contiguous memory (much faster access)
- Support element-wise operations without explicit loops
- Are implemented in C under the hood (10-100x faster than pure Python loops)
pip install numpy
import numpy as np
# Python list approach (slow)
numbers = list(range(1_000_000))
squared = [x ** 2 for x in numbers] # List comprehension, ~300ms
# NumPy approach (fast)
numbers = np.arange(1_000_000)
squared = numbers ** 2 # Vectorized operation, ~3ms
That's a 100x speedup. And the code is shorter and more readable.
Creating Arrays
import numpy as np
# From a Python list
a = np.array([1, 2, 3, 4, 5])
print(a) # [1 2 3 4 5]
# With built-in functions
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
ones = np.ones((3, 4)) # 3x4 matrix of ones
empty = np.empty((2, 3)) # 2x3 uninitialized (fast, random values)
range_arr = np.arange(0, 10, 2) # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
# Random arrays
random_uniform = np.random.rand(3, 3) # 3x3 uniform [0, 1)
random_normal = np.random.randn(1000) # 1000 normal distribution samples
random_ints = np.random.randint(0, 100, size=10) # 10 random ints [0, 100)
# 2D array (matrix)
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(matrix.shape) # (3, 3)
print(matrix.dtype) # int64
Vectorized Operations
The killer feature. Operations apply element-by-element without loops.
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
# Arithmetic
print(a + b) # [11 22 33 44 55]
print(a * b) # [10 40 90 160 250]
print(a ** 2) # [1 4 9 16 25]
print(a / 2) # [0.5 1. 1.5 2. 2.5]
# Comparison (returns boolean array)
print(a > 3) # [False False False True True]
# Math functions
print(np.sqrt(a)) # [1. 1.41 1.73 2. 2.24]
print(np.log(a)) # [0. 0.69 1.10 1.39 1.61]
print(np.exp(a)) # [2.72 7.39 20.09 54.60 148.41]
print(np.sin(a)) # Sine of each element
# Aggregations
print(np.mean(a)) # 3.0
print(np.std(a)) # 1.414
print(np.sum(a)) # 15
print(np.min(a)) # 1
print(np.max(a)) # 5
print(np.argmax(a)) # 4 (index of max value)
Indexing and Slicing
a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
# Basic indexing
print(a[0]) # 10
print(a[-1]) # 90
print(a[2:5]) # [30 40 50]
# Boolean indexing (extremely useful)
mask = a > 40
print(mask) # [False False False False True True True True True]
print(a[mask]) # [50 60 70 80 90]
print(a[a > 40]) # Same thing, shorthand
# Fancy indexing
indices = [0, 2, 4]
print(a[indices]) # [10 30 50]
# 2D slicing
m = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(m[0, :]) # [1 2 3] (first row)
print(m[:, 1]) # [2 5 8] (second column)
print(m[0:2, 0:2]) # [[1 2], [4 5]] (top-left 2x2)
Boolean indexing is one of the most powerful features in NumPy. It lets you filter arrays without loops:
temperatures = np.array([72, 85, 68, 91, 77, 95, 63, 88])
# Find all temperatures above 80
hot_days = temperatures[temperatures > 80]
print(hot_days) # [85 91 95 88]
# Count how many
print(np.sum(temperatures > 80)) # 4
# Multiple conditions (use & for AND, | for OR, ~ for NOT)
moderate = temperatures[(temperatures > 70) & (temperatures < 90)]
print(moderate) # [72 85 77 88]
Reshaping
a = np.arange(12)
print(a) # [ 0 1 2 3 4 5 6 7 8 9 10 11]
# Reshape to 3x4 matrix
b = a.reshape(3, 4)
print(b)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
# -1 means "figure it out"
c = a.reshape(2, -1) # 2 rows, auto-calculate columns (6)
print(c.shape) # (2, 6)
# Flatten back
print(b.ravel()) # [ 0 1 2 3 4 5 6 7 8 9 10 11]
# Transpose
print(b.T.shape) # (4, 3)
Broadcasting
Broadcasting lets NumPy perform operations on arrays of different shapes.
# Add a scalar to every element
a = np.array([1, 2, 3])
print(a + 10) # [11 12 13]
# Add a column vector to each row
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
row = np.array([10, 20, 30])
print(matrix + row)
# [[11 22 33]
# [14 25 36]
# [17 28 39]]
Broadcasting follows rules about shape compatibility, but the intuition is: NumPy "stretches" the smaller array to match the larger one. A (3,4) matrix plus a (4,) vector works because the vector broadcasts across each row.
Linear Algebra
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
# Matrix multiplication
print(a @ b) # or np.dot(a, b)
# [[19 22]
# [43 50]]
# Determinant
print(np.linalg.det(a)) # -2.0
# Inverse
print(np.linalg.inv(a))
# Eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(a)
Part 2: Pandas
What Pandas Does
Pandas provides two key data structures:
- Series: a labeled 1D array (like a column in a spreadsheet)
- DataFrame: a labeled 2D table (like a spreadsheet or SQL table)
pip install pandas
import pandas as pd
# Series
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s['b']) # 20
# DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [28, 35, 42, 31],
'city': ['NYC', 'SF', 'NYC', 'LA'],
'salary': [75000, 95000, 120000, 85000]
})
print(df)
# name age city salary
# 0 Alice 28 NYC 75000
# 1 Bob 35 SF 95000
# 2 Charlie 42 NYC 120000
# 3 Diana 31 LA 85000
Reading Data
This is where Pandas shines. Real-world data comes from files, databases, and APIs.
# CSV (most common)
df = pd.read_csv('data.csv')
# With options
df = pd.read_csv('data.csv',
sep=',', # Delimiter
header=0, # Row number for header (0 = first row)
index_col='id', # Use a column as the index
parse_dates=['date'], # Parse date columns automatically
na_values=['N/A', ''],# Treat these as NaN
nrows=1000, # Only read first 1000 rows
usecols=['name', 'age', 'salary'] # Only read specific columns
)
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON
df = pd.read_json('data.json')
# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)
# Clipboard (handy for quick analysis)
df = pd.read_clipboard()
Exploring Data
The first thing you do with any dataset: understand its shape and contents.
df = pd.read_csv('sales_data.csv')
# Basic info
print(df.shape) # (rows, columns)
print(df.columns) # Column names
print(df.dtypes) # Data types of each column
print(df.info()) # Summary: types, non-null counts, memory
# First/last rows
print(df.head()) # First 5 rows
print(df.tail(10)) # Last 10 rows
# Statistics
print(df.describe()) # Count, mean, std, min, 25%, 50%, 75%, max
print(df['salary'].mean())
print(df['age'].median())
print(df['city'].value_counts()) # Frequency of each unique value
# Missing values
print(df.isnull().sum()) # Count NaN per column
Selecting and Filtering
# Select a column (returns Series)
names = df['name']
# Select multiple columns (returns DataFrame)
subset = df[['name', 'salary']]
# Filter rows
high_earners = df[df['salary'] > 100000]
nyc_workers = df[df['city'] == 'NYC']
# Multiple conditions
senior_nyc = df[(df['age'] > 35) & (df['city'] == 'NYC')]
# String methods
df[df['name'].str.startswith('A')]
df[df['name'].str.contains('ali', case=False)]
# .loc (label-based) and .iloc (integer-based)
df.loc[0:2, 'name':'city'] # Rows 0-2, columns 'name' through 'city'
df.iloc[0:3, 0:3] # First 3 rows, first 3 columns
# isin for matching multiple values
df[df['city'].isin(['NYC', 'SF'])]
# query method (SQL-like, often more readable)
df.query('salary > 100000 and city == "NYC"')
Adding and Modifying Columns
# New column from calculation
df['salary_monthly'] = df['salary'] / 12
# New column from condition
df['is_senior'] = df['age'] > 35
# Apply a function to a column
df['name_upper'] = df['name'].str.upper()
# Apply a custom function
def categorize_salary(salary):
if salary > 100000:
return 'high'
elif salary > 70000:
return 'medium'
return 'low'
df['salary_category'] = df['salary'].apply(categorize_salary)
# Rename columns
df = df.rename(columns={'salary': 'annual_salary', 'city': 'location'})
# Drop columns
df = df.drop(columns=['salary_monthly', 'is_senior'])
Grouping and Aggregating
This is where Pandas replaces hours of loop-based code with one-liners.
# Average salary by city
df.groupby('city')['salary'].mean()
# Multiple aggregations
df.groupby('city')['salary'].agg(['mean', 'median', 'count', 'std'])
# Group by multiple columns
df.groupby(['city', 'is_senior'])['salary'].mean()
# Named aggregations (clean output)
result = df.groupby('city').agg(
avg_salary=('salary', 'mean'),
num_employees=('salary', 'count'),
oldest=('age', 'max'),
youngest=('age', 'min')
).reset_index()
print(result)
# city avg_salary num_employees oldest youngest
# 0 LA 85000 1 31 31
# 1 NYC 97500 2 42 28
# 2 SF 95000 1 35 35
Sorting
# Sort by one column
df.sort_values('salary', ascending=False)
# Sort by multiple columns
df.sort_values(['city', 'salary'], ascending=[True, False])
# Sort index
df.sort_index()
Handling Missing Data
Real data is messy. Pandas handles NaN (Not a Number) values throughout.
# Check for missing values
print(df.isnull().sum())
# Drop rows with any NaN
df_clean = df.dropna()
# Drop rows where specific columns are NaN
df_clean = df.dropna(subset=['salary', 'age'])
# Fill missing values
df['salary'] = df['salary'].fillna(df['salary'].median())
df['city'] = df['city'].fillna('Unknown')
# Forward fill (use previous value)
df['value'] = df['value'].ffill()
# Interpolate (linear interpolation)
df['temperature'] = df['temperature'].interpolate()
Merging DataFrames
Merging is the Pandas equivalent of SQL joins.
# Two DataFrames
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'dept_id': [10, 20, 10, 30]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 30],
'dept_name': ['Engineering', 'Marketing', 'Sales']
})
# Inner merge (only matching rows)
merged = employees.merge(departments, on='dept_id')
print(merged)
# emp_id name dept_id dept_name
# 0 1 Alice 10 Engineering
# 1 3 Charlie 10 Engineering
# 2 2 Bob 20 Marketing
# 3 4 Diana 30 Sales
# Left merge (all employees, even without department)
merged = employees.merge(departments, on='dept_id', how='left')
# Different column names
orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer': [1, 2, 1]
})
customers = pd.DataFrame({
'id': [1, 2],
'name': ['Alice', 'Bob']
})
merged = orders.merge(customers, left_on='customer', right_on='id')
Practical Example: Cleaning Real Data
Let's clean a messy sales dataset -- the kind you'd actually encounter at work.
import pandas as pd
# Read the messy data
df = pd.read_csv('sales_raw.csv')
# Step 1: Inspect
print(df.shape)
print(df.head())
print(df.dtypes)
print(df.isnull().sum())
# Step 2: Fix column names (lowercase, no spaces)
df.columns = df.columns.str.lower().str.replace(' ', '_')
# Step 3: Parse dates
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
# Step 4: Clean price column (remove $ and commas)
df['price'] = (df['price']
.str.replace('$', '', regex=False)
.str.replace(',', '', regex=False)
.astype(float))
# Step 5: Handle missing values
df['customer_name'] = df['customer_name'].fillna('Unknown')
df = df.dropna(subset=['order_date', 'price']) # Drop if critical fields missing
# Step 6: Remove duplicates
df = df.drop_duplicates(subset=['order_id'])
# Step 7: Add derived columns
df['month'] = df['order_date'].dt.month
df['year'] = df['order_date'].dt.year
df['total'] = df['price'] * df['quantity']
# Step 8: Filter out invalid data
df = df[df['price'] > 0]
df = df[df['quantity'] > 0]
# Step 9: Analyze
monthly_sales = (df.groupby(['year', 'month'])
.agg(
total_revenue=('total', 'sum'),
num_orders=('order_id', 'count'),
avg_order_value=('total', 'mean')
)
.reset_index()
.sort_values(['year', 'month']))
print(monthly_sales)
# Step 10: Save clean data
df.to_csv('sales_clean.csv', index=False)
This workflow -- load, inspect, clean, transform, analyze, save -- is the bread and butter of data work. Every data project follows this pattern.
Performance Tips
# Use appropriate dtypes to reduce memory
df['category'] = df['category'].astype('category') # 90%+ memory reduction for low-cardinality strings
df['small_int'] = df['small_int'].astype('int8') # int8 vs int64
# Check memory usage
print(df.memory_usage(deep=True))
# Use chunking for large files
chunks = pd.read_csv('huge_file.csv', chunksize=100_000)
results = []
for chunk in chunks:
processed = chunk[chunk['status'] == 'active']
results.append(processed)
df = pd.concat(results)
# Avoid iterrows() -- it's extremely slow
# Bad:
for index, row in df.iterrows():
df.loc[index, 'new'] = row['a'] + row['b']
# Good (vectorized):
df['new'] = df['a'] + df['b']
Saving Data
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)
df.to_json('output.json', orient='records')
df.to_parquet('output.parquet') # Fastest, smallest, best for large data
NumPy + Pandas Together
Pandas is built on top of NumPy. They work seamlessly together:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]
})
# NumPy operations work on DataFrame columns
df['log_a'] = np.log(df['a'])
df['sqrt_b'] = np.sqrt(df['b'])
# Convert between them
numpy_array = df.values # DataFrame → NumPy array
df_from_array = pd.DataFrame(numpy_array, columns=['a', 'b', 'log_a', 'sqrt_b'])
# Use NumPy's where for conditional logic
df['category'] = np.where(df['a'] > 3, 'high', 'low')
When to Use Which
| Task | Use |
|---|---|
| Numerical computation, linear algebra | NumPy |
| Reading CSVs, Excel files | Pandas |
| Data cleaning and transformation | Pandas |
| Statistical analysis on tabular data | Pandas |
| Machine learning feature matrices | NumPy (scikit-learn expects NumPy arrays) |
| Image processing (pixel arrays) | NumPy |
| Time series analysis | Pandas |
| Large-scale numerical simulation | NumPy |
Where to Go from Here
- Visualization: learn Matplotlib and Seaborn to make charts from your DataFrames
- Machine learning: scikit-learn takes NumPy arrays as input -- you now know how to prepare them
- Bigger data: when Pandas gets slow, look at Polars (faster DataFrame library) or Dask (parallel Pandas)
- Practice: download datasets from Kaggle and practice the load-explore-clean-analyze workflow
- SQL comparison: if you know SQL, learning Pandas is fast -- most operations have direct equivalents
For more Python tutorials, data science guides, and programming content, check out CodeUp.