March 26, 202612 min read

NumPy and Pandas: The Two Libraries Every Python Developer Eventually Needs

A practical guide to NumPy arrays and Pandas DataFrames. Learn vectorization, slicing, data loading, filtering, grouping, and real-world data cleaning.

numpy pandas python data-science tutorial
Ad 336x280

At some point in your Python journey, you'll need to work with data. Maybe it's a CSV export from a database. Maybe it's sensor readings, financial data, or user analytics. You'll try to process it with plain Python lists and loops, and it'll work -- but it'll be slow and verbose. Then someone will tell you about NumPy and Pandas, and you'll wonder how you ever managed without them.

NumPy handles numerical arrays and mathematical operations. Pandas handles tabular data (think spreadsheets). Together, they form the foundation of data work in Python. This guide covers both from the ground up with practical examples.

Part 1: NumPy

What NumPy Does

NumPy's core contribution is the ndarray (N-dimensional array). Unlike Python lists, NumPy arrays:

  • Store data in contiguous memory (much faster access)
  • Support element-wise operations without explicit loops
  • Are implemented in C under the hood (10-100x faster than pure Python loops)
pip install numpy
import numpy as np

# Python list approach (slow)
numbers = list(range(1_000_000))
squared = [x ** 2 for x in numbers]  # List comprehension, ~300ms

# NumPy approach (fast)
numbers = np.arange(1_000_000)
squared = numbers ** 2  # Vectorized operation, ~3ms

That's a 100x speedup. And the code is shorter and more readable.

Creating Arrays

import numpy as np

# From a Python list
a = np.array([1, 2, 3, 4, 5])
print(a)  # [1 2 3 4 5]

# With built-in functions
zeros = np.zeros(5)           # [0. 0. 0. 0. 0.]
ones = np.ones((3, 4))        # 3x4 matrix of ones
empty = np.empty((2, 3))      # 2x3 uninitialized (fast, random values)
range_arr = np.arange(0, 10, 2)  # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.  ]

# Random arrays
random_uniform = np.random.rand(3, 3)     # 3x3 uniform [0, 1)
random_normal = np.random.randn(1000)     # 1000 normal distribution samples
random_ints = np.random.randint(0, 100, size=10)  # 10 random ints [0, 100)

# 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
print(matrix.shape)  # (3, 3)
print(matrix.dtype)  # int64

Vectorized Operations

The killer feature. Operations apply element-by-element without loops.

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Arithmetic
print(a + b)    # [11 22 33 44 55]
print(a * b)    # [10 40 90 160 250]
print(a ** 2)   # [1 4 9 16 25]
print(a / 2)    # [0.5 1.  1.5 2.  2.5]

# Comparison (returns boolean array)
print(a > 3)    # [False False False  True  True]

# Math functions
print(np.sqrt(a))     # [1.   1.41 1.73 2.   2.24]
print(np.log(a))      # [0.   0.69 1.10 1.39 1.61]
print(np.exp(a))      # [2.72 7.39 20.09 54.60 148.41]
print(np.sin(a))      # Sine of each element

# Aggregations
print(np.mean(a))     # 3.0
print(np.std(a))      # 1.414
print(np.sum(a))      # 15
print(np.min(a))      # 1
print(np.max(a))      # 5
print(np.argmax(a))   # 4 (index of max value)

Indexing and Slicing

a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])

# Basic indexing
print(a[0])      # 10
print(a[-1])     # 90
print(a[2:5])    # [30 40 50]

# Boolean indexing (extremely useful)
mask = a > 40
print(mask)          # [False False False False  True  True  True  True  True]
print(a[mask])       # [50 60 70 80 90]
print(a[a > 40])     # Same thing, shorthand

# Fancy indexing
indices = [0, 2, 4]
print(a[indices])    # [10 30 50]

# 2D slicing
m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

print(m[0, :]) # [1 2 3] (first row)
print(m[:, 1]) # [2 5 8] (second column)
print(m[0:2, 0:2]) # [[1 2], [4 5]] (top-left 2x2)

Boolean indexing is one of the most powerful features in NumPy. It lets you filter arrays without loops:

temperatures = np.array([72, 85, 68, 91, 77, 95, 63, 88])

# Find all temperatures above 80
hot_days = temperatures[temperatures > 80]
print(hot_days)  # [85 91 95 88]

# Count how many
print(np.sum(temperatures > 80))  # 4

# Multiple conditions (use & for AND, | for OR, ~ for NOT)
moderate = temperatures[(temperatures > 70) & (temperatures < 90)]
print(moderate)  # [72 85 77 88]

Reshaping

a = np.arange(12)
print(a)  # [ 0  1  2  3  4  5  6  7  8  9 10 11]

# Reshape to 3x4 matrix
b = a.reshape(3, 4)
print(b)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# -1 means "figure it out"
c = a.reshape(2, -1)  # 2 rows, auto-calculate columns (6)
print(c.shape)  # (2, 6)

# Flatten back
print(b.ravel())  # [ 0  1  2  3  4  5  6  7  8  9 10 11]

# Transpose
print(b.T.shape)  # (4, 3)

Broadcasting

Broadcasting lets NumPy perform operations on arrays of different shapes.

# Add a scalar to every element
a = np.array([1, 2, 3])
print(a + 10)  # [11 12 13]

# Add a column vector to each row
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

row = np.array([10, 20, 30])
print(matrix + row)
# [[11 22 33]
# [14 25 36]
# [17 28 39]]

Broadcasting follows rules about shape compatibility, but the intuition is: NumPy "stretches" the smaller array to match the larger one. A (3,4) matrix plus a (4,) vector works because the vector broadcasts across each row.

Linear Algebra

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

# Matrix multiplication
print(a @ b)           # or np.dot(a, b)
# [[19 22]
#  [43 50]]

# Determinant
print(np.linalg.det(a))  # -2.0

# Inverse
print(np.linalg.inv(a))

# Eigenvalues
eigenvalues, eigenvectors = np.linalg.eig(a)

Part 2: Pandas

What Pandas Does

Pandas provides two key data structures:

  • Series: a labeled 1D array (like a column in a spreadsheet)
  • DataFrame: a labeled 2D table (like a spreadsheet or SQL table)
pip install pandas
import pandas as pd

# Series
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s['b'])  # 20

# DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [28, 35, 42, 31],
    'city': ['NYC', 'SF', 'NYC', 'LA'],
    'salary': [75000, 95000, 120000, 85000]
})

print(df)
# name age city salary
# 0 Alice 28 NYC 75000
# 1 Bob 35 SF 95000
# 2 Charlie 42 NYC 120000
# 3 Diana 31 LA 85000

Reading Data

This is where Pandas shines. Real-world data comes from files, databases, and APIs.

# CSV (most common)
df = pd.read_csv('data.csv')

# With options
df = pd.read_csv('data.csv',
    sep=',',              # Delimiter
    header=0,             # Row number for header (0 = first row)
    index_col='id',       # Use a column as the index
    parse_dates=['date'], # Parse date columns automatically
    na_values=['N/A', ''],# Treat these as NaN
    nrows=1000,           # Only read first 1000 rows
    usecols=['name', 'age', 'salary']  # Only read specific columns
)

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('data.json')

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)

# Clipboard (handy for quick analysis)
df = pd.read_clipboard()

Exploring Data

The first thing you do with any dataset: understand its shape and contents.

df = pd.read_csv('sales_data.csv')

# Basic info
print(df.shape)          # (rows, columns)
print(df.columns)        # Column names
print(df.dtypes)         # Data types of each column
print(df.info())         # Summary: types, non-null counts, memory

# First/last rows
print(df.head())         # First 5 rows
print(df.tail(10))       # Last 10 rows

# Statistics
print(df.describe())     # Count, mean, std, min, 25%, 50%, 75%, max
print(df['salary'].mean())
print(df['age'].median())
print(df['city'].value_counts())  # Frequency of each unique value

# Missing values
print(df.isnull().sum())  # Count NaN per column

Selecting and Filtering

# Select a column (returns Series)
names = df['name']

# Select multiple columns (returns DataFrame)
subset = df[['name', 'salary']]

# Filter rows
high_earners = df[df['salary'] > 100000]
nyc_workers = df[df['city'] == 'NYC']

# Multiple conditions
senior_nyc = df[(df['age'] > 35) & (df['city'] == 'NYC')]

# String methods
df[df['name'].str.startswith('A')]
df[df['name'].str.contains('ali', case=False)]

# .loc (label-based) and .iloc (integer-based)
df.loc[0:2, 'name':'city']     # Rows 0-2, columns 'name' through 'city'
df.iloc[0:3, 0:3]              # First 3 rows, first 3 columns

# isin for matching multiple values
df[df['city'].isin(['NYC', 'SF'])]

# query method (SQL-like, often more readable)
df.query('salary > 100000 and city == "NYC"')

Adding and Modifying Columns

# New column from calculation
df['salary_monthly'] = df['salary'] / 12

# New column from condition
df['is_senior'] = df['age'] > 35

# Apply a function to a column
df['name_upper'] = df['name'].str.upper()

# Apply a custom function
def categorize_salary(salary):
    if salary > 100000:
        return 'high'
    elif salary > 70000:
        return 'medium'
    return 'low'

df['salary_category'] = df['salary'].apply(categorize_salary)

# Rename columns df = df.rename(columns={'salary': 'annual_salary', 'city': 'location'}) # Drop columns df = df.drop(columns=['salary_monthly', 'is_senior'])

Grouping and Aggregating

This is where Pandas replaces hours of loop-based code with one-liners.

# Average salary by city
df.groupby('city')['salary'].mean()

# Multiple aggregations
df.groupby('city')['salary'].agg(['mean', 'median', 'count', 'std'])

# Group by multiple columns
df.groupby(['city', 'is_senior'])['salary'].mean()

# Named aggregations (clean output)
result = df.groupby('city').agg(
    avg_salary=('salary', 'mean'),
    num_employees=('salary', 'count'),
    oldest=('age', 'max'),
    youngest=('age', 'min')
).reset_index()

print(result)
# city avg_salary num_employees oldest youngest
# 0 LA 85000 1 31 31
# 1 NYC 97500 2 42 28
# 2 SF 95000 1 35 35

Sorting

# Sort by one column
df.sort_values('salary', ascending=False)

# Sort by multiple columns
df.sort_values(['city', 'salary'], ascending=[True, False])

# Sort index
df.sort_index()

Handling Missing Data

Real data is messy. Pandas handles NaN (Not a Number) values throughout.

# Check for missing values
print(df.isnull().sum())

# Drop rows with any NaN
df_clean = df.dropna()

# Drop rows where specific columns are NaN
df_clean = df.dropna(subset=['salary', 'age'])

# Fill missing values
df['salary'] = df['salary'].fillna(df['salary'].median())
df['city'] = df['city'].fillna('Unknown')

# Forward fill (use previous value)
df['value'] = df['value'].ffill()

# Interpolate (linear interpolation)
df['temperature'] = df['temperature'].interpolate()

Merging DataFrames

Merging is the Pandas equivalent of SQL joins.

# Two DataFrames
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
'dept_id': [10, 20, 30],
'dept_name': ['Engineering', 'Marketing', 'Sales']
})

# Inner merge (only matching rows) merged = employees.merge(departments, on='dept_id') print(merged) # emp_id name dept_id dept_name # 0 1 Alice 10 Engineering # 1 3 Charlie 10 Engineering # 2 2 Bob 20 Marketing # 3 4 Diana 30 Sales # Left merge (all employees, even without department) merged = employees.merge(departments, on='dept_id', how='left') # Different column names orders = pd.DataFrame({ 'order_id': [1, 2, 3], 'customer': [1, 2, 1] }) customers = pd.DataFrame({ 'id': [1, 2], 'name': ['Alice', 'Bob'] }) merged = orders.merge(customers, left_on='customer', right_on='id')

Practical Example: Cleaning Real Data

Let's clean a messy sales dataset -- the kind you'd actually encounter at work.

import pandas as pd

# Read the messy data
df = pd.read_csv('sales_raw.csv')

# Step 1: Inspect
print(df.shape)
print(df.head())
print(df.dtypes)
print(df.isnull().sum())

# Step 2: Fix column names (lowercase, no spaces)
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Step 3: Parse dates
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')

# Step 4: Clean price column (remove $ and commas)
df['price'] = (df['price']
    .str.replace('$', '', regex=False)
    .str.replace(',', '', regex=False)
    .astype(float))

# Step 5: Handle missing values
df['customer_name'] = df['customer_name'].fillna('Unknown')
df = df.dropna(subset=['order_date', 'price'])  # Drop if critical fields missing

# Step 6: Remove duplicates
df = df.drop_duplicates(subset=['order_id'])

# Step 7: Add derived columns
df['month'] = df['order_date'].dt.month
df['year'] = df['order_date'].dt.year
df['total'] = df['price'] * df['quantity']

# Step 8: Filter out invalid data
df = df[df['price'] > 0]
df = df[df['quantity'] > 0]

# Step 9: Analyze
monthly_sales = (df.groupby(['year', 'month'])
    .agg(
        total_revenue=('total', 'sum'),
        num_orders=('order_id', 'count'),
        avg_order_value=('total', 'mean')
    )
    .reset_index()
    .sort_values(['year', 'month']))

print(monthly_sales)

# Step 10: Save clean data df.to_csv('sales_clean.csv', index=False)

This workflow -- load, inspect, clean, transform, analyze, save -- is the bread and butter of data work. Every data project follows this pattern.

Performance Tips

# Use appropriate dtypes to reduce memory
df['category'] = df['category'].astype('category')  # 90%+ memory reduction for low-cardinality strings
df['small_int'] = df['small_int'].astype('int8')     # int8 vs int64

# Check memory usage
print(df.memory_usage(deep=True))

# Use chunking for large files
chunks = pd.read_csv('huge_file.csv', chunksize=100_000)
results = []
for chunk in chunks:
    processed = chunk[chunk['status'] == 'active']
    results.append(processed)
df = pd.concat(results)

# Avoid iterrows() -- it's extremely slow
# Bad:
for index, row in df.iterrows():
    df.loc[index, 'new'] = row['a'] + row['b']

# Good (vectorized):
df['new'] = df['a'] + df['b']

Saving Data

df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)
df.to_json('output.json', orient='records')
df.to_parquet('output.parquet')  # Fastest, smallest, best for large data

NumPy + Pandas Together

Pandas is built on top of NumPy. They work seamlessly together:

import numpy as np
import pandas as pd

df = pd.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]
})

# NumPy operations work on DataFrame columns df['log_a'] = np.log(df['a']) df['sqrt_b'] = np.sqrt(df['b']) # Convert between them numpy_array = df.values # DataFrame → NumPy array df_from_array = pd.DataFrame(numpy_array, columns=['a', 'b', 'log_a', 'sqrt_b']) # Use NumPy's where for conditional logic df['category'] = np.where(df['a'] > 3, 'high', 'low')

When to Use Which

TaskUse
Numerical computation, linear algebraNumPy
Reading CSVs, Excel filesPandas
Data cleaning and transformationPandas
Statistical analysis on tabular dataPandas
Machine learning feature matricesNumPy (scikit-learn expects NumPy arrays)
Image processing (pixel arrays)NumPy
Time series analysisPandas
Large-scale numerical simulationNumPy
In practice, most data projects start with Pandas (load and clean data) and switch to NumPy when feeding data into machine learning models or doing numerical computation.

Where to Go from Here

  1. Visualization: learn Matplotlib and Seaborn to make charts from your DataFrames
  2. Machine learning: scikit-learn takes NumPy arrays as input -- you now know how to prepare them
  3. Bigger data: when Pandas gets slow, look at Polars (faster DataFrame library) or Dask (parallel Pandas)
  4. Practice: download datasets from Kaggle and practice the load-explore-clean-analyze workflow
  5. SQL comparison: if you know SQL, learning Pandas is fast -- most operations have direct equivalents
The combination of NumPy and Pandas handles 90% of data tasks you'll encounter as a developer. Master these two libraries and you'll be equipped to work with data in any context.

For more Python tutorials, data science guides, and programming content, check out CodeUp.

Ad 728x90