March 27, 20266 min read

Data Science Roadmap for Beginners: From Python to Machine Learning

A realistic data science learning path covering Python, NumPy, Pandas, statistics, SQL, machine learning, deep learning, and building a portfolio. Includes timelines, tool comparisons, and honest career advice.

data-science python machine-learning roadmap career

Let's get something out of the way immediately: data science is not mostly machine learning. It's mostly data cleaning. The glamorous part -- training models, neural networks, predictions -- is maybe 20% of the actual job. The other 80% is finding data, cleaning it, transforming it, figuring out why column B has 40% null values, and explaining results to people who just want a simple answer.

If that doesn't scare you off, great. Here's how to actually learn this stuff.

The Roadmap

Python basics (2 weeks)
NumPy + Pandas (3 weeks)
Data visualization (2 weeks)
Statistics fundamentals (3 weeks)
SQL (2 weeks)
Machine Learning basics (4 weeks)
Feature engineering (2 weeks)
Deep learning intro (4 weeks)
Real projects + portfolio (ongoing)

About 22 weeks total. This assumes 2-3 hours a day.

Phase 1: Python Basics (2 Weeks)

If you already know Python, skip this. You need: variables, data types, loops, conditionals, functions, list comprehensions, dictionaries, and file I/O. You don't need to become an expert -- just comfortable enough that syntax doesn't slow you down when you're focused on data problems.

Phase 2: NumPy + Pandas (3 Weeks)

This is where data science actually begins. Pandas is the tool you'll use for everything.

import pandas as pd

df = pd.read_csv("sales_data.csv")
print(df.head())           # First 5 rows
print(df.info())           # Column types, null counts
print(df.describe())       # Statistics for numeric columns

# The operations you'll do on every dataset
df = df.dropna(subset=["revenue"])
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year

# Groupby -- the most important Pandas operation
monthly = df.groupby(df["date"].dt.to_period("M"))["revenue"].sum()

# Filtering
big_deals = df[df["revenue"] > 10000]

Here's the thing about Pandas: you could spend months learning every method. Don't. Learn read_csv, head, info, describe, groupby, merge, apply, fillna, dropna, boolean filtering, and to_csv. That covers 90% of real work.

Phase 3: Data Visualization (2 Weeks)

Numbers in a table don't convince anyone. Charts do.

Library	Best For
Matplotlib	Full control, publication quality
Seaborn	Statistical plots, great defaults
Plotly	Interactive charts, web-ready
Altair	Declarative, clean API

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
sns.histplot(data=df, x="revenue", bins=30, kde=True)
plt.title("Revenue Distribution")
plt.savefig("revenue_dist.png")

Start with Matplotlib + Seaborn. They cover 95% of what you need.

Phase 4: Statistics Fundamentals (3 Weeks)

You cannot do data science without statistics. Period.

Week 1: mean, median, mode, standard deviation, variance, percentiles, distributions. Week 2: probability basics, conditional probability, Bayes' theorem, normal distribution, central limit theorem. Week 3: hypothesis testing, p-values, confidence intervals, correlation vs causation, A/B testing.

from scipy import stats

# Did the new feature improve conversion?
control = [0.12, 0.11, 0.13, 0.10, 0.12, 0.11, 0.14, 0.12]
treatment = [0.15, 0.14, 0.16, 0.13, 0.15, 0.17, 0.14, 0.16]

t_stat, p_value = stats.ttest_ind(control, treatment)
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Statistically significant difference")
else:
    print("No significant difference -- could be random chance")

The honest take: most courses rush through statistics to get to ML. This is a mistake. A data scientist who doesn't understand p-values is just someone copying code from StackOverflow.

Phase 5: SQL (2 Weeks)

Your data lives in databases. You need SQL to get it out.

SELECT
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary
FROM employees
WHERE hire_date >= '2025-01-01'
GROUP BY department
HAVING COUNT(*) >= 5
ORDER BY avg_salary DESC;

Learn SELECT, WHERE, GROUP BY, JOINs, subqueries, CTEs, and window functions. PostgreSQL is the best database to learn with.

Phase 6: Machine Learning Basics (4 Weeks)

Notice we're on phase 6 out of 9 -- ML requires everything that came before it.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))

Algorithm	Type	Use When
Linear Regression	Regression	Predicting continuous values
Logistic Regression	Classification	Binary yes/no predictions
Random Forest	Both	General purpose, hard to mess up
XGBoost	Both	Tabular data king
K-Means	Clustering	Finding groups in unlabeled data

Learn the workflow: split data, train, evaluate on test set, iterate. Understand overfitting, bias-variance tradeoff, and cross-validation. These concepts matter more than memorizing algorithms.

Phase 7: Feature Engineering (2 Weeks)

This separates good data scientists from great ones. Raw data is messy. The features you create determine how well models perform.

df["account_age_days"] = (pd.Timestamp.now() - df["signup_date"]).dt.days
df["is_weekend"] = df["purchase_date"].dt.dayofweek >= 5
df["price_per_unit"] = df["total_price"] / df["quantity"]
df = pd.get_dummies(df, columns=["category", "region"])

No algorithm can compensate for bad features. A simple model with great features beats a complex model with raw data almost every time.

Phase 8: Deep Learning Intro (4 Weeks)

Deep learning dominates in images, text, and audio. For tabular data, traditional ML (XGBoost) often wins.

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 128), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(64, num_classes)
        )
    def forward(self, x):
        return self.layers(x)

PyTorch has won the research and education battle. Start there.

The Math Question

"Do I need linear algebra and calculus?" Honest answer: they help but aren't day-one blockers. Start learning data science and pick up the math as you encounter it. Basic statistics is essential, linear algebra is very helpful, calculus is nice-to-have.

Data Science vs Data Analytics vs ML Engineer

Role	Focus	Key Skills
Data Analyst	Reporting, dashboards	SQL, Excel, Tableau, basic Python
Data Scientist	Modeling, experimentation	Python, statistics, ML, communication
ML Engineer	Deploying ML at scale	Python, MLOps, Docker, cloud

Analysts answer "what happened." Scientists answer "why and what will happen." Engineers make predictions work in production.

Getting Started

Follow the roadmap phase by phase. Each skill builds on the previous one, and skipping to ML without the fundamentals leaves you copying code without understanding it.

If you want to build a solid Python foundation before diving into data science tools, CodeUp covers the programming fundamentals that make everything else in this roadmap click faster.