Data Science Roadmap for Beginners: From Python to Machine Learning
A realistic data science learning path covering Python, NumPy, Pandas, statistics, SQL, machine learning, deep learning, and building a portfolio. Includes timelines, tool comparisons, and honest career advice.
Let's get something out of the way immediately: data science is not mostly machine learning. It's mostly data cleaning. The glamorous part -- training models, neural networks, predictions -- is maybe 20% of the actual job. The other 80% is finding data, cleaning it, transforming it, figuring out why column B has 40% null values, and explaining results to people who just want a simple answer.
If that doesn't scare you off, great. Here's how to actually learn this stuff.
The Roadmap
- Python basics (2 weeks)
- NumPy + Pandas (3 weeks)
- Data visualization (2 weeks)
- Statistics fundamentals (3 weeks)
- SQL (2 weeks)
- Machine Learning basics (4 weeks)
- Feature engineering (2 weeks)
- Deep learning intro (4 weeks)
- Real projects + portfolio (ongoing)
Phase 1: Python Basics (2 Weeks)
If you already know Python, skip this. You need: variables, data types, loops, conditionals, functions, list comprehensions, dictionaries, and file I/O. You don't need to become an expert -- just comfortable enough that syntax doesn't slow you down when you're focused on data problems.
Phase 2: NumPy + Pandas (3 Weeks)
This is where data science actually begins. Pandas is the tool you'll use for everything.
import pandas as pd
df = pd.read_csv("sales_data.csv")
print(df.head()) # First 5 rows
print(df.info()) # Column types, null counts
print(df.describe()) # Statistics for numeric columns
# The operations you'll do on every dataset
df = df.dropna(subset=["revenue"])
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
# Groupby -- the most important Pandas operation
monthly = df.groupby(df["date"].dt.to_period("M"))["revenue"].sum()
# Filtering
big_deals = df[df["revenue"] > 10000]
Here's the thing about Pandas: you could spend months learning every method. Don't. Learn read_csv, head, info, describe, groupby, merge, apply, fillna, dropna, boolean filtering, and to_csv. That covers 90% of real work.
Phase 3: Data Visualization (2 Weeks)
Numbers in a table don't convince anyone. Charts do.
| Library | Best For |
|---|---|
| Matplotlib | Full control, publication quality |
| Seaborn | Statistical plots, great defaults |
| Plotly | Interactive charts, web-ready |
| Altair | Declarative, clean API |
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")
sns.histplot(data=df, x="revenue", bins=30, kde=True)
plt.title("Revenue Distribution")
plt.savefig("revenue_dist.png")
Start with Matplotlib + Seaborn. They cover 95% of what you need.
Phase 4: Statistics Fundamentals (3 Weeks)
You cannot do data science without statistics. Period.
Week 1: mean, median, mode, standard deviation, variance, percentiles, distributions. Week 2: probability basics, conditional probability, Bayes' theorem, normal distribution, central limit theorem. Week 3: hypothesis testing, p-values, confidence intervals, correlation vs causation, A/B testing.from scipy import stats
# Did the new feature improve conversion?
control = [0.12, 0.11, 0.13, 0.10, 0.12, 0.11, 0.14, 0.12]
treatment = [0.15, 0.14, 0.16, 0.13, 0.15, 0.17, 0.14, 0.16]
t_stat, p_value = stats.ttest_ind(control, treatment)
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Statistically significant difference")
else:
print("No significant difference -- could be random chance")
The honest take: most courses rush through statistics to get to ML. This is a mistake. A data scientist who doesn't understand p-values is just someone copying code from StackOverflow.
Phase 5: SQL (2 Weeks)
Your data lives in databases. You need SQL to get it out.
SELECT
department,
COUNT(*) as employee_count,
AVG(salary) as avg_salary
FROM employees
WHERE hire_date >= '2025-01-01'
GROUP BY department
HAVING COUNT(*) >= 5
ORDER BY avg_salary DESC;
Learn SELECT, WHERE, GROUP BY, JOINs, subqueries, CTEs, and window functions. PostgreSQL is the best database to learn with.
Phase 6: Machine Learning Basics (4 Weeks)
Notice we're on phase 6 out of 9 -- ML requires everything that came before it.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))
| Algorithm | Type | Use When |
|---|---|---|
| Linear Regression | Regression | Predicting continuous values |
| Logistic Regression | Classification | Binary yes/no predictions |
| Random Forest | Both | General purpose, hard to mess up |
| XGBoost | Both | Tabular data king |
| K-Means | Clustering | Finding groups in unlabeled data |
Phase 7: Feature Engineering (2 Weeks)
This separates good data scientists from great ones. Raw data is messy. The features you create determine how well models perform.
df["account_age_days"] = (pd.Timestamp.now() - df["signup_date"]).dt.days
df["is_weekend"] = df["purchase_date"].dt.dayofweek >= 5
df["price_per_unit"] = df["total_price"] / df["quantity"]
df = pd.get_dummies(df, columns=["category", "region"])
No algorithm can compensate for bad features. A simple model with great features beats a complex model with raw data almost every time.
Phase 8: Deep Learning Intro (4 Weeks)
Deep learning dominates in images, text, and audio. For tabular data, traditional ML (XGBoost) often wins.
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self, input_size, num_classes):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_size, 128), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, 64), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, num_classes)
)
def forward(self, x):
return self.layers(x)
PyTorch has won the research and education battle. Start there.
The Math Question
"Do I need linear algebra and calculus?" Honest answer: they help but aren't day-one blockers. Start learning data science and pick up the math as you encounter it. Basic statistics is essential, linear algebra is very helpful, calculus is nice-to-have.
Data Science vs Data Analytics vs ML Engineer
| Role | Focus | Key Skills |
|---|---|---|
| Data Analyst | Reporting, dashboards | SQL, Excel, Tableau, basic Python |
| Data Scientist | Modeling, experimentation | Python, statistics, ML, communication |
| ML Engineer | Deploying ML at scale | Python, MLOps, Docker, cloud |
Getting Started
Follow the roadmap phase by phase. Each skill builds on the previous one, and skipping to ML without the fundamentals leaves you copying code without understanding it.
If you want to build a solid Python foundation before diving into data science tools, CodeUp covers the programming fundamentals that make everything else in this roadmap click faster.