March 26, 202611 min read

R Programming: The Language Data Scientists Actually Use for Statistics

A practical guide to R programming — data types, tidyverse, ggplot2, statistical analysis, and when R is the right choice over Python.

r data-science statistics programming tutorial

There's a running debate in data science: R or Python? The Python camp argues that Python is a general-purpose language, has better machine learning libraries, and is easier to integrate into production systems. They're right about all three points. What they consistently underestimate is how much faster R lets you go from raw data to statistical insight.

R was designed by statisticians for statisticians. That's both its greatest strength and its weirdest quirk. The language has syntax decisions that make software engineers cringe — 1-based indexing, the <- assignment operator, vector recycling that silently broadcasts values. But it also has the most expressive data manipulation grammar in any programming language, the best static visualization library (ggplot2, and it's not close), and statistical functions so comprehensive that many exist nowhere else.

If your work involves exploratory data analysis, statistical modeling, or creating publication-quality visualizations, R isn't just competitive with Python — it's often better.

The Basics

R is a vectorized language. Almost everything operates on vectors (one-dimensional arrays) by default:

# Vectors — the fundamental data type
ages <- c(25, 30, 35, 28, 42, 31)
names <- c("Alice", "Bob", "Carol", "Dave", "Eve", "Frank")

# Operations apply to the entire vector
ages + 1          # [1] 26 31 36 29 43 32
ages > 30         # [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE
mean(ages)        # [1] 31.83333
sd(ages)          # [1] 5.981194
summary(ages)     # Min, 1st Qu, Median, Mean, 3rd Qu, Max

# Logical subsetting
ages[ages > 30]   # [1] 35 42 31
names[ages > 30]  # [1] "Carol" "Eve"   "Frank"

# Sequences and repetitions
1:10              # [1]  1  2  3  4  5  6  7  8  9 10
seq(0, 1, by = 0.1)  # 0.0 0.1 0.2 ... 1.0
rep("A", 5)       # [1] "A" "A" "A" "A" "A"

This vectorization is why R code tends to be concise. Where Python needs a list comprehension or a NumPy function, R just applies the operation directly. ages > 30 returns a logical vector. Using that vector as an index filters the original vector. It's terse but powerful once the pattern clicks.

Data Frames — Where R Shines

Data frames are R's tabular data structure — think spreadsheets or SQL tables:

# Create a data frame
employees <- data.frame(
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  department = c("Engineering", "Marketing", "Engineering", "Sales", "Marketing"),
  salary = c(95000, 72000, 105000, 68000, 78000),
  years = c(5, 3, 8, 2, 4),
  stringsAsFactors = FALSE
)

# Basic operations
nrow(employees)           # 5
ncol(employees)           # 4
str(employees)            # Structure — types and sample values
head(employees, 3)        # First 3 rows

# Subsetting
employees[employees$salary > 80000, ]     # Rows where salary > 80k
employees[, c("name", "salary")]          # Select specific columns
employees$department                       # Extract one column as vector

# Adding columns
employees$bonus <- employees$salary * 0.1
employees$senior <- employees$years >= 5

Base R data frames work fine, but the tidyverse (which we'll get to shortly) transforms data manipulation from functional to fluent. First, though, let's cover the other core data types.

Data Types

R has types that feel unique if you come from other languages:

# Factors — categorical data with defined levels
status <- factor(c("active", "inactive", "active", "pending"),
                 levels = c("pending", "active", "inactive"))
table(status)  # Counts per level
# pending  active inactive
#       1       2        1

# Lists — heterogeneous containers (like Python dicts)
person <- list(
  name = "Alice",
  age = 30,
  scores = c(85, 92, 78),
  address = list(city = "NYC", zip = "10001")
)
person$name              # "Alice"
person[["scores"]]       # c(85, 92, 78)
person$address$city      # "NYC"

# Matrices — 2D numeric arrays
m <- matrix(1:12, nrow = 3, ncol = 4)
m %*% t(m)  # Matrix multiplication

# NULL vs NA
x <- NULL    # Absence of a value — the variable has no content
y <- NA      # Missing value — the value exists but is unknown
is.null(x)   # TRUE
is.na(y)     # TRUE
mean(c(1, 2, NA, 4))          # NA (propagates)
mean(c(1, 2, NA, 4), na.rm = TRUE)  # 2.333333

The NA propagation behavior is intentional. In statistical analysis, silently ignoring missing data leads to wrong conclusions. R forces you to explicitly decide how to handle it with na.rm = TRUE or functions like na.omit(), complete.cases(), or imputation.

The Tidyverse — Modern R

The tidyverse is a collection of packages that share a common design philosophy. It was created by Hadley Wickham and has become the standard way to do data science in R. If you're learning R in 2026, learn the tidyverse first:

library(tidyverse)  # Loads dplyr, ggplot2, tidyr, readr, purrr, stringr, forcats, tibble

# Read data
sales <- read_csv("sales_data.csv")

# The pipe operator |> chains operations
# (older code uses %>% from magrittr — same concept)
sales_summary <- sales |>
  filter(year >= 2024) |>                          # Keep rows
  mutate(
    revenue = quantity * price,                     # Add columns
    quarter = paste0("Q", ceiling(month / 3))
  ) |>
  group_by(region, quarter) |>                     # Group
  summarize(
    total_revenue = sum(revenue),
    avg_order = mean(revenue),
    n_orders = n(),
    .groups = "drop"
  ) |>
  arrange(desc(total_revenue))                     # Sort

# The result is a tibble (enhanced data frame)
print(sales_summary)

Each function does one thing: filter selects rows, mutate creates or modifies columns, group_by sets grouping, summarize aggregates, arrange sorts, select picks columns, rename renames them. The pipe |> chains them together left to right.

This reads almost like English: "Take sales, filter to 2024+, calculate revenue and quarter, group by region and quarter, summarize totals, sort by revenue." Compare this to the equivalent pandas code in Python, and the readability difference is stark.

More tidyverse patterns:

# Reshaping data — wide to long
wide_data <- tibble(
  student = c("Alice", "Bob"),
  math = c(90, 85),
  science = c(88, 92),
  english = c(95, 78)
)

long_data <- wide_data |>
  pivot_longer(
    cols = c(math, science, english),
    names_to = "subject",
    values_to = "score"
  )
# student  subject  score
# Alice    math     90
# Alice    science  88
# ...

# Joining data
orders <- tibble(
  order_id = 1:5,
  customer_id = c(101, 102, 101, 103, 102),
  amount = c(50, 75, 30, 120, 45)
)

customers <- tibble(
  customer_id = c(101, 102, 103),
  name = c("Alice", "Bob", "Carol")
)

orders |>
  left_join(customers, by = "customer_id") |>
  group_by(name) |>
  summarize(total_spent = sum(amount), n_orders = n())

# String operations with stringr
names <- c("John Smith", "Jane Doe", "Bob Johnson")
str_to_upper(names)
str_detect(names, "John")    # TRUE FALSE TRUE
str_extract(names, "\\w+$")  # "Smith" "Doe" "Johnson"
str_replace(names, " ", "_") # "John_Smith" "Jane_Doe" "Bob_Johnson"

ggplot2 — The Best Visualization Library, Period

ggplot2 implements the "Grammar of Graphics" — the idea that every statistical graphic is composed of data, aesthetic mappings, and geometric objects. This sounds academic. In practice, it means you build visualizations by layering components:

library(ggplot2)

# Basic scatter plot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

# Add color, size, and a trend line
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Engine Displacement vs Highway MPG",
    x = "Engine Displacement (liters)",
    y = "Highway MPG",
    color = "Vehicle Class"
  ) +
  theme_minimal()

# Faceted plot — split by variable
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(method = "loess") +
  facet_wrap(~drv, labeller = labeller(drv = c(
    "f" = "Front-wheel", "r" = "Rear-wheel", "4" = "4-wheel"
  ))) +
  theme_bw() +
  labs(title = "Fuel Efficiency by Drive Type")

# Bar chart with error bars
experiment_summary <- experiment_data |>
  group_by(treatment) |>
  summarize(
    mean_response = mean(response),
    se = sd(response) / sqrt(n()),
    .groups = "drop"
  )

ggplot(experiment_summary, aes(x = treatment, y = mean_response, fill = treatment)) +
  geom_col(width = 0.6) +
  geom_errorbar(
    aes(ymin = mean_response - 1.96  se, ymax = mean_response + 1.96  se),
    width = 0.2
  ) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Treatment Effect on Response Variable",
       y = "Mean Response (95% CI)") +
  theme_minimal() +
  theme(legend.position = "none")

# Density plot
ggplot(diamonds, aes(x = price, fill = cut)) +
  geom_density(alpha = 0.5) +
  scale_x_log10(labels = scales::dollar) +
  labs(title = "Diamond Price Distribution by Cut Quality") +
  theme_minimal()

The + operator layers elements. Start with data and aesthetics, add geometric shapes, customize scales, add labels, choose a theme. Every piece is independently modifiable. Want to switch from a scatter plot to a box plot? Change geom_point() to geom_boxplot(). Want to split into panels? Add facet_wrap(). The grammar is composable in a way that matplotlib and seaborn can't match.

For publication-ready figures, ggplot2 exports to PDF, SVG, and PNG at any resolution:

ggsave("figure1.pdf", width = 8, height = 5, dpi = 300)

Statistical Functions

R has statistical functions built in that require third-party libraries in every other language:

# T-test — compare two groups
control <- c(23, 25, 28, 22, 27, 24, 26)
treatment <- c(28, 31, 27, 33, 29, 30, 32)
t.test(treatment, control)
# t = 3.4, df = 11.2, p-value = 0.006
# 95% CI: [1.66, 7.77]
# Mean difference: 4.71

# Linear regression
model <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model)
# Coefficients, R-squared, F-statistic, p-values all printed

# ANOVA
model_aov <- aov(yield ~ fertilizer * irrigation, data = crop_data)
summary(model_aov)
TukeyHSD(model_aov)  # Post-hoc pairwise comparisons

# Chi-squared test
observed <- matrix(c(45, 55, 30, 70), nrow = 2)
chisq.test(observed)

# Correlation matrix
cor(mtcars[, c("mpg", "wt", "hp", "disp")])

# Logistic regression
log_model <- glm(survived ~ age + sex + class,
                 data = titanic, family = binomial)
summary(log_model)
exp(coef(log_model))  # Odds ratios

# Time series
ts_data <- ts(monthly_sales, start = c(2020, 1), frequency = 12)
decompose(ts_data)    # Trend, seasonal, random components
forecast::auto.arima(ts_data)  # Automatic ARIMA model selection

The output of summary(model) in R gives you everything a statistician needs: coefficients with standard errors, t-values, p-values, R-squared, adjusted R-squared, F-statistic, and residual standard error. In Python, you need to install statsmodels, configure the model, call summary, and the output is less polished.

R vs. Python for Data Science

This is the comparison that matters, so let's be specific:

R wins at:

Exploratory data analysis (tidyverse is faster to write than pandas)
Statistical modeling (built-in functions + formula syntax)
Visualization (ggplot2 beats matplotlib/seaborn for static plots)
Bioinformatics (Bioconductor has 2000+ specialized packages)
Academic statistics (every statistical test exists in R, often from the method's inventor)

Python wins at:

Machine learning and deep learning (scikit-learn, PyTorch, TensorFlow)
Production deployment (Flask/FastAPI for model serving)
General-purpose programming (web scraping, automation, APIs)
Integration with software engineering workflows
Community size and job market

The practical answer: Many data scientists use both. R for exploration, modeling, and visualization. Python for ML pipelines and production deployment. RStudio (now Posit) supports both languages. Jupyter notebooks support both. Quarto documents can mix both. You don't have to choose one forever.

If you're starting from zero and your goal is a data science career, learn Python first — it's more versatile and has more job listings. But learn R too, because for statistical analysis and visualization, nothing else is as expressive.

When R Is the Right Choice

Choose R when:

Your primary task is statistical analysis or hypothesis testing
You need publication-quality visualizations
You're in academia or a research environment
You're working with bioinformatics data (Bioconductor is unrivaled)
Your team already has R expertise
You need to do exploratory data analysis quickly

Think twice when:

You're building a production ML pipeline
You need to deploy models as web services
Your team is primarily software engineers, not statisticians
You need to do significant string processing, web scraping, or automation alongside analysis
Hiring R developers in your area is difficult

Getting Started

Install R from CRAN (cran.r-project.org) and RStudio from posit.co. RStudio is the standard IDE for R — it combines a code editor, console, environment viewer, plot pane, and package manager in one interface.

Then install the tidyverse:

install.packages("tidyverse")
library(tidyverse)

The best learning resource is "R for Data Science" (r4ds.hadley.nz) — it's free, online, and teaches modern R using the tidyverse from page one.

Start with data you care about. Download a CSV from Kaggle, load it with read_csv(), explore it with glimpse() and summary(), clean it with dplyr, and visualize it with ggplot2. R rewards curiosity — the more you explore data, the more natural the language feels.

For building the programming logic and problem-solving skills that make data analysis more effective, work through challenges on CodeUp. Strong fundamentals in any language make learning R (and every other tool) faster.