Logging and Monitoring for Web Apps — A Practical Guide
How to implement logging and monitoring that actually helps you debug production issues. Covers structured logging, log levels, monitoring tools, alerting, and observability patterns.
Your app is going to break in production. Not because you're a bad developer -- because distributed systems are inherently unpredictable. A database connection times out. A third-party API starts returning 500s. Memory usage creeps up until the OOM killer strikes. The question isn't whether it'll happen, but whether you'll know about it before your users tell you.
Most developers add logging as an afterthought. They sprinkle console.log("here") during debugging, delete most of it, and ship. Then something breaks at 2 AM and they're left reading through raw stdout trying to reconstruct what happened.
Good logging and monitoring isn't hard. It just requires thinking about it before things go wrong.
Structured Logging — Stop Using console.log
This is useless in production:
Error processing request
Something went wrong
User not found
You can't search it, filter it, or aggregate it. You don't know when it happened, which endpoint was hit, or which user was affected.
Structured logging outputs JSON instead of free-text:
{
"timestamp": "2026-03-26T14:32:01.234Z",
"level": "error",
"message": "User not found",
"service": "user-api",
"requestId": "req_abc123",
"userId": "usr_456",
"endpoint": "GET /api/users/456",
"duration": 12,
"statusCode": 404
}
Now you can filter: "show me all errors from the user-api service in the last hour where duration exceeded 1000ms." That's the difference between debugging for 5 minutes and debugging for 5 hours.
Node.js: Pino
Pino is the fastest structured logger for Node.js:
npm install pino pino-pretty
import pino from "pino";
const logger = pino({
level: process.env.LOG_LEVEL || "info",
transport: process.env.NODE_ENV === "development"
? { target: "pino-pretty", options: { colorize: true } }
: undefined,
});
// Basic logging
logger.info("Server started on port 3000");
logger.error({ err: error }, "Failed to connect to database");
// Add context
logger.info({
userId: user.id,
action: "login",
ip: request.ip,
}, "User logged in");
In development, pino-pretty gives you readable colored output. In production, it outputs raw JSON for your log aggregator.
Python: structlog
import structlog
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
]
)
logger = structlog.get_logger()
logger.info("user_login", user_id="usr_456", ip="192.168.1.1")
logger.error("payment_failed", order_id="ord_789", amount=49.99, reason="card_declined")
Express Middleware for Request Logging
import { randomUUID } from "crypto";
function requestLogger(req, res, next) {
const requestId = randomUUID();
const start = Date.now();
// Attach to request for use in handlers
req.requestId = requestId;
req.log = logger.child({ requestId });
res.on("finish", () => {
const duration = Date.now() - start;
const logData = {
method: req.method,
url: req.originalUrl,
statusCode: res.statusCode,
duration,
userAgent: req.get("user-agent"),
ip: req.ip,
};
if (res.statusCode >= 500) {
req.log.error(logData, "Request failed");
} else if (res.statusCode >= 400) {
req.log.warn(logData, "Client error");
} else {
req.log.info(logData, "Request completed");
}
});
next();
}
app.use(requestLogger);
Log Levels — Use Them Correctly
| Level | When to Use | Example |
|---|---|---|
fatal | Application cannot continue | Database connection pool exhausted |
error | Operation failed, needs attention | Payment processing failed |
warn | Something unexpected but handled | Retry succeeded after timeout |
info | Normal operations worth recording | User signed up, order placed |
debug | Detailed diagnostic information | Cache miss for key xyz, query took 45ms |
trace | Very verbose, step-by-step | Entering function X with args Y |
info level. Bump to debug when investigating issues. Never run trace in production unless you enjoy paying for log storage.
The most common mistake: logging errors at warn level. If something failed and needs fixing, it's an error. Warnings are for "this worked, but something's off."
// Wrong: this failed, it's an error not a warning
logger.warn("Failed to send email to user");
// Right
logger.error({ userId, emailType, err }, "Failed to send email");
// This is a warning: it worked, but the fallback path is suspicious
logger.warn({ cacheKey, latency: 2300 }, "Cache miss, fell back to database");
What to Log
Always log:- Incoming requests (method, path, status code, duration)
- Authentication events (login, logout, failed attempts)
- Business-critical operations (orders, payments, signups)
- Errors with full context (stack trace, request data, user ID)
- External service calls (API requests to third parties, with duration)
- Passwords, tokens, API keys
- Full credit card numbers
- Personal health information
- Anything covered by GDPR/PII regulations without proper handling
// Scrub sensitive data
function sanitize(body: any) {
const clean = { ...body };
if (clean.password) clean.password = "[REDACTED]";
if (clean.creditCard) clean.creditCard = "**" + clean.creditCard.slice(-4);
if (clean.ssn) clean.ssn = "[REDACTED]";
return clean;
}
logger.info({ body: sanitize(req.body) }, "Processing payment");
Log Aggregation
Raw log files on a server are fine for a single-instance hobby project. For anything else, you need a log aggregation service.
| Service | Free Tier | Best For |
|---|---|---|
| Grafana Loki | Self-hosted, unlimited | Teams already using Grafana |
| Datadog | 1 day retention | Enterprise, full observability |
| Axiom | 500GB/month ingest | Generous free tier, simple |
| Better Stack (Logtail) | 1GB/month | Clean UI, easy setup |
| AWS CloudWatch | 5GB ingest + 5GB storage | AWS-native apps |
| ELK Stack | Self-hosted, unlimited | Full control, complex setup |
// Axiom with Pino
import pino from "pino";
const logger = pino({
transport: {
targets: [
{ target: "pino-pretty", level: "debug" },
{
target: "@axiomhq/pino",
options: {
dataset: "my-app",
token: process.env.AXIOM_TOKEN,
},
},
],
},
});
Monitoring — Knowing Before Users Complain
Logging tells you _what happened_. Monitoring tells you _what's happening right now_.
The Four Golden Signals
Google's SRE book defines four metrics that matter most:
- Latency — How long requests take (track p50, p95, p99)
- Traffic — Requests per second
- Errors — Error rate as a percentage of total requests
- Saturation — How full your resources are (CPU, memory, disk, connections)
Health Check Endpoint
Every service needs one:
app.get("/health", async (req, res) => {
const checks = {
uptime: process.uptime(),
timestamp: Date.now(),
database: "unknown",
redis: "unknown",
};
try {
await db.query("SELECT 1");
checks.database = "healthy";
} catch (err) {
checks.database = "unhealthy";
}
try {
await redis.ping();
checks.redis = "healthy";
} catch (err) {
checks.redis = "unhealthy";
}
const isHealthy = checks.database === "healthy" && checks.redis === "healthy";
res.status(isHealthy ? 200 : 503).json(checks);
});
Hit this endpoint from an uptime monitor (UptimeRobot, Better Uptime, Pingdom). Get alerted when it returns non-200.
Prometheus Metrics
For custom metrics, Prometheus is the standard:
import { Counter, Histogram, Registry } from "prom-client";
const register = new Registry();
const httpRequestDuration = new Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register],
});
const httpRequestTotal = new Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"],
registers: [register],
});
// Middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on("finish", () => {
const route = req.route?.path || req.path;
const labels = { method: req.method, route, status_code: res.statusCode };
end(labels);
httpRequestTotal.inc(labels);
});
next();
});
// Expose metrics endpoint
app.get("/metrics", async (req, res) => {
res.set("Content-Type", register.contentType);
res.end(await register.metrics());
});
Grafana reads from Prometheus and renders dashboards. The combo is free, self-hosted, and used by most of the industry.
Alerting — Signal vs Noise
Bad alerting is worse than no alerting. If your team gets 50 alerts a day and most are false positives, everyone starts ignoring them. Then the real incident gets lost in the noise.
Rules for useful alerts:
Alert on symptoms, not causes. Don't alert when CPU hits 80% -- alert when response time exceeds your SLA. High CPU might be fine during a traffic spike. Slow responses are never fine. Include context in the alert:🔴 High error rate on user-api
Error rate: 12.3% (threshold: 5%)
Affected endpoint: POST /api/payments
Started: 14:32 UTC
Dashboard: https://grafana.internal/d/user-api
Tier your alerts:
| Severity | Response Time | Channel | Example |
|---|---|---|---|
| Critical | Immediate | PagerDuty/phone | Service down, data loss |
| Warning | Within 1 hour | Slack channel | Error rate elevated |
| Info | Next business day | Email/dashboard | Disk usage above 70% |
Error Tracking
Generic logging catches errors, but a dedicated error tracker gives you stack traces, deduplication, and release tracking:
import * as Sentry from "@sentry/node";
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
release: process.env.GIT_SHA,
tracesSampleRate: 0.1, // 10% of transactions for performance monitoring
});
// Automatic Express error capture
app.use(Sentry.Handlers.errorHandler());
// Manual capture with context
try {
await processPayment(order);
} catch (err) {
Sentry.withScope((scope) => {
scope.setUser({ id: user.id, email: user.email });
scope.setExtra("orderId", order.id);
scope.setExtra("amount", order.amount);
Sentry.captureException(err);
});
throw err;
}
Sentry groups identical errors, shows which release introduced a bug, and tracks whether errors are increasing or decreasing. The free tier handles 5K errors/month, which is plenty for most projects.
Distributed Tracing
When a single request touches multiple services, logs from each service are disconnected. Distributed tracing connects them with a trace ID:
Request → API Gateway → User Service → Database
→ Payment Service → Stripe API
→ Email Service → SendGrid
OpenTelemetry is the standard:
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: "https://otel-collector.internal/v1/traces",
}),
serviceName: "user-api",
});
sdk.start();
With tracing, you can see that a slow checkout request spent 200ms in the user service, 50ms in inventory, and 3000ms waiting for the payment provider. Without it, you'd be guessing which service was the bottleneck.
A Practical Stack for Most Projects
For a team of 1-10 developers:
- Structured logging with Pino (Node.js) or structlog (Python)
- Error tracking with Sentry (free tier)
- Uptime monitoring with UptimeRobot or Better Uptime (free tier)
- Log aggregation with Axiom or Better Stack (free tier)
- Metrics with Prometheus + Grafana if you need dashboards
The goal is always the same: when something breaks at 2 AM, you should be able to open your dashboard, see exactly what went wrong, and fix it before most users even notice. At CodeUp, we've found that even basic structured logging cuts incident resolution time by more than half compared to grepping through raw log files.