March 26, 20264 min read

Confusion Matrix Calculator — Precision, Recall, F1, and More

Calculate precision, recall, F1-score, specificity, and MCC from your confusion matrix values. Understand your classifier's real performance.

machine learning classification confusion matrix f1 score calchub

Accuracy is the most reported metric in ML papers and also one of the most misleading. A fraud detection model that predicts "not fraud" for every transaction achieves 99.9% accuracy on a typical dataset. The confusion matrix is how you catch this kind of deception — in your own models and in other people's claims.

What's in a Confusion Matrix

For a binary classifier, the matrix has four cells:

Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

From these four numbers, you can compute every classification metric you'll ever need. The CalcHub Confusion Matrix Calculator takes TP, FP, TN, and FN as inputs and returns the full suite of metrics with plain-English explanations of what each one means in context.

The Metrics and When They Matter

Metric	Formula	Best for
Accuracy	(TP+TN)/(TP+FP+TN+FN)	Balanced datasets only
Precision	TP/(TP+FP)	When false positives are costly (spam filter)
Recall (Sensitivity)	TP/(TP+FN)	When false negatives are costly (cancer screening)
F1 Score	2 × (P×R)/(P+R)	Unbalanced classes, balanced P/R tradeoff
Specificity	TN/(TN+FP)	How well you identify true negatives
MCC	Balanced metric for skewed datasets	Works well even when classes are imbalanced

Matthews Correlation Coefficient (MCC) is underused but excellent. It returns a value between -1 and +1 and treats all four cells equally — a random classifier scores 0, perfect classifier scores 1, regardless of class balance.

Practical Example: Medical Screening Test

A diagnostic test for a disease with 5% prevalence in a population of 10,000:

TP: 430 (correctly identified sick patients)
FN: 70 (sick patients missed — very bad)
FP: 950 (healthy patients flagged — causes unnecessary anxiety and follow-up)
TN: 8,550 (correctly cleared)

Running these through the calculator:

Accuracy: 89.8% (looks decent, but...)
Precision: 31.2% (only 1 in 3 positive flags is actually sick)
Recall: 86.0% (catches most sick patients)
F1: 0.457
MCC: 0.41

The high FP rate is a real problem in screening — it floods specialists with 950 false alarms for every 430 real cases. A radiologist looking at these numbers would push for a secondary confirmation test rather than immediate treatment.

Multi-Class Confusion Matrices

For 3+ classes, the calculator builds a full N×N matrix. You can paste in a matrix as CSV or enter values cell by cell. It computes per-class precision and recall, then aggregates with macro averaging (unweighted mean across classes) or weighted averaging (weighted by class frequency).

When one class has far more samples, weighted F1 tells a different story than macro F1 — the calculator shows both so you can't miss an imbalanced class getting buried.

Tips

Plot the normalized matrix. Normalizing by row (true class count) makes it easy to spot which classes your model confuses with each other.
Threshold matters. Moving the classification threshold changes your TP/FP tradeoff. A threshold of 0.3 vs 0.7 on the same model gives very different precision/recall. The ROC curve is this relationship visualized.
Don't report F1 alone. Show both precision and recall. An F1 of 0.80 from P=0.90, R=0.72 is very different from P=0.70, R=0.93 depending on your use case.

What's the difference between macro and weighted F1?

Macro F1 averages per-class F1 scores equally — a rare class counts as much as a common one. Weighted F1 weights each class by its support (sample count). For imbalanced datasets, macro F1 is more sensitive to performance on minority classes.

When should I use MCC over F1?

When your dataset is highly imbalanced and you care about both false positives and false negatives equally. MCC is the only metric that considers all four confusion matrix cells symmetrically.

Can this calculator handle more than two classes?

Yes, CalcHub supports up to 10 classes in the matrix. Paste in your confusion matrix as comma-separated rows and it will compute per-class and averaged metrics automatically.

Dataset Split Calculator — plan train/validation/test splits
Learning Rate Calculator — tune training after evaluating your model
Batch Size Calculator — optimize training throughput