Confusion Matrix Calculator — Precision, Recall, F1, and More
Calculate precision, recall, F1-score, specificity, and MCC from your confusion matrix values. Understand your classifier's real performance.
Accuracy is the most reported metric in ML papers and also one of the most misleading. A fraud detection model that predicts "not fraud" for every transaction achieves 99.9% accuracy on a typical dataset. The confusion matrix is how you catch this kind of deception — in your own models and in other people's claims.
What's in a Confusion Matrix
For a binary classifier, the matrix has four cells:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
The Metrics and When They Matter
| Metric | Formula | Best for |
|---|---|---|
| Accuracy | (TP+TN)/(TP+FP+TN+FN) | Balanced datasets only |
| Precision | TP/(TP+FP) | When false positives are costly (spam filter) |
| Recall (Sensitivity) | TP/(TP+FN) | When false negatives are costly (cancer screening) |
| F1 Score | 2 × (P×R)/(P+R) | Unbalanced classes, balanced P/R tradeoff |
| Specificity | TN/(TN+FP) | How well you identify true negatives |
| MCC | Balanced metric for skewed datasets | Works well even when classes are imbalanced |
Practical Example: Medical Screening Test
A diagnostic test for a disease with 5% prevalence in a population of 10,000:
- TP: 430 (correctly identified sick patients)
- FN: 70 (sick patients missed — very bad)
- FP: 950 (healthy patients flagged — causes unnecessary anxiety and follow-up)
- TN: 8,550 (correctly cleared)
- Accuracy: 89.8% (looks decent, but...)
- Precision: 31.2% (only 1 in 3 positive flags is actually sick)
- Recall: 86.0% (catches most sick patients)
- F1: 0.457
- MCC: 0.41
Multi-Class Confusion Matrices
For 3+ classes, the calculator builds a full N×N matrix. You can paste in a matrix as CSV or enter values cell by cell. It computes per-class precision and recall, then aggregates with macro averaging (unweighted mean across classes) or weighted averaging (weighted by class frequency).
When one class has far more samples, weighted F1 tells a different story than macro F1 — the calculator shows both so you can't miss an imbalanced class getting buried.
Tips
- Plot the normalized matrix. Normalizing by row (true class count) makes it easy to spot which classes your model confuses with each other.
- Threshold matters. Moving the classification threshold changes your TP/FP tradeoff. A threshold of 0.3 vs 0.7 on the same model gives very different precision/recall. The ROC curve is this relationship visualized.
- Don't report F1 alone. Show both precision and recall. An F1 of 0.80 from P=0.90, R=0.72 is very different from P=0.70, R=0.93 depending on your use case.
What's the difference between macro and weighted F1?
Macro F1 averages per-class F1 scores equally — a rare class counts as much as a common one. Weighted F1 weights each class by its support (sample count). For imbalanced datasets, macro F1 is more sensitive to performance on minority classes.
When should I use MCC over F1?
When your dataset is highly imbalanced and you care about both false positives and false negatives equally. MCC is the only metric that considers all four confusion matrix cells symmetrically.
Can this calculator handle more than two classes?
Yes, CalcHub supports up to 10 classes in the matrix. Paste in your confusion matrix as comma-separated rows and it will compute per-class and averaged metrics automatically.
Related Calculators
- Dataset Split Calculator — plan train/validation/test splits
- Learning Rate Calculator — tune training after evaluating your model
- Batch Size Calculator — optimize training throughput