Train/Validation/Test Split Calculator
Calculate the right dataset split ratios for your ML project. Get sample counts, stratification advice, and split strategies for small and large datasets.
The classic "80/10/10 split" advice is thrown around like a universal truth, but it only makes sense for medium-sized datasets. With 500 samples, a 10% test set is 50 examples — statistically unreliable for drawing conclusions about real-world performance. With 10 million samples, giving 80% to training is often wasteful.
The Three Splits and What They're For
Training set — what the model learns from. Gradients flow from this data. Bigger is generally better, but only up to the point of diminishing returns. Validation set — used during training to tune hyperparameters and monitor overfitting. Never used for gradient updates. This is where you decide when to stop training. Test set — the final holdout. Touched exactly once, after all model decisions are made. If you evaluate on the test set more than once, it's no longer an honest estimate of generalization.How Split Size Should Scale with Dataset Size
| Dataset Size | Recommended Split | Test Set Size |
|---|---|---|
| < 1,000 samples | Use k-fold CV instead | N/A (no reliable holdout) |
| 1,000–10,000 | 70/15/15 | 150–1,500 samples |
| 10,000–100,000 | 80/10/10 | 1,000–10,000 samples |
| 100,000–1M | 90/5/5 | 5,000–50,000 samples |
| > 1M | 98/1/1 | 10,000+ (already plenty) |
Stratification: Why It Matters
If your dataset is imbalanced — say 95% class A, 5% class B — a random split might put all the class B examples in the training set and leave the validation set with none. Stratified splitting ensures each split has the same class ratio as the full dataset.
Always stratify unless you have a specific reason not to. The calculator defaults to stratified splits when you enter class distribution information.
Time-Series Data Is Different
For time-series or sequential data, never shuffle before splitting. The test set should be the most recent data. If you train on data from 2020–2023 and test on 2020 data, your model has seen the future relative to your test set — that's data leakage and your metrics are meaningless.
The calculator has a time-series mode that splits chronologically and also supports walk-forward validation (rolling origin backtesting).
A Realistic Example
You have 8,500 customer churn records across 3 customer segments: Enterprise (500), Mid-market (3,000), SMB (5,000). You want to train a binary classifier.
Running through the calculator:
- Suggested split: 75/12.5/12.5
- Training: ~6,375 samples
- Validation: ~1,063 samples
- Test: ~1,063 samples
- Stratified by segment: Enterprise gets 375/63/63
With 63 Enterprise examples in the test set, results for that segment will have wide confidence intervals. The calculator flags this and suggests collecting more Enterprise data or using stratified k-fold for segment-specific evaluation.
Tips
- Never tune hyperparameters on the test set. This is the most common way ML results end up not replicating. The test set is sacred.
- Cross-validation for small datasets. 5-fold or 10-fold CV uses all your data for training and gives lower-variance performance estimates than a fixed holdout.
- Group splits for dependent data. If multiple rows belong to the same patient, user, or entity, keep all rows from one entity in the same split. Otherwise you have information leakage between train and test.
What split ratio should I use for fine-tuning a pretrained model?
With a pretrained model, you generally need less training data to converge. A 70/15/15 or even 60/20/20 split works well, since you want more validation data to catch overfitting early during fine-tuning.
How many test samples do I actually need for reliable evaluation?
A rough guide: to detect a 1 percentage point difference in accuracy at 95% confidence, you need roughly 10,000 test samples. For 5 percentage points, about 400 samples suffices. The calculator can compute this power analysis for you.
Is it okay to combine val and test and use k-fold?
For final model selection and reporting, you should always report on a held-out test set. k-fold is appropriate for model development and hyperparameter tuning, but don't report k-fold performance as your final "test" result — it's optimistic.
Related Calculators
- Confusion Matrix Calculator — evaluate after splitting and training
- Batch Size Calculator — configure training after knowing split sizes
- Training Time Calculator — project training duration from training set size