Fine-Tuning LLMs: When, Why and How (OpenAI, Open-Source Models)
A practical guide to fine-tuning large language models covering when to fine-tune vs when not to, OpenAI's API, open-source LoRA/QLoRA with Hugging Face, dataset preparation, and real cost comparisons.
Fine-tuning is the most overhyped technique in the LLM space right now. Every week someone asks me "should I fine-tune a model for my use case?" and 80% of the time the answer is no. Not because fine-tuning doesn't work — it works incredibly well for the right problems. But most people reach for it before trying simpler, cheaper approaches that would solve their problem just as well.
Here's the thing: fine-tuning adjusts a model's weights on your data. It changes how the model behaves, not what it knows. If you need the model to access your company's documentation, that's retrieval (RAG). If you need it to follow a specific output format, that's prompt engineering. Fine-tuning is for when you need to change the model's style, tone, or reasoning patterns in ways that prompting alone can't achieve.
The Decision Hierarchy
Before spending time and money on fine-tuning, work through this hierarchy in order:
1. Prompt engineering first. Write better prompts. Add examples (few-shot). Use system messages. Specify output format explicitly. This is free and solves more problems than people expect. Give it a serious try — not one attempt, but 10-20 prompt variations. 2. RAG second. If the model needs access to specific knowledge (your docs, your database, your product catalog), implement Retrieval-Augmented Generation. Embed your documents, retrieve relevant chunks, and inject them into the prompt. This is cheaper than fine-tuning and the knowledge stays up-to-date. 3. Fine-tuning last. If you need the model to consistently adopt a specific tone, follow complex formatting rules, perform domain-specific reasoning, or behave in ways that prompting can't reliably achieve — now fine-tuning makes sense.What Fine-Tuning Actually Does
When you fine-tune, you're running additional training on a pre-trained model using your dataset. The model adjusts its internal weights to better match the input-output patterns in your data.
What fine-tuning IS good for:
- Consistent style/tone (writing like your brand, matching a specific voice)
- Complex output formats (structured JSON, specific XML schemas)
- Domain-specific classification (legal document categorization, medical coding)
- Reducing prompt length (baking instructions into the model instead of repeating them)
- Improving reliability on narrow tasks
What fine-tuning is NOT:
- Adding new knowledge (the model doesn't memorize your training data — use RAG)
- Making a small model as smart as a large one (you can't fine-tune GPT-4o-mini into GPT-4o)
- A magic fix for bad prompts (if your prompts are poorly designed, fix those first)
Fine-Tuning with OpenAI
OpenAI makes fine-tuning straightforward. Here's the complete workflow:
Step 1: Prepare Your Dataset
Your data needs to be in JSONL format with conversation structure:
{"messages": [{"role": "system", "content": "You are a customer support agent for TechCo. Be concise, empathetic, and always offer a next step."}, {"role": "user", "content": "My order hasn't arrived and it's been 2 weeks."}, {"role": "assistant", "content": "I'm sorry about the delay with your order. Let me look into this right away. Could you share your order number? I'll track it down and get you an update within the hour."}]}
{"messages": [{"role": "system", "content": "You are a customer support agent for TechCo. Be concise, empathetic, and always offer a next step."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "You can reset your password at techco.com/reset — just enter your email and you'll get a reset link within a minute. If you don't see it, check your spam folder. Want me to walk you through it?"}]}
Quality matters far more than quantity. In my experience, 100-200 high-quality examples often outperform 5000 mediocre ones. Each example should represent exactly how you want the model to respond.
Step 2: Upload and Create Fine-Tune Job
from openai import OpenAI
client = OpenAI()
# Upload training file
training_file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# Optional: upload validation file
validation_file = client.files.create(
file=open("validation_data.jsonl", "rb"),
purpose="fine-tune"
)
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
validation_file=validation_file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3,
"batch_size": "auto",
"learning_rate_multiplier": "auto"
}
)
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
Step 3: Monitor and Use
# Check status
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job.status}")
# Once completed, use your fine-tuned model
if job.status == "succeeded":
response = client.chat.completions.create(
model=job.fine_tuned_model, # ft:gpt-4o-mini-2024-07-18:org::abc123
messages=[
{"role": "system", "content": "You are a customer support agent for TechCo."},
{"role": "user", "content": "I can't log in to my account."}
]
)
print(response.choices[0].message.content)
The whole process typically takes 15-45 minutes for small datasets. OpenAI charges per token trained plus slightly higher inference costs for fine-tuned models.
Open-Source Fine-Tuning with LoRA
If you want more control, lower inference costs, or need to keep data on-premises, open-source fine-tuning is the way to go. The breakthrough technique is LoRA (Low-Rank Adaptation), which lets you fine-tune large models by only training a tiny fraction of the parameters.
Why LoRA Changes Everything
A 7B parameter model normally needs 28GB+ of GPU memory to fine-tune. LoRA adds small trainable matrices to specific layers while freezing the rest of the model. Result: you can fine-tune a 7B model on a single consumer GPU (24GB VRAM). QLoRA goes further by quantizing the base model to 4-bit, reducing memory even more.
Complete LoRA Fine-Tuning with Hugging Face
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# Load base model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank — higher = more capacity, more memory
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
# This prints something like "trainable params: 13M || all params: 8B || 0.16%"
model.print_trainable_parameters()
# Load your dataset
dataset = load_dataset("json", data_files="training_data.jsonl")
# Training
training_args = TrainingArguments(
output_dir="./fine-tuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=100,
logging_steps=10,
save_strategy="epoch",
fp16=True,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
max_seq_length=2048,
)
trainer.train()
trainer.save_model("./fine-tuned-model")
That's roughly 0.16% of the model's parameters being trained. The rest stays frozen. This is why LoRA is revolutionary — you get most of the benefits of full fine-tuning at a fraction of the compute cost.
Dataset Preparation (The Part Everyone Rushes)
Your dataset quality determines your results. Period. Here's what actually matters:
Diversity over volume. 200 diverse examples beat 2000 variations of the same pattern. Cover the full range of inputs your model will see in production. Match production distribution. If 60% of real queries are simple and 40% are complex, your training data should roughly reflect that. Consistent formatting. Every example should follow the exact output format you want. Inconsistent examples teach the model inconsistency. Clean ruthlessly. Remove duplicates, fix typos, cut examples where the response isn't actually good. Hold out a test set. Set aside 10-20% for evaluation. Compare base model vs fine-tuned on these examples. No measurable improvement means your fine-tuning didn't help.Cost Comparison
| OpenAI Fine-Tuning | Open-Source (LoRA) | |
|---|---|---|
| Training cost | ~$8 per 1M tokens trained | GPU rental: $1-3/hr (A100) |
| Training time | 15-45 min (managed) | 1-4 hours (depends on dataset) |
| Inference cost | Higher per-token rate | Self-hosted: GPU cost only |
| Setup effort | Minimal (API calls) | Significant (infra, code) |
| Data privacy | Data sent to OpenAI | Data stays on your machines |
| Model ownership | OpenAI hosts it | You own the weights |
Practical Use Cases
Customer support tone. Fine-tune to match your brand's voice — empathetic but concise, never uses certain phrases, always offers a next step. Prompting can get you 80% there, fine-tuning gets you to 95%. Code style enforcement. Fine-tune on your codebase's conventions — naming patterns, error handling style, documentation format. The model generates code that looks like it belongs in your repo. Domain-specific classification. Legal document categorization, medical report coding, financial sentiment analysis. Fine-tuning on labeled examples in your domain significantly outperforms zero-shot prompting. Structured output. If you need the model to consistently output a complex JSON schema with nested objects and specific field names, fine-tuning on examples of correct output is more reliable than describing the schema in a prompt.The Honest Take
Fine-tuning is a powerful technique that most teams reach for too early. Before fine-tuning, try harder prompts, try few-shot examples, try RAG. If those don't work, fine-tuning is your tool.
When you do fine-tune: start small (100-200 examples), evaluate rigorously (compare against base model), and iterate on your dataset, not your hyperparameters. The data is almost always the bottleneck, not the training configuration.
If you're learning about LLMs, prompt engineering, and AI development, CodeUp has hands-on tutorials that walk you through the full stack — from basic API calls to fine-tuning workflows. Building this skill set is increasingly valuable regardless of what kind of software you work on.