2026-05-05

Llama 3 Fine-Tuning: Local Data Step-by-Step Guide

Q: Can I fine tune Llama 3 on a CPU or Mac?

While it is technically possible to perform fine-tuning on [Apple Silicon](/posts/how-to-run-local-llms-on-macbook-m3/) (M-series Macs) using MLX, standard PyTorch/CUDA-based fine-tuning requires an NVIDIA GPU. CPU fine-tuning is exponentially too slow to be practical and is highly discouraged.

By Alex Chen Published: 2026-05-05 Updated: 2026-05-05 Read time: 10 min

Learn how to fine tune Llama 3 on local data securely using Unsloth, LoRA, and Hugging Face. Master custom AI model training on consumer hardware today.

Editor summary

Tune Llama Local Data securely using Unsloth and LoRA adapters—a practical workflow that keeps sensitive information off third-party servers entirely. I found the step-by-step approach to fine-tuning on consumer GPUs genuinely valuable, especially the emphasis on curating 500–2,000 high-quality examples rather than dumping massive messy datasets. The trade-off is real: while local fine-tuning eliminates compliance risk, it demands careful hyperparameter balancing—context window length scales memory quadratically, and overfitting remains a persistent pitfall if you repeat prompt structures across your training set. This guide walks you through loading quantized models, applying LoRA adapters, formatting JSONL datasets, and exporting GGUF files for production use.

Fine Tune Llama 3 on Local Data: Step-by-Step Guide

Quick Answer: To fine-tune Llama 3 on local data, you need to format your custom dataset into JSONL using the ChatML or Alpaca format, set up a Python environment with PyTorch and Unsloth (or PEFT), load the quantized Llama 3 8B model, apply Low-Rank Adaptation (LoRA) to freeze core weights while training new adapters, and execute the training loop using Hugging Face’s SFTTrainer. This process allows you to securely teach the model specific domain knowledge without uploading sensitive data to third-party servers.

Data privacy and domain-specific accuracy are the two biggest challenges organizations face when implementing Large Language Models. While base models like Meta’s Llama 3 are exceptionally capable, they only know the generalized data they were trained on. When you need an AI to understand your internal documentation, specific coding styles, or proprietary customer service protocols, off-the-shelf models fall short.

Sending sensitive corporate or personal data to closed-source API providers is often a non-starter due to compliance, security, and IP concerns. The solution is local fine-tuning. By running the training process entirely on your own hardware, your data never leaves your network.

Historically, fine-tuning an 8-billion parameter model required massive, expensive server clusters. Today, thanks to parameter-efficient fine-tuning (PEFT) techniques and highly optimized libraries, you can fine-tune Llama 3 8B locally on a single consumer-grade GPU. This guide provides a rigorous, technical walkthrough on how to fine tune Llama 3 on local data, covering everything from hardware prerequisites to dataset formatting and the execution of the training loop.

Understanding Local Fine-Tuning for Llama 3

Fine-tuning is the process of taking a pre-trained model and continuing its training on a smaller, specialized dataset. For local execution, we do not perform full parameter fine-tuning, which updates every single weight in the model and requires hundreds of gigabytes of VRAM. Instead, we use a technique called Low-Rank Adaptation (LoRA) combined with quantization (QLoRA).

Quantization reduces the precision of the model’s weights (e.g., from 16-bit float to 4-bit integer), drastically shrinking the memory footprint. LoRA freezes these base weights and injects small, trainable “adapter” matrices into the model’s layers. During training, only these tiny adapters are updated.

For Llama 3 8B, the base model in 4-bit precision consumes about 5.5GB to 6GB of VRAM. The LoRA adapters add an additional 1GB to 2GB depending on your rank ($r$) configuration. This means the entire process fits comfortably within the 12GB to 24GB limits of modern consumer GPUs.

Hardware and Software Requirements

Before beginning the process, you must ensure your system meets the necessary specifications and has the correct dependencies installed. Attempting to train without adequate VRAM will result in Out Of Memory (OOM) errors.

Hardware Prerequisites

GPU: An NVIDIA GPU is strictly required due to CUDA dependencies. For Llama 3 8B (4-bit quantized), you need an absolute minimum of 12GB VRAM (e.g., RTX 3060, RTX 4070). For comfortable training with larger batch sizes and higher context lengths, 24GB VRAM (e.g., RTX 3090, RTX 4090) is recommended.
RAM: 32GB of system RAM minimum.
Storage: At least 50GB of free NVMe SSD space for the model weights, dataset, and training checkpoints.

Software Prerequisites

OS: Linux (Ubuntu 22.04 LTS recommended) or Windows using WSL2. Native Windows is prone to dependency compilation issues.
CUDA Toolkit: Version 12.1 or newer.
Python: Version 3.10 or 3.11.

You will rely heavily on the Hugging Face ecosystem and optimized training libraries. The core stack includes torch, transformers, trl (Transformer Reinforcement Learning), peft, bitsandbytes (for quantization), and unsloth. Unsloth is highly recommended for Llama 3 as it provides custom Triton kernels that speed up training by up to 2x while reducing VRAM usage by 30%.

Preparing Your Local Dataset

The quality of your local data dictates the success of your fine-tune. A common mistake is using massive, low-quality datasets. For fine-tuning, 500 to 2,000 highly curated, perfect examples will yield significantly better results than 50,000 messy examples.

Your data must be formatted to match the prompt template Llama 3 expects. Llama 3 Instruct uses a specific syntax with <|begin_of_text|> and <|start_header_id|> tokens. Your data should be structured as a JSONL (JSON Lines) file, typically utilizing an array of messages representing a conversation.

Dataset Structure Example

You need to convert your local PDFs, databases, or text files into JSON objects that look like this:

{
  "messages": [
    {"role": "system", "[content](/posts/how-to-automate-content-with-n8n-and-claude/)": "You are a senior database administrator."},
    {"role": "user", "content": "Write a PostgreSQL query to find duplicate emails in the users table."},
    {"role": "assistant", "content": "Here is the query to find duplicate emails:\n\n```sql\nSELECT email, COUNT(*) \nFROM users \nGROUP BY email \nHAVING COUNT(*) > 1;\n```"}
  ]
}

Save this as local_training_data.jsonl. If you are converting raw documentation, write a script to chunk the text and use an existing local LLM to synthetically generate instruction-response pairs based on those chunks. This is called synthetic data generation and is the most efficient way to build a training set from unstructured local files.

Setting Up the Training Environment

Create an isolated environment to prevent dependency conflicts. Using conda or venv is mandatory.

First, install PyTorch matching your CUDA version. Then, install the Unsloth library, which includes the necessary Hugging Face tools.

conda create --name llama3-ft python=3.10
conda activate llama3-ft

# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install Unsloth and dependencies
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers trl peft accelerate bitsandbytes

Ensure your terminal is authenticated with Hugging Face if you plan to pull gated base models. You can do this by running huggingface-cli login and providing your access token.

Step-by-Step: How to Fine Tune Llama 3 on Local Data

With your environment active and dataset ready, create a new Python script (e.g., train.py) or Jupyter Notebook. The following steps break down the code required to execute the fine-tuning process.

Step 1: Loading the Base Model and Tokenizer

We will use Unsloth’s FastLanguageModel to load a pre-quantized version of Llama 3 8B. This saves the time and memory required to load the massive 16-bit model and quantize it on the fly.

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Adjust based on your VRAM
dtype = None # Auto-detects float16 for older GPUs, bfloat16 for 30/40 series
load_in_4bit = True # Enforce 4-bit quantization

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "YOUR_HF_TOKEN" # Optional if accessing gated models
)

Step 2: Applying LoRA Adapters

Next, we wrap the base model in a PEFT configuration. This freezes the original 8 billion parameters and adds the trainable adapter matrices.

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Rank: 8, 16, 32, 64. Higher = smarter but more VRAM. 16 is optimal.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Optimized to 0 for Unsloth
    bias = "none",
    use_gradient_checkpointing = "unsloth", # Crucial for saving VRAM
    random_state = 3407,
)

Targeting all linear modules (q_proj through down_proj) usually provides the best fine-tuning accuracy compared to only targeting attention layers.

Step 3: Formatting the Dataset

We need to load our local JSONL file and format it using the tokenizer’s chat template so the model sees the correct system tokens during training.

from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

# Apply Llama 3 specific formatting
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3",
    mapping = {"role" : "role", "content" : "content", "user" : "user", "assistant" : "assistant"},
)

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

dataset = load_dataset("json", data_files="local_training_data.jsonl", split="train")
dataset = dataset.map(formatting_prompts_func, batched = True)

Step 4: Configuring the Trainer

The SFTTrainer (Supervised Fine-Tuning Trainer) manages the training loop. You must configure learning rates, batch sizes, and steps.

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Set to True for speed if dataset allows
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # Total training steps. Adjust based on dataset size.
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Note on steps: Instead of max_steps = 60, you can use num_train_epochs = 3 to run through your entire dataset exactly 3 times. For small datasets (500 rows), 3 to 5 epochs is standard.

Step 5: Executing the Training Loop

Run the trainer. Depending on your GPU and dataset size, this will take anywhere from 10 minutes to several hours.

trainer_stats = trainer.train()

Watch the training loss output in your terminal. You want to see the loss steadily decreasing. If it drops to exactly 0.0000 immediately, your learning rate is too high or your data is formatted incorrectly (the model is memorizing). If it spikes or does not move, lower your learning rate.

Exporting and Testing Your Custom Model

Once training is complete, the weights currently sit in your VRAM. You need to save the LoRA adapters to your local drive.

model.save_pretrained("llama3-local-finetune") # Saves LoRA adapters
tokenizer.save_pretrained("llama3-local-finetune")

To use the model in applications like LM Studio or Ollama, you cannot just use the adapters. You must merge the adapters back into the 16-bit base model and export it as a GGUF file. Unsloth provides built-in methods for this:

# Save to 16bit GGUF for CPU/GPU inference via llama.cpp or Ollama
model.save_pretrained_gguf("llama3-local-gguf", tokenizer, quantization_method = "q4_k_m")

The resulting .gguf file can be directly dragged and dropped into LM Studio, or loaded into Ollama using a custom Modelfile. Your locally fine-tuned Llama 3 is now ready for production use, completely offline.

Practical Advice for Local Fine-Tuning

Achieving a successful fine-tune requires balancing hardware limits with model hyperparameters. Follow these strict guidelines when configuring your runs:

Batch Size vs. Gradient Accumulation: If you experience OOM (Out of Memory) errors, reduce your per_device_train_batch_size to 1. To compensate and maintain training stability, increase gradient_accumulation_steps. The effective batch size is batch_size * accumulation_steps. Aim for an effective batch size of 8 to 16.
Context Window (max_seq_length): Memory usage scales quadratically with context length. Do not set max_seq_length to 8192 if your training examples are only 500 tokens long. Analyze your dataset and set the sequence length exactly to your longest sequence (e.g., 1024 or 2048) to save massive amounts of VRAM.
Overfitting: A model is overfit when it performs perfectly on training data but outputs garbage or hallucinates in real-world use. Prevent this by keeping epochs low (2-4), keeping the learning rate conservative (2e-4 is standard for LoRA), and ensuring your dataset has high variance. Do not repeat the exact same prompt structure for every row.
Evaluation: Always keep 10% of your data out of the training set to use as an evaluation set. Run prompts from the evaluation set through the model after training. If it cannot answer them correctly, the model has not generalized the knowledge.

Conclusion

Learning how to fine tune Llama 3 on local data provides complete control over your AI deployments, eliminating privacy risks while creating highly specialized assistants. By leveraging 4-bit quantization, LoRA, and the Unsloth library, custom model training is now accessible on standard workstation GPUs. Focus your primary efforts on curating a pristine dataset; the quality of your JSONL file will always dictate the intelligence of your final model far more than minor tweaks to training hyperparameters.

Frequently Asked Questions

Can I fine tune Llama 3 on a CPU or Mac?

While it is technically possible to perform fine-tuning on Apple Silicon (M-series Macs) using MLX, standard PyTorch/CUDA-based fine-tuning requires an NVIDIA GPU. CPU fine-tuning is exponentially too slow to be practical and is highly discouraged.

How much data do I need to fine tune Llama 3 locally?

For instruction fine-tuning to adopt a specific tone or format, 500 to 1,000 high-quality examples are sufficient. For injecting complex new domain knowledge, you will need 2,000 to 10,000 highly diverse examples.

What is the difference between RAG and fine-tuning?

Retrieval-Augmented Generation (RAG) connects an AI to external documents to look up facts, making it better for rapidly changing data. Fine-tuning bakes the knowledge and stylistic behavior directly into the model’s weights, making it faster and better at specific tasks but harder to update.

Why is my fine-tuned model outputting gibberish?

Gibberish output usually results from incorrect chat template formatting during training, using a learning rate that is too high, or failing to merge the LoRA adapters correctly with the exact base model version used during training.

How long does local fine-tuning take?

Training time depends heavily on dataset size and GPU hardware. Fine-tuning Llama 3 8B on 1,000 examples with an RTX 4090 using Unsloth typically takes less than 20 minutes. Older GPUs like the RTX 3060 may take 1 to 2 hours for the same task.