Fine-Tuning Llama 4 A Guide With Demo Project

Meta's LLaMA 4 (Large Language Model Meta AI) continues to push forward the boundaries of open-weight large language models. Its improved performance, longer context window, and multilingual capability make fine-tuning LLaMA 4 a very interesting option for custom engineering.

In this blog, we'll go through the following about:

• Fine-tuning and why it is important

• Requirements and setup

• A step-by-step guide to fine-tune LLaMA 4

• Full demo project

• Best practices and tips

🔍 What is Fine-Tuning?

Fine-tuning can be said to be taking a pretrained large language model and tailoring it for a specific domain, tone, or task with the help of a small and domain-specific dataset. Contrary to training from scratch, it saves both time and compute while conferring additional advantage for tasks of interest.

When to Fine-Tune?

• When you want the system to speak your brand language.

• For legal, medical, or domain-specific applications.

• When dealing with lesser-resource languages or dialects.

• For code generation, customer support, or summarization, etc.

⚙️ Prerequisites

1. Hardware

GPU: A100 / V100 / RTX 4090 (minimum 24GB VRAM for full fine-tuning)
Disk: SSD with at least 100GB free
RAM: 32GB+

2. Software

Python 3.10+
PyTorch 2.x
transformers, peft, accelerate, datasets, bitsandbytes
Optional: Weights & Biases or TensorBoard for monitoring

3. Model Access

Download LLaMA 4 weights from Meta (requires approval)
Convert them to Hugging Face format if needed

🔧 Fine-Tuning Methods

You can fine-tune LLaMA 4 through:

1. Full fine-tuning – update all model weights (resource-intensive)

2. Parameter-efficient fine-tuning (PEFT) – such as LoRA, QLoRA

3. Instruction tuning – teach the responding model using prompt-response pairs

We will be using QLoRA in this tutorial to save memory and reduce training time.

🛠️ Step-by-Step Guide to Fine-Tuning LLaMA 4
Install Dependencies

bash

CopyEdit

pip install transformers peft accelerate datasets bitsand bytes

Setting Up Runpod

Go to the RunPod website and create an account. After that, go to the Runpod Billing menu and add $25 using the credit card. You can also pay with cryptocurrency.

Navigate to the My Pods menu to begin configuring your pod. The pod serves as a virtual server that provides you with the necessary CPUs, GPUs, memory, and storage for your tasks.

We will select 3x H200 SXM GPUs, which will provide sufficient memory to load the model, quantize it, and fine-tune it on the new dataset. Alternatively, you can use the Unsloth framework to run the model on a single H100—however, this approach didn’t work effectively for me.

To set up your pod, follow these steps:

1. Select H200 SMX GPU.

2. Name your pod.

3. Choose the “RunPod PyTorch 2.8.0” template.

4. Change the GPU count to 3.

5. Click the “Deploy On-Demand” button.

We will edit our pod by increasing the container disk size to 300GB and adding the HF_TOKEN environment variable, which is your Hugging Face access token. This token is essential for effectively loading and saving the model.

It will take some time to set up the container. Once everything is set up, click on the “Connect” button and launch the JupyterLab Instance.

Create a new notebook and start using this new environment, similar to your local setup.

Llama Fundamentals

Experiment with Llama 3 to run inference on pre-trained models, fine-tune them on custom datasets, and optimize performance.

In this section, we will learn how to address common challenges when fine-tuning the new Llama 4 model, such as out-of-memory issues and bugs in the Transformers library. We will also cover how to seamlessly load, fine-tune, and save the LoRA adapter. By following these steps, you can focus on the fine-tuning process without worrying about technical hurdles.

1. Setting up

We will install the necessary Python packages to fine-tune the large language models.

Note:

The latest version of the Transformers library has an embedding mismatch bug that has been reported on the GitHub repository. To avoid this issue, we will install version 4.51.0.

The Hugging Face server provides access to the xet integration that is faster than Git-LFS. This integration will improve the download speed by three times.

%%capture

!pip install transformers==4.51.0

%pip install -U datasets

%pip install -U accelerate

%pip install -U peft

%pip install -U trl

%pip install -U bitsandbytes

%pip install huggingface_hub[hf_xet]

Load the API key from the environment variable to log in to Hugging Face. By logging in, you can gain access to gated models and also save fine-tuned models and tokenizers.

from huggingface_hub import login import os

hf_token = os.environ.get("HF_TOKEN") login(hf_token)

2. Loading the model and tokenizer

Load the Llama-4-Scout-17B-16E-Instruct model with 4-bit quantization for efficient memory usage. Make sure you have set the device_map to auto to use all three H200 GPUs.

Note: Ensure you have access to the model, as this model is gated and requires you to fill out the form by going to the meta-llama/Llama-4-Scout-17B-16E-Instruct URL.

import os import torch

from transformers import AutoTokenizer, Llama4ForConditionalGeneration, BitsAndBy

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" bnb_config = BitsAndBytesConfig(

load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16,

)

model = Llama4ForConditionalGeneration.from_pretrained( model_id,

device_map="auto",

torch_dtype=torch.bfloat16, quantization_config=bnb_config, trust_remote_code=True,

)

model.config.use_cache = False model.config.pretraining_tp = 1

We will also load the tokenizer using the same model ID.

# Load tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

By running the command below, you can check how much memory we have left for setting up the trainer and fine-tune the model.

!nvidia-smi

3. Loading and processing the data

We will create a prompt style for the model, including a system prompt with placeholders for the question, chain of thought, and response. This prompt will help the model think step-by- step and give clear and accurate answers.

train_prompt_style = """Below is an instruction that describes a task, paired wit Write a response that appropriately completes the request.

Before answering, think carefully about the question and create a step-by-step ch

### Instruction:

You are a medical expert with advanced knowledge in clinical reasoning, diagnosti Please answer the following medical question.

### question:

{}

### Response:

<think>

{}

</think>

{}"""

Next, we will create the Python function to generate the “text” column using the training prompt style and columns from the dataset.

We will load the first 500 samples from the Freedom Intelligence/medical-o1-reasoning-SFT dataset available on the Hugging Face Hub and then apply the formatting_prompts_func function to create the “text” column.

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def formatting_prompts_func(examples): inputs = examples["question"] complex_cots = examples["Complex_CoT"] outputs = examples["Response"]

texts = []

for question, cot, response in zip(inputs, complex_cots, outputs):

# Append the EOS token to the response if it's not already there if not response.endswith(tokenizer.eos_token):

response += tokenizer.eos_token

text = train_prompt_style.format(question, cot, response) texts.append(text) return {"text": texts}

We will load the first 500 samples from the F reedomIntelligence/medical-o1-reasoning-SFT dataset available on the Hugging Face Hub and then apply the formatting_prompts_func function to create the “text” column.

from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split dataset = dataset.map(formatting_prompts_func, batched = True,)

dataset["text"][0]

The “text” column has a system prompt, instructions, question, chain of thought, and the response.

The new STF trainer doesn't accept tokenizers, so we will convert the tokenizer into a data collator and provide the trainer with the data collator instead of the tokenizer.

from transformers import DataCollatorForLanguageModeling data_collator = DataCollatorForLanguageModeling(

tokenizer=tokenizer, mlm=False

)

4. Model inference before fine-tuning

We will now create a testing prompt style that includes everything from the training prompt style except for the chain of thought and response.

prompt_style = """Below is an instruction that describes a task, paired with an i Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step ch

### Instruction:

You are a medical expert with advanced knowledge in clinical reasoning, diagnosti Please answer the following medical question.

### question:

{}

### Response:

<think>{}"""

We will take the first question from the dataset, convert it into the prompt using the testing prompt style, and then prepare it for the model to generate the response.

question = dataset[0]['question'] inputs = tokenizer(

[prompt_style.format(question, "") + tokenizer.eos_token], return_tensors="pt"

).to("cuda")

outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=1200, eos_token_id=tokenizer.eos_token_id, use_cache=True,

) response = tokenizer.batch_decode(outputs, skip_special_tokens=True) print(response[0].split("### Response:") [1])

The model's chain of thought is long, and the answer it provides is quite brief, differing significantly from the dataset.

<think>assistant

To approach this question, let's break down the key elements provided and analyse

1. **Symptoms**: The patient experiences involuntary urine loss during activities

2. **Diagnostic Tests Mentioned**

**Gynecological Exam**: This is likely performed to assess the pelvic anatom

**q-tip Test**: This test is used to assess urethral mobility. A q-tip (cott

3. **Cystometry (Cystometrogram)**: This test measures the pressure within the bl Given that the patient likely has stress urinary incontinence (SUI) based on her -

**Residual Volume**: In patients with SUI, the bladder usually functions normal
**Detrusor Contractions**: In SUI, the problem primarily lies with the urethral
Based on this analysis, cystometry in this patient would most likely reveal: - A
**normal residual volume**, as her symptoms do not suggest a problem with bla -
**Normal detrusor contractions**, as her condition (stress urinary incontinence
Therefore, cystometry would likely show that she has a normal residual volume and

</think>

The final answer is: $\boxed{Normal residual volume and normal detrusor contraction

5. Implementing LoRA

We will now implement LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning and apply it to the model. LoRA is a technique designed to fine-tune large language models by freezing the majority of the model's parameters and training only a small subset of additional parameters.

This approach is memory-efficient, faster, and cost-effective while still maintaining high accuracy comparable to full fine-tuning.

from peft import LoraConfig, get_peft_model

# LoRA config

peft_config = LoraConfig(

lora_alpha=16, lora_dropout=0.05, r=64,

bias="none", task_type="CAUSAL_LM", target_modules=[

"q_proj", "k_proj",

"v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", # Target modules for LoRA

# Scaling factor for LoRA

# Add slight dropout for regularizat
# Rank of the LoRA update matrices
# No bias reparameterization

# Task type: Causal Language Modelin

) ],

model = get_peft_model(model, peft_config)

We will now configure and initialize the SFTTrainer (Supervised Fine-Tuning Trainer) by providing it with the dataset, model, data collator, training arguments, and LoRA configuration. The SFTTrainer simplifies the fine-tuning process by integrating all these components into a single, streamlined workflow, making it easier to train large language models like Llama 4 with LoRA.

from trl import SFTTrainer
from transformers import TrainingArguments

# Training Arguments

training_arguments = TrainingArguments(

output_dir="output", per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=2, optim="paged_adamw_32bit", num_train_epochs=1, logging_steps=0.2, warmup_steps=10, logging_strategy="steps", learning_rate=2e-4, fp16=False, bf16=False, group_by_length=True, report_to="none"

)

# Initialize the Trainer trainer = SFTTrainer(

model=model, args=training_arguments, train_dataset=dataset, peft_config=peft_config, data_collator=data_collator,

)

6. Model training

Begin the training process by running the following command:

trainer.train()

If you switch to your Pod dashboard, you will see that the trainer is utilizing all three GPUs for training.

It took us only 7 minutes to fine-tune the model and overall 30 minutes to run the whole project from start to finish.

7. Model inference after fine-tuning

We will now test the fine-tuned model using the same sample as before to compare its performance after fine-tuning. This will help us evaluate how well the model has improved in reasoning and generating detailed responses.

question = dataset[0]['question'] inputs = tokenizer(

[prompt_style.format(question, "") + tokenizer.eos_token], return_tensors="pt"

).to("cuda")

outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=1200, eos_token_id=tokenizer.eos_token_id, use_cache=True,

) response = tokenizer.batch_decode(outputs, skip_special_tokens=True) print(response[0].split("### Response:")

[1])

The model's reasoning process is accurate, and the response is detailed and logical:

<think>

Alright, let's think about what's going on with this 61-year-old woman. She's exp

Now, let's think about the tests she's had. The gynecological exam and the q-tip Okay, so we've got stress urinary incontinence in mind. What does cystometry tell For someone with stress incontinence, the cystometry would probably show that her Also, with stress incontinence, you wouldn't expect to see abnormal detrusor cont

So, putting it all together, the cystometry should show a normal residual volume

</think>

In this scenario, the cystometry would most likely reveal a normal residual volum

Let’s test the model on a different sample from the dataset to further evaluate its performance.

question = dataset[10]['question'] inputs = tokenizer(

[prompt_style.format(question, "") + tokenizer.eos_token], return_tensors="pt"

).to("cuda")

outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=1200, eos_token_id=tokenizer.eos_token_id, use_cache=True,

) response = tokenizer.batch_decode(outputs, skip_special_tokens=True) print(response[0].split("### Response:") [1])

The response is logical, detailed, and demonstrates the model's ability to apply medical reasoning effectively:

<think>

Alright, let's break this down. So, we have a 42-year-old guy who's just recovere In this syndrome, T3 usually drops first, and T4 might stay normal or even rise a But wait, what about the other hormones? In sick euthyroid syndrome, we often see

This all makes sense because the clinical picture and lab results are lining up n

</think>

In this scenario, considering the clinical context of a 42-year-old man recoverin

8. Saving the model

We will now push the fine-tuned model (LoRA adapter) and tokenizer to the Hugging Face Hub. This process will automatically create a repository for you and upload all the necessary model files, making the model publicly accessible for further use or sharing.

model.push_to_hub("Llama-4-Scout-17B-16E-Instruct-Medical-ChatBot") tokenizer.push_to_hub("Llama-4-Scout-17B-16E-Instruct-Medical-ChatBot")

Once the process is complete, you can access the model repository at the following link: kingabzpro/Llama-4-Scout-17B-16E-Instruct-Medical-ChatBot

Here is the companion notebook that contains all the necessary code, outputs, and detailed instructions to help you fine-tune your own Llama 4 model.

✅ Best Practices

• Always use instructional prompt types

• Clean and balance the dataset

• Start fine-tuning with LoRA or QLoRA instead of full fine-tuning

• Watch out for hallucinations, overfit "Always use instructional prompt types

• Clean and balance the dataset

• Start fine-tuning with LoRA or QLoRA instead of full fine-tuning

• Watch out for hallucinations, overfit

Using evaluation sets and inference testing

Conclusion

Creating this tutorial was a challenging yet rewarding experience. While working with Llama 4, I encountered several issues that highlighted the complexities of using this model. In my opinion, Llama 4 is not yet fully optimized for widespread use, especially for individuals relying on consumer-grade GPUs. Here are some of the key challenges I faced:

1. Out-of-memory issues: Llama 4 requires significant VRAM, making it difficult to run or fine-tune on standard GPUs.

2. Bugs in the Transformers library: There were several compatibility issues and bugs in the library when working with Llama 4, which required troubleshooting and workarounds.

3. Quantization challenges: The quantized models available on Hugging Face are not compatible and require huge VRAM.

4. Setup complexity: Configuring the environment to support the new mechanisms was time-consuming and required a deep understanding of Pytorch and the Hugging Face ecosystem.

Despite these challenges, if you follow this guide step-by-step, you should be able to fine- tune Llama 4 on any new dataset with relative ease.

By fine-tuning LLaMA 4, you give it the opportunity to perform at full capacity for a particular purpose or specific domain focus. If you are building a healthcare assistant, a legal summarizer, or even a multilingual chatbot, you have the tools within your grasp.

Start small, experiment, and iterate-the most powerful AI is the one that fits your very own needs.