Image taken from Unsloth
Most of the LLMs available on platforms like Hugging Face are generalist models, meaning they hold a wide range of information across many topics and can perform fairly well in diverse scenarios. These models are often used as starting points for developing specialized versions that excel at particular tasks. This can include adapting the model for question answering or equipping it with more knowledge in fields such as finance or science. This process, known as fine-tuning, involves training the model with additional data related to the new task or domain.
In this tutorial, we will explore how to fine-tune an LLM using Unsloth and Hugging Face. From installation to preparing your dataset and running inference on the newly trained model, we will cover the basics to help you get started. Hugging Face is an online platform where researchers can upload their models for public use. It also provides a software stack for training and running transformer-based machine learning models. Unsloth, on the other hand, is a framework built on Hugging Face that offers faster fine-tuning through mathematical optimizations.
Requirements
In this tutorial we will fine-tune the Llama3.2 model with 1 billion parameters. For this, you will need a GPU with at least 8 GB of memory, along with its corresponding CUDA drivers.
Installation
The first step to begin your fine-tuning journey is to set up the environment, which means handling the installation of Unsloth and its dependencies. In this tutorial we’ll show you two different installation methods, depending on whether you are running the fine-tuning on Google Colab or locally on your machine.
Google Colab
If you will be using a Google Colab notebook, just run the next command and you’ll be all set:
Additionally, if you want to install Unsloth’s latest nightly version, run this command too:
Local Machine
If you’ll be fine-tuning locally things can get quite tricky, so hold on tight! The key to correctly installing Unsloth is to keep track of:
CUDA version
PyTorch version
Unfortunately, this can sometimes lead to dependency conflicts. We recommend creating a PyEnv virtual environment to keep your installation isolated. You can follow this guide to set it up. For this tutorial we suggest using a virtual environment with Python 3.10.
Once your virtual environment is activated and ready to go, the first step will be to install the correct PyTorch version for your system. It’s important to check the PyTorch versions supported by Unsloth, as detailed in their README.
Here you can find the latest PyTorch version, along with previous releases. Make sure to choose a version that Unsloth supports and that is compatible with your CUDA version. You can verify your CUDA version with this command:
For example, to install PyTorch 2.5.0 for CUDA 12.1:
You can check if PyTorch was installed correctly and can utilize CUDA with the following command, which should print True:
Now that we have a working PyTorch version we can then proceed to installing Unsloth. This command will change depending on your PyTorch and CUDA versions, and also if your GPU features the NVIDIA Ampere architecture.
For instance, use the following commands if you have PyTorch 2.5.0 and CUDA 12.1:
Alternatively, if your GPU is based on the Ampere architecture, such as the A100, H100, RTX 3090, or newer models:
We’re now ready to start fine-tuning!
Loading the Model
Begin by importing Unsloth into your code:
Unsloth provides optimized versions of the most popular and widely used LLMs, ready for fine-tuning. It typically offers both the base and instruct versions of each model. Base models are trained on a great variety of texts, giving them a general understanding of many topics, while instruct models are specialized for tasks like question-answering or following instructions. You can check out the full list of available models here.
In this guide, we’ll be using Meta AI’s Llama 3.2 base model with 1 billion parameters. We’ve chosen this model because we’ll be training it to respond in rhymes when presented with a specific text format. Using the base model allows for more control over the fine-tuning, as it doesn’t come with pre-existing instructions or biases, unlike the instruct versions.
Image generated by Midjourney, taken from VentureBeat.
You can choose to load the model in its 4-bit quantized version, which uses less memory but with a slight reduction in accuracy. The maximum sequence length will depend on your use case (i.e., the maximum amount of tokens the LLM can output). The dtype (data type) can be set to None to let it auto-detect the best option based on your GPU’s capabilities (e.g.: it will use the Bfloat16 format if Ampere is supported).
Applying LoRA Adapters
Fine-tuning a model can be even faster with LoRA (Low-Rank Adaptation). Simply put, LoRA works by freezing the pre-trained model’s weights and adding a new, thin layer of weights (an adapter). Instead of updating all the model's weights, only the newly added ones are trained. LoRA offers several benefits. On the one hand, it reduces the amount of GPU RAM and processing power required for fine-tuning. On the other hand, it allows you to train multiple adapters with different knowledge, and swap them in and out of the main model, as needed!
Frozen loras, as imagined by DALL-E 3.
LoRA adapters are highly customizable, with each parameter having a technical background. The theory behind these fall out of the scope of this article, but Unsloth’s wiki provides a LoRA Parameters Encyclopedia where you can learn more about each one. We will apply LoRA with a rank of 16 across all the target modules.
Preparing the Dataset
Preparing your training data is the final step before fine-tuning. To do this, you’ll need the Hugging Face Datasets library, which processes data into the format expected by the LLM.
This library can handle both local dataset files and datasets from the Hugging Face Hub, but we’ll focus on working with local datasets. A wide range of file types is also supported, including CSV, JSON, Parquet and Arrow; refer to the official documentation for more information.
In this tutorial, we will fine-tune a Q&A model that answers the user’s questions using rhymes. For that we will use CSV files that look like the following:
Start by defining the data_files parameter to map your data files to splits such as training, validation and test. Next, load the dataset by specifying the file type and passing the previously defined data files.
Remember that LLMs like Llama are autoregressive models, meaning they learn to autocomplete the text we provide to them. While they may give the impression of answering questions, they are actually trained to autocomplete a question with an answer. To achieve this effect, we need to transform the dataset into the format we want the model to use for training and learning. To illustrate, if the model will be fine-tuned on a question-answer dataset, an appropriate template for this task is required. Let’s define the following format:
Once the LLM is trained, it will understand that after ### Answer: the most natural completion is the answer to the question above.
You’ll need to define a function that formats an individual sample from the dataset and returns it as a dictionary. It’s crucial to add the EOS (End of Sequence) token at the end of the formatted sample, otherwise you may encounter infinite generations. In our case, the CSV files contain the question and answer columns. The resulting function is:
Now it’s just a matter of mapping the dataset to the format template. The dataset splits will contain a new text column with the formatted samples.
Fine-Tuning the LLM
We’ve come a long way, and it’s now time to fine-tune our model. We’ll be using Hugging Face’s Supervised Fine-tuning Trainer, but first, let’s customize the training process.
There are numerous arguments available to personalize and tweak your training. Here, we’ll provide a brief overview of some key options, but feel free to explore the documentation for a complete list.
auto_find_batch_size: setting this to True will automatically find a suitable batch size for the data based on your available memory.
num_train_epochs: specifies the number of training epochs, or passes, through the entire training dataset.
logging_strategy: defines when to log the loss values during training. Options include "no" for no logging, "steps" for logging at each step, or "epochs" for logging after each epoch.
eval_strategy: sets the frequency for evaluating the model using the validation (also known as evaluation) dataset, options include "no", "steps" and "epochs".
metric_for_best_model: defines the metric used to compare model checkpoints during training.
load_best_model_at_end: determines whether to load the best model found during the training at the end. The metric chosen in the previous argument sets the criteria for model comparison.
save_strategy: establishes the model checkpoint saving strategy, with options "no", "steps" and "epochs".
save_total_limit: limits the number of checkpoints saved. When load_best_model_at_end is set to True, the best checkpoint will always be saved, along with the most recent ones. Setting this argument to 1 will save a maximum of two checkpoints: the best one and the most recent (if they are not the same).
output_dir: the output directory where the model checkpoints will be saved. establishes the model checkpoint saving strategy, with options "no", "steps" and "epochs".
Once the training arguments are defined, we can pass them to the trainer and begin fine-tuning. Ensure that the value passed to dataset_text_field matches the key in the dictionary returned by the formatting_func function—in this case, "text".
Finally, start the training.
You should see an output similar to the following. Keep in mind that the batch size will depend on your system, and the number of examples will vary based on the size of your dataset.
Running the Model
When the training finishes, you can run the model to see how it performs on new data for your fine-tuning task. The following line is essential, as it activates Unsloth’s fast inference.
For this, the model needs to be fed the prompt up until the delimiter where we intend it to begin generating output. In our question-answering case, we need to input up to the ###Answer delimiter:
Next, to tokenize the input text, run it through the model, and then decode the outputs:
Behold, our newly fine-tuned model is now generating text that rhymes based on its knowledge obtained from training:
Saving the Fine-Tuned Model
There are several ways to save the model, either locally or on the Hugging Face Hub, all of which you can study in the Unsloth documentation. We’ll focus on one method: saving the LoRA adapters locally. This allows us to load the adapters on top of the base model to either continue the training or to run inference.
With the next lines, you can save both the model and its tokenizer to a directory of your choice:
Afterwards, you can load your fine-tuned model from the adapters we just saved and perform inference:
And there you go! We successfully prepared a training and evaluation dataset, fine-tuned our model and saved it.
Need Help with your Machine Learning Project? Contact Us!
In this blog, we explored the essential steps to fine-tune a Large Language Model—starting with setting up your environment, preparing and formatting the dataset, training the model, and finally running inference to see its new capabilities in action. RidgeRun offers consulting services for solutions involving Deep Learning and Computer Vision. Reach out to us at contactus@ridgerun.ai and let’s start planning your project!
Comments