Drive Customer Success: Supercharging Safaricom’s Product FAQs with Llama 2 Model

Transforming Customer Experiences: How a Finetuned Llama 2 Model can Empower Product FAQs

Herman Wandabwa
12 min readSep 8, 2023
Image by the Author

Recent advancements in the field of artificial intelligence, with a specific focus on generative AI, have not only piqued the interest of the general public but have also illustrated what experts in this field have understood for a considerable time. These advancements hold the promise of enabling individuals to accomplish remarkable feats, ushering in a fresh era of economic, technological, and social possibilities. They provide new avenues for personal expression, fostering connections among individuals, creators, and businesses.

Large Language Models (LLMs) are pre-trained on a diverse collection of text data. For instance, when considering Llama 2, we possess limited knowledge concerning the specific composition of its training dataset, except for the fact that it comprises an extensive 2 trillion tokens with double the context length of Llama 1. The model has been trained on over 1 million human annotations, as mentioned here. To put this into perspective, the earlier BERT model from 2018 was trained on a dataset consisting of the BookCorpus (800 million words) and the English Wikipedia (2,500 million words). It is worth noting that this pretraining process is not only resource-intensive but also time-consuming, often accompanied by a multitude of hardware-related challenges.

This is the second and final part of this LLM series. The first part involved building a scraper to get the data to fine-tune the model here, and I have documented the process here. The dataset comprises a list of 1750 Question-Answer pairs. The pairs present common themes across Safaricom’s products and services. Safaricom is Kenya’s largest telecom company, boasting about 43.8 million subscribers. It also runs one of the world’s best mobile money platforms, called MPESA.

In this part of the series, I’ll be detailing how an open-source Llama 2 model was trained with Safaricom’s product and service-related FAQ question and answer pairs. Models such as Llama 2 possess the capability to predict the subsequent token within a sequence. However, this predictive ability alone does not render them highly effective virtual assistants, as they do not inherently respond to explicit instructions. To bridge this gap, a technique known as instruction tuning is applied to align their responses more closely with human expectations. This fine-tuning process involves two primary methodologies:

  1. Reinforcement Learning from Human Feedback (RLHF): Here, models learn through interactions with their environment and feedback mechanisms. The training objective is to maximize a reward signal, often utilizing Proximal Policy Optimization (PPO). This reward signal is typically derived from human evaluations of the model’s outputs, enabling the model to adapt and improve its responses based on human preferences and feedback.
  2. Supervised Fine-Tuning (SFT): In this approach, models are subjected to training on a dataset consisting of paired instructions and corresponding responses, as in our case with the Question-Answer pairs. The goal is to optimize the internal model parameters within the LLM to minimize the disparity between the generated answers and the ground-truth responses, which serve as reference labels.

These fine-tuning strategies play a pivotal role in enhancing the utility of auto-regressive models, transforming them into more capable and responsive virtual assistants in alignment with human expectations.

Let’s delve into the details:

  1. Installation and import of required Python packages
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
import locale
def getpreferredencoding(do_setlocale = True):
return "UTF-8"
locale.getpreferredencoding = getpreferredencoding


import os
import torch
import pandas as pd
from datasets import load_dataset,DatasetDict,Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
LlamaForCausalLM,
LlamaTokenizer,
GenerationConfig,
pipeline,
logging
)

from peft import LoraConfig, PeftModel
from trl import SFTTrainer

2. Prompt Templates

One crucial aspect concerning data quality, especially in Llama 2 models, is the prompt template. The authors of the model suggest following the below format in preparing the data.

<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model answer </s>

The same format was followed in preparing the FAQ dataset for fine-tuning the model. I defined a CONFIG class with just one value for now, as below. Other model-related definitions can also be added to the class later.

The dataset is in CSV format, with questions and associated answers as the column values. It can be accessed here.

#Dataset
df = pd.read_csv("Final_FAQs.csv")
df1 = df.drop(['Unnamed: 0'], axis=1)
df1['Answer'] = df1['Answer'].str.replace('\n', ' ') #replace \n witn space
df2 = df1.copy()
df3 = df2.dropna()

This function can then be used to transform the dataframe into the Llama 2 Chat prompt template, as mentioned above.

def transform_dataset_format(df):
"""Transform the dataframe into a specified format."""

def transform(row):
user_text = row["Question"]
assistant_text = row["Answer"]

return f"<s>[INST] "f"{user_text} [/INST] {assistant_text} </s>"

transformed_data = df.apply(transform, axis=1)
transformed_df = transformed_data.to_frame(name="text")

return transformed_df

df4 = transform_dataset_format(df3)
df4.to_csv("Transformed_csv.csv",index=False)

The output of the above function is a transformed dataframe with just one column, “text” that is then loaded as a dataset after being converted to CSV. This way, we’ll have it loaded in Hugging Face format as a dataset dictionary with 1750 non-null row values.

from datasets import load_dataset
dataset = load_dataset('csv', data_files='Transformed_data.csv')

3. Finetuning Llama 2

The finetuning process of the 7 billion-parameter Llama 2 model happens on the T4 GPU with parameter-efficient fine-tuning (PEFT) techniques like LoRA or QLoRA.

In order to significantly minimize VRAM consumption, it is imperative to fine-tune the model using 4-bit precision. This is why we will be making use of QLoRA in this context.

The finetuning process involves using the llama-2-7b-chat-hf model (the chat model) as the base model and training it on the FAQ dataset as described above.

QLoRA is configured to utilize a rank of 64 alongside a scaling parameter set at 16. The Llama 2 model is then directly loaded in 4-bit precision using the NF4 type and trained in a single epoch. Check the respective links for more information on PeftModel, TrainingArguments and SFTTrainer as well as LoRA parameters.

The code below details the step-by-step process of loading everything in the fine-tuning process. The first step is to load the llama-2–7b-chat-hf model and create a name for the new fine-tuned model that is to be saved later on. QLORA parameters are defined, as is the selection of GPU for processing.

# The model that you want to train from the Hugging Face hub

model_name = "meta-llama/Llama-2-7b-chat-hf" #Need to apply for access
new_model = "llama-2-7b-saf3" # Fine-tuned model name

# QLoRA parameters
################################################################################
# LoRA attention dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1
################################################################################
# bitsandbytes parameters
################################################################################
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False
################################################################################
# TrainingArguments parameters
################################################################################
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"
# Number of training epochs
num_train_epochs = 1
# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False
# Batch size per GPU for training
per_device_train_batch_size = 4
# Batch size per GPU for evaluation
per_device_eval_batch_size = 4
# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1
# Enable gradient checkpointing
gradient_checkpointing = True
# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3
# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to use
optim = "paged_adamw_32bit"
# Learning rate schedule
lr_scheduler_type = "cosine"
# Number of training steps (overrides num_train_epochs)
max_steps = -1
# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03
# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True
# Save checkpoint every X updates steps
save_steps = 0
# Log every X updates steps
logging_steps = 25
################################################################################
# SFT parameters
################################################################################
# Maximum sequence length to use
max_seq_length = None
# Pack multiple short examples in the same input sequence to increase efficiency
packing = False
# Load the entire model on the GPU 0
device_map = {"": 0}

The code below enables the login to Hugging Face Hub to enable pushing the model weights to Hugging Face later after training and merging weights.

from huggingface_hub import notebook_login
notebook_login()

Next, bitsandbytes for 4-bit quantization is configured, followed by the loading of the model in 4-bit precision on a GPU with the corresponding tokenizer.

The final bit in the code below loads QLoRA configurations and training parameters and passes everything to the SFTTrainer. The training process took about an hour, and the final model is here. The finetuning time may vary depending on the size of the dataset as well as other parameter settings.

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)

# Load base model
model = LlamaForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map,
temperature=0.1,
do_sample=True
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="tensorboard"
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=1024,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

# Train model
trainer.train()
# Save trained model
trainer.model.save_pretrained(new_model)

The model was tested across a range of questions after training, as below:

  1. Sample Question 1:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Can a non-Kenyan register for MPESA? What options do they have?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Answer:

<s>[INST] Can a non-Kenyan register for MPESA? What options do they have? [/INST]  

Yes, a non-Kenyan can register for MPESA. However, they will need to have a valid passport or national ID number.

Here are the options available for non-Kenyan customers:

1. Register for MPESA as a Non-Kenyan Individual:

Non-Kenyan individuals can register for MPESA by providing their passport or national ID number. They will also need to provide their name, address, and phone number.

2. Register for MPESA as a Business:

Non-Kenyan businesses can register for MPESA by providing their business registration documents, including their certificate of incorporation, VAT registration, and business license.

3. Register for MPESA as a Non-Resident:

2. Sample Question 2:

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "I'm an American and just arrived in Kenya a few minutes ago. I need a SIM card to communicate. What requirememnts are needed for me to register for the SIM card and all services related to it?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Answer:

<s>[INST] I'm an American and just arrived in Kenya a few minutes ago. I need a SIM card to communicate. What requirememnts are  needed for me to register for the SIM card and  all services related to it? [/INST]  

Congratulations on your arrival in Kenya! To get a SIM card, you will need to provide the following documents:

1. Valid passport: You will need to provide your passport as proof of identity.
2. Visa: If you are a non-resident, you will need to obtain a visa before purchasing a SIM card.
3. Proof of address: You will need to provide proof of address, such as a hotel reservation or a letter from your employer.
4. Proof of payment: You will need to pay for the SIM card and any additional services you want to purchase.

3. Sample Question 3:

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is Make Ur Bundle??"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Answer:

<s>[INST] What is Make Ur Bundle?? [/INST]  

Make Ur Bundle is a service that allows you to create a customized bundle of your preferred data, voice and SMS bundles. You can choose the amount of data, voice and SMS you want to purchase and the duration of the bundle. Once you have created your bundle, you can purchase it and use it to make calls, send messages, and browse the internet.

The service is available to all Safaricom customers and can be accessed through the MySafaricom App or by dialing *544#.

Here are some of the benefits of Make Ur Bundle:

1. Customization: You can choose the amount of data, voice and SMS you want to purchase, making it easy to manage your resources.
2. Flexibility: You can purchase a bundle for a duration of 1 day, 3 days, 7 days

4. Sample Question 4:

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "When my Make Ur Bundle expires, what will happen to my unused data?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Answer:

<s>[INST] When my Make Ur Bundle expires, what will happen to my unused data? [/INST]  

If you have unused data in your Make Ur Bundle when it expires, you will be able'}
everybody. You will be able to use the data until it expires. If you do not use the data before the expiration date, it will be lost.

If you have a Make Ur Bundle with a data rollover option, any unused data will roll over to the next month. You will be able to use the rolled over data until it expires.

If you have a Make Ur Bundle with a data cap, any unused data will not roll over to the next month. You will need to purchase a new Make Ur Bundle with a data cap to continue using data.

If you have a Make Ur Bundle with a data rollover option and you do not use the data before the expiration date,

5. Sample Question 5:

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Can I register two lines to MPESA with the same ID?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Answer:

<s>[INST] Can I register two lines to MPESA with the same ID? [/INST]  

No, you cannot register two lines to MPESA with the same ID. everybody has a unique ID.

If you have two lines, you will need to register them separately.

To register a new line, you will need to have your ID and a valid passport photo.

You can register a new line by following these steps:

1. Dial *544#
2. Select “Register a new line”
3. Enter your ID number
4. Enter your passport photo
5. Select “Next”
6. Enter your PIN
7. Select “Confirm”

Once you have registered your new line, you will be able to access all the features of MPESA, including sending and receiving money, paying bills, and buying airtime.

You can go ahead and play with the model, giving it a few “complicated” scenarios and recording outputs. The model provides answers that are logical and substantially true based on the output above. Don’t act on the model’s output, though, as the model has a tendency to hallucinate on some topics. Better double-check every detail on the Safaricom website.

To store llama-2–7b-saf, a merger of LoRA and base model weights is needed. As mentioned here by Maxime, it's a bit hard to do this with in-memory limitations since the base model has to be reloaded in FP16 precision and peft used to merge them, as shown in the code below. Therefore, ending the current runtime and re-running the first two steps and the fourth step should work.

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The merged model and its weights can then be pushed to the Hugging Face Hub for reuse in other applications. It can now be loaded just like any other Llama 2 model here. In addition, this model can be further fine-tuned with other datasets to fit related scenarios. You can interact with the model in this Colab notebook via a Gradio chat.

Conclusion

We looked at the process of fine-tuning a Llama 2 7B model with product related Question and Answer pairs as product and service-related FAQs for Kenya’s largest telco company, Safaricom. The first part detailed background details in LLMs and how the dataset was prepared and loaded with the right prompt template. The second part touched on the fine-tuning details.

Lastly, the model was tested on a few questions, and sample answers were given. I hope you enjoyed reading the article, and I wish you all the best as you fine-tune your models. The final part of the series will be a chatbot that Safaricom can deploy to take care of customer queries in a more intuitive way. Stay tuned for the final chatbot.

--

--