Web Scraping Product Driven Question-Answer Pairs

A Comprehensive Guide to Web Scraping Safaricom’s FAQs to Finetune a Large Language Model for Chatbot Development

Herman Wandabwa
6 min readSep 6, 2023
Photo by Markus Spiske on Unsplash

I’ve always been fascinated by the workings of Large Language Models and, more specifically, by fine-tuning the models to be better at certain tasks. As a Kenyan, I thought of a few companies that could benefit from leveraging the power of such fine-tuned models. I thought of companies like Safaricom (Kenya’s largest telco by subscriber base) and Kenya Airways, the national airline, that could benefit by deploying such models to even handle their customer queries in a more “intelligent” way compared to traditional chatbots.

This is part one of a two-part series where I build a scraper to get most FAQs about Safaricom products to use later on in fine-tuning an open-source Llama 2 Large Language Model on the data and eventually developing a chatbot that users could interact with the fine-tuned model. The second part is here.

FAQs Scraper:

In the world of data-driven decision-making, access to relevant information is paramount. Therefore, web scraping is a valuable technique for extracting data from websites, and Python offers powerful tools to accomplish this task.

In this guide, I’ll take you through the step-by-step process that I followed to extract FAQs from Safaricom’s website here. This data is what I used to fine-tune the Llama 2 model in the second part of this series.

With Python, it's always easier to install all the required packages in an “Environment”. Anaconda makes this process quite easy. Details for creating one can be found here.

1. Setting Up the Environment

Before diving into web scraping, it’s essential to set up your Python environment with the necessary libraries. In setting up the scraping environment, I made use of the following Python libraries :-

BeautifulSoup: This library helps parse HTML and XML documents, making it easier to extract information from web pages.

Selenium: Selenium is a web testing framework that allows for the automation of interactions with websites. It’s particularly useful for scraping dynamic content and interacting with JavaScript-driven pages.

Pandas: Pandas is a powerful data manipulation and analysis library. We’ll use it to store and manipulate the scraped data efficiently.

Random: We use the random library to add random delays between requests to avoid overloading the target website’s server.

You can install these libraries in a Conda environment using the following commands:

conda create -n webscraper_env python=3.9 #change to the Python version of  your liking 
conda activate webscraper_env
conda install beautifulsoup4 selenium

Once your environment is set up, you’re ready to start scraping!

2. How the Code Works

As mentioned earlier, the web scraping code is designed to scrape questions and answers from this URL. The layout needs to be changed if it is to be used on another website. Let’s delve into how it functions:

Scraping Architecture

Selenium with a headless Chrome browser is used to navigate the URL. The webdriver.Chrome object is created, allowing us to interact with web pages programmatically. User-Agent is then set as the header to mimic a legitimate web browser, ensuring that our scraping activity appears less suspicious, and that seems quite common when building web scrapers.

I made use of the function scrape_qa_from_url to perform the scraping. The code is below:

#Function to scrape questions and answers from a given URL
def scrape_qa_from_url(url,product_type):
chrome_options = Options()
#Set the User-Agent header to mimic a legitimate web browser
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")
driver = webdriver.Chrome(options=chrome_options)
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 30)
time.sleep(random.uniform(10, 30))

# Store the locator of accordion parent <div> element
accordion = wait.until(EC.visibility_of_element_located((By.ID, 'faqs')))
# Parse it into soup
soup = BeautifulSoup(accordion.get_attribute('outerHTML'), 'html.parser')

#Find all the questions and answers
questions = soup.find_all('a', class_='card-title')
answers = soup.find_all('div', class_='card-body')

#Create two empty lists to store the questions and answers
question_list = []
answer_list = []

#Iterate over the questions and answers
for question, answer in zip(questions, answers):
question_list.append(question.text.strip())

try:
#Find all the paragraph elements inside the answer element
paragraphs = answer.find_all('p')
ul_list = answer.find('ul')

#Combine the text from all paragraphs to get the complete answer
answer_text = ""
for paragraph in paragraphs:
answer_text += paragraph.text.strip() + "\n"

#Check if the answer contains a <ul> tag
if ul_list:
#Extract the text from each <li> tag and concatenate them to the answer
li_list = ul_list.find_all('li')
for li in li_list:
answer_text += li.text.strip() + "\n"

answer_list.append(answer_text)
except:
#If no answer is found, add an empty string
answer_list.append("")

#Create a DataFrame with the questions and answers
df = pd.DataFrame({'Question': question_list, 'Answer': answer_list})
driver.quit()
return df

The function above takes two arguments: url (the URL of the FAQs page to scrape) and product_type (a label for the type of product or category being scraped). A headless Chrome WebDriver inside the scrape_qa_from_url function is set up using Selenium. This basically maximizes the browser window and sets a User-Agent header to mimic a legitimate web browser.

The function opens the provided URL, waits for the page to load, and then extracts the HTML of a specific section (the FAQ section identified by an accordion) using BeautifulSoup. Inside the FAQ section, it finds and stores all the questions and answers. It goes through the questions and answers in the FAQs section, extracting and formatting values corresponding to the title (questions) and the body (answers) of the section. There are random delays between requests of up to 30 seconds, mimicking the browsing speeds of humans.

An iteration through the questions and answers is made, extracting the contents of the sections. If an answer is in table format, a table is parsed; otherwise, text is extracted from paragraphs. The scraped data is then stored in two lists: question and answer lists, and then stored in a Pandas dataframe for further manipulation. About five answers out of the 1755 question-answer pairs were nulls. A simple filter of the five was enough. Below is the complete code that you can re-use in case you have a related problem.

def scrape_qa_from_url(url,product_type):
chrome_options = Options()
# Set the User-Agent header to mimic a legitimate web browser
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")
driver = webdriver.Chrome(options=chrome_options)
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 30)
time.sleep(random.uniform(10, 30))

# Store the locator of accordion parent <div> element
accordion = wait.until(EC.visibility_of_element_located((By.ID, 'faqs')))
# Parse it into soup
soup = BeautifulSoup(accordion.get_attribute('outerHTML'), 'html.parser')

# Find all the questions and answers
questions = soup.find_all('a', class_='card-title')
answers = soup.find_all('div', class_='card-body')

# Create two empty lists to store the questions and answers
question_list = []
answer_list = []

# Iterate over the questions and answers
for question, answer in zip(questions, answers):
question_list.append(question.text.strip())

try:
# Find all the paragraph elements inside the answer element
paragraphs = answer.find_all('p')
ul_list = answer.find('ul')

# Combine the text from all paragraphs to get the complete answer
answer_text = ""
for paragraph in paragraphs:
answer_text += paragraph.text.strip() + "\n"

# Check if the answer contains a <ul> tag
if ul_list:
# Extract the text from each <li> tag and concatenate them to the answer
li_list = ul_list.find_all('li')
for li in li_list:
answer_text += li.text.strip() + "\n"

answer_list.append(answer_text)
except:
# If no answer is found, add an empty string
answer_list.append("")

# Create a DataFrame with the questions and answers
df = pd.DataFrame({'Question': question_list, 'Answer': answer_list})
driver.quit()
return df

# Initialize WebDriver
driver = webdriver.Chrome()
driver.maximize_window()
URL = "https://www.safaricom.co.ke/media-center-landing/frequently-asked-questions" #replace with your URL
driver.get(URL)
wait = WebDriverWait(driver, 30)
time.sleep(random.uniform(10, 30))


# Find all sections with class 'col-sm-12 col-md-4 year'
sections = driver.find_elements(By.CLASS_NAME, 'col-sm-12.col-md-4.year')

# Create an empty DataFrame to store the final results
final_df1 = pd.DataFrame()

# Loop through each section and scrape questions and answers
for section in sections:
# Extract links from the current section
links = section.find_elements(By.TAG_NAME, 'a')
hrefs = [link.get_attribute('href') for link in links]

for sub_url in hrefs:
try:
product_type_elem = driver.find_element(By.TAG_NAME, 'h1')
product_type = product_type_elem.text.strip()
df_sub = scrape_qa_from_url(sub_url,product_type)
final_df1 = pd.concat([final_df1, df_sub], ignore_index=True)
except Exception as e:
print(f"Error: {e}")
# Refresh the sections list to avoid StaleElementReferenceException
sections = driver.find_elements(By.CLASS_NAME, 'col-sm-12.col-md-4.year')

# Close the WebDriver
driver.quit()

# Display the final DataFrame
print(final_df1)

With this guide, you should have a solid foundation for setting up your Python environment for web scraping, understanding the intricacies of our code, and effectively saving the scraped data. I’ll be keen to scrape more details for more complicated use cases in the future.

Remember to scrape responsibly, respecting website terms of service and legal regulations, and happy web scraping!

--

--