Harnessing GPT-Powered AI to Query and Summarize Multiple Hansard Reports in the Kenyan Parliament

Herman Wandabwa
6 min readMay 19, 2023
Photo by Scott Graham on Unsplash

The Hansard and Audio Services Directorate within the Kenyan Parliament is responsible for recording and producing verbatim reports of parliamentary proceedings and committee deliberations. With a curiosity to understand the topics discussed by members of parliament over time, I sought to explore the Hansard reports for specific sessions or sittings. However, due to the length of these reports and the challenge of identifying relevant dates, this endeavor proved to be time-consuming and potentially unproductive.

This led me to explore effective methods of querying PDF documents and obtaining insightful information on specific topics. After considering various options, I decided to leverage Large Language Models (LLMs), with OpenAI being my preferred choice.

In this article, I will discuss the following topics:

  1. Sourcing data: Extracting PDFs from the official website, as they are publicly available.
  2. PDFs Validity Check
  3. Setting up dependencies: Configuring the necessary software libraries and tools.
  4. Querying the Hansard reports for 2018: Although there is no particular significance attributed to the 2018 reports, I chose this subset for demonstration purposes.
  5. Summarizing PDFs

Let’s begin our exploration!

  1. Sourcing for data

A convenient approach for downloading the files involved inspecting the HTML code first. My objective was to acquire all the files from 2013. During the download process, there were a total of 42 pages containing these reports, with each page containing multiple files. To simplify the process, I opted for a less dynamic approach by automating the downloads on a page-by-page basis.

To initiate the download, you can right-click and copy the link for the first “page connector”. It will resemble something similar to this: http://www.parliament.go.ke/the-national-assembly/house-business/hansard?title=%20&field_parliament_value=2022&page=0. The last page link corresponds to this: http://www.parliament.go.ke/the-national-assembly/house-business/hansard?title=%20&field_parliament_value=2022&page=41. Consequently, the objective was to download everything in between these two pages.

For this task, I utilized the BeautifulSoup and Requests packages, which you can also employ. The following code essentially defines the path to traverse from the first to the last page based on the start and end page definitions. Moreover, it creates a new folder named “pdfs” to store the downloaded files.

import requests
from bs4 import BeautifulSoup
import os


# Set the base URL and the starting page number
base_url = 'http://www.parliament.go.ke/the-national-assembly/house-business/hansard?title=%20&field_parliament_value=All&page='
start_page = 0
end_page = 41

# Create a directory to store the PDF files
if not os.path.exists('pdfs'):
os.mkdir('pdfs')

The code provided iterates through each link to search for PDF files on every page and proceeds to download them.

# Loop through all the pages and download the PDF files
for page in range(start_page, end_page+1):
# Build the URL for the current page
url = base_url + str(page)

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content of the response
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the links to PDF files on the page
links = soup.find_all('a', href=lambda href: href and href.endswith('.pdf'))

# Loop through all the links and download the corresponding PDF files
for link in links:
# Build the URL for the PDF file
pdf_url = link['href']

# Build the file path for the PDF file
file_path = 'pdfs/' + pdf_url.split('/')[-1]

# Download the PDF file
response = requests.get(pdf_url, stream=True)

with open(file_path, 'wb') as file:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
file.write(chunk)

I successfully downloaded all of them, ranging from 2013 to the most recent ones. The total count of downloaded files is approximately 515.

2. PDFs Validity Check

If certain PDF files are not properly constructed, they may generate an error, resulting in their inability to be read and loaded. To address this issue, I have developed a small function that examines each document in the folder to verify the presence of a valid header. If a valid header is identified, the respective document will be printed and subsequently deleted, as shown in the following code snippet.

folder_path = "./pdfs/"  # replace with the path to your folder

# find all PDF files in the folder
pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))

# loop over each PDF file and check if it has an invalid header
for pdf_file in pdf_files:
with open(pdf_file, "rb") as f:
header = f.read(4)
if header != b"%PDF":
print(f"{pdf_file} has an invalid header: {header}") #if nothing is printed then everything is in order
# os.remove(pdf_file) #removes the bad PDFs

3. Setting up dependencies

To begin, we will need to install and import the necessary dependencies for Python 3.8. You can set up these dependencies in a Conda environment, or you can directly execute the installation in a Jupyter notebook. Another option is to utilize Google Colab, which is a free alternative.

# !pip install langchain
# !pip install openai
# !pip install chromadb

from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
from langchain import OpenAI, PromptTemplate
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders import PyPDFDirectoryLoader

import glob
import os

4. Querying the Hansard reports for 2018

The next step is straightforward. I utilized vector stores and queried their indexes to facilitate interaction with the PDFs. Indexes are used to organize documents in a way that optimizes interaction with language models (LLMs). LangChain offers several modules that assist in loading, structuring, storing, and retrieving documents. In this instance, we will load the documents from a folder. To accommodate computational constraints, I only queried the 2018 Hansard documents, but it is feasible to load all PDFs into the vector store.

The process can be implemented simply using the following code: -

# Set up OpenAI API credentials
os.environ["OPENAI_API_KEY"] = "xxxxxx" #Use your API key instead

loader = PyPDFDirectoryLoader("./2018/") #change the folder to your own

docs = loader.load()

# Create the vector store index
index = VectorstoreIndexCreator().from_loaders([loader])

That’s it. Index structures are used to organize files in a manner that facilitates efficient querying.

Sample queries:

I queried the vector store regarding the information provided below, and the answers were largely coherent. The key factor lies in “prompt engineering,” and I strongly believe that effective prompting can lead to highly accurate initial responses.

Sample queries and respective answers

5. Summarizing PDFs:

Each document can be summarized using the following approach: loop through the folder, read each PDF, and print out its summary by utilizing the function provided below. The temperature value governs the randomness and, thus, the creativity of the responses.

llm = OpenAI(temperature=0.2)

def summaries(pdf_folder, custom_prompt):
summaries = []
for pdf_file in glob.glob(pdf_folder + "/*.pdf"):
loader = PyPDFLoader(pdf_file)
docs = loader.load_and_split()
prompt_template = custom_prompt + """
{text}
SUMMARY:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm, chain_type="map_reduce",
map_prompt=PROMPT, combine_prompt=PROMPT)
summary_output = chain({"input_documents": docs},return_only_outputs=True)["output_text"]
summaries.append(summary_output)

return summaries

I suggest selecting a few documents to summarize, as it will be faster and more cost-effective, considering the usage of your OpenAI API key.

Below are some sample summarized results from the 2018 reports: -

Sample Document summaries

However, a little bit of hallucination seems to occur, as shown in the summary below. There is reference to the “House of Commons” yet it is not mentioned anywhere in the document.

Hallucination Example

Hallucination in Large Language Models (LLMs) refers to the phenomenon where the model produces outputs that are syntactically and semantically correct but deviate from the constraints or limitations present in the source data. This issue poses significant ethical concerns for LLMs and can have detrimental consequences, particularly for users who excessively depend on language models.

Conclusion:

That concludes the explanation of the process involved in developing a simple scraper to download Hansard reports for the Kenyan parliament in PDF format. A vector store of a few reports is created and queried, resulting in predominantly relevant outcomes. The experiments can be replicated, similar to what was done with Langchain and OpenAI, thereby introducing a degree of intelligence to otherwise static documents.

--

--