Kelly Shreeve

Logo

Data Scientist with M.A. in sociology, B.A. in environmental sociology, and 5+ years' experience teaching statistics. Completed TripleTen's 10 month data science bootcamp and a real-world data science externship with DataSpeak. Currently accepting data analysis and statistics consulting projects May 2024.

View My LinkedIn Profile

View My GitHub Profile

Generative Answer Chat Bot

Image of a cartoon AI chatbot

Background: An externship project with DataSpeak, a data science consulting firm, to develop an AI customer service chatbot that could be used across multiple clients. This chatbot learns from a dataset and answer questions on domain-specific knowledge.

Purpose: There were three goals for the chatbot:

  1. Generate answers to user questions
  2. Pull information from a domain specific dataset
  3. Produce accurate answers

Techniques: RAG, Llama-2, LangChain, Chainlit

View App Code

Examples

Open-Ended Question Answering

The model accurately responds to open-ended questions with information from the dataset in under 5min on GPU.

Chainlit App open ended question example

Multiple-Choice Question Answering

The model correctly picks from a list of multiple choice questions, displaying accuracy when answering customer questions.

Chainlit App multiple choice question example

Data

Data Acquisition

Data for this project came from a public dataset of python questions and answers from Kaggle.

Data Link: https://www.kaggle.com/datasets/stackoverflow/pythonquestions

Data Preparation

  1. Datasets were cleaned of html text, special characters, and capital letters.
  2. Question and answer datasets were merged into one questions-answer dataframe on Id column.
  3. Answers from a sample of 100,000 question-answer pairs were used as a context document for model development.

Further Research and Development

This model should be tested on each domain-specific dataset to ensure it is able to learn accurate answers to common customer questions. Additionally response times can be sped up by running the app through GPU and vector storage in Pinecone. This app will can be deployed over a web service for use by customers.

Full Project on Github