Skip to main content

Question Answering Benchmarking: Paul Graham Essay

Here we go over how to benchmark performance on a question answering task over a Paul Graham essay.

It is highly recommended that you do any evaluation/benchmarking with tracing enabled. See here for an explanation of what tracing is and how to set it up.

Loading the data

First, let's load the data.

from langchain.evaluation.loading import load_dataset

dataset = load_dataset("question-answering-paul-graham")
    Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--question-answering-paul-graham-76e8f711e038d742/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)



0%| | 0/1 [00:00<?, ?it/s]

Setting up a chain

Now we need to create some pipelines for doing question answering. Step one in that is creating an index over the data in question.

from langchain.document_loaders import TextLoader

loader = TextLoader("../../modules/paul_graham_essay.txt")
from langchain.indexes import VectorstoreIndexCreator
vectorstore = VectorstoreIndexCreator().from_loaders([loader]).vectorstore
    Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.

Now we can create a question answering chain.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever(),
input_key="question",
)

Make a prediction

First, we can make predictions one datapoint at a time. Doing it at this level of granularity allows use to explore the outputs in detail, and also is a lot cheaper than running over multiple datapoints

chain(dataset[0])
    {'question': 'What were the two main things the author worked on before college?',
'answer': 'The two main things the author worked on before college were writing and programming.',
'result': ' Writing and programming.'}

Make many predictions

Now we can make predictions

predictions = chain.apply(dataset)

Evaluate performance

Now we can evaluate the predictions. The first thing we can do is look at them by eye.

predictions[0]
    {'question': 'What were the two main things the author worked on before college?',
'answer': 'The two main things the author worked on before college were writing and programming.',
'result': ' Writing and programming.'}

Next, we can use a language model to score them programatically

from langchain.evaluation.qa import QAEvalChain
llm = OpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(
dataset, predictions, question_key="question", prediction_key="result"
)

We can add in the graded output to the predictions dict and then get a count of the grades.

for i, prediction in enumerate(predictions):
prediction["grade"] = graded_outputs[i]["text"]
from collections import Counter

Counter([pred["grade"] for pred in predictions])
    Counter({' CORRECT': 12, ' INCORRECT': 10})

We can also filter the datapoints to the incorrect examples and look at them.

incorrect = [pred for pred in predictions if pred["grade"] == " INCORRECT"]
incorrect[0]
    {'question': 'What did the author write their dissertation on?',
'answer': 'The author wrote their dissertation on applications of continuations.',
'result': ' The author does not mention what their dissertation was on, so it is not known.',
'grade': ' INCORRECT'}