Evaluate RAG with Human Feedback
When dealing with RAG (Retrieval-Augmented Generation) pipeline, your goal is not only evaluating a single LLM response, but also incorporating various assessments of the retrieved documents like contextual and answer relevancy and faithfulness.
In this example, you will create a labeling interface that aims to evaluate:
- Contextual relevancy of the retrieved documents
- Answer relevancy
- Answer faithfulness
For a tutorial on how to use this template with the Label Studio SDK, see Evaluate LLM Responses.
Configure the labeling interface
Create a project with the following labeling configuration:
<View>
<Style>
.htx-text {white - space: pre-wrap;}
.question {
font - size: 120%;
width: 800px;
margin-bottom: 0.5em;
border: 1px solid #eee;
padding: 0 1em 1em 1em;
background: #fefefe;
}
.answer {
font - size: 120%;
width: 800px;
margin-top: 0.5em;
border: 1px solid #eee;
padding: 0 1em 1em 1em;
background: #fefefe;
}
.doc-body {
white - space: pre-wrap;
overflow-wrap: break-word;
word-break: keep-all;
}
.doc-footer {
font - size: 85%;
overflow-wrap: break-word;
word-break: keep-all;
}
h3 + p + p {font - size: 85%;} /* doc id */
</Style>
<View className="question">
<Header value="Question"/>
<Text name="question" value="$question"/>
</View>
<View style="margin-top: 2em">
<Header value="Context"/>
<List name="results" value="$similar_docs" title="Retrieved Documents"/>
<Ranker name="rank" toName="results">
<Bucket name="relevant" title="Relevant"/>
<Bucket name="non_relevant" title="Non Relevant"/>
</Ranker>
</View>
<View className="answer">
<Header value="Answer"/>
<Text name="answer" value="$answer"/>
</View>
<Collapse>
<Panel value="How relevant is the answer to the provided context?">
<Choices name="answer_relevancy" toName="question" showInline="true">
<Choice value="Relevant" html="<div class="thumb-container" style="display: flex; gap: 20px;">
<div class="thumb-box" id="thumb-up" style="width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;">
<span class="thumb-icon" style="font-size: 48px;">&#128077;</span> <!-- Thumbs Up Emoji -->
</div></div>"/>
<Choice value="Non Relevant" html="<div class="thumb-container" style="display: flex; gap: 20px;">
<div class="thumb-box" id="thumb-down" style="width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;">
<span class="thumb-icon" style="font-size: 48px;">&#128078;</span> <!-- Thumbs Down Emoji -->
</div>
</div>"/>
</Choices>
</Panel>
</Collapse>
<Collapse>
<Panel value="If the answer factually aligns with the retrieved context?">
<Choices name="faithfulness" toName="question" showInline="true">
<Choice value="Relevant" html="<div class="thumb-container" style="display: flex; gap: 20px;">
<div class="thumb-box" id="thumb-up" style="width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;">
<span class="thumb-icon" style="font-size: 48px;">&#128077;</span> <!-- Thumbs Up Emoji -->
</div></div>"/>
<Choice value="Non Relevant" html="<div class="thumb-container" style="display: flex; gap: 20px;">
<div class="thumb-box" id="thumb-down" style="width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;">
<span class="thumb-icon" style="font-size: 48px;">&#128078;</span> <!-- Thumbs Down Emoji -->
</div>
</div>"/>
</Choices>
</Panel>
</Collapse>
</View>
This configuration includes the following elements:
<View>- All labeling configurations must include a baseViewtag. In this configuration, theViewtag is used to configure the display of blocks, similar to the div tag in HTML. It helps in organizing the layout of the labeling interface.<Style>- TheStyletag is used to define CSS styles that apply to the elements within theView. In this configuration, it sets styles for various classes various sections of the labeling interface layout.<Header>- TheHeadertag is used to display a header or title within the labeling interface. The text of the header is defined in thevalueparameter.<Text>- TheTexttag is used to display text provided by the input data. Given the example input data below, the text blocks are either displaying information from thequestionoranswerkeys in the source JSON. You will likely want to adjust these variables to match your own JSON data.<List>- List the retrieved documents. Given the example input data below, you are populating the list from thesimilar_docsfield in the source JSON.<Ranker>- TheRankertag creates UI elements that allow you to rank the list items by dragging and dropping them into different buckets.<Bucket>- TheBuckettag defines a category or container within the Ranker where items can be placed.
<Collapse>- TheCollapsetag creates a collapsible section that can be expanded or collapsed by the user.<Panel>- ThePaneltag is used within a Collapse element to define the content that can be expanded or collapsed.<Choices>- TheChoicestag presents a set of options for the annotator to choose from, specified by thenameandtoNameparameters.<Choice>- TheChoicetag defines an individual option within the Choices tag. In this example, choices are stylized to appear as clickable thumbs up and thumbs down icons.
Input data
In this example, you are including the prompt, the response, and the documents used for context.
[
{
"data": {
"question": "Can I use Label Studio for LLM evaluation?",
"answer": "Yes, you can use Label Studio for LLM evaluation.",
"similar_docs": [
{"id": 0, "body": "Label Studio is a data labeling tool."},
{"id": 1, "body": "Label Studio is a data labeling tool for AI projects."}
]
}
}
]
Use LlamaIndex
You can collect such data using the LlamaIndex framework.
pip install llama-index
For example, you can use a script to create a RAG pipeline to answer user queries regarding GitHub issues:
import os
from llama_index.readers.github import GitHubRepositoryIssuesReader, GitHubIssuesClient
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler, CBEventType
reader = GitHubRepositoryIssuesReader(
github_client=GitHubIssuesClient(),
owner="HumanSignal",
repo="label-studio",
)
llama_debug = LlamaDebugHandler()
callback_manager = CallbackManager([llama_debug])
# check if storage already exists
PERSIST_DIR = "./llama-index-storage"
if not os.path.exists(PERSIST_DIR):
# load the documents and create the index
documents = reader.load_data(state=GitHubRepositoryIssuesReader.IssueState.CLOSED)
index = VectorStoreIndex.from_documents(documents, callback_manager=callback_manager)
# store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context, callback_manager=callback_manager)
query_engine = index.as_query_engine()
question = "Can I use Label Studio for LLM evaluation?"
answer = query_engine.query(query)
# accessing the list of top retrieved documents from callback
event_pairs = llama_debug.get_event_pairs(CBEventType.RETRIEVE)
retrieved_nodes = list(event_pairs[0][1].payload.values())[0]
retrieved_documents = [node.text for node in retrieved_nodes]
Now you can use the SDK to construct a task that can be directly imported into Label Studio project given the labeling configuration described above:
task = {
"question": question,
"answer": answer,
"similar_docs": [{"id": i, "body": text} for i, text in enumerate(retrieved_documents)]
}