NitiBench - RAG on legal domain might (not) be easy

RAG for Legal - is it actually that easy? We see a lot of advancement in using LLM in open question answering (mostly in the form of a chatbot) out there in the market. Behind the mask of those chatbots, many of them are built on Retrieval Augmented Generation (RAG) technology. This makes question-answer chatbots across different domains very easy to develop based on many different open-sourced frameworks such as Llamaindex, Langchain, and many others. One specific domain that we are interested in in this work is the legal domain, which is intensely localized, and knowledge is hardly shared between different languages/countries. Thanoy is one example of a product that also utilizes RAG for legal question-answering. However, within this RAG craze, nobody in the room ever questioned how well their model performs quantitatively and what are the current limitations that RAG can’t (or can) be solved.

For a fully immersive experience, we suggest the reader read our 50-page technical report.

✨ Introducing NitiBench

we come up with many names: ThaiLegal - which is quite lame, JJKBench - which is totally irrelevant but 10/10 for memes, LawBench - which sounds too generic. So we finally choose NitiBench which probably is the best name among this list.

To tackle the mystery of RAG performance in the legal domain, we must first have the benchmark. Specifically, our benchmark consists of two components: dataset and metrics - which can be later used to construct tables in our fancy paper.

Dataset

We curate two datasets for assessing the legal capability of RAG frameworks:

NitiBench-CCL: Our benchmark data on legal question-answering, where CCL stands for Corporate and Commercial Laws. The benchmark focuses on 35 legislations as listed here. We curate the test set fully manually without any AI in the process, where we ask a legal expert to curate a question based on a given law section. The test set consists of around 3.7k entries.
NitiBench-Tax: Another challenge provides a more complex legal QA, specifically on tax laws. We scrape tax rulings from an official Revenue Department website, preprocess to extract relevant law sections and construct the QA dataset out of it. Our test data consists of only 50 entries since we exclude rulings older than 2021 to ensure that the law cited in the ruling remains the latest.

Both dataset samples are in a triple (q, T, y) format where q denotes the question, T represents positive law sections, and y is the answer to q given T.

We also introduce the WangchanX-Legal-ThaiCCL-RAG dataset, which contains both train and test split. Our NitiBench-CCL was also derived from this dataset test split. More details can be found in the technical report section 3.

Metrics

Now that we know what data to use, we need to address how to use our data to measure the effectiveness of the framework. In short, given a dataset and the LLM framework, what value do we use to measure its effectiveness under different perspectives?

First, we decompose RAG evaluation into two separate modules:

Quality of retriever - How accurate the retriever is in retrieving relevant law sections given query
Quality of LLM - The quality of the generated response given retrieved documents

Measuring retriever effectiveness Typically, evaluating retriever is straightforward, just use hit rate, recall, or MRR. Where hit rate and recall describes the accuracy of retrieval model (counted as “hit” if at least one document was retrieved), MRR measure correctness weighted by rank of the retrieved document.

However, this is only if there is only one positive. In the real world (or in NitiBench-tax split), given a single query, there is more than one relevant law section that must be retrieved correctly. To do so, we modified MRR and Hit Rate by proposing their variant to make such metrics compatible with multi-positive setups: