RAG for Legal - is it actually that easy? We see a lot of advancement in using LLM in open question answering (mostly in the form of a chatbot) out there in the market. Behind the mask of those chatbots, many of them are built on Retrieval Augmented Generation (RAG) technology. This makes question-answer chatbots across different domains very easy to develop based on many different open-sourced frameworks such as Llamaindex, Langchain, and many others. One specific domain that we are interested in in this work is the legal domain, which is intensely localized, and knowledge is hardly shared between different languages/countries. Thanoy is one example of a product that also utilizes RAG for legal question-answering. However, within this RAG craze, nobody in the room ever questioned how well their model performs quantitatively and what are the current limitations that RAG can’t (or can) be solved.

For a fully immersive experience, we suggest the reader read our 50-page technical report.
we come up with many names: ThaiLegal - which is quite lame, JJKBench - which is totally irrelevant but 10/10 for memes, LawBench - which sounds too generic. So we finally choose NitiBench which probably is the best name among this list.
To tackle the mystery of RAG performance in the legal domain, we must first have the benchmark. Specifically, our benchmark consists of two components: dataset and metrics - which can be later used to construct tables in our fancy paper.
We curate two datasets for assessing the legal capability of RAG frameworks:
Both dataset samples are in a triple (q, T, y) format where q denotes the question, T represents positive law sections, and y is the answer to q given T.
We also introduce the WangchanX-Legal-ThaiCCL-RAG dataset, which contains both train and test split. Our NitiBench-CCL was also derived from this dataset test split. More details can be found in the technical report section 3.
Now that we know what data to use, we need to address how to use our data to measure the effectiveness of the framework. In short, given a dataset and the LLM framework, what value do we use to measure its effectiveness under different perspectives?
First, we decompose RAG evaluation into two separate modules:
Measuring retriever effectiveness Typically, evaluating retriever is straightforward, just use hit rate, recall, or MRR. Where hit rate and recall describes the accuracy of retrieval model (counted as “hit” if at least one document was retrieved), MRR measure correctness weighted by rank of the retrieved document.
However, this is only if there is only one positive. In the real world (or in NitiBench-tax split), given a single query, there is more than one relevant law section that must be retrieved correctly. To do so, we modified MRR and Hit Rate by proposing their variant to make such metrics compatible with multi-positive setups: