Interpretable LLM-based Table Question Answering

Giang Nguyen, Ivan Brugere, Shubham Sharma, Sanjay Kariyappa, Anh Totti Nguyen, Freddy Lecue

2025

Links: pdf | code | project page

Interpretability for Table Question Answering (Table QA) is critical, particularly in high-stakes industries like finance or healthcare. Although recent approaches using Large Language Models (LLMs) have significantly improved Table QA performance, their explanations for how the answers are generated are ambiguous. To fill this gap, we introduce Plan-of-SQLs (POS), an interpretable Table QA approach designed to improve users’ understanding of model decision-making. Through qualitative and quantitative evaluations with human and LLM judges, we show that: First, POS is the highest-quality explanation method, helps human users understand model behaviors, and facilitates model prediction verification. Second, when evaluated on popular and standard Table QA datasets (TabFact, WikiTQ, and FetaQA), POS achieves QA accuracy that is competitive with or superior to existing methods, while also offering greater efficiency-requiring significantly fewer LLM calls and table database queries-and robust performance on large-sized tables. Finally, we observe high agreement (up to 90%) between LLMs and human users when making decisions based on the same explanations, suggesting that LLMs could serve as an effective proxy for humans in evaluating explanations. This finding enables faster, more affordable evaluation of AI explanations-possibly accelerating trustworthy AI research while maintaining reliable judgments on interpretability.

Acknowledgment: This work is supported by the National Science Foundation under Grant No. 2145767, Adobe Research, and the NaphCare Charitable Foundation.

Figure 1: (a) End-to-End: relies entirely on an LLM to answer the question directly, leaving no room for users to understand the prediction. (b) Text-to-SQL: gGenerates an SQL command to solve the query, requiring domain expertise to understand and becoming unintelligible when the query becomes complex. (c) Chain-of-Table or CoTable: performs planning with abstract functions and executes sequentially to arrive at the final answer. However, function arguments are not justified, and the final answer depends on the LLM’s opaque reasoning. (d) Plan-of-SQLs or POS (Ours): plans in natural language, making each step simple and understandable. Each step is then converted into an SQL command, sequentially transforming the input table end-to-end to produce the final answer.

 

Table 1: Accuracy (%) for TabFact and WikiTQ using GPT-3.5 and GPT-4o-mini. “Breakdown” indicates whether queries are decomposed into sub-problems (see Figure 2– 1 ). “Transformed by” refers to whether intermediate tables are transformed by an LLM or a program (see Figure 2– 2 – 3 ). “Answered by” specifies whether the final answer is generated by an LLM or a program (see Figure 2– 4 ). LLM-only approaches provide the final answer without table transformations. Bold values indicate the best performance for each model and dataset.