EUREQA

LLMs make errors when correct surface-level semantic cues-entities are recursively replaced with descriptions, and the errors are likely related to token similarity. GPT-3.5-turbo is used for this example.

The EUREQA dataset

Download the dataset from [Dataset]

In EUREQA, every question is constructed through an implicit reasoning chain. The chain is constructed by parsing DBPedia. Each layer comprises three components: an entity, a fact about the entity, and a relation between the entity and its counterpart from the next layer. The layers stack up to create chains with different depths of reasoning. We verbalize reasoning chains into natural sentences and anonymize the entity of each layer to create the question. Questions can be solved layer by layer and each layer is guaranteed a unique answer. EUREQA is not a knowledge game: we adopt a knowledge filtering process that ensures that most LLMs have sufficient world knowledge to answer our questions.
EUREQA comprises a total of 2,991 questions of different reasoning depths and difficulties. The entities encompass a broad spectrum of topics, effectively reducing any potential bias arising from specific entity categories. These data are great for analyzing the reasoning processes of LLMs

Performance

Here we present the accuracy of ChatGPT, Gemini-Pro and GPT-4 on the hard set of EUREQA across different depths d of reasoning (number of layers in the questions). We evaluate two prompt strategies: direct zero-shot prompt and ICL with two examples. In general, with the entities recursively substituted by the descriptions of reasoning chaining layers, and therefore eliminating surface-level semantic cues, these models generate more incorrect answers. When the reasoning depth increases from one to five on hard questions, there is a notable decline in performance for all models. This finding underscores the significant impact that semantic shortcuts have on the accuracy of responses, and it also indicates that GPT-4 is considerably more capable of identifying and taking advantage of these shortcuts.

depth	d=1		d=2		d=3		d=4		d=5
	direct	icl	direct	icl	direct	icl	direct	icl	direct	icl
ChatGPT	22.3	53.3	7.0	40.0	5.0	39.2	3.7	39.3	7.2	39.0
Gemini-Pro	45.0	49.3	29.5	23.5	27.3	28.6	25.7	24.3	17.2	21.5
GPT-4	60.3	76.0	50.0	63.7	51.3	61.7	52.7	63.7	46.9	61.9

Analyses and discussion

Can human solve EUREQA?

We carried out a human analysis focusing on the hard set characterized by a reasoning depth of five. This evaluation involves two computer science PhD students as the annotators to ensure their expertise. A selection of 50 questions, randomly extracted from the set, was subjected to annotation. The averaged accuracy of annotators achieves 95% with an Inter-annotator Agreement of Cohen κ = 0.79.

Do LLMs take Shortcuts?

One of our main motivations is to test if language models can follow a simple yet effective reasoning chain instead of taking semantic shortcuts based on entity associations. We design an analytical experiment based on entity similarity to investigate whether LLMs take such semantic shortcuts.

Do open source LLMs perform better?

To expand the scope of our findings, we experimented with open-sourced LLAMA-2 models across different sizes on hard questions of EUREQA. Similar to our observations on GPT-series models, there’s a notable decline in the accuracy of Llama models as the reasoning depth increases from one to five on the hard set.

Will prompting solve EUREQA?

Although the effectiveness of prompting techniques is out of the scope of this paper, we still additionally tested the Tree of Thought(TOT) (Yao et al.,2023) method on ChatGPT with a “propose” strategy, which tries to decompose the questions layer by layer and solve them sequentially. Our results show that such a prompting method fails completely even if we provide human-written examples for question decomposition. In no experiment did the TOT method generate a valid answer in the final response.

Will optimal retrieval solve EUREQA?

Although our knowledge filtering process has already removed the knowledge barrier, it can still be pointed out that whether Retrieval Augmented Generation(RAG) method can solve this task. To address such concerns, we tested GPT-4 on 300 randomly sampled 5-layer hard questions. To minimize the impact of the retrieval method, we investigate a “performance upper bound” setting, which directly injects the retrieval result of the visible entities and relations in the input question from DBpedia. More detailed explanation can be found in the paper. The accuracy of GPT-4 through the above process is 62.0%, which is close to our reported performance of 61.9% in the original setting. We can then hypothesize that simply injecting knowledge into the model can hardly solve the problem and the bottleneck remains at the reasoning/question decomposition ability of LLMs. Moreover, it can be concluded that our knowledge-filtering process can effectively estimate parametric knowledge, and it is also reliable after self-consistency.

Even given all the required information needed for the question (selected information shown in the figure), GPT-4 still makes mistakes starting early layers (highlighted in grey). We only show partial output here. Notice that we give GPT few-shot prompts.

Acknowledgement

This website is adapted from Nerfies, UniversalNER and LLaVA, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models.

Usage and License Notices: The data abd code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, ChatGPT, and the original dataset used in the benchmark. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.