The DQA aka difficult questions attempted project utilises one or more agents to facilitate large language models (LLMs) perform multi-hop question answering (MHQA). This project is inspired by a tutorial 1 from Dean Sacoransky. Unlike the tutorial's use of the LangGraph framework from LangChain for building agents, this project makes use of LlamaIndex Workflows.
The tutorial uses the question Which David Fincher film that stars Edward Norton does not star Brad Pitt? as a litmus test for assessing new AI systems. The answer is supposed to be None, but at the time of writing the tutorial (26 August 2024), the author states that ChatGPT's gpt-4o
model generates the following response.
The David Fincher film starring Edward Norton that does not star Brad Pitt is the "The Game" (1997). Edward Norton appears in an uncredited cameo role in this film.
The author further states that it is impossible "to answer this complex, multi-hop, logical question in one feed-forward pass of a neural network". At the end of the tutorial, the improved response to the question using agents that perform retrieval augmented generation (RAG) is seen to be the following.
None, as there is only one mentioned David Fincher film starring Edward Norton, which is "Fight Club" and it stars Brad Pitt.
This project implements an agent-based framework akin to the one mentioned in the tutorial 1.
Let's focus on a slightly simpler test question that nonetheless baffles ChatGPT gpt-4o
. Let's ask Tell me the result of multiplying the number of 'r's in the word 'strawberry' with the sum of 3 and 4. Explain the process. ChatGPT hopelessly responds to this with the following.
Let's break down the problem step by step:
Find the number of 'r's in the word "strawberry": The word "strawberry" contains 2 'r's.
Find the sum of 3 and 4:
$3+4=7$ .Multiply the number of 'r's by the sum:
$2 \times 7=14$ .So, the result of multiplying the number of 'r's in "strawberry" by the sum of 3 and 4 is 14.
While ChatGPT did not make mistakes with the basic arithmetic operations, it counted the number of 'r's in the word 'strawberry' wrong. In contrast, the DQA workflow gets it right with the following answer, as of September 6, 2024, surprisingly using the gpt-4o-mini
model.
Final Answer: The result of multiplying the number of 'r's in the word 'strawberry' (which is 3) with the sum of 3 and 4 (which is 7) is 21.
Detailed Explanation: To arrive at the final answer, we first determined the number of 'r's in the word 'strawberry'. The analysis revealed that there are 3 'r's in the word. Next, we calculated the sum of 3 and 4, which is 7. Finally, we multiplied these two results together: 3 (the number of 'r's) multiplied by 7 (the sum of 3 and 4) equals 21. Therefore, the final result is 21.
The reason the gpt-4o-mini
model is able to count the number of 'r's correctly is because DQA lets it use a function to calculate the occurrences of a specific character or a sequence of characters in a string.
With multiple workflows now available, check the workflows page for further details.
Recalling the litmus test question (i.e., Which David Fincher film that stars Edward Norton does not star Brad Pitt?), the response from DQA with gpt-4o-mini
is correct, as in the answer is none, but the response is long-winded.
The David Fincher film that stars Edward Norton but does not feature Brad Pitt is none. The only film directed by David Fincher that includes both Edward Norton and Brad Pitt is Fight Club (1999). In this film, Edward Norton plays the unnamed narrator, while Brad Pitt portrays Tyler Durden. Therefore, there are no David Fincher films starring Edward Norton that exclude Brad Pitt.
To summarize:
- Film featuring both Edward Norton and Brad Pitt: Fight Club (1999)
- Other films directed by David Fincher include:
- Alien 3 (1992)
- Se7en (1995)
- The Game (1997)
- Panic Room (2002)
- Zodiac (2007)
- The Curious Case of Benjamin Button (2008)
- The Social Network (2010)
- The Girl with the Dragon Tattoo (2011)
- Gone Girl (2014)
- Mank (2020)
The generated responses depend heavily on the LLM making them very inconsistent. In addition, while the workflow passes on the examples shown here, there remains room for improvement, with respect to wasteful LLM calls, wasteful tool calls, consistency of the answer from the same LLM, ability to generate reliable answers from low parameter quantised models (available on Ollama, for instance), amongst others.
Following is a table of some updates regarding the project status. Note that these do not correspond to specific commits or milestones.
Date | Status | Notes or observations |
---|---|---|
September 21, 2024 | active | Workflows made selectable. |
September 13, 2024 | active | Low parameter LLMs perform badly in unnecessary self-discovery, query refinements and ReAct tool selections. |
September 10, 2024 | active | Query decomposition may generate unnecessary sub-workflows. |
August 31, 2024 | active | Using built-in ReAct agent. |
August 29, 2024 | active | Project started. |
Install and activate a Python virtual environment in the directory where you have cloned this repository. Let us refer to this directory as the working directory or WD (interchangeably) hereonafter. You could do that using pyenv, for example. Make sure you use Python 3.12.0 or later. Inside the activated virtual environment, run the following.
python -m pip install -U pip
python -m pip install -U -r requirements.txt
While calling pip install
with -U
on requirements.txt
will install the latest packages, this may create an environment with unforeseen bugs and incompatibilities. To create a more stable environment, run pip
on a list of packages that specifies package versions.
python -m pip install -r requirements-frozen.txt
If necessary, you can uninstall everything previously installed by pip
(in the virtual environment) by running the following.
python -m pip freeze | cut -d "@" -f1 | xargs pip uninstall -y
In addition to Python dependencies, see the installation instructions of Ollama. You can install it on a separate machine. Download the tool calling model of Ollama that you want to use, e.g., llama3.1
or mistral-nemo
.
Following is a list of environment variables that can be used to configure the DQA application. All environment variables should be supplied as quoted strings. They will be interpreted as the correct type as necessary.
For environment variables starting with GRADIO_
, See Gradio documentation for environment variables.
Variable | [Default value] and description |
---|---|
ANTHROPIC_API_KEY |
[None] Check the docs to get an API key. |
LLM__ANTHROPIC_MODEL |
[claude-3-opus-20240229] See the available models. |
OPENAI_API_KEY |
[None] Check the docs to get an API key. |
LLM__OPENAI_MODEL |
[gpt-4o-mini] See the available models. |
COHERE_API_KEY |
[None] Check the docs to get an API key. |
LLM__COHERE_MODEL |
[command-r-plus] See the available models. |
GROQ_API_KEY |
[None] Check the docs to get an API key. |
LLM__GROQ_MODEL |
[llama-3.1-70b-versatile] See the available models. |
LLM__OLLAMA_URL |
[http://localhost:11434] URL of your desired Ollama host. |
LLM__OLLAMA_MODEL |
[mistral-nemo] See the available models. The model must be available on the selected Ollama server. The model must support tool calling. |
LLM__PROVIDER | [Ollama] Select one from the following default list. |
SUPPORTED_LLM_PROVIDERS |
[Anthropic:Open AI:Cohere:Groq:Ollama] Separator character is ":". A subset of the default set of LLM providers may be used to restrict access in a particular deployment. |
LLM__TEMPERATURE |
[0.0] Inferred type: float . This is the temperature setting for the LLM, which is between |
LLM__TOP_P |
[0.4] Inferred type: float . This is the nucleus sampling hyperparameter that controls the randomness of the LLM output. This parameter is only available when using the Ollama LLM provider. |
LLM__TOP_K |
[40] Inferred type: int . This is the top-k setting for the LLM, which controls token selection. This parameter is only available when using the Ollama LLM provider. |
LLM__REPEAT_PENALTY |
[1.1] Inferred type: float . This is a parameter to control repeated sequences in the output. This parameter is only available when using the Ollama LLM provider. |
LLM__SEED |
[1] Inferred type: int . This parameter is used to initialise the LLM's sampling process. Any fixed value will result in a deterministic initialisation of the sampling process. This parameter is only available when using the Ollama LLM provider. |
TAVILY_API_KEY |
[None] Check the docs to get an API key. |
Make a copy of the file .env.docker
in the working directory as a .env
file.
cp .env.docker .env
Change all occurrences of host.docker.internal
to localhost
or some other host or IP assuming that you have Ollama on port 11434 on your preferred host. Set the Ollama model to the tool calling model that you have downloaded on your Ollama installation. Set the value of the LLM_PROVIDER
to the provider that you want to use. Supported names are Anthropic
, Cohere
, Groq
, Ollama
and Open AI
.
You can use the environment variable SUPPORTED_LLM_PROVIDERS
to further restrict the supported LLM providers to a subset of the aforementioned, such as, by setting the value to Groq:Ollama
to allow only Groq and Ollama for some deployment of this app. Note that the only separating character between LLM provider names is a :
. If you add a provider that is not in the aforementioned set, the app will throw an error and refuse to start.
Add the API keys for Anthropic, Cohere, Groq or Open AI if you want to use any of these. In addition, add an API key of Tavily.
With all these setup done, run the following to start the web server. The web server will serve a web user interface as well as a REST API. It is not configured to use HTTPS.
python src/webapp.py
The web UI will be available at http://localhost:7860.
In the .env.docker
, Ollama is expected to be available on port 11434 on your Docker host, i.e., host.docker.internal
. Set that to some other host(s), if that is where your Ollama server is available. Set the Ollama model to the tool calling model that you have downloaded on your Ollama installation.
Set the value of the LLM_PROVIDER
to the provider that you want to use and add the API keys for Anthropic, Cohere, Groq and Open AI LLM providers as well as that of Tavily as metioned above in the Usage (local) section.
With all these setup done, and assuming that you have Docker installed, you can build an image of the DQA app, create a container and start it as follows.
docker build -f local.dockerfile -t dqa .
docker create -p 7860:7860/tcp --name dqa-container dqa
docker container start dqa-container
You can replace the second line above to the following, in order to use a .env
file on your Docker host that resides at the absolute path PATH_TO_YOUR_.env_FILE
.
docker create -v /PATH_TO_YOUR_.env_FILE:/home/app_user/app/.env -p 7860:7860/tcp --name dqa-container dqa
The web server will serve a web user interface as well as a REST API at http://localhost:7860. It is not configured to use HTTPS.
Install pre-commit
for Git and ruff
. Then enable pre-commit
by running the following in the WD.
pre-commit install
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.