[Issue]: <title> Substantial Indexation Speed Degradation v 0.3.2 -> 1.2.0 #1733

AxelSjoberg · 2025-02-24T11:09:44Z

Do you need to file an issue?

I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

The time for creating a new graph index increased from 6.5 minutes in v0.3.2 to 33 minutes in v1.2.0.

The indexing time increase seems to be related to the entity_extraction process, which is very slow in the latest version for me. I tried to replicate the environment variables as closely as possible but without success so far.

To check that the increased compute time isn’t due to different prompts from prompt tuning, I ran a pipeline where I copied the tuned prompt from entity_extraction.txt in v0.3.2 into a v1.2.0 pipeline (copied over after the tuning step, of course), but the entity extraction is still slow for me. This leads me to believe something might be broken in the library, or that I'm missing something.

Steps to reproduce

v.1.2.0

`

----------- imports -----------

import subprocess

----------- graphrag init -----------

command_list = [
"python",
"-m",
"graphrag",
"init",
"--root", ROOT_PATH
]

result = subprocess.run(command_list, capture_output=capture_output, text=True)

Modify settings env file...

----------- Prompt Tune -----------

cmd = [
"python", "-m", "graphrag", "prompt-tune",
"--root", ROOT_PATH,
"--config", str(Path(ROOT_PATH, "settings.yaml")),
"--domain", domain,
"--no-discover-entity-types"
]

subprocess.run(cmd, check=True)

----------- Indexation -----------

cmd = [
"python",
"-m",
"graphrag", "index",
"--root", ROOT_PATH,
"--output", str(Path(ROOT_PATH, "output")),
"--verbose"
]

subprocess.run(cmd, check=True)
`

v.0.3.2

`

----------- imports -----------

import subprocess

----------- graphrag init -----------

command_setup = [
'python',
'-m',
'graphrag.index',
'--init',
'--root',
ROOT_PATH
]

subprocess.run(command_setup, capture_output=False, text=False)

Modify settings env file...

----------- Prompt Tune -----------

cmd = [
"python", "-m", "graphrag.prompt_tune",
"--root", ROOT_PATH,
"--config", str(Path(ROOT_PATH, "settings.yaml")),
"--no-entity-types"
]

subprocess.run(cmd, check=True)

----------- Indexation -----------

cmd = [
"python",
"-m",
"graphrag.index",
"--root",
ROOT_PATH
]

subprocess.run(cmd, check=True)
`

GraphRAG Config Used

async_mode: threaded
basic_search:
  prompt: prompts/basic_search_system_prompt.txt
cache:
  base_dir: cache
  type: file
chunks:
  group_by_columns:
  - id
  overlap: 100
  size: 1200
claim_extraction:
  description: Any claims or facts that could be relevant to information discovery.
  enabled: false
  max_gleanings: 1
  prompt: prompts/claim_extraction.txt
cluster_graph:
  max_cluster_size: 10
community_reports:
  max_input_length: 8000
  max_length: 2000
  prompt: prompts/community_report.txt
drift_search:
  prompt: prompts/drift_search_system_prompt.txt
  reduce_prompt: prompts/drift_search_reduce_prompt.txt
embed_graph:
  enabled: false
embeddings:
  async_mode: threaded
  llm:
    api_base: ${GRAPHRAG_API_BASE}
    api_key: ${GRAPHRAG_API_KEY}
    api_version: 2024-02-15-preview
    concurrent_requests: 50
    model: text-embedding-3-large
    type: openai_embedding
  vector_store: # is this the reason for the slow down?
    collection_name: default
    db_uri: output/lancedb
    overwrite: true
    type: lancedb
encoding_model: cl100k_base
entity_extraction:
  entity_types:
  - organization
  - person
  - geo
  - event
  max_gleanings: 1
  prompt: prompts/entity_extraction.txt
global_search:
  knowledge_prompt: prompts/global_search_knowledge_system_prompt.txt
  map_prompt: prompts/global_search_map_system_prompt.txt
  reduce_prompt: prompts/global_search_reduce_system_prompt.txt
input:
  base_dir: input
  file_encoding: utf-8
  file_pattern: .*\.csv$
  file_type: csv
  type: file
llm:
  api_base: ${GRAPHRAG_API_BASE}
  api_key: ${GRAPHRAG_API_KEY}
  api_version: 2024-02-15-preview
  batch_size: 16
  concurrent_requests: 50
  max_retries: 3
  model: gpt-4o-mini
  model_supports_json: true
  type: azure_openai_chat
local_search:
  prompt: prompts/local_search_system_prompt.txt
parallelization:
  num_threads: 50
  stagger: 0.3
reporting:
  base_dir: logs
  type: file
skip_workflows: []
snapshots:
  embeddings: false
  graphml: false
  transient: false
storage:
  base_dir: output
  type: file
summarize_descriptions:
  max_length: 500
  prompt: prompts/summarize_descriptions.txt
umap:
  enabled: false
update_index_storage: null

async_mode: threaded
cache:
  base_dir: cache
  type: file
chunks:
  group_by_columns:
  - id
  overlap: 100
  size: 1200
claim_extraction:
  description: Any claims or facts that could be relevant to information discovery.
  max_gleanings: 1
  prompt: prompts/claim_extraction.txt
cluster_graph:
  max_cluster_size: 10
community_reports:
  max_input_length: 8000
  max_length: 2000
  prompt: prompts/community_report.txt
embed_graph:
  enabled: false
embeddings:
  async_mode: threaded
  llm:
    api_base: ${GRAPHRAG_API_BASE}
    api_key: ${GRAPHRAG_API_KEY}
    api_version: 2024-02-15-preview
    concurrent_requests: 50
    model: text-embedding-3-large
    type: openai_embedding
encoding_model: cl100k_base
entity_extraction:
  entity_types:
  - organization
  - person
  - geo
  - event
  max_gleanings: 1
  prompt: prompts/entity_extraction.txt
global_search: null
input:
  base_dir: input
  file_encoding: utf-8
  file_pattern: .*\.csv$
  file_type: csv
  type: file
llm:
  api_base: ${GRAPHRAG_API_BASE}
  api_key: ${GRAPHRAG_API_KEY}
  api_version: 2024-02-15-preview
  batch_size: 16
  concurrent_requests: 50
  max_retries: 3
  model: gpt-4o-mini
  model_supports_json: true
  type: azure_openai_chat
local_search: null
parallelization:
  stagger: 0.3
reporting:
  base_dir: logs
  type: file
skip_workflows: []
snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false
storage:
  base_dir: output
  type: file
summarize_descriptions:
  max_length: 500
  prompt: prompts/summarize_descriptions.txt
umap:
  enabled: false

Logs and screenshots

No response

Additional Information

GraphRAG Version: 1.2.0
Operating System: MacOS Sequoia
Python Version: 3.11.10
Related Issues: -

natoverse · 2025-02-26T00:11:33Z

We've been working through a number of issues since adopting fnllm for API call management. 2.0.0 was just released, which we believe resolves these issues.

AxelSjoberg · 2025-02-26T06:48:33Z

Thanks @natoverse, I'll try it out again after we've migrated to v 2.0.0!

AxelSjoberg added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Feb 24, 2025

natoverse added the awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: <title> Substantial Indexation Speed Degradation v 0.3.2 -> 1.2.0 #1733

[Issue]: <title> Substantial Indexation Speed Degradation v 0.3.2 -> 1.2.0 #1733

AxelSjoberg commented Feb 24, 2025

natoverse commented Feb 26, 2025

AxelSjoberg commented Feb 26, 2025

[Issue]: <title> Substantial Indexation Speed Degradation v 0.3.2 -> 1.2.0 #1733

[Issue]: <title> Substantial Indexation Speed Degradation v 0.3.2 -> 1.2.0 #1733

Comments

AxelSjoberg commented Feb 24, 2025

Do you need to file an issue?

Describe the issue

Steps to reproduce

----------- imports -----------

----------- graphrag init -----------

Modify settings env file...

----------- Prompt Tune -----------

----------- Indexation -----------

----------- imports -----------

----------- graphrag init -----------

Modify settings env file...

----------- Prompt Tune -----------

----------- Indexation -----------

GraphRAG Config Used

Logs and screenshots

Additional Information

natoverse commented Feb 26, 2025

AxelSjoberg commented Feb 26, 2025