Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: <title> Substantial Indexation Speed Degradation v 0.3.2 -> 1.2.0 #1733

Open
3 tasks done
AxelSjoberg opened this issue Feb 24, 2025 · 2 comments
Open
3 tasks done
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response triage Default label assignment, indicates new issue needs reviewed by a maintainer

Comments

@AxelSjoberg
Copy link

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

The time for creating a new graph index increased from 6.5 minutes in v0.3.2 to 33 minutes in v1.2.0.

The indexing time increase seems to be related to the entity_extraction process, which is very slow in the latest version for me. I tried to replicate the environment variables as closely as possible but without success so far.

To check that the increased compute time isn’t due to different prompts from prompt tuning, I ran a pipeline where I copied the tuned prompt from entity_extraction.txt in v0.3.2 into a v1.2.0 pipeline (copied over after the tuning step, of course), but the entity extraction is still slow for me. This leads me to believe something might be broken in the library, or that I'm missing something.

Steps to reproduce

v.1.2.0

`

----------- imports -----------

import subprocess

----------- graphrag init -----------

command_list = [
"python",
"-m",
"graphrag",
"init",
"--root", ROOT_PATH
]

result = subprocess.run(command_list, capture_output=capture_output, text=True)

Modify settings env file...

----------- Prompt Tune -----------

cmd = [
"python", "-m", "graphrag", "prompt-tune",
"--root", ROOT_PATH,
"--config", str(Path(ROOT_PATH, "settings.yaml")),
"--domain", domain,
"--no-discover-entity-types"
]

subprocess.run(cmd, check=True)

----------- Indexation -----------

cmd = [
"python",
"-m",
"graphrag", "index",
"--root", ROOT_PATH,
"--output", str(Path(ROOT_PATH, "output")),
"--verbose"
]

subprocess.run(cmd, check=True)
`

v.0.3.2

`

----------- imports -----------

import subprocess

----------- graphrag init -----------

command_setup = [
'python',
'-m',
'graphrag.index',
'--init',
'--root',
ROOT_PATH
]

subprocess.run(command_setup, capture_output=False, text=False)

Modify settings env file...

----------- Prompt Tune -----------

cmd = [
"python", "-m", "graphrag.prompt_tune",
"--root", ROOT_PATH,
"--config", str(Path(ROOT_PATH, "settings.yaml")),
"--no-entity-types"
]

subprocess.run(cmd, check=True)

----------- Indexation -----------

cmd = [
"python",
"-m",
"graphrag.index",
"--root",
ROOT_PATH
]

subprocess.run(cmd, check=True)
`

GraphRAG Config Used

async_mode: threaded
basic_search:
  prompt: prompts/basic_search_system_prompt.txt
cache:
  base_dir: cache
  type: file
chunks:
  group_by_columns:
  - id
  overlap: 100
  size: 1200
claim_extraction:
  description: Any claims or facts that could be relevant to information discovery.
  enabled: false
  max_gleanings: 1
  prompt: prompts/claim_extraction.txt
cluster_graph:
  max_cluster_size: 10
community_reports:
  max_input_length: 8000
  max_length: 2000
  prompt: prompts/community_report.txt
drift_search:
  prompt: prompts/drift_search_system_prompt.txt
  reduce_prompt: prompts/drift_search_reduce_prompt.txt
embed_graph:
  enabled: false
embeddings:
  async_mode: threaded
  llm:
    api_base: ${GRAPHRAG_API_BASE}
    api_key: ${GRAPHRAG_API_KEY}
    api_version: 2024-02-15-preview
    concurrent_requests: 50
    model: text-embedding-3-large
    type: openai_embedding
  vector_store: # is this the reason for the slow down?
    collection_name: default
    db_uri: output/lancedb
    overwrite: true
    type: lancedb
encoding_model: cl100k_base
entity_extraction:
  entity_types:
  - organization
  - person
  - geo
  - event
  max_gleanings: 1
  prompt: prompts/entity_extraction.txt
global_search:
  knowledge_prompt: prompts/global_search_knowledge_system_prompt.txt
  map_prompt: prompts/global_search_map_system_prompt.txt
  reduce_prompt: prompts/global_search_reduce_system_prompt.txt
input:
  base_dir: input
  file_encoding: utf-8
  file_pattern: .*\.csv$
  file_type: csv
  type: file
llm:
  api_base: ${GRAPHRAG_API_BASE}
  api_key: ${GRAPHRAG_API_KEY}
  api_version: 2024-02-15-preview
  batch_size: 16
  concurrent_requests: 50
  max_retries: 3
  model: gpt-4o-mini
  model_supports_json: true
  type: azure_openai_chat
local_search:
  prompt: prompts/local_search_system_prompt.txt
parallelization:
  num_threads: 50
  stagger: 0.3
reporting:
  base_dir: logs
  type: file
skip_workflows: []
snapshots:
  embeddings: false
  graphml: false
  transient: false
storage:
  base_dir: output
  type: file
summarize_descriptions:
  max_length: 500
  prompt: prompts/summarize_descriptions.txt
umap:
  enabled: false
update_index_storage: null
async_mode: threaded
cache:
  base_dir: cache
  type: file
chunks:
  group_by_columns:
  - id
  overlap: 100
  size: 1200
claim_extraction:
  description: Any claims or facts that could be relevant to information discovery.
  max_gleanings: 1
  prompt: prompts/claim_extraction.txt
cluster_graph:
  max_cluster_size: 10
community_reports:
  max_input_length: 8000
  max_length: 2000
  prompt: prompts/community_report.txt
embed_graph:
  enabled: false
embeddings:
  async_mode: threaded
  llm:
    api_base: ${GRAPHRAG_API_BASE}
    api_key: ${GRAPHRAG_API_KEY}
    api_version: 2024-02-15-preview
    concurrent_requests: 50
    model: text-embedding-3-large
    type: openai_embedding
encoding_model: cl100k_base
entity_extraction:
  entity_types:
  - organization
  - person
  - geo
  - event
  max_gleanings: 1
  prompt: prompts/entity_extraction.txt
global_search: null
input:
  base_dir: input
  file_encoding: utf-8
  file_pattern: .*\.csv$
  file_type: csv
  type: file
llm:
  api_base: ${GRAPHRAG_API_BASE}
  api_key: ${GRAPHRAG_API_KEY}
  api_version: 2024-02-15-preview
  batch_size: 16
  concurrent_requests: 50
  max_retries: 3
  model: gpt-4o-mini
  model_supports_json: true
  type: azure_openai_chat
local_search: null
parallelization:
  stagger: 0.3
reporting:
  base_dir: logs
  type: file
skip_workflows: []
snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false
storage:
  base_dir: output
  type: file
summarize_descriptions:
  max_length: 500
  prompt: prompts/summarize_descriptions.txt
umap:
  enabled: false

Logs and screenshots

No response

Additional Information

  • GraphRAG Version: 1.2.0
  • Operating System: MacOS Sequoia
  • Python Version: 3.11.10
  • Related Issues: -
@AxelSjoberg AxelSjoberg added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Feb 24, 2025
@natoverse
Copy link
Collaborator

We've been working through a number of issues since adopting fnllm for API call management. 2.0.0 was just released, which we believe resolves these issues.

@natoverse natoverse added the awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response label Feb 26, 2025
@AxelSjoberg
Copy link
Author

Thanks @natoverse, I'll try it out again after we've migrated to v 2.0.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_response Maintainers or community have suggested solutions or requested info, awaiting filer response triage Default label assignment, indicates new issue needs reviewed by a maintainer
Projects
None yet
Development

No branches or pull requests

2 participants