-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Using in google colab. I used several different settings.yaml files to try to get it to work, including initial stock with .env file. One time starting in a new folder from scratch it worked partly (errored out before all workflow tasks done), but then after problem persists. I can see no pattern for the cause. please see indexing-engine.log
Steps to reproduce
- use google colab to run.
- pip install graphrag.
note, error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 2.2.2 which is incompatible.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 15.0.0 which is incompatible.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.2.2 which is incompatible.
- run indexing using several different settings.yaml, with combinations of using .env or directly entering config in settings file. Including stock settings.yaml
- empty artifacts folder, no workflow tasks done
Expected Behavior
workflow list should be fully populated and all tasks run correctly. At best have only had a few partial runs, now nothing is done
GraphRAG Config Used
encoding_model: cl100k_base
# encoding_model: ${GRAPHRAG_ENCODING_MODEL}
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
# model: gpt-4-turbo-preview
model: ${GRAPHRAG_MODEL}
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
# api_base: ${GRAPHRAG_API_BASE}
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
concurrent_requests: 5 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
# model: text-embedding-3-small
model: ${GRAPHRAG_EMBEDDING_MODEL}
# api_base: ${GRAPHRAG_API_BASE}
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000
global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Logs and screenshots
indexing-engine.log
14:57:04,749 graphrag.index.run INFO Running pipeline with config settings.yaml
14:57:04,751 graphrag.config.read_dotenv INFO Loading pipeline .env file
14:57:05,473 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at output/20240710-145704/artifacts
14:57:05,482 graphrag.index.input.load_input INFO loading input from root_dir=input
14:57:05,482 graphrag.index.input.load_input INFO using file storage for input
14:57:05,486 graphrag.index.storage.file_pipeline_storage INFO search /content/drive/MyDrive/.../2024-07-10/input for files matching .*\.txt$
14:57:05,488 graphrag.index.input.text INFO found text files from input, found [('wildfly_jira_compact_3.txt', {}), ('wildfly_jira_compact_2.txt', {}), ('wildfly_jira_compact_1.txt', {})]
14:57:05,504 graphrag.index.workflows.load INFO Workflow Run Order: []
14:57:05,505 graphrag.index.run INFO Final # of rows loaded: 3
Additional Information
- GraphRAG Version: 0.1.1
- Operating System: Ubuntu 22.04
- Python Version: 3.10.12
- Related Issues:
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working