-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add graphrag-storage. #2127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v3/main
Are you sure you want to change the base?
Add graphrag-storage. #2127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extracts storage functionality from the main graphrag package into a new standalone graphrag-storage package to improve modularity and reduce coupling. The key changes involve renaming classes (e.g., PipelineStorage → Storage, FilePipelineStorage → FileStorage), removing the StorageType enum in favor of class name-based type resolution, and updating all imports and references throughout the codebase.
- Introduces
graphrag-storagepackage withStoragebase class and concrete implementations (FileStorage,MemoryStorage,AzureBlobStorage,AzureCosmosStorage) - Replaces enum-based storage type configuration with class name strings (e.g.,
"FileStorage"instead ofStorageType.file) - Updates all imports, tests, and documentation to use the new package and naming conventions
Reviewed Changes
Copilot reviewed 47 out of 49 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Adds graphrag-storage package and updates dependency graph |
| tests/unit/load_config/fixtures/config.yaml | New test fixture file for configuration loading tests |
| tests/unit/indexing/input/test_*.py | Updates imports to use StorageConfig from graphrag_storage |
| tests/unit/indexing/cache/test_file_pipeline_cache.py | Updates FileStorage import and usage |
| tests/unit/config/utils.py | Updates StorageConfig import |
| tests/smoke/test_fixtures.py | Renames BlobPipelineStorage to AzureBlobStorage |
| tests/integration/storage/test_*.py | Updates all storage class names and import paths |
| pyproject.toml | Adds graphrag-storage workspace member and version update task |
| packages/graphrag/pyproject.toml | Adds graphrag-storage dependency |
| packages/graphrag/graphrag/utils/*.py | Updates imports to use Storage from graphrag_storage |
| packages/graphrag/graphrag/storage/* | Removes old factory, updates init to re-export from graphrag_storage |
| packages/graphrag/graphrag/index/**/*.py | Updates Storage type references throughout indexing workflows |
| packages/graphrag/graphrag/config/models/*.py | Removes StorageConfig, imports from graphrag_storage instead |
| packages/graphrag/graphrag/config/defaults.py | Changes StorageType enum to FileStorage class name string |
| packages/graphrag/graphrag/config/init_content.py | Updates default storage type references in template |
| packages/graphrag/graphrag/cache/*.py | Updates storage imports to use new graphrag_storage classes |
| packages/graphrag-storage/** | New package with Storage interface, config, factory, and implementations |
| docs/config/yaml.md | Updates documentation with new storage type naming convention |
Comments suppressed due to low confidence (6)
packages/graphrag-storage/graphrag_storage/azure_cosmos_storage.py:19
azure_cosmos_storage.pyimportsProgressfromgraphrag.logger.progress, which creates a dependency from thegraphrag-storagepackage to thegraphragpackage. This creates a circular dependency sincegraphragdepends ongraphrag-storage. Consider either: 1) movingProgresstographrag-common, or 2) defining a similar class ingraphrag-storage, or 3) removing the Progress usage from this storage implementation.
packages/graphrag-storage/graphrag_storage/storage.py:18- The abstract
__init__method in an ABC is unusual and problematic. Abstract classes cannot enforce__init__signatures since constructors are called differently. Consider removing the@abstractmethoddecorator from__init__- subclasses will still need to implement it, but it won't be enforced as an abstract method. The current pattern may cause issues with instantiation.
packages/graphrag-storage/graphrag_storage/memory_storage.py:22 MemoryStorageinherits fromFileStoragebut doesn't use any of its file-based functionality. It only overrides methods to use an in-memory dictionary. This creates an unnecessary dependency and thebase_dirparameter required byFileStorage.__init__will be ignored but still validated. Consider havingMemoryStoragedirectly inherit fromStorageinstead.
packages/graphrag-storage/graphrag_storage/azure_blob_storage.py:23- This class does not call Storage.init during initialization. (AzureBlobStorage.init may be missing a call to a base class init)
packages/graphrag-storage/graphrag_storage/azure_cosmos_storage.py:29 - This class does not call Storage.init during initialization. (AzureCosmosStorage.init may be missing a call to a base class init)
packages/graphrag-storage/graphrag_storage/file_storage.py:27 - This class does not call Storage.init during initialization. (FileStorage.init may be missing a call to a base class init)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| - `storage` **StorageConfig** | ||
| - `type` **file|blob|cosmosdb** - The storage type to use. Default=`file` | ||
| - `type` **FileStorage|AzureBlobStorage|AzureCosmosStorage** - The storage type to use. Default=`FileStorage` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm iffy on this change. It breaks with current convention, and feels "off" for how you would put values in yaml. I understand the intent from factory registration perspective of course. Maybe worth tagging a couple of other folks to see if there is consensus one way or the other
| storage_account_blob_url: str | None = None, | ||
| container_name: str | None = None, | ||
| encoding: str = "utf-8", | ||
| **kwargs: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably the kwargs in these constructors are just to soak additional unknown args from the factory? My preference is definitely to avoid kwargs.get and instead explicitly name it – would it therefore make sense register lambdas with the factory that pluck exact args and perform this soaking function? It's a bit more code, but could set a reasonable pattern of explicit mapping from the factory init call to instances?
|
|
||
|
|
||
| def validate_blob_container_name(container_name: str): | ||
| def _validate_blob_container_name(container_name: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could all of these checks be collapsed into a single regex?
| raise ValueError(msg) | ||
|
|
||
| self._base_dir = Path(base_dir).resolve() | ||
| self._encoding = encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the get/set include optional encoding, should we take it off the init kwargs? Or maybe the reverse - since it isn't applicable to all storage types, maybe we take it off the get/set and expect any given storage instance to only work with a single encoding via init args?
| class StorageConfig(BaseModel): | ||
| """The default configuration section for storage.""" | ||
|
|
||
| model_config = ConfigDict(extra="allow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like the right name for this, and maybe should be punted to a broader discussion about how we want to handle open-ended config
Add graphrag-storage package.