Skip to content

Conversation

@dworthen
Copy link
Contributor

Add graphrag-storage package.

@dworthen dworthen requested a review from a team as a code owner November 11, 2025 17:29
@dworthen dworthen requested a review from Copilot November 11, 2025 18:55
Copilot finished reviewing on behalf of dworthen November 11, 2025 19:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extracts storage functionality from the main graphrag package into a new standalone graphrag-storage package to improve modularity and reduce coupling. The key changes involve renaming classes (e.g., PipelineStorageStorage, FilePipelineStorageFileStorage), removing the StorageType enum in favor of class name-based type resolution, and updating all imports and references throughout the codebase.

  • Introduces graphrag-storage package with Storage base class and concrete implementations (FileStorage, MemoryStorage, AzureBlobStorage, AzureCosmosStorage)
  • Replaces enum-based storage type configuration with class name strings (e.g., "FileStorage" instead of StorageType.file)
  • Updates all imports, tests, and documentation to use the new package and naming conventions

Reviewed Changes

Copilot reviewed 47 out of 49 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
uv.lock Adds graphrag-storage package and updates dependency graph
tests/unit/load_config/fixtures/config.yaml New test fixture file for configuration loading tests
tests/unit/indexing/input/test_*.py Updates imports to use StorageConfig from graphrag_storage
tests/unit/indexing/cache/test_file_pipeline_cache.py Updates FileStorage import and usage
tests/unit/config/utils.py Updates StorageConfig import
tests/smoke/test_fixtures.py Renames BlobPipelineStorage to AzureBlobStorage
tests/integration/storage/test_*.py Updates all storage class names and import paths
pyproject.toml Adds graphrag-storage workspace member and version update task
packages/graphrag/pyproject.toml Adds graphrag-storage dependency
packages/graphrag/graphrag/utils/*.py Updates imports to use Storage from graphrag_storage
packages/graphrag/graphrag/storage/* Removes old factory, updates init to re-export from graphrag_storage
packages/graphrag/graphrag/index/**/*.py Updates Storage type references throughout indexing workflows
packages/graphrag/graphrag/config/models/*.py Removes StorageConfig, imports from graphrag_storage instead
packages/graphrag/graphrag/config/defaults.py Changes StorageType enum to FileStorage class name string
packages/graphrag/graphrag/config/init_content.py Updates default storage type references in template
packages/graphrag/graphrag/cache/*.py Updates storage imports to use new graphrag_storage classes
packages/graphrag-storage/** New package with Storage interface, config, factory, and implementations
docs/config/yaml.md Updates documentation with new storage type naming convention
Comments suppressed due to low confidence (6)

packages/graphrag-storage/graphrag_storage/azure_cosmos_storage.py:19

  • azure_cosmos_storage.py imports Progress from graphrag.logger.progress, which creates a dependency from the graphrag-storage package to the graphrag package. This creates a circular dependency since graphrag depends on graphrag-storage. Consider either: 1) moving Progress to graphrag-common, or 2) defining a similar class in graphrag-storage, or 3) removing the Progress usage from this storage implementation.
    packages/graphrag-storage/graphrag_storage/storage.py:18
  • The abstract __init__ method in an ABC is unusual and problematic. Abstract classes cannot enforce __init__ signatures since constructors are called differently. Consider removing the @abstractmethod decorator from __init__ - subclasses will still need to implement it, but it won't be enforced as an abstract method. The current pattern may cause issues with instantiation.
    packages/graphrag-storage/graphrag_storage/memory_storage.py:22
  • MemoryStorage inherits from FileStorage but doesn't use any of its file-based functionality. It only overrides methods to use an in-memory dictionary. This creates an unnecessary dependency and the base_dir parameter required by FileStorage.__init__ will be ignored but still validated. Consider having MemoryStorage directly inherit from Storage instead.
    packages/graphrag-storage/graphrag_storage/azure_blob_storage.py:23
  • This class does not call Storage.init during initialization. (AzureBlobStorage.init may be missing a call to a base class init)
    packages/graphrag-storage/graphrag_storage/azure_cosmos_storage.py:29
  • This class does not call Storage.init during initialization. (AzureCosmosStorage.init may be missing a call to a base class init)
    packages/graphrag-storage/graphrag_storage/file_storage.py:27
  • This class does not call Storage.init during initialization. (FileStorage.init may be missing a call to a base class init)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


- `storage` **StorageConfig**
- `type` **file|blob|cosmosdb** - The storage type to use. Default=`file`
- `type` **FileStorage|AzureBlobStorage|AzureCosmosStorage** - The storage type to use. Default=`FileStorage`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm iffy on this change. It breaks with current convention, and feels "off" for how you would put values in yaml. I understand the intent from factory registration perspective of course. Maybe worth tagging a couple of other folks to see if there is consensus one way or the other

storage_account_blob_url: str | None = None,
container_name: str | None = None,
encoding: str = "utf-8",
**kwargs: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably the kwargs in these constructors are just to soak additional unknown args from the factory? My preference is definitely to avoid kwargs.get and instead explicitly name it – would it therefore make sense register lambdas with the factory that pluck exact args and perform this soaking function? It's a bit more code, but could set a reasonable pattern of explicit mapping from the factory init call to instances?



def validate_blob_container_name(container_name: str):
def _validate_blob_container_name(container_name: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could all of these checks be collapsed into a single regex?

raise ValueError(msg)

self._base_dir = Path(base_dir).resolve()
self._encoding = encoding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the get/set include optional encoding, should we take it off the init kwargs? Or maybe the reverse - since it isn't applicable to all storage types, maybe we take it off the get/set and expect any given storage instance to only work with a single encoding via init args?

class StorageConfig(BaseModel):
"""The default configuration section for storage."""

model_config = ConfigDict(extra="allow")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the right name for this, and maybe should be punted to a broader discussion about how we want to handle open-ended config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants