Conversation
Add a DuckDB connector plugin following the same patterns as the Snowflake plugin. DuckDB is an embedded analytical database that runs queries locally and synchronously, so the connector executes queries in create() and get() always returns SUCCEEDED. Features: - In-memory and file-based database support - Parameterized SQL queries with typed inputs - Extension installation and loading (httpfs, json, etc.) - Query results returned as pandas DataFrames via temp parquet files - Automatic cleanup of temporary result files Signed-off-by: André Ahlert <andre@aex.partners>
47fa959 to
100a36c
Compare
|
|
||
| count_rows = DuckDB( | ||
| name="count_rows", | ||
| query_template="SELECT COUNT(*) AS total FROM 'data.parquet'", |
There was a problem hiding this comment.
Where is the data.parquet coming from?
Is this an input of type flyte.io.DataFrame
Then i would love to support
count_rows = DuckDB(
name="count_rows",
query_template="SELECT COUNT(*) AS total FROM '{input}'",
plugin_config=config,
input=DataFrame, # This can be implicit
output_dataframe_type=DataFrame,
)Then you can pass a parquet, a pandasDataframe, a spark DataFrame or anything to it
| has_output: bool = False | ||
|
|
||
|
|
||
| class DuckDBConnector(AsyncConnector): |
There was a problem hiding this comment.
You dont want a connector. connector is useful for connecting to remote services. For this it should just be a Dataframe type or a task plugin
| @@ -0,0 +1,150 @@ | |||
| import asyncio | |||
| extensions: Optional[List[str]] = None | ||
|
|
||
|
|
||
| class DuckDB(AsyncConnectorExecutorMixin, TaskTemplate): |
There was a problem hiding this comment.
No need of connector mixin. Rather this should implement the execute method. It should be like ContainerTask (not all of it just the execute function)
flyte-sdk/src/flyte/extras/_container.py
Line 235 in 28396f2
There was a problem hiding this comment.
Once you do this, you can delete the connector. You need this way, because you need guaranteed memory and isolation. connector is ok for shared resources, this is usually api calls
kumare3
left a comment
There was a problem hiding this comment.
I think you got it a little wrong
Thanks for the review! You're right, I should have looked at the existing flytekit DuckDB plugin as reference instead of modeling it after Snowflake. I'll rework this to use TaskTemplate with execute() and add DataFrame input type support. |
Summary
Features
DuckDBConfig(database_path=...)Design
Unlike Snowflake/BigQuery, DuckDB runs locally with no remote service, so the connector pattern is adapted:
create()(wrapped inrun_in_executorfor async compat)get()always returns SUCCEEDED since queries complete increate()delete()cleans up temporary parquet result filesTest plan
pytest plugins/duckdb/tests/ -v)ruff check plugins/duckdb/)