Skip to content

Conversation

paultiq
Copy link
Contributor

@paultiq paultiq commented Oct 3, 2025

Per discussion #84, this PR implements Python Table Valued Functions. (aka: User Defined Table Functions)*.

Table Valued Functions* allow Python callables to be registered as DuckDB Table Functions, returning either (Iterator[Sequence]) or an Arrow table.

This is implemented primarily in python_tvf.cpp. Tuple TVFs scan the py::iter from the Callable. Arrow TVFs delegate to ArrowTableFunction.

* While the main code is ready for review, I'll need a pointer on how to properly add and regenerate the bindings. I added them manually in pyconnection.cpp. As I know some work was being done in the stubs and didn't want to conflict.

** Getting the GIL and reference counting part right took a bit of work, especially around destruction. I found PythonObjectContainer late (after trying other approaches), so please let me know if this is the right approach.

Edits

  • Edit 1: Changed schemas to dictionary, rather than a List[Tuple]

Changes in this PR

  • New functions: create_table_function and unregister_table_function
  • New enum: PythonTVFType
  • Add: case_insensitive_map_t<unique_ptr<ExternalDependency>> registered_table_functions; to PyConnection
  • Change: PyConnection now releases the GIL when materializing.. this was needed to prevent deadlock where the TVF needs the GIL to scan.
  • Tests: Added tests for TUPLES, ARROW_TABLE, along with registration & schema mismatches.

Registration

This implementation adds two new functions:

  • create_table_function
  • unregister_table_function
import duckdb
from duckdb.sqltypes import VARCHAR, BIGINT

conn = duckdb.connect()

conn.create_table_function(
    name="gen_function",
    callable=lambda count: [(f"name_{i}", i) for i in range(count)] ,
    parameters=["count"],
    schema={"name": VARCHAR, "value": BIGINT},
    type="tuples", # or "arrow_table" or PythonTVFType.TUPLES or PythonTVFType.ARROW_TABLE
)

table = conn.execute("select * from gen_function(count:=10)").fetch_arrow_table()

Parameters

Parameters are declared as a list of parameter names, such as parameters = ["col1", "col2"].

  • Positional parameters do not need to be declared: parameters=None still allows positional parameters to be called.
  • Parameters called by name need to be declared: myfunction(count:=10)

Schema

The schema of a Table Function must be declared at bind time. This is done by capturing the return type at registration time. *

Schemas are defined as List[Tuple[str,str]], where each tuple is a pair of Column Name and Data Type.

* It's possible, but not implemented, to infer the schema for Arrow Tables rather than requiring it. Or, perhaps, to be more lenient.

Test Failures:

There's two test failures:

  • tests/fast/api/test_duckdb_connection.py::TestDuckDBConnection::test_wrap_coverage : If/when the main code is reviewed, just need help figuring out how to properly do this. I didn't want to interfere with the stubs work being done.
  • tests/fast/arrow/test_polars.py::TestPolars::test_polars_lazy_pushdown_numeric: See Polars Lazy Filter not working with latest Polars (1.34.0) #98

Tuples Example with positional args

import duckdb
from duckdb.sqltypes import VARCHAR, BIGINT

def simple_generator(count: int = 10):
    for i in range(count):
        yield (f"name_{i}", i)

with duckdb.connect() as conn:
        conn.create_table_function(
            name="gen_function",
            callable=simple_generator,
            parameters=["count"],
            schema={"name": VARCHAR, "value": BIGINT},
            type="tuples",
        )
	
        result = conn.sql(
            "SELECT * FROM gen_function(?)",
            params=(5,),
        ).fetchall()

Example with Parameters=None

import duckdb
from duckdb.sqltypes import VARCHAR, BIGINT

conn = duckdb.connect()

conn.create_table_function(
    name="gen_function",
    callable=lambda x,count=10: [(f"name_{x}", i) for i in range(count)] ,
    parameters=None,
    schema={"name": VARCHAR, "value": BIGINT},
    type="tuples",
)

table = conn.execute("select * from gen_function('foo')").fetch_arrow_table()

Arrow Example

import pyarrow as pa
import duckdb
from duckdb.sqltypes import BIGINT, VARCHAR

conn = duckdb.connect()

def my_function(count=10):
    return pa.table({
        "id": list(range(count)),
        "value": [i * 2 for i in range(count)],
        "name": [f"row_{i}" for i in range(count)],
    })

conn.create_table_function(
    name="somefunction",
    callable=my_function,
    parameters=["count"],
    schema={"id": BIGINT, "value": BIGINT, "name": VARCHAR},
    type="arrow_table",
)

table = conn.execute("select * from somefunction(10)").fetch_arrow_table()

Discussion

Feature Name

What to call this feature?

Some databases refer to the function as Table Valued Functions, and others as User Defined Table Function. Either name works for me... I started with TVF but I think I'm not leaning towards UDTFs.

Deleting / Unregistering

There doesn't appear to be a way to truly "delete" or "unregister" table functions from a connection... so does it make sense to even have an unregister? I chose to require an explicit unregister prior to registering a different function with same name, but this is somewhat arbitrary.

Materializing vs Streaming the Iterator

I didn't notice any significant performance difference between streaming the iterator vs fully materializing it. The benefits outweighed the added complexity.

There is perhaps some room for checking the callable result to see if it has a Length to set the Cardinality initially which may help the optimizer.

Other Callable Types

This PR supports TUPLES and ARROW_TABLES. The TUPLE implementation supports streaming, but the ARROW_TABLES are fully materialized.

ARROW_BATCHED_READER would make sense as a next type. I am assuming this is not required in an initial implementation.

(future) Passing a cursor

For "table in" situations, I'd think we could add a "pass_connection" option to create_table_function. If enabled, a cursor to the current connection would be passed to the TVF as a kwarg.

* I left it out of this initial PR to keep it simple.

Something like:

def myfunction(conn, values=10):
        return conn.execute("select * from range(10)").fetchall()

    conn.create_table_function(
        name="gen_function",
        callable=myfunction,
        parameters=["values"],
        pass_connection=True,  # when true, a cursor is passed to the 'conn' kwarg. 
        schema=[["name", "int"]],
        type=PythonTVFType.TUPLES,
    )
TODO
  • Add support for duckdb.typing types in the schema
  • Consider implementing parameter inference: look at the function signature to populate parameters.

python_import_cache.cpp
python_replacement_scan.cpp
python_udf.cpp
python_tvf.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python_table_udf.cpp has my preference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For types like PythonTVFType, would you prefer PythonTUDF, or PythonUDTF, or PythonTableUDF?

For what it's worth, SnowFlake and DataBrix have gone with UDTF: https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-tabular-functions and https://docs.databricks.com/aws/en/udf/python-udtf

I don't like TUDF as an abbreviation, but UDTF or TableUDF both sound good to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is table udf
So a search for "udf" finds both versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: renamed files to "table_udf" and in code to TableUDF

throw InvalidInputException("Invalid schema format: expected [name, type] pairs, got string '%s'",
py::str(item).cast<std::string>());
}
if (!py::hasattr(item, "__getitem__") || py::len(item) < 2) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignores cases where >2 items are given

But I don't get why we are taking schemas as [[name, type], [name, type]] instead of {name: type, name: type} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At initial design, I wasn't sure if I'd need any other attributes other than Name and Type, so left it as a List of Tuples (or List of Lists).

But, looking back, a mapping makes more sense.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

  • Modified to schema={"x": sqltypes.BIGINT, "y": sqltypes.BIGINT, "name": sqltypes.VARCHAR}.
  • Updated PR examples and test cases to match.

switch (type) {
case PythonTVFType::TUPLES:
tf =
duckdb::TableFunction(name, {}, +PyTVFTuplesScanFunction, +PyTVFTuplesBindFunction, +PyTVFTuplesInitGlobal);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this syntax? +PyTVFTuplesScanFunction ?
I've seen &PyTVFTuplesScanFunction, which makes sense because you're taking the address of the function, but even that is redundant, you can use PyTVFTuplesScanFunction directly afaik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a holdover from a fight with the linter, it's unnecessary / will remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


connection_module.def("create_table_function", &DuckDBPyConnection::RegisterTableFunction,
"Register a table valued function via Callable", py::arg("name"), py::arg("callable"),
py::arg("parameters") = py::none(), py::arg("schema") = py::none(),
Copy link
Collaborator

@Tishj Tishj Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameters should be a keyword-only argument
making the python signature equivalent to: create_table_function(name, callable, schema, type, *, parameters)
We can even infer the parameters of the function, we have similar logic for scalar udfs

I also feel like type can be a keyword-only argument, defaulting to TUPLES if omitted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

  • kwargs for type & parameters
  • default for type is TUPLES
  • parameters is optional

I did not (yet?) do the parameter inference yet. I want to think about that a little bit. Added to TODO in PR Comment

	connection_module.def("create_table_function", &DuckDBPyConnection::RegisterTableFunction,
	                      "Register a table user defined function via Callable", py::arg("name"), py::arg("callable"),
	                      py::arg("schema"), py::kw_only(), py::arg("type") = PythonTableUDFType::TUPLES,
	                      py::arg("parameters") = py::none());

throw InvalidInputException("Invalid schema format: each schema item must be a [name, type] pair");
}
names.emplace_back(py::str(item[py::int_(0)]));
types.emplace_back(TransformStringToLogicalType(py::str(item[py::int_(1)])));
Copy link
Collaborator

@Tishj Tishj Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can accept a DuckDBPyType instead, we can extract the logical type from that
see this for example:

		case PythonObjectType::Value: {
			// Extract the internal object and the type from the Value instance
			auto object = ele.attr("object");
			auto type = ele.attr("type");
			shared_ptr<DuckDBPyType> internal_type;
			if (!py::try_cast<shared_ptr<DuckDBPyType>>(type, internal_type)) {
				string actual_type = py::str(type.get_type());
				throw InvalidInputException("The 'type' of a Value should be of type DuckDBPyType, not '%s'",
				                            actual_type);
			}
			return TransformPythonValue(object, internal_type->Type());
		}

Copy link
Contributor Author

@paultiq paultiq Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

  • schema is a mapping of str=>DuckDBPyType

}
};

struct PyTVFTuplesGlobalState : public GlobalTableFunctionState {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that this only supports single threaded execution, because the virtual function MaxThreads returns 1 by default.
You are using the global state directly so I think you're aware of this, but just double checking.
I think it's correct though, because enabling multi-threaded execution for a Python table UDF sounds like it wouldn't help much. As all time is spent in Python, so the GIL would make it essentially single threaded anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the virtual function MaxThreads returns 1 by default.

Do you think we should make this explicit in table_udf, or is it fine to rely on the default?

enabling multi-threaded execution for a Python table UDF sounds like it wouldn't help much

Agree. Although, with free-threading, the GIL constraint goes away... but I think such complex cases (that need multi-threaded consumption of a Python callable) are best handled in Python-land.

named_parameter_map_t kwargs;
vector<LogicalType> return_types;
vector<string> return_names;
PythonObjectContainer python_objects; // Holds the callable
Copy link
Collaborator

@Tishj Tishj Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a test where we create a table function, create a view that uses the table function, then unregister the table function, and make sure to exit the scope where the python table function was created.
Then execute the view.

Just to make sure that this is enough to keep the python callable alive

(This message is misplaced, it's related to the callable of the PyTVFInfo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

  • tests/fast/table_udf/test_tuples.py::test_callable_lifetime_in_view

static unique_ptr<PyTVFBindData> PyTVFBindInternal(ClientContext &context, TableFunctionBindInput &in,
vector<LogicalType> &return_types, vector<string> &return_names) {
// Disable progress bar to prevent GIL deadlock with Jupyter
// TODO: Decide if this is still needed - was a problem when fully materializing, but switched to streaming
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unresolved TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

  • disabling progress bar was not needed.

This was a holdover from my first implementation, which materialized everything at init time and never released the GIL.

@paultiq paultiq changed the title Python Table Valued Functions / User Defined Table Functions Python Table UDFs Oct 10, 2025
@paultiq paultiq marked this pull request as draft October 10, 2025 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants