feat: add `separator` argument in `read_csv`/`scan_csv` #2989

raisadz · 2025-08-14T09:27:21Z

Closes #2930

What type of PR is this? (check all applicable)

Related issues

Related issue enh: add separator argument to read_csv / scan_csv #2930
Closes #<issue number>

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

FBruzzesi

Thanks @raisadz - I left a comment regarding the special check for pyarrow. I am afraid that it would not fully achieve the goal.

Maybe what we could do is:

check that separator is not None at the top of the function and raise otherwise
then at each backend level, check if the separator was passed with it's specific backend argument name and if that's the case raise a more informative error specifying to use separator instead of sep|parse_options.
Disclaimer: I am expecting no one to use this feature so far. Yet the only problem with that is that this is actually a regression: parse_options (together with read_options) is the only way for pyarrow to specify arguments in read_csv. Therefore we are enabling to pass separator in a standard way but basically disallowing to pass any other argument.
The long way to do this is something along the following lines I think

elif impl is Implementation.PYARROW:
    if "parse_options" in kwargs:
        passed_options = kwargs.pop("parse_options")
        fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )
        parse_options = csv.ParseOptions(
            delimiter=separator, **{field: getattr(passed_options, field) for field in fields}
        )
    else:
        parse_options = csv.ParseOptions(delimiter=separator)
            
    native_frame = csv.read_csv(
        source, parse_options=parse_options, **kwargs
    )

narwhals/functions.py

FBruzzesi · 2025-08-16T20:03:26Z

narwhals/functions.py

+        if separator is not None and "parse_options" in kwargs:
+            msg = "Can't pass both `separator` and `parse_options`."
+            raise TypeError(msg)
        from pyarrow import csv  # ignore-banned-import

-        native_frame = csv.read_csv(source, **kwargs)
+        native_frame = csv.read_csv(
+            source, parse_options=csv.ParseOptions(delimiter=separator), **kwargs
+        )


I think this is a bit odd:

separator is not typed to be None

Even if that was the case, the following would not error in line 607:

nw.read_csv(..., separator=None, parse_option=csv.ParseOptions(...), backend=nw.Implementation.PYARROW)

However, then in line 613, we would call

csv.read_csv( source, parse_options=csv.ParseOptions(delimiter=None), parse_options=parse_options, ... )

which would end up raising an exception at this point

Should we handle the same for other backends? i.e. pandas-like check that sep is not passed, and below for lazy backends

Co-authored-by: Francesco Bruzzesi <[email protected]>

raisadz · 2025-08-18T13:07:56Z

@FBruzzesi thank you for the review! I agree about the separators validation. I added some support functions that should check the passed kwargs now. In PyArrow, maybe we shouldn't type all the fields:

fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )

as they might go out of date if PyArrow changes them and we just need to check delimiter? Please, let me know what you think

narwhals/functions.py

FBruzzesi

Thanks @raisadz - I am taking a closer look.

I am still not the biggest fan of the hustle for pyarrow users (which again, I don't expect to be many).

One part of me leans towards suggesting to completely remove **kwargs and allow only for explicit parameters we can have full control over. I am honestly not sure

narwhals/functions.py

FBruzzesi · 2025-08-18T18:38:44Z

narwhals/functions.py

+        validate_separator(separator, "delimiter", **kwargs)
+        validate_separator(separator, "delim", **kwargs)


Oh wow! TIL: duckdb and pyspark have two ways to pass the separator

FBruzzesi · 2025-08-18T19:01:20Z

narwhals/functions.py

+        return kwargs
+    from pyarrow import csv  # ignore-banned-import
+
+    return {"parse_options": csv.ParseOptions(delimiter=separator)}


Nevermind I completely misread this

Fake panic review

The issue I have with this is that if any other argument was provided in parse_options then it will be silently ignored.

Say someone is calling the following:

nw.read_csv(file, separator=",", parse_options=csv.ParseOptions(ignore_empty_lines=False)

Then at the end of validate_separator_pyarrow, we will end up with csv.ParseOptions(delimiter=separator, ignore_empty_lines=True) (i.e. the default value), silently.

I agree that hardcoding fields as suggested in #2989 (review) is not ideal, yet pyarrow does not provide much else we can use. We could dynamically lookup its __dir__ or use inspect.get_members, exclude dunder methods, but for example we would end up with validate and equals, which are not attribute to set at instantiation.

Unsuccessful tentatives I tried:

inspect.signature

from inspect import signature from pyarrow import csv print(signature(csv.ParseOptions.__init__))

(self, /, *args, **kwargs)

dataclasses.fields

From the stubs I got tricked into thinking that is a dataclass:

from dataclasses import fields from pyarrow import csv print(fields(csv.ParseOptions))

TypeError: must be called with a dataclass type or instance

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

Thanks @FBruzzesi!

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

A user won't need to pass both separator and delimiter as if "parse_options" not in kwargs: then we
return {"parse_options": csv.ParseOptions(delimiter=separator)}

right, so someone would only need to pass both separator and delimiter if they were specifying another parse option (like double_quote)

tbh I think this is fine

Since our default is:

separator: str = ","

and matches pyarrow's default:

delimiter: str = ","

Alternative

ParseOptions.delimiter has higher precedence unless separator overrides the default.

In either case - every other argument is respected

Show definitions

from __future__ import annotations from pyarrow import csv from typing import Any def merge_options(separator: str = ",", **kwargs: Any) -> dict[str, Any]: DEFAULT = "," # noqa: N806 if separator != DEFAULT: if opts := kwargs.pop("parse_options", None): opts.delimiter = separator else: opts = csv.ParseOptions(delimiter=separator) kwargs["parse_options"] = opts return kwargs def display_merge(result: dict[str, Any]) -> None: if result and (options := result.pop("parse_options", None)): print(f"{options.delimiter=}\n{options.double_quote=}") if result: print(f"Remaining: {result!r}") elif result: print(f"Unrelated: {result!r}") else: print(f"Empty: {result!r}")

Would this behavior not be more ideal?

# NOTE: `double_quote` default is `True` user_options = csv.ParseOptions(delimiter="\t", double_quote=False)

>>> display_merge(merge_options(parse_options=user_options)) options.delimiter='\t' options.double_quote=False

>>> display_merge(merge_options(",", parse_options=user_options)) options.delimiter='\t' options.double_quote=False

>>> display_merge(merge_options("?", parse_options=user_options)) options.delimiter='?' options.double_quote=False

>>> display_merge(merge_options()) Empty: {}

>>> display_merge(merge_options("\t")) options.delimiter='\t' options.double_quote=True

>>> display_merge( merge_options( "?", parse_options=csv.ParseOptions(double_quote=False), read_options=csv.ReadOptions(), ) ) options.delimiter='?' options.double_quote=False Remaining: {'read_options': <pyarrow._csv.ReadOptions object at 0x000001F29413AD40>}

Although it is cython, the important part is they're all properties with setters
https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/python/pyarrow/_csv.pyx#L435-L543

ParseOptions.delimiter has higher precedence unless separator overrides the default.

hmmm yes, that does sound better actually, thanks!

Thanks Marco

with (#2989 (comment)) in mind ...

Not sure if this is on duckdb or sqlframe, but sep has higher precedence than delim

from sqlframe.duckdb import DuckDBSession import polars as pl from pathlib import Path data: Mapping[str, Any] = {"a": [1, 2, 3], "b": [4.5, 6.7, 8.9], "z": ["x", "y", "w"]} fp = Path.cwd() / "data" / "file.csv" pl.DataFrame(data).write_csv(fp, separator="\t") session = DuckDBSession.builder.getOrCreate() >>> session.read.format("csv").load(str(fp), sep="\t", delim="?").collect() [Row(a=1, b=4.5, z='x'), Row(a=2, b=6.7, z='y'), Row(a=3, b=8.9, z='w')]

Personally, I think we're best off just defining rule(s) and documenting what we do for each backend if needed.

So instead of

>>> nw.scan_csv("...", backend="sqlframe", separator=",", sep="?", delim="\t", delimiter="!") TypeError: `separator` and `sep` do not match: `separator`=, and `sep`=?.

We either:

pick one and replace it - leaving everything else unchanged

say we'll pick ... then ... and then ...

If any backend raises on non-matching arguments - I say let them - as it saves us the hassle 😅

Co-authored-by: Francesco Bruzzesi <[email protected]>

…t/add-separator-arg

narwhals/functions.py

raisadz added 6 commits August 14, 2025 10:00

feat: add separator argument to read_csv / scan_csv

409dd4b

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

8143ae3

add stable api

9d6e850

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

9000f88

add coverage

b99dfcd

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

6b90890

raisadz added the pyspark Issue is related to pyspark backend label Aug 14, 2025

add session for sqlframe for coverage

c4ff1c6

raisadz marked this pull request as ready for review August 14, 2025 13:37

FBruzzesi reviewed Aug 16, 2025

View reviewed changes

raisadz and others added 6 commits August 17, 2025 17:24

Update narwhals/functions.py

00f0bc2

Co-authored-by: Francesco Bruzzesi <[email protected]>

add separator validation

af21d2f

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

59a5b6b

fix merge

d0c7283

modify kwargs for pyarrow

ff68327

restore header that was there before

b7cb02c

dangotbanned reviewed Aug 18, 2025

View reviewed changes

narwhals/functions.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Aug 18, 2025

View reviewed changes

raisadz and others added 3 commits August 19, 2025 14:16

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

7cfae8f

Update narwhals/functions.py

126c5c4

Co-authored-by: Francesco Bruzzesi <[email protected]>

Merge remote-tracking branch 'origin/feat/add-separator-arg' into fea…

cf7c67d

…t/add-separator-arg

dangotbanned mentioned this pull request Aug 19, 2025

Establish safe patterns using Implementation.UNKNOWN #2786

Open

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

8ace0f9

MarcoGorelli reviewed Aug 23, 2025

View reviewed changes

narwhals/functions.py Outdated Show resolved Hide resolved

raisadz added 2 commits August 23, 2025 09:38

make validate support functions private

512c529

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

ec12904

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add `separator` argument in `read_csv`/`scan_csv` #2989

feat: add `separator` argument in `read_csv`/`scan_csv` #2989

Uh oh!

raisadz commented Aug 14, 2025 •

edited by dangotbanned

Loading

Uh oh!

FBruzzesi left a comment

Uh oh!

Uh oh!

FBruzzesi Aug 16, 2025

Uh oh!

raisadz commented Aug 18, 2025

Uh oh!

Uh oh!

FBruzzesi left a comment •

edited

Loading

Uh oh!

Uh oh!

FBruzzesi Aug 18, 2025

Uh oh!

FBruzzesi Aug 18, 2025 •

edited

Loading

Uh oh!

raisadz Aug 19, 2025

Uh oh!

MarcoGorelli Aug 23, 2025

Uh oh!

dangotbanned Aug 25, 2025 •

edited

Loading

Uh oh!

MarcoGorelli Aug 28, 2025

Uh oh!

dangotbanned Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

		validate_separator(separator, "delimiter", **kwargs)
		validate_separator(separator, "delim", **kwargs)

feat: add separator argument in read_csv/scan_csv #2989

Are you sure you want to change the base?

feat: add separator argument in read_csv/scan_csv #2989

Uh oh!

Conversation

raisadz commented Aug 14, 2025 • edited by dangotbanned Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raisadz commented Aug 18, 2025

Uh oh!

Uh oh!

FBruzzesi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Alternative

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

feat: add `separator` argument in `read_csv`/`scan_csv` #2989

feat: add `separator` argument in `read_csv`/`scan_csv` #2989

raisadz commented Aug 14, 2025 •

edited by dangotbanned

Loading

FBruzzesi left a comment •

edited

Loading

FBruzzesi Aug 18, 2025 •

edited

Loading

dangotbanned Aug 25, 2025 •

edited

Loading