Skip to content

Conversation

raisadz
Copy link
Contributor

@raisadz raisadz commented Aug 14, 2025

Closes #2930

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@raisadz raisadz added the pyspark Issue is related to pyspark backend label Aug 14, 2025
@raisadz raisadz marked this pull request as ready for review August 14, 2025 13:37
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - I left a comment regarding the special check for pyarrow. I am afraid that it would not fully achieve the goal.

Maybe what we could do is:

  1. check that separator is not None at the top of the function and raise otherwise
  2. then at each backend level, check if the separator was passed with it's specific backend argument name and if that's the case raise a more informative error specifying to use separator instead of sep|parse_options.
  3. Disclaimer: I am expecting no one to use this feature so far. Yet the only problem with that is that this is actually a regression: parse_options (together with read_options) is the only way for pyarrow to specify arguments in read_csv. Therefore we are enabling to pass separator in a standard way but basically disallowing to pass any other argument.
  4. The long way to do this is something along the following lines I think
elif impl is Implementation.PYARROW:
    if "parse_options" in kwargs:
        passed_options = kwargs.pop("parse_options")
        fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )
        parse_options = csv.ParseOptions(
            delimiter=separator, **{field: getattr(passed_options, field) for field in fields}
        )
    else:
        parse_options = csv.ParseOptions(delimiter=separator)
            
    native_frame = csv.read_csv(
        source, parse_options=parse_options, **kwargs
    )

Comment on lines 607 to 614
if separator is not None and "parse_options" in kwargs:
msg = "Can't pass both `separator` and `parse_options`."
raise TypeError(msg)
from pyarrow import csv # ignore-banned-import

native_frame = csv.read_csv(source, **kwargs)
native_frame = csv.read_csv(
source, parse_options=csv.ParseOptions(delimiter=separator), **kwargs
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit odd:

  1. separator is not typed to be None

  2. Even if that was the case, the following would not error in line 607:

    nw.read_csv(..., separator=None, parse_option=csv.ParseOptions(...), backend=nw.Implementation.PYARROW)

    However, then in line 613, we would call

    csv.read_csv(
            source, parse_options=csv.ParseOptions(delimiter=None), parse_options=parse_options, ...
        )

    which would end up raising an exception at this point

  3. Should we handle the same for other backends? i.e. pandas-like check that sep is not passed, and below for lazy backends

@raisadz
Copy link
Contributor Author

raisadz commented Aug 18, 2025

@FBruzzesi thank you for the review! I agree about the separators validation. I added some support functions that should check the passed kwargs now. In PyArrow, maybe we shouldn't type all the fields:

fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )

as they might go out of date if PyArrow changes them and we just need to check delimiter? Please, let me know what you think

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - I am taking a closer look.

I am still not the biggest fan of the hustle for pyarrow users (which again, I don't expect to be many).

One part of me leans towards suggesting to completely remove **kwargs and allow only for explicit parameters we can have full control over. I am honestly not sure

Comment on lines 718 to 719
validate_separator(separator, "delimiter", **kwargs)
validate_separator(separator, "delim", **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow! TIL: duckdb and pyspark have two ways to pass the separator

return kwargs
from pyarrow import csv # ignore-banned-import

return {"parse_options": csv.ParseOptions(delimiter=separator)}
Copy link
Member

@FBruzzesi FBruzzesi Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind I completely misread this

Fake panic review

The issue I have with this is that if any other argument was provided in parse_options then it will be silently ignored.

Say someone is calling the following:

nw.read_csv(file, separator=",", parse_options=csv.ParseOptions(ignore_empty_lines=False)

Then at the end of validate_separator_pyarrow, we will end up with csv.ParseOptions(delimiter=separator, ignore_empty_lines=True) (i.e. the default value), silently.

I agree that hardcoding fields as suggested in #2989 (review) is not ideal, yet pyarrow does not provide much else we can use. We could dynamically lookup its __dir__ or use inspect.get_members, exclude dunder methods, but for example we would end up with validate and equals, which are not attribute to set at instantiation.

Unsuccessful tentatives I tried:

inspect.signature

from inspect import signature
from pyarrow import csv

print(signature(csv.ParseOptions.__init__))

(self, /, *args, **kwargs)

dataclasses.fields

From the stubs I got tricked into thinking that is a dataclass:

from dataclasses import fields
from pyarrow import csv

print(fields(csv.ParseOptions))

TypeError: must be called with a dataclass type or instance

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FBruzzesi!

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

A user won't need to pass both separator and delimiter as if "parse_options" not in kwargs: then we
return {"parse_options": csv.ParseOptions(delimiter=separator)}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so someone would only need to pass both separator and delimiter if they were specifying another parse option (like double_quote)

tbh I think this is fine

Copy link
Member

@dangotbanned dangotbanned Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our default is:

separator: str = ","

and matches pyarrow's default:

delimiter: str = ","

Alternative

ParseOptions.delimiter has higher precedence unless separator overrides the default.

In either case - every other argument is respected

Show definitions

from __future__ import annotations

from pyarrow import csv
from typing import Any


def merge_options(separator: str = ",", **kwargs: Any) -> dict[str, Any]:
    DEFAULT = ","  # noqa: N806
    if separator != DEFAULT:
        if opts := kwargs.pop("parse_options", None):
            opts.delimiter = separator
        else:
            opts = csv.ParseOptions(delimiter=separator)
        kwargs["parse_options"] = opts
    return kwargs


def display_merge(result: dict[str, Any]) -> None:
    if result and (options := result.pop("parse_options", None)):
        print(f"{options.delimiter=}\n{options.double_quote=}")
        if result:
            print(f"Remaining: {result!r}")
    elif result:
        print(f"Unrelated: {result!r}")
    else:
        print(f"Empty: {result!r}")

Would this behavior not be more ideal?

# NOTE: `double_quote` default is `True`
user_options = csv.ParseOptions(delimiter="\t", double_quote=False)
>>> display_merge(merge_options(parse_options=user_options))
options.delimiter='\t'
options.double_quote=False
>>> display_merge(merge_options(",", parse_options=user_options))
options.delimiter='\t'
options.double_quote=False
>>> display_merge(merge_options("?", parse_options=user_options))
options.delimiter='?'
options.double_quote=False
>>> display_merge(merge_options())
Empty: {}
>>> display_merge(merge_options("\t"))
options.delimiter='\t'
options.double_quote=True
>>> display_merge(
    merge_options(
        "?",
        parse_options=csv.ParseOptions(double_quote=False),
        read_options=csv.ReadOptions(),
    )
)
options.delimiter='?'
options.double_quote=False
Remaining: {'read_options': <pyarrow._csv.ReadOptions object at 0x000001F29413AD40>}

Although it is cython, the important part is they're all properties with setters
https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/python/pyarrow/_csv.pyx#L435-L543

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseOptions.delimiter has higher precedence unless separator overrides the default.

hmmm yes, that does sound better actually, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marco

with (#2989 (comment)) in mind ...

Not sure if this is on duckdb or sqlframe, but sep has higher precedence than delim

from sqlframe.duckdb import DuckDBSession
import polars as pl
from pathlib import Path


data: Mapping[str, Any] = {"a": [1, 2, 3], "b": [4.5, 6.7, 8.9], "z": ["x", "y", "w"]}
fp = Path.cwd() / "data" / "file.csv"

pl.DataFrame(data).write_csv(fp, separator="\t")

session = DuckDBSession.builder.getOrCreate()
>>> session.read.format("csv").load(str(fp), sep="\t", delim="?").collect()
[Row(a=1, b=4.5, z='x'), Row(a=2, b=6.7, z='y'), Row(a=3, b=8.9, z='w')]

Personally, I think we're best off just defining rule(s) and documenting what we do for each backend if needed.

So instead of

>>> nw.scan_csv("...", backend="sqlframe", separator=",", sep="?", delim="\t", delimiter="!")
TypeError: `separator` and `sep` do not match: `separator`=, and `sep`=?.

We either:

  • pick one and replace it - leaving everything else unchanged
  • say we'll pick ... then ... and then ...

If any backend raises on non-matching arguments - I say let them - as it saves us the hassle πŸ˜…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pyspark Issue is related to pyspark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enh: add separator argument to read_csv / scan_csv
4 participants