-
Notifications
You must be signed in to change notification settings - Fork 166
feat: add separator
argument in read_csv
/scan_csv
#2989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raisadz - I left a comment regarding the special check for pyarrow. I am afraid that it would not fully achieve the goal.
Maybe what we could do is:
- check that separator is not None at the top of the function and raise otherwise
- then at each backend level, check if the separator was passed with it's specific backend argument name and if that's the case raise a more informative error specifying to use
separator
instead ofsep|parse_options
. - Disclaimer: I am expecting no one to use this feature so far. Yet the only problem with that is that this is actually a regression:
parse_options
(together withread_options
) is the only way for pyarrow to specify arguments inread_csv
. Therefore we are enabling to pass separator in a standard way but basically disallowing to pass any other argument. - The long way to do this is something along the following lines I think
elif impl is Implementation.PYARROW:
if "parse_options" in kwargs:
passed_options = kwargs.pop("parse_options")
fields = (
'quote_char',
'double_quote',
'escape_char',
'newlines_in_values',
'ignore_empty_lines',
'invalid_row_handler',
)
parse_options = csv.ParseOptions(
delimiter=separator, **{field: getattr(passed_options, field) for field in fields}
)
else:
parse_options = csv.ParseOptions(delimiter=separator)
native_frame = csv.read_csv(
source, parse_options=parse_options, **kwargs
)
narwhals/functions.py
Outdated
if separator is not None and "parse_options" in kwargs: | ||
msg = "Can't pass both `separator` and `parse_options`." | ||
raise TypeError(msg) | ||
from pyarrow import csv # ignore-banned-import | ||
|
||
native_frame = csv.read_csv(source, **kwargs) | ||
native_frame = csv.read_csv( | ||
source, parse_options=csv.ParseOptions(delimiter=separator), **kwargs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit odd:
-
separator
is not typed to beNone
-
Even if that was the case, the following would not error in line 607:
nw.read_csv(..., separator=None, parse_option=csv.ParseOptions(...), backend=nw.Implementation.PYARROW)
However, then in line 613, we would call
csv.read_csv( source, parse_options=csv.ParseOptions(delimiter=None), parse_options=parse_options, ... )
which would end up raising an exception at this point
-
Should we handle the same for other backends? i.e. pandas-like check that
sep
is not passed, and below for lazy backends
Co-authored-by: Francesco Bruzzesi <[email protected]>
@FBruzzesi thank you for the review! I agree about the separators validation. I added some support functions that should check the passed
as they might go out of date if PyArrow changes them and we just need to check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raisadz - I am taking a closer look.
I am still not the biggest fan of the hustle for pyarrow users (which again, I don't expect to be many).
One part of me leans towards suggesting to completely remove **kwargs
and allow only for explicit parameters we can have full control over. I am honestly not sure
narwhals/functions.py
Outdated
validate_separator(separator, "delimiter", **kwargs) | ||
validate_separator(separator, "delim", **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow! TIL: duckdb and pyspark have two ways to pass the separator
return kwargs | ||
from pyarrow import csv # ignore-banned-import | ||
|
||
return {"parse_options": csv.ParseOptions(delimiter=separator)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind I completely misread this
Fake panic review
The issue I have with this is that if any other argument was provided in parse_options
then it will be silently ignored.
Say someone is calling the following:
nw.read_csv(file, separator=",", parse_options=csv.ParseOptions(ignore_empty_lines=False)
Then at the end of validate_separator_pyarrow
, we will end up with csv.ParseOptions(delimiter=separator, ignore_empty_lines=True)
(i.e. the default value), silently.
I agree that hardcoding fields as suggested in #2989 (review) is not ideal, yet pyarrow does not provide much else we can use. We could dynamically lookup its __dir__
or use inspect.get_members
, exclude dunder methods, but for example we would end up with validate
and equals
, which are not attribute to set at instantiation.
Unsuccessful tentatives I tried:
inspect.signature
from inspect import signature
from pyarrow import csv
print(signature(csv.ParseOptions.__init__))
(self, /, *args, **kwargs)
dataclasses.fields
From the stubs I got tricked into thinking that is a dataclass:
from dataclasses import fields
from pyarrow import csv
print(fields(csv.ParseOptions))
TypeError: must be called with a dataclass type or instance
I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @FBruzzesi!
I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)
A user won't need to pass both separator
and delimiter
as if "parse_options" not in kwargs:
then we
return {"parse_options": csv.ParseOptions(delimiter=separator)}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so someone would only need to pass both separator
and delimiter
if they were specifying another parse option (like double_quote
)
tbh I think this is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our default is:
separator: str = ","
and matches pyarrow
's default:
delimiter: str = ","
Alternative
ParseOptions.delimiter
has higher precedence unless separator
overrides the default.
In either case - every other argument is respected
Show definitions
from __future__ import annotations
from pyarrow import csv
from typing import Any
def merge_options(separator: str = ",", **kwargs: Any) -> dict[str, Any]:
DEFAULT = "," # noqa: N806
if separator != DEFAULT:
if opts := kwargs.pop("parse_options", None):
opts.delimiter = separator
else:
opts = csv.ParseOptions(delimiter=separator)
kwargs["parse_options"] = opts
return kwargs
def display_merge(result: dict[str, Any]) -> None:
if result and (options := result.pop("parse_options", None)):
print(f"{options.delimiter=}\n{options.double_quote=}")
if result:
print(f"Remaining: {result!r}")
elif result:
print(f"Unrelated: {result!r}")
else:
print(f"Empty: {result!r}")
Would this behavior not be more ideal?
# NOTE: `double_quote` default is `True`
user_options = csv.ParseOptions(delimiter="\t", double_quote=False)
>>> display_merge(merge_options(parse_options=user_options))
options.delimiter='\t'
options.double_quote=False
>>> display_merge(merge_options(",", parse_options=user_options))
options.delimiter='\t'
options.double_quote=False
>>> display_merge(merge_options("?", parse_options=user_options))
options.delimiter='?'
options.double_quote=False
>>> display_merge(merge_options())
Empty: {}
>>> display_merge(merge_options("\t"))
options.delimiter='\t'
options.double_quote=True
>>> display_merge(
merge_options(
"?",
parse_options=csv.ParseOptions(double_quote=False),
read_options=csv.ReadOptions(),
)
)
options.delimiter='?'
options.double_quote=False
Remaining: {'read_options': <pyarrow._csv.ReadOptions object at 0x000001F29413AD40>}
Although it is cython
, the important part is they're all properties with setters
https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/python/pyarrow/_csv.pyx#L435-L543
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParseOptions.delimiter has higher precedence unless separator overrides the default.
hmmm yes, that does sound better actually, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Marco
with (#2989 (comment)) in mind ...
Not sure if this is on duckdb
or sqlframe
, but sep
has higher precedence than delim
from sqlframe.duckdb import DuckDBSession
import polars as pl
from pathlib import Path
data: Mapping[str, Any] = {"a": [1, 2, 3], "b": [4.5, 6.7, 8.9], "z": ["x", "y", "w"]}
fp = Path.cwd() / "data" / "file.csv"
pl.DataFrame(data).write_csv(fp, separator="\t")
session = DuckDBSession.builder.getOrCreate()
>>> session.read.format("csv").load(str(fp), sep="\t", delim="?").collect()
[Row(a=1, b=4.5, z='x'), Row(a=2, b=6.7, z='y'), Row(a=3, b=8.9, z='w')]
Personally, I think we're best off just defining rule(s) and documenting what we do for each backend if needed.
So instead of
>>> nw.scan_csv("...", backend="sqlframe", separator=",", sep="?", delim="\t", delimiter="!")
TypeError: `separator` and `sep` do not match: `separator`=, and `sep`=?.
We either:
- pick one and replace it - leaving everything else unchanged
- say we'll pick ... then ... and then ...
If any backend raises on non-matching arguments - I say let them - as it saves us the hassle π
Co-authored-by: Francesco Bruzzesi <[email protected]>
β¦t/add-separator-arg
Closes #2930
What type of PR is this? (check all applicable)
Related issues
separator
argument toread_csv
/scan_csv
Β #2930Checklist
If you have comments or can explain your changes, please do so below