Skip to content

allocates for full sheet width when a cell exists in XFD; read_excel crashes on read: memory allocation of 52428275712 bytes failed #487

Description

@Andrewnolan13

Hi, this is the same exact issue as here

minimal reproducible example

import os
from pathlib import Path
import time

import fastexcel
from openpyxl import Workbook

os.environ["POLARS_VERBOSE"] = "1"

fp = Path("repro_xfd.xlsx")

wb = Workbook()
ws_explode = wb.active
ws_explode.title = "explodes"
ws_no_explode = wb.create_sheet(title="no_explode")

# Small real dataset in A:C
ws_explode["A1"] = 1
ws_explode["B1"] = 2
ws_explode["C1"] = 3

ws_no_explode["A1"] = 1
ws_no_explode["B1"] = 2
ws_no_explode["C1"] = 3

for row in range(2, 100_000):
    ws_explode.cell(row=row, column=1, value=row)
    ws_explode.cell(row=row, column=2, value=row * 10)
    ws_explode.cell(row=row, column=3, value=row * 100)

    ws_no_explode.cell(row=row, column=1, value=row)
    ws_no_explode.cell(row=row, column=2, value=row * 10)
    ws_no_explode.cell(row=row, column=3, value=row * 100)

# Stray value in the last Excel column
ws_explode["XFD1"] = "stray"

wb.save(fp)

print("saved workbook")

reader = fastexcel.read_excel(fp)

sheet1 = reader.load_sheet_by_name(
    "no_explode",
    header_row=None,
    use_columns=[0, 1, 2],
    column_names=["A", "B", "C"],
    dtypes="string",
)

df1 = sheet1.to_polars()

print("read no_explode")
print("df1 shape:", df1.shape)
del df1
del sheet1

print("sleeping for 10 seconds")
time.sleep(10)

sheet2 = reader.load_sheet_by_name(
    "explodes",
    header_row=None,
    use_columns=[0, 1, 2],
    column_names=["A", "B", "C"],
    dtypes="string",
)

df2 = sheet2.to_polars()

print("read explodes")
print("df2 shape:", df2.shape)

log output

saved workbook
async thread count: 8
blocking thread count: 512
read no_explode
df1 shape: (99999, 3)
sleeping for 10 seconds
memory allocation of 52428275712 bytes failed
stack backtrace:
   0:     0x7ffa929fabe0 - PyInit__fastexcel
   1:     0x7ffa92a0b941 - PyInit__fastexcel
   2:     0x7ffa929fe204 - PyInit__fastexcel
   3:     0x7ffa929f73a9 - PyInit__fastexcel
   4:     0x7ffa929f4acc - PyInit__fastexcel
   5:     0x7ffa929f441f - PyInit__fastexcel
   6:     0x7ffa929f743b - PyInit__fastexcel
   7:     0x7ffa929f52e8 - PyInit__fastexcel
   8:     0x7ffa92a23259 - PyInit__fastexcel
   9:     0x7ffa92a23273 - PyInit__fastexcel
  10:     0x7ffa9230cdd6 - PyInit__fastexcel
  11:     0x7ffa92321905 - PyInit__fastexcel
  12:     0x7ffa922843b8 - <unknown>
  13:     0x7ffa922736eb - <unknown>
  14:     0x7ffa922da2e2 - PyInit__fastexcel
  15:     0x7ffa922d0a95 - PyInit__fastexcel
  16:     0x7ffa92338b29 - PyInit__fastexcel
  17:     0x7ffa9233b729 - PyInit__fastexcel
  18:     0x7ffa922df4c2 - PyInit__fastexcel
  19:     0x7ffa922dfb64 - PyInit__fastexcel
  20:     0x7ffab6a81de1 - PyObject_GC_Track
  21:     0x7ffab69e641c - PyObject_Vectorcall
  22:     0x7ffab69e6379 - PyObject_Vectorcall
  23:     0x7ffab6a055c6 - PyEval_EvalFrameDefault
  24:     0x7ffab6a7bca8 - PyEval_EvalCode
  25:     0x7ffab6a7bb5e - PyEval_EvalCode
  26:     0x7ffab6a7ba69 - PyAST_Compile
  27:     0x7ffab6a7b8ec - PyAST_Compile
  28:     0x7ffab6acfd5f - PyUnicode_EqualToUTF8
  29:     0x7ffab6acff48 - PyUnicode_EqualToUTF8
  30:     0x7ffab6acf395 - PyEval_MakePendingCalls
  31:     0x7ffab6acf232 - Py_fopen_obj
  32:     0x7ffab6acf417 - PyEval_MakePendingCalls
  33:     0x7ffab6a63f6b - PyInterpreterState_SetRunningMain
  34:     0x7ffab6a629fc - Py_RunMain
  35:     0x7ffab6a629a3 - Py_Main
  36:     0x7ff6f26d1230 - <unknown>
  37:     0x7ffb1eba7ac4 - BaseThreadInitThunk
  38:     0x7ffb209ba8c1 - RtlUserThreadStart

Issue Description

Issue description
While reading data from excel workbooks, I noticed that my jupyter notebook's kernel would crash, but only on certain sheets of certain workbooks. On further inspection, I noticed that the workbooks it would crash on, had empty columns after, say, column Z, but then one value in column XFD (the right-most column). Even after explicitly selecting the columns to read from, it still crashes.

I guess it's reading all of the null columns in between the actual data and the stray value in XFD, before selecting the passed columns.

Notice in the snippet above how both sheets are identical except for one stray value. One runs fine, but the other explodes into 52.5 GB

Expected behavior

It should read the sheet with the stray value without crashing

details

this may be of use to you
polars.show_versions() returns the following.

--------Version info---------
Polars:              1.40.1
Index type:          UInt32
Platform:            Windows-2019Server-10.0.17763-SP0
Python:              3.13.13 (tags/v3.13.13:01104ce, Apr  7 2026, 19:25:48) [MSC v.1944 64 bit (AMD64)]
Runtime:             rt32

----Optional dependencies----
Azure CLI            'az' is not recognized as an internal or external command,
operable program or batch file.
<not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                1.43.5
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.20.2
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.10.8
numpy                2.4.2
openpyxl             3.1.5
pandas               3.0.0
polars_cloud         <not installed>
pyarrow              23.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.49
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.9

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions