Polars scan_csv and sink_parquet
The documentation for polars is not the best,
and figuring out how to do this below took me over an hour.
Here's how to read in a headerless csv file into a
LazyFrame using
scan_csv
and write it to a parquet file using
sink_parquet.
The key is to use with_column_names
and
schema_overrides
.
Despite what the documentation says, using schema
doesn't work as you
might imagine and sink_parquet
returns with a cryptic error about
the dataframe missing column a
.
This is just a simplified version of what I actually am trying to do, but that's the best way to drill down to the issue. Maybe the search engines will find this and save someone else an hour of frustration.
import numpy as np
import polars as pl
df = pl.DataFrame(
{"a": [str(i) for i in np.arange(10)], "b": np.random.random(10)},
)
df.write_csv("/tmp/stuff.csv", include_header=False)
lf = pl.scan_csv(
"/tmp/stuff.csv",
has_header=False,
schema_overrides={
"a": pl.String,
"b": pl.Float64,
},
with_column_names=lambda _: ["a", "b"],
)
lf.sink_parquet("/tmp/stuff.parquet")