There are few codecs as ubiquitous as CSV: most purposes for which it makes even a smidge of sense to take action, help storing their output as CSV. Other than its recognition, the format itself has fairly a couple of extra benefits:
- It’s human-readable, making it straightforward to create, learn, and edit in any textual content editor, over the terminal, and many others.
- It’s approachable even for non-technical customers, they will make sense of the uncooked contents simply.
- It may be versioned utilizing Git and different model management techniques.
- It’s comparatively condensed in comparison with different text-based codecs like XML or JSON.
As with all issues, CSV has its downsides, too:
- It’s much less environment friendly to retailer. For instance, numbers take up area for every digit as an alternative of being saved as a quantity.
- Being row-based, it’s fairly onerous to get knowledge just for sure columns: the entire file must be learn even when we care in regards to the first two columns, for instance.
- There’s nobody common CSV commonplace, it has a number of variants or dialects.
When including help for CSV recordsdata into Longbow – our framework for creating modular knowledge providers – it was the final level that was particularly difficult. On this article, we describe the strategy we took with it.
Let’s focus on what features of the CSV recordsdata we have to concern ourselves with when ingesting them into Longbow for additional use. For every file, we have to derive the next:
- The encoding utilized by the file (ASCII, UTF-8, and many others.).
- Dialect utilized by the file (delimiters, quotes, and many others.).
- Names of the columns.
- Sorts of the info within the columns (integer, string, date, and many others.).
- Preview of the primary a number of rows in order that the consumer can confirm the CSV was parsed appropriately.
We are going to discover the steps we took for every of this stuff in additional element in the remainder of the article.
The ingest course of
Earlier than diving into the person steps, let’s check out what the method of including a brand new file seems like. First, the consumer uploads the CSV file they need to use to what we name a staging space. That is so we will run some evaluation on the file utilizing Longbow and present the outcomes to the consumer. The consumer can evaluation that the file is parsed appropriately, they usually can tweak among the settings. Then, if they’re glad with the outcomes, they will proceed with confirming the file import. As soon as they try this, the file is moved from the staging space to the manufacturing space and it’s then prepared to be used.
Storing the metadata
CSV has no devoted means of storing any type of metadata within the file itself (aside from in some way together with it earlier than the precise knowledge), and we additionally need to help read-only enter recordsdata. We needed to devise a mechanism to retailer the metadata detected within the steps described under someplace. We ended up with devoted manifest recordsdata. The manifests are situated proper subsequent to the related CSV recordsdata and have the identical identify with the .manifest
suffix. They include JSON-serialized variations of all of the configurations we now have collected each from the evaluation and the consumer. Each time a selected CSV file is requested, we first verify the manifest and use the configuration saved there to learn the precise CSV file.
The configuration itself consists of choices accepted by the Arrow CSV module (ReadOptions
, ParseOptions
, and ConvertOptions
) which can be used as-is when studying the CSV file. We additionally retailer details about date codecs for any columns that must be interpreted as dates (extra on that later).
Detecting the encoding
The very first step when studying an unknown CSV file (or any textual content file for that matter) for any evaluation is to find out the encoding utilized by the file. That is to keep away from any surprises with non-UTF-8 recordsdata being interpreted the incorrect means. We use the charset_normalizer package deal for this goal. The detected encoding is then utilized in subsequent reads of the CSV file.
Detecting the dialect and column names
The subsequent step is to detect the so-called dialect of the CSV file. The dialect describes among the structural properties of the CSV:
- What’s the separating character for the person columns?
- Are there any citation marks used to flee the separators, and in that case, how can they be escaped?
We additionally have to detect the column names. Some CSV recordsdata retailer the column names within the first row, some don’t retailer them in any respect, and we have to generate some ourselves.
We use DuckDB’s sniff_csv operate to collect all of this data. It offers us all of the structural details about the file, just like the delimiters, quotes, and many others. It additionally detects the column headers if there are any, falling again on autogenerated column names. You may learn extra in regards to the DuckDB CSV capabilities of their introductory weblog publish. We additionally have to guarantee that the file we feed into DuckDB is in UTF-8. In any other case, it fails. We make use of the detected encoding and put together a particular copy of the enter file only for DuckDB in case the unique isn’t in UTF-8 (or ASCII).
def _detect_dialect_and_header_and_column_names(
sample_filename: str,
encoding: str,
) -> tuple[CsvDialect, int, list[str]]:
needs_conversion = encoding not in ["utf_8", "ascii"]
if needs_conversion:
duckdb_input_file = sample_filename + ".utf_8.csv"
else:
duckdb_input_file = sample_filename
strive:
return _run_duckdb_detection(duckdb_input_file)
lastly:
if needs_conversion:
os.unlink(duckdb_input_file)
def _run_duckdb_detection(
duckdb_input_file: str,
) -> tuple[CsvDialect, int, list[str]]:
conn = duckdb.join(":reminiscence:", config={"threads": 1})
question = conn.execute(
"SELECT Delimiter, Quote, Escape, HasHeader, Columns FROM sniff_csv(?)",
[duckdb_input_file],
)
query_result = question.fetchone()
if not query_result:
increase ValueError("Unable to detect file dialect.")
(delimiter, quote, escape, has_header, encoded_columns) = query_result
dialect = CsvDialect(
delimiter=delimiter if delimiter != "x00" else None,
quotechar=quote if quote != "x00" else None,
escapechar=escape if escape != "x00" else None,
)
col_query = conn.execute("SELECT a.* FROM (SELECT " + encoded_columns + " as a)")
decoded_columns = col_query.fetch_arrow_table().column_names
if has_header:
header_row_count = max(col_name.rely("n") for col_name in decoded_columns) + 1
else:
header_row_count = 0
return dialect, header_row_count, sample_filename
Earlier than the sniff_csv
was accessible, we used the CleverCSV library for this step. Nonetheless, the DuckDB variant performs higher (we noticed a ten-fold enchancment within the total time) and allowed us to simplify the code since it might detect the dialect and column names in a single step.
Detecting the info varieties
Having a approach to learn the file with the schema in hand, we will proceed with figuring out the precise knowledge kind of every column. You would possibly ask, “Why not use the kinds detected by DuckDB?” or “Why not use the automated detection that Arrow CSV has?”. There are a couple of causes, however probably the most vital one has to do with the varied date codecs we need to help.
The DuckDB CSV sniffer solely helps one date format per file, so in the event you use one date format in a single column and one other format in one other column, it won’t work. Arrow CSV does help totally different date codecs per column, however the set of date codecs it helps is proscribed. Whereas it might work nice with ISO 8601 compliant dates, for instance, it might not acknowledge strings like:
- Jan 22, 2023 01:02:03
- 01 22 23 01:02:03
- 20230122
as probably being dates as nicely. This isn’t to say the Arrow detection is incorrect (in spite of everything, the final instance might very nicely be simply an integer). We simply have to help a wider set of codecs.
You may specify which date codecs you need Arrow to strive, however in case of ambiguity, it can at all times assume that the primary matching format is appropriate. We wish our customers to disambiguate the date format manually: solely they know which format is the right one.
One other limitation of the Arrow CSV strategy is that you simply both get probably the most exact knowledge kind detection (however it’s essential to learn the entire file into reminiscence -which clearly doesn’t scale that nicely), or you should utilize the batch-based strategy. Nonetheless, solely the primary batch of the file is used for the info kind detection making it much less exact.
We wish probably the most exact detection whereas conserving the reminiscence. To that finish, our pipeline is constructed a bit in a different way. First, we inform Arrow to learn the file batch by batch and to deal with every column as a string in order that we keep away from any automated detection carried out by Arrow. That is the place the column names turn out to be useful: you want their names to reference them within the Arrow CSV choices. Subsequent, we pipe this supply right into a customized Acero pipeline that enables us to run the evaluation extraordinarily shortly on your complete file in a streaming style, conserving the reminiscence footprint small.
Acero streaming engine
What’s Acero, you would possibly marvel. Acero is an experimental streaming engine for operating queries on massive knowledge. In Acero, you specify the processing pipeline declaratively, utilizing a number of constructing blocks like projections, filters, and aggregations. You may select from a variety of predefined compute capabilities and crucially, you may as well outline your individual customized capabilities (Consumer Outlined Capabilities, UDFs for brief). The UDFs are pretty straightforward to jot down: you are concerned solely in regards to the transformations you need to carry out. Acero figures out the remaining. What’s extra, you should utilize a number of languages to take action, we use Python for the info kind detection pipeline and Cython for the pipeline we use to learn the CSV knowledge utilizing the detected varieties. If SQL is extra up your alley, you should utilize Substrait to generate the Acero question plan from an SQL question.
The sort detection pipeline
From a high-level perspective, our kind detection pipeline may be very easy: it has one supply node studying the CSV file and one projection node operating the UDF detection algorithm. Ideally, there would even be an aggregation node on the finish that may combination the outcomes of every projection batch. Sadly, Acero doesn’t appear to help UDFs within the aggregation nodes but, so we run the aggregation in pure Python.
The detection UDF is run in parallel for each column in isolation and works like this. For every batch of values in a column:
- We detect which values are null or empty – we use common expressions
import pyarrow.compute as computer
is_boolean_vec = computer.match_substring_regex(
array,
sample=r"^$|^(true|false|0|1)$",
ignore_case=True,
memory_pool=ctx.memory_pool,
)
-
We use common expressions and the strptime operate to detect doable date codecs (primarily based on a set of supported date codecs).
-
We return the next values
- All the kinds the values within the batch conform to order by the specificity (e.g. integer is extra particular than a double).
- All of the date codecs that can be utilized to parse all non-empty values within the batch as a sound date.
- Whether or not any of the values within the batch is null or empty.
- Whether or not all the values within the batch are null or empty.
We then combination the outcomes for all of the batches for every column in order that we get the ultimate consequence:
- Essentially the most particular kind usable for the column.
- All of the date codecs that can be utilized to parse all of the non-empty values in all of the batches.
- A flag indicating whether or not the column is nullable: i.e., it accommodates a minimum of one worth that’s null or empty.
Studying a preview
To permit the consumer to make an knowledgeable choice whether or not we “understood” the file correctly and to permit them to choose the right date format from people who we detected as appropriate, we learn a small pattern of the info utilizing the choices we intend to make use of as soon as the file is confirmed by the consumer. We return this preview as part of the response, together with all of the choices and configurations we detected.
You would possibly marvel, “Why does the consumer want to choose a date format?”. That is to deal with conditions the place the date values are ambiguous. Think about a file that solely has these two values in a column: 01/01/2024 and 01/02/2024. Do these correspond to January 1st and 2nd? Or are they January 1st and February 1st? Solely the consumer is aware of which is the case, so in these (admittedly uncommon) instances, they should choose the right date format for us to make use of.
Utilizing the CSV file as a supply of knowledge
As soon as the consumer confirms the CSV file is appropriately parsed, the file is moved to the manufacturing space of the file storage, and a manifest file with all of the metadata is created. When there’s a computation run that should entry the CSV knowledge, it makes use of the metadata within the manifest to arrange a RecordBatchReader that makes use of one other Acero pipeline with one other UDF for studying the date columns utilizing the right date format. The UDF is a skinny wrapper across the strftime operate written in Cython that doesn’t fail on empty values however fails on invalid non-empty values. The default strftime both fails on empty values or returns null for something it can’t parse, neither of which is what we would like.
The ensuing RecordBatchReader
can then be consumed by the remainder of Longbow, enterprise as common. There’s a devoted article coming about that individual a part of Longbow, so keep tuned!
Abstract
CSV recordsdata are some of the used codecs for storing structured knowledge. Their relative looseness and ease make them straightforward to provide, however they’re additionally fairly difficult to learn and parse routinely. Now we have outlined the way in which we do it for Longbow, leveraging the DuckDB CSV sniffing performance and the Apache Arrow capabilities: its CSV module and the Acero streaming engine.
Wish to be taught extra?
As at all times, I’m keen to listen to what you concentrate on the path we’re taking! Be at liberty to achieve out to us on the GoodData group Slack.
Wish to strive it out for your self? Think about using our free trial. Wish to be taught what else we’re cooking at GoodData? Be part of GoodData Labs!