Is MotherDuck Producktion-Prepared? | GoodData

17 January 2024

157

You will have seen many bulletins concerning integrations with DuckDB/MotherDuck prior to now months. I’m fairly conservative, so I’m not the primary to leap on the hype prepare. I need to be certain about new know-how earlier than publicly asserting a brand new integration.

Because of my conservatism, I made time to correctly consider MotherDuck. Now I can totally inform you whether or not or not MotherDuck is professionalducktion-ready (so we are able to use it in GoodData and suggest it to our clients).

Furthermore, my article reveals you all of the vital variations between DuckDB/MotherDuck and different databases, together with outcomes from the primary iteration of efficiency exams.

Let’s begin!

What the Heck Is DuckDB/MotherDuck?

If you happen to haven’t heard about DuckDB/MotherDuck but, let me rapidly onboard you. MotherDuck information supply is DuckDB-as-a-service, plus just a few thrilling options.

DuckDB is an in-process OLAP database that’s:

Very simple to make use of in all main programming languages and may even run in net browsers by WebAssembly (WASM)
Very, very quick (columnar, vectorized execution engine, and many others…)
Function-rich (SQL, integrations, and many others..) and straight-up invents options we now have been lacking for many years, resembling GROUP BY ALL
Open-source, simply extensible

MotherDuck, standing on DuckDB’s shoulders, offers:

Serverless deployment in a cloud
Customers can join utilizing commonplace protocols – JDBC, ODBC, …
SQL executions are robotically optimized; Think about working SQL on prime CSVs saved throughout a number of areas
Optimized storage for information and caches
Fascinating new idea – hybrid execution, which may mix supply information saved on each the shopper’s and server’s facet

Onboarding as Regular

Nothing new right here. I all the time search for documentation and group.

Usually, MotherDuck excels right here! A lot of my questions locally have been answered by their CEO 🙂

Integrating Semantic Layer With a New DB Engine

That is the way in which GoodData follows.

Our clients recognize the semantic layer as a result of it:

Decouples finish customers from the bodily illustration of the info
Gives the next metric language on prime of SQL
Is simpler to outline metrics (in our case we use the MAQL language)
Is simpler emigrate between DB engines

To help a brand new DB engine, we now have to make sure that all metric ideas, as soon as translated to a selected SQL dialect of the DB engine, could be executed and return appropriate outcomes. Usually, it’s a good take a look at of the capabilities of DB engines. This is the reason I’m assured that the DB engine is producktion-ready, when all of the exams are passing. Now, let me take you thru my path when integrating MotherDuck.

Constructing Minimal Viable Produckt(MVP)

Context:

GoodData offers a customized analytical language referred to as MAQL, powered by a Logical Knowledge Mannequin(LDM). It is best to learn this quick article by my colleague Jan Kadlec, which briefly describes the fundamentals.

After we need to combine a brand new database engine, we now have to have the ability to translate any MAQL metric to the SQL dialect of this new engine.

Nevertheless, MVP is kind of easy:

Add a brand new JDBC driver
Register a brand new DB kind in related APIs and gRPC interfaces
Inherit conduct from PostgreSQL ( Metadata scanning, SQL Dialect, SQL queries execution)

Nevertheless, I bumped into the 2 following points:

The primary difficulty is that the jdbc:duckdb:md:<db_name> can’t be parsed as a legitimate URL. To resolve this, I turned it off, and later, I created a customized parser. We parse it for person comfort.

The second difficulty I encountered was with MotherDuck drivers permitting customers to entry the native filesystem. It’s OK if you happen to run the shopper in your laptop computer, but when we join from our SaaS infra, we now have to disable it!

And the answer is kind of easy! Merely set saas_mode to true.

Humorous insert – studying /and many others/passwd

Then I examined the MVP:

Registered to MotherDuck (SIGN UP button on the house web page)
Created an auth token (https://app.motherduck.com/ -> Person settings -> Service token)
Registered a brand new information supply in GoodData pointing to MotherDuck
Scanned and created a logical information mannequin
Created a metric, report(perception), dashboard, execute them

Did it work out of the field?

MotherDuck suffers from not being absolutely suitable (SQL dialect) with the PostgreSQL database household(e.g., Redshift, Vertica). However primary experiences labored with no points.

The standard difficulty you hit: a minimum of some operations with dates/timestamps don’t work. That occurred right here too. Will there ever be one commonplace for working with dates in SQL? I do know it is extra science fiction, however I nonetheless hope. 🙏

Beta

How will we show {that a} DB engine is beta-ready?

DuckDB executes over 2000 SQL exams to validate any modifications. Spectacular!

We have now developed over 700 exams over the last 15 years, laser-focused on analytics use circumstances that customers can meet in GoodData. Typically, they will meet them in all BI platforms whether or not they present the next language resembling our MAQL or simply uncooked SQL interfaces).

We validate that SQL queries could be efficiently executed in opposition to the database and that their outcomes are appropriate (besides floats – we tolerate variations as much as a configurable delta).

Use circumstances value mentioning:

Aggregation capabilities – SUM, COUNT, MEDIAN, …
Logical operators – AND, OR, …
Mathematical capabilities – SQRT, LN, …
Operating whole capabilities – RUNSUM, RUNSTDEV, …
Rating capabilities – OVER PARTITION BY in SQL, together with FRAMING
Conditional statements – CASE, IF THEN, …
Datetime arithmetic and transformations, together with notorious TIMEZONEs

For example the complexity: essentially the most advanced take a look at generates a circa 20-page-long SQL question containing a mixture of numerous edge circumstances. Though it queries small tables, it doesn’t end below 10 seconds on any supported database!

So, what obstacles did I meet earlier than all of the exams handed?

Loading Take a look at Knowledge

Earlier than working the exams, we should be capable of load take a look at datasets into the brand new database engine. We have now developed seven fashions, every centered on a subset of use circumstances. Some are artificial (TPC-H), and a few are fairly real looking tasks, e.g., our inner Gross sales mission.

Within the case of MotherDuck, I used Python duckdb-engine (SQLAlchemy). Later, after I upgraded to 0.9.2, dependency on duckdb library was not outlined anymore, so I needed to improve it explicitly (and add it to necessities.txt).

Then, I set the MOTHERDUCK_TOKEN env variable and at last may connect with MotherDuck:

self.engine = create_engine(
   f"duckdb:///md:{self._database_name}"
)

Nevertheless, I bumped into some points:

TIMESTAMP WITH TIMEZONE values (in apart from UTC zone) should not loaded accurately.
Resolution: execute a SET assertion (see the supply code beneath).
FLOAT information kind behaves in another way. Loaded values are barely completely different.
Resolution: I modified FLOAT information kind to DOUBLE (configurable in our tooling).
Empty string values are loaded as NULL.

There’s a easy answer. Simply use the allow_quoted_nulls property (see the supply code beneath).

Executing exams

I executed the exams on prime of the results of the MVP described above (PostgreSQL dialect generated), and a whole lot of exams failed, as standard. 😉

Then I spent circa three days fixing them whereas closely speaking with the MotherDuck group. Enjoyable!

Let me recap essentially the most attention-grabbing use circumstances I needed to implement, from easy to advanced.

Desk Sampling

The syntax is completely different from what we now have seen in different databases to this point.

Snowflake:FROM <desk> TABLESAMPLE(10)

DuckDB: FROM <desk> USING SAMPLE 10

Furthermore, DuckDB doesn’t help decimal proportion values, so I truncate the worth earlier than producing SQL. As soon as this characteristic turns into vital for our finish customers, I’ll file a problem to DuckDB Github.

Generate Sequence

God bless DuckDB for offering the GENERATE_SERIES perform. It’s so vital! Some databases don’t present it. ☹️

It’s particularly vital for producing timestamp collection – we outer-join them with information to fill in lacking values.

DuckDB wants so as to add an UNNEST perform along with what, for instance, PostgreSQL requires.

Why is UNNEST wanted right here?

Alex from MotherDuck: “The reason being that DuckDB’s generate collection can be utilized in a from clause as a desk perform or in a choose clause as a scalar perform that produces an inventory. Unnest converts the record into separate rows!”

Truncate Numbers

TRUNC() is the alternative perform to CEIL(), all the time rounding down.

Curiously, just some DB engines settle for two arguments, truncating the final decimal level all the time down. Instance: TRUNC(1.29, 1) = 1.2

DuckDB doesn’t help it both.

MotherDuck CEO Jordan Tigani proposed a pleasant workaround: CAST(1.29 AS DECIMAL(18, 1)) = 1.2

Sadly, it doesn’t all the time work. Instance: TRUNC(1.29, TRUNC(1.1))

Don’t ask me why we settle for decimal values from finish customers on this case 😉

Anyway, CAST(1.29 AS DECIMAL(18, TRUNC(1.1)) isn’t a legitimate assertion.

Date Arithmetic

Date arithmetic in SQL is an absolute nightmare for all information engineers. The syntax could be very completely different in every database household and a few date granularities are outlined utilizing extremely advanced algorithms. The worst-case state of affairs is ISO weeks. This can be a screenshot from the corresponding Wiki web page:

It seems to be like one thing from a quantum mechanics guide, does not it?

We already needed to implement the arithmetic for a lot of databases (PostgreSQL, Snowflake, BigQuery, MSSQL, and many others.), so we grok it. To not make it too simple for us, we offer non-standard use circumstances to our clients, resembling evaluating the present ISO week with the identical week within the earlier yr. Actually humorous to calculate one thing like this!

Moreover, we should respect US/EU weeks beginning on Sunday/Monday. 🙂

So, what did we now have to implement within the case of DuckDB?

Date formatting is completely different from PostgreSQL, however we’re able to override it simply:

There isn’t any TO_CHAR perform in DuckDB. As an alternative, we generate STFTIME or STRPTIME perform relying on whether or not the primary argument is literal (e.g. ‘2023-01-01’) or an expression (e.g. column_name).

Weeks and quarters are difficult. We generate one thing like this:

However sadly, it’s not sufficient.

I all the time talked about the use case when we have to shift a date from an ISO week to a day in the identical ISO week within the earlier yr.

Instance: 2023-01-01 -> 2022-52 -> 2021-52 -> 2021-12-27(Monday)

We have to extract the TIMESTAMP information kind ultimately as a result of we could use it, for instance, as a filter in an outer question. There isn’t any strategy to convert the YEAR-WEEK to TIMESTAMP information kind in DuckDB. The identical is legitimate for quarters. I’m discussing it with the group.

Producktion

OK, so virtually all useful exams are passing now. Are we able to launch the combination to producktion? Sadly, not but. Moreover the problems already described above (ISO weeks, TRUNC), I discovered two extra.

Serializing Quantiles with Decimals

I reminded the group, and the response amused me:

UPDATE 4.12.2023 – an answer was shipped to 0.9.2. I upgraded the JDBC driver, enabled the associated exams, and voila – it really works like a appeal! Nice job, group!

Cancelling (Lengthy-Operating) Queries

I requested the group, and Nick from MotherDuck responded that express client-side cancels by way of SQL should not but supported. Later, he talked about that DuckDBPreparedStatement.cancel() ought to do the job. He additionally described a race situation, which shouldn’t have an effect on us.

We understand the cancellation course of by opening a separate DB connection and calling a cancel SQL assertion there. It’s the most secure answer in most databases. For instance, you possibly can lose the community connection to the DB server and can’t name cancel() utilizing this connection anymore. In the end, we attempt to implement the proposed answer in our stack and can see how dependable it’s.

Why is that this use case so vital for us?

We offer a self-service analytics expertise to finish customers. Customers can create loopy experiences, e.g. aggregating big datasets by main key. We have to cancel queries working longer than a configurable threshold in order to not overload the database.

Driver Extensions Downloaded in Runtime

Every part above isn’t blocking us from the discharge. We will doc it as identified limitations.

However there’s one blocker, sadly, and it pertains to the conduct of shopper drivers.

Each time a shopper connects to the server, it downloads extensions to the shopper facet:

The information are fairly huge (178M) and should be downloaded the primary time you join, which slows down the connection institution considerably. However we may dwell with that.

The actual blocker is that the information are downloaded in run-time, bypassing the safety scanners we now have in our pipelines. This can be a no-go for our compliance. ☹️

I mentioned it with Jordan locally. I perceive their motivation – they should innovate the client-server integration quickly and don’t need to launch new shoppers too usually.

As a brief workaround answer, I’m interested by incorporating the extensions into our docker photos. So long as the server accepts the extension model, we’re OK. As soon as they launch a breaking change to the server and cease accepting our previous extensions, we’ll launch a brand new docker picture with the brand new extensions. Not manufacturing prepared, however adequate for our prospects/clients to play with MotherDuck in manufacturing (PoC, …).

Trying ahead to getting stabilized shoppers sooner or later and asserting full pro-ducktion readiness!

Efficiency Assessments

On this iteration, I didn’t need to handle complete efficiency exams. I plan to do it within the subsequent iteration. Keep tuned!

Nevertheless, when executing the useful exams, I used to be curious how MotherDuck performs in comparison with different supported databases.

I made a decision to handle the next first iteration of efficiency exams:

Examine MotherDuck with Snowflake and PostgreSQL
Run 700 useful exams with numerous numbers of threads(DB connections, parallelized execution)
Run a load of take a look at information (CSV information)

Sadly, I hit a bug in our take a look at framework – Kotlin coroutines didn’t parallelize executions accurately. So, lastly, I measured the length of useful exams whereas working with just one DB connection.

To make a good comparability:

All databases are working in US-EAST-1
Useful exams are executed from US-EAST-1
I applied a set of Gitlab jobs working on US-EAST-1 staff
I executed a load of CSV information from my laptop computer (EU-CENTRAL, lazy to implement related Gitlab jobs)

Notice: I loaded the outcomes of the efficiency exams again to MotherDuck and visualized the leads to GoodData to guage the shortest attainable onboarding state of affairs 🙂

The outcomes clearly present a winner – MotherDuck!

To be truthful, I executed exams in opposition to two sizes of Snowflake – X-SMALL and LARGE. On this state of affairs (comparatively small information, a lot of the queries below 1s), a big Snowflake doesn’t convey any profit. It’s even slower within the case of useful exams and solely barely sooner within the case of CSV load.

Humorous insert:

When working useful exams in opposition to two sizes of Snowflake, one take a look at returns completely different outcomes – it calculates completely different numbers of days in every week. I created a help case for that.

Replace: Snowflake help helped us understand we now have a bug in our SQL generator! The SEQ4() perform can generate completely different sequences when the warehouse runs in a cluster. We’re going to repair this by producing the ROW_NUMBER() perform on this case. A standard difficulty in clustered databases.

What Do We as GoodData Anticipate from MotherDuck?

That’s in comparison with the already supported DB engines.

Saying in another way – what needs to be the important thing differentiators of MotherDuck?

We anticipate considerably decrease prices (TCO). Not solely as a result of it’s a brand new participant attempting to penetrate the market but additionally due to the effectivity of the engine itself. I’ve already seen lots of very promising benchmarks the place DuckDB outperforms all rivals.

We anticipate higher efficiency. GoodData is concentrated on analytics use circumstances querying small/mid-size information volumes ready by numerous ELT processes. DuckDB proved that it’s optimized for such use circumstances. It offers a greater latency to extra parallel customers. That’s the place all conventional information warehouse distributors battle. Personally, I’ve been working with Vertica, Snowflake, BigQuery, Redshift, and others. Vertica supplied the bottom latency / highest concurrency, however nonetheless, with a whole lot+ concurrent customers, the (clustered) engine ultimately stopped scaling.

Lastly, dream with me: the DuckDB(and another equally progressive) SQL dialect will develop into a brand new defacto SQL commonplace. DuckDB has already launched very useful and long-missing options resembling GROUP BY ALL or SELECT * EXCLUDE(). First, two requirements for OLTP and OLAP could possibly be established, after which … I do know I am naive, however typically a person must dream 😉

What’s Subsequent?

Though the problem associated to the extensions downloaded in runtime is a blocker for us, I think about MotherDuck pro-duck-tion prepared for analytics use circumstances.

I need to discover MotherDuck’s capabilities within the space of ELT. I already examined the combination with Meltano and dbt, and because of assist from their communities, I used to be capable of run my advanced ELT mission efficiently. I’ll write an article about it.

Second, I need to run actual perf exams on prime of bigger datasets. Many individuals already examined it, however I need to deal with real looking analytics use circumstances – we execute dashboards, which suggests executing all experiences(insights) in parallel and measuring the general length to load every dashboard. Furthermore, we spin up tens and even a whole lot of digital customers executing these dashboards. Low latency/excessive concurrency are what issues within the case of analytics(BI).

Final, I need to deal with prices(TCO). I want to execute heavy exams and calculate TCO for MotherDuck in comparison with different distributors. MotherDuck already printed their thought for pricing right here. It is vitally easy, and it appears very promising. Suppose the TCO of MotherDuck is considerably decrease, and on the identical time, the efficiency is considerably higher. In that case, we are able to’t do the rest than suggest MotherDuck to our prospects because the default DB engine for analytics. 🙂

By the way in which, GoodData performs a major function right here. We just lately introduced a brand new service referred to as FlexCache. We plan to cache not solely report outcomes but additionally pre-aggregations and considerably cut back the variety of queries that should go to the underlying DB engine.

Attempt It Your self!

To see GoodData in motion, attempt our free trial.

If you’re occupied with attempting GoodData with MotherDuck, please register for our labs atmosphere right here. I additionally ready a brief DOC web page devoted to MotherDuck right here.

Why not attempt our 30-day free trial?

Totally managed, API-first analytics platform. Get immediate entry — no set up or bank card required.

Get began

Is MotherDuck Producktion-Prepared? | GoodData

What the Heck Is DuckDB/MotherDuck?

Onboarding as Regular

Integrating Semantic Layer With a New DB Engine

Constructing Minimal Viable Produckt(MVP)

Beta

Loading Take a look at Knowledge

Executing exams

Desk Sampling

Generate Sequence

Truncate Numbers

Date Arithmetic

Producktion

Serializing Quantiles with Decimals

Cancelling (Lengthy-Operating) Queries

Driver Extensions Downloaded in Runtime

Efficiency Assessments

What Do We as GoodData Anticipate from MotherDuck?

What’s Subsequent?

Attempt It Your self!

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

ABOUT US

POPULAR POSTS

POPULAR CATEGORY

Is MotherDuck Producktion-Prepared? | GoodData