Source
Wikipedia: DuckDB
DuckDB project history and positioning as an in-process OLAP DBMS.
Official site
DuckDB
Official documentation, releases, SQL functionality, and ecosystem integrations.
DuckDB is an in-process analytical DBMS with SQL interface and vectorized execution. In system design, it is commonly used as an embedded OLAP layer for local analytics, ELT workflows, and data-lake file processing (Parquet/CSV) when a standalone server cluster is unnecessary.
History and context
Public project start
DuckDB emerges as an open-source in-process OLAP engine for analytics inside applications and notebooks.
Preparation for 1.0
The 0.10.x line expands SQL capabilities and performance ahead of the stable 1.x branch.
Stable major release
The project establishes the 1.0 line with focus on storage format compatibility and production adoption.
1.x line evolution
Optimizer, SQL features, and ecosystem integrations continue to improve across the 1.x cadence.
LTS branch stabilization
The LTS stream improves upgrade predictability for production environments with longer release cycles.
Core architecture elements
In-process architecture
DuckDB runs as a library inside the host process, without a standalone DB server or network hop.
Vectorized SQL execution
Queries execute in batched pipelines, which is efficient for single-node scan-heavy OLAP workloads.
Columnar storage + open formats
Columnar layout and direct Parquet/CSV/Arrow access make DuckDB a practical SQL layer over data-lake files.
ACID with concurrency limits
DuckDB provides ACID/MVCC/WAL, while write concurrency is centered around a single writer process.
Execution and storage model
The interactive block below summarizes DuckDB internals: in-process deployment, vectorized engine, storage layout, transactional semantics, and write-concurrency constraints.
DuckDB execution and storage model
DuckDB combines in-process deployment, vectorized execution, and columnar storage to run analytical workloads without a standalone server cluster.
Why DuckDB is a distinct analytical DB category
- The engine runs as an embedded library inside the host process, reducing operational overhead.
- Vectorized execution plus columnar layout is optimized for scan-heavy OLAP queries.
- ACID transactions, WAL, and checkpoints are available, but concurrency is centered on a single writer process.
- Native Parquet/CSV/Arrow interoperability makes DuckDB practical for embedded ELT and data-lake analytics.
In-process deployment
DuckDB runs as a library in the host application (Python/R/CLI/BI), without a standalone DB server process.
Key elements
Typical use cases
- Notebook analytics
- Local BI
- Application-side data processing
Example
import duckdb
con = duckdb.connect('warehouse.duckdb')High-Level Architecture
The diagram shows a high-level DuckDB setup: client embedding, SQL/optimizer layer, vectorized execution pipelines, storage subsystem, and ecosystem integration points.
System view
Performance profile
Operational trade-offs
Read / Write Path through components
This unified flow combines write and read paths: from client SQL command through optimizer and vectorized execution to storage persistence or analytical result delivery.
Read/Write Path Explorer
Interactive walkthrough of DuckDB requests through SQL planning, vectorized execution, and storage components.
Write path
- DuckDB is optimized for bulk writes (`COPY`, batched `INSERT`, `CTAS`) within one process.
- Write plan is optimized and executed by vectorized operators in batches.
- ACID transactions with WAL/`CHECKPOINT` provide durability in persistent mode.
- Indexes and constraints improve integrity but can slow down heavy bulk ingestion.
When to choose DuckDB
Good fit
- Embedded analytics in applications, notebooks, and local BI tools.
- ELT/EDA pipelines over Parquet/CSV/JSON without standing up a server cluster.
- Feature engineering and ad-hoc analytics close to Python/Pandas/Polars.
- Offline and edge scenarios where simple deployment and low operational overhead matter.
Avoid when
- Multi-tenant OLTP services with heavy concurrent writes from many processes/nodes.
- Requirements for distributed clustering with automatic failover and horizontal storage scaling.
- Systems that depend on row-level locking and long concurrent transactional workflows.
- Workloads requiring a persistent remote DB endpoint for many independent services.
Practice: DDL and DML
Below are practical DuckDB SQL operations: DDL for schema/index setup and DML for ingest, transformation, and analytics queries.
DDL and DML examples in DuckDB
DDL defines schema/indexes, while DML handles ingest and analytical querying.
DuckDB uses standard SQL DDL for tables, constraints, indexes, schemas, and ATTACH for multi-database-file workflows.
Create analytical tables and constraints
CREATE TABLEDDL defines storage structure and baseline integrity rules.
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
plan VARCHAR,
created_at TIMESTAMP
);
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_type VARCHAR,
event_ts TIMESTAMP,
payload JSON,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);Index for selective filters
CREATE INDEXIndexes help selective lookups; they are not always needed for scan-heavy analytics.
CREATE INDEX idx_events_user_ts
ON events (user_id, event_ts);ATTACH and build mart table in separate schema
ATTACH + CREATE SCHEMA/TABLERaw and mart layers can be isolated across attached databases/schemas.
ATTACH 'warehouse.duckdb' AS wh;
USE wh;
CREATE SCHEMA IF NOT EXISTS mart;
CREATE TABLE mart.daily_events AS
SELECT DATE(event_ts) AS dt, event_type, count(*) AS cnt
FROM events
GROUP BY 1, 2;