Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
7fcc1be
Improve content readability across blog posts and query engine pages
Akshay-datazip Oct 31, 2025
c6ee307
Add internal linking across homepage and documentation
Akshay-datazip Oct 31, 2025
4c922b6
Add additional internal linking - Part 1
Akshay-datazip Oct 31, 2025
10bb244
Add additional internal linking - Part 2
Akshay-datazip Oct 31, 2025
3e8bf22
Add additional internal linking - Part 3
Akshay-datazip Oct 31, 2025
1b9736b
Add internal linking - Part 4: Connectors
Akshay-datazip Oct 31, 2025
6322c28
Add internal linking - Part 5: PostgreSQL blog post
Akshay-datazip Oct 31, 2025
b01eb0b
Add internal linking - Part 6: MongoDB & MySQL blog posts
Akshay-datazip Oct 31, 2025
c9d6ebb
Add internal linking - Part 7: Additional blog posts (25% milestone)
Akshay-datazip Oct 31, 2025
04746ff
Add internal linking - Part 8: Remaining blog posts (50% milestone)
Akshay-datazip Oct 31, 2025
f6014f5
Add internal linking - Part 9: Query engine docs (75% milestone)
Akshay-datazip Oct 31, 2025
0ec74d4
Add internal linking - Part 10: Remaining query engine docs (90% mile…
Akshay-datazip Oct 31, 2025
61dc408
Add internal linking - Part 11: Iceberg integration docs (100% complete)
Akshay-datazip Oct 31, 2025
a8d7e04
Fix 404 links: Remove links not in CSV and fix iceberg-metadata path
Akshay-datazip Nov 5, 2025
880b838
Fix athena.mdx: Remove invalid markdown link from plain string descri…
Akshay-datazip Nov 5, 2025
dec5159
Restore AWS Glue Data Catalog link in athena.mdx description
Akshay-datazip Nov 5, 2025
e92fc22
Final review: Add missing anchor text links from CSV
Akshay-datazip Nov 5, 2025
dfb6da8
Add Trino link in Resources section (CSV line 83)
Akshay-datazip Nov 5, 2025
e0feef8
updated backlinks with the new doc
Akshay-datazip Nov 6, 2025
5344568
Merge branch 'master' into internal-linking
Akshay-datazip Nov 6, 2025
767614a
Fix all 404 internal links - update broken paths to correct documenta…
Akshay-datazip Nov 6, 2025
ac79d49
Fix internal links: update colors, remove unwanted links, add time tr…
Akshay-datazip Nov 14, 2025
caa78a5
Refactor: Create reusable blue-link CSS class and fix white link styling
Akshay-datazip Nov 20, 2025
f60bcf5
Convert markdown links to JSX Link components in Flink and Spark tabl…
Akshay-datazip Nov 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion blog/2024-11-22-debezium-vs-olake.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ tags: [debezium]
![OLake platform: Change data from MySQL, MongoDB, PostgreSQL flows to OLake, processed and stored in S3 and Iceberg](/img/blog/cover/debezium-vs-olake-cover.webp)


Change Data Capture (CDC) is essential for modern data architectures that require real-time data replication and synchronization across systems. Debezium (a Java utility based on the Qurkus framework), coupled with Apache Kafka, has become a popular open-source solution for implementing CDC.
[Change Data Capture (CDC)](/blog/olake-architecture-deep-dive/#cdc-sync) is essential for modern data architectures that require real-time data replication and synchronization across systems. Debezium (a Java utility based on the Qurkus framework), coupled with Apache Kafka, has become a popular open-source solution for implementing CDC.

However, while this combination offers powerful capabilities, it also comes with significant drawbacks that can impact your organization's efficiency and resources.

Expand Down
2 changes: 1 addition & 1 deletion blog/2025-03-18-binlogs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ tags: [olake]


### What Are Binlogs?
Binary logs in MySQL are files that log all changes made to your database. These logs record every operation that modifies data (like `INSERT`, `UPDATE`, `DELETE` statements). They dont log `SELECT` statements or other read-only operations.
Binary logs in [MySQL](/blog/mysql-apache-iceberg-replication) are files that log all changes made to your database. These logs record every operation that modifies data (like `INSERT`, `UPDATE`, `DELETE` statements). They don't log `SELECT` statements or other read-only operations.

Binary logs in MySQL are a feature that allows the recording of all changes made to the database in a structured binary format. The binary log files contain a chronological record of SQL statements or row-level changes that modify the data in the database. They are primarily used for tasks such as replication, point-in-time recovery, auditing, and data analysis.

Expand Down
6 changes: 3 additions & 3 deletions blog/2025-03-18-json-vs-bson-vs-jsonb.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,10 @@ While JSON is perfect for many use cases, it has limitations:
- **Untyped Values**: All values in JSON are strings when transmitted, meaning systems need to parse and interpret types during runtime.

### 2. What is BSON?
**BSON** (Binary JSON) is a binary-encoded serialization of JSON-like documents, originally created for MongoDB. BSON extends JSONs capabilities by adding support for more complex data types and structures.
**BSON** (Binary JSON) is a binary-encoded serialization of JSON-like documents, originally created for MongoDB. BSON extends JSON's capabilities by adding support for more complex data types and structures.

#### Why BSON Exists
As MongoDB began to rise in popularity as a NoSQL document store, the need for a more efficient, flexible format than plain JSON became apparent. BSON was developed to:
As [MongoDB](/docs/connectors/mongodb/setup/local/) began to rise in popularity as a NoSQL document store, the need for a more efficient, flexible format than plain JSON became apparent. BSON was developed to:
- **Handle Complex Data Types**: BSON supports more than just strings and numbers. It can store native types like dates, binary data, and embedded arrays or objects efficiently.
- **Optimize for Database Operations**: BSON is designed to be lightweight but still allow for fast queries and indexing inside a database like MongoDB.
- **Better for Large-Scale Data**: BSON was created to offer faster data reads/writes and a more compact size when dealing with large documents.
Expand All @@ -59,7 +59,7 @@ As MongoDB began to rise in popularity as a NoSQL document store, the need for a


### 3. What is JSONB?
**JSONB** (Binary JSON) is a format introduced by PostgreSQL to store JSON data in a binary format, combining the benefits of both JSON and BSON in a relational database context.
**JSONB** (Binary JSON) is a format introduced by [PostgreSQL](/docs/connectors/postgres) to store JSON data in a binary format, combining the benefits of both JSON and BSON in a relational database context.

#### Why JSONB Exists
JSONB was created to provide a fast, efficient way to store and query JSON-like documents within PostgreSQL. Regular JSON in relational databases comes with several downsides, such as slower queries and no support for indexing. JSONB was introduced to address these problems by offering:
Expand Down
2 changes: 1 addition & 1 deletion blog/2025-07-29-next-gen-lakehouse.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Query your data "as of" last Tuesday for audits or bug fixes.
**Hidden Partitioning** — Iceberg tracks which files hold which dates, so queries auto-skip irrelevant chunks, no brittle dt='2025-07-21' filters required.


Most importantly it is **engine-agnostic** you might have heard this term a lot and here it gets a meaning iceberg supports Spark, Trino, Flink, DuckDB, Dremio, and Snowflake all speak to the tables natively
Most importantly it is **engine-agnostic** you might have heard this term a lot and here it gets a meaning iceberg supports [Spark](/iceberg/query-engine/spark), [Trino](/iceberg/query-engine/trino), Flink, [DuckDB](/iceberg/query-engine/duckdb), Dremio, and Snowflake all speak to the tables natively


Before Iceberg, data lakes were basically digital junkyards. You'd dump data files into cloud storage (like Amazon S3), and finding anything useful was like looking for a specific needle in a haystack .
Expand Down
4 changes: 2 additions & 2 deletions blog/2025-07-31-apache-iceberg-vs-delta-lake-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Performance is often the deciding factor and rightly so. The good news? Both for

### File Layout & Updates

Delta Lake uses a **copy-on-write** approach by default for the open-source version. When you need to update data, it creates new files and marks the old ones for deletion. The new **Deletion Vectors (DVs)** feature is pretty clever, it marks row-level changes without immediately rewriting entire files, which saves you from write amplification headaches. Databricks offers DVs as a default for any Delta tables.
Delta Lake uses a **copy-on-write** approach by default for the open-source version. When you need to update data, it creates new files and marks the old ones for deletion. The new [**Deletion Vectors (DVs)**](/blog/iceberg-delta-lake-delete-methods-comparison/#how-deletion-vectors-work-in-iceberg-v3) feature is pretty clever, it marks row-level changes without immediately rewriting entire files, which saves you from write amplification headaches. Databricks offers DVs as a default for any Delta tables.

Iceberg takes a different approach with its **equality** and **position deletes** for V2. The new Format v3 introduces compact binary Deletion Vectors that reduce both read and write amplification, especially helpful for update-heavy tables.

Expand Down Expand Up @@ -99,7 +99,7 @@ Check out query-engine support matrix [(here)](https://olake.io/iceberg/query-en

### Catalogs & Governance

Catalogs are like the brain (metadata-management + ACID) for lakehouses and its ecosystem is evolving fast. **Apache Polaris** (incubating) now unifies Iceberg and Delta Lake tables in one open-source catalog, delivering vendor-neutral management and robust RBAC governance across major query engines.
Catalogs are like the brain (metadata-management + ACID) for lakehouses and its ecosystem is evolving fast. [**Apache Polaris**](/blog/apache-polaris-lakehouse) (incubating) now unifies Iceberg and Delta Lake tables in one open-source catalog, delivering vendor-neutral management and robust RBAC governance across major query engines.

REST-based options like **Polaris, Gravitino, Lakekeeper, and Nessie** make Iceberg highly flexible; you can connect multiple warehouses and tools while maintaining a single table format, making multi-tool architectures easy and future-proof if **vendor neutrality** matters to you (you can avoid being locked-in into one single vendor and take ownership of cost, tools, performance in your own hands.)

Expand Down
4 changes: 2 additions & 2 deletions blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ For this setup, we're going to orchestrate four key components that work togethe
- **MySQL** - Our source database where all the transactional data lives
- **OLake** - The star of our ETL show, handling data replication
- **MinIO** - Our S3-compatible object storage acting as the data lake
- **PrestoDB** - The lightning-fast query engine for analytics
- [**PrestoDB**](/iceberg/query-engine/presto) - The lightning-fast query engine for analytics

What makes this architecture particularly elegant is how these components communicate through Apache Iceberg table format, ensuring we get ACID transactions, schema evolution, and time travel capabilities right out of the box.

Expand Down Expand Up @@ -113,7 +113,7 @@ This is the heart of our setup. Think of it as the conductor of an orchestra - i

**What it handles:**

- Spins up MySQL, MinIO, Iceberg REST catalog, and PrestoDB containers
- Spins up MySQL, MinIO, [Iceberg REST catalog](/docs/writers/iceberg/catalog/rest/?rest-catalog=generic), and PrestoDB containers
- Creates a private network so all services can find each other
- Maps ports so you can access web interfaces from your browser
- Sets up volume mounts for data persistence
Expand Down
2 changes: 1 addition & 1 deletion blog/2025-09-04-creating-job-olake-docker-cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Today, there's no shortage of options—platforms like Fivetran, Airbyte, Debezi

That's where OLake comes in. Instead of forcing you into one way of working, OLake focuses on making replication into Apache Iceberg (and other destinations) straightforward, fast, and adaptable. You can choose between a guided UI experience for simplicity or a Docker CLI flow for automation and DevOps-style control.

In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from Postgres Apache Iceberg (Glue catalog) with CDC, normalization, filters, partitioning, and scheduling—all running seamlessly.
In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from [Postgres to Apache Iceberg (Glue Catalog)](/iceberg/postgres-to-iceberg-using-glue) with CDC, normalization, filters, partitioning, and scheduling—all running seamlessly.

## Two Setup Styles (pick what fits you)

Expand Down
67 changes: 59 additions & 8 deletions blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ This comprehensive guide will walk you through everything you need to know about

![OLake stream selection UI with Full Refresh + CDC mode for dz-stag-users table](/img/blog/2025/12/lakehouse-image.webp)

## Key Takeaways

- **Protect Production Performance**: Offload heavy analytical queries to Iceberg tables, keeping your PostgreSQL database responsive for application traffic
- **Real-time Logical Replication**: PostgreSQL WAL-based CDC streams changes to Iceberg with sub-second latency for up-to-date analytics
- **50-75% Cost Reduction**: Organizations report dramatic savings by moving analytics from expensive PostgreSQL RDS to cost-effective S3 + Iceberg architecture
- **Open Format Flexibility**: Store data once and query with any engine (Trino, Spark, DuckDB, Athena) - switch tools without data migration
- **Enterprise-Ready Reliability**: OLake handles schema evolution, CDC recovery, and state management automatically for production deployments

## Why PostgreSQL to Iceberg Replication is Essential for Modern Data Teams

### Unlock Scalable Real-Time Analytics Without Production Impact
Expand All @@ -25,11 +33,11 @@ Replicating PostgreSQL to Apache Iceberg transforms how organizations handle ope

**Near Real-Time Reporting Capabilities**: Keep your dashboards, reports, and analytics fresh with near real-time data synchronization, enabling faster decision-making and more responsive business operations.

**Future-Proof Data Lakehouse Architecture**: Embrace open, vendor-agnostic formats like Apache Iceberg to build a modern data lakehouse that avoids vendor lock-in while providing warehouse-like capabilities.
**Future-Proof Data Lakehouse Architecture**: Embrace open, vendor-agnostic formats like [Apache Iceberg](/iceberg/why-iceberg) to build a modern data lakehouse that avoids vendor lock-in while providing warehouse-like capabilities.

Traditional CDC pipelines that feed cloud data warehouses often become expensive, rigid, and difficult to manage when dealing with schema changes. With Postgres-to-Iceberg replication, you can decouple storage from compute, allowing you to:

- Choose the optimal compute engine for specific workloads (Trino, Spark, DuckDB, etc.)
- Choose the optimal compute engine for specific workloads ([Trino](/iceberg/olake-iceberg-trino), Spark, DuckDB, etc.)
- Store data once in cost-effective object storage and access it from anywhere
- Eliminate vendor lock-in while reducing overall warehouse expenses
- Support both batch and streaming data ingestion patterns
Expand Down Expand Up @@ -64,10 +72,10 @@ Apache Iceberg relies on robust metadata management for query performance optimi

### Prerequisites for Setting Up Your Replication Pipeline

Before beginning your PostgreSQL to Apache Iceberg migration, ensure you have the following components configured:
Before beginning your PostgreSQL to [Apache Iceberg](/iceberg/why-iceberg) migration, ensure you have the following components configured:

- Access to a PostgreSQL database with WAL (Write-Ahead Logging) enabled for CDC
- AWS Glue Catalog setup for Iceberg metadata management
- [AWS Glue Catalog](/docs/writers/iceberg/catalog/glue/) setup for Iceberg metadata management
- S3 bucket configured for Iceberg table data storage
- OLake UI deployed (locally or in your cloud environment)
- Docker, PostgreSQL credentials, and AWS S3 access configured
Expand Down Expand Up @@ -123,7 +131,7 @@ This begins tracking changes from the current WAL position. Ensure the publicati

OLake UI provides a web-based interface for managing replication jobs, data sources, destinations, and monitoring without requiring command-line interaction.

#### Quick Start Installation
#### [Quick Start Installation](/docs/getting-started/quickstart)

To install OLake UI using Docker and Docker Compose:

Expand Down Expand Up @@ -162,7 +170,7 @@ Configure your Apache Iceberg destination in the OLake UI:
- IAM credentials (optional if your instance has appropriate IAM roles)
- S3 bucket selection for Iceberg table storage

OLake supports multiple Iceberg catalog implementations including Glue, Nessie, Polaris, Hive, and Unity Catalog. For detailed configuration of other catalogs, refer to the [OLake Catalogs Documentation](https://olake.io/docs/writers/iceberg/catalog/rest/).
OLake supports multiple Iceberg catalog implementations including Glue, Nessie, Polaris, Hive, and Unity Catalog. For detailed configuration of other catalogs, refer to the [Catalog Compatibility Overview](/docs/understanding/compatibility-catalogs).

![OLake destination setup UI for Apache Iceberg with AWS Glue catalog configuration form](/img/blog/2025/12/step-4.webp)

Expand Down Expand Up @@ -192,7 +200,7 @@ For each stream, select the appropriate sync mode based on your requirements:

- **Normalization**: Disable for raw JSON data storage
- **Partitioning**: Configure regex patterns for Iceberg table partitioning
- **Detailed partitioning strategies**: [Iceberg Partitioning Guide](https://olake.io/docs/writers/iceberg/partitioning)
- **Detailed partitioning strategies**: [Iceberg Partitioning Guide](/docs/writers/iceberg/partitioning)

![OLake stream selection step with Full Refresh + CDC sync for dz-stag-users table](/img/blog/2025/12/step-5-2.webp)

Expand Down Expand Up @@ -285,7 +293,7 @@ The state.json file serves as the single source of truth for replication progres
One of the key advantages of Apache Iceberg's open format is compatibility with multiple query engines. Optimize your analytical workloads by:

- Using Apache Spark for large-scale batch processing and complex transformations
- Implementing Trino for interactive analytics and ad-hoc queries
- Implementing [Trino](/iceberg/olake-iceberg-trino) for interactive analytics and ad-hoc queries
- Deploying DuckDB for fast analytical queries on smaller datasets
- Integrating with AWS Athena for serverless SQL analytics

Expand Down Expand Up @@ -324,4 +332,47 @@ With OLake, you gain access to:
- Production-ready monitoring and management capabilities for enterprise deployments

The combination of PostgreSQL's reliability as an operational database and Apache Iceberg's analytical capabilities creates a powerful foundation for data-driven decision making. Whether you're building real-time dashboards, implementing advanced analytics, or developing machine learning pipelines, this replication strategy provides the scalability and flexibility modern organizations require.

## Frequently Asked Questions

### What's the difference between PostgreSQL and Apache Iceberg?

PostgreSQL is an OLTP database designed for transactional application workloads with fast row-based operations. Apache Iceberg is an open table format optimized for large-scale analytics with columnar storage, built for data lakes rather than operational databases.

### How does PostgreSQL logical replication work?

PostgreSQL writes all changes to a Write-Ahead Log (WAL). Logical replication reads this WAL using replication slots and publications, streaming INSERT, UPDATE, and DELETE operations to downstream systems like Iceberg in real-time without impacting database performance.

### Do I need PostgreSQL superuser privileges for CDC?

No! While superuser simplifies setup, you only need specific privileges: REPLICATION permission, and SELECT access on tables you want to replicate. Cloud providers like AWS RDS and Google Cloud SQL support logical replication with limited-privilege accounts.

### Can I replicate PostgreSQL without enabling logical replication?

Yes! OLake offers JDBC-based Full Refresh and Bookmark-based Incremental sync modes. If you can't modify WAL settings or create replication slots, you can still replicate data using standard PostgreSQL credentials with timestamp-based incremental updates.

### How does OLake handle PostgreSQL schema changes?

OLake automatically detects [schema evolution](/docs/features/?tab=schema-evolution). When you add, drop, or modify columns in PostgreSQL, these changes propagate to Iceberg tables without breaking your pipeline. The state management ensures schema and data stay synchronized.

### What happens if my PostgreSQL WAL fills up?

Proper replication slot monitoring is crucial. If OLake falls behind, PostgreSQL retains WAL files until they're consumed. OLake provides lag monitoring and automatic recovery to prevent WAL bloat, but you should set appropriate WAL retention limits.

### How do I handle large PostgreSQL databases for initial load?

OLake uses intelligent chunking strategies (CTID-based or batch splits) to load data in parallel without locking tables. A 1TB PostgreSQL database typically loads in 4-8 hours depending on network and storage performance, and the process can be paused/resumed.

### What query engines work with PostgreSQL-sourced Iceberg tables?

Any Iceberg-compatible engine: [Apache Spark](https://olake.io/iceberg/query-engine/spark) for batch processing, [Trino](https://olake.io/iceberg/query-engine/trino)/[Presto](https://olake.io/iceberg/query-engine/presto) for interactive queries, [DuckDB](https://olake.io/iceberg/query-engine/duckdb) for fast analytical workloads, [AWS Athena](https://olake.io/iceberg/query-engine/athena) for serverless SQL, [Snowflake](https://olake.io/iceberg/query-engine/snowflake), [Databricks](https://olake.io/iceberg/query-engine/databricks), and many others - all querying the same data.

### Can I replicate specific PostgreSQL tables or schemas?

Yes! OLake lets you select specific tables, schemas, or even filter rows using SQL WHERE clauses. This selective replication reduces storage costs and improves query performance by replicating only the data you need for analytics.

### What's the cost comparison between PostgreSQL RDS and Iceberg on S3?

PostgreSQL RDS storage costs ~$0.115/GB/month plus compute charges that run 24/7. Iceberg on S3 costs ~$0.023/GB/month (5x cheaper) with compute costs only when querying. Organizations typically save 50-75% on analytics infrastructure.

<BlogCTA/>
Loading