datazip-inc · Akshay-datazip · Nov 20, 2025 · Oct 31, 2025 · Oct 31, 2025 · Oct 31, 2025
diff --git a/blog/2024-11-22-debezium-vs-olake.mdx b/blog/2024-11-22-debezium-vs-olake.mdx
@@ -12,7 +12,7 @@ tags: [debezium]
 ![OLake platform: Change data from MySQL, MongoDB, PostgreSQL flows to OLake, processed and stored in S3 and Iceberg](/img/blog/cover/debezium-vs-olake-cover.webp)
 
 
-Change Data Capture (CDC) is essential for modern data architectures that require real-time data replication and synchronization across systems. Debezium (a Java utility based on the Qurkus framework), coupled with Apache Kafka, has become a popular open-source solution for implementing CDC.
+[Change Data Capture (CDC)](/blog/olake-architecture-deep-dive/#cdc-sync) is essential for modern data architectures that require real-time data replication and synchronization across systems. Debezium (a Java utility based on the Qurkus framework), coupled with Apache Kafka, has become a popular open-source solution for implementing CDC.
 
 However, while this combination offers powerful capabilities, it also comes with significant drawbacks that can impact your organization's efficiency and resources.
 

diff --git a/blog/2025-03-18-binlogs.mdx b/blog/2025-03-18-binlogs.mdx
@@ -13,7 +13,7 @@ tags: [olake]
 
 
 ### What Are Binlogs?
-Binary logs in MySQL are files that log all changes made to your database. These logs record every operation that modifies data (like `INSERT`, `UPDATE`, `DELETE` statements). They don’t log `SELECT` statements or other read-only operations.
+Binary logs in [MySQL](/blog/mysql-apache-iceberg-replication) are files that log all changes made to your database. These logs record every operation that modifies data (like `INSERT`, `UPDATE`, `DELETE` statements). They don't log `SELECT` statements or other read-only operations.
 
 Binary logs in MySQL are a feature that allows the recording of all changes made to the database in a structured binary format. The binary log files contain a chronological record of SQL statements or row-level changes that modify the data in the database. They are primarily used for tasks such as replication, point-in-time recovery, auditing, and data analysis.
 

diff --git a/blog/2025-03-18-json-vs-bson-vs-jsonb.mdx b/blog/2025-03-18-json-vs-bson-vs-jsonb.mdx
@@ -34,10 +34,10 @@ While JSON is perfect for many use cases, it has limitations:
 - **Untyped Values**: All values in JSON are strings when transmitted, meaning systems need to parse and interpret types during runtime.
 
 ### 2. What is BSON?
-**BSON** (Binary JSON) is a binary-encoded serialization of JSON-like documents, originally created for MongoDB. BSON extends JSON’s capabilities by adding support for more complex data types and structures.
+**BSON** (Binary JSON) is a binary-encoded serialization of JSON-like documents, originally created for MongoDB. BSON extends JSON's capabilities by adding support for more complex data types and structures.
 
 #### Why BSON Exists
-As MongoDB began to rise in popularity as a NoSQL document store, the need for a more efficient, flexible format than plain JSON became apparent. BSON was developed to:
+As [MongoDB](/docs/connectors/mongodb/setup/local/) began to rise in popularity as a NoSQL document store, the need for a more efficient, flexible format than plain JSON became apparent. BSON was developed to:
 - **Handle Complex Data Types**: BSON supports more than just strings and numbers. It can store native types like dates, binary data, and embedded arrays or objects efficiently.
 - **Optimize for Database Operations**: BSON is designed to be lightweight but still allow for fast queries and indexing inside a database like MongoDB.
 - **Better for Large-Scale Data**: BSON was created to offer faster data reads/writes and a more compact size when dealing with large documents.
@@ -59,7 +59,7 @@ As MongoDB began to rise in popularity as a NoSQL document store, the need for a
 
 
 ### 3. What is JSONB?
-**JSONB** (Binary JSON) is a format introduced by PostgreSQL to store JSON data in a binary format, combining the benefits of both JSON and BSON in a relational database context.
+**JSONB** (Binary JSON) is a format introduced by [PostgreSQL](/docs/connectors/postgres) to store JSON data in a binary format, combining the benefits of both JSON and BSON in a relational database context.
 
 #### Why JSONB Exists
 JSONB was created to provide a fast, efficient way to store and query JSON-like documents within PostgreSQL. Regular JSON in relational databases comes with several downsides, such as slower queries and no support for indexing. JSONB was introduced to address these problems by offering:

diff --git a/blog/2025-07-29-next-gen-lakehouse.mdx b/blog/2025-07-29-next-gen-lakehouse.mdx
@@ -36,7 +36,7 @@ Query your data "as of" last Tuesday for audits or bug fixes.
 **Hidden Partitioning** — Iceberg tracks which files hold which dates, so queries auto-skip irrelevant chunks, no brittle dt='2025-07-21' filters required.
 
 
-Most importantly it is **engine-agnostic** you might have heard this term a lot and here it gets a meaning iceberg supports Spark, Trino, Flink, DuckDB, Dremio, and Snowflake all speak to the tables natively
+Most importantly it is **engine-agnostic** you might have heard this term a lot and here it gets a meaning iceberg supports [Spark](/iceberg/query-engine/spark), [Trino](/iceberg/query-engine/trino), Flink, [DuckDB](/iceberg/query-engine/duckdb), Dremio, and Snowflake all speak to the tables natively
 
 
 Before Iceberg, data lakes were basically digital junkyards. You'd dump data files into cloud storage (like Amazon S3), and finding anything useful was like looking for a specific needle in a haystack .

diff --git a/blog/2025-07-31-apache-iceberg-vs-delta-lake-guide.mdx b/blog/2025-07-31-apache-iceberg-vs-delta-lake-guide.mdx
@@ -33,7 +33,7 @@ Performance is often the deciding factor and rightly so. The good news? Both for
 
 ### File Layout & Updates
 
-Delta Lake uses a **copy-on-write** approach by default for the open-source version. When you need to update data, it creates new files and marks the old ones for deletion. The new **Deletion Vectors (DVs)** feature is pretty clever, it marks row-level changes without immediately rewriting entire files, which saves you from write amplification headaches. Databricks offers DVs as a default for any Delta tables.
+Delta Lake uses a **copy-on-write** approach by default for the open-source version. When you need to update data, it creates new files and marks the old ones for deletion. The new [**Deletion Vectors (DVs)**](/blog/iceberg-delta-lake-delete-methods-comparison/#how-deletion-vectors-work-in-iceberg-v3) feature is pretty clever, it marks row-level changes without immediately rewriting entire files, which saves you from write amplification headaches. Databricks offers DVs as a default for any Delta tables.
 
 Iceberg takes a different approach with its **equality** and **position deletes** for V2. The new Format v3 introduces compact binary Deletion Vectors that reduce both read and write amplification, especially helpful for update-heavy tables.
 
@@ -99,7 +99,7 @@ Check out query-engine support matrix [(here)](https://olake.io/iceberg/query-en
 
 ### Catalogs & Governance
 
-Catalogs are like the brain (metadata-management + ACID) for lakehouses and its ecosystem is evolving fast. **Apache Polaris** (incubating) now unifies Iceberg and Delta Lake tables in one open-source catalog, delivering vendor-neutral management and robust RBAC governance across major query engines.
+Catalogs are like the brain (metadata-management + ACID) for lakehouses and its ecosystem is evolving fast. [**Apache Polaris**](/blog/apache-polaris-lakehouse) (incubating) now unifies Iceberg and Delta Lake tables in one open-source catalog, delivering vendor-neutral management and robust RBAC governance across major query engines.
 
 REST-based options like **Polaris, Gravitino, Lakekeeper, and Nessie** make Iceberg highly flexible; you can connect multiple warehouses and tools while maintaining a single table format, making multi-tool architectures easy and future-proof if **vendor neutrality** matters to you (you can avoid being locked-in into one single vendor and take ownership of cost, tools, performance in your own hands.)
 

diff --git a/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx b/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx
@@ -26,7 +26,7 @@ For this setup, we're going to orchestrate four key components that work togethe
 - **MySQL** - Our source database where all the transactional data lives
 - **OLake** - The star of our ETL show, handling data replication
 - **MinIO** - Our S3-compatible object storage acting as the data lake
-- **PrestoDB** - The lightning-fast query engine for analytics
+- [**PrestoDB**](/iceberg/query-engine/presto) - The lightning-fast query engine for analytics
 
 What makes this architecture particularly elegant is how these components communicate through Apache Iceberg table format, ensuring we get ACID transactions, schema evolution, and time travel capabilities right out of the box.
 
@@ -113,7 +113,7 @@ This is the heart of our setup. Think of it as the conductor of an orchestra - i
 
 **What it handles:**
 
-- Spins up MySQL, MinIO, Iceberg REST catalog, and PrestoDB containers
+- Spins up MySQL, MinIO, [Iceberg REST catalog](/docs/writers/iceberg/catalog/rest/?rest-catalog=generic), and PrestoDB containers
 - Creates a private network so all services can find each other
 - Maps ports so you can access web interfaces from your browser
 - Sets up volume mounts for data persistence

diff --git a/blog/2025-09-04-creating-job-olake-docker-cli.mdx b/blog/2025-09-04-creating-job-olake-docker-cli.mdx
@@ -15,7 +15,7 @@ Today, there's no shortage of options—platforms like Fivetran, Airbyte, Debezi
 
 That's where OLake comes in. Instead of forcing you into one way of working, OLake focuses on making replication into Apache Iceberg (and other destinations) straightforward, fast, and adaptable. You can choose between a guided UI experience for simplicity or a Docker CLI flow for automation and DevOps-style control.
 
-In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from Postgres → Apache Iceberg (Glue catalog) with CDC, normalization, filters, partitioning, and scheduling—all running seamlessly.
+In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from [Postgres to Apache Iceberg (Glue Catalog)](/iceberg/postgres-to-iceberg-using-glue) with CDC, normalization, filters, partitioning, and scheduling—all running seamlessly.
 
 ## Two Setup Styles (pick what fits you)
 

diff --git a/blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx b/blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx
@@ -15,6 +15,14 @@ This comprehensive guide will walk you through everything you need to know about
 
 ![OLake stream selection UI with Full Refresh + CDC mode for dz-stag-users table](/img/blog/2025/12/lakehouse-image.webp)
 
+## Key Takeaways
+
+- **Protect Production Performance**: Offload heavy analytical queries to Iceberg tables, keeping your PostgreSQL database responsive for application traffic
+- **Real-time Logical Replication**: PostgreSQL WAL-based CDC streams changes to Iceberg with sub-second latency for up-to-date analytics
+- **50-75% Cost Reduction**: Organizations report dramatic savings by moving analytics from expensive PostgreSQL RDS to cost-effective S3 + Iceberg architecture
+- **Open Format Flexibility**: Store data once and query with any engine (Trino, Spark, DuckDB, Athena) - switch tools without data migration
+- **Enterprise-Ready Reliability**: OLake handles schema evolution, CDC recovery, and state management automatically for production deployments
+
 ## Why PostgreSQL to Iceberg Replication is Essential for Modern Data Teams
 
 ### Unlock Scalable Real-Time Analytics Without Production Impact
@@ -25,11 +33,11 @@ Replicating PostgreSQL to Apache Iceberg transforms how organizations handle ope
 
 **Near Real-Time Reporting Capabilities**: Keep your dashboards, reports, and analytics fresh with near real-time data synchronization, enabling faster decision-making and more responsive business operations.
 
-**Future-Proof Data Lakehouse Architecture**: Embrace open, vendor-agnostic formats like Apache Iceberg to build a modern data lakehouse that avoids vendor lock-in while providing warehouse-like capabilities.
+**Future-Proof Data Lakehouse Architecture**: Embrace open, vendor-agnostic formats like [Apache Iceberg](/iceberg/why-iceberg) to build a modern data lakehouse that avoids vendor lock-in while providing warehouse-like capabilities.
 
 Traditional CDC pipelines that feed cloud data warehouses often become expensive, rigid, and difficult to manage when dealing with schema changes. With Postgres-to-Iceberg replication, you can decouple storage from compute, allowing you to:
 
-- Choose the optimal compute engine for specific workloads (Trino, Spark, DuckDB, etc.)
+- Choose the optimal compute engine for specific workloads ([Trino](/iceberg/olake-iceberg-trino), Spark, DuckDB, etc.)
 - Store data once in cost-effective object storage and access it from anywhere
 - Eliminate vendor lock-in while reducing overall warehouse expenses
 - Support both batch and streaming data ingestion patterns
@@ -64,10 +72,10 @@ Apache Iceberg relies on robust metadata management for query performance optimi
 
 ### Prerequisites for Setting Up Your Replication Pipeline
 
-Before beginning your PostgreSQL to Apache Iceberg migration, ensure you have the following components configured:
+Before beginning your PostgreSQL to [Apache Iceberg](/iceberg/why-iceberg) migration, ensure you have the following components configured:
 
 - Access to a PostgreSQL database with WAL (Write-Ahead Logging) enabled for CDC
-- AWS Glue Catalog setup for Iceberg metadata management
+- [AWS Glue Catalog](/docs/writers/iceberg/catalog/glue/) setup for Iceberg metadata management
 - S3 bucket configured for Iceberg table data storage
 - OLake UI deployed (locally or in your cloud environment)
 - Docker, PostgreSQL credentials, and AWS S3 access configured
@@ -123,7 +131,7 @@ This begins tracking changes from the current WAL position. Ensure the publicati
 
 OLake UI provides a web-based interface for managing replication jobs, data sources, destinations, and monitoring without requiring command-line interaction.
 
-#### Quick Start Installation
+#### [Quick Start Installation](/docs/getting-started/quickstart)
 
 To install OLake UI using Docker and Docker Compose:
 
@@ -162,7 +170,7 @@ Configure your Apache Iceberg destination in the OLake UI:
    - IAM credentials (optional if your instance has appropriate IAM roles)
    - S3 bucket selection for Iceberg table storage
 
-OLake supports multiple Iceberg catalog implementations including Glue, Nessie, Polaris, Hive, and Unity Catalog. For detailed configuration of other catalogs, refer to the [OLake Catalogs Documentation](https://olake.io/docs/writers/iceberg/catalog/rest/).
+OLake supports multiple Iceberg catalog implementations including Glue, Nessie, Polaris, Hive, and Unity Catalog. For detailed configuration of other catalogs, refer to the [Catalog Compatibility Overview](/docs/understanding/compatibility-catalogs).
 
 ![OLake destination setup UI for Apache Iceberg with AWS Glue catalog configuration form](/img/blog/2025/12/step-4.webp)
 
@@ -192,7 +200,7 @@ For each stream, select the appropriate sync mode based on your requirements:
 
 - **Normalization**: Disable for raw JSON data storage
 - **Partitioning**: Configure regex patterns for Iceberg table partitioning
-- **Detailed partitioning strategies**: [Iceberg Partitioning Guide](https://olake.io/docs/writers/iceberg/partitioning)
+- **Detailed partitioning strategies**: [Iceberg Partitioning Guide](/docs/writers/iceberg/partitioning)
 
 ![OLake stream selection step with Full Refresh + CDC sync for dz-stag-users table](/img/blog/2025/12/step-5-2.webp)
 
@@ -285,7 +293,7 @@ The state.json file serves as the single source of truth for replication progres
 One of the key advantages of Apache Iceberg's open format is compatibility with multiple query engines. Optimize your analytical workloads by:
 
 - Using Apache Spark for large-scale batch processing and complex transformations
-- Implementing Trino for interactive analytics and ad-hoc queries
+- Implementing [Trino](/iceberg/olake-iceberg-trino) for interactive analytics and ad-hoc queries
 - Deploying DuckDB for fast analytical queries on smaller datasets
 - Integrating with AWS Athena for serverless SQL analytics
 
@@ -324,4 +332,47 @@ With OLake, you gain access to:
 - Production-ready monitoring and management capabilities for enterprise deployments
 
 The combination of PostgreSQL's reliability as an operational database and Apache Iceberg's analytical capabilities creates a powerful foundation for data-driven decision making. Whether you're building real-time dashboards, implementing advanced analytics, or developing machine learning pipelines, this replication strategy provides the scalability and flexibility modern organizations require.
+
+## Frequently Asked Questions
+
+### What's the difference between PostgreSQL and Apache Iceberg?
+
+PostgreSQL is an OLTP database designed for transactional application workloads with fast row-based operations. Apache Iceberg is an open table format optimized for large-scale analytics with columnar storage, built for data lakes rather than operational databases.
+
+### How does PostgreSQL logical replication work?
+
+PostgreSQL writes all changes to a Write-Ahead Log (WAL). Logical replication reads this WAL using replication slots and publications, streaming INSERT, UPDATE, and DELETE operations to downstream systems like Iceberg in real-time without impacting database performance.
+
+### Do I need PostgreSQL superuser privileges for CDC?
+
+No! While superuser simplifies setup, you only need specific privileges: REPLICATION permission, and SELECT access on tables you want to replicate. Cloud providers like AWS RDS and Google Cloud SQL support logical replication with limited-privilege accounts.
+
+### Can I replicate PostgreSQL without enabling logical replication?
+
+Yes! OLake offers JDBC-based Full Refresh and Bookmark-based Incremental sync modes. If you can't modify WAL settings or create replication slots, you can still replicate data using standard PostgreSQL credentials with timestamp-based incremental updates.
+
+### How does OLake handle PostgreSQL schema changes?
+
+OLake automatically detects [schema evolution](/docs/features/?tab=schema-evolution). When you add, drop, or modify columns in PostgreSQL, these changes propagate to Iceberg tables without breaking your pipeline. The state management ensures schema and data stay synchronized.
+
+### What happens if my PostgreSQL WAL fills up?
+
+Proper replication slot monitoring is crucial. If OLake falls behind, PostgreSQL retains WAL files until they're consumed. OLake provides lag monitoring and automatic recovery to prevent WAL bloat, but you should set appropriate WAL retention limits.
+
+### How do I handle large PostgreSQL databases for initial load?
+
+OLake uses intelligent chunking strategies (CTID-based or batch splits) to load data in parallel without locking tables. A 1TB PostgreSQL database typically loads in 4-8 hours depending on network and storage performance, and the process can be paused/resumed.
+
+### What query engines work with PostgreSQL-sourced Iceberg tables?
+
+Any Iceberg-compatible engine: [Apache Spark](https://olake.io/iceberg/query-engine/spark) for batch processing, [Trino](https://olake.io/iceberg/query-engine/trino)/[Presto](https://olake.io/iceberg/query-engine/presto) for interactive queries, [DuckDB](https://olake.io/iceberg/query-engine/duckdb) for fast analytical workloads, [AWS Athena](https://olake.io/iceberg/query-engine/athena) for serverless SQL, [Snowflake](https://olake.io/iceberg/query-engine/snowflake), [Databricks](https://olake.io/iceberg/query-engine/databricks), and many others - all querying the same data.
+
+### Can I replicate specific PostgreSQL tables or schemas?
+
+Yes! OLake lets you select specific tables, schemas, or even filter rows using SQL WHERE clauses. This selective replication reduces storage costs and improves query performance by replicating only the data you need for analytics.
+
+### What's the cost comparison between PostgreSQL RDS and Iceberg on S3?
+
+PostgreSQL RDS storage costs ~$0.115/GB/month plus compute charges that run 24/7. Iceberg on S3 costs ~$0.023/GB/month (5x cheaper) with compute costs only when querying. Organizations typically save 50-75% on analytics infrastructure.
+
 <BlogCTA/>