diff --git a/blog/2024-11-22-debezium-vs-olake.mdx b/blog/2024-11-22-debezium-vs-olake.mdx index 359d61dc..d86ff739 100644 --- a/blog/2024-11-22-debezium-vs-olake.mdx +++ b/blog/2024-11-22-debezium-vs-olake.mdx @@ -12,7 +12,7 @@ tags: [debezium] ![OLake platform: Change data from MySQL, MongoDB, PostgreSQL flows to OLake, processed and stored in S3 and Iceberg](/img/blog/cover/debezium-vs-olake-cover.webp) -Change Data Capture (CDC) is essential for modern data architectures that require real-time data replication and synchronization across systems. Debezium (a Java utility based on the Qurkus framework), coupled with Apache Kafka, has become a popular open-source solution for implementing CDC. +[Change Data Capture (CDC)](/blog/olake-architecture-deep-dive/#cdc-sync) is essential for modern data architectures that require real-time data replication and synchronization across systems. Debezium (a Java utility based on the Qurkus framework), coupled with Apache Kafka, has become a popular open-source solution for implementing CDC. However, while this combination offers powerful capabilities, it also comes with significant drawbacks that can impact your organization's efficiency and resources. diff --git a/blog/2025-03-18-binlogs.mdx b/blog/2025-03-18-binlogs.mdx index cc522e31..cc09de02 100644 --- a/blog/2025-03-18-binlogs.mdx +++ b/blog/2025-03-18-binlogs.mdx @@ -13,7 +13,7 @@ tags: [olake] ### What Are Binlogs? -Binary logs in MySQL are files that log all changes made to your database. These logs record every operation that modifies data (like `INSERT`, `UPDATE`, `DELETE` statements). They don’t log `SELECT` statements or other read-only operations. +Binary logs in [MySQL](/blog/mysql-apache-iceberg-replication) are files that log all changes made to your database. These logs record every operation that modifies data (like `INSERT`, `UPDATE`, `DELETE` statements). They don't log `SELECT` statements or other read-only operations. Binary logs in MySQL are a feature that allows the recording of all changes made to the database in a structured binary format. The binary log files contain a chronological record of SQL statements or row-level changes that modify the data in the database. They are primarily used for tasks such as replication, point-in-time recovery, auditing, and data analysis. diff --git a/blog/2025-03-18-json-vs-bson-vs-jsonb.mdx b/blog/2025-03-18-json-vs-bson-vs-jsonb.mdx index f230d99a..361da907 100644 --- a/blog/2025-03-18-json-vs-bson-vs-jsonb.mdx +++ b/blog/2025-03-18-json-vs-bson-vs-jsonb.mdx @@ -34,10 +34,10 @@ While JSON is perfect for many use cases, it has limitations: - **Untyped Values**: All values in JSON are strings when transmitted, meaning systems need to parse and interpret types during runtime. ### 2. What is BSON? -**BSON** (Binary JSON) is a binary-encoded serialization of JSON-like documents, originally created for MongoDB. BSON extends JSON’s capabilities by adding support for more complex data types and structures. +**BSON** (Binary JSON) is a binary-encoded serialization of JSON-like documents, originally created for MongoDB. BSON extends JSON's capabilities by adding support for more complex data types and structures. #### Why BSON Exists -As MongoDB began to rise in popularity as a NoSQL document store, the need for a more efficient, flexible format than plain JSON became apparent. BSON was developed to: +As [MongoDB](/docs/connectors/mongodb/setup/local/) began to rise in popularity as a NoSQL document store, the need for a more efficient, flexible format than plain JSON became apparent. BSON was developed to: - **Handle Complex Data Types**: BSON supports more than just strings and numbers. It can store native types like dates, binary data, and embedded arrays or objects efficiently. - **Optimize for Database Operations**: BSON is designed to be lightweight but still allow for fast queries and indexing inside a database like MongoDB. - **Better for Large-Scale Data**: BSON was created to offer faster data reads/writes and a more compact size when dealing with large documents. @@ -59,7 +59,7 @@ As MongoDB began to rise in popularity as a NoSQL document store, the need for a ### 3. What is JSONB? -**JSONB** (Binary JSON) is a format introduced by PostgreSQL to store JSON data in a binary format, combining the benefits of both JSON and BSON in a relational database context. +**JSONB** (Binary JSON) is a format introduced by [PostgreSQL](/docs/connectors/postgres) to store JSON data in a binary format, combining the benefits of both JSON and BSON in a relational database context. #### Why JSONB Exists JSONB was created to provide a fast, efficient way to store and query JSON-like documents within PostgreSQL. Regular JSON in relational databases comes with several downsides, such as slower queries and no support for indexing. JSONB was introduced to address these problems by offering: diff --git a/blog/2025-07-29-next-gen-lakehouse.mdx b/blog/2025-07-29-next-gen-lakehouse.mdx index a11aad24..06bc5241 100644 --- a/blog/2025-07-29-next-gen-lakehouse.mdx +++ b/blog/2025-07-29-next-gen-lakehouse.mdx @@ -36,7 +36,7 @@ Query your data "as of" last Tuesday for audits or bug fixes. **Hidden Partitioning** — Iceberg tracks which files hold which dates, so queries auto-skip irrelevant chunks, no brittle dt='2025-07-21' filters required. -Most importantly it is **engine-agnostic** you might have heard this term a lot and here it gets a meaning iceberg supports Spark, Trino, Flink, DuckDB, Dremio, and Snowflake all speak to the tables natively +Most importantly it is **engine-agnostic** you might have heard this term a lot and here it gets a meaning iceberg supports [Spark](/iceberg/query-engine/spark), [Trino](/iceberg/query-engine/trino), Flink, [DuckDB](/iceberg/query-engine/duckdb), Dremio, and Snowflake all speak to the tables natively Before Iceberg, data lakes were basically digital junkyards. You'd dump data files into cloud storage (like Amazon S3), and finding anything useful was like looking for a specific needle in a haystack . diff --git a/blog/2025-07-31-apache-iceberg-vs-delta-lake-guide.mdx b/blog/2025-07-31-apache-iceberg-vs-delta-lake-guide.mdx index 67c18abb..61280ee6 100644 --- a/blog/2025-07-31-apache-iceberg-vs-delta-lake-guide.mdx +++ b/blog/2025-07-31-apache-iceberg-vs-delta-lake-guide.mdx @@ -33,7 +33,7 @@ Performance is often the deciding factor and rightly so. The good news? Both for ### File Layout & Updates -Delta Lake uses a **copy-on-write** approach by default for the open-source version. When you need to update data, it creates new files and marks the old ones for deletion. The new **Deletion Vectors (DVs)** feature is pretty clever, it marks row-level changes without immediately rewriting entire files, which saves you from write amplification headaches. Databricks offers DVs as a default for any Delta tables. +Delta Lake uses a **copy-on-write** approach by default for the open-source version. When you need to update data, it creates new files and marks the old ones for deletion. The new [**Deletion Vectors (DVs)**](/blog/iceberg-delta-lake-delete-methods-comparison/#how-deletion-vectors-work-in-iceberg-v3) feature is pretty clever, it marks row-level changes without immediately rewriting entire files, which saves you from write amplification headaches. Databricks offers DVs as a default for any Delta tables. Iceberg takes a different approach with its **equality** and **position deletes** for V2. The new Format v3 introduces compact binary Deletion Vectors that reduce both read and write amplification, especially helpful for update-heavy tables. @@ -99,7 +99,7 @@ Check out query-engine support matrix [(here)](https://olake.io/iceberg/query-en ### Catalogs & Governance -Catalogs are like the brain (metadata-management + ACID) for lakehouses and its ecosystem is evolving fast. **Apache Polaris** (incubating) now unifies Iceberg and Delta Lake tables in one open-source catalog, delivering vendor-neutral management and robust RBAC governance across major query engines. +Catalogs are like the brain (metadata-management + ACID) for lakehouses and its ecosystem is evolving fast. [**Apache Polaris**](/blog/apache-polaris-lakehouse) (incubating) now unifies Iceberg and Delta Lake tables in one open-source catalog, delivering vendor-neutral management and robust RBAC governance across major query engines. REST-based options like **Polaris, Gravitino, Lakekeeper, and Nessie** make Iceberg highly flexible; you can connect multiple warehouses and tools while maintaining a single table format, making multi-tool architectures easy and future-proof if **vendor neutrality** matters to you (you can avoid being locked-in into one single vendor and take ownership of cost, tools, performance in your own hands.) diff --git a/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx b/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx index a612b841..734ef8f9 100644 --- a/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx +++ b/blog/2025-08-12-building-open-data-lakehouse-from-scratch.mdx @@ -26,7 +26,7 @@ For this setup, we're going to orchestrate four key components that work togethe - **MySQL** - Our source database where all the transactional data lives - **OLake** - The star of our ETL show, handling data replication - **MinIO** - Our S3-compatible object storage acting as the data lake -- **PrestoDB** - The lightning-fast query engine for analytics +- [**PrestoDB**](/iceberg/query-engine/presto) - The lightning-fast query engine for analytics What makes this architecture particularly elegant is how these components communicate through Apache Iceberg table format, ensuring we get ACID transactions, schema evolution, and time travel capabilities right out of the box. @@ -113,7 +113,7 @@ This is the heart of our setup. Think of it as the conductor of an orchestra - i **What it handles:** -- Spins up MySQL, MinIO, Iceberg REST catalog, and PrestoDB containers +- Spins up MySQL, MinIO, [Iceberg REST catalog](/docs/writers/iceberg/catalog/rest/?rest-catalog=generic), and PrestoDB containers - Creates a private network so all services can find each other - Maps ports so you can access web interfaces from your browser - Sets up volume mounts for data persistence diff --git a/blog/2025-09-04-creating-job-olake-docker-cli.mdx b/blog/2025-09-04-creating-job-olake-docker-cli.mdx index 6ab9390a..3e742b06 100644 --- a/blog/2025-09-04-creating-job-olake-docker-cli.mdx +++ b/blog/2025-09-04-creating-job-olake-docker-cli.mdx @@ -15,7 +15,7 @@ Today, there's no shortage of options—platforms like Fivetran, Airbyte, Debezi That's where OLake comes in. Instead of forcing you into one way of working, OLake focuses on making replication into Apache Iceberg (and other destinations) straightforward, fast, and adaptable. You can choose between a guided UI experience for simplicity or a Docker CLI flow for automation and DevOps-style control. -In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from Postgres → Apache Iceberg (Glue catalog) with CDC, normalization, filters, partitioning, and scheduling—all running seamlessly. +In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from [Postgres to Apache Iceberg (Glue Catalog)](/iceberg/postgres-to-iceberg-using-glue) with CDC, normalization, filters, partitioning, and scheduling—all running seamlessly. ## Two Setup Styles (pick what fits you) diff --git a/blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx b/blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx index e07990ef..e1e85c48 100644 --- a/blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx +++ b/blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx @@ -15,6 +15,14 @@ This comprehensive guide will walk you through everything you need to know about ![OLake stream selection UI with Full Refresh + CDC mode for dz-stag-users table](/img/blog/2025/12/lakehouse-image.webp) +## Key Takeaways + +- **Protect Production Performance**: Offload heavy analytical queries to Iceberg tables, keeping your PostgreSQL database responsive for application traffic +- **Real-time Logical Replication**: PostgreSQL WAL-based CDC streams changes to Iceberg with sub-second latency for up-to-date analytics +- **50-75% Cost Reduction**: Organizations report dramatic savings by moving analytics from expensive PostgreSQL RDS to cost-effective S3 + Iceberg architecture +- **Open Format Flexibility**: Store data once and query with any engine (Trino, Spark, DuckDB, Athena) - switch tools without data migration +- **Enterprise-Ready Reliability**: OLake handles schema evolution, CDC recovery, and state management automatically for production deployments + ## Why PostgreSQL to Iceberg Replication is Essential for Modern Data Teams ### Unlock Scalable Real-Time Analytics Without Production Impact @@ -25,11 +33,11 @@ Replicating PostgreSQL to Apache Iceberg transforms how organizations handle ope **Near Real-Time Reporting Capabilities**: Keep your dashboards, reports, and analytics fresh with near real-time data synchronization, enabling faster decision-making and more responsive business operations. -**Future-Proof Data Lakehouse Architecture**: Embrace open, vendor-agnostic formats like Apache Iceberg to build a modern data lakehouse that avoids vendor lock-in while providing warehouse-like capabilities. +**Future-Proof Data Lakehouse Architecture**: Embrace open, vendor-agnostic formats like [Apache Iceberg](/iceberg/why-iceberg) to build a modern data lakehouse that avoids vendor lock-in while providing warehouse-like capabilities. Traditional CDC pipelines that feed cloud data warehouses often become expensive, rigid, and difficult to manage when dealing with schema changes. With Postgres-to-Iceberg replication, you can decouple storage from compute, allowing you to: -- Choose the optimal compute engine for specific workloads (Trino, Spark, DuckDB, etc.) +- Choose the optimal compute engine for specific workloads ([Trino](/iceberg/olake-iceberg-trino), Spark, DuckDB, etc.) - Store data once in cost-effective object storage and access it from anywhere - Eliminate vendor lock-in while reducing overall warehouse expenses - Support both batch and streaming data ingestion patterns @@ -64,10 +72,10 @@ Apache Iceberg relies on robust metadata management for query performance optimi ### Prerequisites for Setting Up Your Replication Pipeline -Before beginning your PostgreSQL to Apache Iceberg migration, ensure you have the following components configured: +Before beginning your PostgreSQL to [Apache Iceberg](/iceberg/why-iceberg) migration, ensure you have the following components configured: - Access to a PostgreSQL database with WAL (Write-Ahead Logging) enabled for CDC -- AWS Glue Catalog setup for Iceberg metadata management +- [AWS Glue Catalog](/docs/writers/iceberg/catalog/glue/) setup for Iceberg metadata management - S3 bucket configured for Iceberg table data storage - OLake UI deployed (locally or in your cloud environment) - Docker, PostgreSQL credentials, and AWS S3 access configured @@ -123,7 +131,7 @@ This begins tracking changes from the current WAL position. Ensure the publicati OLake UI provides a web-based interface for managing replication jobs, data sources, destinations, and monitoring without requiring command-line interaction. -#### Quick Start Installation +#### [Quick Start Installation](/docs/getting-started/quickstart) To install OLake UI using Docker and Docker Compose: @@ -162,7 +170,7 @@ Configure your Apache Iceberg destination in the OLake UI: - IAM credentials (optional if your instance has appropriate IAM roles) - S3 bucket selection for Iceberg table storage -OLake supports multiple Iceberg catalog implementations including Glue, Nessie, Polaris, Hive, and Unity Catalog. For detailed configuration of other catalogs, refer to the [OLake Catalogs Documentation](https://olake.io/docs/writers/iceberg/catalog/rest/). +OLake supports multiple Iceberg catalog implementations including Glue, Nessie, Polaris, Hive, and Unity Catalog. For detailed configuration of other catalogs, refer to the [Catalog Compatibility Overview](/docs/understanding/compatibility-catalogs). ![OLake destination setup UI for Apache Iceberg with AWS Glue catalog configuration form](/img/blog/2025/12/step-4.webp) @@ -192,7 +200,7 @@ For each stream, select the appropriate sync mode based on your requirements: - **Normalization**: Disable for raw JSON data storage - **Partitioning**: Configure regex patterns for Iceberg table partitioning -- **Detailed partitioning strategies**: [Iceberg Partitioning Guide](https://olake.io/docs/writers/iceberg/partitioning) +- **Detailed partitioning strategies**: [Iceberg Partitioning Guide](/docs/writers/iceberg/partitioning) ![OLake stream selection step with Full Refresh + CDC sync for dz-stag-users table](/img/blog/2025/12/step-5-2.webp) @@ -285,7 +293,7 @@ The state.json file serves as the single source of truth for replication progres One of the key advantages of Apache Iceberg's open format is compatibility with multiple query engines. Optimize your analytical workloads by: - Using Apache Spark for large-scale batch processing and complex transformations -- Implementing Trino for interactive analytics and ad-hoc queries +- Implementing [Trino](/iceberg/olake-iceberg-trino) for interactive analytics and ad-hoc queries - Deploying DuckDB for fast analytical queries on smaller datasets - Integrating with AWS Athena for serverless SQL analytics @@ -324,4 +332,47 @@ With OLake, you gain access to: - Production-ready monitoring and management capabilities for enterprise deployments The combination of PostgreSQL's reliability as an operational database and Apache Iceberg's analytical capabilities creates a powerful foundation for data-driven decision making. Whether you're building real-time dashboards, implementing advanced analytics, or developing machine learning pipelines, this replication strategy provides the scalability and flexibility modern organizations require. + +## Frequently Asked Questions + +### What's the difference between PostgreSQL and Apache Iceberg? + +PostgreSQL is an OLTP database designed for transactional application workloads with fast row-based operations. Apache Iceberg is an open table format optimized for large-scale analytics with columnar storage, built for data lakes rather than operational databases. + +### How does PostgreSQL logical replication work? + +PostgreSQL writes all changes to a Write-Ahead Log (WAL). Logical replication reads this WAL using replication slots and publications, streaming INSERT, UPDATE, and DELETE operations to downstream systems like Iceberg in real-time without impacting database performance. + +### Do I need PostgreSQL superuser privileges for CDC? + +No! While superuser simplifies setup, you only need specific privileges: REPLICATION permission, and SELECT access on tables you want to replicate. Cloud providers like AWS RDS and Google Cloud SQL support logical replication with limited-privilege accounts. + +### Can I replicate PostgreSQL without enabling logical replication? + +Yes! OLake offers JDBC-based Full Refresh and Bookmark-based Incremental sync modes. If you can't modify WAL settings or create replication slots, you can still replicate data using standard PostgreSQL credentials with timestamp-based incremental updates. + +### How does OLake handle PostgreSQL schema changes? + +OLake automatically detects [schema evolution](/docs/features/?tab=schema-evolution). When you add, drop, or modify columns in PostgreSQL, these changes propagate to Iceberg tables without breaking your pipeline. The state management ensures schema and data stay synchronized. + +### What happens if my PostgreSQL WAL fills up? + +Proper replication slot monitoring is crucial. If OLake falls behind, PostgreSQL retains WAL files until they're consumed. OLake provides lag monitoring and automatic recovery to prevent WAL bloat, but you should set appropriate WAL retention limits. + +### How do I handle large PostgreSQL databases for initial load? + +OLake uses intelligent chunking strategies (CTID-based or batch splits) to load data in parallel without locking tables. A 1TB PostgreSQL database typically loads in 4-8 hours depending on network and storage performance, and the process can be paused/resumed. + +### What query engines work with PostgreSQL-sourced Iceberg tables? + +Any Iceberg-compatible engine: [Apache Spark](https://olake.io/iceberg/query-engine/spark) for batch processing, [Trino](https://olake.io/iceberg/query-engine/trino)/[Presto](https://olake.io/iceberg/query-engine/presto) for interactive queries, [DuckDB](https://olake.io/iceberg/query-engine/duckdb) for fast analytical workloads, [AWS Athena](https://olake.io/iceberg/query-engine/athena) for serverless SQL, [Snowflake](https://olake.io/iceberg/query-engine/snowflake), [Databricks](https://olake.io/iceberg/query-engine/databricks), and many others - all querying the same data. + +### Can I replicate specific PostgreSQL tables or schemas? + +Yes! OLake lets you select specific tables, schemas, or even filter rows using SQL WHERE clauses. This selective replication reduces storage costs and improves query performance by replicating only the data you need for analytics. + +### What's the cost comparison between PostgreSQL RDS and Iceberg on S3? + +PostgreSQL RDS storage costs ~$0.115/GB/month plus compute charges that run 24/7. Iceberg on S3 costs ~$0.023/GB/month (5x cheaper) with compute costs only when querying. Organizations typically save 50-75% on analytics infrastructure. + diff --git a/blog/2025-09-09-mysql-to-apache-iceberg-replication.mdx b/blog/2025-09-09-mysql-to-apache-iceberg-replication.mdx index 75b11ed6..78ca9d4e 100644 --- a/blog/2025-09-09-mysql-to-apache-iceberg-replication.mdx +++ b/blog/2025-09-09-mysql-to-apache-iceberg-replication.mdx @@ -11,12 +11,20 @@ image: /img/blog/cover/setup-sql-iceberg.webp **MySQL** powers countless production applications as a reliable operational database. But when it comes to analytics at scale, running heavy queries directly on MySQL can quickly become expensive, slow, and disruptive to transactional workloads. -That's where **Apache Iceberg** comes in. By replicating MySQL data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries all without burdening your source database. +That's where [**Apache Iceberg**](/iceberg/why-iceberg) comes in. By replicating MySQL data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries all without burdening your source database. -Apache Iceberg is more than an average table format and it's designed for large-scale, cost-effective analytics. With native support for ACID transactions, seamless schema evolution, and compatibility with engines like Trino, Spark, and DuckDB, it's ideal for modern data lakehouses. +[Apache Iceberg](/iceberg/why-iceberg) is more than an average table format and it's designed for large-scale, cost-effective analytics. With native support for ACID transactions, seamless schema evolution, and compatibility with query engines like Trino, Spark, and DuckDB, it's ideal for modern data lakehouses. In this comprehensive guide, we'll walk through setting up a real-time pipeline from MySQL to Apache Iceberg using OLake, covering both UI and CLI approaches. We'll explore why companies like Netflix, Natural Intelligence, and Memed have successfully migrated to Iceberg architectures, achieving dramatic performance improvements and cost savings. +## Key Takeaways + +- **Offload Analytics from Production**: Replicate MySQL to Iceberg to run heavy analytical queries without impacting your production database performance +- **Real-time Data Sync**: CDC via [binlog](/blog/binlogs) keeps Iceberg tables up-to-date with sub-second latency for real-time dashboards and reporting +- **Massive Cost Savings**: Companies like Netflix achieved 25% cost reduction and Memed saw 60x faster ETL processing times +- **Open Format Freedom**: Store data once in S3 and query with any engine (Trino, Spark, DuckDB) - no vendor lock-in +- **Enterprise Features Built-in**: Get automatic schema evolution, ACID transactions, time travel, and partitioning without complex engineering + ## The Growing Problem: Why MySQL Analytics Hit Performance Walls ![MySQL limitations vs data lakehouse benefits including streaming and cloud storage](/img/blog/2025/13/comparison.webp) @@ -60,7 +68,7 @@ Replicating MySQL into Apache Iceberg solves these fundamental problems: - **Offload Analytics Workloads**: Keep OLTP performance fast while running complex queries on Iceberg tables without impacting production systems. - **Handle Petabyte Scale**: Iceberg supports petabytes of data on object storage like S3, with no sharding or archiving complexity. -- **Near Real-time Synchronization**: With MySQL CDC from binlogs, Iceberg tables stay continuously up to date with sub-second latency. +- **Near Real-time Synchronization**: With MySQL CDC from [binlog](/blog/binlogs), Iceberg tables stay continuously up to date with sub-second latency. - **Advanced Lakehouse Features**: Partitioning, schema evolution, ACID transactions, and time travel make analytics flexible and reliable. - **Lower Cost, Open Ecosystem**: Store data cheaply in S3, query with engines like Trino or Spark, and avoid vendor lock-in. @@ -146,10 +154,10 @@ Before starting your MySQL to Apache Iceberg replication, ensure you have the fo - Hive Metastore + HDFS/MinIO (alternative) - Other supported catalogs (Nessie, Polaris, Unity) -**Optional Query Engine**: Athena/Trino/Presto or Spark SQL for result validation +**Optional Query Engine**: [Athena](https://olake.io/iceberg/query-engine/athena)/[Trino](/iceberg/olake-iceberg-trino)/[Presto](https://olake.io/iceberg/query-engine/presto) or [Spark](https://olake.io/iceberg/query-engine/spark) SQL for result validation -For comprehensive MySQL setup details, follow this documentation: [MySQL Connector Setup](https://olake.io/docs/connectors/mysql) -For AWS Glue catalog quick setup: [Glue Catalog Configuration](https://olake.io/docs/writers/iceberg/catalog/glue) +For comprehensive MySQL setup details, follow this documentation: [MySQL Connector Setup](/docs/connectors/mysql) +For AWS Glue catalog quick setup: [Glue Catalog Configuration](/writers/iceberg/catalog/glue) ### Step 1: Configure MySQL for Logical Replication @@ -237,8 +245,8 @@ Configure your Iceberg destination in the OLake UI for seamless lakehouse integr **Multi-Catalog Support**: OLake supports multiple catalogs (Glue, Nessie, Polaris, Hive, Unity), providing flexibility for different architectural requirements. -**Detailed Configuration Guide**: [Glue Catalog Setup](https://olake.io/docs/writers/iceberg/catalog/glue) -**Alternative Catalogs**: For REST catalogs (Lakekeeper, Polaris) and other options: [Catalog Configuration Documentation](https://olake.io/docs/connectors) +**Detailed Configuration Guide**: See AWS Glue Catalog setup in [Glue Catalog documentation](/docs/writers/iceberg/catalog/glue/) +**Alternative Catalogs**: For REST catalogs (Lakekeeper, Polaris) and other options: [Catalog Compatibility Overview](/docs/understanding/compatibility-catalogs) ![OLake destination setup UI for Apache Iceberg with AWS Glue catalog configuration form](/img/blog/2025/13/step-4.webp) @@ -396,6 +404,48 @@ Start your MySQL to Apache Iceberg migration today and unlock the full analytica As the data landscape continues evolving toward open, cloud-native architectures, organizations embracing Apache Iceberg lakehouse patterns position themselves for scalable growth while maintaining operational excellence. The question isn't whether to migrate from MySQL analytics, it's how quickly you can implement this transformation to stay competitive in today's data-driven economy. +## Frequently Asked Questions + +### What is the difference between MySQL and Apache Iceberg? + +MySQL is an OLTP (Online Transaction Processing) database designed for handling live application transactions with fast reads and writes. Apache Iceberg is an open table format designed for large-scale analytics on data lakes, optimized for complex queries and petabyte-scale data storage. + +### How does CDC (Change Data Capture) work with MySQL? + +CDC tracks changes in MySQL by reading the binary log (binlog), which records every insert, update, and delete operation. OLake connects to the binlog and streams these changes in real-time to your Iceberg tables without impacting production performance. + +### Can I replicate MySQL to Iceberg without CDC? + +Yes! OLake offers JDBC-based Full Refresh and Bookmark-based Incremental sync modes. If you don't have permissions to enable binlogs, you can start syncing immediately with standard MySQL credentials. + +### What happens to my MySQL schema changes? + +OLake automatically handles schema evolution. When you add, drop, or modify columns in MySQL, these changes are detected and propagated to your Iceberg tables without breaking your pipeline. + +### How much does it cost to store data in Iceberg vs MySQL? + +Iceberg storage on S3 costs approximately $0.023 per GB/month, compared to MySQL RDS storage at $0.115 per GB/month - that's 5x cheaper. Plus, you separate compute from storage, so you only pay for queries when you run them. + +### What query engines can I use with Iceberg tables? + +Apache Iceberg is an open format compatible with: [Trino](https://olake.io/iceberg/query-engine/trino), [Presto](https://olake.io/iceberg/query-engine/presto), [Apache Spark](https://olake.io/iceberg/query-engine/spark), [DuckDB](https://olake.io/iceberg/query-engine/duckdb), [AWS Athena](https://olake.io/iceberg/query-engine/athena), [Snowflake](https://olake.io/iceberg/query-engine/snowflake), [Databricks](https://olake.io/iceberg/query-engine/databricks), and many others. You can switch engines anytime without rewriting data. + +### How do I handle partitioning for optimal query performance? + +Choose partition columns based on your query patterns: use timestamp fields (created_at, updated_at) for time-series queries, or dimensional fields (customer_id, region) for lookup queries. OLake supports regex-based partitioning configuration. + +### Is the initial full load safe for large MySQL databases? + +Yes! OLake uses primary key-based chunking to load data in batches without locking your MySQL tables. The process runs in parallel and can be paused/resumed if needed. + +### What happens if my replication pipeline fails? + +OLake maintains a state.json file that tracks replication progress. If the pipeline fails, it automatically resumes from the last successfully processed position, ensuring no data loss. + +### Can I query both MySQL and Iceberg simultaneously? + +Absolutely! Your MySQL database continues serving production traffic while Iceberg handles analytics. This separation ensures operational workloads never compete with analytical queries for resources. + Happy syncing! 🧊🐘 diff --git a/blog/2025-09-10-how-to-set-up-mongodb-apache-iceberg.mdx b/blog/2025-09-10-how-to-set-up-mongodb-apache-iceberg.mdx index eb62a25f..2c16c149 100644 --- a/blog/2025-09-10-how-to-set-up-mongodb-apache-iceberg.mdx +++ b/blog/2025-09-10-how-to-set-up-mongodb-apache-iceberg.mdx @@ -11,12 +11,20 @@ image: /img/blog/cover/setup-mongodb.webp **MongoDB** has become the go-to database for modern applications, handling everything from user profiles to IoT sensor data with its flexible document model. But when it comes to analytics at scale, MongoDB's document-oriented architecture faces significant challenges with complex queries, aggregations, and large-scale data processing. -That's where **Apache Iceberg** comes in. By replicating MongoDB data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries while maintaining MongoDB's operational performance. +That's where [**Apache Iceberg**](/iceberg/why-iceberg) comes in. By replicating MongoDB data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries while maintaining MongoDB's operational performance. Apache Iceberg is designed for large-scale, cost-effective analytics with native support for ACID transactions, seamless schema evolution, and compatibility with engines like Trino, Spark, and DuckDB. It's the perfect complement to MongoDB's operational strengths. In this comprehensive guide, we'll walk through setting up a real-time pipeline from MongoDB to Apache Iceberg using OLake, covering both UI and CLI approaches. We'll explore why companies are successfully migrating to Iceberg architectures, achieving dramatic performance improvements and cost savings. +## Key Takeaways + +- **Solve MongoDB Analytics Bottlenecks**: Run complex aggregations and joins on Iceberg without slowing down your MongoDB production workloads +- **Real-time Change Streams**: MongoDB Change Streams provide millisecond-latency CDC to keep Iceberg tables continuously synchronized +- **Handle Flexible Schemas**: OLake automatically manages MongoDB's dynamic schema evolution, converting BSON documents to Iceberg-compatible structures +- **Petabyte-Scale Analytics**: Query terabytes or petabytes of data using columnar storage on S3, with costs 5x lower than operational MongoDB +- **Multi-Engine Freedom**: Access your MongoDB data through [Trino](https://olake.io/iceberg/query-engine/trino), [Spark](https://olake.io/iceberg/query-engine/spark), [DuckDB](https://olake.io/iceberg/query-engine/duckdb), or [Athena](https://olake.io/iceberg/query-engine/athena) using standard SQL - no MongoDB query language required + ## The Growing Problem: Why MongoDB Analytics Hit Performance Walls ![MongoDB vs Apache Iceberg fast analytics queries performance improvement](/img/blog/2025/14/analytics-image.webp) @@ -74,7 +82,7 @@ Moving data from MongoDB into Iceberg sounds simple, but in practice there are s **Change Data Capture Implementation Complexity** -- **MongoDB Change Streams Setup**: Getting MongoDB Change Streams configured correctly requires proper replica set setup, appropriate permissions, and understanding of oplog mechanics. The oplog must be properly configured for reliable CDC. +- **MongoDB Change Streams Setup**: Getting MongoDB Change Streams configured correctly requires proper replica set setup, appropriate permissions, and understanding of [oplog](/docs/understanding/terminologies/general/#26-oplog-mongodb) mechanics. The oplog must be properly configured for reliable CDC. - **Real-time vs. Batch Processing**: Choosing between real-time Change Streams and batch replication affects both data freshness and infrastructure complexity. Real-time offers low latency but requires more sophisticated monitoring and error handling. **Schema Evolution and Data Type Compatibility** @@ -125,7 +133,7 @@ Before we start, make sure you have: **MongoDB instance with:** - Running in replica set mode (--replSet rs0). -- Enabled oplog (automatic in replica sets) +- Enabled [oplog](/docs/understanding/terminologies/general/#26-oplog-mongodb) (automatic in replica sets) - Read access to the tables for the MongoDB user. - Version 4.0 or higher @@ -202,9 +210,9 @@ Configure your Iceberg destination in the OLake UI for seamless lakehouse integr **Multi-Catalog Support**: OLake supports multiple catalogs (Glue, Nessie, Polaris, Hive, Unity), providing flexibility for different architectural requirements. -**Detailed Configuration Guide**: [Glue Catalog Setup](https://olake.io/docs/writers/iceberg/catalog/glue) +**Detailed Configuration Guide**: See AWS Glue Catalog setup in [Glue Catalog documentation](/docs/writers/iceberg/catalog/glue/) -**Alternative Catalogs**: For REST catalogs (Lakekeeper, Polaris) and other options: [Catalog Configuration Documentation](https://olake.io/docs/connectors) +**Alternative Catalogs**: For REST catalogs (Lakekeeper, Polaris) and other options: [Catalog Compatibility Overview](/docs/understanding/compatibility-catalogs) ![OLake destination setup UI for Apache Iceberg with AWS Glue catalog configuration form](/img/blog/2025/14/step-4.webp) @@ -265,7 +273,7 @@ Your MongoDB to Iceberg replication creates a structured hierarchy in S3 object With this setup, you now have a fully functional MongoDB-to-Iceberg pipeline running with Change Streams support, ready for analytics, lakehouse querying, and downstream consumption by various query engines. -### Step 7 (Optional): Query Iceberg Tables with AWS Athena +### Step 7 (Optional): Query Iceberg Tables with AWS [Athena](/iceberg/query-engine/athena/) Validate your MongoDB to Iceberg migration by configuring AWS Athena for direct querying: @@ -343,6 +351,48 @@ The combination of MongoDB's operational flexibility and Iceberg's analytical ca As the data landscape continues evolving toward open, cloud-native architectures, organizations embracing Apache Iceberg lakehouse patterns position themselves for scalable growth while maintaining operational excellence. The question isn't whether to migrate from MongoDB analytics, it's how quickly you can implement this transformation to stay competitive in today's data-driven economy. +## Frequently Asked Questions + +### Why can't I just run analytics directly on MongoDB? + +MongoDB is optimized for operational workloads with fast document reads/writes. Complex analytical queries (aggregations, joins, large scans) consume significant resources and slow down production applications. Replicating to Iceberg separates analytics from operations, keeping both performant. + +### How does MongoDB Change Streams work for CDC? + +Change Streams tap into MongoDB's oplog (operation log) to capture every insert, update, and delete in real-time. OLake reads these changes continuously and applies them to Iceberg tables without impacting MongoDB performance or requiring application changes. + +### Do I need a MongoDB replica set for replication? + +For real-time CDC with Change Streams, yes - MongoDB requires replica set mode. However, OLake also offers JDBC-based Full Refresh and Bookmark-based Incremental modes that work with standalone MongoDB instances if you have permission limitations. + +### How does OLake handle MongoDB's flexible schemas? + +MongoDB documents in the same collection can have different fields. OLake automatically detects schema changes and evolves your Iceberg tables accordingly, adding new columns when new fields appear while maintaining backward compatibility. + +### What happens to nested MongoDB documents in Iceberg? + +OLake intelligently flattens nested BSON structures into Iceberg-compatible schemas. Complex nested objects become structured columns in Iceberg tables, making them queryable with standard SQL rather than MongoDB's aggregation framework. + +### Can I filter which MongoDB collections to replicate? + +Yes! OLake allows you to select specific collections and even apply MongoDB aggregation pipeline filters to replicate only the data you need, reducing storage costs and improving query performance. + +### How long does the initial MongoDB to Iceberg load take? + +Initial load time depends on your data volume and MongoDB performance. OLake processes collections in parallel and can be paused/resumed. For example, a 500GB MongoDB database typically loads in 2-4 hours depending on network and storage speed. + +### What's the difference between Change Streams and binlog CDC? + +Change Streams is MongoDB's native change tracking mechanism (similar to MySQL binlogs). It provides a stream of document-level changes that OLake captures and applies to Iceberg tables in real-time. + +### Can I query both MongoDB and Iceberg simultaneously? + +Absolutely! MongoDB continues serving your application traffic while Iceberg handles analytics. This architecture ensures your operational database never competes with analytical workloads for resources. + +### How much does Iceberg storage cost compared to MongoDB? + +S3 storage for Iceberg costs ~$0.023/GB/month compared to MongoDB Atlas storage at ~$0.25/GB/month (10x cheaper). Plus, Iceberg's columnar format compresses better, and you only pay for compute when running queries. + Happy syncing! diff --git a/blog/2025-09-15-apache-hive-vs-apache-iceberg-comparison.mdx b/blog/2025-09-15-apache-hive-vs-apache-iceberg-comparison.mdx index 494f6350..7ee398a7 100644 --- a/blog/2025-09-15-apache-hive-vs-apache-iceberg-comparison.mdx +++ b/blog/2025-09-15-apache-hive-vs-apache-iceberg-comparison.mdx @@ -10,7 +10,7 @@ image: /img/blog/cover/hive-vs-iceberg.webp # When to Choose Apache Iceberg Over Hive: A Comparison Guide -Apache Hive and Apache Iceberg represent two different generations of the data lake ecosystem. Hive was born in the **Hadoop era** as a SQL abstraction over HDFS, excelling in batch ETL workloads and still valuable for organizations with large Hadoop/ORC footprints. Iceberg, by contrast, emerged in the **cloud-native era** as an open table format designed for multi-engine interoperability, **schema evolution**, and features like **time travel**. If you are running a legacy Hadoop stack with minimal need for engine diversity, Hive remains a practical choice. If you want a **flexible, future-proof data lakehouse** that supports diverse engines, reliable transactions, and governance at scale, Iceberg is the more strategic investment. +Apache Hive and Apache Iceberg represent two different generations of the data lake ecosystem. Hive was born in the **Hadoop era** as a SQL abstraction over HDFS, excelling in batch ETL workloads and still valuable for organizations with large Hadoop/ORC footprints. Iceberg, by contrast, emerged in the **cloud-native era** as an [open table format](/iceberg/move-to-iceberg) designed for multi-engine interoperability, [**schema evolution**](/docs/features/?tab=schema-evolution), and features like [**time travel**](/blog/2025/10/03/iceberg-metadata/#63-time-travel-rollback-and-branching). If you are running a legacy Hadoop stack with minimal need for engine diversity, Hive remains a practical choice. If you want a **flexible, future-proof data lakehouse** that supports diverse engines, reliable transactions, and governance at scale, Iceberg is the more strategic investment. ## Hive vs Iceberg — Feature Comparison at a Glance @@ -110,7 +110,7 @@ Hive's support for schema evolution has always been **partial and somewhat fragi Iceberg was built with **schema evolution in mind**. It tracks columns by unique IDs rather than by name or position, which means columns can be safely renamed without breaking queries or requiring file rewrites. Dropping or adding columns is similarly lightweight, and even type changes can often be accommodated without rewriting underlying data. This design makes schema evolution a routine part of table management rather than a disruptive event, which is particularly valuable in environments where data models must adapt quickly to business needs. -### Partitioning +### [Partitioning](/iceberg/hive-partitioning-vs-iceberg-partitioning) ![Comparison of Hive folder partitioning and Iceberg metadata-hidden partitions](/img/blog/2025/15/parition.webp) diff --git a/blog/2025-10-03-iceberg-metadata.mdx b/blog/2025-10-03-iceberg-metadata.mdx index 9aa9de9c..19fc5b8f 100644 --- a/blog/2025-10-03-iceberg-metadata.mdx +++ b/blog/2025-10-03-iceberg-metadata.mdx @@ -109,11 +109,11 @@ This final step is a **compare-and-swap (CAS)** operation. The catalog will only Iceberg supports a variety of catalog implementations, each suited for different ecosystems: -**Hive Metastore Catalog**: A popular choice for users migrating from Hive. It repurposes the existing Hive Metastore (HMS) to store the pointer to the Iceberg metadata file, allowing Iceberg tables to be managed alongside legacy Hive tables. +[**Hive Metastore Catalog**](/docs/writers/iceberg/catalog/hive/): A popular choice for users migrating from Hive. It repurposes the existing Hive Metastore (HMS) to store the pointer to the Iceberg metadata file, allowing Iceberg tables to be managed alongside legacy Hive tables. -**AWS Glue Catalog**: The standard choice for users in the AWS ecosystem. It leverages the AWS Glue Data Catalog as the central metastore. +[**AWS Glue Catalog**](/docs/writers/iceberg/catalog/glue/): The standard choice for users in the AWS ecosystem. It leverages the AWS Glue Data Catalog as the central metastore. -**REST Catalog**: A standardized, open protocol for an Iceberg-native catalog service. This is a great option for building a platform-agnostic data architecture, as it decouples you from specific compute or storage vendors. +[**REST Catalog**](/docs/writers/iceberg/catalog/rest/): A standardized, open protocol for an Iceberg-native catalog service. This is a great option for building a platform-agnostic data architecture, as it decouples you from specific compute or storage vendors. **JDBC Catalog**: Stores metadata pointers in a relational database like PostgreSQL, offering strong transactional guarantees. @@ -388,7 +388,7 @@ While Iceberg brilliantly avoids expensive LIST operations in cloud storage, it This is where all the previous points converge to help you manage costs: - **Effective Caching**: This is your first and most important defense, as it dramatically reduces the number of GET requests for repeated queries. -- **Smart Partitioning**: A good partitioning scheme ensures that partition pruning is highly effective, minimizing the number of manifest files that need to be read in the first place. +- **Smart [Partitioning](/iceberg/iceberg-partitioning-and-writing-strategies)**: A good partitioning scheme ensures that partition pruning is highly effective, minimizing the number of manifest files that need to be read in the first place. - **Regular Compaction**: Fewer data files result in fewer entries in manifests, which can lead to a smaller number of manifest files overall, reducing GET operations during planning. By actively managing your metadata, you can ensure your data lakehouse is not only performant but also cost-effective. @@ -481,7 +481,7 @@ Understanding Iceberg's metadata is the first step, but translating that knowled ### 12.1 Designing for Performance: Schema, Partitioning, and Sorting -The foundation of any good table is a well-designed schema. While Iceberg makes evolution safe with fast, metadata-only operations, starting with clear column names and appropriate data types prevents future complexity. From there, the most critical design choice is the table's physical layout, which is primarily controlled by partitioning. The cardinal rule of partitioning is to choose columns with low cardinality that are frequently used in query filters. The goal is always to maximize data pruning. Common choices include dates, regions, or categories. Using Iceberg's built-in transforms like day(ts) or month(ts) is crucial for temporal columns to prevent an explosion of partitions. +The foundation of any good table is a well-designed schema. While Iceberg makes evolution safe with fast, metadata-only operations, starting with clear column names and appropriate data types prevents future complexity. From there, the most critical design choice is the table's physical layout, which is primarily controlled by [partitioning](/iceberg/iceberg-partitioning-and-writing-strategies). The cardinal rule of [partitioning](/iceberg/iceberg-partitioning-and-writing-strategies) is to choose columns with low cardinality that are frequently used in query filters. The goal is always to maximize data pruning. Common choices include dates, regions, or categories. Using Iceberg's built-in transforms like day(ts) or month(ts) is crucial for temporal columns to prevent an explosion of partitions. A common pitfall is partitioning by a high-cardinality column like user_id, which creates an unmanageable number of small files and harms performance. For these high-cardinality columns, the correct strategy is data sorting. By defining a sort order in the table's metadata, you instruct writers to physically order data within each file. This allows query engines to skip large blocks of rows inside the files, providing a powerful secondary layer of data skipping that works in harmony with partition pruning. diff --git a/blog/2025-10-09-apache-polaris-lakehouse.mdx b/blog/2025-10-09-apache-polaris-lakehouse.mdx index 0e59da36..c574a1ce 100644 --- a/blog/2025-10-09-apache-polaris-lakehouse.mdx +++ b/blog/2025-10-09-apache-polaris-lakehouse.mdx @@ -14,7 +14,7 @@ image: /img/blog/cover/polaris-blog.webp Modern data teams are moving toward the lakehouse architecture—combining the reliability of data warehouses with the scale and cost-efficiency of data lakes. But building one from scratch can feel overwhelming with so many moving parts. -This guide walks you through building a production-ready lakehouse using four powerful open-source tools: **Apache Iceberg** (table format), **Apache Polaris** (catalog), **Trino** (query engine), and **OLake** (data ingestion). We'll show you exactly what each component does, why it matters, and how they work together. +This guide walks you through building a production-ready lakehouse using four powerful open-source tools: **Apache Iceberg** (table format), **Apache Polaris** (catalog), [**Trino**](/iceberg/query-engine/trino) (query engine), and **OLake** (data ingestion). We'll show you exactly what each component does, why it matters, and how they work together. ### Understanding Apache Iceberg: The table format that changes everything @@ -25,7 +25,7 @@ Apache Iceberg reimagines how we structure data lakes. Think of a data lake as a - **ACID transactions on object storage**: Get database-like guarantees on cheap S3/GCS/Azure storage - **Schema evolution made easy**: Add, rename, or drop columns without rewriting terabytes of data - **Hidden partitioning**: Queries automatically prune irrelevant data without users writing complex WHERE clauses -- **Time travel capabilities**: Query your data as it existed at any point in time for audits or debugging +- [**Time travel**](/iceberg/query-engine) capabilities: Query your data as it existed at any point in time for audits or debugging - **Production-grade performance**: Efficiently handle petabyte-scale datasets with fast metadata operations ### Why you need a catalog: Keeping your lakehouse organized @@ -80,7 +80,7 @@ Traditional ETL tools like Debezium + Kafka + Spark require complex pipelines wi What this means in practice: your applications keep writing to operational databases (MySQL, Postgres, MongoDB) as usual. OLake continuously captures those changes and writes them to Iceberg tables that are immediately queryable via Trino or any other Iceberg-compatible engine. -### Trino: Your high-performance query engine +### [Trino](/iceberg/query-engine/trino): Your high-performance query engine With data in Iceberg format and a Polaris catalog managing it all, you need a powerful query engine to actually analyze that data. Trino is perfect for this role. @@ -91,7 +91,7 @@ Trino is a distributed SQL engine designed for fast, interactive analytics on ma - **Blazing fast**: MPP (massively parallel processing) architecture runs queries in seconds, not hours - **Standard SQL**: Use familiar ANSI SQL—no need to learn new query languages - **Federation**: Query across multiple data sources (Iceberg, PostgreSQL, MySQL, Kafka) in a single query -- **Iceberg-native**: Full support for Iceberg features including time travel, schema evolution, and hidden partitioning +- **Iceberg-native**: Full support for Iceberg features including [Time travel](/blog/2025/10/03/iceberg-metadata/#63-time-travel-rollback-and-branching), schema evolution, and hidden partitioning - **Scales horizontally**: Add more workers to handle larger datasets and higher concurrency ### How the pieces mesh diff --git a/docs-iceberg-query-engine/athena.mdx b/docs-iceberg-query-engine/athena.mdx index a30cceea..1befb501 100644 --- a/docs-iceberg-query-engine/athena.mdx +++ b/docs-iceberg-query-engine/athena.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, BoltIcon, @@ -23,7 +24,11 @@ export const athenaFeatures = [ { title: "AWS Glue Catalog Integration", chip: "AWS-Native Only", - description: "Only AWS Glue Data Catalog supported for Iceberg. Hive, REST, Nessie, or JDBC catalogs not recognized - tight AWS ecosystem integration", + description: ( + <> + Only AWS Glue Data Catalog supported for Iceberg. Hive, REST, Nessie, or JDBC catalogs not recognized - tight AWS ecosystem integration + + ), icon: , color: "orange", score: 60, @@ -74,7 +79,8 @@ export const athenaFeatures = [ } ], externalLinks: [ - { label: "AWS Glue Data Catalog", url: "https://docs.aws.amazon.com/athena/latest/ug/data-sources-glue.html", type: "docs" } + { label: "AWS Glue Data Catalog (OLake Docs)", url: "/docs/writers/iceberg/catalog/glue/", type: "docs" }, + { label: "AWS Glue Data Catalog (AWS Docs)", url: "https://docs.aws.amazon.com/athena/latest/ug/data-sources-glue.html", type: "docs" } ] } }, @@ -629,7 +635,11 @@ export const athenaTableData = { tooltip: "Limited to AWS Glue Data Catalog only" }, details: { - value: "Only AWS Glue Data Catalog; no Hive, REST, Nessie, or JDBC catalog support", + value: ( + <> + Only AWS Glue Data Catalog; no Hive, REST, Nessie, or JDBC catalog support + + ), tooltip: "Deep AWS integration but limited catalog flexibility" }, version: { value: "v3" } @@ -657,7 +667,7 @@ export const athenaTableData = { badge: { text: "Complete CRUD", variant: "success" } }, details: { - value: "INSERT, UPDATE, DELETE, MERGE INTO with position-delete files (v2)", + value: "[INSERT, UPDATE, DELETE, MERGE](/iceberg/olake-iceberg-trino) INTO with position-delete files (v2)", tooltip: "Complete DML capabilities with efficient row-level operations" }, version: { value: "v3" } @@ -835,7 +845,11 @@ export const athenaUseCases = [ ]; + Amazon Athena (Engine v3) + + } description="Serverless AWS-native query engine with complete DML operations, Lake Formation governance, time travel, and deep AWS ecosystem integration for Iceberg tables" features={athenaFeatures} tableData={athenaTableData} diff --git a/docs-iceberg-query-engine/bigquery.mdx b/docs-iceberg-query-engine/bigquery.mdx index d35555b9..af72d19d 100644 --- a/docs-iceberg-query-engine/bigquery.mdx +++ b/docs-iceberg-query-engine/bigquery.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, BoltIcon, @@ -23,7 +24,11 @@ export const bigqueryFeatures = [ { title: "Dual Table Model", chip: "Managed + External", - description: "BigQuery-managed Iceberg (internal catalog, full DML) and BigLake external Iceberg (Dataplex/HMS/Glue via GCS, query + limited writes)", + description: ( + <> + BigQuery-managed Iceberg (internal catalog, full DML) and BigLake external Iceberg (Dataplex/HMS/Glue via GCS, query + limited writes) + + ), icon: , color: "blue", score: 80, @@ -360,7 +365,11 @@ export const bigqueryFeatures = [ { title: "Differential Time Travel", chip: "Managed vs External", - description: "Managed tables: FOR SYSTEM_TIME AS OF syntax translating to snapshots. External BigLake tables: no SQL time travel currently", + description: ( + <> + Managed tables: FOR SYSTEM_TIME AS OF syntax translating to snapshots. External BigLake tables: no SQL time travel currently + + ), icon: , color: "orange", score: 60, @@ -793,40 +802,40 @@ export const bigqueryUseCases = [ title: "Serverless Data Warehouse", description: "Fully managed Iceberg tables with automatic optimization", scenarios: [ - "Modern data warehouse with zero maintenance overhead", - "Analytics workloads requiring automatic optimization", - "Teams wanting BigQuery's serverless benefits on Iceberg", - "High-frequency update scenarios with background optimization" + "Real-world example: An e-commerce company manages 5TB of customer transaction data in BigQuery managed Iceberg tables. The automatic optimization service continuously compacts small files and optimizes table layout in the background, eliminating the need for manual OPTIMIZE commands. The data engineering team saves 15+ hours per week of maintenance work while queries remain fast", + "Modern data warehouse with zero maintenance overhead for production workloads", + "Analytics workloads requiring automatic optimization without manual intervention", + "High-frequency update scenarios with background optimization and clustering" ] }, { title: "GCP-Native Analytics Platform", description: "Deep integration with Google Cloud ecosystem services", scenarios: [ - "BigQuery ML on Iceberg data for machine learning", - "Dataform transformations on Iceberg tables", - "Cross-cloud analytics with BigQuery Omni", - "End-to-end data lineage through Dataplex" + "Real-world example: A fintech startup uses BigQuery ML to train fraud detection models directly on Iceberg tables containing payment transaction data. They use Dataform to transform raw data into analytics-ready tables, track end-to-end lineage through Dataplex, and use BigQuery Omni to query data across Google Cloud and AWS without moving it", + "BigQuery ML on Iceberg data for machine learning model training and inference", + "Dataform transformations on Iceberg tables for ELT pipelines", + "Cross-cloud analytics with BigQuery Omni for multi-cloud data access" ] }, { title: "Streaming Analytics with Storage Write API", description: "High-throughput streaming ingestion for real-time analytics", scenarios: [ - "Real-time streaming data analysis", - "High-volume ingestion via Dataflow/Beam/Spark", - "Near real-time dashboard and reporting", - "CDC processing with Datastream integration" + "Real-world example: A gaming company ingests player event data from 10 million active users using Dataflow with the Storage Write API into BigQuery managed Iceberg tables. Events become queryable within 2-3 seconds, powering real-time leaderboards and player analytics dashboards. They process 50,000 events per second with near real-time visibility", + "Real-time streaming data analysis with sub-second to second latency", + "High-volume ingestion via Dataflow/Apache Beam/Spark for operational analytics", + "CDC processing with Datastream integration for database replication" ] }, { title: "Multi-Engine Data Lake", description: "Iceberg tables accessible from multiple GCP services", scenarios: [ - "Data shared between BigQuery and Dataproc Spark/Flink", - "Multi-engine analytical workloads", - "Hybrid batch and streaming architectures", - "Open format data lake with BigQuery performance" + "Real-world example: A media company stores video metadata in BigQuery managed Iceberg tables. Their data analysts use BigQuery SQL for reporting, while data scientists use Dataproc Spark for ML feature engineering, and streaming engineers use Flink for real-time processing. All three teams access the same Iceberg tables without data duplication or ETL", + "Data shared between BigQuery and Dataproc Spark/Flink for unified analytics", + "Multi-engine analytical workloads with consistent data access", + "Hybrid batch and streaming architectures with open format interoperability" ] } ]; diff --git a/docs-iceberg-query-engine/clickhouse.mdx b/docs-iceberg-query-engine/clickhouse.mdx index 166852db..676b515b 100644 --- a/docs-iceberg-query-engine/clickhouse.mdx +++ b/docs-iceberg-query-engine/clickhouse.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, BoltIcon, @@ -364,7 +365,11 @@ export const clickhouseFeatures = [ { title: "Time Travel Capabilities", chip: "Since 25.4", - description: "Time-travel since 25.4 with SET iceberg_timestamp_ms= or iceberg_snapshot_id; partition pruning via use_iceberg_partition_pruning=1", + description: ( + <> + Time-travel since 25.4 with SET iceberg_timestamp_ms=<epoch> or iceberg_snapshot_id; partition pruning via use_iceberg_partition_pruning=1 + + ), icon: , color: "green", score: 100, diff --git a/docs-iceberg-query-engine/databricks.mdx b/docs-iceberg-query-engine/databricks.mdx index 36a8d867..bd2e6660 100644 --- a/docs-iceberg-query-engine/databricks.mdx +++ b/docs-iceberg-query-engine/databricks.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { CloudIcon, ShieldCheckIcon, @@ -78,7 +79,11 @@ export const databricksFeatures = [ { title: "UniForm Multi-Format Technology", chip: "Innovative", - description: "UniForm enables the same table to be accessible as both Delta and Iceberg simultaneously, generating Iceberg metadata on every Delta commit", + description: ( + <> + UniForm enables the same table to be accessible as both Delta and Iceberg simultaneously, generating Iceberg metadata on every Delta commit + + ), icon: , color: "purple", score: 95, @@ -303,7 +308,11 @@ export const databricksFeatures = [ { title: "Time Travel & Snapshot Queries", chip: "Full Support", - description: "External engines can time-travel using standard Iceberg syntax with snapshot-ID or timestamp, enhanced with Delta version mapping properties", + description: ( + <> + External engines can time-travel using standard Iceberg syntax with snapshot-ID or timestamp, enhanced with Delta version mapping properties + + ), icon: , color: "green", score: 90, @@ -473,7 +482,7 @@ export const databricksFeatures = [ export const databricksTableData = { title: "Databricks Iceberg Feature Matrix", - description: "Comprehensive breakdown of Iceberg capabilities in Databricks Runtime 14.3 LTS+", + description: "Comprehensive breakdown of Iceberg capabilities in Databricks Runtime 14.3 LTS+. The matrix shows feature support levels, implementation details, and minimum version requirements for your lakehouse architecture.", variant: "default", columns: [ { @@ -680,40 +689,40 @@ export const databricksUseCases = [ title: "Multi-Engine Lakehouse", description: "Enable external analytics tools and engines to access Delta tables via standard Iceberg APIs", scenarios: [ - "Business intelligence tools requiring Iceberg connectivity", - "Data science platforms with Iceberg client libraries", - "External Spark clusters needing read access to Delta tables", - "Third-party analytics services and visualization tools" + "Practical example: A data science team at a Fortune 500 company maintains their core data in Databricks Delta tables for internal analytics. When external partners need access to this data, they enable UniForm to expose the tables via Iceberg REST catalog. External partners can now query the data using Apache Spark or Trino in their own infrastructure without requiring Databricks access or data duplication", + "Business intelligence tools requiring Iceberg connectivity for reporting", + "Data science platforms with Iceberg client libraries for ML workflows", + "External Spark clusters needing read access to Delta tables for processing" ] }, { title: "Data Sharing & Federation", description: "Share Delta table data across organizational boundaries with standardized Iceberg access", scenarios: [ - "Cross-team data sharing with different tool preferences", - "Partner organizations requiring standard data access", - "Vendor integrations with Iceberg-based analytics", - "Data marketplace implementations with unified access" + "Practical example: A healthcare consortium shares patient outcome data across 15 hospitals. Each hospital uses different analytics tools (some use Databricks, others use Trino or Presto). By enabling UniForm on Delta tables, the central data warehouse provides Iceberg-compatible access, allowing each hospital to query shared data using their preferred tools while maintaining strict Unity Catalog security controls", + "Cross-team data sharing with different tool preferences and requirements", + "Partner organizations requiring standard data access without vendor dependencies", + "Data marketplace implementations with unified access patterns" ] }, { title: "Migration & Modernization", description: "Gradually migrate from legacy systems while maintaining backward compatibility", scenarios: [ + "Practical example: A retail company is migrating from legacy Hive tables to Databricks Delta. During the 6-month transition period, they enable UniForm to allow their existing Tableau dashboards (which connect via Iceberg) to continue working while they gradually migrate reports to use native Delta connections. This phased approach eliminates 'big bang' migration risks", "Transitioning from Hive tables to Delta with external tool support", - "Legacy analytics tools requiring Iceberg compatibility", - "Hybrid architectures during platform migrations", - "Risk mitigation during lakehouse transformation" + "Legacy analytics tools requiring Iceberg compatibility during migration", + "Hybrid architectures during platform modernization initiatives" ] }, { title: "Compliance & Governance", description: "Provide auditable, read-only access for compliance and regulatory scenarios", scenarios: [ - "Regulatory reporting with external audit tools", - "Compliance teams requiring independent data access", - "Data governance with Unity Catalog integration", - "Immutable data views for external stakeholders" + "Practical example: A financial institution must provide read-only access to their transaction data for external auditors during quarterly reviews. Using UniForm with Unity Catalog, they expose specific tables via Iceberg REST catalog with time-bound credentials, ensuring auditors can verify data independently without granting write access or exposing sensitive internal systems", + "Regulatory reporting with external audit tools and compliance frameworks", + "Compliance teams requiring independent data access with audit trails", + "Data governance with Unity Catalog integration for fine-grained control" ] } ]; diff --git a/docs-iceberg-query-engine/dreamio.mdx b/docs-iceberg-query-engine/dreamio.mdx index 5e63c828..48aea761 100644 --- a/docs-iceberg-query-engine/dreamio.mdx +++ b/docs-iceberg-query-engine/dreamio.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, BoltIcon, @@ -711,7 +712,11 @@ export const dremioTableData = { badge: { text: "Arctic Branches + Snapshots", variant: "success" } }, details: { - value: "Arctic/Nessie branches & tags (table@branch); snapshot-based rollback and analysis", + value: ( + <> + Arctic/Nessie branches & tags (table@branch); snapshot-based rollback and analysis. See time travel for details + + ), tooltip: "Git-like versioning plus traditional snapshot management" }, version: { value: "v25+" } diff --git a/docs-iceberg-query-engine/duckdb.mdx b/docs-iceberg-query-engine/duckdb.mdx index 4e2f47a2..77fac89a 100644 --- a/docs-iceberg-query-engine/duckdb.mdx +++ b/docs-iceberg-query-engine/duckdb.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { // DatabaseIcon, BoltIcon, @@ -21,7 +22,11 @@ export const duckdbFeatures = [ { title: "Catalog Support", chip: "Partial Support", - description: "Hadoop (file-system) and Iceberg REST catalogs supported via rest option with bearer/OAuth tokens; no native Hive/Glue catalog yet", + description: ( + <> + Hadoop (file-system) and Iceberg REST catalog supported via rest option with bearer/OAuth tokens; no native Hive/Glue catalog yet + + ), icon: , color: "orange", score: 65, @@ -517,7 +522,11 @@ export const duckdbTableData = { tooltip: "File-system and REST catalogs only" }, details: { - value: "Hadoop (file-system), REST catalog with OAuth tokens; no native Hive/Glue support", + value: ( + <> + Hadoop (file-system), REST catalog with OAuth tokens; no native Hive/Glue support + + ), tooltip: "Can proxy Hive/Glue through REST but no direct catalog integration" }, version: { value: "1.3+" } @@ -706,7 +715,7 @@ export const duckdbUseCases = [ scenarios: [ "Business intelligence report generation", "Data extraction for external systems and tools", - "Historical trend analysis with time travel", + "Historical trend analysis with [time travel](/blog/2025/10/03/iceberg-metadata/#63-time-travel-rollback-and-branching)", "Cross-functional data sharing and exploration" ] }, diff --git a/docs-iceberg-query-engine/flink.mdx b/docs-iceberg-query-engine/flink.mdx index 5717f78f..40f48a33 100644 --- a/docs-iceberg-query-engine/flink.mdx +++ b/docs-iceberg-query-engine/flink.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, // DatabaseIcon, @@ -21,7 +22,11 @@ export const flinkFeatures = [ { title: "Comprehensive Catalog Support", chip: "Full Support", - description: "Hive Metastore, Hadoop catalog, REST catalog (incl. Nessie), AWS Glue, JDBC, plus any custom implementation via catalog-impl", + description: ( + <> + Hive Metastore, Hadoop catalog, REST catalog (incl. Nessie), AWS Glue, JDBC, plus any custom implementation via catalog-impl + + ), icon: , color: "blue", score: 100, @@ -78,7 +83,11 @@ export const flinkFeatures = [ { title: "Streaming & CDC Excellence", chip: "Reference Engine", - description: "Reference engine for CDC → Iceberg: consume Debezium/Kafka changelogs, upsert with exactly-once semantics, FLIP-27 incremental reads", + description: ( + <> + Reference engine for CDC → Iceberg: consume Debezium/Kafka changelogs, upsert with exactly-once semantics, FLIP-27 incremental reads + + ), icon: , color: "green", score: 100, @@ -523,7 +532,11 @@ export const flinkTableData = { tooltip: "Support for all major catalog implementations" }, details: { - value: "Hive, Hadoop, REST (incl. Nessie), AWS Glue, JDBC, custom implementations", + value: ( + <> + Hive, Hadoop, REST (incl. Nessie), AWS Glue, JDBC, custom implementations + + ), tooltip: "CREATE CATALOG statements with flexible catalog-type and catalog-impl options" }, version: { value: "1.18+" } diff --git a/docs-iceberg-query-engine/hive.mdx b/docs-iceberg-query-engine/hive.mdx index 93f6191f..d6e8bfa1 100644 --- a/docs-iceberg-query-engine/hive.mdx +++ b/docs-iceberg-query-engine/hive.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, // DatabaseIcon, @@ -148,7 +149,7 @@ export const hiveFeatures = [ description: "Hive provides complete DML capabilities for Iceberg tables when running on the Tez execution engine.", overviewContent: { strengths: [ - "Full DELETE, UPDATE, MERGE INTO support", + "Full DELETE, UPDATE, [MERGE](/iceberg/query-engine/spark) INTO support", "ACID compliance for all operations", "Standard SQL syntax compatibility", "Integration with existing ETL workflows", @@ -515,7 +516,11 @@ export const hiveTableData = { tooltip: "Native Hive Metastore plus configurable backends" }, details: { - value: "Hive Metastore (default), Hadoop, REST/Nessie, AWS Glue, JDBC, custom implementations", + value: ( + <> + Hive Metastore (default), Hadoop, REST/Nessie, AWS Glue, JDBC, custom implementations + + ), tooltip: "HiveIcebergStorageHandler provides native HMS integration" }, version: { value: "4.0+" } diff --git a/docs-iceberg-query-engine/impala.mdx b/docs-iceberg-query-engine/impala.mdx index 4e56e6cd..8bf3e911 100644 --- a/docs-iceberg-query-engine/impala.mdx +++ b/docs-iceberg-query-engine/impala.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { BoltIcon, ShieldCheckIcon, @@ -21,7 +22,11 @@ export const impalaFeatures = [ { title: "HMS-Centric Catalog Integration", chip: "Enterprise Ready", - description: "Deep integration with Hive Metastore, HadoopCatalog, and HadoopTables; other catalog implementations configurable via Hive site-config", + description: ( + <> + Deep integration with Hive Metastore, HadoopCatalog, and HadoopTables; other catalog implementations configurable via Hive site-config + + ), icon: , color: "blue", score: 75, @@ -519,7 +524,11 @@ export const impalaTableData = { tooltip: "HMS, Hadoop catalogs only; no direct cloud catalog support" }, details: { - value: "HiveCatalog (HMS), HadoopCatalog, HadoopTables; other catalog-impl via Hive site-config", + value: ( + <> + Hive Catalog (HMS), HadoopCatalog, HadoopTables; other catalog-impl via Hive site-config + + ), tooltip: "No direct Glue/REST/Nessie support; requires HMS infrastructure" }, version: { value: "4.0+" } @@ -547,7 +556,7 @@ export const impalaTableData = { badge: { text: "MERGE Preview", variant: "warning" } }, details: { - value: "INSERT, DELETE, UPDATE with position deletes; MERGE in CDW 1.5.5 preview", + value: "INSERT, DELETE, UPDATE with [position deletes](/blog/iceberg-delta-lake-delete-methods-comparison/); MERGE in CDW 1.5.5 preview", tooltip: "Row-level operations require format-version=2" }, version: { value: "4.4+" } diff --git a/docs-iceberg-query-engine/presto.mdx b/docs-iceberg-query-engine/presto.mdx index f31cef55..1b53dd98 100644 --- a/docs-iceberg-query-engine/presto.mdx +++ b/docs-iceberg-query-engine/presto.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, BoltIcon, @@ -23,7 +24,11 @@ export const prestoFeatures = [ { title: "Comprehensive Catalog Support", chip: "REST/Nessie + OAuth2", - description: "Hive Metastore, AWS Glue, REST/Nessie (0.277+ with OAuth2), Hadoop (file-based); JDBC possible via same properties", + description: ( + <> + Hive Metastore, AWS Glue, REST/Nessie (0.277+ with OAuth2), Hadoop (file-based); JDBC possible via same properties + + ), icon: , color: "green", score: 100, @@ -361,7 +366,7 @@ export const prestoFeatures = [ }, { title: "Advanced Time Travel", - chip: "AS OF Syntax", + chip: "[AS OF TIMESTAMP](/blog/apache-polaris-lakehouse) Syntax", description: "AS OF TIMESTAMP / @snapshot_id=... syntax (0.282+). Snapshot procedures: rollback, expire, remove orphan files", icon: , color: "green", @@ -628,7 +633,11 @@ export const prestoTableData = { tooltip: "Comprehensive catalog support with authentication" }, details: { - value: "Hive Metastore, AWS Glue, REST/Nessie (OAuth2), Hadoop file-based, JDBC via properties", + value: ( + <> + Hive Metastore, AWS Glue, REST/Nessie (OAuth2), Hadoop file-based, JDBC via properties + + ), tooltip: "Universal catalog integration with advanced authentication" }, version: { value: "0.277+" } @@ -835,7 +844,11 @@ export const prestoUseCases = [ + Distributed SQL query engine with REST/Nessie catalogs, row-level DELETE, time travel, and configurable MoR/CoW modes for interactive analytics + + } features={prestoFeatures} tableData={prestoTableData} useCases={prestoUseCases} diff --git a/docs-iceberg-query-engine/snowflake.mdx b/docs-iceberg-query-engine/snowflake.mdx index bb45d5d8..375b047e 100644 --- a/docs-iceberg-query-engine/snowflake.mdx +++ b/docs-iceberg-query-engine/snowflake.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, BoltIcon, @@ -477,7 +478,11 @@ export const snowflakeFeatures = [ { title: "UniForm Interoperability", chip: "External Engine Access", - description: "UniForm exposes Snowflake tables through Iceberg-compatible REST catalog so external engines (Spark, Trino) can read them. Cross-cloud support via External Volumes", + description: ( + <> + UniForm exposes Snowflake tables through Iceberg-compatible REST catalog so external engines (Spark, Trino) can read them. Cross-cloud support via External Volumes + + ), icon: , color: "blue", score: 95, @@ -592,7 +597,7 @@ export const snowflakeFeatures = [ export const snowflakeTableData = { title: "Snowflake Iceberg Feature Matrix", - description: "Comprehensive breakdown of Iceberg capabilities in Snowflake", + description: "Comprehensive breakdown of Iceberg capabilities in Snowflake across catalog integration, DML operations, streaming support, and enterprise security. Shows feature support levels, implementation details, and availability status.", variant: "default", columns: [ { @@ -799,47 +804,51 @@ export const snowflakeUseCases = [ title: "Enterprise Data Warehouse", description: "Full-featured data warehouse with native Iceberg integration", scenarios: [ - "Modern data warehouse with zero maintenance optimization", - "Enterprise environments requiring comprehensive RBAC and governance", - "High-frequency update workloads with automatic clustering", - "Multi-tenant deployments with fine-grained security" + "Real-world example: A telecommunications company manages 10TB of customer data in Snowflake Iceberg tables. Snowflake's automatic clustering continuously reorganizes data based on query patterns, while auto-compaction merges small files in the background. The data team focuses on analytics instead of table maintenance, saving 20+ hours per week of manual optimization work", + "Modern data warehouse with zero maintenance optimization for production workloads", + "Enterprise environments requiring comprehensive RBAC and governance controls", + "Multi-tenant deployments with fine-grained security and isolation" ] }, { title: "Real-time Analytics with Snowpipe", description: "Streaming ingestion and change processing workflows", scenarios: [ - "Real-time data warehouse updates with Snowpipe Streaming", - "Change data capture with Streams and Tasks", - "High-throughput streaming analytics", - "Near real-time dashboard and reporting" + "Real-world example: An IoT platform ingests sensor data from 50,000 devices using Snowpipe Streaming into Iceberg tables. Data becomes queryable within seconds of arrival, powering real-time alerting dashboards. When anomalies are detected, Streams and Tasks automatically trigger data quality checks and send notifications, all within Snowflake's Iceberg ecosystem", + "Real-time data warehouse updates with Snowpipe Streaming for operational analytics", + "Change data capture with Streams and Tasks for automated processing pipelines", + "High-throughput streaming analytics with near real-time dashboard updates" ] }, { title: "Multi-Engine Data Architecture", description: "UniForm interoperability for diverse analytical tools", scenarios: [ - "Data sharing between Snowflake and external engines (Spark, Trino)", - "Hybrid analytical architectures with multiple processing engines", - "Cross-cloud and cross-region data access scenarios", - "Open format data lake with Snowflake performance" + "Real-world example: A media company stores video analytics data in Snowflake Iceberg tables. Their data science team uses Snowflake SQL for business intelligence, while their ML engineers use Apache Spark (accessing via UniForm) for model training. Both teams work with the same data without ETL pipelines or data duplication, reducing costs and eliminating sync issues", + ( + <> + Data sharing between Snowflake and external engines (Spark, Trino) without duplication + + ), + "Hybrid analytical architectures with multiple processing engines and tools", + "Cross-cloud and cross-region data access scenarios with unified governance" ] }, { title: "Development and Testing Optimization", description: "Zero-Copy Cloning for efficient development workflows", scenarios: [ - "Instant development and testing environments with clones", - "Data science experimentation without storage costs", - "Backup and recovery scenarios with time travel", - "Reproducible analytics with snapshot isolation" + "Real-world example: A SaaS company uses Zero-Copy Cloning to create instant copies of production Iceberg tables for testing. Developers can experiment with schema changes, test new features, and validate data transformations on production-scale data without consuming additional storage or waiting for lengthy copy operations. When testing completes, they simply drop the clones", + "Instant development and testing environments with clones for rapid iteration", + "Data science experimentation without storage costs or data duplication overhead", + "Backup and recovery scenarios with time travel for disaster recovery" ] } ]; + Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, custom plug-ins + + ), tooltip: "Most comprehensive catalog support in the ecosystem" }, version: { value: "3.0+" } diff --git a/docs-iceberg-query-engine/starburst.mdx b/docs-iceberg-query-engine/starburst.mdx index 228a53c0..bbeb6a60 100644 --- a/docs-iceberg-query-engine/starburst.mdx +++ b/docs-iceberg-query-engine/starburst.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { BoltIcon, ShieldCheckIcon, @@ -576,7 +577,11 @@ export const starburstTableData = { badge: { text: "SQL Native", variant: "success" } }, details: { - value: "FOR VERSION/TIMESTAMP AS OF syntax with metadata tables and procedures", + value: ( + <> + FOR VERSION/TIMESTAMP AS OF syntax with metadata tables and procedures. See time travel for details + + ), tooltip: "Industry-leading time travel with comprehensive tooling" }, version: { value: "414-E+" } @@ -687,40 +692,40 @@ export const starburstUseCases = [ title: "Enterprise Data Warehousing", description: "Comprehensive analytics platform with full DML capabilities and enterprise governance", scenarios: [ - "Large-scale data warehousing with complex transformations", - "Multi-tenant environments with strict security requirements", - "Enterprise reporting and business intelligence platforms", - "Data governance scenarios requiring detailed access control" + "Real-world example: A Fortune 500 insurance company uses Starburst Enterprise to manage 100TB of policy and claims data across Iceberg tables. They use UPDATE operations to correct policy details, MERGE statements for claims processing, and materialized views to accelerate executive dashboards. Starburst's built-in access control ensures agents only see their assigned policies while executives get full visibility", + "Large-scale data warehousing with complex transformations and ETL workflows", + "Multi-tenant environments with strict security requirements and isolation", + "Enterprise reporting and business intelligence platforms with advanced features" ] }, { title: "Multi-Cloud & Hybrid Analytics", description: "Unified analytics across diverse catalog and storage environments", scenarios: [ - "Multi-cloud deployments with different catalog systems", - "Hybrid on-premises and cloud data architectures", - "Migration scenarios from legacy systems to modern lakehouses", - "Federation across multiple metadata and storage systems" + "Real-world example: A global logistics company has data in AWS Glue, Azure on-premises Hive Metastore, and GCP. Starburst connects to all three catalogs simultaneously, allowing analysts to write queries that join shipment data from AWS, customer data from on-premises Hive, and route optimization data from GCP - all in one SQL query without data movement", + "Multi-cloud deployments with different catalog systems (Glue, Hive, Nessie, Snowflake)", + "Hybrid on-premises and cloud data architectures during cloud migration", + "Federation across multiple metadata and storage systems for unified access" ] }, { title: "High-Performance Analytics", description: "Performance-critical workloads requiring sub-second response times", scenarios: [ - "Interactive business intelligence with large datasets", - "Real-time dashboards and operational analytics", - "Complex analytical queries with advanced optimizations", - "Performance-sensitive applications with SLA requirements" + "Real-world example: A digital advertising platform uses Starburst with Warp Speed enabled to power real-time bidding dashboards. Queries that previously took 30 seconds now complete in under 2 seconds thanks to intelligent caching and indexing. The marketing team can analyze 500 million ad impressions interactively, adjusting campaigns based on live performance data", + "Interactive business intelligence with large datasets requiring instant responses", + "Real-time dashboards and operational analytics with SLA requirements", + "Complex analytical queries with advanced optimizations (Warp Speed, materialized views)" ] }, { title: "Compliance & Audit Scenarios", description: "Regulatory environments requiring comprehensive audit trails and access control", scenarios: [ - "Financial services regulatory reporting", - "Healthcare data governance and compliance", - "Audit trail requirements with time travel capabilities", - "Data lineage and governance for compliance frameworks" + "Real-world example: A multinational bank uses Starburst for regulatory reporting with strict compliance requirements. They use [time travel](/blog/2025/10/03/iceberg-metadata/#63-time-travel-rollback-and-branching) queries to reconstruct account balances at specific points in time for audit purposes, column-level access control to protect PII data, and LDAP integration to ensure only authorized analysts can access sensitive financial data. All queries are logged for regulatory review", + "Financial services regulatory reporting with detailed audit trails", + "Healthcare data governance and compliance (HIPAA) with field-level security", + "Data lineage and governance for compliance frameworks (GDPR, SOX, Basel III)" ] } ]; diff --git a/docs-iceberg-query-engine/starrocks.mdx b/docs-iceberg-query-engine/starrocks.mdx index 0fd16364..4c8ba54a 100644 --- a/docs-iceberg-query-engine/starrocks.mdx +++ b/docs-iceberg-query-engine/starrocks.mdx @@ -6,6 +6,7 @@ hide_table_of_contents: true --- import { QueryEngineLayout } from '@site/src/components/Iceberg/QueryEngineLayout'; +import Link from '@docusaurus/Link'; import { ServerStackIcon, BoltIcon, @@ -707,7 +708,11 @@ export const starrocksTableData = { badge: { text: "v3.4+ Required", variant: "error" } }, details: { - value: "No SQL AS OF in v3.2/3.3; separate catalog workaround; SQL time travel v3.4+", + value: ( + <> + No SQL AS OF in v3.2/3.3; separate catalog workaround; SQL time travel v3.4+. See time travel for details + + ), tooltip: "Limited current capabilities with clear future roadmap" }, version: { value: "3.4+" } @@ -790,47 +795,47 @@ export const starrocksUseCases = [ title: "High-Performance Analytics", description: "Vectorized OLAP engine optimized for analytical workloads", scenarios: [ - "Business intelligence and real-time dashboards", - "Large-scale data warehouse analytics", - "Complex analytical queries on lakehouse data", - "High-performance reporting with sub-second latency" + "Real-world example: A retail company uses StarRocks to power sales dashboards that update every minute, showing current inventory levels, sales trends, and customer behavior across 500+ stores. The vectorized OLAP engine processes millions of transactions in seconds, enabling store managers to make data-driven decisions in real-time", + "Large-scale data warehouse analytics with sub-second query response times", + "Complex analytical queries on lakehouse data with automatic optimization", + "High-performance reporting dashboards for business intelligence teams" ] }, { title: "Materialized View Acceleration", description: "Async materialized views for performance optimization", scenarios: [ - "Low-latency dashboard queries on large datasets", - "Incremental data processing and aggregation", - "Performance acceleration for frequent analytical queries", - "Near-real-time analytics with scheduled refresh patterns" + "Real-world example: An e-commerce platform maintains complex aggregation queries for product recommendations. Instead of recalculating these aggregations on every request, StarRocks uses async materialized views to pre-compute results every 5 minutes, reducing dashboard load times from 30 seconds to under 1 second", + "Low-latency dashboard queries on large datasets with automatic refresh", + "Incremental data processing and aggregation for analytics workloads", + "Near-real-time analytics with scheduled refresh patterns for operational reporting" ] }, { title: "Read-Heavy Lakehouse Analytics", description: "Optimal for analytical workloads consuming data from other engines", scenarios: [ - "Query layer for data written by Spark/Flink/Dremio", - "Cross-engine analytical workloads in multi-engine environments", - "Performance-critical read operations on MoR tables", - "Analytical consumption of frequently updated datasets" + "Real-world example: A financial services company uses Apache Spark to write transaction data to Iceberg tables throughout the day. StarRocks acts as the query layer, allowing analysts to run complex SQL queries on this data without waiting for Spark jobs. The MoR read support ensures analysts always see the latest data", + "Query layer for data written by Spark/Flink/Dremio in multi-engine architectures", + "Cross-engine analytical workloads with consistent read performance", + "Performance-critical read operations on frequently updated MoR tables" ] }, { title: "Enterprise OLAP Platform", description: "Comprehensive analytical platform with enterprise security", scenarios: [ - "Multi-tenant analytical platforms with RBAC", - "Enterprise data warehouse modernization projects", - "Cloud-native analytical platforms with catalog integration", - "Compliance-aware analytical environments" + "Real-world example: A healthcare organization uses StarRocks to provide secure access to patient analytics across multiple departments. RBAC integration with external catalogs ensures that doctors only see their patients' data, while administrators can access hospital-wide metrics, all querying the same Iceberg tables", + "Multi-tenant analytical platforms with fine-grained RBAC controls", + "Enterprise data warehouse modernization with cloud-native architecture", + "Compliance-aware analytical environments with audit trail requirements" ] } ]; + hive_metastore, glue, jdbc, rest, nessie, or snowflake catalogs; each exposes same tables once configured in catalog properties + + ), icon: , color: "blue", score: 100, @@ -626,7 +631,11 @@ export const trinoTableData = { tooltip: "Support for 6+ catalog implementations" }, details: { - value: "hive_metastore, glue, jdbc, rest, nessie, snowflake - unified access via catalog properties", + value: ( + <> + hive_metastore, glue, jdbc, rest, nessie, snowflake - unified access via catalog properties + + ), tooltip: "Each catalog configured in etc/catalog/*.properties with static configuration" }, version: { value: "414+" } @@ -793,40 +802,40 @@ export const trinoUseCases = [ title: "Interactive Data Analytics", description: "High-performance ad-hoc queries and data exploration", scenarios: [ - "Business intelligence and reporting dashboards", - "Data science and ML feature engineering", - "Interactive data exploration and analysis", - "Complex analytical queries across large datasets" + "Real-world example: A data science team at a SaaS company uses Trino to explore 50TB of customer behavior data stored in Iceberg tables. Analysts write ad-hoc SQL queries in their notebooks, getting results in seconds thanks to Trino's distributed query engine and partition pruning. They can quickly test hypotheses and build dashboards without waiting for batch jobs", + "Business intelligence and reporting dashboards with sub-second response times", + "Data science and ML feature engineering with interactive exploration", + "Complex analytical queries across large datasets with automatic optimizations" ] }, { title: "Multi-Catalog Data Federation", description: "Unified access to data across heterogeneous systems", scenarios: [ - "Cross-cloud data lake analytics", - "Legacy system integration with modern catalogs", - "Multi-vendor data platform consolidation", - "Federated queries across different storage systems" + "Real-world example: A retail conglomerate has data spread across AWS Glue, on-premises Hive Metastore, and Nessie catalogs. Using Trino, their analysts write a single SQL query that joins customer data from Glue, product data from Hive, and real-time inventory from Nessie - all without moving or replicating data. Trino federates across all three catalogs seamlessly", + "Cross-cloud data lake analytics without data movement or replication", + "Legacy system integration with modern cloud catalogs for unified access", + "Federated queries across different storage systems and metadata stores" ] }, { title: "Lambda Architecture Query Layer", description: "Batch processing and serving layer for real-time architectures", scenarios: [ - "Analytical queries on streaming-updated tables", - "Historical analysis complementing real-time views", - "Batch aggregation and reporting workflows", - "Data quality validation and reconciliation" + "Real-world example: An IoT platform uses Apache Flink to write real-time sensor data to Iceberg tables. Trino serves as the query layer, allowing business analysts to run complex aggregations and historical analysis on the data that Flink continuously updates. Trino handles batch analytical queries while Flink manages real-time stream processing", + "Analytical queries on streaming-updated tables with batch processing power", + "Historical analysis complementing real-time views from streaming engines", + "Batch aggregation and reporting workflows for operational analytics" ] }, { title: "Enterprise Data Warehouse", description: "Modern cloud-native data warehouse with ACID compliance", scenarios: [ - "Traditional data warehouse modernization", - "Time travel for data auditing and compliance", - "Row-level data corrections and updates", - "Schema evolution for changing business needs" + "Real-world example: A healthcare provider modernizes their legacy Oracle data warehouse by migrating to Iceberg tables queried with Trino. They use time travel queries to audit patient record changes for compliance, UPDATE operations to correct data quality issues, and schema evolution to add new fields as healthcare regulations change. Trino provides warehouse capabilities on open formats at fraction of the cost", + "Traditional data warehouse modernization with open table formats", + "Time travel for data auditing and compliance with regulatory requirements", + "Row-level data corrections and updates with ACID transaction guarantees" ] } ]; @@ -840,6 +849,7 @@ export const trinoUseCases = [ officialDocs="https://trino.io/docs/current/" gettingStarted="https://trino.io/docs/current/connector/iceberg.html" additionalResources={[ + { label: "Trino", url: "/iceberg/olake-iceberg-trino" }, { label: "Iceberg Connector Documentation", url: "https://trino.io/docs/current/connector/iceberg.html" }, { label: "Trino Performance Tuning", url: "https://trino.io/docs/current/admin/tuning.html" }, { label: "Catalog Configuration Guide", url: "https://trino.io/docs/current/connector/iceberg.html#catalog-configuration" }, diff --git a/docs/benchmarks.mdx b/docs/benchmarks.mdx index d534b1dd..85e65d2c 100644 --- a/docs/benchmarks.mdx +++ b/docs/benchmarks.mdx @@ -21,7 +21,7 @@ Use the tabs below to view detailed benchmarks per connector. Each tab has a uni -### PostgreSQL → Apache Iceberg Connector Benchmark +### [PostgreSQL to Apache Iceberg](/iceberg/postgres-to-iceberg-using-glue) Connector Benchmark **Benchmark Environment** @@ -42,7 +42,7 @@ _(OLake vs. Popular Data-Movement Tools)_ |----------------------------------------------------|-------------|-------------------------|--------------------| | **OLake**
(as of 14th Sept 2025) | **4.01 B** | **3,19,562 RPS** | – | | Fivetran
(as of 30th Apr 2025) | 4.01 B | 46,395 RPS | **6.8 × slower** | -| Debezium (memiiso)
(as of 30th Apr 2025) | 1.28 B | 14,839 RPS | **21.5 × slower** | +| [Debezium](/blog/debezium-vs-olake) (memiiso)
(as of 30th Apr 2025) | 1.28 B | 14,839 RPS | **21.5 × slower** | | Estuary
(as of 30th Apr 2025) | 0.34 B | 3,982 RPS | **80.2 × slower** | | Airbyte Cloud
(as of 30th Apr 2025) | 12.7 M | 457 RPS | **699.2 × slower** | @@ -65,7 +65,7 @@ _(OLake vs. Popular Data-Movement Tools)_ **Key takeaway:** **OLake** now delivers upto **7x faster bulk-load** performance than **Fivetran**, while outpacing every other open-source alternative by **21x** to over **600x**. -#### 2. Speed Comparison – **Change-Data-Capture (CDC)** +#### 2. Speed Comparison – **[Change-Data-Capture (CDC)](/blog/how-to-set-up-postgresql-cdc-on-aws-rds)** | Tool | CDC Window | Throughput (rows / sec) | Relative to OLake | | ------------------ | -----------: | ----------------------: | ----------------- | @@ -178,7 +178,7 @@ CREATE TABLE fhv_trips ( We used AWS Glue as Iceberg catalog and AWS S3 as the storage layer on the destination side for this benchmarks. :::: -> **Bottom line:** If you need to land terabytes of PostgreSQL data into Apache Iceberg quickly—and keep it continually up-to-date—OLake delivers enterprise-grade speed without the enterprise-grade bill. +> **Bottom line:** If you need to land terabytes of [PostgreSQL](/blog/how-to-set-up-postgres-apache-iceberg) data into Apache Iceberg quickly—and keep it continually up-to-date—OLake delivers enterprise-grade speed without the enterprise-grade bill.
@@ -313,7 +313,7 @@ We used AWS Glue as Iceberg catalog and AWS S3 as the storage layer on the desti -### MongoDB Benchmarks +### [MongoDB](/docs/connectors/mongodb/) Benchmarks In the fast-paced world of data management, every second counts. When it comes to syncing massive datasets from MongoDB into a data warehouse or even a lakehouse, you need a tool that is not just reliable but also blazing fast and cost-effective. diff --git a/docs/connectors/mongodb/index.mdx b/docs/connectors/mongodb/index.mdx index 284cc21c..b2975c86 100644 --- a/docs/connectors/mongodb/index.mdx +++ b/docs/connectors/mongodb/index.mdx @@ -15,7 +15,7 @@ The OLake MongoDB Source connector supports multiple synchronization modes. It o - **Full Refresh** - **Full Refresh + Incremental** - **Full Refresh + CDC** -- **CDC Only** +- [**CDC Only**](/docs/features/#2-sync-modes-supported) ## Prerequisites @@ -25,9 +25,9 @@ MongoDB Version 4.0 or higher ### CDC Prerequisites -For CDC mode, MongoDB must meet the following requirements: +For [Change Data Capture (CDC)](/docs/understanding/terminologies/general/#39-change-data-capture) mode, MongoDB must meet the following requirements: - MongoDB must be running in **replica set mode** (`--replSet rs0`) -- **Oplog must be enabled** (automatic in replica sets) +- [**oplog**](/docs/understanding/terminologies/general/#26-oplog-mongodb) must be enabled (automatic in replica sets) :::info diff --git a/docs/connectors/mysql/index.mdx b/docs/connectors/mysql/index.mdx index ac59877f..a1c83494 100644 --- a/docs/connectors/mysql/index.mdx +++ b/docs/connectors/mysql/index.mdx @@ -13,9 +13,9 @@ The OLake MySQL Source connector supports multiple sync modes. It also offers fe ## Sync Modes Supported - **Full Refresh** -- **Full Refresh + CDC** -- **CDC Only** -- **Full Refresh + Incremental** +- [**Full Refresh + CDC**](/docs/features/#2-sync-modes-supported) +- [**CDC Only**](/docs/features/#2-sync-modes-supported) +- [**Full Refresh + Incremental**](/docs/features/#2-sync-modes-supported) ## Prerequisites @@ -23,7 +23,7 @@ The OLake MySQL Source connector supports multiple sync modes. It also offers fe MySQL Version: MySQL 5.7+ ### CDC Prerequisite - - Binary Logging (Required): + - [Binary Logging](/blog/binlogs) (Required): - log_bin=ON - binlog_format=ROW - binlog_row_image=FULL diff --git a/docs/connectors/mysql/setup/local.mdx b/docs/connectors/mysql/setup/local.mdx index b99564a4..bcbbea85 100644 --- a/docs/connectors/mysql/setup/local.mdx +++ b/docs/connectors/mysql/setup/local.mdx @@ -200,7 +200,7 @@ Follow the [OLake UI setup guide](../../../getting-started/olake-ui) to run the ## 5. Create a Job -Once your source is created, you can create a job to sync data. See [Create a Job](../../../jobs/create-jobs) for detailed instructions. +Once your source is created, you can create a job to sync data. See [Creating First Pipeline](/docs/getting-started/creating-first-pipeline) for detailed instructions. diff --git a/docs/connectors/postgres/index.mdx b/docs/connectors/postgres/index.mdx index 3ecb943c..bdb59d34 100644 --- a/docs/connectors/postgres/index.mdx +++ b/docs/connectors/postgres/index.mdx @@ -13,8 +13,8 @@ The OLake Postgres Source connector supports multiple synchronization modes. It ## Sync Modes Supported - **Full Refresh** -- **Full Refresh + CDC** -- **CDC Only** +- [**Full Refresh + CDC**](/docs/features/#2-sync-modes-supported) +- [**CDC Only**](/docs/features/#2-sync-modes-supported) - **Full Refresh + Incremental** :::danger **wal2json for CDC Deprecated** diff --git a/docs/connectors/postgres/setup/local.mdx b/docs/connectors/postgres/setup/local.mdx index 1588989f..40a7aa83 100644 --- a/docs/connectors/postgres/setup/local.mdx +++ b/docs/connectors/postgres/setup/local.mdx @@ -170,7 +170,7 @@ Follow the [OLake UI setup guide](../../../getting-started/olake-ui) to run the ## 5. Create a Job -Once your source is created, you can create a job to sync data. See [Create a Job](../../../jobs/create-jobs) for detailed instructions. +Once your source is created, you can create a job to sync data. See [Creating First Pipeline](/docs/getting-started/creating-first-pipeline) for detailed instructions. diff --git a/docs/core/architecture.mdx b/docs/core/architecture.mdx index 39b97b16..5ab97914 100644 --- a/docs/core/architecture.mdx +++ b/docs/core/architecture.mdx @@ -7,7 +7,7 @@ sidebar_position: 1 # Architecture -For an in-depth look at OLake’s architecture, including chunking strategies, concurrency models, and state management, explore our blog: [**A Deep Dive into OLake Architecture and Inner Workings**](/blog/olake-architecture-deep-dive). +For an in-depth look at OLake's architecture, including [chunking strategies](/blog/what-makes-olake-fast), [concurrency models](/blog/what-makes-olake-fast), and [state management](/blog/what-makes-olake-fast), explore our blog: [**A Deep Dive into OLake Architecture and Inner Workings**](/blog/olake-architecture-deep-dive).
![OLake architecture diagram with connectors between user, database, and lakehouse](/img/docs/architecture.webp) diff --git a/docs/core/use-cases.mdx b/docs/core/use-cases.mdx index 23d11664..13d50452 100644 --- a/docs/core/use-cases.mdx +++ b/docs/core/use-cases.mdx @@ -11,7 +11,7 @@ sidebar_position: 3 ### 1. Offloading OLTP Databases for Analytics Running complex analytical queries directly on production **OLTP (Online Transaction Processing) databases** can degrade performance and affect transactional workloads. -OLake addresses this by replicating data from **MySQL**, **PostgreSQL**, **Oracle**, and **MongoDB** into an **Apache Iceberg** based data lake. +OLake addresses this by replicating data from **MySQL**, **PostgreSQL**, **Oracle**, and **MongoDB** into an [**Apache Iceberg**](/iceberg/why-iceberg) based data lake. This approach provides: @@ -26,7 +26,7 @@ This approach provides: With OLake, you can maintain stable transactional systems while enabling scalable and reliable analytics on **Apache Iceberg**. ### 2. Building Open Data Stacks and Scaling Data Engineering -Organizations looking to reduce reliance on proprietary ETL and data warehousing tools can use **OLake** as part of an **open-source data stack**. By standardizing on **Apache Iceberg** as the table format, OLake ensures broad compatibility with query engines like **Trino**, **Presto**, **Spark**, **Dremio**, and **DuckDB**. +Organizations looking to reduce reliance on proprietary ETL and data warehousing tools can use **OLake** as part of an [**open-source data stack**](/blog/building-open-data-lakehouse-with-olake-presto). By standardizing on **Apache Iceberg** as the table format, OLake ensures broad compatibility with query engines like **Trino**, **Presto**, **Spark**, **Dremio**, and **DuckDB**. With its open-source approach, OLake helps teams: @@ -43,7 +43,7 @@ Support multiple query engines across different use cases and teams. This enables a **flexible**, **scalable**, and **future-proof data architecture** built on open standards. ### 3. Enabling Near-Real-Time Analytics -Modern applications need fresh data within minutes, not hours. **OLake** enables near-real-time analytics by continuously replicating data from transactional databases using **log-based CDC**, often achieving **sub-minute** latency for updates to appear in **Iceberg**. +Modern applications need fresh data within minutes, not hours. **OLake** enables near-real-time analytics by continuously replicating data from transactional databases using [**CDC**](/docs/understanding/terminologies/general/#39-change-data-capture), often achieving **sub-minute** latency for updates to appear in **Iceberg**. Key benefits: @@ -73,7 +73,7 @@ Adapt to schema changes seamlessly with Iceberg. ### 5. Powering AI and ML Data Pipelines Building effective AI and ML models requires **fresh**, **reliable**, and **structured data**. **OLake** automates the ingestion of transactional data into an **Iceberg-based lakehouse**, ensuring that pipelines always have access to the latest information. -With continuous updates, ML feature stores and training datasets stay current, while Iceberg’s compatibility with engines like **PySpark** and **DuckDB** makes it easy to plug into existing data science workflows. This supports faster model development and iteration. +With continuous updates, [ML feature stores](/blog/apache-iceberg-vs-delta-lake-guide) and training datasets stay current, while Iceberg's compatibility with engines like **PySpark** and **DuckDB** makes it easy to plug into existing data science workflows. This supports faster model development and iteration. Key benefits: @@ -97,7 +97,7 @@ Key benefits: - Dead Letter Queue for dependable error management. ### 7. Reducing Cloud Data Warehouse Costs -Cloud data warehouses can become expensive due to storage and compute costs. **OLake** helps reduce these expenses by offloading raw, historical, or less frequently used data into an **Iceberg lakehouse** on cost-effective object storage. +Cloud data warehouses can become expensive due to storage and compute costs. **OLake** helps reduce these expenses by offloading raw, historical, or less frequently used data into an [**Iceberg lakehouse**](/iceberg/move-to-iceberg) on cost-effective object storage. This lets teams keep their warehouse optimized for active data, while still retaining full access to complete datasets in Iceberg. diff --git a/docs/features/index.mdx b/docs/features/index.mdx index ede8d8b6..dff42b5e 100644 --- a/docs/features/index.mdx +++ b/docs/features/index.mdx @@ -15,7 +15,7 @@ import TabItem from '@theme/TabItem'; ## Source Level Features -### 1. Parallelised Chunking +### 1. [Parallelised Chunking](/blog/what-makes-olake-fast) Parallel chunking is a technique that splits large datasets or collections into smaller virtual chunks, allowing them to be read and processed simultaneously. It is used in sync modes such as Full Refresh, Full Refresh + CDC, and Full Refresh + Incremental. @@ -27,7 +27,7 @@ Parallel chunking is a technique that splits large datasets or collections into - Enables **parallel reads**, dramatically reducing the time needed to perform full snapshots or scans of large datasets. - Improves ingestion speed, scalability, and overall system performance. -### 2. Sync Modes Supported +### 2. [Sync Modes](/blog/what-makes-olake-fast) Supported OLake supports following sync modes to provide flexibility across use cases: @@ -74,7 +74,7 @@ Data Deduplication ensures that only unique records are stored and processed : s Partitioning is the process of dividing large datasets into smaller, more manageable segments based on specific column values (e.g., date, region, or category), improving query performance, scalability, and data organization -- **Iceberg partitioning** → Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution. +- [**Iceberg partitioning**](/docs/writers/iceberg/partitioning/) → Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution. - **S3-style partitioning** → Traditional folder-based layout (e.g., `year=2025/month=08/day=22/`) for compatibility with external tools. - **Normalization** → Automatically expands **level-1 nested JSON fields** into top-level columns. @@ -89,7 +89,7 @@ Partitioning is the process of dividing large datasets into smaller, more manage - Reduces the need for complex JSON parsing in queries. - Improves readability and downstream analytics efficiency. -### 3. Schema Evolution & Data Types Changes +### 3. [Schema Evolution](/blog/2025/10/03/iceberg-metadata) & Data Types Changes OLake automatically handles changes in your table's schema without breaking downstream jobs. Read More [Schema Evolution in OLake](/docs/features?tab=schema-evolution) diff --git a/docs/getting-started/creating-first-pipeline.mdx b/docs/getting-started/creating-first-pipeline.mdx index 63cb85f7..e1cb3eb8 100644 --- a/docs/getting-started/creating-first-pipeline.mdx +++ b/docs/getting-started/creating-first-pipeline.mdx @@ -13,7 +13,7 @@ By the end of this tutorial, you’ll have a complete replication workflow runni ## Prerequisites -Follow the [Quickstart Setup Guide](/docs/getting-started/quickstart) to ensure the OLake UI is running at [localhost:8000](http://localhost:8000) +Follow the [Quickstart Setup Guide](/docs/getting-started/quickstart) to ensure the [OLake UI](/docs/install/olake-ui/) is running at [localhost:8000](http://localhost:8000) ### What is a Job? @@ -58,7 +58,7 @@ Choose **Resource-first** if your source and destination are already configured, In this guide, we'll use the **Job-first workflow** to set up a job from configuring the source and destination to running it. If you prefer video, check out our [video walkthrough](#video-tutorial). First things first, every job needs a source and a destination before it can run. -For this demonstration, we'll use **Postgres** as the source and **Iceberg with Glue catalog** as the destination. +For this demonstration, we'll use [**Postgres**](/docs/connectors/postgres) as the source and [**Apache Iceberg**](/iceberg/why-iceberg) with [**Glue Catalog**](/docs/writers/iceberg/catalog/glue/) as the destination. Let's get started! @@ -171,7 +171,7 @@ Here, you can choose your preferred [sync mode](/docs/understanding/terminologie For this guide, we'll configure the following: - Replicate the `fivehundred` stream (name of the table). -- Use **Full Refresh + CDC** as the sync mode. +- Use [**Full Refresh + CDC**](/docs/features/#2-sync-modes-supported) as the sync mode. - Enable **data Normalization**. - Modify Destination Database name (if required). - Replicate only data where `dropoff_datetime` >= `2010-01-01 00:00:00` (basically data from 2010 onward). @@ -180,7 +180,7 @@ For this guide, we'll configure the following: Let's start by selecting the `fivehundred` stream (or any stream from your source) by checking its checkbox to include it in the replication. Click the stream name to open the stream-level settings panel on the right side. -In the panel, set the **sync mode** to **Full Refresh + CDC**, and enable **Normalization** by toggling the switch on. +In the panel, set the **sync mode** to [**Full Refresh + CDC**](/docs/features/#2-sync-modes-supported), and enable **Normalization** by toggling the switch on.
![Job streams diff --git a/docs/getting-started/playground.mdx b/docs/getting-started/playground.mdx index c15e46ee..0cdd525f 100644 --- a/docs/getting-started/playground.mdx +++ b/docs/getting-started/playground.mdx @@ -6,7 +6,7 @@ sidebar_position: 3 # OLake Playground -OLake Playground is a self-contained environment for exploring lakehouse architecture using Apache Iceberg. It comes preconfigured with all the required components, allowing you to experience the complete workflow without manual setup. +OLake Playground is a self-contained environment for exploring lakehouse architecture using [Apache Iceberg](/iceberg/why-iceberg). It comes preconfigured with all the required components, allowing you to experience the complete workflow without manual setup. ## Included Components @@ -14,11 +14,11 @@ OLake Playground is a self-contained environment for exploring lakehouse archite - **OLake** – Schema discovery and CDC ingestion via an intuitive UI - **MinIO** – Object store for data storage - **Temporal** – Workflow orchestration for ingestion processes -- **Presto** – Query engine for Iceberg tables +- [**Presto**](/iceberg/query-engine/presto/) – Query engine for Iceberg tables ## Objective -Enable developers to experiment with an end-to-end, Iceberg-native lakehouse in minutes. Simply run a single `docker-compose up` command to launch the full stack — no service stitching, no configuration files required. +Enable developers to experiment with an end-to-end, Iceberg-native lakehouse in minutes. Simply run a single [Docker Compose](/docs/getting-started/quickstart) `docker-compose up` command to launch the full stack — no service stitching, no configuration files required. ## ⚙️ Prerequisites diff --git a/docs/getting-started/quickstart.mdx b/docs/getting-started/quickstart.mdx index 66071bc1..17ce2dd9 100644 --- a/docs/getting-started/quickstart.mdx +++ b/docs/getting-started/quickstart.mdx @@ -6,7 +6,7 @@ sidebar_position: 1 --- # How to get started with OLake -This QuickStart guide helps get started with OLake UI, a web-based interface designed to simplify the management of OLake jobs, sources, destinations, and configurations. +This QuickStart guide helps get started with [OLake UI](/docs/install/olake-ui/), a web-based interface designed to simplify the management of OLake jobs, sources, destinations, and configurations. ## Prerequisites diff --git a/docs/install/olake-ui/index.mdx b/docs/install/olake-ui/index.mdx index 640cef88..c53bef59 100644 --- a/docs/install/olake-ui/index.mdx +++ b/docs/install/olake-ui/index.mdx @@ -66,7 +66,7 @@ The default credentials are: OLake UI Jobs
-For detailed job creation instructions, see [Create Jobs](../jobs/create-jobs). +For detailed job creation instructions, see [Create Jobs](/blog/creating-job-olake-docker-cli) or [Creating First Pipeline](/docs/getting-started/creating-first-pipeline). ## Service Configuration diff --git a/docs/intro.mdx b/docs/intro.mdx index 3a5cca05..5b91ed4d 100644 --- a/docs/intro.mdx +++ b/docs/intro.mdx @@ -28,7 +28,7 @@ slug: / ## What is OLake? -OLake is an open-source ELT framework, fully written in Golang for memory efficiency and high performance. It replicates data from sources like PostgreSQL, MySQL, MongoDB, Oracle and Kafka (WIP) directly into open lakehouse formats such as Apache Iceberg and Parquet. +OLake is an open-source ELT framework, fully written in Golang for memory efficiency and high performance. It replicates data from sources like PostgreSQL, MySQL, MongoDB, Oracle and Kafka (WIP) directly into open lakehouse formats such as [Apache Iceberg](/iceberg/why-iceberg) and Parquet. Using Incremental Sync and Change Data Capture (CDC), OLake keeps data continuously in sync while minimizing infrastructure overhead—offering a simple, reliable, and scalable path to building a modern lakehouse. This allows organizations to: @@ -39,9 +39,9 @@ This allows organizations to: --- ## Why OLake? -- **Fastest Path to a Lakehouse** → Achieve high throughput with **parallelized chunking** and **resumable** historical snapshots and blazing-fast incremental updates, even on massive datasets with **exactly-once** delivery. +- **Fastest Path to a Lakehouse** → Achieve high throughput with [**parallelized chunking**](/docs/features/#1-parallelised-chunking) and **resumable** historical snapshots and blazing-fast incremental updates, even on massive datasets with **exactly-once** delivery. -- **Efficient Data Capture** → Capture data efficiently with a full snapshot of your tables or collections, then keep them in sync through near real-time **CDC** using native database logs (**pgoutput, binlogs, oplogs**). +- **Efficient Data Capture** → Capture data efficiently with a full snapshot of your tables or collections, then keep them in sync through near real-time **CDC** using native database logs (**pgoutput, [binlogs](/blog/binlogs), [oplogs](/docs/understanding/terminologies/general/#26-oplog-mongodb)**). - **Schema-Aware Replication** → Automatically detect schema changes to keep your pipelines consistent and reliable. @@ -64,7 +64,11 @@ This allows organizations to: - **PostgreSQL** → CTID ranges, batch-size splits, next-query paging - **MySQL** → Range splits with LIMIT/OFFSET - **MongoDB** → Split-Vector, Bucket-Auto, Timestamp -- **Oracle** → DBMS Parallel Execute +- **Oracle** → DBMS Parallel Execute + +#### Source Level Features + +- [**oplog**](/docs/understanding/terminologies/general/#26-oplog-mongodb) → MongoDB operation log for CDC --- diff --git a/docs/understanding/compatibility-catalogs.mdx b/docs/understanding/compatibility-catalogs.mdx index 252e7bad..f0f51821 100644 --- a/docs/understanding/compatibility-catalogs.mdx +++ b/docs/understanding/compatibility-catalogs.mdx @@ -6,7 +6,7 @@ sidebar_label: Compatibility to Iceberg Catalogs # Compatibility to Iceberg Catalogs -OLake supports multiple Iceberg catalog implementations, letting you choose the one that best fits your environment. The table below shows the supported catalogs at a glance, with links to their setup guides. +OLake supports multiple Iceberg catalog implementations, including [REST catalog](/docs/writers/iceberg/catalog/rest/), [Hive Metastore](/docs/writers/iceberg/catalog/hive/), and [JDBC Catalog](/docs/writers/iceberg/catalog/jdbc/), letting you choose the one that best fits your environment. The table below shows the supported catalogs at a glance, with links to their setup guides. | | Catalog | Link | | ----------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------------------ | diff --git a/docs/understanding/compatibility-engines.mdx b/docs/understanding/compatibility-engines.mdx index db1655fc..8eca809b 100644 --- a/docs/understanding/compatibility-engines.mdx +++ b/docs/understanding/compatibility-engines.mdx @@ -15,7 +15,7 @@ You can query OLake Iceberg tables from multiple engines. The table below shows | Apache Flink (v1.18+) | ✅ | ✅ | ✅ | ✅ | [Flink Docs](https://iceberg.apache.org/docs/latest/flink-configuration/) | | Trino (v475 +) | ✅ | ✅ | ✅ | ✅ | [Trino Docs](https://trino.io/docs/current/object-storage/metastores.html) | | Starburst Enterprise | ✅ | ✅ | ✅ | ✅ | [Starburst Docs](https://docs.starburst.io/latest/object-storage/metastores.html) | -| Presto (v0.288 +) | ✅ | ✅ | ✅ | ✅ | [Presto Guide](https://ibm.github.io/presto-iceberg-lab/lab-1/) | +| [Presto](/blog/building-open-data-lakehouse-with-olake-presto) (v0.288 +) | ✅ | ✅ | ✅ | ✅ | [Presto Guide](https://ibm.github.io/presto-iceberg-lab/lab-1/) | | Apache Hive (v4.0) | ✅ | ✅ | ❌ | ✅ | [Hive Docs](https://iceberg.apache.org/docs/latest/hive/) | | Apache Impala (v4.4) | ❌ | ✅ | ❌ | ❌ | [Impala Docs](https://impala.apache.org/docs/build/html/topics/impala_iceberg.html) | | Dremio (v25/26) | ✅ | ✅ | ❌ | ✅ | [Dremio Docs](https://docs.dremio.com/current/release-notes/version-260-release/) | diff --git a/iceberg/2025-05-08-olake-iceberg-athena.mdx b/iceberg/2025-05-08-olake-iceberg-athena.mdx index f3f73d03..520fbbf6 100644 --- a/iceberg/2025-05-08-olake-iceberg-athena.mdx +++ b/iceberg/2025-05-08-olake-iceberg-athena.mdx @@ -40,15 +40,15 @@ Iceberg's intelligent metadata structure allows query engines to eliminate scann ### 5. Championing Openness: -Built as an open specification, Iceberg ensures you're never locked into a single vendor or engine. Your Iceberg tables on S3 are accessible by a wide array of tools – ingestion platforms like OLake, processing engines, and query engines like Trino, Athena, and Spark SQL – providing ultimate flexibility. +Built as an open specification, Iceberg ensures you're never locked into a single vendor or engine. Your Iceberg tables on S3 are accessible by a wide array of tools – ingestion platforms like OLake, processing engines, and query engines like [Trino](/iceberg/query-engine/trino), Athena, and Spark SQL – providing ultimate flexibility. -This is where the combination of OLake, Apache Iceberg, AWS Glue Data Catalog, and Amazon Athena provides a powerful, simple, and serverless solution. +This is where the combination of OLake, Apache Iceberg, AWS Glue Data Catalog, and [Amazon Athena](/iceberg/query-engine/athena) provides a powerful, simple, and serverless solution. - **OLake**: An open-source tool designed for simple, lightweight, and fast data ingestion from databases - **Apache Iceberg**: A high-performance open table format that brings reliability, schema evolution, and time travel to data files on S3 -- **AWS Glue Data Catalog**: A centralized, managed metadata repository that can act as the Iceberg catalog, making your S3 data discoverable +- [**AWS Glue Data Catalog**](/docs/writers/iceberg/catalog/glue/): A centralized, managed metadata repository that can act as the Iceberg catalog, making your S3 data discoverable - **Amazon Athena**: A serverless query engine that can directly query data in S3 using metadata from glue, perfect for interactive analytics @@ -140,7 +140,7 @@ Run the [Discover](https://olake.io/docs/connectors/postgres/overview) command Run the Sync command to replicate data from your source database into Apache Iceberg tables stored on Amazon S3 -## Step 3: Query Iceberg Data Using Amazon Athena +## Step 3: Query Iceberg Data Using [Amazon Athena](/iceberg/query-engine/athena) Once OLake has synced data into Iceberg tables and registered them with Glue, you can query it instantly using Amazon Athena diff --git a/iceberg/2025-05-08-olake-iceberg-trino.mdx b/iceberg/2025-05-08-olake-iceberg-trino.mdx index 83ec2ba4..ecd8c18d 100644 --- a/iceberg/2025-05-08-olake-iceberg-trino.mdx +++ b/iceberg/2025-05-08-olake-iceberg-trino.mdx @@ -36,7 +36,7 @@ Thanks to its smart metadata, iceberg can quickly figure out which files actuall ### Open and Flexible Architecture -Iceberg is built as an open standard. You're not locked into a single vendor or toolset. Iceberg tables work with a wide range of technologies—like OLake for ingestion and trino, athena, spark and others for querying. This gives you the freedom to build the architecture that fits your needs. +Iceberg is built as an open standard. You're not locked into a single vendor or toolset. Iceberg tables work with a wide range of technologies—like OLake for ingestion and [trino, athena, spark](/iceberg/query-engine) and others for querying. This gives you the freedom to build the architecture that fits your needs. ## Why This Combination? @@ -46,9 +46,9 @@ Iceberg is built as an open standard. You're not locked into a single vendor or - **Apache Iceberg**: A high-performance open table format that brings reliability, schema evolution, and time travel to data files on S3 -- **AWS Glue Data Catalog**: A centralised, managed metadata repository that can act as the iceberg catalog, making your S3 data discoverable +- [**AWS Glue Data Catalog**](/docs/writers/iceberg/catalog/glue/): A centralised, managed metadata repository that can act as the iceberg catalog, making your S3 data discoverable -- **Trino**: Fast & distributed SQL query engine. Connects to many data sources +- [**Trino**](/iceberg/query-engine/trino): Fast & distributed SQL query engine. Connects to many data sources Together, these tools allow you to build an end-to-end pipeline from your database to query-ready data on S3 with minimal infrastructure to manage. diff --git a/src/components/Iceberg/FeatureCard.tsx b/src/components/Iceberg/FeatureCard.tsx index ece5dec5..21b53601 100644 --- a/src/components/Iceberg/FeatureCard.tsx +++ b/src/components/Iceberg/FeatureCard.tsx @@ -2,6 +2,7 @@ import React, { useState } from 'react'; import { Dialog, Transition, Tab } from '@headlessui/react'; import { Fragment } from 'react'; +import Link from '@docusaurus/Link'; import { XMarkIcon, CheckCircleIcon, @@ -50,7 +51,7 @@ export interface FeatureDetail { export interface FeatureCardProps { title: string; chip?: string; - description: string; + description: string | React.ReactNode; icon: React.ReactNode; details: FeatureDetail; color?: 'blue' | 'green' | 'purple' | 'orange' | 'red' | 'yellow'; diff --git a/src/components/Iceberg/QueryEngineLayout.tsx b/src/components/Iceberg/QueryEngineLayout.tsx index 6c0d0ba7..a1a659e1 100644 --- a/src/components/Iceberg/QueryEngineLayout.tsx +++ b/src/components/Iceberg/QueryEngineLayout.tsx @@ -22,13 +22,13 @@ export interface CodeExample { export interface UseCase { title: string; description: string; - scenarios: string[]; + scenarios: (string | React.ReactNode)[]; icon?: React.ReactNode; } export interface QueryEngineLayoutProps { - title: string; - description: string; + title: string | React.ReactNode; + description: string | React.ReactNode; features: FeatureCardProps[]; tableData: InteractiveTableProps; // codeExamples?: CodeExample[]; diff --git a/src/components/site/FeatureShowcase.tsx b/src/components/site/FeatureShowcase.tsx index 88da4314..6f801bc1 100644 --- a/src/components/site/FeatureShowcase.tsx +++ b/src/components/site/FeatureShowcase.tsx @@ -4,12 +4,14 @@ const FeatureCard = ({ title, description, illustration, - bgColor + bgColor, + href }: { title: string description: string illustration: React.ReactNode bgColor: string + href?: string }) => { // Get the appropriate blur color based on background color const getBlurColor = () => { @@ -19,9 +21,11 @@ const FeatureCard = ({ return '#bae6fd' // default } - return ( + const cardContent = (
@@ -46,6 +50,16 @@ const FeatureCard = ({
) + + if (href) { + return ( + + {cardContent} + + ) + } + + return cardContent } const FeatureShowcase: React.FC = () => { @@ -75,6 +89,7 @@ const FeatureShowcase: React.FC = () => {
} bgColor='bg-[#C7ECFF] dark:bg-blue-900/20' + href='/docs/features/#3-stateful-resumable-syncs' /> {
} bgColor='bg-[#E9EBFD] dark:bg-indigo-900/20' + href='/blog/olake-architecture' /> { } bgColor='bg-[#E9EBFD] dark:bg-indigo-900/20' + href='/blog/olake-architecture-deep-dive/#cdc-sync' /> { } bgColor='bg-[#DDF3FF] dark:bg-blue-900/20' + href='/blog/what-makes-olake-fast' /> diff --git a/src/components/site/IcebergHero.tsx b/src/components/site/IcebergHero.tsx index 1d6db1ea..2ccf4e36 100644 --- a/src/components/site/IcebergHero.tsx +++ b/src/components/site/IcebergHero.tsx @@ -18,9 +18,10 @@ interface FeatureCardProps { description: string image: string imageAlt: string + learnMoreLink?: string } -const FeatureCard: React.FC = ({ title, description, image, imageAlt }) => { +const FeatureCard: React.FC = ({ title, description, image, imageAlt, learnMoreLink }) => { return (
@@ -30,6 +31,17 @@ const FeatureCard: React.FC = ({ title, description, image, im

{title}

{description}

+ {learnMoreLink && ( + + Learn More + + + + + )}
@@ -78,18 +90,21 @@ const IcebergHero: React.FC = () => { description='Apache Iceberg enables seamless schema evolution by supporting column additions, deletions, renames, and reordering ensuring reliable analytics on evolving datasets without rewriting historical data.' image='/img/site/iceberg-1.svg' imageAlt='Schema evolution' + learnMoreLink='/docs/features/?tab=schema-evolution' /> diff --git a/src/pages/index.jsx b/src/pages/index.jsx index 9e8f3ca6..31b66e65 100644 --- a/src/pages/index.jsx +++ b/src/pages/index.jsx @@ -22,8 +22,15 @@ export default function New3Page() { const OLakeFaqs = [ { question: 'How to Get Started?', - answer: - 'Check the Quickstart Guide. With a single Docker command you can spin up OLake and access the UI.' + answer: ( + <> + Check the{' '} + + Quickstart Guide + + . With a single Docker command you can spin up OLake and access the UI. + + ) }, { question: 'Is OLake Really Open Source?', diff --git a/styles/index.css b/styles/index.css index 0e8d5781..6e8b88aa 100644 --- a/styles/index.css +++ b/styles/index.css @@ -360,4 +360,46 @@ pre code { .animate-pulse-dot { animation: pulse-dot 2s infinite; +} + +/* Blue links for feature cards and table data */ +.blue-link { + color: rgb(37, 99, 235) !important; /* text-blue-600 */ + text-decoration: underline; +} + +.dark .blue-link { + color: rgb(96, 165, 250) !important; /* dark:text-blue-400 */ +} + +.blue-link:hover { + color: rgb(29, 78, 216) !important; /* hover:text-blue-700 */ +} + +.dark .blue-link:hover { + color: rgb(147, 197, 253) !important; /* dark:hover:text-blue-300 */ +} + +/* White links for QueryEngineLayout descriptions */ +.white-link { + color: white !important; + text-decoration: underline; +} + +.white-link:hover { + color: rgba(255, 255, 255, 0.8) !important; +} + +/* White links in QueryEngineLayout titles */ +h2 .white-link, +h2 a.white-link { + color: white !important; + text-decoration: underline !important; + text-underline-offset: 2px; +} + +h2 .white-link:hover, +h2 a.white-link:hover { + color: rgba(255, 255, 255, 0.8) !important; + text-decoration: underline !important; } \ No newline at end of file