
diff --git a/docs/core/use-cases.mdx b/docs/core/use-cases.mdx
index 23d11664..13d50452 100644
--- a/docs/core/use-cases.mdx
+++ b/docs/core/use-cases.mdx
@@ -11,7 +11,7 @@ sidebar_position: 3
### 1. Offloading OLTP Databases for Analytics
Running complex analytical queries directly on production **OLTP (Online Transaction Processing) databases** can degrade performance and affect transactional workloads.
-OLake addresses this by replicating data from **MySQL**, **PostgreSQL**, **Oracle**, and **MongoDB** into an **Apache Iceberg** based data lake.
+OLake addresses this by replicating data from **MySQL**, **PostgreSQL**, **Oracle**, and **MongoDB** into an [**Apache Iceberg**](/iceberg/why-iceberg) based data lake.
This approach provides:
@@ -26,7 +26,7 @@ This approach provides:
With OLake, you can maintain stable transactional systems while enabling scalable and reliable analytics on **Apache Iceberg**.
### 2. Building Open Data Stacks and Scaling Data Engineering
-Organizations looking to reduce reliance on proprietary ETL and data warehousing tools can use **OLake** as part of an **open-source data stack**. By standardizing on **Apache Iceberg** as the table format, OLake ensures broad compatibility with query engines like **Trino**, **Presto**, **Spark**, **Dremio**, and **DuckDB**.
+Organizations looking to reduce reliance on proprietary ETL and data warehousing tools can use **OLake** as part of an [**open-source data stack**](/blog/building-open-data-lakehouse-with-olake-presto). By standardizing on **Apache Iceberg** as the table format, OLake ensures broad compatibility with query engines like **Trino**, **Presto**, **Spark**, **Dremio**, and **DuckDB**.
With its open-source approach, OLake helps teams:
@@ -43,7 +43,7 @@ Support multiple query engines across different use cases and teams.
This enables a **flexible**, **scalable**, and **future-proof data architecture** built on open standards.
### 3. Enabling Near-Real-Time Analytics
-Modern applications need fresh data within minutes, not hours. **OLake** enables near-real-time analytics by continuously replicating data from transactional databases using **log-based CDC**, often achieving **sub-minute** latency for updates to appear in **Iceberg**.
+Modern applications need fresh data within minutes, not hours. **OLake** enables near-real-time analytics by continuously replicating data from transactional databases using [**CDC**](/docs/understanding/terminologies/general/#39-change-data-capture), often achieving **sub-minute** latency for updates to appear in **Iceberg**.
Key benefits:
@@ -73,7 +73,7 @@ Adapt to schema changes seamlessly with Iceberg.
### 5. Powering AI and ML Data Pipelines
Building effective AI and ML models requires **fresh**, **reliable**, and **structured data**. **OLake** automates the ingestion of transactional data into an **Iceberg-based lakehouse**, ensuring that pipelines always have access to the latest information.
-With continuous updates, ML feature stores and training datasets stay current, while Iceberg’s compatibility with engines like **PySpark** and **DuckDB** makes it easy to plug into existing data science workflows. This supports faster model development and iteration.
+With continuous updates, [ML feature stores](/blog/apache-iceberg-vs-delta-lake-guide) and training datasets stay current, while Iceberg's compatibility with engines like **PySpark** and **DuckDB** makes it easy to plug into existing data science workflows. This supports faster model development and iteration.
Key benefits:
@@ -97,7 +97,7 @@ Key benefits:
- Dead Letter Queue for dependable error management.
### 7. Reducing Cloud Data Warehouse Costs
-Cloud data warehouses can become expensive due to storage and compute costs. **OLake** helps reduce these expenses by offloading raw, historical, or less frequently used data into an **Iceberg lakehouse** on cost-effective object storage.
+Cloud data warehouses can become expensive due to storage and compute costs. **OLake** helps reduce these expenses by offloading raw, historical, or less frequently used data into an [**Iceberg lakehouse**](/iceberg/move-to-iceberg) on cost-effective object storage.
This lets teams keep their warehouse optimized for active data, while still retaining full access to complete datasets in Iceberg.
diff --git a/docs/features/index.mdx b/docs/features/index.mdx
index ede8d8b6..dff42b5e 100644
--- a/docs/features/index.mdx
+++ b/docs/features/index.mdx
@@ -15,7 +15,7 @@ import TabItem from '@theme/TabItem';
## Source Level Features
-### 1. Parallelised Chunking
+### 1. [Parallelised Chunking](/blog/what-makes-olake-fast)
Parallel chunking is a technique that splits large datasets or collections into smaller virtual chunks, allowing them to be read and processed simultaneously. It is used in sync modes such as Full Refresh, Full Refresh + CDC, and Full Refresh + Incremental.
@@ -27,7 +27,7 @@ Parallel chunking is a technique that splits large datasets or collections into
- Enables **parallel reads**, dramatically reducing the time needed to perform full snapshots or scans of large datasets.
- Improves ingestion speed, scalability, and overall system performance.
-### 2. Sync Modes Supported
+### 2. [Sync Modes](/blog/what-makes-olake-fast) Supported
OLake supports following sync modes to provide flexibility across use cases:
@@ -74,7 +74,7 @@ Data Deduplication ensures that only unique records are stored and processed : s
Partitioning is the process of dividing large datasets into smaller, more manageable segments based on specific column values (e.g., date, region, or category), improving query performance, scalability, and data organization
-- **Iceberg partitioning** → Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution.
+- [**Iceberg partitioning**](/docs/writers/iceberg/partitioning/) → Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution.
- **S3-style partitioning** → Traditional folder-based layout (e.g., `year=2025/month=08/day=22/`) for compatibility with external tools.
- **Normalization** → Automatically expands **level-1 nested JSON fields** into top-level columns.
@@ -89,7 +89,7 @@ Partitioning is the process of dividing large datasets into smaller, more manage
- Reduces the need for complex JSON parsing in queries.
- Improves readability and downstream analytics efficiency.
-### 3. Schema Evolution & Data Types Changes
+### 3. [Schema Evolution](/blog/2025/10/03/iceberg-metadata) & Data Types Changes
OLake automatically handles changes in your table's schema without breaking downstream jobs. Read More [Schema Evolution in OLake](/docs/features?tab=schema-evolution)
diff --git a/docs/getting-started/creating-first-pipeline.mdx b/docs/getting-started/creating-first-pipeline.mdx
index 63cb85f7..e1cb3eb8 100644
--- a/docs/getting-started/creating-first-pipeline.mdx
+++ b/docs/getting-started/creating-first-pipeline.mdx
@@ -13,7 +13,7 @@ By the end of this tutorial, you’ll have a complete replication workflow runni
## Prerequisites
-Follow the [Quickstart Setup Guide](/docs/getting-started/quickstart) to ensure the OLake UI is running at [localhost:8000](http://localhost:8000)
+Follow the [Quickstart Setup Guide](/docs/getting-started/quickstart) to ensure the [OLake UI](/docs/install/olake-ui/) is running at [localhost:8000](http://localhost:8000)
### What is a Job?
@@ -58,7 +58,7 @@ Choose **Resource-first** if your source and destination are already configured,
In this guide, we'll use the **Job-first workflow** to set up a job from configuring the source and destination to running it. If you prefer video, check out our [video walkthrough](#video-tutorial).
First things first, every job needs a source and a destination before it can run.
-For this demonstration, we'll use **Postgres** as the source and **Iceberg with Glue catalog** as the destination.
+For this demonstration, we'll use [**Postgres**](/docs/connectors/postgres) as the source and [**Apache Iceberg**](/iceberg/why-iceberg) with [**Glue Catalog**](/docs/writers/iceberg/catalog/glue/) as the destination.
Let's get started!
@@ -171,7 +171,7 @@ Here, you can choose your preferred [sync mode](/docs/understanding/terminologie
For this guide, we'll configure the following:
- Replicate the `fivehundred` stream (name of the table).
-- Use **Full Refresh + CDC** as the sync mode.
+- Use [**Full Refresh + CDC**](/docs/features/#2-sync-modes-supported) as the sync mode.
- Enable **data Normalization**.
- Modify Destination Database name (if required).
- Replicate only data where `dropoff_datetime` >= `2010-01-01 00:00:00` (basically data from 2010 onward).
@@ -180,7 +180,7 @@ For this guide, we'll configure the following:
Let's start by selecting the `fivehundred` stream (or any stream from your source) by checking its checkbox to include it in the replication.
Click the stream name to open the stream-level settings panel on the right side.
-In the panel, set the **sync mode** to **Full Refresh + CDC**, and enable **Normalization** by toggling the switch on.
+In the panel, set the **sync mode** to [**Full Refresh + CDC**](/docs/features/#2-sync-modes-supported), and enable **Normalization** by toggling the switch on.
. It comes preconfigured with all the required components, allowing you to experience the complete workflow without manual setup.
## Included Components
@@ -14,11 +14,11 @@ OLake Playground is a self-contained environment for exploring lakehouse archite
- **OLake** – Schema discovery and CDC ingestion via an intuitive UI
- **MinIO** – Object store for data storage
- **Temporal** – Workflow orchestration for ingestion processes
-- **Presto** – Query engine for Iceberg tables
+- [**Presto**](/iceberg/query-engine/presto/) – Query engine for Iceberg tables
## Objective
-Enable developers to experiment with an end-to-end, Iceberg-native lakehouse in minutes. Simply run a single `docker-compose up` command to launch the full stack — no service stitching, no configuration files required.
+Enable developers to experiment with an end-to-end, Iceberg-native lakehouse in minutes. Simply run a single [Docker Compose](/docs/getting-started/quickstart) `docker-compose up` command to launch the full stack — no service stitching, no configuration files required.
## ⚙️ Prerequisites
diff --git a/docs/getting-started/quickstart.mdx b/docs/getting-started/quickstart.mdx
index 66071bc1..17ce2dd9 100644
--- a/docs/getting-started/quickstart.mdx
+++ b/docs/getting-started/quickstart.mdx
@@ -6,7 +6,7 @@ sidebar_position: 1
---
# How to get started with OLake
-This QuickStart guide helps get started with OLake UI, a web-based interface designed to simplify the management of OLake jobs, sources, destinations, and configurations.
+This QuickStart guide helps get started with [OLake UI](/docs/install/olake-ui/), a web-based interface designed to simplify the management of OLake jobs, sources, destinations, and configurations.
## Prerequisites
diff --git a/docs/install/olake-ui/index.mdx b/docs/install/olake-ui/index.mdx
index 640cef88..c53bef59 100644
--- a/docs/install/olake-ui/index.mdx
+++ b/docs/install/olake-ui/index.mdx
@@ -66,7 +66,7 @@ The default credentials are:
-For detailed job creation instructions, see [Create Jobs](../jobs/create-jobs).
+For detailed job creation instructions, see [Create Jobs](/blog/creating-job-olake-docker-cli) or [Creating First Pipeline](/docs/getting-started/creating-first-pipeline).
## Service Configuration
diff --git a/docs/intro.mdx b/docs/intro.mdx
index 3a5cca05..5b91ed4d 100644
--- a/docs/intro.mdx
+++ b/docs/intro.mdx
@@ -28,7 +28,7 @@ slug: /
## What is OLake?
-OLake is an open-source ELT framework, fully written in Golang for memory efficiency and high performance. It replicates data from sources like PostgreSQL, MySQL, MongoDB, Oracle and Kafka (WIP) directly into open lakehouse formats such as Apache Iceberg and Parquet.
+OLake is an open-source ELT framework, fully written in Golang for memory efficiency and high performance. It replicates data from sources like PostgreSQL, MySQL, MongoDB, Oracle and Kafka (WIP) directly into open lakehouse formats such as [Apache Iceberg](/iceberg/why-iceberg) and Parquet.
Using Incremental Sync and Change Data Capture (CDC), OLake keeps data continuously in sync while minimizing infrastructure overhead—offering a simple, reliable, and scalable path to building a modern lakehouse.
This allows organizations to:
@@ -39,9 +39,9 @@ This allows organizations to:
---
## Why OLake?
-- **Fastest Path to a Lakehouse** → Achieve high throughput with **parallelized chunking** and **resumable** historical snapshots and blazing-fast incremental updates, even on massive datasets with **exactly-once** delivery.
+- **Fastest Path to a Lakehouse** → Achieve high throughput with [**parallelized chunking**](/docs/features/#1-parallelised-chunking) and **resumable** historical snapshots and blazing-fast incremental updates, even on massive datasets with **exactly-once** delivery.
-- **Efficient Data Capture** → Capture data efficiently with a full snapshot of your tables or collections, then keep them in sync through near real-time **CDC** using native database logs (**pgoutput, binlogs, oplogs**).
+- **Efficient Data Capture** → Capture data efficiently with a full snapshot of your tables or collections, then keep them in sync through near real-time **CDC** using native database logs (**pgoutput, [binlogs](/blog/binlogs), [oplogs](/docs/understanding/terminologies/general/#26-oplog-mongodb)**).
- **Schema-Aware Replication** → Automatically detect schema changes to keep your pipelines consistent and reliable.
@@ -64,7 +64,11 @@ This allows organizations to:
- **PostgreSQL** → CTID ranges, batch-size splits, next-query paging
- **MySQL** → Range splits with LIMIT/OFFSET
- **MongoDB** → Split-Vector, Bucket-Auto, Timestamp
-- **Oracle** → DBMS Parallel Execute
+- **Oracle** → DBMS Parallel Execute
+
+#### Source Level Features
+
+- [**oplog**](/docs/understanding/terminologies/general/#26-oplog-mongodb) → MongoDB operation log for CDC
---
diff --git a/docs/understanding/compatibility-catalogs.mdx b/docs/understanding/compatibility-catalogs.mdx
index 252e7bad..f0f51821 100644
--- a/docs/understanding/compatibility-catalogs.mdx
+++ b/docs/understanding/compatibility-catalogs.mdx
@@ -6,7 +6,7 @@ sidebar_label: Compatibility to Iceberg Catalogs
# Compatibility to Iceberg Catalogs
-OLake supports multiple Iceberg catalog implementations, letting you choose the one that best fits your environment. The table below shows the supported catalogs at a glance, with links to their setup guides.
+OLake supports multiple Iceberg catalog implementations, including [REST catalog](/docs/writers/iceberg/catalog/rest/), [Hive Metastore](/docs/writers/iceberg/catalog/hive/), and [JDBC Catalog](/docs/writers/iceberg/catalog/jdbc/), letting you choose the one that best fits your environment. The table below shows the supported catalogs at a glance, with links to their setup guides.
| | Catalog | Link |
| ----------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------------------ |
diff --git a/docs/understanding/compatibility-engines.mdx b/docs/understanding/compatibility-engines.mdx
index db1655fc..8eca809b 100644
--- a/docs/understanding/compatibility-engines.mdx
+++ b/docs/understanding/compatibility-engines.mdx
@@ -15,7 +15,7 @@ You can query OLake Iceberg tables from multiple engines. The table below shows
| Apache Flink (v1.18+) | ✅ | ✅ | ✅ | ✅ | [Flink Docs](https://iceberg.apache.org/docs/latest/flink-configuration/) |
| Trino (v475 +) | ✅ | ✅ | ✅ | ✅ | [Trino Docs](https://trino.io/docs/current/object-storage/metastores.html) |
| Starburst Enterprise | ✅ | ✅ | ✅ | ✅ | [Starburst Docs](https://docs.starburst.io/latest/object-storage/metastores.html) |
-| Presto (v0.288 +) | ✅ | ✅ | ✅ | ✅ | [Presto Guide](https://ibm.github.io/presto-iceberg-lab/lab-1/) |
+| [Presto](/blog/building-open-data-lakehouse-with-olake-presto) (v0.288 +) | ✅ | ✅ | ✅ | ✅ | [Presto Guide](https://ibm.github.io/presto-iceberg-lab/lab-1/) |
| Apache Hive (v4.0) | ✅ | ✅ | ❌ | ✅ | [Hive Docs](https://iceberg.apache.org/docs/latest/hive/) |
| Apache Impala (v4.4) | ❌ | ✅ | ❌ | ❌ | [Impala Docs](https://impala.apache.org/docs/build/html/topics/impala_iceberg.html) |
| Dremio (v25/26) | ✅ | ✅ | ❌ | ✅ | [Dremio Docs](https://docs.dremio.com/current/release-notes/version-260-release/) |
diff --git a/iceberg/2025-05-08-olake-iceberg-athena.mdx b/iceberg/2025-05-08-olake-iceberg-athena.mdx
index f3f73d03..520fbbf6 100644
--- a/iceberg/2025-05-08-olake-iceberg-athena.mdx
+++ b/iceberg/2025-05-08-olake-iceberg-athena.mdx
@@ -40,15 +40,15 @@ Iceberg's intelligent metadata structure allows query engines to eliminate scann
### 5. Championing Openness:
-Built as an open specification, Iceberg ensures you're never locked into a single vendor or engine. Your Iceberg tables on S3 are accessible by a wide array of tools – ingestion platforms like OLake, processing engines, and query engines like Trino, Athena, and Spark SQL – providing ultimate flexibility.
+Built as an open specification, Iceberg ensures you're never locked into a single vendor or engine. Your Iceberg tables on S3 are accessible by a wide array of tools – ingestion platforms like OLake, processing engines, and query engines like [Trino](/iceberg/query-engine/trino), Athena, and Spark SQL – providing ultimate flexibility.
-This is where the combination of OLake, Apache Iceberg, AWS Glue Data Catalog, and Amazon Athena provides a powerful, simple, and serverless solution.
+This is where the combination of OLake, Apache Iceberg, AWS Glue Data Catalog, and [Amazon Athena](/iceberg/query-engine/athena) provides a powerful, simple, and serverless solution.
- **OLake**: An open-source tool designed for simple, lightweight, and fast data ingestion from databases
- **Apache Iceberg**: A high-performance open table format that brings reliability, schema evolution, and time travel to data files on S3
-- **AWS Glue Data Catalog**: A centralized, managed metadata repository that can act as the Iceberg catalog, making your S3 data discoverable
+- [**AWS Glue Data Catalog**](/docs/writers/iceberg/catalog/glue/): A centralized, managed metadata repository that can act as the Iceberg catalog, making your S3 data discoverable
- **Amazon Athena**: A serverless query engine that can directly query data in S3 using metadata from glue, perfect for interactive analytics
@@ -140,7 +140,7 @@ Run the [Discover](https://olake.io/docs/connectors/postgres/overview) command
Run the Sync command to replicate data from your source database into Apache Iceberg tables stored on Amazon S3
-## Step 3: Query Iceberg Data Using Amazon Athena
+## Step 3: Query Iceberg Data Using [Amazon Athena](/iceberg/query-engine/athena)
Once OLake has synced data into Iceberg tables and registered them with Glue, you can query it instantly using Amazon Athena
diff --git a/iceberg/2025-05-08-olake-iceberg-trino.mdx b/iceberg/2025-05-08-olake-iceberg-trino.mdx
index 83ec2ba4..ecd8c18d 100644
--- a/iceberg/2025-05-08-olake-iceberg-trino.mdx
+++ b/iceberg/2025-05-08-olake-iceberg-trino.mdx
@@ -36,7 +36,7 @@ Thanks to its smart metadata, iceberg can quickly figure out which files actuall
### Open and Flexible Architecture
-Iceberg is built as an open standard. You're not locked into a single vendor or toolset. Iceberg tables work with a wide range of technologies—like OLake for ingestion and trino, athena, spark and others for querying. This gives you the freedom to build the architecture that fits your needs.
+Iceberg is built as an open standard. You're not locked into a single vendor or toolset. Iceberg tables work with a wide range of technologies—like OLake for ingestion and [trino, athena, spark](/iceberg/query-engine) and others for querying. This gives you the freedom to build the architecture that fits your needs.
## Why This Combination?
@@ -46,9 +46,9 @@ Iceberg is built as an open standard. You're not locked into a single vendor or
- **Apache Iceberg**: A high-performance open table format that brings reliability, schema evolution, and time travel to data files on S3
-- **AWS Glue Data Catalog**: A centralised, managed metadata repository that can act as the iceberg catalog, making your S3 data discoverable
+- [**AWS Glue Data Catalog**](/docs/writers/iceberg/catalog/glue/): A centralised, managed metadata repository that can act as the iceberg catalog, making your S3 data discoverable
-- **Trino**: Fast & distributed SQL query engine. Connects to many data sources
+- [**Trino**](/iceberg/query-engine/trino): Fast & distributed SQL query engine. Connects to many data sources
Together, these tools allow you to build an end-to-end pipeline from your database to query-ready data on S3 with minimal infrastructure to manage.
diff --git a/src/components/Iceberg/FeatureCard.tsx b/src/components/Iceberg/FeatureCard.tsx
index ece5dec5..21b53601 100644
--- a/src/components/Iceberg/FeatureCard.tsx
+++ b/src/components/Iceberg/FeatureCard.tsx
@@ -2,6 +2,7 @@
import React, { useState } from 'react';
import { Dialog, Transition, Tab } from '@headlessui/react';
import { Fragment } from 'react';
+import Link from '@docusaurus/Link';
import {
XMarkIcon,
CheckCircleIcon,
@@ -50,7 +51,7 @@ export interface FeatureDetail {
export interface FeatureCardProps {
title: string;
chip?: string;
- description: string;
+ description: string | React.ReactNode;
icon: React.ReactNode;
details: FeatureDetail;
color?: 'blue' | 'green' | 'purple' | 'orange' | 'red' | 'yellow';
diff --git a/src/components/Iceberg/QueryEngineLayout.tsx b/src/components/Iceberg/QueryEngineLayout.tsx
index 6c0d0ba7..a1a659e1 100644
--- a/src/components/Iceberg/QueryEngineLayout.tsx
+++ b/src/components/Iceberg/QueryEngineLayout.tsx
@@ -22,13 +22,13 @@ export interface CodeExample {
export interface UseCase {
title: string;
description: string;
- scenarios: string[];
+ scenarios: (string | React.ReactNode)[];
icon?: React.ReactNode;
}
export interface QueryEngineLayoutProps {
- title: string;
- description: string;
+ title: string | React.ReactNode;
+ description: string | React.ReactNode;
features: FeatureCardProps[];
tableData: InteractiveTableProps;
// codeExamples?: CodeExample[];
diff --git a/src/components/site/FeatureShowcase.tsx b/src/components/site/FeatureShowcase.tsx
index 88da4314..6f801bc1 100644
--- a/src/components/site/FeatureShowcase.tsx
+++ b/src/components/site/FeatureShowcase.tsx
@@ -4,12 +4,14 @@ const FeatureCard = ({
title,
description,
illustration,
- bgColor
+ bgColor,
+ href
}: {
title: string
description: string
illustration: React.ReactNode
bgColor: string
+ href?: string
}) => {
// Get the appropriate blur color based on background color
const getBlurColor = () => {
@@ -19,9 +21,11 @@ const FeatureCard = ({
return '#bae6fd' // default
}
- return (
+ const cardContent = (
@@ -46,6 +50,16 @@ const FeatureCard = ({
)
+
+ if (href) {
+ return (
+
+ {cardContent}
+
+ )
+ }
+
+ return cardContent
}
const FeatureShowcase: React.FC = () => {
@@ -75,6 +89,7 @@ const FeatureShowcase: React.FC = () => {
}
bgColor='bg-[#C7ECFF] dark:bg-blue-900/20'
+ href='/docs/features/#3-stateful-resumable-syncs'
/>
{
}
bgColor='bg-[#E9EBFD] dark:bg-indigo-900/20'
+ href='/blog/olake-architecture'
/>