Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion blog/2024-09-16-mongodb-etl-challenges.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ tags: [mongodb,etl ]

# Four Critical MongoDB ETL Challenges and How to tackle them for your Data Lake and Data Warehouse?

![Mongo db logo showing ETL challenges](/img/blog/cover/mongodb-etl-challenges-cover.webp)
![Monitor with leaf icon on green grid background, representing MongoDB ETL challenges](/img/blog/cover/mongodb-etl-challenges-cover.webp)


Moving data from MongoDB into a data warehouse or lakehouse for analytics and reporting can be a complex process. 
Expand Down
4 changes: 2 additions & 2 deletions blog/2024-09-24-querying-json-in-snowflake.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ In this query, you're flattening the orders array inside the `customer_data` JSO

**Output:**

![Snowflake query result showing JSON data extraction with proper formatting](/img/blog/2024/09/querying-json-in-snowflake-3.webp)
![Database query results table with one row for customer ID "C123", first order "O1001", and total orders as 3.](/img/blog/2024/09/querying-json-in-snowflake-3.webp)

* John doesn't have any orders, so he won't appear in the results.

Expand Down Expand Up @@ -1308,7 +1308,7 @@ Now, doing a `SELECT * customer_data`

**OUTPUT**:

![Database query results table showing a single row with John Doe's customer info (name, age 30, and email) as a JSON object field](/img/blog/2024/09/querying-json-in-snowflake-28.webp)
![Database query results table showing a single row with John Does customer info (name, age 30, and email) as a JSON object field](/img/blog/2024/09/querying-json-in-snowflake-28.webp)

**Querying the OBJECT**:

Expand Down
2 changes: 1 addition & 1 deletion blog/2024-10-18-flatten-array.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ df = json_normalize( data)

and you’ll be good to go.

![Terminal showing a row from a DataFrame with id, name, nested projects JSON, and individual contact info columns](/img/blog/2024/11/flatten-array-24.webp)
![Nested JSON data is transformed so each top-level key maps to a column in a flat table](/img/blog/2024/11/flatten-array-24.webp)

## Method 5: Flattening Nested JSON in PySpark

Expand Down
4 changes: 2 additions & 2 deletions blog/2025-01-07-olake-architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ tags: [olake]

# OLake Architecture, How did we do it?

![Pipeline diagram: source DB data chunked and routed to Amazon S3, then transformed and written to a lakehouse](/img/blog/cover/olake-architecture-cover.webp)
![Diagram showing database sync flow: snapshot/CDC extraction, chunking, transform, Amazon S3, and writing to lakehouse](/img/blog/cover/olake-architecture-cover.webp)

update: [18.02.2025]
1. We support S3 data partitioning - refer docs [here](https://olake.io/docs/writers/parquet/partitioning)
Expand Down Expand Up @@ -184,7 +184,7 @@ These results prove that with chunk-based parallel loading and direct Writer int

To illustrate how concurrency is handled, here’s a more extended ASCII diagram:

![Pipeline diagram: source DB data chunked and routed to Amazon S3, then transformed and written to a lakehouse](/img/blog/cover/olake-architecture-cover.webp)
![Diagram showing database sync flow: snapshot/CDC extraction, chunking, transform, Amazon S3, and writing to lakehouse](/img/blog/cover/olake-architecture-cover.webp)

Each driver/writer pair can independently read chunks from MongoDB and write them directly to the target, while the Core monitors everything centrally.

Expand Down
8 changes: 4 additions & 4 deletions blog/2025-04-23-how-to-set-up-postgresql-cdc-on-aws-rds.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Access is needed to modify following (please contact your DevOps team who has se
AWS RDS already has a default RDS parameter group as given in the below picture, and you won’t be able to edit the parameters from this group.


![Amazon RDS parameter groups dashboard showing default MySQL and PostgreSQL parameter groups list](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-1.webp)
![Apache Airflow logo with text 'with OLake', illustrating integration between Airflow workflow management and OLake platform](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-1.webp)

Hence it is advised to create a new parameter group as suggested below.

Expand All @@ -80,7 +80,7 @@ Hence it is advised to create a new parameter group as suggested below.
2. Choose the required postgres version


![Create parameter group screen for PostgreSQL in AWS RDS, with CDC-enabled production setup fields shown](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-2.webp)
![Create PostgreSQL parameter group in AWS RDS: prod-cdc-paramgroup for postgres14 with CDC enabled](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-2.webp)

3. Click on Create and parameter group will be created.

Expand Down Expand Up @@ -134,7 +134,7 @@ Everything on RDS runs within virtual private networks, which means we need to c
**Backup Retention Period**: Choose a backup retention period of at least 7 days.
:::

![AWS RDS additional configuration showing selected DB parameter group and backup retention period option](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-5.webp)
![Additional configuration for AWS RDS PostgreSQL instance showing DB parameter group selection (pg15) and backup retention period set to 1 day](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-5.webp)

* At the bottom, Continue -> Apply immediately -> Modify DB instance.

Expand All @@ -149,7 +149,7 @@ select * from pg_settings where name in ('wal_level', 'rds.logical_replication')
```
You should see results like below ( settings , on and logical )

![SQL query results showing rds.logical_replication set to on and wal_level as logical, with descriptions](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-4.webp)
![SQL query showing rds.logical_replication set to on and wal_level set to logical, enabling logical decoding for PostgreSQL](/img/blog/2025/04/how-to-set-up-postgresql-cdc-on-aws-rds-4.webp)

Now we could connect to this database using our Postgres root user. However, best practices are to use a dedicated account which has the minimal set of required privileges for CDC. Use this user credentials to connect to the Postgres source

Expand Down
44 changes: 22 additions & 22 deletions blog/2025-09-04-creating-job-olake-docker-cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,16 +36,16 @@ We'll take the "job-first" approach. It's straightforward and keeps you in one f
From the left nav, go to **Jobs → Create Job**.
You'll land on a wizard that starts with the **source**.

![OLake jobs dashboard with the Jobs tab, Create Job button, and Create your first Job button highlighted](/img/docs/getting-started/create-your-first-job/job-create.webp)
![OLake jobs dashboard for new users with option to create first job highlighted](/img/docs/getting-started/create-your-first-job/job-create.webp)

### 2) Configure the Source (Postgres)

Choose **Set up a new source** → select **Postgres** → keep OLake version at the latest stable.
Name it clearly, fill the Postgres endpoint config, and hit **Test Connection**.

![OLake Create Job step 2 screen, showing source connector options including Postgres, MongoDB, MySQL, and Oracle, with Postgres highlighted](/img/docs/getting-started/create-your-first-job/job-source-connector.webp)
![OLake create job interface with new source connector selection for MongoDB, Postgres, MySQL, Oracle](/img/docs/getting-started/create-your-first-job/job-source-connector.webp)

![OLake Create Job with Postgres source configuration fields and a side help panel with setup steps](/img/docs/getting-started/create-your-first-job/job-source-config.webp)
![OLake create job screen showing Postgres source endpoint and CDC configuration with setup guide](/img/docs/getting-started/create-your-first-job/job-source-config.webp)

> 📝 **Planning for CDC?**
> Make sure a **replication slot** exists in Postgres.
Expand All @@ -56,13 +56,13 @@ Name it clearly, fill the Postgres endpoint config, and hit **Test Connection**.
Now we set where the data will land.
Pick **Apache Iceberg** as the destination, and **AWS Glue** as the catalog.

![OLake Create Job step 3 destination setup, showing connector selection with Amazon S3 and Apache Iceberg options, and Apache Iceberg highlighted](/img/docs/getting-started/create-your-first-job/job-dest-connector.webp)
![OLake create job destination step showing Apache Iceberg and Amazon S3 connector selection](/img/docs/getting-started/create-your-first-job/job-dest-connector.webp)

![OLake Create Job destination setup for Apache Iceberg, with Catalog Type dropdown showing AWS Glue, JDBC, Hive, and REST options](/img/docs/getting-started/create-your-first-job/job-dest-catalog.webp)
![OLake create job destination endpoint config with catalog type selection AWS Glue JDBC Hive REST](/img/docs/getting-started/create-your-first-job/job-dest-catalog.webp)

Provide the connection details and **Test Connection**.

![OLake Create Job destination config for Apache Iceberg with AWS Glue; right panel shows AWS Glue Catalog Write Guide with setup and prerequisites](/img/docs/getting-started/create-your-first-job/job-dest-config.webp)
![OLake create job destination setup with Apache Iceberg, AWS Glue catalog, and S3 configuration form](/img/docs/getting-started/create-your-first-job/job-dest-config.webp)

### 4) Configure Streams

Expand All @@ -76,50 +76,50 @@ For this walkthrough, we'll:
- **Partitioning:** by **year** extracted from `dropoff_datetime`
- **Schedule:** every day at **12:00 AM**

![OLake streams selection, employee_data and other tables checked, sync mode set to Full Refresh + CDC](/img/docs/getting-started/create-your-first-job/job-streams.webp)
![OLake stream selection UI for Postgres to Iceberg job with Full Refresh + CDC mode](/img/docs/getting-started/create-your-first-job/job-streams.webp)

Select the checkbox for `fivehundred`, then click the stream name to open stream settings.
Pick the sync mode and toggle **Normalization**.

![OLake streams- only five hundred selected, Full Refresh + CDC mode](/img/docs/getting-started/create-your-first-job/job-stream-select.webp)
![OLake create job stream selection for Postgres to Iceberg with Full Refresh + CDC on fivehundred](/img/docs/getting-started/create-your-first-job/job-stream-select.webp)

Let's make the destination query-friendly. Open **Partitioning** → choose `dropoff_datetime` → **year**.
Want more? Read the [Partitioning Guide](/docs/writers/parquet/partitioning).

![OLake: fivehundred stream selected, partition by dropoff_datetime and year](/img/docs/getting-started/create-your-first-job/job-stream-partition.webp)
![OLake partitioning UI for stream fivehundred using dropoff_datetime and year fields in Iceberg](/img/docs/getting-started/create-your-first-job/job-stream-partition.webp)

Add the **Data Filter** so we only move rows from 2010 onward.

![OLake: fivehundred stream, filter dropoff_datetime >= 2010-01-01](/img/docs/getting-started/create-your-first-job/job-data-filter.webp)
![OLake create job with data filter for Postgres to Iceberg pipeline on dropoff_datetime column](/img/docs/getting-started/create-your-first-job/job-data-filter.webp)

Click **Next** to continue.

### 5) Schedule the Job

Give the job a clear name, set **Every Day @ 12:00 AM**, and hit **Create Job**.

![OLake Create Job page showing step 1, with job name, frequency dropdown (Every Day highlighted), and job start time settings](/img/docs/getting-started/create-your-first-job/job-schedule.webp)
![OLake create job stream filter UI for Postgres to Iceberg pipeline using dropoff_datetime column and operators](/img/docs/getting-started/create-your-first-job/job-schedule.webp)

You're set! 🎉

![OLake job created successfully for fivehundred stream, Full Refresh + CDC](/img/docs/getting-started/create-your-first-job/job-creation-success.webp)
![OLake job creation success dialog for Postgres to Iceberg ETL pipeline](/img/docs/getting-started/create-your-first-job/job-creation-success.webp)

Want results right away? Start a run immediately with **Jobs → (⋮) → Sync Now**.

![Active jobs screen for OLake with job options menu expanded.](/img/docs/getting-started/create-your-first-job/job-sync-now.webp)
![OLake jobs dashboard with actions menu for sync, edit streams, pause, logs, settings, delete](/img/docs/getting-started/create-your-first-job/job-sync-now.webp)

You'll see status badges on the right (**Running / Failed / Completed**).
For more details, open **Job Logs & History**.

- Running
![OLake active jobs screen showing a running job](/img/docs/getting-started/create-your-first-job/job-running.webp)
![OLake jobs dashboard showing active job status as running for Postgres to Iceberg pipeline](/img/docs/getting-started/create-your-first-job/job-running.webp)

- Completed
![OLake active jobs screen showing a completed job](/img/docs/getting-started/create-your-first-job/job-success.webp)
![OLake jobs dashboard showing completed status for Postgres to Iceberg pipeline job](/img/docs/getting-started/create-your-first-job/job-success.webp)

Finally, verify that data landed in S3/Iceberg as configured:

![Amazon S3 folder view showing two Parquet files under dropoff_datetime_year=2011](/img/docs/getting-started/create-your-first-job/job-data-s3.webp)
![Amazon S3 browser showing parquet files for dropoff_datetime_year=2011 partition folder](/img/docs/getting-started/create-your-first-job/job-data-s3.webp)

### 6) Manage Your Job (from the Jobs page)

Expand All @@ -128,29 +128,29 @@ Finally, verify that data landed in S3/Iceberg as configured:
**Edit Streams** — Change which streams are included and tweak replication settings.
Use the stepper to jump between **Source** and **Destination**.

![Stream selection screen for OLake Postgres Iceberg job, with S3 folder and sync steps shown](/img/docs/getting-started/create-your-first-job/job-edit-streams-page.webp)
![OLake Postgres Iceberg job UI with stepper showing Job Config, Source, Destination, Streams steps](/img/docs/getting-started/create-your-first-job/job-edit-streams-page.webp)

> By default, source/destination editing is locked. Click **Edit** to unlock.

![OLake Postgres Iceberg job destination config with AWS Glue setup and edit option](/img/docs/getting-started/create-your-first-job/job-edit-destination.webp)
![OLake destination config edit screen for Postgres Iceberg job with AWS Glue write guide](/img/docs/getting-started/create-your-first-job/job-edit-destination.webp)

> 🔄 **Need to change Partitioning / Filter / Normalization for an existing stream?**
> Unselect the stream → **Save** → reopen **Edit Streams** → re-add it with new settings.

**Pause Job** — Temporarily stop runs. You'll find paused jobs under **Inactive Jobs**, where you can **Resume** any time.

![Inactive jobs tab showing a PostgreSQL job with the option to resume in the OLake UI](/img/docs/getting-started/create-your-first-job/job-resume.webp)
![OLake inactive jobs list with menu showing resume job option for Postgres Iceberg pipeline](/img/docs/getting-started/create-your-first-job/job-resume.webp)

**Job Logs & History** — See all runs. Use **View Logs** for per-run details.

![Job log history for a Postgres Iceberg job, showing a completed status and option to view logs.](/img/docs/getting-started/create-your-first-job/view-logs.webp)
![OLake Postgres Iceberg job logs history screen showing completed run and view logs action](/img/docs/getting-started/create-your-first-job/view-logs.webp)

![OLake Postgres Iceberg job logs showing system info and sync steps with Iceberg writer and Postgres source.](/img/docs/getting-started/create-your-first-job/logs-page.webp)
![OLake job logs screen displaying detailed execution logs for Postgres to Iceberg sync job](/img/docs/getting-started/create-your-first-job/logs-page.webp)

**Job Settings** — Rename, change frequency, pause, or delete.
Deleting a job moves its source/destination to **inactive** (if not used elsewhere).

![Active Postgres Iceberg job settings screen; job runs daily at 12 AM UTC with pause and delete options](/img/docs/getting-started/create-your-first-job/job-settings.webp)
![OLake job settings screen showing scheduling, pause and delete options for Postgres Iceberg job](/img/docs/getting-started/create-your-first-job/job-settings.webp)

## Option B — OLake CLI (Docker)

Expand Down
2 changes: 1 addition & 1 deletion blog/2025-09-04-deletion-formats-deep-dive.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ This metadata layer consists of:
- **Manifest files** that contain information about data files and their statistics
- **Data files** where your actual data lives in formats like Parquet or Avro

![OLake architecture diagram with connectors between user, database, and lakehouse](/img/blog/2025/11/architecture.webp)
![MongoDB operational database to Apache Iceberg analytical lakehouse migration](/img/blog/2025/11/architecture.webp)

This layered architecture is what makes Iceberg so powerful. When you want to query your data, the engine doesn't need to scan directories or enumerate files; it simply reads the metadata to understand exactly which data files contain the information you need.

Expand Down
8 changes: 4 additions & 4 deletions blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ Configure your Apache Iceberg destination in the OLake UI:

OLake supports multiple Iceberg catalog implementations including Glue, Nessie, Polaris, Hive, and Unity Catalog. For detailed configuration of other catalogs, refer to the [OLake Catalogs Documentation](https://olake.io/docs/writers/iceberg/catalog/rest/).

![OLake destination setup UI for Apache Iceberg with AWS Glue catalog configuration form](/img/blog/2025/12/step-4.webp)
![OLake UI create destination screen for Apache Iceberg AWS Glue catalog configuration](/img/blog/2025/12/step-4.webp)

### Step 5: Create and Configure Your Replication Job

Expand All @@ -177,7 +177,7 @@ Once source and destination connections are established:
3. Select your existing source and destination configurations
4. In the schema section, choose tables/streams for Iceberg synchronization

![OLake create job UI selecting existing Postgres data source for pipeline setup](/img/blog/2025/12/step-5-1.webp)
![OLake create job wizard selecting MongoDB source from existing connectors](/img/blog/2025/12/step-5-1.webp)

#### Choose Synchronization Mode

Expand All @@ -194,7 +194,7 @@ For each stream, select the appropriate sync mode based on your requirements:
- **Partitioning**: Configure regex patterns for Iceberg table partitioning
- **Detailed partitioning strategies**: [Iceberg Partitioning Guide](https://olake.io/docs/writers/iceberg/partitioning)

![OLake stream selection step with Full Refresh + CDC sync for dz-stag-users table](/img/blog/2025/12/step-5-2.webp)
![OLake job stream selection UI picking tables and configuring CDC sync mode](/img/blog/2025/12/step-5-2.webp)

### Step 6: Execute Your Synchronization

Expand Down Expand Up @@ -233,7 +233,7 @@ To validate your replication setup, configure AWS Athena for querying your Icebe
2. Execute SQL queries against your replicated Iceberg tables
3. Verify data consistency and query performance

![Amazon Athena editor querying olake_test_table with SQL SELECT and results](/img/blog/2025/12/step-7.webp)
![Amazon Athena query editor showing SQL SELECT on olake_test_table with results](/img/blog/2025/12/step-7.webp)

## Production-Ready Best Practices for PostgreSQL to Iceberg Replication

Expand Down
Loading