diff --git a/pages/docs/tracking-methods/warehouse-connectors.mdx b/pages/docs/tracking-methods/warehouse-connectors.mdx index 353a4b3098..1371a6dc50 100644 --- a/pages/docs/tracking-methods/warehouse-connectors.mdx +++ b/pages/docs/tracking-methods/warehouse-connectors.mdx @@ -6,23 +6,23 @@ import { dwhItems, mirrorItems } from '../../../utils/constants'; # Warehouse Connectors: Sync data from your data warehouse into Mixpanel -Warehouse Connector is a paid add-on available to organizations on a Growth or Enterprise plan. Learn more on our [pricing page](https://mixpanel.com/pricing/). +Warehouse Connector is a free add-on available to organizations on a Growth or Enterprise plan. Learn more on our [pricing page](https://mixpanel.com/pricing/). -With Warehouse Connectors you can sync data from data warehouses like Snowflake, BigQuery, Databricks, and Redshift to Mixpanel. By unifying business data with product usage events, you can answer many more questions in Mixpanel: +With Warehouse Connectors, you can sync data from data warehouses like Snowflake, BigQuery, Databricks, and Redshift to Mixpanel. By unifying business data with product usage events, you can answer many more questions in Mixpanel: * What percentage of our Enterprise revenue uses the features we shipped last year? * Did our app redesign reduce support tickets? * Which account demographics have the best retention? * We spent $50,000 on a marketing campaign, did the users we acquired stick around a month later? -Mixpanel's [Mirror](#mirror) sync mode keeps the data in Mixpanel fully in sync with any changes that occur in the warehouse including updating historical events that are deleted or modified in your warehouse. +Mixpanel's [Mirror](#mirror) sync mode keeps the data in Mixpanel fully in sync with any changes that occur in the warehouse, including updating historical events that are deleted or modified in your warehouse. In this guide, we'll walk through how to set up Warehouse Connectors. The integration is completely codeless, but you will need someone with access to your DWH to help with the initial setup. ## Getting Started - To setup Warehouse Connectors, you must have a admin or owner project role. Learn more about [Roles and Permissions](/docs/orgs-and-projects/roles-and-permissions). + To set up Warehouse Connectors, you must have an admin or owner project role. Learn more about [Roles and Permissions](/docs/orgs-and-projects/roles-and-permissions). ### Step 1: Connect a warehouse @@ -82,12 +82,12 @@ JSON columns mapped in BigQuery containing multiple properties are subject to a - To connect to Snowflake you will need: + To connect to Snowflake, you will need: - Your [Snowflake account identifier](https://docs.snowflake.com/user-guide/admin-account-identifier), which you can find in the URL of your Snowflake account (`https://YOUR_ACCOUNT_NAME.snowflakecomputing.com/`). - A dedicated Mixpanel user account and role. The user account can use either key-pair or password authentication. - If using key-pair authentication Mixpanel will generate a secure key-pair during the connection process. The public - key will be provided during the setup process and the private key will be encrypted and stored securely. + If using key-pair authentication, Mixpanel will generate a secure key-pair during the connection process. The public + key will be provided during the setup process, and the private key will be encrypted and stored securely. ```jsx CREATE ROLE MIXPANEL_ROLE; # one of @@ -127,7 +127,7 @@ JSON columns mapped in BigQuery containing multiple properties are subject to a GRANT CREATE STREAM ON SCHEMA .MIXPANEL TO ROLE MIXPANEL_ROLE; ``` - The Mixpanel user needs the `USAGE` and `SELECT` permissions to have read-only access to any tables and views you plan to sync. - Adjust this example to fine tune permissions. + Adjust this example to fine-tune permissions. ```jsx GRANT USAGE ON DATABASE TO ROLE MIXPANEL_ROLE; GRANT USAGE ON ALL SCHEMAS IN DATABASE TO ROLE MIXPANEL_ROLE; @@ -178,7 +178,7 @@ Complete the following steps to get your Databricks connector up and running: 1. Navigate to **Project Settings**, then select **Warehouse Sources**. 2. Click on `+ Add Connection` and select **Databricks**. -3. You should see a new page to create your databricks connector. In the first view, fill out the following fields before clicking `Create Source`: +3. You should see a new page to create your Databricks connector. In the first view, fill out the following fields before clicking `Create Source`: - ** Server Hostname ** - This is the hostname of your Databricks cluster. This can be found in your workspace URL, or by navigating to [JDBC/ODBC connection settings](https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html#step) - ** HTTP Path ** - This is the HTTP path of the cluster you would like to connect to. This can be found in your cluster [JDBC/ODBC connection settings](https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html#step) - ** Access Token ** - This is the Personal access token used to authenticate with your Databricks cluster. Here are the instructions on [how to create an access token](https://docs.databricks.com/en/dev-tools/auth.html#databricks-personal-access-tokens-for-workspace-users) @@ -187,7 +187,7 @@ Complete the following steps to get your Databricks connector up and running: **Using Service Principals** If a service principal is used, you'll want to follow these steps: 1. [Create the service principal](https://docs.databricks.com/en/admin/users-groups/service-principals.html#manage-service-principals-in-your-account) if it doesn't exist. -2. Create a token for the service principal. You can do so through the Databricks cli: +2. Create a token for the service principal. You can do so through the Databricks CLI: ```bash databricks configure --token databricks token-management create-obo-token 31536000 --comment "Mixpanel warehouse connector service principal token" @@ -201,7 +201,7 @@ databricks token-management create-obo-token GRANT MODIFY ON ANY FILE TO ``` -7. During the setup processing Mixpanel, you can now use the token from the service principal when setting up the connection +7. During the setup process in Mixpanel, you can now use the token from the service principal when setting up the connection **IP Allowed List** @@ -246,7 +246,7 @@ Complete the following steps to get your Redshift connector up and running: - Then, click Create Source. 5. In the third view, you should see a confirmation that your source was created. To establish the source connection, we need to ping your Redshift instance to actually create the service account user. - **Grant Access to Schema** - Enter the name of the schema you want to grant Mixpanel access to. - - Copy the command generated and run in your Redshift worksheet. Once that command is run successfully, the connection will be established and you will be able to send data from Redshift tables to Mixpanel. + - Copy the command generated and run it in your Redshift worksheet. Once that command is run successfully, the connection will be established, and you will be able to send data from Redshift tables to Mixpanel. ### IP Allowed List If you are using [AWS PrivateLink](https://docs.aws.amazon.com/redshift/latest/mgmt/security-private-link.html) to restrict access to your instance, you might need to add the following IP addresses to the allowed list. @@ -268,7 +268,7 @@ If you are using [AWS PrivateLink](https://docs.aws.amazon.com/redshift/latest/m Navigate to [Project Settings → Warehouse Data](https://mixpanel.com/report/settings/%23project%2F%24project_id%24%2Fwarehousedata/) and click +Event Table. -Select a table or view representing an event from your warehouse and tell Mixpanel about the table. Once satisfied with the preview, click run and we’ll establish the sync. The initial load may take a few minutes depending on the size of the table, we show you progress as it’s happening. +Select a table or view representing an event from your warehouse and tell Mixpanel about the table. Once satisfied with the preview, click Run, and we’ll establish the sync. The initial load may take a few minutes depending on the size of the table; we show you progress as it’s happening. 🎉 Congrats, you’ve loaded your first warehouse table into Mixpanel! From this point onward, the table will be kept in sync with Mixpanel. You can now use this event throughout Mixpanel’s interface. @@ -280,7 +280,7 @@ Mixpanel’s [Data Model](/docs/how-it-works/concepts) consists of 4 types: Even An event is something that happens at a point in time. It’s akin to a “fact” in dimensional modeling or a log in a database. Events have properties, which describe the event. Learn more about Events [here](/docs/data-structure/events-and-properties). -Here’s an example table that illustrates what can be loaded as events in Mixpanel. The most important fields are the timestamp (when) and the user id (who) — everything else is optional. +Here’s an example table that illustrates what can be loaded as events in Mixpanel. The most important fields are the timestamp (when) and the user ID (who) — everything else is optional. | Timestamp | User ID | Item | Brand | Amount | Type | | --- | --- | --- | --- | --- | --- | @@ -291,9 +291,9 @@ Here are more details about the schema we expect for events: | Column | Required | Type | Description | | --- | --- | --- | --- | -| Event Name | Yes | String | The name of the event. Eg: Purchase Completed or Support Ticket Filed. Note: you can specify this value statically, it doesn’t need to be a column in the table. | +| Event Name | Yes | String | The name of the event. E.g.: Purchase Completed or Support Ticket Filed. Note: you can specify this value statically, it doesn’t need to be a column in the table. | | Time | Yes | Timestamp | The time at which the event occurred. | -| User ID | No | String or Integer | The unique identifier of the user who performed the event. Eg: 12345 or grace@example.com. | +| User ID | No | String or Integer | The unique identifier of the user who performed the event. E.g.: 12345 or grace@example.com. | | Device ID | No | String or Integer | An identifier for anonymous users, useful for tracking pre-login data. Learn more [here](/docs/tracking-methods/id-management/identifying-users) | | JSON Properties | No | JSON or Object | A field that contains key-value properties in JSON format. If provided, Mixpanel will flatten this field out into properties. | | All other columns | No | Any | These can be anything. Mixpanel will auto-detect these columns and attach them to the event as properties. | @@ -325,11 +325,11 @@ Preview from a sample profile history value table Source table requirements: - The source table for user/group history is expected to be modeled as an SCD (Slowly Changing Dimension) Type 2 table. This means that the table must maintain all the history over time that you want to use for analysis. -- History tables are supported only with mirror sync mode. Follow these [docs](/docs/tracking-methods/warehouse-connectors#mirror) to setup your source table to be mirror-compatible. +- History tables are supported only with Mirror Sync mode. Follow these [docs](/docs/tracking-methods/warehouse-connectors#mirror) to set up your source table to be mirror-compatible. - The table should have a Timestamp/Date type column signifying the time that the properties on the row become active. This column will need to be supplied as `Start Time` in the sync configuration. - The following data types are NOT supported: - - Lists (eg Snowflake’s ARRAY) - - Objects (eg Snowflake’s OBJECT) + - Lists (eg, Snowflake’s ARRAY) + - Objects (e,g Snowflake’s OBJECT) ### Group Profiles @@ -342,13 +342,13 @@ Here’s an example table that illustrates what can be loaded as group profiles | 12345 | Notion | notion.so | 45000 | Enterprise | | 45678 | Linear | linear.so | 2000 | Pro | -Group Profile History value and setup is similar to the User Profile History section elaborated above +Group Profile History value and setup are similar to the User Profile History section elaborated above -Generally, group profile history values can only be used for queries within that same group. To power user-mode queries that use a group profile history property, the latest value of the ingested property will be used instead. Following ingestion, there may be a delay before the latest value will be available in queries. +Generally, group profile history values can only be used for queries within that same group. To power user-mode queries that use a group profile history property, the latest value of the ingested property will be used instead. Following ingestion, there may be a delay before the latest value is available in queries. ### Lookup Tables -A Lookup Table is useful for enriching Mixpanel properties (e.g. content, skus, currencies) with additional metadata. Learn more about Lookup Tables [here](/docs/data-structure/lookup-tables). Note the limits of lookup tables indicated [here](/docs/data-structure/lookup-tables#when-shouldnt--i-use-lookup-tables). +A Lookup Table is useful for enriching Mixpanel properties (e.g., content, SKUs, currencies) with additional metadata. Learn more about Lookup Tables [here](/docs/data-structure/lookup-tables). Note the limits of lookup tables indicated [here](/docs/data-structure/lookup-tables#when-shouldnt--i-use-lookup-tables). Here is an example table that illustrates what can be loaded as a lookup table in Mixpanel. The only important column is the ID, which is the primary key of the table that is eventually mapped to a Mixpanel property @@ -363,7 +363,7 @@ Warehouse Connectors regularly check warehouse tables for changes to load into M which changes Mixpanel will reflect. - **Mirror** will keep Mixpanel perfectly in sync with the data in the warehouse. This includes syncing new data, - modifying historical data, and deleting data that were removed from the warehouse. **Mirror** is supported + modifying historical data, and deleting data that was removed from the warehouse. **Mirror** is supported for Snowflake, BigQuery, Databricks, and Redshift. - **Append** will load new rows in the warehouse into Mixpanel, but will ignore modifications to existing rows or rows that were deleted from the warehouse. We recommend using **Mirror** over **Append** for supported @@ -372,7 +372,7 @@ which changes Mixpanel will reflect. syncs are only supported for Lookup Tables, User Profiles, and Group Profiles. - **One-Time** will load the data from your warehouse into Mixpanel _once_ with no ability to send incremental changes later. This is only recommended where the warehouse is being used as a temporary copy of the data being - moved to Mixpanel from some other source and the warehouse copy will not be updated later. + moved to Mixpanel from some other source, and the warehouse copy will not be updated later. ### Mirror @@ -384,21 +384,21 @@ supported for Snowflake, Databricks, BigQuery, and Redshift sources. Mirror takes BigQuery [table snapshots](https://cloud.google.com/bigquery/docs/table-snapshots-intro) and runs queries to compute the -change stream between two snapshot. Snapshots are stored in the `mixpanel` dataset created in [Step 1](#step-1-connect-a-warehouse). +change stream between two snapshots. Snapshots are stored in the `mixpanel` dataset created in [Step 1](#step-1-connect-a-warehouse). **Considerations when using Mirror with BigQuery:** - Mirror is not supported on views in BigQuery. -- If two rows in BigQuery are identical across _all_ columns the checksums Mirror computes for each row will be the same - and Mixpanel will consider them the same row causing only one copy to appear in Mixpanel. We recommend ensuring that one +- If two rows in BigQuery are identical across _all_ columns, the checksums Mirror computes for each row will be the same, + and Mixpanel will consider them the same row, causing only one copy to appear in Mixpanel. We recommend ensuring that one of your columns is a unique row ID to avoid this. - The table snapshots managed by Mixpanel are always created to expire after 21 days. This ensures that the snapshots are deleted even if Mixpanel loses access to them unexpectedly. Make sure that the sync does not go longer than 21 days without - running as each sync run needs access to the previous sync run's snapshot. (Under normal conditions Mirror maintains only one - snapshot per-sync and removes the older run's snapshot as soon as it has been used by the subsequent sync run). + running, as each sync run needs access to the previous sync run's snapshot (under normal conditions, Mirror maintains only one + snapshot per sync and removes the older run's snapshot as soon as it has been used by the subsequent sync run). **How changes are detected:** -Changed rows are detected by checksumming the values of all columns except trailing NULL-valued columns. For example in the following table +Changed rows are detected by checksumming the values of all columns except trailing NULL-valued columns. For example, in the following table would use these per-row checksums: | ID | Song Name | Artist | Genre | **Computed checksum** | @@ -407,8 +407,8 @@ would use these per-row checksums: | 45678 | Voyager | Daft Punk | Electronic | `CHECKSUM(45678, 'Voyager', 'Daft Punk', 'Electronic')` | | 83921 | NULL | NULL | Classical | `CHECKSUM(83921, NULL, NULL, 'Classical')` | -Trailing NULL-values are excluded from the checksum to ensure that adding new columns does not change the checksum -of existing rows. For example if a new column is added to the example table: +Trailing NULL values are excluded from the checksum to ensure that adding new columns does not change the checksum +of existing rows. For example, if a new column is added to the example table: ```jsx ALTER TABLE songs ADD COLUMN Tag STRING NULL; @@ -432,7 +432,7 @@ Until values are written to the new column: **Handling schema changes when using Mirror with BigQuery:** -Adding new, default-NULL columns to Mirror-tracked tables/views is fully supported as described in the +Adding new, default-NULL columns to Mirror-tracked tables/views is fully supported, as described in the previous section. ```jsx @@ -443,7 +443,7 @@ If you have a JSON column in the table/view which you map to `JSON Properties` i We recommend avoiding other types of schema changes on large tables. Other schema changes may cause the checksum of every row to change, effectively re-sending the entire table to Mixpanel. For example, if we -were to remove the Genre column in the example above the checksum of every row would be different: +were to remove the Genre column in the example above, the checksum of every row would be different: | ID | Song Name | Artist | Tag | Computed checksum | | --- | --- | --- | --- | --- | @@ -454,10 +454,10 @@ were to remove the Genre column in the example above the checksum of every row w **Handling partitioned tables:** When syncing [time partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables#date_timestamp_partitioned_tables) or -[ingestion-time partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time) tables Mirror will use partition +[ingestion-time partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time) tables, Mirror will use partition metadata to skip processing partitions that have not changed between sync runs. This will make the computation of the change stream -much more efficient on large partitioned tables where only a small percentage of partitions are update between runs. For example, -in a day-partitioned table with two years of data, where only the last five days of data are normally updated only five partitions +much more efficient on large partitioned tables where only a small percentage of partitions are updated between runs. For example, +in a day-partitioned table with two years of data, where only the last five days of data are normally updated, only five partitions' worth of data will be scanned each time the sync runs. @@ -478,17 +478,17 @@ Mixpanel will create and manage the necessary STREAM objects in the `MIXPANEL` s the larger of [DATA_RETENTION_TIME_IN_DAYS and MAX_DATA_EXTENSION_TIME_IN_DAYS](https://docs.snowflake.com/en/user-guide/streams-manage#avoiding-stream-staleness) (default is 14 days for MAX_DATA_EXTENSION_TIME_IN_DAYS). Make sure that the Mixpanel sync does not go longer than this number of days without running. Mixpanel recommends leaving the default of 14 - days to ensure that if Mixpanel loses access to the warehouse unexpectedly (e.g. a credentials change), there + days to ensure that if Mixpanel loses access to the warehouse unexpectedly (e.g., a credentials change), there is time to correct the issue. -- Snowflake Streams do not work if a table is deleted and re-created with the same name. If using a tool like dbt to model - data in Snowflake make sure to use an [incremental model](https://docs.getdbt.com/docs/build/incremental-models) so that - dbt does not replace the table each time it runs. -- Snowflake streams doesn't capture changes when a column is deleted or renamed, so deletion of columns won't be synced to Mixpanel. +- Snowflake Streams do not work if a table is deleted and re-created with the same name. If using a tool like DBT to model + data in Snowflake, make sure to use an [incremental model](https://docs.getdbt.com/docs/build/incremental-models) so that + DBT does not replace the table each time it runs. +- Snowflake streams don't capture changes when a column is deleted or renamed, so deletion of columns won't be synced to Mixpanel. - Snowflake has specific requirements when using [Streams on Views](https://docs.snowflake.com/en/user-guide/streams-intro#streams-on-views) that must be met when using Mirror with views. -- While Snowflake Streams are a very efficient way of tracking changes there are some performance implications +- While Snowflake Streams are a very efficient way of tracking changes, there are some performance implications when using [Streams on VIEWs that contain JOINs](https://docs.snowflake.com/en/user-guide/streams-intro#join-results-behavior). - If you find yourself needing such a JOIN in event data we recommends considering if syncing the joined data as [User Profiles](#user-profiles), + If you find yourself needing such a JOIN in event data, we recommend considering if syncing the joined data as [User Profiles](#user-profiles), [Group Profiles](#group-profiles), or a [Lookup Table](#lookup-tables) would work instead. **Handling schema changes when using Mirror with Snowflake:** @@ -497,8 +497,8 @@ Adding new, default-NULL columns to Mirror-tracked tables/views is fully support ALTER TABLE ADD COLUMN VARCHAR DEFAULT NULL; ``` We recommend avoiding other types of schema changes. Snowflake streams only reflect changes to tables from DML statements. DDL statements -that logically modify data (e.g. adding new columns with default values, dropping existing columns, or renaming columns) will be reflected -in future data sent to Mixpanel but the Stream will not update historical data with changes caused by DDL statements. +that logically modify data (e.g., adding new columns with default values, dropping existing columns, or renaming columns) will be reflected +in future data sent to Mixpanel, but the Stream will not update historical data with changes caused by DDL statements. @@ -511,24 +511,24 @@ ALTER TABLE
SET TBLPROPERTIES (delta.enableChangeDataFeed = true); **Considerations when using Mirror with Databricks:** - Mirror is not supported on views in Databricks - Databricks Change Data Feed only maintains change history for a limited number of days determined by [delta.logRetentionDuration](https://docs.databricks.com/en/delta/history.html#retrieve-delta-table-history) (default is 30 days). Make sure that the Mixpanel sync does not go longer -than this number of days without running. Mixpanel recommends leaving the default of 30 days to ensure that if Mixpanel loses access to the warehouse unexpectedly (e.g. a credentials change), there is time to correct the issue. -- Databricks Change Data Feed does not work if a table is deleted and re-created with the same name. If using a tool like dbt to model - data in Databricks make sure to use an [incremental model](https://docs.getdbt.com/docs/build/incremental-models) so that - dbt does not replace the table each time it runs. +than this number of days without running. Mixpanel recommends leaving the default of 30 days to ensure that if Mixpanel loses access to the warehouse unexpectedly (e.g., a credentials change), there is time to correct the issue. +- Databricks Change Data Feed does not work if a table is deleted and re-created with the same name. If using a tool like DBT to model + data in Databricks, make sure to use an [incremental model](https://docs.getdbt.com/docs/build/incremental-models) so that + DBT does not replace the table each time it runs. - While Databricks Change Data Feed works with adding new columns to the table, there are certain limitations when it comes to dropping columns or renaming columns -which uses [column mappings](https://docs.databricks.com/en/delta/delta-change-data-feed.html#change-data-feed-limitations-for-tables-with-column-mapping-enabled). In such scenarios Mixpanel, recommends deleting the sync (along with the data) and re-creating the sync. +that use [column mappings](https://docs.databricks.com/en/delta/delta-change-data-feed.html#change-data-feed-limitations-for-tables-with-column-mapping-enabled). In such scenarios, Mixpanel recommends deleting the sync (along with the data) and re-creating the sync. Mirror takes Redshift table snapshots and runs queries to compute the change stream between two snapshots using the computed MD5 hash value for each row. Snapshots are stored in a staging schema created in [Step 1](#step-1-connect-a-warehouse). -- If two rows in Redshift table are identical across *all* columns, the md5 hash value Mirror computes for each row will be the same and Mixpanel will consider them the same row causing only one copy to appear in Mixpanel. We recommend ensuring that one of your columns is a unique row ID to avoid this. -- The table snapshots managed by Mixpanel and are automatically cleaned-up. Under normal conditions, Mirror maintains only one snapshot per-sync and removes the older run's snapshot as soon as it has been used by the subsequent sync run. These snapshots will incur some additional storage cost. +- If two rows in the Redshift table are identical across *all* columns, the md5 hash value Mirror computes for each row will be the same, and Mixpanel will consider them the same row, causing only one copy to appear in Mixpanel. We recommend ensuring that one of your columns is a unique row ID to avoid this. +- The table snapshots managed by Mixpanel are automatically cleaned up. Under normal conditions, Mirror maintains only one snapshot per sync and removes the older run's snapshot as soon as it has been used by the subsequent sync run. These snapshots will incur some additional storage cost. **How changes are detected:** -Changed rows are detected by computing MD5 hash for concatenated values of all columns except trailing NULL-valued columns. For example in the following table would use these per-row MD5: +Changed rows are detected by computing the MD5 hash for concatenated values of all columns except trailing NULL-valued columns. For example, in the following table would use these per-row MD5: | **ID** | **Song Name** | **Artist** | **Genre** | **Computed checksum** | | --- | --- | --- | --- | --- | @@ -536,7 +536,7 @@ Changed rows are detected by computing MD5 hash for concatenated values of all c | 45678 | Voyager | Daft Punk | Electronic | `MD5(45678, 'Voyager', 'Daft Punk', 'Electronic')` | | 83921 | NULL | NULL | Classical | `MD5(83921, NULL, NULL, 'Classical')` | -Trailing NULL-values are excluded from the checksum to ensure that adding new columns does not change the checksum of existing rows. For example if a new column is added to the example table: +Trailing NULL values are excluded from the checksum to ensure that adding new columns does not change the checksum of existing rows. For example, if a new column is added to the example table: ``` ALTER TABLE songs ADD COLUMN Tag STRING NULL; @@ -566,7 +566,7 @@ Adding new, default-NULL columns to Mirror-tracked tables/views is fully support ALTER TABLE
ADD COLUMN STRING NULL; ``` -We recommend avoiding other types of schema changes on large tables. Other schema changes may cause the hash value of every row to change, effectively re-sending the entire table to Mixpanel. For example, if we were to remove the Genre column in the example above the checksum of every row would be different: +We recommend avoiding other types of schema changes on large tables. Other schema changes may cause the hash value of every row to change, effectively re-sending the entire table to Mixpanel. For example, if we were to remove the Genre column in the example above, the checksum of every row would be different: | **ID** | **Song Name** | **Artist** | **Tag** | **Computed checksum** | | --- | --- | --- | --- | --- | @@ -576,10 +576,10 @@ We recommend avoiding other types of schema changes on large tables. Other schem **Considerations when using Mirror with Redshift:** -- Currently only add-column schema update is supported. -- Geography, Geometry, Varbyte and hllsketch not supported currently. If you have a use-case that requires support any of these columns, please contact us for a feature request. -- Any new tables created after the permissions are granted, are not automatically accessible by mixpanel. Please re-grant necessary permissions to be able to setup mirror on the tables. -- `_mp_row_hash`, `_mp_change_type` are reserved column names used for tracking the computed hash value for each row, and update type to the row. Please ensure that this doesn’t conflict with the column names of the table. +- Currently, only the add-column schema update is supported. +- Geography, Geometry, Varbyte, and hllsketch are not supported currently. If you have a use case that requires support for any of these columns, please contact us for a feature request. +- Any new tables created after the permissions are granted are not automatically accessible by Mixpanel. Please re-grant the necessary permissions to be able to set up Mirror on the tables. +- `_mp_row_hash` and `_mp_change_type` are reserved column names used for tracking the computed hash value for each row, and updating the type of the row. Please ensure that this doesn’t conflict with the column names of the table. @@ -599,9 +599,9 @@ BigQuery costs we recommend [partitioning the source table by the ``](https://cloud.google.com/bigquery/docs/partitioned-tables#date_timestamp_partitioned_tables). Doing so will ensure that each incremental sync run [only scans the most recent partitions](https://cloud.google.com/bigquery/docs/partitioned-tables). -To understand the potential savings consider a 100 GB source table with 100 days of data (approximately 1 GB of data per day): -- If this table is not partitioned and is synced daily the Append sync will scan the whole table (100 GB of data) each time it runs, or 3,000 GB of data per month. -- If this table is partitioned by day and is synced daily with an Append sync the Append sync only scan the current day and previous day's partitions (2 GB of data) each time it runs, or 60 GB of data per day, a 50x improvement over the un-partitioned table. +To understand the potential savings, consider a 100 GB source table with 100 days of data (approximately 1 GB of data per day): +- If this table is not partitioned and is synced daily, the Append sync will scan the whole table (100 GB of data) each time it runs, or 3,000 GB of data per month. +- If this table is partitioned by day and is synced daily with an Append sync, the Append sync only scans the current day and the previous day's partitions (2 GB of data) each time it runs, or 60 GB of data per day, a 50x improvement over the un-partitioned table. Note: [BigQuery's ingestion time partitions](https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time) **are not supported** in Append mode. @@ -618,7 +618,7 @@ Mixpanel offers a variety of sync frequency options to cater to different data i ### Standard Sync Frequency Options - GA4 tables support only daily sync frequency. + GA4 tables support only a daily sync frequency. * Hourly: Data is synchronized every hour, providing near real-time updates to your Mixpanel project. @@ -638,7 +638,7 @@ To use the API trigger option: This flexibility allows you to maintain precise control over when and how your data is updated in Mixpanel, ensuring your analytics are always based on the latest information. -Note: If your table sync is set up with Mirror mode, you will need to run a sync job at least every 2 weeks to ensure our snapshots do not get deleted. We rate limit the number of syncs via API to 5 per hour. +Note: If your table sync is set up with Mirror mode, you will need to run a sync job at least every 2 weeks to ensure our snapshots do not get deleted. We rate-limit the number of syncs via API to 5 per hour. @@ -652,7 +652,7 @@ Anything that is event-based (has a user_id and timestamp) and that you want to * Billing: Subscription Created, Subscription Upgraded, Subscription Canceled, Payment Made * Application Database: Sign-up, Purchased Item, Invited Teammate -We also recommend loading your user and account tables, to enrich events with demographic attributes about the users and accounts who performed them. +We also recommend loading your user and account tables to enrich events with demographic attributes about the users and accounts who performed them. ### How fast do syncs transfer data? @@ -666,15 +666,15 @@ After validating your use case, navigate to the imported table and select "Delet ### I already track data to Mixpanel via SDK or CDP, can I still use Warehouse Connectors? -Yes! You can send some events (eg: web and app data) directly via our SDKs and send other data (eg: user profiles from CRM or logs from your backend) from your warehouse and analyze them together in Mixpanel. +Yes! You can send some events (eg, web and app data) directly via our SDKs and send other data (eg, user profiles from CRM or logs from your backend) from your warehouse and analyze them together in Mixpanel. -Please do note that warehouse connectors enforce strict_mode validation by default and any events and historical profiles with time set in future will be dropped. +Please do note that warehouse connectors enforce strict_mode validation by default, and any events and historical profiles with time set in the future will be dropped. We will reject events with time values that are before 1971-01-01 or more than 1 hour in the future as measured on our servers. -We recommend that customer filter such events and refresh such events when they are no longer set in future. +We recommend that the customer filter such events and refresh such events when they are no longer set in the future. ### How do I filter for events coming to Mixpanel via Warehouse Connector Sync in my reports? -We add couple of hidden properties `$warehouse_import_id` and `$warehouse_type` on every event ingested through warehouse connectors. You can add filters and breakdowns on that property in any Mixpanel report. You can find the Warehouse import ID of a sync in Sync History tab shown as `Mixpanel Import Id`. +We add a couple of hidden properties, `$warehouse_import_id` and `$warehouse_type`, on every event ingested through warehouse connectors. You can add filters and breakdowns on that property in any Mixpanel report. You can find the Warehouse import ID of a sync in the Sync History tab shown as `Mixpanel Import ID`. ### Does Mixpanel automatically flatten nested data from warehouse tables? @@ -686,7 +686,7 @@ Consider breaking the data into smaller chunks if you’re working with large da Note: The 20-hour query limit is a Mixpanel restriction, not a BigQuery one, to help keep the system stable for all users. ### Why is mirror mode required for profile history syncs? -Mirror mode allows Mixpanel to detect changes in your data warehouse and update historical profile data in Mixpanel accordingly. This is essential for maintaining accurate history of user profiles. When you use the Mirror mode, Mixpanel data automatically syncs with your warehouse by accurately reflecting all changes, including additions, updates, or deletions. You can learn more about the Mirror mode and its benefits in this [blog post](https://mixpanel.com/blog/mirror-product-analytics-data-warehouse-sync/) +Mirror mode allows Mixpanel to detect changes in your data warehouse and update historical profile data in Mixpanel accordingly. This is essential for maintaining an accurate history of user profiles. When you use the Mirror mode, Mixpanel data automatically syncs with your warehouse by accurately reflecting all changes, including additions, updates, or deletions. You can learn more about the Mirror mode and its benefits in this [blog post](https://mixpanel.com/blog/mirror-product-analytics-data-warehouse-sync/) ### Why am I seeing events in my project with the name of my profile table? Events with the same name as the table/view used for historical profile imports are auto-generated by the WH import process. These are hidden by default and are not meant to be queried directly. Billing for historical imports is done using mirror pricing (link to question below). @@ -712,7 +712,7 @@ The above table applies if your account uses ingestion time billing. If your acc If you’re planning on backfilling a significant amount of historical events and need help understanding how it will impact your costs, please reach out to your Mixpanel account manager or [contact support](https://mixpanel.com/get-support). -**Note on Updates**: If you already have an event in Mixpanel, for example, Event A with properties a,b,c,d but want to: +**Note on Updates**: If you already have an event in Mixpanel, for example, Event A with properties a,b,c,d, but want to: - **Update the value** of property d, or - **Add a new property** or column e with a non-NULL value @@ -723,9 +723,9 @@ If your warehouse workflow **drops and recreates tables**, Mirror will treat thi If instead your workflow updates existing tables—by appending, updating, or deleting specific rows or columns—only the affected rows will be billed, as shown in the table above. -**Note on Deletes**: If you delete events using [Mixpanel UI](https://docs.mixpanel.com/docs/data-governance/data-clean-up), those deletes are not counted towards billable events. Only Deletes coming from warehouse using mirror are billable. +**Note on Deletes**: If you delete events using [Mixpanel UI](https://docs.mixpanel.com/docs/data-governance/data-clean-up), those deletes are not counted towards billable events. Only Deletes coming from the warehouse using Mirror are billable. -**Note on Backfills**: You can also backfill using `Append` mode if you create a new sync. But for a ongoing sync you cannot backfill for older days with in the existing sync once the `insert_time` has moved past. +**Note on Backfills**: You can also backfill using `Append` mode if you create a new sync. But for an ongoing sync, you cannot backfill for older days within the existing sync once the `insert_time` has moved past. **Billing for User/Group Profiles Syncs** @@ -744,17 +744,17 @@ You can monitor these different operations in your billing page, where they'll a **Billing for historical table imports:** -Historical tables can be imported only in mirror mode. Mirror-mode pricing updates apply to all rows imported for profile history tables.This means: +Historical tables can be imported only in mirror mode. Mirror-mode pricing updates apply to all rows imported for profile history tables. This means: - Historical profile updates DO count towards billing. Imports through standard profile tables do not. - Every row counts as a mirror event and is [billed as such](../pricing#i-am-using-the-warehouse-add-on-for-ingesting-data-how-does-this-impact-billing). -- If you update/delete existing rows in your table, [mirror billing](../pricing#i-am-using-the-warehouse-add-on-for-ingesting-data-how-does-this-impact-billing) will be applied including for backfills. +- If you update/delete existing rows in your table, [mirror billing](../pricing#i-am-using-the-warehouse-add-on-for-ingesting-data-how-does-this-impact-billing) will be applied, including for backfills. -For MTU plans, the events and updates/deletes don't directly affect the MTU tally unless the volume of events and updates pushes the threshold for the guardrail (by default 1000 events per user); once the threshold for the guardrail is crossed, events and updates from profile history tables will also count towards billing. +For MTU plans, the events and updates/deletes don't directly affect the MTU tally unless the volume of events and updates pushes the threshold for the guardrail (by default, 1000 events per user); once the threshold for the guardrail is crossed, events and updates from profile history tables will also count towards billing. ### When should I use Mirror vs. Append mode? **Use Mirror mode when:** -- You want to maintain an exact replica of your warehouse data in Mixpanel +- You want to maintain a replica of your warehouse data in Mixpanel - You want to automatically sync updates and deletes from your warehouse - You need to track the history of user profile changes over time (with History mode) @@ -771,14 +771,14 @@ The DWH cost of using a warehouse connector will vary based on the source wareho There are 3 aspects of DWH cost: network egress, storage, and compute. * **Network Egress**: All data is transferred using gzip compression. Assuming an egress rate of \$0.08 per GB and 100 compressed bytes per event, this is a cost of less than \$0.01 per million events. [Mirror](#mirror) and [Append](#append) syncs will only transfer new or modified rows each time they run. [Full](#full) syncs will transfer all rows every time they run. We recommend using Full syncs only for small tables and running them less frequently. * **Storage**: [Append](#append) and [Full](#full) syncs do not store any additional data in your warehouse, so there are no extra storage costs. [Mirror](#mirror) tracks changes using warehouse-specific functionality that can affect warehouse storage costs: - * Snowflake: Mirror uses [Snowflake Streams](https://docs.snowflake.com/en/user-guide/streams-intro). Snowflake Streams will [retain historical data](https://docs.snowflake.com/en/user-guide/streams-intro#billing-for-streams) until it is consumed from the stream. As long as the warehouse connector runs regularly data will be consumed regularly and only retained between runs. - * BigQuery: Mirror uses [table snapshots](https://cloud.google.com/bigquery/docs/table-snapshots-intro). Mirror keeps one snapshot per table to track the contents of the table from the last run. BigQuery table snapshots have no cost when they are first created as they share the underlying storage with the source table, however as the source table changes [the cost of storing changes is attributed to the table snapshot](https://cloud.google.com/bigquery/docs/table-snapshots-intro#storage_costs). Each time the connector runs the current snapshot is replaced with a new snapshot of the latest state of the table. The storage cost is the amount of changes being tracked between the snapshot and source table between runs. - * Databricks: Mirror uses Databricks [Change Data Feed](https://docs.databricks.com/en/delta/delta-change-data-feed.html) and all the changes are retained in databricks for the [delta.logRetentionDuration](https://docs.databricks.com/en/delta/history.html#retrieve-delta-table-history). Configure that window accordingly to keep storage costs low. + * Snowflake: Mirror uses [Snowflake Streams](https://docs.snowflake.com/en/user-guide/streams-intro). Snowflake Streams will [retain historical data](https://docs.snowflake.com/en/user-guide/streams-intro#billing-for-streams) until it is consumed from the stream. As long as the warehouse connector runs regularly, data will be consumed regularly and only retained between runs. + * BigQuery: Mirror uses [table snapshots](https://cloud.google.com/bigquery/docs/table-snapshots-intro). Mirror keeps one snapshot per table to track the contents of the table from the last run. BigQuery table snapshots have no cost when they are first created, as they share the underlying storage with the source table. However, as the source table changes, [the cost of storing changes is attributed to the table snapshot](https://cloud.google.com/bigquery/docs/table-snapshots-intro#storage_costs). Each time the connector runs, the current snapshot is replaced with a new snapshot of the latest state of the table. The storage cost is the amount of changes being tracked between the snapshot and source table between runs. + * Databricks: Mirror uses Databricks [Change Data Feed](https://docs.databricks.com/en/delta/delta-change-data-feed.html) and all the changes are retained in Databricks for the [delta.logRetentionDuration](https://docs.databricks.com/en/delta/history.html#retrieve-delta-table-history). Configure that window accordingly to keep storage costs low. * **Compute**: - * Mirror on Snowflake: [Snowflake Streams](https://docs.snowflake.com/en/user-guide/streams-intro) natively track changes, the compute cost of querying for these changes is normally proportional to the amount of changed data. - * Mirror on BigQuery: Each time the connector runs it checksums all rows in the source table and compares them to a [table snapshot](https://cloud.google.com/bigquery/docs/table-snapshots-intro) from the previous run. For large tables we highly recommend [partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables) the source table. When the source table is partitioned the connector will skip checksumming any partitions which have not been modified since the last run. For more details see the BigQuery-specific instructions in [Mirror](#mirror). - * Mirror on Databricks: Databricks [Change Data Feed](https://docs.databricks.com/en/delta/delta-change-data-feed.html) natively tracks changes to the tables or views, the compute cost of querying these changes is normally proportional to the amount of changed data. Mixpanel recommends using a smaller compute cluster and setting Auto Terminate after 10 minutes of idle time on the compute cluster. - * Append: All Append syncs run a query filtered on `insert_time_column > [last-run-time]`, the compute cost is the cost of this query. Partitioning or clustering based on `insert_time_column` will greatly improve the performance of this query. + * Mirror on Snowflake: [Snowflake Streams](https://docs.snowflake.com/en/user-guide/streams-intro) natively track changes; the compute cost of querying for these changes is normally proportional to the amount of changed data. + * Mirror on BigQuery: Each time the connector runs, it checksums all rows in the source table and compares them to a [table snapshot](https://cloud.google.com/bigquery/docs/table-snapshots-intro) from the previous run. For large tables, we highly recommend [partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables) the source table. When the source table is partitioned, the connector will skip checksumming any partitions that have not been modified since the last run. For more details, see the BigQuery-specific instructions in [Mirror](#mirror). + * Mirror on Databricks: Databricks [Change Data Feed](https://docs.databricks.com/en/delta/delta-change-data-feed.html) natively tracks changes to the tables or views, and the compute cost of querying these changes is normally proportional to the amount of changed data. Mixpanel recommends using a smaller compute cluster and setting Auto Terminate after 10 minutes of idle time on the compute cluster. + * Append: All Append syncs run a query filtered on `insert_time_column > [last-run-time]`; the compute cost is the cost of this query. Partitioning or clustering based on `insert_time_column` will greatly improve the performance of this query. * Full: Full syncs are always a full table scan of the source table to export it. ### Will I be charged for failed imports or errors?