Skip to content

Commit

Permalink
[Doc] Update data_migration_tool.md (backport #51161) (#51221)
Browse files Browse the repository at this point in the history
Co-authored-by: yingtingdong <[email protected]>
  • Loading branch information
mergify[bot] and yingtingdong authored Sep 20, 2024
1 parent e7c91e7 commit 163afa0
Show file tree
Hide file tree
Showing 4 changed files with 74 additions and 7 deletions.
Binary file added docs/en/_assets/data_migration_tool-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 36 additions & 3 deletions docs/en/administration/data_migration_tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,12 +143,14 @@ source_fe_host=
source_fe_query_port=9030
source_cluster_user=root
source_cluster_password=
source_cluster_password_secret_key=
source_cluster_token=
target_fe_host=
target_fe_query_port=9030
target_cluster_user=root
target_cluster_password=
target_cluster_password_secret_key=
# Comma-separated list of database names or table names like <db_name> or <db_name.table_name>
# example: db1,db2.tbl2,db3
Expand All @@ -159,6 +161,9 @@ exclude_data_list=
# If there are no special requirements, please maintain the default values for the following configurations.
target_cluster_storage_volume=
target_cluster_replication_num=-1
target_cluster_max_disk_used_percent=80
max_replication_data_size_per_job_in_gb=-1
meta_job_interval_seconds=180
meta_job_threads=4
Expand All @@ -167,8 +172,10 @@ ddl_job_batch_size=10
ddl_job_allow_drop_target_only=false
ddl_job_allow_drop_schema_change_table=true
ddl_job_allow_drop_inconsistent_partition=true
ddl_job_allow_drop_partition_target_only=true
replication_job_interval_seconds=10
replication_job_batch_size=10
report_interval_seconds=300
```

The description of the parameters is as follows:
Expand All @@ -180,24 +187,30 @@ The description of the parameters is as follows:
| source_fe_query_port | The query port (`query_port`) of the source cluster's FE. |
| source_cluster_user | The username used to log in to the source cluster. This user must be granted the OPERATE privilege on the SYSTEM level. |
| source_cluster_password | The user password used to log in to the source cluster. |
| source_cluster_password_secret_key | The secret key used to encrypt the password of the login user for the source cluster. The default value is an empty string, which means that the login password is not encrypted. If you want to encrypt `source_cluster_password`, you can get the encrypted `source_cluster_password` string by using SQL statement `SELECT TO_BASE64(AES_ENCRYPT('<source_cluster_password>','<source_cluster_password_ secret_key>'))`. |
| source_cluster_token | Token of the source cluster. For information on how to obtain the cluster token, refer to [Obtain Cluster Token](#obtain-cluster-token) below. |
| target_fe_host | The IP address or FQDN (Fully Qualified Domain Name) of the target cluster's FE. |
| target_fe_query_port | The query port (`query_port`) of the target cluster's FE. |
| target_cluster_user | The username used to log in to the target cluster. This user must be granted the OPERATE privilege on the SYSTEM level. |
| target_cluster_password | The user password used to log in to the target cluster. |
| target_cluster_password_secret_key | The secret key used to encrypt the password of the login user for the target cluster. The default value is an empty string, which means that the login password is not encrypted. If you want to encrypt `target_cluster_password`, you can get the encrypted `target_cluster_password` string by using SQL statement `SELECT TO_BASE64(AES_ENCRYPT('<target_cluster_password>','<target_cluster_password_ secret_key>'))`. |
| include_data_list | The databases and tables that need to be migrated, with multiple objects separated by commas (`,`). For example: `db1, db2.tbl2, db3`. This item takes effect prior to `exclude_data_list`. If you want to migrate all databases and tables in the cluster, you do not need to configure this item. |
| exclude_data_list | The databases and tables that do not need to be migrated, with multiple objects separated by commas (`,`). For example: `db1, db2.tbl2, db3`. `include_data_list` takes effect prior to this item. If you want to migrate all databases and tables in the cluster, you do not need to configure this item. |
| target_cluster_storage_volume | The storage volume used to store tables in the target cluster when the target cluster is a shared-data cluster. If you want to use the default storage volume, you do not need to specify this item. |
| target_cluster_replication_num | The number of replicas specified when creating tables in the target cluster. If you want to use the same replica number as the source cluster, you do not need to specify this item. |
| target_cluster_max_disk_used_percent | Disk usage percentage threshold for BE nodes of the target cluster when the target cluster is shared-nothing. Migration is terminated when the disk usage of any BE in the target cluster exceeds this threshold. The default value is `80`, which means 80%. |
| meta_job_interval_seconds | The interval, in seconds, at which the migration tool retrieves metadata from the source and target clusters. You can use the default value for this item. |
| meta_job_threads | The number of threads used by the migration tool to obtain metadata from the source and target clusters. You can use the default value for this item. |
| ddl_job_interval_seconds | The interval, in seconds, at which the migration tool executes DDL statements on the target cluster. You can use the default value for this item. |
| ddl_job_batch_size | The batch size for executing DDL statements on the target cluster. You can use the default value for this item. |
| ddl_job_allow_drop_target_only | Whether to allow the migration tool to delete databases, tables, or partitions that exist only in the target cluster but not in the source cluster. The default is `false`, which means they will not be deleted. You can use the default value for this item. |
| ddl_job_allow_drop_target_only | Whether to allow the migration tool to delete databases or tables that exist only in the target cluster but not in the source cluster. The default is `false`, which means they will not be deleted. You can use the default value for this item. |
| ddl_job_allow_drop_schema_change_table | Whether to allow the migration tool to delete tables with inconsistent schemas between the source and target clusters. The default is `true`, meaning they will be deleted. You can use the default value for this item. The migration tool will automatically synchronize the deleted tables during the migration. |
| ddl_job_allow_drop_inconsistent_partition | Whether to allow the migration tool to delete partitions with inconsistent data distribution between the source and target clusters. The default is `true`, meaning they will be deleted. You can use the default value for this item. The migration tool will automatically synchronize the deleted partitions during the migration. |
| ddl_job_allow_drop_partition_target_only | Whether to allow the migration tool to delete partitions that are deleted in the source cluster to keep the partitions consistent between the source and target clusters. The default is `true`, meaning they will be deleted. You can use the default value for this item. |
| replication_job_interval_seconds | The interval, in seconds, at which the migration tool triggers data synchronization tasks. You can use the default value for this item. |
| replication_job_batch_size | The batch size at which the migration tool triggers data synchronization tasks. You can use the default value for this item. |
| max_replication_data_size_per_job_in_gb | The data size threshold at which the migration tool triggers data synchronization tasks. Unit: GB. Multiple data synchronization tasks will be triggered if the size of the partition to be migrated exceed this value. The default value is `-1`, meaning no limit is imposed, and all partitions in a table will be migrated in a single synchronization task. You can set this parameter to restrict the data size of each task if the table to be migrated has a large data volume. |
| report_interval_seconds | The time interval at which the migration tool prints the progress information. Unit: Seconds. Default value: `300`. You can use the default value for this item. |

### Obtain Cluster Token

Expand Down Expand Up @@ -279,13 +292,13 @@ After configuring the tool, start the migration tool to initiate the data migrat

You can check the migration progress through the migration tool log **log/sync.INFO.log**.

Example:
Example 1: View task progress.

![img](../_assets/data_migration_tool-1.png)

The important metrics are as follows:

- `Sync progress`: The progress of data migration. The migration tool regularly checks whether the data in the target cluster is lagging behind the source cluster. Therefore, a progress of 100% only means that the data synchronization is completed within the current check interval. If new data continues to be loaded into the source cluster, the progress may decrease in the next check interval.
- `Sync job progress`: The progress of data migration. The migration tool regularly checks whether the data in the target cluster is lagging behind the source cluster. Therefore, a progress of 100% only means that the data synchronization is completed within the current check interval. If new data continues to be loaded into the source cluster, the progress may decrease in the next check interval.
- `total`: The total number of all types of jobs in this migration operation.
- `ddlPending`: The number of DDL jobs pending to be executed.
- `jobPending`: The number of pending data synchronization jobs to be executed.
Expand All @@ -295,6 +308,18 @@ The important metrics are as follows:
- `failed`: The number of failed data synchronization jobs. Failed data synchronization jobs will be resent. Therefore, in most cases, you can ignore this metric. If this value is significantly large, please contact our engineers.
- `unknown`: The number of jobs with an unknown status. Theoretically, this value should always be `0`. If this value is not `0`, please contact our engineers.

Example 2: View the table migration progress.

![img](../_assets/data_migration_tool-2.png)

- `Sync table progress`: Table migration progress, that is, the ratio of tables that have been migrated in this migration task to all the tables that need to be migrated.
- `finishedTableRatio`: Ratio of tables with at least one successful synchronization task execution.
- `expiredTableRatio`: Ratio of tables with expired data.
- `total table`: Total number of tables involved in this data migration progress.
- `finished table`: Number of tables with at least one successful synchronization task execution.
- `unfinished table`: Number of tables with no synchronization task execution.
- `expired table`: Number of tables with expired data.

### View Migration Transaction Status

The migration tool opens a transaction for each table. You can view the status of the migration for a table by checking the status of its corresponding transaction.
Expand Down Expand Up @@ -336,6 +361,14 @@ WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_NAME;
```

## Limits

The list of objects that support synchronization currently is as follows (those not included indicate that synchronization is not supported):

- Databases
- Internal tables and their data
- Materialized view schemas and their building statements (The data in the materialized view will not be synchronized. And if the base tables of the materialized view is not synchronized to the target cluster, the background refresh task of the materialized view reports an error.)
- Logical views

## Q&A

Expand Down
Binary file added docs/zh/_assets/data_migration_tool-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 163afa0

Please sign in to comment.