Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog post about selecting sharding keys using vexplain and vt #1880

Draft
wants to merge 4 commits into
base: prod
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions content/en/blog/2025-01-25-going-sharded.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
author: 'Andrés Taylor'
date: 2025-01-25
slug: 'optimizing-sharding-strategies-vitess'
tags: ['Vitess', 'Sharding', 'MySQL', 'Query Optimization', 'Database Scaling', 'Vindex', 'VExplain', 'Performance Analysis', 'SQL Planning']
title: 'Mastering Sharding in Vitess: Tools, Strategies, and Best Practices'
description: "Explore how to optimize sharding strategies in Vitess for scalable query performance, leveraging tools like `vexplain` and `vt` for deep analysis and schema design."
---

## From Single MySQL to Sharded Vitess: A Hands-On Guide to VSchema Design

So you have a successful application that is using a large database that keeps growing?
Congratulations! That's a nice problem to have.

In this blog post, I'll share how you can go from an existing database and query log to a vschema, and some pitfalls to watch out for.
I'm going to assume you already know what database sharding is and how Vitess does it.
If you haven’t read Ben’s excellent [post about sharding](https://planetscale.com/blog/database-sharding), I recommend checking it out first to get the background—you can always return here afterward.

### Analyzing joins, filtering, grouping and transactions

When the Vitess planner analyses queries, it looks at joins, the `WHERE` clause, and the `GROUP BY` clause to figure out how to split up a query across shards.
Additionally, it's important to make sure that transactions don't span multiple shards. If they do, they will be upgraded to distributed atomic transactions that are much more expensive that single shard transaction.

Using the `vt` tooling, these things are easy to analyze: `vt keys` and `vt transactions` take a query log as input and produces json outputs that can then be viewed using `vt summarize` which will produce a markdown report from the json input files.

#### `vt keys`

In Vitess, we have `vexplain keys <query>`, a command that takes a query and analyses it:

```sql
vexplain keys select *
from orders o
join customers c on o.customer_id = c.id
```

This will output columns used by the query that might be interesting to test as sharding keys.

```json
{
"statementType": "SELECT",
"joinColumns": [
"customers.id =",
"orders.customer_id ="
],
"selectColumns": [
"customers.`name`",
"customers.created_at",
"customers.email",
"customers.id",
"orders.`status`",
"orders.created_at",
"orders.customer_id",
"orders.id",
"orders.total_amount"
]
}
```

To do this on a full query log, we use `vt keys`. Without having to start a Vitess cluster, you get the `vexplain keys` output for all queries in a log.

190 changes: 190 additions & 0 deletions content/en/blog/2025-01-25-how-to-choose-sharding-keys-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
---
author: 'Andrés Taylor'
date: 2025-01-25
slug: 'optimizing-sharding-strategies-vitess'
tags: ['Vitess', 'Sharding', 'MySQL', 'Query Optimization', 'Database Scaling', 'Vindex', 'VExplain', 'Performance Analysis', 'SQL Planning']
title: 'Mastering Sharding in Vitess: Tools, Strategies, and Best Practices'
description: "Explore how to optimize sharding strategies in Vitess for scalable query performance, leveraging tools like `vexplain` and `vt` for deep analysis and schema design."
---

## Introduction to Sharding in Vitess

Effective sharding is essential for database scalability, especially when using an orchestration layer like Vitess.
By analyzing and refining your sharding strategy, you can minimize data transfer, optimize query plans, and improve application performance.
This guide explores practical methodologies and key tools in Vitess, including vexplain and the vt CLI, to help you design efficient sharding schemes and analyze query behavior at scale.
A portion of this introduction recaps Ben’s excellent [post about sharding](https://planetscale.com/blog/database-sharding).
If you haven’t read it yet, I recommend checking it out first to get the background—you can always return here afterward.

## The Importance of Choosing the Right Sharding Key [^1]
[^1]: For a visual breakdown, refer to Benjamin’s post on the PlanetScale website about sharding keys.

As your database outgrows the capacity of a single MySQL instance, Vitess can distribute the data across multiple instances through sharding.
Sharding splits a large database into smaller, more manageable pieces called shards, with each shard stored on a separate MySQL instance.
Vitess acts as a database proxy, creating the illusion of a single database while routing queries to the appropriate shards.

Choosing an effective sharding key—a column or set of columns that dictates how data is distributed across shards—is crucial for application performance.
The sharding key routes queries to the correct shard, functioning similarly to a primary key but used for data partitioning.

When analyzing a query, if Vitess detects a join performed on columns sharded by the same rules, it can push the join down to the shard level.
This is the ideal scenario, minimizing data transfer between shards.

Consider an example with two tables, `orders` and `customers`, each sharded by their primary keys (`order_id` and `customer_id`, respectively).
This choice is usually not a great choice, but just to illustrate what happens when using a suboptimal sharding key, let's start with this design.

```sql
select *
from orders o
join customers c on o.customer_id = c.customer_id
```

Since the join is not being done on the sharding keys, Vitess will need to perform the join in the vtgate layer, which is the query router that sits between the application and the MySQL instances.
This is suboptimal, as it means that all the data from both tables will need to be transferred to the vtgate layer, and the join will be performed there.

If we were to shard the `orders` table by `customer_id` instead of `order_id`, the join could be pushed down to the shard, and the join would be performed there.
This would be much more efficient, as only the data that is needed for the join would need to be transferred between shards.
The sharding key does not have to be unique, it's just a value used to decide which shard this row should live on, this column is perfectly fine as a sharding key.

## Analyzing How Queries Execute

When you're choosing sharding keys, it's important to understand how your queries will execute.
Vitess provides a tool called `vexplain` that can help you analyze how your queries will be executed.
It's similar to mysql's `explain` command, showing the query plan that Vitess will use to execute the query.

```sql
vexplain plan select *
from orders o
join customers c on o.customer_id = c.customer_id
```

This will output the query plan that Vitess will use to execute the query. It's represented as a JSON tree:

```json
{
"OperatorType": "Join",
"Variant": "Join",
"JoinColumnIndexes": "L:0,L:1,L:2,L:3,L:4,R:0,R:1,R:2,R:3",
"JoinVars": {
"o_customer_id": 1
},
"TableName": "orders_customers",
"Inputs": [
{
"OperatorType": "Route",
"Variant": "Scatter",
"Keyspace": {
"Name": "ks_derived",
"Sharded": true
},
"FieldQuery": "select o.order_id, o.customer_id, o.`status`, o.total_amount, o.created_at from orders as o where 1 != 1",
"Query": "select o.order_id, o.customer_id, o.`status`, o.total_amount, o.created_at from orders as o",
"Table": "orders"
},
{
"OperatorType": "Route",
"Variant": "EqualUnique",
"Keyspace": {
"Name": "ks_derived",
"Sharded": true
},
"FieldQuery": "select c.customer_id, c.email, c.`name`, c.created_at from customers as c where 1 != 1",
"Query": "select c.customer_id, c.email, c.`name`, c.created_at from customers as c where c.customer_id = :o_customer_id /* INT32 */",
"Table": "customers",
"Values": [
":o_customer_id"
],
"Vindex": "user_index"
}
]
}
```

This is showing the execution plan that Vitess will use for the query. We can see the join being the root of the query plan.
The join is a nested loop join, so for every row returned from the first input to the join, Vitess will issue a query to the second input to the join.
This is a so called "EqualUnique" Route, which means that Vitess knows that the second query only needs to be sent to a single shard.

This is what Vitess will do because the sharding keys we've selected for these tables are the same as the primary keys.
For the second query, since we are querying by its sharding key, it's easy to figure out where to send the query.

If we instead were to shard the `orders` table by `customer_id`, the query plan would look different.

```json
{
"OperatorType": "Route",
"Variant": "Scatter",
"Keyspace": {
"Name": "ks_derived",
"Sharded": true
},
"FieldQuery": "select o.id, o.customer_id, o.`status`, o.total_amount, o.created_at, c.id, c.email, c.`name`, c.created_at from orders as o, customers as c where 1 != 1",
"Query": "select o.id, o.customer_id, o.`status`, o.total_amount, o.created_at, c.id, c.email, c.`name`, c.created_at from orders as o, customers as c where o.customer_id = c.id",
"Table": "customers, orders"
}
```

Here we can see that the join has been pushed down to the shard, and the join is being performed there.
Vtgate only has to concatenate the results from the shards.

## Analyze Queries With VExplain Keys

Vitess also provides a tool called `vexplain keys` that can help you analyze your queries so you can design your schema.

```sql
vexplain keys select *
from orders o
join customers c on o.customer_id = c.id
```

This will output columns used by the query that might be interesting to test as sharding keys.

```json
{
"statementType": "SELECT",
"joinColumns": [
"customers.id =",
"orders.customer_id ="
],
"selectColumns": [
"customers.`name`",
"customers.created_at",
"customers.email",
"customers.id",
"orders.`status`",
"orders.created_at",
"orders.customer_id",
"orders.id",
"orders.total_amount"
]
}
```

This output shows the columns that are being used in the query. The join columns are the columns that are being used to join the two tables. The equality sign after the column name indicates which comparison is being performed against the column.
The tool will also show columns used for filtering or grouping, which can be useful when choosing sharding keys. For a more complex query, the output can look like this:

```json
{
"statementType": "SELECT",
"groupingColumns": [
"customers.`name`",
"customers.id"
],
"joinColumns": [
"customers.id =",
"order_items.order_id =",
"orders.customer_id =",
"orders.id ="
],
"filterColumns": [
"orders.`status` ="
],
"selectColumns": [
"customers.`name`",
"customers.id",
"order_items.quantity",
"order_items.unit_price"
]
}
```

This tool is very useful when you're designing your schema and trying to figure out which columns to use as sharding keys.

In a follow-up blog post, I'll show you can do this at scale with [`vt`](https://github.com/vitessio/vt).
Loading