You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/ROOT/pages/scalability/concepts.adoc
+13-12Lines changed: 13 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,12 +3,12 @@
3
3
= Concepts
4
4
5
5
Scalability is a crucial aspect of database management, allowing a system to handle changing demands by adding and removing resources to meet the demands of a database's workload.
6
-
Neo4j supports multiple strategies to achieve scalability, enabling systems to handle larger datasets, more concurrent users, and higher query complexity without compromising performance or availability, i.e. system's resiliency.
6
+
Neo4j supports multiple strategies to achieve scalability, enabling systems to handle larger datasets, more concurrent users, and higher query complexity without compromising performance or availability, i.e. the system's resiliency.
7
7
The three main strategies are:
8
8
9
-
* Clustering-- for horizontal read scalability.
10
-
* Composite databases -- for federated queries and distributed data management.
11
-
* Property sharding -- for handling massive property-heavy graphs.
9
+
* xref:clustering/setup/analytics-cluster.adoc[Analytics clustering]-- for horizontal read scalability.
10
+
* xref:scalability/composite-databases/concepts.adoc[Composite databases]-- for federated queries and distributed data management.
11
+
* xref:scalability/sharded-property-databases/overview.adoc[Property sharding]-- for handling massive property-heavy graphs.
12
12
13
13
== What is scalability?
14
14
@@ -38,8 +38,8 @@ There are two primary methods to achieve scalability:
38
38
39
39
== What is database scalability?
40
40
41
-
Database scalability is the ability of the database management system (DBMS) to handle changing demands.
42
-
To scale properly, a database needs to use strategies that cover all areas: data access, data manipulation in memory, and database computing.
41
+
Database scalability is the ability of a database management system (DBMS) to handle changing demands.
42
+
To scale properly, a database must apply strategies that cover all areas: data access, data manipulation in memory, and database computing.
43
43
44
44
Strategies include:
45
45
@@ -51,9 +51,9 @@ Strategies include:
51
51
52
52
** *Shared Everything*: All servers share data and memory.
53
53
Flexible, but prone to contention. +
54
-
In this model, data on disk and in memory are shared among all servers in a cluster.
54
+
In this model, data is shared between disk and memory across all servers in a cluster.
55
55
Requests are satisfied by any combination of servers.
56
-
This approach introduces complexity as the cluster must implement a way to avoid contention when multiple servers try to update the same data simultaneously.
56
+
This approach introduces complexity, as the cluster must implement a way to avoid contention when multiple servers try to update the same data simultaneously.
57
57
58
58
** *Shared Nothing*: Each server manages its own partition (shard).
59
59
More fault-tolerant, eliminates single points of failure. +
@@ -68,10 +68,11 @@ Graph database scalability refers to the ability of a database to handle differe
68
68
It includes:
69
69
70
70
* *Data volume* - involves ensuring a consistent SLA in both query and administration response times, even as the size of the data for storage and retrieval expands. +
71
-
Volume depends on data type(s). Vectors occupy a large data space.
71
+
Volume depends on data type(s).
72
+
Vectors occupy a large data space.
72
73
73
74
* *Query volume*
74
-
** Read queries + write queries
75
+
** Read queries + write queries.
75
76
** Queries and user concurrency -- the aim is to ensure a linear response time during the execution of concurrent queries against the same database.
76
77
** Query complexity -- provide response time in line with the complexity of a query. The complexity of a query can be set by the combination of:
77
78
*** Steps to execute
@@ -82,9 +83,9 @@ Volume depends on data type(s). Vectors occupy a large data space.
82
83
83
84
* *Admin volume*
84
85
** Data ingestion/extraction -- When scaling data ingestion/extraction, the goal is to maintain a linear response time when ingesting or extracting an increasing set of data.
85
-
This objective holds true irrespective of the volume of stored data, assuming a similar data structure.
86
+
This objective remains true regardless of the volume of stored data, provided a similar data structure is used.
86
87
** Multi-tenancy -- In SaaS and AaaS environments, the scaling cost for tenants should exhibit linearity.
87
-
For more general services like DBaaS (e.g., Aura), scalability should also be linear, considering all five scalability factors mentioned here.
88
+
For more general services, such as DBaaS (e.g., Aura), scalability should also be linear, considering all five scalability factors mentioned here.
Single database queries must be modified, depending on the *sharding rules*. +
77
+
Single database queries must be modified according to the *sharding rules*. +
78
78
Automated shard pruning using sharding functions.
79
79
| Parallel execution on shards. +
80
80
Single database *queries run as is*. +
81
81
Automated shard pruning based on node selection.
82
82
83
83
| User tools
84
84
| Work with Browser and Cypher Shell. +
85
-
Tools used on individual shards, Bloom is not supported on composite databases.
85
+
Tools used on individual shards and Bloom are not supported on composite databases.
86
86
| All tools supported.
87
87
88
88
| Admin tools
@@ -98,15 +98,15 @@ Tools used on individual shards, Bloom is not supported on composite databases.
98
98
99
99
xref:clustering/index.adoc[Neo4j cluster] is a high-availability cluster with multi-DB support.
100
100
This means that servers and databases are decoupled: servers provide computation and storage power for databases to use.
101
-
Each database relies on its own cluster architecture, organized in primaries (>=3) and secondaries (for read scaling).
101
+
Each database relies on its own cluster architecture, organized into primaries (with a minimum of 3) and secondaries (for read scaling).
102
102
Scalability, allocation/reallocation, service elasticity, load balancing, and automatic routing are automatically provided (or they can be finely controlled).
Sharded property databases are managed similarly to standard Neo4j databases, with some differences in certain administrative operations.
7
+
6
8
== Managing aliases for sharded databases
7
9
8
10
When creating an alias for a sharded database, use the virtual database name when specifying it as the alias target.
9
-
The following example shows how to create the alias foo for the sharded database foo-sharded:
11
+
The following example shows how to create the alias `foo` for the sharded database `foo-sharded`:
10
12
11
13
[source, cypher]
12
14
----
@@ -22,7 +24,7 @@ The following example shows how to enable a server and allow allocating the prop
22
24
23
25
[source, cypher]
24
26
----
25
-
ENABLE SERVER server name OPTIONS { allowedDatabases: [‘foo-sharded-p000’] }
27
+
ENABLE SERVER 'serverId' OPTIONS { allowedDatabases: ['foo-sharded-p000'] }
26
28
----
27
29
28
30
== Resizing and resharding
@@ -31,7 +33,9 @@ Online resharding (adding new shards, removing old ones, relocating data to acco
31
33
You can reshard your data via the `neo4j-admin database copy` command.
32
34
See xref:scalability/sharded-property-databases/data-ingestion.adoc#splitting-existing-db-into-shards[Splitting an existing database into shards] for more information.
33
35
34
-
Alternatively, you can select more shards than needed to start with and allow space for their data to grow, as the Neo4j cluster allows databases to be moved based on server availability. For example, 10 property shards can be initially hosted on 5 servers (2 shards per server), and additional servers can be added as needed.
36
+
Alternatively, you can select more shards than needed to start with and allow space for their data to grow, as the Neo4j cluster allows databases to be moved based on server availability.
37
+
For example, ten property shards can be initially hosted on five servers (two shards per server), and additional servers can be added as needed.
38
+
For details on managing databases and servers in a cluster, see xref:clustering/databases.adoc[Managing databases in a cluster] and xref:clustering/servers.adoc[Managing servers in a cluster].
35
39
36
40
//TODO: We should talk about co-location, adding/removing servers in a cluster and say what is supported and what is not.
37
41
@@ -45,7 +49,7 @@ Backup chains for each shard are produced using the neo4j-admin database backup.
45
49
For the graph shard, its backup chain must contain one full artefact and 0+ differential artefacts.
46
50
Each property shard’s backup chain must contain only one full backup and no differential backups.
47
51
In practical terms, this means that to back up a sharded property database, you start with a full backup of the graph shard and then all of the property shards; any subsequent differential backups would only need to be of the graph shard.
48
-
This is because the transaction log of the property shards is the same as the graph shard log and is just filtered when applied, so only the graph shard log is required for a restore.
52
+
This is because the transaction log of the property shards is the same as the graph shard log and is simply filtered when applied, so only the graph shard log is required for a restore.
49
53
50
54
For example, assume there is a sharded property database called `foo` with a graph shard and 2 property shards.
51
55
A backup must be taken of each shard, for example:
@@ -109,8 +113,8 @@ To form a valid sharded property database backup, you need to:
109
113
* Take a full backup of the property shard `foo-p000` so that its store at least includes transaction 5.
110
114
* Take a differential backup of the graph shard so that at least transaction 12 is included in its transaction log, so `foo-p001` is included in its range.
111
115
112
-
Once a valid sharded properties database backup is formed, then differential backups can be performed by taking differential backups of the graph shard, extending the range of the graph shard chain.
113
-
Continuing with the example, the graph chain contains transactions from 11-36, property shard 1’s store files are at 13, and property shard 2’s store files are at 30.
116
+
Once a valid sharded properties database backup is created, differential backups can be performed by taking differential backups of the graph shard, extending the range of the graph shard chain.
117
+
Continuing with the example, the graph chain contains transactions from 11 to 36, property shard 1’s store files are at 13, and property shard 2’s store files are at 30.
114
118
You then take a differential backup of the graph shard containing transactions 37 to 50.
115
119
At restore time, all databases can be recovered up to transaction 50 and made consistent.
| By default, the sharded property database is disabled. This setting is a feature toggle behind which the sharded property database is developed.
20
-
See xref:scalability/sharded-property-databases/overview.adoc[Property sharding overview].
18
+
| By default, the sharded property database is disabled.footnote:[This setting is a feature toggle behind which the sharded property database is developed. See xref:scalability/sharded-property-databases/overview.adoc[Property sharding overview].]
21
19
22
20
| db.query.default_language=CYPHER_25
23
21
| Ensures that any database created will use Cypher 25 (unless users specifically override the default version in the `CREATE DATABASE` command).
Copy file name to clipboardExpand all lines: modules/ROOT/pages/scalability/sharded-property-databases/limitations-and-considerations.adoc
+20-21Lines changed: 20 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,11 @@
6
6
7
7
=== CDC
8
8
9
-
CDC is not supported in this version.
9
+
CDC is not supported in this version.
10
10
11
11
=== Unsupported procedures
12
12
13
-
The following procedures are not supported by sharded databases:
13
+
The following procedures are not supported by sharded property databases:
14
14
15
15
* cdc.earliest()
16
16
* cdc.current()
@@ -29,14 +29,14 @@ The following procedures are not supported by sharded databases:
29
29
30
30
[NOTE]
31
31
====
32
-
It is strongly recommended not to use `dbms.setConfigValue()` on sharded databases as sharded databases run in a clustered environment, which means the procedure must be run against each cluster member and is not propagated to other members.
33
-
In particular, `dbms.setConfigValue()` cannot be used to set read-only behaviour as the two settings `server.databases.read_only` and `server.databases.writable` are not compatible with sharded databases.
32
+
It is strongly recommended not to use `dbms.setConfigValue()` on sharded property databases, as sharded property databases run in a clustered environment, which means the procedure must be run against each cluster member and is not propagated to other members.
33
+
In particular, `dbms.setConfigValue()` cannot be used to set read-only behavior as the two settings `server.databases.read_only` and `server.databases.writable` are not compatible with sharded property databases.
34
34
The correct way of setting read/write access is by using `ALTER DATABASE`.
35
+
See xref:scalability/sharded-property-databases/altering-sharded-databases.adoc[Altering sharded property databases] for details.
35
36
====
36
37
37
38
=== Property-based access control (PBAC)
38
39
39
-
40
40
PBAC is not supported in this version.
41
41
42
42
=== `USE graph.byElementId()`
@@ -48,8 +48,7 @@ Calling `USE graph.byElementId(<element-id>)` with an element of a sharded datab
48
48
=== Queries with `MERGE` clause
49
49
50
50
`MERGE` queries are very slow at any meaningful scale.
51
-
Due to their plan, they are likely to cause a nested loop join, which does not perform well on SPD at the moment.
52
-
We are looking to fix this soon.
51
+
Due to their plan, they are likely to cause a nested loop join, which does not perform well on sharded property databases at the moment.
53
52
54
53
=== Filtering on properties in paths
55
54
@@ -63,20 +62,22 @@ WHERE k.creationDate=1268465841718
63
62
RETURN n,k,m
64
63
----
65
64
66
-
This could be rewritten to be:
65
+
This could be rewritten to be to perform better as follows:
67
66
68
67
[source, cypher]
69
68
----
70
69
MATCH (n:Person)[k:KNOWS{creationDate=1268465841718}]>+(m:Person)
71
70
RETURN n,k,m
72
71
----
73
72
74
-
Which would perform much better, but not all queries can be rewritten in this way.
73
+
However, not all queries can be rewritten in this way.
75
74
76
75
=== Call in transactions for batch write operations
77
76
78
-
Because of the write architecture, creating larger transactions when doing write operations that can be batched will give large performance benefits.
79
-
For example:
77
+
Because of the write architecture, batching larger transactions during write operations gives significant performance benefits.
78
+
This is also true for single instance databases, but the performance difference is more pronounced in sharded property databases.
79
+
80
+
For example, consider the following query:
80
81
81
82
[source, cypher]
82
83
----
@@ -94,7 +95,7 @@ FOR each update IN node_updates DO
94
95
END FOR
95
96
----
96
97
97
-
can be rewritten in a much more performant way as follows:
98
+
It can be rewritten as follows to perform better:
98
99
99
100
[source, cypher]
100
101
----
@@ -110,28 +111,25 @@ SET n.name = u.name,
110
111
n.age = u.age
111
112
----
112
113
113
-
This is the same advice that would be given for a non-sharded Neo4j database, but it is doubly important for a property-sharded database.
114
-
115
114
== Other considerations
116
115
117
116
=== `neo4j-admin database copy` to a sharded property database
118
117
119
-
When using the `neo4j-admin database copy --property-shard-count > 0` command to split an existing database into shards, it is not possible to copy in place, meaning you cannot replace your existing database with a sharded property database.
120
-
You must specify a new name or set `--to-path-data` and `--to-path-txn` or `--target-location={path|uri}`
121
-
`--target-format={database|backup}` to a new DBMS location.
118
+
When using the `neo4j-admin database copy --property-shard-count > 0` command to split an existing database into shards, it is not possible to copy in place, meaning you cannot replace your existing database with a sharded property database.
119
+
Instead, you must specify a new name or set `--to-path-data` and `--to-path-txn` or `--target-location={path|uri}` and `--target-format={database|backup}` to a new DBMS location.
122
120
123
121
=== `USE` clause with sharded databases
124
122
125
123
When targeting a sharded database in a `USE` clause, use its virtual database name or an alias in the graph reference.
124
+
Targeting a shard directly is not supported.
125
+
126
126
For example:
127
127
128
128
[source, cypher]
129
129
----
130
130
USE `neo4j-sharded` MATCH (n) RETURN n
131
131
----
132
132
133
-
Targeting a shard directly is not supported.
134
-
135
133
=== Cypher 5
136
134
137
135
Cypher 5 is unsupported for sharded property databases.
@@ -145,7 +143,7 @@ See xref:configuration/cypher-version-configuration.adoc[Configure the Cypher de
145
143
Property shards pull transaction log entries from the graph shard and apply them to their stores.
146
144
Thus, there is a requirement that the graph shard may not prune an entry from its transaction log until each replica of each property shard has pulled and applied that entry.
147
145
Failure to maintain this requirement can render a sharded property database irrecoverable.
148
-
In order to ensure enough transaction logs are kept, you must set db.tx_log.rotation.retention_policy accordingly.
146
+
In order to ensure enough transaction logs are kept, you must set xref:configuration/configuration-settings.adoc#config_db.tx_log.rotation.retention_policy[`db.tx_log.rotation.retention_policy`] accordingly.
149
147
A suitable heuristic is to ensure that the transaction log kept covers the transactions written between successive full backups of the sharded property database.
150
148
151
149
[NOTE]
@@ -156,7 +154,8 @@ It is important to ensure that there is space for the transaction logs and that
156
154
157
155
=== Controlling the property shard transaction log pull frequency
158
156
159
-
The interval at which property shards pull transaction log entries from the graph shard is controlled by `internal.dbms.sharded_property_database.property_pull_interval` (defaults to 10ms).Write performance can often be improved by setting this value lower at the cost of more polling on the graph shard from the property shards, which has unknown consequences at the moment.
157
+
The interval at which property shards pull transaction log entries from the graph shard is controlled by `internal.dbms.sharded_property_database.property_pull_interval` (defaults to 10ms).
158
+
Write performance can often be improved by setting this value lower at the cost of more polling on the graph shard from the property shards, which has unknown consequences at the moment.
0 commit comments