Skip to content

Commit 0f7f581

Browse files
committed
Bulk delete sample
bulk delete example for truncate and custom ttl.
1 parent f1355fc commit 0f7f581

File tree

4 files changed

+443
-0
lines changed

4 files changed

+443
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
## Glue Bulk Delete Example
2+
This example provides scala script for bulk delete for data in Amazon Keyspaces using AWS Glue. This allows you to bulk delete or truncate a table from Amazon Keyspaces without setting up a spark cluster.
3+
4+
## Prerequisites
5+
* Setup Spark Cassandra connector using provided [setup script](../)
6+
7+
### Setup Bulk delete
8+
The following script sets up AWS Glue job to bulk delete from a Keyspaces table. The script takes the following parameters
9+
* PARENT_STACK_NAME is the stack name used to create the spark cassandra connector with Glue. [setup script](../)
10+
* DELETE_STACK_NAME is the stack name used to create glue job.
11+
* KEYSPACE_NAME and TABLE_NAME Keyspaces and table is the fully qualified name of the table you wish to bulk delete from.
12+
* S3URI is the S3 uri where the deleted records will be stored. By default it will use the s3 bucket from the parent stack.
13+
* FORMAT can be json, csv, or parquet. parquet is recommended for ease of use with data loading, transformations, and using exports with Athena. default is parquet.
14+
* DISTINCT_KEYS comma seperated list of keys to perform deletes. If left blank, deletes will be made by primary key. If only some keys are specified, the job will perform range deletes. Range deletes in keyspaces can delete up to 1000 rows.
15+
* QUERY_FILTER like query to apply a filter condition to delete statement. Leave blank to delete every row and truncate the table.
16+
17+
```shell
18+
./setup-bulkd-delete.sh SETUP_STACK_NAME DELETE_STACK_NAME KEYSPACE_TABLE TABLE_NAME S3URI FORMAT DISTINCT_KEYS QUERY_FILTER
19+
20+
```
21+
22+
The job will copy data to S3 bucket if provided. You can override or remove the S3 bucket at run time. The structure below is appended to the s3 bucket provided and is the final location of the data thats deleted.
23+
24+
```shell
25+
\--- S3_BUCKET
26+
\------- jars
27+
\------- conf
28+
\------- scripts
29+
\------- spark-logs
30+
\------- export
31+
\----- keyspace_name
32+
\----- table_name
33+
\----- snapshot
34+
\----- year=2025
35+
\----- month=01
36+
\----- day=02
37+
\----- hour=09
38+
\----- minute=22
39+
\--- YOUR DATA HERE
40+
41+
```
42+
43+
### Running the bulk job from the CLI
44+
45+
Running the job can be done through the AWS CLI. In the following example the command is running the job created in the previous step, but overrides the number of glue workers, worker type, and script arguments such as the table name. You can override any of the glue job parameters at run time and the default arguments.
46+
47+
```shell
48+
aws-glue % aws glue start-job-run --job-name AmazonKeyspacesBulkDelete-aksglue-aksglue-bulk-delete --number-of-workers 8 --worker-type G.2X --arguments '{"--TABLE_NAME":"transactions"}'
49+
```
50+
51+
Full list of aws cli arguments [start-job-run arguments](https://docs.aws.amazon.com/cli/latest/reference/glue/start-job-run.html)
52+
53+
### List of script arguments
54+
55+
| argument | defenition | default | required |
56+
| :---------------- | :---------------------------------------------- | :------------------ | :------ |
57+
| --KEYSPACE_NAME | Name of the keyspace of the table to delete | provided at setup | Y |
58+
| --TABLE_NAME | Name of the table to delete | provided at setup | Y |
59+
| --S3_URI | S3 URI where the root of the bulk delete data will be located. The folder structure is added dynamically in the scala script | The default location is the s3 bucked provided when setting up the parent stack or the bulk-delete stack | parent stack s3 bucket | N |
60+
| --FORMAT | THe format of the export. Its recommended to use parquet. You could alternativley use json or other types supported by spark s3 libraries | parquet | N |
61+
| --DRIVER_CONF | the file containing the driver configuration. | By default the parent stack sets up a config for Cassandra and config for keyspaces. You can add as many additional configurations as you like by dropping them in the same location in s3. | keyspaces-application.conf | Y |
62+
| DISTINCT_KEYS | comma seperated list of keys. If left blank the script will use the full primary key located in the system schema. If you provide a portion of the primary key a range delete will be used. Range delete in keyspaces can delete up to 1000 rows. | full primary key | N |
63+
| QUERY_FILTER | SQL like query to apply a filter condition to delete statement | no query filter | N |
64+
65+
66+
### Scheduled Trigger (Cron)
67+
If you are building a frequent delete workload such as custom ttl or moving data to cold storage, you can setup a cron job using glue scheduled trigger. Here is a simple AWS CLI command to create a Glue Trigger that runs your bulk delete Glue job once per week (every Monday at 12:00 UTC). The following script is used to delete rows which event data has passed current date epoch
68+
69+
```shell
70+
aws glue create-trigger \
71+
--name KeyspacesBulkDeleteWeeklyTrigger \
72+
--type SCHEDULED \
73+
--schedule "cron(0 12 ? * MON *)" \
74+
--start-on-creation \
75+
--actions '[{
76+
"JobName": "AmazonKeyspacesBulkDelete-bulk-delete",
77+
"WorkerType": "G.2X",
78+
"NumberOfWorkers": 8,
79+
"Arguments": {
80+
"--table_name": "transactions",
81+
"--keyspace_name": "aws",
82+
"--QUERY_FILTER": "event_date < '2024-05-17'"
83+
}
84+
}]'
85+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
import com.amazonaws.services.glue.GlueContext
2+
import com.amazonaws.services.glue.util.GlueArgParser
3+
import com.amazonaws.services.glue.util.Job
4+
import org.apache.spark.SparkContext
5+
import org.apache.spark.SparkConf
6+
import org.apache.spark.sql.Dataset
7+
import org.apache.spark.sql.Row
8+
import org.apache.spark.sql.SaveMode
9+
import org.apache.spark.sql.SparkSession
10+
import org.apache.spark.sql.functions.from_json
11+
import org.apache.spark.sql.streaming.Trigger
12+
import scala.collection.JavaConverters._
13+
import com.datastax.spark.connector._
14+
import org.apache.spark.sql.cassandra._
15+
import org.apache.spark.sql.SaveMode._
16+
import com.datastax.spark.connector._
17+
import com.datastax.spark.connector.cql._
18+
import com.datastax.oss.driver.api.core._
19+
import org.apache.spark.sql.functions.rand
20+
import com.amazonaws.services.glue.log.GlueLogger
21+
import java.time.ZonedDateTime
22+
import java.time.ZoneOffset
23+
import java.time.temporal.ChronoUnit
24+
import java.time.format.DateTimeFormatter
25+
26+
27+
object GlueApp {
28+
29+
def main(sysArgs: Array[String]) {
30+
31+
val requiredParams = Seq("JOB_NAME", "KEYSPACE_NAME", "TABLE_NAME", "DRIVER_CONF")
32+
33+
val optionalParams = Seq("DISTINCT_KEYS", "QUERY_FILTER", "FORMAT", "S3_URI")
34+
35+
// Build a list of optional parameters that exist in sysArgs
36+
val validOptionalParams = optionalParams.filter(param => sysArgs.contains(s"--$param") && param.trim.nonEmpty)
37+
38+
// Combine required and valid optional parameters
39+
val validParams = requiredParams ++ validOptionalParams
40+
41+
val args = GlueArgParser.getResolvedOptions(sysArgs, validParams.toArray)
42+
43+
val driverConfFileName = args("DRIVER_CONF")
44+
45+
val conf = new SparkConf()
46+
.setAll(
47+
Seq(
48+
("spark.task.maxFailures", "100"),
49+
50+
("spark.cassandra.connection.config.profile.path", driverConfFileName),
51+
("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions"),
52+
("directJoinSetting", "on"),
53+
54+
("spark.cassandra.output.consistency.level", "LOCAL_QUORUM"),//WRITES
55+
("spark.cassandra.input.consistency.level", "LOCAL_ONE"),//READS
56+
57+
("spark.cassandra.sql.inClauseToJoinConversionThreshold", "0"),
58+
("spark.cassandra.sql.inClauseToFullScanConversionThreshold", "0"),
59+
("spark.cassandra.concurrent.reads", "50"),
60+
61+
("spark.cassandra.output.concurrent.writes", "3"),
62+
("spark.cassandra.output.batch.grouping.key", "none"),
63+
("spark.cassandra.output.batch.size.rows", "1"),
64+
("spark.cassandra.output.batch.size.rows", "1"),
65+
("spark.cassandra.output.ignoreNulls", "true")
66+
))
67+
68+
69+
val spark: SparkContext = new SparkContext(conf)
70+
val glueContext: GlueContext = new GlueContext(spark)
71+
val sparkSession: SparkSession = glueContext.getSparkSession
72+
73+
import sparkSession.implicits._
74+
75+
Job.init(args("JOB_NAME"), glueContext, args.asJava)
76+
77+
val logger = new GlueLogger
78+
79+
//validation steps for peers and partitioner
80+
val connector = CassandraConnector.apply(conf);
81+
val session = connector.openSession();
82+
val peersCount = session.execute("SELECT * FROM system.peers").all().size()
83+
84+
val partitioner = session.execute("SELECT partitioner from system.local").one().getString("partitioner")
85+
86+
logger.info("Total number of seeds:" + peersCount)
87+
logger.info("Configured partitioner:" + partitioner)
88+
89+
if(peersCount == 0){
90+
throw new Exception("No system peers found. Check required permissions to read from the system.peers table. If using VPCE check permissions for describing VPCE endpoints. https://docs.aws.amazon.com/keyspaces/latest/devguide/vpc-endpoints.html")
91+
}
92+
93+
if(partitioner.equals("com.amazonaws.cassandra.DefaultPartitioner")){
94+
throw new Exception("Sark requires the use of RandomPartitioner or Murmur3Partitioner. See Working with partioners in Amazon Keyspaces documentation. https://docs.aws.amazon.com/keyspaces/latest/devguide/working-with-partitioners.html")
95+
}
96+
97+
val backupLocation = args.getOrElse("S3_URI", "")
98+
val backupFormat = args.getOrElse("FORMAT", "parquet")
99+
val filterCriteria = args.getOrElse("QUERY_FILTER", "")
100+
101+
val tableName = args("TABLE_NAME")
102+
val keyspaceName = args("KEYSPACE_NAME")
103+
104+
105+
val query =
106+
s"""
107+
|SELECT column_name, kind
108+
|FROM system_schema.columns
109+
|WHERE keyspace_name = '$keyspaceName' AND table_name = '$tableName';
110+
|""".stripMargin
111+
112+
// Execute the query
113+
val resultSet = session.execute(query)
114+
115+
val validKinds = Set("partition_key", "clustering")
116+
// Extract primary key column names
117+
118+
val primaryKeyColumnsCSV = resultSet.all().asScala
119+
.filter(row => validKinds.contains(row.getString("kind")))
120+
.map(_.getString("column_name"))
121+
.toList
122+
.mkString(", ")
123+
124+
// Output the primary key columns
125+
logger.info(s"Primary Key Columns for $keyspaceName.$tableName: ${primaryKeyColumnsCSV}")
126+
127+
val distinctKeys = args.getOrElse("DISTINCT_KEYS", primaryKeyColumnsCSV).filterNot(_.isWhitespace).split(",")
128+
129+
// Output the primary key columns
130+
logger.info(s"Primary Key Columns for $keyspaceName.$tableName: ${distinctKeys.mkString(", ")}")
131+
132+
var tableDf = sparkSession.read
133+
.format("org.apache.spark.sql.cassandra")
134+
.options(Map( "table" -> tableName,
135+
"keyspace" -> keyspaceName,
136+
"pushdown" -> "false"))//set to true when executing against Apache Cassandra, false when working with Keyspaces
137+
.load()
138+
139+
if(filterCriteria.trim.nonEmpty){
140+
tableDf = tableDf.filter(filterCriteria)
141+
}
142+
143+
//backup to s3 for data that wil be deleted
144+
if(backupLocation.trim.nonEmpty){
145+
val now = ZonedDateTime.now( ZoneOffset.UTC )//.truncatedTo( ChronoUnit.MINUTES ).format( DateTimeFormatter.ISO_DATE_TIME )
146+
147+
//backup location for deletes
148+
val fullbackuplocation = backupLocation +
149+
"/export" +
150+
"/" + keyspaceName +
151+
"/" + tableName +
152+
"/bulk-delete" +
153+
"/year=" + "%04d".format(now.getYear()) +
154+
"/month=" + "%02d".format(now.getMonthValue()) +
155+
"/day=" + "%02d".format(now.getDayOfMonth()) +
156+
"/hour=" + "%02d".format(now.getHour()) +
157+
"/minute=" + "%02d".format(now.getMinute())
158+
159+
160+
161+
tableDf.write.format(backupFormat).mode(SaveMode.ErrorIfExists).save(fullbackuplocation)
162+
}
163+
164+
tableDf.select(distinctKeys.head, distinctKeys.tail:_*).rdd.deleteFromCassandra(keyspaceName, tableName)
165+
166+
Job.commit()
167+
}
168+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
AWSTemplateFormatVersion: 2010-09-09
2+
Description: 'Create bulk delete Glue job for Amazon Keyspaces'
3+
Parameters:
4+
KeyspaceName:
5+
NoEcho: false
6+
Description: Cassandra Keyspace name
7+
Type: String
8+
#Default: mykeyspace
9+
MinLength: 3
10+
MaxLength: 48
11+
TableName:
12+
NoEcho: false
13+
Description: Cassandra Table name
14+
Type: String
15+
#Default: mytable
16+
MinLength: 3
17+
MaxLength: 48
18+
DistinctKeys:
19+
NoEcho: false
20+
Description: Optional Paramter. Provide comma seperated list of keys (example "id,create_date") of distinct keys. For instance, you could delete by partition by using just the partition keys.
21+
Type: String
22+
Default: ""
23+
MaxLength: 48
24+
S3URI:
25+
NoEcho: false
26+
Description: folder to export
27+
Type: String
28+
Default: s3://mybucket/myexport
29+
FORMAT:
30+
NoEcho: false
31+
Description: Format used for export
32+
Type: String
33+
Default: parquet
34+
MinLength: 3
35+
MaxLength: 48
36+
QueryFilter:
37+
NoEcho: false
38+
Description: Optional Paramter. Provide Query Filter criteria (example "my_column=='somevalue' AND my_othercolumn=='someothervalue'")
39+
Type: String
40+
Default: ""
41+
MaxLength: 48
42+
ParentStack:
43+
NoEcho: false
44+
Description: Stack used to setup the spark cassandra connector
45+
Type: String
46+
Default: aksglue
47+
MinLength: 3
48+
MaxLength: 48
49+
50+
Conditions:
51+
IsQueryFilterEmpty: !Equals
52+
- !Ref QueryFilter
53+
- ""
54+
IsDistinctKeysEmpty: !Equals
55+
- !Ref DistinctKeys
56+
- ""
57+
58+
Resources:
59+
GlueJob:
60+
Type: AWS::Glue::Job
61+
Properties:
62+
Command:
63+
Name: glueetl
64+
ScriptLocation: !Sub
65+
- "s3://${IMPORTBUCKETNAME}/scripts/${ParentStack}-${AWS::StackName}-bulk-delete-sample.scala"
66+
- IMPORTBUCKETNAME:
67+
Fn::ImportValue:
68+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
69+
DefaultArguments:
70+
"--job-language": "scala"
71+
"--user-jars-first": "true"
72+
"--extra-jars": !Sub
73+
- 's3://${IMPORTBUCKETNAME}/jars/spark-cassandra-connector-assembly_2.12-3.1.0.jar,s3://${IMPORTBUCKETNAME}/jars/aws-sigv4-auth-cassandra-java-driver-plugin-4.0.9-shaded.jar,s3://${IMPORTBUCKETNAME}/jars/spark-extension_2.12-2.8.0-3.4.jar,s3://${IMPORTBUCKETNAME}/jars/amazon-keyspaces-helpers-1.0-SNAPSHOT.jar'
74+
- IMPORTBUCKETNAME:
75+
Fn::ImportValue:
76+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
77+
"--extra-files": !Sub
78+
- 's3://${IMPORTBUCKETNAME}/conf/keyspaces-application.conf'
79+
- IMPORTBUCKETNAME:
80+
Fn::ImportValue:
81+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
82+
"--enable-metrics": "true"
83+
"--enable-continuous-cloudwatch-log": "true"
84+
"--enable-spark-ui": "true"
85+
"--spark-event-logs-path": !Sub
86+
- "s3://${IMPORTBUCKETNAME}/spark-logs/"
87+
- IMPORTBUCKETNAME:
88+
Fn::ImportValue:
89+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
90+
"--write-shuffle-files-to-s3": "true"
91+
"--write-shuffle-spills-to-s3": "true"
92+
"--TempDir": !Sub
93+
- 's3://${IMPORTBUCKETNAME}/shuffle-space/bulk-delete-sample/'
94+
- IMPORTBUCKETNAME:
95+
Fn::ImportValue:
96+
!Sub 'KeyspacesBucketNameExport-${ParentStack}'
97+
"--KEYSPACE_NAME": !Sub '${KeyspaceName}'
98+
"--TABLE_NAME": !Sub '${TableName}'
99+
"--DRIVER_CONF": "keyspaces-application.conf"
100+
"--DISTINCT_KEYS": !If
101+
- IsDistinctKeysEmpty
102+
- !Ref "AWS::NoValue"
103+
- !Sub '${DistinctKeys}'
104+
"--QUERY_FILTER": !If
105+
- IsQueryFilterEmpty
106+
- !Ref "AWS::NoValue"
107+
- !Sub '${QueryFilter}'
108+
"--FORMAT": !Sub '${FORMAT}'
109+
"--S3_URI": !Sub '${S3URI}'
110+
"--class": "GlueApp"
111+
#Connections:
112+
# ConnectionsList
113+
Description: 'bulk delete rows in a Keyspaces table'
114+
#ExecutionClass: String
115+
#ExecutionProperty:
116+
#ExecutionProperty
117+
GlueVersion: "3.0"
118+
#LogUri: String
119+
#MaxCapacity: Double
120+
#MaxRetries: Double
121+
Name: !Sub ['AmazonKeyspacesBulkDelete-${STACKNAME}', STACKNAME: !Ref AWS::StackName]
122+
#NonOverridableArguments: Json
123+
#NotificationProperty:
124+
#NotificationProperty
125+
NumberOfWorkers: 2
126+
Role:
127+
Fn::ImportValue:
128+
!Sub 'KeyspacesGlueJobServiceRoleExport-${ParentStack}'
129+
#SecurityConfiguration: String
130+
#Tags: Json
131+
#Timeout: Integer
132+
WorkerType: G.2X

0 commit comments

Comments
 (0)