Bulk delete sample

michaelraney · michaelraney · commit 0f7f581b10ec · 2025-01-10T10:58:03.000-05:00
bulk delete example for truncate and custom ttl.
diff --git a/scala/datastax-v4/aws-glue/bulk-delete/README.md b/scala/datastax-v4/aws-glue/bulk-delete/README.md
@@ -0,0 +1,85 @@
+## Glue Bulk Delete Example
+This example provides scala script for bulk delete for data in Amazon Keyspaces using AWS Glue. This allows you to bulk delete or truncate a table from Amazon Keyspaces without setting up a spark cluster.
+
+## Prerequisites
+* Setup Spark Cassandra connector using provided [setup script](../)
+
+### Setup Bulk delete
+The following script sets up AWS Glue job to bulk delete from a Keyspaces table. The script takes the following parameters 
+* PARENT_STACK_NAME is the stack name used to create the spark cassandra connector with Glue. [setup script](../)
+* DELETE_STACK_NAME is the stack name used to create glue job. 
+* KEYSPACE_NAME and TABLE_NAME Keyspaces and table is the fully qualified name of the table you wish to bulk delete from.
+* S3URI is the S3 uri where the deleted records will be stored. By default it will use the s3 bucket from the parent stack. 
+* FORMAT can be json, csv, or parquet. parquet is recommended for ease of use with data loading, transformations, and using exports with Athena. default is parquet. 
+* DISTINCT_KEYS comma seperated list of keys to perform deletes. If left blank, deletes will be made by primary key. If only some keys are specified, the job will perform range deletes. Range deletes in keyspaces can delete up to 1000 rows.
+* QUERY_FILTER like query to apply a filter condition to delete statement. Leave blank to delete every row and truncate the table.
+
+```shell
+./setup-bulkd-delete.sh SETUP_STACK_NAME DELETE_STACK_NAME KEYSPACE_TABLE TABLE_NAME S3URI FORMAT DISTINCT_KEYS QUERY_FILTER 
+
+```
+
+ The job will copy data to S3 bucket if provided. You can override or remove the S3 bucket at run time.  The structure below is appended to the s3 bucket provided and is the final location of the data thats deleted. 
+
+```shell
+    \--- S3_BUCKET
+            \------- jars
+            \------- conf
+            \------- scripts
+            \------- spark-logs
+            \------- export
+                \----- keyspace_name
+                    \----- table_name
+                       \----- snapshot
+                           \----- year=2025 
+                               \----- month=01
+                                  \----- day=02
+                                      \----- hour=09
+                                          \----- minute=22
+                                              \--- YOUR DATA HERE
+
+``` 
+
+### Running the bulk job from the CLI
+
+Running the job can be done through the AWS CLI. In the following example the command is running the job created in the previous step, but overrides the number of glue workers, worker type, and script arguments such as the table name. You can override any of the glue job parameters at run time and the default arguments. 
+
+```shell
+aws-glue % aws glue start-job-run --job-name AmazonKeyspacesBulkDelete-aksglue-aksglue-bulk-delete --number-of-workers 8 --worker-type G.2X --arguments '{"--TABLE_NAME":"transactions"}'
+```
+
+Full list of aws cli arguments [start-job-run arguments](https://docs.aws.amazon.com/cli/latest/reference/glue/start-job-run.html)
+
+### List of script arguments
+
+| argument          | defenition                                      | default             | required   |
+| :---------------- | :---------------------------------------------- | :------------------ | :------    |
+| --KEYSPACE_NAME   |   Name of the keyspace of the table to delete   | provided at setup        | Y |
+| --TABLE_NAME      |   Name of the table to delete                   | provided at setup        | Y |
+| --S3_URI          |  S3 URI where the root of the bulk delete data will be located. The folder structure is added dynamically in the scala script       | The default location is the s3 bucked provided when setting up the parent stack or the bulk-delete stack | parent stack s3 bucket | N |
+| --FORMAT          |  THe format of the export. Its recommended to use parquet. You could alternativley use json or other types supported by spark s3 libraries | parquet | N |
+| --DRIVER_CONF     |  the file containing the driver configuration.  | By default the parent stack sets up a config for Cassandra and config for keyspaces. You can add as many additional configurations as you like by dropping them in the same location in s3. | keyspaces-application.conf | Y |
+| DISTINCT_KEYS     | comma seperated list of keys. If left blank the script will use the full primary key located in the system schema. If you provide a portion of the primary key a range delete will be used. Range delete in keyspaces can delete up to 1000 rows. | full primary key | N |
+| QUERY_FILTER   | SQL like query to apply a filter condition to delete statement | no query filter | N |
+
+
+### Scheduled Trigger (Cron) 
+If you are building a frequent delete workload such as custom ttl or moving data to cold storage, you can setup a cron job using glue scheduled trigger.  Here is a simple AWS CLI command to create a Glue Trigger that runs your bulk delete Glue job once per week (every Monday at 12:00 UTC). The following script is used to delete rows which event data has passed current date epoch
+
+```shell
+  aws glue create-trigger \
+  --name KeyspacesBulkDeleteWeeklyTrigger \
+  --type SCHEDULED \
+  --schedule "cron(0 12 ? * MON *)" \
+  --start-on-creation \
+  --actions '[{
+     "JobName": "AmazonKeyspacesBulkDelete-bulk-delete",
+     "WorkerType": "G.2X",
+     "NumberOfWorkers": 8,
+     "Arguments": {
+       "--table_name": "transactions",
+       "--keyspace_name": "aws",
+       "--QUERY_FILTER": "event_date < '2024-05-17'"
+     }
+  }]'
+  ```
diff --git a/scala/datastax-v4/aws-glue/bulk-delete/bulk-delete-sample.scala b/scala/datastax-v4/aws-glue/bulk-delete/bulk-delete-sample.scala
@@ -0,0 +1,168 @@
+import com.amazonaws.services.glue.GlueContext
+import com.amazonaws.services.glue.util.GlueArgParser
+import com.amazonaws.services.glue.util.Job
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.functions.from_json
+import org.apache.spark.sql.streaming.Trigger
+import scala.collection.JavaConverters._
+import com.datastax.spark.connector._
+import org.apache.spark.sql.cassandra._
+import org.apache.spark.sql.SaveMode._
+import com.datastax.spark.connector._
+import com.datastax.spark.connector.cql._
+import com.datastax.oss.driver.api.core._
+import org.apache.spark.sql.functions.rand
+import com.amazonaws.services.glue.log.GlueLogger
+import java.time.ZonedDateTime
+import java.time.ZoneOffset
+import java.time.temporal.ChronoUnit
+import java.time.format.DateTimeFormatter
+
+
+object GlueApp {
+
+  def main(sysArgs: Array[String]) {
+
+    val requiredParams = Seq("JOB_NAME", "KEYSPACE_NAME", "TABLE_NAME", "DRIVER_CONF")
+
+    val optionalParams = Seq("DISTINCT_KEYS", "QUERY_FILTER", "FORMAT", "S3_URI")
+
+    // Build a list of optional parameters that exist in sysArgs
+    val validOptionalParams = optionalParams.filter(param =>  sysArgs.contains(s"--$param") && param.trim.nonEmpty)
+  
+    // Combine required and valid optional parameters
+    val validParams = requiredParams ++ validOptionalParams
+
+    val args = GlueArgParser.getResolvedOptions(sysArgs, validParams.toArray)
+    
+    val driverConfFileName = args("DRIVER_CONF")
+
+    val conf = new SparkConf()
+        .setAll(
+         Seq(
+             ("spark.task.maxFailures",  "100"),
+          
+            ("spark.cassandra.connection.config.profile.path",  driverConfFileName),
+            ("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions"),
+            ("directJoinSetting", "on"),
+            
+            ("spark.cassandra.output.consistency.level",  "LOCAL_QUORUM"),//WRITES
+            ("spark.cassandra.input.consistency.level",  "LOCAL_ONE"),//READS
+
+            ("spark.cassandra.sql.inClauseToJoinConversionThreshold", "0"),
+            ("spark.cassandra.sql.inClauseToFullScanConversionThreshold", "0"),
+            ("spark.cassandra.concurrent.reads", "50"),
+
+            ("spark.cassandra.output.concurrent.writes", "3"),
+            ("spark.cassandra.output.batch.grouping.key", "none"),
+            ("spark.cassandra.output.batch.size.rows", "1"),
+            ("spark.cassandra.output.batch.size.rows", "1"),
+            ("spark.cassandra.output.ignoreNulls", "true")
+        ))
+
+
+    val spark: SparkContext = new SparkContext(conf)
+    val glueContext: GlueContext = new GlueContext(spark)
+    val sparkSession: SparkSession = glueContext.getSparkSession
+
+    import sparkSession.implicits._
+
+    Job.init(args("JOB_NAME"), glueContext, args.asJava)
+
+    val logger = new GlueLogger
+    
+    //validation steps for peers and partitioner 
+    val connector = CassandraConnector.apply(conf);
+    val session = connector.openSession();
+    val peersCount = session.execute("SELECT * FROM system.peers").all().size()
+    
+    val partitioner = session.execute("SELECT partitioner from system.local").one().getString("partitioner")
+    
+    logger.info("Total number of seeds:" + peersCount)
+    logger.info("Configured partitioner:" + partitioner)
+    
+    if(peersCount == 0){
+       throw new Exception("No system peers found. Check required permissions to read from the system.peers table. If using VPCE check permissions for describing VPCE endpoints. https://docs.aws.amazon.com/keyspaces/latest/devguide/vpc-endpoints.html")
+    }
+    
+    if(partitioner.equals("com.amazonaws.cassandra.DefaultPartitioner")){
+        throw new Exception("Sark requires the use of RandomPartitioner or Murmur3Partitioner. See Working with partioners in Amazon Keyspaces documentation. https://docs.aws.amazon.com/keyspaces/latest/devguide/working-with-partitioners.html")
+    }
+    
+    val backupLocation = args.getOrElse("S3_URI", "")
+    val backupFormat = args.getOrElse("FORMAT", "parquet")
+    val filterCriteria = args.getOrElse("QUERY_FILTER", "")
+    
+    val tableName = args("TABLE_NAME")
+    val keyspaceName = args("KEYSPACE_NAME")
+    
+    
+    val query =
+        s"""
+           |SELECT column_name, kind
+           |FROM system_schema.columns
+           |WHERE keyspace_name = '$keyspaceName' AND table_name = '$tableName';
+           |""".stripMargin
+
+     // Execute the query
+     val resultSet = session.execute(query)
+      
+     val validKinds = Set("partition_key", "clustering")
+     // Extract primary key column names
+      
+      val primaryKeyColumnsCSV = resultSet.all().asScala
+        .filter(row => validKinds.contains(row.getString("kind")))
+        .map(_.getString("column_name"))
+        .toList
+        .mkString(", ")
+    
+     // Output the primary key columns
+    logger.info(s"Primary Key Columns for $keyspaceName.$tableName: ${primaryKeyColumnsCSV}")
+    
+    val distinctKeys = args.getOrElse("DISTINCT_KEYS", primaryKeyColumnsCSV).filterNot(_.isWhitespace).split(",")
+    
+    // Output the primary key columns
+    logger.info(s"Primary Key Columns for $keyspaceName.$tableName: ${distinctKeys.mkString(", ")}")
+
+    var tableDf = sparkSession.read
+      .format("org.apache.spark.sql.cassandra")
+      .options(Map( "table" -> tableName, 
+                    "keyspace" -> keyspaceName, 
+                    "pushdown" -> "false"))//set to true when executing against Apache Cassandra, false when working with Keyspaces
+      .load()
+      
+    if(filterCriteria.trim.nonEmpty){
+       tableDf = tableDf.filter(filterCriteria)
+    }
+
+    //backup to s3 for data that wil be deleted
+    if(backupLocation.trim.nonEmpty){
+       val now = ZonedDateTime.now( ZoneOffset.UTC )//.truncatedTo( ChronoUnit.MINUTES ).format( DateTimeFormatter.ISO_DATE_TIME )
+
+       //backup location for deletes
+       val fullbackuplocation = backupLocation +
+                               "/export" + 
+                               "/" + keyspaceName +
+                               "/" + tableName +
+                               "/bulk-delete" +
+                               "/year="   +  "%04d".format(now.getYear()) +
+                               "/month="  +  "%02d".format(now.getMonthValue()) + 
+                               "/day="    +  "%02d".format(now.getDayOfMonth()) +
+                               "/hour="   +  "%02d".format(now.getHour()) + 
+                               "/minute=" +  "%02d".format(now.getMinute())
+
+
+      
+      tableDf.write.format(backupFormat).mode(SaveMode.ErrorIfExists).save(fullbackuplocation)
+    }
+
+    tableDf.select(distinctKeys.head, distinctKeys.tail:_*).rdd.deleteFromCassandra(keyspaceName, tableName)
+
+    Job.commit()
+  }
+}
diff --git a/scala/datastax-v4/aws-glue/bulk-delete/glue-job-bulk-delete.yaml b/scala/datastax-v4/aws-glue/bulk-delete/glue-job-bulk-delete.yaml
@@ -0,0 +1,132 @@
+AWSTemplateFormatVersion: 2010-09-09
+Description: 'Create bulk delete Glue job for Amazon Keyspaces'
+Parameters:
+  KeyspaceName: 
+      NoEcho: false
+      Description: Cassandra Keyspace name
+      Type: String
+      #Default: mykeyspace
+      MinLength: 3
+      MaxLength: 48
+  TableName: 
+      NoEcho: false
+      Description: Cassandra Table name
+      Type: String
+      #Default: mytable
+      MinLength: 3
+      MaxLength: 48
+  DistinctKeys: 
+      NoEcho: false
+      Description: Optional Paramter. Provide comma seperated list of keys (example "id,create_date") of distinct keys. For instance, you could delete by partition by using just the partition keys.  
+      Type: String
+      Default: ""
+      MaxLength: 48
+  S3URI:
+      NoEcho: false
+      Description: folder to export
+      Type: String
+      Default: s3://mybucket/myexport   
+  FORMAT:
+      NoEcho: false
+      Description: Format used for export
+      Type: String
+      Default: parquet
+      MinLength: 3
+      MaxLength: 48  
+  QueryFilter: 
+      NoEcho: false
+      Description: Optional Paramter. Provide Query Filter criteria  (example "my_column=='somevalue' AND my_othercolumn=='someothervalue'") 
+      Type: String
+      Default: ""
+      MaxLength: 48
+  ParentStack:
+      NoEcho: false
+      Description: Stack used to setup the spark cassandra connector
+      Type: String
+      Default: aksglue
+      MinLength: 3
+      MaxLength: 48  
+  
+Conditions:
+    IsQueryFilterEmpty: !Equals
+      - !Ref QueryFilter
+      - ""
+    IsDistinctKeysEmpty: !Equals
+      - !Ref DistinctKeys
+      - ""
+
+Resources:
+  GlueJob:
+    Type: AWS::Glue::Job
+    Properties: 
+      Command:
+        Name: glueetl
+        ScriptLocation: !Sub 
+        - "s3://${IMPORTBUCKETNAME}/scripts/${ParentStack}-${AWS::StackName}-bulk-delete-sample.scala"
+        - IMPORTBUCKETNAME: 
+            Fn::ImportValue: 
+              !Sub 'KeyspacesBucketNameExport-${ParentStack}'
+      DefaultArguments: 
+        "--job-language": "scala"
+        "--user-jars-first": "true"
+        "--extra-jars": !Sub 
+        - 's3://${IMPORTBUCKETNAME}/jars/spark-cassandra-connector-assembly_2.12-3.1.0.jar,s3://${IMPORTBUCKETNAME}/jars/aws-sigv4-auth-cassandra-java-driver-plugin-4.0.9-shaded.jar,s3://${IMPORTBUCKETNAME}/jars/spark-extension_2.12-2.8.0-3.4.jar,s3://${IMPORTBUCKETNAME}/jars/amazon-keyspaces-helpers-1.0-SNAPSHOT.jar'
+        - IMPORTBUCKETNAME:
+            Fn::ImportValue:
+              !Sub 'KeyspacesBucketNameExport-${ParentStack}'
+        "--extra-files": !Sub 
+        - 's3://${IMPORTBUCKETNAME}/conf/keyspaces-application.conf'
+        - IMPORTBUCKETNAME: 
+            Fn::ImportValue:
+              !Sub 'KeyspacesBucketNameExport-${ParentStack}'
+        "--enable-metrics": "true"
+        "--enable-continuous-cloudwatch-log": "true"
+        "--enable-spark-ui": "true"
+        "--spark-event-logs-path": !Sub 
+        - "s3://${IMPORTBUCKETNAME}/spark-logs/"
+        - IMPORTBUCKETNAME: 
+            Fn::ImportValue:
+              !Sub 'KeyspacesBucketNameExport-${ParentStack}'
+        "--write-shuffle-files-to-s3": "true"
+        "--write-shuffle-spills-to-s3": "true"
+        "--TempDir": !Sub 
+        - 's3://${IMPORTBUCKETNAME}/shuffle-space/bulk-delete-sample/'
+        - IMPORTBUCKETNAME: 
+            Fn::ImportValue:
+              !Sub 'KeyspacesBucketNameExport-${ParentStack}'
+        "--KEYSPACE_NAME": !Sub '${KeyspaceName}'
+        "--TABLE_NAME": !Sub '${TableName}'
+        "--DRIVER_CONF": "keyspaces-application.conf"
+        "--DISTINCT_KEYS": !If
+          - IsDistinctKeysEmpty
+          - !Ref "AWS::NoValue"
+          - !Sub '${DistinctKeys}'
+        "--QUERY_FILTER":  !If
+          - IsQueryFilterEmpty
+          - !Ref "AWS::NoValue"
+          - !Sub '${QueryFilter}'
+        "--FORMAT": !Sub '${FORMAT}'
+        "--S3_URI": !Sub '${S3URI}'
+        "--class": "GlueApp"
+      #Connections: 
+      #  ConnectionsList
+      Description: 'bulk delete rows in a Keyspaces table'
+      #ExecutionClass: String
+      #ExecutionProperty: 
+        #ExecutionProperty
+      GlueVersion: "3.0"
+      #LogUri: String
+      #MaxCapacity: Double
+      #MaxRetries: Double
+      Name: !Sub ['AmazonKeyspacesBulkDelete-${STACKNAME}', STACKNAME: !Ref AWS::StackName]
+      #NonOverridableArguments: Json
+      #NotificationProperty: 
+      #NotificationProperty
+      NumberOfWorkers: 2
+      Role: 
+        Fn::ImportValue:
+            !Sub 'KeyspacesGlueJobServiceRoleExport-${ParentStack}'
+      #SecurityConfiguration: String
+      #Tags: Json
+      #Timeout: Integer
+      WorkerType: G.2X
diff --git a/scala/datastax-v4/aws-glue/bulk-delete/setup-bulk-delete.sh b/scala/datastax-v4/aws-glue/bulk-delete/setup-bulk-delete.sh