Add data retention detection feature for GDPR/CCPA compliance by Copilot · Pull Request #42 · mitchelllisle/maskala

Copilot · 2025-10-12T00:44:19Z

Overview

This PR implements a comprehensive data retention detection feature that enables organizations to identify and manage customer data exceeding specified retention schedules. This feature helps ensure compliance with data protection regulations like GDPR and CCPA by providing automated tools for detecting, analyzing, and handling retention violations.

Problem Statement

Organizations using Maskala need the ability to:

Identify customers whose data exceeds retention policies
Generate compliance reports and metrics
Automate data lifecycle management
Create audit trails for regulatory compliance
Scale retention checks across large datasets

Solution

The new RetentionDetector analyzer provides a complete toolkit for retention management:

Core Capabilities

1. Policy-Based Detection

val gdprPolicy = RetentionPolicy(
  name = "GDPR_DEFAULT",
  retentionDays = 365,
  description = Some("GDPR 1-year retention policy")
)

val violations = RetentionDetector.detectViolations(
  data = customerData,
  idColumn = "user_id",
  timestampColumn = "last_activity_date",
  policy = gdprPolicy
)

2. Compliance Monitoring

val (violations, summary) = RetentionDetector.detectViolationsWithSummary(
  customerData, "user_id", "last_activity_date", gdprPolicy
)

println(summary)
// Output:
// Retention Policy: GDPR_DEFAULT
// Total Records: 10000
// Violations Found: 1234
// Compliance Rate: 87.66%

3. Automated Action Planning

val actionPlan = RetentionDetector.generateActionPlan(violations, actionThresholdDays = 30)
// Returns: customerId, daysPastRetention, recommendedAction (Review/Anonymize/Delete), 
//          priority (High/Medium/Low)

4. Multi-Policy Support

val policies = Seq(
  RetentionPolicy("SHORT_TERM", 30),
  RetentionPolicy("MEDIUM_TERM", 90),
  RetentionPolicy("LONG_TERM", 365)
)

val allViolations = RetentionDetector.detectMultiplePolicies(
  customerData, "user_id", "last_activity_date", policies
)

Key Features

Scalable Processing: Built on Spark DataFrames for distributed processing of large datasets
Smart Type Handling: Automatically handles Timestamp, Date, and String column types
Violation Analysis: Groups violations by time ranges (0-30 days, 31-90 days, etc.)
MerkleTree Integration: Creates cryptographic audit trails for compliance proof
Flexible Configuration: Supports both programmatic and YAML-based setup

Integration Example

The feature integrates seamlessly with existing Maskala components:

// Detect violations
val (violations, summary) = RetentionDetector.detectViolationsWithSummary(
  userData, "user_id", "last_activity_date", gdprPolicy
)

// Generate action plan
val actionPlan = RetentionDetector.generateActionPlan(violations, 30)

// Create audit proof before cleanup
val beforeProof = MerkleTree.apply(userData, Seq("email", "age"), "user_id")

// Perform cleanup
val toDelete = actionPlan.filter(col("recommendedAction") === "Delete")
val cleanedData = userData.join(toDelete, "left_anti")

// Verify with audit proof
val afterProof = MerkleTree.apply(cleanedData, Seq("email", "age"), "user_id")
val deletionProof = MerkleTree.verifyDeletion(userData, cleanedData, deletedIds, ...)

What's Included

New Components

RetentionPolicy: Case class for defining retention schedules
RetentionDetector: Main analyzer object with detection methods
RetentionSummary: Statistical summary with compliance metrics
RetentionParams: Configuration support for Anonymiser integration

Testing

16 comprehensive test cases covering all functionality
100% test success rate (63/63 total tests passing)
Edge case coverage (empty data, multiple policies, type handling)
Integration tests with existing features

Documentation

Comprehensive README section with 8+ usage examples
Quick start guide (docs/RETENTION_DETECTION_GUIDE.md) with best practices
Complete pipeline example (RetentionPipelineExample.scala) showing end-to-end workflow
Example YAML configuration for Anonymiser integration
Troubleshooting guide and performance tips

Use Cases

This feature enables:

GDPR Compliance: Identify inactive users and generate deletion reports
CCPA Compliance: Track marketing data retention and automate cleanup
Automated Pipelines: Schedule regular retention checks and automated data lifecycle management
Risk Assessment: Analyze violations by severity and prioritize cleanup actions
Audit Reporting: Generate compliance reports with cryptographic proofs

Technical Details

No breaking changes - all additions are backward compatible
Follows existing Maskala patterns (analyzer design, Spark integration)
Clean compilation with no warnings
Full ScalaDoc documentation on all public methods
Production-ready with proper error handling

Files Changed

Added:

src/main/scala/org/mitchelllisle/analysers/RetentionDetector.scala
src/test/scala/RetentionDetectorTest.scala
src/test/resources/retentionConfig.yaml
src/main/scala/org/mitchelllisle/examples/RetentionPipelineExample.scala
docs/RETENTION_DETECTION_GUIDE.md

Modified:

README.md (added Retention Detection section)

Testing Checklist

All new tests pass (16/16)
All existing tests pass (47/47)
Clean compilation with no warnings
Code follows project conventions
Documentation is comprehensive
Examples are functional and tested

Original prompt

Create a feature to detect customers whose data exceeds specified retention schedules in Spark datasets. This should:

Allow users to configure custom retention schedules

Scan datasets for retention-relevant fields (e.g., timestamps, retention flags)

Report customers who are overdue for data deletion

Provide guidance or automated options for removal/anonymisation

Integrate with Spark workflows for scalable processing

Document usage and provide example pipelines

This will help ensure compliance and improve data lifecycle management.

This pull request was created as a result of the following prompt from Copilot chat.

Create a feature to detect customers whose data exceeds specified retention schedules in Spark datasets. This should:

Allow users to configure custom retention schedules

Scan datasets for retention-relevant fields (e.g., timestamps, retention flags)

Report customers who are overdue for data deletion

Provide guidance or automated options for removal/anonymisation

Integrate with Spark workflows for scalable processing

Document usage and provide example pipelines

This will help ensure compliance and improve data lifecycle management.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>

Initial plan

8c7bc95

Copilot AI assigned Copilot and mitchelllisle Oct 12, 2025

Copilot started work on behalf of mitchelllisle October 12, 2025 00:44 View session

Copilot AI and others added 3 commits October 12, 2025 00:54

Add RetentionDetector feature with comprehensive tests and documentation

e27890b

Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>

Add complete retention pipeline example and update documentation

f30bb98

Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>

Add comprehensive quick start guide for retention detection

0ada0b3

Co-authored-by: mitchelllisle <18128531+mitchelllisle@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add feature to detect overdue customer data retention~~ Add data retention detection feature for GDPR/CCPA compliance Oct 12, 2025

Copilot AI requested a review from mitchelllisle October 12, 2025 01:04

Copilot finished work on behalf of mitchelllisle October 12, 2025 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data retention detection feature for GDPR/CCPA compliance#42

Add data retention detection feature for GDPR/CCPA compliance#42
Copilot wants to merge 4 commits intomainfrom
copilot/add-data-retention-feature

Copilot AI commented Oct 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem Statement

Solution

Core Capabilities

Key Features

Integration Example

What's Included

New Components

Testing

Documentation

Use Cases

Technical Details

Files Changed

Testing Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 12, 2025 •

edited

Loading