Summary
Sparkdantic hardcodes datetime Python types to map to Spark's TimestampType (TIMESTAMP WITH SESSION TIME ZONE). This causes unwanted timezone conversion behavior when working with UTC timestamps, making it impossible to preserve exact UTC values without timezone conversion.
Problem Description
When using datetime fields in a SparkModel, sparkdantic automatically maps them to Spark's timestamp type via this hardcoded mapping:
_type_mapping = MappingProxyType(
{
# ... other mappings ...
datetime: 'timestamp', # Maps to TimestampType (WITH SESSION TIME ZONE)
# ... other mappings ...
}
)
This causes the following issues:
Issue 1: Unintended Timezone Conversion
Spark's TimestampType is "TIMESTAMP WITH SESSION TIME ZONE". When parsing ISO-8601 UTC timestamps (e.g., "2025-01-01T10:00:00Z"):
- Spark correctly parses the UTC indicator (Z suffix)
- Stores the timestamp internally as UTC
- But converts it to the session timezone when displayed/accessed
Example:
from datetime import datetime
from sparkdantic import SparkModel
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType
class EventData(SparkModel):
timestamp: datetime # Maps to TimestampType
# Session timezone: Europe/Berlin (UTC+1)
json_string = '{"timestamp":"2025-01-01T10:00:00Z"}'
df = spark.createDataFrame([(json_string,)], ["data"])
result_df = df.select(from_json("data", EventData.model_spark_schema()).getField("timestamp"))
row = result_df.collect()[0]
print(row[0]) # Output: 2025-01-01 11:00:00 (converted to Berlin time!)
# Expected: 2025-01-01 10:00:00 (UTC unchanged)
Issue 2: Environment-Dependent Behavior
- Local testing: Timestamps appear in your machine's timezone
- Databricks: Timestamps may appear in a different timezone
- Makes tests fragile and environment-dependent
Issue 3: No Way to Opt-in to TimestampNTZ
Spark 3.4+ introduced TimestampNTZType (TIMESTAMP WITHOUT TIME ZONE) which:
- Preserves exact timestamp values without conversion
- Perfect for UTC timestamps that should remain unchanged
- Ideal for multi-timezone environments
But sparkdantic provides no way to use it.
Root Cause
The hardcoded _type_mapping doesn't support Spark 3.4+ TimestampNTZType. There's no mechanism to:
- Map
datetime fields to TimestampNTZType instead of TimestampType
- Specify custom Spark types per field
- Configure global timestamp type preference
Impact
- Severity: High - affects any UTC timestamp handling
- Use Cases Affected:
- IoT/telemetry systems with UTC timestamps
- Multi-timezone data processing
- Systems that need exact timestamp preservation
Minimal Reproducible Example
from datetime import datetime
from sparkdantic import SparkModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
spark = SparkSession.builder.appName("test").getOrCreate()
class TestModel(SparkModel):
timestamp: datetime
# Create test data with UTC timestamp
json_data = '{"timestamp":"2025-01-01T10:00:00Z"}'
df = spark.createDataFrame([(json_data,)], ["data"])
# Parse using sparkdantic schema
result_df = df.select(
from_json("data", TestModel.model_spark_schema())
)
row = result_df.collect()[0][0]
print(f"Result: {row.timestamp}")
print(f"Expected: 2025-01-01 10:00:00 (UTC)")
# In Europe/Berlin timezone:
# Actual: 2025-01-01 11:00:00 ❌
Expected Behavior
UTC timestamps should remain unchanged (10:00) regardless of session timezone.
Additional Context
- Spark 3.4+ has
TimestampNTZType for this exact use case
Summary
Sparkdantic hardcodes
datetimePython types to map to Spark'sTimestampType(TIMESTAMP WITH SESSION TIME ZONE). This causes unwanted timezone conversion behavior when working with UTC timestamps, making it impossible to preserve exact UTC values without timezone conversion.Problem Description
When using
datetimefields in a SparkModel, sparkdantic automatically maps them to Spark'stimestamptype via this hardcoded mapping:This causes the following issues:
Issue 1: Unintended Timezone Conversion
Spark's
TimestampTypeis "TIMESTAMP WITH SESSION TIME ZONE". When parsing ISO-8601 UTC timestamps (e.g.,"2025-01-01T10:00:00Z"):Example:
Issue 2: Environment-Dependent Behavior
Issue 3: No Way to Opt-in to TimestampNTZ
Spark 3.4+ introduced
TimestampNTZType(TIMESTAMP WITHOUT TIME ZONE) which:But sparkdantic provides no way to use it.
Root Cause
The hardcoded
_type_mappingdoesn't support Spark 3.4+TimestampNTZType. There's no mechanism to:datetimefields toTimestampNTZTypeinstead ofTimestampTypeImpact
Minimal Reproducible Example
Expected Behavior
UTC timestamps should remain unchanged (10:00) regardless of session timezone.
Additional Context
TimestampNTZTypefor this exact use case