Skip to content

No Support for TimestampNTZ (Timestamp without Timezone) #825

@RolandWolman

Description

@RolandWolman

Summary

Sparkdantic hardcodes datetime Python types to map to Spark's TimestampType (TIMESTAMP WITH SESSION TIME ZONE). This causes unwanted timezone conversion behavior when working with UTC timestamps, making it impossible to preserve exact UTC values without timezone conversion.

Problem Description

When using datetime fields in a SparkModel, sparkdantic automatically maps them to Spark's timestamp type via this hardcoded mapping:

_type_mapping = MappingProxyType(
    {
        # ... other mappings ...
        datetime: 'timestamp',  # Maps to TimestampType (WITH SESSION TIME ZONE)
        # ... other mappings ...
    }
)

This causes the following issues:

Issue 1: Unintended Timezone Conversion

Spark's TimestampType is "TIMESTAMP WITH SESSION TIME ZONE". When parsing ISO-8601 UTC timestamps (e.g., "2025-01-01T10:00:00Z"):

  1. Spark correctly parses the UTC indicator (Z suffix)
  2. Stores the timestamp internally as UTC
  3. But converts it to the session timezone when displayed/accessed

Example:

from datetime import datetime
from sparkdantic import SparkModel
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType

class EventData(SparkModel):
    timestamp: datetime  # Maps to TimestampType

# Session timezone: Europe/Berlin (UTC+1)
json_string = '{"timestamp":"2025-01-01T10:00:00Z"}'
df = spark.createDataFrame([(json_string,)], ["data"])
result_df = df.select(from_json("data", EventData.model_spark_schema()).getField("timestamp"))

row = result_df.collect()[0]
print(row[0])  # Output: 2025-01-01 11:00:00 (converted to Berlin time!)
# Expected:    2025-01-01 10:00:00 (UTC unchanged)

Issue 2: Environment-Dependent Behavior

  • Local testing: Timestamps appear in your machine's timezone
  • Databricks: Timestamps may appear in a different timezone
  • Makes tests fragile and environment-dependent

Issue 3: No Way to Opt-in to TimestampNTZ

Spark 3.4+ introduced TimestampNTZType (TIMESTAMP WITHOUT TIME ZONE) which:

  • Preserves exact timestamp values without conversion
  • Perfect for UTC timestamps that should remain unchanged
  • Ideal for multi-timezone environments

But sparkdantic provides no way to use it.

Root Cause

The hardcoded _type_mapping doesn't support Spark 3.4+ TimestampNTZType. There's no mechanism to:

  1. Map datetime fields to TimestampNTZType instead of TimestampType
  2. Specify custom Spark types per field
  3. Configure global timestamp type preference

Impact

  • Severity: High - affects any UTC timestamp handling
  • Use Cases Affected:
    • IoT/telemetry systems with UTC timestamps
    • Multi-timezone data processing
    • Systems that need exact timestamp preservation

Minimal Reproducible Example

from datetime import datetime
from sparkdantic import SparkModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json

spark = SparkSession.builder.appName("test").getOrCreate()

class TestModel(SparkModel):
    timestamp: datetime

# Create test data with UTC timestamp
json_data = '{"timestamp":"2025-01-01T10:00:00Z"}'
df = spark.createDataFrame([(json_data,)], ["data"])

# Parse using sparkdantic schema
result_df = df.select(
    from_json("data", TestModel.model_spark_schema())
)

row = result_df.collect()[0][0]
print(f"Result: {row.timestamp}")
print(f"Expected: 2025-01-01 10:00:00 (UTC)")
# In Europe/Berlin timezone:
# Actual:   2025-01-01 11:00:00 ❌

Expected Behavior

UTC timestamps should remain unchanged (10:00) regardless of session timezone.

Additional Context

  • Spark 3.4+ has TimestampNTZType for this exact use case

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions