Skip to content

Datetime sampler column strips out data when generating small datasets #484

@jeremyjordan

Description

@jeremyjordan

Priority Level

Medium (Annoying but has workaround)

Describe the bug

DatetimeFormatMixin.postproc uses a heuristic cascade to auto-detect the output format based on the variability of sampled values. The first data-dependent branch checks series.dt.month.nunique() == 1 and, when true, returns only the year:

# data_designer/engine/sampling_gen/data_sources/base.py, lines 101-102
if series.dt.month.nunique() == 1:
    return series.apply(lambda dt: dt.year).astype(str)

With num_records=1, nunique() is always 1 regardless of the actual sampled datetime, so the output is e.g. '2026' instead of '2026-03-15T10:00:00'. This also triggers for any batch where all records happen to land in the same calendar month.

The bare year string breaks datetime.fromisoformat() and any downstream code that expects an ISO-8601 timestamp.

Workaround: Set convert_to="%Y-%m-%dT%H:%M:%S" on the SamplerColumnConfig to bypass the heuristic entirely.

Suggested fix: The heuristic branches in DatetimeFormatMixin.postproc (lines 101-108) are fragile for small sample sizes. Consider defaulting to ISO-8601 output when convert_to is not set, or at minimum requiring len(series) > 1 before applying the adaptive formatting.

Steps/Code to reproduce bug

Here's a small script that demonstrates the behavior

import data_designer.config as dd
from data_designer.interface import DataDesigner

builder = dd.DataDesignerConfigBuilder()
builder.add_column(
    dd.SamplerColumnConfig(
        name="ts",
        sampler_type=dd.SamplerType.DATETIME,
        params=dd.DatetimeSamplerParams(
            start="2024-01-01",
            end="2026-06-30",
            unit="h",
        ),
    ),
)

designer = DataDesigner()
result = designer.preview(builder, num_records=1)

ts_value = result.dataset["ts"].iloc[0]
print(f"ts value: {ts_value!r}")

# With 1 record, month.nunique() == 1, so postproc returns just the year.
assert len(ts_value) == 4, f"Expected bare year, got {ts_value!r}"
print(f"\nBug confirmed: DATETIME sampler returned bare year '{ts_value}' for a single-record preview.")
print("❌ This breaks datetime.fromisoformat() and any downstream ISO-8601 parsing.")

print("\n" + "-" * 80 + "\n")

# With 2+ records that span different months, postproc falls through to isoformat.
result_multi = designer.preview(builder, num_records=10)
values = result_multi.dataset["ts"].tolist()
print(f"\nWith 10 records: {values[:3]} ...")
assert any(len(v) > 4 for v in values), "Expected full ISO strings with multiple records"
print("✅ With enough records the months vary, so postproc returns full ISO strings.")

Expected behavior

I would expect that we always return either (1) a Python datetime object or (2) an ISO-8601 formatted timestamp string. When the postproc strips away elements of the datetime string, it breaks downstream parsers.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions