-
Notifications
You must be signed in to change notification settings - Fork 92
Datetime sampler column strips out data when generating small datasets #484
Description
Priority Level
Medium (Annoying but has workaround)
Describe the bug
DatetimeFormatMixin.postproc uses a heuristic cascade to auto-detect the output format based on the variability of sampled values. The first data-dependent branch checks series.dt.month.nunique() == 1 and, when true, returns only the year:
# data_designer/engine/sampling_gen/data_sources/base.py, lines 101-102
if series.dt.month.nunique() == 1:
return series.apply(lambda dt: dt.year).astype(str)With num_records=1, nunique() is always 1 regardless of the actual sampled datetime, so the output is e.g. '2026' instead of '2026-03-15T10:00:00'. This also triggers for any batch where all records happen to land in the same calendar month.
The bare year string breaks datetime.fromisoformat() and any downstream code that expects an ISO-8601 timestamp.
Workaround: Set convert_to="%Y-%m-%dT%H:%M:%S" on the SamplerColumnConfig to bypass the heuristic entirely.
Suggested fix: The heuristic branches in DatetimeFormatMixin.postproc (lines 101-108) are fragile for small sample sizes. Consider defaulting to ISO-8601 output when convert_to is not set, or at minimum requiring len(series) > 1 before applying the adaptive formatting.
Steps/Code to reproduce bug
Here's a small script that demonstrates the behavior
import data_designer.config as dd
from data_designer.interface import DataDesigner
builder = dd.DataDesignerConfigBuilder()
builder.add_column(
dd.SamplerColumnConfig(
name="ts",
sampler_type=dd.SamplerType.DATETIME,
params=dd.DatetimeSamplerParams(
start="2024-01-01",
end="2026-06-30",
unit="h",
),
),
)
designer = DataDesigner()
result = designer.preview(builder, num_records=1)
ts_value = result.dataset["ts"].iloc[0]
print(f"ts value: {ts_value!r}")
# With 1 record, month.nunique() == 1, so postproc returns just the year.
assert len(ts_value) == 4, f"Expected bare year, got {ts_value!r}"
print(f"\nBug confirmed: DATETIME sampler returned bare year '{ts_value}' for a single-record preview.")
print("❌ This breaks datetime.fromisoformat() and any downstream ISO-8601 parsing.")
print("\n" + "-" * 80 + "\n")
# With 2+ records that span different months, postproc falls through to isoformat.
result_multi = designer.preview(builder, num_records=10)
values = result_multi.dataset["ts"].tolist()
print(f"\nWith 10 records: {values[:3]} ...")
assert any(len(v) > 4 for v in values), "Expected full ISO strings with multiple records"
print("✅ With enough records the months vary, so postproc returns full ISO strings.")
Expected behavior
I would expect that we always return either (1) a Python datetime object or (2) an ISO-8601 formatted timestamp string. When the postproc strips away elements of the datetime string, it breaks downstream parsers.
Additional context
No response