Skip to content

Implement JSON type support #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 3, 2025
Merged

Conversation

wudidapaopao
Copy link
Contributor

@wudidapaopao wudidapaopao commented May 22, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This PR introduces support for mapping Python objects to ClickHouse JSON type.

  • Pandas DataFrame: Columns of object type are automatically sampled. If all sampled values are of type dict, the column is mapped to JSON type.

  • Python Dict: If the first row of a column is a dict, the column is mapped to JSON type.

  • PyArrow Table: If a column is of struct type in PyArrow, it will be mapped to JSON type.

  • Custom PyReader: Users can explicitly specify a schema name of JSON for a given column, which will be used accordingly.

  • Numpy: JSON type is currently not supported for Numpy arrays.

  • Output Formats: When using output formats such as Arrow, Protobuf, or Parquet, JSON type is temporarily disabled due to ClickHouse limitations.

Additionally, this PR supports SQL queries that involve multiple Python objects within the same query.

The detection of Pandas DataFrame objects is now done via an import-based check, improving compatibility and reliability.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Run these jobs only (required builds will be added automatically):

  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with aarch64
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • All with Azure
  • Add your option here

Deny these jobs:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

@wudidapaopao wudidapaopao marked this pull request as draft May 22, 2025 02:01
…Frame, dict, and PyReader

- Implemented support for querying JSON columns across various data sources including pyarrow Table, DataFrame, dictionary, and PyReader.
- Added corresponding test cases to validate the querying functionality for each data type.
- Enhanced the display format of run_all.py.
@wudidapaopao wudidapaopao changed the title [WIP] Implement JSON type support Implement JSON type support May 26, 2025
@wudidapaopao wudidapaopao marked this pull request as ready for review May 26, 2025 20:05
@wudidapaopao wudidapaopao requested a review from auxten May 26, 2025 20:18
@auxten auxten merged commit ffc395f into chdb-io:main Jun 3, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants