DRAFT OF A DATAFRAME API by CarloMariaProietti · Pull Request #716 · apache/wayang

CarloMariaProietti · 2026-03-11T10:13:22Z

Hi everyone.

The aim of this draft is not to provide a ready-to-use toy DF API, instead the goal is to share my ideas regarding the API with the help of code and comments which (hopefully) help in communicating the core ideas.

NOTE: CURRENT IMPLEMENTATION IS NOT UP TO DATE WITH GSOC OFFICIAL APPLICATION, WHICH IS THE OFFICIAL REFERENCE (NO NEED TO HAVE A NEW ROW CLASS, INSTEAD THE EXISTING RECORD CLASS WILL BE LEVERAGED).

IMPLEMENTATION IN BRIEF
Currently:
Wayang users leverage the DataQuanta abstraction to build execution plans in an object-oriented style. In other words, users write strongly-typed lambda functions that operate on typed objects.
In coordance with DataFrame paragigm, instead of writing opaque functions that operate on typed objects, users employ SQL-like declarative expressions that operate on tables with a schema.
Project key points:
Wayang will provide the DataFrame abstraction by leveraging the existing DataQuanta class. Specifically, a Wayang DataFrame will be a wrapper around a DataQuanta[Record], where Record is the existing Wayang type used to represent a list of objects with a schema and can, consequently, provide the abstraction of a table row.
To provide a proper tabular abstraction for the user, a Schema class will be created to represent the structure of the table (column names and types, and eventually other metadata). Schema retrieval triggers the execution of the plan, so we will have SchemaSink and its implementations (e.g., SparkSchemaSink).
The API will expose standard DataFrame operations such as projection, filtering, and aggregation. In accordance with Wayang design, new operations will be associated with new operators along with their implementations. For operations that already have a corresponding operator (e.g., filtering), it is possible to refactor existing classes to prefer, when possible, a DataFrame backend (the existing ParquetSink leverages a similar principle).

I hope to get some feedback to improve my proposal.

AjayBoddeda4 · 2026-03-17T05:19:13Z

Hi Carlo, this is a great draft! I am Ajay Boddeda, a GSoC 2026 applicant interested in the DataFrames API project. I have hands-on experience with PySpark DataFrames and I am very excited about this direction. I especially like the Row + Schema abstraction idea. Would love to contribute to this and build on your work!

AjayBoddeda4 · 2026-03-18T04:22:23Z

Hi Carlo, I have been studying your draft code carefully. The Row class with a list of generic objects and the Schema mapping column names to types is a clean design. I have one question — for the filter() operation, are you planning to use expression-based filtering like Spark's Column expressions, or a simpler predicate approach first? I ask because in PySpark I use df.filter(df.age > 21) daily in production and I'm thinking about how to map that cleanly to Wayang's execution plan.

AjayBoddeda4 · 2026-03-20T07:59:39Z

Hi Carlo, I noticed you pushed a new refining commit after our discussion — exciting to see the draft evolving! I cloned the Wayang repository locally and have been studying the wayang-api-scala-java structure to understand where the DataFrame API would best fit. Looking forward to seeing the updated design!

AjayBoddeda4 · 2026-03-20T08:04:05Z

Hi Carlo, I studied the new commits carefully — this is excellent progress! I noticed you used Java Records for both Row and Schema which is exactly the direction I suggested on issue #514. The SparkSelectOperator using Dataset[Row] with functions::col is a clean implementation.
Looking at SparkSelectOperator, I see getSupportedInputChannels and getSupportedOutputChannels return empty lists — would DatasetChannel descriptors be the right choice here to keep execution within the Dataset world and avoid RDD conversions? This connects to issue #362 about DataFrameChannel that I was studying.

CarloMariaProietti · 2026-03-20T09:47:13Z

Hi Carlo, I have been studying your draft code carefully. The Row class with a list of generic objects and the Schema mapping column names to types is a clean design. I have one question — for the filter() operation, are you planning to use expression-based filtering like Spark's Column expressions, or a simpler predicate approach first? I ask because in PySpark I use df.filter(df.age > 21) daily in production and I'm thinking about how to map that cleanly to Wayang's execution plan.

Hi Ajay, I am glad you find the design clean. To address your question, I suggest you to look at SelectOperator's comment, there you can find an exhaustive explenation; if you will still have doubts please let me know.

CarloMariaProietti · 2026-03-20T09:57:22Z

Hi Carlo, I studied the new commits carefully — this is excellent progress! I noticed you used Java Records for both Row and Schema which is exactly the direction I suggested on issue #514. The SparkSelectOperator using Dataset[Row] with functions::col is a clean implementation. Looking at SparkSelectOperator, I see getSupportedInputChannels and getSupportedOutputChannels return empty lists — would DatasetChannel descriptors be the right choice here to keep execution within the Dataset world and avoid RDD conversions? This connects to issue #362 about DataFrameChannel that I was studying.

Hi, I am glad that you also think that using record class might be a good choice.
Imo you are right when you suggest that the execution 'should be kept in the Dataset world', Dataset< Row > is exactly the Spark implementation of the DataFrame abstraction; which is exactly what the new Wayang API should provide. In this logic, DatasetChannel descriptors seem to be the right choice for both input and output; however it may also have sense to allow RDD convertions in order to have more flexibility (see SparkParquetOperator).

AjayBoddeda4 · 2026-03-20T17:11:52Z

Hi Carlo, thank you for the detailed responses! I read SelectOperator's comment carefully — the explanation about untyped expressions vs UDFs is very clear and makes perfect sense for the DataFrame abstraction.
Regarding DatasetChannel — I agree that allowing RDD conversions as a fallback gives flexibility while keeping the happy path within Dataset world. I will study SparkParquetOperator to understand how they handle that balance.
Looking forward to contributing to this project through GSoC 2026!

CarloMariaProietti · 2026-03-21T10:15:11Z

Looking forward to contributing to this project through GSoC 2026!

Hi Ajay, reading the last part of your message, I fear there is a misunderstanding. I am not a mentor for the project; instead, I am also a GSoC applicant interested in the project :)

AjayBoddeda4 · 2026-03-21T11:50:30Z

Hi Carlo, thank you for clarifying! That actually makes our discussion even more interesting — it's great to connect with another applicant who is equally passionate about this project. Your draft has been really helpful in understanding the design direction. Looking forward to seeing how this project evolves. Best of luck with your proposal!

a draft for a DF API

8266e32

CarloMariaProietti marked this pull request as draft March 11, 2026 10:13

small fix

0c65aa2

CarloMariaProietti changed the title ~~DRAFT FOR A DATAFRAME API~~ DRAFT OF A DATAFRAME API Mar 11, 2026

CarloMariaProietti added 2 commits March 13, 2026 09:46

refining

7b3ce80

refining

593936f

refining

7aa66bd

CarloMariaProietti added 2 commits March 21, 2026 16:16

refining

98e7a8e

correction of schema class

2d46451

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT OF A DATAFRAME API#716

DRAFT OF A DATAFRAME API#716
CarloMariaProietti wants to merge 7 commits intoapache:mainfrom
CarloMariaProietti:df_api_draft

CarloMariaProietti commented Mar 11, 2026 •

edited

Loading

Uh oh!

AjayBoddeda4 commented Mar 17, 2026

Uh oh!

AjayBoddeda4 commented Mar 18, 2026

Uh oh!

AjayBoddeda4 commented Mar 20, 2026

Uh oh!

AjayBoddeda4 commented Mar 20, 2026

Uh oh!

CarloMariaProietti commented Mar 20, 2026

Uh oh!

CarloMariaProietti commented Mar 20, 2026 •

edited

Loading

Uh oh!

AjayBoddeda4 commented Mar 20, 2026

Uh oh!

CarloMariaProietti commented Mar 21, 2026 •

edited

Loading

Uh oh!

AjayBoddeda4 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CarloMariaProietti commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AjayBoddeda4 commented Mar 17, 2026

Uh oh!

AjayBoddeda4 commented Mar 18, 2026

Uh oh!

AjayBoddeda4 commented Mar 20, 2026

Uh oh!

AjayBoddeda4 commented Mar 20, 2026

Uh oh!

CarloMariaProietti commented Mar 20, 2026

Uh oh!

CarloMariaProietti commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AjayBoddeda4 commented Mar 20, 2026

Uh oh!

CarloMariaProietti commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AjayBoddeda4 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CarloMariaProietti commented Mar 11, 2026 •

edited

Loading

CarloMariaProietti commented Mar 20, 2026 •

edited

Loading

CarloMariaProietti commented Mar 21, 2026 •

edited

Loading