Conversation
|
Hi Carlo, this is a great draft! I am Ajay Boddeda, a GSoC 2026 applicant interested in the DataFrames API project. I have hands-on experience with PySpark DataFrames and I am very excited about this direction. I especially like the Row + Schema abstraction idea. Would love to contribute to this and build on your work! |
|
Hi Carlo, I have been studying your draft code carefully. The Row class with a list of generic objects and the Schema mapping column names to types is a clean design. I have one question — for the filter() operation, are you planning to use expression-based filtering like Spark's Column expressions, or a simpler predicate approach first? I ask because in PySpark I use df.filter(df.age > 21) daily in production and I'm thinking about how to map that cleanly to Wayang's execution plan. |
|
Hi Carlo, I noticed you pushed a new refining commit after our discussion — exciting to see the draft evolving! I cloned the Wayang repository locally and have been studying the wayang-api-scala-java structure to understand where the DataFrame API would best fit. Looking forward to seeing the updated design! |
|
Hi Carlo, I studied the new commits carefully — this is excellent progress! I noticed you used Java Records for both Row and Schema which is exactly the direction I suggested on issue #514. The SparkSelectOperator using Dataset[Row] with functions::col is a clean implementation. |
Hi Ajay, I am glad you find the design clean. To address your question, I suggest you to look at SelectOperator's comment, there you can find an exhaustive explenation; if you will still have doubts please let me know. |
Hi, I am glad that you also think that using record class might be a good choice. |
|
Hi Carlo, thank you for the detailed responses! I read SelectOperator's comment carefully — the explanation about untyped expressions vs UDFs is very clear and makes perfect sense for the DataFrame abstraction. |
Hi Ajay, reading the last part of your message, I fear there is a misunderstanding. I am not a mentor for the project; instead, I am also a GSoC applicant interested in the project :) |
|
Hi Carlo, thank you for clarifying! That actually makes our discussion even more interesting — it's great to connect with another applicant who is equally passionate about this project. Your draft has been really helpful in understanding the design direction. Looking forward to seeing how this project evolves. Best of luck with your proposal! |
Hi everyone.
The aim of this draft is not to provide a ready-to-use toy DF API, instead the goal is to share my ideas regarding the API with the help of code and comments which (hopefully) help in communicating the core ideas.
NOTE: CURRENT IMPLEMENTATION IS NOT UP TO DATE WITH GSOC OFFICIAL APPLICATION, WHICH IS THE OFFICIAL REFERENCE (NO NEED TO HAVE A NEW ROW CLASS, INSTEAD THE EXISTING RECORD CLASS WILL BE LEVERAGED).
IMPLEMENTATION IN BRIEF
Currently:
Wayang users leverage the DataQuanta abstraction to build execution plans in an object-oriented style. In other words, users write strongly-typed lambda functions that operate on typed objects.
In coordance with DataFrame paragigm, instead of writing opaque functions that operate on typed objects, users employ SQL-like declarative expressions that operate on tables with a schema.
Project key points:
Wayang will provide the DataFrame abstraction by leveraging the existing DataQuanta class. Specifically, a Wayang DataFrame will be a wrapper around a DataQuanta[Record], where Record is the existing Wayang type used to represent a list of objects with a schema and can, consequently, provide the abstraction of a table row.
To provide a proper tabular abstraction for the user, a Schema class will be created to represent the structure of the table (column names and types, and eventually other metadata). Schema retrieval triggers the execution of the plan, so we will have SchemaSink and its implementations (e.g., SparkSchemaSink).
The API will expose standard DataFrame operations such as projection, filtering, and aggregation. In accordance with Wayang design, new operations will be associated with new operators along with their implementations. For operations that already have a corresponding operator (e.g., filtering), it is possible to refactor existing classes to prefer, when possible, a DataFrame backend (the existing ParquetSink leverages a similar principle).
I hope to get some feedback to improve my proposal.