RFC: Combine Historical and Incremental Data #85

fuyufjh · 2024-01-12T06:29:34Z

Migrated from Notion.

Preview: https://github.com/risingwavelabs/rfcs/blob/eric/iceberg_source/rfcs/0085-iceberg-source.md

st1page · 2024-05-20T11:47:10Z

We encountered an interesting problem: what happens when users want to define a watermark on this table? If we directly apply the ideas from this RFC, because the order of data inserted into the table by batch queries is unordered, unexpected records will likely be expired and deleted by the watermark. I think there are a few possibilities in my thoughts.

(Completely unfeasible) Define a watermark only on the Kafka source and sink it into the table. This actually cannot achieve the semantics that users want. Because the table after the source will erase all the watermark semantics that users want, and the watermark cannot be used downstream.
Use insert ... select ... order by time asc to insert data into the table with a watermark definition in time order. In this case, we may need to ensure these:
- In the shuffle from BatchInsertExecutor to StreamDMLExecutor, we need to adopt a sort merge or no shuffle method to ensure that the order of the insertion stream is the same as the output of the batch, maintaining the order.
- (Optimization) We need to implement a sort scan from the Iceberg source (push down of order by).
Support the ALTER TABLE ... ADD WATERMARK ... syntax. In this method, users need to first import historical data into a table without a watermark. Then add a watermark to the table, and subsequently construct the upstream (subscribe to changes from Kafka) and downstream (stream processing logic). The problem with this approach is that when we have a materialized view on the table, we cannot send watermarks downstream during the backfill of historical data, which is inefficient or even unacceptable for some stream operators that rely on watermarks. If users use features related to EOWC (End of Watermark Collection), they may also have to accept sending a large amount of data downstream when the first watermark arrives.

fuyufjh · 2024-05-20T14:04:21Z

The 3rd proposal ALTER TABLE ... ADD WATERMARK ... sounds very limited. It requires the table has no downstream streaming jobs associated, which asks the users to prepare everything before doing it in fact. It works but not quite friendly.

I slightly prefer the 2nd proposal insert ... select ... [order by time asc]. In my mind, the order by clause here is not enforced. For example, if the users ingested data from Kafka source first e.g. a topic named historical_events, the events will be almost ordered naturally. Furthermore, if Iceberg can provide such almost ordered reading, it's also doable. (According to Iceberg - Flink queries docs, perhaps there is such a method)

xxchan

Some new ideas by @st1page from https://risingwave-labs.slack.com/archives/C07CU2YBKCG/p1721184731055789

Since we are adding batch read function risingwavelabs/risingwave#17673, we can combine a batch query with a connector. Then we don't need a batch source and no need to ALTER TABLE any more.

tentative syntax:

CREATE TABLE orders (
    order_id INT,
    customer_name VARCHAR,
    data JSONB,
    PRIMARY KEY (order_id, customer_name)
) INITIAL WITH SELECT * FROM file_scan(
  'parquet',
  's3',
  'ap-southeast-2',
  'xxxxxxxxxx',
  'yyyyyyyy',
  's3://your-bucket/path/to/*'
);
WITH (
  connector = 'kinesis',
  stream = 'wkx-dynamo-orders',
  scan.startup.mode='earliest',
  aws.region = 'us-east-1',
  kinesis.credentials.access = 'ABCDEFG',
  kinesis.credentials.secret = 'abcdefg',
) FORMAT DYNAMODB_CDC ENCODE JSON;

@xiangjinwu : Can be achieved by pause_on_create + insert into t select + resume

fuyufjh · 2024-07-18T05:29:43Z

we can combine a batch query with a connector. Then we don't need a batch source and no need to ALTER TABLE any more.

Is it just CREATE TABLE AS <select-query>?

Here, taking your example, the columns in the table definition and the columns in SELECT * actually duplicates. To eliminate the duplication, we will get something like CREATE TABLE AS.

xxchan · 2024-07-18T06:25:19Z

Is it just CREATE TABLE AS <select-query>?

Yes, I asked the same question. 😄 @st1page feels for the specific needs, the syntax CTAS is weird, so he wants to introduce a separated syntax.

Specifically,

INITIAL WITH has an order
We can do INITIAL WITH for source. The syntax CTAS is reasonable, but CSAS looks weird (to him)

fuyufjh · 2024-07-18T06:48:08Z

INITIAL WITH has an order

Order is not that important when processing historical data. Particularly, considering multiple parallelism, the order might be less useful to users.

We can do INITIAL WITH for source. The syntax CTAS is reasonable, but CSAS looks weird (to him)

I feel CREATE SOURCE + INITIAL WITH brings more confusing if you take backfilling into consideration.

Hmmm, overall, I feel this is not better than the idea of CREATE TABLE/SOURCE with pause_on_create = true 😀

st1page · 2024-07-18T08:20:07Z

INITIAL WITH has an order

Order is not that important when processing historical data. Particularly, considering multiple parallelism, the order might be less useful to users.

Hmmm, overall, I feel this is not better than the idea of CREATE TABLE/SOURCE with pause_on_create = true 😀

LGTM. detailly we need

a with option pause_on_create
a new statement RESUME table_name;

And we might need to discuss about insert statment on the source later.

add file

39ad5de

chenzl25 mentioned this pull request Jan 15, 2024

feat: Distributed Insertion risingwavelabs/risingwave#14574

Closed

chenzl25 mentioned this pull request Jan 23, 2024

Feat: Batch ingest iceberg/file source risingwavelabs/risingwave#14742

Closed

21 tasks

fuyufjh mentioned this pull request May 21, 2024

Tracking: schema changes of source | (table with connector) risingwavelabs/risingwave#15582

Closed

4 tasks

xxchan reviewed Jul 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Combine Historical and Incremental Data #85

RFC: Combine Historical and Incremental Data #85

fuyufjh commented Jan 12, 2024 •

edited

Loading

st1page commented May 20, 2024

fuyufjh commented May 20, 2024

xxchan left a comment

fuyufjh commented Jul 18, 2024

xxchan commented Jul 18, 2024

fuyufjh commented Jul 18, 2024

st1page commented Jul 18, 2024

RFC: Combine Historical and Incremental Data #85

Are you sure you want to change the base?

RFC: Combine Historical and Incremental Data #85

Conversation

fuyufjh commented Jan 12, 2024 • edited Loading

st1page commented May 20, 2024

fuyufjh commented May 20, 2024

xxchan left a comment

Choose a reason for hiding this comment

fuyufjh commented Jul 18, 2024

xxchan commented Jul 18, 2024

fuyufjh commented Jul 18, 2024

st1page commented Jul 18, 2024

fuyufjh commented Jan 12, 2024 •

edited

Loading