Large scale collection creation from BigQuery with DataFlow #1159

biarne-a · 2022-10-21T06:57:23Z

biarne-a
Oct 21, 2022

Hi,

First, thank you to the the creators of this great product! I love the value proposition when comparing to more common solutions like FAISS, Annoy or Scann where you don't have any API to help you with the daily management of your data.

I am trying to create a collection with 20 million vectors of size 512. Our vectors are stored in BigQuery.
My first naïve attempt was to create a simple script that would execute the query and fetch the data batch by batch. Each batch would then be directly uploaded to qdrant with the method upload_collection of python qdrant client. I made it work. It runs smoothly but the problem is that it would require more than 12 days to complete. This is not acceptable.

So I switched to using a big data tool. More precisely DataFlow (Google version of Apache Beam). Using DataFlow API, it is very easy to create a data pipeline that will execute my query on BigQuery, fetch the data and send it to as many workers as I want. Each worker instanciate its own qdrant python client, creates batches from the distributed collection and calls upload_collection. The data pipeline runs smoothly when I restrict to only a few thousand vectors. But when I try to create the collection with the 20M vectors I run into timeout or 'Broken pipe' issues on qdrant db side. I have tried with 100 workers, 20 and 8 workers. The pipeline completes in about 3 to 6 hours but I always have around 30% of my queries that fail.

For now, I only used a single qdrant node. What I think is happening is that there are too many HTTP concurrent requests. The qdrant node is basically overwhelmed. Also what I noticed is that the db accepts the insertion queries at a high rate during roughly the first hour. Then the rate at which the db processes the query diminishes a lot. I suspect that is because of the write ahead log mechanism. Writing to the WAL must be quite fast. But at some point, the DB tries to flush to the segments and when it starts doing so, the rate at which the upload queries timeout increases a lot (this is pure speculation though, I still don't understand enough the internals).

In any case, what would you recommend to try to get rid of these timeout issues ?

I tried to setup the timeout parameter when instantiating the qdrant client (QdrantClient(..., timeout=Timeout(timeout=300.0)), it helped but not sufficiently.

I think about 3 options:

Try to delay the writing from the WAL to the segments. Only flush into the segments when the 20M rows are registered into the WAL. I see there is a parameter flush_interval_sec. Maybe I can set it up to a huge value like 10800 (3 hours)?
Use a pub/sub queue. The dataflow workers would stop writing directly to qdrant but instead publish an event into a pub/sub queue. Another pipeline would then listen to the queue and call the upsert method upon message received. But I fear that with this method I would fall back to a huge overall processing time.
Try to setup several distributed qdrant nodes and perform the insertion in a round robin fashion

generall · 2022-10-21T10:45:11Z

generall
Oct 21, 2022
Maintainer

Writing to the WAL must be quite fast. But at some point, the DB tries to flush to the segments and when it starts doing so, the rate at which the upload queries timeout increases a lot (this is pure speculation though, I still don't understand enough the internals).

We actually do an async flush of segment, it should not block the writes.
But overall, there is a queue between WAL and segment storage. This queue have a limited size and once it is full - it applies back-pressure which might cause an operation timeout under height load.

Do you have an estimation how many requests/records per second you deployment was able to process? In our experiment the 20mil vectors upload would take about an hour +- 30 mins (depends on number of machines), so 12 days looks unrealistic to me.

2 replies

biarne-a Oct 21, 2022
Author

The 12 days was on my machine with a single process and single thread calling upload_collection by batch (there is also the data transfer latency from BigQuery->MyMachine) => this was clearly a naive first attempt.

My deployment was able to reach ~8000 RPS during the first 10 minutes than down to ~1000 RPS for the rest of the execution.

But I found the bottleneck in my single qdrant node setup => it was the disk. I just switched to using an extreme performance disk on GCloud and my problems are gone ! The full 20M vectors have been inserted in 37 minutes (including BigQuery execution and fetch latency).

generall Oct 21, 2022
Maintainer

Happy to hear that! 37mins looks like an expected time for a dataset of this size. Also worth mentioning, that even though the upload makes all points available for search and querying immediately, the full search speed will be unlocked once vector index is created. It happens in background and you can check if there is current indexing process by looking at status in the collection info API. It should be green

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qdrant

Large scale collection creation from BigQuery with DataFlow #1159

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Qdrant

Large scale collection creation from BigQuery with DataFlow #1159

biarne-a Oct 21, 2022

Replies: 1 comment · 2 replies

generall Oct 21, 2022 Maintainer

biarne-a Oct 21, 2022 Author

generall Oct 21, 2022 Maintainer

biarne-a
Oct 21, 2022

Replies: 1 comment 2 replies

generall
Oct 21, 2022
Maintainer

biarne-a Oct 21, 2022
Author

generall Oct 21, 2022
Maintainer