Replies: 1 comment 2 replies
-
We actually do an async flush of segment, it should not block the writes. Do you have an estimation how many requests/records per second you deployment was able to process? In our experiment the 20mil vectors upload would take about an hour +- 30 mins (depends on number of machines), so 12 days looks unrealistic to me. |
Beta Was this translation helpful? Give feedback.
-
Hi,
First, thank you to the the creators of this great product! I love the value proposition when comparing to more common solutions like FAISS, Annoy or Scann where you don't have any API to help you with the daily management of your data.
I am trying to create a collection with 20 million vectors of size 512. Our vectors are stored in BigQuery.
My first naïve attempt was to create a simple script that would execute the query and fetch the data batch by batch. Each batch would then be directly uploaded to qdrant with the method upload_collection of python qdrant client. I made it work. It runs smoothly but the problem is that it would require more than 12 days to complete. This is not acceptable.
So I switched to using a big data tool. More precisely DataFlow (Google version of Apache Beam). Using DataFlow API, it is very easy to create a data pipeline that will execute my query on BigQuery, fetch the data and send it to as many workers as I want. Each worker instanciate its own qdrant python client, creates batches from the distributed collection and calls upload_collection. The data pipeline runs smoothly when I restrict to only a few thousand vectors. But when I try to create the collection with the 20M vectors I run into timeout or 'Broken pipe' issues on qdrant db side. I have tried with 100 workers, 20 and 8 workers. The pipeline completes in about 3 to 6 hours but I always have around 30% of my queries that fail.
For now, I only used a single qdrant node. What I think is happening is that there are too many HTTP concurrent requests. The qdrant node is basically overwhelmed. Also what I noticed is that the db accepts the insertion queries at a high rate during roughly the first hour. Then the rate at which the db processes the query diminishes a lot. I suspect that is because of the write ahead log mechanism. Writing to the WAL must be quite fast. But at some point, the DB tries to flush to the segments and when it starts doing so, the rate at which the upload queries timeout increases a lot (this is pure speculation though, I still don't understand enough the internals).
In any case, what would you recommend to try to get rid of these timeout issues ?
I tried to setup the timeout parameter when instantiating the qdrant client (QdrantClient(..., timeout=Timeout(timeout=300.0)), it helped but not sufficiently.
I think about 3 options:
Beta Was this translation helpful? Give feedback.
All reactions