Enhancement: Ability to specify which bricks to read from in an EC setup #4494

sbk173 · 2025-03-11T06:51:16Z

Hi,

I wanted to add a functionality that allows clients to verify that, in an erasure-coded setup, the data present in all bricks is correct.

To achieve this, GlusterFS needs a functionality that allows clients to specify which bricks to read from during a read operation. If this is possible, we can perform two read operations and compare the data read from the two reads to the original data in order to detect incorrectness. For example, in an 8+4 erasure-coded setup, we can store the MD5 checksum of the file before writing to GlusterFS, and then perform a read using the first 8 bricks. We can then calculate the checksum of the data obtained. In the next read operation, we can use the last 8 bricks, calculate the checksum of the data obtained, and verify its correctness. If the checksum calculated from the two reads matches the original checksum, then the data on all bricks is correct. If not, we know that the data on some brick is incorrect.

To implement this, one of the possible approaches would be to store an extended attribute on the disk, fetch it before a read call is made, and use this to specify which bricks are to be used in the read operation.

pranithk · 2025-03-11T08:58:27Z

@xhernandez What do you think of this approach? @sbk173 is one of my colleagues here at phonepe. It definitely reduces the number of combinations we need to consider to validate a file is okay.

pranithk · 2025-03-11T08:58:53Z

We will post the exact changes we will do if you are fine with the approach.

xhernandez · 2025-03-11T10:17:44Z

We already have something similar from commit 8c68787. It requires two different mounts, but I think that's good enough for a data integrity check process.

IMO, reading an xattr before each read is an unnecessary overhead that will increase latency significantly.

Another thing to consider is how to deal with data fragments already marked as bad by Gluster. This can cause that none of the two reads succeed, even though the data is still accessible from another combination of bricks.

Example: (G - Good, B - Bad)

Bricks: GGGGGGBGGGGG

Reading from first 8 will fail: GGGGGGBG
Reading from last 8 will fail: GGBGGGGG

This means that using just two combinations is not enough in all cases. In the worst case (i.e. 4 bad bricks), only 1 of the 495 possible combinations will work.

This also applies even if Gluster has not marked the bricks as bad but the data has been corrupted. To find the good data you may need to try all 495 combinations in the worst case.

If you want a more efficient solution and you are comfortable with linear algebra, you can look at the Berlekamp-Welch algorithm (though there are others) to efficiently identify the bad data from a single read from all bricks at once.

There are some algorithms with requirements that the EC implementation doesn't satisfy, but it has been long ago when I checked that and I don't remember it (I would say that the Berlekamp-Welch should be applicable, but no guarantees). So you will need to be sure that the algorithm you choose is really applicable to EC.

Another issue to solve is the way in which data is encoded in EC, which will require a pre-processing and post-processing to be able to apply any of the algorithms (or transform the algorithm to be able to use the current encoding. It will be hard, but the result will be much faster).

pranithk · 2025-03-11T11:29:59Z

We already have something similar from commit 8c68787. It requires two different mounts, but I think that's good enough for a data integrity check process.

Correct @xhernandez , we need to do this in an automated manner per file for all new files that are being uploaded during the async upload workflow, it is totally fine to fail the upload, it will be retried later. Hence we are trying to change the granularity of this feature where we can do this per file. Doing two mounts will increase memory requirements of the instances we use. Hence trying to optimise it this way, so that it can be done on a single mount.

IMO, reading an xattr before each read is an unnecessary overhead that will increase latency significantly.

Another option is to store this read_mask in memory per inode and apply that during read just like we do at the time of reading like we do using ec->read_mask. This will be done using virtual internal setxattr which comes till EC xlator. We have the control to stop this workflow during add/replace-brick. So the following should also work.

We can

open the file
do a setxattr to set the read_mask on that file to first 8 bricks
Do a read and compute checksum
do a setxattr to set the read_mask on that file to last 8 bricks
Do a read and compute checksum
do a setxattr to reset the read_mask on that file
Close the file

Unfortunately fuse doesn't have fsetxattr, so we have to open the file to keep the inode in memory and use setxattr in this manner which makes sure the inode will be in memory.

Another thing to consider is how to deal with data fragments already marked as bad by Gluster. This can cause that none of the two reads succeed, even though the data is still accessible from another combination of bricks.

The problem statement we are trying to solve is that we want to verify that the data that we are writing to EC is the same data we uploaded. At the time of verification we want the read to fail even if the data is not available even on a single brick(This can be relaxed based on the read_mask the user sends, but for our purposes we will do first 8 bricks and last 8 bricks to start with, if we want to relax it, for ex: we can change the read_mask to consider first 10 and last 10 in 8+4 setup). Once the verification is marked successful, normal EC rules apply in terms of redundancy.

This is needed for end-to-end integrity. The data itself is stored on zfs which has its own integrity checks, so for now the scope is only till glusterfs. We are not going with bit-rot because we use zfs.

Example: (G - Good, B - Bad)

Bricks: GGGGGGBGGGGG

Reading from first 8 will fail: GGGGGGBG Reading from last 8 will fail: GGBGGGGG

This means that using just two combinations is not enough in all cases. In the worst case (i.e. 4 bad bricks), only 1 of the 495 possible combinations will work.

This also applies even if Gluster has not marked the bricks as bad but the data has been corrupted. To find the good data you may need to try all 495 combinations in the worst case.

If you want a more efficient solution and you are comfortable with linear algebra, you can look at the Berlekamp-Welch algorithm (though there are others) to efficiently identify the bad data from a single read from all bricks at once.

There are some algorithms with requirements that the EC implementation doesn't satisfy, but it has been long ago when I checked that and I don't remember it (I would say that the Berlekamp-Welch should be applicable, but no guarantees). So you will need to be sure that the algorithm you choose is really applicable to EC.

Another issue to solve is the way in which data is encoded in EC, which will require a pre-processing and post-processing to be able to apply any of the algorithms (or transform the algorithm to be able to use the current encoding. It will be hard, but the result will be much faster).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Ability to specify which bricks to read from in an EC setup #4494

Enhancement: Ability to specify which bricks to read from in an EC setup #4494

sbk173 commented Mar 11, 2025

pranithk commented Mar 11, 2025

pranithk commented Mar 11, 2025

xhernandez commented Mar 11, 2025

pranithk commented Mar 11, 2025

Enhancement: Ability to specify which bricks to read from in an EC setup #4494

Enhancement: Ability to specify which bricks to read from in an EC setup #4494

Comments

sbk173 commented Mar 11, 2025

pranithk commented Mar 11, 2025

pranithk commented Mar 11, 2025

xhernandez commented Mar 11, 2025

pranithk commented Mar 11, 2025