Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quoted values to the test CSV and possibly some larger blocks. #42

Open
electricessence opened this issue Aug 15, 2021 · 13 comments
Open
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@electricessence
Copy link
Contributor

Just a suggestion, but having a column with quoted and quote escapes will round out the tests.
I know it might break someone's code, but I realized when diving in that a 'row' is not a 'line' in the file.
A 'row' can span multiple lines if it's within a quoted value. So having a newline character in a quoted value somewhere might be good to add to the tests.

@electricessence electricessence changed the title You need to add quoted values to the test CSV and possibly some larger blocks. Add quoted values to the test CSV and possibly some larger blocks. Aug 15, 2021
@electricessence
Copy link
Contributor Author

I might be looking at the wrong CSV for your tests, if so, please close.

@joelverhagen
Copy link
Owner

You're right, the current tests don't include multi-line data or even quote escaped fields. I don't remember my original motivation for this (perhaps it was just a coincidence) but I think the likelihood for diverging behavior/support across the various libraries goes up as more esoteric CSV features are used.

Aside from a performance perspective, it would be cool to test various libraries handling and support of some different nasty CSV files. I have been thinking about a big table of "CSV features" as columns and "CSV libraries" as rows, each cell would be a red x or green check. A data set we could use is https://github.com/maxogden/csv-spectrum.

Great idea :). If you're interested in adding a benchmark for these additional cases, you're more than welcome to it.

@joelverhagen joelverhagen added enhancement New feature or request help wanted Extra attention is needed labels Aug 17, 2021
@electricessence
Copy link
Contributor Author

electricessence commented Aug 17, 2021

I think it's important to show the performance differences as you cannot simply capture a Span<char> of an escaped sequence, you have to reconstruct it as a result. You might have one lib that smokes when nothing is escaped but is crushed when dealing with a column that has quotes.

@Kittoes0124
Copy link

you cannot simply capture a Span<char> of an escaped sequence, you have to reconstruct it as a result

This is true, but one can use the ArrayPoolBufferWriter<> class to reconstruct the result and eliminate most, if not all, allocations.

@jzabroski
Copy link

Related to this is benchmarking OpenSerDe definitions. I hate Amazon Athena / DynamoDB serialization, but it is what it is.

@electricessence
Copy link
Contributor Author

@Kittoes0124 I'm curious how you are using ArrayPoolBufferWriter<char>. I'm not seeing how I would implement it if I'm already using a memory buffer and I even experimented with replacing StringBuilder to merge a string a of characters... But StringBuilder (if reused) is fast for concatenating chars together.

@Kittoes0124
Copy link

Kittoes0124 commented Oct 1, 2021

@electricessence Like so. The idea is that, when possible, we're always yielding a window into the original buffer. When "stitching" is required we write into the ArrayPoolBufferWriter<>, yield its buffer, clear it, and then reuse. My implementation is not strictly compatible with CSVs as defined in the RFC. Even so, neither is string.Split and it appears in the benchmarks. ReadDelimitedAsync is intended to be used as a building block for handling the concept of delimited files more generally.

As the kids are saying these days it uh... slaps.

image

image

[MemoryDiagnoser]
[SimpleJob(RunStrategy.Throughput)]
public class ReaderBenchmark
{
    const string InputFileName = ".../PackageAssets.csv";

    [Params(4096, 16384, 65536)]
    public int BufferSize { get; set; }

    [Benchmark(Baseline = true)]
    public async Task Simple() {
        using var inputStream = new FileStream(
            options: new FileStreamOptions {
                Access = FileAccess.Read,
                BufferSize = 1,
                Mode = FileMode.Open,
                Options = FileOptions.SequentialScan,
                Share = FileShare.Read,
            },
            path: InputFileName
        );

        await foreach (var line in PipeReader.Create(inputStream, new StreamPipeReaderOptions(bufferSize: BufferSize))
            .EnumerateAsync()
            .DecodeAsync()
            .ReadDelimited2dAsync(
                xDelimiter: (','),
                yDelimiter: ('\n')
            )
            .WithCancellation(cancellationToken: default)
            .ConfigureAwait(continueOnCapturedContext: false)
        ) {
        }
    }
}

@Kittoes0124
Copy link

Kittoes0124 commented Oct 1, 2021

Also, adding massive CSVs with degenerate properties might be beneficial. I use the latest version of the NPPES downloadable file to really stress test my stuff because everything about it is unreasonable. It's 8GB unzipped, has 330 fields per row, every single field is quoted, and most of them are sparsely populated. Oh, and it's real-world data; which is a nice bonus!

@electricessence
Copy link
Contributor Author

@Kittoes0124 where's this .EnumerateAsync() extension?

@Kittoes0124
Copy link

Kittoes0124 commented Oct 1, 2021

Where's this .EnumerateAsync() extension?

Right here.

@electricessence
Copy link
Contributor Author

Maybe we should continue this discussion somewhere else.
@Kittoes0124 Did you write those extensions or get them from somewhere else? If you wrote them, I'm impressed! I would have written them exactly the same. :) I'm confused why they're not built in! Or why there isn't a published lib for them.
If you are a fan of Channels: https://github.com/Open-NET-Libraries/Open.ChannelExtensions
Maybe there needs to be a similar extension lib for Pipelines.

That said... Pipeline overhead is not zero:
https://www.reddit.com/r/dotnet/comments/q0lxxl/benchmark_obsession_pipereader_overhead/

@electricessence
Copy link
Contributor Author

electricessence commented Oct 3, 2021

To add more insult to injury. The overhead is not zero and maybe not great in ideal conditions.

Method BufferSize UseAsync Mean Error StdDev Median Ratio RatioSD
StreamReader_Read 4096 False 143.19 ms 2.651 ms 2.604 ms 142.73 ms 1.41 0.15
StreamReader_ReadAsync 4096 False 234.40 ms 4.021 ms 3.761 ms 234.32 ms 2.32 0.23
PipeReader_EnumerateAsync 4096 False 252.06 ms 4.658 ms 4.575 ms 252.21 ms 2.48 0.25
FileStream_Read 4096 False 105.96 ms 3.356 ms 9.736 ms 104.35 ms 1.00 0.00
FileStream_ReadAsync 4096 False 156.22 ms 3.113 ms 5.769 ms 156.54 ms 1.50 0.13
StreamReader_Read 4096 True 384.48 ms 10.109 ms 29.647 ms 384.46 ms 1.01 0.10
StreamReader_ReadAsync 4096 True 401.89 ms 12.897 ms 37.416 ms 407.07 ms 1.05 0.11
PipeReader_EnumerateAsync 4096 True 441.99 ms 8.837 ms 24.190 ms 440.23 ms 1.17 0.09
FileStream_Read 4096 True 382.73 ms 8.951 ms 26.110 ms 380.21 ms 1.00 0.00
FileStream_ReadAsync 4096 True 359.61 ms 7.141 ms 17.110 ms 361.55 ms 0.96 0.07
StreamReader_Read 16384 False 58.95 ms 2.092 ms 6.136 ms 58.61 ms 1.71 0.15
StreamReader_ReadAsync 16384 False 93.71 ms 1.845 ms 3.971 ms 94.38 ms 2.61 0.15
PipeReader_EnumerateAsync 16384 False 125.33 ms 2.882 ms 8.270 ms 123.85 ms 3.62 0.33
FileStream_Read 16384 False 34.78 ms 0.764 ms 2.180 ms 34.41 ms 1.00 0.00
FileStream_ReadAsync 16384 False 49.56 ms 1.313 ms 3.681 ms 48.65 ms 1.43 0.12
StreamReader_Read 16384 True 88.48 ms 1.758 ms 4.693 ms 88.05 ms 1.18 0.11
StreamReader_ReadAsync 16384 True 139.60 ms 2.732 ms 4.411 ms 140.68 ms 1.85 0.17
PipeReader_EnumerateAsync 16384 True 171.27 ms 3.242 ms 5.763 ms 172.72 ms 2.26 0.17
FileStream_Read 16384 True 75.93 ms 2.369 ms 6.603 ms 73.44 ms 1.00 0.00
FileStream_ReadAsync 16384 True 101.14 ms 1.927 ms 4.106 ms 102.07 ms 1.35 0.10

@Kittoes0124
Copy link

Maybe we should continue this discussion somewhere else. @Kittoes0124 Did you write those extensions or get them from somewhere else? If you wrote them, I'm impressed! I would have written them exactly the same. :) I'm confused why they're not built in! Or why there isn't a published lib for them. If you are a fan of Channels: https://github.com/Open-NET-Libraries/Open.ChannelExtensions Maybe there needs to be a similar extension lib for Pipelines.

That said... Pipeline overhead is not zero: https://www.reddit.com/r/dotnet/comments/q0lxxl/benchmark_obsession_pipereader_overhead/

Thanks mate, that means a lot as I did indeed write them myself! Also feel the same way about their lack of inclusion in the base lib. Will continue our discussion further in your Reddit post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants