Skip to content

Conversation

@blowekamp
Copy link
Member

PR Checklist

  • No API changes were made (or the changes have been approved)
  • No major design changes were made (or the changes have been approved)
  • Added test (or behavior not changed)
  • Updated API documentation (or API not changed)
  • Added license to new files (if any)
  • Added Python wrapping to new files (if any) as described in ITK Software Guide Section 9.5
  • Added ITK examples for all new major features (if any)

Refer to the ITK Software Guide for
further development details if necessary.

Comparing perfomance with using the cast image filter to convert
between different vector pixel type.

There are two option to compare:
  - Using ImageRegionRange appears to have performnace benifits,
  compared to the ImageScanelineIterators.
  - Using NumericTraits::GetLength is nearly a constant /compile time
  expression which is also performant. The challenge is choosing input
  vs output pixel type when only one has a compile time size.
@blowekamp blowekamp requested a review from N-Dekker December 1, 2025 19:40
@blowekamp blowekamp changed the title Performance comparison of iterators to copy vector images Performance comparison of iterators and ranges to copy vector images Dec 1, 2025
@github-actions github-actions bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct area:Filtering Issues affecting the Filtering module labels Dec 1, 2025
@blowekamp
Copy link
Member Author

Here is sample timing from my system in release mode:

ITK Image Iteration Performance Test
=====================================

--- Test Suite 1: Image<Vector<float,3>> to Image<RGBPixel<double>> ---

=== Medium Image (256^3) ===
Image size: 256x256x256
Scanline Iterator:     242.631 Mpixels/sec (0.0691471 seconds)
ImageRegionRange:      415.586 Mpixels/sec (0.04037 seconds)
Scanline NumericTraits:336.575 Mpixels/sec (0.0498469 seconds)
Range NumericTraits:   1298.24 Mpixels/sec (0.012923 seconds)
Results match:         YES

=== Large Image (512^3) ===
Image size: 512x512x512
Scanline Iterator:     245.42 Mpixels/sec (0.546889 seconds)
ImageRegionRange:      434.034 Mpixels/sec (0.309233 seconds)
Scanline NumericTraits:358.434 Mpixels/sec (0.374456 seconds)
Range NumericTraits:   1319.51 Mpixels/sec (0.101718 seconds)
Results match:         YES

--- Test Suite 2: VectorImage<float> to Image<Vector<double,3>> ---

=== Medium Image (256^3) ===
Image size: 256x256x256
Scanline Iterator:     310.057 Mpixels/sec (0.0541101 seconds)
ImageRegionRange:      488.19 Mpixels/sec (0.0343661 seconds)
Scanline NumericTraits:346.702 Mpixels/sec (0.0483909 seconds)
Range NumericTraits:   1103.18 Mpixels/sec (0.015208 seconds)
Results match:         YES

=== Large Image (512^3) ===
Image size: 512x512x512
Scanline Iterator:     280.592 Mpixels/sec (0.478337 seconds)
ImageRegionRange:      411.867 Mpixels/sec (0.325876 seconds)
Scanline NumericTraits:324.574 Mpixels/sec (0.413519 seconds)
Range NumericTraits:   803.358 Mpixels/sec (0.167071 seconds)
Results match:         YES

--- Test Suite 3: Image<Vector<float,3>> to VectorImage<double> ---

=== Medium Image (256^3) ===
Image size: 256x256x256
Scanline Iterator:     432.469 Mpixels/sec (0.038794 seconds)
ImageRegionRange:      39.9437 Mpixels/sec (0.420021 seconds)
Scanline NumericTraits:433.308 Mpixels/sec (0.0387189 seconds)
Range NumericTraits:   415.719 Mpixels/sec (0.0403571 seconds)
Results match:         YES

=== Large Image (512^3) ===
Image size: 512x512x512
Scanline Iterator:     363.675 Mpixels/sec (0.36906 seconds)
ImageRegionRange:      38.9987 Mpixels/sec (3.4416 seconds)
Scanline NumericTraits:377.825 Mpixels/sec (0.355238 seconds)
Range NumericTraits:   316.502 Mpixels/sec (0.424066 seconds)
Results match:         YES

--- Test Suite 4: VectorImage<float> to VectorImage<double> ---

=== Medium Image (256^3) ===
Image size: 256x256x256
Scanline Iterator:     415.938 Mpixels/sec (0.0403359 seconds)
ImageRegionRange:      39.5075 Mpixels/sec (0.424659 seconds)
Scanline NumericTraits:417.581 Mpixels/sec (0.0401771 seconds)
Range NumericTraits:   614.708 Mpixels/sec (0.027293 seconds)
Results match:         YES

=== Large Image (512^3) ===
Image size: 512x512x512
Scanline Iterator:     458.723 Mpixels/sec (0.29259 seconds)
ImageRegionRange:      38.4984 Mpixels/sec (3.48632 seconds)
Scanline NumericTraits:452.832 Mpixels/sec (0.296396 seconds)
Range NumericTraits:   522.439 Mpixels/sec (0.256906 seconds)
Results match:         YES

@blowekamp blowekamp force-pushed the cast_image_region_range branch from f530e9c to 236cb22 Compare December 1, 2025 21:52
@N-Dekker
Copy link
Contributor

N-Dekker commented Dec 2, 2025

Thanks @blowekamp, interesting, I'll have a look!

In general, I wonder, should a performance benchmark be part of the regular unit test set? Or should it have its own project? Specifically, should your benchmark be a part of https://github.com/InsightSoftwareConsortium/ITKPerformanceBenchmarking ?

If your benchmark would stay here in InsightSoftwareConsortium/ITK I would suggest to make it a GoogleTest! (Although maybe a benchmark suite like https://github.com/google/benchmark would even be nicer.)

const unsigned int componentsPerPixel = NumericTraits<OutputPixelType>::GetLength(outputPixel);
while (inputIt != inputEnd)
{
InputPixelType inputPixel = *inputIt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not const InputPixelType & inputPixel? (The original code also has const InputPixelType & inputPixel.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You reference code times better on my system.

while (inputIt != inputEnd)
{
InputPixelType inputPixel = *inputIt;
OutputPixelType outputPixel(componentsPerPixel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputPixelType outputPixel(componentsPerPixel) is expensive for VariableLengthVector. Please consider moving this declaration outside the while (inputIt != inputEnd) loop (just like you proposed for CastImageFilter).

Copy link
Contributor

@N-Dekker N-Dekker Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran the last part of your benchmark:

--- Test Suite 4: VectorImage<float> to VectorImage<double> ---

=== Large Image (512^3) ===
Image size: 512x512x512
Scanline Iterator:     243.729 Mpixels/sec (0.550685 seconds)
ImageRegionRange:      35.8598 Mpixels/sec (3.74284 seconds)
Scanline NumericTraits:245.443 Mpixels/sec (0.546839 seconds)
Range NumericTraits:   290.972 Mpixels/sec (0.461273 seconds)
Results match:         YES

That looks very bad for ImageRegionRange. However, after moving the declaration of outputPixel outside the while (inputIt != inputEnd) loop, it said:

=== Large Image (512^3) ===
Image size: 512x512x512
Scanline Iterator:     246.817 Mpixels/sec (0.543794 seconds)
ImageRegionRange:      175.812 Mpixels/sec (0.763417 seconds)
Scanline NumericTraits:246.84 Mpixels/sec (0.543744 seconds)
Range NumericTraits:   293.25 Mpixels/sec (0.45769 seconds)
Results match:         YES

So now ImageRegionRange still doesn't seem as fast as Scanline Iterator, but it's only 0.76 versus 0.54 seconds now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is expensive. For this case in my posted results it's the reason why it's 10x slower when then the output is a VectorImage.

Interestingly, when OutputPixelType outputPixel(componentsPerPixel) is outside the loop it's ~2x slower. AND if it's OutputPixelType outputPixel{ *outputIt}; the same issue of unexpected modified output as reported in #5668 occurs.

Comment on lines 331 to 333
std::cout << "CTEST_FULL_OUTPUT" << std::endl;
std::cout << "ITK Image Iteration Performance Test" << std::endl;
std::cout << "=====================================" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want this to become a regular unit test, please limit the amount of output! The ITK tests already produce an excessive amount of output, in my opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional full output is intentional so that the performance results from the CI can be seen.

I don't know what the long term utility of this performance comparison will be. Currently some systems ran out of memory. So a couple tweaks are needed for that... likely command line the size of the image and reduce number of images in memory.

Comment on lines +335 to +369
// Test different image sizes and types
constexpr unsigned int Dimension = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider testing 2D as well. Maybe do one small 2D and 1 large 3D.

const auto inputEnd = inputRange.end();


OutputPixelType outputPixel{ *outputIt };
Copy link
Contributor

@N-Dekker N-Dekker Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the value of *outputIt may not yet be initialized, right? The output pixel values may still be uninitialized at this point. If so, better just declare OutputPixelType outputPixel; without retrieving the value from *outputIt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct that the values in outputPixel are uninitialized. However, for the VectorImage cast, the outputPixel will point to the data in the output Image's buffer. So this removed the need to allocation memory, reduced the memory locations used ( atleast for variable length vectors. )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also the "BUG" cast where the reference to the first pixel is maintained and will be assigned all values in the range.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you implying that #5661 didn't fix a bug? To be honest, the time it took me to understand the origin of this bug is a clear indication to me that this is not the expected behavior of the Get function of an iterator. And I guess this is not consistent with how it behaves with scalar pixel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged patch fixed the issue in the case filter. But this new code block with the ImageRangeIterator exhibits the same behavior as the original CastImageFilter.

Some tricks have been needed to get operations on the VectorImageFilter to perform reasonably. But they have been this way for over 10 years, and crippling filters by making them run 10x slower may not be a reasonable option.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's reasonable too and I don't have the solution. But it does not mean it's not a bug. BTW, if you implement the (slow) version, a recursive Gaussian test fails which probably indicates a bug there too.

@blowekamp blowekamp force-pushed the cast_image_region_range branch from 236cb22 to a720403 Compare December 2, 2025 14:53
@blowekamp
Copy link
Member Author

I think the Benchmark repo may be a better place for this.

I updated the benchmark with many of your changes. Here are my current results:


--- Test Suite 1: Image<Vector<float,3>> to Image<RGBPixel<double>> ---

=== Image Size: [512, 512, 512] ===
Scanline Iterator:     248.675 Mpixels/sec (0.539732 seconds)
ImageRegionRange:      512.322 Mpixels/sec (0.261979 seconds)
Scanline NumericTraits:348.479 Mpixels/sec (0.385153 seconds)
Range NumericTraits:   1259.01 Mpixels/sec (0.106606 seconds)
Range NT (BUGGY):      1311.76 Mpixels/sec (0.102319 seconds)
All methods validated.

--- Test Suite 2: VectorImage<float> to Image<Vector<double,3>> ---

=== Image Size: [512, 512, 512] ===
Scanline Iterator:     278.474 Mpixels/sec (0.481976 seconds)
ImageRegionRange:      380.225 Mpixels/sec (0.352995 seconds)
Scanline NumericTraits:322.343 Mpixels/sec (0.416382 seconds)
Range NumericTraits:   785.763 Mpixels/sec (0.170812 seconds)
Range NT (BUGGY):      831.41 Mpixels/sec (0.161434 seconds)
All methods validated.

--- Test Suite 3: Image<Vector<float,3>> to VectorImage<double> ---

=== Image Size: [512, 512, 512] ===
Scanline Iterator:     366.2 Mpixels/sec (0.366515 seconds)
ImageRegionRange:      503.896 Mpixels/sec (0.26636 seconds)
Scanline NumericTraits:358.11 Mpixels/sec (0.374795 seconds)
Range NumericTraits:   493.767 Mpixels/sec (0.271824 seconds)
Range NT (BUGGY):      241.063 Mpixels/sec (0.556774 seconds)
  WARNING: Results differ from reference!
All methods validated.

--- Test Suite 4: VectorImage<float> to VectorImage<double> ---

=== Image Size: [512, 512, 512] ===
Scanline Iterator:     434.072 Mpixels/sec (0.309206 seconds)
ImageRegionRange:      441.136 Mpixels/sec (0.304255 seconds)
Scanline NumericTraits:421.92 Mpixels/sec (0.318112 seconds)
Range NumericTraits:   448.332 Mpixels/sec (0.299371 seconds)
Range NT (BUGGY):      240.716 Mpixels/sec (0.557576 seconds)
  WARNING: Results differ from reference!
All methods validated.

=====================================
All tests completed successfully.
  • With the changes, it looks like the RangeIterator with the constexpr NumericTraits is the best performer.
  • The issue of Undefined behavior of iterators Get for VariableLengthVector #5668, also occurs with the Range iterators. Is this a bug or a feature?
  • The best performance relies on the "reference" return by both iterators to prevent extra copies. (2x slower for one dynamic allocation outside the loop, 10x slower for copy with dynamic allocation inside the loop )

Copy link
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As image region range seems more performant than scanline iterator, do you mind issuing a separate PR which updates the cast image filter with the refactoring?

@blowekamp
Copy link
Member Author

As image region range seems more performant than scanline iterator, do you mind issuing a separate PR which updates the cast image filter with the refactoring?

We have determined which is the most performant, however it relies on the behavior of the unexpected assigned bug to gain that performance. The suggestion (in that issue) of making the return value "const PixelType" will would change the performance characteristics in these examples, since it would cause memory allocation for the vector pixel types.

How should we classify the referenced returned of VectorImages for the Iterators and Range iterators?

@dzenanz
Copy link
Member

dzenanz commented Dec 2, 2025

The Value() should provide modifiable reference, Get() should provide const reference (preferable) if possible, otherwise a copy. An example where choices had to be made is RLEImage. See:
https://github.com/KitwareMedical/ITKRLEImage/blob/d7a59ec1bb2edfd55d96cb664b73e7b1e29ed410/include/itkRLEImageConstIterator.h#L314-L330
https://github.com/KitwareMedical/ITKRLEImage/blob/d7a59ec1bb2edfd55d96cb664b73e7b1e29ed410/include/itkRLEImageIterator.h#L87-L106

@dzenanz
Copy link
Member

dzenanz commented Dec 2, 2025

Correspondingly, CastImageFilter should prefer the more performant Value() method, if we need to degrade the performance of Get() to avoid the buggy behavior.

@blowekamp
Copy link
Member Author

The Value() should provide modifiable reference, Get() should provide const reference (preferable) if possible, otherwise a copy. An example where choices had to be made is RLEImage. See: https://github.com/KitwareMedical/ITKRLEImage/blob/d7a59ec1bb2edfd55d96cb664b73e7b1e29ed410/include/itkRLEImageConstIterator.h#L314-L330 https://github.com/KitwareMedical/ITKRLEImage/blob/d7a59ec1bb2edfd55d96cb664b73e7b1e29ed410/include/itkRLEImageIterator.h#L87-L106

Unfortunately, Value() does not work for VectorImages nor Adaptors. I'll have to do some type of search to determine how often the Iterators'Value() methods are used.

The Range style iterators have similar behavior with the *iter for VectorImages. However this dereference also support assignment. I'm not sure what the expectation are for this style iterator.

I'd like to think of this more as documentation/optimization issue than clearly a bug as I'm not sure how big the performance impact on numerous filters will be. These iterators have been this way for over 10 years, it may be best to just leave them as is, and document it. It may be better, less disruptive, and more productive to move on to the Range iterators if their behavior is more defined and consistent.

Comment on lines +172 to +173
auto inputRange = ImageRegionRange<const TInputImage>(*inputPtr, inputRegionForThread);
auto outputRange = ImageRegionRange<TOutputImage>(*outputPtr, outputRegionForThread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: technically this is OK, of course, but in ITK we don't really do AAA ("auto almost always") everywhere. (We just follow clang-tidy modernize-use-auto, but that's still not AAA everywhere). So I would just do:

    ImageRegionRange<const TInputImage> inputRange(*inputPtr, inputRegionForThread);
    ImageRegionRange<TOutputImage>      outputRange(*outputPtr, outputRegionForThread);


const unsigned int componentsPerPixel = outputPtr->GetNumberOfComponentsPerPixel();

#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the #if 0 code may be removed when this PR is ready, right? Otherwise if constexpr (false) might be nicer... 😺 But I would prefer the code to just be removed, eventually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made PR #5670 with just the change to the cast filter.

This change below fails the testing due to the aliasing/reference of the first outputPixel, just like #5668. I'll update that issue with the Image Range iterator to hopefully focus the discussion of what the correct behavior should be.

Comment on lines +379 to +384
if (argc < 2)
{
std::cerr << "Missing parameters." << std::endl;
std::cerr << "Usage: " << itkNameOfTestExecutableMacro(argv) << " imageSize" << std::endl;
return EXIT_FAILURE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run ITKImageFilterBaseTestDriver, it now says:

Available tests:
  0. itkNeighborhoodOperatorImageFilterTest
  1. itkImageToImageFilterTest
  2. itkVectorNeighborhoodOperatorImageFilterTest
  3. itkMaskNeighborhoodOperatorImageFilterTest
  4. itkCastImageFilterTest
  5. itkImageIterationPerformanceTest
To run a test, enter the test number: 5
CTEST_FULL_OUTPUT
Missing parameters.
Usage: <itkImageIterationPerformanceTest executable> imageSize

D:\X\Bin\ITK\bin\Release\ITKImageFilterBaseTestDriver.exe (process 46316) exited with code 1 (0x1).
Press any key to close this window . . .

Please consider a default imageSize, so that the test can be run that way as well... 🙏 (Of course I like GoogleTest even better!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just running ./bin/ITKImageFilterBaseTestDriver itkImageIterationPerformanceTest 512 easy peasy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, easy peasy. 👍 Still I think a default imageSize would make it even easier peasier 😄

@blowekamp
Copy link
Member Author

This has been a useful exercise to gain understanding of the behavior of the iterators with the VectorImage. I'm looking into moving the performance test to the ITK benchmarking repo, and summarizing some thing in the #5668 issue.

@N-Dekker
Copy link
Contributor

N-Dekker commented Dec 3, 2025

@blowekamp Very glad to see ImageRegionRange doing so well! I was afraid that it would be beaten by ImageScanlineIterator!

For regular images I saw that ImageRegionRange would be at least as fast as ImageRegionIterator, and much faster than ImageRegionIteratorWithIndex. (But of course, if you need both the pixel values and their index values, you might still want to use ImageRegionIteratorWithIndex.)

Of course, if you want to iterate over the entire image buffer, use ImageBufferRange instead.

Those VariableLengthVector/VectorImage pixels really behave a bit irregular (or even counter-intuitive), so thank you for your investigation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Filtering Issues affecting the Filtering module type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants