Skip to content

Speed up SSE4.1 test by splitting individual unrolled blocks into their own functions. #24401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 29, 2025

Conversation

juj
Copy link
Collaborator

@juj juj commented May 22, 2025

Before:

test_sse4_1 (test_core.core_2gb) ... ok (5.89s)
test_sse4_1 (test_core.lsan) ... ok (5.94s)
test_sse4_1 (test_core.minimal0) ... ok (5.98s)
test_sse4_1 (test_core.strict_js) ... ok (5.98s)
test_sse4_1 (test_core.strict) ... ok (6.13s)
test_sse4_1 (test_core.bigint) ... ok (6.13s)
test_sse4_1 (test_core.core0) ... ok (6.17s)
test_sse4_1 (test_core.core1) ... ok (6.36s)
test_sse4_1 (test_core.instance) ... ok (6.41s)
test_sse4_1 (test_core.asan) ... ok (9.62s)
test_sse4_1 (test_core.wasmfs) ... ok (10.01s)
test_sse4_1 (test_core.core2) ... ok (10.12s)
test_sse4_1 (test_core.corez) ... ok (11.15s)
test_sse4_1 (test_core.cores) ... ok (37.72s)
test_sse4_1 (test_core.core3) ... ok (140.27s)

Total core time: 273.854s. Wallclock time: 140.702s. Parallelization: 1.95x.

After:

test_sse4_1 (test_core.strict) ... ok (7.11s)
test_sse4_1 (test_core.strict_js) ... ok (7.16s)
test_sse4_1 (test_core.bigint) ... ok (7.18s)
test_sse4_1 (test_core.minimal0) ... ok (7.44s)
test_sse4_1 (test_core.core_2gb) ... ok (7.46s)
test_sse4_1 (test_core.core0) ... ok (7.49s)
test_sse4_1 (test_core.instance) ... ok (7.53s)
test_sse4_1 (test_core.lsan) ... ok (7.64s)
test_sse4_1 (test_core.core1) ... ok (8.15s)
test_sse4_1 (test_core.asan) ... ok (9.63s)
test_sse4_1 (test_core.wasmfs) ... ok (10.54s)
test_sse4_1 (test_core.core2) ... ok (10.44s)
test_sse4_1 (test_core.corez) ... ok (11.38s)
test_sse4_1 (test_core.cores) ... ok (11.69s)
test_sse4_1 (test_core.core3) ... ok (50.80s)

Total core time: 171.622s. Wallclock time: 51.223s. Parallelization: 3.35x.

@@ -547,9 +550,10 @@ __m128 ExtractIntInRandomOrder(unsigned int *arr, int i, int n, int prime) {
char str[256]; tostr(&m1, str); \
char str2[256]; tostr(&ret, str2); \
printf("%s(%s, 0x%08X, %d) = %s\n", #func, str, interesting_ints[j], Tint, str2); \
}
} \
}();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use simple C functions instead of C++ lambda's perhaps? That would also help with debug-ability as it would yield meaningful backtraces. If its not easy then this change is still better of course. lgtm either way.

Copy link
Collaborator Author

@juj juj May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do that easily. The function bodies would then need to be emitted somewhere else by one set of macro expansion magic, and then calls to those functions separately in another place.

@juj juj force-pushed the further_optimize_sse4_1_test_suite branch from 2be37fa to d375a38 Compare May 23, 2025 15:04
@juj juj enabled auto-merge (squash) May 23, 2025 15:04
@juj juj force-pushed the further_optimize_sse4_1_test_suite branch from d375a38 to 0524c96 Compare May 23, 2025 16:41
@juj
Copy link
Collaborator Author

juj commented May 29, 2025

Ping - is this ok to land?

@juj juj merged commit a7cdef6 into emscripten-core:main May 29, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants