Skip to content

Conversation

@jeremylong
Copy link
Collaborator

  • Move string validation and encoding functions to the new StringUtil class
  • Removed the deprecated public @Nullable String uriDecode(final @Nullable String source)
  • Update @since 1.6.0 to @since 2.0.0 (this was missed in build: bump major version #219)

@jeremylong jeremylong changed the title BREAKING CHANGE: refactor PackageURL by moving String functions to StringUtil chore!: refactor PackageURL by moving String functions to StringUtil Mar 22, 2025
@ppkarwasz
Copy link
Contributor

➕ 1 for refactoring String-related methods into their own methods, but I don't understand why the breaking change is necessary? The only breaking change is the removal of PackageURL#uriDecode, I don't think it is worth scaring users with a major release.

* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
package com.github.packageurl.utils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these classes are not meant to be used by third-party libraries, I would suggest:

  • Using com.github.packageurl.internal or similar as package name.
  • Documenting that the package is internal in package-info.java.

Due to #201, this package will not be exported through JPMS, since package-info.java is not annotated with @Export.

@ppkarwasz
Copy link
Contributor

The only breaking change is the removal of PackageURL#uriDecode, I don't think it is worth scaring users with a major release.

Sorry, I just saw the change to PackageURL#getQualifiers.

@jeremylong
Copy link
Collaborator Author

This PR also made minor changes to the percentDecode and percentEncode. The benchmarks before and after show an improvement in percentEncode. However, I'm guessing there is an issue with the benchmark for percentDecode as the times didn't change and are fast enough I'm guessing only the search for an encoded string was not found.

Benchmark of updated percent decode/encode:

Benchmark (nonAsciiProb) Mode Cnt Score Error Units
StringUtilBenchmark.baseline 0 avgt 25 689.715 ± 7.937 us/op
StringUtilBenchmark.baseline 0.1 avgt 25 1347.950 ± 3.308 us/op
StringUtilBenchmark.baseline 0.5 avgt 25 2609.872 ± 5.754 us/op
StringUtilBenchmark.percentDecode 0 avgt 25 191.887 ± 0.347 us/op
StringUtilBenchmark.percentDecode 0.1 avgt 25 192.141 ± 0.208 us/op
StringUtilBenchmark.percentDecode 0.5 avgt 25 192.031 ± 0.190 us/op
StringUtilBenchmark.percentEncode 0 avgt 25 1023.150 ± 3.235 us/op
StringUtilBenchmark.percentEncode 0.1 avgt 25 3269.102 ± 7.088 us/op
StringUtilBenchmark.percentEncode 0.5 avgt 25 7179.152 ± 7.827 us/op
Benchmark (nonAsciiProb) Mode Cnt Score Error Units
PercentEncodingBenchmark.baseline 0 avgt 25 686.520 ± 6.698 us/op
PercentEncodingBenchmark.baseline 0.1 avgt 25 1344.912 ± 3.942 us/op
PercentEncodingBenchmark.baseline 0.5 avgt 25 2614.673 ± 3.389 us/op
PercentEncodingBenchmark.percentDecode 0 avgt 25 191.987 ± 0.319 us/op
PercentEncodingBenchmark.percentDecode 0.1 avgt 25 192.025 ± 0.227 us/op
PercentEncodingBenchmark.percentDecode 0.5 avgt 25 191.950 ± 0.293 us/op
PercentEncodingBenchmark.percentEncode 0 avgt 25 1158.468 ± 3.644 us/op
PercentEncodingBenchmark.percentEncode 0.1 avgt 25 2172.666 ± 6.813 us/op
PercentEncodingBenchmark.percentEncode 0.5 avgt 25 4432.998 ± 12.272 us/op

@jeremylong
Copy link
Collaborator Author

I figured out the problem with the benchmark and I'm re-running it now.

@ppkarwasz
Copy link
Contributor

I figured out the problem with the benchmark and I'm re-running it now.

The setup() method has a bug (encodedData = encodeData(encodedData)), I'll post an improved benchmark soon.

@jeremylong
Copy link
Collaborator Author

jeremylong commented Mar 23, 2025

I can update the benchmark as part of this PR. I'd push the code - but I'm running the benchmark now and I'd like to see the results in another 2 hours after it runs on both pre and post my updates.

@ppkarwasz
Copy link
Contributor

I can update the benchmark as part of this PR. I'd push the code - but I'm running the benchmark now and I'd like to see the results in another 2 hours after it runs on both pre and post my updates.

I fixed and extended the benchmark in #222.

@ppkarwasz
Copy link
Contributor

Since nonAsciiProb == 0 (i.e. there are no characters to encode or to decode) is in practice the most common case, we should probably aggressively optimize for it. The percentDecode can be easily optimized by skipping all processing if the are no % characters:

if (source.indexOf(PERCENT_CHAR) == -1) {
    return source;
}

I am not sure if percentEncode can get much better.

@jeremylong
Copy link
Collaborator Author

After updating to use the new benchmark. You'll notice that there isn't a lot of change in the percentDecode. Again, I think I have the solution to this and I'll include it in this PR. I'll be back in 4+ hours (I really need to go buy a better dev machine ;)).

With my changes:

Benchmark (nonAsciiProb) Mode Cnt Score Error Units
StringUtilBenchmark.baseline 0 avgt 25 44.430 ± 0.247 us/op
StringUtilBenchmark.baseline 0.1 avgt 25 44.335 ± 0.371 us/op
StringUtilBenchmark.baseline 0.5 avgt 25 44.474 ± 0.256 us/op
StringUtilBenchmark.percentDecode 0 avgt 25 191.667 ± 0.247 us/op
StringUtilBenchmark.percentDecode 0.1 avgt 25 191.876 ± 0.110 us/op
StringUtilBenchmark.percentDecode 0.5 avgt 25 191.632 ± 0.216 us/op
StringUtilBenchmark.percentEncode 0 avgt 25 1012.234 ± 8.376 us/op
StringUtilBenchmark.percentEncode 0.1 avgt 25 1001.798 ± 4.785 us/op
StringUtilBenchmark.percentEncode 0.5 avgt 25 993.283 ± 3.431 us/op
StringUtilBenchmark.toLowerCase 0 avgt 25 97.897 ± 0.134 us/op
StringUtilBenchmark.toLowerCase 0.1 avgt 25 97.759 ± 0.232 us/op
StringUtilBenchmark.toLowerCase 0.5 avgt 25 98.030 ± 0.302 us/op
StringUtilBenchmark.toLowerCaseJre 0 avgt 25 910.749 ± 3.538 us/op
StringUtilBenchmark.toLowerCaseJre 0.1 avgt 25 911.569 ± 3.500 us/op
StringUtilBenchmark.toLowerCaseJre 0.5 avgt 25 907.451 ± 2.940 us/op

Legacy version:

Benchmark (nonAsciiProb) Mode Cnt Score Error Units
PercentEncodingBenchmark.baseline 0 avgt 25 44.774 ± 0.299 us/op
PercentEncodingBenchmark.baseline 0.1 avgt 25 44.258 ± 0.478 us/op
PercentEncodingBenchmark.baseline 0.5 avgt 25 44.443 ± 0.321 us/op
PercentEncodingBenchmark.percentDecode 0 avgt 25 191.934 ± 0.276 us/op
PercentEncodingBenchmark.percentDecode 0.1 avgt 25 192.135 ± 0.171 us/op
PercentEncodingBenchmark.percentDecode 0.5 avgt 25 191.946 ± 0.349 us/op
PercentEncodingBenchmark.percentEncode 0 avgt 25 1161.316 ± 7.246 us/op
PercentEncodingBenchmark.percentEncode 0.1 avgt 25 1147.482 ± 4.225 us/op
PercentEncodingBenchmark.percentEncode 0.5 avgt 25 1149.934 ± 5.693 us/op
PercentEncodingBenchmark.toLowerCase 0 avgt 25 97.918 ± 0.352 us/op
PercentEncodingBenchmark.toLowerCase 0.1 avgt 25 97.995 ± 0.168 us/op
PercentEncodingBenchmark.toLowerCase 0.5 avgt 25 98.058 ± 0.355 us/op
PercentEncodingBenchmark.toLowerCaseJre 0 avgt 25 912.258 ± 1.517 us/op
PercentEncodingBenchmark.toLowerCaseJre 0.1 avgt 25 912.719 ± 1.942 us/op
PercentEncodingBenchmark.toLowerCaseJre 0.5 avgt 25 914.154 ± 3.604 us/op

@ppkarwasz
Copy link
Contributor

ppkarwasz commented Mar 23, 2025

I'll be back in 4+ hours (I really need to go buy a better dev machine ;)).

JMH tests have a duration expressed in seconds and do not depend on the machine. 😉

@ppkarwasz
Copy link
Contributor

In #224 I expanded on this PR by optimizing the case when no percent encoding is needed.

Profiling has shown that shouldEncode is the slowest method.

@jeremylong jeremylong force-pushed the scratch/refactor-stringutils branch from 006e58a to 8f17eee Compare March 23, 2025 21:10
jeremylong and others added 2 commits March 23, 2025 17:12
* feat: Improve benchmark (#222)

Fixes a bug in the benchmark initialization and adds a `toLowerCase` benchmark.

* fix: Benchmark initialization

The benchmark **must** be initialized in a `@Setup` method, otherwise `nonAsciiProb` will always be `0.0`.

* fix: Improve encoding/decoding performance for ASCII strings

Since strings that don't require **any** percent encoding are in practice the rule, the encoding/decoding code should be optimized for this case.
@jeremylong jeremylong marked this pull request as ready for review March 24, 2025 12:02
@jeremylong jeremylong requested a review from ppkarwasz March 24, 2025 12:02
@jeremylong
Copy link
Collaborator Author

@ppkarwasz I think this is good to go. I don't think we can get much more optimization out of the encode/decode and moving the string functions to their own class helps clean up the PackageURL class.

Copy link
Contributor

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Around 100 ns per operation on a 256 character long String looks good enough to me.

Maybe we could split the toLowerCase and toLowerCaseJre benchmark method to a separate benchmark class: right now these methods use the test strings for encoding, so there are no favorable test string (e.g. a string with only lowercase characters).

@stevespringett stevespringett merged commit 9b9cde4 into master Mar 24, 2025
5 checks passed
@stevespringett stevespringett deleted the scratch/refactor-stringutils branch March 24, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants