-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8358533: Improve performance of java.io.Reader.readAllLines #25863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The throughput of the implementation as measured by the included benchmark appears to hover around 13% greater than that of the existing method. The updated method should also have a smaller memory footprint for streams of non-trivial length as it does not first create a single intermediate |
👋 Welcome back bpb! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
if (c == '\r' || c == '\n') | ||
break; | ||
term++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding a test of unconventional sequences or \r and \n, including \r\r and \n\n, \r.
The current ReadAll test cover the conventional sequences on Linux and Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should treat "\r\n" as a single line terminator? for example
"hello\r\nworld".lines().forEach(line -> out.format("[%s]\n", line));
=>
[hello]
[world]
instead of (the current impl)
[hello]
[]
[world]
or I misread the impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should treat "\r\n" as a single line terminator?
You are correct: that needs to be fixed:
jshell> Reader r = new StringReader("hello\r\nworld")
r ==> java.io.StringReader@480bdb19
jshell> r.readAllLines()
$3 ==> [hello, , world]
Thanks for the catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scanner
seems to scan for even more characters:
private static final String LINE_SEPARATOR_PATTERN = |
Would it make sense to resemble this? Would it make sense to simply use Scanner
directly? 🤔
if (pos >= n) { | ||
// Buffer content consumed so reload it. | ||
if ((n = read(cb, 0, cb.length)) < 0) { | ||
eos = eol = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eos = eol = true; | |
eos = true; |
The local variable eol assignment here is not used and can be removed.
} | ||
|
||
eol = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
eol = false; | |
} |
Same as above, the local variable eol is not used after being assigned and can be removed.
return readAllCharsAsString().lines().toList(); | ||
char[] cb = new char[TRANSFER_BUFFER_SIZE]; | ||
int pos = 0; | ||
List<String> lines = new ArrayList<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<String> lines = new ArrayList<String>(); | |
List<String> lines = new ArrayList<>(); |
int pos = 0; | ||
List<String> lines = new ArrayList<String>(); | ||
|
||
StringBuilder sb = new StringBuilder(82); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this pre-allocation? If the whole content is smaller than 8192 in size, this allocation would be redundant because we are going through the string constructor path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this pre-allocation?
What would you suggest? Start with a smaller allocation and increase it if needed? There is no possibility of knowing the length of the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this PR explicitly targets performance and as the aim of this method is to keep all content in-memory anyways, I wonder if it would be acceptable and even faster to pre-allocate new StringBuilder(TRANSFER_BUFFER_SIZE)
? In the end, this allocation is just temporary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to call new StringBuilder(0)
as it is possible this is completely unused because we always hit the eol && sb.length() == 0
path below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is motivated by performance, but there will be many inputs that are less than the transfer buffer size and those will not use the StringBuilder, so creating it before it is needed could be avoided.
When a partial line is left in the transfer buffer, copy it to the beginning of the buffer and read more characters for the remaining size of the buffer. It would save some copying into and out of the SB.
You might still need a fallback for really long lines (> transfer buffer size), but that might be more easily handled by reallocating the transfer buffer to make it larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resizing/newCapacity is always expensive and tricky, string builder included. so maybe we should decide if 'long lines' (> transfer buffer size) is the goal of this pr. if not, it might be reasonable/make sense (???) to simply go with "string" + the built-in string concatenation -> we don't care the scenario that most of the 'lines' > buffer.size. i do agree we probably want to avoid paying the cost of copying in & out of the sb, but tweaking the transfer buffer resizing might also be tricky and potentially out of the scope as well. yes, it's always a trade off.
If we want better performance, we should go a step further and overload the readAllLines method in the Reader implementation class. For example, in the most commonly used InputStreamReader, overload readAllLines through StreamDecoder and make special optimizations for UTF8/ISO_8859_1 encoding. In StringReader, special overload methods can also be used for optimization. |
Perhaps, but not in this request. A separate issue should be filed and addressed subsequently. |
int pos = 0; | ||
List<String> lines = new ArrayList<String>(); | ||
|
||
StringBuilder sb = new StringBuilder(82); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this PR explicitly targets performance and as the aim of this method is to keep all content in-memory anyways, I wonder if it would be acceptable and even faster to pre-allocate new StringBuilder(TRANSFER_BUFFER_SIZE)
? In the end, this allocation is just temporary.
if (c == '\r' || c == '\n') | ||
break; | ||
term++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scanner
seems to scan for even more characters:
private static final String LINE_SEPARATOR_PATTERN = |
Would it make sense to resemble this? Would it make sense to simply use Scanner
directly? 🤔
// Current position is terminator so skip it. | ||
pos++; | ||
} else { // term > pos | ||
if (eol && sb.length() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for sb.length() == 0
instead of sb.isEmpty()
?
eol = false; | ||
} | ||
|
||
return lines; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to return a mutable ArrayList
here? In earlier discussions about this very API I was told that it deliberately returns String
instead of CharSequence
due to intended immutability, even if that potentially implied slower performance. Following this logic, it would be just straightforward to return Collections.unmodifiableList(lines);
here. 🤔
@@ -499,7 +547,13 @@ public List<String> readAllLines() throws IOException { | |||
* @since 25 | |||
*/ | |||
public String readAllAsString() throws IOException { | |||
return readAllCharsAsString(); | |||
StringBuilder result = new StringBuilder(); | |||
char[] cbuf = new char[TRANSFER_BUFFER_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this PR explicitly targets performance and as the aim of this method is to keep all content in-memory anyways, I wonder if it would be acceptable and even faster to pre-allocate new StringBuilder(TRANSFER_BUFFER_SIZE)
? In the end, this allocation is just temporary.
Replaces the implementation
readAllCharsAsString().lines().toList()
with reading into a temporarychar
array which is then processed to detect line terminators and copy non-terminating characters into strings which are added to the list.Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25863/head:pull/25863
$ git checkout pull/25863
Update a local copy of the PR:
$ git checkout pull/25863
$ git pull https://git.openjdk.org/jdk.git pull/25863/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25863
View PR using the GUI difftool:
$ git pr show -t 25863
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25863.diff
Using Webrev
Link to Webrev Comment