-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unicode handling capability #106
base: main
Are you sure you want to change the base?
Conversation
The current behaviour is to withhold a half-completed unicode char from the output buffer. If the client program outputs EOF when the unicode char buffer is incomplete, the char is swallowed. However one problem is that if the client program outputs an invalid unicode char and then valid unicode chars, the output buffer will be stalled and no more chars will be piped into the buffer. |
Thanks for your patches and also for your bug report! ❤️ Highly appreciated! The CI failure should be fixed after #107 is merged, don't worry about that. I am not sure what the way forward is yet. I like the first option you proposed in #105 most:
But that might break some downstream users for non-UTF8 outputting programs, as I understand it!? @petreeftime maybe you have some ideas what's the best way to go here? Maybe make the whole thing configurable (a flag that can be passed to the library for each call)? |
I don't think the first option is the best one. The solution in this PR is kind of a compromise between the first two options. This is because the underlying datatype piped by the client program is always TLDR: Unicode encoding necessarily runs the risk of stalling the buffer, and this happens either in the read thread or the write thread. You're correct that this may break non-UTF8 outputting programs, but if the program restricts its output to the ASCII range there shouldn't be any compatibility issue. I left a flag in the code to represent some external option which toggles between unicode and non-unicode encoding. Do you think it would be sensible to add it as a crate-level feature? Or embed the option in every instance of |
If anything I'd have it as an option on the data type, not a crate-level feature or compiletime option, because of the simple fact that someone might have UTF-8 outputting programs and non-UTF-8 outputting programs in one project. In general I think compiletime features should never be used for "either-or" functionality 😆 So a flag on the type would be totally fine with me, with UTF-8 compatible reading as the default. I'd like to see what @petreeftime thinks as well though. |
If you have any idea for where to add such a flag I can do it in this PR. By the way why is Pull Request Checks failing? Is it because I didn't sign my commits? |
Its because you didn't signoff your commits (the |
We merged #107 that should make your compile issues go away, please do rebase to latest master. |
fd91788
to
f5a333d
Compare
Rebased. I added a field for I think the problem about the client program stalling the buffer should not be an issue that needs handling. If the client program outputs a invalid unicode char followed by valid unicode then it is not outputting unicode anyways so UTF8 mode is inadequate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good.
I think having the encoding be an argument on new
, but also providing two convenience helpers NBReader::ascii()
and NBReader::utf8()
, which call new
with the respective argument would be nice and not that much of a hassle.
@petreeftime pinging you again, hope you can have a look as well.
Signed-off-by: Leni Aniva <[email protected]>
Signed-off-by: Leni Aniva <[email protected]>
Signed-off-by: Leni Aniva <[email protected]>
Signed-off-by: Leni Aniva <[email protected]>
Have you decided that this is the way to go? I need this crate in production |
I've merged
it seems like bash is generating some extra outputs? This also occurs on the master branch. |
Previously, if the client program outputs unicode, the unicode output would be garbled when it is piped through
NBReader
. For more detail, see #105. Writing doesn't seem to have any trouble as demonstrated byexamples/cat.rs
.Right now the unicode encoder is always on. In pexpect, an encoding option is available to toggle between the two, which I'm not sure where to put here.