-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using non-ASCII characters in start_text crashes sample.lua #47
Comments
Can you use KOI8? |
Yeah, I thought to use KOI8-R or CP1251 datasets and do character reencodings for argument and output myself (perhaps, with a python script). But now I tried it with KOI8-R dataset and KOI8-R encoded start_text and got same error. Looks like it doesn't like any non-ASCII characters, even on single-byte encodings. |
Can you check the input_json file? Does it have correct indexes/tokens in both idx_to_token and token_to_idx arrays? |
Edit: Yes, it seems to store unicode codepoints in arrays for cyrillic characters which are decoded from KOI8 codepoints by preprocessor. I think they're correct, but since json uses unicode codepoints, that means sample.lua has no means of knowing which encoding is used for string arguments, hence the error. It just can't map non-ASCII characters in args to unicode codepoints because arguments come in variable byte encoding and tokens stored in double-byte encoding. Looks like something like -start_text_encoding is required to resolve the issue. |
You can use char-rnn for the time being since char-rnn seems to honor Unicode fine. |
Another solution: #12 |
I don't exactly get how PR12 solves this problem. |
It depends if this issue is "lua only", or it's somewhere between how python creates token_to_idx and how lua interprets the start_text. PR12 gets rid of python and moves everything into one single lua domain. |
No, it's clearly lua-only issue. Tokens are stored as double-byte unicode values. It's just Lua has no knowing of how to map UTF-8 encoded arguments from command-line to double-byte unicode token values. All it could see are single bytes outside (>127) ascii range. It couldn't figure out that UTF-8 bytes must be interpreted together, not on single-byte basis. It seems to know only ascii encoding natively. So, moving everything to lua would make no difference. But thanks to @maraoz , this problem is resolved, by using UTF-8 library, torch-rnn is now capable of correctly interpreting UTF-8 encoded command-line arguments. All that left is to merge this pull request. |
Not a fix, but if you don't really mind losing the unicode, one can quickly strip unicode with this command: |
I'm having this same problem. I don't know if anyone else has this use case, but I'm trying to encode a list of 160 possible actions as individual characters. This is bigger than the ascii range so I use utf8. The model trains, but like @tmp6154 I can't initialize the start_text param with utf8 input to sample it, either from the command line or reading from a file in sample.lua. I still don't understand the solution proposed above, but it would be nice if it were merged in. It seems odd that I can train the model with utf8 chars, but can't initialize a sample with the same encoding. |
Actually, it's pretty simple if you think about it. Encoding, such as UTF-8, maps characters to numbers. Neural network doesn't know anything about characters, it operates on numbers. preprocess.py takes the text in any encoding Python supports (including UTF-8 encoded text) and basically, creates a brand new, reduced encoding off base encoding which contains only characters contained within corpus (in order they were encountered in the text). Now, all information that is required for training and sampling is available - everything is stored in output JSON and "text" (that is, sequence of tokens from JSON) is stored in h5. UTF-8 chars actually get transformed into single unicode codepoint (which, I believe, is different from Karpathy's char-rnn, which treated separate UTF-8 bytes as individual characters, leading to invalid UTF-8 sequences in samples) and are mapped to some integer number. During training phase, neural network receives only integer numbers, representing the characters. It doesn't know anything about encoding they were in. For example, in one of my datasets, letter "a" has code 85, while in UTF-8, it's 97. I.e, UTF-8 is not used at all during the training, it's our derived encoding, that was produced with preprocessor. That means that when training sees a letter "a", input neuron for 85 will be activated. Same applies to sampling phase, actually. To generate samples, sample.lua doesn't need to know UTF-8. It just does a forward pass of neural network, checks, which character number is next, and then maps it back to the original character (as found in JSON file). The reason why start_text couldn't be initialized from UTF-8 is because Lua (unlike Python) doesn't come "with batteries included", that is, to support UTF-8 text, a third party library is needed. It tries to interpret start_text as 8-bit encoding, and obviously, it couldn't find matching Unicode codepoints to map those bytes into numbers. Justin is probably busy for now, in meanwhile, you could either apply the patch from PR #52, or checkout the fork I made, where I've already applied that (and a few other) patches. Hopefully, these patches will be merged into the mainline in the future. |
I found the fork in your repo, it works. Thanks a lot for this. I was stuck on this problem for days and now I am
|
No problem, enjoy! Actually, thanks for fixing that particular problem goes to @maraoz, he made that pull request, which I applied against my repo. |
I run into the same problem and found an easy hack to overcome this. You can replace function LM:encode_string(s)
local encoded = torch.LongTensor(#s)
local token = ''
local ei = 1
for i = 1, #s do
token = token .. s:sub(i, i)
local idx = self.token_to_idx[token]
if idx ~= nil then
encoded[ei] = idx
token = ''
ei = ei + 1
elseif #token == 4 or i == #s then
assert(idx ~= nil, 'Got invalid idx')
end
end
encoded:resize(ei-1)
return encoded
end It scans for multibyte keys in token_to_idx table. I'm not sure if it is clean enough, but maybe @jcjohnson will take a look and incorporate it. |
We are rewriting the preprocess.py script to be much more robust and take into account up to utf-32. |
Having non-ASCII characters in arguments, particularly UTF-8 which is my terminal encoding breaks sample.lua start_text functionality. Here's sample output, where i try to initialize network with russian word for "test":
This is very unfortunate, since most of datasets I train network on consist mostly of Russian UTF-8 encoded text and I'm unable to preseed the network. My guess is that it treats UTF-8 as a single-byte encoding, which would explain why it yields invalid indices.
The text was updated successfully, but these errors were encountered: