Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using non-ASCII characters in start_text crashes sample.lua #47

Closed
ostrosablin opened this issue Mar 25, 2016 · 16 comments
Closed

Using non-ASCII characters in start_text crashes sample.lua #47

ostrosablin opened this issue Mar 25, 2016 · 16 comments

Comments

@ostrosablin
Copy link

Having non-ASCII characters in arguments, particularly UTF-8 which is my terminal encoding breaks sample.lua start_text functionality. Here's sample output, where i try to initialize network with russian word for "test":

th sample.lua -checkpoint models/test/checkpoint_27350.t7 -length 1000 -sample 1 -gpu -1 -temperature 1 -start_text тест
/home/vostrosa/torch/install/bin/luajit: ./LanguageModel.lua:129: Got invalid idx
stack traceback:
        [C]: in function 'assert'
        ./LanguageModel.lua:129: in function 'encode_string'
        ./LanguageModel.lua:174: in function 'sample'
        sample.lua:41: in main chunk
        [C]: in function 'dofile'
        ...ator/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405ec0

This is very unfortunate, since most of datasets I train network on consist mostly of Russian UTF-8 encoded text and I'm unable to preseed the network. My guess is that it treats UTF-8 as a single-byte encoding, which would explain why it yields invalid indices.

@AlekzNet
Copy link

Can you use KOI8?

@ostrosablin
Copy link
Author

Yeah, I thought to use KOI8-R or CP1251 datasets and do character reencodings for argument and output myself (perhaps, with a python script).

But now I tried it with KOI8-R dataset and KOI8-R encoded start_text and got same error. Looks like it doesn't like any non-ASCII characters, even on single-byte encodings.

@AlekzNet
Copy link

Can you check the input_json file? Does it have correct indexes/tokens in both idx_to_token and token_to_idx arrays?

@ostrosablin
Copy link
Author

Edit: Yes, it seems to store unicode codepoints in arrays for cyrillic characters which are decoded from KOI8 codepoints by preprocessor. I think they're correct, but since json uses unicode codepoints, that means sample.lua has no means of knowing which encoding is used for string arguments, hence the error. It just can't map non-ASCII characters in args to unicode codepoints because arguments come in variable byte encoding and tokens stored in double-byte encoding. Looks like something like -start_text_encoding is required to resolve the issue.

@cgadgil
Copy link

cgadgil commented Mar 31, 2016

You can use char-rnn for the time being since char-rnn seems to honor Unicode fine.

https://github.com/karpathy/char-rnn

@AlekzNet
Copy link

AlekzNet commented Apr 3, 2016

Another solution: #12

@ostrosablin
Copy link
Author

I don't exactly get how PR12 solves this problem.

@AlekzNet
Copy link

AlekzNet commented Apr 4, 2016

It depends if this issue is "lua only", or it's somewhere between how python creates token_to_idx and how lua interprets the start_text. PR12 gets rid of python and moves everything into one single lua domain.

@ostrosablin
Copy link
Author

No, it's clearly lua-only issue. Tokens are stored as double-byte unicode values. It's just Lua has no knowing of how to map UTF-8 encoded arguments from command-line to double-byte unicode token values. All it could see are single bytes outside (>127) ascii range. It couldn't figure out that UTF-8 bytes must be interpreted together, not on single-byte basis. It seems to know only ascii encoding natively. So, moving everything to lua would make no difference.

But thanks to @maraoz , this problem is resolved, by using UTF-8 library, torch-rnn is now capable of correctly interpreting UTF-8 encoded command-line arguments. All that left is to merge this pull request.

@jtippett
Copy link

Not a fix, but if you don't really mind losing the unicode, one can quickly strip unicode with this command: iconv -c -f utf-8 -t ascii corpus.txt > corpus-ascii.txt, and then run the rnn on the stripped file.

@hooande
Copy link

hooande commented Jul 1, 2016

I'm having this same problem. I don't know if anyone else has this use case, but I'm trying to encode a list of 160 possible actions as individual characters. This is bigger than the ascii range so I use utf8. The model trains, but like @tmp6154 I can't initialize the start_text param with utf8 input to sample it, either from the command line or reading from a file in sample.lua.

I still don't understand the solution proposed above, but it would be nice if it were merged in. It seems odd that I can train the model with utf8 chars, but can't initialize a sample with the same encoding.

@ostrosablin
Copy link
Author

ostrosablin commented Jul 1, 2016

Actually, it's pretty simple if you think about it. Encoding, such as UTF-8, maps characters to numbers. Neural network doesn't know anything about characters, it operates on numbers. preprocess.py takes the text in any encoding Python supports (including UTF-8 encoded text) and basically, creates a brand new, reduced encoding off base encoding which contains only characters contained within corpus (in order they were encountered in the text).

Now, all information that is required for training and sampling is available - everything is stored in output JSON and "text" (that is, sequence of tokens from JSON) is stored in h5. UTF-8 chars actually get transformed into single unicode codepoint (which, I believe, is different from Karpathy's char-rnn, which treated separate UTF-8 bytes as individual characters, leading to invalid UTF-8 sequences in samples) and are mapped to some integer number.

During training phase, neural network receives only integer numbers, representing the characters. It doesn't know anything about encoding they were in. For example, in one of my datasets, letter "a" has code 85, while in UTF-8, it's 97. I.e, UTF-8 is not used at all during the training, it's our derived encoding, that was produced with preprocessor. That means that when training sees a letter "a", input neuron for 85 will be activated.

Same applies to sampling phase, actually. To generate samples, sample.lua doesn't need to know UTF-8. It just does a forward pass of neural network, checks, which character number is next, and then maps it back to the original character (as found in JSON file).

The reason why start_text couldn't be initialized from UTF-8 is because Lua (unlike Python) doesn't come "with batteries included", that is, to support UTF-8 text, a third party library is needed. It tries to interpret start_text as 8-bit encoding, and obviously, it couldn't find matching Unicode codepoints to map those bytes into numbers.

Justin is probably busy for now, in meanwhile, you could either apply the patch from PR #52, or checkout the fork I made, where I've already applied that (and a few other) patches. Hopefully, these patches will be merged into the mainline in the future.

@hooande
Copy link

hooande commented Jul 1, 2016

I found the fork in your repo, it works.

Thanks a lot for this. I was stuck on this problem for days and now I am
happily making predictions. Cheers!
On Jul 1, 2016 3:11 AM, "Ostrosablin Vitaly" [email protected]
wrote:

Actually, it's pretty simple if you think about it. Encoding, such as
UTF-8, maps characters to numbers. Neural network doesn't know anything
about characters, it operates on numbers. preprocess.py takes the text in
any encoding Python supports (including UTF-8 encoded text) and basically,
creates a brand new, reduced encoding off base encoding which contains only
characters contained within corpus (in order they were encountered in the
text).

Now, all information that is required for training and sampling is
available - everything is stored in output JSON and "text" (that is,
sequence of tokens from JSON) is stored in h5. UTF-8 chars actually get
transformed into single unicode codepoint (which, I believe, is different
from Karpathy's char-rnn, which treated separate UTF-8 bytes as individual
characters, leading to invalid UTF-8 sequences in samples) and are mapped
to some integer number.

During training phase, neural network receives only integer numbers,
representing the characters. It doesn't know anything about encoding they
were in. For example, in one of my datasets, letter "a" has code 85, while
in UTF-8, it's 97. I.e, UTF-8 is not used at all during the training, it's
our derived encoding, that was produced with preprocessor. That means that
when training sees a letter "a", input neuron for 85 will be activated.

Same applies to sampling phase, actually. To generate samples, sample.lua
doesn't need to know UTF-8. It just does a forward pass of neural network,
looks, which checks, which character number is next, and then maps it back
to the original encoding (as found in JSON file).

The reason, why start_text couldn't be initialized from UTF-8 is because
Lua (unlike Python) doesn't come "with batteries included", that is, to
support UTF-8 text, a third party library is needed. It couldn't find
matching Unicode codepoints to map them into numbers for JSON and tries to
interpret start_text as 8-bit encoding.

Justin is probably busy for now, in meanwhile, you could either apply the
patch from PR #52 #52, or
checkout the fork https://github.com/tmp6154/torch-rnn I made, where
I've already applied that (and a few other) patches. Hopefully, these
patches will be merged into mainline in the future.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAVMyKCZNeQZ2VHm06qImA1LmdT14OYiks5qRL2mgaJpZM4H45z6
.

@ostrosablin
Copy link
Author

No problem, enjoy! Actually, thanks for fixing that particular problem goes to @maraoz, he made that pull request, which I applied against my repo.

@smartynov
Copy link

I run into the same problem and found an easy hack to overcome this. You can replace LM:encode_string method in LanguageModel.lua to the following version:

function LM:encode_string(s)
  local encoded = torch.LongTensor(#s)
  local token = ''
  local ei = 1
  for i = 1, #s do
    token = token .. s:sub(i, i)
    local idx = self.token_to_idx[token]
    if idx ~= nil then
      encoded[ei] = idx
      token = ''
      ei = ei + 1
    elseif #token == 4 or i == #s then
      assert(idx ~= nil, 'Got invalid idx')
    end
  end
  encoded:resize(ei-1)
  return encoded
end

It scans for multibyte keys in token_to_idx table. I'm not sure if it is clean enough, but maybe @jcjohnson will take a look and incorporate it.

@dgcrouse
Copy link

We are rewriting the preprocess.py script to be much more robust and take into account up to utf-32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants