Using non-ASCII characters in start_text crashes sample.lua #47

ostrosablin · 2016-03-25T17:44:52Z

Having non-ASCII characters in arguments, particularly UTF-8 which is my terminal encoding breaks sample.lua start_text functionality. Here's sample output, where i try to initialize network with russian word for "test":

th sample.lua -checkpoint models/test/checkpoint_27350.t7 -length 1000 -sample 1 -gpu -1 -temperature 1 -start_text тест
/home/vostrosa/torch/install/bin/luajit: ./LanguageModel.lua:129: Got invalid idx
stack traceback:
        [C]: in function 'assert'
        ./LanguageModel.lua:129: in function 'encode_string'
        ./LanguageModel.lua:174: in function 'sample'
        sample.lua:41: in main chunk
        [C]: in function 'dofile'
        ...ator/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405ec0

This is very unfortunate, since most of datasets I train network on consist mostly of Russian UTF-8 encoded text and I'm unable to preseed the network. My guess is that it treats UTF-8 as a single-byte encoding, which would explain why it yields invalid indices.

The text was updated successfully, but these errors were encountered:

AlekzNet · 2016-03-25T20:20:11Z

Can you use KOI8?

ostrosablin · 2016-03-25T20:33:11Z

Yeah, I thought to use KOI8-R or CP1251 datasets and do character reencodings for argument and output myself (perhaps, with a python script).

But now I tried it with KOI8-R dataset and KOI8-R encoded start_text and got same error. Looks like it doesn't like any non-ASCII characters, even on single-byte encodings.

AlekzNet · 2016-03-25T20:59:40Z

Can you check the input_json file? Does it have correct indexes/tokens in both idx_to_token and token_to_idx arrays?

ostrosablin · 2016-03-25T21:16:40Z

Edit: Yes, it seems to store unicode codepoints in arrays for cyrillic characters which are decoded from KOI8 codepoints by preprocessor. I think they're correct, but since json uses unicode codepoints, that means sample.lua has no means of knowing which encoding is used for string arguments, hence the error. It just can't map non-ASCII characters in args to unicode codepoints because arguments come in variable byte encoding and tokens stored in double-byte encoding. Looks like something like -start_text_encoding is required to resolve the issue.

cgadgil · 2016-03-31T02:38:22Z

You can use char-rnn for the time being since char-rnn seems to honor Unicode fine.

https://github.com/karpathy/char-rnn

AlekzNet · 2016-04-03T14:04:12Z

Another solution: #12

ostrosablin · 2016-04-04T05:50:27Z

I don't exactly get how PR12 solves this problem.

AlekzNet · 2016-04-04T11:11:46Z

It depends if this issue is "lua only", or it's somewhere between how python creates token_to_idx and how lua interprets the start_text. PR12 gets rid of python and moves everything into one single lua domain.

ostrosablin · 2016-04-04T18:45:31Z

No, it's clearly lua-only issue. Tokens are stored as double-byte unicode values. It's just Lua has no knowing of how to map UTF-8 encoded arguments from command-line to double-byte unicode token values. All it could see are single bytes outside (>127) ascii range. It couldn't figure out that UTF-8 bytes must be interpreted together, not on single-byte basis. It seems to know only ascii encoding natively. So, moving everything to lua would make no difference.

But thanks to @maraoz , this problem is resolved, by using UTF-8 library, torch-rnn is now capable of correctly interpreting UTF-8 encoded command-line arguments. All that left is to merge this pull request.

jtippett · 2016-04-11T08:40:18Z

Not a fix, but if you don't really mind losing the unicode, one can quickly strip unicode with this command: iconv -c -f utf-8 -t ascii corpus.txt > corpus-ascii.txt, and then run the rnn on the stripped file.

hooande · 2016-07-01T00:59:09Z

I'm having this same problem. I don't know if anyone else has this use case, but I'm trying to encode a list of 160 possible actions as individual characters. This is bigger than the ascii range so I use utf8. The model trains, but like @tmp6154 I can't initialize the start_text param with utf8 input to sample it, either from the command line or reading from a file in sample.lua.

I still don't understand the solution proposed above, but it would be nice if it were merged in. It seems odd that I can train the model with utf8 chars, but can't initialize a sample with the same encoding.

ostrosablin · 2016-07-01T07:11:31Z

Actually, it's pretty simple if you think about it. Encoding, such as UTF-8, maps characters to numbers. Neural network doesn't know anything about characters, it operates on numbers. preprocess.py takes the text in any encoding Python supports (including UTF-8 encoded text) and basically, creates a brand new, reduced encoding off base encoding which contains only characters contained within corpus (in order they were encountered in the text).

Now, all information that is required for training and sampling is available - everything is stored in output JSON and "text" (that is, sequence of tokens from JSON) is stored in h5. UTF-8 chars actually get transformed into single unicode codepoint (which, I believe, is different from Karpathy's char-rnn, which treated separate UTF-8 bytes as individual characters, leading to invalid UTF-8 sequences in samples) and are mapped to some integer number.

During training phase, neural network receives only integer numbers, representing the characters. It doesn't know anything about encoding they were in. For example, in one of my datasets, letter "a" has code 85, while in UTF-8, it's 97. I.e, UTF-8 is not used at all during the training, it's our derived encoding, that was produced with preprocessor. That means that when training sees a letter "a", input neuron for 85 will be activated.

Same applies to sampling phase, actually. To generate samples, sample.lua doesn't need to know UTF-8. It just does a forward pass of neural network, checks, which character number is next, and then maps it back to the original character (as found in JSON file).

The reason why start_text couldn't be initialized from UTF-8 is because Lua (unlike Python) doesn't come "with batteries included", that is, to support UTF-8 text, a third party library is needed. It tries to interpret start_text as 8-bit encoding, and obviously, it couldn't find matching Unicode codepoints to map those bytes into numbers.

Justin is probably busy for now, in meanwhile, you could either apply the patch from PR #52, or checkout the fork I made, where I've already applied that (and a few other) patches. Hopefully, these patches will be merged into the mainline in the future.

hooande · 2016-07-01T07:17:49Z

I found the fork in your repo, it works.

Thanks a lot for this. I was stuck on this problem for days and now I am
happily making predictions. Cheers!
On Jul 1, 2016 3:11 AM, "Ostrosablin Vitaly" [email protected]
wrote:

Actually, it's pretty simple if you think about it. Encoding, such as
UTF-8, maps characters to numbers. Neural network doesn't know anything
about characters, it operates on numbers. preprocess.py takes the text in
any encoding Python supports (including UTF-8 encoded text) and basically,
creates a brand new, reduced encoding off base encoding which contains only
characters contained within corpus (in order they were encountered in the
text).

Now, all information that is required for training and sampling is
available - everything is stored in output JSON and "text" (that is,
sequence of tokens from JSON) is stored in h5. UTF-8 chars actually get
transformed into single unicode codepoint (which, I believe, is different
from Karpathy's char-rnn, which treated separate UTF-8 bytes as individual
characters, leading to invalid UTF-8 sequences in samples) and are mapped
to some integer number.

During training phase, neural network receives only integer numbers,
representing the characters. It doesn't know anything about encoding they
were in. For example, in one of my datasets, letter "a" has code 85, while
in UTF-8, it's 97. I.e, UTF-8 is not used at all during the training, it's
our derived encoding, that was produced with preprocessor. That means that
when training sees a letter "a", input neuron for 85 will be activated.

Same applies to sampling phase, actually. To generate samples, sample.lua
doesn't need to know UTF-8. It just does a forward pass of neural network,
looks, which checks, which character number is next, and then maps it back
to the original encoding (as found in JSON file).

The reason, why start_text couldn't be initialized from UTF-8 is because
Lua (unlike Python) doesn't come "with batteries included", that is, to
support UTF-8 text, a third party library is needed. It couldn't find
matching Unicode codepoints to map them into numbers for JSON and tries to
interpret start_text as 8-bit encoding.

Justin is probably busy for now, in meanwhile, you could either apply the
patch from PR #52 #52, or
checkout the fork https://github.com/tmp6154/torch-rnn I made, where
I've already applied that (and a few other) patches. Hopefully, these
patches will be merged into mainline in the future.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#47 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAVMyKCZNeQZ2VHm06qImA1LmdT14OYiks5qRL2mgaJpZM4H45z6
.

ostrosablin · 2016-07-01T07:22:07Z

No problem, enjoy! Actually, thanks for fixing that particular problem goes to @maraoz, he made that pull request, which I applied against my repo.

smartynov · 2016-08-14T23:08:48Z

I run into the same problem and found an easy hack to overcome this. You can replace LM:encode_string method in LanguageModel.lua to the following version:

function LM:encode_string(s)
  local encoded = torch.LongTensor(#s)
  local token = ''
  local ei = 1
  for i = 1, #s do
    token = token .. s:sub(i, i)
    local idx = self.token_to_idx[token]
    if idx ~= nil then
      encoded[ei] = idx
      token = ''
      ei = ei + 1
    elseif #token == 4 or i == #s then
      assert(idx ~= nil, 'Got invalid idx')
    end
  end
  encoded:resize(ei-1)
  return encoded
end

It scans for multibyte keys in token_to_idx table. I'm not sure if it is clean enough, but maybe @jcjohnson will take a look and incorporate it.

dgcrouse · 2017-04-27T04:59:29Z

We are rewriting the preprocess.py script to be much more robust and take into account up to utf-32.

ostrosablin mentioned this issue Mar 31, 2016

support UTF-8 start_text #52

Open

smartynov mentioned this issue Aug 14, 2016

sample -start_text seems to fail support unicode. #133

Closed

dgcrouse closed this as completed Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using non-ASCII characters in start_text crashes sample.lua #47

Using non-ASCII characters in start_text crashes sample.lua #47

ostrosablin commented Mar 25, 2016

AlekzNet commented Mar 25, 2016

ostrosablin commented Mar 25, 2016

AlekzNet commented Mar 25, 2016

ostrosablin commented Mar 25, 2016

cgadgil commented Mar 31, 2016

AlekzNet commented Apr 3, 2016

ostrosablin commented Apr 4, 2016

AlekzNet commented Apr 4, 2016

ostrosablin commented Apr 4, 2016

jtippett commented Apr 11, 2016

hooande commented Jul 1, 2016

ostrosablin commented Jul 1, 2016 •

edited

Loading

hooande commented Jul 1, 2016

ostrosablin commented Jul 1, 2016

smartynov commented Aug 14, 2016

dgcrouse commented Apr 27, 2017

Using non-ASCII characters in start_text crashes sample.lua #47

Using non-ASCII characters in start_text crashes sample.lua #47

Comments

ostrosablin commented Mar 25, 2016

AlekzNet commented Mar 25, 2016

ostrosablin commented Mar 25, 2016

AlekzNet commented Mar 25, 2016

ostrosablin commented Mar 25, 2016

cgadgil commented Mar 31, 2016

AlekzNet commented Apr 3, 2016

ostrosablin commented Apr 4, 2016

AlekzNet commented Apr 4, 2016

ostrosablin commented Apr 4, 2016

jtippett commented Apr 11, 2016

hooande commented Jul 1, 2016

ostrosablin commented Jul 1, 2016 • edited Loading

hooande commented Jul 1, 2016

ostrosablin commented Jul 1, 2016

smartynov commented Aug 14, 2016

dgcrouse commented Apr 27, 2017

ostrosablin commented Jul 1, 2016 •

edited

Loading