Skip to content

This library doesn't work for large embeddings #11

@ruanchaves

Description

@ruanchaves

Issue description

I tried to execute a slightly modified version of this script ( no significative changes were made ) for an embedding with a large vocabulary and 600 dimensions:

from nncompress import EmbeddingCompressor

# Load my embedding matrix
matrix = np.load("data/glove.6B.300d.npy")

# Initialize the compressor
compressor = EmbeddingCompressor(32, 16, "data/mymodel")

# Train the quantization model
compressor.train(matrix)

# Evaluate
distance = compressor.evaluate(matrix)
print("Mean euclidean distance:", distance)

# Export the codes and codebook
compressor.export(matrix, "data/mymodel")

But then, this is what I got:

Traceback (most recent call last):
  File "compress.py", line 82, in <module>
    pipe\
  File "compress.py", line 70, in train
    compressor.train(matrix)
  File "/home/user/summer/smallnilc/nncompress/embed_compress.py", line 159, in train
    word_ids_var, loss_op, train_op, maxp_op = self.build_training_graph(embed_matrix)
  File "/home/user/summer/smallnilc/nncompress/embed_compress.py", line 114, in build_training_graph
    input_matrix = tf.constant(embed_matrix, name="embed_matrix")
  File "/home/user/summer/smallnilc/small/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 180, in constant_v1
    allow_broadcast=False)
  File "/home/user/summer/smallnilc/small/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 284, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/home/user/summer/smallnilc/small/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 537, in make_tensor_proto
    "Cannot create a tensor proto whose content is larger than 2GB.")
ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Tensorflow devs have answered issues similar to this one by saying that the only solution is to rewrite your code in a way that it doesn't break the hard limit of 2GB imposed by protobuf.

Steps to reproduce the issue

Simply try to compress an embedding above 300 dimensions ( either 600 or 1000 dimensions ).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions