Skip to content

Minimum number of corpus files for TokenHandler #14

@am-gz

Description

@am-gz

For a reason I wasn't able to identify, there seems to be a requirement of minimum 4 corpus (.conll) files for the correct retrieval of tokens using the TokenHandler class. If the corpus path only has 1-3 corpus files, the transform_nodes_to_matrix function in mxutils.py returns an empty matrix, even if the amount of data in the 1-3 files is the same as when the data is split into 4 or more files. When I split my single large corpus file into 10 files, everything worked smoothly.

I'm including a minimal example and the error log before using the 10 corpus files.

After importing all relevant packages and settings:

ifhan = ItemFreqHandler(settings = settings)
vocab = ifhan.build_item_freq() # by default it uses multiprocessor, which is overkill with the toy corpus

vocab_fname = f"{output_path}/{corpus_name}.nfreq"
vocab.save(vocab_fname)

query = vocab.subvocab(["girl/N"])

tokhan = TokenHandler(query, settings=settings)
tokens = tokhan.retrieve_tokens()
tokens

Log:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 2
      1 tokhan = TokenHandler(query, settings=settings)
----> 2 tokens = tokhan.retrieve_tokens()
      3 tokens

File /nephosem/models/typetoken.py:531, in TokenHandler.retrieve_tokens(self, fnames)
    529 logger.info("Creating matrix...")
    530 start = time.time()
--> 531 mtx = mxutils.transform_nodes_to_matrix(res, self.formatter.colloc_format)
    532 logger.info(f"Finished matrix after {time.time()-start} seconds.")
    533 return mtx

File /nephosem/specutils/mxutils.py:188, in transform_nodes_to_matrix(type2toks, colloc_fmt)
    185 print(row_items) #AMGZ
    186 print(col_items) #AMGZ
--> 188 tokmx = transform_dict_to_spmatrix(tok2collocs, row_items, col_items)
    189 return matrix.TypeTokenMatrix(tokmx, row_items, col_items)

File /nephosem/specutils/mxutils.py:73, in transform_dict_to_spmatrix(dict_mtx, rowid2item, colid2item, verbose)
     71         data.append(value)
     72 if len(data) == 0:
---> 73     raise ValueError("Error: the input matrix is empty!!!")
     74 dt = type(data[0])
     75 ''' TODO: check data types
     76 dtypes = {
     77     int: np.int32,
   (...)
     88 data = np.array(data, dtype=dtypes[dt])
     89 '''

ValueError: Error: the input matrix is empty!!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions