Minimum number of corpus files for TokenHandler

For a reason I wasn't able to identify, there seems to be a requirement of minimum 4 corpus (.conll) files for the correct retrieval of tokens using the ```TokenHandler``` class. If the corpus path only has 1-3 corpus files, the ```transform_nodes_to_matrix``` function in ```mxutils.py``` returns an empty matrix, even if the amount of data in the 1-3 files is the same as when the data is split into 4 or more files. When I split my single large corpus file into 10 files, everything worked smoothly.

I'm including a minimal example and the error log before using the 10 corpus files.

After importing all relevant packages and settings:

```
ifhan = ItemFreqHandler(settings = settings)
vocab = ifhan.build_item_freq() # by default it uses multiprocessor, which is overkill with the toy corpus

vocab_fname = f"{output_path}/{corpus_name}.nfreq"
vocab.save(vocab_fname)

query = vocab.subvocab(["girl/N"])

tokhan = TokenHandler(query, settings=settings)
tokens = tokhan.retrieve_tokens()
tokens
```

Log:

```
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 2
      1 tokhan = TokenHandler(query, settings=settings)
----> 2 tokens = tokhan.retrieve_tokens()
      3 tokens

File /nephosem/models/typetoken.py:531, in TokenHandler.retrieve_tokens(self, fnames)
    529 logger.info("Creating matrix...")
    530 start = time.time()
--> 531 mtx = mxutils.transform_nodes_to_matrix(res, self.formatter.colloc_format)
    532 logger.info(f"Finished matrix after {time.time()-start} seconds.")
    533 return mtx

File /nephosem/specutils/mxutils.py:188, in transform_nodes_to_matrix(type2toks, colloc_fmt)
    185 print(row_items) #AMGZ
    186 print(col_items) #AMGZ
--> 188 tokmx = transform_dict_to_spmatrix(tok2collocs, row_items, col_items)
    189 return matrix.TypeTokenMatrix(tokmx, row_items, col_items)

File /nephosem/specutils/mxutils.py:73, in transform_dict_to_spmatrix(dict_mtx, rowid2item, colid2item, verbose)
     71         data.append(value)
     72 if len(data) == 0:
---> 73     raise ValueError("Error: the input matrix is empty!!!")
     74 dt = type(data[0])
     75 ''' TODO: check data types
     76 dtypes = {
     77     int: np.int32,
   (...)
     88 data = np.array(data, dtype=dtypes[dt])
     89 '''

ValueError: Error: the input matrix is empty!!!
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum number of corpus files for TokenHandler #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Minimum number of corpus files for TokenHandler #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions