-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
For a reason I wasn't able to identify, there seems to be a requirement of minimum 4 corpus (.conll) files for the correct retrieval of tokens using the TokenHandler class. If the corpus path only has 1-3 corpus files, the transform_nodes_to_matrix function in mxutils.py returns an empty matrix, even if the amount of data in the 1-3 files is the same as when the data is split into 4 or more files. When I split my single large corpus file into 10 files, everything worked smoothly.
I'm including a minimal example and the error log before using the 10 corpus files.
After importing all relevant packages and settings:
ifhan = ItemFreqHandler(settings = settings)
vocab = ifhan.build_item_freq() # by default it uses multiprocessor, which is overkill with the toy corpus
vocab_fname = f"{output_path}/{corpus_name}.nfreq"
vocab.save(vocab_fname)
query = vocab.subvocab(["girl/N"])
tokhan = TokenHandler(query, settings=settings)
tokens = tokhan.retrieve_tokens()
tokens
Log:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[21], line 2
1 tokhan = TokenHandler(query, settings=settings)
----> 2 tokens = tokhan.retrieve_tokens()
3 tokens
File /nephosem/models/typetoken.py:531, in TokenHandler.retrieve_tokens(self, fnames)
529 logger.info("Creating matrix...")
530 start = time.time()
--> 531 mtx = mxutils.transform_nodes_to_matrix(res, self.formatter.colloc_format)
532 logger.info(f"Finished matrix after {time.time()-start} seconds.")
533 return mtx
File /nephosem/specutils/mxutils.py:188, in transform_nodes_to_matrix(type2toks, colloc_fmt)
185 print(row_items) #AMGZ
186 print(col_items) #AMGZ
--> 188 tokmx = transform_dict_to_spmatrix(tok2collocs, row_items, col_items)
189 return matrix.TypeTokenMatrix(tokmx, row_items, col_items)
File /nephosem/specutils/mxutils.py:73, in transform_dict_to_spmatrix(dict_mtx, rowid2item, colid2item, verbose)
71 data.append(value)
72 if len(data) == 0:
---> 73 raise ValueError("Error: the input matrix is empty!!!")
74 dt = type(data[0])
75 ''' TODO: check data types
76 dtypes = {
77 int: np.int32,
(...)
88 data = np.array(data, dtype=dtypes[dt])
89 '''
ValueError: Error: the input matrix is empty!!!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels