Releases: lanl/T-ELF
v0.0.38
New Modules
Adds Penguin, Bunny, Peacock, and SeaLion modules:
- Penguin: Text storage tool.
- Bunny: Dataset generation tool for documents and their citations/references.
- Peacock: Data visualization and generation of actionable statistics.
- SeaLion: Generic report generation tool.
Bugs
- Fixes query index issue in Cheetah
v0.0.37
Adds three new modules for pre-processing text (Orca and iPenguin), and post-processing text (Wolf).
- Wolf: Graph centrality and ranking tool.
- iPenguin: Online information retrieval tool for Scopus, SemanticScholar, and OSTI.
- Orca: Duplicate author detector for text mining and information retrieval.
v0.0.36
HNMFk graph post-processing & root node naming
-
Added the ability to post-process HNMFk graphs based on the number of documents in leaf nodes.
- New functions:
model.traverse_tiny_leaf_topics(threshold: int)
: Identifies outlier clusters where the number of documents is below the given threshold.model.get_tiny_leaf_topics()
: Retrieves tiny leaf nodes (processed separately).model.process_tiny_leaf_topics(threshold: int)
: Processes the graph to separate tiny nodes based on the given threshold.- Resetting the graph by setting
threshold=None
restores the tiny nodes.
- New functions:
-
Added option to specify a root node name in HNMFk using
root_node_name="Root"
.- Default is now
"Root"
instead of"*"
to resolve Windows compatibility issues.
- Default is now
Bug(s)
- Fixed a bug in Beaver where mismatched indexes caused incorrect highlighting.
v0.0.35
- Fixes a bug with Cheetah on setting the default to empty string.
- Adds Logistic Matrix Factorization (LMF).
- Adds developer script to change versioning automatically.
- Updates documentation.
v0.0.34
Fast-tracking to v0.0.34 from v0.0.20
Enhancements
Pruning Support:
- Enabled pruning in
bnmf
,wnmf
, andnmf_recommender
. - Added pruning of additional matrices, e.g.,
MASK
, based onX
. - Included
pruned_cols
andpruned_rows
in saved outputs.
Matrix Factorization:
- Introduced new submodule
BNMFk
underNMFk
withnmf_method='bnmf'
. - Added
WEIGHT
andMASK
keys forWNMFk
andBNMFk
. - Implemented matrix deletion in subroutines to reduce memory consumption.
- Added
factor_thresholding
parameter to perform thresholding overNMFk
factors, making them boolean. Options include:coord_desc_thresh
WH_thresh
- Introduced
factor_thresholding_obj_params
for configuring thresholding subroutines. - Added
clustering_method
parameter with options:kmeans
bool
orboolean
(both are equivalent).
- Introduced
clustering_obj_params
to configure clustering subroutines. - Added new perturbation type for boolean matrices:
perturb_type='boolean'
orperturb_type='bool'
. - Updated examples to reflect new boolean-specific features.
- Path compatibility using
os.path.join
.
Thresholding and Clustering:
- Added
factor_thresholding_H_regression
with options:otsu_thresh
coord_desc_thresh
kmeans_thresh
- Default
factor_thresholding_H_regression
set tokmeans_thresh
. - Default
factor_thresholding
set tootsu_thresh
. - Introduced
factor_thresholding_H_regression_obj_params
to configure parameters. - Added K-means-based boolean thresholding for
W
andH
matrices:- Clusters values in each row of
W
andH
into two groups; then the boolean threshold is the midpoint of cluster centroids.
- Clusters values in each row of
Hardware and Device Management:
- Added
device
parameter toNMFk
for GPU management:device=-1
: Use all GPUs.device=0
: Use the GPU with ID 0.device=[0,1,...]
: Use a specific list of GPUs.- Negative values other than
-1
: Use(number of GPUs + device + 1)
.
Hierarchical NMFk (HNMFk) Improvements:
- Added new variables for nodes:
parent_node_factors_path
parent_node_k
factors_path
- Enabled dynamic renaming of paths when loading HNMFk models from different directories.
- Improved decomposition behavior:
- Nodes with fewer samples than the sample threshold no longer decompose unnecessarily.
- Added signature, centroid, and probabilities from parent nodes to child nodes.
- Introduced graph iterator methods for navigating to specific nodes by name.
- Updated node naming conventions to use ancestor-based indexing.
Result Storage:
- Added
W_all
to saved outputs ofNMFk
.
Installation and Documentation
- Migrated to a new installation system using pip and Poetry.
- Added a post-installation script for simplifying setup on different systems.
- Updated documentation for:
- New installation methods on Chicoma and Darwin.
Bug Fixes
- Corrected HNMFk behavior to return total data indices instead of indices of indices.
- Corrected naming inconsistencies in pruning variables in
NMFk
. - Fixed error calculation to consider only known locations when masking is applied.
- Resolved GPU transfer conflicts when using
MASK
. - Fixed default
device
parameter inNMFk
to be-1
(use all devices). - Addressed issues in
WNMFk
andBNMFk
examples. - Fixed checkpointing bugs:
- Made saving checkpoints true by default.
- Resolved issues when loading an HNMFk model during an ongoing process.
- Fixed scalar addition error with sparse matrices in
kl_mu
. - Resolved dependency conflicts with
numpy
andnumba
. - Updated HPC documentation for T-ELF installation.
v0.0.20
Fixes a bug on HNMFk where the original indices were wrong.
v0.0.19
- Fixes a bug with HNMFk checkpointing where if continuing from checkpoint on a HPC system, not all nodes would be free on the job queue due to the bug.
- Fixes a bug with BST post-order search where the order was incorrect.
- Adds BST in-order search capability. NMFk hyper-parameter changed accordingly:
k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".
* ``k_search_method='linear'`` will linearly visit each K given in ``Ks`` hyper-parameter of the ``fit()`` function.
* ``k_search_method='bst_post'`` will perform post-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_pre'`` will perform pre-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_in'`` will perform in-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
v0.0.18
- Fixes a bug where Ks were not organized correctly for BST post and pre order.
- Fixes a bug for H_sill_thresh, now allowing for being able to set threshold at negative values as well.
- Adds option to use either W sill for k prediction, H sill for k prediction, or both. Selection of the
predict_k_method
also changes how the BST search is done withk_search_method
. Below hyper-parameters for NMFk are modified accordingly:
predict_k_method : str, optional
Method to use when performing automatic k prediction. Default is "WH_sill".
predict_k_method='pvalue' # will use L-Statistics with column-wise error for automatically estimating the number of latent factors.
predict_k_method='WH_sill' # will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.
predict_k_method='W_sill' # will use Silhouette scores from W latent factor for estimating the number of latent factors.
predict_k_method='H_sill' # will use Silhouette scores from H latent factor for estimating the number of latent factors.
predict_k_method='sill' # will default to ``predict_k_method='WH_sill'``.
v0.0.17
New Features
-
Introduces a new Vulture subclass
VocabularyConsolidator
, underTELF.pre_processing.Vulture.tokens_analysis
, designed to consolidate vocabularies and textual terms. -
Refactors NMFk, RESCALk, HNMFk, and SymNMFk to enhance modularity. Helper functions are created under
TELF.factorization.utilities
to modularize the code. -
Adds a new search criterion for identifying the optimal rank, or K, to NMFk, HNMFk, WNMFk, and RNMFk. This enhancement introduces a significant speedup to each algorithm. The new criterion utilizes a Binary Search Tree to streamline the process of determining the optimal rank, drastically reducing the search space and the time needed for factorization. Additionally, this K search feature is compatible with High Performance Computing (HPC) systems, ensuring that changes in the K search space by any node are synchronized across all nodes. NMFk has been updated to include new hyper-parameters tailored to these search settings.
k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".k_search_method='linear'
will linearly visit each K given inKs
hyper-parameter of thefit()
function.k_search_method='bst_post'
will perform post-order binary search. When an ideal rank is found withmin(W silhouette, H silhouette) >= sill_thresh
, all lower ranks are pruned from the search space.k_search_method='bst_pre'
will perform pre-order binary search. When an ideal rank is found withmin(W silhouette, H silhouette) >= sill_thresh
, all lower ranks are pruned from the search space.
H_sill_thresh : float, optional
Setting for removing higher ranks from the search space. The default is -1.When searching for the optimal rank with binary search using
k_search='bst_post'
ork_search='bst_pre'
, this hyper-parameter can be used to cut off higher ranks from search space.
The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette belowH_sill_thresh
is found for a given rank or K, all higher ranks are removed from the search space.
IfH_sill_thresh=-1
, it is not used.
Bugs
- Fixes a bug in RESCALk plotting where plotting function was expecting W and H silhouettes.
- Fixes a bug where k predict would not work if none of the
W
orH
silhouettes are above thesill_thresh
hyper-parameter. New fix selects newsill_thresh
based on the rule:self.sill_thresh = min([max(sils_min_W), max(sils_min_H)])
when none of theW
orH
silhouettes are above thesill_thresh
hyper-parameter. - Fixes a bug in document substitutions of Vulture where an error is raised if no corpus substitutions are passed.
v0.0.16
- Fixes a bug for HPC HNMFk capability when checkpointing would not save if using custom callback functionality.
- Fixes a bug in the stopwords option in Vulture Clean that excludes hyphens from stop word checks, a boolean in iterable’s place bug.
- Fixes a bug to flatten the output dictionary in the Vulture Acronyms module, a dictionary iteration bug.
- Fixes a bug where
itertools
was missing in permutation import in Vulture material permutations. - Fixes a bug in Vulture materials permutations for the
save_path
definition. - Adds Ks range and X shape checks for HNMFk to make sure the decomposition can still be done if using a callback functionality.
- Adds a feature to include lowercased materials in permutations.
- Adds future for material permutations.
- Adds multithread string consolidation in levenshtein.
- Levenshtein consolidation criteria change from shorest string to most common string.
- Moves HNMFk leaf node termination, based on sample threshold, to after factorization to obtain the latent factors W and H even for nodes where number of samples are less than the threshold.