Skip to content

Releases: lanl/T-ELF

v0.0.38

26 Mar 15:01
9531315
Compare
Choose a tag to compare

New Modules

Adds Penguin, Bunny, Peacock, and SeaLion modules:

  • Penguin: Text storage tool.
  • Bunny: Dataset generation tool for documents and their citations/references.
  • Peacock: Data visualization and generation of actionable statistics.
  • SeaLion: Generic report generation tool.

Bugs

  • Fixes query index issue in Cheetah

v0.0.37

18 Mar 16:10
02c0c7b
Compare
Choose a tag to compare

Adds three new modules for pre-processing text (Orca and iPenguin), and post-processing text (Wolf).

  • Wolf: Graph centrality and ranking tool.
  • iPenguin: Online information retrieval tool for Scopus, SemanticScholar, and OSTI.
  • Orca: Duplicate author detector for text mining and information retrieval.

v0.0.36

04 Mar 18:32
4561828
Compare
Choose a tag to compare

HNMFk graph post-processing & root node naming

  • Added the ability to post-process HNMFk graphs based on the number of documents in leaf nodes.

    • New functions:
      • model.traverse_tiny_leaf_topics(threshold: int): Identifies outlier clusters where the number of documents is below the given threshold.
      • model.get_tiny_leaf_topics(): Retrieves tiny leaf nodes (processed separately).
      • model.process_tiny_leaf_topics(threshold: int): Processes the graph to separate tiny nodes based on the given threshold.
      • Resetting the graph by setting threshold=None restores the tiny nodes.
  • Added option to specify a root node name in HNMFk using root_node_name="Root".

    • Default is now "Root" instead of "*" to resolve Windows compatibility issues.

Bug(s)

  • Fixed a bug in Beaver where mismatched indexes caused incorrect highlighting.

v0.0.35

27 Jan 23:03
a84b855
Compare
Choose a tag to compare
  • Fixes a bug with Cheetah on setting the default to empty string.
  • Adds Logistic Matrix Factorization (LMF).
  • Adds developer script to change versioning automatically.
  • Updates documentation.

v0.0.34

07 Jan 01:20
ff683c8
Compare
Choose a tag to compare

Fast-tracking to v0.0.34 from v0.0.20

Enhancements

Pruning Support:

  • Enabled pruning in bnmf, wnmf, and nmf_recommender.
  • Added pruning of additional matrices, e.g., MASK, based on X.
  • Included pruned_cols and pruned_rows in saved outputs.

Matrix Factorization:

  • Introduced new submodule BNMFk under NMFk with nmf_method='bnmf'.
  • Added WEIGHT and MASK keys for WNMFk and BNMFk.
  • Implemented matrix deletion in subroutines to reduce memory consumption.
  • Added factor_thresholding parameter to perform thresholding over NMFk factors, making them boolean. Options include:
    • coord_desc_thresh
    • WH_thresh
  • Introduced factor_thresholding_obj_params for configuring thresholding subroutines.
  • Added clustering_method parameter with options:
    • kmeans
    • bool or boolean (both are equivalent).
  • Introduced clustering_obj_params to configure clustering subroutines.
  • Added new perturbation type for boolean matrices: perturb_type='boolean' or perturb_type='bool'.
  • Updated examples to reflect new boolean-specific features.
  • Path compatibility using os.path.join.

Thresholding and Clustering:

  • Added factor_thresholding_H_regression with options:
    • otsu_thresh
    • coord_desc_thresh
    • kmeans_thresh
  • Default factor_thresholding_H_regression set to kmeans_thresh.
  • Default factor_thresholding set to otsu_thresh.
  • Introduced factor_thresholding_H_regression_obj_params to configure parameters.
  • Added K-means-based boolean thresholding for W and H matrices:
    • Clusters values in each row of W and H into two groups; then the boolean threshold is the midpoint of cluster centroids.

Hardware and Device Management:

  • Added device parameter to NMFk for GPU management:
    • device=-1: Use all GPUs.
    • device=0: Use the GPU with ID 0.
    • device=[0,1,...]: Use a specific list of GPUs.
    • Negative values other than -1: Use (number of GPUs + device + 1).

Hierarchical NMFk (HNMFk) Improvements:

  • Added new variables for nodes:
    • parent_node_factors_path
    • parent_node_k
    • factors_path
  • Enabled dynamic renaming of paths when loading HNMFk models from different directories.
  • Improved decomposition behavior:
    • Nodes with fewer samples than the sample threshold no longer decompose unnecessarily.
  • Added signature, centroid, and probabilities from parent nodes to child nodes.
  • Introduced graph iterator methods for navigating to specific nodes by name.
  • Updated node naming conventions to use ancestor-based indexing.

Result Storage:

  • Added W_all to saved outputs of NMFk.

Installation and Documentation

  • Migrated to a new installation system using pip and Poetry.
  • Added a post-installation script for simplifying setup on different systems.
  • Updated documentation for:
    • New installation methods on Chicoma and Darwin.

Bug Fixes

  • Corrected HNMFk behavior to return total data indices instead of indices of indices.
  • Corrected naming inconsistencies in pruning variables in NMFk.
  • Fixed error calculation to consider only known locations when masking is applied.
  • Resolved GPU transfer conflicts when using MASK.
  • Fixed default device parameter in NMFk to be -1 (use all devices).
  • Addressed issues in WNMFk and BNMFk examples.
  • Fixed checkpointing bugs:
    • Made saving checkpoints true by default.
    • Resolved issues when loading an HNMFk model during an ongoing process.
  • Fixed scalar addition error with sparse matrices in kl_mu.
  • Resolved dependency conflicts with numpy and numba.
  • Updated HPC documentation for T-ELF installation.

v0.0.20

24 Jul 19:59
581cceb
Compare
Choose a tag to compare

Fixes a bug on HNMFk where the original indices were wrong.

v0.0.19

04 May 22:10
309eb02
Compare
Choose a tag to compare
  • Fixes a bug with HNMFk checkpointing where if continuing from checkpoint on a HPC system, not all nodes would be free on the job queue due to the bug.
  • Fixes a bug with BST post-order search where the order was incorrect.
  • Adds BST in-order search capability. NMFk hyper-parameter changed accordingly:

k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".

* ``k_search_method='linear'`` will linearly visit each K given in ``Ks`` hyper-parameter of the ``fit()`` function.
* ``k_search_method='bst_post'`` will perform post-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_pre'`` will perform pre-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_in'`` will perform in-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.

v0.0.18

29 Apr 23:42
a0b2be7
Compare
Choose a tag to compare
  • Fixes a bug where Ks were not organized correctly for BST post and pre order.
  • Fixes a bug for H_sill_thresh, now allowing for being able to set threshold at negative values as well.
  • Adds option to use either W sill for k prediction, H sill for k prediction, or both. Selection of the predict_k_method also changes how the BST search is done with k_search_method. Below hyper-parameters for NMFk are modified accordingly:

predict_k_method : str, optional
Method to use when performing automatic k prediction. Default is "WH_sill".

predict_k_method='pvalue' # will use L-Statistics with column-wise error for automatically estimating the number of latent factors.
predict_k_method='WH_sill' # will use Silhouette scores from minimum of W and H latent factors for estimating the number of latent factors.
predict_k_method='W_sill' # will use Silhouette scores from W latent factor for estimating the number of latent factors.
predict_k_method='H_sill' # will use Silhouette scores from H latent factor for estimating the number of latent factors.
predict_k_method='sill' # will default to ``predict_k_method='WH_sill'``.

v0.0.17

27 Apr 00:34
014900f
Compare
Choose a tag to compare

New Features

  • Introduces a new Vulture subclass VocabularyConsolidator, under TELF.pre_processing.Vulture.tokens_analysis, designed to consolidate vocabularies and textual terms.

  • Refactors NMFk, RESCALk, HNMFk, and SymNMFk to enhance modularity. Helper functions are created under TELF.factorization.utilities to modularize the code.

  • Adds a new search criterion for identifying the optimal rank, or K, to NMFk, HNMFk, WNMFk, and RNMFk. This enhancement introduces a significant speedup to each algorithm. The new criterion utilizes a Binary Search Tree to streamline the process of determining the optimal rank, drastically reducing the search space and the time needed for factorization. Additionally, this K search feature is compatible with High Performance Computing (HPC) systems, ensuring that changes in the K search space by any node are synchronized across all nodes. NMFk has been updated to include new hyper-parameters tailored to these search settings.

    k_search_method : str, optional
    Which approach to use when searching for the rank or k. The default is "linear".

    • k_search_method='linear' will linearly visit each K given in Ks hyper-parameter of the fit() function.
    • k_search_method='bst_post' will perform post-order binary search. When an ideal rank is found with min(W silhouette, H silhouette) >= sill_thresh, all lower ranks are pruned from the search space.
    • k_search_method='bst_pre' will perform pre-order binary search. When an ideal rank is found with min(W silhouette, H silhouette) >= sill_thresh, all lower ranks are pruned from the search space.

    H_sill_thresh : float, optional
    Setting for removing higher ranks from the search space. The default is -1.

    When searching for the optimal rank with binary search using k_search='bst_post' or k_search='bst_pre', this hyper-parameter can be used to cut off higher ranks from search space.
    The cut-off of higher ranks from the search space is based on threshold for H silhouette. When a H silhouette below H_sill_thresh is found for a given rank or K, all higher ranks are removed from the search space.
    If H_sill_thresh=-1, it is not used.

Bugs

  • Fixes a bug in RESCALk plotting where plotting function was expecting W and H silhouettes.
  • Fixes a bug where k predict would not work if none of the W or H silhouettes are above the sill_thresh hyper-parameter. New fix selects new sill_thresh based on the rule: self.sill_thresh = min([max(sils_min_W), max(sils_min_H)]) when none of the W or H silhouettes are above the sill_thresh hyper-parameter.
  • Fixes a bug in document substitutions of Vulture where an error is raised if no corpus substitutions are passed.

v0.0.16

22 Apr 18:28
3207768
Compare
Choose a tag to compare
  • Fixes a bug for HPC HNMFk capability when checkpointing would not save if using custom callback functionality.
  • Fixes a bug in the stopwords option in Vulture Clean that excludes hyphens from stop word checks, a boolean in iterable’s place bug.
  • Fixes a bug to flatten the output dictionary in the Vulture Acronyms module, a dictionary iteration bug.
  • Fixes a bug where itertools was missing in permutation import in Vulture material permutations.
  • Fixes a bug in Vulture materials permutations for the save_path definition.
  • Adds Ks range and X shape checks for HNMFk to make sure the decomposition can still be done if using a callback functionality.
  • Adds a feature to include lowercased materials in permutations.
  • Adds future for material permutations.
  • Adds multithread string consolidation in levenshtein.
  • Levenshtein consolidation criteria change from shorest string to most common string.
  • Moves HNMFk leaf node termination, based on sample threshold, to after factorization to obtain the latent factors W and H even for nodes where number of samples are less than the threshold.