Skip to content

Releases: lanl/T-ELF

v0.0.41

08 May 23:20
52a836a
Compare
Choose a tag to compare

New Pre-Processing Module: Squirrel 🚀

Version 0.0.41 introduces a new pre-processing module called Squirrel, designed for automated document pruning. Squirrel streamlines the process of accepting or rejecting documents by applying predefined rules and thresholds, eliminating the need for manual review. Squirrel supports the use of multiple pruning strategies. In this release, we include both embedding-based pruning and LLM-based pruning:

  • Embedding-Based Pruning: This method filters documents based on their distance from a reference centroid in embedding space. Only documents within a specified threshold are retained, ensuring higher data quality.
  • LLM-Based Pruning: Squirrel leverages large language models to further refine the pruning process. It conducts multiple voting trials using LLM evaluations to determine whether a document should be accepted or rejected.

v0.0.40

30 Apr 19:34
ce17cfa
Compare
Choose a tag to compare

🔧 Vulture Enhancements

  • Expanded Standard Cleaning Functions:
    Added the following to Vulture’s standard text cleaning pipeline:
    • remove_numbers: Removes stand-alone numbers.
    • remove_alphanumeric: Removes mixed alphanumeric terms (e.g., abc123).
    • remove_roman_numerals: Removes Roman numeral listings.

🐆 Cheetah Additions

  • term_generator:

    • Extracts top keywords from cleaned text using TF-IDF.
    • Pairs keywords with nearby support terms based on a co-occurrence matrix.
    • Saves output as a structured markdown file of search terms.
  • CheetahTermFormatter:

    • Parses markdown search term files into structured blocks with optional filters (e.g., positives, negatives).
    • Supports plain string output or category-based filtering.
    • Can generate substitution maps to convert multi-word phrases into underscored versions and back.
  • convert_txt_to_cheetah_markdown:

    • Converts plain .txt files or structured term dictionaries into Cheetah-compatible markdown format.
    • Facilitates easier programmatic creation and editing of search term files.

🧹 Refactoring and Fixes

  • Code Refactoring:

    • Consolidated several duplicated functions across modules into shared helper utilities at a higher level.
  • Bug Fixes:

    • Vulture:
      • Fixed path-saving logic in operator pipelines.
      • Fixed bugs in the NER and Vocabulary Consolidator operators.
    • Beaver:
      • Resolved a file-saving issue that also affected Wolf’s visualization routines.
    • Fixes README under examples to have the correct module links.
  • .gitignore Updates:

    • Added more output files and example notebook directories to .gitignore.

📁 New Example: NM Law Data Pipeline

Added the NM Law Data/ folder, containing the data processing pipeline used in the paper:
“Legal Document Analysis with HNMFk” (arXiv:2502.20364)

  • 00_data_collection/:
    Scrapes and formats legal documents (statutes, constitution, court cases) from Justia.

  • 01_hnmfk_operation/:
    Constructs document-word matrices and runs Hierarchical Nonnegative Matrix Factorization (HNMFk).

  • 02_benchmarking/:
    Evaluates LLM-generated content using factual accuracy, entailment, and summarization metrics.

  • 03_visualizations/:
    Visualizes legal trends, knowledge graphs, and model evaluation results.

v0.0.39

02 Apr 21:24
bf34fd5
Compare
Choose a tag to compare

🚀 New Features

New Modules Added

  • Fox: Report generation tool for text data from NMFk using OpenAI
  • ArcticFox : Report generation tool for text data from HNMFk using local LLMs
  • SPLIT : Joint NMFk factorization of multiple data via SPLIT
  • SPLITTransfer : Supervised transfer learning method via SPLIT and NMFk

Beaver Enhancements

  • Added support for automatically creating the directory specified by save_path when saving objects.

🐛 Bug Fixes

Beaver: Highlighting & Vocabulary Logic

  • Fixed an issue where tokens used in highlighting but not present in the provided vocabulary would fail to trigger re-vectorization.
  • The logic now ensures that the vocabulary is properly expanded and documents are re-vectorized accordingly.

Beaver: Trailing Newline in Output Files

  • Fixed a bug where an extra newline was added at the end of output text files such as Vocabulary.txt.

HNMFk: Model Loading Path

  • Fixed a bug with incorrect handling of the model path and name when loading an existing model.

Vulture: Module Imports

  • Resolved inconsistent module imports.

Conda Installation

  • Fixed .yml files by adding missing dependencies for proper conda installation.

v0.0.38

26 Mar 15:01
9531315
Compare
Choose a tag to compare

New Modules

Adds Penguin, Bunny, Peacock, and SeaLion modules:

  • Penguin: Text storage tool.
  • Bunny: Dataset generation tool for documents and their citations/references.
  • Peacock: Data visualization and generation of actionable statistics.
  • SeaLion: Generic report generation tool.

Bugs

  • Fixes query index issue in Cheetah

v0.0.37

18 Mar 16:10
02c0c7b
Compare
Choose a tag to compare

Adds three new modules for pre-processing text (Orca and iPenguin), and post-processing text (Wolf).

  • Wolf: Graph centrality and ranking tool.
  • iPenguin: Online information retrieval tool for Scopus, SemanticScholar, and OSTI.
  • Orca: Duplicate author detector for text mining and information retrieval.

v0.0.36

04 Mar 18:32
4561828
Compare
Choose a tag to compare

HNMFk graph post-processing & root node naming

  • Added the ability to post-process HNMFk graphs based on the number of documents in leaf nodes.

    • New functions:
      • model.traverse_tiny_leaf_topics(threshold: int): Identifies outlier clusters where the number of documents is below the given threshold.
      • model.get_tiny_leaf_topics(): Retrieves tiny leaf nodes (processed separately).
      • model.process_tiny_leaf_topics(threshold: int): Processes the graph to separate tiny nodes based on the given threshold.
      • Resetting the graph by setting threshold=None restores the tiny nodes.
  • Added option to specify a root node name in HNMFk using root_node_name="Root".

    • Default is now "Root" instead of "*" to resolve Windows compatibility issues.

Bug(s)

  • Fixed a bug in Beaver where mismatched indexes caused incorrect highlighting.

v0.0.35

27 Jan 23:03
a84b855
Compare
Choose a tag to compare
  • Fixes a bug with Cheetah on setting the default to empty string.
  • Adds Logistic Matrix Factorization (LMF).
  • Adds developer script to change versioning automatically.
  • Updates documentation.

v0.0.34

07 Jan 01:20
ff683c8
Compare
Choose a tag to compare

Fast-tracking to v0.0.34 from v0.0.20

Enhancements

Pruning Support:

  • Enabled pruning in bnmf, wnmf, and nmf_recommender.
  • Added pruning of additional matrices, e.g., MASK, based on X.
  • Included pruned_cols and pruned_rows in saved outputs.

Matrix Factorization:

  • Introduced new submodule BNMFk under NMFk with nmf_method='bnmf'.
  • Added WEIGHT and MASK keys for WNMFk and BNMFk.
  • Implemented matrix deletion in subroutines to reduce memory consumption.
  • Added factor_thresholding parameter to perform thresholding over NMFk factors, making them boolean. Options include:
    • coord_desc_thresh
    • WH_thresh
  • Introduced factor_thresholding_obj_params for configuring thresholding subroutines.
  • Added clustering_method parameter with options:
    • kmeans
    • bool or boolean (both are equivalent).
  • Introduced clustering_obj_params to configure clustering subroutines.
  • Added new perturbation type for boolean matrices: perturb_type='boolean' or perturb_type='bool'.
  • Updated examples to reflect new boolean-specific features.
  • Path compatibility using os.path.join.

Thresholding and Clustering:

  • Added factor_thresholding_H_regression with options:
    • otsu_thresh
    • coord_desc_thresh
    • kmeans_thresh
  • Default factor_thresholding_H_regression set to kmeans_thresh.
  • Default factor_thresholding set to otsu_thresh.
  • Introduced factor_thresholding_H_regression_obj_params to configure parameters.
  • Added K-means-based boolean thresholding for W and H matrices:
    • Clusters values in each row of W and H into two groups; then the boolean threshold is the midpoint of cluster centroids.

Hardware and Device Management:

  • Added device parameter to NMFk for GPU management:
    • device=-1: Use all GPUs.
    • device=0: Use the GPU with ID 0.
    • device=[0,1,...]: Use a specific list of GPUs.
    • Negative values other than -1: Use (number of GPUs + device + 1).

Hierarchical NMFk (HNMFk) Improvements:

  • Added new variables for nodes:
    • parent_node_factors_path
    • parent_node_k
    • factors_path
  • Enabled dynamic renaming of paths when loading HNMFk models from different directories.
  • Improved decomposition behavior:
    • Nodes with fewer samples than the sample threshold no longer decompose unnecessarily.
  • Added signature, centroid, and probabilities from parent nodes to child nodes.
  • Introduced graph iterator methods for navigating to specific nodes by name.
  • Updated node naming conventions to use ancestor-based indexing.

Result Storage:

  • Added W_all to saved outputs of NMFk.

Installation and Documentation

  • Migrated to a new installation system using pip and Poetry.
  • Added a post-installation script for simplifying setup on different systems.
  • Updated documentation for:
    • New installation methods on Chicoma and Darwin.

Bug Fixes

  • Corrected HNMFk behavior to return total data indices instead of indices of indices.
  • Corrected naming inconsistencies in pruning variables in NMFk.
  • Fixed error calculation to consider only known locations when masking is applied.
  • Resolved GPU transfer conflicts when using MASK.
  • Fixed default device parameter in NMFk to be -1 (use all devices).
  • Addressed issues in WNMFk and BNMFk examples.
  • Fixed checkpointing bugs:
    • Made saving checkpoints true by default.
    • Resolved issues when loading an HNMFk model during an ongoing process.
  • Fixed scalar addition error with sparse matrices in kl_mu.
  • Resolved dependency conflicts with numpy and numba.
  • Updated HPC documentation for T-ELF installation.

v0.0.20

24 Jul 19:59
581cceb
Compare
Choose a tag to compare

Fixes a bug on HNMFk where the original indices were wrong.

v0.0.19

04 May 22:10
309eb02
Compare
Choose a tag to compare
  • Fixes a bug with HNMFk checkpointing where if continuing from checkpoint on a HPC system, not all nodes would be free on the job queue due to the bug.
  • Fixes a bug with BST post-order search where the order was incorrect.
  • Adds BST in-order search capability. NMFk hyper-parameter changed accordingly:

k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".

* ``k_search_method='linear'`` will linearly visit each K given in ``Ks`` hyper-parameter of the ``fit()`` function.
* ``k_search_method='bst_post'`` will perform post-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_pre'`` will perform pre-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_in'`` will perform in-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.