08 May 23:20

52a836a

v0.0.41 Latest

Latest

New Pre-Processing Module: Squirrel 🚀

Version 0.0.41 introduces a new pre-processing module called Squirrel, designed for automated document pruning. Squirrel streamlines the process of accepting or rejecting documents by applying predefined rules and thresholds, eliminating the need for manual review. Squirrel supports the use of multiple pruning strategies. In this release, we include both embedding-based pruning and LLM-based pruning:

Embedding-Based Pruning: This method filters documents based on their distance from a reference centroid in embedding space. Only documents within a specified threshold are retained, ensuring higher data quality.
LLM-Based Pruning: Squirrel leverages large language models to further refine the pruning process. It conducts multiple voting trials using LLM evaluations to determine whether a document should be accepted or rejected.

Assets 2

30 Apr 19:34

MaksimEkin

v0.0.40

ce17cfa

v0.0.40

🔧 Vulture Enhancements

Expanded Standard Cleaning Functions:
Added the following to Vulture’s standard text cleaning pipeline:
- remove_numbers: Removes stand-alone numbers.
- remove_alphanumeric: Removes mixed alphanumeric terms (e.g., abc123).
- remove_roman_numerals: Removes Roman numeral listings.

🐆 Cheetah Additions

term_generator:
- Extracts top keywords from cleaned text using TF-IDF.
- Pairs keywords with nearby support terms based on a co-occurrence matrix.
- Saves output as a structured markdown file of search terms.
CheetahTermFormatter:
- Parses markdown search term files into structured blocks with optional filters (e.g., positives, negatives).
- Supports plain string output or category-based filtering.
- Can generate substitution maps to convert multi-word phrases into underscored versions and back.
convert_txt_to_cheetah_markdown:
- Converts plain .txt files or structured term dictionaries into Cheetah-compatible markdown format.
- Facilitates easier programmatic creation and editing of search term files.

🧹 Refactoring and Fixes

Code Refactoring:
- Consolidated several duplicated functions across modules into shared helper utilities at a higher level.
Bug Fixes:
- Vulture:
  - Fixed path-saving logic in operator pipelines.
  - Fixed bugs in the NER and Vocabulary Consolidator operators.
- Beaver:
  - Resolved a file-saving issue that also affected Wolf’s visualization routines.
- Fixes README under examples to have the correct module links.
.gitignore Updates:
- Added more output files and example notebook directories to .gitignore.

📁 New Example: NM Law Data Pipeline

Added the NM Law Data/ folder, containing the data processing pipeline used in the paper:
“Legal Document Analysis with HNMFk” (arXiv:2502.20364)

00_data_collection/:
Scrapes and formats legal documents (statutes, constitution, court cases) from Justia.
01_hnmfk_operation/:
Constructs document-word matrices and runs Hierarchical Nonnegative Matrix Factorization (HNMFk).
02_benchmarking/:
Evaluates LLM-generated content using factual accuracy, entailment, and summarization metrics.
03_visualizations/:
Visualizes legal trends, knowledge graphs, and model evaluation results.

Assets 2

02 Apr 21:24

MaksimEkin

v0.0.39

bf34fd5

v0.0.39

🚀 New Features

New Modules Added

Fox: Report generation tool for text data from NMFk using OpenAI
ArcticFox : Report generation tool for text data from HNMFk using local LLMs
SPLIT : Joint NMFk factorization of multiple data via SPLIT
SPLITTransfer : Supervised transfer learning method via SPLIT and NMFk

Beaver Enhancements

Added support for automatically creating the directory specified by save_path when saving objects.

🐛 Bug Fixes

Beaver: Highlighting & Vocabulary Logic

Fixed an issue where tokens used in highlighting but not present in the provided vocabulary would fail to trigger re-vectorization.
The logic now ensures that the vocabulary is properly expanded and documents are re-vectorized accordingly.

Beaver: Trailing Newline in Output Files

Fixed a bug where an extra newline was added at the end of output text files such as Vocabulary.txt.

HNMFk: Model Loading Path

Fixed a bug with incorrect handling of the model path and name when loading an existing model.

Vulture: Module Imports

Resolved inconsistent module imports.

Conda Installation

Fixed .yml files by adding missing dependencies for proper conda installation.

Assets 2

26 Mar 15:01

MaksimEkin

v0.0.38

9531315

v0.0.38

New Modules

Adds Penguin, Bunny, Peacock, and SeaLion modules:

Penguin: Text storage tool.
Bunny: Dataset generation tool for documents and their citations/references.
Peacock: Data visualization and generation of actionable statistics.
SeaLion: Generic report generation tool.

Bugs

Fixes query index issue in Cheetah

Assets 2

18 Mar 16:10

MaksimEkin

v0.0.37

02c0c7b

v0.0.37

Adds three new modules for pre-processing text (Orca and iPenguin), and post-processing text (Wolf).

Wolf: Graph centrality and ranking tool.
iPenguin: Online information retrieval tool for Scopus, SemanticScholar, and OSTI.
Orca: Duplicate author detector for text mining and information retrieval.

Assets 2

04 Mar 18:32

MaksimEkin

v0.0.36

4561828

v0.0.36

HNMFk graph post-processing & root node naming

Added the ability to post-process HNMFk graphs based on the number of documents in leaf nodes.
- New functions:
  - model.traverse_tiny_leaf_topics(threshold: int): Identifies outlier clusters where the number of documents is below the given threshold.
  - model.get_tiny_leaf_topics(): Retrieves tiny leaf nodes (processed separately).
  - model.process_tiny_leaf_topics(threshold: int): Processes the graph to separate tiny nodes based on the given threshold.
  - Resetting the graph by setting threshold=None restores the tiny nodes.
Added option to specify a root node name in HNMFk using root_node_name="Root".
- Default is now "Root" instead of "*" to resolve Windows compatibility issues.

Bug(s)

Fixed a bug in Beaver where mismatched indexes caused incorrect highlighting.

Assets 2

27 Jan 23:03

MaksimEkin

v0.0.35

a84b855

v0.0.35

Fixes a bug with Cheetah on setting the default to empty string.
Adds Logistic Matrix Factorization (LMF).
Adds developer script to change versioning automatically.
Updates documentation.

Assets 2

07 Jan 01:20

MaksimEkin

v0.0.34

ff683c8

v0.0.34

Fast-tracking to v0.0.34 from v0.0.20

Enhancements

Pruning Support:

Enabled pruning in bnmf, wnmf, and nmf_recommender.
Added pruning of additional matrices, e.g., MASK, based on X.
Included pruned_cols and pruned_rows in saved outputs.

Matrix Factorization:

Introduced new submodule BNMFk under NMFk with nmf_method='bnmf'.
Added WEIGHT and MASK keys for WNMFk and BNMFk.
Implemented matrix deletion in subroutines to reduce memory consumption.
Added factor_thresholding parameter to perform thresholding over NMFk factors, making them boolean. Options include:
- coord_desc_thresh
- WH_thresh
Introduced factor_thresholding_obj_params for configuring thresholding subroutines.
Added clustering_method parameter with options:
- kmeans
- bool or boolean (both are equivalent).
Introduced clustering_obj_params to configure clustering subroutines.
Added new perturbation type for boolean matrices: perturb_type='boolean' or perturb_type='bool'.
Updated examples to reflect new boolean-specific features.
Path compatibility using os.path.join.

Thresholding and Clustering:

Added factor_thresholding_H_regression with options:
- otsu_thresh
- coord_desc_thresh
- kmeans_thresh
Default factor_thresholding_H_regression set to kmeans_thresh.
Default factor_thresholding set to otsu_thresh.
Introduced factor_thresholding_H_regression_obj_params to configure parameters.
Added K-means-based boolean thresholding for W and H matrices:
- Clusters values in each row of W and H into two groups; then the boolean threshold is the midpoint of cluster centroids.

Hardware and Device Management:

Added device parameter to NMFk for GPU management:
- device=-1: Use all GPUs.
- device=0: Use the GPU with ID 0.
- device=[0,1,...]: Use a specific list of GPUs.
- Negative values other than -1: Use (number of GPUs + device + 1).

Hierarchical NMFk (HNMFk) Improvements:

Added new variables for nodes:
- parent_node_factors_path
- parent_node_k
- factors_path
Enabled dynamic renaming of paths when loading HNMFk models from different directories.
Improved decomposition behavior:
- Nodes with fewer samples than the sample threshold no longer decompose unnecessarily.
Added signature, centroid, and probabilities from parent nodes to child nodes.
Introduced graph iterator methods for navigating to specific nodes by name.
Updated node naming conventions to use ancestor-based indexing.

Result Storage:

Added W_all to saved outputs of NMFk.

Installation and Documentation

Migrated to a new installation system using pip and Poetry.
Added a post-installation script for simplifying setup on different systems.
Updated documentation for:
- New installation methods on Chicoma and Darwin.

Bug Fixes

Corrected HNMFk behavior to return total data indices instead of indices of indices.
Corrected naming inconsistencies in pruning variables in NMFk.
Fixed error calculation to consider only known locations when masking is applied.
Resolved GPU transfer conflicts when using MASK.
Fixed default device parameter in NMFk to be -1 (use all devices).
Addressed issues in WNMFk and BNMFk examples.
Fixed checkpointing bugs:
- Made saving checkpoints true by default.
- Resolved issues when loading an HNMFk model during an ongoing process.
Fixed scalar addition error with sparse matrices in kl_mu.
Resolved dependency conflicts with numpy and numba.
Updated HPC documentation for T-ELF installation.

Assets 2

24 Jul 19:59

MaksimEkin

v0.0.20

581cceb

v0.0.20

Fixes a bug on HNMFk where the original indices were wrong.

Assets 2

04 May 22:10

MaksimEkin

v0.0.19

309eb02

v0.0.19

Fixes a bug with HNMFk checkpointing where if continuing from checkpoint on a HPC system, not all nodes would be free on the job queue due to the bug.
Fixes a bug with BST post-order search where the order was incorrect.
Adds BST in-order search capability. NMFk hyper-parameter changed accordingly:

k_search_method : str, optional
Which approach to use when searching for the rank or k. The default is "linear".

* ``k_search_method='linear'`` will linearly visit each K given in ``Ks`` hyper-parameter of the ``fit()`` function.
* ``k_search_method='bst_post'`` will perform post-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_pre'`` will perform pre-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.
* ``k_search_method='bst_in'`` will perform in-order binary search. When an ideal rank is found, determined by the selected ``predict_k_method``, all lower ranks are pruned from the search space.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New Pre-Processing Module: Squirrel 🚀

Uh oh!

🔧 Vulture Enhancements

🐆 Cheetah Additions

🧹 Refactoring and Fixes

📁 New Example: NM Law Data Pipeline

Uh oh!

🚀 New Features

New Modules Added

Beaver Enhancements

🐛 Bug Fixes

Beaver: Highlighting & Vocabulary Logic

Beaver: Trailing Newline in Output Files

HNMFk: Model Loading Path

Vulture: Module Imports

Conda Installation

Uh oh!

New Modules

Adds Penguin, Bunny, Peacock, and SeaLion modules:

Bugs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Enhancements

Pruning Support:

Matrix Factorization:

Thresholding and Clustering:

Hardware and Device Management:

Hierarchical NMFk (HNMFk) Improvements:

Result Storage:

Installation and Documentation

Bug Fixes

Uh oh!

Uh oh!

Uh oh!

Releases: lanl/T-ELF

v0.0.41

New Pre-Processing Module: Squirrel 🚀

Uh oh!

v0.0.40

🔧 Vulture Enhancements

🐆 Cheetah Additions

🧹 Refactoring and Fixes

📁 New Example: NM Law Data Pipeline

Uh oh!

v0.0.39

🚀 New Features

New Modules Added

Beaver Enhancements

🐛 Bug Fixes

Beaver: Highlighting & Vocabulary Logic

Beaver: Trailing Newline in Output Files

HNMFk: Model Loading Path

Vulture: Module Imports

Conda Installation

Uh oh!

v0.0.38

New Modules

Adds Penguin, Bunny, Peacock, and SeaLion modules:

Bugs

Uh oh!

v0.0.37

Uh oh!

v0.0.36

Uh oh!

v0.0.35

Uh oh!

v0.0.34

Enhancements

Pruning Support:

Matrix Factorization:

Thresholding and Clustering:

Hardware and Device Management:

Hierarchical NMFk (HNMFk) Improvements:

Result Storage:

Installation and Documentation

Bug Fixes

Uh oh!

v0.0.20

Uh oh!

v0.0.19

Uh oh!