tweetopic

⚡ Blazing Fast topic modelling over short texts in Python

Features

Fast ⚡
Scalable 💥
High consistency and coherence 🎯
High quality topics 🔥
Easy visualization and inspection 👀
Full scikit-learn compatibility 🔩

New in version 0.4.0 ✨

You can now pass random_state to topic models to make your results reproducible.

from tweetopic import DMM

model = DMM(10, random_state=42)

🛠 Installation

Install from PyPI:

pip install tweetopic

👩‍💻 Usage (documentation)

Train your a topic model on a corpus of short texts:

from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
vectorizer = CountVectorizer(min_df=15, max_df=0.1)

# Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(n_components=30, n_iterations=100, alpha=0.1, beta=0.1)

# Creating topic pipeline
pipeline = Pipeline([
    ("vectorizer", vectorizer),
    ("dmm", dmm),
])

You may fit the model with a stream of short texts:

pipeline.fit(texts)

To investigate internal structure of topics and their relations to words and indicidual documents we recommend using topicwizard.

Install it from PyPI:

pip install topic-wizard

Then visualize your topic model:

import topicwizard

topicwizard.visualize(pipeline=pipeline, corpus=texts)

🎓 References

Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233–242). Association for Computing Machinery.

Name	Name	Last commit message	Last commit date
Latest commit x-tabdeveloping Update README.md Jun 6, 2024 5ada7b8 · Jun 6, 2024 History 125 Commits
.github/workflows	.github/workflows	Added test workflow	Jun 5, 2024
docs	docs	Added random_state to docs	Jun 5, 2024
tests	tests	Added integration test	Jun 5, 2024
tweetopic	tweetopic	Added random states to DMM and BTM	Jun 5, 2024
.flake8	.flake8	ci: Added pre-commit hooks	Sep 5, 2022
.gitignore	.gitignore	Added documentation build and new doc workflow	May 28, 2023
.pre-commit-config.yaml	.pre-commit-config.yaml	ci: removed pyupgrade	Sep 5, 2022
LICENSE	LICENSE	Initial commit	Aug 30, 2022
README.md	README.md	Update README.md	Jun 6, 2024
citation.cff	citation.cff	Version bump	May 27, 2023
mess.py	mess.py	Experimenting with change of variable for minibatch hmc	Jun 10, 2023
pyproject.toml	pyproject.toml	Bumped dependencies and version	Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tweetopic

Features

New in version 0.4.0 ✨

🛠 Installation

👩‍💻 Usage (documentation)

🎓 References

About

Releases 1

Packages

Contributors 4

Languages

License

centre-for-humanities-computing/tweetopic

Folders and files

Latest commit

History

Repository files navigation

tweetopic

Features

New in version 0.4.0 ✨

🛠 Installation

👩‍💻 Usage (documentation)

🎓 References

About

Topics

Resources

License

Citation

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages