Skip to content

Cross-discipline paper clustering #8

@akuligowski9

Description

@akuligowski9

Summary

Group related papers that appear across different disciplines based on topic similarity, rather than just source-level deduplication.

Context

The app already deduplicates papers within a discipline using URL matching (PaperDeduplicator). But conceptually related papers across disciplines (e.g., a neuroscience paper and a CS/AI paper about the same neural network technique) are not connected. Clustering would surface these cross-discipline connections.

Possible approaches

  1. Keyword/TF-IDF clustering — Extract keywords from titles and abstracts, cluster by similarity. Simple, no external API needed.
  2. Embedding-based clustering — Use an embedding model (Gemini, OpenAI, or local) to compute paper embeddings, then cluster with k-means or DBSCAN.
  3. LLM-assisted grouping — Send all paper titles to the AI provider and ask it to identify groups. Simplest to implement but uses API quota.

What to build

  1. A PaperClusterer service that takes the full digest and returns cluster assignments
  2. A UI section showing "Related across disciplines" with grouped papers
  3. Configurable: opt-in via a checkbox or setting (clustering adds latency)

Considerations

  • This is a significant feature — consider starting with approach 1 (keyword-based) as a proof of concept
  • The digest is generated per-session, so clustering runs once after generation completes
  • Papers may belong to multiple clusters

Metadata

Metadata

Assignees

No one assigned

    Labels

    ambitiousLarge scope, multiple components

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions