Skip to content

Define optimal chunking strategy for HN posts #13

@EndlessReform

Description

@EndlessReform

Feature Summary

  • Category: Data Collection
  • Priority: High

Background and Objective

  • Problem Statement: HN posts need to be represented optimally for later unsupervised segmentation
  • Objective: Determine what additional information besides post title and URL, if any, should be encoded

Technical Requirements

  • Languages and Frameworks: Python - sklearn, PyTorch
  • Dependencies:
    • representative extract of data to test
    • Embedding model selected

Tests and Evaluation Metrics

Evaluation Metrics: (e.g., Inertia, Silhouette Score for $k$-means clustering):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions