From 1b4ec9e836c9d1e87ef4fa5d9a9466c25f64920e Mon Sep 17 00:00:00 2001 From: yichuan520030910320 Date: Tue, 2 Jun 2026 11:41:18 -0700 Subject: [PATCH] Add Data Curation subsection to Training Co-Authored-By: Claude Opus 4.6 (1M context) --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index 5920aff..00b73cb 100644 --- a/README.md +++ b/README.md @@ -193,6 +193,13 @@ We also release the full training set ([`Chrisyichuan/screenshot-training-natural-filtered-v2`](https://huggingface.co/datasets/Chrisyichuan/screenshot-training-natural-filtered-v2)), so you can adapt other backbones yourself — a larger Qwen, or any other embedding model. +### Data Curation + +Visualization of some very early version of the training data: +[early training data viewer](https://yichuan-w.github.io/share/blog-review-first100-light/) + +Reproduce: TBD + ## Acknowledgments Thanks to [Rulin Shao](https://rulinshao.github.io/) for support.