Skip to content

SB-0MNBSEY9G0055O5N: Document content extraction strategy and alternatives#20

Merged
SorraTheOrc merged 1 commit intomainfrom
sb-0mnbsey9g0055o5n-content-extraction-documentation
Apr 12, 2026
Merged

SB-0MNBSEY9G0055O5N: Document content extraction strategy and alternatives#20
SorraTheOrc merged 1 commit intomainfrom
sb-0mnbsey9g0055o5n-content-extraction-documentation

Conversation

@SorraTheOrc
Copy link
Copy Markdown
Member

Summary

  • Document current @extractus/article-extractor pipeline
  • Create comparison matrix of alternative libraries (@mozilla/readability, @postlight/parser, cheerio, JSDOM, Puppeteer, Playwright)
  • Document when to use each approach
  • Add extraction performance benchmarks
  • Document tiered fallback strategies
  • Create ADR for extraction approach

Changes

  • Created docs/feature-requests/content-extraction-strategy.md

Testing

  • Documentation only, no code changes

…tives

- Document current @extractus/article-extractor pipeline
- Create comparison matrix of alternative libraries
- Document when to use each approach
- Add extraction performance benchmarks
- Document fallback strategies with tiered approach
- Create ADR for extraction strategy
- Include related work references
@SorraTheOrc SorraTheOrc merged commit 6366a29 into main Apr 12, 2026
@SorraTheOrc SorraTheOrc deleted the sb-0mnbsey9g0055o5n-content-extraction-documentation branch April 12, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant