Skip to content

Conversation

@mishig25
Copy link
Contributor

Description

This PR adds a Python script to strip HTML tags from markdown documentation files and convert docstrings to clean markdown format.

Features

  • HTML stripping: Removes all HTML tags (divs, docstrings, tips, etc.) while preserving content
  • Docstring formatting: Converts complex docstring blocks to clean markdown with:
    • Level 4 headers with anchor format: #### ClassName[[anchor]]
    • Removes class and def prefixes from names
    • Source links in markdown format
    • Parameters formatted without bullets, using : separator
    • Blank lines between parameters for readability
    • Return types and descriptions
  • Flexible processing: Can process single files or entire directories recursively
  • Preserves markdown: Maintains code blocks, links, and other markdown formatting

Usage

# Single file
python3 src/scripts/strip_html_from_md.py input.md -o output.md

# Directory (recursive)
python3 src/scripts/strip_html_from_md.py docs/ -o clean_docs/ --recursive

Files Added

  • src/scripts/strip_html_from_md.py - Main script
  • README_strip_html.md - Documentation and usage examples

Example Transformation

Before:

<div class="docstring...">
<docstring><name>class transformers.BertConfig</name>
<anchor>transformers.BertConfig</anchor>
<paramsdesc>- **vocab_size** (`int`) -- Description</paramsdesc>
</docstring>
This is a config class.
</div>

After:

#### transformers.BertConfig[[transformers.BertConfig]]

This is a config class.

**Parameters:**

vocab_size (`int`) : Description

- Add strip_html_from_md.py script to convert HTML-heavy docs to clean markdown
- Strip HTML tags from docstrings and convert to level 4 headers
- Format class/function names with anchors: #### ClassName[[anchor]]
- Remove 'class' and 'def' prefixes from names
- Format parameters without bullets, using : separator with blank lines
- Support processing single files or directories recursively
- Add comprehensive README with usage examples and documentation
@mishig25 mishig25 merged commit d2d6a20 into main Nov 12, 2025
4 checks passed
@mishig25 mishig25 deleted the add-html-stripper-script branch November 12, 2025 13:53
@stevhliu
Copy link
Member

stevhliu commented Nov 12, 2025

It looks like this is breaking the build_pr_documentation check on PRs, like huggingface/diffusers#12642 for example. The specific error message is:

0s
Run source .venv/bin/activate
  source .venv/bin/activate
  echo "Stripping HTML from markdown files in build_dir"
  python3 doc-builder/src/scripts/strip_html_from_md.py build_dir/ --recursive
  echo "HTML stripping complete"
  shell: sh -e {0}
  env:
    DIFFUSERS_SLOW_IMPORT: yes
    UV_HTTP_TIMEOUT: 900
    ROOT_APT_GET: apt-get
    PIP_OR_UV: pip
    doc_folder: diffusers/docs/source
    package_name: diffusers
/__w/_temp/7b978111-60b1-414a-a873-3f48fbb8177b.sh: 1: source: not found
Error: Process completed with exit code 127.

@mishig25
Copy link
Contributor Author

@stevhliu thanks for letting me know. Pushed fix #684

huggingface/diffusers#12642 doc action is building correctly now ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants