Skip to content

Commit d2d6a20

Browse files
authored
Add HTML stripper script for markdown documentation (#682)
* Add HTML stripper script for markdown documentation - Add strip_html_from_md.py script to convert HTML-heavy docs to clean markdown - Strip HTML tags from docstrings and convert to level 4 headers - Format class/function names with anchors: #### ClassName[[anchor]] - Remove 'class' and 'def' prefixes from names - Format parameters without bullets, using : separator with blank lines - Support processing single files or directories recursively - Add comprehensive README with usage examples and documentation * ruff fix * format * add to jobs
1 parent 1b39f4d commit d2d6a20

File tree

4 files changed

+457
-0
lines changed

4 files changed

+457
-0
lines changed

.github/workflows/build_main_documentation.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,13 @@ jobs:
234234
235235
cd ..
236236
237+
- name: Strip HTML from built markdown files
238+
run: |
239+
source .venv/bin/activate
240+
echo "Stripping HTML from markdown files in build_dir"
241+
python3 src/scripts/strip_html_from_md.py build_dir/ --recursive
242+
echo "HTML stripping complete"
243+
237244
- name: Push to repositories
238245
run: |
239246
source .venv/bin/activate

.github/workflows/build_pr_documentation.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,13 @@ jobs:
219219
fi
220220
cd ..
221221
222+
- name: Strip HTML from built markdown files
223+
run: |
224+
source .venv/bin/activate
225+
echo "Stripping HTML from markdown files in build_dir"
226+
python3 src/scripts/strip_html_from_md.py build_dir/ --recursive
227+
echo "HTML stripping complete"
228+
222229
- name: Save commit_sha & pr_number
223230
run: |
224231
echo ${{ inputs.commit_sha }} > ./build_dir/commit_sha

README_strip_html.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# HTML Stripper for Markdown Documentation
2+
3+
This script strips HTML tags from markdown files and converts docstrings to clean markdown format.
4+
5+
## Features
6+
7+
- Removes all HTML tags while preserving content
8+
- Converts docstring blocks to clean markdown with level 4 headers (`####`)
9+
- Extracts and formats:
10+
- Class/function names as headers (removes `class` and `def` prefixes)
11+
- Anchors in double square brackets format: `[[anchor]]`
12+
- Source links
13+
- Parameter descriptions (removes bullets and bold, uses `:` separator, adds blank lines between params)
14+
- Return types and descriptions
15+
- Preserves markdown code blocks and structure
16+
- Can process single files or entire directories
17+
18+
## Usage
19+
20+
### Single File
21+
22+
```bash
23+
python3 src/scripts/strip_html_from_md.py input.md -o output.md
24+
```
25+
26+
Or overwrite the input file:
27+
28+
```bash
29+
python3 src/scripts/strip_html_from_md.py input.md
30+
```
31+
32+
### Directory Processing
33+
34+
Process all markdown files in a directory:
35+
36+
```bash
37+
python3 src/scripts/strip_html_from_md.py docs/ -o clean_docs/
38+
```
39+
40+
Process recursively:
41+
42+
```bash
43+
python3 src/scripts/strip_html_from_md.py docs/ -o clean_docs/ --recursive
44+
```
45+
46+
## Examples
47+
48+
### Before (with HTML)
49+
50+
```markdown
51+
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5">
52+
53+
<docstring><name>class transformers.BertConfig</name><anchor>transformers.BertConfig</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/models/bert/configuration_bert.py#L29</source><paramsdesc>- **vocab_size** (`int`, *optional*, defaults to 30522) --
54+
Vocabulary size of the BERT model.</paramsdesc></docstring>
55+
56+
This is the configuration class to store the configuration of a BertModel.
57+
58+
</div>
59+
```
60+
61+
### After (clean markdown)
62+
63+
```markdown
64+
#### transformers.BertConfig[[transformers.BertConfig]]
65+
66+
[Source](https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/models/bert/configuration_bert.py#L29)
67+
68+
This is the configuration class to store the configuration of a BertModel.
69+
70+
**Parameters:**
71+
72+
vocab_size (`int`, *optional*, defaults to 30522) : Vocabulary size of the BERT model.
73+
74+
hidden_size (`int`, *optional*, defaults to 768) : Dimensionality of the encoder layers and the pooler layer.
75+
```
76+
77+
## Command-line Options
78+
79+
- `input` - Input markdown file or directory (required)
80+
- `-o, --output` - Output file or directory (optional, defaults to overwriting input)
81+
- `-r, --recursive` - Process directory recursively (optional)
82+
83+
## What Gets Stripped
84+
85+
The script removes:
86+
- `<div>` tags and their attributes
87+
- `<docstring>` and nested tags (`<name>`, `<anchor>`, `<source>`, etc.)
88+
- Component tags: `<Tip>`, `<ExampleCodeBlock>`, `<hfoptions>`, `<hfoption>`, etc.
89+
- `<EditOnGithub>` links
90+
- HTML comments
91+
- Any other HTML tags
92+
93+
## What Gets Preserved
94+
95+
- Markdown syntax (headers, lists, code blocks, links, etc.)
96+
- Text content from within HTML tags
97+
- Code blocks (backtick-fenced)
98+
- Link URLs and formatting
99+

0 commit comments

Comments
 (0)