Universal website content extraction and conversion tool. Downloads any website and converts it to structured markdown or JSON format.
# Import a website as markdown files
npm run import https://www.example.com
# Import as JSON
npm run import https://www.example.com --format=jsonFor detailed usage instructions, see USAGE.md.
site-importer/
├── import.js # Entry point - handles downloads and orchestration
├── index.js # Main converter orchestrator
├── converters/
│ ├── index.js # Exports all converters
│ ├── page-converter.js # Converts static pages
│ ├── blog-converter.js # Converts blog posts to news
│ ├── product-converter.js # Converts product pages
│ ├── category-converter.js # Converts category pages
│ ├── special-pages-converter.js # Generates special pages from data
│ ├── home-converter.js # Converts homepage content
│ ├── blog-index-converter.js # Generates blog index
│ └── reviews-index-converter.js # Generates reviews index
├── utils/
│ ├── base-converter.js # Base class for all converters
│ ├── category-scanner.js # Scans for product categories
│ ├── cli-args.js # Command-line argument parsing
│ ├── content-processor.js # Content cleaning and extraction
│ ├── directory-cleaner.js # Output directory management
│ ├── favicon-extractor.js # Extracts favicon files
│ ├── filesystem.js # File operations
│ ├── frontmatter-generator.js # YAML frontmatter generation
│ ├── html-patterns.js # HTML selector patterns
│ ├── html-table-extractor.js # Extracts data from HTML tables
│ ├── image-downloader.js # Downloads embedded images
│ ├── json-exporter.js # Exports data as JSON
│ ├── markdown-table-parser.js # Parses markdown tables
│ ├── markdown-writer.js # Writes markdown files
│ ├── metadata-extractor.js # HTML metadata extraction
│ ├── pandoc-converter.js # HTML to Markdown via pandoc
│ ├── results-tracker.js # Tracks conversion results
│ ├── site-downloader.js # Downloads websites using wget
│ └── test-runner.js # Test execution framework
└── tests/
├── data-structure-tests.js # Validates internal data structure
└── markdown-output-tests.js # Validates markdown output files
-
Download Phase (
import.js)- Checks if
old_site/directory exists - If not, downloads the website using
wgetwith mirror settings - Caches downloaded site for future runs (much faster)
- Checks if
-
Conversion Phase (
index.js)- Extracts favicons and assets
- Processes homepage content
- Converts static pages, blog posts, products, and categories
- Generates special pages and index pages
- Builds internal data structure
-
Validation Phase (
tests/)- Validates data structure integrity
- Checks for duplicate slugs, missing fields
- Validates frontmatter format
- Verifies image references (markdown mode only)
-
Output Phase
- Markdown mode: Writes individual
.mdfiles with YAML frontmatter - JSON mode: Writes single
content.jsonfile with all data - Downloads and organizes images
- Markdown mode: Writes individual
output/
├── pages/ # Static pages with frontmatter
├── news/ # Blog posts converted to news items
├── products/ # Product pages with pricing
├── categories/ # Category landing pages
├── images/ # Downloaded and organized images
└── assets/
└── favicon/ # Favicon files
output/
├── content.json # Single file with all structured data
├── images/ # Downloaded images
└── assets/
└── favicon/ # Favicon files
The JSON structure includes pages, news, products, categories, home content, and metadata.
The main entry point that:
- Parses command-line arguments (
--format=json) - Manages the
old_site/download cache - Downloads websites using
wgetwhen needed - Cleans and prepares the
output/directory - Orchestrates the conversion process
- Runs validation tests
Coordinates all converters in the correct order:
- Favicon extraction
- Homepage content
- Static pages
- Special pages
- Blog posts
- Products and categories
- Index pages
Handles both markdown and JSON output modes.
All converters extend BaseConverter which provides:
- HTML file loading and parsing
- Metadata extraction
- Content processing
- Result tracking
Content Type Converters:
page-converter.js- Static pages from the siteblog-converter.js- Blog posts to news articlesproduct-converter.js- Product pages with pricing and specscategory-converter.js- Category landing pageshome-converter.js- Homepage banners, features, sectionsspecial-pages-converter.js- Generates special pages from datablog-index-converter.js- Creates blog listing pagereviews-index-converter.js- Creates reviews listing page
Content Processing:
content-processor.js- Removes navigation, cleans HTML, normalizes whitespacemetadata-extractor.js- Extracts titles, descriptions, prices, dates, og:tagspandoc-converter.js- HTML to Markdown conversion wrapperfrontmatter-generator.js- Generates YAML frontmatter
Data Extraction:
html-table-extractor.js- Extracts structured data from HTML tablesmarkdown-table-parser.js- Parses markdown table syntaxcategory-scanner.js- Scans for product categorieshtml-patterns.js- Common HTML selector patterns
I/O Operations:
filesystem.js- File reading, writing, directory managementsite-downloader.js- Website downloading via wgetimage-downloader.js- Downloads and organizes imagesfavicon-extractor.js- Extracts favicon filesdirectory-cleaner.js- Cleans output directories
Export & Output:
json-exporter.js- Manages JSON export with data collectionmarkdown-writer.js- Writes markdown files with frontmatter
Testing & Validation:
test-runner.js- Test execution frameworkresults-tracker.js- Tracks conversion statisticscli-args.js- Command-line argument parsing
Data Structure Tests (tests/data-structure-tests.js):
- Validates all required fields present
- Checks for duplicate slugs
- Verifies frontmatter structure
- Ensures content has H1 headings
- Validates metadata completeness
Markdown Output Tests (tests/markdown-output-tests.js):
- Verifies file structure
- Validates YAML frontmatter syntax
- Checks image references exist
- Ensures proper filename formatting
- Verifies no duplicate files
- Create a converter in
converters/:
const BaseConverter = require('../utils/base-converter')
class NewTypeConverter extends BaseConverter {
async convert() {
const files = this.loadHtmlFiles('new-type')
// Process files...
return { converted: files.length, failed: 0 }
}
}
module.exports = NewTypeConverter- Export from
converters/index.js:
const convertNewType = require('./new-type-converter')
module.exports = { convertNewType, /* ... */ }- Add to orchestrator in
index.js:
tracker.add('New Type', await convertNewType())Edit utils/content-processor.js to:
- Remove additional HTML elements
- Clean specific patterns
- Normalize content structure
- Extract in
utils/metadata-extractor.js - Add to frontmatter in
utils/frontmatter-generator.js - Update relevant converter to use new fields
Add patterns to utils/html-patterns.js for reusable selectors across converters.
- Node.js 14+ - JavaScript runtime
- pandoc - HTML to Markdown conversion
- wget - Website downloading
Ubuntu/Debian:
apt-get install nodejs pandoc wgetmacOS:
brew install node pandoc wgetWindows:
- Install Node.js from nodejs.org
- Install pandoc from pandoc.org
- Install wget from gnu.org/software/wget
None! This project uses only Node.js built-in modules for simplicity and portability.
First run - Downloads the website:
npm run import https://www.example.com -- --format=markdownSubsequent runs - Reuses cached site (much faster):
npm run import https://www.example.com -- --format=markdownForce re-download - Delete the cache first:
rm -rf old_site
npm run import https://www.example.com -- --format=markdownTest changes without download:
# After first download, just run the converter directly
node index.jsTests run automatically after each conversion. They validate:
- Data structure integrity (all modes)
- Required fields present
- No duplicate slugs
- Valid frontmatter format
- Markdown file structure (markdown mode only)
- Image references exist
To run tests manually:
# Run data structure tests
node tests/data-structure-tests.js
# Run markdown output tests (after markdown conversion)
node tests/markdown-output-tests.js---
title: "Product Name"
description: "Product description for SEO"
layout: product
price: "£199.99"
categories: ["category-name"]
image: "/images/products/product-image.jpg"
---
# Product Name
Product content in markdown format...{
"pages": [
{
"title": "About Us",
"slug": "about",
"content": "# About Us\n\nOur story...",
"description": "Learn about our company"
}
],
"products": [...],
"news": [...],
"categories": [...],
"home": {...},
"metadata": {
"exported_at": "2025-10-17T12:00:00.000Z",
"format_version": "1.0"
}
}- Universal website support - Works with any HTML website
- Smart caching - Downloads once, converts many times
- Dual output formats - Markdown files or single JSON file
- Automatic validation - Built-in tests ensure quality output
- Image downloading - Extracts and downloads all images
- Favicon extraction - Captures all favicon variants
- Metadata preservation - Extracts SEO metadata, prices, dates
- Clean conversion - Removes navigation, footers, inline styles
- Product support - Extracts pricing, categories, specifications
- Blog support - Converts blog posts with dates and authors
- No dependencies - Uses only Node.js built-in modules
"pandoc is not installed"
- Install pandoc using your system's package manager
- Verify with:
pandoc --version
"wget command failed"
- Ensure wget is installed
- Check the URL is accessible
- Try downloading manually first to test
"No files found in old_site/"
- Check if the download completed successfully
- Verify the URL structure is supported by wget
- Look for HTML files in
old_site/subdirectories
Tests failing
- Review test output for specific validation errors
- Check that all required fields are extracted
- Verify frontmatter YAML syntax is valid
ISC - Terragon Labs