A Playwright-powered website mirroring tool for the JavaScript age
Smippo is a modern alternative to HTTrack that uses Playwright to capture websites as they appear after JavaScript execution. Unlike traditional web crawlers that fetch raw HTML, Smippo renders pages in a real Chromium browser, waits for network activity to settle, then captures the fully rendered DOM and all network artifacts.
A browser downloads a page by:
- Fetching the initial HTML
- Executing JavaScript
- Making subsequent network requests (images, fonts, API calls, etc.)
- Waiting for the network to settle
Playwright gives us exactly this capability. Smippo leverages it to create faithful offline copies of modern web applications.
| Feature | HTTrack | Smippo |
|---|---|---|
| JavaScript execution | ❌ No | ✅ Full browser rendering |
| SPA support | ❌ Limited | ✅ Native |
| CSS-in-JS capture | ❌ No | ✅ Yes |
| Dynamic content | ❌ No | ✅ Yes |
| Recursive crawling | ✅ Yes | ✅ Yes |
| Depth control | ✅ Yes | ✅ Yes |
| URL filters | ✅ Yes | ✅ Yes |
| MIME type filters | ✅ Yes | ✅ Yes |
| File size filters | ✅ Yes | ✅ Yes |
| robots.txt support | ✅ Yes | ✅ Yes |
| Cookie support | ✅ Yes | ✅ Yes |
| Custom headers | ✅ Yes | ✅ Yes |
| Proxy support | ✅ Yes | ✅ Yes |
| HAR file generation | ❌ No | ✅ Yes |
| Resume/continue | ✅ Yes | ✅ Yes |
| Cache system | ✅ ZIP | ✅ JSON + files |
| Link rewriting | ✅ Yes | ✅ Yes |
| Concurrent connections | ✅ Yes | ✅ Yes (browser tabs) |
| Authentication | ✅ Basic | ✅ Full (incl. form-based) |
| Performance | ✅ Fast | ⚡ Moderate (real browser) |
smippo https://example.comCaptures a single page with all its rendered content and network artifacts.
Output:
index.html- Fully rendered HTMLassets/- All fetched resources (images, CSS, JS, fonts, etc.)network.har- HAR file for debugging/replay
smippo https://example.com --depth 3Crawls the website following internal links up to the specified depth.
By URL pattern:
smippo https://example.com --include "*.jpg" --exclude "*tracking*"By MIME type:
smippo https://example.com --mime-include "image/*" --mime-exclude "video/*"By file size:
smippo https://example.com --max-size 5MB --min-size 1KB# Stay on same domain (default)
smippo https://www.example.com --scope domain
# Stay on same subdomain
smippo https://www.example.com --scope subdomain
# Follow all links (dangerous!)
smippo https://www.example.com --scope all --depth 2
# Stay in same directory tree
smippo https://example.com/docs/ --scope directoryBasic auth:
smippo https://user:pass@example.comCookie-based:
smippo https://example.com --cookies cookies.jsonForm-based (interactive capture):
smippo https://example.com --capture-authOpens a browser window for manual login, then captures the session.
smippo <url> [options]
Commands:
smippo <url> Capture/mirror a website
smippo capture <url> Take a screenshot of a URL
smippo serve [dir] Serve captured site locally
smippo continue Resume an interrupted capture
smippo update Update an existing mirror
smippo help Show detailed help
Options:
Core:
-o, --output <dir> Output directory (default: ./site)
-d, --depth <n> Recursion depth (default: 0 = single page)
--no-crawl Disable link following (same as -d 0)
--dry-run Show what would be captured without downloading
Scope:
-s, --scope <type> Link scope: subdomain|domain|tld|all (default: domain)
--stay-in-dir Only follow links in same directory or subdirs
--external-assets Capture assets from external domains
Filters:
-I, --include <glob> Include URLs matching pattern (can repeat)
-E, --exclude <glob> Exclude URLs matching pattern (can repeat)
--mime-include <type> Include MIME types (can repeat)
--mime-exclude <type> Exclude MIME types (can repeat)
--max-size <size> Maximum file size (e.g., 10MB)
--min-size <size> Minimum file size (e.g., 1KB)
Browser:
--wait <strategy> Wait strategy: networkidle|load|domcontentloaded (default: networkidle)
--wait-time <ms> Additional wait time after network idle
--timeout <ms> Page load timeout (default: 30000)
--user-agent <string> Custom user agent
--viewport <WxH> Viewport size (default: 1920x1080)
--device <name> Emulate device (e.g., "iPhone 13")
Network:
--proxy <url> Proxy server URL
--cookies <file> Load cookies from JSON file
--headers <json> Custom headers as JSON
--capture-auth Interactive authentication capture
Output:
--structure <type> Output structure: original|flat|domain (default: original)
--har Generate HAR file (default: true)
--no-har Disable HAR file generation
--screenshot Take screenshot of each page
--pdf Save PDF of each page
Performance:
-w, --workers <n> Parallel workers (default: 8)
--max-pages <n> Maximum pages to capture
--max-time <seconds> Maximum total time
--rate-limit <ms> Delay between requests
Robots:
--ignore-robots Ignore robots.txt
--respect-robots Respect robots.txt (default)
Cache:
--no-cache Don't use cache
--cache-only Only serve from cache (update mode)
Logging:
-v, --verbose Verbose output
-q, --quiet Minimal output
--log-file <path> Write logs to file
--debug Debug mode with full browser
Misc:
--version Show version
--help Show help
Preserves the URL path structure:
site/
├── example.com/
│ ├── index.html
│ ├── about/
│ │ └── index.html
│ └── assets/
│ ├── style.css
│ └── logo.png
├── .smippo/
│ ├── cache.json # Metadata cache
│ ├── network.har # HAR file
│ ├── manifest.json # Capture manifest
│ └── log.txt # Capture log
└── index.html # Entry point
All files in one directory with hashed names:
site/
├── index.html
├── about-index.html
├── style-abc123.css
├── logo-def456.png
└── .smippo/
└── ...
Organized by domain:
site/
├── www.example.com/
│ └── ...
├── cdn.example.com/
│ └── ...
└── .smippo/
└── ...
{
"version": "0.0.1",
"created": "2024-01-15T10:30:00Z",
"updated": "2024-01-15T11:45:00Z",
"rootUrl": "https://example.com",
"options": {
"depth": 3,
"scope": "domain",
"filters": {
"include": ["*"],
"exclude": ["*tracking*"]
}
},
"stats": {
"pagesCapt": 42,
"assetsCapt": 156,
"totalSize": 15728640,
"duration": 180000,
"errors": 2
},
"pages": [
{
"url": "https://example.com/",
"localPath": "example.com/index.html",
"status": 200,
"captured": "2024-01-15T10:30:05Z",
"size": 45678,
"title": "Example Domain"
}
],
"assets": [
{
"url": "https://example.com/style.css",
"localPath": "example.com/style.css",
"mimeType": "text/css",
"size": 12345
}
]
}{
"etags": {
"https://example.com/": "\"abc123\"",
"https://example.com/style.css": "\"def456\""
},
"lastModified": {
"https://example.com/": "2024-01-10T00:00:00Z"
},
"contentTypes": {
"https://example.com/api/data": "application/json"
}
}Smippo rewrites all links in captured HTML/CSS to point to local files:
Original:
<link href="https://example.com/style.css" rel="stylesheet" />
<img src="/images/logo.png" />
<a href="https://example.com/about">About</a>Rewritten:
<link href="./style.css" rel="stylesheet" />
<img src="./images/logo.png" />
<a href="./about/index.html">About</a><a href><link href><script src><img src>,<img srcset><video src>,<audio src><source src>,<source srcset><iframe src><object data>- CSS
url()references - CSS
@import - Inline styles
- JavaScript string URLs (best effort)
smippo https://myblog.com --depth 5 --exclude "*/comments*" -o ~/archives/myblogsmippo https://docs.library.com --depth 10 --scope subdomain -o ./docssmippo https://myapp.com --wait-time 5000 --screenshotsmippo https://private.site.com --capture-auth --depth 3smippo https://media.site.com --max-size 10MB --mime-exclude "video/*"cd my-mirror/
smippo update# Basic screenshot
smippo capture https://example.com
# Full-page screenshot
smippo capture https://example.com --full-page
# Mobile device screenshot
smippo capture https://example.com --device "iPhone 13" -O mobile.png
# Screenshot with dark mode
smippo capture https://example.com --dark-mode
# Capture specific element
smippo capture https://example.com --selector ".hero-section"# Serve with auto port detection
smippo serve ./site
# Specify port and open browser
smippo serve ./site --port 3000 --open
# Verbose logging
smippo serve ./site --verbose# Default: 8 parallel workers
smippo https://large-site.com --depth 5
# Limit workers for rate-limited sites
smippo https://api-heavy-site.com --workers 2 --rate-limit 1000
# Maximum speed (use with caution)
smippo https://your-server.com --workers 16{
"dependencies": {
"@clack/prompts": "^0.11.0",
"chalk": "^5.3.0",
"cheerio": "^1.0.0-rc.12",
"cli-progress": "^3.12.0",
"commander": "^12.0.0",
"figlet": "^1.9.4",
"fs-extra": "^11.2.0",
"glob": "^10.3.10",
"gradient-string": "^3.0.0",
"mime-types": "^2.1.35",
"minimatch": "^10.1.1",
"ora": "^8.0.1",
"p-queue": "^8.0.1",
"playwright": "^1.41.0",
"robots-parser": "^3.0.1"
}
}smippo/
├── bin/
│ └── smippo.js # CLI entry point
├── src/
│ ├── index.js # Main export
│ ├── cli.js # CLI argument parsing
│ ├── crawler.js # Main crawler logic
│ ├── page-capture.js # Single page capture
│ ├── link-extractor.js # Extract links from HTML/CSS
│ ├── link-rewriter.js # Rewrite links for offline
│ ├── resource-saver.js # Save resources to disk
│ ├── filter.js # URL/MIME/size filtering
│ ├── robots.js # robots.txt parsing
│ ├── cache.js # Cache management
│ ├── manifest.js # Manifest management
│ ├── har.js # HAR file generation
│ └── utils/
│ ├── url.js # URL utilities
│ ├── path.js # Path utilities
│ └── logger.js # Logging
├── package.json
└── README.md
// Simplified core flow
async function capture(url, options) {
const browser = await chromium.launch();
const context = await browser.newContext({
recordHar: {path: harPath},
});
const page = await context.newPage();
// Intercept and save resources
page.on('response', async response => {
await saveResource(response, options);
});
// Navigate and wait for network idle
await page.goto(url, {waitUntil: options.wait});
// Get rendered HTML
const html = await page.content();
// Extract links for crawling
const links = await extractLinks(page, options);
// Rewrite links for offline
const rewrittenHtml = rewriteLinks(html, options);
// Save HTML
await saveHtml(rewrittenHtml, url, options);
// Continue crawling if depth > 0
if (options.depth > 0) {
for (const link of links) {
if (shouldFollow(link, options)) {
await capture(link, {...options, depth: options.depth - 1});
}
}
}
await browser.close();
}| HTTrack Option | Smippo Equivalent |
|---|---|
-rN (depth) |
--depth N |
-O (output path) |
-o, --output |
-w (mirror) |
default behavior |
-g (get files) |
--no-crawl |
-i (continue) |
smippo continue |
-P (proxy) |
--proxy |
-F (user-agent) |
--user-agent |
-c (connections) |
--workers |
-T (timeout) |
--timeout |
-m (max size) |
--max-size |
-E (max time) |
--max-time |
-b (cookies) |
--cookies |
-s (robots.txt) |
--respect-robots / --ignore-robots |
-a (stay on address) |
--scope subdomain |
-d (stay on domain) |
--scope domain |
-e (go everywhere) |
--scope all |
+pattern / -pattern |
--include / --exclude |
-mime:type |
--mime-include / --mime-exclude |
-N (structure) |
--structure |
- Java class parsing - Not needed; JS executes natively
- FTP support - Out of scope for browser-based tool
- HTTP/1.0 mode - Modern browsers handle this
- 8.3 filename conversion - Obsolete
- proxytrack - Different architecture
- Full JavaScript execution and SPA support
- Device emulation
- PDF/screenshot capture via
smippo capturecommand - Interactive auth capture
- Native HAR file generation
- JSON-based cache (more portable than ZIP)
- Modern filter syntax (glob patterns)
- Built-in server via
smippo servecommand - Interactive guided mode with beautiful CLI
- Parallel workers (default: 8) for faster crawling
- Static mode (
--static) for offline viewing without JS
npm install -g smippo
smippo https://example.comnpx smippo https://example.combrew install smippo
smippo https://example.comclass Smippo < Formula
desc "Modern website copier powered by Playwright"
homepage "https://github.com/username/smippo"
url "https://registry.npmjs.org/smippo/-/smippo-0.0.1.tgz"
license "MIT"
depends_on "node"
def install
system "npm", "install", *Language::Node.std_npm_install_args(libexec)
bin.install_symlink Dir["#{libexec}/bin/*"]
# Install Playwright browsers
system "npx", "playwright", "install", "chromium"
end
end- Single page capture
- Recursive crawling with depth control
- Resource saving (images, CSS, JS, fonts)
- Link rewriting for offline viewing
- HAR file generation
- Basic filtering (URL patterns)
- CLI interface with interactive mode
- npm package
- MIME type filtering
- File size filtering
- Scope control (subdomain, domain, TLD)
- External asset handling
- robots.txt support
- Parallel workers (default: 8)
- Screenshot capture command
- Built-in server command
- Static mode (strip JS)
- Beautiful CLI with progress indicators
- Cache system refinement
- Continue/resume interrupted captures
- Update existing mirrors
- Error recovery improvements
- Cookie support
- Proxy support
- Custom headers
- Interactive auth capture
- Device emulation improvements
- Homebrew formula
- Docker image
- GitHub Actions integration
- Full programmatic API
- Comprehensive test suite
MIT License