eml2pdf includes security features to protect users from potentially malicious
content embedded in email messages. These features are implemented in
eml2pdf/security.py and are enabled by default during PDF generation.
The primary security mechanism is the sanitize_html() function in
eml2pdf/security.py, which removes or neutralizes potentially dangerous
HTML elements and attributes before PDF generation.
The sanitizer removes tags that can execute code or load external active content:
<script>- JavaScript execution<iframe>- Embedded frames that can load external content<object>- Embedded objects (Flash, Java applets, etc.)<embed>- Alternative embedding mechanism<video>- Video elements with potential codec vulnerabilities<audio>- Audio elements with potential codec vulnerabilities
Forms can be used to send data to external servers:
<form>- Form elements that could submit data
Remote resources can be used to track email opens and gather information:
- Images with
srcattributes starting withhttp://orhttps://are removed - This prevents tracking pixels and web beacons
- Only embedded images (using data URIs or CID references) are allowed
- Inline
styleattributes containingurl()functions are cleared - Prevents loading of remote stylesheets, fonts, or background images
- Links with
hrefattributes starting withhttp://orhttps://are replaced with# - Prevents clickable links that could lead to malicious sites
- Note: This doesn't remove the link text, just neutralizes the destination
Event handlers and data attributes can be exploited for XSS attacks:
- All attributes starting with
onare removed (e.g.,onclick,onload,onerror) - Prevents JavaScript execution via event handlers
- All attributes starting with
data-are removed - Prevents storage of potentially malicious data for later exploitation
Additional tags that can affect page behavior or load external resources:
<meta>- Meta tags that could redirect or refresh<link>- Linked resources like stylesheets, icons, or prefetch hints
By default, all HTML content is sanitized before PDF generation:
from eml2pdf import libeml2pdf
# Sanitization is enabled by default (unsafe=False)
libeml2pdf.process_eml(eml_path, output_dir)Command-line:
python -m eml2pdf input/ output/The --unsafe flag disables HTML sanitization. This mode should only be used when:
- You completely trust the source of the EML files
- You understand the security risks involved
Warning again: Unsafe mode may expose sensitive user information through tracking pixels, external resources, and other privacy-invasive techniques. Consider running eml2pdf airgapped if you have reason to.
python -m eml2pdf input/ output/ --unsafe- HTML content is parsed using BeautifulSoup.
- Risky tags are found and completely removed with
.decompose(). - Attributes are selectively filtered or modified.
- The sanitized HTML is converted back to a string.
The sanitization occurs in the generate_pdf() function in eml2pdf/libeml2pdf.py:
def generate_pdf(html_content: str, outfile_path: Path, infile: Path,
debug_html: bool = False, page: str = 'a4',
unsafe: bool = False):
"""Convert HTML to PDF."""
if not unsafe:
html_content = security.sanitize_html(html_content)
# ... PDF generation continues ...This ensures that all HTML content is sanitized immediately before PDF rendering, regardless of the source.
Above is a listing of what we do and you have access to the code. Judge for yourself and use on your discretion. Suggestions are more than welcome!
- Keep Default Security: Always use the default secure mode unless you have a specific, trusted use case
- Verify Sources: Only process EML files from trusted sources
- Sandbox Processing: Consider running eml2pdf in a sandboxed environment when processing untrusted emails. With no Internet access.
- Update Dependencies: Keep BeautifulSoup, WeasyPrint, and other dependencies up to date
- Security implementation:
eml2pdf/security.py - Security integration:
eml2pdf/libeml2pdf.py - CLI unsafe flag:
eml2pdf/eml2pdf.py