Skip to content

bug(gmail): ISO-2022-JP emails garbled — decodeBodyCharset re-decodes UTF-8 bytes #446

@Kazuaki-Tanaka

Description

@Kazuaki-Tanaka

Summary

gog gmail read garbles emails whose MIME header declares charset="iso-2022-jp" (and likely other non-UTF-8 charsets).

Environment: gog v0.12.0, Linux arm64

Steps to Reproduce

  1. Receive an email with Content-Type: text/plain; charset="iso-2022-jp".
  2. Run gog gmail read <threadId> -a <account>.
  3. Body text is garbled (\ufffd replacement characters).

Root Cause

In internal/cmd/gmail_thread.go, decodeBodyCharset() checks the MIME charset and, if it's not UTF-8, re-decodes the bytes using that charset.

However, the Gmail API (format=full) always normalizes body.data to UTF-8 before base64url-encoding, while preserving the original MIME headers verbatim. So after decodeBase64URLBytes() the bytes are already valid UTF-8, but decodeBodyCharset re-interprets them as ISO-2022-JP via golang.org/x/text/encoding/ianaindex, producing garbage.

Proposed Fix

Add a utf8.Valid() guard before charset conversion:

import "unicode/utf8"

func decodeBodyCharset(data []byte, contentType string) []byte {
    charsetLabel := charsetLabelFromContentType(contentType)
    normalized := strings.ToLower(strings.ReplaceAll(strings.TrimSpace(charsetLabel), "_", "-"))
    if charsetLabel == "" || normalized == "utf-8" || normalized == "utf8" {
        return data
    }
    if utf8.Valid(data) {
        return data
    }
    if decoded, ok := decodeWithCharsetLabel(data, charsetLabel); ok {
        return decoded
    }
    return data
}

This is safe because genuine ISO-2022-JP / Shift-JIS / EUC-JP raw bytes are almost never valid UTF-8, so the guard won't fire on non-API paths. For Gmail API JSON responses (the normal path), the bytes are always valid UTF-8 and the re-encoding is correctly skipped.

Affected Encodings

Confirmed: iso-2022-jp. Likely also: shift_jis, euc-jp, gb2312, gbk, euc-kr, windows-1252, iso-8859-1.

Workaround (Python script)

Use gog gmail read -j <threadId> (JSON mode) and decode the base64 body as UTF-8 manually:

#!/usr/bin/env python3
import json, sys, subprocess, base64

cmd = ['gog', 'gmail', 'read', '-j', sys.argv[1]]
if len(sys.argv) > 2:
    cmd += ['-a', sys.argv[2]]

data = json.loads(subprocess.run(cmd, capture_output=True).stdout)
for msg in data.get('thread', data).get('messages', []):
    hdrs = {h['name']: h['value'] for h in msg['payload'].get('headers', [])}
    print(f"From: {hdrs.get('From','')}  Date: {hdrs.get('Date','')}")
    print(f"Subject: {hdrs.get('Subject','')}\n")
    def find_text(p):
        if 'text/plain' in p.get('mimeType',''):
            return p
        for sub in p.get('parts', []):
            r = find_text(sub)
            if r: return r
    tp = find_text(msg['payload'])
    if tp and tp.get('body',{}).get('data'):
        print(base64.urlsafe_b64decode(tp['body']['data'] + '==').decode('utf-8', errors='replace'))
    print('-' * 40)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions