Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt entry using CharsetConverter on a large CSV with SJIS -> UTF-8 #553

Closed
earthiverse opened this issue Feb 5, 2025 · 6 comments
Closed
Assignees

Comments

@earthiverse
Copy link

Bug Report

Information Description
Version 9.20.1
PHP version 8.3.16
OS Platform Docker (Alpine) running on MacOS

Summary

$reader = Reader::createFromPath($srcFilepath);
CharsetConverter::addTo($reader, 'SJIS', 'UTF-8');

Has a corrupt entry for the CSV in this zip file from Japan Post after correctly processing the previous ~37664 rows.

Array
(
    [postcode] => 1030016
    [prefecture] => 東京都
    [city] => ??寞?,日本橋小網町"
    [district] => TOKYO TO
    [prefecture_romaji] => CHUO KU
    [city_romaji] => NIHOMBASHIKOAMICHO
    [district_romaji] => 
)

Expected result

$reader = Reader::createFromPath($srcFilepath);
$reader->addStreamFilter('convert.iconv.SJIS/UTF-8');

Does not produce a corrupt entry

Array
(
    [postcode] => 1030016
    [prefecture] => 東京都
    [city] => 中央区
    [district] => 日本橋小網町
    [prefecture_romaji] => TOKYO TO
    [city_romaji] => CHUO KU
    [district_romaji] => NIHOMBASHIKOAMICHO
)
@nyamsprod
Copy link
Member

@earthiverse thanks for using the library. Did you try with the latest release version 9.21.0 had a overhaul update around stream filters. I believe this is fixed in the last version.
Give it a try and let me know if the issue is still present.

@earthiverse
Copy link
Author

earthiverse commented Feb 10, 2025

Upgraded to League CSV 21.0, but CharsetConverter::addTo($reader, 'SJIS', 'UTF-8'); is still failing at the same entry.

$reader->addStreamFilter('convert.iconv.SJIS/UTF-8'); is still working, though 👍

Also, this is unrelated to the previous issue,
but if anyone is finding this through a search engine in the future, I had a different problem with the JP Post business postal codes, and solved it by using SHIFT_JIS-2004 instead of SJIS (one of the entries had an encoded roman numeral).


I found another project where we were doing something similar, but it has an additional mb_check_encoding

public function filter($in, $out, &$consumed, $closing): int
{
    $buffer = $this->buffer ?: '';

    while ($in_bucket = stream_bucket_make_writeable($in)) {
        $buffer .= $in_bucket->data;
        $consumed += $in_bucket->datalen;
    }

    $out_bucket = stream_bucket_new($this->tempStream, '');
    if (false === $out_bucket) {
        return PSFS_ERR_FATAL;
    }

    // If the buffer does not contain a valid string in the from encoding, we need to request more data
    // This can happen, for example, if the stream is chunked across a character boundary and only half
    // of the character is in the buffer.
    if (!mb_check_encoding($buffer, $this->from) && !$closing) {
        $this->buffer = $buffer;
        return PSFS_FEED_ME;
    }

    $out_bucket->data = @mb_convert_encoding($buffer, $this->to, $this->from);
    stream_bucket_append($out, $out_bucket);

    $this->buffer = '';
    return PSFS_PASS_ON;
}

@nyamsprod
Copy link
Member

@earthiverse I would appreciate if you could share the exact snippet that is triggering the issue on your end. As I was unable to reproduce your issue with both league versions using the following snippet 🤔

<?php

declare(strict_types=1);

use League\Csv\CharsetConverter;
use League\Csv\Reader;

require_once __DIR__ . '/vendor/autoload.php';

$dest = tmpfile();
$source = fopen("zip://path/to/japan_geo.zip#KEN_ALL.CSV", "r");
stream_copy_to_stream($source, $dest);

$document = Reader::createFromStream($dest);
CharsetConverter::addTo($document, 'SJIS', 'UTF-8');

$record = $document
    ->filter(fn (array $record): bool => ($record[2] ?? '') === '1030016')
    ->first();

var_dump($record);

It does not mean that there is no room to improve the stream filter it's just that I need a real case to validate the improvement I may add to the feature.

@earthiverse
Copy link
Author

I'm so sorry, I thought that was the file that was giving me errors, but it was a different CSV (still from JP Post).

https://www.post.japanpost.jp/zipcode/dl/roman/KEN_ALL_ROME.zip contains the CSV that fails to parse.

This script reproduces the issue for me:

<?php

use League\Csv\CharsetConverter;
use League\Csv\Reader;

require_once __DIR__ . '/../vendor/autoload.php';

// From https://www.post.japanpost.jp/zipcode/dl/roman/KEN_ALL_ROME.zip
$reader = Reader::createFromPath('KEN_ALL_ROME.CSV');

//$reader->appendStreamFilterOnRead('convert.iconv.SJIS/UTF-8');
CharsetConverter::addTo($reader, 'SJIS', 'UTF-8');

$num = 0;
foreach (
    $reader->getRecords([
        'postcode',
        'prefecture',
        'city',
        'district',
        'prefecture_romaji',
        'city_romaji',
        'district_romaji',
    ]) as $row
) {
    if ($row['district_romaji'] === null) {
        print_r($row);
        throw new Exception('Unable to parse district romaji for ' . $row['postcode'] . '!');
    }
    $num++;
    if ($num % 10000 === 0) {
        print_r('Parsed ' . $num . ' records...' . PHP_EOL);
    }
}

print_r('Parsed all ' . $num . ' records!' . PHP_EOL);

@nyamsprod
Copy link
Member

@earthiverse thanks for the new file. The fix has landed on the master branch you can test it there to validate if all is OK now! If it is you may close this ticket

@earthiverse
Copy link
Author

I can confirm that everything works using dev-master, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants