-
-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupt entry using CharsetConverter on a large CSV with SJIS -> UTF-8 #553
Comments
@earthiverse thanks for using the library. Did you try with the latest release version 9.21.0 had a overhaul update around stream filters. I believe this is fixed in the last version. |
Upgraded to League CSV 21.0, but
Also, this is unrelated to the previous issue, I found another project where we were doing something similar, but it has an additional public function filter($in, $out, &$consumed, $closing): int
{
$buffer = $this->buffer ?: '';
while ($in_bucket = stream_bucket_make_writeable($in)) {
$buffer .= $in_bucket->data;
$consumed += $in_bucket->datalen;
}
$out_bucket = stream_bucket_new($this->tempStream, '');
if (false === $out_bucket) {
return PSFS_ERR_FATAL;
}
// If the buffer does not contain a valid string in the from encoding, we need to request more data
// This can happen, for example, if the stream is chunked across a character boundary and only half
// of the character is in the buffer.
if (!mb_check_encoding($buffer, $this->from) && !$closing) {
$this->buffer = $buffer;
return PSFS_FEED_ME;
}
$out_bucket->data = @mb_convert_encoding($buffer, $this->to, $this->from);
stream_bucket_append($out, $out_bucket);
$this->buffer = '';
return PSFS_PASS_ON;
} |
@earthiverse I would appreciate if you could share the exact snippet that is triggering the issue on your end. As I was unable to reproduce your issue with both league versions using the following snippet 🤔 <?php
declare(strict_types=1);
use League\Csv\CharsetConverter;
use League\Csv\Reader;
require_once __DIR__ . '/vendor/autoload.php';
$dest = tmpfile();
$source = fopen("zip://path/to/japan_geo.zip#KEN_ALL.CSV", "r");
stream_copy_to_stream($source, $dest);
$document = Reader::createFromStream($dest);
CharsetConverter::addTo($document, 'SJIS', 'UTF-8');
$record = $document
->filter(fn (array $record): bool => ($record[2] ?? '') === '1030016')
->first();
var_dump($record); It does not mean that there is no room to improve the stream filter it's just that I need a real case to validate the improvement I may add to the feature. |
I'm so sorry, I thought that was the file that was giving me errors, but it was a different CSV (still from JP Post). https://www.post.japanpost.jp/zipcode/dl/roman/KEN_ALL_ROME.zip contains the CSV that fails to parse. This script reproduces the issue for me: <?php
use League\Csv\CharsetConverter;
use League\Csv\Reader;
require_once __DIR__ . '/../vendor/autoload.php';
// From https://www.post.japanpost.jp/zipcode/dl/roman/KEN_ALL_ROME.zip
$reader = Reader::createFromPath('KEN_ALL_ROME.CSV');
//$reader->appendStreamFilterOnRead('convert.iconv.SJIS/UTF-8');
CharsetConverter::addTo($reader, 'SJIS', 'UTF-8');
$num = 0;
foreach (
$reader->getRecords([
'postcode',
'prefecture',
'city',
'district',
'prefecture_romaji',
'city_romaji',
'district_romaji',
]) as $row
) {
if ($row['district_romaji'] === null) {
print_r($row);
throw new Exception('Unable to parse district romaji for ' . $row['postcode'] . '!');
}
$num++;
if ($num % 10000 === 0) {
print_r('Parsed ' . $num . ' records...' . PHP_EOL);
}
}
print_r('Parsed all ' . $num . ' records!' . PHP_EOL); |
@earthiverse thanks for the new file. The fix has landed on the master branch you can test it there to validate if all is OK now! If it is you may close this ticket |
I can confirm that everything works using |
Bug Report
Summary
Has a corrupt entry for the CSV in this zip file from Japan Post after correctly processing the previous ~37664 rows.
Expected result
Does not produce a corrupt entry
The text was updated successfully, but these errors were encountered: