Forever to parse 150,000 row file #585

Kryptonit3-zz · 2018-09-03T20:15:36Z

Is there anyway I can optimize reading this file?

Btw this is a Laravel application. data_fill() is a Laravel specific function. Its just my php array building logic.

https://github.com/laravel/framework/blob/3414dcfcbe27cf0f4deee0670f022983e8016392/src/Illuminate/Support/helpers.php#L427

I took the 150,000 row, 16 column file and chopped it down to a much smaller sample of 4261 rows and it takes about 10-15 seconds. The complete file takes minutes (had to heavily modify nginx and php to allow for this).

Here is a blackfire.io report of the smaller file - https://blackfire.io/profiles/c4087f40-dd5c-42ed-9258-3c6d5a1ace51/graph

Looks like it is reading all 68176 cells (4261 rows * 16 columns) multiple times on the smaller file 349380 times (the darker red boxes).

Spreadsheets attached.

spreadsheets.zip

Here is the code i am using to process the file(s). I have hard coded some parameters for testing. Normally these would be request variables to allow for different column selection based on the header row.

public function report02()
    {
        ini_set('max_execution_time', '400');

        // Time how long script takes to run
        $executionStartTime = microtime(true);

        $reader = ReaderFactory::create(Type::XLSX); // for XLSX files

        $reader->open(storage_path('smaller.xlsx'));

        $headers = [];
        $header_tech = strtoupper('INSTALLER NBR');
        $header_tech_index = 0;
        $header_equipment_type = strtoupper('ITEM');
        $header_equipment_type_index = 0;
        $header_equipment_sn = strtoupper('SERIAL NUMBER');
        $header_equipment_sn_index = 0;
        $header_equipment_status = strtoupper('EQUIP STS');
        $header_equipment_status_index = 0;
        $header_equipment_age = strtoupper('DAYS ASSIGNED');
        $header_equipment_age_index = 0;
        $show_old_equipment_limit_in_days = 21;
        // convert comma delimitied tech numbers to an array and trim white space
        $techs = array_map('trim', explode(",", '9448,69091,69165,69327,69430,69445,69449,69711,70056'));
        $data = [
            'old_equipment' => [
                'total' => 0
            ]
        ];

        foreach ($reader->getSheetIterator() as $sheet) {
            foreach ($sheet->getRowIterator() as $row => $values) {
                // Check for last row - end of file
                $values2 = array_map('trim', $values);
                if (! empty($values2)) {
                    // Header row position
                    if ($row == 2) {
                        $headers = $values2;
                        $header_tech_index = array_search($header_tech, $headers);
                        $header_equipment_type_index = array_search($header_equipment_type, $headers);
                        $header_equipment_sn_index = array_search($header_equipment_sn, $headers);
                        $header_equipment_status_index = array_search($header_equipment_status, $headers);
                        $header_equipment_age_index = array_search($header_equipment_age, $headers);
                    }

                    // Data after header
                    if ($row > 2) {
                        // This row contains data for a tech we requested
                        if (in_array($values2[$header_tech_index], $techs)) {
                            // 7 Status
                            if ($values2[$header_equipment_status_index] == 7) {
                                if (array_has($data, 'techs.' . $values2[$header_tech_index] . '.counts.7 Status.' . $values2[$header_equipment_type_index])) {
                                    $data['techs'][$values2[$header_tech_index]]['counts']['7 Status'][$values2[$header_equipment_type_index]]++;
                                } else {
                                    data_fill($data, 'techs.' . $values2[$header_tech_index] . '.counts.7 Status.' . $values2[$header_equipment_type_index], 1);
                                }

                                // All tech equipment
                                data_fill($data, 'techs.' . $values2[$header_tech_index] . '.equipment.7 Status.' . $values2[$header_equipment_sn_index], [
                                    'type' => $values2[$header_equipment_type_index],
                                    'age' => $values2[$header_equipment_age_index],
                                ]);
                            }

                            // T Status
                            if ($values2[$header_equipment_status_index] == 'T') {
                                if (array_has($data, 'techs.' . $values2[$header_tech_index] . '.counts.T Status.' . $values2[$header_equipment_type_index])) {
                                    $data['techs'][$values2[$header_tech_index]]['counts']['T Status'][$values2[$header_equipment_type_index]]++;
                                } else {
                                    data_fill($data, 'techs.' . $values2[$header_tech_index] . '.counts.T Status.' . $values2[$header_equipment_type_index], 1);
                                }

                                // All tech equipment
                                data_fill($data, 'techs.' . $values2[$header_tech_index] . '.equipment.T Status.' . $values2[$header_equipment_sn_index], [
                                    'type' => $values2[$header_equipment_type_index],
                                    'age' => $values2[$header_equipment_age_index],
                                ]);
                            }

                            // X Days or older
                            if ($values2[$header_equipment_age_index] >= $show_old_equipment_limit_in_days && ($values2[$header_equipment_status_index] !== 'L')) {
                                data_fill($data, 'old_equipment.techs.' . $values2[$header_tech_index] . '.' . $values2[$header_equipment_sn_index], [
                                        'type' => $values2[$header_equipment_type_index],
                                        'status' => $values2[$header_equipment_status_index],
                                        'age' => $values2[$header_equipment_age_index],
                                    ]);

                                $data['old_equipment']['total']++;
                            }
                        }
                    }
                }
                unset($values2);
            }
        }

        $reader->close();

        $executionEndTime = microtime(true);
        $data['runtime_in_seconds'] = round($executionEndTime - $executionStartTime, 2);

        return response()->json($data, 200);
    }

Screenshot of smaller file output

Thanks for any help I receive.

The text was updated successfully, but these errors were encountered:

Kryptonit3-zz · 2018-09-03T20:47:06Z

I also made a custom function bypassing my array building logic. Same slow times.

ini_set('max_execution_time', '400');

        // Time how long script takes to run
        $executionStartTime = microtime(true);

        $reader = ReaderFactory::create(Type::XLSX); // for XLSX files

        $reader->open(storage_path('spreadsheets/smaller.xlsx'));

        foreach ($reader->getSheetIterator() as $sheet) {
            foreach ($sheet->getRowIterator() as $row => $values) {
            	//
            }
        }

        $reader->close();

        $executionEndTime = microtime(true);
        $data = round($executionEndTime - $executionStartTime, 2);

        return json_encode(['runtime_in_seconds' => $data]);

Smaller: 4.55 seconds
Larger (original): 184.76 seconds

Kryptonit3-zz · 2018-09-04T01:46:15Z

I would love to figure out why this library takes so long, but while I was perusing solutions I found another library - https://github.com/akeneo-labs/spreadsheet-parser - I was able to parse my larger file in 60-80 seconds. Any idea what this library is doing differently? This is exponentially faster.

adrilo · 2018-09-04T18:40:31Z

Hi @Kryptonit3,

This is indeed a strange behavior. It should definitely not take that long to read the large file...
The 349380 calls are normal. It just corresponds to the number of XML nodes the reader went through. These nodes are not only "value nodes".

I'll try to investigate why it takes so long

Kryptonit3-zz · 2018-09-04T22:37:17Z

Thanks. I was looking at your code and theirs and it looks like they use XMLReader::open and I think you use ZipArchive. Could that be the speed difference?

adrilo · 2019-01-13T19:21:37Z

Spout uses a combination of ZipArchive and XMLReader::open. The former is used to get handles to XML files with the XLSX while the latter is used to actually read their content. So nothing weird here.
Sorry I don't have much time to investigate, but I'll keep the issue open in case someone wants to take a look (or when I have more time).

agolovenkin · 2019-01-17T09:23:39Z

Looks like #617

neverender24 · 2020-07-01T06:46:49Z

any update about this?

adrilo · 2021-05-14T13:36:54Z

With #763 being merged, you may now get better results.
Please re-open the issue if that's still not the case.

yo8yo · 2021-06-11T12:01:25Z

Having a file with ~38000 rows, 79 cols each take ~400s of time with just a loop like @Kryptonit3-zz had done above.
Is it normal or this issue persist?

adrilo added help wanted question labels Jan 13, 2019

adrilo closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forever to parse 150,000 row file #585

Forever to parse 150,000 row file #585

Kryptonit3-zz commented Sep 3, 2018 •

edited

Loading

Kryptonit3-zz commented Sep 3, 2018

Kryptonit3-zz commented Sep 4, 2018

adrilo commented Sep 4, 2018

Kryptonit3-zz commented Sep 4, 2018

adrilo commented Jan 13, 2019

agolovenkin commented Jan 17, 2019

neverender24 commented Jul 1, 2020

adrilo commented May 14, 2021

yo8yo commented Jun 11, 2021

Forever to parse 150,000 row file #585

Forever to parse 150,000 row file #585

Comments

Kryptonit3-zz commented Sep 3, 2018 • edited Loading

Kryptonit3-zz commented Sep 3, 2018

Kryptonit3-zz commented Sep 4, 2018

adrilo commented Sep 4, 2018

Kryptonit3-zz commented Sep 4, 2018

adrilo commented Jan 13, 2019

agolovenkin commented Jan 17, 2019

neverender24 commented Jul 1, 2020

adrilo commented May 14, 2021

yo8yo commented Jun 11, 2021

Kryptonit3-zz commented Sep 3, 2018 •

edited

Loading