Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

Forever to parse 150,000 row file #585

Closed
Kryptonit3-zz opened this issue Sep 3, 2018 · 9 comments
Closed

Forever to parse 150,000 row file #585

Kryptonit3-zz opened this issue Sep 3, 2018 · 9 comments

Comments

@Kryptonit3-zz
Copy link

Kryptonit3-zz commented Sep 3, 2018

Is there anyway I can optimize reading this file?

Btw this is a Laravel application. data_fill() is a Laravel specific function. Its just my php array building logic.

https://github.com/laravel/framework/blob/3414dcfcbe27cf0f4deee0670f022983e8016392/src/Illuminate/Support/helpers.php#L427

I took the 150,000 row, 16 column file and chopped it down to a much smaller sample of 4261 rows and it takes about 10-15 seconds. The complete file takes minutes (had to heavily modify nginx and php to allow for this).

Here is a blackfire.io report of the smaller file - https://blackfire.io/profiles/c4087f40-dd5c-42ed-9258-3c6d5a1ace51/graph

Looks like it is reading all 68176 cells (4261 rows * 16 columns) multiple times on the smaller file 349380 times (the darker red boxes).

Spreadsheets attached.

spreadsheets.zip

Here is the code i am using to process the file(s). I have hard coded some parameters for testing. Normally these would be request variables to allow for different column selection based on the header row.

public function report02()
    {
        ini_set('max_execution_time', '400');

        // Time how long script takes to run
        $executionStartTime = microtime(true);

        $reader = ReaderFactory::create(Type::XLSX); // for XLSX files

        $reader->open(storage_path('smaller.xlsx'));

        $headers = [];
        $header_tech = strtoupper('INSTALLER NBR');
        $header_tech_index = 0;
        $header_equipment_type = strtoupper('ITEM');
        $header_equipment_type_index = 0;
        $header_equipment_sn = strtoupper('SERIAL NUMBER');
        $header_equipment_sn_index = 0;
        $header_equipment_status = strtoupper('EQUIP STS');
        $header_equipment_status_index = 0;
        $header_equipment_age = strtoupper('DAYS ASSIGNED');
        $header_equipment_age_index = 0;
        $show_old_equipment_limit_in_days = 21;
        // convert comma delimitied tech numbers to an array and trim white space
        $techs = array_map('trim', explode(",", '9448,69091,69165,69327,69430,69445,69449,69711,70056'));
        $data = [
            'old_equipment' => [
                'total' => 0
            ]
        ];

        foreach ($reader->getSheetIterator() as $sheet) {
            foreach ($sheet->getRowIterator() as $row => $values) {
                // Check for last row - end of file
                $values2 = array_map('trim', $values);
                if (! empty($values2)) {
                    // Header row position
                    if ($row == 2) {
                        $headers = $values2;
                        $header_tech_index = array_search($header_tech, $headers);
                        $header_equipment_type_index = array_search($header_equipment_type, $headers);
                        $header_equipment_sn_index = array_search($header_equipment_sn, $headers);
                        $header_equipment_status_index = array_search($header_equipment_status, $headers);
                        $header_equipment_age_index = array_search($header_equipment_age, $headers);
                    }

                    // Data after header
                    if ($row > 2) {
                        // This row contains data for a tech we requested
                        if (in_array($values2[$header_tech_index], $techs)) {
                            // 7 Status
                            if ($values2[$header_equipment_status_index] == 7) {
                                if (array_has($data, 'techs.' . $values2[$header_tech_index] . '.counts.7 Status.' . $values2[$header_equipment_type_index])) {
                                    $data['techs'][$values2[$header_tech_index]]['counts']['7 Status'][$values2[$header_equipment_type_index]]++;
                                } else {
                                    data_fill($data, 'techs.' . $values2[$header_tech_index] . '.counts.7 Status.' . $values2[$header_equipment_type_index], 1);
                                }

                                // All tech equipment
                                data_fill($data, 'techs.' . $values2[$header_tech_index] . '.equipment.7 Status.' . $values2[$header_equipment_sn_index], [
                                    'type' => $values2[$header_equipment_type_index],
                                    'age' => $values2[$header_equipment_age_index],
                                ]);
                            }

                            // T Status
                            if ($values2[$header_equipment_status_index] == 'T') {
                                if (array_has($data, 'techs.' . $values2[$header_tech_index] . '.counts.T Status.' . $values2[$header_equipment_type_index])) {
                                    $data['techs'][$values2[$header_tech_index]]['counts']['T Status'][$values2[$header_equipment_type_index]]++;
                                } else {
                                    data_fill($data, 'techs.' . $values2[$header_tech_index] . '.counts.T Status.' . $values2[$header_equipment_type_index], 1);
                                }

                                // All tech equipment
                                data_fill($data, 'techs.' . $values2[$header_tech_index] . '.equipment.T Status.' . $values2[$header_equipment_sn_index], [
                                    'type' => $values2[$header_equipment_type_index],
                                    'age' => $values2[$header_equipment_age_index],
                                ]);
                            }

                            // X Days or older
                            if ($values2[$header_equipment_age_index] >= $show_old_equipment_limit_in_days && ($values2[$header_equipment_status_index] !== 'L')) {
                                data_fill($data, 'old_equipment.techs.' . $values2[$header_tech_index] . '.' . $values2[$header_equipment_sn_index], [
                                        'type' => $values2[$header_equipment_type_index],
                                        'status' => $values2[$header_equipment_status_index],
                                        'age' => $values2[$header_equipment_age_index],
                                    ]);

                                $data['old_equipment']['total']++;
                            }
                        }
                    }
                }
                unset($values2);
            }
        }

        $reader->close();

        $executionEndTime = microtime(true);
        $data['runtime_in_seconds'] = round($executionEndTime - $executionStartTime, 2);

        return response()->json($data, 200);
    }

Screenshot of smaller file output

screen shot 2018-09-03 at 4 23 37 pm

Thanks for any help I receive.

@Kryptonit3-zz
Copy link
Author

I also made a custom function bypassing my array building logic. Same slow times.

ini_set('max_execution_time', '400');

        // Time how long script takes to run
        $executionStartTime = microtime(true);

        $reader = ReaderFactory::create(Type::XLSX); // for XLSX files

        $reader->open(storage_path('spreadsheets/smaller.xlsx'));

        foreach ($reader->getSheetIterator() as $sheet) {
            foreach ($sheet->getRowIterator() as $row => $values) {
            	//
            }
        }

        $reader->close();

        $executionEndTime = microtime(true);
        $data = round($executionEndTime - $executionStartTime, 2);

        return json_encode(['runtime_in_seconds' => $data]);

Smaller: 4.55 seconds
Larger (original): 184.76 seconds

@Kryptonit3-zz
Copy link
Author

I would love to figure out why this library takes so long, but while I was perusing solutions I found another library - https://github.com/akeneo-labs/spreadsheet-parser - I was able to parse my larger file in 60-80 seconds. Any idea what this library is doing differently? This is exponentially faster.

@adrilo
Copy link
Collaborator

adrilo commented Sep 4, 2018

Hi @Kryptonit3,

This is indeed a strange behavior. It should definitely not take that long to read the large file...
The 349380 calls are normal. It just corresponds to the number of XML nodes the reader went through. These nodes are not only "value nodes".

I'll try to investigate why it takes so long

@Kryptonit3-zz
Copy link
Author

Thanks. I was looking at your code and theirs and it looks like they use XMLReader::open and I think you use ZipArchive. Could that be the speed difference?

@adrilo
Copy link
Collaborator

adrilo commented Jan 13, 2019

Spout uses a combination of ZipArchive and XMLReader::open. The former is used to get handles to XML files with the XLSX while the latter is used to actually read their content. So nothing weird here.
Sorry I don't have much time to investigate, but I'll keep the issue open in case someone wants to take a look (or when I have more time).

@agolovenkin
Copy link

Looks like #617

@neverender24
Copy link

any update about this?

@adrilo
Copy link
Collaborator

adrilo commented May 14, 2021

With #763 being merged, you may now get better results.
Please re-open the issue if that's still not the case.

@adrilo adrilo closed this as completed May 14, 2021
@yo8yo
Copy link

yo8yo commented Jun 11, 2021

Having a file with ~38000 rows, 79 cols each take ~400s of time with just a loop like @Kryptonit3-zz had done above.
Is it normal or this issue persist?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants