Skip to content

Fix handling of global discussions files from planet.osm.org #388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 3, 2025

Conversation

yarray
Copy link
Contributor

@yarray yarray commented Jul 1, 2025

This PR fixes critical issues when processing global discussions files distributed by planet.osm.org, which contain changesets with comments. While libosmium can handle a large portion of the changesets, two issues would happen for some changesets.

Problem

Global discussions files (changesets with comments) from planet.osm.org such as discussions-20250519.osm.bz2 were not being processed successfully due to two issues:

  1. Incomplete file reading: These files use multi-stream bzip2 compression, but libosmium was stopping after the first stream, causing incomplete XML tag error at some point.
  2. Memory corruption during parsing: Comments in changesets could cause invalid memory reading when the number of comments goes high.

The tested inpiut file is https://planet.osm.org/planet/2025/discussions-250519.osm.bz2. The first issue happened at about 21% of the processing. The second happened for changeset 62278154's 9th comment on my computer after resolving the first issue. Both osmium::apply and osmium::io::make_input_iterator_range present the same behavior.

Solution

1. Fix bzip2 multi-stream support

  • Modified bzip2 decompression in include/osmium/io/bzip2_compression.hpp to handle concatenated streams
  • When one stream ends, automatically attempt to open the next stream in the file
  • Only mark file as complete when no additional streams are available

2. Improve memory-safety of ChangesetDiscussionBuilder

  • Replaced unsafe pointer storage with offset-based tracking in include/osmium/builder/osm_object_builder.hpp
  • Added get_comment_ptr() method that safely recalculates pointers after buffer reallocations

Impact

This fix enables reliable processing of planet.osm.org global discussions files by:

  • Reading 100% of the global discussions data successfully (tested 20250519.osm.bz2).
  • Maintaining backward compatibility with existing code

Testing

Tested with all current unit tests to ensure the fix does not break anything.

yarray added 2 commits July 1, 2025 23:13
… as osm.bz2 files of all changesets with discussions (e.g. discussions-250519.osm.bz2 from planet.osm.org)
…for comment management to prevent corrupted memory access when the size of comments go larger and triggers buffer reallocation handling (e.g. changeset with id 62278154).
@joto
Copy link
Member

joto commented Jul 2, 2025

When I look at the discussions-250519.osm.bz2 I do get an error message. But your changes don't fix that for me.

What tool did you run and what error message did you get?

@yarray
Copy link
Contributor Author

yarray commented Jul 3, 2025

@joto The code (simplified from my project and tested):

#include <iostream>
#include <string>
#include <cstdio>
#include <functional>

#include "osmium/handler.hpp"
#include "osmium/visitor.hpp"
#include "osmium/io/any_input.hpp"
#include "osmium/osm/entity_bits.hpp"
#include "osmium/util/progress_bar.hpp"

class FastChangesetHandler : public osmium::handler::Handler {
private:
    volatile long long counter = 0;
    volatile long long sum_ids = 0;
    volatile long long sum_changes = 0;
    volatile long long total_comments = 0;
    volatile long long total_comment_length = 0;
    std::function<void()> progress_callback;
    
public:
    void changeset(const osmium::Changeset& changeset) {
        counter++;
        
        // Do actual work to prevent optimization
        sum_ids += changeset.id();
        sum_changes += changeset.num_changes();
        
        // Process discussions to prevent optimization
        long long comment_count = 0;
        for (auto &comment : changeset.discussion()) {
            // Prevent optimization by accessing comment properties
            total_comment_length += strlen(comment.user());
            total_comment_length += strlen(comment.text());
            comment_count++;
            
            // Safety check to prevent segfault as mentioned in original code
            if (comment_count >= changeset.num_comments()) {
                break;
            }
        }
        total_comments += comment_count;
        
        // Update progress bar
        if (progress_callback) {
            progress_callback();
        }
    }
    
    void attach_progress_callback(std::function<void()> callback) {
        progress_callback = callback;
    }
    
    void print_summary() {
        printf("\nFinal Summary:\n");
        printf("==============\n");
        printf("Processed changesets: %lld\n", counter);
        printf("Sum of changeset IDs: %lld\n", sum_ids);
        printf("Sum of changes: %lld\n", sum_changes);
        printf("Total discussion comments: %lld\n", total_comments);
        printf("Total comment text length: %lld characters\n", total_comment_length);
        if (counter > 0) {
            printf("Average comments per changeset: %.2f\n", (double)total_comments / counter);
            printf("Average comment length: %.2f characters\n", 
                   total_comments > 0 ? (double)total_comment_length / total_comments : 0.0);
        }
    }
};

int main(int argc, char* argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <osm-file>\n", argv[0]);
        return 1;
    }

    std::string filename = argv[1];
    
    try {
        // Disable stdio synchronization for faster operation
        std::ios_base::sync_with_stdio(false);
        
        // Open the OSM file
        osmium::io::File infile(filename);
        
        // Create reader - only read changesets
        osmium::io::Reader reader(infile, osmium::osm_entity_bits::all);
        
        // Setup progress bar
        osmium::ProgressBar progress_bar{reader.file_size(), true};
        
        // Create our fast handler
        auto handler = std::make_unique<FastChangesetHandler>();
        
        long long processed = 0;
        handler->attach_progress_callback([&processed, &progress_bar, &reader]() {
            if (++processed % 100== 0) {
                progress_bar.update(reader.offset());
            }
        });
        
        printf("Processing changesets from: %s\n", filename.c_str());

        // Apply the handler to iterate through changesets
        osmium::apply(reader, *handler);
        
        progress_bar.done();
        reader.close();
        
        // Print final summary
        handler->print_summary();
        
    } catch (const std::exception& e) {
        fprintf(stderr, "Error: %s\n", e.what());
        return 1;
    }

    return 0;
}

With the current version of libosmium:

Processing changesets from: /mnt/d/data/osm/discussions-250519.osm.bz2
Error: XML parsing error at line 5846904, column 1: unclosed token

After patching bzip2_compression.hpp:

Processing changesets from: /mnt/d/data/osm/discussions-250519.osm.bz2
[1]    128536 segmentation fault (core dumped)  ./build/importer /mnt/d/data/osm/discussions-250519.osm.bz2

After patching osm_object_builder.hpp the file can be processed correctly.

Copy link
Member

@joto joto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now been able to verify this. The bzip2 fixes are fine, but I have some comments for the ChangesetDiscussionBuilder.

@yarray
Copy link
Contributor Author

yarray commented Jul 3, 2025

@joto Thanks for your comments! The code has been updated accordingly.

@joto joto merged commit d2e2bcc into osmcode:master Jul 3, 2025
34 checks passed
@joto
Copy link
Member

joto commented Jul 3, 2025

Thanks @yarray for finding and fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants