-
-
Notifications
You must be signed in to change notification settings - Fork 119
Fix handling of global discussions files from planet.osm.org #388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… as osm.bz2 files of all changesets with discussions (e.g. discussions-250519.osm.bz2 from planet.osm.org)
…for comment management to prevent corrupted memory access when the size of comments go larger and triggers buffer reallocation handling (e.g. changeset with id 62278154).
When I look at the discussions-250519.osm.bz2 I do get an error message. But your changes don't fix that for me. What tool did you run and what error message did you get? |
@joto The code (simplified from my project and tested): #include <iostream>
#include <string>
#include <cstdio>
#include <functional>
#include "osmium/handler.hpp"
#include "osmium/visitor.hpp"
#include "osmium/io/any_input.hpp"
#include "osmium/osm/entity_bits.hpp"
#include "osmium/util/progress_bar.hpp"
class FastChangesetHandler : public osmium::handler::Handler {
private:
volatile long long counter = 0;
volatile long long sum_ids = 0;
volatile long long sum_changes = 0;
volatile long long total_comments = 0;
volatile long long total_comment_length = 0;
std::function<void()> progress_callback;
public:
void changeset(const osmium::Changeset& changeset) {
counter++;
// Do actual work to prevent optimization
sum_ids += changeset.id();
sum_changes += changeset.num_changes();
// Process discussions to prevent optimization
long long comment_count = 0;
for (auto &comment : changeset.discussion()) {
// Prevent optimization by accessing comment properties
total_comment_length += strlen(comment.user());
total_comment_length += strlen(comment.text());
comment_count++;
// Safety check to prevent segfault as mentioned in original code
if (comment_count >= changeset.num_comments()) {
break;
}
}
total_comments += comment_count;
// Update progress bar
if (progress_callback) {
progress_callback();
}
}
void attach_progress_callback(std::function<void()> callback) {
progress_callback = callback;
}
void print_summary() {
printf("\nFinal Summary:\n");
printf("==============\n");
printf("Processed changesets: %lld\n", counter);
printf("Sum of changeset IDs: %lld\n", sum_ids);
printf("Sum of changes: %lld\n", sum_changes);
printf("Total discussion comments: %lld\n", total_comments);
printf("Total comment text length: %lld characters\n", total_comment_length);
if (counter > 0) {
printf("Average comments per changeset: %.2f\n", (double)total_comments / counter);
printf("Average comment length: %.2f characters\n",
total_comments > 0 ? (double)total_comment_length / total_comments : 0.0);
}
}
};
int main(int argc, char* argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <osm-file>\n", argv[0]);
return 1;
}
std::string filename = argv[1];
try {
// Disable stdio synchronization for faster operation
std::ios_base::sync_with_stdio(false);
// Open the OSM file
osmium::io::File infile(filename);
// Create reader - only read changesets
osmium::io::Reader reader(infile, osmium::osm_entity_bits::all);
// Setup progress bar
osmium::ProgressBar progress_bar{reader.file_size(), true};
// Create our fast handler
auto handler = std::make_unique<FastChangesetHandler>();
long long processed = 0;
handler->attach_progress_callback([&processed, &progress_bar, &reader]() {
if (++processed % 100== 0) {
progress_bar.update(reader.offset());
}
});
printf("Processing changesets from: %s\n", filename.c_str());
// Apply the handler to iterate through changesets
osmium::apply(reader, *handler);
progress_bar.done();
reader.close();
// Print final summary
handler->print_summary();
} catch (const std::exception& e) {
fprintf(stderr, "Error: %s\n", e.what());
return 1;
}
return 0;
} With the current version of libosmium:
After patching bzip2_compression.hpp:
After patching osm_object_builder.hpp the file can be processed correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now been able to verify this. The bzip2 fixes are fine, but I have some comments for the ChangesetDiscussionBuilder.
@joto Thanks for your comments! The code has been updated accordingly. |
Thanks @yarray for finding and fixing this. |
This PR fixes critical issues when processing global discussions files distributed by planet.osm.org, which contain changesets with comments. While libosmium can handle a large portion of the changesets, two issues would happen for some changesets.
Problem
Global discussions files (changesets with comments) from planet.osm.org such as discussions-20250519.osm.bz2 were not being processed successfully due to two issues:
The tested inpiut file is https://planet.osm.org/planet/2025/discussions-250519.osm.bz2. The first issue happened at about 21% of the processing. The second happened for changeset 62278154's 9th comment on my computer after resolving the first issue. Both
osmium::apply
andosmium::io::make_input_iterator_range
present the same behavior.Solution
1. Fix bzip2 multi-stream support
include/osmium/io/bzip2_compression.hpp
to handle concatenated streams2. Improve memory-safety of ChangesetDiscussionBuilder
include/osmium/builder/osm_object_builder.hpp
get_comment_ptr()
method that safely recalculates pointers after buffer reallocationsImpact
This fix enables reliable processing of planet.osm.org global discussions files by:
Testing
Tested with all current unit tests to ensure the fix does not break anything.