Skip to content

JessicaCai-jca421/353

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Project Topic: OSM, Photos, and Tours

The OpenStreetMap project collects community-provided map data that is free to use. The full data dump, known as planet. Osm provides all of the data from their maps in a slightly ugly XML format, ready for analysis.

The Idea

The OSM data set has a huge collection of things you might have seen while walking around the city: Canada Place, The Steam Clock, a bench, etc. Maybe you have walked by these and not even noticed. Wouldn't it be nice if they were pointed out to you?

We have previously worked with GPX data: the file produced by fitness trackers, GPS systems, or anything else that tracks movements with GPS signals (or related technology). The problem with GPX is that you have to create it. You are more likely to find geographic information in photographs naturally: Exif data in JPEG images can contain latitude and longitude data as well, and most phones automatically add it

The challenge: Take a collection of geotagged photos representing my walk/tour/vacation. Give me a tour of the things I should have seen. Try to guess what is in the photos.

Provided Data

I have downloaded the planet. osm data and done some work to turn the monster of an XML file into more usable data, amenities-vancouver.json.gz in the provided data and code. The OSM data was turned into what I gave you with the code in that archive:

  1. Turned the monolithic XML file into a split file with its top-level elements one-per-line so they can sensibly be approached with Spark. (With disassemble-osm.py, producing the data in `/courses/datasets/OpenStreetMap on our cluster's HDFS.)
  2. Processed the fragmented XML, keeping only nodes that are an "amenity" and saving them as a more appropriate JSON format. (With osm-amenities.py as a Spark job on the cluster.)
  3. Extracted only data that was roughly within Greater Vancouver. (With just-vancouver.py on the cluster.)

If you want to work with a different subset of the data, you can modify the code and repeat steps 2 or 3. You'd have to be insane to repeat step 1.

The data set is in JSON format, with fields for latitude, longitude, timestamp (when the node was edited), the amenity type (like "restaurant", "bench", "Pharmacy", etc.), the name (like "White Spot", often missing), and a dictionary of any other tags in the entry.

In Pandas, the tags field will be loaded as a Python dictionary (mapping keys to values). In Spark, it will be loaded as a MapType string to string (see just-vancouver.py for a schema).

Other Data

The problem with the OSM data is probably that it's too complete. I don't need to know about every park bench I walked by. You must find the interesting things I passed to make this useful.

The result: the provided data is likely insufficient to get good results. You might be able to combine it with WikiData information (e.g. the Steam Clock has the tag "wikidata" with the value "Q477663", referring to its wikidata entry). Or with Wikipedia data (e.g. the Steam Clock has a "Wikipedia" tag referring to its Wikipedia entry).

You may be able to do some clever processing on the OSM data to guess what attractions would be interesting to the user. It's possible that you can apply some heuristic to find interesting points (more-complete entries get more attention) or exclude boring ones (like park benches and infrastructure).

Notes

Remember, this is supposed to be a larger-scale and more independent project. If you plan to re-purpose the idea from the exercises and do something like "find nearby stuff", that's not much of a project, and your mark will probably reflect that. We expect more creativity here to attack a more open-ended problem like this.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 49.1%
  • HTML 25.9%
  • Jupyter Notebook 25.0%