User:Danysan/Sandbox/Opinionated Planet.osm
This page aims to be a brainstorming space for an opinionated distribution of Planet.osm.
Why
The openness, freedom and community focus of OSM are its strengths but they can make life harder for data consumers:
- often unexperienced (and sometimes experienced) mappers involuntarily introduce errors in the map
- it's easy to sneak in vandalic edits and while most of the time other mappers quickly find and fix these edits, the exceptions can be very problematic
- Deprecated features usually are not immediately mass-updated to their suggested replacement, forcing consumers to check multiple tags to find the same data
- OSM's non-stringent schema can make life harder for consumers, forcing them to check multiple non-documented and often non-homogeneous tags
- Good practice rules are suggested but not always enforced and this can lead to non-homogeneous data
It would be useful to find a way to simplify data consumers' life by making available a distribution of the data checked, cleaned, schema-normalised (and possibly enhanced) through opinionated filters and actions to the data.
Who
This proposal was born from this OSM community thread and takes inspiration from Meta Daylight Map Distribution' Planet file. Given that having a clean and safe dataset out of OSM is in the best interest of not only Meta but of the full OSM Community and all OSM consumers, this proposal aims to explore the feasibility, opportunity and obstacles of an OSM in-house opinionated distribution of the data where all stakeholders of such a project could join forces.
What
Brainstorming of possible activities to execute on the data:
Wrong element removal
- removing unintentionally broken or intentionally vandalized elements
- identified based on name/description, geometry and latest editor track record
- Daylight Map Distribution does a big job in this aspect (implementation details described here)
- Some research resources and datasets available:
- Nicolas Tempelmeier; Elena Demidova. “Attention-Based Vandalism Detection in OpenStreetMap” .
- Yinxiao Li; Jennings Anderson; Yiqi Niu. “Vandalism Detection in OpenStreetMap via User Embeddings” .
- Nicolas Tempelmeier; Elena Demidova. “Ovid: A Machine Learning Approach for Automated Vandalism Detection in OpenStreetMap” .
- OSM Name Vandalism Corpus (1k vandalic + 1M non-vandalic verified changeset comments) by Meta
- remove elements where the content of the source tag suggests usage of invalid sources with non-compatible licenses
Element editing to fixup tagging errors
- Remove tags with values that clearly are impossible (such as if maximum highway speed in a country is 120 km/h, have 300 km/h in a highway=residential likely to be extra 0)
- remove broken links in website=, wikipedia=, wikidata=* and wikimedia_commons=*
- remove wikidata links that are clearly wrong because they point to a person (the user likely used wikidata=* instead of subject:wikidata=* or something similar), a tree species, …
- fix coastlines to prevent the “flooding” effect when they get broken
- Evaluate using OSMCoastline
Schema normalization
- Move values from officially deprecated keys or tags to the respective substitution tag
- for example, manufacturer:type=foo=>model=foo and emergency=aed=>emergency=defibrillator
Data enhancement
- restore elements removed by changesets highly likely to be vandalism
- an algorithm would be needed to calculate how long a changeset will be ignored before it's applied
- very complex task. This would mean selectively reverting changesets; tools like osmium apply-changes would help, but it would still be complex and computationally expensive
- integration with OSM's schema of data from Wikidata in elements where wikidata=* is available (Wikidata entities are CC-0 licensed, compatible with ODBL)
- compilation of missing name=* languages (name:*=*) with internationalized labels from Wikidata
- similar to what Mapbox already does to its map data
- compilation of missing image=* or wikimedia_commons=* with Wikidata's images and/or category links to Wikimedia Commons
- compilation of missing name=* languages (name:*=*) with internationalized labels from Wikidata
How
Most basic rule based checks could be executed with libraries like Osmium.
For more efficient handling of the computing load a parallel MapReduce approach could be more appropriate, for example with libraries for Apache Spark such as [Atlas](https://github.com/osmlab/atlas).
For some of the above tasks rule-based elaboration will not be enough and AI powered tools will be needed (machine learning powered classification, NLP models, ...). Daylight Map Distribution has publicly described some details of its ML-powered pipeline for vandalism prevention (see its wiki page for details), given Meta invlovement in OSMF it would be great to see its participation in this project.
For tasks that require the intersection of OSM data with Wikidata or other resources, other libraries will need to be used (hypothesis: wikibrain). In general, Wikidata data can be accessed in one of three ways:
- Download a dump of the DB and do anything you want with it [1]
- high client cost (Requires a lot of space, more than OSM), high availability, high bandwidth (once downloaded it will be extremely fast)
- Wikidata Query Service (WDQS), Wikidata's own SPARQL endpoint[2]
- very powerful query language, low client cost (no need to download the full DB), high server cost, low availability, low bandwidth (unfeasable for very big quantities of data, pagination will be needed)
- Linked Data Fragments (LDF) endpoint [3] [4]
- somewhere in between the two options: low client cost (no need to download the full dump of the DB), extremely basic query language, high bandwidth
A Proof of Concept implementation can be found at https://github.com/Danysan1/opinionated-planet .
When
TBD
Where
OSM infrastructure, details TBD