Google Summer of Code/2017/Project Ideas/Nominatim postcode project
Objective
Improve Nominatim database table structure which will result in better search results for queries containing postcodes.
Also tracked in github issue #403 where further issues and examples are linked.
How it works currently
During database initial import Nominatim checks addresses for postcode values and generates postcode places (table record) in the placex
table (osm_type=P, class=place, type=postcocde
). Each place is linked to a parent place, this builds a hierarchy for every address. Streets are linked to postcodes and postcodes to cities (parent_place_id
column).
This approach has drawbacks
- often the postcode exists only as single node (point)
- postcode records are not updated when addresses change in the database
- in most countries postcodes follow their own hierarchy defined by the country's post office administration. It's not usual that a postcode goes about city boundaries. Postcodes can be smaller than a city or contain multiple cities.
- streets (in some countries even buildings) can belong to more than one postcode
The new approach should be
- place postcode records in a separate table
- make it possible to update that table
- don't link places to postcodes. Instead during query-time ignore postcodes in the query and check all found addresses against the postcode (distance) later
Remember: OpenStreetMap data can be imperfect, absent, unclean (typos) and we support 200 countries and 100+ languages. Try to improve Nominatim's algorithms without changing OSM data itself.
Rough outline of steps involved
Create separate table
Postcode records are currently in the placex
table
osm_type=p class=place type=postcode osm_id=seq_postcodes always only centroid
A new table could look like
country_code postcode prefix (boolean) centroid geometry
Centroid and geometry columns are types of the PostgreSQL extension.
The maximum length of postcodes needs to be determined. United States has up to 10 characters (12345-6789
) for example.
Write to new table during initial import
Currently that happens in the calculate-postcodes
step in utils/setup.php
. It is run once during the initial import.
The logic should move into a (testable) PHP library.
Write the postcode, but also the derived prefixes into the table.
Change forward search logic to query that table
Currently that happens in lib/Geocode.php
where a postcode is another phrase gets checked in the words table. Addresses are linked to postcodes and postcodes are treated as having a certain place rank.
Instead we should ignore the postcode in the query (remove it, treat it different), try to find places (place candidates) as before and do a final check which of the places is near the postcodes.
Ideally if the postcode cannot be found we look at its prefixes as well. For example for the GB postcode SW14 4RX
it would be great to check the nearby SW14 4
.
Write tests
Apart from the PHP test suite Nominatim also uses Python behave test suite to evaluate results of queries against the API.
Write during incremental update
Currently not done.
We were thinking about a PHP script that regularly updates the data. Scale (number of database transactions, data size) needs to be considered. Eventually that logic should move into PL/SQL (see functions.sql
). PL/SQL is to complex (and harder to test) to be part of this project proposal.
Optional: Update GB postcodes
On nominatim.org we supply a pre-generated list of GB postcodes. It hasn't been updated in a while. Write a PHP script that converts Ordnance Survey Code-Point Open into an intermediate format, e.g. postcode|lat|lon
or JSON/GeoJSON, and as a second step into SQL files that write to the new table.
Ideas for PHP methods to handle postcodes
After the generic table and updates works it makes sense to move logic into testable methods. Below ideas how that could be structured. A lot of the logic will be country specific and should be easy to change by local interested mappers. Ideally via configuration files per country. Don't write logic for each country yourself, too time-consuming.
- A method to normalize postcodes. For example in
NL-1000
theNL
is the country and can be removed. For other countries there might be prefixes, optional spaces, dashes.
- A method to deal with OSM values containing multiple postcodes. The usual though rare convention is split by semicolon, e.g.
1067 JR;1067JR
(that's a special case because both should be normalized to the same string and thus be treated as duplicate).
- A method to check if a string follows the postcode format of that country. E.g. French postcodes are 5 digits. Anything not 5 digits (after normalizing) should be refused.
- A method that recognizes a postcode inside a string. Given a full address
17 Rue du Médecin-Colonel Calbairac, 31000 Toulouse, France
it would find31000
. We may be able to recognize the country already (or the user has specified that) so that can be given as hint to the method.
- A method to derive a prefix from a postcode. Sometimes called area postcode or district. For example for
1234
in The Netherlands that's12
. In Spain it's the first two digits. In the United Kingdom the part before the space (actually there are more complicated rules). Some postcodes might have zero, others two prefixes.
- A method to combine multiple postcode coordinates into an area (geometry, polygon). For example if 10 addresses share the same postcode we want to create a boundary and a center point for that postcode. Ideally this has outlier detection, we don't want a typo in an address to create a boundary that covers half the country. Checks what is considered an outlier should probably be its own method so it can be tweaked for different countries.
- PHP tests for each. Make it easy to add more test cases with little to no coding (e.g. configuration files).
Postcode coverage in worldwide Nominatim database
SELECT calculated_country_code AS iso, count(*) AS c FROM placex WHERE class='place' AND type='postcode' AND osm_type='P' GROUP BY calculated_country_code ORDER BY c DESC LIMIT 10;
iso | c ------------ nl | 462088 gb | 114690 (excluding the 'gb_postcodes' table) us | 80433 ca | 26967 ro | 23586 br | 21897 ru | 18358 pl | 16682 ua | 15791 jo | 11140 de | 8475 jp | 7950 fr | 7174 in | 7117 pt | 6277 se | 5583 es | 5148 sg | 4924 it | 4015 mx | 3786