Address import from RUIAN

From OpenStreetMap Wiki
Jump to navigation Jump to search

About

The purpose of the import is to add missing and correct existing address data addr=* in the territory of Czech Republic.
There is a page about the data source for this import RUIAN where we discuss the procedure of this format as well.
As this is interesting mainly for Czech users the main content of the page is in Czech language only. Sorry.

Here are some link to discussuon about this import in Czech OSM community (Czech language again):

https://lists.openstreetmap.org/pipermail/talk-cz/2014-February/009238.html
https://lists.openstreetmap.org/pipermail/talk-cz/2014-February/009284.html
https://lists.openstreetmap.org/pipermail/talk-cz/2014-February/009399.html

Goals

The import has these goals:

  • creation of a new address nodes which do not exist in OSM yet
  • correcting existing OSM addresses (adding sorely missing addr:place=* mainly, but adding/correcting all other addr=* tags as well)
  • establishing a link between address point and Czech official database of addresses RUIAN to allow future automatic correction updates by adding ref:ruian:addr=* and source:addr=cuzk:ruian
  • removing ill-defined tag is_in=* which strange and hard to use structure and seems to be obsolete. Also it seems that 70% of all tags is_in=* on address nodes are in Czech Republic.
  • (probably) adding addr:country=CZ to all addresses. There is a discussion about it. Some claim it is waste of database space the other say it just belongs to the proper address from international point of view.

Schedule

The start was planned on January/February 2014. Due to the unexpectedly long debate over import content the import really started in April 2014. There are almost 3,000,000 addresses in RUIAN registry which needs to be processed. Not all of them are correct a those will not be possible to import automatically. Approximately 2-5% of address nodes will need manual verification. The estimated duration of import is several months. The duration depends heavily on the number and activity of our volunteers. As RUIAN is being updated constantly then even after we finish the main import the bot will continue working and update the address nodes in OSM according to changes in RUIAN.

You can watch the progress of import here http://ruian.poloha.net/czaddr/.

In 6/2019 the last municipality was imported - village Kozojedy in Praha-Východ county. There were lot of significant errors in the govermental database. This is the reason why it tooks years.

Currently the import continues as once-twice a month updates. About 3000 address places is updated (it means created and deleted too) every month.

Import data

Background

The data, supplied by CUZK (Czech Cadastre), are in XML format.

They are available at RUIAN VDP.

You can download Example XML file of the town Ceske Budejovice.

XML files are then converted to the PostgreSQL database using utility ruian2pgsql.

The data are freely available to anyone by law as they were created by government agency using tax payers' money. There are no copyrights on the data. The basic registers are created according to the Law No. 111/2009 Sb.. The possibility of extraction and use of data for OSM is based on § 62 of Law 111/2009 Sb. Sorry, we are not aware of any English translations of Czech laws.

OSM data

We create only address nodes. But existing addresses (provided that we could match them automatically or manually to the address nodes in RUIAN) are corrected on all OSM primitives - nodes, ways and relations. The position of alereday existing data primitives will not be changed.

Import type

We will import all the addresses in Czech Republic as address nodes first. Then the new addresses will be added/updated continuously, depending on changes in our data source RUIAN.

Data preparation

Generally the manager of the import will create an OSM file with updates from RUIAN and another .csv file where the suspicious places will be highlighted. Then the operator-volunteer will check this file, make changes in it if necessary in JOSM and return it to the manager. Manager will upload the OSM file to the OSM server and updated his internal records on the local server.

As there are many addresses in OSM already (on nodes, ways and relations) it is necessary to find for every address node in RUIAN (further on referred as AM) the addresses on primitives in OSM, if there are any. There are mistakes or omissions on both sides - in RUIAN as well as in OSM. Therefore the probability of an correct match depends on a quality of the data in give location. There are duplicated nodes in RUIAN (several AM on one location or just few centimeters apart). In some cases it is really just duplication of data in another these are different AM placed in a same wrong location.

The whole script is written in plpgsql/postgis and it is very dependent on the configuration of my own server and its content. It uses data schema of RUIAN (thanks to FordFrog for ruian2pgsql), APIDB schema and schema for Mapnik, which I use for preprocessing geometry. APIDB then contains complete OSM data. There is a special table on my server containing the geometries of the OSM data primitives with addresses (ST_Union).

The first stage of pairing is searching for all OSM primitives with address matching an AM from RUIAN. The script tries to find the best OSM primitive for each AM. As there are duplicated information in OSM as well as in RUIAN the matching process has six stages. I search for addresses in 100 meters radius. In case of large buildings or complexes it is normal to find AM even 80 m away.

  1. The AM and OSM primitive must match in addr:conscriptionnumber=* (or addr:provisionalnumber=*), in addr:streetnumber=* and in addr:street=*. There is a bit of fuzzy logic used in street name matching where I can try to compensate for different ways of writing names (i.e. (Wolkerova vs. J. Wolkera).
  2. Same as 1. but without street name matching. But at least in one table the street name must be NULL.
  3. Match in addr:conscriptionnumber=* (or addr:provisionalnumber=*). The street names in both tables must be NULL.
  4. Same as 3., but at least one street name must be NULL.
  5. Same as 3., the street names are different, searching in 8 m radius only.
  6. Any match will of house number will do, but only in 3 m radius.
  • Postprocessing I - searches for duplicated nodes and nodes which are suspiciously close to each other. I eliminate automatically from import such AMs which are duplicated in 5 m radius (matching house numbers and street names)
  • Postprocessing II - we search for OSM primitives which contain any address and are very close to the nodes in data meant for import. These primitives are then included in OSM file as well so the operator has to decide what to do with them in JOSM.

The output of Postprocessing I and Postprocessing II are tables with suspicious data which needs to be checked manualy. Finally there is a OSM file generated which can be loaded into JOSM editor. When the operator does the verifications and corrections on the OSM file it returns it to the manager of import.

The manager then uploads the data on OSM server. Of course the records what data were uploaded to the OSM server and on what date it was done. This is to allow for watching the updates of these data in RUIAN and update the data in OSM accordingly and automatically in the future.

There is an sample file available for download here [1]. The data in the sample file are before mandatory volunteer operator verification. They can be edited in appropriate editor - i.e. JOSM.

Data reduction and simplification

As the new addresses are just nodes there is no simplification planned. Except for possible duplicated addresses which would be handled by the volunteer first before the upload. We will remove some obsolete tags as well - see below.

Tagging plans

We will add new address nodes from RUIAN with the following tags (or replace their contents on already existing nodes, ways or relations), if it would be relevant for given address node:

Key Description Note
addr:conscriptionnumber=* Conscription number of a building Just a number
addr:provisionalnumber=* Provisional number of building Just a number
addr:streetnumber=* Street number of a building Number, optionally followed by a letter
addr:housenumber=* House number Format - if there is street number assigned to the building: conscriptionNo (or ev. provisionalNo) / streetNo
- or only conscriptionNo (or ev. provisionalNo)
Examples: 1367, 1367/67, 2238/1a, ev.21, ev.21/1
addr:street=* Street Name of the street, if there is such.
addr:place=* Small village / City quarter Name of a small village (not having a village council, belonging administratively under another village tagged addr:city=*) or a city quarter. RUIAN and Czech cadastre calls this část obce.
Should never be used if the street has a name.
Essential tag for searching using Nominatim in places where there are no streets. If there the street has a name, this tag should not be used; Nomination ignores addr:place=* it if the street name is specified anyway.
Examples: Lhotka, Libív, Vysočany.
It is not possible to omit this tag as addresses in small villages are not defined by boundaries but by simple list of addresses.
Attention! In such places that more districts in one cadastral area (for example, Horšovský Týn), this tag was removed, causing further damage (for example, a duplicate pair of addresses). In this case, we recommend to import this tag as soon as possible.
addr:suburb=* City subrub Administrative division of city. Exists in large towns only.
Examples: Praha 9 or Plzeň 2-Slovany
It is not necessary to add this tag for all addresses as we will create boundary polygons in OSM in near future. This tag is added only in rare cases where the address node lies outside of its respective suburb boundary polygon.
addr:city=* Village / City Name of city, town, small town, military area or village (having a village council).
Examples: Brno, Jince, Brdy, Lhota.
It is not necessary to add this tag for all addresses as there are boundary polygons in OSM (boundary=administrative+ admin_level=8). This tag is added only in rare cases where the address node lies outside of its respective city boundary polygon.
addr:postcode=* ZIP code Number in format nnnnn without space.
Examples: 19000
addr:country=* Country CZ
It is unnecessary to add this tag as there are valid country border polygons in OSM.
ref:ruian:addr=*
or older ref:ruian=*
RUIAN id of address node Used for establishing a link with RUIAN for possible future automatic corrections of addresses by bot based on the correction data published by RUIAN
You can view the details using link: http://vdp.cuzk.cz/vdp/ruian/adresnimista/<ref:ruian:addr>
You can edit the Czech version of this template here and English version is here.

We will not modify any existing nodes, ways or relations with address if the address content is correct. Therefore we will not erase any of the tags below from such nodes as well.

We will erase these keys from existing address nodes, ways, relations:

We will erase these tags from existing address nodes, ways, relations:

Changeset tags

The changesets will be uploaded using the account CzechAddress.

The changeset will be tagged like this:

Data transformation

For data transformation from RUIAN data exchange format (VDP) we will use software ruian2pgsql and database server PostgreSQL with PostGIS extension.

Team approach

Teamwork will be a very important factor as mentioned in Schedule.

Workflow

The area for import could be defined by village/town, set of villages/towns, county, bounding box or generally anything which could be written as a SQL query. The script will create two files data.osm and data.csv. data.osm is used to create a changeset in JOSM editor by the volunteer operator. data.csv is a list of potentially problematic locations. The operator has to check the problematic locations in editor and fix them if necessary. After this stage the data will be returned to manager of the import and he will upload them to the OSM server. The list of already imported nodes will be stored in his local database.

The size of changeset should not exceed 50,000 address nodes.

There is an sample file available for download here [2]. The data in the sample file are before mandatory volunteer operator verification. They can be edited in appropriate editor - i.e. JOSM.

In case of faulty import we have scripts OSMTOOLS by Frederik Ramm which could revert parts or the whole changesets.

Conflation

As this project handles only address nodes which were imported from the other sources in the past or created by users based on their knowledge or cadastre map and those address node are replaced/created by the addresses from the new main official source RUIAN we see no opportunity to improve upon those data or add anything new.
With the one possible exception that we could add a link from the address node to the building id which it belongs to. We will consider this.

Other possibility is to add the name of the post office serving the given address. Sometimes village has several postcodes and each of them has another assigned post office. But we do not know which tag to use for this and if we even want to include this information at all. Update: This seems to be just a addition of not really useful data.

QA

The error which we would find during the import (or which we could create during in) will be reported on this page.
We would correct our own mistake manually.
The errors in RUIAN will be reported to the help-desk of the government agency responsible for RUIAN - CUZK (Czech Cadastre Office). Then we would hope that they would fix them someday. But it is a government agency so we can not expect miracles.

Q: I found a very suspicious edit? How can I stop the bot from making chages?