SQLite/fid to tag schema

From OpenStreetMap Wiki
Jump to navigation Jump to search

ogr2ogr stores the OSM tags in SQLite or GeoPackage files in suboptimal ways (JSON, HSTORE, ogrconf.ini's attributes). There is a feature request to store in two columns key and value[1] to allow for a query that is performant, readable and flexible at the same time. For now, you can implement that schema by hand with these steps.

The problem

When ogr2ogr makes a SQLite or GeoPackage file, the OSM tags go into

  1. Table columns, for a limited set of keys that you explicitly list in attributes in ogrconf.ini.
    • You cannot query for any possible key or search for a value among multiple related keys. If you query for an unlisted key, you get error column not found instead of 0 records: this is inconvenient when doing QGis or sqlite interactively, and forces you to handle special cases when querying the DB programmatically (user script, cron job).
  2. A catch-all column (all_tags or other_tags), in JSON ('{"key":"value"}')[2] or HSTORE ('"key"=>"value"') formats.
    • You can query for any key / value.
    • The query is very slow, because JSON / HSTORE are not native SQL data types, they are plain strings and must be parsed for each row.
    • You must learn the tool-specific JSON / HSTORE functions from QGis, GDAL, SQLite
    • This is a violation of first normal form.

We want something like

SELECT (feature identification number)
  FROM (data source)
 WHERE key = 'bench'
       AND value LIKE '%wood%'
    -- LIKE is to handle multi-valued tags
    -- and leading / trailing semicolons, such as ';wood;metal;'

So we make a table for feature identification number, key, value out of the JSON or HSTORE column. [3]

Outline of the steps

  • This tutorial assumes you have a table points(ogc_fid INTEGER PRIMARY KEY, all_tags VARCHAR).[4] Replace the names accordingly, based on your DB. See the ogr2ogr schema for info about how to find the names of the feature identifiers column and the OSM tags column in your DB.
  • We'll only work on table points for simplicity, but you can repeat this work for each of the tables listed in the SQLite's vector_layers, so lines, multilinestrings, multipolygons, other_relations, points.
  • Indices can be optionally defined on the columns of the new table, to improve performance further.

Extract tags from JSON

If the tags column is like {"key":"value"}, run this to make the wanted table.[5]

CREATE TABLE points_fid_to_tag(
             ogc_fid INTEGER,
             key VARCHAR,
             value VARCHAR,
             FOREIGN KEY(ogc_fid) REFERENCES points(ogc_fid));

 INSERT INTO points_fid_to_tag
      SELECT ogc_fid, key, value
        FROM points, json_tree(all_tags)
       WHERE key IS NOT NULL;

Compatibility with old versions:

  • GDAL can write JSON into a SQLite DB since GDAL 3.7. See OGR#ogr2ogr for details.
  • GDAL < 3.9.0 emits wrong (unparseable) JSON, that SQLite will refuse to work with, if the OSM tags contain special characters. See OGR#ogr2ogr for details.
  • SQLite >= 3.38.0 can always read JSON. Before that version, redistributors of SQLite were allowed to disable JSON support, in which case you must use HSTORE.[6]

Extract tags from HSTORE

If the tags column is like "key"=>"value", do this to make the wanted table.

You do not need to install PostgreSQL, but you need SqlAlchemy. For example, on Debian, install the package python3-sqlalchemy.

Save into a file a.py.

#! /usr/bin/env python3

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--target")
parser.add_argument("--table")
args = parser.parse_args()
target = args.target
table = args.table

# Per <https://stackoverflow.com/a/56152470>
from sqlalchemy.dialects.postgresql import HSTORE
hstore_to_dict = HSTORE().result_processor(None, None)

create = (
   f"CREATE TABLE {table}_fid_to_tag("
    "ogc_fid INTEGER, key VARCHAR, value VARCHAR,"
   f" FOREIGN KEY(ogc_fid) REFERENCES {table}(ogc_fid));"
)
select = f"SELECT ogc_fid, all_tags FROM {table} WHERE all_tags IS NOT NULL;"
insert = f"INSERT INTO {table}_fid_to_tag VALUES(?, ?, ?);"

import sqlite3
conn = sqlite3.connect(target)

conn.execute(create)
for (fid, tags) in conn.execute(select).fetchall():
    for key, value in hstore_to_dict(tags).items():
        conn.execute(insert, (fid, key, value)) and None

conn.commit()  # required by close()
conn.close()

Then do

    chmod +x a.py
    ./a.py --target /path/to/file.sqlite --table points

Create indices

Run this to improve performance even further.[7]

CREATE UNIQUE INDEX index_points_fid_to_tag_by_id_key
           ON points_fid_to_tag(ogc_fid, key);
CREATE  INDEX index_points_fid_to_tag_by_all
           ON points_fid_to_tag(key, value, ogc_fid);
CREATE  INDEX index_points_fid_to_tag_by_value
           ON points_fid_to_tag(value);

If you want to learn how to make your custom indices and troubleshoot indices, see SQLite#Indices.

How to use

Remember:

  • to compare multi-valued tags (separated by semicolon or other) you want LIKE '%value%'.
  • Put strings within ' not within ".

Get the OpenStreetMap ids of features:

SELECT osm_id
  FROM points, points_fid_to_tag
 WHERE points.ogc_fid = points_fid_to_tag.ogc_fid
       AND key = 'amenity'
       AND value = 'bench';

Get all the tags of a feature:

SELECT key, value
  FROM points, points_fid_to_tag
 WHERE points.ogc_fid = points_fid_to_tag.ogc_fid
       AND osm_id = ...;

Intersect queries:

SELECT COUNT(*)
  FROM (
       SELECT ogc_fid
         FROM points_fid_to_tag
        WHERE key = 'amenity'
              AND value = 'bench'
  ) AS benches,
       (
       SELECT ogc_fid
         FROM points_fid_to_tag
        WHERE key = 'material'
              AND value LIKE '%wood%'
  ) AS wooden
 WHERE benches.ogc_fid = wooden.ogc_fid;

Notes

  1. https://github.com/OSGeo/gdal/issues/7552
  2. Note that GDAL supports JSON since version 3.7. Before that, ogr2ogr can only write HSTORE.
  3. Ideally, we would make a materialized view, so that a change to the original table points (= add or remove a feature, or change a feature's tags) would auto-update our new table. SQLite does not have materialized views, so we make a table instead; if the original table is changed, this new table will not be auto-updated (if you make a trigger, please add instructions). We do not make a STORED generated column, because generating two columns would need twice the parsing.
  4. Having the primary key of exact type "INTEGER" (not "INT", not another type) exploits an internal SQLite optimization called rowid tables.
  5. If you want to learn more about json_tree, see the docs or run .mode box and then SELECT * FROM json_tree('{"outer":{"inner":"value"}}');.
  6. Details
  7. These indices were returned by passing the queries on this page to .expert. They should be enough.