SQLite/fid to tag schema
ogr2ogr stores the OSM tags in SQLite or GeoPackage files in suboptimal ways (JSON, HSTORE, ogrconf.ini
's attributes
). There is a feature request to store in two columns key
and value
[1] to allow for a query that is performant, readable and flexible at the same time. For now, you can implement that schema by hand with these steps.
The problem
When ogr2ogr makes a SQLite or GeoPackage file, the OSM tags go into
- Table columns, for a limited set of keys that you explicitly list in
attributes
inogrconf.ini
.- You cannot query for any possible key or search for a value among multiple related keys. If you query for an unlisted key, you get error
column not found
instead of 0 records: this is inconvenient when doing QGis or sqlite interactively, and forces you to handle special cases when querying the DB programmatically (user script, cron job).
- You cannot query for any possible key or search for a value among multiple related keys. If you query for an unlisted key, you get error
- A catch-all column (
all_tags
orother_tags
), in JSON ('{"key":"value"}'
)[2] or HSTORE ('"key"=>"value"'
) formats.- You can query for any key / value.
- The query is very slow, because JSON / HSTORE are not native SQL data types, they are plain strings and must be parsed for each row.
- You must learn the tool-specific JSON / HSTORE functions from QGis, GDAL, SQLite
- This is a violation of first normal form.
We want something like
SELECT (feature identification number)
FROM (data source)
WHERE key = 'bench'
AND value LIKE '%wood%'
-- LIKE is to handle multi-valued tags
-- and leading / trailing semicolons, such as ';wood;metal;'
So we make a table for feature identification number, key, value
out of the JSON or HSTORE column.
[3]
Outline of the steps
- This tutorial assumes you have a table
points(ogc_fid INTEGER PRIMARY KEY, all_tags VARCHAR)
.[4] Replace the names accordingly, based on your DB. See the ogr2ogr schema for info about how to find the names of the feature identifiers column and the OSM tags column in your DB.
- We'll only work on table
points
for simplicity, but you can repeat this work for each of the tables listed in the SQLite'svector_layers
, solines, multilinestrings, multipolygons, other_relations, points
.
- Indices can be optionally defined on the columns of the new table, to improve performance further.
Extract tags from JSON
If the tags column is like {"key":"value"}
, run this to make the wanted table.[5]
CREATE TABLE points_fid_to_tag(
ogc_fid INTEGER,
key VARCHAR,
value VARCHAR,
FOREIGN KEY(ogc_fid) REFERENCES points(ogc_fid));
INSERT INTO points_fid_to_tag
SELECT ogc_fid, key, value
FROM points, json_tree(all_tags)
WHERE key IS NOT NULL;
Compatibility with old versions:
- GDAL can write JSON into a SQLite DB since GDAL 3.7. See OGR#ogr2ogr for details.
- GDAL < 3.9.0 emits wrong (unparseable) JSON, that SQLite will refuse to work with, if the OSM tags contain special characters. See OGR#ogr2ogr for details.
- SQLite >= 3.38.0 can always read JSON. Before that version, redistributors of SQLite were allowed to disable JSON support, in which case you must use HSTORE.[6]
Extract tags from HSTORE
If the tags column is like "key"=>"value"
, do this to make the wanted table.
You do not need to install PostgreSQL, but you need SqlAlchemy. For example, on Debian, install the package python3-sqlalchemy
.
Save into a file a.py
.
#! /usr/bin/env python3
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--target")
parser.add_argument("--table")
args = parser.parse_args()
target = args.target
table = args.table
# Per <https://stackoverflow.com/a/56152470>
from sqlalchemy.dialects.postgresql import HSTORE
hstore_to_dict = HSTORE().result_processor(None, None)
create = (
f"CREATE TABLE {table}_fid_to_tag("
"ogc_fid INTEGER, key VARCHAR, value VARCHAR,"
f" FOREIGN KEY(ogc_fid) REFERENCES {table}(ogc_fid));"
)
select = f"SELECT ogc_fid, all_tags FROM {table} WHERE all_tags IS NOT NULL;"
insert = f"INSERT INTO {table}_fid_to_tag VALUES(?, ?, ?);"
import sqlite3
conn = sqlite3.connect(target)
conn.execute(create)
for (fid, tags) in conn.execute(select).fetchall():
for key, value in hstore_to_dict(tags).items():
conn.execute(insert, (fid, key, value)) and None
conn.commit() # required by close()
conn.close()
Then do
chmod +x a.py
./a.py --target /path/to/file.sqlite --table points
Create indices
Run this to improve performance even further.[7]
CREATE UNIQUE INDEX index_points_fid_to_tag_by_id_key
ON points_fid_to_tag(ogc_fid, key);
CREATE INDEX index_points_fid_to_tag_by_all
ON points_fid_to_tag(key, value, ogc_fid);
CREATE INDEX index_points_fid_to_tag_by_value
ON points_fid_to_tag(value);
If you want to learn how to make your custom indices and troubleshoot indices, see SQLite#Indices.
How to use
Remember:
- to compare multi-valued tags (separated by semicolon or other) you want
LIKE '%value%'
. - Put strings within
'
not within"
.
Get the OpenStreetMap ids of features:
SELECT osm_id
FROM points, points_fid_to_tag
WHERE points.ogc_fid = points_fid_to_tag.ogc_fid
AND key = 'amenity'
AND value = 'bench';
Get all the tags of a feature:
SELECT key, value
FROM points, points_fid_to_tag
WHERE points.ogc_fid = points_fid_to_tag.ogc_fid
AND osm_id = ...;
Intersect queries:
SELECT COUNT(*)
FROM (
SELECT ogc_fid
FROM points_fid_to_tag
WHERE key = 'amenity'
AND value = 'bench'
) AS benches,
(
SELECT ogc_fid
FROM points_fid_to_tag
WHERE key = 'material'
AND value LIKE '%wood%'
) AS wooden
WHERE benches.ogc_fid = wooden.ogc_fid;
Notes
- ↑ https://github.com/OSGeo/gdal/issues/7552
- ↑ Note that GDAL supports JSON since version 3.7. Before that,
ogr2ogr
can only write HSTORE. - ↑
Ideally, we would make a materialized view, so that a change to the original table
points
(= add or remove a feature, or change a feature's tags) would auto-update our new table. SQLite does not have materialized views, so we make a table instead; if the original table is changed, this new table will not be auto-updated (if you make a trigger, please add instructions). We do not make a STORED generated column, because generating two columns would need twice the parsing. - ↑ Having the primary key of exact type "INTEGER" (not "INT", not another type) exploits an internal SQLite optimization called rowid tables.
- ↑
If you want to learn more about
json_tree
, see the docs or run.mode box
and thenSELECT * FROM json_tree('{"outer":{"inner":"value"}}');
. - ↑ Details
- ↑ These indices were returned by passing the queries on this page to
.expert
. They should be enough.