Multilingual maps Wikipedia project/Final Report
This is the final report for the project outlining what we did.
Changes to the Mapquest Render Stack
There were many changes to the Mapquest Render Stack tile server software. Most of them directly relevant to this project but along the way we also fixed some bugs etc. All changes are in the master branch of a fork ([1]) of the original Mapquest github repository ([2]). We are in contact with the maintainer of the original software, but it is currently unclear if these changes will be pulled upstream. Where possible we have tried to keep the changes to the software small to make later merging easier.
Style Parameters and Flexible Tile URL Parsing
A normal tile in the render stack is defined by a style name, zoom level and x and y coordinates. We added provisions for additional parameters which can be added depending on the style. This is initially only used for a "lang" parameter describing the language or languages requested, but could be used for other parameters. The protocol used between different sub-systems of the render stack can carry these parameters and the memcached storage is able to use it.
The format of the tile URLs was hardcoded in the render stack. We made the format configurable in a config file. This was needed, because a request for an overlay tile must contain the language(s) for the labels. But a "normal" tile server doesn't need this information.
Memcached Storage
The render stack allows flexible storage backends. We added a memcached-based storage for the short-term storing of label overlay tiles. Memcached can be configured to use as much RAM as needed, a tile in the labels overlay needs on average about 2kB, this can probably be optimized by compressing tiles better or special casing empty tiles. When label tiles are stored in memcached they are given a configurable expire time, for instance 24 hours, so we are sure labels are never older than this.
Cassandra Storage
We added a storage backend using Cassandra. Because we only had one test server available and it doesn't look like we will actually get several production servers but only one, it was never thoroughly tested. It is currently not in the "master" branch, but only in the "cassandra_storage" branch.
Combining Background and Overlay Tiles
In the design document we planned to deliver overlay and background tiles not only as separate tiles, but together by combining the tiles on the fly on the server. The Mapquest Render Stack supports this, but not in a way that is immediately usable in our case. To keep the changes from the upstream code to a minimum, we have chosen not to implement this. The tiles can be used without this functionality without problem and there are advantages and disadvantages to both solutions anyway.
Prioritization of Requests
Requests to render a tile in the render stack used to have one of a fixed set of priorities depending on whether they were for existing tiles, non-existing tiles that need re-rendering or background rendering jobs. We added a priority setting to the protocol spoken between the sub-systems of the render stack to make this more flexible. A configuration option allows the priority of a job to be boosted depending on the map style. This is used to give jobs for the label overlay a higher priority then the background tiles. Overlay jobs are normally much quicker to handle and should not be blocked by the long-running jobs for the background tiles. Jobs can still be blocked when all rendering processes are busy though, unlike Tirex the Mapquest Render Stack can not reserve rendering processes for high priority jobs.
Support for Language Dependent Rendering
The render stack uses a Python script to configure and call the Mapnik renderer which does the actual rendering of the tile. We augmented this script to read the language parameter to the style if it is available and change the Mapnik configuration before calling Mapnik to render the style. Depending on the language we want different OSM tags to be used such as "name:de", "name:en", or "name". So we basically change the SQL query thats embedded in the Mapnik configuration to query for the right tags. The details are a bit more complicated and depend on the language(s) the user wants and the type of label (on points vs on linestrings). We have implemented flexible code that allows for all sorts of language combinations but the script will probably need changing for different use cases.
Debian Package
We added all that is necessary to build a 'rendermq' Debian package that contains all that is needed to run the Mapquest Render Stack on a Debian wheezy system. The package is less than perfect, but it makes it much easier to deploy the render stack. See the 'debian' directory in the source code.
Demo Site
We set up a demo site ([3]) where the user can try out the new functionality. It shows the usual tiles world map using OpenLayers. The user can choose any language or language combinations for the overlay which is calculated on demand if not available yet. The demo was very well received by the OSM community. The tile generation runs on a server of the FOSSGIS e.V. (bessel.openstreetmap.de).
Map Style
The demo site uses a style derived from the "German" map style which is in turn derived from the main OSM map style. The main difference is that the style has been split into two styles for the background and labels overlay. Alle labels and highway shields have their own layer. This is currently a "quick and dirty hack", but as there is currently a major effort underway to re-do the whole OSM style in a different format (Carto instead of Mapnik XML) ([4]), it seemed not worth to spend effort on a more cleaner approach. The style files are available on request.
PostGIS Optimizations
The render stack and map style use a normal osm2pgsql-fed PostGIS database with hstore to get the OSM data. To make label rendering fast enough several indexes were added to the database:
CREATE INDEX placenames_large_idx ON planet_osm_point USING GIST (way) WHERE place IN ('country','state','continent'); CREATE INDEX placenames_capital_idx ON planet_osm_point USING GIST (way) WHERE place IN ('city','metropolis','town') AND tags->'capital' = 'yes'; CREATE INDEX placenames_medium1_idx ON planet_osm_point USING GIST (way) WHERE place IN ('city','metropolis') AND ((tags->'capital') IS NULL OR tags->'capital' != 'yes'); CREATE INDEX placenames_medium2_idx ON planet_osm_point USING GIST (way) WHERE place IN ('city','metropolis','town','large_town','small_town') AND ((tags->'capital') IS NULL OR tags->'capital' != 'yes'); CREATE INDEX amenity_symbols_labels_idx ON planet_osm_point USING GIST (way) WHERE aeroway in ('airport','aerodrome','helipad'); CREATE INDEX text_idx ON planet_osm_point USING GIST (way) WHERE amenity is not null or shop in ('supermarket','bakery','clothes','fashion','convenience','doityourself','hairdresser','department_store','butcher','car','car_repair','bicycle','florist') or leisure is not null or landuse is not null or tourism is not null or "natural" is not null or man_made in ('lighthouse','windmill') or place='island' or military='danger_area' or aeroway='gate' or waterway='lock' or historic in ('memorial','archaeological_site','castle'); CREATE INDEX text_poly_idx ON planet_osm_polygon USING GIST (way) WHERE amenity is not null or shop in ('supermarket','bakery','clothes','fashion','convenience','doityourself','hairdresser','department_store', 'butcher','car','car_repair','bicycle') or leisure is not null or landuse is not null or tourism is not null or "natural" is not null or man_made in ('lighthouse','windmill') or place='island' or military='danger_area' or historic in ('memorial','archaeological_site','castle');
With these indexes and the database on an SSD rendering of the label tiles is very fast. This obviates the need for any special label or rendering hint database as we had envisioned when starting this project. This makes the setup of the whole system much easier.
Benchmarking
Benchmarking the current setup is a bit difficult because it runs on a virtual server, and the server is in production use and has other things to do, too. Because the Internet connection to the server is rate-limited, benchmarking had to be done on the server itself and not from a remote system or systems. Benchmarking was done with the 'siege' software running on the host system and the render stack running on the virtual server. Because of this the benchmarking results can only be seen as rough approximations.
The server has an Intel Xeon E5620 Quad Core processor running at 2.4 GHz and 48 GB RAM. The PostGIS database is installed on an SSD, background tiles are stored on disk, label overlay tiles in memcached.
First a test with pre-rendered tiles. The file labels-10000.txt contains a collection of tile URLs for the labels overlay layer taken from the log file. We hit the server with 800 concurrent clients for 20 seconds with random URLs from the file as fast as we can. We observe about 1300 tiles delivered per second.
This is the siege command line used:
siege --concurrent 800 --internet --benchmark --time=20s --file=labels-10000.txt
If we do the same thing, but empty the memcached cache before each run, we get (with 8 worker processes doing the rendering) about 50 tiles per second delivered (average over 5 runs).
These are very rough measurements, but they show that under a realistic load where the most often used tiles will be cached we can expect the server to deliver well over 100 tiles per second.
Next Steps
The project itself is finished now, but of course there is more to do:
- When hardware becomes available for a production server at Wikipedia we intend to install the software there and switch the current limited multilingual maps over to the new software.
- Now that OSM mappers can see where there are names in which language available it is to expect that name:* tags are used more often and people can improve the OSM database.
- We expect more discussion in the OSM community on how exactly the different tags like "name", "int_name", "old_name", "name:*", etc. are used and which combinations of tags can and should be used for different kinds of maps. Now that there is a map available for people to look at these discussions are not so theoretical any more. Another issue is the use of transliteration that needs more thought.