PBF Format

From OpenStreetMap Wiki
Jump to navigation Jump to search

PBF format ("Protocolbuffer Binary Format") is primarily intended as an alternative to the XML format. It is about half of the size of a gzipped planet and about 30% smaller than a bzipped planet. It is also about 5x faster to write than a gzipped planet and 6x faster to read than a gzipped planet. The format was designed to support future extensibility and flexibility.

The underlying file format is chosen to support random access at the 'fileblock' granularity. Each file-block is independently decodable and contains a series of encoded PrimitiveGroups, with each PrimitiveGroup containing ~8k OSM entities in the default configuration. There is no tag hardcoding used; all keys and values are stored in full as opaque strings. For future scalability, 64-bit node/way/relation IDs are assumed. The current serializer (Osmosis) preserves the order of OSM entities, and tags on OSM entities. To flexibly handle multiple resolutions, the granularity, or resolution used for representing locations and timestamps is adjustable in multiples of 1 millisecond and 1 nanodegree. The default scaling factor is 1000 milliseconds and 100 nanodegrees, corresponding to about ~1 cm at the equator. This is the current resolution of the OSM database.

Files have the extension *.osm.pbf.

At present, the reference implementation of PBF is the Osmosis implementation, split into two parts, the Osmosis-specific part, contained in the Osmosis repository at [1], and an application-generic part at [2]. This application-generic part is used to build the osmpbf.jar (as used in Osmosis and other Java-based PBF readers) and also contains the master definition of the PBF protocol buffer definitions (*.proto files).

Software support for PBF

A lot of software used in the OSM project already supports PBF in addition to the original XML format, plus there are several tools to convert from PBF to OSM XML and vice versa.

See PBF/Software Compliance for details about which kinds of PBF files are supported by the various programs.

Design

Low level encoding

Google's Protocol Buffers are used for the low-level store. Given a specification file of one or more messages, the protocol buffer compiler writes low-level serialization code. Messages may contain other messages, forming hierarchical structures. Protocol buffers also support extensibility; new fields can be added to a message and old clients can read those messages without recompiling. For more details, please see https://github.com/protocolbuffers/protobuf/ or read the respective article on the Google Open Source Blog. Google officially supports C++, Java, Python, Objective-C, C#, Ruby, Go, PHP, Dart, and JavaScript and compilers exist for other languages. An example message specification is:

message Node {
  required sint64 id = 1;
  // Parallel arrays.
  repeated uint32 keys = 2 [packed = true]; // String IDs.
  repeated uint32 vals = 3 [packed = true]; // String IDs.
  optional Info info = 4; // May be omitted in omitmeta
  required sint64 lat = 8;
  required sint64 lon = 9;
}

Protocol Buffers use a variable-bit encoding for integers. An integer is encoded at 7 bits per byte, where the high bit indicates whether or not the next byte is to be read. This minimizes the file size when messages contain small integers. Two encodings exist, one intended for mostly positive integers, and one for signed integers. In the standard encoding, integers [0,127] require one byte, [128,16383] require two bytes, etc. In the signed number encoding, the sign bit is placed in the least significant position; numbers [-64,63] require one byte, [-8192,8191] require two bytes, and so forth. For further details of the serialized format of protocol buffer messages, please see their website.

The generated files use a Java package of crosby.binary. In other languages, the generated files are in package OSMPBF.

File format

A file contains a header followed by a sequence of fileblocks. The design is intended to allow future random-access to the contents of the file and skipping past not-understood or unwanted data.

The format is a repeating sequence of:

  • int4: length of the BlobHeader message in network byte order
  • serialized BlobHeader message
  • serialized Blob message (size is given in the header)

A BlobHeader is currently defined as:

message BlobHeader {
  required string type = 1;
  optional bytes indexdata = 2;
  required int32 datasize = 3;
}
  • type contains the type of data in this block message.
  • indexdata is some arbitrary blob that may include metadata about the following blob, (e.g., for OSM data, it might contain a bounding box.) This is a stub intended to enable the future design of indexed *.osm.pbf files.
  • datasize contains the serialized size of the subsequent Blob message.

(Please note that BlobHeader used to be called BlockHeader. It was renamed in v1.1 to avoid confusion with HeaderBlock, below)

A Blob is used to store an arbitrary blob of data, either uncompressed or in compressed format.

message Blob {
  optional int32 raw_size = 2; // When compressed, the uncompressed size

  oneof data {
    bytes raw = 1; // No compression

    // Possible compressed versions of the data.
    bytes zlib_data = 3;

    // For LZMA compressed data (optional)
    bytes lzma_data = 4;

    // Formerly used for bzip2 compressed data. Deprecated in 2010.
    bytes OBSOLETE_bzip2_data = 5 [deprecated=true]; // Don't reuse this tag number.

    // For LZ4 compressed data (optional)
    bytes lz4_data = 6;

    // For ZSTD compressed data (optional)
    bytes zstd_data = 7;
  }
}

All readers and writers must support uncompressed and zlib-compressed data. The other compression formats are optional and currently not widely used.

In order to robustly detect illegal or corrupt files, I limit the maximum size of BlobHeader and Blob messages. The length of the BlobHeader should be less than 32 KiB (32*1024 bytes) and must be less than 64 KiB. The uncompressed length of a Blob should be less than 16 MiB (16*1024*1024 bytes) and must be less than 32 MiB.

Encoding OSM entities into fileblocks

There are currently two fileblock types for OSM data. These textual type strings are stored in the type field in the BlobHeader:

  • OSMHeader: The Blob Contains a serialized HeaderBlock message (See osmformat.proto). Every fileblock must have one of these blocks before the first OSMData block.
  • OSMData: Contains a serialized PrimitiveBlock message. (See osmformat.proto). These contain the entities.

This design lets other software extend the format to include fileblocks of additional types for their own purposes. Parsers should ignore and skip fileblock types that they do not recognize.

Definition of the OSMHeader fileblock

message HeaderBlock {
  optional HeaderBBox bbox = 1;
  /* Additional tags to aid in parsing this dataset */
  repeated string required_features = 4;
  repeated string optional_features = 5;

  optional string writingprogram = 16;
  optional string source = 17; // From the bbox field.

  /* Tags that allow continuing an Osmosis replication */

  // replication timestamp, expressed in seconds since the epoch,
  // otherwise the same value as in the "timestamp=..." field
  // in the state.txt file used by Osmosis
  optional int64 osmosis_replication_timestamp = 32;

  // replication sequence number (sequenceNumber in state.txt)
  optional int64 osmosis_replication_sequence_number = 33;

  // replication base URL (from Osmosis's configuration.txt file)
  optional string osmosis_replication_base_url = 34;
}

To offer forward and backward compatibility, a parser needs to know if it is able to parse a file. This is done by required features. If a file contains a required feature that a parser does NOT understand, it must reject the file with an error, and report which required features it does not support.

Currently the following features are defined:

  • "OsmSchema-V0.6" — File contains data with the OSM v0.6 schema.
  • "DenseNodes" — File contains dense nodes and dense info.
  • "HistoricalInformation" — File contains historical OSM data.

In addition, a file may have optional properties that a parser can exploit. For instance, the file may be pre-sorted, and not need sorting before being used. Or, the ways in the file may have bounding boxes precomputed. If a program encounters an optional feature it does not know, it can still safely read the file. If a program expects an optional feature that is not there, it can error out. The following features have been proposed:

  • "Has_Metadata" – The file contains author and timestamp metadata.
  • "Sort.Type_then_ID" – Entities are sorted by type then ID.
  • "Sort.Geographic" – Entities are in some form of geometric sort. (presently unused)
  • "timestamp=2011-10-16T15:45:00Z" – Interim solution for storing a file timestamp. Please use osmosis_replication_timestamp instead.
  • "LocationsOnWays" — File has lat/lon values on each way. See Ways and Relations below.

What are the replication fields for?

The osmosis_replication_* fields are intended to allow the consumer of a PBF file to append data from an Osmosis-managed update server in order to keep the file current. Osmosis is the software used to produce the daily, hourly, and minutely diffs on planet.openstreetmap.org. To append updates to a PBF file, one has to know which replication state the file represents so that the right synchronisation point can be found.

  • osmosis_replication_timestamp - the replication timestamp (as Unix epoch value), taken from the state.txt file written by Osmosis (where it is contained not as Unix epoch value but as an ISO time string). Technically this is the internal database timestamp of the last transaction fully contained in the file; it does not necessarily mean that every object with a timestamp smaller than or equal to this timestamp is contained in the file.
  • osmosis_replication_sequence_number - the sequence number of the last database transaction contained in the file. This usually matches the timestamp - if you know one you can find out the other, however it makes things easier for the consumer to know both.
  • osmosis_replication_base_url - the base URL for replication diffs, e.g. https://planet.openstreetmap.org/replication/minute/, so that the consumer knows which server (and therefore, which database) the given IDs relate to.

When processing a PBF file, you would usually keep these fields intact (i.e. copy them from input to output) just like you would copy the bbox block, unless the particular kind of processing that you apply makes it impossible or useless to want to apply updates to the file later.

Definition of OSMData fileblock

To encode OSM entities into protocol buffers, I collect a series of PrimitiveGroup entities to form a PrimitiveBlock, which is serialized into the Blob portion of an 'OsmData' fileblock.

message PrimitiveBlock {
  required StringTable stringtable = 1;
  repeated PrimitiveGroup primitivegroup = 2;

  // Granularity, units of nanodegrees, used to store coordinates in this block
  optional int32 granularity = 17 [default=100]; 

  // Offset value between the output coordinates coordinates and the granularity grid, in units of nanodegrees.
  optional int64 lat_offset = 19 [default=0];
  optional int64 lon_offset = 20 [default=0]; 

  // Granularity of dates, normally represented in units of milliseconds since the 1970 epoch.
  optional int32 date_granularity = 18 [default=1000]; 


  // Proposed extension:
  //optional BBox bbox = XX;
}

When creating a PBF file, you need to extract all strings (key, value, role, user) into a separate string table. Thereafter, strings are referred to by their index into this table, except that index=0 is used as a delimiter when encoding DenseNodes. This means that you cannot safely store a useful string in that slot. Therefore an empty string is stored at index=0 and that slot is never used. It is not necessary but might have positive effect on the performance if you sort the string table that way that frequently used strings have small indexes. You also might improve deflate compressibility of the stringtable if you sort strings that have the same frequency lexicographically.

Each PrimitiveBlock is independently decompressable, containing all of the information to decompress the entities it contains. It contains a string table, it also encodes the granularity for both position and timestamps.

A block may contain any number of entities, as long as the size limits for a block are obeyed. It will result in small file sizes if you pack as many entities as possible into each block. However, for simplicity, certain programs (e.g. osmosis 0.38) limit the number of entities in each block to 8000 when writing PBF format.

In addition to granularity, the primitive block also encodes a latitude and longitude offset value. These values, measured in units of nanodegrees, must be added to each coordinate.

latitude = .000000001 * (lat_offset + (granularity * lat))
longitude = .000000001 * (lon_offset + (granularity * lon))

Where latitude is the latitude in degrees, granularity is the granularity given in the PrimitiveBlock, lat_offset is the offset given in the PrimitiveBlock, and lat/lon are encoded in a Node or delta-encoded in a DenseNode. The explanation of the equation for longitude is analogous.

The reason that lat_offset and lon_offset exist is for concisely representing isohypsis data (contour lines) or other data that occurs in a regular grid. Say we wished to represent such data that was at a 100 microdegree grid. We would like to use a granularity of 100000 nanodegrees for the highest compression, except that that could only represent points of the form (.0001*x,.0001*y), when the real gridded data may be of the form (.00003345+.0001*x, .00008634+.0001*y). By using lat_offset=3345 and lon_offset=8634, we can represent this 100 microdegree grid exactly.

For datestamps,

millisec_stamp = timestamp * date_granularity

Where timestamp is the timestamp encoded in an Info or delta encoded in a DenseInfo, date_granularity is given in the PrimitiveBlock, and millisec_stamp is the date of the entity, measured in number of milliseconds since the 1970 Unix epoch. To get the date measured in seconds since the 1970 epoch, divide millisec_stamp by 1000.

Within each PrimitiveBlock, I then divide entities into primitive groups that contain up to 8k OSM entities that are all of the same type (node/way/relation).

message PrimitiveGroup {
  repeated Node     nodes = 1;
  optional DenseNodes dense = 2;
  repeated Way      ways = 3;
  repeated Relation relations = 4;
  repeated ChangeSet changesets = 5;
}

A PrimitiveGroup MUST NEVER contain different types of objects. So either it contains many Node messages, or a DenseNode message, or many Way messages, or many Relation messages, or many ChangeSet messages. But it can never contain any mixture of those. The reason is the way Protocol Buffer encoding works it would be impossible to get the objects out in the same order they have been written into the file. This could be rather confusing to users.

After being serialized into a string, each PrimitiveBlock is optionally gzip/deflate compressed individually when stored in the Blob fileblock.

Ways and Relations

For ways and relations, which contain the IDs of other nodes in the field refs, I exploit the tendency of consecutive nodes in a way or relation to have nearby node IDs by using delta compression, resulting in small integers. (I.E., instead of encoding x_1, x_2, x_3, I encode x_1, x_2-x_1, x_3-x_2, ...). Except for that, ways and relations are mostly encoded in the way one would expect. Tags are encoded as two parallel arrays, one array of string-IDs of the keys, and the other of string-IDs of the values.

message Way {
  required int64 id = 1;
  // Parallel arrays.
  repeated uint32 keys = 2 [packed = true];
  repeated uint32 vals = 3 [packed = true];

  optional Info info = 4;

  repeated sint64 refs = 8 [packed = true];  // DELTA coded

  // The following two fields are optional. They are only used in a special
  // format where node locations are also added to the ways. This makes the
  // files larger, but allows creating way geometries directly.
  //
  // If this is used, you MUST set the optional_features tag "LocationsOnWays"
  // and the number of values in refs, lat, and lon MUST be the same.
  repeated sint64 lat = 9 [packed = true]; // DELTA coded, optional
  repeated sint64 lon = 10 [packed = true]; // DELTA coded, optional
}

Relations use an enum to represent member types.

message Relation {
  enum MemberType {
    NODE = 0;
    WAY = 1;
    RELATION = 2;
  }
   required int64 id = 1;

   // Parallel arrays.
   repeated uint32 keys = 2 [packed = true];
   repeated uint32 vals = 3 [packed = true];

   optional Info info = 4;

   // Parallel arrays
   repeated int32 roles_sid = 8 [packed = true];
   repeated sint64 memids = 9 [packed = true]; // DELTA encoded
   repeated MemberType types = 10 [packed = true];
}

Metadata includes non-geographic information about an object, such as:

message Info {
   optional int32 version = 1 [default = -1];
   optional int32 timestamp = 2;
   optional int64 changeset = 3;
   optional int32 uid = 4;
   optional int32 user_sid = 5; // String IDs

   // The visible flag is used to store history information. It indicates that
   // the current object version has been created by a delete operation on the
   // OSM API.
   // When a writer sets this flag, it MUST add a required_features tag with
   // value "HistoricalInformation" to the HeaderBlock.
   // If this flag is not available for some object it MUST be assumed to be
   // true if the file has the required_features tag "HistoricalInformation"
   // set.
   // If visible is set to false, this element has been deleted.
   optional bool visible = 6;
}

Nodes

Nodes can be encoded one of two ways, as a Node (defined above) and a special dense format. In the dense format, I store the group 'columnwise', as an array of IDs, array of latitudes, and array of longitudes. Each column is delta-encoded. This reduces header overheads and allows delta-coding to work very effectively.

Keys and values for all nodes are encoded as a single array of stringIDs. Each node's tags are encoded in alternating <keyid> <valid>. We use a single stringid of 0 to delimit when the tags of a node ends and the tags of the next node begin. The storage pattern is: ((<keyid> <valid>)* '0' )* As an exception, if no node in the current block has any key/value pairs, this array does not contain any delimiters, but is simply empty.

message DenseNodes {
   repeated sint64 id = 1 [packed = true]; // DELTA coded

   //repeated Info info = 4;
   optional DenseInfo denseinfo = 5;

   repeated sint64 lat = 8 [packed = true]; // DELTA coded
   repeated sint64 lon = 9 [packed = true]; // DELTA coded

   // Special packing of keys and vals into one array. May be empty if all nodes in this block are tagless.
   repeated int32 keys_vals = 10 [packed = true];
}

DenseInfo does a similar delta coding on metadata.

message DenseInfo {
   repeated int32 version = 1 [packed = true];
   repeated sint64 timestamp = 2 [packed = true]; // DELTA coded
   repeated sint64 changeset = 3 [packed = true]; // DELTA coded
   repeated sint32 uid = 4 [packed = true]; // DELTA coded
   repeated sint32 user_sid = 5 [packed = true]; // String IDs for usernames. DELTA coded

   // The visible flag is used to store history information. It indicates that
   // the current object version has been created by a delete operation on the
   // OSM API.
   // When a writer sets this flag, it MUST add a required_features tag with
   // value "HistoricalInformation" to the HeaderBlock.
   // If this flag is not available for some object it MUST be assumed to be
   // true if the file has the required_features tag "HistoricalInformation"
   // set.
   // If visible is set to false, this element has been deleted.
   repeated bool visible = 6 [packed = true];
}

Format example

In the following, we will have a look into the bytes of an OSM PBF file. The small regional extract bremen.osm.pbf (geofabrik.de, 2011-01-13) is used as an example.

Every data is preceded by a variable identifier. This identifier consists of type and id; the bits 0 through 2 stand for the type, bits 3 and above for the id. These types may be used:

  • 0: V (Varint) int32, int64, uint32, uint64, sint32, sint64, bool, enum
  • 1: D (64-bit) fixed64, sfixed64, double
  • 2: S (Length-delimited) string, bytes, embedded messages, packed repeated fields
  • 5: I (32-bit) fixed32, sfixed32, float
00000000  00 00 00 0d - length in bytes of the BlobHeader in network-byte order
00000000  __ __ __ __ 0a - S 1 'type'
00000000  __ __ __ __ __ 09 - length 9 bytes
00000000  __ __ __ __ __ __ 4f 53  4d 48 65 61 64 65 72 - "OSMHeader"
00000000  __ __ __ __ __ __ __ __  __ __ __ __ __ __ __ 18 - V 3 'datasize'
00000010  7c - 124 bytes long
00000010  __ 10 - V 2 'raw_size'
00000010  __ __ 71 - 113 bytes long
00000010  __ __ __ 1a - S 3 'zlib_data'
00000010  __ __ __ __ 78 - length 120 bytes

--- compressed section:
00000010  __ __ __ __ __ 78 9c e3  92 e2 b8 70 eb da 0c 7b  ||.q.xx.....p...{|
00000020  81 0b 7b 7a ff 39 49 34  3c 5c bb bd 9f 59 a1 61  |..{z.9I4<\...Y.a|
00000030  ce a2 df 5d cc 4a 7c fe  c5 b9 c1 c9 19 a9 b9 89  |...].J|.........|
00000040  ba 61 06 7a 66 4a 5c 2e  a9 79 c5 a9 7e f9 29 a9  |.a.zfJ\..y..~.).|
00000050  c5 4d 8c fc c1 7e 8e 01  c1 1e fe 21 ba 45 46 26  |.M...~.....!.EF&|
00000060  96 16 26 5d 8c 2a 19 25  25 05 56 fa fa e5 e5 e5  |..&].*.%%.V.....|
00000070  7a f9 05 40 a5 25 45 a9  a9 25 b9 89 05 7a f9 45  |z..@.%E..%...z.E|
00000080  e9 fa 89 05 99 fa 40 43  00 c0 94 29 0c
--- decompressed --->
00000000  0a - S 1 'bbox'
00000000  __ 1a - length 26 bytes
00000000  __ __ 08 d0 da d6 98 3f  10 d0 bc 8d fe 42 18 80
00000010  e1 ad b7 8f 03 20 80 9c  a2 fb 8a 03 - BBOX (4*Varint)
00000010  __ __ __ __ __ __ __ __  __ __ __ __ 22 - S 4 'required_features'
00000010  __ __ __ __ __ __ __ __  __ __ __ __ __ 0e - length 14 bytes
00000010  __ __ __ __ __ __ __ __  __ __ __ __ __ __ 4f 73
00000020  6d 53 63 68 65 6d 61 2d  56 30 2e 36 - "OsmSchema-V0.6"
00000020  __ __ __ __ __ __ __ __  __ __ __ __ 22 - S 4 'required_features'
00000020  __ __ __ __ __ __ __ __  __ __ __ __ __ 0a - length 10 bytes
00000020  __ __ __ __ __ __ __ __  __ __ __ __ __ __ 44 65
00000030  6e 73 65 4e 6f 64 65 73 - "DenseNodes"
00000030  __ __ __ __ __ __ __ __  82 01 - S 16 'writingprogram'
00000030  __ __ __ __ __ __ __ __  __ __ 0f - length 15 bytes
00000030  __ __ __ __ __ __ __ __  __ __ __ 53 4e 41 50 53
00000040  48 4f 54 2d 72 32 34 39  38 34 - "SNAPSHOT-r24984"
00000040  __ __ __ __ __ __ __ __  __ __ 8a 01 - S 17 'source'
00000040  __ __ __ __ __ __ __ __  __ __ __ __ 24 - length 36 bytes
00000040  __ __ __ __ __ __ __ __  __ __ __ __ __ 68 74 74
00000050  70 3a 2f 2f 77 77 77 2e  6f 70 65 6e 73 74 72 65
00000060  65 74 6d 61 70 2e 6f 72  67 2f 61 70 69 2f 30 2e
00000070  36 - "https://www.openstreetmap.org/api/0.6"
<--- decompressed ---

00000080  __ __ __ __ __ __ __ __  __ __ __ __ __ 00 00 00
00000090  0d - length in bytes of the BlobHeader in network-byte order
00000090  __ 0a - S 1 'type'
00000090  __ __ 07 - length 7 bytes
00000090  __ __ __ 4f 53 4d 44 61  74 61 "OSMData"
00000090  __ __ __ __ __ __ __ __  __ __ 18 - V 3 'datasize'
00000090  __ __ __ __ __ __ __ __  __ __ __ 90 af 05 - 87952 bytes long
00000090  __ __ __ __ __ __ __ __  __ __ __ __ __ __ 10 - V 2 'raw_size'
00000090  __ __ __ __ __ __ __ __  __ __ __ __ __ __ __ 8f
000000a0  84 08 - 131599 bytes long
000000a0  __ __ 1a - S 3 'zlib_data'
000000a0  __ __ __ 88 af 05 - length 87944 bytes

--- compressed section:
000000a0  __ __ __ __ __ __ 78 9c  b4 bc 09 5c 14 57 ba 28  |......x....\.W.(|
000000b0  5e 75 aa ba ba ba ba 69  16 11 d1 b8 90 b8 1b 41  |^u.....i.......A|
000000c0  10 11 97 98 c4 2d 31 1a  27 b9 9a 49 ee 64 ee 8c  |.....-1.'..I.d..|
000000d0  69 a0 95 8e 40 9b 06 62  32 f7 dd f7 5c 00 01 11  |i...@..b2...\...|
000000e0  11 05 11 11 71 43 45 44  40 05 44 54 14 17 44 44  |....qCED@.DT..DD|
000000f0  40 16 15 dc 00 37 50 44  05 c4 05 7d df 39 55 dd  |@....7PD...}.9U.|
etc.
--- decompressed --->
00000000  0a - S 1 'stringtable'
00000000  __ d4 2e - length 5972 bytes
00000000  __ __ __ 0a - S 1
00000000  __ __ __ __ 00 length 0 bytes
00000000  __ __ __ __ __ 0a - S 1
00000000  __ __ __ __ __ __ 07 length 7 bytes
00000000  __ __ __ __ __ __ __ 44  65 65 6c 6b 61 72 - "Deelkar"
00000000  __ __ __ __ __ __ __ __  __ __ __ __ __ __ 0a 0a  |.......Deelkar..|
00000010  63 72 65 61 74 65 64 5f  62 79 0a 04 4a 4f 53 4d  |created_by..JOSM|
00000020  0a 0b 45 74 72 69 63 43  65 6c 69 6e 65 0a 04 4b  |..EtricCeline..K|
00000030  6f 77 61 0a 05 55 53 63  68 61 0a 0d 4b 61 72 74  |owa..UScha..Kart|
00000040  6f 47 72 61 70 48 69 74  69 0a 05 4d 75 65 63 6b  |oGrapHiti..Mueck|
etc.
--- decompressed part from offset 5975 --->
00000000  12 - S 2 'primitivegroup'
00000000  __ ad d5 07 - length 125613 bytes
00000000  __ __ __ __ 12 - S 2 -- Tag #2 in a 'PrimitiveGroup', containing a serialized DenseNodes.
00000000  __ __ __ __ __ a9 d5 07 - of length 125609 bytes
00000000  __ __ __ __ __ __ __ __  0a - S 1 - Tag #1 in a DenseNodes, which is an array of packed varints.
00000000  __ __ __ __ __ __ __ __  __ df 42 - of length 8543 bytes
00000000  __ __ __ __ __ __ __ __  __ __ __ ce ad 0f 02 02  |..........B.....|
00000010  02 02 04 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
00000020  02 02 02 c6 8b ef 13 02  02 02 02 02 02 02 02 f0  |................|
00000030  ea 01 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|

    Each varint is stored consecutively.
    We process until we have read 8543 bytes worth, then resume parsing the DenseNodes.
    The varints are delta-encoded node id numbers.

00000040  02 02 02 04 02 04 02 02  04 02 02 02 02 02 02 02  |................|
00000050  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
00000060  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 04  |................|
00000070  02 02 06 44 02 02 02 02  02 02 02 02 02 02 02 02  |...D............|
00000080  68 02 02 02 02 02 02 02  02 02 02 02 04 02 02 06  |h...............|
00000090  02 02 0c 02 02 02 0a 02  02 02 02 06 0c 06 02 04  |................|
000000a0  02 06 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
000000b0  02 02 02 02 02 02 02 02  04 04 02 06 04 04 10 02  |................|
000000c0  04 02 04 18 0a 02 02 02  02 02 02 02 02 02 02 02  |................|
000000d0  04 06 02 02 04 02 02 02  02 04 02 02 02 02 08 02  |................|
000000e0  02 02 02 02 02 02 02 02  02 02 02 cc 06 02 02 02  |................|
000000f0  02 02 02 02 02 02 02 02  04 02 02 02 02 02 02 02  |................|
00000100  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
00000110  02 02 02 02 02 02 02 02  36 02 02 04 04 04 02 02  |........6.......|
00000120  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
etc.
<--- decompressed ---

Of course, the protocol buffer library handles all of these low-level encoding details.

The code

The codebase is split into two pieces. Common code that is application-independent exists on github. This includes the protocol buffer definitions, and Java code that operates at the fileblock level. https://github.com/openstreetmap/OSM-binary

Unfortunately, osmosis, mkgmap, and the splitter each use a different internal representation for OSM entities. This means that I have to reimplement the entity serialization and parsing code for each program and there is less common code between the implementations than I would like. The serializer and deserializer for osmosis are in trunk.

A deserializer for an older version of the mkgmap splitter (circa 5/2010) is available on github at http://github.com/scrosby/OSM-splitter

Miscellaneous notes

Downloads

The complete (current state, no history) OSM planet in PBF format is available at https://planet.openstreetmap.org/pbf/

Other places to download OSM extracts in PBF format: Planet.osm

See also

  • PBF Perl Parser
  • Osmium C++ library for working with OSM files with bindings for Python and JavaScript
  • Osm4j Java framework for working with OSM files

External links

  • Protocol Buffers at Wikipedia
  • pyrosm Python library that parses OpenStreetMap data in PBF format. Uses region files available online from GeoFabrik
  • libosmpbfreader A simple C++ library to read OpenStreetMap binary files
  • osm-read node.js library for parsing OpenStreetMap data in XML and PBF format
  • pbf_parser A Ruby gem for parsing PBF files easily
  • osm4scala High performance scala library to iterate over osm element from a pbf file.
  • parallelpbf Java multi-threaded PBF format reader
  • tiny-osmpbf JavaScript library which is optimized for small code footprint.
  • Node Locations on Ways PBF Format extension "LocationsOnWays" to include lat/lon values on ways. Used e.g. by libosmium.
  • osmpbf Rust library for parsing OSM PBF files
  • Detailed explanation of the PBF format by Mapbox
  • osm-go Go library provides a writer for encoding large OSM PBF files.