Talk:O5m
Clearness of Format Description
Nice format. I can see the protobuf inspiration in it. Could you give performance benchmarks for the time to encode and decode for each format? Also, what is the o5m block size you're using in your benchmarks? --Nutscrape 02:54, 17 May 2011 (BST)
- Hi! Yes, you're right, some of the ideas came from .pbf format decoding. I did not want to use the libraries, so I decoded a .pbf file manually – without analyzing the source code. This gave me deeper insight into .pbf format – Excellent format!
- O5m block size? There is no block size for .o5m format. In contrast to .pbf, .o5m is not structured in blocks; it's a kind of stream format. Until now, I did not do any benchmarks; but if you like, I will try... What would be an appropriate scenario to compare .osm, .pbf and .o5m? --Marqqs 21:02, 17 May 2011 (BST)
- I was referring to the 0xff/reset command. --Nutscrape
- The Reset byte must be stored at every change of the object type: between nodes and ways, and between ways and relations. In addition to this, the Reset byte may be stored at any position in the file (not inside other objects, of course). The more Reset bytes you store, the larger the file will be, and the easier will direct access be possible – if you want to pick just one OSM object. --Marqqs 16:24, 18 May 2011 (BST)
- I was referring to the 0xff/reset command. --Nutscrape
I removed the text on PBF not offering a choice of compression. It offers uncompressed or deflate. Uncompressed is twice as fast to write and twice the size compared to the deflate default. --Nutscrape 02:54, 17 May 2011 (BST)
- Ok! Maybe this was a misunderstanding. I just wanted to point out that, at present, .pbf allows only one method of compression: zlib, whereas .osm and .o5m are formats without a predefined method. Thus, you will find .osm and .osc files bz2-compressed, gzip-compressed and lzop-compressed. Same applies to .o5m. --Marqqs 21:02, 17 May 2011 (BST)
You also have an underspecification in the design. When you say " If there is no more space in the table to hold a new string pair, the oldest string pair must be deleted." you need to be very very specific here. The encoder and decoder have to have synchronized stringtables for decoding to work. You must give your exact LRU replacement algorithm, including the exact stringtable maximum size, the order tags are processed, (and if they're processed before or after metadata tags such as users) and, possibly, how ties are broken. In addition, this LRU replacement might be very fragile if the metadata on an entity is increased in the future. --Nutscrape 02:54, 17 May 2011 (BST)
- I must confess, I'm not sure if I follow you correctly. The table's size is defined at O5m#Strings: 15,000 strings (resp. string pairs).
- The issue is that at all points during compression, the compressor and decompressor's stringtable must be identical. Say you implement a faster compressor that does an approximate LRU algorithm, while my decompressor uses an exact LRU algorithm. Then, when string "#15001" is inserted, your algorithm thinks that slot #53 was the approximately least recently used and places "#15001" into that slot. My decompressor however knows that slot #96 was the least recently used, so places "#15001" into that different slot. The next string says to use the contents of slot 96. My exact LRU decompressor will thus record a different output than what was passed to your compressor. There are other related synchronization problems if I define LRU differently from you. (e.g, say your compressor treats usernames before tags, my decompressor goes in the opposite order.) --Nutscrape
- Maybe the description on the Wiki page isn't clear enough. I will try to put it in other words. I will describe the decoding only. Encoding must be programmed that way the subsequent encoding will operate correctly.
- Every string (or string pair) you read as plain text (not as a reference) must be stored in a string table (exception: string length > 250 characters). If the table is full (15,000 elements), the oldest element will be discarded before the newest is stored. References are relative; for example: "1" means, the newest stored string; "2" means, the string which has been stored before the newest; 15000 refers to the oldest string in the table. There is no LRU algorithm applied. This means, that you will have to re-store every popular string pair (such as "highway=residential"), even if it has been referred to by string references several hundred times lately. With LRU, you would get smaller files because popular strings would not have to be repeated at all, and references < 128 would be coded in one byte, not in two (see pbf encoding). The decoding would need a bit more CPU power and the algorithm would be a bit more complicated. Could be worth a try. Do you know a fast algorithm for that? All I can think of is a chained list, to whose elements you would not have direct access by reference number. This slows the decoding. An alternative would be PLRU, but this still needs 14 table accesses to get the right element out of 16348 (binary tree). --Marqqs 16:24, 18 May 2011 (BST)
Could you clarify what you mean with: "Note that strings are always stored as string pairs, even if defined as single string. The only time this difference takes effect is when writing to or reading from an .o5m file. In this case, for a single string only two zero bytes are used instead of three. " --ChristianVetter 14:02, 26 June 2011 (BST)
Compression Algorithms
I'm surprised using backreference for string pairs is a net win after compression: since that technique is basic stuff for a compressor, you'd expect the compressor to do a better job at it (reach further back in the file, code using a fraction of a byte, join more than 1 items together, etc) than the base format. Has this been benchmarked ? --Vincent De Phily 19:26, 17 May 2011 (BST)
- Hi Vincent, no there is no qualified benchmark until now. One of my goals was to create a format which does not need to be compressed. Using the format should lead to small file sizes, even if no compression is applied. So, although I never tested this, I am very sure that the format can be encoded and decoded very fast. --Marqqs 21:02, 17 May 2011 (BST)
- Well we could argue about that endlessly, but my opinion is that there's no point in using a non-compressed file : the cpu savings are dwarfed by the bandwidth/disk space wastage (at least for quick compressors like gzip), and your app can read/write the compressed stream seamlessly anyway. From that premise, if a simpler format can lead to better compression, it's a net gain. Note also that gzip's (for example) backreference implementation has been tuned/optimized for years, whereas you're starting yours from scratch. --Vincent De Phily 23:36, 17 May 2011 (BST)
- There are several advantages to handling these back-references for strings. First, gzipping more data is slower than gzipping less data, so doing a pre-encode helps. In addition, gzip has a rather small backbuffer from which it can copy. Pre-encoding back-references makes that buffer extend over more entities.--Nutscrape
- Well we could argue about that endlessly, but my opinion is that there's no point in using a non-compressed file : the cpu savings are dwarfed by the bandwidth/disk space wastage (at least for quick compressors like gzip), and your app can read/write the compressed stream seamlessly anyway. From that premise, if a simpler format can lead to better compression, it's a net gain. Note also that gzip's (for example) backreference implementation has been tuned/optimized for years, whereas you're starting yours from scratch. --Vincent De Phily 23:36, 17 May 2011 (BST)
Also, you can help the compressor, for example by sorting the tags : if the tags always come in the same order, the compressor is more likely to be able to represent a few of them as a single repetition. --Vincent De Phily 19:26, 17 May 2011 (BST)
- Yes, you're right. The new format goes half this way: it uses string pairs. Therefore highway=residential is one unit and coded in not more than 2 bytes. Your suggestion would result in further clustering, the file size would decrease a bit. At the moment, I'm not sure if this would be worth the effort, because this would slow the encoding process. Maybe we should try and find out. :-) --Marqqs 21:02, 17 May 2011 (BST)
- This was a small ~.5% effect in my experiments.--Nutscrape
No libraries shall be needed?
I cannot follow the argument that no library is required to use O5m. From the programmers view I need a way to parse the input file. I have two choices for that:
- Implementing a parser
- Using a library that provides the parser
This is valid for all formats: for osm you need an XML parser provided as library; for pbf you need the PBF parser provided as library and for O5m you need a new parser which also will be provided as a libray.
So where's the advantage? Or am I missing something? --WanMil 15:30, 21 May 2011 (BST)
- Yes, you're right! It's a POV thing. Some parsers are easier to implement than others. Huge effort would be necessary to implement the zlib functions. But everything else could be managed by an average software engineer. I deleted the referring lines. --Marqqs 20:44, 21 May 2011 (BST)
Random access
As far as I can tell, there is no mechanism for random access in a (large) .o5m file. This could easily be added simply by adding a unique string to the reset code. I'd suggest that the 0xff reset byte be followed by 0x00 0x00 0x00 0x00 0x00 0x00 0x00 (7 bytes). This can easily be detected by simply scanning in 32 bit integers and checking for "0" (the 7 bytes ensure that there is a 32-bit integer 0 on every 4-byte boundary). Also, I believe this string would not occur anywhere in the file under normal circumstances.
That said, any code which can be used to resync would be useful.
- Interesting suggestion. With this Sync element, OSM data would be "streamable". Also, you could try to perform a binary search in an .o5m file. Presently I cannot think of any application which would use this feature. But who knows... :-) By definition, every start byte between 0xf0 and 0xff must be a single byte and does not start a data object. Therefore I would suggest to take this sequence:
0xee 0x07 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff
- "0xee" starts a new object of length "0x07" which is unknown to existing .o5m programs, hence they will ignore it and jump over it. New programs would recognize the seven "0x00" and start decoding at the "0xff". --Marqqs 12:29, 27 May 2011 (BST)
- P.S.: I wouldn't say "no random access" at present. There is at least one program (o5mfilter) which uses random access to .osm and .o5m files: The way section and the relation section are processed more than one time. Maybe "no stream access" or "not streamable" would be more exact. --Marqqs 17:17, 27 May 2011 (BST)
- That sounds good. Thanks.
What's the advantage of random access as described here?
I think it would be great to be able to jump directly to the relations block or to the way block so I could first parse all relations then parse all ways and at last parse all nodes. With the current formats one has to use and to parse the files the other way round.
As far as I understand the format description o5m does not help very much here (please correct me if I am wrong!). Why?
- The format description is open. Some elements may be contained but needn't be contained in the o5m file. So I cannot rely on the existance of the special sync element. Nor can I rely where the sync elements are located (one sync element before the node section, one before the way section, one before the rels section - or - one each 2000 elements - or - one each 5 MB).
- Due to gain smaller filesizes o5m files will be compressed externally. So jumping from A to B means that you still have to read and uncompress the whole file. Jumping back is not possible.
As a result I cannot use random access in my software. Am I missing something? --WanMil 12:29, 4 June 2011 (BST)
Move version number to first part of object header
It has been suggested that the version number should become a mendatory part for OSM objects. This would result in a few changes:
- The node data will have to be resequenced in this order: 0x10, length, id, version, if version!=0: timestamp, if timestamp!=0: changeset/uid/user, longitude, latitude.
- The file will start with 0xff 0xe0 0x04 0x6f 0x35 0x6d 0x32 ("o5m2"). --Marqqs 11:40, 31 May 2011 (BST)
- I think, in this early stage of development it's ok to change the format definition without caring about the .o5m files which have already been written. The programs osmconvert and o5mfilter will soon be adjusted to the changes. I will update the description on the Wiki page. --Marqqs 15:10, 31 May 2011 (BST)
File Size Comparisons
I'd be interested whether you could add .pbf.gz and .pbf.7z to the comparisons table -- EdLoach 16:28, 31 May 2011 (BST)
- This could take hours on my weak computer – nevertheless I will try. But I am not sure if it makes sense to compress an already compressed file, hence .pbf is compressed internally with zlib algorithms. Sometimes such a second compressing increases the file size. But you are right, let's try it aniway. --Marqqs 16:49, 31 May 2011 (BST)
- Unfortunately, the result was like I suspected: A .pbf file increases in its size when compressed (again). I tried bz2 and 7zip algorithm (see article page), will do another test with gzip. --Marqqs 17:56, 31 May 2011 (BST)
- Could you please redo the PBF benchmarks for gz and 7z with the uncompressed PBF format? If somebody wants to apply 7zip to a PBF file he would turn off the internal zlib compression. --ChristianVetter 13:52, 26 June 2011 (BST)
- Unfortunately, the result was like I suspected: A .pbf file increases in its size when compressed (again). I tried bz2 and 7zip algorithm (see article page), will do another test with gzip. --Marqqs 17:56, 31 May 2011 (BST)
Reason for new format?
Can you explain why the new format? Compared to the already established PBF I don't see anything new really. You mention need for zlib as a drawback for PBF, but from your numbers its obvious that you need zlib (or something like it) to get the same kind of compression that PBF has. You mention no support for random access as a drawback, but your format doesn't have that either. Same for readability oder editability for users. One of you goals is "processable as data stream". PBF is processable as a stream (yes, it has blocks, but do you see a problem with that?). So whats left? -- Joto 08:04, 3 June 2011 (BST)
- Hi Joto – good questions. In short: if you don't like it, don't use it. ;-) In my opinion it is better than PBF – for certain applications. I would never demand that o5m should completely replace PBF. I will try to answer your questions.
- Data stream: If you have a continuous data stream, it's necessary to synchronize on it. PBF does not have such a mechanism, and it's not possible to implement one because due to dense node structure any byte sequence is legal. You cannot define a legal byte sequence as Sync object.
- Random access: In .o5m if you need direct access to every ÓSM object, you may precede every dataset with a Reset byte. This is not possible in PBF due to its block structure. There you have to decode the whole block first, even decompress it (if standard PBF).
- Compressions: There's only one compression defined for PBF. If you want to get a smaller file size, you cannot simply recompress the file with a different algorithm because this would increase the file size (see o5m#File_Size_Comparison). On the other hand, if you want to compress your data as fast as possible (e.g. with lzop), this is not defined either for PBF.
- There are a few measurable advantages (I took germany.osm as a base for my experiments):
- .o5m (uncompressed) can be read 12% faster than .pbf (standard compressed). If you need to read some sections more than once (e.g. to process dependencies between relations), this value can easily increase up to 50%.
- .o5m (compressed) is up to 11.5% smaller than .pbf (compressed).
- .o5m (uncompressed) is 16% smaller than .pbf (uncompressed).
- Standards: You can convert .osc files into .o5c – and back. There are o5m standards for even smaller file sizes: no history and no version.
- This all – of course – is my personal opinion and I really did not want to start a competition between PBF and o5m format. If you decide to develop an application, you will choose the format which fits your needs; this may be PBF or o5m or any other format. For o5mfilter, for example, I chose o5m because this made the program much faster. --Marqqs 14:21, 3 June 2011 (BST)
Hello, I was wondering if this file format is still under "experimental" usage. I find the format simply clever, but all the big Planet Dump sites are just using .osm or .pbf. So is this file format going to be replacing the others anytime soon? axjacosta 12:50, 7 November 2011
- Hi, this format has indeed left the experimental stage. More than a few people use it to keep their local planet files up-to-date. Others choose it because of some speed advantages with osmfilter. But I don't think that it's going to replace .osm or PBF. The .osm XML format is easy to handle and human-readable. PBF is a bit complicated if you try to understand it byte by byte, but it's not a bad format. And it's well established. Of course, I would appreciate it if someone offered planet downloads in .o5m format. But then, the files would have to be compressed to minimize traffic, thus the format would loose most of its advantages. In the end its not that bad to simply download the planet as PBF and to locally convert it to .o5m. I guess that the the program osmconvert needs about the same time for the conversion as it would take to decompress a compressed .o5m file. --Marqqs 16:09, 7 November 2011 (UTC)
Merging?
I don't understand why o5m supports easy merging of files. What exactly do you mean with merging and easy merging? I think merging means to read two input files and create a new file that contains the union of both files.
Isn't that also possible with the pbf format?--WanMil 11:59, 4 June 2011 (BST)
- Yes, of course, this can be done with PBF too. It's just easier to use o5m because of its flat hierarchy (no data blocks). This "easy" does not refer to the usage of programs; it refers to software development. --Marqqs 12:37, 4 June 2011 (BST)
- I believe merging PBF files is as easy as appending one to the other one - as long a you do not have to remove duplicate objects. --ChristianVetter 14:03, 26 June 2011 (BST)
- Exactly right. But you definitely will have duplicate objects if you're merging two neighboring regions. And if you just concatenate two files, you will dispose the ascending order of objects. Not to mention that you would get a second header object in the middle of the file --Marqqs 16:13, 26 June 2011 (BST)
String format
I think the description of the string format needs a hint which character encoding is used. Is it UTF-8? --WanMil 13:40, 11 June 2011 (BST)
- You're right, this informations had to be added. Thanks! --Marqqs 20:09, 11 June 2011 (BST)
Definition of the bounding box
I would suggest you define the bounding box not as (minlat,minlon) (maxlat,maxlon), but as a (lat,lon) (latheight,lonwidth). That way you can express boxes that cross the international date line concisely. --Nutscrape 05:28, 13 June 2011 (BST)
- Hi! Good idea. However, most programs use absolute coordinates to define the box, and not relative lengths (latheight, lonwidth). It could be a compromise to define the box by its corners: lower left and upper right. --Marqqs 15:50, 13 June 2011 (BST)
Why jump sections not used by default before Nodes Ways and Relations
Hi, how about add jump sections before Nodes Ways and Relations block? I want to read file in reverse order: relations ways and after that - nodes. When I read description I think that jump sections will be written before them and I can easely skip unnecesary file parts. Now I need to skip every point section and every way section to seek for relations. Dkiselev (talk) 23:04, 8 February 2013 (UTC)
- I would like to have that too. --WanMil (talk) 13:07, 11 February 2013 (UTC)
- I think that's a good idea. The program osmfilter could benefit from this also. However, this would mean that random access to the o5m output file was needed, because the forward-Jump distances would have to be written to the file at a later point in time. To make this possible I just extended the format a bit and inserted a spacer (see latest change in o5m format description). Now we just have to write a program that uses this new feature and writes o5m data with Jump information. ;-) --Marqqs (talk) 17:09, 16 February 2013 (UTC)
Hi Marqqs. Have you already implement this behaviour? Dkiselev (talk) 11:04, 19 February 2014 (UTC)
- Hi Dkiselev, sorry, the feature has been specified, but it has not been implemented in osmconvert. Maybe there is another program which generates these jump marks? --Marqqs (talk) 16:03, 29 April 2014 (UTC)
How to calculate length of Uid
For the the normal tag/value string pairs it is clear that if the combined length is over 250 characters the string pair should not be saved for later. What about a Uid/User string pair? The Uid is stored as a binary number, should the length be measured as the number of characters it uses in binary format or as converted into a string?
In the example uid=1020,"John" is coded as 0x00 0xfc 0x07 0x00 0x4a 0x6f 0x68 0x4e 0x00 should the UId length be measured as 4 charaters because "1020" contains 4 characters or should the length be measured as 2 characters because 0xfc 0x07 is 2 characters? --RocketMan (talk) 08:03, 25 April 2014 (UTC)
- Hi RocketMan, user IDs are PBF-encoded. osmconvert does this by using the procedure o5_uvar32buf(). The elements of a string pair are zero-separated. The UID length is 2 characters, as you have already assumed correctly. --Marqqs (talk) 16:03, 29 April 2014 (UTC)
Questions for perl implementation
The language description is pretty clear for me to do a perl implementation. I just have 1 question:
Are the reference numbers for ways and relations treated as one or two values (stored in one or two variables)? If a reset is performed before the relation section, this is irrelavant, but if not, the refs in the relations might be wrong.