Talk:Import/Catalogue/PSMA Admin Boundaries
Trial attempt at uploading
I had a go at seeing how long it would take to upload these using JOSM. After a quick clean up of the NSW dataset (about 2.8 million changes) and about 1h 45m of processing it took 18h 45m to upload. The only problem is that for about 17 hours there is nothing but unconnected nodes; all it would take is a mapper to delete one of these nodes to cause a problem.
I tried bulk_upload.py but couldn't get it to work. After a small modification the upload.py scripts do work. Uploading with the scripts takes just under 19h (see https://master.apis.dev.openstreetmap.org/user/gumikeze_bulk/history).
--Adavidson (talk) 07:59, 14 October 2018 (UTC)
- Good job testing! Do you think it's just slower due to it being the dev API? Any idea why JOSM doesn't upload the relations? BTW, it's likely NSW won't be imported since we have imported data already. Aharvey (talk) 12:48, 14 October 2018 (UTC)
- I chose NSW simply because it was the largest dataset and if anything was going to make JOSM choke that would be it.
- What I meant by the comment about the relations is that JOSM uploads everything in the order nodes -> ways -> relations. So if you do a huge upload most of the changes sets will only be bare nodes, the last couple ways, then one final changeset with all of the relations in it (see https://master.apis.dev.openstreetmap.org/changeset/134807). Which that means for quite a long time you have only bare nodes on the map and it would only take one mapper to delete one and I'm not sure how JOSM would recover (don't know if you can re-do a multi-part upload if it breaks somewhere along the way).
- I tried using bulk_upload but it uses libraries that no longer seem to be available on Ubuntu. It was easy to modify upload.py to work (code is here https://github.com/FrakGart/bulkupload). It is a bit clunky in the sense that you have to pre-process with osm2change, smarter-sort and split, and it does produce a lot of temp files while it is working. However, it does work and it has the nice feature that it can create changeset messages with a progress counter (ie: 1 of n). The script is still not perfect but it does mean that the bare nodes are only on the map for a maximum of two to four minutes which will hopefully cut down on the opportunity of mappers "cleaning" them up. Also far more robust in the case that there is some disruption due to a power cut, network, server or mapper problems.
- Don't know if the production server will be faster or slower. On the one hand the test api might be running on a Raspberry Pi, for all we know, but on the other I was the only user uploading at the time.
- Adavidson (talk) 01:16, 15 October 2018 (UTC)
- I chose NSW simply because it was the largest dataset and if anything was going to make JOSM choke that would be it.
Getting data back out of dev server for checking
As there is no Overpass server or planet dumps for the dev api server it's quite hard to get the uploaded data backout to check in JOSM. Ended up extracting each relation through the API and then merged them using osmium. Making use of the fact that upload.py leaves all of the diff xml files behind at the end we can use them to work out what to download:
grep relation files-sorted-part*.diff.xml | cut -d '"' -f 4 | parallel --progress -j5 wget https://master.apis.dev.openstreetmap.org/api/0.6/relation/{}/full -q -O testresults/{}.osm
Then merge them together using:
osmium merge -o combined.pbf *.osm
Once we have the final candidates for import we can do a final round trip test with the dev server.
What to do with Wooroonooran (QLD)?
The dataset has two "Wooroonooran" next to each other in QLD. Should we combine them? Adavidson (talk) 09:51, 10 November 2018 (UTC)
Typo in "Miniyeri" (NT)
Should be spelt Miniyerri (http://www.ntlis.nt.gov.au/placenames/view.jsp?id=22441) Adavidson (talk) 23:24, 10 November 2018 (UTC)
Duplicate locality names within same state/territory
There are a number of localities with the same name within the same state. e.g. Hillside, Ascot, Reedy Creek all in Victoria. These are normally written as Hillside (Greater Melbourne) and Hillside (East Gippsland). Do we need to factor this in. Ewen Hill (talk) 03:39, 5 November 2019 (UTC)
Duplicate island names within NT (and elsewhere)
Within the PSMA data, there can be duplicate names of islands within the same island group with the islets around the Island named as the main Island. I don't see this as an issue and possibly a huge benefit in mapping missing islands but it is something to be taken into consideration. Ewen Hill (talk) 03:39, 5 November 2019 (UTC)
What is in the PSMA data set are bounded localities so a bounded locality with the name "Island" doesn't necessarily cover only the Island with the same name.
Adavidson (talk) 03:59, 5 November 2019 (UTC)
First Nation Names, localities with dual names, foreign language names
A number of contributors have already added foreign and Indigenous names to localities. How do we test and transfer all current additional legitimate tags to the new process.
This is an interesting question. Are there any bounded localites with dual names? Remembering that a bounded locality is not the same as locality. Adavidson (talk) 03:58, 5 November 2019 (UTC)
Thanks for all the replies. I am not certain on this question however there have been a number of changes to names in the NT and WA (mainly in Arnhem Land) for example Port Keats/Wadeye and whilst this appears to be a replacement, I was more noting the edge cases we should be careful of. Ewen Hill (talk) 05:25, 5 November 2019 (UTC)
Are states too big for a single import?
We are possibly pushing the boundaries (pardon the pun) on uploading these large chunks, would you consider a couple hundred localities at a time, perhaps starting smaller and expanding as lessons are learnt. The first upload of this will be a once off as quarterly updates will only be relatively small so why build something large for a one off event? Ewen Hill (talk) 03:39, 5 November 2019 (UTC)
Given how relations work in OSM it is easier to do all of a state at once rather than doing part of it and then trying to merge the boundaries together, then doing another bit then merge those, etc. How do an upload on a state-by-state basis has been figured out. It's just waiting for the last of the remaining questions to be discussed and a decision made:
- Do we use the simplified boundaries or the originals?
- Do we delete the outer boundaries from the data before uploading or upload and then delete once completeness is tested?