Talk:HIFLD
A start
Well, having started late July 2022 and as I write this early October 2022, barely two months old. And being talked about on the Imports mail-list as ambitious and "chewy." Even if this is actually "only" the 92 as specified at HIFLD, there truly are differing standards for, say HIFLD#Marine_Transportation, like "Daymark Locations," where the column specifying "Quality" says "Moderate," but following the link specifies the errors could be so great that they admonish these are not to be used for navigation (per Codes of Federal Regulations). That's a pretty blatant reason not to import this one of the 92 (not hundreds) of subsets of HIFLD, as if the data are admittedly not good enough for the federal government (the data's publisher) to use for navigation — by law — they can easily be argued to not be good enough quality for OSM. There may be many more of these, some already also identified, further whittling down the number of subsets from 92 to "even fewer than that." So, "a good start," but a lot more vetting to go. Also the CPAD-US data seems to ignore that some statewide data ARE better and "fresher," e.g. CPAD (California/Using CPAD data) than CPAD-US, a large component of HIFLD. Please keep working on this, knowing that critical review of the larger, smaller and really any subset of these data remain incomplete and not especially well-vetted among the wider OSM-US community. But, that does seem to be growing a bit with the early-October 2022 responses. (Greg Troxel, Mike Thompson, myself...). Stevea (talk) 04:42, 3 October 2022 (UTC)
Quality values
Sometimes quality values are "good" or "moderate" and sometimes (as in sub-pages) they are "90/100." How are these determined? (They look like a guess or estimate and lack specifications). Stevea (talk) 23:12, 23 October 2022 (UTC)
Hi Steve,
I wrote an explanation for the quality assessment on the mailing list, as well as more specific details on how I determined the results on the wiki page as well. The quality values using a word I initially determined with a quick and non-scientific approach of just examining a few objects on the dataset and making up a value. I am currently replacing these according to the standard below. I'll paste what I wrote here in case you didn't see it on the page
Quality:
Quality is assessed using the following method:
create a random extract of 100 objects from the dataset in QGIS, examine each object individually and determine if a mapper could reasonably assign the object to a real object for OpenStreetMap. If I couldn't find a corresponding object near the point, it would count as a point against the dataset.
Thanks, for the comments and continued interest. --SherbetS (talk) 23:18, 23 October 2022 (UTC)
- Are you checking for a real object, or the real object? I know that the fire stations have an issue where the dot is sometimes on the wrong building, and particularly in rural areas, it's sometimes hard to tell if a given garage is a fire station or something else. --Carnildo (talk) 01:37, 24 October 2022 (UTC)
- I am checking to see if the node is placed accurately enough to determine the correct building confidently. If the state that the imported node is in has address data, then I can line up the point with the matching building address and I would count that object as valid. If the point is on the road centerline or some sort of similar situation where I can’t confidently determine which building is being referred to, I deduct a point. If it’s not on the right building, I’m not currently able to catch that error.
Localizing and documenting refs
As this import campaign covers a given world region (here USA), local refs will be useful for conflation and quality assurance in the long run.
ref:ornl=* is mentioned at the bottom of the page and there is no doubt many more references will be used for each data category.
May I ask you to use a local name like ref:US please? We already do in France, UK and some other countries as to ensure there won't be any collision in the future.
We can't demonstrate ornl acronym isn't in use elsewhere in the world and won't ever be in the future. Using ref:US:ornl=* instead of ref:ornl=* is future proof and won't cause any problem in case of conflict. Fanfouer (talk) 11:42, 30 April 2023 (UTC)
- What's the advantage of this? I don't see why objects where there's one valid verifiable code why we wouldn't use ref=*. SherbetS (talk) 13:35, 30 April 2023 (UTC)
- Several advantages:
- Ability to document each ref nature with its own definition and validation rules.
- Prevent any conflict between different region using the same acronym for different things.
- Distinguish raw references read in place with ref=* and qualified/refined ones in dedicated subkeys. Fanfouer (talk) 14:44, 30 April 2023 (UTC)
- Several advantages:
Source of Coordinates
The locations in many of these datasets may have been determined through "automatic geocoding", that is translating a postal address to a latitude and longitude. This can yield erroneous results. Some possible reasons are house numbers that are mis-entered, postal codes that are mis-entered, poor reference data (a lot of times the reference data is just an address range for each block), etc. In many cases no one actually visited the site to survey it, or even looked at satellite imagery to see if the location was reasonable. --Tekim (talk) 00:04, 8 July 2023 (UTC)
- Yes, This is an issue I've faced while analyzing this data. The best approach I've come up with going forward is for the datasets curated in this manner to be made into MapRoulette projects, where users can review points one by one and do any necessary sleuthing to uncover the true location of the facility. --SherbetS (talk) 00:41, 8 July 2023 (UTC)
- I am not sure what you mean by "curated in this manner", I am going to assume that you mean datasets whose locations were determined in this manner (through automatic geocoding). In any event, how do you know how the locations were determined? I checked the metadata for some of the datasets and it isn't clear. I think it is safe to assume that except for very small specialized datasets, some features in each dataset where the features have addresses had their location determined through automatic geocoding, thus all features in nearly all datasets should be individually reviewed with a MapRoulette challenge as you suggest. It would be nice if the datasets had feature level metadata indicating how the coordinates were determined but I haven't seen that yet (although perhaps it is hidden in one of the fields that isn't explained in the dataset level metadata).--Tekim (talk) 22:46, 8 July 2023 (UTC)
- Yes, This is an issue I've faced while analyzing this data. The best approach I've come up with going forward is for the datasets curated in this manner to be made into MapRoulette projects, where users can review points one by one and do any necessary sleuthing to uncover the true location of the facility. --SherbetS (talk) 00:41, 8 July 2023 (UTC)
Need to See Converted Data
Prior to actually doing an import of each specific HIFLD dataset, the importer should make the converted data available (.osm format) to the community for review. In addition, if a script or program is used to do the conversion, the source code should also be made available to the community to review. --Tekim (talk) 14:16, 9 July 2023 (UTC)
Individual Datasets Need Discussed Prior to Import
The datasets that are part of the HIFLD data vary in schema, quality, and size. Prior to the import of any individual dataset, the community should have a chance to review and discuss. A blanket discussion of "HIFLD" is not sufficient. --Tekim (talk) 13:14, 13 July 2023 (UTC)
- I see the page has been edited to indicated that individual datasets will be discussed prior to import. Thanks! --Tekim (talk) 13:14, 13 July 2023 (UTC)
Use Original Source Rather than HIFLD
Many of the HIFLD datasets are taken directly from another source (e.g. EPA, SEC), or compiled from a few other sources (e.g. from the states). For the most up-to-date and highest quality data, considering taking data from these "primary" sources rather than HIFLD. A good example is hospitals, as hospitals are generally regulated at the state level, and thus every state should have a list of all licensed hospitals in their state. --Tekim (talk) 19:17, 23 July 2023 (UTC)