Frederick County, Virginia, Building Import

From OpenStreetMap Wiki
Jump to navigation Jump to search

Import building footprints for Virgina (USA) from the VGIN. The import is currently IN REVIEW until Nov 1 2024

Import Proposal Available Here (Shapefile, 8MB)

Goals

Add missing FredCo buildings to OSM to massively help local mappers by adding 59k missing buildings to this county. Buildings that already exist in OSM for Frederick County (5.7k) will NOT be modified.

This import will not conflate with the Virginia Address Points dataset as this introduces potential for errors, would not affect the 5.7k existing buildings, and can be done later in a distinct import.

This import will not include the independent cities nearby, notable Winchester VA, as it's my first import and a rural region seems simpler as a learning exercise

Schedule

  • 13 Oct - Download dataset
  • 18 Oct - Finish data prep and share with community for review
  • Upload schedule TBD

Import Data

Data source: Virginia Geographic Information Network (see Virginia/VBMP)

ODbL Compliance verified: Yes. VGIN public domain data has been used in multiple prior works. Currently per the VGIN website, "The reference layers are defined as non-secure, public domain datasets that are authored and created by governmental and other entities."

Data Files

Background

VGIN is directed by Code of Virginia to maintain statewide orthoimagery refreshed on a four year cycle. From 2002 to 2024 imagery was updated on a ~3year cycle with 1/3 of the state updated each year.

The Building Footprint layer is part of the Virginia Base Mapping (VBMP) layers and is published quarterly. It contains outlines of structures remotely rendered through digitizing of VBMP digital imagery, local government subdivision plats, and other sources. Building footprints carry no addressing, ownership, resident information, or construction specifications.

Known Dataset Traits & Issues

  • Major Positive: VGIN imagery for Frederick County was collected in winter after the leaves have fallen but before snow. This makes it excellent for aerial tracing as structures are readily visible. See Example 6 below

  • Moderate: Shed retailers (sellers of garden sheds, storage sheds, etc) often leave their structures in a lot. VBMP counts these as buildings, which is correct technically, but as they are moved frequently a human would not map them until they are "semi-permanent" e.g. installed on someone's property. See example for Valley Structures below. I have manually removed all instances I found (there are about 10 shed retailers in the Fred. Co region which all shared the same issue). Similar issue fixed for some tractor trailer parking lots

  • Major: VBMP is known to be of highly varying quality region-by-region. To address this, I am focusing on one county which (so far) appears high-quality

  • Moderate: VBMP typically includes decks and patios in building footprints leading to excess vertex counts and odd building outlines
  • Moderate: VBMP counts most covered structures as buildings e.g. covered walkways, lean shelters which results in larger-than-normal buildings when you are dealing with a 'campus' as the individual building polys are often "joined" by covered walkways
  • Minor: VBMP is missing some buildings, especially industrial buildings with white roofs. Not of higher concern as these are few in number and often already present on OSM due to being readily visible in most imagery
  • Minor: VBMP does not always break row homes into one 'building' polygon per residence, sometimes there is a single polygon for all homes in the same physical building. Same as a human mapper would do
  • Minor: VBMP data is from ~2022 and some structures may have been (re)moved. In manual QA I've seen a few temporary sheds. Future work could consider building removal, but IMO that's better left for human verifiers not automated imports given the low instance count
  • Minor: The building footprints dataset and the VBMP most recent imagery dataset are not always from the same time window, leading to inaccuracies when the buildings change location. This is mainly an issue for sheds, construction trailers, trailer homes, etc. There are occasional not-actually-present buildings due to these errors. IMO most issues have been found & remediated during the data preparation phase.

Import Size

There are 5,739 buildings in OSM for Frederick County as of 13 Oct. Added the cleaned VBMP dataset would result in 64,808, an 11x increase.

Preparation Stage Building Counts
01 - VBMP All 4,395,463
02 - VBMP Fred Co only 64,759
03 - VBMP Buildings NOT in OSM 59,291
04 - VBMP Post Preparation

(ready for import)

59,069

Import Type

One Time Import

New data is released quarterly, so future contributors may choose to repeat this process

Data Preparation

Explaining how the initial data was loaded, processes, and prepared for import

  1. Loaded VirginiaBuildingFootprint -- 2024 Q3 (entire dataset) into QGIS
  2. Filtered to LOCALITY = 'Frederick County'
  3. Used QuickOSM plugin to download OSM buildings in Frederick County, Virginia

    [out:xml] [timeout:25];
    {{geocodeArea:Frederick County, Virginia}} -> .area_0;
    (
        node["building"](area.area_0);
        way["building"](area.area_0);
        relation["building"](area.area_0);
    );
    (._;>;);
    out body;

    Note: QuickOSM generates 3 layers - ways, nodes (typically entrances/exits referenced by the building way) and lines (relations referenced by building ways). We will focus on the ways as those are the footprints of interest. There is only one relation

    Note: VGIN uses EPSG:3968 - NAD83 / Virginia Lambert while OSM uses EPSG:4326 - WGS 84. QGIS recommended (and I accepted) the Inverse of NAD83 to WGS 84 (40) + Virginia Lambert Conic Conformal transform identified by Identifiers: INVERSE(DERIVED_FROM(EPSG)):1736, EPSG:3967

  4. Join OSM and VGIN data

    Using VGIN as the target layer, add attributes from all spatially joined OSM buildings

  5. Filter items already present in OSM

    Filter to features where "osm_id" is NULL to remove all OSM buildings

  6. Update labels

    Remove all labels, add building=yes

  7. Manually fix of tiny polygons

    Searched for buildings smaller than 20m2 and manually removed inaccurate entries such as for-sale movable sheds sitting in sale lots. Manually merged inaccurate entries such as single-family homes incorrectly split into two polygons. Most issues were in structures under 13m2

  8. Run QGIS Topology Checker & manually fix issues

    • Check for invalid geoms OR duplicates OR multi-parts. No issues detected

    • Check for gaps. 149 issues found, fixed by manually editing features. Most issues were pools (see below). VGIN often incorrectly perceives a pool/deck/exterior stairs and any surrounding walkway as being part of the building yet (correctly) understands the pool is not a building thus leaves a gap in the footprint polygon. Note that not all 'gaps' are immediately incorrect. The San Damiano Monastery in OSM one such example - there are about 15 cases of 'correct' gaps in the final dataset. small number of gaps were due to digitization errors, seemingly often caused by incorrect building detection such as a car parked close to a small house or potentially attempting to overlay "old" building footprints with "newer" footprints when aerial imagery was re-digitized. The fix is remove the offending poly and repair the correct one such that it follows the correct outline from the latest imagery

    • Check for overlaps: There were ~5. My QGIS skills are limited and the issues were all townhomes, so I just merged the individual units into a unified poly where needed
VGIN often incorrectly digitizes a pool/deck/exterior stairs and any surrounding walkway as being part of the building yet (correctly) understands the pool is not a building. Example below - red indicated the detected gap. Building footprint is not visualized here - it wraps both the building and the entire fenced area
Gap caused by pool (building footprint not visualized - ot wraps both the building and the entire fenced area)
Example of digitization error in VGIN building footprint data
Example of digitization error in VGIN building footprint data
San Damiano Monastery in VGIN aerial imagery with VGIN building footprint highlighted. Courtyard results in a (correct) gap in the building footprint
  1. Simplify Geometry

    Union nearby nodes using a tolerance of 1-ft. Reduced nodes from 582,599 to 549,797, or 5.6%. Re-run topology checks to ensure no issues

  2. Check for excessive points

    Added field for num_points($geometry) and manually reviewed highest point-count polygons. No issues found

  3. Manual QA

Tagging Plans

Only building=yes as no other useful fields are present

Workflow

Extents for each of the 60 upload chunks
Extents for each of the 60 upload chunks
  1. Used QGIS to split import layer into 60 sub-layers with 1k entries each. Sub-layers started with NW features then proceeded towards SE, see graphic for visualizing of resulting chunk extents. 1k chunk size ensures we don't overload OSM servers. See collapsed section below for code
  2. Load each chunk into JOSM
    1. Download existing OSM data for the chunk extent Note: Be sure to download OSM data into the same layer as the chunk data so they are combined into one dataset. You will need to also uncheck 'Discourage Upload'
    2. Run validity checker and resolve issues. Typically these are when non-building data intersects the newly-added building data, such as existing OSM roads
    3. Upload with advanced settings to ensure we get one changeset per chunk with about 500 objects per request (typically, 5 to 10 requests per changeset)
  3. Update the 'Import Status Per Upload Chunk' table below

Note: Initially ran into the rate limit for my import user - waiting for ~50 minutes then re-uploading allowed me to complete the first changeset without need a revert, and I have since emailed the DWG asking for the `importer` role to be added to my import account.

import os
from qgis.core import QgsVectorFileWriter, QgsCoordinateReferenceSystem, QgsProcessingException

from qgis.core import QgsWkbTypes


# Set target CRS to EPSG:4326 (WGS 84)
target_crs = QgsCoordinateReferenceSystem("EPSG:4326")

# Define the input layer and output directory
# Use the currently selected layer in QGIS
layer = iface.activeLayer()
output_dir = "/Users/hamiltont/Downloads/export_test"

# Number of features per chunk
chunk_size = 1000

# List to store chunk extents for eventual output
chunk_extents = []

if not layer.isValid():
    print("Error: Original layer is invalid.")
else:
    print("Original layer is valid.")
geom_type = layer.geometryType()
print(f"Original layer geometry type: {geom_type}")
print(f"Original layer geometry WKB type: {layer.wkbType()}")
print(f"Original layer geometry: {QgsWkbTypes.displayString(layer.wkbType())}")
print(f"Original layer CRS: {layer.crs().authid()}")


# Function to get the centroid coordinates of a feature
def get_centroid_coords(feature):
    geom = feature.geometry().centroid()
    return (geom.asPoint().y(), geom.asPoint().x())  # (Y, X) coordinates

# Sort all features by Y (north-south) first, and by X (west-east) second
sorted_features = sorted(features, key=get_centroid_coords, reverse=True)

print(f"Total features: {len(features)}")
print(f"Total features after sorting: {len(sorted_features)}")


# Create and export each chunk
for i in range(0, total_features, chunk_size):
    chunk_number = i // chunk_size + 1
    print(f"\nProcessing chunk {chunk_number}...")
    
    # Create a memory layer for the chunk
    sub_layer = QgsVectorLayer(f"MultiPolygonZ?crs={layer.crs().authid()}", 
                                f"chunk_{chunk_number}", "memory")
    geom_type = sub_layer.geometryType()
    print(f"Sub layer geometry type: {geom_type}")
    print(f"Sub layer geometry WKB type: {sub_layer.wkbType()}")
    print(f"Sub layer geometry type: {QgsWkbTypes.displayString(sub_layer.wkbType())}")
    print(f"Sub layer CRS: {sub_layer.crs().authid()}")

    sub_layer_data = sub_layer.dataProvider()

    # Add fields and features to the memory layer
    sub_layer_data.addAttributes(layer.fields())
    sub_layer.updateFields()
    sub_features = sorted_features[i:i + chunk_size]


    # Check if the chunk has valid features
    if not sub_features:
        print(f"Chunk {chunk_number} is empty. Skipping export.")
        continue
    print(f"Chunk {chunk_number}: Extracted {len(sub_features)} features.")

    success = sub_layer_data.addFeatures(sub_features)
    if not success:
        print(f"Warning: Some features could not be added to chunk {chunk_number}.")

    sub_layer.updateExtents()
    
    for feature in sub_features:
        if not feature.geometry().isGeosValid():
            print(f"Invalid geometry detected in feature ID: {feature.id()}")

    if not sub_layer.isValid():
        print(f"Error: Input layer {sub_layer.name()} is invalid.")
        break

    # Reproject the chunk to EPSG:4326
    try:
        print(f"Reprojecting chunk {chunk_number} to EPSG:4326...")
        reprojected_layer = processing.run(
            "native:reprojectlayer",
            {
                'INPUT': sub_layer,
                'TARGET_CRS': target_crs,
                'OUTPUT': 'memory:'  # Use an in-memory layer for efficiency
            }
        )['OUTPUT']
        print(f"Reprojection successful for chunk {chunk_number}.")

    except QgsProcessingException as e:
        print(f"Error during reprojection of chunk {chunk_number}: {e}")
        raise  # Stop execution if reprojection fails
        
    # Get the extent of the reprojected chunk (in lat/lon)
    reprojected_layer.updateExtents()
    extent = reprojected_layer.extent()
    chunk_extents.append({
        "chunk": chunk_number,
        "features": len(sub_features),
        "min_lon": extent.xMinimum(),
        "min_lat": extent.yMinimum(),
        "max_lon": extent.xMaximum(),
        "max_lat": extent.yMaximum(),
    })
    print(f"Repo Layer CRS: {reprojected_layer.crs().authid()}")
    print(f"Repo Layer Chunk {chunk_number} extent: "
          f"Min Lon {extent.xMinimum()}, Min Lat {extent.yMinimum()}, "
          f"Max Lon {extent.xMaximum()}, Max Lat {extent.yMaximum()}")

    # Define the output path for the GeoJSON
    output_path = os.path.join(output_dir, f"chunk_{chunk_number}.geojson")
    print(f"Exporting chunk {chunk_number} to: {output_path}")

    # Export the reprojected chunk as GeoJSON
    error = QgsVectorFileWriter.writeAsVectorFormat(
        reprojected_layer, output_path, "UTF-8", reprojected_layer.crs(), "GeoJSON"
    )

    # Check for export errors
    if error == QgsVectorFileWriter.NoError:
        print(f"Successfully exported chunk {chunk_number}.")
    else:
        print(f"Error exporting chunk {chunk_number}: ({error})")

# Print all chunk extents in CSV format at the end
print("\nCSV Output of Chunk Extents:")
print("chunk,feature_count,min_lon,min_lat,max_lon,max_lat,url")
for extent in chunk_extents:
    min_lon = extent['min_lon']
    min_lat = extent['min_lat']
    max_lon = extent['max_lon']
    max_lat = extent['max_lat']
    f_count = extent['features']

    # Create the OSM URL for the bounding box
    osm_url = f"https://www.openstreetmap.org/?minlon={min_lon}&minlat={min_lat}&maxlon={max_lon}&maxlat={max_lat}"

    # Print the CSV row with the OSM URL
    print(f"{extent['chunk']},{f_count},{min_lon},{min_lat},{max_lon},{max_lat},{osm_url}")

print("\nAll chunks processed.")

Dedicated Import Account

Will use hamiltont_nova_Import

Changeset Tags

source=VGIN

Import Status Per Upload Chunk

Note: Per the DWG's advice, import is paused until 1 Nov to allow a longer period for pre-import review

Status chunk feature_count min_lon min_lat max_lon max_lat
DONE - 21 Oct 1 1000 -78.35550324 39.38628836 -78.23639107 39.45100915
DONE - 22 Oct 2 1000 -78.36486372 39.3562779 -78.18446043 39.38701276
DONE - 22 Oct 3 1000 -78.36232016 39.33417316 -78.14383743 39.35820365
DONE - 22 Oct 4 1000 -78.35916379 39.31658864 -78.12153098 39.33574577
DONE - 22 Oct 5 1000 -78.36629139 39.3037176 -78.09449648 39.31896178
6 1000 -78.38055957 39.29479093 -78.08685682 39.30619942
7 1000 -78.38631535 39.28951937 -78.07920111 39.29772042
8 1000 -78.39133519 39.28177215 -78.0662739 39.29207375
9 1000 -78.3987386 39.27463163 -78.06035894 39.28536153
10 1000 -78.40381729 39.26752527 -78.03916332 39.27742435
11 1000 -78.41157923 39.25928657 -78.03517969 39.27121217
12 1000 -78.41559956 39.25383288 -78.03824747 39.26291478
13 1000 -78.41657955 39.24732356 -78.04050137 39.25756661
14 1000 -78.40125911 39.24054574 -78.04188646 39.25089048
15 1000 -78.40427716 39.23522245 -78.0471561 39.24374494
16 1000 -78.40464456 39.23107838 -78.04942125 39.23890116
17 1000 -78.40776549 39.22584345 -78.05139517 39.23474042
18 1000 -78.40998346 39.22219806 -78.06075301 39.22901605
19 1000 -78.41528008 39.2164628 -78.06367693 39.22534856
20 1000 -78.42096011 39.21052094 -78.06494283 39.219668
21 1000 -78.42473085 39.20592805 -78.06986175 39.21303849
22 1000 -78.43014612 39.20236712 -78.07478176 39.20929534
23 1000 -78.43456088 39.19701165 -78.07067478 39.20613869
24 1000 -78.43175999 39.19387641 -78.07350933 39.20002774
25 1000 -78.43348559 39.1913623 -78.07545182 39.19740007
26 1000 -78.4186674 39.18919444 -78.07826241 39.19459834
27 1000 -78.42187212 39.1864137 -78.0770293 39.19213158
28 1000 -78.42841321 39.18311966 -78.08173155 39.18942092
29 1000 -78.42754527 39.18068565 -78.08206382 39.18618401
30 1000 -78.42706656 39.17835389 -78.07515185 39.18413886
31 1000 -78.42378204 39.17518108 -78.07536611 39.18191237
32 1000 -78.41485794 39.17242027 -78.08691814 39.17857648
33 1000 -78.40855336 39.16907011 -78.09038342 39.17512478
34 1000 -78.39771669 39.16527983 -78.08833045 39.171926
35 1000 -78.4024012 39.16260584 -78.08885609 39.1682223
36 1000 -78.40672575 39.15993417 -78.08985429 39.16547766
37 1000 -78.40149198 39.15621413 -78.08337495 39.16317633
38 1000 -78.4110847 39.14899646 -78.09153487 39.15917937
39 1000 -78.41869993 39.14237712 -78.09006146 39.15214547
40 1000 -78.42307718 39.13745269 -78.09737781 39.14529157
41 1000 -78.42532803 39.1319263 -78.10509044 39.14020876
42 1000 -78.43563655 39.12425453 -78.10133758 39.13509912
43 1000 -78.45183931 39.11290444 -78.10374787 39.12714409
44 1000 -78.46382574 39.10687765 -78.10302921 39.11587823
45 1000 -78.47360731 39.1031298 -78.10369237 39.11056484
46 1000 -78.4815121 39.09818099 -78.11132441 39.10652915
47 1000 -78.49514054 39.09403615 -78.10945963 39.10155333
48 1000 -78.49243737 39.0912623 -78.11494465 39.09755431
49 1000 -78.49470675 39.08852621 -78.12385917 39.09481076
50 1000 -78.49655629 39.0862345 -78.13055767 39.0916529
51 1000 -78.45415932 39.08328985 -78.12936486 39.0891412
52 1000 -78.45451099 39.07911906 -78.14002908 39.08642785
53 1000 -78.5019145 39.07350801 -78.13474757 39.08227542
54 1000 -78.45918014 39.06751839 -78.13963556 39.07685453
55 1000 -78.47642295 39.06092414 -78.14172856 39.07071946
56 1000 -78.47197529 39.04863938 -78.14559808 39.06324825
57 1000 -78.49549195 39.03490493 -78.1502462 39.05147232
58 1000 -78.48602808 39.02662178 -78.15381819 39.03809128
59 1000 -78.31411696 39.01393555 -78.16297631 39.02796566
60 69 -78.31451995 39.00698526 -78.22753749 39.01466277

QA

Green is OSM, Burgundy (or bright red) is to-be-added from VGIN

Example 1: Old Robert E. Aylor Middle School

The main school building (green) was added to OSM in Jan 2023. The covered walkways and equipment sheds (red) are not present on OSM

OSM link

Region around Old Robert E. Aylor Middle School after joining OSM and VGIN data
Region around Old Robert E. Aylor Middle School after joining OSM and VGIN data

Existing Aerial Imagery status

Looking at existing imagery gives a good example of how VBMP is one of the best options for this region. As a reminder, "Clarity" imagery is not time-based it uses a mix of time ranges to ensure the best possible resolution.

Main Building Only: NAIP, Bing, USGS

Construction Shown: Esri, MapBox

All Buildings Done: VBMP, Esri Clarity

Example 2: Headley's Antiques

Rural business building missing from OSM. As VBMP imagery was collected in winter, we can also see two sheds which are not visible in any other imagery. Bottom-left shows OSM-present data (new buildings added by me) that are not yet shown in VBMP dataset as the construction was completed after VBMP aerial imagery.

OSM Link

Rural business building missing from OSM. As VGIN imagery was collected in winter, we can also see two sheds which are not visible in any other imagery. Bottom-left shows OSM-present data (new buildings added by me) that are not yet shown in VBMP dataset as the construction was completed after VBMP aerial imagery.
green=existing in OSM ; red=newly added

Example 3: Pack Horse Road

Remote rural homes - completely unmapped on OSM. Adding these is critical for emergency response. Use 'Esri Clarity' maps to see houses in ID editor

Note: For better visibility, VGIN data is yellow in this image

OSM Link

Remote rural homes - completely unmapped on OSM. Adding these is critical for emergency response.
Newly added homes (via VGIN data) are yellow in this image

Example 4: North Winchester Industrial District

Urban region within the Winchester VA beltway of I-81 and VA-37. Mix of industrial and row homes. We see that OSM only has major industrial centers mapped, no one has yet traced smaller buildings. In most cases, the VGIN dataset is of high enough quality that it contains distinct polygons for individual residences in a unified townhome/row house building

OSM Link

Urban region within the Winchester beltway of I-81 and VA-37. Mix of industrial and row homes. We see that OSM only has major industrial centers mapped, no one has traced the individual buildings. In most cases, the VGIN dataset is of high enough quality that it creates distinct polygons for individual residences in a unified townhome/row house building
green=OSM; red=Added by VGIN

Example 5: Valley Structures

Note: This is BEFORE the Data Prep Step "Removal of tiny polygons". Post removal none of the small sheds shown here are in the data

Valley Structures is a local shed retailer. These for-sale shed structures are moved regularly while in the sale lot - a human would likely compare different imagery datasets and determine which buildings are permanent and ignore the for-sale structures. This import instead would (incorrectly IMO) load in the for-sale sheds as buildings. It would also (correctly) load the permanent structures into OSM. Note: Once the for-sale structures are in someone's back yard we WOULD like them classified as buildings, which is a distinction a human would likely make easily. See also: Shenandoah Sheds, LLC - 1518 Fairfax Pike, White Post, VA 22663 for a similar issue. I've fixed 3-4 of these in the dataset

OSM Link

Valley Structures.png

Example 6: Comparison of Bing & VBMP

One major positive is that VBMP imagery for Frederick County was collected in winter after the leaves have fallen but before snow.

This makes it by far the best imagery for aerial tracing, as many structures are readily visible that are hidden by greenery on all other imagery I can locate.

OSM Link

Bing Imagery of Laurel Hill
Bing Imagery
VBMP Imagery of Laurel Hill
VBMP Imagery

References