Santa Clara County, California/Social distancing protocol import
Volunteers affiliated with Open Source San José (formerly Code for San José) are carefully importing tens of thousands of business facilities based on social distancing protocols that business owners have filed with the Santa Clara County Public Health Department under COVID-19 public health orders.
Goal
This import will revamp OSM's coverage of retail, commercial, and industrial points of interest in the South Bay. We envision bringing fresher, more comprehensive POI coverage to OSM than most proprietary datasets currently offer.
Until now, local mappers have largely collected POIs ad-hoc through field surveying and armchair mapping from Mapillary footage. A 2017 analysis found this coverage to be uneven across business categories compared to the local business telephone directory. There are also concerns that this coverage underrepresents minority-owned businesses and small businesses.
Since the COVID-19 pandemic began, most POI data has been at risk of going stale due to temporary or permanent closures or changes in opening hours or services. Various city and neighborhood business associations have compiled listings of members that are open for business, but these listings are skewed toward certain kinds of businesses, and the copyright situation is unclear (or at least not clear enough to rely on in OSM).
The Santa Clara County Public Health Department is keeping track of businesses that are opening for business during the pandemic, along with their contact information and self-declared compliance with local COVID-19 safety orders. We expect these business names and addresses to have high accuracy. These POIs will form a solid foundation so that, after the pandemic, we can continue to build upon it incrementally through field surveying and other means.
Schedule
We have not yet developed a timetable for completing this import. Our only concrete deadline is that we want to complete this import before the pandemic subsides to the point where the public health department no longer requires businesses to submit SDPs.
- September 12, 2020: Project kickoff as part of Code for San José's local edition of Code for America's National Day of Civic Hacking
- September 17: Initial test scrape of SDP directory
- October 13: California Department of Public Health moves Santa Clara County to Tier 3; existing SDPs are invalid within 14 days
- October 15: Draft mapping of business categories to iD presets and/or tags
- November 9: Proposal drafted on the wiki
- November 16: Scrape of SDP directory at high water mark before move to Tier 1
- November 17: California Department of Public Health moves Santa Clara County to Tier 1; many businesses with SDPs must close, but unclear if remaining essential businesses will file revised SDPs
- November 19: Request for comments posted to the talk-us-sfbay, imports-us, and imports mailing lists and the
#imports
channel on OSMUS Slack - November 26: Smallest MapRoulette challenges open for mapping
- December 3: Largest MapRoulette challenges open for mapping
The import may take weeks to complete, depending on the number of participants.
Source
This import pulls together three data sources:
Social distancing protocols
The Santa Clara County Public Health Department has created a non-machine-readable database of businesses and institutions that have submitted social distancing protocols (SDPs). Under an October 5 public health order, all businesses and institutions must file a social distancing protocol with the department by October 27 to stay open. As of November 16, 2020 (the last day at Tier 3), the database includes 20,682 SDPs, with caveats:
- Many establishments are either on-site services that lack a physical address or home businesses that do not accept customer visits. We plan to identify and omit these establishments.
- An establishment can file multiple SDPs, each SDP superseding the previous one. The database generally excludes superseded SDPs, but it is up to the business to mark replacement SDPs; the database does not guarantee uniqueness.
Excluding establishments without physical addresses, 14,866 SDPs had enough information to import when the import began. Since then, many more SDPs have been added to the dataset.
Other datasets
The county also maintains a address point dataset containing addresses throughout the county. We are using this dataset to geocode the address in each SDP.
The Santa Clara Valley Transportation Authority, a special district covering Santa Clara County, publishes a land use dataset that we are using to provide a hint to mappers as to whether a geocoded POI is in a residential or nonresidential area.
License
The SDP listing, address point dataset, and land use dataset are all compiled by California local government agencies and are therefore in the public domain. (The County of Santa Clara was the defendant in a landmark case before a California appellate court that resulted in such works being in the public domain.)
It is not copyrighted because (lacking an exception in statute like those for works of the Department of Toxic Substances Control or works of certain colleges established by statute) "unrestricted disclosure is required".
↑ This template should only be used on file pages.
Preparation
Scraping
The county has not published a structured dataset corresponding to this listing, so we are resorting to scraping the SDP website to reconstruct the most relevant parts of the dataset. We have also scraped HTTP response headers from the submitted form PDFs, which contain additional fields that aren't displayed in HTML format. These additional fields will help us choose precise tags.
The SDP site is being updated every day. After the initial import kicks off, we will periodically rescrape the website for new listings. Business owners have the option to resubmit an SDP, replacing the previous submission. We will deduplicate submissions by hashing the name and address of each submission.
Geocoding
The SDP listing includes hand-entered addresses but no coordinates. To forward-geocode the addresses to coordinates, we loaded addresses from the county's address point dataset into Pelias. This may be related to the dataset that we are currently using in the San José building import.
Post-processing
To assist mappers, extra fields are added to the geocoded points. A QGIS script correlates the points with a zoning map, to indicate which locations may actually be in residential zones; measures the distance of each point to nearby relevant OSM features, to judge how likely the neighborhood is already well-mapped (and thus lower-priority); and splits the list into multiple layers depending on the type of business, so each category can be made into a separate MR challenge.
Tagging
This table summarizes how each category in the SDP listing corresponds to one or more iD presets and feature tags. Some categories are quite broad, so participants in this import will need to choose between multiple presets on a case-by-case basis.
The "Other, please specify" category is particularly challenging. Business owners have the opportunity to clarify the line of business in a freeform field, which we have scraped from metadata attached to the submitted PDFs. We have not yet chosen an efficient way to associate the freeform responses with tags.
Aside from feature tags, the following secondary tags will generally appear on imported features:
- name=* – fictitious business name
- official_name=*, operator=*, or owner=* – business name if accompanied by a fictitious business name
- addr:housenumber=*, addr:unit=*, addr:street=*, addr:city=*, addr:state=*, addr:postcode=*
We previously considered the following tags but decided against them:
- opening_hours:covid19=open – Month-by-month fluctuation between tiers means data consumers could not be confident about the accuracy of this tag.
- If "Facility/Worksite visited by public" is "NO", some categories like "Construction" could be tagged access=private; for others, that will be a signal that the business should not be mapped. However, the responses seemed to be unreliable in a spot-check, and extracting the field from the PDFs would have been challenging.
- safety:hand_sanitizer:covid19=* – The "Hand sanitizer and/or soap and water are available at or near the site entrance…" checkbox corresponds well to this tag, but we expect most POIs to provide hand sanitizer, and extracting the field from the PDFs would have been challenging.
We do not plan to tag POIs with the phone numbers listed on the SDP website. At a glance, many of the phone numbers appear to be personal cell phones of store managers or compliance staff.
Results
We will upload this series of GeoJSON files to MapRoulette, one challenge per category. Before we upload the GeoJSON files, we will join them with the business type descriptions.
Workflow
This MapRoulette project contains one challenge per business category. Each challenge consists of one business facility per task. The task's instructions will suggest presets or tags to choose from. This document provides more detailed guidance. The largest challenges will be hidden initially while we make sure the workflow runs smoothly with the smaller categories.
Some challenges require more manual mapping because businesses in the category are not reliably mappable. For example, the "Alternative Non-hotel Guest Accommodations" category included many home Airbnbs before the SDP form was revised in October. Many of the listings in the "Construction" category are for work sites that may normally be a different business (example).
The mapper is responsible for conflating the imported POI with nearby existing data and spot-checking the business against available aerial or street-level imagery to verify that the business is not obviously a home office. We are investigating providing additional signals as part of the per-task instructions for cases where street-level imagery is unavailable. Since Pelias's handling of unit numbers is limited, the mapper should also try to refine the business's location if it lies inside a strip mall or office building.
Changesets will credit source=Santa Clara County Public Health Department along with any imagery_used=* added by iD. Changeset comments will include the hashtags #c4sj
, #South-Bay-OSM
, and #maproulette
.
Various followup tasks will be possible outside the import. For example, we can review POIs last edited before the beginning of the import to see if they still exist (even in pre-pandemic street-level imagery).
Participants
The following Open Source San José volunteers are leading the effort to import SDPs:
- impiaaa (on osm, edits, contrib, heatmap, chngset com.)
- Minh Nguyen (on osm, edits, contrib, heatmap, chngset com.)
- kevinmasd (on osm, edits, contrib, heatmap, chngset com.)
- stgibson GitHub
This MapRoulette leaderboard shows who has contributed to the import's challenges. We encourage anyone in the local community to help us import the SDPs.
Statistics
A spot-check of the SDP website as of November 23 compared to the latest County Business Patterns data from the Census Bureau shows that an import of the SDP database could make significant progress towards our POI coverage in the South Bay:
See also
External links
- GitHub issue tracking this import's technical and logistical planning in more detail
- December 2020 diary post calling for volunteers
- August 2023 forum topic describing the import's privacy precautions