This short series (go to Part 1 or Part 2) arises from the recently published paper, “The Evolving Role of Humans in Weather Prediction and Communication“. Please read the paper first.
Objective verification of forecasts will remain hugely important, and the authors duly note that. But one factor not discussed (perhaps due to space limitations?) is the quality of the verification data. That matters…perhaps not to bureaucrats, who tend to overlook components of the verification sausage that provide context. But flawed verification datasets give you flawed verification numbers, even if the calculations are completely mathematically correct!
As someone who has analyzed and examined U.S. tornado, wind and hail data for most of my career, and published some research rooted in it, I can say two things with confidence:
1. It’s the most complete, precise and detailed data in the world, but
2. Precision is not necessarily accuracy. The data remain suffused with blobs of rottenness and grossly estimated or even completely fudged magnitudes, potentially giving misleading impressions on how “good” a forecast is.
How? Take the convective wind data, for example. More details can be found in my formally published paper on the subject, but suffice to say, it’s actually rather deeply contaminated, questionably accurate and surprisingly imprecise, and I’m amazed that it has generated as much useful research as it has. For example: trees and limbs can fall down in severe (50 kt, 58 mph by NWS definition) wind, subsevere wind, light wind, or no wind at all. Yet reports of downed trees and tree damage, when used to verify warnings, are bogused to severe numeric wind values by policy (as noted and cited in the paper). A patently unscientific and intellectually dishonest policy!
For another example, estimated winds tend to be overestimates, by a factor of about 1.25 in bulk, based on human wind-tunnel exposure (same paper). Yet four years after that research published, estimated gusts continue to be treated exactly like measured ones for verification (and now ML-informing) purposes. Why? Either estimated winds should be thrown out, or a pre-verification reduction factor applied to account for human overestimation. The secular increase in wind reports over the last few decades since the WSR-88D came online also should be normalized. That’s the far more scientifically justifiable approach than using the reports as-is, with no quality control nor temporal detrending.
For one more example, which we discussed just a little in the paper, all measured winds are treated the same, even though an increasing proportion come from non-AWOS, non-ASOS, non-mesonet instruments such as school and home weather stations. These are of questionable scientific validity in terms of proper exposure and calibration. The same can be said for storm-chaser and -spotter instrumentation, which may not be well-calibrated at a base level, and which may be either handheld at unknown height and exposure, or recording the slipstream if mounted on a vehicle.
Yet all those collectively populate the “severe” gust verification datasets also are used for training machine-learning algorithms — to the extent that actual, measured winds with scientific-grade, calibrated, verifiably properly sited instruments are a tiny minority of reports. With regard to wind reports, national outlooks, local warnings, and machine-learning training data use excess, non-severe wind data for verification, but because they all do, comparisons between them still may be useful, even if misleading.
Several of us severe-storms forecasters have noticed operationally that some ML-informed algorithms for generating calibrated wind probabilities put bull’s-eyes over CWAs and small parts of the country (mainly east) known to heavily use “trees down” to verify warnings, and that have much less actual severe thunderstorm wind (based on peer-reviewed studies of measured gusts, such as mine and this one by Bryan Smith) than the central and west. This has little to do with meteorology, and much to do with inconsistent and unscientific verification practices.
To improve the training data, the report-gathering and verification practices that inform it must improve, and/or the employers of the training data must apply objective filters. Will they?
This concludes the three-part series stimulated by Neil’s excellent paper. Great gratitude goes to Neil and his coauthors, and the handful who ever will read this far.