Step 8 of 9

Verification and Calibration

Comparing predictions against outcomes and using the results to improve the model

Why Verification Matters

A forecast method that cannot be tested is not a forecast method. It's a belief system.

The Jones method is testable. We make predictions before the period they cover, lock them in, and compare them against what actually happened. No retrospective adjustments. No "well, if you interpret the forecast this way, it was actually correct." Either the direction was right or it wasn't. Either the value fell in the stated range or it didn't.

Verification is also how you improve. A model that is never wrong is not learning. A model that is occasionally wrong and knows exactly where it failed is getting somewhere. The errors are the signal. Being willing to look at them carefully is the whole point of running this kind of method over decades rather than producing one forecast and moving on.

We are still waiting on enough years of data to draw firm conclusions on some configurations. That's fine. The process keeps running.

What We Measure

For each forecast month, we compare three variables against the actual recorded outcome at the Lyndhurst Hill station.

MinTemp: the predicted departure from the long run monthly mean versus the actual departure recorded
MaxTemp: same comparison
Precipitation occurrence (POP): the fraction of days wiht recorded rainfall versus the forecast tendency (wetter, drier, or average)

We also note any significant events the forecast either correctly anticipated or missed. A forecast that said "higher likelihood of storm activity in the third week" and got it right is worth noting. A forecast that said "drier than average" and coincided wiht a 200mm event is also worth noting. Both go in the record.

The station data is the reference. Not the Samford BoM station, not the online forecast aggregate. The Lyndhurst Hill station, because that's what our historical record is built from and it's the only consistent comparison point we have going back to 2004.

The Scoring Method

Each month gets a simple score. Three possible outcomes:

Good: the forecast direction was correct (above, below, or near average) AND the actual value fell within the analogue range stated in the forecast
Partial: the forecast direction was correct but the magnitude was outside the stated range (we said wetter than average, it was, but well above even the upper end of our range)
Miss: the forecast direction was wrong

We don't use a numeric scoring system because the precision would be false. A "good" month where the station recorded 128mm against a forecast of 80–140mm is just as good as one where it recorded 115mm. Both fell inside the stated range. Treating them differently by scoring them numerically would imply a level of discrimination the method doesn't actually have.

Example month scorecard

Variable	Forecast	Actual (Lyndhurst Hill)	Direction correct	Score
MaxTemp anomaly	+0.5 to +1.2°C above mean	+0.9°C above mean	Yes	Good
MinTemp anomaly	Near average (±0.3°C)	+0.1°C	Yes	Good
Rainfall	80–140mm (mean 110mm)	94mm	Yes	Good
Dominant air mass	North east tendency	Mixed, north east dominant	Yes	Good

The example above is a good month. Not every month looks like that. The accuracy page on dayboro.au shows the full record.

Finding Systematic Errors

One or two bad months tells you very little. Six to twelve months of results starts to show patterns in the errors. Specific types of systematic error to look for:

Consistent undercalling rainfall in La Niña years. The ENSO signal is stronger than the base planetary pattern match predicts in those years.
Getting winter temperatures right reliably but summer MaxTemp wrong. That suggests the method handles cool and dry patterns better than warm and wet ones.
Consistently overestimating rainfall from north east air mass configurations but underestimating from east to south east flows. That's an orographic issue specific to Dayboro's position at the base of the range.

Systematic errors are useful. Random errors are just noise. A systematic error tells you something real about the model's blind spots, and those blind spots can be corrected with a calibration factor derived from the same data you've been collecting.

Calibration in Practice

Calibration means adjusting the model based on observed errors. It does not mean changing the forecast retroactively to match what happened.

Three calibrations we've applied here at Dayboro over the years:

ENSO adjustment

During strong La Niña or El Niño events, the observed rainfall signal is consistently stronger or weaker than the base planetary pattern match predicts. Our analogues are drawn from across the full ENSO cycle, so they average out the signal. We now apply an ENSO overlay: when BoM's ENSO outlook shows a strong active event, we adjust the rainfall forecast up (La Niña) or down (El Niño) by a factor derived from the historical outcomes for ENSO years in our library.

Orographic adjustment

Dayboro sits at the base of the D'Aguilar Range. East to north east airflows lift against the range and produce significantly more rainfall here than anywhere on the flat to the east. We apply a +10% to +15% uplift to months with a dominant easterly air mass signal. That percentage came from comparing our station record against the Samford BoM station across 15 years of easterly dominant months.

Sunspot cycle strength

Solar Cycle 25 came in stronger than the long run average. Most of our historical analogues come from weaker cycles. The historical pattern matching was therefore underestimating the solar signal. We now weight analogues from stronger past cycles more heavily when Cycle 25's activity is above average for its phase. This is a small adjustment but it's made the temperature forecasts more consistent over the past two years.

What Calibration Is Not

This is the line you cannot cross: Calibration is not adjusting the forecast after the fact to make it look better. It's not finding a reason why the miss "doesn't really count." Legitimate calibration works like this: you observe a systematic pattern across multiple months (not one bad month), you form a hypothesis about why it occurs, you apply a correction going forward, and then you check whether subsequent forecasts improved. If the correction doesn't help over the next 6 months, you remove it. The adjustment must be derived from prior data, applied to future forecasts, and evaluated on its own merits. Full stop.

The temptation to fudge is real when you've put 20 years into a method and a month comes in badly. I understand the temptation. I've felt it. The record is the record. The miss gets published.

The Public Record

Every month's forecast and outcome is published on our Forecast Accuracy page. Good months and bad months, both. We don't curate for the strong results.

If you're evaluating this method, the accuracy page is the evidence. Not our description of the method. Not our explanation of why it should work in theory. The numbers. Scroll through them. Form your own view.

That's the only honest way to present it.