Step 2 of 9

Data Collection and Preparation

What you need, where it comes from, and how to organise it

The four data categories

Before you can run any calculations, you need data in four distinct categories. Each one does a different job in the forecast model.

  • Astronomical: planetary positions, lunar phases, and solar activity. This is the core of the Jones method. You're computing where Jupiter, Saturn, Uranus, and Neptune are at any given date, and what phase the Moon is in.
  • Historical weather: past records for the location you're forecasting. You need a long, continuous series. The longer the better, though quality matters more than length past a certain point.
  • Solar: sunspot data and solar cycle information. This is partly astronomical adn partly climate record, and the two sources need to be aligned.
  • Local: your own station data if you have it, or proxy data from the nearest reliable station if you don't. For Dayboro we use 21 years of records from Lyndhurst Hill. If you're starting fresh, the nearest long run BoM station will do.

You don't need all of this perfectly assembled before you start. But you do need to know where each dataset comes from, adn you need a plan for how it fits together. That's what this step covers.

Astronomical data sources

Jean Meeus, Astronomical Algorithms (2nd edition, 1998)

This is the calculation reference. It contains the algorithms for computing planetary positions, lunar phases, and solar events to high precision using nothing more than a date and a scientific calculator (or a spreadsheet). It is still in print, available from bookshops and online. You need it.

The chapters you'll use most: chapter 33 (the Moon's coordinates), chapter 25 (solar coordinates), the planetary longitude chapters for Jupiter (33), Saturn (33), Uranus (33), and Neptune (33). Meeus gives coefficients for polynomial approximations that are accurate to within a few arc minutes for dates within a few centuries of J2000.0 (1 January 2000, 12:00 TT). That's more than precise enough for this work.

I won't pretend the book is easy reading. It assumes you know what an ecliptic longitude is and what the Julian Date calendar means. If those are new concepts, spend an hour on Wikipedia first. Once you've built the spreadsheet once, you won't need to revisit the maths regularly.

NASA JPL Horizons

Available at https://ssd.jpl.nasa.gov/horizons/. This is a web-based tool for computing precise positions of any solar system body for any date, using NASA's DE421 ephemeris (which is what we use). You can request tabular output of planetary longitudes in ecliptic coordinates, at whatever time interval you choose.

Horizons is most useful for verification and for building your historical position tables going back 60 to 178 years. Computing 178 years of monthly planetary positions by hand from Meeus is possible but tedious. Horizons does it in about 30 seconds. Download the output, paste it into your spreadsheet, and move on.

On this moment the Horizons web interface is free and requires no registration. It exports to plain text, which is easy to work with in any spreadsheet application.

US Naval Observatory

Available at https://aa.usno.navy.mil/. Moon phase tables, rising and setting times, and the MICA software (free download) for full astronomical almanac calculations. MICA is overkill for most of what we do, but the moon phase tables on the USNO site are the simplest way to get accurate phase dates and times for a given location and year.

Solar and sunspot data

SILSO (Royal Observatory of Belgium)

The authoritative source for sunspot numbers. Available at https://www.sidc.be/SILSO/datafiles.

Download SN_d_tot_V2.0. That's the daily total sunspot number file, going back to 1818. It's a space-separated text file. Column 5 is the number you want: daily total sunspot count. A value of -1 means no observation was made that day. The file also includes a definitive/provisional flag in the last column, so you can tell which recent data points may still be revised.

Version 2.0 was revised in 2015 to correct a calibration error that had been in the original dataset since the 1940s. If you're using any older sunspot dataset, throw it out and start again with V2.0. The correction is not trivial.

NOAA Space Weather Prediction Center

Monthly solar cycle progression chart at https://www.swpc.noaa.gov/products/solar-cycle-progression. Updated monthly. Shows where we are in the current cycle and the official consensus prediction from here. Solar Cycle 25 has tracked significantly above the original prediction. The consensus peak estimate has been revised upwards multiple times since the cycle started in December 2019. For long range forecasting, the phase of the cycle (ascending, near peak, descending, near minimum) matters more than the exact monthly count.

Historical weather data for Queensland

BoM climate data portal

Available at http://www.bom.gov.au/climate/data/. Select Daily Data, then choose your nearest station. For Dayboro we use station 040056. Some temperature records go back to the 1890s. Daily rainfall for Queensland has good coverage from about the 1900s onwards for the more populated areas.

Download, at minimum: daily minimum temperature, daily maximum temperature, daily rainfall. Monthly totals are useful for a quick correlation overview, but you'll want the daily data for the pattern matching work later. BoM lets you download station data in CSV format. Save the full file, not just the years you think you need.

ACORN-SAT

The homogenised long term temperature dataset that BoM maintains for climate analysis. The nearest ACORN-SAT station to Dayboro is Caboolture (station 040214). ACORN-SAT adjusts for non-climatic breaks in the record caused by station moves, instrument changes, and time of observation shifts. For temperature trend analysis it's more reliable than the raw station data. For rainfall, ACORN-SAT doesn't help much since rainfall isn't homogenised the same way.

Historical severe weather events

The BoM Queensland Office publishes records of significant events. Flood years (1893, 1974, 2011, 2022), major droughts (1902, 1982, 2019), and heatwave years are the anchor points for pattern matching. When you look at which analogue years match your current cycle configuration, you want to flag which ones also produced the strongest climate anomalies. Those are the years that do the most work in the model.

Organising your data

The setup that works for us: a spreadsheet with separate tabs for sunspot data, planetary positions, historical weather, and the forecast model itself. Each tab has a clear header row and a consistent date format. We use YYYY-MM-DD everywhere. Don't mix date formats. You'll regret it at row 15,000.

You need a continuous time series. Gaps in the weather data are a problem. A missing week of rainfall during a flood event can make that year look drier than it was. Either note the gap explicitly, interpolate cautiously (and flag the interpolated values), or exclude that year from the analogue set. Don't silently carry the gap into your model.

Convert all event times to local mean time for your location, not UTC. This matters for moon phases and solar events that fall near midnight. A Full Moon that occurs at 23:45 UTC is a different calendar date in AEST. Get it right before you build the chart in Step 5.

Planetary positions can be computed at weekly or fortnightly intervals since the outer planets move slowly (Jupiter covers about 30 degrees per year, Saturn about 12). Daily sunspot numbers are worth keeping daily for the early processing, but you'll aggregate them into monthly means for most of the model work.

What we don't share

Our raw data files are not public. This is not about hiding the method. Every data source we use is listed above and is freely available. We don't publish our assembled datasets because the conversation then shifts to whether our numbers are correct rather than whether the method works. Build your own dataset from the sources in this step. You'll reach the same starting position we did, and you can verify every number independently.

Step 3 covers the actual download process in detail: specific files, exact column formats, and what to check before you trust the data.

Secret Link