2026.06 / research method

The RailSide Ratings Evidence Ladder

RailSide Ratings is not meant to be another “today’s nap” site dressed up with charts. The useful version of the product is a research workbench: it should show what the model thinks, what evidence supports it, and where the evidence is still too weak to trust.

The central idea is an evidence ladder. A racing angle should not jump straight from “looked good in a backtest” to “use this with real money”. It should move through stricter layers of proof, and it should be blocked or labelled research-only when it fails.

Nothing here is betting advice or a promise of profit. The whole point of the ladder is to slow decisions down until the evidence is better.

Level 1 — real racecards, no mock fallback

The lowest rung is still important: the app has to be using real racecards. If a scraper breaks, an API is unavailable, or a dependency fails, RailSide Ratings should not quietly invent runners just to keep the dashboard looking full.

Fake fallback data is dangerous because it makes a broken app look healthy. The safer rule is simple: real data or a visible failure.

Level 2 — transparent scoring

Once the app has real race data, it can score runners. But scoring is not the same as evidence. A rating is only a structured opinion until it has been tested.

This is why the scoring layer needs to stay explainable. Users should be able to see the ingredients: recent form, class context, race profile, distance suitability, market clues where available, and warning labels when information is missing.

Racecard
  ↓
Runner factors
  ↓
Transparent score
  ↓
Confidence and risk labels
  ↓
Research status, not blind certainty

Level 3 — historical smoke tests

Historical tests are useful for killing bad ideas early. If a rule cannot survive a broad historical smoke test, it probably does not deserve live attention.

But this layer can also flatter weak ideas. Starting-price or proxy-only results can look tidy while hiding whether a user could actually have got a sensible price. So historical smoke tests are a filter, not a finish line.

Level 4 — official and executable-price checks

The next rung is stricter: compare the idea against price sources that are closer to what the market actually offered. That means separating source types rather than blending them into one flattering headline.

Proxy results are useful for early shape-checking.
Official historical prices are a stronger test of whether the angle survives real settlement assumptions.
Exchange-style or delayed market captures help show whether the signal still makes sense around executable prices.

If an angle only works in the weakest bucket, it stays weak. The app should say that plainly.

Level 5 — forward paper tracking

Forward paper tracking is the product guardrail. It tests whether a signal can be generated live, captured consistently, priced realistically, and settled without rewriting the rules after the result.

This is deliberately slower than publishing a daily tip list. It means accepting that most signals will stay in a boring status like “collecting evidence” for a long time. That is a feature, not a flaw.

Level 6 — promotion guardrails

The final rung is promotion, and most ideas should never reach it. A candidate needs enough settled signals, tolerable drawdown, positive forward evidence, and performance that does not depend on one fragile price source.

For now, the useful default is caution. Candidate slices can be tracked, compared, and reported, but they should not be treated as validated until sample size and executable-price evidence justify it.

Interesting backtest
  ↓
Source-by-source validation
  ↓
Forward paper capture
  ↓
Separate settlement by mode
  ↓
Sample-size and drawdown checks
  ↓
Only then consider promotion

Why this matters for the product

The evidence ladder changes what RailSide Ratings is trying to be. It is not just a list of picks. It is a system for making claims harder to publish and easier to challenge.

That is the product direction I trust more: fewer bold claims, clearer labels, more separation between research modes, and public writing that explains what is still unproven. In racing, a tool that blocks weak ideas is more valuable than a tool that always finds something to say.

Live project: railsideratings.co.uk.