import Pkg;

Multisite rainfall HMM based SWG (paper)

This tutorial describes the numerical applications described in the paper Interpretable Seasonal Multisite Hidden Markov Model for stochastic rain generation in France by Emmanuel Gobet (CMAP - École Polytechnique), David Métivier (MISTEA – INRAE) and Sylvie Parey (R&D – EDF). It shows a fully reproducible example on how to use the package StochasticWeatherGenerators.jl to reproduce, step-by-step, exactly (almost) all the figures of the paper.

The paper describes the construction of a Stochastic Weather Generator with an Autoregressive Seasonal Hidden Markov Model (SHMM). The SHMM is trained with French weather stations, and the hidden states are interpreted as weather regimes. The model is validated with simulations, especially for its ability to reproduce extreme weather, e.g. droughts. In the paper, the model is also used with Climate Change RCP scenarios (not shown here).

Schematic of the Autoregressive Seasonal Hidden Markov Model

Set up

Package and functions

For Julia new user

There are several ways to add a package before using, one way is for this tutorial to copy-paste (it might take a while):

import Pkg
Pkg.add(["CSV", "JLD", "DelimitedFiles", "DataFrames", "DataFramesMeta", "StatsBase", "Random", "Distributions", "StatsPlots", "LaTeXStrings"])

using CSV, JLD, DelimitedFiles # File Read/Load/Save

using DataFrames, DataFramesMeta # DataFrames

using Dates

using StatsBase, Random

using Distributions

The main package is StochasticWeatherGenerators.jl, which provides the interface to the models and data. SmoothPeriodicStatsModels.jl contains the smooth periodic models (fit and sampling methods) used in this tutorial, namely the Seasonal Hidden Markov Model (SHMM), and the seasonal mixture models for rainfall amounts.

using StochasticWeatherGenerators # interface to use with SmoothPeriodicStatsModels.jl

using SmoothPeriodicStatsModels # Name might change. Small collection of smooth periodic models e.g. AR, HMM

Random.seed!(1234)

Random.TaskLocalRNG()

Settings for plotting

Some settings and packages to have nice plots.

using StatsPlots, LaTeXStrings
using StatsPlots.PlotMeasures # To play with margin in Plots

gr() # plotly() # for interactive plots
default(fontfamily="Computer Modern")
cur_colors = get_color_palette(:auto, 100);
my_palette(K) = palette(vcat(cur_colors[1], [cur_colors[c] for c in 3:4], cur_colors[2]), K)

file_for_plot_utilities = download("https://raw.githubusercontent.com/dmetivie/StochasticWeatherGenerators.jl/master/examples/utilities_plot.jl")
include(file_for_plot_utilities)

cyclic (generic function with 1 method)

To plot maps, we use GeoMakie.jl + NaturalEarth.jl. Note that using cartopy with PyCall.jl also works very well.

For the following code to work you will need to add the following packages

import Pkg
Pkg.add("NaturalEarth", "GeoMakie", "CairoMakie")

file_for_maps_with_geomakie = download("https://raw.githubusercontent.com/dmetivie/StochasticWeatherGenerators.jl/master/examples/utilities_geo_makie_features.jl") # download file from a GitHub repo
include(file_for_maps_with_geomakie)

savefigcrop (generic function with 4 methods)

Global Parameters

The period of the Seasonal HMM is chosen as the number of days in a year. Our choice is 366 however we carefully skip February 29 when needed.

T = 366

Define the French area for map (Longitude and latitude) plot and the precision of the map precision_scale

precision_scale = 50 # meter

LON_min = -5 # West

LON_max = 10 # East

LAT_min = 41 # South

LAT_max = 52 # North

conversion_factor for rainfall amounts RR in 0.1 mm to mm

conversion_factor = 0.1 # 0.1 mm -> mm

0.1

HMM Hyperparameters

Number of hidden states

K = 4

my_pal = my_palette(K); # just colors I like for plotting weather regime!

Degree 𝐃𝐞𝐠 of the trigonometric expansion It could be an array different for each station and variable. Not implemented yet though.

𝐃𝐞𝐠 = 1

Local memory order i.e. at station $j$, $\mathbb{P}(Y_n^{(j)} = y_n^{(j)} \mid Z = k, Y_{n-1:n-\texttt{local memory}}^{(j)} = y_{n-1:n-\texttt{local memory}}^{(j)})$

local_order = 1

Note

The local_order and/or 𝐃𝐞𝐠 could be a vector/matrix of size D and different for each station, and also different depending on wet or dry past. Not yet implemented.

size_order = 2^local_order

println("K = $K, ", "degree = $𝐃𝐞𝐠, ", "local_order = $local_order")

K = 4, degree = 1, local_order = 1

Data

Select relevant stations from the `station.txt` file

Here we

Remove white space at the right of the CN, STANAME which is caused by imperfect CVS importation
Select only the stations with 100% valid data for the period Date(1955,12,31) .≤ :DATE .≤ Date(2019,12,31)
Shorten station names

begin
    station_file = Base.download("https://raw.githubusercontent.com/dmetivie/StochasticWeatherGenerators.jl/master/weather_files/stations.txt")
    station_all = CSV.read(station_file, DataFrame, header=18, normalizenames=true, ignoreemptyrows=true)
    station_all = @chain station_all begin
        @transform(:CN = rstrip.(:CN), :STANAME = rstrip.(:STANAME))
        # @subset(:CN .∈ tuple(["FR", "BE", "LU", "CH"])) # Choose that if you want to look at all stations in France, Belgium, Luxembourg and Switzerland.
        @subset(:STAID .∈ tuple([32, 33, 34, 36, 39, 203, 322, 323, 434, 736, 737, 738, 740, 742, 745, 749, 750, 755, 756, 757, 758, 786, 793, 2192, 2203, 2205, 2207, 2209, 11244, 11245, 11247, 11249]))
        @transform(:STANAME = shortname.(:STANAME))
    end
end

selected_station_name = ["BOURGES", "TOULOUSE", "MARIGNANE", "LUXEMBOURG", "LILLE", "EMBRUN", "BASTIA", "LA HAGUE", "CHASSIRON", "ORLY"]

10-element Vector{String}:
 "BOURGES"
 "TOULOUSE"
 "MARIGNANE"
 "LUXEMBOURG"
 "LILLE"
 "EMBRUN"
 "BASTIA"
 "LA HAGUE"
 "CHASSIRON"
 "ORLY"

Hypothesis: Conditional Independence of Rain Occurrences

You can change the selected stations. However, keep in mind that for the model to work, the conditional independence hypothesis must hold between stations i.e. $\mathbb{P}(Y_1 = y_1, \cdots, Y_S = y_s\mid Z = k) = \prod_{s=1}^S \mathbb{P}(Y_s = y_s)$. Hence stations must be sufficiently far apart. Check out this MNIST example to see Bernoulli mixtures in action!

station = @subset(station_all, :STANAME .∈ tuple(selected_station_name))

STAID = station.STAID #[32, 33, 39, 203, 737, 755, 758, 793, 11244, 11249];

station_name = station.STANAME

10-element Vector{SubString{String}}:
 "BOURGES"
 "TOULOUSE"
 "MARIGNANE"
 "LUXEMBOURG"
 "LILLE"
 "EMBRUN"
 "BASTIA"
 "LA HAGUE"
 "CHASSIRON"
 "ORLY"

Sort stations (index) by latitude. It is useful for plotting from North to South.

staid_lat = sortperm(station.LAT, rev=true);

Station number

D = length(STAID)

Date range

date_start = Date(1956)

1956-01-01

Date including the previous days used in the initial condition (in case local_memory > 0)

date_start_w_memory = date_start - Day(local_order)

date_end = Date(2020) - Day(1)

every_year = date_start:Day(1):date_end

every_year_w_memory = date_start_w_memory:Day(1):date_end

n2t = dayofyear_Leap.(every_year)

N = length(n2t)

Treat data

Load into a DataFrame the (ECA) RR files (rain). It filters by date and valid data. It also adds a column :RO for rain occurrences (0: dry, 1: wet).

begin
    data_stations = collect_data_ECA.(STAID, date_start_w_memory, date_end, "https://raw.githubusercontent.com/dmetivie/StochasticWeatherGenerators.jl/master/weather_files/ECA_blend_rr/RR_", portion_valid_data=1, skipto=22, header=21, url=true)
    for i = eachindex(data_stations)
        @transform!(data_stations[i], :RO = onefy.(:RR))
    end
end

Binary matrix version of the rain event at the D stations.

RR = reduce(hcat, [data_stations[j].RR[1+local_order:end] for j = 1:D]) * 0.1

Yall = BitMatrix(reduce(hcat, [data_stations[j].RO for j = 1:D]))

Y_past = BitMatrix(Yall[1:local_order, :]) # rand(Bool, local_order, D)

ξ = [1; zeros(K - 1)];  # 1 jan 1956 was most likely a type Z = 1 wet day all over France

Y = Yall[1+local_order:end, :]

23376×10 BitMatrix:
 1  1  1  1  1  1  1  1  1  1
 1  1  0  1  1  0  1  0  1  0
 0  0  0  0  0  0  1  0  0  0
 0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0
 0  1  1  0  0  0  0  0  0  1
 1  1  1  0  1  0  1  1  1  1
 1  1  1  1  1  1  0  0  1  1
 0  1  1  1  1  1  1  1  1  0
 1  0  0  1  1  1  0  1  1  1
 ⋮              ⋮           
 1  1  0  1  1  1  0  1  1  1
 1  0  0  1  1  0  0  0  0  0
 0  0  0  1  0  0  0  1  1  0
 1  0  0  1  1  0  0  1  0  1
 1  0  0  1  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  1  0
 0  0  0  0  1  0  0  0  1  0
 0  0  0  0  0  0  0  0  1  0
 0  0  0  0  0  0  0  0  1  0

Map of stations

Convert LAT/LON coordinates from DMS to DD which seems most widely accepted format.

LAT_idx = dms_to_dd.(station.LAT)

LON_idx = dms_to_dd.(station.LON)

long_spell = [longuest_spell(y) for y in eachcol(Y)]

FR_map_spell = map_with_stations(LON_idx, LAT_idx, long_spell; station_name=station_name, show_value=true, colorbar_show=true, precision_scale = precision_scale, colorbar_label = "Days")

Fit the seasonal HMM

Hypothesis: Smooth parameter evolution

We assume all models e.g. HMM, rain mixture to have parameters evolving smoothly with periodicity $T$ for $t \in [1, T]$. For example a Bernoulli parameter will write

\[p(t) = \dfrac{1}{1 + e^{P(t)}} \in [0, 1],\]

with

\[ P_c(t) = c_0 + \sum_{j=1}^{\texttt{Deg}} \left(c_{2j-1}\cos\left(\dfrac{2\pi}{T}j t\right) + c_{2j}\sin\left(\dfrac{2\pi}{T}j t\right)\right).\]

Fit slice: naive estimation

Note

Before inferring the HMM parameters with the EM (Baum-Welch) algorithm, we do a first naive inference that will be used as initial condition for the EM.

The reference station ref_station is used to sort the hidden states obtained via the slide initialization Here we choose j=1 $\to$ STAID=32 $\to$ BOURGES because it is a central station for France

ref_station = 1

This generates a random Periodic HMM that we then fit slice by slice (day by day). See paper.

hmm_random = randARPeriodicHMM(K, T, D, local_order; ξ=ξ, ref_station=ref_station);

@time "FitMLE SHMM (Slice)" hmm_slice = fit_mle_all_slices(hmm_random, Y, Y_past; n2t=n2t, robust=true, rand_ini=true, Dirichlet_α=0.8, history=false, n_random_ini=1, Yₜ_extanted=[-12, -7, 0, 6, 13]);

θᴬ_slice, θᴮ_slice = fit_θ!(hmm_slice, 𝐃𝐞𝐠);

FitMLE SHMM (Slice): 26.948115 seconds (344.14 M allocations: 29.835 GiB, 14.23% gc time, 15.66% compilation time)

Fit with Baum Welch using the slice estimate as a starting point

With the Slice estimate as a good starting point for the full (seasonal) Baum Welch EM algorithm we fit the model!

Tip

To accelerate the fitting procedure (especially for larger models or when testing various model hyperparameters), one can do

using Distributed
addprocs(10) # number of worker to add
@everywhere using SmoothPeriodicStatsModels # load the pkg on each worker

Then the fitting loop inside fit_mle will be distributed. See the official Julia doc for more info. On smaller models it does not worth it since adding workers add some compilation and communication time.

@time "FitMLE SHMM (Baum Welch)" hmm_fit, θq_fit, θy_fit, hist, histo_A, histo_B = fit_mle(hmm_slice, θᴬ_slice, θᴮ_slice, Y, Y_past, maxiter=10000, robust=true; display=:iter, silence=true, tol=1e-3, θ_iters=true, n2t=n2t);

Iteration 0: logtot = -122032.50888854622
Iteration 1: logtot = -119978.989709, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.5859 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.75979
Iteration 2: logtot = -118851.217736, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.32943 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.65147
Iteration 3: logtot = -117984.404189, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.27123 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.76473
Iteration 4: logtot = -117530.910729, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.23744 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.68865
Iteration 5: logtot = -117334.429045, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.1803 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.49175
Iteration 6: logtot = -117243.473606, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.12904 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.31446
Iteration 7: logtot = -117196.530696, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.11048 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.19502
Iteration 8: logtot = -117170.63798, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.09375 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.11894
Iteration 9: logtot = -117155.713125, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.07686 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.07913
Iteration 10: logtot = -117146.773266, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.06192 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.05726
Iteration 11: logtot = -117141.209453, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.04946 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.04106
Iteration 12: logtot = -117137.610304, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.03938 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.02995
Iteration 13: logtot = -117135.19217, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.03137 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.02801
Iteration 14: logtot = -117133.508369, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.02507 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.02579
Iteration 15: logtot = -117132.297044, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.02012 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.02346
Iteration 16: logtot = -117131.400073, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.01624 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.02115
Iteration 17: logtot = -117130.719034, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.0132 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.01895
Iteration 18: logtot = -117130.190784, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.01147 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.01691
Iteration 19: logtot = -117129.773603, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.01014 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.01505
Iteration 20: logtot = -117129.439134, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00893 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.01338
Iteration 21: logtot = -117129.167591, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00784 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.01191
Iteration 22: logtot = -117128.944818, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00687 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.01061
Iteration 23: logtot = -117128.760462, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00602 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00948
Iteration 24: logtot = -117128.606792, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00527 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00849
Iteration 25: logtot = -117128.477925, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00461 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00763
Iteration 26: logtot = -117128.369315, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00404 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00688
Iteration 27: logtot = -117128.277391, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00355 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00623
Iteration 28: logtot = -117128.199315, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00312 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00565
Iteration 29: logtot = -117128.132804, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00275 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00515
Iteration 30: logtot = -117128.076002, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00243 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.0047
Iteration 31: logtot = -117128.027389, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00216 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00431
Iteration 32: logtot = -117127.985708, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.002 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00396
Iteration 33: logtot = -117127.949915, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00185 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00364
Iteration 34: logtot = -117127.919136, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00172 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00336
Iteration 35: logtot = -117127.89264, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.0016 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.0031
Iteration 36: logtot = -117127.869805, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00148 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00287
Iteration 37: logtot = -117127.85011, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00138 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00265
Iteration 38: logtot = -117127.833108, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00128 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00246
Iteration 39: logtot = -117127.81842, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00119 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00228
Iteration 40: logtot = -117127.805725, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00111 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00212
Iteration 41: logtot = -117127.794745, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00103 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00196
Iteration 42: logtot = -117127.785244, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00096 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00183
Iteration 43: logtot = -117127.777018, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.0009 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.0017
Iteration 44: logtot = -117127.769893, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00084 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00158
Iteration 45: logtot = -117127.76372, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00078 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00147
Iteration 46: logtot = -117127.758369, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00073 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00136
Iteration 47: logtot = -117127.753729, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00068 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00127
Iteration 48: logtot = -117127.749705, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00063 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00118
Iteration 49: logtot = -117127.746213, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00059 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.0011
Iteration 50: logtot = -117127.743182, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00055 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00102
Iteration 51: logtot = -117127.740551, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00051 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00095
Iteration 52: logtot = -117127.738266, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00048 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00088
Iteration 53: logtot = -117127.736282, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00044 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00082
Iteration 54: logtot = -117127.734558, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00041 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00076
Iteration 55: logtot = -117127.733059, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00039 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00071
Iteration 56: logtot = -117127.731757, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00036 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00066
Iteration 57: logtot = -117127.730625, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00034 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00061
Iteration 58: logtot = -117127.72964, max(|θᴬᵢ-θᴬᵢ₋₁|) = 0.00031 & max(|θᴮᵢ-θᴮᵢ₋₁|) = 0.00057
EM converged in 58 iterations, logtot = -117127.72964034643
FitMLE SHMM (Baum Welch): 32.665494 seconds (162.53 M allocations: 16.584 GiB, 4.97% gc time, 12.63% compilation time)

Run the following code to load a saved hmm

hmm_infos = load(joinpath(save_tuto_path,"hmm_fit_K_$(K)_d_$(𝐃𝐞𝐠)_m_$(local_order).jld"))
hmm_fit = hmm_infos["hmm"]
hist = hmm_infos["hist"]
θq_fit = hmm_infos["Q_param"]
θy_fit = hmm_infos["Y_param"]

Visualization of the HMM parameters

Transition matrix

begin
    pA = [plot(legendfont=14, foreground_color_legend=nothing, background_color_legend=nothing, legend_columns=4, tickfont=12, legendfontsize=16) for k in 1:K]
    for k in 1:K
        [plot!(pA[k], hmm_fit.A[k, l, :], c=my_color(l, K), label=L"Q_{%$(k)\to %$(l)}", legend=:top, lw=1.75) for l in 1:K]
        hline!(pA[k], [0.5], c=:black, label=:none, s=:dot)
        xticks!(pA[k], vcat(dayofyear_Leap.(Date.(2000, 1:12)), 366), vcat(string.(monthabbr.(1:12)), ""), xlims=(0, 367), ylims=(0, 1))
    end
    pallA = plot(pA..., size=(1000, 500))
end

Rain probabilities

begin
    mm = 1
    jt = D
    pB = [plot(legendfont=14, title="$(station_name[j])", titlefontsize=17, tickfont=14, legendfontsize = 16) for j in 1:jt]
    for j in 1:jt
        [plot!(pB[j], succprob.(hmm_fit.B[k, :, j, mm]), c=my_color(k, K), label=islabel(j, 3, L"\mathbb{P}(Y = \textrm{wet}\mid Z = %$k, H = \textrm{dry})"), lw=2) for k in 1:K]
        hline!(pB[j], [0.5], c=:black, label=:none, s=:dot)
        xticks!(
            pB[j],
            vcat(dayofyear_Leap.(Date.(2000, 1:12)), 366),
            vcat(string.(first.(monthabbr.(1:12))))
        )
        xlims!(pB[j], (0, 367))
        ylims!(pB[j], (0, 1))
    end
    pallB = plot(pB[staid_lat]..., size=(3000 / 2, 1000 / 1), layout=(2, 5))
end

Spatial Rain probability

memory_past_cat = 1

h = 1 (day before dry) or 2 (day before wet) $\mathbb{P}(Y = \text{Rain}\mid Z = k, H = h)$ with h = memory_past_cat

For now there are some scale rendering issues due to an GeoMakie.jl issue so it might be tiny.

p_FR_map_mean_prob = map_with_stations(LON_idx, LAT_idx, [[mean(succprob.(hmm_fit.B[k, :, j, memory_past_cat])) for j in 1:length(STAID)] for k in 1:K], colorbar_show=true, colorbar_label = L"\mathbb{P}(Y = \text{Rain}\mid Z = k, H = 1)", precision_scale = precision_scale)

Inference of the historical hidden states

Viterbi algorithm

ẑ = viterbi(hmm_fit, Y, Y_past; n2t=n2t)

data_stations_z = map(data_stations) do df
    @transform(df, :z = [fill(missing, local_order); ẑ])
end

ẑ_per_cat = [findall(ẑ .== k) for k in 1:K]

4-element Vector{Vector{Int64}}:
 [1, 7, 8, 12, 17, 23, 29, 44, 80, 81  …  23356, 23357, 23358, 23359, 23362, 23364, 23365, 23366, 23367, 23368]
 [2, 10, 11, 14, 18, 20, 21, 22, 25, 26  …  23328, 23333, 23351, 23352, 23355, 23360, 23361, 23369, 23370, 23371]
 [6, 9, 13, 15, 16, 24, 27, 30, 31, 40  …  23325, 23330, 23331, 23334, 23335, 23339, 23345, 23346, 23350, 23363]
 [3, 4, 5, 19, 32, 33, 34, 38, 46, 48  …  23295, 23332, 23347, 23348, 23349, 23372, 23373, 23374, 23375, 23376]

Visualization of the historical sequences of hidden states

year_range = unique(year.(data_stations[1][1+local_order:end, :DATE]));

idx_year = [findall(x -> year.(x) == m, data_stations[1][1+local_order:end, :DATE]) for m in year_range];

select_year = unique(sort([4:10:length(year_range); 21; 48; 64]))

begin
    year_nb = length(select_year)
    z_hat_mat = zeros(year_nb, 366)

    for (i, y) in enumerate(select_year)
        if isleapyear(year_range[y])
            z_hat_mat[i, :] = ẑ[idx_year[y]]
        else
            z_hat_mat[i, :] = [ẑ[idx_year[y]]; 0]
        end
    end
    thick = 1
    heatmap(z_hat_mat, colorbar=:none, c=my_palette(K), minorticks=:false, framestyle=:xbox, grid=:none, thickness_scaling=thick)
    xticks!(vcat(dayofyear_Leap.(Date.(2000, 1:12)), 366), vcat(string.(monthabbr.(1:12)), ""), xlims=(0, 367), xtickfontsize=14 / thick, ytickfontsize=14 / thick)
    hline!((1:year_nb) .+ 0.5, c=:black, legend=:none, lw=4)
    ylims!(0.5, year_nb + 0.5)
    pviterbi = yticks!(1:year_nb, string.(year_range[select_year]), size=(1000, 600))
end

Adding Rainfall amounts to the model

Marginal distribution

We fit the marginals of the rainfall amount $R>0$ at each station $s$ and for each hidden state $Z$ independently. We use a mixture of exponential functions

\[g(r) = w \dfrac{e^{-{\frac {r}{\vartheta_1}}}}{\vartheta_1} + (1-w) \dfrac{e^{-{\frac {r}{\vartheta_2}}}}{\vartheta_2}.\]

whose parameters $w(t)$, $\vartheta_1(t)$ and $\vartheta_2(t)$ are smooth periodic functions of the day of the year.

@time "FitMLE RR" mix_allE = fit_mle_RR.(data_stations_z, local_order, mix₀=mix_ini(T));

FitMLE RR: 69.864497 seconds (330.73 M allocations: 30.791 GiB, 3.92% gc time, 1.57% compilation time)

Thanks to Distributions.jl PR #1389 (September 2nd, 2021) and Julia multiple dispatch, the quantile function of Mixtures can be very efficiently computed.

Rain correlations

We fit a Gaussian copula to each pair of stations for joint rainy days only.

Warning

For some hidden states corresponding to dry weather, it might happen that for some pair of stations, there are not enough simultaneous rain occurrences in a rain category $Z = k$ to estimate a correlation coefficient. In that case a missing coefficient is returned by cov_RR or it returns the value impute_missing if specified (to avoid missing). The option force_PosDef (enabled by default) ensures having positive definite matrix. This is necessary to use gaussian copula.

Σ²RR = cov_RR(data_stations_z, K)

4-element Vector{Matrix{Float64}}:
 [3879.3410237455955 643.1969973888454 … 1044.722558290334 1422.1412369837526; 643.1969973888454 3906.8065809319683 … 421.21726685970293 261.80966034462176; … ; 1044.722558290334 421.21726685970293 … 3782.5212727685685 797.3469589896836; 1422.1412369837526 261.80966034462176 … 797.3469589896836 3100.798265766617]
 [2301.9313610952054 485.18478831568194 … 770.578089634074 863.0300234394578; 485.18478831568194 2006.8630718718769 … 428.13538892508905 271.67257203655475; … ; 770.578089634074 428.13538892508905 … 2440.5330122895875 592.8827865026185; 863.0300234394578 271.67257203655475 … 592.8827865026185 1893.078927735636]
 [2948.5928743209097 750.2824568975376 … 621.8223025654515 827.3791552074431; 750.2824568975376 5838.066822339216 … 369.18199793741627 352.11239948672414; … ; 621.8223025654515 369.18199793741627 … 2871.8548379173185 652.5033493795553; 827.3791552074431 352.11239948672414 … 652.5033493795553 2070.968133316776]
 [2578.6677735877342 939.3858870148969 … 533.7956218371148 -108.18202437342076; 939.3858870148969 3052.544041679537 … 895.8296379061048 212.7174426012282; … ; 533.7956218371148 895.8296379061048 … 1706.7038943029056 -415.34586196345674; -108.18202437342076 212.7174426012282 … -415.34586196345674 803.7842493374762]

Simulation

Now we are ready to generate samples from the SWG model.

Nb is the number of realization. In (Gobet et al., Sep 2024) Nb = 5_000 was used.

Nb = 1_000

Sample the (seasonal) HMM model and output the sequence of hidden states and multi-site dry/wet.

begin
    zs = zeros(Int, N, Nb)
    ys = zeros(Bool, N, D, Nb)
    @time "Simulations Z, Y" for i in 1:Nb
        zs[:, i], ys[:, :, i] = rand(hmm_fit, n2t; y_ini=Yall[1:local_order, :], z_ini=1, seq=true)
    end
end

Simulations Z, Y: 18.593620 seconds (374.84 M allocations: 23.093 GiB, 17.63% gc time, 6.87% compilation time)

Given the hidden states and dry/wet, it generates the rainfall amounts at each station (correlated with a Gaussian Copula).

begin
    rs = zeros(D, N, Nb)
    @time "Simulations RR>0" for i in 1:Nb
        rs[:, :, i] = rand_RR(mix_allE, n2t, zs[:, i], ys[:, :, i]', Σ²RR)
    end
end

Simulations RR>0: 230.267012 seconds (412.87 M allocations: 33.651 GiB, 1.92% gc time, 0.23% compilation time)

WGEN model

We will compare to the WGEN model that propose Markov chain of order 4 for rain occurrences (fitted monthly) and laten gaussian model for multisite occurrences (fitted monthly).

Wilks, D. S. "Multisite generalization of a daily stochastic precipitation generation model". Journal of Hydrology, (1998). https://doi.org/10.1016/S0022-1694(98)00186-3.
Srikanthan, Ratnasingham, et Geoffrey G. S. Pegram. "A nested multisite daily rainfall stochastic generation model". Journal of Hydrology 2009. https://doi.org/10.1016/j.jhydrol.2009.03.025.

wgen_order = 4
idx_months = [findall(x -> month.(x) == m, data_stations[1][1+local_order:end, :DATE]) for m in 1:12]
wgen4_model = fit_wgen(Y, idx_months, wgen_order)

ys_wgen = similar(ys)
@time "Simulation Y wgen 4" for i in 1:Nb
    ys_wgen[:, :, i] = rand(wgen4_model, 1956:2019; Y_ini=vcat(rand(Bool, wgen_order - 1, D), Y_past))
end

Simulation Y wgen 4: 75.133796 seconds (1.90 G allocations: 66.297 GiB, 11.54% gc time, 1.37% compilation time)

Results

Spell distribution

select_month to choose the month where to compute the spell distributions (summer month, winter, etc.) select_month = 1:12 corresponds to all months.

select_month = 1:12

idx_months = [findall(x -> month.(x) == m, data_stations[1][1+local_order:end, :DATE]) for m in 1:12]

idx_month_vcat = vcat(idx_months[select_month]...)

idx_all = [intersect(yea, mon) for yea in idx_year, mon in idx_months];

Historic spells

len_spell_hist = [pmf_spell(Y[idx_month_vcat, j], dw) for j in 1:D, dw in 0:1];

Simulation spells

len_spell_simu = [pmf_spell(ys[idx_month_vcat, j, i], dw) for i in 1:Nb, j in 1:D, dw in 0:1];

Dry spell

make_range(y, step=1) = range(extrema(y)..., step=step)

begin
    dry_or_wet = 1 # dry
    p_spell_dry = [plot(ylims=(1e-4, 1e-0), tickfont=11, legendfontsize=13) for j = 1:D]
    for j = 1:D
        all_spells = len_spell_simu[:, j, dry_or_wet]
        spell_range = 1:1:(1+maximum(vcat(reduce(vcat, all_spells), len_spell_hist[j, dry_or_wet])))

        errorlinehist!(p_spell_dry[j], all_spells, groupcolor=:grey, legend=:topright, label=islabel(j, staid_lat[[1]], L"Simu $q_{0,100}$"), norm=:probability, bins=spell_range, errortype=:percentile, percentiles=[0, 100], fillalpha=0.4, centertype=:median)

        errorlinehist!(p_spell_dry[j], all_spells, groupcolor=:red, label=islabel(j, staid_lat[[1]], L"Simu $q_{25,75}$"), norm=:probability, bins=make_range(reduce(vcat, all_spells)), errortype=:percentile, percentiles=[25, 75], fillalpha=0.5, centertype=:median)

        histo_spell = len_spell_hist[j, dry_or_wet]
        errorlinehist!(p_spell_dry[j], [histo_spell], label=islabel(j, staid_lat[[1]], "Obs"), groupcolor=:blue, lw=1.5, norm=:probability, bins=spell_range, errortype=:percentile, alpha = 0.8)
        xlims!(p_spell_dry[j], 0, 2 + maximum(1.5maximum.(histo_spell)))
        yaxis!(:log10)
    end

    [xlabel!(p_spell_dry[j], "Nb of days", xlabelfontsize=12) for j in staid_lat[6:10]]
    [ylabel!(p_spell_dry[j], "PMF", ylabelfontsize=12) for j in staid_lat[[1, 6]]]
    [title!(p_spell_dry[j], station_name[j], titlefontsize=13) for j = 1:D]
    pall_spell_dry = plot(p_spell_dry[staid_lat]..., size=(3000 / 2.5, 1000 / 1.5), layout=(2, 5), left_margin=0.5cm, bottom_margin=0.275cm)
end

Wet spell

begin
    dry_or_wet = 2 # wet
    p_spell_wet = [plot(ylims=(1e-4, 1e-0), tickfont=11, legendfontsize=13) for j = 1:D]
    for j = 1:D
        all_spells = len_spell_simu[:, j, dry_or_wet]
        spell_range = 1:1:(1+maximum(vcat(reduce(vcat, all_spells), len_spell_hist[j, dry_or_wet])))

        errorlinehist!(p_spell_wet[j], all_spells, groupcolor=:grey, legend=:topright, label=islabel(j, staid_lat[[1]], L"Simu $q_{0,100}$"), norm=:probability, bins=spell_range, errortype=:percentile, percentiles=[0, 100], fillalpha=0.4, centertype=:median)

        errorlinehist!(p_spell_wet[j], all_spells, groupcolor=:red, label=islabel(j, staid_lat[[1]], L"Simu $q_{25,75}$"), norm=:probability, bins=spell_range, errortype=:percentile, percentiles=[25, 75], fillalpha=0.5, centertype=:median)

        histo_spell = len_spell_hist[j, dry_or_wet]
        errorlinehist!(p_spell_wet[j], [histo_spell], label=islabel(j, staid_lat[[1]], "Obs"), groupcolor=:blue, lw=1.5, norm=:probability, bins=spell_range, errortype=:percentile, alpha = 0.8)
        xlims!(p_spell_wet[j], 0, 2 + maximum(1.5maximum.(histo_spell)))
        yaxis!(:log10)
    end

    [xlabel!(p_spell_wet[j], "Nb of days", xlabelfontsize=12) for j in staid_lat[6:10]]
    [ylabel!(p_spell_wet[j], "PMF", ylabelfontsize=12) for j in staid_lat[[1, 6]]]
    [title!(p_spell_wet[j], station_name[j], titlefontsize=13) for j = 1:D]
    pall_spell_wet = plot(p_spell_wet[staid_lat]..., size=(3000 / 2.5, 1000 / 1.5), layout=(2, 5), left_margin=0.5cm, bottom_margin=0.275cm)
end

Seasonal areal dry spells

RRmax = 0
ROR = [mean(r .> RRmax) for r in eachrow(RR)]
RORs = [[mean(r .> RRmax) for r in eachrow(rr)] for rr in eachslice(ys, dims=3)]
RORswgen = [[mean(r .> RRmax) for r in eachrow(rr)] for rr in eachslice(ys_wgen, dims=3)]

JJA = [6, 7, 8]
MAM = [3, 4, 5]
SON = [9, 10, 11]
DJF = [12, 1, 2]
SEASONS = [DJF, MAM, JJA, SON]
seasonname = ["DJF", "MAM", "JJA", "SON"]

idx_seasons = [findall(month.(data_stations[1][1+local_order:end, :DATE]) .∈ tuple(season)) for season in SEASONS]

let
    perc = 0.2
    QQ = [5, 95]

    p_spell_rors = [plot(ylims=(5e-4, 1e-0), xlims=(-0.01,25), tickfont=11, legendfontsize=13, legend=:left) for i in eachindex(idx_seasons)]
    xlabel!.(p_spell_rors[3:end], "Nb of days", xlabelfontsize=12)
    ylabel!.(p_spell_rors[[1, 3]], "PMF", ylabelfontsize=12)
    for m in eachindex(idx_seasons)
        len_ror_hist = pmf_spell(ROR[idx_seasons[m]] .≤ perc, 1)
        len_ror_simu = [pmf_spell(RORs[i][idx_seasons[m]] .≤ perc, 1) for i in 1:Nb]
        len_ror_simuwgen = [pmf_spell(RORswgen[i][idx_seasons[m]] .≤ perc, 1) for i in 1:Nb]

        errorlinehist!(p_spell_rors[m], [len_ror_hist], groupcolor=:blue, lw=2, norm=:probability, bins=make_range(len_ror_hist), errortype=:percentile,
        label=label = islabel(m, 1, "Obs"),
        legend=:bottom)
        yaxis!(:log10)

        sim_range = make_range(reduce(vcat, len_ror_simuwgen))
        errorlinehist!(p_spell_rors[m], len_ror_simuwgen, groupcolor=:green, legend=:topright,
            label=islabel(m, 1, "WGEN 4"),
            norm=:probability, bins=sim_range, errortype=:percentile, percentiles=QQ, fillalpha=0.25, centertype=:median, linewidth=2)

        sim_range = make_range(reduce(vcat, len_ror_simu))
        errorlinehist!(p_spell_rors[m], len_ror_simu, groupcolor=:grey, legend=:topright,
            label=islabel(m, 1, "SHHMM"),
            norm=:probability, bins=sim_range, errortype=:percentile, percentiles=QQ, fillalpha=0.3, centertype=:median, alpha=1, linewidth=2)
        annotate!(p_spell_rors[m], median(sim_range), 1.5, seasonname[m])
        yticks!(10.0 .^ (-4:-0))
    end

    pall = plot(p_spell_rors..., layout=(2, 2), size=(1000, 600), top_margin=0.34cm, left_margin=0.3cm, bottom_margin=0.22cm)
    file_name = "ROR_spell_season_perc_$(perc)_Q_$(QQ[1])_$(QQ[2])_no_inset"
    file_name = replace(file_name, "." => "p")
    pall
end

Rain

Interpretation: Mean Rain per weather regime $R > 0 \mid Z = k$.

We plot the empirical (strictly) positive mean rainfall amounts per weather regime. The results are smoothed using a cyclic_moving_average with a time window of $\pm 15$ days and the Epanechnikov kernel.

begin
    p_rainpercat = [plot(tickfont=12, ylabelfontsize=14, titlefontsize=14, legendfontsize=13) for j = 1:D]
    for j = 1:D
        df_j = @chain data_stations_z[j] begin
            dropmissing
            @transform(:day = dayofyear_Leap.(:DATE))
            @subset(:RR .> 0)
            @by([:day, :z], :MEAN_RR = mean(:RR))
            groupby(:z)
        end
        # Uncomment to see how the double exponential mixtures compare to the empirical data.
        # [plot!(p_rainpercat[j], 1:T, t -> conversion_factor * mean(mix_allE[j][k, t]), label=:none, c=my_color(k, K), lw=1.5, legend = :topleft) for k in 1:K]
        for k in 1:K
            cycle_avg = replace(cyclic_moving_average(df_j[k].MEAN_RR, df_j[k].day, T, 15), 0 => missing)
            @df df_j[k] plot!(p_rainpercat[j], 1:T, conversion_factor * cycle_avg, c=my_color(k, K), alpha=1, label=islabel(j, staid_lat[[4]], L"Z = %$k"), lw=1.5)
        end
        ylims!(p_rainpercat[j], 0, Inf)
    end
    [ylabel!(p_rainpercat[j], "Rain (mm)") for j in staid_lat[[1, 6]]]
    [xticks!(
        p_rainpercat[j],
        vcat(dayofyear_Leap.(Date.(2000, 1:12)), 366),
        vcat(string.(first.(string.(monthabbr.(1:12)))))
    ) for j in 1:D]
    [title!(p_rainpercat[j], station_name[j]) for j = 1:D]
    plt_rain_cat_mix = plot(p_rainpercat[staid_lat]..., size=(3000 / 2.2, 1000 / 1.5), layout=(2, 5), left_margin=25px)
end

Univariate Rain distributions

Historical vs Nb simulations distribution

begin
    p_uniR = [plot(yaxis=:log10, ylims=(0.2e-4, 1e-0), tickfont=11, legendfontsize=13, titlefontsize=13) for j = 1:D]
    for j = 1:D
        dists_RR_positive_j = conversion_factor * [filter(!iszero, rs[j, :, i]) for i in 1:Nb]
        Rmax = ceil(max(dists_RR_positive_j .|> maximum |> maximum, conversion_factor * filter(!iszero, data_stations[j].RR) |> maximum))
        BINS = 0:2:Rmax # fixing the bins is very important to ensure fair comparison. Note that changing the bin step changes the aspect of the distributions.
        errorlinehist!(p_uniR[j], dists_RR_positive_j, groupcolor=:grey, legend=:topright, label=islabel(j, staid_lat[[1]], L"Simu $q_{0,100}$"), norm=:pdf, errortype=:percentile, percentiles=[0, 100], fillalpha=0.4, centertype=:median, bins = BINS)

        errorlinehist!(p_uniR[j], dists_RR_positive_j, groupcolor=:red, label=islabel(j, staid_lat[[1]], L"Simu $q_{25,75}$"), norm=:pdf, errortype=:percentile, percentiles=[25, 75], fillalpha=0.5, centertype=:median, bins = BINS)

        errorlinehist!(p_uniR[j], [conversion_factor * filter(!iszero, data_stations[j].RR)], label=islabel(j, staid_lat[[1]], "Obs"), groupcolor=:blue, lw=1.5, norm=:pdf, errortype=:percentile, bins = BINS, alpha = 0.7)

        xlims!(p_uniR[j], 0.0, Inf)
    end
    [plot!(p_uniR[j], xlabel="Rain (mm)") for j in staid_lat[6:10]]
    [plot!(p_uniR[j], ylabel="PDF") for j in staid_lat[[1, 6]]]

    [title!(p_uniR[j], station_name[j]) for j = 1:D]

    pall_R = plot(p_uniR[staid_lat]..., size=(3000 / 2.5, 1000 / 1.5), layout=(2, 5), bottom_margin=11px, left_margin=15px)
end

Aggregated 5 days RR distribution

agg_window = 5
df_res = [
    @chain df[1+local_order:end, :] begin
    @transform(:agg = first.(divrem.(1:N, agg_window)))
    @by(:agg, :AGG = sum(:RR) * 0.1)
    end
        for df in data_stations_z
            ]


agg_i_full = first.(divrem.(1:N, agg_window))
idx_agg = [findall(agg_i_full .== val) for val in unique((agg_i_full))]
agg_rs = [[sum(rs[j, ii, i]) for ii in idx_agg] for j in 1:D, i in 1:Nb] * conversion_factor
agg_RR = [[sum(RR[ii, j]) for ii in idx_agg] for j in 1:D]
begin
    p_uniR = [plot(yaxis=:log10, ylims=(3e-5, 2e-1), tickfont=11, legendfontsize=13, titlefontsize=13, legend=:bottom) for j = 1:D]
    [plot!(p_uniR[j], xlabel="$(agg_window) days rain (mm)") for j in staid_lat[6:10]]
    [plot!(p_uniR[j], ylabel="PDF") for j in staid_lat[[1, 6]]]
    [title!(p_uniR[j], station_name[j]) for j = 1:D]
    for j = 1:D
        dists_RR_positive_j = agg_rs[j, :]
        Rmax = ceil(max(agg_rs[j, :] .|> maximum |> maximum, agg_RR[j] |> maximum))
        errorlinehist!(p_uniR[j], dists_RR_positive_j, groupcolor=:grey, legend=:topright, label=islabel(j, 20, L"Simu $q_{0,100}$"), norm=:pdf, errortype=:percentile, percentiles=[0, 100], fillalpha=0.4, centertype=:median, bins=0:3:Rmax)

        errorlinehist!(p_uniR[j], dists_RR_positive_j, groupcolor=:red, label=islabel(j, 20, L"Simu $q_{25,75}$"), norm=:pdf, errortype=:percentile, percentiles=[25, 75], fillalpha=0.5, centertype=:median, bins=0:3:Rmax)

        errorlinehist!(p_uniR[j], [agg_RR[j]], label=islabel(j, 20, "Obs"), groupcolor=:blue, lw=1.5, norm=:pdf, errortype=:percentile, bins=0:3:Rmax, fillalpha=0.8)

        xlims!(p_uniR[j], 0.0, Inf)
        yticks!(10.0 .^ (-5:-1))
        lens!(p_uniR[j], [0, 25], [0.0, 0.2], inset=(1, bbox(0.48, -0, 0.475, 0.25)), linewidth=0)
    end

    pall_aggR = plot(p_uniR[staid_lat]..., size=(3000 / 2.5, 1000 / 1.5), layout=(2, 5), bottom_margin=14px, left_margin=15px)
    plot!(pall_aggR, (1:3)', inset=(1, bbox(2.4, 0.42, 0.6, 0.3)), subplot=2D + 1, legendfontsize=14, framestyle=:none, label=[L"Simu $q_{0,100}$" L"Simu $q_{25,75}$" "Obs"], c=[:gray :red :blue], foreground_color_legend=nothing, lw=2)
end

Autocorrelation

acfrange = 0:15
@views aa = [autocor(rs[j, :, i], acfrange) for j in 1:D, i in 1:Nb]

begin
    p_spell_wet = [plot(xlabelfontsize=16, ylabelfontsize=16, tickfont=11, legendfontsize=16) for j = 1:D]
    for j = 1:D
        errorline!(p_spell_wet[j], acfrange, stack(aa[:, j], dims=1)', groupcolor=:gray, label=islabel(j, staid_lat[[1]], L"Simu $q_{0,100}$"), errortype=:percentile, percentiles=[0, 100], fillalpha=0.8, lw=2, centertype=:median)
        plot!(p_spell_wet[j], acfrange, autocor(RR[:, j], acfrange), label=islabel(j, staid_lat[[1]], "Obs"), lw=2.0, c=1, markers=:circle, alpha=0.8)
    end

    [xlabel!(p_spell_wet[j], "Lag", xlabelfontsize=12) for j in staid_lat[6:10]]
    [ylabel!(p_spell_wet[j], "ACF", ylabelfontsize=12) for j in staid_lat[[1, 6]]]
    [title!(p_spell_wet[j], station_name[j], titlefontsize=13) for j = 1:D]
    pall_ACF = plot(p_spell_wet[staid_lat]..., size=(3000 / 2.5, 1000 / 1.5), layout=(2, 5), left_margin=0.42cm, bottom_margin=0.32cm)
end

Monthly quantile amount

rh = reduce(hcat, [df[1+local_order:end, :RR] for df in data_stations])

month_rain_simu = [monthly_agg(rs[j, :, i], idx_all) for j in 1:D, i in 1:Nb];

month_rain_histo = [monthly_agg(rh[:, j], idx_all) for j in 1:D]

qs = [0.9, 0.5, 0.1]

@time "Plot monthly quantile" begin
    p_month_RR = [scatter(xtickfontsize=10, ytickfontsize=11, ylabelfontsize=12, legendfontsize = 12, foreground_color_legend=nothing) for j = 1:D]
    for j = 1:D
        for (α, per) in enumerate([[0, 100], [25, 75]])
            for (cc, q) in enumerate(qs)
                errorline!(p_month_RR[j], [quantile(month_rain_simu[j, i][:, m], q) * conversion_factor for m in 1:12, i in 1:Nb], label=(α == 1 ? islabel(j, 9,L"Simu  $q_{%$(Int(q*100))}$") : :none), fillalpha=0.18 * α^2, centertype=:median, errortype=:percentile, percentiles=per, groupcolor=my_palette(length(qs))[cc])
            end
        end
        for q in qs
            scatter!(p_month_RR[j], m -> quantile(month_rain_histo[j][:, m], q) * conversion_factor, 1:12, label=(q == qs[1] ? islabel(j, 3,"Obs") : :none), legend = :topleft, ms=2.5, c=:blue)
            plot!(p_month_RR[j], m -> quantile(month_rain_histo[j][:, m], q) * conversion_factor, 1:12, label=:none, c=:blue, lw=1.75)
        end
        xticks!(p_month_RR[j], 1:12, string.(first.(monthabbr.(1:12))))
        ylims!(p_month_RR[j], 0, Inf)
    end
    [ylabel!(p_month_RR[j], "Rain (mm)") for j in staid_lat[[1, 6]]]

    [title!(p_month_RR[j], station_name[j], titlefontsize=12) for j = 1:D]
    pall_month_RR = plot(p_month_RR[staid_lat]..., size=(1190, 500), layout=(2, 5), left_margin=19px)
end

Correlations

Rain event dry/wet

cor_bin_hist = cor(reduce(hcat, [df.RO for df in data_stations]));

cor_bin_mean_simu = mean(cor(ys[:, :, i]) for i in 1:Nb);


begin
    plots_cor_bin = [plot(-0.1:0.1:0.8, -0.1:0.1:0.8, aspect_ratio=true, label=:none, xlabelfontsize=16, ylabelfontsize=16, tickfont=11, legendfontsize=13) for _ in 1:1]
    scatter!(plots_cor_bin[1], vec_triu(cor_bin_hist), vec_triu(cor_bin_mean_simu), label="Correlations", xlabel="Observations", ylabel="Simulations", c=2)
    [xlims!(plots_cor_bin[i], -0.1, 1) for i in 1:1]
    [ylims!(plots_cor_bin[i], -0.1, 1) for i in 1:1]
    annotate!(0.2, 0.7, "MSE ≃ $(round(mean(abs2, vec_triu(cor_bin_hist) - vec_triu(cor_bin_mean_simu)), digits = 4))")
    plot_cor_bin = plot(plots_cor_bin...)
end

The largest pair correlation error for rain occurrence comes from the pair

println("$(station_name[findmax(cor_bin_hist - cor_bin_mean_simu)[2][1]]) and $(station_name[findmax(cor_bin_hist - cor_bin_mean_simu)[2][2]])")

CHASSIRON and LA HAGUE

Rainfall amount

cor_hist = cor(reduce(hcat, [df.RR for df in data_stations]));

corT_hist = corTail(reduce(hcat, [df.RR for df in data_stations]));

cor_mean_simu = mean(cor(rs[:, :, i]') for i in 1:Nb);

corT_mean_simu = mean(corTail(rs[:, :, i]') for i in 1:Nb);

begin
    plots_cor = [plot(-0.1:0.1:0.8, -0.1:0.1:0.8, aspect_ratio=true, label=:none, xlabelfontsize=16, ylabelfontsize=16, tickfont=11, legendfontsize=13) for _ in 1:2]
    scatter!(plots_cor[1], vec_triu(cor_hist), vec_triu(cor_mean_simu), label="Correlations", xlabel="Observations", ylabel="Simulations", c=2)
    annotate!(plots_cor[1], 0.3, 0.7, "MSE ≃ $(round(mean(abs2, vec_triu(cor_hist) - vec_triu(cor_mean_simu)), digits = 4))")

    scatter!(plots_cor[2], vec_triu(corT_hist), vec_triu(corT_mean_simu), label="Tail index", xlabel="Observations", ylabel="Simulations", c=3)
    annotate!(plots_cor[2], 0.3, 0.7, "MSE ≃ $(round(mean(abs2, vec_triu(corT_hist) - vec_triu(corT_mean_simu)), digits = 4))")

    [xlims!(plots_cor[i], -0.1, 1) for i in 1:2]
    [ylims!(plots_cor[i], -0.1, 1) for i in 1:2]
    plot_cor_all = plot(plots_cor..., size=(800, 400), left_margin=15px, right_margin = 8px)
end

The largest pair correlation error for rain (zero and non zero amounts) comes from the pair

println("$(station_name[findmax(cor_hist - cor_mean_simu)[2][1]]) and $(station_name[findmax(cor_hist - cor_mean_simu)[2][2]])")

EMBRUN and MARIGNANE

Gaussian copula hypothesis

For a pair of stations, we transform the marginal $R_s>0$ to $\mathcal{N}(0,1)$. We compare the obtained bi-variate Normal distribution with the Mahalanobis distance to the theoretical $\chi^2(2)$-distriubtion.

corΣ = cov2cor.(Σ²RR)
begin
    j1 = 10
    j2 = 8
    plt_qqp_copula = plot(0:25, 0:25, aspect_ratio=:equal, legendfontsize=14, c=:black, label=:none, tickfont=12, ylabelfontsize=13, xlabelfontsize=13)
    df_12 = leftjoin(data_stations_z[j1], data_stations_z[j2], on=:DATE, makeunique=true)
    @subset!(df_12, :RR .> 0, :RR_1 .> 0)
    for k in 1:K
        df_X = @chain df_12 begin
            @subset(:z .== k)
            dropmissing
            @aside u = StochasticWeatherGenerators.Copulas.pseudos(permutedims(hcat(_.RR, _.RR_1)))
            @transform(:X = quantile(Normal(), u[1,:]), :X_1 = quantile(Normal(), u[2,:]))
        end
        X = hcat(df_X.X, df_X.X_1)
        cor_sigma = [1 corΣ[k][j1,j2]; corΣ[k][j1,j2] 1]
        Σ⁻¹ = inv(cor_sigma)

        X2 = [(x' * Σ⁻¹ * x) for x in eachrow(X)] |> sort
        ecdfX2 = ecdf(X2)(X2) * length(X2) / (length(X2) + 1)

        plot!(quantile(Chisq(2), ecdfX2), X2, xlabel=L"$\chi^2(2)$-quantile", c=my_color(k, K), ylabel="Observed squared Mahalanobis distance", label=L"Z = %$k ", legend=:topleft, lw=2)
    end
    title!("$(station_name[j1])-$(station_name[j2]) $(ifelse(j1 == 10 && j2 == 8, "(334 km)", ""))")
    xlims!(0, 22)
    ylims!(0, 22)
end

Reproducibility

using InteractiveUtils
InteractiveUtils.versioninfo()

Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_DEBUG = Documenter

Package list and version

import Pkg; Pkg.status()

Status `~/work/StochasticWeatherGenerators.jl/StochasticWeatherGenerators.jl/docs/Project.toml`
  [336ed68f] CSV v0.10.15
⌃ [13f3f980] CairoMakie v0.13.10
  [a93c6f00] DataFrames v1.7.0
  [1313f7d8] DataFramesMeta v0.15.4
  [8bb1440f] DelimitedFiles v1.9.1
  [31c24e10] Distributions v0.25.120
  [e30172f5] Documenter v1.14.1
  [daee34ce] DocumenterCitations v1.4.0
  [db073c08] GeoMakie v0.7.13
  [4138dd39] JLD v0.13.5
  [b964fa9f] LaTeXStrings v1.4.0
  [98b081ad] Literate v2.20.1
  [436b0209] NaturalEarth v0.1.0
  [bac558e1] OrderedCollections v1.8.1
  [3d8e1505] SmoothPeriodicStatsModels v2.0.3
  [2913bbd2] StatsBase v0.34.5
  [f3b207a7] StatsPlots v0.15.7
  [3eadb0b9] StochasticWeatherGenerators v1.3.1 `~/work/StochasticWeatherGenerators.jl/StochasticWeatherGenerators.jl`
  [fd094767] Suppressor v0.2.8
  [ade2ca70] Dates v1.11.0
  [b77e0a4c] InteractiveUtils v1.11.0
  [9a3f8284] Random v1.11.0
Info Packages marked with ⌃ have new versions available and may be upgradable.

This page was generated using Literate.jl.