Lmst

#quarto #rstats friends who use github action to publish articles:

it's currently taking github actions ~30 mins to publish my little #mgcv help site (https://calgary.converged.yt/). This seems to be because it's installing a lot of R packages from source.

What's the current state-of-the-art to get these things to render quickly? (And using minimal power.)

(I'd like to not use github but I would also like to encourage PRs etc from folks without a huge overhead from them, so let's stick to github-based solutions for now.)

new (out for a while but sitting in my browser from before Christmas) paper in Biometrika from Benjamin Säfken, Thomas Kneib and Simon Wood on smoothing parameter degrees of freedom

Green OA @ Edinburgh https://www.pure.ed.ac.uk/ws/portalfiles/portal/475921820/asae052.pdf

#mgcvchat #mgcv

#mgcv mini-lifehack:

(assuming you have multithreading enabled) you can get a rough idea of what's happening when fitting a big model by looking at your CPU usage. If only 1 core is being used, the model is still "building" (assembling of design/penalty matrices), once you switch to all cores, then you're actually fitting the model. Sometimes that first model construction phase can take a long time (with a very big model), so it'll probably take a very very long time to fit. So buckle-up.

#mgcvchat

#Poisson regression with #mgcv and #glmmTMB in #rstats just rocks

spending some more time thinking about neighbourhood cross-validation in #mgcv (see original post here: https://calgary.converged.yt/articles/ncv.html), but for time series.

Pretty nice to be able to get back to a yearly trend here without needing to specify an autoregressive structure. We just need to specify a cross-validation scheme and the autocorrelation is "dealt with" during fitting.

Full post on this soon. #mgcvchat #rstats

plot showing volumetric water content over time. the raw data is very very noisy. A simple GAM smooth overfits to the data whereas using a GAMM with an AR structure gives a more seasonal pattern, ignoring the noise. the GAM using NCV gives a very similar fit.

Ok, a more *specific* #mgcv #GAM question: When using tensor product interaction terms with `ti()`, do the knots have to match? E.g. do I have to do ti(x, k = 10) + ti(y, k = 20) + ti(x,y k = c(10, 20))? Or can the knots in the interaction term be whatever? Would I want them to be different for some reason?

#rstats

A unifying modelling approach for hierarchical distributed lag models, by Theo Economou et al:

https://doi.org/10.48550/arXiv.2407.13374

code: https://zenodo.org/records/10458640

#rstats #mgcv

Preprint from Simon Wood on the new cross-validation smoothness estimation in #mgcv: https://arxiv.org/abs/2404.16490. It's a neat performant + data-efficient way to estimate GAMs based on complex CV splits (like spatial/temporal/phylo ones).

See ?NCV in latest {mgcv} for examples (https://cran.r-universe.dev/mgcv/doc/manual.html#NCV)

I might write a helper to convert {rsample}/{spatialsample} objects into mgcv's funny CV indexing structure.

#rstats #ml #tidymodels #mgcvchat @MikeMahoney218 @gavinsimpson @ericJpedersen @millerdl

Figure 7 from the linked paper. It shows two grids of predictions from a spatio-temporal model. In the left grid the topographic lines are less wiggly and more widely spaced. The caption reads: "Forest health model space time interaction effect estimates. Left uses NCV accounting for the possibility of spatial and temporal autocorrelation as described in the text, and right uses REML assuming independence. Broadly, the NCV estimate is smoother in space, but less so in time."

#Rstats folks: what’s the best way to parse a formula that has a mix of linear covariates and #mgcv smoothers? I need a design matrix that I can send to Stan. Handling penalty terms separately for now.

@millerdl any tips?

@cameronpat I have wondered about this too! Especially since GAMs seem like a natural progression from "ordinary" linear models. Is it the choosing of bases or interpretation of coefficients that's a turn off? But those aren't decisions specific to #biostatistics. Perhaps it's just a lack of awareness? I've found #mgcv in #rstats super easy to use.

#RStats issues I'm struggling with that seem impossible to Google: Building a {brms} model within the {tidymodels} framework using {bayesian}.

The formula is inherently too complex (including splines and random effects) for the typical tidymodels workflow that involves recipes &c., so it must be added in at a later step. Two things:

1. Complex {brms} multivariate formulas seem to not be possible using {tidymodels}. E.g., literally multivariate or including phi after my formula via brms::bf(). It simply errors. :( This may just need some tweaking of {bayesian}'s scripts or waiting for an update since it's still fairly young.

2. Using {mgcv} random effect syntax like s(cat1, cat2, bs = "re") seems to not pick up as random effects in the model...I think? And I have never figured out if this is creating hierarchical random effects or not -- or if multilevel random effects just aren't possible in this syntax(?).

3. Using {lme4} random effect like (1 | cat1 / cat2) to ensure the hierarchy is preserved *does* retain random effects I can pull out of the model later using `ranef`, but for some absurd reason I cannot run this model through cross-validation or a myriad of other steps later because it seems to force-create a complex web of interacting factor levels that don't exist. E.g., if my random effects are '(1 | realm / biome)', this eventually fails because it'll look for tundra biome types in Africa for some absurd reason.*

Noticed this while trying to solve *separate* issues within broom.mixed:::tidy.brmsfit() -- that it seems to delete the names of all the fixed effects and return them as 'NULL' character strings (???), and its reliance on 'ranef' means it doesn't find the random effects using {mgcv} syntax.

That's my rambling mess of an essay for the day. Not sure how many of these are real issues or me simply not understanding how these packages differ or wot.

#brms #mgcv #tidymodels

* Almost wondering if this might even be a separate {tidymodels} issue right now. Every recipe no matter what seems to factor every single character column regardless of how the recipe is built. Hmmmm.

Struggling to identify random effects using {lme4} syntax and {mgcv} syntax models.

Absolutely gaga over this new preprint by Nick Clark and the @weecology group. So many methodological threads - long-term ecological monitoring, an open data system, careful semi-parametric models, simulation-based inference and forecasting rigor - combine into predicting complex multispecies dynamics while learning about their relationships + drivers

https://ecoevorxiv.org/repository/view/5143/, code at https://github.com/nicholasjclark/portal_VAR

Thread from Nick at: https://twitter.com/nj_clark/status/1635417591157260288

#ecology #forecasting #EFI #mgcv #rstats

Figure from paper described in post showing three time series of multiple rodent species monitored populations, model fits to those populations, and forward forecasts.

#Rstats #mgcv question:

I'm working with very large data, and am fitting smoothing splines with the `bam()` function, and `discrete = TRUE` (which is an amazing speed boost!)

When I want to predict new data from the fitted model, is there any reason why I can't set `discrete = FALSE` in `predict.bam()`? That is, the fitted bam model is still just a gam fitted model, right?

(I have *many* levels for a random effect, and the discretized predictions are erroring out with "data is too long")

#rstats friends 🎉💻🐈

two bots of potential interest

@mgcv_updates tells you about what's new in #mgcv

and

@rverbsr is a silly bot that toots "verb that noun" phrases where the verbs are functions in R base and the nouns are R types

enjoy!

OK, a first convening of team #gams #mgcv here: @ericJpedersen @gavinsimpson @millerdl .

If I want to fit a spline but constrain it to going through certain points (e.g., the start and end of an epicurve should be zero), what's the best way? I'm thinking of adding points to the data at the ends of the range with very high weights. Not sure what the consequences of that would be. #rstats

#mgcv

Client Info