#crossValidation

2025-07-03

Кросс-валидация на временных рядах: как не перемешать время

Привет, Хабр! Сегодня рассмотрим то, что чаще всего ломает даже круто выглядящие модели при работе с временными рядами — неправильная кросс‑валидация . Разберем, почему KFold тут не работает, как легко словить утечку будущего, какие сплиттеры реально честны по отношению ко времени, как валидировать фичи с лагами и агрегатами.

habr.com/ru/companies/otus/art

#временные_ряды #time_series #машинное_обучение #прогнозирование #кроссвалидация #crossvalidation

InterData VNinterdatavn
2025-05-03

Cross-Validation là gì trong Machine Learning? A-Z

Cross-Validation là một kỹ thuật then chốt trong Machine Learning, giúp kiểm tra hiệu suất và khả năng tổng quát của mô hình trên dữ liệu mới. Nhờ đó, mô hình tránh được tình trạng học lệch và hoạt động ổn định hơn. Bài viết này sẽ cùng bạn khám phá chi tiết về Cross-Validation, lý do nó quan trọng và các phương pháp xác thực phổ biến hiện nay.

Xem chi tiết bài viết tại đây: interdata.vn/blog/cross-valida

Dr Mircea Zloteanu ☀️ 🌊🌴mzloteanu
2024-05-29

#103 On the marginal likelihood and cross-validation

Thoughts: Can't say I can follow much of this, so I'll open it up to the community for input. Seems important though.

doi.org/10.1093/biomet/asz077

Chloé Azencottcazencott@lipn.info
2024-05-29

⬆️

6) thankfully, Wager (2020) doi.org/10.1080/01621459.2020. shows that cross-validation is asymptotically consistant for model selection, so while what we're doing gives us poor estimates of generalization error and bad error bars, at least it's valid for model selection.

#machineLearning #statistics #crossValidation

Chloé Azencottcazencott@lipn.info
2024-05-29

⬆️

5) Bates et al. (2023) doi.org/10.1080/01621459.2023. propose a nested cross-validation estimator of generalization error that's unbiased and has an unbiased mean squared error estimator. It's computationally quite intensive. I played a bit with it, and my in high-dimensional set ups (large p small n) I got error bars that had indeed good coverage of the generalization error, but were also covering most of the [0, 1] interval, which is less helpful.

⬇️

#machineLearning #statistics #crossValidation

Chloé Azencottcazencott@lipn.info
2024-05-29

⬆️

4) in any case, error bars are wrong, because it's impossible to get an unbiased estimator of the mean squared error of an estimator that's based on a single fold of cross-validation, as shown by Bengio & Grandvalet (2004) dl.acm.org/doi/10.5555/1005332

⬇️

#machineLearning #statistics #crossValidation

Chloé Azencottcazencott@lipn.info
2024-05-29

⬆️

3) cross-validation estimators are better estimators of *expected test error* (across all possible training sets) than of *generalization error* of a model.

This has been known for a while and even appears in The Elements of Statistical Learning, so I should have known about this much earlier. Bates et al. (2023) doi.org/10.1080/01621459.2023. show why this is for linear models.

⬇️

#machineLearning #statistics #crossValidation

Chloé Azencottcazencott@lipn.info
2024-05-29

⬆️

2) (not a surprise, but worth remembering): cross-validation error bars can be very large when sample sizes are small (unsurprisingly, due to the \( \frac{1}{\sqrt{n}} \) factor).

This is discussed for example regarding microarray studies in Braga-Neto & Dougherty (2004) doi.org/10.1093/bioinformatics and @GaelVaroquaux (2018) regarding brain image analysis doi.org/10.1016/j.neuroimage.2

⬇️

#machineLearning #statistics #crossValidation

Chloé Azencottcazencott@lipn.info
2024-05-29

⬆️

Reading the discussion of the paper by other statisticians is enlightening as to how the tone of scientific discourse has mercifully changed in 50 years.

Also, "The term 'assessment' is preferred to 'validation' which has a ring of excessive confidence about it."

⬇️

#machineLearning #statistics #crossValidation

Chloé Azencottcazencott@lipn.info
2024-05-29

We were discussing cross-validation estimates of model performance recently with colleagues, and I dug a bit in the literature to better understand where we're at.

This is not my topic of expertise, but here are a few tidbits I'd like to share.

1) cross-validation has been the topic of much discussion for many decades. Stone (1974) jstor.org/stable/2984809 gives a good overview of what precedes. ­

⬇️

#machineLearning #statistics #crossValidation

Christophe BousquetKrisAnathema@fediscience.org
2024-05-21

#Interpersonal #HeartRate #synchrony predicts effective #information #processing in a #naturalistic #group #DecisionMaking task

@PNASNews

"Heart rate synchrony predicted the probability that groups would reach the correct #consensus with >70% #CrossValidation accuracy, thus, providing a #biomarker of interpersonal engagement that facilitates adaptive #learning and effective information #sharing during #collective decision-making"

pnas.org/doi/10.1073/pnas.2313

Martin Modrákmodrak_m@bayes.club
2024-04-04

New on the blog: I explore the connection between Bayes factors and cross-validation and explain why I think it does not justify the use Bayes factors in most cases. martinmodrak.cz/2024/03/23/cro

#bayesian #BayesFactors #stats #CrossValidation

2023-07-25

Enjoying the discussion of cross-validation methods for use of sensor data for air quality applications at the EPA air sensor QA workshop. It’s easy to overestimate how well you are doing with sensor data corrections or fusion applications unless a rigorous independent test approach is used #airquality #airpollution #crossvalidation #lowcostsensors @dwestervelt epa.gov/amtic/2023-air-sensors

ipofanesipofanes
2023-07-11

Da sehe ich Saisonvorbereitung und weiß nicht, ob es ein Trainings- oder Testspiel war. Sollte ich als Statistiker aber streng auseinander halten können.

Empor gehaltene Eintrittskarte mir den Vereinsemblemen von VfL Bochum und Kickers Emden, im Hintergrund Sportanlagen und Waldrand.
Daniele de Rigodderigo@hostux.social
2023-06-10

3/
#Feynman: "it doesn’t make any sense to calculate after the event. You see, you found the peculiarity, and so you selected the peculiar case"
archive.org/details/meaningofi

Special trending case: #CrossValidation (where data for selecting/tuning a model are also used to test it, with allegedly "clever" methods to avoid fooling oneself) and other #MachineLearning math. tricks where many dimensions/parameters are tuned by using much less data

Without a deep understanding, black-box tools lead astray

Tiago F. R. Ribeirotiago_ribeiro
2023-02-01

Model Evaluation, Model Selection, and Algorithm
Selection in Machine Learning


arxiv.org/pdf/1811.12808.pdf

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst