#LearningStatistics

2025-08-17

So, to recap:

- sample means and standard deviations just happen to be optimal estimators of the parameters of a Gaussian distribution
- Gaussian distributions happen naturally (Central Limit Theorem), especially when mixing several causes to an effect so we can often fall back to them
- to construct a CI one has to build a probability around something independent of the very thing we're trying to estimate (otherwise circular dep!)
- it's easy when sigma is known (literally the CLT), but to extract something without both sigma and mu we need a bit more elbow grease (Student t)
- when not Gaussian we need moar math

#statistics #LearningStatistics

2025-08-17

stats.libretexts.org/Bookshelv

"Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n∈(0,∞) degrees of freedom, and that Z and V are independent. Random variable
T=Z/√(V/N) has the student t distribution with n degrees of freedom."

This formula is very reminiscent from the one used to construct CIs of Gaussian samples with known std. dev., just with the sample estimate of sigma instead of an a priori fixed sigma.

#statistics #LearningStatistics

2025-08-17

Buried under the en.m.wikipedia.org/wiki/Studen is this quote which explains a lot:

"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."

#statisticd #LearningStatistics

2025-08-17

A second more involved realization: I wish people writing pages/articles/courses told upfront why statistics textbooks are so full of some more complex distributions like Student t, chi-squared instead of harping for 20 pages about their properties.

I now understand that:
- the mean often follows a Gaussian distribution
- the variance often follows a chi-squared distribution (I think this really needs a good visualization)
- when sigma is known a priori Gaussian CIs of samples from a Gaussian variable are estimated from a Gaussian distribution ; when not it is a Student t distribution (it cancels both mean and std. dev)

#statistics #LearningStatistics

2025-08-10

The last page is also part of a bunch of wiki pages that are... surely technically correct but difficult to grasp intuitively.

Note the difference between population (1/N) and sample (1/(N-1)) stats. The first has better mean squared error but biased with respect to the population, and the second has worse MSE but is unbiased with respect to the population.

I spent some time trying to grasp that, and came to the conclusion that in practical terms it's not actionable for me yet: I either have large N, or my problem is small but more complex than a mean/var/std and I have no clue how to get an unbiased estimator for that. 🧵

#statistics #LearningStatistics

2025-08-10

Some of the answers in the last link do point out interesting results : sample mean and variance are optimal for a Gaussian distribution.

en.wikipedia.org/wiki/Unbiased adds on that the midrange ((min+max)/2) would be optimal for unknown bounded distributions ? 🧵

#statistics #LearningStatistics

2025-08-10

An intuition I haven't yet verified: when we qualify samples using means and standard deviations, a hidden assumption is often made of a normal (Gaussian) distribution.

This might be what we want (the central limit theorem applies in a lot of cases, and is essentially "throw enough distributions together in a big bowl, mix them up and you end up with a normally distributed smoothie") but this is not always the case.

stats.stackexchange.com/questi has more to say on this, but I'm not fully satisfied because it focuses on the pure theoretical math side of things, not on "what people actually interpret it is". 🧵

#statistics #LearningStatistics

2025-08-10

What made my gears turn a little was : if instead of adding more data you can only take subsets of your samples? For example you're trying to write a color picker tool on a photo? Different subsets of equal pixel count (the size of the picker tool) come out of a Poisson distribution (plus extra after processing)... and their mean will change even if the color is supposed to be uniform.

The mean is itself a random variable, this means we can do statistics on it. For example, compute its mean (whose difference with the actual mean is the bias) standard deviation (which is called standard error of the estimator of the mean). 🧵

#statistics #LearningStatistics

2025-08-10

So, first, maybe means? I know, the things below might have been evident, I've been starting from a very low bar okay?

I hear about means and averages all the time.

One thing that surprised me a few years ago was that the mean of a random variable is itself a random variable.

This was not very obvious to me ; my naive viewpoint was, I think, colored by the fact that if you're just looking at tests were you decide the sampling (e.g. do a poll on 100 people), well, you can just add 100 more and get better results and the law of large numbers says you should get better as you add more, right? 🧵

#statistics #LearningStatistics

2025-08-10

One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.

A secondary motivation is that many fundamental results, papers, standards and recommendations on #ColourScience are based on statistics around psych and physical tests, whether for good or bad. But these results still elude me for the most part. 🧵

#statistics #LearningStatistics

2025-08-10

So, I'm going to start a thread on trying to better understanding #statistics, in case anybody is interested. Boosts and/or clarifications welcome and appreciated!

It's been bugging me for a while that I don't seem to have a good intuitive grasp of statistics, and this is despite graduating from an engineering school — while I did get courses on probability theory and stuff like Markov chains or EM algorithms and whatnot these were engineering-focused. Case in point, I can't say I "get" confidence intervals. Neither do I understand statistical tests and the p-value outputs that are often presented as "obvious" in other fields. 🧵

#LearningStatistics

2025-03-26

The Graphs and Statistics are fascinating in this UN Report. It might provide good exercises for a data or statistcs class: redesign the pie charts, find the original figures and design tables... Just learn from it.. It's a shame the text was garbled when I tried to copy direct from the .pdf..
unodc.org/documents/data-and-a
#HomicideRates #MurderRate #StatisticsClass #LearningStatistics #UncertaintyGraphs #UncertaintyVisualization

A screenshot of FIG. 1 on page 150 of the UN .pdf linked in the post. The figure shows the "Rates of homicide, suspects brought into contact with the police and people convicted of homicide per 100,000 population..>" for the "selected regions": Americas, Asia, Europe, Global. The data is from "2021 or latest year available".
The bar chart (nice thin bars, a decent ink-to-info ratio I guess) allows easy comparison of the "Victims of Homicide", "Suspects brought into formal contact with police", and "People Convicted" for each region. The figures are
Americas(25 countries), 17.9, 7.7, 3.4
Asia(17 countries), 2.6, 4.9, 1.5
Europe(35), 2.5, 2.8, 2.0
Global(82), 4.4, 4.7, 1.8 
The Americas seem like an outlier region. Eduard Galeano probably points us towards an convincing explanation...Another screenshot of a FIG. from the UN .pdf linked in the post. This graph on page 132 shows "Regional shares of homicides by type of known mechanism, 2021" for the regions, Europe, Asia, Americas, and World. My attention was draw to visualization of an "uncertainty range" with a  blue line spanning a certain distance around (well up to in the case of 75% firearm murders in the Americas) the pink dot with a number for each observation. This graph could motivate learners of statistics working with confidence intervals and inferring uncertainty. Other parts of the document mention OLS regression ("Pooled cross-sectional OLS regression estimates predicting the (ln) homicide rate..") that could help motivate people to keep learning statistics in order to better understand documents like this, and to think of policy with greater confidence...
2022-11-19

An #introduction post: I love #Statistics and #Learning, which means that I love both #StatisticalLearning and #LearningStatistics. Currently, most of my professional focus is on #teaching #ResearchMethods and Statistics to #undergraduate #psychmajors. I also love #photography, #travel, and #food (who doesn’t), and have recently figured out a way to combine all of these loves into a three-week #StudyAbroad trip to #Japan where I get to teach a class on #PsychologyOfLanguage.

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst