Lmst

@data @datadon 🧵

Accuracy! To counter regression dilution, a method is to add a constraint on the statistical modeling.
Regression Redress restrains bias by segregating the residual values.
My article: http://data.yt/kit/regression-redress.html

#bias #modeling #dataDev #AIDev #modelEvaluation #regression #modelling #dataLearning #linearRegression #probability #probabilities #statistics #stats #correctionRatio #ML #distributions #accuracy #RegressionRedress #Python #RStats

@data @datadon 🧵

How to assess a statistical model?
How to choose between variables?

Pearson's #correlation is irrelevant if you suspect that the relationship is not a straight line.

If monotonic relationship:
"#Spearman’s rho is particularly useful for small samples where weak correlations are expected, as it can detect subtle monotonic trends." It is "widespread across disciplines where the measurement precision is not guaranteed".
"#Kendall’s Tau-b is less affected [than Spearman’s rho] by outliers in the data, making it a robust option for datasets with extreme values."
Ref: https://statisticseasily.com/kendall-tau-b-vs-spearman/

#normality #normalDistribution #modeling #dataDev #AIDev #ML #modelEvaluation #regression #modelling #dataLearning #featureEngineering #linearRegression #modeling #probability #probabilities #statistics #stats #correctionRatio #ML #Pearson #bias #regressionRedress #distributions

@data @datadon 🧵

Redressing #Bias: "Correlation Constraints for Regression Models":
Treder et al (2021) https://doi.org/10.3389/fpsyt.2021.615754

#dataDev #linearRegression #modeling #probability #probabilities #statistics #stats #modelling #regression #correctionRatio #skLearn #scikitLearn #python #AIDev

"A generalized linear model or #GLM consists of three components:
1. A random component, specifying the conditional distribution of the response variable, Yi (for the ith of n independently sampled observations). […]
2. A linear predictor—that is a linear function of regressors,
ηi = α + Σj Xij*βj
3. A smooth and invertible link function g(·), which transforms the expectation of the response variable, μi ≡ E(Yi), to the linear predictor:
g(μi) = ηi"

https://www.sagepub.com/sites/default/files/upm-binaries/21121_Chapter_15.pdf

#dataDev #regression

@data @datadon

#DataViz on two requirements:
* zooming, panning and rescaling
* shareable dashboards

"Plotly vs. Bokeh: Interactive Python Visualisation Pros and Cons", by Dr Paul Iacomi: https://pauliacomi.com/2020/06/07/plotly-v-bokeh.html

#dataDev #retrieval #dataMining #plotly #Dash #Bokeh #python #dataInteraction #data #dataDon #widgets #ipython #jupyter #dashboards #businessIntelligence

#DataViz Decision-Making Guide

"How do you decide between #Plotly and #Seaborn?
* If you need interactive and dynamic visualizations, especially for dashboards or 3D data, Plotly is the way to go.
* If you’re focused on statistical analysis, creating publication-ready visuals, or conducting exploratory data analysis, Seaborn is likely your best choice."
by Amit Yadav: https://medium.com/@amit25173/plotly-vs-seaborn-f7207dd3e642

#dataDev #retrieval #dataMining

´Technical people are blind to the fact they automatically solve dozens of problems every day in their regular workflow, any single one big enough to block another user for a few hours. Without even thinking about it.´

´There are usually two kinds of coders giving advises. A fresh one that has no idea how complex things really are, yet. Or an experienced one, that forgot it.´

@bitecode https://www.bitecode.dev/p/why-not-tell-people-to-simply-use 🧵

#dev #dataDev #install #anaconda #packages #Python #tech #packaging #complexity

"The #gamma GLM is a relatively assumption-light means of #modeling non-negative data, given gamma's flexibility.
[…]
"Explaining what is used and what is not used, despite merits and demerits […]: Loosely, the larger the internal literature in any field on modelling techniques, the less inclined people in that field seem to be to try something different."

Nick Cox, 2013: https://stats.stackexchange.com/questions/67547/when-to-use-gamma-glms

#normality #normalDistribution #Γ #modelling #dataDev #AIDev #ML #AIEvaluation #logNormal

@datadon

#Lasso #LinearRegression "is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent"

https://scikit-learn.org/stable/modules/linear_model.html#lasso 🧵

#dataDev #AIDev #ML #sklearn #python #interpretability

@data "practitioners can leverage #LASSO regression to construct more interpretable and predictive models that excel in scenarios involving high-dimensional data and intricate feature relationships."

https://datasciencedecoded.com/posts/12_LASSO_Regression_Feature_Selection_Predictive_Models

#dataDev #interpretability #AIDev

How to identify and handle duplicate values: https://stackabuse.com/handling-duplicate-values-in-a-pandas-dataframe/

#dataDev #Python #Pandas #dataAnalysis #statistics #stats #dataScience

@datadon

Unfilled cells influence models.
"Handling Missing Data in Machine Learning": https://ml-nn.eu/a1/51.html by Calin Sandu @mlnn

#missingData #bias #wealth #dataQuality #complexity #dataDev #machineLearning #dataPrep #EDA #dataWrangling

A categorical variable takes on a limited number of values.
The categorical #dataType is useful in the following cases:
- A string variable consisting of only some values. df[["label"]].astype("category") saves memory.
- The lexical order is not the same as the logical order (“one”, “two”, “three”). Sorting and min/max will use the logical order.
- As a signal to other libraries to treat as a category.

More: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

#dataDev #Python #Pandas #dataAnalysis #statistics #stats

#PythonGotcha 🧵

It is often useful to make a #copy of a given list before performing operations that would mutate the elements.

When you make a shallow copy of an existing list, you create a new pointer to a new list object that points to the same old elements. (It saves memory.)

On the other hand, if you make a deep copy, then you create a completely new copy of the original list.

In other words: https://realpython.com/python-mutable-vs-immutable-types/#making-copies-of-lists

#learning #objects #python #dev #CS #DataScience #memory #dataDev

"Extract Year from a datetime column", by Piyush Raj: https://datascienceparichay.com/article/pandas-extract-year-from-datetime-column/

#dataDev #Python #Pandas #timeSeries #data #inference #dataAnalysis #statistics #stats

#dataDev

Client Info