#dataengineering

pipTrendspiptrends
2025-07-17

The Astral team published a guide on moving from a pip and pip‑tools workflow based on requirements files to uv’s project workflow using pyproject.toml and uv.lock.

docs.astral.sh/uv/guides/migra

2025-07-17

Discover how CocoIndex transforms data orchestration with a pure Data Flow Programming model — ensuring traceable, immutable, and declarative pipelines for know hackernoon.com/redefining-data #dataengineering

pipTrendspiptrends
2025-07-16

@hynek released another great video on uv, where he explained how he uses the just tool to store commands in a cross‑platform, portable way for everyday tasks like installing or refreshing virtual environments, running tests and code checks and even development tasks like sending requests.

youtube.com/watch?v=TiBIjouDGuI

2025-07-15

Ever wonder about the mind behind Pandas & Apache Arrow? 🤔 Ep. 2 of #TheTestSet (Part 1!) unpacks Wes McKinney's journey – including his speedrunning past! What makes good tools good?

🎧 Listen at thetestset.co, on Spotify, or Apple Podcasts

#DataStack #DataEngineering #Pandas #OpenSource #PodcastLaunch #Python

Will Hopkins 🌈📸willhopkins@a2mi.social
2025-07-15

#dataengineering If you needed to use a data lake with Redshift, would you use Iceberg, given some native support, over Delta Lake, which is arguably a better format?

Asking for a friend who is me

blaze.emailblazeemail
2025-07-15

🔍 Excited about AXLearn for modular ML training, Pinterest's Moka for massive data processing, and PromiseTune for causal configuration tuning!

blaze.email/Machine-Learning-E

N-gated Hacker Newsngate
2025-07-14

Ah, yes, the riveting saga of cramming "user-defined indexes" into Apache Parquet files. 😴 Because who doesn’t love a story about exploiting footer metadata to do something nobody asked for? Next time, tell us how to alphabetize your sock drawer using ForestDB. 🧦📚
datafusion.apache.org/blog/202

Recce - Trust, Verify, ShipDataRecce
2025-07-14

Excited to sponsor the inaugural Data Debug SF happy hour! 🍻

Data Debug SF is building a data engineering community for people who get the "why doesn't this data add up" struggle

Tues July 29 • 5:30-7:30pm • details: lu.ma/g92ckftj

The Data ChanneltheDataChannel
2025-07-12

Data dose of the Day: Day 12

🔹 Tip: Use data profiling tools like Deequ, Great Expectations, or Spark describe() before writing any transformation logic.

🔸 Why?: Avoid incorrect assumptions (nulls, cardinality, outliers). A bad assumption = a bad model or broken report.

pipTrendspiptrends
2025-07-12

This week's pip Trends newsletter is out. Interesting stuff by Vivis Dev, Jacob Padilla, Adrien, dash0 team & Simon Willison are covered this week

newsletter.piptrends.com/p/mak

pipTrendspiptrends
2025-07-10

Logs are crucial for understanding what's happening inside your application, especially in production. This article by the Dash0 team is a must-read for mastering logging in Python. It covers structured JSON logs, centralised logging config and enriching logs with contextvars and other modern observability practices.

dash0.com/guides/logging-in-py

pipTrendspiptrends
2025-07-08

In this article, Jacob Padilla walked through building a simple HTTP server from scratch using asyncio Protocols - no external libraries. It's a great way to understand how an HTTP server works under the hood.

jacobpadilla.com/articles/asyn

Recce - Trust, Verify, ShipDataRecce
2025-07-08

How do you review data changes in PRs? 📊

A) Auto-diff everything
B) Explore impact then validate
C) Manual spot checks
D) No review

We're seeing a big shift from A → B, see datarecce.io/blog/recce-vs-dat
But curious what's working or not

2025-07-07

AI-Ready Data: как дообучить LLM без боли и с максимальной отдачей

В последние месяцы я всё чаще сталкиваюсь с одним и тем же выводом: внедрение LLM-систем (особенно с использованием RAG-подхода) тормозится не из-за самой модели, а из-за отсутствия качественных данных. Самое дорогое в процессе — это не запуск пайплайна, не подбор архитектуры, а подготовка структурированных, очищенных и корректных данных, пригодных для обучения или дообучения моделей. Всё чаще этот подход называют AI-Ready Data.

habr.com/ru/companies/naumen/a

##AIReadyData ##LLM ##DataEngineering ##RubyOnRails ##RAG ##LowCode

Rami Krispin :unverified:ramikrispin@mstdn.social
2025-07-05

My weekly newsletter is out! 🚀

This week's agenda:
🔹 Open Source of the Week - The dagster project
🔹 New learning resources - Forecasting with linear regression, multi-model LLM, multiprocessing with Python
🔹 Book of the week - Visualization for Social Data Science by Roger Beecham

📌 Join 29k subscribers and subscribe to get weekly updates 🗞️👇🏼
ramikrispin.substack.com/p/the

#DataScience #DataEngineering #Python #RStats #AI #OpenSource

pipTrendspiptrends
2025-07-04
pipTrendspiptrends
2025-07-04

In this video, @koaning and Johnny demoed some of the powerful features of Marimo Notebook and walked through the amazing things you can build with it. It’s a great showcase of how Marimo stands out from traditional notebooks.

youtube.com/watch?v=V77fXADveo0

Praveen KumarPraveen323
2025-07-04

📢 Join our Azure Data Engineer with Data Factory training!
🗓️ 10th July | ⏰ 6:30 PM IST
👨‍🏫 Led by Mr. Gareth – Get real-world exposure in Azure Pipelines, ETL tools & Synapse
🔗 tr.ee/AZD10JU

🌐 Azure Data Engineer with Data Factory
Sébastien Stormacqsebsto
2025-07-04

🎧 New episode in the AWS Developers podcast

Join Poonam Pratik for a practical discussion on serverless ETL, AWS Glue, and data pipeline monitoring. Real examples, tested solutions.

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst