Lmst

📢📢📢 The wait is almost over! Vaulted Backup for Azure Data Lake Storage is now Public Preview. #backup #datalake #Azure https://azure.microsoft.com/en-us/updates

Spark Connect. А нужны ли перемены?

Привет, Хабр! Я Станислав Габдулгазиев, архитектор департамента поддержки продаж Arenadata. Apache Spark давно и прочно занял место одного из ключевых инструментов в арсенале инженеров и дата-сайентистов, работающих с большими данными. Его способность быстро обрабатывать огромные объёмы информации, гибкость за счёт поддержки множества языков (Python, Scala, Java, SQL) и возможность решать самые разнообразные задачи — от сложных ETL до машинного обучения и стриминга — делают его незаменимым инструментом в мире анализа данных.

https://habr.com/ru/companies/arenadata/articles/921246/

#spark_connect #apache #datalake #lakehouse #платформа_данных #bigdata #dataframe #интеграция_сервисов #apache_arrow #spark

Spark 4.0 на горизонте: Готовимся к апгрейду или остаёмся на проверенном 3.0?

Привет, Хабр! Я Станислав Габдулгазиев, архитектор департамента поддержки продаж Arenadata. Кажется, ещё вчера мы радовались возможностям Apache Spark 3.0 , разбирались с Adaptive Query Execution и наслаждались улучшениями Pandas API . Но мир больших данных не стоит на месте, и вот уже на подходе Apache Spark 4.0 . Новый мажорный релиз — это всегда событие: он обещает новые фичи, прирост производительности и, конечно же, новые вызовы при миграции. Apache Spark де-факто стал стандартом для распределённой обработки данных. От классических ETL-пайплайнов и SQL-аналитики до сложного машинного обучения и стриминга — Spark так или иначе задействован во многих современных data-платформах. Поэтому каждый новый релиз вызывает живой интерес у комьюнити: что там под капотом? Какие проблемы решены? Не сломается ли то, что работало годами?

https://habr.com/ru/companies/arenadata/articles/921252/

#spark #data_science #data_engineering #bigdata #sql #lakehouse #datalake #хранение_данных #hadoop #производительность

📺 #Netflix has introduced a new engineering specialization: Media ML Data Engineering - powered by a Media Data Lake designed to handle video, audio, text, and image assets at scale.

The impact so far:
✅ Richer ML models trained on standardized media
✅ Faster evaluation cycles
✅ Deeper insights into creative workflows

🔗 Learn more: https://bit.ly/4oWM3T3

#InfoQ #DataLake #AI

Проблема маленьких файлов. Оценка замедления S3 и проблем HDFS и Greenplum при работе ними

Не так давно в блоге компании Arenadata был опубликован материал тестирования поведения различных распределенных файловых систем при работе с маленькими файлами (~2 Мб). Краткий вывод: по результатам проверки оказалось, что лучше всего с задачей маленьких файлов справляется старый-добрый HDFS, деградируя в 1.5 раза, S3 на базе minIO не тянет, замедляясь в 8 раз, S3 API над Ozone деградирует в 4 раза, а наиболее предпочтительной системой в при работе с мелкими файлами, по утверждению коллег, является Greenplum, в том числе для компаний «экзабайтного клуба». Коллеги также выполнили огромную работу по поиску «Теоретических подтверждений неожиданных показателей». Результаты тестирования в части S3 minIO показались нашей команде неубедительными, и мы предположили, что они могут быть связаны с: -недостаточным практическим опытом эксплуатации SQL compute over S3 и S3 в целом; -отсутствием опыта работы с кластерами minIO. В частности в высоконагруженном продуктивном окружении на 200+ Тб сжатых колоночных данных Iceberg/parquet, особенно в сценариях, где проблема маленьких файлов быстро становится актуальной. -особенностями сборок дистрибутивов; Мы благодарны коллегам за идею и вдохновение провести аналогичное тестирование. Давайте разбираться.

https://habr.com/ru/companies/datasapience/articles/941046/

#s3 #minio #hdfs #greenplum #bigdata #lakehouse #datalake #dwh

WAP паттерн в data-engineering

Несмотря на бурное развитие дата инжиниринга, WAP паттерн долгое время незаслуженно обходят стороной. Кто-то слышал о нем, но не применяет. Кто-то применяет, но интуитивно. В этой статье хочу на примере детально описать паттерн работы с данными, которому уже почти 8 лет, но за это время ни одна статья не была написана с принципом работы.

https://habr.com/ru/articles/937738/

#data_engineering #bigdata #big_data #data_warehouse #data_quality #warehouse #datalake #etl

We’re excited to partner with Greptime to teach you how to set up a fully #FOSS observability stack — complete with a Prometheus Group compatible #datalake and real-time incident insights! https://t.ly/JNmvQ

#kubernetes #databases #devops #sre #freesoftware #sql #observability #ebpf #sysadmin #linux

📊 Your customer journeys are telling you something.
Are you listening or just watching clicks and opens?

Microsoft Fabric just changed the game for Customer Insights – Journeys users.
Now, every journey interaction, every click, every goal hit lives in OneLake, ready for real-time analysis.

👉 Curious, how are you analyzing journey drop-offs today?

#MicrosoftFabric #CustomerInsights #PowerPlatform #Dynamics365 #MarketingAnalytics #DataLake #PowerBI #MarketingOps

http://mytrial365.com/2025/08/14/customer-insights-fabric-the-marketing-analytics-match-you-didnt-know-you-needed/

There's a lot talk about "ZeroDisk" infrastructure backed by S3. The pitch is "move your data from locally attached NVMe storage to S3 and your applications will scale easier and be more performant!"

Maybe I'm getting too old for this shit, but I swear to dog this is the 4th such cycle in my career:

1. NFS
2. iSCSI / Fibrechannel
3. Hadoop / HDFS
4. ZeroDisk with S3

Am I the only one that's like: "wait, move TBs of data to S3 from NVMe to increase performance? Are you high?"

It doesn't work, so you scale up. Now you're back to local NVMe "cache disks" running instances as expensive as the locally attached NVMe instances when you add those costs to your S3 bill. The performance is worse because of course it is.

It always comes back to the two hard problems in computer science: naming things, cache invalidation, and off-by-one errors. 😂

#zerodisk #s3 #hadoop #cache #datalake #GetOffMyLawn

It is not possible to eliminate the risk of failures, but it is possible to mitigate them by making failures explainable, detectable, and manageable. https://hackernoon.com/diving-deep-into-data-lake-observability-why-it-matters-more-than-ever #datalake

Microsoft Unveils Sentinel Data Lake to Power AI Defenses and Cut Security Costs

#Cybersecurity #Microsoft #MicrosoftSentinel #AI #CloudSecurity #SIEM #DataLake

https://winbuzzer.com/2025/07/22/microsoft-unveils-sentinel-data-lake-to-power-ai-defenses-and-cut-security-costs-xcxwbn

Simplified #metadata definition with the Data Catalog Schema Wizard

Data Fabric Cheat Sheet: #DataFabric #DataLake #InforOS. source

https://quadexcel.com/wp/simplified-metadata-definition-with-the-data-catalog-schema-wizard/

⬆️ Data volumes continue to rise. In fact, within industries like #engineering and #finance, the volume and volatility of log data have even outpaced the capacity of traditional #SIEM and analytics tools. 😰 What this means is... with orgs facing high costs and fatigue, the ones that thrive will be the ones that treat storage and retrieval as distinct functions. 🤔

This is where selective retrieval comes in—the ability to triage, park, and later selectively ingest high-volume data from a centralized repository for forensic or compliance-driven investigation. 🙌

Read this excellent article by #Graylog's Adam Abernethy in BigDATAwire to learn about:
🌏 Selective retrieval examples in the real world
⚠️ Risk coverage without always-on cost
🔒 Flexibility without architectural lock-in
💻 The technological shifts that are converging to make selective retrieval possible and necessary
↔️ How selective retrieval bridges the gap between data engineering complexity and #security usability
💼 The business case for selective retrieval, especially for mid-size IT teams
🛂 Regaining control over data sprawl
➕ More

https://www.bigdatawire.com/2025/07/14/rethinking-risk-the-role-of-selective-retrieval-in-data-lake-strategies/ #datalake #logdata #datamanagement @bigabe @bigdatawirenews

New project alert! Comparqter, a tool that compacts Parquet files and optimises file sizes.

https://codeberg.org/unticks/comparqter

#rust #parquet #s3 #datalake

🎉 Huge thanks to the LanceDB CEO / cofounder Chang She for delivering an incredible talk on "Search, Retrieval, Training, and Analytics with Modern AI Data Lake" at #DataAndAIEngineering #SanFrancisco #meetup !

📹 Great news - the recording is now available! Check it out if you missed it or want to revisit the key concepts. 👇

https://watch.softinio.com/w/mVkLgtcQw8Qv5vA4v8SDHB

#DataEngineering #AIEngineering #SanFrancisco #LanceDB #DataLake #MachineLearning #VectorDB #Database #AI #ArtificialIntelligence

"Centralize Your Data Lake: Apache Polaris Supports Apache Iceberg and Now Delta Lake"

BTW 'Polaris' used to be the name of the UK nuclear deterrent pre 1996. 😬

https://snowflake.com/en/engineering-blog/apache-polaris-supports-iceberg-delta-lake/

#ApacheIceberg #ApachePolaris #DataLake

First I thought I'd found the Loch Ness Monster...turns out to be Nessie instead. 🦕

Project Nessie: Transactional Catalog for Data Lakes with Git-like semantics
"Nessie is to Data Lakes what Git is to source code repositories..."

https://projectnessie.org/

#ProjectNessie #LochNessMonster #DataEngineering #DataLake

Ah, the $10/month Lakehouses: because who wouldn't want a bargain-basement data lake with all the charm of a timeshare in purgatory? 🤔💸 Just add a sprinkle of buzzwords like "DuckLake" and "time travel" and voilà, you've got a tech article that feels like a 2-hour #infomercial for something you'll never use. 📈🔮
https://tobilg.com/the-age-of-10-dollar-a-month-lakehouses #Lakehouses #DuckLake #DataLake #TechTrends #HackerNews #ngated

Apache Iceberg Deep Dive | Part 1 | Crash Course

Lakehouse #iceberg #Apache_Iceberg #datalake #data ... source

https://quadexcel.com/wp/apache-iceberg-deep-dive-part-1-crash-course/

#TBT... to an entire week ago at #RSAC where Seth Goldhammer had the chance to demo Graylog's data telemetry pipeline management! 🖥️ ⭐

Join Seth as he talks about data lakes, data lake previews, getting your data back when you need it, and more.

Wanna learn more about this topic? Here you go: https://graylog.org/post/security-data-lake-strategy/ #RSA #RSAC2025 #datalake #datamanagement #datapipeline

#datalake

Client Info