Lmst

No response yet to my #syslog_ng #HDFS destination question:

https://www.syslog-ng.com/community/b/blog/posts/deprecating-java-based-drivers-from-syslog-ng-is-hdfs-next

Most likely it means that we can drop #Hadoop support from syslog-ng without any complaints. But I rather repeat my question a few more times on my #socialmedia accounts...

Влияние маленьких файлов на Big Data: HDFS vs S3

Привет, Хабр! Я Станислав Габдулгазиев, архитектор департамента поддержки продаж Arenadata. В этой статье рассмотрим, как большое количество мелких файлов влияет на производительность различных систем хранения, таких как HDFS и объектные хранилища с S3 API. Разберём, какие технологии хранения лучше всего подходят для работы с мелкими файлами в архитектурах Data Lake и Lakehouse . Сравним производительность HDFS и объектных хранилищ с S3 API . На конкретных тестах покажем, почему именно HDFS эффективнее справляется с большим количеством небольших файлов. Обсудим также случаи, когда мелкие файлы становятся не просто нежелательной ситуацией, а неизбежной необходимостью, например в подходах типа Change Data Capture (CDC). Тесты, графики, инсайды

https://habr.com/ru/companies/arenadata/articles/915684/

#bigdata #hdfs #s3 #hadoop #data_lake #lakehouse #impala #spark #хранение #minio

# Hadoop Là Gì? Tổng Quan Và Vai Trò #Hadoop #BigData #PhânTíchDữLiệu #HệThốngPhânTán #LưuTrữDữLiệu

Hadoop Là Gì? Tổng Quan Và Vai Trò #Hadoop #BigData #PhânTíchDữLiệu #HệThốngPhânTán #LưuTrữDữLiệu Giới thiệu chung về Hadoop Hadoop là một nền tảng phần mềm mã nguồn mở được thiết kế để lưu trữ và xử lý lượng dữ liệu lớn (Big Data) một cách hiệu quả và đáng tin cậy. Nó không chỉ là một công nghệ đơn lẻ mà là một tập hợp các module hoạt động cùng nhau, cung cấp một kiến…

https://bietduoc.io.vn/2025/06/03/hadoop-la-gi-tong-quan-va-vai-tro-hadoop-bigdata-phantichdulieu-hethongphantan-luutrudulieu/

Соединение SortMergeJoin в Apache Spark

Рассмотрим, как реализован SortMergeJoin в Apache Spark, и заодно заглянем в исходный код на GitHub. Spark написан на языке Scala, и вся логика работы оператора доступна в открытом репозитории проекта. Вот здесь :) Первое, что рассмотрим - это конструктор кейс-класса 1. Конструктор SortMergeJoinExec

https://habr.com/ru/companies/gnivc/articles/914932/

#spark #join #hadoop #bigdata #mapreduce

Command-line Tools can be 235x Faster than your Hadoop Cluster
"This find | xargs mawk | mawk pipeline gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation."

#complexity #ShellTools #RightToolForTheRightJob #Hadoop #computing

Как я удалил clickstream, но его восстановили из небытия

Всем привет! Я Дмитрий Немчин из Т-Банка. Расскажу не очень успешную историю о том как я удалил данные и что из этого вышло. В ИТ я больше 12 лет, начинал DBA и разработчиком в кровавом энтепрайзе с Oracle. В 2015 году познакомился с Greenplum в Т, да так тут и остался. С 2017 года стал лидить команду, потом все чуть усложнилось и команда стала не одна. Возможно, вы меня могли видеть как организатора Greenplum-митапов в России. Но команда командой, менеджмент менеджментом, а руки чешутся..

https://habr.com/ru/companies/tbank/articles/910030/

#parquet #удаление_данных #fail_story #hadoop

Секреты Spark в Arenadata Hadoop: как мы ускорили построение витрин для задач ML

Привет, Хабр! Я Дмитрий Жихарев, CPO Платформы искусственного интеллекта RAISA в Лаборатории ИИ РСХБ-Интех. В этой статье я и архитектор нашей платформы Александр Рындин @aryndin9999 расскажем о том, как мы построили взаимодействие Платформы ИИ и Озера данных для работы с витринами данных моделей машинного обучения с использованием Spark.

https://habr.com/ru/companies/rshb/articles/904072/

#spark #arenadata #hadoop #datalake #витрина_данных #ai #платформа #livy

Методы расширения атрибутивного состава таблиц БД

Представим себе картину из идеального мира данных, в котором всё стабильно, изменений нет и на горизонте не предвидятся. Аналитик полностью согласовал с заказчиком требования к витрине, спроектировал решение и передал в разработку. Разработчики внедрили витрину в продуктивный контур, пользователи счастливы, всё работает корректно — сопровождение разработчиков и аналитиков не требуется. Представили? Но, как мы знаем, «IT» и «изменения» — синонимы, поэтому в идеальном мире, как гром среди ясного неба, появляются новые требования: разработать инструмент для регулярного добавления в витрину данных новых атрибутов, на текущий момент в неизвестном количестве. Сразу отмечу, что решения и оценки, о которых пойдёт речь, подбирались для работы с большими данными на стеке технологий Apache Hadoop, где для обработки данных использовали фреймворк Apache Spark, СУБД — Apache Hive для анализа данных, оркестратор — Airflow, данные хранятся в колоночном формате Parquet.

https://habr.com/ru/companies/T1Holding/articles/903546/

#hadoop #spark #airflow #hive #HDFS #Apache_Parquet #ddl #sql #eav #json

Opisałem jak działa u nas zarządzanie danymi za pomocą OpenZFS: https://kicb.pl/adaptacja-strategii-zarzadzania-danymi/ może będę żałował, ale w końcu to pierwszy post - krytyka mile widziana ;) #OpenZFS #zfs #Debian #OpenSource #Linux #server #NAS #hardware #Hadoop #BigData 📖

OpenZFS GNU/Lnux Debian trixie/sid Apache HadoopHDFS NAS by marcin ^^ gnulinux ^^ pl

Unlock the potential of #Hadoop for large-scale data processing. Niklas Lang's comprehensive guide covers Hadoop's architecture, installation in different environments, and essential commands.

https://towardsdatascience.com/mastering-hadoop-part-2-getting-hands-on-setting-up-and-scaling-hadoop/

Any hadoop experts out there looking for some consulting? Got a hadoop cluster that needs some expert TLC.

#hadoop #bigdata #fedijobs

Ah yes, let's compare #Iceberg to #Hadoop, because nothing says "modern" like reminiscing about decade-old #tech 📅💾. This is the part where we pretend everything old is new again, while grabbing our #vintage floppy disks to join the 'modern' #data #revolution 🚀😂.
https://blog.det.life/apache-iceberg-the-hadoop-of-the-modern-data-stack-c83f63a4ebb9 #modern #data #HackerNews #ngated

Apache iceberg the Hadoop of the modern-data-stack? — https://blog.det.life/apache-iceberg-the-hadoop-of-the-modern-data-stack-c83f63a4ebb9
#HackerNews #ApacheIceberg #ModernDataStack #Hadoop #DataEngineering #BigData

Как не утонуть в данных: выбираем между DWH, Data Lake и Lakehouse

Привет, Хабр! Меня зовут Алексей Струченко, я работаю архитектором информационных систем в Arenadata. Сегодня хотелось бы поговорить о хранилищах данных — их видах, ключевых особенностях и о том, как выбрать подходящее решение. В эпоху цифровой трансформации данные стали одним из самых ценных активов для компаний любого масштаба и сферы деятельности. Эффективное хранение, обработка и анализ больших объёмов данных помогают организациям принимать обоснованные решения, повышать операционную эффективность и создавать конкурентные преимущества. Однако с ростом объёмов данных и усложнением их структуры традиционные методы хранения сталкиваются с ограничениями. В этой статье мы подробно рассмотрим подходы к хранению данных: Data Warehouse (DWH) , Data Lake и относительно новую концепцию Lakehouse . Разберем их особенности, различия, преимущества и недостатки, а также предложим рекомендации по выбору каждого подхода. Всплыть

https://habr.com/ru/companies/arenadata/articles/885722/

#dwh #data_lake #lakehouse #хранение_данных #big_data #администрирование_бд #базы_данных #озеро_данных #spark #hadoop

Hadoop на микросервисах или история одного пет-проекта

Столкнувшись с концепцией Big Data некоторое время назад, у меня возник очевидный вопрос: как это можно «потрогать» своими собственными руками, где и как можно посмотреть программное обеспечение, составляющее данный концепт, разобраться с его конфигурацией, а в силу того, что я являюсь специалистом информационной безопасности, «потыкать в него палочками», провести проверку на предмет защищенности, возможности несанкционированных доступов. Ввиду специфики систем данного рода, их достаточно тяжело развернуть в качестве учебного проекта на собственном персональном компьютере. Используемые в организации программы такого рода, мягко говоря, также не очень предназначены для того, чтобы их «ковыряли», «подламывали» и всячески пытались вывести из штатного режима работы. Представляемый в данной статье проект предназначен для того, чтобы развернуть внутри Docker-контейнеров, распределенных на несколько компьютеров, максимально защищенную среду Hadoop (включающую в себя ПО Ranger и Knox), предоставить доступ к ее интерфейсам для тестирования и настройки. Если кратко, то это все. «Git clone», «docker compose up -d» с некоторыми предварительными настройками и «будет вам счастье». Написанный код (преимущественно shell-скрипты и конфигурация docker) максимально документирован ссылками на ресурсы сети Интернет, откуда это взято и где это все подробно описано. Технологии все общеизвестные, новые паттерны я здесь не изобретал. Если же что-то становится не понятным или docker-контейнеры «не взлетают» с первого раза – придется читать дальше, тут я как раз постараюсь описать все подробнее. Итак, поехали…

https://habr.com/ru/articles/885646/

#hadoop #ranger #knox #docker #dockerswarm

Help needed! I'm writing a media theory book chapter about Hadoop, BigTable, ACID, BASE and CAP for a non-technical audience. I'm looking for an _expert_ and practitioner from the field to doublecheck the 11 pages I wrote. Thank you for boosting. #phd #academia #Hadoop, #BigTable

Screenshot of the text I’m currently writing for illustration

Vacature: Senior #Systeembeheerder #Linux bij de @aivd:

https://www.werkenvoornederland.nl/vacatures/senior-systeembeheerder-linux-AIVD-2025-0006

#Elasticsearch #NiFi #Hadoop #Ansible #Kafka #devops #OpenSource #werkenvoornederland

This is a customer-facing role, so if that's not your thing, keep scrolling.

TLDR: If you know Hadoop and live close enough to Belfast to commute, you should apply.

I've posted this before, but it's been a little while #fedihire. Also, adding some additional information this time. This is my team. We are already on three continents and 6 timezones, but #Belfast is a new location for the team. I know literally nothing about the office.

I know a lot of places Hadoop is the past, and sure we see a ton of #Spark (I do not understand why that is not listed in the job description but maybe because they want to emphasis that we need hadoop expertise?). You can see all the projects we support at https://www.openlogic.com/supported-technology

It depends on how you count, as I was on two teams during tradition, but I've been on this team for over 5 years now. It's a great team. I've been with the company now right at 7 years. I cannot say how we compare to Belfast employers but this is well more than double where I have stayed at any other employer (even if you count UNC-CH as a single employer rather than the different departments, I've beat them by well over a year at this point).

My manager has been on this team for almost 15 years. His manager has been with this team for almost as long as me, but with the company much longer. His manager has been here almost as long as me (I actually did orientation with him). His manager is a her and she's been here almost as long as me. So, obviously, this is a place where people want to stay!

Our team has a lot of testosterone, but when I started, our CEO was a woman. The GM for the division is a woman.

My manager is black. The manager of our sister team is black.

I think you'll find our team and company is concerned about your work product and not how you dress, what bathroom you use, or the color of your skin.

If you take a look at our careers page, you'll see this:

Work Should Be Fun
There’s always something to look forward to as a Perforce employee: scavenger hunts, community lunches, summer events, virtual games, and year-end celebrations just to name a few.

We take that shit seriously. Nauseatingly so sometimes, lol.

Actually, we take everything on the careers page seriously, but I know from experience that some places treat support like they are a shoe sole to be worn down. Not so here. It's not all rainbows and sunshine, of course. The whole point is that the customer is having an issue! Our customers treat us with respect because management demands that they do.

------

The Director of Product Development at Perforce is searching for a Enterprise Architect (#BigData Solutions) to join the team. We are looking for an individual who loves data solutions, views technology as a lifestyle, and has a passion for open source software. In this position, you’ll get hands on experience building, configuring, deploying, and troubleshooting our big data solutions, and you’ll contribute to our most strategic product offerings.

At OpenLogic we do #opensource right, and our people make it happen. We provide the technical expertise required for maintaining healthy implementations of hundreds of integrated open source software packages. If your skills meet any of the specs below, now is the time to apply to be a part of our passionate team.
Responsibilities:

Troubleshoot and conduct root cause analysis on enterprise scale big data systems operated by third-party clients. Assisting them in resolving complex issues in mission critical environments.
Install, configure, validate, and monitor a bundle of open source packages that deliver a cohesive world class big data solution.
Evaluate existing Big Data systems operated by third-party clients and identify areas for improvement.
Administer automation for provisioning and updating our big data distribution.

Requirements:

Demonstrable proficiency in #Linux command-line essentials
Strong #SQL and #NoSQL background required
Demonstrable experience designing or testing disaster recovery plans, including backup and recovery
Must have a firm understanding of the #Hadoop ecosystem, including the various open source packages that contribute to a broader solution, as well as an appreciation for the turmoil and turf wars among vendors in the space
Must understand the unique use cases and requirements for platform specific deployments, including on-premises vs cloud vs hybrid, as well as bare metal vs virtualization
Demonstrable experience in one or more cloud-based technologies (AWS or Azure preferred)
Experience with #virtualization and #containerization at scale
Experience creating architectural blueprints and best practices for Hadoop implementations
Some programming experience required
#Database administration experience very desirable
Experience working in enterprise/carrier production environments
Understanding of #DevOps and automation concepts
#Ansible playbook development very desirable
Experience with #Git-based version control
Be flexible and willing to support occasional after-hours and weekend work
Experience working with a geographically dispersed virtual team

https://jobs.lever.co/perforce/479dfdd6-6e76-4651-9ddb-c4b652ab7b74

top 10 data analytics tools #dataanalytics #data #data #hadoop

source

https://quadexcel.com/wp/top-10-data-analytics-tools-dataanalytics-data-data-hadoop/

come work on my team! #fedihire

Position Summary:

The Director of Product Development at Perforce is searching for a Enterprise Architect (Big Data Solutions) to join the team. We are looking for an individual who loves data solutions, views technology as a lifestyle, and has a passion for open source software. In this position, you’ll get hands on experience building, configuring, deploying, and troubleshooting our big data solutions, and you’ll contribute to our most strategic product offerings.

Requirements:

Demonstrable proficiency in #Linux command-line essentials
Strong #SQL and #NoSQL background required
Demonstrable experience designing or testing disaster recovery plans, including backup and recovery
Must have a firm understanding of the #Hadoop ecosystem, including the various open source packages that contribute to a broader solution, as well as an appreciation for the turmoil and turf wars among vendors in the space
Must understand the unique use cases and requirements for platform specific deployments, including on-premises vs cloud vs hybrid, as well as bare metal vs #virtualization
Demonstrable experience in one or more cloud-based technologies (#AWS or #Azure preferred)
Experience with virtualization and containerization at scale
Experience creating architectural blueprints and best practices for Hadoop implementations
Some programming experience required
#Database administration experience very desirable
Experience working in enterprise/carrier production environments
Understanding of #DevOps and automation concepts
#Ansible playbook development very desirable
Experience with #Git-based version control
Be flexible and willing to support occasional after-hours and weekend work
Experience working with a geographically dispersed virtual team

Apply at https://jobs.lever.co/perforce/479dfdd6-6e76-4651-9ddb-c4b652ab7b74

#hadoop

Client Info