Lmst

Как мы заменили сотни Join’ов на один РТ-процессинг с 1kk RPS

Как связаны скидки, пользовательские пути и огромные массивы данных в Яндекс Рекламе? Привет, Хабр! Меня зовут Максим Стаценко, я работаю с базами данных и яростно в них копаюсь с 2010 года, а в Big Data — с 2016. Сейчас работаю в Яндексе в DWH поиска и рекламы. Мы работаем с ОЧЕНЬ большими данными. Каждый день миллионы пользователей видят рекламу Яндекса, а наши системы обрабатывают огромные объёмы данных. Чтобы реклама работала эффективно, нам нужно в каждый момент времени иметь максимально полную информацию об истории жизни рекламного объявления, а значит нужно каким-то образом передавать данные от одного события к другому внутри рекламной воронки. Расскажу, как мы решали эту проблему.

https://habr.com/ru/companies/oleg-bunin/articles/884560/

#ytsaurus #mapreduce #olap #oltp #антифрод #распределенные_системы #оптимизация #обработка_данных #хранилища_данных

Соединение SortMergeJoin в Apache Spark

Рассмотрим, как реализован SortMergeJoin в Apache Spark, и заодно заглянем в исходный код на GitHub. Spark написан на языке Scala, и вся логика работы оператора доступна в открытом репозитории проекта. Вот здесь :) Первое, что рассмотрим - это конструктор кейс-класса 1. Конструктор SortMergeJoinExec

https://habr.com/ru/companies/gnivc/articles/914932/

#spark #join #hadoop #bigdata #mapreduce

YTsaurus — два года в опенсорсе: чего мы достигли и куда движемся

20 марта мы провели митап для пользователей YTsaurus — главной платформы для хранения и обработки больших данных в Яндексе от разработчиков из Yandex Infrastructure, которая уже успела зарекомендовать себя за пределами компании. Этот текст во многом основан на моем выступлении на митапе: я кратко расскажу, чего мы достигли, какие улучшения внесли и что ждёт пользователей в ближайшем будущем.

https://habr.com/ru/companies/yandex/articles/901290/

#ytsaurus #map_reduce #mapreduce #большие_данные #big_data

Leveraging map-reduce and LLMs for enhanced cybersecurity network detection: https://corelight.com/blog/map-reduce-llms-cybersecurity-network-detection

#ndr #mapreduce #llm

#YouTube might've "fixed" their #JavaScript workload.

A few weeks ago, i've mentioned how i've grown quite fond of the #Firefox process manager (about:processes), because i can just "unload" groups of tabs.

You might recall; that whole tirade about #MapReduce. 😅

meanwhile, a few software updates later

…for a few days now, i don't seem to have to employ the process manager, quite as often as before. (100% single-core spikes reduced drastically)

Who-ever made it happen, thanks, mate! 🖖 😉 🍻

so, gonna write some stuff on #HDFS #MapReduce #yarn and maybe clustering. Also, #machinelearning was suggested but I think that may be too broad of a topic for this. I did cover Machine Learning in a blog back in 2023, but this time is for KB, not blog: https://www.openlogic.com/blog/using-cassandra-kafka-and-spark-ai

Hmm, perhaps some sort of ML performance (as in disk io, etc not accuracy) document would be good but still, where to even start.

If anyone has beginner resources, I'll likely be pointing folks to some resources

How many seconds did it take to read this #MapReduce thread?

Yeah, Google made an hour-long presentation out of it.

"you get your shiz together, and you deal with it" ← that's MapReduce! 🤣

…of course, you get to do some extra steps, when you want to massively parallelise processing. 😅

#MapReduce, as far as i understood it; you "do work", acquire a MAP of your workload (this is me, opening a bunch of tabs), and then you REDUCE it, by processing the heap.

When i 1st heard of the "MapReduce" thing, it was more impressive than it actually is! Thinking about it, it's kinda basic, actually. 😅

I've grown quite fond of the #Firefox Process Manager (about:processes)!

Using lots of #YouTube tabs tends to accumulate 1-3gig of "stuff" in memory, per group of 6-ish. A 100% single-threaded load, for several seconds!

…everything grinds to a halt 🐌

Technically, i really don't need all the tabs to be "active" - i just nuke 'em (click the tab-group's "X") …the tabs remain, but they become "unloaded", memory is being freed-up, CPU-load reduced!

My use-case is what Google taught; #MapReduce! 😅

"Given the psychology of geekdom, the charm of #mapreduce is understandable" -- Orri Erling (#Virtuoso PM @OpenLink). LOL! #linkeddata #qotd

Nice post about #dbms technology and the many misconceptions re. #rdbms esp. in light of #nosql world views: http://bit.ly/aEbHYp #mapreduce

Новые динтаблицы: вторичные индексы, web assembly и ещё много улучшений к версии YTsaurus 24.1.0

Динамические таблицы — это распределённая база данных, key‑value‑пары которой объединяются в привычные пользователям реляционных СУБД таблицы. В YTsaurus в них можно хранить огромные массивы данных, при этом их можно быстро читать — поэтому YTsaurus используют почти все сервисы Яндекса: Реклама, Маркет, Такси, даже Поиск при построении поисковой базы, и другие. Я руковожу службой разработки динамических таблиц в Yandex Infrastructure и раньше уже рассказывал , как мы оптимизировали чтение, улучшали выборку строк в SQL‑запросах и защищались от перегрузок. Сегодня вышла новая версия YTsaurus 24.1.0, в которой динамические таблицы получили ещё несколько долгожданных доработок. В статье расскажу про них подробнее.

https://habr.com/ru/companies/yandex/articles/857708/

#ytsaurus #mapreduce #map_reduce #инфраструктура #большие_данные #big_data #алгоритмы

The actual work the build is doing isn't that interesting, but it has to run your functions against around 150+ combinations of inputs to define production environments. Think #terraform , but built in-house.

Since this is basically a #MapReduce I'm thinking I can make this much faster just by parallelizing those maps. But again, Ruby is single-threaded.

MapReduceの使いどころを探る：データ構造と課題の関係性
https://qiita.com/Tadataka_Takahashi/items/1883da776b5ab670cd47?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

#qiita #bigdata #MapReduce #データ構造 #分散処理 #使いどころ

Pythonで始めるMapReduceデータ処理：中級者向け
https://qiita.com/Tadataka_Takahashi/items/997f4e215663a355937a?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

#qiita #Python #MapReduce #分散処理 #GoogleColab #大規模データ処理

The Hadoop ecosystem comprises various tools and frameworks designed to handle large-scale data processing and analytics. Let's discuss the core components, namely Hadoop, HBase, and Hive, along with other significant tools such as Pig, Sqoop, Flume, Oozie, and Zookeeper.

https://linuxexpert.org/so-you-wanna-do-big-data/

#Hadoop #HBase #Hive #BigData #HadoopEcosystem #HDFS #MapReduce #YARN #Pig #Sqoop #Flume #Oozie #Zookeeper #DataProcessing #DataAnalytics #DataWarehousing #ETL #DataIngestion #Security

Ускорение Python в 2 раза с помощью multiprocessing, async и MapReduce

Python действительно может считаться относительно медленным языком программирования по сравнению с некоторыми другими языками, такими как C++ или Java. Однако, существуют различные библиотеки и инструменты, которые позволяют ускорить выполнение счетных задач в Python. Рассмотрим как можно ускорить анализ данных в 2 раза!

https://habr.com/ru/articles/825206/

#python3 #python #asyncio #async/await #multiprocessing #mapreduce

#ITByte: #MapReduce is a programming model and framework designed for processing large datasets in a parallel and distributed manner.

It's particularly useful for tasks that can be broken down into smaller, independent pieces.

https://knowledgezone.co.in/posts/What-is-MapReduce-6677bf67af6322731de3b7e9

It's 5am. I've set up Hadoop on Arch Linux in WSL2.

There... were issues.

But it works. And now nobody can stop me from performing word counts on lorem ipsum text using MapReduce.

Nobody!!

#Hadoop #MapReduce #BigData #Python

🌘 MapReduce、TensorFlow、Vertex：Google 避免在 AI 領域重蹈覆轍的賭注
➤ Google 歷史上曾經創造出多項革命性技術，但卻在推出後讓競爭對手搶走了先機，這次 Google 希望避免這樣的悲劇再次發生在人工智慧領域。
✤ https://www.supervised.news/p/mapreduce-tensorflow-bard-also-hello
Google 歷史上曾經創造出多項革命性技術，但卻在推出後讓競爭對手搶走了先機，這次 Google 希望避免這樣的悲劇再次發生在人工智慧領域。Google 推出了 Vertex AI，這是一個人工智慧基礎架構，旨在重新奪回 Google 在人工智慧領域的領導地位。Vertex AI 的推出也讓 Google 成為了 Azure AI Studio 和 Amazon Bedrock 的直接競爭對手。此外，Google 還與 Nvidia 合作推出了一個新的語言模型開發框架 PaxML，這是建立在 Goo
#Google #人工智慧 #TensorFlow #Vertex AI #MapReduce

#mapreduce

Client Info