Lmst

Last week, the 2025 Edition of our “Current Data Science for Business Students Meet Alumni” Event took place at the Facultyof Economics and Business Administration (Ghent University). #ORMS #DataScience #DataAnalytics #Python #ApacheSpark #SQL

https://www.linkedin.com/pulse/current-ds4b-students-meet-alumni-2025-edition-dirk-van-den-poel-cdsbe

🚀 From 24h to 20min – A Small Change, Huge Impact!

A Spark query ran almost a full day on a large dataset. Stats showed 300GB traffic between worker nodes! 🔍 The Explain Plan revealed the culprit: a costly JOIN causing shuffles.

The fix? No JOIN needed! A simple filter replaced it—resulting in a 20-minute runtime instead of 24h.

💡 Lesson: Always check the Explain Plan!

#BigData #ApacheSpark #PerformanceTuning #DataEngineering

New post about how to write data from a Apache Spark DataFrame into a Elasticsearch/Opensearch database #datascience #databricks #elasticsearch #opensearch #bigdata #apachespark #spark #tech #programming #python:

https://pedro-faria.netlify.app/posts/2025/2025-03-16-spark-elasticsearch/en/

Easier to use: DuckDB gets local web user interface

As of version 1.2.1, the DuckDB in-process database can be conveniently operated via a local UI, which is installed as an extension, as an alternative to CLI.

https://www.heise.de/en/news/Easier-to-use-DuckDB-gets-local-web-user-interface-10316323.html?wt_mc=sm.red.ho.mastodon.mastodon.md_beitraege.md_beitraege&utm_source=mastodon

#ApacheSpark #Datenbanken #SQL #news

Einfacher bedienen: DuckDB erhält lokale Web-Benutzeroberfläche

Die In-Process-Datenbank DuckDB lässt sich ab Version 1.2.1 alternativ zur CLI komfortabel über ein lokales UI bedienen, das als Extension installiert wird.

https://www.heise.de/news/Einfacher-bedienen-DuckDB-erhaelt-lokale-Web-Benutzeroberflaeche-10316264.html?wt_mc=sm.red.ho.mastodon.mastodon.md_beitraege.md_beitraege&utm_source=mastodon

#ApacheSpark #Datenbanken #SQL #news

TIL: You can get a list of Spark-enabled GATK tools with the command

gatk --list | grep Spark

(The website doesn't seem to have a list anywhere)

#bioinformatics #GATK #ApacheSpark

Overall, leveraging StreamingQueryListener is vital for optimizing streaming workloads. More details and code examples can be found here. #ApacheSpark #OpenTelemetry #StreamingData

For more information check: https://devblogs.microsoft.com/ise/spark_job_otel/.

PySpark Tutorial for Beginners

PySpark Tutorial for Beginners #SparkTutorial #pysparkTutorial #ApacheSpark ========== VIDEO CONTENT ... source

https://quadexcel.com/wp/pyspark-tutorial-for-beginners-2/

In the world of data science, raw data serves as the foundation for generating actionable insights. However, managing, processing, and transforming this data into a usable format requires specialized tools.

#ApacheSpark #ApacheHadoop #Kafka #Airflow #DBT #Presto #SQL #Python #PySpark #Snowflake #Redshift #BigQuery #Kubernetes #Docker #Terraform #Databricks

Spark Connect is revolutionizing the way we run Spark applications. With version 3.4 and beyond, remote client applications written in Scala or Python can now run on a Spark cluster, offering more flexibility than ever before. Read Sergey Kotlov's latest article now.

#ApacheSpark #DataEngineering

https://towardsdatascience.com/adopting-spark-connect-cdd6de69fa98

🎃The October issue of #CheckpointChronicle is now out 🌟

It covers Ververica's Fluss, #ApacheFlink 2.0, Iggy.rs, Strimzi's support for #ApacheKafka 4.0, tons of OTF material from @vanlightly, Christian Hollinger's write up of ngrok's data platform, nice detail of how SmartNews use #ApacheIceberg with Flink and #ApacheSpark, a good writeup from Sudhendu Pandey on #ApachePolaris, notes from Kir Titievsky on Kafka's Avro serialisers, and much more!

https://dcbl.link/cc-oct242

▶️ Data Engineering: Aufbau und Wartung von #Dateninfrastrukturen, einschließlich #Datenbanken und Datenpipelines (SQL, #Hadoop, #ApacheSpark, #AWS, #Azure, #Kafka) 🖥

Mehr dazu in unserem #Blog unter: https://www.vioffice.de/de/blog/data-science-analytics-engineering/ 🇩🇪🇬🇧

2/2

What is Databricks? Why is it Gaining Popularity? – Quick Guide

https://zurl.co/Lfpx

#Databricks
#BigData
#MachineLearning
#CloudComputing
#ApacheSpark
#DataAnalytics
#TechInnovation
#DataScience
#CloudPlatform
#DataProcessing

I'm getting back into #VizierDB development after a lengthy hiatus with an experiment in polyglot IDEs. Although the experiment was not (yet) successful, it's opened up several ideas for Vizier, including ways to improve Vizier's state model, and decouple it from #ApacheSpark to also allow lighter-weight SQL engines like #DuckDB. I'm also inspired to explore #Curses as an alternative frontend to Vizier.

For now, just some maintenance with Vizier's plugin architecture.
https://github.com/VizierDB/vizier-scala/issues/288

Apache Spark 3 - Spark Programming in Python for Beginners

Data Engineering using PySpark

https://couponfrogg.com/coupons/apache-spark-programming-in-python-for-beginners/

#ApacheSpark #Python

anybody know if it is ok to run #apachespark and #apachehive on the same box? I have 969 #java processes on this #centos box, which seems like a lot, but not sure if it is actually a problem.

Something is certainly a problem.

#bigdata

Ente gut, alles gut? DuckDB ist eine besondere Datenbank

DuckDB ist in Version 1.0 erschienen. Was hat es mit dieser Datenbank auf sich, die einiges anders macht als andere Datenbanken?

https://www.heise.de/blog/Ente-gut-alles-gut-DuckDB-ist-eine-besondere-Datenbank-9753854.html?wt_mc=sm.red.ho.mastodon.mastodon.md_beitraege.md_beitraege&utm_source=mastodon

#ApacheSpark #Datenbanken #SQL #news

Latest version of my Whisky clustering using Apache projects talk:
https://speakerdeck.com/paulk/groovy-whiskey
Tickets are still available for CoCEU. #apachecon #communityovercode #apachewayang #ApacheFlink #ApacheSpark #ApacheBeam #ApacheIgnite #ApacheCommons @ApacheGroovy #opensource #machinelearning #groovylang

🛍️ Unlock the power of personalized shopping with Apache Spark! 🌟 Dive into data transformation and machine learning to craft tailored experiences for your customers. Spark revolutionizes retail analytics, predicting preferences with precision.
Read the full article: https://squads.com/blog/making-shopping-personal-with-apache-spark
#ApacheSpark #RetailAnalytics #Personalization 🚀🛒

The s390x open source team at IBM confirms the latest versions of various software packages run well on #Linux on #IBMZ.

In March 2024 validation was maintained for over 30 projects, including: #ApacheSolr #WildFly & #ApacheSpark

Full report: https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/elizabeth-k-joseph1/2024/04/05/linuxone-open-source-report-march-2024 🐧

#apacheSpark

Client Info