Last week, the 2025 Edition of our “Current Data Science for Business Students Meet Alumni” Event took place at the Facultyof Economics and Business Administration (Ghent University). #ORMS #DataScience #DataAnalytics #Python #ApacheSpark #SQL
Last week, the 2025 Edition of our “Current Data Science for Business Students Meet Alumni” Event took place at the Facultyof Economics and Business Administration (Ghent University). #ORMS #DataScience #DataAnalytics #Python #ApacheSpark #SQL
🚀 From 24h to 20min – A Small Change, Huge Impact!
A Spark query ran almost a full day on a large dataset. Stats showed 300GB traffic between worker nodes! 🔍 The Explain Plan revealed the culprit: a costly JOIN causing shuffles.
The fix? No JOIN needed! A simple filter replaced it—resulting in a 20-minute runtime instead of 24h.
💡 Lesson: Always check the Explain Plan!
New post about how to write data from a Apache Spark DataFrame into a Elasticsearch/Opensearch database #datascience #databricks #elasticsearch #opensearch #bigdata #apachespark #spark #tech #programming #python:
https://pedro-faria.netlify.app/posts/2025/2025-03-16-spark-elasticsearch/en/
Easier to use: DuckDB gets local web user interface
As of version 1.2.1, the DuckDB in-process database can be conveniently operated via a local UI, which is installed as an extension, as an alternative to CLI.
Einfacher bedienen: DuckDB erhält lokale Web-Benutzeroberfläche
Die In-Process-Datenbank DuckDB lässt sich ab Version 1.2.1 alternativ zur CLI komfortabel über ein lokales UI bedienen, das als Extension installiert wird.
TIL: You can get a list of Spark-enabled GATK tools with the command
gatk --list | grep Spark
(The website doesn't seem to have a list anywhere)
Overall, leveraging StreamingQueryListener is vital for optimizing streaming workloads. More details and code examples can be found here. #ApacheSpark #OpenTelemetry #StreamingData
For more information check: https://devblogs.microsoft.com/ise/spark_job_otel/.
PySpark Tutorial for Beginners
PySpark Tutorial for Beginners #SparkTutorial #pysparkTutorial #ApacheSpark ========== VIDEO CONTENT ... source
In the world of data science, raw data serves as the foundation for generating actionable insights. However, managing, processing, and transforming this data into a usable format requires specialized tools.
#ApacheSpark #ApacheHadoop #Kafka #Airflow #DBT #Presto #SQL #Python #PySpark #Snowflake #Redshift #BigQuery #Kubernetes #Docker #Terraform #Databricks
Spark Connect is revolutionizing the way we run Spark applications. With version 3.4 and beyond, remote client applications written in Scala or Python can now run on a Spark cluster, offering more flexibility than ever before. Read Sergey Kotlov's latest article now.
https://towardsdatascience.com/adopting-spark-connect-cdd6de69fa98
🎃The October issue of #CheckpointChronicle is now out 🌟
It covers Ververica's Fluss, #ApacheFlink 2.0, Iggy.rs, Strimzi's support for #ApacheKafka 4.0, tons of OTF material from @vanlightly, Christian Hollinger's write up of ngrok's data platform, nice detail of how SmartNews use #ApacheIceberg with Flink and #ApacheSpark, a good writeup from Sudhendu Pandey on #ApachePolaris, notes from Kir Titievsky on Kafka's Avro serialisers, and much more!
▶️ Data Engineering: Aufbau und Wartung von #Dateninfrastrukturen, einschließlich #Datenbanken und Datenpipelines (SQL, #Hadoop, #ApacheSpark, #AWS, #Azure, #Kafka) 🖥
Mehr dazu in unserem #Blog unter: https://www.vioffice.de/de/blog/data-science-analytics-engineering/ 🇩🇪🇬🇧
2/2
What is Databricks? Why is it Gaining Popularity? – Quick Guide
#Databricks
#BigData
#MachineLearning
#CloudComputing
#ApacheSpark
#DataAnalytics
#TechInnovation
#DataScience
#CloudPlatform
#DataProcessing
I'm getting back into #VizierDB development after a lengthy hiatus with an experiment in polyglot IDEs. Although the experiment was not (yet) successful, it's opened up several ideas for Vizier, including ways to improve Vizier's state model, and decouple it from #ApacheSpark to also allow lighter-weight SQL engines like #DuckDB. I'm also inspired to explore #Curses as an alternative frontend to Vizier.
For now, just some maintenance with Vizier's plugin architecture.
https://github.com/VizierDB/vizier-scala/issues/288
Apache Spark 3 - Spark Programming in Python for Beginners
Data Engineering using PySpark
https://couponfrogg.com/coupons/apache-spark-programming-in-python-for-beginners/
anybody know if it is ok to run #apachespark and #apachehive on the same box? I have 969 #java processes on this #centos box, which seems like a lot, but not sure if it is actually a problem.
Something is certainly a problem.
Ente gut, alles gut? DuckDB ist eine besondere Datenbank
DuckDB ist in Version 1.0 erschienen. Was hat es mit dieser Datenbank auf sich, die einiges anders macht als andere Datenbanken?
Latest version of my Whisky clustering using Apache projects talk:
https://speakerdeck.com/paulk/groovy-whiskey
Tickets are still available for CoCEU. #apachecon #communityovercode #apachewayang #ApacheFlink #ApacheSpark #ApacheBeam #ApacheIgnite #ApacheCommons @ApacheGroovy #opensource #machinelearning #groovylang
🛍️ Unlock the power of personalized shopping with Apache Spark! 🌟 Dive into data transformation and machine learning to craft tailored experiences for your customers. Spark revolutionizes retail analytics, predicting preferences with precision.
Read the full article: https://squads.com/blog/making-shopping-personal-with-apache-spark
#ApacheSpark #RetailAnalytics #Personalization 🚀🛒
The s390x open source team at IBM confirms the latest versions of various software packages run well on #Linux on #IBMZ.
In March 2024 validation was maintained for over 30 projects, including: #ApacheSolr #WildFly & #ApacheSpark
Full report: https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/elizabeth-k-joseph1/2024/04/05/linuxone-open-source-report-march-2024 🐧