Lmst

#ZoomCamp

🚕💡 The model is up and running! It predicts ride durations for NY Yellow Taxi trips, and I’m loving the MLOps journey. Now focusing on deploying the model and automating the process. #DataScience #AI #MachineLearning #MLOps #ZoomCamp #DataTalksClub

📊💻 Just completed the linear regression model to predict ride durations based on data from Jan-Feb 2023. Now on to tuning and integrating the model into a Docker container. Next steps ahead! #MachineLearning #DataScience #MLOps #ZoomCamp #DataTalksClub

🗽🚖 Starting with the NY Yellow Taxi dataset from Jan-Feb 2023! Preparing to build a regression model to predict ride durations. Time to dive into the data and start exploring! #MLOps #ZoomCamp #DataTalksClub #MachineLearning

🌟 Just wrapped up the homework for Batch 5 of the Zoomcamp!
I processed and analyzed the yellow_tripdata_2024-10.parquet and taxi_zone_lookup.csv datasets using PySpark and Spark SQL. Feels great to finish a hands-on project! 🏆
#DataEngineering #Zoomcamp #DataTalks #ETL #PySpark #SparkSQL

📈 Spark SQL is amazing!
Today I worked on SQL queries within PySpark to analyze and transform large datasets. This is such a powerful tool for data engineering! 🚀
#DataEngineering #Zoomcamp #PySpark #DataTalks #SparkSQL

💥 Today, I started using Spark on GCP with PySpark.
Worked with yellow_tripdata_2024-10.parquet and taxi_zone_lookup.csv to process data. Learning how Spark handles big data in the cloud is incredible! 🚗
#DataEngineering #Zoomcamp #DataTalks #PySpark #BigData #Spark

🚀 I’ve just started the Zoomcamp Data Engineering by @DataTalksClub!
This module focuses on ETL processing with Spark, Spark SQL, and DataFrames. Excited to dive into big data processing and learn how to use Spark at scale! 🔥
#DataEngineering #Zoomcamp #DataTalks #PySpark

🔧 Now we’re building the pipeline! 🛠️ Transforming the #NYCTaxi data into something useful by processing it with DLT and sending it to DuckDB. 🚂💾 Stay tuned as we turn raw data into insights! #DataEngineering #DuckDB #DLT" #Zoomcamp

📥 Next step: pulling the #NYCTaxi dataset using RESTClient. 🗽🔄 It's all about getting that raw data into our system so we can transform it into valuable insights. Excited to see how DLT can streamline this process! 🚀 #DataEngineering #Zoomcamp

🚀 Kicking off the #Zoomcamp by @DataTalksClub today! We’re diving into #DLT, starting with a hands-on workshop. The journey begins by watching the video to understand the fundamentals and get ready for some serious data work! 📊💻 #DataEngineering

🔚 Final Results & Lessons Learned
🏆 4th (Public LB) – RMSE: 12.2324
🏅 5th (Private LB) – RMSE: 9.5624
Key takeaways:
✔ Feature engineering & selection are crucial
✔ Encoding strategies impact model performance
✔ Hyperparameter tuning makes a real difference! 🚀

#DataTalksClub #zoomcamp #machinelearning

🔧 Hyperparameter Optimization
Tuned XGBoost using Optuna, a powerful Bayesian optimization library. Finding the best hyperparameters helped lower RMSE and improve generalization! 🔥⚡
#DataTalksClub #zoomcamp #machinelearning

⚙️ Model Choice: XGBoost
Why?
✅ Handles missing data well
✅ Great with tabular data
✅ Efficient and highly tunable
XGBoost was the perfect choice for this structured dataset! 📈💡
#DataTalksClub #zoomcamp #machinelearning

🛠️ Preprocessing
For high-cardinality categorical features, I used Target Encoding.
For low-cardinality categorical features, I applied Ordinal Encoding.
Missing values? Used SimpleImputer (most_frequent) to fill them efficiently. 🚀
#DataTalksClub #zoomcamp #machinelearning

🎯 Feature Selection
Used XGBoost feature importance to filter out low-impact variables.
Dropped features with more than 60% missing values to improve model stability. Less noise, better predictions! 📉🔍
#DataTalksClub #zoomcamp #machinelearning

📊 Feature Engineering
To capture the cyclical nature of time, I transformed day of the week, month, and week of the year into sine and cosine features. This helps models understand seasonality better! 🌍📅
#DataTalksClub #zoomcamp #machinelearning

🚀 Just wrapped up a Kaggle competition on sales forecasting! The challenge? Predicting product demand based on historical sales data, price changes, promotions, and product details. 📊

I used XGBoost as my main model and focused on:
✅ Feature Engineering
✅ Feature Selection
✅ Preprocessing
✅ Hyperparameter Optimization

#DataTalksClub #zoomcamp #machinelearning

⚙️ Hybrid Approach: The Hybrid method combines the strengths of both Filter and Wrapper approaches, offering a balance between speed and accuracy. By using a filter to narrow down the features and a wrapper for fine-tuning, it provides an effective and efficient feature selection process. #DataScience #ML #FeatureSelection #DataTalksClub #zoomcamp

🤖 Ensemble Approach: This technique combines multiple models to select the best features. By using multiple algorithms and aggregating their results, it improves robustness and accuracy. Common methods include Random Forest and Gradient Boosting. #MachineLearning #AI #FeatureSelection #DataTalksClub #zoomcamp

🎯 Wrapper Approach: Unlike the Filter approach, the Wrapper method evaluates feature subsets by training a model. It iteratively adds or removes features to find the optimal set. While more computationally expensive, it tends to provide better results when combined with powerful models. #DataScience #ML #FeatureSelection #DataTalksClub #zoomcamp

#ZoomCamp

Client Info