#ZoomCamp

2025-05-22

πŸš•πŸ’‘ The model is up and running! It predicts ride durations for NY Yellow Taxi trips, and I’m loving the MLOps journey. Now focusing on deploying the model and automating the process.

2025-05-22

πŸ“ŠπŸ’» Just completed the linear regression model to predict ride durations based on data from Jan-Feb 2023. Now on to tuning and integrating the model into a Docker container. Next steps ahead!

2025-05-22

πŸ—½πŸš– Starting with the NY Yellow Taxi dataset from Jan-Feb 2023! Preparing to build a regression model to predict ride durations. Time to dive into the data and start exploring!

2025-03-12

🌟 Just wrapped up the homework for Batch 5 of the Zoomcamp!
I processed and analyzed the yellow_tripdata_2024-10.parquet and taxi_zone_lookup.csv datasets using PySpark and Spark SQL. Feels great to finish a hands-on project! πŸ†

2025-03-12

πŸ“ˆ Spark SQL is amazing!
Today I worked on SQL queries within PySpark to analyze and transform large datasets. This is such a powerful tool for data engineering! πŸš€

2025-03-12

πŸ’₯ Today, I started using Spark on GCP with PySpark.
Worked with yellow_tripdata_2024-10.parquet and taxi_zone_lookup.csv to process data. Learning how Spark handles big data in the cloud is incredible! πŸš—

2025-03-12

πŸš€ I’ve just started the Zoomcamp Data Engineering by @DataTalksClub!
This module focuses on ETL processing with Spark, Spark SQL, and DataFrames. Excited to dive into big data processing and learn how to use Spark at scale! πŸ”₯

2025-02-21

πŸ”§ Now we’re building the pipeline! πŸ› οΈ Transforming the data into something useful by processing it with DLT and sending it to DuckDB. πŸš‚πŸ’Ύ Stay tuned as we turn raw data into insights! "

2025-02-21

πŸ“₯ Next step: pulling the dataset using RESTClient. πŸ—½πŸ”„ It's all about getting that raw data into our system so we can transform it into valuable insights. Excited to see how DLT can streamline this process! πŸš€

2025-02-21

πŸš€ Kicking off the by @DataTalksClub today! We’re diving into , starting with a hands-on workshop. The journey begins by watching the video to understand the fundamentals and get ready for some serious data work! πŸ“ŠπŸ’»

2025-02-10

πŸ”š Final Results & Lessons Learned
πŸ† 4th (Public LB) – RMSE: 12.2324
πŸ… 5th (Private LB) – RMSE: 9.5624
Key takeaways:
βœ” Feature engineering & selection are crucial
βœ” Encoding strategies impact model performance
βœ” Hyperparameter tuning makes a real difference! πŸš€

2025-02-10

πŸ”§ Hyperparameter Optimization
Tuned XGBoost using Optuna, a powerful Bayesian optimization library. Finding the best hyperparameters helped lower RMSE and improve generalization! πŸ”₯⚑

2025-02-10

βš™οΈ Model Choice: XGBoost
Why?
βœ… Handles missing data well
βœ… Great with tabular data
βœ… Efficient and highly tunable
XGBoost was the perfect choice for this structured dataset! πŸ“ˆπŸ’‘

2025-02-10

πŸ› οΈ Preprocessing
For high-cardinality categorical features, I used Target Encoding.
For low-cardinality categorical features, I applied Ordinal Encoding.
Missing values? Used SimpleImputer (most_frequent) to fill them efficiently. πŸš€

2025-02-10

🎯 Feature Selection
Used XGBoost feature importance to filter out low-impact variables.
Dropped features with more than 60% missing values to improve model stability. Less noise, better predictions! πŸ“‰πŸ”

2025-02-10

πŸ“Š Feature Engineering
To capture the cyclical nature of time, I transformed day of the week, month, and week of the year into sine and cosine features. This helps models understand seasonality better! πŸŒπŸ“…

2025-02-10

πŸš€ Just wrapped up a Kaggle competition on sales forecasting! The challenge? Predicting product demand based on historical sales data, price changes, promotions, and product details. πŸ“Š

I used XGBoost as my main model and focused on:
βœ… Feature Engineering
βœ… Feature Selection
βœ… Preprocessing
βœ… Hyperparameter Optimization

2025-02-10

βš™οΈ Hybrid Approach: The Hybrid method combines the strengths of both Filter and Wrapper approaches, offering a balance between speed and accuracy. By using a filter to narrow down the features and a wrapper for fine-tuning, it provides an effective and efficient feature selection process.

2025-02-10

πŸ€– Ensemble Approach: This technique combines multiple models to select the best features. By using multiple algorithms and aggregating their results, it improves robustness and accuracy. Common methods include Random Forest and Gradient Boosting.

2025-02-10

🎯 Wrapper Approach: Unlike the Filter approach, the Wrapper method evaluates feature subsets by training a model. It iteratively adds or removes features to find the optimal set. While more computationally expensive, it tends to provide better results when combined with powerful models.

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst