The presented code is based on real car data collected from the Divar website until the spring of 1402. The data was consolidated into a comprehensive file, which was then subjected to various techniques to improve its quality and reduce noise and errors. These techniques aimed to make the data more suitable for further analysis and learning purposes. The resulting dataset and the techniques used to improve it are available for use and can be accessed for further study.
To run this code, you need to have the following dependencies installed:
- matplotlib
- numpy
- pandas
- seaborn
- pandasgui(optional)
- scikit-learn
You can install these dependencies using pip:
pip install matplotlib numpy pandas seaborn pandasgui scikit-learn
Clone the repository or download the code files. Make sure you have the required dependencies installed. Run the code using a Python IDE or Jupyter Notebook. Data The data used in this project is stored in a CSV file named car_final.csv. It contains information about various cars, including their features and prices.
The code begins with an exploratory data analysis to gain insights into the dataset. It includes checking for null values, displaying DataFrame information, and creating a correlation matrix heatmap to visualize the relationships between different features.
Before training the models, the data is preprocessed to transform categorical variables into numerical representations using label encoding. The dataset is then split into training and testing sets.
A Random Forest regressor is trained using the training data. The code specifies the number of estimators and maximum depth for the Random Forest model.
The performance of the trained model is evaluated using various metrics, including R-squared, mean squared error (MSE), and mean absolute error (MAE). The code calculates these metrics and prints the results.
Residual analysis is performed to examine the distribution of residuals, which represent the differences between the predicted and actual car prices. The code generates histograms and density plots to visualize the distribution of residuals and identify potential outliers.
Outliers with extremely high residual values are removed from the dataset. The code filters out rows where the absolute difference between the actual and predicted prices exceeds a threshold.
The importance of each feature in predicting car prices is analyzed using the trained Random Forest model. The code calculates feature importances and visualizes them in a bar plot. The plot provides insights into the relative importance of different features in determining the car prices.
Before removing outliers and cleaning the data, the model achieved the following results:
- R-squared: 0.891
- MSE: 1.155e+17
- MAE: 1.001e+08
After training and evaluating the Random Forest regression model, the following results were obtained:
- R-squared: 0.983
- MSE: 7.697e+15
- MAE: 3.358e+07
The improved model shows significant performance enhancements compared to the initial model. The R-squared value increased from 0.891 to 0.983, indicating that the refined model explains a larger portion of the variance in car prices. The mean squared error (MSE) decreased from 1.155e+17 to 7.697e+15, representing a reduction in the model's prediction errors. Additionally, the mean absolute error (MAE) decreased from 1.001e+08 to 3.358e+07, indicating a better accuracy in predicting car prices.
These results demonstrate the effectiveness of the data cleaning process and the refinement of the Random Forest regression model in improving its predictive performance for car prices.
Based on these results, we can conclude that the Random Forest regression model, trained on the cleaned dataset, is effective in predicting car prices. It captures important relationships between the features and target variable, leading to accurate price predictions. Further improvements could be made by exploring other regression algorithms, feature engineering techniques, or fine-tuning the model's hyperparameters.
Feel free to modify the code and experiment with different parameters to further improve the performance or explore additional aspects of the data.
- Preprocessing
- Visualizing
- Machine-Learning
- Maybe some deep-learning