Data Scientist with M.A. in sociology, B.A. in environmental sociology, and 5+ years' experience teaching statistics. Completed TripleTen's 10 month data science bootcamp and a real-world data science externship with DataSpeak. Currently accepting data analysis and statistics consulting projects May 2024.
View My LinkedIn Profile
Purpose
Build a multi-output regression that minimizes symmetric Mean Absolute Error (sMAPE) when predicting gold recovery from ore. After identifying the best model, determine which factors are unprofitable.
Techniques
Linear Regression, Decision Tree Regression, and Random Forest Regression were tuned and cross-validated to determine which model best minimizes sMAPE on the training set.
Three data sets contain information on the full set, training set, and test set:
gold_recovery_full.csv
gold_recovery_train.csv
gole_recovery_test.csv
Datasets contain information on input and output measures for each stage in the gold recovery process:
Observations close to each other in time are often similar.
[stage]_[parameter_type].[parameter_name]
The data were provided by TripleTen’s Data Science bootcamp.
Data were checked for missing values and duplicates. Missing values were imputed with the average of forward and backward fill. There were no duplicates.
├── LICENSE
├── README.md
│
├── images
│ └── final_regression.png
│ └── gold_stages.png
│ └── recovery_process.png
│
└── notebooks
└── gold-recovery-analysis.ipynb
Concentration of gold ore tends to increase though each stage of the recovery process.
Linear Regression achieved the best training sMAPE of 12.30 and this was confirmed with a final sMAPE of 7.30 on the test set. Upon viewing the regression output, some floatbank states were not signficant, though all of the chemical inputs were. The company may be able to loosen control over the floatbank states but should continue to be mindful of the chemical inputs during the process.
TripleTen’s Data Science bootcamp for providing the project and the data.