From aa104a3e9269a68edb7dd67141fbfeec1abd30fe Mon Sep 17 00:00:00 2001 From: mwatid Date: Fri, 13 Dec 2024 01:23:10 -0500 Subject: [PATCH 1/2] submission of final notebook --- .../mwatid_finalProject.Rmd | 307 ++++++++++++++++++ 1 file changed, 307 insertions(+) create mode 100644 StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd diff --git a/StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd b/StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd new file mode 100644 index 0000000..1a48da1 --- /dev/null +++ b/StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd @@ -0,0 +1,307 @@ +--- +title: "xamining Campaign Prediction Using Machine Learning on PIXL/LIBS Combined Dataset" +author: "Dante Mwatibo" +date: "December 13, 2024" +output: + html_document: + toc: yes + toc_depth: 3 + toc_float: yes + number_sections: yes + theme: united + html_notebook: default + pdf_document: + toc: yes + toc_depth: '3' +--- + +# DAR Project and Group Members + +* Project name: Mars +* Project team members: + - **Dante Mwatibo** + - Ashton Compton + - Aadi Lahiri + - CJ Marino + - Nicolas Morawski + - Charlotte Peterson + - Doña Roberts + - Margo VanEsselstyn + - David Walczyk + + +# 0.0 Preliminaries. + +*R Notebooks are meant to be dynamic documents. Provide any relevant technical guidance for users of your notebook. Also take care of any preliminaries, such as required packages. Sample text:* + +This report is generated from an R Markdown file that includes all the R code necessary to produce the results described and embedded in the report. Code blocks can be surpressed from output for readability using the command code `{R, echo=show}` in the code block header. If `show <- FALSE` the code block will be surpressed; if `show <- TRUE` then the code will be show. + +```{r} +# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks +show <- TRUE +``` + + +Executing this R notebook requires some subset of the following packages: + +* `devtools` +* `ggplot2` +* `knitr` +* `gplots` +* `heatmaply` +* `dlpyr` +* `plotly` +* `caret` +* `rpart`` +* `rpart.plot`` + +These will be installed and loaded as necessary (code suppressed). + + +```{r, include=FALSE} +# This code will install required packages if they are not already installed +if (!require("devtools")) { + install.packages("devtools") + library(devtools) +} +if (!require("ggplot2")) { + install.packages("ggplot2") + library(ggplot2) +} +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} +if (!require("gplots")) { + install.packages("gplots") + library(gplots) +} +if (!require("heatmaply")) { + install.packages("heatmaply") + library(heatmaply) +} +if (!require("dplyr")) { + install.packages("dplyr") + library(dplyr) +} +if (!require("plotly")) { + install.packages("plotly") + library(plotly) +} +if (!require("caret")) { + install.packages("caret") + library(caret) +} +if (!require("kableExtra")) { + install.packages("kableExtra") + library(kableExtra) +} +if (!require("rpart")) { + install.packages("rpart") + library(rpart) +} +if (!require("rpart.plot")) { + install.packages("rpart.plot") + library(rpart.plot) +} + +``` + +# 1.0 Project Introduction + +The Mars Project, headed by NASA, collects, studies, and analyzes data collected from the 2020 Mars Perseverance Mission Rover. It's goal is to explore Mars and look for any signs of life it can find, primarily through studying the geographical landscape of Mars. On the Perserverence Rover there are many tools to help it take samples of the surface of Mars or scan the surface of Mars. The data that will be analyzed in this notebook come from two of those instruments: the Planetary Instrument for X-Ray Lithochemistry (PIXL) and the Laser-Induced Breakdown Spectroscopy (LIBS). + +This notebook will use machine learning tools in order to learn more about the different minerals detected in the LIBS data and how/if it is possible to predict the campaign in the PIXL data. The two different campaigns are the "Delta Front" and the "Crater Floor", and represent two different geographical regions in Mars. By attempting to predict the campaigns each sample is a part of, we hope to uncover which minerals might be more closely associated with life-bearing organisms. For example, a Delta existing is evidence in favor of there being water (therefore increasing the chance of life) once on Mars. If we can determine which minerals are more likely to appear at the Delta Front, we might then hope to be able to extrapolate that such that if similar minerals are seen elsewhere we can confidently assume there is potential for water and therefore the potential for life in that area. + + +# 2.0 Organization of Report + +This report is organize as follows: + + +* Section 3.0. Finding 1: Predicting Campaign using Random Forest Classification -- Using the combined LIBS+PIXL dataset I will be using Random Forest to try to predict the campaign of the sample using the LIBS mineral data. + + * Section 4.0: Finding 2: Predicting Campaign using Decision Tree Modeling -- Using the combined LIBS+PIXL dataset I will be using a decision tree model to analyze which minerals are most important in predicting the Campaign of any given sample. + +* Section 5.0: Overall conclusions and suggestions + +* Section 6.0: Appendix This section describe the following additional works that may be helpful in the future work: *list subjects*. + + +# 3.0 Finding 1: Predicting Campaign using Random Forest Classification + +_Give a highlevel overview of the major finding. What questions were your trying to address, what approaches did you employ, and what happened?_ + +It is possible to predect the Campaign of a LIBS sample with a high degree of certainty. The question I was trying to adress is whether or not it would be possible to do so, of which the answer is yes it is possible to do so. In order to test this, I had to actually train and test a Random Forest algorithm, then validate it. Finally, for more insight as well as to ensure interprebility, I did a feature analysis on the LIBS mineral data. + +## 3.1 Data, Code, and Resources + +1. PIXL_LIBS_Combined.Rds is in the StudentData folder and contains a dataset combining LIBS and PIXL into one dataset, matching up the campaigns in the PIXL Dataset with the mineral data in the LIBS dataset. This dataset was created by other team members Charlotte Peterson and Margo VanEsselstyn +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/PIXL_LIBS_Combined.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/PIXL_LIBS_Combined.Rds) + + +The dataset used contains ~1400 samples, each containing mineral data which is gotten from LIBS, the location (lat/lon) from both PIXL and LIBS, the Sol (a Mars day) and point columns from LIBS, a Distance column. Of these for this analysis I only used the LIBS mineral data and the PIXL Campaign data. Additionally, I needed to change the Campaign column to a factor so random forest performed classification as opposed to regression. Finally, I set the seed for reproduceability of results. +```{r } +### loading in the PIXL_LIBS_Combined data +pixl.libs.combined.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/PIXL_LIBS_Combined.Rds") + +# setting the seed for the analyses +set.seed(9554) + +# getting a subset of the data for Random Forest classification +pixl.libs.combined.df.4rf <- pixl.libs.combined.df[,10:18] +pixl.libs.combined.df.4rf$PIXL.Campaign <-factor(pixl.libs.combined.df$PIXL.Campaign, labels= c("Crater Floor", "Delta Front")) +``` + + +## 3.2 Contribution + +_State if this section is sole work or joint work. If joint work describe who you worked with and your contribution. You can also indicate any work by others that you reused._ + +This section is the joint work of myself, Margo, and Charlotte. Their contributions to this section include the PIXL_LIBS_Combined dataset. My contribution is all transformations to the dataset they provided, model generation, model verification, model interpretation, and data visualization. + + + +## 3.3 Methods Description + + +_Describe the data analytics methods you used and why you chose them. +Discuss your data analytics "pipeline" from *data preparation* and *experimental design*, to *methods*, to *results*. Were you able to use pre-existing implementations? If the techniques required user-specified parameters, how did you choose what parameter values to use?_ + +The first thing I had to do was remove some columns. This is because I did not believe every column in the dataset would be valuable to the analysis I was trying to do, namely seeing if mineral data could predict the Campaign of a sample. Because of this, I removed every column in the PIXL_LIBS_Combined dataset except the LIBS mineral data and the PIXL Campaign column. + +Then I split the data into test and train data, randomly selecting 1000 samples for the train dataset and ~400 for the test dataset. I generated a Random Forest model using the training data, used the model to predict the Campaign of the test dataset using the newly generated Random Forest model, then validated that model using a confusion matrix and some common classification validation metrics, and finally performed a feature analysis on the model. + + +## 3.4 Result and Discussion + + +```{r } +########################## RF Model Generation ############################## +rf.s <- sample(nrow(pixl.libs.combined.df.4rf), 1000) +rf.train <- pixl.libs.combined.df.4rf[rf.s,] +rf.test <- pixl.libs.combined.df.4rf[-rf.s,] +rf.mod <- train(PIXL.Campaign~ .,data=rf.train, method = "rf", prox = TRUE) + +########################## RF Verification ############################## +rf.pred <- predict(rf.mod, rf.test) +cm.class = as.matrix(table(Actual = pixl.libs.combined.df.4rf$PIXL.Campaign[-rf.s], Predicted = rf.pred)) + +cm.class %>% +kable(caption="Confusion Matrix of the Predicted vs. Actual Campaigns Using Random Forest") %>% + add_header_above(c(" " = 1, "Predicted" = 2)) %>% + pack_rows("Actual", 1, nrow(cm.class)) + +n = sum(cm.class) # number of instances +nc = nrow(cm.class) # number of classes +diag = diag(cm.class) # number of correctly classified instances per class +rowsums = apply(cm.class, 1, sum) # number of instances per class +colsums = apply(cm.class, 2, sum) # number of predictions per class +p = rowsums / n # distribution of instances over the actual classes +q = colsums / n # distribution of instances over the predicted + +recall.class = diag / rowsums +precision.class = diag / colsums +f1.class = 2 * precision.class * recall.class / (precision.class + recall.class) + +rf.verification <- data.frame(precision.class, recall.class, f1.class) +kable(rf.verification, caption="Performance Verification Metrics for Random Forest Model") +``` +The results of the Random Forest model predecting the Campaign of the testing data are positive. Precision being ~90 for both campaigns means that of the samples predicted to be Crater Floor or Delta Front, one can say with 90% certainty that the model is predicting correctly. Recall is a measure testing the amount of true positives the model could correctly identify. It's interesting to note that the model is going to correctly identify true Delta Front samples significantly more than it will correctly identify Crater Floor samples. I believe this is a good thing, as we are likely more interested in samples from the Delta Front anyways. With recall values of ~85% and ~94% for Crater Floor and Delta Front Campaigns respectively, one can be assured that true values for both campaigns are very likely being captured. Finally, f1 is the harmonic mean of both precision and recall. With the values of Crater Floor and Delta Front prediction f1 being ~88% and ~92%, one can be confident in the campaign this model predicts for a given sample. It's worth mentioning that these statistics provide a more easily understood interpretation of prediction results than a confusion matrix, although the confusion matrix is worth looking at to see generally how the model performed. + + + +```{r} +# feature analysis +# ploting feature analysis +# get the coefficients for that model +mineral.importance.rf <- varImp(rf.mod) +# graph all relevant features +ggplot(mineral.importance.rf$importance, aes(x=Overall, y=rownames(mineral.importance.rf$importance))) + + geom_bar(stat = "identity", fill = "lightblue") + + labs(title = "Random Forest Classifier Mineral Importance Coefficients", + x = "Coefficient Value", + y = "Minerals", + caption = "The coeffficients representing the importance of minerals in the model's sample classification of Campaign") + + theme_minimal() +``` +The data provided shows there is one mineral that is of utmost importance to determining the classification of the campaign for a given sample to the Random Forest model: MgO. THe next two variables that are still important, just much less so, are Na2O and FeOT. + + +## 3.5 Conclusions, Limitations, and Future Work. + +We have generated a model which can predict the Campaign of a given sample with roughly ~90% confidence. In particular the model seemed to correctly predict ~95% of the samples from the Delta Front correctly, and only mislabeled ~10% of the samples as being a part of the Delta Front campaign. These results are important as they give us confidence in the model, as well as confidence in the correctness of its feature analysis. The feature analysis showed that the presence or absence of MgO is significantly more important to the prediction of Campaign than any other mineral. Additionally it showed that Na2O and FeOT are the second and third most important minerals respectively when looking at the presence or absence of them in predicting the Campaign of a given sample. It is worth researching these minerals to see if there is any connection between them and signs of water or life. + +# 4.0 Finding 1: Predicting Campaign using Decision Tree Modeling -- Using the combined LIBS+PIXL dataset I will be using a decision tree model to analyze which minerals are most important in predicting the Campaign of any given sample. + +_These sections can be duplicated for each finding as needed._ + +## 4.1 Data, Code, and Resources + +1. PIXL_LIBS_Combined.Rds is in the StudentData folder and contains a dataset combining LIBS and PIXL into one dataset, matching up the campaigns in the PIXL Dataset with the mineral data in the LIBS dataset. This dataset was created by other team members Charlotte Peterson and Margo VanEsselstyn +[https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/PIXL_LIBS_Combined.Rds](https://github.rpi.edu/DataINCITE/DAR-Mars-F24/blob/main/StudentData/PIXL_LIBS_Combined.Rds) + + +The dataset used contains ~1400 samples, each containing mineral data which is gotten from LIBS, the location (lat/lon) from both PIXL and LIBS, the Sol (a Mars day) and point columns from LIBS, a Distance column. Of these for this analysis I only used the LIBS mineral data and the PIXL Campaign data. Additionally, I needed to change the Campaign column to a factor so random forest performed classification as opposed to regression. Finally, I set the seed for reproduceability of results. +```{r } +### loading in the PIXL_LIBS_Combined data +pixl.libs.combined.df<- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/StudentData/PIXL_LIBS_Combined.Rds") + +# setting the seed for the analyses +set.seed(9554) + +# getting a subset of the data for Random Forest classification +pixl.libs.combined.df.4rf <- pixl.libs.combined.df[,10:18] +pixl.libs.combined.df.4rf$PIXL.Campaign <-factor(pixl.libs.combined.df$PIXL.Campaign, labels= c("Crater Floor", "Delta Front")) +``` + +## 4.2 Contribution + +_State if this section is sole work or joint work. If joint work describe who you worked with and your contribution. You can also indicate any work by others that you reused._ + +This section is the joint work of myself, Margo, and Charlotte. Their contributions to this section include the PIXL_LIBS_Combined dataset. My contribution is all transformations to the dataset they provided, model generation, model verification, model interpretation, and data visualization. + + + +## 4.3 Methods Description + + +_Describe the data analytics methods you used and why you chose them. +Discuss your data analytics "pipeline" from *data preparation* and *experimental design*, to *methods*, to *results*. Were you able to use pre-existing implementations? If the techniques required user-specified parameters, how did you choose what parameter values to use?_ + +I had to remove some columns. This is because I did not believe every column in the dataset would be valuable to the analysis I was trying to do, namely seeing if mineral data could predict the Campaign of a sample. Because of this, I removed every column in the PIXL_LIBS_Combined dataset except the LIBS mineral data and the PIXL Campaign column. + +## 4.4 Result and Discussion + +```{r} +# tree 1 +minerals.dt.mod <- rpart(PIXL.Campaign~ ., data=pixl.libs.combined.df.4rf, method = "class", cp= 0.03) +rpart.plot(minerals.dt.mod, extra=104) +``` +In order to produce this decision tree I used the rpart function. This function allows me to create a decision tree model to predict a categorial variable and customize the tree displayed. I then used the rpart.plot function to display the tree. Each node in the decision tree contains a subset of samples in the dataset. The topmost line is the most common Campaign of the samples in the node. the two numbers in the row underneath are the percentage of samples in the node that are Crater Floor and Delta Front Campaigns respectively. The number on the bottom of the node contains the percentage of the entire dataset that is contained in that node. The following tree is a result of tweaking some parameters, as there are trees which are complicated to the point of not being useful and trees that are so simplistic conclusions are difficult to draw from them. + +What we are able to tell from this is that MgO does appear to be an incredibly important variable in whether or not the sample is from the Crater Floor, with ~82% of samples made up of less than 7.1% MgO are from the Crater Floor. This means that samples made up of least 7.1% MgO are incredibly likely to be from the Delta Front (~77%). We also see that the next most discriminatory mineral presence is FeOT, regardless of whether or not there is high MgO presence. It appears that samples with high traces of FeOT are likely to be found in the Delta Front. On the other hand minerals with low traces of FeOT are likely to be from the Crater Floor. Finally it appears that the presence of at least 42% SiO2 in the sample, assuming MgO presence is over 7.1% and FeOT is less than 25% of the sample, has a 80% chance of being from the Delta Front according to the decision tree. + +## 4.5 Conclusions, Limitations, and Future Work. +From the results of Section 3.0, we determined the 3 most important variables the Random Forest model believes are indicative of predicting which class a random sample is in. With the Decision Tree Analysis, we are now able to not only describe which variables are most important, but also what values of those variables are most important and split the data the most in terms of Campaign of the sample. Additionally with both the Decision Tree and the Random Forest models in agreement on the importance of MgO and FeOT, we can be more confident in the importance of those two variables. Additonally with the added information of the critical point of these variables scientists can make better determinations about what minerals and in what abundance they can look for to help determine which areas they might want to explore further based off of the abundance of certain minerals. + + + +# Bibliography +Provide a listing of references and other sources. + +* Citations from literature. Give each reference a unique name combining first author last name, year, and additional letter if required. e.g.[Bennett22a]. If there is no known author, make something reasonable up. +* Significant R packages used + + +```{r} +citation("rpart") +citation("caret") +citation("kableExtra") +``` + + +# Appendix + +*Include here whatever you think is relevant to support the main content of your notebook. For example, you may have only include example figures above in your main text but include additional ones here. Or you may have done a more extensive investigation, and want to put more results here to document your work in the semester. Be sure to divide appendix into appropriate sections and make the contents clear to the reader using approaches discussed above. * + From 17f565c47b9bc3f157b75b0277901531f0aeaab8 Mon Sep 17 00:00:00 2001 From: mwatid Date: Fri, 13 Dec 2024 02:10:10 -0500 Subject: [PATCH 2/2] Final Project submission --- .../mwatid_finalProject.Rmd | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd b/StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd index 1a48da1..245eb1b 100644 --- a/StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd +++ b/StudentNotebooks/Assignment08_FinalProjectNotebook/mwatid_finalProject.Rmd @@ -3,16 +3,16 @@ title: "xamining Campaign Prediction Using Machine Learning on PIXL/LIBS Combine author: "Dante Mwatibo" date: "December 13, 2024" output: + pdf_document: + toc: yes + toc_depth: '3' + html_notebook: default html_document: toc: yes toc_depth: 3 toc_float: yes number_sections: yes theme: united - html_notebook: default - pdf_document: - toc: yes - toc_depth: '3' --- # DAR Project and Group Members @@ -92,6 +92,10 @@ if (!require("caret")) { install.packages("caret") library(caret) } +if (!require("tinytex")) { + install.packages("tinytex") + library(tinytex) +} if (!require("kableExtra")) { install.packages("kableExtra") library(kableExtra) @@ -276,7 +280,7 @@ I had to remove some columns. This is because I did not believe every column in ```{r} # tree 1 minerals.dt.mod <- rpart(PIXL.Campaign~ ., data=pixl.libs.combined.df.4rf, method = "class", cp= 0.03) -rpart.plot(minerals.dt.mod, extra=104) +rpart.plot(minerals.dt.mod, extra=104, main="Decision Tree for Determining Campaign of Samples") ``` In order to produce this decision tree I used the rpart function. This function allows me to create a decision tree model to predict a categorial variable and customize the tree displayed. I then used the rpart.plot function to display the tree. Each node in the decision tree contains a subset of samples in the dataset. The topmost line is the most common Campaign of the samples in the node. the two numbers in the row underneath are the percentage of samples in the node that are Crater Floor and Delta Front Campaigns respectively. The number on the bottom of the node contains the percentage of the entire dataset that is contained in that node. The following tree is a result of tweaking some parameters, as there are trees which are complicated to the point of not being useful and trees that are so simplistic conclusions are difficult to draw from them.