This report is generated from an R Markdown file that includes all
+the R code necessary to produce the results described and embedded in
+the report. Code blocks can be suppressed from output for readability.
+If show <- FALSE
the code block will be suppressed; if
+show <- TRUE
then the code will be shown.
# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks
+show <- TRUE
+Executing this R notebook requires the following packages:
+ggplot2
tidyverse
pheatmap
reshape2
These will be installed and loaded as necessary.
+This report is for the Mars 2024 Data Analytics Research group. Our +goal is to create an app (“Mission Minder”) that shows unique dynamic +analysis of the Mars Perseverance data.
+This particular report mainly focuses on how the data analyzed in +“Mission Minder” is cleaned and organized. With a brief look at how the +two types of features in sample data (mineral presence and elemental +compound concentration) relate and why we won’t be using Lithology going +forward.
+This report is organized as follows:
+Section 3.0. “Finding 1: Data Organization”
+Section 4.0: “Finding 2: Correlation Between Elemental Compounds +and Minerals”
+Section 5.0: “Lithology and SHERLOC match”
+Section 6.0 “Overall conclusions and suggestions”
Section 7.0 “Bibliography”
Section 8.0 “Appendix”
+The original data for Perseverance was inconsistently organized and +not always in a usable form. This section describes how the data was +reorganized, re-classed, and relabeled to be more intuitive and to +minimize the data manipulation needed for each individual analysis.
+Here is a list of data sets, codes, and resources that relate to this +section. Note that if you click on the file names it will send you to +that file in the GitHub.
+v1_Data_Introuduction.md
.The reordering and data cleaning was done by me, Doña Roberts.
+The data sets v1_libs_to_sample.Rds
, the Type meta data
+for v1_libs.Rds
, and the Sol, Lat, and Lon meta data for
+v1_pixl.Rds
came from date sets created by Margo
+VanEsselstyn and Charlotte Peterson. Their final notebooks, which
+document where they got the new data, can be found on the GitHub
+following links 3.1.d.1 and 3.1.d.2.
Additionally, David created a data set, aqueous.Rds
+that describes the details of each of the minerals in SHERLOC. His final
+notebook, which documents how he created this data set, can also be
+found on GitHub following link 3.1.d.3.
For the description of the resulting data sets, see the wiki page: https://github.rpi.edu/DataINCITE/DAR-Mars-F24/wiki/Dataset-Description
+Since the wiki doesn’t go into detail on what changes were made (just
+what the end result is) we’ll summarize that now. Note that the actual
+file implementing these changes (with comments) is
+v1_consistent_data_naming.Rmd
and can be found by on GitHub
+by following link 3.1.a.1.
libs_typed.Rds
(to be used later in separating out the
+earth references)PIXL_LIBS_Combined.Rds
(now called
+“libs_to_sample”) so that it matches the naming convention of PIXL/LIBS
+and to make explicit whether a column came from PIXL or came from
+LIBSlibs_to_sample
columns such that LIBS meta
+data was first, distance between, PIXL meta data next, and LIBS
+elemental compound data last.
+character
types were changed to
+factor
+numeric
or integer
and labels were changed to
+factor
v1_
with the
+intention that updates to these data sets will be saved with the prefix
+vn_
where “n” denotes what addition they are. This way, if
+any errors occur in the updating process, the Mission Minder can still
+run on the older data sets while the new ones are fixed.While I attempted to organize the data in the most useful ways, there +were some changes that could have been un-ideal, inconvenient, or had a +valid alternative.
+Here is a table of the most notable ones:
+Table of Decisions
+File: | +Part in question: | +Decision: | +Pros: | +Cons: | +Alternative: | +
---|---|---|---|---|---|
v1_sherloc |
+The class of mineral presences | +Setting the class as factor |
+Accurate to what it is | +Inconvenient to use, see 8.1 for how to convert to numeric | +1) Leaving class as character 2) Setting class as
+numeric |
+
_________ | +_______________ | +_________ | +__________ | +__________ | +______________ | +
v1_sample_ meta |
+The existence of a separate meta data set | +Separating meta data out | +Cleaner, Neater, Less repetitive | +One more data set to import | +Repeating meta data in each sample data set | +
_________ | +_______________ | +_________ | +__________ | +__________ | +______________ | +
v1_libs_earth _references |
+The existence of a separate earth reference data set | +Separating scct (Earth reference) targets out |
+We aren’t accidentally looking at Earth rocks when trying to analyze +Mars ones | +One more data set to import, harder to compare scct
+data to other types of libs data |
+Keeping scct targets in the v1_libs data
+set |
+
_________ | +_______________ | +_________ | +__________ | +__________ | +______________ | +
v1_libs_to_ sample |
+The inclusion of non-identifier PIXL and LIBS data | +Include the extra meta and elemental compound data | +Convenience. Change would be too late in process, you’d have to go +through and change any code depending on the “fluff’s” inclusion | +Makes the data set more complex, more repetitive, and less +clean | +Removing non-identifier data from
+v1_libs_to_sample |
+
_________ | +_______________ | +_________ | +__________ | +__________ | +______________ | +
v1_consistent _data_naming |
+Line 153, the inclusion of all samples in
+v1_lithology |
+Limiting Lithology to only first 16 samples (by several +requests) | +Don’t need to remove partially received samples for every analysis +including PIXL data | +Problematic for when we receive new samples!!! | +1) Don’t limit it but people need to manually remove extra rows for +analysis 2) Define a variable “n” at start of file, limit to “n” +samples. Increment “n” every time we get a new sample | +
_________ | +_______________ | +_________ | +__________ | +__________ | +______________ | +
v1_libs |
+The inclusion of a “Cluster” column | +Include cluster column | +For analysis using the 4 clusters, it’s easier. Unifies clustering +between people’s analysis | +Doesn’t really make sense with the app where number of clusters is +changeable and seed is already consistent between analysis | +Don’t include cluster column, instead rely on a “Cluster” variable +in the app which is dependent on the number of clusters requested | +
There are many more decisions made while cleaning the data, but the +above table details the most important (questionable) ones.
+In the future, we expect to receive more Sample and LIBS data from +NASA. When this happens, there needs to be a way to integrate it.
+Below are the steps required, and the files to reference, in order to +add and clean the new data. For some of the steps the code to complete +it isn’t public in the Data or StudentData GitHub folders, so the person +responsible for that code or the place to find it detailed is +identified.
+For new Sample data:
+Get the new data from NASA to update
+mineral_data_static.Rds
,
+abrasions_sherloc_samples.Rds
, and
+sample_pixl_wide.Rds
(which is used in the creation of
+pixl_sol_coordinates.Rds
)
mineral_data_static.Rds
, look at the
+“Lithology” report at the start of each new sample’s Sample
+Report. As of December 14th 2024, the specific code to create the
+Rds file isn’t public, so contact Dr. Erickson.abrasions_sherloc_samples.Rds
, look at the
+“SHERLOC” table in each new sample’s Sample
+Report. Talk to Karen Rogers for details.sample_pixl_wide.Rds
, as of December 14th
+2024, there is no documentation available on GitHub explaining how this
+Rds
file was created. Check with Dr. Erickson to see where
+this data came from and how to update it.Follow instructions in Margo and Charlotte’s notebooks to update
+pixl_sol_coordinates.Rds
. The Sol/Lat/Lon information they
+added comes from the Analyst
+Notebook.
Follow instructions in Margo and Charlotte’s notebooks to update
+PIXL_LIBS_Combined.Rds
for the new LIBS points.
Change line 153 of v1_consistent_data_naming.Rmd
+from lithology.df <- lithology.df[1:16,~~~]
to
+lithology.df <- lithology.df[1:n,~~~]
where
+n
is the new number of samples.
Run v1_consistent_data_naming.Rmd
to update
+v1_sample_meta.Rds
, v1_pixl.Rds
,
+v1_sherloc.Rds
, and v1_libs_to_sample.Rds
.
v1_lithology.Rds
.For new LIBS data:
+Download new moc csv file from LIBS, and update
+supercam_libs_moc_loc.Rds
. As of December 14th 2024, the
+specific code to create the Rds file isn’t public, so contact
+Dr. Erickson.
Follow instructions in Margo and Charlotte’s notebooks to create
+libs_typed.Rds
for the new LIBS points.
Follow instructions in Margo and Charlotte’s notebooks to update
+PIXL_LIBS_Combined.Rds
for the new LIBS points.
Run v1_consistent_data_naming.Rmd
to update
+v1_libs.Rds
, v1_libs_earth_references.Rds
, and
+v1_libs_to_sample.Rds
.
Some recommendations for updating the data sets:
+Instead of updating “v1” to contain the new data, create a copy
+of v1_consistent_data_naming.Rmd
and replace all references
+to “v1” with “v2” (including the title of the file). This way, instead
+of updating the “v1” data sets you are creating new “v2” data sets. This
+means that if something goes wrong and it breaks, you can always fall
+back to the previous version while you figure out how to fix the
+break.
When creating the v2_consistent_data_naming.Rmd
, go
+through and find Margo and Charlotte’s code to update their data sets
+and add it to the start of the new v2
code so that in the
+future there is only one file to run when updating. I was going to do
+this but, as they were still improving and updating it, the code wasn’t
+stable enough for the transfer to make sense.
This section looks at how the concentration of elemental compounds +(PIXL data) relates to/indicates the presence of minerals (SHERLOC data) +at a sample location.
+We are looking at the question of how PIXL and SHERLOC relate to each +other, and thus the [Exploration] –> [PIXL vs Sherloc] section of the +Mission Minder.
+Here is a list of data sets, codes, and resources that are used in +this work. Click on the file name to be sent to the file on GitHub.
+First, we need to report the data sets.
+Then, before we can actually calculate the data we need, we must +combine the two data sets into a matrix with only numeric data.
+## Creating pixl matrix
+# removing the sample number column
+pixl.matrix <- as.matrix(pixl.df[,-1])
+## Creating sherloc matrix
+# removing the sample number column
+# converting from feature to numeric
+sherloc.matrix <- as.matrix(as.data.frame(lapply(lapply(
+ sherloc.df[,-1],as.character),as.numeric)))
+## Combining pixl and sherloc into one matrix
+# removing the column(s) with no deviation, in this case "Hydrated Carbonates"
+combined.matrix <- cbind(pixl.matrix,sherloc.matrix[,-16])
+With the data from the combined matrix, we calculate the pearson +correlation of their combined data, and then look at only the +correlation between features in PIXL and features in SHERLOC
+# Calculating pearson feature correlation
+combined.cormat <- round(cor(combined.matrix),2)
+# Selecting only the correlations between pixl features and sherloc features
+combined.cormat <- combined.cormat[colnames(pixl.matrix),colnames(sherloc.matrix[,-16])]
+This section is solely my own work.
+In order to get from raw SHERLOC and PIXL data to their correlation,
+I simply used the cor
function from base r to calculate the
+correlation between each feature in PIXL (elemental compounds) and
+SHERLOC (minerals) and then selected only the correlations between
+features of PIXL and features of SHERLOC (instead of correlations within
+PIXL or within SHERLOC) to find my desired correlations.
In the next section, I simply use pheatmap
to
+hierarchaly cluster and output a heatmap of correlations.
# Creating title for heatmap
+heatmap.title <- "Correlation between Elemental Concentration & Mineral Presence"
+# Printing heatmap of correlation
+pheatmap(combined.cormat,
+ scale="none",
+ treeheight_row = 10,treeheight_col = 10,
+ main = heatmap.title)
+
+Caption: Vertical axis is PIXL elemental compounds, horizontal axis is
+SHERLOC minerals. Colored squares indicate the correlation between the
+elemental compound in that row and the mineral in that column. That is
+to say, if the square is red then when that elemental compound is
+present it is highly likely that the mineral is present and if that
+square is dark blue then when that elemental compound is present it is
+highly unlikely that the mineral is present.
Here we can clearly see see which elemental compounds correlate to +which minerals.
+For example, high SO3 concentration strongly correlates to the +presence of Hydrated Mg-Fe sulfate, Mg-sulfate, Kaolinite, & +Fe-Mg-clay minerals.
+Similarly, high Cr2O3 concentration strongly correlates to the +presence of Apatite, Spinels, Zircon/Baddeleyite, Chromite, & +Ilmenite.
+Overall we see that;
+There is likely a chemical reason for the strong correlations found +above. Some correlations are obvious even to a non geologist, such as +Chromite & Cr2O3 or several Sulfates & SO3, but other +correlations aren’t immediately obvious, such as the low correlation +between Sulfate & SO3 or Perchlorates & Cl. There is likely a +lot more connections to be seen here by someone with more geological +insight.
+It would also be interesting to look at this heatmap in conjunction
+with the chemical formulas for the minerals listed in David’s aqueous.Rds
+file and see if that provides any new insights.
Sadly, since, we don’t have data for mineral presence at LIBS +targets, we can’t do a similar heatmap for that, but we may be able to +look at the relation between concentration of elemental compounds and +presence of minerals found through sample data to hypothesize what +minerals are likely present at the LIBS targets based on the elemental +compound concentrations found there.
+This section looks at the realtionship between the Lithology data and +the SHERLOC data.
+We started with looking at two data sets, Lithology and SHERLOC, that +both describe the presence of minerals at each sample site.
+Lithology is made up of binary numeric values representing present or +not present, and SHERLOC is made up of discrete numeric values +representing the level of presence.
+If these two match, then we only really need SHERLOC, as Lithology +can be achieved by looking at if a SHERLOC value is “= 0” (then “0”) or +“> 0” (then “1”).
+For this reason, we want to confirm that SHERLOC and Lithology +match.
+Here is a list of data sets, codes, and resources that are used in +this work. The data sets below can be found on GitHub by clicking on +them.
+First we import the Lithology data set. Note that we don’t need to +import SHERLOC, since we imported that in section 4.1.
+Then, we convert Lithology into a matrix, excluding the “Sample” +column. Again, note that we don’t need to do this for SHERLOC, as we +also did this in section 4.1.
+## Creating lithology matrix
+# removing the sample number column
+# converting from feature to numeric
+lithology.matrix <- as.matrix(as.data.frame(lapply(lapply(
+ lithology.df[,-1],as.character),as.numeric)))
+This section is solely my own work.
+There are many ways to confirm that Lithology and SHERLOC match, but +the simplest one is probably to just subtract the SHERLOC matrix from +the Lithology matrix and look at the minimum and maximum outputs.
+## Calculating Difference matrix
+diff <- lithology.matrix - sherloc.matrix
+## Finding minimum, should be 0
+min(diff)
+## [1] 0
+## Finding maximum, should be 0.75
+max(diff)
+## [1] 0.75
+After doing this, we get that the minimum of the difference is “0” +and the maximum of the difference is “0.75”.
+Since the minimum of the difference between them is “0”, we know that +if Lithology is “0” then SHERLOC is also “0”, since if SHERLOC were > +“0” then difference would be < “0”.
+Since the maximum of the difference between them is “0.75”, we know +that if Lithology is “1” then SHERLOC is \(\geq\) “0.25”, since if SHERLOC were < +“0.25” then difference would be > “0.75”.
+This means that we know if Lithology claims a mineral is present then +SHERLOC also claims the mineral is present, and that if Lithology claims +that a mineral is absent then SHERLOC also claims that the mineral is +absent.
+Since we can show that SHERLOC and Lithology match, we have decided +to only use the SHERLOC data set in Mars Mission Minder, since Lithology +is simply a less detailed version of SHERLOC.
+If there are any carry over references to “Lithology” within the Mars +Mission Minder, this is actually referring to the binary version of +SHERLOC.
+In the future, as more samples are added, I would suggest +streamlining the process of integrating new data. That is, I would +recommend creating one master file that pulls in all the data in their +raw form (csv’s or manually added data from sample reports), cleans and +organizes it, and then outputs the finalized data sets. The current +system has many intermediary steps, some of which aren’t available for +viewing and editing, making it hard to integrate new data.
+When the Mission Minder gets extended to the other Mars Missions, I +would recommend (at the start!) sitting down and finding some uniform +naming, organizing, and updating system that can work for +ALL the missions. This way, when analysis +between missions starts to be created, the data will already +line up neatly without extra manipulation required. I would recommend +this to be a priority, done before analysis starts being created, so +that the data is ready and finalized before it starts being used. Many +of the things that make our current cleaning process and our finalized +data sets unideal were caused by people using and changing the +pre-existing (uncleaned) data.
+If you attempt to directly convert Lithology or SHERLOC from factor
+to numeric using as.factor
you run into a problem. The
+“number” factor data gets changed into the wrong “number” numeric data.
+As seen in the table below.
Factor # | +Lithology Basic Conversion | +SHERLOC Basic Conversion | +Desired Conversion | +
---|---|---|---|
“0” | +1 | +1 | +0 | +
“0.25” | ++ | 2 | +0.25 | +
“0.5” | ++ | 3 | +0.5 | +
“0.75” | ++ | 4 | +0.75 | +
“1” | +2 | +5 | +1 | +
The code chunk displayed below shows my method for getting around +this.
+# Converting factor "numbers" to equivalent numeric data
+lithology.df[,-1] <- as.data.frame(lapply(lapply(
+ lithology.df[,-1],as.character),as.numeric))
+sherloc.df[,-1] <- as.data.frame(lapply(lapply(
+ sherloc.df[,-1],as.character),as.numeric))
+Code explained:
+[,-1]
), since that is
+“Sample” ID and that should remain as integer
.lapply(dataset,as.newclass)
since we are working with a
+data frame and are trying to individually change the class of each
+element of the data frame. The lapply function (part of base r) is a
+convenient way to do this.character
before converting to
+numeric
because factor
smoothly converts to
+the identical character
(I.e., if you have the string
+“Blarg” as a factor, the character version will be the identical string
+“Blarg”) but it doesn’t smoothly convert to numeric
(I.e.,
+if it is the string “Blarg” as a factor, the numeric version would be
+1
, assuming “Blarg” is the first factor listed. Similarily
+if the string “0” is the first factor, it will convert to the numeric
+“1”). Then, since character
smoothly converts to
+numeric
if the character is a
+number, we can now convert without complications.as.data.frame
, since lapply’s normal output is a
+list. This isn’t really needed here, because r interprets the
+[,-1]
to mean we want a dataframe not a list and will give
+us that on it’s own, but for safety it’s best to include this, as there
+may be another case where you are trying to convert the entire
+data frame, and you’ll thus need to use as.data.frame
.