Skip to content
Permalink
Newer
Older
100644 439 lines (321 sloc) 19.2 KB
1
---
2
title: "RPI github and Mars 2020 PIXL"
3
subtitle: "DAR Assignment 1"
4
author: "Doña Roberts"
5
date: "`r format(Sys.time(), '%d %B %Y')`"
6
output:
August 30, 2024 18:29
7
pdf_document: default
8
html_document:
9
toc: true
10
number_sections: true
11
df_print: paged
12
---
13
```{r setup, include=FALSE}
14
# REQUIRE R PACKAGE INSTALLATIONS
15
# This section installs packages if they are not already installed.
16
# This block will not be shown in the knitted file.
17
18
# RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
19
20
# Set the default CRAN repository
21
local({r <- getOption("repos")
22
r["CRAN"] <- "http://cran.r-project.org"
23
options(repos=r)
24
})
25
26
if (!require("pandoc")) {
27
install.packages("pandoc")
28
library(pandoc)
29
}
30
31
if (!require("knitr")) {
32
install.packages("knitr")
33
library(knitr)
34
}
35
36
# Required packages for M20 LIBS analysis
37
if (!require("rmarkdown")) {
38
install.packages("rmarkdown")
39
library(rmarkdown)
40
}
41
42
if (!require("tidyverse")) {
43
install.packages("tidyverse")
44
library(tidyverse)
45
}
46
47
if (!require("stringr")) {
48
install.packages("stringr")
49
library(stringr)
50
}
51
52
if (!require("ggbiplot")) {
53
install.packages("ggbiplot")
54
library(ggbiplot)
55
}
56
57
if (!require("pheatmap")) {
58
install.packages("pheatmap")
59
library(pheatmap)
60
}
61
62
if (!require("ggrepel")) {
63
install.packages("ggrepel")
64
library(ggrepel)
65
}
66
67
if (!require("farver")) {
68
install.packages("farver")
69
library(farver)
70
}
71
72
if (!require("labeling")) {
73
install.packages("labeling")
74
library(labeling)
75
}
76
77
knitr::opts_chunk$set(echo = TRUE)
78
79
```
80
81
# Introductory Data Analytics Research Notebook
82
83
This notebook is broken into two main parts:
84
85
* Part 1: A basic introduction to github and RStudio Server
86
* Part 2: An introduction to the Mars 2020 PIXL dataset
87
88
The RPI github repository for all the code and data required for this notebook may be found at:
89
90
* https://github.rpi.edu/DataINCITE/DAR-Mars-F24
91
92
93
## BEFORE YOU BEGIN: github account setup
94
95
To contribute to any RPI github repository or read private repos you _must_ validate your RPI github.com ID and send a confirmation email to John Erickson at `erickj4@rpi.edu`. Please do the following **now**:
96
97
**Enabling 2FA on the RPI github and saving personal access tokens, et.al.**
98
99
* Browse to http://github.rpi.edu
100
* Login using your RPI credentials
101
* Enable github two-factor authentication (2FA)
102
* Under "Settings" -> "Password and authentication"
103
* Select "Authenticator app" (Duo or Google authenticator are recommended)
104
* Follow steps to set up authenticator app; may involve scanning a QR Code)
105
* See directions for 2FA at https://itssc.rpi.edu/hc/en-us/articles/360004801811-GitHub-Enterprise-Overview#2fa
106
* **CRITICAL:** Make sure to save your **recovery codes** in a safe place! Recovery codes can be used to access your account in the event you lose access to your device and cannot receive two-factor authentication codes.
107
* Create and save a *personal access token*
108
* Under "Settings" -> "Developer settings"
109
* Select "Personal access tokens"
110
* Click on "Generate new token (classic)"
111
* Set an expiration period for the end of the Fall 2024 term
112
* Enable everything (check the left-most boxes)
113
* Generate (green button)
114
* SAVE THE RESULT! You won't be able to see it again...
115
* _Use this token when command-line git asks you for a password_
116
* **PLEASE DO THIS IMMEDIATELY BEFORE READING ANY FURTHER!!**
117
118
# DAR ASSIGNMENT 1 (Part 1): CLONING A NOTEBOOK AND UPDATING THE REPOSITORY
119
120
In this assignment we're asking you to
121
122
* clone the `DAR-Mars-F24` github repository,
123
* create a personal branch using git,
124
* create a new notebook that includes your answers to questions in this notebook,
125
* make additions to the repository by adding your notebook to the repository.
126
127
_The instructions which follow explain how to accomplish this._
128
129
**For DAR Fall 2024** you *must* be using RStudio Server on the IDEA Cluster. Instructions for accessing "The Cluster" appear at the end of this notebook. Don't forget to validate your RPI github ID as above and email `erickj4@rpi.edu`
130
131
### Cloning an RPI github repository
132
133
The recommended procedure for cloning and using this repository is as follows:
134
135
* Access the RPI network via VPN
136
* See https://itssc.rpi.edu/hc/en-us/articles/360008783172-VPN-Connection-and-Installation for information
137
138
* Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/
139
* You must be on the RPI VPN!!
140
* Access the Linux shell on the IDEA Cluster by clicking the **Terminal** tab of RStudio Server (lower left panel).
141
* You now see the Linux shell on the IDEA Cluster
142
* `cd` (change directory) to enter your home directory using: `cd ~`
143
* Type `pwd` to confirm
144
* NOTE: Advanced users may use `ssh` to directly access the Linux shell from a macOS or Linux command line
145
* Type `git clone https://github.rpi.edu/DataINCITE/DAR-Mars-F24` from within your `home` directory
146
* Enter your RCS ID and your saved personal access token when asked
147
* This will create a new directory `DAR-Mars-F24`
148
* In the Linux shell, `cd` to `DAR-Mars-F24/StudentNotebooks/Assignment01`
149
* Type `ls -al` to list the current contents
150
* Don't be surprised if you see many files!
151
* In the Linux shell, type `git checkout -b dar-yourrcs` where `yourrcs` is your RCS id
152
* For example, if your RCS is `erickj4`, your new branch should be `dar-erickj4`
153
* It is _critical_ that you include your RCS id in your branch id!
154
* Back in the RStudio Server UI, navigate to the `DAR-Mars-F24/StudentNotebooks/Assignment01` directory via the **Files** panel (lower right panel)
155
* Under the **More** menu, set this to be your R working directory
156
* Setting the correct working directory is essential for interactive R use!
157
158
## REQUIRED FOR ASSIGNMENT 1
159
160
1. In RStudio, make a **copy** of `dar-f24-assignment1-template.Rmd` file using a *new, original, descriptive* filename that **includes your RCS ID!**
161
* Open `darf24-assignment1-template.Rmd`
162
* **Save As...** using a new filename that includes your RCS ID
163
* Example filename for user `erickj4`: `erickj4-assignment1-f24.Rmd`
164
* POINTS OFF IF:
165
* You don't create a new filename!
166
* You don't include your RCS ID!
167
* You include `template` in your new filename!
168
2. Edit your new notebook using RStudio and save
169
* Change the `title:` and `subtitle:` headers (at the top of the file)
170
* Change the `author:`
171
* Don't bother changing the `date:`; it should update automagically...
172
* **Save** your changes
173
3. Use the RStudio `Knit` command to create an HTML file; repeat as necessary
174
* Use the down arrow next to the word `Knit` and select **Knit to HTML**
175
* You may also knit to PDF...
176
4. In the Linux terminal, use `git add` to add each new file you want to add to the repository
177
* Type: `git add yourfilename.Rmd`
178
* Type: `git add yourfilename.html` (created when you knitted)
179
* Add your PDF if you also created one...
180
5. Continue making changes to your personal notebook
181
* Add code where specified
182
* Answer questions were indicated.
183
6. When you're ready, in Linux commit your changes:
184
* Type: `git commit -m "some comment"` where "some comment" is a useful comment describing your changes
185
* This commits your changes to your local repo, and sets the stage for your next operation.
186
7. Finally, push your commits to the RPI github repo
187
* Type: `git push origin dar-yourrcs` (where `dar-yourrcs` is the branch you've been working in)
188
* Enter your RCS ID and personal access token (as a password) when asked.
189
* Your changes are now safely on the RPI github.
190
8. **REQUIRED:** On the RPI github, submit a pull request.
191
* In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24.git
192
and log in using 2FA
193
* In the branch selector drop-down (by default says **main**), select your branch
194
* **Submit a pull request for your branch**
195
* One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo.
196
197
Please also see these handy github "cheatsheets":
198
199
* https://education.github.com/git-cheat-sheet-education.pdf
200
201
# DAR ASSIGNMENT 1 (Part 2): Exploring the Mars 2020 (M20) PIXL Dataset
202
203
This part of the notebook demonstrates some basic analysis of data from the M20 PIXL (Planetary Instrument for X-ray Lithochemistry) experiment.
204
205
PIXL (Planetary Instrument for X-ray Lithochemistry) is a microfocus X-ray fluorescence instrument that measures elemental chemistry at sub-millimeter scales. This is achieved by focusing an X-ray beam to a small spot ~ 150 µm, scanning the surface with this beam, and then measuring the induced X-ray fluorescence. PIXL observations consist of a suite of X-ray fluorescence measurements, context images, and metadata. The XRF measurements can be executed in a variety of geometries depending on target type and available observation time, and are accompanied by a set of images documenting the target and its position relative to the instrument.
206
207
In this notebook we will be looking at pre-processed PIXL data that is ready for your next steps.
208
209
* More about the PIXL instrument: https://an.rsl.wustl.edu/help/Content/About%20the%20mission/M20/Instruments/M20%20PIXL.htm
210
* Raw PIXL data bundle: https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/
211
212
## Load the PIXL Data and display summary
213
214
Here is the MARS PIXL data. Take note of the variables, their types, and distriubtions.
215
216
```{r}
217
# Saved LIBS data with locations added
218
pixl.df <- readRDS("~/DAR-Mars-F24/Data/samples_pixl_wide.Rds")
219
220
# convert location to a number
221
pixl.df$location <- as.numeric(pixl.df$location )
222
223
# Automatically converts all strings to factors
224
pixl.df[sapply(pixl.df, is.character)] <-
225
lapply(pixl.df[sapply(pixl.df,
226
is.character)], as.factor)
227
228
# Show summary of the data
229
summary(pixl.df)
230
231
```
232
233
234
Create a matrix containing the measurements without any meta data to prepare for clustering. Here we delibrately do not scale the data to get preliminary results.
235
236
```{r}
237
# Prepare dataset for clustering selecting specific columns of interest and putting in a matrix
238
pixl_trim.mat <- pixl.df %>%
239
dplyr::select(c("Na20","Mgo","Al203","Si02",
240
"P205","S03","Cl","K20","Cao","Ti02",
241
"Cr203","Mno","FeO-T")) %>% as.matrix()
242
summary(pixl_trim.mat)
243
```
244
245
# Clustering
246
247
Our first analysis goal is to cluster the mineralogy data using K-means and pick the appropriate number of clusters.
248
249
Here we recall the function `wssplot` we created in MATP-4400 (IDM) to examine cluster sizes in order to perform the "elbow" test. The function takes as its arguments a matrix, the maximum number of clusters and a random seed. It creates clusters for each possible value of k and plots the k-means objective function.
250
251
NOTE: The basic syntax for creating a user-defined function in R is:
252
253
`output <- function(arguments){ do stuff }`
254
255
The following plot shows the K-Means objective value for up to eight clusters.
256
257
```{r}
258
# A user-defined function to examine clusters and plot the results
259
wssplot <- function(data, nc=15, seed=10){
260
wss <- data.frame(cluster=1:nc, quality=c(0))
261
for (i in 1:nc){
262
set.seed(seed)
263
wss[i,2] <- kmeans(data, centers=i)$tot.withinss}
264
ggplot(data=wss,aes(x=cluster,y=quality)) +
265
geom_line() +
266
ggtitle("Quality of k-means by Cluster")
267
}
268
269
# Apply `wssplot()` to our PIXL data
270
wssplot(pixl_trim.mat, nc=8, seed=2)
271
```
272
273
274
Based on where the "elbow" occurs, it looks like `5` might be a good `k` choice for k-means clustering. But for the sake of simplifying this assignment and because there are few data points, we are going to examine the solution with k=3 clusters instead.
275
276
## k-means Clustering
277
278
We create the final clustering with 3 clusters.
279
280
```{r}
281
# Use our chosen 'k' to perform k-means clustering
282
set.seed(2)
283
k <- 3
284
km <- kmeans(pixl_trim.mat,k)
285
286
```
287
288
## Examine cluster means
289
290
Below is a heat map of the cluster centers with rows and columns clustered. We keep the scale the same as in the original data.
291
292
```{r}
293
294
pheatmap(km$centers,scale="none")
295
296
```
297
298
Notice how the means of the clusters vary.
299
300
## Perform PCA on PIXL Data
301
302
We're now ready to perform PCA. Note we have already scaled data so set `scale=FALSE`.
303
304
We first show a [Scree plot](https://en.wikipedia.org/wiki/Scree_plot) to understand the explained variance by principal component. Note the elbow in the Scree plot should roughly match the one you saw in k-means.
305
306
```{r}
307
# Perform the PCA on the matrix `pixl_trim.mat` we created earlier
308
309
pixl_trim.mat.pca <- prcomp(pixl_trim.mat, scale=FALSE)
310
311
# generate the Scree plot
312
ggscreeplot(pixl_trim.mat.pca)
313
```
314
315
Make a table indicating how many samples are in each cluster.
316
317
```{r}
318
# clusters sizes are in the km object produced by kmeans
319
cluster.df<-data.frame(cluster= 1:3, size=km$size)
320
kable(cluster.df,caption="Samples per cluster")
321
```
322
323
324
## Create a PCA Biplot using ggbiplot
325
326
Now we'll create a biplot of the data colored by cluster and label by rock type.
327
328
```{r message=FALSE, warning=FALSE}
329
# For this lab we'll create a PCA biplot the easy way using ggbiplot!
330
ggbiplot::ggbiplot(pixl_trim.mat.pca,
331
labels = pixl.df$type,
332
groups = as.factor(km$cluster)) +
333
xlim(-2,2) + ylim(-2,2)
334
335
```
336
337
## ANSWER THESE QUESTIONS!
338
339
Add a description of each cluster here in your own words.
340
August 30, 2024 18:29
341
**Important Note about Heat maps below:** The color and scale aren't the same across heatmaps, the first has 50 as a specefic shade of red, the second 30 and the third 40 for that same shade.
342
343
Describe Cluster 1: Cluster 1 is igneous rock that has a lot higher concentration of Si02 and mildly higher concentration Al203 than the rest of the igneous rock and a lower concentration of FeO-T and Mgo. All 3 samples in Cluster 1 have identical collected data, so they overlap on the plot above.
344
345
```{r}
346
#Heatmap of samples in cluster 1, scaled
347
pheatmap(subset(pixl_trim.mat,km$cluster == 1),scale="none",cluster_cols=FALSE,cluster_rows=FALSE)
348
```
349
350
Describe Cluster 2: Cluster 2 is all the Sedimentary rock. This cluster has a *much* lower concentration of Si02 and a mildly higher concentration of S03 and Mgo. Interestingly enough, even though this cluster is made up of a different kind of rock than Cluster 1 and Cluster 3 are, the detected amount of FeO-T is in between those two clusters.
351
352
```{r}
353
#Heatmap of samples in cluster 1, scaled
354
pheatmap(subset(pixl_trim.mat,km$cluster == 2),scale="none",cluster_cols=FALSE,cluster_rows=FALSE)
355
```
356
357
Describe Cluster 3: Cluster 3 is again igneous rock, but this time with *lower* concentration of Si02 and Al203 and *higher* concentration of FeO-T and Mgo. Additionally, this cluster is a *lot* more varied than the igneous rock in cluster 1 is. For example, while the range of sampled FeO-T is 0 in cluster 1, it is 16.81 in cluster 3.
358
359
```{r}
360
#Heatmap of samples in cluster 1, scaled
361
pheatmap(subset(pixl_trim.mat,km$cluster == 3),scale="none",cluster_cols=FALSE,cluster_rows=FALSE)
362
```
363
364
What do the clustering and PCA results tell us about the data detected by the M2O PIXL experiment?
365
366
The clustering results tell us which locations were similar in which molecules were present and in what quantities. This was discussed in more depth in the descriptions of the clusters above.
367
368
The PCA results tell us in what ways the ratios of molecules tended to vary, both relative to each other and overall.
369
370
By relative to each other I mean that molecules with similar coefficients in PC1 or PC2 tend to vary in similar ways. For example, Mgo and FeO-T both have large positive coefficients in PC1, and thus we can assume that when one increases the other tends to also increase, however we can tell Si02 tends to vary inversely to Mgo since it's coefficient is very negative in PC1. That is to say, PC1 suggests that samples with higher Mgo tend to have higher FeO-T and lower Si02. Looking at the heatmap below (which has been scaled by *column* to see how the presence of how a particular molecule's presence varies between sample sites) we see this does indeed appear to generally be the case. It's important to note that they don't *always* very as the coefficients would suggest, this is the case because we are *only* looking at the coefficients in PC1 for this example, the other PC's would help to describe the rest of the variation.
373
#Looking at how the presence of molecules vary in relation to eachother
374
pheatmap(pixl_trim.mat[,colnames(pixl_trim.mat) %in% c("Mgo","FeO-T","Si02")],scale="column",cluster_cols=FALSE,cluster_rows=FALSE)
378
By overall I mean that molecules with larger (farther from 0) coefficients in the early PC's are more important than molecules with lower coefficients (closer to 0) in telling the samples apart. If you look at the range in how much of the molecule is present between sample sites, you'll notice that the molecules with a larger range have a larger coefficient in PC1. Below we'll calculate the range of the molecule with the largest and the molecule with the smallest coefficient in PC1.
379
380
```{r}
381
# Calculating range of moleculues with largest and smallest coefficients in PC1
382
print(max(pixl_trim.mat[,"Si02"]) - min(pixl_trim.mat[,"Si02"]))
383
print(max(pixl_trim.mat[,"Mno"]) - min(pixl_trim.mat[,"Mno"]))
384
385
```
386
387
We can see Si02, which has a coefficient of -0.747, the farthest from 0, in PC1, has a range of 34.5 while Mno, which has a coefficient of 0.001, the closest to 0, in PC1, has a range of 0.59.
388
389
## SAVE, COMMIT and PUSH YOUR CHANGES!
390
391
When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using **steps 4-8** in **Section 2.2**, summarized here:
392
393
**In the Linux terminal:**
394
395
* `git branch`
396
* To double-check that you are in your working branch
397
* `git add <your changed files>`
398
* Your Rmd and knitted PDF
399
* `git commit -m "Some useful comments"`
400
* `git push origin <your branch name>`
401
402
**On github:**
403
404
* Log in at https://github.rpi.edu/DataINCITE/DAR-Mars-F24
405
* Select your branch from drop-down (default is **main**)
406
* Submit a "pull request" for your branch
407
* DO NOT MERGE!!!
408
409
# APPENDIX: Accessing RStudio Server on the IDEA Cluster
410
411
The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server)
412
413
* The Cluster requires RCS credentials, enabled via registration in class
414
* email John Erickson for problems `erickj4@rpi.edu`
415
* RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes
416
* Access via RPI physical network or VPN only
417
418
# More info about Rstudio on our Cluster
419
420
## RStudio GUI Access:
421
422
* Use:
423
* http://lp01.idea.rpi.edu/rstudio-ose/
424
* http://lp01.idea.rpi.edu/rstudio-ose-3/
425
* http://lp01.idea.rpi.edu/rstudio-ose-6/
426
* http://lp01.idea.rpi.edu/rstudio-ose-7/
427
* Linux terminal accessible from within RStudio "Terminal" or via ssh (below)
428
429
## Shared Data on Cluster:
430
431
* Users enrolled in DAR have access to `/academics/MATP-4910-F24`
432
* Usually DAR users will see a symbolic ("soft") link in their home directories
433
* If you do not, type the following in the **Terminal** via RStudio: `ln -s /academics/MATP-4910-F23/ MATP-4910-F24`
434
* All idea_users have access to shared storage via `/data` ("data" in your home directories)
435
* You might wish to use this for data sharing in team projects...
436
* ...but we recommend using github for shared code development
437
* Shell access to nodes: You must access "landing pad" first, then compute node:
438
* `ssh your_rcs@lp01.idea.rpi.edu` For example: `ssh erickj4@lp01.idea.rpi.edu`
439
* Then, `ssh` to the desired compute node, e.g.: `ssh idea-node-02`