DAR-Mars-F24/StudentNotebooks/Assignment01/roberd10-dar-assignment1-f24.Rmd at ef62a80eb05dbcc6eac85408c2d901fe7b9af56b · DataINCITE/DAR-Mars-F24

Newer

Older

Raw

Normal view

History

100644 439 lines (321 sloc) 19.2 KB

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

---

title: "RPI github and Mars 2020 PIXL"

subtitle: "DAR Assignment 1"

author: "Doña Roberts"

date: "`r format(Sys.time(), '%d %B %Y')`"

output:

Fixed typo

August 30, 2024 18:29

pdf_document: default

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

html_document:

toc: true

number_sections: true

df_print: paged

---

```{r setup, include=FALSE}

# REQUIRE R PACKAGE INSTALLATIONS

# This section installs packages if they are not already installed.

# This block will not be shown in the knitted file.

# RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!

# Set the default CRAN repository

local({r <- getOption("repos")

r["CRAN"] <- "http://cran.r-project.org"

options(repos=r)

})

if (!require("pandoc")) {

install.packages("pandoc")

library(pandoc)

}

if (!require("knitr")) {

install.packages("knitr")

library(knitr)

}

# Required packages for M20 LIBS analysis

if (!require("rmarkdown")) {

install.packages("rmarkdown")

library(rmarkdown)

}

if (!require("tidyverse")) {

install.packages("tidyverse")

library(tidyverse)

}

if (!require("stringr")) {

install.packages("stringr")

library(stringr)

}

if (!require("ggbiplot")) {

install.packages("ggbiplot")

library(ggbiplot)

}

if (!require("pheatmap")) {

install.packages("pheatmap")

library(pheatmap)

}

if (!require("ggrepel")) {

install.packages("ggrepel")

library(ggrepel)

}

if (!require("farver")) {

install.packages("farver")

library(farver)

}

if (!require("labeling")) {

install.packages("labeling")

library(labeling)

}

knitr::opts_chunk$set(echo = TRUE)

```

# Introductory Data Analytics Research Notebook

This notebook is broken into two main parts:

* Part 1: A basic introduction to github and RStudio Server

* Part 2: An introduction to the Mars 2020 PIXL dataset

The RPI github repository for all the code and data required for this notebook may be found at:

* https://github.rpi.edu/DataINCITE/DAR-Mars-F24

## BEFORE YOU BEGIN: github account setup

To contribute to any RPI github repository or read private repos you _must_ validate your RPI github.com ID and send a confirmation email to John Erickson at `erickj4@rpi.edu`. Please do the following **now**:

**Enabling 2FA on the RPI github and saving personal access tokens, et.al.**

* Browse to http://github.rpi.edu

100

* Login using your RPI credentials

101

* Enable github two-factor authentication (2FA)

102

* Under "Settings" -> "Password and authentication"

103

* Select "Authenticator app" (Duo or Google authenticator are recommended)

104

* Follow steps to set up authenticator app; may involve scanning a QR Code)

105

   * See directions for 2FA at https://itssc.rpi.edu/hc/en-us/articles/360004801811-GitHub-Enterprise-Overview#2fa

106

   * **CRITICAL:** Make sure to save your **recovery codes** in a safe place!   Recovery codes can be used to access your account in the event you lose access to your device and cannot receive two-factor authentication codes.

107

* Create and save a *personal access token*

108

* Under "Settings" -> "Developer settings"

109

* Select "Personal access tokens"

110

* Click on "Generate new token (classic)"

111

* Set an expiration period for the end of the Fall 2024 term

112

* Enable everything (check the left-most boxes)

113

* Generate (green button)

114

* SAVE THE RESULT! You won't be able to see it again...

115

* _Use this token when command-line git asks you for a password_

116

* **PLEASE DO THIS IMMEDIATELY BEFORE READING ANY FURTHER!!**

117

118

# DAR ASSIGNMENT 1 (Part 1): CLONING A NOTEBOOK AND UPDATING THE REPOSITORY

119

120

In this assignment we're asking you to

121

122

* clone the `DAR-Mars-F24` github repository,

123

* create a personal branch using git,

124

* create a new notebook that includes your answers to questions in this notebook, 

125

* make additions to the repository by adding your notebook to the repository.

126

127

_The instructions which follow explain how to accomplish this._

128

129

**For DAR Fall 2024** you *must* be using RStudio Server on the IDEA Cluster. Instructions for accessing "The Cluster" appear at the end of this notebook. Don't forget to validate your RPI github ID as above and email `erickj4@rpi.edu` 

130

131

### Cloning an RPI github repository

132

133

The recommended procedure for cloning and using this repository is as follows:

134

135

* Access the RPI network via VPN

136

    * See https://itssc.rpi.edu/hc/en-us/articles/360008783172-VPN-Connection-and-Installation for information

137

138

* Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/

139

* You must be on the RPI VPN!!

140

* Access the Linux shell on the IDEA Cluster by clicking the **Terminal** tab of RStudio Server (lower left panel). 

141

* You now see the Linux shell on the IDEA Cluster

142

* `cd` (change directory) to enter your home directory using: `cd ~`

143

* Type `pwd` to confirm

144

    * NOTE: Advanced users may use `ssh` to directly access the Linux shell from a macOS or Linux command line

145

* Type `git clone https://github.rpi.edu/DataINCITE/DAR-Mars-F24` from within your `home` directory

146

* Enter your RCS ID and your saved personal access token when asked

147

* This will create a new directory `DAR-Mars-F24`

148

* In the Linux shell, `cd` to `DAR-Mars-F24/StudentNotebooks/Assignment01`

149

* Type `ls -al` to list the current contents

150

* Don't be surprised if you see many files!

151

* In the Linux shell, type `git checkout -b dar-yourrcs` where `yourrcs` is your RCS id

152

    * For example, if your RCS is `erickj4`, your new branch should be `dar-erickj4`

153

* It is _critical_ that you include your RCS id in your branch id!

154

* Back in the RStudio Server UI, navigate to the `DAR-Mars-F24/StudentNotebooks/Assignment01` directory via the **Files** panel (lower right panel)

155

* Under the **More** menu, set this to be your R working directory

156

* Setting the correct working directory is essential for interactive R use!

157

158

## REQUIRED FOR ASSIGNMENT 1

159

160

1. In RStudio, make a **copy** of `dar-f24-assignment1-template.Rmd` file using a *new, original, descriptive* filename that **includes your RCS ID!**

161

* Open `darf24-assignment1-template.Rmd`

162

* **Save As...** using a new filename that includes your RCS ID

163

* Example filename for user `erickj4`: `erickj4-assignment1-f24.Rmd`

164

* POINTS OFF IF:

165

* You don't create a new filename!

166

* You don't include your RCS ID!

167

* You include `template` in your new filename!

168

2. Edit your new notebook using RStudio and save

169

* Change the `title:` and `subtitle:` headers (at the top of the file)

170

* Change the `author:`

171

* Don't bother changing the `date:`; it should update automagically...

172

* **Save** your changes

173

3. Use the RStudio `Knit` command to create an HTML file; repeat as necessary

174

* Use the down arrow next to the word `Knit` and select **Knit to HTML**

175

* You may also knit to PDF...

176

4. In the Linux terminal, use `git add` to add each new file you want to add to the repository

177

* Type: `git add yourfilename.Rmd`

178

* Type: `git add yourfilename.html` (created when you knitted)

179

* Add your PDF if you also created one...

180

5. Continue making changes to your personal notebook

181

* Add code where specified

182

* Answer questions were indicated.

183

6. When you're ready, in Linux commit your changes:

184

    * Type: `git commit -m "some comment"` where "some comment" is a useful comment describing your changes

185

    * This commits your changes to your local repo, and sets the stage for your next operation.

186

7. Finally, push your commits to the RPI github repo

187

    * Type: `git push origin dar-yourrcs` (where `dar-yourrcs` is the branch you've been working in)

188

* Enter your RCS ID and personal access token (as a password) when asked.

189

* Your changes are now safely on the RPI github.

190

8. **REQUIRED:** On the RPI github, submit a pull request.

191

    * In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24.git

192

and log in using 2FA

193

    * In the branch selector drop-down (by default says **main**), select your branch

194

* **Submit a pull request for your branch**

195

    * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo.

196

197

Please also see these handy github "cheatsheets":

198

199

* https://education.github.com/git-cheat-sheet-education.pdf

200

201

# DAR ASSIGNMENT 1 (Part 2): Exploring the Mars 2020 (M20) PIXL Dataset

202

203

This part of the notebook demonstrates some basic analysis of data from the M20 PIXL (Planetary Instrument for X-ray Lithochemistry) experiment.

204

205

PIXL (Planetary Instrument for X-ray Lithochemistry) is a microfocus X-ray fluorescence instrument that measures elemental chemistry at sub-millimeter scales. This is achieved by focusing an X-ray beam to a small spot ~ 150 µm, scanning the surface with this beam, and then measuring the induced X-ray fluorescence. PIXL observations consist of a suite of X-ray fluorescence measurements, context images, and metadata. The XRF measurements can be executed in a variety of geometries depending on target type and available observation time, and are accompanied by a set of images documenting the target and its position relative to the instrument.

206

207

In this notebook we will be looking at pre-processed PIXL data that is ready for your next steps.

208

209

* More about the PIXL instrument: https://an.rsl.wustl.edu/help/Content/About%20the%20mission/M20/Instruments/M20%20PIXL.htm

210

* Raw PIXL data bundle: https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/

211

212

## Load the PIXL Data and display summary

213

214

Here is the MARS PIXL data. Take note of the variables, their types, and distriubtions. 

215

216

```{r}

217

# Saved LIBS data with locations added

218

pixl.df <- readRDS("~/DAR-Mars-F24/Data/samples_pixl_wide.Rds")

219

220

# convert location to a number

221

pixl.df$location <- as.numeric(pixl.df$location )

222

223

# Automatically converts all strings to factors

224

pixl.df[sapply(pixl.df, is.character)] <-

225

lapply(pixl.df[sapply(pixl.df,

226

is.character)], as.factor)

227

228

# Show summary of the data

229

summary(pixl.df)

230

231

```

232

233

234

Create a matrix containing the measurements without any meta data to prepare for clustering. Here we delibrately do not scale the data to get preliminary results.  

235

236

```{r}

237

# Prepare dataset for clustering selecting specific columns of interest and putting in a matrix

238

pixl_trim.mat <- pixl.df %>%

239

dplyr::select(c("Na20","Mgo","Al203","Si02",

240

"P205","S03","Cl","K20","Cao","Ti02",

241

"Cr203","Mno","FeO-T")) %>% as.matrix()

242

summary(pixl_trim.mat)

243

```

244

245

# Clustering

246

247

Our first analysis goal is to cluster the mineralogy data using K-means and pick the appropriate number of clusters. 

248

249

Here we recall the function `wssplot` we created in MATP-4400 (IDM) to examine cluster sizes in order to perform the  "elbow" test. The function takes as its arguments a matrix, the maximum number of clusters and a random seed.  It creates clusters for each possible value of  k and plots the k-means objective function. 

250

251

NOTE: The basic syntax for creating a user-defined function in R is:

252

253

`output <- function(arguments){ do stuff }`

254

255

The following plot shows the K-Means objective value for up to eight clusters.

256

257

```{r}

258

# A user-defined function to examine clusters and plot the results

259

wssplot <- function(data, nc=15, seed=10){

260

wss <- data.frame(cluster=1:nc, quality=c(0))

261

for (i in 1:nc){

262

set.seed(seed)

263

wss[i,2] <- kmeans(data, centers=i)$tot.withinss}

264

ggplot(data=wss,aes(x=cluster,y=quality)) +

265

geom_line() +

266

ggtitle("Quality of k-means by Cluster")

267

}

268

269

# Apply `wssplot()` to our PIXL data

270

wssplot(pixl_trim.mat, nc=8, seed=2)

271

```

272

273

Edited some of the answers to the questions and added graphics

August 30, 2024 13:51

274

Based on where the "elbow" occurs, it looks like `5` might be a good `k` choice for k-means clustering. But for the sake of simplifying this assignment and because there are few data points, we are going to examine the solution with k=3 clusters instead.

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

275

276

## k-means Clustering

277

278

We create the final clustering with 3 clusters.

279

280

```{r}

281

# Use our chosen 'k' to perform k-means clustering

282

set.seed(2)

283

k <- 3

284

km <- kmeans(pixl_trim.mat,k)

285

286

```

287

288

## Examine cluster means

289

290

Below is a heat map of the cluster centers with rows and columns clustered. We keep the scale the same as in the original data. 

291

292

```{r}

293

294

pheatmap(km$centers,scale="none")

295

296

```

297

298

Notice how the means of the clusters vary.

299

300

## Perform PCA on PIXL Data

301

302

We're now ready to perform PCA. Note we have already scaled data so set `scale=FALSE`.

303

304

We first show a [Scree plot](https://en.wikipedia.org/wiki/Scree_plot) to understand the explained variance by principal component.  Note the elbow in the Scree plot should roughly match the one you saw in k-means. 

305

306

```{r}

307

# Perform the PCA on the matrix `pixl_trim.mat` we created earlier

308

309

pixl_trim.mat.pca <- prcomp(pixl_trim.mat, scale=FALSE)

310

311

# generate the Scree plot

312

ggscreeplot(pixl_trim.mat.pca)

313

```

314

315

Make a table indicating how many samples are in each cluster.

316

317

```{r}

318

# clusters sizes are in the km object produced by kmeans

319

cluster.df<-data.frame(cluster= 1:3, size=km$size)

320

kable(cluster.df,caption="Samples per cluster")

321

```

322

323

324

## Create a PCA Biplot using ggbiplot

325

326

Now we'll create a biplot of the data colored by cluster and label by rock type.

327

328

```{r message=FALSE, warning=FALSE}

329

# For this lab we'll create a PCA biplot the easy way using ggbiplot!

330

ggbiplot::ggbiplot(pixl_trim.mat.pca,

331

labels = pixl.df$type,

332

groups = as.factor(km$cluster)) +

333

xlim(-2,2) + ylim(-2,2)

334

335

```

336

337

## ANSWER THESE QUESTIONS!

338

Edited some of the answers to the questions and added graphics

August 30, 2024 13:51

339

Add a description of each cluster here in your own words.

340

Fixed typo

August 30, 2024 18:29

341

**Important Note about Heat maps below:** The color and scale aren't the same across heatmaps, the first has 50 as a specefic shade of red, the second 30 and the third 40 for that same shade.

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

342

343

Describe Cluster 1: Cluster 1 is igneous rock that has a lot higher concentration of Si02 and mildly higher concentration Al203 than the rest of the igneous rock and a lower concentration of FeO-T and Mgo. All 3 samples in Cluster 1 have identical collected data, so they overlap on the plot above.

344

Edited some of the answers to the questions and added graphics

August 30, 2024 13:51

345

```{r}

346

#Heatmap of samples in cluster 1, scaled

347

pheatmap(subset(pixl_trim.mat,km$cluster == 1),scale="none",cluster_cols=FALSE,cluster_rows=FALSE)

348

```

349

350

Describe Cluster 2: Cluster 2 is all the Sedimentary rock. This cluster has a *much* lower concentration of Si02 and a mildly higher concentration of S03 and Mgo. Interestingly enough, even though this cluster is made up of a different kind of rock than Cluster 1 and Cluster 3 are, the detected amount of FeO-T is in between those two clusters.

351

352

```{r}

353

#Heatmap of samples in cluster 1, scaled

354

pheatmap(subset(pixl_trim.mat,km$cluster == 2),scale="none",cluster_cols=FALSE,cluster_rows=FALSE)

355

```

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

356

357

Describe Cluster 3: Cluster 3 is again igneous rock, but this time with *lower* concentration of Si02 and Al203 and *higher* concentration of FeO-T and Mgo. Additionally, this cluster is a *lot* more varied than the igneous rock in cluster 1 is. For example, while the range of sampled FeO-T is 0 in cluster 1, it is 16.81 in cluster 3.

358

Edited some of the answers to the questions and added graphics

August 30, 2024 13:51

359

```{r}

360

#Heatmap of samples in cluster 1, scaled

361

pheatmap(subset(pixl_trim.mat,km$cluster == 3),scale="none",cluster_cols=FALSE,cluster_rows=FALSE)

362

```

363

364

What do the clustering and PCA results tell us about the data detected by the M2O PIXL experiment?

365

366

The clustering results tell us which locations were similar in which molecules were present and in what quantities. This was discussed in more depth in the descriptions of the clusters above.

367

368

The PCA results tell us in what ways the ratios of molecules tended to vary, both relative to each other and overall.

369

370

By relative to each other I mean that molecules with similar coefficients in PC1 or PC2 tend to vary in similar ways. For example, Mgo and FeO-T both have large positive coefficients in PC1, and thus we can assume that when one increases the other tends to also increase, however we can tell Si02 tends to vary inversely to Mgo since it's coefficient is very negative in PC1. That is to say, PC1 suggests that samples with higher Mgo tend to have higher FeO-T and lower Si02. Looking at the heatmap below (which has been scaled by *column* to see how the presence of how a particular molecule's presence varies between sample sites) we see this does indeed appear to generally be the case. It's important to note that they don't *always* very as the coefficients would suggest, this is the case because we are *only* looking at the coefficients in PC1 for this example, the other PC's would help to describe the rest of the variation.

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

371

372

```{r}

Edited some of the answers to the questions and added graphics

August 30, 2024 13:51

373

#Looking at how the presence of molecules vary in relation to eachother

374

pheatmap(pixl_trim.mat[,colnames(pixl_trim.mat) %in% c("Mgo","FeO-T","Si02")],scale="column",cluster_cols=FALSE,cluster_rows=FALSE)

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

375

376

```

377

Edited some of the answers to the questions and added graphics

August 30, 2024 13:51

378

By overall I mean that molecules with larger (farther from 0) coefficients in the early PC's are more important than molecules with lower coefficients (closer to 0) in telling the samples apart. If you look at the range in how much of the molecule is present between sample sites, you'll notice that the molecules with a larger range have a larger coefficient in PC1. Below we'll calculate the range of the molecule with the largest and the molecule with the smallest coefficient in PC1.

379

380

```{r}

381

# Calculating range of moleculues with largest and smallest coefficients in PC1

382

print(max(pixl_trim.mat[,"Si02"]) - min(pixl_trim.mat[,"Si02"]))

383

print(max(pixl_trim.mat[,"Mno"]) - min(pixl_trim.mat[,"Mno"]))

384

385

```

386

387

We can see Si02, which has a coefficient of -0.747, the farthest from 0, in PC1, has a range of 34.5 while Mno, which has a coefficient of 0.001, the closest to 0, in PC1, has a range of 0.59.

388

Original. Description of clusters has alread been added and potential…

August 28, 2024 19:42

389

## SAVE, COMMIT and PUSH YOUR CHANGES!

390

391

When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using **steps 4-8** in **Section 2.2**, summarized here:

392

393

**In the Linux terminal:**

394

395

* `git branch`

396

* To double-check that you are in your working branch

397

* `git add <your changed files>`

398

* Your Rmd and knitted PDF

399

* `git commit -m "Some useful comments"`

400

* `git push origin <your branch name>`

401

402

**On github:**

403

404

* Log in at https://github.rpi.edu/DataINCITE/DAR-Mars-F24

405

* Select your branch from drop-down (default is **main**)

406

* Submit a "pull request" for your branch

407

* DO NOT MERGE!!!

408

409

# APPENDIX: Accessing RStudio Server on the IDEA Cluster

410

411

The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server)

412

413

* The Cluster requires RCS credentials, enabled via registration in class

414

* email John Erickson for problems `erickj4@rpi.edu`

415

* RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes

416

* Access via RPI physical network or VPN only

417

418

# More info about Rstudio on our Cluster

419

420

## RStudio GUI Access:

421

422

* Use:

423

* http://lp01.idea.rpi.edu/rstudio-ose/

424

* http://lp01.idea.rpi.edu/rstudio-ose-3/

425

* http://lp01.idea.rpi.edu/rstudio-ose-6/

426

* http://lp01.idea.rpi.edu/rstudio-ose-7/

427

* Linux terminal accessible from within RStudio "Terminal" or via ssh (below)

428

429

## Shared Data on Cluster:

430

431

* Users enrolled in DAR have access to `/academics/MATP-4910-F24`

432

* Usually DAR users will see a symbolic ("soft") link in their home directories

433

    * If you do not, type the following in the **Terminal** via RStudio: `ln -s /academics/MATP-4910-F23/ MATP-4910-F24`

434

* All idea_users have access to shared storage via `/data` ("data" in your home directories)

435

* You might wish to use this for data sharing in team projects...

436

* ...but we recommend using github for shared code development

437

* Shell access to nodes: You must access "landing pad" first, then compute node:

438

* `ssh your_rcs@lp01.idea.rpi.edu`  For example: `ssh erickj4@lp01.idea.rpi.edu`

439

* Then, `ssh` to the desired compute node, e.g.: `ssh idea-node-02`