Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
DAR-Mars-F24/Instructors/dar-f24-assignment1-template-v3.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
332 lines (244 sloc)
13.6 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
html_document: | |
toc: true | |
author: "Your Name Here" | |
date: "`r format(Sys.time(), '%d %B %Y')`" | |
output: | |
pdf_document: default | |
html_document: | |
df_print: paged | |
subtitle: DAR Assignment 1 (Fall 2024) | |
title: 'RPI github and Mars 2020 PIXL Example Notebook:' | |
number_sections: true | |
df_print: paged | |
--- | |
```{r setup, include=FALSE} | |
# Set the default CRAN repository | |
local({r <- getOption("repos") | |
r["CRAN"] <- "http://cran.r-project.org" | |
options(repos=r) | |
}) | |
# Required packages for M20 LIBS analysis | |
# Load required packages; install if necessary | |
# CAUTION: DO NOT interrupt R as it installs packages!! | |
library(tidyverse) | |
library(stringr) | |
library(BBmisc) | |
library(pheatmap) | |
``` | |
# Introductory Data Analytics Research Notebook | |
This notebook is broken into two main parts: | |
* Part 1: A basic introduction to github and RStudio Server | |
* Part 2: An introduction to the Mars 2020 PIXL dataset | |
The RPI github repository for all the code and data required for this notebook may be found at: | |
* https://github.rpi.edu/DataINCITE/DAR-Mars-F24 | |
## BEFORE YOU BEGIN | |
To contribute to any RPI github repository or read private repos you _must_ validate your RPI github.com ID and send a confirmation email to John Erickson at `erickj4@rpi.edu`. Please do the following **now**: | |
* Browse to http://github.rpi.edu | |
* Login using your RPI credentials (RCS ID) | |
* Enable github two-factor authentication (2FA) | |
* Under "Settings" -> "Password and authentication" | |
* Select "Authenticator app" (Duo or Google authenticator are recommended) | |
* _2FA is now enabled..._ | |
* Create and save a *personal access token* | |
* Under "Settings" -> "Developer settings" | |
* Select "Personal access tokens" | |
* Click on "Generate new token (classic)" | |
* Enable everything (check the left-most boxes) | |
* Select "Generate" (green button) | |
* SAVE THE RESULT! You won't be able to see it again... | |
* _Use this token when command-line git asks you for a password_ | |
* **PLEASE DO THIS IMMEDIATELY BEFORE READING ANY FURTHER!!** | |
# DAR ASSIGNMENT 1 (Part 1): CLONING A NOTEBOOK AND UPDATING THE REPOSITORY | |
In this assignment we're asking you to | |
* clone the `DAR-Mars-F24` github repository, | |
* create a personal working branch using git, | |
* copy the template and create a new notebook that includes your answers to questions in this notebook, | |
* make additions to the repository by "pushing" your notebook to the repository. | |
The instructions which follow explain how to accomplish this. | |
**For DAR Fall 2024** you *must* be using RStudio Server on the IDEA Cluster. Instructions for accessing "The Cluster" appear at the end of this notebook. Don't forget to validate your RPI github ID as above and email `erickj4@rpi.edu` | |
## Cloning an RPI github repository | |
The recommended procedure for cloning and using this repository is as follows: | |
* Access the RPI network via VPN | |
* See https://itssc.rpi.edu/hc/en-us/articles/360008783172-VPN-Connection-and-Installation for information | |
* Access RStudio Server on the IDEA Cluster at http://lp01.idea.rpi.edu/rstudio-ose/ | |
* You must be on the RPI VPN!! | |
* Access the Linux shell on the IDEA Cluster by clicking the **Terminal** tab of RStudio Server (lower left panel). | |
* You now see the Linux shell on the IDEA Cluster | |
* `cd` (change directory) to enter your home directory using: `cd ~` | |
* Type `pwd` to confirm | |
* NOTE: Advanced users may use `ssh` to directly access the Linux shell from a macOS or Linux command line | |
* Type `git clone https://github.rpi.edu/DataINCITE/DAR-Mars-F24.git` from within your `home` directory | |
* NOTE: At this point you will be asked to authenticate using your RCS ID. You have two ways to do this: | |
* Using a saved passcode: | |
* Username: Enter your RCS ID | |
* Password: Enter the passcode you saved (above) | |
* Using an authenticator app (such as Duo or Google) | |
* Username: Enter your RCS ID | |
* Password: Enter a passcode you obtain from the app | |
* After the `git clone` instruction completes, you will see a new directory: `DAR-Mars-F24` | |
* In the Linux shell, `cd` into `DAR-Mars-F24/StudentNotebooks/Assignment01` | |
* Type `ls -al` to list the current contents | |
* Don't be surprised if you see many files! | |
* In the Linux shell, type `git checkout -b dar-yourrcs` where `yourrcs` is your RCS id | |
* For example, `git checkout -b dar-erickj4` would create a new working branch for user `erickj4` | |
* This command creates a new branch using the unique name you provide | |
* It is _critical_ that you include your RCS id in your branch id | |
* _Your RCS ID is not erickj4..._ | |
* Now in the **RStudio Server UI**, navigate to the `DAR-Mars-F24/StudentNotebooks/Assignment01` directory via the **Files** panel (lower right panel) | |
* Under the **More** menu, select "Set as working directory" to set this to be your R working directory | |
* Setting the correct working directory is essential for interactive R use! | |
## REQUIRED FOR ASSIGMENT 1 | |
1. In RStudio, make a **copy** of `dar-f24-assignment1-template.Rmd` file using a *new, original, descriptive* filename that **includes your RCS ID!** | |
* Open `dar-f24-assignment1-template.Rmd` | |
* **Save As...** using a new filename that includes your RCS ID | |
* Example filename for user `erickj4`: `erickj4-assignment1-f24.Rmd` | |
* POINTS OFF IF: | |
* You don't create a new filename! | |
* You don't include your RCS ID! | |
* You include `template` in your new filename! | |
2. Edit your new notebook using RStudio and save | |
* Change the `title:` and `subtitle:` headers (at the top of the file) | |
* Change the `author:` | |
* Don't bother changing the `date:`; it should update automagically... | |
* **Save** your changes | |
3. Use the RStudio `Knit` command to create an HTML file; repeat as necessary | |
* Use the down arrow next to the word `Knit` and select **Knit to HTML** | |
* You may also knit to PDF... | |
4. In the Linux terminal, use `git add` to add each new file you want to add to the repository | |
* Type: `git add yourfilename.Rmd` | |
* Type: `git add yourfilename.html` (created when you knitted) | |
* Add your PDF if you also created one... | |
5. When you're ready, in Linux commit your changes: | |
* Type: `git commit -m "some comment"` where "some comment" is a useful comment describing your changes | |
* This commits your changes to your local repo, and sets the stage for your next operation. | |
6. Finally, push your commits to the RPI github repo | |
* Type: `git push origin dar-yourrcs` (where `dar-yourrcs` is the branch you've been working in) | |
* Your changes are now safely on the RPI github. | |
7. **REQUIRED:** On the RPI github, submit a pull request. | |
* In a web browser, navigate to https://github.rpi.edu/DataINCITE/DAR-Mars-F24 | |
* In the branch selector drop-down (by default says **master**), select your branch | |
* **Submit a pull request for your branch** | |
* One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. | |
## Confirm what you just did! | |
For this assignment you will be asked in LMS to confirm the following; for convenience, _copy them here_: | |
* The location of the github: | |
* Your github ID: | |
* The name of your new branch: | |
* The name of your new (copied) notebook: | |
Please also see these handy github "cheatsheets": | |
* https://education.github.com/git-cheat-sheet-education.pdf | |
# DAR ASSIGNMENT 1 (Part 2): Exploring the Mars 2020 (M20) PIXL Dataset | |
This part of the notebook demonstrates some basic analysis of data from the M20 PIXL (Planetary Instrument for X-ray Lithochemistry) experiment. | |
PIXL (Planetary Instrument for X-ray Lithochemistry) is a microfocus X-ray fluorescence instrument that measures elemental chemistry at sub-millimeter scales. This is achieved by focusing an X-ray beam to a small spot ~ 150 µm, scanning the surface with this beam, and then measuring the induced X-ray fluorescence. PIXL observations consist of a suite of X-ray fluorescence measurements, context images, and metadata. The XRF measurements can be executed in a variety of geometries depending on target type and available observation time, and are accompanied by a set of images documenting the target and its position relative to the instrument. | |
In this notebook we will be looking at pre-processed PIXL data that is ready for your next steps. | |
* More about the PIXL instrument: https://an.rsl.wustl.edu/help/Content/About%20the%20mission/M20/Instruments/M20%20PIXL.htm | |
* Raw PIXL data bundle: https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/ | |
## Load the PIXL Data and display summary | |
```{r} | |
# Saved LIBS data with locations added | |
#samples_pixl_wide <- readRDS("/academics/MATP-4910-F24/Mars-F24/Data/samples_pixl_wide.Rds") | |
samples_pixl_wide <- readRDS("../Data/samples_pixl_wide.Rds") | |
# Let's take a look at the dataset we're analyzing: | |
summary(samples_pixl_wide) | |
# Prepare the matrix for clustering and PCA | |
# by selecting specific columns of interest | |
pixl_trim <- samples_pixl_wide %>% | |
select(c(sample,name,"Na20","Mgo","Al203","Si02", | |
"P205","S03","Cl","K20","Cao","Ti02", | |
"Cr203","Mno","FeO-T")) | |
pixl_trim.mat <- as.matrix(pixl_trim[3:14]) | |
row.names(pixl_trim.mat) <- seq(1:16) | |
``` | |
# Clustering | |
The first analysis goal is to cluster the data using Kmeans and pick the appropriate number of clusters. | |
Here we recall a function we created in MATP 4400 IDM course to examine cluster sizes to do the "elbow" test. The function below takes as its arguments a matrix, the maximum number of clusters and a random seed. It creates clusters for each possible value of k and plots the k-means objective function. | |
The basic syntax for creating a user-defined function in R is: | |
`output <- function(arguments){ do stuff}` | |
```{r} | |
# A user-defined function to examine clusters and plot the results | |
wssplot <- function(data, nc=15, seed=10){ | |
wss <- data.frame(cluster=1:nc, quality=c(0)) | |
for (i in 1:nc){ | |
set.seed(seed) | |
wss[i,2] <- kmeans(data, centers=i)$tot.withinss} | |
ggplot(data=wss,aes(x=cluster,y=quality)) + | |
geom_line() + | |
ggtitle("Quality of k-means by Cluster") | |
} | |
# Apply this to our PIXL data | |
wssplot(pixl_trim.mat, nc=8) | |
``` | |
Based on where the "elbow" occurs, it looks like `4` might be a good `k` choice for k-means clustering. | |
NOTE: PCA and k-means should use same features, so we're using the same data structure for both. | |
## k-means Clustering | |
Let's determine the final clustering using _four_ clusters. | |
```{r} | |
# Use our chosen 'k' to perform k-means clustering | |
set.seed(10) | |
k <- 4 | |
km <- kmeans(pixl_trim.mat,k) | |
``` | |
## Examine clluster means | |
Generate a heat map of the cluster centers with rows and columns clustered. | |
```{r} | |
# Pheatmap command to plot cluster centers. | |
# Pass the PIXL matrix to the pheatmap function | |
pheatmap(t(pixl_trim.mat), | |
scale = "row", | |
cluster_rows = FALSE, | |
main = "Basic Heatmap of PIXL data") | |
``` | |
Notice how the means of the clusters vary. | |
## Perform PCA on PIXL Data | |
We perform a PCA on our PIXL matrix from earlier: | |
```{r} | |
# Calculate our PCA using the PIXL matrix created above | |
pixl_trim.mat.pca <- prcomp(pixl_trim.mat, scale=TRUE) | |
plot(pixl_trim.mat.pca) | |
``` | |
## Create a PCA Biplot using ggbiplot | |
```{r message=FALSE, warning=FALSE} | |
# For this lab we'll create a PCA biplot the easy way using ggbiplot! | |
ggbiplot::ggbiplot(pixl_trim.mat.pca, | |
# labels = pixl_trim$name, | |
groups = as.factor(km$cluster)) | |
``` | |
## Question to be completed | |
What do the clustering and PCA cresults tell us about the data detected by the M20 PIXL experiment? Specifically describe how the clusters vary in terms of the features. | |
Put your description of each cluster here. Feel free to add graphs or analysis to verify your analysis. | |
Cluster 1 Description: | |
Cluster 2 Description: | |
Cluster 3 Description: | |
Cluster 4 Description: | |
## SAVE, COMMIT and PUSH YOUR CHANGES! | |
When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using **steps 4-7** in **Section 2.2**, summarized here: | |
* `git branch` | |
* To double-check that you are in your working branch | |
* `git add <your changed files>` | |
* `git commit -m "Some useful comments"` | |
* `git push origin <your branch name>` | |
# APPENDIX: Accessing RStudio Server on the IDEA Cluster | |
The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server) | |
* The Cluster requires RCS credentials, enabled via registration in class | |
* email John Erickson for problems `erickj4@rpi.edu` | |
* RStudio, Jupyter, MATLAB, GPUs (on two nodes); lots of storage and computes | |
* Access via RPI physical network or VPN only | |
# More info about Rstudio on our Cluster | |
## RStudio GUI Access: | |
* Use: | |
* http://lp01.idea.rpi.edu/rstudio-ose/ | |
* http://lp01.idea.rpi.edu/rstudio-ose-3/ | |
* http://lp01.idea.rpi.edu/rstudio-ose-6/ | |
* http://lp01.idea.rpi.edu/rstudio-ose-7/ | |
* Linux terminal accessible from within RStudio "Terminal" or via ssh (below) | |
## Shared Data on Cluster: | |
* Users enrolled in DAR have access to `/academics/MATP-4910-F24` | |
* Usually DAR users will see a symbolic ("soft") link in their home directories | |
* If you do not, type the following in the **Terminal** via RStudio: `ln -s /academics/MATP-4910-F23/ MATP-4910-F24` | |
* All idea_users have access to shared storage via `/data` ("data" in your home directories) | |
* You might wish to use this for data sharing in team projects... | |
* ...but we recommend using github for shared code development | |
* Shell access to nodes: You must access "landing pad" first, then compute node: | |
* `ssh your_rcs@lp01.idea.rpi.edu` For example: `ssh erickj4@lp01.idea.rpi.edu` | |
* Then, `ssh` to the desired compute node, e.g.: `ssh idea-node-02` |