From e09623626a64546546991fdf441d99e22d418cce Mon Sep 17 00:00:00 2001 From: changk2 Date: Wed, 11 Sep 2024 23:26:29 -0400 Subject: [PATCH] assignment2 PIXL dataset update --- .../Assignment02/changk2-assignment2-f24.Rmd | 59 +++++++- .../Assignment02/changk2-assignment2-f24.html | 137 ++++++++---------- .../Assignment02/changk2-assignment2-f24.pdf | Bin 136941 -> 140509 bytes 3 files changed, 114 insertions(+), 82 deletions(-) diff --git a/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd b/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd index b030acd..1c54b03 100644 --- a/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd +++ b/StudentNotebooks/Assignment02/changk2-assignment2-f24.Rmd @@ -1,15 +1,15 @@ --- title: "Mars 2020 Mission Data Notebook:" subtitle: "DAR Assignment 2 (Fall 2024)" -author: "Your Name Here" +author: "Kaiyang Chang" date: "`r format(Sys.time(), '%d %B %Y')`" output: + pdf_document: + latex_engine: xelatex html_document: toc: true number_sections: true df_print: paged - pdf_document: - latex_engine: xelatex --- ```{r setup, include=FALSE} @@ -202,6 +202,30 @@ str(lithology.matrix) The PIXL data provides summaries of the mineral compositions measured at selected sample sites by the PIXL instrument. Note that here we scale pixl.mat so features have mean 0 and standard deviation so results will be different than in Assignment 1. +### Declare: +I used GPT-4O to do some analysis. +I generate analysis for Quality K-means by Clusters. + +For heatmap and PCA Biplot: +I use GPT to figure out some features of Heatmap and PCA Biplot. +but for the analysis and comparison result, I made by myself, but I use GPT for sentence optimization. + +### Description for data set B: +The dataset contains 16 sample points, and 13 mineral components were measured at each sample point,including sodium oxide (Na2O), magnesium oxide (MgO), aluminum oxide (Al2O3), silicon dioxide (SiO2), phosphorus pentoxide (P2O5), sulfur trioxide (SO3), chlorine (Cl), potassium oxide (K2O), calcium oxide (CaO), titanium dioxide (TiO2), chromium trioxide (Cr2O3), manganese oxide (MnO) and total iron oxide (FeO-T). There are three Metadata features: name (sample name), type (sample type, such as igneous or sedimentary rock), campaign (geographic area where the sample was taken). + +The sample types are mainly divided into igneous and sedimentary rocks, including 8 igneous rock samples, 7 sedimentary rock samples, and 1 sample type is undefined. At the same time, the sample locations are classified in detail, including different locations such as crater floors and delta fronts. + +### Scaling +The PIXL team’s goal is to understand and explain how scaling changes results fromAssignment 1. The matrix version was scaled above but not in Assignment 1. + +### Clustering data method +I categorized the data through the k-means clustering method and decided on the optimal number of clusters. Since we need to compare with the plots in Assignment1, we make sure that all the analytical methods remain the same except for the data changes. + +By generating a plot of the quality of the clusters and using the wssplot() function to evaluate the effect of different numbers of clusters on the total sum of squares within the clusters, I chose three cluster centers (as in Assignment1). It begins to level off after three cluster centers, indicating that adding more cluster centers has little effect on the improvement of cluster quality. I then performed k-means clustering and then began the comparison. + +### Creative Analysis!!!! +Let's start! + ```{r} # Load the saved PIXL data with locations added pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide.Rds") @@ -210,12 +234,10 @@ pixl.df <- readRDS("/academics/MATP-4910-F24/DAR-Mars-F24/Data/samples_pixl_wide pixl.df[sapply(pixl.df, is.character)] <- lapply(pixl.df[sapply(pixl.df, is.character)], as.factor) -# Review our dataframe -summary(pixl.df) - # Make the matrix of just mineral percentage measurements pixl.matrix <- pixl.df[,2:14] %>% scale() + # Review the structure str(pixl.matrix) @@ -233,13 +255,14 @@ wssplot <- function(data, nc=15, seed=10){ # Apply `wssplot()` to our PIXL data wssplot(pixl.matrix, nc=8, seed=2) ``` +Compare to quality of k-means by Cluster in assginment1 Smaller quality values: After scaling, the quality values (or WSS values) are significantly lower, starting at around 200. This indicates that the features have been normalized, ensuring that all features contribute equally to the clustering process. This prevents high-value features from dominating the distance calculations and ensures a more balanced approach. Smoother WSS curve: The reduction in quality values is much more smooth, indicating that the features have a more balanced influence on the clustering. Each feature contributes equally after scaling, leading to more consistent and reasonable clustering results. ```{r} set.seed(2) -k <- 5 +k <- 3 km <- kmeans(pixl.matrix,k) @@ -254,6 +277,10 @@ More balanced color distribution: After scaling, all features are normalized to More reasonable clustering results: By standardizing the features to the same scale, the clustering results are no longer dominated by high-value features. Each feature contributes more equally to the cluster centers, leading to smoother transitions in the feature values and a more reliable clustering outcome. +In the Assignment1 heatmap, certain features with larger raw values, like Si02, dominate the clustering process. This feature disproportionately influence the cluster centers, overshadowing the impact of smaller-value features. As a result, the cluster centers are unbalanced, reflecting more of the high-value features and less of the overall data structure. + +In the Assignment2 heatmap, the color distribution is more balanced, indicating that features with previously smaller values now have a more significant role in determining the cluster centers. The clustering results are more balanced and accurately reflect the combined influence of all features, leading to a more representative and fair clustering outcome. + ```{r} pca_result <- prcomp(pixl.matrix, scale=TRUE) pca_data <- data.frame(pca_result$x[, 1:2], cluster=km$cluster) @@ -274,7 +301,7 @@ ggscreeplot(pixl.matrix.pca) ```{r} # clusters sizes are in the km object produced by kmeans -cluster.df<-data.frame(cluster= 1:5, size=km$size) +cluster.df<-data.frame(cluster= 1:3, size=km$size) ggbiplot::ggbiplot(pixl.matrix.pca, labels = pixl.df$type, @@ -283,6 +310,22 @@ ggbiplot::ggbiplot(pixl.matrix.pca, ``` Variable Direction and Distribution: The variables (e.g., SiO2, FeO-T, Mno, etc.) are spread out in a balanced radial pattern. The arrows representing the variables have relatively equal lengths, indicating that each feature has a balanced influence on the results. Since the data has been scaled, each variable contributes similarly in the PCA calculation, preventing certain features from dominating the clustering due to their value range.. +In the Assignment1 PCA biplot, certain features with larger raw values dominate the distribution of the principal components. As a result, these features heavily influence the direction and length of the arrows, skewing the PCA results. + +In the Assignment2 PCA biplot all features contribute more equally to the principal components, indicating a balanced influence of all features on the PCA results. This provides a more accurate and holistic view of the relationships between the different features and the resulting cluster centers. + +Analysis of new graphs: + +Combining PCA and heatmap analyses, we can infer significant chemical composition differences within the data, which are likely related to rock types. The scree plot and K-means clustering analysis indicate that the variance in the dataset is primarily concentrated in the first few principal components, and fewer clusters can effectively categorize the samples. The clustering analysis results support the use of fewer clusters for data classification and complement the PCA findings. This helps to better understand the chemical composition structure and classification of rock samples. + +New result from final PCA Biplot in Assignment2: + +Principal Component Interpretation: The X-axis of the PCA graph (standardizedPC1) explains 49.2% of the data variance, while the Y-axis (standardizedPC2) explains 18.9%. This means that these two principal components together explain about 68.1% of the data variance. This is much less variance than in Assignment1. + +Variable Vector Orientation: The vector orientation of individual elements such as FeOT, MgO, Na2O, etc. can demonstrate their correlation with the principal components. For example, FeOT and MgO have a strong positive correlation on the first principal component, while Al2O3, SiO2 and K2O have a strong negative correlation on the second principal component. +However, a large number of elements could not be analyzed in Assignment1. + +Rock type classification: It can be seen that the points labeled “igneous” are mainly concentrated in the extension direction of FeOT and MgO, while the points labeled “sedimentary” are biased towards the extension direction of Al2O3 and SiO2. Al2O3 and SiO2. This may indicate that the igneous rocks are higher in FeOT and MgO, while the sedimentary rocks may be rich in Al2O3 and SiO2. ## Data Set C: Load the LIBS Data diff --git a/StudentNotebooks/Assignment02/changk2-assignment2-f24.html b/StudentNotebooks/Assignment02/changk2-assignment2-f24.html index 9f4ac13..88c548e 100644 --- a/StudentNotebooks/Assignment02/changk2-assignment2-f24.html +++ b/StudentNotebooks/Assignment02/changk2-assignment2-f24.html @@ -9,9 +9,9 @@ - + - + Mars 2020 Mission Data Notebook: @@ -166,8 +166,8 @@

Mars 2020 Mission Data Notebook:

DAR Assignment 2 (Fall 2024)

-

Your Name Here

-

10 September 2024

+

Kaiyang Chang

+

11 September 2024

@@ -182,7 +182,14 @@

10 September 2024

  • 3 DAR ASSIGNMENT 2 (Part 2): Loading the Mars 2020 (M20) Datasets