Introduction
For this assignment we have been instructed to experiment with clustering. Within the MASS library we can find the data set biopsy, which gives approximately nine attributes for breast tumors of 699 patients. The first column of the data is the patient ID number and the last column is the classification (“benign” or “malignant”) of the tumor.
Scenario
Our responsibility is as follows: remove ID, classification and replace any missing values with zero; select two clustering techniques and determine the number of clusters; cross-tabulate the two methods with each other and compute the value of CramersV; and then cross tabulate one cluster method against the true class.
Data Preparations
In order to properly analyze the biopsy data for …show more content…
Hence, the reason it is called the elbow method and is considered an appropriate indicator for determining the number of clusters. This elbow looking plot is a representation of when the total within cluster sum of squares is minimized and as compacted as possible. This may not always be the best answer but I believe it is better than just picking arbitrary values for (k).
Now that I have established a reasonable number of clusters; I then run clustering techniques for kmeans and pam
Kmeans Method
With this method I used the process shown in class. I set nstart to a large number, so that I could establish a global optimum. I had to run this twice because it will give different results. I then setup a two-way table so I could see how close I have come to a global optimum. The Rcode is below as well as Figure 2 with the results:
It appear that both of the runs for kmeans agree and that I am fairly close to optimal by clustering in two groups.
Pam Method
Based on the given from above, I simply clustered using pam function by cluster in two groups. The following Rcode was used to cluster with