Search for question
Question

This question demonstrates the effect of rescaling input variables on the cluster results. We will discover clusters using all the observations in the TwoFeatures.csv file with the following specifications.

• The input interval variables are x1 and x2

• The metric is the Manhattan distance

• The minimum number of clusters is 1

• The maximum number of clusters is 8

• Use the Elbow value for choosing the optimal number of clusters

Since the sklearn.cluster.KMeans class works only with the Euclidean distance, you will need to develop custom Python codes to implement the K-Means algorithm with the Manhattan distance.

A- Plot x2 (vertical axis) versus x1 (horizontal axis). Add gridlines to both axes. Let the graph engine chooses the tick marks. How many clusters do you see in the graph?

B- Discover the optimal number of clusters without any transformations. List the number of clusters, the Total Within-Cluster Sum of Squares (TWCSS), and the Elbow values in a table. Plot the Elbow Values versus the number of clusters. How many clusters do you find? What are the centroids of your optimal clusters?

C- Linearly rescale x1 such that the resulting variable has a minimum of zero and a maximum of ten. Likewise, rescale x2. Discover the optimal number of clusters from the transformed observations. List the number of clusters, the Total Within-Cluster Sum of Squares (TWCSS), and the Elbow values in a table. Plot the Elbow Values versus the number of clusters. How many clusters do you find? What are the centroids of your optimal clusters in the original scale of x1 and x2?

D- If you are doing everything correctly, you should discover two different optimal cluster solutions. In your words, how do you explain the difference?