a use k means to group stations to 4 clusters with respect to their hy
Search for question
Question
a) Use K-means to group stations to 4 clusters with respect to their hydraulic geometry
components. (4 points)
b) Use GMM to group stations to 4 clusters with respect to their hydraulic geometry
components. For this case only, make PDFs for each variable in each cluster. (4 points)
c) Use performance measures of Silhouette Score and Davies-Bouldin to study which
clustering approach performs better. Is this consistent with the visual inspection of
clustering results. If no- what would be the reason? (2 points)
d) Bonus: Enhancing K-means Performance
Explore techniques to increase the performance of the K-means model to a level comparable
to the GMM. Please provide a Python script detailing the results. (5 points)/n/n Contents
Clustering
6.1. Clustering.....
6.2. Mean-Shift Clustering.
6.3. K-Means Clustering.......
6.4. Gaussian Mixture Model (GMM) Clustering.
6.5. Performance Measures for Clustering..
6.6 Homework #6 (10 points) .....
..1
.1
..9
.15
.22
..25
6.1. Clustering
Some parts of this lecture follow the notes from Jeremy Watt
Clustering is a popular unsupervised learning technique that involves grouping similar data points
together. In clustering, the algorithm identifies patterns and structures within the data without
any predefined labels or categories. The primary goal is to find natural groupings in the data based
on the inherent similarities between data points.
The Mean Shift and K-Means algorithms share similarities as clustering techniques, both utilizing
mean vector operations to extract information from data. Despite the broad popularity of the K-
Means algorithm, the Mean Shift algorithm has seen more constrained applications, primarily in
areas such as image segmentation. In this note, a brief comparison of these two algorithms will be
presented.
6.2. Mean-Shift Clustering
Mean-Shift clustering is a non-parametric clustering algorithm that aims to discover the modes or
peaks of a probability density function representing the underlying distribution of data points. It is
often used for image segmentation, but it can be applied to general clustering problems. Mean-
Shift is particularly useful when the shape of the clusters is not well-defined or when the number
of clusters is not known in advance. Please read this Comaniciu & Meer (2002) for more
information about this approach. The following figure is from this article.
1 NORMALIZED DENSITY
0.6
0.4
0.2
0
80
60
60
40
8.
20
0
0
0
-20
u⭑
100
50
L*
0
50
100
Mean-shift builds on the foundation of kernel density estimation (KDE). Consider the scenario
where the aforementioned data is sampled from a probability distribution. KDE serves as a
technique for approximating the underlying distribution, also known as the probability density
function, for a given dataset. This method involves placing a kernel at each point within the
dataset. Here, a "kernel" refers to a sophisticated mathematical term for a weighting function
commonly employed in convolution. While various types of kernels exist, the Gaussian kernel
stands out as the most widely used. The cumulative effect of summing up all individual kernels
yields a probability surface, exemplified by a density function. The resulting density function is
contingent upon the chosen kernel bandwidth parameter, leading to variations in the outcome.
To illustrate, suppose we are given a data set {u; } of points in d-dimensional space, sampled from
some larger population, and that we have chosen a kernel K having bandwidth parameter h.
Together, these data and kernel function returns the following kernel density estimator for the full
population's density function.
1
n
fx (u) = nha Σ K (u - u)
i=1
К
The kernel function here is required to satisfy the following two conditions:
[ K (u) du
= 1
K(u) = K(|u|)
2 The initial prerequisite is essential to guarantee the normalization of our estimate. On the other
hand, the second requirement is linked to the symmetry of our space. Two very well-know kernel
functions that satisfy these conditions are:
Flat/uniform distribution:
K(u) = 1/2 {1
- 1≤|u|≤1
else
Gaussian distribution:
1
K(u)
=
2πα/2
e-1/2
Here, we illustrate an example in one dimension, employing the Gaussian kernel to estimate the
density of a population along the x-axis. Notably, each sample point contributes a minor Gaussian
to our estimate, with the center positioned around it. While the equations above may appear
somewhat daunting, the accompanying graphic serves to elucidate that the underlying concept is
rather straightforward. The following figure is from this page.
-2
0
2
4
6
8
10
Key Concepts of Mean-Shift Clustering:
Kernel Density Estimation: Mean-Shift is based on the concept of kernel density estimation. It
represents the distribution of data points as a smooth, continuous probability density function.
Mean-Shift Iterative Procedure: The algorithm starts with a set of data points and a kernel function
that defines the neighborhood around each point. The mean-shift procedure iteratively shifts each
data point towards the mode of the underlying distribution until convergence.
3 Mode Seeking: At each iteration, the algorithm computes the mean shift vector for each data
point, indicating the direction and magnitude of the shift needed to move towards a higher density
region. Points are then updated by shifting them in the direction of the mean shift vector.
Convergence: The algorithm iterates until convergence, which occurs when the mean shift vectors
become small or negligible. At this point, data points are assigned to the mode or peak where they
have converged.
import numpy as np
import matplotlib.pyplot as plt
#23 Function to calculate Euclidean distance
def euclidean_dist (p1, p2):
return np.sqrt(np.sum((np.array(pl)
#23 Function to define the Gaussian kernel
def kernel (dist, bandwidth):
-
np.array(p2) ) **2))
return np.exp(-0.5 * (dist / bandwidth) **2) / (bandwidth *
np.pi))
# Modified shift function
def shift (p, original_points, kernel_bandwidth):
np.sqrt(2*
shift_coords
=
scale factor
=
[0.0] *
0.0
len (p)
for p_temp in original_points:
%23 Calculate numerator
dist =
weight
euclidean_dist (p, p_temp)
=
kernel (dist, kernel_bandwidth)
shift_coords
=
[x (w coord) for x, w, coord in
zip (shift_coords, [weight] * len(p), p_temp)]
#23 Accumulate denominator
scale factor += weight
%23 Update coordinates based on mean shift
shifted_point
=
[coord scale_factor for coord in shift_coords]
return shifted_point
# Generate synthetic data for clustering
np.random.seed(42)
data =
np.concatenate([np.random.normal(loc=[1, 1], size=(100, 2)),
np.random.normal (loc=[5, 5], size=(100, 2)),
np.random.normal(loc=[9, 1], size=(100, 2))])
4 %23 Perform mean-shift clustering
kernel bandwidth = 2.0 # Adjust the bandwidth as needed
copied points
data.tolist()
max iterations = 100 # Adjust the maximum number of iterations as needed
convergence_criteria 1e-3 # Adjust the convergence criteria as needed
=
#23 Assign each point to a cluster
cluster_assignments =
np.zeros(len (copied_points))
for i, p in enumerate (copied_points):
if cluster_assignments[i]
== 0: %23 Not assigned to any cluster yet
converged
= False
old one
iterations = 0
while not converged and iterations < max_iterations:
old_position
=
=
p.copy()
р shift (p, copied_points, kernel_bandwidth)
%23 Check for convergence by comparing the new position with the
if euclidean_dist (p, old_position) < convergence_criteria:
converged = True
iterations += 1
#23 Assign points to a cluster based on final position
cluster_assignments[i] = np.sum([euclidean_dist(p, center) for
center in copied_points]) + 1
%23 Convert cluster assignments to integers for color indexing
cluster_assignments
=
cluster_assignments.astype(int)
%23 Plot the original data and clustered points with different colors
data = np.array(data)
copied _points
=
np.array(copied_points)
plt.scatter(data[:, 0], data[:, 1], label='Original Data', alpha=0.5)
plt.scatter (copied_points[:, 0], copied_points[:, 1],
c=cluster_assignments, cmap='viridis', marker='x')
plt.legend ()
plt.title('Mean-Shift Clustering')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
%23 Show the figure
5