Search for question
Question

a) Use K-means to group stations to 4 clusters with respect to their hydraulic geometry components. (4 points) b) Use GMM to group stations to 4 clusters with respect to their hydraulic geometry components. For this case only, make PDFs for each variable in each cluster. (4 points) c) Use performance measures of Silhouette Score and Davies-Bouldin to study which clustering approach performs better. Is this consistent with the visual inspection of clustering results. If no- what would be the reason? (2 points) d) Bonus: Enhancing K-means Performance Explore techniques to increase the performance of the K-means model to a level comparable to the GMM. Please provide a Python script detailing the results. (5 points)/n/n Contents Clustering 6.1. Clustering..... 6.2. Mean-Shift Clustering. 6.3. K-Means Clustering....... 6.4. Gaussian Mixture Model (GMM) Clustering. 6.5. Performance Measures for Clustering.. 6.6 Homework #6 (10 points) ..... ..1 .1 ..9 .15 .22 ..25 6.1. Clustering Some parts of this lecture follow the notes from Jeremy Watt Clustering is a popular unsupervised learning technique that involves grouping similar data points together. In clustering, the algorithm identifies patterns and structures within the data without any predefined labels or categories. The primary goal is to find natural groupings in the data based on the inherent similarities between data points. The Mean Shift and K-Means algorithms share similarities as clustering techniques, both utilizing mean vector operations to extract information from data. Despite the broad popularity of the K- Means algorithm, the Mean Shift algorithm has seen more constrained applications, primarily in areas such as image segmentation. In this note, a brief comparison of these two algorithms will be presented. 6.2. Mean-Shift Clustering Mean-Shift clustering is a non-parametric clustering algorithm that aims to discover the modes or peaks of a probability density function representing the underlying distribution of data points. It is often used for image segmentation, but it can be applied to general clustering problems. Mean- Shift is particularly useful when the shape of the clusters is not well-defined or when the number of clusters is not known in advance. Please read this Comaniciu & Meer (2002) for more information about this approach. The following figure is from this article. 1 NORMALIZED DENSITY 0.6 0.4 0.2 0 80 60 60 40 8. 20 0 0 0 -20 u⭑ 100 50 L* 0 50 100 Mean-shift builds on the foundation of kernel density estimation (KDE). Consider the scenario where the aforementioned data is sampled from a probability distribution. KDE serves as a technique for approximating the underlying distribution, also known as the probability density function, for a given dataset. This method involves placing a kernel at each point within the dataset. Here, a "kernel" refers to a sophisticated mathematical term for a weighting function commonly employed in convolution. While various types of kernels exist, the Gaussian kernel stands out as the most widely used. The cumulative effect of summing up all individual kernels yields a probability surface, exemplified by a density function. The resulting density function is contingent upon the chosen kernel bandwidth parameter, leading to variations in the outcome. To illustrate, suppose we are given a data set {u; } of points in d-dimensional space, sampled from some larger population, and that we have chosen a kernel K having bandwidth parameter h. Together, these data and kernel function returns the following kernel density estimator for the full population's density function. 1 n fx (u) = nha Σ K (u - u) i=1 К The kernel function here is required to satisfy the following two conditions: [ K (u) du = 1 K(u) = K(|u|) 2 The initial prerequisite is essential to guarantee the normalization of our estimate. On the other hand, the second requirement is linked to the symmetry of our space. Two very well-know kernel functions that satisfy these conditions are: Flat/uniform distribution: K(u) = 1/2 {1 - 1≤|u|≤1 else Gaussian distribution: 1 K(u) = 2πα/2 e-1/2 Here, we illustrate an example in one dimension, employing the Gaussian kernel to estimate the density of a population along the x-axis. Notably, each sample point contributes a minor Gaussian to our estimate, with the center positioned around it. While the equations above may appear somewhat daunting, the accompanying graphic serves to elucidate that the underlying concept is rather straightforward. The following figure is from this page. -2 0 2 4 6 8 10 Key Concepts of Mean-Shift Clustering: Kernel Density Estimation: Mean-Shift is based on the concept of kernel density estimation. It represents the distribution of data points as a smooth, continuous probability density function. Mean-Shift Iterative Procedure: The algorithm starts with a set of data points and a kernel function that defines the neighborhood around each point. The mean-shift procedure iteratively shifts each data point towards the mode of the underlying distribution until convergence. 3 Mode Seeking: At each iteration, the algorithm computes the mean shift vector for each data point, indicating the direction and magnitude of the shift needed to move towards a higher density region. Points are then updated by shifting them in the direction of the mean shift vector. Convergence: The algorithm iterates until convergence, which occurs when the mean shift vectors become small or negligible. At this point, data points are assigned to the mode or peak where they have converged. import numpy as np import matplotlib.pyplot as plt #23 Function to calculate Euclidean distance def euclidean_dist (p1, p2): return np.sqrt(np.sum((np.array(pl) #23 Function to define the Gaussian kernel def kernel (dist, bandwidth): - np.array(p2) ) **2)) return np.exp(-0.5 * (dist / bandwidth) **2) / (bandwidth * np.pi)) # Modified shift function def shift (p, original_points, kernel_bandwidth): np.sqrt(2* shift_coords = scale factor = [0.0] * 0.0 len (p) for p_temp in original_points: %23 Calculate numerator dist = weight euclidean_dist (p, p_temp) = kernel (dist, kernel_bandwidth) shift_coords = [x (w coord) for x, w, coord in zip (shift_coords, [weight] * len(p), p_temp)] #23 Accumulate denominator scale factor += weight %23 Update coordinates based on mean shift shifted_point = [coord scale_factor for coord in shift_coords] return shifted_point # Generate synthetic data for clustering np.random.seed(42) data = np.concatenate([np.random.normal(loc=[1, 1], size=(100, 2)), np.random.normal (loc=[5, 5], size=(100, 2)), np.random.normal(loc=[9, 1], size=(100, 2))]) 4 %23 Perform mean-shift clustering kernel bandwidth = 2.0 # Adjust the bandwidth as needed copied points data.tolist() max iterations = 100 # Adjust the maximum number of iterations as needed convergence_criteria 1e-3 # Adjust the convergence criteria as needed = #23 Assign each point to a cluster cluster_assignments = np.zeros(len (copied_points)) for i, p in enumerate (copied_points): if cluster_assignments[i] == 0: %23 Not assigned to any cluster yet converged = False old one iterations = 0 while not converged and iterations < max_iterations: old_position = = p.copy() р shift (p, copied_points, kernel_bandwidth) %23 Check for convergence by comparing the new position with the if euclidean_dist (p, old_position) < convergence_criteria: converged = True iterations += 1 #23 Assign points to a cluster based on final position cluster_assignments[i] = np.sum([euclidean_dist(p, center) for center in copied_points]) + 1 %23 Convert cluster assignments to integers for color indexing cluster_assignments = cluster_assignments.astype(int) %23 Plot the original data and clustered points with different colors data = np.array(data) copied _points = np.array(copied_points) plt.scatter(data[:, 0], data[:, 1], label='Original Data', alpha=0.5) plt.scatter (copied_points[:, 0], copied_points[:, 1], c=cluster_assignments, cmap='viridis', marker='x') plt.legend () plt.title('Mean-Shift Clustering') plt.xlabel('X-axis') plt.ylabel('Y-axis') %23 Show the figure 5

Fig: 1