Search for question
Question

2. In this problem you will write code to implement a 2-class or 3-class nearest-means classifier (using Euclidean distance) for data that has 2 features. Part of the code is already

written for you - a plotting routine that also finds the decision regions and decision boundaries, given the class means. For parts (a)-(d) below, you are given 3 datasets, each of which is divided into a training set and test set. dataset1_train.csv, dataset1_test.csv, dataset2_train.csv, dataset2_test.csv, dataset3_train.csv, dataset3_test.csv Each csv file has one row for each data point, and one column for each feature; except the last column contains the class labels (1 or 2). Note that for a nearest-means classifier, the training or learning phase consists merely of computing the class means; and the classification of each data point can be done by finding which class mean is closer. Tips: For plotting, plotDecBoundaries.py or plotDecBoundaries.m function helps you plot the data points, means, decision boundary and regions. (Or you may write your own plotting code if you prefer.) The PlotDecBoundaries descriptions below are subject to possible minor revision when the Problem 2 resources are posted: The PlotDecBoundaries() in PlotDecBoundaries.py requires three arguments, i. Training data points of all classes (of shape [n_train, n_feats]) ii. Corresponding Train labels (of shape [n_train]) iii. Class sample means (kth row needs to be set as the sample mean of the kth class) The PlotDecBoundaries.m function in MATLAB requires three arguments, i. Training data points of all classes (of size [n_train, n_feats]) ii. iii. Corresponding Train labels (of size [n_train,1]) Class sample means (kth row needs to be set as the sample mean of the kth class) • As a good coding practice, it is recommended that your code should be generalized to work for any given number of classes and data which has any number of features./nAccordingly, you can derive the number of classes from the training labels using unique() function from NumPy or MATLAB. The number of features can be derived from the number of columns in the csv file (number of features is one less than the number of columns). This will also ensure the same code works for part (e). But if you're new to Python, it is probably better to keep it as simple as possible, and code for just 2 (or 3) classes. (a) Learning (training) and classification. Use unnormalized data as supplied in the datasets. For each dataset (1, 2, and 3), do the following. Compute the class means on the training data. (i) Plot the training data (using different colors or symbols for the different classes), the class means, the decision boundary, and decision regions. Classify all data points in the training set and in the test set, using the class means computed above. (ii) Report the classification error rate on the training set, and separately on the test set. Classification error rate= (Number of points misclassified) / (total number of points tried), expressed as percentage. (b) Compare and comment on the results: how do the test error rates on datasets 1, 2, and 3 compare? Try to explain why. (c) Preprocessing: normalization. Standardize the data (so that each feature, across both classes combined, has sample mean = 0 and sample variance = 1). For each dataset, compute the normalizing parameters from the training data, and then use those parameter values to standardize the training data and test data. The result is the (standardized) data you will use for this part. Repeat part (a), except on the standardized data. (d) Compare and comment on the results of part (a) to those of part (c): how do the error rates on normalized (standardized) and unnormalized data compare for each given dataset? Try to explain why. (e) For this part, use the following datasets: dataset4_train.csv, dataset4_test.csv, dataset5_train.csv, dataset5_test.csv, dataset6_train.csv, dataset6_test.csv which have 3 classes (the first 2 classes of which are the same as in datasets1,2, 3). Repeat parts (a)-(d) except using these 3-class datasets.

Fig: 1

Fig: 2