Search for question
Question

Each student is required to submit (1) the Orange workflow file or Python/R script files (s)he created using the format HW1PID.OWS (.py and .r respectively for Python/R) where PID is

your PID number, and (2) a PDF file with answers using the format HW3PID.pdf All files listed below were posted on Canvas sub-folder Homeworks of Data sets folder. Homework III - Questions The purpose of this homework is to use textual analysis of WSJ news in predicting financial market outcomes. In particular we will rely on a data set measuring the state of the economy via textual analysis of business news. From the full text content of 800,000 Wall Street Journal articles for 1984-2017, Bybee et al. estimate a model that summarizes business news as easily interpretable topical themes and quantifies the proportion of news attention allocated to each theme at each point in time. These news attention estimates are inputs into the models we want to estimate. The data source is described in http://structureofnews.com/. The data was standardized and prepared for this assignment. Please use the data file ML_MBA_UNC_Processed.xlxs from the HW3 material folder (sub-folder of Assignments folder) on CANVAS (it is also posted in the Data Sets folder - Homework sub-folder) Click on the News Taxomony tab of the aforementioned website and you will find a taxon- omy of news themes in The Wall Street Journal. Estimated with hierarchical agglomerative clustering, the dendrogram illustrates how 180 topics cluster into an intuitive hierarchy of increasingly broad metatopics. The list of topics is reproduced below - further details appear on the website: Natural disasters, Internet, Soft drinks, Mobile devices, Profits, M&A, Changes, Police / crime, Research, Executive pay, Mid- size cities, Scenario analysis, Economic ideology, Middle east, Savings & loans, IPOs, Restraint, Electronics, Record high, Connecticut, Steel, Bond yields, Small business, Cable, Fast food, Disease, Activists, Competition, Music industry, Short sales, Nonperforming loans, Key role, News conference, US defense, Political contributions, Revised estimate, Economic growth, Justice Department, Credit ratings, Broadcasting, Problems, Announce plan, Federal Reserve, Job cuts, Chemicals / paper, Regulation, Environment, Small caps, Unions, C-suite, Control stakes, Mutual funds, Venture capital, European sovereign debt, Mining, Company spokesperson, Private / public sec- tor, Pharma, Schools, Russia, Programs / initiatives, Health insurance, Drexel, Trade agreements, Treasury bonds, Challenges, People familiar, Sales call, Publishing, Financial crisis, Aerospace / defense, Recession, Latin America, Cultural life, SEC, Earnings losses, Phone companies, Computers, Marketing, Japan, Nuclear / North Korea, NY politics, Tobacco, Product prices, Biology / chemistry / physics, Movie industry, Automotive, Machinery, Bankruptcy, Arts, International exchanges, Accounting, Space program, Immigration, Small changes, Small possibility, Agreement reached, Oil drilling, Rail / trucking / shipping, Indictments, Positive sentiment, Canada / South Africa, Airlines, California, Corporate governance, China, Investment banking, Spring/summer, Software, Pensions, Humor / language, Systems, Clintons, Major concerns, Mid-level executives, US Senate, Agriculture, Bank loans, Takeovers, State politics, Real estate, Futures / indices, Southeast Asia, Optimism, Corrections / amplifications, Government budgets, Exchanges / composites, Currencies / metals, Mortgages, Financial reports, Germany, Rental properties, Committees, Subsidiaries, Management changes, Share payouts, France / Italy, Acquired investment banks, Credit cards, Bear / bull market, Earnings forecasts, Terrorism, Watchdogs, Oil market, Couriers, Commodities, Utilities, Foods / consumer goods, Convertible / preferred, Macroeconomic data, Courts, Safety admin- istrations, Reagan, Bush / Obama / Trump, Fees, Gender issues, Trading activity, Microchips, Insurance, Earnings, Luxury / beverages, Iraq, National security, Buffett, Taxes, Options / VIX, Casinos, Elections, Private equity / hedge funds, Negotiations, European politics, Size, NASD, Mexico, Retail, Long / short term, Wide range, Lawsuits, UK, Revenue growth 1.a [4 points] For the exercise below select ALL features EXCEPT FUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SGNFUTSP500 Your first task is to predict the direction of the market (S&P 500 index) over the next month, i.e. SGNFUTSP500 up or down - which is a classification prediction problem, using the importance of news topics. Use the logistic regression and neural network widgets of the Orange software to compute the following models: logistic regression with LASSO regularization with C = 0.007 logistic regression with LASSO regularization with C = 0.80 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 You should get something along the following lines with 10-fold cross-validation: AUC CA ● Models Logistic LASSO C = 0.80 Neural Net Logistic LASSO C = 0.007 0.583 0.649 0.566 0.619 0.500 0.371 Explain why the Logistic regression with LASSO C = 0.007 does so poorly (hint: look at the coefficients of the model). When you look at the ROC curve, explain the curve for the LASSO C = 0.007 model. 2 1.b [4 points] List the top ten features which have the most negative impact on next month's market direction and the top ten with the most positive impact. 2.a [4 points] We are turning now to a linear regression instead of classification problem, predicting actual market returns (continuous) rather than direction (binary). For the exer- cise below select ALL features EXCEPT SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: FUTSP500 Use the linear regression and neural net widgets of Orange software to compute the follow- ing models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Note that the logistic regression and linear regression widgets use a different way of writing the penalty function (although there is a mapping between C and a via something called the Lagrangian multiplier not covered in the course). neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.b [4 points] In the previous case we were trying to predict the return next month with current news. Now, we will try to explain current returns with current news. For the exercise below select ALL features EXCEPT FUTSP500, SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SP500 Use the linear regression widget of Orange software to compute the following models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Recall that the logistic regression and linear re- gression widgets use a different way of writing the penalty function (although there is a mapping between the role of C and a via something called the Lagrangian mul- tiplier not covered in the course). 3 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.c [2 points] List the top ten features which have the most negative impact on next month's market returns and the top ten with the most positive impact. How is it different from the results in 1.b? 2.d [2 points] Explain the difference between your answers in 1.a, 2.a and 2.b. 4


Most Viewed Questions Of Machine Learning

3-For the data shown in the attached figure (dark circles are one class, white circles another) solve the classification problem with a neuron by hand. That is, find the appropriate weights of the required linear discriminant.


Q1 Consider the problem where we want to predict the gender of a person from a set of input parameters, namely height, weight, and age. a) Using Cartesian distance, Manhattan distance and Minkowski distance of order 3 as the similarity measurements show the results of the gender prediction for the Evaluation data that is listed below generated training data for values of K of 1, 3, and 7. Include the intermediate steps (i.e., distance calculation, neighbor selection, and prediction). b) c) To evaluate the performance of the KNN algorithm (using Euclidean distance metric), implement a leave- one-out evaluation routine for your algorithm. In leave-one-out validation, we repeatedly evaluate the algorithm by removing one data point from the training set, training the algorithm on the remaining data set and then testing it on the point we removed to see if the label matches or not. Repeating this for each of the data points gives us an estimate as to the percentage of erroneous predictions the algorithm makes and thus a measure of the accuracy of the algorithm for the given data. Apply your leave-one-out validation with your KNN algorithm to the dataset for Question 1 c) for values for K of 1, 3, 5, 7, 9, and 11 and report the results. For which value of K do you get the best performance? d) Repeat the prediction and validation you performed in Question 1 c) using KNN when the age data is removed (i.e. when only the height and weight features are used as part of the distance calculation in the KNN algorithm). Report the results and compare the performance without the age attribute with the ones from Question 1 c). Discuss the results. What do the results tell you about the data? Implement the KNN algorithm for this problem. Your implementation should work with different training data sets as well as different values of K and allow to input a data point for the prediction.


Q1 Consider the problem where we want to predict the gender of a person from a set of input parameters, namely height, weight, and age. a) Using Cartesian distance, Manhattan distance and Minkowski distance of order 3 as the similarity measurements show the results of the gender prediction for the Evaluation data that is listed below generated training data for values of K of 1, 3, and 7. Include the intermediate steps (i.e., distance calculation, neighbor selection, and prediction). b) Implement the KNN algorithm for this problem. Your implementation should work with different training data sets as well as different values of K and allow to input a data point for the prediction. c) To evaluate the performance of the KNN algorithm (using Euclidean distance metric), implement a leave- one-out evaluation routine for your algorithm. In leave-one-out validation, we repeatedly evaluate the algorithm by removing one data point from the training set, training the algorithm on the remaining data set and then testing it on the point we removed to see if the label matches or not. Repeating this for each of the data points gives us an estimate as to the percentage of erroneous predictions the algorithm makes and thus a measure of the accuracy of the algorithm for the given data. Apply your leave-one-out validation with your KNN algorithm to the dataset for Question 1 c) for values for K of 1, 3, 5, 7, 9, and 11 and report the results. For which value of K do you get the best performance? d) Repeat the prediction and validation you performed in Question 1 c) using KNN when the age data is removed (i.e. when only the height and weight features are used as part of the distance calculation in the KNN algorithm). Report the results and compare the performance without the age attribute with the ones from Question 1 c). Discuss the results. What do the results tell you about the data?


2. Perform K-means clustering with K = 2 using the Euclidean norm.Toss a coin 7 times to initialise the algorithm. 3. Cluster the data using hierarchical clustering with complete linkage and the Euclidean norm. Draw the resulting dendrogram.


Q2. Using the data from Problem 2, build a Gaussian Naive Bayes classifier for this problem. For this you have to learn Gaussian distribution parameters for each input data feature, i.e. for p(height|W), p(height|M), p(weight|W), p(weight|M), p(age|W), p(age|M). a) Learn/derive the parameters for the Gaussian Na ive Bayes Classifier for the data from Question 2 a) and apply them to the same target as in problem 1a). b) Implement the Gaussian Na ive Bayes Classifier for this problem. c) Repeat the experiment in part 1 c) and 1 d) with the Gaussian Native Bayes Classifier. Discuss the results, in particular with respect to the performance difference between using all features and using only height and weight. d) Same as 1d but with Naïve Bayes. e) Compare the results of the two classifiers (i.e., the results form 1 c) and 1d) with the ones from 2 c) 2d) and discuss reasons why one might perform better than the other.


1. Introduction In this assignment you will build on your knowledge of classification image classification problem using a convolutional neural network. This assignment aims to guide you through the processes by following the four fundamental princi- ples. in particular you will solve an • Data: Data import, preprocessing, and augmentation. • Model: Designing a convolutional neural network model for classifying the images of the parts. • Fitting: Training the model using stochastic gradient descent. • Validation: Checking the model's accuracy on the reserved test data set and investigating where the most improvement could be found. Additionally, looking into the uncertainty in the predictions. This is not necessarily a lincar process, after you have fit and/or validated your model, you may need to go back to carlier steps and adjust your processing of the data or your model structure. This may need to be done several times to achieve a satisfactory result. This assignment is worth 35% of your course grade and is graded from 0 35 marks. An additional two bonus marks are available to the student who's model performs best on a previously unseen data sel.


(a) What is meant by feature engineering in machine learning? (b) You are given a classification problem with one feature and the followingItraining set: As usual, y is the label. This is a multi-class classification problem with possible labels A, B, and C. The test samples are 0, 1, and -5. Find the 1-Nearest Neighbour prediction for each of the test samples. Use the standard Euclidean metric. If you have encountered any ties, discuss briefly your tie-breaking strategy.[5 marks] Engineer an additional feature for this dataset, namely ². Therefore, your new training set still has 6 labelled samples in its training set and 3 unlabelled samples in its test set, but there are two features, and ². Find the 1-Nearest Neighbour prediction for each of the test samples in the new dataset.[16 marks] (d) What is meant by a kernel in machine learning? (e) How can the distance between the images of two samples in the feature space be expressed via the corresponding kernel?[2 marks] (f) You are given the same training set as before, and only one test sample, 1. The learning problem is still multi-class classification with possible labels A, B, or C. Using kernelized Nearest Neighbours algorithm with kernel K(1,1)= (1-1¹)², compute the 3-Nearest Neighbours prediction for the test sample. If applicable, describe your tie-breaking strategy.[10 marks]


For this programming assignment you will implement the Naive Bayes algorithm from scratch and the functions to evaluate it with a k-fold cross validation (also from scratch). You can use the code in the following tutorial to get started and get ideas for your implementation of the Naive Bayes algorithm but please, enhance it as much as you can (there are many things you can do to enhance it such as those mentioned at the end of the tutorial):


Q2. Using the data from Problem 2, build a Gaussian Na ive Bayes classifier for this problem. For this you have to learn Gaussian distribution parameters for each input data feature, i.e. for p(height|W), p(height|M), p(weight|W), p(weight|M), p(age|W), p(age|M). a) Learn/derive the parameters for the Gaussian Naive Bayes Classifier for the data from Question 2 a) and apply them to the same target as in problem 1a). b) Implement the Gaussian Naive Bayes Classifier for this problem. c) Repeat the experiment in part 1 c) and 1 d) with the Gaussian Naive Bayes Classifier. Discuss the results, in particular with respect to the performance difference between using all features and using only height and weight. d) Same as 1d but with Naïve Bayes. e) Compare the results of the two classifiers (i.e., the results form 1 c) and 1d) with the ones from 2 c) 2d) and discuss reasons why one might perform better than the other.


Question 1 Download the SGEMM GPU kernel performance dataset from the below link. https://archive.ics.uci.edu/ml/datasets/SGEMM+GPU+kernel+performance Understand the dataset by performing exploratory analysis. Prepare the target parameter by taking the average of the THREE (3) runs with long performance times. Design a linear regression model to estimate the target using only THREE (3) attributes from the dataset. Discuss your results, relevant performance metrics and the impact of normalizing the dataset.