tutorbin

data mining homework help

Boost your journey with 24/7 access to skilled experts, offering unmatched data mining homework help

tutorbin

Trusted by 1.1 M+ Happy Students

Place An Orderand save time
man
phone
  • United States+1
  • United Kingdom+44
  • Afghanistan (‫افغانستان‬‎)+93
  • Albania (Shqipëri)+355
  • Algeria (‫الجزائر‬‎)+213
  • American Samoa+1
  • Andorra+376
  • Angola+244
  • Anguilla+1
  • Antigua and Barbuda+1
  • Argentina+54
  • Armenia (Հայաստան)+374
  • Aruba+297
  • Ascension Island+247
  • Australia+61
  • Austria (Österreich)+43
  • Azerbaijan (Azərbaycan)+994
  • Bahamas+1
  • Bahrain (‫البحرين‬‎)+973
  • Barbados+1
  • Belarus (Беларусь)+375
  • Belgium (België)+32
  • Belize+501
  • Benin (Bénin)+229
  • Bermuda+1
  • Bhutan (འབྲུག)+975
  • Bolivia+591
  • Bosnia and Herzegovina (Босна и Херцеговина)+387
  • Botswana+267
  • Brazil (Brasil)+55
  • British Indian Ocean Territory+246
  • British Virgin Islands+1
  • Brunei+673
  • Bulgaria (България)+359
  • Burkina Faso+226
  • Burundi (Uburundi)+257
  • Cambodia (កម្ពុជា)+855
  • Cameroon (Cameroun)+237
  • Canada+1
  • Cape Verde (Kabu Verdi)+238
  • Caribbean Netherlands+599
  • Cayman Islands+1
  • Central African Republic (République centrafricaine)+236
  • Chad (Tchad)+235
  • Chile+56
  • China (中国)+86
  • Christmas Island+61
  • Cocos (Keeling) Islands+61
  • Colombia+57
  • Comoros (‫جزر القمر‬‎)+269
  • Congo (DRC) (Jamhuri ya Kidemokrasia ya Kongo)+243
  • Congo (Republic) (Congo-Brazzaville)+242
  • Cook Islands+682
  • Costa Rica+506
  • Côte d’Ivoire+225
  • Croatia (Hrvatska)+385
  • Cuba+53
  • Curaçao+599
  • Cyprus (Κύπρος)+357
  • Czech Republic (Česká republika)+420
  • Denmark (Danmark)+45
  • Djibouti+253
  • Dominica+1
  • Dominican Republic (República Dominicana)+1
  • Ecuador+593
  • Egypt (‫مصر‬‎)+20
  • El Salvador+503
  • Equatorial Guinea (Guinea Ecuatorial)+240
  • Eritrea+291
  • Estonia (Eesti)+372
  • Eswatini+268
  • Ethiopia+251
  • Falkland Islands (Islas Malvinas)+500
  • Faroe Islands (Føroyar)+298
  • Fiji+679
  • Finland (Suomi)+358
  • France+33
  • French Guiana (Guyane française)+594
  • French Polynesia (Polynésie française)+689
  • Gabon+241
  • Gambia+220
  • Georgia (საქართველო)+995
  • Germany (Deutschland)+49
  • Ghana (Gaana)+233
  • Gibraltar+350
  • Greece (Ελλάδα)+30
  • Greenland (Kalaallit Nunaat)+299
  • Grenada+1
  • Guadeloupe+590
  • Guam+1
  • Guatemala+502
  • Guernsey+44
  • Guinea (Guinée)+224
  • Guinea-Bissau (Guiné Bissau)+245
  • Guyana+592
  • Haiti+509
  • Honduras+504
  • Hong Kong (香港)+852
  • Hungary (Magyarország)+36
  • Iceland (Ísland)+354
  • India (भारत)+91
  • Indonesia+62
  • Iran (‫ایران‬‎)+98
  • Iraq (‫العراق‬‎)+964
  • Ireland+353
  • Isle of Man+44
  • Israel (‫ישראל‬‎)+972
  • Italy (Italia)+39
  • Jamaica+1
  • Japan (日本)+81
  • Jersey+44
  • Jordan (‫الأردن‬‎)+962
  • Kazakhstan (Казахстан)+7
  • Kenya+254
  • Kiribati+686
  • Kosovo+383
  • Kuwait (‫الكويت‬‎)+965
  • Kyrgyzstan (Кыргызстан)+996
  • Laos (ລາວ)+856
  • Latvia (Latvija)+371
  • Lebanon (‫لبنان‬‎)+961
  • Lesotho+266
  • Liberia+231
  • Libya (‫ليبيا‬‎)+218
  • Liechtenstein+423
  • Lithuania (Lietuva)+370
  • Luxembourg+352
  • Macau (澳門)+853
  • North Macedonia (Македонија)+389
  • Madagascar (Madagasikara)+261
  • Malawi+265
  • Malaysia+60
  • Maldives+960
  • Mali+223
  • Malta+356
  • Marshall Islands+692
  • Martinique+596
  • Mauritania (‫موريتانيا‬‎)+222
  • Mauritius (Moris)+230
  • Mayotte+262
  • Mexico (México)+52
  • Micronesia+691
  • Moldova (Republica Moldova)+373
  • Monaco+377
  • Mongolia (Монгол)+976
  • Montenegro (Crna Gora)+382
  • Montserrat+1
  • Morocco (‫المغرب‬‎)+212
  • Mozambique (Moçambique)+258
  • Myanmar (Burma) (မြန်မာ)+95
  • Namibia (Namibië)+264
  • Nauru+674
  • Nepal (नेपाल)+977
  • Netherlands (Nederland)+31
  • New Caledonia (Nouvelle-Calédonie)+687
  • New Zealand+64
  • Nicaragua+505
  • Niger (Nijar)+227
  • Nigeria+234
  • Niue+683
  • Norfolk Island+672
  • North Korea (조선 민주주의 인민 공화국)+850
  • Northern Mariana Islands+1
  • Norway (Norge)+47
  • Oman (‫عُمان‬‎)+968
  • Palau+680
  • Palestine (‫فلسطين‬‎)+970
  • Panama (Panamá)+507
  • Papua New Guinea+675
  • Paraguay+595
  • Peru (Perú)+51
  • Philippines+63
  • Poland (Polska)+48
  • Portugal+351
  • Puerto Rico+1
  • Qatar (‫قطر‬‎)+974
  • Réunion (La Réunion)+262
  • Romania (România)+40
  • Russia (Россия)+7
  • Rwanda+250
  • Saint Barthélemy+590
  • Saint Helena+290
  • Saint Kitts and Nevis+1
  • Saint Lucia+1
  • Saint Martin (Saint-Martin (partie française))+590
  • Saint Pierre and Miquelon (Saint-Pierre-et-Miquelon)+508
  • Saint Vincent and the Grenadines+1
  • Samoa+685
  • San Marino+378
  • São Tomé and Príncipe (São Tomé e Príncipe)+239
  • Saudi Arabia (‫المملكة العربية السعودية‬‎)+966
  • Senegal (Sénégal)+221
  • Serbia (Србија)+381
  • Seychelles+248
  • Sierra Leone+232
  • Singapore+65
  • Sint Maarten+1
  • Slovakia (Slovensko)+421
  • Slovenia (Slovenija)+386
  • Solomon Islands+677
  • Somalia (Soomaaliya)+252
  • South Africa+27
  • South Korea (대한민국)+82
  • South Sudan (‫جنوب السودان‬‎)+211
  • Spain (España)+34
  • Sri Lanka (ශ්‍රී ලංකාව)+94
  • Sudan (‫السودان‬‎)+249
  • Suriname+597
  • Svalbard and Jan Mayen+47
  • Sweden (Sverige)+46
  • Switzerland (Schweiz)+41
  • Syria (‫سوريا‬‎)+963
  • Taiwan (台灣)+886
  • Tajikistan+992
  • Tanzania+255
  • Thailand (ไทย)+66
  • Timor-Leste+670
  • Togo+228
  • Tokelau+690
  • Tonga+676
  • Trinidad and Tobago+1
  • Tunisia (‫تونس‬‎)+216
  • Turkey (Türkiye)+90
  • Turkmenistan+993
  • Turks and Caicos Islands+1
  • Tuvalu+688
  • U.S. Virgin Islands+1
  • Uganda+256
  • Ukraine (Україна)+380
  • United Arab Emirates (‫الإمارات العربية المتحدة‬‎)+971
  • United Kingdom+44
  • United States+1
  • Uruguay+598
  • Uzbekistan (Oʻzbekiston)+998
  • Vanuatu+678
  • Vatican City (Città del Vaticano)+39
  • Venezuela+58
  • Vietnam (Việt Nam)+84
  • Wallis and Futuna (Wallis-et-Futuna)+681
  • Western Sahara (‫الصحراء الغربية‬‎)+212
  • Yemen (‫اليمن‬‎)+967
  • Zambia+260
  • Zimbabwe+263
  • Åland Islands+358
*Get instant homework help from top tutors—just a WhatsApp message away. 24/7 support for all your academic needs!

Recently Asked data mining Questions

Expert help when you need it
  • Q1:Programming Assignment Explanation • Fortune Cookie Classifier¹ You will build a binary fortune cookie classifier. This classifier will be used to classify fortune cookie messages into two classes: messages that predict what will happen in the future (class 1) and messages that just contain a wise saying (class 0). For example, "Never go in against a Sicilian when death is on the line" would be a message in class 0. "You will get an A in Machine learning class" would be a message in class 1. Files Provided There are three sets of files. All words in these files are lower case and punctuation has been removed. 1) The training data: traindata.txt: This is the training data consisting of fortune cookie messages. trainlabels.txt: This file contains the class labels for the training data. 2) The testing data: testdata.txt: This is the testing data consisting of fortune cookie messages. testlabels.txt: This file contains the class labels for the testing data.See Answer
  • Q2:Q2. (10 points) Consider the following setting. You are provided with n training examples: (T₁, 9₁, h₁), (2, 92, h₂),, (In, Yn, hn), where z, is the input example, y, is the class label (+1 or -1), and h₁> 0 is the importance weight of the example. The teacher gave you some additional information by specifying the importance of each training example. How will you modify the perceptron algorithm to be able to leverage this extra information? Please justify your answer.See Answer
  • Q3:Q4. You were just hired by MetaMind. MetaMind is expanding rapidly, and you decide to use your machine learning skills to assist them in their attempts to hire the best. To do so, you have the following available to you for each candidate i in the pool of candidates Z: (i) Their GPA, (ii) Whether they took Data Mining course and achieved an A, (iii) Whether they took Algorithms course and achieved an A, (iv) Whether they have a job offer from Google, (v) Whether they have a job offer from Facebook, (vi) The number of misspelled words on their resume. You decide to represent each candidate i € I by a corresponding 6-dimensional feature vector f(z)). You believe that if you just knew the right weight vector w R you could reliably predict the quality of a candidate i by computing w- f(z). To determine w your boss lets you sample pairs of candidates from the pool. For a pair of candidates (k, 1) you can have them face off in a "DataMining-fight." The result is score (k > 1), which tells you that candidate k is at least score (k> 1) better than candidate 1. Note that the score will be negative when I is a better candidate than k. Assume you collected scores for a set of pairs of candidates P. Describe how you could use a perceptron based algorithm to learn the weight vector w. Make sure to describe the basic intuition; how the weight updates will be done; and pseudo-code for the entire algorithm.See Answer
  • Q4:Please create a K-means Clustering and Hierarchical Clustering with the line of code provided. The line of code should include a merger of the excel files. The excel files will also be provided See Answer
  • Q5:Discussion - Data Mining, Text Mining, and Sentiment Analysis Explain the relationship between data mining, text mining, and sentiment analysis. Provide situations where you would use each of the three techniques. Respond to the following in a minimum of 230 words:See Answer
  • Q6:Assignment #3: DBSCAN, OPTICS, and Clustering Evaluation 1. If Epsilon is 2 and minpoint is 2 (including the centroid itself), what are the clusters that DBScan would discover with the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Use the Euclidean distance. Draw the 10 by 10 space and illustrate the discovered clusters. What if Epsilon is increased to sqrt(10)? (30 pts)See Answer
  • Q7: 2. Use OPTICS algorithm to output the reachability distance and the cluster ordering for the dataset provided, starting from Instance 1. Use the following parameters for discovering the cluster ordering: minPts =2 and epsilon =2. Use epsilonprime =1.2 to generate clusters from the cluster ordering and their reachability distance. Don't forget to record the core distance of a data point if it has a dense neighborhood. You don't need to include the core distance in your result but you may need to use them in generating clusters. (45 pts) 2 16 14 12 10 015 05 20 06 04 021 026 016 027 022 025 019 023 09 024 07 018 08 011 070 030 029 028 012 013 014 2 01 0 03 0 2 4 8 10 12 14 16 017 Dataset visualization Below are the first few lines of the calculation. You need to complete the remaining lines and generate clusters based on the given epsilonprime value: Instance (X,Y) Reachability Distance Instance 1: (1,1) Undefined(or infinity) Instance 2: (0, 1) 1.0 Instance 3: (1, 0) 1.0 Instance 16: (5,9) Undefined Instance 13: (9,2) Undefined Instance 12: (8,2) 1See Answer
  • Q8:3. Use F-measure and the Pairwise measures (TP, FN, FP, TN) to measure the agreement between a clustering result (C1, C2, C3) and the ground truth partitions (T1, T2, T3) as shown below. Show details of your calculation. (25 pts) Ground Truth T, TT, Cluster C, CC3See Answer
  • Q9:1. We will use Flower classification dataset a. https://www.kaggle.com/competitions/tpu-getting-started 2. Your goal is improving the average accuracy of classification. a. You SHOULD use google collab as the main computing. (Using Kaggle is okay) b. You SHOULD create a github reposit for the source code i. Put a readme file for execution c. You SHOULD explain your source code in the BLOG. d. Try experimenting with various hyperparameters i. Network topology 1. Number of neurons per layer (for example, 100 x 200 x 100, 200 x 300 x 100...) 2. number of layers (For example, 2 vs 3 vs 4 ... ) 3. shape of conv2d ii. While doing experiments, make sure you record your performance such that you can create a bar chart of the performance iii. An additional graph idea might be a training time comparison Do some research on ideas for improving this. iv. e. You can refer to the code or tutorial internet. But the main question you have to answer is what improvement you made over the existing reference. i. Make sure it is very clear which lines of code is yours or not. When you copy the source code, add a reference. 3. Documentation is the half of your work. Write a good blog post for your work and step-by-step how to guide. a. A good example is https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/ 4. Add a reference a. You add a citation number in the contents and put the reference in the separate reference sectionSee Answer
  • Q10:This tutorial will guide you how to do homework in this course. 1. Goto https://www.kaggle.com/c/titanic and follow walkthrough as https://www.kaggle.com/alexisbcook/titanic-tutorial B 2. Submit your result to Kaggle challenge. 3. Post jupyter notebook to your homepage as blog post. A good example of blog post is. https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/ 4. Submit your homepage link and screenshot pdf in the canvas. 5. Doing 1-4 will give you 8 points. To get additional 2 points, create a section as "Contribution" and try to improve the performance. I expect one or two paragraph minimum (the longer the better). Show the original score and improved score.See Answer
  • Q11:1. Use the insurance fraud dataset. Consider the data quality issues (e.g., missing data) and preprocess the data. Split the data into a 10% train and 90% test set using random_state = 1. Create a decision tree with a max depth of 3 using a gini measure. Print the accuracy on the test set and the tree. Is this a good approach? Why or why not? 2. Create a decision tree on the same data with max depth of 3 and an entropy measure. Does the accuracy change? Does the tree change? Discuss which measure you think is better. 3. Now split the data into 70% train and 30% test using random_state = 1. Redo 2 and 3. Have the trees and accuracy changed? Are the trees more or less similar now? Discuss which split you think is better and why. 4. Evaluate how the accuracy changes with the depth of the tree with the 70-30 data. Look at the accuracy for a max depth of 1, 2, 3, ... 10, 15, 20. Plot the curve of changing. Do you see underfitting? Do you see overfitting? 5. What variable provides the most information gain in the insurance fraud data (for the 70-30 split)? 6. Decision trees are a "white box" method. What do you observe about the insurance fraud data using decision trees?See Answer
  • Q12:Natural Language Processing Project Welcome to the NLP Project for this section of the course. In this NLP project you will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. This will be a simpler procedure than the lecture, since we will utilize the pipeline methods for more complex tasks. We will use the Yelp Review Data Set from Kaggle. Each observation in this dataset is a review of a particular business by a particular user. The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business. The "useful" and "funny" columns are similar to the "cool" column. Let's get started! Just follow the directions below!/n]: blob.core.windows.net/media//655d2lade3a3d6c1blb06e36/questions/hip-project-practice _1700602595848.html AM ]: 5. Use the corr() method on that groupby dataframe to produce this dataframe: (5 points) ]: cool NLP Project practice useful funny text length cool 1.000000 -0.743329 -0.944939 -0.857664 useful -0.743329 1.000000 0.894506 0.699881 funny -0.944939 0.894506 1.000000 0.843461 text length -0.857664 0.699881 0.843461 1.000000 NLP Classification Task Let's move on to the actual task. To make things a little easier, go ahead and only grab reviews that were either 1 star or 5 stars. 6. Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews. (5 points) 7. Create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class. (Your features and target/labels) (10 points) 3/1/n9. Use the fit transform method on the CountVectorizer object and pass in X (the 'text' column). Save this result by overwriting X (10 points) 7]: X = cv.fit_transform (X) tb.blob.core.windows.net/media//655d21ade3a3d6c1b1b06e36/questions/hip-project-practice_1700602595848.html 5 AM 8]: 9]: 01: Train Test Split Let's split our data into training and testing data. 11: 1]: NLP Project practice 10. Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101 (10 points) Training a Model Time to train a model! 11. Import Decision TreeClassifier and create an instance of the estimator and call is tree then fit the model using the training set (10 points) Now fit tree using the training data. DecisionTreeClassifier() 4See Answer
  • Q13:2the Task - Correlation Rule Extraction and Outlier Detection Prices Question 1 Suppose we have ten product codes (from 0 to 9) and the following table of transactions Identifier Transaction 1 2 3 4 5 6 7 8 9 10 Products 0, 1, 3, 4 1, 2, 3 1, 2, 4, 5 1, 3, 4, 5 2, 3, 4, 5 2, 4, 5 3,A 1, 2, 3 1, 4, 5 B,4 where, the penultimate and last digit of your registration number divided by2 and rounded to the nearest unit. That is, if your registration number ends in 12, then you will set = 1/2= 0.5 that is1 and =2/2= 1. But if the penultimate digit of the number your registry is either 5 or 6, you will set = 4. Accordingly, if the last digit of the number your register is either 3 or 4, you will set = 2. Please answer the following questions, showing your calculations: 1. Compute the support of the sets {4,5} and {3, 5} 2. The support and trust of the rules {4, 5} → {3} and {3, 5} → {2} 3. Run the Apriori Algorithm with the method Fk-1X F k-1 for support threshold equal to 2 and report which frequent sets you find at each stepSee Answer
  • Q14:Question 2 Suppose we have the following one-dimensional data set, consisting of 10 points x1= 59, x 2= 30, x 3= 64, x4=72, x5= 35, x6= 21, x 7= 66, x8= 53, x 9= 56, x10=A Where the last two digits of your registration number (ie if your registration number ends in 12 you will set = 12). If the last two digits of your registration number are present already in the data set you will set = 47. Detect outliers using: 1. The Grubbs test 2. The one nearest neighbor technique, using as distance function the absolute value of the difference between two points 3. The relative density for two nearest neighbors, using as a distance function the absolute value of the difference between two pointSee Answer
  • Q15:ASSIGNMENT: Project 2 A trojan horse traffic dataset is provided alongside the test dataset in the following link. https://www.kaggle.com/datasets/neelabhkshetry/trojan-classification The dataset has 85 features with 1 target column "Class". This a binary classification problem with the classes being "Trojan" or "Benign". You will perform all your tasks on the training dataset which has almost 160k rows and the output class is divided almost equally. Your task is to build classification models to predict the "Class" column. First, become familiar with data set, and research some background information on trojan horse detection in a network. 1. First, research some background information on applying data mining in related topics by reviewing published research papers (minimum 3!) on similar topics. Make sure each review is at least one paragraph and you cite your sources (example format see appendix). 2. Go through the detailed process of data preparation on train_data.csv. Apply all preprocessing and data reduction techniques you assume they are necessary and explain why. For every preprocessing technique: a. Explain what the technique is; b. Explain how it is applied: c. Show the summary results of the preprocessing. d. Do not include raw code or raw output or raw screen capture Perform dimensionality reduction in at least two steps (once by applying your understanding of the features and once by using feature selection or extraction methods)/n3. When the data set is with enough quality, apply several predictive based techniques (minimum are 3 techniques!) and create appropriate predictive models. For every predictive technique you applied: a. Explain what the technique is and how the technique works b. Explain what the parameters of the technique are and how the parameters are chosen and tuned. c. Explain and discuss the predictive results and performances of the technique. Analyze different aspect of the result, including but not limited to ROC curve, F score, accuracy, etc. d. Do not include raw code or raw output or raw screen capture 4. After you have built three predictive models, test your models so that you can compare the three data mining techniques you've chosen. You should include the following in comparison discussion: a. The performance comparison among your techniques b. Visualization or table showing the performance differences. Make sure you explain and discuss the visualization or table c. Explain the probable reason behind the performances differences. d. Explain which technique is the best for the dataset. Comparison of data mining techniques (and obtained predictive models) with additional discussion and interpretation of results will be very important part of your report. Do not include raw code or raw output. A conclusion section will be used to summarize your findings throughout the report. Give example of data samples, discuss how you would use your model on your samples and what result can be expected from your model. The final grade will be 90% report and 10% class competition, explained next./nA separate competition dataset (test_data.csv) will be provided for additional evaluation. We will compare your prediction results against the true label for additional unbiased evaluation as a component of your project. Up to 2 points in the final grade of the project will depend upon your model's performance with this dataset. The performance is calculated using Area under ROC. The entire class score will be calculated and ranked. If you achieve the top 20% of score when compare to your peers, you get the full 2 points. The grade distribution is the following: Score Percentile Top 20% 50% - 80% 20% - 50% 10% - 20% 0% - 10% Score Range 2 > 1.5-2 > 1-1.5 > 0.5-1 0-0.5 Please read "Required Structure for the Project 2 Documentation”, "Required Structure for the Test Data Result” and “Required Materials for Project 2 Submission" below. See Answer
  • Q16:1. If Epsilon is 2 and minpoint is 2 (including the centroid itself), what are the clusters that DBScan would discover with the following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Use the Euclidean distance. Draw the 10 by 10 space and illustrate the discovered clusters. What if Epsilon is increased to sqrt(10)? (30 pts)See Answer
  • Q17:2. Use OPTICS algorithm to output the reachability distance and the cluster ordering for the dataset provided, starting from Instance 1. Use the following parameters for discovering the cluster ordering: minPts =2 and epsilon =2. Use epsilonprime =1.2 to generate clusters from the cluster ordering and their reachability distance. Don't forget to record the core distance of a data point if it has a dense neighborhood. You don't need to include the core distance in your result but you may need to use them in generating clusters. (45 pts) Instance 1: Instance 2: Instance 3: Instance 16: Dataset visualization Below are the first few lines of the calculation. You need to complete the remaining lines and generate clusters based on the given epsilonprime value: Instance (X,Y) (1, 1) (0, 1) (1, 0) (5,9) 617 Reachability Distance Undefined (or infinity) 1.0 1.0 UndefinedSee Answer
  • Q18:3. Use F-measure and the Pairwise measures (TP, FN, FP, TN) to measure the agreement between a clustering result (C1, C2, C3) and the ground truth partitions (T1, T2, T3) as shown below. Show details of your calculation. (25 pts) Ground Truth T₁ T₂ Cluster C C₂ C₂See Answer
  • Q19: Each student is required to submit (1) the Orange workflow file or Python/R script files (s)he created using the format HW1PID.OWS (.py and .r respectively for Python/R) where PID is your PID number, and (2) a PDF file with answers using the format HW3PID.pdf All files listed below were posted on Canvas sub-folder Homeworks of Data sets folder. Homework III - Questions The purpose of this homework is to use textual analysis of WSJ news in predicting financial market outcomes. In particular we will rely on a data set measuring the state of the economy via textual analysis of business news. From the full text content of 800,000 Wall Street Journal articles for 1984-2017, Bybee et al. estimate a model that summarizes business news as easily interpretable topical themes and quantifies the proportion of news attention allocated to each theme at each point in time. These news attention estimates are inputs into the models we want to estimate. The data source is described in http://structureofnews.com/. The data was standardized and prepared for this assignment. Please use the data file ML_MBA_UNC_Processed.xlxs from the HW3 material folder (sub-folder of Assignments folder) on CANVAS (it is also posted in the Data Sets folder - Homework sub-folder) Click on the News Taxomony tab of the aforementioned website and you will find a taxon- omy of news themes in The Wall Street Journal. Estimated with hierarchical agglomerative clustering, the dendrogram illustrates how 180 topics cluster into an intuitive hierarchy of increasingly broad metatopics. The list of topics is reproduced below - further details appear on the website: Natural disasters, Internet, Soft drinks, Mobile devices, Profits, M&A, Changes, Police / crime, Research, Executive pay, Mid- size cities, Scenario analysis, Economic ideology, Middle east, Savings & loans, IPOs, Restraint, Electronics, Record high, Connecticut, Steel, Bond yields, Small business, Cable, Fast food, Disease, Activists, Competition, Music industry, Short sales, Nonperforming loans, Key role, News conference, US defense, Political contributions, Revised estimate, Economic growth, Justice Department, Credit ratings, Broadcasting, Problems, Announce plan, Federal Reserve, Job cuts, Chemicals / paper, Regulation, Environment, Small caps, Unions, C-suite, Control stakes, Mutual funds, Venture capital, European sovereign debt, Mining, Company spokesperson, Private / public sec- tor, Pharma, Schools, Russia, Programs / initiatives, Health insurance, Drexel, Trade agreements, Treasury bonds, Challenges, People familiar, Sales call, Publishing, Financial crisis, Aerospace / defense, Recession, Latin America, Cultural life, SEC, Earnings losses, Phone companies, Computers, Marketing, Japan, Nuclear / North Korea, NY politics, Tobacco, Product prices, Biology / chemistry / physics, Movie industry, Automotive, Machinery, Bankruptcy, Arts, International exchanges, Accounting, Space program, Immigration, Small changes, Small possibility, Agreement reached, Oil drilling, Rail / trucking / shipping, Indictments, Positive sentiment, Canada / South Africa, Airlines, California, Corporate governance, China, Investment banking, Spring/summer, Software, Pensions, Humor / language, Systems, Clintons, Major concerns, Mid-level executives, US Senate, Agriculture, Bank loans, Takeovers, State politics, Real estate, Futures / indices, Southeast Asia, Optimism, Corrections / amplifications, Government budgets, Exchanges / composites, Currencies / metals, Mortgages, Financial reports, Germany, Rental properties, Committees, Subsidiaries, Management changes, Share payouts, France / Italy, Acquired investment banks, Credit cards, Bear / bull market, Earnings forecasts, Terrorism, Watchdogs, Oil market, Couriers, Commodities, Utilities, Foods / consumer goods, Convertible / preferred, Macroeconomic data, Courts, Safety admin- istrations, Reagan, Bush / Obama / Trump, Fees, Gender issues, Trading activity, Microchips, Insurance, Earnings, Luxury / beverages, Iraq, National security, Buffett, Taxes, Options / VIX, Casinos, Elections, Private equity / hedge funds, Negotiations, European politics, Size, NASD, Mexico, Retail, Long / short term, Wide range, Lawsuits, UK, Revenue growth 1.a [4 points] For the exercise below select ALL features EXCEPT FUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SGNFUTSP500 Your first task is to predict the direction of the market (S&P 500 index) over the next month, i.e. SGNFUTSP500 up or down - which is a classification prediction problem, using the importance of news topics. Use the logistic regression and neural network widgets of the Orange software to compute the following models: logistic regression with LASSO regularization with C = 0.007 logistic regression with LASSO regularization with C = 0.80 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 You should get something along the following lines with 10-fold cross-validation: AUC CA ● Models Logistic LASSO C = 0.80 Neural Net Logistic LASSO C = 0.007 0.583 0.649 0.566 0.619 0.500 0.371 Explain why the Logistic regression with LASSO C = 0.007 does so poorly (hint: look at the coefficients of the model). When you look at the ROC curve, explain the curve for the LASSO C = 0.007 model. 2 1.b [4 points] List the top ten features which have the most negative impact on next month's market direction and the top ten with the most positive impact. 2.a [4 points] We are turning now to a linear regression instead of classification problem, predicting actual market returns (continuous) rather than direction (binary). For the exer- cise below select ALL features EXCEPT SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: FUTSP500 Use the linear regression and neural net widgets of Orange software to compute the follow- ing models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Note that the logistic regression and linear regression widgets use a different way of writing the penalty function (although there is a mapping between C and a via something called the Lagrangian multiplier not covered in the course). neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.b [4 points] In the previous case we were trying to predict the return next month with current news. Now, we will try to explain current returns with current news. For the exercise below select ALL features EXCEPT FUTSP500, SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SP500 Use the linear regression widget of Orange software to compute the following models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Recall that the logistic regression and linear re- gression widgets use a different way of writing the penalty function (although there is a mapping between the role of C and a via something called the Lagrangian mul- tiplier not covered in the course). 3 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.c [2 points] List the top ten features which have the most negative impact on next month's market returns and the top ten with the most positive impact. How is it different from the results in 1.b? 2.d [2 points] Explain the difference between your answers in 1.a, 2.a and 2.b. 4See Answer
  • Q20: 2. In PC Tech's product mix problem, assume there is another PC model, the VXP, that the company can produce in addition to Basics and XPs. Each VXP requires eight hours for assembling, three hours for testing, $275 for component parts, and sells for $560. At most 50 VXPs can be sold. a. Modify the spreadsheet model to include this new product, and use Solver to find the optimal product mix. 4. Again continuing Problem 2, suppose that you want to force the optimal solution to be integers. Do this in Solver by adding a new constraint. Select the deci- sion variable cells for the left side of the constraint, and in the middle dropdown list, select the "int" op- tion. How does the optimal integer solution compare to the optimal noninteger solution in Problem 2? Are the decision variable cell values rounded versions of those in Problem 2? Is the objective value more or less than in Problem 2? 26. A furniture company manufactures desks and chairs. Each desk uses four units of wood, and each chair uses three units of wood. A desk contributes $250 to profit, and a chair contributes $145. Marketing restrictions require that the number of chairs produced be at least four times the number of desks produced. There are 2000 units of wood available. a. Use Solver to maximize the company's profit. 46. During each four-hour period, the Smalltown police force requires the following number of on-duty police officers: four from midnight to 4 A.M.; four from 4 A.M. to 8 A.M.; seven from 8 A.M. to noon; seven from noon to 4 P.M.; eight from 4 P.M. to 8 P.M.; and ten from 8 P.M. to midnight. Each police officer works two consecutive four-hour shifts. a. Determine how to minimize the number of police of- ficers needed to meet Smalltown's dailyrequirements. 66. United Steel manufactures two types of steel at three different steel mills. During a given month, each steel mill has 240 hours of blast furnace time available. Because of differences in the furnaces at each mill, the time and cost to produce a ton of steel differ for each mill, as listed in the file P04_66.xlsx. Each month, the company must manufacture at least 700 tons of steel 1 and 600 tons of steel 2. Determine how United Steel can minimize the cost of manufacturing the desired steel.See Answer

TutorBin Testimonials

I found TutorBin Data Mining homework help when I was struggling with complex concepts. Experts provided step-wise explanations and examples to help me understand concepts clearly.

Rick Jordon

5

TutorBin experts resolve your doubts without making you wait for long. Their experts are responsive & available 24/7 whenever you need Data Mining subject guidance.

Andrea Jacobs

5

I trust TutorBin for assisting me in completing Data Mining assignments with quality and 100% accuracy. Experts are polite, listen to my problems, and have extensive experience in their domain.

Lilian King

5

I got my Data Mining homework done on time. My assignment is proofread and edited by professionals. Got zero plagiarism as experts developed my assignment from scratch. Feel relieved and super excited.

Joey Dip

5

TutorBin helping students around the globe

TutorBin believes that distance should never be a barrier to learning. Over 500000+ orders and 100000+ happy customers explain TutorBin has become the name that keeps learning fun in the UK, USA, Canada, Australia, Singapore, and UAE.