Each student is required to submit (1) the Orange workflow file or Python/R script files (s)he created using the format HW1PID.OWS (.py and .r respectively for Python/R) where PID is
your PID number, and (2) a PDF file with answers using the format HW3PID.pdf All files listed below were posted on Canvas sub-folder Homeworks of Data sets folder. Homework III - Questions The purpose of this homework is to use textual analysis of WSJ news in predicting financial market outcomes. In particular we will rely on a data set measuring the state of the economy via textual analysis of business news. From the full text content of 800,000 Wall Street Journal articles for 1984-2017, Bybee et al. estimate a model that summarizes business news as easily interpretable topical themes and quantifies the proportion of news attention allocated to each theme at each point in time. These news attention estimates are inputs into the models we want to estimate. The data source is described in http://structureofnews.com/. The data was standardized and prepared for this assignment. Please use the data file ML_MBA_UNC_Processed.xlxs from the HW3 material folder (sub-folder of Assignments folder) on CANVAS (it is also posted in the Data Sets folder - Homework sub-folder) Click on the News Taxomony tab of the aforementioned website and you will find a taxon- omy of news themes in The Wall Street Journal. Estimated with hierarchical agglomerative clustering, the dendrogram illustrates how 180 topics cluster into an intuitive hierarchy of increasingly broad metatopics. The list of topics is reproduced below - further details appear on the website: Natural disasters, Internet, Soft drinks, Mobile devices, Profits, M&A, Changes, Police / crime, Research, Executive pay, Mid- size cities, Scenario analysis, Economic ideology, Middle east, Savings & loans, IPOs, Restraint, Electronics, Record high, Connecticut, Steel, Bond yields, Small business, Cable, Fast food, Disease, Activists, Competition, Music industry, Short sales, Nonperforming loans, Key role, News conference, US defense, Political contributions, Revised estimate, Economic growth, Justice Department, Credit ratings, Broadcasting, Problems, Announce plan, Federal Reserve, Job cuts, Chemicals / paper, Regulation, Environment, Small caps, Unions, C-suite, Control stakes, Mutual funds, Venture capital, European sovereign debt, Mining, Company spokesperson, Private / public sec- tor, Pharma, Schools, Russia, Programs / initiatives, Health insurance, Drexel, Trade agreements, Treasury bonds, Challenges, People familiar, Sales call, Publishing, Financial crisis, Aerospace / defense, Recession, Latin America, Cultural life, SEC, Earnings losses, Phone companies, Computers, Marketing, Japan, Nuclear / North Korea, NY politics, Tobacco, Product prices, Biology / chemistry / physics, Movie industry, Automotive, Machinery, Bankruptcy, Arts, International exchanges, Accounting, Space program, Immigration, Small changes, Small possibility, Agreement reached, Oil drilling, Rail / trucking / shipping, Indictments, Positive sentiment, Canada / South Africa, Airlines, California, Corporate governance, China, Investment banking, Spring/summer, Software, Pensions, Humor / language, Systems, Clintons, Major concerns, Mid-level executives, US Senate, Agriculture, Bank loans, Takeovers, State politics, Real estate, Futures / indices, Southeast Asia, Optimism, Corrections / amplifications, Government budgets, Exchanges / composites, Currencies / metals, Mortgages, Financial reports, Germany, Rental properties, Committees, Subsidiaries, Management changes, Share payouts, France / Italy, Acquired investment banks, Credit cards, Bear / bull market, Earnings forecasts, Terrorism, Watchdogs, Oil market, Couriers, Commodities, Utilities, Foods / consumer goods, Convertible / preferred, Macroeconomic data, Courts, Safety admin- istrations, Reagan, Bush / Obama / Trump, Fees, Gender issues, Trading activity, Microchips, Insurance, Earnings, Luxury / beverages, Iraq, National security, Buffett, Taxes, Options / VIX, Casinos, Elections, Private equity / hedge funds, Negotiations, European politics, Size, NASD, Mexico, Retail, Long / short term, Wide range, Lawsuits, UK, Revenue growth 1.a [4 points] For the exercise below select ALL features EXCEPT FUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SGNFUTSP500 Your first task is to predict the direction of the market (S&P 500 index) over the next month, i.e. SGNFUTSP500 up or down - which is a classification prediction problem, using the importance of news topics. Use the logistic regression and neural network widgets of the Orange software to compute the following models: logistic regression with LASSO regularization with C = 0.007 logistic regression with LASSO regularization with C = 0.80 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 You should get something along the following lines with 10-fold cross-validation: AUC CA ● Models Logistic LASSO C = 0.80 Neural Net Logistic LASSO C = 0.007 0.583 0.649 0.566 0.619 0.500 0.371 Explain why the Logistic regression with LASSO C = 0.007 does so poorly (hint: look at the coefficients of the model). When you look at the ROC curve, explain the curve for the LASSO C = 0.007 model. 2 1.b [4 points] List the top ten features which have the most negative impact on next month's market direction and the top ten with the most positive impact. 2.a [4 points] We are turning now to a linear regression instead of classification problem, predicting actual market returns (continuous) rather than direction (binary). For the exer- cise below select ALL features EXCEPT SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: FUTSP500 Use the linear regression and neural net widgets of Orange software to compute the follow- ing models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Note that the logistic regression and linear regression widgets use a different way of writing the penalty function (although there is a mapping between C and a via something called the Lagrangian multiplier not covered in the course). neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.b [4 points] In the previous case we were trying to predict the return next month with current news. Now, we will try to explain current returns with current news. For the exercise below select ALL features EXCEPT FUTSP500, SGNFUTSP500, APPL, SGNAPPL, FUTAPPL, SGNFUTAPPL, SGNSP500, DATE and use as target: SP500 Use the linear regression widget of Orange software to compute the following models: regression with Elastic Net with a = 0.007 for the LASSO/Ridge regularization and weight 0.80 on the l₂ regularization. Recall that the logistic regression and linear re- gression widgets use a different way of writing the penalty function (although there is a mapping between the role of C and a via something called the Lagrangian mul- tiplier not covered in the course). 3 • neural network with 100 hidden layers, ReLu activation function, SGD optimization, and max 200 iterations and regularization a = 5 The Test and Score widget now records MSE, RMSE, MAE, and R2. What do you learn from the R2 results? 2.c [2 points] List the top ten features which have the most negative impact on next month's market returns and the top ten with the most positive impact. How is it different from the results in 1.b? 2.d [2 points] Explain the difference between your answers in 1.a, 2.a and 2.b. 4