Search for question
Question

1 Course work course-work March 23, 2024 0. Import the libraries that you will need import pandas as pd import seaborn as sns import matplotlib.pyplot as plt 1. Get the data - in the cell below run: Note you only need to run this command the first time you do the exercise. If you save and go away and come back, then can skip straight to step 2. !python get-my-data.py 2. Read in the csv: df = pd.read_csv('coursework-data.csv') 3. Perform some exploratory data analysis to clean up the dataset. The code needed for this part is found in the first set of exercises that you did. • • Remove outliers If any pairs of variables are highly correlated, remove one of the pair - highly correlated in this case > 0.99 4. Fit a baseline model, linear regression to map the control parameters (all parameters except gllbsc_gap) to the dependent parameter gllbsc_gap. Summarise its performance. To set up the data use: x = df.loc[:, df.columns != "gllbsc_gap"].values y = df.loc[:, == df.columns "gllbsc_gap"].values The rest of the code you need for this found in the second set of exercises that you did. • From looking at the linear regression model, which features have the greatest influence on the band gap? 5. Develop a gradient boosted regressor to the same data. Summarise its performance. 1.1 Important notes 1.1.1 Submitting the coursework When you are finished with the coursework - use File > Save and Export Notebook As > pdf to download a pdf of the completed notebook. Submit this pdf via the portal on QMplus. The deadline for submission is Sunday, 24 March 2024, 11:59 PM. 1 [ ] : 1.1.2 Text explanations Please please please add text to explain what you are doing in the code. Adding text boxes is easy, just add a new cell as normal then change the type to Markdown with the dropdown menu at the top of the cells. Adding text will make sure that markers can give you proper grades even if you make a small slip in your code. If you have no text explanation and still have a small slip, you will likely get no marks! 1.1.3 Datasets All of your datasets are generated randomly. So do not expect the same answers as your friends. If you compare answers and find that you have something very different, do not worry. 1.1.4 Warnings from the code Don't worry if the code throws some warnings sometimes. If it keeps running then it is fine. Warnings usually just alert you to future planned changes in the code you are using. 1.1.5 Long run times There is a certain part of the exercise where a grid search is required. It could take quite a long time with this code. I have tested it and it took about 15 minutes for a 10-fold cross validation on a 5x5 gridsearch. Dont worry if it seems to be running for a long time, that's okay. 21