identify the best sales prospects for an intensive sales campaign. In particular, the manufacturer is
interested in classifying households as prospective owners or nonowners on the basis of Income (in
$1000s) and Lot Size (in 1000 ft2). The marketing expert looked at a random sample of 24 households,
given in the file Riding Mowers.csv.
a. Using ggplot() in R, create a scatter plot of Lot Size vs. Income, color-coded by the outcome variable
owner/nonowner. Make sure to obtain a well-formatted plot (create legible labels and a legend, etc.).
3. Laptop Sales at a London Computer Chain: Bar Charts and Boxplots. The file LaptopSales-
January 2008.csv contains data for all sales of laptops at a computer chain in London in January 2008.
This is a subset of the full dataset that includes data for the entire year.
a. Using ggplot() in R, create a histogram and density plot of the average retail price. Overlay the
histogram and density plot by a normal density plot. Does the price data look normally distributed?
b. Create a Q-Q plot of the price data. Does the Q-Q plot confirm your finding (in part a.) about the
normality of the data? Are there any outliers?
c. Create a bar chart, showing the average retail price by store postcode (StorePostcode). Which store
postcode has the highest average retail price? Which has the lowest? Hint: For better readability, feel free
to rotate the x axis labels. You can do it by adding the following statement to the ggplot() statement:
+theme (axis.text.x = element_text (angle = 90)). Also, in order to zoom in closer to the price
limit, add the following statement to the ggplot () call: + coord_cartesian (ylim-c (480, 500)).
d. Using the filter() function of the dplyr package, reduce your laptop data frame to only these two
store postcodes. Using ggplot2, create a side-by-side violin plot of retail prices of the two stores. Be
sure to jitter the markers for better visibility. Does there seem to be a huge difference between their prices?
e. To better compare retail prices across post codes, create side-by-side boxplots of retail prices of the two
postcodes and compare the price distribution in the two postcodes. Does there seem to be a difference
between their price distributions?
f. Suppose you are interested in what specific technical features greatly impact computer prices. Using
the cut() function of the base package, create a new categorical variable in your main laptop sales
data frame that contains 3 RetailPrice categories: "low", "medium", and "high." Call the variable
PriceCat and make sure that its class is factor. Subsequently, create another data frame that contains
this PriceCat variable and all the columns that describe laptop features (such as BatteryLife_Hrs,
ScreenSize In, etc.). Finally, create a box-plot enhanced parallel coordinate plot with all the features
on the horizontal axis and PriceCat on the vertical axis. Which feature(s) seem to be the most
important determinants of PriceCat?