Part I: Downloading the US Census Income Dataset
We will use one of the datasets of the UCI data repository. It contains many datasets
that are used by researchers and data scientists to evaluate new data mining and
machine learning methods.
1. Go to the US Census Income dataset
page: https://archive.ics.uci.edu/ml/datasets/Census+IncomeLinks to an
external site.
2. Click on the Data Folder
3.
Download the adult.data, adult.names and adult.test files.
4. name is a text file. It can be opened using sublime any text editor.
5. data and adult.test are csv files. They can be opened in sublime, excel, or any
text editor.
6. The adult.name file contains a description of the dataset and its attributes.
Each row contains information for one person. Each column is an attribute of
a person such as age, work class, education, etc.
Each line in this dataset represents a person. Each column represents one feature for a
person such as age, education, marital status, etc.
Part II: Creating a Bar Chart
The last column in the dataset holds the income. It is a categorical label that has the
two labels "<=50K" and ">50K". These indicate whether the person's annual income is
less than or greater than $50K, respectively.
Write a python script that uses the Bokeh library to draw a bar chart showing the
number of people with income greater than 50K and the number of people with income
less than or equal 50K./nWrite a python script that uses the Bokeh library to draw a bar chart showing the
number of people with income greater than 50K and the number of people with income
less than or equal 50K.
1. Start by using the basic bar chart code from the bokeh
documentation.
mlLinks to an external site.
https://docs.bokeh.org/en/latest/docs/gallery/bar_basic.ht
2. Write code to create the x and y arrays that will be passed to the vbar()
function. (P.S. If you are going to use file handling, you can follow the steps
below. However, you are free to use pandas dataframes or any other
methods/libraries in your script.)
1. First, create a dictionary and use it to count the number of
instances for each income label.
2. Create a for-loop that scans the dataset file, extracts the last
column, and updates the dictionary accordingly.
3. Once the for-loop is done, write another for loop that copies the
keys and values of the dictionary into two arrays, call them x & y.
These are the arrays that will be passed to the vbar() plotting
function./nPart III: Creating a Line Chart for Time Data
The kobe_braynt_points.csv Download kobe_braynt_points.csv dataset contains the
points scored by player Kobe Bryant over the different years/seasons.
Write a script that plots the points over the years. Use a line chart.
1. Start with the code for single line
glyphs/plots: https://docs.bokeh.org/en/latest/docs/user_guide/plotting.htm
ILinks to an external site.
2. Modify this adding code for reading the years and points from the dataset
file. Make sure to use the comma "," as the separator in the split function.
1. You should read the years and points into two arrays to represent
xlabels & y. That is, the x-axis labels and y values.
2. Make sure to show the seasons as xlabels on the xaxis.
to
1. One way to do it is by creating a third array to hold x
values. The x values should be 0,1,2,... and go as many as
the number of years.
2.
The three arrays: xlabels, x & y will by populated by the
for loop that reads the data from the file.
3. Change the figure creation line from
O
I
p = figure(plot_width=400, plot_height=400)
p = figure(plot_width=400, plot_height=400, x_range=
xlabels)/nxlabels)
This way you are passing the xlabels to be shown on the x-axis.
4. Since the x-labels are too long, it is better to write them vertically rather than
horizontally. You can do so by adding this line of code after creating the
figure:
p.xaxis.major_label_orientation = pi/2
5. Finally, pass the x & y arrays to the p.line() function.
6. How does the curve look like? What can you say about the point scoring of
Kobe Bryant from the plot?
Part IV: Creating a Scatter Plot
Repeat the exercise in part III using a scatter plot instead of a line chart.
The only difference is that instead of calling the p.line() function, you will call the
p.circle() function. For this function, you should specify the size of the circles, color and
opacity (alpha). Use the scatter markers code example at:/nPart IV: Creating a Scatter Plot
Repeat the exercise in part III using a scatter plot instead of a line chart.
The only difference is that instead of calling the p.line() function, you will call the
p.circle() function. For this function, you should specify the size of the circles, color and
opacity (alpha). Use the scatter markers code example at:
https://docs.bokeh.org/en/latest/docs/user_guide/plotting.htmlLinks to an external
site.
• Try using the p.square() function instead of p.circle(). How does that change
the visualization?
In your opinion, is it the line chart or the scatter plot that better represents the
data?
Lab Submission:
Insert your code and screenshots of the obtained visualizations into ONE word file.