Search for question
Question

Part I: Downloading the US Census Income Dataset

We will use one of the datasets of the UCI data repository. It contains many datasets

that are used by researchers and data scientists to evaluate new data mining and

machine learning methods.

1. Go to the US Census Income dataset

page: https://archive.ics.uci.edu/ml/datasets/Census+IncomeLinks to an

external site.

2. Click on the Data Folder

3.

Download the adult.data, adult.names and adult.test files.

4. name is a text file. It can be opened using sublime any text editor.

5. data and adult.test are csv files. They can be opened in sublime, excel, or any

text editor.

6. The adult.name file contains a description of the dataset and its attributes.

Each row contains information for one person. Each column is an attribute of

a person such as age, work class, education, etc.

Each line in this dataset represents a person. Each column represents one feature for a

person such as age, education, marital status, etc.

Part II: Creating a Bar Chart

The last column in the dataset holds the income. It is a categorical label that has the

two labels "<=50K" and ">50K". These indicate whether the person's annual income is

less than or greater than $50K, respectively.

Write a python script that uses the Bokeh library to draw a bar chart showing the

number of people with income greater than 50K and the number of people with income

less than or equal 50K./nWrite a python script that uses the Bokeh library to draw a bar chart showing the

number of people with income greater than 50K and the number of people with income

less than or equal 50K.

1. Start by using the basic bar chart code from the bokeh

documentation.

mlLinks to an external site.

https://docs.bokeh.org/en/latest/docs/gallery/bar_basic.ht

2. Write code to create the x and y arrays that will be passed to the vbar()

function. (P.S. If you are going to use file handling, you can follow the steps

below. However, you are free to use pandas dataframes or any other

methods/libraries in your script.)

1. First, create a dictionary and use it to count the number of

instances for each income label.

2. Create a for-loop that scans the dataset file, extracts the last

column, and updates the dictionary accordingly.

3. Once the for-loop is done, write another for loop that copies the

keys and values of the dictionary into two arrays, call them x & y.

These are the arrays that will be passed to the vbar() plotting

function./nPart III: Creating a Line Chart for Time Data

The kobe_braynt_points.csv Download kobe_braynt_points.csv dataset contains the

points scored by player Kobe Bryant over the different years/seasons.

Write a script that plots the points over the years. Use a line chart.

1. Start with the code for single line

glyphs/plots: https://docs.bokeh.org/en/latest/docs/user_guide/plotting.htm

ILinks to an external site.

2. Modify this adding code for reading the years and points from the dataset

file. Make sure to use the comma "," as the separator in the split function.

1. You should read the years and points into two arrays to represent

xlabels & y. That is, the x-axis labels and y values.

2. Make sure to show the seasons as xlabels on the xaxis.

to

1. One way to do it is by creating a third array to hold x

values. The x values should be 0,1,2,... and go as many as

the number of years.

2.

The three arrays: xlabels, x & y will by populated by the

for loop that reads the data from the file.

3. Change the figure creation line from

O

I

p = figure(plot_width=400, plot_height=400)

p = figure(plot_width=400, plot_height=400, x_range=

xlabels)/nxlabels)

This way you are passing the xlabels to be shown on the x-axis.

4. Since the x-labels are too long, it is better to write them vertically rather than

horizontally. You can do so by adding this line of code after creating the

figure:

p.xaxis.major_label_orientation = pi/2

5. Finally, pass the x & y arrays to the p.line() function.

6. How does the curve look like? What can you say about the point scoring of

Kobe Bryant from the plot?

Part IV: Creating a Scatter Plot

Repeat the exercise in part III using a scatter plot instead of a line chart.

The only difference is that instead of calling the p.line() function, you will call the

p.circle() function. For this function, you should specify the size of the circles, color and

opacity (alpha). Use the scatter markers code example at:/nPart IV: Creating a Scatter Plot

Repeat the exercise in part III using a scatter plot instead of a line chart.

The only difference is that instead of calling the p.line() function, you will call the

p.circle() function. For this function, you should specify the size of the circles, color and

opacity (alpha). Use the scatter markers code example at:

https://docs.bokeh.org/en/latest/docs/user_guide/plotting.htmlLinks to an external

site.

• Try using the p.square() function instead of p.circle(). How does that change

the visualization?

In your opinion, is it the line chart or the scatter plot that better represents the

data?

Lab Submission:

Insert your code and screenshots of the obtained visualizations into ONE word file.

Fig: 1

Fig: 2

Fig: 3

Fig: 4

Fig: 5