cis 6930 spring 2024 assignment 2 augmenting data introduction in assi
Search for question
Question
CIS 6930, Spring 2024 Assignment 2
Augmenting Data
Introduction
In assignment zero, you wrote code to extract records from from a public police department
website. Each pdf allows people to view incidents. The code we created is in a structured
formati and helpfule for analysis. For further end stream purposes, we need to perform data
augmentation on the extracted records. To perform augmentation we will need to keep
fairness and bias issues in mind.
Task Overview
In this assignment, we will perform a subsequent task for the data pipeline. Using the
submission from assignment 0 you will take records from several instances of pdf files and
augment the data. You will also create a Datasheet for the dataset you creating. Review the
discussion from the litrature to guide your creation of the data sheet.
Your code should be executable via the command line. The main head of the code should be in
a file names assignment2.py. This should tke one parameter --urls <filename> which
points to a file with a list of incidents. Each line contains only a url and no other information
pipenv run python assignment2.py --urls files.csv
Your code should extract URLS and other information read each file listed in the file passed in.
Then you are going to preform data augmentation to increase the ability of the data to be
passed on to another process in the pipeline. The output, tab-separated content should be
printed to stdout.
Below we describe the output format.
Data Augmentation
Given each file, you are to produce the following tab separated rows. Each file should be
processed in the order it is listed in the order it is added in the --urls file that is passed and
each record should be ordered by its appearance in the corresponding pdf.
Day of the Week
Time of Day Weather Location Rank Side of Town Incident Rank Nature Day of the Week
Time of Day Weather
Location Rank Side of Town
Incident Rank
Nature
integer
integer
integer
integer
string
integer
string
Day of Week
The day of week is a numeric value in the range 1-7. Where 1 corresponds to Sunday and 7
corresonds of Saturday.
Time of Data
The time of data is a numeric code from 0 to 24 describing the hour of the incident.
Weather
Determine the weather at the time and location of the incident. The weather is determined by
the WMO CODE. The code is an integer that represents a weather position..
Location Rank
Sort all listed locatiions. Give an integer ranking of the frequency of locations with ties
preserved. For instance, if there is a three-way tie for the most popular location, each location
will be ranked 1; the next most popular location should be ranked 4. You can use the exact text
of the location.
Side of Town
The side of town is one of eight items {N, S, E, W, NW, NE, SW, SE}. Side of town is determined
by approximate orientation of the center of town 35.220833, -97.443611. You can use
the geopy library for assistance.
Incident Rank
Sort all of the Natures. Give an integer ranking of the frequency of natures with ties preserved.
For instance, if there is a three-way tie for the most popular incident, each incident will be
ranked 1; the next most popular nature should be ranked 4.
Nature
The Nature is the direct text of the Nature from the source record.
EMSSTAT This is a boolean value that is True in two cases. First, if the Incident ORI was EMSSTAT or if the
subsequent record or two contain an EMSSTAT at the same time and locaton.
Submission
DATASHEET.md
Use the template from the datasheets for datasets paper or from a more recent location to
create the daata sheet for this data set. Your answers should be completed to the best of your
ability. Be sure you work on this portion individually because we will example submissions for
academic dishonesy. We understand that not all answers are possible but you should still fill
out each question as much as possible.
README.md
The README file name should be uppercase with an .md extension. You should write your
name in it, an example of how to run it, and a list of any web or external resources that you
used for help. The README file should also contain a list of any bugs or assumptions made
while writing the program. Note that you should not be copying code from any website not
provided by the instructor. You should include directions on how to install and use the code.
You should describe any known bugs and cite any sources or people you used for help. Be
sure to include any assumptions you make for your solution.
COLLABORATORS file
This file should contain a comma-separated list describing who you worked with and a small
text description describing the nature of the collaboration. This information should be listed in
three fields as in the example is below:
Katherine Johnson, kj@nasa.gov, Helped me understand calculations
Dorothy Vaughan, doro@dod.gov, Helped me with multiplexed time management
Assignment Descriptions
Your code structure should be in a directory with something similar to the following format:
cis6930sp24-assignment2/
COLLABORATORS
DATASHEET - LICENSE
- README
- Pipfile
- src
|
docs/
assignment2.py
setup.cfg
- setup.py
- tests/
test_time.py
L test_geo.py
L test_nature.py
setup.py
from setuptools import setup, find_packages
setup(
name='assignment2',
version='1.0',
author='You Name',
authour_email='your ufl email',
packages=find_packages (exclude=('tests', 'docs')),
setup_requires=['pytest-runner'],
tests require=['pytest'] )
Note, the setup.cfg file should have at least the following text inside:
[aliases]
test=pytest
[tool:pytest]
norecursedi rs = .*, CVS, _ darcs, {arch}, *.egg, venv
Grading
Grades will be assessed according to the following distribution:
60%: Correctness.
This will be assessed by giving your code a range of inputs and checking the
output.
Use the creation of tests to prove correctness.
20%: Datasheet.
о
The datasheet on appropriateness and completeness
20%: Documentation.
о
Your README file should fully explain your process for developing your code.
Explain proper below points in READMI file.
2.1
Runing instructions
2.2
Bugs & Assumptions
2.3
Function Description-
Fetch/Download
2.4
Function Description - Parse/Extract
2.5
Function Description- Create
2.6
Function Description-
Populate/Insert
2.7
Function Description- Status/Print
2.8
Test Function Descriptions
O
All other commands should be well-documented.
Note we will be running your code in batch it is important that you follow directions
closely.