Search for question
Question

CIS 6930, Spring 2024 Assignment 2 Augmenting Data Introduction In assignment zero, you wrote code to extract records from from a public police department website. Each pdf allows people to view incidents. The code we created is in a structured formati and helpfule for analysis. For further end stream purposes, we need to perform data augmentation on the extracted records. To perform augmentation we will need to keep fairness and bias issues in mind. Task Overview In this assignment, we will perform a subsequent task for the data pipeline. Using the submission from assignment 0 you will take records from several instances of pdf files and augment the data. You will also create a Datasheet for the dataset you creating. Review the discussion from the litrature to guide your creation of the data sheet. Your code should be executable via the command line. The main head of the code should be in a file names assignment2.py. This should tke one parameter --urls <filename> which points to a file with a list of incidents. Each line contains only a url and no other information pipenv run python assignment2.py --urls files.csv Your code should extract URLS and other information read each file listed in the file passed in. Then you are going to preform data augmentation to increase the ability of the data to be passed on to another process in the pipeline. The output, tab-separated content should be printed to stdout. Below we describe the output format. Data Augmentation Given each file, you are to produce the following tab separated rows. Each file should be processed in the order it is listed in the order it is added in the --urls file that is passed and each record should be ordered by its appearance in the corresponding pdf. Day of the Week Time of Day Weather Location Rank Side of Town Incident Rank Nature Day of the Week Time of Day Weather Location Rank Side of Town Incident Rank Nature integer integer integer integer string integer string Day of Week The day of week is a numeric value in the range 1-7. Where 1 corresponds to Sunday and 7 corresonds of Saturday. Time of Data The time of data is a numeric code from 0 to 24 describing the hour of the incident. Weather Determine the weather at the time and location of the incident. The weather is determined by the WMO CODE. The code is an integer that represents a weather position.. Location Rank Sort all listed locatiions. Give an integer ranking of the frequency of locations with ties preserved. For instance, if there is a three-way tie for the most popular location, each location will be ranked 1; the next most popular location should be ranked 4. You can use the exact text of the location. Side of Town The side of town is one of eight items {N, S, E, W, NW, NE, SW, SE}. Side of town is determined by approximate orientation of the center of town 35.220833, -97.443611. You can use the geopy library for assistance. Incident Rank Sort all of the Natures. Give an integer ranking of the frequency of natures with ties preserved. For instance, if there is a three-way tie for the most popular incident, each incident will be ranked 1; the next most popular nature should be ranked 4. Nature The Nature is the direct text of the Nature from the source record. EMSSTAT This is a boolean value that is True in two cases. First, if the Incident ORI was EMSSTAT or if the subsequent record or two contain an EMSSTAT at the same time and locaton. Submission DATASHEET.md Use the template from the datasheets for datasets paper or from a more recent location to create the daata sheet for this data set. Your answers should be completed to the best of your ability. Be sure you work on this portion individually because we will example submissions for academic dishonesy. We understand that not all answers are possible but you should still fill out each question as much as possible. README.md The README file name should be uppercase with an .md extension. You should write your name in it, an example of how to run it, and a list of any web or external resources that you used for help. The README file should also contain a list of any bugs or assumptions made while writing the program. Note that you should not be copying code from any website not provided by the instructor. You should include directions on how to install and use the code. You should describe any known bugs and cite any sources or people you used for help. Be sure to include any assumptions you make for your solution. COLLABORATORS file This file should contain a comma-separated list describing who you worked with and a small text description describing the nature of the collaboration. This information should be listed in three fields as in the example is below: Katherine Johnson, kj@nasa.gov, Helped me understand calculations Dorothy Vaughan, doro@dod.gov, Helped me with multiplexed time management Assignment Descriptions Your code structure should be in a directory with something similar to the following format: cis6930sp24-assignment2/ COLLABORATORS DATASHEET - LICENSE - README - Pipfile - src | docs/ assignment2.py setup.cfg - setup.py - tests/ test_time.py L test_geo.py L test_nature.py setup.py from setuptools import setup, find_packages setup( name='assignment2', version='1.0', author='You Name', authour_email='your ufl email', packages=find_packages (exclude=('tests', 'docs')), setup_requires=['pytest-runner'], tests require=['pytest'] ) Note, the setup.cfg file should have at least the following text inside: [aliases] test=pytest [tool:pytest] norecursedi rs = .*, CVS, _ darcs, {arch}, *.egg, venv Grading Grades will be assessed according to the following distribution: 60%: Correctness. This will be assessed by giving your code a range of inputs and checking the output. Use the creation of tests to prove correctness. 20%: Datasheet. о The datasheet on appropriateness and completeness 20%: Documentation. о Your README file should fully explain your process for developing your code. Explain proper below points in READMI file. 2.1 Runing instructions 2.2 Bugs & Assumptions 2.3 Function Description- Fetch/Download 2.4 Function Description - Parse/Extract 2.5 Function Description- Create 2.6 Function Description- Populate/Insert 2.7 Function Description- Status/Print 2.8 Test Function Descriptions O All other commands should be well-documented. Note we will be running your code in batch it is important that you follow directions closely.