Question

, submit a ZIP file that includes a Word document with a cover page containing the names of your team members and each of the steps outlined below, clearly identified with

a title. Also, include your data sources in the Zip file for submission. Please provide thorough comments on your steps and work. Failure to comply with the submission guidelines will result in penalties. 1. Identify a data source of your choice (See: https://donnees montreal.ca/) and provide the link to your data source in your Word document. Describe your data source in your Word document. Proceed with data verification and assess their quality. Identify and perform any necessary data preprocessing, if needed. (20 points) 2. Add your data source to HDFS in your Hadoop environment. Include your steps in your Word document. (20 points) 3. Identify a first processing task for this data source. Create and test your MapReduce code in your Hadoop environment. Use comments to clearly identify each step of your MapReduce code. Describe your processing task in one to two sentences and include it in your Word document. (30 points) 4. Identify a second processing task (Different from the first processing task in step 3) for this data source. Create and test your Spark SQL code in your Hadoop environment. The use of temporary tables is not allowed in your project. Use comments to clearly identify each step of your Spark SQL code. Describe your processing task in one to two sentences and include it in your Word document. (30 points) You will be evaluated on the consistency of your processing tasks and the completeness and details in your Word document compared to the specifications, as well as the optimality and quality of the code. To propose consistent work, try to draw inspiration from the various practices done in class to complete the requested work and not simply replicate the same examples covered in those practices.

Fig: 1