containing the names of your team members and each of the steps
outlined below, clearly identified with a title. Also, include your data
sources in the Zip file for submission.
Please provide thorough comments on your steps and work.
Failure to comply with the submission guidelines will result in penalties.
1. Identify a data source of your choice (See: https://donnees montreal.ca/) and
provide the link to your data source in your Word document.
Describe your data source in your Word document. Proceed with data verification
and assess their quality. Identify and perform any necessary data preprocessing,
if needed. (20 points)
2. Add your data source to HDFS in your Hadoop environment. Include your steps in
your Word document. (20 points)
3. Identify a first processing task for this data source. Create and test your
MapReduce code in your Hadoop environment. Use comments to clearly identify
each step of your MapReduce code. Describe your processing task in one to two
sentences and include it in your Word document. (30 points)
4. Identify a second processing task (Different from the first processing task in step
3) for this data source. Create and test your Spark SQL code in your Hadoop
environment. The use of temporary tables is not allowed in your project. Use
comments to clearly identify each step of your Spark SQL code. Describe your
processing task in one to two sentences and include it in your Word document.
(30 points)
You will be evaluated on the consistency of your processing tasks and
the completeness and details in your Word document compared to the
specifications, as well as the optimality and quality of the code. To
propose consistent work, try to draw inspiration from the various
practices done in class to complete the requested work and not simply
replicate the same examples covered in those practices.
Fig: 1