[SOLVED] CSE511 Project 2: Hot Spot Analysis

75.00 $

Category: Tags: , , , ,

Description

5/5 - (1 vote)

Purpose

In this project, you will perform numerous spatial queries on a huge database that contains geographic information and the real-time locations of the customers of a well-known taxi company.

Objectives

Learners will be able to:

  • Set up Apache Spark and use dataframes within it to manage data operations.
  • Write some simple Scala code then identify how to run it.
  • Interact with geospatial data and run queries on it.

Technology Requirements

  • Apache Spark
  • SparkSQL
  • Scala
  • Java
  • Hadoop

Project Description

You will be utilizing Scala and Apache Spark in this project to create a solution that can extract crucial data from the provided dataset, which can then be used to make operational and strategic choices. The technology gives the client access to statistically significant geographic locations, which it can utilize to plan its operations in advance and improve customer service.

Directions

In addition to the directions below, please review the “Introductionvideo located in the “Week 8: Overview” section.

In this project, you are required to do spatial hot spot analysis. In particular, you need to complete two different hot spot analysis tasks.

Tasks:

  1. Hot Zone Analysis

This task will need to perform a range join operation on a rectangle datasets and a point dataset. For each rectangle, the number of points located within the rectangle will be obtained. The hotter rectangle means that it includes more points. So this task is to calculate the hotness of all the rectangles.

Download the required templates (attached in the Coursera project description).

  1. Hot Cell Analysis

This task will focus on applying spatial statistics to spatio-temporal big data in order to identify statistically significant spatial hot spots using Apache Spark. The topic of this task is from ACM SIGSPATIAL GISCUP 2016.

As stated in the Problem Definition page, in this task, you are asked to implement a Spark program to calculate the Getis-Ord statistic of NYC Taxi Trip datasets. We call it “hot cell analysis

To reduce the computation power need, we made the following changes:

  1. The input will be “yellow_trip_sample_100000.csv” dataset.
  2. Each cell unit size is 0.01 * 0.01 in terms of latitude and longitude degrees.
  3. You only need to consider Pick-up Location.
  4. We don’t use Jaccard similarity to check your answer. However, you don’t need to worry about how to decide the cell coordinates because the code template generates cell coordinates. You just need to write the rest of the task.

Coding Template Specification:

Input Parameters:

  1. Output path (Mandatory)
  2. Task name: “hotzoneanalysis” or “hotcellanalysis”
  3. Task parameters: (1) Hot zone (2 parameters): nyc taxi data path, zone path(2) Hot cell (1 parameter): nyc taxi data path
  4. The number/order of tasks do not matter.
  5. But, the first 7 of our final test cases will be hot zone analysis, the last 8 will be hot cell analysis.

Input Data Format:

The main function/entrance is “cse512.Entrance” scala file.

  1. Point data: The input point dataset is the pickup point of New York Taxi trip datasets. The data format of this phase is the original format of NYC taxi trip which is similar from Phase 2. But the coding template already parsed it for you. Find the data in the .zip file (attached in the Coursera project description, titled “…Yellow Trip Sample 100000”).
  2. Zone data (only for hot zone analysis): at “src/resources/zone-hotzone” of the template

Hot Zone Analysis:

The input point data can be any small subset of the NYC taxi dataset.

Hot Cell Analysis:

The input point data is a sample dataset like “yellow_trip_sample_100000.csv.”

Output Data Format:

Hot Zone Analysis:

All zones with their count, sorted by “rectangle” string in an ascending order.

The coordinates of top 50 hottest cells sorted by their G score in a descending order. ●    Note: Do not output G score.

An example input and answer are put in “testcase” folder of the coding template

Where you need to make changes:

Do not delete any existing code in the coding template unless you see this “You need to change this part.

Hot Zone Analysis:

In the code template:

  1. You need to change “scala” and “HotzoneUtils.scala“.
  2. The coding template has loaded the data and written the first step, range join query, for you. Please finish the rest of the task.
  3. The output DataFrame should be sorted by you according to the “rectangle” string.

Hot Cell Analysis:

In the code template,

  1. You need to change “scala” and “HotcellUtils.scala“.
  2. The coding template has loaded the data and decided the cell coordinate, x, y, z and their min and max. Please finish the rest of the task.
  3. The output DataFrame should be sorted by you according to G-score. The coding template will take the first 50 to output. Do not output G-score.

Submission Directions for Project Deliverables

Submission Files:

Submit a .jar file. Please title your file with “Last Name_First Name_Fall B 2022_Project 2 Hot Spot Analysis.”

  1. Go to “Programming Assignment: Project 2: Hot Spot Analysis”.
  2. Click on the “My submissions” tab (located at the top of the page under the deadline date).
  3. Click on “Create submission.”
  4. Upload your project jar package and click “Submit.”
  5. If there is an error in your assignment, you may make corrections and resubmit.
  6. You need to make sure your code can compile and package by entering sbt clean assembly.

We will run the compiled package on our cluster directly using “spark-submit” with parameters. If your code cannot compile and package, you will not receive any points.

**Important: Grading will take about 10~20 minutes.**

Tips (Optional):

This section is the same as that in Phase 2.

How to debug your code in IDE:

If you are using the Scala template,

  1. Use IntelliJ Idea with Scala plug-in or any other Scala IDE.
  2. Replace the logic of User Defined Functions ST_Contains and ST_Within in SpatialQuery.scala.
  3. Append .master(“local[*]”) after .config(“spark.some.config.option”, “some-value”) to tell IDE the master IP is localhost.
  4. In some cases, you may need to go to “build.sbt” file and change % “provided” to % “compile” in order to debug your code in IDE
  5. Run your code in IDE.
  6. You must revert Step 3 and 4 above and recompile your code before use of spark-submit!

How to submit your code to Spark:

If you are using the Scala template

  1. Go to the project root folder.
  2. Run sbt clean assembly. You may need to install sbt in order to run this command.
  3. Find the packaged jar in

“./target/scala-2.11/CSE511-Project-Hotspot-Analysis-Template-assembly-0.1.0.jar”

  1. Submit the jar to Spark using Spark command “./bin/spark-submit”. A pseudo code example:

./bin/spark-submit

~/GitHub/CSE511-Project-Hotspot-Analysis-Template/target/scala-2.11/CSE511-Project-Hotsp ot-Analysis-Template-assembly-0.1.0.jar test/output hotzoneanalysis src/resources/point-hotzone.csv src/resources/zone-hotzone.csv hotcellanalysis src/resources/yellow_trip_sample_100000.csv

Report Submission:

For this project deliverable, you must write a 2-3 page report detailing your work on this project. Your report should include the following elements:

  1. Reflection: How did you approach the project? What did you specifically do?
  2. Lessons Learned: What did you learn by doing this project?
  3. Implementation: You need to include the following parts in your Project 2 report:
    1. What hot zone analysis is doing:
      1. def ST_Contains(queryRectangle: String, pointString: String )
    2. What hot cell analysis is doing:
      1. write the multiple steps required before calculating the z-score.

This is a manually graded task and you will get credit for the report if you covered all the mentioned parts.

Your report should be in pdf. Please title your report file with “Last Name_First Name_Fall B

2022_Project 2 Hot Spot Analysis Report.”

When you are ready to submit the report:

Double check that the content of your report is complete, your file is in pdf and titled with “Last Name_First Name_Fall B 2022_Project 2 Hot Spot Analysis Report.” Then submit your report:

  1. Go to “Graded Assignment: Project 2: Hot Spot Analysis Report Submission”.
  2. Click “Start Submission“.
  3. Click “Upload a File.”
  4. Locate and select your report file.
  5. Run your file through Plagiarism Detection.
  6. Click “Submit.”

Evaluation

There are 15 test cases for a total of 15 points. The first 7 of the final test cases will be hot zone analysis, the last 8 will be hot cell analysis. If the submission fails, you will see the corresponding error logs that indicate where the error occurred.

Grading Rubric for Report

Rubrics communicate specific criteria for evaluation. Prior to starting any graded coursework, learners are expected to read through the rubric, so they know how they will be assessed. You are encouraged to self-assess your responses and make informed revisions before submitting your final report. Engaging in this learning practice will support you in developing your best work. Points may be deducted at the discretion of the faculty for disorganized submissions that convolute the grading process.

Component 0 1 2
Reflection There is no reflection included. The reflection attempts to demonstrate thinking about learning but is vague and/or unclear about the personal learning process. The reflection explains the student’s own thinking and learning processes, as well as implications for future learning.
Analysis There is no analysis included. The reflection attempts to analyze the learning experience but the value of the learning to the student or others is vague and/or unclear. The reflection is an in-depth analysis of the learning experience, the value of the derived learning to self or others, and the enhancement of the student’s appreciation for the discipline.

Learner Checklist

Prior to submitting, read through the Learner Checklist to ensure you are ready to submit your best work.

Did you title your file correctly and convert it into a single .jar file?

Did you include your legal first and last name in the designated area on the report?

Did you title your report document correctly and convert it into a single pdf?

○   Last Name_First Name_Semester Year_Name of Project Name

Did you answer all of the questions to the best of your ability?

Did you make sure your answers directly address the prompt(s) in an organized manner that is easy to follow?

Did you self-assess your open-ended responses using the rubric and make any necessary revisions?

Did you proofread your work?