Repository of the code produced for the Scalable and Cloud Programming course.
- solutions folder: all the explored solutions
- resources folder: reduced version of the dataset used to locally compare the solutions
- data_analysis folder: Python scripts used to do data analysis
The program can be used with four parameters:
- the path of the input file
- the path of the output folder
- the solution code
- the number of nodes
The only solution that supports the number_of_nodes param is BestSolutionWithPartitions.
All other solutions will ignore the parameter and so it can be omitted.
Each explored solution has a solution_id that can be used as a parameter when running the project.
The available solutions are:
| Solution Id | Solution Name |
|---|---|
| 0 | FirstSolution |
| 1 | GroupByKey |
| 2 | NewPairsMapping |
| 3 | MergeTwoStages |
| 4 | MergeTwoStagesNoFilter |
| 5 | BestSolutionWithPartitions |
- Create a new project on Google Cloud Platform
- Enable Dapaproc and the APIs
- Create and download the service account keys
- Export useful variables:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json" \ export PROJECT=<project-id> \ export BUCKET_NAME=<bucket-name> \ export CLUSTER=<cluster-name> \ export REGION=<region> - Clone this repository:
git clone https://github.com/taglioIsCoding/Co-purchaseAnalysis/ - Initialize the Google Cloud project:
gcloud init - Create the bucket:
gcloud storage buckets create gs://${BUCKET_NAME} --location=${REGION} - Create the cluster:
- With a single node
gcloud dataproc clusters create ${CLUSTER} \ --project=${PROJECT} \ --region=${REGION} \ --single-node \ --master-boot-disk-size 240- With multiple nodes
gcloud dataproc clusters create ${CLUSTER} \ --region=${REGION} \ --num-workers=<number-of-workers> \ --master-boot-disk-size 240 \ --worker-boot-disk-size 240 \ --project=${PROJECT} - Put the input file in the bucket:
gcloud storage cp </path/to/dataset.csv> gs://${BUCKET_NAME}/input/input.csv - Build the project:
sbt clean package - Put the project jar in the bucket:
gcloud storage cp ./target/scala-2.12/copurchaseanalysis_2.12-0.1.0-SNAPSHOT.jar gs://${BUCKET_NAME}/scala/project.jar - Submit a job:
You can select any available solution, the best one is number 5.
gcloud dataproc jobs submit spark --cluster=${CLUSTER} \ --class=Main \ --jars=gs://${BUCKET_NAME}/scala/project.jar \ --region=${REGION} \ -- gs://${BUCKET_NAME}/input/input.csv gs://${BUCKET_NAME}/output/ <solution-id> <number-of-nodes> - Delete the cluster:
gcloud dataproc clusters delete ${CLUSTER} --region=${REGION} --project=${PROJECT}