Co-purchase Analysis

Repository of the code produced for the Scalable and Cloud Programming course.

Project Structure

solutions folder: all the explored solutions
resources folder: reduced version of the dataset used to locally compare the solutions
data_analysis folder: Python scripts used to do data analysis

Solutions

The program can be used with four parameters:

the path of the input file
the path of the output folder
the solution code
the number of nodes

The only solution that supports the number_of_nodes param is BestSolutionWithPartitions. All other solutions will ignore the parameter and so it can be omitted. Each explored solution has a solution_id that can be used as a parameter when running the project. The available solutions are:

Solution Id	Solution Name
0	FirstSolution
1	GroupByKey
2	NewPairsMapping
3	MergeTwoStages
4	MergeTwoStagesNoFilter
5	BestSolutionWithPartitions

Run on DataProc

Create a new project on Google Cloud Platform
Enable Dapaproc and the APIs
Create and download the service account keys

Export useful variables:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json" \
export PROJECT=<project-id> \
export BUCKET_NAME=<bucket-name> \
export CLUSTER=<cluster-name> \
export REGION=<region>

Clone this repository:

git clone https://github.com/taglioIsCoding/Co-purchaseAnalysis/

Initialize the Google Cloud project:
```
gcloud init
```

Create the bucket:

gcloud storage buckets create gs://${BUCKET_NAME} --location=${REGION}

Create the cluster:

With a single node

gcloud dataproc clusters create ${CLUSTER} \
--project=${PROJECT} \
--region=${REGION} \
--single-node \
--master-boot-disk-size 240

With multiple nodes

gcloud dataproc clusters create ${CLUSTER} \
--region=${REGION} \
--num-workers=<number-of-workers> \
--master-boot-disk-size 240 \
--worker-boot-disk-size 240 \
--project=${PROJECT}

Put the input file in the bucket:

gcloud storage cp </path/to/dataset.csv> gs://${BUCKET_NAME}/input/input.csv

Build the project:
```
sbt clean package
```

Put the project jar in the bucket:

gcloud storage cp ./target/scala-2.12/copurchaseanalysis_2.12-0.1.0-SNAPSHOT.jar gs://${BUCKET_NAME}/scala/project.jar

Submit a job:

gcloud dataproc jobs submit spark --cluster=${CLUSTER} \
 --class=Main \
 --jars=gs://${BUCKET_NAME}/scala/project.jar \
 --region=${REGION} \
 -- gs://${BUCKET_NAME}/input/input.csv gs://${BUCKET_NAME}/output/ <solution-id> <number-of-nodes>

You can select any available solution, the best one is number 5.

Delete the cluster:

gcloud dataproc clusters delete ${CLUSTER} --region=${REGION} --project=${PROJECT}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data_analysis		data_analysis
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Co-purchase Analysis

Project Structure

Solutions

Run on DataProc

About

Uh oh!

Releases

Packages

Languages

taglioIsCoding/Co-purchaseAnalysis

Folders and files

Latest commit

History

Repository files navigation

Co-purchase Analysis

Project Structure

Solutions

Run on DataProc

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages