3BIO_analytics/Clustering_Analysis.qmd at main · coreofscience/3BIO_analytics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: Cluster Analysis
author: Sebastian Robledo
date: "Nov 20, 2022"
format:
    html:
        code-fold: true
editor:
  render-on-save: true
---

# Loading python libraries

```{python}
# Load the "pandas" library as "pd"
import pandas as pd
# Load the "cluster" library
from sklearn.cluster import KMeans
# Load the "stats" library
import numpy as np
```

##Data loading

```{python}
# Perform clustering analysis with python library "cluster" on cluster_data
# Read the data from the Google Sheet
cluster_data_inst = pd.read_csv("https://docs.google.com/spreadsheets/d/1SxlV8oasFsBP8KbBuyomyCJyaS3b3Z78yDxNPHSqRMY/export?format=csv&gid=1099331023")

# Remove column "instituciones" from cluster_data
cluster_data = cluster_data_inst.drop("instituciones", axis=1)
```

# Sebas analysis

```{python}
# Divide the values in the columns "Paper_A1", "Paper_A2", "Paper_B", "Paper_C", "Paper_Sin_categoria" by the values in the column "researchers_total" and save the result in the columns "Paper_A1_res_total", "Paper_A2_res_total", "Paper_B_res_total", "Paper_C_res_total", "Paper_Sin_categoria_res_total"
paper_categories = ["Paper_A1", "Paper_A2", "Paper_B", "Paper_C", "Paper_Sin_categoria"]
for category in paper_categories:
    cluster_data[f"{category}_res_total"] = cluster_data[category] / cluster_data["researchers_total"]
```
## Centralization and scaling
```{python}
# Center the data in cluster_data in all columns with names finishing in "_res_total" and save the result in cluster_data_sebas
cluster_data_sebas = cluster_data[[col for col in cluster_data.columns if col.endswith("_res_total")]] - cluster_data[[col for col in cluster_data.columns if col.endswith("_res_total")]].mean()
```

```{python}
# Pareto scale the data in cluster_data_sebas and save the result in cluster_data_sebas_sc
cluster_data_sebas_sc = cluster_data_sebas / cluster_data_sebas.std()
```

## Clustering hierarchical

```{python}
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Perform the linkage
Z = linkage(cluster_data_sebas_sc, 'ward')

# Plot the dendrogram
dendrogram(Z)
plt.show()

```

## Clustering kmeans

```{python}
# Perform clustering analysis with python library "cluster" on cluster_data_vivi_sc
# Create a KMeans object with 3 clusters
kmeans = KMeans(n_clusters=5)
# Fit the data to the KMeans object
kmeans.fit(cluster_data_sebas_sc)
# Get the cluster labels
labels = kmeans.predict(cluster_data_sebas_sc)
# Create a new column in cluster_data_vivi_sc named "cluster" with the cluster labels
cluster_data_sebas_sc["cluster"] = labels

# Count unique values in column "cluster" of cluster_data_vivi_sc
cluster_data_sebas_sc["cluster"].value_counts()

# Add column "instituciones" from cluster_data_inst to cluster_data_vivi_sc
cluster_data_sebas_sc["instituciones"] = cluster_data_inst["instituciones"]
```

```{python}
# Print the size of each cluster in cluster_data_vivi_sc
for i in range(0,5):
    print("Cluster", i, "size:", cluster_data_sebas_sc[cluster_data_sebas_sc["cluster"] == i].shape[0])
```

```{python}
# Create a list with the names of the columns to be scaled in cluster_data_sebas_sc
scaled_features = [col for col in cluster_data_sebas_sc.columns if col.endswith("_res_total")]

# Plot cluster centers to visualize the clusters
cluster_data_sebas_sc.groupby('cluster')[scaled_features].mean().plot(legend=True, kind='bar')
plt.show()
```


# Vivi analysis

```{python}
# add a column to cluster_data named "tiempo". Tiempo is the time in years minus column "anio"
cluster_data["tiempo"] = 2022 - cluster_data["anio"]

# create a new column "Group_A_time" in cluster_data. Group_A_time is the value in column "Group_A" divided by the value in column "tiempo"
cluster_data["Group_A_time"] = cluster_data["Group_A"] / cluster_data["tiempo"]

# create a new column "Group_B_time" in cluster_data. Group_B_time is the value in column "Group_B" divided by the value in column "tiempo"
cluster_data["Group_B_time"] = cluster_data["Group_B"] / cluster_data["tiempo"]

# create a new column "Group_C_time" in cluster_data. Group_C_time is the value in column "Group_C" divided by the value in column "tiempo"
cluster_data["Group_C_time"] = cluster_data["Group_C"] / cluster_data["tiempo"]

# create a new column "Group_no_category_time" in cluster_data. Group_no_category is the value in column "Group_no_category" divided by the value in column "tiempo"
cluster_data["Group_no_category_time"] = cluster_data["Group_no_category"] / cluster_data["tiempo"]

# Create a new column "Paper_A1_time" in cluster_data. Paper_A1_time is the value in column "Paper_A1" divided by the value in column "tiempo"
cluster_data["Paper_A1_time"] = cluster_data["Paper_A1"] / cluster_data["tiempo"]

# Create a new column "Paper_A2_time" in cluster_data. Paper_A2_time is the value in column "Paper_A2" divided by the value in column "tiempo"
cluster_data["Paper_A2_time"] = cluster_data["Paper_A2"] / cluster_data["tiempo"]

# Create a new column "Paper_B_time" in cluster_data. Paper_B_time is the value in column "Paper_B" divided by the value in column "tiempo"
cluster_data["Paper_B_time"] = cluster_data["Paper_B"] / cluster_data["tiempo"]

# Create a new column "Paper_C_time" in cluster_data. Paper_C_time is the value in column "Paper_C" divided by the value in column "tiempo"
cluster_data["Paper_C_time"] = cluster_data["Paper_C"] / cluster_data["tiempo"]

# Create a new column "Paper_no_category_time" in cluster_data. Paper_no_category_time is the value in column "Paper_Sin_category" divided by the value in column "tiempo"
cluster_data["Paper_no_category_time"] = cluster_data["Paper_Sin_categoria"] / cluster_data["tiempo"]

# Create a new column "doctorate_time" in cluster_data. doctorate is the value in column "doctorate" divided by the value in column "total"
cluster_data["doctorate"] = cluster_data["doctorate"] / cluster_data["researchers_total"]

# Create a new column "magister_time" in cluster_data. magister is the value in column "magister" divided by the value in column "total"
cluster_data["magister"] = cluster_data["magister"] / cluster_data["researchers_total"]

# Create a new column "medical_specialization_time" in cluster_data. medical_specialization is the value in column "medical_specialization" divided by the value in column "total"
cluster_data["medical_specialization"] = cluster_data["medical_specialization"] / cluster_data["researchers_total"]

# Create a new column "undergrad_time" in cluster_data. undergrad is the value in column "undergrad" divided by the value in column "tiempo"
cluster_data["undergrad"] = cluster_data["undergrad"] / cluster_data["researchers_total"]

# Select from cluster_data the columns "Group_A_time", "Group_B_time", "Group_C_time", "Group_no_category_time", "Paper_A1_time", "Paper_A2_time", "Paper_B_time", "Paper_C_time", "Paper_no_category_time" and "doctorate_time", "magister_time", "medical_specialization_time" and "undergrad_time" and save the result in cluster_data_vivi
cluster_data_vivi = cluster_data[["Group_A_time", "Group_B_time", "Group_C_time", "Group_no_category_time", "Paper_A1_time", "Paper_A2_time", "Paper_B_time", "Paper_C_time","doctorate", "magister", "medical_specialization", "undergrad"]]
```

## Centralization and scaling
```{python}
# Center the data in cluster_data_vivi
cluster_data_vivi = cluster_data_vivi - cluster_data_vivi.mean()
```

```{python}
# Pareto scale the data in cluster_data_vivi
cluster_data_vivi_sc = cluster_data_vivi / cluster_data_vivi.std()
```

## Clustering hierarchical

```{python}
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Perform the linkage
Z = linkage(cluster_data_vivi_sc, 'ward')

# Plot the dendrogram
dendrogram(Z)
plt.show()

```

## Clustering kmeans

```{python}
# Perform clustering analysis with python library "cluster" on cluster_data_vivi_sc
# Create a KMeans object with 3 clusters
kmeans = KMeans(n_clusters=5)
# Fit the data to the KMeans object
kmeans.fit(cluster_data_vivi_sc)
# Get the cluster labels
labels = kmeans.predict(cluster_data_vivi_sc)
# Create a new column in cluster_data_vivi_sc named "cluster" with the cluster labels
cluster_data_vivi_sc["cluster"] = labels

# Count unique values in column "cluster" of cluster_data_vivi_sc
cluster_data_vivi_sc["cluster"].value_counts()

# Add column "instituciones" from cluster_data_inst to cluster_data_vivi_sc
cluster_data_vivi_sc["instituciones"] = cluster_data_inst["instituciones"]
```

```{python}
# Print the size of each cluster in cluster_data_vivi_sc
for i in range(0,5):
    print("Cluster", i, "size:", cluster_data_vivi_sc[cluster_data_vivi_sc["cluster"] == i].shape[0])
```

```{python}
# Create a list with the names of the columns to be scaled in cluster_data_vivi_sc
scaled_features = ["Group_A_time", "Group_B_time", "Group_C_time", "Group_no_category_time", "Paper_A1_time", "Paper_A2_time", "Paper_B_time", "Paper_C_time", "doctorate", "magister", "medical_specialization", "undergrad"]

# Plot cluster centers to visualize the clusters
cluster_data_vivi_sc.groupby('cluster')[scaled_features].mean().plot(legend=True, kind='bar')
plt.show()
```

# Data cleaning

```{python}
# Normalize the data in cluster_data
def divi(x):
    const = x.iloc[1]
    if const == 0:
        const=1
    df = x.iloc[3:]
    df_n = df.div(const)
    return df_n

cluster_data_nor = cluster_data.apply(divi, axis=1)
# cluster_data_nor['instituciones']= cluster_data_inst['instituciones']

# Scale the data in cluster_data to be between 0 and 1
cluster_data_sc = (cluster_data - cluster_data.min()) / (cluster_data.max() - cluster_data.min())

# Center the data in cluster_sc by subtracting the mean of each column from each column
cluster_data_sc_cent = cluster_data_sc - cluster_data_sc.mean()

# Divide the data in cluster_data by the value in the column "researchers_total"
cluster_data_nor_res = cluster_data.div(cluster_data["researchers_total"], axis=0)

# Remove researchers_total from cluster_data_nor_res
cluster_data_nor_res = cluster_data_nor_res.drop("researchers_total", axis=1)

# Scale the data in cluster_data_nor_res to be between 0 and 1
cluster_data_nor_res_sc = (cluster_data_nor_res - cluster_data_nor_res.min()) / (cluster_data_nor_res.max() - cluster_data_nor_res.min())

# Center the data in cluster_data_nor_res_sc by subtracting the mean of each column from each column
cluster_data_nor_res_sc_cent = cluster_data_nor_res_sc - cluster_data_nor_res_sc.mean()
```


# insert python chunck


Tipo de programa? -
añadir los tipos de productos (libros, innovaciones, software)

https://drive.google.com/drive/u/1/folders/1Vh-72HiNoVF6oDlDpgwv_U32RSV22i0h


# Clustering Analysis with k-means

Clustering analysis

```{python}
# Number of clusters
k = 5

# Create KMeans model
kmeans = KMeans(n_clusters=k)

# Fit the model to the data
kmeans.fit(cluster_data_nor_res_sc_cent)

# Get cluster labels
labels = kmeans.labels_

# Get cluster centers
cluster_centers = kmeans.cluster_centers_

# Add labels to cluster_data as a new column "Cluster"
cluster_data_nor_res_sc_cent["Cluster"] = labels

# Add "instituciones" to cluster_data_nor as a new column "instituciones" from cluster_data_inst
cluster_data_nor_res_sc_cent["instituciones"] = cluster_data_inst["instituciones"]

# Count the unique values in column "Cluster"
cluster_data_nor_res_sc_cent["Cluster"].value_counts()
```


```{python}
# Crea un dataframe con datos simulados de ventas de un producto en 3 tiendas (A, B y C) en 4 días (L, M, X y J)
# y con 3 categorías de producto (A, B y C)
data = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]}, index=['L', 'M', 'X', 'J'])
```