Skip to content

ClickHouse version of import users script#1317

Open
jamesqo wants to merge 7 commits into
masterfrom
import-users-ch
Open

ClickHouse version of import users script#1317
jamesqo wants to merge 7 commits into
masterfrom
import-users-ch

Conversation

@jamesqo
Copy link
Copy Markdown
Contributor

@jamesqo jamesqo commented Apr 8, 2026

Summary of changes

  • Removed all read/write interactions with MySQL
  • Moved all read operations with MySQL -> ClickHouse
  • Replace usage of clickhouse CLI with clickhouse-connect library (will be useful in the future when this script is moved to the Airflow server)
  • Update users / authorities table in a single batched SQL transaction
  • Removed --ssl-ca flag which no longer applies (this was only relevant for RDS)

The portal.properties files also need to be updated in order to include clickhouse creds -- will submit a PR to pipelines-configuration later

cc @averyniceday @sheridancbio

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Apr 8, 2026

still need to test changes here

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Apr 8, 2026

Did some testing. Further changes:

  • Converted script to Python3
  • Added more logging
  • Speed it up by doing a batched query for authorities in 1 swoop

added_user_rows.append([user_email, user_name, user.enabled])
added_authority_rows += [[user_email, authority] for authority in user.authorities]
if added_user_rows:
ch_client.insert('users', added_user_rows, column_names=['email', 'name', 'enabled'])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add error checking like we were doing before for SQL here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes .. if the insert fails I wonder what happens. I suppose ch_client will throw an exception? At the very least we should confirm that if this happens we will get a stacktrace or some kind of indication in the run logs. A notice to our slack channel (letting us know that we failed an attempt to add a user to the user table) should probably also be made.

Copy link
Copy Markdown
Contributor Author

@jamesqo jamesqo Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added back code that will email the failure message to users if there is an exception thrown during inserts.

Comment thread import-scripts/importUsers.py
Copy link
Copy Markdown
Collaborator

@averyniceday averyniceday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General changes look good - couple suggestions /questions regarding readiability and organization

Comment thread import-scripts/importUsers.py
Comment thread import-scripts/importUsers.py
Comment thread import-scripts/importUsersClickhouse.py
Copy link
Copy Markdown
Contributor

@sheridancbio sheridancbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I breezed over the parts which seemed to be clear language conversions to accommodate python 3 syntax. So long as this works in production I approve. I suppose the inherent retry is still present because each time this runs we re-compute the list of users who are missing from the database and try to import them again.

Comment thread import-scripts/import-portal-users.sh
Comment thread import-scripts/importUsers.py
added_user_rows.append([user_email, user_name, user.enabled])
added_authority_rows += [[user_email, authority] for authority in user.authorities]
if added_user_rows:
ch_client.insert('users', added_user_rows, column_names=['email', 'name', 'enabled'])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes .. if the insert fails I wonder what happens. I suppose ch_client will throw an exception? At the very least we should confirm that if this happens we will get a stacktrace or some kind of indication in the run logs. A notice to our slack channel (letting us know that we failed an attempt to add a user to the user table) should probably also be made.

Comment thread import-scripts/importUsers.py
@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Apr 9, 2026

@sheridancbio I'm wary of adding code for Slack notifications -- if something goes wrong with the network for example, we will be getting pinged every minute by this script. I did improve the error handling for failed DB operations however

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Apr 14, 2026

@copilot resolve the merge conflicts in this pull request

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Apr 14, 2026

We decided that we will be leaving the behavior on failure as-is for now but we should probably address it in the future so that users don't get spammed with emails if something goes wrong with our networking, for example.

SMTP_SERVER=`cat $MAIL_SMTP_SERVER`
GENIE_BLUE_DATABASE_PROPERTIES_FILENAME="portal.properties.genie.blue"
GENIE_GREEN_DATABASE_PROPERTIES_FILENAME="portal.properties.genie.green"
GENIE_BLUE_CLICKHOUSE_CLIENT_CONFIG_FILEPATH="/data/portal-cron/pipelines-credentials/clickhouse_client_genie_blue_config.yaml"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these lines as we're no longer using the clickhouse CLI

Copy link
Copy Markdown
Collaborator

@averyniceday averyniceday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me - only one change -- let's just replace the contents of the existing importUsers.py script instead of having a separate importUsersClickhouse.py script. That way we do cleanup as part of this rollout, thanks!

@sheridancbio
Copy link
Copy Markdown
Contributor

sheridancbio commented Apr 21, 2026

Overall, I'm good with merging this PR. But one thing I'm not sure I understand ... why did we switch from using the command line executable "clickhouse client" and switch to using the python third party library "clickhouse_connect"?

I'm not against this ... but it means that now we have some parts of our code using clickhouse client and some parts using "clickhouse_connect". Both libraries come from the same organization ... but having 2 dependencies is worse than having 1. Perhaps we should stop using clickhouse client and do all clickhouse interactions through clickhouse_connect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants