Skip to content

Optimize raw data job queries.#1780

Merged
aaronweeden merged 1 commit intoubccr:xdmod11.0from
aaronweeden:optimize-jobs-raw-data-query
Dec 14, 2023
Merged

Optimize raw data job queries.#1780
aaronweeden merged 1 commit intoubccr:xdmod11.0from
aaronweeden:optimize-jobs-raw-data-query

Conversation

@aaronweeden
Copy link
Contributor

@aaronweeden aaronweeden commented Sep 20, 2023

Description

This PR optimizes raw data queries in the Jobs realm by using WHERE conditions on the end_day_id field of modw.job_tasks instead of the end_time_ts field. This PR also removes the DISTINCT modifier for single-day queries.

There are also PRs for https://github.com/ubccr/xdmod-xsede/pull/433 and https://github.com/ubccr/xdmod-xsede/pull/448.

Motivation and Context

The new query is much faster (<10 seconds per single-day query compared to ~200 seconds), and in conjunction with #1792 and ubccr/xdmod-data#19, it makes the data analytics framework more responsive.

Tests performed

In addition to ensuring the regression tests from #1788 still pass:

  1. Run the script from https://github.com/ubccr/xdmod-xsede/pull/436 with the --time argument.
    1. Set 'new': 'https://xdmod-dev.ccr.xdmod.org' since https://github.com/ubccr/xdmod-xsede/pull/433 is already merged.
    2. Set 'old': 'https://xdmod-dev.ccr.xdmod.org:9001' with the version of classes/DataWarehouse/Query/Jobs/JobDataset.php from Use the portal's time zone for raw data query dates instead of UTC. #1779.
  2. Run the script again with the --time argument and the variable NUM_DAYS_PER_REQUEST set to 1:
    1. Confirm the new Jobs realm queries take <10 seconds compared to ~200 seconds for the old queries.

Checklist:

  • The pull request description is suitable for a Changelog entry
  • The milestone is set correctly on the pull request
  • The appropriate labels have been added to the pull request

@aaronweeden aaronweeden force-pushed the optimize-jobs-raw-data-query branch 2 times, most recently from 5c0e330 to a03a932 Compare September 22, 2023 17:00
@aaronweeden aaronweeden force-pushed the optimize-jobs-raw-data-query branch from a03a932 to fc9c13e Compare October 19, 2023 13:36
@aaronweeden aaronweeden force-pushed the optimize-jobs-raw-data-query branch 2 times, most recently from df4badf to 6e0faf3 Compare November 7, 2023 18:51
@aaronweeden aaronweeden force-pushed the optimize-jobs-raw-data-query branch from 6e0faf3 to f1d789b Compare November 21, 2023 01:53
@aaronweeden aaronweeden added this to the 11.0.0 milestone Nov 21, 2023
@aaronweeden aaronweeden added maintenance / code quality Improvements and code cleanup. Not a new feature or enhancement to existing functionality. Category:Data Warehouse Export Data Warehouse Export Category: Data Analytics Framework enhancement Enhancement of the functionality of an existing feature and removed maintenance / code quality Improvements and code cleanup. Not a new feature or enhancement to existing functionality. labels Nov 21, 2023
@aaronweeden aaronweeden marked this pull request as ready for review November 21, 2023 19:30
@aaronweeden aaronweeden force-pushed the optimize-jobs-raw-data-query branch 2 times, most recently from 46c1389 to 9948af7 Compare December 7, 2023 16:17
Comment on lines +124 to +118
} else {
$this->addPdoWhereCondition(
new WhereCondition(
new TableField($factTable, 'end_time_ts'),
'>=',
$startDateTs
)
);
$this->addPdoWhereCondition(
new WhereCondition(
new TableField($factTable, 'end_time_ts'),
'<=',
$endDateTs
)
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why either the day_id or end_day_id field wouldn't be better for multiple day queries as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have changed it to use end_day_id.

@aaronweeden aaronweeden force-pushed the optimize-jobs-raw-data-query branch 3 times, most recently from 9b0c253 to 82c5429 Compare December 13, 2023 22:08
@aaronweeden aaronweeden changed the title Optimize single-day raw data job queries. Optimize raw data job queries. Dec 13, 2023
Co-Authored-By: Joe White <jpwhite4@buffalo.edu>
@aaronweeden aaronweeden force-pushed the optimize-jobs-raw-data-query branch from 82c5429 to 5ad42fc Compare December 14, 2023 14:34
@aaronweeden aaronweeden merged commit d570408 into ubccr:xdmod11.0 Dec 14, 2023
@aaronweeden aaronweeden deleted the optimize-jobs-raw-data-query branch December 14, 2023 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Category: Data Analytics Framework Category:Data Warehouse Export Data Warehouse Export enhancement Enhancement of the functionality of an existing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants