-
Notifications
You must be signed in to change notification settings - Fork 12
Code structure
The code has been developed in two distinct areas: monitoring and analysis. The data produced by the monitoring tools is consumed in an off-line mode by the analysis tools. Below we outline the structure and implementation of these tools.
The Ranger system has 3936 nodes where a job will not share a node at simultaneously. This allows for a set of monitor collectors to run on each host and give a consistent view of what a job is doing.
Data is collected on each host one of three times:
- job start,
- job end, or
- when a cron scripts runs (currenlty every 10 minutes)
This give us a 10 minute window for each data point and results at least 144 collections a day per host.
The collections write data onto a local ramdisk (/var/log/tacc_stats/{current,<time_since_epoch_at_start_of_day>}). Each collection writes a block of measurements as such
time jobid
montor_type device measurements
...
Each file has a set of schemas describing the measurements of a particular monitor_type at the beginning of each collection file.
At midnight these files are rotated. Between 2 and 4 am these files are archived to /scratch/projects/tacc_stats/.../host/<time_since_epoch>.tar.gz
While these facilitate simple collection, which we want to impact the system as little as possible, they require a bit of manipulation to get the data about specific jobs that is required for viewing. To capture all this information, a python script reads the monitor data and produces a nested set of dictionaries containing the data. Summary statistics are uploaded to the sql summary of the data and each job object is pickled and stored in the filesystem.
Currently there are a few routines for viewing data but nothing for more interesting analytics. As the data views are finished and we allow internal users to see the system, we will return to discuss more analytics in the system.
The current data views are all based on django system for providing views and models of the data. There are two types of pages in the current focus:
- Bulk data pages: These pages provide a quick summary of data statistics. For example a homepage for the HPC Analyst can show the number of jobs run, distribution of memory and so forth.
- Drill down data pages: These pages are finer metrics on the data that allows a user to see individual contributions. For example a list of jobs with each job giving some graphics on its particular run.
The bulk data pages will query the sql database for its views. The drill down pages may grab a collection of statistics from the sql database or the individual job files produced by the system monitors.