Green Algorithms 4 HPC

In some situations, using an online calculator such as the Green Algorithms one isn’t very practical, e.g. when many different jobs are run. In an ideal world, there would be a tool that collects automatically all the details of the algorithms ran and estimate the corresponding energy usage and carbon footprint. GA4HPC is a first step in this direction.

High Performance Computing (HPC) clusters tend to log information on all jobs ran on them, for accounting purposes, and this information can be pulled.

Who is it for?

At this stage, the script works on any HPC server using SLURM as a workload manager. It can be adapted to other workload managers, see here on how to add one.

How to install it

It doesn’t require any particular permissions, you just need to copy the GitHub repository on your HPC drive, enter some information about your data centre, and you’re good to go! Tutorial here

How to use it

Anyone with access to the_shared_directory where the script is located can run the calculator, by running the same command, with various options available:

usage: myCarbonFootprint.sh      [-h] [-S STARTDAY] [-E ENDDAY] [--filterCWD]
                                 [--filterJobIDs FILTERJOBIDS]
                                 [--filterAccount FILTERACCOUNT] [--reportBug]
                                 [--reportBugHere]
                                 [--useCustomLogs USECUSTOMLOGS]

Calculate your carbon footprint on CSD3.

optional arguments:
  -h, --help            show this help message and exit
  -S STARTDAY, --startDay STARTDAY
                        The first day to take into account, as YYYY-MM-DD
                        (default: 2022-01-01)
  -E ENDDAY, --endDay ENDDAY
                        The last day to take into account, as YYYY-MM-DD
                        (default: today)
  --filterCWD           Only report on jobs launched from the current
                        location.
  --filterJobIDs FILTERJOBIDS
                        Comma seperated list of Job IDs you want to filter on.
  --filterAccount FILTERACCOUNT
                        Only consider jobs charged under this account
  --customSuccessStates CUSTOMSUCCESSSTATES
                        Comma-separated list of job states. By default, only
                        jobs that exit with status CD or COMPLETED are
                        considered succesful (PENDING, RUNNING and REQUEUD are
                        ignored). Jobs with states listed here will be
                        considered successful as well (best to list both
                        2-letter and full-length codes. Full list of job
                        states:
                        https://slurm.schedmd.com/squeue.html#SECTION_JOB-
                        STATE-CODES
  --reportBug           In case of a bug, this flag logs jobs informations so
                        that we can fix it. Note that this will write out some
                        basic information about your jobs, such as runtime,
                        number of cores and memory usage.
  --reportBugHere       Similar to --reportBug, but exports the output to your
                        home folder
  --useCustomLogs USECUSTOMLOGS
                        This bypasses the workload manager, and enables you to
                        input a custom log file of your jobs. This is mostly
                        meant for debugging, but can be useful in some
                        situations. An example of the expected file can be
                        found at `example_files/example_sacctOutput_raw.tsv`.

Limitations to keep in mind

The workload manager doesn’t alway log the exact CPU usage time, and when this information is missing, we assume that all cores are used at 100%.
For now, we assume that GPU jobs only use 1 GPU and the GPU is used at 100%, as the information needed for more accurate measurement is not always available.

(both of these may lead to slightly overestimated carbon footprints, although the order of magnitude is likely to be correct)

Conversely, the wasted energy due to memory overallocation may be largely underestimated, as the information needed is not always logged.

Report bugs

If you spot any bugs, or would like new features, just open a new issue on GitHub.

How to modify the script for my cluster?

See the “Edit code and contribute” page on how to modify the code and share your improvements with other users.

Licence

This work is licensed under a Creative Commons Attribution 4.0 International License.