A curated list of open technology projects to sustain a stable climate, energy supply, biodiversity and natural resources.

Recent Releases of Zeus

Zeus - Zeus Daemon v0.2.0

Change Highlights

CPU and DRAM energy measurements

Zeus daemon now also supports CPU and DRAM energy measurements with RAPL, which also requires root privileges just for measurement. Zeus daemon has also been integrated into the Zeus Python library, so as long as you have the daemon deployed and you set the ZEUSD_SOCK_PATH environment variable, you'll be all set!

What's Changed

Consumption - Computation and Communication - Python
Published by jaywonchung 3 months ago

Zeus - Zeus v0.11.0

Change Highlights

Renamed to zeus!

Until now we used zeus-ml because the name zeus was taken on PyPI, but now we're finally able to move to zeus:

pip install zeus

Prometheus Metrics

Zeus power and energy measurements can now be exported as Prometheus metrics! We currently support three metrics:

  • Energy consumption of a fixed code range (Histogram)
  • Power draw over time (Gauge)
  • Cumulative energy consumption over time (Counter)

We wrote up a detailed metric monitoring guide and integration examples.

AMD GPU enhancements

We created an official distribution of ROCm AMDSMI Python bindings (GitHub, PyPI) and integrated it with Zeus. Before this, users had to cd into their ROCm installation's AMDSMI distribution directory and run pip install, which isn't very convenient.

Carbon Emission Estimations

The new [zeus.monitor.carbon.CarbonEmissionMonitor]https://ml.energy/zeus/reference/monitor/carbon/#zeus.monitor.carbon.CarbonEmissionMonitor) takes in a carbon intensity provider (e.g., from ElectricityMaps) and provides an estimate for operational carbon emissions. The window-based API is essentially the same as ZeusMonitor.

Full Changelog

New Contributors

Consumption - Computation and Communication - Python
Published by jaywonchung 3 months ago

Zeus - Zeus v0.10.1

This is a maintenance release aimed at enhancing usability and fixing small bugs.

What's Changed

New Contributors 🎉

Full Changelog: https://github.com/ml-energy/zeus/compare/zeus-v0.10.0...zeus-v0.10.1

Consumption - Computation and Communication - Python
Published by jaywonchung 8 months ago

Zeus - Zeus v0.10.0: Broader support

What's New

CPU and DRAM energy measurement

We implemented support for Intel RAPL, which allows CPU and DRAM energy measurement on supported CPUs.
Generally speaking, most Intel CPUs support would support both and some AMD CPUs will support RAPL, albeit only CPU measurement.

JAX support

We added preliminary JAX support. Check out our full example here.

API usage is mostly identical:

monitor = ZeusMonitor(sync_execution_with="jax")  # JAX!

monitor.begin_window("computations")
# Run computation
measurement = monitor.end_window("computations")

Zeus Daemon

Our energy optimizers require changing setting on the GPU, including power limit and frequency. This requires admin privileges. More details in our docs.

Zeus Daemon lets you circumvent this by running as a standalone daemon process on the node that implements privileged operations on your behalf, so that you don't have to give the entire Zeus-integrated application admin privileges.

We wrote the Zeus Daemon in Rust: Check out the source code and crates.io for details.

Breaking Changes

ZeusMonitor.begin_window and ZeusMonitor.end_window's second parameter sync_cuda was renamed to sync_execution.
This is because JAX asynchronously runs CPU code as well, and we would like to synchronize both CUDA and CPU computations. This created the need to generalize sync_cuda to sync_execution.

Changelog

New Contributors 🎉

Full Changelog: https://github.com/ml-energy/zeus/compare/v0.9.1...zeus-v0.10.0

Consumption - Computation and Communication - Python
Published by jaywonchung 9 months ago

Zeus - v0.9.1

What's new

  • For GPU power draw, we use nvmlDeviceGetFieldValues, which gives us instant power draw (instead of average power draw) for any microarchitecture.

Consumption - Computation and Communication - Python
Published by jaywonchung 12 months ago

Zeus - v0.9.0: Batch size optimizer and big cleanups

What's new

  • The batch size optimizer is now a full-fledged server that can be deployed independently, with Docker Compose, or on Kubernetes + KubeFlow.
  • GPU abstraction: We created an abstraction layer over GPU vendors (NVIDIA and AMD). We're on our way to supporting AMD GPUs.
  • Completely revamped documentation under https://ml.energy/zeus.

Deprecated

  • See #20 (ZeusDataLoader, ZeusMaster, and the C++ Zeus monitor)

Consumption - Computation and Communication - Python
Published by jaywonchung 12 months ago

Zeus - v0.8.0: Energy-efficient large model training

This release features Perseus, an optimizer for energy-efficient large model training.

See the Perseus docs for details.

Consumption - Computation and Communication - Python
Published by jaywonchung over 1 year ago

Zeus - v0.7.1: Moved to under `ml-energy`!

We moved our repository to under ml-energy. No feature changes :)

Consumption - Computation and Communication - Python
Published by jaywonchung over 1 year ago

Zeus - v0.7.0: Python-based power monitor

What's New

  • We used to have a C++ power monitor under zeus_monitor, but we've deprecated that. There's no need for high speed polling because NVML power counters do not update that quick anyway.
    • In order to poll power consumption programmatically, use zeus.monitor.power.PowerMonitor.
  • CLI power & energy monitor:
    • python -m zeus.monitor power
    • python -m zeus.monitor energy
  • We switched from the old setup.py to the new package metadata standard pyproject.toml.
  • Docker image sizes are drastically smaller now! The compressed image used to be 8.48 GB, but now it's down to 2.71 GB.

Consumption - Computation and Communication - Python
Published by jaywonchung over 1 year ago

Zeus - v0.6.1: `approx_instant_energy`

What's New

approx_instant_energy in ZeusMonitor

  • Sometimes, the NVML energy counter update period is longer than the measurement window, in which case energy consumption may be return as 0.0. In this case, when approx_instant_energy=True, ZeusMonitor will approximate the energy consumption of the window as instant power consumption multiplied by the duration of the measurement window:
    \textrm{Energy} = \int_0^T \textrm{Power}(t) dt \approx \textrm{Power}(T) \cdot T
    

Consumption - Computation and Communication - Python
Published by jaywonchung over 1 year ago

Zeus - v0.6.0: `OptimumSelector`

What's New

OptimumSelector

  • Until know, the optimal power limit for GlobalPowerLimitOptimizer was the one that minimizes the Zeus time-energy cost. Not everyone would want that.
  • Now, OptimumSelector is an abstract base class with which you can implement your own optimal power limit selection policy.
  • Pre-implemented one are Time, Energy, ZeusCost, and MaxSlowdownConstraint. These are thoroughly tested.

wait_steps

  • Now, you can specify wait_steps in GlobalPowerLimitOptimizer, and it'll wait for the specified number of steps before profiling and optimizing.
  • wait_steps is set to 1 by default to because users may have torch.backends.cudnn.benchmark = True and DataLoader workers usually need time to warm up before ramping up to their normal fetch throughput.

Breaking Changes

  • GlobalPowerLimitOptimizer now takes an instance of OptimumSelector in its constructor, instead of eta_knob. If you want to recover the functionality of v0.5.0, modify your code like this:
    # Before
    plo = GlobalPowerLimitOptimizer(..., eta_knob=0.5, ...)
    
    # After
    from zeus.optimizer.power_limit import ZeusCost
    
    plo = GlobalPowerLimitOptimizer(..., optimum_selector=ZeusCost(eta_knob=0.5), ...)
    

Consumption - Computation and Communication - Python
Published by jaywonchung almost 2 years ago

Zeus - v0.5.0: Big refactor, `GlobalPowerLimitOptimizer`

What's New

Callback-based architecture

  • zeus.callback.Callback is the new backbone for Zeus components
  • GlobalPowerLimitOptimizer is the shiny new way to online-profile and optimize the power limit of DNN training.
  • EarlyStopController monitors and manages all sorts of conditions to determine whether training should stop.

Extensive testing

  • tests/ is richer than ever. With deep component tests with exhaustive parametrization, there are now around 1500 test cases.
  • Especially, zeus.util.testing.ReplayZeusMonitor exposes the same public API as ZeusMonitor but replays the measurement window logs produced by ZeusMonitor, instead of doing actual measurement. With this, Zeus can now be tested without any actual GPUs.

Consumption - Computation and Communication - Python
Published by jaywonchung almost 2 years ago

Zeus - v0.4.0: `ZeusMonitor`

What's New

Consumption - Computation and Communication - Python
Published by jaywonchung almost 2 years ago

Zeus - v0.3.0: `ZeusMonitorContext` for in-training-loop profiling

What's New

  • ZeusMonitorContext allows users to profile their per-iteration energy and time consumption.
    • It's aimed for those who would like to get a feel for the energy consumption of their DNN training job with a couple additional lines (as opposed to modified lines).
    • Documentation and integration example: here

Consumption - Computation and Communication - Python
Published by jaywonchung over 2 years ago

Zeus - v0.2.2

Bug Fix

  • Fixed a bug that made all Zeus monitors monitor the same GPU (index 0) in DP mode (#10)

Consumption - Computation and Communication - Python
Published by jaywonchung over 2 years ago

Zeus - v0.2.1

Bug fix

  • Fixed a bug where power limit profiling did not carry over to the next epoch when the dataset has less numbers of batches (#7).

Consumption - Computation and Communication - Python
Published by jaywonchung over 2 years ago

Zeus - v0.2.0: Single-Node Data Parallel Support

New Features

  • Single-node multi-GPU data parallel training support added (#2)
  • zeus_monitor is built at Docker image build time and baked into the image (#6)

Breaking Changes

  • ZeusDataLoader's profile window for each power limit is now based on the number of iterations, not time. (#2)
    • This was done to ease synchronization between GPUs while profiling power limits.
    • The ZEUS_PROFILE_PARAMS environment variable is now parsed as a comma separated string of the number of warmup and measure iterations.
    • ZeusMaster's constructor now takes arguments profile_warmup_iters and profile_measure_iters.

Consumption - Computation and Communication - Python
Published by jaywonchung over 2 years ago

Zeus - v0.1.0

First official release of Zeus!

  • Support for single-GPU training is stable.

Consumption - Computation and Communication - Python
Published by jaywonchung over 2 years ago