v2.0 - Major performance & multi-GPU update (nightly branch)
Key new features:
- Added torch.compile support (reduce-overhead + dynamic=True) for ~30–100% faster inference after initial compilation warmup
- Explicit SDPA (Scaled Dot-Product Attention) backend for better speed and memory efficiency on Ampere/Ada GPUs
- Multi-GPU support via device_map="auto" – toggleable new input "use_multi_gpu" (default True). Turn off for single-GPU only (e.g.users with only cuda:0 visible)
- Modern dtype options (bf16 default, fp16, fp32, auto)
- Better logging, error handling, and unloading when keep_loaded=False
This version focuses on stability, speed, and compatibility with multi-GPU setups while keeping single-GPU rock-solid.
Repo: link
Tested on RTX 3090 (single GPU) with PyTorch 2.7–2.9.
Feedback/PRs welcome for multi-GPU testing!