# Hardware Selection Guide Choosing the right hardware (flavor) is critical for cost-effective training. ## Available Hardware ### CPU - `cpu-basic` - Basic CPU, testing only - `cpu-upgrade` - Enhanced CPU **Use cases:** Dataset validation, preprocessing, testing scripts **Not recommended for training:** Too slow for any meaningful training ### GPU Options | Flavor | GPU | Memory | Use Case | Cost/hour | |--------|-----|--------|----------|-----------| | `t4-small` | NVIDIA T4 | 16GB | <1B models, demos | ~$0.50-1 | | `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 | | `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient training | ~$2-3 | | `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU training | ~$8-12 | | `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 | | `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 | | `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 | | `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 | | `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast training | ~$8-12 | ### TPU Options | Flavor | Type | Use Case | |--------|------|----------| | `v5e-1x1` | TPU v5e | Small TPU workloads | | `v5e-2x2` | 4x TPU v5e | Medium TPU workloads | | `v5e-2x4` | 8x TPU v5e | Large TPU workloads | **Note:** TPUs require TPU-optimized code. Most TRL training uses GPUs. ## Selection Guidelines ### By Model Size **Tiny Models (<1B parameters)** - **Recommended:** `t4-small` - **Example:** Qwen2.5-0.5B, TinyLlama - **Batch size:** 4-8 - **Training time:** 1-2 hours for 1K examples **Small Models (1-3B parameters)** - **Recommended:** `t4-medium` or `a10g-small` - **Example:** Qwen2.5-1.5B, Phi-2 - **Batch size:** 2-4 - **Training time:** 2-4 hours for 10K examples **Medium Models (3-7B parameters)** - **Recommended:** `a10g-small` or `a10g-large` - **Example:** Qwen2.5-7B, Mistral-7B - **Batch size:** 1-2 (or LoRA with 4-8) - **Training time:** 4-8 hours for 10K examples **Large Models (7-13B parameters)** - **Recommended:** `a10g-large` or `a100-large` - **Example:** Llama-3-8B, Mixtral-8x7B (with LoRA) - **Batch size:** 1 (full fine-tuning) or 2-4 (LoRA) - **Training time:** 6-12 hours for 10K examples - **Note:** Always use LoRA/PEFT **Very Large Models (13B+ parameters)** - **Recommended:** `a100-large` with LoRA - **Example:** Llama-3-13B, Llama-3-70B (LoRA only) - **Batch size:** 1-2 with LoRA - **Training time:** 8-24 hours for 10K examples - **Note:** Full fine-tuning not feasible, use LoRA/PEFT ### By Budget **Minimal Budget (<$5 total)** - Use `t4-small` - Train on subset of data (100-500 examples) - Limit to 1-2 epochs - Use small model (<1B) **Small Budget ($5-20)** - Use `t4-medium` or `a10g-small` - Train on 1K-5K examples - 2-3 epochs - Model up to 3B parameters **Medium Budget ($20-50)** - Use `a10g-small` or `a10g-large` - Train on 5K-20K examples - 3-5 epochs - Model up to 7B parameters **Large Budget ($50-200)** - Use `a10g-large` or `a100-large` - Full dataset training - Multiple epochs - Model up to 13B parameters with LoRA ### By Training Type **Quick Demo/Experiment** - `t4-small` - 50-100 examples - 5-10 steps - ~10-15 minutes **Development/Iteration** - `t4-medium` or `a10g-small` - 1K examples - 1 epoch - ~30-60 minutes **Production Training** - `a10g-large` or `a100-large` - Full dataset - 3-5 epochs - 4-12 hours **Research/Experimentation** - `a100-large` - Multiple runs - Various hyperparameters - Budget for 20-50 hours ## Memory Considerations ### Estimating Memory Requirements **Full fine-tuning:** ``` Memory (GB) ≈ (Model params in billions) × 20 ``` **LoRA fine-tuning:** ``` Memory (GB) ≈ (Model params in billions) × 4 ``` **Examples:** - Qwen2.5-0.5B full: ~10GB ✅ fits t4-small - Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs - Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small - Qwen2.5-7B full: ~140GB ❌ not feasible - Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large ### Memory Optimization If hitting memory limits: 1. **Use LoRA/PEFT** ```python peft_config=LoraConfig(r=16, lora_alpha=32) ``` 2. **Reduce batch size** ```python per_device_train_batch_size=1 ``` 3. **Increase gradient accumulation** ```python gradient_accumulation_steps=8 # Effective batch size = 1×8 ``` 4. **Enable gradient checkpointing** ```python gradient_checkpointing=True ``` 5. **Use mixed precision** ```python bf16=True # or fp16=True ``` 6. **Upgrade to larger GPU** - t4 → a10g → a100 ## Cost Estimation ### Formula ``` Total Cost = (Hours of training) × (Cost per hour) ``` ### Example Calculations **Quick demo:** - Hardware: t4-small ($0.75/hour) - Time: 15 minutes (0.25 hours) - Cost: $0.19 **Development training:** - Hardware: a10g-small ($3.50/hour) - Time: 2 hours - Cost: $7.00 **Production training:** - Hardware: a10g-large ($5/hour) - Time: 6 hours - Cost: $30.00 **Large model with LoRA:** - Hardware: a100-large ($10/hour) - Time: 8 hours - Cost: $80.00 ### Cost Optimization Tips 1. **Start small:** Test on t4-small with subset 2. **Use LoRA:** 4-5x cheaper than full fine-tuning 3. **Optimize hyperparameters:** Fewer epochs if possible 4. **Set appropriate timeout:** Don't waste compute on stalled jobs 5. **Use checkpointing:** Resume if job fails 6. **Monitor costs:** Check running jobs regularly ## Multi-GPU Training TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors. **Multi-GPU flavors:** - `l4x4` - 4x L4 GPUs - `a10g-largex2` - 2x A10G GPUs - `a10g-largex4` - 4x A10G GPUs **When to use:** - Models >13B parameters - Need faster training (linear speedup) - Large datasets (>50K examples) **Example:** ```python hf_jobs("uv", { "script": "train.py", "flavor": "a10g-largex2", # 2 GPUs "timeout": "4h", "secrets": {"HF_TOKEN": "$HF_TOKEN"} }) ``` No code changes needed—TRL/Accelerate handles distribution automatically. ## Choosing Between Options ### a10g vs a100 **Choose a10g when:** - Model <13B parameters - Budget conscious - Training time not critical **Choose a100 when:** - Model 13B+ parameters - Need fastest training - Memory requirements high - Budget allows ### Single vs Multi-GPU **Choose single GPU when:** - Model <7B parameters - Budget constrained - Simpler debugging **Choose multi-GPU when:** - Model >13B parameters - Need faster training - Large batch sizes required - Cost-effective for large jobs ## Quick Reference ```python # Model size → Hardware selection HARDWARE_MAP = { "<1B": "t4-small", "1-3B": "a10g-small", "3-7B": "a10g-large", "7-13B": "a10g-large (LoRA) or a100-large", ">13B": "a100-large (LoRA required)" } ```