Axion1-350K-A250K

DeepSeek-V3 architecture scaled to 344k total parameters (160k active/token) β€” runs entirely on CPU.

Built from scratch as a proof-of-concept that the real DeepSeek-V3 architectural innovations (MLA + DeepSeekMoE + auxiliary-loss-free load balancing) work correctly even at extreme miniaturization.


Architecture

This is not a distilled or quantized version of DeepSeek. Every component was implemented from scratch in pure PyTorch, faithfully following the DeepSeek-V3 technical report (arXiv:2412.19437).

Component DeepSeek-V3 Axion1
Attention MLA (Multi-head Latent Attention) βœ… Identical MLA
FFN DeepSeekMoE (256 routed experts) βœ… MoE (4 routed, top-2)
Load balancing Auxiliary-loss-free (dynamic bias) βœ… Section 2.3.2
Position RoPE βœ… RoPE
Normalization RMSNorm βœ… RMSNorm
Activation SwiGLU βœ… SwiGLU
Total params 671B 344k
Active params/token 37B ~160k

Model Details

d_model           : 64
n_layers          : 4
n_heads           : 4   (MLA)
d_head            : 16
kv_lora_rank      : 8   (MLA KV compression)
q_lora_rank       : 16  (MLA Q compression)
n_shared_experts  : 1
n_routed_experts  : 4   (top-2 activated)
d_ff              : 64  (per expert)
vocab_size        : 1024 (BPE, trained on GSM8K)
max_seq_len       : 512
total_params      : 343,616
active_params/tok : ~160,000

Training

  • Dataset: GSM8K β€” grade school math, converted to plain text with question / reasoning / answer format
  • Tokenizer: BPE trained from scratch, vocab size 1024
  • Hardware: AMD Ryzen 5 5600G β€” CPU only, 12 threads, 32 GB RAM
  • Speed: ~1,000–1,100 tokens/sec on CPU
  • Epochs: 20 | Final val loss: ~3.2 | Total time: ~115 minutes

Training Curve

Epoch Val Loss
1 5.49
2 4.59
3 4.30
5 3.88
7 3.66
9 3.54
20 ~3.2

Usage

from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList
from tokenizer import BPETokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "AxionLab-official/Axion1-350k-A250k",
    trust_remote_code=True
)
model.eval()

tok = BPETokenizer.load("model.vocab", "model.model")

# Bloqueia EOS e PAD nos primeiros min_tokens gerados
class MinNewTokens(LogitsProcessor):
    def __init__(self, min_tokens: int, eos_id: int, pad_id: int):
        self.min_tokens = min_tokens
        self.bad = [eos_id, pad_id]
        self.generated = 0

    def __call__(self, input_ids, scores):
        if self.generated < self.min_tokens:
            for bid in self.bad:
                scores[:, bid] = float("-inf")
        self.generated += 1
        return scores

eos_id = tok.token2id["<eos>"]
pad_id = tok.token2id["<pad>"]

prompt = "# Pergunta:\nQuanto Γ© 5 + 3?\n--\n# Resposta:\n"
ids = tok.encode(prompt, add_bos=True, add_eos=False)
input_ids = torch.tensor([ids])

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=80,
        temperature=0.9,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        eos_token_id=eos_id,
        pad_token_id=pad_id,
        use_cache=False,
        logits_processor=LogitsProcessorList([
            MinNewTokens(min_tokens=5, eos_id=eos_id, pad_id=pad_id)
        ]),
    )

new_tokens = output[0][len(ids):].tolist()
# Remove EOS do final se presente
if new_tokens and new_tokens[-1] == eos_id:
    new_tokens = new_tokens[:-1]

print("Resposta:", tok.decode(new_tokens))

Scaling Roadmap

Version Params Status
Axion1-v0.1 (this) 344k βœ… Released
Axion1-v0.2 ~1.5M πŸ”œ Next
Axion1-v0.3 ~6M πŸ“… Planned
Axion1--v0.4 ~24M πŸ“… Planned
Axion1--v0.5 ~100M πŸ“… Planned

Files

β”œβ”€β”€ model.py             # Full DeepSeek-V3 architecture (MLA + MoE)
β”œβ”€β”€ modeling_axion.py    # HuggingFace wrapper
β”œβ”€β”€ config.json          # Model configuration
β”œβ”€β”€ model.safetensors    # Trained weights
β”œβ”€β”€ model.vocab          # BPE vocabulary
└── model.model          # BPE merge rules

Limitations

With only 344k parameters, the model has learned mathematical vocabulary and co-occurrence patterns from GSM8K but cannot reliably solve problems or maintain syntactic coherence. This is expected β€” the purpose of this release is to demonstrate that the DeepSeek-V3 architectural components work correctly at any scale, and to serve as a foundation for the scaling roadmap above.


Citation

@article{deepseekv3,
  title  = {DeepSeek-V3 Technical Report},
  author = {DeepSeek-AI},
  year   = {2024},
  url    = {https://arxiv.org/abs/2412.19437}
}

License

MIT β€” free to use, modify, and build upon.


Made by AxionLab

Downloads last month
24
Safetensors
Model size
344k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train AxionLab-Co/AxionMoE-350k-A250k

Paper for AxionLab-Co/AxionMoE-350k-A250k