FaceForge Generator: Vision Transformer-based Face Manipulation
π¨ 252M Parameters | ViT-Based | Baseline Training Complete
β οΈ RESEARCH USE ONLY - This model is for academic research and developing detection systems.
Model Description
FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.
Key Features:
- ποΈ 252 million trainable parameters
- π Dual encoder architecture for source and target faces
- π― Cross-attention fusion mechanism
- πΌοΈ Generates 224Γ224 RGB face images
- β‘ ~300ms inference time per image
- π Achieved 0.204 validation loss after 3 epochs
Model Architecture
FaceForge Generator (252.5M parameters)
β
βββ ViT Encoders (172M params)
β βββ Source Encoder: ViT-B/16 (86M)
β β βββ 12 layers, 768-dim, 12 heads
β βββ Target Encoder: ViT-B/16 (86M)
β βββ 12 layers, 768-dim, 12 heads
β
βββ Cross-Attention Module (14M params)
β βββ 2 layers, 8 heads
β βββ FFN: 768 β 3072 β 768
β βββ Dropout: 0.1
β
βββ Transformer Decoder (58M params)
β βββ 256 learnable queries (16Γ16)
β βββ 6 decoder layers, 8 heads
β βββ 2D positional embeddings
β
βββ CNN Upsampler (9M params)
βββ TransposeConv: 768β512β256β128β64
βββ 4 upsampling stages (16Γ16 β 224Γ224)
βββ Conv: 64β32β3 + Tanh
Training Progress
Baseline Training (3 Epochs)
| Epoch | Train Loss | Val Loss | Time (min) |
|---|---|---|---|
| 1 | 0.2873 | 0.2804 | 227.5 |
| 2 | 0.2432 | 0.2304 | 231.2 |
| 3 | 0.2143 | 0.2043 | 228.8 |
Total Training Time: 11.5 hours (687.5 minutes)
Loss Reduction
- Training loss: 0.287 β 0.214 (25.3% reduction)
- Validation loss: 0.280 β 0.204 (27.1% reduction)
- Minimal overfitting (train-val gap: 0.010)
Usage
Installation
pip install torch torchvision timm pillow numpy
Loading the Model
import torch
import torch.nn as nn
import timm
from torchvision import transforms
class FaceForgeGenerator(nn.Module):
def __init__(self):
super().__init__()
# Source and Target ViT Encoders
self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
# Cross-attention (implement your architecture)
# Transformer decoder
# CNN upsampler
# ... (see full architecture in paper)
def forward(self, source_face, target_face):
# Encode both faces
source_features = self.source_encoder.forward_features(source_face)
target_features = self.target_encoder.forward_features(target_face)
# Cross-attention fusion
fused_features = self.cross_attention(source_features, target_features)
# Decode to spatial map
spatial_features = self.transformer_decoder(fused_features)
# Upsample to 224Γ224
generated_face = self.cnn_upsampler(spatial_features)
return generated_face
# Load checkpoint
model = FaceForgeGenerator()
checkpoint = torch.load('generator_best.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Preprocessing
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# Generate face swap
def generate_face_swap(source_path, target_path):
source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
with torch.no_grad():
generated = model(source, target)
# Denormalize and convert to PIL
generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
generated = transforms.ToPILImage()(generated)
return generated
# Example
result = generate_face_swap("source.jpg", "target.jpg")
result.save("generated.jpg")
Training Details
Dataset
- Source: FaceForensics++ (c40 compression)
- Training: 7,000 face images (triplets: source, target, ground truth)
- Validation: 1,500 face images
- Resolution: 224Γ224 RGB
Hyperparameters
optimizer: AdamW
learning_rate: 1e-4
betas: [0.9, 0.999]
weight_decay: 1e-4
batch_size: 16
epochs: 3 (baseline)
loss_function: L1 (Mean Absolute Error)
lr_schedule: Cosine Annealing (1e-4 β 1e-6)
Training Configuration
- Hardware: CPU
- Throughput: ~32 samples/minute
- Batch Processing: 219 train batches, 47 val batches per epoch
- Best Model: Saved at epoch 3
Current Status
β οΈ Baseline Training: This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.
Current Capabilities:
- β Learns pose transfer
- β Captures facial structures
- β Shows convergence trend
- β³ Some blur in generated images (expected at baseline)
- β³ Benefits from extended training
Use Cases
Research Applications
- Detector Training: Generate challenging samples for deepfake detection
- Adversarial Training: Min-max game with detector
- Understanding Manipulation: Study how synthetic faces are created
- Benchmark Creation: Generate test sets for evaluation
Educational Uses
- Demonstrate face generation techniques
- Teach computer vision concepts
- Illustrate transformer architectures
- Show attention mechanism visualization
Limitations
- Training Duration: Only 3 epochs completed; extended training needed for photo-realism
- Blur: Generated faces show some blur at baseline stage
- Dataset Scale: Trained on 10K images; larger datasets would improve quality
- Single Frame: Doesn't consider temporal consistency for video
- Compute: Large model (252M params) requires significant memory
Ethical Guidelines
β οΈ Responsible Use Required
This model is intended for: β Academic research β Deepfake detection development β Educational demonstrations β Ethical AI studies
Prohibited uses: β Creating misinformation β Identity theft or impersonation β Non-consensual face manipulation β Malicious content creation
Recommendations:
- Watermark generated content
- Maintain audit logs
- Require user consent
- Implement content filters
Future Improvements
Planned enhancements:
- Extended training (15-20 epochs)
- Perceptual loss functions (VGG, LPIPS)
- GAN-based adversarial training
- Multi-scale architecture
- Attention visualization
- Video temporal consistency
Citation
@techreport{nasir2026faceforge,
title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
author={Nasir, Huzaifa},
institution={National University of Computer and Emerging Sciences},
year={2026},
doi={10.5281/zenodo.18530439}
}
Links
- π Paper: https://doi.org/10.5281/zenodo.18530439
- π» Code: https://github.com/Huzaifanasir95/FaceForge
- π Detector Model: https://huggingface.co/Huzaifanasir95/faceforge-detector
- π Notebooks: See repository for training/inference notebooks
Architecture Details
Vision Transformer Encoder
- Patch Size: 16Γ16
- Patches: 196 + 1 CLS token
- Embedding Dim: 768
- Layers: 12
- Attention Heads: 12
- MLP Ratio: 4.0
Cross-Attention Mechanism
- Query: Source features
- Key/Value: Target features
- Attention: Multi-head (8 heads)
- FFN Expansion: 4Γ (768 β 3072 β 768)
CNN Upsampler
- Input: 768Γ16Γ16
- Output: 3Γ224Γ224
- Stages: 4 transpose convolutions
- Kernel: 4Γ4, Stride: 2, Padding: 1
- Activation: ReLU β Tanh (output)
License
This model is released under CC BY 4.0 license. Use responsibly and ethically.
Author
Huzaifa Nasir
National University of Computer and Emerging Sciences (NUCES)
Islamabad, Pakistan
π§ nasirhuzaifa95@gmail.com
Acknowledgments
- Vision Transformer (Dosovitskiy et al.)
- FaceForensics++ dataset
- PyTorch and timm libraries
- Open-source AI community