FaceForge Generator: Vision Transformer-based Face Manipulation

Paper GitHub License

🎨 252M Parameters | ViT-Based | Baseline Training Complete

⚠️ RESEARCH USE ONLY - This model is for academic research and developing detection systems.

Model Description

FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.

Key Features:

  • πŸ—οΈ 252 million trainable parameters
  • πŸ”„ Dual encoder architecture for source and target faces
  • 🎯 Cross-attention fusion mechanism
  • πŸ–ΌοΈ Generates 224Γ—224 RGB face images
  • ⚑ ~300ms inference time per image
  • πŸ“‰ Achieved 0.204 validation loss after 3 epochs

Model Architecture

FaceForge Generator (252.5M parameters)
β”‚
β”œβ”€β”€ ViT Encoders (172M params)
β”‚   β”œβ”€β”€ Source Encoder: ViT-B/16 (86M)
β”‚   β”‚   └── 12 layers, 768-dim, 12 heads
β”‚   └── Target Encoder: ViT-B/16 (86M)
β”‚       └── 12 layers, 768-dim, 12 heads
β”‚
β”œβ”€β”€ Cross-Attention Module (14M params)
β”‚   β”œβ”€β”€ 2 layers, 8 heads
β”‚   β”œβ”€β”€ FFN: 768 β†’ 3072 β†’ 768
β”‚   └── Dropout: 0.1
β”‚
β”œβ”€β”€ Transformer Decoder (58M params)
β”‚   β”œβ”€β”€ 256 learnable queries (16Γ—16)
β”‚   β”œβ”€β”€ 6 decoder layers, 8 heads
β”‚   └── 2D positional embeddings
β”‚
└── CNN Upsampler (9M params)
    β”œβ”€β”€ TransposeConv: 768β†’512β†’256β†’128β†’64
    β”œβ”€β”€ 4 upsampling stages (16Γ—16 β†’ 224Γ—224)
    └── Conv: 64β†’32β†’3 + Tanh

Training Progress

Baseline Training (3 Epochs)

Epoch Train Loss Val Loss Time (min)
1 0.2873 0.2804 227.5
2 0.2432 0.2304 231.2
3 0.2143 0.2043 228.8

Total Training Time: 11.5 hours (687.5 minutes)

Loss Reduction

  • Training loss: 0.287 β†’ 0.214 (25.3% reduction)
  • Validation loss: 0.280 β†’ 0.204 (27.1% reduction)
  • Minimal overfitting (train-val gap: 0.010)

Usage

Installation

pip install torch torchvision timm pillow numpy

Loading the Model

import torch
import torch.nn as nn
import timm
from torchvision import transforms

class FaceForgeGenerator(nn.Module):
    def __init__(self):
        super().__init__()
        # Source and Target ViT Encoders
        self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
        self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
        
        # Cross-attention (implement your architecture)
        # Transformer decoder
        # CNN upsampler
        # ... (see full architecture in paper)
    
    def forward(self, source_face, target_face):
        # Encode both faces
        source_features = self.source_encoder.forward_features(source_face)
        target_features = self.target_encoder.forward_features(target_face)
        
        # Cross-attention fusion
        fused_features = self.cross_attention(source_features, target_features)
        
        # Decode to spatial map
        spatial_features = self.transformer_decoder(fused_features)
        
        # Upsample to 224Γ—224
        generated_face = self.cnn_upsampler(spatial_features)
        
        return generated_face

# Load checkpoint
model = FaceForgeGenerator()
checkpoint = torch.load('generator_best.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# Generate face swap
def generate_face_swap(source_path, target_path):
    source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
    target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
    
    with torch.no_grad():
        generated = model(source, target)
    
    # Denormalize and convert to PIL
    generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
    generated = transforms.ToPILImage()(generated)
    
    return generated

# Example
result = generate_face_swap("source.jpg", "target.jpg")
result.save("generated.jpg")

Training Details

Dataset

  • Source: FaceForensics++ (c40 compression)
  • Training: 7,000 face images (triplets: source, target, ground truth)
  • Validation: 1,500 face images
  • Resolution: 224Γ—224 RGB

Hyperparameters

optimizer: AdamW
learning_rate: 1e-4
betas: [0.9, 0.999]
weight_decay: 1e-4
batch_size: 16
epochs: 3 (baseline)
loss_function: L1 (Mean Absolute Error)
lr_schedule: Cosine Annealing (1e-4 β†’ 1e-6)

Training Configuration

  • Hardware: CPU
  • Throughput: ~32 samples/minute
  • Batch Processing: 219 train batches, 47 val batches per epoch
  • Best Model: Saved at epoch 3

Current Status

⚠️ Baseline Training: This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.

Current Capabilities:

  • βœ… Learns pose transfer
  • βœ… Captures facial structures
  • βœ… Shows convergence trend
  • ⏳ Some blur in generated images (expected at baseline)
  • ⏳ Benefits from extended training

Use Cases

Research Applications

  1. Detector Training: Generate challenging samples for deepfake detection
  2. Adversarial Training: Min-max game with detector
  3. Understanding Manipulation: Study how synthetic faces are created
  4. Benchmark Creation: Generate test sets for evaluation

Educational Uses

  • Demonstrate face generation techniques
  • Teach computer vision concepts
  • Illustrate transformer architectures
  • Show attention mechanism visualization

Limitations

  1. Training Duration: Only 3 epochs completed; extended training needed for photo-realism
  2. Blur: Generated faces show some blur at baseline stage
  3. Dataset Scale: Trained on 10K images; larger datasets would improve quality
  4. Single Frame: Doesn't consider temporal consistency for video
  5. Compute: Large model (252M params) requires significant memory

Ethical Guidelines

⚠️ Responsible Use Required

This model is intended for: βœ… Academic research βœ… Deepfake detection development βœ… Educational demonstrations βœ… Ethical AI studies

Prohibited uses: ❌ Creating misinformation ❌ Identity theft or impersonation ❌ Non-consensual face manipulation ❌ Malicious content creation

Recommendations:

  • Watermark generated content
  • Maintain audit logs
  • Require user consent
  • Implement content filters

Future Improvements

Planned enhancements:

  • Extended training (15-20 epochs)
  • Perceptual loss functions (VGG, LPIPS)
  • GAN-based adversarial training
  • Multi-scale architecture
  • Attention visualization
  • Video temporal consistency

Citation

@techreport{nasir2026faceforge,
  title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
  author={Nasir, Huzaifa},
  institution={National University of Computer and Emerging Sciences},
  year={2026},
  doi={10.5281/zenodo.18530439}
}

Links

Architecture Details

Vision Transformer Encoder

  • Patch Size: 16Γ—16
  • Patches: 196 + 1 CLS token
  • Embedding Dim: 768
  • Layers: 12
  • Attention Heads: 12
  • MLP Ratio: 4.0

Cross-Attention Mechanism

  • Query: Source features
  • Key/Value: Target features
  • Attention: Multi-head (8 heads)
  • FFN Expansion: 4Γ— (768 β†’ 3072 β†’ 768)

CNN Upsampler

  • Input: 768Γ—16Γ—16
  • Output: 3Γ—224Γ—224
  • Stages: 4 transpose convolutions
  • Kernel: 4Γ—4, Stride: 2, Padding: 1
  • Activation: ReLU β†’ Tanh (output)

License

This model is released under CC BY 4.0 license. Use responsibly and ethically.

Author

Huzaifa Nasir
National University of Computer and Emerging Sciences (NUCES)
Islamabad, Pakistan
πŸ“§ nasirhuzaifa95@gmail.com

Acknowledgments

  • Vision Transformer (Dosovitskiy et al.)
  • FaceForensics++ dataset
  • PyTorch and timm libraries
  • Open-source AI community
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support