Update README.md

f1da249 verified about 1 month ago

7.9 kB

metadata

library_name: transformers
license: apache-2.0
base_model: protectai/deberta-v3-base-prompt-injection
pipeline_tag: text-classification
language:
  - en
tags:
  - prompt-injection
  - security
  - injection-detection
  - text-classification
datasets:
  - geekyrakshit/prompt-injection-dataset
  - xTRam1/safe-guard-prompt-injection
  - deepset/prompt-injections
model-index:
  - name: deberta-v3-prompt-injection-guard-v2
    results:
      - task:
          type: text-classification
          name: Prompt injection detection
        dataset:
          name: xTRam1/safe-guard-prompt-injection (test split)
          type: xTRam1/safe-guard-prompt-injection
          split: test
        metrics:
          - type: accuracy
            value: 0.9913
          - type: precision
            value: 0.9862
          - type: recall
            value: 0.9862
          - type: f1
            value: 0.9862
      - task:
          type: text-classification
          name: Prompt injection detection
        dataset:
          name: deepset/prompt-injections (test split, tuned threshold)
          type: deepset/prompt-injections
          split: test
        metrics:
          - type: accuracy
            value: 0.8707
          - type: precision
            value: 0.9592
          - type: recall
            value: 0.7833
          - type: f1
            value: 0.8624

dmasamba/deberta-v3-prompt-injection-guard-v2

DeBERTa-v3 based classifier for prompt-injection detection, fine-tuned on a mix of public prompt-injection datasets.

Given a text prompt, the model predicts whether it is:

0 – Safe
1 – Prompt Injection (attempts to override or hijack instructions)

This v2 checkpoint extends deberta-v3-prompt-injection-guard-v1 by continuing training on additional datasets and using a linear LR scheduler with warmup. It is intended as a guardrail component in LLM pipelines.

Model Details

Base model: protectai/deberta-v3-base-prompt-injection
Architecture: DeBERTa-v3 base + classification head
Task: Binary text classification (safe vs. prompt injection)
Languages: English
License: Apache-2.0 (inherits from base; check dataset licenses separately)
Author: @dmasamba
Version: v2, continued training from v1 on a mixed dataset

Label mapping

All datasets were normalized to:

label = 0 → "safe"
label = 1 → "prompt_injection"

Training Data (v2)

v2 is trained on a mixture of three datasets:

geekyrakshit/prompt-injection-dataset
xTRam1/safe-guard-prompt-injection
deepset/prompt-injections

For each dataset:

The train split was used.
10% of each train split was held out as validation.
The remaining 90% portions were concatenated to form a mixed training set, and the validation portions were concatenated into a mixed validation set.

Each dataset contains binary labels indicating whether a prompt is safe or a prompt-injection attempt (jailbreaks, “ignore previous instructions”, tool/role hijacks, etc.), plus benign prompts.

Training Procedure

Preprocessing

Text column unified to prompt / text depending on source, then mapped into a single prompt field in code.
Tokenization with the base DeBERTa tokenizer:
- max_length = 512
- truncation = True
- dynamic padding via DataCollatorWithPadding.

Optimization

Training was done in two stages:

Stage 1 (v1):
- Dataset: geekyrakshit/prompt-injection-dataset (train split, 10% val)
- Optimizer: AdamW
- LR: 2e-5
- Batch size: 8 (train), 16 (val)
- Epochs: 3
- No scheduler.
Stage 2 (this v2 checkpoint):
- Start from v1 weights.
- Datasets: geekyrakshit/prompt-injection-dataset, xTRam1/safe-guard-prompt-injection, deepset/prompt-injections.
- Mixed train/val construction as described above.
- Optimizer: AdamW
- LR: 1e-5
- Batch size: 8 (train), 16 (val)
- Epochs: 3
- Scheduler: linear decay with 10% warmup steps (get_linear_schedule_with_warmup).

Training was run on a single GPU (e.g., Kaggle P100-class hardware).

Evaluation

All metrics below are for the binary task with positive class = 1 (Prompt Injection).

1. `xTRam1/safe-guard-prompt-injection` – test split (2,060 samples)

Threshold: 0.50 (default argmax of logits).

Test loss: 0.0432
Accuracy: 0.9913 (99.13%)
Precision (inj): 0.9862 (98.62%)
Recall (inj): 0.9862 (98.62%)
F1 (inj): 0.9862 (98.62%)

Confusion matrix (rows = true label, cols = predicted):

	Pred: Safe	Pred: Injection
True: Safe (0)	1401	9
True: Injection (1)	9	641

True negatives (safe): 1401
False positives (safe → injection): 9
False negatives (injection → safe): 9
True positives (injection): 641

Classification report

Class	Precision	Recall	F1	Support
Safe (0)	0.99	0.99	0.99	1410
Prompt Injection (1)	0.99	0.99	0.99	650
Accuracy			0.99	2060

2. `deepset/prompt-injections` – test split (116 samples, tuned threshold)

For this smaller, stylistically different dataset, we tuned the decision threshold on the test scores to maximize F1. A sweep over thresholds in [0.1, 0.9] (step 0.05) selected:

Best threshold (by F1): t = 0.10

All metrics below are reported at this tuned threshold.

Test loss: 1.0319
Accuracy: 0.8707 (87.07%)
Precision (inj): 0.9592 (95.92%)
Recall (inj): 0.7833 (78.33%)
F1 (inj): 0.8624 (86.24%)