Delta-Vector commited on
Commit
d5318e0
·
verified ·
1 Parent(s): 0bbc517

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen3-4B-Instruct-2507
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:Qwen/Qwen3-4B-Instruct-2507
7
+ - lora
8
+ - transformers
9
+ ---
10
+
11
+ # Model Card for Model ID
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+
24
+
25
+ - **Developed by:** [More Information Needed]
26
+ - **Funded by [optional]:** [More Information Needed]
27
+ - **Shared by [optional]:** [More Information Needed]
28
+ - **Model type:** [More Information Needed]
29
+ - **Language(s) (NLP):** [More Information Needed]
30
+ - **License:** [More Information Needed]
31
+ - **Finetuned from model [optional]:** [More Information Needed]
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [More Information Needed]
38
+ - **Paper [optional]:** [More Information Needed]
39
+ - **Demo [optional]:** [More Information Needed]
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### Direct Use
46
+
47
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
+
49
+ [More Information Needed]
50
+
51
+ ### Downstream Use [optional]
52
+
53
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
+
55
+ [More Information Needed]
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
+
61
+ [More Information Needed]
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
+
67
+ [More Information Needed]
68
+
69
+ ### Recommendations
70
+
71
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
+
73
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ Use the code below to get started with the model.
78
+
79
+ [More Information Needed]
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+
85
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
+
87
+ [More Information Needed]
88
+
89
+ ### Training Procedure
90
+
91
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
+
93
+ #### Preprocessing [optional]
94
+
95
+ [More Information Needed]
96
+
97
+
98
+ #### Training Hyperparameters
99
+
100
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
+
102
+ #### Speeds, Sizes, Times [optional]
103
+
104
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
+
106
+ [More Information Needed]
107
+
108
+ ## Evaluation
109
+
110
+ <!-- This section describes the evaluation protocols and provides the results. -->
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ #### Testing Data
115
+
116
+ <!-- This should link to a Dataset Card if possible. -->
117
+
118
+ [More Information Needed]
119
+
120
+ #### Factors
121
+
122
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
+
124
+ [More Information Needed]
125
+
126
+ #### Metrics
127
+
128
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
+
130
+ [More Information Needed]
131
+
132
+ ### Results
133
+
134
+ [More Information Needed]
135
+
136
+ #### Summary
137
+
138
+
139
+
140
+ ## Model Examination [optional]
141
+
142
+ <!-- Relevant interpretability work for the model goes here -->
143
+
144
+ [More Information Needed]
145
+
146
+ ## Environmental Impact
147
+
148
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
+
150
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
+
152
+ - **Hardware Type:** [More Information Needed]
153
+ - **Hours used:** [More Information Needed]
154
+ - **Cloud Provider:** [More Information Needed]
155
+ - **Compute Region:** [More Information Needed]
156
+ - **Carbon Emitted:** [More Information Needed]
157
+
158
+ ## Technical Specifications [optional]
159
+
160
+ ### Model Architecture and Objective
161
+
162
+ [More Information Needed]
163
+
164
+ ### Compute Infrastructure
165
+
166
+ [More Information Needed]
167
+
168
+ #### Hardware
169
+
170
+ [More Information Needed]
171
+
172
+ #### Software
173
+
174
+ [More Information Needed]
175
+
176
+ ## Citation [optional]
177
+
178
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
+
180
+ **BibTeX:**
181
+
182
+ [More Information Needed]
183
+
184
+ **APA:**
185
+
186
+ [More Information Needed]
187
+
188
+ ## Glossary [optional]
189
+
190
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
+
192
+ [More Information Needed]
193
+
194
+ ## More Information [optional]
195
+
196
+ [More Information Needed]
197
+
198
+ ## Model Card Authors [optional]
199
+
200
+ [More Information Needed]
201
+
202
+ ## Model Card Contact
203
+
204
+ [More Information Needed]
205
+ ### Framework versions
206
+
207
+ - PEFT 0.17.1
adapter_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "Qwen/Qwen3-4B-Instruct-2507",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 64,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.0,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "qalora_group_size": 16,
24
+ "r": 1,
25
+ "rank_pattern": {},
26
+ "revision": null,
27
+ "target_modules": [
28
+ "q_proj",
29
+ "down_proj",
30
+ "v_proj",
31
+ "up_proj",
32
+ "k_proj",
33
+ "o_proj",
34
+ "gate_proj"
35
+ ],
36
+ "target_parameters": null,
37
+ "task_type": "CAUSAL_LM",
38
+ "trainable_token_indices": null,
39
+ "use_dora": false,
40
+ "use_qalora": false,
41
+ "use_rslora": false
42
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:febee3feb5f0c8e065e204cb15c426e336c7b73d04ff4408b69bd5c3fc5ea5ee
3
+ size 4194640
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- for message in messages %}
18
+ {%- if message.content is string %}
19
+ {%- set content = message.content %}
20
+ {%- else %}
21
+ {%- set content = '' %}
22
+ {%- endif %}
23
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
24
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
25
+ {%- elif message.role == "assistant" %}
26
+ {{- '<|im_start|>' + message.role + '\n' + content }}
27
+ {%- if message.tool_calls %}
28
+ {%- for tool_call in message.tool_calls %}
29
+ {%- if (loop.first and content) or (not loop.first) %}
30
+ {{- '\n' }}
31
+ {%- endif %}
32
+ {%- if tool_call.function %}
33
+ {%- set tool_call = tool_call.function %}
34
+ {%- endif %}
35
+ {{- '<tool_call>\n{"name": "' }}
36
+ {{- tool_call.name }}
37
+ {{- '", "arguments": ' }}
38
+ {%- if tool_call.arguments is string %}
39
+ {{- tool_call.arguments }}
40
+ {%- else %}
41
+ {{- tool_call.arguments | tojson }}
42
+ {%- endif %}
43
+ {{- '}\n</tool_call>' }}
44
+ {%- endfor %}
45
+ {%- endif %}
46
+ {{- '<|im_end|>\n' }}
47
+ {%- elif message.role == "tool" %}
48
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
49
+ {{- '<|im_start|>user' }}
50
+ {%- endif %}
51
+ {{- '\n<tool_response>\n' }}
52
+ {{- content }}
53
+ {{- '\n</tool_response>' }}
54
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
55
+ {{- '<|im_end|>\n' }}
56
+ {%- endif %}
57
+ {%- endif %}
58
+ {%- endfor %}
59
+ {%- if add_generation_prompt %}
60
+ {{- '<|im_start|>assistant\n' }}
61
+ {%- endif %}
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|im_end|>",
232
+ "errors": "replace",
233
+ "extra_special_tokens": {},
234
+ "model_max_length": 1010000,
235
+ "pad_token": "<|endoftext|>",
236
+ "split_special_tokens": false,
237
+ "tokenizer_class": "Qwen2Tokenizer",
238
+ "unk_token": null
239
+ }
trainer_state.json ADDED
@@ -0,0 +1,2034 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 0.2,
6
+ "eval_steps": 50,
7
+ "global_step": 100,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "advantage/absmean": 0.0,
14
+ "entropy": 0.49213120341300964,
15
+ "epoch": 0.002,
16
+ "grad_norm": 0.0,
17
+ "importance_ratio": 0.9995924234390259,
18
+ "learning_rate": 0.0,
19
+ "loss": 0.0,
20
+ "mismatch_kl": 0.0013128521386533976,
21
+ "reward": 0.009999999776482582,
22
+ "reward/refusal_reward_func": 0.009999999776482582,
23
+ "reward/std": 0.0,
24
+ "step": 1,
25
+ "timing/generation_ms": 3254.1263923048973,
26
+ "timing/scoring_ms": 25275.689974427223,
27
+ "timing/total_ms": 28529.81636673212,
28
+ "tokens/completion": 196.125,
29
+ "tokens/masked_fraction": 0.0,
30
+ "wall_clock/generate_s": 37.95693778991699
31
+ },
32
+ {
33
+ "advantage/absmean": 0.17499999701976776,
34
+ "entropy": 0.6319499611854553,
35
+ "epoch": 0.004,
36
+ "grad_norm": 0.21545597111492612,
37
+ "importance_ratio": 0.9992015957832336,
38
+ "learning_rate": 1e-05,
39
+ "loss": -0.0015,
40
+ "mismatch_kl": 0.0010142761748284101,
41
+ "reward": 0.7100000381469727,
42
+ "reward/refusal_reward_func": 0.7100000381469727,
43
+ "reward/std": 0.26457512378692627,
44
+ "step": 2,
45
+ "timing/generation_ms": 5644.344195723534,
46
+ "timing/scoring_ms": 31337.847255170345,
47
+ "timing/total_ms": 36982.19145089388,
48
+ "tokens/completion": 642.9375,
49
+ "tokens/masked_fraction": 0.0,
50
+ "wall_clock/generate_s": 92.69584393501282
51
+ },
52
+ {
53
+ "advantage/absmean": 0.08203125,
54
+ "entropy": 0.4765586853027344,
55
+ "epoch": 0.006,
56
+ "grad_norm": 0.2892202194064109,
57
+ "importance_ratio": 1.000748872756958,
58
+ "learning_rate": 2e-05,
59
+ "loss": 0.0023,
60
+ "mismatch_kl": 0.001345986733213067,
61
+ "reward": 0.06593750417232513,
62
+ "reward/refusal_reward_func": 0.06593750417232513,
63
+ "reward/std": 0.15905697643756866,
64
+ "step": 3,
65
+ "timing/generation_ms": 2269.3659961223602,
66
+ "timing/scoring_ms": 26262.165516614914,
67
+ "timing/total_ms": 28531.531512737274,
68
+ "tokens/completion": 240.15625,
69
+ "tokens/masked_fraction": 0.0,
70
+ "wall_clock/generate_s": 143.94128561019897
71
+ },
72
+ {
73
+ "advantage/absmean": 0.0237890612334013,
74
+ "entropy": 0.5391862988471985,
75
+ "epoch": 0.008,
76
+ "grad_norm": 0.03657768868171977,
77
+ "importance_ratio": 1.0003156661987305,
78
+ "learning_rate": 3e-05,
79
+ "loss": 0.0003,
80
+ "mismatch_kl": 0.0014281703624874353,
81
+ "reward": 0.02812499925494194,
82
+ "reward/refusal_reward_func": 0.02812499925494194,
83
+ "reward/std": 0.028986798599362373,
84
+ "step": 4,
85
+ "timing/generation_ms": 3297.7539896965027,
86
+ "timing/scoring_ms": 18404.439702630043,
87
+ "timing/total_ms": 21702.193692326546,
88
+ "tokens/completion": 328.96875,
89
+ "tokens/masked_fraction": 0.0,
90
+ "wall_clock/generate_s": 30.997375011444092
91
+ },
92
+ {
93
+ "advantage/absmean": 0.37681642174720764,
94
+ "entropy": 0.6411616206169128,
95
+ "epoch": 0.01,
96
+ "grad_norm": 0.3743397229309058,
97
+ "importance_ratio": 1.0011004209518433,
98
+ "learning_rate": 4e-05,
99
+ "loss": -0.0173,
100
+ "mismatch_kl": 0.00182775326538831,
101
+ "reward": 0.37406250834465027,
102
+ "reward/refusal_reward_func": 0.37406250834465027,
103
+ "reward/std": 0.3808777630329132,
104
+ "step": 5,
105
+ "timing/generation_ms": 4316.511310636997,
106
+ "timing/scoring_ms": 31669.483192265034,
107
+ "timing/total_ms": 35985.99450290203,
108
+ "tokens/completion": 485.3125,
109
+ "tokens/masked_fraction": 0.0,
110
+ "wall_clock/generate_s": 72.83462309837341
111
+ },
112
+ {
113
+ "advantage/absmean": 0.007929688319563866,
114
+ "entropy": 0.5038114786148071,
115
+ "epoch": 0.012,
116
+ "grad_norm": 0.02765669861934211,
117
+ "importance_ratio": 0.999994158744812,
118
+ "learning_rate": 5e-05,
119
+ "loss": -0.0013,
120
+ "mismatch_kl": 0.0035878715571016073,
121
+ "reward": 0.014374999329447746,
122
+ "reward/refusal_reward_func": 0.014374999329447746,
123
+ "reward/std": 0.015398356132209301,
124
+ "step": 6,
125
+ "timing/generation_ms": 3028.248645365238,
126
+ "timing/scoring_ms": 25548.00620675087,
127
+ "timing/total_ms": 28576.254852116108,
128
+ "tokens/completion": 334.125,
129
+ "tokens/masked_fraction": 0.0,
130
+ "wall_clock/generate_s": 36.018136739730835
131
+ },
132
+ {
133
+ "advantage/absmean": 0.0,
134
+ "entropy": 0.29107579588890076,
135
+ "epoch": 0.014,
136
+ "grad_norm": 0.0,
137
+ "importance_ratio": 1.0008047819137573,
138
+ "learning_rate": 6e-05,
139
+ "loss": 0.0,
140
+ "mismatch_kl": 0.003742748638615012,
141
+ "reward": 0.009999999776482582,
142
+ "reward/refusal_reward_func": 0.009999999776482582,
143
+ "reward/std": 0.0,
144
+ "step": 7,
145
+ "timing/generation_ms": 1139.9633809924126,
146
+ "timing/scoring_ms": 19651.204399764538,
147
+ "timing/total_ms": 20791.16778075695,
148
+ "tokens/completion": 100.375,
149
+ "tokens/masked_fraction": 0.0,
150
+ "wall_clock/generate_s": 23.77466917037964
151
+ },
152
+ {
153
+ "advantage/absmean": 0.006562499795109034,
154
+ "entropy": 0.3522435426712036,
155
+ "epoch": 0.016,
156
+ "grad_norm": 0.013866793274081895,
157
+ "importance_ratio": 1.0039043426513672,
158
+ "learning_rate": 7e-05,
159
+ "loss": 0.0029,
160
+ "mismatch_kl": 0.022596202790737152,
161
+ "reward": 0.013749999925494194,
162
+ "reward/refusal_reward_func": 0.013749999925494194,
163
+ "reward/std": 0.009921567514538765,
164
+ "step": 8,
165
+ "timing/generation_ms": 784.0555533766747,
166
+ "timing/scoring_ms": 19302.494660019875,
167
+ "timing/total_ms": 20086.55021339655,
168
+ "tokens/completion": 56.875,
169
+ "tokens/masked_fraction": 0.0,
170
+ "wall_clock/generate_s": 24.927189350128174
171
+ },
172
+ {
173
+ "advantage/absmean": 0.17390625178813934,
174
+ "entropy": 0.8274978995323181,
175
+ "epoch": 0.018,
176
+ "grad_norm": 0.11705006346082461,
177
+ "importance_ratio": 1.0001081228256226,
178
+ "learning_rate": 8e-05,
179
+ "loss": 0.0021,
180
+ "mismatch_kl": 0.0024926774203777313,
181
+ "reward": 0.6775000095367432,
182
+ "reward/refusal_reward_func": 0.6775000095367432,
183
+ "reward/std": 0.24893523752689362,
184
+ "step": 9,
185
+ "timing/generation_ms": 9839.449286460876,
186
+ "timing/scoring_ms": 40201.816976070404,
187
+ "timing/total_ms": 50041.26626253128,
188
+ "tokens/completion": 1103.59375,
189
+ "tokens/masked_fraction": 0.0,
190
+ "wall_clock/generate_s": 94.76600408554077
191
+ },
192
+ {
193
+ "advantage/absmean": 0.0035156249068677425,
194
+ "entropy": 0.24158422648906708,
195
+ "epoch": 0.02,
196
+ "grad_norm": 0.005773329550661251,
197
+ "importance_ratio": 0.9983059763908386,
198
+ "learning_rate": 9e-05,
199
+ "loss": 0.0009,
200
+ "mismatch_kl": 0.05829961970448494,
201
+ "reward": 0.011874999850988388,
202
+ "reward/refusal_reward_func": 0.011874999850988388,
203
+ "reward/std": 0.007261843420565128,
204
+ "step": 10,
205
+ "timing/generation_ms": 553.7310987710953,
206
+ "timing/scoring_ms": 18607.969902455807,
207
+ "timing/total_ms": 19161.701001226902,
208
+ "tokens/completion": 26.25,
209
+ "tokens/masked_fraction": 0.0,
210
+ "wall_clock/generate_s": 23.25968861579895
211
+ },
212
+ {
213
+ "advantage/absmean": 0.027421876788139343,
214
+ "entropy": 0.5188111066818237,
215
+ "epoch": 0.022,
216
+ "grad_norm": 0.030197590581975912,
217
+ "importance_ratio": 0.9986535310745239,
218
+ "learning_rate": 0.0001,
219
+ "loss": -0.0035,
220
+ "mismatch_kl": 0.005903073586523533,
221
+ "reward": 0.026874996721744537,
222
+ "reward/refusal_reward_func": 0.026874996721744537,
223
+ "reward/std": 0.046599194407463074,
224
+ "step": 11,
225
+ "timing/generation_ms": 2835.02546697855,
226
+ "timing/scoring_ms": 24443.98508220911,
227
+ "timing/total_ms": 27279.01054918766,
228
+ "tokens/completion": 308.53125,
229
+ "tokens/masked_fraction": 0.0,
230
+ "wall_clock/generate_s": 145.36076593399048
231
+ },
232
+ {
233
+ "advantage/absmean": 0.12355469167232513,
234
+ "entropy": 0.44281068444252014,
235
+ "epoch": 0.024,
236
+ "grad_norm": 0.42446378552193,
237
+ "importance_ratio": 0.9987350106239319,
238
+ "learning_rate": 0.0001,
239
+ "loss": -0.0423,
240
+ "mismatch_kl": 0.006952627561986446,
241
+ "reward": 0.09437499195337296,
242
+ "reward/refusal_reward_func": 0.09437499195337296,
243
+ "reward/std": 0.21499907970428467,
244
+ "step": 12,
245
+ "timing/generation_ms": 2469.3235754966736,
246
+ "timing/scoring_ms": 35024.46338534355,
247
+ "timing/total_ms": 37493.786960840225,
248
+ "tokens/completion": 273.28125,
249
+ "tokens/masked_fraction": 0.0,
250
+ "wall_clock/generate_s": 154.79029417037964
251
+ },
252
+ {
253
+ "advantage/absmean": 0.09375,
254
+ "entropy": 0.5496101379394531,
255
+ "epoch": 0.026,
256
+ "grad_norm": 0.1257011193040082,
257
+ "importance_ratio": 0.9992591738700867,
258
+ "learning_rate": 0.0001,
259
+ "loss": -0.0035,
260
+ "mismatch_kl": 0.0034747051540762186,
261
+ "reward": 0.7599999904632568,
262
+ "reward/refusal_reward_func": 0.7599999904632568,
263
+ "reward/std": 0.19364915788173676,
264
+ "step": 13,
265
+ "timing/generation_ms": 8158.617563545704,
266
+ "timing/scoring_ms": 36215.20960330963,
267
+ "timing/total_ms": 44373.827166855335,
268
+ "tokens/completion": 930.6875,
269
+ "tokens/masked_fraction": 0.0,
270
+ "wall_clock/generate_s": 122.23734498023987
271
+ },
272
+ {
273
+ "advantage/absmean": 0.1844140589237213,
274
+ "entropy": 0.6212608814239502,
275
+ "epoch": 0.028,
276
+ "grad_norm": 0.26809699141136345,
277
+ "importance_ratio": 1.0004856586456299,
278
+ "learning_rate": 0.0001,
279
+ "loss": -0.0125,
280
+ "mismatch_kl": 0.0054090130142867565,
281
+ "reward": 0.1978124976158142,
282
+ "reward/refusal_reward_func": 0.1978124976158142,
283
+ "reward/std": 0.21153803169727325,
284
+ "step": 14,
285
+ "timing/generation_ms": 3771.018899977207,
286
+ "timing/scoring_ms": 32557.9876229167,
287
+ "timing/total_ms": 36329.006522893906,
288
+ "tokens/completion": 423.1875,
289
+ "tokens/masked_fraction": 0.0,
290
+ "wall_clock/generate_s": 76.63339400291443
291
+ },
292
+ {
293
+ "advantage/absmean": 0.09375,
294
+ "entropy": 0.5119910836219788,
295
+ "epoch": 0.03,
296
+ "grad_norm": 0.2293179923709544,
297
+ "importance_ratio": 1.0004829168319702,
298
+ "learning_rate": 0.0001,
299
+ "loss": 0.0032,
300
+ "mismatch_kl": 0.0062421830371022224,
301
+ "reward": 0.05999999865889549,
302
+ "reward/refusal_reward_func": 0.05999999865889549,
303
+ "reward/std": 0.19364915788173676,
304
+ "step": 15,
305
+ "timing/generation_ms": 3333.340108394623,
306
+ "timing/scoring_ms": 27093.5076251626,
307
+ "timing/total_ms": 30426.847733557224,
308
+ "tokens/completion": 374.25,
309
+ "tokens/masked_fraction": 0.0,
310
+ "wall_clock/generate_s": 143.25073266029358
311
+ },
312
+ {
313
+ "advantage/absmean": 0.023906249552965164,
314
+ "entropy": 0.7846541404724121,
315
+ "epoch": 0.032,
316
+ "grad_norm": 0.03020597167190001,
317
+ "importance_ratio": 0.9995452165603638,
318
+ "learning_rate": 0.0001,
319
+ "loss": 0.0001,
320
+ "mismatch_kl": 0.006196495145559311,
321
+ "reward": 0.04218750074505806,
322
+ "reward/refusal_reward_func": 0.04218750074505806,
323
+ "reward/std": 0.030489176511764526,
324
+ "step": 16,
325
+ "timing/generation_ms": 6338.90475332737,
326
+ "timing/scoring_ms": 35081.99892938137,
327
+ "timing/total_ms": 41420.90368270874,
328
+ "tokens/completion": 725.5625,
329
+ "tokens/masked_fraction": 0.0,
330
+ "wall_clock/generate_s": 89.63312363624573
331
+ },
332
+ {
333
+ "advantage/absmean": 0.2109375,
334
+ "entropy": 0.4431617856025696,
335
+ "epoch": 0.034,
336
+ "grad_norm": 0.11997986504420295,
337
+ "importance_ratio": 1.0006847381591797,
338
+ "learning_rate": 0.0001,
339
+ "loss": -0.0079,
340
+ "mismatch_kl": 0.0046984292566776276,
341
+ "reward": 0.6850000023841858,
342
+ "reward/refusal_reward_func": 0.6850000023841858,
343
+ "reward/std": 0.2904737591743469,
344
+ "step": 17,
345
+ "timing/generation_ms": 4871.280819177628,
346
+ "timing/scoring_ms": 27726.198948919773,
347
+ "timing/total_ms": 32597.4797680974,
348
+ "tokens/completion": 541.3125,
349
+ "tokens/masked_fraction": 0.0,
350
+ "wall_clock/generate_s": 72.54996109008789
351
+ },
352
+ {
353
+ "advantage/absmean": 0.020624998956918716,
354
+ "entropy": 0.005482906475663185,
355
+ "epoch": 0.036,
356
+ "grad_norm": 2.4822412849126587e-05,
357
+ "importance_ratio": 0.9996475577354431,
358
+ "learning_rate": 0.0001,
359
+ "loss": 0.0,
360
+ "mismatch_kl": 3.828452292964357e-07,
361
+ "reward": 0.023749999701976776,
362
+ "reward/refusal_reward_func": 0.023749999701976776,
363
+ "reward/std": 0.026896795257925987,
364
+ "step": 18,
365
+ "timing/generation_ms": 466.4832055568695,
366
+ "timing/scoring_ms": 18893.90940964222,
367
+ "timing/total_ms": 19360.39261519909,
368
+ "tokens/completion": 13.0,
369
+ "tokens/masked_fraction": 0.0,
370
+ "wall_clock/generate_s": 24.94274139404297
371
+ },
372
+ {
373
+ "advantage/absmean": 0.06175781413912773,
374
+ "entropy": 0.5270810723304749,
375
+ "epoch": 0.038,
376
+ "grad_norm": 0.19852274538781398,
377
+ "importance_ratio": 1.0000227689743042,
378
+ "learning_rate": 0.0001,
379
+ "loss": 0.0024,
380
+ "mismatch_kl": 0.005923233926296234,
381
+ "reward": 0.06562499701976776,
382
+ "reward/refusal_reward_func": 0.06562499701976776,
383
+ "reward/std": 0.13811086118221283,
384
+ "step": 19,
385
+ "timing/generation_ms": 4183.5604682564735,
386
+ "timing/scoring_ms": 30921.414978802204,
387
+ "timing/total_ms": 35104.97544705868,
388
+ "tokens/completion": 476.46875,
389
+ "tokens/masked_fraction": 0.0,
390
+ "wall_clock/generate_s": 69.38085293769836
391
+ },
392
+ {
393
+ "advantage/absmean": 0.0018164062639698386,
394
+ "entropy": 0.004832141101360321,
395
+ "epoch": 0.04,
396
+ "grad_norm": 0.00028270733132418016,
397
+ "importance_ratio": 0.9995356202125549,
398
+ "learning_rate": 0.0001,
399
+ "loss": -0.0,
400
+ "mismatch_kl": 1.2251906582605443e-06,
401
+ "reward": 0.010937499813735485,
402
+ "reward/refusal_reward_func": 0.010937499813735485,
403
+ "reward/std": 0.005219778511673212,
404
+ "step": 20,
405
+ "timing/generation_ms": 469.4804549217224,
406
+ "timing/scoring_ms": 18075.80190896988,
407
+ "timing/total_ms": 18545.2823638916,
408
+ "tokens/completion": 13.0,
409
+ "tokens/masked_fraction": 0.0,
410
+ "wall_clock/generate_s": 22.75140905380249
411
+ },
412
+ {
413
+ "advantage/absmean": 0.13593749701976776,
414
+ "entropy": 0.6113055944442749,
415
+ "epoch": 0.042,
416
+ "grad_norm": 0.14856988549338257,
417
+ "importance_ratio": 0.999966561794281,
418
+ "learning_rate": 0.0001,
419
+ "loss": -0.0022,
420
+ "mismatch_kl": 0.005093984771519899,
421
+ "reward": 0.7350000143051147,
422
+ "reward/refusal_reward_func": 0.7350000143051147,
423
+ "reward/std": 0.23318447172641754,
424
+ "step": 21,
425
+ "timing/generation_ms": 6330.163478851318,
426
+ "timing/scoring_ms": 28551.29039287567,
427
+ "timing/total_ms": 34881.45387172699,
428
+ "tokens/completion": 715.9375,
429
+ "tokens/masked_fraction": 0.0,
430
+ "wall_clock/generate_s": 84.07915687561035
431
+ },
432
+ {
433
+ "advantage/absmean": 0.01318359375,
434
+ "entropy": 0.5851391553878784,
435
+ "epoch": 0.044,
436
+ "grad_norm": 0.040096615934913975,
437
+ "importance_ratio": 1.0004969835281372,
438
+ "learning_rate": 0.0001,
439
+ "loss": -0.0018,
440
+ "mismatch_kl": 0.009555388242006302,
441
+ "reward": 0.017812499776482582,
442
+ "reward/refusal_reward_func": 0.017812499776482582,
443
+ "reward/std": 0.020575225353240967,
444
+ "step": 22,
445
+ "timing/generation_ms": 2347.38065302372,
446
+ "timing/scoring_ms": 26964.800156652927,
447
+ "timing/total_ms": 29312.180809676647,
448
+ "tokens/completion": 255.15625,
449
+ "tokens/masked_fraction": 0.0,
450
+ "wall_clock/generate_s": 150.09117031097412
451
+ },
452
+ {
453
+ "advantage/absmean": 0.01718750037252903,
454
+ "entropy": 0.5837696194648743,
455
+ "epoch": 0.046,
456
+ "grad_norm": 0.027639162796687714,
457
+ "importance_ratio": 1.0036453008651733,
458
+ "learning_rate": 0.0001,
459
+ "loss": -0.0037,
460
+ "mismatch_kl": 0.017472539097070694,
461
+ "reward": 0.022499999031424522,
462
+ "reward/refusal_reward_func": 0.022499999031424522,
463
+ "reward/std": 0.021650634706020355,
464
+ "step": 23,
465
+ "timing/generation_ms": 2000.5059093236923,
466
+ "timing/scoring_ms": 24497.070513665676,
467
+ "timing/total_ms": 26497.57642298937,
468
+ "tokens/completion": 222.25,
469
+ "tokens/masked_fraction": 0.0,
470
+ "wall_clock/generate_s": 39.751105070114136
471
+ },
472
+ {
473
+ "advantage/absmean": 0.09375,
474
+ "entropy": 0.6268512010574341,
475
+ "epoch": 0.048,
476
+ "grad_norm": 0.18640619697968583,
477
+ "importance_ratio": 0.9990081787109375,
478
+ "learning_rate": 0.0001,
479
+ "loss": -0.0019,
480
+ "mismatch_kl": 0.005103914998471737,
481
+ "reward": 0.7599999904632568,
482
+ "reward/refusal_reward_func": 0.7599999904632568,
483
+ "reward/std": 0.19364915788173676,
484
+ "step": 24,
485
+ "timing/generation_ms": 5919.098302721977,
486
+ "timing/scoring_ms": 32965.33615142107,
487
+ "timing/total_ms": 38884.43445414305,
488
+ "tokens/completion": 674.9375,
489
+ "tokens/masked_fraction": 0.0,
490
+ "wall_clock/generate_s": 105.22681593894958
491
+ },
492
+ {
493
+ "advantage/absmean": 0.007929687388241291,
494
+ "entropy": 0.657754123210907,
495
+ "epoch": 0.05,
496
+ "grad_norm": 0.025336375406596286,
497
+ "importance_ratio": 1.0006119012832642,
498
+ "learning_rate": 0.0001,
499
+ "loss": -0.0006,
500
+ "mismatch_kl": 0.006333181634545326,
501
+ "reward": 0.014374999329447746,
502
+ "reward/refusal_reward_func": 0.014374999329447746,
503
+ "reward/std": 0.015398357063531876,
504
+ "step": 25,
505
+ "timing/generation_ms": 4377.142012119293,
506
+ "timing/scoring_ms": 25255.830891430378,
507
+ "timing/total_ms": 29632.97290354967,
508
+ "tokens/completion": 501.125,
509
+ "tokens/masked_fraction": 0.0,
510
+ "wall_clock/generate_s": 47.89186096191406
511
+ },
512
+ {
513
+ "advantage/absmean": 0.0,
514
+ "entropy": 0.17220205068588257,
515
+ "epoch": 0.052,
516
+ "grad_norm": 0.0,
517
+ "importance_ratio": 1.0002461671829224,
518
+ "learning_rate": 0.0001,
519
+ "loss": 0.0,
520
+ "mismatch_kl": 0.0026462471578270197,
521
+ "reward": 0.009999999776482582,
522
+ "reward/refusal_reward_func": 0.009999999776482582,
523
+ "reward/std": 0.0,
524
+ "step": 26,
525
+ "timing/generation_ms": 497.48579412698746,
526
+ "timing/scoring_ms": 18527.822844684124,
527
+ "timing/total_ms": 19025.30863881111,
528
+ "tokens/completion": 19.40625,
529
+ "tokens/masked_fraction": 0.0,
530
+ "wall_clock/generate_s": 22.794724941253662
531
+ },
532
+ {
533
+ "advantage/absmean": 0.17499999701976776,
534
+ "entropy": 0.546366810798645,
535
+ "epoch": 0.054,
536
+ "grad_norm": 0.22199732245019957,
537
+ "importance_ratio": 0.999916672706604,
538
+ "learning_rate": 0.0001,
539
+ "loss": -0.0027,
540
+ "mismatch_kl": 0.005502623040229082,
541
+ "reward": 0.7100000381469727,
542
+ "reward/refusal_reward_func": 0.7100000381469727,
543
+ "reward/std": 0.26457512378692627,
544
+ "step": 27,
545
+ "timing/generation_ms": 7701.031573116779,
546
+ "timing/scoring_ms": 28537.479266524315,
547
+ "timing/total_ms": 36238.510839641094,
548
+ "tokens/completion": 881.84375,
549
+ "tokens/masked_fraction": 0.0,
550
+ "wall_clock/generate_s": 68.64078164100647
551
+ },
552
+ {
553
+ "advantage/absmean": 0.06175781041383743,
554
+ "entropy": 0.6991814374923706,
555
+ "epoch": 0.056,
556
+ "grad_norm": 0.04213571392317473,
557
+ "importance_ratio": 0.9988569617271423,
558
+ "learning_rate": 0.0001,
559
+ "loss": -0.0026,
560
+ "mismatch_kl": 0.00713223684579134,
561
+ "reward": 0.07656250149011612,
562
+ "reward/refusal_reward_func": 0.07656250149011612,
563
+ "reward/std": 0.1120995506644249,
564
+ "step": 28,
565
+ "timing/generation_ms": 4908.71948748827,
566
+ "timing/scoring_ms": 35273.585848510265,
567
+ "timing/total_ms": 40182.305335998535,
568
+ "tokens/completion": 557.53125,
569
+ "tokens/masked_fraction": 0.0,
570
+ "wall_clock/generate_s": 92.86418867111206
571
+ },
572
+ {
573
+ "advantage/absmean": 0.2201562523841858,
574
+ "entropy": 0.8785261511802673,
575
+ "epoch": 0.058,
576
+ "grad_norm": 0.3194279303513244,
577
+ "importance_ratio": 0.998691737651825,
578
+ "learning_rate": 0.0001,
579
+ "loss": 0.0079,
580
+ "mismatch_kl": 0.006111000664532185,
581
+ "reward": 0.5774999856948853,
582
+ "reward/refusal_reward_func": 0.5774999856948853,
583
+ "reward/std": 0.2741691768169403,
584
+ "step": 29,
585
+ "timing/generation_ms": 4181.9010972976685,
586
+ "timing/scoring_ms": 29512.903429567814,
587
+ "timing/total_ms": 33694.80452686548,
588
+ "tokens/completion": 478.875,
589
+ "tokens/masked_fraction": 0.0,
590
+ "wall_clock/generate_s": 72.91563320159912
591
+ },
592
+ {
593
+ "advantage/absmean": 0.20125000178813934,
594
+ "entropy": 0.6921765208244324,
595
+ "epoch": 0.06,
596
+ "grad_norm": 0.4406053771041607,
597
+ "importance_ratio": 0.9964888095855713,
598
+ "learning_rate": 0.0001,
599
+ "loss": -0.0386,
600
+ "mismatch_kl": 0.01204030029475689,
601
+ "reward": 0.17999999225139618,
602
+ "reward/refusal_reward_func": 0.17999999225139618,
603
+ "reward/std": 0.23286262154579163,
604
+ "step": 30,
605
+ "timing/generation_ms": 3321.3287368416786,
606
+ "timing/scoring_ms": 29757.627181708813,
607
+ "timing/total_ms": 33078.95591855049,
608
+ "tokens/completion": 382.8125,
609
+ "tokens/masked_fraction": 0.0,
610
+ "wall_clock/generate_s": 53.77872610092163
611
+ },
612
+ {
613
+ "advantage/absmean": 0.0018164062639698386,
614
+ "entropy": 0.5210146903991699,
615
+ "epoch": 0.062,
616
+ "grad_norm": 0.01658749077570669,
617
+ "importance_ratio": 0.9975528120994568,
618
+ "learning_rate": 0.0001,
619
+ "loss": 0.0,
620
+ "mismatch_kl": 0.024219391867518425,
621
+ "reward": 0.010937499813735485,
622
+ "reward/refusal_reward_func": 0.010937499813735485,
623
+ "reward/std": 0.005219778511673212,
624
+ "step": 31,
625
+ "timing/generation_ms": 1940.1119500398636,
626
+ "timing/scoring_ms": 23180.305778980255,
627
+ "timing/total_ms": 25120.41772902012,
628
+ "tokens/completion": 190.5625,
629
+ "tokens/masked_fraction": 0.0,
630
+ "wall_clock/generate_s": 33.192246198654175
631
+ },
632
+ {
633
+ "advantage/absmean": 0.04218750074505806,
634
+ "entropy": 0.7747635245323181,
635
+ "epoch": 0.064,
636
+ "grad_norm": 0.05856219259172721,
637
+ "importance_ratio": 1.0011621713638306,
638
+ "learning_rate": 0.0001,
639
+ "loss": 0.0001,
640
+ "mismatch_kl": 0.008931240066885948,
641
+ "reward": 0.0456249974668026,
642
+ "reward/refusal_reward_func": 0.0456249974668026,
643
+ "reward/std": 0.06082441285252571,
644
+ "step": 32,
645
+ "timing/generation_ms": 6628.670156002045,
646
+ "timing/scoring_ms": 29644.853502511978,
647
+ "timing/total_ms": 36273.52365851402,
648
+ "tokens/completion": 758.0625,
649
+ "tokens/masked_fraction": 0.0,
650
+ "wall_clock/generate_s": 44.213383197784424
651
+ },
652
+ {
653
+ "advantage/absmean": 0.005097656510770321,
654
+ "entropy": 0.7032718062400818,
655
+ "epoch": 0.066,
656
+ "grad_norm": 0.004871385581572528,
657
+ "importance_ratio": 0.9980704188346863,
658
+ "learning_rate": 0.0001,
659
+ "loss": -0.0002,
660
+ "mismatch_kl": 0.017929747700691223,
661
+ "reward": 0.012812498956918716,
662
+ "reward/refusal_reward_func": 0.012812498956918716,
663
+ "reward/std": 0.008744417689740658,
664
+ "step": 33,
665
+ "timing/generation_ms": 2495.2172189950943,
666
+ "timing/scoring_ms": 23352.533906698227,
667
+ "timing/total_ms": 25847.75112569332,
668
+ "tokens/completion": 271.1875,
669
+ "tokens/masked_fraction": 0.0,
670
+ "wall_clock/generate_s": 146.50593042373657
671
+ },
672
+ {
673
+ "advantage/absmean": 0.20359376072883606,
674
+ "entropy": 0.7547333240509033,
675
+ "epoch": 0.068,
676
+ "grad_norm": 0.2936937114262587,
677
+ "importance_ratio": 1.0006296634674072,
678
+ "learning_rate": 0.0001,
679
+ "loss": -0.0116,
680
+ "mismatch_kl": 0.011033565737307072,
681
+ "reward": 0.47718751430511475,
682
+ "reward/refusal_reward_func": 0.47718751430511475,
683
+ "reward/std": 0.26260992884635925,
684
+ "step": 34,
685
+ "timing/generation_ms": 4908.003121614456,
686
+ "timing/scoring_ms": 30874.776013195515,
687
+ "timing/total_ms": 35782.77913480997,
688
+ "tokens/completion": 551.65625,
689
+ "tokens/masked_fraction": 0.0,
690
+ "wall_clock/generate_s": 60.00288248062134
691
+ },
692
+ {
693
+ "advantage/absmean": 0.11302733421325684,
694
+ "entropy": 0.7300074696540833,
695
+ "epoch": 0.07,
696
+ "grad_norm": 0.32493885105188386,
697
+ "importance_ratio": 0.9997804164886475,
698
+ "learning_rate": 0.0001,
699
+ "loss": -0.0011,
700
+ "mismatch_kl": 0.016147281974554062,
701
+ "reward": 0.10593750327825546,
702
+ "reward/refusal_reward_func": 0.10593750327825546,
703
+ "reward/std": 0.16747872531414032,
704
+ "step": 35,
705
+ "timing/generation_ms": 3645.879790186882,
706
+ "timing/scoring_ms": 28717.37616509199,
707
+ "timing/total_ms": 32363.255955278873,
708
+ "tokens/completion": 418.03125,
709
+ "tokens/masked_fraction": 0.0,
710
+ "wall_clock/generate_s": 58.10710334777832
711
+ },
712
+ {
713
+ "advantage/absmean": 0.015136717818677425,
714
+ "entropy": 0.6982847452163696,
715
+ "epoch": 0.072,
716
+ "grad_norm": 0.04051473715939973,
717
+ "importance_ratio": 0.9992015957832336,
718
+ "learning_rate": 0.0001,
719
+ "loss": -0.0006,
720
+ "mismatch_kl": 0.01328412164002657,
721
+ "reward": 0.019687499850988388,
722
+ "reward/refusal_reward_func": 0.019687499850988388,
723
+ "reward/std": 0.02113710716366768,
724
+ "step": 36,
725
+ "timing/generation_ms": 3033.673010766506,
726
+ "timing/scoring_ms": 29509.141087532043,
727
+ "timing/total_ms": 32542.81409829855,
728
+ "tokens/completion": 339.375,
729
+ "tokens/masked_fraction": 0.0,
730
+ "wall_clock/generate_s": 151.12143540382385
731
+ },
732
+ {
733
+ "advantage/absmean": 0.013593749143183231,
734
+ "entropy": 0.6699934005737305,
735
+ "epoch": 0.074,
736
+ "grad_norm": 0.02768782924464237,
737
+ "importance_ratio": 1.0014734268188477,
738
+ "learning_rate": 0.0001,
739
+ "loss": -0.0003,
740
+ "mismatch_kl": 0.00942978449165821,
741
+ "reward": 0.019062498584389687,
742
+ "reward/refusal_reward_func": 0.019062498584389687,
743
+ "reward/std": 0.017741085961461067,
744
+ "step": 37,
745
+ "timing/generation_ms": 4092.4242958426476,
746
+ "timing/scoring_ms": 29670.215159654617,
747
+ "timing/total_ms": 33762.639455497265,
748
+ "tokens/completion": 466.8125,
749
+ "tokens/masked_fraction": 0.0,
750
+ "wall_clock/generate_s": 77.67575216293335
751
+ },
752
+ {
753
+ "advantage/absmean": 0.09351562708616257,
754
+ "entropy": 0.6306953430175781,
755
+ "epoch": 0.076,
756
+ "grad_norm": 0.2799348001151443,
757
+ "importance_ratio": 0.9985800981521606,
758
+ "learning_rate": 0.0001,
759
+ "loss": -0.0037,
760
+ "mismatch_kl": 0.009972751140594482,
761
+ "reward": 0.06187500059604645,
762
+ "reward/refusal_reward_func": 0.06187500059604645,
763
+ "reward/std": 0.19330088794231415,
764
+ "step": 38,
765
+ "timing/generation_ms": 5255.660645663738,
766
+ "timing/scoring_ms": 29534.79740768671,
767
+ "timing/total_ms": 34790.45805335045,
768
+ "tokens/completion": 602.75,
769
+ "tokens/masked_fraction": 0.0,
770
+ "wall_clock/generate_s": 56.81032872200012
771
+ },
772
+ {
773
+ "advantage/absmean": 0.021894531324505806,
774
+ "entropy": 0.6860240697860718,
775
+ "epoch": 0.078,
776
+ "grad_norm": 0.021804850572780036,
777
+ "importance_ratio": 0.999793291091919,
778
+ "learning_rate": 0.0001,
779
+ "loss": 0.0008,
780
+ "mismatch_kl": 0.008842560462653637,
781
+ "reward": 0.028437498956918716,
782
+ "reward/refusal_reward_func": 0.028437498956918716,
783
+ "reward/std": 0.02670549787580967,
784
+ "step": 39,
785
+ "timing/generation_ms": 5519.4277837872505,
786
+ "timing/scoring_ms": 28062.6777485013,
787
+ "timing/total_ms": 33582.10553228855,
788
+ "tokens/completion": 635.6875,
789
+ "tokens/masked_fraction": 0.0,
790
+ "wall_clock/generate_s": 49.89069700241089
791
+ },
792
+ {
793
+ "advantage/absmean": 0.04843749850988388,
794
+ "entropy": 0.7330471873283386,
795
+ "epoch": 0.08,
796
+ "grad_norm": 0.23560813153480506,
797
+ "importance_ratio": 1.0003836154937744,
798
+ "learning_rate": 0.0001,
799
+ "loss": 0.0014,
800
+ "mismatch_kl": 0.008081368170678616,
801
+ "reward": 0.03500000014901161,
802
+ "reward/refusal_reward_func": 0.03500000014901161,
803
+ "reward/std": 0.13919411599636078,
804
+ "step": 40,
805
+ "timing/generation_ms": 3926.6494438052177,
806
+ "timing/scoring_ms": 27580.46282082796,
807
+ "timing/total_ms": 31507.11226463318,
808
+ "tokens/completion": 445.78125,
809
+ "tokens/masked_fraction": 0.0,
810
+ "wall_clock/generate_s": 70.10888338088989
811
+ },
812
+ {
813
+ "advantage/absmean": 0.12544921040534973,
814
+ "entropy": 0.5206725597381592,
815
+ "epoch": 0.082,
816
+ "grad_norm": 0.5033391726424777,
817
+ "importance_ratio": 1.0012822151184082,
818
+ "learning_rate": 0.0001,
819
+ "loss": -0.0225,
820
+ "mismatch_kl": 0.007961519993841648,
821
+ "reward": 0.1446875035762787,
822
+ "reward/refusal_reward_func": 0.1446875035762787,
823
+ "reward/std": 0.1963731199502945,
824
+ "step": 41,
825
+ "timing/generation_ms": 1945.6753060221672,
826
+ "timing/scoring_ms": 24715.371668338776,
827
+ "timing/total_ms": 26661.046974360943,
828
+ "tokens/completion": 203.96875,
829
+ "tokens/masked_fraction": 0.0,
830
+ "wall_clock/generate_s": 41.8072030544281
831
+ },
832
+ {
833
+ "advantage/absmean": 0.0018164062639698386,
834
+ "entropy": 0.82820063829422,
835
+ "epoch": 0.084,
836
+ "grad_norm": 0.0008894764900618824,
837
+ "importance_ratio": 0.9987862706184387,
838
+ "learning_rate": 0.0001,
839
+ "loss": 0.0002,
840
+ "mismatch_kl": 0.008028813637793064,
841
+ "reward": 0.010937499813735485,
842
+ "reward/refusal_reward_func": 0.010937499813735485,
843
+ "reward/std": 0.005219778511673212,
844
+ "step": 42,
845
+ "timing/generation_ms": 6800.906598567963,
846
+ "timing/scoring_ms": 31876.1548101902,
847
+ "timing/total_ms": 38677.06140875816,
848
+ "tokens/completion": 785.28125,
849
+ "tokens/masked_fraction": 0.0,
850
+ "wall_clock/generate_s": 59.81179714202881
851
+ },
852
+ {
853
+ "advantage/absmean": 0.050859373062849045,
854
+ "entropy": 0.829409658908844,
855
+ "epoch": 0.086,
856
+ "grad_norm": 0.1901939415206169,
857
+ "importance_ratio": 1.0000019073486328,
858
+ "learning_rate": 0.0001,
859
+ "loss": 0.003,
860
+ "mismatch_kl": 0.009588975459337234,
861
+ "reward": 0.04312499612569809,
862
+ "reward/refusal_reward_func": 0.04312499612569809,
863
+ "reward/std": 0.1388217806816101,
864
+ "step": 43,
865
+ "timing/generation_ms": 6013.470813632011,
866
+ "timing/scoring_ms": 28884.994342923164,
867
+ "timing/total_ms": 34898.465156555176,
868
+ "tokens/completion": 691.1875,
869
+ "tokens/masked_fraction": 0.0,
870
+ "wall_clock/generate_s": 51.95971155166626
871
+ },
872
+ {
873
+ "advantage/absmean": 0.01318359375,
874
+ "entropy": 0.9210640788078308,
875
+ "epoch": 0.088,
876
+ "grad_norm": 0.03744765208460247,
877
+ "importance_ratio": 0.9966821670532227,
878
+ "learning_rate": 0.0001,
879
+ "loss": -0.0019,
880
+ "mismatch_kl": 0.018160995095968246,
881
+ "reward": 0.017812499776482582,
882
+ "reward/refusal_reward_func": 0.017812499776482582,
883
+ "reward/std": 0.020575225353240967,
884
+ "step": 44,
885
+ "timing/generation_ms": 2924.579069018364,
886
+ "timing/scoring_ms": 20441.766560077667,
887
+ "timing/total_ms": 23366.34562909603,
888
+ "tokens/completion": 343.75,
889
+ "tokens/masked_fraction": 0.0,
890
+ "wall_clock/generate_s": 84.02694988250732
891
+ },
892
+ {
893
+ "advantage/absmean": 0.04843749850988388,
894
+ "entropy": 0.8891170024871826,
895
+ "epoch": 0.09,
896
+ "grad_norm": 0.12978747422641476,
897
+ "importance_ratio": 0.9995278120040894,
898
+ "learning_rate": 0.0001,
899
+ "loss": -0.0021,
900
+ "mismatch_kl": 0.004716904368251562,
901
+ "reward": 0.7849999666213989,
902
+ "reward/refusal_reward_func": 0.7849999666213989,
903
+ "reward/std": 0.13919411599636078,
904
+ "step": 45,
905
+ "timing/generation_ms": 12784.472778439522,
906
+ "timing/scoring_ms": 41995.29768526554,
907
+ "timing/total_ms": 54779.77046370506,
908
+ "tokens/completion": 1424.53125,
909
+ "tokens/masked_fraction": 0.0,
910
+ "wall_clock/generate_s": 159.0035264492035
911
+ },
912
+ {
913
+ "advantage/absmean": 0.19917967915534973,
914
+ "entropy": 1.1817245483398438,
915
+ "epoch": 0.092,
916
+ "grad_norm": 0.18725081241592767,
917
+ "importance_ratio": 1.0018072128295898,
918
+ "learning_rate": 0.0001,
919
+ "loss": -0.0256,
920
+ "mismatch_kl": 0.008767618797719479,
921
+ "reward": 0.5290625095367432,
922
+ "reward/refusal_reward_func": 0.5290625095367432,
923
+ "reward/std": 0.2553595304489136,
924
+ "step": 46,
925
+ "timing/generation_ms": 11124.341294169426,
926
+ "timing/scoring_ms": 37177.75782942772,
927
+ "timing/total_ms": 48302.099123597145,
928
+ "tokens/completion": 1282.71875,
929
+ "tokens/masked_fraction": 0.0,
930
+ "wall_clock/generate_s": 67.16122078895569
931
+ },
932
+ {
933
+ "advantage/absmean": 0.02968750149011612,
934
+ "entropy": 0.9758523106575012,
935
+ "epoch": 0.094,
936
+ "grad_norm": 0.038752578543884246,
937
+ "importance_ratio": 1.0004377365112305,
938
+ "learning_rate": 0.0001,
939
+ "loss": 0.0042,
940
+ "mismatch_kl": 0.007721519563347101,
941
+ "reward": 0.03500000014901161,
942
+ "reward/refusal_reward_func": 0.03500000014901161,
943
+ "reward/std": 0.04690415784716606,
944
+ "step": 47,
945
+ "timing/generation_ms": 8879.09684330225,
946
+ "timing/scoring_ms": 37631.5980181098,
947
+ "timing/total_ms": 46510.69486141205,
948
+ "tokens/completion": 1009.46875,
949
+ "tokens/masked_fraction": 0.0,
950
+ "wall_clock/generate_s": 73.34895396232605
951
+ },
952
+ {
953
+ "advantage/absmean": 0.13593749701976776,
954
+ "entropy": 0.918021559715271,
955
+ "epoch": 0.096,
956
+ "grad_norm": 0.19107099447712278,
957
+ "importance_ratio": 0.9999390244483948,
958
+ "learning_rate": 0.0001,
959
+ "loss": -0.0005,
960
+ "mismatch_kl": 0.007433234713971615,
961
+ "reward": 0.7350000143051147,
962
+ "reward/refusal_reward_func": 0.7350000143051147,
963
+ "reward/std": 0.23318448662757874,
964
+ "step": 48,
965
+ "timing/generation_ms": 10426.40034854412,
966
+ "timing/scoring_ms": 37421.48996144533,
967
+ "timing/total_ms": 47847.89030998945,
968
+ "tokens/completion": 1180.71875,
969
+ "tokens/masked_fraction": 0.0,
970
+ "wall_clock/generate_s": 93.57710862159729
971
+ },
972
+ {
973
+ "advantage/absmean": 0.17765624821186066,
974
+ "entropy": 1.0576484203338623,
975
+ "epoch": 0.098,
976
+ "grad_norm": 0.16558979067438845,
977
+ "importance_ratio": 1.0013474225997925,
978
+ "learning_rate": 0.0001,
979
+ "loss": -0.006,
980
+ "mismatch_kl": 0.006344192661345005,
981
+ "reward": 0.6915624737739563,
982
+ "reward/refusal_reward_func": 0.6915624737739563,
983
+ "reward/std": 0.24835848808288574,
984
+ "step": 49,
985
+ "timing/generation_ms": 8644.713327288628,
986
+ "timing/scoring_ms": 35074.74631816149,
987
+ "timing/total_ms": 43719.459645450115,
988
+ "tokens/completion": 984.625,
989
+ "tokens/masked_fraction": 0.0,
990
+ "wall_clock/generate_s": 63.12021732330322
991
+ },
992
+ {
993
+ "advantage/absmean": 0.12312500178813934,
994
+ "entropy": 0.8099650144577026,
995
+ "epoch": 0.1,
996
+ "grad_norm": 0.1253616039663461,
997
+ "importance_ratio": 1.0017279386520386,
998
+ "learning_rate": 0.0001,
999
+ "loss": -0.0052,
1000
+ "mismatch_kl": 0.008134027011692524,
1001
+ "reward": 0.14249999821186066,
1002
+ "reward/refusal_reward_func": 0.14249999821186066,
1003
+ "reward/std": 0.1628841608762741,
1004
+ "step": 50,
1005
+ "timing/generation_ms": 8775.396101176739,
1006
+ "timing/scoring_ms": 40764.19186592102,
1007
+ "timing/total_ms": 49539.58796709776,
1008
+ "tokens/completion": 1000.1875,
1009
+ "tokens/masked_fraction": 0.0,
1010
+ "wall_clock/generate_s": 70.7229483127594
1011
+ },
1012
+ {
1013
+ "advantage/absmean": 0.04843749850988388,
1014
+ "entropy": 0.8692086338996887,
1015
+ "epoch": 0.102,
1016
+ "grad_norm": 0.016399389830222273,
1017
+ "importance_ratio": 1.001124620437622,
1018
+ "learning_rate": 0.0001,
1019
+ "loss": -0.0013,
1020
+ "mismatch_kl": 0.010322043672204018,
1021
+ "reward": 0.7849999666213989,
1022
+ "reward/refusal_reward_func": 0.7849999666213989,
1023
+ "reward/std": 0.13919411599636078,
1024
+ "step": 51,
1025
+ "timing/generation_ms": 14720.608927309513,
1026
+ "timing/scoring_ms": 41765.84377884865,
1027
+ "timing/total_ms": 56486.45270615816,
1028
+ "tokens/completion": 1616.15625,
1029
+ "tokens/masked_fraction": 0.0,
1030
+ "wall_clock/generate_s": 89.23512148857117
1031
+ },
1032
+ {
1033
+ "advantage/absmean": 0.2948242425918579,
1034
+ "entropy": 1.1107399463653564,
1035
+ "epoch": 0.104,
1036
+ "grad_norm": 0.18323432452024488,
1037
+ "importance_ratio": 1.0023913383483887,
1038
+ "learning_rate": 0.0001,
1039
+ "loss": 0.0017,
1040
+ "mismatch_kl": 0.0076919617131352425,
1041
+ "reward": 0.32218748331069946,
1042
+ "reward/refusal_reward_func": 0.32218748331069946,
1043
+ "reward/std": 0.3130169212818146,
1044
+ "step": 52,
1045
+ "timing/generation_ms": 15175.773054361343,
1046
+ "timing/scoring_ms": 57348.27160835266,
1047
+ "timing/total_ms": 72524.044662714,
1048
+ "tokens/completion": 1662.0,
1049
+ "tokens/masked_fraction": 0.0,
1050
+ "wall_clock/generate_s": 182.7874722480774
1051
+ },
1052
+ {
1053
+ "advantage/absmean": 0.15343749523162842,
1054
+ "entropy": 0.9647155404090881,
1055
+ "epoch": 0.106,
1056
+ "grad_norm": 0.1470849270360251,
1057
+ "importance_ratio": 1.0000090599060059,
1058
+ "learning_rate": 0.0001,
1059
+ "loss": 0.0102,
1060
+ "mismatch_kl": 0.00862339697778225,
1061
+ "reward": 0.6565625071525574,
1062
+ "reward/refusal_reward_func": 0.6565625071525574,
1063
+ "reward/std": 0.22976359724998474,
1064
+ "step": 53,
1065
+ "timing/generation_ms": 10986.380942165852,
1066
+ "timing/scoring_ms": 38820.41800022125,
1067
+ "timing/total_ms": 49806.798942387104,
1068
+ "tokens/completion": 1229.09375,
1069
+ "tokens/masked_fraction": 0.0,
1070
+ "wall_clock/generate_s": 157.4189648628235
1071
+ },
1072
+ {
1073
+ "advantage/absmean": 0.04843749850988388,
1074
+ "entropy": 0.8174250721931458,
1075
+ "epoch": 0.108,
1076
+ "grad_norm": 0.017013624591187354,
1077
+ "importance_ratio": 1.000490427017212,
1078
+ "learning_rate": 0.0001,
1079
+ "loss": -0.0008,
1080
+ "mismatch_kl": 0.004199406132102013,
1081
+ "reward": 0.7849999666213989,
1082
+ "reward/refusal_reward_func": 0.7849999666213989,
1083
+ "reward/std": 0.13919411599636078,
1084
+ "step": 54,
1085
+ "timing/generation_ms": 13110.579743981361,
1086
+ "timing/scoring_ms": 39618.385925889015,
1087
+ "timing/total_ms": 52728.96566987038,
1088
+ "tokens/completion": 1453.71875,
1089
+ "tokens/masked_fraction": 0.0,
1090
+ "wall_clock/generate_s": 159.44240832328796
1091
+ },
1092
+ {
1093
+ "advantage/absmean": 0.07984375208616257,
1094
+ "entropy": 0.8634753227233887,
1095
+ "epoch": 0.11,
1096
+ "grad_norm": 0.06408238305252761,
1097
+ "importance_ratio": 1.0001016855239868,
1098
+ "learning_rate": 0.0001,
1099
+ "loss": -0.0018,
1100
+ "mismatch_kl": 0.006974226329475641,
1101
+ "reward": 0.7643749713897705,
1102
+ "reward/refusal_reward_func": 0.7643749713897705,
1103
+ "reward/std": 0.15140874683856964,
1104
+ "step": 55,
1105
+ "timing/generation_ms": 13839.490927755833,
1106
+ "timing/scoring_ms": 43094.41144019365,
1107
+ "timing/total_ms": 56933.902367949486,
1108
+ "tokens/completion": 1539.5625,
1109
+ "tokens/masked_fraction": 0.0,
1110
+ "wall_clock/generate_s": 162.71979093551636
1111
+ },
1112
+ {
1113
+ "advantage/absmean": 0.22667968273162842,
1114
+ "entropy": 0.7791604399681091,
1115
+ "epoch": 0.112,
1116
+ "grad_norm": 0.16725001444914697,
1117
+ "importance_ratio": 1.0006935596466064,
1118
+ "learning_rate": 0.0001,
1119
+ "loss": -0.0115,
1120
+ "mismatch_kl": 0.007046831306070089,
1121
+ "reward": 0.47468751668930054,
1122
+ "reward/refusal_reward_func": 0.47468751668930054,
1123
+ "reward/std": 0.2708203196525574,
1124
+ "step": 56,
1125
+ "timing/generation_ms": 16662.443839013577,
1126
+ "timing/scoring_ms": 45676.41341686249,
1127
+ "timing/total_ms": 62338.857255876064,
1128
+ "tokens/completion": 1816.03125,
1129
+ "tokens/masked_fraction": 0.0,
1130
+ "wall_clock/generate_s": 124.58920621871948
1131
+ },
1132
+ {
1133
+ "advantage/absmean": 0.11601562052965164,
1134
+ "entropy": 0.7439562678337097,
1135
+ "epoch": 0.114,
1136
+ "grad_norm": 0.13019703908354843,
1137
+ "importance_ratio": 1.0008944272994995,
1138
+ "learning_rate": 0.0001,
1139
+ "loss": -0.0026,
1140
+ "mismatch_kl": 0.009258158504962921,
1141
+ "reward": 0.7171875238418579,
1142
+ "reward/refusal_reward_func": 0.7171875238418579,
1143
+ "reward/std": 0.16097815334796906,
1144
+ "step": 57,
1145
+ "timing/generation_ms": 12729.440599679947,
1146
+ "timing/scoring_ms": 44226.92193090916,
1147
+ "timing/total_ms": 56956.362530589104,
1148
+ "tokens/completion": 1423.09375,
1149
+ "tokens/masked_fraction": 0.0,
1150
+ "wall_clock/generate_s": 158.57960319519043
1151
+ },
1152
+ {
1153
+ "advantage/absmean": 0.15726563334465027,
1154
+ "entropy": 0.8287293910980225,
1155
+ "epoch": 0.116,
1156
+ "grad_norm": 0.14055232526730846,
1157
+ "importance_ratio": 1.001518726348877,
1158
+ "learning_rate": 0.0001,
1159
+ "loss": -0.0132,
1160
+ "mismatch_kl": 0.008714662864804268,
1161
+ "reward": 0.6956250071525574,
1162
+ "reward/refusal_reward_func": 0.6956250071525574,
1163
+ "reward/std": 0.2274579405784607,
1164
+ "step": 58,
1165
+ "timing/generation_ms": 10148.803442716599,
1166
+ "timing/scoring_ms": 45419.282242655754,
1167
+ "timing/total_ms": 55568.08568537235,
1168
+ "tokens/completion": 1146.34375,
1169
+ "tokens/masked_fraction": 0.0,
1170
+ "wall_clock/generate_s": 88.32079148292542
1171
+ },
1172
+ {
1173
+ "advantage/absmean": 0.18738281726837158,
1174
+ "entropy": 0.6767469644546509,
1175
+ "epoch": 0.118,
1176
+ "grad_norm": 0.1800028526601889,
1177
+ "importance_ratio": 0.9996235370635986,
1178
+ "learning_rate": 0.0001,
1179
+ "loss": -0.0004,
1180
+ "mismatch_kl": 0.00572241609916091,
1181
+ "reward": 0.6946874856948853,
1182
+ "reward/refusal_reward_func": 0.6946874856948853,
1183
+ "reward/std": 0.26609423756599426,
1184
+ "step": 59,
1185
+ "timing/generation_ms": 5857.890740036964,
1186
+ "timing/scoring_ms": 28364.16070908308,
1187
+ "timing/total_ms": 34222.051449120045,
1188
+ "tokens/completion": 674.78125,
1189
+ "tokens/masked_fraction": 0.0,
1190
+ "wall_clock/generate_s": 153.28042459487915
1191
+ },
1192
+ {
1193
+ "advantage/absmean": 0.04843749850988388,
1194
+ "entropy": 0.48612505197525024,
1195
+ "epoch": 0.12,
1196
+ "grad_norm": 0.017636908815215933,
1197
+ "importance_ratio": 1.001957893371582,
1198
+ "learning_rate": 0.0001,
1199
+ "loss": -0.0003,
1200
+ "mismatch_kl": 0.0077150240540504456,
1201
+ "reward": 0.7849999666213989,
1202
+ "reward/refusal_reward_func": 0.7849999666213989,
1203
+ "reward/std": 0.13919411599636078,
1204
+ "step": 60,
1205
+ "timing/generation_ms": 13039.215676486492,
1206
+ "timing/scoring_ms": 37324.29302483797,
1207
+ "timing/total_ms": 50363.50870132446,
1208
+ "tokens/completion": 1435.25,
1209
+ "tokens/masked_fraction": 0.0,
1210
+ "wall_clock/generate_s": 155.66002011299133
1211
+ },
1212
+ {
1213
+ "advantage/absmean": 0.033906251192092896,
1214
+ "entropy": 0.8923248648643494,
1215
+ "epoch": 0.122,
1216
+ "grad_norm": 0.01138046317613093,
1217
+ "importance_ratio": 1.0010781288146973,
1218
+ "learning_rate": 0.0001,
1219
+ "loss": -0.0001,
1220
+ "mismatch_kl": 0.009470692835748196,
1221
+ "reward": 0.7925000190734863,
1222
+ "reward/refusal_reward_func": 0.7925000190734863,
1223
+ "reward/std": 0.09743587672710419,
1224
+ "step": 61,
1225
+ "timing/generation_ms": 20387.244410812855,
1226
+ "timing/scoring_ms": 49000.582568347454,
1227
+ "timing/total_ms": 69387.82697916031,
1228
+ "tokens/completion": 2047.65625,
1229
+ "tokens/masked_fraction": 0.0,
1230
+ "wall_clock/generate_s": 179.17891383171082
1231
+ },
1232
+ {
1233
+ "advantage/absmean": 0.14195312559604645,
1234
+ "entropy": 0.6803052425384521,
1235
+ "epoch": 0.124,
1236
+ "grad_norm": 0.1396860758416004,
1237
+ "importance_ratio": 1.0004676580429077,
1238
+ "learning_rate": 0.0001,
1239
+ "loss": 0.0005,
1240
+ "mismatch_kl": 0.009484711103141308,
1241
+ "reward": 0.7112500071525574,
1242
+ "reward/refusal_reward_func": 0.7112500071525574,
1243
+ "reward/std": 0.19915054738521576,
1244
+ "step": 62,
1245
+ "timing/generation_ms": 20013.910226523876,
1246
+ "timing/scoring_ms": 51556.99533224106,
1247
+ "timing/total_ms": 71570.90555876493,
1248
+ "tokens/completion": 2029.46875,
1249
+ "tokens/masked_fraction": 0.0,
1250
+ "wall_clock/generate_s": 281.4015655517578
1251
+ },
1252
+ {
1253
+ "advantage/absmean": 0.14208984375,
1254
+ "entropy": 0.7867326736450195,
1255
+ "epoch": 0.126,
1256
+ "grad_norm": 0.07770627410364793,
1257
+ "importance_ratio": 1.0014797449111938,
1258
+ "learning_rate": 0.0001,
1259
+ "loss": -0.0003,
1260
+ "mismatch_kl": 0.014644050039350986,
1261
+ "reward": 0.6584374904632568,
1262
+ "reward/refusal_reward_func": 0.6584374904632568,
1263
+ "reward/std": 0.21947535872459412,
1264
+ "step": 63,
1265
+ "timing/generation_ms": 19855.214461684227,
1266
+ "timing/scoring_ms": 52655.76823055744,
1267
+ "timing/total_ms": 72510.98269224167,
1268
+ "tokens/completion": 2026.1875,
1269
+ "tokens/masked_fraction": 0.0,
1270
+ "wall_clock/generate_s": 168.5356569290161
1271
+ },
1272
+ {
1273
+ "advantage/absmean": 0.09375,
1274
+ "entropy": 0.722605288028717,
1275
+ "epoch": 0.128,
1276
+ "grad_norm": 0.10541623752837016,
1277
+ "importance_ratio": 1.0017896890640259,
1278
+ "learning_rate": 0.0001,
1279
+ "loss": 0.0002,
1280
+ "mismatch_kl": 0.009318462572991848,
1281
+ "reward": 0.7599999904632568,
1282
+ "reward/refusal_reward_func": 0.7599999904632568,
1283
+ "reward/std": 0.19364915788173676,
1284
+ "step": 64,
1285
+ "timing/generation_ms": 20329.02915775776,
1286
+ "timing/scoring_ms": 43974.1270840168,
1287
+ "timing/total_ms": 64303.15624177456,
1288
+ "tokens/completion": 2046.09375,
1289
+ "tokens/masked_fraction": 0.0,
1290
+ "wall_clock/generate_s": 69.69067120552063
1291
+ },
1292
+ {
1293
+ "advantage/absmean": 0.13593751192092896,
1294
+ "entropy": 0.6431602239608765,
1295
+ "epoch": 0.13,
1296
+ "grad_norm": 0.04996864946933639,
1297
+ "importance_ratio": 1.0007299184799194,
1298
+ "learning_rate": 0.0001,
1299
+ "loss": 0.0042,
1300
+ "mismatch_kl": 0.01128534134477377,
1301
+ "reward": 0.7350000143051147,
1302
+ "reward/refusal_reward_func": 0.7350000143051147,
1303
+ "reward/std": 0.23318447172641754,
1304
+ "step": 65,
1305
+ "timing/generation_ms": 17292.446829378605,
1306
+ "timing/scoring_ms": 48005.3500905633,
1307
+ "timing/total_ms": 65297.7969199419,
1308
+ "tokens/completion": 1868.75,
1309
+ "tokens/masked_fraction": 0.0,
1310
+ "wall_clock/generate_s": 137.745671749115
1311
+ },
1312
+ {
1313
+ "advantage/absmean": 0.043593745678663254,
1314
+ "entropy": 0.6788095831871033,
1315
+ "epoch": 0.132,
1316
+ "grad_norm": 0.1104672773690437,
1317
+ "importance_ratio": 1.0008015632629395,
1318
+ "learning_rate": 0.0001,
1319
+ "loss": -0.0,
1320
+ "mismatch_kl": 0.011132912710309029,
1321
+ "reward": 0.7875000238418579,
1322
+ "reward/refusal_reward_func": 0.7875000238418579,
1323
+ "reward/std": 0.1252746880054474,
1324
+ "step": 66,
1325
+ "timing/generation_ms": 20402.18196809292,
1326
+ "timing/scoring_ms": 42341.174609959126,
1327
+ "timing/total_ms": 62743.356578052044,
1328
+ "tokens/completion": 2048.0,
1329
+ "tokens/masked_fraction": 0.0,
1330
+ "wall_clock/generate_s": 162.58158588409424
1331
+ },
1332
+ {
1333
+ "advantage/absmean": 0.05214843899011612,
1334
+ "entropy": 0.5562156438827515,
1335
+ "epoch": 0.134,
1336
+ "grad_norm": 0.016212117765327168,
1337
+ "importance_ratio": 0.9997291564941406,
1338
+ "learning_rate": 0.0001,
1339
+ "loss": 0.0,
1340
+ "mismatch_kl": 0.008888973854482174,
1341
+ "reward": 0.7821874618530273,
1342
+ "reward/refusal_reward_func": 0.7821874618530273,
1343
+ "reward/std": 0.1277872771024704,
1344
+ "step": 67,
1345
+ "timing/generation_ms": 20475.26439279318,
1346
+ "timing/scoring_ms": 46698.052957654,
1347
+ "timing/total_ms": 67173.31735044718,
1348
+ "tokens/completion": 2048.0,
1349
+ "tokens/masked_fraction": 0.0,
1350
+ "wall_clock/generate_s": 151.99999165534973
1351
+ },
1352
+ {
1353
+ "advantage/absmean": 0.08964844048023224,
1354
+ "entropy": 0.6966589093208313,
1355
+ "epoch": 0.136,
1356
+ "grad_norm": 0.1577321206922815,
1357
+ "importance_ratio": 1.0010697841644287,
1358
+ "learning_rate": 0.0001,
1359
+ "loss": -0.0002,
1360
+ "mismatch_kl": 0.00940707977861166,
1361
+ "reward": 0.7621874809265137,
1362
+ "reward/refusal_reward_func": 0.7621874809265137,
1363
+ "reward/std": 0.18551842868328094,
1364
+ "step": 68,
1365
+ "timing/generation_ms": 20498.32931160927,
1366
+ "timing/scoring_ms": 54109.15730148554,
1367
+ "timing/total_ms": 74607.4866130948,
1368
+ "tokens/completion": 2048.0,
1369
+ "tokens/masked_fraction": 0.0,
1370
+ "wall_clock/generate_s": 395.203547000885
1371
+ },
1372
+ {
1373
+ "advantage/absmean": 0.27099609375,
1374
+ "entropy": 0.5135282278060913,
1375
+ "epoch": 0.138,
1376
+ "grad_norm": 0.2079259512896872,
1377
+ "importance_ratio": 1.0020484924316406,
1378
+ "learning_rate": 0.0001,
1379
+ "loss": -0.0,
1380
+ "mismatch_kl": 0.012067537754774094,
1381
+ "reward": 0.6365625262260437,
1382
+ "reward/refusal_reward_func": 0.6365625262260437,
1383
+ "reward/std": 0.32783961296081543,
1384
+ "step": 69,
1385
+ "timing/generation_ms": 20592.435374855995,
1386
+ "timing/scoring_ms": 59319.189973175526,
1387
+ "timing/total_ms": 79911.62534803152,
1388
+ "tokens/completion": 2048.0,
1389
+ "tokens/masked_fraction": 0.0,
1390
+ "wall_clock/generate_s": 397.43536710739136
1391
+ },
1392
+ {
1393
+ "advantage/absmean": 0.0,
1394
+ "entropy": 0.4793013036251068,
1395
+ "epoch": 0.14,
1396
+ "grad_norm": 0.0,
1397
+ "importance_ratio": 1.0002073049545288,
1398
+ "learning_rate": 0.0001,
1399
+ "loss": 0.0,
1400
+ "mismatch_kl": 0.01224527694284916,
1401
+ "reward": 0.8100000023841858,
1402
+ "reward/refusal_reward_func": 0.8100000023841858,
1403
+ "reward/std": 0.0,
1404
+ "step": 70,
1405
+ "timing/generation_ms": 20546.609550714493,
1406
+ "timing/scoring_ms": 42320.83362340927,
1407
+ "timing/total_ms": 62867.443174123764,
1408
+ "tokens/completion": 2048.0,
1409
+ "tokens/masked_fraction": 0.0,
1410
+ "wall_clock/generate_s": 161.7110676765442
1411
+ },
1412
+ {
1413
+ "advantage/absmean": 0.053593751043081284,
1414
+ "entropy": 0.35620445013046265,
1415
+ "epoch": 0.142,
1416
+ "grad_norm": 0.06178490851261073,
1417
+ "importance_ratio": 1.0008577108383179,
1418
+ "learning_rate": 0.0001,
1419
+ "loss": 0.0001,
1420
+ "mismatch_kl": 0.007034921087324619,
1421
+ "reward": 0.7793750166893005,
1422
+ "reward/refusal_reward_func": 0.7793750166893005,
1423
+ "reward/std": 0.08525467664003372,
1424
+ "step": 71,
1425
+ "timing/generation_ms": 20559.83528494835,
1426
+ "timing/scoring_ms": 50445.407539606094,
1427
+ "timing/total_ms": 71005.24282455444,
1428
+ "tokens/completion": 2047.65625,
1429
+ "tokens/masked_fraction": 0.0,
1430
+ "wall_clock/generate_s": 165.77110528945923
1431
+ },
1432
+ {
1433
+ "advantage/absmean": 0.06621094048023224,
1434
+ "entropy": 0.34386953711509705,
1435
+ "epoch": 0.144,
1436
+ "grad_norm": 0.06068478204359807,
1437
+ "importance_ratio": 1.0005115270614624,
1438
+ "learning_rate": 0.0001,
1439
+ "loss": 0.0001,
1440
+ "mismatch_kl": 0.006632550619542599,
1441
+ "reward": 0.7746874690055847,
1442
+ "reward/refusal_reward_func": 0.7746874690055847,
1443
+ "reward/std": 0.14985378086566925,
1444
+ "step": 72,
1445
+ "timing/generation_ms": 20651.6492664814,
1446
+ "timing/scoring_ms": 53449.473068118095,
1447
+ "timing/total_ms": 74101.1223345995,
1448
+ "tokens/completion": 2048.0,
1449
+ "tokens/masked_fraction": 0.0,
1450
+ "wall_clock/generate_s": 396.06008672714233
1451
+ },
1452
+ {
1453
+ "advantage/absmean": 0.08964844048023224,
1454
+ "entropy": 0.3689025640487671,
1455
+ "epoch": 0.146,
1456
+ "grad_norm": 0.028592002076965557,
1457
+ "importance_ratio": 1.0001220703125,
1458
+ "learning_rate": 0.0001,
1459
+ "loss": 0.0,
1460
+ "mismatch_kl": 0.009937528520822525,
1461
+ "reward": 0.7621874809265137,
1462
+ "reward/refusal_reward_func": 0.7621874809265137,
1463
+ "reward/std": 0.18551842868328094,
1464
+ "step": 73,
1465
+ "timing/generation_ms": 20704.90287989378,
1466
+ "timing/scoring_ms": 56059.35876071453,
1467
+ "timing/total_ms": 76764.26164060831,
1468
+ "tokens/completion": 2048.0,
1469
+ "tokens/masked_fraction": 0.0,
1470
+ "wall_clock/generate_s": 395.15700674057007
1471
+ },
1472
+ {
1473
+ "advantage/absmean": 0.10025390982627869,
1474
+ "entropy": 0.3531518578529358,
1475
+ "epoch": 0.148,
1476
+ "grad_norm": 0.0966128922725595,
1477
+ "importance_ratio": 0.9996236562728882,
1478
+ "learning_rate": 0.0001,
1479
+ "loss": -0.0001,
1480
+ "mismatch_kl": 0.00807525310665369,
1481
+ "reward": 0.754687488079071,
1482
+ "reward/refusal_reward_func": 0.754687488079071,
1483
+ "reward/std": 0.19453445076942444,
1484
+ "step": 74,
1485
+ "timing/generation_ms": 20194.012761116028,
1486
+ "timing/scoring_ms": 45799.67290908098,
1487
+ "timing/total_ms": 65993.68567019701,
1488
+ "tokens/completion": 2048.0,
1489
+ "tokens/masked_fraction": 0.0,
1490
+ "wall_clock/generate_s": 150.22299551963806
1491
+ },
1492
+ {
1493
+ "advantage/absmean": 0.08964844048023224,
1494
+ "entropy": 0.3910990059375763,
1495
+ "epoch": 0.15,
1496
+ "grad_norm": 0.14646197667696922,
1497
+ "importance_ratio": 0.999146580696106,
1498
+ "learning_rate": 0.0001,
1499
+ "loss": -0.0003,
1500
+ "mismatch_kl": 0.008608575910329819,
1501
+ "reward": 0.7621874809265137,
1502
+ "reward/refusal_reward_func": 0.7621874809265137,
1503
+ "reward/std": 0.18551842868328094,
1504
+ "step": 75,
1505
+ "timing/generation_ms": 20237.79760301113,
1506
+ "timing/scoring_ms": 51105.7443395257,
1507
+ "timing/total_ms": 71343.54194253683,
1508
+ "tokens/completion": 2048.0,
1509
+ "tokens/masked_fraction": 0.0,
1510
+ "wall_clock/generate_s": 394.9270164966583
1511
+ },
1512
+ {
1513
+ "advantage/absmean": 0.13537108898162842,
1514
+ "entropy": 0.2564987540245056,
1515
+ "epoch": 0.152,
1516
+ "grad_norm": 0.10767344052989877,
1517
+ "importance_ratio": 1.000230073928833,
1518
+ "learning_rate": 0.0001,
1519
+ "loss": 0.0002,
1520
+ "mismatch_kl": 0.00910898856818676,
1521
+ "reward": 0.7353124618530273,
1522
+ "reward/refusal_reward_func": 0.7353124618530273,
1523
+ "reward/std": 0.23228463530540466,
1524
+ "step": 76,
1525
+ "timing/generation_ms": 20231.063432991505,
1526
+ "timing/scoring_ms": 64505.957297980785,
1527
+ "timing/total_ms": 84737.02073097229,
1528
+ "tokens/completion": 2048.0,
1529
+ "tokens/masked_fraction": 0.0,
1530
+ "wall_clock/generate_s": 394.6906681060791
1531
+ },
1532
+ {
1533
+ "advantage/absmean": 0.04843749850988388,
1534
+ "entropy": 0.35127270221710205,
1535
+ "epoch": 0.154,
1536
+ "grad_norm": 0.05430162481667564,
1537
+ "importance_ratio": 1.001720666885376,
1538
+ "learning_rate": 0.0001,
1539
+ "loss": -0.0112,
1540
+ "mismatch_kl": 0.02907688170671463,
1541
+ "reward": 0.7849999666213989,
1542
+ "reward/refusal_reward_func": 0.7849999666213989,
1543
+ "reward/std": 0.13919411599636078,
1544
+ "step": 77,
1545
+ "timing/generation_ms": 2299.7111305594444,
1546
+ "timing/scoring_ms": 24145.363181829453,
1547
+ "timing/total_ms": 26445.074312388897,
1548
+ "tokens/completion": 256.90625,
1549
+ "tokens/masked_fraction": 0.0,
1550
+ "wall_clock/generate_s": 43.695470094680786
1551
+ },
1552
+ {
1553
+ "advantage/absmean": 0.17824219167232513,
1554
+ "entropy": 0.2108859121799469,
1555
+ "epoch": 0.156,
1556
+ "grad_norm": 0.1343157239601839,
1557
+ "importance_ratio": 1.000138521194458,
1558
+ "learning_rate": 0.0001,
1559
+ "loss": -0.0061,
1560
+ "mismatch_kl": 0.006567842327058315,
1561
+ "reward": 0.7043750286102295,
1562
+ "reward/refusal_reward_func": 0.7043750286102295,
1563
+ "reward/std": 0.26504644751548767,
1564
+ "step": 78,
1565
+ "timing/generation_ms": 19559.65828895569,
1566
+ "timing/scoring_ms": 56106.24121129513,
1567
+ "timing/total_ms": 75665.89950025082,
1568
+ "tokens/completion": 2012.8125,
1569
+ "tokens/masked_fraction": 0.0,
1570
+ "wall_clock/generate_s": 397.83462166786194
1571
+ },
1572
+ {
1573
+ "advantage/absmean": 0.14109376072883606,
1574
+ "entropy": 0.2118162214756012,
1575
+ "epoch": 0.158,
1576
+ "grad_norm": 0.04314766069392161,
1577
+ "importance_ratio": 0.999754011631012,
1578
+ "learning_rate": 0.0001,
1579
+ "loss": 0.0,
1580
+ "mismatch_kl": 0.005713644903153181,
1581
+ "reward": 0.7293750047683716,
1582
+ "reward/refusal_reward_func": 0.7293750047683716,
1583
+ "reward/std": 0.23431998491287231,
1584
+ "step": 79,
1585
+ "timing/generation_ms": 20274.492114782333,
1586
+ "timing/scoring_ms": 53302.132822573185,
1587
+ "timing/total_ms": 73576.62493735552,
1588
+ "tokens/completion": 2048.0,
1589
+ "tokens/masked_fraction": 0.0,
1590
+ "wall_clock/generate_s": 396.22464632987976
1591
+ },
1592
+ {
1593
+ "advantage/absmean": 0.13593749701976776,
1594
+ "entropy": 0.32609474658966064,
1595
+ "epoch": 0.16,
1596
+ "grad_norm": 0.10083540436117842,
1597
+ "importance_ratio": 1.0001963376998901,
1598
+ "learning_rate": 0.0001,
1599
+ "loss": -0.0002,
1600
+ "mismatch_kl": 0.008770663291215897,
1601
+ "reward": 0.7350000143051147,
1602
+ "reward/refusal_reward_func": 0.7350000143051147,
1603
+ "reward/std": 0.23318448662757874,
1604
+ "step": 80,
1605
+ "timing/generation_ms": 20281.54794126749,
1606
+ "timing/scoring_ms": 45572.31470942497,
1607
+ "timing/total_ms": 65853.86265069246,
1608
+ "tokens/completion": 2048.0,
1609
+ "tokens/masked_fraction": 0.0,
1610
+ "wall_clock/generate_s": 157.70197463035583
1611
+ },
1612
+ {
1613
+ "advantage/absmean": 0.13197265565395355,
1614
+ "entropy": 0.349658727645874,
1615
+ "epoch": 0.162,
1616
+ "grad_norm": 0.15801403288716265,
1617
+ "importance_ratio": 0.9997016191482544,
1618
+ "learning_rate": 0.0001,
1619
+ "loss": 0.0003,
1620
+ "mismatch_kl": 0.007918323390185833,
1621
+ "reward": 0.7371875047683716,
1622
+ "reward/refusal_reward_func": 0.7371875047683716,
1623
+ "reward/std": 0.22671890258789062,
1624
+ "step": 81,
1625
+ "timing/generation_ms": 19729.195773601532,
1626
+ "timing/scoring_ms": 52965.313747525215,
1627
+ "timing/total_ms": 72694.50952112675,
1628
+ "tokens/completion": 2031.71875,
1629
+ "tokens/masked_fraction": 0.0,
1630
+ "wall_clock/generate_s": 395.87234902381897
1631
+ },
1632
+ {
1633
+ "advantage/absmean": 0.19093748927116394,
1634
+ "entropy": 0.30307242274284363,
1635
+ "epoch": 0.164,
1636
+ "grad_norm": 0.05658978916858572,
1637
+ "importance_ratio": 0.9997415542602539,
1638
+ "learning_rate": 0.0001,
1639
+ "loss": 0.0,
1640
+ "mismatch_kl": 0.009000571444630623,
1641
+ "reward": 0.6924999952316284,
1642
+ "reward/refusal_reward_func": 0.6924999952316284,
1643
+ "reward/std": 0.2553306818008423,
1644
+ "step": 82,
1645
+ "timing/generation_ms": 20152.134649455547,
1646
+ "timing/scoring_ms": 55147.81706035137,
1647
+ "timing/total_ms": 75299.95170980692,
1648
+ "tokens/completion": 2048.0,
1649
+ "tokens/masked_fraction": 0.0,
1650
+ "wall_clock/generate_s": 276.00095558166504
1651
+ },
1652
+ {
1653
+ "advantage/absmean": 0.13197265565395355,
1654
+ "entropy": 0.2418041229248047,
1655
+ "epoch": 0.166,
1656
+ "grad_norm": 0.09124961991568047,
1657
+ "importance_ratio": 0.9993461966514587,
1658
+ "learning_rate": 0.0001,
1659
+ "loss": 0.0,
1660
+ "mismatch_kl": 0.008319162763655186,
1661
+ "reward": 0.7371875047683716,
1662
+ "reward/refusal_reward_func": 0.7371875047683716,
1663
+ "reward/std": 0.22671890258789062,
1664
+ "step": 83,
1665
+ "timing/generation_ms": 20235.445871949196,
1666
+ "timing/scoring_ms": 60683.41539800167,
1667
+ "timing/total_ms": 80918.86126995087,
1668
+ "tokens/completion": 2048.0,
1669
+ "tokens/masked_fraction": 0.0,
1670
+ "wall_clock/generate_s": 394.98225951194763
1671
+ },
1672
+ {
1673
+ "advantage/absmean": 0.09492187201976776,
1674
+ "entropy": 0.36278918385505676,
1675
+ "epoch": 0.168,
1676
+ "grad_norm": 0.10428837192295014,
1677
+ "importance_ratio": 1.0009691715240479,
1678
+ "learning_rate": 0.0001,
1679
+ "loss": 0.0001,
1680
+ "mismatch_kl": 0.006821990944445133,
1681
+ "reward": 0.7593749761581421,
1682
+ "reward/refusal_reward_func": 0.7593749761581421,
1683
+ "reward/std": 0.19606979191303253,
1684
+ "step": 84,
1685
+ "timing/generation_ms": 20079.659663140774,
1686
+ "timing/scoring_ms": 62456.88313245773,
1687
+ "timing/total_ms": 82536.5427955985,
1688
+ "tokens/completion": 2047.4375,
1689
+ "tokens/masked_fraction": 0.0,
1690
+ "wall_clock/generate_s": 395.080442905426
1691
+ },
1692
+ {
1693
+ "advantage/absmean": 0.16296875476837158,
1694
+ "entropy": 0.2741187810897827,
1695
+ "epoch": 0.17,
1696
+ "grad_norm": 0.05559178148711944,
1697
+ "importance_ratio": 1.0001330375671387,
1698
+ "learning_rate": 0.0001,
1699
+ "loss": 0.0002,
1700
+ "mismatch_kl": 0.008196860551834106,
1701
+ "reward": 0.7168750166893005,
1702
+ "reward/refusal_reward_func": 0.7168750166893005,
1703
+ "reward/std": 0.2492668777704239,
1704
+ "step": 85,
1705
+ "timing/generation_ms": 20191.432282328606,
1706
+ "timing/scoring_ms": 66640.07867872715,
1707
+ "timing/total_ms": 86831.51096105576,
1708
+ "tokens/completion": 2048.0,
1709
+ "tokens/masked_fraction": 0.0,
1710
+ "wall_clock/generate_s": 395.3853757381439
1711
+ },
1712
+ {
1713
+ "advantage/absmean": 0.06513672322034836,
1714
+ "entropy": 0.3744433522224426,
1715
+ "epoch": 0.172,
1716
+ "grad_norm": 0.02027358693373687,
1717
+ "importance_ratio": 0.9997438192367554,
1718
+ "learning_rate": 0.0001,
1719
+ "loss": 0.0001,
1720
+ "mismatch_kl": 0.0079119261354208,
1721
+ "reward": 0.7740625143051147,
1722
+ "reward/refusal_reward_func": 0.7740625143051147,
1723
+ "reward/std": 0.1449754238128662,
1724
+ "step": 86,
1725
+ "timing/generation_ms": 20130.67189604044,
1726
+ "timing/scoring_ms": 63243.81287395954,
1727
+ "timing/total_ms": 83374.48476999998,
1728
+ "tokens/completion": 2048.0,
1729
+ "tokens/masked_fraction": 0.0,
1730
+ "wall_clock/generate_s": 394.553094625473
1731
+ },
1732
+ {
1733
+ "advantage/absmean": 0.06621094048023224,
1734
+ "entropy": 0.2802242636680603,
1735
+ "epoch": 0.174,
1736
+ "grad_norm": 0.11007226741752094,
1737
+ "importance_ratio": 1.0002408027648926,
1738
+ "learning_rate": 0.0001,
1739
+ "loss": -0.0,
1740
+ "mismatch_kl": 0.010727161541581154,
1741
+ "reward": 0.7746875286102295,
1742
+ "reward/refusal_reward_func": 0.7746875286102295,
1743
+ "reward/std": 0.14985376596450806,
1744
+ "step": 87,
1745
+ "timing/generation_ms": 20197.582133114338,
1746
+ "timing/scoring_ms": 60626.82098895311,
1747
+ "timing/total_ms": 80824.40312206745,
1748
+ "tokens/completion": 2048.0,
1749
+ "tokens/masked_fraction": 0.0,
1750
+ "wall_clock/generate_s": 394.6186418533325
1751
+ },
1752
+ {
1753
+ "advantage/absmean": 0.15476563572883606,
1754
+ "entropy": 0.15260648727416992,
1755
+ "epoch": 0.176,
1756
+ "grad_norm": 0.09113868718318835,
1757
+ "importance_ratio": 0.9991167187690735,
1758
+ "learning_rate": 0.0001,
1759
+ "loss": -0.0,
1760
+ "mismatch_kl": 0.007635745219886303,
1761
+ "reward": 0.7215625047683716,
1762
+ "reward/refusal_reward_func": 0.7215625047683716,
1763
+ "reward/std": 0.2370404750108719,
1764
+ "step": 88,
1765
+ "timing/generation_ms": 20186.022453010082,
1766
+ "timing/scoring_ms": 65319.87015157938,
1767
+ "timing/total_ms": 85505.89260458946,
1768
+ "tokens/completion": 2048.0,
1769
+ "tokens/masked_fraction": 0.0,
1770
+ "wall_clock/generate_s": 394.704843044281
1771
+ },
1772
+ {
1773
+ "advantage/absmean": 0.09375,
1774
+ "entropy": 0.2925874888896942,
1775
+ "epoch": 0.178,
1776
+ "grad_norm": 0.10900817571461202,
1777
+ "importance_ratio": 1.0003485679626465,
1778
+ "learning_rate": 0.0001,
1779
+ "loss": 0.0003,
1780
+ "mismatch_kl": 0.01015115063637495,
1781
+ "reward": 0.7599999904632568,
1782
+ "reward/refusal_reward_func": 0.7599999904632568,
1783
+ "reward/std": 0.19364915788173676,
1784
+ "step": 89,
1785
+ "timing/generation_ms": 19985.6186658144,
1786
+ "timing/scoring_ms": 50727.93058305979,
1787
+ "timing/total_ms": 70713.54924887419,
1788
+ "tokens/completion": 2039.15625,
1789
+ "tokens/masked_fraction": 0.0,
1790
+ "wall_clock/generate_s": 292.40805864334106
1791
+ },
1792
+ {
1793
+ "advantage/absmean": 0.24515625834465027,
1794
+ "entropy": 0.24920716881752014,
1795
+ "epoch": 0.18,
1796
+ "grad_norm": 0.12621044924209393,
1797
+ "importance_ratio": 1.000746250152588,
1798
+ "learning_rate": 0.0001,
1799
+ "loss": 0.0,
1800
+ "mismatch_kl": 0.008628414012491703,
1801
+ "reward": 0.6162500381469727,
1802
+ "reward/refusal_reward_func": 0.6162500381469727,
1803
+ "reward/std": 0.2840307056903839,
1804
+ "step": 90,
1805
+ "timing/generation_ms": 20263.45807313919,
1806
+ "timing/scoring_ms": 69397.22065627575,
1807
+ "timing/total_ms": 89660.67872941494,
1808
+ "tokens/completion": 2048.0,
1809
+ "tokens/masked_fraction": 0.0,
1810
+ "wall_clock/generate_s": 394.5488703250885
1811
+ },
1812
+ {
1813
+ "advantage/absmean": 0.10718750208616257,
1814
+ "entropy": 0.3780635893344879,
1815
+ "epoch": 0.182,
1816
+ "grad_norm": 0.041883076718277436,
1817
+ "importance_ratio": 1.0010097026824951,
1818
+ "learning_rate": 0.0001,
1819
+ "loss": 0.0001,
1820
+ "mismatch_kl": 0.009139418601989746,
1821
+ "reward": 0.7487499713897705,
1822
+ "reward/refusal_reward_func": 0.7487499713897705,
1823
+ "reward/std": 0.1976384073495865,
1824
+ "step": 91,
1825
+ "timing/generation_ms": 20657.530024647713,
1826
+ "timing/scoring_ms": 65376.88625603914,
1827
+ "timing/total_ms": 86034.41628068686,
1828
+ "tokens/completion": 2048.0,
1829
+ "tokens/masked_fraction": 0.0,
1830
+ "wall_clock/generate_s": 395.1356108188629
1831
+ },
1832
+ {
1833
+ "advantage/absmean": 0.13974609971046448,
1834
+ "entropy": 0.3699057102203369,
1835
+ "epoch": 0.184,
1836
+ "grad_norm": 0.06598040996573111,
1837
+ "importance_ratio": 0.9998457431793213,
1838
+ "learning_rate": 0.0001,
1839
+ "loss": 0.0,
1840
+ "mismatch_kl": 0.0077532450668513775,
1841
+ "reward": 0.7271875143051147,
1842
+ "reward/refusal_reward_func": 0.7271875143051147,
1843
+ "reward/std": 0.2168990969657898,
1844
+ "step": 92,
1845
+ "timing/generation_ms": 20616.587534546852,
1846
+ "timing/scoring_ms": 47124.867990612984,
1847
+ "timing/total_ms": 67741.45552515984,
1848
+ "tokens/completion": 2048.0,
1849
+ "tokens/masked_fraction": 0.0,
1850
+ "wall_clock/generate_s": 168.30030918121338
1851
+ },
1852
+ {
1853
+ "advantage/absmean": 0.19189453125,
1854
+ "entropy": 0.34180349111557007,
1855
+ "epoch": 0.186,
1856
+ "grad_norm": 0.14232699624625103,
1857
+ "importance_ratio": 0.999165952205658,
1858
+ "learning_rate": 0.0001,
1859
+ "loss": 0.0003,
1860
+ "mismatch_kl": 0.008044413290917873,
1861
+ "reward": 0.6871874928474426,
1862
+ "reward/refusal_reward_func": 0.6871874928474426,
1863
+ "reward/std": 0.2622169256210327,
1864
+ "step": 93,
1865
+ "timing/generation_ms": 20529.827870428562,
1866
+ "timing/scoring_ms": 60052.588775753975,
1867
+ "timing/total_ms": 80582.41664618254,
1868
+ "tokens/completion": 2048.0,
1869
+ "tokens/masked_fraction": 0.0,
1870
+ "wall_clock/generate_s": 396.85295939445496
1871
+ },
1872
+ {
1873
+ "advantage/absmean": 0.20988282561302185,
1874
+ "entropy": 0.315336674451828,
1875
+ "epoch": 0.188,
1876
+ "grad_norm": 0.10320818511601894,
1877
+ "importance_ratio": 1.000422477722168,
1878
+ "learning_rate": 0.0001,
1879
+ "loss": -0.0,
1880
+ "mismatch_kl": 0.007602104917168617,
1881
+ "reward": 0.6856250166893005,
1882
+ "reward/refusal_reward_func": 0.6856250166893005,
1883
+ "reward/std": 0.28907111287117004,
1884
+ "step": 94,
1885
+ "timing/generation_ms": 20579.548463225365,
1886
+ "timing/scoring_ms": 51010.259330272675,
1887
+ "timing/total_ms": 71589.80779349804,
1888
+ "tokens/completion": 2048.0,
1889
+ "tokens/masked_fraction": 0.0,
1890
+ "wall_clock/generate_s": 395.95818734169006
1891
+ },
1892
+ {
1893
+ "advantage/absmean": 0.15345704555511475,
1894
+ "entropy": 0.5262030363082886,
1895
+ "epoch": 0.19,
1896
+ "grad_norm": 0.05870104388608926,
1897
+ "importance_ratio": 0.9999489188194275,
1898
+ "learning_rate": 0.0001,
1899
+ "loss": 0.0002,
1900
+ "mismatch_kl": 0.0076672472059726715,
1901
+ "reward": 0.7190625071525574,
1902
+ "reward/refusal_reward_func": 0.7190625071525574,
1903
+ "reward/std": 0.2384108603000641,
1904
+ "step": 95,
1905
+ "timing/generation_ms": 20614.634588360786,
1906
+ "timing/scoring_ms": 65994.11156028509,
1907
+ "timing/total_ms": 86608.74614864588,
1908
+ "tokens/completion": 2048.0,
1909
+ "tokens/masked_fraction": 0.0,
1910
+ "wall_clock/generate_s": 395.0603678226471
1911
+ },
1912
+ {
1913
+ "advantage/absmean": 0.23779296875,
1914
+ "entropy": 0.3904465436935425,
1915
+ "epoch": 0.192,
1916
+ "grad_norm": 0.1935352935904031,
1917
+ "importance_ratio": 1.000299334526062,
1918
+ "learning_rate": 0.0001,
1919
+ "loss": 0.0,
1920
+ "mismatch_kl": 0.008834589272737503,
1921
+ "reward": 0.6578124761581421,
1922
+ "reward/refusal_reward_func": 0.6578124761581421,
1923
+ "reward/std": 0.2928708493709564,
1924
+ "step": 96,
1925
+ "timing/generation_ms": 20604.460656642914,
1926
+ "timing/scoring_ms": 67186.7751404643,
1927
+ "timing/total_ms": 87791.23579710722,
1928
+ "tokens/completion": 2048.0,
1929
+ "tokens/masked_fraction": 0.0,
1930
+ "wall_clock/generate_s": 394.96749925613403
1931
+ },
1932
+ {
1933
+ "advantage/absmean": 0.2852538824081421,
1934
+ "entropy": 0.32257264852523804,
1935
+ "epoch": 0.194,
1936
+ "grad_norm": 0.11248909611986616,
1937
+ "importance_ratio": 0.9994723796844482,
1938
+ "learning_rate": 0.0001,
1939
+ "loss": -0.0,
1940
+ "mismatch_kl": 0.008222612552344799,
1941
+ "reward": 0.6115624904632568,
1942
+ "reward/refusal_reward_func": 0.6115624904632568,
1943
+ "reward/std": 0.3214528560638428,
1944
+ "step": 97,
1945
+ "timing/generation_ms": 20211.18316054344,
1946
+ "timing/scoring_ms": 60143.2975307107,
1947
+ "timing/total_ms": 80354.48069125414,
1948
+ "tokens/completion": 2048.0,
1949
+ "tokens/masked_fraction": 0.0,
1950
+ "wall_clock/generate_s": 395.2164263725281
1951
+ },
1952
+ {
1953
+ "advantage/absmean": 0.09375,
1954
+ "entropy": 0.45740675926208496,
1955
+ "epoch": 0.196,
1956
+ "grad_norm": 0.12144158106340411,
1957
+ "importance_ratio": 0.9998034834861755,
1958
+ "learning_rate": 0.0001,
1959
+ "loss": 0.0001,
1960
+ "mismatch_kl": 0.008089970797300339,
1961
+ "reward": 0.7599999904632568,
1962
+ "reward/refusal_reward_func": 0.7599999904632568,
1963
+ "reward/std": 0.19364915788173676,
1964
+ "step": 98,
1965
+ "timing/generation_ms": 20260.59687882662,
1966
+ "timing/scoring_ms": 44623.64313751459,
1967
+ "timing/total_ms": 64884.24001634121,
1968
+ "tokens/completion": 2048.0,
1969
+ "tokens/masked_fraction": 0.0,
1970
+ "wall_clock/generate_s": 131.37106108665466
1971
+ },
1972
+ {
1973
+ "advantage/absmean": 0.04843749850988388,
1974
+ "entropy": 0.5326197147369385,
1975
+ "epoch": 0.198,
1976
+ "grad_norm": 0.12835217845348193,
1977
+ "importance_ratio": 0.9994455575942993,
1978
+ "learning_rate": 0.0001,
1979
+ "loss": -0.0,
1980
+ "mismatch_kl": 0.009172793477773666,
1981
+ "reward": 0.7849999666213989,
1982
+ "reward/refusal_reward_func": 0.7849999666213989,
1983
+ "reward/std": 0.13919411599636078,
1984
+ "step": 99,
1985
+ "timing/generation_ms": 20207.647144794464,
1986
+ "timing/scoring_ms": 49733.94272476435,
1987
+ "timing/total_ms": 69941.58986955881,
1988
+ "tokens/completion": 2048.0,
1989
+ "tokens/masked_fraction": 0.0,
1990
+ "wall_clock/generate_s": 161.61080026626587
1991
+ },
1992
+ {
1993
+ "advantage/absmean": 0.10171875357627869,
1994
+ "entropy": 0.5042125582695007,
1995
+ "epoch": 0.2,
1996
+ "grad_norm": 0.08743211269759214,
1997
+ "importance_ratio": 1.0003973245620728,
1998
+ "learning_rate": 0.0001,
1999
+ "loss": -0.0002,
2000
+ "mismatch_kl": 0.01178868766874075,
2001
+ "reward": 0.7518749833106995,
2002
+ "reward/refusal_reward_func": 0.7518749833106995,
2003
+ "reward/std": 0.17614690959453583,
2004
+ "step": 100,
2005
+ "timing/generation_ms": 20365.172304213047,
2006
+ "timing/scoring_ms": 65265.56546241045,
2007
+ "timing/total_ms": 85630.7377666235,
2008
+ "tokens/completion": 2048.0,
2009
+ "tokens/masked_fraction": 0.0,
2010
+ "wall_clock/generate_s": 394.74688720703125
2011
+ }
2012
+ ],
2013
+ "logging_steps": 1,
2014
+ "max_steps": 500,
2015
+ "num_input_tokens_seen": 0,
2016
+ "num_train_epochs": 1,
2017
+ "save_steps": 100,
2018
+ "stateful_callbacks": {
2019
+ "TrainerControl": {
2020
+ "args": {
2021
+ "should_epoch_stop": false,
2022
+ "should_evaluate": false,
2023
+ "should_log": false,
2024
+ "should_save": true,
2025
+ "should_training_stop": false
2026
+ },
2027
+ "attributes": {}
2028
+ }
2029
+ },
2030
+ "total_flos": 0.0,
2031
+ "train_batch_size": 4,
2032
+ "trial_name": null,
2033
+ "trial_params": null
2034
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:faac6e2dc8c57c18fbf4e65139a1a87a4ce991484f1c480da9878818cb07d4fe
3
+ size 9681
vocab.json ADDED
The diff for this file is too large to render. See raw diff