Reproduce humaneval pass@10 and pass@100

#38

by sh0416 - opened Jul 14, 2023

Jul 14, 2023

I was reproducing the Table 7 using MultiPL-E framework.
I used temperature 0.8 and follow the instruction wrote in the paper.
This is the result I got and the metric is much worse than the reported metric.
It seems that applying strip to the prompt in multiPL-E dataset improves the performance, but it is not enough to reach the reported metric (0.49 for pass@100).
Is there anyone who knows further details in the evaluation? Is there any difference between gpt_santacoder and this checkpoint?

ground_truth,1,0.9813664596273292,161,1,1
santacoder,10,0.05375377190333478,161,200,200
santacoder,100,0.19456654232712897,161,200,200
santacoder-strip-prompt-fp32,10,0.07665345298421541,161,200,200
santacoder-strip-prompt-fp32,100,0.30245030981097853,161,200,200

sh0416

Jul 14, 2023

For further details, I was using this dataset.
https://huggingface.co/datasets/nuprl/MultiPL-E/viewer/humaneval-py/test

sh0416

Jul 14, 2023

I think I found it. When we give prompt, remove newline from the prompt. When we get completion, prepend newline from the completion.

Dataset,Pass@k,Estimate,NumProblems,MinCompletions,MaxCompletions
ground_truth,1,0.9813664596273292,161,1,1
santacoder,10,0.05375377190333478,161,200,200
santacoder,100,0.19456654232712897,161,200,200
santacoder-strip-prompt-fp32,10,0.07665345298421541,161,200,200
santacoder-strip-prompt-fp32,100,0.30245030981097853,161,200,200
santacoder-strip-prompt-fp32-add-newline,10,0.27866654269616764,161,200,200
santacoder-strip-prompt-fp32-add-newline,100,0.48030915497938453,161,200,200

sh0416 changed discussion status to closed Jul 14, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment