How does cross-domain fine-tuning on security-specific code corpora affect the F1-score of Llama3 and Codestra
Description
Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed
Research goal: How does cross-domain fine-tuning on security-specific code corpora affect the F1-score of Llama3 and Codestral in zero-shot vulnerability classification across unseen programming languages compared to general code pre-training?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.0/10.
Notes
Files
paper.pdf
Files
(78.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:236b59b0818857efc1c9cd9d199ba726
|
78.0 kB | Preview Download |