10-20-23 Dataset Update
Challanges with realistic code datasets
Realistic C code tend to be large
- LLM have limited context window
- LLAMA-2: 4k tokens
- GPT-3 2K
- GPT-4 32k
- Most C libraries have much more tokens
- [Impossible] Transpiling and testing part by part
- Rewrite one part of C program with rust code and run unit tests
- Possible for unsafe rust using Rust FFI. However previous C -> Unsafe Rust attempts exist and are successful.
- For safe Rust, many "unsafe" C behaviors effects program globally(e.g. dangling pointers in some queue implementation), so impossible to transpile part of C program to Safe Rust
- Rewrite one part of C program with rust code and run unit tests
- Thus, our method must learn to translate code larger than their context window.
- We test by translating full C programs/libraries, and run all test cases with Rust FFI.
Code could be seen by LLM during training
- All code not handwritten specifically to create such a dataset could be uploaded on line and seen by LLM during training.
- Past NLP benchmarks either ask people to write questions, or just ignore the possibility of questions being seen by LLM
- We may need to find ways to verify testing code is equivalent to handwritten code to LLM
- Potentially by checking LLM confidence in given code or wether LLM has memorized given code
Potential Next Steps
- Start with testing using minimal C libraries like:
- Begin brainstorming method to process code larger than LLM context window.
Examples: Queue implementation
From cdsa
#include <assert.h>
#include <stddef.h>
struct Queue;
struct QueueNode;
typedef struct Queue Queue;
typedef struct QueueNode QueueNode;
struct Queue {
QueueNode *head;
QueueNode *tail;
size_t size;
};
struct QueueNode {
QueueNode *next;
};
void queue_init(Queue *queue) {
assert(queue);
queue->head = NULL; // Init with dangling pointer
queue->tail = NULL;
queue->size = 0;
}
QueueNode* queue_peek(const Queue *queue) {
assert(queue);
return queue->head;
}
size_t queue_size(const Queue *queue) {
assert(queue);
return queue->size;
}
int queue_empty(const Queue *queue) {
assert(queue);
return queue->size == 0;
}
void queue_push(Queue *queue, QueueNode *node) {
assert(queue && node);
if (!queue->head) {
queue->head = node;
} else {
queue->tail->next = node;
}
queue->tail = node;
node->next = NULL;
++queue->size;
}
QueueNode *queue_pop(Queue *queue) {
QueueNode *n;
assert(queue);
n = queue->head;
if (!n) {
return NULL;
}
if (!n->next) {
queue->tail = NULL;
}
queue->head = n->next;
n->next = ((QueueNode*) 0x100);
--queue->size;
return n;
}
void queue_remove_all(Queue *queue) {
assert(queue);
if (queue->head) {
queue->head->next = ((QueueNode*) 0x100);
queue->tail->next = ((QueueNode*) 0x100);
}
queue->head = NULL;
queue->tail = NULL;
queue->size = 0;
}
Safe Rust Implementation From Blog
#![allow(unused)] fn main() { struct Queue<T> { queue: Vec<T>, } impl<T> Queue<T> { fn new() -> Self { Queue { queue: Vec::new() } } fn queue_size(&self) -> usize { self.queue.len() } fn queue_push(&mut self, item: T) { self.queue.push(item) } fn queue_pop(&mut self) -> T { self.queue.remove(0) } fn queue_empty(&self) -> bool { self.queue.is_empty() } fn queue_peek(&self) -> Option<&T> { self.queue.first() } } }
Rust Translation by Chat-GPT-3.5
Given cdsa code and "Translate this c code to safe rust, keep the function names"
Resulting function names are wrong and unsafe.
#![allow(unused)] fn main() { use std::ptr; pub struct Queue { head: *mut QueueNode, tail: *mut QueueNode, size: usize, } pub struct QueueNode { next: *mut QueueNode, } impl Queue { pub fn new() -> Self { Queue { head: ptr::null_mut(), tail: ptr::null_mut(), size: 0, } } pub fn peek(&self) -> Option<*mut QueueNode> { if self.head.is_null() { None } else { Some(self.head) } } pub fn size(&self) -> usize { self.size } pub fn is_empty(&self) -> bool { self.size == 0 } pub fn push(&mut self, node: *mut QueueNode) { if self.head.is_null() { self.head = node; } else { unsafe { (*self.tail).next = node; } } self.tail = node; unsafe { (*node).next = ptr::null_mut(); } self.size += 1; } pub fn pop(&mut self) -> Option<*mut QueueNode> { if self.head.is_null() { None } else { let n = self.head; if unsafe { (*n).next.is_null() } { self.tail = ptr::null_mut(); } self.head = unsafe { (*n).next }; unsafe { (*n).next = 0x100 as *mut QueueNode; } self.size -= 1; Some(n) } } pub fn remove_all(&mut self) { if !self.head.is_null() { let null_ptr = 0x100 as *mut QueueNode; unsafe { (*self.head).next = null_ptr; (*self.tail).next = null_ptr; } } self.head = ptr::null_mut(); self.tail = ptr::null_mut(); self.size = 0; } } }