10-20-23 Dataset Update

Challanges with realistic code datasets

Realistic C code tend to be large

  • LLM have limited context window
    • LLAMA-2: 4k tokens
    • GPT-3 2K
    • GPT-4 32k
  • Most C libraries have much more tokens
  • [Impossible] Transpiling and testing part by part
    • Rewrite one part of C program with rust code and run unit tests
      • Possible for unsafe rust using Rust FFI. However previous C -> Unsafe Rust attempts exist and are successful.
      • For safe Rust, many "unsafe" C behaviors effects program globally(e.g. dangling pointers in some queue implementation), so impossible to transpile part of C program to Safe Rust
  • Thus, our method must learn to translate code larger than their context window.
    • We test by translating full C programs/libraries, and run all test cases with Rust FFI.

Code could be seen by LLM during training

  • All code not handwritten specifically to create such a dataset could be uploaded on line and seen by LLM during training.
  • Past NLP benchmarks either ask people to write questions, or just ignore the possibility of questions being seen by LLM
  • We may need to find ways to verify testing code is equivalent to handwritten code to LLM
    • Potentially by checking LLM confidence in given code or wether LLM has memorized given code

Potential Next Steps

  • Start with testing using minimal C libraries like:
    • klib: implementation for basic data structure & functionalities (e.g. hashing, sorting, dynamic array, & string manipulation)
    • cdsa: implementation of simple data structures including list, queue, stack.
  • Begin brainstorming method to process code larger than LLM context window.

Examples: Queue implementation

From cdsa

#include <assert.h>
#include <stddef.h>

struct Queue;
struct QueueNode;

typedef struct Queue Queue;
typedef struct QueueNode QueueNode;

struct Queue {
    QueueNode *head;
    QueueNode *tail;
    size_t size;
};

struct QueueNode {
    QueueNode *next;
};

void queue_init(Queue *queue) {
    assert(queue);

    queue->head = NULL; // Init with dangling pointer
    queue->tail = NULL;
    queue->size = 0;
}

QueueNode* queue_peek(const Queue *queue) {
    assert(queue);

    return queue->head;
}

size_t queue_size(const Queue *queue) {
    assert(queue);

    return queue->size;
}

int queue_empty(const Queue *queue) {
    assert(queue);

    return queue->size == 0;
}

void queue_push(Queue *queue, QueueNode *node) {
    assert(queue && node);

    if (!queue->head) {
        queue->head = node;
    } else {
        queue->tail->next = node;
    }

    queue->tail = node;
    node->next = NULL;

    ++queue->size;
}

QueueNode *queue_pop(Queue *queue) {
    QueueNode *n;

    assert(queue);

    n = queue->head;

    if (!n) {
        return NULL;
    }

    if (!n->next) {
        queue->tail = NULL;
    }

    queue->head = n->next;
    n->next = ((QueueNode*) 0x100);

    --queue->size;

    return n;
}

void queue_remove_all(Queue *queue) {
    assert(queue);

    if (queue->head) {
        queue->head->next = ((QueueNode*) 0x100);
        queue->tail->next = ((QueueNode*) 0x100);
    }

    queue->head = NULL;
    queue->tail = NULL;
    queue->size = 0;
}

Safe Rust Implementation From Blog

#![allow(unused)]
fn main() {
struct Queue<T> {
  queue: Vec<T>,
}

impl<T> Queue<T> {
  fn new() -> Self {
    Queue { queue: Vec::new() }
  }

  fn queue_size(&self) -> usize {
    self.queue.len()
  }

  fn queue_push(&mut self, item: T) {
    self.queue.push(item)
  }

  fn queue_pop(&mut self) -> T {
    self.queue.remove(0)
  }
  fn queue_empty(&self) -> bool {
    self.queue.is_empty()
  }

  fn queue_peek(&self) -> Option<&T> {
    self.queue.first()
  }
}
}

Rust Translation by Chat-GPT-3.5

Given cdsa code and "Translate this c code to safe rust, keep the function names"

Resulting function names are wrong and unsafe.

#![allow(unused)]
fn main() {
use std::ptr;

pub struct Queue {
    head: *mut QueueNode,
    tail: *mut QueueNode,
    size: usize,
}

pub struct QueueNode {
    next: *mut QueueNode,
}

impl Queue {
    pub fn new() -> Self {
        Queue {
            head: ptr::null_mut(),
            tail: ptr::null_mut(),
            size: 0,
        }
    }

    pub fn peek(&self) -> Option<*mut QueueNode> {
        if self.head.is_null() {
            None
        } else {
            Some(self.head)
        }
    }

    pub fn size(&self) -> usize {
        self.size
    }

    pub fn is_empty(&self) -> bool {
        self.size == 0
    }

    pub fn push(&mut self, node: *mut QueueNode) {
        if self.head.is_null() {
            self.head = node;
        } else {
            unsafe {
                (*self.tail).next = node;
            }
        }

        self.tail = node;
        unsafe {
            (*node).next = ptr::null_mut();
        }

        self.size += 1;
    }

    pub fn pop(&mut self) -> Option<*mut QueueNode> {
        if self.head.is_null() {
            None
        } else {
            let n = self.head;

            if unsafe { (*n).next.is_null() } {
                self.tail = ptr::null_mut();
            }

            self.head = unsafe { (*n).next };
            unsafe {
                (*n).next = 0x100 as *mut QueueNode;
            }

            self.size -= 1;
            Some(n)
        }
    }

    pub fn remove_all(&mut self) {
        if !self.head.is_null() {
            let null_ptr = 0x100 as *mut QueueNode;
            unsafe {
                (*self.head).next = null_ptr;
                (*self.tail).next = null_ptr;
            }
        }

        self.head = ptr::null_mut();
        self.tail = ptr::null_mut();
        self.size = 0;
    }
}

}