10-20-23 Dataset Update

Challanges with realistic code datasets

Realistic C code tend to be large

LLM have limited context window
- LLAMA-2: 4k tokens
- GPT-3 2K
- GPT-4 32k
Most C libraries have much more tokens
[Impossible] Transpiling and testing part by part
- Rewrite one part of C program with rust code and run unit tests
  - Possible for unsafe rust using Rust FFI. However previous C -> Unsafe Rust attempts exist and are successful.
  - For safe Rust, many "unsafe" C behaviors effects program globally(e.g. dangling pointers in some queue implementation), so impossible to transpile part of C program to Safe Rust
Thus, our method must learn to translate code larger than their context window.
- We test by translating full C programs/libraries, and run all test cases with Rust FFI.

Code could be seen by LLM during training

All code not handwritten specifically to create such a dataset could be uploaded on line and seen by LLM during training.
Past NLP benchmarks either ask people to write questions, or just ignore the possibility of questions being seen by LLM
We may need to find ways to verify testing code is equivalent to handwritten code to LLM
- Potentially by checking LLM confidence in given code or wether LLM has memorized given code

Potential Next Steps

Start with testing using minimal C libraries like:
- klib: implementation for basic data structure & functionalities (e.g. hashing, sorting, dynamic array, & string manipulation)
- cdsa: implementation of simple data structures including list, queue, stack.
Begin brainstorming method to process code larger than LLM context window.

Examples: Queue implementation

From cdsa

#include <assert.h>
#include <stddef.h>

struct Queue;
struct QueueNode;

typedef struct Queue Queue;
typedef struct QueueNode QueueNode;

struct Queue {
    QueueNode *head;
    QueueNode *tail;
    size_t size;
};

struct QueueNode {
    QueueNode *next;
};

void queue_init(Queue *queue) {
    assert(queue);

    queue->head = NULL; // Init with dangling pointer
    queue->tail = NULL;
    queue->size = 0;
}

QueueNode* queue_peek(const Queue *queue) {
    assert(queue);

    return queue->head;
}

size_t queue_size(const Queue *queue) {
    assert(queue);

    return queue->size;
}

int queue_empty(const Queue *queue) {
    assert(queue);

    return queue->size == 0;
}

void queue_push(Queue *queue, QueueNode *node) {
    assert(queue && node);

    if (!queue->head) {
        queue->head = node;
    } else {
        queue->tail->next = node;
    }

    queue->tail = node;
    node->next = NULL;

    ++queue->size;
}

QueueNode *queue_pop(Queue *queue) {
    QueueNode *n;

    assert(queue);

    n = queue->head;

    if (!n) {
        return NULL;
    }

    if (!n->next) {
        queue->tail = NULL;
    }

    queue->head = n->next;
    n->next = ((QueueNode*) 0x100);

    --queue->size;

    return n;
}

void queue_remove_all(Queue *queue) {
    assert(queue);

    if (queue->head) {
        queue->head->next = ((QueueNode*) 0x100);
        queue->tail->next = ((QueueNode*) 0x100);
    }

    queue->head = NULL;
    queue->tail = NULL;
    queue->size = 0;
}

Safe Rust Implementation From Blog

#![allow(unused)]
fn main() {
struct Queue<T> {
  queue: Vec<T>,
}

impl<T> Queue<T> {
  fn new() -> Self {
    Queue { queue: Vec::new() }
  }

  fn queue_size(&self) -> usize {
    self.queue.len()
  }

  fn queue_push(&mut self, item: T) {
    self.queue.push(item)
  }

  fn queue_pop(&mut self) -> T {
    self.queue.remove(0)
  }
  fn queue_empty(&self) -> bool {
    self.queue.is_empty()
  }

  fn queue_peek(&self) -> Option<&T> {
    self.queue.first()
  }
}
}

Rust Translation by Chat-GPT-3.5

Given cdsa code and "Translate this c code to safe rust, keep the function names"

Resulting function names are wrong and unsafe.

#![allow(unused)]
fn main() {
use std::ptr;

pub struct Queue {
    head: *mut QueueNode,
    tail: *mut QueueNode,
    size: usize,
}

pub struct QueueNode {
    next: *mut QueueNode,
}

impl Queue {
    pub fn new() -> Self {
        Queue {
            head: ptr::null_mut(),
            tail: ptr::null_mut(),
            size: 0,
        }
    }

    pub fn peek(&self) -> Option<*mut QueueNode> {
        if self.head.is_null() {
            None
        } else {
            Some(self.head)
        }
    }

    pub fn size(&self) -> usize {
        self.size
    }

    pub fn is_empty(&self) -> bool {
        self.size == 0
    }

    pub fn push(&mut self, node: *mut QueueNode) {
        if self.head.is_null() {
            self.head = node;
        } else {
            unsafe {
                (*self.tail).next = node;
            }
        }

        self.tail = node;
        unsafe {
            (*node).next = ptr::null_mut();
        }

        self.size += 1;
    }

    pub fn pop(&mut self) -> Option<*mut QueueNode> {
        if self.head.is_null() {
            None
        } else {
            let n = self.head;

            if unsafe { (*n).next.is_null() } {
                self.tail = ptr::null_mut();
            }

            self.head = unsafe { (*n).next };
            unsafe {
                (*n).next = 0x100 as *mut QueueNode;
            }

            self.size -= 1;
            Some(n)
        }
    }

    pub fn remove_all(&mut self) {
        if !self.head.is_null() {
            let null_ptr = 0x100 as *mut QueueNode;
            unsafe {
                (*self.head).next = null_ptr;
                (*self.tail).next = null_ptr;
            }
        }

        self.head = ptr::null_mut();
        self.tail = ptr::null_mut();
        self.size = 0;
    }
}

}

Research Log of Han