10-6-23 Pivot

2 real world motivations

C to Rust for memory safety
Web front-end framework update. New web front-end frameworks appear every few month, and old frameworks also keep getting big updates. We could implement the same webpage with different frameworks in many languages and each one would have their own advantages. As a web project grow, the original framework it was built on may not be advantages anymore, so it could be very helpful if we can translate the codebase from one web framework to another while keeping the same functionality.

Understanding the Effectiveness of Large Language Models in Code Translation seems to be a good exploration of current use of LLM in this field. From this paper we can see:
- LLM usage is very basic. Mostly, people just prompt with some task description + minimal prompt engineering. This can also be seen from the leaderboards of some code translation datasets. Generally, code translation seems to be a way to show the ability of a multi-lingual code generation model, and not much work is focused on optimizing LLM sampling /generation/prompt-engineering for this specific task.
- LLM makes uncharacteristically trivial errors. One of the findings in this paper is that among all the failed translation attempts by LLMs, most of them failed due to compilation error, even for the models that performs extremely well like GPT-4 (~47% success rate). They would make mistakes like violation of semantic restrictions and dependency errors that are not often seen when I was trying ChatGPT on coding questions. From the paper we see most bugs can be trace back to the difference between languages. By following the code from source language, the LLM can forgot about the rules in the target language. (e.g. ll = 1 works in go but ll ll = 1 would not work in C++ since ll is a reserved keyword)

Base on the findings from this paper, using stochastic sampling with a high level planner seems a promising direction
We could use MCMC-like sampling methods to condition each LLM translation on (original code + high-level plan + semantic rules). This would allow LLM to keep follow the source code, but not so blindly that it makes trivial errors in target language.

There is some existing dataset like CodeXGLUE
We could also use any benchmark, dataset, or code base with test cases and just run the tests in the target language. The "Understanding" paper used dataset like "EvalPlus", and projects with APIs for command-line processing including Apache Commons CLI(Java), Click (Python)

Expanding literature search in LLM code translation. See what is the method used by the most current state of art.
Try LLM code translation on ChatGPT and Llama, and see if they indeed make trivial errors.
Test if condition LLM on high level plan by simply adding high level description to translation prompt can improve translation result.
Find suitable C or web codebase/dataset that can be used as benchmark.