The Codex Paper Has Been Published: the Idea Behind GitHub Copilot

The Codex paper has been published yesterday. Codex is a GPT language model finetuned on publicly available code from GitHub which has Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. This paper focuses on the work leading to the early Codex models. The main task is the generation of standalone Python functions from docstrings, and the automated evaluation of the correctness of code samples through unit tests (this is in contrast to natural language generation, where samples are typically evaluated by heuristics or by human evaluators). To solve a problem in the test set, the authors generate multiple samples from the models, and check if any of them passes the unit tests.

The raw training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. Then it has been filtered by removing files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000 or contained a small percentage of alphanumeric characters. After the filtering, the final dataset totaled 159 GB.

Along with the paper, HumanEval, an evaluation framework as been released in a GitHub repo.

I am not going through the details of the technical work that has been done to achieve the presented results, as they are clearly explained in the paper, so anyone could read from there. Instead, I want to highlight some important considerations after reading the comprehensive limitations and risks sections of this paper.

Codex it is not sample efficient to train. The selected training dataset comprises a significant fraction of publicly available Python code on GitHub, totaling hundreds of millions of lines of code. But even seasoned developers do not encounter anywhere near this amount of code over their careers. Indeed, a strong student who completes an introductory computer science course is expected to be able to solve a larger fraction of problems than Codex final model.

Codex has the potential to be useful in a range of ways, such as help onboard users to new codebases, reduce context switching for experienced coders, enable non-programmers to write specifications and have Codex draft implementations, or aid in education and exploration. However, Codex also raises significant safety challenges, does not always produce code that is aligned with user intent and has the potential to be misused.

Code generation and associated capabilities have several possible economic and labor market impacts. While Codex at its current capability level may somewhat reduce the cost of producing software by increasing programmer productivity, the size of this effect may be limited by the fact that engineers don’t spend their full day writing code. Daily important tasks include conferring with colleagues, writing design specifications, upgrading existing software stacks, etc.

Because Codex can produce vulnerable or misaligned code, qualified operators should review its generations before executing or trusting them, absent appropriate precautions. Probably future code generation models may be able to trained to produce more secure code than the average developer, but this is still far from achievement to my opinion.

Codex, same as any other technology or ML/AI model could also be misused to aid cybercrime. Although this is worthy of concern, the authors believe that at their current level of capability, Codex models do not materially lower the barrier to entry for malware development. Anyway more powerful code generation models will lead to future advancements, and therefore further research into mitigations and continued study of model capabilities are mandatory.

At the same time, the non-deterministic nature of systems like Codex could enable more advanced malware. This non-determinism makes it easier to create diverse software that accomplish the same tasks. While software diversity can sometimes aid defenders, it presents unique challenges for traditional malware detection and antivirus systems that rely on fingerprinting and signature-matching against previously sampled binaries. A more capable code generation model could conceivably advance techniques for generating polymorphic malware. Application security and model deployment strategies including rate-limiting access and abuse monitoring can manage this threat in the near term; however, the efficacy of these mitigations may scale sublinearly as more capable models are developed.

Codex, like other large generative models, has an energy footprint from both training and inference. The original training of GPT-3-12B consumed hundreds of petaflop/sdays of compute, while fine-tuning it to create Codex-12B consumed a similar amount of compute. Looking more long-term, the compute demands of code generation could grow to be much larger than Codex’s training if significant inference is used to tackle challenging problems.

Conclusions:

According to the paper authors, models like this should be developed, used, but their capabilities explored carefully with an eye towards maximizing their positive social impacts and minimizing intentional or unintentional harms that their use might cause. Which is exactly my point of view since I have heard of Copilot for the first time. A contextual approach is critical to effective hazard analysis and mitigation, though a few broad categories of mitigations are important to consider in any deployment of code generation models (and in general for any generative model).
Don't forget that coding is a broad activity which involves much much more than synthesizing code from docstrings. So, software engineers and data scientists wouldn't be replaced so easy yet.

Turning Python Scripts into Working Web Apps Quickly with Streamlit

I just realized that I am using Streamlit since almost one year now, posted about in Twitter or LinkedIn several times, but never wrote a blog post about it before. Communication in Data Science and Machine Learning is the key. Being able to showcase work in progress and share results with the business makes the difference. Verbal and non-verbal communication skills are important. Having some tool that could support you in this kind of conversation with a mixed audience that couldn't have a technical background or would like to hear in terms of results and business value would be of great help. I found that Streamlit fits well this scenario. Streamlit is an Open Source (Apache License 2.0) Python framework that turns data or ML scripts into shareable web apps in minutes (no kidding). Python only: no front‑end experience required. To start with Streamlit, just install it through pip (it is available in Anaconda too): pip install streamlit and you are ready to execute the working de...

Googlielmo's blog

Search This Blog

The Codex Paper Has Been Published: the Idea Behind GitHub Copilot

Labels

Comments

Post a Comment

Popular posts from this blog

jOOQ: code generation in Eclipse

Turning Python Scripts into Working Web Apps Quickly with Streamlit

Exporting InfluxDB data to a CVS file