Googlielmo's blog

Posts

The Codex Paper Has Been Published: the Idea Behind GitHub Copilot

The Codex paper has been published yesterday. Codex is a GPT language model finetuned on publicly available code from GitHub which has Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot . This paper focuses on the work leading to the early Codex models. The main task is the generation of standalone Python functions from docstrings, and the automated evaluation of the correctness of code samples through unit tests (this is in contrast to natural language generation, where samples are typically evaluated by heuristics or by human evaluators). To solve a problem in the test set, the authors generate multiple samples from the models, and check if any of them passes the unit tests. The raw training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. Then it has been filtered by removing files which were likely auto-generated, had average line lengt

Python Calculations in Jupyter with Handcalcs

Jupyter notebooks allows LaTeX rendering inside markdown. This way you can write complex math equations within a notebook. While LaTeX is the de facto standard for scientific documents, it hasn't a very friendly and intuitive syntax. handcalcs is an Open Source library for converting Python calculations into rendered LaTeX: just write the symbolic formula, followed by numeric substitutions and that's it. After install it (it is available through PyPI), in the simplest case you just need to import the render class and use the %%render magic command to render the content of a cell: Here another example of equation render and numeric substitution: It is also possible to render just the symbolic equation: or any way generate the corresponding LaTeX code: By default handcalcs renders code vertically, but it is possible to use the %%render params magic to save space by rendering in a single line or show just the result of a calculation: handcalcs allows to adjust precision, use Gr

Generating Meaningful Mock Data with Faker

Faker is an Open Source Python package that generates synthetic data that could be used for many things such as populating a database, do load testing or anonymize production data for development or ML purposes. Generating fully random data isn't a good choice: with Faker you can drive the generation process and tailor the generated data to your specific needs: this is the greatest value provided by Faker. This package comes with 23 built-in data providers, some other providers are available from the community. The available data providers cover majority of data types and cases, but it is possible any way make the generated data more meaningful by implementing a custom provider. Faker supports Python 3.6+ and it is available for installation through PyPI or Anaconda. Here's a code example that shows how to implement a custom provider to generate synthetic data following the structure and constraints as for this Kaggle dataset related to a restaurant data with consumer ratin

Diagrams as Code with Python

In my career I have noticed that often organizations are reluctant on providing Engineering teams with licenses for software to draw diagrams. In the best case scenarios MS Visio is usually the only option available, which isn't the best experience when trying to draw modern software architectures. Several online options are available, but they require to share project details that cannot leave your organization network, so they couldn't be taken into account often. Also, while treating everything as code, it would be nice to have also diagrams as code. All these needs can be satisfied by adopting Diagrams . It is an Open Source Python package that allows you draw cloud system architecture diagrams programmatically and then put them under version control, (as at the end they are regular Python files). It supports cloud (AWS, Azure, GCP, Alibaba, Oracle) and on-prem system architecture diagrams. The Diagrams nodes include also Kubernetes, programming languages and frameworks. D

TagUI: an Excellent Open Source Option for RPA - Introduction

Photo by Dinu J Nair on Unsplash Today I want to introduce TagUI , an RPA (Robotic Process Automation) Open Source tool I am using to automate test scenarios for web applications. It is developed and maintained by the AI Singapore national programme. It allows writing flows to automate repetitive tasks, such as regression testing of web applications. Flows are written in natural language : English and other 20 languages are currently supported. Works on Windows, Linux and macOS. The TagUI official documentation can be found here . The tool doesn't require installation: just go the official GitHub repository and download the archive for your specific OS (ZIP for Windows, tar.gz for Linux or macOS). After the download is completed, unpack its content in the local hard drive. The executable to use is named tagui (.cmd in Windows, .sh for other OS) and it is located into the <destination_folder>/tagui/src directory. In order to use it, the Google Chrome web browser ne

Googlielmo's Blog 2.0: a Fresh Restart

After a 7 months hiatus I have decided to go back posting on this blog. Lot of things happened across 2020 and 2021 that left me with little or no time at all to share my thoughts and findings. In this long period of time I have been involved in challenging ML/AI projects, managing them and interacting with people 100% remotely because of the COVID-10 pandemic, had a chance to experiment with many and in some cases successfully applications of new DL architectures and Python Open Source libraries, but also tune mine and my family personal life among all the style changes imposed by the pandemic. The reasons that led me to restart the blog are mostly the following: I have accumulated tons of technical topics that are worth to share with a wider audience. During the past months I have shared some through social networks such as LinkedIn and Twitter or in few virtual meetups or conferences, but they need more deep dive. This week I gave a workshop at the ODSC Europe 2021 conference and I