Fine-Tuning an LLM on Purr-Data source code examples.

Fine-tuning Gemma 2b for Purr-Data: An Experiment

Recently, I decided to experiment with fine-tuning Google's Gemma 2b instruct model to generate source code for Purr-Data patches.

Purr-Data, a visual programming language for creating multimedia applications, is amazing for its unique approach. But its niche use case means there's not a ton of data floating around online. This lack of data makes it a challenge for traditional machine learning models to understand.

Building a Dataset for Purr-Data Patch Source code examples

I created a dataset with the goal of evaluating the ability of large language models like Google's 2B GEMMA to be fine-tuned for Purr-Data source code generation.
It focuses specifically on patches that output a particular message when a "bang" object is clicked.

Dataset Characteristics:

Content: Each data point consists of two parts:

Focus: The dataset is restricted to examples where the patch functionality centers around printing a specific message on a bang click.

link to the dataset: https://huggingface.co/datasets/ParZiVal04/Purr-Data_example_source_codes

Video Demo


view on youtube if the player above doesn't work

A Proof of Concept for Niche Languages

This experiment showed that fine-tuning a large language model can be a viable approach for working with niche visual languages like Purr-Data. It's a small step, but one that paves the way for further exploration.

The Future

There's still a lot to explore. I'd love to expand the dataset to include more complex Purr-Data patches and see how the model performs. Ideally, some human programmers evaluating the quality of the code the model generates would be nice.