How I created my own Programming Language

In this article, I go over the topics on how to build your own programming language.

Dec 18, 2022 · 5 mins

Introduction#

Programming languages are nothing but programs written by other programmers for you to program your programs. These days developers take programming language for granted, not everyone are interested in understanding the complexity that goes behind a simple "Hello world" program until they run into issue for their complex program where they need to start digging into the Source code program to find out why there is a weird Null Pointer Exception or memory leak happening in the code.

Early day programmers did not have fancy IDEs, intellisense software that could identify errors before even you run the program. They used to write their program on a physical punch card and then run them through the machines (or PC) that would take several hours to execute. There would be an operator who would work on creating and processing these cards on a machine.

Designing a programming language is similar to how you would program a simple code for your web API or some automation. The main task that Programing language does is, it allows you to write some instruction and then convert it to form understood by your machine. There are various ways by which you can design your own language. In this article, I will talk about how I designed 2 of my own programming language using a concept of Tree Walking Interpreter.

The Beginning#

Before starting with the program design you'd need to decide on what are the programming language constructs that you want to support in the language. Almost all language have these basics covered when it comes to language implementation:

  1. Variable Declarations
  2. Expression evaluation statements like mathematical operation, function call, object declaration etc.
  3. Control Statements like if/else, for, while, etc.
  4. Object Oriented Programming (OOP)
  5. Primitive data structures like Array, Maps/Hash, Set etc.
  6. Standard Library for Input/Output Processing, Handling File System reads etc.

Tokenization#

Based on these factors, you can come up with how the syntax of your language should look like and start working on the first stage of compiler or interpreter design called Tokenizer. Tokenization is the first process for any programming language that wants to include source-code processing engine whose responsibility is to take in source code and generate Tokens.

Tokens in C programming language is as simple as "if", "int", "printf", "(", ")" etc. Tokenizer can generate tokens with following information:

  1. Type of Token (Function, Keyword, Identifier, etc)
  2. Line number of the Token (Useful when showing errors)
  3. Literal value of the Token (Raw string content of text, "if", "variable_name" etc)

Parsing#

In the second stage, we read the tokens generated by the tokenizer and try to build a Syntax tree. This tree contains expressions and statements arranged in well, Tree structure. The tree is build up in such a way that it satisfies the grammar rules for our language. If a user has written a bad code which is syntactically wrong then, at this stage we can alert the user as whatever the rule user has written it breaks the grammar rule. Most compilers would stop from building executables until your code is syntactically correct. In Interpreters, we basically alert user about their code issues and go to a recovery mode where we still continue parsing for more errors.

Following is a example code for my programming language LoxPy:

var value = 2 + 3;

This generates following syntax tree in LoxPy:

    Var.Statement(
        name = Identifier(
            name = "value"
        ),
        value = Expression(
            value = Add (
                left = Value(2),
                right = Value(3)
            )
        )
    )

This completes the frontend of the programming language. The reason its called frontend because all the rules/syntax for your language is handled in these stages.

Optimization#

Program performance is critical when it comes to building faster applications. Early day systems were not as powerful as a personal computers or mobiles that we have today. Programs would take hours or days sometimes to evaluate. Programmers had to work with limited memory for writing their complete program and use some hacks to optimize their code.

For my programming language LoxPy and Talion, I haven't used any of the optimization techniques. The AST form is the intermediate representation (IR) for my languages. You can either take AST and optimize it to generate another IR or directly convert AST to a machine level program that can be optimized. There are different strategies to perform code optimization like Dead Code Elimination, Loop Optimization, Speed, Resources etc.

Evaluation#

At this stage you need to make decision on how you want to run the program. There are 2 ways you can go about designing this for your programming language:

  1. Compile: You can take the AST and compile them to a machine executable program. Machine can be your direct system (C, C++, Go) or another virtual machine (Python, Java). The executable file is also termed as byte code.

  2. Interpret: You can directly start evaluating AST in an underlying backend program. Python is an example for both compiled and interpreted programming language. Here the python code is first compiled to byte code and then interpreted using a Python Virtual Machine.

One of the easiest implementation of a interpreter is Tree Walking Interpreter. AST nodes are evaluated node-by-node at each instruction levels.

Tree Walking Interpreter stages: https://imantung.medium.com/tree-walk-interpreter-b33fe5c19a63

Early implementation of languages like Ruby, R had this technique of evaluation. As each node reference has to be stored in memory, this used to be expensive. If you want to introduce closures, classes to your language then you need to start handling resolution based environment handling (addressing identifier, function, class names) which again lead to storing a complete copy or reference of set of variables in body of these objects.

Crafting Interpreters: Want to build your own language?

My programming languages:

LoxPy: Github

talion: Github