A quick overview of the Python Virtual Machine — Pt. 1

This is the part one of a series covering aspects of Virtual Machines and how an interpreted programming language performs it’s executions.

During these articles I will introduce some concepts about virtual machine design. Today I will briefly explain the execution tree of a Python code and I will introduce a very basic concept of a virtual machine, the Program Counter.

Python is just a C program. This program is also called Python Virtual Machine, and this virtual machine has several tasks to do before your Python code is indeed executed. The virtual machine cannot execute a Python code as you write in your python files. It must parse the entire Python code into something understandable by the virtual machine.

  1. The Python Virtual Machine loads all the necessary dependencies and sets all the necessary contents, this is made by the Py_Initialize function.
  2. After this initialization, a function called run_file is executed in order to load the Python scripts.
  3. Py_Main executes PyRun_SimpleFileExFlags and generates the __main__ namespace of the Python program.
  4. PyParser_ASTFromFileObject is called, and calls the PyParser_ParseFileObjects. They create a parse object tree, which is something that will be translated into instructions later, the name tree suggests that these objects hold pointers to other objects.
  5. In the end it calls PyParser_ASTFromNodeObject that will be final step of the parsing, the AST is the Abstract Syntax Tree, you can also have access to this part from the python code, this is an amazing meta-programming technique for Python programs, it’s also similar to other implementations, such as Julia and LUA. Also as the Python documentation describes, the abstract syntax tree changes in all python releases.

These are the initial steps of the virtual machine, after that it must create code objects from the parse tree created before.

Now we are inside the context of run_mod function, which handles the execution of the code.

  1. run_mod calls PyAST_CompileObject and the bytecode is generated.
  2. run_mod calls PyEval_EvalCode and the final code objects or bytecodes are created. These code objects represent your Python code, but now in a way that the virtual machine can understand and execute.

All the necessary steps to make a Python code running is now completed. We now have the main thread, the namespace, the builtins, the internals … everything needed to run a Python program. So the final step is the execution loop.

Almost all virtual machines have something called Execution Loop, it is basically the execution environment of a program, and to perform the execution of something we need several things that will be helping the virtual machine. One important thing is the Program Counter.

The Program Counter, as the name suggests, marks the next instruction to be executed, after every instruction this counter is “incremented”. You can imagine that your code is indexed and the Program Counter has the index of the next instruction, so the virtual machine always knows what’s next. The Program Counter is not always incremented in Python and in other programming languages, actually the concept of increment a Program Counter is very basic. Given the fact that it is a complex programming language, and programming languages have several concepts of branches, jumps and others, so the Program Counter can be manipulated by the virtual machine, but of course the main principle remains the same, mark the next instruction to be executed.

Several other characteristics of a virtual machine will be described later in the next part, this is just the beginning.

Now that all the code objects are ready, the virtual machine is going to point to the first instruction (Program Counter). A function called _PyEval_EvalFrameEx is called to run the code objects and begins the execution of the program. These code objects or opcodes can manipulate data, in Python an opcode manipulates data through the Stack (And there is also the Stack Pointer, which will be covered in the next article).

The Python virtual machine works following the design of the Stack-Based Virtual Machines, this basically means that every single piece of data handled by an opcode comes from the Stack. Working differently from Register-Based Virtual Machines and real machines which have Registers and Stack. Something to have in mind is that every instruction will cause push operations and pop operations to the Stack, something that will generate more function calls and in somehow can be slower or can cause more effort than a Register-Based Virtual Machine.

As you probably know, when you compile a C program it will be translated to the machine code of the architecture you are compiling in. The same occurs with Python, during the generation of the opcodes, several optimizations are performed in order to create the less possible instructions. Mind that a simple Hello World will create various instructions to be handled later, that’s why efficient code is faster and optimization in the high level layer is an important matter.

Ok, my program was executed, now what?

Basically the virtual machine has yet some tasks to do, an important function will be called to clear everything the virtual machine left behind, this function is Py_FinalizeEx. After all this work, and believe this is a very brief explanation, the python virtual machine can die finishing the execution of the program.

The next article will be covering in a way more deep the Parse module of the Python Virtual Machine, we are going to understand what is the parse tree objects and how the Parser understands the syntax of Python. Also, we are going to cover another concept of Virtual Machines, the Stack Pointer.