A quick overview of the Python Virtual Machine — Pt. 2

Caio Cozza
6 min readSep 20, 2019

This is the part two of our series. In this article I’m going to briefly talk about the Python Parser, Tokenization and the beginning of the execution.

To understand what a Parser is, we don’t need any code, If we look at the definition of Parsing from the wikipedia we can have a clear and concise explanation.

Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. https://en.wikipedia.org/wiki/Parsing.

With this understanding we can surely say that every programming language which is different from each other has a different parser, and that’s because they are grammatically different. In Interpreted Programming Languages, during the Parse time we have several techniques to apply over the unparsed content, one of these most important parts is called Tokenization.

Tokens are the format that a virtual machine expects to generate more complex code object structures, instructions …

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.
https://en.wikipedia.org/wiki/Lexical_analysis

Python has a very nice, well engineered and optimized parser. And now we are going to have a quick overview of the Parsing time.

By looking at the Python source code, more precisely at Python/pythonrun.c, we can find that function I mentioned in the previous article, PyParser_ASTFromFileObject. This function is responsible to create the parse tree, which is the result of another function called PyParser_ParseFileObject. Let’s go inside this function, this is where the parsing magic begins.

When we write a piece o code, for example, a = 1, we can easily read and understand this expression, that’s because we have a natural parser and interpreter in our head. But Python also has a very nice parser (a bit more slower than our brain), and the first step of this parser is to break this expression into tokens. Every single element in this expression will be converted into a token, we are now going to break this sentence into understandable Python tokens to see how it’s looks like.

So, we have a = 1, basically we have 3 tokens here.

  1. NAME, which is ‘a’
  2. OP (OPERATOR), which is the assign ‘=’
  3. NUMBER, which represents a numeric value ‘1’

The Python parser has an array describing all the tokens that the Python programming language supports, you can find this array in the file Parser/token.c, and the constant array is called _PyParser_TokenNames. Also if you are curious to see your Python code tokenized, you can do this by running:

python -m tokenize your_file.py

And here is an example of it.

file.pydef sayHello(to):
print "Hello", to
sayHello("Donald, The Duck")====================================================================1,0-1,3: NAME 'def'
1,4-1,12: NAME 'sayHello'
1,12-1,13: OP '('
1,13-1,15: NAME 'to'
1,15-1,16: OP ')'
1,16-1,17: OP ':'
1,17-1,18: NEWLINE '\n'
2,0-2,4: INDENT ' '
2,4-2,9: NAME 'print'
2,10-2,17: STRING '"Hello"'
2,17-2,18: OP ','
2,19-2,21: NAME 'to'
2,21-2,22: NEWLINE '\n'
3,0-3,1: NL '\n'
4,0-4,0: DEDENT ''
4,0-4,8: NAME 'sayHello'
4,8-4,9: OP '('
4,9-4,27: STRING '"Donald, The Duck"'
4,27-4,28: OP ')'
4,28-4,29: NEWLINE '\n'
5,0-5,0: ENDMARKER ''

We can briefly analise these tokens, look at the first token, it holds the line number, where it begins and where it ends, 1,0–1,3: NAME ‘def’. This identification is important and can be visualized when you debug a python code and the debugger knows exactly where it should stop in the “.py” file being debugged. Also another interesting feature of python is the INDENT token, this is a very special token and some times cause a lot of trouble for new Python programmers, since Python requires indentation to be properly parsed and executed.

Tokens are the first step to achieve instructions, the Virtual Machine needs something more standardized to in fact generate instructions that can be executed, and it must be tied in a way that makes sense, that’s why the Parser will be creating something called parse tree, so there tokens will have a kind of relationship with each other in a way that reflects the reality of the original Python code.

The Tokenization time is also the part that handles syntax errors, for example the following code a ;;; 1 is not valid in Python because the token ‘;;;’ does not exist in the rules set nor permits relationship between NAMES and NUMBERS.

When the function PyParser_ParseFileObject ends, it calls in its return statement another very important function in the parsing time, The parsetok.

The parsetok function has the same name of its file, parsetok.c, and has the responsibility for creating the parse tree by the given token_state pointer. By looking inside this function we can find the Tokenization Loop which is a “for(;;)”, and here I’m proving what I said before, the syntax error analysis will be handled, and an error will be raised if something goes out of the expected rules. This is a very critical function, it will be allocating and deallocating memory a lot of times and it begins by grabbing the type of the Token and applying to the necessary relationships, it can also create and modify tokens in some cases. It will check if a given token requires continuation. For example, a ‘(‘ should be closed and a future ‘)’ will be expected, if the parsetok cannot find it, an error will be raised, this is a simple demonstration of how this parse tree works in its token relationship.

Let’s go back. We already have our tokens. Do you remember the function that initiates everything was the PyParser_ASTFromFileObject? This function ends the Parsing time, the next steps will be generate the Symbol Tables (This is close to execution time, will be covered in the next part) and the Bytecode.

Execution comes right after the parsing time, and it’s done after the creation of the python modules, the function executing your Python code is the run_mod (run module), it is called after the PyParser_ASTFromFileObject, the function that returns modules. The execution will be handled by run_eval_code_obj which is going to call the most important method in the execution, PyEval_EvalCode. Keep in mind that to achieve PyObjects first is crucial to achieve the Python modules, the function PyRun_FileExFlags holds a very clear and easy way to understand this process, also it has only 23 lines of C code. Bellow you can have an example of the final product from a Python code, a compiled Bytecode.

file.pydef sayHello(to):
print "Hello", to
sayHello("Donald, The Duck")====================================================================python -m dis file.py==================================================================== 1 0 LOAD_CONST 0 (<code object sayHello at 0x1048673b0, file "file.py", line 1>)
3 MAKE_FUNCTION 0
6 STORE_NAME 0 (sayHello)
4 9 LOAD_NAME 0 (sayHello)
12 LOAD_CONST 1 ('Donald, The Duck')
15 CALL_FUNCTION 1
18 POP_TOP
19 LOAD_CONST 2 (None)
22 RETURN_VALUE

This is the human readable form of a Python Bytecode, it is very similar to any of the assembly instruction sets out there. You can find more opcodes of the Python Bytecode Instruction Set here https://docs.python.org/3/library/dis.html#python-bytecode-instructions.

We now achieved the Run time, all the work of the Parsing time passing from the PyParse’s files, Tokenization, parsetok, … ends here in the execution. We are going to handle the execution of a Python code in the next part of this series, the execution handles with several concepts of virtual machines, and we are going deep with the Stack based nature of Python, so be ready to see much more generic virtual machine design than Python.

References:
https://docs.python.org/3/contents.html
https://github.com/python/cpython
https://tech.blog.aknin.name/2010/04/02/pythons-innards-introduction/

--

--