Python does not Support non-ASCII Keywords

June 19, 2023(updated on June 23, 2023)

Introduction

It sounds like, but this post is not a complaint or a criticism. Python does not say it supports non-ASCII keywords. All standard keywords contain only ASCII characters, as do all other mainstream programming languages. This post is just a very tiny visit to Python’s source, grammar, tokenizer, and parser.

I was working on something else, but the problem or the question reached this point. Can a Python keyword contain non-ASCII characters?

For reference, there is a PEP 263 - Defining Python Source Code Encodings and PEP 3131 - Supporting Non-ASCII Identifiers. It is possible to use Unicode in literals (Strings) and identifiers (variable and function names etc.), and UTF-8 source file encoding is the default. So, it is perfectly fine to have a Python source like this with the wonderful Turkish letter of ğ latin letter g with breve:

import sys

def ğğğ():
    if len(sys.argv) < 2:
        print("ğ: 0")
    else:
        ğ = len(sys.argv[1])
        print("ğ: %d" % ğ)

if __name__ == "__main__":
    ğğğ()

which runs as expected:

$ python3 test.py 
ğ: 0

$ python3 test.py ğğğ
ğ: 3

Problem

So, I thought it would probably be OK to modify the grammar (Grammar/python.gram) and add a new keyword called eğer, which means if in Turkish. However, it did not work. I changed it to eger, with the ASCII g, and it worked. Why is it not working? After a few hours, I think I found the reason.

Python’s Parser

Python’s parser (starting with 3.9 it is the new PEG based parser), I think, runs concurrently with the lexer. I mean, there are no isolated lexical analysis and parsing steps. The parser runs, and while running, it requests lexical analysis as needed. Lexer, basically, reads the input (character by character), and emits tokens. For example, consequent digits, such as 123, is emitted as a NUMBER token (token with type NUMBER) with the value 123. To simplify a bit, anything quoted is a STRING token, and anything not quoted is a NAME. So, NAME can be a keyword or an identifier.

Debugging with Python 3.11.4

After 3.11, at some point, the code below has slightly changed. It looks like my point is still valid but I did not have time to verify it yet.

I thought that the lexer outputs identifier and keyword as two different tokens, actually even outputs each different keyword as a different token. It is not like that, it only emits NAME, but the token type reported to the parser is modified if the value of NAME token is a keyword. It is like the lexer is giving a different token for each keyword, but with this workaround which is done in initialize_token function in Parser/pegen.c:

static int
initialize_token(Parser *p, Token *token, const char *start, const char *end, int token_type) {
    assert(token != NULL);

    token->type = (token_type == NAME) ? _get_keyword_or_name_type(p, start, (int)(end - start)) : token_type;
...

For NAME tokens, the actual type is returned by the _get_keyword_or_name_type function, which is:

static int
_get_keyword_or_name_type(Parser *p, const char *name, int name_len)
{
    assert(name_len > 0);
    if (name_len >= p->n_keyword_lists ||
        p->keywords[name_len] == NULL ||
        p->keywords[name_len]->type == -1) {
        return NAME;
    }
    for (KeywordToken *k = p->keywords[name_len]; k != NULL && k->type != -1; k++) {
        if (strncmp(k->str, name, name_len) == 0) {
            return k->type;
        }
    }
    return NAME;
}

The actual bytes of the token is pointed by name and has a length of name_len. This interesting implementation seems to be an optimization. p->keywords is equal to reserved_keywords which is defined in Parser/parser.c as:

static const int n_keyword_lists = 9;
static KeywordToken *reserved_keywords[] = {
    (KeywordToken[]) {{NULL, -1}},
    (KeywordToken[]) {{NULL, -1}},
    (KeywordToken[]) {
        {"if", 639},
        {"as", 637},
        {"in", 648},
        {"or", 574},
        {"is", 582},
        {NULL, -1},
    },
    (KeywordToken[]) {
        {"del", 603},
        {"def", 649},
        {"for", 647},
        {"try", 621},
        {"and", 575},
        {"not", 581},
        {NULL, -1},
    },
...

Above I only showed the first 4 entries, it has 9 entries (also indicated by n_keyword_lists), because the longest keywords are continue and nonlocal, which are 8 chars long.

The index of reserved_keywords is the length of the keywords; the keywords are grouped by their length. I guess it is implemented like this for performance reasons, instead of searching all the keywords, it only searches the name among the keywords having the same length.

You might have already realized the problem. If a keyword with a non-ASCII character is in this list, such as eğer, since it is UTF-8 encoded, such a non-ASCII character would be encoded with at least >1 (2 to 4) bytes. Thus, eğer would have 5 bytes. If this keyword is used in the code, the function above is called with name_len=5 because it is has 5 bytes, and it searches for it among p>keywords[5]. However, it is stored in p->keywords[4] because this file, Parser/parser.c, is generated from the grammar file Grammar/python.gram, it is not manually written. The generator code is written in Python, and Python len function is used, and len('eğer') is 4.

So, the problem is either:

  • the name_len argument of _get_keyword_or_name_type is ambiguous. Is it byte length or string/code-point length ? The function has to work correctly and it should be called correctly according to what it actually is.

or:

  • the index of reserved_keywords is ambiguous. Is it byte length or string/code-point length ? The generator code should work according to what it actually is.

or:

  • this design is not correct. If reserved_keywords was just a simple list and the search was done over the whole list, so name_len argument would be byte length and the index of reserved_keywords had no special meaning, the problem would not happen. For performance, this list can be ordered, and a binary search can be implemented.

Solution

In order to fix reserved_keywords, second option above, PEG generator code has to be modified. However, it might be a bit strange to see, for example, eğer together with, for example, while, just because their byte lengths are the same.

If there is no performance penalty, I think the third option above is the ideal one. There are maximum 8 keywords at the moment in a group. So in worst case, this requires 8 comparison since the current implementation is doing a linear search. There are less than 64 keywords, so a binary search would do 6 comparisons in worst case. This worths a try but it also requires a change in PEG generator and also a major change in _get_keyword_or_name_type implementation.

The simplest solution is to fix the first option above. At the moment, name_len is byte length. If I keep this as bytes_name_len but modify it to have the correct string length and use it for most of the things in that function, it should be enough. There is also a need for a strlen method that works for UTF-8 encoded strings, but it is not difficult.

So, I added this function just above _get_keyword_or_name_type:

static int
_utf8_strlen(const char *s, int len)     
{
  int r = 0;
  for (int i=0; (len >= 0) && (i < len); i++) {
    unsigned char c = (unsigned char) s[i];
    if (c == 0) return r;
    if ((c & 0x80) == 0x00) i+=0;
    else if ((c & 0xE0) == 0xC0) i+=1;
    else if ((c & 0xF0) == 0xE0) i+=2;
    else if ((c & 0xF8) == 0xF0) i+=3;
    else return 0;
    r++;
  }
  return r;
}

In addition to len argument, the function also checkes for the NULL termination (if c==0), as it will be needed later. Also, it is possible to not use len (e.g., by sending -1).

Then, I change _get_keyword_or_name_type to this:

static int
_get_keyword_or_name_type(Parser *p, const char *name, int name_len)
{
    assert(name_len > 0);
    int bytes_name_len = name_len;
    name_len = _utf8_strlen(name, name_len);
    assert(name_len > 0);
    if (name_len >= p->n_keyword_lists ||
        p->keywords[name_len] == NULL ||
        p->keywords[name_len]->type == -1) {
        return NAME;
    }
    for (KeywordToken *k = p->keywords[name_len]; k != NULL && k->type != -1; k++) {
        if (strlen(k->str) != bytes_name_len) {
            continue;
        }
        if (strncmp(k->str, name, bytes_name_len) == 0) {
            return k->type;
        }
    }
    return NAME;
}

So I keep name_len argument as bytes_name_len, and then call the function to calculate the actual string length according to UTF-8 encoding and save this as name_len. Then, name_len is used almost everywhere other than strncmp. Additionally, another check is needed because now the elements of p->keywords[name_len] may not have the same bytes length.

Test

To test the solution, I added a keyword called importğ to work like import (because it is simpler than changing if). This requires a change in Grammar/python.gram as such:

simple_stmt[stmt_ty] (memo):
    | assignment
    | &"type" type_alias
    | e=star_expressions { _PyAST_Expr(e, EXTRA) }
    | &'return' return_stmt
    | &('import' | 'from') import_stmt
    | &('importğ') import_stmt
    | &'raise' raise_stmt
    | 'pass' { _PyAST_Pass(EXTRA) }
    | &'del' del_stmt
    | &'yield' yield_stmt
    | &'assert' assert_stmt
    | 'break' { _PyAST_Break(EXTRA) }
    | 'continue' { _PyAST_Continue(EXTRA) }
    | &'global' global_stmt
    | &'nonlocal' nonlocal_stmt

import_name[stmt_ty]:
    | 'import' a=dotted_as_names { _PyAST_Import(a, EXTRA) }
    | 'importğ' a=dotted_as_names { _PyAST_Import(a, EXTRA) }

and then make regen-pegen is needed to regenerate Parser/parser.c.

Before the change, it results an error:

$ ./python -c 'import sys'

$ ./python -c 'importğ sys'
ModuleNotFoundError: No module named 'unicodedata'

After the change:

$ ./python -c 'import sys'

$ ./python -c 'importğ sys'

it gives no errors.

There is one additional problem which I am still investigating. The keyword has to start with an ASCII characters, so importğ works but ğğğ does not, and it gives an error at make regen-pegen step.