dev/lang/parser/ SimpleTokeniserInPython1


Explanation

V1

Here we have a single input file input.txt which looks like

function hello {
  print "hello";
}

Now how we tokenise is by having a list of characters for each class.

alpha_lower = "abcdefghijklmnopqrstuvwxyz"
alpha_upper = alpha_lower.upper()
alpha = alpha_upper + alpha_lower
numeric = "0123456789"
alphanumeric = alpha+numeric
whitespace = " \t\r\n"

We have what is know as a state machine. The tokeniser starts in the scan state. It then proceeds, character by character, and decides based on what character it see, what it does, and which state it goes into next. Tools like Antlr automate this process, but to understand it it is instructive to write your own tokeniser and parser without assisting tools. Once you understand what a parser generator is doing for you, you're better equipped to understand how computer languages work at the grammatical level.

We shall flesh out this language grammar and then write a script that parses the grammar.

The source:

V1

The input:

function hello {
  print "hello";
}

the Python

input_text = open("input.txt").read()

# step 1: tokenise
state = "scan"
i = 0
alpha_lower = "abcdefghijklmnopqrstuvwxyz"
alpha_upper = alpha_lower.upper()
alpha = alpha_upper + alpha_lower
numeric = "0123456789"
alphanumeric = alpha+numeric
whitespace = " \t\r\n"
tokens = []
src = input_text
while i < len(src):
  c = src[i]
  print(f"scan <{state}> i={i} c={c}")
  if state == "scan":
    if c in alpha:
      i0 = i
      state = "identifier"
    elif c in whitespace:
      i0 = i
      state = "whitespace"
    elif c == "{":
      tokens.append(("lbrace","{"))
      state == "scan"
    elif c == "}":
      tokens.append(("rbrace","}"))
      state == "scan"
    elif c == ";":
      tokens.append(("semicolon",";"))
  elif state == "identifier":
    if c in alphanumeric:
      pass
    elif c not in alphanumeric:
      tokens.append((state,src[i0:i]))
      state = "scan"
      continue
  elif state == "whitespace":
    if c in whitespace:
      pass
    elif c not in whitespace:
      tokens.append((state,src[i0:i]))
      state = "scan"
      continue
  i += 1
print(tokens)

the output: (skipping the debug prints)

[('identifier', 'function'), ('whitespace', ' '), ('identifier', 'hello'), ('whitespace', ' '), ('lbrace', '{'), ('whitespace', '\n  '), ('identifier', 'print'), ('whitespace', ' '), ('identifier', 'hello'), ('semicolon', ';'), ('whitespace', '\n'), ('rbrace', '}')]