dev/lang/parser/ SimpleTokeniserInPython1
Explanation
V1
Here we have a single input file input.txt
which looks like
function hello {
print "hello";
}
Now how we tokenise is by having a list of characters for each class.
alpha_lower = "abcdefghijklmnopqrstuvwxyz"
alpha_upper = alpha_lower.upper()
alpha = alpha_upper + alpha_lower
numeric = "0123456789"
alphanumeric = alpha+numeric
whitespace = " \t\r\n"
We have what is know as a state machine. The tokeniser starts in the scan
state.
It then proceeds, character by character, and decides based on what character it see, what it does,
and which state it goes into next. Tools like Antlr automate this process, but to
understand it it is instructive to write your own tokeniser and parser without assisting tools.
Once you understand what a parser generator is doing for you, you're better equipped to understand
how computer languages work at the grammatical level.
We shall flesh out this language grammar and then write a script that parses the grammar.
The source:
V1
The input:
function hello {
print "hello";
}
the Python
input_text = open("input.txt").read()
# step 1: tokenise
state = "scan"
i = 0
alpha_lower = "abcdefghijklmnopqrstuvwxyz"
alpha_upper = alpha_lower.upper()
alpha = alpha_upper + alpha_lower
numeric = "0123456789"
alphanumeric = alpha+numeric
whitespace = " \t\r\n"
tokens = []
src = input_text
while i < len(src):
c = src[i]
print(f"scan <{state}> i={i} c={c}")
if state == "scan":
if c in alpha:
i0 = i
state = "identifier"
elif c in whitespace:
i0 = i
state = "whitespace"
elif c == "{":
tokens.append(("lbrace","{"))
state == "scan"
elif c == "}":
tokens.append(("rbrace","}"))
state == "scan"
elif c == ";":
tokens.append(("semicolon",";"))
elif state == "identifier":
if c in alphanumeric:
pass
elif c not in alphanumeric:
tokens.append((state,src[i0:i]))
state = "scan"
continue
elif state == "whitespace":
if c in whitespace:
pass
elif c not in whitespace:
tokens.append((state,src[i0:i]))
state = "scan"
continue
i += 1
print(tokens)
the output: (skipping the debug prints)
[('identifier', 'function'), ('whitespace', ' '), ('identifier', 'hello'), ('whitespace', ' '), ('lbrace', '{'), ('whitespace', '\n '), ('identifier', 'print'), ('whitespace', ' '), ('identifier', 'hello'), ('semicolon', ';'), ('whitespace', '\n'), ('rbrace', '}')]