Previous | Next | |
Parsing_03_what_is_parsing | Parsing_04_tokenization | Parsing_05_parsing_grammars_and_asts |
Parsing: (04) Tokenization
Context Free Grammars, EBNF, and Abstract Syntax Trees
More on Parsing:
- Parsing: (00) Some sample code to refer to—Some code you can run, compile, and consult as go through the tutorial
- Parsing: (01) Background—What is this parsing stuff all about?
- Parsing: (02) Formal Language Specifications—How do we write the rules of a language? (FSAs, CFGs, etc.)
- Parsing: (03) What is parsing?—The general and specific sense of the word, and explanation of three phases: tokenization, parsing, evaluation
- Parsing: (04) Tokenization—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (05) Parsing, Grammars and ASTs—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (06) Evaluation Phase—The specific phase where we apply grammar productions and construct an AST
- Parsing: (07) Syntax vs. Semantic Errors—The formal definition of the difference
Acknowledgments: this series or articles is joint work, a collaboration between Kyle Dewey and Phill Conrad.
Tokenization
Tokenization involves converting larger sequences of characters into meaningful units (integers, operators, variables, keywords, etc.). Each of these units is called a “token”. To illustrate what tokens are, consider again the aforementioned Java code snippet, repeated below for convenience:
if (x > 7) { // magic number
foo = 10;
} else {
/* TODO: this needs a better solution
right now this is broken */
System.err.println( "FIX THIS CODE" );
}
This code snippet contains over 100 characters. However, it contains only 24 tokens, namely (put into a table to save space):
if |
( |
x |
> |
7 |
) |
{ |
foo |
= |
10 |
; |
} |
else |
{ |
System |
. |
err |
. |
println |
( |
"FIX THIS CODE" |
) |
; |
} |
As shown, tokens separate semantically meaningful groupings of characters into individual units.
For example, if
, else
, 10
, "FIX THIS CODE"
, and others are considered individual, distinct tokens.
Additionally, components extraneous to the code, such as unnecessary whitespace and comments, are completely stripped out.
That said, there are still some “extra” components in here, such as the braces ({
and }
).
This is left for the next phase to handle, namely parsing.
More on Parsing:
- Parsing: (00) Some sample code to refer to—Some code you can run, compile, and consult as go through the tutorial
- Parsing: (01) Background—What is this parsing stuff all about?
- Parsing: (02) Formal Language Specifications—How do we write the rules of a language? (FSAs, CFGs, etc.)
- Parsing: (03) What is parsing?—The general and specific sense of the word, and explanation of three phases: tokenization, parsing, evaluation
- Parsing: (04) Tokenization—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (05) Parsing, Grammars and ASTs—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (06) Evaluation Phase—The specific phase where we apply grammar productions and construct an AST
- Parsing: (07) Syntax vs. Semantic Errors—The formal definition of the difference
Previous | Next | |
Parsing_03_what_is_parsing | Parsing_04_tokenization | Parsing_05_parsing_grammars_and_asts |