Previous		Next
Parsing_03_what_is_parsing	Parsing_04_tokenization	Parsing_05_parsing_grammars_and_asts

Parsing: (04) Tokenization

Context Free Grammars, EBNF, and Abstract Syntax Trees

Tokenization

Tokenization involves converting larger sequences of characters into meaningful units (integers, operators, variables, keywords, etc.). Each of these units is called a “token”. To illustrate what tokens are, consider again the aforementioned Java code snippet, repeated below for convenience:

if (x > 7) { // magic number
  foo = 10;
} else {
  /* TODO: this needs a better solution
     right now this is broken */
  System.err.println( "FIX THIS CODE" );
}

This code snippet contains over 100 characters. However, it contains only 24 tokens, namely (put into a table to save space):


`if`	`(`	`x`	`>`	`7`
`)`	`{`	`foo`	`=`	`10`
`;`	`}`	`else`	`{`	`System`
`.`	`err`	`.`	`println`	`(`
`"FIX THIS CODE"`	`)`	`;`	`}`

As shown, tokens separate semantically meaningful groupings of characters into individual units. For example, if, else, 10, "FIX THIS CODE", and others are considered individual, distinct tokens. Additionally, components extraneous to the code, such as unnecessary whitespace and comments, are completely stripped out. That said, there are still some “extra” components in here, such as the braces ({ and }). This is left for the next phase to handle, namely parsing.

More on Parsing:

Parsing: (00) Some sample code to refer to—Some code you can run, compile, and consult as go through the tutorial
Parsing: (01) Background—What is this parsing stuff all about?
Parsing: (02) Formal Language Specifications—How do we write the rules of a language? (FSAs, CFGs, etc.)
Parsing: (03) What is parsing?—The general and specific sense of the word, and explanation of three phases: tokenization, parsing, evaluation
Parsing: (04) Tokenization—Context Free Grammars, EBNF, and Abstract Syntax Trees
Parsing: (05) Parsing, Grammars and ASTs—Context Free Grammars, EBNF, and Abstract Syntax Trees
Parsing: (06) Evaluation Phase—The specific phase where we apply grammar productions and construct an AST
Parsing: (07) Syntax vs. Semantic Errors—The formal definition of the difference

Previous		Next
Parsing_03_what_is_parsing	Parsing_04_tokenization	Parsing_05_parsing_grammars_and_asts