Previous | Next | |
Parsing_02_formal_spec | Parsing_03_what_is_parsing | Parsing_04_tokenization |
Parsing: (03) What is parsing?
The general and specific sense of the word, and explanation of three phases: tokenization, parsing, evaluation
More on Parsing:
- Parsing: (00) Some sample code to refer to—Some code you can run, compile, and consult as go through the tutorial
- Parsing: (01) Background—What is this parsing stuff all about?
- Parsing: (02) Formal Language Specifications—How do we write the rules of a language? (FSAs, CFGs, etc.)
- Parsing: (03) What is parsing?—The general and specific sense of the word, and explanation of three phases: tokenization, parsing, evaluation
- Parsing: (04) Tokenization—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (05) Parsing, Grammars and ASTs—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (06) Evaluation Phase—The specific phase where we apply grammar productions and construct an AST
- Parsing: (07) Syntax vs. Semantic Errors—The formal definition of the difference
Acknowledgments: this series or articles is joint work, a collaboration between Kyle Dewey and Phill Conrad.
What is parsing?
The general and specific sense of the word parsing
The word parsing can be confusing, because it is used in two different ways—a general sense, and a specfic sense.
In Computer Science, in a general, sense, the word parsing refers to any computation that:
- as input, takes sequences of characters in some formally specified language
- as output, produces an organized data structure that can be the basis of some computation.
In this general sense, parsing refers to all three phases: tokenizing, parsing, and interpretation
The more specific sense of parsing is a particular “phase” of this process, the middle part of three phases:
- tokenization (turning sequence of characters into sequence of tokens)
- parsing (applying a context free grammar and producing an AST)
- evaluation/translation (processing the AST that resulting from parsing)
Before we get into the three phases, the how of parsing, let’s look at what we need to be able to do from a big picture standpoint.
What we need to be able to do when we parse.
As an example of the concerns that arise in parsing, suppose our language is restricted to arithmetic expressions involving:
- the integer data type (e.g.
int
as in C/C++/Java), including integer constants (both positive and negative) - four basic arithmetic operators,
+
,-
,*
,/
with the usual meanings (integer division for/
), and usual precedence and associativity - a unary minus operator
- a comparison operator,
==
, that returns1
for true and 0 for false. - parentheses for grouping
We want to be able to do three things. Here’s a quick overview, followed by a more detailed explanation with specific examples:
- Check Syntax. We want to be able to distinguish between
- well-formed expressions (those that follow the rules of the language), and
- errors (strings of characters that do not follow the rules)
- Produce an Abstract Syntax Tree (AST).
- An AST is a data structure corresponding to the string that shows its structure
- Examples will help illustrate this point.
- In the case of an arithmetic expression, the AST is a tree we can traverse to compute the result
- In the case of a programming language, the AST is traversed either to produce machine code (in the case of C/C++), Java bytecodes (for Java), or for some interpreted languages, it is traversed directly to carry out the computation.
- Evalute ASTs to produce a result
- Evaluating the AST is not necessarily part of parsing per se; it is typically considered separate concern
- However, ASTs rarely exist outside the context of parsing, so explore them as part of the same tutorial.
Here is a bit more on each of these topics.
Checking Syntax
As an example of checking syntax, we want to be able to distinguish between well-formed expressions and those with errors.
Let’s use the example of integer arithmetic expressions with five operators: +
,-
,*
,/
,==
and parentheses (
, )
- Later, we’ll discuss how to formally specify the rules for such expressinos
- For, we’ll assume you are already familiar with those rules, at least informally, from using them in C/C++/Java code.
Here are some examples of well-formed expressions, and expressions with errors.
Well formed | Errors |
2+2 |
+ 2 2 |
123 + 345 * 678 |
3 45 + 6 |
(1+2)*678 |
(( 1+2) * 678 |
0 |
45 67 |
-7 + -8 |
9 % 5 |
--3 |
3-- |
1 + 2 == 3 |
3 = = 6 |
Note that we sometimes call these “legal” and “illegal” expressions, though in this case the “law” is simply the rules for the language, and nothing to do with civil or criminal laws.
Various kinds of errors
We’ll see later than there is a distinction between errors that occur at the level of individual characters and at the level of higher structures. This distinction will become more clear as we progress, but for now, let’s point out simply that these two errors are different from the others in the list above:
Tokenizer Errors
Here are two examples of errors that are detected during tokenizing:
-
The expression ` 9 % 5
is not well-formed because in this *particular* arithmetic language, we did not include the
%operator. We *could have*, but we *didn't.* Therefore, the
%` character is an “illegal character”. -
The expression
3 = = 6
is not legal because a==
operator is not allowed to have a space in the middle of it.
When we cover finite state automata in more detail, we’ll see how a finite state automata handles these cases.
Semantic Errors
Also note that from the standpoint of syntax, the following expressions are well-formed, even though they will result in an error (division by zero) when evaluated. These are not syntax errors, but errors of meaning, i.e. semantic errors or evalutor errors.
7/0 |
7/(3-3) |
123/( 4*5 - 20) |
As you can infer from the escalating complexity of the examples, expressions involving division by zero can get arbitrarily complex, which is why we don’t try to detect such problems as part of the “syntax checking”.
For more on syntax vs. semantic errors, see: Parsing: Syntax vs. Semantic Errors
Parsing: Grammars and ASTs
A parser typically has two jobs. The first, which we already covered above, is to determine whether an input string follows the rules of the language
The second job is to product an Abstract Syntax Tree (AST). An AST is a tree structure that represents the meaning of some expression in a formal language.
We’ll cover that in a later section of this tutorial.
More on Parsing:
- Parsing: (00) Some sample code to refer to—Some code you can run, compile, and consult as go through the tutorial
- Parsing: (01) Background—What is this parsing stuff all about?
- Parsing: (02) Formal Language Specifications—How do we write the rules of a language? (FSAs, CFGs, etc.)
- Parsing: (03) What is parsing?—The general and specific sense of the word, and explanation of three phases: tokenization, parsing, evaluation
- Parsing: (04) Tokenization—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (05) Parsing, Grammars and ASTs—Context Free Grammars, EBNF, and Abstract Syntax Trees
- Parsing: (06) Evaluation Phase—The specific phase where we apply grammar productions and construct an AST
- Parsing: (07) Syntax vs. Semantic Errors—The formal definition of the difference
Previous | Next | |
Parsing_02_formal_spec | Parsing_03_what_is_parsing | Parsing_04_tokenization |