Parsing: (03) What is parsing?

The general and specific sense of the word, and explanation of three phases: tokenization, parsing, evaluation

More on Parsing:

Acknowledgments: this series or articles is joint work, a collaboration between Kyle Dewey and Phill Conrad.

What is parsing?

The general and specific sense of the word parsing

The word parsing can be confusing, because it is used in two different ways—a general sense, and a specfic sense.

In Computer Science, in a general, sense, the word parsing refers to any computation that:

In this general sense, parsing refers to all three phases: tokenizing, parsing, and interpretation

The more specific sense of parsing is a particular “phase” of this process, the middle part of three phases:

Before we get into the three phases, the how of parsing, let’s look at what we need to be able to do from a big picture standpoint.

What we need to be able to do when we parse.

As an example of the concerns that arise in parsing, suppose our language is restricted to arithmetic expressions involving:

We want to be able to do three things. Here’s a quick overview, followed by a more detailed explanation with specific examples:

  1. Check Syntax. We want to be able to distinguish between
    • well-formed expressions (those that follow the rules of the language), and
    • errors (strings of characters that do not follow the rules)
  2. Produce an Abstract Syntax Tree (AST).
    • An AST is a data structure corresponding to the string that shows its structure
    • Examples will help illustrate this point.
    • In the case of an arithmetic expression, the AST is a tree we can traverse to compute the result
    • In the case of a programming language, the AST is traversed either to produce machine code (in the case of C/C++), Java bytecodes (for Java), or for some interpreted languages, it is traversed directly to carry out the computation.
  3. Evalute ASTs to produce a result
    • Evaluating the AST is not necessarily part of parsing per se; it is typically considered separate concern
    • However, ASTs rarely exist outside the context of parsing, so explore them as part of the same tutorial.

Here is a bit more on each of these topics.

Checking Syntax

As an example of checking syntax, we want to be able to distinguish between well-formed expressions and those with errors.

Let’s use the example of integer arithmetic expressions with five operators: +,-,*,/,== and parentheses (, )

Here are some examples of well-formed expressions, and expressions with errors.

Well formed Errors
2+2 + 2 2
123 + 345 * 678 3 45 + 6
(1+2)*678 (( 1+2) * 678
0 45 67
-7 + -8 9 % 5
--3 3--
1 + 2 == 3 3 = = 6

Note that we sometimes call these “legal” and “illegal” expressions, though in this case the “law” is simply the rules for the language, and nothing to do with civil or criminal laws.

Various kinds of errors

We’ll see later than there is a distinction between errors that occur at the level of individual characters and at the level of higher structures. This distinction will become more clear as we progress, but for now, let’s point out simply that these two errors are different from the others in the list above:

Tokenizer Errors

Here are two examples of errors that are detected during tokenizing:

When we cover finite state automata in more detail, we’ll see how a finite state automata handles these cases.

Semantic Errors

Also note that from the standpoint of syntax, the following expressions are well-formed, even though they will result in an error (division by zero) when evaluated. These are not syntax errors, but errors of meaning, i.e. semantic errors or evalutor errors.

7/0 7/(3-3) 123/( 4*5 - 20)

As you can infer from the escalating complexity of the examples, expressions involving division by zero can get arbitrarily complex, which is why we don’t try to detect such problems as part of the “syntax checking”.

For more on syntax vs. semantic errors, see: Parsing: Syntax vs. Semantic Errors

Parsing: Grammars and ASTs

A parser typically has two jobs. The first, which we already covered above, is to determine whether an input string follows the rules of the language

The second job is to product an Abstract Syntax Tree (AST). An AST is a tree structure that represents the meaning of some expression in a formal language.

We’ll cover that in a later section of this tutorial.

More on Parsing: