Designing a Syntax Highlighter with Finite State Machines

All Notes

Technology

Extreme close-up of computer code displaying various programming terms and elements.

Designing a Syntax Highlighter with Finite State Machines

The Art of Seeing Code: Beyond Plain Text

Think about your favorite code editor or IDE. As you type, keywords glow in vibrant colors, strings stand out in a different hue, and comments recede into a muted tone. This visual transformation isn’t just aesthetic; it’s a fundamental utility known as syntax highlighting. It dramatically improves readability, helps spot typos, and gives immediate feedback on the structure of your code.

But how does an editor know that if is a keyword, "hello" is a string, and 42 is a number? It’s not magic. Behind the scenes, a sophisticated process of understanding and categorizing text is at play. At the heart of this process for syntax highlighting lies a concept from theoretical computer science: the Finite State Machine (FSM).

The Core Problem: Understanding Code Structure

At its most basic level, a source code file is just a sequence of characters. To a computer, there’s no inherent difference between 'f', 'o', 'r' (which might form a loop keyword) and 'f', 'o', 'o' (which might be a variable name). The editor needs a way to parse this stream of characters and assign meaning to distinct groups.

This initial phase of compiler design is called lexical analysis, or tokenization. The goal is to break down the raw input stream into a sequence of meaningful units called tokens. Each token represents a single logical unit of the program, such as a keyword, an identifier, a literal (number, string), an operator, or a punctuation mark.

For example, the line int count = 0; might be tokenized as:

int (Keyword)
count (Identifier)
= (Operator)
0 (Integer Literal)
; (Delimiter)

Once the code is represented as a stream of tokens, it becomes much easier for a syntax highlighter to assign styles based on the token type. And it’s here that Finite State Machines step into the spotlight.

Demystifying Finite State Machines (FSMs)

A Finite State Machine (FSM), or Finite Automaton, is an abstract machine that can be in exactly one of a finite number of states at any given time. It can change from one state to another in response to some external input; this change is called a transition.

Imagine a simple traffic light:

States: Red, Green, Yellow.
Inputs: Timer expiration.
Transitions: Green -> Yellow -> Red -> Green.

More formally, an FSM is defined by:

A finite set of states (Q).
A finite set of input symbols (Σ), representing the characters it can read.
A transition function (δ) that maps a state and an input symbol to a next state.
A start state (q₀ ∈ Q).
A set of final (or accepting) states (F ⊆ Q), which indicate successful recognition of a pattern.

FSMs can be categorized as:

Deterministic Finite Automata (DFAs): For any given state and input symbol, there is exactly one transition to a next state. These are generally simpler to implement and reason about.
Nondeterministic Finite Automata (NFAs): For a given state and input symbol, there might be multiple possible transitions, or no transition at all. NFAs can be converted into equivalent DFAs.

For the purpose of syntax highlighting, DFAs are perfectly suitable and widely used due to their straightforward implementation.

Designing an FSM for Syntax Highlighting

The core idea is to design a specific FSM for each type of token we want to recognize. The “lexer” (the part of the compiler performing lexical analysis) then uses these FSMs to consume characters and identify tokens.

Let’s walk through examples for common token types.

Example 1: Recognizing Integer Literals

Consider a simple integer like 123, 0, or 98765.

It starts with a digit.
It continues with zero or more digits.

An FSM for this could look like:

States:

S0 (Start State): The initial state, waiting for the first digit.
S1 (In_Integer): We’ve seen at least one digit and are currently forming an integer.

Transitions:

From S0:
- On a digit (0-9) → S1 (append digit to current token)
- On any other character → Error or push back character and try another FSM.
From S1:
- On a digit (0-9) → S1 (append digit)
- On any other character (e.g., space, operator, end of line) → We’ve reached the end of the integer. S1 is an accepting state. Push back the non-digit character for the next token’s FSM to process, and emit the recognized integer token.

This FSM guarantees that we correctly identify a sequence of digits as an integer.

Example 2: Recognizing String Literals

String literals are often enclosed in quotes (e.g., "hello world"). They can also contain escape sequences (e.g., \n for newline, \" for an escaped quote).

States:

S0 (Start State): Initial state.
S1 (In_String): We’ve seen the opening quote and are inside the string.
S2 (In_Escape): We’ve seen a backslash and expect an escape character next.

Transitions:

From S0:
- On " (double quote) → S1 (start buffering characters for the string)
- On any other character → Error or push back.
From S1:
- On " (double quote) → S0 (end of string, S1 is an accepting state. Emit string token.)
- On \ (backslash) → S2 (handle escape sequence)
- On any other character (except newline/EOF, which are often errors in strings) → S1 (append character to string buffer)
From S2: (after a backslash)
- On n, t, r, \, ", etc. (valid escape chars) → S1 (append the escaped character, e.g., \n becomes a single newline char)
- On any other character → S1 (append the backslash and the character, potentially an error or specific language behavior)

Example 3: Recognizing Multi-line Comments

Multi-line comments often start with /* and end with */.

States:

S0 (Start State): Initial state.
S1 (Potential_Comment_Start): We saw a /.
S2 (In_Comment): We’re inside the comment.
S3 (Potential_Comment_End): We saw a * inside the comment.

Transitions:

From S0:
- On / → S1
- On any other character → Push back, try another FSM.
From S1:
- On * → S2 (We found /*, now buffering comment content)
- On any other character → Push back both / and the current character, return to S0 (e.g., / followed by something else might be a division operator).
From S2:
- On * → S3
- On any other character (including /) → S2 (continue buffering comment content)
From S3:
- On / → S0 (We found */, S3 is an accepting state. Emit comment token.)
- On * → S3 (e.g., /**/)
- On any other character → S2 (The * was not the end of the comment, just a character inside. Return to In_Comment state.)

Handling Ambiguity and the Longest Match Rule

What if if is a keyword, but ifdef is an identifier? If we have an FSM that recognizes if and stops, ifdef would be tokenized as if (keyword) and then def (identifier). This is incorrect.

Lexers typically employ the “longest match” rule. When multiple FSMs could potentially recognize a prefix of the input, the one that recognizes the longest possible token is chosen. After that, the lexer looks at the character that caused the FSM to stop (the first character not part of the token) and “pushes it back” onto the input stream so it can be processed as the start of the next token.

For example, when seeing ifdef:

The FSM for if matches if.
The FSM for identifier also matches ifdef.
Since ifdef is longer than if, ifdef is chosen and emitted as an identifier.

The Lexer in Action: An Implementation Approach

A syntax highlighter’s lexer operates as a loop, continually trying to match patterns until the end of the input is reached.

Here’s a high-level algorithm:

Initialize: Set current_position to the start of the input stream.
Loop: While current_position is not at the end of the input: a. Lookahead: Start an “attempt” for each known token FSM. For each FSM, feed characters from current_position one by one, keeping track of the longest successful match found so far and the FSM that achieved it. b. Consume and Transition: For the current FSM being tried, read a character. Based on the character and the current state, transition to the next state. c. Buffer: Append the read character to a temporary buffer for the potential token. d. Recognize/Fail: * If the FSM enters an accepting state, mark the potential token and its length. * If no valid transition is possible from the current state (i.e., the character doesn’t fit the pattern), the current FSM fails for this input. e. Select Longest Match: After attempting all FSMs (or intelligently trying them in priority order), identify the FSM that produced the longest valid token. f. Emit Token: * If a token was successfully recognized (e.g., an integer, keyword, string): * Create a token object containing its type (e.g., TokenType.INTEGER, TokenType.KEYWORD), its lexeme (the actual text, e.g., “123”), and its position (start index, length). * Add this token to a list of recognized tokens. * Advance current_position by the length of the recognized token. * If no FSM recognized a valid token (e.g., @ in a language where it’s not defined, or an unexpected character sequence): * Handle as an UNKNOWN or ERROR token. * Advance current_position by at least one character to avoid infinite loops. * Note: For a simple highlighter, this might just mean skipping the character or coloring it as default.
Return: The list of tokens.

Data Structures for Implementation:

State Transition Table/Map: A common way to implement FSMs is using a table where rows represent current states, columns represent input characters, and cells contain the next state.
Input Buffer: The source code itself, often treated as a character stream.
Token Buffer: A temporary buffer to accumulate characters for the current token being recognized.

From Tokens to Colors: Applying the Style

Once the lexer has done its job and produced a stream of tokens, the rest is relatively straightforward for a syntax highlighter:

Map Token Types to Styles: Each TokenType (e.g., KEYWORD, STRING, COMMENT, IDENTIFIER) is mapped to a specific visual style (color, font weight, background color, etc.). In a web context, this might be a CSS class name.
Render Text with Styles:
- Iterate through the list of tokens.
- For each token, wrap its lexeme (the actual text) in a container (e.g., a <span> in HTML) that applies the corresponding style.
- Concatenate these styled segments to form the final highlighted output.

For example, int count = 0; might become:

<span class="keyword">int</span>
<span class="whitespace"> </span>
<span class="identifier">count</span>
<span class="whitespace"> </span>
<span class="operator">=</span>
<span class="whitespace"> </span>
<span class="literal-number">0</span>
<span class="delimiter">;</span>

This HTML (or similar structure in a native UI framework) is then rendered, giving you the familiar colorful code display.

Beyond the Basics: Limitations and Advanced Considerations

While FSMs are powerful for lexical analysis, they have limitations when it comes to understanding the full context of code:

Context-Sensitivity: FSMs are inherently context-free. They can recognize patterns, but they can’t remember arbitrary information or determine if a variable has been declared before being used. For example, they can’t distinguish between foo as a type definition and foo as a variable instance. This level of understanding requires parsing (syntax analysis) and semantic analysis, which are beyond the scope of a simple FSM-based lexer. A lexer just identifies what a word is (e.g., identifier); a parser determines how that word fits into the language’s grammar.
Performance: For very large files, repeatedly scanning from the beginning can be slow. Real-world editors often use techniques like incremental highlighting, where only the changed lines or sections are re-tokenized and re-rendered.
Regular Expressions: Many lexer generators (like Lex/Flex) use regular expressions to define token patterns. Under the hood, regular expressions are typically compiled into NFAs, which are then converted to DFAs for efficient pattern matching. So, using regex still relies on the same FSM principles.
Unicode and Complex Scripts: Handling international characters, emojis, and right-to-left languages adds layers of complexity that need careful consideration beyond basic ASCII character sets.

Conclusion: The Unsung Heroes of Our IDEs

The next time you gaze upon your beautifully colored code, spare a thought for the humble yet mighty Finite State Machine. It’s the silent workhorse, tirelessly crunching characters and transforming raw text into meaningful tokens. This fundamental process, leveraging the elegant simplicity of FSMs, forms the bedrock of not just syntax highlighting but nearly every tool that interacts with source code, from compilers to linters and refactoring engines.

Understanding these underlying mechanisms not only demystifies our development tools but also provides a deeper appreciation for the structured beauty of computer science.

References & Further Reading:

“Compilers: Principles, Techniques, and Tools” (The Dragon Book) by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. This is the definitive textbook on compiler design, with extensive coverage of lexical analysis and FSMs.
Wikipedia - Lexical analysis: https://en.wikipedia.org/wiki/Lexical_analysis
Wikipedia - Finite-state machine: https://en.wikipedia.org/wiki/Finite-state_machine
Tutorials on Compiler Design/Lexing: Many university computer science courses offer excellent online notes and lectures on these topics, often with practical examples. Searching for “compiler design lexical analysis tutorial” can yield valuable resources.