Published on
Reading Time
12 min read

An Introduction to Compilers for Software Engineers

Table of Contents

Introduction

A compiler is a computer program that translates code written in one programming language (the source language) into another language (the target language). For most programmers, the source language is a high-level programming language like C, C++, or Java, while the target language is assembly or machine code that can be directly executed by the CPU.

Compilers play a crucial role in software development by allowing developers to write human-readable code without worrying about low-level details like memory management and instruction sets. Learning about how compilers work gives software engineers a better mental model of the programming process and exposes them to core computer science concepts like formal languages, graphs, and algorithms.

The compilation process consists of three essential phases: analysis, transformation, and code generation. In the analysis phases (lexical, syntax, semantic), the compiler parses and validates the source code. In the transformation phase, the compiler converts the source into intermediate representations. Finally, in code generation, the compiler produces target assembly or machine code.

By demystifying these phases, I hope this empowers software engineers to write more efficient, thoughtful code with a clearer picture of how it executes. Let's get started!

Lexical Analysis

Lexical analysis is the first phase of a compiler. It takes the stream of characters making up the source code and groups them into meaningful tokens.

The lexical analyzer, or lexer, reads in the source code character-by-character, checks for patterns, and outputs a series of tokens. It utilizes regular expressions like [a-zA-Z_][a-zA-Z0-9_]* to match against token patterns and formatting rules. This is similar to how a regex engine finds matching strings and routes them appropriately.

Some examples of tokens include identifiers (e.g. variable names), keywords (e.g. if, while, return), operators (+, -, *, /), delimiters (parenthesis, braces, semicolons), literals (numeric constants, strings) and comments. An identifier token would match a variable name like my_variable. The if keyword is another token type. An integer literal like 42 is also a token. The lexer recognizes these based on defined regex patterns.

The lexer consumes the input character stream and chunks it into these token types. It does not interpret meaning, only recognizing matching character sequences. Things like variable scoping, data types, and other semantics are not considered in lexical analysis.

The stream of tokens emitted by the lexer is then passed to the syntax analysis phase. Here the tokens are arranged into a parse tree based on hierarchical grammatical rules. The lexer outputs the raw materials for this structure by breaking the input into fundamental units.

Lexical analysis is like splitting words from a piece of text. The syntax analysis phase then uses these words to construct structured sentences and paragraphs. The lexer provides an important first step in breaking down source code into analyzed components.

Symbol Tables

A key data structure used throughout the compilation process is the symbol table. The symbol table tracks identifiers declared in the source code like variable and function names, along with information like types and scopes.

Symbol tables are created during lexical analysis to catalog the identifiers tokenized by the lexer. Entries contain metadata like the identifier lexeme and location. The parser adds type information and scope levels as declarations are encountered.

In semantic analysis, the symbol table is heavily used for type checking and scope resolution. The analyzer cross-references variable uses against their declared types and scopes in the table. It also detects undeclared identifiers.

Symbol tables aid in name mangling, where identifiers get encoded with type info to support function overloading. Encoded mangled names resolve to the right declaration based on types.

During optimization, symbol table data facilitates analysis of variable lifetimes for register allocation. The table provides information about scope and usage to efficiently allocate registers.

Some declarative information from the symbol table appears in debugging information in the final binary. This maps optimized machine code back to the original source variables and types.

The symbol table is a vital data structure flowing through all compiler stages, enabling tasks like semantic checks, overload resolution, register allocation, and debugging. It contains essential information about identifiers in an accessible table format.

Syntax Analysis

Syntax analysis, or parsing, is the second compilation phase after lexical analysis. It takes the stream of tokens from the lexer and imposes structure based on the syntactic rules of the language.

The parser creates a parse tree (also called abstract syntax tree or AST) that arranges the tokens hierarchically based on the grammar. The highest level nodes represent major structures like functions or classes. Lower levels group statements, expressions, variables, etc.

The parser uses the language grammar to determine how tokens can legally combine. For example, an if statement must be followed by a condition in parentheses. The parser uses productions like if_statement -> if (expression) { body } to arrange tokens.

These syntax rules are defined formally using a context-free grammar. The grammar is a collection of productions that show how non-terminals (structures) are composed from terminals (tokens) and other non-terminals. Parsers use context-free grammars that define recursion rules for language constructs. Two common parsing techniques are top-down (recursive descent) and bottom-up (shift-reduce).

Top-down parsing is one technique that recursively processes productions to match tokens against grammar rules. At each level, it selects a production and attempts to match the input. Backtracking occurs on failed matches.

Bottom-up parsing instead builds the tree from the bottom by shifting/reducing tokens into grammar production matches. The final result is the same parse tree but the process differs.

Syntax analysis ensures the tokens obey language rules. It detects and rejects incorrectly structured programs by failing to generate a valid parse. The parser also performs disambiguating based on grammar productions when multiple parses are possible.

The completed syntax tree fully reflects the syntactic structure of the program in a detailed format convenient for further processing. The parser consumes tokens and organizes them into a meaningful representation.

Semantic Analysis

While syntax analysis focuses on the form and structure of the code, semantic analysis focuses on the meaning by checking that the syntactically correct program makes sense. It ensures proper usage of data and operations.

The semantic analyzer attaches symbol table data to the AST nodes then uses the AST as input. It traverses the tree, verifying semantic rules and decorating nodes with additional information like types and scopes.

One key task is type checking - verifying expression and variable types are compatible and conversions are valid. For example, the analyzer checks that variables are the right type for arithmetic operations.

Also, the semantic analyzer does scope resolution by linking variable references to their declarations. Each identifier is marked with its declared type and scope level. Undeclared or misused variables are detected.

Name resolution involves matching call sites and function references to the correct declaration. Overloaded functions are disambiguated based on number of arguments and types.

The analyzer inserts implicit type conversions where necessary, like converting integers to floating point values before division. It also catches illegal conversions that would lose data or precision.

In addition to typing and scopes, semantic checks can include validating class inheritance hierarchies, checking imported namespaces, and enforcing access modifiers.

At the end of analysis, the syntax tree is fully annotated with semantic information. Each node is enriched with types, scopes, and any inferred attributes. This semantic-rich abstract syntax tree can then be used for subsequent compilation phases.

Ultimately, the semantic analyzer ensures that while code may be syntactically valid, it also must make logical sense according to language rules before being transformed and compiled.

Intermediate Representations

After syntax and semantic analysis, the abstract syntax tree fully represents the source code structure and meaning. The compiler then begins transforming the code through a series of intermediate representations (IRs).

IRs are alternate ways of encoding the program information that are more convenient for analysis and optimization. The AST directly represents original source syntax, while IRs transform it for compiler purposes.

Some common IRs include:

  • Abstract syntax tree (AST) - A tree structure retaining the parsed program syntax.
  • Control flow graph (CFG) - A graph modeling the control flow of the program. Each node is a basic block of instructions without branches. Edges represent control flow between blocks.
  • Three-address code - A linear IR with explicit memory addresses for each operation. Statements are three-address instructions like x = y + z.
  • Static single assignment (SSA) form - A representation where each variable is assigned exactly once. Helpful for optimizations.

IRs can be platform-independent, providing a generic assembly-like language to optimize before targeting a specific machine. Or they may be low-level, exposing details like registers and pointers for a particular architecture.

Optimization and analysis algorithms are defined over IRs, not raw syntax trees. For example, dead code elimination removes unreachable nodes in the control flow graph. Constant folding substitutes constant values in three-address instructions.

IR transformations simplify the program and expose optimization opportunities. They remove syntactic complexities and ambiguities, allowing the compiler to focus just on program semantics and logic flow.

Frontends emit IR from source code, while backends consume IR to generate machine-dependent code. IR provides a well-defined representation to link these phases. It also enables retargetability - adapting the compiler to new platforms by just swapping the backend.

After machine-independent optimizations are applied, the transformed IR moves to code generation. Here it is mapped to real machine instructions while retaining the optimizations and abstractions from earlier phases.

Code Generation

The code generator is the final phase of the compiler backend that transforms the optimized intermediate representation into target machine code.

One major task is register allocation. The IR has unlimited temporary variables, while real architectures have a fixed set of physical registers. The allocator must efficiently map IR variables to available registers.

Instruction selection converts IR operations into equivalent machine instructions. The target instruction set architecture defines these instructions. For example, a multiply operator becomes a MUL instruction on x86.

Instruction scheduling orders instructions to maximize parallel execution and pipeline usage. Dependencies and hazards may require rearranging instructions from their original IR order.

Throughout code generation, the compiler must preserve semantic equivalence. Optimization and transformations cannot alter program behavior, only improve performance.

After scheduling, actual assembly code is emitted. Assembly instructions execute the scheduled instructions using allocated registers and memory.

For native compilation, target code is tailored to the host architecture. For cross-compilers, separate back-ends support multiple target platforms from the same front-end.

The operating system ABI (application binary interface) defines conventions for function calling, stack frames, and register usage. Code generators adhere to these conventions.

Optimizations like register allocation are critical for generating efficient code. Simple one-to-one IR mapping would produce slow, unoptimized assembly. Intelligent allocation, ordering, and instruction choices are necessary.

The end result of compilation is target platform executable code ready to be loaded and executed. The code generator outputs binaries like .exe or object files containing this ready-to-run machine code.

Optimization

Optimization is the process of transforming code to make it faster, smaller, and more efficient without changing its functional behavior. Optimizations occur after initial analysis and transformations to IR.

Common optimizations include:

  • Constant propagation - Replacing variables with known constant values.
  • Dead code elimination - Removing unreachable code and unused variables.
  • Loop optimizations - Improving loop performance by unrolling/jamming.
  • Inlining - Inserting function body at call site instead of calling.
  • Common subexpression elimination - Caching redundant sub-computations.

Optimizations operate on IR by applying semantic-preserving transformations. For example, dead code elimination removes nodes from the control flow graph that cannot be reached.

The compiler contains multiple passes that progressively apply optimizations. Passes are often interleaved with additional analysis passes to enable further optimization opportunities.

Optimization levels control how many passes are applied, with higher levels producing faster but larger code. Lower levels optimize minimally for fast compilation. The highest level applies the most aggressive transformations.

Optimizations should improve performance according to workload profiles. Compiler writers analyze common workloads and identify key hot spots to focus optimizations on, like hash table access patterns.

Optimizations increase compile time, sometimes significantly at higher levels. There is a tradeoff between optimization time and improved run time performance. Just-in-time compilers optimize sparingly due to compile latency.

Optimized code can be harder to debug due to variables changing values and statement reordering. Debugging info needs to be generated to map optimized code back to the original source.

The optimized IR code finally goes to the code generator to produce efficient assembly instructions. Without optimization, code generation would be limited in potential efficiency gains.

Conclusion

This article has provided an overview of the key phases and components involved in compiling source code into executable machine code. While compilers are complex programs, gaining a high-level understanding of the process helps demystify what happens behind the scenes when code is compiled.

We saw that compilation begins with analysis of the source text, first scanning characters into tokens, then organizing tokens into a parse tree based on syntax rules. Semantic analysis enriches the syntax tree with deeper meaning by associating types and scopes.

The compiler then transforms the code through a series of intermediate representations and optimizations. These representations expose more opportunities for analysis and optimization while removing specifics of the source language syntax.

The final code generator produces efficient target machine code, taking into account constraints like registers and instructions sets. The result is optimized, executable machine code ready to be run directly on the hardware.

Learning about compilers gives programmers perspective into how high-level languages are systematically compiled down to low-level machine code. This knowledge also reveals all the work compilers do to generate fast, efficient programs.

While introductory, this article covers the core ideas and phases of this complex process, providing software engineers insight into the magic happening behind the scenes whenever code is compiled and run.