Family Ties part 2: A Tale of Three ASTs

2016-05-08 familyties

This is the second of a series of articles on what I’ve learned about Erlang (and Elixir) from writing Erl2ex, an Erlang-to-Elixir transpiler. This week we take a look at the heart of each language, its abstract syntax tree, or AST, and what each tells us about its language.

AST of the AST

Visualizing the language

Before I began working with Elixir, ASTs were just an esoteric internal data structure used by compilers. One of the interesting innovations of Elixir has been to make access to the AST commonplace via its metaprogramming features. Every Elixir program has a canonical representation using Elixir’s on data structures, a representation that you can access and manipulate at compile time, providing powerful code generation capabilities. Certainly, Elixir is not the first language to have done this—Lisps have been manipulating their own program structure for decades—but Elixir has done much in a very short time to bring these capabilities into mainstream software development.

Erlang, of course, also has its own internal AST, but it is not as obvious to the everyday programmer. Still, Erlang’s compiler is part of its standard library, and you can use those tools to parse and compile Erlang code from Erlang. This access to the AST in both languages underlies the basic design of Erl2ex. Erlang provides libraries to parse Erlang source to Erlang AST, and Elixir provides libraries to reconstruct Elixir source from Elixir AST. Erl2ex simply needs to fit in the middle, performing a data transform from Erlang AST to Elixir AST. (At least, it’s simple in concept, though perhaps not so much in practice.)

To get a feel for what ASTs actually look like, let’s return to the count example we saw in part 1.

In Elixir, we defined the count function in two clauses:

def count([]), do: 0
def count([_head | tail]), do: 1 + count(tail)

If we parse this into AST (we’ll see how to do this below), we get the following two data structures:

{:def, [context: Elixir, import: Kernel], [
  {:count, [context: Elixir], [[]]},
  [do: 0]
]},
{:def, [context: Elixir, import: Kernel], [
  {:count, [context: Elixir],
    [[{:|, [], [{:_head, [], Elixir}, {:tail, [], Elixir}]}]]
  },
  [do:
    {:+, [context: Elixir, import: Kernel], [
      1,
      {:count, [], [{:tail, [], Elixir}]}
    ]}
  ]
]}

We’ll take a closer look at a few of the important features of this data below, but for now, just note two things. First, you can kind of see the program structure—the function definitions, the arguments and variables, and the expressions—in the AST. And second, the AST uses Elixir’s data structures; it’s built out of tuples, arrays, and atoms. Thus, Elixir code can interpret and manipulate it.

The corresponding Erlang code:

count([]) -> 0;
count([_Head | Tail]) -> 1 + count(Tail).

…looks like this in Erlang AST. (I’m using Elixir syntax to display the AST to make it easier to compare with the Elixir AST above. But since the languages have the same data types, showing this in Erlang syntax wouldn’t look much different.)

{:function, 1, :count, 1, [
  {:clause, 1,
    [{nil, 1}], [],
    [{:integer, 1, 0}]
  },
  {:clause, 2,
    [{:cons, 2, {:var, 2, :_Head}, {:var, 2, :Tail}}], [],
    [{:op, 2, :+,
      {:integer, 2, 1},
      {:call, 2, {:atom, 2, :count}, [{:var, 2, :Tail}]}
    }]
  }
]}

Again, you can kind of see the program structure there, although it’s clear there are some differences from the Elixir example. Now let’s take a closer look.

Elixir: a Lisp?

Elixir’s metaprogramming features provide easy access to the AST and some pretty powerful tools for manipulating it, often without even having to parse the AST itself. For example, to see the AST for a piece of code, you can use the quote Kernel function:

iex(1)> quote do: foo(:an_atom, a_variable, [1, 2])
{:foo, [], [:an_atom, {:a_variable, [], Elixir}, [1, 2]]}
iex(2)> quote do
...(2)>   def count([]), do: 0
...(2)>   def count([_head | tail]), do: 1 + count(tail)
...(2)> end
{:__block__, [],
 [{:def, [context: Elixir, import: Kernel],
   [{:count, [context: Elixir], [[]]}, [do: 0]]},
  {:def, [context: Elixir, import: Kernel],
   [{:count, [context: Elixir],
     [[{:|, [], [{:_head, [], Elixir}, {:tail, [], Elixir}]}]]},
    [do: {:+, [context: Elixir, import: Kernel],
      [1, {:count, [], [{:tail, [], Elixir}]}]}]]}]}

Let’s take a closer look at some characteristics of Elixir’s AST. Simple values—generally, atoms, numbers, lists, and some tuples (specifically, two-element tuples)—are represented literally as their value. For example:

iex(1)> quote do: :an_atom
:an_atom
iex(2)> quote do: [1, 2, {:a, :b}]
[1, 2, {:a, :b}]

All other program structures are considered function calls, and are represented as a three-element tuple comprising the function name, a rather mysterious list (which we’ll get to later), and what looks like a list of the function arguments:

iex(1)> quote do: foo(:an_atom, [1, 2])
{:foo, [], [:an_atom, [1, 2]]}

What is probably not immediately obvious is that some structures that you wouldn’t expect to be considered a “function call” are nonetheless represented as functions. Literal tuples with any length other than 2 are represented by something that you could think of as a “function that creates a tuple”.

iex(1)> quote do: {:one}
{:{}, [], [:one]}
iex(2)> quote do: {:one, :two, :three}
{:{}, [], [:one, :two, :three]}
iex(2)> quote do: {}
{:{}, [], []}

Operators similarly appear as functions. Here is a simple expression with the “+” operator:

iex(1)> quote do: 1 + 2
{:+, [context: Elixir, import: Kernel], [1, 2]}

This is where we start to see something interesting in that second tuple element. The “+” operator is actually a built-in function imported from the Kernel module. The Elixir AST supplies that meta-information in a keyword list as the second element of the tuple. Similarly, when you parse an Elixir file, the line number and other parse information also appears in the keyword list.

A qualified function name has, as its name, an expression with a dot operator, e.g. “Enum.count”.

iex(1)> quote do: Enum.count([1, 2])
{{:., [], [{:__aliases__, [alias: false], [:Enum]}, :count]}, [], [[1, 2]]}

Notice how the name of the function being called is itself represented as a function call—in this case, the dot operator applied to the Enum alias and the “count” name. Similarly, the Enum alias itself is also represented as a function call.

Interestingly variable references are also represented as pseudo function calls:

iex(1)> quote do: aVariable
{:aVariable, [], Elixir}

The Elixir alias as the “parameter” specifies the context within which the variable is assigned. (Although I’m still unsure what that actually means.)

So Elixir AST is basically a tree with two kinds of nodes: literals, and function calls in which the function and its arguments are, recursively, also nodes. This is in fact reminiscent of the structure of another language we might have heard of before: Lisp. No, Elixir is not (really) a Lisp, but I think it’s telling that the AST structure resembles Lisp code. It is designed for simplicity and orthogonality with a minimum of primitives.

Let’s see how that compares with Erlang:

Cracking open Erlang

Obtaining the AST in Erlang is a bit harder than in Elixir; I’m not aware of any way to generate Erlang AST that is quite as simple as Elixir’s quote function. Generally, you need to use the erl_scan module and erl_parse module in the Erlang standard library to parse AST from source code. If you’re interested in playing with Erlang AST, I’ll leave implementing this as an exercise. One gotcha to bear in mind is that erl_parse generally works on one “form” at a time. Erlang forms are generally delimited by periods (a “dot” token), so you probably need to scan source into tokens, split the list of tokens into a list of sublists delineated by dots, and then run erl_parse on each sublist. This logic, for example, can be found in the Erl2ex source.

Erlang’s libraries aside, let’s take a closer look at some examples of Erlang’s AST structure. (Again, I’m using Elixir syntax to display the data structure, to make it easier to compare with the Elixir examples above.)

# The integer 42
{:integer, 1, 42}
# The list [4, 2]
{:cons, 1, {:integer, 1, 4}, {:cons, 1, {:integer, 1, 2}, {nil, 1}}}
# The expression 3 + 4
{:op, 1, :+, {:integer, 1, 3}, {:integer, 1, 4}}
# A call to the function math:pi()
{:call, 1, {:remote, 1, {:atom, 1, :math}, {:atom, 1, :pi}}, []}
# The function definition: tri(X) -> X * (X + 1) / 2.
{:function, 1, :tri, 1, [
  {:clause, 1,
    [{:var, 1, :X}], [],
    [
      {:op, 1, :/,
        {:op, 1, :*,
          {:var, 1, :X},
          {:op, 1, :+,
            {:var, 1, :X},
            {:integer, 1, 1}}},
        {:integer, 1, 2}}
    ]
  },
]}

Every Erlang construct—that is, every node in an Erlang AST—is represented by a tuple which may have between two and six elements. The first element is a node type, an atom describing the kind of construct. The second is the line number from which the construct was parsed. The remaining elements are “arguments” to the construct and are dependent on the type. For example, an operator construct has three arguments, the operator identity, the lhs, and the rhs. A function definition also has three arguments: the name of the function, its arity, and a list of clauses. Each clause, in turn, is a construct represented by a five-element tuple: the “clause” node type, the line number, a list of arguments, a list of guards, and a list of expressions making up the function body.

Unlike in Elixir, there are no literals in Erlang AST. Even the simplest values, such as individual numbers or atoms, are represented by a node. We can also see that pattern matching against Erlang AST becomes a bit more challenging because its tuples may have different cardinalities depending on the type. Elixir AST tuples always have three elements; anything other than a three-element tuple is a literal.

Going a bit deeper, we can make an observation that Erlang AST represents a somewhat higher level of abstraction than Elixir AST. Erlang AST is meaningful; it represents, to some degree, the semantics of an expression. For example, the Erlang representation of a call to the remote function “math:pi” explicitly knows it is a remote function call. Elixir, on the other hand, represents only the syntax; it represents the “dot” operator without saying anything about its meaning, e.g. that it denotes a remote function rather than, say, a field of a struct.

# A call to the function math:pi() in Erlang AST
{:call, 1, {:remote, 1, {:atom, 1, :math}, {:atom, 1, :pi}}, []}
# The same call in Elixir AST
{{:., [], [:math, :pi]}, [], []}

As a result, it is possible to generate valid Elixir AST that does not correspond to valid Elixir code; it is much harder to do so in Erlang AST.

This has an important implication: the Elixir language is extensible at the AST level. In Erlang, if you want to add a new language construct or modify an existing one, you need to create a new AST type and/or extend the arguments list, to reflect the meaning of the new construct. Thus you would need to modify Erlang’s parser and compiler. Elixir AST format, on the other hand, does not need to change when modifications are made to the language. This property is a crucial part of Elixir’s extensibility story.

An alternate Erlang AST

Well into my work on Erl2ex, I discovered that Erlang has a second AST. To be precise, the two are somewhat related—this “second” AST is actually a superset of the Erlang parse tree described above—but they have very different goals and designs, enough so that I think it’s worthwhile to think of them as distinct representations.

This second AST is defined by the module erl_syntax. Where the normal Erlang AST generated by erl_parse focuses on expression semantics, the erl_syntax tree more closely represents syntax—similar to Elixir AST. As a result, it is a much more lenient format than erl_parse. As with Elixir AST, it is possible to create erl_syntax that does not correspond to valid Erlang code.

An important characteristic of erl_syntax is that the data structure format itself is opaque, and in fact is subject to change. Instead, the erl_syntax module provides functions for constructing and querying the tree.

The erl_syntax form is useful for applications (such as syntax highlighting and other code analysis features) where it is important for parsing to be somewhat forgiving. In Erl2ex, it also becomes important when dealing with the preprocessor. As we shall explore in a later article, the Erlang preprocessor is an enormous headache for the transpiler because the preprocessor syntax itself is not valid Erlang code, and thus fails to parse using erl_parse. I have to perform an analysis of the erl_syntax form, along with a number of hacks, to interpret preprocessor directives. Erlang’s parser that generates erl_syntax form is, interestingly enough, called epp_dodger, a parser that “dodges” the Erlang Preprocessor.

Where to go from here

Elixir makes it very easy to explore its AST via the quote function. I also recommend Chris McCord’s book Metaprogramming Elixir which discusses Elixir AST in the context of its main purpose, hooking into the compiler and generating code.

I’m not aware of really good resources for becoming familiar with Erlang AST. (Leave me a comment if you know of some.) You can peruse the documentation of erl_scan and erl_parse for information on the normal AST, and the syntax tools for information on the alternate AST. The parsing code from erl2ex provides some more examples of how they can be used.

Next time, we’ll dive into the naming conventions of the two languages, and how to get them to interoperate. Feel free to browse the index of articles in this series, and stay tuned for more on Erlang and Elixir’s family ties.