Family Ties part 8: Rise of the Preprocessor

2016-06-20 familyties

This is the eighth of a series of articles on what I’ve learned about Erlang (and Elixir) from writing Erl2ex, an Erlang-to-Elixir transpiler. This week we study the Erlang preprocessor, probably the most significant Erlang feature that is not supported by Elixir.

This is a fairly big topic, so I’ve split it into two articles. This week we’ll focus on the Erlang preprocessor itself, from the point of view of an Elixir developer. We’ll look at what it is, how it works, and how it relates to the rest of the language.

The preprocessor is just parsed that way

Module attributes in Erlang and Elixir

As I was learning Elixir, one concept I found perpetually confusing was the @ sigil, signifying a module attribute. Elixir books and documentation presented these as filling several different, seemingly unrelated, roles. On one hand, they can be used as constants, private to a module. On the other hand, they are used to annotate documentation and certain other module meta-information. To make matters more confusing, Erlang also has module attributes that behave more like module-based storage. And both languages seem to have an abundance of special cases, specific attributes that have special meanings to the compiler.

Coming from Ruby, I was tempted to think of module attributes as similar to Ruby constants, which are essentially properties of a module or class. But that picture doesn’t match well with Elixir module attributes, and it was quite a while before I came to what seemed to be a satisfactory understanding of just what these things are: They are best understood as compiler variables, temporary storage that is accessible during module compilation but not at runtime.

But wait! you might say. Not accessible at runtime? What about when I reference an attribute in my code?

# This is legal Elixir
defmodule Mod1 do
  @attr1 :hello
  def foo, do: @attr1
  @attr1 :world
  def bar, do: @attr1
end

Although it looks like the “foo” and “bar” functions above are looking up the attribute, that’s not exactly what’s going on. When an attribute appears in your code, the compiler looks up the attribute and replaces it with its value. That is, the compiler effectively transforms the above code into the following:

defmodule Mod1 do
  # The @attr1 goes away, and is replaced by the value.
  def foo, do: :hello
  def bar, do: :world
end

The hard-coded values themselves are compiled into the BEAM file, which is how attributes can function as “constants”.

But because attributes live at compile time, their real power comes from being able to “configure” the behavior of the compiler. For example, you may know already that adding a @doc attribute before a function provides documentation to that function. How does that work? When compiling a function, the Elixir compiler effectively looks up the current value of the @doc attribute. If one is found, it is applied as the function’s documentation.

defmodule Mod2 do
  # Set the "@doc" attribute
  @doc "This function returns the value 0"
  # The compiler uses the current "@doc" value as the documentation
  # when defining the function.
  def zero, do: 0
end

Because of these semantics, I find the term “module attribute” to be a bit misleading. They aren’t really attributes of the module. I prefer to think of them as “compiler directives” or “compiler variables”.

That said, Elixir does allow you to turn these “compiler variables” into something more like a “module attribute” by “persisting” them. This is accomplished by calling Module.register_attribute/3 (docs) at compile time. A persisted attribute is part of the compiled module definition, and can be retrieved at runtime by calling the function “module_info”, which is automatically added to every module.

# Without registering the @attr1 attribute
defmodule Unegistered do
  @attr1 :hello
end

# Registering the @attr1 attribute
defmodule Registered do
  Module.register_attribute __MODULE__, :attr1, persist: true
  @attr1 :hello
end

The results:

iex(1)> Unregistered.module_info(:attributes)
[vsn: [217778150367665195312818088209879323369]]
iex(2)> Registered.module_info(:attributes)
[vsn: [224579266478158539223178516435841507614], attr1: [:hello]]

This distinction between the concepts of “compiler directive” and “module attribute” is also important for understanding Erlang’s module attributes. For the most part Elixir’s attributes are not persisted by default and so behave as compiler directives; whereas, for the most part, Erlang’s attributes are persisted by default, and so actually behave like properties of the module.

-module(registered).
% Most attributes are automatically "persisted" in Erlang.
-attr1(hello).

% So you can access them at runtime
1> registered:module_info(attributes).
[{vsn,[224579266478158539223178516435841507614]}, {attr1,[hello]}]

However, in Erlang, as in Elixir, certain attributes are recognized and handled specially, treated as directives rather than “persisted” as moudle attributes. Some of these “special” attributes are handled by the compiler: for example, the “module” attribute specifying the module name, or the “record” attribute for defining a record data structure. Others are intercepted before the compiler even sees them, by a tool that doesn’t exist at all in Elixir. That tool is the preprocessor.

What is a preprocessor?

If you’ve ever programmed in C, you probably know that the typical way to create a named constant is with the #define directive.

#define RETURN_VALUE 123
int foo() {
  return RETURN_VALUE;
}

You might also know that evaluating these defines is done not by the C compiler, but by a separate tool, “cpp”, the C Preprocessor. This tool tokenizes the file, parses a few directives that it understands (such as #define), and then does a simple textual replacement. In the example above, it would read the file, replace the token “RETURN_VALUE” with the token “123”, and output the following:

int foo() {
  return 123;
}

Only after the preprocessor has done its transformations, does the compiler itself start working.

The C Preprocessor also supports “parameterized” defines, which syntactically look like function calls, but aren’t real function calls. The preprocessor replaces them inline.

// C code, input to the preprocessor
#define IS_ZERO(x) (x == 0)
int foo(value) {
  return IS_ZERO(value);
}

// Output from the preprocessor
int foo(value) {
  return (value == 0);
}

Finally, the C preprocessor offers a few additional features such as conditionals (i.e. selective removal of parts of the source code by the preprocessor) and file inclusion (i.e. copying another file, usually a “header”, into your source file.)

Erlang has a preprocessor that works similarly. It tokenizes the file, and then “hijacks” certain module attributes, such as “define”, “ifdef”, and “include”. The feature set is very similar to that offered by the C preprocessor. You can perform simple token replacement:

% Erlang code, input to the preprocessor
-define(RETURN_VALUE, 123).
foo() -> ?RETURN_VALUE.

% "Output" from the preprocessor, passed on to the compiler
foo() -> 123.

As well as “parameterized” replacement.

% Erlang code, input to the preprocessor
-define(IS_ZERO(X), X == 0).
foo(Value) -> ?IS_ZERO(Value).

% Output from the preprocessor
foo(Value) -> Value == 0.

These replacements are collectively called “macros” in Erlang. (Not to be confused with Elixir macros, which work quite differently as we shall see in next week’s article.) And finally, Erlang, like C, provides conditional directives and file inclusion. Here are a few examples:

% Conditional compilation in Erlang
-define(A_MACRO, 1).
-ifdef(A_MACRO).
% This is compiled because A_MACRO is defined at this point
foo() -> ?A_MACRO.
-else.
% The compiler would flag an error on this function, but the
% preprocessor removes it before the compiler sees it, so in the end
% no error is raised.
foo(X) -> UnknownVariable.
-endif.

% File inclusion. Erlang "header" files typically define records and
% types, and have names ending with ".hrl".
-include("my_header.hrl").

So far this seems pretty straightforward. Then why is this feature missing from Elixir?

The preprocessor and the parser

To approach this question, we have to highlight an important property: The preprocessor is wholly separate from the compiler. The C preprocessor runs as a separate binary, transforming source code before the compiler even sees it. The Erlang preprocessor behaves similarly: it has its own module and runs as a separate (BEAM) process, transforming source code before the Erlang compiler sees it. This has two important implications.

The preprocessor hijacks certain syntax as its own.
The preprocessor cannot interact with the compiler.

Let’s unpack these two items so we can understand their repercussions.

As we saw earlier, preprocessor directives “look like” module attributes in Erlang. They share some of the same syntactic markers, beginning with a hyphen sigil, a name atom, and parameters. But they are not module attributes. If you include the following line in your Erlang source:

-define(A_MACRO, 1).

It does not create a module attribute named “define”. Indeed, the compiler will never even see this line because it is consumed by the preprocessor. Furthermore, even if the compiler could see this line, it would raise a syntax error because Erlang module attributes generally do not support multiple parameters.

To take it even further, preprocessor definitions do not even have to conform to correct Erlang syntax. This is because the “replacement” is simply a block of text (or, more precisely, a series of tokens) that could make up a “fragment” of syntax that doesn’t stand on its own. Take this example:

-define(and_op(T1, T2), T1 == $&, T2 == $&).

If you’re not familiar with Erlang macros, what would you think this is doing? Normally, the “define” directive has two arguments: a macro and its definition. But here it looks like there are three arguments. What could that mean? Does the macro “and_op” have two definitions?

Actually, this “define” directive does have only two arguments. The entire expression T1 == $&, T2 == $&, including the comma, is the definition of the macro. Remember from part 6 that Erlang guard expressions may be combined using commas, which are a logical conjunction (i.e. effectively the same as the “andalso” operator). This macro generates such a guard. For example, it could be used in a guard clause like this:

tokenize([T1, T2 | Rest]) when ?and_op(T1, T2) -> handle_and_op().

Which the preprocessor would transform into:

tokenize([T1, T2 | Rest]) when T1 == $&, T2 == $& -> handle_and_op().

Notice how, in order to achieve this effect, the macro “abuses” Erlang syntax, turning what might “normally” be parsed as two separate arguments, into a single argument with an embedded comma. The preprocessor handles this as a special case, so it doesn’t confuse the compiler. But it may confuse us, unless we understand that it is not (yet) valid Erlang code.

Significantly, you cannot (in general) feed Erlang source code directly into the Erlang compiler without running it through the preprocessor first. Erlang’s parser will likely reject some of the code as invalid. If you want to parse real Erlang code that includes preprocessor directives, you cannot use the normal parser erl_parse. You’ll need to find an alternative, such as epp_dodger, as we discussed in part 2.

(Incidentally, that was not a contrived example. It’s an actual macro from the Elixir tokenizer, one of many similar macros in that file.)

Second, we claimed that the preprocessor cannot interact with the compiler. What do we mean by that?

A preprocessor supports a primitive sort of metaprogramming by generating syntax for the compiler to consume. However, because it does not run in the compiler, the interaction is one-way. The compiler cannot pass information back into the preprocessor to inform its code generation. And significantly, a preprocessor cannot run custom code; it is limited to its own built-in expressions.

But those are all features of Elixir macros. Like a preprocessor, Elixir macros can generate code. But unlike a preprocessor, Elixir can feed compiler output back into the code generation, and it can compile custom code, written in Elixir, to run in the compiler to perform that code generation. It supports the capabilities of a preprocessor, but greatly enhances them by integrating closely with the compiler itself.

So it sounds like, through its metaprogramming features, Elixir superseded and thus eliminated the need for a preprocessor. Is it really that simple?

Well, not quite.

Elixir macros can “simulate” a preprocessor for many cases, but there remain some edge cases that are quite difficult. Next week we’ll dive more deeply into Erlang’s preprocessor and Elixir’s macros, and explore what makes each unique. We’ll also see some of the techniques used by Erl2ex to emulate Erlang preprocessor definitions in Elixir code.

Where to go from here

Erlang’s reference manual documents the preprocessor pretty well, and you can also learn more about how Erlang treats module attributes.

Next time, we’ll study how things traditionally done with preprocessors can be reproduced using Elixir features such as macros. Until then, feel free to browse the index of articles in this series, and stay tuned for more on Erlang and Elixir’s family ties.

Module attributes in Erlang and Elixir

What is a preprocessor?

The preprocessor and the parser

Where to go from here

Dialogue & Discussion