Geo-Rails Part 3: Spatial Data Types with RGeo

2011-12-05 georails

RGeo is a library and framework for handling spatial data in a Ruby application. It's currently designed more for completeness than ease of use, so there's a bit of an initial learning curve. This article is an attempt to smooth that learning curve a bit. It contains a tutorial introduction to RGeo, covering the basics that every RGeo user needs to know, and a bit of discussion of where the library came from. Included is:

An introduction to the industry standard spatial data types
Working with spatial data objects in RGeo
Factories: why RGeo uses them and what they're for
A comparison with GeoRuby
A guide to the RDocs

RGeo includes a number of advanced features which I'll cover in future articles. But for now, I think these are the important topics that will get you started.

This is part 3 of my series of articles on geospatial programming in Ruby and Rails. For a list of the other installments, please visit http://daniel-azuma.com/articles/georails.

Standard Spatial Data Types

Most serious geospatial systems operate on a standard set of spatial data types specified by a standard known as the Simple Feature Access Specification, which is maintained by the Open Geospatial Consortium. This spec (which I'll abbreviate SFS) defines a suite of seven concrete data types capable of representing points and piecewise linear objects in two-dimensional space, along with a set of standard operations that can be performed on them.

The SFS has gone through several iterations. Most current production systems are based on version 1.1 of the SFS, although newer versions have added a few more data subtypes. Since 1.1 is the most commonly supported revision, it is what RGeo implements and what I will cover here.

The seven data types defined by the SFS include three geometric types, and four collection types. They are as follows.

Point. This is a simple point in two-dimensional space, identified by an x and y coordinate. Often, Points are used to represent locations on the surface of the earth, and sometimes (but not always) the x and y coordinate are interpreted as longitude and latitude, respectively. In other cases, a Point could simply represent a point on the X-Y plane.

LineString. This is a set of one or more straight line segments connected end to end. A common use for a LineString might be a set of driving directions. LineStrings may be self-intersecting, and some special LineStrings may be closed loops where the start point is the same as the end point. Below are a few examples of LineStrings. (I lifted this diagram straight out of the SFS document.)

Polygon. This is a continguous area in the plane, with piecewise linear borders. Polygons can also have holes. A common use for a Polygon might be a city or country boundary. Below are a few examples of Polygons (again lifted out of the SFS document.)

For each of the above three types, there is a corresponding collection type that can represent zero or more of that type of object. So MultiPoint may include zero or more Points, MultiLineString may include zero or more separate LineStrings, and MultiPolygon may include zero or more nonoverlapping Polygons.

Finally, there is a generic GeometryCollection type that may contain zero or more of any type of object, without any restrictions.

The SFS arranges these spatial types in a class hierarchy. A number of operations (such as intersection and distance) are defined across all types, but a few (such as area) are specific to certain types. The operations defined in such a way as to make them more or less language-agnostic. RGeo, at its heart, can be thought of as a Ruby implementation of these SFS types.

Working with Spatial Data in RGeo

Now we'll go through some basic examples of handling spatial data in RGeo. This assumes you have RGeo installed along with Geos and Proj4. Please refer to part 1 (as well as the RGeo README) for instructions on installing RGeo if you are having difficulty.

In these examples, we'll work with simple planar data. RGeo refers to planar data as "Cartesian", and provides a factory object for creating planar objects.

factory = RGeo::Cartesian.factory

Factories are discussed in more detail below; for now, you simply create spatial data objects using the factory. Let's create some Points:

point1 = factory.point(1, 0)
point2 = factory.point(1, 4)
point3 = factory.point(-2, 0)
point4 = factory.point(-2, 4)

Four points plotted on the X-Y plane

You can extract the coordinates of a point.

point1.x # => 1.0
point1.y # => 0.0

As well as perform a rich set of spatial operations. Distance is a pretty common operation:

point2.distance(point3) # => 5.0

Create LineString objects by providing a series of points, indicating the endpoints of the LineString. This first example has two segments specified using three points:

line_string1 = factory.line_string([point1, point2, point3])

You can extract the individual points that make up the LineString.

line_string1.num_points # => 3
line_string1.point_n(0) == point1 # => true
line_string1.end_point == point3 # => true

Here we create a new LineString and determine whether the two LineStrings intersect:

point5 = factory.point(0, 1)
line_string2 = factory.line_string([point4, point5])
line_string1.intersects(line_string2) # => true

LineString 2, in green, intersects LineString 1, in blue.

To create a Polygon object, provide the boundary as a LineString.

large_triangle = factory.polygon(line_string1)

To create a polygon with holes, provide the boundaries of the holes in the optional second argument.

point6 = factory.point(0, 2)
point7 = factory.point(-1, 1)
line_string3 = factory.line_string([point5, point6, point7])
triangle_with_hole = factory.polygon(line_string1, [line_string3])

The polygon triangle_with_hole

You can also create that triangle with a hole using a spatial operation, by subtracting the small triangle from the larger one.

small_triangle = factory.polygon(line_string3)
triangle_with_hole = large_triangle - small_triangle

To create a collection, provide the elements as an enumeration. MultiPoint, MultiLineString, and MultiPolygon restrict the types of their elements; GeometryCollection has no restriction.

four_points = factory.multi_point([point1, point2, point3, point4])
general_collection = factory.collection([line_string1, point5])

In addition to the basic spatial operations, collections implement Enumerable:

four_points.each{ |p| ... }

There's a lot of depth in the SFS spatial classes and the operations and analysis you can perform on them. I'll cover more advanced topics in a later articles. But first, we should address a burning question.

RGeo Factories

In the example above, we created geometric objects using a factory. Now, for some of us with a Java background, this might conjure up some less-than-pleasant memories. Factories? How un-Ruby-like!

I must admit, I struggled with this while designing RGeo. But in the end, in RGeo's case, I decided they were appropriate. (Or at least a necessary evil.)

In the above examples, we were working with points on the Cartesian X-Y plane. The geometric objects we worked with follow the rules of Euclidean geometry that you're probably familiar with from high school mathematics classes. The distance between two points, for example, can be determined using the Pythagorean Theorem.

However, we're not always going to be handling Cartesian objects, especially when we're working with location data. Location is generally measured across the surface of the earth, and the surface of the earth is not flat. This means our familiar theorems and formulas for Euclidean geometry may not work, especially for objects covering large areas.

So when RGeo measures a distance, computes an intersection, or performs almost any kind of spatial operation, it needs to know the context: whether you're working with points on an X-Y plane, or a latitude-longitude. And even in the latter case, it actually needs to know which latitude-longitude, since there are in fact a number of different ways to define latitude and longitude.

A factory provides this context. It knows whether the coordinate system is an X-Y Cartesian coordinate system, or whether it is latitude and longitude, or something else. It is basically a set of preferences directing how RGeo handles data and performs computations. All the spatial objects created by a factory inherit its preferences.

Or here's another way to put it. A point may have coordinates (2, 3). The factory tells you what the "2" and the "3" actually mean and how they relate to the real world. Are they degrees, feet, or light years? Which direction are they? And what assumptions about the nature of reality do they imply?

Another aspect controlled by RGeo's factories is the implementation. When RGeo works with Cartesian coordinates, its factory calls into the Geos library to handle most of the computational geometry. However, sometimes Geos may not be available on your system. In this case, you can use a different factory that also computes Cartesian geometry but uses a pure Ruby implementation. This alternate factory is not as fast as Geos and is currently missing a number of capabilities, but it is available in case you cannot install Geos.

You can obtain the factory object providing the context for any geographic object by calling its "factory" method.

triangle_with_hole.factory # => factory

Generally, when you cause two objects to interact by comparing them or performing some binary operation on them, they must have the same factory and live in the same context. It makes sense to find the distance between two points -- say, (2, 3) and (4, 5) -- on the Cartesian X-Y plane, but it doesn't make sense to find the "distance" between the point (2, 3) on the X-Y plane, and the point at latitude 47.606, longitude -122.332.

I will say more about coordinate systems and the different factories available in RGeo in later articles. For now, two factories you will probably use often are the Cartesian factory we saw above; and the "spherical" geographic factory. This latter factory handles latitudes and longitudes, and supports basic spatial operations but is currently missing some of the more complex operations.

geographic_factory = RGeo::Geographic.spherical_factory

RGeo Factories and Rails

In part 2 of this series, we saw that activerecord-postgis-adapter exposes spatial column values as RGeo objects. Each of these objects, of course, has a factory that provides its coordinate system and context. Now we can look a little more closely at this process.

In the tutorial for part 2, we added this line to the Location model:

class Location < ActiveRecord::Base
  set_rgeo_factory_for_column(:latlon,
    RGeo::Geographic.spherical_factory(:srid => 4326))
end

What this does is provide a specific factory for the latlon attribute of the Location model. In this case, we use the spherical geographic factory discussed above. When you get a Location from the database and read the latlon attribute, it returns a Point created by that factory. The "srid" argument controls the Spatial Reference ID, which must be set to 4326 for the PostGIS geographic type. We will cover SRIDs in a later article; for now, just think of it as a required parameter.

You can ask the model for the factory as follows:

latlon_factory = Location.rgeo_factory_for_column(:latlon)

Now you can use this factory to create values. In particular, if you do not want to use WKT to set a latlon value, you can set it directly from a point object created from this factory.

loc3 = Location.create(:name => 'Columbia Tower')
loc3.latlon = latlon_factory.point(-122.330779, 47.604828)
loc3.save

RGeo vs. GeoRuby

One question I am asked quite a bit is, how does RGeo compare with GeoRuby. GeoRuby is an older Ruby library that provides classes for the SFS geometry objects. It is considerably smaller than RGeo, and somewhat easier to get started with. Indeed, I also started off using GeoRuby once upon a time, but I quickly decided that a fundamental redesign was necessary in order to support the functionality I needed. Among those:

GeoRuby provides a small subset of the spatial operations defined by the SFS. For example, it computes distance between points but not distances involving lines or polygons, and it doesn't do intersections or other such geometric operations. RGeo implements the entire SFS -- every single operation. To accomplish this, it uses Geos, the same industry standard computational geometry library that PostGIS uses internally, so you can be confident of its speed and stability.
GeoRuby assumes most objects are in a flat Cartesian coordinate system; it generally does not handle different coordinate systems. The sole exception is that it provides specialized methods to measure distance across the globe, but they require that you keep track of the coordinate system yourself. RGeo automatically ensures that computations take place in the right coordinate system, and provides rich tools for managing and converting coordinate systems.
The original GeoRuby project has not been updated for a long time. There is a recent fork that is being maintained somewhat more actively. But even the fork doesn't look like it is likely to have the basic capabilities many non-trivial applications require, at least not anytime soon.

That said, some of the early inspiration for RGeo did come from GeoRuby. Although RGeo's design is markedly different, it was created to solve some of the same basic problems and so is something of a spiritual descendant.

RGeo Documentation

One thing missing right now with RGeo is a really good tutorial and/or user's guide. However, the RDocs are fairly extensive and should often provide you with enough information to get started.

Most of the APIs that you will work with are documented as modules within the RGeo::Feature namespace. Factories should follow the API defined by RGeo::Feature::Factory, which specifies a method for constructing each type of spatial object. Each object type, in turn, has its own corresponding interface -- for example, RGeo::Feature::Point defines the interface for point objects. All these interfaces inherit from the base interface RGeo::Feature::Geometry, which defines methods common to all spatial objects.

One important thing to note is that the interface modules in RGeo::Feature may not necessarily be included in the objects themselves. That is, it is not necessarily true that:

point1.is_a?(RGeo::Feature::Point) # may not be true
factory.is_a?(RGeo::Feature::Factory) # also may not be true

However, the objects will still "duck-type" (that is, implement the same methods as) the interface modules, so to find documentation on a particular object, you need only look at the RDocs for the relevant interface modules.

Here's a map of the important interfaces. First, the factory interface:

RGeo::Feature::Factory

Next, the interfaces corresponding to the types and subtypes defined by the SFS:

How do you create a factory in the first place? For the most part, you will use class methods provided for this purpose. These modules will contain methods for getting factories:

Where to go from here

We have taken a whirlwind tour of the basic features of RGeo. RGeo provides a deep set of tools, and at this point you should have enough background to do some pretty interesting geospatial analysis.

In addition to the RDocs for RGeo, I think the actual Simple Features Spec is essential background reading. It's quite accessible and provides a useful overview of the data types and computations available.

The next topic for this series is likely to be an introduction to coordinate systems and projections. Stay tuned!

This is part 3 of my series of articles on geospatial programming in Ruby and Rails. For a list of the other installments, please visit http://daniel-azuma.com/articles/georails.