Originally published Jan 2, 2012
Scaling, scaling, scaling. Can Rails really scale? It's been a source of FUD and the butt of running jokes. But scaling is a serious matter when it comes to large data sets, and it's something we need to pay attention to in the geospatial realm where big data is commonplace.
In this week's article, I'll go over the basic issues every geospatial programmer should know about scaling, and provide tips for writing your geospatial Rails application so it doesn't fall over when you go national. We will cover:
This is part 6 of my continuing series of articles on geospatial programming in Ruby and Rails. For a list of the other installments, please visit http://daniel-azuma.com/articles/georails.
Scaling is a high-profile issue. We all notice when our blog gets slashdotted, and when Twitter or Amazon goes down it makes national news. Failures happen to everyone, and seem almost inevitable. Are they?
Now, I don't want to downplay the complexity of the scaling task, but we do have to start with an important observation. Scaling, in its essence, is a solved problem. The techniques involved have been well-understood for decades. We all learned about logarithmic searching, branch and bound, and similar algorithms in our computer science classes. And as web developers, we should already know what these algorithms look like in practice: database indexes, sharding, caching, replication, load balancing, and so forth.
So why the hoopla?
Because scaling, though a solved problem, is not an automatically solved problem. It requires our attention. More to the point, it requires that we understand every aspect of the system we are building, how the various components work and how they interact. If we need our database to scale, at some level we need to understand how it works, how we're using it, and thus what we need to do to make it scale.
This is probably the main reason why Rails has historically had a negative reputation about scaling. Rails purports to make web development simple. But that's a bit misleading. If you think about it, a website is a complex system with a lot of moving parts: the network, the server, the MVC control flow, databases, caches, client-side code, control flow from one page to another, interactions with external systems, security layers..., and that doesn't even include your application logic with all of its complexities. It's hardly simple. Rails tries to deal with this complexity by hiding elements, at least partly. It hides the database behind ActiveRecord, and in doing so trains us not to think about the database. But that's a deceit. We have to think about the database if we're going to scale it.
And that is one of the key motivations behind this entire series. I've heard some comments that this material is difficult, that there's no TL;DR. Yes, the material is difficult, and I'm not going to try to hide or gloss over that fact. I could try to make geospatial programming really easy, creating a one-size-fits-all tool or recipe for everyone to cargo-cult. But that is the wide road that leads to destruction. Eventually, you'll need to figure out how to scale, and at that time, if you don't have some understanding of what's going on under the hood, you'll get very stuck very quickly.
Now, that said, dealing with geospatial features does not fundamentally change the scaling task. Scaling is still a solved problem. As we prepare to scale our applications, there is a well-known, systematic process we all go through. We measure, find the bottlenecks, apply well-understood techniques to address those bottlenecks, and repeat. It can be a tedious process, and (believe me, I know) it is sometimes difficult to sell our business partners on the fact that we need to spend time on it before it's too late. But it's not like we don't know what we're doing. Scaling is complex, but not difficult. It simply requires that we have a general understanding of how spatial data works.
So, sermon over, let's dive in.
Spatial data is often big data, and as with any big data, our basic scaling task involves making it smaller.
In a database, this is generally accomplished by judicious use of indexes. An index provides a fast way to look up data by some criteria, without having to read and compare against every single row. For example, if your table has a million rows, each identified by a numeric ID, you can generally speed up ID lookups by creating an index on that column.
Similarly, spatial database indexes can accelerate queries that include spatial criteria. If your million-row table also contains latitude-longitude coordinate, and you want to find rows whose coordinate falls within a certain region, you should consider building a spatial index on the coordinate column. This allows your spatial search to avoid checking every row in the database, thus speeding up your queries.
The important thing to understand about spatial indexes is that although they are conceptually the same as "standard" database indexes, they are implemented differently under the hood. In most databases, a simple index on an ID column will use an algorithm known as a B-tree. Such an index relies on a global ordering of the data, and builds a balanced binary tree, which, as we remember from computer science, lets us do lookups in logarithmic time.
Spatial data, however, has some important differences from normal scalar data. A simple numeric ID is an infinitesimal point on a one-dimensional number line, whereas a polygon is a finite area on a two-dimensional surface. For data that covers finite areas or lives in more than one dimension, a B-tree does not work. We have to resort to a more complex algorithm, usually a variant on what is known as an R-tree.
I'll save the gory details on R-trees for a later article on spatial index design, but there is one upshot you'll need to understand: spatial indexes are heavier and more expensive than standard indexes. An R-tree takes up more space in memory and on disk than a similarly-sized B-tree. Queries against an R-tree can be a little slower than against a B-tree, and R-tree updates can be considerably slower. However, R-trees still provide logarithmic-time queries, and so will still give you speed-ups in many situations. So the usual database mantra still applies, and indeed goes double for spatial indexes: Index your common queries but don't index everything. And of course, Measure, Measure, Measure.
Because R-tree updates can be slow, it is also usually a good idea to remove or disable a spatial index if you are going to be loading a lot of data, and then turn it back on once you are done. In this way, you pay the cost of building the index only once at the end, rather than having to incrementally update it on every insert.
Because a spatial index is constructed differently from most indexes, creating one usually requires a special syntax. For a Rails project, you can usually let RGeo's ActiveRecord adapters handle this for you. Create a spatial index in a migration simply by providing the
create_table :counties do |t| t.string :name t.polygon :poly, :srid => 2285 end change_table :counties do |t| t.index :poly, :spatial => true end
If you are managing your schema manually, you'll need to use the database's particular syntax. In PostGIS, spatial indexes use the GIST framework, so you denote a spatial index with "USING gist":
CREATE INDEX "counties_poly_idx" ON "counties" USING gist ("poly");
MySQL's spatial extension defines a separate index type:
CREATE SPATIAL INDEX `counties_poly_index` ON `counties` (`poly`);
In SpatiaLite, a spatial index is actually a separate table that you must join to. Creating a spatial index involves calling a special function provided by the SpatiaLite library:
SELECT CreateSpatialIndex('counties', 'poly');
Once you've created a spatial index, it is usually a good idea to verify that your queries will take advantage of it. You'll want to make sure the database's query planner, the component that analyzes a query and decides how to attack it, is producing an optimal plan. This is generally good practice for all your database design, but more so for spatial queries because they are less commonly used, and query planners do not always do as good a job with them as we would like.
Your best tool for interacting with the query planner, whether or not you're using a spatial database, is
For most databases, you can invoke the EXPLAIN command simply by prefixing your query with "
EXPLAIN SELECT "name" FROM "counties" WHERE ST_Intersects("poly", ST_GeomFromEWKT('SRID=2285;POINT(1266457.58 230052.50)'));
Postgres will return a query plan that looks something like this:
QUERY PLAN ---------------------------------------------------------------------------------------------- Index Scan using counties_poly_idx on counties (cost=0.00..8.52 rows=1 width=68) Index Cond: (poly && '0101000020ED08000048E17A94195333410000000024150C41'::geometry) Filter: _st_intersects(poly, '0101000020ED08000048E17A94195333410000000024150C41'::geometry)
Note that it's using the "
Unfortunately, there are a few cases when the query planner won't be able to figure out by itself that an index is useful. For example, suppose we want to perform a sanity check of our counties database, making sure we don't have any overlapping polygons. More precisely, while we expect that county polygons will "touch"---that is, share boundaries---we don't want counties to actually share interior points. That could mean a problem in our data.
Unfortunately, PostGIS doesn't provide this kind of "interior intersection" function out of the box. The
But we can build "interior intersection" using the function
EXPLAIN SELECT c1.name, c2.name FROM counties AS c1 INNER JOIN counties AS c2 ON c1.id != c2.id AND ST_Relate(c1.poly, c2.poly, 'T********');
QUERY PLAN ---------------------------------------------------------------------- Nested Loop (cost=0.00..10372.17 rows=229633 width=64) Join Filter: st_relate(c1.poly, c2.poly, 'T********'::text) -> Seq Scan on counties c1 (cost=0.00..18.30 rows=830 width=68) -> Materialize (cost=0.00..22.45 rows=830 width=68) -> Seq Scan on counties c2 (cost=0.00..18.30 rows=830 width=64)
Ouch! That's an unfortunate query plan. It does nested sequential scans, comparing every county polygon with every other county polygon, an n-squared operation. In a table with thousands of counties, this can be slow.
But it turns out we can do better. The "interior intersection" operation actually can be optimized using the index. The query planner doesn't realize this, so we need to give it some help.
Here a bit of trivia about spatial indexes will help us. In general, the "native" operation for an R-tree index is bounding box intersection. It can take the bounding box of an input geometry, and determine which geometries in the table have bounding boxes that intersect the input. At the most basic level, the query planner for PostGIS works by looking for opportunities to apply this native operation. It asks, "How can I reduce the search space by applying a bounding box intersection?"
In our first example above, when we used
SELECT "name" FROM "counties" WHERE ST_Intersects("poly", ST_GeomFromEWKT('SRID=2285;POINT(1266457.58 230052.50)'));
SELECT "name" FROM "counties" WHERE "poly" && ST_GeomFromEWKT('SRID=2285;POINT(1266457.58 230052.50)') AND ST_Intersects("poly", ST_GeomFromEWKT('SRID=2285;POINT(1266457.58 230052.50)'));
...using the PostgreSQL operator for bounding box intersection: "
QUERY PLAN ---------------------------------------------------------------------------------------------- Index Scan using counties_poly_index on counties (cost=0.00..8.52 rows=1 width=68) Index Cond: (poly && '0101000020ED08000048E17A94195333410000000024150C41'::geometry) Filter: _st_intersects(poly, '0101000020ED08000048E17A94195333410000000024150C41'::geometry)
See the bounding box intersection "&&" in the Index Condition? That wasn't in our original query, but PostGIS rewrote our query and put it there so it could use the index. Pretty clever, PostGIS is.
Well, sometimes PostGIS isn't quite clever enough, and we have to give it some help. In our "interior intersection" example, we can improve the query plan by going through this process manually. We reason thus: PostGIS doesn't realize this, but every time two geometries have an "interior intersection" using ST_Relate, their bounding boxes also intersect. So I can add a bounding box intersection and not change the result of the query. Bounding box intersections are good because they let me use the index.
So let's manually rewrite our query from:
SELECT c1.name, c2.name FROM counties AS c1 INNER JOIN counties AS c2 ON c1.id != c2.id AND ST_Relate(c1.poly, c2.poly, 'T********');
SELECT c1.name, c2.name FROM counties AS c1 INNER JOIN counties AS c2 ON c1.poly && c2.poly AND c1.id != c2.id AND ST_Relate(c1.poly, c2.poly, 'T********');
Now we give this to PostGIS, and voila! The query planner now uses the index:
EXPLAIN SELECT c1.name, c2.name FROM counties AS c1 INNER JOIN counties AS c2 ON c1.poly && c2.poly AND c1.id != c2.id AND ST_Relate(c1.poly, c2.poly, 'T********');
QUERY PLAN ---------------------------------------------------------------------------------------- Nested Loop (cost=0.00..296.81 rows=1 width=64) Join Filter: st_relate(c1.poly, c2.poly, 'T********'::text) -> Seq Scan on counties c1 (cost=0.00..18.30 rows=830 width=68) -> Index Scan using counties_poly_idx on counties c2 (cost=0.00..0.32 rows=1 width=68) Index Cond: (c1.poly && c2.poly)
This improved plan still does one full sequential scan, because it still has to check every county in the database. But the nested scan, which checks whether that county overlaps any other county, is now accelerated using the index. We've reduced the n-squared query to an n log n query. The computer scientist in us rejoices!
Going through this process does require some creativity, and it helps to have a bit of experience. The good news is that PostGIS is smart enough to handle most cases automatically. But you should still make liberal use of the EXPLAIN tool and look carefully at the query plan that is generated, to see if it's doing as well as you think it ought. There may be opportunities to improve your query performance dramatically just by giving it a little bit of help.
Generally, I recommend PostGIS as an open source spatial database. But there are others out there that you may need to use from time to time, and each one will have its quirks.
As we've seen, the spatial extensions to MySQL do support spatial indexes. However, there are some significant limitations in comparison with PostGIS.
First, spatial indexes currently work only on MyISAM tables. This means you can't use spatial indexes and get the transaction safety benefits of InnoDB on the same table. Ugh.
Second, MySQL supports only a very limited set of spatial relationship functions. In particular, nearly all MySQL's functions work on bounding boxes (which MySQL calls Minimum Bounding Rectangles, or MBR) rather than the geometry itself. So for example, MySQL's
I generally don't recommend using MySQL Spatial unless you're already using MySQL. But then I typically don't recommend using MySQL in general either...
SpatiaLite is a set of spatial extensions to the popular SQLite database. I haven't used SpatiaLite much. It does seem to have a fairly complete feature set, at least in comparison with MySQL Spatial, though it doesn't compare in maturity with PostGIS.
Spatial indexes in SpatiaLite are a bit of a pain, however. They are implemented as a separate set of managed join tables tied to your main table using triggers. All this is handled fairly transparently, except for queries. When you want to write a query that takes advantage of a spatial index in SpatiaLite, you must explicitly join to the index table.
For the sake of space, I won't go into the details here. Instead, I highly recommend an excellent online publication by the author of SpatiaLite, the SpatiaLite Cookbook, which serves as the user's manual for SpatiaLite, and provides a number of very helpful examples.
But wait---there's more!
Remember that the basic scaling task is to make big data smaller. If we have big data, we try to do clever things, such as applying indexes, so that we don't have to analyze all the data at once.
Now, there are two ways in which spatial data can be "big". First, there may be a lot of objects, lots of rows. In this case, we can often speed up queries by adding a spatial index, as we have seen.
However, individual objects can also be "big", particularly when you're dealing with polygons. Take our table of county polygons. Some county boundaries are simple polygons with just a few sides, but many others have complex, crinkly boundaries that follow rivers, coastlines, mountain divides, or other natural features. The number of sides in such polygons can quickly rise into the thousands or more. When you want to compute, say, an intersection with such a polygon, it can be slow.
There are several different strategies you can use to address this problem. I'm just going to summarize a couple of the important ones here. But first I need to emphasize one thing. There is no one-size-fits-all solution. Each approach has its pros and cons, and your choice will depend on the requirements of your particular application.
To this end, measurement is absolutely critical. Before, during, and after applying any optimization technique, run a benchmark and make sure that (1) you're addressing the right problem, and (2) the performance is going in the right direction. This is doubly important when dealing with spatial data, because the algorithms involved are somewhat more complex, and it may surprise you what's fast and what's slow.
There are two general techniques for dealing with large polygons: simplification and segmentation.
Simplification can be applied when you're more concerned with speed than accuracy. For example, you might have a polygon with a thousand sides, but if you're going to be displaying it in a relatively small area on a map, or you're running some spatial queries where you don't care too much if you're a little off, then you can probably get away with an approximation of the polygon with fewer sides.
There are a number of polygon simplification techniques out there, useful for different circumstances. I don't have space here for a full discussion, but I may write a specialized article on simplification at a later time, because it's an interesting (and sometimes tricky) problem.
Segmentation is often useful for speeding up queries against big polygons, when you care not about the shape of the polygon itself but merely whether you're intersecting it. Segmentation, for example, might be useful in our county boundary example.
The idea is to break up large polygons into smaller polygons that can be stored in separate rows in your database. That is, we trade "width" of the data (i.e. how big each object is, in terms of number of vertices) for "length" (i.e. how many objects there are). Since we have spatial indexes that mitigate big "length", this trade-off can be a win for us.
There are many ways to break up a large polygon. The simplest approach, usually good enough in practice, is to perform recursive four-to-one subdivision. Don't be scared off by the name; it's actually quite straightforward. The idea is to take your large polygon with many sides, and split it down the middle horizontally and vertically. This will typically result in four polygons, each covering about a quarter of the area and containing about a quarter of the number of sides:
Now, if any of the four polygons still has too many sides, you can do the same thing again, recursively, and so forth until you reach a number of sides that you're comfortable with. Once you're done, the union of all the resulting polygons will still be your original polygon. So, in our counties example, a county now
So how deeply should you segment a polygon? What's the "sweet spot" in the number of sides? That's where you have to test and measure, because it will depend on many factors. In one recent project in which I did some polygon segmentation, I measured the optimal number of sides between three and five hundred. But your mileage will vary.
It is worth diving into the manual for your spatial database for tips on the effective use of spatial indexes. The PostGIS manual is online.
EXPLAIN is a very powerful tool for studying and optimizing your database performance in general, not only when you're working with spatial data. I highly recommend getting familiar with using it in your database. For PostgreSQL, a good place to start is the Using Explain page in the manual. SQLite and MySQL also have sections in their manuals. Additionally, it looks like Rails 3.2 will include some useful EXPLAIN-based tools out of the box.
This week's article didn't include a lot of code. This was because I had a lot of material to get through, and I decided it was more important to cover the concepts at this stage, rather than encourage code cargo-culting. In next week's article, however, I'll go through a fully worked example, with code, that mirrors an actual task I recently had to do for my job at Pirq.
Until then, have fun and let's bring Rails down to earth!
This is part 6 of my continuing series of articles on geospatial programming in Ruby and Rails. For a list of the other installments, please visit http://daniel-azuma.com/articles/georails.