Geo-Rails Part 5: Spatial Data Formats

2011-12-19 georails

The location revolution is a revolution of data. Ubiquitous data, from mobile GPS and user input as well as from census and other datasets, is what makes location-aware applications possible. And so the first task of many geospatial projects is to determine how to find and utilize (and, in some cases, produce) external data.

In this article, we will survey some of the important spatial data formats, including serialization, file formats, and api-oriented formats. Specifically, we will look at:

Basic serialization using WKT and WKB
Variants on WKT and WKB
Reading public datasets from shapefiles
Web service oriented formats such as GeoJSON
XML-based formats commonly used in web services

We will also go over a few quick examples using Ruby and RGeo. This will be a fairly high-level overview and we won't go into a lot of detail. We'll take deeper looks at some of these formats in future articles.

This is part 5 of my continuing series of articles on geospatial programming in Ruby and Rails. For a list of the other installments, please visit http://daniel-azuma.com/articles/georails.

The standard OGC serialization formats

If, after reading part 3, you looked through the Simple Feature interfaces (or the corresponding RGeo interfaces), you may have noticed two serialization methods provided for geometries: as_text and as_binary. These methods respectively output the "Well-Known Text" and "Well-Known Binary" representations of the geometry. These two standard serialization formats are defined by the OGC Simple Feature Access specification, and commonly supported by most GIS systems.

Well-Known Text (often abbreviated WKT) is a human-readable and parseable text-based format for all geometry objects. You can read the exact format specification in the Simple Features Spec, but a few examples are probably sufficient to get the general hang of it.

Point(-122.1 47.2)
LineString(2 4, 5 4, 5 8, 2 4)
Polygon((0 0, 5 0, 5 5, 0 5, 0 0), (2 2, 2 3, 3 3, 3 2, 2 2))
MultiPoint((-122.1 47.2), (-93.5 39.4))
GeometryCollection(Point(3 5), LineString(-2 0, -3 -4))
MultiLineString EMPTY

Don't confuse the simple features WKT format with the coordinate system WKT format we covered in part 4. Unfortunately, both are commonly known as Well-Known Text (WKT), but they are distinct formats: one represents geometric objects whereas the other represents coordinate systems.

Well-Known Binary (often abbreviated WKB) is a binary format that uses numeric codes and IEEE floating-point representations. It is not human-readable but is much more compact than WKT.

Using RGeo, you can obtain the WKT and WKB representations of a geometric object by calling as_text and as_binary, respectively. Factory objects will provide methods to parse WKT and WKB format and recover the geometric object.

point = factory.point(1, 2)
wkt = point.as_text   # => "Point(1 2)"
point2 = factory.parse_wkt(wkt)
point == point2       # => true

Variants on WKT and WKB

As simple and well-supported as they are, WKT an WKB have several important weaknesses that have caused headaches for spatial databases and applications. In part 4, we saw that to properly interpret a geometric object, you need to know the coordinate system, which is usually specified by a spatial reference ID (SRID). Unfortunately, neither WKT nor WKB include a way to represent SRID. They expect SRID to be specified or implied elsewhere, which is sometimes but not always true.

Furthermore, some applications use additional coordinates in their geometric data. Applications that store altitude or other third-dimensional data may include a "Z" coordinate in their geometries. Other applications may include a measurement (such as temperature or population) stored in an "M" coordinate. Version 1.1 of the Simple Features Spec (and the corresponding WKT and WKB specifications) do not directly support these extra coordinates, (although version 1.2 does address this, as we will see.)

Finally, neither WKT nor WKB by themselves provide a way to associate metadata, such as object names or other properties, with geometric objects. This limits their usefulness as a complete format for data transmission.

Because of these limitations, several variants have appeared that you should be aware of. The PostGIS database supports an extension to WKT called "EWKT", which supports SRID as well as "Z" and "M" coordinates. The SRID, if present, appears at the front of the EWKT string:

SRID=4326;Point(-122.34978 47.62058)

EWKT supporting "Z" and "M" coordinates to be appended to each pair of coordinates as third and fourth coordinate values. When both "Z" and "M" are present (i.e. four coordinate values per point), the third coordinate is used for "Z" while the fourth is used for "M". If only one is used (i.e. three coordinate values per point), you must specify whether it is "Z" or "M". Here are some examples:

Point(-122.34978 47.62057 20.0 -3)  # X,Y,Z,M in EWKT
PointM(-122.34978 47.62057 -3)      # X,Y,M
PointZ(-122.34978 47.62057 20.0)    # X,Y,Z

PostGIS also defines a corresponding "EWKB" format with appropriate extensions to the binary format to support SRID as well as Z and M. EWKB is (or at least appears to be) the native internal format used by PostGIS to represent geometric data.

More recent versions of the OGC Simple Features Spec (version 1.2 and later) also provide support for Z and M. However, beware that the OGC format is not the same as the PostGIS EWKT and EWKB. The WKT update expects a space between the geometry type and the Z/M specifier, and it also requires the modifier in the "four-dimensional" ZM case:

Point ZM(-122.34978 47.62057 20.0 -3)  # X,Y,Z,M in WKT 1.2
Point M(-122.34978 47.62057 -3)        # X,Y,M
Point Z(-122.34978 47.62057 20.0)      # X,Y,Z

Furthermore, the updated WKT format still does not support a SRID. The updated WKB similarly supports Z and M (but not SRID), but uses different binary codes than those used by EWKB. Hence, these two extensions are not fully compatible with each other.

Because of this fragmentation, neither of these extensions are, in practice, used frequently for long-term serialization. However, you will likely need to work with EWKT at some point if you use PostGIS, so it is important to be familiar with it.

RGeo provides support for parsing and generating both variants in the RGeo::WKRep module. See the rdocs for more details. Here's a really quick code example as a starting point:

parser = RGeo::WKRep::WKTParser.new(nil, :support_ewkt => true)
point = parser.parse('SRID=4326;Point(-122.1 47.3)')
point.srid   # => 4326

Shapefiles and public datasets

Location is driven by data, and a lot of the data you will need to work with will likely come in the form of shapefiles. The shapefile is a flat file format for geospatial data originally developed by ESRI for storing sets of geographic features. It supports certain vector shapes-- points, lines, and polygons-- along with associated attributes. Although shapefile began as a proprietary format, the format specification is readily available, and it is now a de facto standard for large datasets, including those provided by government agencies such as the US Census Bureau.

A shapefile actually consists of three (and sometimes more) related files, each with the same base filename but different extensions. The main file has the extension ".shp" and contains the geometric data itself in a binary format. An auxiliary ".shx" file provides a simple flat index allowing random access into the shapefile. A second auxiliary ".dbf" file provides the attribute data in dBASE format. All shapefiles should have those three core files, although some shapefiles may include additional files containing coordinate system, spatial index, or other related information.

Most Rails applications will not read a shapefile directly, but will instead transfer the data to a spatial database such as PostGIS for rapid query and data retrieval. In Ruby, you can use the rgeo-shapefile gem to help with this task. This gem does the heavy lifting involved with parsing and analyzing a shapefile, and exposes the data to you as RGeo geometric objects. You should also install the dbf gem, which lets you read the dBASE attributes in the shapefile.

% gem install rgeo-shapefile
% gem install dbf

Once you have the gems installed, and you've downloaded and unpacked a shapefile, use the RGeo::Shapefile::Reader class to open and read the file. The following example reads objects sequentially:

factory = RGeo::Geographic.spherical_factory(:srid => 4326)
RGeo::Shapefile::Reader.open('myfile.shp', :factory => factory) do |file|
  file.each do |record|
    geom = record.geometry
    # geom is now an RGeo geometry object.
    name = record['Name']
    # You can read any other attribute similarly.
    # Now, you can do whatever you want with the data,
    # such as inserting rows into your database...
  end
end

Notice that we provide a factory for the objects being read. Shapefiles generally do not provide an SRID, so we must supply that. The above example assumes the shapefile contains latitude-longitude coordinates in WSG84.

The RGeo::Shapefile::Reader class also lets you do random access reads, and get other information about the shapefile's contents. See the rdocs for more details. The gem does not currently support writing shapefiles, but that feature is on the roadmap.

For more information on the shapefile format itself, you can find the original ESRI specification at http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf. Another common (C-based) implementation of shapefile is Shapelib, which you can find at http://shapelib.maptools.org/.

Web services and GeoJSON

Another way to obtain location data is to call a web service such as Google Places, SimpleGeo, or Factual. These services do the heavy lifting of curating, deduping, and managing location data, and generally provide an http REST api letting you query for location information of interest.

There are a number of different types of web services, including geocoders, point of interest search, location properties, and others. I'll write up a survey of useful location-oriented web services in a later article. For this current article, however, we are interested in data formats that would typically be returned from a point of interest search. When you make a query, what sort of data can you expect to get?

In many cases, the web service will define its own schema for the returned data. You must then parse the returned document yourself to extract the information you want. There are well-known gems available for this task, such as json for parsing JSON, and nokogiri for parsing XML. There are also, however, a few semi-standard schemas commonly used by a number of web services. Here we will take a quick tour of some of these formats and how you can go about using them.

GeoJSON is an important emerging standard commonly used by SimpleGeo and similar modern APIs. It provides a standard JSON representation for each geometric type, as well as support for bounding boxes, coordinate systems, and a set of properties. Following is an example of GeoJSON, lifted out of the specification:

{ "type": "FeatureCollection",
  "features": [
    { "type": "Feature",
      "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
      "properties": {"prop0": "value0"}
      },
    { "type": "Feature",
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
          ]
        },
      "properties": {
        "prop0": "value0",
        "prop1": 0.0
        }
      },
    { "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
            [100.0, 1.0], [100.0, 0.0] ]
          ]
      },
      "properties": {
        "prop0": "value0",
        "prop1": {"this": "that"}
        }
      }
    ]
  }

The core object type in GeoJSON is the Feature, which consists of a geometry and a set of properties. The geometry can be any of the OGC types, and its internal representation is closely modeled on WKT. Properties are simply named key-value pairs whose values can be any JSON object.

GeoJSON is simple and highly versatile, and is often an ideal format both for consuming and producing geospatial data. From Ruby, you can use the rgeo-geojson gem to read and write GeoJSON. Here are some quick examples to get you started:

require 'rgeo/geo_json'

str1 = '{"type":"Point","coordinates":[1,2]}'
geom = RGeo::GeoJSON.decode(str1, :json_parser => :json)
geom.as_text              # => "POINT(1.0 2.0)"

str2 = '{"type":"Feature","geometry":{"type":"Point","coordinates":' +
  '[2.5,4.0]},"properties":{"color":"red"}}'
feature = RGeo::GeoJSON.decode(str2, :json_parser => :json)
feature['color']          # => 'red'
feature.geometry.as_text  # => "POINT(2.5 4.0)"

hash = RGeo::GeoJSON.encode(feature)
hash.to_json == str2      # => true

For more information on GeoJSON, see http://geojson.org/. The actual spec hosted on the website is quite short and very readable. You can find more information on the rgeo-geojson gem from its rdocs.

XML-based formats

Although JSON is often a format of choice for many modern web services because of its simplicity and its close affinity with Javascript and similar high-level languages, XML is still the established standard in many fields and applications. GIS services, in particular, have a long tradition of XML-based representation, and there are a number of XML-based geospatial formats you may encounter when writing location-aware applications. Among them:

GeoRSS is a family of RSS extensions for embedding geospatial data into RSS or Atom feeds, often used to spatially tag feed entries. It comes in two flavors, Simple GeoRSS and GML GeoRSS. Simple GeoRSS is designed for simplicity, and supports a limited set of features. Notably, not all the OGC geometric types can be represented, and coordinate system is limited to WGS84 latitude/longitude. GML GeoRSS is a more full-featured but much more complex format, essentially a profile of GML, which we will cover below. Most actual implementations of GeoRSS are of the Simple flavor.

Below are a couple of examples of a basic GeoRSS element from an RSS feed, first in the Simple flavor and then in the GML flavor.

<georss:point>47.604828 -122.330779</georss:point>

<GeoRSS:where>
  <gml:Point>
    <gml:pos>47.604828 -122.330779</gml:pos>
  </gml:Point>
<GeoRSS:where>

As of this writing, the georss.org website appears to be unmaintained and possibly hacked. The best starting point I can recommend for GeoRSS is an OGC whitepaper at http://portal.opengeospatial.org/files/?artifact_id=15755.

I'm not currently aware of an RGeo-based Ruby implementation of GeoRSS. The older GeoRuby gem, however, does have basic support for GeoRSS.

Geography Markup Language (or GML) is an XML-based object model intended to describe geographic information. Its specification is maintained by the Open Geospatial Consortium. GML by itself is a highly general and flexible model that can represent not only geometric objects and coordinate systems such as we have looked at so far in this article series, but also observations, topological information, temporal information, and various other related entities.

You generally don't work with GML directly, but instead use an application XML schema that references GML internally. Furthermore, most application schemas don't utilize the entire GML specification, but a relevant subset, known as a GML profile. For example, GML GeoRSS is an application schema referencing a GML profile relevant to geotagging feed entries.

Another common GML-based XML schema is CityGML (http://www.citygml.org/), which is designed to model urban objects. CityGML is commonly used, for example, to model 3D visualizations of cities.

For more information on GML as a whole, you can review the OGC spec at http://www.opengeospatial.org/standards/gml.

I'm not currently aware of any specific Ruby support for GML or its various dialects.

KML (or Keyhole Markup Language) is an XML schema that originated at Google for describing features in Google Earth, but was later standardized by the OGC. Although it does have some overlap with GML, KML is often seen as complementary because of its particular emphasis on visualization. Its intended use is to describe how to display features within a Google Earth style application. You can, for example, open a KML file with Google Earth to display its contents.

For more information on KML, see the Google documentation at http://code.google.com/apis/kml/documentation/ or the OGC specification at http://www.opengeospatial.org/standards/kml.

I'm not currently aware of any specific Ruby support for KML.

Where to go from here

This article has covered just a few of the most common and/or promising major spatial data formats. There are a number of others currently in use, including many locale or application-specific forms. But as you can see, Ruby support for even the major formats is currently rather thin. We still have much work to do on our tools.

As the principal author of RGeo, I'm looking for help in this area. I released the rgeo-geojson and rgeo-shapefile gems based on work I've done to integrate my own applications with those formats. However, I haven't yet had the need to actually use one of the XML formats, and as a result I haven't written any tools to help with them. There is currently quite a bit of room to contribute to the community in this area.

Next week I'm going to take a break for the holidays, but I expect to release the next planned article on scaling spatial applications with the new year. Stay tuned, and let's bring Rails down to earth!

This is part 5 of my continuing series of articles on geospatial programming in Ruby and Rails. For a list of the other installments, please visit http://daniel-azuma.com/articles/georails.