Extensible Data Types based on Associated Types

The Problem

Algebraic data types are notorically hard to extend. To see why, consider the following declarations:

data Bird = Duck | Emu | Seagull

canFly :: Bird -> Bool
canFly Duck    = True
canFly Emu     = False
canFly Seagull = True

It is hard to add another bird, such as Penguin, not only because the definition of the data type would have to be extended, but especially because all functions that consume this data type, such as canFly, need to be extended, too. Obvious attempts to solve the problem, such as requiring a default case, all have serious drawbacks.

In case we know a priori that we might want to extend a data type, we can define it such that it is more flexible.

data Bird ext = Duck | Emu | Seagull | Other ext

However, this requires to define all functions with more foresight, too.

canFly :: (ext -> Bool) -> Bird -> Bool
canFly canOtherFly Duck        = True
canFly canOtherFly Emu         = False
canFly canOtherFly Seagull     = True
canFly canOtherFly (Other ext) = canOtherFly ext

At least, we can now define

data OtherBird = Penguin   -- This could of course have more than one alternative

type ExtBird = Bird OtherBird

otherCanFly :: OtherBird -> Bool
otherCanFly Penguin = False

extCanFly :: ExtBird -> Bool
extCanFly = canFly otherCanFly

If all functions that use canFly accept the canOtherFly argument, we can use them on the extended data type without modification, simply by passing in otherCanFly.

This approach certainly has a number of drawbacks. Firstly, explicitly threading the additional argument to canFly through all functions that directly or indirectly use canFly is tedious to begin with, but if we have multiple extensible data types, it can quickly get impractical. Secondly, we might want to extend a data type multiple times; i.e., we might modify the definition of OtherBird to also get an extra type parameter to plug in an another extension. This makes the explicit threading of the extra arguments a notch more tedious. Thirdly, we have to prepare a data type and all its consuming functions suitably to be extensible.

We can not hope to overcome the third point as long as we want to have both standard data types and extensible data types. Having only extensible data types has clear disadvantages as long as they are more work to define and use then vanilla data types. Moreover, they might also be undesirable for formally reasoning about programs. Nevertheless, we might hope to eliminate the first two drawbacks.

Using Associated Types

The above example can be implemented as follows with associated types:

class Birds a where
  data Bird a
  canFly :: Bird a -> Bool

data Std
instance Birds Std where
  data Bird Std = Duck | Emu | Seagull
  canFly Duck    = True
  canFly Emu     = False
  canFly Seagull = True

We can extend the Bird data type and all class methods consuming that type by simply adding more instances, such as

data Other
instance Birds Other where
  data Bird Other = Penguin
  canFly Penguin = False

Here the explicit threading of the manual method to extensible data types is replaced by the dictionary passing mechanism used by the compiler to implement type classes. Moreover, multiple extensions of one data type require no additional work from the programmer. However, we still need to decide a priori whether to define the data type in this manner and we need to put all functions over the data type that we like to be extensible into the class definition. (We'll come back to that last point below.) In essence, we collect all data types whose variants we want to be alternatives for each other into one indexed type family.

Whether functions using methods, such as canFly work transparently in the face of extensions depends on their signature. The following function will obviously only work for standard birds:

swimsAndFlys :: Bird Std -> Bool
swimsAndFlys Duck    = canFly Duck
swimsAndFlys Emu     = False
swimsAndFlys Seagull = canFly Seagull

In contrast, the next function works fine for any bird we ever include in the class:

tooFatToFly :: Birds a => Bird a -> Bool
tooFatToFly bird = not . canFly

Caveat: We cannot easily put the (extended) elements of a data type into a homogenous structure, such as a list; so, [Emu, Penguin] leads to a type error, as it would require Bird Std and Bird Other to unify. Note that, when using the manual encoding outlined initially, [Emu, Penguin] would be ill-typed, too, but [Emu, Other Penguin] would be fine.

OO Languages

It is well known that algebraic data types in functional languages and classes in OO languages have orthogonal extension properties. In a functional language, it is easy to add a new function scrutinising a data type, but hard to add a new variant. In an OO language, it is easy to add a new variant (i.e., instance) to a class, but hard to add a new member function.

Interestingly, our proposed solution to extensible data types based on associated types inherits the properties of OO languages that we can easily add new variants, but have a closed set of class methods operating on these variants. Functions defined outside the class can operate on the variants defined in one instance (such as swimsAndFlys) above, but not on variants spanning multiple instances.

Type Classes with Existential Types

SeanSeefried rightly points out that we can achieve a very similar encoding with a combination of existential types and type classes using an encoding suggested by [WWW]Konstantin Läufer. The basic definition would be

class Birds a where
  canFly :: a -> Bool

data BIRD = forall a.Birds a => MakeBird a

instance Birds BIRD where
  canFly (MakeBird b) = canFly b

data Bird = Duck | Emu | Seagull

instance Birds Bird where
  canFly Duck    = True
  canFly Emu     = False
  canFly Seagull = True

Now, we can extend this with

data ExtBird = Penguin

instance Birds ExtBird where
  canFly _ = False

This leads to the question of what the difference between the associated types and the existential types solution is. There seem to be two main differences:

In the associated types variant, the type family is denoted Bird a and its members are obtained by suitable instantiation of the type variable a; i.e., we get Bird Std and Bird Other. In the variant using existential types, the type family is denoted BIRD and its members are Bird and ExtBird. More precisely, with associated types, the family/family-member relationship is embedded in the type structure as it corresponds to type variable instantation. However, with existential types that relationship is a meta property of the program that is not reflected in the type structure (i.e., apart from a suitable choice of identifiers the denotation of the family and it's members bears no resemblence). In other words, associated types allow a deep encoding of type families, whereas the encoding with existantial types is shallow.

The disadvantage of the shallow encoding is that we have to coerce between values of the type family BIRD and its members Bird and ExtBird by adding or stripping the data constructor MakeBird. That's not required for the associated types encoding. Nevtherless, the shallow encoding also has an advantage. We can easily compile a homogenous list from an arbitrary collection of BIRDs, we can't do that with the associated type Bird a. Which of this is a more serious concern probably depends on the requirements of the concrete application.

The construction of the data type BIRD (in the existential types variant) uses a type context, namely Birds a on the existential type. In terms of the implementation, this means that the dictionary representing instances of Birds at runtime will be included into every value of that type. If such a value is decomposed - by pattern matching on MakeBird (as done in the BIRD instance of Birds) - it is this dictionary which is used when class methods are applied to the value embedded in MakeBird. This means that any optimisations that the compiler performs to use type specialisation to eliminate the overhead of dictionary passing will be in vain for values embedded in MakeBird. This is not the case when associated types are used, but whether that's a concern will again depend on the concrete application.

Please email comments to Manuel Chakravarty. [Last update 16 Feb 5]