January 04, 2005
Experiences using package ghc
package ghc seems, at first glance, like a great way to save a whole bunch of programming effort, but it's not as simple as that. I'll illustrate this through the use of an example. I wanted to write a tool that would, for a given Haskell module, output a call-graph. That is, a graph that showed what functions each function depended on.
The first problem I encountered was that of needing to write a parser. This begged the question, why should I write a parser when there is clearly one in GHC already? This was what made be first consider using package ghc. A little more though revealed it was just what I was looking for. I want to use this call-graph tool to better understand the tangled GHC sources. And since GHC contains the only extant parser that is guaranteed to understand all the GHC extensions (which are often used in its code base) it was the ideal candidate.
Thus, having steeled myself for some code diving, I attempted to use the parseModule function from package ghc. It is telling that although I got this to work without too much effort it was not entirely straightforward. I expected parseModule to have a type something like:
String -> HsModule RdrName
But instead I had to create a PState (parser state) data structure first and feed that into the parser. This necessitated the create of something called a StringBuffer which was essentially an array holding the contents of the source file. Many modules had to be imported. This is not really worth whinging about.
Now I had a data structure, of type HsModule RdrName, with which to construct the call graph. At first glance it seems like a fairly easy transformation over the data structure. The general idea is to first, find a function, then traverse through its body looking for calls to other functions. But what happens if there is a call to a function but it's defined in a let expression of where clause? Also, at some point you're going to have to traverse through sub-declarations - if they call other functions then this information must be propagated to the top-level.
I also had a hunch that GHC already did a similar analysis in creating so-called binding groups, which are collections of mutually recursive declarations. This led me to scour the sources looking for functions I could re-use. Imagine my surprise when I unearthed a def-use analysis! This seemed to be exactly what I wanted and so I set about writing a function to retrieve this information from the abstract syntax tree yielded by the parser.
I found some functions somewhere in the middle of the renamer that looked useful, but because of their position (i.e. in the middle) they weren't exported by package ghc. My only recourse was to recompile ghc and export them or use the top-level renaming function tcRnModule. But this spawned way too much computation, including the loading of interface files for the imported modules! If I was to use this function I would be completely breaking the spirit of the tool - it's simply meant to parse a single module - not to mention the fact that this tool was no longer stand-alone; it needed access to GHC packages (libraries).
So these are my conclusions. Package ghc is useful but it must be used with care. What modularity there is in the package seems to be divided along the phase boundaries - i.e. The renamer, type checker, desugarer and core to core passes work pretty much independently of each other. Yet within a particular phase the code is heavily tangled. There seems to be no easy way to use parts of the code in the middle of a phase.
This is because many of the phases start with the initialisation of data structures which are then threaded through the functions of the phase. They are modified as they pass through. If you do initialise such a data structure but then start using a function halfway through the phase there is no guarantee that the function will yield correct results as often it depends on what was already in the data structure. Thus, to achieve any reuse of the code one must first understand the code. Lack of documentation makes this difficult. Normally I'd launch into a tirade about APIs and documentation quality, but this seems unfair in this instance - GHC was never intended to be an API.
Summary:
- Modularity good along phase boundaries. Bad within phase.
- Documentation poor. This makes sense - it was never meant to be an API.