Abstraction is one of the most ubiquitous concepts in Computer Science, and yet it is also one of the most poorly taught1. Programmers (and even computer scientists) often confuse semantic notions of abstraction and refinement in terms of models with the syntactic notions of abstraction that exist within a particular programming language. In this post, I'll give an overview of these two different notions of abstraction, and how they might be related.
Let's be clear: abstraction is not about indirection, nor is it the process of giving things names, nor is it even the packaging of code into reusable modules. Informally, abstraction is the merely the elimination of detail.
When we discuss the semantics of a particular program, we are really discussing what is called the semantic model of a program. A model is a mathematical object that captures the interesting aspects of what the program means. While there are a variety of models for various domains, languages, and objectives, a common choice for a semantic model of an imperative program is some kind of state transformer, which describes the transitions possible from a given initial state to some final state(s).
To eliminate detail on a model such as this requires a way to be imprecise about some aspects of a program – usually, this takes the form of non-determinism. Non-determinism can be hard for beginners to grasp, but it typically has to be employed when modelling real programs. For example, suppose we had a greeting program that differed depending on the physical location of the computer2:
If we wanted to mathematically model the behaviour of this program, it would be frightfully inconvenient to include the geography of Earth and the computer's physical location in our model. That's where non-determinism comes in. If we abstract away from the geographical details, and instead regard the program as choosing between the two options based on some unspecified criteria, then we can get away with modelling less, at the cost of some detail:
Such underspecified conditionals are usually called non-deterministic choice, where is written simply as .
While we tend to view our physical machines as deterministic automata, the state upon which each decision is deterministically made includes a number of external things which are tedious to model mathematically. We can also use non-determinism to ignore details that we don't care about for our particular domain – a common example is memory allocation, where it is very convenient (for some programs) to think of memory as infinite, and allocation as an operation that can potentially fail, without specifying exactly when and how this failure can occur. This is normally modelled as a non-deterministic choice between successful allocation and a failure state.
In a total, deterministic setting, we might model semantics of a program as a total function – given an initial state, there will be exactly one final state determined entirely by the initial state. But, with non-determinism, each use of the choice operator potentially doubles the number of final states3. So, with non-determinism in our language, the semantics of a program are given as a binary relation on states: a mapping from initial states to every possible final state. For our purposes, we will define a state as just a mapping from variable names to their values. We shall call the set of all states .
A Toy Language
Before we go any further, let's define a little language that we can use for our programs. For simplicity, we will assume that all our variables contain integers. First I'll introduce the syntax, and then I'll discuss the semantics of each form separately.
Here we use and to denote simple boolean propositions and arithmetic expressions respectively. These expressions may mention our program variables, so we will assume the existence of a simple semantics for them. For the arithmetic expressions, they are interpreted as a function that, given a state, will produce a resultant integer:
For boolean propositions, the semantics are simply the set of states where the proposition holds:
As mentioned in the previous section, the semantics of a given program will be a binary relation on states:
For an assignment statement, the final state is the same as the initial state, save that the updated variable is replaced with the result of evaluating the expression with respect to the initial state:
For non-deterministic choice, seeing as contains all the possible state transitions of , and contains all the possible state transitions of , the semantics of is just their union:
We also have a familiar sequential composition operator, written as a semicolon as in , which behaves much like the semicolon in C and Pascal. First executing , and then subsequently executing . Formally, this means that a transition can only be made through if there exists an intermediate state resulting from that leads to the final state via :
Where is an operator for forward-composition of relations, defined as:
We also have guards, which are programs that do not change the state, but only permit execution when the given boolean condition holds:
We can use the above building blocks to regain the familiar statement:
Exercise: Devise a direct semantic definition for statements. Prove that your semantics are equivalent to that of the translation into non-deterministic choice and guards.
Lastly, in any real programming language, we need some mechanism for loops or recursion. For our toy language, we add the very simple Kleene star, written , which runs a nondeterministic amount of times. A good intuition is to think of this recursive expansion4:
Semantically, this is the reflexive, transitive closure of the semantics of P:
Where superscripting a relation is self-composition:
Here is the identity relation, i.e. .
We can recover the traditional loop using our Kleene star and some carefully placed guards: One in the loop body, to ensure the loop is only run while the guard is true; and one after the loop, to ensure that the loop only finishes when the guard is false:
Exercise: Devise a direct semantic definition for loops. Prove that your semantics are equivalent to that of the translation into the Kleene star and guards.
Getting back to abstraction
When we transformed our simple greeting program into a nondeterministic choice, we reduced the size of our state model, but doubled the number of possible outcomes for a given initial state. Instead of being able to determine which greeting would be printed, we must now account for both greetings.
This means that the more abstract a program is, the bigger the semantic relation is. We can say that a program is an abstraction of a program iff:
Equivalently, we also say that is a refinement of . Refinement is the inverse of abstraction.
Because refinement is just the subset relation on semantics, it forms a bounded lattice, giving us a greatest and least element. The greatest element is the relation that contains all state transitions:
This greatest element is an abstraction of every program, because it is so non-specific that it contains every possible outcome the program could produce.
Conversely, the least element is the relation that does not contain any transitions – representable syntactically with the guard, or the infinite loop:
Specifications as abstractions
One common use for abstraction in computer programming is for the specification, verification and derivation of programs.
If we define a specification of a program as a pair of a pre- and a post-condition, we could specify something like a factorial program as follows:
Here we are using specification statements of the form , where , the pre-condition, and , the post-condition, are referred to collectively as assertions.
The specification statement describes a program that, assuming that the pre-condition is true of the initial state, will ensure that the post-condition is true of the final state. Exactly how the program gets from the initial state to the final state is left unspecified. We can make these specification statements bona-fide statements in our toy language, and give them a semantics:
Our semantics for a specification statement include every possible transition that satisfies the specification. Therefore, our specification is an abstraction of every possible implementation of that specification.
A common technique for the derivation of programs is to build a syntactic refinement calculus, allowing us to incrementally derive a program from its specification into a less and less abstract version, until we at last have a version suitable for implementation. This process proceeds via formally justified refinement rules. Because they are proven to be sound, a correct application of these rules from the specification yields a correct program by construction.
Let us define a miniature refinement calculus for use with our toy language. For a calculus that is actually useful for more real-world programming scenarios, I recommend consulting Carroll Morgan's great book, Programming from Specifications, an online copy of which is available here.
To start with, we will define a syntactic abstraction relation, , which is defined like this5:
Now, we can give rules for introducing each of our language constructs6:
Exercise: By translating the above rules into semantics, show that the rules are sound (that is, that the semantics of the RHS is a subset of the semantics of the LHS).
We can also derive rules for our trusty statements and loops:
Exercise: Show that these rules are indeed derivable, using the translations provided in the previous section.
Lastly, it is also sometimes necessary to apply logical reasoning to transform assertions during the derivation process. The consequence rule, given below, allows us to swap out our assertions for more convenient ones, provided they remain a refinement of the original assertions:
Using our refinement calculus, let's derive an implementation for our factorial specification:
Firstly, we have to split the code into two parts, firstly to initialise variables and establish the loop invariant ( ), and the second to actually contain the loop.
Next, we must use the consequence rule, to get the spec statement into the right form for using with the while loop rule. After introducing the loop, we can fill in the body a bit by incrementing the counter:
Here we must use the consequence rule in order to get the meat of the loop body into the right form for the assignment rule.
Lastly, we just initialise our variables in the obvious way to ensure the loop invariant holds initially:
Treating specifications as abstractions of their implementations is a powerful idea. It gives a semantic framework for the gradual, step-by-step derivation of a correct program from its correctness definition.
Moreover, it shows that a common informal definition of abstraction that is bandied about by programmers – the separation of a specification from an implementation – is just an instance of the more general notion of semantic abstraction. If we were to interpret types as a particularly weak form of specification, then we can view type systems as an instance of this technique as well7.
One of the most common techniques for managing complexity in software engineering is that of data abstraction. Data abstraction is the process of hiding some particular piece of state behind an interface or signature of abstract operations. This allows for a neat separation of concerns. For example, consider this program that only succeeds if a string of parentheses and brackets is balanced:
This version makes use of an abstract stack type and four operations: , an initialiser which sets up an empty stack; , a simple predicate which is true iff the stack is empty; , the familiar operation that adds a new element to the top of the stack; and , the inverse of which removes the top element from the stack and returns it. Certainly, the version making use of abstract operations is far more readable than the concrete alternative, swapping the abstract stack for an (infinite-sized) array and an index to the top of the stack8:
Mathematically justifying the above translation is a process called data refinement, and a variety of techniques exist. One of the simplest is Reynold's method. Starting with the abstract program, it proceeds in four steps:
Add variables to represent the concrete state (in this case and )
Define a coupling invariant – an assertion that relates the abstract and the concrete variables. In our example, if we assume a stack model like the following grammar:
Then the coupling invariant relating and can be defined as a recursive predicate like so:
For each operation that writes to abstract variables, such as , add code to perform the corresponding updates to the concrete variables, such that the coupling invariant is re-established. This step can be formally justified using a program logic such as Hoare logic, which is analogous to the refinement calculus used above, except designed for post-hoc verification rather than derivation of correct programs from specifications.
Each operation that reads from abstract variables, such as , is replaced with code that reads the same information from the concrete variables. This step should be justified as a direct consequence of the coupling invariant.
With all abstract read operations replaced with concrete ones, the abstract write operations are now completely superfluous, and can be removed.
Following the above steps with our original stack-based program will yield the concrete program we devised in terms of arrays. So the method appears to work, but what does data abstraction and data refinement have to do with the notions of abstraction we saw in the previous section?
Semantic Data Types
To be able to talk about data abstraction in terms of semantics, we need a semantic model of a data type. Formally, we consider a data type to consist of:
- A set of representation variables, containing the data of the data type. We write to represent the state space extended with these additional variables.
- An initialiser (or a constructor if you prefer), , which augments the state with a new instance of our data type, introducing our representation variables.
- A finaliser (or a destructor if you prefer), , which eliminates our representation variables from the state.
- For each operation name , we have a relation – simply the semantics of the operation.
Let's define data types for our abstract stack and our concrete implementation. To make specification easier, we annotate the names of the operations with the external variables they may touch. More elaborate refinement calculi include frames, which make this technique a good deal more rigorous.
For our abstract stack, we never explicitly provide an implementation, merely providing specifications. Because, as we discussed before, specifications are in the same semantic domain as our programs, we can use them to provide our abstract data type.
For the concrete data type, we just take the semantics of the code we use to implement each operation.
With both data types, we can start to devise a definition of abstraction between data types.
Any consumer of our data type, such as the bracket-matching program above, can be viewed as the sequential composition of the initialiser, some sequence of operations, followed by the finaliser. A data type is a refinement of another if all such sequences are a refinement of the corresponding abstract sequence.
Thus, to show refinement, we must show that, for any operation sequence :
That is, data refinement is "just" program refinement, but for an arbitrary program. Next, we'll look at common ways to prove this statement, and how they generalise syntactic approaches such as Reynold's method. For a more detailed introduction to this model-oriented version of data refinement, and comparisons to many more refinement techniques, I recommend this great book by W. P. de Roever and Kai Engelhardt (who was one of my teachers).
We would like to prove the above subset obligation using induction on the length of the sequence of operations, but the presence of the initialisers and finalisers makes the induction hypothesis useless, of the form , which does not refer to a subexpression of our goal.
One technique to resolve this is so-called downward simulation, where we define a refinement relation , and split the above obligation into three parts:
The initialiser establishes the refinement relation:
Each operation preserves the refinement relation:
Finalisers will converge from -related states:
The second part can be generalised into an analogous theorem about sequences, via a neat induction on the length of the sequence:
From here, one can straightforwardly use the first and third lemmas to show that is indeed a refinement of . In this way, we remove those pesky initialisers and finalisers so that we can do induction, and then just tack them on again after the induction is complete.
So, for our stack example, what would our refinement relation look like? It turns out to merely be a relational form of our coupling invariant from Reynold's method:
In fact, all of Reynold's method is just an instance of this downward simulation technique.
It turns out that downward simulation, and thus Reynold's method, is not complete, in that one can construct a pair of data types where one refines another, but that a refinement relation cannot be constructed between them. Upward simulation, the mirror image of downward simulation, relies instead on an abstraction relation and performs induction from the back of the sequence rather than the front. The combination of both upward and downward simulation is complete. The proof of this is presented in de Roever and Engelhardt's book.
Many programming languages provide features that are commonly called abstraction. The most common is the module, consisting of one or more types (usually left abstract in the sense that their implementations are hidden) coupled with operations on those types. We can consider a module's signature or interface to be an abstract data type in the semantic sense, where any type-correct implementation can be considered a refinement. In this sense, module systems in programming languages make it substantially easier to do the kind of data abstraction I discuss above, as both abstract and concrete versions are in a machine-readable structure. However, the presence of a module system is neither necessary nor sufficient for data abstraction to be possible.
A perhaps more common use of the word abstraction in the Haskell community refers to the λ-abstraction9. Seeing as λ-calculus-based languages have a very different semantic domain, based on Scott domains, I can't directly relate the notion of λ-abstraction to the kind of semantic abstraction I present here. I'd be very interested to see some explanation to see if there is a solid connection between the very syntactic notion of abstraction we see in functional languages, where "abstraction" essentially refers to parameterisation, and the kinds of semantic abstractions we see elsewhere.
If you enjoyed this article and you're a UNSW student, this article is a whirlwhind tour of the second-year COMP2111 course, taught by Kai Engelhardt along with yours truly. The course goes into substantially more detail on the specification and derivation components, including a detailed study of Hoare Logic and Carroll Morgan's refinement calculus. Feel free to enrol if you're interested10.
In my undergraduate years, I remember thinking that data abstraction had something to do with header files or object-oriented programming.↩︎
Further internationalisation is left as an exercise.↩︎
This is why deterministically simulating a non-deterministic program is exponential complexity in the worst-case.↩︎
Here is just sugar for the program that does not change the state and always executes successfully, equivalent to the trivially true guard, or an assignment .↩︎
Because all of the semantic relational operators ( , etc.) are -monotone, this relation enjoys all the usual congruence properties. You can refine a small part of a program, and the resultant program will be a refinement of the original whole program.↩︎
The notation is a substitution, substituting the expression instead of the variable .↩︎
Doing the refinement to a dynamically-expanding array is too much pain for this article, but feel free to do it as an exercise.↩︎
A lot of Haskell programmers don't seem to value semantic abstraction anyway. Perhaps this is a case of anti-modular language features such as type classes making real abstraction fall out of favour. Or perhaps Haskell is already so abstract there's not much point in further abstraction.↩︎
Assuming UNSW hasn't gone to hell, the course isn't cancelled, and the teaching staff aren't driven out due to poor management – a big assumption.↩︎