# What is abstraction anyway?

Abstraction is one of the most ubiquitous concepts in Computer Science, and yet it is also one of the most poorly taught^{1}. Programmers (and even computer scientists) often confuse *semantic* notions of abstraction and refinement in terms of *models* with the *syntactic* notions of abstraction that exist within a particular programming language. In this post, I’ll give an overview of these two different notions of abstraction, and how they might be related.

Let’s be clear: abstraction is not about indirection, nor is it the process of giving things names, nor is it even the packaging of code into reusable modules. Informally, abstraction is the merely the *elimination of detail*.

## Semantic Abstraction

When we discuss the semantics of a particular program, we are really discussing what is called the semantic *model* of a program. A model is a mathematical object that captures the interesting aspects of what the program *means*. While there are a variety of models for various domains, languages, and objectives, a common choice for a semantic model of an imperative program is some kind of *state transformer*, which describes the *transitions* possible from a given initial state to some final state(s).

To eliminate detail on a model such as this requires a way to be imprecise about some aspects of a program — usually, this takes the form of non-determinism. Non-determinism can be hard for beginners to grasp, but it typically has to be employed when modelling real programs. For example, suppose we had a greeting program that differed depending on the physical location of the computer^{2}:

If we wanted to mathematically model the behaviour of this program, it would be frightfully inconvenient to include the geography of Earth and the computer’s physical location in our model. That’s where non-determinism comes in. If we *abstract* away from the geographical details, and instead regard the program as choosing between the two options based on some *unspecified criteria*, then we can get away with modelling less, at the cost of some detail:

Such underspecified conditionals are usually called *non-deterministic choice*, where is written simply as .

While we tend to view our physical machines as deterministic automata, the state upon which each decision is deterministically made includes a number of external things which are tedious to model mathematically. We can also use non-determinism to ignore details that we don’t care about for our particular domain — a common example is memory allocation, where it is very convenient (for some programs) to think of memory as infinite, and allocation as an operation that can potentially fail, without specifying exactly when and how this failure can occur. This is normally modelled as a non-deterministic choice between successful allocation and a failure state.

In a total, deterministic setting, we might model semantics of a program as a total function — given an initial state, there will be exactly one final state determined entirely by the initial state. But, with non-determinism, each use of the choice operator potentially doubles the number of final states^{3}. So, with non-determinism in our language, the semantics of a program are given as a binary *relation* on states: a mapping from initial states to *every possible* final state. For our purposes, we will define a *state* as just a mapping from variable names to their values. We shall call the set of all states .

## A Toy Language

Before we go any further, let’s define a little language that we can use for our programs. For simplicity, we will assume that all our variables contain integers. First I’ll introduce the syntax, and then I’ll discuss the semantics of each form separately.

Here we use and to denote simple boolean propositions and arithmetic expressions respectively. These expressions may mention our program variables, so we will assume the existence of a simple semantics for them. For the arithmetic expressions, they are interpreted as a function that, given a state, will produce a resultant integer:

For boolean propositions, the semantics are simply the set of states where the proposition holds:

As mentioned in the previous section, the semantics of a given program will be a binary *relation* on states:

For an assignment statement, the final state is the same as the initial state, save that the updated variable is replaced with the result of evaluating the expression with respect to the initial state:

For non-deterministic choice, seeing as contains all the possible state transitions of , and contains all the possible state transitions of , the semantics of is just their union:

We also have a familiar *sequential composition* operator, written as a semicolon as in , which behaves much like the semicolon in C and Pascal. First executing , and then subsequently executing . Formally, this means that a transition can only be made through if there exists an intermediate state resulting from that leads to the final state via :

Where is an operator for forward-composition of relations, defined as:

We also have *guards*, which are programs that do not change the state, but only permit execution when the given boolean condition holds:

We can use the above building blocks to regain the familiar statement:

**Exercise**: Devise a direct semantic definition for statements. Prove that your semantics are equivalent to that of the translation into non-deterministic choice and guards.

Lastly, in any real programming language, we need some mechanism for loops or recursion. For our toy language, we add the very simple *Kleene star*, written , which runs a nondeterministic amount of times. A good intuition is to think of this recursive expansion^{4}:

Semantically, this is the *reflexive, transitive closure* of the semantics of P:

Where superscripting a relation is self-composition:

Here is the identity relation, i.e. .

We can recover the traditional loop using our Kleene star and some carefully placed guards: One in the loop body, to ensure the loop is only run while the guard is true; and one after the loop, to ensure that the loop only finishes when the guard is false:

**Exercise**: Devise a direct semantic definition for loops. Prove that your semantics are equivalent to that of the translation into the Kleene star and guards.

## Getting back to abstraction

When we transformed our simple greeting program into a nondeterministic choice, we reduced the size of our state model, but doubled the number of possible outcomes for a given initial state. Instead of being able to determine which greeting would be printed, we must now account for both greetings.

This means that the more *abstract* a program is, the *bigger* the semantic relation is. We can say that a program is an *abstraction* of a program iff:

Equivalently, we also say that is a *refinement* of . Refinement is the inverse of abstraction.

Because refinement is just the subset relation on semantics, it forms a bounded lattice, giving us a greatest and least element. The greatest element is the relation that contains all state transitions:

This greatest element is an abstraction of every program, because it is so non-specific that it contains every possible outcome the program could produce.

Conversely, the least element is the relation that does not contain any transitions — representable syntactically with the guard, or the infinite loop:

## Specifications as abstractions

One common use for abstraction in computer programming is for the *specification*, *verification* and *derivation* of programs.

If we define a *specification* of a program as a pair of a pre- and a post-condition, we could specify something like a factorial program as follows:

Here we are using *specification statements* of the form , where , the pre-condition, and , the post-condition, are referred to collectively as *assertions*.

The specification statement describes a program that, assuming that the pre-condition is true of the initial state, will ensure that the post-condition is true of the final state. Exactly *how* the program gets from the initial state to the final state is left unspecified. We can make these specification statements bona-fide statements in our toy language, and give them a semantics:

Our semantics for a specification statement include every possible transition that satisfies the specification. Therefore, our specification is an *abstraction* of every possible *implementation* of that specification.

A common technique for the derivation of programs is to build a syntactic *refinement calculus*, allowing us to incrementally derive a program from its specification into a less and less abstract version, until we at last have a version suitable for implementation. This process proceeds via formally justified *refinement rules*. Because they are proven to be sound, a correct application of these rules from the specification yields a correct program by construction.

Let us define a miniature refinement calculus for use with our toy language. For a calculus that is actually useful for more real-world programming scenarios, I recommend consulting Carroll Morgan’s great book, *Programming from Specifications*, an online copy of which is available here.

To start with, we will define a syntactic abstraction relation, , which is defined like this^{5}:

Now, we can give rules for introducing each of our language constructs^{6}:

**Exercise**: By translating the above rules into semantics, show that the rules are sound (that is, that the semantics of the RHS is a subset of the semantics of the LHS).

We can also derive rules for our trusty statements and loops:

**Exercise**: Show that these rules are indeed derivable, using the translations provided in the previous section.

Lastly, it is also sometimes necessary to apply logical reasoning to transform assertions during the derivation process. The *consequence* rule, given below, allows us to swap out our assertions for more convenient ones, provided they remain a refinement of the original assertions:

Using our refinement calculus, let’s derive an implementation for our factorial specification:

Firstly, we have to split the code into two parts, firstly to initialise variables and establish the loop invariant (), and the second to actually contain the loop.

Next, we must use the consequence rule, to get the spec statement into the right form for using with the while loop rule. After introducing the loop, we can fill in the body a bit by incrementing the counter:

Here we must use the consequence rule in order to get the meat of the loop body into the right form for the assignment rule.

Lastly, we just initialise our variables in the obvious way to ensure the loop invariant holds initially:

Treating specifications as abstractions of their implementations is a powerful idea. It gives a semantic framework for the gradual, step-by-step derivation of a correct program from its correctness definition.

Moreover, it shows that a common informal definition of abstraction that is bandied about by programmers — the separation of a specification from an implementation — is just an instance of the more general notion of semantic abstraction. If we were to interpret types as a particularly weak form of specification, then we can view type systems as an instance of this technique as well^{7}.

## Data Abstraction

One of the most common techniques for managing complexity in software engineering is that of *data abstraction*. Data abstraction is the process of *hiding* some particular piece of state behind an *interface* or *signature* of abstract *operations*. This allows for a neat separation of concerns. For example, consider this program that only succeeds if a string of parentheses and brackets is balanced:

This version makes use of an abstract *stack* type and four operations: , an initialiser which sets up an empty stack; , a simple predicate which is true iff the stack is empty; , the familiar operation that adds a new element to the top of the stack; and , the inverse of which removes the top element from the stack and returns it. Certainly, the version making use of abstract operations is far more readable than the concrete alternative, swapping the abstract stack for an (infinite-sized) array and an index to the top of the stack^{8}:

Mathematically justifying the above translation is a process called *data refinement*, and a variety of techniques exist. One of the simplest is Reynold’s method. Starting with the abstract program, it proceeds in four steps:

- Add variables to represent the
*concrete*state (in this case and ) Define a

*coupling invariant*— an assertion that relates the abstract and the concrete variables. In our example, if we assume a stack model like the following grammar:Then the coupling invariant relating and can be defined as a recursive predicate like so:

- For each operation that
*writes*to abstract variables, such as ,*add*code to perform the corresponding updates to the concrete variables, such that the coupling invariant is re-established. This step can be formally justified using a program logic such as Hoare logic, which is analogous to the refinement calculus used above, except designed for post-hoc verification rather than derivation of correct programs from specifications. Each operation that

*reads*from abstract variables, such as , is*replaced*with code that reads the same information from the concrete variables. This step should be justified as a direct consequence of the coupling invariant.With all abstract read operations replaced with concrete ones, the abstract write operations are now completely superfluous, and can be removed.

Following the above steps with our original stack-based program will yield the concrete program we devised in terms of arrays. So the method appears to work, but what does data abstraction and data refinement have to do with the notions of abstraction we saw in the previous section?

### Semantic Data Types

To be able to talk about data abstraction in terms of semantics, we need a semantic model of a data type. Formally, we consider a data type to consist of:

- A set of
*representation variables*, containing the data of the data type. We write to represent the state space*extended*with these additional variables. - An
*initialiser*(or a*constructor*if you prefer), , which augments the state with a new instance of our data type, introducing our representation variables. - A
*finaliser*(or a*destructor*if you prefer), , which eliminates our representation variables from the state. - For each
*operation name*, we have a relation — simply the semantics of the operation.

Let’s define data types for our abstract stack and our concrete implementation. To make specification easier, we annotate the names of the operations with the external variables they may touch. More elaborate refinement calculi include *frames*, which make this technique a good deal more rigorous.

For our abstract stack, we never explicitly provide an implementation, merely providing specifications. Because, as we discussed before, specifications are in the same semantic domain as our programs, we can use them to provide our abstract data type.

For the concrete data type, we just take the semantics of the code we use to implement each operation.

With both data types, we can start to devise a definition of abstraction between data types.

Any consumer of our data type, such as the bracket-matching program above, can be viewed as the sequential composition of the initialiser, some sequence of operations, followed by the finaliser. A data type is a *refinement* of another if all such sequences are a refinement of the corresponding abstract sequence.

Thus, to show refinement, we must show that, for any operation sequence :

That is, data refinement is “just” program refinement, but for an *arbitrary* program. Next, we’ll look at common ways to prove this statement, and how they generalise syntactic approaches such as Reynold’s method. For a more detailed introduction to this model-oriented version of data refinement, and comparisons to many more refinement techniques, I recommend this great book by W. P. de Roever and Kai Engelhardt (who was one of my teachers).

### Simulation

We would like to prove the above subset obligation using induction on the length of the sequence of operations, but the presence of the initialisers and finalisers makes the induction hypothesis useless, of the form , which does not refer to a subexpression of our goal.

One technique to resolve this is so-called *downward simulation*, where we define a *refinement relation* , and split the above obligation into three parts:

The initialiser establishes the refinement relation:

Each operation preserves the refinement relation:

Finalisers will converge from -related states:

The second part can be generalised into an analogous theorem about sequences, via a neat induction on the length of the sequence:

From here, one can straightforwardly use the first and third lemmas to show that is indeed a refinement of . In this way, we remove those pesky initialisers and finalisers so that we can do induction, and then just tack them on again after the induction is complete.

So, for our stack example, what would our refinement relation look like? It turns out to merely be a relational form of our coupling invariant from Reynold’s method:

In fact, all of Reynold’s method is just an instance of this downward simulation technique.

It turns out that downward simulation, and thus Reynold’s method, is not *complete*, in that one can construct a pair of data types where one refines another, but that a refinement relation cannot be constructed between them. *Upward simulation*, the mirror image of downward simulation, relies instead on an *abstraction relation* and performs induction from the back of the sequence rather than the front. The combination of both upward and downward simulation *is* complete. The proof of this is presented in de Roever and Engelhardt’s book.

## Syntactic Abstraction

Many programming languages provide features that are commonly called *abstraction*. The most common is the *module*, consisting of one or more *types* (usually left *abstract* in the sense that their implementations are hidden) coupled with *operations* on those types. We can consider a module’s *signature* or interface to be an abstract data type in the semantic sense, where any type-correct implementation can be considered a refinement. In this sense, module systems in programming languages make it substantially easier to do the kind of data abstraction I discuss above, as both abstract and concrete versions are in a machine-readable structure. However, the presence of a module system is neither necessary nor sufficient for data abstraction to be possible.

A perhaps more common use of the word *abstraction* in the Haskell community refers to the *λ*-abstraction^{9}. Seeing as *λ*-calculus-based languages have a very different semantic domain, based on Scott domains, I can’t directly relate the notion of *λ*-abstraction to the kind of semantic abstraction I present here. I’d be very interested to see some explanation to see if there is a solid connection between the very *syntactic* notion of abstraction we see in functional languages, where “abstraction” essentially refers to *parameterisation*, and the kinds of semantic abstractions we see elsewhere.

If you enjoyed this article and you’re a UNSW student, this article is a whirlwhind tour of the second-year COMP2111 course, taught by Kai Engelhardt along with yours truly. The course goes into substantially more detail on the *specification* and *derivation* components, including a detailed study of Hoare Logic and Carroll Morgan’s refinement calculus. Feel free to enrol if you’re interested^{10}.

In my undergraduate years, I remember thinking that data abstraction had something to do with header files or object-oriented programming.↩

Further internationalisation is left as an exercise.↩

This is why deterministically simulating a non-deterministic program is exponential complexity in the worst-case.↩

Here is just sugar for the program that does not change the state and always executes successfully, equivalent to the trivially true guard, or an assignment .↩

Because all of the semantic relational operators (, etc.) are -monotone, this relation enjoys all the usual congruence properties. You can refine a small part of a program, and the resultant program will be a refinement of the original whole program.↩

The notation is a substitution, substituting the expression instead of the variable .↩

The view of types as abstract interpretations is expounded in great detail in Cousot’s paper.↩

Doing the refinement to a dynamically-expanding array is too much pain for this article, but feel free to do it as an exercise.↩

A lot of Haskell programmers don’t seem to value semantic abstraction anyway. Perhaps this is a case of anti-modular language features such as type classes making real abstraction fall out of favour. Or perhaps Haskell is already so abstract there’s not much point in further abstraction.↩

Assuming UNSW hasn’t gone to hell, the course isn’t cancelled, and the teaching staff aren’t driven out due to poor management — a big assumption.↩