From: www.itworld.com

Testing GUI applications, part 1

by Cameron Laird and Kathryn Soraiz

May 1, 2001 —

 

What programs can test graphical applications? As consultants, we recognize that a large part of our business is decoding what clients really want, but we are wary of this particular question, as it has a history of obscuring true goals. This two-part series will explore the quality assurance (QA) of programs with GUIs.

First, we need a brief theory of quality. There's an extensive amount of literature on quality, much of it applicable to computing. In fact, quality assurance is one of the few well-grounded domains of computing, in the sense that it has yielded replicable experimental data. Evidence about the effectiveness of object orientation or the correct philosophy for chip sets (CISC versus RISC?) remains ambiguous and generally open to interpretation. Quality researchers, though, proudly point to the principles they've established beyond reasonable doubt:

What's the significance for the construction of a high-quality GUI application? Well, quality matters. In general, it matters more than any testing program bandaged on a project at the last minute. Does your project have the resources to manage a QA phase that only takes action when the product is almost ready? Then it has more than enough resources to document requirements explicitly, design the application carefully, and engineer the software throughout development. We've never seen a case where investing in end-phase testing and beta programs pays off more than it does with those kinds of up-front engineering.

Even when the above commandments are observed, there's generally more important work for a QA staff than the often disorganized grunt work traditionally assigned them. This month, we'll describe the activities a successful QA department undertakes as part of a development project. Next month, we'll look at specific testing frameworks for common GUI applications.

QA's contribution

Too many organizations try to paste QA onto the end of their development process. When they think they're nearly ready to go to market, they toss their newest employees, and sometimes others who don't fit in anywhere else, in a room, stir in a poorly defined beta program, and ask for a miracle. This is the intellectual equivalent of solving security for a rock concert by hiring a biker gang on the final weekend before the show. Both are recipes for spilt blood.

Real quality requires involvement from a QA engineer from the start. A good product has sound finances, a manufacturing plan, and so on, from the start, to focus on achievable objectives that meet customer needs. It also has a QA plan from the beginning.

From the first planning session, a good QA engineer asks, "What does it mean to have an 'adequate' response? Does the screen need to flash within three seconds? Thirty? Do you realize that writing custom widgets rather than using the native ones has a history of soaking up weeks of tinkering and just confuses end users? Will the documentation we're planning satisfy all three of the market segments we're targeting? Why have we budgeted for a smaller customer support staff, on a proportionate basis, than our competitors?"

The first job is to assess feasibility. A plan that calls for even elite coders to turn out hundreds of line of source daily, with all but one in a thousand flawless, is unrealistic.

In the best projects, documentation and testing protocols are written at the beginning and are constantly available to guide source development. While this is one of the precepts of extreme programming (XP), it's a principle that was known long before XP's current popularity. It's certainly standard procedure in more mature domains of engineering, including aerospace.

Perhaps most indispensably, QA engineers hunt for testability.

Testing only proves programs wrong, not right

Testing never proves a program right. That proposition is well-known dogma in QA circles, and neglected or not even understood outside them. Here's a stark example. Can you test your way to confidence that:


     0.1 * (a + b) = (0.1 * a) + (0.1 * b)
   

How many values of a and b would you have to try to make you feel good about the formula?

Mathematicians utterly reject proof by anecdote, of course. The fact is that most computer arithmetics are subject to rounding error and wraparound, which means they do not respect even this simplest example, at least for select values of a and b. A sufficiently determined, or negligent, development crew can embed nuttiness beyond the ability of any black box QA team to detect.

Black box means behavioral, as psychologists sometimes use it. It was news this winter that the Interbase database management system had been vulnerable to a security exploit for most of the last decade. It was finally made public only when the source code became widely available for inspection. At that point, experts could look inside Interbase and were not limited to external black box investigations.

So what is the value of testing? It's a discipline for finding faults, and finding them earlier rather than later. Applications have faults, and testing never finds all of them. However, systematic testing finds many of them before they've done much damage.

Smart testing also yields quantitative results on the incidence of faults. That leads to estimates on time and other costs to achieve specific levels of quality, exactly the kind of information necessary to manage a project rationally.

In search of testability

Tests themselves exhibit quality. Perhaps the one most consistent responsibility of QA engineers is to insist on testability. Testability requires explicit expression. The expression, Our program has to be fast, might be OK for an early planning meeting, but a test protocol needs propositions more along the lines of, The application must display the selected sequence within three and a half seconds.

While that sort of objective, scientific language is necessary for testability, it isn't sufficient. Replicability and cost are also elements of testability. It's generally better to have a test that runs quickly and with few resources, instead of requiring a special machine in a special location that can be accessed only once daily.

Calendrical calculations are notorious for illustrating a different sort of infeasibility that impairs testability. Suppose a product has a test plan that requires a date that is one month in the future to be displayed. It would be prudent to test that display under various conditions, especially in a case where a leap year is a factor. Does your testing environment require you to wait three years until the next real one rolls around?

Good test plans also deal with exceptions. This is an area we particularly emphasize in our own practice. Important as it is to test When the user double-clicks on the image of any province in the map, then..., it's at least equally vital to make explicit that When the user double-clicks outside the boundary of all provinces on the map, then.... If the latter proposition isn't explicit, an application is likely to either do nothing or react in the same manner as it would if the user had clicked on a nearby province, exited the application, or put an inscrutable diagnostic on the screen. There are circumstances in which any of those actions could be disastrous. A good test plan makes explicit not only what happens when everything is going well, but also what happens when the user asks for something silly or makes an impossible selection, or when the network goes down, or when a system administrator forgets to renew an authentication value in a permissions table.

What about GUIs?

This is all abstract. Let's agree that projects should have test plans from the beginning, and that the elements of the test plans should be testable. Even this consensus is of limited value, though, because the same can be said for military campaigns and breakfast cereal rollouts. Now it's time to talk about bits and bytes.

Suppose we have a first draft of a test plan that collects elements such as:

When an end user selects a piece of equipment from the inventory listbox, then...

or

An administrator can put a rubber band around an area of the display by dragging with the first button. Within a second of completion of such an operation, a pop-up...

The fundamental difficulty with GUIs, in design as well as in testing, is that computers model none of these items in a canonical way.

Computers represent command-line applications completely. Mass-market OSs all have an unambiguous way to report The user just typed e, x, i, t, on the keyboard. The individual characters in exit arrive in sequence and there's no particular difficulty in distinguishing such a sequence from, for example, 3X1T.

The fundamental problem

GUIs are different, though. Computer implementations do not model pick the third item from a listbox in the same discrete way. The fundamental representation of all major modern OSs is more like: at a particular time, the first button went down in the vicinity of pixel (x1,y1), then, at a particular later time, the button came up near (x2,y2).

This is a large mismatch from our draft test plan, of course. Many organizations despair of fitting the two together coherently. They have no way to mechanize selection of pick the third item from a listbox except to command entry-level employees to sit at a workstation and exercise the mouse. Those employees act as translators between the languages of human-readable test plans and system-level GUI event sequences.

This incompatibility is a profound one. There's no cosmetic fix for it because it's in a dual relation to the true role of GUIs. GUIs exist to limit choice. Computers are constructed in such a way that every pixel and delay between events has potential significance. To achieve simplicity for users, though, a specific GUI collapses all those potential distinctions into a small number of aggregates: a buttonlike widget that can be pushed, a display panel for read-only information, and so on. It's very difficult to move back to the original pixel-based domain and write our tests in that language.

There are alternatives, of course. That's the real subject of this series: how can we write test plans for GUI applications that are testable automatically ?

Here are the realistic answers:

  1. Fix the screen layout of the application so invariantly that there's a well-defined mapping between a human-readable test plan and the raw user interface (UI) logs
  2. Decompose testing into functional and UI parts, automate the former separately, and test the latter with human labor
  3. Synthesize higher-order UI actions and use them to realize the test plan directly
  4. Combine elements of 1, 2, and 3 to achieve your goal

This month, we'll explain what the first two of these mean and apply them in a few examples.

Commercial answers

Think back to the question that began this article. The sort of answer questioners expect is, Mercury Interactive's XRunner. Don't be misled, by the way, when you notice that Mercury's homepage advertises only Web-testing tools. The company's distinguished history of working under X and Windows should survive the latest marketing fashions.

Products like XRunner fundamentally work this way: a QA engineer configures a workstation, begins a recording session, and then performs a keyboard-and-mouse sequence. The testing product records the action sequence, as well as a screen image. This record is available for later reply and comparison, so that the engineer can do a regression test to confirm there's been no change.

Regression tests are essential for many products, and vendors have quite a few satisfied customers, despite their steep licensing fees (often over $10,000 to start) and uneven technical support. However, this is only a minor part of proper testing.

The problem is the fundamental one introduced in the last section: the test records are difficult or impossible to manage in human terms. Because the display reference is expressed in pixels, several products that automate this sort of regression testing reject comparisons if there's been a pixel-twiddling change in font. Because the keyboard selections are recorded in physical rather than logical expressions, a sequence typically must be rerun if the screen layout changes even trivially.

Better products advertise their scriptability as an enhancement to their record-and-replay capabilities. A script captures a sequence in textual form. This is a powerful idea, because it's much faster, more flexible, and less error-prone to create new or modified texts than to simulate new clicking-and-typing runs.

Scripting to the rescue

Management of a test suite is so much different in the two modes that they generally should be considered as two distinct categories: record-and-playback testing tools and scriptable testing tools. In our experience, the former are useful only for very conservative projects with rigid requirements. Most organizations are better off improving their designs and process than investing in the specialized benefits a record-and-playback tool affords.

Scripting is very different. With sufficient scripting power, it's even possible to write tests without the target product. An example might look like:


   test end-of-list-selection {Boundary condition} {
       prepare_standard_listbox
       .listbox activate end
       # ...
       get_text_of_result
   } {last-item}
   

This is just what we need for a healthy QA process.

Getting it takes some work, though. Remember that an OS doesn't know how to .listbox activate end; without particular knowledge that we'll interpret that as a listbox selection, it can only move the visual cursor to a specific spot on the screen.

Therefore, the first solution we generally recommend is to make the GUI application itself scriptable. That means to expose all its functionality through the keyboard the way a conventional command-line application does. The industry has become quite expert at testing command-line applications; many products are known to receive daily exercise from testing protocols of up to a million lines of source.

Two ways to build in scriptability

There are at least a couple of distinct architectures that lend themselves to this functionality. We just finished a three-part series on the construction of GUI wrappers for standalone command-line applications. In this approach, the command-line versions are available for conventional testing. Testing the GUI deliverable still suffers from what we're calling the fundamental problem of record-and-playback. However, this architecture makes it a smaller problem. Suppose the original test plan includes:

Test that correct results are displayed when {zero,all,first,last,...} item(s) are selected.

By construction of the application as a thin GUI wrapper, this collection of items transforms to:

Test that the GUI displays the command-line results correctly.

Test that the command-line application returns the correct results with {zero,all,first,...} item(s) selected.

The first of these is much simpler than the original collection and the second is purely within the command-line realm.

The egg model is a label we've heard applied to a slight variation of this architecture. The egg has a scriptable yolk -- perhaps scriptable purely through command-line processes, perhaps through a different mechanism. The scriptable yolk is wrapped in a thin GUI shell.

Building eggs this way is a significant change for many developers. Visual Basic and allied products encourage developers to attach low-level functionality directly to GUI actions. The layering in an egg has the reputation in many circles of being just a performance burden. What we've seen, though, is that designs which clearly distinguish internal state and functionality from display are more robust, maintainable, and testable. The popularity of the model-view-controller (MVC) pattern is one instance of this principle.

The second major scriptable architecture calls for development personnel to design and implement as they usually do, without particular regard for testability. However, they're required to use a GUI toolkit, which itself is scriptable.

Our yearlong series on GUI toolkits mentioned scriptability only in passing. In the next installment we'll look in more detail at toolkit scriptability and an open source testing tool that combines the best of several approaches. With the theory of this article in place, we can also show executable code.

Halfway through

So far, we've introduced the principles of testing, described the fundamental problem of GUI testing, and looked at the traditional answers to it. One final point is important for now. We've written about the contribution good QA engineering makes to a testing program. That's the scope of this series; however, QA has several other potential responsibilities beyond testing products. QA, for example, is often in a good position professionally to construct or review organizational disaster recovery plans. QA should know about proof techniques for software, static analysis of sources, white box testing, the managerial significance of test results, and much more. See the Resources below for a fuller picture of what QA does. This series focuses only on the contribution that QA makes to testing in a narrow sense.

Resources