Dan Bricklin's Web Site: www.bricklin.com
Gestures, the iPhone, and Standards: A Developer's Questions
A discussion of the nature and use of gestures in a computer controlled by screen-contact and some of the issues with regards to developing standards.

The release of the Apple iPhone (and now the iPod Touch) has renewed interest in computing devices controlled by gestures. Over the years, I've been writing about different devices with direct hand input (either by touch and/or pen) and it's time to do it again.

Many have written about the iPhone, so why should you listen to what I have to say? In addition to my general experience with user interfaces and "tool" design, I have experience directly related to such systems. Back in the early 1990's I worked on a variety of such systems at Slate Corporation and was heavily involved in the developments at that time. A few months ago I attended (and gave a presentation to) a small conference at Brown University about current pen computing research. In preparation, I looked over some of the old videos from my Slate days, and even played a bit with some of the old pen computers in my personal collection. More recently, I've spent some time with the Apple iPhone and have started doing a little experimental development with my newly-acquired iPod Touch (which shares the same interface and programming environment). Here are some thoughts.

Introduction
This essay is going to address the issue of a gesture-based interface.

In the "real" world, a gesture is a motion of the limbs or an act made to express a thought or as a symbol of intent. You gesture to your waiter to come over and see the fly in your soup, or wave an oncoming car past your stopped car. You make hand gestures to express disgust and anger of others, or to signal approval and disapproval (thumbs up and thumbs down).

These gestures are often shortcuts or silent, non-verbal alternatives for expression. In other cases, especially when we want a richer vocabulary, we may use spoken or written language to express ourselves more explicitly.

These gestures and their meaning are usually learned. While they may have a relationship to the idea being expressed (waving a car around your stopped car), they often have more obscure symbolism and take longer to learn (handshaking, giving a "high-five). They may be a bit ambiguous and very context dependent: To order a hot dog from a vendor walking the aisles at a ball game, a variety of gestures and meanings are often necessary: You get the vendor's attention with the "choose me to start a conversation or transaction" gesture (a raised hand with palm facing the other person); signal the desired action with a raised index finger ("one, please"); tell him to not put on relish with a "stop" gesture (the raised hand with facing palm gesture again); and then tell him to wait a minute while you borrow some money from the person next to you (the raised index finger, again).

In the realm of computing, we have other gestures. When using a mouse, we indicate position on the computer screen by sliding our hands over the desk in a relative (not absolute-positioned) motion. We use finger gestures, including some we call "pushing buttons", that are really more akin to the dexterity of playing a note on the clarinet than on a piano or pushing a button to choose something from a vending machine. The same gesture can have different meanings when we press a second button on the mouse or use our other hand to hold down a modifier key on the keyboard.

Again, the computer gestures are learned, sometimes with an obvious symbolic connection to the operation desired (clicking while "pointing" to an object displayed on the screen) and sometimes not (using the "control-key" along with clicking to mean "also select this one"). Basic mouse use was helped by the addition to Microsoft Windows of the Solitaire program, a brilliant move (in terms of training).

The keyboard is an interesting situation. When entering text, the symbols printed on the keys make their meaning very clear. Pressing the "a" key enters the letter "a". Pressing the "Del" key deletes a character or selected item. Sometimes, we have additional meanings to learn and use, such as two-key combinations like pressing "control" along with the arrow keys for different types of cursor motion, or with the letter keys to format text (e.g., ctrl-B for bold). Again, these are gestures we make that the computer "understands". They are gestures that have varying levels of symbolic connection to the actual operation being performed. They are gestures that must be learned through experimentation, reading documentation, or other training.

Once you learn a gesture and its meaning, it becomes a "natural" way of expression. In your mind, you start thinking of waving a car around yours or stopping one approaching a crosswalk by using an upraised hand as directly "controlling" something else. You start to think of the gesture like a lever that is mechanically effecting what you want, and may think that is the only gesture that could have that meaning, just as in the physical world only that particular lever could actually control the thing to which it was connected.

However, without learning, many gestures probably have little intrinsic meaning. A gesture that might be common and "obvious" in one culture, perhaps being so offensive as to lead to fights for honor in that culture, might be completely ignored in another, or have another, polite and commonly used meaning. The "ctrl-" modifier of the Windows world (such as "ctrl-B" for bold) is used differently in other systems, where, for example, the Mac more commonly uses the Command/Apple key (such as "cmd-B" for bold).

I remember learning to hitchhike (that is, standing by the side of the road asking the passing motorists to give me a ride) by facing traffic, stretching out my arm, making a fist with the thumb sticking out, and pointing behind me in the direction traffic was going. That's the symbol in the United States, I guess meaning "I want to go that direction". One summer, I was in another country, and there the symbol was an outstretched arm with a closed hand and the index finger pointing to the road by my side. I guess that came from "please stop here for me". Who knows? In both cases, the effective message was the same, but the gesture was different and certainly needed to be learned. After using it for a while, it became second nature. (Wikipedia lists a variety of different gestures for hitchhiking used in different parts of the world.)

Touch-screen and pen gestures
In the area of computers with screen-contact interaction, such as those using a pen or a touch sensitive display, we become very much dependent on gestures for communicating with the computer. Looking at systems that had widespread popular use, the first systems included "kiosk" style systems, such as banking machines and some control panels. These systems, as I recall, mainly used images of buttons which you could "press" -- a very simple, somewhat obvious and easy to learn gesture.

The first really popular (and long-lived) system for the general public was the Palm Pilot. In addition to "tapping" gestures for selection (using a finger, fingernail, or the included stylus) and operating some of the "controls" that functioned in the manner of the already-common mouse-based GUI systems, the device used a large set of special gestures for entering text, known as "Graffiti". From the Pilot Handbook: "Graffiti is a system where simple strokes you write with the stylus are instantly recognized as letters or numbers...The strokes recognized by Graffiti are designed to closely resemble those of the regular alphabet."

While the gestures for some characters may be almost the same as normally writing the letter, such as the letter "L", others required some learning, such as "A" (upside down "V") and "T" ("L" rotated 180 degrees). The space character was entered as a horizontal line written left to right, while backspace was one written right to left. Period was two periods (tapping twice). While most letters were related to their printed uppercase selves, "H" was a lowercase "h" and "Y" was a lowercase script "y". "V" was either a "V" with an additional horizontal tail or written right to left as a "V", distinguishing it from a "U".

After a time of practice, for many people (millions of Palm PDAs were sold) the gestures became associated in the mind with the characters and writing became natural.

Some gestures are "easy" for the computer to recognize reliably, such as tapping on a button image. Others, like handwritten characters, are much harder.

A very famous and extensive use of gestural control of a handheld computer was the PenPoint operating system from GO Corporation. PenPoint used "handwritten" gestures for all input and control, including text entry. It could deal with a wide variety of seemingly identical gestures, appropriately placing them in context. A drawn vertical line could alternately be interpreted as a drawn line, the letter "I", a "flick" gesture to control scrolling, and more.

PenPoint-based computers were, in many ways, on par with full traditional laptops of the time, with word processing, spreadsheet, drawing, scheduling, custom applications, program development, and more. I have posted a copy of a GO promotional video aimed at developers that includes a very extensive demonstration by PenPoint architect Robert Carr of the system and its use of gestures. The 59 minute video is available as "PenPoint Demonstration 1991" on Google Video.

I've also written an essay about the state of pen-based computing in the 1990's: "About Tablet Computing Old and New". It lists a variety of products and patents. The patents are especially valuable for their descriptions of the thinking of those days, no matter what actually ended up in the patent claims themselves or the validity of those claims in light of today's reading of the law. (Note from a layperson: In a patent, the long section called the description is written before it is clear exactly which claims will be allowed by the Patent Office. Only the claims are what is "patented", not everything in the description. The extra material in the description is often quite interesting for learning, and itself is one of the forms of prior art used by patent examiners.)

More specifics about gestures
Feedback to the user while making the gestures is important. Just the fact that their finger or pen touches the screen is one type of feedback. With a pen, the "feel" of the stylus point against the screen matters. Visually, "touched" objects often respond by either highlighting, morphing, moving, etc. With added computing power, objects could be "dragged" and now, on the iPhone, even more "realistic" responses can be displayed making the operation even clearer once you learn the gesture and making the illusion of direct connection to a "machine" more complete.

Gestures on these screen-contact computers have a variety of variables to distinguish them from each other. One is the shape of the gesture, determined by the path the finger or stylus takes while in contact with the screen. Another is the position of the gesture and the parts of its path, if any. Finally, there is the timing, both within the gesture itself and relative to other events. Sometimes the operating system generically interprets the gesture and sometimes a particular application interprets the user input with varying degrees of common assistance from the system.

For example, a "tap" is usually just a brief contact of the screen in one position. The path is very small, if any, and the shape doesn't matter, since to the user it's supposed to be thought of as a single "dot". If the tap is over an image of a button, it often means to "press" the button and do whatever that would do. If the tap is over an object of some sort, it may mean to select that object, either for operation immediately or perhaps at a later time, such as selecting an image for display. If the tap is close in time to a previous tap, and within a specified distance from that first tap, it may be a different command, such as the iPhone browser's use of tap to click a link and double tap for zooming in and then zooming out. In the AtHand spreadsheet, described in one of the patent descriptions, the relative position of the second tap in a double-tap gesture indicated which direction a cell range selection should be extended, akin to the End key shortcuts in Lotus 1-2-3.

Another gesture is the "flick" gesture. This is basically a horizontal or vertical line of contact with the screen. In PenPoint, the direction you draw (left to right, top to bottom, etc.) determines whether or not the gesture is interpreted as a Page Up, Page Down, Page Left, or Page Right command, and then performed accordingly by the underlying program. Some programs may ignore the recognition, and just use the tracking of the pen motions to control the motion of something being displayed on the screen. Sometimes, holding down the pen in one position before moving it in a direction is used to turn a Page Down gesture into a "drag" operation. Again, location of the gesture (on something that may be dragged), and timing can determine exactly what the gesture does.

On the iPhone/iPod Touch browser, dragging horizontally or vertically on a page seems to enter a "flick" mode, where the screen scrolls in pretty much direct response to continued motion of the finger in that axis (and that axis only), with the speed of motion at release determining some visual "momentum" for a nice, smooth feel that sort of makes it feel like there is a direct connection to a physical object and that also gives you an ability to scroll with each flick a bit further than your finger actually moves. Motion that starts out on a diagonal, though, can continue in any direction until you stop touching the screen. Once zoomed in on a photo in the iPhone photo viewer application, finger motion works equally in all directions, except that scrolling sideward stops at a photo boundary (the photos are displayed horizontally in sequence) unless certain speed and sequencing criteria are met in a way that makes it feel like you have to coax it over the boundary.

As you can see, the set of gestures and the definition of their functionality can be quite extensive and detailed.

Choice of gestures
Both PenPoint and the iPhone use a flick gesture of some sort (they both assign the same name to it) for paging through data on the screen. Unlike a lock which requires a specific style of key turned a specific amount in a specific direction, there is nothing inherent in scrolling that requires that particular gesture. Other systems have used sliding scrollbars, and page up and down "buttons". The mimicking of a physical object does not even require that gesture. While a scroll of paper may respond well to being slid, or the turning of a knob, pages of "real" paper are also advanced physically by turning the pages one at a time. The iPhone has an orientation sensor of some sort and could possibly respond to physical "turning" as page turns just as well as it responds to switching from portrait to landscape.

In a computer system, like hitchhiking, the choice of gestures often leaves a lot of room for variations. The gestures used for particular operations (the visual feedback) may be chosen from a range of options. While some may be easier to guess or learn than others, many will serve the task. As with any mapping of functions to input options, be they in choices of keys or menu locations, there is technically a lot of choice. For human interface design purposes, though, there are other factors that may dictate the choices.

Product developers have found that there are advantages when you keep in common the general operation within various genres of computing devices. The GUI interface of point-and-click, drop-down menus, scrollbars, etc., makes it easier to learn new applications and to switch between using multiple applications on traditional personal computers. To paraphrase Jakob Nieslen from his old essay "Do Interface Standards Stifle Design Creativity?": "Users spend most of their time [using other applications and devices]. Thus, anything that is a convention and used on the majority of other [applications and devices] will be burned into the users' brains and you can only deviate from it on pain of major usability problems."

This style of product design, of using commonly-accepted user interface conventions, has served us well repeatedly in the past. As Jakob points out, it makes it easier to go from web site to web site doing e-commerce, with familiar components and terminology. Once you learn how to buy from Amazon or eBay, buying from Lands' End or Joe's Cellular Accessories becomes straightforward. Once you learned Lotus 1-2-3's moving cursor-style menu and "F1 for Help", many other non-spreadsheet applications that followed those conventions seemed "natural".

The world now that we have the iPhone
In today's world, we have graphic manipulation ability that greatly outstrips technology available even a decade ago, with larger handheld screens with multi-touch, motion and position sensors, increasingly inexpensive memory to hold photos, audio, video, and forms of media, high-resolution-but-tiny cameras, and various forms of wireless connectivity. The general public is accustomed to carrying a cellphone, digital camera, and perhaps an MP3 player. WiFi and other connectivity is becoming quite ubiquitous. These conditions are opening up new opportunities for the user interface.

The excitement around the iPhone for its dazzling interface and design polish, and the desirability for pocket-sized devices with as much screen area as possible, makes it highly likely that we will be deluged with applications (and devices) that use a contact-with-screen gestural interface. A question that rises then is: What should be the standard interface on such devices?

I'm running into this problem as I contemplate programming for the iPhone/iPod Touch. At present, the only non-Apple programming for these devices allowed by Apple is through the browser. While in some senses the browser in the iPhone is the "same" as the Safari browser on a Mac or PC, in many ways it is quite different -- much more different than Safari is from other browsers (like Firefox, Internet Explorer, or Opera). The relationship between the physical screen and the virtual page on which the HTML is rendered is different than in a traditional browser. For an optimal experience, this requires coding the HTML page with the characteristics of the iPhone browser specifically in mind.

While iPhone Safari's operation with a page like the New York Times homepage shown in the TV ads looks quite usable, in practice many web pages are much less smooth to use on the iPhone than you would want. For quick operation on the go, this can be a problem. Web developers are finding that they have to make major changes, perhaps with dedicated URLs, to give iPhone users the support they deserve.

There is nothing wrong with needing to program specifically for the iPhone, especially given the likelihood that this is great learning for tuning applications to similarly sized screens. We did it before for the more minimal screens of earlier mobile devices (such as the special mobile portals for Google, the airlines, some news sites, etc.). It would be helpful, though, if we didn't need different code for different manufacturers (remember the "best viewed in Netscape" and later "best viewed in IE" days).

Another challenge is that the iPhone version of Safari does not fully implement all of the input functionality expected by JavaScript in a browser. For example, the tracking of finger contact (which would correspond to mouse movements) is currently reserved for the operating system and not passed through to your program. Basically, only the tap gesture is provided to a non-Apple program, and then only at the time when the finger stops contact. The flicking and zooming gestures perform their operation without much coordination with the JavaScript. Any web-based application that depends upon that missing functionality can have compatibility and usability issues.

This means that, for developers outside of Apple looking to develop compatible software, there is a much more limited repertoire of gestures from which to choose than you would expect, and you are likely to end up with an application whose operation seems foreign to the rest of the system. There is a lot that can be done with tapping, from multiple taps in various configurations to various pop-up button-pads, but the "soul" of the iPhone includes the smooth animation of its response to drag gestures of various sorts.

What do we do now? What are some of the issues?
To further my original question: What should an iPhone programmer do today?

Here are some factors to consider. We'll start with which gestures to use.

As I hope I've demonstrated, gestures are learned and any apparent direct connection between the gesture and the operation being accomplished is usually something we also learn and later internalize. There is usually not one "right" gesture for most operations. For example, the show-stopping "pinch" gesture used for zooming in and out on the iPhone could also have been a single-finger drag in or out to a corner, much like sizing images in many existing programs. Both types, two-fingered pinching/stretching and single-fingered corner-dragging, need to be learned and have mnemonic value.

Historically, users "vote" for various preferences by their purchases and feedback, and software developers try different approaches or mimic existing products as they see fit, sometimes getting people "trained" on their approach because of the desirability of other features of the product or its ubiquity for other reasons. Over time, commonly accepted standards seem to develop, often aided by explicitly documented style-guides from the "winning" developers. Apple seems to be "campaigning" for its choices, and doing pre-purchase training, though heavy television advertising.

There are also legal issues.

Historically, some vendors have sought to lock in their advantages by precluding others from "copying" their interface standards. There were "look and feel wars" in the 1990's using copyright law. There is now more and more use of patent law for trying to keep an interface style proprietary. In the early days of popular GUI, Apple, after "borrowing" a lot from Xerox, attempted to keep elements of their particular expression from Microsoft, but contract and other legal issues got in their way and common use by everybody of the mouse and GUI proliferated. Apple appears to be signaling a desire to have some user interface elements of the iPhone to themselves when they refer to the "revolutionary" multi-touch interface and through the reported filing of patent applications. Microsoft seems to be signaling its intention to dispute those claims with actions such as the posting by their researcher Bill Buxton of his very interesting essay "Multi-Touch Systems that I Have Known and Loved". (This essay is also a great introduction to some of the advantages and disadvantages of a variety of input means. As you will see, the "touch" interface is not a "perfect" solution.)

Apple has released some very detailed and helpful documentation about the current state of the iPhone browser. This documentation looks like it will help a developer wring the most out of the capabilities Apple is providing. Apple has also stated that they will be providing a more extensive SDK to give developers even more access to the device's capabilities, but, as of this writing, they have not stated exactly which capabilities. The legal notice at the beginning of the released documentation states:

No licenses, express or implied, are granted with respect to any of the technology described in this document. Apple retains all intellectual property rights associated with the technology described in this document. This document is intended to assist application developers to develop applications only for Apple-labeled or Apple-licensed computers.

From what I've seen as a non-lawyer, over the past few years patents have become a major battleground and invalidating patents has been very difficult and expensive (see "Thoughts on Patent Litigation in 2006"). Also, proving to the patent office that your idea was "non-obvious" (and thereby patentable) was relatively easy compared to what many laypeople would think is the case because of the interpretation of the word "obvious" by the patent courts. "Prior art" that would disqualify an application needed to be much more explicitly descriptive of what was being patented than most laypeople appear to assume.

I am not a lawyer, but as a layperson, the recent U.S. Supreme Court ruling on patents in "KSR International Co. v. Teleflex Inc. et al." will narrow the definition of "non-obvious" and change the dynamics. Exactly how is yet to be seen. Here are some excerpts (and you can see how Buxton's essay fits in here):

Common sense teaches, however, that familiar items may have obvious uses beyond their primary purposes, and in many cases a person of ordinary skill will be able to fit the teachings of multiple patents [(DanB:) and/or existing known technology] together like pieces of a puzzle...

A person of ordinary skill is also a person of ordinary creativity, not an automaton.

[The] Court of Appeals [concluded in error] that a patent claim cannot be proved obvious merely by showing that the combination of elements was "obvious to try." ...When there is a design need or market pressure to solve a problem and there are a finite number of identified, predictable solutions, a person of ordinary skill has good reason to pursue the known options within his or her technical grasp. If this leads to the anticipated success, it is likely the product not of innovation but of ordinary skill and common sense. In that instance the fact that a combination was obvious to try might show that it was obvious under §103 [and therefore not patentable]...

We build and create by bringing to the tangible and palpable reality around us new works based on instinct, simple logic, ordinary inferences, extraordinary ideas, and sometimes even genius. These advances, once part of our shared knowledge, define a new threshold from which innovation starts once more. And as progress beginning from higher levels of achievement is expected in the normal course, the results of ordinary innovation are not the subject of exclusive rights under the patent laws. Were it otherwise patents might stifle, rather than promote, the progress of useful arts. See U. S. Const., Art. I, §8, cl. 8.

From what we can see here, for the ordinary small developer with little money, the legal landscape is unclear and perhaps perilous.

Putting all this together, what we as developers need to do is figure out where we should standardize and how, and where we should encourage experimentation. As we start programming for the iPhone, we need to decide where we will follow Apple, where we will use more common or legally-clear gestures, and where additional innovation is needed.

Leadership needed
The world of handheld screen-contact computing looks like it will continue to blossom. We need leadership that will help us proceed with the commonality we have used to advantage repeatedly in the past to benefit all.

Who will step forward with that leadership and be followed? Will Apple try to maintain a sole position as a platform or will it encourage the whole industry to follow its lead? Will Microsoft go the Open route, and follow its previous examples evangelizing XML and other very open standards, or will it try to create its own proprietary following? Will some members of the academic or FOSS community do the legal legwork, interface design, and initial coding to mimic the success of the Berners-Lee and later the W3C vs. proprietary systems such as those from AOL, CompuServe, and Microsoft? Who will fund that? Google? Nokia? Will there be inward-looking greed or industry leadership?

As Bill Buxton points out in his essay, the iPhone interface has some important drawbacks. Unlike physical button-based interfaces, it is hard to use in one hand or while not looking at the screen for feedback. For those who are visually impaired its operation is difficult. In the early GUI world, Microsoft (knowing that there were few computers with a mouse installed and the value of keeping your hands on the keyboard during data entry) made sure that there were keyboard equivalents for almost all operations and encouraged that as a standard. The original Mac didn't even have a full complement of cursor movement keys. Eventually good elements of both Mac and Windows became common.

As part of our "common" system, we will probably need some physical actuators (buttons and/or sliders?), maybe more than the very few on the iPhone or iPod Touch. (A Tablet PC usually has a few input buttons available when closed and they are quite useful, I've found.) We will need alternative (but pretty complete) input means for people with disabilities or other special situations, perhaps through means such as wired or wireless connection to other input devices, and these means must be commonly supported without too much extra work on the part of developers. Continued use and experimentation with today's systems will lead us to understand what other additions should be "standard".

As a software developer, I await signals from those with the resources to make things happen. In the meantime, I'll experiment with what we have and continue to hone my skills on other platforms.

-Dan Bricklin, 24 October 2007

© Copyright 1999-2018 by Daniel Bricklin
All Rights Reserved.