Dan Bricklin's Web Site: www.bricklin.com
Revenge of the CLI: A transformation through voice and AI
Applying a framework for choosing an application interface to today's Voice User Interfaces, comparing and contrasting with GUI and Command Line Interfaces.
Bottom line: Today's popular Voice User Interfaces are worth experimenting with for enterprise applications that benefit from hands-free operation and speaking out loud, and that consist of choosing commands and values from easily remembered or easily composed words and phrases or certain values with a rich verbal vocabulary like dates and times. This is especially the case when the scope of the choices is a very large, unordered set of discrete values, a property shared with Command Line Interfaces.
With all of the hype around voice, mobile, touch, web, and other interfaces, and the cost in time and money to create many apps, developers within the enterprise are challenged to decide which interfaces to use for controlling a given application. The microservice architecture that has become common and popular makes it much easier to build a variety of different user interfaces that accomplish the same tasks. While it may be "best" to say "support them all", that is often not an option, nor needed.
For mission-critical applications, where savings from efficiency and usability can be very important, and canned, general-purpose, lowest-common-denominator solutions are not appropriate, a framework for choosing an interface to focus on can be helpful. This essay provides one such framework and uses it to find characteristics of applications that lend themselves to voice user interfaces.
In the business world, the options for user interfaces has been evolving over many years. With the current growing popularity of voice input, coupled with various forms of vastly improved artificial intelligence techniques for reliably extracting the intent and details of that input, a new option for user interface has entered the business computing world for use by employees. This follows closely on the heels of the entry of touch-enabled mobile user interfaces in the last decade that have taken the spotlight from the GUI (mouse/keyboard/graphical screen) of the two decades before, and the simple forms-based keyboard and character-only screens before that. In the earliest days that I remember (and going back to the first half of the 20th century), the main interface was punched card input and a mixture of print and punched card output. (see Wikipedia's "Tabulating Machine") Later, magnetic tapes could simulate very large decks of cards.
Throughout, for developers, computer operations staff, and some users of mission-critical applications given to extensive training, the command-line interface (CLI) that started in the days of teletype-style input of the 1950s and 1960s has been used. The "shell" style CLI developed for the Multics operating system in the mid-1960's (see "Multics Command Language") still lives on in Unix, Linux, MacOS, and even Windows. Even today, many developers at times prefer to (or are even required to) use the CLI version of program control as part of their development or control of the system. CLI control is available for Git, Android development, Cordova hybrid-app building, and many more situations, as well as for many operating system operations. For some new systems, CLI control is often the only form available.
The continuing popularity of CLI-style interfaces among "professionals" may seem strange. Proponents of other forms, such as GUI, have often denigrated such use as "clinging to the past". I (and others) maintain, though, that there are measurable traits of particular applications for which one can understand why CLI would be a popular choice.
I have been struck by similarities between CLI and the voice input typified by Amazon Alexa (which I'll call "VUI" for Voice User Interface, a common choice). I see how many of the strong points about CLI are shared with this new voice style, and some of the weaknesses are lessened. Thinking about that led to this essay.
The growing popularity of VUI in the home, and the large number of people who are becoming accustomed to using it and even preferring it for certain applications, may relate to these strong points. Business application developers need to at least consider VUI as a possible addition to their choices when it makes sense but should be more circumspect when it does not.
To help provide an example of an enterprise use of VUI, you may want to see the Amazon Alexa access to Rhino Fleet Tracking's truck tracking and assignment system: "GPS Tracking with Amazon Alexa".
To help differentiate the various user interface methods, let us describe a few attributes for evaluation. They are: Choice of operation, specification of parameters, context, level of precision, and skills needed for implementation.
Choice of operation refers to choosing the general operation to perform. The actual operation may require a variety of options and parameters, but the user usually thinks in terms of the operation. For example, "Print" or "List records" are common operations.
Specification of parameters refers to providing the option choices and data values, if any, needed to execute the operation. For example, specifying which of the available printers, the range of pages to print, and whether to perform one-sided vs. two-sided printing, are common parameters.
Context refers to information that is not part of a command that is used to help recognize the command itself and evaluate its arguments. For example, the login ID of a user or physical location of the device may determine which list of printers to display or which default values to use.
Level of precision refers to the tolerance for exactness that is required for specifying elements of the user interface. For example, in a touch interface, larger buttons require less precision, and in an interface using typed input, automatic proposed entry completion makes it easier to enter data without remembering exact spelling.
Skills needed for implementation refers to the skills and levels of expertise with various technologies needed by the developer to implement a user interface style for an application. For example, the skills needed to implement a mouse/keyboard GUI on a browser-based web application running on a computer with a large screen are different than those needed to implement a CLI-based interface on a personal computer, or a native-code, touch-enabled, standalone mobile application on a small phone screen.
Choosing what to compare and contrast
To simplify things here, I will just look at GUI, CLI, and VUI.
I am using GUI (Graphical User Interface) to refer to user interfaces making use of a pointing device like a mouse (including a trackpad or some use of a high-resolution pen with "hover") along with a physical keyboard and bitmapped graphics display. These go back to at least the Xerox Alto, and were first popularized to general computer users by Apple Macintosh and Microsoft Windows. This style has also been adopted by almost all HTML-content browsers, such as Firefox, Safari, Internet Explorer, and Chrome, starting with Tim Berners-Lee's first browser running on the GUI environment of the Next Computer.
The user can enter position information using the mouse, usually with a single point indicated through a larger cursor graphic that changes position on the screen. This is a pair of values (X, Y) linearly chosen from a range more or less determined by the size of the screen. The user can also enter discrete information through typing keys on the keyboard (well suited to entering text and some discrete command operations), clicking mouse buttons (well suited for initiating certain very general command operations made more specific by context or the position of the cursor), and using modifiers like the shift keys and the mouse buttons. To add even more options for input, a set of event directives (like keys) is now provided by a mouse wheel. Additional directives are more recently coming into vogue through use of touch events on the trackpad or mouse surface. (Does this start sounding like playing a complex pipe organ with a multitude of keyboards, stops, and pedals?)
In most cases, the design of the visual display is created such as to give the impression that there is a natural connection between the manipulation of the mouse, etc., and what happens on the screen. For example, when clicking and holding followed by mouse movement results in "dragging" something on the screen, or when the mouse wheel initiates vertical scrolling.
The design of the display can be such that position and the visual state of items can serve as a means for indicating context, in addition to any remembered data values or state.
The choice of which of the many means of doing input gives many options to the user interface designer, and it is not always clear which would be the most obvious to a new user or most effective to use. Training has been going on for many years, starting, for some, with playing Solitaire on Windows to learn clicking and precision dragging (the reason it was included in the system), and continuing with the gradual addition of new affordances that are slowly learned as popular applications start depending upon them for operation.
For many systems, a GUI interface is a built-in part of the operating system. Various system commands may be accessed through it, and parts of that UI may be extended programmatically by application designers, for example to add their own menu items to a header menu. Application designers may also make use of the underlying GUI mechanisms to create interfaces within their applications, sometimes with parts quite different than the system style.
For our purposes here, I will not be including touch-based systems without a separate, high-resolution pointing device and without a separate keyboard. It should be straightforward, though, how to extend the concepts behind this essay to evaluate those systems, too. For example, the linear precision and resolution of a finger on a phone (and the fact that it covers the touched location on the screen from view) provides a smaller range of values than a mouse on a large screen, making it quite different for many applications.
A Command Line Interface makes use of just a normal computer keyboard and a display device capable of displaying lines of text. It is derived from the earliest control languages used to interact with computer systems. Use is usually in an immediate mode, with command lines typed and immediately acted upon, though saved sequences of the same commands are often also used. (In fact, such saved sequences are often used as additional CLI "commands" themselves, and are one of the commonly-cited useful features of CLI interfaces.)
On many systems, the system comes preconfigured with a set of CLI commands. Application developers can easily add their own commands to the system extending the repertoire, sometimes just by adding files with the appropriate name and attributes. Implementing various forms of CLI-style control within an application that takes text input is usually simple for most programmers, especially compared to implementing their own forms of GUI without specialized libraries.
The common, basic syntax for CLI is as follows: The command lines are normally composed of one or more parts separated by space characters. For example, the command to display a detailed listing of text files "ls -l *.txt" has the parts "ls", "-l", and "*.txt". Normally, the first part indicates the command to invoke (the listing command "ls" in our example), with the parts that follow specifying the parameters to the command. Some parameters are values by themselves ("*.txt") while others choose command options or affect the interpretation of parts that follow. The position of the parameters, sometimes in positional relation to options, is used to distinguish between different optional parameters.
The syntax used by most CLI implementations is very terse, akin to a natural language sentence reduced to just key nouns and verbs. While not natural for conversation, it is similar to notes one may take to concisely document a process and therefore "natural" in that regard.
When entering a command line using the keyboard, various action keys on the keyboard may be used to speed input. For example, the Tab key may autocomplete a parameter based on the context and what has been typed so far, and the Up Arrow may populate the line with a previously typed command and leave it ready for immediate re-execution or inline editing.
The voice input user interface typified by the Amazon Alexa system takes spoken words as input.
Not everything spoken is treated as a command -- much is usually completely ignored. The general listening most of the time is usually only for a triggering word or phrase (e.g., "Alexa", "OK, Google", "Hey, Siri", "Hey, Cortana"). Once that is detected, the sounds that follow are recognized and processed. Many built-in commands are pre-configured in these systems. Application developers can add their own commands to the system, extending the repertoire, through special data structures.
The processing of the sound data uses designer-created sample templates along with various forms of AI to identify the command and the parameters. In addition to the simple values that one might find in a traditional CLI for these parts, the processing uses other words and features of what was said to identify which command was requested and which parameters were provided. For example, "Alexa alert me in 10 seconds" and "Alexa in 10 seconds alert me" both perform the command of setting an alert with the time being 10 seconds from now. "Intelligence" built into the system can turn "tomorrow", "next Friday", and "January 19th" all into dates without any complex work on the part of the application developer making use of a VUI system. The technology needed to determine many classes of values (dates, numbers, city names, and even street addresses) as well as to handle different languages and accents are built-in. Adding a new class of values that consists of a list of words or phrases (like product names) is relatively simple. The values of "missing" parameters can be requested automatically in a natural-feeling dialog.
The style of input used by VUI usually mimics normal interpersonal verbal communication. Many of the methods for organizing the parts of commands as well as the words used are the natural ways already second-nature to most people.
One characteristic of communication is feedback that a command was understood correctly and executed. GUI systems often depend on visual feedback, such as changes to WYSIWYG visualizations, or less frequently dialog boxes that must be dismissed. CLI systems are often silent, or extremely terse in their response. The CLI input, though, is usually left visible for validation. VUI systems, where it is harder for the person doing input to validate the input as they are doing it, often depend upon an audio response from the VUI system, with the application designer crafting the response to provide the feedback in ways that help catch errors and also give confidence and perhaps guidance to the user. Of course, if the result of the command is apparent through other means (e.g., turning on a machine, changing a visual display) then that may suffice.
The place of VUI in the world of user interface has recently changed. While previously it was usually part of a few specific application areas built around specific libraries and hardware (such as special headsets with quality microphones) tuned and dedicated to the application and requiring tedious training to recognize the words of each user, it has now broken out into the mainstream. Hardware can now be the ubiquitous smartphone or personal computer, and "smart speakers" have made it useful, reliable, common, and inexpensive in most any room, and able to participate in multiple applications. The use of modern AI technologies with huge volumes of training data has made it usable by multiple individuals with no training to their voices and has expanded the scope of variations understood as well as the ease of "scripting" new interactions. Often, the developer can provide just a few sample phrases that are then extrapolated into a wide range of allowed variations. The use of modern microservice architecture and the availability of cloud-based language processing has made it easy for capable programmers with backgrounds in other user interface methods to add basic VUI to a system without a long, steep learning curve. Services such as "Alexa for Business" are making it easier to tailor VUI to the needs of the enterprise.
Comparing and contrasting
To evaluate our three methods of user interface with respect to the attributes of choice of operation, specification of parameters, context, level of precision, and skills needed for implementation, let us look at some of strengths and weaknesses of each.
GUI makes it easy to choose from items presented on the screen. The user just looks at the screen and clicks on the item they desire. However, the actual size of the screen, and the speed with which you want a user to choose an item, affects how large a set of items may be displayed at once.
For example, in drop-down menus, there is often a reasonable limit of a dozen or two items in a column. Helping the user quickly find a particular item, distinguished from others, may involve local grouping and layout techniques, as well as visual cues like icons and text. This can further limit how many items may be displayed at once. That then necessitates either having scrolling lists, or needing to "drill-down" to sub-menus or property sheets, slowing down operation. Finding an item in a long list can be very slow, especially when ordering the list (such as alphabetic) has little relationship to the meaning to the user. Hierarchical data can be easy to represent but tedious to navigate with special motion at each level. Some values that must be input lend themselves well to an ordered list or a 1 or 2-dimensional layout, such as alphabetized names (that you know how to spell) or the positioning of a visual item relative to another at screen resolution or finding an approximate color on a palette, while other do not (address values, numbers, exact color values).
To make it easier to make sure that the user chooses the item they intend, use of extra space (e.g., larger buttons) can help, as well as redundancy with combinations of text, icons and other visual cues, and grouping. This, though, reduces the space available to additional items.
So, GUI is good for applications that involve choices from a well-defined set that is not too broad at each level. It is good for communicating values that lend themselves to expression as a position (in one to two dimensions) at a resolution within perhaps 200 steps in each direction. It is not as good for choosing from a wide non-ordered space of discrete values. (That is, not an order in one or two dimensions that makes direct navigation easy.) At that point, it may degenerate into requiring the user to switch to the keyboard, switching back and forth to the pointing device. Many applications have a limited number of top-level commands, so this isn't a major restriction for that, but for many parameter values it can be a source of user friction.
GUI can provide context through visual and spatial cues. An especially good cue, and a major design feature of GUI, is being able to tie the command "buttons" and parameter settings to a visual representation of data, especially as WYSIWYG. In typing, it can provide context through auto-complete and proposed values. GUI systems can be helpful for "just in time learning" requiring less extensive training than many other methods. However, providing pre-stored commands and parameters is not usually a strength of GUI systems.
Developing applications for GUI systems is usually specific to the underlying system. Beginning fluent in developing for one (for example, Windows) does not necessarily help with another, such as a browser, or a mobile device.
Contrasted to GUI, CLI systems use typed input for commands and values, so the number of different values for commands or parameters can be arbitrarily large without any change to the interface. (For example, Wikipedia lists 160 top-level commands for Unix, with an average length of only a little over 4 characters each.) The main speed up is having shorter names, such as for common commands, options, or values. CLI is very good for supporting wide, non-ordered spaces of discrete values. It also pays little penalty for hierarchical data (for example, requiring just a "/" or "." to switch levels).
CLI systems lend themselves very well to pre-stored commands and sequences of commands, including with parameterization of those sequences and the chaining of commands, passing the results of one to the input of another. This lends itself well to controlling repeated complex, multi-step operations.
CLI, though, has an issue in requiring the specification of commands and parameters as exact text. For many values, and for large sets of infrequently used values, this is cumbersome and error-prone. The context provided by having a "current directory" in the file system as context, and the feature of auto-complete using that context, helps. It also often has, at best, the context aid of a "help" feature that gives separately displayed written guidance about parameters, their positions, and options. This requirement of exactness is not as much of a problem when stored command sequences are used since they may be slowly constructed with care and with reference to documentation and then tested. It does, though, limit use of CLI interfaces to users who are willing to, or have good motivation to, learn the specifics. It does not lend itself to casual use.
VUI uses voice to input commands, parameters, and options. As sort of a spoken version of CLI, the number of different values for commands or parameters can be arbitrarily large without any change to the interface. It is very good for supporting wide, non-ordered spaces of discrete values. It probably pays a higher penalty than CLI for hierarchical data, depending upon how that is represented in a spoken fashion. It is not very good for positional data.
VUI could lend itself to creating stored command sequences, but that is not too common at present, and the means of doing parameterization within the sequences has not been standardized. Chaining of commands, with passed context, is now common.
One area that VUI can excel at is in the tolerance for variations of spelling, typing, and even phrasing compared to CLI. Spoken words often have redundancy not available to text in common CLI systems, especially with the addition of AI systems that have been extensively trained and usage instructions that encourage wording that adds additional information for clarity (as a person would use when speaking to another). There are often multiple ways to express the same ideas that make it more natural to vary exactly what is said. The AI component, and the scripting variations that application developers put in the system to aid those algorithms for their particular applications, can often handle these variations without error.
VUI is different than GUI and CLI in that the user's hands and eyes are free to be doing other things. The user can be sitting down or standing up, or even walking around. A plus and a minus is that others around the user can hear exactly what they are saying. This is good for transparency so that others can not feel left out when the user is executing commands, but can be annoying to others and provide transparency when it is not desired.
It was pointed out to me that besides hands-free and transparency, another aspect to look at is whether or not the user is creating something through use of the commands. That is, have a "conversation" with the system as the process of creation. This is especially true with WYSIWYG applications. The precision, as measured by the user's vision, of making a change and determining exactly when to start and stop it (such as positioning or sizing something) can be much better and quicker with GUI than voice. CLI often isn't applied there. (However, an extension of CLI in the spreadsheet interface (when controlled strictly by keyboard, as in the old PC days) lets you construct a formula with a mixture of typing and "pointing" with arrow keys on a tabular grid.) Voice as a simple trigger, together with visual feedback, such as to say "Now!" when conditions being viewed are as desired, can be good, especially when hands are busy. However, the cliche verbal "Move it on the wall... a little lower... a little lower... no, a bit higher..." to control and trigger can be combersome.
Choices for business applications
When should a business application be implemented using VUI? From what I've written above, it seems that applications that benefit from hands-free operation and speaking out loud, and that consist of choosing commands and values from easily remembered or easily composed words and phrases or certain values with a rich verbal vocabulary like dates and times, are good candidates for VUI. This is especially the case when the scope of the choices is a very large, unordered set of discrete values. Sequences of commands, where the response to one command by the system is then followed on with the user giving another command using that context, are also good candidates.
In today's world, if the underlying system being controlled is implemented with a good microservice or API easily accessible from a microservice, building a simple VUI for that system as a prototype is not very difficult for many experienced application developers even if they have little experience with VUI systems themselves. The development environments provided by Amazon and others can speed the learning curve, with proofs of concepts buildable in no more than a few days, if that. However, crafting an efficient and robust voice interface for a particular application should take much more additional time, with careful studying of actual usage and functionality provided by the particular VUI system being used.
I tried this myself recently. I programmed Amazon Alexa to interface with a simple data capture system with mobile forms input to get totals and subtotals of sales. It was surprisingly easy to get results that many people have found instructive with regards to the potential. I made a one and a half minute video titled "Simple business app with Alexa":
Video on YouTube
For a demonstration, then, it may be worthwhile trying to build a simple VUI if the application warrants it. For some applications this form of control could be the dominant one for some users. I hope that some of the ideas in this essay can help you determine the ones for which it might be worth the try.
- Dan Bricklin, 26 January 2018
© Copyright 1999-2018 by Daniel Bricklin
All Rights Reserved.