How does Merlyn Mind’s AI turn natural language utterances into actions? A previous post explored the first step in this process: an automatic speech recognition (ASR) model that produces text transcriptions from raw audio. The next step is to turn those text transcriptions into more abstract, machine-readable representations that help Merlyn’s client applications, like the Symphony Classroom device and browser extension, determine the appropriate action(s) to take.
What information should these representations contain, to specify a given action? And how can that information be identified from natural language expressions? In this post, we’ll illustrate how these general questions apply to some of our specific use-cases at Merlyn Mind, using common classroom commands as examples. Without going into too much technical detail, we’ll put several different types of commands under the microscope and discover that they all share underlying structural similarities once we look at them with the right lens!
Let’s begin by considering a familiar real-world example. Imagine you are driving on the highway when you see a sign that says (1). A short while later, you encounter another sign, which says (2).
(1) “REDUCE SPEED BY 10 MPH”
(2) “REDUCE SPEED TO 10 MPH”
You’ve probably realized immediately, without even thinking about it, that these two signs are making two very different requests. Let’s consider how to represent this difference in a generalizable way. Both signs ask you to make a change in your speed, but the specific change to be made depends on three pieces of information:
• A direction of change (in both signs, the word “REDUCE” is telling you to go slower, not faster)
• An interval of change (in the first sign, “10 MPH” tells you by how much to change your speed)
• A target point (in the second sign, “10 MPH” tells you the speed you should reach after the change)
Natural languages encode these pieces of information in various ways. In English, intervals of change are often (but not always) marked with the preposition by. Target points are often (but not always) marked with the preposition to or at. So, if you hear, “Reduce your speed by 10 mph to 30 mph,” you know that “10 mph” indicates the interval of change and “30 mph” indicates the target, not the other way around:
If you speak a language besides English, try formulating these sentences in that language — you may find that targets and intervals are expressed with different prepositions, different case markers, different word orders, or something else entirely!
Note that neither the interval nor the target is strictly necessary. For instance, the sign might have said simply, “REDUCE SPEED,” in which case it would be left to you, the driver, to infer the requested degree of reduction, perhaps on the basis of context or convention. (And if there were an opportunity to communicate with the sign, you might ask it to provide the interval or target!)
How do targets, directions, and intervals apply to the utterances we hear in the classroom? As it turns out, these same three ingredients form the “skeleton” of many different types of commands handled by the Symphony Classroom device and browser extension. Let’s explore several such cases.
Case 1: Navigating Video and Audio Files
Suppose you are watching a video file and want to jump to a different location in the playback. With our three ingredients in mind, consider the various ways you might express this command. You might use:
A direction plus an interval: “Skip ahead by two minutes.”
A direction plus a target point: “Rewind to 35:10.”
Just a target point: “Go to 35:10.”
Just a direction: “Go back,” or, “Skip forward.”
(You may recognize this last case, where only a direction is supplied, as analogous to the “REDUCE SPEED” example above. What might be an appropriate action to take in this case?)
The natural language expression of these commands varies widely, but our abstract representations will contain some combination of these same three ingredients. To identify these ingredients from the natural language input, we’ll generally consider the following questions:
• What are the terms that can be used to denote a positive or negative direction of change for this action? For audio-video navigation, these include (fast-)forward, rewind, ahead, back, etc.
• Are there special terms that denote units in which the target and interval are expressed? In the case of audio-video navigation, these are generally units of time: seconds, minutes, hours, etc.
• Are there any special terms for certain points on the scale? In this case, these might include beginning and end, and maybe even things like credits, third movement, scene five, etc.
Case 2: Adjusting the Volume
Suppose you want to adjust the volume on the Symphony Classroom device. Consider some of the ways you might express this command:
A direction plus an interval: “Turn up the volume by two.”
A direction plus a target: “Lower the volume to level five.”
Just a target: “Set the volume to seven.”
Just a direction: “Turn up the volume,” or, “Quieter, please!”
Our same three ingredients show up! To identify these ingredients in the natural language input, we’ll consider the same questions as above, though the answers differ:
• What are the terms that can be used to denote a positive or negative direction of change? In the case of volume adjustment, these will include terms like increase, up, loud(er), raise, and decrease, reduce, lower, quiet(er), down.
• Are there special terms that denote units in which the target and interval are expressed? In principle, we might expect something precise like decibels, but in practice, we use numerals that stand in for points on our sparser volume scale.
• Are there any special terms for certain points on the scale? In the case of volume, these might include maximum, minimum, and perhaps mute.
Now that you know what to look for, let’s briefly examine several other classes of commands where we find these same three ingredients at work.
Case 3: Setting and Adjusting Timers
If you want to set or adjust the timer on the Symphony Classroom device, you might use one of the following combinations of ingredients (these may start to look quite familiar by now!):
A direction plus an interval: “Shorten the timer by two minutes.”
A direction plus a target: “Extend the timer to 10 minutes.”
Just a target: “Set the timer to five minutes.”
Just a direction: “Extend the timer.”
Case 4: Assigning and Deducting Points
The Merlyn Mind browser extension supports an integration with ClassCraft that allows teachers to assign points to classes or teams. Here are some commands that might be used as part of this functionality:
Case 5: Scrolling Within a Document
Finally, consider the commands you might use to scroll up and down within a document:
For each of these three previous cases, see if you can answer these questions yourself:
• What are the terms that are used to denote the positive and negative directions of change?
• What are the terms used to denote the units in which targets and intervals are expressed?
• Are there any special terms for certain points on the scale?
And now that you know all about directions, targets, and intervals, what other types of commands can you think of that might use some of these same ingredients?
Jeremy Hartman is a Principal Engineer at Merlyn Mind. He works on natural language understanding.