#9: How Alexa, Siri and Co. are understanding us better

Whether it’s to say “OK Google, tell me what the weather will be like”, “Hey Siri, set a timer for 15 minutes” or “Alexa, play music” – in our everyday lives it’s perfectly normal to talk to voice assistants and, when it works well, for them to answer us. Communication problems are however still a factor in the relationship between human being and machine, often resulting in the frustrating sentence “Sorry, I didn’t understand that.” Junior professor Ingo Siegert of the Institute for Information Technology and Communications at the University of Magdeburg is researching how communication between humans and their voice assistants can be more successful. In the new episode of Know When You Want, he discusses where the problems in communication lie, how voice assistants have been used for advertising, and whether Alexa and Co. really are listening all the time.

Today’s guest

Jun.-Prof. Dr. Ingo Siegert researches and teaches at the Institute for Information Technology and Communications within the Faculty of Electrical Engineering and Information Technology. He studied information technology at the University of Magdeburg, was awarded a doctorate in 2015, and has been a junior professor since 2018. One focus of his research is human-machine interaction. As part of his lectures, he has developed the “Sprich mit der Uni Magdeburg!” (“Speak with the University of Magdeburg!”) voice skill with students, which is designed to help with online course orientation.

 

*the audio file is only available in German

 

The Podcast to Read

Intro voice: "Wissen, wann du wilst." The podcast about research at the University of Magdeburg.

Lisa Baaske: Whether it’s to say “OK Google, tell me what the weather will be like”, “Hey Siri, set a 15-minute timer” or “Alexa, play music” – in our everyday lives it’s perfectly normal to talk to so-called voice assistance systems and, when it works well, for them to answer us. Often, however, the communication is not that perfect. We still hear “Sorry, I didn’t understand that” much too often. Junior professor Ingo Siegert of the Institute for Information Technology and Communications at the University of Magdeburg is researching how communication between humans and machines can be more successful. And with that, let’s give today’s guest a warm welcome. Before we get technical, the first and most burning question is of course: Do you often use voice assistance systems privately and what for?

Prof. Ingo Siegert: Well, I don’t use them privately. At the most to quickly dictate messages. But if I’m too fast or use certain expressions, the system doesn’t always understand me.

Lisa Baaske: That effectively brings us to the basic problem that we’re talking about today ...

Prof. Ingo Siegert: Exactly.

Lisa Baaske: What do you ultimately find so fascinating about speech dialog systems? So, why did you decide to pursue this research area?

Prof. Ingo Siegert: Because it brings together so much. It’s engineering to even be able to receive the acoustic signals properly in the first place. It’s information science to then as it were train the artificial intelligence behind it, but obviously psychology very much plays a role too: user acceptance of the technology. That all comes together. It’s a very interdisciplinary research field and it’s just exciting to work at all these interfaces.

Lisa Baaske: OK, that certainly sounds relatable, I would say. But how exactly do voice assistants work? How do the machines manage to communicate with us?

Prof. Ingo Siegert: There are lots of small subproblems that need to be solved. The most important thing, I believe, is that the systems are capable of transforming our language, which in the end is just speech acoustics, i.e. waves that hit microphones or come out of speakers, into understandable units. So first of all, acoustics must be converted into sounds. That is very analog-driven towards biological patterns of how humans produce language. The aim is to convert that into models and then at the end you have a kind of sound soup where the next thing to do is to try and designate meaningful words as units. And there is a lot that can go wrong there because we humans have a strong tendency, particularly when we’re talking, to slur words, to swallow endings, to not speak clearly, and afterwards an attempt must be made to make sense of the words that have now been strung together. Mostly it’s straightforward. If I say: “Alexa, play music” or “Siri, save such and such a reminder!” then there aren’t many variations on what is said and how it can be said. But even with the weather it varies greatly. I can ask: “Siri, will it rain tomorrow? What will the weather be like in Berlin tomorrow? Do I need my umbrella tomorrow? Do I need sunscreen? Can I wear shorts?” There are so many variations that we humans can consider when asking a question that the system, which is also programmed by humans at this point, sometimes doesn’t follow in order to understand that. That’s when we always hear that well-known phrase: “I’m sorry, I didn’t understand!”

Lisa Baaske: So, you just touched upon it: We all know that often Alexa doesn’t understand us. This causes great despair and is also exasperating. So, is it that Alexa effectively doesn’t understand us because we speak too indistinctly, because we have a dialect? Why doesn’t she understand us?

Prof. Ingo Siegert: All those things. For one thing, because each of us speaks a little differently and that needs to be modeled. Then because we speak certain dialects or even use certain words which aren’t so well-known. Then because we don’t speak how the developers of Alexa, Siri and Co. envisaged at that point. For instance, early on with Siri you could add an appointment and at the end you had to confirm it. The question that Siri then asked was: “Should I add the appointment?” As a normal user I would say “Yes” and so probably would you. But on the GUI, the graphical user interface, the two buttons were “Confirm” and “Cancel”. And so there you had to say “Confirm” to confirm the appointment. But this discrepancy sometimes also occurs between “What the system expects” and “What can I as user say or what would I say in that case”. And obviously there’s the fact that voice assistants are seen as knowing everything and you can ask all kinds of questions from all imaginable directions. I can ask: How was the weather? Add an appointment! Set a timer! But also, who is the current chancellor of Germany? How old is the chancellor? Anything. And being able to represent this variety in the language is not so easy at all. And then mostly, even though there could be various reasons why Alexa isn’t understanding us – so either we’re speaking indistinctly, or the acoustics are not understood, or we haven't used the expression that was expected at this point, or we jumped from one topic to another in our dialog, or Alexa really doesn’t know the answer, where she says: “Sorry, I cannot help you with that”, without really giving a reason. It’s always difficult when you don’t know the reason why in a specific case. And that’s frustrating.

Lisa Baaske: Yes, definitely (laughs). In one project you’re investigating to what extent communication between humans differentiates from interaction with technical systems. To what extent can something like that actually be studied?

Prof. Ingo Siegert: By doing experiments in which users speak with other human beings as well as with technical systems. Ideally in the same type of dialog. Because it makes a difference whether I’m talking with my best friend about sports or wanting to arrange an appointment with Alexa. That will differ. There you have to see that you find a setting where the type of dialog is similar. Then you make the recordings accordingly and then examine the recordings to see whether there is something prosaically different. Whether, for instance, the user has greater variability in their intonation, as it were, when they are talking with another human, whether there are more variations at the end of the sentence in the statements – that sometimes the intonation goes down or up – and whether these things also occur when the user is talking to a technical system. And obviously you can also go by the meaning of the words, how many words are used, how many different variations in the formulation of certain statements.

Lisa Baaske: You’ve recorded many people speaking and then effectively how they would talk to Alexa too?

Prof. Ingo Siegert: Exactly. The question behind it for us was: Is it possible, based on prosody, i.e. how someone talks, to distinguish whether the user is talking to a technical system or with a human being? Even though it sounds like a banal question, there can be very great advantages because at the moment I have to say “Alexa, hey Google, hey Siri …” for it to come on. On the one hand, that’s relatively unnatural because to some extent I have to say it every time if I’m having a longer conversation, which they simply can’t do. And on the other hand, there are still a lot of erroneous detections. So, the classic one with Alexa is when I have somebody called Alexa with me. But even other phonetically similar sounding words to Alexa, so Alex – whatever – these devices activate in situations when they shouldn’t be activated. And if the devices could in addition to the language content, i.e. the sound, the word, could also identify how it was said and distinguish whether users are speaking with a system or with a human, then this additional information can serve to reduce these erroneous detections. So if users speak with a system with not so strong modulation, i.e. the intonation is very monotonous or is also as often as not strongly accented, then the systems could detect that and then only come on if they are really intended to.

Lisa Baaske: The thought has just occurred to me: If people are listening to the podcast now and we keep saying Alexa and obviously they have an Alexa at home, then it might also now keep coming on in all probability. Sorry about that. (laughs)

Prof. Ingo Siegert: Exactly, there are also others who’ve exploited that. There was a famous Super Bowl commercial where I believe Burger King bought a 30-second slot, I believe only 15 seconds rather than 30, and in that time in the commercial the actor simply asked Alexa to tell them about the Whopper burger.

Lisa Baaske: (laughs) Oh, clever.

Prof. Ingo Siegert: And then this Wikipedia entry that’s read out by Alexa was manipulated accordingly. And then they had another 30 seconds or so of extra advertising for all customers who had that. In response Amazon obviously had to adjust the algorithms and they now have a kind of ad recognition ...

Lisa Baaske: Ah!

Prof. Ingo Siegert: … which works by pre-screening all commercials, seeing when the Alexa keyword occurs and what is said afterwards. And then recognizing what comes after the Alexa keyword in this commercial and marking it as advertising and not responding to it.

Lisa Baaske: Ah, OK (laughs), very interesting in any case. But what then is the aim of your project generally? So ultimately Alexa and Co. should understand us completely and communicate with us as if they were also human?

Prof. Ingo Siegert: No. So, I see the technical systems always as technical assistants in specific situations where we users could use support to make everyday life better. But they shouldn’t replace other people, rather they can act as a type of assistant in certain situations. Obviously, it would also be nice if this subject understood a little more about human communication. At the moment it’s the case that when we humans speak, we don’t just use language content to communicate with the other person, we also use language prosody. What was said and how it was said is therefore always important. So, if I mean something sarcastically or with a commanding tone or as a request, I can convey that through the content and also through the prosody. Voice assistants only understand the content. The prosody is totally immaterial. In most cases that’s OK. If I just want to know a fact, it’s enough to simply state that as a command. But in many cases, it would perhaps be good if the assistants understood a little more of this prosody.

One very mundane example is I’m sitting in my car in the morning, programming my navigation system using voice because I have to get to an important appointment. And I’m already slightly rushed because I’m already running late. If the assistant then noticed: Oh, he’s in a rush. Then it can perhaps adopt the most agreeable route from my default setting right away: OK, he’s in a rush, just make a suggestion: Oh, I notice that you’re in a bit of a rush today, if we take the fastest route, we’ll arrive 10 minutes sooner. That would be the kind of support I’d get if I had a human passenger beside me who was still navigating for me somewhere using a road map. That would probably also attract their attention and they would respond accordingly.

Looking at it the other way around, of course, if I’m driving on vacation and I have time and I’ve already driven for six hours. There are these tiredness detection systems in the car or if the eyes slowly close a coffee cup is displayed. But it would be much nicer if at that point there was perhaps also a voice interaction like: “Oh, you’ve already driven for six hours. In half an hour there’s a nice rest stop. It has great ratings. Shall we stop there?” It’s little things like that where technical systems can be a little more helpful in everyday life without outright anthropomorphizing them, still viewing them as a technical system with a specific task but which could facilitate communication a little. And that, I believe, is an aim I’m pursuing.


Lisa Baaske: I understand. Sounds cool at any rate. It would be really great if that were possible one of these days. You’ve been researching this for quite a while now. Was there ever a point where you said: I’ve reached a giant dead-end here? That was a huge challenge! Something along those lines?

Prof. Ingo Siegert: Yes, so the big problem that we have when discerning how users speak is that we need examples from users who speak in a certain way in certain situations. And that’s a little like a cat biting its own tail because, if I have a technical system which can't detect any emotions, how do I get users to speak to the technical system as if it can detect emotions? That was a big issue as it were. Why should people speak with a system in an emotional way so that I can analyze how emotionally they speak with it and without them knowing that the system can do that.

There is a vast amount of data, which has also been recorded by other researchers who are trying to reconstruct how users speak with technical systems. However, they are all relatively short. Three, four commands are given and that’s it. That makes a longer unfolding, as it were, of how users speak with the system relatively difficult and so we first had to generate data as part of several research projects, including in association with the big group where I then went on to work as a PhD student. And when you have the data, you then still have to evaluate it and see, OK, there we had such and such emotionality and there such and such emotionality. To play around with it… How do you do that so that at the end they are valid statements? You first have to do a lot of reading and consider how it works.


Lisa Baaske: Was there ever a moment where you said: Now that’s perhaps not my area? That’s too challenging for me now. Perhaps I should do something else?

Prof. Ingo Siegert: Nope, I’ve always found it exciting. This interdisciplinarity has always appealed to me. And the nice thing about it, which for me is what it’s all about, is gaining knowledge. So I have certain ideas of how it could be. Then I look at the literature to see what happened. Then I construct a hypothesis and want to find out if I can confirm the hypothesis. Obviously, it also happens that I construct a hypothesis which is not confirmed. That happens quite a lot. And then you have to look at it and say, OK, why is that? Was at least my experimental design correct that I ran? And could it be that I made an error somewhere else?

But then that’s also good to know, that you say OK, that’s not the way it is. Admittedly it’s then only to show an example, it didn’t actually work, but it’s also adding to the body of knowledge. Sometimes that’s really important. And I find that it’s a bit of a pity in academia at present that it’s always just about producing new knowledge, getting better and better and faster and faster systems, however providing confirmation of other experiments has been pushed into the background a little. So if, for instance, academics in Japan have found that robots can provide support in the care of the elderly, because it delays the onset of dementia, for instance, if the residents of a care home can interact with robots or whatever and so that touches the emotional world on some level, then these experiments need to be repeated in Europe or in America to see how people respond because obviously socialization is always different. And it’s also fascinating to compare notes on that.

Lisa Baaske: Yes, I definitely understand that. And as you’ve already said, we also learn from failure. This is why it’s very exciting. Have you found out then? So how does communication between humans differ from communication with technical systems?

Prof. Ingo Siegert: Yes, so there are certain characteristics in the language that differ. For one, the inflections of speech. When we talk to other people, these are more extensive and there is more variance than if we are talking to technical systems. The variance of the fundamental frequency is also different. Those are the main differences. But what we’ve discovered is that it’s very individual. So there are certain speakers who also speak extensively with technical systems, who strongly change their prosody. Other users don't do that. That means that later, when you have a technical system that also wants to detect these inflections of speech, it will always need to be adapted to the individual users. But then I think that’s very good.

Lisa Baaske: You’re dealing with voice assistants on a daily basis. What do you think, how do people perceive technical systems like Alexa? So we clearly kind of speak with them differently than with humans.

Prof. Ingo Siegert: Yes. But that will change a lot once again. So, eighteen months ago I had a class from an elementary school with us in the lab who I showed around and then let them interact with Alexa. And they did that without any inhibitions. Then I asked whether they have voice assistants at home and how many. Apart from two of them, one of them being the teacher, they all had at least one voice assistant at home and sometimes even a second one in their room to listen to stories, to set an alarm, whatever. So they are growing up quite differently with it. That might sound frightening, but I believe it’s a normal development. Smartphones are already ubiquitous for us, but not yet for our parents. The same happened with the television 30/40 years ago and just the same with the radio. So if you go back in the history of this development and look at old newspaper reports or even old research findings, the same discussions are always taking place: The children will be brutalized. There will be no more interpersonal contact. This was triggered by the radio, television, and the internet. It happens time and again. But I believe that people are social beings and communicative beings who need that, and a voice assistant cannot fully replace that. And it shouldn’t either. But obviously the way we interact with it will be different. And also, our understanding of how things work will change.

Lisa Baaske: Yes, I also believe that it’s by all means certainly a generational issue. For instance, I don’t personally use a voice assistant either because I simply haven’t seen a need for it yet for me. Many people do use voice assistants but also don’t because they are afraid of being listened to. Is it true that Alexa and Co. are always listening?

Prof. Ingo Siegert: Yes and no. So, the systems are … actually there are two voice assistants inside every voice assistant. One is the local voice assistant which, as it were, is only for detecting activation. There the microphones are always on and they’re kind of waiting for the trigger. If there’s an acoustic expression which sounds similar to the set wake word, i.e. hey Siri, hey Google, Alexa or whatever, then the device is first activated and then the devices are listening, receiving what is said and then routing that to the cloud where it’s analyzed in more detail. That means they’re not actually always eavesdropping. They’re kind of always waiting for the trigger. And so too when this trigger is activated and the expression is analyzed, as it were, there is mostly still a second step to consider whether it was the trigger and whether what came afterwards really was a command to be analyzed.

Lisa Baaske: OK, good, that will perhaps now reassure some people. Who knows? So, besides the concerns, there are also a lot of people who are really enthusiastic about voice assistants. I was also very surprised when it was announced at the start of the year that we now have the opportunity at the University to use voice assistants for course orientation. How did that come about?

Prof. Ingo Siegert: That was an idea which came about in one of my lectures. I have a lecture on dialog systems which is also about what a dialog with technical systems actually is, what can you do with them? How close are today's systems which are already on the market to conducting dialogs as it were with people? Mostly it’s the case that devices these days are conducting one-shot dialogs. So I can raise a request, receive the answers, and I’m done. And with dialogs it’s more about obtaining information that requires multiple intermediate steps. And I’ve considered with students what you could do to investigate that, and we thought let’s implement that in Alexa for course orientation. How far can you get with it and what happens? And it was also just then, in the initial stages of the pandemic, where we thought that might be a good option if students or prospective students cannot come to the University, maybe to send the University there. And many of them use voice assistants too. And that was just the initial idea, to maybe implement something like that.

Lisa Baaske: That could be precisely the generation who have more or less grown up with it and that’s why it’s perhaps also of strong interest to them. How can one imagine it then in general? What was the step that took this from the idea to really being able to now say, “OK Google, talk to the University of Magdeburg”?

Prof. Ingo Siegert: So first I have to consider how such a dialog can be conducted. There we worked with the General Student Advisory Service to elicit a little bit about how they conduct those initial dialogs to find out where the interests are, and then we developed a very simple dialog in response which can also be implemented in a voice assistant to actually find out what direction one wants to go in without the system just asking which field of study you are interested in? Because that’s a question that if you ask a prospective student, they probably won’t be able to give a good answer. So, to steer that a little we focused on implementing and considering, OK, what are the options that could be said by the users at those points? Sometimes you try to steer things a little by asking classic yes/no questions. Or just if you have to decide between certain things. We had: Are you more interested in natural sciences, engineering or social sciences? Then these kind of either/or concepts can also be given as an answer. And to implement that, then to test, to do a lot of testing, and then to find other users given the difficulty of the pandemic situation who can test the system under real conditions. So as developers, we know what we need to say at certain points. You do perhaps still have some imagination about what could be said but somewhere that stops. OK, then we have to look at how do I get that right? Luckily there were one, two events in the fall last year where it was once again possible to test everything in order to see how users interact with it.

Lisa Baaske: I remember. I was definitely there at one of the events. It was still relatively new at that time, and it was also about what could they say to keep it going. Was that the biggest problem with it, so covering all possibilities as to how the 18-year-olds will respond?

Prof. Ingo Siegert: Yes, that is the big problem with such things. Above all else, you don’t want to lose them. That means being able to somehow respond to all eventualities because, if the University of Magdeburg now has a voice skill to be able to interact with prospective students, even if the implementation of the whole thing is a little playful to even generate interest in the first place. And then they get something like: “I couldn’t understand you.” Then that’s super frustrating and users won't perceive that as an Amazon or Google problem, but as a University of Magdeburg problem and that’s what we wanted to avoid. That was the main thing.

Lisa Baaske: I know for sure that it works. I actually tried it once and I even got German Studies. Which is what I studied.


Prof. Ingo Siegert:
That’s great.


Lisa Baaske: So, everything worked well. (laughs) At the moment, the voice skill only works for bachelor’s degree courses. Will there be further development? Is more work being done on that?

Prof. Ingo Siegert: Not at the moment. I don’t have the capacity to implement that. But in principle it’s in place and can also be extended. We also had a couple of other ideas, that you could maybe also have an FAQ area where prospective students can ask: When do I have to send the application forms by? How long does it take? Does such and such a course of study have restricted admission? And then you can maybe point people to certain events. So, come down, there’s a summer picnic on such and such a date or this department is having an open day. Something like that. But that requires maintenance, and I don't have the capacity for that at the moment.

Lisa Baaske: Maybe they’ll come back to it. Definitely sounds very exciting and I believe we’d all be pleased with that. What do you think generally as to how the development of voice assistance systems will evolve? So, what could they be in 10 years?

Prof. Ingo Siegert: I believe in 10 years there will be better use cases. If you now compare it with Apple and the App Store, which came out in about 2009 and then some time it was possible to download the first apps. The first three apps were like a virtual lighter or some sound generators. Of the skills or actions, the voice apps you can use for Alexa and Google, I believe the fart generator is presently the number 1 to 5 in most European countries.

Lisa Baaske: OK. (laughs)

Prof. Ingo Siegert: So, at the moment it’s very playful but that will change. There will be certain use cases where voice assistants can really offer help. That will then be the first things that will come. So, for instance, providing a meter reading via voice assistants so that you no longer have to make a phone call. Maybe also getting help faster in certain cases where you otherwise always have to boot up a computer to write an email or to work with a chatbot, that you can also do that via a voice skill as it were. Those will be like the first things that will come. Then certainly also integration into many mobile applications. As an example: In the car there’s a lot of scope to do much more with voice because pressing buttons, taking your hands off the wheel to look at something, is always a distraction. And almost all cars now have a built-in hands-free setup. And creating more opportunities there to operate things via voice will simply come. Also, because it can often be so much more intuitive. So, at the moment, it’s the case that when you get in your car and want to change something – play music, whatever, play a CD, change the radio station, turn on route navigation – you first have to go through three or four menu levels to get to it. And you can do all that considerably more easily with voice if you establish good dialog design, because some time you can switch to saying, please commence navigation. I now want to drive the fastest route from Magdeburg to Berlin. Effectively one expression and you’ve gone straight through all five menu levels at once. You no longer even know where you actually are. So, these abstractions that are there at present with smartphones … you open an app, you don’t even know if it’s a website, is it a proper app? Where are the data stored? What’s actually happening as it were internally? This whole sort of interface that we’re familiar with from the standard desktop PC, where you have to go to the start menu and click through, even disappears in this app interface. And so, there’ll be a similar evolution with voice apps which then, particularly if spoken naturally and you can raise more complex requests en bloc, will be able to demonstrate their full strength.

Lisa Baaske: What would you then want them to be able to do in 10 years? An unrealistic question.

Prof. Ingo Siegert: To provide better assistance in everyday situations without being obtrusive. So that you allow it to be able to capture complex relationships in the device control systems so that the user only effectively has to be aware of what it is that they want? What would I normally say to do that? And then the differences between specific device manufacturers, between specific car brands could effectively disappear at this point. Because nothing is worse than trying to operate different technical devices that do the same thing and have unclear menu structures and not knowing where I actually need to go. So that a little of this fear that “I’m now operating technology” falls away.

Lisa Baaske: So now I have actually got to my last question. Please give us another insight into the future. What’s on your agenda next? What do you want to research in addition? What do you still want to do?

Prof. Ingo Siegert: At the moment I’m about to start a research project about increasing data protection for the people speaking. If I operate voice assistants, at the moment all that matters is what’s said. It’s not desperately important who said it. In future, if prosodic information also becomes important, then it’s obviously also important how something is said. But it’s still not important that I said it to operate the voice assistants. That means we’re presently trying to launch a project in which we want to develop technology that enables us to mask the speaker’s identity but while retaining the language content and the language prosody or the emotionality. So that I can still interact with voice assistants, can control them, that the voice assistants are also able to react to my prosodic information. So, am I currently annoyed because the system has failed to understand me five times in a row? Or am I really sure that I want to do that in such a way without the system knowing that it’s me? And that can be a huge advantage for many use cases. So that you can then also interact with systems by voice in public places without the systems really knowing that I operated it. That’s the idea so to speak. There’s obviously much, much more that can be done there, including in other cases. If you now think of being able to use anonymous advice services, where you can maybe lower the barriers still further to using advice such as hotlines for violence against women or violence against children, that the victims can in the first instance use that anonymously without having to be afraid that they will be recognized and can however be helped nonetheless, because the content and emotion of what they are saying is understood.

Lisa Baaske: OK, that certainly sounds fascinating. I wish you all the very best with that. Thank you. Other than that, we really have now reached the end. Many, many thanks for taking the time to talk to us. Thank you for the fascinating insights. Maybe in the near future I will have the courage to let a voice assistant move in with me. And thanks to you for listening. I hope that you will join us again for the next episode in November.

Intro voice: "Wissen, wann du wilst." The podcast about research at the University of Magdeburg.

Last Modification: 08.09.2021 - Contact Person: Webmaster