Rohit Prasad, Vice President and Senior Scientist, Amazon Alexa Team.
When Amazon’s first Alexa-enabled smart speaker hit the market in 2014, it was a novelty: a voice-activated interface for processing natural language that could perform a number of simple tasks.
Fast forward to today, and the internet connected platform has rapidly expanded and become its own electronic ecosystem. With tens of thousands of Alexa-enabled devices and hundreds of millions of units sold, Alexa is almost ubiquitous as a virtual assistant.
But while Alexa is now integrated into everything from televisions to microwaves to headphones, Amazon’s vision of ambient computing is still in its infancy. While tremendous strides have been made in natural language processing and other areas of artificial intelligence to work for a potential market with billions of users, there is still much room for improvement.
Ultimately, looking to the future, Amazon wants these devices to understand and support users almost as well as a human assistant does. However, to achieve this requires significant progress to be made in several areas, including contextual decision-making and reasoning.
To delve deeper into the potential of Alexa and ambient computing in general, I asked Alexa Rohit Prasad’s senior vice president and head scientist about the future of the platform and Amazon’s goals for the increasingly intelligent virtual assistant platform.
Richard Yonck: Alexa is sometimes referred to as “Ambient Computing”. What are some examples or use cases for Ambient AI?
Rohit Prasad: Ambient computing is technology that is there when you need it and takes a back seat when you don’t. It anticipates your needs and makes life easier by always being available without being intrusive. For example, with Alexa you can use routines to automate your home, e.g.
Yonck: During your last CogX presentation, you mentioned that Alexa “goes into reasoning and autonomy on your behalf”. What are some examples of this in the near future compared to our current situation?
Prasad: Today we have features like Hunches where Alexa suggests actions to take in response to abnormal sensor data. More recently, Ring Video Doorbell Pro owners can choose to have Alexa act on their behalf, greet visitors and offer to record a message, or provide package delivery instructions.
Overall, we’ve moved to more contextual decision-making, taking first steps in terms of reasoning and autonomy through self-learning or Alexa’s ability to improve and expand its skills without human intervention. Last year we took another step with a new Alexa feature that can infer a customer’s latent goal. Let’s say a customer asks about the weather on the beach, Alexa could use the request in combination with other contextual information to infer that the customer might be interested in a trip to the beach.
The new Echo Show 10. (Amazon Photo)
Yonck: Edge computing is a means of running some computing power close to the device rather than in the cloud. Do you believe that at some point, Alexa will be able to do enough processing at the edge to sufficiently reduce latency, support federated learning, and address privacy concerns?
Prasad: Since the launch of Echo and Alexa in 2014, our approach has combined processing in the cloud, on the device, and at the edge. The relationship is symbiotic. Where the computing takes place depends on several factors, including connectivity, latency, and customer privacy.
As an example, we understood that customers want basic functionality to work even if they lose network connectivity. For this reason, we introduced a hybrid mode in 2018, in which smart home intents, including control of lights and switches, continue to work even if connectivity is lost. This also applies to taking Alexa with you on the go, even in the car, where connectivity may be interrupted.
Over the past few years we have pursued various techniques to make neural networks efficient enough to run on the device and to minimize memory and computation requirements without sacrificing accuracy. With neural accelerators such as our AZ1 Neural Edge processor, we are now pioneering new experiences for customers such as natural turning, a feature that we will be offering our customers this year that uses device-internal algorithms to merge acoustic and visual cues to draw conclusions about whether participants in a conversation are interacting with each other or with Alexa.
Yonck: You have described several functions that we need in our social bots and task bots in your AI Pillars for the Future. Can you share projected timelines with any of these, even if they are broad?
Prasad: Open domain, multi-turn calls remain an unsolved problem. However, I’m excited to see college students advancing conversational AI through the Alexa Prize competition tracks. Participating teams improved the state of the art by developing an improved understanding of natural language and dialogue guidelines that result in more engaging conversations. Some have even worked on spotting humor and generating humorous responses or selecting contextual jokes.
These are tough AI problems that will take time to resolve. Although I believe we are still 5 to 10 years away from meeting the goals of these challenges, I am particularly looking forward to the area of Conversational AI, for which the Alexa team recently received a best paper award: the explicit one Infusion of commonsense knowledge graphs and implicitly into large pre-trained language models to give machines greater intelligence. This work will make Alexa more intuitive and intelligent for our customers.
Yonck: For open domain conversations, you mentioned that transformer-based neural response generators are combined with knowledge selection to generate more engaging responses. In a nutshell, how is knowledge selection carried out?
Prasad: We are pushing the boundaries with open domain conversations, among other things as part of the Alexa Prize SocialBot Challenge, in which we are constantly inventing for the participating university teams. One such innovation is a neural transformer based speech generator (i.e. a neural response generator or NRG). We have expanded NRG in order to generate even better answers by integrating a dialogue policy and amalgamating world knowledge. The guideline determines the optimal form of the answer – for example, the next move the AI should confirm the previous move, if necessary, and then ask a question. To integrate knowledge, we index publicly available knowledge on the web and retrieve sentences that are most relevant to the context of the dialogue. The goal of the NRG is to give optimal answers that correspond to the political decision and contain knowledge.
Yonck: For naturalness, ideally you want to have a great contextual base for conversations. Learn, store and access a large amount of personal information and preferences to provide personalized responses to each user. This feels very computationally and memory-intensive. Where is Amazon’s hardware now compared to what it has to be to ultimately achieve this?
Prasad: This is where the processing on the edge comes into play. In order to provide the best customer experience, certain processing – such as B. Computer vision to find out who is addressing the device in the room – can be carried out locally. This is an active area of research and invention, and our teams are hard at work making machine learning – both inference and model updates – more efficient on the device. In particular, I’m excited about large pre-trained deep learning-based models that can be efficiently distilled for efficient processing at the edge.
Yonck: What do you think is the biggest challenge to achieve a fully developed ambient AI as you described it?
Prasad: The biggest challenge in making our vision a reality is moving from reactive responses to proactive support, where Alexa can spot anomalies and warn you (e.g. a hunch you left the garage door open) or anticipate your needs to achieve your latent goals. While AIs can be preprogrammed for such proactive support, given the myriad of use cases it will not scale.
Hence, we need to move towards more general intelligence, i.e. the ability of an AI to 1) perform multiple tasks without requiring significant task-specific intelligence, 2) adapt itself to variability within a set of known tasks, and 3) entirely new Learn tasks.
In the context of Alexa, this means that it is more self-learning without the need for human supervision; more selfishness by making it easier to integrate Alexa into new devices, drastically reducing the burden on developers building conversational experiences, and even allowing customers to customize Alexa and convey new concepts and personal preferences directly; and increased self-awareness of the state of the environment to proactively anticipate customer needs and help them seamlessly.