Amazon begins moving cloud AI from Alexa to its own silicon

[ad_1]

Amazon engineers discuss the migration of 80% of Alexa’s workload to Inferentia ASICs in this three-minute clip.

On Thursday, an Amazon AWS blog post reported that the company has moved most of its Alexa personal assistant’s cloud processing out of Nvidia GPUs and onto its own Inferentia application-specific integrated circuit (ASIC). Amazon developer Sébastien Stormacq describes the hardware design of the Inferentia as follows:

AWS Inferentia is a custom chip, designed by AWS, to accelerate machine learning inference workloads and optimize their cost. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a high performance systolic matrix multiplication engine, which massively accelerates typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps reduce access to external memory, significantly reduce latency and increase throughput.

When an Amazon customer – usually someone who has an Echo or Echo point – uses the Alexa personal assistant, very little processing is done on the device itself. The workload for a typical Alexa request looks like this:

A human speaks to an Amazon Echo saying, “Alexa, what’s the special ingredient in Earl Gray tea?”
The Echo detects the wake word – Alexa – using its own built-in processing
Echo transmits request to Amazon data centers
In the Amazon data center, the voice stream is converted to phonemes (Inference AI workload)
Also in the data center, phonemes are converted into words (Inference AI workload)
Words are assembled into sentences (Inference AI workload)
Sentences are distilled into intention (Inference AI workload)
Intent is routed to an appropriate execution service, which returns a response as a JSON document
The JSON document is parsed, including the text of Alexa’s response
Textual form of Alexa’s response is converted to natural sounding speech (Inference AI workload)
The sound of natural speech is sent back to the Echo device to be read – “It’s bergamot orange oil”.

As you can see, almost all of the actual work done to answer an Alexa request happens in the cloud, not in an Echo or Echo Dot device itself. And the vast majority of this cloud work isn’t done by traditional if-then logic, but by inference – which is the responding side of neural network processing.

According to Stormacq, moving this inference workload from the Nvidia GPU hardware to Amazon’s own Inferentia chip resulted in a 30% cost reduction and a 25% improvement in end-to-end latency on the loads. Alexa text-to-speech work. Amazon isn’t the only company using the Inferentia processor – the chip powers Amazon AWS Inf1 instances, which are available to the general public and compete with Amazon’s GPU-powered G4 instances.

Amazon’s AWS Neuron SDK enables machine learning developers to use Inferentia as a target for popular frameworks, including TensorFlow, PyTorch, and MXNet.

Listing Image by Amazon

[ad_2]

Source link