[ad_1]
What just happened? Amazon has announced that it is migrating its artificial intelligence processing to custom AWS Inferentia chips. This means that Amazon’s larger inference services, like the virtual assistant Alexa, will be run on faster, specialized silicon instead of somewhat versatile GPUs.
Amazon has already shifted around 80% of Alexa processing to Elastic Compute Cloud (EC2) Inf1 instances, which use the new AWS Inferentia chips. Compared to G4 instances, which used traditional GPUs, Inf1 instances increase throughput by 30% and costs by 45%. Amazon believes these are the best instances on the market for inferring natural language and speech processing workloads.
Alexa works like this: The speaker housing (or cylinder, as it may) does next to nothing, while AWS cloud processors do it all. Or to put it more technically … the system boots up after the wake word has been detected by the Echo’s built-in chip. It starts streaming audio to the cloud in real time. Off in a data center somewhere, the audio is turned into text (this is an example of inference). Then the meaning is removed from the text (another example of inference). All required actions are completed, such as retrieving today’s weather information.
After Alexa completes your request, they should give you the answer. What she is supposed to say is chosen from a modular script. Then the script is turned into an audio file (another example of inference) and sent to your Echo device. The Echo reads the file and you decide to bring an umbrella to work with you.
Obviously, inference is a big part of the job. It’s no surprise that Amazon has invested millions of dollars in making the perfect inference chips.
Speaking of which, Inferentia chips are made up of four NeuronCores. Each implements a “high performance systolic matrix multiplication engine”. More or less, each NeuronCore is made up of a very large number of small data processing units (DPUs) that process data in a linear and independent fashion. Each Inferentia chip also has a huge cache, which improves latencies.
[ad_2]
Source link