Artificial intelligence (AI) is the next big thing in the market, something which the tech companies are investing heavily to get themselves a clear lead in the future AI-driven market. And who doesn’t know that Tesla values AI more than anything? The company’s entire EV business depends on the AI capabilities of its vehicles.
According to Tesla in its latest presentation, the company believes that AI has limitless possibilities and the system is getting smarter than an average human. Tesla announced that to speed up the AI software workloads, its D1 Dojo custom application-specific integrated circuit (ASIC) for AI training will be of great use, the software that the company presented today.
As we know many companies are building ASICs for AI workloads these companies work tirelessly and their list not only included start-ups but also big names like Amazon, Baidu, Intel and NVIDIA. However, not everyone has the right formula and not everyone can satisfy each workload perfectly, meaning that there is a market for AI training. And hence, the reason why Tesla opted to develop its own ASIC for AI training purposes.
The system which is called the D1 resembles a part of the Dojo supercomputer used to train AI models inside Tesla HQ. The chip is a product of TSMC’s manufacturing efforts and is produced using the 7nm semiconductor node. The chip reportedly is packed with over 50 billion transistors and boasts a huge die size of 645mm^2.
According to Tesla, its latest AI chip has some impressive performance and can output as much as 362 TeraFLOPs at FP16/CFP8 precision or about 22.6 TeraFLOPs of single-precision FP32 tasks. It’s a remarkable feat that in terms of optimized FP16 data types, Tesla has even managed to beat the current leader in compute power which is Nvidia. As we know the green teams, A100 Ampere GPU is capable of producing “only” 312 TeraFLOPs of power at FP16 workloads.
Tesla has built up a mesh of functional units (FUs) which are interconnected together to form one massive chip and each FU contains a 64-bit CPU with custom ISA. The CPU is a superscalar implementation that has 4-wide scalar and 2-wide vector pipelines. According to reports, the functional unit has the capability of performing one TeraFLOP of BF16 or CFP8, 64 GigaFLOPs of FP32 computation, and has 512 GB/s bandwidth in any direction in the mesh.