Biren BR100 GPU for Datacenter Compute and AI Workloads





HC34 Biren BR100 GPU Cover
HC34 Biren BR100 GPU Cover

A cool feature of Hot Chips 34 is that we again get to look at some chips that we would normally not get to see (unless we bought and somehow imported them.) One of those in HC34 is the Biren BR100. The BR100 is a GPU designed for GPU compute in a data center and uses a number of advanced features that make the design very interesting.

Note: We are doing this piece live at HC34 during the presentation so please excuse typos.

Biren BR100 GPU for Datacenter Compute and AI Workloads

This is certainly something different. The BR100 is a 77B transistor 7nm chip. Many of the next-gen GPU are going below 7nm, so this is a big chip at 1074mm2. It sounded like the company said that the compute tile is 477mm2. The NVIDIA H100 is a 80B transistor chip. One other interesting aspect is that this is a 64GB HBM2E part. That is going to be lower than the Ponte Vecchio, H100, and also the AMD MI250X. It is more like half of the MI250X memory. This is likely done to lower cost. At the same time, we should be comparing this to the H100 and Ponte Vecchio as it is a PCIe Gen5 GPU with CXL.

HC34 Biren BR100 GPU Overview
HC34 Biren BR100 GPU Overview

Just for some reference, the Intel Ponte Vecchio at 1.6GHz was just disclosed at 839TFLOPS for XMX BF16 but that may or may not be the operating frequency when that ships. Biren says its chip is much faster than the NVIDIA A100 in deep learning.

HC34 Biren BR100 GPU AI Performance
HC34 Biren BR100 GPU AI Performance

Here are the five C’s on Biren’s focus. Cost is always a big deal when companies look to compete with NVIDIA.

HC34 Biren BR100 GPU Five C
HC34 Biren BR100 GPU Five C

Here is how Biren is making its chiplet-based GPU. We can see two GPU compute tiles. There are then two HBM2E packages. The 896GB/s die-to-die interconnect is more than 2x the die-to-die interconnect on the MI250X.

HC34 Biren BR100 GPU Tiles
HC34 Biren BR100 GPU Tiles

Biren has its BR100 for its OAM platform and the BR104 for PCIe. Since it uses chiplets and is modular, there is a BR104 without having to re-tape out a new design. To us, the Biren Hearten server is very interesting.

HC34 Biren BR100 GPU Product Line
HC34 Biren BR100 GPU Product Line

Folks are going to wonder what the Hearten Server is. Since the BR100 GPUs are OAM, and we see eight of them with external connectivity, it is likely that they are using the OCP OAM Universal Baseboard or UBB. This is now OAI.

OCP OAM UBB Breakaway
OCP OAM UBB Breakaway

The server itself looks like a variation of the Inspur accelerator platform, but with a UBB tray instead of a NVIDIA A100 tray. You can see the top section from our Inspur NF5488A5 8x NVIDIA A100 HGX and Inspur NF5488M5 8x NVIDIA V100 reviews (video here.) Our best sense is that it is a version of these platforms but with the OCP OAM UBB inside.

Inspur NF5488A5 GPU Tray Coming Out
Inspur NF5488A5 GPU Tray Coming Out

OAM is designed to have multiple connections. BLink is Biren’s interconnect between GPUs.

HC34 Biren BR100 GPU OAM Interconnect
HC34 Biren BR100 GPU OAM Interconnect

Biren has a Streaming Processing Center with 8MB of L2 cache plus the compute resources. In this diagram, also important is that the GPU has video encoders and decoders. That is a big deal for AI since so much is based on processing video.

HC34 Biren BR100 GPU Architecture Diagram
HC34 Biren BR100 GPU Architecture Diagram

Each GPU compute tile has 16x SPCs. Each SPC has 8MB of L2 so there is 128MB on each tile and 256MB on the Biren GPU.

HC34 Biren BR100 GPU SPC Architecture
HC34 Biren BR100 GPU SPC Architecture

Each SPC also has 16 EUs.

HC34 Biren BR100 CU Scaling
HC34 Biren BR100 CU Scaling

Biren designed a SIMT GPU for general-purpose compute. With the BR100, the compute unit or CU, can be split into 4, 8, or 16 EU chunks.

HC34 Biren BR100 GPU V Core GP SIMT Processor
HC34 Biren BR100 GPU V Core GP SIMT Processor

Here is Biren’s detail on the V-Core Warp Control.

HC34 Biren BR100 GPU V Core Warp Control
HC34 Biren BR100 GPU V Core Warp Control

The T-core seems to be Biren’s tensor core that deals with matrix math.

HC34 Biren BR100 GPU T Core Overview
HC34 Biren BR100 GPU T Core Overview

Here are the specs on the T-core and is how Biren is hitting the previously mentioned performance figures.

HC34 Biren BR100 GPU SPC Scale 2.5D GEMM Architecture
HC34 Biren BR100 GPU SPC Scale 2.5D GEMM Architecture

Biren has its new data type TF32+. This is between TF32 and FP32.

HC34 Biren BR100 GPU TF32+ Tensor Data Type
HC34 Biren BR100 GPU TF32+ Tensor Data Type

Biren also has the Tensor Data Accelerator or TDA. NVIDIA just discussed its asynchronous data mover, the TMA, at Hot Chips 34 as well.

HC34 Biren BR100 GPU Tensor Data Accelerator TDA
HC34 Biren BR100 GPU Tensor Data Accelerator TDA

Here is Biren’s memory scheme of NUMA and UMA depending on where the data sits.

HC34 Biren BR100 GPU Memory Scheme NUMA And UMA
HC34 Biren BR100 GPU Memory Scheme NUMA And UMA

Biren has a Near Memory Engine to make L2 cache use more efficient.

HC34 Biren BR100 GPU Near Memory Computing
HC34 Biren BR100 GPU Near Memory Computing

Along with building a GPU in three years, Biren also has its middleware software platform.

HC34 Biren BR100 GPU BIRENSUPA Software Platform
HC34 Biren BR100 GPU BIRENSUPA Software Platform

Here is the company’s summary slide.

HC34 Biren BR100 GPU Summary
HC34 Biren BR100 GPU Summary

There is a lot of really interesting technology here.

Final Words

This is a cool-looking chip. At the same time, perhaps the most amazing aspect is that Biren was able to create the BR in less than three years. The company said that it started with targets in 2019. In under three years, they made a multi-tile GPU, with all of the ancillary features. That is an amazing engineering effort when you compare it to established players like NVIDIA, AMD, and Intel and their large engineering teams.

Biren Technology Established
Biren Technology Established

The company said that it has already submitted MLPerf Inference performance numbers, and we should see the results in a few weeks. The company also said it is waiting for Intel Xeon Sapphire Rapids for CXL compatibility testing but it believes its part is ready.




Leave a Comment