



### **TIPP'17**

### Acceleration of an particle identification algorithm used for the LHCb Upgrade with the new Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA



Christian Färber CERN Openlab Fellow LHCb Online group



On behalf of the LHCb Online group and the HTC Collaboration

TIPP2017, Beijing 25.05.2017







### HTCC

- High Throughput Computing Collaboration
- Members from Intel and CERN LHCb/IT
- Test Intel technology for the usage in trigger and data acquisition (TDAQ) systems
- Projects
  - Intel<sup>®</sup> KNL computing accelerator
  - Intel® Omni-Path 100 Gbit/s network
  - Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA computing accelerator







### Detector Example: LHCb



- Single-arm spectrometer designed to search new physics through measuring CP violation and rare decays of heavy flavour mesons.
- 40 MHz proton proton collisions
- Trigger with 1 MHz, upgrade to 40MHz
- Bandwidth after upgrade up to 40TBit/s







### Future challenges

- Higher luminosity from LHC
- Upgraded sub-detector Front-Ends
- Removal of hardware trigger
- Software trigger has to handle:
  - Larger event size (50KB to 100KB)
  - Larger event rate (1MHz to 40MHz)









### **Upgrade Readout Schematic**

- Raw data input ~ 40 Tbit/s
- EFF needs fast processing of trigger algorithms, different technologies are explored.
- Test FPGA compute accelerators for the usage in:
  - Event building
    - Decompressing and re-formatting packed binary data from detector
  - Event filtering
    - Tracking
    - Particle identification
- Compare with: GPUs, Intel<sup>®</sup> Xeon<sup>®</sup>-Phi and other compute accelerators

Christian Färber, TIPP2017, Beijing – 25.05.2017



5





### FPGAs as Compute accelerators

- Microsoft Catapult and Bing
  - Improve performance, reduce power consumption



- Reduce the number of Von-Neumann abstraction layers
  - Bit level operations
- Power only logic cells and registers needed
- Current Test Devices in LHCb
  - Nallatech PCIe with OpenCL
  - Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA







## Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA

 Two socket system:
 First: Intel<sup>®</sup> Xeon<sup>®</sup> E5-2680 v2



Second: Altera Stratix V GX A7 FPGA

• 234'720 ALMs, 940'000 Registers, 256 DSPs

Host Interface: high-bandwidth and low latency

- Memory: Cache-coherent access to main memory
- Programming model: Verilog and OpenCL





## New Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA with Arria10 FPGA

- Multichip package including:
  - Intel<sup>®</sup> Xeon<sup>®</sup> E5-2600 v4
  - Intel<sup>®</sup> Arria10 GX 1150 FPGA



inte

• 427'200 ALMs, 1'708'800 Registers, 1'518 DSPs

- Hardened floating point add/mult blocks (HFB)!
- Host Interface: Bandwidth 5x higher than Stratix V version
- Memory: Cache-coherent access to main memory

Programming model: Verilog soon also OpenCL





## Sort with Xeon<sup>®</sup> and FPGA

### Sorting of INT arrays with 32 elements

- Implemented pipeline with 32 array stages
- FPGA sort is up to x117 faster than single Xeon<sup>®</sup> thread
- Bandwidth to the FPGA is the bottleneck

Time ratio for sorting with Xeon only to Xeon with FPGA







### Test case: RICH PID Algorithm

- RICH PID is not processed for every event, processing time too long!



#### Calculations:

- solve quartic equation
- cube root
- complex square root
- rotation matrix
- scalar/cross products







## Implementation of Cherenkov Angle reconstruction Stratix V

- 748 clock cycle long pipeline written in Verilog
  - Additional blocks developed: cube root, complex square root, rot. matrix, cross/scalar product,...
    - Lengthy task in Verilog with all test benches (implementation took 2.5 months)
- Pipeline running with 200MHz  $\rightarrow$  5ns per photon

### • FPGA resources:

| FPGA Resource Type | FPGA Resources used [%] | For Interface used [%] |  |
|--------------------|-------------------------|------------------------|--|
| ALMs               | 88                      | 30                     |  |
| DSPs               | 67                      | 0                      |  |
| Registers          | 48                      | 5                      |  |
|                    |                         |                        |  |





### Implementation of Cherenkov Angle reconstruction Arria 10

- 259 clock cycle long pipeline written in Verilog
  - Stratix V blocks ported using HFB: complex square root, rot. matrix, cross/scalar product,...
- Pipeline running with 200MHz  $\rightarrow$  5ns per photon
  - With Arria 10 GT FPGA 400 MHz possible

### • FPGA resources:

| FPGA Resources used [%] | For Interface used [%] |  |
|-------------------------|------------------------|--|
| 32                      | 18                     |  |
| 15                      | 0                      |  |
| 12                      | 5                      |  |
|                         | 32                     |  |





### Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA Results



- Acceleration of factor up to 35 with Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA with respect to single Intel<sup>®</sup> Xeon<sup>®</sup> thread
- Theoretical limit of photon pipeline: a factor 64 for Stratix V FPGA, for Arria 10 FPGA a factor ~ 300
- Bottleneck: Data transfer bandwidth to FPGA, caching can improve this, tests ongoing

Christian Färber, TIPP2017, Beijing – 25.05.2017

13



| FPGA Resource TypeFPGA Resources used [%]FPGA Resources used [%]ALMs8863SimilarDSPs6782resourceRegisters4824Usage |                    |           |         | -                |                  |         |   |
|-------------------------------------------------------------------------------------------------------------------|--------------------|-----------|---------|------------------|------------------|---------|---|
| DSPs 67 82 resource                                                                                               | FPGA Resource Type |           | FPGA Re | sources used [%] | FPGA Resources u | sed [%] |   |
|                                                                                                                   |                    | ALMs      |         | 88               | 63               | Similar |   |
| Peristers 18 21 USAGE                                                                                             |                    | DSPs      |         | 67               | 82               | resourc | e |
| 17egisters 40 24 40 40                                                                                            |                    | Registers |         | 48               | 24               | usage   | 5 |

Christian Färber, TIPP2017, Beijing – 25.05.2017

14R N openlab



### Compare PCIe – QPI Interconnect



- Nallatech 385 PCIe vs. Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA QPI
- Both Stratix V A7 with 256 DSPs
- Programming model: OpenCL



Reconstruct 1'000'000 photons

**RICH Kernel** 

Compare Nallatech 385 and Intel Xeon/FPGA acceleration







### **Future Tests**

- Implement additional LHCb HLT algorithms
  - Kalman filter, decompressing and re-formatting packed binary data from detector, ...
- Compare performance with Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA system with Skylake + Arria 10 FPGA
  - Skylake + Arria 10 arrived in our lab now 🙂

Christian Färber,

TIPP2017, Beijing – 25.05.2017

- Measurements of Arria10 PCIe accelerators
- Compare Verilog vs. OpenCL
- Power measurements
  - Compare with GPUs, ...







### Summary



- Results are very encouraging to use FPGA acceleration in the HEP field
- Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA accelerator
  performs better than the
  Nallatech PCIe board using the same FPGA



- Programming model with OpenCL very attractive and convenient for HEP field
- Also other experiments want to test the usage of the Intel<sup>®</sup> Xeon<sup>®</sup>-FPGA with Arria10!
- High bandwidth interconnect and modern Arria10 FPGA lets expect high performance and performance per Joule for HEP algorithms! Don't forget Stratix10!





LHCB AT

# Backup





### Nallatech 385 Board

- FPGA: Altera Stratix V GX A7
  - 234'720 ALMs, 940'000 Registers
  - 256 DSPs
- Programming model: OpenCL
- Host Interface: 8-lane PCIe Gen3
  - Up to 7.5GB/s



- Memory: 8GB DDR3 SDRAM
- Network Enabled with (2) SFP+ 10GbE ports
- Power usage: ≤ 25W (GPU up to 300W)

