





#### Imperial College London











# FULL CHAIN DEMONSTRATOR



**Tom James, Imperial College London** TMTT: L1 Tracking Review 08/Dec/2016



#### SYSTEM ARCHITECTURE - THE TRACK FINDING PROCESSOR (TFP) - RECAP

- Track Finding Processor (TFP) boards receive links from adjacent detector (DTC) octants, corresponding to a single processing octant
- > Each TFP processes 1/8 in  $\varphi \& 1/\text{tmp}$  in time



Tom James (Imperial College) TMTT

No further duplication or sharing between regions is required downstream

- ► System is compartmentalised
  - One processing octant becomes the demonstrator slice unit



# **SYSTEM ARCHITECTURE - MAPPING TO DEMONSTRATOR SLICE**

- > TFP is internally divided into logical elements, each on separate boards (MP7s)
- Simplifies division of labour and algorithm development/testing
- > Presently available FPGA resources is not necessarily a limit to the scale/performance of algorithms we want to implement
- > Can extrapolate FPGA resources to those of a future processing card, allowing demonstration of what could be final system performance with currently in hand technology



Tom James (Imperial College) TMTT





### **DEMONSTRATOR HARDWARE - THE MASTER PROCESSOR 7 (MP7)**

- The MP7 is a generic high-performance data-stream processing double width AMC card, developed for use in the Phase I calorimeter trigger upgrade
- Equipped with a Xilinx Virtex-7 690-T FPGA

- 12 Avago Technologies MiniPOD optical transmitters/receivers
  - Each provide 12 optical links running at up to 10.3 Gbps
  - Total optical bandwidth 0.74 Tbps each way



Tom James (Imperial College) TMTT











# **DEMONSTRATOR HARDWARE - THE MASTER PROCESSOR 7 (MP7)**

- Comes with well supported firmware and software infrastructure
- Firmware already provided for core tasks such as transceiver buffering, I/O formatting & external communication and configuration
  - Simplifies demonstration process greatly -> can focus on the algorithms
- Using well understood hardware means that details of timing and synchronisation were not an issue - we benefit from a great deal of work previously done





**Tom James (Imperial College) TMTT** 



#### TMTT: L1 Tracking Review



#### **DEMONSTRATOR HARDWARE**

- Location: Tracker Integration Facility (TIF) B186, CERN
- **CERN Blue rack -** Turbine, 3-phase power, air deflector, water cooling/heat exchangers
- **Schroff** MicroTCA crate powered by external PowerOne 48V PSU & Vadatech power modules
- Equipped with NAT-MCH (Gigabit Ethernet communication via backplane) & AMC13 (synchronisation, timing & control)
- PowerEdge R620 CMS rack PC
- 11 MP7-XE's installed in Schroff crate
  - 5 for algorithm demonstration (one TFP)
  - 3 as large buffers for source & sink
  - 3 spares/backup
- Boards daisy-chained with optical fibres to meet required demonstrator configuration

**Tom James (Imperial College) TMTT** 



TMTT: L1 Tracking Review





#### **DEMONSTRATOR OVERVIEW - DATA TAKING**

- ► 8 daisy-chained MP7 boards
- ► Five boards emulate one Track Finder Processor
- > Processes Monte Carlo stubs for any one octant in  $\varphi$ , all of  $\eta$  at once
- ► We take take data for all eight octants to generate hardware results for entire tracker



**Tom James (Imperial College) TMTT** 



TMTT: L1 Tracking Review





#### **DEMONSTRATOR OVERVIEW - SOURCE & SINK**

- The source represents up to 36 virtual DTCs, covering a  $\varphi$ -octant in both z+ and z-
- **Source** stores ~30 events for playback
- The **sink** stores output of ~30 events



**Tom James (Imperial College) TMTT** 

Where a virtual DTC is a time multiplexed stream of data as if it were coming from a real DTC





#### **DEMONSTRATOR OVERVIEW - SOURCE & SINK**

- Same fw for both source & sink
- 72 Big Buffers 16k deep Dual port RAMS
- Acts like a **FIFO**
- 16k deep rams -> 8k 32 bit half-stubs, as 2 bits must be used for data valid & strobe
- Read/Write via **IPBus**



**Tom James (Imperial College) TMTT** 

TMTT: L1 Tracking Review



# **DEMONSTRATOR OVERVIEW**

We compare hardware output **directly** with cmssw simulation software -> can measure performance directly with hardware

- > Objective To run standard physics samples through a hardware demonstrator to ensure that expected performance, as seen in simulation results, is realistic
- simulation/emulation software



**Tom James (Imperial College) TMTT** 

Full MC events passed through hardware, and tracks found are compared with those found by our CMSSW

08/Dec/2016

TMTT: L1 Tracking Review





Tom James (Imperial College) TMTT



|    | Efficiency (%) | Av Rate | Matched track |
|----|----------------|---------|---------------|
| hw | 94.5           | 76.5    | 007           |
| SW | 94.8           | 79.4    | 90.7          |

TMTT: L1 Tracking Review

### DEMONSTRATOR RESULTS - MUONS 8-100 GEV + 200 PU (1100 EVENTS)



Tom James (Imperial College) TMTT

|    | Efficiency (%) | Matched tracl |
|----|----------------|---------------|
| hw | 97.1           | 00.2          |
| SW | 97.1           | 99.2          |



# DEMONSTRATOR RESULTS - MUONS 8-100 GEV + 200 PU (1100 EVENTS)

► Well matched efficiency in CMSSW and demonstrator allow for extrapolation of results to the higher statistics currently available in software



Tom James (Imperial College) TMTT

|    | Efficiency (%) | Matched trac |
|----|----------------|--------------|
| hw | 97.1           | 00.2         |
| SW | 97.1           | 99.Z         |



# DEMONSTRATOR RESULTS - ELECTRONS IN TTBAR + 20

- ► Particle-gun electron samples not available until yesterday
  - ► Using electrons in ttbar + 200 PU samples instead
- ► Performance matches CMSSW simulation

Tom James (Imperial College) TMTT

08/Dec/2016





| )0 PU | (1800 | <b>EVENTS</b> |
|-------|-------|---------------|
| )0 PU | (1800 | EVENTS)       |

|    | Efficiency (%) | Matched trac |
|----|----------------|--------------|
| hw | 81.4           | 007          |
| SW | 81.8           | 90.7         |





# DEMONSTRATOR RESULTS - DIGITISED RESOLUTIONS (TTBAR + 200PU)

- ► Resolution of track helix parameters good in simulation and hardware
- Realised very recently that demonstrator z<sub>0</sub> resolution was degraded by our choice of 12 and 10 bit encoding of r and z stub coordinates
- Simulation shows we can recover optimal resolution in demonstrator by using 2 more bits to encode:
  - ► r of stubs in the barrel
  - ► z in endcaps
  - Can trivially accommodate this change in demonstrator without degrading resolution of other helix parameters, but did not have time before review
  - Software configuration for demonstrator comparison plots use the smaller number of bits for better matching

Tom James (Imperial College) TMTT



# DEMONSTRATOR RESULTS - TTBAR + 200PU RESOLUTIONS (1800 EVENTS)

Excellent helix parameter resolutions measured in demonstrator

#### q/p<sub>T</sub> resolution [1/GeV]



Tom James (Imperial College) TMTT

08/Dec/2016

#### $\varphi_0$ resolution [rad]



# DEMONSTRATOR RESULTS - TTBAR + 200PU RESOLUTIONS (1800 EVENTS)

Excellent helix parameter resolutions measured in demonstrator

z<sub>0</sub> resolution [cm]



Tom James (Imperial College) TMTT

08/Dec/2016

#### $\eta$ resolution



TMTT: L1 Tracking Review

# DEMONSTRATOR RESULTS - RATE FOR TTBAR + 200PU

- ► Av. rate out tracks of the duplicate removal stage is measured in hw as ~76 tracks per event
- > Duplicate removal recently integrated into processor chain and shows good results
- > Small discrepancies in duplicate rate will be debugged over the next couple of weeks



| TTbar + 200 PU | Av Rate |  |
|----------------|---------|--|
| hw             | 76.5    |  |
| SW             | 79.4    |  |

08/Dec/2016



# DEMONSTRATOR RESULTS - DEAD COOLING LOOP SCENARIO - TTBAR + 200 PU (900 EVENTS)

- > As seen in CMSSW simulation, the demonstrator can also be configured to recover performance in a dead cooling loop scenario
- > Online configuration of Hough Transform over ipbus all that is required



**Tom James (Imperial College) TMTT** 

08/Dec/2016

- ► In this dead cooling loop example, eta regions 5-8 have been configured to accept HT candidates with only 4 stubs
- ► Average efficiency is preserved at 94.6% in hardware

|    | Efficiency (%) | Matched tracks (%) |
|----|----------------|--------------------|
| hw | 94.2           |                    |
| SW | 94.5           | 70.4               |

TMTT: L1 Tracking Review



# **DEMONSTRATOR RESULTS - FPGA RESOURCES**

- ► Although we are using 5 MP7's, to demonstrate one Track Finding Processor, the actual resource usage of the system is much smaller than we have available
- ➤ One can see that the GP+HT, and the KF+DR could each fit inside Ultrascale or Ultrascale+ generation chips

Kin

Virte

Tom James (Imperial College) TMTT

08/Dec/2016

| er Tracker octant<br>one TFP | LUTS  | BRAM<br>(36Kb) | DSPs |
|------------------------------|-------|----------------|------|
| GP + HT                      | 412k  | 1566           | 1560 |
| KF + DR                      | 382k  | 1750           | 5040 |
| Virtex 7 690                 | 433k  | 1470           | 3600 |
| tex Ultrascale 115           | 663k  | 2160           | 5520 |
| ex Ultrascale+ 11P           | 1296k | 1970           | 9216 |

TMTT: L1 Tracking Review



# **DTC REQUIREMENTS**

- ► TMTT requirements of the DTC FPGA
  - Conversion to global coordinates (48 bits)
  - > Sorting data by event into N time multiplexed streams (demonstrator N = 36)



**Tom James (Imperial College) TMTT** 

> Duplicating data at our processing node boundaries, so no cross-node data flow downstream required

#### reminder of DTC input

TMTT: L1 Tracking Review

# **PROPOSED DTC - IMPLEMENTATION & LATENCY EST**

- > We have proposed a DTC solution that provides us with our time multiplexed streams, but also avoids large fan-outs and fan-ins at all costs



Tom James (Imperial College) TMTT

DTC Latency estimate:

60 clocks 250 ns at 240 MHz

Experience delivering a timemultiplexed calorimeter trigger and track-trigger demonstrator give us confidence that this is realistic

TMTT: L1 Tracking Review

08/Dec/2016







#### **PROPOSED DTC - IMPLEMENTATION & LATEN**



**Tom James (Imperial College) TMTT** 

08/Dec/2016





## **DEMONSTRATOR LATENCY - MEASUREMENTS**

Latency of all parts of the demonstrator chain are **fixed** 

#### Independent of pileup or event

Latency measured for each block and set of links independently, and also of the total chain for validation



Tom James (Imperial College) TMTT

08/Dec/2016







# **DEMONSTRATOR LATENCY MEASUREMENTS**

Demonstrator target processing latency of 4 us has been achieved

- ► Latency has been tuned for worst case scenario (ttbar+200PU, flat tracker geometry)
- ► However, final system latency must also include the DTC, but fewer serdes & optics within the Track Finding Processor

| Demonstrator Chain                   | Latency (n |
|--------------------------------------|------------|
| Serdes & optical length 1            | 143        |
| Geometric Processor                  | 310        |
| Serdes & optical length 2            | 144        |
| Hough Transform                      | 1025       |
| Serdes & optical length 3            | 129        |
| Kalman Filter + Duplicate<br>removal | 1658       |
| Serdes & optical length 4            | 129        |
| Total<br>First out - First in        | 3538       |
| Last out - First out                 | 225        |
| Total<br>Last out - First in         | 3763       |
|                                      |            |





### **DEMONSTRATOR LATENCY MEASUREMENTS**

► Final system latency must also include the DTC, but fewer serde & optics within the *Track Finding Processor* 

- Have already explored ways we could reduce the latency further
  - ► All parts of system currently clocked at 240 MHz
  - ► Accumulation periods as we wait for all data to arrive in HT KF could be reduced with faster link/smaller TMUX

|     | System Latency                |      |
|-----|-------------------------------|------|
|     | DTC estimate                  | 250  |
| les | Serdes & optical length x 3   | 450  |
| ſ   | Geometric Processor           | 310  |
|     | Hough Transform               | 1025 |
| or  | Kalman Filter                 | 1620 |
|     | Duplicate removal             | 38   |
|     | Total<br>First out - First in | 3693 |
|     | Last out - First out          | 225  |
|     | Total<br>Last out - First in  | 3918 |

TMTT: L1 Tracking Review



#### SUMMARY

- Capable of finding and fitting real physics tracks
  - > over the entire  $2\pi$  and  $|\eta| < 2.4$  solid angle
  - > one octant in  $\varphi$  at a time
- ► Have demonstrated
  - high efficiency and rate reduction in Monte-Carlo physics events
  - ➤ including TTbar + 200 PU
- ► With a fixed processing latency < 4.0 us

Tom James (Imperial College) TMTT

> Have built a track-finding & fitting hardware demonstrator with currently available MicroTCA boards

data flov





#### **BACKUP - DEMONSTRATOR RESULTS - RATE REDUCTION**

► Hough Transform does the vast majority of the rate reduction



Tom James (Imperial College) TMTT

08/Dec/2016





