



# Readout Unit Overview

J. Schambach University of Texas at Austin July 19–20, 2018

MVTX Directors Review



- Based on ALICE ITS Upgrade electronics
- Requirement: Less than 5% additional dead time @ 15 kHz average trigger rate and simulated occupancy
- Same Readout Units as ITS, but different Data Aggregation Module (ATLAS FELIX)
- Trigger Rate: 15 kHz
- Comparison ITS MVTX:
  - Peak Au+Au collision rate:
  - Peak p+p collision rate:
  - Event size, dN/dy :
  - Trigger/readout rate:
  - Expected average data rate:

200kHz (sPHENIX) : 50kHz (ALICE)

- 13 MHz (sPHENIX) : 200kHz (ALICE)
- sPHENIX = 1/3 ALICE(pp), 1/5 ALICE(AA MB), 1.1 ALICE(AA central)
- 15kHz (sPHENIX) : 50kHz (ALICE)
- MVTX = 30% of ALICE (mostly central AA events, lower trigger rate)





### • Hardware

- 48 ALICE ALPIDE Staves + Interface Cables
- 48 Front End Electronics (ALICE RUv2)
- 6 Back End Electronics (ATLAS FELIX v2.x)
- 6 EBDC server
- 24 Power Boards + Supplies + Cables
- 48 Stave to RU cables
- 144 data fiber optic cables (3 fibers x 48 FEE)
- Electronics Spares ~20%
- Stave Spares ~75%





# Readout Unit Design























# Scrubbing

SPHENIX

- Scrubbing is an error correction technique that uses a background task to periodically inspect/correct errors in a data memory.
  - Data memory = Config mem of Xilinx Ultrascale
  - Errors caused by single event upsets
- Relevant scrubbing techniques for the RU:
  - Xilinx Soft Error Mitigation Core (sem IP)<sup>1</sup>
    - Supported by Xilinx
    - Detection and correction
    - Fast
    - Black box design
    - Sem IP core only partially mitigated
    - Scan starts from zero upon an upset
  - External Scrubbing network
    - Proven solution (ALICE TPC RCU1)<sup>2</sup>
    - Full control of design
    - No support by Xilinx
    - Substantially slower than sem IP
    - Mitigation of Flash and aux FPGA is needed



<sup>2</sup> <u>https://cds.cern.ch/record/1141616</u>, chapter 4











• The RU must comply with MFT and sPhenix specifications, which require using 25 and 27 GTX lanes, respectively, for reading the sensors. To solve the routing, an interchangeable mezzanine card solution (**Transition Board**) has been adopted.





### Readout Unit – Detailed Block Diagram

























### Power Unit Control via GBT or CANbus



• Main: CRU=>GBT=>US(+)=>Power Board (2\*16SE+2\*I2C)













# Prototypes and Tests



SPHENIX

- Separate Boards for FPGA and GBT
- Based on Xilinx Kintex-7 FPGA
- Integrated USB Interface for debug
- Separate Fiber connection for emulation of the CRU
- Integrated Power Supply to power sensors independent of Power Unit
- GBT components on separate board connected via FMC
- Various sensor connector configuration to prototype both Inner & Outer Barrel
- VME 6U form factor
- Used for initial verification of electrical interfaces
- Initial experience with GBT parts
- Used for radiation testing of fundamental firmware components and initial testing of mitigation techniques





## Readout Unit Prototype Version 1: "RUv1"

- 6U VME size
- Based on Xilinx Kintex Ultrascale FPGA (better radiation performance, more resources)
- Eliminate all extra components from RUv0
- Use close to final component types and numbers
- Layout close to final design
- 8\* RUv1.0 boards produced
  - 2 \* August 2017
  - 6 \* September 2017
- 13\* RUv1.1 boards produced (minor modifications and bug fixes)
  - 6 \* November 2017
  - 7 \* March 2018



SPHE













- RU power section
  - 7 DCDC regulators (accuracy, noise, ripple, Inrush, sequencing)
  - I<sub>SENSE</sub> offset problem solved by replacing amp AD626=>AD8418
- Communication with sensors/Alpide using transition board & firefly
  - CLOCK, CONRTOL, DATA @ 1.2Gbps and 400 Mbps
- Communication via 4.8 Gbps GBT links (GBTx/1/2, VTRx/VTTx)
  - Access & program GBTx using Cern USBI2C dongle (connector bug)
- Communication with the SCA-chip
  - monitor voltage/current, temperature, VTRx optical power
  - I2C, JTAG, GPIO interface
- Clock distribution: jitter/levels on crystal oscillators, jitter cleaner and clock buffer
- USB3 (using FX3 chip) communication & boot from I2C PROM
- JTAG chain (primary US & PA3 program, secondary FX3 & GBTX/1/2)
- Program Xilinx US with PA3 via select map interface
- Communication with the PU using front pannel I2C
- FLASH PROM (access, read-ID, read & write)
- CAN transceiver



### Radiation Validation – Latest Beam Tests



| Facility                  | Date          | Goal                               | DUT            | Particles            | <b>Flux<sup>1</sup></b><br>[p cm <sup>-2</sup> s <sup>-1</sup> ] | Fluence <sup>2</sup><br>[p cm <sup>-2</sup> ]       | Duration<br>[hours] | Notes                                                                                                                             |
|---------------------------|---------------|------------------------------------|----------------|----------------------|------------------------------------------------------------------|-----------------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| Prague                    | 2015-<br>2017 | Firmware protection                | RUvO           | Protons<br>30 MeV    | 1×107                                                            | 4×10 <sup>12</sup>                                  | 4×48                | Beam limited to the FPGA only. Most effective way to check firmware design.                                                       |
| CHARM                     | Oct<br>2017   | Cavern spectrum overall evaluation | RUv1           | Mixed<br>field       | 5 kRad/day<br>(≈5×10 <sup>7</sup> n/day)                         | 10 kRad<br>(≈1×10 <sup>8</sup> n cm <sup>-2</sup> ) | 48 ÷ 96             | CHARM gives ≈2×10 <sup>7</sup> HEH (30% to 80% being neutron) for each delivered kRad of TID. <u>Destructive test</u> due to TID. |
| Louvain                   | Nov<br>2017   | High statistics<br>DCDC test       | DCDC<br>boards | Neutron<br>23 MeV    | 1×10 <sup>8</sup>                                                | ≈1×10 <sup>13</sup>                                 | 8                   | Extended reliability test sor SEL/SEU only (no TID effects at first order).<br>Neutron spectrum > 10 MeV. <u>NON destructive.</u> |
| Prague                    | Dec<br>2017   | Firmware protection                | RUv1           | Protons<br>30 MeV    | 1×107                                                            | 1×10 <sup>12</sup>                                  | 24                  | Beam limited to the FPGA only. Most effective way to check firmware design.                                                       |
| Prague                    | Jan<br>2018   | Firmware protection                | RUv1           | Protons<br>30 MeV    | 1×107                                                            | 1×10 <sup>12</sup>                                  | 24                  | Beam limited to the FPGA only. Most effective way to check firmware design.                                                       |
| <b>ChipIR</b><br>(Oxford) | March<br>2018 | SEE with no TID<br>on PA3          | RUv1           | Neutron<br>≤ 500 MeV | ≈1×10 <sup>6</sup>                                               | ≈1×10 <sup>11</sup>                                 | 12                  | Further verification of the system without destroying PA3 programmability (which dies around 10 kRad).                            |
| GIF<br>(CERN)             | 2018          | TID, scrubbing with realistic flux | RUv2           | Mixed                | 1×10 <sup>4</sup>                                                | ≈1×10 <sup>6</sup>                                  | 62                  | Easily accessible verification tool for TID and scrubbing behavior testing/verification.                                          |
| Prague                    | 2018          | Further FLASH /<br>Firmware tests  | RUv2           | Protons<br>30 MeV    | 1×10 <sup>7</sup>                                                | ≈1×10 <sup>12</sup>                                 | 24                  | Further investigation about the FLASH (flux threshold effect?).<br>Benchmarking of extra TMR in key firmware blocks.              |

\* Readout unit number is that one registered in the WP10 material inventory. They are rotated to account for appropriate cool-down periods and/or damage.

<sup>1</sup> Realistic flux obtainable at the specific facility

<sup>2</sup> Integrated flux over the irradiation time AND number of DUTs





- The system will operate in the ALICE cavern radiation environment, with a total TID of about **10 kRad** (safety factor 10) and a high energy ionizing particle flux of **1 kHz cm**<sup>-2</sup>. Upsets in the logic are the main concern, while TID is extensively tested.
- The Readout Electronics has been designed to incorporate hardware (hardware TMR and scrubbing subsystem) and firmware protection (TMR, ECC, etc.) against SEE effects.

### So far tested in

- <u>6 specific test beams</u> at the Rez facility in Prague (30 MeV proton) for firmware and scrubbing verification.
- <u>1 test beam at the CHARM</u> facility (including the Power Unit) in a realistic mixed radiation field.
- <u>1 neutron test beam in Louvain</u> (23 Mev Neutrons) specifically intended to further verify DCDC (LMZ31710RVQ)
- <u>1 neutron test beam at Oxford ChipIR (up to 500 MeV Neutrons) for whole system testing</u>
- Scrubbing from auxiliary FPGA working and effective.
- In Any Case: even without using the scrubbing, the firmware proved resilient enough to SEE to operate on average for 1300 hours (29 h full IB or 10 h full OB) before experiencing any upset which would require recovery.
- <u>Data interruption in case of firmware upset has been measured and lasts on average one second</u> for the affected lane, or a few seconds if a reset of the FPGA is necessary.
- Errors on clock resources or other key subsystem are negligible (too rare to gather any significant statistics).
- <u>Commercial DCDC</u> proved compliant with the task, one power glitch every 72 hours foreseen within the whole ITS.





| Failure mode                   | Affected section                   | Estimated occurrence in ITS operations (average MTBF) |            |            | Corrective action                                   | Downtime per<br>occurrence |
|--------------------------------|------------------------------------|-------------------------------------------------------|------------|------------|-----------------------------------------------------|----------------------------|
|                                |                                    | IB                                                    | OB         | Whole ITS  |                                                     |                            |
| Sensor data lane               | 1 sensor for IB<br>½ module for OB | 22 - 40 h                                             | 4 - 6 h    | 3 - 5 h    | Self-repairing                                      | < 1 s >                    |
| GBT data*                      | 1 full stave                       | 29 h                                                  | 10 h       | 7 h        | 30% self repairing, 70% reset<br>by slow control    | < 5 s >                    |
| Clock resources                | 1 full stave                       | Negligible                                            | Negligible | Negligible | Reset by slow control                               | < 5 s >                    |
| Transceiver settings           | 1 sensor, IB only                  | > 932 h                                               | -          | —          | Reset by slow control                               | < 5 s>                     |
| Flash memory                   | 1 full stave                       | Negligible                                            | Negligible | Negligible | FLASH reprogramming (30 s<br>beam off, 30m beam on) | 30 s – 30 m                |
| <sup>1</sup> PA3               | 1 full stave                       | 172 h                                                 | 58 h       | 43 h       | Reset by slow control                               | < 0 s>                     |
| <sup>2</sup> DCDC power glitch | 1 full stave                       | 294 h                                                 | 98 h       | 71 h       | Power cycle                                         | < 10 s >                   |

\* <u>Data for the non-TMR block</u>, final version will use TMR protected block. This failure mode also include sensor control and clock failures. <sup>1</sup>When PA3 get stuck the main FPGA is not compromised, and therefore no downtime occurs. TMR of key block in PA3 firmware will further improve that.

<sup>2</sup> Considering 200 RU with 8 DCDC each (20% overestimation)



## 4-Alpide Telescope with RUv1 at Fermilab Beam Test







#### **Highlights:**

- Primarily 120GeV proton beam; also with low energy pion beams
- Beam trigger rate ~7kHz
- Tested High ALPIDE occupancy runs, with 10cm lead bricks in front of the sensors
- See Sho's talk for details







# RU Production Version "RUv2"



# Readout Unit Version 2 – 3D rendering of the Production Version









|   |                                                                                          | RUv1_x                             | RUv2                                                                                                                            |  |  |  |
|---|------------------------------------------------------------------------------------------|------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| 1 | Dimensions                                                                               | 160x233 mm                         | 220x233 mm                                                                                                                      |  |  |  |
| 2 | Power connector                                                                          | J0 (Weidmuller BL/SL 5.08) on back | J0 (MOLEX 172316) to front<br>(J1 stays as it is + switch to select between J0 & J1)                                            |  |  |  |
| 3 | Transistion board connector                                                              | Samtec QFS/QMS type                | 2 * Samtec ERF8-50 (USB3 connector also moved)                                                                                  |  |  |  |
| 4 | Power Converters                                                                         | Only COTS DCDC                     | Besides COTS also FEASTMP_CLP placeholder<br>Updated DCDC placement for improved PI<br>Use of blind via's to further improve PI |  |  |  |
| 5 | Removal high compements from the cold plate area, aiming for same cold plate power board |                                    |                                                                                                                                 |  |  |  |
| 6 | Change Clock distribution with PA3 clock independent from jitter cleaner                 |                                    |                                                                                                                                 |  |  |  |
| 7 | Remove secondary JTAG chain                                                              |                                    |                                                                                                                                 |  |  |  |

Logic and layout of the board is mostly the same as RUv1.1











# Firmware



### Firmware Overview





Firmware is mostly complete (see backup slides for an overview of the various firmware components) to accomplish these tasks and to mitigate SEEs; it has been used in various radiation campaigns and readout tests

Handle radiation upsets in programmable logic (& sensors)





# MVTX Production & QA





The RUv2 CERN tendering document is ready, including testing procedures (see following slides). Production is foreseen in batches and will last approximately 3-4 months. 4 Initial prototypes have been produced, and are currently being verified

### Batches

- A pre-series totaling 10 boards;
- A first batch totaling 106 series production modules;
- A second batch totaling 106 series production modules;
- A third batch totaling 88 series production modules.
- Option for 80 more boards for sPhenix

#### Hardware components

- 1 Xilinx XCKU060-1FFVA1156C (1156 pin BGA package)
- 1 A3PE600L-FGG484M FP (484 pin BGA package)
- 3 CERN custom GBTx chip (434 pin BGA package)
- 1 CERN custom made SCA chip (196 pin LFBGA package)
- 3 SFP+ pod connectors
- 2 ERF8-050-05.0-L-DV-TR connectors (100 pins, 0.8mm pitch)
- Miniaturized passive components (0201 minimum)

#### Activities at the Contractor's premises

- Quality control of the PCBs.
- Ordering of passive and active components.
- Input quality control of all components.
- Assembly of the components on the PCBs and soldering.
- Quality control of the assembled boards.
- Packing, and shipping.

#### Key technical PCB parameters

- 10 layers low-loss material (Er < 3.7 at 5 GHz, Df < 0.012 at 5 GHz)
- Maximum overall thickness of (1.57±0.13) mm
- Copper Outer layer: 35 μm, Copper Inner layer: 17.5 μm
- Holes per PCB: 5000
- Minimum hole diameter: 0.63 mm
- Blind vias (layers): 32 (1)
- Minimum track width (outer layer): 110  $\mu m$  (inner layer): 100  $\mu m$
- Minimum spacing (outer layer): 110  $\mu$ m, (inner layer): 100  $\mu$ m



- A total of **222** boards will be manufactured for ITS, 88 for MFT, option of 80 boards for MVTX
- Testing will be done in a 2+ stage approach:
- 1) Hardware testing at the manufacturer (only functional smoke test).
- 2) Board bring-up, initial hardware verification and short-term functional verification at Nikhef/Utrecht.
- +) Long-term functional testing at collaborator sites (sampling) & during commissioning.

For MVTX, the second stage of testing will be performed by UT Austin in combination with LANL







- The contractor shall inspect all parts individually according to the applicable standards (see backup slides). All component defects and assembly errors shall be eliminated.
- The contractor shall test 10% of the boards manufactured for ionic residues
- The dimensions of the mechanical parts shall be checked to ensure conformity. Items which are outside tolerance for straightness, flatness, position of holes or other reasons shall be rejected.
- The contractor shall have an approved and formal process designed to monitor and record each phase of the manufacturing of the supply, such that complete conformity with the requirements of this specification is achieved.
- All specified tests and measurements carried out during all stages of production, from material procurement up to delivery shall be recorded. The contractor shall provide these records in electronic form (Microsoft Word, Excel, Project or OpenDocument and in PDF).



# Summary



- ITS Readout Unit satisfies the requirements of the MVTX Readout
- ITS Readout Unit prototyping is successfully completed for electrical, functional and radiation performance
- A Manufacturing Plan that includes RU production and assembly for MVTX is completed
- Tender document will be posted after initial RUv2 prototype testing
- A testing plan is in place and test components are being designed and fabricated
- UT Austin will take responsibility for organization with ITS of the board manufacture and testing
- Board testing after delivery will be performed by UT Austin in combination with LANL