MacroPlacement

MacroPlacement is an open, transparent effort to provide a public, baseline implementation of Google Brain’s Circuit Training (Morpheus) deep RL-based placement method. We will provide (1) testcases in open enablements, along with multiple EDA tool flows; (2) implementations of missing or binarized elements of Circuit Training; (3) reproducible example macro placement solutions produced by our implementation; and (4) post-routing results obtained by full completion of the synthesis-place-and-route flow using both proprietary and open-source tools.

Materials for the Broad Audience

Table of Contents

FAQs

1. Why are you doing this?

2. What can others contribute?

3. What is your timeline?

New FAQs after the release of our ISPD-2023 paper (here and on arXiv)

4. How was the UCSD replication of CT validated?

We obtained two separate confirmations from Google engineers that our running of CT was correct. These were received on August 10, 2022 and October 3, 2022.

The above-mentioned matches between our CT runs and Google engineers’ CT runs provided confirmation as of last Fall that our environment is correct. All of our code has been open-sourced and unchanged since mid-January 2023. There have been no suggestions that it is incorrect in any way.

5. Was Circuit Training intended by Google to provide the code that was used in the Nature paper?

Google has stated this on a number of occasions. Of course, a key motivation for our MacroPlacement work has been that code to reproduce Nature has been only partially open-sourced in Circuit Training, and that the data used in Nature has not yet been made public.

6. Did you use pre-trained models? How much does pre-training matter?

We did not use pre-trained models in our study. Note that it is impossible to replicate the pre-training described in the Nature paper, for two reasons: (1) the data set used for pre-training consists of 20 TPU blocks which are not open-sourced, and (2) the code for pre-training is not released either.

7. What are the runtimes (wall times) of different macro placers that you studied?

8. In your experiments how do the results of Simulated Annealing (SA) and Reinforcement Learning (i.e., Circuit Training) compare?

Testcases Proxy cost Wirelength (WL)
ICCAD04 (IBM) SA wins over CT 17/17 SA wins over CT 16/17 (HPWL)
Modern IC designs SA wins over CT 4/6 SA wins over CT 5/6 (routed WL)

9. Did the work by Prof. David Pan show that Google open-source code was sufficient?

10. Which conclusions did you confirm from the Nature paper and from Stronger Baselines?

11. Did it matter that Circuit Training used an initial placement from a physical synthesis tool?

Yes. Circuit Training benefits substantially from its use of the placement locations that it obtains from physical synthesis.

12. Are the benchmarks (testcases) that you use adequate to test modern macro placement techniques?

We believe so. We developed new, modern testcases that are mapped to modern, open technologies with full routing and timing information. The table below summarizes the numbers of flip-flops, macros, distinct macro sizes, and standard-cell instances in these testcases.

BlackParrot and MemPool Group are larger and have multiple sizes of macros. They are significantly more challenging than the Ariane testcase used by Google, as confirmed by a shuffling experiment described in Section 5.2.6 of our paper.

We also use the ICCAD04 academic benchmarks studied by Stronger Baselines; these are heavily used and well-known in the academic literature. All the ICCAD04 and modern benchmarks are fully available for download. We welcome additional testcases that target criteria not covered by our existing testcases.

13. Are the resources used to run Circuit Training good enough to reproduce the Nature result?

We believe the answer is Yes. We refer to the ISPD-2022 paper by Google authors S. Yu, E. Songhori, W. Jiang, T. Boyd, A. Goldie, A. Mirhoseini and S. Guadarrama, “Scalability and Generalization of Circuit Training for Chip Floorplanning”.

14. The ISPD-2023 paper includes results from Cadence’s Concurrent Macro Placer (in Innovus 21.1). What is the reasoning behind your use of CMP 21.1, which was not available to Google engineers when they wrote the Nature paper?

We used Innovus version 21.1 since it was the latest version of our place-and-route evaluator of macro placement solutions. CMP 21.1 is part of Innovus 21.1.

15. What are the outcomes of CT when the training is continued until convergence?

To put this question in perspective, training “until convergence” is not described in any of the guidelines provided by the CT GitHub repo for reproducing the results in the Nature paper. For the ISPD 2023 paper, we adhere to the guidelines given in the CT GitHub repo, use the same number of iterations for Ariane as Google engineers demonstrate in the CT GitHub repo, and obtain results that closely align with Google’s outcomes for Ariane. (See FAQs #4 and #13.)

CT code does not guarantee convergence. This said, we have run CT training for an extended number (= 600, which is three times our default value of 200) of iterations, for each of Ariane, BlackParrot and MemPool Group, on NG45. For MemPool Group, CT diverges (tensorboard link).

When convergence can be attained, the impact on key chip metrics is mixed. For instance, for Ariane, the chip metrics remain similar. In the case of BlackParrot, the routed wirelength significantly improves, but the TNS and WNS degrade. For Ariane and BlackParrot, the proxy cost improves significantly, but does not correlate with timing metrics. For more details, see here.

In sum, training until convergence worsens some key chip metrics while improving others, highlighting the poor correlation between proxy cost and chip metrics. Overall, training until convergence does not qualitatively change comparisons to results of Simulated Annealing and human macro placements reported in the ISPD 2023 paper.

Note: We have not studied what happens if SA is given triple the runtime used in our reported experiments.

16. The ISPD-2023 paper (Section 5.2.1, and Slide 17 of the ISPD-2023 presentation) concludes that CT benefits significantly from its use of initial placement. What is the reasoning behind giving CT “impossible” initial placements, where all instances are placed at the same location?

CT requires (x,y) locations – i.e., a placement – to run its grouping flow. Section 5.2.1 of our ISPD-2023 paper discusses the advantage that CT derives from its use of initial placement information from a commercial EDA tool. To measure this advantage, we study what happens when CT is deprived of this placement information.

Note: To be clear, in our ISPD-2023 paper, all CT runs are given the benefit of an initial placement generated by CMP + Genus iSpatial flow. In Section 5.2.1 of the paper, vacuous (referred to as “impossible” in recent comments) placements are used solely to study the effect of the commercial initial placement on CT outcomes.

Testcases

The list of available testcases is as follows.

In the Nature Paper, the authors report results for an Ariane design with 133 memory (256x16, single ported SRAM) macros. We observe that synthesizing from the available Ariane RTL in the lowRISC GitHub repository using 256x16 memories results in an Ariane design that has 136 memory macros. We outline the steps to instantiate the memories for Ariane 136 here and we show how we convert the Ariane 136 design to an Ariane 133 design that matches Google’s memory macros count here.

We provide flop count, macro type and macro count for all the testcases in the following table.

Testcase Flop Count Macro Details (macro type x macro count)
Ariane136 19839 (256x16-bit SRAM) x 136
Ariane133 19807 (256x16-bit SRAM) x 133
MemPool tile 18278 (256x32-bit SRAM) x 16 + (64x64-bit SRAM) x 4
MemPool group 360724 (256x32-bit SRAM) x 256 + (64x64-bit SRAM) x 64 + (128x256-bit SRAM) x 2 + (128x32-bit SRAM) x 2
NVDLA 45295 (256x64-bit SRAM) x 128
BlackParrot 214441 (512x64-bit SRAM) x 128 + (64x62-bit SRAM) x 32 + (32x32-bit SRAM) x 32 + (64x124-bit SRAM) x 16 + (128x16-bit SRAM) x 8 + (256x48-bit SRAM) x 4

All the testcases are available in the Testcases directory. Details of the sub-directories are

Enablements

The list of available enablements is as follows.

Open-source enablements NanGate45, ASAP7 and SKY130HD are utilized in our SP&R flow. All the enablements are available under the Enablements directory. Details of the sub-directories are:

We also provide the steps to generate the fakeram models for each of the enablements based on the required memory configurations.

Flows

We provide multiple flows for each of the testcases and enablements. They are: (1) a logical synthesis-based SP&R flow using Cadence Genus and Innovus (Flow-1), (2) a physical synthesis-based SP&R flow using Cadence Genus iSpatial and Innovus (Flow-2), (3) a logical synthesis-based SP&R flow using Yosys and OpenROAD (Flow-3), and (4) creation of input data for Physical synthesis-based Circuit Training using Genus iSpatial (Flow-4).

The details of each flow are given in the following.

In the following table, we provide the status details of each testcase on each of the enablements for the different flows.

Test Cases Nangate45 ASAP7 SKY130HD FakeStack
Flow-1 Flow-2 Flow-3 Flow-4 Flow-1 Flow-2 Flow-3 Flow-4 Flow-1 Flow-2 Flow-3 Flow-4
Ariane 136 Link Link Link Link Link Link N/A Link Link Link Link Link
Ariane 133 Link Link Link Link Link Link N/A Link Link Link Link Link
MemPool tile Link Link Link Link Link Link N/A Link Link Link Link Link
MemPool group Link Link N/A Link N/A N/A N/A N/A N/A N/A N/A N/A
NVDLA Link Link N/A Link Link Link N/A Link Link Link N/A Link
BlackParrot Link Link N/A Link N/A N/A N/A N/A N/A N/A N/A N/A

The directory structure is : ./Flows/<enablement>/<testcase>/<constraint|def|netlist|scripts|run>/. Details of the sub-directories for each testcase on each enablement are as follows.

Code Elements

The code elements below are the most crucial undocumented portions of Circuit Training. We thank Google engineers for Q&A in a shared document, as well as live discussions on May 19, 2022, that have explained aspects of several of the following code elements used in Circuit Training. All errors of understanding and implementation are the authors’. We will rectify such errors as soon as possible after being made aware of them.

A Human Baseline for Circuit Training

We provide a human-generated baseline for Google Brain’s Circuit Training by placing macros manually following similar (grid-restricted location) rules as the RL agent. The example for Ariane133 implemented on NanGate45 is shown here. We generate the manual macro placement in two steps:
(1) we call the gridding scripts to generate grid cells (27 x 27 in our case); (2) we manually place macros on the centers of grid cells.