Metering by counting cycles

# Metering by counting cycles The current Ethereum metering model is "cost and sum", which is nice because it is simple. Unfortunately, it does not capture hardware acceleration on common hardware, which includes operations on registers, register renaming, multi-issue, wide pipelines, out-of-order execution, superscalar with hardware queues, branch prediction, memory hierarchy tricks, transaction lookaside buffer, vectorization, multi-core, simultaneous multithreading, and others. As a result, cost and sum may be over-conservative, limiting throughput. We hope that metering which captures common hardware acceleration will give much better approximations of execution time, and better approximations will allow more throughput. We propose extending the Ewasm specificaiton to allow a special constrained form of Wasm which resembles a universal register assembly language. This language can be assembled to native register assembly of major hardware architectures such as `x86_64`, `aarch64`, and `riscv64`. A programmer concerned with execution speed can hand-write critical parts of code in this universal register assembly, just as is done with native assembly, and non-critical parts can remain in regular Wasm. It is a monumental task to specify such a universal register assembly language -- some instruction set manuals are thousands of pages long! We start with [prototypes](https://github.com/poemm/assembles) of some bigint and hashing, from which we took the below samples of `mul256` in universal forms of `x86_64`, `aarch64`, `riscv64`, and `wasm32` assembly. We propose metering this universal register assembly by counting cycles on an open hardware description with reasonably competitive instructins/cycle. We don't need the actual hardware, just [a cycle-accurate simulator of hardware](https://en.wikipedia.org/wiki/List_of_HDL_simulators) which is based on our hardware description. We execute code and count cycles. As a first step for Ethereum adoption, we propose metering opcodes like `mul256`, establishing a toolchain for such metering. With lessons learned, we will evaluate allowing users to deploy this universal register assembly for their domain-specific hand-optimizations -- i.e. user-deployed pre-compiles, or perhaps "pre-assembles". # Code samples Below are "equivalent" code samples for `mul256x256->256`. From `mul256x256_256.wasm32.wat`. ``` ;; a0*b0 (local.set 5 (local.get 6)) ;; bring a to workspace (i32.wrap/i64 (local.get 5)) (i64.extend_i32_u) (local.set 5) ;; get lower 32-bits (local.set 4 (local.get 10)) ;; bring b to workspace (i32.wrap/i64 (local.get 4)) (i64.extend_i32_u) (local.set 4) ;; get lower 32-bits (i64.mul (local.get 4) (local.get 5)) (local.set 4) ;; multiply (local.set 3 (local.get 4)) ;; remove dependancy from local 4 (i64.add (local.get 3) (local.get 1)) (local.set 1) ;; add to accumulator t1 (i64.gt_u (local.get 3) (local.get 1)) (local.set 3 (i64.extend_i32_u)) ;; compare l1 and t3 (i64.add (local.get 2) (local.get 3)) (local.set 2) ;; add to overflow, even if add zero ``` From `mul256x256_256.x86_64.s`. ``` # a0*b0 movq %r8, %rcx # bring a to workspace movl %ecx, %ecx # get lower 32-bits movq %r12, %rdx # bring b to workspace movl %edx, %edx # get lower 32-bits imulq %rcx, %rdx # multiply movq %rdx, %rbx # remove dependancy from rdx addq %rbx, %rax # add to accumulator rax cmpq %rax, %rbx # compare rax and rbx seta %bl # set register bl to 1 or 0 based on cmpq movzx %bl, %rbx # extend bl to rbx addq %rbx, %rsi # add to overflow, even if add zero ``` From `mul256x256_256.aarch64.s`. ``` // a0*b0 mov x4, x6 // bring a to workspace mov w4, w4 // get lower 32-bits mov x5, x10 // bring b to workspace mov w5, w5 // get lower 32-bits mul x4, x4, x5 // multiply mov x3, x4 // remove dependancy from x4 add x1, x1, x3 // add to accumulator x1 cmp x3, x1 // compare x1 and x3 cset x3, hi // set register x1 to 1 or 0 based on cmp add x2, x2, x3 // add to overflow, even if add zero ``` From `mul256x256_256.riscv64.s`. ``` # a0*b0 mv t5, a0 # bring a to workspace and t5, t5, t6 # get lower 32-bits mv t4, a4 # bring b to workspace and t4, t4, t6 # get lower 32-bits mul t4, t4, t5 # multiply mv t3, t4 # remove dependancy from t4 add t1, t1, t3 # add to accumulator t1 sltu t3, t1, t3 # compare l1 and t3 add t2, t2, t3 # add to overflow, even if add zero ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.