owned this note
owned this note
Published
Linked with GitHub
# Metering by counting cycles
The current Ethereum metering model is "cost and sum", which is nice because it is simple. Unfortunately, it does not capture hardware acceleration on common hardware, which includes operations on registers, register renaming, multi-issue, wide pipelines, out-of-order execution, superscalar with hardware queues, branch prediction, memory hierarchy tricks, transaction lookaside buffer, vectorization, multi-core, simultaneous multithreading, and others. As a result, cost and sum may be over-conservative, limiting throughput.
We hope that metering which captures common hardware acceleration will give much better approximations of execution time, and better approximations will allow more throughput. We propose extending the Ewasm specificaiton to allow a special constrained form of Wasm which resembles a universal register assembly language. This language can be assembled to native register assembly of major hardware architectures such as `x86_64`, `aarch64`, and `riscv64`. A programmer concerned with execution speed can hand-write critical parts of code in this universal register assembly, just as is done with native assembly, and non-critical parts can remain in regular Wasm.
It is a monumental task to specify such a universal register assembly language -- some instruction set manuals are thousands of pages long! We start with [prototypes](https://github.com/poemm/assembles) of some bigint and hashing, from which we took the below samples of `mul256` in universal forms of `x86_64`, `aarch64`, `riscv64`, and `wasm32` assembly.
We propose metering this universal register assembly by counting cycles on an open hardware description with reasonably competitive instructins/cycle. We don't need the actual hardware, just [a cycle-accurate simulator of hardware](https://en.wikipedia.org/wiki/List_of_HDL_simulators) which is based on our hardware description. We execute code and count cycles.
As a first step for Ethereum adoption, we propose metering opcodes like `mul256`, establishing a toolchain for such metering. With lessons learned, we will evaluate allowing users to deploy this universal register assembly for their domain-specific hand-optimizations -- i.e. user-deployed pre-compiles, or perhaps "pre-assembles".
# Code samples
Below are "equivalent" code samples for `mul256x256->256`.
From `mul256x256_256.wasm32.wat`.
```
;; a0*b0
(local.set 5 (local.get 6)) ;; bring a to workspace
(i32.wrap/i64 (local.get 5)) (i64.extend_i32_u) (local.set 5) ;; get lower 32-bits
(local.set 4 (local.get 10)) ;; bring b to workspace
(i32.wrap/i64 (local.get 4)) (i64.extend_i32_u) (local.set 4) ;; get lower 32-bits
(i64.mul (local.get 4) (local.get 5)) (local.set 4) ;; multiply
(local.set 3 (local.get 4)) ;; remove dependancy from local 4
(i64.add (local.get 3) (local.get 1)) (local.set 1) ;; add to accumulator t1
(i64.gt_u (local.get 3) (local.get 1)) (local.set 3 (i64.extend_i32_u)) ;; compare l1 and t3
(i64.add (local.get 2) (local.get 3)) (local.set 2) ;; add to overflow, even if add zero
```
From `mul256x256_256.x86_64.s`.
```
# a0*b0
movq %r8, %rcx # bring a to workspace
movl %ecx, %ecx # get lower 32-bits
movq %r12, %rdx # bring b to workspace
movl %edx, %edx # get lower 32-bits
imulq %rcx, %rdx # multiply
movq %rdx, %rbx # remove dependancy from rdx
addq %rbx, %rax # add to accumulator rax
cmpq %rax, %rbx # compare rax and rbx
seta %bl # set register bl to 1 or 0 based on cmpq
movzx %bl, %rbx # extend bl to rbx
addq %rbx, %rsi # add to overflow, even if add zero
```
From `mul256x256_256.aarch64.s`.
```
// a0*b0
mov x4, x6 // bring a to workspace
mov w4, w4 // get lower 32-bits
mov x5, x10 // bring b to workspace
mov w5, w5 // get lower 32-bits
mul x4, x4, x5 // multiply
mov x3, x4 // remove dependancy from x4
add x1, x1, x3 // add to accumulator x1
cmp x3, x1 // compare x1 and x3
cset x3, hi // set register x1 to 1 or 0 based on cmp
add x2, x2, x3 // add to overflow, even if add zero
```
From `mul256x256_256.riscv64.s`.
```
# a0*b0
mv t5, a0 # bring a to workspace
and t5, t5, t6 # get lower 32-bits
mv t4, a4 # bring b to workspace
and t4, t4, t6 # get lower 32-bits
mul t4, t4, t5 # multiply
mv t3, t4 # remove dependancy from t4
add t1, t1, t3 # add to accumulator t1
sltu t3, t1, t3 # compare l1 and t3
add t2, t2, t3 # add to overflow, even if add zero
```