The last two arguments(fixup response response for finite values) are
neg-pos, not pos-neg. Found this out while re-using this function for
some math stuff. Thankfully nothing currently uses this fixup response
at the moment.
PAGE_SIZE is a kernel symbol and depending on the libc in use, it will
"leak". In this case dynarmic was using it's own PAGE_SIZE and in
combination with the Musl libc the compiler would complain it was overwriting
the kernel symbol
* We failed to invalidate entries if there are no patches required for a location descriptor.
* Bug in A64 hashing code (rbx instead of rbp).
* Bug in A32 and A64 lookup code (inconsistent choice of key: PC vs IR::LocationDescriptor).
* Test case added.
`MConst` is refactored into `XmmConst` to clearly communicate the
addressable space of the newly allocated 16-byte memory constant.
`GetVectorOf` is elevated into a globally available `XmmBConst` function
that "broadcasts" bits of the input-value into n-bit elements that span
the width of the Xmm-constant.
`emit_x64_floating_point` will utilize the same 16-byte
broadcasted-constants to encourage more cache-hits within the
constant-pool between vector and non-vector code.
`vfpclassp* k, xmm, i8` has better latency(4->3) and allocates better
execution ports(01->5) that are out of the way of ALU-ports than
`vcmpunordp* xmm, xmm, xmm`(`vcmpp* xmm, xmm, xmm, i8`) and removes the
pipeline dependency on `xmm0` in favor AVX512 `k`-mask registers.
`vblendmp* xmm, k, xmm, mem` is about the same throughput and latency as
`blendvp* xmm. mem` but has the benefit of embedded broadcasts to reduce
memory bandwidth(32/64-bit read rather than 128-bit) and lends itself to
a future size optimization feature of `constant_pool`.
Both single and double precision floating point numbers as well as the
packed and unpacked version of this instruction will be able to use the
same memory constant. This takes advantage of the fact that `VFIXUPIMM*`
doesn't just copy from the source, but it will convert to `0.0` if it
turns out that it is a denormal and the `MXCSR.DAZ` flag is set.
```
tsrc[31:0]←((src1[30:23] = 0) AND (MXCSR.DAZ =1)) ? 0.0 : src1[31:0]
...
CASE(token_response[3:0]) {
...
0001: dest[31:0]←tsrc[31:0]; ; pass through src1 normal input value, denormal as zero
...
```
There is an important subtlety that should be documented here. All the
operands of `FpFixup` that read from the `Src` register actually do a
`DAZ` operation if `MXCSR.DAZ` is set.
Intended to be used for library users wishing implement accurate memory watchpoints.
* A32: optionally make memory instructions the end of basic blocks
* A64: optionally make memory instructions the end of basic blocks
* Make memory halt checking a user configurable
* oops
AVX512 adds an additional **16** simd registers, for a total of 32 simd
registers, accessible by utilizing EVEX encoded instructions. Rather
than using the `ScratchXmm` function, adding additional
register-pressure and spilling, AVX512-enabled contexts can just
directly use `xmm{16-31}` registers as intermediate scratch registers.
* Provide reason for halting and atomically update this.
* Allow user to specify a halt reason and return this information on halt.
* Check if halt was requested prior to starting execution.
`map` is an ordinal structure with log(n) time searches.
`unordered_map` uses O(1) average-time searches and O(n) in the worst
case where a bucket has a to a colliding hash and has to start chaining.
The unordered version should speed up our general-case when looking up
constants.
I've added a trivial order-dependent(_(0,1) and (1,0) will return a
different hash_) hash to combine a 128-bit constant into a
64-bit hash that generally will not collide, using a bit-rotate to
preserve entropy.
In MSVC, having files with identical filenames will result into massive slowdowns when compiling.
The approach I have taken to resolve this is renaming the identically named files in frontend/(A32, A64) to (a32, a64)_filename.cpp/h
This makes dynarmic installable, and also adds a CMake package config
file, that allows projects to use `find_package(dynarmic)` to import the
library.
I know #636 adds the same thing, but while experimenting with the
different install options in
https://github.com/merryhime/dynarmic/pull/636#discussion_r725656034
I ended up with a working patch, so I'm proposing this as well. This
implements solution 2.
This adds versioning information to the built library.
When building the shared library on Linux systems, a new object will
be created: libdynarmic.so.5
This is really useful when talking about ABI compatibility.
The variables dynarmic_VERSION and dynarmic_VERSION_MAJOR
are implicitly created when calling project(dynarmic VERSION x.y.z)
Adds all elements of vector and puts the result into the lowest element.
Accelerates the `addv` instruction into a vectorized implementation
rather than a serial one.
The lane-splatting variant of `FMUL` and `FMLA` is very
common in instruction streams when implementing things like
matrix multiplication. When used, they are used very densely.
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-3-matrix-multiplication
The way this is currently implemented is by grabbing the particular lane
into a general purpose register and then broadcasting it into a simd
register through `VectorGetElement` and `VectorBroadcast`.
```cpp
const IR::U128 operand2 = v.ir.VectorBroadcast(esize, v.ir.VectorGetElement(esize, v.V(idxdsize, Vm), index));
```
What could be done instead is to keep it within
the vector-register and use a permute/shuffle to "splat" the particular
lane across all other lanes, removing the GPR-round-trip.
This is implemented as the new IR instruction `VectorBroadcastElement`:
```cpp
const IR::U128 operand2 = v.ir.VectorBroadcastElement(esize, v.V(idxdsize, Vm), index);
```
Recursive calls to `Replicate` beyond the first call might
cause an unintentional up-casting to an `int` type due
to `|` and `<<` operations on types such as `uint8_t` and `uint16_t`
This makes sure calls such as `Recursive<u8>` stay as the `u8` type
through-out.
xbyak is intended to be installed in /usr/local/include/xbyak.
Since we desire not to install xbyak before using it, we copy the headers
to the appropriate directory structure and use that instead
AVX512 introduces the _unsigned_ variant of float-to-integer conversion
functions via `vcvttp{sd}2u{dq}q`. In the case that a value is not
representable as an unsigned integer, it will result in `0xFFFFF...`
which can be utilized to get "free" saturation when the floating point
value exceeds the unsigned range, after masking away negative values.
https://www.felixcloutier.com/x86/vcvttps2udqhttps://www.felixcloutier.com/x86/vcvttpd2uqq
This PR also speeds up the _signed_ conversion function for fp64->int64
https://www.felixcloutier.com/x86/vcvttpd2qq