SHAKTI-MS: a RISC-V processor for memory safety in C

In this era of IoT devices, security is very often traded off for smaller device footprint and low power consumption. Considering the exponentially growing security threats of IoT and cyber-physical systems, it is important that these devices have built-in features that enhance security. In this paper, we present Shakti-MS, a lightweight RISC-V processor with built-in support for both temporal and spatial memory protection. At run time, Shakti-MS can detect and stymie memory misuse in C and C++ programs, with minimum runtime overheads. The solution uses a novel implementation of fat-pointers to efficiently detect misuse of pointers at runtime. Our proposal is to use stack-based cookies for crafting fat-pointers instead of having object-based identifiers. We store the fat-pointer on the stack, which eliminates the use of shadow memory space, or any table to store the pointer metadata. This reduces the storage overheads by a great extent. The cookie also helps to preserve control flow of the program by ensuring that the return address never gets modified by vulnerabilities like buffer overflows. Shakti-MS introduces new instructions in the microprocessor hardware, and also a modified compiler that automatically inserts these new instructions to enable memory protection. This co-design approach is intended to reduce runtime and area overheads, and also provides an end-to-end solution. The hardware has an area overhead of 700 LUTs on a Xilinx Virtex Ultrascale FPGA and 4100 cells on an open 55nm technology node. The clock frequency of the processor is not affected by the security extensions, while there is a marginal increase in the code size by 11% with an average runtime overhead of 13%.


Introduction
With the advent of IoT, there has been a rapid increase in the use of low-power embedded devices. These devices are deployed in wide and diverse applications that are connected to the Internet. While these devices are becoming more pervasive, large scale attacks involving compromised embedded devices such as the Mirai botnet [21] are becoming commonplace. In the absence of robust secure environments, vulnerabilities introduced in these devices due to programming flaws can allow attackers to take control of systems with ease.
Several of these vulnerabilities occur due to illegal use of memory accesses. Today, these memory access vulnerabilities rank among the top 25 vulnerabilities in system software [24]. Vulnerabilities like buffer overflows [34], useafter-free(UAF) [36,43], and double-free [16] are some of the major security threats. These vulnerabilities still persist due to predominant use of C and C++ programming languages b. Storage of metadata along with the malloc'd region due to the fact that these programming languages have features like explicit pointer manipulations, flexible type casting constructs and ease in interfacing with the hardware. These features make them favorable for the development of operating systems, virtual machine monitors, embedded systems, and database management software. However, these features come with the risk of illegal memory access and have led to many attacks in the past. Rewriting all existing code in memory safe languages is not feasible and hence we are left with the difficult task of retrofitting security into existing systems.
There have been many studies relating to spatial and temporal attacks due to illegal memory uses [1ś3, 5ś8, 25, 26, 28ś 31,35], and many have proposed methods to prevent one or both of these attacks. Some of the approaches focus only on software solutions [1ś3, 5, 8, 28ś31], while others rely on support from the hardware to enforce memory safety [6,7,25,26,35]. Many of the existing software solutions either fail to provide complete temporal and spatial safety or they incur high run time overheads. Pure software solutions like [28] and [29] can be combined to tackle most kinds of spatial and temporal attacks, but this approach leads to high code size and runtime overheads of around 116% [29]. On the other hand, hardware solutions like [23,25] reduce the run time overhead at the cost of hardware complexity. Although [23] enhances a RISC-V processor to efficiently implement memory checks, the software support required for [23] is extremely complex. Watchdog [25], is a compiler plus hardware solution for memory safety. It uses a shadow memory space to maintain the metadata used for memory checks. This shadow memory can result in considerable memory overheads. For complete spatial and temporal safety, 56% of the system memory would be inaccessible [25]. Gandalf [18] also has a hardware plus software solution with minimal hardware complexity and compiler modifications. However, it does not provide temporal safety.
In this paper, we introduce Shakti-MS, a RISC-V processor providing both spatial and temporal memory safety. It requires modifications in the compiler to insert certain instructions that enable the hardware to perform the required memory checks at runtime. Further, unlike [25], we are not using any separate shadow memory space and unlike [23], there are no additional tables or tag bits that are required in the processor to store pointer metadata. These features reduce the hardware complexities and storage overhead to a great extent. Another significant benefit of our approach is that the hardware is fully compliant to the RISC-V spec. Any binary compiled with an unmodified compiler toolchain can still run on the modified processor, and can coexist with protected programs. One program itself may have protected and non-protected sections by selectively building static and dynamic libraries with protection enabled. In our approach to prevent spatial and temporal attacks, each derived data type object (pointers, arrays, structures) is associated with a fat-pointer as shown in Figure 1.a. In addition to the memory pointer, the fat-pointer also contains the base, bound and id_hash fields. The base and bound are used to enforce spatial safety, whereas the id_hash field is used to enforce temporal safety. To protect against illegal memory operations in stacks, each stack frame is associated with a cookie to help craft fat-pointers for objects within the current stack frame. The idea of having a stack based cookie instead of an object based identifier helps reduce the storage overheads, as temporal metadata is associated with stack frames rather than individual variables. It also simplifies invalidating pointers to a stack frame when it goes out of context. The cookie also helps to preserve the control flow of the program by ensuring that the return address never gets modified. Moreover, the cookie not only helps to prevent temporal attacks in stacks but also serves as the lower bound to prevent any overflows beyond this region.
To provide memory safety in heaps, each allocated region is associated with a unique 64-bit value to craft fat-pointers as shown in Figure 1.b. The base address of pointer referencing to this malloc region would point to this unique random number. This number also acts as the cookie for this allocated region. All pointers pointing to the same allocated region uses the cookie value to craft the fat-pointer. Moreover when any one of these fat-pointers is freed the cookie value is randomised to ensure that all other fat-pointers pointing to the same allocated region is invalid. The storage overhead for the proposed solution is (128 * + 64) bits. In case of stacks, indicates the number of derived data type objects in the current stack frame and for heaps, it indicates the number of aliased pointers, i.e. pointers pointing to the same allocated region of memory. We introduce two new instructions namely "hash" and "val" which are added in the RISC-V ISA to support memory safety checks. These instructions are inserted by the compiler at the desired places to ensure that all pointer accesses are validated before being performed. To achieve the compiler modifications, a transformation pass is developed in RISC-V LLVM [19] and the hardware support is developed in Bluespec-System-Verilog [4].
The rest of the paper is organized as follows. Section 2 defines some of the terminologies used in Shakti-MS. It also discusses the key idea that prevents temporal and spatial attacks on stacks and heaps. Section 3 elaborates the architecture and implementation of Shakti-MS. This section also describes the compiler modification along with the details of Micro-architectural implementation of Shakti-MS. Section 4 presents some of the case studies demonstrating how the compiler changes were made, and how these changes help thwart certain attacks. Section 5 reports some of the analysis like runtime overheads and code size overheads of Shakti-MS. Section 6 concludes the paper.

Shakti-MS: The Crux
This section describes the proposed solution and explains how the system is protected against various memory related attacks. Before diving into the solution, we first define the terminologies that will be used subsequently. We will then describe how spatial and temporal attacks are prevented on stacks and heaps.

Stack Frame Cookie (SFC)
: It is a unique 64-bit random number that is placed on the stack frame below all the variables of the current function as shown in Figure 2. This SFC is unique for each function call and is used to provide temporal safety for all variables and objects available within the current stack frame. Moreover, the SFC is destroyed once the function goes out of scope ensuring all pointers to be invalid once the function returns. 2. ROData Cookie (RODC) : Just like the SFC, it is also a unique 64-bit random number, but unlike the SFC, it is placed in the .bss region of the program's memory. It is used to protect the read only segment of memory thereby preventing any kind of over-reads or invalid pointer accesses. The RODC is used to provide temporal safety for both global and static variables. 3. ID_HASH : It is a 32-bit unsigned number computed either from the cookies of stacks or heaps. It is also of the respective cookie. It is one of the four fields of the fat-pointer which is used for computing hash values and for checking lower bounds. 5. BOUND : It is a 32-bit address indicating the absolute bound of the object. It is the maximum permissible range that the fat-pointer can access. 6. Safe Malloc : It is a wrapper function (similar to the malloc function) that allocates 8 more bytes than the requested size of malloc, as shown in Figure 1.b, and returns a fat-pointer corresponding to the allocated region. In this extra 8-bytes we store a unique 64-bit random number (a cookie) which helps us to protect against temporal attacks. This cookie is used to craft fat-pointers for all pointers pointing to the allocated region of memory. 7. Safe Free : It is a wrapper function (similar to the free function) that accepts fat-pointers instead of normal memory pointers as its input. The method safefree first validates the fat-pointer. On successful validation, it calls the free function (after converting the fat-pointer into a normal pointer and passing it as the input) that deallocates the corresponding memory region. This method also randomises the 64-bit value stored along with the allocated region of memory, so that any further reference to that region would result in an invalid memory access. 8. Craft : It is a function that is used to craft fat-pointers.
It accepts four 32-bit numbers i.e base, bound, id_hash, and the pointer itself and then returns a 128-bit object by creating the fat-pointer. Figure 1.a shows the structure of the fat-pointer returned by the craft function.

Preventing Temporal and Spatial Attacks on Stacks
Stack is often the primary target of spatial attacks primarily because overflowing local variables could potentially allow the return address to be altered, thereby changing the control flow of the program. Temporal attacks on the stack are not as prevalent as spatial attacks, though they are not unheard of [12]. A dangling pointer to the stack is a pointer of one function that is not deleted when the function returns. This pointer could now potentially be used to modify the stack of another function that later uses the same region of memory. Moreover, spatial attacks like buffer overflow have evolved over time and given rise to more sophisticated techniques like return-to-libc [38] or Return Oriented Programming (ROP) [37]. Many of these attacks have mitigations in place, such as making the stack non-executable [40] or adding stack canaries [11] to detect tampering of return address. One of the promising and most widely used solution is Address Space Layout Randomization (ASLR) [39]. Although ASLR has proven to be the most successful solutions for preventing ROP, it does not address the underlying issue of buffer overflow. There have also been attacks that bypass ASLR [22]. Another, less prevalent solution is using fat-pointers where every pointer is associated with some metadata that is used to prevent various memory corruption attacks. In our proposed solution, we use the concept of fat-pointers but differ in the implementation with respect to other existing fat-pointer solutions.

Preventing Spatial Attacks
To prevent spatial attacks on stack, each derived data type object is associated with a base and bound. The bound represents the maximum accessible range the pointer to an object can access, whereas the base represents the base address of the SFC. Although the base here is not a strict lower bound for the object but it prevents all overflows provided that there are no pointer decrements. Even if there are pointer decrements it can never overflow beyond the SFC. Moreover, a slighter loose lower bound is chosen because it allows the same SFC to be used for temporal checks. To understand the concept to protect against spatial attacks let us take a look at the example below: int x, a [10]; int *ptr = a; x = *(ptr + 5); x = *(ptr + 10); // spatial check violation Figure 2, shows the stack frame with and without the fatpointers. The shaded region represents the metadata for the stack frame. The SFC represents the "Stack Frame Cookie" which is used to protect the current stack frame and craft fat-pointers for the objects within the current stack frame. The objects placed below the SFC, namely fpr_a and fpr_ptr represents the fat-pointers for their respective objects a and ptr. These fat-pointers are placed below the SFC to ensure that there is no tampering of the metadata due to pointer decrements. Moreover, when the pointer is assigned to point to the array in the stack, the metadata of the array is copied to the metadata of the pointer. Also, every load and store instructions to this object are preceded by a validity check to ensure memory safety. So in the above example, the pointer ptr can only access (ptr + 5) and would fail to access (ptr + 10).

Preventing Temporal Attacks
Temporal attacks on stack occur when a pointer to a local variable of a function is not deleted after the function returns, allowing it to overwrite the stack of any other function that occupies the region of the stack used by the previous function. Consider the following code snippet that demonstrates how a temporal attack can occur on stacks and how the proposed solution mitigates such attacks: ... = *q; //temporal check violation } As mentioned before, each function has its own unique SFC which is used to derive the id_hash of each pointer in the current stack frame. Moreover, every pointer to objects in stack have its own metadata associated with it. Relating this fact to the above example, there are two functions which have their own unique SFC. When the the global pointer q points to the variable a in function foo, it has its id_hash derived from the SFC of function foo. As soon as foo goes out of scope, SFC of that function is randomised. Therefore when q gets dereferenced in main after returning from foo, it results in a validation error.

Preventing Temporal and Spatial Attacks on Heaps
The heap is dynamically allocated region in memory that the program uses at runtime to typically store program data. A heap overflow or an overrun is a type of buffer overflow that occurs in this heap data area. Exploitation is performed by corrupting the heap data in specific ways to cause programs to overwrite internal structures such as a linked list or malloc metadata. Moreover, overflowing buffers in heap can also change pointers that point to important data. Attacks like use-after-free [43] and double-free [16] are quite common in heaps which have led to critical system failures.

Preventing Temporal Attacks
There have been many solutions proposed to mitigate temporal attacks [2,3,9,10,15,29,31,32,35,41,42]. They can be classified into two categories, namely, "location based" and "identifier based". Location based [15,17,31,41] approaches use an extra data structure such as a tree or it uses a hashtable/trie-based implementation of shadow memory space to keep track of the allocated and deallocated memory regions. This approach prevents most dangling pointers but fails to protect against stale pointers which points to reallocated memory, as it uses the object's address to determine if the pointer is valid or not. In contrast, the identifier based [25,27,29,35,42] approach uses metadata associated with each pointer or a lock-and-key mechanism to prevent the exploitation of dangling pointers.
In Shakti-MS we use a lock-and-key based approach to mitigate all variants of temporal attacks. To protect against dangling pointers, double-free and other temporal attacks, all calls to malloc and free are replaced with safemalloc and safefree, which are wrapper functions that add metadata to heap-allocated objects. These are the basic function calls in C that provide low level access to memory. So protecting these functions are of prime importance. The safemalloc function call allocates an extra 8 bytes of memory, stores a unique 64-bit random number in it and returns a 128-bit fat-pointer having the pointer metadata. In the metadata, the id_hash field represents the key and the base represents the lock location. Every pointer pointing to the allocated region will be transformed into a fat-pointer with their respective id_hash and base. To ensure temporal safety, the following check is performed on dereferencing a pointer: abort();//dangling pointers detected The hash function is introduced because even if the lock location is compromised due to implementation flaws or by any other means, the hash function still remains unknown. This introduces an extra level of difficulty for the attacker to craft any arbitrary fat-pointers. Furthermore, all subsequent loads and stores on the fat-pointers are prefaced by temporal safety checks. The safefree method randomizes the 64-bit value stored at the start of the allocated region, which further ensures that any pointer dereference to that allocated region after being deallocated would result in a validation error. The other method which might cause a problem for dangling pointers is realloc. To ensure safe handling of reallocations, the saferealloc method replaces all realloc calls in the program.
Saferealloc accepts a 128-bit fat-pointer to an object and the reallocation size as parameters. It validates the accepted fatpointer, allocates a new region in heap, copies data over, frees the old region, and returns a valid fat-pointer for the newly allocated region.

Preventing Spatial Attacks
Spatial attacks, as the name suggests, involves accessing regions of memory beyond the legitimate and intended scope of the code. These attacks are often accomplished by overflowing buffers in memory or reading beyond the limits of an object. To ensure spatial safety on the heaps, we use fatpointer to restrict memory access of a pointer within a base and bound address. As discussed in the previous section each malloc'd region is now associated with a unique 64-bit random number and every pointer pointing to the malloc'd region has been transformed into a fat-pointer. Every pointer now has its own base and bound associated with it, where the base points to the start of the allocated region and the bound points to the end of the allocated region, referring the absolute memory address the pointer can access. All pointer dereferences undergoes a base and bound check, ensuring spatial safety. For better clarity on the proposed solution, consider the following sample code: 1. int *p,*q,*r; 2. p = malloc(10*sizeof(int)); 3. q = r = p ; 4. int value = *(r+10); // spatial safety violation 5. free(p); 6. ... = *q; //temporal safety violation The pseudo code equivalent of the above block after the compiler transforms: 1. __int128 fpr_p,fpr_q,fpr_r; 2. fpr_p = safemalloc(10*sizeof(int)); //safemalloc returns a __int128 object //consisting of base,bound,id_hash and pointer 3. fpr_q = fpr_r = fpr_p ; 4.1 validate (fpr_r+10); //Spatial safety violation 4.2 int value = *(fpr_r+10); 5 safefree(fpr_p); 6.1 validate fpr_q; //Temporal safety violation 6.2 ... = *fpr_q; As shown above in line numbers 2 and 5, the compiler replaces every malloc/free calls with safemalloc and safefree wrapper functions. Moreover, the compiler also inserts validity checks before every pointer dereferences as shown in line numbers 4.1 and 6.1, ensuring both temporal and spatial safety.

Architecture and Implementation of Shakti-MS
In the previous section, we looked for mechanisms to prevent spatial and temporal attacks on stack and heap. In this section, we look deeper into compiler and hardware instrumentation aspects of Shakti-MS.
In order to provide security guarantees, the given code might need to be instrumented. This can be done at the binary, compiler or the source code level. Additionally, new hardware instructions can also be added to accelerate memory safety checks. Binary level code instrumentation works to modify the source code after compilation. This approach, however, limits the flexibility of code instrumentation. For example, one cannot add a new instruction in between, without affecting the branch instructions that work on relative addresses. Source code transformation, on the other hand, doesn't provide one with enough information to apply transformations. For example, the stack organisation is not visible at the source code level.
However, compiler based instrumentation does not have any of the above mentioned drawbacks. Therefore, in our solution, we use compiler based transformations to achieve code instrumentation. Also, new hardware instructions are added to have minimal performance overheads while performing these security checks. The details of the implementation are given below.

ISA Extensions
This section describes the details of the two new instructions that are added to the RISC-V ISA in order to support memory safety.
1   Here, Check 1 and Check 2 ensure temporal safety by verifying that id_hash stored along with the fat-pointer and hash computed from value stored in the memory location pointed by base are equal. Check 3 ensures spatial safety by verifying that every pointer access is within the base and bound. This prevents all manifestations of spatial and temporal memory attacks.

Compiler Based Instrumentation
The compiler based instrumentation needed in Shakti-MS is implemented using the RISC-V LLVM [19] compiler infrastructure. The LLVM toolchain converts the C-code to an intermediate representation (IR), runs certain passes on the intermediate representation (for instrumentation or optimization) and finally compiles the IR into machine code. The ability to write transformation passes that operate at the IR level provides great flexibility in terms of making changes at a logical level. To ensure spatial and temporal safety from the compiler perspective, we wrote compiler passes to analyze programs and understand the program behavior. We then wrote transformation passes that operate on the IR to add metadata to pointers and insert runtime checks. In our solution, we have added support for two new machine instructions, namely "hash" and "val", in RISC-V LLVM with the help of intrinsics. These intrinsics are represented as function calls at the LLVM-IR level but will be translated into machine instructions at the assembly level. As per LLVM's documentation [20] adding a new instruction directly changes the bit code format, and would require a considerable amount of effort to maintain compatibility with the previous versions. Thus, we have proceeded with adding a new intrinsic to the compiler instead of an instruction. The code for adding a new intrinsic in RISC-V LLVM is shown in Figure 3.
To explain the process of implementing the IR transformation pass, we divide it into a set of tasks (not necessarily in an order) that were performed.

Handling global variables and global pointers :
This part of the transformation pass deals with handling global variables and global pointers which might cause a potential threat. Since these variables neither reside in the stack nor on the heap so we cannot directly craft a fat-pointer using the SFC or the metadata stored along with the malloc'd region. These variables lie in the read only section of the memory known as .bss. To prevent overflow or read-past-bounds attacks on global pointers, we have crafted fat-pointers using the RODC instead. 2. Replacing malloc calls and free calls with safemalloc and safefree : This part of the transformation pass replaces all malloc calls with safemalloc, free with safefree and realloc with saferealloc. It also mutates the return type of the malloc and realloc functions to fat-pointers. 3. Adding the stack frame cookie : This transformation pass inserts instructions to the first basic block of every function in IR. The inserted IR code generates a SFC and places it at the bottom of the stack frame every time the function is called at runtime. Once the SFC is placed on the stack, its hash value is computed using the hash instruction and stored in some temporary register in LLVM. This hash value is used to create fat-pointer for derived datatype objects on stack.
The transform also adds code to the last basic block of the function which randomizes the SFC before the function returns. This ensures that once the function goes out of scope and returns to the calling function, any attempts to use pointers to that stack frame will raise an exception. 4. Handling pointer arguments and returns for function calls within, and outside a module : This part of the pass operates on function calls and function prototypes within the module and converts every pointer in the arguments or return values to fatpointers. However, system calls like scanf, printf or function calls outside the module are left untouched. Any fat-pointers passed to them as arguments are first validated and collapsed into pointers. Additionally, to protect against overflows caused by special library functions like memcpy and strcpy, explicit checks are added to ensure destination buffer length is greater than source buffer length. 5. Crafting fat-pointers : Since our solution is based on fat-pointers, this part of the transformation pass deals with transforming every derived data type objects on stacks to fat-pointers. The fat-pointer is created by calling the function named craft. Then all uses of the existing object are replaced with the newly created fat-pointer to ensure memory safety.

Transformations for various LLVM instruction :
This is the most important part of the transformation pass converting pointers to fat-pointers and handling type mismatch of all LLVM instructions. It is also responsible for adding val instruction before every load and store instruction. It also ensures that wherever a pointer is dereferenced, a validity check is inserted just before it. The validity checks are only inserted by the compiler in the form of a val instruction, but the actual check is performed by the hardware at runtime. 7. Warnings for pointer decrements : This is an analysis pass used for identifying any decrement operation performed on pointers within the program. As stated earlier, pointer decrements on stacks might cause illegal access to other objects within the current stack frame, below the base element of the pointer. Thus, this pass of the compiler is responsible for throwing a warning whenever a pointer decrement is encountered in the desired function or module. Currently, no automated solution has been put in to fix this problem; therefore, its the responsibility of the programmer to handle this scenario. A possible solution can be to replace the pointer decrement with a pointer plus offset for compiler-enforced safety, or manually validate the safety and suppress the warning.

Micro Architecture
The hardware and ISA extensions proposed for Shakti-MS have been implemented over an existing baseline processor in order to provide a fair comparison of the incurred area and performance overheads. We have used the 64-bit 6-stage in-order Shakti C-64 design [14] as our baseline processor whose micro-architecture is shown in Figure 4. This processor was designed using Bluespec System Verilog (BSV) [4].
Following is a brief outline on the functioning of Shakti C-64: 1. PC Generate Stage: This is the first stage of the pipeline. This stage is responsible for generating the value of the Program Counter (PC). The Branch Prediction Unit (BPU) sends out the value of PC and the prediction bits. If the prediction bits indicate that the branch is taken, the next PC is the value that is given out by the BPU; else, the next PC is computed as current PC + 4. This PC is then sent to the instruction cache.   ISB. However, if the instruction executed did not require any memory accesses then the result from the EXE-MEM ISB is simply buffered into the MEM-WB ISB. 6. Writeback Stage: This stage is responsible for writing the results back to the register file if no exception was generated. Also, the result is forwarded via the operand forwarding path. In case of an exception, a complete pipeline flush is initiated, and the processor jumps to the exception handler routine. Also, for instructions like store and branch, no operations are performed in this stage.
As far as the hardware implementation of the two new instructions are concerned, the PC Generate, Fetch and, Decode and Operand Fetch stages for both of them work similar to that of any other arithmetic instruction. Actions performed in the remaining stages are described below:

Hash Instruction
• Execute stage: In the execute stage, the hash instruction is treated similar to that of a load instruction. Here the effective address is resolved by retrieving the address from the rs1 register. Moreover, an extra piece of information is passed onto the memory stage to distinguish between normal loads and the hash instruction. • Memory stage: In this stage, memory access is performed and the pipeline stalls until a response is obtained from the memory subsystem. Once a response is obtained it is checked for exceptions. If the response was a valid one, then the hash of the read value is computed and written onto MEM-WB ISB. Additionally, the hash value computed is forwarded via the operand forwarding path. • Writeback stage: The writeback stage of a hash instruction is similar to that of a load instruction. The results are forwarded via the operand forwarding path and also written into the register file, provided there was no exception generated.

Val Instruction
• Execute stage: In this stage, two actions are performed by the processor. First, the effective address is resolved i.e. the address of the base is extracted from the operands. To be more precise the address is the lower 32-bits of rs1 register. This address will be further used in the memory stage to issue a load request to get the cookie. The second action is to compare the value of the pointer with the values of base and bounds that are present in the fat-pointer. If the pointer lies within its permissible limits, no exception is raised, else an exception bit is set and the result is forwarded to the subsequent stages. • Memory stage: The val instruction in this stage basically performs three operations provided that the exception bit was not set in the execute stage. Initially, it issues a load request to the effective address that was computed in the execute stage. Once the response is obtained, necessary checks for exceptions are performed, and on valid response only, the hash of the returned value is computed. Finally, the computed hash is compared with the id_hash stored along with the fat-pointer (id_hash stored in the upper 32-bits of rs2). If these values match, then the load is treated to be valid, else an exception bit is set to indicate invalid memory accesses by the pointer. The results obtained in this stage are then passed onto the writeback stage. • Writeback stage: This stage reads the data from the previous stage and checks if the exception bit is set. If so, then an Invalid_Pointer exception is raised; else no operation is performed thereby indicating that the subsequent load or store instruction is indeed a valid access.

Case Study
This section provides a sketch of the generated LLVM-IR code (by Clang) of different parts of a simple C-program, and also how the transformation pass modifies the IR. Given below are some of the examples of the transformation pass.
1. Handling the SFC : Given below is the LLVM IR code to insert the SFC and collapse it just before the function exits. The keyword "alloca"is used to allocate a memory on the stack and all variables with '%' sign represent a temporary register. LLVM uses the concept of static single assignment and has infinite number of registers for computation.
;insert this at the end of all ;alloca calls in a function %stack_cookie = alloca i64 %2 = call i64 @random64() store i64 %2, i64* %stack_cookie ;body of the function call ... ;insert this at the end of the function %4 = call i64 @random64() store i64 %4, i64* %stack_cookie %stack_cookie_burn = call i64 @llvm.RISCV.hash(i64* %stack_cookie) Here @llvm.RISCV.hash represents the intrinsic call to our function hash. 2. Crafting fat-pointers : Crafting a fat-pointer is done by calling a function named craft with four parameters namely base, bound, id_hash and the pointer itself. The craft function is a few lines of assembly code inserted during code lowering. The call to craft function below is used to create a fat-pointer to a character array of size 10.
Here a is an array of size 10 and ptr is a pointer pointing to the array. The code below is the LLVM IR representation of the said line: The corresponding LLVM IR code would look something like this: %1 = alloca i8*, align 8 %2 = call i8* @malloc(i64 zeroext 10) store i8* %2, i8** %1, align 8 where %1 refers to the allocation of variable q. Malloc allocates 10 bytes of memory, and assigns it to q. The modified IR code is given below :

Results
Shakti-MS has two implementation aspects, namely, hardware design and compiler transformations. Hardware additions cause an increase in the area of the chip, and also may increase the critical path length. The compiler transformations, on the other hand, may cause an increase in the code size and runtime overheads. This section discusses the overheads in terms of all these aspects, and also quantifies the effectiveness of the proposed solution.

Runtime Overheads
To calculate the runtime overheads we have used some of the SPEC benchmarks that had successfully compiled using RISC-V LLVM toolchain. We also used some of the buffer overflow benchmarks given in SARD-dataset-88 [13], and some commonly used programs consisting of intensive pointer operations/arithmetic to estimate the runtime overheads. The average runtime overhead is approximately 13%. Figure 5 shows the cycle count overhead for some of the benchmarks and other programs. The white bar in the graph indicates cycle counts for execution if the programs are compiled with vanilla RISC-V CLANG, while the black bar indicates the cycle count if the programs are compiled with the modified compiler toolchain.  Figure 6. Graph demonstrating overheads of different programs with respect to code size.

Overheads in the Code
To estimate overheads in terms of code size, the hexdump of the same set of programs were compared with and without the LLVM transformation pass. Figure 6 shows the code size of different programs. The white bar indicates the code size of the program without the transformation pass applied, whereas the black bar indicates the code size with the transformation pass applied. The average increase in code size is about 11%.

Hardware Overheads
The modified microprocessor was synthesized on UMCIP's open 55nm technology node, and also on a Virtex Ultrascale FPGA (part number xcvu0095-ffva2104-2-e) using Xilinx Vivado 2016.1. The RAMs of caches were treated as a black box for the ASIC synthesis due to unavailability of required SRAM cuts. Shakti-MS has an overhead of 4100 cells on ASIC, and 700 LUTs on the FPGA.
The critical path, which is in the execute stage, did not change as the base and bounds check, and hash computation are done in parallel with the existing circuit in that stage. Also, the extra logic in the memory stage does not fall on the critical path.

Effectiveness
To check the effectiveness of bounds checking and use after free attacks, we have used the SARD-dataset-81 and SARDdataset-89 downloaded from SAMATE-NIST [13] website. We also developed our own test cases for more obscure memory corruption attacks. Our test cases were developed to target the different uses of dangling pointers for temporal safety checks, whereas the SARD-dataset was used for spatial safety checks. The SARD dataset has around 1100 programs consisting of both correct and vulnerable ones. Our solution was able to detect all the vulnerabilities it was designed to address. The issues relating to false negatives were due to multi-threaded programs and nested sub-object protection. However, issues relating to sub-objects are handled to ensure that the overflow is never beyond the object's scope. Since these are very small tests that are written just to check for the effectiveness of a solution, these have not been included in Figure 5. Nevertheless, the runtime overheads that were observed for these programs were negligible.

Conclusion
In this paper, we propose Shakti-MS, a RISC-V processor supporting both spatial and temporal memory safety. It is a light-weight co-design approach with the compiler responsible for inserting new instructions that perform memory checks and the hardware responsible for executing them. Table 1 shows a comparative study of the runtime overhead of the proposed solution with the existing works. The low runtime overhead is achieved due to the fact that the work is being divided between the compiler and the hardware. The major contribution of the paper lies in the fact that we are using stack based cookies instead of using object based id's. The proposed implementation of fat-pointer prevents both spatial and temporal attacks on stacks and heaps with minimal storage and runtime overheads. Another major advantage of Shakti-MS allows existing RISC-V software and binaries to be run unmodified. This means that any program compiled with an unmodified compiler toolchain can still run on the modified processor and co-exist with the protected programs.
Although we see that Shakti-MS works well for protecting against both spatial and temporal attacks, but observing its effectiveness in case of sub-object protection and multithreading environment would be an interesting work. Moreover, since our code transformation relies on the compiler to insert instructions, different optimisation passes can be applied before and after our own transformation pass. One example would be to run a pass and figure out statically as to which pointers need to be transformed into fat-pointers, and only transform those pointers to have minimal runtime and code size overheads.

A.1 Abstract
Our artifact primarily consists of a modified LLVM toolchain. Some sample programs have been included which can be used to check the robustness in implementation of the proposed solution. The artifacts are provided in a docker image. The image contains a pre-built LLVM toolchain, the riscv64 GNU toolchain, spike RISC-V ISA simulator, and sample Cprograms. The links of the repositories containing the source code of the modified LLVM toolchain and the modified Shakti microprocessor have also been provided.

A.2 Artifact Check-list (Meta-information)
• Program: Various C-programs that have various manifestations of spatial and temporal vulnerabilities have been included to verify that the implementation. The artifacts are bundled into a docker image which can be found at illustris/shakti-ms-artefacts on dockerhub. The source code of the modified LLVM toolchain can be found at https://github.com/illustris/riscv-llvm-toolchain. Also, the source code of the modified Shakti microprocessor can be found at https://bitbucket.org/arjunmenon/sec-c.

A.3.2 Hardware Dependencies
The docker image needs to be run on an amd64/x86_64 (VM or host).

A.3.3 Software Dependencies
Docker should be installed in the host system. Also, in order to compile the hardware source code, Bluespec compiler should also be installed.

A.3.4 Data Sets
The data set primarily consists of sample C-programs which have various manifestations of spatial and temporal memory vulnerabilities.

A.4 Installation
Run the following command on a terminal: $ sudo docker run --rm -it illustris/shakti-ms-artefacts

A.5 Experiment Workflow
Run the demo to test esoteric C functionality: $ cd /root/demos/C\_functionality $ make To test a new C-program (named filename.c): $ cd /root/demos/size_and_cycles $ make CODE=filename.c cyclecount Follow the instructions printed by make. Moreover, to measure the time taken for the C-program to execute, insert a rdcycle instruction at the beginning and end of the program (as done in the example demo program). The actual cycles taken will be the difference between these two values. Print the code size of the program before and after applying the transformation: $ make CODE=filename.c codesize Run the provided buffer overflow exploit programs: $ cd~/demos/exploits/buffer_overflow $ make The above command will build and run two versions of the code; one with, and one without a buffer overflow vulnerability. The hardware implementation of val instruction raises an exception to terminate the process when access violations are detected, but this demo uses a software simulated version of val that prints a debug message and resumes.

A.6 Evaluation and Expected Results
• C functionality demo passes all tests • Buffer overflow demo detects and warns of buffer overflows • Code size demos create modified binaries with code sizes falling under specified thresholds • Cycle count is not an actual representative of the obtained results, as the demo uses software-emulated hash and val instructions that expand to multiple instructions at runtime