Assembly-Level View on C++ Development

When I started learning C/C++ programming, I only had a very vague idea of what a computer was doing when it executed a program. In languages like Java, JavaScript, or Python, you don't necessarily care about what your lines of code mean in terms of low-level program execution, memory, or performance.

In this article, I will explain the concepts of C++ on the implementation level. I will give examples, show memory dumps, and assembly code, to illustrate my points. This will give you an understanding of what is going on under the hood, and maybe why the C++ language is designed as it is. The assembly code in the examples is for Intel x64 CPUs.

This article will not teach you C++ language features or the basics of programming. There are much better tutorials for that.

Do you really need to know all the things in this article? From the perspective of a hobby programmer, probably not. From the perspective of a software engineer, you should be very familiar with the language and the generated assembly code of your compiler, including its implications on performance and memory. Otherwise there is no point in using a language like C or C++.

Please consider that the article only scratches the tip of the iceberg. I don't cover compiler, operating system and CPU design, which play a major role in how your code executes. Knowing what virtual memory is, how data is cached, or what out-of-order execution is, will give you further insight into the inner workings of your computer.

Please check out the Further Reading section for related topics that are not covered in this article.

Overview

Memory

Memory can be addressed in byte sized increments. To help you visualize it, open the memory view of your IDE while debugging your application. You will see something like this:

                  00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
000000000044C3D0  2b 2b 5c 77 69 6e 33 32 5c 63 70 70 74 65 73 74  ++\win32\cpptest
000000000044C3E0  5c 62 75 69 6c 64 5c 52 65 6c 57 69 74 68 44 65  \build\RelWithDe
000000000044C3F0  62 49 6e 66 6f 5c 63 70 70 74 65 73 74 2e 65 78  bInfo\cpptest.ex
000000000044C400  65 00 00 00 00 00 00 00 72 35 16 22 66 fc 00 10  e.......r5."fü..

On the far left side is the memory address of the first byte in the row. It is followed by 16 bytes of data. Right next to it is the interpretation of the data as ASCII characters. You can switch the view to different representations to see what integers or floating-point numbers are written in memory.

The memory is not initialized by default. It can consist of random data. You have to decide if and when to initialize the memory. However, failing to initialize memory correctly can lead to undefined behavior in your application.

Endianness

There are different ways to store values in memory - particularly values that span over multiple bytes. This is also referred to as the byte-order of a system. The two common ones are big-endian and little-endian.

If you store a 4-byte integer on a little-endian system, the bytes will be stored from lowest to highest:

                  00 01 02 03
0000004C848FF550  78 56 34 12

Whereas on a big-endian system, the bytes will be stored from highest to lowest:

                  00 01 02 03
0000004C848FF550  12 34 56 78

The endianness is very important when manipulating bytes in memory, serializing data, sending data over the network, or storing data in files. Since platforms can have a different byte-order, it can easily lead to compatibility issues. You cannot expect that a memory dump from one system can be correctly interpreted by another system.

Intel x64 CPUs work exclusively with the little-endian format. All memory dumps in this article are in the little-endian format.

Stack

The stack is a data structure with two operations: push and pop. To push an element to the stack means that it is put on top of the other elements. To pop an element from the stack means that the top element is removed. Think of it as a stack of plates. The implementation of a stack typically has an index or pointer that points to the top element of the stack. This pointer is modified whenever push or pop are called.

The processor has push and pop operations, as well as a stack pointer. The stack pointer is a memory address that points to the block of memory that is used by the processor as stack memory. When push is executed by the processor, the stack pointer is decremented, and data is written to the stack. When pop is executed by the processor, data is read from the stack, and the stack pointer is incremented. pop does not actually remove the data from memory. The data stays written in memory until it is overwritten, either by subsequent push operations or direct memory access.

As a C++ programmer you cannot call push or pop directly on the processor, unless your compiler supports inline assembly. Instead, you can influence how and when data is pushed and popped with the tools of the language. There is nothing that keeps you from manipulating the stack memory directly.

In some cases, your application can run out of stack memory and potentially overwrite data in other parts of the memory. This is also known as a stack overflow.

Heap

The heap is dynamic memory on which you can perform read and write operations. You have to explicitly allocate the memory that you want to use in your program, and deallocate the memory when you are done using it. If you don't deallocate the memory that you have allocated, you can potentially run out of memory.

As with the stack, there is nothing that keeps you from reading or writing outside of the bounds of the allocated memory.

Static

Static memory is memory whose size and location is determined at compile time. The lifetime of this memory is the entire execution of the program. You don't have to explicitly allocate or deallocate the memory yourself. Static memory is initialized to a default value prior to code execution.

Numbers, string literals, and even brace-enclosed lists of values that are used to initialize arrays or structs, are all stored in the executable.

Data

;int a = 42;
mov dword ptr [a],2Ah

const char* str = "Hello World!";

000000000000BBB0  48 65 6c 6c 6f 20 57 6f 72 6c 64 21 00 00 00 00  Hello World!....

When the executable is loaded into memory, the string is always at the same memory address. You are always referring to the string by its memory address. Since the string is loaded into read-only memory, you have to copy the string to modify it.

Registers

Registers are the memory that the CPU uses for calculations. The number of registers vary from CPU to CPU. The size of a register determines the range of memory that can be addressed. For instance, a 16-bit register can only address 65535 bytes of memory.^[2]

The state of all registers during program execution can give you a lot of useful information. You will find integer values that are used for calculation, memory addresses of objects, the stack pointer, or the instruction pointer. CPUs that support 32bit/64bit floating-point arithmetic might have special registers and instructions that can process multiple floating-point numbers in parallel (see SIMD).

Registers only play a secondary role when programming in C++. You as a programmer have very little influence over how the registers are utilized, unless your compiler supports inline assembly.

Pointers

struct Test
{
    int a;
    int b;
    int* ptr; // pointer to an integer
    int c;
};

Test test;

test.a = 1;
test.b = 2;
test.ptr = &test.b;
test.c = *test.ptr;

By writing Test test, the object is stored in stack memory. We write two integer values, a pointer to an integer value, and a third integer to the memory.

We specify the memory address to a variable by using the ampersand character (&). We specify the value that is stored at a given memory address by using the asterisk character (*).

000000000026FB28  01 00 00 00
000000000026FB2C  02 00 00 00
000000000026FB30  2c fb 26 00
000000000026FB34  00 00 00 00
000000000026FB38  02 00 00 00

At address 0x26FB28 and 0x26FB2C there are two 4 byte long numbers, our integers. At 0x26FB30 there is what looks like an address. If we read it correctly, it is the memory address 0x26FB2C. This is our pointer. It points to the second integer in memory. On a 64bit machine the pointer is 8 bytes long, this explains the 4 bytes of zeros at 0x26FB34.

At address 0x26FB38 there is another number, our third integer. By writing test.c = *test.ptr, we tell the compiler to copy the value that is located at address 0x26FB2C to address 0x26FB38. Lucky for us, both values are 4 byte integers - no conversion is needed.

Notice how the structure is represented value by value in memory. The values are stored in the little-endian format. Since the compiler knows the layout of our structure, it can use offsets in memory to locate each of its values.

;test.a = 1;
mov dword ptr [test],1

;test.b = 2;
mov dword ptr [rsp+2Ch],2

;test.ptr = &test.b;
lea rax,[rsp+2Ch]               ;RAX=0x26FB2C (&test.b)
mov qword ptr [rsp+30h],rax     ;test.ptr=RAX

;test.c = *test.ptr;
mov rax,qword ptr [rsp+30h]     ;RAX=0x2CFB26 (test.ptr)
mov eax,dword ptr [rax]         ;EAX=2        (*)
mov dword ptr [rsp+38h],eax     ;test.c=EAX   (test.c=)

Pointer types

I showed that the compiler writes pointers as plain memory addresses to memory. But how does the compiler know the type of the object, that the address points to?

If we write int*, the compiler knows that the pointer points to an integer value. This means that the compiler will interpret the bytes at the address that the pointer points to as integers.

If you change the type of a pointer, the memory will be interpreted differently.

int a = 8;
int* ptrToInt = &a;
float* ptrToFloat = (float*)ptrToInt;
float b = *ptrToFloat;

000000004A38F8A0  08 00 00 00
000000004A38F8A4
000000004A38F8A8  b8 f8 38 4a
000000004A38F8AC  00 00 00 00
000000004A38F8B0  b8 f8 38 4a
000000004A38F8B4  00 00 00 00
000000004A38F8B8  08 00 00 00

The pointers at 0x4A38F8A8 and 0x4A38F8B0 point to the same address, which is 0x4A38F8B8. The values at 0x4A38F8A0 and 0x4A38F8B8 appear to be equal because the memory contents is exactly the same. Yet, to the compiler, they are two totally different things.

The variable a (0x4A38F8B8) holds the integer value 8. The variable b (0x4A38F8A0) holds the floating-point value 1.121e-44#DEN. If you try to do calculations with these two numbers, the result will be very different.

This applies to all other types in C++, including structures and classes. You can cast any pointer type to another pointer type and access the memory as if the object was created with that type. That said, this behavior is undefined in C++ except for a few cases that are outlined in the strict aliasing rules.^[3]

Pointer arithmetic

I showed how pointers are stored in memory, and how the compiler knows what type of object the pointer points to. But what happens when we do calculations with memory addresses?

The key to understanding pointer arithmetic is that the result of the calculation depends entirely on the pointer type. If your type is a 4 byte integer, then the integer pointer will change in 4 byte increments.

int a = 4;
int* ptrA = &a;
int* ptrB = ptrA + 1;

000000000015FCB8  a4 fc 15 00 00 00 00 00  ¤ü......
000000000015FCC0  a8 fc 15 00 00 00 00 00  ¨ü......

Although we have incremented ptrA by 1, the difference between ptrA (0x15FCA4) and ptrB (0x15FCA8) is exactly 4 bytes.

The same exact rule applies to any other type. If a struct or class has a size of 16 bytes, incrementing a pointer to that structure by 1 will result in an offset of 16 bytes. If you want to step through memory in 1 byte increments, you have to chose the appropriate type, like a char.

There are situations in which you can work with memory and memory addresses without knowing what the contents is. You don't need to know that you are copying floating-point values, as long as you know how many bytes you are copying. It is possible to rewrite algorithms to be more generic by writing them in terms of memory operations, instead of operations on types. Alternatively, you can use templates if code readability and maintainability is a priority, and executable size is not an issue.

Subscript operator

In languages like Java, the subscript operator ([]) is only used for arrays. In C++ however, there is no difference between a pointer and an array, in terms of how the data is accessed.

int* array = ...;
int value = array[2];

Why the third and not the second? The number 2 is an offset, relative to the address of the array. The code increments the integer pointer by 2. It returns the value at the memory address by implicitly dereferencing the pointer. This is equivalent to writing:

int* array = ...;
int value = *(array + 2);

Adding 2 to the integer pointer gives us an offset of 8 bytes (sizeof(int) * 2) in memory. We have to explicitly dereference the pointer to get the value.

;int val0 = arr0[2];
mov  eax,4
imul rax,rax,2                ;RAX=8
mov  rcx,qword ptr [arr0]
mov  eax,dword ptr [rcx+rax]
mov  dword ptr [val0],eax

;int val1 = *(arr1 + 2);
mov  rax,qword ptr [arr1]
mov  eax,dword ptr [rax+8]
mov  dword ptr [val1],eax

This is also the reason why array[0] is the first value of the array. The offset is 0. You can write *array instead.

Function pointers

Many programming languages have a mechanism that allows you to treat a function like a data type. We can create objects of it, and assign it to variables. This means that we can pass it to other functions that can in turn call that function. Or we can store the function pointer and call it at a later time. This is typically how callback-mechanisms are implemented.

;foo()
call foo (013F4C1294h)

;bar(foo)
lea  rcx,[foo (013F4C1294h)]
call bar (013F4C128Fh)

;void bar(void(*fn)(void)) {
mov  qword ptr [rsp],rcx
;fn()
call qword ptr [rsp]

Bear in mind that the CPU will jump to the address of the function pointer regardless of what data is stored there. If the memory happens to contain random data, this data will be executed, and will most likely lead to undefined behavior in your program.

Structs

Before I get to object oriented programming, I want to touch on some properties of C++ structs. This will be essential to understanding how inheritance and polymorphism is implemented.

struct A
{
    int a = 0x10;
    int b = 0x20;
    int c = 0x30;
};

000000000021FE28  10 00 00 00  ....
000000000021FE2C  20 00 00 00  ....
000000000021FE30  30 00 00 00  ....

There are cases in which we want to put a structure inside another structure. When I started learning C++, coming from Java, I would go for this approach:

struct B
{
    int b = 0x20;
    int c = 0x30;
};

struct A
{
    int a = 0x10;
    B* b = nullptr;
};

The intention was to include struct B into struct A by using a pointer. It presents itself in memory like this:

000000000025FD10  20 00 00 00   ...
000000000025FD14  30 00 00 00  0...

000000000025FD18  10 00 00 00  ....
000000000025FD1C  00 00 00 00  ....
000000000025FD20  10 fd 25 00  .ý%.
000000000025FD24  00 00 00 00  ....

Oddly, the size of struct A (0x25FD18) is 16 bytes. We only store an integer of 4 bytes and a pointer of 8 bytes, which is 12 bytes in total.

Since I'm compiling this code on a 64-bit machine, the compiler has chosen to pad the structure in order to align the values to specific addresses. This is known as data structure alignment. The alignment allows the CPU to read the memory more efficiently.

struct B
{
    int b = 0x20;
    int c = 0x30;
};

struct A
{
    int a = 0x10;
    B b;
};

000000000017FDD8  10 00 00 00  ....
000000000017FDDC  20 00 00 00   ...
000000000017FDE0  30 00 00 00  0...

We told the compiler that struct B is part of struct A and it complies. The size of struct A is 12 bytes.

Both definitions carry certain implications when it comes to memory consumption, memory access, and object lifetime. I will discuss this in a later section.

Member fields and functions

The member fields and member functions of structs and classes are only abstract concepts.

Accessing member fields and calling member functions results in assembly instructions that operate on relative addresses. They take the base address of the object as a parameter. As a result, the same instructions can be used regardless of the object.

struct A
{
    int value;

    void setValue(int value) { this.value = value; }
};

A a;
a.setValue(0x10);

;a.setValue(0x10);
mov  edx,10h  
lea  rcx,[a]  
call A::setValue (013F07125Dh)

In C, where you cannot define member functions, you would write something like this:

void setValue(struct A *this, int size);

struct A a;
setValue(&a, 0x10);

You always have to pass a pointer of the object that you want to modify to the function. Calling a member function in C++ does this implicitly.

Functions

Functions are blocks of compiled code that have a name and a memory address associated with them. I already talked about function pointers in the section about Pointers. In this section, I want to examine the mechanisms behind function calls.

Function Definition

When you define a function in C++, the compiler generates assembly code or machine code for the instructions that are inside the function body. It associates the memory address of the generated code with a unique name. This name is a combination of the namespace the function is defined in, the function name, and the number and types of parameters (see name mangling).

One of the key features of the C++ programming language is function overloading. This means that you can define functions with the same name but different sets of input parameters. Because the name mangling also encodes the number of input parameters and their types, C++ is able to find the correct function it needs to call during compilation. You cannot overload functions by only changing the return type of the function. The name mangling does not encode the return type of the function.
This is different to C, where there is no name mangling, and no function overloading.

Once the code has been compiled, the function names have been resolved to addresses.

Function Call

However, this description leaves out a lot of details about how a function call is implemented on the actual hardware.

To answer these questions, compiler vendors and operating system developers have come up with many different calling conventions for their platforms throughout the years. These calling conventions are usually part of a broader application binary interface.

A calling convention defines how a function call should be performed, on a machine code level. It guarantees that programs that are written by different people, with different compilers, on different machines, are compatible and can call each other.
This also extends to programs written in different programming languages.

Microsoft x64 Calling Convention

Data Types

It defines what type of data exists. This includes signed and unsigned integers of various sizes, floating-point values of various sizes, pointers, as well as arrays, structs and unions. It also defines the alignment of data in memory.

This guarantees that two different programs know how to access the memory correctly, when they pass data between each other.

Function Calls

It defines how parameters are passed to and from functions. The Microsoft x64 calling convention uses a mixture of registers and stack memory. Four parameters are stored in registers, the rest is written to stack memory. The return value is stored in a register.

There are also rules for what registers need to be stored and restored by either the person that calls the function (caller), or the person who's function is called (callee).

For instance, if you are using a set of registers to do a mathematical calculation, and you call a function, you may need to store these registers to stack memory, otherwise they will be overwritten by the function. If you don't store the registers and continue to work with them after the function call, your result of your calculation might be wrong.

Stack Frame

It defines the contents and structure of the so called stack frame. For each function call, a stack frame is created. The stack frame contains the return address, register values that need to be restored after the call, as well as the local variables.

The information in the stack frame is what makes recursive function calls possible.

Example

In the following example, the function foo is called with the value 13. foo performs some calculations using the function parameter, as well as the value of a variable that is stored in stack memory.

int foo(int num)
{
    int d = 42;
    return num * num + d;
}

int main()
{ 
    return foo(13);
}

; int foo(int):
mov  DWORD PTR [rsp+8],ecx
sub  rsp,24
; int d = 42;
mov  DWORD PTR d$[rsp],42
; return num * num + d;
mov  eax,DWORD PTR num$[rsp]
imul eax,DWORD PTR num$[rsp]
add  eax,DWORD PTR d$[rsp]
add  rsp,24
ret  0

; int main()
sub  rsp,40
mov  ecx,13
call int foo(int)
add  rsp,40
ret  0

Conclusion

As you can see, there are a lot of details that need to be taken into account when a function is called. It becomes even more complicated with features like variadic functions, and C++ exception handling. Thankfully the compiler takes care of all these details for us.

Object-Oriented Programming

There are different ways how object-oriented programming and its main concepts are taught. One of the commonly used examples is that of animals. A cat is an animal. A dog is an animal. As animals, both share certain properties. They can walk, they can emit some kind of sound, they hunt prey, they have fur, etc. However, they exhibit these properties in different ways.
This is to illustrate inheritance and polymorphism based on how humans categorize abstract concepts in a hierarchical structure, with is-a relations. A Cat is-a Mammal, a Mammal is-a Animal, etc.

In practice however, this example is very misleading. The main advantage of inheritance and polymorphism can be illustrated much better when you think about interfaces and abstractions. For instance, if you have a console, a file, a printer, and an internet connection, each of them requires very specific I/O code. This code is based on the hardware, the operating system, and communication protocols.

As a programmer, you find yourself in situations in which you don't care what the implementation is. You just want to read and write data. Thus, one possible abstraction to all the four types of I/O devices is a data stream. The stream has read and write, as well as open and close operations. What actually happens is hidden behind the abstraction. You don't have to know how TCP/IP works in order to use a data stream.

Inheritance is also a tool for extending the functionality of a class, by adding new functions and properties. On a low level, it is implemented exactly like an extension.

Inheritance

If you have read the section about structures, you will understand how inheritance is implemented immediately. Lets take the following example:

struct A
{
    int a = 0x10;

    void test();
    void testA();
};

struct B : public A
{
    int b = 0x20;
    int c = 0x30;

    void test();
};

B b;

00000000002AFAF8  10 00 00 00  ....
00000000002AFAFC  20 00 00 00   ...
00000000002AFB00  30 00 00 00  0...

Look familiar? It's no coincidence! You are telling the compiler that you are extending the definition of struct A by the definition of struct B, and it complies.

Since the memory of struct A is contained in the memory of struct B, you can call functions defined in struct A with an object of type struct B. For instance:

B b;
b.test();
b.testA();

;B b;
lea  rcx,[b]
call B::B (013F1612DFh)

;b.test();
lea  rcx,[b]
call B::test (013F1612EEh)

;b.testA();
lea  rcx,[b]
call A::testA (013F1612F3h)

Since struct B defines a function named B::test, it is called instead of A::test. [TODO why? give source]

As you can see in the assembly code, when calling b.testA(), it calls the implementation of A::testA with a pointer to object b. This is possible because struct B contains struct A at the beginning of the memory - the offsets to its values are the same. The implementation of A::testA simply assumes that there is a struct A in memory. It doesn't know about struct B.

Polymorphism

Going back to the example of the data stream: if the file implementation and the printer implementation have a function called write, how do we know what implementation to call when we only have the abstract data stream to work with?

struct Stream
{
    virtual int read() = 0; // pure virtual function
    virtual int write() = 0;
};

struct File : public Stream
{
    int v = 0x10;

    int read() override {
        return v;
    }

    int write() override {
        return v;
    }
};

struct Printer : public Stream
{
    int v = 0x20;

    int read() override {
        return v;
    }

    int write() override {
        return v;
    }
};

void operate(Stream* stream)
{
    std::cout << stream->write() << std::endl;
}

File file;
Printer printer;

operate(&file);
operate(&printer);

If we look at how struct File is represented in memory, we notice something odd:

000000000024F8B8  78 9d dd 3f  x."?
000000000024F8BC  01 00 00 00  ....
000000000024F8C0  10 00 00 00  ....

struct Stream doesn't contain any members, so why are there 8 bytes before our 0x10?

As you can see from the source code, the compiler allows us to pass a pointer to an object of type struct File to a function that takes a pointer to a struct Stream. When Stream::write is called, the only way for the processor to know what implementation to call, is to look at the so called virtual function table of the object.

The VFT is an array of function pointers. We give the compiler a hint what functions to put into the VFT by using the keyword virtual.

The first element of the memory listing is in fact a pointer to the VFT of the type struct File. The following assembly code shows when and how the address is written to the object:

;Stream::Stream ()
mov  qword ptr [rsp+8],rcx
push rdi
mov  rax,qword ptr [this]
lea  rcx,[Stream::vftable]
mov  qword ptr [rax],rcx
mov  rax,qword ptr [this]
pop  rdi
ret

;File::File ()
mov  qword ptr [rsp+8],rcx
push rdi
sub  rsp,20h
mov  rdi,rsp
mov  ecx,8
mov  eax,0CCCCCCCCh
rep stos dword ptr [rdi]
mov  rcx,qword ptr [this]
mov  rcx,qword ptr [this]
call Stream::Stream
mov  rax,qword ptr [this]
lea  rcx,[File::vftable]
mov  qword ptr [rax],rcx
mov  rax,qword ptr [this]
mov  dword ptr [rax+8],10h
mov  rax,qword ptr [this]
add  rsp,20h
pop  rdi
ret

You can see how the struct File constructor calls the struct Stream constructor, because of the inheritance. Then the address of the struct File VFT is written to the beginning of the object. After that, the fields of the object are initialized.

All instances of struct File share the same VFT, which happens to be located at 0x13FDD9D78 in my example. Lets see what we can find there:

000000013FDD9D78  4f 11 dd 3f 01 00 00 00  O..?....
000000013FDD9D80  62 12 dd 3f 01 00 00 00  b..?....

Since struct Stream has two virtual functions, and we override both in struct File, there are two function pointers in the VFT. One pointing to the implementation of File::read at 0x13FDD114F, and the other one to the implementation of File::write at 0x13FDD1262.

Now that we know what a VFT is and how it gets there, it's time to take a look at the virtual function call mechanism:

;void operate(Stream* stream) {
mov  rax,qword ptr [stream]  ;RAX=0x24F8B8
mov  rax,qword ptr [rax]     ;RAX=0x13FDD9D78
mov  rcx,qword ptr [stream]  ;RCX=0x24F8B8
call qword ptr [rax+8]       ;call 0x13FDD1262

As you will see by the console output, the processor manages to call the correct implementation. For the file object, the number 0x10 is printed out. For the printer object, the number 0x20 is printed out.

Memory Management

Copying

When reading and writing C++ programs, you should develop a sense when memory is being copied, and in doubt, check the implementation.

struct A
{
    int a;
    int b;
    int c;
};

A createStruct();
A* createStructPtr();

A a = { 1, 2, 3 };
A b = a;
A c = createStruct();
A* d = createStructPtr();
A e = *d;

When you create a class or structure, the following constructors and operators are created for you by default. They control how objects are moved or copied in memory.

By overloading them, you can specify exactly what needs to happen, when an object is copied. For instance:

struct Array
{
    Array(int size) : size(size) { data = new int[size]; }
    ~Array() { delete data; }

    const int* data;
    const int size;
};

Array a(0x20);
Array b = a;

The first struct Array object is created on the stack. The constructor allocates enough memory on the heap to store 32 integer values. The pointer to this memory is written to our object. So far so good, but what happens when we assign array a to array b?

;Array a(0x20);
mov  edx,20h  
lea  rcx,[a]  
call Array::Array (013FE71294h)

;Array b = a;
lea  rax,[b]  
lea  rcx,[a]  
mov  rdi,rax  
mov  rsi,rcx  
mov  ecx,10h  
rep movs byte ptr [rdi],byte ptr [rsi]  ;memcpy

The constructor for b is not called. Instead, the contents of a is simply copied to b, byte by byte. As a result, b contains the pointer to the same memory as a. The data on the heap is not copied.

This is very risky. The destructor of struct Array deallocates the buffer on the heap. If the other array tries to access it, it will result in undefined behavior. If the other array object is destructed, the code will attempt to delete the memory for a second time, which is also undefined behavior.

If you overload the copy assignment operator, you can tell the compiler to allocate new memory and make a copy of the data on the heap. Alternatively, you can delete the copy constructor and copy assignment operator. You will get a compile error whenever you try to use the array in a way that was not intended.

This is one of many examples in which it is not clear what the assignment operator actually does. If you are in doubt, always check the implementation. Otherwise, you could be copying a lot of data without even knowing it.

Note: Another solution is to overload the move assignment operator and use std::move to copy the object a to b and invalidate object a. If implemented correctly, this will eliminate the undefined behavior (dangling pointers, double-frees).

Scope

Every variable you declare between two curly braces {} is only valid inside that scope. Once the program exists the scope, the variable becomes invalid.

This mechanism is strongly tied to the stack. If you create an object on the stack, its lifetime is defined by the scope in which it was created. Once the program exits the scope, the destructor of the object is called.

Object* createInt()
{
    Object a;

    return &a;
}

If you create an object on the stack, inside a function scope, and decide to return a pointer to that object, you are returning a pointer to a memory address that will be invalidated and probably overwritten. This can result in undefined behavior if you try to read from this pointer at a later time.

If you want a value to survive the lifetime of a scope, you have to either copy it or allocate it on the heap.

{
    int* arrA = new int[1000];

    if (conditionAIsFalse)
    {
        delete arrA;
        return -1;
    }

    int* arrB = new int[2000];

    if (conditionBIsFalse)
    {
        delete arrB;
        delete arrA;
        return -2;
    }

    delete arrB;
    delete arrA;

    return 0;
}

We use dynamic memory allocation to allocate memory on the heap. For the program to run correctly, and not leak memory, we have to delete the allocated memory when we exit the function. Since we have multiple exits, it becomes difficult to track what needs to be deleted and what not.

An alternative is to only have one exit point and nest the if-statements. You have to be very careful to set the return value correctly. The readability suffers the more nested if-statements you have.

{
    int returnValue = 0;

    int* arrA = new int[1000];

    if (conditionAIsTrue)
    {
        int* arrB = new int[2000];

        if (conditionBIsTrue)
        {
            returnValue = 0;
        }
        else
        {
            returnValue = -2;
        }

        delete arrB;
    }
    else
    {
        returnValue = -1;
    }

    delete arrA;

    return returnValue;
}

Or if you feel very adventurous, you can even use the goto statement to jump to the clean-up code whenever an error occurs.

In any case, you have to be very careful about deallocating memory. If you forget, your program will leak memory. If you delete memory more than once, you end up with undefined behavior.

In C++, we can hide some of the memory management details by using the scope of variables to our advantage, and write the following code:

struct Array
{
    Array(int size) { data = new int[size]; }
    ~Array() { delete data; }

    int* data;
};

{
    Array arrA(0x10);

    if (conditionAIsFalse)
        return -1;

    Array arrB(0x20);

    if (conditionBIsFalse)
        return -2;

    return 0;
}

Notice how we create arrA and arrB on the stack. They only contain a pointer to a buffer. The actual integer buffers are created dynamically on the heap when the constructor is called. The moment the program leaves the scope, the destructor of struct Array is called. The compiler figures out when and where to do this for us. We can simply use the array and not worry about the memory management.

;Array arrB(0x20);
mov   edx,20h
lea   rcx,[arrB]
call  Array::Array (013FA910D7h)

;if (conditionBIsFalse)
movzx eax,byte ptr [conditionBIsFalse]
test  eax,eax
je    main+0A6h (013FA918E6h)

    ;return -2;
mov   dword ptr [rsp+68h],0FFFFFFFEh
lea   rcx,[arrB]
call  Array::~Array (013FA91127h)
nop
lea   rcx,[arrA]
call  Array::~Array (013FA91127h)
mov   eax,dword ptr [rsp+68h]
jmp   main+0C7h (013FA91907h)

Before the struct Array constructor is called, the array size is stored in EDX, and the memory address of arrB is stored in RCX.

Next the value of the bool is checked. If it is false, we jump forward to the code that continues our function. If it is true, we continue executing the return code, which stores the value -2 in stack memory for later reference, and calls the struct Array destructor on arrA and arrB. Last but not least, the return value is copied to EAX, and the program jumps to some function wrap up code.

Note how arrA and arrB are destructed in reverse. This adheres to the Last-In-First-Out behavior of the stack data structure.

Using the scoped lifetime of objects in order to allocate and deallocate resources automatically is known as RAII. RAII is used throughout the standard library. For example, a std::vector is a container that dynamically allocates an array on the heap, and automatically cleans up its resources when it goes out of scope. RAII is also used to close files and streams, release locks, etc.

Note: Please use std::unique_ptr in combination with std::make_unique. Using std::unique_ptr will ensure that the allocated memory is cleaned up when the pointer object goes out of scope.

Should I use stack, heap, or static memory?

Code Generation

A major feature of the C++ programming language are templates. Let me quickly explain what problem templates solve.

Imagine you are writing a piece of software and you need a mathematical function that adds two integer numbers and returns the result. You write the following code:

int sum(int a, int b)
{
    return a + b;
}

At some point, you decide to support floating-point numbers in your application. However, you cannot just use the already existing function because if you passed floating-point values to it, they would be implicitly converted to integer values, thus truncating them. You decide to add another function that handles single-precision floating-point numbers:

int sum(int a, int b)
{
    return a + b;
}

float sum(float a, float b)
{
    return a + b;
}

The algorithm is the same in both functions. The only difference is the types. If you wanted to modify the algorithm, you would have to change the implementation in both functions. This might not seem like a problem now, but if you had dozens of functions, and they each consisted of a few hundred lines of code, it would become a problem.

Templates let you express an algorithm in a generic way by adding placeholders into the code that will be substituted with concrete types or values when the template is used.

// template definition
template<typename T>
T sum(T a, T b)
{
    return a + b;
}

// template instantiation
sum<int>(5, 10);
sum(0.3f, 0.1f); // the type is deduced

sum<int> instructs the compiler that you instantiate the function template with type T=int. This means that anywhere T is used in the code, substitute it with int.

The second function call is valid because the compiler can deduce which type to use for the function template, based on the type of the arguments. Since both arguments are of the float, the function template is instantiated with T=float.

Generation by type substitution

Given the last example, lets dive a little bit deeper and look at the assembly code that is being generated. Since C++20 you can define templates by using the auto keyword as a type placeholder.

auto sum(auto a, auto b)
{
    return a + b;
}

sum(5, 10);
sum(0.3f, 0.1f);
sum(0.8, 5);

To a C++ beginner, the code might look like we are reusing the same function three times. However, when we look at the assembly, we see that this code has generated three different versions of the same function:

;auto sum<int, int>(int, int):
push  rbp
mov   rbp, rsp
mov   dword ptr [rbp - 4], edi
mov   dword ptr [rbp - 8], esi
mov   eax, dword ptr [rbp - 4]
add   eax, dword ptr [rbp - 8]
pop   rbp
ret

;auto sum<float, float>(float, float):
push  rbp
mov   rbp, rsp
movss dword ptr [rbp - 4], xmm0
movss dword ptr [rbp - 8], xmm1
movss xmm0, dword ptr [rbp - 4]
addss xmm0, dword ptr [rbp - 8]
pop   rbp
ret

;auto sum<double, int>(double, int):
push  rbp
mov   rbp, rsp
movsd qword ptr [rbp - 8], xmm0
mov   dword ptr [rbp - 12], edi
movsd xmm0, qword ptr [rbp - 8]
cvtsi2sd xmm1, dword ptr [rbp - 12]
addsd xmm0, xmm1
pop   rbp
ret

The generated code depends on the type of the value that is passed to the function. This also means that there is no type conversion required to call these functions:

; sum(5, 10);
mov   edi, 5
mov   esi, 10
call  auto sum<int, int>(int, int)

; sum(0.3f, 0.1f);
movss xmm0, dword ptr [rip + .LCPI1_1]
movss xmm1, dword ptr [rip + .LCPI1_2]
call  auto sum<float, float>(float, float)

; sum(0.8, 5)
movsd xmm0, qword ptr [rip + .LCPI1_0]
mov   edi, 5
call  auto sum<double, int>(double, int)

The same name mangling mechanism that is used for function overloading is also used for templates.

If you enable compiler optimizations, the compiler will try to inline these functions whenever possible. That said, it does not change the fact that you are using a code generation facility.

Generation by value substitution

We have seen how you can generate code by type substitution. C++ also allows you to generate code based on value substitution.

template<int A, int B>
int sum()
{
    return A + B;
}

sum<8, 10>();
sum<42, 8>();

;int sum<8, 10>():
push rbp
mov  rbp, rsp
mov  eax, 18
pop  rbp
ret

;int sum<42, 8>():
push rbp
mov  rbp, rsp
mov  eax, 50
pop  rbp
ret

The compiler has generated two sum implementations, one for each distinct template parameterization. Since the values of the template parameters are known at compile time, the compiler has calculated the sum of both values and written the result directly into the code.

; sum<8, 10>();
call int sum<8, 10>()

; sum<42, 8>();
call int sum<42, 8>()

Both functions are called directly, without passing any arguments. As stated before, the template parameters are used in the name mangling, making both function names unique.

Recursive code generation

template<int N>
int sum()
{
    return N + sum<N-1>();
}

// define stop condition with template specialization
template<>
int sum<1>()
{
    return 1;
}

sum<3>();

;int sum<3>():
push rbp
mov  rbp, rsp
call int sum<2>()
add  eax, 3
pop  rbp
ret

;int sum<2>():
push rbp
mov  rbp, rsp
call int sum<1>()
add  eax, 2
pop  rbp
ret

;int sum<1>():
push rbp
mov  rbp, rsp
mov  eax, 1
pop  rbp
ret

Keep in mind that the optimizer may inline all the calls and calculate the result at compile-time.

Optimization

The assembly code that you have seen so far was all unoptimized. Optimized code can look entirely different from what you would expect. When the compiler optimizes your code, it makes decisions based on the context in which the code is used. It can only make optimizations for things that it knows at compile-time. The compiler cannot predict what the value of a variable will be at runtime.

This section deals with different compiler optimizations. This might give you some ideas about how you could structure your code, so that it is easier to optimize. Of course you need to check the assembly code, to see if your efforts were fruitful. And you need to benchmark your code before and after the optimization, to see if it has any significant effect on the performance of your program.

SIMD & Auto-Vectorization

SIMD stands for Single Instruction Multiple Data. SIMD instructions are used to speed up computation by processing multiple values at a time. For instance, you can add or multiply a packet of 4 or 8 floats in parallel. This can have advantages in areas that are computation heavy, like image and audio processing.

Each CPU supports different SIMD instruction sets. The most common are variants of SSE and AVX. The main difference being the size of the registers and the operations that you can perform. You have to be able to recognize which instruction set and registers are being used in the generated assembly.

Compilers will try to auto-vectorize loops in which the compiler detects certain memory access patterns. There are however calculations that cannot be easily auto-vectorized, either due to data dependencies, unpredictable loop stride, complex loop body, or other compiler-specific reasons.

Unoptimized Loop

Let's start with a simple example. We want to process an array of floats by multiplying the input with a value, and writing the result to the output.

void process(float* in, float* out, int len, float value)
{
    for (int i = 0; i < len; i++)
    {
        out[i] = in[i] * value;
    }
}

$LN2@process:
    mov     eax, DWORD PTR i$1[rsp]
    inc     eax                          ; i++
    mov     DWORD PTR i$1[rsp], eax
$LN4@process:
    mov     eax, DWORD PTR len$[rsp]
    cmp     DWORD PTR i$1[rsp], eax
    jge     SHORT $LN3@process           ; if i >= len then leave loop
    movsxd  rax, DWORD PTR i$1[rsp]
    mov     rcx, QWORD PTR in$[rsp]
    movss   xmm0, DWORD PTR [rcx+rax*4]  ; XMM0 = in[i]
    mulss   xmm0, DWORD PTR value$[rsp]  ; XMM0 = XMM0 * value
    movsxd  rax, DWORD PTR i$1[rsp]
    mov     rcx, QWORD PTR out$[rsp]
    movss   DWORD PTR [rcx+rax*4], xmm0  ; out[i] = XMM0
    jmp     SHORT $LN2@process

The register xmm0 and the instructions movss and mulss point to the use of SIMD instructions. However, if you check the individual instructions in the Intel Intrinsics Guide, you will see that the code is processing each individual float value. The ss part of movss implies that one single-precision element is moved from memory. xmm0 is a 128-bit register that could technically hold 4 floats. The DWORD also gives a hint that we are accessing single elements.

Successful Loop Vectorization with MSVC

To optimize the version, we will use the compiler option /O2. To check that the code was actually auto-vectorized, we use the option /Qvec-report:2. This will print out a auto-vectorizer report during compilation. Check the links below for more information.

The generated assembly consists of many different parts. I will point out the most relevant to you.

    movaps  xmm2, xmm3                      ; XMM2 = XMM3 = (value,0,0,0)
    shufps  xmm2, xmm2, 0                   ; XMM2 = (value,value,value,value)
    ...
$LL4@process:
    movups  xmm0, XMMWORD PTR [rdx+rax-16]  ; XMM0 = in[?]
    add     r11d, 16
    add     r9, 16                          ; increment by 16 byte (4 floats)
    movups  xmm1, XMMWORD PTR [rax+rdx]     ; XMM1 = in[?]
    lea     rax, QWORD PTR [rax+64]
    mulps   xmm0, xmm2                      ; XMM0 = XMM0 * XMM2
    mulps   xmm1, xmm2                      ; XMM1 = XMM1 * XMM2
    movups  XMMWORD PTR [rax-80], xmm0      ; out[?] = XMM0
    movups  xmm0, XMMWORD PTR [rdx+rax-48]  ; XMM0 = in[?]
    movups  XMMWORD PTR [rax-64], xmm1      ; out[?] = XMM1
    mulps   xmm0, xmm2                      ; XMM0 = XMM0 * XMM2
    movups  xmm1, XMMWORD PTR [rdx+rax-32]  ; XMM1 = in[?]
    mulps   xmm1, xmm2                      ; XMM1 = XMM1 * XMM2
    movups  XMMWORD PTR [rax-48], xmm0      ; out[?] = XMM0
    movups  XMMWORD PTR [rax-32], xmm1      ; out[?] = XMM1
    cmp     r9, rbx
    jl      SHORT $LL4@process

The first thing you should see is that movss became movups. The ups in movups indicates that we are moving a pack of 4 single-precision floating-point values, from a memory address that might be unaligned^[4]. We are moving the floats into XMM registers. As we've established, XMM are 128-bit registers that can hold 4 floats. The XMMWORD in the assembly also hints at the fact that we are reading 128-bits from memory.

One thing to note here is that the assembly shows 4 multiplications, 4 movups reading from memory, and 4 movups writing to memory. This is referred to as loop unrolling. It is processing 4 packs of 4 floats for a total of 16 floats per iteration. The goal of loop unrolling is better processor utilization by increasing the throughput.

But what happens if the length of our array is not divisible by 16? What if we are processing 21 values? There would be a remainder of 5 values that we cannot process with this assembly code. The answer lies in the next part of the generated assembly.

$LL18@process:
    lea     rax, QWORD PTR [rax+16]
    movaps  xmm0, xmm3                    ; XMM0 = value
    mulss   xmm0, DWORD PTR [rcx+rax-20]  ; XMM0 = XMM0 * in[?]
    movaps  xmm1, xmm3                    ; XMM1 = value
    movss   DWORD PTR [rax-20], xmm0      ; out[?] = XMM0
    movaps  xmm0, xmm3                    ; XMM0 = value
    mulss   xmm1, DWORD PTR [rax+rcx-16]  ; XMM1 = XMM1 * in[?]
    movss   DWORD PTR [rax-16], xmm1      ; out[?] = XMM1
    movaps  xmm1, xmm3                    ; XMM1 = value
    mulss   xmm0, DWORD PTR [rcx+rax-12]  ; XMM0 = XMM0 * in[?]
    movss   DWORD PTR [rax-12], xmm0      ; out[?] = XMM0
    mulss   xmm1, DWORD PTR [rcx+rax-8]   ; XMM1 = in[?]
    movss   DWORD PTR [rax-8], xmm1       ; out[?] = XMM1
    sub     r8, 1                         ; decrease iterator
    jne     SHORT $LL18@process

The compiler has switched back to single floating-point values, and decided to unroll the loop. We are now processing 4 floats per iteration. We had 21 floats, we already processed 16, this left us with a remainder of 5 floats. This loop processes another 4 floats, leaving us with 1 float. Maybe you can already guessed what follows.

$LC8@process:
    lea     rax, QWORD PTR [rax+4]
    movaps  xmm0, xmm3                   ; XMM0 = value
    mulss   xmm0, DWORD PTR [rax+rcx-4]  ; XMM0 = XMM0 * in[i]
    movss   DWORD PTR [rax-4], xmm0      ; out[i] = XMM0
    sub     rdx, 1                       ; decrease iterator
    jne     SHORT $LC8@process

We are back to processing single floats, one float per iteration. We have processed our last float and are finally done.

Conclusion

To summarize, the compiler generated three versions for this particular function:

You have to appreciate how much work the compiler does for us. This is not code that you want to write yourself. Therefore, whenever you write processing code that needs to run fast, check if the compiler is able to auto-vectorize it. A few simple changes might already be enough. If that's not the case, look into SIMD optimization with compiler intrinsics.

Conclusion

I hope that I could illustrate in this article that checking the generated assembly code is a good way to prove that the instructions you write, do what you expect them to do.

This will not only give you more insight into the language and its implementation, but also stronger arguments for or against the use of certain language features or framework functions, when you are discussing your particular case with colleagues.

Knowing your way around assembly code and memory dumps will also help you in investigating crashes. You will start to recognize certain implementation patterns like loops, switch-statements, virtual function calls, compiler optimizations, and others, which will give you some intuition about what the program is doing and what kind of code your compiler is generating.

There are also cases in which a program will crash due to unsupported instructions, i.e. when executing AVX instructions on a CPU that does not support the AVX instruction set. There you have no other choice than to look at the assembly, otherwise you won't know what is causing the error. The source code alone can't tell you.

The downside to this is that every compiler generates different assembly for the same C++ code, and the factors that influence this are plentiful:

If you want to quickly check the generated assembly for a piece of code with different compilers, Godbolt's Compiler Explorer has you covered.

That said, you will always have to profile your application because the generated assembly code is not the only factor that determines the behavior of your program during execution.

Introduction

Overview

Memory

Endianness

Stack

Heap

Static

Data

Registers

Pointers

Pointer types

Pointer arithmetic

Subscript operator

Function pointers

Structs

Member fields and functions

Functions

Function Definition

Function Call

Microsoft x64 Calling Convention

Data Types

Function Calls

Stack Frame

Example

Conclusion

Object-Oriented Programming

Inheritance

Polymorphism

Memory Management

Copying

Scope

Should I use stack, heap, or static memory?

Code Generation

Generation by type substitution

Generation by value substitution

Recursive code generation

Optimization

SIMD & Auto-Vectorization

Unoptimized Loop

Successful Loop Vectorization with MSVC

Conclusion

Conclusion

Further Reading