When I started learning C/C++ programming, I only had a very vague idea of what a computer was doing when it executed a program. In languages like Java, JavaScript, or Python, you don't necessarily care about what your lines of code mean in terms of low-level program execution, memory, or performance.
In this article, I will explain the concepts of C++ on the implementation level. I will give examples, show memory dumps, and assembly code, to illustrate my points. This will give you an understanding of what is going on under the hood, and maybe why the C++ language is designed as it is. The assembly code in the examples is for Intel x64 CPUs.
This article will not teach you C++ language features or the basics of programming. There are much better tutorials for that.
Do you really need to know all the things in this article? From the perspective of a hobby programmer, probably not. From the perspective of a software engineer, you should be very familiar with the language and the generated assembly code of your compiler, including its implications on performance and memory. Otherwise there is no point in using a language like C or C++.
Please consider that the article only scratches the tip of the iceberg. I don't cover compiler, operating system and CPU design, which play a major role in how your code executes. Knowing what virtual memory is, how data is cached, or what out-of-order execution is, will give you further insight into the inner workings of your computer.
Please check out the Further Reading section for related topics that are not covered in this article.
Memory can be addressed in byte sized increments. To help you visualize it, open the memory view of your IDE while debugging your application. You will see something like this:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
000000000044C3D0 2b 2b 5c 77 69 6e 33 32 5c 63 70 70 74 65 73 74 ++\win32\cpptest
000000000044C3E0 5c 62 75 69 6c 64 5c 52 65 6c 57 69 74 68 44 65 \build\RelWithDe
000000000044C3F0 62 49 6e 66 6f 5c 63 70 70 74 65 73 74 2e 65 78 bInfo\cpptest.ex
000000000044C400 65 00 00 00 00 00 00 00 72 35 16 22 66 fc 00 10 e.......r5."fü..
On the far left side is the memory address of the first byte in the row. It is followed by 16 bytes of data. Right next to it is the interpretation of the data as ASCII characters. You can switch the view to different representations to see what integers or floating-point numbers are written in memory.
The memory is not initialized by default. It can consist of random data. You have to decide if and when to initialize the memory. However, failing to initialize memory correctly can lead to undefined behavior in your application.
There are different ways to store values in memory - particularly values that span over multiple bytes. This is also referred to as the byte-order of a system. The two common ones are big-endian and little-endian.
If you store a 4-byte integer on a little-endian system, the bytes will be stored from lowest to highest:
int i = 0x12345678;
00 01 02 03
0000004C848FF550 78 56 34 12
Whereas on a big-endian system, the bytes will be stored from highest to lowest:
int i = 0x12345678;
00 01 02 03
0000004C848FF550 12 34 56 78
The endianness is very important when manipulating bytes in memory, serializing data, sending data over the network, or storing data in files. Since platforms can have a different byte-order, it can easily lead to compatibility issues. You cannot expect that a memory dump from one system can be correctly interpreted by another system.
Intel x64 CPUs work exclusively with the little-endian format. All memory dumps in this article are in the little-endian format.
The stack is a data structure with two operations: push
and pop
. To push an element to the stack means that it is put on top of the other elements. To pop an element from the stack means that the top element is removed. Think of it as a stack of plates. The implementation of a stack typically has an index or pointer that points to the top element of the stack. This pointer is modified whenever push
or pop
are called.
The processor has push
and pop
operations, as well as a stack pointer. The stack pointer is a memory address that points to the block of memory that is used by the processor as stack memory. When push
is executed by the processor, the stack pointer is decremented, and data is written to the stack. When pop
is executed by the processor, data is read from the stack, and the stack pointer is incremented. pop
does not actually remove the data from memory. The data stays written in memory until it is overwritten, either by subsequent push
operations or direct memory access.
As a C++ programmer you cannot call push
or pop
directly on the processor, unless your compiler supports inline assembly. Instead, you can influence how and when data is pushed and popped with the tools of the language. There is nothing that keeps you from manipulating the stack memory directly.
In some cases, your application can run out of stack memory and potentially overwrite data in other parts of the memory. This is also known as a stack overflow.
The heap is dynamic memory on which you can perform read and write operations. You have to explicitly allocate the memory that you want to use in your program, and deallocate the memory when you are done using it. If you don't deallocate the memory that you have allocated, you can potentially run out of memory.
As with the stack, there is nothing that keeps you from reading or writing outside of the bounds of the allocated memory.
Static memory is memory whose size and location is determined at compile time. The lifetime of this memory is the entire execution of the program. You don't have to explicitly allocate or deallocate the memory yourself. Static memory is initialized to a default value prior to code execution.
Numbers, string literals, and even brace-enclosed lists of values that are used to initialize arrays or structs, are all stored in the executable.
In x64 assembly, numbers become part of CPU instructions[1]:
;int a = 42;
mov dword ptr [a],2Ah
A string literal:
const char* str = "Hello World!";
Is written to the executable file at an address known at compile time:
000000000000BBB0 48 65 6c 6c 6f 20 57 6f 72 6c 64 21 00 00 00 00 Hello World!....
When the executable is loaded into memory, the string is always at the same memory address. You are always referring to the string by its memory address. Since the string is loaded into read-only memory, you have to copy the string to modify it.
Registers are the memory that the CPU uses for calculations. The number of registers vary from CPU to CPU. The size of a register determines the range of memory that can be addressed. For instance, a 16-bit register can only address 65535 bytes of memory.[2]
The state of all registers during program execution can give you a lot of useful information. You will find integer values that are used for calculation, memory addresses of objects, the stack pointer, or the instruction pointer. CPUs that support 32bit/64bit floating-point arithmetic might have special registers and instructions that can process multiple floating-point numbers in parallel (see SIMD).
Registers only play a secondary role when programming in C++. You as a programmer have very little influence over how the registers are utilized, unless your compiler supports inline assembly.
A pointer is a data type which holds a memory address.
Take a look at the following code:
struct Test
{
int a;
int b;
int* ptr; // pointer to an integer
int c;
};
Test test;
test.a = 1;
test.b = 2;
test.ptr = &test.b;
test.c = *test.ptr;
By writing Test test
, the object is stored in stack memory. We write two integer values, a pointer to an integer value, and a third integer to the memory.
&
). We specify the value that is stored at a given memory address by using the asterisk character (*
).
A look at the memory reveals our structure:
000000000026FB28 01 00 00 00
000000000026FB2C 02 00 00 00
000000000026FB30 2c fb 26 00
000000000026FB34 00 00 00 00
000000000026FB38 02 00 00 00
At address 0x26FB28
and 0x26FB2C
there are two 4 byte long numbers, our integers. At 0x26FB30
there is what looks like an address. If we read it correctly, it is the memory address 0x26FB2C
. This is our pointer. It points to the second integer in memory. On a 64bit machine the pointer is 8 bytes long, this explains the 4 bytes of zeros at 0x26FB34
.
At address 0x26FB38
there is another number, our third integer. By writing test.c = *test.ptr
, we tell the compiler to copy the value that is located at address 0x26FB2C
to address 0x26FB38
. Lucky for us, both values are 4 byte integers - no conversion is needed.
Notice how the structure is represented value by value in memory. The values are stored in the little-endian format. Since the compiler knows the layout of our structure, it can use offsets in memory to locate each of its values.
To illustrate this mechanism, here is the generated code from the compiler:
;test.a = 1;
mov dword ptr [test],1
;test.b = 2;
mov dword ptr [rsp+2Ch],2
;test.ptr = &test.b;
lea rax,[rsp+2Ch] ;RAX=0x26FB2C (&test.b)
mov qword ptr [rsp+30h],rax ;test.ptr=RAX
;test.c = *test.ptr;
mov rax,qword ptr [rsp+30h] ;RAX=0x2CFB26 (test.ptr)
mov eax,dword ptr [rax] ;EAX=2 (*)
mov dword ptr [rsp+38h],eax ;test.c=EAX (test.c=)
struct Test
object, that is located on the stack.0x2C
is needed to write the number 2 to the second integer of our object.0x30
, exactly 4 bytes away from our second integer at offset 0x2C
.0x30
and store it in RAX.0x38
. This is exactly 8 bytes away from the offset that is needed to access the pointer (0x30
).I showed that the compiler writes pointers as plain memory addresses to memory. But how does the compiler know the type of the object, that the address points to?
If we write int*
, the compiler knows that the pointer points to an integer value. This means that the compiler will interpret the bytes at the address that the pointer points to as integers.
If you change the type of a pointer, the memory will be interpreted differently.
Take this code for example:
int a = 8;
int* ptrToInt = &a;
float* ptrToFloat = (float*)ptrToInt;
float b = *ptrToFloat;
This is the stack memory:
000000004A38F8A0 08 00 00 00
000000004A38F8A4
000000004A38F8A8 b8 f8 38 4a
000000004A38F8AC 00 00 00 00
000000004A38F8B0 b8 f8 38 4a
000000004A38F8B4 00 00 00 00
000000004A38F8B8 08 00 00 00
The pointers at 0x4A38F8A8
and 0x4A38F8B0
point to the same address, which is 0x4A38F8B8
. The values at 0x4A38F8A0
and 0x4A38F8B8
appear to be equal because the memory contents is exactly the same. Yet, to the compiler, they are two totally different things.
The variable a
(0x4A38F8B8
) holds the integer value 8. The variable b
(0x4A38F8A0
) holds the floating-point value 1.121e-44#DEN. If you try to do calculations with these two numbers, the result will be very different.
This applies to all other types in C++, including structures and classes. You can cast any pointer type to another pointer type and access the memory as if the object was created with that type. That said, this behavior is undefined in C++ except for a few cases that are outlined in the strict aliasing rules.[3]
The only pointer that is a pure memory address is the void pointer (void*
).
I showed how pointers are stored in memory, and how the compiler knows what type of object the pointer points to. But what happens when we do calculations with memory addresses?
The key to understanding pointer arithmetic is that the result of the calculation depends entirely on the pointer type. If your type is a 4 byte integer, then the integer pointer will change in 4 byte increments.
For instance:
int a = 4;
int* ptrA = &a;
int* ptrB = ptrA + 1;
A look at the memory reveals the following picture:
000000000015FCB8 a4 fc 15 00 00 00 00 00 ¤ü......
000000000015FCC0 a8 fc 15 00 00 00 00 00 ¨ü......
ptrA
is stored at 0x15FCB8
. ptrB
is stored at 0x15FCC0
.
Although we have incremented ptrA
by 1, the difference between ptrA
(0x15FCA4
) and ptrB
(0x15FCA8
) is exactly 4 bytes.
The same exact rule applies to any other type. If a struct or class has a size of 16 bytes, incrementing a pointer to that structure by 1 will result in an offset of 16 bytes. If you want to step through memory in 1 byte increments, you have to chose the appropriate type, like a char
.
There are situations in which you can work with memory and memory addresses without knowing what the contents is. You don't need to know that you are copying floating-point values, as long as you know how many bytes you are copying. It is possible to rewrite algorithms to be more generic by writing them in terms of memory operations, instead of operations on types. Alternatively, you can use templates if code readability and maintainability is a priority, and executable size is not an issue.
In languages like Java, the subscript operator ([]
) is only used for arrays. In C++ however, there is no difference between a pointer and an array, in terms of how the data is accessed.
The following code will get the third element of the array:
int* array = ...;
int value = array[2];
Why the third and not the second? The number 2 is an offset, relative to the address of the array. The code increments the integer pointer by 2. It returns the value at the memory address by implicitly dereferencing the pointer. This is equivalent to writing:
int* array = ...;
int value = *(array + 2);
Adding 2 to the integer pointer gives us an offset of 8 bytes (sizeof(int) * 2
) in memory. We have to explicitly dereference the pointer to get the value.
You can see the equivalence when comparing the generated assembly code:
;int val0 = arr0[2];
mov eax,4
imul rax,rax,2 ;RAX=8
mov rcx,qword ptr [arr0]
mov eax,dword ptr [rcx+rax]
mov dword ptr [val0],eax
;int val1 = *(arr1 + 2);
mov rax,qword ptr [arr1]
mov eax,dword ptr [rax+8]
mov dword ptr [val1],eax
This is also the reason why array[0]
is the first value of the array. The offset is 0. You can write *array
instead.
Many programming languages have a mechanism that allows you to treat a function like a data type. We can create objects of it, and assign it to variables. This means that we can pass it to other functions that can in turn call that function. Or we can store the function pointer and call it at a later time. This is typically how callback-mechanisms are implemented.
The type system allows us to use the function call operator on function pointers.
;foo()
call foo (013F4C1294h)
;bar(foo)
lea rcx,[foo (013F4C1294h)]
call bar (013F4C128Fh)
;void bar(void(*fn)(void)) {
mov qword ptr [rsp],rcx
;fn()
call qword ptr [rsp]
Bear in mind that the CPU will jump to the address of the function pointer regardless of what data is stored there. If the memory happens to contain random data, this data will be executed, and will most likely lead to undefined behavior in your program.
Before I get to object oriented programming, I want to touch on some properties of C++ structs. This will be essential to understanding how inheritance and polymorphism is implemented.
The following struct definition:
struct A
{
int a = 0x10;
int b = 0x20;
int c = 0x30;
};
Presents itself in memory like this:
000000000021FE28 10 00 00 00 ....
000000000021FE2C 20 00 00 00 ....
000000000021FE30 30 00 00 00 ....
The structure has a size of 12 bytes.
There are cases in which we want to put a structure inside another structure. When I started learning C++, coming from Java, I would go for this approach:
struct B
{
int b = 0x20;
int c = 0x30;
};
struct A
{
int a = 0x10;
B* b = nullptr;
};
The intention was to include struct B
into struct A
by using a pointer. It presents itself in memory like this:
000000000025FD10 20 00 00 00 ...
000000000025FD14 30 00 00 00 0...
000000000025FD18 10 00 00 00 ....
000000000025FD1C 00 00 00 00 ....
000000000025FD20 10 fd 25 00 .ý%.
000000000025FD24 00 00 00 00 ....
Oddly, the size of struct A (0x25FD18
) is 16 bytes. We only store an integer of 4 bytes and a pointer of 8 bytes, which is 12 bytes in total.
Since I'm compiling this code on a 64-bit machine, the compiler has chosen to pad the structure in order to align the values to specific addresses. This is known as data structure alignment. The alignment allows the CPU to read the memory more efficiently.
But there is also another way to define struct B
as being part of struct A
:
struct B
{
int b = 0x20;
int c = 0x30;
};
struct A
{
int a = 0x10;
B b;
};
A look at the memory reveals:
000000000017FDD8 10 00 00 00 ....
000000000017FDDC 20 00 00 00 ...
000000000017FDE0 30 00 00 00 0...
We told the compiler that struct B
is part of struct A
and it complies. The size of struct
A is 12 bytes.
Both definitions carry certain implications when it comes to memory consumption, memory access, and object lifetime. I will discuss this in a later section.
The member fields and member functions of structs and classes are only abstract concepts.
Accessing member fields and calling member functions results in assembly instructions that operate on relative addresses. They take the base address of the object as a parameter. As a result, the same instructions can be used regardless of the object.
Take this definition for instance:
struct A
{
int value;
void setValue(int value) { this.value = value; }
};
A a;
a.setValue(0x10);
In assembly, you will see the following access pattern:
;a.setValue(0x10);
mov edx,10h
lea rcx,[a]
call A::setValue (013F07125Dh)
A::setValue
is called, the function uses the memory address saved in RCX as the base address for all of its instructions. RCX is effectively holding the address to this
.In C, where you cannot define member functions, you would write something like this:
void setValue(struct A *this, int size);
struct A a;
setValue(&a, 0x10);
You always have to pass a pointer of the object that you want to modify to the function. Calling a member function in C++ does this implicitly.
[TODO show assembly of code inside member function]
Functions are blocks of compiled code that have a name and a memory address associated with them. I already talked about function pointers in the section about Pointers. In this section, I want to examine the mechanisms behind function calls.
When you define a function in C++, the compiler generates assembly code or machine code for the instructions that are inside the function body. It associates the memory address of the generated code with a unique name. This name is a combination of the namespace the function is defined in, the function name, and the number and types of parameters (see name mangling).
One of the key features of the C++ programming language is function overloading. This means that you can define functions with the same name but different sets of input parameters. Because the name mangling also encodes the number of input parameters and their types, C++ is able to find the correct function it needs to call during compilation. You cannot overload functions by only changing the return type of the function. The name mangling does not encode the return type of the function.
This is different to C, where there is no name mangling, and no function overloading.
Once the code has been compiled, the function names have been resolved to addresses.
A function call generally consists of the following steps:
However, this description leaves out a lot of details about how a function call is implemented on the actual hardware.
To answer these questions, compiler vendors and operating system developers have come up with many different calling conventions for their platforms throughout the years. These calling conventions are usually part of a broader application binary interface.
A calling convention defines how a function call should be performed, on a machine code level. It guarantees that programs that are written by different people, with different compilers, on different machines, are compatible and can call each other.
This also extends to programs written in different programming languages.
The Microsoft x64 Calling Convention is used on Windows. The System V AMD64 Calling Conventions are used on Unix systems.
To give you a concrete example, let's have a look at the Microsoft x64 Calling Convention.
It defines what type of data exists. This includes signed and unsigned integers of various sizes, floating-point values of various sizes, pointers, as well as arrays, structs and unions. It also defines the alignment of data in memory.
This guarantees that two different programs know how to access the memory correctly, when they pass data between each other.
It defines how parameters are passed to and from functions. The Microsoft x64 calling convention uses a mixture of registers and stack memory. Four parameters are stored in registers, the rest is written to stack memory. The return value is stored in a register.
There are also rules for what registers need to be stored and restored by either the person that calls the function (caller), or the person who's function is called (callee).
For instance, if you are using a set of registers to do a mathematical calculation, and you call a function, you may need to store these registers to stack memory, otherwise they will be overwritten by the function. If you don't store the registers and continue to work with them after the function call, your result of your calculation might be wrong.
It defines the contents and structure of the so called stack frame. For each function call, a stack frame is created. The stack frame contains the return address, register values that need to be restored after the call, as well as the local variables.
The information in the stack frame is what makes recursive function calls possible.
In the following example, the function foo
is called with the value 13. foo
performs some calculations using the function parameter, as well as the value of a variable that is stored in stack memory.
int foo(int num)
{
int d = 42;
return num * num + d;
}
int main()
{
return foo(13);
}
This results is the following assembly code:
; int foo(int):
mov DWORD PTR [rsp+8],ecx
sub rsp,24
; int d = 42;
mov DWORD PTR d$[rsp],42
; return num * num + d;
mov eax,DWORD PTR num$[rsp]
imul eax,DWORD PTR num$[rsp]
add eax,DWORD PTR d$[rsp]
add rsp,24
ret 0
; int main()
sub rsp,40
mov ecx,13
call int foo(int)
add rsp,40
ret 0
Inside the main
function:
foo
. The return address points to the first instruction after the function call.Inside the foo
function:
d
in stack memory.d
by writing it to the appropriate address in stack memory.As you can see, there are a lot of details that need to be taken into account when a function is called. It becomes even more complicated with features like variadic functions, and C++ exception handling. Thankfully the compiler takes care of all these details for us.
There are different ways how object-oriented programming and its main concepts are taught. One of the commonly used examples is that of animals. A cat is an animal. A dog is an animal. As animals, both share certain properties. They can walk, they can emit some kind of sound, they hunt prey, they have fur, etc. However, they exhibit these properties in different ways.
This is to illustrate inheritance and polymorphism based on how humans categorize abstract concepts in a hierarchical structure, with is-a relations. A Cat is-a Mammal, a Mammal is-a Animal, etc.
In practice however, this example is very misleading. The main advantage of inheritance and polymorphism can be illustrated much better when you think about interfaces and abstractions. For instance, if you have a console, a file, a printer, and an internet connection, each of them requires very specific I/O code. This code is based on the hardware, the operating system, and communication protocols.
As a programmer, you find yourself in situations in which you don't care what the implementation is. You just want to read and write data. Thus, one possible abstraction to all the four types of I/O devices is a data stream. The stream has read and write, as well as open and close operations. What actually happens is hidden behind the abstraction. You don't have to know how TCP/IP works in order to use a data stream.
Inheritance is also a tool for extending the functionality of a class, by adding new functions and properties. On a low level, it is implemented exactly like an extension.
If you have read the section about structures, you will understand how inheritance is implemented immediately. Lets take the following example:
struct A
{
int a = 0x10;
void test();
void testA();
};
struct B : public A
{
int b = 0x20;
int c = 0x30;
void test();
};
B b;
Which is represented in memory as:
00000000002AFAF8 10 00 00 00 ....
00000000002AFAFC 20 00 00 00 ...
00000000002AFB00 30 00 00 00 0...
Look familiar? It's no coincidence! You are telling the compiler that you are extending the definition of struct A
by the definition of struct B
, and it complies.
Since the memory of struct A
is contained in the memory of struct B
, you can call functions defined in struct A
with an object of type struct B
. For instance:
B b;
b.test();
b.testA();
This is the generated assembly code:
;B b;
lea rcx,[b]
call B::B (013F1612DFh)
;b.test();
lea rcx,[b]
call B::test (013F1612EEh)
;b.testA();
lea rcx,[b]
call A::testA (013F1612F3h)
Notice how each time the memory address of b
is written to the register RCX.
Since struct B
defines a function named B::test
, it is called instead of A::test
. [TODO why? give source]
As you can see in the assembly code, when calling b.testA()
, it calls the implementation of A::testA
with a pointer to object b
. This is possible because struct B
contains struct A
at the beginning of the memory - the offsets to its values are the same. The implementation of A::testA
simply assumes that there is a struct A
in memory. It doesn't know about struct B
.
Going back to the example of the data stream: if the file implementation and the printer implementation have a function called write
, how do we know what implementation to call when we only have the abstract data stream to work with?
Lets look at the following example:
struct Stream
{
virtual int read() = 0; // pure virtual function
virtual int write() = 0;
};
struct File : public Stream
{
int v = 0x10;
int read() override {
return v;
}
int write() override {
return v;
}
};
struct Printer : public Stream
{
int v = 0x20;
int read() override {
return v;
}
int write() override {
return v;
}
};
void operate(Stream* stream)
{
std::cout << stream->write() << std::endl;
}
File file;
Printer printer;
operate(&file);
operate(&printer);
If we look at how struct File
is represented in memory, we notice something odd:
000000000024F8B8 78 9d dd 3f x."?
000000000024F8BC 01 00 00 00 ....
000000000024F8C0 10 00 00 00 ....
struct Stream
doesn't contain any members, so why are there 8 bytes before our 0x10
?
As you can see from the source code, the compiler allows us to pass a pointer to an object of type struct File
to a function that takes a pointer to a struct Stream
. When Stream::write
is called, the only way for the processor to know what implementation to call, is to look at the so called virtual function table of the object.
The VFT is an array of function pointers. We give the compiler a hint what functions to put into the VFT by using the keyword virtual
.
The first element of the memory listing is in fact a pointer to the VFT of the type struct File
. The following assembly code shows when and how the address is written to the object:
;Stream::Stream ()
mov qword ptr [rsp+8],rcx
push rdi
mov rax,qword ptr [this]
lea rcx,[Stream::vftable]
mov qword ptr [rax],rcx
mov rax,qword ptr [this]
pop rdi
ret
;File::File ()
mov qword ptr [rsp+8],rcx
push rdi
sub rsp,20h
mov rdi,rsp
mov ecx,8
mov eax,0CCCCCCCCh
rep stos dword ptr [rdi]
mov rcx,qword ptr [this]
mov rcx,qword ptr [this]
call Stream::Stream
mov rax,qword ptr [this]
lea rcx,[File::vftable]
mov qword ptr [rax],rcx
mov rax,qword ptr [this]
mov dword ptr [rax+8],10h
mov rax,qword ptr [this]
add rsp,20h
pop rdi
ret
You can see how the struct File
constructor calls the struct Stream
constructor, because of the inheritance. Then the address of the struct File
VFT is written to the beginning of the object. After that, the fields of the object are initialized.
All instances of struct File
share the same VFT, which happens to be located at 0x13FDD9D78
in my example. Lets see what we can find there:
000000013FDD9D78 4f 11 dd 3f 01 00 00 00 O..?....
000000013FDD9D80 62 12 dd 3f 01 00 00 00 b..?....
Since struct Stream
has two virtual functions, and we override both in struct File
, there are two function pointers in the VFT. One pointing to the implementation of File::read
at 0x13FDD114F
, and the other one to the implementation of File::write
at 0x13FDD1262
.
Now that we know what a VFT is and how it gets there, it's time to take a look at the virtual function call mechanism:
;void operate(Stream* stream) {
mov rax,qword ptr [stream] ;RAX=0x24F8B8
mov rax,qword ptr [rax] ;RAX=0x13FDD9D78
mov rcx,qword ptr [stream] ;RCX=0x24F8B8
call qword ptr [rax+8] ;call 0x13FDD1262
File::write
, which is the second address in the VFT, 8 is added to the address of the VFT. This gives us the function pointer of File::write
. We jump to the address that the function pointer points to, and continue the execution.As you will see by the console output, the processor manages to call the correct implementation. For the file
object, the number 0x10
is printed out. For the printer
object, the number 0x20
is printed out.
When reading and writing C++ programs, you should develop a sense when memory is being copied, and in doubt, check the implementation.
Take this code for example:
struct A
{
int a;
int b;
int c;
};
A createStruct();
A* createStructPtr();
A a = { 1, 2, 3 };
A b = a;
A c = createStruct();
A* d = createStructPtr();
A e = *d;
struct A
in stack memory. Each value is copied to the stack.a
to b
, the entire contents of the object is copied to a new object in memory, byte by byte. If the object is very big, this can take a considerable amount of time.When you create a class or structure, the following constructors and operators are created for you by default. They control how objects are moved or copied in memory.
By overloading them, you can specify exactly what needs to happen, when an object is copied. For instance:
struct Array
{
Array(int size) : size(size) { data = new int[size]; }
~Array() { delete data; }
const int* data;
const int size;
};
Array a(0x20);
Array b = a;
The first struct Array
object is created on the stack. The constructor allocates enough memory on the heap to store 32 integer values. The pointer to this memory is written to our object. So far so good, but what happens when we assign array a
to array b
?
The assembly code holds the answer:
;Array a(0x20);
mov edx,20h
lea rcx,[a]
call Array::Array (013FE71294h)
;Array b = a;
lea rax,[b]
lea rcx,[a]
mov rdi,rax
mov rsi,rcx
mov ecx,10h
rep movs byte ptr [rdi],byte ptr [rsi] ;memcpy
The constructor for b
is not called. Instead, the contents of a
is simply copied to b
, byte by byte. As a result, b
contains the pointer to the same memory as a
. The data on the heap is not copied.
This is very risky. The destructor of struct Array
deallocates the buffer on the heap. If the other array tries to access it, it will result in undefined behavior. If the other array object is destructed, the code will attempt to delete the memory for a second time, which is also undefined behavior.
If you overload the copy assignment operator, you can tell the compiler to allocate new memory and make a copy of the data on the heap. Alternatively, you can delete the copy constructor and copy assignment operator. You will get a compile error whenever you try to use the array in a way that was not intended.
This is one of many examples in which it is not clear what the assignment operator actually does. If you are in doubt, always check the implementation. Otherwise, you could be copying a lot of data without even knowing it.
Note: Another solution is to overload the move assignment operator and use std::move to copy the object a
to b
and invalidate object a
. If implemented correctly, this will eliminate the undefined behavior (dangling pointers, double-frees).
Every variable you declare between two curly braces {}
is only valid inside that scope. Once the program exists the scope, the variable becomes invalid.
This mechanism is strongly tied to the stack. If you create an object on the stack, its lifetime is defined by the scope in which it was created. Once the program exits the scope, the destructor of the object is called.
Consider the following code:
Object* createInt()
{
Object a;
return &a;
}
If you create an object on the stack, inside a function scope, and decide to return a pointer to that object, you are returning a pointer to a memory address that will be invalidated and probably overwritten. This can result in undefined behavior if you try to read from this pointer at a later time.
If you want a value to survive the lifetime of a scope, you have to either copy it or allocate it on the heap.
Consider the following code:
{
int* arrA = new int[1000];
if (conditionAIsFalse)
{
delete arrA;
return -1;
}
int* arrB = new int[2000];
if (conditionBIsFalse)
{
delete arrB;
delete arrA;
return -2;
}
delete arrB;
delete arrA;
return 0;
}
We use dynamic memory allocation to allocate memory on the heap. For the program to run correctly, and not leak memory, we have to delete the allocated memory when we exit the function. Since we have multiple exits, it becomes difficult to track what needs to be deleted and what not.
An alternative is to only have one exit point and nest the if-statements. You have to be very careful to set the return value correctly. The readability suffers the more nested if-statements you have.
{
int returnValue = 0;
int* arrA = new int[1000];
if (conditionAIsTrue)
{
int* arrB = new int[2000];
if (conditionBIsTrue)
{
returnValue = 0;
}
else
{
returnValue = -2;
}
delete arrB;
}
else
{
returnValue = -1;
}
delete arrA;
return returnValue;
}
Or if you feel very adventurous, you can even use the goto
statement to jump to the clean-up code whenever an error occurs.
In any case, you have to be very careful about deallocating memory. If you forget, your program will leak memory. If you delete memory more than once, you end up with undefined behavior.
In C++, we can hide some of the memory management details by using the scope of variables to our advantage, and write the following code:
struct Array
{
Array(int size) { data = new int[size]; }
~Array() { delete data; }
int* data;
};
{
Array arrA(0x10);
if (conditionAIsFalse)
return -1;
Array arrB(0x20);
if (conditionBIsFalse)
return -2;
return 0;
}
Notice how we create arrA
and arrB
on the stack. They only contain a pointer to a buffer. The actual integer buffers are created dynamically on the heap when the constructor is called. The moment the program leaves the scope, the destructor of struct Array
is called. The compiler figures out when and where to do this for us. We can simply use the array and not worry about the memory management.
Here is an excerpt from the generated assembly code:
;Array arrB(0x20);
mov edx,20h
lea rcx,[arrB]
call Array::Array (013FA910D7h)
;if (conditionBIsFalse)
movzx eax,byte ptr [conditionBIsFalse]
test eax,eax
je main+0A6h (013FA918E6h)
;return -2;
mov dword ptr [rsp+68h],0FFFFFFFEh
lea rcx,[arrB]
call Array::~Array (013FA91127h)
nop
lea rcx,[arrA]
call Array::~Array (013FA91127h)
mov eax,dword ptr [rsp+68h]
jmp main+0C7h (013FA91907h)
Before the struct Array
constructor is called, the array size is stored in EDX, and the memory address of arrB
is stored in RCX.
Next the value of the bool is checked. If it is false, we jump forward to the code that continues our function. If it is true, we continue executing the return code, which stores the value -2 in stack memory for later reference, and calls the struct Array
destructor on arrA
and arrB
. Last but not least, the return value is copied to EAX, and the program jumps to some function wrap up code.
Note how arrA
and arrB
are destructed in reverse. This adheres to the Last-In-First-Out behavior of the stack data structure.
Using the scoped lifetime of objects in order to allocate and deallocate resources automatically is known as RAII. RAII is used throughout the standard library. For example, a std::vector is a container that dynamically allocates an array on the heap, and automatically cleans up its resources when it goes out of scope. RAII is also used to close files and streams, release locks, etc.
Note: Please use std::unique_ptr
in combination with std::make_unique
. Using std::unique_ptr
will ensure that the allocated memory is cleaned up when the pointer object goes out of scope.
You should base your decision on the properties of each memory type.
A major feature of the C++ programming language are templates. Let me quickly explain what problem templates solve.
Imagine you are writing a piece of software and you need a mathematical function that adds two integer numbers and returns the result. You write the following code:
int sum(int a, int b)
{
return a + b;
}
At some point, you decide to support floating-point numbers in your application. However, you cannot just use the already existing function because if you passed floating-point values to it, they would be implicitly converted to integer values, thus truncating them. You decide to add another function that handles single-precision floating-point numbers:
int sum(int a, int b)
{
return a + b;
}
float sum(float a, float b)
{
return a + b;
}
The algorithm is the same in both functions. The only difference is the types. If you wanted to modify the algorithm, you would have to change the implementation in both functions. This might not seem like a problem now, but if you had dozens of functions, and they each consisted of a few hundred lines of code, it would become a problem.
Templates let you express an algorithm in a generic way by adding placeholders into the code that will be substituted with concrete types or values when the template is used.
// template definition
template<typename T>
T sum(T a, T b)
{
return a + b;
}
// template instantiation
sum<int>(5, 10);
sum(0.3f, 0.1f); // the type is deduced
sum<int>
instructs the compiler that you instantiate the function template with type T=int
. This means that anywhere T
is used in the code, substitute it with int
.
The second function call is valid because the compiler can deduce which type to use for the function template, based on the type of the arguments. Since both arguments are of the float
, the function template is instantiated with T=float
.
In essence, templates are a way to control code generation.
Given the last example, lets dive a little bit deeper and look at the assembly code that is being generated. Since C++20 you can define templates by using the auto
keyword as a type placeholder.
auto sum(auto a, auto b)
{
return a + b;
}
sum(5, 10);
sum(0.3f, 0.1f);
sum(0.8, 5);
To a C++ beginner, the code might look like we are reusing the same function three times. However, when we look at the assembly, we see that this code has generated three different versions of the same function:
;auto sum<int, int>(int, int):
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov dword ptr [rbp - 8], esi
mov eax, dword ptr [rbp - 4]
add eax, dword ptr [rbp - 8]
pop rbp
ret
;auto sum<float, float>(float, float):
push rbp
mov rbp, rsp
movss dword ptr [rbp - 4], xmm0
movss dword ptr [rbp - 8], xmm1
movss xmm0, dword ptr [rbp - 4]
addss xmm0, dword ptr [rbp - 8]
pop rbp
ret
;auto sum<double, int>(double, int):
push rbp
mov rbp, rsp
movsd qword ptr [rbp - 8], xmm0
mov dword ptr [rbp - 12], edi
movsd xmm0, qword ptr [rbp - 8]
cvtsi2sd xmm1, dword ptr [rbp - 12]
addsd xmm0, xmm1
pop rbp
ret
xmm0
, using the single-precision floating-point instructions movss
and addss
.double
is loaded into xmm0
using the double-precision floating-point instruction movsd
. The signed 32-bit integer is converted to a double-precision (64-bit) floating-point and loaded into xmm1
with cvtsi2sd
.The generated code depends on the type of the value that is passed to the function. This also means that there is no type conversion required to call these functions:
; sum(5, 10);
mov edi, 5
mov esi, 10
call auto sum<int, int>(int, int)
; sum(0.3f, 0.1f);
movss xmm0, dword ptr [rip + .LCPI1_1]
movss xmm1, dword ptr [rip + .LCPI1_2]
call auto sum<float, float>(float, float)
; sum(0.8, 5)
movsd xmm0, qword ptr [rip + .LCPI1_0]
mov edi, 5
call auto sum<double, int>(double, int)
The same name mangling mechanism that is used for function overloading is also used for templates.
If you enable compiler optimizations, the compiler will try to inline these functions whenever possible. That said, it does not change the fact that you are using a code generation facility.
We have seen how you can generate code by type substitution. C++ also allows you to generate code based on value substitution.
template<int A, int B>
int sum()
{
return A + B;
}
sum<8, 10>();
sum<42, 8>();
This compiles to:
;int sum<8, 10>():
push rbp
mov rbp, rsp
mov eax, 18
pop rbp
ret
;int sum<42, 8>():
push rbp
mov rbp, rsp
mov eax, 50
pop rbp
ret
The compiler has generated two sum
implementations, one for each distinct template parameterization. Since the values of the template parameters are known at compile time, the compiler has calculated the sum of both values and written the result directly into the code.
; sum<8, 10>();
call int sum<8, 10>()
; sum<42, 8>();
call int sum<42, 8>()
Both functions are called directly, without passing any arguments. As stated before, the template parameters are used in the name mangling, making both function names unique.
Taking it one step further, you can use templates recursively to generate code.
template<int N>
int sum()
{
return N + sum<N-1>();
}
// define stop condition with template specialization
template<>
int sum<1>()
{
return 1;
}
sum<3>();
This will result in the following assembly:
;int sum<3>():
push rbp
mov rbp, rsp
call int sum<2>()
add eax, 3
pop rbp
ret
;int sum<2>():
push rbp
mov rbp, rsp
call int sum<1>()
add eax, 2
pop rbp
ret
;int sum<1>():
push rbp
mov rbp, rsp
mov eax, 1
pop rbp
ret
Keep in mind that the optimizer may inline all the calls and calculate the result at compile-time.
The assembly code that you have seen so far was all unoptimized. Optimized code can look entirely different from what you would expect. When the compiler optimizes your code, it makes decisions based on the context in which the code is used. It can only make optimizations for things that it knows at compile-time. The compiler cannot predict what the value of a variable will be at runtime.
This section deals with different compiler optimizations. This might give you some ideas about how you could structure your code, so that it is easier to optimize. Of course you need to check the assembly code, to see if your efforts were fruitful. And you need to benchmark your code before and after the optimization, to see if it has any significant effect on the performance of your program.
SIMD stands for Single Instruction Multiple Data. SIMD instructions are used to speed up computation by processing multiple values at a time. For instance, you can add or multiply a packet of 4 or 8 floats in parallel. This can have advantages in areas that are computation heavy, like image and audio processing.
Each CPU supports different SIMD instruction sets. The most common are variants of SSE and AVX. The main difference being the size of the registers and the operations that you can perform. You have to be able to recognize which instruction set and registers are being used in the generated assembly.
Compilers will try to auto-vectorize loops in which the compiler detects certain memory access patterns. There are however calculations that cannot be easily auto-vectorized, either due to data dependencies, unpredictable loop stride, complex loop body, or other compiler-specific reasons.
Let's start with a simple example. We want to process an array of floats by multiplying the input with a value, and writing the result to the output.
void process(float* in, float* out, int len, float value)
{
for (int i = 0; i < len; i++)
{
out[i] = in[i] * value;
}
}
In the unoptimized version you might see something like this.
$LN2@process:
mov eax, DWORD PTR i$1[rsp]
inc eax ; i++
mov DWORD PTR i$1[rsp], eax
$LN4@process:
mov eax, DWORD PTR len$[rsp]
cmp DWORD PTR i$1[rsp], eax
jge SHORT $LN3@process ; if i >= len then leave loop
movsxd rax, DWORD PTR i$1[rsp]
mov rcx, QWORD PTR in$[rsp]
movss xmm0, DWORD PTR [rcx+rax*4] ; XMM0 = in[i]
mulss xmm0, DWORD PTR value$[rsp] ; XMM0 = XMM0 * value
movsxd rax, DWORD PTR i$1[rsp]
mov rcx, QWORD PTR out$[rsp]
movss DWORD PTR [rcx+rax*4], xmm0 ; out[i] = XMM0
jmp SHORT $LN2@process
The register xmm0
and the instructions movss
and mulss
point to the use of SIMD instructions. However, if you check the individual instructions in the Intel Intrinsics Guide, you will see that the code is processing each individual float value. The ss
part of movss
implies that one single-precision element is moved from memory. xmm0
is a 128-bit register that could technically hold 4 floats. The DWORD
also gives a hint that we are accessing single elements.
To optimize the version, we will use the compiler option /O2
. To check that the code was actually auto-vectorized, we use the option /Qvec-report:2
. This will print out a auto-vectorizer report during compilation. Check the links below for more information.
Once compiled, the vectorization report reads:
-- Analyzing function: void __cdecl process(float * __ptr64,float * __ptr64,int,float)
<source>(4) : info C5001: loop vectorized
The generated assembly consists of many different parts. I will point out the most relevant to you.
movaps xmm2, xmm3 ; XMM2 = XMM3 = (value,0,0,0)
shufps xmm2, xmm2, 0 ; XMM2 = (value,value,value,value)
...
$LL4@process:
movups xmm0, XMMWORD PTR [rdx+rax-16] ; XMM0 = in[?]
add r11d, 16
add r9, 16 ; increment by 16 byte (4 floats)
movups xmm1, XMMWORD PTR [rax+rdx] ; XMM1 = in[?]
lea rax, QWORD PTR [rax+64]
mulps xmm0, xmm2 ; XMM0 = XMM0 * XMM2
mulps xmm1, xmm2 ; XMM1 = XMM1 * XMM2
movups XMMWORD PTR [rax-80], xmm0 ; out[?] = XMM0
movups xmm0, XMMWORD PTR [rdx+rax-48] ; XMM0 = in[?]
movups XMMWORD PTR [rax-64], xmm1 ; out[?] = XMM1
mulps xmm0, xmm2 ; XMM0 = XMM0 * XMM2
movups xmm1, XMMWORD PTR [rdx+rax-32] ; XMM1 = in[?]
mulps xmm1, xmm2 ; XMM1 = XMM1 * XMM2
movups XMMWORD PTR [rax-48], xmm0 ; out[?] = XMM0
movups XMMWORD PTR [rax-32], xmm1 ; out[?] = XMM1
cmp r9, rbx
jl SHORT $LL4@process
The first thing you should see is that movss
became movups
. The ups
in movups
indicates that we are moving a pack of 4 single-precision floating-point values, from a memory address that might be unaligned[4]. We are moving the floats into XMM registers. As we've established, XMM are 128-bit registers that can hold 4 floats. The XMMWORD
in the assembly also hints at the fact that we are reading 128-bits from memory.
One thing to note here is that the assembly shows 4 multiplications, 4 movups reading from memory, and 4 movups writing to memory. This is referred to as loop unrolling. It is processing 4 packs of 4 floats for a total of 16 floats per iteration. The goal of loop unrolling is better processor utilization by increasing the throughput.
But what happens if the length of our array is not divisible by 16? What if we are processing 21 values? There would be a remainder of 5 values that we cannot process with this assembly code. The answer lies in the next part of the generated assembly.
$LL18@process:
lea rax, QWORD PTR [rax+16]
movaps xmm0, xmm3 ; XMM0 = value
mulss xmm0, DWORD PTR [rcx+rax-20] ; XMM0 = XMM0 * in[?]
movaps xmm1, xmm3 ; XMM1 = value
movss DWORD PTR [rax-20], xmm0 ; out[?] = XMM0
movaps xmm0, xmm3 ; XMM0 = value
mulss xmm1, DWORD PTR [rax+rcx-16] ; XMM1 = XMM1 * in[?]
movss DWORD PTR [rax-16], xmm1 ; out[?] = XMM1
movaps xmm1, xmm3 ; XMM1 = value
mulss xmm0, DWORD PTR [rcx+rax-12] ; XMM0 = XMM0 * in[?]
movss DWORD PTR [rax-12], xmm0 ; out[?] = XMM0
mulss xmm1, DWORD PTR [rcx+rax-8] ; XMM1 = in[?]
movss DWORD PTR [rax-8], xmm1 ; out[?] = XMM1
sub r8, 1 ; decrease iterator
jne SHORT $LL18@process
The compiler has switched back to single floating-point values, and decided to unroll the loop. We are now processing 4 floats per iteration. We had 21 floats, we already processed 16, this left us with a remainder of 5 floats. This loop processes another 4 floats, leaving us with 1 float. Maybe you can already guessed what follows.
$LC8@process:
lea rax, QWORD PTR [rax+4]
movaps xmm0, xmm3 ; XMM0 = value
mulss xmm0, DWORD PTR [rax+rcx-4] ; XMM0 = XMM0 * in[i]
movss DWORD PTR [rax-4], xmm0 ; out[i] = XMM0
sub rdx, 1 ; decrease iterator
jne SHORT $LC8@process
We are back to processing single floats, one float per iteration. We have processed our last float and are finally done.
To summarize, the compiler generated three versions for this particular function:
You have to appreciate how much work the compiler does for us. This is not code that you want to write yourself. Therefore, whenever you write processing code that needs to run fast, check if the compiler is able to auto-vectorize it. A few simple changes might already be enough. If that's not the case, look into SIMD optimization with compiler intrinsics.
Vectorized code can increase the performance of your code significantly.
[TODO provide benchmark results]
I hope that I could illustrate in this article that checking the generated assembly code is a good way to prove that the instructions you write, do what you expect them to do.
This will not only give you more insight into the language and its implementation, but also stronger arguments for or against the use of certain language features or framework functions, when you are discussing your particular case with colleagues.
Knowing your way around assembly code and memory dumps will also help you in investigating crashes. You will start to recognize certain implementation patterns like loops, switch-statements, virtual function calls, compiler optimizations, and others, which will give you some intuition about what the program is doing and what kind of code your compiler is generating.
There are also cases in which a program will crash due to unsupported instructions, i.e. when executing AVX instructions on a CPU that does not support the AVX instruction set. There you have no other choice than to look at the assembly, otherwise you won't know what is causing the error. The source code alone can't tell you.
The downside to this is that every compiler generates different assembly for the same C++ code, and the factors that influence this are plentiful:
If you want to quickly check the generated assembly for a piece of code with different compilers, Godbolt's Compiler Explorer has you covered.
That said, you will always have to profile your application because the generated assembly code is not the only factor that determines the behavior of your program during execution.
Thanks to the redditors of the r/cpp subreddit for feedback and suggestions.
u
, i.e. movups
. The compiler cannot know at compile-time if the memory address is aligned or unaligned, unless there are annotations in the code that give the compiler hints.