[DotNetBook] Type Instance Structure and VMT

With this article, I continue to publish a series of articles, the result of which will be a book on the work of the .NET CLR, and .NET as a whole.

The entire book will be available on GitHub: CLR Book . So Issues and Pull Requests are welcome :)

This is a squeeze out of the chapter on type structure and their VMT .
')

Structure of objects in memory

Until now, speaking of the difference between significant and reference types , we touched on this topic from the height of the final developer. Those. we did not look at how they are in reality arranged at the level of the CLR, as these or other mechanics are made inside each of them. We actually looked at the end result. However, in order to understand the essence of things more deeply and in order to put aside the last remaining thoughts about any magic that happens inside the CLR, it is worth looking into its very implications.

Note

The chapter published on Habré is not updated and it is possible that it is already somewhat outdated. So, please ask for a more recent text to the original:

CLR Book: GitHub, table of contents
CLR Book: GitHub, chapter
Release 0.5.2 of the book, PDF: GitHub Release

Internal structure of type instances

If we speak of classes as data types, then in a conversation about their data types, it suffices to recall their basic device. Let's start with the object type, which is the base type and forms the structure for all reference types:

System.Object

  ---------------------------------------------- | SyncBlkIndx | VMTPtr | Data | ---------------------------------------------- | 4 / 8  | 4 / 8  | 4 / 8  | ---------------------------------------------- | 0xFFF..FFF | 0xXXX..XXX | 0 | ---------------------------------------------- ^ |     . ..   ,   VMT Sum size = 12 (x86) .. 24 (x64)

Those. in fact, the size depends on the final platform on which the application will run.

Now, to gain further insight into what we are dealing with, let's follow the VMTPtr pointer. For the entire type system, this pointer is the most important: it is through it that both inheritance, and the implementation of interfaces and type conversion, and many other things work. This pointer is a reference to the .NET CLR type system.

Virtual Methods Table

The description of the table itself is available at the address in GitHub CoreCLR and if you drop all unnecessary (and there are 4381 lines! The guys from the CoreCLR team are not shy), it looks like this :

This is a version of CoreCLR. If you look at the structure of fields in the .NET Framework, then it will differ in the location of the fields.

  // Low WORD is component size for array and string types (HasComponentSize() returns true). // Used for flags otherwise. DWORD m_dwFlags; // Base size of instance of this class when allocated on the heap DWORD m_BaseSize; WORD m_wFlags2; // Class token if it fits into 16-bits. If this is (WORD)-1, the class token is stored in the TokenOverflow optional member. WORD m_wToken; // <NICE> In the normal cases we shouldn't need a full word for each of these </NICE> WORD m_wNumVirtuals; WORD m_wNumInterfaces;

Agree, it looks scary. And it is scary not because there are only 6 fields (and where are all the rest?), But the fact that in order to get to them, we had to skip 4,100 lines of logic. But let's not be discouraged and try to immediately get a benefit from this: we have no idea what we mean by other fields, but the `m_BaseSize` field looks tempting. As the comment tells us, this is the actual size for the type instance. Let's try in battle?

To get the VMT address, we can go two ways: either go from the complex end, having received the address of the object, and therefore VMT (some of this code was already on the pages of this book, but do not scold me: I do not want you to look for it):

 class Program { public static unsafe void Main() { Union x = new Union(); x.Reference.Value = "Hello!"; //      ,   //   VMT // - (IntPtr*)x.Value.Value -     (   ) // - *(IntPtr*)x.Value.Value -      VMT // - (void *)*(IntPtr*)x.Value.Value -    void *vmt = (void *)*(IntPtr*)x.Value.Value; //     VMT; Console.WriteLine((ulong)vmt); } [StructLayout(LayoutKind.Explicit)] public class Union { public Union() { Value = new Holder<IntPtr>(); Reference = new Holder<object>(); } [FieldOffset(0)] public Holder<IntPtr> Value; [FieldOffset(0)] public Holder<object> Reference; } public class Holder<T> { public T Value; } }

Or the same address is returned to the .NET FCL API:

  var vmt = typeof(string).TypeHandle.Value;

The second way is of course simpler (although it works longer). However, knowledge of the first is very important from the point of view of understanding the structure of an instance of a type. Using the second way, although it adds a sense of confidence: if we call the API method, it seems like we are using the documented way of working with VMT. And if we get through pointers, then no. But do not forget that `VMT *` storage is standard for almost any OOP language and for the .NET platform as a whole: it is always in the same place.

Let's explore the issue of type structure in terms of the size of their instance. We need not only to study them abstractly (this is just plain boring), but in addition we will try to derive from this such a benefit, which cannot be learned in the usual way.

Why sizeof is for Value Type but not for Reference Type? In fact, the question is open because No one bothers to calculate the size of the reference type. The only thing you can stumble about is not the fixed size of two reference types: `Array` and` String`. As well as the `Generic` group, which depends entirely on specific options. Those. With the `sizeof (..)` operator, we couldn’t get by: you need to work with specific instances. However, no one bothers to make a method like `static int System.Object.SizeOf (object obj)`, which would easily and simply return to us what we need. So why didn't Microsoft implement this method? There is an idea that the .NET platform, in their understanding, is not the platform where the developer will be very worried about specific bytes. In which case, you can simply deliver the bar to the motherboard. Moreover, most of the data types that we implement do not occupy such large volumes. However, those who need everything they need will calculate all the sizes as it should. The latter, of course, is controversial.

But we will not be distracted. So, to get the size of an instance of any class whose instances have a fixed size, it’s enough to write the following code:

 unsafe int SizeOf(Type type) { MethodTable *pvmt = (MethodTable *)type.TypeHandle.Value.ToPointer(); return pvmt->Size; } [StructLayout(LayoutKind.Explicit)] public struct MethodTable { [FieldOffset(4)] public int Size; } class Sample { int x; } class GenericSample<T> { T fld; } // ... Console.WriteLine(SizeOf(typeof(Sample)));

So what did we just do? The first step was a pointer to a virtual method table. Then we typed the type to a pointer to a table of virtual methods (a very simplified version of it). After that we read the size and get `12` - this is the sum of the sizes of the fields` SyncBlockIndex + VMT_Ptr + field x` for a 32-bit platform. If we play around with different types, we’ll get something like the following table:

Type or its definition	The size	Comment
Object	12	SyncBlk + VMT + empty field
Int16	12	Boxed Int16: SyncBlk + VMT + Data (aligned by 4 bytes on x86)
Int32	12	Boxed Int32: SyncBlk + VMT + Data
Int64	sixteen	Boxed Int64: SyncBlk + VMT + Data
Char	12	Boxed Char: SyncBlk + VMT + Data (aligned by 4 bytes on x86)
Double	sixteen	Boxed Double: SyncBlk + VMT + Data
IEnumerable	0	The interface has no size: you must take obj.GetType ()
List [T]	24	It doesn't matter how many items in List [T], to occupy it will be the same it stores data in an array that is not taken into account
GenericSample [int]	12	As you can see, generics are beautifully considered. The size has not changed, because data is in the same place as boxed int. Total: SyncBlk + VMT + data = 12 bytes (x86)
GenericSample [Int64]	sixteen	Similarly
GenericSample [IEnumerable]	12	Similarly
GenericSample [DateTime]	sixteen	Similarly
string	14	This value will be returned for any string. because real size should be considered dynamically. However, it is suitable for the size of an empty line. Please note that the size is not aligned to bit depth: essentially this field is used should not
int [] {1}	24554	For arrays in this place are very different data plus their size is not fixed because it must be considered separately

As you can see, when the system stores data on the size of an instance of a type, it actually stores data for the reference type (including for the reference variant of the significant type). Let's draw some conclusions:

If you want to know how much a value type will take as a value, use `sizeof (TType)`
If you want to calculate what boxing will cost you, then you can round up `sizeof (TType)` up to the word size of the processor (4 or 8 bytes) and add 2 more words. Or take this value from the `VMT` type.
If it is necessary to understand how much heap memory allocation will cost us, we have three options:

System.String

About the lines in practice, we will talk separately: this relatively small class can be divided into an entire chapter. And in the framework of the chapter on the structure of VMT, we will talk about the structure of strings at a low level. UTF16 is used to store strings. This means that each character takes 2 bytes. Additionally, a null terminator is stored at the end of each line (i.e., a value that identifies that the line has ended). The length of the string is also stored in the form of an Int32 number — so as not to count the length each time you need it. We'll talk about encodings separately, but for now this information is enough for us.

  //  .NET Framework 4   ------------------------------------------------------------------------- | SyncBlkIndx | VMTPtr | Length | char | char | Term | ------------------------------------------------------------------------- | 4 / 8  | 4 / 8  | 4  | 2 . | 2 . | 2 . | ------------------------------------------------------------------------- | -1 | 0xXXXXXXXX | 2 | a | b | nil | ------------------------------------------------------------------------- Term - null terminator Sum size = (12 (24) + 2 + (Len*2)) ->      . (20   ) //  .NET Framework 3.5   ------------------------------------------------------------------------------ | SyncBlkIndx| VMTPtr | ArrayLength | Length | char | char | Term | ------------------------------------------------------------------------------ | 4 / 8  | 4 / 8  | 4  | 4  | 2 . | 2 . | 2 . | ------------------------------------------------------------------------------ | -1 | 0xXXXXXXXX | 3 | 2 | a | b | nil | ------------------------------------------------------------------------------ Term - null terminator Sum size = (16 (32) + 2 + (Len*2)) ->      . (24   )

Rewrite our method to teach it to count the size of lines:

 unsafe int SizeOf(object obj) { var majorNetVersion = Environment.Version.Major; var type = obj.GetType(); var href = Union.GetRef(obj).ToInt64(); var DWORD = sizeof(IntPtr); var baseSize = 3 * DWORD; if (type == typeof(string)) { if (majorNetVersion >= 4) { var length = (int)*(int*)(href + DWORD /* skip vmt */); return DWORD * ((baseSize + 2 + 2 * length + (DWORD-1)) / DWORD); } else { // on 1.0 -> 3.5 string have additional RealLength field var arrlength = *(int*)(href + DWORD /* skip vmt */); var length = *(int*)(href + DWORD /* skip vmt */ + 4 /* skip length */); return DWORD * ((baseSize + 2 + 2 * length + (DWORD -1)) / DWORD); } } else if (type.BaseType == typeof(Array) || type == typeof(Array)) { return ((ArrayInfo*)href)->SizeOf(); } return SizeOf(type); }

Where `SizeOf (type)` will call the old implementation - for fixed-length reference types.

Let's check the code in practice:

  Action<string> stringWriter = (arg) => { Console.WriteLine($"Length of `{arg}` string: {SizeOf(arg)}"); }; stringWriter("a"); stringWriter("ab"); stringWriter("abc"); stringWriter("abcd"); stringWriter("abcde"); stringWriter("abcdef"); } ----- Length of `a` string: 16 Length of `ab` string: 20 Length of `abc` string: 20 Length of `abcd` string: 24 Length of `abcde` string: 24 Length of `abcdef` string: 28

Calculations show that the size of the string does not increase linearly but in steps: every two characters. This happens because the size of each character is 2 bytes, they follow each other. But the final size should be divided without any rest by the processor width. Those. some lines will get another 2 bytes upwards. The result of our work is wonderful: we can calculate the cost of this or that line. The last step is left for us to find out how to calculate the size of the arrays in memory and to make the task even more practical, let's make a method that will answer us the question: what size should the array be taken so that we fit in the SOH. It may seem that using the Length property would be more reasonable and faster: however, in reality it will work more slowly: additional costs.

Arrays

The structure of arrays is somewhat more complicated: after all, arrays can have variants of their structure:

They can store significant types, and they can store reference.
Arrays can contain one or several dimensions.
Each measurement can begin with either `0` or any other number (this is in my opinion a very controversial possibility: to save a programmer from being lazy to make` arr [i - startIndex] `at the FCL level)

Hence, some confusion in the implementation of arrays and the inability to accurately predict the size of a finite array: it is not enough to multiply the number of elements by their size. Although, of course, for most cases it will be more or less sufficient. Size becomes important when we are afraid to get into LOH. However, here and there we have options: we can simply throw to the size calculated “on the knee” some constant from above (for example, 100) in order to understand whether we have crossed the 85,000 border or not. However, within this section, the task is somewhat different: to understand the structure of types. We will look at it:

 //  -------------------------------------------------------------------------------- | SBI | VMTPtr |Total | Len_1 | Len_2 | .. | Len_N | Term | VMT_Child | --------------------------opt-------opt------------opt-------opt--------opt----- | 4 / 8 | 4 / 8 | 4 | 4 | 4 | | 4 | 4 | 4/8 | -------------------------------------------------------------------------------- |0xFF.FF|0xXX.XX | ? | ? | ? | | ? |0x00.00| 0xXX..XX | -------------------------------------------------------------------------------- - opt:  - SBI: Sync Block Index - VMT_Child:         - Total:   .         - Len_2..Len_N + Term:       1 (   VMT->Flags)

As we can see, the type header stores data about array dimensions: their number can be either 1 or large enough: in fact, their size is limited only by a null terminator, meaning that the enumeration is complete. This example is fully available in the file [GettingInstanceSize] (./ samples / GettingInstanceSize.linq), and below I will only give you the most important part:

 public int SizeOf() { var total = 0; int elementsize; fixed (void* entity = &MethodTable) { var arr = Union.GetObj<Array>((IntPtr)entity); var elementType = arr.GetType().GetElementType(); if (elementType.IsValueType) { var typecode = Type.GetTypeCode(elementType); switch (typecode) { case TypeCode.Byte: case TypeCode.SByte: case TypeCode.Boolean: elementsize = 1; break; case TypeCode.Int16: case TypeCode.UInt16: case TypeCode.Char: elementsize = 2; break; case TypeCode.Int32: case TypeCode.UInt32: case TypeCode.Single: elementsize = 4; break; case TypeCode.Int64: case TypeCode.UInt64: case TypeCode.Double: elementsize = 8; break; case TypeCode.Decimal: elementsize = 12; break; default: var info = (MethodTable*)elementType.TypeHandle.Value; elementsize = info->Size - 2 * sizeof(IntPtr); // sync blk + vmt ptr break; } } else { elementsize = IntPtr.Size; } // Header total += 3 * sizeof(IntPtr); // sync blk + vmt ptr + total length total += elementType.IsValueType ? 0 : sizeof(IntPtr); // MethodsTable for refTypes total += IsMultidimentional ? Dimensions * sizeof(int) : 0; } // Contents total += (int)TotalLength * elementsize; // align size to IntPtr if ((total % sizeof(IntPtr)) != 0) { total += sizeof(IntPtr) - total % (sizeof(IntPtr)); } return total; }

This code takes into account all variations of array types and can be used to calculate its size:

 Console.WriteLine($"size of int[]{{1,2}}: {SizeOf(new int[2])}"); Console.WriteLine($"size of int[2,1]{{1,2}}: {SizeOf(new int[1,2])}"); Console.WriteLine($"size of int[2,3,4,5]{{...}}: {SizeOf(new int[2, 3, 4, 5])}"); --- size of int[]{1,2}: 20 size of int[2,1]{1,2}: 32 size of int[2,3,4,5]{...}: 512

Conclusions to the section

At this stage, we learned a few fairly important things. First, we have divided reference types into three groups: fixed-type reference types, generic types, and variable-size reference types. We also learned to understand the structure of the final instance of any type (for the time being, I am silent about the structure of the VMT. We understood the whole field so far only one field: this is also a great achievement). Whether it is a fixed-size reference type (everything is very simple there) or an indefinite-size reference type: an array or a string. Indefinite because its size will be determined upon creation. With generic types, in fact, everything is simple: for each specific generic type, its own VMT is created, which will contain a specific size.

Methods Table

VMT classes

The explanation of the work of the Methods Table is mostly academic: after all, to crawl into such jungle is like digging a grave for yourself. On the one hand, such bins conceal something exciting and interesting, they store certain data, which further reveal the understanding of what is happening. However, on the other hand, we all understand that Microsoft will not give us any guarantees that they will leave their runtime unchanged and, for example, they will not suddenly move the table of methods one field forward.

All right, warned. Now let's dive into the world as they say through the mirror. After all, until now, the whole looking glass was reduced to the knowledge of the structure of objects: and in theory we should already know about it. And in essence, this knowledge is not through the looking glass, but rather as an entrance to the looking glass. Returning to the `MethodTable` structure described in CoreCLR :

  // Low WORD is component size for array and string types (HasComponentSize() returns true). // Used for flags otherwise. DWORD m_dwFlags; // Base size of instance of this class when allocated on the heap DWORD m_BaseSize; WORD m_wFlags2; // Class token if it fits into 16-bits. If this is (WORD)-1, the class token is stored in the TokenOverflow optional member. WORD m_wToken; // <NICE> In the normal cases we shouldn't need a full word for each of these </NICE> WORD m_wNumVirtuals; WORD m_wNumInterfaces;

Namely, to the fields `m_wNumVirtuals` and` m_wNumInterfaces`. These two fields define the answer to the question “how many virtual methods and interfaces does the type have?”. In this structure, there is no information about the usual methods, fields, properties (which combine the methods) and ** in no way connected with reflection **. By its nature and purpose, this structure is created for the work of calling methods in the CLR (and in fact in any OOP: be it Java, C ++, Ruby, or something else. Just the location of the fields will be somewhat different). Let's look at the code:

  public class Sample { public int _x; public void ChangeTo(int newValue) { _x = newValue; } public virtual GetValue() { return _x; } } public class OverridedSample : Sample { public override GetValue() { return 666; } }

No matter how meaningless these classes may seem, they will fit us perfectly to describe their VMT. And for this we need to understand the difference between the base type and the `ChangeTo` and` GetValue` methods inherited from the question.

The `ChangeTo` method is present in both types: it cannot be overridden. This means that it can be rewritten as:

 public class Sample { public int _x; public static void ChangeTo(Sample self, int newValue) { self._x = newValue; } // ... } //        struct public struct Sample { public int _x; public static void ChangeTo(ref Sample self, int newValue) { self._x = newValue; } // ... }

And at the same time, apart from the architectural meaning, nothing will change: believe me, when compiling, both options will work the same way, since for instance methods, this is just the first parameter of the method, which is passed to us implicitly.

I will explain in advance why all explanations around inheritance are built around examples on static methods: in fact, all methods are static. And the copy and no. There is no instance of compiled methods for each class instance. It would take a huge amount of memory: it's easier for the same method to pass a reference every time to an instance of the structure or class with which it works.

For the `GetValue` method, everything is completely different. We cannot simply take and override the method by redefining the * static * `GetValue` in the inherited type: only those code sections that work with a variable as ʻOverridedSample` will get a new method, and if it work as a variable with a base type` Sample` you can only call the `GetValue` base type because you have no idea what type the object is. In order to understand what type a variable is and, as a result, which method is specifically called, we can proceed as follows:

 void Main() { var sample = new Sample(); var overrided = new OverridedSample(); Console.WriteLine(sample.Virtuals[Sample.GetValuePosition].DynamicInvoke(sample)); Console.WriteLine(overrided.Virtuals[Sample.GetValuePosition].DynamicInvoke(sample)); } public class Sample { public const int GetValuePosition = 0; public Delegate[] Virtuals; public int _x; public Sample() { Virtuals = new Delegate[1] { new Func<Sample, int>(GetValue) }; } public static void ChangeTo(Sample self, int newValue) { self._x = newValue; } public static int GetValue(Sample self) { return self._x; } } public class OverridedSample : Sample { public OverridedSample() : base() { Virtuals[0] = new Func<Sample, int>(GetValue); } public static new int GetValue(Sample self) { return 666; } }

In this example, we actually build a table of virtual methods manually, and make calls on the position of the method in this table. If you understand the essence of the example, then you actually understand how inheritance is built at the level of compiled code: methods are called by their index in the virtual method table. Just when you create an instance of a certain inherited type, then the places where the base type has virtual methods are located; the compiler will locate the pointers to the overridden methods.Thus, the difference between our example and the real VMT is only that when the compiler builds this table, it knows in advance what it has to do and creates a table of the correct size: in our example, to build a table for the types that will make the table larger due to adding new methods will have to sweat a lot. But our task is different, and therefore we will not engage in such perversions.

Link to the whole book

CLR Book: GitHub
Release 0.5.0 books, PDF: GitHub Release

Source: https://habr.com/ru/post/344556/

All Articles