Demystifying Arm Cortex-M33 Bare Metal: Compile, Assembly and Link

November 01, 2023(updated on November 03, 2023)

Introduction

This is the second post of this series. The first post was Demystifying Arm Cortex-M33 Bare Metal: Startup. If you have not already read it, I recommend to check at least the Introduction and Questions sections, that gives an introduction to this series which I am not repeating in subsequent posts.

I will not explain the specs, nano and nosys in this post because I have written about them before in Demystifying Arm GNU Toolchain Specs: nano and nosys. If you are not familiar with this topic, I highly recommend you check that post before or after reading this post.

As in the previous post, I have created an STM32 project with project type empty.

The build settings (Project Properties > C/C++ Build > Settings) show all the options passed to the GNU Compiler Collection gcc for compile, assembly and link. STM32CubeIDE is not using GNU assembler as and GNU linker ld separately but passes assembler and linker options through gcc. This also makes it possibly to use specs.

I am showing here the Release configuration to skip any debug related options. I also selected Floating-point unit as None and Floating-point ABI as Software implementation in the Settings.

Compiler options are -mcpu=cortex-m33 -std=gnu11 -DSTM32H563ZITx -DSTM32 -DSTM32H5 -DNUCLEO_H563ZI -c -I../Inc -Os -ffunction-sections -fdata-sections -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfloat-abi=soft -mthumb

Assembler options are -mcpu=cortex-m33 -c -x assembler-with-cpp --specs=nano.specs -mfloat-abi=soft -mthumb.

Linker options are -mcpu=cortex-m33 -T"STM32H563ZITX_FLASH.ld" --specs=nosys.specs -Wl,-Map="${BuildArtifactFileBaseName}.map" -Wl,--gc-sections -static --specs=nano.specs -mfloat-abi=soft -mthumb -Wl,--start-group -lc -lm -Wl,--end-group.

The common options in all three are:

  • -mcpu=cortex-m33: sets the target processor
  • -mfloat-abi=soft: using software floating-point
  • -mthumb: targets thumb instruction set, actually it means Thumb-2 because the processor supports Thumb-2
  • --specs=nano.specs: builds against newlib-nano which is the stripped down version of newlib, this is the standard C library

Omitting the debug like options such as -fstack-usage and -fcyclomatic-complexity, warnings like -Wall and device specific definitions like -DSTM32, the following options are left:

Compiler options:

  • -std=gnu11: selects C11 standard with GNU extensions
  • -ffunction-sections: places each function into its own section
  • -fdata-sections: places each data into its own section

Assembler options:

  • -x assembler-with-cpp: assembly files may contain C processor directives, so a preprocessor runs first

Linker options:

  • -T"...": use the specified linker script rather than using the default
  • -Wl,--gc-sections: unused code is eliminated (garbage collect sections), this requires objects to be compiled with -ffunction-sections and -fdata-sections
  • -static: does not link against shared libraries
  • --specs=nosys.specs: generates stubs for system calls that returns error by default

The options most different than using C on desktop are using an explicit linker script and using newlib-nano and nosys specs. The standard C library is available in many platforms (such as glibc or newlib). However, in an embedded platform, the resources and capabilities are limited, so it makes sense to use a minimal library like newlib-nano. Moreover, the standard C library depends on the system calls particularly for I/O. These calls are normally implemented by the operating system. When making a bare-metal application, there is no operating system, so some/most of the system calls are not available. nosys provides such a library where the system calls are just stubs and returning errors.

Basic Concepts about Linker and Object Files

When a source code (e.g. C or assembly) is compiled or assembled, an object is created. One or many (input) objects are combined by the linker to produce a single (output) object. All objects, both input and output, are stored as files in Executable and Linkable Format (ELF) (for the context of this post), and typically have extension .o. The final output object, the executable, typically has no extension.

The output object can be called an executable when there is an operating system to execute this. For an embedded system, the output object file can be programmed onto the device either directly or after converting it to another format (e.g. bin, hex), and it can be called a firmware. The difference is, the operating system knows what to do with the sections in an executable (object) file; whereas an embedded system with no OS is basically empty, so the sections has to be programmed (written) to the correct (non-volatile/Flash) memory regions.

An object contains sections and symbols (symbol table). A section contains code (instructions) or data. Data can be initialized (contains initial values, zeroes or others) or uninitialized. A section can also contain information for other purposes. Depending on what the section contains its type and attributes are set accordingly.

For an embedded system with a microcontroller, each of code, initialized data and uninitialized data have different load and allocate semantics. Initially, the firmware is stored in flash memory, thus all of these has to be in the flash. However, when the firmware is run, data has to be in the RAM. Furthermore, the initial values of initialized data (or just initialized data) has to be copied from flash memory (from the firmware) to the RAM. There is no such (initialization) data for uninitialized data section, so only a region in RAM has to be allocated (thus nothing is copied). This semantic difference is controlled with virtual memory address (VMA) and load memory address (LMA). The addresses into where the firmware is programmed to, thus flash memory, is LMA. Where the addresses that are used during runtime, thus the flash memory for code and RAM for data, is VMA. Thus, code has same LMA and VMA which is in flash memory. Initialized data has LMA in flash memory and VMA in RAM. Uninitialized data actually has nothing as LMA because there is no initial values, but its VMA is in RAM. The LMA and VMA of an output section is set by the linker.

An input object can either define and use a symbol or refers and uses an external symbol (e.g. with extern in C). When an external symbol is used in an object, and if the symbol is not weak, and after linking, no such symbol is found in any objects, this is an undefined symbol (error).

Everything in ELF object files can be seen using objdump or readelf.

STM32CubeH5 Linker Script

The linker script is important because:

  • each microcontroller has a particular memory layout, so the final output file should be created accordingly and this information is used in the linker script.

  • there are symbols used by the startup code for certain addresses, and these are created by the linker based on the linker script.

Now, I will go line by line (without the comments) of the file STM32H563ZITX_FLASH.ld:

ENTRY(Reset_Handler)

the first code in the output will be the code of Reset_Handler. I do not think this matters, and it defaults to the start of .text section. There is no operating system to use the entry point and no tool mentioned here requires it. The actual entry point of the system is the second entry in the vector table as explained in my previous post explaining Startup.

_estack = ORIGIN(RAM) + LENGTH(RAM);

creates a symbol _estack and sets its value to the end of specified RAM (origin+length). RAM will be defined soon. _estack is the initial stack pointer value, since the stack will grow down, it points to the highest address. As you expect, this is going to be the first entry of the vector table which is defined in the startup code.

_Min_Heap_Size = 0x200;
_Min_Stack_Size = 0x400;

creates two symbols containing an estimation of the heap and the stack size of the application. These should be modified according to the needs of the actual application.

MEMORY
{
  RAM   (xrw) : ORIGIN = 0x20000000, LENGTH = 640K
  FLASH (rx)  : ORIGIN = 0x8000000,  LENGTH = 2048K
}

defines the memory layout used in this file. RAM is eXecutable, Readable and Writable, starts at 0x20000000 and has a length of 640K. FLASH is Readable and eXecutable, starts at 0x8000000 and has a length of 2048K. The origin addresses have to be accurate for the particular microcontroller used. The length does not have to be the exact but it should be enough for the code and the data and it should naturally be at most the actual size of the memory of the particular microcontroller used.

SECTIONS
{

starts defining the output sections, the mapping from the input sections to the output sections.

  .isr_vector :
  {

defines the first output section called .isr_vector. The name of the section is .isr_vector but it does not need to be, for example the template in CMSIS uses the name .vectors. Also, the section name here is the same as the section name in the STM32CubeH5 assembly startup file, but it does not need to be. What comes to this section is specified in the following lines, the names play no role here.

    . = ALIGN(4);

In linker scripts, . is a special variable meaning the current output location counter. The location counter means it is an address in relevant memory, since MEMORY section defines RAM and FLASH, it is one of these and this is selected at the end of this definition and for .isr_vector it is FLASH. ALIGN is a builtin function, and . = ALIGN(4) means whatever the current location at FLASH is, make it aligned to the 4 bytes boundary. This means, if it is not divisible by 4, increment it so it becomes divisible by 4.

Since this is the first section, naturally the location counter is at the moment the origin of FLASH which is obviously aligned, so in this situation, this statment is unnecessary. This is not a big problem but I think there is also a mistake here. Because, as far as I understand, VTOR[6:0] is reserved and probably returns 0, that means the vector table has to be 128 bytes aligned not 4 bytes. Furthermore, the option bytes that specify the inital value of VTOR in STM32 H5 are 24-bits and they are used as VTOR[31:8]. Practically, this means the vector table address has to be 256 bytes aligned. I am not sure if it is a good idea to use ALIGN here, because there are various factors that cannot be controlled only by the linker. It is probably better to not use ALIGN here, but pay attention to the start address (ORIGIN) of flash memory (FLASH).

    KEEP(*(.isr_vector))

This is an input section description and it means whatever is in .isr_vector section (in the object files) store it here (in the .isr_vector output section). KEEP indicates to keep this even if no symbols here are used (or referenced), because when --gc-sections is used with ld, the sections with no used symbols are removed.

The input sections are described by file_name(section_name) syntax, because different (object) files might have the same sections. A wildcard pattern (such as *) can be used both for file_name and section_name and actually multiple sections can be specified like (section1_name section2_name). Then, *(.isr_vector) above means .isr_vector sections in all files. Because the input file names are usually not known beforehand, it is very common to use * for the file name.

    . = ALIGN(4);
  } >FLASH

similar to the beginning of this section, align the current location to 4 bytes. Again, I do not think this matters, and actually, for the vector table, I think it is better to make it correct size by allocating extra space if necessary in the startup file.

Then, this section definition ends with }. What follows is important and here it says this section goes to FLASH memory with >FLASH.

> specifies VMA. If LMA is not specified, LMA is the same as VMA. Thus, for .isr_vector, both VMA and LMA is in FLASH memory.

  .text :
  {
    . = ALIGN(4);
    *(.text)
    *(.text*)
    *(.glue_7)
    *(.glue_7t)
    *(.eh_frame)

starts defining the .text output section. .text historically means the executable code (or instructions). As before, first an alignment is done. Then, everything in .text and any section starting with .text (with .text*), glue_7, glue_7t and eh_frame is stored in this output section.

The strange thing here is .glue_7 and glue_7t. These are for the glue codes switching from arm to thumb and thumb to arm code. However, Cortex-M only runs thumb code, so there cannot be any glue code. These looks unnecessary to me.

What is in the .eh_frame section ? The primary purpose of .eh_frame is to support C++ exceptions. I think it is also used for some C functionality, not sure if it is unnecessary or not. I tend to remove it in Assembly/C applications unless I see there is a problem (with debugger etc.).

    KEEP (*(.init))
    KEEP (*(.fini))

.init and .fini input sections will be stored in this output section as well. The .init and .fini are for initialization and finalization functions, such as used by processes or shared objects. However, there is neither a concept of a process nor shared object here and after the build, I do not see these sections. These might be unnecessary as well.

    . = ALIGN(4);
    _etext = .;
  } >FLASH

at the end of the output section definition, again there is an alignment. Then, the current location is saved into symbol _etext. As before, this section is also in FLASH.

  .rodata :
  {
    . = ALIGN(4);
    *(.rodata)
    *(.rodata*)
    . = ALIGN(4);
  } >FLASH

the output section called .rodata keeps the data of read-only (static const) variables.

  .ARM.extab   : {
    . = ALIGN(4);
    *(.ARM.extab* .gnu.linkonce.armextab.*)
    . = ALIGN(4);
  } >FLASH

according to Arm ABI, .ARM.extab section contains exception-handling table for stack unwinding.

  .ARM : {
    . = ALIGN(4);
    __exidx_start = .;
    *(.ARM.exidx*)
    __exidx_end = .;
    . = ALIGN(4);
  } >FLASH

according to Arm ABI, .ARM.exidx section contains the index table for exception handling/stack unwinding. It is strange that this section is named just .ARM, it looks to me like it has to be .ARM.exidx.

Arm ABI says: “Tables are not required for ABI compliance at the C/Assembler level but are required for C++.”. If you are not using C++, it seems both .ARM.extab and .ARM.exidx (.ARM above) are unnecessary.

  .preinit_array :
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array*))
    PROVIDE_HIDDEN (__preinit_array_end = .);
    . = ALIGN(4);
  } >FLASH

  .init_array :
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT(.init_array.*)))
    KEEP (*(.init_array*))
    PROVIDE_HIDDEN (__init_array_end = .);
    . = ALIGN(4);
  } >FLASH

  .fini_array :
  {
    . = ALIGN(4);
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP (*(SORT(.fini_array.*)))
    KEEP (*(.fini_array*))
    PROVIDE_HIDDEN (__fini_array_end = .);
    . = ALIGN(4);
  } >FLASH

these sections are very similar, they keep the corresponding section (e.g. __init_array) and mark the start (e.g. __init_array_start) and the end (e.g. __init_array_end).

PROVIDE_HIDDEN makes the symbol hidden (not exported) and only defines the symbol if it is not defined elsewhere.

Each of these sections can hold an array of pointers to functions. Each function is for a pre-initialization, an initialization or a finalization (termination). So, when it is the right time, a run-time function calls these functions (think about C++ constructors and destructors).

  _sidata = LOADADDR(.data);

saves the load address (LMA) of data section below to the symbol _sidata. This is used in the startup file.

  .data :
  {
    . = ALIGN(4);
    _sdata = .;        /* create a global symbol at data start */
    *(.data)           /* .data sections */
    *(.data*)          /* .data* sections */
    *(.RamFunc)        /* .RamFunc sections */
    *(.RamFunc*)       /* .RamFunc* sections */

    . = ALIGN(4);
    _edata = .;        /* define a global symbol at data end */

  } >RAM AT> FLASH

defines the .data (static initialized variables) output section. As usual, the beginning and the end are aligned. The start of data is saved to _sdata and the end to _edata symbols (pay attention, it saves the aligned addresses). These are used in the startup code to copy the data to RAM.

The last line is different than the ones we saw until now. It is >RAM AT> FLASH. > defines VMA and AT> defines LMA. So the section will be loaded to FLASH (LMA) but used from RAM (VMA). The symbols defined here, _sdata and _edata has VMA addresses in RAM.

That is why _sidata above is needed. Because _sdata and _edata are addresses in RAM, we cannot find the location of data in FLASH at first. _sidata is the start address of data in FLASH. This is used in startup code to copy the data from FLASH (starting from _sidata) to RAM[_sdata:_edata].

  . = ALIGN(4);
  .bss :
  {
    _sbss = .;
    __bss_start__ = _sbss;
    *(.bss)
    *(.bss*)
    *(COMMON)

    . = ALIGN(4);
    _ebss = .;
    __bss_end__ = _ebss;
  } >RAM

creates the .bss output section which contains uninitialized static variables. As usual, the beginning and the end are aligned. The start is saved both to _sbss and __bss_start__, and the end is saved to both _ebss and __bss_end__. This section specifies only >RAM because these are uninitialized so there is nothing to keep in the final output and program into the Flash. The memory area between _sbss and _ebss is simply zeroed in the startup code.

  ._user_heap_stack :
  {
    . = ALIGN(8);
    PROVIDE ( end = . );
    PROVIDE ( _end = . );
    . = . + _Min_Heap_Size;
    . = . + _Min_Stack_Size;
    . = ALIGN(8);
  } >RAM

it allocates an output section ._user_heap_stack as big as the sum of _Min_Heap_Size and _Min_Stack_Size. This is not used anywhere but I think it is to be sure if there is enough _Min_Heap_Size and _Min_Stack_Size left in the memory, because if location counter passes the final address of the RAM, linker will give an error. However, the symbol end is important, because it is used by libnosys _sbrk system call implementation.

  /DISCARD/ :
  {
    libc.a ( * )
    libm.a ( * )
    libgcc.a ( * )
  }

removes all information from these libraries.

  .ARM.attributes 0 : { *(.ARM.attributes) }

according to Arm ABI, .ARM.attributes section contains build attributes. This is a debug information and can be removed in the final output if needed. If the file is going to be re-linked, than this section might be needed.

}

the output section definitions are finished, it is the end of the file.

CMSIS GCC Linker Script

CMSIS linker script for GCC is gcc_arm.ld and it also has some interesting differences. I will also go over but quickly not line by line, since it should be clear now what the linker script does. I will also not repeat the points I have made previously.

I have removed the comments and the TrustZone related a few parts from the CMSIS linker script below and changed the indentation for display purposes when needed.

__ROM_BASE = 0x00000000;
__ROM_SIZE = 0x00040000;

__RAM_BASE = 0x20000000;
__RAM_SIZE = 0x00020000;

__STACK_SIZE = 0x00000400;
__HEAP_SIZE  = 0x00000C00;

it starts with defining some constants, which is a good idea.

MEMORY
{
  FLASH (rx)  : ORIGIN = __ROM_BASE, LENGTH = __ROM_SIZE
  RAM   (rwx) : ORIGIN = __RAM_BASE, LENGTH = __RAM_SIZE
}

ENTRY(Reset_Handler)

it defines the memory as in STM32CubeH5 startup code and uses the same entry definition.

SECTIONS
{
  .text :
  {
    KEEP(*(.vectors))
    *(.text*)

    KEEP(*(.init))
    KEEP(*(.fini))

    /* .ctors */
    *crtbegin.o(.ctors)
    *crtbegin?.o(.ctors)
    *(EXCLUDE_FILE(*crtend?.o *crtend.o) .ctors)
    *(SORT(.ctors.*))
    *(.ctors)

    /* .dtors */
    *crtbegin.o(.dtors)
    *crtbegin?.o(.dtors)
    *(EXCLUDE_FILE(*crtend?.o *crtend.o) .dtors)
    *(SORT(.dtors.*))
    *(.dtors)

    *(.rodata*)

    KEEP(*(.eh_frame*))
  } > FLASH

differently than STM32CubeH5, this script lists everything related to code and also the vector table in the text section. I cannot think of a disadvantage of this but also no advantage.

In addition to .init and .fini I mentioned before, it also keeps .ctors and .dtors sections. These are similar to .init_array and .fini_array and also contain pointers to constructor and destructor functions. I believe this is the old way to do this and Arm EABI uses .init_array and .fini_array (same as System V ABI). So I do not think .ctors and .dtors are needed.

It also puts .rodata to this section, the read-only static data.

Finally, it also keeps .eh_frame in this section, which is for stack unwinding.

It first sounded strange to me to keep .rodata and .eh_frame here but both of these are read-only and section type is SHT_PROGBITS, probably that is why they are kept in the same output section. Same is true for .ctors and .dtors.

  .ARM.extab :
  {
    *(.ARM.extab* .gnu.linkonce.armextab.*)
  } > FLASH

  __exidx_start = .;
  .ARM.exidx :
  {
    *(.ARM.exidx* .gnu.linkonce.armexidx.*)
  } > FLASH
  __exidx_end = .;

as mentioned before, these are to support exception handling and stack unwinding and not required for C applications.

  .copy.table :
  {
    . = ALIGN(4);
    __copy_table_start__ = .;

    LONG (__etext)
    LONG (__data_start__)
    LONG ((__data_end__ - __data_start__) / 4)

     __copy_table_end__ = .;
  } > FLASH

the copying and zeroing code of CMSIS startup code is a little different. Instead of using separate symbols, it stores the start of data in FLASH (_etext), start of data in RAM (__data_start) and the number of items (number of 4-bytes) at __copy_table_start__. These are used in the data copy section of the startup code.

  .zero.table :
  {
    . = ALIGN(4);
    __zero_table_start__ = .;
    __zero_table_end__ = .;
  } > FLASH

the section for bss called .zero.table is also similar to .copy.table. However, you realize nothing is written here, and there is a question at CMSIS repository for this: GCC linker file .zero.table entry missing. To make it similar to STM32CubeH5, there should be these lines after __zero_table_start__:

    LONG (__bss_start__)
    LONG ((__bss_end__ - __bss_start__) / 4)

it just stores the start address and the number of 4-bytes items, as nothing is copied but just zeroed. As mentioned in the question in CMSIS_5 repo, newlib automatically zero-initializes the bss section.

  __etext = ALIGN (4);

  .data : AT (__etext)
  {
    __data_start__ = .;
    *(vtable)
    *(.data)
    *(.data.*)

    . = ALIGN(4);
    /* preinit data */
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP(*(.preinit_array))
    PROVIDE_HIDDEN (__preinit_array_end = .);

    . = ALIGN(4);
    /* init data */
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP(*(SORT(.init_array.*)))
    KEEP(*(.init_array))
    PROVIDE_HIDDEN (__init_array_end = .);

    . = ALIGN(4);
    /* finit data */
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP(*(SORT(.fini_array.*)))
    KEEP(*(.fini_array))
    PROVIDE_HIDDEN (__fini_array_end = .);

    KEEP(*(.jcr*))
    . = ALIGN(4);
    /* All data end */
    __data_end__ = .;

  } > RAM

This is the initialized data section, as marked with symbols __data_start and __data_end. The CMSIS script also includes vtable section, which is probably needed for C++ (I did not check). Also, it includes a section called .jcr which seems to be a shortcut for Java Class Registration and naturally only used by Java. Both can be omitted.

  .bss :
  {
    . = ALIGN(4);
    __bss_start__ = .;
    *(.bss)
    *(.bss.*)
    *(COMMON)
    . = ALIGN(4);
    __bss_end__ = .;
  } > RAM AT > RAM

similarly, a space for uninitialized data is allocated and the start and the end is kept in the symbols __bss_start__ and __bss_end__.

  .heap (COPY) :
  {
    . = ALIGN(8);
    __end__ = .;
    PROVIDE(end = .);
    . = . + __HEAP_SIZE;
    . = ALIGN(8);
    __HeapLimit = .;
  } > RAM

  .stack (ORIGIN(RAM) + LENGTH(RAM) - __STACK_SIZE) (COPY) :
  {
    . = ALIGN(8);
    __StackLimit = .;
    . = . + __STACK_SIZE;
    . = ALIGN(8);
    __StackTop = .;
  } > RAM
  PROVIDE(__stack = __StackTop);

similar to STM32CubeH5 but CMSIS creates two separate sections for the heap and the stack and uses __HEAP_SIZE and __STACK_SIZE to size them. These are marked as COPY to indicate no memory will be allocated for them (but they are in RAM anyway, it does not matter for a bare-metal project). Keep in mind the stack is growing from top to bottom downwards, so __StackTop is where the stack starts, and the start address of .stack is fixed by setting it to the end of RAM minus __STACK_SIZE. So it is not like first .heap is placed and then .stack, but first .heap is placed, then the start of .stack is set, so __StackLimit can be smaller than __HeapLimit.

I do not see anywhere __end__, end or __stack is used, so these can be omitted.

  ASSERT(__StackLimit >= __HeapLimit, "region RAM overflowed with stack")
}

at the end, it checks if the stack (growing from __StackTop downwards to __StackLimit) overflowed into the heap (growing from where the data and bss finished upwards till __HeapLimit).

Summary

Comparing to STM32CubeH5, I find these differences meaningful (correct, more correct or looks better):

  • using constants for memory locations
  • using separate heap and stack sections and separate limits

and these are not necessary, not necessarily correct or not a good idea for me:

  • .ctors and .dtors are unnecessary, EABI requires .init_array and .fini_array instead
  • .jcr is unnecessary because it is for Java

I find these unneccessary in both linker scripts:

  • .ARM.extab and .ARM.exidx are unnecessary, they are for C++.

Example: Makefile based Template Project

As an example, I created a Makefile based template project for a bare-metal Assembly/C application for STM32 H5 (Cortex-M33) MCU (STM32H563) that can be built on Linux command line (tested with Ubuntu 22.04) without using STM32CubeIDE. This project contains a minimal linker script and a minimal startup code that I wrote based on relevant STM32CubeH5 and CMSIS files and my comments in this post.

References