Demystifying Arm Cortex-M33 Bare Metal: Compile, Assembly and Link
Introduction
This is the second post of this series. The first post was Demystifying Arm Cortex-M33 Bare Metal: Startup. If you have not already read it, I recommend to check at least the Introduction and Questions sections, that gives an introduction to this series which I am not repeating in subsequent posts.
I will not explain the specs, nano and nosys in this post because I have written about them before in Demystifying Arm GNU Toolchain Specs: nano and nosys. If you are not familiar with this topic, I highly recommend you check that post before or after reading this post.
STM32CubeIDE Compile, Assembly and Link Settings
As in the previous post, I have created an STM32 project with project type empty.
The build settings (Project Properties > C/C++ Build > Settings) show all the options passed to the GNU Compiler Collection gcc
for compile, assembly and link. STM32CubeIDE is not using GNU assembler as
and GNU linker ld
separately but passes assembler and linker options through gcc
. This also makes it possibly to use specs
.
I am showing here the Release configuration to skip any debug related options. I also selected Floating-point unit as None and Floating-point ABI as Software implementation in the Settings.
Compiler options are -mcpu=cortex-m33 -std=gnu11 -DSTM32H563ZITx -DSTM32 -DSTM32H5 -DNUCLEO_H563ZI -c -I../Inc -Os -ffunction-sections -fdata-sections -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfloat-abi=soft -mthumb
Assembler options are -mcpu=cortex-m33 -c -x assembler-with-cpp --specs=nano.specs -mfloat-abi=soft -mthumb
.
Linker options are -mcpu=cortex-m33 -T"STM32H563ZITX_FLASH.ld" --specs=nosys.specs -Wl,-Map="${BuildArtifactFileBaseName}.map" -Wl,--gc-sections -static --specs=nano.specs -mfloat-abi=soft -mthumb -Wl,--start-group -lc -lm -Wl,--end-group
.
The common options in all three are:
-mcpu=cortex-m33
: sets the target processor-mfloat-abi=soft
: using software floating-point-mthumb
: targets thumb instruction set, actually it means Thumb-2 because the processor supports Thumb-2--specs=nano.specs
: builds against newlib-nano which is the stripped down version of newlib, this is the standard C library
Omitting the debug like options such as -fstack-usage
and -fcyclomatic-complexity
, warnings like -Wall
and device specific definitions like -DSTM32
, the following options are left:
Compiler options:
-std=gnu11
: selects C11 standard with GNU extensions-ffunction-sections
: places each function into its own section-fdata-sections
: places each data into its own section
Assembler options:
-x assembler-with-cpp
: assembly files may contain C processor directives, so a preprocessor runs first
Linker options:
-T"..."
: use the specified linker script rather than using the default-Wl,--gc-sections
: unused code is eliminated (garbage collect sections), this requires objects to be compiled with-ffunction-sections
and-fdata-sections
-static
: does not link against shared libraries--specs=nosys.specs
: generates stubs for system calls that returns error by default
The options most different than using C on desktop are using an explicit linker script and using newlib-nano
and nosys
specs. The standard C library is available in many platforms (such as glibc
or newlib
). However, in an embedded platform, the resources and capabilities are limited, so it makes sense to use a minimal library like newlib-nano
. Moreover, the standard C library depends on the system calls particularly for I/O. These calls are normally implemented by the operating system. When making a bare-metal application, there is no operating system, so some/most of the system calls are not available. nosys
provides such a library where the system calls are just stubs and returning errors.
Basic Concepts about Linker and Object Files
When a source code (e.g. C or assembly) is compiled or assembled, an object is created. One or many (input) objects are combined by the linker to produce a single (output) object. All objects, both input and output, are stored as files in Executable and Linkable Format (ELF) (for the context of this post), and typically have extension .o
. The final output object, the executable, typically has no extension.
The output object can be called an executable when there is an operating system to execute this. For an embedded system, the output object file can be programmed onto the device either directly or after converting it to another format (e.g. bin, hex), and it can be called a firmware. The difference is, the operating system knows what to do with the sections in an executable (object) file; whereas an embedded system with no OS is basically empty, so the sections has to be programmed (written) to the correct (non-volatile/Flash) memory regions.
An object contains sections and symbols (symbol table). A section contains code (instructions) or data. Data can be initialized (contains initial values, zeroes or others) or uninitialized. A section can also contain information for other purposes. Depending on what the section contains its type and attributes are set accordingly.
For an embedded system with a microcontroller, each of code, initialized data and uninitialized data have different load and allocate semantics. Initially, the firmware is stored in flash memory, thus all of these has to be in the flash. However, when the firmware is run, data has to be in the RAM. Furthermore, the initial values of initialized data (or just initialized data) has to be copied from flash memory (from the firmware) to the RAM. There is no such (initialization) data for uninitialized data section, so only a region in RAM has to be allocated (thus nothing is copied). This semantic difference is controlled with virtual memory address (VMA) and load memory address (LMA). The addresses into where the firmware is programmed to, thus flash memory, is LMA. Where the addresses that are used during runtime, thus the flash memory for code and RAM for data, is VMA. Thus, code has same LMA and VMA which is in flash memory. Initialized data has LMA in flash memory and VMA in RAM. Uninitialized data actually has nothing as LMA because there is no initial values, but its VMA is in RAM. The LMA and VMA of an output section is set by the linker.
An input object can either define and use a symbol or refers and uses an external symbol (e.g. with extern
in C). When an external symbol is used in an object, and if the symbol is not weak, and after linking, no such symbol is found in any objects, this is an undefined symbol (error).
Everything in ELF object files can be seen using objdump
or readelf
.
STM32CubeH5 Linker Script
The linker script is important because:
each microcontroller has a particular memory layout, so the final output file should be created accordingly and this information is used in the linker script.
there are symbols used by the startup code for certain addresses, and these are created by the linker based on the linker script.
Now, I will go line by line (without the comments) of the file STM32H563ZITX_FLASH.ld
:
ENTRY(Reset_Handler)
the first code in the output will be the code of Reset_Handler
. I do not think this matters, and it defaults to the start of .text
section. There is no operating system to use the entry point and no tool mentioned here requires it. The actual entry point of the system is the second entry in the vector table as explained in my previous post explaining Startup.
_estack = ORIGIN(RAM) + LENGTH(RAM);
creates a symbol _estack
and sets its value to the end of specified RAM (origin+length). RAM will be defined soon. _estack
is the initial stack pointer value, since the stack will grow down, it points to the highest address. As you expect, this is going to be the first entry of the vector table which is defined in the startup code.
_Min_Heap_Size = 0x200;
_Min_Stack_Size = 0x400;
creates two symbols containing an estimation of the heap and the stack size of the application. These should be modified according to the needs of the actual application.
MEMORY
{
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 640K
FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 2048K
}
defines the memory layout used in this file. RAM is eXecutable, Readable and Writable, starts at 0x20000000 and has a length of 640K. FLASH is Readable and eXecutable, starts at 0x8000000 and has a length of 2048K. The origin addresses have to be accurate for the particular microcontroller used. The length does not have to be the exact but it should be enough for the code and the data and it should naturally be at most the actual size of the memory of the particular microcontroller used.
SECTIONS
{
starts defining the output sections, the mapping from the input sections to the output sections.
.isr_vector :
{
defines the first output section called .isr_vector
. The name of the section is .isr_vector
but it does not need to be, for example the template in CMSIS uses the name .vectors
. Also, the section name here is the same as the section name in the STM32CubeH5 assembly startup file, but it does not need to be. What comes to this section is specified in the following lines, the names play no role here.
. = ALIGN(4);
In linker scripts, .
is a special variable meaning the current output location counter. The location counter means it is an address in relevant memory, since MEMORY section defines RAM and FLASH, it is one of these and this is selected at the end of this definition and for .isr_vector
it is FLASH. ALIGN
is a builtin function, and . = ALIGN(4)
means whatever the current location at FLASH is, make it aligned to the 4 bytes boundary. This means, if it is not divisible by 4, increment it so it becomes divisible by 4.
Since this is the first section, naturally the location counter is at the moment the origin of FLASH which is obviously aligned, so in this situation, this statment is unnecessary. This is not a big problem but I think there is also a mistake here. Because, as far as I understand, VTOR[6:0] is reserved and probably returns 0, that means the vector table has to be 128 bytes aligned not 4 bytes. Furthermore, the option bytes that specify the inital value of VTOR in STM32 H5 are 24-bits and they are used as VTOR[31:8]. Practically, this means the vector table address has to be 256 bytes aligned. I am not sure if it is a good idea to use ALIGN here, because there are various factors that cannot be controlled only by the linker. It is probably better to not use ALIGN here, but pay attention to the start address (ORIGIN) of flash memory (FLASH).
KEEP(*(.isr_vector))
This is an input section description and it means whatever is in .isr_vector
section (in the object files) store it here (in the .isr_vector
output section). KEEP
indicates to keep this even if no symbols here are used (or referenced), because when --gc-sections
is used with ld
, the sections with no used symbols are removed.
The input sections are described by file_name(section_name)
syntax, because different (object) files might have the same sections. A wildcard pattern (such as *
) can be used both for file_name
and section_name
and actually multiple sections can be specified like (section1_name section2_name)
. Then, *(.isr_vector)
above means .isr_vector
sections in all files. Because the input file names are usually not known beforehand, it is very common to use *
for the file name.
. = ALIGN(4);
} >FLASH
similar to the beginning of this section, align the current location to 4 bytes. Again, I do not think this matters, and actually, for the vector table, I think it is better to make it correct size by allocating extra space if necessary in the startup file.
Then, this section definition ends with }
. What follows is important and here it says this section goes to FLASH memory with >FLASH
.
>
specifies VMA. If LMA is not specified, LMA is the same as VMA. Thus, for .isr_vector
, both VMA and LMA is in FLASH memory.
.text :
{
. = ALIGN(4);
*(.text)
*(.text*)
*(.glue_7)
*(.glue_7t)
*(.eh_frame)
starts defining the .text
output section. .text
historically means the executable code (or instructions). As before, first an alignment is done. Then, everything in .text
and any section starting with .text
(with .text*
), glue_7
, glue_7t
and eh_frame
is stored in this output section.
The strange thing here is .glue_7
and glue_7t
. These are for the glue codes switching from arm to thumb and thumb to arm code. However, Cortex-M only runs thumb code, so there cannot be any glue code. These looks unnecessary to me.
What is in the .eh_frame
section ? The primary purpose of .eh_frame
is to support C++ exceptions. I think it is also used for some C functionality, not sure if it is unnecessary or not. I tend to remove it in Assembly/C applications unless I see there is a problem (with debugger etc.).
KEEP (*(.init))
KEEP (*(.fini))
.init
and .fini
input sections will be stored in this output section as well. The .init
and .fini
are for initialization and finalization functions, such as used by processes or shared objects. However, there is neither a concept of a process nor shared object here and after the build, I do not see these sections. These might be unnecessary as well.
. = ALIGN(4);
_etext = .;
} >FLASH
at the end of the output section definition, again there is an alignment. Then, the current location is saved into symbol _etext
. As before, this section is also in FLASH.
.rodata :
{
. = ALIGN(4);
*(.rodata)
*(.rodata*)
. = ALIGN(4);
} >FLASH
the output section called .rodata
keeps the data of read-only (static const) variables.
.ARM.extab : {
. = ALIGN(4);
*(.ARM.extab* .gnu.linkonce.armextab.*)
. = ALIGN(4);
} >FLASH
according to Arm ABI, .ARM.extab
section contains exception-handling table for stack unwinding.
.ARM : {
. = ALIGN(4);
__exidx_start = .;
*(.ARM.exidx*)
__exidx_end = .;
. = ALIGN(4);
} >FLASH
according to Arm ABI, .ARM.exidx
section contains the index table for exception handling/stack unwinding. It is strange that this section is named just .ARM
, it looks to me like it has to be .ARM.exidx
.
Arm ABI says: “Tables are not required for ABI compliance at the C/Assembler level but are required for C++.”. If you are not using C++, it seems both .ARM.extab
and .ARM.exidx
(.ARM
above) are unnecessary.
.preinit_array :
{
. = ALIGN(4);
PROVIDE_HIDDEN (__preinit_array_start = .);
KEEP (*(.preinit_array*))
PROVIDE_HIDDEN (__preinit_array_end = .);
. = ALIGN(4);
} >FLASH
.init_array :
{
. = ALIGN(4);
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT(.init_array.*)))
KEEP (*(.init_array*))
PROVIDE_HIDDEN (__init_array_end = .);
. = ALIGN(4);
} >FLASH
.fini_array :
{
. = ALIGN(4);
PROVIDE_HIDDEN (__fini_array_start = .);
KEEP (*(SORT(.fini_array.*)))
KEEP (*(.fini_array*))
PROVIDE_HIDDEN (__fini_array_end = .);
. = ALIGN(4);
} >FLASH
these sections are very similar, they keep the corresponding section (e.g. __init_array
) and mark the start (e.g. __init_array_start
) and the end (e.g. __init_array_end
).
PROVIDE_HIDDEN
makes the symbol hidden (not exported) and only defines the symbol if it is not defined elsewhere.
Each of these sections can hold an array of pointers to functions. Each function is for a pre-initialization, an initialization or a finalization (termination). So, when it is the right time, a run-time function calls these functions (think about C++ constructors and destructors).
_sidata = LOADADDR(.data);
saves the load address (LMA) of data section below to the symbol _sidata
. This is used in the startup file.
.data :
{
. = ALIGN(4);
_sdata = .; /* create a global symbol at data start */
*(.data) /* .data sections */
*(.data*) /* .data* sections */
*(.RamFunc) /* .RamFunc sections */
*(.RamFunc*) /* .RamFunc* sections */
. = ALIGN(4);
_edata = .; /* define a global symbol at data end */
} >RAM AT> FLASH
defines the .data
(static initialized variables) output section. As usual, the beginning and the end are aligned. The start of data is saved to _sdata
and the end to _edata
symbols (pay attention, it saves the aligned addresses). These are used in the startup code to copy the data to RAM.
The last line is different than the ones we saw until now. It is >RAM AT> FLASH
. >
defines VMA and AT>
defines LMA. So the section will be loaded to FLASH (LMA) but used from RAM (VMA). The symbols defined here, _sdata
and _edata
has VMA addresses in RAM.
That is why _sidata
above is needed. Because _sdata
and _edata
are addresses in RAM, we cannot find the location of data in FLASH at first. _sidata
is the start address of data in FLASH. This is used in startup code to copy the data from FLASH (starting from _sidata
) to RAM[_sdata
:_edata
].
. = ALIGN(4);
.bss :
{
_sbss = .;
__bss_start__ = _sbss;
*(.bss)
*(.bss*)
*(COMMON)
. = ALIGN(4);
_ebss = .;
__bss_end__ = _ebss;
} >RAM
creates the .bss
output section which contains uninitialized static variables. As usual, the beginning and the end are aligned. The start is saved both to _sbss
and __bss_start__
, and the end is saved to both _ebss
and __bss_end__
. This section specifies only >RAM
because these are uninitialized so there is nothing to keep in the final output and program into the Flash. The memory area between _sbss
and _ebss
is simply zeroed in the startup code.
._user_heap_stack :
{
. = ALIGN(8);
PROVIDE ( end = . );
PROVIDE ( _end = . );
. = . + _Min_Heap_Size;
. = . + _Min_Stack_Size;
. = ALIGN(8);
} >RAM
it allocates an output section ._user_heap_stack
as big as the sum of _Min_Heap_Size
and _Min_Stack_Size
. This is not used anywhere but I think it is to be sure if there is enough _Min_Heap_Size
and _Min_Stack_Size
left in the memory, because if location counter passes the final address of the RAM, linker will give an error. However, the symbol end
is important, because it is used by libnosys _sbrk
system call implementation.
/DISCARD/ :
{
libc.a ( * )
libm.a ( * )
libgcc.a ( * )
}
removes all information from these libraries.
.ARM.attributes 0 : { *(.ARM.attributes) }
according to Arm ABI, .ARM.attributes
section contains build attributes. This is a debug information and can be removed in the final output if needed. If the file is going to be re-linked, than this section might be needed.
}
the output section definitions are finished, it is the end of the file.
CMSIS GCC Linker Script
CMSIS linker script for GCC is gcc_arm.ld and it also has some interesting differences. I will also go over but quickly not line by line, since it should be clear now what the linker script does. I will also not repeat the points I have made previously.
I have removed the comments and the TrustZone related a few parts from the CMSIS linker script below and changed the indentation for display purposes when needed.
__ROM_BASE = 0x00000000;
__ROM_SIZE = 0x00040000;
__RAM_BASE = 0x20000000;
__RAM_SIZE = 0x00020000;
__STACK_SIZE = 0x00000400;
__HEAP_SIZE = 0x00000C00;
it starts with defining some constants, which is a good idea.
MEMORY
{
FLASH (rx) : ORIGIN = __ROM_BASE, LENGTH = __ROM_SIZE
RAM (rwx) : ORIGIN = __RAM_BASE, LENGTH = __RAM_SIZE
}
ENTRY(Reset_Handler)
it defines the memory as in STM32CubeH5 startup code and uses the same entry definition.
SECTIONS
{
.text :
{
KEEP(*(.vectors))
*(.text*)
KEEP(*(.init))
KEEP(*(.fini))
/* .ctors */
*crtbegin.o(.ctors)
*crtbegin?.o(.ctors)
*(EXCLUDE_FILE(*crtend?.o *crtend.o) .ctors)
*(SORT(.ctors.*))
*(.ctors)
/* .dtors */
*crtbegin.o(.dtors)
*crtbegin?.o(.dtors)
*(EXCLUDE_FILE(*crtend?.o *crtend.o) .dtors)
*(SORT(.dtors.*))
*(.dtors)
*(.rodata*)
KEEP(*(.eh_frame*))
} > FLASH
differently than STM32CubeH5, this script lists everything related to code and also the vector table in the text section. I cannot think of a disadvantage of this but also no advantage.
In addition to .init
and .fini
I mentioned before, it also keeps .ctors
and .dtors
sections. These are similar to .init_array
and .fini_array
and also contain pointers to constructor and destructor functions. I believe this is the old way to do this and Arm EABI uses .init_array
and .fini_array
(same as System V ABI). So I do not think .ctors
and .dtors
are needed.
It also puts .rodata
to this section, the read-only static data.
Finally, it also keeps .eh_frame
in this section, which is for stack unwinding.
It first sounded strange to me to keep .rodata
and .eh_frame
here but both of these are read-only and section type is SHT_PROGBITS
, probably that is why they are kept in the same output section. Same is true for .ctors
and .dtors
.
.ARM.extab :
{
*(.ARM.extab* .gnu.linkonce.armextab.*)
} > FLASH
__exidx_start = .;
.ARM.exidx :
{
*(.ARM.exidx* .gnu.linkonce.armexidx.*)
} > FLASH
__exidx_end = .;
as mentioned before, these are to support exception handling and stack unwinding and not required for C applications.
.copy.table :
{
. = ALIGN(4);
__copy_table_start__ = .;
LONG (__etext)
LONG (__data_start__)
LONG ((__data_end__ - __data_start__) / 4)
__copy_table_end__ = .;
} > FLASH
the copying and zeroing code of CMSIS startup code is a little different. Instead of using separate symbols, it stores the start of data in FLASH (_etext
), start of data in RAM (__data_start
) and the number of items (number of 4-bytes) at __copy_table_start__
. These are used in the data copy section of the startup code.
.zero.table :
{
. = ALIGN(4);
__zero_table_start__ = .;
__zero_table_end__ = .;
} > FLASH
the section for bss called .zero.table
is also similar to .copy.table
. However, you realize nothing is written here, and there is a question at CMSIS repository for this: GCC linker file .zero.table entry missing. To make it similar to STM32CubeH5, there should be these lines after __zero_table_start__
:
LONG (__bss_start__)
LONG ((__bss_end__ - __bss_start__) / 4)
it just stores the start address and the number of 4-bytes items, as nothing is copied but just zeroed. As mentioned in the question in CMSIS_5 repo, newlib
automatically zero-initializes the bss section.
__etext = ALIGN (4);
.data : AT (__etext)
{
__data_start__ = .;
*(vtable)
*(.data)
*(.data.*)
. = ALIGN(4);
/* preinit data */
PROVIDE_HIDDEN (__preinit_array_start = .);
KEEP(*(.preinit_array))
PROVIDE_HIDDEN (__preinit_array_end = .);
. = ALIGN(4);
/* init data */
PROVIDE_HIDDEN (__init_array_start = .);
KEEP(*(SORT(.init_array.*)))
KEEP(*(.init_array))
PROVIDE_HIDDEN (__init_array_end = .);
. = ALIGN(4);
/* finit data */
PROVIDE_HIDDEN (__fini_array_start = .);
KEEP(*(SORT(.fini_array.*)))
KEEP(*(.fini_array))
PROVIDE_HIDDEN (__fini_array_end = .);
KEEP(*(.jcr*))
. = ALIGN(4);
/* All data end */
__data_end__ = .;
} > RAM
This is the initialized data section, as marked with symbols __data_start
and __data_end
. The CMSIS script also includes vtable
section, which is probably needed for C++ (I did not check). Also, it includes a section called .jcr
which seems to be a shortcut for Java Class Registration and naturally only used by Java. Both can be omitted.
.bss :
{
. = ALIGN(4);
__bss_start__ = .;
*(.bss)
*(.bss.*)
*(COMMON)
. = ALIGN(4);
__bss_end__ = .;
} > RAM AT > RAM
similarly, a space for uninitialized data is allocated and the start and the end is kept in the symbols __bss_start__
and __bss_end__
.
.heap (COPY) :
{
. = ALIGN(8);
__end__ = .;
PROVIDE(end = .);
. = . + __HEAP_SIZE;
. = ALIGN(8);
__HeapLimit = .;
} > RAM
.stack (ORIGIN(RAM) + LENGTH(RAM) - __STACK_SIZE) (COPY) :
{
. = ALIGN(8);
__StackLimit = .;
. = . + __STACK_SIZE;
. = ALIGN(8);
__StackTop = .;
} > RAM
PROVIDE(__stack = __StackTop);
similar to STM32CubeH5 but CMSIS creates two separate sections for the heap and the stack and uses __HEAP_SIZE and __STACK_SIZE to size them. These are marked as COPY to indicate no memory will be allocated for them (but they are in RAM anyway, it does not matter for a bare-metal project). Keep in mind the stack is growing from top to bottom downwards, so __StackTop is where the stack starts, and the start address of .stack
is fixed by setting it to the end of RAM minus __STACK_SIZE. So it is not like first .heap is placed and then .stack, but first .heap is placed, then the start of .stack is set, so __StackLimit can be smaller than __HeapLimit.
I do not see anywhere __end__
, end
or __stack
is used, so these can be omitted.
ASSERT(__StackLimit >= __HeapLimit, "region RAM overflowed with stack")
}
at the end, it checks if the stack (growing from __StackTop downwards to __StackLimit) overflowed into the heap (growing from where the data and bss finished upwards till __HeapLimit).
Summary
Comparing to STM32CubeH5, I find these differences meaningful (correct, more correct or looks better):
- using constants for memory locations
- using separate heap and stack sections and separate limits
and these are not necessary, not necessarily correct or not a good idea for me:
.ctors
and.dtors
are unnecessary, EABI requires.init_array
and.fini_array
instead.jcr
is unnecessary because it is for Java
I find these unneccessary in both linker scripts:
.ARM.extab
and.ARM.exidx
are unnecessary, they are for C++.
Example: Makefile based Template Project
As an example, I created a Makefile based template project for a bare-metal Assembly/C application for STM32 H5 (Cortex-M33) MCU (STM32H563) that can be built on Linux command line (tested with Ubuntu 22.04) without using STM32CubeIDE. This project contains a minimal linker script and a minimal startup code that I wrote based on relevant STM32CubeH5 and CMSIS files and my comments in this post.
References
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.