This is not a tutorial, just a collection of my thoughts as I was learning in no particular order.
To preface, I'm on a Mac and I'll be using the STM32 Nucleo-64 development board with STM32F446RE MCU.
Homebrew:
gcc-arm-embedded(Cask)- GNU bare-metal toolchain for 32-bit Arm processors (
gcc,ld,objcopy)
- GNU bare-metal toolchain for 32-bit Arm processors (
stlink- STLink Toolset (
st-flash,st-info,st-util)
- STLink Toolset (
make
The MCU is using an ARM processor, running no operating system, and using EABI (Embedded Application Binary Interface) hence we need to install arm-none-eabi-gcc via the gcc-arm-embedded cask to use the GCC Toolchain.
If you're a VSCode user, here's what I've been using to make the Microsoft C/C++ Extension work as expected:
bear -- make -jThen in .vscode/settings.json:
{
"C_Cpp.intelliSenseEngine": "default",
"C_Cpp.errorSquiggles": "enabled",
"C_Cpp.default.compileCommands": "${workspaceFolder}/compile_commands.json",
"C_Cpp.default.intelliSenseMode": "gcc-arm"
}We need the Cortex M4's CMSIS core headers and we need the STM32 device headers. From ARM's github repo we can grab the core headers we need.
m-profile/
armv7m_mpu.h
cmsis_gcc_m.h
core_cm4.h
cmsis_compiler.h
cmsis_gcc.h
cmsis_version.hWe can get the STM32 CMSIS device headers from STM's repo at /Include.
stm32f44.h
stm32f446xx.h
system_stm32f4xx.hWe also need the system_stm32f4xx.c file from STM's repo at /Source/Template
CMSIS is not a full Hardware Abstraction Layer, it provides a CPU/peripheral API. Its comprised of two parts: the cpu level core headers (from ARM) and chip level device headers (from STM). STM also provides startup and systems conventions (SystemInit(), SystemCoreClock) via system_stm32f4xx.c
/* FPU settings ------------------------------------------------------------*/
#if (__FPU_PRESENT == 1) && (__FPU_USED == 1)
SCB->CPACR |=
((3UL << 10 * 2) | (3UL << 11 * 2)); /* set CP10 and CP11 Full Access */
#endifThese lines in system_stm32f4xx.c set the two floating point CPACR (Coprocessor Access Control) registers to full access, allowing us to using floating point operations. As far as I know, only the Cortex M4F has a FPU, the M4 does not. My STM32F446RE has a M4F. Also note that we don't nessacarily use floats for this project explcity but I think this is good practice because floating point operations may take place behind the scenes.
#define HSI_VALUE \
((uint32_t)16000000) /*!< Value of the Internal oscillator in Hz*/
...
uint32_t SystemCoreClock = 16000000;system_stm32f4xx.c also initializes (and handles upates for) the SystemCoreClock variable which tracks the speed of the clock in Hz. This means that there are 16 million clock edges by default, using the HSI (High Speed Internal) oscilator.
We can look at the delay function in main.c to provide inside on the system clock.
static void delay(volatile uint32_t count)
{
while (--count)
{
__asm__("nop");
}
}
...
delay(1000000);Let's look at the disassembly as well using objdump:
> objdump -d build/stm32f446-blink.elf
build/stm32f446-blink.elf: file format elf32-littlearm
Disassembly of section .text:
080001c4 <delay>:
80001c4: b480 push {r7}
80001c6: b083 sub sp, #0xc
80001c8: af00 add r7, sp, #0x0
80001ca: 6078 str r0, [r7, #0x4]
80001cc: e000 b 0x80001d0 <delay+0xc> @ imm = #0x0
80001ce: bf00 nop
80001d0: 687b ldr r3, [r7, #0x4]
80001d2: 3b01 subs r3, #0x1
80001d4: 607b str r3, [r7, #0x4]
80001d6: 2b00 cmp r3, #0x0
80001d8: d1f9 bne 0x80001ce <delay+0xa> @ imm = #-0xe
80001da: bf00 nop
80001dc: bf00 nop
80001de: 370c adds r7, #0xc
80001e0: 46bd mov sp, r7
80001e2: f85d 7b04 ldr r7, [sp], #4
80001e6: 4770 bx lr
We know by default there are 16 million cycles per second and we know that total delay time = (cycles per iteration) * (iterations) / cycles/sec. We can calculate the number of cycles per iteration by looking at the loop:
80001ce: bf00 nop
80001d0: 687b ldr r3, [r7, #0x4]
80001d2: 3b01 subs r3, #0x1
80001d4: 607b str r3, [r7, #0x4]
80001d6: 2b00 cmp r3, #0x0
80001d8: d1f9 bne 0x80001ce <delay+0xa> @ imm = #-0xe
The nop (no operation) instruction uses 1 cycle. ldr (load register) instruction uses 2 cycles, the subs (subtract set) instruction is 1 cycle, the str (store register) instruction uses 2 cycles, the cmp (compare) instruction takes 1 cycle, and bne (branch not equal) usually uses 3 cycles. So thats ~10 cycles. So the equation becomes total delay time = 10 * 1,000,000 / 16,000,000 = 0.625 so if we call delay(1000000) the cpu should loop for about 0.625 seconds.
We explicity select the cpu, the Thumb-2 instruction set, set the hardware floating point unit (fpv4 -> ARM's floating point version 4; sp -> single precision/32 bit floats; d16-> use 16 floating point registers), and set the calling convention for floating values to have the FPU used internally and in function calls.
MCU_FLAGS := -mcpu=cortex-m4 -mthumb -mfpu=fpv4-sp-d16 -mfloat-abi=hardThese two compiler flags are enabled to split the .text section (.text.foo, .text.bar) which allows the linker to garbage collect unused functions (since the linker can only garbage collect at the section level). The same goes for the .bss functions. I don't believe these options will be used unless we set -gc-sections (garbage collection sections) for the linker.
CFLAGS := ... -ffunction-sections -fdata-sectionsAs for the linker we can use -Wl,-gc-sections with arm-none-eabi-gcc to pass the garbage collection option to the linker. We also specify -TSTM32F446RE_FLASH.ld to replace the linker's default script with our own. -nostartfiles will make it so we don't use any CRT (C Runtime) startup/teardown files, again since we will be making our own. Note we are still including the standard library.
LDFLAGS := ... -Wl,-gc-sections -TSTM32F446RE_FLASH.ld -nostartfiles// Enable clock for GPIOA
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;GPIOA (General Pupose Input Output) consists of 16 physical pins (PA0-PA15). We need to allow the hardware block to be alive and reactive to the system clock (attached in the clock tree) by using the RCC (Reset Clock Control) block which is responsible for enabling/disabling clock signals to peripherals. We use the AHB1ENR (I think this is Advanced High-performance Bus 1 Enable Register) from the RCC and enable the GPIOA register.
// Set PA5 as output
GPIOA->MODER &= ~GPIO_MODER_MODER5_Msk; // -> MODER & ~(3 << 10) -> MODER & ~(11 00 00 00 00 00)
GPIOA->MODER |= GPIO_MODER_MODER5_0; // -> MODER | (1 << 10) -> MODER | (01 00 00 00 00 00)GPIO ports have a mode register to configure I/O modes per pin. MODER is 32 bits, 2 bits per pin. There are four modes: 00 input (the default mode), 01 general purpose output, 10 alternate function mode, and 11 analog mode. In the above snippet we set the two MODER bits corresponding to pin 5 (MODER5) to 01 aka output mode.
while (1)
{
GPIOA->ODR ^= GPIO_ODR_OD5; // Toggle PA5
delay(1000000);
}Now we can toggle the state of pin 5 in the GPIOA port (PA5) via the GPIO port's ODR (Output Data Register). The ODR is not mapped to RAM, its mapped to a physical address on the AHB, its mapped to 32 bits but only the lower 16 are used representing the 16 pins 1 bit per pin. At this point we're getting above my pay grade, I'm not really sure how the bus transaction works but I do know the mode register on the port (MODER) controls an output latch, so you can not use PA5 as an output pin unless you've set the MODER5 bits.
I talk more about the delay function in the SystemCoreClock sub section.