ARM® Cortex™-A Series Programmer’s Guide - Bootcode

이재하·2023년 6월 15일

Cortex-A

목록 보기

4/4

Code to be run immediately after the core comes out of reset, on a so called bare metal system, that is, one in which code is run without the use of an operating system. This is a situation that is often encountered when first starting up a chip or system.

How a bootloader loads and runs the Linux Kernel.

13.1 Booting a bare-metal system

When the core has been reset, it will commence execution at the location of the reset vector in the exception vector table( at either address 0x00000000 of 0xFFFF0000.

The reset handler code must do some, or all of the following

In a multi core system, enable non-primary cores to sleep.
Initialize exception vectors.
Initialize the memory system, including the MMU.
Initialize core mode stacks and registers.
Initialize any critical I/O devices.
Perform any nessary initialization of NEON or VFP.
Enable interrupts
Change core mode or state.
Handle any set-up required for the Secure world.
Call the Main() application.

The first considertaion is placemnet of the exception vector table. You must make sure that it contains. You must make sure that it contains a valid set of instructions that branch to appropriate handlers.

The _start Directive in the GNU Assembler tells the linker to locate code at paricular address and can be used to place code in the vector table. The initial vector table will be in non-volatile memory and can contain branch to sel instructions(other than the reset vector) as no execeptions are expected at this point. Typically, the reset vector contains a branch to the boot code in ROM. The ROM can be aliased to the address of the exception vector. The ROM then writes to some memory remap peripheral that maps RAM into address 0 and the real exception vector table is copied into RAM. This means the part of the boot code that handles remapping must be position-independent, as only PC-relative addressing can be used.

Typical exception vector table code

start
	B Reset_Handler
    B Undifiend_Handler
    B SWI_Handler
    B Prefetch_Handler
    B Data_Handler
    NOP @ Reserve vector
    B IRQ_Handler
    
@ FIQ_Handler will follow directly after this table

You might then have to initialize stack pointers for the various modes that your application can make use of.

LDR R0, stack_base
@ Enter each mode in turn and set up the stack pointer
MSR CPSR_c, #Mode_FIQ:OR:I_bit:OR:F_bit;
MOV SP, R0
SUB R0, R0, #FIQ_Stack_Size
MSR CPSR_c, #Mode_IRQ:OR:I_Bit:OR:F_Bit;
MOV SP,R0

응용 프로그램이 사용할 수 있는 다양한 모드의 스택 포인터를 초기화해야 할 수도 있습니다. Example 13-2는 FIQ 및 IRQ 모드의 스택 포인터를 초기화하는 코드를 보여주는 간단한 예제입니다.

The next step is to set up the caches, MMU and branch predictors. We begin by disabling the MMU and caches and invalidating the caches and TLB. This example code is for the Cortex-A9 processor. Some of the Cortex-A processors automatically invalidate the L1 and/or L2 caches at reset, other require manual invalidation. You must check the RTM for a particular core to determine which options have been implemented.

The MMU TLBs must be invalidated. The branch target predictor hardware might not have to be explicitly invlidated, but it must be enabled by boot code. Branch predoctopm cam safely be enable by bootcode. Branch prediction can safely be enable at this point; this will improve performance.

@ Disable MMU
MRC p15, 0, r1, c1, c0, 0	@ Read Control Register Configraution data
BIC r1, r1, #0x1
MCR p15, 0, r1, c1, c0, 0	@ Write Control Register configuration data

@ Disable L1 Caches
MRC p15, 0, r1, c1, c0, 0	@ Read Control Register configuration data
BIC r1, r1, #(0x1 << 12)	@ Disable I Cache
BIC r1, r1, #(0x1 << 2)		@ Disable D Cache
MCR p15, 0, r1, c1, c0, 0	@ Write Control Register configuration data

@ Invalidate Data cache
@ to make the code general purpose, we calculate the
@ cache size first and loop through each set + way

MRC p15, 1, r0, c0, c0, 0	@ Read Cache Size ID
LDR r3, #0x1ff
AND	r0,	r3,	r0,	LSR	#13		@ r0 = n0.	of sets -1

MOV		r1, #0				@ r1 = way counter wap_loop

way_loop:

MOV		r3, #0				@ r3 = set counter set_loop

set_loop:

MOV		r2, r1, LSL #30
ORR		r2,	r3,	LSL #5		@ r2 = set/way cache operation format
MCR		p15, 0,	r2, c7, c6, 2	@ Invalidate line described by r2
ADD		r3, r3, #1			@ Increment set counter
CMP		r0, r3				@ Last set reached yet?
BGT		set_loop			@if not, iterate set_loop
ADD		r1, r1, #1			@ else, next
CMP		r1, #4				@ Last way reached yet?
BNE		way_loop			@ if not, iterate way loop

@ Invalidate TLB
MCR		p15, 0, r1, c8, c7, 0

@ Branch Prediction Enable
MOV		r1, #0
MRC		p15, 0, r1, c1, c0, 0	@ Read Control Register configuration data
ORR		r1, r1, #(0x1 << 11)	@ Global BP Enable bit
MCR		p15,  0,  r1,  c1,  c0,  0	@ Write Control Register configuration data
---
After this, you can create some translation tables, as shown in the example code of Example 13-4. The variable ttb_address is used to denote the address to be used for the initial translation table. This must be a 16KB area of memory(whose start address is aligned to a 16KB boundary), to which an L1 translation table can be written by this code.

@ Enable D-side Prefetch
MRC		p15, 0, r1, c1, c0, 1		@ Read Auxiliary Control Register
ORR		r1, r1, #(0x1 <<2_
MCR 	p15, 0, r1, c1, c0, 1 ;		@ Write Auxiliary Control Register
DSB
ISB
@ DSB causes completion of all cache maintenance operations apprearing in program
@ order before the DSB instruction
@ An ISB instruction causes the effect of all branch predictor maintenance
@ operations before the ISB instruction to be visible to all instructions
@ after the ISB instruction
@ Initialize PageTable

@ We will create a basic L1 page table in RAM, with 1MB sections containing a flat (CA=PA_ mapping, all pages Full Access, Stronly Ordered

@ It would be fater to create this in a read-only section in an assembly file

LDR		r0, =2_00000000000000000000110111100010 @ r0 is the non-address part of descriptor
LDR		r1, ttb_address
LDR		43, = 4095		@loop counter
write_pte
ORR		r2, r0, r3, LSL #20		@ OR together address & default PTE bits
STR		r2, [r1, r3, LSL #2]		@ write PTE to TTB
SUBS r3, r3, #1		@ decrement loop counter
BNE write_pte

@ for the very first entry in the table, we will make it cacheable, normal, write-back, write allocate

BIC		r0, r0, #2_1100		@ clear CB bits
ORR		r0,	r0,	#2_0100		@ inner write-back, write allocate
BIC		r0,	r0,	#2_111000000000 @clear TEX bits
ORR		r0, r0, #2_101000000000 @ set TEX as write-back, write allocate
ORR		r0, r0, #2_100000000000 @shareable
STR		r0, [r1]

@ Initialize MMU
MOV		r1,#0x0
MCR		p15, 0, r1, c2, c0, 2		@ Write Translation Table Base Control Register
LDR		r1, ttb_address
MCR		p15, 0, r1, c2, c0, 0		@ Write Translation Table Base Register 0

@ In this simple example, we don't use TRE or Norma Memory Remap Register.
@ Set all Domains to Clinet
LDR r1, =0x55555555
MCR p15, 0, r1, c3, c0, 0		@ Write Domain Access Control Register

@ Enable MMU
MRC p15, 0, r1, c1, c0, 0		@ Read Control Register configuration data
ORR		r1, r1, #0x1		@ Bit 0 is the MMU enable
MCR		p15, 0, r1, c1, c0, 0		@ Write Control Register copnfiguration data

The L2 cache, if present, and if running without an operating system, might also require invalidating and enabling at this point. NEON or VFP access mlust also be enabled. If the system makes use of the TrustZone Security Extensions, it maight have to switch to the Normal world when the Secure world is initialized.

The next steps will depend on the exact nature of the system. It might be necessary, for example, to zero initialize memory that will hold uninitialized C variables, copy the initial values of other variables from a ROM image toRAM, and set up applicationstack and geap spaces. It might also be necessarty to initialize C library functions, call top level contructors and other standard embedded C initialization.

A common approach is to permit a single core within the cluster to perform system initialization, while the same code, if un on a differnt core, will cause it to sleep, that is, enter WFI state, as described in Chapter 2-. The other cores might be woken after core 0 has created a simple set of L1 translation table entries, as these could be used by all cores in the system. Example 13-5 shows example code that determines which core it is running on and either branches to initialization code(if running on core 0), or goes to sleep otherwise. The secondary cores are typically woken up later by an SMP OS.

우선 MMU와 캐시를 비활성화하고, 캐시와 TLB를 무효화합니다. 이 예제 코드는 Cortex-A9 프로세서를 대상으로 합니다. 일부 Cortex-A 프로세서는 리셋 시 자동으로 L1 및/또는 L2 캐시를 무효화하지만, 일부는 수동으로 무효화해야 합니다.

MMU TLB는 무효화되어야 합니다. 분기 대상 예측기 하드웨어는 명시적으로 무효화할 필요가 없을 수도 있지만, 부트 코드에서 활성화해야 합니다. 이 시점에서 분기 예측을 활성화해도 괜찮습니다. 이렇게 하면 성능이 향상됩니다.

@ Only CPU 0 performs initialization. Other CPUs go into WFI
@ to do this, first work out which CPU this is
@ this code typically is run before any other initialization step

MRC p15, 0, r1, c0, c0, 5 @ Read Multiprocessor Affinity Register

캐시는 데이터를 저장하는 데 사용되는 고속 메모리입니다. 캐시는 여러 개의 세트(set)와 각 세트에 여러 개의 웨이(way)를 가지고 있습니다.

세트(set)은 캐시에서 데이터가 저장되는 단위입니다. 각 세트에는 여러 개의 라인(line)이 존재하며, 각 라인은 데이터와 해당 데이터의 메모리 주소를 가지고 있습니다.

웨이(way)는 세트 내에서의 라인의 인덱스입니다. 캐시는 다중 웨이 구조를 가지고 있으며, 각 웨이에는 여러 개의 라인이 저장될 수 있습니다. 각 웨이는 독립적인 캐시 뱅크로 생각할 수 있습니다.

예를 들어, 4-way set-associative 캐시는 4개의 웨이를 가진 세트 구조를 갖습니다. 각 세트는 4개의 라인을 저장할 수 있으며, 각 웨이는 해당 세트의 1개의 라인을 가집니다. 이렇게 함으로써 캐시는 더 많은 데이터를 저장할 수 있고, 데이터의 적중률을 향상시킬 수 있습니다.

반복문을 통해 각 세트와 웨이를 순회하며 캐시의 모든 라인을 무효화하는 예제 코드에서 way는 웨이 인덱스를 나타내고, set은 세트 인덱스를 나타냅니다. 따라서 코드에서 way_loop는 웨이를 순회하는 반복문을 나타내고, set_loop는 세트를 순회하는 반복문을 나타냅니다.

13.2 Configuration

There are a number of control register bits with in the core that will typically be set by boot code.
In all cases, for best performance, code must run with the MMU, instruction and dat caches and branch prediction enabled. Translation table entries for all regions of memory that are not peripheral I/O devices must be marked as L1 Cacheable and (by defulat) set to read-allocate, write-back cache policy. For multi core systems pages must be mared as Sharable and the bradcating feature for CP15 maintanance operations must be enablerd.

In addition to the CP15 registers required by the ARM architecture, cores typically have registers that control implementation specific features. Programmers of boot code should refer to the relevant technical reference manual for the correct usage of these.

13.3 Booting Linux

It is useful to understand what happens from the fore coming out of rest and executing its first instruction at the exception base address 0x00000000 or 0xFFFF0000 if HIVECS(known as high vectors) is selected, until the Linux command prompt appreas.

When the kernel is present in memory, the sequence on an ARM processor based system is similar to what might happen on a desktop computer. However, the bootlading process can be very differentm as ARM processor based phones or more deeply embedded devices can lack a hard drive or PC-like BIOS.

Typically, what happens when you power the system on is that hardware specific boot code runs from flash or ROM. This code initializes the system, including any necessary hardwware peripheral code and then launches the bootloader(for example U-boot). This initializes main memory and copies the compressed linux jkernel image into main memory(from a flash device, memory on a board, MMC, host PC or elsewhere). The bootloader passes certin initialization parameters to the kernel. The Linux kernel then decompresses it self and initializes its data structures and running user processes , before starting the commmand shell environment. Let's take a more detailed look at each of those processes.

13.3.1 Reset handler

There is typically a small amount of system-specific boot monitor code that configures memory controllers and performs other system peripheral initialization. It sets up stacks in memory and typically copies itself from ROM into RAM, before changing the hardware memory mapping so that RAM is mapped to the xecpetion vector address, rather than ROM. In enssence this code is independent of that operating system is to be run ion the board and performs a function similar to a PC BIOS. When it has completed execution, it will call a Linux bootloader, such as U boot

13.3.2 Bootloader

Linux requires a certain amount of code to be run out of reset, to initialize the system. This performs the basic tasks required for the kernel to boot:

Initializing the memory system and peripherals.
Loading the kernel image to an appropriate location in memory ( and possibly also an initial RAM disk )
Generate the boot parameters to be passed to the kernel (including machine type).
Set up a console (video or serial) for the kernel.
Enter the kernel.

The exact steps taken vary between different bootloaders, so for detailed information, refer to documentation for the one that you wnat to use. U-boot is a widly used example, but other bootloader possibilities include Aoex, Blob, Bootldr and Redboot.

When the bootloader starts, it is typically not present in main memory. It must start by allocating a stack and initializing the core ( for example invalidating its caches) and installing itself to main memory. It must also allocate space for global data and for use by malloc() and copy exception vector entries into the appropriate location.

13.3.3 Initialize memory system

This is very much a board or system specific piece of code. The Linux kernel has no responsibility for the configuration of the RAM in the system. It is presented with the physical memory layout, but no other knowledge of the memory system. In many systems, the available RAM and its location are fixed and the bootloader task is straightforware. In other systems, code must be written that discovers the amount of RAM available in the system.

13.3.4 kernel images

The kernel image from the build process is typically compressed in zImage format(the conventional name given to the bootable kernel image). Its head code contains a magic number, used to verify the integrity of the decompression, plus start and end address. The kernel code is position independent and can be located anywhere in memory. Conventionallym it is placed at a 0x8000 offset from the base of physical RAM. This gives soace for the parameter block placed at a 0x100 offset(used for translation table etc).

Many systems require an initial RAM disk(initrd), as this lets you have a root filesystem available without other drivers being setup. The bootloader can place an initial ramdisk image into memory and pass the location of this to the kernel using ATAG_INITRD2(a tag that describes the physical location of the compressed RAM disk image) and ATAG_RAMDISK.

The bootloader will typically setup a serial port in the target, enabling the kernel serial driver to detect the port and use it for a console. In some systems, another output device such as a video driver can be used as a console. The kernel command line parameter console = can be used to pass the information.

13.3.5 Kernel parameters using ATAGs

Historycally, the parameters passed to the kernel are in the form of a tagged list, placein in physical RAM with register R2 holding the address of the list. Tag headers hold two 32-bit unsigned ints, with the first giving the size of the tag in words and the second providing the tag value(indicating the type of tag). For a full list of parameters that can be passed, consult the appropriate documentation. Examples include ATAG_MEM to describe the physical memory map and ATAG_INITRD2 to describe where the compressed ramdisk image is located. The bootloader must also provide an ARM linux mahcine type number(MACH_TYPE). This can be a hard-coded value, or the boot code can inspect thae available hardware and assign a value accordingly.

There is a more flexible, or generic method for passing this information using Flattened Device Trees(FDTs).

13.3.6 Kernel parameters using Flattened Device Trees

The Linux device tree or FDT support was introduced for the PowerPC kernel as a part of the merger of a 32-bit and 64-bit kernel to standardize the firmware interface by using an Open Firmware interface for all PowerPC platforms, servers, desktop and embedded. It has become the configuration methodology used in the Linux kernel for PowerPC, Micro Blaze and SPARC architectures.

A device tree is a data structure that describes the hardware configuration. It includes information about processors, memory sizes and banks, interrupt configuration, and peripherals. The data structure is organized as a tree with a single root node named/. With the exception of the root node, each node has a single parent. Each node has a name and can have any number of child nodes. Nodes can also contain named properties values with arbitary data, and they are expressed in key-value pairs.

The device tree data follows the conventions defined in IEEE standars 1275. To simplify system description, a device tree source format(.dts) is used to express device tree data.

A device tree node must comply with the following syntax :

[lable:] node-name[@unit-address]{
[properties definition][child nodes]
}

Nodes are defined with a name and a unit-address. Brace mark the beginning and end of the node definition.

You can use a Device Tree Compiler (DTC) tool to convert the device tree source file(.dts) to the device tree blob(dtb) format. The dtb, or blob, is known as the Flattened Device Tree and is a firmware independent desciption of the system, in a compressed format that requries no firmware calls to retrieve its properties. The Linux kernel loads the dtb before it loads the operating system.

The chosen node is a placeholder for any environment information that does not belong anywhere else, such as boot argumnets for the kernel and default console. Chosen node properties are usually defined by the boot loader, but the dts file can specify a default value.

The following code fragment shows a root node description for an ARM Versatile Platform Board. The model and compatible properties are assigned the name of the platform in the form , . This string concatenation is the universal identifier for the machine and must be defined at the top node.

/ {
model = "arm, versatilepb";
compatible = "arm, versatilepb";
#address-cells = <1>;
#size-cells = <1>;

memory {
name = "memory";
device_type = "memory";
reg = <0x0 0x080000000>;
};

chosen {
       bootargs = "console = ttyAMA0 debug";
       }

};

13.3.7 kernel entry

kernel execution must commence with the core in a fixed state. The bootloader calls the kernel image by branching directly to its first instruction, the start label in arch/arm/boot/compressed/head.S The MMU and data cache must be disabled. The core must be in Supervisor mode, with CPSR I and F Bits set(IRQ and FIQ disable). R0 must contain 0, R1 the MACH_TYPE value and R2 the address of the tagged list of parameters.

The first step in getting the kernel working is to decompress it. This is mostly architecture independent. The parameters passed from the bootloader are saved and the caches and MMU are enabled. Checks are made to see if the decompressed image will overwrite the coompressed image, befor calling decompress_kernel() in arch/arm/boot/compressed/misc.c, the cache is then cleand and ivalidataed before being disabled again. We then branch to the kernel startup entry point in arch/arm/kernel/head.S

13.3.8 Platform-specific actions

A number of architecture specific tasks are now undertaken. The first checks core type using lookup_processor_type() that returns a code specifying which core it is running on. The function lookup_machinge_type() is then used(unsurprisingly) to look up machine type. A basic set of translation tables is then defined which map the kernel code. The cache and MMU are initialized and other control registers set. The data segment is copied to RAM and start_kernel() is called.

13.3.9 kernel start-up code

In principle, the rest of the startup sequence is the same on any architecture, but in fact some functions are still hardware dependent.

IRQ interrupts are disabled with local_irq_disable(), while lock_kernel() is used to stop FIQ interrupts from interrupting the kernel. It initialize the tick control, memory system and architecture-specific subsystems and deals with the commnad line options passed by the bootloader.
Stacks are set up and the Linux scheduler is initialized.
The various memory areas are set-up and pages allocated.
The interrupt and exception table and handlers are setup, along with the GIC.
The system timer is setup and at this point IRQs are enabled. Addtitional memory system initialization occurs and then a value called BogoMips is used to calibrate the core clock speed.
Internal components of the kernel are set up, including the filesystem and the initialization process, followed by the thread daemon that creates kernel threads.
The kernel is unlocked(FIQ enabled) and the scheduler started.
The function do_basic_setup() is called to initialize drivers, sysctl, work queues and network sockets. At this point, the switch to User mode is performed.

The memory map used by Linux shown in Figure 13-1. ZI refers to zero initialized data. There is a broad split between kernel memory, above address 0xBF000000 and user memory, below that address. Kernel memory uses global mappings, while user memory uses non-global mappings, although both code and data can be shared between processes. As already mentioned, application code starts at 0x1000, leaving the first 4KB page unused, to enable trapping of NULL pointer references.

이재하

이전 포스트

ARM® Cortex™-A Series Programmer’s Guide - Bootcode

Cortex-A

Programmer’s Guide for ARMv8-A - Chapter 1 Introduction

0개의 댓글

관련 채용 정보