Paging Initialization

When setup_arch() setup the system memory,
paging_init() is called to set Page Directory Table and Page Table Entry.

This function is defind in ${linux src}/arch/i386/mm/init.c.

Intel processor internally two level page translation from virtual to physical.
If conpile time configuration defines CONFIG_X86_PAE,
three level page translation is adopted.
Three level is basic memory page translation system on linux.

Here are macros and definitions for paging_init() function.

There are typedefs in ${linux src}/include/asm/page.h.

#ifdef CONFIG_X86_PAE
extern unsigned long long __supported_pte_mask;
typedef struct { unsigned long pte_low, pte_high; } pte_t;
typedef struct { unsigned long long pmd; } pmd_t;
typedef struct { unsigned long long pgd; } pgd_t;
typedef struct { unsigned long long pgprot; } pgprot_t;
#define pmd_val(x)      ((x).pmd)
#define pte_val(x)      ((x).pte_low | ((unsigned long long)(x).pte_high << 32))
#define __pmd(x) ((pmd_t) { (x) } )
#define HPAGE_SHIFT     21

And for 2 level translation.

typedef struct { unsigned long pte_low; } pte_t;
typedef struct { unsigned long pgd; } pgd_t;
typedef struct { unsigned long pgprot; } pgprot_t;
#define boot_pte_t pte_t /* or would you rather have a typedef */
#define pte_val(x)      ((x).pte_low)
#define HPAGE_SHIFT     22

And when ${linux src}/include/asm/pgtable-3level.h is included,
${linux src}/inlucde/asm/asm-generic/pgtable-nopud.h is also included.
In this header file, pud_t is defined.

typedef struct { pgd_t pgd; } pud_t;

And in pgtable-3level.h, set_pud is defined as following.

#define set_pud(pudptr,pudval) \
                set_64bit((unsigned long long *)(pudptr),pud_val(pudval))

set_64bit is defind in ${linux src}/include/asm-i386/system.h.

static inline void __set_64bit (unsigned long long * ptr,
                unsigned int low, unsigned int high)
        __asm__ __volatile__ (
                "movl (%0), %%eax\n\t"
                "movl 4(%0), %%edx\n\t"
                "lock cmpxchg8b (%0)\n\t"
                "jnz 1b"
                : /* no outputs */
                :       "D"(ptr),
                :       "ax","dx","memory");

cmpxchg8b instruction is described as following:

Compares the 64-bit value in EDX:EAX with the operand (destination operand).
If the values are equal, the 64-bit value in ECX:EBX is stored
in the destination operand.
Otherwise, the value in the destination operand is loaded into EDX:EAX.
The destination operand is an 8-byte memory location.
For the EDX:EAX and ECX:EBX register pairs,
EDX and ECX contain the high-order 32 bits and EAX and EBX
contain the low-order 32 bits of a 64-bit value.

This description is extracted from http://

__set_64bit() uses __asm__ inline assembler.
It stores low 32bit to %ebx ("b"(low)) ,high 32bit to %ecx ("c"(high))
and first argument ptr to %edx ("D"(ptr)).
%0 in code above first argument is unsigned long long *ptr.

It stores the 64bit value pointed by ptr in %edx and %eax, then compares it
with %ecx and %eax.
If the values does not equal, set the new value into the memory pointed by ptr.

And set_pgd is defined in ${linux src}/include/asm-generic/pgtable-nopud.h.

#define set_pgd(pgdptr, pgdval) set_pud((pud_t *)(pgdptr), (pud_t) { pgdval })

In short, it stores the value pgdval into memory pointed by pgdptr.


paging_init() at first calls pagetable_init() that is also defined in
${linux src}/arch/i386/mm/init.c.

At the beginning of this function, all Page Directory Table for kernel
(swapper_pg_dir + 1024 table) is initialized to point to empty_zero_page.

Local variable pgd_base initialized.

        pgd_t *pgd_base = swapper_pg_dir;

swapper_pg_dir looks like

unsigned long long swapper_pg_dir[1024];

and pointer for pgd_t is pointer for "{unsigned long long pgd}",
pgd_base pgd_base points the start address of Page Directory Table.


        for (i = 0; i < PTRS_PER_PGD; i++)
      set_pgd(pgd_base + i, __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));

__pa() does subtract 0xC0000000 to get physical address.

Then it calls physical_kernel_mapping_init(), which is the main routine
for paging initialization.

There are many sub function and macros called in physical_kernel_mapping_init().
One of them is alloc_bootmem_low_pages().
alloc_bootmem_low_pages_node() is wrapper for __alloc_bootmem_node.
This is defined in ${linux src}/include/asm-i386/bootmem.h.

#define alloc_bootmem_low_pages(x) \
        __alloc_bootmem((x), PAGE_SIZE, 0)

And __alloc_bootmem() calls simply _alloc_bootmem_core().
Both function is defined in ${linux src}/mm/bootmem.c.

__alloc_bootmem() is a little long function.
It is defined as following.

__alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
                unsigned long align, unsigned long goal)

But what is to be done with this function is simple.
It seatches bitmap and finds first cleared bit (unused).
If it finds a cleared bit, gets series of bits according to the size
required and sets these bits 1, which means these pages are already used.

When goal is not zero, it tries to get pages above goal.
Then, returns void pointer.

one_md_table_init() is the function to get a page for Page Middle Table.
one_md_table_init() is defined in ${linux src}/arch/i386/mm/init.c.

static pmd_t * __init one_md_table_init(pgd_t *pgd)
        pud_t *pud;
        pmd_t *pmd_table;

#ifdef CONFIG_X86_PAE pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE); set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT)); pud = pud_offset(pgd, 0); if (pmd_table != pmd_offset(pud, 0)) BUG(); #else pud = pud_offset(pgd, 0); pmd_table = pmd_offset(pud, 0); #endif

return pmd_table; }

If CONFIG_X86_PAE is defined, kernel uses 3 level address translation.
If so, get one page for Page Middle Directory by alloc_bootmem_low_pages()
and stores it in pmd_table.

Then, set Page Global Directory with the address and _PAGE_PRESENT.
_PAGE_PRESENT is defined as 0x001. See the directory entry.

pud _offset() is defined in ${linux src}/include/asm-generic/pgtable_nopud.h.

static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
        return (pud_t *)pgd;

This function only returns pgd passed as argument as pointer for pud.
So pud in code is pointer to pgd.

pmd_offset() and pud_page() are defined in ${linux src}/include/asm-i386/pgtable-3level.h.

#define pmd_offset(pud, address) ((pmd_t *) pud_page(*(pud)) + \

#define pud_page(pud) \ ((struct page *) __va(pud_val(pud) & PAGE_MASK))


#define pud_val(x)                              (pgd_val((x).pgd))

in ${linux src}/include/asm-generic/pgtable-nopud.h.

#define pgd_val(x)      ((x).pgd)

in ${linux src}/include/asm/page.h

pud_val(pud) -> pgd_val(pud.pgd) -> pud.pgd.pgd
As shown at the top of this page, pud has member of pgd_t
and pgd_t has member of unsigned long long,
pud.pgd.pgd is the value of pgd.

pud_page() clears attribute bits of directory entry.

pmd_offset(pud, 0) = ((pmd_t *) __va(pud_val(pud) & PAGE_MASK))
= ((pmd_t *) __va(*pgd & PAGE_MASK)

So, pmd_offset(pud, 0) returns the same value of pmd_table.

If CONFIG_X86_PEA is not defined, two level translation is used internally.
In this case, Page Middle Directory is not used and include pgtable-2level.h.
This header file includes asm-generic/pgtable-nopmd.h implicitly.
In this header file, pmd_offset() is defined.

static inline pmd_t * pmd_offset(pud_t * pud, unsigned long address)
        return (pmd_t *)pud;

In either case, it returns pmd_table that is an address of
coresponding Middle Directory Table.

One more ressemble funciton is one_page_table_init().

static pte_t * __init one_page_table_init(pmd_t *pmd) { if (pmd_none(*pmd)) { pte_t *page_table = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE); set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE)); if (page_table != pte_offset_kernel(pmd, 0)) BUG();

return page_table; }

return pte_offset_kernel(pmd, 0); }

This function does the same thing with one_md_table_init() except that
it will get one page for pte not pmd.

While pmd depends of level of address translation,
pte is the same for both level.

set_pmd() for 3 level is defined in ${linux src}/include/asm-i386/pgtable-3level.h

#define set_pmd(pmdptr,pmdval) \
                set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))

And set_pmd() for 2 level is defined in ${linux src}/include/asm-i386/pgtable-2level.h

#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))

The difference is wether 64bit or 32bit.

setup page tables for kernel space

Most inner loop for setting up the page table for kernel is

          for (pte_ofs = 0; pte_ofs < PTRS_PER_PTE && pfn < max_low_pfn; pte++, pfn++, pte_ofs++) {
                           if (is_kernel_text(address))
                                     set_pte(pte, pfn_pte(pfn, PAGE_KERNEL_EXEC));
                                     set_pte(pte, pfn_pte(pfn, PAGE_KERNEL));

PTRS_PER_PTE is 512 for 3 level, 1024 for 2 level translation.
Thses are defiend in pgtable-[2|3]level-defs.h header file.

pfn_pte() is the function to make value of Page Table Entry.
For 3 level paging, defined in ${linux src}/include/asm-i386/pgtable-3level.h.


static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
        pte_t pte;

pte.pte_high = (page_nr >> (32 - PAGE_SHIFT)) | \ (pgprot_val(pgprot) >> 32); pte.pte_high &= (__supported_pte_mask >> 32); pte.pte_low = ((page_nr << PAGE_SHIFT) | pgprot_val(pgprot)) & \ __supported_pte_mask; return pte; }

In linux paging system, page size is 4Kbytes, low 32bit should have
page_nr(page frame number passed as argument) << PAGE_SHIFT (=12).

For 3 level paging, pte has 64bit value and page frame number is 32bit.

 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 64bit
       Page Frame Number             000000xx xxxxxxxx xxxxxxxx xxxxxxxx 32bit
 Shifted Page Frame Number    hhhhhh xxxxxxxx xxxxxxxx xxxx0000 00000000 32bit

This depict shows Page Frame Number and Shifted Page Frame Number.
"x" is a number not 0 and h is a number overflowed when shifted.
In order to get series of "h", pfn is to be shifted by 12 to the left.
Then to be shifted 32 to the right.
Total shift needed to get series of "h" is (32 - PAGE_SHIFT (=12)) to the left.

In the code, this operation will be done.
And pgprot is or-ed for each 32bit high and low bits.
__supoerted_pte_mask is ~_PAGE_NX = ~0. (defind in mm/init.c)
this mask do nothing.

#define set_pmd(pmdptr,pmdval) \
                set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))

So, most inner loop does set pte entry according to Page Frame Number.
Before this loop, one page table for the Page Table Entry is allocated
by one_page_table_init(pmd) and start address is storeed in memory
pointed by pmd.

For the 2 level operation is similer and simple.
The functions that has the same name is defind in pgtable-2level.h.

The second inner loop also does set the Middle Directory Entry.

 pmd = one_md_table_init(pgd);
                if (pfn >= max_low_pfn)
                for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; pmd++, pmd_idx++) {

At first one page for Middle Directory Entry is allocated and its start address
is stored in memory pointed by pgd (Global Directory Entry).
while Page Frame Number(pfn) is in kernel code (pfn < max_low_pfn),
do the most inner loop to setup Page Frame Entry.

Most outer loop is

        for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) {

And pgd_idx is initialized

         pgd_idx = pgd_index(PAGE_OFFSET);

before the loop. pgd_index(PAGE_OFFSET) = 0x300.
This is the index for the start address for kernel space (3Gbyte).

At this point, all of the virtual kernel space (above 0xC0000000)
is mapped correctlly to the physical kernel space (from 0x0 to max_low_pfn).

inserted by FC2 system