Virtual Memeory Allocation

vmalloc() is the similler function with kmalloc().
While kmalloc() will try to get physically continuous memory region as series of pages,
vmalloc() does not mind physical continuity and try to get some regions of memory
and map these region virtually as if these are continuous.

So, in comparison with kmalloc, it takes some more time excessively to
allocate memory region as pages.

Almost of all allocation of memory will be done by calling kmalloc()
except for that there is nessesity for large continuous memory region.


vmalloc() is defined in ${linux src}/mm/vmalloc.c

void *vmalloc(unsigned long size)
       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);

This function is an only wrapper function or __vmalloc().


__vmalloc() is also defined in ${linux src}/mm/vmalloc.c

void *__vmalloc(unsigned long size, unsigned int __nocast gfp_mask, pgprot_t pro
        struct vm_struct *area;

size = PAGE_ALIGN(size); if (!size || (size >> PAGE_SHIFT) > num_physpages) return NULL;

area = get_vm_area(size, VM_ALLOC); if (!area) return NULL;

return __vmalloc_area(area, gfp_mask, prot); }

The first arguemnt is a size that is reqested and second is flag for allocation,
and third is protection mode for the new region.

The second arguement is passed from vmalloc() as GFP_KERNEL | __GFP_HIGHMEM.
__GFP_HIGHMEM makes the system try to make free region in NORMAL erea,
for example DMA transfer, and try to get higher memory regions.

Both are defined in ${linux src}/include/linux/gfp.h

#define __GFP_WAIT      0x10u   /* Can wait and reschedule? */
#define __GFP_HIGH      0x20u   /* Should access emergency pools? */
#define __GFP_IO        0x40u   /* Can start physical IO? */
#define __GFP_FS        0x80u   /* Can call down to low-level FS? */
#define __GFP_COLD      0x100u  /* Cache-cold page required */
#define __GFP_NOWARN    0x200u  /* Suppress page allocation failure warning */
#define __GFP_REPEAT    0x400u  /* Retry the allocation.  Might fail */
#define __GFP_NOFAIL    0x800u  /* Retry for ever.  Cannot fail */
#define __GFP_NORETRY   0x1000u /* Do not retry.  Might fail */

#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)

There are another definition as shown.
And page protection is defind in ${linux src}/include/asm-i386/pgtable.h

#define PAGE_KERNEL             __pgprot(__PAGE_KERNEL)

and __pgprot is defined in ${linux src}/include/asm-i386/page.h

#define __pgprot(x)      ((pgprot_t) { (x) } )

This macro does only type conversion.

At the top of this function, local variable named area is declared as
struct vm_struct.
This structure is defined in ${linux src}/include/linux/vmalloc.h

struct vm_struct {
        void                    *addr;
        unsigned long           size;
        unsigned long           flags;
        struct page             **pages;
        unsigned int            nr_pages;
        unsigned long           phys_addr;
        struct vm_struct        *next;

Next, size of region should be aligned.

        size = PAGE_ALIGN(size);
        if (!size || (size >> PAGE_SHIFT) > num_physpages)
                return NULL;

(PAGE_ALIGN() macro is shown bolow with ALIGN() macro.)

Page size is adjusted to include requested size.
If size is zero or size is greater than "num_physipages",
return NULL pointer.

Local variable "area" is assigned value by calling get_vm_area().
This function is wrapper function and it calls __get_vm_area() as following

        return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END);

Related definitions are defined in ${linux}/include/asm-i386/pgtable.h

#define VMALLOC_OFFSET  (8*1024*1024)
#define VMALLOC_START   (((unsigned long) high_memory + vmalloc_earlyreserve + \
                        2*VMALLOC_OFFSET-1) & ~(VMALLOC_OFFSET-1))

VMALLOC_START and VMALLOC_END are defined for vmalloc().
These is a comment just above the definition in header file.
Here is this

/* Just any arbitrary offset to the start of the vmalloc VM area: the
 * current 8MB value just means that there will be a 8MB "hole" after the
 * physical memory until the kernel virtual memory starts.  That means that
 * any out-of-bounds memory accesses will hopefully be caught.
 * The vmalloc() routines leaves a hole of 4kB between each vmalloced
 * area for the same reason. ;)


__get_vm_area() is in ${linux src}/mm/vmalloc.c

At first,

        if (flags & VM_IOREMAP) {
                int bit = fls(size);

if (bit > IOREMAP_MAX_ORDER) bit = IOREMAP_MAX_ORDER; else if (bit < PAGE_SHIFT) bit = PAGE_SHIFT;

align = 1ul << bit; }

MV_IOREMAP is specified in flags, local variable "align" is set
at least one page and at most IOREMAP_MAX_ORDER.

        addr = ALIGN(start, align);
        size = PAGE_ALIGN(size);

Alignment. These macros are defined in each header file.

PAGE_ALIGN() is defined in ${linux src}/include/asm-i386/page.h

#define PAGE_ALIGN(addr)  (((addr)+PAGE_SIZE-1)&PAGE_MASK)

And more general ALIGN() macro is defined in ${linux src}/include/linux/kernel.h.

#define ALIGN(x,a) (((x)+(a)-1)&~((a)-1))

Next, addr is aligned in address that is times of align.
And size is also aligned to the size that is times of page size.

        area = kmalloc(sizeof(*area), GFP_KERNEL);
        if (unlikely(!area))
                return NULL;

if (unlikely(!size)) { kfree (area); return NULL; }

The code tries to get new memory for struct vm_struct using kmalloc()
and if there is no memory or size is zero, returns NULL pointer.

        size += PAGE_SIZE;

The size that is to be allocated is expanded by one page size.
This is beacuse when the code over the limit of region size,
there should be no system crash with this extra page.
But segmentation fault is invoked.

"for" loop appears.

        for (p = &vmlist; (tmp = *p) != NULL ;p = &tmp->next) {
                if ((unsigned long)tmp->addr < addr) {
                        if((unsigned long)tmp->addr + tmp->size >= addr)
                                addr = ALIGN(tmp->size + 
                                             (unsigned long)tmp->addr, align);
                if ((size + addr) < addr)
                        goto out;
                if (size + addr <= (unsigned long)tmp->addr)
                        goto found;
                addr = ALIGN(tmp->size + (unsigned long)tmp->addr, align);
                if (addr > end - size)
                        goto out;

Starting address is adjusted.
This "for" loop run through struct vm_struct starting from vmlist to
upper most vm_struct sorted by address.

local variable "tmp" points to current vm_struct.

If current starting address is smaller than addr, code moves current pointer
(tmp) to next vm_strcut.

And also
If current starting address is smaller than addr and end of the region
that this vm_struct represents (tmp->size + tmp->addr) is greater than addr,
addr is reset to tmp->size + tmp->addr because addr is in the memory region
that current vm_struct represents.
The code moves to next region, continue.

If region will be wrapped round (size + addr) < addr,
it should not be occured and goto out:.

As upper half of the code does adjust the starting address correctly,
if (addr < tmp->addr), here is the point this region should reside.
goto found:.

But when this conndition is not true, which means the end of reqeusted region
(addr + size) is greater than starging address of next region,
the region requested is in current region.
address is reset to tmp->addr + tmp->size and code checks next vm_struct.

When (addr > end - size), which means (addr + size > end),
this should not be occured and goto out:.

        area->next = *p;
        *p = area;

area->flags = flags; area->addr = (void *)addr; area->size = size; area->pages = NULL; area->nr_pages = 0; area->phys_addr = 0;

When appropriate point is found, area->next is set to point the next upper
vm_struct and current mv_struct (represented by **p) is set to area,
which means area is linked at the appropriate point.

Several member of mv_struct represented by "area" is set.
And this function returns the pointer of vm_struct now correctly adjusted.

vm_get_area() returns pointer to the struct vm_struct.
And again, __vmalloc() is executed.

       if (!area)
                return NULL;

When vm_get_area() fails to allocate memory for struct vm_struct,
__vmalloc() returns NULL pointer as void *.

        return __vmalloc_area(area, gfp_mask, prot);

__vmalloc() leaves core jobs to __vmalloc_area() and returns the returned value
from __vmalloc_area().


__vmalloc_area() is defined in ${linux src}/mm/vmalloc.c.
The first argument is a pointer to struct vm_struct that is allocated by
mv_get_area() and other two arguemnts is the same with those passed to
vmalloc() wrapper function.

        nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
        array_size = (nr_pages * sizeof(struct page *));

area->nr_pages = nr_pages;

First, the number of pages that is to be allocated is calculated.
As extra page is added at the end of memory region to avoid overflow,
one page size is subtracted from area->size and shifted to right by PAGE_SHIFT
to get the number of pages.

And local variable "array_size" should have memory size required for
the array of struct page for pages that is to be allcated.

The number of pages is stored in area->nr_pages.

       if (array_size > PAGE_SIZE)
                pages = __vmalloc(array_size, gfp_mask, PAGE_KERNEL);
                pages = kmalloc(array_size, (gfp_mask & ~__GFP_HIGHMEM));
        area->pages = pages;

As mentioned above, array_size is the required size for the array of
struct page for pages that is to be allcated, when the size is greater than
one page, it should allocate another pages for these structures.
So, the code calls __vmalloc() recursively.

Otherwize, it is enough to get one page for page structures.
In this case, code calls kmalloc() to allocate memory.

And these array of struct page is assigned to area->pages.

        if (!area->pages) {
                return NULL;
        memset(area->pages, 0, array_size);

If there are no pages available, system remove vm_struct from vmlist and frees
struct vm_strcut by kfree(). The function return NULL.

When allocation is done successfully, the array of struct pages are filled
with zero.

       for (i = 0; i < area->nr_pages; i++) {
                area->pages[i] = alloc_page(gfp_mask);
                if (unlikely(!area->pages[i])) {
                        area->nr_pages = i;
                        goto fail;

Next "for" loop runs through the array of struct pages and allocates page
from free list by calling alloc_page().

In this "for" loop, each page is got from free list and assigned to the
entry area->pages[i], which means that physical pages may not be continuous.

If alloc_page returns NULL pointer, this means page allocation should be failed.
In this situation, area->nr_pages is set to i, which is the number of pages
that is allocated so far and goto fail:.

When the code exits "for" loop, required pages is allocated correctlly.

       if (map_vm_area(area, prot, &pages))
                goto fail;
        return area->addr;

These codes for vmalloc() does allocate pages for virtual memory.
But pages that is allocated so far is physical memories.

These physical pages should be mapped to directory entries.
This job is done by calling map_vm_area().


map_vm_area() is also in vmalloc.c

        unsigned long addr = (unsigned long) area->addr;
        unsigned long end = addr + area->size - PAGE_SIZE;

At first, start address of virtual memory region is assigned in local variable
named "addr".
And local variable "end" is set to end address of virtual region.
Extra page should be subtracted to get an actual size of region.

        pgd = pgd_offset_k(addr);

Local variable "pgd" is set to the value from pgd_offset_k() macro.
This macro and related are defined in ${linux src}/include/linux/pgtable.h

#define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
#define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
#define pgd_offset_k(address) pgd_offset(&init_mm, address)

In short, pgd_offset_k() returns corresponding "page directory entry" for
given virtual address.

       do {
                next = pgd_addr_end(addr, end);
                err = vmap_pud_range(pgd, addr, next, prot, pages);
                if (err)
        } while (pgd++, addr = next, addr != end);

Code enters "do {...} while()" loop.
pgd_addr_end() is defined in ${linux src}/include/asm-generic/pgtable.h.

#define pgd_addr_end(addr, end)                                         \
({      unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;  \
        (__boundary - 1 < (end) - 1)? __boundary: (end);                \

This macro is obvious except for -1.
The explanation is added in the same header file.

 * When walking page tables, get the address of the next boundary,
 * or the end address of the range if that comes earlier.  Although no
 * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.

pgd_addr_end() returns next page that begins from addr + PAGE_SIZE.
Then, vmap_pud_range() is called. In this function, vmap_pmd_range() is called.
And in vmap_pmd_range(), vmap_pte_range() is called.

All of these function are in vmalloc.c

static inline int vmap_pud_range(pgd_t *pgd, unsigned long addr,
                        unsigned long end, pgprot_t prot, struct page ***pages)
        pud_t *pud;
        unsigned long next;

pud = pud_alloc(&init_mm, pgd, addr); if (!pud) return -ENOMEM; do { next = pud_addr_end(addr, end); if (vmap_pmd_range(pud, addr, next, prot, pages)) return -ENOMEM; } while (pud++, addr = next, addr != end); return 0; }

vmap_pud_range() and vmap_pmd_range() are ressemble.
So let's at vmap_pud_range().

static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long a
        if (pgd_none(*pgd))
                return __pud_alloc(mm, pgd, address);
        return pud_offset(pgd, address);

pgd_none() macro is defined in ${linux src}/include/asm-generic/pgtable-nopud.h
This header file is included no matter which level address translation should
be selected, 2 or 3.

When 2 level translation, which does not use "page middle directory(pmd)",
is selected, ${linux src}/include/asm-generic/pgtable-nopmd.h include
this header file.

When 3 level translation is selected, ${linux src}/asm-i386/pgtable-3level.h
header file includes this header file.

And in pgtable-nopud.h, pgd_none() macro always retuens 0.
So, this macro is the same with pud_offset().

And pud_offset() is also defined in pgtable-nopud.h.

static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address)
        return (pud_t *)pgd;

It only returns pgd itself.

Before vmap_pud_range() is called, pgd is got by pgd_offset_k() in map_vm_area().
pgd should have a pointer that points appropriate entry in "global page directory"
according to virtual address passed as arguemnt.

This pgd is passed to subroutine from map_vm_area() and in vmap_pud_range(),
pud_alloc() returns the same value except for type, not pgd_t but pud_t.

Code enters "do{..}while()" loop once more to treat "page middle directory (pmd)"
and what should be done is the same with vmap_pud_range().

        do {
                next = pud_addr_end(addr, end);
                if (vmap_pmd_range(pud, addr, next, prot, pages))
                        return -ENOMEM;
        } while (pud++, addr = next, addr != end);

The name of function treat with "pmd" is vmap_pmd_range().
vmap_pmd_range() does the erssemble job with vmap_pud_range().
These fuctions set entries in "pgd" and "pmd" to get to appropriate "pte",
which has a pointer for physical page in physical memory.

vmap_pte_range() is a real setup function that sets up the translation from
virtual address to physical address.

It sets "page table entry" to the pointer for the real memory page for
every pages given as argument.

vmap_pud_range() and vmap_pmd_range() both search appropriate area in
"page table entry" to set up translation table.

Here is a vmap_pte_range().

static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
                        unsigned long end, pgprot_t prot, struct page ***pages)
        pte_t *pte;

pte = pte_alloc_kernel(&init_mm, pmd, addr); if (!pte) return -ENOMEM; do { struct page *page = **pages; WARN_ON(!pte_none(*pte)); if (!page) return -ENOMEM; set_pte_at(&init_mm, addr, pte, mk_pte(page, prot)); (*pages)++; } while (pte++, addr += PAGE_SIZE, addr != end); return 0; }

pte_alloc_kernel() checks pmd is valid page middle directory.
When it is not valid (not present), it allcate required pte entries and
validate pmd and returns poitner for appropriate pte_t entry.

mk_pte() is defined in ${linux src}/include/asm-i386/pgtable.h

#define mk_pte(page, pgprot)    pfn_pte(page_to_pfn(page), (pgprot))

Once it get page frame number from struct page and passes it to pfn_pte()
with pgprot.

pfn_pte() is defined differently according to the level of address translation.
See pfn_pte().

And also set_pte_at() is defined diffrently for both level.
But it set page table entry correctory.

vmap_pud_range() and vmap_pmd_range() have ressemble prototype of fucntion.

static int vmap_p[um]d_range(pmd_t *pmd, unsigned long addr,
                        unsigned long end, pgprot_t prot, struct page ***pages)

The third argument "end" points the next virtual address that specifies the
point the entry of pgd or pmd must be changed properly.

So, when page table entries (pte) are filled correctly to translate virtual
address to physical address, "do {...} while()" loop for pte is over
and pmd entry is progressed.

In the same way, set of page table entries for one pmd entry are filled,
"do {...} while()" loop for pmd is over and pgd entry is progreessed.

All of the pte for allocated pages are set correctly, map_vm_area() flushes
cache for virtual map and returns to caller "__vmalloc_area()".

__vmalloc_area() returns begnning address of coutinuous virtual address region
and __vmalloc() that is the caller of __vmalloc_area() also returns its address.

At last, vmalloc() has allocated maybe non-continuous physical memory regions and
mapped these regions to continuous virtual memory region and set up
the translation table from virtual address to physical address.

vmalloc() is over.

inserted by FC2 system