__get_free_pages() is a interface to the kernel for getting available pages.
This function is defined in ${linux src}/mm/page_alloc.c

fastcall unsigned long __get_free_pages(unsigned int __nocast gfp_mask, unsigned
 int order)
        struct page * page;
        page = alloc_pages(gfp_mask, order);
        if (!page)
                return 0;
        return (unsigned long) page_address(page);

This function will try to get pages by calling alloc_pages() and if there is
no page available, return 0.

When it gets pages required, returns unsigned long (address pointer) through
page_address() macro.

page_address() macro is defined in ${linux src}/include/linux/mm.h

#define page_address(page) lowmem_page_address(page)

static inline void *lowmem_page_address(struct page *page) { return __va(page_to_pfn(page) << PAGE_SHIFT); }

It returns a virtual memory address.


Before looking at alloc_pages(), some kernel configuration should be undefined
for simplicity.

There is a concept called NUMA.
This concept intends th put togather memory regions to which the processor
takes the same time to access.

These set of regions are called "node" in linux.
This concept is intoroduced into linux to make kernel more efficient.

But implementation of this consept is complex and when there are two or more
processors on a machine, the situation is more complex, which needs more time
to explain the implementation.

As we are interested in memory allocation, we assume that there is only one
proccessor and NUMA is not implemented by not defineing CONFIG_NUMA.

The deftail of NUMA is documented in ${linux src}/Documantation/vm/numa file.

alloc_page() is defiend in ${linux src}/include/linux/gfp.h

static inline struct page *alloc_pages_node(int nid, unsigned int __nocast gfp_m
                                                unsigned int order)
        if (unlikely(order >= MAX_ORDER))
                return NULL;

return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK)); }

In linux, memory management should be done with chunk of pages.
The size of pages is represented by 2 powered order.

order is in between 0 and 9.
So sizes of chunk of pages are 1, 2, 4, 8, 16, 32, 64, 128, 256, 512.

In result, MAX_ORDER is set to 9.
There must not be order that is greater than MAX_ORDER.
If so, it returns NULL pointer of struct page.

alloc_page() calls __alloc_pages() and returns the returned value.
It passes complex arguments.
NODE_DATA() macro is defined differently in mmzone.h header file according to
kernel configuration.

is also not defined.
In this case, NODE_DATA() is as following

#define NODE_DATA(nid)          (&contig_page_data)

And contig_page_data is in ${linux src}/mm/page_alloc.c

static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };

struct pglist_data is defined in ${linux src}/include/linux/mmzone.h
This structure has node_zonelists[] array.

GFP_ZONEMASK is also defined in mmzone.h and its value is 0x3.

gfp_mask & GFP_ZONEMASK gives offset from node_zonelist and selects appropriate
zone for allocating.

Let's look at __alloc_pages().


At first, might_sleep_if() is called.
might_sleep_if() macro is defined in ${linux src}/include/linux/kernel.h.

#define might_sleep_if(cond) do { if (unlikely(cond)) might_sleep(); } while (0)

unlikely(x) macro returns !!x. So, if cond is true, might_sleep() is invoked.
might_sleep() macros and related are defined in kernel.h.

extern int cond_resched(void);
# define might_resched() cond_resched()
# define might_resched() do { } while (0)

#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP void __might_sleep(char *file, int line); # define might_sleep() \ do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0) #else # define might_sleep() do { might_resched(); } while (0)

this macro does nothing at all.

Next, local variable can_try_harder is set.
This variable will be used to decide how harder the system should try to reclaim pages to free page list.

        can_try_harder = (unlikely(rt_task(p)) && !in_interrupt()) || !wait;

When this process is not real time task and not interrupted, or __GFP_WAIT flag
is not set, local variable can_try_harder is set.

The code enters "for" loop labeled restart:.
This loop run through every zone, DMA, NORMAL, HIGH.

At the top of the inside of "for" block, there is a function named cpuset_zone_allowed().
This function is defined in ${linux src}/include/linux/cpuset.h.
But it does return 1 only, do nothing.

At first, local variable "do_reclaim" is set to return value from should_reclaim_zone().
should_reclaim_zone() is also defined in page_alloc.c.

static inline int
should_reclaim_zone(struct zone *z, unsigned int gfp_mask)
        if (!z->reclaim_pages)
                return 0;
        if (gfp_mask & __GFP_NORECLAIM)
                return 0;
        return 1;

When reclain_pages member of struct zone is not set or
__GFP_NORECLAIM is not set in gfp_mask also passed as arguemnt,
this function returns 0. otherwise return 1.


If "do_reclaim" is 0, the kernel does not reclaim pages in this zone.
If 1, when zone_watermark_ok() (mentioned soon) returns 0, kernel tries to
reclaim pages to free list of this zone.

zone_watermark_ok() is in alloc_pages.c
struct zone has a member named pages_low, which is the watermark under which
free pages of the zone must not be below.

zone_watermark_ok() checks wether page allocation makes total number of free
pages become below under this watermark or not.

int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
                      int classzone_idx, int can_try_harder, int gfp_high)
        long min = mark, free_pages = z->free_pages - (1 << order) + 1;
        int o;

if (gfp_high) min -= min / 2; if (can_try_harder) min -= min / 4;

if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return 0; for (o = 0; o < order; o++) { free_pages -= z->free_area[o].nr_free << o; min >>= 1;

if (free_pages <= min) return 0; } return 1; }

At first, local variable min is set to mark (z->pages_low) and free_pages is set to
number of free pages in this zone minus size of this order (1 << order) - 1.

If gfp_high is set, try hard, min >>= 1, min = min / 2.
And if there is a decision to try more harder, which is specified with can_try_harder,
min should be devided by 4, min >>= 2, min = min / 4.
At most, min might be 1/8.

At this time, free_pages is smaller than min + z->lowmem_reserve[classzone_idx],
returns 0 immidiately, which means that only by this page allocation,
free pages in this zone will become below under watermark.

Otherwise the code enters "for" loop that runs through 0 to the order - 1.
This is because although this page allocation does not make free pages become
below under the water mark, one chunk of pages under this order
may make it become below under the watermark.

So, this function checks whether the number of free pages of this zone should
become under water mark when each size (under 1 << order - 1) is removeed from
free list.

If free pages has become under water mark in "for" loop, return 0 immidiately
because the system should reclaim the pages into the free list to make free pages
enough to allocate requested pages.

If and only if free pages are not under water mark, which means threre are so
many free pages to allocate when requested, return 1.

In the loop.
Each struct zone has free_area[] according to the order and member nr_free has
the number of free chunk of page.
So, size of pages of this order should be given by free_area[].nr_free << order.
And this size is subtracted from free_pages.

If order is up one level, the size of chunk of pages will be doubled,
As the size subtracted from free_pages is doubled,
min should be halved, min >>= 1.

When free_pages is smaller than min, return 0.
Otherwise, next loop is executed.
If subtraction of all of orders under order passed as arguement is not enough
to make free_pages smaller thatn min, this funtion returns 1.

If zone_watermark_ok() return 0, zone_reclaim() will be invoked once.

reclaimination of pages are so complex procedure and we are now interested in
page allocation, we ignore the detail of zone_reclaim().
It tries to free pages as many as possible in order to page allocation is to
be done smoothly.

Core functions for memory allocation

Then, kernel will try to get reqested pages by calling buffered_rmqueue().
But buffered_rmqueue() does some challenge to get one page and if it is
requested more pages, it calls __rmqueue() frunction that gets pages from
free list of free_area of this zone.

And __rmqueue() calls expand() function to adjust memory management system
to work correctly.

As these two function is the core functions for memory allocation,
Let's look at these two first.


__remqueue() is a core funciton for memory allocation.
This function will be called from several path to allocate memory (or pages).

__rmqueu() is defined in ${linux src}/mm/page_alloc.c

static struct page *__rmqueue(struct zone *zone, unsigned int order)
        struct free_area * area;
        unsigned int current_order;
        struct page *page;

for (current_order = order; current_order < MAX_ORDER; ++current_order) { area = zone->free_area + current_order; if (list_empty(&area->free_list)) continue;

page = list_entry(area->free_list.next, struct page, lru); list_del(&page->lru); rmv_page_order(page); area->nr_free--; zone->free_pages -= 1UL << order; return expand(zone, page, order, current_order, area); }

return NULL; }

Arguments passed to this function is strcut zone and int order.
This function should try to get chunk of pages, which size is 1 << order from
free_area[] of zone given as arguemnt.

It will run through "for" loop from order to MAX_ORDER (9 in linux, 512 pages).
In the loop, at first pointer for struct free_area is set to element of free_area
specified by local variable current_order.

And it checks with list_empty() macro if the free_list is empty or not.
If list is empty, it cannot get pages from this free_list and go up one level
(__current_order++ in the "for" statement) and continues loop.

free_list member of free_area links free pages with lru member.
So, it uses list_entry() macro.
In this case this macro extracts pointer of struct page
with lru (least resent use) member of struct page.
The target "lru" member is area->free_list.next, which is the member of
page struct this macro will return.

                page = list_entry(area->free_list.next, struct page, lru);

Then, it remove these pages from free_list of this free_area by using
del_list() macro.

rmv_page_order() set these pages does not belong to any free list.

As required pages (size is 1 << current_order) is removed, the number of the
chunk of free pages of this free_area (size is 1 << current_order) should be
decremented (area.nr_free--).


And zone that includes these free_area is stolen chunk of pages,
its member free_pages is decremented by the size of pages removed (1 << order).

                zone->free_pages -= 1UL << order;

When required size (1 << order) and removed pages size (1 << current_order)
is not the same, following expand() adjusts this difference.


expand() is also defined in page_alloc.c

static inline struct page *
expand(struct zone *zone, struct page *page,
        int low, int high, struct free_area *area)
        unsigned long size = 1 << high;

while (high > low) { area--; high--; size >>= 1; BUG_ON(bad_range(zone, &page[size])); list_add(&page[size].lru, &area->free_list); area->nr_free++; set_page_order(&page[size], high); } return page; }

Two arguement, low and high, are required order and acquired order.
If __rmqueue() could get the same size of pages from free list,
these two is the same and this function does nothing.

But if not, higher level was removed from free list and remained pages
should relinkd to lower level free list.

It downs the order level one by one in "while" loop.
area--; high--; size >>= 1; do this job.

It checks an impossible condition with BUG_ON macro.

In this function, page is the beginning of the array of struct page
that is removed from free_list of order high.

and page is also the beginning of the the pages that is to be returned.

On the other side, page[size] is an another half of removed pages.
These pages should be linked to one low level free_list link with
lru (least resent use) member. list_add() macro do this job.

This free area is added new chunk of pages, nr_free must be incremented.

And this upper half of pages is marked as belonging to free_area of this order
specified with high.

While (high > low) returns true, it go down one level of order and upper half
of the pages should be added into the one level down free list of free_area.

This operation will keeps page allocaton and freeing pages system.


Now, look at buffered_rmqueue().

buffered_rmqueue() is also defined in page_alloc.c

When requested page order is 0, which means the size of page is one,
this function will try to get page from per_cpu_pages struct for each processor.

There are two kind of per_cpu_pages (pcp), hot and cold.
hot is related with L2 cache in processor.

There is a line to determain which pcp should be used.

        int cold = !!(gfp_flags & __GFP_COLD);

Doubled inverse instruction converts the type of return value to int.
If __GFP_COLD is set in gfp_flags (passed as arguemnt), cold equals one,
which means this function uses cold pcp (pcp[1]).
Otherwise it uses hot pcp (pcp[0]).

        if (pcp->count <= pcp->low)
                 pcp->count += rmqueue_bulk(zone, 0,
                                         pcp->batch, &pcp->list);

pcp (per_cpu_pages) has several water marks.
If "if" condition returns true, there are not enough pages for hot or cold
page allocation, then get some amount of pages specified by pcp->batch
and link these pages into the pcp->list link list.


rmqueue_bulk() is defined in page_alloc.c

static int rmqueue_bulk(struct zone *zone, unsigned int order, 
                        unsigned long count, struct list_head *list)
        unsigned long flags;
        int i;
        int allocated = 0;
        struct page *page;

spin_lock_irqsave(&zone->lock, flags); for (i = 0; i < count; ++i) { page = __rmqueue(zone, order); if (page == NULL) break; allocated++; list_add_tail(&page->lru, list); } spin_unlock_irqrestore(&zone->lock, flags); return allocated; }

This function uses spin_lock_irqsave() and spin_unlock_ieqrestore() for
zone->lock for secure.

It will try to get the number of chunk of pages which size is 1 << order.
Inside "for" loop, it will get pages using __rmqueue() and allocated count
is incrementted (allocated++).

Then, the page now allocated is linked to list passed as argument using
list_add_tail() macro. struct page is linked with lru member.

After "for" loop, it returns the number of allocated chunk of pages, which
size is 1 << order.

As order that is passed to rmqueue_bulk() is 0, size of these pages is one page(4kbytes).

The more frequently one page allcation is occured, the more the cache hit increases.
Again, buffered_rmqueue().

         if (pcp->count) {
                page = list_entry(pcp->list.next, struct page, lru);

Then, if there are pages in pcp list, get one page using list_entry() macro
and delete it from pcp list.

        if (page == NULL) {
                spin_lock_irqsave(&zone->lock, flags);
                page = __rmqueue(zone, order);
                spin_unlock_irqrestore(&zone->lock, flags);

When there is no page available so far because order is not zero or other reson,
it will try to get required pages using __rmqueue().

        if (page != NULL) {
                BUG_ON(bad_range(zone, page));
                mod_page_state_zone(zone, pgalloc, 1 << order);
                prep_new_page(page, order);

if (gfp_flags & __GFP_ZERO) prep_zero_page(page, order, gfp_flags);

if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); }

When it can allocate new pages, this block is executed.

mod_page_state_zone() macro is defined in ${linux src}/include/linux/page-flags.h header file.
It adds pgalloc member of struct page_state by 1 << order (size of pages).
This addition is done by __mod_page_state() (defined in page_alloc.c).
The difference between them are that mod_pageA_state_zone() chnages member
from pgalloc_high, pgalloc_normal to pgalloc_dma member according to zone.

If __GFP_ZERO is specified, prep_zero_page() initailize pages to series of zeros.

Unless CONFIG_HUGETBL_PAGE is not defined, prep_compound_page() does nothing.

This function returns pointer for page newly allocated.
buffered_ruqueue() is over.

Continue __alloc_pages()

After waking up kswapd process to keep free pages available enough,
it once more tries to get free pages requested more hardly.

       for (i = 0; (z = zones[i]) != NULL; i++) {
                if (!zone_watermark_ok(z, order, z->pages_min,
                                       classzone_idx, can_try_harder,
                                       gfp_mask & __GFP_HIGH))

if (wait && !cpuset_zone_allowed(z)) continue;

page = buffered_rmqueue(z, order, gfp_mask); if (page) goto got_pg; }

It again calls zon_watermark_ok() for every zone.
But this time, it specifis "can_try_harder and high (gfp_mask & __GFP_HIGH)
argument to reclaim pages into free list.
And it changes watermark from z->pages_low to z->pages_min.

When zone_watermark_ok() return 0, which means there are not enough pages,
it tries next zone and does not tries to reclaim pages this time.

Next "if" statement has cpuset_zone_allowed() and this function returns 1 always, so this condition returns false.

And again, it tries to get pages requested by calling buffered_rmqueue().
When it get pages, goto label got_pg:.

        if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
                        && !in_interrupt()) {
                if (!(gfp_mask & __GFP_NOMEMALLOC)) {
                        for (i = 0; (z = zones[i]) != NULL; i++) {
                                if (!cpuset_zone_allowed(z))
                                page = buffered_rmqueue(z, order, gfp_mask);
                                if (page)
                                        goto got_pg;
                goto nopage;

If there is no page available so far and the condition of "if" statement returns
true, this time it ignores the limitation of watermark and tries to get pages
requested by calling buffered_rmqueue().

If it can get pages requested, goto got_pg: label and do nothing about
reclaiming pages to free list.

Once the code enter into this block and cannot allcate requested pages,
jump to nopage: label and it's over.

        if (!wait)
                goto nopage;

When "if" statement shown above does not return true and local variable "wait"
is zero (gfp_mask & __GFP_WAIT), goto nopage: label and return NULL pointer.


/* We now go into synchronous reclaim */ p->flags |= PF_MEMALLOC; reclaim_state.reclaimed_slab = 0; p->reclaim_state = &reclaim_state;

did_some_progress = try_to_free_pages(zones, gfp_mask);

p->reclaim_state = NULL; p->flags &= ~PF_MEMALLOC;


At this point, there are no pages available.
The code block shown above should try to free pages.
But as we are interested in allcating pages, ignore detail of reclaiming of pages.

       if (likely(did_some_progress)) {
                for (i = 0; (z = zones[i]) != NULL; i++) {
                        if (!zone_watermark_ok(z, order, z->pages_min,
                                               classzone_idx, can_try_harder,
                                               gfp_mask & __GFP_HIGH))

if (!cpuset_zone_allowed(z)) continue;

page = buffered_rmqueue(z, order, gfp_mask); if (page) goto got_pg; }

When there is a successful result (stored in local variable "did_some_progress"),
it does the same effort with the second "for" loop to try to get pages reqested.

       } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
                          for (i = 0; (z = zones[i]) != NULL; i++) {
                        if (!zone_watermark_ok(z, order, z->pages_high,
                                               classzone_idx, 0, 0))

if (!cpuset_zone_allowed(z)) continue;

page = buffered_rmqueue(z, order, gfp_mask); if (page) goto got_pg; }

Otherwize, which means there is no progress on free pages and
gfp_mask is not set at __GFP_FS and _GFP_NORETRY(allow to retry to get pages),
it calls zone_watermark_ok() with z->pages_high as mark and without
can_try_harder and gfp_high (last to argument of this function).

And there is a probability to get pages requested, it should try again
by calling buffered_rmqueue().

In this time, when it can get pages, goto got_pg: lebel.

                out_of_memory(gfp_mask, order);
                goto restart;

At last, there is no memory available.
It will try to get pages again.
goto restart: label, the first "for" loop to get pages requested.

       do_retry = 0;
        if (!(gfp_mask & __GFP_NORETRY)) {
                if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
                        do_retry = 1;
                if (gfp_mask & __GFP_NOFAIL)
                        do_retry = 1;
        if (do_retry) {
                blk_congestion_wait(WRITE, HZ/50);
                goto rebalance;

And the last chance to get free pages.
If gfp_mask argument allow to retry and reqested pages is not too big
(order <= 3) or __GFP_REPEAT is set (which means this allocation should not
be failed), and __GFP_NOFAIL (litelally) is set, "do_retry" is set to 1.

And go to rebalance: label to retry to get free pages requested.

        if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
                printk(KERN_WARNING "%s: page allocation failure."
                        " order:%d, mode:0x%x\n",
                        p->comm, order, gfp_mask);
        return NULL;
        zone_statistics(zonelist, z);
        return page;

The end of __alloc_pages() is there.
When there is no page available, return NULL pointer.
Wnen it can get pages requested, calls zone_statistics() and return
the beginning of pages as pointer of struct page.

The great effort of page allocation is now over.

inserted by FC2 system