Free Pages

In linux, when the system wants to free chunk of pages, it calls free_pages()
function.

Free_pages() is defined in ${linux src}/mm/page_alloc.c

fastcall void free_pages(unsigned long addr, unsigned int order)
{
        if (addr != 0) {
                BUG_ON(!virt_addr_valid((void *)addr));
                __free_pages(virt_to_page((void *)addr), order);
        }
}

Using BUG_ON macro, it checks impossible condition.
virt_addr_valid() and virt_to_page() macro is defined in ${linux src}/include/asm-i386/page.h.

#define virt_addr_valid(kaddr)  pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
#define virt_to_page(kaddr)     pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)

#define pfn_valid(pfn) ((pfn) < max_mapnr) #define pfn_to_page(pfn) (mem_map + (pfn))

The check checks whether page frame number (pfn) exceeds max_mapnr.
And virt_to_page() is replaced with pfn_to_page().
__pa() subtracts 0xC0000000 from virtual address passed as argument,
then shift right by PAGE_SHIFT.

At this point, shifted value of address is page frame number of the page
passed as argument.
This value is passed to pfn_to_page() macro, which will gets the struct page
corresponding the page now we are interested in.

mem_map is a start point of the array of struct page for each pages on the system
and these are ordered by address (or page frame number).
So, mem_map + pfn returns pointer of page frame number of the target page.

Once it gets pointer for struct page, it calls __free_pages() with order.

order is log(or ln) based on 2 of size of pages which is to be freed.
In linux, chunk of pages is treated as 2 powered order and max value of order is 9.
So, page size should be 1, 2, 4, 8, 16, 32, 64, 128, 256, 512.

__free_page() is also defined in ${linux src}/mm/page_alloc.c.

astcall void __free_pages(struct page *page, unsigned int order)
{
        if (!PageReserved(page) && put_page_testzero(page)) {
                if (order == 0)
                        free_hot_page(page);
                else
                        __free_pages_ok(page, order);
        }
}

About page that is reserved, @{linux src}/include/linux/page-flags.h has comment:

 * PG_reserved is set for special pages, which can never be swapped out. Some
 * of them might not even exist (eg empty_bad_page)...

and PageReserved() checks corresponding bit is set or not.
put_page_testzero() macro is ${linux src}/include/linux/mm.h with comment

/*
 * Drop a ref, return true if the logical refcount fell to zero (the page has
 * no users)
 */
#define put_page_testzero(p)                            \
        ({                                              \
                BUG_ON(page_count(p) == 0);             \
                atomic_add_negative(-1, &(p)->_count);  \
        })

When these condtions are cleared, it calls __free_pages_ok(), which is the
main routine in linux memory management when kernel frees pages, except for
that the order is zero (page size is one).

If order is 0, it calls free_hot_page().
free_hot_page() and free_cold_page() is defiend in ${linux src}/mm/page_alloc.c.

void fastcall free_hot_page(struct page *page)
{
        free_hot_cold_page(page, 0);
}

void fastcall free_cold_page(struct page *page) { free_hot_cold_page(page, 1); }

These two functions call free_hot_cold_page().
Free_hot_cold_page() is also page_alloc.c

static void fastcall free_hot_cold_page(struct page *page, int cold)
{
        struct zone *zone = page_zone(page);
        struct per_cpu_pages *pcp;
        unsigned long flags;

arch_free_page(page, 0);

kernel_map_pages(page, 1, 0); inc_page_state(pgfree); if (PageAnon(page)) page->mapping = NULL; free_pages_check(__FUNCTION__, page); pcp = &zone_pcp(zone, get_cpu())->pcp[cold]; local_irq_save(flags); list_add(&page->lru, &pcp->list); pcp->count++; if (pcp->count >= pcp->high) pcp->count -= free_pages_bulk(zone, pcp->batch, &pcp->list, 0); local_irq_restore(flags); put_cpu(); }

In i386, arch_free_page() and kernel_map_pages() do nothing.
PageAnon() is defined in ${linux src}/include/linux/mm.h.
It checks PAGE_MAPPING_ANON. there are comment about this value in the mm.h header file.
When this bit is set, page->mapping is set to NULL.

pcp stands for "per cpu page" and struct per_cpu_page is defined in ${linux src}/include/linux/mmzone.h.

struct per_cpu_pages {
        int count;              /* number of pages in the list */
        int low;                /* low watermark, refill needed */
        int high;               /* high watermark, emptying needed */
        int batch;              /* chunk size for buddy add/remove */
        struct list_head list;  /* the list of pages */
};

Each cpu on linux, processor has 2 array of pcp. hot pcp and cold pcp.
hot pcp is related with L2 cache in processor and possibility that this page
is required again is high, &page->lru is linked to the list of pcp[hot].
lru stands for "least resent use".

pcp->high has the water mark which must not exceeds its own pcp pages
and when it exceeds, using free_pages_bulk() ,which is mentioned later,
it will really free the pages into the free_list of zone the page belongs to.

__free_pages_ok()

__free_pages_ok() is defind in ${linux src}/mm/pages_alloc.c.

void __free_pages_ok(struct page *page, unsigned int order)
{
        LIST_HEAD(list);
        int i;

arch_free_page(page, order);

mod_page_state(pgfree, 1 << order);

#ifndef CONFIG_MMU if (order > 0) for (i = 1 ; i < (1 << order) ; ++i) __put_page(page + i); #endif

for (i = 0 ; i < (1 << order) ; ++i) free_pages_check(__FUNCTION__, page + i); list_add(&page->lru, &list); kernel_map_pages(page, 1<<order, 0); free_pages_bulk(page_zone(page), 1, &list, order); }

It calls first arch_free_page() but this is defined as do {} while(0),
which is do nothing.

mod_page_state() is defined in ${linux src}/include/linux/page-flags.h.

#define mod_page_state(member, delta)   \
        __mod_page_state(offsetof(struct page_state, member), (delta))

struct page_state is deined in the same header file for
each processor on this system.
offsetof returns offset from the beginning of strcut page_state to member.

And __mod_page_state() is defined in ${linux}/mm/page_alloc.c.

void __mod_page_state(unsigned long offset, unsigned long delta)
{
        unsigned long flags;
        void* ptr;

local_irq_save(flags); ptr = &__get_cpu_var(page_states); *(unsigned long*)(ptr + offset) += delta; local_irq_restore(flags); }

This function adds the size of the page (1 << order, passed as delta)
to the pgfree member of page_states of this cpu.

Next "for" loop

        for (i = 0 ; i < (1 << order) ; ++i)
                free_pages_check(__FUNCTION__, page + i);

means that the each pages from page to page + size (1 << order)
should be checked by free_pages_check().

Next,

        list_add(&page->lru, &list);

"lru" stands "least resently used".
This member of struct page is tweaked in __free_pages_bulk(),
this is added into link list (link initialized at the beginning of this function.)

Instead calling __free_pages_bulk(), it calls free_pages_bulk().
both are defined also in page_alloc.c

Before look at free_pages_bulk(), visit page_zone() because it will be called
as artument for free_pages_bulk()

        free_pages_bulk(page_zone(page), 1, &list, order);

page_zone()


page_zone() and related definition is defined in ${linux src}/include/linux/mmzone.h
and ${linux src}/include/linux/mm.h.

in mmzone.h

#define ZONE_DMA                0
#define ZONE_NORMAL             1
#define ZONE_HIGHMEM            2

#define MAX_NR_ZONES 3 /* Sync this with ZONES_SHIFT */ #define ZONES_SHIFT 2 /* ceil(log2(MAX_NR_ZONES)) */

and in mm.h

#define ZONES_WIDTH             ZONES_SHIFT
#define ZONES_PGSHIFT           (ZONES_PGOFF * (ZONES_WIDTH != 0))
#define ZONETABLE_PGSHIFT       ZONES_PGSHIFT
#define ZONETABLE_MASK             ((1UL << ZONETABLE_SHIFT) - 1)

static inline struct zone *page_zone(struct page *page) { return zone_table[(page->flags >> ZONETABLE_PGSHIFT) & ZONETABLE_MASK]; }

There are three type of zones, DMA, NORMAL, HIGHMEM.
bit 2-3 of page->flag specifies zone the page belongs to.
page->flags >> ZONETABLE_PGSHIFT & ZONETABLE_MASK = page->flags >> 2 & 0x3
returns type of memory zone the page is belongs.

zone_table[] is an array of pointer for struct zone. for three of them.
So, page_zone returns pointer for zone to which the page passed as
argument belongs.

Now fre_pages_bulk() is called.

        free_pages_bulk(page_zone(page), 1, &list, order);

This function does several initialization and calls __free_pages_bulk().

free_pages_bulk()

static int
free_pages_bulk(struct zone *zone, int count,
                struct list_head *list, unsigned int order)
{
        unsigned long flags;
        struct page *page = NULL;
        int ret = 0;

spin_lock_irqsave(&zone->lock, flags); zone->all_unreclaimable = 0; zone->pages_scanned = 0; while (!list_empty(list) && count--) { page = list_entry(list->prev, struct page, lru); /* have to delete it as __free_pages_bulk list manipulates */ list_del(&page->lru); __free_pages_bulk(page, zone, order); ret++; } spin_unlock_irqrestore(&zone->lock, flags); return ret; }

This code is surrounded by spin_lock_irqsave() and spin_unlock_irqrestore().
list_entry macro finds a pointer for struct page from the member named lru.
This page is set by free_pages_ok() by list_add().

        list_add(&page->lru, &list);

This link list has pointer for member lru of struct page.
In short, it returns struct page itself.
The list has one element and count is 1.
So, page has the pointer of the target page.

When page is freed, page->lru is linked to free_area[].free_list of
the zone this page belongs.
list_del remove page->lru from the list now linked.

And __free_pages_bulk() is called.

__free_pages_bulk()

__free_pages_bulk() is main function when page chunk is freed.
It checks wether pages is to be merged up to doubled size chunk of pages.
If so, merges these pages and remove merged chunks from lower level,
then new merged chunk (size is doubled) is linked in free_list of upper level.

Look at "memory management" to see how to the system does adjust free memories.

__free_pages_bulk() is defined in ${linux src}/mm/page_alloc.c.

At first, page_to_pfn() is called.

        page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

page_to_pfn() is defined in ${linux src}/include/linux/mmzone.h.

#define page_to_pfn(page)                                               \
({                                                                      \
        page - __section_mem_map_addr(__nr_to_section(                  \
                page_to_section(page)));                                \
})

page_to_section() and related definition is defind in ${linux src}/include/linux/mm.h.

#ifdef CONFIG_SPARSEMEM
#define SECTIONS_WIDTH          SECTIONS_SHIFT
#else
#define SECTIONS_WIDTH          0
#endif

#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0)) #define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)

static inline unsigned long page_to_section(struct page *page) { return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK; }

SECTIONS_SHIFT is not defined when CONFIG_SPARSEMEM is not defined.
Then SECTIONS_WIDTH = 0. So, SECTIONS_PGSHIFT = 0.

And SECTIONS_MASK equals (1 << 0) -1 = 0.
So, when CONFIG_SPARSEMEM is not defined,
page_to_secition returns 0 no matter what status page->flag is.

__nr_to_section() is defined in ${linux src}/include/linux/mmzone.h.

static inline struct mem_section *__nr_to_section(unsigned long nr)
{
        return &mem_section[nr];
}

mem_section structure is also defined in ${linux src}/include/linux/mmzone.h.

struct mem_section {
        unsigned long section_mem_map;
};

__nr_to_seciton() returns pointer for mem_section[0].
As mem_section is array of pointer to page structures for every page,
page_to_pfn() litelally returns Page Frame Number of the page passed as argument.

After adding size of pages freed into zone member free_pages,
"while" loop is executed.

        while (order < MAX_ORDER-1) {

MAX_ORDER equals 9.
In linux, memory is dealt with chunk of pages.
The size of pages is 1, 2, 4, 8, 16, 32, 64, 128, 256, 512.
Log of these sizes based on 2 are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.

This "while" loop does visit sizes greater then the order, which is the
size now the system will free.

At first, it gets page frame number that specifies new chunk of pages
when this chunk of pages are merged with another half of chunk of pages,
which makes double sized chunk of pages. (level is order + 1).

To do this, it calls __find_combined_index().

__find_combined_index() is defined in page_alloc.c

static inline unsigned long
__find_combined_index(unsigned long page_idx, unsigned int order)
{
        return (page_idx & ~(1 << order));
}

page_idx is a page frame number given by page_to_pfn().
(1 << order) equals 2 powered order.

This operation gives the index for combined pages.
These pages to be merged consist of the pair of pages which size is decided by 1 << order,
and new combind pages are considered as member of chunk of pages for (1 << order + 1) size.

To see how the system calcurate an index, look at Calculate the index for chunk of pages.

Next, it calls __page_find_buddy().

static inline struct page *
__page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order)
{
        unsigned long buddy_idx = page_idx ^ (1 << order);

return page + (buddy_idx - page_idx); }

This function returns pointer to struct page that is the beginning of
the other part of chunk of pages that is consist of the pair of chunk
for 1 << order size.

Also to see how the system calculates an index, look at Calculate the index for chunk of pages.

And next code is important.

                if (!page_is_buddy(buddy, order))
                        break;          /* Move the buddy up one level. */

If page_is_buddy() returns 0, the system cannot merge this chunk of pages
and cannot move up one level.

But if it returns 1, this means this chunk of pages is mergable.
In this case, code does not break the loop and continues to execute the loop.

These codes are

                list_del(&buddy->lru);
                area = zone->free_area + order;
                area->nr_free--;
                rmv_page_order(buddy);
                page = page + (combined_idx - page_idx);
                page_idx = combined_idx;
                order++;

buddy->lru is linked free_area[1 << order].free_list of its zone.
As these codes are executed when pages are mergable,
this chunk of pages is removed from free_list of size (1 << order).
list_del() does this job.

The local variable area is a pointer to the struct free_area.
This variable is set to free_area for the size (1 << order) and
number of free chunk of pages is decremented (nr_free--).

rmv_page_order() is defined in ${linux src}/mm/page_alloc.

static inline void rmv_page_order(struct page *page)
{
        __ClearPagePrivate(page);
        page->private = 0;
}

private bit of page->flags is cleared and page->private is set to 0,
which means this chunk of pages belongs to any level.

Then a pointer for struct page named page is set to new page frame number
that is stored in combined_idx.

Page frame number now treated is changed to combined_idx and order++(up one level).
"while" loop should be done again.

This loop is executed until no more merge is possible.

Look at __page_is_buddy().
__page_is_buddy() is defined in ${linux src}/mm/page_alloc.c.

static inline int page_is_buddy(struct page *page, int order)
{
       if (PagePrivate(page)           &&
           (page_order(page) == order) &&
           !PageReserved(page)         &&
            page_count(page) == 0)
               return 1;
       return 0;
}

And page_order() is also defined in the same file.

static inline unsigned long page_order(struct page *page) {
        return page->private;
}

Private member has the order in which tha page belongs now.
Two checking macro are defined in ${linux src}/include/linux/page-flags.h.

#define PagePrivate(page)       test_bit(PG_private, &(page)->flags)
#define PageReserved(page) test_bit(PG_reserved, &(page)->flags)

page_count() is defined in ${linux src}/include/linux/mm.h.

#define page_count(p)              (atomic_read(&(p)->_count) + 1)

page_is_buddy() returns 1 if PG_private bit is set in page->flags and belongs to
this order and not reserved and page count equeals to 0.
If page is free, page count has -1 and page_count() returns 0.

Otherwise returns 0.

In short, if this page is free and mergable and not reserved and not referenced,
returns 1.

When this function returns 0, this chunk of pages is not mergable.
If so, break the "while" loop and

After "while" loop exits, which means pages are marged as large as possble,
codes shown below are executed.

        set_page_order(page, order);
        list_add(&page->lru, &zone->free_area[order].free_list);
        zone->free_area[order].nr_free++;

it sets page order with

static inline void set_page_order(struct page *page, int order) {
        page->private = order;
        __SetPagePrivate(page);
}

and links this page's lru with free_list of free_area defined by order.
The zone is the zone to which this page belongs.

nr_free member of free_area structure is number of free chunk of pages of
this order.
So nr_free is incremented.

At this point, functions returns back to the caller up to __free_pages_ok()
and __free_pages_ok() itself returns to the caller.
inserted by FC2 system