Multi-Collection Pagination

31 October 2015 ~ blog

A few years ago, I was working on a project where we had collections of data spread across multiple rows of data…​ and then we had to provide a paginated view of that data. This research was the result of those efforts. The discussion here is a bit more rigorous than I usually go into, so if you just want the implementation code jump to the bottom.

Introduction

Consider that you have a data set representing a collection of collections:

[
    [ A0, A1, A2, A3, A4, A5 ],
    [ B0, B1, B2, B3, B4, B5 ],
    [ C0, C1, C2, C3, C4, C5 ]
]

We want to retrieve the data in a paginated fashion where the subset (page) with index P and subset size (page size) S is used to retrieve only the desired elements in the most efficient means possible.

Consider also that the data sets may be very large and that the internal collections may not be directly associated with the enclosing collection (e.g. two different databases).

Also consider that the subsets may cross collection boundaries or contain fewer than the desired number of elements.

Lastly, requests for data subsets will be more likely discrete events – one subset per request, rather than iterating over all results.

For a page size of four (S = 4) you would have the following five pages:

P0 : [ A0, A1, A2, A3 ]
P1 : [ A4, A5, B0, B1 ]
P2 : [ B2, B3, B4, B5 ]
P3 : [ C0, C1, C2, C3 ]
P4 : [ C4, C5 ]

Computations

The overall collection is traversed to determine how many elements are contained within each sub-collection; this may be pre-computed or done at runtime. Three counts are calculated or derived for each sub-collection:

  • Count (CI) - the number of elements in the sub-collection.

  • Count-before (CB) - the total count of all sub-collection elements counted before this collection, but not including this collection.

  • Count-with (CW) - the total count of all sub-collection elements counted before and including this collection.

For our example data set we would have:

[
    { CI:6, CB:0, CW:6 [ A0, A1, A2, A3, A4, A5 ] },
    { CI:6, CB:6, CW:12 [ B0, B1, B2, B3, B4, B5 ] },
    { CI:6, CB:12, CW:18 [ C0, C1, C2, C3, C4, C5 ] }
]

This allows for a simple means of selecting only the sub-collections we are interested in; those containing the desired elements based on the starting and ending indices for the subset (START and END respectively). These indices can easily be calculated as:

START = P * S

END = START + S – 1
Note
The indices referenced here are for the overall collection, not the individual sub-collections.

The desired elements will reside in sub-collections whose inclusive count (CW) is greater than the starting index and whose preceding count (CB) is less than or equal to the ending index, or:

CW > START and CB ≤ END

For the case of selecting the second subset of data (P = 1) with a page size of four (S = 4) we would have:

START = 4

END = 7

This will select the first two or the three sub-collections as "interesting" sub-collections containing at least some of our desired elements, namely:

{ CI:6, CB:0, CW:6 [ A0, A1, A2, A3, A4, A5 ] },
{ CI:6, CB:6, CW:12 [ B0, B1, B2, B3, B4, B5 ] }

What remains is to gather from these sub-collections (call them SC[0], SC[1]) the desired number of elements (S).

To achieve this, a local starting and ending index must be calculated while iterating through the "interesting" sub-collections to gather the elements until either the desired amount is obtained (S) or there are no more elements available.

  1. Calculate the initial local starting index (LOCAL_START) by subtracting the non-inclusive preceding count value of the first selected collection (SC[0]) from the overall starting index.

  2. Iterate the selected collections (in order) until the desired amount has been gathered

This is more clearly represented in pseudo code as:

LOCAL_START = START – SC[0].CB
REMAINING = S

for-each sc in SC while REMAINING > 0

    if( REMAINING < (sc.size() - LOCAL_START) )
        LOCAL_END = LOCAL_START + REMAINING - 1
    else
        LOCAL_END = sc.size()-1

    FOUND = sc.sub( LOCAL_START, LOCAL_END )
    G.addAll( FOUND )
    REMAINING = REMAINING – FOUND.size()
    LOCAL_START = 0

end

Where the gathered collection of elements (G) is your resulting data set containing the elements for the specified data page.

It must be stated that the ordering of the overall collection and the sub-collections must be consistent across multiple data requests for this procedure to work properly.

Implementation

Ok now, enough discussion. Let’s see what this looks like with some real Groovy code. First, we need our collections of collections data to work with:

def data = [
    [ 'A0', 'A1', 'A2', 'A3', 'A4', 'A5' ],
    [ 'B0', 'B1', 'B2', 'B3', 'B4', 'B5' ],
    [ 'C0', 'C1', 'C2', 'C3', 'C4', 'C5' ]
]

Next, we need to implement the algorithm in Groovy:

int page = 1
int pageSize = 4

// pre-computation

int before = 0
def prepared = data.collect {d ->
    def result = [
        countIn: d.size(),
        countBefore: before,
        countWith: before + d.size(),
        values:d
    ]

    before += d.size()

    return result
}

// main computation

def localStart = (page * pageSize ) - prepared[0].countBefore
def remaining = pageSize

def gathered = []

prepared.each { sc->
    if( remaining ){
        def localEnd
        if( remaining < (sc.values.size() - localStart) ){
            localEnd = localStart + remaining - 1
        } else {
            localEnd = sc.values.size() - 1
        }

        def found = sc.values[localStart..localEnd]
        gathered.addAll(found)

        remaining -= found.size()
        localStart = 0
    }
}

println "P$page : $gathered"

which yields

P1 : [A4, A5, B0, B1]

and if you look all the way back up to the beginning of the article, you see that this is the expected data set for page 1 of the example data.

It’s not a scenario I have run into often, but it was a bit of a tricky one to unravel. The pre-computation steps ended up being the key to keeping it simple and stable.


Creative Commons License CoffeaElectronica.com content is copyright © 2016 Christopher J. Stehno and available under a Creative Commons Attribution-ShareAlike 4.0 International License.