배열 항목의 가중 무작위 샘플 * 대체 없음 *

Nov 28 2020

Javascript / ECMAScript 6 특정 솔루션이 필요합니다.

각 개체에 대한 가중치 배열을 사용하여 개체 배열에서 임의의 샘플을 생성하고 싶습니다. 인구 목록에는 구성원 유형이 아닌 실제 인구 구성원이 포함됩니다. 하나의 샘플을 선택한 후에는 다시 선택할 수 없습니다.

제가 작업중인 것과 유사한 문제는 체스 토너먼트의 가능한 결과를 시뮬레이션하는 것입니다. 각 플레이어의 등급은 체중입니다. 플레이어는 토너먼트 당 한 번만 (1 위, 2 위 또는 3 위) 배치 할 수 있습니다.

To pick a likely list of the top 3 winners could look like:

let winners = wsample(chessPlayers,  // population
                      playerRatings, // weights
                      3);            // sample size

The weighted list may, or may not, be integer values. It could be floats like [0.2, 0.1, 0.7, 0.3], or it could be integers like [20, 10, 70, 30]. The weights do not have to add up to a value that represents 100%.

Peter below gave me a good reference on a general algorithm, however It's not specific to JS: https://stackoverflow.com/a/62459274/7915759 it may be a good point of reference.

Solutions to the problem that rely on generating a second population list with each member copied weight number of times may not be a practical solution. Each weight in the weights array could be very high numbers, or they could be fractional; basically, any non-negative value.

Some additional questions:

Is there already an accumulate() function available in JS?
Is there a bisect() type function in JS that does a binary search of sorted lists?
Are there any efficient and low memory footprint JS modules available with statistical functions that include solutions for the above?

답변

1 meriton Nov 29 2020 at 11:44

The following implementation selects k out of n elements, without replacement, with weighted probabilities, in O(n + k log n) by keeping the accumulated weights of the remaining elements in a sum heap:

function sample_without_replacement<T>(population: T[], weights: number[], sampleSize: number) {

    let size = 1;
    while (size < weights.length) {
        size = size << 1;
    }

    // construct a sum heap for the weights
    const root = 1;
    const w = [...new Array(size) as number[], ...weights, 0];
    for (let index = size - 1; index >= 1; index--) {
        const leftChild = index << 1;
        const rightChild = leftChild + 1;
        w[index] = (w[leftChild] || 0) + (w[rightChild] || 0);
    }

    // retrieves an element with weight-index r 
    // from the part of the heap rooted at index
    const retrieve = (r: number, index: number): T => {
        if (index >= size) {
            w[index] = 0;
            return population[index - size];
        } 
        
        const leftChild = index << 1;
        const rightChild = leftChild + 1;

        try {
            if (r <= w[leftChild]) {
                return retrieve(r, leftChild);
            } else {
                return retrieve(r - w[leftChild], rightChild);
            }
        } finally {
            w[index] = w[leftChild] + w[rightChild];
        }
    }

    // and now retrieve sampleSize random elements without replacement
    const result: T[] = [];
    for (let k = 0; k < sampleSize; k++) {
        result.push(retrieve(Math.random() * w[root], root));
    }
    return result;
}

The code is written in TypeScript. You can transpile it to whatever version of EcmaScript you need in the TypeScript playground.

Test code:

const n = 1E7;
const k = n / 2;
const population: number[] = [];
const weight: number[] = [];
for (let i = 0; i < n; i++) {
    population[i] = i;
    weight[i] = i;
}

console.log(`sampling ${k} of ${n} elments without replacement`);
const sample = sample_without_replacement(population, weight, k);
console.log(sample.slice(0, 100)); // logging everything takes forever on some consoles
console.log("Done")

Executed in Chrome, this samples 5 000 000 out of 10 000 000 entries in about 10 seconds.

Todd Nov 28 2020 at 12:35

This is one approach, but not the most efficient.

The highest level function. It iterates k times, calling wchoice() each time. To remove the currently selected member from the population, I just set its weight to 0.

/**
 * Produces a weighted sample from `population` of size `k` without replacement.
 * 
 * @param {Object[]} population The population to select from.
 * @param {number[]} weights    The weighted values of the population.
 * @param {number}   k          The size of the sample to return.
 * @returns {[number[], Object[]]} An array of two arrays. The first holds the
 *                                 indices of the members in the sample, and
 *                                 the second holds the sample members.
 */
function wsample(population, weights, k) {
    let sample  = [];
    let indices = [];
    let index   = 0;
    let choice  = null;
    let acmwts  = accumulate(weights);

    for (let i=0; i < k; i++) {
        [index, choice] = wchoice(population, acmwts, true);
        sample.push(choice);
        indices.push(index);

        // The below updates the accumulated weights as if the member
        // at `index` has a weight of 0, eliminating it from future draws.
        // This portion could be optimized. See note below.
        let ndecr = weights[index];
        for (; index < acmwts.length; index++) {
            acmwts[index] -= ndecr;
        }
    }
    return [indices, sample];
}

The section of code above that updates the accumulated weights array is the point of inefficiency in the algorithm. Worst case it's O(n - ?) to update on every pass. Another solution here follows a similar algorithm to this, but uses a heap to reduce the work needed to maintain the accumulated weights array at O(log n).

wsample() calls wchoice() which selects one member from the weighted list. wchoice() generates an array of cumulative weights, generates a random number from 0 to the total sum of the weights (last item in the cumulative weights list). Then finds its insertion point in the cumulative weights; which is the winner: