Julia nie skaluje wielu wątków dla żenująco równoległych zadań

Nov 20 2020

Poniższy kod oblicza średnią liczbę losowań, aby uzyskać 50 unikalnych kart z kilku zestawów. Ważne jest tylko to, że problem ten nie wymaga dużej ilości pamięci RAM i nie udostępnia żadnej zmiennej po uruchomieniu w trybie wielowątkowym. Po uruchomieniu z czterema więcej niż jednym wątkiem, aby wykonać 400 000 symulacji, konsekwentnie potrzeba około dodatkowej sekundy niż dwa procesy uruchomione razem i wykonanie 200 000 symulacji. Martwiło mnie to i nie mogłem znaleźć żadnego wytłumaczenia.

To jest kod Julii w epic_draw_multi_thread.jl:

using Random
using Printf
import Base.Threads.@spawn

function pickone(dist)
    n = length(dist)
    i = 1
    r = rand()
    while r >= dist[i] && i<n 
        i+=1
    end
    return i
end  

function init_items(type_dist, unique_elements)
    return zeros(Int32, length(type_dist), maximum(unique_elements))
end

function draw(type_dist, unique_elements_dist)
    item_type = pickone(type_dist)
    item_number = pickone(unique_elements_dist[item_type])
    return item_type, item_number
end

function draw_unique(type_dist, unique_elements_dist, items, x)
    while sum(items .> 0) < x
        item_type, item_number = draw(type_dist, unique_elements_dist)
        items[item_type, item_number] += 1
    end
    return sum(items)
end

function average_for_unique(type_dist, unique_elements_dist, x, n, reset=true)
    println(@sprintf("Started average_for_unique on thread %d with n = %d", Threads.threadid(), n))
    items = init_items(type_dist, unique_elements)

    tot_draws = 0
    for i in 1:n
        tot_draws += draw_unique(type_dist, unique_elements_dist, items, x)
        if reset
            items .= 0
        else
            items[items.>1] -= 1
        end
    end

    println(@sprintf("Completed average_for_unique on thread %d with n = %d", Threads.threadid(), n))
    return tot_draws / n
end

function parallel_average_for_unique(type_dist, unique_elements_dist, x, n, reset=true)
    println("Started computing...")
    t = max(Threads.nthreads() - 1, 1)
    m = Int32(round(n / t))
    tasks = Array{Task}(undef, t)
    @sync for i in 1:t
        task = @spawn average_for_unique(type_dist, unique_elements_dist, x, m)
        tasks[i] = task
    end
    sum(fetch(t) for t in tasks) / t
end
    
type_dist = [0.3, 0.3, 0.2, 0.15, 0.05]
const cum_type_dist = cumsum(type_dist)

unique_elements = [21, 27, 32, 14, 10]
unique_elements_dist = [[1 / unique_elements[j] for i in 1:unique_elements[j]] for j in 1:length(unique_elements)]
const cum_unique_elements_dist = [cumsum(dist) for dist in unique_elements_dist]

str_n = ARGS[1]
n = parse(Int64, str_n)
avg = parallel_average_for_unique(cum_type_dist, cum_unique_elements_dist, 50, n)
print(avg)

To jest polecenie wydane w powłoce, aby uruchomić w dwóch wątkach wraz z danymi wyjściowymi i wynikami synchronizacji:

time julia --threads 3 epic_draw_multi_thread.jl 400000
Started computing...
Started average_for_unique on thread 3 with n = 200000
Started average_for_unique on thread 2 with n = 200000
Completed average_for_unique on thread 2 with n = 200000
Completed average_for_unique on thread 3 with n = 200000
70.44460749999999
real    0m14.347s
user    0m26.959s
sys     0m2.124s

Oto polecenie wydawane w powłoce, aby uruchomić dwa procesy o połowie rozmiaru zadania każdy wraz z danymi wyjściowymi i wynikami czasowymi:

time julia --threads 1 epic_draw_multi_thread.jl 200000 &
time julia --threads 1 epic_draw_multi_thread.jl 200000 &
Started computing...
Started computing...
Started average_for_unique on thread 1 with n = 200000
Started average_for_unique on thread 1 with n = 200000
Completed average_for_unique on thread 1 with n = 200000
Completed average_for_unique on thread 1 with n = 200000
70.434375
real    0m12.919s
user    0m12.688s
sys     0m0.300s
70.448695
real    0m12.996s
user    0m12.790s
sys     0m0.308s

Bez względu na to, ile razy powtórzę eksperyment, tryb wielowątkowy zawsze jest wolniejszy. Uwagi:

Utworzyłem kod równoległy, aby oszacować wartość PI i nie napotkałem tego samego problemu. Jednak nie widzę w tym kodzie niczego, co mogłoby spowodować konflikt między wątkami powodując spowolnienie.
Kiedy zaczynam od więcej niż jednego wątku, używam liczby wątków minus jeden, aby wykonać losowania. W przeciwnym razie ostatnia nić wydaje się wisieć. Tę instrukcję t = max(Threads.nthreads() - 1, 1)można zmienić na, t = Threads.nthreads()aby używała dokładnej liczby dostępnych wątków.

EDYCJA w dniu 20.11.2020 r

Wdrożone zalecenia Przemysława Szufla. Oto nowy kod:

using Random
using Printf
import Base.Threads.@spawn
using BenchmarkTools

function pickone(dist, mt)
    n = length(dist)
    i = 1
    r = rand(mt)
    while r >= dist[i] && i<n 
        i+=1
    end
    return i
end  

function init_items(type_dist, unique_elements)
    return zeros(Int32, length(type_dist), maximum(unique_elements))
end

function draw(type_dist, unique_elements_dist, mt)
    item_type = pickone(type_dist, mt)
    item_number = pickone(unique_elements_dist[item_type], mt)
    return item_type, item_number
end

function draw_unique(type_dist, unique_elements_dist, items, x, mt)
    while sum(items .> 0) < x
        item_type, item_number = draw(type_dist, unique_elements_dist, mt)
        items[item_type, item_number] += 1
    end
    return sum(items)
end

function average_for_unique(type_dist, unique_elements_dist, x, n, mt, reset=true)
    println(@sprintf("Started average_for_unique on thread %d with n = %d", Threads.threadid(), n))
    items = init_items(type_dist, unique_elements)

    tot_draws = 0
    for i in 1:n
        tot_draws += draw_unique(type_dist, unique_elements_dist, items, x, mt)
        if reset
            items .= 0
        else
            items[items.>1] -= 1
        end
    end

    println(@sprintf("Completed average_for_unique on thread %d with n = %d", Threads.threadid(), n))
    return tot_draws / n
end

function parallel_average_for_unique(type_dist, unique_elements_dist, x, n, reset=true)
    println("Started computing...")
    t = max(Threads.nthreads() - 1, 1)
    mts = MersenneTwister.(1:t)
    m = Int32(round(n / t))
    tasks = Array{Task}(undef, t)
    @sync for i in 1:t
        task = @spawn average_for_unique(type_dist, unique_elements_dist, x, m, mts[i])
        tasks[i] = task
    end
    sum(fetch(t) for t in tasks) / t
end
    
type_dist = [0.3, 0.3, 0.2, 0.15, 0.05]
const cum_type_dist = cumsum(type_dist)

unique_elements = [21, 27, 32, 14, 10]
unique_elements_dist = [[1 / unique_elements[j] for i in 1:unique_elements[j]] for j in 1:length(unique_elements)]
const cum_unique_elements_dist = [cumsum(dist) for dist in unique_elements_dist]

str_n = ARGS[1]
n = parse(Int64, str_n)
avg = @btime parallel_average_for_unique(cum_type_dist, cum_unique_elements_dist, 50, n)
print(avg)

Zaktualizowane testy porównawcze:

Threads          @btime     Linux Time       
1 (2 processes)  9.927 s    0m44.871s 
2 (1 process)   20.237 s    1m14.156s
3 (1 process)   14.302 s    1m2.114s

Odpowiedzi

5 PrzemyslawSzufel Nov 20 2020 at 02:35

Są tu dwa problemy:

Nie mierzysz poprawnie wyników
Podczas generowania liczb losowych w wątkach należy mieć osobny MersenneTwisterstan losowy dla każdego wątku, aby uzyskać najlepszą wydajność (w przeciwnym razie losowy stan jest współdzielony we wszystkich wątkach i musi nastąpić synchronizacja)

Obecnie mierzysz czas „Julia startowa czasu” + „czas kompilacji kodu„ + „runtime”. Kompilacja kodu wielowątkowego oczywiście trwa dłużej niż kompilacja kodu jednowątkowego. Samo uruchomienie Julii również zajmuje sekundę lub dwie.

Masz tutaj dwie możliwości. Najłatwiej jest użyć BenchmarkTools @btimemakra do pomiaru czasu wykonania wewnątrz kodu. Inną opcją byłoby przekształcenie kodu w pakiet i skompilowanie go do obrazu Julii za pośrednictwem PackageCompiler . Nadal będziesz jednak mierzyć „Czas rozpoczęcia Julii” + „Czas wykonania Julii”