Clickhouse. Crea database su cluster termina con timeout

Nov 22 2020

Ho un cluster composto da due nodi di Clickhouse. Entrambe le istanze si trovano in contenitori Docker. Tutte le comunicazioni tra gli host vengono verificate con successo: ping, telnet, wget funzionano bene. In Zookeeper posso vedere le mie query licenziate sotto il brunch ddl .

Ogni esecuzione dell'istruzione "crea database su cluster" termina con il timeout. Qual è il problema? Qualcuno ha qualche idea?

Sono presenti frammenti del file di configurazione.

Ver 20.10.3.30

<remote_servers>
        <history_cluster>
            <shard>
                <replica>
                    <host>10.3.194.104</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>10.3.194.105</host>
                    <port>9000</port>
                </replica>
            </shard>
        </history_cluster>
  </remote_servers>
  <zookeeper>
                <node index="1">
                        <host>10.3.194.106</host>
                        <port>2181</port>
                </node>
  </zookeeper>

La sezione "macro"

    <macros incl="macros" optional="true" />

Il frammento di registro

2020.11.20 22:38:44.104001 [ 90 ] {68062325-a6cf-4ac3-a355-c2159c66ae8b} <Error> executeQuery: Code: 159, e.displayText() = DB::Exception: Watching task /clickhouse/task_queue/ddl/query-0000000013 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 2 unfinished hosts (0 of them are currently active), they are going to execute the query in background (version 20.10.3.30 (official build)) (from 172.17.0.1:51272) (in query: create database event_history on cluster history_cluster;), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, long&, unsigned long&, unsigned long&) @ 0xd8dcc75 in /usr/bin/clickhouse
1. DB::DDLQueryStatusInputStream::readImpl() @ 0xd8dc84d in /usr/bin/clickhouse
2. DB::IBlockInputStream::read() @ 0xd71b1a5 in /usr/bin/clickhouse
3. DB::AsynchronousBlockInputStream::calculate() @ 0xd71761d in /usr/bin/clickhouse
4. ? @ 0xd717db8 in /usr/bin/clickhouse
5. ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) @ 0x7b8c17d in /usr/bin/clickhouse
6. std::__1::__function::__func<ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&&...)::'lambda'(), std::__1::allocator<ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&&...)::'lambda'()>, void ()>::operator()() @ 0x7b8e67a in /usr/bin/clickhouse
7. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 0x7b8963d in /usr/bin/clickhouse
8. ? @ 0x7b8d153 in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

Risposte

2 DennyCrane Nov 22 2020 at 03:50

Il problema più probabile è rappresentato dagli IP / nomi host interni della finestra mobile dei nodi.

Un iniziatore del nodo (in cui viene eseguito "sul cluster") inserisce in ZK un'attività per 10.3.194.104 e 10.3.194.105. Tutti i nodi controllano costantemente la coda delle attività ed eseguono il pull della loro attività. Se i loro IP / nomi host sono 127.0.0.1 / localhost, non trovano mai le loro attività. Perché 10.3.194.104! = 127.0.0.1.