純粋なPythonの遅いハフマンコード

Aug 22 2020

私は、テキストの単純なハフマンコード圧縮の高速実装の作成に取り組んでいました。標準ライブラリだけを使って書くという発想でしたが、もっと速くする方法が見つからないようです。また、速度を犠牲にすることなく、より「Pythonic」で書く方法についてのアドバイスも探しています。

スピードが必要な場合はPythonを使用すべきではないことは承知していますが、純粋なPythonのパフォーマンスをテストするための演習として取り上げました。

from collections import Counter, defaultdict

def huffman_compress(input_file, output_file, encoding='utf8'):
    """This functions compresses a txt file using Huffman code compression."""
    
    # Store the text in memory since it is faster than reading twice
    text = open(input_file, "r", encoding=encoding).read()
    
    # Count the times each letter appears on the text
    letter_freq = Counter(text)
    alphabet = defaultdict(str)
    
    # Obtain the huffman code for each letter
    while len(letter_freq) > 1:
        (letter1, count1), (letter2, count2) = letter_freq.most_common(2)
        letter_freq[letter1+letter2] = count1 + count2
        for bit, combination in enumerate([letter1, letter2]):
            for letter in combination:
                alphabet[letter] = str(bit) + alphabet[letter]
            del letter_freq[combination]
    
    # Save the transformation to ascii for possible the 256 characters
    bit_to_ascii = {format(x, '08b'): chr(x) for x in range(256)}
    
    with open(output_file, 'w') as output:
        # Transform each letter to its huffman code
        me = ''.join(alphabet[ch] for ch in text)
        
        # Add 0's so that the string is multiple of 8
        extra_bits = 8 - len(me) % 8
        me +=  extra_bits * '0'
        
        # Write the number of letters compressed and the number of bits added
        output.write(f'{chr(len(alphabet))}{extra_bits}')
        
        # Write the letters compressed and their huffman code for the decompression
        output.write('|'.join(c for item in alphabet.items() for c in item))
        
        # Transform the huffman bits to ascii and save them on the compressed file.
        output.write(''.join(bit_to_ascii[me[j:j+8]] for j in range(0, len(me), 8)))

回答

8 FMc Aug 25 2020 at 05:08

私はあなたのコードから始めsys.argv、コマンドラインでファイルパスを渡し、大きなテキストファイル（もちろんWar and Peace）をダウンロードし、プログラムを実行し、ファイルサイズをチェックできるように追加しました。

$ curl 'https://www.gutenberg.org/files/2600/2600-0.txt' -o war-peace.txt -k $ time python huffman.py war-peace.txt encoded

real    0m11.052s
user    0m10.462s
sys 0m0.389s

$ ls -lh
-rw-r--r-- 1 fmc staff  40M Aug 24 13:51 encoded
-rw-r--r-- 1 fmc staff 3.3M Aug 24 13:50 war-peace.txt

誤って拡張アルゴリズムを発明したようです。約12倍の大きさのファイルが作成されます。また、わずか40Mのテキストを処理するのに11秒は遅いようです。通常、Pythonはそのサイズのデータをはるかに迅速に処理できます。

一時的に短い文字列（huffman）をtext変数に割り当て、ファイルの読み取りをバイパスして、いくつかの中間変数を出力しました。見た目letter_freqは良かったのですalphabetが、私たちが望んでいたものとは逆でした。

f 00000     # The most frequent letter has the longest code.
h 00001
u 0001
m 001
a 01
n 1

ハフマンアルゴリズムは、最も一般的でない頻度で2つの要素を組み合わせますが、逆のことをしています。だから私はあなたのコードをこのように微調整しました：

(letter1, count1), (letter2, count2) = letter_freq.most_common()[:-3:-1]

その変更によりalphabet、少なくとももっともらしく見え、出力ファイルは入力ファイルよりも小さくなり（ただし、私が期待するほどではないため、コードに他の問題がある可能性があります）、約1秒で終了します。 11よりも大きい（おそらく、はるかに小さい出力ファイルを書き込んでいるため）。

いくつかの提案：

最初に正当性に焦点を合わせます。後で速度について心配する-そしてそれが本当に重要である場合にのみ（そして他の理由がなければその教育的であるかもしれない）。
アルゴリズムと副作用は混ざりません。コードを再編成して、テストとデバッグを容易にします。huffman_compress()関数自体は、ファイルの読み書きと自分自身を懸念べきではありません。テキストのブロブを取り、バイトのブロブ、ピリオドを返す必要があります。（ハフマンのように）高度にアルゴリズム化されたコードには、決して副作用がないはずです。それは純粋関数の領域に住むべきです。
データをラウンドトリップします。また、huffman_expand()関数を記述します。バイトを取り、テキストを返します。それがなければ、プロセスに自信を持つことはできません。特に、次のことができるようにする必要がありますassert original_text == huffman_expand(huffman_compress(original_text))。これは、ハフマンを正しく実装したことを証明するものではありませんが（おそらく、独自の特別なエンコードスキームを発明することになるでしょう。これはクールかもしれません）、少なくともロスレスラウンドトリップを実行できることは証明されます。

2 superbrain Aug 25 2020 at 14:49

可能な256文字のASCIIへの変換を保存します

ASCIIには256文字がありません。128あります。

また、デフォルトのエンコーディングであるUTF-8を使用して書き込むため、理由もなく、256文字の非ASCII半分を2バイトとして書き込み、ファイルのサイズを約1.5倍にします。