Mengapa paket textreuse di R membuat bucket LSH jauh lebih besar daripada minhash asli?

Aug 16 2020

Sejauh yang saya pahami salah satu fungsi utama metode LSH adalah reduksi data bahkan di luar hash yang mendasarinya (sering berupa minhash). Saya telah menggunakan textreusepaket di R, dan saya terkejut dengan ukuran data yang dihasilkannya. textreuseadalah paket ROpenSci yang ditinjau oleh rekan sejawat , jadi saya menganggapnya melakukan tugasnya dengan benar, tetapi pertanyaan saya tetap ada.

Katakanlah saya menggunakan 256 permutasi dan 64 band untuk fungsi minhash dan LSH saya masing-masing - nilai realistis yang sering digunakan untuk mendeteksi dengan kepastian relatif (~ 98%) kesamaan serendah 50%.

If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:

object.size(trtd$minhashes)
> 1072 bytes

Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:

object.size(l$buckets)
> 6704 bytes

So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.

But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?

Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.

Jawaban

1 LincolnMullen Aug 17 2020 at 03:24

Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)

The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.

Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.