[Snapshot ch27 Bryan O'Sullivan **20080619231018] { adddir ./examples/ch27/examples addfile ./examples/ch27/examples/BloomCheck.hs addfile ./examples/ch27/examples/bloomCheck.ghci hunk ./en/ch27-advanced-haskell.xml 420 - elements in the input list. + elements in the input list. Notice that we use + genericLength to handle lists that are + longer than will fit in an Int. hunk ./en/ch27-advanced-haskell.xml 424 - This function will of course not always be usable, but it - complements the other interfaces we provide. It lets us provide - a range of control over creation, from entirely imperative to - completely declarative. + This function will of course not always be usable: for + example, it will fail if the length of the input list is too + long. However, its simplicity rounds out the other interfaces we + provide. It lets us provide our users with a range of control + over creation, from entirely imperative to completely + declarative. + + + Re-exporting names for convenience + + In the export list for our module, we re-export some names + from the base BloomFilter module. This allows + casual users to import only the BloomFilter.Easy + module, and have access to all of the types and functions they + are likely to need. + + If we import both BloomFilter.Easy and + BloomFilter, you might wonder what will happen if + we try to use a name exported by both. We already know that + if we import BloomFilter unqualified and try to + use length, &GHC; will issue an error + about ambiguity, because the Prelude also makes the name + length available. + + The Haskell standard requires an implementation to be able + to tell when several names refer to the same + thing. For instance, the Bloom + type is exported by BloomFilter and + BloomFilter.Easy. If we import both modules and + try to use Bloom, &GHC; will be able to see that + the Bloom re-exported from + BloomFilter.Easy is the same as the one exported + from BloomFilter, and it will not report an + ambiguity. + hunk ./en/ch27-advanced-haskell.xml 481 - url="http://burtleburtle.net/bob/c/lookup3.c">lookup3.c. + url="http://burtleburtle.net/bob/c/lookup3.c">lookup3.c. + We create a cbits directory and download + it to there. + + + A little editing + + On line 36 of the copy of lookup3.c + that you just downloaded, there is a macro named + SELF_TEST defined. To use this source file as + a library, you must delete this line or + comment it out. If you forget to do so, the + main function defined near the bottom + of the file will supersede the main of + any Haskell program you link this library against. + hunk ./en/ch27-advanced-haskell.xml 506 - these two functions. + these two functions. We save this to + cbits/lookup3.h. hunk ./en/ch27-advanced-haskell.xml 509 - &lookup3.h:header; + &lookup3.h:header; hunk ./en/ch27-advanced-haskell.xml 517 - Here are Haskell bindings to these functions. + Here are our Haskell bindings to these functions. hunk ./en/ch27-advanced-haskell.xml 521 - For efficiency, we will combine the two 32-bit salts and - hashes into a single 64-bit value. + We have specified that the definitions of the functions + can be found in the lookup3.h header file + that we just created. + + For convenience and efficiency, we will combine the 32-bit + salts consumed, and the hash values computed, by the Jenkins + hash functions into a single 64-bit value. hunk ./en/ch27-advanced-haskell.xml 558 - To hash basic types, we write a little bit of boilerplate - code. + We don't want clients of this module to be stuck fiddling + with low-level details, so we use a typeclass to provide a + clean, high-level interface. + + &Hash.hs:Storable; + + We also provide a number of useful implementations of this + typeclass. To hash basic types, we must write a little + boilerplate code. hunk ./en/ch27-advanced-haskell.xml 570 - We might prefer to be able to make use of the - Storable typeclass to write just one declaration, - as follows: + We might prefer to use the Storable typeclass + to write just one declaration, as follows: hunk ./en/ch27-advanced-haskell.xml 576 - instances of this form, as they make the type system - undecidable: they can cause the - compiler's type checker to loop infinitely. We are forced to - use a little boilerplate instead. The restriction on - undecidable types does not, however, pose a problem for a + instances of this form, as allowing them would make the type + system undecidable: they can cause the + compiler's type checker to loop infinitely. This restriction + on undecidable types forces us to write out individual + declarations. It does not, however, pose a problem for a hunk ./en/ch27-advanced-haskell.xml 585 - This instance lets us hash values of many list types. - Most importantly, since the Char type is an - instance of Storable, we have gained the ability - to hash values of type String. + The compiler will accept this instance, so we gain the + ability to hash values of many list types + Unfortunately, we do not have room to explain why one + of these instances is decidable, but the other is + not. + . Most importantly, since Char is an + instance of Storable, we can now hash + String values. hunk ./en/ch27-advanced-haskell.xml 595 - composition. + composition. We take a salt in at one end of the composition + pipeline, and use the result of hashing each tuple element as + the salt for the next element. hunk ./en/ch27-advanced-haskell.xml 601 - And for ByteString types, we write special - instances. + To hash ByteString types, we write special + instances that plug straight into the internals of the + ByteString types. This gives us excellent + hashing performance. hunk ./en/ch27-advanced-haskell.xml 608 + XXX May need to flesh ByteString details out if not + covered elsewhere. + hunk ./en/ch27-advanced-haskell.xml 645 - In this function, we attempt to minimise only the size of - the Bloom filter, without regard for the number of hashes. To - see why. Let us - interactively explore the relationship between filter size and + We perform some rather paranoid checking. For instance, + the sizings function suggests pairs of + array size and hash count, but it does not validate its + suggestions. Since we use 32-bit hashes, we must filter out + suggested array sizes that are too large. + + In our suggestSizing function, we + attempt to minimise only the size of the bit array, without + regard for the number of hashes. To see why, let us + interactively explore the relationship between array size and hunk ./en/ch27-advanced-haskell.xml 657 - Suppose we want - to insert 10 million elements into the Bloom filter with a - false positive rate of 0.1%. + Suppose we want to insert 10 million elements into a Bloom + filter, with a false positive rate of 0.1%. hunk ./en/ch27-advanced-haskell.xml 664 - storage, we can reduce the number of hashes to 7. If we were + space, we can reduce the number of hashes to 7. If we were hunk ./en/ch27-advanced-haskell.xml 807 - we can manually specify the values of flags, though we will - rarely need to do so in practice. + we can manually specify the values of flags via the + option, though we will rarely need to + do so in practice. hunk ./en/ch27-advanced-haskell.xml 811 + + + Compilation options, and interfacing to C + + Continuing with our .cabal file, we + fill out the remaining details of the Haskell side of our + library. + + &rwh-bloomfilter.cabal:modules; + + The Other-Modules property lists Haskell + modules that are private to the library. Such modules will be + invisible to code that uses this package. + + When we build this package with &GHC;, Cabal will pass the + options from the GHC-Options property to the + compiler. + + The option makes &GHC; optimise our + code aggressively. Code compiled without optimisation is very + slow, so we should always use for + production code. + + To help ourselves to write cleaner code, we usually add + the option, which enables all of + &GHC;'s warnings. This will cause &GHC; to issue complaints + if it encounters potential problems, such as overlapping + patterns; function parameters that are not used; and a myriad + of other potential stumbling blocks. While it is often safe to + ignore these warnings, we generally prefer to fix up our code + to eliminate them. The small added effort usually yields code + that is easier to read and maintain. + + When we compile with , &GHC; will + generate C code and use the system's C compiler to compile it, + instead of going straight to assembly language as it usually + does. This slows compilation down, but sometimes the C + compiler can further improve &GHC;'s optimised code, so it can + be worthwhile. + + We include here mainly to show + how to make compilation with it work. + + &rwh-bloomfilter.cabal:cbits; + + For the C-Sources property, we only need to + list files that must be compiled into our library. The + CC-Options property contains options for the C + compiler ( specifies a high level of + optimisation). Because our FFI bindings for the Jenkins hash + functions refer to the lookup3.h header + file, we need to tell Cabal where to find the header + file. + + + The value of -fvia-C with the FFI + + Compiling with has a + substantial benefit when we write FFI bindings. If we + mention a header file in an FFI declaration (e.g. + foreign import "string.h memcpy"), the C + compiler will typecheck the generated Haskell code and + ensure that its invocation of the C function is consistent + with the C function's prototype in the header file. + + If we do not use , we lose that + additional layer of safety. This makes it easy to let + simple C type errors slip into our Haskell code. As an + example, on most 64-bit machines, a CInt is 32 + bits wide, and a CSize is 64 bits wide. If we + accidentally use one type to describe a parameter for an FFI + binding when we should use the other, we are likely to cause + data corruption or a crash. + + + + + + Testing with QuickCheck + + Before we pay any attention to performance, we want to + establish that our Bloom filter behaves correctly. We can + easily use QuickCheck to test some basic properties. + + &BloomCheck.hs:BloomCheck; + + We will not use the normal quickCheck + function to test our properties, as the 100 test inputs that it + generates do not provide much coverage. + + &BloomCheck.hs:handyCheck; + + Our first task is to ensure that if we add a value to a + Bloom filter, a subsequent membership test will always report it + as present, no matter what the chosen false positive rate or + input value is. + + We will use the easyList function to + create a Bloom filter. The Random instance for + Double generates numbers in the range zero to one, + so QuickCheck can supply us with arbitrary false positive + rates. + + &BloomCheck.hs:prop_one_present; + + Our small combinator, (=~>), lets us + filter out failures of easyList: if it + fails, the test automatically passes. + + + Polymorphic testing + + QuickCheck requires properties to be + monomorphic. Since we have many + different hashable types that we would like to test, we would + very much like to avoid having to write the same test in many + different ways. + + Notice that although our + prop_one_present function is polymorphic, + it ignores its first argument. We use this to simulate + monomorphic properties, as follows. + + &bloomCheck.ghci:monomorphic; + + We can supply any value as the first argument to + prop_one_present. All that matters is + its type, as the same type will be used + for the first element of the second argument. + + &bloomCheck.ghci:one; + + If we populate a Bloom filter with many elements, they + should all be present afterwards. + + &BloomCheck.hs:prop_all_present; + + This test also succeeds. + + &bloomCheck.ghci:all; + + + + Writing Arbitrary instances for ByteStrings + + The QuickCheck library does not provide + Arbitrary instances for ByteString + types, so we must write our own. Rather than create a + ByteString directly, we will use a + pack function to create one from a + [Word8]. + + &BloomCheck.hs:ByteString; + + Also missing from QuickCheck are Arbitrary + instances for the fixed-width types defined in + Data.Word and Data.Int. We need to + at least create an Arbitrary instance for + Word8. + + &BloomCheck.hs:Word8; + + We support these instances with a few common functions so + that we can reuse them when writing instances for other + integral types. + + &BloomCheck.hs:Word32; + + With these Arbitrary instances created, we + can try our existing properties on the ByteString + types. + + &bloomCheck.ghci:bs; + + + + + Exercises + + + + + Our use ofgenericLength in + easyList will cause our function to + loop infinitely if we supply an infinite list. Fix + this. + + + + + Difficult. Write a QuickCheck + property that checks whether the observed false positive + rate is close to the requested false positive rate. + + + hunk ./examples/ch27/BloomFilter/Easy.hs 7 + + -- re-export useful names from BloomFilter + , B.Bloom + , B.length + , B.elem + , B.notElem hunk ./examples/ch27/BloomFilter/Easy.hs 16 -import BloomFilter (Bloom, fromList) +import Data.List (genericLength) +import Data.Maybe (catMaybes) hunk ./examples/ch27/BloomFilter/Easy.hs 19 +import qualified BloomFilter as B hunk ./examples/ch27/BloomFilter/Easy.hs 22 - fromList (doubleHash numHashes) numBits values - where capacity = length values - (numBits, numHashes) = suggestSizing capacity errRate + case suggestSizing (genericLength values) errRate of + Left err -> Left err + Right (bits,hashes) -> Right filt + where filt = B.fromList (doubleHash hashes) bits values hunk ./examples/ch27/BloomFilter/Easy.hs 29 -suggestSizing :: Int -- expected maximum capacity - -> Double -- desired false positive rate - -> (Word32, Int) -- (filter size, number of hashes) +suggestSizing + :: Integer -- expected maximum capacity + -> Double -- desired false positive rate + -> Either String (Word32,Int) -- (filter size, number of hashes) hunk ./examples/ch27/BloomFilter/Easy.hs 34 - | capacity <= 0 = error "invalid capacity" - | errRate <= 0 || errRate >= 1 = error "invalid error rate" - | otherwise = minimum (sizings capacity errRate) + | capacity <= 0 = Left "capacity too small" + | errRate <= 0 || errRate >= 1 = Left "invalid error rate" + | null saneSizes = Left "capacity too large" + | otherwise = Right (minimum saneSizes) + where saneSizes = catMaybes . map sanitize $ sizings capacity errRate + sanitize (bits,hashes) + | bits > maxWord32 - 1 = Nothing + | otherwise = Just (ceiling bits, truncate hashes) + where maxWord32 = fromIntegral (maxBound :: Word32) hunk ./examples/ch27/BloomFilter/Easy.hs 44 -sizings :: Int -> Double -> [(Word32, Int)] +sizings :: Integer -> Double -> [(Double, Double)] hunk ./examples/ch27/BloomFilter/Easy.hs 46 - [(round ((-k) * cap / log (1 - (errRate ** (1 / k)))), round k) - | k <- [1..100]] + [(((-k) * cap / log (1 - (errRate ** (1 / k)))), k) | k <- [1..50]] hunk ./examples/ch27/BloomFilter/Easy.hs 54 - -> Bloom a + -> Either String (B.Bloom a) hunk ./examples/ch27/BloomFilter/Hash.hs 3 -module BloomFilter.Hash where +module BloomFilter.Hash + ( + Hashable(..) + , hash + , doubleHash + ) where hunk ./examples/ch27/BloomFilter/Hash.hs 29 +{-- snippet Hashable --} hunk ./examples/ch27/BloomFilter/Hash.hs 37 +{-- /snippet Hashable --} hunk ./examples/ch27/BloomFilter/Hash.hs 95 -hashSB :: Word64 -> Strict.ByteString -> IO Word64 -hashSB salt bs = Strict.useAsCStringLen bs $ \(ptr, len) -> - hashIO ptr (fromIntegral len) salt +hashByteString :: Word64 -> Strict.ByteString -> IO Word64 +hashByteString salt bs = Strict.useAsCStringLen bs $ \(ptr, len) -> + hashIO ptr (fromIntegral len) salt hunk ./examples/ch27/BloomFilter/Hash.hs 100 - hashSalt salt bs = unsafePerformIO $ hashSB salt bs + hashSalt salt bs = unsafePerformIO $ hashByteString salt bs hunk ./examples/ch27/BloomFilter/Hash.hs 104 - foldM hashSB salt (Lazy.toChunks bs) + foldM hashByteString salt (Lazy.toChunks bs) hunk ./examples/ch27/cbits/lookup3.c 36 -#define SELF_TEST 1 hunk ./examples/ch27/examples/BloomCheck.hs 1 +{-- snippet BloomCheck --} +{-# LANGUAGE GeneralizedNewtypeDeriving #-} +module Main where + +import BloomFilter.Hash (Hashable) +import Data.Word (Word8, Word32) +import System.Random (Random(..), RandomGen) +import Test.QuickCheck +import qualified BloomFilter.Easy as B +import qualified Data.ByteString as Strict +import qualified Data.ByteString.Lazy as Lazy +{-- /snippet BloomCheck --} + +{-- snippet handyCheck --} +handyCheck :: Testable a => Int -> a -> IO () +handyCheck limit = check defaultConfig { + configMaxTest = limit + , configEvery = \_ _ -> "" + } +{-- /snippet handyCheck --} + +{-- snippet Word8 --} +instance Random Word8 where + randomR = integralRandomR + random = randomR (minBound, maxBound) + +instance Arbitrary Word8 where + arbitrary = choose (minBound, maxBound) + coarbitrary = integralCoarbitrary +{-- /snippet Word8 --} + +{-- snippet Word32 --} +integralCoarbitrary n = + variant $ if m >= 0 then 2*m else 2*(-m) + 1 + where m = fromIntegral n + +integralRandomR (a,b) g = case randomR (c,d) g of + (x,h) -> (fromIntegral x, h) + where (c,d) = (fromIntegral a :: Integer, + fromIntegral b :: Integer) + +instance Random Word32 where + randomR = integralRandomR + random = randomR (minBound, maxBound) + +instance Arbitrary Word32 where + arbitrary = choose (minBound, maxBound) + coarbitrary = integralCoarbitrary +{-- /snippet Word32 --} + +{-- snippet ByteString --} +instance Arbitrary Lazy.ByteString where + arbitrary = Lazy.pack `fmap` arbitrary + coarbitrary = coarbitrary . Lazy.unpack + +instance Arbitrary Strict.ByteString where + arbitrary = Strict.pack `fmap` arbitrary + coarbitrary = coarbitrary . Strict.unpack +{-- /snippet ByteString --} + +{-- snippet prop_one_present --} +(=~>) :: Either a b -> (b -> Bool) -> Bool +k =~> f = either (const True) f k + +prop_one_present :: (Hashable a) => a -> (a, Double) -> Bool +prop_one_present _ (elt,errRate) = + B.easyList errRate [elt] =~> \filt -> + elt `B.elem` filt +{-- /snippet prop_one_present --} + +{-- snippet prop_all_present --} +prop_all_present :: (Hashable a) => a -> ([a], Double) -> Bool +prop_all_present _ (xs,errRate) = B.easyList errRate xs =~> \filt -> + all (`B.elem` filt) xs +{-- /snippet prop_all_present --} + +prop_suggestions_sane errRate = + forAll (choose (1,fromIntegral maxWord32 `div` 8)) $ \cap -> + (fst . minimum $ B.sizings cap errRate) < fromIntegral maxWord32 ==> + either (const False) sane $ B.suggestSizing cap errRate + where sane (bits,hashes) = bits > 0 && bits < maxBound && hashes > 0 + maxWord32 = maxBound :: Word32 hunk ./examples/ch27/examples/bloomCheck.ghci 1 +--# monomorphic +:load BloomCheck +:t prop_one_present +:t prop_one_present (undefined :: Int) + +--# one +handyCheck 5000 $ prop_one_present (undefined :: Int) +handyCheck 5000 $ prop_one_present (undefined :: Double) + +--# all +handyCheck 2000 $ prop_all_present (undefined :: Int) + +--# bs +handyCheck 1000 $ prop_one_present (undefined :: Lazy.ByteString) +handyCheck 1000 $ prop_all_present (undefined :: Strict.ByteString) hunk ./examples/ch27/rwh-bloomfilter.cabal 42 +-- snippet modules hunk ./examples/ch27/rwh-bloomfilter.cabal 48 + GHC-Options: -O2 -Wall -fvia-C +-- /snippet modules + +-- snippet modules + GHC-Prof-Options: -auto-all +-- /snippet modules + +-- snippet cbits hunk ./examples/ch27/rwh-bloomfilter.cabal 57 - GHC-Options: -O2 -Wall -fliberate-case-threshold=1000 hunk ./examples/ch27/rwh-bloomfilter.cabal 58 + Include-Dirs: cbits + Includes: lookup3.h +-- /snippet cbits hunk ./examples/ch27/sizings.ghci 6 -let kbytes (numBits,numHashes) = (numBits `div` 8192, numHashes) +let kbytes (bits,hashes) = (ceiling bits `div` 8192, hashes) }