[Complete technical review of ch26 Don Stewart **20080828083522] { hunk ./en/ch25-profiling.xml 4 - Profiling and optimisation + Profiling and optimization hunk ./en/ch25-profiling.xml 22 -of using lazy or strict evaluation strategies, and techniques for analysing and -controlling space and time behaviour. +of using lazy or strict evaluation strategies, and techniques for analyzing and +controlling space and time behavior. hunk ./en/ch25-profiling.xml 29 -address them. To do this we'll use time and space profiling, and collect -runtime statistics, as well as reasoning about how strict or lazy evaluation -strategies affect behaviour. We'll also look at the impact of compiler -optimisations on performance, and the use of advanced optimisation techniques -that become feasible in a purely functional language. So let's begin with a -classic challenge: squashing excessive memory usage in some inoccuous looking -code. +address them. To do this we'll use investigate a range of techniques: time +and space profiling, runtime statistics, and reasoning about strict and lazy +evaluation. We'll also look at the impact of compiler optimizations on +performance, and the use of advanced optimization techniques that become +feasible in a purely functional language. So let's begin with a challenge: +squashing unexpected memory usage in some inoccuous looking code. hunk ./en/ch25-profiling.xml 49 -spots that can catch out the unwary: +spots that can catch out the unwary. hunk ./en/ch25-profiling.xml 60 -compile this source to native code (with optimisations on) and run it with -the standard time command to see how it performs: +compile this source to native code (with optimizations on) and run it with +the time command to see how it performs: hunk ./en/ch25-profiling.xml 84 -right, but it's unclear what resources are being used. Let's investigate, and -see what we can find out about this program's behaviour. +right, but it's unclear what resources are being used. Let's investigate. hunk ./en/ch25-profiling.xml 94 -The application itself won't see those flags as they're immediately consumed +The application itself won't see those flags, as they're immediately consumed hunk ./en/ch25-profiling.xml 101 -controlling the number of OS threads with -N, or tweaking the -stack and heaps sizes). We'll also use runtime flags to enabling different -varieties of profiling. The complete set of runtime flags the Haskell +control the number of OS threads with -N, or tweak the +stack and heap sizes). We'll also use runtime flags to enable different +varieties of profiling. The complete set of flags the Haskell hunk ./en/ch25-profiling.xml 111 -yielding this result: +yielding this result. hunk ./en/ch25-profiling.xml 145 -spent in garbage collection, and what the maximum live memory in use was. It +spent in garbage collection, and what the maximum live memory usage was. It hunk ./en/ch25-profiling.xml 147 -program used a maximum of 742 megabytes to the heap, and spent 96.9% of its time -doing garbage collection! In total, only 3.1% of the program's runtime was +program used a maximum of 742 megabytes on the heap, and spent 96.9% of its time +doing garbage collection! In total, only 3.1% of the program's running time was hunk ./en/ch25-profiling.xml 164 -GHC, thankfully, comes with several tools to analysing a program's time and +GHC, thankfully, comes with several tools to analyze a program's time and hunk ./en/ch25-profiling.xml 166 -which when run, yields useful information about what resources each function +which, when run, yields useful information about what resources each function hunk ./en/ch25-profiling.xml 176 -cost center is a location in the program we'd like to collect statistics about, -and GHC will generate code to compute the cost of evaluation the expression -at each location. Cost centres can be added manually to any expression, using -the SCC pragma: +cost centre is a location in the program we'd like to collect statistics about, +and GHC will generate code to compute the cost of evaluating the expression +at each location. Cost centres can be added manually to instrument any +expression, using the SCC pragma: hunk ./en/ch25-profiling.xml 203 --no-recomp flag to to force full recompilation): +-fforce-recomp flag to to force full recompilation): hunk ./en/ch25-profiling.xml 207 -$ ghc -O2 --make A.hs -prof -auto-all -caf-all -no-recomp +$ ghc -O2 --make A.hs -prof -auto-all -caf-all -fforce-recomp hunk ./en/ch25-profiling.xml 215 -additional profiling overheads): +additional profiling overhead): hunk ./en/ch25-profiling.xml 228 -optimised, possibly changing its runtime behaviour, as each expression now +optimized, possibly changing its runtime behavior, as each expression now hunk ./en/ch25-profiling.xml 231 -case, it is simple to proceed, we use the GHC runtime flag, -K, -to set an arbitrarily large stack limit for our program: +case, it is simple to proceed -- we use the GHC runtime flag, -K, +to set a larger stack limit for our program (with the usual suffixes +to indicate magnitude): hunk ./en/ch25-profiling.xml 243 -A.prof, named after the binary that was executed, which +A.prof (named after the binary that was executed) which hunk ./en/ch25-profiling.xml 277 -This gives us a view into the program's runtime behaviour. We can see the +This gives us a view into the program's runtime behavior. We can see the hunk ./en/ch25-profiling.xml 281 -the maximum live memory, which was around 700M). +the maximum live memory, which was around 700MB). hunk ./en/ch25-profiling.xml 291 -Additionally, we see the one-off costs of evaluating constants, (such as the -floating point values in the large list, and the list itself), assigned to top +Additionally, we see the one-off costs of evaluating constants (such as the +floating point values in the large list, and the list itself) assigned to top hunk ./en/ch25-profiling.xml 299 -another for floating point numbers. These alone accounts for nearly all -allocations that occured during the program run. Combined with our earlier +another for floating point numbers. These alone account for nearly all +allocations that occurred during the program run. Combined with our earlier hunk ./en/ch25-profiling.xml 308 -profile can narrow down to a particular problematic module and top level +profile can highlight a particular problematic module and top level hunk ./en/ch25-profiling.xml 322 -revealing "space leaks", where memory is retained unneccessarily, leading to +revealing "space leaks", where memory is retained unnecessarily, leading to hunk ./en/ch25-profiling.xml 332 -with their own insights. Heap profiling produces a log to a file, +with its own insights. Heap profiling A.hs logs to a file hunk ./en/ch25-profiling.xml 335 -visualisation of the heap over time. +visualization of the heap over time. hunk ./en/ch25-profiling.xml 339 -To extract a standard heap profile from our program, we compile run it with +To extract a standard heap profile from our program, we run it with hunk ./en/ch25-profiling.xml 372 -increase the frequency of samples by using -iN, where N is the -fraction of a second to sample at. Obviously, the more we sample, the more -accurate the results, but the slower our program will run. We can now render -the heap profile as a graph, using the hp2ps tool: +increase the heap sampling frequency by using -iN, where N is the +number of seconds (e.g. 0.01) between heap size samples. Obviously, the more +we sample, the more accurate the results, but the slower our program will +run. We can now render the heap profile as a graph, using the +hp2ps tool: hunk ./en/ch25-profiling.xml 442 -type. Finally, let's break it down by what constructors are being allocated, -using the -hd flag: +type (represented as data of type "*"). Finally, let's break it down by what +constructors are being allocated, using the -hd flag: hunk ./en/ch25-profiling.xml 446 + + hunk ./en/ch25-profiling.xml 472 -A lot of work is going into allocating list nodes containing double precision +A lot of work is going into allocating list nodes containing double-precision hunk ./en/ch25-profiling.xml 476 -over half our way through the program run, the program finally finishes +over halfway through the program run, the program finally finishes hunk ./en/ch25-profiling.xml 479 -is being retained. +is being retained: hunk ./en/ch25-profiling.xml 513 -Now, instead of taking the sum of the list, and retaining the list till we can -take its length, we left-fold over the list accumulating the intermediate sum -and length values in a pair (and we must left fold, since a right fold would -take us to the end of the list and work backwards, which is exactly not what -we need in this case). +Now, instead of taking the sum of the list, and retaining the list until we can +take its length, we left-fold over the list, accumulating the intermediate sum +and length values in a pair (and we must left-fold, since a right-fold would +take us to the end of the list and work backwards, which is exactly what +we're trying to avoid). hunk ./en/ch25-profiling.xml 524 -we run this, however, we get surprising results: +we run this, however, we get a stack overflow: hunk ./en/ch25-profiling.xml 527 -$ ghc -O2 --make B.hs -no-recomp +$ ghc -O2 --make B.hs -fforce-recomp hunk ./en/ch25-profiling.xml 542 -$ ghc -O2 --make B.hs -prof -auto-all -caf-all -no-recomp +$ ghc -O2 --make B.hs -prof -auto-all -caf-all -fforce-recomp hunk ./en/ch25-profiling.xml 577 -The problem is that our left fold, foldl is too lazy. +The problem is that our left-fold, foldl, is too lazy. hunk ./en/ch25-profiling.xml 579 -as a goto, where no state is left on the stack. In this case +as a goto, with no state left on the stack. In this case hunk ./en/ch25-profiling.xml 630 + + hunk ./en/ch25-profiling.xml 656 -If we run this, with allocation statistics enabled, we get the satisifying +If we run this, with allocation statistics enabled, we get the satisfying hunk ./en/ch25-profiling.xml 703 -the parallel strategies library, which unlike seq reduces to -normal form (hence its name). Such a "deep seq" fold we can write as: +the parallel strategies library (along with using), which unlike +seq reduces to the fully evaluated "normal form" (hence its +name). Such a "deep seq" fold we can write as: hunk ./en/ch25-profiling.xml 710 + + hunk ./en/ch25-profiling.xml 728 -Perhaps the cheapest way, syntactically to add required strictness to code -that's excessively lazy is via "bang patterns", a language extension +Perhaps the cheapest way, syntactically, to add required strictness to code +that's excessively lazy is via "bang patterns" (whose name comes from +pronunciation of the "!" character as "bang"), a language extension hunk ./en/ch25-profiling.xml 740 -We can now rewrite the loop state to be simply: +Bang patterns are a language extension, and are enabled with the +BangPatterns language pragma. We can now rewrite the loop state to be simply: hunk ./en/ch25-profiling.xml 782 -In large projects, when we investigating memory allocation hot spots, bang +In large projects, when we are investigating memory allocation hot spots, bang hunk ./en/ch25-profiling.xml 795 -to the compiler. By default, Haskell data types are fully lazy, but it is -easy enough to to add strictness information to the fields of a data type +to the compiler. By default, Haskell data types are lazy, but it is +easy enough to add strictness information to the fields of a data type hunk ./en/ch25-profiling.xml 811 -This implementation again has the same efficient, constant space behaviour. +This implementation again has the same efficient, constant space behavior. hunk ./en/ch25-profiling.xml 828 -the compiler is done optimising it, particularly in the case of Haskell +the compiler is done optimizing it, particularly in the case of Haskell hunk ./en/ch25-profiling.xml 830 -uses "a simple functional language", known as Core, as the compiler -intermediate representation. It is essentially a subset of Haskell, augmented -with unboxed data types (these are raw machine types, directly corresponding -to primitive data types in languages like C), suitable for code generation. -GHC optimises Haskell by transformation, repeatedly rewriting the source into -more and more efficient forms. The Core representation is the final version -of your program, and the one that will be executed on the machine. In other -words, Core has the final say, and if all out performance is your goal, it is +uses what is humorously referred to as "a simple functional language", known as +Core, as the compiler intermediate representation. It is essentially a subset +of Haskell, augmented with unboxed data types (raw machine types, +directly corresponding to primitive data types in languages like C), suitable +for code generation. GHC optimizes Haskell by transformation, repeatedly +rewriting the source into more and more efficient forms. The Core +representation is the final functional version +of your program, before translation to low level imperative code. In other +words, Core has the final say, and if all-out performance is your goal, it is hunk ./en/ch25-profiling.xml 843 -To view the Core our Haskell program we compile with the --ddump-simpl flag, or use the ghc-core core tool, a -third party utility that lets us view Core in a pager. So let's look at the +To view the Core version of our Haskell program we compile with the +-ddump-simpl flag, or use the ghc-core tool, a +third-party utility that lets us view Core in a pager. So let's look at the hunk ./en/ch25-profiling.xml 855 -A screenful of text is generated, which if we look carefully at, we'll see a +A screenful of text is generated. If we look carefully at, we'll see a hunk ./en/ch25-profiling.xml 873 -about the next steps for optimisation. The fold itself has been entirely +about the next steps for optimization. The fold itself has been entirely hunk ./en/ch25-profiling.xml 887 -instead, we can replace Integer with it, and GHC will then be -able to use a raw machine int for the length. We can hope for an -improvement in time and space by ensuring both loop components are unboxed, -and kept in registers. +instead, we can replace Integer with it, via a type annotation, +and GHC will then be able to use a raw machine Int# for the +length. We can hope for an improvement in time and space by ensuring both +loop components are unboxed, and kept in registers. hunk ./en/ch25-profiling.xml 896 -sum. Notice that the return type is a a heap-allocated Double +sum. Notice that the return type is a heap-allocated Double hunk ./en/ch25-profiling.xml 898 -value onto the heap, and again this has implications for performance, as GHC -will need to check there is sufficient heap space available before it can +value onto the heap. Again this has implications for performance, as GHC +will need to check that there is sufficient heap space available before it can hunk ./en/ch25-profiling.xml 906 -in the loop. In addition, can can enable an optimisation GHC provides, that -unboxes the strict fields of a data type, which will ensure the new pair data -type is stored entirely in registers. This optimisation is turned on with +in the loop. In addition, GHC provides an optimiztion that unboxes the strict +fields of a data type, ensuring the fields of the new pair type will be +stored in registers. This optimization is turned on with hunk ./en/ch25-profiling.xml 921 -Compiling this with optimisations on, and -funbox-strict-fields -ddump-simpl, +Compiling this with optimizations on, and -funbox-strict-fields -ddump-simpl, hunk ./en/ch25-profiling.xml 940 -performance against our original program, with the full input size: +performance against our original program: hunk ./en/ch25-profiling.xml 976 -runs in around half a second, and allocates a total of 1 megabyte. Something of -an improvement! +runs in around half a second, and allocates a total of 1 megabyte. Quite an improvement! hunk ./en/ch25-profiling.xml 980 -The general rules we can learn from the profiling and optimisation process are: +The general rules we can learn from the profiling and optimization process are: hunk ./en/ch25-profiling.xml 985 - Compile to native code, with optimisations on + Compile to native code, with optimizations on hunk ./en/ch25-profiling.xml 1001 - Use data types closer to the machine representation (prefer + Use data types with simpler machine representations (prefer hunk ./en/ch25-profiling.xml 1009 -issues, and when used wisely, can avoid them occuring in the first place. +issues, and when used wisely, can avoid them occurring in the first place. hunk ./en/ch25-profiling.xml 1022 -the elements of the list must be represented as heap allocated +the elements of the list will be represented as heap allocated hunk ./en/ch25-profiling.xml 1029 -the list program into a listless version, using an optimisation known as -deforestation, which refers to a general class of optimisations that involve +the list program into a listless version, using an optimization known as +deforestation, which refers to a general class of optimizations that involve hunk ./en/ch25-profiling.xml 1033 -reordering and transforming code wholesale at times. The specific -deforestation optimisation we can use here is stream fusion. +reordering and transforming wholesale at times. The specific +deforestation optimization we will use here is stream fusion. hunk ./en/ch25-profiling.xml 1038 -This optimisation transforms recursive list generation and transformation +This optimization transforms recursive list generation and transformation hunk ./en/ch25-profiling.xml 1042 -allocation. The optimisation isn't enabled by default, but is enabled by a -number of data structure libraries, which provide "rewrite rules", custom -optimisations the compiler applies to functions the library exports. - +allocation. The optimization isn't enabled by default, and it can radically +change the complexity of a piece of code, but is enabled by a number of data +structure libraries, which provide "rewrite rules", custom optimizations the +compiler applies to functions the library exports. hunk ./en/ch25-profiling.xml 1050 -structures. Rewriting our program to use streams is straight forward: +structures. Rewriting our program to use streams is straightforward: hunk ./en/ch25-profiling.xml 1053 + + hunk ./en/ch25-profiling.xml 1073 -yielding a tight loop where list generation is interleaved with acumulation, +yielding a tight loop where list generation is interleaved with accumulation, hunk ./en/ch25-profiling.xml 1089 -Given that our Core is now optimal, the only step left to take, to optimise this -program further, is by looking directly at the assembly. Of course, there +Given that our Core is now optimal, the only step left to take this +program further is to look directly at the assembly. Of course, there hunk ./en/ch25-profiling.xml 1096 -optimisation flags to pass to GCC. Particularly with floating point code, it +optimization flags to pass to GCC. Particularly with floating point code, it hunk ./en/ch25-profiling.xml 1098 -compiler optimisations. +compiler optimizations. hunk ./en/ch25-profiling.xml 1102 -For example, we can squeeze our the last drops of performance +For example, we can squeeze out the last drops of performance hunk ./en/ch25-profiling.xml 1104 -cuts the running time in half again (as the C compiler is able to optimise +cuts the running time in half again (as the C compiler is able to optimize hunk ./en/ch25-profiling.xml 1109 -$ ghc -no-recomp --make -O2 -funbox-strict-fields -fvia-C -optc-O2 I.hs +$ ghc -fforce-recomp --make -O2 -funbox-strict-fields -fvia-C -optc-O2 I.hs hunk ./en/ch25-profiling.xml 1118 -Inspecting the final assembly (via -keep-tmp-files), we +Inspecting the final x86_64 assembly (via -keep-tmp-files), we hunk ./en/ch25-profiling.xml 1133 -We've effectively massaged the program through multiple source level -optimisations, all the way to the final assembly. There's no where else to go from -here. Optimising code to this level is very rarely necessarily, of course, -and typically makes sense when writing low level libraries, or optimising -particularly important code, where all algorithm choices have already -determined. For day to day code, choosing better algorithms is always a more -effective strategy, but it's useful to know we can optimise down to the metal +We've effectively massaged the program through multiple source-level +optimizations, all the way to the final assembly. There's nowhere else to go from +here. Optimising code to this level is very rarely necessary, of course, +and typically only makes sense when writing low level libraries, or optimizing +particularly important code, where all algorithm choices have already been +determined. For day-to-day code, choosing better algorithms is always a more +effective strategy, but it's useful to know we can optimize down to the metal hunk ./en/ch25-profiling.xml 1151 -conventions that can go a long way towards keeping code lean and efficient. +conventions that can go a long way towards keeping your code lean and efficient. hunk ./en/ch25-profiling.xml 1161 -high level. A balance of productivity and ruthless efficiency. +high level. The result is a sweet balance of productivity and ruthless efficiency. }