LloydForge blog http://lloydforge.org/index.rhtml LloydForge blog en-us ruby 1.8.6 vs 1.9 - performance & memory usage http://lloydforge.org/index.rhtml/ruby/1.9.vs.1.8.6.html <title> ruby 1.8.6 vs 1.9 - performance & memory usage</title> Seeing all this action in ruby trunk, combined with what I've read 'round the net had piqued my interest in 1.9 performance differences. <p> Given the set of contributed benchmarks that I used when developing the <a href="/projects/ruby/ruby-less-mem-III.patch"> inital patch </a> to improve the reclamation and decrease memory usage of ruby, I did some comparisons of <a href="/projects/misc/1.8.6_vs_1.9-trunk.html"> ruby 1.9 vs ruby 1.8.6</a>, and of <a href="/projects/misc/patched.1.8.vs.1.9.html"> ruby 1.9 vs a patched ruby 1.8.6</a>. <p> In short, looking at this data leads me to some preliminary conclusions: <ul> <li> 1.9 is decidedly "faster" than 1.8.6. Especially when runtimes are longer, or yaml is involved. <li> 1.9 uses slightly less memory overall. <li> There is considerable room for improvement in 1.9's memory reclamation. </ul> <p> So <i>why is memory usage important</i>? Well, here I'm personally biased. My primary day to day application of ruby is embedding. In this scenario, I really want a small and tight ruby that I can use to move a buncha code out of c++ and into ruby. I'm also interested in portability of ruby to less capable devices. In both of these situations, memory usage is an important factor. Additionally, I think more generally, low memory usage minimizes the copy on write issues prevalent in ruby on rails environments. If the process has a small, tight, compact heap, it doesn't matter so much that we have to copy the whole thing on each fork. <p> My educated guess here is that we could focus energy on minimizing memory usage rather than a external bitset, and we'd get the desired COW friendly ruby at a similar (minimal) performance cost, but the solution would yield more across-the-board benefits. To be clear, moving mark bits into a external data structure (and out of the heap) is really an optimization focused at ruby in web environments that comes with a cost. I don't want to pay this <a href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/15827"> performance penalty</a> for something that doesn't really benefit me. <p> So I guess the task is to continue to follow 1.9 and to attempt to figure out which 2 lines are inverted that could really make the difference... It can't be too hard, right? <p> till the next,<br> lloyd YAJL 0.4.0, finally http://lloydforge.org/index.rhtml/yajl/0.4.0.html <title> YAJL 0.4.0, finally </title> <p> This release is 90% contributions. Thanks to all who are using YAJL and taking the time to push me patches. </p> <p> YAJL source is moved to a publically visible repository with full history preserved: <br> <a href="http://code.google.com/p/yajl-c">http://code.google.com/p/yajl-c</a> </p> <p> <h3>Changes in 0.4.0</h3> <ul> <li> <b>lth</b> buffer overflow bug in yajl_gen_double s/%lf/%g/ - thanks to Eric Bergstrome <li> <b>lth</b> yajl_number callback to allow passthrough of arbitrary precision numbers to client. Thanks to Hatem Nassrat. <li> <b>lth</b> yajl_integer now deals in long, instead of long long. This combined with yajl_number improves compiler compatibility while maintaining precision. <li> <b>lth</b> better ./configure && make experience (still requires cmake & ruby) <li> <b>lth</b> fix handling of special characters hex 0F and 1F in yajl_encode (thanks to Robert Geiger) <li> <b>lth</b> allow leading zeros in exponents (thanks to Hatem Nassrat) </ul> </p> <p> enjoy! <br> --lloyd matz, the ruby trunk, and GC changes http://lloydforge.org/index.rhtml/ruby/matz_1.9_gc.html <title> matz, the ruby trunk, and GC changes </title> w00t. An email from matz, and a little spelunking in the ruby subversion repository shows that there's some tinkering going on in ruby garbage collection land. Here are the interesting change logs: <p> <pre> r15674 | matz | 2008-03-03 01:27:43 -0700 (Mon, 03 Mar 2008) | 5 lines * gc.c (add_heap): sort heaps array in ascending order to use binary search. * gc.c (is_pointer_to_heap): use binary search to identify object in heaps. works better when number of heap segments grow big. </pre> and... <pre> r16194 | matz | 2008-04-25 03:03:32 -0600 (Fri, 25 Apr 2008) | 7 lines * gc.c (free_unused_heaps): preserve last used heap segment to reduce malloc() call. * gc.c (HEAP_SIZE): use smaller heap segment (2K) for more chance to be freed. based on patch from authorNari <authornari at gmail.com>. * gc.c (rb_newobj_from_heap): eventually allocate heap segments. </pre> <p> So now in ruby 1.9 trunk we're keeping heaps in sorted order by memory address, and using binary search to answer the <tt> is_pointer_to_heap() </tt> question quickly. This optimizes things to the point where we can really crank down heap size. Smaller heaps means more OS reclaimation, means reduced resource usage, and should even mean a ruby with reduced COW badness. All this at a minimal performance impact for normal execution (maybe none, matz knows). <p> So applause to open source, and matz specifically for sifting through all the ideas/hacks/and patches to realize this thing. It will be interesting to include 1.9 in the <a href="/projects/ruby"> performance comparison table </a> to see how things have changed from 1.8.6 to present trunk. <p> So why do I care so much about a less memory intensive ruby? Well because ruby _really_ shines as an embedded language. In terms of the presence of a robust set of built-ins, and a fairly modest size hit. Also, the C api for embedding is beautiful. It's fun to use in the way that ruby itself is. I'm guessing that the embedding API got so nice because extension authors have been using it and complaining about it for a while. So the part of the api that is common to authoring extentions and embedding an interpreter is great. What sucks is the part that's unique to embedding. It would be great if: <ol> <li>I could ruby two ruby interpreter contexts in a single process in different threads (I know this would have some benefits in web server plugins too). <li> if I could have multiple interpreter contexts around at the same time. <li> To be able to cleanly shutdown and restart the interpreter, without massive memory leaks. </ol> <p> So reducing memory usage is a first good step to making ruby the premier language for embedding. Next steps include getting rid of all them statics and breaking and making optional the stuff that is only required by the ruby interpreter itself. Perhaps a bit more ambitious than hackin on the GC... <p> till the next,<br> lloyd What's up with YAJL? http://lloydforge.org/index.rhtml/yajl/yajl_status.html <title> What's up with YAJL? </title> Many apologies for my lack of attention to yajl. I've been slowly integrating fixes and change requests from folks, and have moved the source into <a href="http://code.google.com/p/yajl-c">code.google.com</a> for direct read only svn access. <p> I've got a couple more reported bugs to work out (mostly around number parsing), and have a contributed C++ wrapper from <a href="http://surfulater.com"> Neville Franks </a> that I'd like to integrate as an optional additional library. <p> Truth be told my <a href="http://browserplus.yahoo.com"> day job </a> has been consuming all of my time. In coming weeks I expect this to lighten up a tad and to have time to complete what will become yajl 0.4.0. <p> stay with me, <br> --lloyd hacking on ruby's garbage collector http://lloydforge.org/index.rhtml/ruby/original_gc_rant.html <title> hacking on ruby's garbage collector </title> This is my original writeup on possible improvements to the ruby garbage collector. Latest status is archived in projects/ruby. <h3> overview </h3> Ruby's GC & heap implementation uses a lot of memory. The thing is based around the idea of "heaps". Heaps are chunks of memory where ruby objects are stored. Each heap consists of a number of slots. Slots are between 20 and 40 bytes, depending on sizeof(long). When ruby runs out of heap space, it first does a GC run to try to free something up, and then allocates a new heap. the new heap is 1.8 times larger than the last. Every time a GC run happens, the entire heap is written to turn off mark bits, these are stored in the heap. Then we run through top level objects, and mark them, and all their descendents. Then we throw away anything that's not marked (sweep). Because of the way ruby works, objects may _never_ be moved around in heaps. That means from the time they're allocated to the time they're freed they may not be moved to a new memory address. <p> So this is a very terse summary, more is available in the ruby hackers guide. But it's enough. There are a couple bad things here. <ol> <li> in order for a heap to be reclaimed, _all_ entries on the heap need to be freed. The bigger a heap is, the more likely that it will contain at least one long lived object. The 1.8 growth factor makes it bloody unlikely that you'll ever get to reclaim heap space. <li> A big heap makes GC slower. You have to scan the whole thing <li> (<a href="http://izumi.plan99.net/blog/index.php/2007/10/15/making-ruby%e2%80%99s-garbage-collector-copy-on-write-friendly-part-6-final/#comment-7010">this guy gets the credit on this idea</a>) Because we do scanning copy on write semantics are blown. you do a fork, and as soon as GC runs, your whole heap is resident and private memory. <li> We cannot do compaction at GC time, we must either change the way ruby works in a very fundamental way (bad), or think of a creative lightweight way to just keep the heap compacted. </ol> <h3> Plan of attack </h3> <ol> <li> develop a method of quantifying performance of ruby GC & heap <li> prove some of these ideas can have positive effects in terms of memory usage and performance <li> produce patches for the ruby community and anyone who wants em <li> drink beer </ol> <h3> Quantifying </h3> First thing we need is a way to get a look at statistics of the gc stuff. So we hack in a GC.heap_info function that returns <ul> <li> <b>num_heaps</b> - the number of allocated heaps <li> <b>heap_slots_free</b> - the total number of free slots <li> <b>heap_memory</b> - the amount of memory allocated to ruby heaps <li> <b>heap_slots_allocated</b> - the total number of available slots <li> <b>heap_slots_used</b> - the total number of used slots <li> <b>num_gc_passes</b> - the number of times GC has been run </ul> <p> Great. Next we need test cases. I start with three: <ul> <li> <b> grow array </b> build a buncha arrays bigger and bigger <li> <b> shrink array </b> start with a big array, build a buncha ones smaller. <li> <b> PLIST parsing </b> Parse a huge plist file. </ul> Yeah, they're artificial, but we'll add more cases as we go. At least we have some constant tests to start with. <h3> Proof of concept </h3> <h4> hypothesis </h4> By getting rid of the 1.8 growth factor, and making heaps smaller, we can increase the amount of memory that's reclaimed. And make ruby faster, by reducing the amount of scanning unused memory that occurs. <h4> data </h4> Vanilla ruby: <pre> running cases/growarray.rb heap before (mem/used slots/% free/heaps/gc passes) 560080/6501/0.767838011570602/2/9 heap after (mem/used slots/% free/heaps/gc passes) 560080/6367/0.772623384043997/2/11 time 1.339915 running cases/plist.rb heap before (mem/used slots/% free/heaps/gc passes) 15056520/559048/0.257393875553088/7/21 heap after (mem/used slots/% free/heaps/gc passes) 15056520/109351/0.854744633172117/7/23 time 3.773354 running cases/shrinkarray.rb heap before (mem/used slots/% free/heaps/gc passes) 560080/7537/0.730840654238983/2/9 heap after (mem/used slots/% free/heaps/gc passes) 560080/6373/0.77240911363474/2/11 time 1.338151 </pre> Killing 1.8 growth factor: <pre> running cases/growarray.rb heap before (mem/used slots/% free/heaps/gc passes) 400080/6501/0.674982501749825/2/9 heap after (mem/used slots/% free/heaps/gc passes) 400080/6367/0.681681831816818/2/11 time 1.337504 running cases/plist.rb heap before (mem/used slots/% free/heaps/gc passes) 9406200/380442/0.191001631002226/47/51 heap after (mem/used slots/% free/heaps/gc passes) 9406200/109351/0.767468416609429/47/53 time 4.278476 running cases/shrinkarray.rb heap before (mem/used slots/% free/heaps/gc passes) 400080/7525/0.623787621237876/2/9 heap after (mem/used slots/% free/heaps/gc passes) 400080/6373/0.681381861813819/2/11 time 1.343519 </pre> Killing 1.8 growth factor and reducing heap size to 1/10th <pre> running cases/growarray.rb heap before (mem/used slots/% free/heaps/gc passes) 221540/6491/0.413428519790349/11/18 heap after (mem/used slots/% free/heaps/gc passes) 221540/6368/0.424543647207663/11/20 time 1.351549 running cases/plist.rb heap before (mem/used slots/% free/heaps/gc passes) 7741680/382475/0.0109718192584778/366/365 heap after (mem/used slots/% free/heaps/gc passes) 3795580/109350/0.423259493670886/179/367 time 7.971707 running cases/shrinkarray.rb heap before (mem/used slots/% free/heaps/gc passes) 221540/7717/0.302638713175493/11/17 heap after (mem/used slots/% free/heaps/gc passes) 221540/6374/0.424001445870233/11/19 time 1.343365 </pre> <h4> analysis </h4> note "heap before" means "heap before final GC run". We fork a process which runs the test case, then we check out the heap using GC.heap_info, then we run a GC pass, then we check it out again. <p> w00t! we made ruby twice as slow! Well hold on. First inspect the run times of plist.rb (the most realistic test case). Also inspect the number of gc passes. Pretty tight correlation, right? reducing heap size, and removing the 1.8 growth factor both increase the number of gc passes that we make. So we see a significant performance degradation proportional to the number of passes that are run. <p> Inspect memory usage (still looking at only plist.rb). Vanilla ruby is using 15mb. At the end of everything, and that heap is 85% unused. Kill the 1.8 and we're using 9mb of heap space, 76% unused. Decrease the heap size, and we actually see memory being reclaimed. After the run and final GC we've only got ~4mb in use at 42% free. Immediately after the run we were around 8mb. <h4> parting shot/conclusion </h4> By changing two constants we can make ruby a lot more memory efficient, and at the same time a lot slower. The slow down appears to be largely from increased frequency of GC. Maybe we can look at not running GC _every_ heap allocation, but every N heap allocations... Goal here being to restore ruby to it's original, or better performance characteristics, but reduce the memory usage. <p> Essence here is that everyone knows you can grow a buffer by a factor and make things faster. But other aspects of ruby make that choice perhaps not optimal here. Stay tuned, we'll dig further. <h4> one more thing, ideas on "automatic heap compaction" </h4> a global freelist? bad. Why not have per-heap freelists. Why not sort the heaps by usage percentage at the end of a sweep? Allocate the heaviest used ones first... There's some complexity here around when GC is run... Cause it's only run when everything is full... But perhaps some room for exploration...