|
|
Created:
12 years, 7 months ago by tejohnson Modified:
10 years, 5 months ago CC:
hubicka_ucw.cz, gcc-patches_gcc.gnu.org Base URL:
svn+ssh://gcc.gnu.org/svn/gcc/trunk/ Visibility:
Public. |
Patch Set 1 #
Total comments: 5
Patch Set 2 : [PATCH] New fdo summary-based icache sensitive unrolling #
MessagesTotal messages: 17
Ports some patches related to improving FDO program summary information and using it to guide loop unrolling from google branches to mainline. The patch is enhanced to add additional summary information to aid in determining hot/cold decisions. The original patch description is at: http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html and further discussion about incorporating onto mainline is at: http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 Honza, can you take a look to see if this patch meets your needs? Full description: This patch adds new program summary information to the gcov profile files that indicate how many profiled counts compose the majority of the program's execution time. This is used to provide an indication of the overall code size of the program. The new profile summary information is then used to guide codesize based unroll and peel decisions, to prevent those optimizations from increasing code size too much when the program may be sensitive to icache effects. This patch also pulls in dependent portions of google/main r187660 that cache additional loop analysis results in the niter_desc auxiliary information hanging off the loop structure (the optimization portions of that change are not included here, and have an outstanding review request for mainline). Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? Thanks, Teresa 2012-07-11 Teresa Johnson <tejohnson@google.com> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. (gcov_compute_cutoff_values): Ditto. (gcov_exit): Call gcov_compute_cutoff_values and merge new summary information. * gcc/doc/invoke.texi (roll much): Document new options -fpeel-codesize-limit and -funroll-codesize-limit, and new params codesize-hotness-threshold and unrollpeel-hotness-threshold. * gcc/gcov-io.c (gcov_write_summary): Write new summary info. (gcov_read_summary): Read new summary info. * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. (struct gcov_ctr_summary): Add new summary info: num_hot_counters and hot_cutoff_value. * gcc/loop-unroll.c (code_size_limit_factor): New function. (decide_unroll_runtime_iterations): Call code_size_limit_factor to control the unroll factor, and retrieve number of branches from niter_desc instead of via function that walks loop. (decide_peel_simple, decide_unroll_stupid): Ditto. * gcc/coverage.c (read_counts_file): Propagate new summary info. * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns function, and add guards to enable this function to work for the outermost loop. * gcc/common.opt: Add -fpeel-codesize-limit and -funroll-codesize-limit. * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. (num_loop_branches): Remove. * gcc/cfgloop.h (struct niter_desc): Added new fields to cache additional loop analysis information. (num_loop_branches): Remove. (analyze_loop_insns): Declare. * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. * gcc/gcov-dump.c (tag_summary): Dump new summary info. Index: libgcc/libgcov.c =================================================================== --- libgcc/libgcov.c (revision 189413) +++ libgcc/libgcov.c (working copy) @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned return 1; } +/* Used by qsort to sort gcov values in descending order. */ + +static int +sort_by_reverse_gcov_value (const void *pa, const void *pb) +{ + const gcov_type a = *(gcov_type const *)pa; + const gcov_type b = *(gcov_type const *)pb; + + if (b > a) + return 1; + else if (b == a) + return 0; + else + return -1; +} + +/* Determines the number of counters required to cover a given percentage + of the total sum of execution counts in the summary, which is then also + recorded in SUM. */ + +static void +gcov_compute_cutoff_values (struct gcov_summary *sum) +{ + struct gcov_info *gi_ptr; + const struct gcov_fn_info *gfi_ptr; + const struct gcov_ctr_info *ci_ptr; + struct gcov_ctr_summary *cs_ptr; + unsigned t_ix, f_ix, i, ctr_info_ix, index; + gcov_unsigned_t c_num; + gcov_type *value_array; + gcov_type cum, cum_cutoff; + char *cutoff_str; + unsigned cutoff_perc; + +#define CUM_CUTOFF_PERCENT_TIMES_10 999 + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); + if (cutoff_str && strlen (cutoff_str)) + cutoff_perc = atoi (cutoff_str); + else + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; + + /* This currently only applies to arc counters. */ + t_ix = GCOV_COUNTER_ARCS; + + /* First check if there are any counts recorded for this counter. */ + cs_ptr = &(sum->ctrs[t_ix]); + if (!cs_ptr->num) + return; + + /* Determine the cumulative counter value at the specified cutoff + percentage and record the percentage for use by gcov consumers. + Check for overflow when sum_all is multiplied by the cutoff_perc, + and if so, do the divide first. */ + if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) + /* Overflow, do the divide first. */ + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; + else + /* Otherwise multiply first to get the correct value for small + values of sum_all. */ + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; + + /* Next, walk through all the per-object structures and save each of + the count values in value_array. */ + index = 0; + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) + { + if (!gi_ptr->merge[t_ix]) + continue; + + /* Find the appropriate index into the gcov_ctr_info array + for the counter we are currently working on based on the + existence of the merge function pointer for this object. */ + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) + { + if (gi_ptr->merge[i]) + ctr_info_ix++; + } + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) + { + gfi_ptr = gi_ptr->functions[f_ix]; + + if (!gfi_ptr || gfi_ptr->key != gi_ptr) + continue; + + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; + /* Sanity check that there are enough entries in value_arry + for this function's counters. Gracefully handle the case when + there are not, in case something in the profile info is + corrupted. */ + c_num = ci_ptr->num; + if (index + c_num > cs_ptr->num) + c_num = cs_ptr->num - index; + /* Copy over this function's counter values. */ + memcpy (&value_array[index], ci_ptr->values, + sizeof (gcov_type) * c_num); + index += c_num; + } + } + + /* Sort all the counter values by descending value and finally + accumulate the values from hottest on down until reaching + the cutoff value computed earlier. */ + qsort (value_array, cs_ptr->num, sizeof (gcov_type), + sort_by_reverse_gcov_value); + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) + cum += value_array[c_num]; + /* Record the number of counters required to reach the cutoff value, + as well as the counter value that caused cum to reach the cutoff. */ + cs_ptr->num_hot_counters = c_num; + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; + free (value_array); +} + /* Dump the coverage counts. We merge with existing counts when possible, to avoid growing the .da files ad infinitum. We use this program's checksum to make sure we only accumulate whole program @@ -347,6 +461,7 @@ gcov_exit (void) } } } + gcov_compute_cutoff_values (&this_prg); { /* Check if the level of dirs to strip off specified. */ @@ -598,7 +713,14 @@ gcov_exit (void) if (gi_ptr->merge[t_ix]) { if (!cs_prg->runs++) - cs_prg->num = cs_tprg->num; + { + cs_prg->num = cs_tprg->num; + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) + { + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; + } + } cs_prg->sum_all += cs_tprg->sum_all; if (cs_prg->run_max < cs_tprg->run_max) cs_prg->run_max = cs_tprg->run_max; Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi (revision 189413) +++ gcc/doc/invoke.texi (working copy) @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}. -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol -fprefetch-loop-arrays @gol -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol -fprofile-generate=@var{path} @gol @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}. -ftree-reassoc @gol -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol --funit-at-a-time -funroll-all-loops -funroll-loops @gol +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol -fwhole-program -fwpa -fuse-linker-plugin @gol @@ -8505,6 +8505,14 @@ the loop is entered. This usually makes programs @option{-funroll-all-loops} implies the same options as @option{-funroll-loops}. +@item -funroll-codesize-limit +@opindex funroll-codesize-limit +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fpeel-loops @opindex fpeel-loops Peels loops for which there is enough information that they do not @@ -8513,6 +8521,14 @@ roll much (from profile feedback). It also turns Enabled with @option{-fprofile-use}. +@item -fpeel-codesize-limit +@opindex fpeel-codesize-limit +Limit loop peeling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fmove-loop-invariants @opindex fmove-loop-invariants Enables the loop invariant motion pass in the RTL loop optimizer. Enabled @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s @item max-completely-peel-loop-nest-depth The maximum depth of a loop nest suitable for complete peeling. +@item unrollpeel-codesize-threshold +Maximum profile-based code size footprint estimate for loop unrolling and +peeling. + +@item unrollpeel-hotness-threshold +Maximum ratio of total execution count to loop entry block count under which +most profile-based code size estimates will be ignored for loop unrolling and +peeling. + @item max-unswitch-insns The maximum number of insns of an unswitched loop. Index: gcc/gcov-io.c =================================================================== --- gcc/gcov-io.c (revision 189413) +++ gcc/gcov-io.c (working copy) @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { gcov_write_unsigned (csum->num); + gcov_write_unsigned (csum->num_hot_counters); + gcov_write_counter (csum->hot_cutoff_value); gcov_write_unsigned (csum->runs); gcov_write_counter (csum->sum_all); gcov_write_counter (csum->run_max); @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { csum->num = gcov_read_unsigned (); + csum->num_hot_counters = gcov_read_unsigned (); + csum->hot_cutoff_value = gcov_read_counter (); csum->runs = gcov_read_unsigned (); csum->sum_all = gcov_read_counter (); csum->run_max = gcov_read_counter (); Index: gcc/gcov-io.h =================================================================== --- gcc/gcov-io.h (revision 189413) +++ gcc/gcov-io.h (working copy) @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) #define GCOV_TAG_SUMMARY_LENGTH \ - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) /* Counters that are collected. */ #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; struct gcov_ctr_summary { gcov_unsigned_t num; /* number of counters. */ + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given + percent of sum_all. */ + gcov_type hot_cutoff_value; /* value of smallest counter included in + num_hot_counters. */ gcov_unsigned_t runs; /* number of program runs */ gcov_type sum_all; /* sum of all counters accumulated. */ gcov_type run_max; /* maximum value on a single run. */ Index: gcc/loop-unroll.c =================================================================== --- gcc/loop-unroll.c (revision 189413) +++ gcc/loop-unroll.c (working copy) @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see #include "hashtab.h" #include "recog.h" #include "target.h" +#include "gcov-io.h" /* This pass performs loop unrolling and peeling. We only perform these optimizations on innermost loops (with single exception) because @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc basic_block); static rtx get_expansion (struct var_to_expand *); +/* Determine whether and how much LOOP unrolling/peeling should be constrained + based on code footprint estimates. Returns the codesize-based factor to be + divided into the max instructions in an unrolled or peeled loop: + 1) For size <= threshold, do not limit (by returning 1). + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or + unrolled instructions according to loop hotness. + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning + INT_MAX). */ + +static int +code_size_limit_factor(struct loop *loop) +{ + unsigned size_threshold; + struct niter_desc *desc = get_simple_loop_desc (loop); + gcov_type sum_to_header_ratio; + int hotness_ratio_threshold; + int limit_factor; + + /* First check if the application has a large codesize footprint. + This is estimated from FDO profile summary information for the + program, where the num_hot_counters indicates the number of hottest + counters (blocks) that compose most of the execution time of + the program. A large value would indicate a large flat execution + profile where icache misses may be a concern. */ + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); + if (!profile_info + || profile_info->num_hot_counters <= size_threshold + || !profile_info->sum_all) + return 1; + + /* Next, exclude some loops where unrolling/peeling may be more + important to overall performance. */ + + /* Ignore FP loops, which are more likely to benefit heavily from + unrolling. */ + if (desc->has_fp) + return 1; + + /* Next, set the value of the codesize-based unroll factor divisor which in + most loops will need to be set to a value that will reduce or eliminate + unrolling/peeling. */ + if (profile_info->num_hot_counters < size_threshold * 2) + { + /* For applications that are less than twice the codesize limit, allow + limited unrolling for very hot loops. */ + sum_to_header_ratio = profile_info->sum_all / loop->header->count; + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); + /* When the profile count sum to loop entry header ratio is smaller than + the threshold (i.e. the loop entry is hot enough, the divisor is set + to 1 so the unroll/peel factor is not reduced. When it is bigger + than the ratio, increase the divisor by the amount this ratio + is over the threshold, which will quickly reduce the unroll/peel + factor to zero as the loop's hotness reduces. */ + if (sum_to_header_ratio > hotness_ratio_threshold) + { + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; + gcc_assert (limit_factor >= 1); + } + else + limit_factor = 1; + } + else + /* For appliations that are at least twice the codesize limit, set + the divisor to a large value that will force the unroll factor to 0. */ + limit_factor = INT_MAX; + + return limit_factor; +} + /* Unroll and/or peel (depending on FLAGS) LOOPS. */ void unroll_and_peel_loops (int flags) @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_UNROLL)) { @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo "\n;; Considering unrolling loop with runtime " "computable number of iterations\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags) { unsigned npeel; struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_PEEL)) { @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags) if (dump_file) fprintf (dump_file, "\n;; Considering simply peeling loop\n"); + if (flag_peel_codesize_limit) + { + /* Determine whether to limit code size growth from peeling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max peeled insns by divisor %d\n", limit_factor); + } + } + /* npeel = number of iterations to peel. */ - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor + / loop->ninsns; if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags) /* Do not simply peel loops with branches inside -- it increases number of mispredicts. */ - if (num_loop_branches (loop) > 1) + if (desc->num_branches > 1) { if (dump_file) fprintf (dump_file, ";; Not peeling, contains branches\n"); @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_UNROLL_ALL)) { @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags if (dump_file) fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags /* Do not unroll loops with branches inside -- it increases number of mispredicts. */ - if (num_loop_branches (loop) > 1) + if (desc->num_branches > 1) { if (dump_file) fprintf (dump_file, ";; Not unrolling, contains branches\n"); Index: gcc/coverage.c =================================================================== --- gcc/coverage.c (revision 189413) +++ gcc/coverage.c (working copy) @@ -245,6 +245,14 @@ read_counts_file (void) gcov_read_summary (&sum); for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) { + if (summary.ctrs[ix].num_hot_counters + < sum.ctrs[ix].num_hot_counters) + { + summary.ctrs[ix].num_hot_counters + = sum.ctrs[ix].num_hot_counters; + summary.ctrs[ix].hot_cutoff_value + = sum.ctrs[ix].hot_cutoff_value; + } summary.ctrs[ix].runs += sum.ctrs[ix].runs; summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) Index: gcc/loop-iv.c =================================================================== --- gcc/loop-iv.c (revision 189413) +++ gcc/loop-iv.c (working copy) @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop) /* At least desc->infinite is not always initialized by find_simple_loop_exit. */ desc = XCNEW (struct niter_desc); - iv_analysis_loop_init (loop); - find_simple_exit (loop, desc); + if (loop->latch != EXIT_BLOCK_PTR) + { + iv_analysis_loop_init (loop); + find_simple_exit (loop, desc); + } + analyze_loop_insns (loop, desc); loop->aux = desc; if (desc->simple_p && (desc->assumptions || desc->infinite)) Index: gcc/common.opt =================================================================== --- gcc/common.opt (revision 189413) +++ gcc/common.opt (working copy) @@ -1551,6 +1551,10 @@ fpcc-struct-return Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) Return small aggregates in memory, not registers +fpeel-codesize-limit +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop peeling under profile estimates of large code footprint + fpeel-loops Common Report Var(flag_peel_loops) Optimization Perform loop peeling @@ -2108,6 +2112,10 @@ funroll-all-loops Common Report Var(flag_unroll_all_loops) Optimization Perform loop unrolling for all loops +funroll-codesize-limit +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop unrolling under profile estimates of large code footprint + ; Nonzero means that loop optimizer may assume that the induction variables ; that control loops do not overflow and that the loops with nontrivial ; exit condition are not infinite Index: gcc/cfgloop.c =================================================================== --- gcc/cfgloop.c (revision 189413) +++ gcc/cfgloop.c (working copy) @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop) return edges; } -/* Counts the number of conditional branches inside LOOP. */ +/* Determine if INSN is a floating point set. */ -unsigned -num_loop_branches (const struct loop *loop) +static bool +insn_has_fp_set(rtx insn) { - unsigned i, n; - basic_block * body; + int i; + rtx pat = PATTERN(insn); + if (GET_CODE (pat) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); + else if (GET_CODE (pat) == PARALLEL) + { + for (i = 0; i < XVECLEN (pat, 0); i++) + { + rtx sub = XVECEXP (pat, 0, i); + if (GET_CODE (sub) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); + } + } + return false; +} - gcc_assert (loop->latch != EXIT_BLOCK_PTR); +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts + the number of conditional branch instructions, calls and fp instructions, + as well as the average number of branches executed per iteration. */ +void +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) +{ + unsigned i, nbranch; + gcov_type weighted_nbranch; + bool has_call, has_fp; + basic_block * body, bb; + rtx insn; + gcov_type header_count = loop->header->count; + + nbranch = weighted_nbranch = 0; + has_call = has_fp = false; + body = get_loop_body (loop); - n = 0; for (i = 0; i < loop->num_nodes; i++) - if (EDGE_COUNT (body[i]->succs) >= 2) - n++; + { + bb = body[i]; + + if (EDGE_COUNT (bb->succs) >= 2) + { + nbranch++; + + /* If this block is executed less frequently than the header (loop + entry), then it is weighted based on its execution count, which + will be turned into a ratio compared to the loop header below. */ + if (bb->count < header_count) + weighted_nbranch += bb->count; + + /* When it is executed more frequently than the header (i.e. it is + in a nested inner loop), simply weight the branch the same as the + header execution count, so that it will contribute 1 branch to + the ratio computed below. */ + else + weighted_nbranch += header_count; + } + + /* No need to iterate through the instructions below if + both flags have already been set. */ + if (has_call && has_fp) + continue; + + FOR_BB_INSNS (bb, insn) + { + if (!INSN_P (insn)) + continue; + + if (!has_call) + has_call = CALL_P (insn); + + if (!has_fp) + has_fp = insn_has_fp_set (insn); + } + } free (body); - return n; + desc->num_branches = nbranch; + /* Now divide the weights computed above by the loop header execution count, + to compute the average number of branches through the loop. By adding + header_count/2 to the numerator we round to nearest with integer + division. */ + if (header_count != 0) + desc->av_num_branches + = (weighted_nbranch + header_count/2) / header_count; + else + desc->av_num_branches = 0; + desc->has_call = has_call; + desc->has_fp = has_fp; } /* Adds basic block BB to LOOP. */ Index: gcc/cfgloop.h =================================================================== --- gcc/cfgloop.h (revision 189413) +++ gcc/cfgloop.h (working copy) @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); edge single_exit (const struct loop *); -extern unsigned num_loop_branches (const struct loop *); extern edge loop_preheader_edge (const struct loop *); extern edge loop_latch_edge (const struct loop *); @@ -366,7 +365,8 @@ struct rtx_iv }; /* The description of an exit from the loop and of the number of iterations - till we take the exit. */ + till we take the exit. Also includes other information used primarily + by the loop unroller. */ struct niter_desc { @@ -407,6 +407,18 @@ struct niter_desc /* The number of iterations of the loop. */ rtx niter_expr; + + /* The number of branches in the loop. */ + unsigned num_branches; + + /* The number of executed branches per iteration. */ + unsigned av_num_branches; + + /* Whether the loop contains a call instruction. */ + bool has_call; + + /* Whether the loop contains fp instructions. */ + bool has_fp; }; extern void iv_analysis_loop_init (struct loop *); @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); extern struct niter_desc *get_simple_loop_desc (struct loop *loop); extern void free_simple_loop_desc (struct loop *loop); +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); static inline struct niter_desc * simple_loop_desc (struct loop *loop) Index: gcc/params.def =================================================================== --- gcc/params.def (revision 189413) +++ gcc/params.def (working copy) @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, "max-completely-peel-loop-nest-depth", "The maximum depth of a loop nest we completely peel", 8, 0, 0) +/* The maximum code size estimate under which loop unrolling and peeling + * is allowed in a profile feedback compile. This currently applies to loops + * with non-constant iteration counts and no floating point computations. */ +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, + "unrollpeel-codesize-threshold", + "Maximum profile-based code size footprint estimate for loop unrolling " + "and peeling", + 15000, 0, 0) +/* The maximum ratio of total profiled execution counts to loop entry block + count that must be exceeded to ignore most code size limits when unrolling + and peeling. */ +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, + "unrollpeel-hotness-threshold", + "Maximum ratio of total profiled execution count to loop entry " + "block count under which most codesize limits for unrolling and " + "peeling will be ignored", + 100, 1, 0) /* The maximum number of insns of an unswitched loop. */ DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, Index: gcc/gcov-dump.c =================================================================== --- gcc/gcov-dump.c (revision 189413) +++ gcc/gcov-dump.c (working copy) @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED { printf ("\n"); print_prefix (filename, 0, 0); - printf ("\t\tcounts=%u, runs=%u", - summary.ctrs[ix].num, summary.ctrs[ix].runs); + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", + summary.ctrs[ix].num, + summary.ctrs[ix].num_hot_counters, + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, + summary.ctrs[ix].runs); printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); -- This patch is available for review at http://codereview.appspot.com/6351086
Sign in to reply to this message.
ping Teresa On Wed, Jul 11, 2012 at 10:42 AM, Teresa Johnson <tejohnson@google.com>wrote: > Ports some patches related to improving FDO program summary information > and using it to guide loop unrolling from google branches to mainline. > The patch is enhanced to add additional summary information to aid > in determining hot/cold decisions. > > The original patch description is at: > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > and further discussion about incorporating onto mainline is at: > http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 > > Honza, can you take a look to see if this patch meets your needs? > > Full description: > > This patch adds new program summary information to the gcov > profile files that indicate how many profiled counts compose > the majority of the program's execution time. This is used to > provide an indication of the overall code size of the program. > > The new profile summary information is then used to guide > codesize based unroll and peel decisions, to prevent those > optimizations from increasing code size too much when the > program may be sensitive to icache effects. > > This patch also pulls in dependent portions of google/main r187660 that > cache > additional loop analysis results in the niter_desc auxiliary information > hanging off the loop structure (the optimization portions of that > change are not included here, and have an outstanding review request > for mainline). > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > Thanks, > Teresa > > 2012-07-11 Teresa Johnson <tejohnson@google.com> > > * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. > (gcov_compute_cutoff_values): Ditto. > (gcov_exit): Call gcov_compute_cutoff_values and merge new summary > information. > * gcc/doc/invoke.texi (roll much): Document new options > -fpeel-codesize-limit and -funroll-codesize-limit, and new params > codesize-hotness-threshold and unrollpeel-hotness-threshold. > * gcc/gcov-io.c (gcov_write_summary): Write new summary info. > (gcov_read_summary): Read new summary info. > * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary > info. > (struct gcov_ctr_summary): Add new summary info: num_hot_counters > and > hot_cutoff_value. > * gcc/loop-unroll.c (code_size_limit_factor): New function. > (decide_unroll_runtime_iterations): Call code_size_limit_factor > to control the unroll factor, and retrieve number of branches from > niter_desc instead of via function that walks loop. > (decide_peel_simple, decide_unroll_stupid): Ditto. > * gcc/coverage.c (read_counts_file): Propagate new summary info. > * gcc/loop-iv.c (get_simple_loop_desc): Invoke new > analyze_loop_insns > function, and add guards to enable this function to work for the > outermost loop. > * gcc/common.opt: Add -fpeel-codesize-limit and > -funroll-codesize-limit. > * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New > functions. > (num_loop_branches): Remove. > * gcc/cfgloop.h (struct niter_desc): Added new fields to cache > additional loop analysis information. > (num_loop_branches): Remove. > (analyze_loop_insns): Declare. > * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. > * gcc/gcov-dump.c (tag_summary): Dump new summary info. > > Index: libgcc/libgcov.c > =================================================================== > --- libgcc/libgcov.c (revision 189413) > +++ libgcc/libgcov.c (working copy) > @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned > return 1; > } > > +/* Used by qsort to sort gcov values in descending order. */ > + > +static int > +sort_by_reverse_gcov_value (const void *pa, const void *pb) > +{ > + const gcov_type a = *(gcov_type const *)pa; > + const gcov_type b = *(gcov_type const *)pb; > + > + if (b > a) > + return 1; > + else if (b == a) > + return 0; > + else > + return -1; > +} > + > +/* Determines the number of counters required to cover a given percentage > + of the total sum of execution counts in the summary, which is then also > + recorded in SUM. */ > + > +static void > +gcov_compute_cutoff_values (struct gcov_summary *sum) > +{ > + struct gcov_info *gi_ptr; > + const struct gcov_fn_info *gfi_ptr; > + const struct gcov_ctr_info *ci_ptr; > + struct gcov_ctr_summary *cs_ptr; > + unsigned t_ix, f_ix, i, ctr_info_ix, index; > + gcov_unsigned_t c_num; > + gcov_type *value_array; > + gcov_type cum, cum_cutoff; > + char *cutoff_str; > + unsigned cutoff_perc; > + > +#define CUM_CUTOFF_PERCENT_TIMES_10 999 > + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); > + if (cutoff_str && strlen (cutoff_str)) > + cutoff_perc = atoi (cutoff_str); > + else > + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; > + > + /* This currently only applies to arc counters. */ > + t_ix = GCOV_COUNTER_ARCS; > + > + /* First check if there are any counts recorded for this counter. */ > + cs_ptr = &(sum->ctrs[t_ix]); > + if (!cs_ptr->num) > + return; > + > + /* Determine the cumulative counter value at the specified cutoff > + percentage and record the percentage for use by gcov consumers. > + Check for overflow when sum_all is multiplied by the cutoff_perc, > + and if so, do the divide first. */ > + if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) > + /* Overflow, do the divide first. */ > + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; > + else > + /* Otherwise multiply first to get the correct value for small > + values of sum_all. */ > + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; > + > + /* Next, walk through all the per-object structures and save each of > + the count values in value_array. */ > + index = 0; > + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); > + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) > + { > + if (!gi_ptr->merge[t_ix]) > + continue; > + > + /* Find the appropriate index into the gcov_ctr_info array > + for the counter we are currently working on based on the > + existence of the merge function pointer for this object. */ > + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) > + { > + if (gi_ptr->merge[i]) > + ctr_info_ix++; > + } > + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) > + { > + gfi_ptr = gi_ptr->functions[f_ix]; > + > + if (!gfi_ptr || gfi_ptr->key != gi_ptr) > + continue; > + > + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; > + /* Sanity check that there are enough entries in value_arry > + for this function's counters. Gracefully handle the case when > + there are not, in case something in the profile info is > + corrupted. */ > + c_num = ci_ptr->num; > + if (index + c_num > cs_ptr->num) > + c_num = cs_ptr->num - index; > + /* Copy over this function's counter values. */ > + memcpy (&value_array[index], ci_ptr->values, > + sizeof (gcov_type) * c_num); > + index += c_num; > + } > + } > + > + /* Sort all the counter values by descending value and finally > + accumulate the values from hottest on down until reaching > + the cutoff value computed earlier. */ > + qsort (value_array, cs_ptr->num, sizeof (gcov_type), > + sort_by_reverse_gcov_value); > + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; > c_num++) > + cum += value_array[c_num]; > + /* Record the number of counters required to reach the cutoff value, > + as well as the counter value that caused cum to reach the cutoff. */ > + cs_ptr->num_hot_counters = c_num; > + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; > + free (value_array); > +} > + > /* Dump the coverage counts. We merge with existing counts when > possible, to avoid growing the .da files ad infinitum. We use this > program's checksum to make sure we only accumulate whole program > @@ -347,6 +461,7 @@ gcov_exit (void) > } > } > } > + gcov_compute_cutoff_values (&this_prg); > > { > /* Check if the level of dirs to strip off specified. */ > @@ -598,7 +713,14 @@ gcov_exit (void) > if (gi_ptr->merge[t_ix]) > { > if (!cs_prg->runs++) > - cs_prg->num = cs_tprg->num; > + { > + cs_prg->num = cs_tprg->num; > + if (cs_tprg->num_hot_counters > > cs_prg->num_hot_counters) > + { > + cs_prg->num_hot_counters = > cs_tprg->num_hot_counters; > + cs_prg->hot_cutoff_value = > cs_tprg->hot_cutoff_value; > + } > + } > cs_prg->sum_all += cs_tprg->sum_all; > if (cs_prg->run_max < cs_tprg->run_max) > cs_prg->run_max = cs_tprg->run_max; > Index: gcc/doc/invoke.texi > =================================================================== > --- gcc/doc/invoke.texi (revision 189413) > +++ gcc/doc/invoke.texi (working copy) > @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}. > -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol > -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol > -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls > @gol > --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops > -fpredictive-commoning @gol > -fprefetch-loop-arrays @gol > -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol > -fprofile-generate=@var{path} @gol > @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}. > -ftree-reassoc @gol > -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol > -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol > --funit-at-a-time -funroll-all-loops -funroll-loops @gol > +-funit-at-a-time -funroll-all-loops -funroll-loops > -funroll-codesize-limit @gol > -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops > @gol > -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol > -fwhole-program -fwpa -fuse-linker-plugin @gol > @@ -8505,6 +8505,14 @@ the loop is entered. This usually makes programs > @option{-funroll-all-loops} implies the same options as > @option{-funroll-loops}. > > +@item -funroll-codesize-limit > +@opindex funroll-codesize-limit > +Limit loop unrolling of non-const non-FP loops in a profile feedback > compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are > controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fpeel-loops > @opindex fpeel-loops > Peels loops for which there is enough information that they do not > @@ -8513,6 +8521,14 @@ roll much (from profile feedback). It also turns > > Enabled with @option{-fprofile-use}. > > +@item -fpeel-codesize-limit > +@opindex fpeel-codesize-limit > +Limit loop peeling of non-const non-FP loops in a profile feedback > compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are > controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fmove-loop-invariants > @opindex fmove-loop-invariants > Enables the loop invariant motion pass in the RTL loop optimizer. Enabled > @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s > @item max-completely-peel-loop-nest-depth > The maximum depth of a loop nest suitable for complete peeling. > > +@item unrollpeel-codesize-threshold > +Maximum profile-based code size footprint estimate for loop unrolling and > +peeling. > + > +@item unrollpeel-hotness-threshold > +Maximum ratio of total execution count to loop entry block count under > which > +most profile-based code size estimates will be ignored for loop unrolling > and > +peeling. > + > @item max-unswitch-insns > The maximum number of insns of an unswitched loop. > > Index: gcc/gcov-io.c > =================================================================== > --- gcc/gcov-io.c (revision 189413) > +++ gcc/gcov-io.c (working copy) > @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str > for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) > { > gcov_write_unsigned (csum->num); > + gcov_write_unsigned (csum->num_hot_counters); > + gcov_write_counter (csum->hot_cutoff_value); > gcov_write_unsigned (csum->runs); > gcov_write_counter (csum->sum_all); > gcov_write_counter (csum->run_max); > @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) > for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) > { > csum->num = gcov_read_unsigned (); > + csum->num_hot_counters = gcov_read_unsigned (); > + csum->hot_cutoff_value = gcov_read_counter (); > csum->runs = gcov_read_unsigned (); > csum->sum_all = gcov_read_counter (); > csum->run_max = gcov_read_counter (); > Index: gcc/gcov-io.h > =================================================================== > --- gcc/gcov-io.h (revision 189413) > +++ gcc/gcov-io.h (working copy) > @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; > #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* > Obsolete */ > #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) > #define GCOV_TAG_SUMMARY_LENGTH \ > - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) > + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) > > /* Counters that are collected. */ > #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ > @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; > struct gcov_ctr_summary > { > gcov_unsigned_t num; /* number of counters. */ > + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given > + percent of sum_all. */ > + gcov_type hot_cutoff_value; /* value of smallest counter included in > + num_hot_counters. */ > gcov_unsigned_t runs; /* number of program runs */ > gcov_type sum_all; /* sum of all counters accumulated. */ > gcov_type run_max; /* maximum value on a single run. */ > Index: gcc/loop-unroll.c > =================================================================== > --- gcc/loop-unroll.c (revision 189413) > +++ gcc/loop-unroll.c (working copy) > @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see > #include "hashtab.h" > #include "recog.h" > #include "target.h" > +#include "gcov-io.h" > > /* This pass performs loop unrolling and peeling. We only perform these > optimizations on innermost loops (with single exception) because > @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc > basic_block); > static rtx get_expansion (struct var_to_expand *); > > +/* Determine whether and how much LOOP unrolling/peeling should be > constrained > + based on code footprint estimates. Returns the codesize-based factor > to be > + divided into the max instructions in an unrolled or peeled loop: > + 1) For size <= threshold, do not limit (by returning 1). > + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or > + unrolled instructions according to loop hotness. > + 3) For threshold >= 2*threshold, disable unrolling/peeling (by > returning > + INT_MAX). */ > + > +static int > +code_size_limit_factor(struct loop *loop) > +{ > + unsigned size_threshold; > + struct niter_desc *desc = get_simple_loop_desc (loop); > + gcov_type sum_to_header_ratio; > + int hotness_ratio_threshold; > + int limit_factor; > + > + /* First check if the application has a large codesize footprint. > + This is estimated from FDO profile summary information for the > + program, where the num_hot_counters indicates the number of hottest > + counters (blocks) that compose most of the execution time of > + the program. A large value would indicate a large flat execution > + profile where icache misses may be a concern. */ > + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); > + if (!profile_info > + || profile_info->num_hot_counters <= size_threshold > + || !profile_info->sum_all) > + return 1; > + > + /* Next, exclude some loops where unrolling/peeling may be more > + important to overall performance. */ > + > + /* Ignore FP loops, which are more likely to benefit heavily from > + unrolling. */ > + if (desc->has_fp) > + return 1; > + > + /* Next, set the value of the codesize-based unroll factor divisor > which in > + most loops will need to be set to a value that will reduce or > eliminate > + unrolling/peeling. */ > + if (profile_info->num_hot_counters < size_threshold * 2) > + { > + /* For applications that are less than twice the codesize limit, > allow > + limited unrolling for very hot loops. */ > + sum_to_header_ratio = profile_info->sum_all / loop->header->count; > + hotness_ratio_threshold = PARAM_VALUE > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); > + /* When the profile count sum to loop entry header ratio is smaller > than > + the threshold (i.e. the loop entry is hot enough, the divisor is > set > + to 1 so the unroll/peel factor is not reduced. When it is bigger > + than the ratio, increase the divisor by the amount this ratio > + is over the threshold, which will quickly reduce the unroll/peel > + factor to zero as the loop's hotness reduces. */ > + if (sum_to_header_ratio > hotness_ratio_threshold) > + { > + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; > + gcc_assert (limit_factor >= 1); > + } > + else > + limit_factor = 1; > + } > + else > + /* For appliations that are at least twice the codesize limit, set > + the divisor to a large value that will force the unroll factor to > 0. */ > + limit_factor = INT_MAX; > + > + return limit_factor; > +} > + > /* Unroll and/or peel (depending on FLAGS) LOOPS. */ > void > unroll_and_peel_loops (int flags) > @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_UNROLL)) > { > @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo > "\n;; Considering unrolling loop with runtime " > "computable number of iterations\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. > */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / > loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags) > { > unsigned npeel; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_PEEL)) > { > @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags) > if (dump_file) > fprintf (dump_file, "\n;; Considering simply peeling loop\n"); > > + if (flag_peel_codesize_limit) > + { > + /* Determine whether to limit code size growth from peeling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max peeled insns by divisor %d\n", limit_factor); > + } > + } > + > /* npeel = number of iterations to peel. */ > - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; > + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor > + / loop->ninsns; > if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) > npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); > > @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags) > > /* Do not simply peel loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not peeling, contains branches\n"); > @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_UNROLL_ALL)) > { > @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags > if (dump_file) > fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. > */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av > - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags > > /* Do not unroll loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unrolling, contains branches\n"); > Index: gcc/coverage.c > =================================================================== > --- gcc/coverage.c (revision 189413) > +++ gcc/coverage.c (working copy) > @@ -245,6 +245,14 @@ read_counts_file (void) > gcov_read_summary (&sum); > for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) > { > + if (summary.ctrs[ix].num_hot_counters > + < sum.ctrs[ix].num_hot_counters) > + { > + summary.ctrs[ix].num_hot_counters > + = sum.ctrs[ix].num_hot_counters; > + summary.ctrs[ix].hot_cutoff_value > + = sum.ctrs[ix].hot_cutoff_value; > + } > summary.ctrs[ix].runs += sum.ctrs[ix].runs; > summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; > if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) > Index: gcc/loop-iv.c > =================================================================== > --- gcc/loop-iv.c (revision 189413) > +++ gcc/loop-iv.c (working copy) > @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop) > /* At least desc->infinite is not always initialized by > find_simple_loop_exit. */ > desc = XCNEW (struct niter_desc); > - iv_analysis_loop_init (loop); > - find_simple_exit (loop, desc); > + if (loop->latch != EXIT_BLOCK_PTR) > + { > + iv_analysis_loop_init (loop); > + find_simple_exit (loop, desc); > + } > + analyze_loop_insns (loop, desc); > loop->aux = desc; > > if (desc->simple_p && (desc->assumptions || desc->infinite)) > Index: gcc/common.opt > =================================================================== > --- gcc/common.opt (revision 189413) > +++ gcc/common.opt (working copy) > @@ -1551,6 +1551,10 @@ fpcc-struct-return > Common Report Var(flag_pcc_struct_return,1) > Init(DEFAULT_PCC_STRUCT_RETURN) > Return small aggregates in memory, not registers > > +fpeel-codesize-limit > +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop peeling under profile estimates of large code > footprint > + > fpeel-loops > Common Report Var(flag_peel_loops) Optimization > Perform loop peeling > @@ -2108,6 +2112,10 @@ funroll-all-loops > Common Report Var(flag_unroll_all_loops) Optimization > Perform loop unrolling for all loops > > +funroll-codesize-limit > +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop unrolling under profile estimates of large > code footprint > + > ; Nonzero means that loop optimizer may assume that the induction > variables > ; that control loops do not overflow and that the loops with nontrivial > ; exit condition are not infinite > Index: gcc/cfgloop.c > =================================================================== > --- gcc/cfgloop.c (revision 189413) > +++ gcc/cfgloop.c (working copy) > @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop) > return edges; > } > > -/* Counts the number of conditional branches inside LOOP. */ > +/* Determine if INSN is a floating point set. */ > > -unsigned > -num_loop_branches (const struct loop *loop) > +static bool > +insn_has_fp_set(rtx insn) > { > - unsigned i, n; > - basic_block * body; > + int i; > + rtx pat = PATTERN(insn); > + if (GET_CODE (pat) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); > + else if (GET_CODE (pat) == PARALLEL) > + { > + for (i = 0; i < XVECLEN (pat, 0); i++) > + { > + rtx sub = XVECEXP (pat, 0, i); > + if (GET_CODE (sub) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); > + } > + } > + return false; > +} > > - gcc_assert (loop->latch != EXIT_BLOCK_PTR); > +/* Analyzes the instructions inside LOOP, updating the DESC. Currently > counts > + the number of conditional branch instructions, calls and fp > instructions, > + as well as the average number of branches executed per iteration. */ > > +void > +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) > +{ > + unsigned i, nbranch; > + gcov_type weighted_nbranch; > + bool has_call, has_fp; > + basic_block * body, bb; > + rtx insn; > + gcov_type header_count = loop->header->count; > + > + nbranch = weighted_nbranch = 0; > + has_call = has_fp = false; > + > body = get_loop_body (loop); > - n = 0; > for (i = 0; i < loop->num_nodes; i++) > - if (EDGE_COUNT (body[i]->succs) >= 2) > - n++; > + { > + bb = body[i]; > + > + if (EDGE_COUNT (bb->succs) >= 2) > + { > + nbranch++; > + > + /* If this block is executed less frequently than the header > (loop > + entry), then it is weighted based on its execution count, > which > + will be turned into a ratio compared to the loop header > below. */ > + if (bb->count < header_count) > + weighted_nbranch += bb->count; > + > + /* When it is executed more frequently than the header (i.e. it > is > + in a nested inner loop), simply weight the branch the same > as the > + header execution count, so that it will contribute 1 branch > to > + the ratio computed below. */ > + else > + weighted_nbranch += header_count; > + } > + > + /* No need to iterate through the instructions below if > + both flags have already been set. */ > + if (has_call && has_fp) > + continue; > + > + FOR_BB_INSNS (bb, insn) > + { > + if (!INSN_P (insn)) > + continue; > + > + if (!has_call) > + has_call = CALL_P (insn); > + > + if (!has_fp) > + has_fp = insn_has_fp_set (insn); > + } > + } > free (body); > > - return n; > + desc->num_branches = nbranch; > + /* Now divide the weights computed above by the loop header execution > count, > + to compute the average number of branches through the loop. By adding > + header_count/2 to the numerator we round to nearest with integer > + division. */ > + if (header_count != 0) > + desc->av_num_branches > + = (weighted_nbranch + header_count/2) / header_count; > + else > + desc->av_num_branches = 0; > + desc->has_call = has_call; > + desc->has_fp = has_fp; > } > > /* Adds basic block BB to LOOP. */ > Index: gcc/cfgloop.h > =================================================================== > --- gcc/cfgloop.h (revision 189413) > +++ gcc/cfgloop.h (working copy) > @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order > > extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); > edge single_exit (const struct loop *); > -extern unsigned num_loop_branches (const struct loop *); > > extern edge loop_preheader_edge (const struct loop *); > extern edge loop_latch_edge (const struct loop *); > @@ -366,7 +365,8 @@ struct rtx_iv > }; > > /* The description of an exit from the loop and of the number of > iterations > - till we take the exit. */ > + till we take the exit. Also includes other information used primarily > + by the loop unroller. */ > > struct niter_desc > { > @@ -407,6 +407,18 @@ struct niter_desc > > /* The number of iterations of the loop. */ > rtx niter_expr; > + > + /* The number of branches in the loop. */ > + unsigned num_branches; > + > + /* The number of executed branches per iteration. */ > + unsigned av_num_branches; > + > + /* Whether the loop contains a call instruction. */ > + bool has_call; > + > + /* Whether the loop contains fp instructions. */ > + bool has_fp; > }; > > extern void iv_analysis_loop_init (struct loop *); > @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); > > extern struct niter_desc *get_simple_loop_desc (struct loop *loop); > extern void free_simple_loop_desc (struct loop *loop); > +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); > > static inline struct niter_desc * > simple_loop_desc (struct loop *loop) > Index: gcc/params.def > =================================================================== > --- gcc/params.def (revision 189413) > +++ gcc/params.def (working copy) > @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, > "max-completely-peel-loop-nest-depth", > "The maximum depth of a loop nest we completely peel", > 8, 0, 0) > +/* The maximum code size estimate under which loop unrolling and peeling > + * is allowed in a profile feedback compile. This currently applies to > loops > + * with non-constant iteration counts and no floating point computations. > */ > +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, > + "unrollpeel-codesize-threshold", > + "Maximum profile-based code size footprint estimate for loop > unrolling " > + "and peeling", > + 15000, 0, 0) > +/* The maximum ratio of total profiled execution counts to loop entry > block > + count that must be exceeded to ignore most code size limits when > unrolling > + and peeling. */ > +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, > + "unrollpeel-hotness-threshold", > + "Maximum ratio of total profiled execution count to loop entry " > + "block count under which most codesize limits for unrolling and " > + "peeling will be ignored", > + 100, 1, 0) > > /* The maximum number of insns of an unswitched loop. */ > DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, > Index: gcc/gcov-dump.c > =================================================================== > --- gcc/gcov-dump.c (revision 189413) > +++ gcc/gcov-dump.c (working copy) > @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED > { > printf ("\n"); > print_prefix (filename, 0, 0); > - printf ("\t\tcounts=%u, runs=%u", > - summary.ctrs[ix].num, summary.ctrs[ix].runs); > + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" > + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", > + summary.ctrs[ix].num, > + summary.ctrs[ix].num_hot_counters, > + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, > + summary.ctrs[ix].runs); > > printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, > (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); > > -- > This patch is available for review at > http://codereview.appspot.com/6351086 > -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
Ping (retrying ping in plain text mode so that it goes through properly). Thanks, Teresa On Wed, Jul 11, 2012 at 10:42 AM, Teresa Johnson <tejohnson@google.com> wrote: > Ports some patches related to improving FDO program summary information > and using it to guide loop unrolling from google branches to mainline. > The patch is enhanced to add additional summary information to aid > in determining hot/cold decisions. > > The original patch description is at: > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > and further discussion about incorporating onto mainline is at: > http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 > > Honza, can you take a look to see if this patch meets your needs? > > Full description: > > This patch adds new program summary information to the gcov > profile files that indicate how many profiled counts compose > the majority of the program's execution time. This is used to > provide an indication of the overall code size of the program. > > The new profile summary information is then used to guide > codesize based unroll and peel decisions, to prevent those > optimizations from increasing code size too much when the > program may be sensitive to icache effects. > > This patch also pulls in dependent portions of google/main r187660 that cache > additional loop analysis results in the niter_desc auxiliary information > hanging off the loop structure (the optimization portions of that > change are not included here, and have an outstanding review request > for mainline). > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > Thanks, > Teresa > > 2012-07-11 Teresa Johnson <tejohnson@google.com> > > * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. > (gcov_compute_cutoff_values): Ditto. > (gcov_exit): Call gcov_compute_cutoff_values and merge new summary > information. > * gcc/doc/invoke.texi (roll much): Document new options > -fpeel-codesize-limit and -funroll-codesize-limit, and new params > codesize-hotness-threshold and unrollpeel-hotness-threshold. > * gcc/gcov-io.c (gcov_write_summary): Write new summary info. > (gcov_read_summary): Read new summary info. > * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. > (struct gcov_ctr_summary): Add new summary info: num_hot_counters and > hot_cutoff_value. > * gcc/loop-unroll.c (code_size_limit_factor): New function. > (decide_unroll_runtime_iterations): Call code_size_limit_factor > to control the unroll factor, and retrieve number of branches from > niter_desc instead of via function that walks loop. > (decide_peel_simple, decide_unroll_stupid): Ditto. > * gcc/coverage.c (read_counts_file): Propagate new summary info. > * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns > function, and add guards to enable this function to work for the > outermost loop. > * gcc/common.opt: Add -fpeel-codesize-limit and > -funroll-codesize-limit. > * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. > (num_loop_branches): Remove. > * gcc/cfgloop.h (struct niter_desc): Added new fields to cache > additional loop analysis information. > (num_loop_branches): Remove. > (analyze_loop_insns): Declare. > * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. > * gcc/gcov-dump.c (tag_summary): Dump new summary info. > > Index: libgcc/libgcov.c > =================================================================== > --- libgcc/libgcov.c (revision 189413) > +++ libgcc/libgcov.c (working copy) > @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned > return 1; > } > > +/* Used by qsort to sort gcov values in descending order. */ > + > +static int > +sort_by_reverse_gcov_value (const void *pa, const void *pb) > +{ > + const gcov_type a = *(gcov_type const *)pa; > + const gcov_type b = *(gcov_type const *)pb; > + > + if (b > a) > + return 1; > + else if (b == a) > + return 0; > + else > + return -1; > +} > + > +/* Determines the number of counters required to cover a given percentage > + of the total sum of execution counts in the summary, which is then also > + recorded in SUM. */ > + > +static void > +gcov_compute_cutoff_values (struct gcov_summary *sum) > +{ > + struct gcov_info *gi_ptr; > + const struct gcov_fn_info *gfi_ptr; > + const struct gcov_ctr_info *ci_ptr; > + struct gcov_ctr_summary *cs_ptr; > + unsigned t_ix, f_ix, i, ctr_info_ix, index; > + gcov_unsigned_t c_num; > + gcov_type *value_array; > + gcov_type cum, cum_cutoff; > + char *cutoff_str; > + unsigned cutoff_perc; > + > +#define CUM_CUTOFF_PERCENT_TIMES_10 999 > + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); > + if (cutoff_str && strlen (cutoff_str)) > + cutoff_perc = atoi (cutoff_str); > + else > + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; > + > + /* This currently only applies to arc counters. */ > + t_ix = GCOV_COUNTER_ARCS; > + > + /* First check if there are any counts recorded for this counter. */ > + cs_ptr = &(sum->ctrs[t_ix]); > + if (!cs_ptr->num) > + return; > + > + /* Determine the cumulative counter value at the specified cutoff > + percentage and record the percentage for use by gcov consumers. > + Check for overflow when sum_all is multiplied by the cutoff_perc, > + and if so, do the divide first. */ > + if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) > + /* Overflow, do the divide first. */ > + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; > + else > + /* Otherwise multiply first to get the correct value for small > + values of sum_all. */ > + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; > + > + /* Next, walk through all the per-object structures and save each of > + the count values in value_array. */ > + index = 0; > + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); > + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) > + { > + if (!gi_ptr->merge[t_ix]) > + continue; > + > + /* Find the appropriate index into the gcov_ctr_info array > + for the counter we are currently working on based on the > + existence of the merge function pointer for this object. */ > + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) > + { > + if (gi_ptr->merge[i]) > + ctr_info_ix++; > + } > + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) > + { > + gfi_ptr = gi_ptr->functions[f_ix]; > + > + if (!gfi_ptr || gfi_ptr->key != gi_ptr) > + continue; > + > + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; > + /* Sanity check that there are enough entries in value_arry > + for this function's counters. Gracefully handle the case when > + there are not, in case something in the profile info is > + corrupted. */ > + c_num = ci_ptr->num; > + if (index + c_num > cs_ptr->num) > + c_num = cs_ptr->num - index; > + /* Copy over this function's counter values. */ > + memcpy (&value_array[index], ci_ptr->values, > + sizeof (gcov_type) * c_num); > + index += c_num; > + } > + } > + > + /* Sort all the counter values by descending value and finally > + accumulate the values from hottest on down until reaching > + the cutoff value computed earlier. */ > + qsort (value_array, cs_ptr->num, sizeof (gcov_type), > + sort_by_reverse_gcov_value); > + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) > + cum += value_array[c_num]; > + /* Record the number of counters required to reach the cutoff value, > + as well as the counter value that caused cum to reach the cutoff. */ > + cs_ptr->num_hot_counters = c_num; > + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; > + free (value_array); > +} > + > /* Dump the coverage counts. We merge with existing counts when > possible, to avoid growing the .da files ad infinitum. We use this > program's checksum to make sure we only accumulate whole program > @@ -347,6 +461,7 @@ gcov_exit (void) > } > } > } > + gcov_compute_cutoff_values (&this_prg); > > { > /* Check if the level of dirs to strip off specified. */ > @@ -598,7 +713,14 @@ gcov_exit (void) > if (gi_ptr->merge[t_ix]) > { > if (!cs_prg->runs++) > - cs_prg->num = cs_tprg->num; > + { > + cs_prg->num = cs_tprg->num; > + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) > + { > + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; > + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; > + } > + } > cs_prg->sum_all += cs_tprg->sum_all; > if (cs_prg->run_max < cs_tprg->run_max) > cs_prg->run_max = cs_tprg->run_max; > Index: gcc/doc/invoke.texi > =================================================================== > --- gcc/doc/invoke.texi (revision 189413) > +++ gcc/doc/invoke.texi (working copy) > @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}. > -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol > -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol > -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol > --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol > -fprefetch-loop-arrays @gol > -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol > -fprofile-generate=@var{path} @gol > @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}. > -ftree-reassoc @gol > -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol > -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol > --funit-at-a-time -funroll-all-loops -funroll-loops @gol > +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol > -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol > -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol > -fwhole-program -fwpa -fuse-linker-plugin @gol > @@ -8505,6 +8505,14 @@ the loop is entered. This usually makes programs > @option{-funroll-all-loops} implies the same options as > @option{-funroll-loops}. > > +@item -funroll-codesize-limit > +@opindex funroll-codesize-limit > +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fpeel-loops > @opindex fpeel-loops > Peels loops for which there is enough information that they do not > @@ -8513,6 +8521,14 @@ roll much (from profile feedback). It also turns > > Enabled with @option{-fprofile-use}. > > +@item -fpeel-codesize-limit > +@opindex fpeel-codesize-limit > +Limit loop peeling of non-const non-FP loops in a profile feedback compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fmove-loop-invariants > @opindex fmove-loop-invariants > Enables the loop invariant motion pass in the RTL loop optimizer. Enabled > @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s > @item max-completely-peel-loop-nest-depth > The maximum depth of a loop nest suitable for complete peeling. > > +@item unrollpeel-codesize-threshold > +Maximum profile-based code size footprint estimate for loop unrolling and > +peeling. > + > +@item unrollpeel-hotness-threshold > +Maximum ratio of total execution count to loop entry block count under which > +most profile-based code size estimates will be ignored for loop unrolling and > +peeling. > + > @item max-unswitch-insns > The maximum number of insns of an unswitched loop. > > Index: gcc/gcov-io.c > =================================================================== > --- gcc/gcov-io.c (revision 189413) > +++ gcc/gcov-io.c (working copy) > @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str > for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) > { > gcov_write_unsigned (csum->num); > + gcov_write_unsigned (csum->num_hot_counters); > + gcov_write_counter (csum->hot_cutoff_value); > gcov_write_unsigned (csum->runs); > gcov_write_counter (csum->sum_all); > gcov_write_counter (csum->run_max); > @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) > for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) > { > csum->num = gcov_read_unsigned (); > + csum->num_hot_counters = gcov_read_unsigned (); > + csum->hot_cutoff_value = gcov_read_counter (); > csum->runs = gcov_read_unsigned (); > csum->sum_all = gcov_read_counter (); > csum->run_max = gcov_read_counter (); > Index: gcc/gcov-io.h > =================================================================== > --- gcc/gcov-io.h (revision 189413) > +++ gcc/gcov-io.h (working copy) > @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; > #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ > #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) > #define GCOV_TAG_SUMMARY_LENGTH \ > - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) > + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) > > /* Counters that are collected. */ > #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ > @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; > struct gcov_ctr_summary > { > gcov_unsigned_t num; /* number of counters. */ > + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given > + percent of sum_all. */ > + gcov_type hot_cutoff_value; /* value of smallest counter included in > + num_hot_counters. */ > gcov_unsigned_t runs; /* number of program runs */ > gcov_type sum_all; /* sum of all counters accumulated. */ > gcov_type run_max; /* maximum value on a single run. */ > Index: gcc/loop-unroll.c > =================================================================== > --- gcc/loop-unroll.c (revision 189413) > +++ gcc/loop-unroll.c (working copy) > @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see > #include "hashtab.h" > #include "recog.h" > #include "target.h" > +#include "gcov-io.h" > > /* This pass performs loop unrolling and peeling. We only perform these > optimizations on innermost loops (with single exception) because > @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc > basic_block); > static rtx get_expansion (struct var_to_expand *); > > +/* Determine whether and how much LOOP unrolling/peeling should be constrained > + based on code footprint estimates. Returns the codesize-based factor to be > + divided into the max instructions in an unrolled or peeled loop: > + 1) For size <= threshold, do not limit (by returning 1). > + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or > + unrolled instructions according to loop hotness. > + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning > + INT_MAX). */ > + > +static int > +code_size_limit_factor(struct loop *loop) > +{ > + unsigned size_threshold; > + struct niter_desc *desc = get_simple_loop_desc (loop); > + gcov_type sum_to_header_ratio; > + int hotness_ratio_threshold; > + int limit_factor; > + > + /* First check if the application has a large codesize footprint. > + This is estimated from FDO profile summary information for the > + program, where the num_hot_counters indicates the number of hottest > + counters (blocks) that compose most of the execution time of > + the program. A large value would indicate a large flat execution > + profile where icache misses may be a concern. */ > + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); > + if (!profile_info > + || profile_info->num_hot_counters <= size_threshold > + || !profile_info->sum_all) > + return 1; > + > + /* Next, exclude some loops where unrolling/peeling may be more > + important to overall performance. */ > + > + /* Ignore FP loops, which are more likely to benefit heavily from > + unrolling. */ > + if (desc->has_fp) > + return 1; > + > + /* Next, set the value of the codesize-based unroll factor divisor which in > + most loops will need to be set to a value that will reduce or eliminate > + unrolling/peeling. */ > + if (profile_info->num_hot_counters < size_threshold * 2) > + { > + /* For applications that are less than twice the codesize limit, allow > + limited unrolling for very hot loops. */ > + sum_to_header_ratio = profile_info->sum_all / loop->header->count; > + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); > + /* When the profile count sum to loop entry header ratio is smaller than > + the threshold (i.e. the loop entry is hot enough, the divisor is set > + to 1 so the unroll/peel factor is not reduced. When it is bigger > + than the ratio, increase the divisor by the amount this ratio > + is over the threshold, which will quickly reduce the unroll/peel > + factor to zero as the loop's hotness reduces. */ > + if (sum_to_header_ratio > hotness_ratio_threshold) > + { > + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; > + gcc_assert (limit_factor >= 1); > + } > + else > + limit_factor = 1; > + } > + else > + /* For appliations that are at least twice the codesize limit, set > + the divisor to a large value that will force the unroll factor to 0. */ > + limit_factor = INT_MAX; > + > + return limit_factor; > +} > + > /* Unroll and/or peel (depending on FLAGS) LOOPS. */ > void > unroll_and_peel_loops (int flags) > @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_UNROLL)) > { > @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo > "\n;; Considering unrolling loop with runtime " > "computable number of iterations\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags) > { > unsigned npeel; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_PEEL)) > { > @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags) > if (dump_file) > fprintf (dump_file, "\n;; Considering simply peeling loop\n"); > > + if (flag_peel_codesize_limit) > + { > + /* Determine whether to limit code size growth from peeling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max peeled insns by divisor %d\n", limit_factor); > + } > + } > + > /* npeel = number of iterations to peel. */ > - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; > + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor > + / loop->ninsns; > if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) > npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); > > @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags) > > /* Do not simply peel loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not peeling, contains branches\n"); > @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_UNROLL_ALL)) > { > @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags > if (dump_file) > fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av > - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags > > /* Do not unroll loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unrolling, contains branches\n"); > Index: gcc/coverage.c > =================================================================== > --- gcc/coverage.c (revision 189413) > +++ gcc/coverage.c (working copy) > @@ -245,6 +245,14 @@ read_counts_file (void) > gcov_read_summary (&sum); > for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) > { > + if (summary.ctrs[ix].num_hot_counters > + < sum.ctrs[ix].num_hot_counters) > + { > + summary.ctrs[ix].num_hot_counters > + = sum.ctrs[ix].num_hot_counters; > + summary.ctrs[ix].hot_cutoff_value > + = sum.ctrs[ix].hot_cutoff_value; > + } > summary.ctrs[ix].runs += sum.ctrs[ix].runs; > summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; > if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) > Index: gcc/loop-iv.c > =================================================================== > --- gcc/loop-iv.c (revision 189413) > +++ gcc/loop-iv.c (working copy) > @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop) > /* At least desc->infinite is not always initialized by > find_simple_loop_exit. */ > desc = XCNEW (struct niter_desc); > - iv_analysis_loop_init (loop); > - find_simple_exit (loop, desc); > + if (loop->latch != EXIT_BLOCK_PTR) > + { > + iv_analysis_loop_init (loop); > + find_simple_exit (loop, desc); > + } > + analyze_loop_insns (loop, desc); > loop->aux = desc; > > if (desc->simple_p && (desc->assumptions || desc->infinite)) > Index: gcc/common.opt > =================================================================== > --- gcc/common.opt (revision 189413) > +++ gcc/common.opt (working copy) > @@ -1551,6 +1551,10 @@ fpcc-struct-return > Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) > Return small aggregates in memory, not registers > > +fpeel-codesize-limit > +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop peeling under profile estimates of large code footprint > + > fpeel-loops > Common Report Var(flag_peel_loops) Optimization > Perform loop peeling > @@ -2108,6 +2112,10 @@ funroll-all-loops > Common Report Var(flag_unroll_all_loops) Optimization > Perform loop unrolling for all loops > > +funroll-codesize-limit > +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop unrolling under profile estimates of large code footprint > + > ; Nonzero means that loop optimizer may assume that the induction variables > ; that control loops do not overflow and that the loops with nontrivial > ; exit condition are not infinite > Index: gcc/cfgloop.c > =================================================================== > --- gcc/cfgloop.c (revision 189413) > +++ gcc/cfgloop.c (working copy) > @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop) > return edges; > } > > -/* Counts the number of conditional branches inside LOOP. */ > +/* Determine if INSN is a floating point set. */ > > -unsigned > -num_loop_branches (const struct loop *loop) > +static bool > +insn_has_fp_set(rtx insn) > { > - unsigned i, n; > - basic_block * body; > + int i; > + rtx pat = PATTERN(insn); > + if (GET_CODE (pat) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); > + else if (GET_CODE (pat) == PARALLEL) > + { > + for (i = 0; i < XVECLEN (pat, 0); i++) > + { > + rtx sub = XVECEXP (pat, 0, i); > + if (GET_CODE (sub) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); > + } > + } > + return false; > +} > > - gcc_assert (loop->latch != EXIT_BLOCK_PTR); > +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts > + the number of conditional branch instructions, calls and fp instructions, > + as well as the average number of branches executed per iteration. */ > > +void > +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) > +{ > + unsigned i, nbranch; > + gcov_type weighted_nbranch; > + bool has_call, has_fp; > + basic_block * body, bb; > + rtx insn; > + gcov_type header_count = loop->header->count; > + > + nbranch = weighted_nbranch = 0; > + has_call = has_fp = false; > + > body = get_loop_body (loop); > - n = 0; > for (i = 0; i < loop->num_nodes; i++) > - if (EDGE_COUNT (body[i]->succs) >= 2) > - n++; > + { > + bb = body[i]; > + > + if (EDGE_COUNT (bb->succs) >= 2) > + { > + nbranch++; > + > + /* If this block is executed less frequently than the header (loop > + entry), then it is weighted based on its execution count, which > + will be turned into a ratio compared to the loop header below. */ > + if (bb->count < header_count) > + weighted_nbranch += bb->count; > + > + /* When it is executed more frequently than the header (i.e. it is > + in a nested inner loop), simply weight the branch the same as the > + header execution count, so that it will contribute 1 branch to > + the ratio computed below. */ > + else > + weighted_nbranch += header_count; > + } > + > + /* No need to iterate through the instructions below if > + both flags have already been set. */ > + if (has_call && has_fp) > + continue; > + > + FOR_BB_INSNS (bb, insn) > + { > + if (!INSN_P (insn)) > + continue; > + > + if (!has_call) > + has_call = CALL_P (insn); > + > + if (!has_fp) > + has_fp = insn_has_fp_set (insn); > + } > + } > free (body); > > - return n; > + desc->num_branches = nbranch; > + /* Now divide the weights computed above by the loop header execution count, > + to compute the average number of branches through the loop. By adding > + header_count/2 to the numerator we round to nearest with integer > + division. */ > + if (header_count != 0) > + desc->av_num_branches > + = (weighted_nbranch + header_count/2) / header_count; > + else > + desc->av_num_branches = 0; > + desc->has_call = has_call; > + desc->has_fp = has_fp; > } > > /* Adds basic block BB to LOOP. */ > Index: gcc/cfgloop.h > =================================================================== > --- gcc/cfgloop.h (revision 189413) > +++ gcc/cfgloop.h (working copy) > @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order > > extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); > edge single_exit (const struct loop *); > -extern unsigned num_loop_branches (const struct loop *); > > extern edge loop_preheader_edge (const struct loop *); > extern edge loop_latch_edge (const struct loop *); > @@ -366,7 +365,8 @@ struct rtx_iv > }; > > /* The description of an exit from the loop and of the number of iterations > - till we take the exit. */ > + till we take the exit. Also includes other information used primarily > + by the loop unroller. */ > > struct niter_desc > { > @@ -407,6 +407,18 @@ struct niter_desc > > /* The number of iterations of the loop. */ > rtx niter_expr; > + > + /* The number of branches in the loop. */ > + unsigned num_branches; > + > + /* The number of executed branches per iteration. */ > + unsigned av_num_branches; > + > + /* Whether the loop contains a call instruction. */ > + bool has_call; > + > + /* Whether the loop contains fp instructions. */ > + bool has_fp; > }; > > extern void iv_analysis_loop_init (struct loop *); > @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); > > extern struct niter_desc *get_simple_loop_desc (struct loop *loop); > extern void free_simple_loop_desc (struct loop *loop); > +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); > > static inline struct niter_desc * > simple_loop_desc (struct loop *loop) > Index: gcc/params.def > =================================================================== > --- gcc/params.def (revision 189413) > +++ gcc/params.def (working copy) > @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, > "max-completely-peel-loop-nest-depth", > "The maximum depth of a loop nest we completely peel", > 8, 0, 0) > +/* The maximum code size estimate under which loop unrolling and peeling > + * is allowed in a profile feedback compile. This currently applies to loops > + * with non-constant iteration counts and no floating point computations. */ > +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, > + "unrollpeel-codesize-threshold", > + "Maximum profile-based code size footprint estimate for loop unrolling " > + "and peeling", > + 15000, 0, 0) > +/* The maximum ratio of total profiled execution counts to loop entry block > + count that must be exceeded to ignore most code size limits when unrolling > + and peeling. */ > +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, > + "unrollpeel-hotness-threshold", > + "Maximum ratio of total profiled execution count to loop entry " > + "block count under which most codesize limits for unrolling and " > + "peeling will be ignored", > + 100, 1, 0) > > /* The maximum number of insns of an unswitched loop. */ > DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, > Index: gcc/gcov-dump.c > =================================================================== > --- gcc/gcov-dump.c (revision 189413) > +++ gcc/gcov-dump.c (working copy) > @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED > { > printf ("\n"); > print_prefix (filename, 0, 0); > - printf ("\t\tcounts=%u, runs=%u", > - summary.ctrs[ix].num, summary.ctrs[ix].runs); > + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" > + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", > + summary.ctrs[ix].num, > + summary.ctrs[ix].num_hot_counters, > + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, > + summary.ctrs[ix].runs); > > printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, > (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); > > -- > This patch is available for review at http://codereview.appspot.com/6351086 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
Ping. Teresa On Wed, Jul 18, 2012 at 8:48 AM, Teresa Johnson <tejohnson@google.com> wrote: > Ping (retrying ping in plain text mode so that it goes through properly). > > Thanks, > Teresa > > On Wed, Jul 11, 2012 at 10:42 AM, Teresa Johnson <tejohnson@google.com> wrote: >> Ports some patches related to improving FDO program summary information >> and using it to guide loop unrolling from google branches to mainline. >> The patch is enhanced to add additional summary information to aid >> in determining hot/cold decisions. >> >> The original patch description is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> and further discussion about incorporating onto mainline is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 >> >> Honza, can you take a look to see if this patch meets your needs? >> >> Full description: >> >> This patch adds new program summary information to the gcov >> profile files that indicate how many profiled counts compose >> the majority of the program's execution time. This is used to >> provide an indication of the overall code size of the program. >> >> The new profile summary information is then used to guide >> codesize based unroll and peel decisions, to prevent those >> optimizations from increasing code size too much when the >> program may be sensitive to icache effects. >> >> This patch also pulls in dependent portions of google/main r187660 that cache >> additional loop analysis results in the niter_desc auxiliary information >> hanging off the loop structure (the optimization portions of that >> change are not included here, and have an outstanding review request >> for mainline). >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> 2012-07-11 Teresa Johnson <tejohnson@google.com> >> >> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. >> (gcov_compute_cutoff_values): Ditto. >> (gcov_exit): Call gcov_compute_cutoff_values and merge new summary >> information. >> * gcc/doc/invoke.texi (roll much): Document new options >> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >> codesize-hotness-threshold and unrollpeel-hotness-threshold. >> * gcc/gcov-io.c (gcov_write_summary): Write new summary info. >> (gcov_read_summary): Read new summary info. >> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. >> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and >> hot_cutoff_value. >> * gcc/loop-unroll.c (code_size_limit_factor): New function. >> (decide_unroll_runtime_iterations): Call code_size_limit_factor >> to control the unroll factor, and retrieve number of branches from >> niter_desc instead of via function that walks loop. >> (decide_peel_simple, decide_unroll_stupid): Ditto. >> * gcc/coverage.c (read_counts_file): Propagate new summary info. >> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >> function, and add guards to enable this function to work for the >> outermost loop. >> * gcc/common.opt: Add -fpeel-codesize-limit and >> -funroll-codesize-limit. >> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >> (num_loop_branches): Remove. >> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >> additional loop analysis information. >> (num_loop_branches): Remove. >> (analyze_loop_insns): Declare. >> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >> * gcc/gcov-dump.c (tag_summary): Dump new summary info. >> >> Index: libgcc/libgcov.c >> =================================================================== >> --- libgcc/libgcov.c (revision 189413) >> +++ libgcc/libgcov.c (working copy) >> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >> return 1; >> } >> >> +/* Used by qsort to sort gcov values in descending order. */ >> + >> +static int >> +sort_by_reverse_gcov_value (const void *pa, const void *pb) >> +{ >> + const gcov_type a = *(gcov_type const *)pa; >> + const gcov_type b = *(gcov_type const *)pb; >> + >> + if (b > a) >> + return 1; >> + else if (b == a) >> + return 0; >> + else >> + return -1; >> +} >> + >> +/* Determines the number of counters required to cover a given percentage >> + of the total sum of execution counts in the summary, which is then also >> + recorded in SUM. */ >> + >> +static void >> +gcov_compute_cutoff_values (struct gcov_summary *sum) >> +{ >> + struct gcov_info *gi_ptr; >> + const struct gcov_fn_info *gfi_ptr; >> + const struct gcov_ctr_info *ci_ptr; >> + struct gcov_ctr_summary *cs_ptr; >> + unsigned t_ix, f_ix, i, ctr_info_ix, index; >> + gcov_unsigned_t c_num; >> + gcov_type *value_array; >> + gcov_type cum, cum_cutoff; >> + char *cutoff_str; >> + unsigned cutoff_perc; >> + >> +#define CUM_CUTOFF_PERCENT_TIMES_10 999 >> + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); >> + if (cutoff_str && strlen (cutoff_str)) >> + cutoff_perc = atoi (cutoff_str); >> + else >> + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; >> + >> + /* This currently only applies to arc counters. */ >> + t_ix = GCOV_COUNTER_ARCS; >> + >> + /* First check if there are any counts recorded for this counter. */ >> + cs_ptr = &(sum->ctrs[t_ix]); >> + if (!cs_ptr->num) >> + return; >> + >> + /* Determine the cumulative counter value at the specified cutoff >> + percentage and record the percentage for use by gcov consumers. >> + Check for overflow when sum_all is multiplied by the cutoff_perc, >> + and if so, do the divide first. */ >> + if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) >> + /* Overflow, do the divide first. */ >> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; >> + else >> + /* Otherwise multiply first to get the correct value for small >> + values of sum_all. */ >> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; >> + >> + /* Next, walk through all the per-object structures and save each of >> + the count values in value_array. */ >> + index = 0; >> + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); >> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) >> + { >> + if (!gi_ptr->merge[t_ix]) >> + continue; >> + >> + /* Find the appropriate index into the gcov_ctr_info array >> + for the counter we are currently working on based on the >> + existence of the merge function pointer for this object. */ >> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) >> + { >> + if (gi_ptr->merge[i]) >> + ctr_info_ix++; >> + } >> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >> + { >> + gfi_ptr = gi_ptr->functions[f_ix]; >> + >> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >> + continue; >> + >> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >> + /* Sanity check that there are enough entries in value_arry >> + for this function's counters. Gracefully handle the case when >> + there are not, in case something in the profile info is >> + corrupted. */ >> + c_num = ci_ptr->num; >> + if (index + c_num > cs_ptr->num) >> + c_num = cs_ptr->num - index; >> + /* Copy over this function's counter values. */ >> + memcpy (&value_array[index], ci_ptr->values, >> + sizeof (gcov_type) * c_num); >> + index += c_num; >> + } >> + } >> + >> + /* Sort all the counter values by descending value and finally >> + accumulate the values from hottest on down until reaching >> + the cutoff value computed earlier. */ >> + qsort (value_array, cs_ptr->num, sizeof (gcov_type), >> + sort_by_reverse_gcov_value); >> + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) >> + cum += value_array[c_num]; >> + /* Record the number of counters required to reach the cutoff value, >> + as well as the counter value that caused cum to reach the cutoff. */ >> + cs_ptr->num_hot_counters = c_num; >> + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; >> + free (value_array); >> +} >> + >> /* Dump the coverage counts. We merge with existing counts when >> possible, to avoid growing the .da files ad infinitum. We use this >> program's checksum to make sure we only accumulate whole program >> @@ -347,6 +461,7 @@ gcov_exit (void) >> } >> } >> } >> + gcov_compute_cutoff_values (&this_prg); >> >> { >> /* Check if the level of dirs to strip off specified. */ >> @@ -598,7 +713,14 @@ gcov_exit (void) >> if (gi_ptr->merge[t_ix]) >> { >> if (!cs_prg->runs++) >> - cs_prg->num = cs_tprg->num; >> + { >> + cs_prg->num = cs_tprg->num; >> + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) >> + { >> + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; >> + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; >> + } >> + } >> cs_prg->sum_all += cs_tprg->sum_all; >> if (cs_prg->run_max < cs_tprg->run_max) >> cs_prg->run_max = cs_tprg->run_max; >> Index: gcc/doc/invoke.texi >> =================================================================== >> --- gcc/doc/invoke.texi (revision 189413) >> +++ gcc/doc/invoke.texi (working copy) >> @@ -384,7 +384,7 @@ Objective-C and Objective-C++ Dialects}. >> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol >> -fprefetch-loop-arrays @gol >> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >> -fprofile-generate=@var{path} @gol >> @@ -416,7 +416,7 @@ Objective-C and Objective-C++ Dialects}. >> -ftree-reassoc @gol >> -ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol >> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >> -fwhole-program -fwpa -fuse-linker-plugin @gol >> @@ -8505,6 +8505,14 @@ the loop is entered. This usually makes programs >> @option{-funroll-all-loops} implies the same options as >> @option{-funroll-loops}. >> >> +@item -funroll-codesize-limit >> +@opindex funroll-codesize-limit >> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fpeel-loops >> @opindex fpeel-loops >> Peels loops for which there is enough information that they do not >> @@ -8513,6 +8521,14 @@ roll much (from profile feedback). It also turns >> >> Enabled with @option{-fprofile-use}. >> >> +@item -fpeel-codesize-limit >> +@opindex fpeel-codesize-limit >> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fmove-loop-invariants >> @opindex fmove-loop-invariants >> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >> @@ -8864,6 +8880,15 @@ The maximum number of iterations of a loop to be s >> @item max-completely-peel-loop-nest-depth >> The maximum depth of a loop nest suitable for complete peeling. >> >> +@item unrollpeel-codesize-threshold >> +Maximum profile-based code size footprint estimate for loop unrolling and >> +peeling. >> + >> +@item unrollpeel-hotness-threshold >> +Maximum ratio of total execution count to loop entry block count under which >> +most profile-based code size estimates will be ignored for loop unrolling and >> +peeling. >> + >> @item max-unswitch-insns >> The maximum number of insns of an unswitched loop. >> >> Index: gcc/gcov-io.c >> =================================================================== >> --- gcc/gcov-io.c (revision 189413) >> +++ gcc/gcov-io.c (working copy) >> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> gcov_write_unsigned (csum->num); >> + gcov_write_unsigned (csum->num_hot_counters); >> + gcov_write_counter (csum->hot_cutoff_value); >> gcov_write_unsigned (csum->runs); >> gcov_write_counter (csum->sum_all); >> gcov_write_counter (csum->run_max); >> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> csum->num = gcov_read_unsigned (); >> + csum->num_hot_counters = gcov_read_unsigned (); >> + csum->hot_cutoff_value = gcov_read_counter (); >> csum->runs = gcov_read_unsigned (); >> csum->sum_all = gcov_read_counter (); >> csum->run_max = gcov_read_counter (); >> Index: gcc/gcov-io.h >> =================================================================== >> --- gcc/gcov-io.h (revision 189413) >> +++ gcc/gcov-io.h (working copy) >> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; >> #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ >> #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) >> #define GCOV_TAG_SUMMARY_LENGTH \ >> - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) >> + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) >> >> /* Counters that are collected. */ >> #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ >> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; >> struct gcov_ctr_summary >> { >> gcov_unsigned_t num; /* number of counters. */ >> + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given >> + percent of sum_all. */ >> + gcov_type hot_cutoff_value; /* value of smallest counter included in >> + num_hot_counters. */ >> gcov_unsigned_t runs; /* number of program runs */ >> gcov_type sum_all; /* sum of all counters accumulated. */ >> gcov_type run_max; /* maximum value on a single run. */ >> Index: gcc/loop-unroll.c >> =================================================================== >> --- gcc/loop-unroll.c (revision 189413) >> +++ gcc/loop-unroll.c (working copy) >> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >> #include "hashtab.h" >> #include "recog.h" >> #include "target.h" >> +#include "gcov-io.h" >> >> /* This pass performs loop unrolling and peeling. We only perform these >> optimizations on innermost loops (with single exception) because >> @@ -150,6 +151,75 @@ static void combine_var_copies_in_loop_exit (struc >> basic_block); >> static rtx get_expansion (struct var_to_expand *); >> >> +/* Determine whether and how much LOOP unrolling/peeling should be constrained >> + based on code footprint estimates. Returns the codesize-based factor to be >> + divided into the max instructions in an unrolled or peeled loop: >> + 1) For size <= threshold, do not limit (by returning 1). >> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >> + unrolled instructions according to loop hotness. >> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >> + INT_MAX). */ >> + >> +static int >> +code_size_limit_factor(struct loop *loop) >> +{ >> + unsigned size_threshold; >> + struct niter_desc *desc = get_simple_loop_desc (loop); >> + gcov_type sum_to_header_ratio; >> + int hotness_ratio_threshold; >> + int limit_factor; >> + >> + /* First check if the application has a large codesize footprint. >> + This is estimated from FDO profile summary information for the >> + program, where the num_hot_counters indicates the number of hottest >> + counters (blocks) that compose most of the execution time of >> + the program. A large value would indicate a large flat execution >> + profile where icache misses may be a concern. */ >> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >> + if (!profile_info >> + || profile_info->num_hot_counters <= size_threshold >> + || !profile_info->sum_all) >> + return 1; >> + >> + /* Next, exclude some loops where unrolling/peeling may be more >> + important to overall performance. */ >> + >> + /* Ignore FP loops, which are more likely to benefit heavily from >> + unrolling. */ >> + if (desc->has_fp) >> + return 1; >> + >> + /* Next, set the value of the codesize-based unroll factor divisor which in >> + most loops will need to be set to a value that will reduce or eliminate >> + unrolling/peeling. */ >> + if (profile_info->num_hot_counters < size_threshold * 2) >> + { >> + /* For applications that are less than twice the codesize limit, allow >> + limited unrolling for very hot loops. */ >> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >> + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >> + /* When the profile count sum to loop entry header ratio is smaller than >> + the threshold (i.e. the loop entry is hot enough, the divisor is set >> + to 1 so the unroll/peel factor is not reduced. When it is bigger >> + than the ratio, increase the divisor by the amount this ratio >> + is over the threshold, which will quickly reduce the unroll/peel >> + factor to zero as the loop's hotness reduces. */ >> + if (sum_to_header_ratio > hotness_ratio_threshold) >> + { >> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >> + gcc_assert (limit_factor >= 1); >> + } >> + else >> + limit_factor = 1; >> + } >> + else >> + /* For appliations that are at least twice the codesize limit, set >> + the divisor to a large value that will force the unroll factor to 0. */ >> + limit_factor = INT_MAX; >> + >> + return limit_factor; >> +} >> + >> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >> void >> unroll_and_peel_loops (int flags) >> @@ -802,6 +872,7 @@ decide_unroll_runtime_iterations (struct loop *loo >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL)) >> { >> @@ -814,10 +885,26 @@ decide_unroll_runtime_iterations (struct loop *loo >> "\n;; Considering unrolling loop with runtime " >> "computable number of iterations\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1195,6 +1282,7 @@ decide_peel_simple (struct loop *loop, int flags) >> { >> unsigned npeel; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_PEEL)) >> { >> @@ -1205,8 +1293,23 @@ decide_peel_simple (struct loop *loop, int flags) >> if (dump_file) >> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >> >> + if (flag_peel_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from peeling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max peeled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* npeel = number of iterations to peel. */ >> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >> + / loop->ninsns; >> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >> >> @@ -1231,7 +1334,7 @@ decide_peel_simple (struct loop *loop, int flags) >> >> /* Do not simply peel loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not peeling, contains branches\n"); >> @@ -1348,6 +1451,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL_ALL)) >> { >> @@ -1358,11 +1462,26 @@ decide_unroll_stupid (struct loop *loop, int flags >> if (dump_file) >> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av >> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1392,7 +1511,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> >> /* Do not unroll loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> Index: gcc/coverage.c >> =================================================================== >> --- gcc/coverage.c (revision 189413) >> +++ gcc/coverage.c (working copy) >> @@ -245,6 +245,14 @@ read_counts_file (void) >> gcov_read_summary (&sum); >> for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) >> { >> + if (summary.ctrs[ix].num_hot_counters >> + < sum.ctrs[ix].num_hot_counters) >> + { >> + summary.ctrs[ix].num_hot_counters >> + = sum.ctrs[ix].num_hot_counters; >> + summary.ctrs[ix].hot_cutoff_value >> + = sum.ctrs[ix].hot_cutoff_value; >> + } >> summary.ctrs[ix].runs += sum.ctrs[ix].runs; >> summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; >> if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) >> Index: gcc/loop-iv.c >> =================================================================== >> --- gcc/loop-iv.c (revision 189413) >> +++ gcc/loop-iv.c (working copy) >> @@ -2969,8 +2969,12 @@ get_simple_loop_desc (struct loop *loop) >> /* At least desc->infinite is not always initialized by >> find_simple_loop_exit. */ >> desc = XCNEW (struct niter_desc); >> - iv_analysis_loop_init (loop); >> - find_simple_exit (loop, desc); >> + if (loop->latch != EXIT_BLOCK_PTR) >> + { >> + iv_analysis_loop_init (loop); >> + find_simple_exit (loop, desc); >> + } >> + analyze_loop_insns (loop, desc); >> loop->aux = desc; >> >> if (desc->simple_p && (desc->assumptions || desc->infinite)) >> Index: gcc/common.opt >> =================================================================== >> --- gcc/common.opt (revision 189413) >> +++ gcc/common.opt (working copy) >> @@ -1551,6 +1551,10 @@ fpcc-struct-return >> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >> Return small aggregates in memory, not registers >> >> +fpeel-codesize-limit >> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop peeling under profile estimates of large code footprint >> + >> fpeel-loops >> Common Report Var(flag_peel_loops) Optimization >> Perform loop peeling >> @@ -2108,6 +2112,10 @@ funroll-all-loops >> Common Report Var(flag_unroll_all_loops) Optimization >> Perform loop unrolling for all loops >> >> +funroll-codesize-limit >> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint >> + >> ; Nonzero means that loop optimizer may assume that the induction variables >> ; that control loops do not overflow and that the loops with nontrivial >> ; exit condition are not infinite >> Index: gcc/cfgloop.c >> =================================================================== >> --- gcc/cfgloop.c (revision 189413) >> +++ gcc/cfgloop.c (working copy) >> @@ -1155,24 +1155,98 @@ get_loop_exit_edges (const struct loop *loop) >> return edges; >> } >> >> -/* Counts the number of conditional branches inside LOOP. */ >> +/* Determine if INSN is a floating point set. */ >> >> -unsigned >> -num_loop_branches (const struct loop *loop) >> +static bool >> +insn_has_fp_set(rtx insn) >> { >> - unsigned i, n; >> - basic_block * body; >> + int i; >> + rtx pat = PATTERN(insn); >> + if (GET_CODE (pat) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >> + else if (GET_CODE (pat) == PARALLEL) >> + { >> + for (i = 0; i < XVECLEN (pat, 0); i++) >> + { >> + rtx sub = XVECEXP (pat, 0, i); >> + if (GET_CODE (sub) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >> + } >> + } >> + return false; >> +} >> >> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts >> + the number of conditional branch instructions, calls and fp instructions, >> + as well as the average number of branches executed per iteration. */ >> >> +void >> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >> +{ >> + unsigned i, nbranch; >> + gcov_type weighted_nbranch; >> + bool has_call, has_fp; >> + basic_block * body, bb; >> + rtx insn; >> + gcov_type header_count = loop->header->count; >> + >> + nbranch = weighted_nbranch = 0; >> + has_call = has_fp = false; >> + >> body = get_loop_body (loop); >> - n = 0; >> for (i = 0; i < loop->num_nodes; i++) >> - if (EDGE_COUNT (body[i]->succs) >= 2) >> - n++; >> + { >> + bb = body[i]; >> + >> + if (EDGE_COUNT (bb->succs) >= 2) >> + { >> + nbranch++; >> + >> + /* If this block is executed less frequently than the header (loop >> + entry), then it is weighted based on its execution count, which >> + will be turned into a ratio compared to the loop header below. */ >> + if (bb->count < header_count) >> + weighted_nbranch += bb->count; >> + >> + /* When it is executed more frequently than the header (i.e. it is >> + in a nested inner loop), simply weight the branch the same as the >> + header execution count, so that it will contribute 1 branch to >> + the ratio computed below. */ >> + else >> + weighted_nbranch += header_count; >> + } >> + >> + /* No need to iterate through the instructions below if >> + both flags have already been set. */ >> + if (has_call && has_fp) >> + continue; >> + >> + FOR_BB_INSNS (bb, insn) >> + { >> + if (!INSN_P (insn)) >> + continue; >> + >> + if (!has_call) >> + has_call = CALL_P (insn); >> + >> + if (!has_fp) >> + has_fp = insn_has_fp_set (insn); >> + } >> + } >> free (body); >> >> - return n; >> + desc->num_branches = nbranch; >> + /* Now divide the weights computed above by the loop header execution count, >> + to compute the average number of branches through the loop. By adding >> + header_count/2 to the numerator we round to nearest with integer >> + division. */ >> + if (header_count != 0) >> + desc->av_num_branches >> + = (weighted_nbranch + header_count/2) / header_count; >> + else >> + desc->av_num_branches = 0; >> + desc->has_call = has_call; >> + desc->has_fp = has_fp; >> } >> >> /* Adds basic block BB to LOOP. */ >> Index: gcc/cfgloop.h >> =================================================================== >> --- gcc/cfgloop.h (revision 189413) >> +++ gcc/cfgloop.h (working copy) >> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order >> >> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >> edge single_exit (const struct loop *); >> -extern unsigned num_loop_branches (const struct loop *); >> >> extern edge loop_preheader_edge (const struct loop *); >> extern edge loop_latch_edge (const struct loop *); >> @@ -366,7 +365,8 @@ struct rtx_iv >> }; >> >> /* The description of an exit from the loop and of the number of iterations >> - till we take the exit. */ >> + till we take the exit. Also includes other information used primarily >> + by the loop unroller. */ >> >> struct niter_desc >> { >> @@ -407,6 +407,18 @@ struct niter_desc >> >> /* The number of iterations of the loop. */ >> rtx niter_expr; >> + >> + /* The number of branches in the loop. */ >> + unsigned num_branches; >> + >> + /* The number of executed branches per iteration. */ >> + unsigned av_num_branches; >> + >> + /* Whether the loop contains a call instruction. */ >> + bool has_call; >> + >> + /* Whether the loop contains fp instructions. */ >> + bool has_fp; >> }; >> >> extern void iv_analysis_loop_init (struct loop *); >> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); >> >> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >> extern void free_simple_loop_desc (struct loop *loop); >> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >> >> static inline struct niter_desc * >> simple_loop_desc (struct loop *loop) >> Index: gcc/params.def >> =================================================================== >> --- gcc/params.def (revision 189413) >> +++ gcc/params.def (working copy) >> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >> "max-completely-peel-loop-nest-depth", >> "The maximum depth of a loop nest we completely peel", >> 8, 0, 0) >> +/* The maximum code size estimate under which loop unrolling and peeling >> + * is allowed in a profile feedback compile. This currently applies to loops >> + * with non-constant iteration counts and no floating point computations. */ >> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >> + "unrollpeel-codesize-threshold", >> + "Maximum profile-based code size footprint estimate for loop unrolling " >> + "and peeling", >> + 15000, 0, 0) >> +/* The maximum ratio of total profiled execution counts to loop entry block >> + count that must be exceeded to ignore most code size limits when unrolling >> + and peeling. */ >> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >> + "unrollpeel-hotness-threshold", >> + "Maximum ratio of total profiled execution count to loop entry " >> + "block count under which most codesize limits for unrolling and " >> + "peeling will be ignored", >> + 100, 1, 0) >> >> /* The maximum number of insns of an unswitched loop. */ >> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >> Index: gcc/gcov-dump.c >> =================================================================== >> --- gcc/gcov-dump.c (revision 189413) >> +++ gcc/gcov-dump.c (working copy) >> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED >> { >> printf ("\n"); >> print_prefix (filename, 0, 0); >> - printf ("\t\tcounts=%u, runs=%u", >> - summary.ctrs[ix].num, summary.ctrs[ix].runs); >> + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" >> + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", >> + summary.ctrs[ix].num, >> + summary.ctrs[ix].num_hot_counters, >> + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, >> + summary.ctrs[ix].runs); >> >> printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, >> (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); >> >> -- >> This patch is available for review at http://codereview.appspot.com/6351086 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h File gcc/gcov-io.h (right): http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h#newcode396 gcc/gcov-io.h:396: gcov_unsigned_t num_hot_counters;/* number of counters to reach a given Should it be made into an array accessed with pre-defined indices? http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h#newcode399 gcc/gcov-io.h:399: num_hot_counters. */ Similarly for this one -- or put the above two values into a pair and declare an array of the pairs? http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c File libgcc/libgcov.c (right): http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332 libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) space needed http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332 libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) space needed. Is it better to check if (cs_ptr->sum_all * cutoff_perc /cutoff_perc != cs_ptr->sum_all)? http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode718 libgcc/libgcov.c:718: if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) Why guard the update under !cs_prg->runs++ --> the condition is used to identify if it is a first write or a merged write. (verify with a test case to see if the merge of the summary is correct).
Sign in to reply to this message.
On Thu, Jul 26, 2012 at 11:26 AM, <davidxl@google.com> wrote: > > http://codereview.appspot.com/**6351086/diff/1/gcc/gcov-io.h<http://coderevie... > File gcc/gcov-io.h (right): > > http://codereview.appspot.com/**6351086/diff/1/gcc/gcov-io.h#**newcode396<htt... > gcc/gcov-io.h:396: gcov_unsigned_t num_hot_counters;/* number of > > counters to reach a given > Should it be made into an array accessed with pre-defined indices? > You mean provide an array of values for different thresholds? In that case both num_hot_counters and hot_cutoff_value would need to be arrays (oh, I see now that you indicate that below too). And we would possibly need another array to hold the thresholds, unless they are hard coded values. I've been thinking about extending this to provide data for more threshold values, but was holding that off for a follow-on, when I had a better idea of how the expanded summary info would be used. Honza - you were talking about using this type of summary info to guide hot/cold decisions, do you find the current single threshold data useful enough? > http://codereview.appspot.com/**6351086/diff/1/gcc/gcov-io.h#**newcode399<htt... > gcc/gcov-io.h:399: num_hot_counters. */ > Similarly for this one -- or put the above two values into a pair and > declare an array of the pairs? > > http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.**c<http://code... > File libgcc/libgcov.c (right): > > http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.** > c#newcode332<http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332> > libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) > space needed > > I think I have added the space you wanted (around the '*' char) - let me know if you meant something else. > http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.** > c#newcode332<http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332> > libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) > space needed. Is it better to check if (cs_ptr->sum_all * cutoff_perc > /cutoff_perc != cs_ptr->sum_all)? > Good suggestion - changed. > > http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.** > c#newcode718<http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode718> > libgcc/libgcov.c:718: if (cs_tprg->num_hot_counters > > cs_prg->num_hot_counters) > Why guard the update under !cs_prg->runs++ --> the condition is used to > identify if it is a first write or a merged write. (verify with a test > case to see if the merge of the summary is correct). > In fact the ">" check doesn't make sense if this is the first write, which is currently how I have it guarded. The question is how to best "merge" this information. After thinking about this, I am going to move this code out from under the !cs_prg->runs++ check, so that on a merge, we use the summary with the largest num_hot_counters. I am going to do a little testing with my changes and will upload a new patch. Thanks, Teresa > > http://codereview.appspot.com/**6351086/<http://codereview.appspot.com/6351086/> > -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
Resending in plain text mode...sigh. Teresa On Thu, Jul 26, 2012 at 1:32 PM, Teresa Johnson <tejohnson@google.com> wrote: > > > > > On Thu, Jul 26, 2012 at 11:26 AM, <davidxl@google.com> wrote: >> >> >> http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h >> File gcc/gcov-io.h (right): >> >> http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h#newcode396 >> gcc/gcov-io.h:396: gcov_unsigned_t num_hot_counters;/* number of >> >> counters to reach a given >> Should it be made into an array accessed with pre-defined indices? > > > You mean provide an array of values for different thresholds? In that case both num_hot_counters and hot_cutoff_value would need to be arrays (oh, I see now that you indicate that below too). And we would possibly need another array to hold the thresholds, unless they are hard coded values. I've been thinking about extending this to provide data for more threshold values, but was holding that off for a follow-on, when I had a better idea of how the expanded summary info would be used. > > Honza - you were talking about using this type of summary info to guide hot/cold decisions, do you find the current single threshold data useful enough? > >> >> http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h#newcode399 >> gcc/gcov-io.h:399: num_hot_counters. */ >> Similarly for this one -- or put the above two values into a pair and >> declare an array of the pairs? >> >> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c >> File libgcc/libgcov.c (right): >> >> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332 >> libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) >> space needed >> > > I think I have added the space you wanted (around the '*' char) - let me know if you meant something else. > >> >> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332 >> libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all) >> space needed. Is it better to check if (cs_ptr->sum_all * cutoff_perc >> /cutoff_perc != cs_ptr->sum_all)? > > > Good suggestion - changed. > >> >> >> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode718 >> libgcc/libgcov.c:718: if (cs_tprg->num_hot_counters > >> cs_prg->num_hot_counters) >> Why guard the update under !cs_prg->runs++ --> the condition is used to >> identify if it is a first write or a merged write. (verify with a test >> case to see if the merge of the summary is correct). > > > In fact the ">" check doesn't make sense if this is the first write, which is currently how I have it guarded. The question is how to best "merge" this information. After thinking about this, I am going to move this code out from under the !cs_prg->runs++ check, so that on a merge, we use the summary with the largest num_hot_counters. > > I am going to do a little testing with my changes and will upload a new patch. > > Thanks, > Teresa > >> >> >> http://codereview.appspot.com/6351086/ > > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
Updated patch based on review feedback. Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? Teresa Original patch description: Ports some patches related to improving FDO program summary information and using it to guide loop unrolling from google branches to mainline. The patch is enhanced to add additional summary information to aid in determining hot/cold decisions. The original patch description is at: http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html and further discussion about incorporating onto mainline is at: http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 Honza, can you take a look to see if this patch meets your needs? Full description: This patch adds new program summary information to the gcov profile files that indicate how many profiled counts compose the majority of the program's execution time. This is used to provide an indication of the overall code size of the program. The new profile summary information is then used to guide codesize based unroll and peel decisions, to prevent those optimizations from increasing code size too much when the program may be sensitive to icache effects. This patch also pulls in dependent portions of google/main r187660 that cache additional loop analysis results in the niter_desc auxiliary information hanging off the loop structure (the optimization portions of that change are not included here, and have an outstanding review request for mainline). Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? Thanks, Teresa 2012-07-26 Teresa Johnson <tejohnson@google.com> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. (gcov_compute_cutoff_values): Ditto. (gcov_exit): Call gcov_compute_cutoff_values and merge new summary information. * gcc/doc/invoke.texi (roll much): Document new options -fpeel-codesize-limit and -funroll-codesize-limit, and new params codesize-hotness-threshold and unrollpeel-hotness-threshold. * gcc/gcov-io.c (gcov_write_summary): Write new summary info. (gcov_read_summary): Read new summary info. * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. (struct gcov_ctr_summary): Add new summary info: num_hot_counters and hot_cutoff_value. * gcc/loop-unroll.c (code_size_limit_factor): New function. (decide_unroll_runtime_iterations): Call code_size_limit_factor to control the unroll factor, and retrieve number of branches from niter_desc instead of via function that walks loop. (decide_peel_simple, decide_unroll_stupid): Ditto. * gcc/coverage.c (read_counts_file): Propagate new summary info. * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns function, and add guards to enable this function to work for the outermost loop. * gcc/common.opt: Add -fpeel-codesize-limit and -funroll-codesize-limit. * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. (num_loop_branches): Remove. * gcc/cfgloop.h (struct niter_desc): Added new fields to cache additional loop analysis information. (num_loop_branches): Remove. (analyze_loop_insns): Declare. * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. * gcc/gcov-dump.c (tag_summary): Dump new summary info. Index: libgcc/libgcov.c =================================================================== --- libgcc/libgcov.c (revision 189893) +++ libgcc/libgcov.c (working copy) @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned return 1; } +/* Used by qsort to sort gcov values in descending order. */ + +static int +sort_by_reverse_gcov_value (const void *pa, const void *pb) +{ + const gcov_type a = *(gcov_type const *)pa; + const gcov_type b = *(gcov_type const *)pb; + + if (b > a) + return 1; + else if (b == a) + return 0; + else + return -1; +} + +/* Determines the number of counters required to cover a given percentage + of the total sum of execution counts in the summary, which is then also + recorded in SUM. */ + +static void +gcov_compute_cutoff_values (struct gcov_summary *sum) +{ + struct gcov_info *gi_ptr; + const struct gcov_fn_info *gfi_ptr; + const struct gcov_ctr_info *ci_ptr; + struct gcov_ctr_summary *cs_ptr; + unsigned t_ix, f_ix, i, ctr_info_ix, index; + gcov_unsigned_t c_num; + gcov_type *value_array; + gcov_type cum, cum_cutoff; + char *cutoff_str; + unsigned cutoff_perc; + +#define CUM_CUTOFF_PERCENT_TIMES_10 999 + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); + if (cutoff_str && strlen (cutoff_str)) + cutoff_perc = atoi (cutoff_str); + else + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; + + /* This currently only applies to arc counters. */ + t_ix = GCOV_COUNTER_ARCS; + + /* First check if there are any counts recorded for this counter. */ + cs_ptr = &(sum->ctrs[t_ix]); + if (!cs_ptr->num) + return; + + /* Determine the cumulative counter value at the specified cutoff + percentage and record the percentage for use by gcov consumers. + Check for overflow when sum_all is multiplied by the cutoff_perc, + and if so, do the divide first. */ + if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all) + /* Overflow, do the divide first. */ + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; + else + /* Otherwise multiply first to get the correct value for small + values of sum_all. */ + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; + + /* Next, walk through all the per-object structures and save each of + the count values in value_array. */ + index = 0; + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) + { + if (!gi_ptr->merge[t_ix]) + continue; + + /* Find the appropriate index into the gcov_ctr_info array + for the counter we are currently working on based on the + existence of the merge function pointer for this object. */ + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) + { + if (gi_ptr->merge[i]) + ctr_info_ix++; + } + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) + { + gfi_ptr = gi_ptr->functions[f_ix]; + + if (!gfi_ptr || gfi_ptr->key != gi_ptr) + continue; + + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; + /* Sanity check that there are enough entries in value_arry + for this function's counters. Gracefully handle the case when + there are not, in case something in the profile info is + corrupted. */ + c_num = ci_ptr->num; + if (index + c_num > cs_ptr->num) + c_num = cs_ptr->num - index; + /* Copy over this function's counter values. */ + memcpy (&value_array[index], ci_ptr->values, + sizeof (gcov_type) * c_num); + index += c_num; + } + } + + /* Sort all the counter values by descending value and finally + accumulate the values from hottest on down until reaching + the cutoff value computed earlier. */ + qsort (value_array, cs_ptr->num, sizeof (gcov_type), + sort_by_reverse_gcov_value); + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) + cum += value_array[c_num]; + /* Record the number of counters required to reach the cutoff value, + as well as the counter value that caused cum to reach the cutoff. */ + cs_ptr->num_hot_counters = c_num; + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; + free (value_array); +} + /* Dump the coverage counts. We merge with existing counts when possible, to avoid growing the .da files ad infinitum. We use this program's checksum to make sure we only accumulate whole program @@ -347,6 +461,7 @@ gcov_exit (void) } } } + gcov_compute_cutoff_values (&this_prg); { /* Check if the level of dirs to strip off specified. */ @@ -598,7 +713,12 @@ gcov_exit (void) if (gi_ptr->merge[t_ix]) { if (!cs_prg->runs++) - cs_prg->num = cs_tprg->num; + cs_prg->num = cs_tprg->num; + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) + { + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; + } cs_prg->sum_all += cs_tprg->sum_all; if (cs_prg->run_max < cs_tprg->run_max) cs_prg->run_max = cs_tprg->run_max; Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi (revision 189893) +++ gcc/doc/invoke.texi (working copy) @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}. -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol -fprefetch-loop-arrays @gol -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol -fprofile-generate=@var{path} @gol @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}. -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol -ftree-switch-conversion -ftree-tail-merge @gol -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol --funit-at-a-time -funroll-all-loops -funroll-loops @gol +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol -fwhole-program -fwpa -fuse-linker-plugin @gol @@ -8527,6 +8527,14 @@ the loop is entered. This usually makes programs @option{-funroll-all-loops} implies the same options as @option{-funroll-loops}. +@item -funroll-codesize-limit +@opindex funroll-codesize-limit +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fpeel-loops @opindex fpeel-loops Peels loops for which there is enough information that they do not @@ -8535,6 +8543,14 @@ roll much (from profile feedback). It also turns Enabled with @option{-fprofile-use}. +@item -fpeel-codesize-limit +@opindex fpeel-codesize-limit +Limit loop peeling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fmove-loop-invariants @opindex fmove-loop-invariants Enables the loop invariant motion pass in the RTL loop optimizer. Enabled @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s @item max-completely-peel-loop-nest-depth The maximum depth of a loop nest suitable for complete peeling. +@item unrollpeel-codesize-threshold +Maximum profile-based code size footprint estimate for loop unrolling and +peeling. + +@item unrollpeel-hotness-threshold +Maximum ratio of total execution count to loop entry block count under which +most profile-based code size estimates will be ignored for loop unrolling and +peeling. + @item max-unswitch-insns The maximum number of insns of an unswitched loop. Index: gcc/gcov-io.c =================================================================== --- gcc/gcov-io.c (revision 189893) +++ gcc/gcov-io.c (working copy) @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { gcov_write_unsigned (csum->num); + gcov_write_unsigned (csum->num_hot_counters); + gcov_write_counter (csum->hot_cutoff_value); gcov_write_unsigned (csum->runs); gcov_write_counter (csum->sum_all); gcov_write_counter (csum->run_max); @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { csum->num = gcov_read_unsigned (); + csum->num_hot_counters = gcov_read_unsigned (); + csum->hot_cutoff_value = gcov_read_counter (); csum->runs = gcov_read_unsigned (); csum->sum_all = gcov_read_counter (); csum->run_max = gcov_read_counter (); Index: gcc/gcov-io.h =================================================================== --- gcc/gcov-io.h (revision 189893) +++ gcc/gcov-io.h (working copy) @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) #define GCOV_TAG_SUMMARY_LENGTH \ - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) /* Counters that are collected. */ #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; struct gcov_ctr_summary { gcov_unsigned_t num; /* number of counters. */ + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given + percent of sum_all. */ + gcov_type hot_cutoff_value; /* value of smallest counter included in + num_hot_counters. */ gcov_unsigned_t runs; /* number of program runs */ gcov_type sum_all; /* sum of all counters accumulated. */ gcov_type run_max; /* maximum value on a single run. */ Index: gcc/loop-unroll.c =================================================================== --- gcc/loop-unroll.c (revision 189893) +++ gcc/loop-unroll.c (working copy) @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see #include "hashtab.h" #include "recog.h" #include "target.h" +#include "gcov-io.h" #include "dumpfile.h" /* This pass performs loop unrolling and peeling. We only perform these @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc basic_block); static rtx get_expansion (struct var_to_expand *); +/* Determine whether and how much LOOP unrolling/peeling should be constrained + based on code footprint estimates. Returns the codesize-based factor to be + divided into the max instructions in an unrolled or peeled loop: + 1) For size <= threshold, do not limit (by returning 1). + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or + unrolled instructions according to loop hotness. + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning + INT_MAX). */ + +static int +code_size_limit_factor(struct loop *loop) +{ + unsigned size_threshold; + struct niter_desc *desc = get_simple_loop_desc (loop); + gcov_type sum_to_header_ratio; + int hotness_ratio_threshold; + int limit_factor; + + /* First check if the application has a large codesize footprint. + This is estimated from FDO profile summary information for the + program, where the num_hot_counters indicates the number of hottest + counters (blocks) that compose most of the execution time of + the program. A large value would indicate a large flat execution + profile where icache misses may be a concern. */ + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); + if (!profile_info + || profile_info->num_hot_counters <= size_threshold + || !profile_info->sum_all) + return 1; + + /* Next, exclude some loops where unrolling/peeling may be more + important to overall performance. */ + + /* Ignore FP loops, which are more likely to benefit heavily from + unrolling. */ + if (desc->has_fp) + return 1; + + /* Next, set the value of the codesize-based unroll factor divisor which in + most loops will need to be set to a value that will reduce or eliminate + unrolling/peeling. */ + if (profile_info->num_hot_counters < size_threshold * 2) + { + /* For applications that are less than twice the codesize limit, allow + limited unrolling for very hot loops. */ + sum_to_header_ratio = profile_info->sum_all / loop->header->count; + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); + /* When the profile count sum to loop entry header ratio is smaller than + the threshold (i.e. the loop entry is hot enough, the divisor is set + to 1 so the unroll/peel factor is not reduced. When it is bigger + than the ratio, increase the divisor by the amount this ratio + is over the threshold, which will quickly reduce the unroll/peel + factor to zero as the loop's hotness reduces. */ + if (sum_to_header_ratio > hotness_ratio_threshold) + { + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; + gcc_assert (limit_factor >= 1); + } + else + limit_factor = 1; + } + else + /* For appliations that are at least twice the codesize limit, set + the divisor to a large value that will force the unroll factor to 0. */ + limit_factor = INT_MAX; + + return limit_factor; +} + /* Unroll and/or peel (depending on FLAGS) LOOPS. */ void unroll_and_peel_loops (int flags) @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_UNROLL)) { @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo "\n;; Considering unrolling loop with runtime " "computable number of iterations\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags) { unsigned npeel; struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_PEEL)) { @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags) if (dump_file) fprintf (dump_file, "\n;; Considering simply peeling loop\n"); + if (flag_peel_codesize_limit) + { + /* Determine whether to limit code size growth from peeling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max peeled insns by divisor %d\n", limit_factor); + } + } + /* npeel = number of iterations to peel. */ - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor + / loop->ninsns; if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags) /* Do not simply peel loops with branches inside -- it increases number of mispredicts. */ - if (num_loop_branches (loop) > 1) + if (desc->num_branches > 1) { if (dump_file) fprintf (dump_file, ";; Not peeling, contains branches\n"); @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_UNROLL_ALL)) { @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags if (dump_file) fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags /* Do not unroll loops with branches inside -- it increases number of mispredicts. */ - if (num_loop_branches (loop) > 1) + if (desc->num_branches > 1) { if (dump_file) fprintf (dump_file, ";; Not unrolling, contains branches\n"); Index: gcc/coverage.c =================================================================== --- gcc/coverage.c (revision 189893) +++ gcc/coverage.c (working copy) @@ -247,6 +247,14 @@ read_counts_file (void) gcov_read_summary (&sum); for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) { + if (summary.ctrs[ix].num_hot_counters + < sum.ctrs[ix].num_hot_counters) + { + summary.ctrs[ix].num_hot_counters + = sum.ctrs[ix].num_hot_counters; + summary.ctrs[ix].hot_cutoff_value + = sum.ctrs[ix].hot_cutoff_value; + } summary.ctrs[ix].runs += sum.ctrs[ix].runs; summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) Index: gcc/loop-iv.c =================================================================== --- gcc/loop-iv.c (revision 189893) +++ gcc/loop-iv.c (working copy) @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop) /* At least desc->infinite is not always initialized by find_simple_loop_exit. */ desc = XCNEW (struct niter_desc); - iv_analysis_loop_init (loop); - find_simple_exit (loop, desc); + if (loop->latch != EXIT_BLOCK_PTR) + { + iv_analysis_loop_init (loop); + find_simple_exit (loop, desc); + } + analyze_loop_insns (loop, desc); loop->aux = desc; if (desc->simple_p && (desc->assumptions || desc->infinite)) Index: gcc/common.opt =================================================================== --- gcc/common.opt (revision 189893) +++ gcc/common.opt (working copy) @@ -1551,6 +1551,10 @@ fpcc-struct-return Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) Return small aggregates in memory, not registers +fpeel-codesize-limit +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop peeling under profile estimates of large code footprint + fpeel-loops Common Report Var(flag_peel_loops) Optimization Perform loop peeling @@ -2112,6 +2116,10 @@ funroll-all-loops Common Report Var(flag_unroll_all_loops) Optimization Perform loop unrolling for all loops +funroll-codesize-limit +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop unrolling under profile estimates of large code footprint + ; Nonzero means that loop optimizer may assume that the induction variables ; that control loops do not overflow and that the loops with nontrivial ; exit condition are not infinite Index: gcc/cfgloop.c =================================================================== --- gcc/cfgloop.c (revision 189893) +++ gcc/cfgloop.c (working copy) @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop) return edges; } -/* Counts the number of conditional branches inside LOOP. */ +/* Determine if INSN is a floating point set. */ -unsigned -num_loop_branches (const struct loop *loop) +static bool +insn_has_fp_set(rtx insn) { - unsigned i, n; - basic_block * body; + int i; + rtx pat = PATTERN(insn); + if (GET_CODE (pat) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); + else if (GET_CODE (pat) == PARALLEL) + { + for (i = 0; i < XVECLEN (pat, 0); i++) + { + rtx sub = XVECEXP (pat, 0, i); + if (GET_CODE (sub) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); + } + } + return false; +} - gcc_assert (loop->latch != EXIT_BLOCK_PTR); +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts + the number of conditional branch instructions, calls and fp instructions, + as well as the average number of branches executed per iteration. */ +void +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) +{ + unsigned i, nbranch; + gcov_type weighted_nbranch; + bool has_call, has_fp; + basic_block * body, bb; + rtx insn; + gcov_type header_count = loop->header->count; + + nbranch = weighted_nbranch = 0; + has_call = has_fp = false; + body = get_loop_body (loop); - n = 0; for (i = 0; i < loop->num_nodes; i++) - if (EDGE_COUNT (body[i]->succs) >= 2) - n++; + { + bb = body[i]; + + if (EDGE_COUNT (bb->succs) >= 2) + { + nbranch++; + + /* If this block is executed less frequently than the header (loop + entry), then it is weighted based on its execution count, which + will be turned into a ratio compared to the loop header below. */ + if (bb->count < header_count) + weighted_nbranch += bb->count; + + /* When it is executed more frequently than the header (i.e. it is + in a nested inner loop), simply weight the branch the same as the + header execution count, so that it will contribute 1 branch to + the ratio computed below. */ + else + weighted_nbranch += header_count; + } + + /* No need to iterate through the instructions below if + both flags have already been set. */ + if (has_call && has_fp) + continue; + + FOR_BB_INSNS (bb, insn) + { + if (!INSN_P (insn)) + continue; + + if (!has_call) + has_call = CALL_P (insn); + + if (!has_fp) + has_fp = insn_has_fp_set (insn); + } + } free (body); - return n; + desc->num_branches = nbranch; + /* Now divide the weights computed above by the loop header execution count, + to compute the average number of branches through the loop. By adding + header_count/2 to the numerator we round to nearest with integer + division. */ + if (header_count != 0) + desc->av_num_branches + = (weighted_nbranch + header_count/2) / header_count; + else + desc->av_num_branches = 0; + desc->has_call = has_call; + desc->has_fp = has_fp; } /* Adds basic block BB to LOOP. */ Index: gcc/cfgloop.h =================================================================== --- gcc/cfgloop.h (revision 189893) +++ gcc/cfgloop.h (working copy) @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); edge single_exit (const struct loop *); -extern unsigned num_loop_branches (const struct loop *); extern edge loop_preheader_edge (const struct loop *); extern edge loop_latch_edge (const struct loop *); @@ -366,7 +365,8 @@ struct rtx_iv }; /* The description of an exit from the loop and of the number of iterations - till we take the exit. */ + till we take the exit. Also includes other information used primarily + by the loop unroller. */ struct niter_desc { @@ -407,6 +407,18 @@ struct niter_desc /* The number of iterations of the loop. */ rtx niter_expr; + + /* The number of branches in the loop. */ + unsigned num_branches; + + /* The number of executed branches per iteration. */ + unsigned av_num_branches; + + /* Whether the loop contains a call instruction. */ + bool has_call; + + /* Whether the loop contains fp instructions. */ + bool has_fp; }; extern void iv_analysis_loop_init (struct loop *); @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); extern struct niter_desc *get_simple_loop_desc (struct loop *loop); extern void free_simple_loop_desc (struct loop *loop); +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); static inline struct niter_desc * simple_loop_desc (struct loop *loop) Index: gcc/params.def =================================================================== --- gcc/params.def (revision 189893) +++ gcc/params.def (working copy) @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, "max-completely-peel-loop-nest-depth", "The maximum depth of a loop nest we completely peel", 8, 0, 0) +/* The maximum code size estimate under which loop unrolling and peeling + * is allowed in a profile feedback compile. This currently applies to loops + * with non-constant iteration counts and no floating point computations. */ +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, + "unrollpeel-codesize-threshold", + "Maximum profile-based code size footprint estimate for loop unrolling " + "and peeling", + 15000, 0, 0) +/* The maximum ratio of total profiled execution counts to loop entry block + count that must be exceeded to ignore most code size limits when unrolling + and peeling. */ +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, + "unrollpeel-hotness-threshold", + "Maximum ratio of total profiled execution count to loop entry " + "block count under which most codesize limits for unrolling and " + "peeling will be ignored", + 100, 1, 0) /* The maximum number of insns of an unswitched loop. */ DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, Index: gcc/gcov-dump.c =================================================================== --- gcc/gcov-dump.c (revision 189893) +++ gcc/gcov-dump.c (working copy) @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED { printf ("\n"); print_prefix (filename, 0, 0); - printf ("\t\tcounts=%u, runs=%u", - summary.ctrs[ix].num, summary.ctrs[ix].runs); + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", + summary.ctrs[ix].num, + summary.ctrs[ix].num_hot_counters, + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, + summary.ctrs[ix].runs); printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); -- This patch is available for review at http://codereview.appspot.com/6351086
Sign in to reply to this message.
On Fri, Jul 27, 2012 at 6:47 AM, Teresa Johnson <tejohnson@google.com> wrote: > * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. > (struct gcov_ctr_summary): Add new summary info: num_hot_counters and > hot_cutoff_value. You should also update the description for the data file at the head of gcov-io.h: The data file contains the following records. data: {unit summary:object summary:program* function-data*}* unit: header int32:checksum function-data: announce_function present counts announce_function: header int32:ident int32:lineno_checksum int32:cfg_checksum present: header int32:present counts: header int64:count* summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE count-summary: int32:num int32:runs int64:sum int64:max int64:sum_max You've added two fields in count-summary IIUC. Ciao! Steven
Sign in to reply to this message.
On Fri, Jul 27, 2012 at 12:31 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Fri, Jul 27, 2012 at 6:47 AM, Teresa Johnson <tejohnson@google.com> wrote: >> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. >> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and >> hot_cutoff_value. > > You should also update the description for the data file at the head > of gcov-io.h: > > The data file contains the following records. > data: {unit summary:object summary:program* function-data*}* > unit: header int32:checksum > function-data: announce_function present counts > announce_function: header int32:ident > int32:lineno_checksum int32:cfg_checksum > present: header int32:present > counts: header int64:count* > summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE > count-summary: int32:num int32:runs int64:sum > int64:max int64:sum_max > > You've added two fields in count-summary IIUC. Thanks, I will update this in my next version of the patch. Teresa > > Ciao! > Steven -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Fri, Jul 27, 2012 at 3:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > Hi, >> Index: libgcc/libgcov.c >> =================================================================== >> --- libgcc/libgcov.c (revision 189893) >> +++ libgcc/libgcov.c (working copy) >> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >> return 1; >> } >> >> +/* Used by qsort to sort gcov values in descending order. */ >> + >> +static int >> +sort_by_reverse_gcov_value (const void *pa, const void *pb) >> +{ >> + const gcov_type a = *(gcov_type const *)pa; >> + const gcov_type b = *(gcov_type const *)pb; >> + >> + if (b > a) >> + return 1; >> + else if (b == a) >> + return 0; >> + else >> + return -1; >> +} >> + >> +/* Determines the number of counters required to cover a given percentage >> + of the total sum of execution counts in the summary, which is then also >> + recorded in SUM. */ >> + >> +static void >> +gcov_compute_cutoff_values (struct gcov_summary *sum) > > This looks like good idea to me to drive the hot/cold partitioning even if it > is not quite accurate (you have no idea how many instructions given counter is > guarding). Thanks - right, I think it will be a good approximation. > > To reduce overhead on embedded sysems, what about just doing histogram with say > 128 steps instead if dragging in qsort? This also avoids the need to > produce the copy of all counters. I like that suggestion. I'll need 1024 buckets though to get the 99.9% cutoff I am using right now (or use 128 buckets plus 1 to track what would be bucket 1024 which is roughly 99.9%). I can use something like a binary search to locate the right bucket without a linear search or divide. And I can keep track of the min value for each bucket in order to identify the correct value for hot_cutoff_value. >> + >> + /* Determine the cumulative counter value at the specified cutoff >> + percentage and record the percentage for use by gcov consumers. >> + Check for overflow when sum_all is multiplied by the cutoff_perc, >> + and if so, do the divide first. */ >> + if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all) >> + /* Overflow, do the divide first. */ >> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; >> + else >> + /* Otherwise multiply first to get the correct value for small >> + values of sum_all. */ >> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; > > To further keep embedded systems (at least a bit) happier, I guess one could do > this without generic 64bit divide operations. I guess 1000 can be bumped up to > 1024, small error is hamless here. > > Actually it may be easier to simply embedd the histogram into gcov summary > so one can control the cutoff with --param in compiler at --profile-use time. > It seems resonable to me to trade 128 values per file for the extra flexibility. Both you and David have requested more values, so I will go ahead and implement this suggestion in this patch. I can use the 128 + 1 bucket approach I described above to get the data for roughly every 1% plus the 99.9%. That should be enough granularity for the optimizations (and the smallest bucket doesn't really need to be fed back as it can be largely extrapolated from the others). This will require feeding back 128 arrays of 2 values (num_hot_counters and hot_cutoff_value). > >> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) >> + { >> + if (!gi_ptr->merge[t_ix]) >> + continue; >> + >> + /* Find the appropriate index into the gcov_ctr_info array >> + for the counter we are currently working on based on the >> + existence of the merge function pointer for this object. */ >> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) >> + { >> + if (gi_ptr->merge[i]) >> + ctr_info_ix++; >> + } >> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >> + { >> + gfi_ptr = gi_ptr->functions[f_ix]; >> + >> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >> + continue; >> + >> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >> + /* Sanity check that there are enough entries in value_arry >> + for this function's counters. Gracefully handle the case when >> + there are not, in case something in the profile info is >> + corrupted. */ >> + c_num = ci_ptr->num; >> + if (index + c_num > cs_ptr->num) >> + c_num = cs_ptr->num - index; >> + /* Copy over this function's counter values. */ >> + memcpy (&value_array[index], ci_ptr->values, >> + sizeof (gcov_type) * c_num); >> + index += c_num; > > I wonder if the loop walking all counters can't be fused into one of the other > loops we already have. Not with the histogram approach as the preceding walk (in the caller, gcov_exit) will be needed to find the min and max counter values first. >> + } >> Index: gcc/doc/invoke.texi >> =================================================================== >> --- gcc/doc/invoke.texi (revision 189893) >> +++ gcc/doc/invoke.texi (working copy) >> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}. >> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops >> -fpredictive-commoning @gol >> -fprefetch-loop-arrays @gol >> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >> -fprofile-generate=@var{path} @gol >> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}. >> -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol >> -ftree-switch-conversion -ftree-tail-merge @gol >> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >> -fwhole-program -fwpa -fuse-linker-plugin @gol >> @@ -8527,6 +8527,14 @@ the loop is entered. This usually makes programs >> @option{-funroll-all-loops} implies the same options as >> @option{-funroll-loops}. >> >> +@item -funroll-codesize-limit >> +@opindex funroll-codesize-limit >> +Limit loop unrolling of non-const non-FP loops in a profile feedback >> compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are >> controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. > > Lets handle the cutoff logic independently of the loop bits I can not approve. > The patch is OK with the change ssuggested Ok, I will split this into 2 patches, one for the gcov changes only, and a follow-on patch with my loop unroller changes. Thanks! Teresa > > Honza -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
This patch was proposed awhile back, and the new working set program summary was split out and committed separately. I would like to see if the loop unroller part of the patch can now go in. For the previous versions of the patch and related discussion, see: http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html and http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html Use the new working set information in the FDO program summary for codesize based unroll and peel decisions, to prevent those optimizations from increasing code size too much when the program may be sensitive to icache effects. As part of this, cache additional loop analysis results in the niter_desc auxiliary information hanging off the loop structure to reduce redundant analyses during unrolling. Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? Thanks, Teresa 2012-11-19 Teresa Johnson <tejohnson@google.com> * gcc/doc/invoke.texi: Document new options -fpeel-codesize-limit and -funroll-codesize-limit, and new params codesize-hotness-threshold and unrollpeel-hotness-threshold. * gcc/loop-unroll.c (code_size_limit_factor): New function. (decide_unroll_runtime_iterations): Call code_size_limit_factor to control the unroll factor, and retrieve number of branches from niter_desc instead of via function that walks loop. (decide_peel_simple, decide_unroll_stupid): Ditto. * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns function, and add guards to enable this function to work for the outermost loop. * gcc/common.opt: Add -fpeel-codesize-limit and -funroll-codesize-limit. * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. (num_loop_branches): Remove. * gcc/cfgloop.h (struct niter_desc): Added new fields to cache additional loop analysis information. (num_loop_branches): Remove. (analyze_loop_insns): Declare. * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 193626) +++ doc/invoke.texi (working copy) @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}. -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol -fprefetch-loop-arrays -fprofile-report @gol -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol -fprofile-generate=@var{path} @gol @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}. -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol -ftree-switch-conversion -ftree-tail-merge @gol -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol --funit-at-a-time -funroll-all-loops -funroll-loops @gol +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol -fwhole-program -fwpa -fuse-linker-plugin @gol @@ -8771,6 +8771,14 @@ the loop is entered. This usually makes programs @option{-funroll-all-loops} implies the same options as @option{-funroll-loops}. +@item -funroll-codesize-limit +@opindex funroll-codesize-limit +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fpeel-loops @opindex fpeel-loops Peels loops for which there is enough information that they do not @@ -8779,6 +8787,14 @@ roll much (from profile feedback). It also turns Enabled with @option{-fprofile-use}. +@item -fpeel-codesize-limit +@item -fpeel-codesize-limit +@opindex fpeel-codesize-limit +Limit loop peeling of non-const non-FP loops in a profile feedback compilation +under estimates of a large code footprint. Enabled by default with +@option{-fprofile-use}. Code size and execution weight thresholds are controlled +by the @option{unrollpeel-codesize-threshold} and +@option{unrollpeel-hotness-threshold} parameters. + @item -fmove-loop-invariants @opindex fmove-loop-invariants Enables the loop invariant motion pass in the RTL loop optimizer. Enabled @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s @item max-completely-peel-loop-nest-depth The maximum depth of a loop nest suitable for complete peeling. +@item unrollpeel-codesize-threshold +Maximum profile-based code size footprint estimate for loop unrolling and +peeling. + +@item unrollpeel-hotness-threshold +Maximum ratio of total execution count to loop entry block count under which +most profile-based code size estimates will be ignored for loop unrolling and +peeling. + @item max-unswitch-insns The maximum number of insns of an unswitched loop. Index: loop-unroll.c =================================================================== --- loop-unroll.c (revision 193626) +++ loop-unroll.c (working copy) @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see #include "hashtab.h" #include "recog.h" #include "target.h" +#include "gcov-io.h" #include "dumpfile.h" /* This pass performs loop unrolling and peeling. We only perform these @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc basic_block); static rtx get_expansion (struct var_to_expand *); +/* Determine whether and how much LOOP unrolling/peeling should be constrained + based on code footprint estimates. Returns the codesize-based factor to be + divided into the max instructions in an unrolled or peeled loop: + 1) For size <= threshold, do not limit (by returning 1). + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or + unrolled instructions according to loop hotness. + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning + INT_MAX). */ + +static int +code_size_limit_factor(struct loop *loop) +{ + unsigned size_threshold, num_hot_counters; + struct niter_desc *desc = get_simple_loop_desc (loop); + gcov_type sum_to_header_ratio; + int hotness_ratio_threshold; + int limit_factor; + gcov_working_set_t *ws; + + ws = find_working_set(999); + if (! ws) + return 1; + num_hot_counters = ws->num_counters; + + /* First check if the application has a large codesize footprint. + This is estimated from FDO profile summary information for the + program, where the num_hot_counters indicates the number of hottest + counters (blocks) that compose most of the execution time of + the program. A large value would indicate a large flat execution + profile where icache misses may be a concern. */ + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); + if (!profile_info + || num_hot_counters <= size_threshold + || !profile_info->sum_all) + return 1; + + /* Next, exclude some loops where unrolling/peeling may be more + important to overall performance. */ + + /* Ignore FP loops, which are more likely to benefit heavily from + unrolling. */ + if (desc->has_fp) + return 1; + + /* Next, set the value of the codesize-based unroll factor divisor which in + most loops will need to be set to a value that will reduce or eliminate + unrolling/peeling. */ + if (num_hot_counters < size_threshold * 2 + && loop->header->count > 0) + { + /* For applications that are less than twice the codesize limit, allow + limited unrolling for very hot loops. */ + sum_to_header_ratio = profile_info->sum_all / loop->header->count; + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); + /* When the profile count sum to loop entry header ratio is smaller than + the threshold (i.e. the loop entry is hot enough), the divisor is set + to 1 so the unroll/peel factor is not reduced. When it is bigger + than the ratio, increase the divisor by the amount this ratio + is over the threshold, which will quickly reduce the unroll/peel + factor to zero as the loop's hotness reduces. */ + if (sum_to_header_ratio > hotness_ratio_threshold) + { + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; + gcc_assert (limit_factor >= 1); + } + else + limit_factor = 1; + } + else + /* For appliations that are at least twice the codesize limit, set + the divisor to a large value that will force the unroll factor to 0. */ + limit_factor = INT_MAX; + + return limit_factor; +} + /* Unroll and/or peel (depending on FLAGS) LOOPS. */ void unroll_and_peel_loops (int flags) @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; double_int iterations; if (!(flags & UAP_UNROLL)) @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo "\n;; Considering unrolling loop with runtime " "computable number of iterations\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags) { unsigned npeel; double_int iterations; + struct niter_desc *desc; + int limit_factor = 1; if (!(flags & UAP_PEEL)) { @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags) if (dump_file) fprintf (dump_file, "\n;; Considering simply peeling loop\n"); + if (flag_peel_codesize_limit) + { + /* Determine whether to limit code size growth from peeling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max peeled insns by divisor %d\n", limit_factor); + } + } + /* npeel = number of iterations to peel. */ - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor + / loop->ninsns; if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags) } /* Do not simply peel loops with branches inside -- it increases number - of mispredicts. + of mispredicts. Exception is when we do have profile and we however have good chance to peel proper number of iterations loop will iterate in practice. TODO: this heuristic needs tunning; while for complette unrolling @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags) peeling it is not the case. Also a function call inside loop is also branch from branch prediction POV (and probably better reason to not unroll/peel). */ - if (num_loop_branches (loop) > 1 + desc = get_simple_loop_desc (loop); + if (desc->num_branches > 1 && profile_status != PROFILE_READ) { if (dump_file) @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags { unsigned nunroll, nunroll_by_av, i; struct niter_desc *desc; + int limit_factor = 1; double_int iterations; if (!(flags & UAP_UNROLL_ALL)) @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags if (dump_file) fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); + if (flag_unroll_codesize_limit) + { + /* Determine whether to limit code size growth from unrolling, + using FDO profile summary information that gives an + estimated number of executed blocks. */ + limit_factor = code_size_limit_factor (loop); + if (dump_file && limit_factor > 1) + { + fprintf (dump_file, + ";; Due to large code size footprint estimate, limit " + "max unrolled insns by divisor %d\n", limit_factor); + } + } + /* nunroll = total number of copies of the original loop body in unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; - nunroll_by_av - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor + / loop->ninsns; + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) + / limit_factor / loop->av_ninsns; if (nunroll > nunroll_by_av) nunroll = nunroll_by_av; if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags of mispredicts. TODO: this heuristic needs tunning; call inside the loop body is also relatively good reason to not unroll. */ - if (num_loop_branches (loop) > 1) + if (desc->num_branches > 1) { if (dump_file) fprintf (dump_file, ";; Not unrolling, contains branches\n"); Index: loop-iv.c =================================================================== --- loop-iv.c (revision 193626) +++ loop-iv.c (working copy) @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop) /* At least desc->infinite is not always initialized by find_simple_loop_exit. */ desc = XCNEW (struct niter_desc); - iv_analysis_loop_init (loop); - find_simple_exit (loop, desc); + if (loop->latch != EXIT_BLOCK_PTR) + { + iv_analysis_loop_init (loop); + find_simple_exit (loop, desc); + } + analyze_loop_insns (loop, desc); loop->aux = desc; if (desc->simple_p && (desc->assumptions || desc->infinite)) Index: common.opt =================================================================== --- common.opt (revision 193626) +++ common.opt (working copy) @@ -1573,6 +1573,10 @@ fpcc-struct-return Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) Return small aggregates in memory, not registers +fpeel-codesize-limit +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop peeling under profile estimates of large code footprint + fpeel-loops Common Report Var(flag_peel_loops) Optimization Perform loop peeling @@ -2138,6 +2142,10 @@ funroll-all-loops Common Report Var(flag_unroll_all_loops) Optimization Perform loop unrolling for all loops +funroll-codesize-limit +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization +Limit non-const non-FP loop unrolling under profile estimates of large code footprint + ; Nonzero means that loop optimizer may assume that the induction variables ; that control loops do not overflow and that the loops with nontrivial ; exit condition are not infinite Index: cfgloop.c =================================================================== --- cfgloop.c (revision 193626) +++ cfgloop.c (working copy) @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop) return edges; } -/* Counts the number of conditional branches inside LOOP. */ +/* Determine if INSN is a floating point set. */ -unsigned -num_loop_branches (const struct loop *loop) +static bool +insn_has_fp_set(rtx insn) { - unsigned i, n; - basic_block * body; + int i; + rtx pat = PATTERN(insn); + if (GET_CODE (pat) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); + else if (GET_CODE (pat) == PARALLEL) + { + for (i = 0; i < XVECLEN (pat, 0); i++) + { + rtx sub = XVECEXP (pat, 0, i); + if (GET_CODE (sub) == SET) + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); + } + } + return false; +} - gcc_assert (loop->latch != EXIT_BLOCK_PTR); +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts + the number of conditional branch instructions, calls and fp instructions, + as well as the average number of branches executed per iteration. */ +void +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) +{ + unsigned i, nbranch; + gcov_type weighted_nbranch; + bool has_call, has_fp; + basic_block * body, bb; + rtx insn; + gcov_type header_count = loop->header->count; + + nbranch = weighted_nbranch = 0; + has_call = has_fp = false; + body = get_loop_body (loop); - n = 0; for (i = 0; i < loop->num_nodes; i++) - if (EDGE_COUNT (body[i]->succs) >= 2) - n++; + { + bb = body[i]; + + if (EDGE_COUNT (bb->succs) >= 2) + { + nbranch++; + + /* If this block is executed less frequently than the header (loop + entry), then it is weighted based on its execution count, which + will be turned into a ratio compared to the loop header below. */ + if (bb->count < header_count) + weighted_nbranch += bb->count; + + /* When it is executed more frequently than the header (i.e. it is + in a nested inner loop), simply weight the branch the same as the + header execution count, so that it will contribute 1 branch to + the ratio computed below. */ + else + weighted_nbranch += header_count; + } + + /* No need to iterate through the instructions below if + both flags have already been set. */ + if (has_call && has_fp) + continue; + + FOR_BB_INSNS (bb, insn) + { + if (!INSN_P (insn)) + continue; + + if (!has_call) + has_call = CALL_P (insn); + + if (!has_fp) + has_fp = insn_has_fp_set (insn); + } + } free (body); - return n; + desc->num_branches = nbranch; + /* Now divide the weights computed above by the loop header execution count, + to compute the average number of branches through the loop. By adding + header_count/2 to the numerator we round to nearest with integer + division. */ + if (header_count != 0) + desc->av_num_branches + = (weighted_nbranch + header_count/2) / header_count; + else + desc->av_num_branches = 0; + desc->has_call = has_call; + desc->has_fp = has_fp; } /* Adds basic block BB to LOOP. */ Index: cfgloop.h =================================================================== --- cfgloop.h (revision 193626) +++ cfgloop.h (working copy) @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order extern vec<edge> get_loop_exit_edges (const struct loop *); extern edge single_exit (const struct loop *); extern edge single_likely_exit (struct loop *loop); -extern unsigned num_loop_branches (const struct loop *); extern edge loop_preheader_edge (const struct loop *); extern edge loop_latch_edge (const struct loop *); @@ -365,7 +364,8 @@ struct rtx_iv }; /* The description of an exit from the loop and of the number of iterations - till we take the exit. */ + till we take the exit. Also includes other information used primarily + by the loop unroller. */ struct niter_desc { @@ -403,6 +403,18 @@ struct niter_desc /* The number of iterations of the loop. */ rtx niter_expr; + + /* The number of branches in the loop. */ + unsigned num_branches; + + /* The number of executed branches per iteration. */ + unsigned av_num_branches; + + /* Whether the loop contains a call instruction. */ + bool has_call; + + /* Whether the loop contains fp instructions. */ + bool has_fp; }; extern void iv_analysis_loop_init (struct loop *); @@ -416,6 +428,7 @@ extern void iv_analysis_done (void); extern struct niter_desc *get_simple_loop_desc (struct loop *loop); extern void free_simple_loop_desc (struct loop *loop); +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); static inline struct niter_desc * simple_loop_desc (struct loop *loop) Index: params.def =================================================================== --- params.def (revision 193626) +++ params.def (working copy) @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, "max-completely-peel-loop-nest-depth", "The maximum depth of a loop nest we completely peel", 8, 0, 0) +/* The maximum code size estimate under which loop unrolling and peeling + * is allowed in a profile feedback compile. This currently applies to loops + * with non-constant iteration counts and no floating point computations. */ +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, + "unrollpeel-codesize-threshold", + "Maximum profile-based code size footprint estimate for loop unrolling " + "and peeling", + 15000, 0, 0) +/* The maximum ratio of total profiled execution counts to loop entry block + count that must be exceeded to ignore most code size limits when unrolling + and peeling. */ +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, + "unrollpeel-hotness-threshold", + "Maximum ratio of total profiled execution count to loop entry " + "block count under which most codesize limits for unrolling and " + "peeling will be ignored", + 100, 1, 0) /* The maximum number of insns of an unswitched loop. */ DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote: > Updated patch based on review feedback. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > Teresa > > > Original patch description: > > Ports some patches related to improving FDO program summary information > and using it to guide loop unrolling from google branches to mainline. > The patch is enhanced to add additional summary information to aid > in determining hot/cold decisions. > > The original patch description is at: > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > and further discussion about incorporating onto mainline is at: > http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 > > Honza, can you take a look to see if this patch meets your needs? > > Full description: > > This patch adds new program summary information to the gcov > profile files that indicate how many profiled counts compose > the majority of the program's execution time. This is used to > provide an indication of the overall code size of the program. > > The new profile summary information is then used to guide > codesize based unroll and peel decisions, to prevent those > optimizations from increasing code size too much when the > program may be sensitive to icache effects. > > This patch also pulls in dependent portions of google/main r187660 that cache > additional loop analysis results in the niter_desc auxiliary information > hanging off the loop structure (the optimization portions of that > change are not included here, and have an outstanding review request > for mainline). > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > Thanks, > Teresa > > 2012-07-26 Teresa Johnson <tejohnson@google.com> > > * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. > (gcov_compute_cutoff_values): Ditto. > (gcov_exit): Call gcov_compute_cutoff_values and merge new summary > information. > * gcc/doc/invoke.texi (roll much): Document new options > -fpeel-codesize-limit and -funroll-codesize-limit, and new params > codesize-hotness-threshold and unrollpeel-hotness-threshold. > * gcc/gcov-io.c (gcov_write_summary): Write new summary info. > (gcov_read_summary): Read new summary info. > * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. > (struct gcov_ctr_summary): Add new summary info: num_hot_counters and > hot_cutoff_value. > * gcc/loop-unroll.c (code_size_limit_factor): New function. > (decide_unroll_runtime_iterations): Call code_size_limit_factor > to control the unroll factor, and retrieve number of branches from > niter_desc instead of via function that walks loop. > (decide_peel_simple, decide_unroll_stupid): Ditto. > * gcc/coverage.c (read_counts_file): Propagate new summary info. > * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns > function, and add guards to enable this function to work for the > outermost loop. > * gcc/common.opt: Add -fpeel-codesize-limit and > -funroll-codesize-limit. > * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. > (num_loop_branches): Remove. > * gcc/cfgloop.h (struct niter_desc): Added new fields to cache > additional loop analysis information. > (num_loop_branches): Remove. > (analyze_loop_insns): Declare. > * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. > * gcc/gcov-dump.c (tag_summary): Dump new summary info. > > Index: libgcc/libgcov.c > =================================================================== > --- libgcc/libgcov.c (revision 189893) > +++ libgcc/libgcov.c (working copy) > @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned > return 1; > } > > +/* Used by qsort to sort gcov values in descending order. */ > + > +static int > +sort_by_reverse_gcov_value (const void *pa, const void *pb) > +{ > + const gcov_type a = *(gcov_type const *)pa; > + const gcov_type b = *(gcov_type const *)pb; > + > + if (b > a) > + return 1; > + else if (b == a) > + return 0; > + else > + return -1; > +} > + > +/* Determines the number of counters required to cover a given percentage > + of the total sum of execution counts in the summary, which is then also > + recorded in SUM. */ > + > +static void > +gcov_compute_cutoff_values (struct gcov_summary *sum) > +{ > + struct gcov_info *gi_ptr; > + const struct gcov_fn_info *gfi_ptr; > + const struct gcov_ctr_info *ci_ptr; > + struct gcov_ctr_summary *cs_ptr; > + unsigned t_ix, f_ix, i, ctr_info_ix, index; > + gcov_unsigned_t c_num; > + gcov_type *value_array; > + gcov_type cum, cum_cutoff; > + char *cutoff_str; > + unsigned cutoff_perc; > + > +#define CUM_CUTOFF_PERCENT_TIMES_10 999 > + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); > + if (cutoff_str && strlen (cutoff_str)) > + cutoff_perc = atoi (cutoff_str); > + else > + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; > + > + /* This currently only applies to arc counters. */ > + t_ix = GCOV_COUNTER_ARCS; > + > + /* First check if there are any counts recorded for this counter. */ > + cs_ptr = &(sum->ctrs[t_ix]); > + if (!cs_ptr->num) > + return; > + > + /* Determine the cumulative counter value at the specified cutoff > + percentage and record the percentage for use by gcov consumers. > + Check for overflow when sum_all is multiplied by the cutoff_perc, > + and if so, do the divide first. */ > + if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all) > + /* Overflow, do the divide first. */ > + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; > + else > + /* Otherwise multiply first to get the correct value for small > + values of sum_all. */ > + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; > + > + /* Next, walk through all the per-object structures and save each of > + the count values in value_array. */ > + index = 0; > + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); > + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) > + { > + if (!gi_ptr->merge[t_ix]) > + continue; > + > + /* Find the appropriate index into the gcov_ctr_info array > + for the counter we are currently working on based on the > + existence of the merge function pointer for this object. */ > + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) > + { > + if (gi_ptr->merge[i]) > + ctr_info_ix++; > + } > + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) > + { > + gfi_ptr = gi_ptr->functions[f_ix]; > + > + if (!gfi_ptr || gfi_ptr->key != gi_ptr) > + continue; > + > + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; > + /* Sanity check that there are enough entries in value_arry > + for this function's counters. Gracefully handle the case when > + there are not, in case something in the profile info is > + corrupted. */ > + c_num = ci_ptr->num; > + if (index + c_num > cs_ptr->num) > + c_num = cs_ptr->num - index; > + /* Copy over this function's counter values. */ > + memcpy (&value_array[index], ci_ptr->values, > + sizeof (gcov_type) * c_num); > + index += c_num; > + } > + } > + > + /* Sort all the counter values by descending value and finally > + accumulate the values from hottest on down until reaching > + the cutoff value computed earlier. */ > + qsort (value_array, cs_ptr->num, sizeof (gcov_type), > + sort_by_reverse_gcov_value); > + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) > + cum += value_array[c_num]; > + /* Record the number of counters required to reach the cutoff value, > + as well as the counter value that caused cum to reach the cutoff. */ > + cs_ptr->num_hot_counters = c_num; > + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; > + free (value_array); > +} > + > /* Dump the coverage counts. We merge with existing counts when > possible, to avoid growing the .da files ad infinitum. We use this > program's checksum to make sure we only accumulate whole program > @@ -347,6 +461,7 @@ gcov_exit (void) > } > } > } > + gcov_compute_cutoff_values (&this_prg); > > { > /* Check if the level of dirs to strip off specified. */ > @@ -598,7 +713,12 @@ gcov_exit (void) > if (gi_ptr->merge[t_ix]) > { > if (!cs_prg->runs++) > - cs_prg->num = cs_tprg->num; > + cs_prg->num = cs_tprg->num; > + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) > + { > + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; > + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; > + } > cs_prg->sum_all += cs_tprg->sum_all; > if (cs_prg->run_max < cs_tprg->run_max) > cs_prg->run_max = cs_tprg->run_max; > Index: gcc/doc/invoke.texi > =================================================================== > --- gcc/doc/invoke.texi (revision 189893) > +++ gcc/doc/invoke.texi (working copy) > @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}. > -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol > -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol > -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol > --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol > -fprefetch-loop-arrays @gol > -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol > -fprofile-generate=@var{path} @gol > @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}. > -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol > -ftree-switch-conversion -ftree-tail-merge @gol > -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol > --funit-at-a-time -funroll-all-loops -funroll-loops @gol > +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol > -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol > -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol > -fwhole-program -fwpa -fuse-linker-plugin @gol > @@ -8527,6 +8527,14 @@ the loop is entered. This usually makes programs > @option{-funroll-all-loops} implies the same options as > @option{-funroll-loops}. > > +@item -funroll-codesize-limit > +@opindex funroll-codesize-limit > +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fpeel-loops > @opindex fpeel-loops > Peels loops for which there is enough information that they do not > @@ -8535,6 +8543,14 @@ roll much (from profile feedback). It also turns > > Enabled with @option{-fprofile-use}. > > +@item -fpeel-codesize-limit > +@opindex fpeel-codesize-limit > +Limit loop peeling of non-const non-FP loops in a profile feedback compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fmove-loop-invariants > @opindex fmove-loop-invariants > Enables the loop invariant motion pass in the RTL loop optimizer. Enabled > @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s > @item max-completely-peel-loop-nest-depth > The maximum depth of a loop nest suitable for complete peeling. > > +@item unrollpeel-codesize-threshold > +Maximum profile-based code size footprint estimate for loop unrolling and > +peeling. > + > +@item unrollpeel-hotness-threshold > +Maximum ratio of total execution count to loop entry block count under which > +most profile-based code size estimates will be ignored for loop unrolling and > +peeling. > + > @item max-unswitch-insns > The maximum number of insns of an unswitched loop. > > Index: gcc/gcov-io.c > =================================================================== > --- gcc/gcov-io.c (revision 189893) > +++ gcc/gcov-io.c (working copy) > @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str > for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) > { > gcov_write_unsigned (csum->num); > + gcov_write_unsigned (csum->num_hot_counters); > + gcov_write_counter (csum->hot_cutoff_value); > gcov_write_unsigned (csum->runs); > gcov_write_counter (csum->sum_all); > gcov_write_counter (csum->run_max); > @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) > for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) > { > csum->num = gcov_read_unsigned (); > + csum->num_hot_counters = gcov_read_unsigned (); > + csum->hot_cutoff_value = gcov_read_counter (); > csum->runs = gcov_read_unsigned (); > csum->sum_all = gcov_read_counter (); > csum->run_max = gcov_read_counter (); > Index: gcc/gcov-io.h > =================================================================== > --- gcc/gcov-io.h (revision 189893) > +++ gcc/gcov-io.h (working copy) > @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; > #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ > #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) > #define GCOV_TAG_SUMMARY_LENGTH \ > - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) > + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) > > /* Counters that are collected. */ > #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ > @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; > struct gcov_ctr_summary > { > gcov_unsigned_t num; /* number of counters. */ > + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given > + percent of sum_all. */ > + gcov_type hot_cutoff_value; /* value of smallest counter included in > + num_hot_counters. */ > gcov_unsigned_t runs; /* number of program runs */ > gcov_type sum_all; /* sum of all counters accumulated. */ > gcov_type run_max; /* maximum value on a single run. */ > Index: gcc/loop-unroll.c > =================================================================== > --- gcc/loop-unroll.c (revision 189893) > +++ gcc/loop-unroll.c (working copy) > @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see > #include "hashtab.h" > #include "recog.h" > #include "target.h" > +#include "gcov-io.h" > #include "dumpfile.h" > > /* This pass performs loop unrolling and peeling. We only perform these > @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc > basic_block); > static rtx get_expansion (struct var_to_expand *); > > +/* Determine whether and how much LOOP unrolling/peeling should be constrained > + based on code footprint estimates. Returns the codesize-based factor to be > + divided into the max instructions in an unrolled or peeled loop: > + 1) For size <= threshold, do not limit (by returning 1). > + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or > + unrolled instructions according to loop hotness. > + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning > + INT_MAX). */ > + > +static int > +code_size_limit_factor(struct loop *loop) > +{ > + unsigned size_threshold; > + struct niter_desc *desc = get_simple_loop_desc (loop); > + gcov_type sum_to_header_ratio; > + int hotness_ratio_threshold; > + int limit_factor; > + > + /* First check if the application has a large codesize footprint. > + This is estimated from FDO profile summary information for the > + program, where the num_hot_counters indicates the number of hottest > + counters (blocks) that compose most of the execution time of > + the program. A large value would indicate a large flat execution > + profile where icache misses may be a concern. */ > + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); > + if (!profile_info > + || profile_info->num_hot_counters <= size_threshold > + || !profile_info->sum_all) > + return 1; > + > + /* Next, exclude some loops where unrolling/peeling may be more > + important to overall performance. */ > + > + /* Ignore FP loops, which are more likely to benefit heavily from > + unrolling. */ > + if (desc->has_fp) > + return 1; > + > + /* Next, set the value of the codesize-based unroll factor divisor which in > + most loops will need to be set to a value that will reduce or eliminate > + unrolling/peeling. */ > + if (profile_info->num_hot_counters < size_threshold * 2) > + { > + /* For applications that are less than twice the codesize limit, allow > + limited unrolling for very hot loops. */ > + sum_to_header_ratio = profile_info->sum_all / loop->header->count; > + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); > + /* When the profile count sum to loop entry header ratio is smaller than > + the threshold (i.e. the loop entry is hot enough, the divisor is set > + to 1 so the unroll/peel factor is not reduced. When it is bigger > + than the ratio, increase the divisor by the amount this ratio > + is over the threshold, which will quickly reduce the unroll/peel > + factor to zero as the loop's hotness reduces. */ > + if (sum_to_header_ratio > hotness_ratio_threshold) > + { > + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; > + gcc_assert (limit_factor >= 1); > + } > + else > + limit_factor = 1; > + } > + else > + /* For appliations that are at least twice the codesize limit, set > + the divisor to a large value that will force the unroll factor to 0. */ > + limit_factor = INT_MAX; > + > + return limit_factor; > +} > + > /* Unroll and/or peel (depending on FLAGS) LOOPS. */ > void > unroll_and_peel_loops (int flags) > @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_UNROLL)) > { > @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo > "\n;; Considering unrolling loop with runtime " > "computable number of iterations\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags) > { > unsigned npeel; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_PEEL)) > { > @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags) > if (dump_file) > fprintf (dump_file, "\n;; Considering simply peeling loop\n"); > > + if (flag_peel_codesize_limit) > + { > + /* Determine whether to limit code size growth from peeling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max peeled insns by divisor %d\n", limit_factor); > + } > + } > + > /* npeel = number of iterations to peel. */ > - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; > + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor > + / loop->ninsns; > if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) > npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); > > @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags) > > /* Do not simply peel loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not peeling, contains branches\n"); > @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_UNROLL_ALL)) > { > @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags > if (dump_file) > fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av > - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags > > /* Do not unroll loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unrolling, contains branches\n"); > Index: gcc/coverage.c > =================================================================== > --- gcc/coverage.c (revision 189893) > +++ gcc/coverage.c (working copy) > @@ -247,6 +247,14 @@ read_counts_file (void) > gcov_read_summary (&sum); > for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) > { > + if (summary.ctrs[ix].num_hot_counters > + < sum.ctrs[ix].num_hot_counters) > + { > + summary.ctrs[ix].num_hot_counters > + = sum.ctrs[ix].num_hot_counters; > + summary.ctrs[ix].hot_cutoff_value > + = sum.ctrs[ix].hot_cutoff_value; > + } > summary.ctrs[ix].runs += sum.ctrs[ix].runs; > summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; > if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) > Index: gcc/loop-iv.c > =================================================================== > --- gcc/loop-iv.c (revision 189893) > +++ gcc/loop-iv.c (working copy) > @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop) > /* At least desc->infinite is not always initialized by > find_simple_loop_exit. */ > desc = XCNEW (struct niter_desc); > - iv_analysis_loop_init (loop); > - find_simple_exit (loop, desc); > + if (loop->latch != EXIT_BLOCK_PTR) > + { > + iv_analysis_loop_init (loop); > + find_simple_exit (loop, desc); > + } > + analyze_loop_insns (loop, desc); > loop->aux = desc; > > if (desc->simple_p && (desc->assumptions || desc->infinite)) > Index: gcc/common.opt > =================================================================== > --- gcc/common.opt (revision 189893) > +++ gcc/common.opt (working copy) > @@ -1551,6 +1551,10 @@ fpcc-struct-return > Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) > Return small aggregates in memory, not registers > > +fpeel-codesize-limit > +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop peeling under profile estimates of large code footprint > + > fpeel-loops > Common Report Var(flag_peel_loops) Optimization > Perform loop peeling > @@ -2112,6 +2116,10 @@ funroll-all-loops > Common Report Var(flag_unroll_all_loops) Optimization > Perform loop unrolling for all loops > > +funroll-codesize-limit > +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop unrolling under profile estimates of large code footprint > + > ; Nonzero means that loop optimizer may assume that the induction variables > ; that control loops do not overflow and that the loops with nontrivial > ; exit condition are not infinite > Index: gcc/cfgloop.c > =================================================================== > --- gcc/cfgloop.c (revision 189893) > +++ gcc/cfgloop.c (working copy) > @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop) > return edges; > } > > -/* Counts the number of conditional branches inside LOOP. */ > +/* Determine if INSN is a floating point set. */ > > -unsigned > -num_loop_branches (const struct loop *loop) > +static bool > +insn_has_fp_set(rtx insn) > { > - unsigned i, n; > - basic_block * body; > + int i; > + rtx pat = PATTERN(insn); > + if (GET_CODE (pat) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); > + else if (GET_CODE (pat) == PARALLEL) > + { > + for (i = 0; i < XVECLEN (pat, 0); i++) > + { > + rtx sub = XVECEXP (pat, 0, i); > + if (GET_CODE (sub) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); > + } > + } > + return false; > +} > > - gcc_assert (loop->latch != EXIT_BLOCK_PTR); > +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts > + the number of conditional branch instructions, calls and fp instructions, > + as well as the average number of branches executed per iteration. */ > > +void > +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) > +{ > + unsigned i, nbranch; > + gcov_type weighted_nbranch; > + bool has_call, has_fp; > + basic_block * body, bb; > + rtx insn; > + gcov_type header_count = loop->header->count; > + > + nbranch = weighted_nbranch = 0; > + has_call = has_fp = false; > + > body = get_loop_body (loop); > - n = 0; > for (i = 0; i < loop->num_nodes; i++) > - if (EDGE_COUNT (body[i]->succs) >= 2) > - n++; > + { > + bb = body[i]; > + > + if (EDGE_COUNT (bb->succs) >= 2) > + { > + nbranch++; > + > + /* If this block is executed less frequently than the header (loop > + entry), then it is weighted based on its execution count, which > + will be turned into a ratio compared to the loop header below. */ > + if (bb->count < header_count) > + weighted_nbranch += bb->count; > + > + /* When it is executed more frequently than the header (i.e. it is > + in a nested inner loop), simply weight the branch the same as the > + header execution count, so that it will contribute 1 branch to > + the ratio computed below. */ > + else > + weighted_nbranch += header_count; > + } > + > + /* No need to iterate through the instructions below if > + both flags have already been set. */ > + if (has_call && has_fp) > + continue; > + > + FOR_BB_INSNS (bb, insn) > + { > + if (!INSN_P (insn)) > + continue; > + > + if (!has_call) > + has_call = CALL_P (insn); > + > + if (!has_fp) > + has_fp = insn_has_fp_set (insn); > + } > + } > free (body); > > - return n; > + desc->num_branches = nbranch; > + /* Now divide the weights computed above by the loop header execution count, > + to compute the average number of branches through the loop. By adding > + header_count/2 to the numerator we round to nearest with integer > + division. */ > + if (header_count != 0) > + desc->av_num_branches > + = (weighted_nbranch + header_count/2) / header_count; > + else > + desc->av_num_branches = 0; > + desc->has_call = has_call; > + desc->has_fp = has_fp; > } > > /* Adds basic block BB to LOOP. */ > Index: gcc/cfgloop.h > =================================================================== > --- gcc/cfgloop.h (revision 189893) > +++ gcc/cfgloop.h (working copy) > @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order > > extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); > edge single_exit (const struct loop *); > -extern unsigned num_loop_branches (const struct loop *); > > extern edge loop_preheader_edge (const struct loop *); > extern edge loop_latch_edge (const struct loop *); > @@ -366,7 +365,8 @@ struct rtx_iv > }; > > /* The description of an exit from the loop and of the number of iterations > - till we take the exit. */ > + till we take the exit. Also includes other information used primarily > + by the loop unroller. */ > > struct niter_desc > { > @@ -407,6 +407,18 @@ struct niter_desc > > /* The number of iterations of the loop. */ > rtx niter_expr; > + > + /* The number of branches in the loop. */ > + unsigned num_branches; > + > + /* The number of executed branches per iteration. */ > + unsigned av_num_branches; > + > + /* Whether the loop contains a call instruction. */ > + bool has_call; > + > + /* Whether the loop contains fp instructions. */ > + bool has_fp; > }; > > extern void iv_analysis_loop_init (struct loop *); > @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); > > extern struct niter_desc *get_simple_loop_desc (struct loop *loop); > extern void free_simple_loop_desc (struct loop *loop); > +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); > > static inline struct niter_desc * > simple_loop_desc (struct loop *loop) > Index: gcc/params.def > =================================================================== > --- gcc/params.def (revision 189893) > +++ gcc/params.def (working copy) > @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, > "max-completely-peel-loop-nest-depth", > "The maximum depth of a loop nest we completely peel", > 8, 0, 0) > +/* The maximum code size estimate under which loop unrolling and peeling > + * is allowed in a profile feedback compile. This currently applies to loops > + * with non-constant iteration counts and no floating point computations. */ > +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, > + "unrollpeel-codesize-threshold", > + "Maximum profile-based code size footprint estimate for loop unrolling " > + "and peeling", > + 15000, 0, 0) > +/* The maximum ratio of total profiled execution counts to loop entry block > + count that must be exceeded to ignore most code size limits when unrolling > + and peeling. */ > +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, > + "unrollpeel-hotness-threshold", > + "Maximum ratio of total profiled execution count to loop entry " > + "block count under which most codesize limits for unrolling and " > + "peeling will be ignored", > + 100, 1, 0) > > /* The maximum number of insns of an unswitched loop. */ > DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, > Index: gcc/gcov-dump.c > =================================================================== > --- gcc/gcov-dump.c (revision 189893) > +++ gcc/gcov-dump.c (working copy) > @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED > { > printf ("\n"); > print_prefix (filename, 0, 0); > - printf ("\t\tcounts=%u, runs=%u", > - summary.ctrs[ix].num, summary.ctrs[ix].runs); > + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" > + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", > + summary.ctrs[ix].num, > + summary.ctrs[ix].num_hot_counters, > + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, > + summary.ctrs[ix].runs); > > printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, > (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); > > -- > This patch is available for review at http://codereview.appspot.com/6351086 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
Ping. Teresa On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: > This patch was proposed awhile back, and the new working set program > summary was split out and committed separately. I would like to > see if the loop unroller part of the patch can now go in. For the > previous versions of the patch and related discussion, see: > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > and > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html > > Use the new working set information in the FDO program summary for > codesize based unroll and peel decisions, to prevent those > optimizations from increasing code size too much when the > program may be sensitive to icache effects. > > As part of this, cache additional loop analysis results in the niter_desc > auxiliary information hanging off the loop structure to reduce > redundant analyses during unrolling. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > Thanks, > Teresa > > 2012-11-19 Teresa Johnson <tejohnson@google.com> > > * gcc/doc/invoke.texi: Document new options > -fpeel-codesize-limit and -funroll-codesize-limit, and new params > codesize-hotness-threshold and unrollpeel-hotness-threshold. > * gcc/loop-unroll.c (code_size_limit_factor): New function. > (decide_unroll_runtime_iterations): Call code_size_limit_factor > to control the unroll factor, and retrieve number of branches from > niter_desc instead of via function that walks loop. > (decide_peel_simple, decide_unroll_stupid): Ditto. > * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns > function, and add guards to enable this function to work for the > outermost loop. > * gcc/common.opt: Add -fpeel-codesize-limit and > -funroll-codesize-limit. > * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. > (num_loop_branches): Remove. > * gcc/cfgloop.h (struct niter_desc): Added new fields to cache > additional loop analysis information. > (num_loop_branches): Remove. > (analyze_loop_insns): Declare. > * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. > > Index: doc/invoke.texi > =================================================================== > --- doc/invoke.texi (revision 193626) > +++ doc/invoke.texi (working copy) > @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}. > -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol > -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol > -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol > --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops > -fpredictive-commoning @gol > -fprefetch-loop-arrays -fprofile-report @gol > -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol > -fprofile-generate=@var{path} @gol > @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}. > -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol > -ftree-switch-conversion -ftree-tail-merge @gol > -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol > --funit-at-a-time -funroll-all-loops -funroll-loops @gol > +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol > -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol > -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol > -fwhole-program -fwpa -fuse-linker-plugin @gol > @@ -8771,6 +8771,14 @@ the loop is entered. This usually makes programs > @option{-funroll-all-loops} implies the same options as > @option{-funroll-loops}. > > +@item -funroll-codesize-limit > +@opindex funroll-codesize-limit > +Limit loop unrolling of non-const non-FP loops in a profile feedback > compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are > controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fpeel-loops > @opindex fpeel-loops > Peels loops for which there is enough information that they do not > @@ -8779,6 +8787,14 @@ roll much (from profile feedback). It also turns > > Enabled with @option{-fprofile-use}. > > +@item -fpeel-codesize-limit > > +@item -fpeel-codesize-limit > +@opindex fpeel-codesize-limit > +Limit loop peeling of non-const non-FP loops in a profile feedback compilation > +under estimates of a large code footprint. Enabled by default with > +@option{-fprofile-use}. Code size and execution weight thresholds are > controlled > +by the @option{unrollpeel-codesize-threshold} and > +@option{unrollpeel-hotness-threshold} parameters. > + > @item -fmove-loop-invariants > @opindex fmove-loop-invariants > Enables the loop invariant motion pass in the RTL loop optimizer. Enabled > @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s > @item max-completely-peel-loop-nest-depth > The maximum depth of a loop nest suitable for complete peeling. > > +@item unrollpeel-codesize-threshold > +Maximum profile-based code size footprint estimate for loop unrolling and > +peeling. > + > +@item unrollpeel-hotness-threshold > +Maximum ratio of total execution count to loop entry block count under which > +most profile-based code size estimates will be ignored for loop unrolling and > +peeling. > + > @item max-unswitch-insns > The maximum number of insns of an unswitched loop. > > Index: loop-unroll.c > =================================================================== > --- loop-unroll.c (revision 193626) > +++ loop-unroll.c (working copy) > @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see > #include "hashtab.h" > #include "recog.h" > #include "target.h" > +#include "gcov-io.h" > #include "dumpfile.h" > > /* This pass performs loop unrolling and peeling. We only perform these > @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc > basic_block); > static rtx get_expansion (struct var_to_expand *); > > +/* Determine whether and how much LOOP unrolling/peeling should be constrained > + based on code footprint estimates. Returns the codesize-based factor to be > + divided into the max instructions in an unrolled or peeled loop: > + 1) For size <= threshold, do not limit (by returning 1). > + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or > + unrolled instructions according to loop hotness. > + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning > + INT_MAX). */ > + > +static int > +code_size_limit_factor(struct loop *loop) > +{ > + unsigned size_threshold, num_hot_counters; > + struct niter_desc *desc = get_simple_loop_desc (loop); > + gcov_type sum_to_header_ratio; > + int hotness_ratio_threshold; > + int limit_factor; > + gcov_working_set_t *ws; > + > + ws = find_working_set(999); > + if (! ws) > + return 1; > + num_hot_counters = ws->num_counters; > + > + /* First check if the application has a large codesize footprint. > + This is estimated from FDO profile summary information for the > + program, where the num_hot_counters indicates the number of hottest > + counters (blocks) that compose most of the execution time of > + the program. A large value would indicate a large flat execution > + profile where icache misses may be a concern. */ > + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); > + if (!profile_info > + || num_hot_counters <= size_threshold > + || !profile_info->sum_all) > + return 1; > + > + /* Next, exclude some loops where unrolling/peeling may be more > + important to overall performance. */ > + > + /* Ignore FP loops, which are more likely to benefit heavily from > + unrolling. */ > + if (desc->has_fp) > + return 1; > + > + /* Next, set the value of the codesize-based unroll factor divisor which in > + most loops will need to be set to a value that will reduce or eliminate > + unrolling/peeling. */ > + if (num_hot_counters < size_threshold * 2 > + && loop->header->count > 0) > + { > + /* For applications that are less than twice the codesize limit, allow > + limited unrolling for very hot loops. */ > + sum_to_header_ratio = profile_info->sum_all / loop->header->count; > + hotness_ratio_threshold = PARAM_VALUE > (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); > + /* When the profile count sum to loop entry header ratio is smaller than > + the threshold (i.e. the loop entry is hot enough), the divisor is set > + to 1 so the unroll/peel factor is not reduced. When it is bigger > + than the ratio, increase the divisor by the amount this ratio > + is over the threshold, which will quickly reduce the unroll/peel > + factor to zero as the loop's hotness reduces. */ > + if (sum_to_header_ratio > hotness_ratio_threshold) > + { > + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; > + gcc_assert (limit_factor >= 1); > + } > + else > + limit_factor = 1; > + } > + else > + /* For appliations that are at least twice the codesize limit, set > + the divisor to a large value that will force the unroll factor to 0. */ > + limit_factor = INT_MAX; > + > + return limit_factor; > +} > + > /* Unroll and/or peel (depending on FLAGS) LOOPS. */ > void > unroll_and_peel_loops (int flags) > @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > double_int iterations; > > if (!(flags & UAP_UNROLL)) > @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo > "\n;; Considering unrolling loop with runtime " > "computable number of iterations\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / > loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags) > { > unsigned npeel; > double_int iterations; > + struct niter_desc *desc; > + int limit_factor = 1; > > if (!(flags & UAP_PEEL)) > { > @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags) > if (dump_file) > fprintf (dump_file, "\n;; Considering simply peeling loop\n"); > > + if (flag_peel_codesize_limit) > + { > + /* Determine whether to limit code size growth from peeling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max peeled insns by divisor %d\n", limit_factor); > + } > + } > + > /* npeel = number of iterations to peel. */ > - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; > + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor > + / loop->ninsns; > if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) > npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); > > @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags) > } > > /* Do not simply peel loops with branches inside -- it increases number > - of mispredicts. > + of mispredicts. > Exception is when we do have profile and we however have good chance > to peel proper number of iterations loop will iterate in practice. > TODO: this heuristic needs tunning; while for complette unrolling > @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags) > peeling it is not the case. Also a function call inside loop is > also branch from branch prediction POV (and probably better reason > to not unroll/peel). */ > - if (num_loop_branches (loop) > 1 > + desc = get_simple_loop_desc (loop); > + if (desc->num_branches > 1 > && profile_status != PROFILE_READ) > { > if (dump_file) > @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags > { > unsigned nunroll, nunroll_by_av, i; > struct niter_desc *desc; > + int limit_factor = 1; > double_int iterations; > > if (!(flags & UAP_UNROLL_ALL)) > @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags > if (dump_file) > fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); > > + if (flag_unroll_codesize_limit) > + { > + /* Determine whether to limit code size growth from unrolling, > + using FDO profile summary information that gives an > + estimated number of executed blocks. */ > + limit_factor = code_size_limit_factor (loop); > + if (dump_file && limit_factor > 1) > + { > + fprintf (dump_file, > + ";; Due to large code size footprint estimate, limit " > + "max unrolled insns by divisor %d\n", limit_factor); > + } > + } > + > /* nunroll = total number of copies of the original loop body in > unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ > - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; > - nunroll_by_av > - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; > + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor > + / loop->ninsns; > + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) > + / limit_factor / loop->av_ninsns; > if (nunroll > nunroll_by_av) > nunroll = nunroll_by_av; > if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) > @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags > of mispredicts. > TODO: this heuristic needs tunning; call inside the loop body > is also relatively good reason to not unroll. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unrolling, contains branches\n"); > Index: loop-iv.c > =================================================================== > --- loop-iv.c (revision 193626) > +++ loop-iv.c (working copy) > @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop) > /* At least desc->infinite is not always initialized by > find_simple_loop_exit. */ > desc = XCNEW (struct niter_desc); > - iv_analysis_loop_init (loop); > - find_simple_exit (loop, desc); > + if (loop->latch != EXIT_BLOCK_PTR) > + { > + iv_analysis_loop_init (loop); > + find_simple_exit (loop, desc); > + } > + analyze_loop_insns (loop, desc); > loop->aux = desc; > > if (desc->simple_p && (desc->assumptions || desc->infinite)) > Index: common.opt > =================================================================== > --- common.opt (revision 193626) > +++ common.opt (working copy) > @@ -1573,6 +1573,10 @@ fpcc-struct-return > Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) > Return small aggregates in memory, not registers > > +fpeel-codesize-limit > +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop peeling under profile estimates of large > code footprint > + > fpeel-loops > Common Report Var(flag_peel_loops) Optimization > Perform loop peeling > @@ -2138,6 +2142,10 @@ funroll-all-loops > Common Report Var(flag_unroll_all_loops) Optimization > Perform loop unrolling for all loops > > +funroll-codesize-limit > +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization > +Limit non-const non-FP loop unrolling under profile estimates of > large code footprint > + > ; Nonzero means that loop optimizer may assume that the induction variables > ; that control loops do not overflow and that the loops with nontrivial > ; exit condition are not infinite > Index: cfgloop.c > =================================================================== > --- cfgloop.c (revision 193626) > +++ cfgloop.c (working copy) > @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop) > return edges; > } > > -/* Counts the number of conditional branches inside LOOP. */ > +/* Determine if INSN is a floating point set. */ > > -unsigned > -num_loop_branches (const struct loop *loop) > +static bool > +insn_has_fp_set(rtx insn) > { > - unsigned i, n; > - basic_block * body; > + int i; > + rtx pat = PATTERN(insn); > + if (GET_CODE (pat) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); > + else if (GET_CODE (pat) == PARALLEL) > + { > + for (i = 0; i < XVECLEN (pat, 0); i++) > + { > + rtx sub = XVECEXP (pat, 0, i); > + if (GET_CODE (sub) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); > + } > + } > + return false; > +} > > - gcc_assert (loop->latch != EXIT_BLOCK_PTR); > +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts > + the number of conditional branch instructions, calls and fp instructions, > + as well as the average number of branches executed per iteration. */ > > +void > +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) > +{ > + unsigned i, nbranch; > + gcov_type weighted_nbranch; > + bool has_call, has_fp; > + basic_block * body, bb; > + rtx insn; > + gcov_type header_count = loop->header->count; > + > + nbranch = weighted_nbranch = 0; > + has_call = has_fp = false; > + > body = get_loop_body (loop); > - n = 0; > for (i = 0; i < loop->num_nodes; i++) > - if (EDGE_COUNT (body[i]->succs) >= 2) > - n++; > + { > + bb = body[i]; > + > + if (EDGE_COUNT (bb->succs) >= 2) > + { > + nbranch++; > + > + /* If this block is executed less frequently than the header (loop > + entry), then it is weighted based on its execution count, which > + will be turned into a ratio compared to the loop header below. */ > + if (bb->count < header_count) > + weighted_nbranch += bb->count; > + > + /* When it is executed more frequently than the header (i.e. it is > + in a nested inner loop), simply weight the branch the same as the > + header execution count, so that it will contribute 1 branch to > + the ratio computed below. */ > + else > + weighted_nbranch += header_count; > + } > + > + /* No need to iterate through the instructions below if > + both flags have already been set. */ > + if (has_call && has_fp) > + continue; > + > + FOR_BB_INSNS (bb, insn) > + { > + if (!INSN_P (insn)) > + continue; > + > + if (!has_call) > + has_call = CALL_P (insn); > + > + if (!has_fp) > + has_fp = insn_has_fp_set (insn); > + } > + } > free (body); > > - return n; > + desc->num_branches = nbranch; > + /* Now divide the weights computed above by the loop header execution count, > + to compute the average number of branches through the loop. By adding > + header_count/2 to the numerator we round to nearest with integer > + division. */ > + if (header_count != 0) > + desc->av_num_branches > + = (weighted_nbranch + header_count/2) / header_count; > + else > + desc->av_num_branches = 0; > + desc->has_call = has_call; > + desc->has_fp = has_fp; > } > > /* Adds basic block BB to LOOP. */ > Index: cfgloop.h > =================================================================== > --- cfgloop.h (revision 193626) > +++ cfgloop.h (working copy) > @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order > extern vec<edge> get_loop_exit_edges (const struct loop *); > extern edge single_exit (const struct loop *); > extern edge single_likely_exit (struct loop *loop); > -extern unsigned num_loop_branches (const struct loop *); > > extern edge loop_preheader_edge (const struct loop *); > extern edge loop_latch_edge (const struct loop *); > @@ -365,7 +364,8 @@ struct rtx_iv > }; > > /* The description of an exit from the loop and of the number of iterations > - till we take the exit. */ > + till we take the exit. Also includes other information used primarily > + by the loop unroller. */ > > struct niter_desc > { > @@ -403,6 +403,18 @@ struct niter_desc > > /* The number of iterations of the loop. */ > rtx niter_expr; > + > + /* The number of branches in the loop. */ > + unsigned num_branches; > + > + /* The number of executed branches per iteration. */ > + unsigned av_num_branches; > + > + /* Whether the loop contains a call instruction. */ > + bool has_call; > + > + /* Whether the loop contains fp instructions. */ > + bool has_fp; > }; > > extern void iv_analysis_loop_init (struct loop *); > @@ -416,6 +428,7 @@ extern void iv_analysis_done (void); > > extern struct niter_desc *get_simple_loop_desc (struct loop *loop); > extern void free_simple_loop_desc (struct loop *loop); > +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); > > static inline struct niter_desc * > simple_loop_desc (struct loop *loop) > Index: params.def > =================================================================== > --- params.def (revision 193626) > +++ params.def (working copy) > @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, > "max-completely-peel-loop-nest-depth", > "The maximum depth of a loop nest we completely peel", > 8, 0, 0) > +/* The maximum code size estimate under which loop unrolling and peeling > + * is allowed in a profile feedback compile. This currently applies to loops > + * with non-constant iteration counts and no floating point computations. */ > +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, > + "unrollpeel-codesize-threshold", > + "Maximum profile-based code size footprint estimate for loop > unrolling " > + "and peeling", > + 15000, 0, 0) > +/* The maximum ratio of total profiled execution counts to loop entry block > + count that must be exceeded to ignore most code size limits when unrolling > + and peeling. */ > +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, > + "unrollpeel-hotness-threshold", > + "Maximum ratio of total profiled execution count to loop entry " > + "block count under which most codesize limits for unrolling and " > + "peeling will be ignored", > + 100, 1, 0) > > /* The maximum number of insns of an unswitched loop. */ > DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, > > On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote: >> Updated patch based on review feedback. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Teresa >> >> >> Original patch description: >> >> Ports some patches related to improving FDO program summary information >> and using it to guide loop unrolling from google branches to mainline. >> The patch is enhanced to add additional summary information to aid >> in determining hot/cold decisions. >> >> The original patch description is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> and further discussion about incorporating onto mainline is at: >> http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 >> >> Honza, can you take a look to see if this patch meets your needs? >> >> Full description: >> >> This patch adds new program summary information to the gcov >> profile files that indicate how many profiled counts compose >> the majority of the program's execution time. This is used to >> provide an indication of the overall code size of the program. >> >> The new profile summary information is then used to guide >> codesize based unroll and peel decisions, to prevent those >> optimizations from increasing code size too much when the >> program may be sensitive to icache effects. >> >> This patch also pulls in dependent portions of google/main r187660 that cache >> additional loop analysis results in the niter_desc auxiliary information >> hanging off the loop structure (the optimization portions of that >> change are not included here, and have an outstanding review request >> for mainline). >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> 2012-07-26 Teresa Johnson <tejohnson@google.com> >> >> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. >> (gcov_compute_cutoff_values): Ditto. >> (gcov_exit): Call gcov_compute_cutoff_values and merge new summary >> information. >> * gcc/doc/invoke.texi (roll much): Document new options >> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >> codesize-hotness-threshold and unrollpeel-hotness-threshold. >> * gcc/gcov-io.c (gcov_write_summary): Write new summary info. >> (gcov_read_summary): Read new summary info. >> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. >> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and >> hot_cutoff_value. >> * gcc/loop-unroll.c (code_size_limit_factor): New function. >> (decide_unroll_runtime_iterations): Call code_size_limit_factor >> to control the unroll factor, and retrieve number of branches from >> niter_desc instead of via function that walks loop. >> (decide_peel_simple, decide_unroll_stupid): Ditto. >> * gcc/coverage.c (read_counts_file): Propagate new summary info. >> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >> function, and add guards to enable this function to work for the >> outermost loop. >> * gcc/common.opt: Add -fpeel-codesize-limit and >> -funroll-codesize-limit. >> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >> (num_loop_branches): Remove. >> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >> additional loop analysis information. >> (num_loop_branches): Remove. >> (analyze_loop_insns): Declare. >> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >> * gcc/gcov-dump.c (tag_summary): Dump new summary info. >> >> Index: libgcc/libgcov.c >> =================================================================== >> --- libgcc/libgcov.c (revision 189893) >> +++ libgcc/libgcov.c (working copy) >> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >> return 1; >> } >> >> +/* Used by qsort to sort gcov values in descending order. */ >> + >> +static int >> +sort_by_reverse_gcov_value (const void *pa, const void *pb) >> +{ >> + const gcov_type a = *(gcov_type const *)pa; >> + const gcov_type b = *(gcov_type const *)pb; >> + >> + if (b > a) >> + return 1; >> + else if (b == a) >> + return 0; >> + else >> + return -1; >> +} >> + >> +/* Determines the number of counters required to cover a given percentage >> + of the total sum of execution counts in the summary, which is then also >> + recorded in SUM. */ >> + >> +static void >> +gcov_compute_cutoff_values (struct gcov_summary *sum) >> +{ >> + struct gcov_info *gi_ptr; >> + const struct gcov_fn_info *gfi_ptr; >> + const struct gcov_ctr_info *ci_ptr; >> + struct gcov_ctr_summary *cs_ptr; >> + unsigned t_ix, f_ix, i, ctr_info_ix, index; >> + gcov_unsigned_t c_num; >> + gcov_type *value_array; >> + gcov_type cum, cum_cutoff; >> + char *cutoff_str; >> + unsigned cutoff_perc; >> + >> +#define CUM_CUTOFF_PERCENT_TIMES_10 999 >> + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); >> + if (cutoff_str && strlen (cutoff_str)) >> + cutoff_perc = atoi (cutoff_str); >> + else >> + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; >> + >> + /* This currently only applies to arc counters. */ >> + t_ix = GCOV_COUNTER_ARCS; >> + >> + /* First check if there are any counts recorded for this counter. */ >> + cs_ptr = &(sum->ctrs[t_ix]); >> + if (!cs_ptr->num) >> + return; >> + >> + /* Determine the cumulative counter value at the specified cutoff >> + percentage and record the percentage for use by gcov consumers. >> + Check for overflow when sum_all is multiplied by the cutoff_perc, >> + and if so, do the divide first. */ >> + if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all) >> + /* Overflow, do the divide first. */ >> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; >> + else >> + /* Otherwise multiply first to get the correct value for small >> + values of sum_all. */ >> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; >> + >> + /* Next, walk through all the per-object structures and save each of >> + the count values in value_array. */ >> + index = 0; >> + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); >> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) >> + { >> + if (!gi_ptr->merge[t_ix]) >> + continue; >> + >> + /* Find the appropriate index into the gcov_ctr_info array >> + for the counter we are currently working on based on the >> + existence of the merge function pointer for this object. */ >> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) >> + { >> + if (gi_ptr->merge[i]) >> + ctr_info_ix++; >> + } >> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >> + { >> + gfi_ptr = gi_ptr->functions[f_ix]; >> + >> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >> + continue; >> + >> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >> + /* Sanity check that there are enough entries in value_arry >> + for this function's counters. Gracefully handle the case when >> + there are not, in case something in the profile info is >> + corrupted. */ >> + c_num = ci_ptr->num; >> + if (index + c_num > cs_ptr->num) >> + c_num = cs_ptr->num - index; >> + /* Copy over this function's counter values. */ >> + memcpy (&value_array[index], ci_ptr->values, >> + sizeof (gcov_type) * c_num); >> + index += c_num; >> + } >> + } >> + >> + /* Sort all the counter values by descending value and finally >> + accumulate the values from hottest on down until reaching >> + the cutoff value computed earlier. */ >> + qsort (value_array, cs_ptr->num, sizeof (gcov_type), >> + sort_by_reverse_gcov_value); >> + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) >> + cum += value_array[c_num]; >> + /* Record the number of counters required to reach the cutoff value, >> + as well as the counter value that caused cum to reach the cutoff. */ >> + cs_ptr->num_hot_counters = c_num; >> + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; >> + free (value_array); >> +} >> + >> /* Dump the coverage counts. We merge with existing counts when >> possible, to avoid growing the .da files ad infinitum. We use this >> program's checksum to make sure we only accumulate whole program >> @@ -347,6 +461,7 @@ gcov_exit (void) >> } >> } >> } >> + gcov_compute_cutoff_values (&this_prg); >> >> { >> /* Check if the level of dirs to strip off specified. */ >> @@ -598,7 +713,12 @@ gcov_exit (void) >> if (gi_ptr->merge[t_ix]) >> { >> if (!cs_prg->runs++) >> - cs_prg->num = cs_tprg->num; >> + cs_prg->num = cs_tprg->num; >> + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) >> + { >> + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; >> + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; >> + } >> cs_prg->sum_all += cs_tprg->sum_all; >> if (cs_prg->run_max < cs_tprg->run_max) >> cs_prg->run_max = cs_tprg->run_max; >> Index: gcc/doc/invoke.texi >> =================================================================== >> --- gcc/doc/invoke.texi (revision 189893) >> +++ gcc/doc/invoke.texi (working copy) >> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}. >> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol >> -fprefetch-loop-arrays @gol >> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >> -fprofile-generate=@var{path} @gol >> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}. >> -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol >> -ftree-switch-conversion -ftree-tail-merge @gol >> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >> -fwhole-program -fwpa -fuse-linker-plugin @gol >> @@ -8527,6 +8527,14 @@ the loop is entered. This usually makes programs >> @option{-funroll-all-loops} implies the same options as >> @option{-funroll-loops}. >> >> +@item -funroll-codesize-limit >> +@opindex funroll-codesize-limit >> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fpeel-loops >> @opindex fpeel-loops >> Peels loops for which there is enough information that they do not >> @@ -8535,6 +8543,14 @@ roll much (from profile feedback). It also turns >> >> Enabled with @option{-fprofile-use}. >> >> +@item -fpeel-codesize-limit >> +@opindex fpeel-codesize-limit >> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fmove-loop-invariants >> @opindex fmove-loop-invariants >> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s >> @item max-completely-peel-loop-nest-depth >> The maximum depth of a loop nest suitable for complete peeling. >> >> +@item unrollpeel-codesize-threshold >> +Maximum profile-based code size footprint estimate for loop unrolling and >> +peeling. >> + >> +@item unrollpeel-hotness-threshold >> +Maximum ratio of total execution count to loop entry block count under which >> +most profile-based code size estimates will be ignored for loop unrolling and >> +peeling. >> + >> @item max-unswitch-insns >> The maximum number of insns of an unswitched loop. >> >> Index: gcc/gcov-io.c >> =================================================================== >> --- gcc/gcov-io.c (revision 189893) >> +++ gcc/gcov-io.c (working copy) >> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> gcov_write_unsigned (csum->num); >> + gcov_write_unsigned (csum->num_hot_counters); >> + gcov_write_counter (csum->hot_cutoff_value); >> gcov_write_unsigned (csum->runs); >> gcov_write_counter (csum->sum_all); >> gcov_write_counter (csum->run_max); >> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> csum->num = gcov_read_unsigned (); >> + csum->num_hot_counters = gcov_read_unsigned (); >> + csum->hot_cutoff_value = gcov_read_counter (); >> csum->runs = gcov_read_unsigned (); >> csum->sum_all = gcov_read_counter (); >> csum->run_max = gcov_read_counter (); >> Index: gcc/gcov-io.h >> =================================================================== >> --- gcc/gcov-io.h (revision 189893) >> +++ gcc/gcov-io.h (working copy) >> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; >> #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ >> #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) >> #define GCOV_TAG_SUMMARY_LENGTH \ >> - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) >> + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) >> >> /* Counters that are collected. */ >> #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ >> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; >> struct gcov_ctr_summary >> { >> gcov_unsigned_t num; /* number of counters. */ >> + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given >> + percent of sum_all. */ >> + gcov_type hot_cutoff_value; /* value of smallest counter included in >> + num_hot_counters. */ >> gcov_unsigned_t runs; /* number of program runs */ >> gcov_type sum_all; /* sum of all counters accumulated. */ >> gcov_type run_max; /* maximum value on a single run. */ >> Index: gcc/loop-unroll.c >> =================================================================== >> --- gcc/loop-unroll.c (revision 189893) >> +++ gcc/loop-unroll.c (working copy) >> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >> #include "hashtab.h" >> #include "recog.h" >> #include "target.h" >> +#include "gcov-io.h" >> #include "dumpfile.h" >> >> /* This pass performs loop unrolling and peeling. We only perform these >> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc >> basic_block); >> static rtx get_expansion (struct var_to_expand *); >> >> +/* Determine whether and how much LOOP unrolling/peeling should be constrained >> + based on code footprint estimates. Returns the codesize-based factor to be >> + divided into the max instructions in an unrolled or peeled loop: >> + 1) For size <= threshold, do not limit (by returning 1). >> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >> + unrolled instructions according to loop hotness. >> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >> + INT_MAX). */ >> + >> +static int >> +code_size_limit_factor(struct loop *loop) >> +{ >> + unsigned size_threshold; >> + struct niter_desc *desc = get_simple_loop_desc (loop); >> + gcov_type sum_to_header_ratio; >> + int hotness_ratio_threshold; >> + int limit_factor; >> + >> + /* First check if the application has a large codesize footprint. >> + This is estimated from FDO profile summary information for the >> + program, where the num_hot_counters indicates the number of hottest >> + counters (blocks) that compose most of the execution time of >> + the program. A large value would indicate a large flat execution >> + profile where icache misses may be a concern. */ >> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >> + if (!profile_info >> + || profile_info->num_hot_counters <= size_threshold >> + || !profile_info->sum_all) >> + return 1; >> + >> + /* Next, exclude some loops where unrolling/peeling may be more >> + important to overall performance. */ >> + >> + /* Ignore FP loops, which are more likely to benefit heavily from >> + unrolling. */ >> + if (desc->has_fp) >> + return 1; >> + >> + /* Next, set the value of the codesize-based unroll factor divisor which in >> + most loops will need to be set to a value that will reduce or eliminate >> + unrolling/peeling. */ >> + if (profile_info->num_hot_counters < size_threshold * 2) >> + { >> + /* For applications that are less than twice the codesize limit, allow >> + limited unrolling for very hot loops. */ >> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >> + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >> + /* When the profile count sum to loop entry header ratio is smaller than >> + the threshold (i.e. the loop entry is hot enough, the divisor is set >> + to 1 so the unroll/peel factor is not reduced. When it is bigger >> + than the ratio, increase the divisor by the amount this ratio >> + is over the threshold, which will quickly reduce the unroll/peel >> + factor to zero as the loop's hotness reduces. */ >> + if (sum_to_header_ratio > hotness_ratio_threshold) >> + { >> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >> + gcc_assert (limit_factor >= 1); >> + } >> + else >> + limit_factor = 1; >> + } >> + else >> + /* For appliations that are at least twice the codesize limit, set >> + the divisor to a large value that will force the unroll factor to 0. */ >> + limit_factor = INT_MAX; >> + >> + return limit_factor; >> +} >> + >> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >> void >> unroll_and_peel_loops (int flags) >> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL)) >> { >> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo >> "\n;; Considering unrolling loop with runtime " >> "computable number of iterations\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags) >> { >> unsigned npeel; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_PEEL)) >> { >> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags) >> if (dump_file) >> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >> >> + if (flag_peel_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from peeling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max peeled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* npeel = number of iterations to peel. */ >> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >> + / loop->ninsns; >> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >> >> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags) >> >> /* Do not simply peel loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not peeling, contains branches\n"); >> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_UNROLL_ALL)) >> { >> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags >> if (dump_file) >> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av >> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> >> /* Do not unroll loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> Index: gcc/coverage.c >> =================================================================== >> --- gcc/coverage.c (revision 189893) >> +++ gcc/coverage.c (working copy) >> @@ -247,6 +247,14 @@ read_counts_file (void) >> gcov_read_summary (&sum); >> for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) >> { >> + if (summary.ctrs[ix].num_hot_counters >> + < sum.ctrs[ix].num_hot_counters) >> + { >> + summary.ctrs[ix].num_hot_counters >> + = sum.ctrs[ix].num_hot_counters; >> + summary.ctrs[ix].hot_cutoff_value >> + = sum.ctrs[ix].hot_cutoff_value; >> + } >> summary.ctrs[ix].runs += sum.ctrs[ix].runs; >> summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; >> if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) >> Index: gcc/loop-iv.c >> =================================================================== >> --- gcc/loop-iv.c (revision 189893) >> +++ gcc/loop-iv.c (working copy) >> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop) >> /* At least desc->infinite is not always initialized by >> find_simple_loop_exit. */ >> desc = XCNEW (struct niter_desc); >> - iv_analysis_loop_init (loop); >> - find_simple_exit (loop, desc); >> + if (loop->latch != EXIT_BLOCK_PTR) >> + { >> + iv_analysis_loop_init (loop); >> + find_simple_exit (loop, desc); >> + } >> + analyze_loop_insns (loop, desc); >> loop->aux = desc; >> >> if (desc->simple_p && (desc->assumptions || desc->infinite)) >> Index: gcc/common.opt >> =================================================================== >> --- gcc/common.opt (revision 189893) >> +++ gcc/common.opt (working copy) >> @@ -1551,6 +1551,10 @@ fpcc-struct-return >> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >> Return small aggregates in memory, not registers >> >> +fpeel-codesize-limit >> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop peeling under profile estimates of large code footprint >> + >> fpeel-loops >> Common Report Var(flag_peel_loops) Optimization >> Perform loop peeling >> @@ -2112,6 +2116,10 @@ funroll-all-loops >> Common Report Var(flag_unroll_all_loops) Optimization >> Perform loop unrolling for all loops >> >> +funroll-codesize-limit >> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint >> + >> ; Nonzero means that loop optimizer may assume that the induction variables >> ; that control loops do not overflow and that the loops with nontrivial >> ; exit condition are not infinite >> Index: gcc/cfgloop.c >> =================================================================== >> --- gcc/cfgloop.c (revision 189893) >> +++ gcc/cfgloop.c (working copy) >> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop) >> return edges; >> } >> >> -/* Counts the number of conditional branches inside LOOP. */ >> +/* Determine if INSN is a floating point set. */ >> >> -unsigned >> -num_loop_branches (const struct loop *loop) >> +static bool >> +insn_has_fp_set(rtx insn) >> { >> - unsigned i, n; >> - basic_block * body; >> + int i; >> + rtx pat = PATTERN(insn); >> + if (GET_CODE (pat) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >> + else if (GET_CODE (pat) == PARALLEL) >> + { >> + for (i = 0; i < XVECLEN (pat, 0); i++) >> + { >> + rtx sub = XVECEXP (pat, 0, i); >> + if (GET_CODE (sub) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >> + } >> + } >> + return false; >> +} >> >> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts >> + the number of conditional branch instructions, calls and fp instructions, >> + as well as the average number of branches executed per iteration. */ >> >> +void >> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >> +{ >> + unsigned i, nbranch; >> + gcov_type weighted_nbranch; >> + bool has_call, has_fp; >> + basic_block * body, bb; >> + rtx insn; >> + gcov_type header_count = loop->header->count; >> + >> + nbranch = weighted_nbranch = 0; >> + has_call = has_fp = false; >> + >> body = get_loop_body (loop); >> - n = 0; >> for (i = 0; i < loop->num_nodes; i++) >> - if (EDGE_COUNT (body[i]->succs) >= 2) >> - n++; >> + { >> + bb = body[i]; >> + >> + if (EDGE_COUNT (bb->succs) >= 2) >> + { >> + nbranch++; >> + >> + /* If this block is executed less frequently than the header (loop >> + entry), then it is weighted based on its execution count, which >> + will be turned into a ratio compared to the loop header below. */ >> + if (bb->count < header_count) >> + weighted_nbranch += bb->count; >> + >> + /* When it is executed more frequently than the header (i.e. it is >> + in a nested inner loop), simply weight the branch the same as the >> + header execution count, so that it will contribute 1 branch to >> + the ratio computed below. */ >> + else >> + weighted_nbranch += header_count; >> + } >> + >> + /* No need to iterate through the instructions below if >> + both flags have already been set. */ >> + if (has_call && has_fp) >> + continue; >> + >> + FOR_BB_INSNS (bb, insn) >> + { >> + if (!INSN_P (insn)) >> + continue; >> + >> + if (!has_call) >> + has_call = CALL_P (insn); >> + >> + if (!has_fp) >> + has_fp = insn_has_fp_set (insn); >> + } >> + } >> free (body); >> >> - return n; >> + desc->num_branches = nbranch; >> + /* Now divide the weights computed above by the loop header execution count, >> + to compute the average number of branches through the loop. By adding >> + header_count/2 to the numerator we round to nearest with integer >> + division. */ >> + if (header_count != 0) >> + desc->av_num_branches >> + = (weighted_nbranch + header_count/2) / header_count; >> + else >> + desc->av_num_branches = 0; >> + desc->has_call = has_call; >> + desc->has_fp = has_fp; >> } >> >> /* Adds basic block BB to LOOP. */ >> Index: gcc/cfgloop.h >> =================================================================== >> --- gcc/cfgloop.h (revision 189893) >> +++ gcc/cfgloop.h (working copy) >> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order >> >> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >> edge single_exit (const struct loop *); >> -extern unsigned num_loop_branches (const struct loop *); >> >> extern edge loop_preheader_edge (const struct loop *); >> extern edge loop_latch_edge (const struct loop *); >> @@ -366,7 +365,8 @@ struct rtx_iv >> }; >> >> /* The description of an exit from the loop and of the number of iterations >> - till we take the exit. */ >> + till we take the exit. Also includes other information used primarily >> + by the loop unroller. */ >> >> struct niter_desc >> { >> @@ -407,6 +407,18 @@ struct niter_desc >> >> /* The number of iterations of the loop. */ >> rtx niter_expr; >> + >> + /* The number of branches in the loop. */ >> + unsigned num_branches; >> + >> + /* The number of executed branches per iteration. */ >> + unsigned av_num_branches; >> + >> + /* Whether the loop contains a call instruction. */ >> + bool has_call; >> + >> + /* Whether the loop contains fp instructions. */ >> + bool has_fp; >> }; >> >> extern void iv_analysis_loop_init (struct loop *); >> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); >> >> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >> extern void free_simple_loop_desc (struct loop *loop); >> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >> >> static inline struct niter_desc * >> simple_loop_desc (struct loop *loop) >> Index: gcc/params.def >> =================================================================== >> --- gcc/params.def (revision 189893) >> +++ gcc/params.def (working copy) >> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >> "max-completely-peel-loop-nest-depth", >> "The maximum depth of a loop nest we completely peel", >> 8, 0, 0) >> +/* The maximum code size estimate under which loop unrolling and peeling >> + * is allowed in a profile feedback compile. This currently applies to loops >> + * with non-constant iteration counts and no floating point computations. */ >> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >> + "unrollpeel-codesize-threshold", >> + "Maximum profile-based code size footprint estimate for loop unrolling " >> + "and peeling", >> + 15000, 0, 0) >> +/* The maximum ratio of total profiled execution counts to loop entry block >> + count that must be exceeded to ignore most code size limits when unrolling >> + and peeling. */ >> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >> + "unrollpeel-hotness-threshold", >> + "Maximum ratio of total profiled execution count to loop entry " >> + "block count under which most codesize limits for unrolling and " >> + "peeling will be ignored", >> + 100, 1, 0) >> >> /* The maximum number of insns of an unswitched loop. */ >> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >> Index: gcc/gcov-dump.c >> =================================================================== >> --- gcc/gcov-dump.c (revision 189893) >> +++ gcc/gcov-dump.c (working copy) >> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED >> { >> printf ("\n"); >> print_prefix (filename, 0, 0); >> - printf ("\t\tcounts=%u, runs=%u", >> - summary.ctrs[ix].num, summary.ctrs[ix].runs); >> + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" >> + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", >> + summary.ctrs[ix].num, >> + summary.ctrs[ix].num_hot_counters, >> + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, >> + summary.ctrs[ix].runs); >> >> printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, >> (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); >> >> -- >> This patch is available for review at http://codereview.appspot.com/6351086 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: > Ping. > Teresa > > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: > > This patch was proposed awhile back, and the new working set program > > summary was split out and committed separately. I would like to > > see if the loop unroller part of the patch can now go in. For the > > previous versions of the patch and related discussion, see: > > > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > > and > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html > > > > Use the new working set information in the FDO program summary for > > codesize based unroll and peel decisions, to prevent those > > optimizations from increasing code size too much when the > > program may be sensitive to icache effects. > > > > As part of this, cache additional loop analysis results in the niter_desc > > auxiliary information hanging off the loop structure to reduce > > redundant analyses during unrolling. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? I guess this is the missing piece that would fix the Firefox lto/pgo code-size issue: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144 -- Markus
Sign in to reply to this message.
> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: > > Ping. > > Teresa > > > > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: > > > This patch was proposed awhile back, and the new working set program > > > summary was split out and committed separately. I would like to > > > see if the loop unroller part of the patch can now go in. For the > > > previous versions of the patch and related discussion, see: > > > > > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html > > > and > > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html > > > > > > Use the new working set information in the FDO program summary for > > > codesize based unroll and peel decisions, to prevent those > > > optimizations from increasing code size too much when the > > > program may be sensitive to icache effects. > > > > > > As part of this, cache additional loop analysis results in the niter_desc > > > auxiliary information hanging off the loop structure to reduce > > > redundant analyses during unrolling. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > I guess this is the missing piece that would fix the Firefox lto/pgo > code-size issue: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144 It won't help in this case - the problem is that too much of code is identified as hot, similar as for spec2k GCC benchmark. Teresa - do you have any ideas what to do here short of re-adding the BB ratio parameter and combining both heuristics for 4.8? Honza > > -- > Markus
Sign in to reply to this message.
On Tue, Dec 11, 2012 at 10:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: >> > Ping. >> > Teresa >> > >> > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: >> > > This patch was proposed awhile back, and the new working set program >> > > summary was split out and committed separately. I would like to >> > > see if the loop unroller part of the patch can now go in. For the >> > > previous versions of the patch and related discussion, see: >> > > >> > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> > > and >> > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html >> > > >> > > Use the new working set information in the FDO program summary for >> > > codesize based unroll and peel decisions, to prevent those >> > > optimizations from increasing code size too much when the >> > > program may be sensitive to icache effects. >> > > >> > > As part of this, cache additional loop analysis results in the niter_desc >> > > auxiliary information hanging off the loop structure to reduce >> > > redundant analyses during unrolling. >> > > >> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> I guess this is the missing piece that would fix the Firefox lto/pgo >> code-size issue: >> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144 > > It won't help in this case - the problem is that too much of code is identified as hot, > similar as for spec2k GCC benchmark. Teresa - do you have any ideas what to do here > short of re-adding the BB ratio parameter and combining both heuristics for 4.8? Right, I doubt this patch would help. I looked through the bugzilla log and it is most likely inlining increases. What was the size of the gcc lto/pgo binary before the change to use the histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a very large increase, ~25%. Markus, could you attach to the bug one of the gcda files so that I can see the program summary and figure out how far off the old hot bb threshold is from the new histogram-based one? Also, it would be good to see the -fdump-ipa-inline dumps before and after the regression (if necessary, the before one could be from 4_7). I'll update the bug with the same, so that we can track the discussion there. Thanks, Teresa > > Honza >> >> -- >> Markus -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
Ping. Teresa On Tue, Dec 11, 2012 at 9:13 AM, Teresa Johnson <tejohnson@google.com> wrote: > Ping. > Teresa > > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: >> This patch was proposed awhile back, and the new working set program >> summary was split out and committed separately. I would like to >> see if the loop unroller part of the patch can now go in. For the >> previous versions of the patch and related discussion, see: >> >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >> and >> http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html >> >> Use the new working set information in the FDO program summary for >> codesize based unroll and peel decisions, to prevent those >> optimizations from increasing code size too much when the >> program may be sensitive to icache effects. >> >> As part of this, cache additional loop analysis results in the niter_desc >> auxiliary information hanging off the loop structure to reduce >> redundant analyses during unrolling. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> 2012-11-19 Teresa Johnson <tejohnson@google.com> >> >> * gcc/doc/invoke.texi: Document new options >> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >> codesize-hotness-threshold and unrollpeel-hotness-threshold. >> * gcc/loop-unroll.c (code_size_limit_factor): New function. >> (decide_unroll_runtime_iterations): Call code_size_limit_factor >> to control the unroll factor, and retrieve number of branches from >> niter_desc instead of via function that walks loop. >> (decide_peel_simple, decide_unroll_stupid): Ditto. >> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >> function, and add guards to enable this function to work for the >> outermost loop. >> * gcc/common.opt: Add -fpeel-codesize-limit and >> -funroll-codesize-limit. >> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >> (num_loop_branches): Remove. >> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >> additional loop analysis information. >> (num_loop_branches): Remove. >> (analyze_loop_insns): Declare. >> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >> >> Index: doc/invoke.texi >> =================================================================== >> --- doc/invoke.texi (revision 193626) >> +++ doc/invoke.texi (working copy) >> @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}. >> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops >> -fpredictive-commoning @gol >> -fprefetch-loop-arrays -fprofile-report @gol >> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >> -fprofile-generate=@var{path} @gol >> @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}. >> -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol >> -ftree-switch-conversion -ftree-tail-merge @gol >> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >> -fwhole-program -fwpa -fuse-linker-plugin @gol >> @@ -8771,6 +8771,14 @@ the loop is entered. This usually makes programs >> @option{-funroll-all-loops} implies the same options as >> @option{-funroll-loops}. >> >> +@item -funroll-codesize-limit >> +@opindex funroll-codesize-limit >> +Limit loop unrolling of non-const non-FP loops in a profile feedback >> compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are >> controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fpeel-loops >> @opindex fpeel-loops >> Peels loops for which there is enough information that they do not >> @@ -8779,6 +8787,14 @@ roll much (from profile feedback). It also turns >> >> Enabled with @option{-fprofile-use}. >> >> +@item -fpeel-codesize-limit >> >> +@item -fpeel-codesize-limit >> +@opindex fpeel-codesize-limit >> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation >> +under estimates of a large code footprint. Enabled by default with >> +@option{-fprofile-use}. Code size and execution weight thresholds are >> controlled >> +by the @option{unrollpeel-codesize-threshold} and >> +@option{unrollpeel-hotness-threshold} parameters. >> + >> @item -fmove-loop-invariants >> @opindex fmove-loop-invariants >> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >> @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s >> @item max-completely-peel-loop-nest-depth >> The maximum depth of a loop nest suitable for complete peeling. >> >> +@item unrollpeel-codesize-threshold >> +Maximum profile-based code size footprint estimate for loop unrolling and >> +peeling. >> + >> +@item unrollpeel-hotness-threshold >> +Maximum ratio of total execution count to loop entry block count under which >> +most profile-based code size estimates will be ignored for loop unrolling and >> +peeling. >> + >> @item max-unswitch-insns >> The maximum number of insns of an unswitched loop. >> >> Index: loop-unroll.c >> =================================================================== >> --- loop-unroll.c (revision 193626) >> +++ loop-unroll.c (working copy) >> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >> #include "hashtab.h" >> #include "recog.h" >> #include "target.h" >> +#include "gcov-io.h" >> #include "dumpfile.h" >> >> /* This pass performs loop unrolling and peeling. We only perform these >> @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc >> basic_block); >> static rtx get_expansion (struct var_to_expand *); >> >> +/* Determine whether and how much LOOP unrolling/peeling should be constrained >> + based on code footprint estimates. Returns the codesize-based factor to be >> + divided into the max instructions in an unrolled or peeled loop: >> + 1) For size <= threshold, do not limit (by returning 1). >> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >> + unrolled instructions according to loop hotness. >> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >> + INT_MAX). */ >> + >> +static int >> +code_size_limit_factor(struct loop *loop) >> +{ >> + unsigned size_threshold, num_hot_counters; >> + struct niter_desc *desc = get_simple_loop_desc (loop); >> + gcov_type sum_to_header_ratio; >> + int hotness_ratio_threshold; >> + int limit_factor; >> + gcov_working_set_t *ws; >> + >> + ws = find_working_set(999); >> + if (! ws) >> + return 1; >> + num_hot_counters = ws->num_counters; >> + >> + /* First check if the application has a large codesize footprint. >> + This is estimated from FDO profile summary information for the >> + program, where the num_hot_counters indicates the number of hottest >> + counters (blocks) that compose most of the execution time of >> + the program. A large value would indicate a large flat execution >> + profile where icache misses may be a concern. */ >> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >> + if (!profile_info >> + || num_hot_counters <= size_threshold >> + || !profile_info->sum_all) >> + return 1; >> + >> + /* Next, exclude some loops where unrolling/peeling may be more >> + important to overall performance. */ >> + >> + /* Ignore FP loops, which are more likely to benefit heavily from >> + unrolling. */ >> + if (desc->has_fp) >> + return 1; >> + >> + /* Next, set the value of the codesize-based unroll factor divisor which in >> + most loops will need to be set to a value that will reduce or eliminate >> + unrolling/peeling. */ >> + if (num_hot_counters < size_threshold * 2 >> + && loop->header->count > 0) >> + { >> + /* For applications that are less than twice the codesize limit, allow >> + limited unrolling for very hot loops. */ >> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >> + hotness_ratio_threshold = PARAM_VALUE >> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >> + /* When the profile count sum to loop entry header ratio is smaller than >> + the threshold (i.e. the loop entry is hot enough), the divisor is set >> + to 1 so the unroll/peel factor is not reduced. When it is bigger >> + than the ratio, increase the divisor by the amount this ratio >> + is over the threshold, which will quickly reduce the unroll/peel >> + factor to zero as the loop's hotness reduces. */ >> + if (sum_to_header_ratio > hotness_ratio_threshold) >> + { >> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >> + gcc_assert (limit_factor >= 1); >> + } >> + else >> + limit_factor = 1; >> + } >> + else >> + /* For appliations that are at least twice the codesize limit, set >> + the divisor to a large value that will force the unroll factor to 0. */ >> + limit_factor = INT_MAX; >> + >> + return limit_factor; >> +} >> + >> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >> void >> unroll_and_peel_loops (int flags) >> @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> double_int iterations; >> >> if (!(flags & UAP_UNROLL)) >> @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo >> "\n;; Considering unrolling loop with runtime " >> "computable number of iterations\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / >> loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags) >> { >> unsigned npeel; >> double_int iterations; >> + struct niter_desc *desc; >> + int limit_factor = 1; >> >> if (!(flags & UAP_PEEL)) >> { >> @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags) >> if (dump_file) >> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >> >> + if (flag_peel_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from peeling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max peeled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* npeel = number of iterations to peel. */ >> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >> + / loop->ninsns; >> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >> >> @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags) >> } >> >> /* Do not simply peel loops with branches inside -- it increases number >> - of mispredicts. >> + of mispredicts. >> Exception is when we do have profile and we however have good chance >> to peel proper number of iterations loop will iterate in practice. >> TODO: this heuristic needs tunning; while for complette unrolling >> @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags) >> peeling it is not the case. Also a function call inside loop is >> also branch from branch prediction POV (and probably better reason >> to not unroll/peel). */ >> - if (num_loop_branches (loop) > 1 >> + desc = get_simple_loop_desc (loop); >> + if (desc->num_branches > 1 >> && profile_status != PROFILE_READ) >> { >> if (dump_file) >> @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> { >> unsigned nunroll, nunroll_by_av, i; >> struct niter_desc *desc; >> + int limit_factor = 1; >> double_int iterations; >> >> if (!(flags & UAP_UNROLL_ALL)) >> @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags >> if (dump_file) >> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >> >> + if (flag_unroll_codesize_limit) >> + { >> + /* Determine whether to limit code size growth from unrolling, >> + using FDO profile summary information that gives an >> + estimated number of executed blocks. */ >> + limit_factor = code_size_limit_factor (loop); >> + if (dump_file && limit_factor > 1) >> + { >> + fprintf (dump_file, >> + ";; Due to large code size footprint estimate, limit " >> + "max unrolled insns by divisor %d\n", limit_factor); >> + } >> + } >> + >> /* nunroll = total number of copies of the original loop body in >> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >> - nunroll_by_av >> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >> + / loop->ninsns; >> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >> + / limit_factor / loop->av_ninsns; >> if (nunroll > nunroll_by_av) >> nunroll = nunroll_by_av; >> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >> @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> of mispredicts. >> TODO: this heuristic needs tunning; call inside the loop body >> is also relatively good reason to not unroll. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> Index: loop-iv.c >> =================================================================== >> --- loop-iv.c (revision 193626) >> +++ loop-iv.c (working copy) >> @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop) >> /* At least desc->infinite is not always initialized by >> find_simple_loop_exit. */ >> desc = XCNEW (struct niter_desc); >> - iv_analysis_loop_init (loop); >> - find_simple_exit (loop, desc); >> + if (loop->latch != EXIT_BLOCK_PTR) >> + { >> + iv_analysis_loop_init (loop); >> + find_simple_exit (loop, desc); >> + } >> + analyze_loop_insns (loop, desc); >> loop->aux = desc; >> >> if (desc->simple_p && (desc->assumptions || desc->infinite)) >> Index: common.opt >> =================================================================== >> --- common.opt (revision 193626) >> +++ common.opt (working copy) >> @@ -1573,6 +1573,10 @@ fpcc-struct-return >> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >> Return small aggregates in memory, not registers >> >> +fpeel-codesize-limit >> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop peeling under profile estimates of large >> code footprint >> + >> fpeel-loops >> Common Report Var(flag_peel_loops) Optimization >> Perform loop peeling >> @@ -2138,6 +2142,10 @@ funroll-all-loops >> Common Report Var(flag_unroll_all_loops) Optimization >> Perform loop unrolling for all loops >> >> +funroll-codesize-limit >> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >> +Limit non-const non-FP loop unrolling under profile estimates of >> large code footprint >> + >> ; Nonzero means that loop optimizer may assume that the induction variables >> ; that control loops do not overflow and that the loops with nontrivial >> ; exit condition are not infinite >> Index: cfgloop.c >> =================================================================== >> --- cfgloop.c (revision 193626) >> +++ cfgloop.c (working copy) >> @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop) >> return edges; >> } >> >> -/* Counts the number of conditional branches inside LOOP. */ >> +/* Determine if INSN is a floating point set. */ >> >> -unsigned >> -num_loop_branches (const struct loop *loop) >> +static bool >> +insn_has_fp_set(rtx insn) >> { >> - unsigned i, n; >> - basic_block * body; >> + int i; >> + rtx pat = PATTERN(insn); >> + if (GET_CODE (pat) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >> + else if (GET_CODE (pat) == PARALLEL) >> + { >> + for (i = 0; i < XVECLEN (pat, 0); i++) >> + { >> + rtx sub = XVECEXP (pat, 0, i); >> + if (GET_CODE (sub) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >> + } >> + } >> + return false; >> +} >> >> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts >> + the number of conditional branch instructions, calls and fp instructions, >> + as well as the average number of branches executed per iteration. */ >> >> +void >> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >> +{ >> + unsigned i, nbranch; >> + gcov_type weighted_nbranch; >> + bool has_call, has_fp; >> + basic_block * body, bb; >> + rtx insn; >> + gcov_type header_count = loop->header->count; >> + >> + nbranch = weighted_nbranch = 0; >> + has_call = has_fp = false; >> + >> body = get_loop_body (loop); >> - n = 0; >> for (i = 0; i < loop->num_nodes; i++) >> - if (EDGE_COUNT (body[i]->succs) >= 2) >> - n++; >> + { >> + bb = body[i]; >> + >> + if (EDGE_COUNT (bb->succs) >= 2) >> + { >> + nbranch++; >> + >> + /* If this block is executed less frequently than the header (loop >> + entry), then it is weighted based on its execution count, which >> + will be turned into a ratio compared to the loop header below. */ >> + if (bb->count < header_count) >> + weighted_nbranch += bb->count; >> + >> + /* When it is executed more frequently than the header (i.e. it is >> + in a nested inner loop), simply weight the branch the same as the >> + header execution count, so that it will contribute 1 branch to >> + the ratio computed below. */ >> + else >> + weighted_nbranch += header_count; >> + } >> + >> + /* No need to iterate through the instructions below if >> + both flags have already been set. */ >> + if (has_call && has_fp) >> + continue; >> + >> + FOR_BB_INSNS (bb, insn) >> + { >> + if (!INSN_P (insn)) >> + continue; >> + >> + if (!has_call) >> + has_call = CALL_P (insn); >> + >> + if (!has_fp) >> + has_fp = insn_has_fp_set (insn); >> + } >> + } >> free (body); >> >> - return n; >> + desc->num_branches = nbranch; >> + /* Now divide the weights computed above by the loop header execution count, >> + to compute the average number of branches through the loop. By adding >> + header_count/2 to the numerator we round to nearest with integer >> + division. */ >> + if (header_count != 0) >> + desc->av_num_branches >> + = (weighted_nbranch + header_count/2) / header_count; >> + else >> + desc->av_num_branches = 0; >> + desc->has_call = has_call; >> + desc->has_fp = has_fp; >> } >> >> /* Adds basic block BB to LOOP. */ >> Index: cfgloop.h >> =================================================================== >> --- cfgloop.h (revision 193626) >> +++ cfgloop.h (working copy) >> @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order >> extern vec<edge> get_loop_exit_edges (const struct loop *); >> extern edge single_exit (const struct loop *); >> extern edge single_likely_exit (struct loop *loop); >> -extern unsigned num_loop_branches (const struct loop *); >> >> extern edge loop_preheader_edge (const struct loop *); >> extern edge loop_latch_edge (const struct loop *); >> @@ -365,7 +364,8 @@ struct rtx_iv >> }; >> >> /* The description of an exit from the loop and of the number of iterations >> - till we take the exit. */ >> + till we take the exit. Also includes other information used primarily >> + by the loop unroller. */ >> >> struct niter_desc >> { >> @@ -403,6 +403,18 @@ struct niter_desc >> >> /* The number of iterations of the loop. */ >> rtx niter_expr; >> + >> + /* The number of branches in the loop. */ >> + unsigned num_branches; >> + >> + /* The number of executed branches per iteration. */ >> + unsigned av_num_branches; >> + >> + /* Whether the loop contains a call instruction. */ >> + bool has_call; >> + >> + /* Whether the loop contains fp instructions. */ >> + bool has_fp; >> }; >> >> extern void iv_analysis_loop_init (struct loop *); >> @@ -416,6 +428,7 @@ extern void iv_analysis_done (void); >> >> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >> extern void free_simple_loop_desc (struct loop *loop); >> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >> >> static inline struct niter_desc * >> simple_loop_desc (struct loop *loop) >> Index: params.def >> =================================================================== >> --- params.def (revision 193626) >> +++ params.def (working copy) >> @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >> "max-completely-peel-loop-nest-depth", >> "The maximum depth of a loop nest we completely peel", >> 8, 0, 0) >> +/* The maximum code size estimate under which loop unrolling and peeling >> + * is allowed in a profile feedback compile. This currently applies to loops >> + * with non-constant iteration counts and no floating point computations. */ >> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >> + "unrollpeel-codesize-threshold", >> + "Maximum profile-based code size footprint estimate for loop >> unrolling " >> + "and peeling", >> + 15000, 0, 0) >> +/* The maximum ratio of total profiled execution counts to loop entry block >> + count that must be exceeded to ignore most code size limits when unrolling >> + and peeling. */ >> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >> + "unrollpeel-hotness-threshold", >> + "Maximum ratio of total profiled execution count to loop entry " >> + "block count under which most codesize limits for unrolling and " >> + "peeling will be ignored", >> + 100, 1, 0) >> >> /* The maximum number of insns of an unswitched loop. */ >> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >> >> On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote: >>> Updated patch based on review feedback. >>> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >>> >>> Teresa >>> >>> >>> Original patch description: >>> >>> Ports some patches related to improving FDO program summary information >>> and using it to guide loop unrolling from google branches to mainline. >>> The patch is enhanced to add additional summary information to aid >>> in determining hot/cold decisions. >>> >>> The original patch description is at: >>> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html >>> and further discussion about incorporating onto mainline is at: >>> http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414 >>> >>> Honza, can you take a look to see if this patch meets your needs? >>> >>> Full description: >>> >>> This patch adds new program summary information to the gcov >>> profile files that indicate how many profiled counts compose >>> the majority of the program's execution time. This is used to >>> provide an indication of the overall code size of the program. >>> >>> The new profile summary information is then used to guide >>> codesize based unroll and peel decisions, to prevent those >>> optimizations from increasing code size too much when the >>> program may be sensitive to icache effects. >>> >>> This patch also pulls in dependent portions of google/main r187660 that cache >>> additional loop analysis results in the niter_desc auxiliary information >>> hanging off the loop structure (the optimization portions of that >>> change are not included here, and have an outstanding review request >>> for mainline). >>> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >>> >>> Thanks, >>> Teresa >>> >>> 2012-07-26 Teresa Johnson <tejohnson@google.com> >>> >>> * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function. >>> (gcov_compute_cutoff_values): Ditto. >>> (gcov_exit): Call gcov_compute_cutoff_values and merge new summary >>> information. >>> * gcc/doc/invoke.texi (roll much): Document new options >>> -fpeel-codesize-limit and -funroll-codesize-limit, and new params >>> codesize-hotness-threshold and unrollpeel-hotness-threshold. >>> * gcc/gcov-io.c (gcov_write_summary): Write new summary info. >>> (gcov_read_summary): Read new summary info. >>> * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info. >>> (struct gcov_ctr_summary): Add new summary info: num_hot_counters and >>> hot_cutoff_value. >>> * gcc/loop-unroll.c (code_size_limit_factor): New function. >>> (decide_unroll_runtime_iterations): Call code_size_limit_factor >>> to control the unroll factor, and retrieve number of branches from >>> niter_desc instead of via function that walks loop. >>> (decide_peel_simple, decide_unroll_stupid): Ditto. >>> * gcc/coverage.c (read_counts_file): Propagate new summary info. >>> * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >>> function, and add guards to enable this function to work for the >>> outermost loop. >>> * gcc/common.opt: Add -fpeel-codesize-limit and >>> -funroll-codesize-limit. >>> * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >>> (num_loop_branches): Remove. >>> * gcc/cfgloop.h (struct niter_desc): Added new fields to cache >>> additional loop analysis information. >>> (num_loop_branches): Remove. >>> (analyze_loop_insns): Declare. >>> * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add. >>> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto. >>> * gcc/gcov-dump.c (tag_summary): Dump new summary info. >>> >>> Index: libgcc/libgcov.c >>> =================================================================== >>> --- libgcc/libgcov.c (revision 189893) >>> +++ libgcc/libgcov.c (working copy) >>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >>> return 1; >>> } >>> >>> +/* Used by qsort to sort gcov values in descending order. */ >>> + >>> +static int >>> +sort_by_reverse_gcov_value (const void *pa, const void *pb) >>> +{ >>> + const gcov_type a = *(gcov_type const *)pa; >>> + const gcov_type b = *(gcov_type const *)pb; >>> + >>> + if (b > a) >>> + return 1; >>> + else if (b == a) >>> + return 0; >>> + else >>> + return -1; >>> +} >>> + >>> +/* Determines the number of counters required to cover a given percentage >>> + of the total sum of execution counts in the summary, which is then also >>> + recorded in SUM. */ >>> + >>> +static void >>> +gcov_compute_cutoff_values (struct gcov_summary *sum) >>> +{ >>> + struct gcov_info *gi_ptr; >>> + const struct gcov_fn_info *gfi_ptr; >>> + const struct gcov_ctr_info *ci_ptr; >>> + struct gcov_ctr_summary *cs_ptr; >>> + unsigned t_ix, f_ix, i, ctr_info_ix, index; >>> + gcov_unsigned_t c_num; >>> + gcov_type *value_array; >>> + gcov_type cum, cum_cutoff; >>> + char *cutoff_str; >>> + unsigned cutoff_perc; >>> + >>> +#define CUM_CUTOFF_PERCENT_TIMES_10 999 >>> + cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10"); >>> + if (cutoff_str && strlen (cutoff_str)) >>> + cutoff_perc = atoi (cutoff_str); >>> + else >>> + cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10; >>> + >>> + /* This currently only applies to arc counters. */ >>> + t_ix = GCOV_COUNTER_ARCS; >>> + >>> + /* First check if there are any counts recorded for this counter. */ >>> + cs_ptr = &(sum->ctrs[t_ix]); >>> + if (!cs_ptr->num) >>> + return; >>> + >>> + /* Determine the cumulative counter value at the specified cutoff >>> + percentage and record the percentage for use by gcov consumers. >>> + Check for overflow when sum_all is multiplied by the cutoff_perc, >>> + and if so, do the divide first. */ >>> + if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all) >>> + /* Overflow, do the divide first. */ >>> + cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc; >>> + else >>> + /* Otherwise multiply first to get the correct value for small >>> + values of sum_all. */ >>> + cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000; >>> + >>> + /* Next, walk through all the per-object structures and save each of >>> + the count values in value_array. */ >>> + index = 0; >>> + value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num); >>> + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) >>> + { >>> + if (!gi_ptr->merge[t_ix]) >>> + continue; >>> + >>> + /* Find the appropriate index into the gcov_ctr_info array >>> + for the counter we are currently working on based on the >>> + existence of the merge function pointer for this object. */ >>> + for (i = 0, ctr_info_ix = 0; i < t_ix; i++) >>> + { >>> + if (gi_ptr->merge[i]) >>> + ctr_info_ix++; >>> + } >>> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >>> + { >>> + gfi_ptr = gi_ptr->functions[f_ix]; >>> + >>> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >>> + continue; >>> + >>> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >>> + /* Sanity check that there are enough entries in value_arry >>> + for this function's counters. Gracefully handle the case when >>> + there are not, in case something in the profile info is >>> + corrupted. */ >>> + c_num = ci_ptr->num; >>> + if (index + c_num > cs_ptr->num) >>> + c_num = cs_ptr->num - index; >>> + /* Copy over this function's counter values. */ >>> + memcpy (&value_array[index], ci_ptr->values, >>> + sizeof (gcov_type) * c_num); >>> + index += c_num; >>> + } >>> + } >>> + >>> + /* Sort all the counter values by descending value and finally >>> + accumulate the values from hottest on down until reaching >>> + the cutoff value computed earlier. */ >>> + qsort (value_array, cs_ptr->num, sizeof (gcov_type), >>> + sort_by_reverse_gcov_value); >>> + for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++) >>> + cum += value_array[c_num]; >>> + /* Record the number of counters required to reach the cutoff value, >>> + as well as the counter value that caused cum to reach the cutoff. */ >>> + cs_ptr->num_hot_counters = c_num; >>> + cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0; >>> + free (value_array); >>> +} >>> + >>> /* Dump the coverage counts. We merge with existing counts when >>> possible, to avoid growing the .da files ad infinitum. We use this >>> program's checksum to make sure we only accumulate whole program >>> @@ -347,6 +461,7 @@ gcov_exit (void) >>> } >>> } >>> } >>> + gcov_compute_cutoff_values (&this_prg); >>> >>> { >>> /* Check if the level of dirs to strip off specified. */ >>> @@ -598,7 +713,12 @@ gcov_exit (void) >>> if (gi_ptr->merge[t_ix]) >>> { >>> if (!cs_prg->runs++) >>> - cs_prg->num = cs_tprg->num; >>> + cs_prg->num = cs_tprg->num; >>> + if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters) >>> + { >>> + cs_prg->num_hot_counters = cs_tprg->num_hot_counters; >>> + cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value; >>> + } >>> cs_prg->sum_all += cs_tprg->sum_all; >>> if (cs_prg->run_max < cs_tprg->run_max) >>> cs_prg->run_max = cs_tprg->run_max; >>> Index: gcc/doc/invoke.texi >>> =================================================================== >>> --- gcc/doc/invoke.texi (revision 189893) >>> +++ gcc/doc/invoke.texi (working copy) >>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}. >>> -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol >>> -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol >>> -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol >>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol >>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning @gol >>> -fprefetch-loop-arrays @gol >>> -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol >>> -fprofile-generate=@var{path} @gol >>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}. >>> -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol >>> -ftree-switch-conversion -ftree-tail-merge @gol >>> -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol >>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol >>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit @gol >>> -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol >>> -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol >>> -fwhole-program -fwpa -fuse-linker-plugin @gol >>> @@ -8527,6 +8527,14 @@ the loop is entered. This usually makes programs >>> @option{-funroll-all-loops} implies the same options as >>> @option{-funroll-loops}. >>> >>> +@item -funroll-codesize-limit >>> +@opindex funroll-codesize-limit >>> +Limit loop unrolling of non-const non-FP loops in a profile feedback compilation >>> +under estimates of a large code footprint. Enabled by default with >>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >>> +by the @option{unrollpeel-codesize-threshold} and >>> +@option{unrollpeel-hotness-threshold} parameters. >>> + >>> @item -fpeel-loops >>> @opindex fpeel-loops >>> Peels loops for which there is enough information that they do not >>> @@ -8535,6 +8543,14 @@ roll much (from profile feedback). It also turns >>> >>> Enabled with @option{-fprofile-use}. >>> >>> +@item -fpeel-codesize-limit >>> +@opindex fpeel-codesize-limit >>> +Limit loop peeling of non-const non-FP loops in a profile feedback compilation >>> +under estimates of a large code footprint. Enabled by default with >>> +@option{-fprofile-use}. Code size and execution weight thresholds are controlled >>> +by the @option{unrollpeel-codesize-threshold} and >>> +@option{unrollpeel-hotness-threshold} parameters. >>> + >>> @item -fmove-loop-invariants >>> @opindex fmove-loop-invariants >>> Enables the loop invariant motion pass in the RTL loop optimizer. Enabled >>> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s >>> @item max-completely-peel-loop-nest-depth >>> The maximum depth of a loop nest suitable for complete peeling. >>> >>> +@item unrollpeel-codesize-threshold >>> +Maximum profile-based code size footprint estimate for loop unrolling and >>> +peeling. >>> + >>> +@item unrollpeel-hotness-threshold >>> +Maximum ratio of total execution count to loop entry block count under which >>> +most profile-based code size estimates will be ignored for loop unrolling and >>> +peeling. >>> + >>> @item max-unswitch-insns >>> The maximum number of insns of an unswitched loop. >>> >>> Index: gcc/gcov-io.c >>> =================================================================== >>> --- gcc/gcov-io.c (revision 189893) >>> +++ gcc/gcov-io.c (working copy) >>> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str >>> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >>> { >>> gcov_write_unsigned (csum->num); >>> + gcov_write_unsigned (csum->num_hot_counters); >>> + gcov_write_counter (csum->hot_cutoff_value); >>> gcov_write_unsigned (csum->runs); >>> gcov_write_counter (csum->sum_all); >>> gcov_write_counter (csum->run_max); >>> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary) >>> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >>> { >>> csum->num = gcov_read_unsigned (); >>> + csum->num_hot_counters = gcov_read_unsigned (); >>> + csum->hot_cutoff_value = gcov_read_counter (); >>> csum->runs = gcov_read_unsigned (); >>> csum->sum_all = gcov_read_counter (); >>> csum->run_max = gcov_read_counter (); >>> Index: gcc/gcov-io.h >>> =================================================================== >>> --- gcc/gcov-io.h (revision 189893) >>> +++ gcc/gcov-io.h (working copy) >>> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type; >>> #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ >>> #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) >>> #define GCOV_TAG_SUMMARY_LENGTH \ >>> - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) >>> + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2)) >>> >>> /* Counters that are collected. */ >>> #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ >>> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type; >>> struct gcov_ctr_summary >>> { >>> gcov_unsigned_t num; /* number of counters. */ >>> + gcov_unsigned_t num_hot_counters;/* number of counters to reach a given >>> + percent of sum_all. */ >>> + gcov_type hot_cutoff_value; /* value of smallest counter included in >>> + num_hot_counters. */ >>> gcov_unsigned_t runs; /* number of program runs */ >>> gcov_type sum_all; /* sum of all counters accumulated. */ >>> gcov_type run_max; /* maximum value on a single run. */ >>> Index: gcc/loop-unroll.c >>> =================================================================== >>> --- gcc/loop-unroll.c (revision 189893) >>> +++ gcc/loop-unroll.c (working copy) >>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3. If not see >>> #include "hashtab.h" >>> #include "recog.h" >>> #include "target.h" >>> +#include "gcov-io.h" >>> #include "dumpfile.h" >>> >>> /* This pass performs loop unrolling and peeling. We only perform these >>> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc >>> basic_block); >>> static rtx get_expansion (struct var_to_expand *); >>> >>> +/* Determine whether and how much LOOP unrolling/peeling should be constrained >>> + based on code footprint estimates. Returns the codesize-based factor to be >>> + divided into the max instructions in an unrolled or peeled loop: >>> + 1) For size <= threshold, do not limit (by returning 1). >>> + 2) For threshold < size < 2*threshold, reduce maximum allowed peeled or >>> + unrolled instructions according to loop hotness. >>> + 3) For threshold >= 2*threshold, disable unrolling/peeling (by returning >>> + INT_MAX). */ >>> + >>> +static int >>> +code_size_limit_factor(struct loop *loop) >>> +{ >>> + unsigned size_threshold; >>> + struct niter_desc *desc = get_simple_loop_desc (loop); >>> + gcov_type sum_to_header_ratio; >>> + int hotness_ratio_threshold; >>> + int limit_factor; >>> + >>> + /* First check if the application has a large codesize footprint. >>> + This is estimated from FDO profile summary information for the >>> + program, where the num_hot_counters indicates the number of hottest >>> + counters (blocks) that compose most of the execution time of >>> + the program. A large value would indicate a large flat execution >>> + profile where icache misses may be a concern. */ >>> + size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD); >>> + if (!profile_info >>> + || profile_info->num_hot_counters <= size_threshold >>> + || !profile_info->sum_all) >>> + return 1; >>> + >>> + /* Next, exclude some loops where unrolling/peeling may be more >>> + important to overall performance. */ >>> + >>> + /* Ignore FP loops, which are more likely to benefit heavily from >>> + unrolling. */ >>> + if (desc->has_fp) >>> + return 1; >>> + >>> + /* Next, set the value of the codesize-based unroll factor divisor which in >>> + most loops will need to be set to a value that will reduce or eliminate >>> + unrolling/peeling. */ >>> + if (profile_info->num_hot_counters < size_threshold * 2) >>> + { >>> + /* For applications that are less than twice the codesize limit, allow >>> + limited unrolling for very hot loops. */ >>> + sum_to_header_ratio = profile_info->sum_all / loop->header->count; >>> + hotness_ratio_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD); >>> + /* When the profile count sum to loop entry header ratio is smaller than >>> + the threshold (i.e. the loop entry is hot enough, the divisor is set >>> + to 1 so the unroll/peel factor is not reduced. When it is bigger >>> + than the ratio, increase the divisor by the amount this ratio >>> + is over the threshold, which will quickly reduce the unroll/peel >>> + factor to zero as the loop's hotness reduces. */ >>> + if (sum_to_header_ratio > hotness_ratio_threshold) >>> + { >>> + limit_factor = sum_to_header_ratio / hotness_ratio_threshold; >>> + gcc_assert (limit_factor >= 1); >>> + } >>> + else >>> + limit_factor = 1; >>> + } >>> + else >>> + /* For appliations that are at least twice the codesize limit, set >>> + the divisor to a large value that will force the unroll factor to 0. */ >>> + limit_factor = INT_MAX; >>> + >>> + return limit_factor; >>> +} >>> + >>> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >>> void >>> unroll_and_peel_loops (int flags) >>> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo >>> { >>> unsigned nunroll, nunroll_by_av, i; >>> struct niter_desc *desc; >>> + int limit_factor = 1; >>> >>> if (!(flags & UAP_UNROLL)) >>> { >>> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo >>> "\n;; Considering unrolling loop with runtime " >>> "computable number of iterations\n"); >>> >>> + if (flag_unroll_codesize_limit) >>> + { >>> + /* Determine whether to limit code size growth from unrolling, >>> + using FDO profile summary information that gives an >>> + estimated number of executed blocks. */ >>> + limit_factor = code_size_limit_factor (loop); >>> + if (dump_file && limit_factor > 1) >>> + { >>> + fprintf (dump_file, >>> + ";; Due to large code size footprint estimate, limit " >>> + "max unrolled insns by divisor %d\n", limit_factor); >>> + } >>> + } >>> + >>> /* nunroll = total number of copies of the original loop body in >>> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >>> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >>> - nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >>> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >>> + / loop->ninsns; >>> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >>> + / limit_factor / loop->av_ninsns; >>> if (nunroll > nunroll_by_av) >>> nunroll = nunroll_by_av; >>> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >>> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags) >>> { >>> unsigned npeel; >>> struct niter_desc *desc; >>> + int limit_factor = 1; >>> >>> if (!(flags & UAP_PEEL)) >>> { >>> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags) >>> if (dump_file) >>> fprintf (dump_file, "\n;; Considering simply peeling loop\n"); >>> >>> + if (flag_peel_codesize_limit) >>> + { >>> + /* Determine whether to limit code size growth from peeling, >>> + using FDO profile summary information that gives an >>> + estimated number of executed blocks. */ >>> + limit_factor = code_size_limit_factor (loop); >>> + if (dump_file && limit_factor > 1) >>> + { >>> + fprintf (dump_file, >>> + ";; Due to large code size footprint estimate, limit " >>> + "max peeled insns by divisor %d\n", limit_factor); >>> + } >>> + } >>> + >>> /* npeel = number of iterations to peel. */ >>> - npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns; >>> + npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor >>> + / loop->ninsns; >>> if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES)) >>> npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES); >>> >>> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags) >>> >>> /* Do not simply peel loops with branches inside -- it increases number >>> of mispredicts. */ >>> - if (num_loop_branches (loop) > 1) >>> + if (desc->num_branches > 1) >>> { >>> if (dump_file) >>> fprintf (dump_file, ";; Not peeling, contains branches\n"); >>> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags >>> { >>> unsigned nunroll, nunroll_by_av, i; >>> struct niter_desc *desc; >>> + int limit_factor = 1; >>> >>> if (!(flags & UAP_UNROLL_ALL)) >>> { >>> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags >>> if (dump_file) >>> fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n"); >>> >>> + if (flag_unroll_codesize_limit) >>> + { >>> + /* Determine whether to limit code size growth from unrolling, >>> + using FDO profile summary information that gives an >>> + estimated number of executed blocks. */ >>> + limit_factor = code_size_limit_factor (loop); >>> + if (dump_file && limit_factor > 1) >>> + { >>> + fprintf (dump_file, >>> + ";; Due to large code size footprint estimate, limit " >>> + "max unrolled insns by divisor %d\n", limit_factor); >>> + } >>> + } >>> + >>> /* nunroll = total number of copies of the original loop body in >>> unrolled loop (i.e. if it is 2, we have to duplicate loop body once. */ >>> - nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns; >>> - nunroll_by_av >>> - = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns; >>> + nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor >>> + / loop->ninsns; >>> + nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) >>> + / limit_factor / loop->av_ninsns; >>> if (nunroll > nunroll_by_av) >>> nunroll = nunroll_by_av; >>> if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) >>> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags >>> >>> /* Do not unroll loops with branches inside -- it increases number >>> of mispredicts. */ >>> - if (num_loop_branches (loop) > 1) >>> + if (desc->num_branches > 1) >>> { >>> if (dump_file) >>> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >>> Index: gcc/coverage.c >>> =================================================================== >>> --- gcc/coverage.c (revision 189893) >>> +++ gcc/coverage.c (working copy) >>> @@ -247,6 +247,14 @@ read_counts_file (void) >>> gcov_read_summary (&sum); >>> for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) >>> { >>> + if (summary.ctrs[ix].num_hot_counters >>> + < sum.ctrs[ix].num_hot_counters) >>> + { >>> + summary.ctrs[ix].num_hot_counters >>> + = sum.ctrs[ix].num_hot_counters; >>> + summary.ctrs[ix].hot_cutoff_value >>> + = sum.ctrs[ix].hot_cutoff_value; >>> + } >>> summary.ctrs[ix].runs += sum.ctrs[ix].runs; >>> summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; >>> if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) >>> Index: gcc/loop-iv.c >>> =================================================================== >>> --- gcc/loop-iv.c (revision 189893) >>> +++ gcc/loop-iv.c (working copy) >>> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop) >>> /* At least desc->infinite is not always initialized by >>> find_simple_loop_exit. */ >>> desc = XCNEW (struct niter_desc); >>> - iv_analysis_loop_init (loop); >>> - find_simple_exit (loop, desc); >>> + if (loop->latch != EXIT_BLOCK_PTR) >>> + { >>> + iv_analysis_loop_init (loop); >>> + find_simple_exit (loop, desc); >>> + } >>> + analyze_loop_insns (loop, desc); >>> loop->aux = desc; >>> >>> if (desc->simple_p && (desc->assumptions || desc->infinite)) >>> Index: gcc/common.opt >>> =================================================================== >>> --- gcc/common.opt (revision 189893) >>> +++ gcc/common.opt (working copy) >>> @@ -1551,6 +1551,10 @@ fpcc-struct-return >>> Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN) >>> Return small aggregates in memory, not registers >>> >>> +fpeel-codesize-limit >>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization >>> +Limit non-const non-FP loop peeling under profile estimates of large code footprint >>> + >>> fpeel-loops >>> Common Report Var(flag_peel_loops) Optimization >>> Perform loop peeling >>> @@ -2112,6 +2116,10 @@ funroll-all-loops >>> Common Report Var(flag_unroll_all_loops) Optimization >>> Perform loop unrolling for all loops >>> >>> +funroll-codesize-limit >>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization >>> +Limit non-const non-FP loop unrolling under profile estimates of large code footprint >>> + >>> ; Nonzero means that loop optimizer may assume that the induction variables >>> ; that control loops do not overflow and that the loops with nontrivial >>> ; exit condition are not infinite >>> Index: gcc/cfgloop.c >>> =================================================================== >>> --- gcc/cfgloop.c (revision 189893) >>> +++ gcc/cfgloop.c (working copy) >>> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop) >>> return edges; >>> } >>> >>> -/* Counts the number of conditional branches inside LOOP. */ >>> +/* Determine if INSN is a floating point set. */ >>> >>> -unsigned >>> -num_loop_branches (const struct loop *loop) >>> +static bool >>> +insn_has_fp_set(rtx insn) >>> { >>> - unsigned i, n; >>> - basic_block * body; >>> + int i; >>> + rtx pat = PATTERN(insn); >>> + if (GET_CODE (pat) == SET) >>> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >>> + else if (GET_CODE (pat) == PARALLEL) >>> + { >>> + for (i = 0; i < XVECLEN (pat, 0); i++) >>> + { >>> + rtx sub = XVECEXP (pat, 0, i); >>> + if (GET_CODE (sub) == SET) >>> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >>> + } >>> + } >>> + return false; >>> +} >>> >>> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts >>> + the number of conditional branch instructions, calls and fp instructions, >>> + as well as the average number of branches executed per iteration. */ >>> >>> +void >>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >>> +{ >>> + unsigned i, nbranch; >>> + gcov_type weighted_nbranch; >>> + bool has_call, has_fp; >>> + basic_block * body, bb; >>> + rtx insn; >>> + gcov_type header_count = loop->header->count; >>> + >>> + nbranch = weighted_nbranch = 0; >>> + has_call = has_fp = false; >>> + >>> body = get_loop_body (loop); >>> - n = 0; >>> for (i = 0; i < loop->num_nodes; i++) >>> - if (EDGE_COUNT (body[i]->succs) >= 2) >>> - n++; >>> + { >>> + bb = body[i]; >>> + >>> + if (EDGE_COUNT (bb->succs) >= 2) >>> + { >>> + nbranch++; >>> + >>> + /* If this block is executed less frequently than the header (loop >>> + entry), then it is weighted based on its execution count, which >>> + will be turned into a ratio compared to the loop header below. */ >>> + if (bb->count < header_count) >>> + weighted_nbranch += bb->count; >>> + >>> + /* When it is executed more frequently than the header (i.e. it is >>> + in a nested inner loop), simply weight the branch the same as the >>> + header execution count, so that it will contribute 1 branch to >>> + the ratio computed below. */ >>> + else >>> + weighted_nbranch += header_count; >>> + } >>> + >>> + /* No need to iterate through the instructions below if >>> + both flags have already been set. */ >>> + if (has_call && has_fp) >>> + continue; >>> + >>> + FOR_BB_INSNS (bb, insn) >>> + { >>> + if (!INSN_P (insn)) >>> + continue; >>> + >>> + if (!has_call) >>> + has_call = CALL_P (insn); >>> + >>> + if (!has_fp) >>> + has_fp = insn_has_fp_set (insn); >>> + } >>> + } >>> free (body); >>> >>> - return n; >>> + desc->num_branches = nbranch; >>> + /* Now divide the weights computed above by the loop header execution count, >>> + to compute the average number of branches through the loop. By adding >>> + header_count/2 to the numerator we round to nearest with integer >>> + division. */ >>> + if (header_count != 0) >>> + desc->av_num_branches >>> + = (weighted_nbranch + header_count/2) / header_count; >>> + else >>> + desc->av_num_branches = 0; >>> + desc->has_call = has_call; >>> + desc->has_fp = has_fp; >>> } >>> >>> /* Adds basic block BB to LOOP. */ >>> Index: gcc/cfgloop.h >>> =================================================================== >>> --- gcc/cfgloop.h (revision 189893) >>> +++ gcc/cfgloop.h (working copy) >>> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order >>> >>> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >>> edge single_exit (const struct loop *); >>> -extern unsigned num_loop_branches (const struct loop *); >>> >>> extern edge loop_preheader_edge (const struct loop *); >>> extern edge loop_latch_edge (const struct loop *); >>> @@ -366,7 +365,8 @@ struct rtx_iv >>> }; >>> >>> /* The description of an exit from the loop and of the number of iterations >>> - till we take the exit. */ >>> + till we take the exit. Also includes other information used primarily >>> + by the loop unroller. */ >>> >>> struct niter_desc >>> { >>> @@ -407,6 +407,18 @@ struct niter_desc >>> >>> /* The number of iterations of the loop. */ >>> rtx niter_expr; >>> + >>> + /* The number of branches in the loop. */ >>> + unsigned num_branches; >>> + >>> + /* The number of executed branches per iteration. */ >>> + unsigned av_num_branches; >>> + >>> + /* Whether the loop contains a call instruction. */ >>> + bool has_call; >>> + >>> + /* Whether the loop contains fp instructions. */ >>> + bool has_fp; >>> }; >>> >>> extern void iv_analysis_loop_init (struct loop *); >>> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void); >>> >>> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >>> extern void free_simple_loop_desc (struct loop *loop); >>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >>> >>> static inline struct niter_desc * >>> simple_loop_desc (struct loop *loop) >>> Index: gcc/params.def >>> =================================================================== >>> --- gcc/params.def (revision 189893) >>> +++ gcc/params.def (working copy) >>> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >>> "max-completely-peel-loop-nest-depth", >>> "The maximum depth of a loop nest we completely peel", >>> 8, 0, 0) >>> +/* The maximum code size estimate under which loop unrolling and peeling >>> + * is allowed in a profile feedback compile. This currently applies to loops >>> + * with non-constant iteration counts and no floating point computations. */ >>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD, >>> + "unrollpeel-codesize-threshold", >>> + "Maximum profile-based code size footprint estimate for loop unrolling " >>> + "and peeling", >>> + 15000, 0, 0) >>> +/* The maximum ratio of total profiled execution counts to loop entry block >>> + count that must be exceeded to ignore most code size limits when unrolling >>> + and peeling. */ >>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD, >>> + "unrollpeel-hotness-threshold", >>> + "Maximum ratio of total profiled execution count to loop entry " >>> + "block count under which most codesize limits for unrolling and " >>> + "peeling will be ignored", >>> + 100, 1, 0) >>> >>> /* The maximum number of insns of an unswitched loop. */ >>> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >>> Index: gcc/gcov-dump.c >>> =================================================================== >>> --- gcc/gcov-dump.c (revision 189893) >>> +++ gcc/gcov-dump.c (working copy) >>> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED >>> { >>> printf ("\n"); >>> print_prefix (filename, 0, 0); >>> - printf ("\t\tcounts=%u, runs=%u", >>> - summary.ctrs[ix].num, summary.ctrs[ix].runs); >>> + printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count=" >>> + HOST_WIDEST_INT_PRINT_DEC "), runs=%u", >>> + summary.ctrs[ix].num, >>> + summary.ctrs[ix].num_hot_counters, >>> + (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value, >>> + summary.ctrs[ix].runs); >>> >>> printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC, >>> (HOST_WIDEST_INT)summary.ctrs[ix].sum_all); >>> >>> -- >>> This patch is available for review at http://codereview.appspot.com/6351086 >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
|