Issue 6351086: [PATCH] New fdo summary-based icache sensitive unrolling

Can't Edit
Can't Publish+Mail
Start Review

Created:
13 years, 5 months ago by tejohnson

Modified:
11 years, 3 months ago

Reviewers:
davidxl, stevenb.gcc, markus

CC:
hubicka_ucw.cz, gcc-patches_gcc.gnu.org

Base URL:
svn+ssh://gcc.gnu.org/svn/gcc/trunk/

Visibility:
Public.

Patch Set 2 : [PATCH] New fdo summary-based icache sensitive unrolling #

Created: 13 years, 4 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+428 lines, -28 lines)			Patch
M	gcc/cfgloop.h	View	1	4 chunks	+15 lines, -2 lines	0 comments	Download
M	gcc/cfgloop.c	View	1	1 chunk	+84 lines, -10 lines	0 comments	Download
M	gcc/common.opt	View	1	2 chunks	+8 lines, -0 lines	0 comments	Download
M	gcc/coverage.c	View	1	1 chunk	+8 lines, -0 lines	0 comments	Download
M	gcc/doc/invoke.texi	View	1	5 chunks	+27 lines, -2 lines	0 comments	Download
M	gcc/gcov-dump.c	View	1	1 chunk	+6 lines, -2 lines	0 comments	Download
M	gcc/gcov-io.h	View	1	2 chunks	+5 lines, -1 line	0 comments	Download
M	gcc/gcov-io.c	View	1	2 chunks	+4 lines, -0 lines	0 comments	Download
M	gcc/loop-iv.c	View	1	1 chunk	+6 lines, -2 lines	0 comments	Download
M	gcc/loop-unroll.c	View	1	10 chunks	+127 lines, -8 lines	0 comments	Download
M	gcc/params.def	View	1	1 chunk	+17 lines, -0 lines	0 comments	Download
M	libgcc/libgcov.c	View	1	3 chunks	+121 lines, -1 line	0 comments	Download

http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h File gcc/gcov-io.h (right): http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h#newcode396 gcc/gcov-io.h:396: gcov_unsigned_t num_hot_counters;/* number of counters to reach a given ...

13 years, 4 months ago (2012-07-26 18:26:04 UTC) #5

tejohnson

On Thu, Jul 26, 2012 at 11:26 AM, <davidxl@google.com> wrote: > > http://codereview.appspot.com/**6351086/diff/1/gcc/gcov-io.h<http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h> > File ...

13 years, 4 months ago (2012-07-26 20:32:03 UTC) #6

On Thu, Jul 26, 2012 at 11:26 AM, <davidxl@google.com> wrote:

>
>
http://codereview.appspot.com/**6351086/diff/1/gcc/gcov-io.h<http://coderevie...
> File gcc/gcov-io.h (right):
>
>
http://codereview.appspot.com/**6351086/diff/1/gcc/gcov-io.h#**newcode396<htt...
> gcc/gcov-io.h:396: gcov_unsigned_t num_hot_counters;/* number of
>
> counters to reach a given
> Should it be made into an array accessed with pre-defined indices?
>

You mean provide an array of values for different thresholds? In that case
both num_hot_counters and hot_cutoff_value would need to be arrays (oh, I
see now that you indicate that below too). And we would possibly need
another array to hold the thresholds, unless they are hard coded values.
I've been thinking about extending this to provide data for more threshold
values, but was holding that off for a follow-on, when I had a better idea
of how the expanded summary info would be used.

Honza - you were talking about using this type of summary info to guide
hot/cold decisions, do you find the current single threshold data useful
enough?

>
http://codereview.appspot.com/**6351086/diff/1/gcc/gcov-io.h#**newcode399<htt...
> gcc/gcov-io.h:399: num_hot_counters.  */
> Similarly for this one -- or put the above two values into a pair and
> declare an array of the pairs?
>
>
http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.**c<http://code...
> File libgcc/libgcov.c (right):
>
> http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.**
>
c#newcode332<http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332>
> libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
> space needed
>
>
I think I have added the space you wanted (around the '*' char) - let me
know if you meant something else.

> http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.**
>
c#newcode332<http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332>
> libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
> space needed. Is it better to check if (cs_ptr->sum_all * cutoff_perc
> /cutoff_perc != cs_ptr->sum_all)?
>

Good suggestion - changed.

>
> http://codereview.appspot.com/**6351086/diff/1/libgcc/libgcov.**
>
c#newcode718<http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode718>
> libgcc/libgcov.c:718: if (cs_tprg->num_hot_counters >
> cs_prg->num_hot_counters)
> Why guard the update under !cs_prg->runs++ --> the condition is used to
> identify if it is a first write or a merged write.  (verify with a test
> case to see if the merge of the summary is correct).
>

In fact the ">" check doesn't make sense if this is the first write, which
is currently how I have it guarded. The question is how to best "merge"
this information. After thinking about this, I am going to move this code
out from under the !cs_prg->runs++ check, so that on a merge, we use the
summary with the largest num_hot_counters.

I am going to do a little testing with my changes and will upload a new
patch.

Thanks,
Teresa

>
>
http://codereview.appspot.com/**6351086/<http://codereview.appspot.com/6351086/>
>

-- 
Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413

tejohnson

Resending in plain text mode...sigh. Teresa On Thu, Jul 26, 2012 at 1:32 PM, Teresa ...

13 years, 4 months ago (2012-07-26 20:38:57 UTC) #7

Resending in plain text mode...sigh.

Teresa


On Thu, Jul 26, 2012 at 1:32 PM, Teresa Johnson <tejohnson@google.com> wrote:
>
>
>
>
> On Thu, Jul 26, 2012 at 11:26 AM, <davidxl@google.com> wrote:
>>
>>
>> http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h
>> File gcc/gcov-io.h (right):
>>
>> http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h#newcode396
>> gcc/gcov-io.h:396: gcov_unsigned_t num_hot_counters;/* number of
>>
>> counters to reach a given
>> Should it be made into an array accessed with pre-defined indices?
>
>
> You mean provide an array of values for different thresholds? In that case
both num_hot_counters and hot_cutoff_value would need to be arrays (oh, I see
now that you indicate that below too). And we would possibly need another array
to hold the thresholds, unless they are hard coded values. I've been thinking
about extending this to provide data for more threshold values, but was holding
that off for a follow-on, when I had a better idea of how the expanded summary
info would be used.
>
> Honza - you were talking about using this type of summary info to guide
hot/cold decisions, do you find the current single threshold data useful
enough?
>
>>
>> http://codereview.appspot.com/6351086/diff/1/gcc/gcov-io.h#newcode399
>> gcc/gcov-io.h:399: num_hot_counters.  */
>> Similarly for this one -- or put the above two values into a pair and
>> declare an array of the pairs?
>>
>> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c
>> File libgcc/libgcov.c (right):
>>
>> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332
>> libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
>> space needed
>>
>
> I think I have added the space you wanted (around the '*' char) - let me know
if you meant something else.
>
>>
>> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode332
>> libgcc/libgcov.c:332: if (cs_ptr->sum_all*cutoff_perc < cs_ptr->sum_all)
>> space needed. Is it better to check if (cs_ptr->sum_all * cutoff_perc
>> /cutoff_perc != cs_ptr->sum_all)?
>
>
> Good suggestion - changed.
>
>>
>>
>> http://codereview.appspot.com/6351086/diff/1/libgcc/libgcov.c#newcode718
>> libgcc/libgcov.c:718: if (cs_tprg->num_hot_counters >
>> cs_prg->num_hot_counters)
>> Why guard the update under !cs_prg->runs++ --> the condition is used to
>> identify if it is a first write or a merged write.  (verify with a test
>> case to see if the merge of the summary is correct).
>
>
> In fact the ">" check doesn't make sense if this is the first write, which is
currently how I have it guarded. The question is how to best "merge" this
information. After thinking about this, I am going to move this code out from
under the !cs_prg->runs++ check, so that on a merge, we use the summary with the
largest num_hot_counters.
>
> I am going to do a little testing with my changes and will upload a new
patch.
>
> Thanks,
> Teresa
>
>>
>>
>> http://codereview.appspot.com/6351086/
>
>
>
>
> --
> Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
>



--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

tejohnson

Updated patch based on review feedback. Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? Teresa ...

13 years, 4 months ago (2012-07-27 04:47:51 UTC) #8

Updated patch based on review feedback.

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Teresa


Original patch description:

Ports some patches related to improving FDO program summary information
and using it to guide loop unrolling from google branches to mainline.
The patch is enhanced to add additional summary information to aid
in determining hot/cold decisions.

The original patch description is at:
  http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
and further discussion about incorporating onto mainline is at:
  http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414

Honza, can you take a look to see if this patch meets your needs?

Full description:

This patch adds new program summary information to the gcov
profile files that indicate how many profiled counts compose
the majority of the program's execution time. This is used to
provide an indication of the overall code size of the program.

The new profile summary information is then used to guide
codesize based unroll and peel decisions, to prevent those
optimizations from increasing code size too much when the
program may be sensitive to icache effects.

This patch also pulls in dependent portions of google/main r187660 that cache
additional loop analysis results in the niter_desc auxiliary information
hanging off the loop structure (the optimization portions of that
change are not included here, and have an outstanding review request
for mainline).

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Thanks,
Teresa

2012-07-26  Teresa Johnson  <tejohnson@google.com>

	* libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
	(gcov_compute_cutoff_values): Ditto.
	(gcov_exit): Call gcov_compute_cutoff_values and merge new summary
        information.
	* gcc/doc/invoke.texi (roll much): Document new options
        -fpeel-codesize-limit and -funroll-codesize-limit, and new params
        codesize-hotness-threshold and unrollpeel-hotness-threshold.
	* gcc/gcov-io.c (gcov_write_summary): Write new summary info.
	(gcov_read_summary): Read new summary info.
	* gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary info.
        (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
        hot_cutoff_value.
	* gcc/loop-unroll.c (code_size_limit_factor): New function.
	(decide_unroll_runtime_iterations): Call code_size_limit_factor
        to control the unroll factor, and retrieve number of branches from
        niter_desc instead of via function that walks loop.
	(decide_peel_simple, decide_unroll_stupid): Ditto.
	* gcc/coverage.c (read_counts_file): Propagate new summary info.
	* gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
	* gcc/common.opt: Add -fpeel-codesize-limit and
        -funroll-codesize-limit.
	* gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
	* gcc/cfgloop.h (struct niter_desc): Added new fields to cache
        additional loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
	* gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
        (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
	* gcc/gcov-dump.c (tag_summary): Dump new summary info.

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c	(revision 189893)
+++ libgcc/libgcov.c	(working copy)
@@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
   return 1;
 }
 
+/* Used by qsort to sort gcov values in descending order.  */
+
+static int
+sort_by_reverse_gcov_value (const void *pa, const void *pb)
+{
+  const gcov_type a = *(gcov_type const *)pa;
+  const gcov_type b = *(gcov_type const *)pb;
+
+  if (b > a)
+    return 1;
+  else if (b == a)
+    return 0;
+  else
+    return -1;
+}
+
+/* Determines the number of counters required to cover a given percentage
+   of the total sum of execution counts in the summary, which is then also
+   recorded in SUM.  */
+
+static void
+gcov_compute_cutoff_values (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_fn_info *gfi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, f_ix, i, ctr_info_ix, index;
+  gcov_unsigned_t c_num;
+  gcov_type *value_array;
+  gcov_type cum, cum_cutoff;
+  char *cutoff_str;
+  unsigned cutoff_perc;
+
+#define CUM_CUTOFF_PERCENT_TIMES_10 999
+  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
+  if (cutoff_str && strlen (cutoff_str))
+    cutoff_perc = atoi (cutoff_str);
+  else
+    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  /* Determine the cumulative counter value at the specified cutoff
+     percentage and record the percentage for use by gcov consumers.
+     Check for overflow when sum_all is multiplied by the cutoff_perc,
+     and if so, do the divide first.  */
+  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
+    /* Overflow, do the divide first.  */
+    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
+  else
+    /* Otherwise multiply first to get the correct value for small
+       values of sum_all.  */
+    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
+
+  /* Next, walk through all the per-object structures and save each of
+     the count values in value_array.  */
+  index = 0;
+  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
+  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!gi_ptr->merge[t_ix])
+        continue;
+
+      /* Find the appropriate index into the gcov_ctr_info array
+         for the counter we are currently working on based on the
+         existence of the merge function pointer for this object.  */
+      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
+        {
+          if (gi_ptr->merge[i])
+            ctr_info_ix++;
+        }
+      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
+        {
+          gfi_ptr = gi_ptr->functions[f_ix];
+
+          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
+            continue;
+
+          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
+          /* Sanity check that there are enough entries in value_arry
+            for this function's counters. Gracefully handle the case when
+            there are not, in case something in the profile info is
+            corrupted.  */
+          c_num = ci_ptr->num;
+          if (index + c_num > cs_ptr->num)
+            c_num = cs_ptr->num - index;
+          /* Copy over this function's counter values.  */
+          memcpy (&value_array[index], ci_ptr->values,
+                  sizeof (gcov_type) * c_num);
+          index += c_num;
+        }
+    }
+
+  /* Sort all the counter values by descending value and finally
+     accumulate the values from hottest on down until reaching
+     the cutoff value computed earlier.  */
+  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
+         sort_by_reverse_gcov_value);
+  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
+    cum += value_array[c_num];
+  /* Record the number of counters required to reach the cutoff value,
+     as well as the counter value that caused cum to reach the cutoff.  */
+  cs_ptr->num_hot_counters = c_num;
+  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
+  free (value_array);
+}
+
 /* Dump the coverage counts. We merge with existing counts when
    possible, to avoid growing the .da files ad infinitum. We use this
    program's checksum to make sure we only accumulate whole program
@@ -347,6 +461,7 @@ gcov_exit (void)
 	    }
 	}
     }
+  gcov_compute_cutoff_values (&this_prg);
 
   {
     /* Check if the level of dirs to strip off specified. */
@@ -598,7 +713,12 @@ gcov_exit (void)
 	  if (gi_ptr->merge[t_ix])
 	    {
 	      if (!cs_prg->runs++)
-		cs_prg->num = cs_tprg->num;
+                cs_prg->num = cs_tprg->num;
+              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
+              {
+                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
+                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
+              }
 	      cs_prg->sum_all += cs_tprg->sum_all;
 	      if (cs_prg->run_max < cs_tprg->run_max)
 		cs_prg->run_max = cs_tprg->run_max;
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 189893)
+++ gcc/doc/invoke.texi	(working copy)
@@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
 -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
 -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
--fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
+-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning
@gol
 -fprefetch-loop-arrays @gol
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
@@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
--funit-at-a-time -funroll-all-loops -funroll-loops @gol
+-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
 -fwhole-program -fwpa -fuse-linker-plugin @gol
@@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
 @option{-funroll-all-loops} implies the same options as
 @option{-funroll-loops}.
 
+@item -funroll-codesize-limit
+@opindex funroll-codesize-limit
+Limit loop unrolling of non-const non-FP loops in a profile feedback
compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fpeel-loops
 @opindex fpeel-loops
 Peels loops for which there is enough information that they do not
@@ -8535,6 +8543,14 @@ roll much (from profile feedback).  It also turns
 
 Enabled with @option{-fprofile-use}.
 
+@item -fpeel-codesize-limit
+@opindex fpeel-codesize-limit
+Limit loop peeling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fmove-loop-invariants
 @opindex fmove-loop-invariants
 Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
@@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s
 @item max-completely-peel-loop-nest-depth
 The maximum depth of a loop nest suitable for complete peeling.
 
+@item unrollpeel-codesize-threshold
+Maximum profile-based code size footprint estimate for loop unrolling and
+peeling.
+
+@item unrollpeel-hotness-threshold
+Maximum ratio of total execution count to loop entry block count under which
+most profile-based code size estimates will be ignored for loop unrolling and
+peeling.
+
 @item max-unswitch-insns
 The maximum number of insns of an unswitched loop.
 
Index: gcc/gcov-io.c
===================================================================
--- gcc/gcov-io.c	(revision 189893)
+++ gcc/gcov-io.c	(working copy)
@@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       gcov_write_unsigned (csum->num);
+      gcov_write_unsigned (csum->num_hot_counters);
+      gcov_write_counter (csum->hot_cutoff_value);
       gcov_write_unsigned (csum->runs);
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
@@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
       csum->num = gcov_read_unsigned ();
+      csum->num_hot_counters = gcov_read_unsigned ();
+      csum->hot_cutoff_value = gcov_read_counter ();
       csum->runs = gcov_read_unsigned ();
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
Index: gcc/gcov-io.h
===================================================================
--- gcc/gcov-io.h	(revision 189893)
+++ gcc/gcov-io.h	(working copy)
@@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
 #define GCOV_TAG_SUMMARY_LENGTH  \
-	(1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+	(1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
 
 /* Counters that are collected.  */
 #define GCOV_COUNTER_ARCS 	0  /* Arc transitions.  */
@@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
 struct gcov_ctr_summary
 {
   gcov_unsigned_t num;		/* number of counters.  */
+  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
+                                      percent of sum_all.  */
+  gcov_type hot_cutoff_value;   /* value of smallest counter included in
+                                   num_hot_counters.  */
   gcov_unsigned_t runs;		/* number of program runs */
   gcov_type sum_all;		/* sum of all counters accumulated.  */
   gcov_type run_max;		/* maximum value on a single run.  */
Index: gcc/loop-unroll.c
===================================================================
--- gcc/loop-unroll.c	(revision 189893)
+++ gcc/loop-unroll.c	(working copy)
@@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "hashtab.h"
 #include "recog.h"
 #include "target.h"
+#include "gcov-io.h"
 #include "dumpfile.h"
 
 /* This pass performs loop unrolling and peeling.  We only perform these
@@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc
 					     basic_block);
 static rtx get_expansion (struct var_to_expand *);
 
+/* Determine whether and how much LOOP unrolling/peeling should be constrained
+   based on code footprint estimates. Returns the codesize-based factor to be
+   divided into the max instructions in an unrolled or peeled loop:
+   1) For size <= threshold, do not limit (by returning 1).
+   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
+      unrolled instructions according to loop hotness.
+   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
+      INT_MAX).  */
+
+static int
+code_size_limit_factor(struct loop *loop)
+{
+  unsigned size_threshold;
+  struct niter_desc *desc = get_simple_loop_desc (loop);
+  gcov_type sum_to_header_ratio;
+  int hotness_ratio_threshold;
+  int limit_factor;
+
+  /* First check if the application has a large codesize footprint.
+     This is estimated from FDO profile summary information for the
+     program, where the num_hot_counters indicates the number of hottest
+     counters (blocks) that compose most of the execution time of
+     the program. A large value would indicate a large flat execution
+     profile where icache misses may be a concern.  */
+  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
+  if (!profile_info
+      || profile_info->num_hot_counters <= size_threshold
+      || !profile_info->sum_all)
+    return 1;
+
+  /* Next, exclude some loops where unrolling/peeling may be more
+     important to overall performance.  */
+
+  /* Ignore FP loops, which are more likely to benefit heavily from
+     unrolling. */
+  if (desc->has_fp)
+    return 1;
+
+  /* Next, set the value of the codesize-based unroll factor divisor which in
+     most loops will need to be set to a value that will reduce or eliminate
+     unrolling/peeling.  */
+  if (profile_info->num_hot_counters < size_threshold * 2)
+    {
+      /* For applications that are less than twice the codesize limit, allow
+         limited unrolling for very hot loops.  */
+      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
+      hotness_ratio_threshold = PARAM_VALUE
(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
+      /* When the profile count sum to loop entry header ratio is smaller than
+         the threshold (i.e. the loop entry is hot enough, the divisor is set
+         to 1 so the unroll/peel factor is not reduced. When it is bigger
+         than the ratio, increase the divisor by the amount this ratio
+         is over the threshold, which will quickly reduce the unroll/peel
+         factor to zero as the loop's hotness reduces.  */
+      if (sum_to_header_ratio > hotness_ratio_threshold)
+        {
+          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
+          gcc_assert (limit_factor >= 1);
+        }
+      else
+        limit_factor = 1;
+    }
+  else
+    /* For appliations that are at least twice the codesize limit, set
+       the divisor to a large value that will force the unroll factor to 0. 
*/
+    limit_factor = INT_MAX;
+
+  return limit_factor;
+}
+
 /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
 void
 unroll_and_peel_loops (int flags)
@@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL))
     {
@@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo
 	     "\n;; Considering unrolling loop with runtime "
 	     "computable number of iterations\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+        {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+        }
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags)
 {
   unsigned npeel;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_PEEL))
     {
@@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags)
   if (dump_file)
     fprintf (dump_file, "\n;; Considering simply peeling loop\n");
 
+  if (flag_peel_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from peeling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max peeled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* npeel = number of iterations to peel.  */
-  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
+  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
+          / loop->ninsns;
   if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
     npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
 
@@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags)
 
   /* Do not simply peel loops with branches inside -- it increases number
      of mispredicts.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
 	fprintf (dump_file, ";; Not peeling, contains branches\n");
@@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
 
   if (!(flags & UAP_UNROLL_ALL))
     {
@@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags
   if (dump_file)
     fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
 
+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+	{
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+	}
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av
-    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags
 
   /* Do not unroll loops with branches inside -- it increases number
      of mispredicts.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
 	fprintf (dump_file, ";; Not unrolling, contains branches\n");
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c	(revision 189893)
+++ gcc/coverage.c	(working copy)
@@ -247,6 +247,14 @@ read_counts_file (void)
 	  gcov_read_summary (&sum);
 	  for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
 	    {
+              if (summary.ctrs[ix].num_hot_counters
+                  < sum.ctrs[ix].num_hot_counters)
+                {
+		  summary.ctrs[ix].num_hot_counters
+                      = sum.ctrs[ix].num_hot_counters;
+		  summary.ctrs[ix].hot_cutoff_value
+                      = sum.ctrs[ix].hot_cutoff_value;
+                }
 	      summary.ctrs[ix].runs += sum.ctrs[ix].runs;
 	      summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
 	      if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
Index: gcc/loop-iv.c
===================================================================
--- gcc/loop-iv.c	(revision 189893)
+++ gcc/loop-iv.c	(working copy)
@@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop)
   /* At least desc->infinite is not always initialized by
      find_simple_loop_exit.  */
   desc = XCNEW (struct niter_desc);
-  iv_analysis_loop_init (loop);
-  find_simple_exit (loop, desc);
+  if (loop->latch != EXIT_BLOCK_PTR)
+    {
+      iv_analysis_loop_init (loop);
+      find_simple_exit (loop, desc);
+    }
+  analyze_loop_insns (loop, desc);
   loop->aux = desc;
 
   if (desc->simple_p && (desc->assumptions || desc->infinite))
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 189893)
+++ gcc/common.opt	(working copy)
@@ -1551,6 +1551,10 @@ fpcc-struct-return
 Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
 Return small aggregates in memory, not registers
 
+fpeel-codesize-limit
+Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop peeling under profile estimates of large code
footprint
+
 fpeel-loops
 Common Report Var(flag_peel_loops) Optimization
 Perform loop peeling
@@ -2112,6 +2116,10 @@ funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops
 
+funroll-codesize-limit
+Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop unrolling under profile estimates of large code
footprint
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
Index: gcc/cfgloop.c
===================================================================
--- gcc/cfgloop.c	(revision 189893)
+++ gcc/cfgloop.c	(working copy)
@@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop)
   return edges;
 }
 
-/* Counts the number of conditional branches inside LOOP.  */
+/* Determine if INSN is a floating point set.  */
 
-unsigned
-num_loop_branches (const struct loop *loop)
+static bool
+insn_has_fp_set(rtx insn)
 {
-  unsigned i, n;
-  basic_block * body;
+  int i;
+  rtx pat = PATTERN(insn);
+  if (GET_CODE (pat) == SET)
+    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
+  else if (GET_CODE (pat) == PARALLEL)
+    {
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+        {
+          rtx sub = XVECEXP (pat, 0, i);
+          if (GET_CODE (sub) == SET)
+            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
+        }
+    }
+  return false;
+}
 
-  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
+/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
+   the number of conditional branch instructions, calls and fp instructions,
+   as well as the average number of branches executed per iteration.  */
 
+void
+analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
+{
+  unsigned i, nbranch;
+  gcov_type weighted_nbranch;
+  bool has_call, has_fp;
+  basic_block * body, bb;
+  rtx insn;
+  gcov_type header_count = loop->header->count;
+
+  nbranch = weighted_nbranch = 0;
+  has_call = has_fp = false;
+
   body = get_loop_body (loop);
-  n = 0;
   for (i = 0; i < loop->num_nodes; i++)
-    if (EDGE_COUNT (body[i]->succs) >= 2)
-      n++;
+    {
+      bb = body[i];
+
+      if (EDGE_COUNT (bb->succs) >= 2)
+        {
+          nbranch++;
+
+          /* If this block is executed less frequently than the header (loop
+             entry), then it is weighted based on its execution count, which
+             will be turned into a ratio compared to the loop header below. */
+          if (bb->count < header_count)
+            weighted_nbranch += bb->count;
+
+          /* When it is executed more frequently than the header (i.e. it is
+             in a nested inner loop), simply weight the branch the same as the
+             header execution count, so that it will contribute 1 branch to
+             the ratio computed below. */
+          else
+            weighted_nbranch += header_count;
+        }
+
+      /* No need to iterate through the instructions below if
+         both flags have already been set.  */
+      if (has_call && has_fp)
+        continue;
+
+      FOR_BB_INSNS (bb, insn)
+        {
+          if (!INSN_P (insn))
+            continue;
+
+          if (!has_call)
+            has_call = CALL_P (insn);
+
+          if (!has_fp)
+            has_fp = insn_has_fp_set (insn);
+        }
+    }
   free (body);
 
-  return n;
+  desc->num_branches = nbranch;
+  /* Now divide the weights computed above by the loop header execution count,
+     to compute the average number of branches through the loop. By adding
+     header_count/2 to the numerator we round to nearest with integer
+     division.  */
+  if (header_count  != 0)
+    desc->av_num_branches
+        = (weighted_nbranch + header_count/2) / header_count;
+  else
+    desc->av_num_branches = 0;
+  desc->has_call = has_call;
+  desc->has_fp = has_fp;
 }
 
 /* Adds basic block BB to LOOP.  */
Index: gcc/cfgloop.h
===================================================================
--- gcc/cfgloop.h	(revision 189893)
+++ gcc/cfgloop.h	(working copy)
@@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
 
 extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
 edge single_exit (const struct loop *);
-extern unsigned num_loop_branches (const struct loop *);
 
 extern edge loop_preheader_edge (const struct loop *);
 extern edge loop_latch_edge (const struct loop *);
@@ -366,7 +365,8 @@ struct rtx_iv
 };
 
 /* The description of an exit from the loop and of the number of iterations
-   till we take the exit.  */
+   till we take the exit. Also includes other information used primarily
+   by the loop unroller.  */
 
 struct niter_desc
 {
@@ -407,6 +407,18 @@ struct niter_desc
 
   /* The number of iterations of the loop.  */
   rtx niter_expr;
+
+  /* The number of branches in the loop.  */
+  unsigned num_branches;
+
+  /* The number of executed branches per iteration.  */
+  unsigned av_num_branches;
+
+  /* Whether the loop contains a call instruction.  */
+  bool has_call;
+
+  /* Whether the loop contains fp instructions.  */
+  bool has_fp;
 };
 
 extern void iv_analysis_loop_init (struct loop *);
@@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
 
 extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
 extern void free_simple_loop_desc (struct loop *loop);
+void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
 
 static inline struct niter_desc *
 simple_loop_desc (struct loop *loop)
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 189893)
+++ gcc/params.def	(working copy)
@@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
 	 "max-completely-peel-loop-nest-depth",
 	 "The maximum depth of a loop nest we completely peel",
 	 8, 0, 0)
+/* The maximum code size estimate under which loop unrolling and peeling
+ * is allowed in a profile feedback compile. This currently applies to loops
+ * with non-constant iteration counts and no floating point computations.  */
+DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
+        "unrollpeel-codesize-threshold",
+        "Maximum profile-based code size footprint estimate for loop unrolling
"
+        "and peeling",
+        15000, 0, 0)
+/* The maximum ratio of total profiled execution counts to loop entry block
+   count that must be exceeded to ignore most code size limits when unrolling
+   and peeling.  */
+DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
+        "unrollpeel-hotness-threshold",
+        "Maximum ratio of total profiled execution count to loop entry "
+        "block count under which most codesize limits for unrolling and "
+        "peeling will be ignored",
+        100, 1, 0)
 
 /* The maximum number of insns of an unswitched loop.  */
 DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
Index: gcc/gcov-dump.c
===================================================================
--- gcc/gcov-dump.c	(revision 189893)
+++ gcc/gcov-dump.c	(working copy)
@@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
     {
       printf ("\n");
       print_prefix (filename, 0, 0);
-      printf ("\t\tcounts=%u, runs=%u",
-	      summary.ctrs[ix].num, summary.ctrs[ix].runs);
+      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
+              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
+	      summary.ctrs[ix].num,
+	      summary.ctrs[ix].num_hot_counters,
+	      (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
+	      summary.ctrs[ix].runs);
 
       printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
 	      (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);

--
This patch is available for review at http://codereview.appspot.com/6351086

stevenb.gcc

On Fri, Jul 27, 2012 at 6:47 AM, Teresa Johnson <tejohnson@google.com> wrote: > * gcc/gcov-io.h ...

13 years, 4 months ago (2012-07-27 07:31:59 UTC) #9

tejohnson

On Fri, Jul 27, 2012 at 12:31 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Fri, ...

13 years, 4 months ago (2012-07-27 14:33:17 UTC) #10

tejohnson

On Fri, Jul 27, 2012 at 3:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > Hi, >> ...

13 years, 4 months ago (2012-07-27 14:57:11 UTC) #11

On Fri, Jul 27, 2012 at 3:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c    (revision 189893)
>> +++ libgcc/libgcov.c    (working copy)
>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>    return 1;
>>  }
>>
>> +/* Used by qsort to sort gcov values in descending order.  */
>> +
>> +static int
>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>> +{
>> +  const gcov_type a = *(gcov_type const *)pa;
>> +  const gcov_type b = *(gcov_type const *)pb;
>> +
>> +  if (b > a)
>> +    return 1;
>> +  else if (b == a)
>> +    return 0;
>> +  else
>> +    return -1;
>> +}
>> +
>> +/* Determines the number of counters required to cover a given percentage
>> +   of the total sum of execution counts in the summary, which is then also
>> +   recorded in SUM.  */
>> +
>> +static void
>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>
> This looks like good idea to me to drive the hot/cold partitioning even if it
> is not quite accurate (you have no idea how many instructions given counter
is
> guarding).

Thanks - right, I think it will be a good approximation.

>
> To reduce overhead on embedded sysems, what about just doing histogram with
say
> 128 steps instead if dragging in qsort? This also avoids the need to
> produce the copy of all counters.

I like that suggestion. I'll need 1024 buckets though to get the 99.9% cutoff
I am using right now (or use 128 buckets plus 1 to track what would be
bucket 1024 which is roughly 99.9%). I can use something like a binary
search to locate the right bucket without a linear search or divide. And
I can keep track of the min value for each bucket in order to identify the
correct value for hot_cutoff_value.

>> +
>> +  /* Determine the cumulative counter value at the specified cutoff
>> +     percentage and record the percentage for use by gcov consumers.
>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>> +     and if so, do the divide first.  */
>> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
>> +    /* Overflow, do the divide first.  */
>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>> +  else
>> +    /* Otherwise multiply first to get the correct value for small
>> +       values of sum_all.  */
>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>
> To further keep embedded systems (at least a bit) happier, I guess one could
do
> this without generic 64bit divide operations.  I guess 1000 can be bumped up
to
> 1024, small error is hamless here.
>
> Actually it may be easier to simply embedd the histogram into gcov summary
> so one can control the cutoff with --param in compiler at --profile-use time.
> It seems resonable to me to trade 128 values per file for the extra
flexibility.

Both you and David have requested more values, so I will go ahead
and implement this suggestion in this patch. I can use the 128 + 1
bucket approach I described above to get the data for roughly every
1% plus the 99.9%. That should be enough granularity for the
optimizations (and the smallest bucket doesn't really need to be
fed back as it can be largely extrapolated from the others). This
will require feeding back 128 arrays of 2 values (num_hot_counters
and hot_cutoff_value).

>
>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>> +    {
>> +      if (!gi_ptr->merge[t_ix])
>> +        continue;
>> +
>> +      /* Find the appropriate index into the gcov_ctr_info array
>> +         for the counter we are currently working on based on the
>> +         existence of the merge function pointer for this object.  */
>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>> +        {
>> +          if (gi_ptr->merge[i])
>> +            ctr_info_ix++;
>> +        }
>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>> +        {
>> +          gfi_ptr = gi_ptr->functions[f_ix];
>> +
>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>> +            continue;
>> +
>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>> +          /* Sanity check that there are enough entries in value_arry
>> +            for this function's counters. Gracefully handle the case when
>> +            there are not, in case something in the profile info is
>> +            corrupted.  */
>> +          c_num = ci_ptr->num;
>> +          if (index + c_num > cs_ptr->num)
>> +            c_num = cs_ptr->num - index;
>> +          /* Copy over this function's counter values.  */
>> +          memcpy (&value_array[index], ci_ptr->values,
>> +                  sizeof (gcov_type) * c_num);
>> +          index += c_num;
>
> I wonder if the loop walking all counters can't be fused into one of the
other
> loops we already have.

Not with the histogram approach as the preceding walk (in the caller,
gcov_exit)
will be needed to find the min and max counter values first.

>> +        }
>> Index: gcc/doc/invoke.texi
>> ===================================================================
>> --- gcc/doc/invoke.texi (revision 189893)
>> +++ gcc/doc/invoke.texi (working copy)
>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
>> -fpredictive-commoning @gol
>>  -fprefetch-loop-arrays @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
@gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback
>> compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
>> controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>
> Lets handle the cutoff logic independently of the loop bits I can not
approve.
> The patch is OK with the change ssuggested

Ok, I will split this into 2 patches, one for the gcov changes only, and a
follow-on patch with my loop unroller changes.

Thanks!
Teresa

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

tejohnson

This patch was proposed awhile back, and the new working set program summary was split ...

13 years, 1 month ago (2012-11-20 00:58:18 UTC) #12

This patch was proposed awhile back, and the new working set program
summary was split out and committed separately. I would like to
see if the loop unroller part of the patch can now go in. For the
previous versions of the patch and related discussion, see:

http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
and
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html

Use the new working set information in the FDO program summary for
codesize based unroll and peel decisions, to prevent those
optimizations from increasing code size too much when the
program may be sensitive to icache effects.

As part of this, cache additional loop analysis results in the niter_desc
auxiliary information hanging off the loop structure to reduce
redundant analyses during unrolling.

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Thanks,
Teresa

2012-11-19  Teresa Johnson  <tejohnson@google.com>

        * gcc/doc/invoke.texi: Document new options
        -fpeel-codesize-limit and -funroll-codesize-limit, and new params
        codesize-hotness-threshold and unrollpeel-hotness-threshold.
        * gcc/loop-unroll.c (code_size_limit_factor): New function.
        (decide_unroll_runtime_iterations): Call code_size_limit_factor
        to control the unroll factor, and retrieve number of branches from
        niter_desc instead of via function that walks loop.
        (decide_peel_simple, decide_unroll_stupid): Ditto.
        * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
        function, and add guards to enable this function to work for the
        outermost loop.
        * gcc/common.opt: Add -fpeel-codesize-limit and
        -funroll-codesize-limit.
        * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
        (num_loop_branches): Remove.
        * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
        additional loop analysis information.
        (num_loop_branches): Remove.
        (analyze_loop_insns): Declare.
        * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
        (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi     (revision 193626)
+++ doc/invoke.texi     (working copy)
@@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}.
 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
 -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
 -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
--fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
+-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
-fpredictive-commoning @gol
 -fprefetch-loop-arrays -fprofile-report @gol
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
@@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
 -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
--funit-at-a-time -funroll-all-loops -funroll-loops @gol
+-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
 -fwhole-program -fwpa -fuse-linker-plugin @gol
@@ -8771,6 +8771,14 @@ the loop is entered.  This usually makes programs
 @option{-funroll-all-loops} implies the same options as
 @option{-funroll-loops}.

+@item -funroll-codesize-limit
+@opindex funroll-codesize-limit
+Limit loop unrolling of non-const non-FP loops in a profile feedback
compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fpeel-loops
 @opindex fpeel-loops
 Peels loops for which there is enough information that they do not
@@ -8779,6 +8787,14 @@ roll much (from profile feedback).  It also turns

 Enabled with @option{-fprofile-use}.

+@item -fpeel-codesize-limit

+@item -fpeel-codesize-limit
+@opindex fpeel-codesize-limit
+Limit loop peeling of non-const non-FP loops in a profile feedback compilation
+under estimates of a large code footprint. Enabled by default with
+@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
+by the @option{unrollpeel-codesize-threshold} and
+@option{unrollpeel-hotness-threshold} parameters.
+
 @item -fmove-loop-invariants
 @opindex fmove-loop-invariants
 Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
@@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s
 @item max-completely-peel-loop-nest-depth
 The maximum depth of a loop nest suitable for complete peeling.

+@item unrollpeel-codesize-threshold
+Maximum profile-based code size footprint estimate for loop unrolling and
+peeling.
+
+@item unrollpeel-hotness-threshold
+Maximum ratio of total execution count to loop entry block count under which
+most profile-based code size estimates will be ignored for loop unrolling and
+peeling.
+
 @item max-unswitch-insns
 The maximum number of insns of an unswitched loop.

Index: loop-unroll.c
===================================================================
--- loop-unroll.c       (revision 193626)
+++ loop-unroll.c       (working copy)
@@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "hashtab.h"
 #include "recog.h"
 #include "target.h"
+#include "gcov-io.h"
 #include "dumpfile.h"

 /* This pass performs loop unrolling and peeling.  We only perform these
@@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc
                                             basic_block);
 static rtx get_expansion (struct var_to_expand *);

+/* Determine whether and how much LOOP unrolling/peeling should be constrained
+   based on code footprint estimates. Returns the codesize-based factor to be
+   divided into the max instructions in an unrolled or peeled loop:
+   1) For size <= threshold, do not limit (by returning 1).
+   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
+      unrolled instructions according to loop hotness.
+   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
+      INT_MAX).  */
+
+static int
+code_size_limit_factor(struct loop *loop)
+{
+  unsigned size_threshold, num_hot_counters;
+  struct niter_desc *desc = get_simple_loop_desc (loop);
+  gcov_type sum_to_header_ratio;
+  int hotness_ratio_threshold;
+  int limit_factor;
+  gcov_working_set_t *ws;
+
+  ws = find_working_set(999);
+  if (! ws)
+    return 1;
+  num_hot_counters = ws->num_counters;
+
+  /* First check if the application has a large codesize footprint.
+     This is estimated from FDO profile summary information for the
+     program, where the num_hot_counters indicates the number of hottest
+     counters (blocks) that compose most of the execution time of
+     the program. A large value would indicate a large flat execution
+     profile where icache misses may be a concern.  */
+  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
+  if (!profile_info
+      || num_hot_counters <= size_threshold
+      || !profile_info->sum_all)
+    return 1;
+
+  /* Next, exclude some loops where unrolling/peeling may be more
+     important to overall performance.  */
+
+  /* Ignore FP loops, which are more likely to benefit heavily from
+     unrolling. */
+  if (desc->has_fp)
+    return 1;
+
+  /* Next, set the value of the codesize-based unroll factor divisor which in
+     most loops will need to be set to a value that will reduce or eliminate
+     unrolling/peeling.  */
+  if (num_hot_counters < size_threshold * 2
+      && loop->header->count > 0)
+    {
+      /* For applications that are less than twice the codesize limit, allow
+         limited unrolling for very hot loops.  */
+      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
+      hotness_ratio_threshold = PARAM_VALUE
(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
+      /* When the profile count sum to loop entry header ratio is smaller than
+         the threshold (i.e. the loop entry is hot enough), the divisor is set
+         to 1 so the unroll/peel factor is not reduced. When it is bigger
+         than the ratio, increase the divisor by the amount this ratio
+         is over the threshold, which will quickly reduce the unroll/peel
+         factor to zero as the loop's hotness reduces.  */
+      if (sum_to_header_ratio > hotness_ratio_threshold)
+        {
+          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
+          gcc_assert (limit_factor >= 1);
+        }
+      else
+        limit_factor = 1;
+    }
+  else
+    /* For appliations that are at least twice the codesize limit, set
+       the divisor to a large value that will force the unroll factor to 0. 
*/
+    limit_factor = INT_MAX;
+
+  return limit_factor;
+}
+
 /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
 void
 unroll_and_peel_loops (int flags)
@@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
   double_int iterations;

   if (!(flags & UAP_UNROLL))
@@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo
             "\n;; Considering unrolling loop with runtime "
             "computable number of iterations\n");

+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+        {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+        }
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags)
 {
   unsigned npeel;
   double_int iterations;
+  struct niter_desc *desc;
+  int limit_factor = 1;

   if (!(flags & UAP_PEEL))
     {
@@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags)
   if (dump_file)
     fprintf (dump_file, "\n;; Considering simply peeling loop\n");

+  if (flag_peel_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from peeling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+       {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max peeled insns by divisor %d\n", limit_factor);
+       }
+    }
+
   /* npeel = number of iterations to peel.  */
-  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
+  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
+          / loop->ninsns;
   if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
     npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);

@@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags)
     }

   /* Do not simply peel loops with branches inside -- it increases number
-     of mispredicts.
+     of mispredicts.
      Exception is when we do have profile and we however have good chance
      to peel proper number of iterations loop will iterate in practice.
      TODO: this heuristic needs tunning; while for complette unrolling
@@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags)
      peeling it is not the case.  Also a function call inside loop is
      also branch from branch prediction POV (and probably better reason
      to not unroll/peel).  */
-  if (num_loop_branches (loop) > 1
+  desc = get_simple_loop_desc (loop);
+  if (desc->num_branches > 1
       && profile_status != PROFILE_READ)
     {
       if (dump_file)
@@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags
 {
   unsigned nunroll, nunroll_by_av, i;
   struct niter_desc *desc;
+  int limit_factor = 1;
   double_int iterations;

   if (!(flags & UAP_UNROLL_ALL))
@@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags
   if (dump_file)
     fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");

+  if (flag_unroll_codesize_limit)
+    {
+      /* Determine whether to limit code size growth from unrolling,
+         using FDO profile summary information that gives an
+         estimated number of executed blocks.  */
+      limit_factor = code_size_limit_factor (loop);
+      if (dump_file && limit_factor > 1)
+       {
+          fprintf (dump_file,
+                   ";; Due to large code size footprint estimate, limit "
+                   "max unrolled insns by divisor %d\n", limit_factor);
+       }
+    }
+
   /* nunroll = total number of copies of the original loop body in
      unrolled loop (i.e. if it is 2, we have to duplicate loop body once.  */
-  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
-  nunroll_by_av
-    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
+  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
+            / loop->ninsns;
+  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
+                  / limit_factor / loop->av_ninsns;
   if (nunroll > nunroll_by_av)
     nunroll = nunroll_by_av;
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
@@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags
      of mispredicts.
      TODO: this heuristic needs tunning; call inside the loop body
      is also relatively good reason to not unroll.  */
-  if (num_loop_branches (loop) > 1)
+  if (desc->num_branches > 1)
     {
       if (dump_file)
        fprintf (dump_file, ";; Not unrolling, contains branches\n");
Index: loop-iv.c
===================================================================
--- loop-iv.c   (revision 193626)
+++ loop-iv.c   (working copy)
@@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop)
   /* At least desc->infinite is not always initialized by
      find_simple_loop_exit.  */
   desc = XCNEW (struct niter_desc);
-  iv_analysis_loop_init (loop);
-  find_simple_exit (loop, desc);
+  if (loop->latch != EXIT_BLOCK_PTR)
+    {
+      iv_analysis_loop_init (loop);
+      find_simple_exit (loop, desc);
+    }
+  analyze_loop_insns (loop, desc);
   loop->aux = desc;

   if (desc->simple_p && (desc->assumptions || desc->infinite))
Index: common.opt
===================================================================
--- common.opt  (revision 193626)
+++ common.opt  (working copy)
@@ -1573,6 +1573,10 @@ fpcc-struct-return
 Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
 Return small aggregates in memory, not registers

+fpeel-codesize-limit
+Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop peeling under profile estimates of large
code footprint
+
 fpeel-loops
 Common Report Var(flag_peel_loops) Optimization
 Perform loop peeling
@@ -2138,6 +2142,10 @@ funroll-all-loops
 Common Report Var(flag_unroll_all_loops) Optimization
 Perform loop unrolling for all loops

+funroll-codesize-limit
+Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
+Limit non-const non-FP loop unrolling under profile estimates of
large code footprint
+
 ; Nonzero means that loop optimizer may assume that the induction variables
 ; that control loops do not overflow and that the loops with nontrivial
 ; exit condition are not infinite
Index: cfgloop.c
===================================================================
--- cfgloop.c   (revision 193626)
+++ cfgloop.c   (working copy)
@@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop)
   return edges;
 }

-/* Counts the number of conditional branches inside LOOP.  */
+/* Determine if INSN is a floating point set.  */

-unsigned
-num_loop_branches (const struct loop *loop)
+static bool
+insn_has_fp_set(rtx insn)
 {
-  unsigned i, n;
-  basic_block * body;
+  int i;
+  rtx pat = PATTERN(insn);
+  if (GET_CODE (pat) == SET)
+    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
+  else if (GET_CODE (pat) == PARALLEL)
+    {
+      for (i = 0; i < XVECLEN (pat, 0); i++)
+        {
+          rtx sub = XVECEXP (pat, 0, i);
+          if (GET_CODE (sub) == SET)
+            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
+        }
+    }
+  return false;
+}

-  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
+/* Analyzes the instructions inside LOOP, updating the DESC. Currently counts
+   the number of conditional branch instructions, calls and fp instructions,
+   as well as the average number of branches executed per iteration.  */

+void
+analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
+{
+  unsigned i, nbranch;
+  gcov_type weighted_nbranch;
+  bool has_call, has_fp;
+  basic_block * body, bb;
+  rtx insn;
+  gcov_type header_count = loop->header->count;
+
+  nbranch = weighted_nbranch = 0;
+  has_call = has_fp = false;
+
   body = get_loop_body (loop);
-  n = 0;
   for (i = 0; i < loop->num_nodes; i++)
-    if (EDGE_COUNT (body[i]->succs) >= 2)
-      n++;
+    {
+      bb = body[i];
+
+      if (EDGE_COUNT (bb->succs) >= 2)
+        {
+          nbranch++;
+
+          /* If this block is executed less frequently than the header (loop
+             entry), then it is weighted based on its execution count, which
+             will be turned into a ratio compared to the loop header below. */
+          if (bb->count < header_count)
+            weighted_nbranch += bb->count;
+
+          /* When it is executed more frequently than the header (i.e. it is
+             in a nested inner loop), simply weight the branch the same as the
+             header execution count, so that it will contribute 1 branch to
+             the ratio computed below. */
+          else
+            weighted_nbranch += header_count;
+        }
+
+      /* No need to iterate through the instructions below if
+         both flags have already been set.  */
+      if (has_call && has_fp)
+        continue;
+
+      FOR_BB_INSNS (bb, insn)
+        {
+          if (!INSN_P (insn))
+            continue;
+
+          if (!has_call)
+            has_call = CALL_P (insn);
+
+          if (!has_fp)
+            has_fp = insn_has_fp_set (insn);
+        }
+    }
   free (body);

-  return n;
+  desc->num_branches = nbranch;
+  /* Now divide the weights computed above by the loop header execution count,
+     to compute the average number of branches through the loop. By adding
+     header_count/2 to the numerator we round to nearest with integer
+     division.  */
+  if (header_count  != 0)
+    desc->av_num_branches
+        = (weighted_nbranch + header_count/2) / header_count;
+  else
+    desc->av_num_branches = 0;
+  desc->has_call = has_call;
+  desc->has_fp = has_fp;
 }

 /* Adds basic block BB to LOOP.  */
Index: cfgloop.h
===================================================================
--- cfgloop.h   (revision 193626)
+++ cfgloop.h   (working copy)
@@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order
 extern vec<edge> get_loop_exit_edges (const struct loop *);
 extern edge single_exit (const struct loop *);
 extern edge single_likely_exit (struct loop *loop);
-extern unsigned num_loop_branches (const struct loop *);

 extern edge loop_preheader_edge (const struct loop *);
 extern edge loop_latch_edge (const struct loop *);
@@ -365,7 +364,8 @@ struct rtx_iv
 };

 /* The description of an exit from the loop and of the number of iterations
-   till we take the exit.  */
+   till we take the exit. Also includes other information used primarily
+   by the loop unroller.  */

 struct niter_desc
 {
@@ -403,6 +403,18 @@ struct niter_desc

   /* The number of iterations of the loop.  */
   rtx niter_expr;
+
+  /* The number of branches in the loop.  */
+  unsigned num_branches;
+
+  /* The number of executed branches per iteration.  */
+  unsigned av_num_branches;
+
+  /* Whether the loop contains a call instruction.  */
+  bool has_call;
+
+  /* Whether the loop contains fp instructions.  */
+  bool has_fp;
 };

 extern void iv_analysis_loop_init (struct loop *);
@@ -416,6 +428,7 @@ extern void iv_analysis_done (void);

 extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
 extern void free_simple_loop_desc (struct loop *loop);
+void analyze_loop_insns (const struct loop *, struct niter_desc *desc);

 static inline struct niter_desc *
 simple_loop_desc (struct loop *loop)
Index: params.def
===================================================================
--- params.def  (revision 193626)
+++ params.def  (working copy)
@@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
         "max-completely-peel-loop-nest-depth",
         "The maximum depth of a loop nest we completely peel",
         8, 0, 0)
+/* The maximum code size estimate under which loop unrolling and peeling
+ * is allowed in a profile feedback compile. This currently applies to loops
+ * with non-constant iteration counts and no floating point computations.  */
+DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
+        "unrollpeel-codesize-threshold",
+        "Maximum profile-based code size footprint estimate for loop
unrolling "
+        "and peeling",
+        15000, 0, 0)
+/* The maximum ratio of total profiled execution counts to loop entry block
+   count that must be exceeded to ignore most code size limits when unrolling
+   and peeling.  */
+DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
+        "unrollpeel-hotness-threshold",
+        "Maximum ratio of total profiled execution count to loop entry "
+        "block count under which most codesize limits for unrolling and "
+        "peeling will be ignored",
+        100, 1, 0)

 /* The maximum number of insns of an unswitched loop.  */
 DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,

On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote:
> Updated patch based on review feedback.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Teresa
>
>
> Original patch description:
>
> Ports some patches related to improving FDO program summary information
> and using it to guide loop unrolling from google branches to mainline.
> The patch is enhanced to add additional summary information to aid
> in determining hot/cold decisions.
>
> The original patch description is at:
>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> and further discussion about incorporating onto mainline is at:
>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>
> Honza, can you take a look to see if this patch meets your needs?
>
> Full description:
>
> This patch adds new program summary information to the gcov
> profile files that indicate how many profiled counts compose
> the majority of the program's execution time. This is used to
> provide an indication of the overall code size of the program.
>
> The new profile summary information is then used to guide
> codesize based unroll and peel decisions, to prevent those
> optimizations from increasing code size too much when the
> program may be sensitive to icache effects.
>
> This patch also pulls in dependent portions of google/main r187660 that cache
> additional loop analysis results in the niter_desc auxiliary information
> hanging off the loop structure (the optimization portions of that
> change are not included here, and have an outstanding review request
> for mainline).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Thanks,
> Teresa
>
> 2012-07-26  Teresa Johnson  <tejohnson@google.com>
>
>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>         (gcov_compute_cutoff_values): Ditto.
>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>         information.
>         * gcc/doc/invoke.texi (roll much): Document new options
>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>         (gcov_read_summary): Read new summary info.
>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary
info.
>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters and
>         hot_cutoff_value.
>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>         to control the unroll factor, and retrieve number of branches from
>         niter_desc instead of via function that walks loop.
>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>         function, and add guards to enable this function to work for the
>         outermost loop.
>         * gcc/common.opt: Add -fpeel-codesize-limit and
>         -funroll-codesize-limit.
>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>         (num_loop_branches): Remove.
>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>         additional loop analysis information.
>         (num_loop_branches): Remove.
>         (analyze_loop_insns): Declare.
>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>
> Index: libgcc/libgcov.c
> ===================================================================
> --- libgcc/libgcov.c    (revision 189893)
> +++ libgcc/libgcov.c    (working copy)
> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>    return 1;
>  }
>
> +/* Used by qsort to sort gcov values in descending order.  */
> +
> +static int
> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
> +{
> +  const gcov_type a = *(gcov_type const *)pa;
> +  const gcov_type b = *(gcov_type const *)pb;
> +
> +  if (b > a)
> +    return 1;
> +  else if (b == a)
> +    return 0;
> +  else
> +    return -1;
> +}
> +
> +/* Determines the number of counters required to cover a given percentage
> +   of the total sum of execution counts in the summary, which is then also
> +   recorded in SUM.  */
> +
> +static void
> +gcov_compute_cutoff_values (struct gcov_summary *sum)
> +{
> +  struct gcov_info *gi_ptr;
> +  const struct gcov_fn_info *gfi_ptr;
> +  const struct gcov_ctr_info *ci_ptr;
> +  struct gcov_ctr_summary *cs_ptr;
> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
> +  gcov_unsigned_t c_num;
> +  gcov_type *value_array;
> +  gcov_type cum, cum_cutoff;
> +  char *cutoff_str;
> +  unsigned cutoff_perc;
> +
> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
> +  if (cutoff_str && strlen (cutoff_str))
> +    cutoff_perc = atoi (cutoff_str);
> +  else
> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
> +
> +  /* This currently only applies to arc counters.  */
> +  t_ix = GCOV_COUNTER_ARCS;
> +
> +  /* First check if there are any counts recorded for this counter.  */
> +  cs_ptr = &(sum->ctrs[t_ix]);
> +  if (!cs_ptr->num)
> +    return;
> +
> +  /* Determine the cumulative counter value at the specified cutoff
> +     percentage and record the percentage for use by gcov consumers.
> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
> +     and if so, do the divide first.  */
> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
> +    /* Overflow, do the divide first.  */
> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
> +  else
> +    /* Otherwise multiply first to get the correct value for small
> +       values of sum_all.  */
> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
> +
> +  /* Next, walk through all the per-object structures and save each of
> +     the count values in value_array.  */
> +  index = 0;
> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
> +    {
> +      if (!gi_ptr->merge[t_ix])
> +        continue;
> +
> +      /* Find the appropriate index into the gcov_ctr_info array
> +         for the counter we are currently working on based on the
> +         existence of the merge function pointer for this object.  */
> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
> +        {
> +          if (gi_ptr->merge[i])
> +            ctr_info_ix++;
> +        }
> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
> +        {
> +          gfi_ptr = gi_ptr->functions[f_ix];
> +
> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
> +            continue;
> +
> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
> +          /* Sanity check that there are enough entries in value_arry
> +            for this function's counters. Gracefully handle the case when
> +            there are not, in case something in the profile info is
> +            corrupted.  */
> +          c_num = ci_ptr->num;
> +          if (index + c_num > cs_ptr->num)
> +            c_num = cs_ptr->num - index;
> +          /* Copy over this function's counter values.  */
> +          memcpy (&value_array[index], ci_ptr->values,
> +                  sizeof (gcov_type) * c_num);
> +          index += c_num;
> +        }
> +    }
> +
> +  /* Sort all the counter values by descending value and finally
> +     accumulate the values from hottest on down until reaching
> +     the cutoff value computed earlier.  */
> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
> +         sort_by_reverse_gcov_value);
> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff; c_num++)
> +    cum += value_array[c_num];
> +  /* Record the number of counters required to reach the cutoff value,
> +     as well as the counter value that caused cum to reach the cutoff.  */
> +  cs_ptr->num_hot_counters = c_num;
> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
> +  free (value_array);
> +}
> +
>  /* Dump the coverage counts. We merge with existing counts when
>     possible, to avoid growing the .da files ad infinitum. We use this
>     program's checksum to make sure we only accumulate whole program
> @@ -347,6 +461,7 @@ gcov_exit (void)
>             }
>         }
>      }
> +  gcov_compute_cutoff_values (&this_prg);
>
>    {
>      /* Check if the level of dirs to strip off specified. */
> @@ -598,7 +713,12 @@ gcov_exit (void)
>           if (gi_ptr->merge[t_ix])
>             {
>               if (!cs_prg->runs++)
> -               cs_prg->num = cs_tprg->num;
> +                cs_prg->num = cs_tprg->num;
> +              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
> +              {
> +                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
> +                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
> +              }
>               cs_prg->sum_all += cs_tprg->sum_all;
>               if (cs_prg->run_max < cs_tprg->run_max)
>                 cs_prg->run_max = cs_tprg->run_max;
> Index: gcc/doc/invoke.texi
> ===================================================================
> --- gcc/doc/invoke.texi (revision 189893)
> +++ gcc/doc/invoke.texi (working copy)
> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning
@gol
>  -fprefetch-loop-arrays @gol
>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>  -fprofile-generate=@var{path} @gol
> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>  -ftree-switch-conversion -ftree-tail-merge @gol
>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
@gol
>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>  -fwhole-program -fwpa -fuse-linker-plugin @gol
> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>  @option{-funroll-all-loops} implies the same options as
>  @option{-funroll-loops}.
>
> +@item -funroll-codesize-limit
> +@opindex funroll-codesize-limit
> +Limit loop unrolling of non-const non-FP loops in a profile feedback
compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fpeel-loops
>  @opindex fpeel-loops
>  Peels loops for which there is enough information that they do not
> @@ -8535,6 +8543,14 @@ roll much (from profile feedback).  It also turns
>
>  Enabled with @option{-fprofile-use}.
>
> +@item -fpeel-codesize-limit
> +@opindex fpeel-codesize-limit
> +Limit loop peeling of non-const non-FP loops in a profile feedback
compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fmove-loop-invariants
>  @opindex fmove-loop-invariants
>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s
>  @item max-completely-peel-loop-nest-depth
>  The maximum depth of a loop nest suitable for complete peeling.
>
> +@item unrollpeel-codesize-threshold
> +Maximum profile-based code size footprint estimate for loop unrolling and
> +peeling.
> +
> +@item unrollpeel-hotness-threshold
> +Maximum ratio of total execution count to loop entry block count under which
> +most profile-based code size estimates will be ignored for loop unrolling
and
> +peeling.
> +
>  @item max-unswitch-insns
>  The maximum number of insns of an unswitched loop.
>
> Index: gcc/gcov-io.c
> ===================================================================
> --- gcc/gcov-io.c       (revision 189893)
> +++ gcc/gcov-io.c       (working copy)
> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>      {
>        gcov_write_unsigned (csum->num);
> +      gcov_write_unsigned (csum->num_hot_counters);
> +      gcov_write_counter (csum->hot_cutoff_value);
>        gcov_write_unsigned (csum->runs);
>        gcov_write_counter (csum->sum_all);
>        gcov_write_counter (csum->run_max);
> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>      {
>        csum->num = gcov_read_unsigned ();
> +      csum->num_hot_counters = gcov_read_unsigned ();
> +      csum->hot_cutoff_value = gcov_read_counter ();
>        csum->runs = gcov_read_unsigned ();
>        csum->sum_all = gcov_read_counter ();
>        csum->run_max = gcov_read_counter ();
> Index: gcc/gcov-io.h
> ===================================================================
> --- gcc/gcov-io.h       (revision 189893)
> +++ gcc/gcov-io.h       (working copy)
> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete
*/
>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>  #define GCOV_TAG_SUMMARY_LENGTH  \
> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>
>  /* Counters that are collected.  */
>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>  struct gcov_ctr_summary
>  {
>    gcov_unsigned_t num;         /* number of counters.  */
> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
> +                                      percent of sum_all.  */
> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
> +                                   num_hot_counters.  */
>    gcov_unsigned_t runs;                /* number of program runs */
>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>    gcov_type run_max;           /* maximum value on a single run.  */
> Index: gcc/loop-unroll.c
> ===================================================================
> --- gcc/loop-unroll.c   (revision 189893)
> +++ gcc/loop-unroll.c   (working copy)
> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "hashtab.h"
>  #include "recog.h"
>  #include "target.h"
> +#include "gcov-io.h"
>  #include "dumpfile.h"
>
>  /* This pass performs loop unrolling and peeling.  We only perform these
> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc
>                                              basic_block);
>  static rtx get_expansion (struct var_to_expand *);
>
> +/* Determine whether and how much LOOP unrolling/peeling should be
constrained
> +   based on code footprint estimates. Returns the codesize-based factor to
be
> +   divided into the max instructions in an unrolled or peeled loop:
> +   1) For size <= threshold, do not limit (by returning 1).
> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
> +      unrolled instructions according to loop hotness.
> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
> +      INT_MAX).  */
> +
> +static int
> +code_size_limit_factor(struct loop *loop)
> +{
> +  unsigned size_threshold;
> +  struct niter_desc *desc = get_simple_loop_desc (loop);
> +  gcov_type sum_to_header_ratio;
> +  int hotness_ratio_threshold;
> +  int limit_factor;
> +
> +  /* First check if the application has a large codesize footprint.
> +     This is estimated from FDO profile summary information for the
> +     program, where the num_hot_counters indicates the number of hottest
> +     counters (blocks) that compose most of the execution time of
> +     the program. A large value would indicate a large flat execution
> +     profile where icache misses may be a concern.  */
> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
> +  if (!profile_info
> +      || profile_info->num_hot_counters <= size_threshold
> +      || !profile_info->sum_all)
> +    return 1;
> +
> +  /* Next, exclude some loops where unrolling/peeling may be more
> +     important to overall performance.  */
> +
> +  /* Ignore FP loops, which are more likely to benefit heavily from
> +     unrolling. */
> +  if (desc->has_fp)
> +    return 1;
> +
> +  /* Next, set the value of the codesize-based unroll factor divisor which
in
> +     most loops will need to be set to a value that will reduce or eliminate
> +     unrolling/peeling.  */
> +  if (profile_info->num_hot_counters < size_threshold * 2)
> +    {
> +      /* For applications that are less than twice the codesize limit, allow
> +         limited unrolling for very hot loops.  */
> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
> +      hotness_ratio_threshold = PARAM_VALUE
(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
> +      /* When the profile count sum to loop entry header ratio is smaller
than
> +         the threshold (i.e. the loop entry is hot enough, the divisor is
set
> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
> +         than the ratio, increase the divisor by the amount this ratio
> +         is over the threshold, which will quickly reduce the unroll/peel
> +         factor to zero as the loop's hotness reduces.  */
> +      if (sum_to_header_ratio > hotness_ratio_threshold)
> +        {
> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
> +          gcc_assert (limit_factor >= 1);
> +        }
> +      else
> +        limit_factor = 1;
> +    }
> +  else
> +    /* For appliations that are at least twice the codesize limit, set
> +       the divisor to a large value that will force the unroll factor to 0. 
*/
> +    limit_factor = INT_MAX;
> +
> +  return limit_factor;
> +}
> +
>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>  void
>  unroll_and_peel_loops (int flags)
> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_UNROLL))
>      {
> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>              "\n;; Considering unrolling loop with runtime "
>              "computable number of iterations\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +        {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +        }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags)
>  {
>    unsigned npeel;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_PEEL))
>      {
> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags)
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>
> +  if (flag_peel_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from peeling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max peeled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* npeel = number of iterations to peel.  */
> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
> +          / loop->ninsns;
>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>
> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags)
>
>    /* Do not simply peel loops with branches inside -- it increases number
>       of mispredicts.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not peeling, contains branches\n");
> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_UNROLL_ALL))
>      {
> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av
> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>
>    /* Do not unroll loops with branches inside -- it increases number
>       of mispredicts.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
> Index: gcc/coverage.c
> ===================================================================
> --- gcc/coverage.c      (revision 189893)
> +++ gcc/coverage.c      (working copy)
> @@ -247,6 +247,14 @@ read_counts_file (void)
>           gcov_read_summary (&sum);
>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>             {
> +              if (summary.ctrs[ix].num_hot_counters
> +                  < sum.ctrs[ix].num_hot_counters)
> +                {
> +                 summary.ctrs[ix].num_hot_counters
> +                      = sum.ctrs[ix].num_hot_counters;
> +                 summary.ctrs[ix].hot_cutoff_value
> +                      = sum.ctrs[ix].hot_cutoff_value;
> +                }
>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
> Index: gcc/loop-iv.c
> ===================================================================
> --- gcc/loop-iv.c       (revision 189893)
> +++ gcc/loop-iv.c       (working copy)
> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop)
>    /* At least desc->infinite is not always initialized by
>       find_simple_loop_exit.  */
>    desc = XCNEW (struct niter_desc);
> -  iv_analysis_loop_init (loop);
> -  find_simple_exit (loop, desc);
> +  if (loop->latch != EXIT_BLOCK_PTR)
> +    {
> +      iv_analysis_loop_init (loop);
> +      find_simple_exit (loop, desc);
> +    }
> +  analyze_loop_insns (loop, desc);
>    loop->aux = desc;
>
>    if (desc->simple_p && (desc->assumptions || desc->infinite))
> Index: gcc/common.opt
> ===================================================================
> --- gcc/common.opt      (revision 189893)
> +++ gcc/common.opt      (working copy)
> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>  Return small aggregates in memory, not registers
>
> +fpeel-codesize-limit
> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop peeling under profile estimates of large code
footprint
> +
>  fpeel-loops
>  Common Report Var(flag_peel_loops) Optimization
>  Perform loop peeling
> @@ -2112,6 +2116,10 @@ funroll-all-loops
>  Common Report Var(flag_unroll_all_loops) Optimization
>  Perform loop unrolling for all loops
>
> +funroll-codesize-limit
> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop unrolling under profile estimates of large code
footprint
> +
>  ; Nonzero means that loop optimizer may assume that the induction variables
>  ; that control loops do not overflow and that the loops with nontrivial
>  ; exit condition are not infinite
> Index: gcc/cfgloop.c
> ===================================================================
> --- gcc/cfgloop.c       (revision 189893)
> +++ gcc/cfgloop.c       (working copy)
> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop)
>    return edges;
>  }
>
> -/* Counts the number of conditional branches inside LOOP.  */
> +/* Determine if INSN is a floating point set.  */
>
> -unsigned
> -num_loop_branches (const struct loop *loop)
> +static bool
> +insn_has_fp_set(rtx insn)
>  {
> -  unsigned i, n;
> -  basic_block * body;
> +  int i;
> +  rtx pat = PATTERN(insn);
> +  if (GET_CODE (pat) == SET)
> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
> +  else if (GET_CODE (pat) == PARALLEL)
> +    {
> +      for (i = 0; i < XVECLEN (pat, 0); i++)
> +        {
> +          rtx sub = XVECEXP (pat, 0, i);
> +          if (GET_CODE (sub) == SET)
> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
> +        }
> +    }
> +  return false;
> +}
>
> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently
counts
> +   the number of conditional branch instructions, calls and fp instructions,
> +   as well as the average number of branches executed per iteration.  */
>
> +void
> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
> +{
> +  unsigned i, nbranch;
> +  gcov_type weighted_nbranch;
> +  bool has_call, has_fp;
> +  basic_block * body, bb;
> +  rtx insn;
> +  gcov_type header_count = loop->header->count;
> +
> +  nbranch = weighted_nbranch = 0;
> +  has_call = has_fp = false;
> +
>    body = get_loop_body (loop);
> -  n = 0;
>    for (i = 0; i < loop->num_nodes; i++)
> -    if (EDGE_COUNT (body[i]->succs) >= 2)
> -      n++;
> +    {
> +      bb = body[i];
> +
> +      if (EDGE_COUNT (bb->succs) >= 2)
> +        {
> +          nbranch++;
> +
> +          /* If this block is executed less frequently than the header (loop
> +             entry), then it is weighted based on its execution count, which
> +             will be turned into a ratio compared to the loop header below.
*/
> +          if (bb->count < header_count)
> +            weighted_nbranch += bb->count;
> +
> +          /* When it is executed more frequently than the header (i.e. it is
> +             in a nested inner loop), simply weight the branch the same as
the
> +             header execution count, so that it will contribute 1 branch to
> +             the ratio computed below. */
> +          else
> +            weighted_nbranch += header_count;
> +        }
> +
> +      /* No need to iterate through the instructions below if
> +         both flags have already been set.  */
> +      if (has_call && has_fp)
> +        continue;
> +
> +      FOR_BB_INSNS (bb, insn)
> +        {
> +          if (!INSN_P (insn))
> +            continue;
> +
> +          if (!has_call)
> +            has_call = CALL_P (insn);
> +
> +          if (!has_fp)
> +            has_fp = insn_has_fp_set (insn);
> +        }
> +    }
>    free (body);
>
> -  return n;
> +  desc->num_branches = nbranch;
> +  /* Now divide the weights computed above by the loop header execution
count,
> +     to compute the average number of branches through the loop. By adding
> +     header_count/2 to the numerator we round to nearest with integer
> +     division.  */
> +  if (header_count  != 0)
> +    desc->av_num_branches
> +        = (weighted_nbranch + header_count/2) / header_count;
> +  else
> +    desc->av_num_branches = 0;
> +  desc->has_call = has_call;
> +  desc->has_fp = has_fp;
>  }
>
>  /* Adds basic block BB to LOOP.  */
> Index: gcc/cfgloop.h
> ===================================================================
> --- gcc/cfgloop.h       (revision 189893)
> +++ gcc/cfgloop.h       (working copy)
> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>
>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>  edge single_exit (const struct loop *);
> -extern unsigned num_loop_branches (const struct loop *);
>
>  extern edge loop_preheader_edge (const struct loop *);
>  extern edge loop_latch_edge (const struct loop *);
> @@ -366,7 +365,8 @@ struct rtx_iv
>  };
>
>  /* The description of an exit from the loop and of the number of iterations
> -   till we take the exit.  */
> +   till we take the exit. Also includes other information used primarily
> +   by the loop unroller.  */
>
>  struct niter_desc
>  {
> @@ -407,6 +407,18 @@ struct niter_desc
>
>    /* The number of iterations of the loop.  */
>    rtx niter_expr;
> +
> +  /* The number of branches in the loop.  */
> +  unsigned num_branches;
> +
> +  /* The number of executed branches per iteration.  */
> +  unsigned av_num_branches;
> +
> +  /* Whether the loop contains a call instruction.  */
> +  bool has_call;
> +
> +  /* Whether the loop contains fp instructions.  */
> +  bool has_fp;
>  };
>
>  extern void iv_analysis_loop_init (struct loop *);
> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>
>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>  extern void free_simple_loop_desc (struct loop *loop);
> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>
>  static inline struct niter_desc *
>  simple_loop_desc (struct loop *loop)
> Index: gcc/params.def
> ===================================================================
> --- gcc/params.def      (revision 189893)
> +++ gcc/params.def      (working copy)
> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>          "max-completely-peel-loop-nest-depth",
>          "The maximum depth of a loop nest we completely peel",
>          8, 0, 0)
> +/* The maximum code size estimate under which loop unrolling and peeling
> + * is allowed in a profile feedback compile. This currently applies to loops
> + * with non-constant iteration counts and no floating point computations. 
*/
> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
> +        "unrollpeel-codesize-threshold",
> +        "Maximum profile-based code size footprint estimate for loop
unrolling "
> +        "and peeling",
> +        15000, 0, 0)
> +/* The maximum ratio of total profiled execution counts to loop entry block
> +   count that must be exceeded to ignore most code size limits when
unrolling
> +   and peeling.  */
> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
> +        "unrollpeel-hotness-threshold",
> +        "Maximum ratio of total profiled execution count to loop entry "
> +        "block count under which most codesize limits for unrolling and "
> +        "peeling will be ignored",
> +        100, 1, 0)
>
>  /* The maximum number of insns of an unswitched loop.  */
>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
> Index: gcc/gcov-dump.c
> ===================================================================
> --- gcc/gcov-dump.c     (revision 189893)
> +++ gcc/gcov-dump.c     (working copy)
> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>      {
>        printf ("\n");
>        print_prefix (filename, 0, 0);
> -      printf ("\t\tcounts=%u, runs=%u",
> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
> +             summary.ctrs[ix].num,
> +             summary.ctrs[ix].num_hot_counters,
> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
> +             summary.ctrs[ix].runs);
>
>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>
> --
> This patch is available for review at http://codereview.appspot.com/6351086



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

tejohnson

Ping. Teresa On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote: > ...

13 years ago (2012-12-11 17:13:35 UTC) #13

Ping.
Teresa

On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote:
> This patch was proposed awhile back, and the new working set program
> summary was split out and committed separately. I would like to
> see if the loop unroller part of the patch can now go in. For the
> previous versions of the patch and related discussion, see:
>
> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
> and
> http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
>
> Use the new working set information in the FDO program summary for
> codesize based unroll and peel decisions, to prevent those
> optimizations from increasing code size too much when the
> program may be sensitive to icache effects.
>
> As part of this, cache additional loop analysis results in the niter_desc
> auxiliary information hanging off the loop structure to reduce
> redundant analyses during unrolling.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>
> Thanks,
> Teresa
>
> 2012-11-19  Teresa Johnson  <tejohnson@google.com>
>
>         * gcc/doc/invoke.texi: Document new options
>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>         to control the unroll factor, and retrieve number of branches from
>         niter_desc instead of via function that walks loop.
>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns
>         function, and add guards to enable this function to work for the
>         outermost loop.
>         * gcc/common.opt: Add -fpeel-codesize-limit and
>         -funroll-codesize-limit.
>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions.
>         (num_loop_branches): Remove.
>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>         additional loop analysis information.
>         (num_loop_branches): Remove.
>         (analyze_loop_insns): Declare.
>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi     (revision 193626)
> +++ doc/invoke.texi     (working copy)
> @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
> -fpredictive-commoning @gol
>  -fprefetch-loop-arrays -fprofile-report @gol
>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>  -fprofile-generate=@var{path} @gol
> @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}.
>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>  -ftree-switch-conversion -ftree-tail-merge @gol
>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
@gol
>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>  -fwhole-program -fwpa -fuse-linker-plugin @gol
> @@ -8771,6 +8771,14 @@ the loop is entered.  This usually makes programs
>  @option{-funroll-all-loops} implies the same options as
>  @option{-funroll-loops}.
>
> +@item -funroll-codesize-limit
> +@opindex funroll-codesize-limit
> +Limit loop unrolling of non-const non-FP loops in a profile feedback
> compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are
> controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fpeel-loops
>  @opindex fpeel-loops
>  Peels loops for which there is enough information that they do not
> @@ -8779,6 +8787,14 @@ roll much (from profile feedback).  It also turns
>
>  Enabled with @option{-fprofile-use}.
>
> +@item -fpeel-codesize-limit
>
> +@item -fpeel-codesize-limit
> +@opindex fpeel-codesize-limit
> +Limit loop peeling of non-const non-FP loops in a profile feedback
compilation
> +under estimates of a large code footprint. Enabled by default with
> +@option{-fprofile-use}. Code size and execution weight thresholds are
> controlled
> +by the @option{unrollpeel-codesize-threshold} and
> +@option{unrollpeel-hotness-threshold} parameters.
> +
>  @item -fmove-loop-invariants
>  @opindex fmove-loop-invariants
>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
> @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s
>  @item max-completely-peel-loop-nest-depth
>  The maximum depth of a loop nest suitable for complete peeling.
>
> +@item unrollpeel-codesize-threshold
> +Maximum profile-based code size footprint estimate for loop unrolling and
> +peeling.
> +
> +@item unrollpeel-hotness-threshold
> +Maximum ratio of total execution count to loop entry block count under which
> +most profile-based code size estimates will be ignored for loop unrolling
and
> +peeling.
> +
>  @item max-unswitch-insns
>  The maximum number of insns of an unswitched loop.
>
> Index: loop-unroll.c
> ===================================================================
> --- loop-unroll.c       (revision 193626)
> +++ loop-unroll.c       (working copy)
> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "hashtab.h"
>  #include "recog.h"
>  #include "target.h"
> +#include "gcov-io.h"
>  #include "dumpfile.h"
>
>  /* This pass performs loop unrolling and peeling.  We only perform these
> @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc
>                                              basic_block);
>  static rtx get_expansion (struct var_to_expand *);
>
> +/* Determine whether and how much LOOP unrolling/peeling should be
constrained
> +   based on code footprint estimates. Returns the codesize-based factor to
be
> +   divided into the max instructions in an unrolled or peeled loop:
> +   1) For size <= threshold, do not limit (by returning 1).
> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
> +      unrolled instructions according to loop hotness.
> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
> +      INT_MAX).  */
> +
> +static int
> +code_size_limit_factor(struct loop *loop)
> +{
> +  unsigned size_threshold, num_hot_counters;
> +  struct niter_desc *desc = get_simple_loop_desc (loop);
> +  gcov_type sum_to_header_ratio;
> +  int hotness_ratio_threshold;
> +  int limit_factor;
> +  gcov_working_set_t *ws;
> +
> +  ws = find_working_set(999);
> +  if (! ws)
> +    return 1;
> +  num_hot_counters = ws->num_counters;
> +
> +  /* First check if the application has a large codesize footprint.
> +     This is estimated from FDO profile summary information for the
> +     program, where the num_hot_counters indicates the number of hottest
> +     counters (blocks) that compose most of the execution time of
> +     the program. A large value would indicate a large flat execution
> +     profile where icache misses may be a concern.  */
> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
> +  if (!profile_info
> +      || num_hot_counters <= size_threshold
> +      || !profile_info->sum_all)
> +    return 1;
> +
> +  /* Next, exclude some loops where unrolling/peeling may be more
> +     important to overall performance.  */
> +
> +  /* Ignore FP loops, which are more likely to benefit heavily from
> +     unrolling. */
> +  if (desc->has_fp)
> +    return 1;
> +
> +  /* Next, set the value of the codesize-based unroll factor divisor which
in
> +     most loops will need to be set to a value that will reduce or eliminate
> +     unrolling/peeling.  */
> +  if (num_hot_counters < size_threshold * 2
> +      && loop->header->count > 0)
> +    {
> +      /* For applications that are less than twice the codesize limit, allow
> +         limited unrolling for very hot loops.  */
> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
> +      hotness_ratio_threshold = PARAM_VALUE
> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
> +      /* When the profile count sum to loop entry header ratio is smaller
than
> +         the threshold (i.e. the loop entry is hot enough), the divisor is
set
> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
> +         than the ratio, increase the divisor by the amount this ratio
> +         is over the threshold, which will quickly reduce the unroll/peel
> +         factor to zero as the loop's hotness reduces.  */
> +      if (sum_to_header_ratio > hotness_ratio_threshold)
> +        {
> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
> +          gcc_assert (limit_factor >= 1);
> +        }
> +      else
> +        limit_factor = 1;
> +    }
> +  else
> +    /* For appliations that are at least twice the codesize limit, set
> +       the divisor to a large value that will force the unroll factor to 0. 
*/
> +    limit_factor = INT_MAX;
> +
> +  return limit_factor;
> +}
> +
>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>  void
>  unroll_and_peel_loops (int flags)
> @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>    double_int iterations;
>
>    if (!(flags & UAP_UNROLL))
> @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>              "\n;; Considering unrolling loop with runtime "
>              "computable number of iterations\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +        {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +        }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
> loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags)
>  {
>    unsigned npeel;
>    double_int iterations;
> +  struct niter_desc *desc;
> +  int limit_factor = 1;
>
>    if (!(flags & UAP_PEEL))
>      {
> @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags)
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>
> +  if (flag_peel_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from peeling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max peeled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* npeel = number of iterations to peel.  */
> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
> +          / loop->ninsns;
>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>
> @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags)
>      }
>
>    /* Do not simply peel loops with branches inside -- it increases number
> -     of mispredicts.
> +     of mispredicts.
>       Exception is when we do have profile and we however have good chance
>       to peel proper number of iterations loop will iterate in practice.
>       TODO: this heuristic needs tunning; while for complette unrolling
> @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags)
>       peeling it is not the case.  Also a function call inside loop is
>       also branch from branch prediction POV (and probably better reason
>       to not unroll/peel).  */
> -  if (num_loop_branches (loop) > 1
> +  desc = get_simple_loop_desc (loop);
> +  if (desc->num_branches > 1
>        && profile_status != PROFILE_READ)
>      {
>        if (dump_file)
> @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>  {
>    unsigned nunroll, nunroll_by_av, i;
>    struct niter_desc *desc;
> +  int limit_factor = 1;
>    double_int iterations;
>
>    if (!(flags & UAP_UNROLL_ALL))
> @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>    if (dump_file)
>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>
> +  if (flag_unroll_codesize_limit)
> +    {
> +      /* Determine whether to limit code size growth from unrolling,
> +         using FDO profile summary information that gives an
> +         estimated number of executed blocks.  */
> +      limit_factor = code_size_limit_factor (loop);
> +      if (dump_file && limit_factor > 1)
> +       {
> +          fprintf (dump_file,
> +                   ";; Due to large code size footprint estimate, limit "
> +                   "max unrolled insns by divisor %d\n", limit_factor);
> +       }
> +    }
> +
>    /* nunroll = total number of copies of the original loop body in
>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
> -  nunroll_by_av
> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
> +            / loop->ninsns;
> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
> +                  / limit_factor / loop->av_ninsns;
>    if (nunroll > nunroll_by_av)
>      nunroll = nunroll_by_av;
>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
> @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>       of mispredicts.
>       TODO: this heuristic needs tunning; call inside the loop body
>       is also relatively good reason to not unroll.  */
> -  if (num_loop_branches (loop) > 1)
> +  if (desc->num_branches > 1)
>      {
>        if (dump_file)
>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
> Index: loop-iv.c
> ===================================================================
> --- loop-iv.c   (revision 193626)
> +++ loop-iv.c   (working copy)
> @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop)
>    /* At least desc->infinite is not always initialized by
>       find_simple_loop_exit.  */
>    desc = XCNEW (struct niter_desc);
> -  iv_analysis_loop_init (loop);
> -  find_simple_exit (loop, desc);
> +  if (loop->latch != EXIT_BLOCK_PTR)
> +    {
> +      iv_analysis_loop_init (loop);
> +      find_simple_exit (loop, desc);
> +    }
> +  analyze_loop_insns (loop, desc);
>    loop->aux = desc;
>
>    if (desc->simple_p && (desc->assumptions || desc->infinite))
> Index: common.opt
> ===================================================================
> --- common.opt  (revision 193626)
> +++ common.opt  (working copy)
> @@ -1573,6 +1573,10 @@ fpcc-struct-return
>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>  Return small aggregates in memory, not registers
>
> +fpeel-codesize-limit
> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop peeling under profile estimates of large
> code footprint
> +
>  fpeel-loops
>  Common Report Var(flag_peel_loops) Optimization
>  Perform loop peeling
> @@ -2138,6 +2142,10 @@ funroll-all-loops
>  Common Report Var(flag_unroll_all_loops) Optimization
>  Perform loop unrolling for all loops
>
> +funroll-codesize-limit
> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
> +Limit non-const non-FP loop unrolling under profile estimates of
> large code footprint
> +
>  ; Nonzero means that loop optimizer may assume that the induction variables
>  ; that control loops do not overflow and that the loops with nontrivial
>  ; exit condition are not infinite
> Index: cfgloop.c
> ===================================================================
> --- cfgloop.c   (revision 193626)
> +++ cfgloop.c   (working copy)
> @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop)
>    return edges;
>  }
>
> -/* Counts the number of conditional branches inside LOOP.  */
> +/* Determine if INSN is a floating point set.  */
>
> -unsigned
> -num_loop_branches (const struct loop *loop)
> +static bool
> +insn_has_fp_set(rtx insn)
>  {
> -  unsigned i, n;
> -  basic_block * body;
> +  int i;
> +  rtx pat = PATTERN(insn);
> +  if (GET_CODE (pat) == SET)
> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
> +  else if (GET_CODE (pat) == PARALLEL)
> +    {
> +      for (i = 0; i < XVECLEN (pat, 0); i++)
> +        {
> +          rtx sub = XVECEXP (pat, 0, i);
> +          if (GET_CODE (sub) == SET)
> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
> +        }
> +    }
> +  return false;
> +}
>
> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently
counts
> +   the number of conditional branch instructions, calls and fp instructions,
> +   as well as the average number of branches executed per iteration.  */
>
> +void
> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
> +{
> +  unsigned i, nbranch;
> +  gcov_type weighted_nbranch;
> +  bool has_call, has_fp;
> +  basic_block * body, bb;
> +  rtx insn;
> +  gcov_type header_count = loop->header->count;
> +
> +  nbranch = weighted_nbranch = 0;
> +  has_call = has_fp = false;
> +
>    body = get_loop_body (loop);
> -  n = 0;
>    for (i = 0; i < loop->num_nodes; i++)
> -    if (EDGE_COUNT (body[i]->succs) >= 2)
> -      n++;
> +    {
> +      bb = body[i];
> +
> +      if (EDGE_COUNT (bb->succs) >= 2)
> +        {
> +          nbranch++;
> +
> +          /* If this block is executed less frequently than the header (loop
> +             entry), then it is weighted based on its execution count, which
> +             will be turned into a ratio compared to the loop header below.
*/
> +          if (bb->count < header_count)
> +            weighted_nbranch += bb->count;
> +
> +          /* When it is executed more frequently than the header (i.e. it is
> +             in a nested inner loop), simply weight the branch the same as
the
> +             header execution count, so that it will contribute 1 branch to
> +             the ratio computed below. */
> +          else
> +            weighted_nbranch += header_count;
> +        }
> +
> +      /* No need to iterate through the instructions below if
> +         both flags have already been set.  */
> +      if (has_call && has_fp)
> +        continue;
> +
> +      FOR_BB_INSNS (bb, insn)
> +        {
> +          if (!INSN_P (insn))
> +            continue;
> +
> +          if (!has_call)
> +            has_call = CALL_P (insn);
> +
> +          if (!has_fp)
> +            has_fp = insn_has_fp_set (insn);
> +        }
> +    }
>    free (body);
>
> -  return n;
> +  desc->num_branches = nbranch;
> +  /* Now divide the weights computed above by the loop header execution
count,
> +     to compute the average number of branches through the loop. By adding
> +     header_count/2 to the numerator we round to nearest with integer
> +     division.  */
> +  if (header_count  != 0)
> +    desc->av_num_branches
> +        = (weighted_nbranch + header_count/2) / header_count;
> +  else
> +    desc->av_num_branches = 0;
> +  desc->has_call = has_call;
> +  desc->has_fp = has_fp;
>  }
>
>  /* Adds basic block BB to LOOP.  */
> Index: cfgloop.h
> ===================================================================
> --- cfgloop.h   (revision 193626)
> +++ cfgloop.h   (working copy)
> @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order
>  extern vec<edge> get_loop_exit_edges (const struct loop *);
>  extern edge single_exit (const struct loop *);
>  extern edge single_likely_exit (struct loop *loop);
> -extern unsigned num_loop_branches (const struct loop *);
>
>  extern edge loop_preheader_edge (const struct loop *);
>  extern edge loop_latch_edge (const struct loop *);
> @@ -365,7 +364,8 @@ struct rtx_iv
>  };
>
>  /* The description of an exit from the loop and of the number of iterations
> -   till we take the exit.  */
> +   till we take the exit. Also includes other information used primarily
> +   by the loop unroller.  */
>
>  struct niter_desc
>  {
> @@ -403,6 +403,18 @@ struct niter_desc
>
>    /* The number of iterations of the loop.  */
>    rtx niter_expr;
> +
> +  /* The number of branches in the loop.  */
> +  unsigned num_branches;
> +
> +  /* The number of executed branches per iteration.  */
> +  unsigned av_num_branches;
> +
> +  /* Whether the loop contains a call instruction.  */
> +  bool has_call;
> +
> +  /* Whether the loop contains fp instructions.  */
> +  bool has_fp;
>  };
>
>  extern void iv_analysis_loop_init (struct loop *);
> @@ -416,6 +428,7 @@ extern void iv_analysis_done (void);
>
>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>  extern void free_simple_loop_desc (struct loop *loop);
> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>
>  static inline struct niter_desc *
>  simple_loop_desc (struct loop *loop)
> Index: params.def
> ===================================================================
> --- params.def  (revision 193626)
> +++ params.def  (working copy)
> @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>          "max-completely-peel-loop-nest-depth",
>          "The maximum depth of a loop nest we completely peel",
>          8, 0, 0)
> +/* The maximum code size estimate under which loop unrolling and peeling
> + * is allowed in a profile feedback compile. This currently applies to loops
> + * with non-constant iteration counts and no floating point computations. 
*/
> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
> +        "unrollpeel-codesize-threshold",
> +        "Maximum profile-based code size footprint estimate for loop
> unrolling "
> +        "and peeling",
> +        15000, 0, 0)
> +/* The maximum ratio of total profiled execution counts to loop entry block
> +   count that must be exceeded to ignore most code size limits when
unrolling
> +   and peeling.  */
> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
> +        "unrollpeel-hotness-threshold",
> +        "Maximum ratio of total profiled execution count to loop entry "
> +        "block count under which most codesize limits for unrolling and "
> +        "peeling will be ignored",
> +        100, 1, 0)
>
>  /* The maximum number of insns of an unswitched loop.  */
>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>
> On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com> wrote:
>> Updated patch based on review feedback.
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Teresa
>>
>>
>> Original patch description:
>>
>> Ports some patches related to improving FDO program summary information
>> and using it to guide loop unrolling from google branches to mainline.
>> The patch is enhanced to add additional summary information to aid
>> in determining hot/cold decisions.
>>
>> The original patch description is at:
>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>> and further discussion about incorporating onto mainline is at:
>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>>
>> Honza, can you take a look to see if this patch meets your needs?
>>
>> Full description:
>>
>> This patch adds new program summary information to the gcov
>> profile files that indicate how many profiled counts compose
>> the majority of the program's execution time. This is used to
>> provide an indication of the overall code size of the program.
>>
>> The new profile summary information is then used to guide
>> codesize based unroll and peel decisions, to prevent those
>> optimizations from increasing code size too much when the
>> program may be sensitive to icache effects.
>>
>> This patch also pulls in dependent portions of google/main r187660 that
cache
>> additional loop analysis results in the niter_desc auxiliary information
>> hanging off the loop structure (the optimization portions of that
>> change are not included here, and have an outstanding review request
>> for mainline).
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Thanks,
>> Teresa
>>
>> 2012-07-26  Teresa Johnson  <tejohnson@google.com>
>>
>>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>>         (gcov_compute_cutoff_values): Ditto.
>>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>>         information.
>>         * gcc/doc/invoke.texi (roll much): Document new options
>>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>>         (gcov_read_summary): Read new summary info.
>>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary
info.
>>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters
and
>>         hot_cutoff_value.
>>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>>         to control the unroll factor, and retrieve number of branches from
>>         niter_desc instead of via function that walks loop.
>>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new
analyze_loop_insns
>>         function, and add guards to enable this function to work for the
>>         outermost loop.
>>         * gcc/common.opt: Add -fpeel-codesize-limit and
>>         -funroll-codesize-limit.
>>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New
functions.
>>         (num_loop_branches): Remove.
>>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>>         additional loop analysis information.
>>         (num_loop_branches): Remove.
>>         (analyze_loop_insns): Declare.
>>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>>
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c    (revision 189893)
>> +++ libgcc/libgcov.c    (working copy)
>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>    return 1;
>>  }
>>
>> +/* Used by qsort to sort gcov values in descending order.  */
>> +
>> +static int
>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>> +{
>> +  const gcov_type a = *(gcov_type const *)pa;
>> +  const gcov_type b = *(gcov_type const *)pb;
>> +
>> +  if (b > a)
>> +    return 1;
>> +  else if (b == a)
>> +    return 0;
>> +  else
>> +    return -1;
>> +}
>> +
>> +/* Determines the number of counters required to cover a given percentage
>> +   of the total sum of execution counts in the summary, which is then also
>> +   recorded in SUM.  */
>> +
>> +static void
>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>> +{
>> +  struct gcov_info *gi_ptr;
>> +  const struct gcov_fn_info *gfi_ptr;
>> +  const struct gcov_ctr_info *ci_ptr;
>> +  struct gcov_ctr_summary *cs_ptr;
>> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
>> +  gcov_unsigned_t c_num;
>> +  gcov_type *value_array;
>> +  gcov_type cum, cum_cutoff;
>> +  char *cutoff_str;
>> +  unsigned cutoff_perc;
>> +
>> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
>> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
>> +  if (cutoff_str && strlen (cutoff_str))
>> +    cutoff_perc = atoi (cutoff_str);
>> +  else
>> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
>> +
>> +  /* This currently only applies to arc counters.  */
>> +  t_ix = GCOV_COUNTER_ARCS;
>> +
>> +  /* First check if there are any counts recorded for this counter.  */
>> +  cs_ptr = &(sum->ctrs[t_ix]);
>> +  if (!cs_ptr->num)
>> +    return;
>> +
>> +  /* Determine the cumulative counter value at the specified cutoff
>> +     percentage and record the percentage for use by gcov consumers.
>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>> +     and if so, do the divide first.  */
>> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
>> +    /* Overflow, do the divide first.  */
>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>> +  else
>> +    /* Otherwise multiply first to get the correct value for small
>> +       values of sum_all.  */
>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>> +
>> +  /* Next, walk through all the per-object structures and save each of
>> +     the count values in value_array.  */
>> +  index = 0;
>> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>> +    {
>> +      if (!gi_ptr->merge[t_ix])
>> +        continue;
>> +
>> +      /* Find the appropriate index into the gcov_ctr_info array
>> +         for the counter we are currently working on based on the
>> +         existence of the merge function pointer for this object.  */
>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>> +        {
>> +          if (gi_ptr->merge[i])
>> +            ctr_info_ix++;
>> +        }
>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>> +        {
>> +          gfi_ptr = gi_ptr->functions[f_ix];
>> +
>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>> +            continue;
>> +
>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>> +          /* Sanity check that there are enough entries in value_arry
>> +            for this function's counters. Gracefully handle the case when
>> +            there are not, in case something in the profile info is
>> +            corrupted.  */
>> +          c_num = ci_ptr->num;
>> +          if (index + c_num > cs_ptr->num)
>> +            c_num = cs_ptr->num - index;
>> +          /* Copy over this function's counter values.  */
>> +          memcpy (&value_array[index], ci_ptr->values,
>> +                  sizeof (gcov_type) * c_num);
>> +          index += c_num;
>> +        }
>> +    }
>> +
>> +  /* Sort all the counter values by descending value and finally
>> +     accumulate the values from hottest on down until reaching
>> +     the cutoff value computed earlier.  */
>> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
>> +         sort_by_reverse_gcov_value);
>> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff;
c_num++)
>> +    cum += value_array[c_num];
>> +  /* Record the number of counters required to reach the cutoff value,
>> +     as well as the counter value that caused cum to reach the cutoff.  */
>> +  cs_ptr->num_hot_counters = c_num;
>> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
>> +  free (value_array);
>> +}
>> +
>>  /* Dump the coverage counts. We merge with existing counts when
>>     possible, to avoid growing the .da files ad infinitum. We use this
>>     program's checksum to make sure we only accumulate whole program
>> @@ -347,6 +461,7 @@ gcov_exit (void)
>>             }
>>         }
>>      }
>> +  gcov_compute_cutoff_values (&this_prg);
>>
>>    {
>>      /* Check if the level of dirs to strip off specified. */
>> @@ -598,7 +713,12 @@ gcov_exit (void)
>>           if (gi_ptr->merge[t_ix])
>>             {
>>               if (!cs_prg->runs++)
>> -               cs_prg->num = cs_tprg->num;
>> +                cs_prg->num = cs_tprg->num;
>> +              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
>> +              {
>> +                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
>> +                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
>> +              }
>>               cs_prg->sum_all += cs_tprg->sum_all;
>>               if (cs_prg->run_max < cs_tprg->run_max)
>>                 cs_prg->run_max = cs_tprg->run_max;
>> Index: gcc/doc/invoke.texi
>> ===================================================================
>> --- gcc/doc/invoke.texi (revision 189893)
>> +++ gcc/doc/invoke.texi (working copy)
>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops -fpredictive-commoning
@gol
>>  -fprefetch-loop-arrays @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
@gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback
compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fpeel-loops
>>  @opindex fpeel-loops
>>  Peels loops for which there is enough information that they do not
>> @@ -8535,6 +8543,14 @@ roll much (from profile feedback).  It also turns
>>
>>  Enabled with @option{-fprofile-use}.
>>
>> +@item -fpeel-codesize-limit
>> +@opindex fpeel-codesize-limit
>> +Limit loop peeling of non-const non-FP loops in a profile feedback
compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fmove-loop-invariants
>>  @opindex fmove-loop-invariants
>>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
>> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s
>>  @item max-completely-peel-loop-nest-depth
>>  The maximum depth of a loop nest suitable for complete peeling.
>>
>> +@item unrollpeel-codesize-threshold
>> +Maximum profile-based code size footprint estimate for loop unrolling and
>> +peeling.
>> +
>> +@item unrollpeel-hotness-threshold
>> +Maximum ratio of total execution count to loop entry block count under
which
>> +most profile-based code size estimates will be ignored for loop unrolling
and
>> +peeling.
>> +
>>  @item max-unswitch-insns
>>  The maximum number of insns of an unswitched loop.
>>
>> Index: gcc/gcov-io.c
>> ===================================================================
>> --- gcc/gcov-io.c       (revision 189893)
>> +++ gcc/gcov-io.c       (working copy)
>> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>      {
>>        gcov_write_unsigned (csum->num);
>> +      gcov_write_unsigned (csum->num_hot_counters);
>> +      gcov_write_counter (csum->hot_cutoff_value);
>>        gcov_write_unsigned (csum->runs);
>>        gcov_write_counter (csum->sum_all);
>>        gcov_write_counter (csum->run_max);
>> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>      {
>>        csum->num = gcov_read_unsigned ();
>> +      csum->num_hot_counters = gcov_read_unsigned ();
>> +      csum->hot_cutoff_value = gcov_read_counter ();
>>        csum->runs = gcov_read_unsigned ();
>>        csum->sum_all = gcov_read_counter ();
>>        csum->run_max = gcov_read_counter ();
>> Index: gcc/gcov-io.h
>> ===================================================================
>> --- gcc/gcov-io.h       (revision 189893)
>> +++ gcc/gcov-io.h       (working copy)
>> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete
*/
>>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>>  #define GCOV_TAG_SUMMARY_LENGTH  \
>> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
>> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>>
>>  /* Counters that are collected.  */
>>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
>> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>>  struct gcov_ctr_summary
>>  {
>>    gcov_unsigned_t num;         /* number of counters.  */
>> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
>> +                                      percent of sum_all.  */
>> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
>> +                                   num_hot_counters.  */
>>    gcov_unsigned_t runs;                /* number of program runs */
>>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>>    gcov_type run_max;           /* maximum value on a single run.  */
>> Index: gcc/loop-unroll.c
>> ===================================================================
>> --- gcc/loop-unroll.c   (revision 189893)
>> +++ gcc/loop-unroll.c   (working copy)
>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "hashtab.h"
>>  #include "recog.h"
>>  #include "target.h"
>> +#include "gcov-io.h"
>>  #include "dumpfile.h"
>>
>>  /* This pass performs loop unrolling and peeling.  We only perform these
>> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc
>>                                              basic_block);
>>  static rtx get_expansion (struct var_to_expand *);
>>
>> +/* Determine whether and how much LOOP unrolling/peeling should be
constrained
>> +   based on code footprint estimates. Returns the codesize-based factor to
be
>> +   divided into the max instructions in an unrolled or peeled loop:
>> +   1) For size <= threshold, do not limit (by returning 1).
>> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
>> +      unrolled instructions according to loop hotness.
>> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
>> +      INT_MAX).  */
>> +
>> +static int
>> +code_size_limit_factor(struct loop *loop)
>> +{
>> +  unsigned size_threshold;
>> +  struct niter_desc *desc = get_simple_loop_desc (loop);
>> +  gcov_type sum_to_header_ratio;
>> +  int hotness_ratio_threshold;
>> +  int limit_factor;
>> +
>> +  /* First check if the application has a large codesize footprint.
>> +     This is estimated from FDO profile summary information for the
>> +     program, where the num_hot_counters indicates the number of hottest
>> +     counters (blocks) that compose most of the execution time of
>> +     the program. A large value would indicate a large flat execution
>> +     profile where icache misses may be a concern.  */
>> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
>> +  if (!profile_info
>> +      || profile_info->num_hot_counters <= size_threshold
>> +      || !profile_info->sum_all)
>> +    return 1;
>> +
>> +  /* Next, exclude some loops where unrolling/peeling may be more
>> +     important to overall performance.  */
>> +
>> +  /* Ignore FP loops, which are more likely to benefit heavily from
>> +     unrolling. */
>> +  if (desc->has_fp)
>> +    return 1;
>> +
>> +  /* Next, set the value of the codesize-based unroll factor divisor which
in
>> +     most loops will need to be set to a value that will reduce or
eliminate
>> +     unrolling/peeling.  */
>> +  if (profile_info->num_hot_counters < size_threshold * 2)
>> +    {
>> +      /* For applications that are less than twice the codesize limit,
allow
>> +         limited unrolling for very hot loops.  */
>> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
>> +      hotness_ratio_threshold = PARAM_VALUE
(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
>> +      /* When the profile count sum to loop entry header ratio is smaller
than
>> +         the threshold (i.e. the loop entry is hot enough, the divisor is
set
>> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
>> +         than the ratio, increase the divisor by the amount this ratio
>> +         is over the threshold, which will quickly reduce the unroll/peel
>> +         factor to zero as the loop's hotness reduces.  */
>> +      if (sum_to_header_ratio > hotness_ratio_threshold)
>> +        {
>> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
>> +          gcc_assert (limit_factor >= 1);
>> +        }
>> +      else
>> +        limit_factor = 1;
>> +    }
>> +  else
>> +    /* For appliations that are at least twice the codesize limit, set
>> +       the divisor to a large value that will force the unroll factor to 0. 
*/
>> +    limit_factor = INT_MAX;
>> +
>> +  return limit_factor;
>> +}
>> +
>>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>>  void
>>  unroll_and_peel_loops (int flags)
>> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_UNROLL))
>>      {
>> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>>              "\n;; Considering unrolling loop with runtime "
>>              "computable number of iterations\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +        {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +        }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>  {
>>    unsigned npeel;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_PEEL))
>>      {
>> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags)
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>>
>> +  if (flag_peel_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from peeling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max peeled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* npeel = number of iterations to peel.  */
>> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
>> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
>> +          / loop->ninsns;
>>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>>
>> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>
>>    /* Do not simply peel loops with branches inside -- it increases number
>>       of mispredicts.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not peeling, contains branches\n");
>> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_UNROLL_ALL))
>>      {
>> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av
>> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>
>>    /* Do not unroll loops with branches inside -- it increases number
>>       of mispredicts.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
>> Index: gcc/coverage.c
>> ===================================================================
>> --- gcc/coverage.c      (revision 189893)
>> +++ gcc/coverage.c      (working copy)
>> @@ -247,6 +247,14 @@ read_counts_file (void)
>>           gcov_read_summary (&sum);
>>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>>             {
>> +              if (summary.ctrs[ix].num_hot_counters
>> +                  < sum.ctrs[ix].num_hot_counters)
>> +                {
>> +                 summary.ctrs[ix].num_hot_counters
>> +                      = sum.ctrs[ix].num_hot_counters;
>> +                 summary.ctrs[ix].hot_cutoff_value
>> +                      = sum.ctrs[ix].hot_cutoff_value;
>> +                }
>>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
>> Index: gcc/loop-iv.c
>> ===================================================================
>> --- gcc/loop-iv.c       (revision 189893)
>> +++ gcc/loop-iv.c       (working copy)
>> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop)
>>    /* At least desc->infinite is not always initialized by
>>       find_simple_loop_exit.  */
>>    desc = XCNEW (struct niter_desc);
>> -  iv_analysis_loop_init (loop);
>> -  find_simple_exit (loop, desc);
>> +  if (loop->latch != EXIT_BLOCK_PTR)
>> +    {
>> +      iv_analysis_loop_init (loop);
>> +      find_simple_exit (loop, desc);
>> +    }
>> +  analyze_loop_insns (loop, desc);
>>    loop->aux = desc;
>>
>>    if (desc->simple_p && (desc->assumptions || desc->infinite))
>> Index: gcc/common.opt
>> ===================================================================
>> --- gcc/common.opt      (revision 189893)
>> +++ gcc/common.opt      (working copy)
>> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>>  Return small aggregates in memory, not registers
>>
>> +fpeel-codesize-limit
>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop peeling under profile estimates of large code
footprint
>> +
>>  fpeel-loops
>>  Common Report Var(flag_peel_loops) Optimization
>>  Perform loop peeling
>> @@ -2112,6 +2116,10 @@ funroll-all-loops
>>  Common Report Var(flag_unroll_all_loops) Optimization
>>  Perform loop unrolling for all loops
>>
>> +funroll-codesize-limit
>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop unrolling under profile estimates of large code
footprint
>> +
>>  ; Nonzero means that loop optimizer may assume that the induction variables
>>  ; that control loops do not overflow and that the loops with nontrivial
>>  ; exit condition are not infinite
>> Index: gcc/cfgloop.c
>> ===================================================================
>> --- gcc/cfgloop.c       (revision 189893)
>> +++ gcc/cfgloop.c       (working copy)
>> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop)
>>    return edges;
>>  }
>>
>> -/* Counts the number of conditional branches inside LOOP.  */
>> +/* Determine if INSN is a floating point set.  */
>>
>> -unsigned
>> -num_loop_branches (const struct loop *loop)
>> +static bool
>> +insn_has_fp_set(rtx insn)
>>  {
>> -  unsigned i, n;
>> -  basic_block * body;
>> +  int i;
>> +  rtx pat = PATTERN(insn);
>> +  if (GET_CODE (pat) == SET)
>> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
>> +  else if (GET_CODE (pat) == PARALLEL)
>> +    {
>> +      for (i = 0; i < XVECLEN (pat, 0); i++)
>> +        {
>> +          rtx sub = XVECEXP (pat, 0, i);
>> +          if (GET_CODE (sub) == SET)
>> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
>> +        }
>> +    }
>> +  return false;
>> +}
>>
>> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently
counts
>> +   the number of conditional branch instructions, calls and fp
instructions,
>> +   as well as the average number of branches executed per iteration.  */
>>
>> +void
>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
>> +{
>> +  unsigned i, nbranch;
>> +  gcov_type weighted_nbranch;
>> +  bool has_call, has_fp;
>> +  basic_block * body, bb;
>> +  rtx insn;
>> +  gcov_type header_count = loop->header->count;
>> +
>> +  nbranch = weighted_nbranch = 0;
>> +  has_call = has_fp = false;
>> +
>>    body = get_loop_body (loop);
>> -  n = 0;
>>    for (i = 0; i < loop->num_nodes; i++)
>> -    if (EDGE_COUNT (body[i]->succs) >= 2)
>> -      n++;
>> +    {
>> +      bb = body[i];
>> +
>> +      if (EDGE_COUNT (bb->succs) >= 2)
>> +        {
>> +          nbranch++;
>> +
>> +          /* If this block is executed less frequently than the header
(loop
>> +             entry), then it is weighted based on its execution count,
which
>> +             will be turned into a ratio compared to the loop header below.
*/
>> +          if (bb->count < header_count)
>> +            weighted_nbranch += bb->count;
>> +
>> +          /* When it is executed more frequently than the header (i.e. it
is
>> +             in a nested inner loop), simply weight the branch the same as
the
>> +             header execution count, so that it will contribute 1 branch to
>> +             the ratio computed below. */
>> +          else
>> +            weighted_nbranch += header_count;
>> +        }
>> +
>> +      /* No need to iterate through the instructions below if
>> +         both flags have already been set.  */
>> +      if (has_call && has_fp)
>> +        continue;
>> +
>> +      FOR_BB_INSNS (bb, insn)
>> +        {
>> +          if (!INSN_P (insn))
>> +            continue;
>> +
>> +          if (!has_call)
>> +            has_call = CALL_P (insn);
>> +
>> +          if (!has_fp)
>> +            has_fp = insn_has_fp_set (insn);
>> +        }
>> +    }
>>    free (body);
>>
>> -  return n;
>> +  desc->num_branches = nbranch;
>> +  /* Now divide the weights computed above by the loop header execution
count,
>> +     to compute the average number of branches through the loop. By adding
>> +     header_count/2 to the numerator we round to nearest with integer
>> +     division.  */
>> +  if (header_count  != 0)
>> +    desc->av_num_branches
>> +        = (weighted_nbranch + header_count/2) / header_count;
>> +  else
>> +    desc->av_num_branches = 0;
>> +  desc->has_call = has_call;
>> +  desc->has_fp = has_fp;
>>  }
>>
>>  /* Adds basic block BB to LOOP.  */
>> Index: gcc/cfgloop.h
>> ===================================================================
>> --- gcc/cfgloop.h       (revision 189893)
>> +++ gcc/cfgloop.h       (working copy)
>> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>>
>>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>>  edge single_exit (const struct loop *);
>> -extern unsigned num_loop_branches (const struct loop *);
>>
>>  extern edge loop_preheader_edge (const struct loop *);
>>  extern edge loop_latch_edge (const struct loop *);
>> @@ -366,7 +365,8 @@ struct rtx_iv
>>  };
>>
>>  /* The description of an exit from the loop and of the number of iterations
>> -   till we take the exit.  */
>> +   till we take the exit. Also includes other information used primarily
>> +   by the loop unroller.  */
>>
>>  struct niter_desc
>>  {
>> @@ -407,6 +407,18 @@ struct niter_desc
>>
>>    /* The number of iterations of the loop.  */
>>    rtx niter_expr;
>> +
>> +  /* The number of branches in the loop.  */
>> +  unsigned num_branches;
>> +
>> +  /* The number of executed branches per iteration.  */
>> +  unsigned av_num_branches;
>> +
>> +  /* Whether the loop contains a call instruction.  */
>> +  bool has_call;
>> +
>> +  /* Whether the loop contains fp instructions.  */
>> +  bool has_fp;
>>  };
>>
>>  extern void iv_analysis_loop_init (struct loop *);
>> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>>
>>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>>  extern void free_simple_loop_desc (struct loop *loop);
>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>>
>>  static inline struct niter_desc *
>>  simple_loop_desc (struct loop *loop)
>> Index: gcc/params.def
>> ===================================================================
>> --- gcc/params.def      (revision 189893)
>> +++ gcc/params.def      (working copy)
>> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>>          "max-completely-peel-loop-nest-depth",
>>          "The maximum depth of a loop nest we completely peel",
>>          8, 0, 0)
>> +/* The maximum code size estimate under which loop unrolling and peeling
>> + * is allowed in a profile feedback compile. This currently applies to
loops
>> + * with non-constant iteration counts and no floating point computations. 
*/
>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
>> +        "unrollpeel-codesize-threshold",
>> +        "Maximum profile-based code size footprint estimate for loop
unrolling "
>> +        "and peeling",
>> +        15000, 0, 0)
>> +/* The maximum ratio of total profiled execution counts to loop entry block
>> +   count that must be exceeded to ignore most code size limits when
unrolling
>> +   and peeling.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
>> +        "unrollpeel-hotness-threshold",
>> +        "Maximum ratio of total profiled execution count to loop entry "
>> +        "block count under which most codesize limits for unrolling and "
>> +        "peeling will be ignored",
>> +        100, 1, 0)
>>
>>  /* The maximum number of insns of an unswitched loop.  */
>>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>> Index: gcc/gcov-dump.c
>> ===================================================================
>> --- gcc/gcov-dump.c     (revision 189893)
>> +++ gcc/gcov-dump.c     (working copy)
>> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>>      {
>>        printf ("\n");
>>        print_prefix (filename, 0, 0);
>> -      printf ("\t\tcounts=%u, runs=%u",
>> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
>> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
>> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
>> +             summary.ctrs[ix].num,
>> +             summary.ctrs[ix].num_hot_counters,
>> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
>> +             summary.ctrs[ix].runs);
>>
>>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>>
>> --
>> This patch is available for review at http://codereview.appspot.com/6351086
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

markus_trippelsdorf.de

On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: > Ping. > Teresa > > On ...

13 years ago (2012-12-11 18:03:01 UTC) #14

hubicka_ucw.cz

> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote: > > Ping. > > Teresa ...

13 years ago (2012-12-11 18:12:20 UTC) #15

tejohnson

On Tue, Dec 11, 2012 at 10:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On 2012.12.11 ...

13 years ago (2012-12-11 19:28:36 UTC) #16

On Tue, Dec 11, 2012 at 10:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On 2012.12.11 at 09:13 -0800, Teresa Johnson wrote:
>> > Ping.
>> > Teresa
>> >
>> > On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com>
wrote:
>> > > This patch was proposed awhile back, and the new working set program
>> > > summary was split out and committed separately. I would like to
>> > > see if the loop unroller part of the patch can now go in. For the
>> > > previous versions of the patch and related discussion, see:
>> > >
>> > > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>> > > and
>> > > http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
>> > >
>> > > Use the new working set information in the FDO program summary for
>> > > codesize based unroll and peel decisions, to prevent those
>> > > optimizations from increasing code size too much when the
>> > > program may be sensitive to icache effects.
>> > >
>> > > As part of this, cache additional loop analysis results in the
niter_desc
>> > > auxiliary information hanging off the loop structure to reduce
>> > > redundant analyses during unrolling.
>> > >
>> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> I guess this is the missing piece that would fix the Firefox lto/pgo
>> code-size issue:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375#c144
>
> It won't help in this case - the problem is that too much of code is
identified as hot,
> similar as for spec2k GCC benchmark.  Teresa - do you have any ideas what to
do here
> short of re-adding the BB ratio parameter and combining both heuristics for
4.8?

Right, I doubt this patch would help. I looked through the bugzilla
log and it is most likely inlining increases. What was the size of the
gcc lto/pgo binary before the change to use the histogram? Was it
close to the gcc 4.7 lto/pgo size? In that case that is a very large
increase, ~25%.

Markus, could you attach to the bug one of the gcda files so that I
can see the program summary and figure out how far off the old hot bb
threshold is from the new histogram-based one? Also, it would be good
to see the -fdump-ipa-inline dumps before and after the regression (if
necessary, the before one could be from 4_7).

I'll update the bug with the same, so that we can track the discussion there.

Thanks,
Teresa

>
> Honza
>>
>> --
>> Markus



--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

tejohnson

12 years, 11 months ago (2013-01-02 15:55:41 UTC) #17

Ping.
Teresa

On Tue, Dec 11, 2012 at 9:13 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Ping.
> Teresa
>
> On Mon, Nov 19, 2012 at 4:58 PM, Teresa Johnson <tejohnson@google.com> wrote:
>> This patch was proposed awhile back, and the new working set program
>> summary was split out and committed separately. I would like to
>> see if the loop unroller part of the patch can now go in. For the
>> previous versions of the patch and related discussion, see:
>>
>> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>> and
>> http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01373.html
>>
>> Use the new working set information in the FDO program summary for
>> codesize based unroll and peel decisions, to prevent those
>> optimizations from increasing code size too much when the
>> program may be sensitive to icache effects.
>>
>> As part of this, cache additional loop analysis results in the niter_desc
>> auxiliary information hanging off the loop structure to reduce
>> redundant analyses during unrolling.
>>
>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>
>> Thanks,
>> Teresa
>>
>> 2012-11-19  Teresa Johnson  <tejohnson@google.com>
>>
>>         * gcc/doc/invoke.texi: Document new options
>>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>>         to control the unroll factor, and retrieve number of branches from
>>         niter_desc instead of via function that walks loop.
>>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new
analyze_loop_insns
>>         function, and add guards to enable this function to work for the
>>         outermost loop.
>>         * gcc/common.opt: Add -fpeel-codesize-limit and
>>         -funroll-codesize-limit.
>>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New
functions.
>>         (num_loop_branches): Remove.
>>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>>         additional loop analysis information.
>>         (num_loop_branches): Remove.
>>         (analyze_loop_insns): Declare.
>>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>>
>> Index: doc/invoke.texi
>> ===================================================================
>> --- doc/invoke.texi     (revision 193626)
>> +++ doc/invoke.texi     (working copy)
>> @@ -389,7 +389,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls @gol
>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
>> -fpredictive-commoning @gol
>>  -fprefetch-loop-arrays -fprofile-report @gol
>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>  -fprofile-generate=@var{path} @gol
>> @@ -421,7 +421,7 @@ Objective-C and Objective-C++ Dialects}.
>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
@gol
>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>> @@ -8771,6 +8771,14 @@ the loop is entered.  This usually makes programs
>>  @option{-funroll-all-loops} implies the same options as
>>  @option{-funroll-loops}.
>>
>> +@item -funroll-codesize-limit
>> +@opindex funroll-codesize-limit
>> +Limit loop unrolling of non-const non-FP loops in a profile feedback
>> compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
>> controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fpeel-loops
>>  @opindex fpeel-loops
>>  Peels loops for which there is enough information that they do not
>> @@ -8779,6 +8787,14 @@ roll much (from profile feedback).  It also turns
>>
>>  Enabled with @option{-fprofile-use}.
>>
>> +@item -fpeel-codesize-limit
>>
>> +@item -fpeel-codesize-limit
>> +@opindex fpeel-codesize-limit
>> +Limit loop peeling of non-const non-FP loops in a profile feedback
compilation
>> +under estimates of a large code footprint. Enabled by default with
>> +@option{-fprofile-use}. Code size and execution weight thresholds are
>> controlled
>> +by the @option{unrollpeel-codesize-threshold} and
>> +@option{unrollpeel-hotness-threshold} parameters.
>> +
>>  @item -fmove-loop-invariants
>>  @opindex fmove-loop-invariants
>>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
>> @@ -9142,6 +9158,15 @@ The maximum number of iterations of a loop to be s
>>  @item max-completely-peel-loop-nest-depth
>>  The maximum depth of a loop nest suitable for complete peeling.
>>
>> +@item unrollpeel-codesize-threshold
>> +Maximum profile-based code size footprint estimate for loop unrolling and
>> +peeling.
>> +
>> +@item unrollpeel-hotness-threshold
>> +Maximum ratio of total execution count to loop entry block count under
which
>> +most profile-based code size estimates will be ignored for loop unrolling
and
>> +peeling.
>> +
>>  @item max-unswitch-insns
>>  The maximum number of insns of an unswitched loop.
>>
>> Index: loop-unroll.c
>> ===================================================================
>> --- loop-unroll.c       (revision 193626)
>> +++ loop-unroll.c       (working copy)
>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>>  #include "hashtab.h"
>>  #include "recog.h"
>>  #include "target.h"
>> +#include "gcov-io.h"
>>  #include "dumpfile.h"
>>
>>  /* This pass performs loop unrolling and peeling.  We only perform these
>> @@ -147,6 +148,82 @@ static void combine_var_copies_in_loop_exit (struc
>>                                              basic_block);
>>  static rtx get_expansion (struct var_to_expand *);
>>
>> +/* Determine whether and how much LOOP unrolling/peeling should be
constrained
>> +   based on code footprint estimates. Returns the codesize-based factor to
be
>> +   divided into the max instructions in an unrolled or peeled loop:
>> +   1) For size <= threshold, do not limit (by returning 1).
>> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
>> +      unrolled instructions according to loop hotness.
>> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by returning
>> +      INT_MAX).  */
>> +
>> +static int
>> +code_size_limit_factor(struct loop *loop)
>> +{
>> +  unsigned size_threshold, num_hot_counters;
>> +  struct niter_desc *desc = get_simple_loop_desc (loop);
>> +  gcov_type sum_to_header_ratio;
>> +  int hotness_ratio_threshold;
>> +  int limit_factor;
>> +  gcov_working_set_t *ws;
>> +
>> +  ws = find_working_set(999);
>> +  if (! ws)
>> +    return 1;
>> +  num_hot_counters = ws->num_counters;
>> +
>> +  /* First check if the application has a large codesize footprint.
>> +     This is estimated from FDO profile summary information for the
>> +     program, where the num_hot_counters indicates the number of hottest
>> +     counters (blocks) that compose most of the execution time of
>> +     the program. A large value would indicate a large flat execution
>> +     profile where icache misses may be a concern.  */
>> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
>> +  if (!profile_info
>> +      || num_hot_counters <= size_threshold
>> +      || !profile_info->sum_all)
>> +    return 1;
>> +
>> +  /* Next, exclude some loops where unrolling/peeling may be more
>> +     important to overall performance.  */
>> +
>> +  /* Ignore FP loops, which are more likely to benefit heavily from
>> +     unrolling. */
>> +  if (desc->has_fp)
>> +    return 1;
>> +
>> +  /* Next, set the value of the codesize-based unroll factor divisor which
in
>> +     most loops will need to be set to a value that will reduce or
eliminate
>> +     unrolling/peeling.  */
>> +  if (num_hot_counters < size_threshold * 2
>> +      && loop->header->count > 0)
>> +    {
>> +      /* For applications that are less than twice the codesize limit,
allow
>> +         limited unrolling for very hot loops.  */
>> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
>> +      hotness_ratio_threshold = PARAM_VALUE
>> (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
>> +      /* When the profile count sum to loop entry header ratio is smaller
than
>> +         the threshold (i.e. the loop entry is hot enough), the divisor is
set
>> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
>> +         than the ratio, increase the divisor by the amount this ratio
>> +         is over the threshold, which will quickly reduce the unroll/peel
>> +         factor to zero as the loop's hotness reduces.  */
>> +      if (sum_to_header_ratio > hotness_ratio_threshold)
>> +        {
>> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
>> +          gcc_assert (limit_factor >= 1);
>> +        }
>> +      else
>> +        limit_factor = 1;
>> +    }
>> +  else
>> +    /* For appliations that are at least twice the codesize limit, set
>> +       the divisor to a large value that will force the unroll factor to 0. 
*/
>> +    limit_factor = INT_MAX;
>> +
>> +  return limit_factor;
>> +}
>> +
>>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>>  void
>>  unroll_and_peel_loops (int flags)
>> @@ -822,6 +899,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>    double_int iterations;
>>
>>    if (!(flags & UAP_UNROLL))
>> @@ -835,10 +913,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>>              "\n;; Considering unrolling loop with runtime "
>>              "computable number of iterations\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +        {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +        }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
>> loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1229,6 +1323,8 @@ decide_peel_simple (struct loop *loop, int flags)
>>  {
>>    unsigned npeel;
>>    double_int iterations;
>> +  struct niter_desc *desc;
>> +  int limit_factor = 1;
>>
>>    if (!(flags & UAP_PEEL))
>>      {
>> @@ -1239,8 +1335,23 @@ decide_peel_simple (struct loop *loop, int flags)
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>>
>> +  if (flag_peel_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from peeling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max peeled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* npeel = number of iterations to peel.  */
>> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
>> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
>> +          / loop->ninsns;
>>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>>
>> @@ -1253,7 +1364,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>      }
>>
>>    /* Do not simply peel loops with branches inside -- it increases number
>> -     of mispredicts.
>> +     of mispredicts.
>>       Exception is when we do have profile and we however have good chance
>>       to peel proper number of iterations loop will iterate in practice.
>>       TODO: this heuristic needs tunning; while for complette unrolling
>> @@ -1261,7 +1372,8 @@ decide_peel_simple (struct loop *loop, int flags)
>>       peeling it is not the case.  Also a function call inside loop is
>>       also branch from branch prediction POV (and probably better reason
>>       to not unroll/peel).  */
>> -  if (num_loop_branches (loop) > 1
>> +  desc = get_simple_loop_desc (loop);
>> +  if (desc->num_branches > 1
>>        && profile_status != PROFILE_READ)
>>      {
>>        if (dump_file)
>> @@ -1387,6 +1499,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>  {
>>    unsigned nunroll, nunroll_by_av, i;
>>    struct niter_desc *desc;
>> +  int limit_factor = 1;
>>    double_int iterations;
>>
>>    if (!(flags & UAP_UNROLL_ALL))
>> @@ -1398,11 +1511,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>>    if (dump_file)
>>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>>
>> +  if (flag_unroll_codesize_limit)
>> +    {
>> +      /* Determine whether to limit code size growth from unrolling,
>> +         using FDO profile summary information that gives an
>> +         estimated number of executed blocks.  */
>> +      limit_factor = code_size_limit_factor (loop);
>> +      if (dump_file && limit_factor > 1)
>> +       {
>> +          fprintf (dump_file,
>> +                   ";; Due to large code size footprint estimate, limit "
>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>> +       }
>> +    }
>> +
>>    /* nunroll = total number of copies of the original loop body in
>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>> -  nunroll_by_av
>> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>> +            / loop->ninsns;
>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>> +                  / limit_factor / loop->av_ninsns;
>>    if (nunroll > nunroll_by_av)
>>      nunroll = nunroll_by_av;
>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>> @@ -1434,7 +1562,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>       of mispredicts.
>>       TODO: this heuristic needs tunning; call inside the loop body
>>       is also relatively good reason to not unroll.  */
>> -  if (num_loop_branches (loop) > 1)
>> +  if (desc->num_branches > 1)
>>      {
>>        if (dump_file)
>>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
>> Index: loop-iv.c
>> ===================================================================
>> --- loop-iv.c   (revision 193626)
>> +++ loop-iv.c   (working copy)
>> @@ -3006,8 +3006,12 @@ get_simple_loop_desc (struct loop *loop)
>>    /* At least desc->infinite is not always initialized by
>>       find_simple_loop_exit.  */
>>    desc = XCNEW (struct niter_desc);
>> -  iv_analysis_loop_init (loop);
>> -  find_simple_exit (loop, desc);
>> +  if (loop->latch != EXIT_BLOCK_PTR)
>> +    {
>> +      iv_analysis_loop_init (loop);
>> +      find_simple_exit (loop, desc);
>> +    }
>> +  analyze_loop_insns (loop, desc);
>>    loop->aux = desc;
>>
>>    if (desc->simple_p && (desc->assumptions || desc->infinite))
>> Index: common.opt
>> ===================================================================
>> --- common.opt  (revision 193626)
>> +++ common.opt  (working copy)
>> @@ -1573,6 +1573,10 @@ fpcc-struct-return
>>  Common Report Var(flag_pcc_struct_return,1) Init(DEFAULT_PCC_STRUCT_RETURN)
>>  Return small aggregates in memory, not registers
>>
>> +fpeel-codesize-limit
>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop peeling under profile estimates of large
>> code footprint
>> +
>>  fpeel-loops
>>  Common Report Var(flag_peel_loops) Optimization
>>  Perform loop peeling
>> @@ -2138,6 +2142,10 @@ funroll-all-loops
>>  Common Report Var(flag_unroll_all_loops) Optimization
>>  Perform loop unrolling for all loops
>>
>> +funroll-codesize-limit
>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
>> +Limit non-const non-FP loop unrolling under profile estimates of
>> large code footprint
>> +
>>  ; Nonzero means that loop optimizer may assume that the induction variables
>>  ; that control loops do not overflow and that the loops with nontrivial
>>  ; exit condition are not infinite
>> Index: cfgloop.c
>> ===================================================================
>> --- cfgloop.c   (revision 193626)
>> +++ cfgloop.c   (working copy)
>> @@ -1148,24 +1148,98 @@ get_loop_exit_edges (const struct loop *loop)
>>    return edges;
>>  }
>>
>> -/* Counts the number of conditional branches inside LOOP.  */
>> +/* Determine if INSN is a floating point set.  */
>>
>> -unsigned
>> -num_loop_branches (const struct loop *loop)
>> +static bool
>> +insn_has_fp_set(rtx insn)
>>  {
>> -  unsigned i, n;
>> -  basic_block * body;
>> +  int i;
>> +  rtx pat = PATTERN(insn);
>> +  if (GET_CODE (pat) == SET)
>> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
>> +  else if (GET_CODE (pat) == PARALLEL)
>> +    {
>> +      for (i = 0; i < XVECLEN (pat, 0); i++)
>> +        {
>> +          rtx sub = XVECEXP (pat, 0, i);
>> +          if (GET_CODE (sub) == SET)
>> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
>> +        }
>> +    }
>> +  return false;
>> +}
>>
>> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently
counts
>> +   the number of conditional branch instructions, calls and fp
instructions,
>> +   as well as the average number of branches executed per iteration.  */
>>
>> +void
>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
>> +{
>> +  unsigned i, nbranch;
>> +  gcov_type weighted_nbranch;
>> +  bool has_call, has_fp;
>> +  basic_block * body, bb;
>> +  rtx insn;
>> +  gcov_type header_count = loop->header->count;
>> +
>> +  nbranch = weighted_nbranch = 0;
>> +  has_call = has_fp = false;
>> +
>>    body = get_loop_body (loop);
>> -  n = 0;
>>    for (i = 0; i < loop->num_nodes; i++)
>> -    if (EDGE_COUNT (body[i]->succs) >= 2)
>> -      n++;
>> +    {
>> +      bb = body[i];
>> +
>> +      if (EDGE_COUNT (bb->succs) >= 2)
>> +        {
>> +          nbranch++;
>> +
>> +          /* If this block is executed less frequently than the header
(loop
>> +             entry), then it is weighted based on its execution count,
which
>> +             will be turned into a ratio compared to the loop header below.
*/
>> +          if (bb->count < header_count)
>> +            weighted_nbranch += bb->count;
>> +
>> +          /* When it is executed more frequently than the header (i.e. it
is
>> +             in a nested inner loop), simply weight the branch the same as
the
>> +             header execution count, so that it will contribute 1 branch to
>> +             the ratio computed below. */
>> +          else
>> +            weighted_nbranch += header_count;
>> +        }
>> +
>> +      /* No need to iterate through the instructions below if
>> +         both flags have already been set.  */
>> +      if (has_call && has_fp)
>> +        continue;
>> +
>> +      FOR_BB_INSNS (bb, insn)
>> +        {
>> +          if (!INSN_P (insn))
>> +            continue;
>> +
>> +          if (!has_call)
>> +            has_call = CALL_P (insn);
>> +
>> +          if (!has_fp)
>> +            has_fp = insn_has_fp_set (insn);
>> +        }
>> +    }
>>    free (body);
>>
>> -  return n;
>> +  desc->num_branches = nbranch;
>> +  /* Now divide the weights computed above by the loop header execution
count,
>> +     to compute the average number of branches through the loop. By adding
>> +     header_count/2 to the numerator we round to nearest with integer
>> +     division.  */
>> +  if (header_count  != 0)
>> +    desc->av_num_branches
>> +        = (weighted_nbranch + header_count/2) / header_count;
>> +  else
>> +    desc->av_num_branches = 0;
>> +  desc->has_call = has_call;
>> +  desc->has_fp = has_fp;
>>  }
>>
>>  /* Adds basic block BB to LOOP.  */
>> Index: cfgloop.h
>> ===================================================================
>> --- cfgloop.h   (revision 193626)
>> +++ cfgloop.h   (working copy)
>> @@ -252,7 +252,6 @@ extern basic_block *get_loop_body_in_custom_order
>>  extern vec<edge> get_loop_exit_edges (const struct loop *);
>>  extern edge single_exit (const struct loop *);
>>  extern edge single_likely_exit (struct loop *loop);
>> -extern unsigned num_loop_branches (const struct loop *);
>>
>>  extern edge loop_preheader_edge (const struct loop *);
>>  extern edge loop_latch_edge (const struct loop *);
>> @@ -365,7 +364,8 @@ struct rtx_iv
>>  };
>>
>>  /* The description of an exit from the loop and of the number of iterations
>> -   till we take the exit.  */
>> +   till we take the exit. Also includes other information used primarily
>> +   by the loop unroller.  */
>>
>>  struct niter_desc
>>  {
>> @@ -403,6 +403,18 @@ struct niter_desc
>>
>>    /* The number of iterations of the loop.  */
>>    rtx niter_expr;
>> +
>> +  /* The number of branches in the loop.  */
>> +  unsigned num_branches;
>> +
>> +  /* The number of executed branches per iteration.  */
>> +  unsigned av_num_branches;
>> +
>> +  /* Whether the loop contains a call instruction.  */
>> +  bool has_call;
>> +
>> +  /* Whether the loop contains fp instructions.  */
>> +  bool has_fp;
>>  };
>>
>>  extern void iv_analysis_loop_init (struct loop *);
>> @@ -416,6 +428,7 @@ extern void iv_analysis_done (void);
>>
>>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>>  extern void free_simple_loop_desc (struct loop *loop);
>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>>
>>  static inline struct niter_desc *
>>  simple_loop_desc (struct loop *loop)
>> Index: params.def
>> ===================================================================
>> --- params.def  (revision 193626)
>> +++ params.def  (working copy)
>> @@ -321,6 +321,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>>          "max-completely-peel-loop-nest-depth",
>>          "The maximum depth of a loop nest we completely peel",
>>          8, 0, 0)
>> +/* The maximum code size estimate under which loop unrolling and peeling
>> + * is allowed in a profile feedback compile. This currently applies to
loops
>> + * with non-constant iteration counts and no floating point computations. 
*/
>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
>> +        "unrollpeel-codesize-threshold",
>> +        "Maximum profile-based code size footprint estimate for loop
>> unrolling "
>> +        "and peeling",
>> +        15000, 0, 0)
>> +/* The maximum ratio of total profiled execution counts to loop entry block
>> +   count that must be exceeded to ignore most code size limits when
unrolling
>> +   and peeling.  */
>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
>> +        "unrollpeel-hotness-threshold",
>> +        "Maximum ratio of total profiled execution count to loop entry "
>> +        "block count under which most codesize limits for unrolling and "
>> +        "peeling will be ignored",
>> +        100, 1, 0)
>>
>>  /* The maximum number of insns of an unswitched loop.  */
>>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>>
>> On Thu, Jul 26, 2012 at 9:47 PM, Teresa Johnson <tejohnson@google.com>
wrote:
>>> Updated patch based on review feedback.
>>>
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>>
>>> Teresa
>>>
>>>
>>> Original patch description:
>>>
>>> Ports some patches related to improving FDO program summary information
>>> and using it to guide loop unrolling from google branches to mainline.
>>> The patch is enhanced to add additional summary information to aid
>>> in determining hot/cold decisions.
>>>
>>> The original patch description is at:
>>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00437.html
>>> and further discussion about incorporating onto mainline is at:
>>>   http://gcc.gnu.org/ml/gcc-patches/2012-06/threads.html#00414
>>>
>>> Honza, can you take a look to see if this patch meets your needs?
>>>
>>> Full description:
>>>
>>> This patch adds new program summary information to the gcov
>>> profile files that indicate how many profiled counts compose
>>> the majority of the program's execution time. This is used to
>>> provide an indication of the overall code size of the program.
>>>
>>> The new profile summary information is then used to guide
>>> codesize based unroll and peel decisions, to prevent those
>>> optimizations from increasing code size too much when the
>>> program may be sensitive to icache effects.
>>>
>>> This patch also pulls in dependent portions of google/main r187660 that
cache
>>> additional loop analysis results in the niter_desc auxiliary information
>>> hanging off the loop structure (the optimization portions of that
>>> change are not included here, and have an outstanding review request
>>> for mainline).
>>>
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?
>>>
>>> Thanks,
>>> Teresa
>>>
>>> 2012-07-26  Teresa Johnson  <tejohnson@google.com>
>>>
>>>         * libgcc/libgcov.c (sort_by_reverse_gcov_value): New function.
>>>         (gcov_compute_cutoff_values): Ditto.
>>>         (gcov_exit): Call gcov_compute_cutoff_values and merge new summary
>>>         information.
>>>         * gcc/doc/invoke.texi (roll much): Document new options
>>>         -fpeel-codesize-limit and -funroll-codesize-limit, and new params
>>>         codesize-hotness-threshold and unrollpeel-hotness-threshold.
>>>         * gcc/gcov-io.c (gcov_write_summary): Write new summary info.
>>>         (gcov_read_summary): Read new summary info.
>>>         * gcc/gcov-io.h (GCOV_TAG_SUMMARY_LENGTH): Update for new summary
info.
>>>         (struct gcov_ctr_summary): Add new summary info: num_hot_counters
and
>>>         hot_cutoff_value.
>>>         * gcc/loop-unroll.c (code_size_limit_factor): New function.
>>>         (decide_unroll_runtime_iterations): Call code_size_limit_factor
>>>         to control the unroll factor, and retrieve number of branches from
>>>         niter_desc instead of via function that walks loop.
>>>         (decide_peel_simple, decide_unroll_stupid): Ditto.
>>>         * gcc/coverage.c (read_counts_file): Propagate new summary info.
>>>         * gcc/loop-iv.c (get_simple_loop_desc): Invoke new
analyze_loop_insns
>>>         function, and add guards to enable this function to work for the
>>>         outermost loop.
>>>         * gcc/common.opt: Add -fpeel-codesize-limit and
>>>         -funroll-codesize-limit.
>>>         * gcc/cfgloop.c (insn_has_fp_set, analyze_loop_insns): New
functions.
>>>         (num_loop_branches): Remove.
>>>         * gcc/cfgloop.h (struct niter_desc): Added new fields to cache
>>>         additional loop analysis information.
>>>         (num_loop_branches): Remove.
>>>         (analyze_loop_insns): Declare.
>>>         * gcc/params.def (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD): Add.
>>>         (PARAM_UNROLLPEEL_HOTNESS_THRESHOLD): Ditto.
>>>         * gcc/gcov-dump.c (tag_summary): Dump new summary info.
>>>
>>> Index: libgcc/libgcov.c
>>> ===================================================================
>>> --- libgcc/libgcov.c    (revision 189893)
>>> +++ libgcc/libgcov.c    (working copy)
>>> @@ -276,6 +276,120 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>>    return 1;
>>>  }
>>>
>>> +/* Used by qsort to sort gcov values in descending order.  */
>>> +
>>> +static int
>>> +sort_by_reverse_gcov_value (const void *pa, const void *pb)
>>> +{
>>> +  const gcov_type a = *(gcov_type const *)pa;
>>> +  const gcov_type b = *(gcov_type const *)pb;
>>> +
>>> +  if (b > a)
>>> +    return 1;
>>> +  else if (b == a)
>>> +    return 0;
>>> +  else
>>> +    return -1;
>>> +}
>>> +
>>> +/* Determines the number of counters required to cover a given percentage
>>> +   of the total sum of execution counts in the summary, which is then also
>>> +   recorded in SUM.  */
>>> +
>>> +static void
>>> +gcov_compute_cutoff_values (struct gcov_summary *sum)
>>> +{
>>> +  struct gcov_info *gi_ptr;
>>> +  const struct gcov_fn_info *gfi_ptr;
>>> +  const struct gcov_ctr_info *ci_ptr;
>>> +  struct gcov_ctr_summary *cs_ptr;
>>> +  unsigned t_ix, f_ix, i, ctr_info_ix, index;
>>> +  gcov_unsigned_t c_num;
>>> +  gcov_type *value_array;
>>> +  gcov_type cum, cum_cutoff;
>>> +  char *cutoff_str;
>>> +  unsigned cutoff_perc;
>>> +
>>> +#define CUM_CUTOFF_PERCENT_TIMES_10 999
>>> +  cutoff_str = getenv ("GCOV_HOTCODE_CUTOFF_TIMES_10");
>>> +  if (cutoff_str && strlen (cutoff_str))
>>> +    cutoff_perc = atoi (cutoff_str);
>>> +  else
>>> +    cutoff_perc = CUM_CUTOFF_PERCENT_TIMES_10;
>>> +
>>> +  /* This currently only applies to arc counters.  */
>>> +  t_ix = GCOV_COUNTER_ARCS;
>>> +
>>> +  /* First check if there are any counts recorded for this counter.  */
>>> +  cs_ptr = &(sum->ctrs[t_ix]);
>>> +  if (!cs_ptr->num)
>>> +    return;
>>> +
>>> +  /* Determine the cumulative counter value at the specified cutoff
>>> +     percentage and record the percentage for use by gcov consumers.
>>> +     Check for overflow when sum_all is multiplied by the cutoff_perc,
>>> +     and if so, do the divide first.  */
>>> +  if ((cs_ptr->sum_all * cutoff_perc) / cutoff_perc != cs_ptr->sum_all)
>>> +    /* Overflow, do the divide first.  */
>>> +    cum_cutoff = cs_ptr->sum_all / 1000 * cutoff_perc;
>>> +  else
>>> +    /* Otherwise multiply first to get the correct value for small
>>> +       values of sum_all.  */
>>> +    cum_cutoff = (cs_ptr->sum_all * cutoff_perc) / 1000;
>>> +
>>> +  /* Next, walk through all the per-object structures and save each of
>>> +     the count values in value_array.  */
>>> +  index = 0;
>>> +  value_array = (gcov_type *) malloc (sizeof (gcov_type) * cs_ptr->num);
>>> +  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
>>> +    {
>>> +      if (!gi_ptr->merge[t_ix])
>>> +        continue;
>>> +
>>> +      /* Find the appropriate index into the gcov_ctr_info array
>>> +         for the counter we are currently working on based on the
>>> +         existence of the merge function pointer for this object.  */
>>> +      for (i = 0, ctr_info_ix = 0; i < t_ix; i++)
>>> +        {
>>> +          if (gi_ptr->merge[i])
>>> +            ctr_info_ix++;
>>> +        }
>>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>>> +        {
>>> +          gfi_ptr = gi_ptr->functions[f_ix];
>>> +
>>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>>> +            continue;
>>> +
>>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>>> +          /* Sanity check that there are enough entries in value_arry
>>> +            for this function's counters. Gracefully handle the case when
>>> +            there are not, in case something in the profile info is
>>> +            corrupted.  */
>>> +          c_num = ci_ptr->num;
>>> +          if (index + c_num > cs_ptr->num)
>>> +            c_num = cs_ptr->num - index;
>>> +          /* Copy over this function's counter values.  */
>>> +          memcpy (&value_array[index], ci_ptr->values,
>>> +                  sizeof (gcov_type) * c_num);
>>> +          index += c_num;
>>> +        }
>>> +    }
>>> +
>>> +  /* Sort all the counter values by descending value and finally
>>> +     accumulate the values from hottest on down until reaching
>>> +     the cutoff value computed earlier.  */
>>> +  qsort (value_array, cs_ptr->num, sizeof (gcov_type),
>>> +         sort_by_reverse_gcov_value);
>>> +  for (cum = 0, c_num = 0; c_num < cs_ptr->num && cum < cum_cutoff;
c_num++)
>>> +    cum += value_array[c_num];
>>> +  /* Record the number of counters required to reach the cutoff value,
>>> +     as well as the counter value that caused cum to reach the cutoff.  */
>>> +  cs_ptr->num_hot_counters = c_num;
>>> +  cs_ptr->hot_cutoff_value = c_num > 0 ? value_array[c_num-1] : 0;
>>> +  free (value_array);
>>> +}
>>> +
>>>  /* Dump the coverage counts. We merge with existing counts when
>>>     possible, to avoid growing the .da files ad infinitum. We use this
>>>     program's checksum to make sure we only accumulate whole program
>>> @@ -347,6 +461,7 @@ gcov_exit (void)
>>>             }
>>>         }
>>>      }
>>> +  gcov_compute_cutoff_values (&this_prg);
>>>
>>>    {
>>>      /* Check if the level of dirs to strip off specified. */
>>> @@ -598,7 +713,12 @@ gcov_exit (void)
>>>           if (gi_ptr->merge[t_ix])
>>>             {
>>>               if (!cs_prg->runs++)
>>> -               cs_prg->num = cs_tprg->num;
>>> +                cs_prg->num = cs_tprg->num;
>>> +              if (cs_tprg->num_hot_counters > cs_prg->num_hot_counters)
>>> +              {
>>> +                cs_prg->num_hot_counters = cs_tprg->num_hot_counters;
>>> +                cs_prg->hot_cutoff_value = cs_tprg->hot_cutoff_value;
>>> +              }
>>>               cs_prg->sum_all += cs_tprg->sum_all;
>>>               if (cs_prg->run_max < cs_tprg->run_max)
>>>                 cs_prg->run_max = cs_tprg->run_max;
>>> Index: gcc/doc/invoke.texi
>>> ===================================================================
>>> --- gcc/doc/invoke.texi (revision 189893)
>>> +++ gcc/doc/invoke.texi (working copy)
>>> @@ -385,7 +385,7 @@ Objective-C and Objective-C++ Dialects}.
>>>  -fno-sched-interblock -fno-sched-spec -fno-signed-zeros @gol
>>>  -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss @gol
>>>  -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls
@gol
>>> --fpartial-inlining -fpeel-loops -fpredictive-commoning @gol
>>> +-fpartial-inlining -fpeel-codesize-limit -fpeel-loops
-fpredictive-commoning @gol
>>>  -fprefetch-loop-arrays @gol
>>>  -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
>>>  -fprofile-generate=@var{path} @gol
>>> @@ -417,7 +417,7 @@ Objective-C and Objective-C++ Dialects}.
>>>  -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
>>>  -ftree-switch-conversion -ftree-tail-merge @gol
>>>  -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
>>> --funit-at-a-time -funroll-all-loops -funroll-loops @gol
>>> +-funit-at-a-time -funroll-all-loops -funroll-loops -funroll-codesize-limit
@gol
>>>  -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops
@gol
>>>  -fvariable-expansion-in-unroller -fvect-cost-model -fvpt -fweb @gol
>>>  -fwhole-program -fwpa -fuse-linker-plugin @gol
>>> @@ -8527,6 +8527,14 @@ the loop is entered.  This usually makes programs
>>>  @option{-funroll-all-loops} implies the same options as
>>>  @option{-funroll-loops}.
>>>
>>> +@item -funroll-codesize-limit
>>> +@opindex funroll-codesize-limit
>>> +Limit loop unrolling of non-const non-FP loops in a profile feedback
compilation
>>> +under estimates of a large code footprint. Enabled by default with
>>> +@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
>>> +by the @option{unrollpeel-codesize-threshold} and
>>> +@option{unrollpeel-hotness-threshold} parameters.
>>> +
>>>  @item -fpeel-loops
>>>  @opindex fpeel-loops
>>>  Peels loops for which there is enough information that they do not
>>> @@ -8535,6 +8543,14 @@ roll much (from profile feedback).  It also turns
>>>
>>>  Enabled with @option{-fprofile-use}.
>>>
>>> +@item -fpeel-codesize-limit
>>> +@opindex fpeel-codesize-limit
>>> +Limit loop peeling of non-const non-FP loops in a profile feedback
compilation
>>> +under estimates of a large code footprint. Enabled by default with
>>> +@option{-fprofile-use}. Code size and execution weight thresholds are
controlled
>>> +by the @option{unrollpeel-codesize-threshold} and
>>> +@option{unrollpeel-hotness-threshold} parameters.
>>> +
>>>  @item -fmove-loop-invariants
>>>  @opindex fmove-loop-invariants
>>>  Enables the loop invariant motion pass in the RTL loop optimizer.  Enabled
>>> @@ -8886,6 +8902,15 @@ The maximum number of iterations of a loop to be s
>>>  @item max-completely-peel-loop-nest-depth
>>>  The maximum depth of a loop nest suitable for complete peeling.
>>>
>>> +@item unrollpeel-codesize-threshold
>>> +Maximum profile-based code size footprint estimate for loop unrolling and
>>> +peeling.
>>> +
>>> +@item unrollpeel-hotness-threshold
>>> +Maximum ratio of total execution count to loop entry block count under
which
>>> +most profile-based code size estimates will be ignored for loop unrolling
and
>>> +peeling.
>>> +
>>>  @item max-unswitch-insns
>>>  The maximum number of insns of an unswitched loop.
>>>
>>> Index: gcc/gcov-io.c
>>> ===================================================================
>>> --- gcc/gcov-io.c       (revision 189893)
>>> +++ gcc/gcov-io.c       (working copy)
>>> @@ -376,6 +376,8 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>>      {
>>>        gcov_write_unsigned (csum->num);
>>> +      gcov_write_unsigned (csum->num_hot_counters);
>>> +      gcov_write_counter (csum->hot_cutoff_value);
>>>        gcov_write_unsigned (csum->runs);
>>>        gcov_write_counter (csum->sum_all);
>>>        gcov_write_counter (csum->run_max);
>>> @@ -495,6 +497,8 @@ gcov_read_summary (struct gcov_summary *summary)
>>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>>      {
>>>        csum->num = gcov_read_unsigned ();
>>> +      csum->num_hot_counters = gcov_read_unsigned ();
>>> +      csum->hot_cutoff_value = gcov_read_counter ();
>>>        csum->runs = gcov_read_unsigned ();
>>>        csum->sum_all = gcov_read_counter ();
>>>        csum->run_max = gcov_read_counter ();
>>> Index: gcc/gcov-io.h
>>> ===================================================================
>>> --- gcc/gcov-io.h       (revision 189893)
>>> +++ gcc/gcov-io.h       (working copy)
>>> @@ -310,7 +310,7 @@ typedef HOST_WIDEST_INT gcov_type;
>>>  #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete
*/
>>>  #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
>>>  #define GCOV_TAG_SUMMARY_LENGTH  \
>>> -       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
>>> +       (1 + GCOV_COUNTERS_SUMMABLE * (3 + 4 * 2))
>>>
>>>  /* Counters that are collected.  */
>>>  #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
>>> @@ -393,6 +393,10 @@ typedef HOST_WIDEST_INT gcov_type;
>>>  struct gcov_ctr_summary
>>>  {
>>>    gcov_unsigned_t num;         /* number of counters.  */
>>> +  gcov_unsigned_t num_hot_counters;/* number of counters to reach a given
>>> +                                      percent of sum_all.  */
>>> +  gcov_type hot_cutoff_value;   /* value of smallest counter included in
>>> +                                   num_hot_counters.  */
>>>    gcov_unsigned_t runs;                /* number of program runs */
>>>    gcov_type sum_all;           /* sum of all counters accumulated.  */
>>>    gcov_type run_max;           /* maximum value on a single run.  */
>>> Index: gcc/loop-unroll.c
>>> ===================================================================
>>> --- gcc/loop-unroll.c   (revision 189893)
>>> +++ gcc/loop-unroll.c   (working copy)
>>> @@ -32,6 +32,7 @@ along with GCC; see the file COPYING3.  If not see
>>>  #include "hashtab.h"
>>>  #include "recog.h"
>>>  #include "target.h"
>>> +#include "gcov-io.h"
>>>  #include "dumpfile.h"
>>>
>>>  /* This pass performs loop unrolling and peeling.  We only perform these
>>> @@ -151,6 +152,75 @@ static void combine_var_copies_in_loop_exit (struc
>>>                                              basic_block);
>>>  static rtx get_expansion (struct var_to_expand *);
>>>
>>> +/* Determine whether and how much LOOP unrolling/peeling should be
constrained
>>> +   based on code footprint estimates. Returns the codesize-based factor to
be
>>> +   divided into the max instructions in an unrolled or peeled loop:
>>> +   1) For size <= threshold, do not limit (by returning 1).
>>> +   2) For threshold < size < 2*threshold, reduce maximum allowed peeled or
>>> +      unrolled instructions according to loop hotness.
>>> +   3) For threshold >= 2*threshold, disable unrolling/peeling (by
returning
>>> +      INT_MAX).  */
>>> +
>>> +static int
>>> +code_size_limit_factor(struct loop *loop)
>>> +{
>>> +  unsigned size_threshold;
>>> +  struct niter_desc *desc = get_simple_loop_desc (loop);
>>> +  gcov_type sum_to_header_ratio;
>>> +  int hotness_ratio_threshold;
>>> +  int limit_factor;
>>> +
>>> +  /* First check if the application has a large codesize footprint.
>>> +     This is estimated from FDO profile summary information for the
>>> +     program, where the num_hot_counters indicates the number of hottest
>>> +     counters (blocks) that compose most of the execution time of
>>> +     the program. A large value would indicate a large flat execution
>>> +     profile where icache misses may be a concern.  */
>>> +  size_threshold = PARAM_VALUE (PARAM_UNROLLPEEL_CODESIZE_THRESHOLD);
>>> +  if (!profile_info
>>> +      || profile_info->num_hot_counters <= size_threshold
>>> +      || !profile_info->sum_all)
>>> +    return 1;
>>> +
>>> +  /* Next, exclude some loops where unrolling/peeling may be more
>>> +     important to overall performance.  */
>>> +
>>> +  /* Ignore FP loops, which are more likely to benefit heavily from
>>> +     unrolling. */
>>> +  if (desc->has_fp)
>>> +    return 1;
>>> +
>>> +  /* Next, set the value of the codesize-based unroll factor divisor which
in
>>> +     most loops will need to be set to a value that will reduce or
eliminate
>>> +     unrolling/peeling.  */
>>> +  if (profile_info->num_hot_counters < size_threshold * 2)
>>> +    {
>>> +      /* For applications that are less than twice the codesize limit,
allow
>>> +         limited unrolling for very hot loops.  */
>>> +      sum_to_header_ratio = profile_info->sum_all / loop->header->count;
>>> +      hotness_ratio_threshold = PARAM_VALUE
(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD);
>>> +      /* When the profile count sum to loop entry header ratio is smaller
than
>>> +         the threshold (i.e. the loop entry is hot enough, the divisor is
set
>>> +         to 1 so the unroll/peel factor is not reduced. When it is bigger
>>> +         than the ratio, increase the divisor by the amount this ratio
>>> +         is over the threshold, which will quickly reduce the unroll/peel
>>> +         factor to zero as the loop's hotness reduces.  */
>>> +      if (sum_to_header_ratio > hotness_ratio_threshold)
>>> +        {
>>> +          limit_factor = sum_to_header_ratio / hotness_ratio_threshold;
>>> +          gcc_assert (limit_factor >= 1);
>>> +        }
>>> +      else
>>> +        limit_factor = 1;
>>> +    }
>>> +  else
>>> +    /* For appliations that are at least twice the codesize limit, set
>>> +       the divisor to a large value that will force the unroll factor to 0.
 */
>>> +    limit_factor = INT_MAX;
>>> +
>>> +  return limit_factor;
>>> +}
>>> +
>>>  /* Unroll and/or peel (depending on FLAGS) LOOPS.  */
>>>  void
>>>  unroll_and_peel_loops (int flags)
>>> @@ -803,6 +873,7 @@ decide_unroll_runtime_iterations (struct loop *loo
>>>  {
>>>    unsigned nunroll, nunroll_by_av, i;
>>>    struct niter_desc *desc;
>>> +  int limit_factor = 1;
>>>
>>>    if (!(flags & UAP_UNROLL))
>>>      {
>>> @@ -815,10 +886,26 @@ decide_unroll_runtime_iterations (struct loop *loo
>>>              "\n;; Considering unrolling loop with runtime "
>>>              "computable number of iterations\n");
>>>
>>> +  if (flag_unroll_codesize_limit)
>>> +    {
>>> +      /* Determine whether to limit code size growth from unrolling,
>>> +         using FDO profile summary information that gives an
>>> +         estimated number of executed blocks.  */
>>> +      limit_factor = code_size_limit_factor (loop);
>>> +      if (dump_file && limit_factor > 1)
>>> +        {
>>> +          fprintf (dump_file,
>>> +                   ";; Due to large code size footprint estimate, limit "
>>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>>> +        }
>>> +    }
>>> +
>>>    /* nunroll = total number of copies of the original loop body in
>>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
>>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>>> -  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) /
loop->av_ninsns;
>>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>>> +            / loop->ninsns;
>>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>>> +                  / limit_factor / loop->av_ninsns;
>>>    if (nunroll > nunroll_by_av)
>>>      nunroll = nunroll_by_av;
>>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>>> @@ -1196,6 +1283,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>>  {
>>>    unsigned npeel;
>>>    struct niter_desc *desc;
>>> +  int limit_factor = 1;
>>>
>>>    if (!(flags & UAP_PEEL))
>>>      {
>>> @@ -1206,8 +1294,23 @@ decide_peel_simple (struct loop *loop, int flags)
>>>    if (dump_file)
>>>      fprintf (dump_file, "\n;; Considering simply peeling loop\n");
>>>
>>> +  if (flag_peel_codesize_limit)
>>> +    {
>>> +      /* Determine whether to limit code size growth from peeling,
>>> +         using FDO profile summary information that gives an
>>> +         estimated number of executed blocks.  */
>>> +      limit_factor = code_size_limit_factor (loop);
>>> +      if (dump_file && limit_factor > 1)
>>> +       {
>>> +          fprintf (dump_file,
>>> +                   ";; Due to large code size footprint estimate, limit "
>>> +                   "max peeled insns by divisor %d\n", limit_factor);
>>> +       }
>>> +    }
>>> +
>>>    /* npeel = number of iterations to peel.  */
>>> -  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / loop->ninsns;
>>> +  npeel = PARAM_VALUE (PARAM_MAX_PEELED_INSNS) / limit_factor
>>> +          / loop->ninsns;
>>>    if (npeel > (unsigned) PARAM_VALUE (PARAM_MAX_PEEL_TIMES))
>>>      npeel = PARAM_VALUE (PARAM_MAX_PEEL_TIMES);
>>>
>>> @@ -1232,7 +1335,7 @@ decide_peel_simple (struct loop *loop, int flags)
>>>
>>>    /* Do not simply peel loops with branches inside -- it increases number
>>>       of mispredicts.  */
>>> -  if (num_loop_branches (loop) > 1)
>>> +  if (desc->num_branches > 1)
>>>      {
>>>        if (dump_file)
>>>         fprintf (dump_file, ";; Not peeling, contains branches\n");
>>> @@ -1349,6 +1452,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>>  {
>>>    unsigned nunroll, nunroll_by_av, i;
>>>    struct niter_desc *desc;
>>> +  int limit_factor = 1;
>>>
>>>    if (!(flags & UAP_UNROLL_ALL))
>>>      {
>>> @@ -1359,11 +1463,26 @@ decide_unroll_stupid (struct loop *loop, int flags
>>>    if (dump_file)
>>>      fprintf (dump_file, "\n;; Considering unrolling loop stupidly\n");
>>>
>>> +  if (flag_unroll_codesize_limit)
>>> +    {
>>> +      /* Determine whether to limit code size growth from unrolling,
>>> +         using FDO profile summary information that gives an
>>> +         estimated number of executed blocks.  */
>>> +      limit_factor = code_size_limit_factor (loop);
>>> +      if (dump_file && limit_factor > 1)
>>> +       {
>>> +          fprintf (dump_file,
>>> +                   ";; Due to large code size footprint estimate, limit "
>>> +                   "max unrolled insns by divisor %d\n", limit_factor);
>>> +       }
>>> +    }
>>> +
>>>    /* nunroll = total number of copies of the original loop body in
>>>       unrolled loop (i.e. if it is 2, we have to duplicate loop body once. 
*/
>>> -  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / loop->ninsns;
>>> -  nunroll_by_av
>>> -    = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS) / loop->av_ninsns;
>>> +  nunroll = PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS) / limit_factor
>>> +            / loop->ninsns;
>>> +  nunroll_by_av = PARAM_VALUE (PARAM_MAX_AVERAGE_UNROLLED_INSNS)
>>> +                  / limit_factor / loop->av_ninsns;
>>>    if (nunroll > nunroll_by_av)
>>>      nunroll = nunroll_by_av;
>>>    if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
>>> @@ -1393,7 +1512,7 @@ decide_unroll_stupid (struct loop *loop, int flags
>>>
>>>    /* Do not unroll loops with branches inside -- it increases number
>>>       of mispredicts.  */
>>> -  if (num_loop_branches (loop) > 1)
>>> +  if (desc->num_branches > 1)
>>>      {
>>>        if (dump_file)
>>>         fprintf (dump_file, ";; Not unrolling, contains branches\n");
>>> Index: gcc/coverage.c
>>> ===================================================================
>>> --- gcc/coverage.c      (revision 189893)
>>> +++ gcc/coverage.c      (working copy)
>>> @@ -247,6 +247,14 @@ read_counts_file (void)
>>>           gcov_read_summary (&sum);
>>>           for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
>>>             {
>>> +              if (summary.ctrs[ix].num_hot_counters
>>> +                  < sum.ctrs[ix].num_hot_counters)
>>> +                {
>>> +                 summary.ctrs[ix].num_hot_counters
>>> +                      = sum.ctrs[ix].num_hot_counters;
>>> +                 summary.ctrs[ix].hot_cutoff_value
>>> +                      = sum.ctrs[ix].hot_cutoff_value;
>>> +                }
>>>               summary.ctrs[ix].runs += sum.ctrs[ix].runs;
>>>               summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
>>>               if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
>>> Index: gcc/loop-iv.c
>>> ===================================================================
>>> --- gcc/loop-iv.c       (revision 189893)
>>> +++ gcc/loop-iv.c       (working copy)
>>> @@ -2970,8 +2970,12 @@ get_simple_loop_desc (struct loop *loop)
>>>    /* At least desc->infinite is not always initialized by
>>>       find_simple_loop_exit.  */
>>>    desc = XCNEW (struct niter_desc);
>>> -  iv_analysis_loop_init (loop);
>>> -  find_simple_exit (loop, desc);
>>> +  if (loop->latch != EXIT_BLOCK_PTR)
>>> +    {
>>> +      iv_analysis_loop_init (loop);
>>> +      find_simple_exit (loop, desc);
>>> +    }
>>> +  analyze_loop_insns (loop, desc);
>>>    loop->aux = desc;
>>>
>>>    if (desc->simple_p && (desc->assumptions || desc->infinite))
>>> Index: gcc/common.opt
>>> ===================================================================
>>> --- gcc/common.opt      (revision 189893)
>>> +++ gcc/common.opt      (working copy)
>>> @@ -1551,6 +1551,10 @@ fpcc-struct-return
>>>  Common Report Var(flag_pcc_struct_return,1)
Init(DEFAULT_PCC_STRUCT_RETURN)
>>>  Return small aggregates in memory, not registers
>>>
>>> +fpeel-codesize-limit
>>> +Common Report Var(flag_peel_codesize_limit) Init(1) Optimization
>>> +Limit non-const non-FP loop peeling under profile estimates of large code
footprint
>>> +
>>>  fpeel-loops
>>>  Common Report Var(flag_peel_loops) Optimization
>>>  Perform loop peeling
>>> @@ -2112,6 +2116,10 @@ funroll-all-loops
>>>  Common Report Var(flag_unroll_all_loops) Optimization
>>>  Perform loop unrolling for all loops
>>>
>>> +funroll-codesize-limit
>>> +Common Report Var(flag_unroll_codesize_limit) Init(1) Optimization
>>> +Limit non-const non-FP loop unrolling under profile estimates of large code
footprint
>>> +
>>>  ; Nonzero means that loop optimizer may assume that the induction
variables
>>>  ; that control loops do not overflow and that the loops with nontrivial
>>>  ; exit condition are not infinite
>>> Index: gcc/cfgloop.c
>>> ===================================================================
>>> --- gcc/cfgloop.c       (revision 189893)
>>> +++ gcc/cfgloop.c       (working copy)
>>> @@ -1154,24 +1154,98 @@ get_loop_exit_edges (const struct loop *loop)
>>>    return edges;
>>>  }
>>>
>>> -/* Counts the number of conditional branches inside LOOP.  */
>>> +/* Determine if INSN is a floating point set.  */
>>>
>>> -unsigned
>>> -num_loop_branches (const struct loop *loop)
>>> +static bool
>>> +insn_has_fp_set(rtx insn)
>>>  {
>>> -  unsigned i, n;
>>> -  basic_block * body;
>>> +  int i;
>>> +  rtx pat = PATTERN(insn);
>>> +  if (GET_CODE (pat) == SET)
>>> +    return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat))));
>>> +  else if (GET_CODE (pat) == PARALLEL)
>>> +    {
>>> +      for (i = 0; i < XVECLEN (pat, 0); i++)
>>> +        {
>>> +          rtx sub = XVECEXP (pat, 0, i);
>>> +          if (GET_CODE (sub) == SET)
>>> +            return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub))));
>>> +        }
>>> +    }
>>> +  return false;
>>> +}
>>>
>>> -  gcc_assert (loop->latch != EXIT_BLOCK_PTR);
>>> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently
counts
>>> +   the number of conditional branch instructions, calls and fp
instructions,
>>> +   as well as the average number of branches executed per iteration.  */
>>>
>>> +void
>>> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc)
>>> +{
>>> +  unsigned i, nbranch;
>>> +  gcov_type weighted_nbranch;
>>> +  bool has_call, has_fp;
>>> +  basic_block * body, bb;
>>> +  rtx insn;
>>> +  gcov_type header_count = loop->header->count;
>>> +
>>> +  nbranch = weighted_nbranch = 0;
>>> +  has_call = has_fp = false;
>>> +
>>>    body = get_loop_body (loop);
>>> -  n = 0;
>>>    for (i = 0; i < loop->num_nodes; i++)
>>> -    if (EDGE_COUNT (body[i]->succs) >= 2)
>>> -      n++;
>>> +    {
>>> +      bb = body[i];
>>> +
>>> +      if (EDGE_COUNT (bb->succs) >= 2)
>>> +        {
>>> +          nbranch++;
>>> +
>>> +          /* If this block is executed less frequently than the header
(loop
>>> +             entry), then it is weighted based on its execution count,
which
>>> +             will be turned into a ratio compared to the loop header below.
*/
>>> +          if (bb->count < header_count)
>>> +            weighted_nbranch += bb->count;
>>> +
>>> +          /* When it is executed more frequently than the header (i.e. it
is
>>> +             in a nested inner loop), simply weight the branch the same as
the
>>> +             header execution count, so that it will contribute 1 branch
to
>>> +             the ratio computed below. */
>>> +          else
>>> +            weighted_nbranch += header_count;
>>> +        }
>>> +
>>> +      /* No need to iterate through the instructions below if
>>> +         both flags have already been set.  */
>>> +      if (has_call && has_fp)
>>> +        continue;
>>> +
>>> +      FOR_BB_INSNS (bb, insn)
>>> +        {
>>> +          if (!INSN_P (insn))
>>> +            continue;
>>> +
>>> +          if (!has_call)
>>> +            has_call = CALL_P (insn);
>>> +
>>> +          if (!has_fp)
>>> +            has_fp = insn_has_fp_set (insn);
>>> +        }
>>> +    }
>>>    free (body);
>>>
>>> -  return n;
>>> +  desc->num_branches = nbranch;
>>> +  /* Now divide the weights computed above by the loop header execution
count,
>>> +     to compute the average number of branches through the loop. By adding
>>> +     header_count/2 to the numerator we round to nearest with integer
>>> +     division.  */
>>> +  if (header_count  != 0)
>>> +    desc->av_num_branches
>>> +        = (weighted_nbranch + header_count/2) / header_count;
>>> +  else
>>> +    desc->av_num_branches = 0;
>>> +  desc->has_call = has_call;
>>> +  desc->has_fp = has_fp;
>>>  }
>>>
>>>  /* Adds basic block BB to LOOP.  */
>>> Index: gcc/cfgloop.h
>>> ===================================================================
>>> --- gcc/cfgloop.h       (revision 189893)
>>> +++ gcc/cfgloop.h       (working copy)
>>> @@ -255,7 +255,6 @@ extern basic_block *get_loop_body_in_custom_order
>>>
>>>  extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *);
>>>  edge single_exit (const struct loop *);
>>> -extern unsigned num_loop_branches (const struct loop *);
>>>
>>>  extern edge loop_preheader_edge (const struct loop *);
>>>  extern edge loop_latch_edge (const struct loop *);
>>> @@ -366,7 +365,8 @@ struct rtx_iv
>>>  };
>>>
>>>  /* The description of an exit from the loop and of the number of
iterations
>>> -   till we take the exit.  */
>>> +   till we take the exit. Also includes other information used primarily
>>> +   by the loop unroller.  */
>>>
>>>  struct niter_desc
>>>  {
>>> @@ -407,6 +407,18 @@ struct niter_desc
>>>
>>>    /* The number of iterations of the loop.  */
>>>    rtx niter_expr;
>>> +
>>> +  /* The number of branches in the loop.  */
>>> +  unsigned num_branches;
>>> +
>>> +  /* The number of executed branches per iteration.  */
>>> +  unsigned av_num_branches;
>>> +
>>> +  /* Whether the loop contains a call instruction.  */
>>> +  bool has_call;
>>> +
>>> +  /* Whether the loop contains fp instructions.  */
>>> +  bool has_fp;
>>>  };
>>>
>>>  extern void iv_analysis_loop_init (struct loop *);
>>> @@ -420,6 +432,7 @@ extern void iv_analysis_done (void);
>>>
>>>  extern struct niter_desc *get_simple_loop_desc (struct loop *loop);
>>>  extern void free_simple_loop_desc (struct loop *loop);
>>> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc);
>>>
>>>  static inline struct niter_desc *
>>>  simple_loop_desc (struct loop *loop)
>>> Index: gcc/params.def
>>> ===================================================================
>>> --- gcc/params.def      (revision 189893)
>>> +++ gcc/params.def      (working copy)
>>> @@ -311,6 +311,23 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
>>>          "max-completely-peel-loop-nest-depth",
>>>          "The maximum depth of a loop nest we completely peel",
>>>          8, 0, 0)
>>> +/* The maximum code size estimate under which loop unrolling and peeling
>>> + * is allowed in a profile feedback compile. This currently applies to
loops
>>> + * with non-constant iteration counts and no floating point computations. 
*/
>>> +DEFPARAM(PARAM_UNROLLPEEL_CODESIZE_THRESHOLD,
>>> +        "unrollpeel-codesize-threshold",
>>> +        "Maximum profile-based code size footprint estimate for loop
unrolling "
>>> +        "and peeling",
>>> +        15000, 0, 0)
>>> +/* The maximum ratio of total profiled execution counts to loop entry
block
>>> +   count that must be exceeded to ignore most code size limits when
unrolling
>>> +   and peeling.  */
>>> +DEFPARAM(PARAM_UNROLLPEEL_HOTNESS_THRESHOLD,
>>> +        "unrollpeel-hotness-threshold",
>>> +        "Maximum ratio of total profiled execution count to loop entry "
>>> +        "block count under which most codesize limits for unrolling and "
>>> +        "peeling will be ignored",
>>> +        100, 1, 0)
>>>
>>>  /* The maximum number of insns of an unswitched loop.  */
>>>  DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
>>> Index: gcc/gcov-dump.c
>>> ===================================================================
>>> --- gcc/gcov-dump.c     (revision 189893)
>>> +++ gcc/gcov-dump.c     (working copy)
>>> @@ -456,8 +456,12 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
>>>      {
>>>        printf ("\n");
>>>        print_prefix (filename, 0, 0);
>>> -      printf ("\t\tcounts=%u, runs=%u",
>>> -             summary.ctrs[ix].num, summary.ctrs[ix].runs);
>>> +      printf ("\t\tcounts=%u (num hot counts=%u, hot cutoff count="
>>> +              HOST_WIDEST_INT_PRINT_DEC "), runs=%u",
>>> +             summary.ctrs[ix].num,
>>> +             summary.ctrs[ix].num_hot_counters,
>>> +             (HOST_WIDEST_INT)summary.ctrs[ix].hot_cutoff_value,
>>> +             summary.ctrs[ix].runs);
>>>
>>>        printf (", sum_all=" HOST_WIDEST_INT_PRINT_DEC,
>>>               (HOST_WIDEST_INT)summary.ctrs[ix].sum_all);
>>>
>>> --
>>> This patch is available for review at http://codereview.appspot.com/6351086
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Expand All Messages | Collapse All Messages

Issue 6351086: [PATCH] New fdo summary-based icache sensitive unrolling

Patch Set 1 #

Patch Set 2 : [PATCH] New fdo summary-based icache sensitive unrolling #

Messages