Issue 6465057: [PATCH] Add working-set size and hotness information to fdo summary

tejohnson

This is a revision of my earlier patch to add working set information to the ...

13 years, 2 months ago (2012-08-16 14:21:39 UTC) #1

This is a revision of my earlier patch to add working set information to
the gcov program summary based on discussions between Honza, David and I
and approved with changes by Honza in this earlier thread on the first patch:
  http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01412.html
I plan to commit this in couple days unless there are more revisions
requested.

Changes from the earlier patch are the removal of the loop unroller changes
utilizing the new summary info, and feeding back an array of working
set information for different cutoffs, as well as using a histogram to
compute the working set information to avoid expensive copying and qsorting.

Patch description:

Patch to add new summary information on counter working set sizes to gcov
profiles. The function summaries now include an array of working set data,
where each data point corresponds to a percentage of the total sum_all of
the counters and includes the number of hot counters and the minimum counter
value included in that working set.

By default, there are 128 working set data points included in the summary,
but this is controlled at runtime with the GCOV_WORKING_SET_ENTRIES
environment variable. The first GCOV_WORKING_SET_ENTRIES - 1 entries
correspond to percentages 100.0/GCOV_WORKING_SET_ENTRIES through
100.0/GCOV_WORKING_SET_ENTRIES * (GCOV_WORKING_SET_ENTRIES - 1) of sum_all.
The last entry always corresponds to 99.99% of sum_all.

A log2-based histogram is used to compute the working set information,
to avoid the need to copy all counters into a large array and invoke
qsort.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

2012-08-16  Teresa Johnson  <tejohnson@google.com>

        * libgcc/libgcov.c (histo_index): New function.
        (histogram_insert): Ditto.
        (gcov_compute_working_set): Ditto.
        (gcov_exit): Invoke gcov_compute_working_set, and
        perform merging of new working set summary info.
        * gcc/gcov.c (read_count_file): Invoke gcov_destroy_summary to
        clean up allocated working set memory.
        * gcc/gcov-io.c (gcov_write_summary): Include length of working
        set summary info in GCOV_TAG_SUMMARY_LENGTH, and write out working
        set summary.
        (gcov_read_summary): Read working set summary info.
        (gcov_destroy_summary): New function.
        * gcc/gcov-io.h (gcov_type_unsigned): New type.
        (gcov_destroy_summary): Declare.
        (GCOV_TAG_SUMMARY_LENGTH): Update to include working set summary.
        (struct gcov_working_set_info): New structure.
        (struct gcov_ctr_summary): Include working set summary.
        * gcc/coverage.c (htab_counts_entry_del): Free working sets.
        (read_counts_file): Read and merge in working set summary info.
        * gcc/gcov-dump.c (tag_summary): Dump out working set summary info.

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c	(revision 189420)
+++ libgcc/libgcov.c	(working copy)
@@ -276,6 +276,252 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
   return 1;
 }
 
+/* Structure used for each bucket of the log2 histogram of counter values.  */
+
+typedef struct
+{
+  /* The total number of BBs whose profile count falls within the bucket.  */
+  gcov_unsigned_t count;
+  /* The minimum profile count placed in this bucket.  */
+  gcov_type min_value;
+  /* This is the cumulative value of the profile counts in this histogram
+     bucket.  */
+  gcov_type cum_value;
+} bucket_type;
+
+/* For a log2 scale histogram with each range split into 4
+   linear sub-ranges, there will be at most GCOV_TYPE_SIZE - 1 log2
+   ranges since the lowest 2 log2 values share the lowest 4 linear
+   sub-range (values 0 - 3).  */
+
+#define HISTOGRAM_SIZE (GCOV_TYPE_SIZE - 1) * 4
+
+/* Determine the index into histogram for VALUE. */
+
+static unsigned
+histo_index(gcov_type value)
+{
+  gcov_type_unsigned v = (gcov_type_unsigned)value;
+  unsigned r = 0;
+  unsigned prev2bits = 0;
+
+  /* Find index into log2 scale histogram, where each of the log2
+     sized buckets is divided into 4 linear sub-buckets for better
+     focus in the higher buckets.  */
+
+  /* Find the place of the most-significant bit set.  */
+  if (v > 0)
+    r = GCOV_TYPE_SIZE - 1 - __builtin_clzll (v);
+
+  /* If at most the 2 least significant bits are set (value is
+     0 - 3) then that value is our index into the lowest set of
+     four buckets.  */
+  if (r < 2)
+    return (unsigned)value;
+
+  gcc_assert (r < 64);
+
+  /* Find the two next most significant bits to determine which
+     of the four linear sub-buckets to select.  */
+  prev2bits = (v >> (r - 2)) & 0x3;
+  /* Finally, compose the final bucket index from the log2 index and
+     the next 2 bits. The minimum r value at this point is 2 since we
+     returned above if r was 2 or more, so the minimum bucket at this
+     point is 4.  */
+  return (r - 1) * 4 + prev2bits;
+}
+
+/* Insert counter VALUE into HISTOGRAM.  */
+
+static void
+histogram_insert(bucket_type *histogram, gcov_type value)
+{
+  unsigned i;
+
+  i = histo_index(value);
+  gcc_assert (i < HISTOGRAM_SIZE);
+
+  histogram[i].count++;
+  histogram[i].cum_value += value;
+  if (value < histogram[i].min_value)
+    histogram[i].min_value = value;
+}
+
+/* Determines the working set statistics to place in the summary SUM. This is
+   an array of information for a range of percentages of the total execution
+   count (sum_all), and includes the number of counters required to cover that
+   working set percentage and the minimum counter value in that working set. 
*/
+
+static void
+gcov_compute_working_set (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_fn_info *gfi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, f_ix, ws_ix, ctr_info_ix, ix;
+  gcov_unsigned_t c_num, count;
+  int h_ix;
+  bucket_type histogram[HISTOGRAM_SIZE];
+  gcov_type *working_set_cum_values;
+  gcov_type ws_cum_hotness_incr;
+  gcov_type cum, tmp_cum;
+  char *ws_entries_str;
+  unsigned num_ws_entries;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  for (h_ix = 0; h_ix < HISTOGRAM_SIZE; h_ix++)
+    {
+      histogram[h_ix].count = 0;
+      histogram[h_ix].min_value = cs_ptr->run_max;
+      histogram[h_ix].cum_value = 0;
+    }
+
+#define MAX_WS_SUMMARY_ENTRIES 128
+  ws_entries_str = getenv ("GCOV_WORKING_SET_ENTRIES");
+  if (ws_entries_str && strlen (ws_entries_str))
+    {
+      num_ws_entries = atoi (ws_entries_str);
+      /* Limit the summary information to a reasonable amount.
+         The maximum value gives statistics for more than every 1%
+         of the sum_all, which should be sufficient granularity for
+         optimizations. It also ensures that the last entry of 99.9%
+         is the largest value.  */
+      if (num_ws_entries > MAX_WS_SUMMARY_ENTRIES)
+        num_ws_entries = MAX_WS_SUMMARY_ENTRIES;
+    }
+  else
+    num_ws_entries = MAX_WS_SUMMARY_ENTRIES;
+
+  cs_ptr->working_set_count = num_ws_entries;
+
+  if (! num_ws_entries)
+    return;
+
+  /* Next fill in an array of the cumulative hotness values corresponding
+     to each working set summary entry we are going to compute below.  */
+  working_set_cum_values
+      = (gcov_type *) malloc (sizeof (gcov_type) * num_ws_entries);
+
+  /* Compute the amount of sum_all that the cumulative hotness grows
+     by in each successive working set entry, which depends on the
+     number of working set entries requested. By default num_ws_entries
+     is a power-of-two so that divide becomes a shift.  */
+  ws_cum_hotness_incr = cs_ptr->sum_all / num_ws_entries;
+
+  /* Skip 0% statistics, which can be extrapolated from the
+     rest of the summary data.  */
+  cum = ws_cum_hotness_incr;
+  for (ws_ix = 0; ws_ix < num_ws_entries; ws_ix++, cum += ws_cum_hotness_incr)
+    working_set_cum_values[ws_ix] = cum;
+  /* The last summary entry is reserved for (roughly) 99.9% of the
+     working set. Divide by 1024 so it becomes a shift, which gives
+     almost exactly 99.9%.  */
+  working_set_cum_values[num_ws_entries-1]
+      = cs_ptr->sum_all - cs_ptr->sum_all/1024;
+
+  /* Next, walk through all the per-object structures and record each of
+     the count values in histogram.  */
+  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!gi_ptr->merge[t_ix])
+        continue;
+
+      /* Find the appropriate index into the gcov_ctr_info array
+         for the counter we are currently working on based on the
+         existence of the merge function pointer for this object.  */
+      for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++)
+        {
+          if (gi_ptr->merge[ix])
+            ctr_info_ix++;
+        }
+      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
+        {
+          gfi_ptr = gi_ptr->functions[f_ix];
+
+          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
+            continue;
+
+          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
+          for (ix = 0; ix < ci_ptr->num; ix++)
+            histogram_insert(histogram, ci_ptr->values[ix]);
+        }
+    }
+
+  /* Next, walk through the histogram in decending order of hotness
+     and compute the statistics for the working set summary array.
+     As histogram entries are accumulated, we check to see which
+     working set entries have had their expected cum_value reached
+     and fill them in, walking the working set entries in increasing
+     size of cum_value.  */
+  cs_ptr->working_sets = (gcov_ws_info_t *) malloc (
+      num_ws_entries * sizeof (gcov_ws_info_t));
+  ws_ix = 0; /* The current entry into the working set array.  */
+  cum = 0; /* The current accumulated counter sum.  */
+  count = 0; /* The current accumulated count of block counters.  */
+  for (h_ix = HISTOGRAM_SIZE - 1; h_ix > 0 && ws_ix < num_ws_entries; h_ix--)
+    {
+      /* If we haven't reached the required cumulative counter value for
+         the current working set percentage, simply accumulate this histogram
+         entry into the running sums and continue to the next histogram
+         entry.  */
+      if (cum + histogram[h_ix].cum_value < working_set_cum_values[ws_ix])
+        {
+          cum += histogram[h_ix].cum_value;
+          count += histogram[h_ix].count;
+          continue;
+        }
+
+      /* If adding the current histogram entry's cumulative counter value
+         causes us to exceed the current working set size, then estimate
+         how many of this histogram entry's counter values are required to
+         reach the working set size, and fill in working set entries
+         as we reach their expected cumulative value.  */
+      for (c_num = 0, tmp_cum = cum;
+           c_num < histogram[h_ix].count && ws_ix < num_ws_entries;
+           c_num++)
+        {
+          count++;
+          /* If we haven't reached the last histogram entry counter, add
+             in the minimum value again. This will underestimate the
+             cumulative sum so far, because many of the counter values in this
+             entry may have been larger than the minimum. We could add in the
+             average value every time, but that would require an expensive
+             divide operation.  */
+          if (c_num + 1 < histogram[h_ix].count)
+            tmp_cum += histogram[h_ix].min_value;
+          /* If we have reached the last histogram entry counter, then add
+             in the entire cumulative value.  */
+          else
+            tmp_cum = cum + histogram[h_ix].cum_value;
+
+          /* Next walk through successive working set entries and fill in
+            the statistics for any whose size we have reached by accumulating
+            this histogram counter.  */
+          while (tmp_cum >= working_set_cum_values[ws_ix]
+                 && ws_ix < num_ws_entries)
+            {
+              cs_ptr->working_sets[ws_ix].num_counters = count;
+              cs_ptr->working_sets[ws_ix].min_bb_counter
+                  = histogram[h_ix].min_value;
+              ws_ix++;
+            }
+        }
+      /* Finally, update the running cumulative value since we were
+         using a temporary above.  */
+      cum += histogram[h_ix].cum_value;
+    }
+  gcc_assert (ws_ix == num_ws_entries);
+  free (working_set_cum_values);
+}
+
 /* Dump the coverage counts. We merge with existing counts when
    possible, to avoid growing the .da files ad infinitum. We use this
    program's checksum to make sure we only accumulate whole program
@@ -347,6 +593,7 @@ gcov_exit (void)
 	    }
 	}
     }
+  gcov_compute_working_set (&this_prg);
 
   {
     /* Check if the level of dirs to strip off specified. */
@@ -389,7 +636,7 @@ gcov_exit (void)
   /* Now merge each file.  */
   for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
     {
-      unsigned n_counts;
+      unsigned n_counts, ws_cnt;
       struct gcov_summary prg; /* summary for this object over all
 				  program.  */
       struct gcov_ctr_summary *cs_prg, *cs_tprg, *cs_all;
@@ -482,8 +729,6 @@ gcov_exit (void)
 
 	      f_ix--;
 	      length = gcov_read_unsigned ();
-	      if (length != GCOV_TAG_SUMMARY_LENGTH)
-		goto read_mismatch;
 	      gcov_read_summary (&tmp);
 	      if ((error = gcov_is_error ()))
 		goto read_error;
@@ -597,8 +842,35 @@ gcov_exit (void)
 
 	  if (gi_ptr->merge[t_ix])
 	    {
+              ws_cnt = cs_tprg->working_set_count;
 	      if (!cs_prg->runs++)
-		cs_prg->num = cs_tprg->num;
+                {
+                  cs_prg->num = cs_tprg->num;
+                  /* Allocate the working set array for the merged summary. 
*/
+                  if (ws_cnt)
+                    {
+                      cs_prg->working_set_count = ws_cnt;
+                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
+                          ws_cnt * sizeof (gcov_ws_info_t));
+                    }
+                }
+              else if (cs_prg->num != cs_tprg->num
+                       || ws_cnt != cs_prg->working_set_count)
+                goto read_mismatch;
+              /* Copy over this run's working set information if either this
is
+                 the first run, the total size of the profile (sum_all) is
much
+                 (50%) larger for this run (need to check this before updating
+                 cs_prg->sum_all below), or this run has a larger working
+                 set in the largest (99.99% of sum_all) bucket.  */
+              if (ws_cnt
+                  && (cs_prg->runs == 1
+                      || (cs_tprg->sum_all
+                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
+                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
+                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
+                memcpy (cs_prg->working_sets,
+                        cs_tprg->working_sets,
+                        ws_cnt * sizeof (gcov_ws_info_t));
 	      cs_prg->sum_all += cs_tprg->sum_all;
 	      if (cs_prg->run_max < cs_tprg->run_max)
 		cs_prg->run_max = cs_tprg->run_max;
@@ -696,7 +968,11 @@ gcov_exit (void)
 		   "profiling:%s:Overflow writing\n" :
 		   "profiling:%s:Error writing\n",
 		   gi_filename);
+
+      gcov_destroy_summary (&prg);
     }
+
+    gcov_destroy_summary (&this_prg);
 }
 
 /* Reset all counters to zero.  */
Index: gcc/gcov.c
===================================================================
--- gcc/gcov.c	(revision 189924)
+++ gcc/gcov.c	(working copy)
@@ -1264,6 +1264,7 @@ read_count_file (function_t *fns)
 	  struct gcov_summary summary;
 	  gcov_read_summary (&summary);
 	  object_runs += summary.ctrs[GCOV_COUNTER_ARCS].runs;
+          gcov_destroy_summary (&summary);
 	  program_count++;
 	}
       else if (tag == GCOV_TAG_FUNCTION && !length)
Index: gcc/gcov-io.c
===================================================================
--- gcc/gcov-io.c	(revision 189924)
+++ gcc/gcov-io.c	(working copy)
@@ -368,10 +368,13 @@ gcov_write_tag_length (gcov_unsigned_t tag, gcov_u
 GCOV_LINKAGE void
 gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, ws_ix, ws_cnt;
   const struct gcov_ctr_summary *csum;
 
-  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH);
+  for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE, ws_cnt = 0;
+       ix--; csum++)
+    ws_cnt += csum->working_set_count;
+  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(ws_cnt));
   gcov_write_unsigned (summary->checksum);
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
@@ -380,6 +383,12 @@ gcov_write_summary (gcov_unsigned_t tag, const str
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
       gcov_write_counter (csum->sum_max);
+      gcov_write_unsigned (csum->working_set_count);
+      for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++)
+        {
+          gcov_write_unsigned (csum->working_sets[ws_ix].num_counters);
+          gcov_write_counter (csum->working_sets[ws_ix].min_bb_counter);
+        }
     }
 }
 #endif /* IN_LIBGCOV */
@@ -488,7 +497,7 @@ gcov_read_string (void)
 GCOV_LINKAGE void
 gcov_read_summary (struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, ws_ix;
   struct gcov_ctr_summary *csum;
 
   summary->checksum = gcov_read_unsigned ();
@@ -499,9 +508,35 @@ gcov_read_summary (struct gcov_summary *summary)
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
       csum->sum_max = gcov_read_counter ();
+      csum->working_set_count = gcov_read_unsigned ();
+      csum->working_sets
+#if IN_LIBGCOV
+          = (gcov_ws_info_t *) malloc (csum->working_set_count *
+#else
+          = (gcov_ws_info_t *) xmalloc (csum->working_set_count *
+#endif
+                                        sizeof (gcov_ws_info_t));
+      for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++)
+        {
+          csum->working_sets[ws_ix].num_counters = gcov_read_unsigned ();
+          csum->working_sets[ws_ix].min_bb_counter = gcov_read_counter ();
+        }
     }
 }
 
+GCOV_LINKAGE void
+gcov_destroy_summary (struct gcov_summary *summary)
+{
+  unsigned ix;
+  struct gcov_ctr_summary *csum;
+
+  for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
+    {
+      if (csum->working_set_count)
+        free (csum->working_sets);
+    }
+}
+
 #if !IN_LIBGCOV
 /* Reset to a known position.  BASE should have been obtained from
    gcov_position, LENGTH should be a record length.  */
Index: gcc/gcov-io.h
===================================================================
--- gcc/gcov-io.h	(revision 189924)
+++ gcc/gcov-io.h	(working copy)
@@ -139,7 +139,9 @@ see the files COPYING3 and COPYING.RUNTIME respect
 	counts: header int64:count*
 	summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE
 	count-summary:	int32:num int32:runs int64:sum
-			int64:max int64:sum_max
+			int64:max int64:sum_max working-set-summary
+        working-set-summary: int32:count working-sets*
+        working-sets: int32:num int64:min
 
    The ANNOUNCE_FUNCTION record is the same as that in the note file,
    but without the source location.  The COUNTS gives the
@@ -171,8 +173,10 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (SI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (DI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #endif
 #else
 #if BITS_PER_UNIT == 16
@@ -180,16 +184,20 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (HI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #endif
 #else
 typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI)));
 typedef unsigned gcov_position_t __attribute__ ((mode (QI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (QI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI)));
 #endif
 #endif
 #endif
@@ -213,8 +221,6 @@ typedef HOST_WIDEST_INT gcov_type;
 #if IN_GCOV > 0
 #include <sys/types.h>
 #endif
-#else /*!IN_GCOV */
-#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)
 #endif
 
 #if defined (HOST_HAS_F_SETLKW)
@@ -225,6 +231,8 @@ typedef HOST_WIDEST_INT gcov_type;
 
 #endif /* !IN_LIBGCOV */
 
+#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)
+
 /* In gcov we want function linkage to be static.  In the compiler we want
    it extern, so that they can be accessed from elsewhere.  In libgcov we
    need these functions to be extern, so prefix them with __gcov.  In
@@ -248,6 +256,7 @@ typedef HOST_WIDEST_INT gcov_type;
 #define gcov_read_unsigned __gcov_read_unsigned
 #define gcov_read_counter __gcov_read_counter
 #define gcov_read_summary __gcov_read_summary
+#define gcov_destroy_summary __gcov_destroy_summary
 
 /* Poison these, so they don't accidentally slip in.  */
 #pragma GCC poison gcov_write_string gcov_write_tag gcov_write_length
@@ -309,9 +318,10 @@ typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2)
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
-#define GCOV_TAG_SUMMARY_LENGTH  \
-	(1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+#define GCOV_TAG_SUMMARY_LENGTH(NUM)  \
+        (1 + GCOV_COUNTERS_SUMMABLE * (3 + 3 * 2) + (NUM) * 3)
 
+
 /* Counters that are collected.  */
 #define GCOV_COUNTER_ARCS 	0  /* Arc transitions.  */
 #define GCOV_COUNTERS_SUMMABLE	1  /* Counters which can be
@@ -389,6 +399,16 @@ typedef HOST_WIDEST_INT gcov_type;
 
 /* Structured records.  */
 
+/* Working set size statistics for a given percentage of the entire
+   profile (sum_all).  */
+typedef struct gcov_working_set_info
+{
+  /* Number of hot counters included in this working set.  */
+  gcov_unsigned_t num_counters;
+  /* Smallest counter included in this working set.  */
+  gcov_type min_bb_counter;
+} gcov_ws_info_t;
+
 /* Cumulative counter data.  */
 struct gcov_ctr_summary
 {
@@ -397,6 +417,8 @@ struct gcov_ctr_summary
   gcov_type sum_all;		/* sum of all counters accumulated.  */
   gcov_type run_max;		/* maximum value on a single run.  */
   gcov_type sum_max;    	/* sum of individual run max values.  */
+  gcov_unsigned_t working_set_count; /* number of working set infos.  */
+  gcov_ws_info_t *working_sets; /* array of working set info.  */
 };
 
 /* Object & program summary record.  */
@@ -552,6 +574,7 @@ static int gcov_is_error (void);
 GCOV_LINKAGE gcov_unsigned_t gcov_read_unsigned (void) ATTRIBUTE_HIDDEN;
 GCOV_LINKAGE gcov_type gcov_read_counter (void) ATTRIBUTE_HIDDEN;
 GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *) ATTRIBUTE_HIDDEN;
+GCOV_LINKAGE void gcov_destroy_summary (struct gcov_summary *)
ATTRIBUTE_HIDDEN;
 
 #if IN_LIBGCOV
 /* Available only in libgcov */
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c	(revision 189924)
+++ gcc/coverage.c	(working copy)
@@ -172,6 +172,8 @@ htab_counts_entry_del (void *of)
   counts_entry_t *const entry = (counts_entry_t *) of;
 
   free (entry->counts);
+  if (entry->summary.working_sets)
+    free (entry->summary.working_sets);
   free (entry);
 }
 
@@ -247,12 +249,40 @@ read_counts_file (void)
 	  gcov_read_summary (&sum);
 	  for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++)
 	    {
+              gcov_unsigned_t ws_cnt = sum.ctrs[ix].working_set_count;
+              /* Allocate the working set array for a new summary.  */
+              if (ws_cnt && new_summary)
+                {
+                  summary.ctrs[ix].working_set_count = ws_cnt;
+                  summary.ctrs[ix].working_sets
+                      = (gcov_ws_info_t *) xmalloc (ws_cnt
+                                                    * sizeof
(gcov_ws_info_t));
+                }
+              /* Copy over this summary's working set information if either
+                 summary is new, the total size of the profile (sum_all) is
much
+                 (50%) larger for this summary (need to check this before
+                 updating sum_all below), or this summary has a larger working
+                 set in the largest (99.99% of sum_all) bucket.  */
+              if (ws_cnt
+                  && ws_cnt == summary.ctrs[ix].working_set_count
+                  && (new_summary
+                      || (sum.ctrs[ix].sum_all
+                          > (summary.ctrs[ix].sum_all
+                             + summary.ctrs[ix].sum_all / 2))
+                      || (sum.ctrs[ix].working_sets[ws_cnt - 1].num_counters
+                          > summary.ctrs[ix].working_sets[ws_cnt -
1].num_counters)))
+                {
+                  memcpy (summary.ctrs[ix].working_sets,
+                          sum.ctrs[ix].working_sets,
+                          ws_cnt * sizeof (gcov_ws_info_t));
+                }
 	      summary.ctrs[ix].runs += sum.ctrs[ix].runs;
 	      summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all;
 	      if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max)
 		summary.ctrs[ix].run_max = sum.ctrs[ix].run_max;
 	      summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max;
 	    }
+          gcov_destroy_summary (&sum);
 	  new_summary = 0;
 	}
       else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident)
Index: gcc/gcov-dump.c
===================================================================
--- gcc/gcov-dump.c	(revision 189924)
+++ gcc/gcov-dump.c	(working copy)
@@ -447,7 +447,8 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
 	     unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED)
 {
   struct gcov_summary summary;
-  unsigned ix;
+  unsigned ix, jx, pctinc, pct;
+  gcov_ws_info_t *ws_info;
 
   gcov_read_summary (&summary);
   printf (" checksum=0x%08x", summary.checksum);
@@ -465,5 +466,27 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
 	      (HOST_WIDEST_INT)summary.ctrs[ix].run_max);
       printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC,
 	      (HOST_WIDEST_INT)summary.ctrs[ix].sum_max);
+      if (! summary.ctrs[ix].working_set_count)
+        continue;
+      printf ("\n");
+      print_prefix (filename, 0, 0);
+      printf ("\t\tworking set statistics:");
+      /* Multiply the percentage by 100 to avoid float.  */
+      pctinc = 100 * 100 / summary.ctrs[ix].working_set_count;
+      for (jx = 0, pct = pctinc; jx < summary.ctrs[ix].working_set_count;
+           jx++, pct += pctinc)
+        {
+          if (jx == summary.ctrs[ix].working_set_count - 1)
+            pct = 9999;
+          ws_info = &summary.ctrs[ix].working_sets[jx];
+          printf ("\n");
+          print_prefix (filename, 0, 0);
+          /* Print out the percentage using int arithmatic to avoid float.  */
+          printf ("\t\t%u.%u%%: num counts=%u, min counter="
+              HOST_WIDEST_INT_PRINT_DEC,
+              pct / 100, pct - pct / 100,
+	      ws_info->num_counters, (HOST_WIDEST_INT)ws_info->min_bb_counter);
+        }
     }
+  gcov_destroy_summary (&summary);
 }

--
This patch is available for review at http://codereview.appspot.com/6465057

Sign in to reply to this message.

hubicka_ucw.cz

> + { > + cs_prg->num = cs_tprg->num; > + /* Allocate the working set ...

13 years, 2 months ago (2012-08-18 08:19:56 UTC) #2

> +                {
> +                  cs_prg->num = cs_tprg->num;
> +                  /* Allocate the working set array for the merged summary. 
*/
> +                  if (ws_cnt)
> +                    {
> +                      cs_prg->working_set_count = ws_cnt;
> +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
> +                          ws_cnt * sizeof (gcov_ws_info_t));
> +                    }
> +                }
> +              else if (cs_prg->num != cs_tprg->num
> +                       || ws_cnt != cs_prg->working_set_count)
> +                goto read_mismatch;
> +              /* Copy over this run's working set information if either this
is
> +                 the first run, the total size of the profile (sum_all) is
much
> +                 (50%) larger for this run (need to check this before
updating
> +                 cs_prg->sum_all below), or this run has a larger working
> +                 set in the largest (99.99% of sum_all) bucket.  */
> +              if (ws_cnt
> +                  && (cs_prg->runs == 1
> +                      || (cs_tprg->sum_all
> +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
> +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
> +                          > cs_prg->working_sets[ws_cnt - 1].num_counters)))
> +                memcpy (cs_prg->working_sets,
> +                        cs_tprg->working_sets,
> +                        ws_cnt * sizeof (gcov_ws_info_t));
>  	      cs_prg->sum_all += cs_tprg->sum_all;

Hmm, when you run i.e. gcc bootstrap  where the program is run couple hundred
times, this will end up really inaccurate, since it will probably just store
the histogram of insn-attrtab compilation, right?

Why you don't simply write the histogram into gcov file and don't merge the
values
here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
By default you are streaming 128 values that is the same as needed to stream the
histogram.
I suppose we can have environment variable to reduce the histogram size - I
guess in smaller
setups smaller histogram will run just fine...

Honza

Sign in to reply to this message.

tejohnson

On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > > > ...

13 years, 2 months ago (2012-08-20 04:59:30 UTC) #3

On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > +                {
> > +                  cs_prg->num = cs_tprg->num;
> > +                  /* Allocate the working set array for the merged summary.
 */
> > +                  if (ws_cnt)
> > +                    {
> > +                      cs_prg->working_set_count = ws_cnt;
> > +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
> > +                          ws_cnt * sizeof (gcov_ws_info_t));
> > +                    }
> > +                }
> > +              else if (cs_prg->num != cs_tprg->num
> > +                       || ws_cnt != cs_prg->working_set_count)
> > +                goto read_mismatch;
> > +              /* Copy over this run's working set information if either
this is
> > +                 the first run, the total size of the profile (sum_all) is
much
> > +                 (50%) larger for this run (need to check this before
updating
> > +                 cs_prg->sum_all below), or this run has a larger working
> > +                 set in the largest (99.99% of sum_all) bucket.  */
> > +              if (ws_cnt
> > +                  && (cs_prg->runs == 1
> > +                      || (cs_tprg->sum_all
> > +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
> > +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
> > +                          > cs_prg->working_sets[ws_cnt -
1].num_counters)))
> > +                memcpy (cs_prg->working_sets,
> > +                        cs_tprg->working_sets,
> > +                        ws_cnt * sizeof (gcov_ws_info_t));
> >             cs_prg->sum_all += cs_tprg->sum_all;
>
> Hmm, when you run i.e. gcc bootstrap  where the program is run couple hundred
> times, this will end up really inaccurate, since it will probably just store
> the histogram of insn-attrtab compilation, right?

Well, it should store the largest working set in BBs, or one that came
from a much longer run. But I see the issue you are pointing out. The
num_counters (the number of hot bbs) should be reasonable, as the
total number of BBs is the same between all runs, and I want to show
the largest or more dominant working set in terms of the number of hot
bbs. But the min_bb_counter will become more and more inaccurate as
the number of runs increases, since the counter values are
accumulated.

I typed up a detailed email below on why getting this right would be
difficult, but then I realized there might be a fairly simple accurate
solution, which I'll describe first:

The only way I see to do this completely accurately is to take two
passes through the existing gcda files that we are merging into, one
to read in all the counter values and merge them into all the counter
values in the arrays from the current run (after which we can do the
histogramming/working set computation accurately from the merged
counters), and the second to rewrite them. In libgcov this doesn't
seem like it would be too difficult to do, although it is a little bit
of restructuring of the main merging loop and needs some special
handling for buffered functions (which could increase the memory
footprint a bit if there are many of these since they will all need to
be buffered across the iteration over all the gcda files).

The summary merging in coverage.c confuses me a bit as it seems to be
handling the case when there are multiple program summaries in a
single gcda file. When would this happen? It looks like the merge
handling in libgcov should always produce a single program summary per
gcda file.

>
>
> Why you don't simply write the histogram into gcov file and don't merge the
values
> here (i.e. doing the cummulation loop in GCC instead of within libgcov)?

That doesn't completely solve the problem, unfortunately. The reason
is that you don't know which histogram entry contains a particular
block each run, and therefore you don't know how to compute the new
combined histogram value and index for that bb counter. For example, a
particular histogram index may have 5 counters (bbs) in it for one run
and 2 counters (bbs) in it for a second run, so the question is how to
compute the new entry for all of those bb counters, as the 5 bbs from
the first run may or may not be a superset of the 2 from the second
run. You could assume that the bbs have the same relative order of
hotness between runs, and combine the bb counters accordingly, but
there is still some trickiness because you have to apportion the
cumulative counter sum stored in the histogram entry between new
entries. For example, assume the highest indexed non-zero entries (the
histogram buckets containing the largest counter values) in the two
incoming histograms are:

histogram 1:

index 100: 4 counters, cumulative value X, min counter value minx
...

histogram 2:

index 100: 2 counters, cumulative value Y, min counter value miny
index 99: 3 counters, cumulative value W, min counter value minw
...

To merge, you could assume something like 2 counters with a new
cumulative value (Y + X*2/4), and new min counter value minx+miny,
that go into the merged histogram with the index corresponding to
counter value minx+miny. And then 2 counters have a new cumulative
value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
into the merged histogram with index corresponding to counter value
minw+minx. Etc... Not entirely accurate, although it might be a
reasonable approximation, but it also requires a number of division
operations during the merge in libgcov.

Another possibility, that might also provide a reasonable
approximation, would be to scale the min_bb_counter values in the
working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
is the new sum_all, and sum_all_orig is the sum_all from the summary
whose working_set was chosen to be propagated to the new merged
summary. This also requires some divisions at libgcov merge time,
unless we save the original sum_all along with the working set in the
summary and do the scaling at profile-use time.

> By default you are streaming 128 values that is the same as needed to stream
the histogram.
> I suppose we can have environment variable to reduce the histogram size - I
guess in smaller
> setups smaller histogram will run just fine...

It is a bit more, as the histogram will have 64*4 = 256 entries for
64-bit counters, and each entry contains 2 gcov_type counter values
and one unsigned int. The working set entries only have one gcov_type
counter and one unsigned. So it could be close to 4x.

What do you think?

Thanks,
Teresa

>
> Honza

--
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

davidxl

On Sun, Aug 19, 2012 at 9:59 PM, Teresa Johnson <tejohnson@google.com> wrote: > On Sat, ...

13 years, 2 months ago (2012-08-20 05:41:23 UTC) #4

On Sun, Aug 19, 2012 at 9:59 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> > +                {
>> > +                  cs_prg->num = cs_tprg->num;
>> > +                  /* Allocate the working set array for the merged
summary.  */
>> > +                  if (ws_cnt)
>> > +                    {
>> > +                      cs_prg->working_set_count = ws_cnt;
>> > +                      cs_prg->working_sets = (gcov_ws_info_t *) malloc (
>> > +                          ws_cnt * sizeof (gcov_ws_info_t));
>> > +                    }
>> > +                }
>> > +              else if (cs_prg->num != cs_tprg->num
>> > +                       || ws_cnt != cs_prg->working_set_count)
>> > +                goto read_mismatch;
>> > +              /* Copy over this run's working set information if either
this is
>> > +                 the first run, the total size of the profile (sum_all) is
much
>> > +                 (50%) larger for this run (need to check this before
updating
>> > +                 cs_prg->sum_all below), or this run has a larger working
>> > +                 set in the largest (99.99% of sum_all) bucket.  */
>> > +              if (ws_cnt
>> > +                  && (cs_prg->runs == 1
>> > +                      || (cs_tprg->sum_all
>> > +                          > (cs_prg->sum_all + cs_prg->sum_all / 2))
>> > +                      || (cs_tprg->working_sets[ws_cnt - 1].num_counters
>> > +                          > cs_prg->working_sets[ws_cnt -
1].num_counters)))
>> > +                memcpy (cs_prg->working_sets,
>> > +                        cs_tprg->working_sets,
>> > +                        ws_cnt * sizeof (gcov_ws_info_t));
>> >             cs_prg->sum_all += cs_tprg->sum_all;
>>
>> Hmm, when you run i.e. gcc bootstrap  where the program is run couple
hundred
>> times, this will end up really inaccurate, since it will probably just store
>> the histogram of insn-attrtab compilation, right?
>
>
> Well, it should store the largest working set in BBs, or one that came
> from a much longer run. But I see the issue you are pointing out. The
> num_counters (the number of hot bbs) should be reasonable, as the
> total number of BBs is the same between all runs, and I want to show
> the largest or more dominant working set in terms of the number of hot
> bbs. But the min_bb_counter will become more and more inaccurate as
> the number of runs increases, since the counter values are
> accumulated.
>
> I typed up a detailed email below on why getting this right would be
> difficult, but then I realized there might be a fairly simple accurate
> solution, which I'll describe first:
>
> The only way I see to do this completely accurately is to take two
> passes through the existing gcda files that we are merging into, one
> to read in all the counter values and merge them into all the counter
> values in the arrays from the current run (after which we can do the
> histogramming/working set computation accurately from the merged
> counters), and the second to rewrite them. In libgcov this doesn't
> seem like it would be too difficult to do, although it is a little bit
> of restructuring of the main merging loop and needs some special
> handling for buffered functions (which could increase the memory
> footprint a bit if there are many of these since they will all need to
> be buffered across the iteration over all the gcda files).
>

The current way of doing summary merging produces imprecise data (e.g,
sum_max). It computes current execution's summary first, and then
merge with previous run's summary data while profile counters are
merged.  The right way is to merge profile counters first, and then
recompute the final summary based on the merged profile.   The later
way allows merging of more advanced summary data such as the histogram
here. It is also used in LIPO mode (for dynamic callgraph build/merge
-- kind of a summary itself).

However I think changing the way summary is merged should probably be
done in a different patch.

David


> The summary merging in coverage.c confuses me a bit as it seems to be
> handling the case when there are multiple program summaries in a
> single gcda file. When would this happen? It looks like the merge
> handling in libgcov should always produce a single program summary per
> gcda file.
>
>>
>>
>> Why you don't simply write the histogram into gcov file and don't merge the
values
>> here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
>
> That doesn't completely solve the problem, unfortunately. The reason
> is that you don't know which histogram entry contains a particular
> block each run, and therefore you don't know how to compute the new
> combined histogram value and index for that bb counter. For example, a
> particular histogram index may have 5 counters (bbs) in it for one run
> and 2 counters (bbs) in it for a second run, so the question is how to
> compute the new entry for all of those bb counters, as the 5 bbs from
> the first run may or may not be a superset of the 2 from the second
> run. You could assume that the bbs have the same relative order of
> hotness between runs, and combine the bb counters accordingly, but
> there is still some trickiness because you have to apportion the
> cumulative counter sum stored in the histogram entry between new
> entries. For example, assume the highest indexed non-zero entries (the
> histogram buckets containing the largest counter values) in the two
> incoming histograms are:
>
> histogram 1:
>
> index 100: 4 counters, cumulative value X, min counter value minx
> ...
>
> histogram 2:
>
> index 100: 2 counters, cumulative value Y, min counter value miny
> index 99: 3 counters, cumulative value W, min counter value minw
> ...
>
> To merge, you could assume something like 2 counters with a new
> cumulative value (Y + X*2/4), and new min counter value minx+miny,
> that go into the merged histogram with the index corresponding to
> counter value minx+miny. And then 2 counters have a new cumulative
> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
> into the merged histogram with index corresponding to counter value
> minw+minx. Etc... Not entirely accurate, although it might be a
> reasonable approximation, but it also requires a number of division
> operations during the merge in libgcov.
>
> Another possibility, that might also provide a reasonable
> approximation, would be to scale the min_bb_counter values in the
> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
> is the new sum_all, and sum_all_orig is the sum_all from the summary
> whose working_set was chosen to be propagated to the new merged
> summary. This also requires some divisions at libgcov merge time,
> unless we save the original sum_all along with the working set in the
> summary and do the scaling at profile-use time.
>
>> By default you are streaming 128 values that is the same as needed to stream
the histogram.
>> I suppose we can have environment variable to reduce the histogram size - I
guess in smaller
>> setups smaller histogram will run just fine...
>
> It is a bit more, as the histogram will have 64*4 = 256 entries for
> 64-bit counters, and each entry contains 2 gcov_type counter values
> and one unsigned int. The working set entries only have one gcov_type
> counter and one unsigned. So it could be close to 4x.
>
> What do you think?
>
> Thanks,
> Teresa
>
>>
>> Honza
>
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hubicka_ucw.cz

> Well, it should store the largest working set in BBs, or one that came ...

13 years, 2 months ago (2012-08-20 09:48:34 UTC) #5

> Well, it should store the largest working set in BBs, or one that came
> from a much longer run. But I see the issue you are pointing out. The
> num_counters (the number of hot bbs) should be reasonable, as the
> total number of BBs is the same between all runs, and I want to show
> the largest or more dominant working set in terms of the number of hot
> bbs. But the min_bb_counter will become more and more inaccurate as
> the number of runs increases, since the counter values are
> accumulated.

Yes and that is the one that we plan to use to determine hot/cold decisions on,
right?

Note that there is no really 1-1 corespondence in betwen BBs and counters.
For each function the there should be num_edges-num_bbs+1 counters.
What do you plan to use BB counts for?
> 
> I typed up a detailed email below on why getting this right would be
> difficult, but then I realized there might be a fairly simple accurate
> solution, which I'll describe first:
> 
> The only way I see to do this completely accurately is to take two
> passes through the existing gcda files that we are merging into, one
> to read in all the counter values and merge them into all the counter
> values in the arrays from the current run (after which we can do the
> histogramming/working set computation accurately from the merged
> counters), and the second to rewrite them. In libgcov this doesn't
> seem like it would be too difficult to do, although it is a little bit
> of restructuring of the main merging loop and needs some special
> handling for buffered functions (which could increase the memory
> footprint a bit if there are many of these since they will all need to
> be buffered across the iteration over all the gcda files).
> 
> The summary merging in coverage.c confuses me a bit as it seems to be
> handling the case when there are multiple program summaries in a
> single gcda file. When would this happen? It looks like the merge
> handling in libgcov should always produce a single program summary per
> gcda file.


This is something Nathan introduced years back. The idea was IMO to handle
more acurately objects linked into multiple binaries. I am not sure
if the code really works or worked on some point.

The idea, as I recall it, was to produce overall checksum of all objects and
have different profile records for each combination.

This is not really useful for profile feedback as you generate single object
file for all uses, but it might become useful for LTO where you know into which
binary you are linking to. I am not really sure it is worth all the
infrastructure
needed to support this though.
> 
> >
> >
> > Why you don't simply write the histogram into gcov file and don't merge the
values
> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
> 
> That doesn't completely solve the problem, unfortunately. The reason
> is that you don't know which histogram entry contains a particular
> block each run, and therefore you don't know how to compute the new
> combined histogram value and index for that bb counter. For example, a
> particular histogram index may have 5 counters (bbs) in it for one run
> and 2 counters (bbs) in it for a second run, so the question is how to
> compute the new entry for all of those bb counters, as the 5 bbs from
> the first run may or may not be a superset of the 2 from the second
> run. You could assume that the bbs have the same relative order of
> hotness between runs, and combine the bb counters accordingly, but
> there is still some trickiness because you have to apportion the
> cumulative counter sum stored in the histogram entry between new
> entries. For example, assume the highest indexed non-zero entries (the
> histogram buckets containing the largest counter values) in the two
> incoming histograms are:
> 
> histogram 1:
> 
> index 100: 4 counters, cumulative value X, min counter value minx
> ...
> 
> histogram 2:
> 
> index 100: 2 counters, cumulative value Y, min counter value miny
> index 99: 3 counters, cumulative value W, min counter value minw
> ...
> 
> To merge, you could assume something like 2 counters with a new
> cumulative value (Y + X*2/4), and new min counter value minx+miny,
> that go into the merged histogram with the index corresponding to
> counter value minx+miny. And then 2 counters have a new cumulative
> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
> into the merged histogram with index corresponding to counter value
> minw+minx. Etc... Not entirely accurate, although it might be a
> reasonable approximation, but it also requires a number of division
> operations during the merge in libgcov.
> 
> Another possibility, that might also provide a reasonable
> approximation, would be to scale the min_bb_counter values in the
> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
> is the new sum_all, and sum_all_orig is the sum_all from the summary
> whose working_set was chosen to be propagated to the new merged
> summary. This also requires some divisions at libgcov merge time,
> unless we save the original sum_all along with the working set in the
> summary and do the scaling at profile-use time.
> 
> > By default you are streaming 128 values that is the same as needed to stream
the histogram.
> > I suppose we can have environment variable to reduce the histogram size - I
guess in smaller
> > setups smaller histogram will run just fine...
> 
> It is a bit more, as the histogram will have 64*4 = 256 entries for
> 64-bit counters, and each entry contains 2 gcov_type counter values
> and one unsigned int. The working set entries only have one gcov_type
> counter and one unsigned. So it could be close to 4x.
> 
> What do you think?

So we have options
  1) ignore the problem that summaries become inaccurate with many train runs
     as we do now.
  2) write histogram only, do not track BB counts or approximate them by
scalling
     and perhaps retire max_counter from the summary in favour of histogram
estimate

     We will pay by more data being written (I think the histogram may actually
compress pretty
     well by skipping zero entries, don't know) and by getting only estimated
max_count
     from the histogram that still should be good for practice (max_count is
used only
     for the hot/cold decisions and those will be histogram based anyway)
  3) do two stage processing with reading data into memory first, producing
summaries
     and writting them next.

2) seems appealing to me because it is simple, but there are limitations in
what it can handle.
3) solve precision problems, but how we handle the locking & races?  In the
case like bootstrap where GCCs are executed in parallel we will end up with
random results if the files gets modified by another profiled run of GCC in
between read in and write out.

So I definitely preffer 2 or 3 over 1. David has experience with 3. How well
does
it work for LIPO?

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> 
> 
> 
> 
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

tejohnson

On Mon, Aug 20, 2012 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Well, it ...

13 years, 2 months ago (2012-08-20 14:53:22 UTC) #6

On Mon, Aug 20, 2012 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Well, it should store the largest working set in BBs, or one that came
>> from a much longer run. But I see the issue you are pointing out. The
>> num_counters (the number of hot bbs) should be reasonable, as the
>> total number of BBs is the same between all runs, and I want to show
>> the largest or more dominant working set in terms of the number of hot
>> bbs. But the min_bb_counter will become more and more inaccurate as
>> the number of runs increases, since the counter values are
>> accumulated.
>
> Yes and that is the one that we plan to use to determine hot/cold decisions
on,
> right?

Yes, so I agree it is important to do something to update this as
profiles are merged.

>
> Note that there is no really 1-1 corespondence in betwen BBs and counters.
> For each function the there should be num_edges-num_bbs+1 counters.
> What do you plan to use BB counts for?

True, although what I am using this for is to get an idea of the size
of the working set to help identify large icache footprints in order
to control code size increasing optimizations such as unrolling. The
idea is that a huge number of hot counters indicates a huge number of
hot bbs which in turn indicates a huge number of hot instructions and
therefore high icache pressure. We're using it in the google branch to
control unrolling successfully.

>>
>> I typed up a detailed email below on why getting this right would be
>> difficult, but then I realized there might be a fairly simple accurate
>> solution, which I'll describe first:
>>
>> The only way I see to do this completely accurately is to take two
>> passes through the existing gcda files that we are merging into, one
>> to read in all the counter values and merge them into all the counter
>> values in the arrays from the current run (after which we can do the
>> histogramming/working set computation accurately from the merged
>> counters), and the second to rewrite them. In libgcov this doesn't
>> seem like it would be too difficult to do, although it is a little bit
>> of restructuring of the main merging loop and needs some special
>> handling for buffered functions (which could increase the memory
>> footprint a bit if there are many of these since they will all need to
>> be buffered across the iteration over all the gcda files).
>>
>> The summary merging in coverage.c confuses me a bit as it seems to be
>> handling the case when there are multiple program summaries in a
>> single gcda file. When would this happen? It looks like the merge
>> handling in libgcov should always produce a single program summary per
>> gcda file.
>
>
> This is something Nathan introduced years back. The idea was IMO to handle
> more acurately objects linked into multiple binaries. I am not sure
> if the code really works or worked on some point.
>
> The idea, as I recall it, was to produce overall checksum of all objects and
> have different profile records for each combination.
>
> This is not really useful for profile feedback as you generate single object
> file for all uses, but it might become useful for LTO where you know into
which
> binary you are linking to. I am not really sure it is worth all the
infrastructure
> needed to support this though.

Ok, so perhaps for the merging in coverage.c we can do something less
accurate, and worry more about the libgcov merging.

>>
>> >
>> >
>> > Why you don't simply write the histogram into gcov file and don't merge the
values
>> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)?
>>
>> That doesn't completely solve the problem, unfortunately. The reason
>> is that you don't know which histogram entry contains a particular
>> block each run, and therefore you don't know how to compute the new
>> combined histogram value and index for that bb counter. For example, a
>> particular histogram index may have 5 counters (bbs) in it for one run
>> and 2 counters (bbs) in it for a second run, so the question is how to
>> compute the new entry for all of those bb counters, as the 5 bbs from
>> the first run may or may not be a superset of the 2 from the second
>> run. You could assume that the bbs have the same relative order of
>> hotness between runs, and combine the bb counters accordingly, but
>> there is still some trickiness because you have to apportion the
>> cumulative counter sum stored in the histogram entry between new
>> entries. For example, assume the highest indexed non-zero entries (the
>> histogram buckets containing the largest counter values) in the two
>> incoming histograms are:
>>
>> histogram 1:
>>
>> index 100: 4 counters, cumulative value X, min counter value minx
>> ...
>>
>> histogram 2:
>>
>> index 100: 2 counters, cumulative value Y, min counter value miny
>> index 99: 3 counters, cumulative value W, min counter value minw
>> ...
>>
>> To merge, you could assume something like 2 counters with a new
>> cumulative value (Y + X*2/4), and new min counter value minx+miny,
>> that go into the merged histogram with the index corresponding to
>> counter value minx+miny. And then 2 counters have a new cumulative
>> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go
>> into the merged histogram with index corresponding to counter value
>> minw+minx. Etc... Not entirely accurate, although it might be a
>> reasonable approximation, but it also requires a number of division
>> operations during the merge in libgcov.
>>
>> Another possibility, that might also provide a reasonable
>> approximation, would be to scale the min_bb_counter values in the
>> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged
>> is the new sum_all, and sum_all_orig is the sum_all from the summary
>> whose working_set was chosen to be propagated to the new merged
>> summary. This also requires some divisions at libgcov merge time,
>> unless we save the original sum_all along with the working set in the
>> summary and do the scaling at profile-use time.
>>
>> > By default you are streaming 128 values that is the same as needed to
stream the histogram.
>> > I suppose we can have environment variable to reduce the histogram size - I
guess in smaller
>> > setups smaller histogram will run just fine...
>>
>> It is a bit more, as the histogram will have 64*4 = 256 entries for
>> 64-bit counters, and each entry contains 2 gcov_type counter values
>> and one unsigned int. The working set entries only have one gcov_type
>> counter and one unsigned. So it could be close to 4x.
>>
>> What do you think?
>
> So we have options
>   1) ignore the problem that summaries become inaccurate with many train runs
>      as we do now.

Or 1b) do some sum_all based scaling of the min_bb_counter values as I
describe a couple paragraphs above.

>   2) write histogram only, do not track BB counts or approximate them by
scalling
>      and perhaps retire max_counter from the summary in favour of histogram
estimate

By max_counter do you mean sum_max?

>
>      We will pay by more data being written (I think the histogram may
actually compress pretty
>      well by skipping zero entries, don't know) and by getting only estimated
max_count
>      from the histogram that still should be good for practice (max_count is
used only
>      for the hot/cold decisions and those will be histogram based anyway)

Regarding the data being written, I think we can minimize this since
the histogram should be pretty sparse. So store only non-zero entries.
The histogram indices can be recomputed on read from the min counter
value stored in the entry.

>   3) do two stage processing with reading data into memory first, producing
summaries
>      and writting them next.
>
> 2) seems appealing to me because it is simple, but there are limitations in
> what it can handle.
> 3) solve precision problems, but how we handle the locking & races?  In the
> case like bootstrap where GCCs are executed in parallel we will end up with
> random results if the files gets modified by another profiled run of GCC in
> between read in and write out.
>
> So I definitely preffer 2 or 3 over 1. David has experience with 3. How well
does
> it work for LIPO?

I think the locking issue is reduced for LIPO because each gcda file
is read/merged/written in a single pass as it is on trunk, and the
lipo module info is written subsequently to a separate file. Although
that could certainly get a little stale if there are many simultaneous
profile runs merging the files.

To address this for the working set summary info, I think we would
have to maintain the current per-gcdafile read/merge/write for the
counters but not write any working set summary info on the first pass.
Then compute the working set summary and rewrite the all the gcda
files with it. The working set summary may get a little stale before
it is written (just as in the lipo case for the module info), although
at least the individual counters will not. Since the individual
counters will get merged properly, there may be a good chance that one
of the final profile runs will compute an accurate working set as it
is finishing up.

If this approach seems like it is feasible, then we could stick with
the current approach of emitting the working set array in the summary,
mitigating it somewhat by doing the sum_all based scaling of the
counter values, then in a follow on patch restructure the merging code
to delay the working set computation as described above.

Otherwise, either go with the approach of scaling the counter values
using the sum_all ratio, or encode/merge the histograms instead.

Teresa

>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

stevenb.gcc

On Mon, Aug 20, 2012 at 11:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> The summary ...

13 years, 2 months ago (2012-08-20 15:36:05 UTC) #7

davidxl

> > So I definitely preffer 2 or 3 over 1. David has experience with ...

13 years, 2 months ago (2012-08-20 16:57:03 UTC) #8

davidxl

> If this approach seems like it is feasible, then we could stick with > ...

13 years, 2 months ago (2012-08-20 17:05:00 UTC) #9

tejohnson

On Mon, Aug 20, 2012 at 8:35 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Mon, ...

13 years, 2 months ago (2012-08-20 17:44:15 UTC) #10

stevenb.gcc

On Mon, Aug 20, 2012 at 7:44 PM, Teresa Johnson <tejohnson@google.com> wrote: > But > ...

13 years, 2 months ago (2012-08-20 17:51:39 UTC) #11

andi_firstfloor.org

Xinliang David Li <davidxl@google.com> writes: > > Process level synchronization problems can happen when two ...

13 years, 2 months ago (2012-08-20 19:03:44 UTC) #12

davidxl

I was mistaken here -- gcov_open actually uses locking via fcntl interface -- so we ...

13 years, 2 months ago (2012-08-20 19:58:35 UTC) #13

hubicka_ucw.cz

> Xinliang David Li <davidxl@google.com> writes: > > > > Process level synchronization problems can ...

13 years, 2 months ago (2012-08-21 01:27:31 UTC) #14

> Xinliang David Li <davidxl@google.com> writes:
> >
> > Process level synchronization problems can happen when two processes
> > (running the instrumented binary) exit at the same time. The
> > updated/merged counters from one process may be overwritten by another
> > process -- this is true for both counter data and summary data.
> > Solution 3) does not introduce any new problems.
> 
> You could just use lockf() ?

The issue here is holding lock for all the files (that can be many) versus
number of locks limits & possibilities for deadlocking (mind that updating
may happen in different orders on the same files for different programs built
from same objects)

For David: there is no thread safety code in mainline for the counters.
Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was
invented)
http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down
as too memory expensive per thread. We could optionaly do atomic updates like
ICC
or combination of both as discussed in the thread.
So far no one implemented it since the coverage fixups seems to work well enough
in
pracitce for multithreaded programs where reproducibility do not seem to be
_that_
important.

For GCC profiled bootstrap however I would like to see the output binary to be
reproducible. We realy ought to update profiles safe for multple processes.
Trashing whole process run is worse than doing race in increment. There is good
chance that one of runs is more important than others and it will get trashed.

I do not think we do have serious update problems in the summaries at the
moment.
We lock individual files as we update them. The summary is simple enough to be
safe.
sum_all is summed, max_all is maximum over the individual runs. Even when you
combine
mutiple programs the summary will end up same. Everything except for max_all is
ignored
anyway.

Solution 2 (i.e. histogram streaming) will also have the property that it is
safe
WRT multiple programs, just like sum_all.

Honza
> 
> -Andi

Sign in to reply to this message.

tejohnson

On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Xinliang David ...

13 years, 2 months ago (2012-08-21 05:14:08 UTC) #15

On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Xinliang David Li <davidxl@google.com> writes:
>> >
>> > Process level synchronization problems can happen when two processes
>> > (running the instrumented binary) exit at the same time. The
>> > updated/merged counters from one process may be overwritten by another
>> > process -- this is true for both counter data and summary data.
>> > Solution 3) does not introduce any new problems.
>>
>> You could just use lockf() ?
>
> The issue here is holding lock for all the files (that can be many) versus
> number of locks limits & possibilities for deadlocking (mind that updating
> may happen in different orders on the same files for different programs built
> from same objects)
>
> For David: there is no thread safety code in mainline for the counters.
> Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was
invented)
> http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down
> as too memory expensive per thread. We could optionaly do atomic updates like
ICC
> or combination of both as discussed in the thread.
> So far no one implemented it since the coverage fixups seems to work well
enough in
> pracitce for multithreaded programs where reproducibility do not seem to be
_that_
> important.
>
> For GCC profiled bootstrap however I would like to see the output binary to
be
> reproducible. We realy ought to update profiles safe for multple processes.
> Trashing whole process run is worse than doing race in increment. There is
good
> chance that one of runs is more important than others and it will get
trashed.
>
> I do not think we do have serious update problems in the summaries at the
moment.
> We lock individual files as we update them. The summary is simple enough to be
safe.
> sum_all is summed, max_all is maximum over the individual runs. Even when you
combine
> mutiple programs the summary will end up same. Everything except for max_all
is ignored
> anyway.
>
> Solution 2 (i.e. histogram streaming) will also have the property that it is
safe
> WRT multiple programs, just like sum_all.

I think the sum_all based scaling of the working set entries also has
this property. What is your opinion on saving the histogram in the
summary and merging histograms together as best as possible compared
to the alternative of saving the working set information as now and
scaling it up by the ratio between the new and old sum_all when
merging?

Thanks,
Teresa

>
> Honza
>>
>> -Andi



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hubicka_ucw.cz

> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> ...

13 years, 2 months ago (2012-08-21 05:29:18 UTC) #16

> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> Xinliang David Li <davidxl@google.com> writes:
> >> >
> >> > Process level synchronization problems can happen when two processes
> >> > (running the instrumented binary) exit at the same time. The
> >> > updated/merged counters from one process may be overwritten by another
> >> > process -- this is true for both counter data and summary data.
> >> > Solution 3) does not introduce any new problems.
> >>
> >> You could just use lockf() ?
> >
> > The issue here is holding lock for all the files (that can be many) versus
> > number of locks limits & possibilities for deadlocking (mind that updating
> > may happen in different orders on the same files for different programs
built
> > from same objects)
> >
> > For David: there is no thread safety code in mainline for the counters.
> > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was
invented)
> > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted
down
> > as too memory expensive per thread. We could optionaly do atomic updates
like ICC
> > or combination of both as discussed in the thread.
> > So far no one implemented it since the coverage fixups seems to work well
enough in
> > pracitce for multithreaded programs where reproducibility do not seem to be
_that_
> > important.
> >
> > For GCC profiled bootstrap however I would like to see the output binary to
be
> > reproducible. We realy ought to update profiles safe for multple processes.
> > Trashing whole process run is worse than doing race in increment. There is
good
> > chance that one of runs is more important than others and it will get
trashed.
> >
> > I do not think we do have serious update problems in the summaries at the
moment.
> > We lock individual files as we update them. The summary is simple enough to
be safe.
> > sum_all is summed, max_all is maximum over the individual runs. Even when
you combine
> > mutiple programs the summary will end up same. Everything except for max_all
is ignored
> > anyway.
> >
> > Solution 2 (i.e. histogram streaming) will also have the property that it is
safe
> > WRT multiple programs, just like sum_all.
> 
> I think the sum_all based scaling of the working set entries also has
> this property. What is your opinion on saving the histogram in the

I think the scaling will have at lest roundoff issues WRT different merging
orders.

> summary and merging histograms together as best as possible compared
> to the alternative of saving the working set information as now and
> scaling it up by the ratio between the new and old sum_all when
> merging?

So far I like this option best. But David seems to lean towards thirtd option
with
whole file locking.  I see it may show to be more extensible in the future.
At the moment I do not understand two things
 1) why do we need info on the number of counter above given threshold, sinc
ethe hot/cold
    decisions usually depends purely on the count cutoff.
    Removing those will solve merging issues with variant 2 and then it would be
probably
    good solution.
 2) Do we plan to add some features in near future that will anyway require
global locking?
    I guess LIPO itself does not count since it streams its data into
independent file as you
    mentioned earlier and locking LIPO file is not that hard.
    Does LIPO stream everything into that common file, or does it use
combination of gcda files
    and common summary?

    What other stuff Google plans to merge?
    (In general I would be curious about merging plans WRT profile stuff, so we
get more
    synchronized and effective on getting patches in. We have about two months
to get it done
    in stage1 and it would be nice to get as much as possible. Obviously some of
the patches will
    need bit fo dicsussion like this one. Hope you do not find it frustrating, I
actually think
    this is an important feature).

I also realized today that the common value counters (used by switch, indirect
call and div/mod value profiling) are non-stanble WRT different merging orders
(i.e.  parallel make in train run).  I do not think there is actual solution to
that except for not merging the counter section of this type in libgcov and
merge them in some canonical order at profile feedback time.  Perhaps we just
want to live with this, since the disprepancy here is small. (i.e. these
counters are quite rare and their outcome has just local effect on the final
binary, unlike the global summaries/edge counters).

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> >>
> >> -Andi
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

davidxl

On Mon, Aug 20, 2012 at 10:29 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Mon, ...

13 years, 2 months ago (2012-08-21 06:15:37 UTC) #17

On Mon, Aug 20, 2012 at 10:29 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> Xinliang David Li <davidxl@google.com> writes:
>> >> >
>> >> > Process level synchronization problems can happen when two processes
>> >> > (running the instrumented binary) exit at the same time. The
>> >> > updated/merged counters from one process may be overwritten by another
>> >> > process -- this is true for both counter data and summary data.
>> >> > Solution 3) does not introduce any new problems.
>> >>
>> >> You could just use lockf() ?
>> >
>> > The issue here is holding lock for all the files (that can be many) versus
>> > number of locks limits & possibilities for deadlocking (mind that updating
>> > may happen in different orders on the same files for different programs
built
>> > from same objects)
>> >
>> > For David: there is no thread safety code in mainline for the counters.
>> > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was
invented)
>> > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted
down
>> > as too memory expensive per thread. We could optionaly do atomic updates
like ICC
>> > or combination of both as discussed in the thread.
>> > So far no one implemented it since the coverage fixups seems to work well
enough in
>> > pracitce for multithreaded programs where reproducibility do not seem to be
_that_
>> > important.
>> >
>> > For GCC profiled bootstrap however I would like to see the output binary to
be
>> > reproducible. We realy ought to update profiles safe for multple
processes.
>> > Trashing whole process run is worse than doing race in increment. There is
good
>> > chance that one of runs is more important than others and it will get
trashed.
>> >
>> > I do not think we do have serious update problems in the summaries at the
moment.
>> > We lock individual files as we update them. The summary is simple enough to
be safe.
>> > sum_all is summed, max_all is maximum over the individual runs. Even when
you combine
>> > mutiple programs the summary will end up same. Everything except for
max_all is ignored
>> > anyway.
>> >
>> > Solution 2 (i.e. histogram streaming) will also have the property that it
is safe
>> > WRT multiple programs, just like sum_all.
>>
>> I think the sum_all based scaling of the working set entries also has
>> this property. What is your opinion on saving the histogram in the
>
> I think the scaling will have at lest roundoff issues WRT different merging
orders.
>
>> summary and merging histograms together as best as possible compared
>> to the alternative of saving the working set information as now and
>> scaling it up by the ratio between the new and old sum_all when
>> merging?
>
> So far I like this option best. But David seems to lean towards thirtd option
with
> whole file locking.  I see it may show to be more extensible in the future.
> At the moment I do not understand two things
>  1) why do we need info on the number of counter above given threshold, sinc
ethe hot/cold
>     decisions usually depends purely on the count cutoff.
>     Removing those will solve merging issues with variant 2 and then it would
be probably
>     good solution.

This is useful for large applications with a long tail. The
instruction working set for those applications are very large, and
inliner and unroller need to be aware of that and good heuristics can
be developed to throttle aggressive code bloat transformations. For
inliner, it is kind of the like the global budget but more application
dependent. In the long run, we will collect more advanced fdo summary
regarding working set -- it will be working set size for each code
region (locality region).


>  2) Do we plan to add some features in near future that will anyway require
global locking?
>     I guess LIPO itself does not count since it streams its data into
independent file as you
>     mentioned earlier and locking LIPO file is not that hard.
>     Does LIPO stream everything into that common file, or does it use
combination of gcda files
>     and common summary?

Actually, LIPO module grouping information are stored in gcda files.
It is also stored in a separate .imports file (one per object) ---
this is primarily used by our build system for dependence information.


>
>     What other stuff Google plans to merge?
>     (In general I would be curious about merging plans WRT profile stuff, so
we get more
>     synchronized and effective on getting patches in. We have about two months
to get it done
>     in stage1 and it would be nice to get as much as possible. Obviously some
of the patches will
>     need bit fo dicsussion like this one. Hope you do not find it frustrating,
I actually think
>     this is an important feature).

We plan to merge in the new LIPO implementation based on LTO
streaming. Rong Xu finished this in 4.6 based compiler, and he needs
to port it to 4.8.


thanks,

David

>
> I also realized today that the common value counters (used by switch,
indirect
> call and div/mod value profiling) are non-stanble WRT different merging
orders
> (i.e.  parallel make in train run).  I do not think there is actual solution
to
> that except for not merging the counter section of this type in libgcov and
> merge them in some canonical order at profile feedback time.  Perhaps we just
> want to live with this, since the disprepancy here is small. (i.e. these
> counters are quite rare and their outcome has just local effect on the final
> binary, unlike the global summaries/edge counters).
>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> -Andi
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hubicka_ucw.cz

> > This is useful for large applications with a long tail. The > instruction ...

13 years, 2 months ago (2012-08-21 06:33:08 UTC) #18

> 
> This is useful for large applications with a long tail. The
> instruction working set for those applications are very large, and
> inliner and unroller need to be aware of that and good heuristics can
> be developed to throttle aggressive code bloat transformations. For
> inliner, it is kind of the like the global budget but more application
> dependent. In the long run, we will collect more advanced fdo summary
> regarding working set -- it will be working set size for each code
> region (locality region).

I see, so you use it to estimate size of the working set and effect of bloating
optimizations on cache size. This sounds interesting. What are you experiences
with this?

What concerns me that it is greatly inaccurate - you have no idea how many
instructions given counter is guarding and it can differ quite a lot. Also
inlining/optimization makes working sets significantly different (by factor of
100 for tramp3d). But on the ohter hand any solution at this level will be
greatly inaccurate. So I am curious how reliable data you can get from this?
How you take this into account for the heuristics?

It seems to me that for this use perhaps the simple logic in histogram merging
maximizing the number of BBs for given bucket will work well?  It is
inaccurate, but we are working with greatly inaccurate data anyway.
Except for degenerated cases, the small and unimportant runs will have small BB
counts, while large runs will have larger counts and those are ones we optimize
for anyway.
> 
> 
> >  2) Do we plan to add some features in near future that will anyway require
global locking?
> >     I guess LIPO itself does not count since it streams its data into
independent file as you
> >     mentioned earlier and locking LIPO file is not that hard.
> >     Does LIPO stream everything into that common file, or does it use
combination of gcda files
> >     and common summary?
> 
> Actually, LIPO module grouping information are stored in gcda files.
> It is also stored in a separate .imports file (one per object) ---
> this is primarily used by our build system for dependence information.

I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave
on GCC bootstrap? (i.e. it does a lot more work in the libgcov module per each
invocation, so I am curious if it is practically useful at all).

With LTO based solution a lot can be probably pushed at link time? Before
actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from
gcda files and do all the merging/updating/CFG constructions that is currently
performed at runtime, right?
> 
> 
> >
> >     What other stuff Google plans to merge?
> >     (In general I would be curious about merging plans WRT profile stuff, so
we get more
> >     synchronized and effective on getting patches in. We have about two
months to get it done
> >     in stage1 and it would be nice to get as much as possible. Obviously
some of the patches will
> >     need bit fo dicsussion like this one. Hope you do not find it
frustrating, I actually think
> >     this is an important feature).
> 
> We plan to merge in the new LIPO implementation based on LTO
> streaming. Rong Xu finished this in 4.6 based compiler, and he needs
> to port it to 4.8.

Good.  Looks like a lot of work ahead. It would be nice if we can perhaps start
by merging the libgcov infrastructure updates prior the LIPO changes.  From
what I saw at LIPO branch some time ago it has a lot of stuff that is not
exactly LIPO specific.

Honza
> 
> 
> thanks,
> 
> David
> 
> >
> > I also realized today that the common value counters (used by switch,
indirect
> > call and div/mod value profiling) are non-stanble WRT different merging
orders
> > (i.e.  parallel make in train run).  I do not think there is actual solution
to
> > that except for not merging the counter section of this type in libgcov and
> > merge them in some canonical order at profile feedback time.  Perhaps we
just
> > want to live with this, since the disprepancy here is small. (i.e. these
> > counters are quite rare and their outcome has just local effect on the
final
> > binary, unlike the global summaries/edge counters).
> >
> > Honza
> >>
> >> Thanks,
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> -Andi
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

davidxl

On Mon, Aug 20, 2012 at 11:33 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> This ...

13 years, 2 months ago (2012-08-21 07:14:50 UTC) #19

On Mon, Aug 20, 2012 at 11:33 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> This is useful for large applications with a long tail. The
>> instruction working set for those applications are very large, and
>> inliner and unroller need to be aware of that and good heuristics can
>> be developed to throttle aggressive code bloat transformations. For
>> inliner, it is kind of the like the global budget but more application
>> dependent. In the long run, we will collect more advanced fdo summary
>> regarding working set -- it will be working set size for each code
>> region (locality region).
>
> I see, so you use it to estimate size of the working set and effect of
bloating
> optimizations on cache size. This sounds interesting. What are you
experiences
> with this?

Teresa has done some tunings for the unroller so far. The inliner
tuning is the next step.

>
> What concerns me that it is greatly inaccurate - you have no idea how many
> instructions given counter is guarding and it can differ quite a lot. Also
> inlining/optimization makes working sets significantly different (by factor
of
> 100 for tramp3d).

The pre ipa-inline working set is the one that is needed for ipa
inliner tuning. For post-ipa inline code increase transformations,
some update is probably needed.

>But on the ohter hand any solution at this level will be
> greatly inaccurate. So I am curious how reliable data you can get from this?
> How you take this into account for the heuristics?

This effort is just the first step to allow good heuristics to develop.

>
> It seems to me that for this use perhaps the simple logic in histogram
merging
> maximizing the number of BBs for given bucket will work well?  It is
> inaccurate, but we are working with greatly inaccurate data anyway.
> Except for degenerated cases, the small and unimportant runs will have small
BB
> counts, while large runs will have larger counts and those are ones we
optimize
> for anyway.

The working set curve for each type of applications contains lots of
information that can be mined. The inaccuracy can also be mitigated by
more data 'calibration'.

>>
>>
>> >  2) Do we plan to add some features in near future that will anyway require
global locking?
>> >     I guess LIPO itself does not count since it streams its data into
independent file as you
>> >     mentioned earlier and locking LIPO file is not that hard.
>> >     Does LIPO stream everything into that common file, or does it use
combination of gcda files
>> >     and common summary?
>>
>> Actually, LIPO module grouping information are stored in gcda files.
>> It is also stored in a separate .imports file (one per object) ---
>> this is primarily used by our build system for dependence information.
>
> I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO
behave
> on GCC bootstrap?

We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
main problem for application build -- the link time (for debug build)
is.

> (i.e. it does a lot more work in the libgcov module per each
> invocation, so I am curious if it is practically useful at all).
>
> With LTO based solution a lot can be probably pushed at link time? Before
> actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from
> gcda files and do all the merging/updating/CFG constructions that is
currently
> performed at runtime, right?

The dynamic cgraph build and analysis is still done at runtime.
However, with the new implementation, FE is no longer involved. Gcc
driver is modified to understand module grouping, and lto is used to
merge the streamed output from aux modules.


David

>>
>>
>> >
>> >     What other stuff Google plans to merge?
>> >     (In general I would be curious about merging plans WRT profile stuff,
so we get more
>> >     synchronized and effective on getting patches in. We have about two
months to get it done
>> >     in stage1 and it would be nice to get as much as possible. Obviously
some of the patches will
>> >     need bit fo dicsussion like this one. Hope you do not find it
frustrating, I actually think
>> >     this is an important feature).
>>
>> We plan to merge in the new LIPO implementation based on LTO
>> streaming. Rong Xu finished this in 4.6 based compiler, and he needs
>> to port it to 4.8.
>
> Good.  Looks like a lot of work ahead. It would be nice if we can perhaps
start
> by merging the libgcov infrastructure updates prior the LIPO changes.  From
> what I saw at LIPO branch some time ago it has a lot of stuff that is not
> exactly LIPO specific.
>
> Honza
>>
>>
>> thanks,
>>
>> David
>>
>> >
>> > I also realized today that the common value counters (used by switch,
indirect
>> > call and div/mod value profiling) are non-stanble WRT different merging
orders
>> > (i.e.  parallel make in train run).  I do not think there is actual
solution to
>> > that except for not merging the counter section of this type in libgcov
and
>> > merge them in some canonical order at profile feedback time.  Perhaps we
just
>> > want to live with this, since the disprepancy here is small. (i.e. these
>> > counters are quite rare and their outcome has just local effect on the
final
>> > binary, unlike the global summaries/edge counters).
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> >
>> >> > Honza
>> >> >>
>> >> >> -Andi
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hubicka_ucw.cz

> Teresa has done some tunings for the unroller so far. The inliner > tuning ...

13 years, 2 months ago (2012-08-21 07:34:20 UTC) #20

> Teresa has done some tunings for the unroller so far. The inliner
> tuning is the next step.
> 
> >
> > What concerns me that it is greatly inaccurate - you have no idea how many
> > instructions given counter is guarding and it can differ quite a lot. Also
> > inlining/optimization makes working sets significantly different (by factor
of
> > 100 for tramp3d).
> 
> The pre ipa-inline working set is the one that is needed for ipa
> inliner tuning. For post-ipa inline code increase transformations,
> some update is probably needed.
> 
> >But on the ohter hand any solution at this level will be
> > greatly inaccurate. So I am curious how reliable data you can get from
this?
> > How you take this into account for the heuristics?
> 
> This effort is just the first step to allow good heuristics to develop.
> 
> >
> > It seems to me that for this use perhaps the simple logic in histogram
merging
> > maximizing the number of BBs for given bucket will work well?  It is
> > inaccurate, but we are working with greatly inaccurate data anyway.
> > Except for degenerated cases, the small and unimportant runs will have small
BB
> > counts, while large runs will have larger counts and those are ones we
optimize
> > for anyway.
> 
> The working set curve for each type of applications contains lots of
> information that can be mined. The inaccuracy can also be mitigated by
> more data 'calibration'.

Sure, I think I am leaning towards trying the solution 2) with maximizing
counter count merging (probably it would make sense to rename it from BB count
since it is not really BB count and thus it is misleading) and we will see how
well it works in practice.

We have benefits of much fewer issues with profile locking/unlocking and we
lose bit of precision on BB counts. I tend to believe that the error will not
be that important in practice. Another loss is more histogram streaming into
each gcda file, but with skiping zero entries it should not be major overhead
problem I hope.

What do you think?
> 
> >>
> >>
> >> >  2) Do we plan to add some features in near future that will anyway
require global locking?
> >> >     I guess LIPO itself does not count since it streams its data into
independent file as you
> >> >     mentioned earlier and locking LIPO file is not that hard.
> >> >     Does LIPO stream everything into that common file, or does it use
combination of gcda files
> >> >     and common summary?
> >>
> >> Actually, LIPO module grouping information are stored in gcda files.
> >> It is also stored in a separate .imports file (one per object) ---
> >> this is primarily used by our build system for dependence information.
> >
> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO
behave
> > on GCC bootstrap?
> 
> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> main problem for application build -- the link time (for debug build)
> is.

I was primarily curious how the LIPOs runtime analysis fare in the situation
where
you do very many small train runs on rather large app (sure GCC is small to
google's
use case ;).
> 
> > (i.e. it does a lot more work in the libgcov module per each
> > invocation, so I am curious if it is practically useful at all).
> >
> > With LTO based solution a lot can be probably pushed at link time? Before
> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs
from
> > gcda files and do all the merging/updating/CFG constructions that is
currently
> > performed at runtime, right?
> 
> The dynamic cgraph build and analysis is still done at runtime.
> However, with the new implementation, FE is no longer involved. Gcc
> driver is modified to understand module grouping, and lto is used to
> merge the streamed output from aux modules.

I see. Are there any fundamental reasons why it can not be done at link-time
when all gcda files are available? Why the grouping is not done inside linker
plugin?

Honza
> 
> 
> David

Sign in to reply to this message.

tejohnson

On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > > Teresa ...

13 years, 2 months ago (2012-08-21 14:38:14 UTC) #21

On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote:

> > Teresa has done some tunings for the unroller so far. The inliner
> > tuning is the next step.
>

We've gotten some noticeable improvements from very large applications with
icache problems. Looking at the working set curves for these apps, the
working set size in terms of the number of hot counters starts increasing
exponentially to a much higher value than other apps as you approach 99.99%
of the sum_all of the counter values. The long tail as David mentioned, and
very large values.

>
> > >
> > > What concerns me that it is greatly inaccurate - you have no idea how
> many
> > > instructions given counter is guarding and it can differ quite a lot.
> Also
> > > inlining/optimization makes working sets significantly different (by
> factor of
> > > 100 for tramp3d).
>

That's true, but from what I have seen the working sets of the applications
this is targeted towards are fairly distinct. Which suggests that perfect
accuracy in merging summaries may not be needed, at least for the current
usage. As we start trying to use this to guide inlining and other
optimizations it may need some fine tuning.

> >
> > The pre ipa-inline working set is the one that is needed for ipa
> > inliner tuning. For post-ipa inline code increase transformations,
> > some update is probably needed.
>
>
> > >But on the ohter hand any solution at this level will be
> > > greatly inaccurate. So I am curious how reliable data you can get from
> this?
> > > How you take this into account for the heuristics?
> >
> > This effort is just the first step to allow good heuristics to develop.
> >
> > >
> > > It seems to me that for this use perhaps the simple logic in histogram
> merging
> > > maximizing the number of BBs for given bucket will work well?  It is
> > > inaccurate, but we are working with greatly inaccurate data anyway.
> > > Except for degenerated cases, the small and unimportant runs will have
> small BB
> > > counts, while large runs will have larger counts and those are ones we
> optimize
> > > for anyway.
> >
> > The working set curve for each type of applications contains lots of
> > information that can be mined. The inaccuracy can also be mitigated by
> > more data 'calibration'.
>
> Sure, I think I am leaning towards trying the solution 2) with maximizing
> counter count merging (probably it would make sense to rename it from BB
> count
> since it is not really BB count and thus it is misleading) and we will see
> how
> well it works in practice.
>
> We have benefits of much fewer issues with profile locking/unlocking and we
> lose bit of precision on BB counts. I tend to believe that the error will
> not
> be that important in practice. Another loss is more histogram streaming
> into
> each gcda file, but with skiping zero entries it should not be major
> overhead
> problem I hope.
>
> What do you think?
>

I can go ahead with the histogram approach. There is some roundoff
error from the working set scaling approach that can affect different
merging orders as you note, although I think this only really affects the
small counter values. The other place where saving/merging the histogram
would help is when the distribution of counter values in the histogram
varies between runs, say for example, the hottest loop is much hotter in a
subsequent run, but the rest of the counter values stay largely consistent.
Note, however, that if the hotspots are different in different runs, then
merging either the histogram or the working set will have issues. The only
way to detect this is to recompute the histogram/working set from scratch
from the merged counters.

I wonder in practice, even when there are a lot of simultaneous runs going
like in a gcc bootstrap, if we could get reasonably accurate summary
recomputation without global locking. The reason is that as long as the
actual counter updates are safe as they are now due to the individual file
locking, the inaccuracy in the recomputed summary information will not grow
over time, and there is a reasonable chance that the last run to complete
and merge will recompute the summary from the final merged counter values
and get it right (or only be off by a little bit if there are a couple of
runs completing simultaneously at the end). But this can be investigated as
a follow on to this patch.

Thanks,
Teresa

>
> > >>
> > >>
> > >> >  2) Do we plan to add some features in near future that will anyway
> require global locking?
> > >> >     I guess LIPO itself does not count since it streams its data
> into independent file as you
> > >> >     mentioned earlier and locking LIPO file is not that hard.
> > >> >     Does LIPO stream everything into that common file, or does it
> use combination of gcda files
> > >> >     and common summary?
> > >>
> > >> Actually, LIPO module grouping information are stored in gcda files.
> > >> It is also stored in a separate .imports file (one per object) ---
> > >> this is primarily used by our build system for dependence information.
> > >
> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
> LIPO behave
> > > on GCC bootstrap?
> >
> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> > main problem for application build -- the link time (for debug build)
> > is.
>
> I was primarily curious how the LIPOs runtime analysis fare in the
> situation where
> you do very many small train runs on rather large app (sure GCC is small
> to google's
> use case ;).
> >
> > > (i.e. it does a lot more work in the libgcov module per each
> > > invocation, so I am curious if it is practically useful at all).
> > >
> > > With LTO based solution a lot can be probably pushed at link time?
> Before
> > > actual GCC starts from the linker plugin, LIPO module can read gcov
> CFGs from
> > > gcda files and do all the merging/updating/CFG constructions that is
> currently
> > > performed at runtime, right?
> >
> > The dynamic cgraph build and analysis is still done at runtime.
> > However, with the new implementation, FE is no longer involved. Gcc
> > driver is modified to understand module grouping, and lto is used to
> > merge the streamed output from aux modules.
>
> I see. Are there any fundamental reasons why it can not be done at
> link-time
> when all gcda files are available? Why the grouping is not done inside
> linker
> plugin?
>
> Honza
> >
> >
> > David
>

-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

andi_firstfloor.org

> The issue here is holding lock for all the files (that can be many) ...

13 years, 2 months ago (2012-08-21 14:59:09 UTC) #22

davidxl

On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Teresa has ...

13 years, 2 months ago (2012-08-21 17:10:22 UTC) #23

On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Teresa has done some tunings for the unroller so far. The inliner
>> tuning is the next step.
>>
>> >
>> > What concerns me that it is greatly inaccurate - you have no idea how many
>> > instructions given counter is guarding and it can differ quite a lot. Also
>> > inlining/optimization makes working sets significantly different (by factor
of
>> > 100 for tramp3d).
>>
>> The pre ipa-inline working set is the one that is needed for ipa
>> inliner tuning. For post-ipa inline code increase transformations,
>> some update is probably needed.
>>
>> >But on the ohter hand any solution at this level will be
>> > greatly inaccurate. So I am curious how reliable data you can get from
this?
>> > How you take this into account for the heuristics?
>>
>> This effort is just the first step to allow good heuristics to develop.
>>
>> >
>> > It seems to me that for this use perhaps the simple logic in histogram
merging
>> > maximizing the number of BBs for given bucket will work well?  It is
>> > inaccurate, but we are working with greatly inaccurate data anyway.
>> > Except for degenerated cases, the small and unimportant runs will have
small BB
>> > counts, while large runs will have larger counts and those are ones we
optimize
>> > for anyway.
>>
>> The working set curve for each type of applications contains lots of
>> information that can be mined. The inaccuracy can also be mitigated by
>> more data 'calibration'.
>
> Sure, I think I am leaning towards trying the solution 2) with maximizing
> counter count merging (probably it would make sense to rename it from BB
count
> since it is not really BB count and thus it is misleading) and we will see
how
> well it works in practice.
>
> We have benefits of much fewer issues with profile locking/unlocking and we
> lose bit of precision on BB counts. I tend to believe that the error will not
> be that important in practice. Another loss is more histogram streaming into
> each gcda file, but with skiping zero entries it should not be major overhead
> problem I hope.
>
> What do you think?
>>
>> >>
>> >>
>> >> >  2) Do we plan to add some features in near future that will anyway
require global locking?
>> >> >     I guess LIPO itself does not count since it streams its data into
independent file as you
>> >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> >     Does LIPO stream everything into that common file, or does it use
combination of gcda files
>> >> >     and common summary?
>> >>
>> >> Actually, LIPO module grouping information are stored in gcda files.
>> >> It is also stored in a separate .imports file (one per object) ---
>> >> this is primarily used by our build system for dependence information.
>> >
>> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO
behave
>> > on GCC bootstrap?
>>
>> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> main problem for application build -- the link time (for debug build)
>> is.
>
> I was primarily curious how the LIPOs runtime analysis fare in the situation
where
> you do very many small train runs on rather large app (sure GCC is small to
google's
> use case ;).


There will be race, but as Teresa mentioned, there is a big chance
that the process which finishes the merge the last is also t the final
overrider of the LIPO summary data.


>>
>> > (i.e. it does a lot more work in the libgcov module per each
>> > invocation, so I am curious if it is practically useful at all).
>> >
>> > With LTO based solution a lot can be probably pushed at link time? Before
>> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs
from
>> > gcda files and do all the merging/updating/CFG constructions that is
currently
>> > performed at runtime, right?
>>
>> The dynamic cgraph build and analysis is still done at runtime.
>> However, with the new implementation, FE is no longer involved. Gcc
>> driver is modified to understand module grouping, and lto is used to
>> merge the streamed output from aux modules.
>
> I see. Are there any fundamental reasons why it can not be done at link-time
> when all gcda files are available?

For build parallelism, the decision should be made as early as
possible -- that is what makes LIPO 'light'.

> Why the grouping is not done inside linker
> plugin?

It is not delayed into link time. In fact linker plugin is not even involved.

David


>
> Honza
>>
>>
>> David

Sign in to reply to this message.

davidxl

On Tue, Aug 21, 2012 at 7:38 AM, Teresa Johnson <tejohnson@google.com>wrote: > > > On ...

13 years, 2 months ago (2012-08-21 17:11:54 UTC) #24

On Tue, Aug 21, 2012 at 7:38 AM, Teresa Johnson <tejohnson@google.com>wrote:

>
>
> On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>
>> > Teresa has done some tunings for the unroller so far. The inliner
>> > tuning is the next step.
>>
>
> We've gotten some noticeable improvements from very large applications
> with icache problems. Looking at the working set curves for these apps, the
> working set size in terms of the number of hot counters starts increasing
> exponentially to a much higher value than other apps as you approach 99.99%
> of the sum_all of the counter values. The long tail as David mentioned, and
> very large values.
>
> >
>> > >
>> > > What concerns me that it is greatly inaccurate - you have no idea how
>> many
>> > > instructions given counter is guarding and it can differ quite a lot.
>> Also
>> > > inlining/optimization makes working sets significantly different (by
>> factor of
>> > > 100 for tramp3d).
>>
>
> That's true, but from what I have seen the working sets of the
> applications this is targeted towards are fairly distinct. Which suggests
> that perfect accuracy in merging summaries may not be needed, at least for
> the current usage. As we start trying to use this to guide inlining and
> other optimizations it may need some fine tuning.
>
>
>> >
>> > The pre ipa-inline working set is the one that is needed for ipa
>> > inliner tuning. For post-ipa inline code increase transformations,
>> > some update is probably needed.
>>
> >
>> > >But on the ohter hand any solution at this level will be
>> > > greatly inaccurate. So I am curious how reliable data you can get
>> from this?
>> > > How you take this into account for the heuristics?
>> >
>> > This effort is just the first step to allow good heuristics to develop.
>> >
>> > >
>> > > It seems to me that for this use perhaps the simple logic in
>> histogram merging
>> > > maximizing the number of BBs for given bucket will work well?  It is
>> > > inaccurate, but we are working with greatly inaccurate data anyway.
>> > > Except for degenerated cases, the small and unimportant runs will
>> have small BB
>> > > counts, while large runs will have larger counts and those are ones
>> we optimize
>> > > for anyway.
>> >
>> > The working set curve for each type of applications contains lots of
>> > information that can be mined. The inaccuracy can also be mitigated by
>> > more data 'calibration'.
>>
>> Sure, I think I am leaning towards trying the solution 2) with maximizing
>> counter count merging (probably it would make sense to rename it from BB
>> count
>> since it is not really BB count and thus it is misleading) and we will
>> see how
>> well it works in practice.
>>
>> We have benefits of much fewer issues with profile locking/unlocking and
>> we
>> lose bit of precision on BB counts. I tend to believe that the error will
>> not
>> be that important in practice. Another loss is more histogram streaming
>> into
>> each gcda file, but with skiping zero entries it should not be major
>> overhead
>> problem I hope.
>>
>> What do you think?
>>
>
> I can go ahead with the histogram approach. There is some roundoff
> error from the working set scaling approach that can affect different
> merging orders as you note, although I think this only really affects the
> small counter values. The other place where saving/merging the histogram
> would help is when the distribution of counter values in the histogram
> varies between runs, say for example, the hottest loop is much hotter in a
> subsequent run, but the rest of the counter values stay largely consistent.
> Note, however, that if the hotspots are different in different runs, then
> merging either the histogram or the working set will have issues. The only
> way to detect this is to recompute the histogram/working set from scratch
> from the merged counters.
>
> I wonder in practice, even when there are a lot of simultaneous runs going
> like in a gcc bootstrap, if we could get reasonably accurate summary
> recomputation without global locking. The reason is that as long as the
> actual counter updates are safe as they are now due to the individual file
> locking, the inaccuracy in the recomputed summary information will not grow
> over time, and there is a reasonable chance that the last run to complete
> and merge will recompute the summary from the final merged counter values
> and get it right (or only be off by a little bit if there are a couple of
> runs completing simultaneously at the end). But this can be investigated as
> a follow on to this patch.
>


The main concern is probably the build reproducibility in gcc bootstrap
with FDO.

David



>
> Thanks,
> Teresa
>
> >
>> > >>
>> > >>
>> > >> >  2) Do we plan to add some features in near future that will
>> anyway require global locking?
>> > >> >     I guess LIPO itself does not count since it streams its data
>> into independent file as you
>> > >> >     mentioned earlier and locking LIPO file is not that hard.
>> > >> >     Does LIPO stream everything into that common file, or does it
>> use combination of gcda files
>> > >> >     and common summary?
>> > >>
>> > >> Actually, LIPO module grouping information are stored in gcda files.
>> > >> It is also stored in a separate .imports file (one per object) ---
>> > >> this is primarily used by our build system for dependence
>> information.
>> > >
>> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
>> LIPO behave
>> > > on GCC bootstrap?
>> >
>> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> > main problem for application build -- the link time (for debug build)
>> > is.
>>
>> I was primarily curious how the LIPOs runtime analysis fare in the
>> situation where
>> you do very many small train runs on rather large app (sure GCC is small
>> to google's
>> use case ;).
>> >
>> > > (i.e. it does a lot more work in the libgcov module per each
>> > > invocation, so I am curious if it is practically useful at all).
>> > >
>> > > With LTO based solution a lot can be probably pushed at link time?
>> Before
>> > > actual GCC starts from the linker plugin, LIPO module can read gcov
>> CFGs from
>> > > gcda files and do all the merging/updating/CFG constructions that is
>> currently
>> > > performed at runtime, right?
>> >
>> > The dynamic cgraph build and analysis is still done at runtime.
>> > However, with the new implementation, FE is no longer involved. Gcc
>> > driver is modified to understand module grouping, and lto is used to
>> > merge the streamed output from aux modules.
>>
>> I see. Are there any fundamental reasons why it can not be done at
>> link-time
>> when all gcda files are available? Why the grouping is not done inside
>> linker
>> plugin?
>>
>> Honza
>> >
>> >
>> > David
>>
>
>
>
> --
> Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
>

Sign in to reply to this message.

hubicka_ucw.cz

> > I can go ahead with the histogram approach. There is some roundoff > ...

13 years, 2 months ago (2012-08-22 01:56:30 UTC) #25

> > I can go ahead with the histogram approach. There is some roundoff
> > error from the working set scaling approach that can affect different
> > merging orders as you note, although I think this only really affects the
> > small counter values. The other place where saving/merging the histogram

Do you have any intuition on why simple maximalization merging (that is safe
wrt
ordering) would be bad idea?

We care only about working set size around top of the histogram and I would say
that we should sort of optimize for the largest (in the number of blocks in hot
area) of the train runs.  One way where things will get messed up is when the
working set is about the same but the runs are of different size so all the
blocks gets accounted into two different buckets.

But in general I do not think there is resonably accurate way to merge the
profiles without actually streaming out all counter IDs in every bucket, so
perhaps
this will work well enough.  If not, we can in future introduce per-program
global
summary file that will contain the counters to be merged acurately.

> > would help is when the distribution of counter values in the histogram
> > varies between runs, say for example, the hottest loop is much hotter in a
> > subsequent run, but the rest of the counter values stay largely consistent.
> > Note, however, that if the hotspots are different in different runs, then
> > merging either the histogram or the working set will have issues. The only
> > way to detect this is to recompute the histogram/working set from scratch
> > from the merged counters.
> >
> > I wonder in practice, even when there are a lot of simultaneous runs going
> > like in a gcc bootstrap, if we could get reasonably accurate summary
> > recomputation without global locking. The reason is that as long as the
> > actual counter updates are safe as they are now due to the individual file
> > locking, the inaccuracy in the recomputed summary information will not grow
> > over time, and there is a reasonable chance that the last run to complete
> > and merge will recompute the summary from the final merged counter values
> > and get it right (or only be off by a little bit if there are a couple of
> > runs completing simultaneously at the end). But this can be investigated as
> > a follow on to this patch.
> >
> 
> 
> The main concern is probably the build reproducibility in gcc bootstrap
> with FDO.

Hmm, you mean in the first pass update every file with new counters and in the
second
pass just update the summaries?
OK, so I guess we went through
 1) two pass updating with race in between pases.
 2) two pass updating with first pass updating countes and second having race
only for summary update.
    (i.e. no races for counters)
 3) two pass updating with flocking (and some way to handle detected deadlocks)
 4) one pass updating with histogram merging + maximalization of working set.
    (we do not realy need to scale the buckets, we can simply merge the
histograms
     and then mutiply them by nruns before comparing to actual counters.
     This assumes that working sets of all runs are about the same, but should
work
     resonably in practice I think.

I guess 3/4 are acceptable WRT bootstrap reproducibility. 

I have no experience with flocking large number of files and portability of
this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
in practice and 3) has chance to be portable, we could go for this.  It will
solve the precision problems and will also work for LIPO summaries.
I would be curious about effect on profiledbootstrap time of this if you
implement
it.

Honza
> 
> David
> 
> 
> 
> >
> > Thanks,
> > Teresa
> >
> > >
> >> > >>
> >> > >>
> >> > >> >  2) Do we plan to add some features in near future that will
> >> anyway require global locking?
> >> > >> >     I guess LIPO itself does not count since it streams its data
> >> into independent file as you
> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
> >> > >> >     Does LIPO stream everything into that common file, or does it
> >> use combination of gcda files
> >> > >> >     and common summary?
> >> > >>
> >> > >> Actually, LIPO module grouping information are stored in gcda files.
> >> > >> It is also stored in a separate .imports file (one per object) ---
> >> > >> this is primarily used by our build system for dependence
> >> information.
> >> > >
> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
> >> LIPO behave
> >> > > on GCC bootstrap?
> >> >
> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
> >> > main problem for application build -- the link time (for debug build)
> >> > is.
> >>
> >> I was primarily curious how the LIPOs runtime analysis fare in the
> >> situation where
> >> you do very many small train runs on rather large app (sure GCC is small
> >> to google's
> >> use case ;).
> >> >
> >> > > (i.e. it does a lot more work in the libgcov module per each
> >> > > invocation, so I am curious if it is practically useful at all).
> >> > >
> >> > > With LTO based solution a lot can be probably pushed at link time?
> >> Before
> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
> >> CFGs from
> >> > > gcda files and do all the merging/updating/CFG constructions that is
> >> currently
> >> > > performed at runtime, right?
> >> >
> >> > The dynamic cgraph build and analysis is still done at runtime.
> >> > However, with the new implementation, FE is no longer involved. Gcc
> >> > driver is modified to understand module grouping, and lto is used to
> >> > merge the streamed output from aux modules.
> >>
> >> I see. Are there any fundamental reasons why it can not be done at
> >> link-time
> >> when all gcda files are available? Why the grouping is not done inside
> >> linker
> >> plugin?
> >>
> >> Honza
> >> >
> >> >
> >> > David
> >>
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
> >

Sign in to reply to this message.

tejohnson

On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> > I ...

13 years, 2 months ago (2012-08-22 05:33:58 UTC) #26

On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > I can go ahead with the histogram approach. There is some roundoff
>> > error from the working set scaling approach that can affect different
>> > merging orders as you note, although I think this only really affects the
>> > small counter values. The other place where saving/merging the histogram
>
> Do you have any intuition on why simple maximalization merging (that is safe
wrt
> ordering) would be bad idea?

When you say "maximalization merging" are you talking about the
histogram merging approach I mentioned a few emails back (my response
on Aug 19) where we assume the same relative order of hotness in the
counters between runs, and accumulate the counter values in the
histogram in that order?

This would be inaccurate if different runs exercised different areas
of the code, and thus the counters would be ordered in the histogram
differently.

>
> We care only about working set size around top of the histogram and I would
say

For optimizations that care about the boundary between hot and cold
such as code layout I think we will also care about the smaller values
in the histogram (to have a good idea of what constitutes a cold block
counter value).

> that we should sort of optimize for the largest (in the number of blocks in
hot
> area) of the train runs.  One way where things will get messed up is when the
> working set is about the same but the runs are of different size so all the
> blocks gets accounted into two different buckets.

I'm not sure I understand the last sentence - is this something that
would not get handled by merging the histogram entries as I described
earlier? Or it sounds like you might have a different merging approach
in mind?

>
> But in general I do not think there is resonably accurate way to merge the
> profiles without actually streaming out all counter IDs in every bucket, so
perhaps
> this will work well enough.  If not, we can in future introduce per-program
global
> summary file that will contain the counters to be merged acurately.

Sounds good.

>
>> > would help is when the distribution of counter values in the histogram
>> > varies between runs, say for example, the hottest loop is much hotter in a
>> > subsequent run, but the rest of the counter values stay largely
consistent.
>> > Note, however, that if the hotspots are different in different runs, then
>> > merging either the histogram or the working set will have issues. The only
>> > way to detect this is to recompute the histogram/working set from scratch
>> > from the merged counters.
>> >
>> > I wonder in practice, even when there are a lot of simultaneous runs going
>> > like in a gcc bootstrap, if we could get reasonably accurate summary
>> > recomputation without global locking. The reason is that as long as the
>> > actual counter updates are safe as they are now due to the individual file
>> > locking, the inaccuracy in the recomputed summary information will not
grow
>> > over time, and there is a reasonable chance that the last run to complete
>> > and merge will recompute the summary from the final merged counter values
>> > and get it right (or only be off by a little bit if there are a couple of
>> > runs completing simultaneously at the end). But this can be investigated
as
>> > a follow on to this patch.
>> >
>>
>>
>> The main concern is probably the build reproducibility in gcc bootstrap
>> with FDO.
>
> Hmm, you mean in the first pass update every file with new counters and in the
second
> pass just update the summaries?

Right, that's what I had in mind (what you have described in #2 below).

> OK, so I guess we went through
>  1) two pass updating with race in between pases.
>  2) two pass updating with first pass updating countes and second having race
only for summary update.
>     (i.e. no races for counters)
>  3) two pass updating with flocking (and some way to handle detected
deadlocks)
>  4) one pass updating with histogram merging + maximalization of working set.
>     (we do not realy need to scale the buckets, we can simply merge the
histograms
>      and then mutiply them by nruns before comparing to actual counters.

By merging the histograms (and accumulating the counter values stored
there as we merge), I don't think we need to multiply the counter
values by nruns, do we?

>      This assumes that working sets of all runs are about the same, but should
work
>      resonably in practice I think.
>
> I guess 3/4 are acceptable WRT bootstrap reproducibility.
>
> I have no experience with flocking large number of files and portability of
> this solution i.e.  to Windows.  If you think that 2) would be too inaccurate
> in practice and 3) has chance to be portable, we could go for this.  It will
> solve the precision problems and will also work for LIPO summaries.
> I would be curious about effect on profiledbootstrap time of this if you
implement
> it.

I'm hoping that 2) will be accurate enough in practice, but it will
need some investigation.

Thanks,
Teresa

>
> Honza
>>
>> David
>>
>>
>>
>> >
>> > Thanks,
>> > Teresa
>> >
>> > >
>> >> > >>
>> >> > >>
>> >> > >> >  2) Do we plan to add some features in near future that will
>> >> anyway require global locking?
>> >> > >> >     I guess LIPO itself does not count since it streams its data
>> >> into independent file as you
>> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> > >> >     Does LIPO stream everything into that common file, or does it
>> >> use combination of gcda files
>> >> > >> >     and common summary?
>> >> > >>
>> >> > >> Actually, LIPO module grouping information are stored in gcda files.
>> >> > >> It is also stored in a separate .imports file (one per object) ---
>> >> > >> this is primarily used by our build system for dependence
>> >> information.
>> >> > >
>> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
>> >> LIPO behave
>> >> > > on GCC bootstrap?
>> >> >
>> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the
>> >> > main problem for application build -- the link time (for debug build)
>> >> > is.
>> >>
>> >> I was primarily curious how the LIPOs runtime analysis fare in the
>> >> situation where
>> >> you do very many small train runs on rather large app (sure GCC is small
>> >> to google's
>> >> use case ;).
>> >> >
>> >> > > (i.e. it does a lot more work in the libgcov module per each
>> >> > > invocation, so I am curious if it is practically useful at all).
>> >> > >
>> >> > > With LTO based solution a lot can be probably pushed at link time?
>> >> Before
>> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
>> >> CFGs from
>> >> > > gcda files and do all the merging/updating/CFG constructions that is
>> >> currently
>> >> > > performed at runtime, right?
>> >> >
>> >> > The dynamic cgraph build and analysis is still done at runtime.
>> >> > However, with the new implementation, FE is no longer involved. Gcc
>> >> > driver is modified to understand module grouping, and lto is used to
>> >> > merge the streamed output from aux modules.
>> >>
>> >> I see. Are there any fundamental reasons why it can not be done at
>> >> link-time
>> >> when all gcda files are available? Why the grouping is not done inside
>> >> linker
>> >> plugin?
>> >>
>> >> Honza
>> >> >
>> >> >
>> >> > David
>> >>
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer |  tejohnson@google.com |  408-460-2413
>> >



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hubicka_ucw.cz

> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> ...

13 years, 2 months ago (2012-08-22 06:18:03 UTC) #27

> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> > I can go ahead with the histogram approach. There is some roundoff
> >> > error from the working set scaling approach that can affect different
> >> > merging orders as you note, although I think this only really affects
the
> >> > small counter values. The other place where saving/merging the histogram
> >
> > Do you have any intuition on why simple maximalization merging (that is safe
wrt
> > ordering) would be bad idea?
> 
> When you say "maximalization merging" are you talking about the
> histogram merging approach I mentioned a few emails back (my response
> on Aug 19) where we assume the same relative order of hotness in the
> counters between runs, and accumulate the counter values in the
> histogram in that order?

I was speaking of BB (counter) counts only here and considering the simplest
strategy: maximalize the BB counts in corresponding buckets and sum the
cummlative
values in the corresponding buckets. 

The strategy of scaling is more sensible, but it will also be sensitive to
order
of train runs, i.e. it will give unstable results with parallel make.
> > OK, so I guess we went through
> >  1) two pass updating with race in between pases.
> >  2) two pass updating with first pass updating countes and second having
race only for summary update.
> >     (i.e. no races for counters)
> >  3) two pass updating with flocking (and some way to handle detected
deadlocks)
> >  4) one pass updating with histogram merging + maximalization of working
set.
> >     (we do not realy need to scale the buckets, we can simply merge the
histograms
> >      and then mutiply them by nruns before comparing to actual counters.
> 
> By merging the histograms (and accumulating the counter values stored
> there as we merge), I don't think we need to multiply the counter
> values by nruns, do we?

With the simple approach we need, but...
> 
> >      This assumes that working sets of all runs are about the same, but
should work
> >      resonably in practice I think.
> >
> > I guess 3/4 are acceptable WRT bootstrap reproducibility.
> >
> > I have no experience with flocking large number of files and portability of
> > this solution i.e.  to Windows.  If you think that 2) would be too
inaccurate
> > in practice and 3) has chance to be portable, we could go for this.  It
will
> > solve the precision problems and will also work for LIPO summaries.
> > I would be curious about effect on profiledbootstrap time of this if you
implement
> > it.
> 
> I'm hoping that 2) will be accurate enough in practice, but it will
> need some investigation.

Perhaps weighting all pros/cons you are right here.  I think getting results
consistent across different order of runs is more important than the
possibility of race on the last update and 4) is probably getting quite a lot
of GIGO issues.

We can implement more consistent locking mechanizm incrementally if this turns
out to be issue.

It is posible to actually detect the race: at first round we are updating the
nruns. In the second round we can see if nruns has changed.  If so, we have
problem since someone has already updated the file and will update the
histograms.

I would suggest skiping update when nruns has changes, so the summary at least
match the counters in current file accurately (even though it will be off for
the whole program since whoever updated the file may have not seen all our
updates, just part of them).  Just to reduce changes that the trashed run is
the most important one. Theoretically we could also make libgcov to re-read all
counters in this case and try another update, but I would not worry about that
for the moment.

We will still have problems with bootstrap: the summary of libbackend.a will
depend on what of compiler binaries has executed last, since cc1 will have
different summary from cc1plus becuase frontend is different.  I would give
extra points to voluteer who changes libgcov to do multiple summaries for
different programs as intended originally (if you search archives in 2001-2003,
somewhere are patches that changes libgcov from not merging at all to merging
always. I guess changing it to merge just when program checksum match is not
that hard).

Thanks,
Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> >>
> >> David
> >>
> >>
> >>
> >> >
> >> > Thanks,
> >> > Teresa
> >> >
> >> > >
> >> >> > >>
> >> >> > >>
> >> >> > >> >  2) Do we plan to add some features in near future that will
> >> >> anyway require global locking?
> >> >> > >> >     I guess LIPO itself does not count since it streams its data
> >> >> into independent file as you
> >> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
> >> >> > >> >     Does LIPO stream everything into that common file, or does
it
> >> >> use combination of gcda files
> >> >> > >> >     and common summary?
> >> >> > >>
> >> >> > >> Actually, LIPO module grouping information are stored in gcda
files.
> >> >> > >> It is also stored in a separate .imports file (one per object) ---
> >> >> > >> this is primarily used by our build system for dependence
> >> >> information.
> >> >> > >
> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does
> >> >> LIPO behave
> >> >> > > on GCC bootstrap?
> >> >> >
> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not
the
> >> >> > main problem for application build -- the link time (for debug build)
> >> >> > is.
> >> >>
> >> >> I was primarily curious how the LIPOs runtime analysis fare in the
> >> >> situation where
> >> >> you do very many small train runs on rather large app (sure GCC is
small
> >> >> to google's
> >> >> use case ;).
> >> >> >
> >> >> > > (i.e. it does a lot more work in the libgcov module per each
> >> >> > > invocation, so I am curious if it is practically useful at all).
> >> >> > >
> >> >> > > With LTO based solution a lot can be probably pushed at link time?
> >> >> Before
> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov
> >> >> CFGs from
> >> >> > > gcda files and do all the merging/updating/CFG constructions that
is
> >> >> currently
> >> >> > > performed at runtime, right?
> >> >> >
> >> >> > The dynamic cgraph build and analysis is still done at runtime.
> >> >> > However, with the new implementation, FE is no longer involved. Gcc
> >> >> > driver is modified to understand module grouping, and lto is used to
> >> >> > merge the streamed output from aux modules.
> >> >>
> >> >> I see. Are there any fundamental reasons why it can not be done at
> >> >> link-time
> >> >> when all gcda files are available? Why the grouping is not done inside
> >> >> linker
> >> >> plugin?
> >> >>
> >> >> Honza
> >> >> >
> >> >> >
> >> >> > David
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Teresa Johnson | Software Engineer |  tejohnson@google.com | 
408-460-2413
> >> >
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

tejohnson

On Tue, Aug 21, 2012 at 11:18 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Tue, ...

13 years, 2 months ago (2012-08-23 02:07:34 UTC) #28

On Tue, Aug 21, 2012 at 11:18 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > I can go ahead with the histogram approach. There is some roundoff
>> >> > error from the working set scaling approach that can affect different
>> >> > merging orders as you note, although I think this only really affects
the
>> >> > small counter values. The other place where saving/merging the
histogram
>> >
>> > Do you have any intuition on why simple maximalization merging (that is
safe wrt
>> > ordering) would be bad idea?
>>
>> When you say "maximalization merging" are you talking about the
>> histogram merging approach I mentioned a few emails back (my response
>> on Aug 19) where we assume the same relative order of hotness in the
>> counters between runs, and accumulate the counter values in the
>> histogram in that order?
>
> I was speaking of BB (counter) counts only here and considering the simplest
> strategy: maximalize the BB counts in corresponding buckets and sum the
cummlative
> values in the corresponding buckets.

I'm concerned about inaccuracies arising when multiple runs have
different sizes and therefore counters are in different buckets in the
corresponding histograms. We would end up in a situation where the
merged histogram has more "counters" in it than there are actual
counters in the program.

>
> The strategy of scaling is more sensible, but it will also be sensitive to
order
> of train runs, i.e. it will give unstable results with parallel make.

Regarding the approach of merging histograms by matching up counters
in the same order and apportioning out the histogram cumulative values
evenly between counters in the same bucket before merging, I don't
think the ordering will have a huge effect. The reason is that
counters we end up matching together and accumulating will stay in the
same relative order, and the min counter values are simply summed to
compute the new bucket index and new min counter value to record. For
the cumulative value, since the incoming cumulative values will be
apportioned evenly between the counters coming from the same bucket
when computing the new cumulative value (which is then just another
add), there could be some slight differences from different round off
errors when histograms are merged in different orders. But I would
expect that to result in only small differences in the cumulative
values in the histogram. Is that acceptable?

>> > OK, so I guess we went through
>> >  1) two pass updating with race in between pases.
>> >  2) two pass updating with first pass updating countes and second having
race only for summary update.
>> >     (i.e. no races for counters)
>> >  3) two pass updating with flocking (and some way to handle detected
deadlocks)
>> >  4) one pass updating with histogram merging + maximalization of working
set.
>> >     (we do not realy need to scale the buckets, we can simply merge the
histograms
>> >      and then mutiply them by nruns before comparing to actual counters.
>>
>> By merging the histograms (and accumulating the counter values stored
>> there as we merge), I don't think we need to multiply the counter
>> values by nruns, do we?
>
> With the simple approach we need, but...
>>
>> >      This assumes that working sets of all runs are about the same, but
should work
>> >      resonably in practice I think.
>> >
>> > I guess 3/4 are acceptable WRT bootstrap reproducibility.
>> >
>> > I have no experience with flocking large number of files and portability
of
>> > this solution i.e.  to Windows.  If you think that 2) would be too
inaccurate
>> > in practice and 3) has chance to be portable, we could go for this.  It
will
>> > solve the precision problems and will also work for LIPO summaries.
>> > I would be curious about effect on profiledbootstrap time of this if you
implement
>> > it.
>>
>> I'm hoping that 2) will be accurate enough in practice, but it will
>> need some investigation.
>
> Perhaps weighting all pros/cons you are right here.  I think getting results
> consistent across different order of runs is more important than the
> possibility of race on the last update and 4) is probably getting quite a lot
> of GIGO issues.
>
> We can implement more consistent locking mechanizm incrementally if this
turns
> out to be issue.
>
> It is posible to actually detect the race: at first round we are updating the
> nruns. In the second round we can see if nruns has changed.  If so, we have
> problem since someone has already updated the file and will update the
> histograms.
>
> I would suggest skiping update when nruns has changes, so the summary at
least
> match the counters in current file accurately (even though it will be off for
> the whole program since whoever updated the file may have not seen all our
> updates, just part of them).  Just to reduce changes that the trashed run is
> the most important one. Theoretically we could also make libgcov to re-read
all
> counters in this case and try another update, but I would not worry about
that
> for the moment.

Ok, I think that makes sense.

If it is ok with you, I'd prefer to go with a 3 step approach:
1) Implement some form of histogram merging (one of the options we
discussed above). (Your approach #4).
2) Implement approach #2 (restructure into 2 passes with counters
being updated accurately and a possible race on the histogram
computation).
3) Investigate and implement either approach #3 (some kind of global
lock) or the race detection you describe above.

>
> We will still have problems with bootstrap: the summary of libbackend.a will
> depend on what of compiler binaries has executed last, since cc1 will have
> different summary from cc1plus becuase frontend is different.  I would give
> extra points to voluteer who changes libgcov to do multiple summaries for
> different programs as intended originally (if you search archives in
2001-2003,
> somewhere are patches that changes libgcov from not merging at all to merging
> always. I guess changing it to merge just when program checksum match is not
> that hard).

In this case, I assume that currently the merge will abort because the
# counters, etc don't match?

Thanks,
Teresa

>
> Thanks,
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> >
>> >> > Thanks,
>> >> > Teresa
>> >> >
>> >> > >
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> >  2) Do we plan to add some features in near future that will
>> >> >> anyway require global locking?
>> >> >> > >> >     I guess LIPO itself does not count since it streams its
data
>> >> >> into independent file as you
>> >> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
>> >> >> > >> >     Does LIPO stream everything into that common file, or does
it
>> >> >> use combination of gcda files
>> >> >> > >> >     and common summary?
>> >> >> > >>
>> >> >> > >> Actually, LIPO module grouping information are stored in gcda
files.
>> >> >> > >> It is also stored in a separate .imports file (one per object)
---
>> >> >> > >> this is primarily used by our build system for dependence
>> >> >> information.
>> >> >> > >
>> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How
does
>> >> >> LIPO behave
>> >> >> > > on GCC bootstrap?
>> >> >> >
>> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not
the
>> >> >> > main problem for application build -- the link time (for debug
build)
>> >> >> > is.
>> >> >>
>> >> >> I was primarily curious how the LIPOs runtime analysis fare in the
>> >> >> situation where
>> >> >> you do very many small train runs on rather large app (sure GCC is
small
>> >> >> to google's
>> >> >> use case ;).
>> >> >> >
>> >> >> > > (i.e. it does a lot more work in the libgcov module per each
>> >> >> > > invocation, so I am curious if it is practically useful at all).
>> >> >> > >
>> >> >> > > With LTO based solution a lot can be probably pushed at link time?
>> >> >> Before
>> >> >> > > actual GCC starts from the linker plugin, LIPO module can read
gcov
>> >> >> CFGs from
>> >> >> > > gcda files and do all the merging/updating/CFG constructions that
is
>> >> >> currently
>> >> >> > > performed at runtime, right?
>> >> >> >
>> >> >> > The dynamic cgraph build and analysis is still done at runtime.
>> >> >> > However, with the new implementation, FE is no longer involved. Gcc
>> >> >> > driver is modified to understand module grouping, and lto is used to
>> >> >> > merge the streamed output from aux modules.
>> >> >>
>> >> >> I see. Are there any fundamental reasons why it can not be done at
>> >> >> link-time
>> >> >> when all gcda files are available? Why the grouping is not done inside
>> >> >> linker
>> >> >> plugin?
>> >> >>
>> >> >> Honza
>> >> >> >
>> >> >> >
>> >> >> > David
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Teresa Johnson | Software Engineer |  tejohnson@google.com | 
408-460-2413
>> >> >
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hubicka_ucw.cz

> > I'm concerned about inaccuracies arising when multiple runs have > different sizes and ...

13 years, 2 months ago (2012-08-23 02:28:59 UTC) #29

> 
> I'm concerned about inaccuracies arising when multiple runs have
> different sizes and therefore counters are in different buckets in the
> corresponding histograms. We would end up in a situation where the
> merged histogram has more "counters" in it than there are actual
> counters in the program.

Yes, it is very inaccurate.
> 
> >
> > The strategy of scaling is more sensible, but it will also be sensitive to
order
> > of train runs, i.e. it will give unstable results with parallel make.
> 
> Regarding the approach of merging histograms by matching up counters
> in the same order and apportioning out the histogram cumulative values
> evenly between counters in the same bucket before merging, I don't
> think the ordering will have a huge effect. The reason is that
> counters we end up matching together and accumulating will stay in the
> same relative order, and the min counter values are simply summed to
> compute the new bucket index and new min counter value to record. For
> the cumulative value, since the incoming cumulative values will be
> apportioned evenly between the counters coming from the same bucket
> when computing the new cumulative value (which is then just another
> add), there could be some slight differences from different round off
> errors when histograms are merged in different orders. But I would
> expect that to result in only small differences in the cumulative
> values in the histogram. Is that acceptable?

Yes, I think it will be small difference and it will give resonable results in
practice.
I am only concerned about reproducibility of the bootstraps that...
> Ok, I think that makes sense.
> 
> If it is ok with you, I'd prefer to go with a 3 step approach:
> 1) Implement some form of histogram merging (one of the options we
> discussed above). (Your approach #4).
> 2) Implement approach #2 (restructure into 2 passes with counters
> being updated accurately and a possible race on the histogram
> computation).
> 3) Investigate and implement either approach #3 (some kind of global
> lock) or the race detection you describe above.

Yes, this sounds like good plan.
> 
> >
> > We will still have problems with bootstrap: the summary of libbackend.a
will
> > depend on what of compiler binaries has executed last, since cc1 will have
> > different summary from cc1plus becuase frontend is different.  I would give
> > extra points to voluteer who changes libgcov to do multiple summaries for
> > different programs as intended originally (if you search archives in
2001-2003,
> > somewhere are patches that changes libgcov from not merging at all to
merging
> > always. I guess changing it to merge just when program checksum match is
not
> > that hard).
> 
> In this case, I assume that currently the merge will abort because the
> # counters, etc don't match?

We only abort when number of counters in current file does not match (i.e. the
gcda
file is from difference compilation). We accept when the overall program is
different,
i.e. libbackend.a is linked into multiple binaries.
Since the histogram summing is global across all profiled libraries, we will end
up with
different histograms depending on what program using given library was executed
last.

To get this stable I think it is only possible to go with #3 (i.e. program wide
global locking) and get different records for different programs (i.e. do
checksum of all objects libgcov sees and merge only with matching checksums).
In coverage.c we then can either choose the proper record whendoing LTO and
knowing what program we are just compiling or do hitogram merging in your way
but after sorting the entries into some canonical order (i.e. increasing
checksum).

So we can track this incrementally.  I guess if you go with step 1) as you
describe and we can move ahead.

Other thing I have patch for is to make LTO ipa-profile to produce more
accurate histograms based on real instruction counts (here we can read in all
gcda files, turn them into CFG profiles, and do our own summary).  I suppose
that once you merge in your patches I can update this and we will have two
mechanizms so we will also have data on how your implementation diverges from
precise solution.

Thanks!
Honza
> 
> Thanks,
> Teresa
> 
> >
> > Thanks,
> > Honza
> >>
> >> Thanks,
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> David
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> > Thanks,
> >> >> > Teresa
> >> >> >
> >> >> > >
> >> >> >> > >>
> >> >> >> > >>
> >> >> >> > >> >  2) Do we plan to add some features in near future that will
> >> >> >> anyway require global locking?
> >> >> >> > >> >     I guess LIPO itself does not count since it streams its
data
> >> >> >> into independent file as you
> >> >> >> > >> >     mentioned earlier and locking LIPO file is not that hard.
> >> >> >> > >> >     Does LIPO stream everything into that common file, or does
it
> >> >> >> use combination of gcda files
> >> >> >> > >> >     and common summary?
> >> >> >> > >>
> >> >> >> > >> Actually, LIPO module grouping information are stored in gcda
files.
> >> >> >> > >> It is also stored in a separate .imports file (one per object)
---
> >> >> >> > >> this is primarily used by our build system for dependence
> >> >> >> information.
> >> >> >> > >
> >> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How
does
> >> >> >> LIPO behave
> >> >> >> > > on GCC bootstrap?
> >> >> >> >
> >> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not
the
> >> >> >> > main problem for application build -- the link time (for debug
build)
> >> >> >> > is.
> >> >> >>
> >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the
> >> >> >> situation where
> >> >> >> you do very many small train runs on rather large app (sure GCC is
small
> >> >> >> to google's
> >> >> >> use case ;).
> >> >> >> >
> >> >> >> > > (i.e. it does a lot more work in the libgcov module per each
> >> >> >> > > invocation, so I am curious if it is practically useful at all).
> >> >> >> > >
> >> >> >> > > With LTO based solution a lot can be probably pushed at link
time?
> >> >> >> Before
> >> >> >> > > actual GCC starts from the linker plugin, LIPO module can read
gcov
> >> >> >> CFGs from
> >> >> >> > > gcda files and do all the merging/updating/CFG constructions that
is
> >> >> >> currently
> >> >> >> > > performed at runtime, right?
> >> >> >> >
> >> >> >> > The dynamic cgraph build and analysis is still done at runtime.
> >> >> >> > However, with the new implementation, FE is no longer involved.
Gcc
> >> >> >> > driver is modified to understand module grouping, and lto is used
to
> >> >> >> > merge the streamed output from aux modules.
> >> >> >>
> >> >> >> I see. Are there any fundamental reasons why it can not be done at
> >> >> >> link-time
> >> >> >> when all gcda files are available? Why the grouping is not done
inside
> >> >> >> linker
> >> >> >> plugin?
> >> >> >>
> >> >> >> Honza
> >> >> >> >
> >> >> >> >
> >> >> >> > David
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Teresa Johnson | Software Engineer |  tejohnson@google.com | 
408-460-2413
> >> >> >
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

tejohnson

Revision to earlier patch to augment the gcov program summary with working set information. We ...

13 years, 2 months ago (2012-08-28 14:45:55 UTC) #30

Revision to earlier patch to augment the gcov program summary with working set
information. We now emit the non-zero arc counter histogram entries into the
summary, instead of computing the working set information and emitting that
into the summary.  The working set information is computed by the compiler from
the histogram when it is read back in. A new method for merging histogram
summaries based on earlier discussions with Honza is utilized when gcov
summaries are merged.

See earlier discussions and description at:
  http://gcc.gnu.org/ml/gcc-patches/2012-08/msg01062.html
  http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01412.html

Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk?

Thanks,
Teresa

2012-08-28  Teresa Johnson  <tejohnson@google.com>

	* libgcc/libgcov.c (gcov_histogram_insert): New function.
	(gcov_compute_histogram): Ditto.
	(gcov_exit): Invoke compute_histogram, and perform merging of
        histograms during summary merging.
	* gcc/gcov-io.c (gcov_write_summary): Include length of non-zero
        histogram entries in GCOV_TAG_SUMMARY_LENGTH, and write out those
        entries.
	(gcov_read_summary): Read histogram entries.
	(gcov_histo_index): New function.
	(gcov_histogram_merge): Ditto.
	* gcc/gcov-io.h (gcov_type_unsigned): New type.
        (struct gcov_bucket_type): Ditto.
	(struct gcov_ctr_summary): Include histogram.
        (GCOV_TAG_SUMMARY_LENGTH): Update to include histogram entries.
        (GCOV_HISTOGRAM_SIZE): New macro.
        (GCOV_TYPE_SIZE): Make macro available for all build targets.
	* gcc/profile.c (NUM_GCOV_WORKING_SETS): New macro.
	(gcov_working_sets): New global variable.
	(compute_working_sets): New function.
	(find_working_set): Ditto.
	(get_exec_counts): Invoke compute_working_sets.
	* gcc/coverage.c (read_counts_file): Merge histograms, and
        fix bug with accessing summary info for non-summable counters.
	* gcc/basic-block.h (gcov_type_unsigned): New type.
        (struct gcov_working_set_info): Ditto.
	* gcc/gcov-dump.c (tag_summary): Dump out histogram.

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c	(revision 190736)
+++ libgcc/libgcov.c	(working copy)
@@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
   return 1;
 }
 
+/* Insert counter VALUE into HISTOGRAM.  */
+
+static void
+gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value)
+{
+  unsigned i;
+
+  i = gcov_histo_index(value);
+  gcc_assert (i < GCOV_HISTOGRAM_SIZE);
+
+  histogram[i].num_counters++;
+  histogram[i].cum_value += value;
+  if (value < histogram[i].min_value)
+    histogram[i].min_value = value;
+}
+
+/* Computes a histogram of the arc counters to place in the summary SUM.  */
+
+static void
+gcov_compute_histogram (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_fn_info *gfi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, f_ix, ctr_info_ix, ix;
+  int h_ix;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+    {
+      cs_ptr->histogram[h_ix].num_counters = 0;
+      cs_ptr->histogram[h_ix].min_value = cs_ptr->run_max;
+      cs_ptr->histogram[h_ix].cum_value = 0;
+    }
+
+  /* Walk through all the per-object structures and record each of
+     the count values in histogram.  */
+  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!gi_ptr->merge[t_ix])
+        continue;
+
+      /* Find the appropriate index into the gcov_ctr_info array
+         for the counter we are currently working on based on the
+         existence of the merge function pointer for this object.  */
+      for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++)
+        {
+          if (gi_ptr->merge[ix])
+            ctr_info_ix++;
+        }
+      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
+        {
+          gfi_ptr = gi_ptr->functions[f_ix];
+
+          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
+            continue;
+
+          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
+          for (ix = 0; ix < ci_ptr->num; ix++)
+            gcov_histogram_insert(cs_ptr->histogram, ci_ptr->values[ix]);
+        }
+    }
+}
+
 /* Dump the coverage counts. We merge with existing counts when
    possible, to avoid growing the .da files ad infinitum. We use this
    program's checksum to make sure we only accumulate whole program
@@ -347,6 +419,7 @@ gcov_exit (void)
 	    }
 	}
     }
+  gcov_compute_histogram (&this_prg);
 
   {
     /* Check if the level of dirs to strip off specified. */
@@ -482,8 +555,6 @@ gcov_exit (void)
 
 	      f_ix--;
 	      length = gcov_read_unsigned ();
-	      if (length != GCOV_TAG_SUMMARY_LENGTH)
-		goto read_mismatch;
 	      gcov_read_summary (&tmp);
 	      if ((error = gcov_is_error ()))
 		goto read_error;
@@ -598,11 +669,18 @@ gcov_exit (void)
 	  if (gi_ptr->merge[t_ix])
 	    {
 	      if (!cs_prg->runs++)
-		cs_prg->num = cs_tprg->num;
+                cs_prg->num = cs_tprg->num;
+              else if (cs_prg->num != cs_tprg->num)
+                goto read_mismatch;
 	      cs_prg->sum_all += cs_tprg->sum_all;
 	      if (cs_prg->run_max < cs_tprg->run_max)
 		cs_prg->run_max = cs_tprg->run_max;
 	      cs_prg->sum_max += cs_tprg->run_max;
+              if (cs_prg->runs == 1)
+                memcpy (cs_prg->histogram, cs_tprg->histogram,
+                        sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+              else
+                gcov_histogram_merge (cs_prg->histogram, cs_tprg->histogram);
 	    }
 	  else if (cs_prg->runs)
 	    goto read_mismatch;
Index: gcc/gcov-io.c
===================================================================
--- gcc/gcov-io.c	(revision 190736)
+++ gcc/gcov-io.c	(working copy)
@@ -368,10 +368,18 @@ gcov_write_tag_length (gcov_unsigned_t tag, gcov_u
 GCOV_LINKAGE void
 gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, h_ix, h_cnt = 0;
   const struct gcov_ctr_summary *csum;
 
-  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH);
+  /* Count number of non-zero histogram entries. The histogram is only
+     currently computed for arc counters.  */
+  csum = &summary->ctrs[GCOV_COUNTER_ARCS];
+  for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+    {
+      if (csum->histogram[h_ix].num_counters > 0)
+        h_cnt++;
+    }
+  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt));
   gcov_write_unsigned (summary->checksum);
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
@@ -380,6 +388,21 @@ gcov_write_summary (gcov_unsigned_t tag, const str
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
       gcov_write_counter (csum->sum_max);
+      if (ix != GCOV_COUNTER_ARCS)
+        {
+          gcov_write_unsigned (0);
+          continue;
+        }
+      gcov_write_unsigned (h_cnt);
+      for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+        {
+          if (!csum->histogram[h_ix].num_counters)
+            continue;
+          gcov_write_unsigned (h_ix);
+          gcov_write_unsigned (csum->histogram[h_ix].num_counters);
+          gcov_write_counter (csum->histogram[h_ix].min_value);
+          gcov_write_counter (csum->histogram[h_ix].cum_value);
+        }
     }
 }
 #endif /* IN_LIBGCOV */
@@ -488,7 +511,7 @@ gcov_read_string (void)
 GCOV_LINKAGE void
 gcov_read_summary (struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, h_ix, h_cnt;
   struct gcov_ctr_summary *csum;
 
   summary->checksum = gcov_read_unsigned ();
@@ -499,6 +522,16 @@ gcov_read_summary (struct gcov_summary *summary)
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
       csum->sum_max = gcov_read_counter ();
+      memset (csum->histogram, 0,
+              sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+      h_cnt = gcov_read_unsigned ();
+      while (h_cnt--)
+        {
+          h_ix = gcov_read_unsigned ();
+          csum->histogram[h_ix].num_counters = gcov_read_unsigned ();
+          csum->histogram[h_ix].min_value = gcov_read_counter ();
+          csum->histogram[h_ix].cum_value = gcov_read_counter ();
+        }
     }
 }
 
@@ -550,3 +583,142 @@ gcov_time (void)
     return status.st_mtime;
 }
 #endif /* IN_GCOV */
+
+#if IN_LIBGCOV || !IN_GCOV
+/* Determine the index into histogram for VALUE. */
+
+static unsigned
+gcov_histo_index(gcov_type value)
+{
+  gcov_type_unsigned v = (gcov_type_unsigned)value;
+  unsigned r = 0;
+  unsigned prev2bits = 0;
+
+  /* Find index into log2 scale histogram, where each of the log2
+     sized buckets is divided into 4 linear sub-buckets for better
+     focus in the higher buckets.  */
+
+  /* Find the place of the most-significant bit set.  */
+  if (v > 0)
+    r = GCOV_TYPE_SIZE - 1 - __builtin_clzll (v);
+
+  /* If at most the 2 least significant bits are set (value is
+     0 - 3) then that value is our index into the lowest set of
+     four buckets.  */
+  if (r < 2)
+    return (unsigned)value;
+
+  gcc_assert (r < 64);
+
+  /* Find the two next most significant bits to determine which
+     of the four linear sub-buckets to select.  */
+  prev2bits = (v >> (r - 2)) & 0x3;
+  /* Finally, compose the final bucket index from the log2 index and
+     the next 2 bits. The minimum r value at this point is 2 since we
+     returned above if r was 2 or more, so the minimum bucket at this
+     point is 4.  */
+  return (r - 1) * 4 + prev2bits;
+}
+
+/* Merge SRC_HISTO into TGT_HISTO.  */
+
+static void gcov_histogram_merge(gcov_bucket_type *tgt_histo,
+                                 gcov_bucket_type *src_histo)
+{
+  unsigned src_i, tgt_i, tmp_i;
+  unsigned src_num, tgt_num, merge_num;
+  gcov_type src_cum, tgt_cum, merge_src_cum, merge_tgt_cum, merge_cum;
+  gcov_type merge_min;
+  gcov_bucket_type tmp_histo[GCOV_HISTOGRAM_SIZE];
+
+  memset(tmp_histo, 0, sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+
+  /* Assume that the counters are in the same relative order in both
+     histograms. Walk the histograms from smallest to largest entry,
+     matching up and combining counters in order.  */
+  src_num = 0;
+  src_cum = 0;
+  src_i = 0;
+  for (tgt_i = 0; tgt_i < GCOV_HISTOGRAM_SIZE; tgt_i++)
+    {
+      tgt_num = tgt_histo[tgt_i].num_counters;
+      tgt_cum = tgt_histo[tgt_i].cum_value;
+      /* Keep going until all of the target histogram's counters at this
+         position have been matched and merged with counters from the
+         source histogram.  */
+      while (tgt_num > 0)
+        {
+          /* If this is either the first time through this loop or we just
+             exhausted the previous non-zero source histogram entry, look
+             for the next non-zero source histogram entry.  */
+          if (!src_num)
+            {
+              /* Locate the first non-zero entry.  */
+              while (src_i < GCOV_HISTOGRAM_SIZE &&
+                     !src_histo[src_i].num_counters)
+                src_i++;
+              /* Source and target total number of counts should be
+                 the same or caller should not have attempted merge.
+                 Since we have target counters left to merge, there
+                 ought to still be some left in the source.  */
+              gcc_assert (src_i < GCOV_HISTOGRAM_SIZE);
+              src_num = src_histo[src_i].num_counters;
+              src_cum = src_histo[src_i].cum_value;
+            }
+
+          /* The number of counters to merge on this pass is the minimum
+             of the remaining counters from the current target and source
+             histogram entries.  */
+          merge_num = tgt_num;
+          if (src_num < merge_num)
+            merge_num = src_num;
+
+          /* The merged min_value is the min of the min_values from target
+             and source.  */
+          merge_min = tgt_histo[tgt_i].min_value;
+          if (merge_min > src_histo[tgt_i].min_value)
+            merge_min = src_histo[src_i].min_value;
+
+          /* Compute the portion of source and target entries' cum_value
+             that will be apportioned to the counters being merged.
+             The total remaining cum_value from each entry is divided
+             equally among the counters from that histogram entry if we
+             are not merging all of them.  */
+          merge_src_cum = src_cum;
+          if (merge_num < src_num)
+            merge_src_cum = merge_num * src_cum / src_num;
+          merge_tgt_cum = tgt_cum;
+          if (merge_num < tgt_num)
+            merge_tgt_cum = merge_num * tgt_cum / tgt_num;
+          /* The merged cum_value is the sum of the source and target
+             components.  */
+          merge_cum = merge_src_cum + merge_tgt_cum;
+
+          /* Update the remaining number of counters and cum_value left
+             to be merged from this source and target entry.  */
+          src_cum -= merge_src_cum;
+          tgt_cum -= merge_tgt_cum;
+          src_num -= merge_num;
+          tgt_num -= merge_num;
+
+          /* The merged counters get placed in the new merged histogram
+             at the entry for the merged min_value.  */
+          tmp_i = gcov_histo_index(merge_min);
+          gcc_assert (tmp_i < GCOV_HISTOGRAM_SIZE);
+          tmp_histo[tmp_i].num_counters += merge_num;
+          tmp_histo[tmp_i].cum_value += merge_cum;
+          if (!tmp_histo[tmp_i].min_value ||
+              merge_min < tmp_histo[tmp_i].min_value)
+            tmp_histo[tmp_i].min_value = merge_min;
+
+          /* Ensure the search for the next non-zero src_histo entry starts
+             at the next histogram bucket.  */
+          if (!src_num)
+            src_i++;
+        }
+    }
+
+  /* Finally, copy the merged histogram into tgt_histo.  */
+  memcpy(tgt_histo, tmp_histo, sizeof (gcov_bucket_type) *
GCOV_HISTOGRAM_SIZE);
+}
+#endif /* IN_LIBGCOV || !IN_GCOV */
Index: gcc/gcov-io.h
===================================================================
--- gcc/gcov-io.h	(revision 190736)
+++ gcc/gcov-io.h	(working copy)
@@ -139,7 +139,9 @@ see the files COPYING3 and COPYING.RUNTIME respect
 	counts: header int64:count*
 	summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE
 	count-summary:	int32:num int32:runs int64:sum
-			int64:max int64:sum_max
+			int64:max int64:sum_max histogram
+        histogram: int32:count histogram-buckets*
+        histogram-buckets: int32:index int32:num int64:min int64:sum
 
    The ANNOUNCE_FUNCTION record is the same as that in the note file,
    but without the source location.  The COUNTS gives the
@@ -171,8 +173,10 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (SI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (DI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #endif
 #else
 #if BITS_PER_UNIT == 16
@@ -180,16 +184,20 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (HI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #endif
 #else
 typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI)));
 typedef unsigned gcov_position_t __attribute__ ((mode (QI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (QI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI)));
 #endif
 #endif
 #endif
@@ -210,11 +218,10 @@ typedef unsigned gcov_position_t;
 #if IN_GCOV
 #define GCOV_LINKAGE static
 typedef HOST_WIDEST_INT gcov_type;
+typedef unsigned HOST_WIDEST_INT gcov_type_unsigned;
 #if IN_GCOV > 0
 #include <sys/types.h>
 #endif
-#else /*!IN_GCOV */
-#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)
 #endif
 
 #if defined (HOST_HAS_F_SETLKW)
@@ -309,9 +316,10 @@ typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2)
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
-#define GCOV_TAG_SUMMARY_LENGTH  \
-	(1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+#define GCOV_TAG_SUMMARY_LENGTH(NUM)  \
+        (1 + GCOV_COUNTERS_SUMMABLE * (3 + 3 * 2) + (NUM) * 6)
 
+
 /* Counters that are collected.  */
 #define GCOV_COUNTER_ARCS 	0  /* Arc transitions.  */
 #define GCOV_COUNTERS_SUMMABLE	1  /* Counters which can be
@@ -387,8 +395,28 @@ typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_ARC_FAKE		(1 << 1)
 #define GCOV_ARC_FALLTHROUGH	(1 << 2)
 
+#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)
+
 /* Structured records.  */
 
+/* Structure used for each bucket of the log2 histogram of counter values.  */
+typedef struct
+{
+  /* Number of counters whose profile count falls within the bucket.  */
+  gcov_unsigned_t num_counters;
+  /* Smallest profile count included in this bucket.  */
+  gcov_type min_value;
+  /* Cumulative value of the profile counts in this bucket.  */
+  gcov_type cum_value;
+} gcov_bucket_type;
+
+/* For a log2 scale histogram with each range split into 4
+   linear sub-ranges, there will be at most GCOV_TYPE_SIZE - 1 log2
+   ranges since the lowest 2 log2 values share the lowest 4 linear
+   sub-range (values 0 - 3).  */
+
+#define GCOV_HISTOGRAM_SIZE (GCOV_TYPE_SIZE - 1) * 4
+
 /* Cumulative counter data.  */
 struct gcov_ctr_summary
 {
@@ -397,6 +425,8 @@ struct gcov_ctr_summary
   gcov_type sum_all;		/* sum of all counters accumulated.  */
   gcov_type run_max;		/* maximum value on a single run.  */
   gcov_type sum_max;    	/* sum of individual run max values.  */
+  gcov_bucket_type histogram[GCOV_HISTOGRAM_SIZE]; /* histogram of
+                                                      counter values.  */
 };
 
 /* Object & program summary record.  */
Index: gcc/profile.c
===================================================================
--- gcc/profile.c	(revision 190736)
+++ gcc/profile.c	(working copy)
@@ -84,6 +84,15 @@ struct bb_info {
 
 const struct gcov_ctr_summary *profile_info;
 
+/* Number of data points in the working set summary array. Using 128
+   provides information for at least every 1% increment of the total
+   profile size. The last entry is hardwired to 99.9% of the total.  */
+#define NUM_GCOV_WORKING_SETS 128
+
+/* Counter working set information computed from the current counter
+   summary. Not initialized unless profile_info summary is non-NULL.  */
+static gcov_working_set_t gcov_working_sets[NUM_GCOV_WORKING_SETS];
+
 /* Collect statistics on the performance of this pass for the entire source
    file.  */
 
@@ -192,6 +201,151 @@ instrument_values (histogram_values values)
 }
 
 
+/* Compute the working set information from the counter histogram in
+   the profile summary. This is an array of information corresponding to a
+   range of percentages of the total execution count (sum_all), and includes
+   the number of counters required to cover that working set percentage and
+   the minimum counter value in that working set.  */
+
+static void
+compute_working_sets (void)
+{
+  gcov_type working_set_cum_values[NUM_GCOV_WORKING_SETS];
+  gcov_type ws_cum_hotness_incr;
+  gcov_type cum, tmp_cum;
+  const gcov_bucket_type *histo_bucket;
+  unsigned ws_ix, h_ix, c_num, count, pctinc, pct;
+  gcov_working_set_t *ws_info;
+
+  if (!profile_info)
+    return;
+
+  /* Compute the amount of sum_all that the cumulative hotness grows
+     by in each successive working set entry, which depends on the
+     number of working set entries.  */
+  ws_cum_hotness_incr = profile_info->sum_all / NUM_GCOV_WORKING_SETS;
+
+  /* Next fill in an array of the cumulative hotness values corresponding
+     to each working set summary entry we are going to compute below. 
+     Skip 0% statistics, which can be extrapolated from the
+     rest of the summary data.  */
+  cum = ws_cum_hotness_incr;
+  for (ws_ix = 0; ws_ix < NUM_GCOV_WORKING_SETS;
+       ws_ix++, cum += ws_cum_hotness_incr)
+    working_set_cum_values[ws_ix] = cum;
+  /* The last summary entry is reserved for (roughly) 99.9% of the
+     working set. Divide by 1024 so it becomes a shift, which gives
+     almost exactly 99.9%.  */
+  working_set_cum_values[NUM_GCOV_WORKING_SETS-1]
+      = profile_info->sum_all - profile_info->sum_all/1024;
+
+  /* Next, walk through the histogram in decending order of hotness
+     and compute the statistics for the working set summary array.
+     As histogram entries are accumulated, we check to see which
+     working set entries have had their expected cum_value reached
+     and fill them in, walking the working set entries in increasing
+     size of cum_value.  */
+  ws_ix = 0; /* The current entry into the working set array.  */
+  cum = 0; /* The current accumulated counter sum.  */
+  count = 0; /* The current accumulated count of block counters.  */
+  for (h_ix = GCOV_HISTOGRAM_SIZE - 1;
+       h_ix > 0 && ws_ix < NUM_GCOV_WORKING_SETS; h_ix--)
+    {
+      histo_bucket = &profile_info->histogram[h_ix];
+
+      /* If we haven't reached the required cumulative counter value for
+         the current working set percentage, simply accumulate this histogram
+         entry into the running sums and continue to the next histogram
+         entry.  */
+      if (cum + histo_bucket->cum_value < working_set_cum_values[ws_ix])
+        {
+          cum += histo_bucket->cum_value;
+          count += histo_bucket->num_counters;
+          continue;
+        }
+
+      /* If adding the current histogram entry's cumulative counter value
+         causes us to exceed the current working set size, then estimate
+         how many of this histogram entry's counter values are required to
+         reach the working set size, and fill in working set entries
+         as we reach their expected cumulative value.  */
+      for (c_num = 0, tmp_cum = cum;
+           c_num < histo_bucket->num_counters && ws_ix <
NUM_GCOV_WORKING_SETS;
+           c_num++)
+        {
+          count++;
+          /* If we haven't reached the last histogram entry counter, add
+             in the minimum value again. This will underestimate the
+             cumulative sum so far, because many of the counter values in this
+             entry may have been larger than the minimum. We could add in the
+             average value every time, but that would require an expensive
+             divide operation.  */
+          if (c_num + 1 < histo_bucket->num_counters)
+            tmp_cum += histo_bucket->min_value;
+          /* If we have reached the last histogram entry counter, then add
+             in the entire cumulative value.  */
+          else
+            tmp_cum = cum + histo_bucket->cum_value;
+
+          /* Next walk through successive working set entries and fill in
+            the statistics for any whose size we have reached by accumulating
+            this histogram counter.  */
+          while (tmp_cum >= working_set_cum_values[ws_ix]
+                 && ws_ix < NUM_GCOV_WORKING_SETS)
+            {
+              gcov_working_sets[ws_ix].num_counters = count;
+              gcov_working_sets[ws_ix].min_counter
+                  = histo_bucket->min_value;
+              ws_ix++;
+            }
+        }
+      /* Finally, update the running cumulative value since we were
+         using a temporary above.  */
+      cum += histo_bucket->cum_value;
+    }
+  gcc_assert (ws_ix == NUM_GCOV_WORKING_SETS);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Counter working sets:\n");
+      /* Multiply the percentage by 100 to avoid float.  */
+      pctinc = 100 * 100 / NUM_GCOV_WORKING_SETS;
+      for (ws_ix = 0, pct = pctinc; ws_ix < NUM_GCOV_WORKING_SETS;
+           ws_ix++, pct += pctinc)
+        {
+          if (ws_ix == NUM_GCOV_WORKING_SETS - 1)
+            pct = 9990;
+          ws_info = &gcov_working_sets[ws_ix];
+          /* Print out the percentage using int arithmatic to avoid float.  */
+          fprintf (dump_file, "\t\t%u.%02u%%: num counts=%u, min counter="
+                   HOST_WIDEST_INT_PRINT_DEC "\n",
+                   pct / 100, pct - (pct / 100 * 100),
+                   ws_info->num_counters,
+                   (HOST_WIDEST_INT)ws_info->min_counter);
+        }
+    }
+}
+
+/* Given a the desired percentage of the full profile (sum_all from the
+   summary), multiplied by 10 to avoid float in PCT_TIMES_10, returns
+   the corresponding working set information. If an exact match for
+   the percentage isn't found, the closest value is used.  */
+
+gcov_working_set_t *
+find_working_set (unsigned pct_times_10)
+{
+  unsigned i;
+  if (!profile_info)
+    return NULL;
+  gcc_assert (pct_times_10 <= 1000);
+  if (pct_times_10 >= 999)
+    return &gcov_working_sets[NUM_GCOV_WORKING_SETS - 1];
+  i = pct_times_10 * NUM_GCOV_WORKING_SETS / 1000;
+  if (!i)
+    return &gcov_working_sets[0];
+  return &gcov_working_sets[i - 1];
+}
+
 /* Computes hybrid profile for all matching entries in da_file.  
    
    CFG_CHECKSUM is the precomputed checksum for the CFG.  */
@@ -219,6 +373,8 @@ get_exec_counts (unsigned cfg_checksum, unsigned l
   if (!counts)
     return NULL;
 
+  compute_working_sets();
+
   if (dump_file && profile_info)
     fprintf(dump_file, "Merged %u profiles with maximal count %u.\n",
 	    profile_info->runs, (unsigned) profile_info->sum_max);
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c	(revision 190736)
+++ gcc/coverage.c	(working copy)
@@ -248,6 +248,13 @@ read_counts_file (void)
 		summary.ctrs[ix].run_max = sum.ctrs[ix].run_max;
 	      summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max;
 	    }
+          if (new_summary)
+            memcpy (summary.ctrs[GCOV_COUNTER_ARCS].histogram,
+                    sum.ctrs[GCOV_COUNTER_ARCS].histogram,
+                    sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+          else
+            gcov_histogram_merge (summary.ctrs[GCOV_COUNTER_ARCS].histogram,
+                                  sum.ctrs[GCOV_COUNTER_ARCS].histogram);
 	  new_summary = 0;
 	}
       else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident)
@@ -268,8 +275,9 @@ read_counts_file (void)
 	      entry->ctr = elt.ctr;
 	      entry->lineno_checksum = lineno_checksum;
 	      entry->cfg_checksum = cfg_checksum;
-	      entry->summary = summary.ctrs[elt.ctr];
-	      entry->summary.num = n_counts;
+              if (elt.ctr < GCOV_COUNTERS_SUMMABLE)
+                entry->summary = summary.ctrs[elt.ctr];
+              entry->summary.num = n_counts;
 	      entry->counts = XCNEWVEC (gcov_type, n_counts);
 	    }
 	  else if (entry->lineno_checksum != lineno_checksum
Index: gcc/basic-block.h
===================================================================
--- gcc/basic-block.h	(revision 190736)
+++ gcc/basic-block.h	(working copy)
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
    flow graph is manipulated by various optimizations.  A signed type
    makes those easy to detect.  */
 typedef HOST_WIDEST_INT gcov_type;
+typedef unsigned HOST_WIDEST_INT gcov_type_unsigned;
 
 /* Control flow edge information.  */
 struct GTY((user)) edge_def {
@@ -91,6 +92,16 @@ enum cfg_edge_flags {
    profile.c.  */
 extern const struct gcov_ctr_summary *profile_info;
 
+/* Working set size statistics for a given percentage of the entire
+   profile (sum_all from the counter summary).  */
+typedef struct gcov_working_set_info
+{
+  /* Number of hot counters included in this working set.  */
+  unsigned num_counters;
+  /* Smallest counter included in this working set.  */
+  gcov_type min_counter;
+} gcov_working_set_t;
+
 /* Declared in cfgloop.h.  */
 struct loop;
 
@@ -897,4 +908,7 @@ extern void rtl_profile_for_bb (basic_block);
 extern void rtl_profile_for_edge (edge);
 extern void default_rtl_profile (void);
 
+/* In profile.c.  */
+extern gcov_working_set_t *find_working_set(unsigned pct_times_10);
+
 #endif /* GCC_BASIC_BLOCK_H */
Index: gcc/gcov-dump.c
===================================================================
--- gcc/gcov-dump.c	(revision 190736)
+++ gcc/gcov-dump.c	(working copy)
@@ -447,7 +447,8 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
 	     unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED)
 {
   struct gcov_summary summary;
-  unsigned ix;
+  unsigned ix, h_ix;
+  gcov_bucket_type *histo_bucket;
 
   gcov_read_summary (&summary);
   printf (" checksum=0x%08x", summary.checksum);
@@ -465,5 +466,24 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
 	      (HOST_WIDEST_INT)summary.ctrs[ix].run_max);
       printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC,
 	      (HOST_WIDEST_INT)summary.ctrs[ix].sum_max);
+      if (ix != GCOV_COUNTER_ARCS)
+        continue;
+      printf ("\n");
+      print_prefix (filename, 0, 0);
+      printf ("\t\tcounter histogram:");
+      for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+        {
+          histo_bucket = &summary.ctrs[ix].histogram[h_ix];
+          if (!histo_bucket->num_counters)
+            continue;
+          printf ("\n");
+          print_prefix (filename, 0, 0);
+          printf ("\t\t%d: num counts=%u, min counter="
+              HOST_WIDEST_INT_PRINT_DEC ", cum_counter="
+              HOST_WIDEST_INT_PRINT_DEC,
+	      h_ix, histo_bucket->num_counters,
+              (HOST_WIDEST_INT)histo_bucket->min_value,
+              (HOST_WIDEST_INT)histo_bucket->cum_value);
+        }
     }
 }

--
This patch is available for review at http://codereview.appspot.com/6465057

Sign in to reply to this message.

hubicka_ucw.cz

> Index: libgcc/libgcov.c > =================================================================== > --- libgcc/libgcov.c (revision 190736) > +++ libgcc/libgcov.c (working copy) ...

13 years, 2 months ago (2012-08-29 13:12:52 UTC) #31

> Index: libgcc/libgcov.c
> ===================================================================
> --- libgcc/libgcov.c	(revision 190736)
> +++ libgcc/libgcov.c	(working copy)
> @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>    return 1;
>  }
>  
> +/* Insert counter VALUE into HISTOGRAM.  */
> +
> +static void
> +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value)
> +{
> +  unsigned i;
> +
> +  i = gcov_histo_index(value);
> +  gcc_assert (i < GCOV_HISTOGRAM_SIZE);
Does checking_assert work in libgcov? I do not think internal consistency check
should go to --enable-checking=release libgcov. We want to maintain it as
lightweight as possible. (I see there are two existing gcc_asserts, since they
report file format corruption, I think they should give better diagnostic).

Inliner will do good job here, but perhaps explicit inline fits.
> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
> +        {
> +          gfi_ptr = gi_ptr->functions[f_ix];
> +
> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
> +            continue;
> +
> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
> +          for (ix = 0; ix < ci_ptr->num; ix++)
> +            gcov_histogram_insert(cs_ptr->histogram, ci_ptr->values[ix]);
Space before (.
> +        }
> +    }
> +}
> +
>  /* Dump the coverage counts. We merge with existing counts when
>     possible, to avoid growing the .da files ad infinitum. We use this
>     program's checksum to make sure we only accumulate whole program
> @@ -347,6 +419,7 @@ gcov_exit (void)
>  	    }
>  	}
>      }
> +  gcov_compute_histogram (&this_prg);
> @@ -598,11 +669,18 @@ gcov_exit (void)
>  	  if (gi_ptr->merge[t_ix])
>  	    {
>  	      if (!cs_prg->runs++)
> -		cs_prg->num = cs_tprg->num;
> +                cs_prg->num = cs_tprg->num;
> +              else if (cs_prg->num != cs_tprg->num)
> +                goto read_mismatch;

Doesn't think check that all the programs that contain this unit are the same?
I.e. will this survive profiledbootstrap where we interleave cc1 and cc1plus?
> +  /* Count number of non-zero histogram entries. The histogram is only
> +     currently computed for arc counters.  */
> +  csum = &summary->ctrs[GCOV_COUNTER_ARCS];
> +  for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
> +    {
> +      if (csum->histogram[h_ix].num_counters > 0)
> +        h_cnt++;
> +    }
> +  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt));
>    gcov_write_unsigned (summary->checksum);
>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>      {
> @@ -380,6 +388,21 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>        gcov_write_counter (csum->sum_all);
>        gcov_write_counter (csum->run_max);
>        gcov_write_counter (csum->sum_max);
> +      if (ix != GCOV_COUNTER_ARCS)
> +        {
> +          gcov_write_unsigned (0);
> +          continue;
> +        }
> +      gcov_write_unsigned (h_cnt);
> +      for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
> +        {
> +          if (!csum->histogram[h_ix].num_counters)
> +            continue;
> +          gcov_write_unsigned (h_ix);

It is kind of waste to write whole unsigned for each histogram index.
What about writting bitmap of non-zero entries followed by each entry?
> +/* Merge SRC_HISTO into TGT_HISTO.  */

Perhaps comment about overall concept of the merging routine would suit here.
> -#else /*!IN_GCOV */
> -#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)

Why do you need t omove this out of !libgcov? I do not thing this is correct for
all configurations.
i.e. gcov_type may be 16bit.

Patch is OK if it passed profiledbootstrap modulo the comments above.
Thanks!
Honza

Sign in to reply to this message.

tejohnson

On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Index: libgcc/libgcov.c ...

13 years, 2 months ago (2012-08-29 21:41:33 UTC) #32

On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Index: libgcc/libgcov.c
>> ===================================================================
>> --- libgcc/libgcov.c  (revision 190736)
>> +++ libgcc/libgcov.c  (working copy)
>> @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>>    return 1;
>>  }
>>
>> +/* Insert counter VALUE into HISTOGRAM.  */
>> +
>> +static void
>> +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value)
>> +{
>> +  unsigned i;
>> +
>> +  i = gcov_histo_index(value);
>> +  gcc_assert (i < GCOV_HISTOGRAM_SIZE);
> Does checking_assert work in libgcov? I do not think internal consistency
check
> should go to --enable-checking=release libgcov. We want to maintain it as
> lightweight as possible. (I see there are two existing gcc_asserts, since
they
> report file format corruption, I think they should give better diagnostic).

gcc_checking_assert isn't available, since tsystem.h not system.h is
included. I could probably just remove the assert (to be safe,
silently return if i is out of bounds?).

>
> Inliner will do good job here, but perhaps explicit inline fits.
>> +      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
>> +        {
>> +          gfi_ptr = gi_ptr->functions[f_ix];
>> +
>> +          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
>> +            continue;
>> +
>> +          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
>> +          for (ix = 0; ix < ci_ptr->num; ix++)
>> +            gcov_histogram_insert(cs_ptr->histogram, ci_ptr->values[ix]);
> Space before (.

Ok.

>> +        }
>> +    }
>> +}
>> +
>>  /* Dump the coverage counts. We merge with existing counts when
>>     possible, to avoid growing the .da files ad infinitum. We use this
>>     program's checksum to make sure we only accumulate whole program
>> @@ -347,6 +419,7 @@ gcov_exit (void)
>>           }
>>       }
>>      }
>> +  gcov_compute_histogram (&this_prg);
>> @@ -598,11 +669,18 @@ gcov_exit (void)
>>         if (gi_ptr->merge[t_ix])
>>           {
>>             if (!cs_prg->runs++)
>> -             cs_prg->num = cs_tprg->num;
>> +                cs_prg->num = cs_tprg->num;
>> +              else if (cs_prg->num != cs_tprg->num)
>> +                goto read_mismatch;
>
> Doesn't think check that all the programs that contain this unit are the
same?
> I.e. will this survive profiledbootstrap where we interleave cc1 and cc1plus?

Ok, removing that check and I am switching the histogram merging code
to handle the case where there are different numbers of counters. It
will end up with the same number of counters as in the summary we are
merging into since that is the num we keep above when runs > 0 to
start with.

>> +  /* Count number of non-zero histogram entries. The histogram is only
>> +     currently computed for arc counters.  */
>> +  csum = &summary->ctrs[GCOV_COUNTER_ARCS];
>> +  for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
>> +    {
>> +      if (csum->histogram[h_ix].num_counters > 0)
>> +        h_cnt++;
>> +    }
>> +  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt));
>>    gcov_write_unsigned (summary->checksum);
>>    for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
>>      {
>> @@ -380,6 +388,21 @@ gcov_write_summary (gcov_unsigned_t tag, const str
>>        gcov_write_counter (csum->sum_all);
>>        gcov_write_counter (csum->run_max);
>>        gcov_write_counter (csum->sum_max);
>> +      if (ix != GCOV_COUNTER_ARCS)
>> +        {
>> +          gcov_write_unsigned (0);
>> +          continue;
>> +        }
>> +      gcov_write_unsigned (h_cnt);
>> +      for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
>> +        {
>> +          if (!csum->histogram[h_ix].num_counters)
>> +            continue;
>> +          gcov_write_unsigned (h_ix);
>
> It is kind of waste to write whole unsigned for each histogram index.
> What about writting bitmap of non-zero entries followed by each entry?

Sure, I will do that instead.

>> +/* Merge SRC_HISTO into TGT_HISTO.  */
>
> Perhaps comment about overall concept of the merging routine would suit here.

Ok.

>> -#else /*!IN_GCOV */
>> -#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32)
>
> Why do you need t omove this out of !libgcov? I do not thing this is correct
for all configurations.
> i.e. gcov_type may be 16bit.

From my understanding of the mode attribute meanings, which I thought
are defined in terms of the number of smallest addressable units, the
code in gcov-io.h that sets up the gcov_type typedef will always end
up with a gcov_type that is 32 or 64 bits? I.e. when BITS_PER_UNIT is
8 it will use either SI or DI which will end up either 32 or 64, and
when BITS_PER_UNIT is 16 it would use either HI or SI which would
again be either 32 or 64. Is that wrong and we can end up with a 16
bit gcov_type?

The GCOV_TYPE_SIZE was being defined everywhere except when IN_GOV (so
it was being defined IN_LIBGCOV), but I wanted it defined
unconditionally because I am using it to determine the required number
of histogram entries.

In any case, I think it will be more straightforward to define the
number of histogram entries based on the max possible gcov_type size
which is 64 (so 256 entries). This will make implementing the bit mask
simpler, since it will always require the same number of gcov_unsigned
ints (8).

>
> Patch is OK if it passed profiledbootstrap modulo the comments above.

Ok, thanks. Working on the fixes above.

Teresa

> Thanks!
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hubicka_ucw.cz

> On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> ...

13 years, 2 months ago (2012-08-30 15:32:59 UTC) #33

> On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> Index: libgcc/libgcov.c
> >> ===================================================================
> >> --- libgcc/libgcov.c  (revision 190736)
> >> +++ libgcc/libgcov.c  (working copy)
> >> @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
> >>    return 1;
> >>  }
> >>
> >> +/* Insert counter VALUE into HISTOGRAM.  */
> >> +
> >> +static void
> >> +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value)
> >> +{
> >> +  unsigned i;
> >> +
> >> +  i = gcov_histo_index(value);
> >> +  gcc_assert (i < GCOV_HISTOGRAM_SIZE);
> > Does checking_assert work in libgcov? I do not think internal consistency
check
> > should go to --enable-checking=release libgcov. We want to maintain it as
> > lightweight as possible. (I see there are two existing gcc_asserts, since
they
> > report file format corruption, I think they should give better diagnostic).
> 
> gcc_checking_assert isn't available, since tsystem.h not system.h is
> included. I could probably just remove the assert (to be safe,
> silently return if i is out of bounds?).

I think just removing the assert is fine here: only way it can trigger is when
GCOV_HISTOGRAM_SIZE is wrong and it ought not to be.
> 
> >From my understanding of the mode attribute meanings, which I thought
> are defined in terms of the number of smallest addressable units, the
> code in gcov-io.h that sets up the gcov_type typedef will always end
> up with a gcov_type that is 32 or 64 bits? I.e. when BITS_PER_UNIT is
> 8 it will use either SI or DI which will end up either 32 or 64, and
> when BITS_PER_UNIT is 16 it would use either HI or SI which would
> again be either 32 or 64. Is that wrong and we can end up with a 16
> bit gcov_type?

I see, the code simplified a bit since we dropped support for some of more
exotic
targets.  The type should be either 32bit or 64. 
> 
> The GCOV_TYPE_SIZE was being defined everywhere except when IN_GOV (so
> it was being defined IN_LIBGCOV), but I wanted it defined
> unconditionally because I am using it to determine the required number
> of histogram entries.
> 
> In any case, I think it will be more straightforward to define the
> number of histogram entries based on the max possible gcov_type size
> which is 64 (so 256 entries). This will make implementing the bit mask
> simpler, since it will always require the same number of gcov_unsigned
> ints (8).
Sounds good to me. 64bit should be enough for everyone. Coverage is kind of
useless
for smaller types and for really restricted targets we more likely will want to
disable
histogram generation rather than making it a bit smaller.
> 
> >
> > Patch is OK if it passed profiledbootstrap modulo the comments above.
> 
> Ok, thanks. Working on the fixes above.

OK, thanks!
Honza
> 
> Teresa
> 
> > Thanks!
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

tejohnson

I just committed the patch (included below). I implemented the occupancy bit vector approach for ...

13 years, 1 month ago (2012-09-04 21:23:51 UTC) #34

I just committed the patch (included below). I implemented the
occupancy bit vector approach for recording non-zero histogram
entries, and a few issues uncovered with the merging in a profiled
bootstrap.

Passes both bootstrap and profiledbootstrap builds and regression tests.

Thanks,
Teresa

Enhances the gcov program summary by adding a histogram of arc counter
entries. This is used to compute working set information in the compiler
for use by optimizations that need information on hot vs cold counter
values or the rough working set size in terms of the number of counters.
Each working set data point is the minimum counter value and number of
counters required to reach a given percentage of the cumulative counter
sum across the profiled execution (sum_all in the program summary).

2012-09-04  Teresa Johnson  <tejohnson@google.com>

        * libgcc/libgcov.c (struct gcov_summary_buffer): New structure.
        (gcov_histogram_insert): New function.
        (gcov_compute_histogram): Ditto.
        (gcov_exit): Invoke gcov_compute_histogram, and perform merging of
        histograms during summary merging.
        * gcc/gcov-io.c (gcov_write_summary): Write out non-zero histogram
        entries to function summary along with an occupancy bit vector.
        (gcov_read_summary): Read in the histogram entries.
        (gcov_histo_index): New function.
        (void gcov_histogram_merge): Ditto.
        * gcc/gcov-io.h (gcov_type_unsigned): New type.
        (struct gcov_bucket_type): Ditto.
        (struct gcov_ctr_summary): Include histogram.
        (GCOV_TAG_SUMMARY_LENGTH): Update to include histogram entries.
        (GCOV_HISTOGRAM_SIZE): New macro.
        (GCOV_HISTOGRAM_BITVECTOR_SIZE): Ditto.
        * gcc/profile.c (NUM_GCOV_WORKING_SETS): Ditto.
        (gcov_working_sets): New global variable.
        (compute_working_sets): New function.
        (find_working_set): Ditto.
        (get_exec_counts): Invoke compute_working_sets.
        * gcc/coverage.c (read_counts_file): Merge histograms, and
        fix bug with accessing summary info for non-summable counters.
        * gcc/basic-block.h (gcov_type_unsigned): New type.
        (struct gcov_working_set_info): Ditto.
        (find_working_set): Declare.
        * gcc/gcov-dump.c (tag_summary): Dump out histogram.

Index: libgcc/libgcov.c
===================================================================
--- libgcc/libgcov.c    (revision 190950)
+++ libgcc/libgcov.c    (working copy)
@@ -97,6 +97,12 @@ struct gcov_fn_buffer
   /* note gcov_fn_info ends in a trailing array.  */
 };

+struct gcov_summary_buffer
+{
+  struct gcov_summary_buffer *next;
+  struct gcov_summary summary;
+};
+
 /* Chain of per-object gcov structures.  */
 static struct gcov_info *gcov_list;

@@ -276,6 +282,76 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
   return 1;
 }

+/* Insert counter VALUE into HISTOGRAM.  */
+
+static void
+gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value)
+{
+  unsigned i;
+
+  i = gcov_histo_index(value);
+  histogram[i].num_counters++;
+  histogram[i].cum_value += value;
+  if (value < histogram[i].min_value)
+    histogram[i].min_value = value;
+}
+
+/* Computes a histogram of the arc counters to place in the summary SUM.  */
+
+static void
+gcov_compute_histogram (struct gcov_summary *sum)
+{
+  struct gcov_info *gi_ptr;
+  const struct gcov_fn_info *gfi_ptr;
+  const struct gcov_ctr_info *ci_ptr;
+  struct gcov_ctr_summary *cs_ptr;
+  unsigned t_ix, f_ix, ctr_info_ix, ix;
+  int h_ix;
+
+  /* This currently only applies to arc counters.  */
+  t_ix = GCOV_COUNTER_ARCS;
+
+  /* First check if there are any counts recorded for this counter.  */
+  cs_ptr = &(sum->ctrs[t_ix]);
+  if (!cs_ptr->num)
+    return;
+
+  for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+    {
+      cs_ptr->histogram[h_ix].num_counters = 0;
+      cs_ptr->histogram[h_ix].min_value = cs_ptr->run_max;
+      cs_ptr->histogram[h_ix].cum_value = 0;
+    }
+
+  /* Walk through all the per-object structures and record each of
+     the count values in histogram.  */
+  for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next)
+    {
+      if (!gi_ptr->merge[t_ix])
+        continue;
+
+      /* Find the appropriate index into the gcov_ctr_info array
+         for the counter we are currently working on based on the
+         existence of the merge function pointer for this object.  */
+      for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++)
+        {
+          if (gi_ptr->merge[ix])
+            ctr_info_ix++;
+        }
+      for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++)
+        {
+          gfi_ptr = gi_ptr->functions[f_ix];
+
+          if (!gfi_ptr || gfi_ptr->key != gi_ptr)
+            continue;
+
+          ci_ptr = &gfi_ptr->ctrs[ctr_info_ix];
+          for (ix = 0; ix < ci_ptr->num; ix++)
+            gcov_histogram_insert (cs_ptr->histogram, ci_ptr->values[ix]);
+        }
+    }
+}
+
 /* Dump the coverage counts. We merge with existing counts when
    possible, to avoid growing the .da files ad infinitum. We use this
    program's checksum to make sure we only accumulate whole program
@@ -347,6 +423,7 @@ gcov_exit (void)
            }
        }
     }
+  gcov_compute_histogram (&this_prg);

   {
     /* Check if the level of dirs to strip off specified. */
@@ -400,6 +477,8 @@ gcov_exit (void)
       const char *fname, *s;
       struct gcov_fn_buffer *fn_buffer = 0;
       struct gcov_fn_buffer **fn_tail = &fn_buffer;
+      struct gcov_summary_buffer *next_sum_buffer, *sum_buffer = 0;
+      struct gcov_summary_buffer **sum_tail = &sum_buffer;

       fname = gi_ptr->filename;

@@ -482,17 +561,29 @@ gcov_exit (void)

              f_ix--;
              length = gcov_read_unsigned ();
-             if (length != GCOV_TAG_SUMMARY_LENGTH)
-               goto read_mismatch;
              gcov_read_summary (&tmp);
              if ((error = gcov_is_error ()))
                goto read_error;
-             if (summary_pos || tmp.checksum != crc32)
-               goto next_summary;
+             if (summary_pos)
+                {
+                  /* Save all summaries after the one that will be
+                     merged into below. These will need to be rewritten
+                     as histogram merging may change the number of non-zero
+                     histogram entries that will be emitted, and thus the
+                     size of the merged summary.  */
+                  (*sum_tail) = (struct gcov_summary_buffer *)
+                      malloc (sizeof(struct gcov_summary_buffer));
+                  (*sum_tail)->summary = tmp;
+                  (*sum_tail)->next = 0;
+                  sum_tail = &((*sum_tail)->next);
+                  goto next_summary;
+                }
+             if (tmp.checksum != crc32)
+                goto next_summary;

              for (t_ix = 0; t_ix != GCOV_COUNTERS_SUMMABLE; t_ix++)
                if (tmp.ctrs[t_ix].num != this_prg.ctrs[t_ix].num)
-                 goto next_summary;
+                  goto next_summary;
              prg = tmp;
              summary_pos = eof_pos;

@@ -598,11 +689,16 @@ gcov_exit (void)
          if (gi_ptr->merge[t_ix])
            {
              if (!cs_prg->runs++)
-               cs_prg->num = cs_tprg->num;
+               cs_prg->num = cs_tprg->num;
              cs_prg->sum_all += cs_tprg->sum_all;
              if (cs_prg->run_max < cs_tprg->run_max)
                cs_prg->run_max = cs_tprg->run_max;
              cs_prg->sum_max += cs_tprg->run_max;
+              if (cs_prg->runs == 1)
+                memcpy (cs_prg->histogram, cs_tprg->histogram,
+                        sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+              else
+                gcov_histogram_merge (cs_prg->histogram, cs_tprg->histogram);
            }
          else if (cs_prg->runs)
            goto read_mismatch;
@@ -635,8 +731,18 @@ gcov_exit (void)
       /* Generate whole program statistics.  */
       gcov_write_summary (GCOV_TAG_PROGRAM_SUMMARY, &prg);

-      if (summary_pos < eof_pos)
-       gcov_seek (eof_pos);
+      /* Rewrite all the summaries that were after the summary we merged
+         into. This is necessary as the merged summary may have a different
+         size due to the number of non-zero histogram entries changing after
+         merging.  */
+
+      while (sum_buffer)
+        {
+          gcov_write_summary (GCOV_TAG_PROGRAM_SUMMARY, &sum_buffer->summary);
+          next_sum_buffer = sum_buffer->next;
+          free (sum_buffer);
+          sum_buffer = next_sum_buffer;
+        }

       /* Write execution counts for each function.  */
       for (f_ix = 0; (unsigned)f_ix != gi_ptr->n_functions; f_ix++)
Index: gcc/gcov-io.c
===================================================================
--- gcc/gcov-io.c       (revision 190950)
+++ gcc/gcov-io.c       (working copy)
@@ -368,10 +368,25 @@ gcov_write_tag_length (gcov_unsigned_t tag, gcov_u
 GCOV_LINKAGE void
 gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, h_ix, bv_ix, h_cnt = 0;
   const struct gcov_ctr_summary *csum;
+  unsigned histo_bitvector[GCOV_HISTOGRAM_BITVECTOR_SIZE];

-  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH);
+  /* Count number of non-zero histogram entries, and fill in a bit vector
+     of non-zero indices. The histogram is only currently computed for arc
+     counters.  */
+  for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++)
+    histo_bitvector[bv_ix] = 0;
+  csum = &summary->ctrs[GCOV_COUNTER_ARCS];
+  for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+    {
+      if (csum->histogram[h_ix].num_counters > 0)
+        {
+          histo_bitvector[h_ix / 32] |= 1 << (h_ix % 32);
+          h_cnt++;
+        }
+    }
+  gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt));
   gcov_write_unsigned (summary->checksum);
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
     {
@@ -380,6 +395,22 @@ gcov_write_summary (gcov_unsigned_t tag, const str
       gcov_write_counter (csum->sum_all);
       gcov_write_counter (csum->run_max);
       gcov_write_counter (csum->sum_max);
+      if (ix != GCOV_COUNTER_ARCS)
+        {
+          for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++)
+            gcov_write_unsigned (0);
+          continue;
+        }
+      for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++)
+        gcov_write_unsigned (histo_bitvector[bv_ix]);
+      for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+        {
+          if (!csum->histogram[h_ix].num_counters)
+            continue;
+          gcov_write_unsigned (csum->histogram[h_ix].num_counters);
+          gcov_write_counter (csum->histogram[h_ix].min_value);
+          gcov_write_counter (csum->histogram[h_ix].cum_value);
+        }
     }
 }
 #endif /* IN_LIBGCOV */
@@ -488,8 +519,10 @@ gcov_read_string (void)
 GCOV_LINKAGE void
 gcov_read_summary (struct gcov_summary *summary)
 {
-  unsigned ix;
+  unsigned ix, h_ix, bv_ix, h_cnt = 0;
   struct gcov_ctr_summary *csum;
+  unsigned histo_bitvector[GCOV_HISTOGRAM_BITVECTOR_SIZE];
+  unsigned cur_bitvector;

   summary->checksum = gcov_read_unsigned ();
   for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++)
@@ -499,6 +532,43 @@ gcov_read_summary (struct gcov_summary *summary)
       csum->sum_all = gcov_read_counter ();
       csum->run_max = gcov_read_counter ();
       csum->sum_max = gcov_read_counter ();
+      memset (csum->histogram, 0,
+              sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+      for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++)
+        {
+          histo_bitvector[bv_ix] = gcov_read_unsigned ();
+          h_cnt += __builtin_popcountll (histo_bitvector[bv_ix]);
+        }
+      bv_ix = 0;
+      h_ix = 0;
+      cur_bitvector = 0;
+      while (h_cnt--)
+        {
+          /* Find the index corresponding to the next entry we will read in.
+             First find the next non-zero bitvector and re-initialize
+             the histogram index accordingly, then right shift and increment
+             the index until we find a set bit.  */
+          while (!cur_bitvector)
+            {
+              h_ix = bv_ix * 32;
+              cur_bitvector = histo_bitvector[bv_ix++];
+              gcc_assert(bv_ix <= GCOV_HISTOGRAM_BITVECTOR_SIZE);
+            }
+          while (!(cur_bitvector & 0x1))
+            {
+              h_ix++;
+              cur_bitvector >>= 1;
+            }
+          gcc_assert(h_ix < GCOV_HISTOGRAM_SIZE);
+
+          csum->histogram[h_ix].num_counters = gcov_read_unsigned ();
+          csum->histogram[h_ix].min_value = gcov_read_counter ();
+          csum->histogram[h_ix].cum_value = gcov_read_counter ();
+          /* Shift off the index we are done with and increment to the
+             corresponding next histogram entry.  */
+          cur_bitvector >>= 1;
+          h_ix++;
+        }
     }
 }

@@ -550,3 +620,184 @@ gcov_time (void)
     return status.st_mtime;
 }
 #endif /* IN_GCOV */
+
+#if IN_LIBGCOV || !IN_GCOV
+/* Determine the index into histogram for VALUE. */
+
+static unsigned
+gcov_histo_index(gcov_type value)
+{
+  gcov_type_unsigned v = (gcov_type_unsigned)value;
+  unsigned r = 0;
+  unsigned prev2bits = 0;
+
+  /* Find index into log2 scale histogram, where each of the log2
+     sized buckets is divided into 4 linear sub-buckets for better
+     focus in the higher buckets.  */
+
+  /* Find the place of the most-significant bit set.  */
+  if (v > 0)
+    r = 63 - __builtin_clzll (v);
+
+  /* If at most the 2 least significant bits are set (value is
+     0 - 3) then that value is our index into the lowest set of
+     four buckets.  */
+  if (r < 2)
+    return (unsigned)value;
+
+  gcc_assert (r < 64);
+
+  /* Find the two next most significant bits to determine which
+     of the four linear sub-buckets to select.  */
+  prev2bits = (v >> (r - 2)) & 0x3;
+  /* Finally, compose the final bucket index from the log2 index and
+     the next 2 bits. The minimum r value at this point is 2 since we
+     returned above if r was 2 or more, so the minimum bucket at this
+     point is 4.  */
+  return (r - 1) * 4 + prev2bits;
+}
+
+/* Merge SRC_HISTO into TGT_HISTO. The counters are assumed to be in
+   the same relative order in both histograms, and are matched up
+   and merged in reverse order. Each counter is assigned an equal portion of
+   its entry's original cumulative counter value when computing the
+   new merged cum_value.  */
+
+static void gcov_histogram_merge(gcov_bucket_type *tgt_histo,
+                                 gcov_bucket_type *src_histo)
+{
+  int src_i, tgt_i, tmp_i = 0;
+  unsigned src_num, tgt_num, merge_num;
+  gcov_type src_cum, tgt_cum, merge_src_cum, merge_tgt_cum, merge_cum;
+  gcov_type merge_min;
+  gcov_bucket_type tmp_histo[GCOV_HISTOGRAM_SIZE];
+  int src_done = 0;
+
+  memset(tmp_histo, 0, sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+
+  /* Assume that the counters are in the same relative order in both
+     histograms. Walk the histograms from largest to smallest entry,
+     matching up and combining counters in order.  */
+  src_num = 0;
+  src_cum = 0;
+  src_i = GCOV_HISTOGRAM_SIZE - 1;
+  for (tgt_i = GCOV_HISTOGRAM_SIZE - 1; tgt_i >= 0 && !src_done; tgt_i--)
+    {
+      tgt_num = tgt_histo[tgt_i].num_counters;
+      tgt_cum = tgt_histo[tgt_i].cum_value;
+      /* Keep going until all of the target histogram's counters at this
+         position have been matched and merged with counters from the
+         source histogram.  */
+      while (tgt_num > 0 && !src_done)
+        {
+          /* If this is either the first time through this loop or we just
+             exhausted the previous non-zero source histogram entry, look
+             for the next non-zero source histogram entry.  */
+          if (!src_num)
+            {
+              /* Locate the next non-zero entry.  */
+              while (src_i >= 0 && !src_histo[src_i].num_counters)
+                src_i--;
+              /* If source histogram has fewer counters, then just
copy over the
+                 remaining target counters and quit.  */
+              if (src_i < 0)
+                {
+                  tmp_histo[tgt_i].num_counters += tgt_num;
+                  tmp_histo[tgt_i].cum_value += tgt_cum;
+                  if (!tmp_histo[tgt_i].min_value ||
+                      tgt_histo[tgt_i].min_value < tmp_histo[tgt_i].min_value)
+                    tmp_histo[tgt_i].min_value = tgt_histo[tgt_i].min_value;
+                  while (--tgt_i >= 0)
+                    {
+                      tmp_histo[tgt_i].num_counters
+                          += tgt_histo[tgt_i].num_counters;
+                      tmp_histo[tgt_i].cum_value +=
tgt_histo[tgt_i].cum_value;
+                      if (!tmp_histo[tgt_i].min_value ||
+                          tgt_histo[tgt_i].min_value
+                          < tmp_histo[tgt_i].min_value)
+                        tmp_histo[tgt_i].min_value =
tgt_histo[tgt_i].min_value;
+                    }
+
+                  src_done = 1;
+                  break;
+                }
+
+              src_num = src_histo[src_i].num_counters;
+              src_cum = src_histo[src_i].cum_value;
+            }
+
+          /* The number of counters to merge on this pass is the minimum
+             of the remaining counters from the current target and source
+             histogram entries.  */
+          merge_num = tgt_num;
+          if (src_num < merge_num)
+            merge_num = src_num;
+
+          /* The merged min_value is the sum of the min_values from target
+             and source.  */
+          merge_min = tgt_histo[tgt_i].min_value + src_histo[src_i].min_value;
+
+          /* Compute the portion of source and target entries' cum_value
+             that will be apportioned to the counters being merged.
+             The total remaining cum_value from each entry is divided
+             equally among the counters from that histogram entry if we
+             are not merging all of them.  */
+          merge_src_cum = src_cum;
+          if (merge_num < src_num)
+            merge_src_cum = merge_num * src_cum / src_num;
+          merge_tgt_cum = tgt_cum;
+          if (merge_num < tgt_num)
+            merge_tgt_cum = merge_num * tgt_cum / tgt_num;
+          /* The merged cum_value is the sum of the source and target
+             components.  */
+          merge_cum = merge_src_cum + merge_tgt_cum;
+
+          /* Update the remaining number of counters and cum_value left
+             to be merged from this source and target entry.  */
+          src_cum -= merge_src_cum;
+          tgt_cum -= merge_tgt_cum;
+          src_num -= merge_num;
+          tgt_num -= merge_num;
+
+          /* The merged counters get placed in the new merged histogram
+             at the entry for the merged min_value.  */
+          tmp_i = gcov_histo_index(merge_min);
+          gcc_assert (tmp_i < GCOV_HISTOGRAM_SIZE);
+          tmp_histo[tmp_i].num_counters += merge_num;
+          tmp_histo[tmp_i].cum_value += merge_cum;
+          if (!tmp_histo[tmp_i].min_value ||
+              merge_min < tmp_histo[tmp_i].min_value)
+            tmp_histo[tmp_i].min_value = merge_min;
+
+          /* Ensure the search for the next non-zero src_histo entry starts
+             at the next smallest histogram bucket.  */
+          if (!src_num)
+            src_i--;
+        }
+    }
+
+  gcc_assert (tgt_i < 0);
+
+  /* In the case where there were more counters in the source histogram,
+     accumulate the remaining unmerged cumulative counter values. Add
+     those to the smallest non-zero target histogram entry. Otherwise,
+     the total cumulative counter values in the histogram will be smaller
+     than the sum_all stored in the summary, which will complicate
+     computing the working set information from the histogram later on.  */
+  if (src_num)
+    src_i--;
+  while (src_i >= 0)
+    {
+      src_cum += src_histo[src_i].cum_value;
+      src_i--;
+    }
+  /* At this point, tmp_i should be the smallest non-zero entry in the
+     tmp_histo.  */
+  gcc_assert(tmp_i >= 0 && tmp_i < GCOV_HISTOGRAM_SIZE
+             && tmp_histo[tmp_i].num_counters > 0);
+  tmp_histo[tmp_i].cum_value += src_cum;
+
+  /* Finally, copy the merged histogram into tgt_histo.  */
+  memcpy(tgt_histo, tmp_histo, sizeof (gcov_bucket_type) *
GCOV_HISTOGRAM_SIZE);
+}
+#endif /* IN_LIBGCOV || !IN_GCOV */
Index: gcc/gcov-io.h
===================================================================
--- gcc/gcov-io.h       (revision 190950)
+++ gcc/gcov-io.h       (working copy)
@@ -139,7 +139,9 @@ see the files COPYING3 and COPYING.RUNTIME respect
        counts: header int64:count*
        summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE
        count-summary:  int32:num int32:runs int64:sum
-                       int64:max int64:sum_max
+                       int64:max int64:sum_max histogram
+        histogram: {int32:bitvector}8 histogram-buckets*
+        histogram-buckets: int32:num int64:min int64:sum

    The ANNOUNCE_FUNCTION record is the same as that in the note file,
    but without the source location.  The COUNTS gives the
@@ -171,8 +173,10 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (SI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (DI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #endif
 #else
 #if BITS_PER_UNIT == 16
@@ -180,16 +184,20 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m
 typedef unsigned gcov_position_t __attribute__ ((mode (HI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (SI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #endif
 #else
 typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI)));
 typedef unsigned gcov_position_t __attribute__ ((mode (QI)));
 #if LONG_LONG_TYPE_SIZE > 32
 typedef signed gcov_type __attribute__ ((mode (HI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI)));
 #else
 typedef signed gcov_type __attribute__ ((mode (QI)));
+typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI)));
 #endif
 #endif
 #endif
@@ -210,6 +218,7 @@ typedef unsigned gcov_position_t;
 #if IN_GCOV
 #define GCOV_LINKAGE static
 typedef HOST_WIDEST_INT gcov_type;
+typedef unsigned HOST_WIDEST_INT gcov_type_unsigned;
 #if IN_GCOV > 0
 #include <sys/types.h>
 #endif
@@ -309,9 +318,10 @@ typedef HOST_WIDEST_INT gcov_type;
 #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2)
 #define GCOV_TAG_OBJECT_SUMMARY  ((gcov_unsigned_t)0xa1000000) /* Obsolete */
 #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000)
-#define GCOV_TAG_SUMMARY_LENGTH  \
-       (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2))
+#define GCOV_TAG_SUMMARY_LENGTH(NUM)  \
+        (1 + GCOV_COUNTERS_SUMMABLE * (10 + 3 * 2) + (NUM) * 5)

+
 /* Counters that are collected.  */
 #define GCOV_COUNTER_ARCS      0  /* Arc transitions.  */
 #define GCOV_COUNTERS_SUMMABLE 1  /* Counters which can be
@@ -389,6 +399,29 @@ typedef HOST_WIDEST_INT gcov_type;

 /* Structured records.  */

+/* Structure used for each bucket of the log2 histogram of counter values.  */
+typedef struct
+{
+  /* Number of counters whose profile count falls within the bucket.  */
+  gcov_unsigned_t num_counters;
+  /* Smallest profile count included in this bucket.  */
+  gcov_type min_value;
+  /* Cumulative value of the profile counts in this bucket.  */
+  gcov_type cum_value;
+} gcov_bucket_type;
+
+/* For a log2 scale histogram with each range split into 4
+   linear sub-ranges, there will be at most 64 (max gcov_type bit
size) - 1 log2
+   ranges since the lowest 2 log2 values share the lowest 4 linear
+   sub-range (values 0 - 3).  This is 252 total entries (63*4).  */
+
+#define GCOV_HISTOGRAM_SIZE 252
+
+/* How many unsigned ints are required to hold a bit vector of non-zero
+   histogram entries when the histogram is written to the gcov file.
+   This is essentially a ceiling divide by 32 bits.  */
+#define GCOV_HISTOGRAM_BITVECTOR_SIZE (GCOV_HISTOGRAM_SIZE + 31) / 32
+
 /* Cumulative counter data.  */
 struct gcov_ctr_summary
 {
@@ -397,6 +430,8 @@ struct gcov_ctr_summary
   gcov_type sum_all;           /* sum of all counters accumulated.  */
   gcov_type run_max;           /* maximum value on a single run.  */
   gcov_type sum_max;           /* sum of individual run max values.  */
+  gcov_bucket_type histogram[GCOV_HISTOGRAM_SIZE]; /* histogram of
+                                                      counter values.  */
 };

 /* Object & program summary record.  */
Index: gcc/profile.c
===================================================================
--- gcc/profile.c       (revision 190950)
+++ gcc/profile.c       (working copy)
@@ -84,6 +84,15 @@ struct bb_info {

 const struct gcov_ctr_summary *profile_info;

+/* Number of data points in the working set summary array. Using 128
+   provides information for at least every 1% increment of the total
+   profile size. The last entry is hardwired to 99.9% of the total.  */
+#define NUM_GCOV_WORKING_SETS 128
+
+/* Counter working set information computed from the current counter
+   summary. Not initialized unless profile_info summary is non-NULL.  */
+static gcov_working_set_t gcov_working_sets[NUM_GCOV_WORKING_SETS];
+
 /* Collect statistics on the performance of this pass for the entire source
    file.  */

@@ -192,6 +201,152 @@ instrument_values (histogram_values values)
 }
 ^L

+/* Compute the working set information from the counter histogram in
+   the profile summary. This is an array of information corresponding to a
+   range of percentages of the total execution count (sum_all), and includes
+   the number of counters required to cover that working set percentage and
+   the minimum counter value in that working set.  */
+
+static void
+compute_working_sets (void)
+{
+  gcov_type working_set_cum_values[NUM_GCOV_WORKING_SETS];
+  gcov_type ws_cum_hotness_incr;
+  gcov_type cum, tmp_cum;
+  const gcov_bucket_type *histo_bucket;
+  unsigned ws_ix, c_num, count, pctinc, pct;
+  int h_ix;
+  gcov_working_set_t *ws_info;
+
+  if (!profile_info)
+    return;
+
+  /* Compute the amount of sum_all that the cumulative hotness grows
+     by in each successive working set entry, which depends on the
+     number of working set entries.  */
+  ws_cum_hotness_incr = profile_info->sum_all / NUM_GCOV_WORKING_SETS;
+
+  /* Next fill in an array of the cumulative hotness values corresponding
+     to each working set summary entry we are going to compute below.
+     Skip 0% statistics, which can be extrapolated from the
+     rest of the summary data.  */
+  cum = ws_cum_hotness_incr;
+  for (ws_ix = 0; ws_ix < NUM_GCOV_WORKING_SETS;
+       ws_ix++, cum += ws_cum_hotness_incr)
+    working_set_cum_values[ws_ix] = cum;
+  /* The last summary entry is reserved for (roughly) 99.9% of the
+     working set. Divide by 1024 so it becomes a shift, which gives
+     almost exactly 99.9%.  */
+  working_set_cum_values[NUM_GCOV_WORKING_SETS-1]
+      = profile_info->sum_all - profile_info->sum_all/1024;
+
+  /* Next, walk through the histogram in decending order of hotness
+     and compute the statistics for the working set summary array.
+     As histogram entries are accumulated, we check to see which
+     working set entries have had their expected cum_value reached
+     and fill them in, walking the working set entries in increasing
+     size of cum_value.  */
+  ws_ix = 0; /* The current entry into the working set array.  */
+  cum = 0; /* The current accumulated counter sum.  */
+  count = 0; /* The current accumulated count of block counters.  */
+  for (h_ix = GCOV_HISTOGRAM_SIZE - 1;
+       h_ix >= 0 && ws_ix < NUM_GCOV_WORKING_SETS; h_ix--)
+    {
+      histo_bucket = &profile_info->histogram[h_ix];
+
+      /* If we haven't reached the required cumulative counter value for
+         the current working set percentage, simply accumulate this histogram
+         entry into the running sums and continue to the next histogram
+         entry.  */
+      if (cum + histo_bucket->cum_value < working_set_cum_values[ws_ix])
+        {
+          cum += histo_bucket->cum_value;
+          count += histo_bucket->num_counters;
+          continue;
+        }
+
+      /* If adding the current histogram entry's cumulative counter value
+         causes us to exceed the current working set size, then estimate
+         how many of this histogram entry's counter values are required to
+         reach the working set size, and fill in working set entries
+         as we reach their expected cumulative value.  */
+      for (c_num = 0, tmp_cum = cum;
+           c_num < histo_bucket->num_counters && ws_ix <
NUM_GCOV_WORKING_SETS;
+           c_num++)
+        {
+          count++;
+          /* If we haven't reached the last histogram entry counter, add
+             in the minimum value again. This will underestimate the
+             cumulative sum so far, because many of the counter values in this
+             entry may have been larger than the minimum. We could add in the
+             average value every time, but that would require an expensive
+             divide operation.  */
+          if (c_num + 1 < histo_bucket->num_counters)
+            tmp_cum += histo_bucket->min_value;
+          /* If we have reached the last histogram entry counter, then add
+             in the entire cumulative value.  */
+          else
+            tmp_cum = cum + histo_bucket->cum_value;
+
+          /* Next walk through successive working set entries and fill in
+            the statistics for any whose size we have reached by accumulating
+            this histogram counter.  */
+          while (tmp_cum >= working_set_cum_values[ws_ix]
+                 && ws_ix < NUM_GCOV_WORKING_SETS)
+            {
+              gcov_working_sets[ws_ix].num_counters = count;
+              gcov_working_sets[ws_ix].min_counter
+                  = histo_bucket->min_value;
+              ws_ix++;
+            }
+        }
+      /* Finally, update the running cumulative value since we were
+         using a temporary above.  */
+      cum += histo_bucket->cum_value;
+    }
+  gcc_assert (ws_ix == NUM_GCOV_WORKING_SETS);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Counter working sets:\n");
+      /* Multiply the percentage by 100 to avoid float.  */
+      pctinc = 100 * 100 / NUM_GCOV_WORKING_SETS;
+      for (ws_ix = 0, pct = pctinc; ws_ix < NUM_GCOV_WORKING_SETS;
+           ws_ix++, pct += pctinc)
+        {
+          if (ws_ix == NUM_GCOV_WORKING_SETS - 1)
+            pct = 9990;
+          ws_info = &gcov_working_sets[ws_ix];
+          /* Print out the percentage using int arithmatic to avoid float.  */
+          fprintf (dump_file, "\t\t%u.%02u%%: num counts=%u, min counter="
+                   HOST_WIDEST_INT_PRINT_DEC "\n",
+                   pct / 100, pct - (pct / 100 * 100),
+                   ws_info->num_counters,
+                   (HOST_WIDEST_INT)ws_info->min_counter);
+        }
+    }
+}
+
+/* Given a the desired percentage of the full profile (sum_all from the
+   summary), multiplied by 10 to avoid float in PCT_TIMES_10, returns
+   the corresponding working set information. If an exact match for
+   the percentage isn't found, the closest value is used.  */
+
+gcov_working_set_t *
+find_working_set (unsigned pct_times_10)
+{
+  unsigned i;
+  if (!profile_info)
+    return NULL;
+  gcc_assert (pct_times_10 <= 1000);
+  if (pct_times_10 >= 999)
+    return &gcov_working_sets[NUM_GCOV_WORKING_SETS - 1];
+  i = pct_times_10 * NUM_GCOV_WORKING_SETS / 1000;
+  if (!i)
+    return &gcov_working_sets[0];
+  return &gcov_working_sets[i - 1];
+}
+
 /* Computes hybrid profile for all matching entries in da_file.

    CFG_CHECKSUM is the precomputed checksum for the CFG.  */
@@ -219,6 +374,8 @@ get_exec_counts (unsigned cfg_checksum, unsigned l
   if (!counts)
     return NULL;

+  compute_working_sets();
+
   if (dump_file && profile_info)
     fprintf(dump_file, "Merged %u profiles with maximal count %u.\n",
            profile_info->runs, (unsigned) profile_info->sum_max);
Index: gcc/coverage.c
===================================================================
--- gcc/coverage.c      (revision 190950)
+++ gcc/coverage.c      (working copy)
@@ -248,6 +248,13 @@ read_counts_file (void)
                summary.ctrs[ix].run_max = sum.ctrs[ix].run_max;
              summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max;
            }
+          if (new_summary)
+            memcpy (summary.ctrs[GCOV_COUNTER_ARCS].histogram,
+                    sum.ctrs[GCOV_COUNTER_ARCS].histogram,
+                    sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE);
+          else
+            gcov_histogram_merge (summary.ctrs[GCOV_COUNTER_ARCS].histogram,
+                                  sum.ctrs[GCOV_COUNTER_ARCS].histogram);
          new_summary = 0;
        }
       else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident)
@@ -268,8 +275,9 @@ read_counts_file (void)
              entry->ctr = elt.ctr;
              entry->lineno_checksum = lineno_checksum;
              entry->cfg_checksum = cfg_checksum;
-             entry->summary = summary.ctrs[elt.ctr];
-             entry->summary.num = n_counts;
+              if (elt.ctr < GCOV_COUNTERS_SUMMABLE)
+                entry->summary = summary.ctrs[elt.ctr];
+              entry->summary.num = n_counts;
              entry->counts = XCNEWVEC (gcov_type, n_counts);
            }
          else if (entry->lineno_checksum != lineno_checksum
Index: gcc/basic-block.h
===================================================================
--- gcc/basic-block.h   (revision 190950)
+++ gcc/basic-block.h   (working copy)
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
    flow graph is manipulated by various optimizations.  A signed type
    makes those easy to detect.  */
 typedef HOST_WIDEST_INT gcov_type;
+typedef unsigned HOST_WIDEST_INT gcov_type_unsigned;

 /* Control flow edge information.  */
 struct GTY((user)) edge_def {
@@ -91,6 +92,16 @@ enum cfg_edge_flags {
    profile.c.  */
 extern const struct gcov_ctr_summary *profile_info;

+/* Working set size statistics for a given percentage of the entire
+   profile (sum_all from the counter summary).  */
+typedef struct gcov_working_set_info
+{
+  /* Number of hot counters included in this working set.  */
+  unsigned num_counters;
+  /* Smallest counter included in this working set.  */
+  gcov_type min_counter;
+} gcov_working_set_t;
+
 /* Declared in cfgloop.h.  */
 struct loop;

@@ -897,4 +908,7 @@ extern void rtl_profile_for_bb (basic_block);
 extern void rtl_profile_for_edge (edge);
 extern void default_rtl_profile (void);

+/* In profile.c.  */
+extern gcov_working_set_t *find_working_set(unsigned pct_times_10);
+
 #endif /* GCC_BASIC_BLOCK_H */
Index: gcc/gcov-dump.c
===================================================================
--- gcc/gcov-dump.c     (revision 190950)
+++ gcc/gcov-dump.c     (working copy)
@@ -447,7 +447,8 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
             unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED)
 {
   struct gcov_summary summary;
-  unsigned ix;
+  unsigned ix, h_ix;
+  gcov_bucket_type *histo_bucket;

   gcov_read_summary (&summary);
   printf (" checksum=0x%08x", summary.checksum);
@@ -465,5 +466,24 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED
              (HOST_WIDEST_INT)summary.ctrs[ix].run_max);
       printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC,
              (HOST_WIDEST_INT)summary.ctrs[ix].sum_max);
+      if (ix != GCOV_COUNTER_ARCS)
+        continue;
+      printf ("\n");
+      print_prefix (filename, 0, 0);
+      printf ("\t\tcounter histogram:");
+      for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++)
+        {
+          histo_bucket = &summary.ctrs[ix].histogram[h_ix];
+          if (!histo_bucket->num_counters)
+            continue;
+          printf ("\n");
+          print_prefix (filename, 0, 0);
+          printf ("\t\t%d: num counts=%u, min counter="
+              HOST_WIDEST_INT_PRINT_DEC ", cum_counter="
+              HOST_WIDEST_INT_PRINT_DEC,
+             h_ix, histo_bucket->num_counters,
+              (HOST_WIDEST_INT)histo_bucket->min_value,
+              (HOST_WIDEST_INT)histo_bucket->cum_value);
+        }
     }
 }


On Thu, Aug 30, 2012 at 8:32 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> Index: libgcc/libgcov.c
>> >> ===================================================================
>> >> --- libgcc/libgcov.c  (revision 190736)
>> >> +++ libgcc/libgcov.c  (working copy)
>> >> @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned
>> >>    return 1;
>> >>  }
>> >>
>> >> +/* Insert counter VALUE into HISTOGRAM.  */
>> >> +
>> >> +static void
>> >> +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value)
>> >> +{
>> >> +  unsigned i;
>> >> +
>> >> +  i = gcov_histo_index(value);
>> >> +  gcc_assert (i < GCOV_HISTOGRAM_SIZE);
>> > Does checking_assert work in libgcov? I do not think internal consistency
check
>> > should go to --enable-checking=release libgcov. We want to maintain it as
>> > lightweight as possible. (I see there are two existing gcc_asserts, since
they
>> > report file format corruption, I think they should give better
diagnostic).
>>
>> gcc_checking_assert isn't available, since tsystem.h not system.h is
>> included. I could probably just remove the assert (to be safe,
>> silently return if i is out of bounds?).
>
> I think just removing the assert is fine here: only way it can trigger is
when
> GCOV_HISTOGRAM_SIZE is wrong and it ought not to be.
>>
>> >From my understanding of the mode attribute meanings, which I thought
>> are defined in terms of the number of smallest addressable units, the
>> code in gcov-io.h that sets up the gcov_type typedef will always end
>> up with a gcov_type that is 32 or 64 bits? I.e. when BITS_PER_UNIT is
>> 8 it will use either SI or DI which will end up either 32 or 64, and
>> when BITS_PER_UNIT is 16 it would use either HI or SI which would
>> again be either 32 or 64. Is that wrong and we can end up with a 16
>> bit gcov_type?
>
> I see, the code simplified a bit since we dropped support for some of more
exotic
> targets.  The type should be either 32bit or 64.
>>
>> The GCOV_TYPE_SIZE was being defined everywhere except when IN_GOV (so
>> it was being defined IN_LIBGCOV), but I wanted it defined
>> unconditionally because I am using it to determine the required number
>> of histogram entries.
>>
>> In any case, I think it will be more straightforward to define the
>> number of histogram entries based on the max possible gcov_type size
>> which is 64 (so 256 entries). This will make implementing the bit mask
>> simpler, since it will always require the same number of gcov_unsigned
>> ints (8).
> Sounds good to me. 64bit should be enough for everyone. Coverage is kind of
useless
> for smaller types and for really restricted targets we more likely will want
to disable
> histogram generation rather than making it a bit smaller.
>>
>> >
>> > Patch is OK if it passed profiledbootstrap modulo the comments above.
>>
>> Ok, thanks. Working on the fixes above.
>
> OK, thanks!
> Honza
>>
>> Teresa
>>
>> > Thanks!
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

markus_trippelsdorf.de

On 2012.09.04 at 14:23 -0700, Teresa Johnson wrote: > I just committed the patch (included ...

13 years, 1 month ago (2012-09-05 07:12:45 UTC) #35

tejohnson

Sorry about that. I am right now trying to reproduce the profiledbootstrap problem that H.J. ...

13 years, 1 month ago (2012-09-05 15:09:23 UTC) #36

hjl.tools_gmail.com

On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com> wrote: > Sorry about ...

13 years, 1 month ago (2012-09-05 15:44:26 UTC) #37

tejohnson

On Wed, Sep 5, 2012 at 8:44 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Wed, ...

13 years, 1 month ago (2012-09-05 15:50:18 UTC) #38

On Wed, Sep 5, 2012 at 8:44 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com> wrote:
>> Sorry about that. I am right now trying to reproduce the
>> profiledbootstrap problem that H.J. reported, which is on
>> x86_64-unknown-linux-gnu where I had successfully done a
>> profiledbootstrap before my commit.
>>
>> Unfortunately after svn updating my client I am hitting an unrelated
>> build problem with my profiledbootstrap (it can't find the rule to
>> make libc++11convenience.la) which is slowing me down.
>
> I can reproduce it with revision 190982 on Fedora 18/x86-64
> with 8-core.

Ok, thanks. I am being blocked by an unrelated error:

libtool: compile:
/home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc/xgcc
-shared-libgcc -B/home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc
-nostdinc++
-L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src
-L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs
-B/usr/local/x86_64-unknown-linux-gnu/bin/
-B/usr/local/x86_64-unknown-linux-gnu/lib/ -isystem
/usr/local/x86_64-unknown-linux-gnu/include -isystem
/usr/local/x86_64-unknown-linux-gnu/sys-include
-I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/../libgcc
-I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu
-I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include
-I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/libsupc++
-std=gnu++11 -fno-implicit-templates -Wall -Wextra -Wwrite-strings
-Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections
-fdata-sections -frandom-seed=random.lo -g -O2 -D_GNU_SOURCE -c
/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/src/c++11/random.cc
-fPIC -DPIC -o random.o
/tmp/ccOm0d5x.s: Assembler messages:
/tmp/ccOm0d5x.s:33: Error: no such instruction: `rdrand %eax'
make[6]: *** [random.lo] Error 1

Looks like I am being hit by:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419

I'm going to try backing out all the changes related to this bug to
see if I can make progress on the profiledbootstrap.

Teresa

>
> --
> H.J.



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

hjl.tools_gmail.com

On Wed, Sep 5, 2012 at 8:50 AM, Teresa Johnson <tejohnson@google.com> wrote: > On Wed, ...

13 years, 1 month ago (2012-09-05 16:13:09 UTC) #39

On Wed, Sep 5, 2012 at 8:50 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Wed, Sep 5, 2012 at 8:44 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com> wrote:
>>> Sorry about that. I am right now trying to reproduce the
>>> profiledbootstrap problem that H.J. reported, which is on
>>> x86_64-unknown-linux-gnu where I had successfully done a
>>> profiledbootstrap before my commit.
>>>
>>> Unfortunately after svn updating my client I am hitting an unrelated
>>> build problem with my profiledbootstrap (it can't find the rule to
>>> make libc++11convenience.la) which is slowing me down.
>>
>> I can reproduce it with revision 190982 on Fedora 18/x86-64
>> with 8-core.
>
> Ok, thanks. I am being blocked by an unrelated error:
>
> libtool: compile:
> /home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc/xgcc
> -shared-libgcc -B/home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc
> -nostdinc++
-L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src
>
-L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs
> -B/usr/local/x86_64-unknown-linux-gnu/bin/
> -B/usr/local/x86_64-unknown-linux-gnu/lib/ -isystem
> /usr/local/x86_64-unknown-linux-gnu/include -isystem
> /usr/local/x86_64-unknown-linux-gnu/sys-include
> -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/../libgcc
>
-I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu
>
-I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include
> -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/libsupc++
> -std=gnu++11 -fno-implicit-templates -Wall -Wextra -Wwrite-strings
> -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections
> -fdata-sections -frandom-seed=random.lo -g -O2 -D_GNU_SOURCE -c
> /home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/src/c++11/random.cc
> -fPIC -DPIC -o random.o
> /tmp/ccOm0d5x.s: Assembler messages:
> /tmp/ccOm0d5x.s:33: Error: no such instruction: `rdrand %eax'
> make[6]: *** [random.lo] Error 1
>
> Looks like I am being hit by:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419
>
> I'm going to try backing out all the changes related to this bug to
> see if I can make progress on the profiledbootstrap.
>

You can install the latest binutils and put it in your PATH.

-- 
H.J.

Sign in to reply to this message.

tejohnson

13 years, 1 month ago (2012-09-05 16:35:43 UTC) #40

On Wed, Sep 5, 2012 at 9:13 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Wed, Sep 5, 2012 at 8:50 AM, Teresa Johnson <tejohnson@google.com> wrote:
>> On Wed, Sep 5, 2012 at 8:44 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com>
wrote:
>>>> Sorry about that. I am right now trying to reproduce the
>>>> profiledbootstrap problem that H.J. reported, which is on
>>>> x86_64-unknown-linux-gnu where I had successfully done a
>>>> profiledbootstrap before my commit.
>>>>
>>>> Unfortunately after svn updating my client I am hitting an unrelated
>>>> build problem with my profiledbootstrap (it can't find the rule to
>>>> make libc++11convenience.la) which is slowing me down.
>>>
>>> I can reproduce it with revision 190982 on Fedora 18/x86-64
>>> with 8-core.
>>
>> Ok, thanks. I am being blocked by an unrelated error:
>>
>> libtool: compile:
>> /home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc/xgcc
>> -shared-libgcc -B/home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc
>> -nostdinc++
-L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src
>>
-L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs
>> -B/usr/local/x86_64-unknown-linux-gnu/bin/
>> -B/usr/local/x86_64-unknown-linux-gnu/lib/ -isystem
>> /usr/local/x86_64-unknown-linux-gnu/include -isystem
>> /usr/local/x86_64-unknown-linux-gnu/sys-include
>> -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/../libgcc
>>
-I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu
>>
-I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include
>> -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/libsupc++
>> -std=gnu++11 -fno-implicit-templates -Wall -Wextra -Wwrite-strings
>> -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections
>> -fdata-sections -frandom-seed=random.lo -g -O2 -D_GNU_SOURCE -c
>> /home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/src/c++11/random.cc
>> -fPIC -DPIC -o random.o
>> /tmp/ccOm0d5x.s: Assembler messages:
>> /tmp/ccOm0d5x.s:33: Error: no such instruction: `rdrand %eax'
>> make[6]: *** [random.lo] Error 1
>>
>> Looks like I am being hit by:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419
>>
>> I'm going to try backing out all the changes related to this bug to
>> see if I can make progress on the profiledbootstrap.
>>
>
> You can install the latest binutils and put it in your PATH.

Ok, I just backed out the libstdc++ patches which fixed it for now. I
am trying a couple different profiledbootstraps. One with just
--with-build-config=bootstrap-lto and one with "--enable-clocale=gnu
--with-system-zlib --with-demangler-in-ld
--enable-languages=c,c++  --enable-gnu-indirect-function --with-fpmath=sse".

Thanks,
Teresa

>
> --
> H.J.



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Sign in to reply to this message.

Issue 6465057: [PATCH] Add working-set size and hotness information to fdo summary

Patch Set 1 #

Patch Set 2 : [PATCH] Add counter histogram to fdo summary #

Messages