|
|
Created:
12 years, 8 months ago by tejohnson Modified:
10 years, 7 months ago CC:
hubicka_ucw.cz, davidxl, gcc-patches_gcc.gnu.org Base URL:
svn+ssh://gcc.gnu.org/svn/gcc/trunk/ Visibility:
Public. |
Patch Set 1 #Patch Set 2 : [PATCH] Add counter histogram to fdo summary #
MessagesTotal messages: 40
This is a revision of my earlier patch to add working set information to the gcov program summary based on discussions between Honza, David and I and approved with changes by Honza in this earlier thread on the first patch: http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01412.html I plan to commit this in couple days unless there are more revisions requested. Changes from the earlier patch are the removal of the loop unroller changes utilizing the new summary info, and feeding back an array of working set information for different cutoffs, as well as using a histogram to compute the working set information to avoid expensive copying and qsorting. Patch description: Patch to add new summary information on counter working set sizes to gcov profiles. The function summaries now include an array of working set data, where each data point corresponds to a percentage of the total sum_all of the counters and includes the number of hot counters and the minimum counter value included in that working set. By default, there are 128 working set data points included in the summary, but this is controlled at runtime with the GCOV_WORKING_SET_ENTRIES environment variable. The first GCOV_WORKING_SET_ENTRIES - 1 entries correspond to percentages 100.0/GCOV_WORKING_SET_ENTRIES through 100.0/GCOV_WORKING_SET_ENTRIES * (GCOV_WORKING_SET_ENTRIES - 1) of sum_all. The last entry always corresponds to 99.99% of sum_all. A log2-based histogram is used to compute the working set information, to avoid the need to copy all counters into a large array and invoke qsort. Bootstrapped and tested on x86_64-unknown-linux-gnu. 2012-08-16 Teresa Johnson <tejohnson@google.com> * libgcc/libgcov.c (histo_index): New function. (histogram_insert): Ditto. (gcov_compute_working_set): Ditto. (gcov_exit): Invoke gcov_compute_working_set, and perform merging of new working set summary info. * gcc/gcov.c (read_count_file): Invoke gcov_destroy_summary to clean up allocated working set memory. * gcc/gcov-io.c (gcov_write_summary): Include length of working set summary info in GCOV_TAG_SUMMARY_LENGTH, and write out working set summary. (gcov_read_summary): Read working set summary info. (gcov_destroy_summary): New function. * gcc/gcov-io.h (gcov_type_unsigned): New type. (gcov_destroy_summary): Declare. (GCOV_TAG_SUMMARY_LENGTH): Update to include working set summary. (struct gcov_working_set_info): New structure. (struct gcov_ctr_summary): Include working set summary. * gcc/coverage.c (htab_counts_entry_del): Free working sets. (read_counts_file): Read and merge in working set summary info. * gcc/gcov-dump.c (tag_summary): Dump out working set summary info. Index: libgcc/libgcov.c =================================================================== --- libgcc/libgcov.c (revision 189420) +++ libgcc/libgcov.c (working copy) @@ -276,6 +276,252 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned return 1; } +/* Structure used for each bucket of the log2 histogram of counter values. */ + +typedef struct +{ + /* The total number of BBs whose profile count falls within the bucket. */ + gcov_unsigned_t count; + /* The minimum profile count placed in this bucket. */ + gcov_type min_value; + /* This is the cumulative value of the profile counts in this histogram + bucket. */ + gcov_type cum_value; +} bucket_type; + +/* For a log2 scale histogram with each range split into 4 + linear sub-ranges, there will be at most GCOV_TYPE_SIZE - 1 log2 + ranges since the lowest 2 log2 values share the lowest 4 linear + sub-range (values 0 - 3). */ + +#define HISTOGRAM_SIZE (GCOV_TYPE_SIZE - 1) * 4 + +/* Determine the index into histogram for VALUE. */ + +static unsigned +histo_index(gcov_type value) +{ + gcov_type_unsigned v = (gcov_type_unsigned)value; + unsigned r = 0; + unsigned prev2bits = 0; + + /* Find index into log2 scale histogram, where each of the log2 + sized buckets is divided into 4 linear sub-buckets for better + focus in the higher buckets. */ + + /* Find the place of the most-significant bit set. */ + if (v > 0) + r = GCOV_TYPE_SIZE - 1 - __builtin_clzll (v); + + /* If at most the 2 least significant bits are set (value is + 0 - 3) then that value is our index into the lowest set of + four buckets. */ + if (r < 2) + return (unsigned)value; + + gcc_assert (r < 64); + + /* Find the two next most significant bits to determine which + of the four linear sub-buckets to select. */ + prev2bits = (v >> (r - 2)) & 0x3; + /* Finally, compose the final bucket index from the log2 index and + the next 2 bits. The minimum r value at this point is 2 since we + returned above if r was 2 or more, so the minimum bucket at this + point is 4. */ + return (r - 1) * 4 + prev2bits; +} + +/* Insert counter VALUE into HISTOGRAM. */ + +static void +histogram_insert(bucket_type *histogram, gcov_type value) +{ + unsigned i; + + i = histo_index(value); + gcc_assert (i < HISTOGRAM_SIZE); + + histogram[i].count++; + histogram[i].cum_value += value; + if (value < histogram[i].min_value) + histogram[i].min_value = value; +} + +/* Determines the working set statistics to place in the summary SUM. This is + an array of information for a range of percentages of the total execution + count (sum_all), and includes the number of counters required to cover that + working set percentage and the minimum counter value in that working set. */ + +static void +gcov_compute_working_set (struct gcov_summary *sum) +{ + struct gcov_info *gi_ptr; + const struct gcov_fn_info *gfi_ptr; + const struct gcov_ctr_info *ci_ptr; + struct gcov_ctr_summary *cs_ptr; + unsigned t_ix, f_ix, ws_ix, ctr_info_ix, ix; + gcov_unsigned_t c_num, count; + int h_ix; + bucket_type histogram[HISTOGRAM_SIZE]; + gcov_type *working_set_cum_values; + gcov_type ws_cum_hotness_incr; + gcov_type cum, tmp_cum; + char *ws_entries_str; + unsigned num_ws_entries; + + /* This currently only applies to arc counters. */ + t_ix = GCOV_COUNTER_ARCS; + + /* First check if there are any counts recorded for this counter. */ + cs_ptr = &(sum->ctrs[t_ix]); + if (!cs_ptr->num) + return; + + for (h_ix = 0; h_ix < HISTOGRAM_SIZE; h_ix++) + { + histogram[h_ix].count = 0; + histogram[h_ix].min_value = cs_ptr->run_max; + histogram[h_ix].cum_value = 0; + } + +#define MAX_WS_SUMMARY_ENTRIES 128 + ws_entries_str = getenv ("GCOV_WORKING_SET_ENTRIES"); + if (ws_entries_str && strlen (ws_entries_str)) + { + num_ws_entries = atoi (ws_entries_str); + /* Limit the summary information to a reasonable amount. + The maximum value gives statistics for more than every 1% + of the sum_all, which should be sufficient granularity for + optimizations. It also ensures that the last entry of 99.9% + is the largest value. */ + if (num_ws_entries > MAX_WS_SUMMARY_ENTRIES) + num_ws_entries = MAX_WS_SUMMARY_ENTRIES; + } + else + num_ws_entries = MAX_WS_SUMMARY_ENTRIES; + + cs_ptr->working_set_count = num_ws_entries; + + if (! num_ws_entries) + return; + + /* Next fill in an array of the cumulative hotness values corresponding + to each working set summary entry we are going to compute below. */ + working_set_cum_values + = (gcov_type *) malloc (sizeof (gcov_type) * num_ws_entries); + + /* Compute the amount of sum_all that the cumulative hotness grows + by in each successive working set entry, which depends on the + number of working set entries requested. By default num_ws_entries + is a power-of-two so that divide becomes a shift. */ + ws_cum_hotness_incr = cs_ptr->sum_all / num_ws_entries; + + /* Skip 0% statistics, which can be extrapolated from the + rest of the summary data. */ + cum = ws_cum_hotness_incr; + for (ws_ix = 0; ws_ix < num_ws_entries; ws_ix++, cum += ws_cum_hotness_incr) + working_set_cum_values[ws_ix] = cum; + /* The last summary entry is reserved for (roughly) 99.9% of the + working set. Divide by 1024 so it becomes a shift, which gives + almost exactly 99.9%. */ + working_set_cum_values[num_ws_entries-1] + = cs_ptr->sum_all - cs_ptr->sum_all/1024; + + /* Next, walk through all the per-object structures and record each of + the count values in histogram. */ + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) + { + if (!gi_ptr->merge[t_ix]) + continue; + + /* Find the appropriate index into the gcov_ctr_info array + for the counter we are currently working on based on the + existence of the merge function pointer for this object. */ + for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++) + { + if (gi_ptr->merge[ix]) + ctr_info_ix++; + } + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) + { + gfi_ptr = gi_ptr->functions[f_ix]; + + if (!gfi_ptr || gfi_ptr->key != gi_ptr) + continue; + + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; + for (ix = 0; ix < ci_ptr->num; ix++) + histogram_insert(histogram, ci_ptr->values[ix]); + } + } + + /* Next, walk through the histogram in decending order of hotness + and compute the statistics for the working set summary array. + As histogram entries are accumulated, we check to see which + working set entries have had their expected cum_value reached + and fill them in, walking the working set entries in increasing + size of cum_value. */ + cs_ptr->working_sets = (gcov_ws_info_t *) malloc ( + num_ws_entries * sizeof (gcov_ws_info_t)); + ws_ix = 0; /* The current entry into the working set array. */ + cum = 0; /* The current accumulated counter sum. */ + count = 0; /* The current accumulated count of block counters. */ + for (h_ix = HISTOGRAM_SIZE - 1; h_ix > 0 && ws_ix < num_ws_entries; h_ix--) + { + /* If we haven't reached the required cumulative counter value for + the current working set percentage, simply accumulate this histogram + entry into the running sums and continue to the next histogram + entry. */ + if (cum + histogram[h_ix].cum_value < working_set_cum_values[ws_ix]) + { + cum += histogram[h_ix].cum_value; + count += histogram[h_ix].count; + continue; + } + + /* If adding the current histogram entry's cumulative counter value + causes us to exceed the current working set size, then estimate + how many of this histogram entry's counter values are required to + reach the working set size, and fill in working set entries + as we reach their expected cumulative value. */ + for (c_num = 0, tmp_cum = cum; + c_num < histogram[h_ix].count && ws_ix < num_ws_entries; + c_num++) + { + count++; + /* If we haven't reached the last histogram entry counter, add + in the minimum value again. This will underestimate the + cumulative sum so far, because many of the counter values in this + entry may have been larger than the minimum. We could add in the + average value every time, but that would require an expensive + divide operation. */ + if (c_num + 1 < histogram[h_ix].count) + tmp_cum += histogram[h_ix].min_value; + /* If we have reached the last histogram entry counter, then add + in the entire cumulative value. */ + else + tmp_cum = cum + histogram[h_ix].cum_value; + + /* Next walk through successive working set entries and fill in + the statistics for any whose size we have reached by accumulating + this histogram counter. */ + while (tmp_cum >= working_set_cum_values[ws_ix] + && ws_ix < num_ws_entries) + { + cs_ptr->working_sets[ws_ix].num_counters = count; + cs_ptr->working_sets[ws_ix].min_bb_counter + = histogram[h_ix].min_value; + ws_ix++; + } + } + /* Finally, update the running cumulative value since we were + using a temporary above. */ + cum += histogram[h_ix].cum_value; + } + gcc_assert (ws_ix == num_ws_entries); + free (working_set_cum_values); +} + /* Dump the coverage counts. We merge with existing counts when possible, to avoid growing the .da files ad infinitum. We use this program's checksum to make sure we only accumulate whole program @@ -347,6 +593,7 @@ gcov_exit (void) } } } + gcov_compute_working_set (&this_prg); { /* Check if the level of dirs to strip off specified. */ @@ -389,7 +636,7 @@ gcov_exit (void) /* Now merge each file. */ for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) { - unsigned n_counts; + unsigned n_counts, ws_cnt; struct gcov_summary prg; /* summary for this object over all program. */ struct gcov_ctr_summary *cs_prg, *cs_tprg, *cs_all; @@ -482,8 +729,6 @@ gcov_exit (void) f_ix--; length = gcov_read_unsigned (); - if (length != GCOV_TAG_SUMMARY_LENGTH) - goto read_mismatch; gcov_read_summary (&tmp); if ((error = gcov_is_error ())) goto read_error; @@ -597,8 +842,35 @@ gcov_exit (void) if (gi_ptr->merge[t_ix]) { + ws_cnt = cs_tprg->working_set_count; if (!cs_prg->runs++) - cs_prg->num = cs_tprg->num; + { + cs_prg->num = cs_tprg->num; + /* Allocate the working set array for the merged summary. */ + if (ws_cnt) + { + cs_prg->working_set_count = ws_cnt; + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( + ws_cnt * sizeof (gcov_ws_info_t)); + } + } + else if (cs_prg->num != cs_tprg->num + || ws_cnt != cs_prg->working_set_count) + goto read_mismatch; + /* Copy over this run's working set information if either this is + the first run, the total size of the profile (sum_all) is much + (50%) larger for this run (need to check this before updating + cs_prg->sum_all below), or this run has a larger working + set in the largest (99.99% of sum_all) bucket. */ + if (ws_cnt + && (cs_prg->runs == 1 + || (cs_tprg->sum_all + > (cs_prg->sum_all + cs_prg->sum_all / 2)) + || (cs_tprg->working_sets[ws_cnt - 1].num_counters + > cs_prg->working_sets[ws_cnt - 1].num_counters))) + memcpy (cs_prg->working_sets, + cs_tprg->working_sets, + ws_cnt * sizeof (gcov_ws_info_t)); cs_prg->sum_all += cs_tprg->sum_all; if (cs_prg->run_max < cs_tprg->run_max) cs_prg->run_max = cs_tprg->run_max; @@ -696,7 +968,11 @@ gcov_exit (void) "profiling:%s:Overflow writing\n" : "profiling:%s:Error writing\n", gi_filename); + + gcov_destroy_summary (&prg); } + + gcov_destroy_summary (&this_prg); } /* Reset all counters to zero. */ Index: gcc/gcov.c =================================================================== --- gcc/gcov.c (revision 189924) +++ gcc/gcov.c (working copy) @@ -1264,6 +1264,7 @@ read_count_file (function_t *fns) struct gcov_summary summary; gcov_read_summary (&summary); object_runs += summary.ctrs[GCOV_COUNTER_ARCS].runs; + gcov_destroy_summary (&summary); program_count++; } else if (tag == GCOV_TAG_FUNCTION && !length) Index: gcc/gcov-io.c =================================================================== --- gcc/gcov-io.c (revision 189924) +++ gcc/gcov-io.c (working copy) @@ -368,10 +368,13 @@ gcov_write_tag_length (gcov_unsigned_t tag, gcov_u GCOV_LINKAGE void gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary) { - unsigned ix; + unsigned ix, ws_ix, ws_cnt; const struct gcov_ctr_summary *csum; - gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH); + for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE, ws_cnt = 0; + ix--; csum++) + ws_cnt += csum->working_set_count; + gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(ws_cnt)); gcov_write_unsigned (summary->checksum); for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { @@ -380,6 +383,12 @@ gcov_write_summary (gcov_unsigned_t tag, const str gcov_write_counter (csum->sum_all); gcov_write_counter (csum->run_max); gcov_write_counter (csum->sum_max); + gcov_write_unsigned (csum->working_set_count); + for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++) + { + gcov_write_unsigned (csum->working_sets[ws_ix].num_counters); + gcov_write_counter (csum->working_sets[ws_ix].min_bb_counter); + } } } #endif /* IN_LIBGCOV */ @@ -488,7 +497,7 @@ gcov_read_string (void) GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *summary) { - unsigned ix; + unsigned ix, ws_ix; struct gcov_ctr_summary *csum; summary->checksum = gcov_read_unsigned (); @@ -499,9 +508,35 @@ gcov_read_summary (struct gcov_summary *summary) csum->sum_all = gcov_read_counter (); csum->run_max = gcov_read_counter (); csum->sum_max = gcov_read_counter (); + csum->working_set_count = gcov_read_unsigned (); + csum->working_sets +#if IN_LIBGCOV + = (gcov_ws_info_t *) malloc (csum->working_set_count * +#else + = (gcov_ws_info_t *) xmalloc (csum->working_set_count * +#endif + sizeof (gcov_ws_info_t)); + for (ws_ix = 0; ws_ix < csum->working_set_count; ws_ix++) + { + csum->working_sets[ws_ix].num_counters = gcov_read_unsigned (); + csum->working_sets[ws_ix].min_bb_counter = gcov_read_counter (); + } } } +GCOV_LINKAGE void +gcov_destroy_summary (struct gcov_summary *summary) +{ + unsigned ix; + struct gcov_ctr_summary *csum; + + for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) + { + if (csum->working_set_count) + free (csum->working_sets); + } +} + #if !IN_LIBGCOV /* Reset to a known position. BASE should have been obtained from gcov_position, LENGTH should be a record length. */ Index: gcc/gcov-io.h =================================================================== --- gcc/gcov-io.h (revision 189924) +++ gcc/gcov-io.h (working copy) @@ -139,7 +139,9 @@ see the files COPYING3 and COPYING.RUNTIME respect counts: header int64:count* summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE count-summary: int32:num int32:runs int64:sum - int64:max int64:sum_max + int64:max int64:sum_max working-set-summary + working-set-summary: int32:count working-sets* + working-sets: int32:num int64:min The ANNOUNCE_FUNCTION record is the same as that in the note file, but without the source location. The COUNTS gives the @@ -171,8 +173,10 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (SI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (DI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI))); #else typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #endif #else #if BITS_PER_UNIT == 16 @@ -180,16 +184,20 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (HI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #else typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #endif #else typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI))); typedef unsigned gcov_position_t __attribute__ ((mode (QI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #else typedef signed gcov_type __attribute__ ((mode (QI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI))); #endif #endif #endif @@ -213,8 +221,6 @@ typedef HOST_WIDEST_INT gcov_type; #if IN_GCOV > 0 #include <sys/types.h> #endif -#else /*!IN_GCOV */ -#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) #endif #if defined (HOST_HAS_F_SETLKW) @@ -225,6 +231,8 @@ typedef HOST_WIDEST_INT gcov_type; #endif /* !IN_LIBGCOV */ +#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) + /* In gcov we want function linkage to be static. In the compiler we want it extern, so that they can be accessed from elsewhere. In libgcov we need these functions to be extern, so prefix them with __gcov. In @@ -248,6 +256,7 @@ typedef HOST_WIDEST_INT gcov_type; #define gcov_read_unsigned __gcov_read_unsigned #define gcov_read_counter __gcov_read_counter #define gcov_read_summary __gcov_read_summary +#define gcov_destroy_summary __gcov_destroy_summary /* Poison these, so they don't accidentally slip in. */ #pragma GCC poison gcov_write_string gcov_write_tag gcov_write_length @@ -309,9 +318,10 @@ typedef HOST_WIDEST_INT gcov_type; #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2) #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) -#define GCOV_TAG_SUMMARY_LENGTH \ - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) +#define GCOV_TAG_SUMMARY_LENGTH(NUM) \ + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 3 * 2) + (NUM) * 3) + /* Counters that are collected. */ #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ #define GCOV_COUNTERS_SUMMABLE 1 /* Counters which can be @@ -389,6 +399,16 @@ typedef HOST_WIDEST_INT gcov_type; /* Structured records. */ +/* Working set size statistics for a given percentage of the entire + profile (sum_all). */ +typedef struct gcov_working_set_info +{ + /* Number of hot counters included in this working set. */ + gcov_unsigned_t num_counters; + /* Smallest counter included in this working set. */ + gcov_type min_bb_counter; +} gcov_ws_info_t; + /* Cumulative counter data. */ struct gcov_ctr_summary { @@ -397,6 +417,8 @@ struct gcov_ctr_summary gcov_type sum_all; /* sum of all counters accumulated. */ gcov_type run_max; /* maximum value on a single run. */ gcov_type sum_max; /* sum of individual run max values. */ + gcov_unsigned_t working_set_count; /* number of working set infos. */ + gcov_ws_info_t *working_sets; /* array of working set info. */ }; /* Object & program summary record. */ @@ -552,6 +574,7 @@ static int gcov_is_error (void); GCOV_LINKAGE gcov_unsigned_t gcov_read_unsigned (void) ATTRIBUTE_HIDDEN; GCOV_LINKAGE gcov_type gcov_read_counter (void) ATTRIBUTE_HIDDEN; GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *) ATTRIBUTE_HIDDEN; +GCOV_LINKAGE void gcov_destroy_summary (struct gcov_summary *) ATTRIBUTE_HIDDEN; #if IN_LIBGCOV /* Available only in libgcov */ Index: gcc/coverage.c =================================================================== --- gcc/coverage.c (revision 189924) +++ gcc/coverage.c (working copy) @@ -172,6 +172,8 @@ htab_counts_entry_del (void *of) counts_entry_t *const entry = (counts_entry_t *) of; free (entry->counts); + if (entry->summary.working_sets) + free (entry->summary.working_sets); free (entry); } @@ -247,12 +249,40 @@ read_counts_file (void) gcov_read_summary (&sum); for (ix = 0; ix != GCOV_COUNTERS_SUMMABLE; ix++) { + gcov_unsigned_t ws_cnt = sum.ctrs[ix].working_set_count; + /* Allocate the working set array for a new summary. */ + if (ws_cnt && new_summary) + { + summary.ctrs[ix].working_set_count = ws_cnt; + summary.ctrs[ix].working_sets + = (gcov_ws_info_t *) xmalloc (ws_cnt + * sizeof (gcov_ws_info_t)); + } + /* Copy over this summary's working set information if either + summary is new, the total size of the profile (sum_all) is much + (50%) larger for this summary (need to check this before + updating sum_all below), or this summary has a larger working + set in the largest (99.99% of sum_all) bucket. */ + if (ws_cnt + && ws_cnt == summary.ctrs[ix].working_set_count + && (new_summary + || (sum.ctrs[ix].sum_all + > (summary.ctrs[ix].sum_all + + summary.ctrs[ix].sum_all / 2)) + || (sum.ctrs[ix].working_sets[ws_cnt - 1].num_counters + > summary.ctrs[ix].working_sets[ws_cnt - 1].num_counters))) + { + memcpy (summary.ctrs[ix].working_sets, + sum.ctrs[ix].working_sets, + ws_cnt * sizeof (gcov_ws_info_t)); + } summary.ctrs[ix].runs += sum.ctrs[ix].runs; summary.ctrs[ix].sum_all += sum.ctrs[ix].sum_all; if (summary.ctrs[ix].run_max < sum.ctrs[ix].run_max) summary.ctrs[ix].run_max = sum.ctrs[ix].run_max; summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max; } + gcov_destroy_summary (&sum); new_summary = 0; } else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident) Index: gcc/gcov-dump.c =================================================================== --- gcc/gcov-dump.c (revision 189924) +++ gcc/gcov-dump.c (working copy) @@ -447,7 +447,8 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED) { struct gcov_summary summary; - unsigned ix; + unsigned ix, jx, pctinc, pct; + gcov_ws_info_t *ws_info; gcov_read_summary (&summary); printf (" checksum=0x%08x", summary.checksum); @@ -465,5 +466,27 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED (HOST_WIDEST_INT)summary.ctrs[ix].run_max); printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC, (HOST_WIDEST_INT)summary.ctrs[ix].sum_max); + if (! summary.ctrs[ix].working_set_count) + continue; + printf ("\n"); + print_prefix (filename, 0, 0); + printf ("\t\tworking set statistics:"); + /* Multiply the percentage by 100 to avoid float. */ + pctinc = 100 * 100 / summary.ctrs[ix].working_set_count; + for (jx = 0, pct = pctinc; jx < summary.ctrs[ix].working_set_count; + jx++, pct += pctinc) + { + if (jx == summary.ctrs[ix].working_set_count - 1) + pct = 9999; + ws_info = &summary.ctrs[ix].working_sets[jx]; + printf ("\n"); + print_prefix (filename, 0, 0); + /* Print out the percentage using int arithmatic to avoid float. */ + printf ("\t\t%u.%u%%: num counts=%u, min counter=" + HOST_WIDEST_INT_PRINT_DEC, + pct / 100, pct - pct / 100, + ws_info->num_counters, (HOST_WIDEST_INT)ws_info->min_bb_counter); + } } + gcov_destroy_summary (&summary); } -- This patch is available for review at http://codereview.appspot.com/6465057
Sign in to reply to this message.
> + { > + cs_prg->num = cs_tprg->num; > + /* Allocate the working set array for the merged summary. */ > + if (ws_cnt) > + { > + cs_prg->working_set_count = ws_cnt; > + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( > + ws_cnt * sizeof (gcov_ws_info_t)); > + } > + } > + else if (cs_prg->num != cs_tprg->num > + || ws_cnt != cs_prg->working_set_count) > + goto read_mismatch; > + /* Copy over this run's working set information if either this is > + the first run, the total size of the profile (sum_all) is much > + (50%) larger for this run (need to check this before updating > + cs_prg->sum_all below), or this run has a larger working > + set in the largest (99.99% of sum_all) bucket. */ > + if (ws_cnt > + && (cs_prg->runs == 1 > + || (cs_tprg->sum_all > + > (cs_prg->sum_all + cs_prg->sum_all / 2)) > + || (cs_tprg->working_sets[ws_cnt - 1].num_counters > + > cs_prg->working_sets[ws_cnt - 1].num_counters))) > + memcpy (cs_prg->working_sets, > + cs_tprg->working_sets, > + ws_cnt * sizeof (gcov_ws_info_t)); > cs_prg->sum_all += cs_tprg->sum_all; Hmm, when you run i.e. gcc bootstrap where the program is run couple hundred times, this will end up really inaccurate, since it will probably just store the histogram of insn-attrtab compilation, right? Why you don't simply write the histogram into gcov file and don't merge the values here (i.e. doing the cummulation loop in GCC instead of within libgcov)? By default you are streaming 128 values that is the same as needed to stream the histogram. I suppose we can have environment variable to reduce the histogram size - I guess in smaller setups smaller histogram will run just fine... Honza
Sign in to reply to this message.
On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > > > + { > > + cs_prg->num = cs_tprg->num; > > + /* Allocate the working set array for the merged summary. */ > > + if (ws_cnt) > > + { > > + cs_prg->working_set_count = ws_cnt; > > + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( > > + ws_cnt * sizeof (gcov_ws_info_t)); > > + } > > + } > > + else if (cs_prg->num != cs_tprg->num > > + || ws_cnt != cs_prg->working_set_count) > > + goto read_mismatch; > > + /* Copy over this run's working set information if either this is > > + the first run, the total size of the profile (sum_all) is much > > + (50%) larger for this run (need to check this before updating > > + cs_prg->sum_all below), or this run has a larger working > > + set in the largest (99.99% of sum_all) bucket. */ > > + if (ws_cnt > > + && (cs_prg->runs == 1 > > + || (cs_tprg->sum_all > > + > (cs_prg->sum_all + cs_prg->sum_all / 2)) > > + || (cs_tprg->working_sets[ws_cnt - 1].num_counters > > + > cs_prg->working_sets[ws_cnt - 1].num_counters))) > > + memcpy (cs_prg->working_sets, > > + cs_tprg->working_sets, > > + ws_cnt * sizeof (gcov_ws_info_t)); > > cs_prg->sum_all += cs_tprg->sum_all; > > Hmm, when you run i.e. gcc bootstrap where the program is run couple hundred > times, this will end up really inaccurate, since it will probably just store > the histogram of insn-attrtab compilation, right? Well, it should store the largest working set in BBs, or one that came from a much longer run. But I see the issue you are pointing out. The num_counters (the number of hot bbs) should be reasonable, as the total number of BBs is the same between all runs, and I want to show the largest or more dominant working set in terms of the number of hot bbs. But the min_bb_counter will become more and more inaccurate as the number of runs increases, since the counter values are accumulated. I typed up a detailed email below on why getting this right would be difficult, but then I realized there might be a fairly simple accurate solution, which I'll describe first: The only way I see to do this completely accurately is to take two passes through the existing gcda files that we are merging into, one to read in all the counter values and merge them into all the counter values in the arrays from the current run (after which we can do the histogramming/working set computation accurately from the merged counters), and the second to rewrite them. In libgcov this doesn't seem like it would be too difficult to do, although it is a little bit of restructuring of the main merging loop and needs some special handling for buffered functions (which could increase the memory footprint a bit if there are many of these since they will all need to be buffered across the iteration over all the gcda files). The summary merging in coverage.c confuses me a bit as it seems to be handling the case when there are multiple program summaries in a single gcda file. When would this happen? It looks like the merge handling in libgcov should always produce a single program summary per gcda file. > > > Why you don't simply write the histogram into gcov file and don't merge the values > here (i.e. doing the cummulation loop in GCC instead of within libgcov)? That doesn't completely solve the problem, unfortunately. The reason is that you don't know which histogram entry contains a particular block each run, and therefore you don't know how to compute the new combined histogram value and index for that bb counter. For example, a particular histogram index may have 5 counters (bbs) in it for one run and 2 counters (bbs) in it for a second run, so the question is how to compute the new entry for all of those bb counters, as the 5 bbs from the first run may or may not be a superset of the 2 from the second run. You could assume that the bbs have the same relative order of hotness between runs, and combine the bb counters accordingly, but there is still some trickiness because you have to apportion the cumulative counter sum stored in the histogram entry between new entries. For example, assume the highest indexed non-zero entries (the histogram buckets containing the largest counter values) in the two incoming histograms are: histogram 1: index 100: 4 counters, cumulative value X, min counter value minx ... histogram 2: index 100: 2 counters, cumulative value Y, min counter value miny index 99: 3 counters, cumulative value W, min counter value minw ... To merge, you could assume something like 2 counters with a new cumulative value (Y + X*2/4), and new min counter value minx+miny, that go into the merged histogram with the index corresponding to counter value minx+miny. And then 2 counters have a new cumulative value (W*2/3 + X*2/4) and new min counter value minx+minw, that go into the merged histogram with index corresponding to counter value minw+minx. Etc... Not entirely accurate, although it might be a reasonable approximation, but it also requires a number of division operations during the merge in libgcov. Another possibility, that might also provide a reasonable approximation, would be to scale the min_bb_counter values in the working sets by the sum_all_merged/sum_all_orig, where sum_all_merged is the new sum_all, and sum_all_orig is the sum_all from the summary whose working_set was chosen to be propagated to the new merged summary. This also requires some divisions at libgcov merge time, unless we save the original sum_all along with the working set in the summary and do the scaling at profile-use time. > By default you are streaming 128 values that is the same as needed to stream the histogram. > I suppose we can have environment variable to reduce the histogram size - I guess in smaller > setups smaller histogram will run just fine... It is a bit more, as the histogram will have 64*4 = 256 entries for 64-bit counters, and each entry contains 2 gcov_type counter values and one unsigned int. The working set entries only have one gcov_type counter and one unsigned. So it could be close to 4x. What do you think? Thanks, Teresa > > Honza -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Sun, Aug 19, 2012 at 9:59 PM, Teresa Johnson <tejohnson@google.com> wrote: > On Sat, Aug 18, 2012 at 1:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> > + { >> > + cs_prg->num = cs_tprg->num; >> > + /* Allocate the working set array for the merged summary. */ >> > + if (ws_cnt) >> > + { >> > + cs_prg->working_set_count = ws_cnt; >> > + cs_prg->working_sets = (gcov_ws_info_t *) malloc ( >> > + ws_cnt * sizeof (gcov_ws_info_t)); >> > + } >> > + } >> > + else if (cs_prg->num != cs_tprg->num >> > + || ws_cnt != cs_prg->working_set_count) >> > + goto read_mismatch; >> > + /* Copy over this run's working set information if either this is >> > + the first run, the total size of the profile (sum_all) is much >> > + (50%) larger for this run (need to check this before updating >> > + cs_prg->sum_all below), or this run has a larger working >> > + set in the largest (99.99% of sum_all) bucket. */ >> > + if (ws_cnt >> > + && (cs_prg->runs == 1 >> > + || (cs_tprg->sum_all >> > + > (cs_prg->sum_all + cs_prg->sum_all / 2)) >> > + || (cs_tprg->working_sets[ws_cnt - 1].num_counters >> > + > cs_prg->working_sets[ws_cnt - 1].num_counters))) >> > + memcpy (cs_prg->working_sets, >> > + cs_tprg->working_sets, >> > + ws_cnt * sizeof (gcov_ws_info_t)); >> > cs_prg->sum_all += cs_tprg->sum_all; >> >> Hmm, when you run i.e. gcc bootstrap where the program is run couple hundred >> times, this will end up really inaccurate, since it will probably just store >> the histogram of insn-attrtab compilation, right? > > > Well, it should store the largest working set in BBs, or one that came > from a much longer run. But I see the issue you are pointing out. The > num_counters (the number of hot bbs) should be reasonable, as the > total number of BBs is the same between all runs, and I want to show > the largest or more dominant working set in terms of the number of hot > bbs. But the min_bb_counter will become more and more inaccurate as > the number of runs increases, since the counter values are > accumulated. > > I typed up a detailed email below on why getting this right would be > difficult, but then I realized there might be a fairly simple accurate > solution, which I'll describe first: > > The only way I see to do this completely accurately is to take two > passes through the existing gcda files that we are merging into, one > to read in all the counter values and merge them into all the counter > values in the arrays from the current run (after which we can do the > histogramming/working set computation accurately from the merged > counters), and the second to rewrite them. In libgcov this doesn't > seem like it would be too difficult to do, although it is a little bit > of restructuring of the main merging loop and needs some special > handling for buffered functions (which could increase the memory > footprint a bit if there are many of these since they will all need to > be buffered across the iteration over all the gcda files). > The current way of doing summary merging produces imprecise data (e.g, sum_max). It computes current execution's summary first, and then merge with previous run's summary data while profile counters are merged. The right way is to merge profile counters first, and then recompute the final summary based on the merged profile. The later way allows merging of more advanced summary data such as the histogram here. It is also used in LIPO mode (for dynamic callgraph build/merge -- kind of a summary itself). However I think changing the way summary is merged should probably be done in a different patch. David > The summary merging in coverage.c confuses me a bit as it seems to be > handling the case when there are multiple program summaries in a > single gcda file. When would this happen? It looks like the merge > handling in libgcov should always produce a single program summary per > gcda file. > >> >> >> Why you don't simply write the histogram into gcov file and don't merge the values >> here (i.e. doing the cummulation loop in GCC instead of within libgcov)? > > That doesn't completely solve the problem, unfortunately. The reason > is that you don't know which histogram entry contains a particular > block each run, and therefore you don't know how to compute the new > combined histogram value and index for that bb counter. For example, a > particular histogram index may have 5 counters (bbs) in it for one run > and 2 counters (bbs) in it for a second run, so the question is how to > compute the new entry for all of those bb counters, as the 5 bbs from > the first run may or may not be a superset of the 2 from the second > run. You could assume that the bbs have the same relative order of > hotness between runs, and combine the bb counters accordingly, but > there is still some trickiness because you have to apportion the > cumulative counter sum stored in the histogram entry between new > entries. For example, assume the highest indexed non-zero entries (the > histogram buckets containing the largest counter values) in the two > incoming histograms are: > > histogram 1: > > index 100: 4 counters, cumulative value X, min counter value minx > ... > > histogram 2: > > index 100: 2 counters, cumulative value Y, min counter value miny > index 99: 3 counters, cumulative value W, min counter value minw > ... > > To merge, you could assume something like 2 counters with a new > cumulative value (Y + X*2/4), and new min counter value minx+miny, > that go into the merged histogram with the index corresponding to > counter value minx+miny. And then 2 counters have a new cumulative > value (W*2/3 + X*2/4) and new min counter value minx+minw, that go > into the merged histogram with index corresponding to counter value > minw+minx. Etc... Not entirely accurate, although it might be a > reasonable approximation, but it also requires a number of division > operations during the merge in libgcov. > > Another possibility, that might also provide a reasonable > approximation, would be to scale the min_bb_counter values in the > working sets by the sum_all_merged/sum_all_orig, where sum_all_merged > is the new sum_all, and sum_all_orig is the sum_all from the summary > whose working_set was chosen to be propagated to the new merged > summary. This also requires some divisions at libgcov merge time, > unless we save the original sum_all along with the working set in the > summary and do the scaling at profile-use time. > >> By default you are streaming 128 values that is the same as needed to stream the histogram. >> I suppose we can have environment variable to reduce the histogram size - I guess in smaller >> setups smaller histogram will run just fine... > > It is a bit more, as the histogram will have 64*4 = 256 entries for > 64-bit counters, and each entry contains 2 gcov_type counter values > and one unsigned int. The working set entries only have one gcov_type > counter and one unsigned. So it could be close to 4x. > > What do you think? > > Thanks, > Teresa > >> >> Honza > > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> Well, it should store the largest working set in BBs, or one that came > from a much longer run. But I see the issue you are pointing out. The > num_counters (the number of hot bbs) should be reasonable, as the > total number of BBs is the same between all runs, and I want to show > the largest or more dominant working set in terms of the number of hot > bbs. But the min_bb_counter will become more and more inaccurate as > the number of runs increases, since the counter values are > accumulated. Yes and that is the one that we plan to use to determine hot/cold decisions on, right? Note that there is no really 1-1 corespondence in betwen BBs and counters. For each function the there should be num_edges-num_bbs+1 counters. What do you plan to use BB counts for? > > I typed up a detailed email below on why getting this right would be > difficult, but then I realized there might be a fairly simple accurate > solution, which I'll describe first: > > The only way I see to do this completely accurately is to take two > passes through the existing gcda files that we are merging into, one > to read in all the counter values and merge them into all the counter > values in the arrays from the current run (after which we can do the > histogramming/working set computation accurately from the merged > counters), and the second to rewrite them. In libgcov this doesn't > seem like it would be too difficult to do, although it is a little bit > of restructuring of the main merging loop and needs some special > handling for buffered functions (which could increase the memory > footprint a bit if there are many of these since they will all need to > be buffered across the iteration over all the gcda files). > > The summary merging in coverage.c confuses me a bit as it seems to be > handling the case when there are multiple program summaries in a > single gcda file. When would this happen? It looks like the merge > handling in libgcov should always produce a single program summary per > gcda file. This is something Nathan introduced years back. The idea was IMO to handle more acurately objects linked into multiple binaries. I am not sure if the code really works or worked on some point. The idea, as I recall it, was to produce overall checksum of all objects and have different profile records for each combination. This is not really useful for profile feedback as you generate single object file for all uses, but it might become useful for LTO where you know into which binary you are linking to. I am not really sure it is worth all the infrastructure needed to support this though. > > > > > > > Why you don't simply write the histogram into gcov file and don't merge the values > > here (i.e. doing the cummulation loop in GCC instead of within libgcov)? > > That doesn't completely solve the problem, unfortunately. The reason > is that you don't know which histogram entry contains a particular > block each run, and therefore you don't know how to compute the new > combined histogram value and index for that bb counter. For example, a > particular histogram index may have 5 counters (bbs) in it for one run > and 2 counters (bbs) in it for a second run, so the question is how to > compute the new entry for all of those bb counters, as the 5 bbs from > the first run may or may not be a superset of the 2 from the second > run. You could assume that the bbs have the same relative order of > hotness between runs, and combine the bb counters accordingly, but > there is still some trickiness because you have to apportion the > cumulative counter sum stored in the histogram entry between new > entries. For example, assume the highest indexed non-zero entries (the > histogram buckets containing the largest counter values) in the two > incoming histograms are: > > histogram 1: > > index 100: 4 counters, cumulative value X, min counter value minx > ... > > histogram 2: > > index 100: 2 counters, cumulative value Y, min counter value miny > index 99: 3 counters, cumulative value W, min counter value minw > ... > > To merge, you could assume something like 2 counters with a new > cumulative value (Y + X*2/4), and new min counter value minx+miny, > that go into the merged histogram with the index corresponding to > counter value minx+miny. And then 2 counters have a new cumulative > value (W*2/3 + X*2/4) and new min counter value minx+minw, that go > into the merged histogram with index corresponding to counter value > minw+minx. Etc... Not entirely accurate, although it might be a > reasonable approximation, but it also requires a number of division > operations during the merge in libgcov. > > Another possibility, that might also provide a reasonable > approximation, would be to scale the min_bb_counter values in the > working sets by the sum_all_merged/sum_all_orig, where sum_all_merged > is the new sum_all, and sum_all_orig is the sum_all from the summary > whose working_set was chosen to be propagated to the new merged > summary. This also requires some divisions at libgcov merge time, > unless we save the original sum_all along with the working set in the > summary and do the scaling at profile-use time. > > > By default you are streaming 128 values that is the same as needed to stream the histogram. > > I suppose we can have environment variable to reduce the histogram size - I guess in smaller > > setups smaller histogram will run just fine... > > It is a bit more, as the histogram will have 64*4 = 256 entries for > 64-bit counters, and each entry contains 2 gcov_type counter values > and one unsigned int. The working set entries only have one gcov_type > counter and one unsigned. So it could be close to 4x. > > What do you think? So we have options 1) ignore the problem that summaries become inaccurate with many train runs as we do now. 2) write histogram only, do not track BB counts or approximate them by scalling and perhaps retire max_counter from the summary in favour of histogram estimate We will pay by more data being written (I think the histogram may actually compress pretty well by skipping zero entries, don't know) and by getting only estimated max_count from the histogram that still should be good for practice (max_count is used only for the hot/cold decisions and those will be histogram based anyway) 3) do two stage processing with reading data into memory first, producing summaries and writting them next. 2) seems appealing to me because it is simple, but there are limitations in what it can handle. 3) solve precision problems, but how we handle the locking & races? In the case like bootstrap where GCCs are executed in parallel we will end up with random results if the files gets modified by another profiled run of GCC in between read in and write out. So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does it work for LIPO? Honza > > Thanks, > Teresa > > > > > Honza > > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Mon, Aug 20, 2012 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Well, it should store the largest working set in BBs, or one that came >> from a much longer run. But I see the issue you are pointing out. The >> num_counters (the number of hot bbs) should be reasonable, as the >> total number of BBs is the same between all runs, and I want to show >> the largest or more dominant working set in terms of the number of hot >> bbs. But the min_bb_counter will become more and more inaccurate as >> the number of runs increases, since the counter values are >> accumulated. > > Yes and that is the one that we plan to use to determine hot/cold decisions on, > right? Yes, so I agree it is important to do something to update this as profiles are merged. > > Note that there is no really 1-1 corespondence in betwen BBs and counters. > For each function the there should be num_edges-num_bbs+1 counters. > What do you plan to use BB counts for? True, although what I am using this for is to get an idea of the size of the working set to help identify large icache footprints in order to control code size increasing optimizations such as unrolling. The idea is that a huge number of hot counters indicates a huge number of hot bbs which in turn indicates a huge number of hot instructions and therefore high icache pressure. We're using it in the google branch to control unrolling successfully. >> >> I typed up a detailed email below on why getting this right would be >> difficult, but then I realized there might be a fairly simple accurate >> solution, which I'll describe first: >> >> The only way I see to do this completely accurately is to take two >> passes through the existing gcda files that we are merging into, one >> to read in all the counter values and merge them into all the counter >> values in the arrays from the current run (after which we can do the >> histogramming/working set computation accurately from the merged >> counters), and the second to rewrite them. In libgcov this doesn't >> seem like it would be too difficult to do, although it is a little bit >> of restructuring of the main merging loop and needs some special >> handling for buffered functions (which could increase the memory >> footprint a bit if there are many of these since they will all need to >> be buffered across the iteration over all the gcda files). >> >> The summary merging in coverage.c confuses me a bit as it seems to be >> handling the case when there are multiple program summaries in a >> single gcda file. When would this happen? It looks like the merge >> handling in libgcov should always produce a single program summary per >> gcda file. > > > This is something Nathan introduced years back. The idea was IMO to handle > more acurately objects linked into multiple binaries. I am not sure > if the code really works or worked on some point. > > The idea, as I recall it, was to produce overall checksum of all objects and > have different profile records for each combination. > > This is not really useful for profile feedback as you generate single object > file for all uses, but it might become useful for LTO where you know into which > binary you are linking to. I am not really sure it is worth all the infrastructure > needed to support this though. Ok, so perhaps for the merging in coverage.c we can do something less accurate, and worry more about the libgcov merging. >> >> > >> > >> > Why you don't simply write the histogram into gcov file and don't merge the values >> > here (i.e. doing the cummulation loop in GCC instead of within libgcov)? >> >> That doesn't completely solve the problem, unfortunately. The reason >> is that you don't know which histogram entry contains a particular >> block each run, and therefore you don't know how to compute the new >> combined histogram value and index for that bb counter. For example, a >> particular histogram index may have 5 counters (bbs) in it for one run >> and 2 counters (bbs) in it for a second run, so the question is how to >> compute the new entry for all of those bb counters, as the 5 bbs from >> the first run may or may not be a superset of the 2 from the second >> run. You could assume that the bbs have the same relative order of >> hotness between runs, and combine the bb counters accordingly, but >> there is still some trickiness because you have to apportion the >> cumulative counter sum stored in the histogram entry between new >> entries. For example, assume the highest indexed non-zero entries (the >> histogram buckets containing the largest counter values) in the two >> incoming histograms are: >> >> histogram 1: >> >> index 100: 4 counters, cumulative value X, min counter value minx >> ... >> >> histogram 2: >> >> index 100: 2 counters, cumulative value Y, min counter value miny >> index 99: 3 counters, cumulative value W, min counter value minw >> ... >> >> To merge, you could assume something like 2 counters with a new >> cumulative value (Y + X*2/4), and new min counter value minx+miny, >> that go into the merged histogram with the index corresponding to >> counter value minx+miny. And then 2 counters have a new cumulative >> value (W*2/3 + X*2/4) and new min counter value minx+minw, that go >> into the merged histogram with index corresponding to counter value >> minw+minx. Etc... Not entirely accurate, although it might be a >> reasonable approximation, but it also requires a number of division >> operations during the merge in libgcov. >> >> Another possibility, that might also provide a reasonable >> approximation, would be to scale the min_bb_counter values in the >> working sets by the sum_all_merged/sum_all_orig, where sum_all_merged >> is the new sum_all, and sum_all_orig is the sum_all from the summary >> whose working_set was chosen to be propagated to the new merged >> summary. This also requires some divisions at libgcov merge time, >> unless we save the original sum_all along with the working set in the >> summary and do the scaling at profile-use time. >> >> > By default you are streaming 128 values that is the same as needed to stream the histogram. >> > I suppose we can have environment variable to reduce the histogram size - I guess in smaller >> > setups smaller histogram will run just fine... >> >> It is a bit more, as the histogram will have 64*4 = 256 entries for >> 64-bit counters, and each entry contains 2 gcov_type counter values >> and one unsigned int. The working set entries only have one gcov_type >> counter and one unsigned. So it could be close to 4x. >> >> What do you think? > > So we have options > 1) ignore the problem that summaries become inaccurate with many train runs > as we do now. Or 1b) do some sum_all based scaling of the min_bb_counter values as I describe a couple paragraphs above. > 2) write histogram only, do not track BB counts or approximate them by scalling > and perhaps retire max_counter from the summary in favour of histogram estimate By max_counter do you mean sum_max? > > We will pay by more data being written (I think the histogram may actually compress pretty > well by skipping zero entries, don't know) and by getting only estimated max_count > from the histogram that still should be good for practice (max_count is used only > for the hot/cold decisions and those will be histogram based anyway) Regarding the data being written, I think we can minimize this since the histogram should be pretty sparse. So store only non-zero entries. The histogram indices can be recomputed on read from the min counter value stored in the entry. > 3) do two stage processing with reading data into memory first, producing summaries > and writting them next. > > 2) seems appealing to me because it is simple, but there are limitations in > what it can handle. > 3) solve precision problems, but how we handle the locking & races? In the > case like bootstrap where GCCs are executed in parallel we will end up with > random results if the files gets modified by another profiled run of GCC in > between read in and write out. > > So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does > it work for LIPO? I think the locking issue is reduced for LIPO because each gcda file is read/merged/written in a single pass as it is on trunk, and the lipo module info is written subsequently to a separate file. Although that could certainly get a little stale if there are many simultaneous profile runs merging the files. To address this for the working set summary info, I think we would have to maintain the current per-gcdafile read/merge/write for the counters but not write any working set summary info on the first pass. Then compute the working set summary and rewrite the all the gcda files with it. The working set summary may get a little stale before it is written (just as in the lipo case for the module info), although at least the individual counters will not. Since the individual counters will get merged properly, there may be a good chance that one of the final profile runs will compute an accurate working set as it is finishing up. If this approach seems like it is feasible, then we could stick with the current approach of emitting the working set array in the summary, mitigating it somewhat by doing the sum_all based scaling of the counter values, then in a follow on patch restructure the merging code to delay the working set computation as described above. Otherwise, either go with the approach of scaling the counter values using the sum_all ratio, or encode/merge the histograms instead. Teresa > > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Mon, Aug 20, 2012 at 11:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> The summary merging in coverage.c confuses me a bit as it seems to be >> handling the case when there are multiple program summaries in a >> single gcda file. When would this happen? It looks like the merge >> handling in libgcov should always produce a single program summary per >> gcda file. AFAIU it merges existing profile data with new profile data from a second (or third, or ...) trial run. Ciao! Steven
Sign in to reply to this message.
> > So I definitely preffer 2 or 3 over 1. David has experience with 3. How well does > it work for LIPO? > This (lack of locking, races) is not a new problem. There is no synchronization in libgcov for profile update/merge at both thread and process level. Thread level data races leads to inconsistent counters, but can be mostly corrected under -fprofile-correction and smoothing. There is also problem with false indirect call targets --- gcc has mechanism to filter those out. Process level synchronization problems can happen when two processes (running the instrumented binary) exit at the same time. The updated/merged counters from one process may be overwritten by another process -- this is true for both counter data and summary data. Solution 3) does not introduce any new problems. thanks, David > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> If this approach seems like it is feasible, then we could stick with > the current approach of emitting the working set array in the summary, > mitigating it somewhat by doing the sum_all based scaling of the > counter values, then in a follow on patch restructure the merging code > to delay the working set computation as described above. +1 David > Teresa > >> >> Honza >>> >>> Thanks, >>> Teresa >>> >>> > >>> > Honza >>> >>> >>> >>> >>> -- >>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Mon, Aug 20, 2012 at 8:35 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Mon, Aug 20, 2012 at 11:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >>> The summary merging in coverage.c confuses me a bit as it seems to be >>> handling the case when there are multiple program summaries in a >>> single gcda file. When would this happen? It looks like the merge >>> handling in libgcov should always produce a single program summary per >>> gcda file. > > AFAIU it merges existing profile data with new profile data from a > second (or third, or ...) trial run. It looks like libgcov will always merge program summaries between different runs though. As far as I can tell, it will always either rewrite the whole gcda file with a single merged program summary, or abort the write if the merge was not successful. However, the comment above gcov_exit does talk about keeping program summaries separate for object files contained in different programs, which is what Honza was describing as the motivation for the coverage.c merging. But I don't see where multiple program summaries ever get written to the same gcda file - if the checksums are different it seems to abort the write. But the code in coverage.c is dealing with a single gcda file containing multiple program summaries. Is there code somewhere that will cause this to happen? Thanks, Teresa > > Ciao! > Steven -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Mon, Aug 20, 2012 at 7:44 PM, Teresa Johnson <tejohnson@google.com> wrote: > But > the code in coverage.c is dealing with a single gcda file containing > multiple program summaries. Is there code somewhere that will cause > this to happen? Not that I know of, no. (There are enhancement requests for this, see e.g. PR47618). Ciao! Steven
Sign in to reply to this message.
Xinliang David Li <davidxl@google.com> writes: > > Process level synchronization problems can happen when two processes > (running the instrumented binary) exit at the same time. The > updated/merged counters from one process may be overwritten by another > process -- this is true for both counter data and summary data. > Solution 3) does not introduce any new problems. You could just use lockf() ? -Andi
Sign in to reply to this message.
I was mistaken here -- gcov_open actually uses locking via fcntl interface -- so we do need to find a way to solve the summary update synchronization problem when it is separate out of the per-file update loop. David On Mon, Aug 20, 2012 at 12:03 PM, Andi Kleen <andi@firstfloor.org> wrote: > Xinliang David Li <davidxl@google.com> writes: >> >> Process level synchronization problems can happen when two processes >> (running the instrumented binary) exit at the same time. The >> updated/merged counters from one process may be overwritten by another >> process -- this is true for both counter data and summary data. >> Solution 3) does not introduce any new problems. > > You could just use lockf() ? > > -Andi
Sign in to reply to this message.
> Xinliang David Li <davidxl@google.com> writes: > > > > Process level synchronization problems can happen when two processes > > (running the instrumented binary) exit at the same time. The > > updated/merged counters from one process may be overwritten by another > > process -- this is true for both counter data and summary data. > > Solution 3) does not introduce any new problems. > > You could just use lockf() ? The issue here is holding lock for all the files (that can be many) versus number of locks limits & possibilities for deadlocking (mind that updating may happen in different orders on the same files for different programs built from same objects) For David: there is no thread safety code in mainline for the counters. Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down as too memory expensive per thread. We could optionaly do atomic updates like ICC or combination of both as discussed in the thread. So far no one implemented it since the coverage fixups seems to work well enough in pracitce for multithreaded programs where reproducibility do not seem to be _that_ important. For GCC profiled bootstrap however I would like to see the output binary to be reproducible. We realy ought to update profiles safe for multple processes. Trashing whole process run is worse than doing race in increment. There is good chance that one of runs is more important than others and it will get trashed. I do not think we do have serious update problems in the summaries at the moment. We lock individual files as we update them. The summary is simple enough to be safe. sum_all is summed, max_all is maximum over the individual runs. Even when you combine mutiple programs the summary will end up same. Everything except for max_all is ignored anyway. Solution 2 (i.e. histogram streaming) will also have the property that it is safe WRT multiple programs, just like sum_all. Honza > > -Andi
Sign in to reply to this message.
On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Xinliang David Li <davidxl@google.com> writes: >> > >> > Process level synchronization problems can happen when two processes >> > (running the instrumented binary) exit at the same time. The >> > updated/merged counters from one process may be overwritten by another >> > process -- this is true for both counter data and summary data. >> > Solution 3) does not introduce any new problems. >> >> You could just use lockf() ? > > The issue here is holding lock for all the files (that can be many) versus > number of locks limits & possibilities for deadlocking (mind that updating > may happen in different orders on the same files for different programs built > from same objects) > > For David: there is no thread safety code in mainline for the counters. > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down > as too memory expensive per thread. We could optionaly do atomic updates like ICC > or combination of both as discussed in the thread. > So far no one implemented it since the coverage fixups seems to work well enough in > pracitce for multithreaded programs where reproducibility do not seem to be _that_ > important. > > For GCC profiled bootstrap however I would like to see the output binary to be > reproducible. We realy ought to update profiles safe for multple processes. > Trashing whole process run is worse than doing race in increment. There is good > chance that one of runs is more important than others and it will get trashed. > > I do not think we do have serious update problems in the summaries at the moment. > We lock individual files as we update them. The summary is simple enough to be safe. > sum_all is summed, max_all is maximum over the individual runs. Even when you combine > mutiple programs the summary will end up same. Everything except for max_all is ignored > anyway. > > Solution 2 (i.e. histogram streaming) will also have the property that it is safe > WRT multiple programs, just like sum_all. I think the sum_all based scaling of the working set entries also has this property. What is your opinion on saving the histogram in the summary and merging histograms together as best as possible compared to the alternative of saving the working set information as now and scaling it up by the ratio between the new and old sum_all when merging? Thanks, Teresa > > Honza >> >> -Andi -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> Xinliang David Li <davidxl@google.com> writes: > >> > > >> > Process level synchronization problems can happen when two processes > >> > (running the instrumented binary) exit at the same time. The > >> > updated/merged counters from one process may be overwritten by another > >> > process -- this is true for both counter data and summary data. > >> > Solution 3) does not introduce any new problems. > >> > >> You could just use lockf() ? > > > > The issue here is holding lock for all the files (that can be many) versus > > number of locks limits & possibilities for deadlocking (mind that updating > > may happen in different orders on the same files for different programs built > > from same objects) > > > > For David: there is no thread safety code in mainline for the counters. > > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) > > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down > > as too memory expensive per thread. We could optionaly do atomic updates like ICC > > or combination of both as discussed in the thread. > > So far no one implemented it since the coverage fixups seems to work well enough in > > pracitce for multithreaded programs where reproducibility do not seem to be _that_ > > important. > > > > For GCC profiled bootstrap however I would like to see the output binary to be > > reproducible. We realy ought to update profiles safe for multple processes. > > Trashing whole process run is worse than doing race in increment. There is good > > chance that one of runs is more important than others and it will get trashed. > > > > I do not think we do have serious update problems in the summaries at the moment. > > We lock individual files as we update them. The summary is simple enough to be safe. > > sum_all is summed, max_all is maximum over the individual runs. Even when you combine > > mutiple programs the summary will end up same. Everything except for max_all is ignored > > anyway. > > > > Solution 2 (i.e. histogram streaming) will also have the property that it is safe > > WRT multiple programs, just like sum_all. > > I think the sum_all based scaling of the working set entries also has > this property. What is your opinion on saving the histogram in the I think the scaling will have at lest roundoff issues WRT different merging orders. > summary and merging histograms together as best as possible compared > to the alternative of saving the working set information as now and > scaling it up by the ratio between the new and old sum_all when > merging? So far I like this option best. But David seems to lean towards thirtd option with whole file locking. I see it may show to be more extensible in the future. At the moment I do not understand two things 1) why do we need info on the number of counter above given threshold, sinc ethe hot/cold decisions usually depends purely on the count cutoff. Removing those will solve merging issues with variant 2 and then it would be probably good solution. 2) Do we plan to add some features in near future that will anyway require global locking? I guess LIPO itself does not count since it streams its data into independent file as you mentioned earlier and locking LIPO file is not that hard. Does LIPO stream everything into that common file, or does it use combination of gcda files and common summary? What other stuff Google plans to merge? (In general I would be curious about merging plans WRT profile stuff, so we get more synchronized and effective on getting patches in. We have about two months to get it done in stage1 and it would be nice to get as much as possible. Obviously some of the patches will need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think this is an important feature). I also realized today that the common value counters (used by switch, indirect call and div/mod value profiling) are non-stanble WRT different merging orders (i.e. parallel make in train run). I do not think there is actual solution to that except for not merging the counter section of this type in libgcov and merge them in some canonical order at profile feedback time. Perhaps we just want to live with this, since the disprepancy here is small. (i.e. these counters are quite rare and their outcome has just local effect on the final binary, unlike the global summaries/edge counters). Honza > > Thanks, > Teresa > > > > > Honza > >> > >> -Andi > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Mon, Aug 20, 2012 at 10:29 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Mon, Aug 20, 2012 at 6:27 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> Xinliang David Li <davidxl@google.com> writes: >> >> > >> >> > Process level synchronization problems can happen when two processes >> >> > (running the instrumented binary) exit at the same time. The >> >> > updated/merged counters from one process may be overwritten by another >> >> > process -- this is true for both counter data and summary data. >> >> > Solution 3) does not introduce any new problems. >> >> >> >> You could just use lockf() ? >> > >> > The issue here is holding lock for all the files (that can be many) versus >> > number of locks limits & possibilities for deadlocking (mind that updating >> > may happen in different orders on the same files for different programs built >> > from same objects) >> > >> > For David: there is no thread safety code in mainline for the counters. >> > Long time ago Zdenek implmented poor-mans TLS for counters (before TLS was invented) >> > http://gcc.gnu.org/ml/gcc-patches/2001-11/msg01546.html but it was voted down >> > as too memory expensive per thread. We could optionaly do atomic updates like ICC >> > or combination of both as discussed in the thread. >> > So far no one implemented it since the coverage fixups seems to work well enough in >> > pracitce for multithreaded programs where reproducibility do not seem to be _that_ >> > important. >> > >> > For GCC profiled bootstrap however I would like to see the output binary to be >> > reproducible. We realy ought to update profiles safe for multple processes. >> > Trashing whole process run is worse than doing race in increment. There is good >> > chance that one of runs is more important than others and it will get trashed. >> > >> > I do not think we do have serious update problems in the summaries at the moment. >> > We lock individual files as we update them. The summary is simple enough to be safe. >> > sum_all is summed, max_all is maximum over the individual runs. Even when you combine >> > mutiple programs the summary will end up same. Everything except for max_all is ignored >> > anyway. >> > >> > Solution 2 (i.e. histogram streaming) will also have the property that it is safe >> > WRT multiple programs, just like sum_all. >> >> I think the sum_all based scaling of the working set entries also has >> this property. What is your opinion on saving the histogram in the > > I think the scaling will have at lest roundoff issues WRT different merging orders. > >> summary and merging histograms together as best as possible compared >> to the alternative of saving the working set information as now and >> scaling it up by the ratio between the new and old sum_all when >> merging? > > So far I like this option best. But David seems to lean towards thirtd option with > whole file locking. I see it may show to be more extensible in the future. > At the moment I do not understand two things > 1) why do we need info on the number of counter above given threshold, sinc ethe hot/cold > decisions usually depends purely on the count cutoff. > Removing those will solve merging issues with variant 2 and then it would be probably > good solution. This is useful for large applications with a long tail. The instruction working set for those applications are very large, and inliner and unroller need to be aware of that and good heuristics can be developed to throttle aggressive code bloat transformations. For inliner, it is kind of the like the global budget but more application dependent. In the long run, we will collect more advanced fdo summary regarding working set -- it will be working set size for each code region (locality region). > 2) Do we plan to add some features in near future that will anyway require global locking? > I guess LIPO itself does not count since it streams its data into independent file as you > mentioned earlier and locking LIPO file is not that hard. > Does LIPO stream everything into that common file, or does it use combination of gcda files > and common summary? Actually, LIPO module grouping information are stored in gcda files. It is also stored in a separate .imports file (one per object) --- this is primarily used by our build system for dependence information. > > What other stuff Google plans to merge? > (In general I would be curious about merging plans WRT profile stuff, so we get more > synchronized and effective on getting patches in. We have about two months to get it done > in stage1 and it would be nice to get as much as possible. Obviously some of the patches will > need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think > this is an important feature). We plan to merge in the new LIPO implementation based on LTO streaming. Rong Xu finished this in 4.6 based compiler, and he needs to port it to 4.8. thanks, David > > I also realized today that the common value counters (used by switch, indirect > call and div/mod value profiling) are non-stanble WRT different merging orders > (i.e. parallel make in train run). I do not think there is actual solution to > that except for not merging the counter section of this type in libgcov and > merge them in some canonical order at profile feedback time. Perhaps we just > want to live with this, since the disprepancy here is small. (i.e. these > counters are quite rare and their outcome has just local effect on the final > binary, unlike the global summaries/edge counters). > > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> -Andi >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> > This is useful for large applications with a long tail. The > instruction working set for those applications are very large, and > inliner and unroller need to be aware of that and good heuristics can > be developed to throttle aggressive code bloat transformations. For > inliner, it is kind of the like the global budget but more application > dependent. In the long run, we will collect more advanced fdo summary > regarding working set -- it will be working set size for each code > region (locality region). I see, so you use it to estimate size of the working set and effect of bloating optimizations on cache size. This sounds interesting. What are you experiences with this? What concerns me that it is greatly inaccurate - you have no idea how many instructions given counter is guarding and it can differ quite a lot. Also inlining/optimization makes working sets significantly different (by factor of 100 for tramp3d). But on the ohter hand any solution at this level will be greatly inaccurate. So I am curious how reliable data you can get from this? How you take this into account for the heuristics? It seems to me that for this use perhaps the simple logic in histogram merging maximizing the number of BBs for given bucket will work well? It is inaccurate, but we are working with greatly inaccurate data anyway. Except for degenerated cases, the small and unimportant runs will have small BB counts, while large runs will have larger counts and those are ones we optimize for anyway. > > > > 2) Do we plan to add some features in near future that will anyway require global locking? > > I guess LIPO itself does not count since it streams its data into independent file as you > > mentioned earlier and locking LIPO file is not that hard. > > Does LIPO stream everything into that common file, or does it use combination of gcda files > > and common summary? > > Actually, LIPO module grouping information are stored in gcda files. > It is also stored in a separate .imports file (one per object) --- > this is primarily used by our build system for dependence information. I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave on GCC bootstrap? (i.e. it does a lot more work in the libgcov module per each invocation, so I am curious if it is practically useful at all). With LTO based solution a lot can be probably pushed at link time? Before actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from gcda files and do all the merging/updating/CFG constructions that is currently performed at runtime, right? > > > > > > What other stuff Google plans to merge? > > (In general I would be curious about merging plans WRT profile stuff, so we get more > > synchronized and effective on getting patches in. We have about two months to get it done > > in stage1 and it would be nice to get as much as possible. Obviously some of the patches will > > need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think > > this is an important feature). > > We plan to merge in the new LIPO implementation based on LTO > streaming. Rong Xu finished this in 4.6 based compiler, and he needs > to port it to 4.8. Good. Looks like a lot of work ahead. It would be nice if we can perhaps start by merging the libgcov infrastructure updates prior the LIPO changes. From what I saw at LIPO branch some time ago it has a lot of stuff that is not exactly LIPO specific. Honza > > > thanks, > > David > > > > > I also realized today that the common value counters (used by switch, indirect > > call and div/mod value profiling) are non-stanble WRT different merging orders > > (i.e. parallel make in train run). I do not think there is actual solution to > > that except for not merging the counter section of this type in libgcov and > > merge them in some canonical order at profile feedback time. Perhaps we just > > want to live with this, since the disprepancy here is small. (i.e. these > > counters are quite rare and their outcome has just local effect on the final > > binary, unlike the global summaries/edge counters). > > > > Honza > >> > >> Thanks, > >> Teresa > >> > >> > > >> > Honza > >> >> > >> >> -Andi > >> > >> > >> > >> -- > >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Mon, Aug 20, 2012 at 11:33 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> This is useful for large applications with a long tail. The >> instruction working set for those applications are very large, and >> inliner and unroller need to be aware of that and good heuristics can >> be developed to throttle aggressive code bloat transformations. For >> inliner, it is kind of the like the global budget but more application >> dependent. In the long run, we will collect more advanced fdo summary >> regarding working set -- it will be working set size for each code >> region (locality region). > > I see, so you use it to estimate size of the working set and effect of bloating > optimizations on cache size. This sounds interesting. What are you experiences > with this? Teresa has done some tunings for the unroller so far. The inliner tuning is the next step. > > What concerns me that it is greatly inaccurate - you have no idea how many > instructions given counter is guarding and it can differ quite a lot. Also > inlining/optimization makes working sets significantly different (by factor of > 100 for tramp3d). The pre ipa-inline working set is the one that is needed for ipa inliner tuning. For post-ipa inline code increase transformations, some update is probably needed. >But on the ohter hand any solution at this level will be > greatly inaccurate. So I am curious how reliable data you can get from this? > How you take this into account for the heuristics? This effort is just the first step to allow good heuristics to develop. > > It seems to me that for this use perhaps the simple logic in histogram merging > maximizing the number of BBs for given bucket will work well? It is > inaccurate, but we are working with greatly inaccurate data anyway. > Except for degenerated cases, the small and unimportant runs will have small BB > counts, while large runs will have larger counts and those are ones we optimize > for anyway. The working set curve for each type of applications contains lots of information that can be mined. The inaccuracy can also be mitigated by more data 'calibration'. >> >> >> > 2) Do we plan to add some features in near future that will anyway require global locking? >> > I guess LIPO itself does not count since it streams its data into independent file as you >> > mentioned earlier and locking LIPO file is not that hard. >> > Does LIPO stream everything into that common file, or does it use combination of gcda files >> > and common summary? >> >> Actually, LIPO module grouping information are stored in gcda files. >> It is also stored in a separate .imports file (one per object) --- >> this is primarily used by our build system for dependence information. > > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave > on GCC bootstrap? We have not tried gcc bootstrap with LIPO. Gcc compile time is not the main problem for application build -- the link time (for debug build) is. > (i.e. it does a lot more work in the libgcov module per each > invocation, so I am curious if it is practically useful at all). > > With LTO based solution a lot can be probably pushed at link time? Before > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from > gcda files and do all the merging/updating/CFG constructions that is currently > performed at runtime, right? The dynamic cgraph build and analysis is still done at runtime. However, with the new implementation, FE is no longer involved. Gcc driver is modified to understand module grouping, and lto is used to merge the streamed output from aux modules. David >> >> >> > >> > What other stuff Google plans to merge? >> > (In general I would be curious about merging plans WRT profile stuff, so we get more >> > synchronized and effective on getting patches in. We have about two months to get it done >> > in stage1 and it would be nice to get as much as possible. Obviously some of the patches will >> > need bit fo dicsussion like this one. Hope you do not find it frustrating, I actually think >> > this is an important feature). >> >> We plan to merge in the new LIPO implementation based on LTO >> streaming. Rong Xu finished this in 4.6 based compiler, and he needs >> to port it to 4.8. > > Good. Looks like a lot of work ahead. It would be nice if we can perhaps start > by merging the libgcov infrastructure updates prior the LIPO changes. From > what I saw at LIPO branch some time ago it has a lot of stuff that is not > exactly LIPO specific. > > Honza >> >> >> thanks, >> >> David >> >> > >> > I also realized today that the common value counters (used by switch, indirect >> > call and div/mod value profiling) are non-stanble WRT different merging orders >> > (i.e. parallel make in train run). I do not think there is actual solution to >> > that except for not merging the counter section of this type in libgcov and >> > merge them in some canonical order at profile feedback time. Perhaps we just >> > want to live with this, since the disprepancy here is small. (i.e. these >> > counters are quite rare and their outcome has just local effect on the final >> > binary, unlike the global summaries/edge counters). >> > >> > Honza >> >> >> >> Thanks, >> >> Teresa >> >> >> >> > >> >> > Honza >> >> >> >> >> >> -Andi >> >> >> >> >> >> >> >> -- >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> Teresa has done some tunings for the unroller so far. The inliner > tuning is the next step. > > > > > What concerns me that it is greatly inaccurate - you have no idea how many > > instructions given counter is guarding and it can differ quite a lot. Also > > inlining/optimization makes working sets significantly different (by factor of > > 100 for tramp3d). > > The pre ipa-inline working set is the one that is needed for ipa > inliner tuning. For post-ipa inline code increase transformations, > some update is probably needed. > > >But on the ohter hand any solution at this level will be > > greatly inaccurate. So I am curious how reliable data you can get from this? > > How you take this into account for the heuristics? > > This effort is just the first step to allow good heuristics to develop. > > > > > It seems to me that for this use perhaps the simple logic in histogram merging > > maximizing the number of BBs for given bucket will work well? It is > > inaccurate, but we are working with greatly inaccurate data anyway. > > Except for degenerated cases, the small and unimportant runs will have small BB > > counts, while large runs will have larger counts and those are ones we optimize > > for anyway. > > The working set curve for each type of applications contains lots of > information that can be mined. The inaccuracy can also be mitigated by > more data 'calibration'. Sure, I think I am leaning towards trying the solution 2) with maximizing counter count merging (probably it would make sense to rename it from BB count since it is not really BB count and thus it is misleading) and we will see how well it works in practice. We have benefits of much fewer issues with profile locking/unlocking and we lose bit of precision on BB counts. I tend to believe that the error will not be that important in practice. Another loss is more histogram streaming into each gcda file, but with skiping zero entries it should not be major overhead problem I hope. What do you think? > > >> > >> > >> > 2) Do we plan to add some features in near future that will anyway require global locking? > >> > I guess LIPO itself does not count since it streams its data into independent file as you > >> > mentioned earlier and locking LIPO file is not that hard. > >> > Does LIPO stream everything into that common file, or does it use combination of gcda files > >> > and common summary? > >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> It is also stored in a separate .imports file (one per object) --- > >> this is primarily used by our build system for dependence information. > > > > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave > > on GCC bootstrap? > > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > main problem for application build -- the link time (for debug build) > is. I was primarily curious how the LIPOs runtime analysis fare in the situation where you do very many small train runs on rather large app (sure GCC is small to google's use case ;). > > > (i.e. it does a lot more work in the libgcov module per each > > invocation, so I am curious if it is practically useful at all). > > > > With LTO based solution a lot can be probably pushed at link time? Before > > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from > > gcda files and do all the merging/updating/CFG constructions that is currently > > performed at runtime, right? > > The dynamic cgraph build and analysis is still done at runtime. > However, with the new implementation, FE is no longer involved. Gcc > driver is modified to understand module grouping, and lto is used to > merge the streamed output from aux modules. I see. Are there any fundamental reasons why it can not be done at link-time when all gcda files are available? Why the grouping is not done inside linker plugin? Honza > > > David
Sign in to reply to this message.
On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > > Teresa has done some tunings for the unroller so far. The inliner > > tuning is the next step. > We've gotten some noticeable improvements from very large applications with icache problems. Looking at the working set curves for these apps, the working set size in terms of the number of hot counters starts increasing exponentially to a much higher value than other apps as you approach 99.99% of the sum_all of the counter values. The long tail as David mentioned, and very large values. > > > > > > > What concerns me that it is greatly inaccurate - you have no idea how > many > > > instructions given counter is guarding and it can differ quite a lot. > Also > > > inlining/optimization makes working sets significantly different (by > factor of > > > 100 for tramp3d). > That's true, but from what I have seen the working sets of the applications this is targeted towards are fairly distinct. Which suggests that perfect accuracy in merging summaries may not be needed, at least for the current usage. As we start trying to use this to guide inlining and other optimizations it may need some fine tuning. > > > > The pre ipa-inline working set is the one that is needed for ipa > > inliner tuning. For post-ipa inline code increase transformations, > > some update is probably needed. > > > > >But on the ohter hand any solution at this level will be > > > greatly inaccurate. So I am curious how reliable data you can get from > this? > > > How you take this into account for the heuristics? > > > > This effort is just the first step to allow good heuristics to develop. > > > > > > > > It seems to me that for this use perhaps the simple logic in histogram > merging > > > maximizing the number of BBs for given bucket will work well? It is > > > inaccurate, but we are working with greatly inaccurate data anyway. > > > Except for degenerated cases, the small and unimportant runs will have > small BB > > > counts, while large runs will have larger counts and those are ones we > optimize > > > for anyway. > > > > The working set curve for each type of applications contains lots of > > information that can be mined. The inaccuracy can also be mitigated by > > more data 'calibration'. > > Sure, I think I am leaning towards trying the solution 2) with maximizing > counter count merging (probably it would make sense to rename it from BB > count > since it is not really BB count and thus it is misleading) and we will see > how > well it works in practice. > > We have benefits of much fewer issues with profile locking/unlocking and we > lose bit of precision on BB counts. I tend to believe that the error will > not > be that important in practice. Another loss is more histogram streaming > into > each gcda file, but with skiping zero entries it should not be major > overhead > problem I hope. > > What do you think? > I can go ahead with the histogram approach. There is some roundoff error from the working set scaling approach that can affect different merging orders as you note, although I think this only really affects the small counter values. The other place where saving/merging the histogram would help is when the distribution of counter values in the histogram varies between runs, say for example, the hottest loop is much hotter in a subsequent run, but the rest of the counter values stay largely consistent. Note, however, that if the hotspots are different in different runs, then merging either the histogram or the working set will have issues. The only way to detect this is to recompute the histogram/working set from scratch from the merged counters. I wonder in practice, even when there are a lot of simultaneous runs going like in a gcc bootstrap, if we could get reasonably accurate summary recomputation without global locking. The reason is that as long as the actual counter updates are safe as they are now due to the individual file locking, the inaccuracy in the recomputed summary information will not grow over time, and there is a reasonable chance that the last run to complete and merge will recompute the summary from the final merged counter values and get it right (or only be off by a little bit if there are a couple of runs completing simultaneously at the end). But this can be investigated as a follow on to this patch. Thanks, Teresa > > > >> > > >> > > >> > 2) Do we plan to add some features in near future that will anyway > require global locking? > > >> > I guess LIPO itself does not count since it streams its data > into independent file as you > > >> > mentioned earlier and locking LIPO file is not that hard. > > >> > Does LIPO stream everything into that common file, or does it > use combination of gcda files > > >> > and common summary? > > >> > > >> Actually, LIPO module grouping information are stored in gcda files. > > >> It is also stored in a separate .imports file (one per object) --- > > >> this is primarily used by our build system for dependence information. > > > > > > I see, getting LIPO safe WRT parallel updates will be fun. How does > LIPO behave > > > on GCC bootstrap? > > > > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > > main problem for application build -- the link time (for debug build) > > is. > > I was primarily curious how the LIPOs runtime analysis fare in the > situation where > you do very many small train runs on rather large app (sure GCC is small > to google's > use case ;). > > > > > (i.e. it does a lot more work in the libgcov module per each > > > invocation, so I am curious if it is practically useful at all). > > > > > > With LTO based solution a lot can be probably pushed at link time? > Before > > > actual GCC starts from the linker plugin, LIPO module can read gcov > CFGs from > > > gcda files and do all the merging/updating/CFG constructions that is > currently > > > performed at runtime, right? > > > > The dynamic cgraph build and analysis is still done at runtime. > > However, with the new implementation, FE is no longer involved. Gcc > > driver is modified to understand module grouping, and lto is used to > > merge the streamed output from aux modules. > > I see. Are there any fundamental reasons why it can not be done at > link-time > when all gcda files are available? Why the grouping is not done inside > linker > plugin? > > Honza > > > > > > David > -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> The issue here is holding lock for all the files (that can be many) versus > number of locks limits & possibilities for deadlocking (mind that updating > may happen in different orders on the same files for different programs built > from same objects) lockf typically has a deadlock detector, and will error out. -Andi
Sign in to reply to this message.
On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Teresa has done some tunings for the unroller so far. The inliner >> tuning is the next step. >> >> > >> > What concerns me that it is greatly inaccurate - you have no idea how many >> > instructions given counter is guarding and it can differ quite a lot. Also >> > inlining/optimization makes working sets significantly different (by factor of >> > 100 for tramp3d). >> >> The pre ipa-inline working set is the one that is needed for ipa >> inliner tuning. For post-ipa inline code increase transformations, >> some update is probably needed. >> >> >But on the ohter hand any solution at this level will be >> > greatly inaccurate. So I am curious how reliable data you can get from this? >> > How you take this into account for the heuristics? >> >> This effort is just the first step to allow good heuristics to develop. >> >> > >> > It seems to me that for this use perhaps the simple logic in histogram merging >> > maximizing the number of BBs for given bucket will work well? It is >> > inaccurate, but we are working with greatly inaccurate data anyway. >> > Except for degenerated cases, the small and unimportant runs will have small BB >> > counts, while large runs will have larger counts and those are ones we optimize >> > for anyway. >> >> The working set curve for each type of applications contains lots of >> information that can be mined. The inaccuracy can also be mitigated by >> more data 'calibration'. > > Sure, I think I am leaning towards trying the solution 2) with maximizing > counter count merging (probably it would make sense to rename it from BB count > since it is not really BB count and thus it is misleading) and we will see how > well it works in practice. > > We have benefits of much fewer issues with profile locking/unlocking and we > lose bit of precision on BB counts. I tend to believe that the error will not > be that important in practice. Another loss is more histogram streaming into > each gcda file, but with skiping zero entries it should not be major overhead > problem I hope. > > What do you think? >> >> >> >> >> >> >> > 2) Do we plan to add some features in near future that will anyway require global locking? >> >> > I guess LIPO itself does not count since it streams its data into independent file as you >> >> > mentioned earlier and locking LIPO file is not that hard. >> >> > Does LIPO stream everything into that common file, or does it use combination of gcda files >> >> > and common summary? >> >> >> >> Actually, LIPO module grouping information are stored in gcda files. >> >> It is also stored in a separate .imports file (one per object) --- >> >> this is primarily used by our build system for dependence information. >> > >> > I see, getting LIPO safe WRT parallel updates will be fun. How does LIPO behave >> > on GCC bootstrap? >> >> We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> main problem for application build -- the link time (for debug build) >> is. > > I was primarily curious how the LIPOs runtime analysis fare in the situation where > you do very many small train runs on rather large app (sure GCC is small to google's > use case ;). There will be race, but as Teresa mentioned, there is a big chance that the process which finishes the merge the last is also t the final overrider of the LIPO summary data. >> >> > (i.e. it does a lot more work in the libgcov module per each >> > invocation, so I am curious if it is practically useful at all). >> > >> > With LTO based solution a lot can be probably pushed at link time? Before >> > actual GCC starts from the linker plugin, LIPO module can read gcov CFGs from >> > gcda files and do all the merging/updating/CFG constructions that is currently >> > performed at runtime, right? >> >> The dynamic cgraph build and analysis is still done at runtime. >> However, with the new implementation, FE is no longer involved. Gcc >> driver is modified to understand module grouping, and lto is used to >> merge the streamed output from aux modules. > > I see. Are there any fundamental reasons why it can not be done at link-time > when all gcda files are available? For build parallelism, the decision should be made as early as possible -- that is what makes LIPO 'light'. > Why the grouping is not done inside linker > plugin? It is not delayed into link time. In fact linker plugin is not even involved. David > > Honza >> >> >> David
Sign in to reply to this message.
On Tue, Aug 21, 2012 at 7:38 AM, Teresa Johnson <tejohnson@google.com>wrote: > > > On Tue, Aug 21, 2012 at 12:34 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> > Teresa has done some tunings for the unroller so far. The inliner >> > tuning is the next step. >> > > We've gotten some noticeable improvements from very large applications > with icache problems. Looking at the working set curves for these apps, the > working set size in terms of the number of hot counters starts increasing > exponentially to a much higher value than other apps as you approach 99.99% > of the sum_all of the counter values. The long tail as David mentioned, and > very large values. > > > >> > > >> > > What concerns me that it is greatly inaccurate - you have no idea how >> many >> > > instructions given counter is guarding and it can differ quite a lot. >> Also >> > > inlining/optimization makes working sets significantly different (by >> factor of >> > > 100 for tramp3d). >> > > That's true, but from what I have seen the working sets of the > applications this is targeted towards are fairly distinct. Which suggests > that perfect accuracy in merging summaries may not be needed, at least for > the current usage. As we start trying to use this to guide inlining and > other optimizations it may need some fine tuning. > > >> > >> > The pre ipa-inline working set is the one that is needed for ipa >> > inliner tuning. For post-ipa inline code increase transformations, >> > some update is probably needed. >> > > >> > >But on the ohter hand any solution at this level will be >> > > greatly inaccurate. So I am curious how reliable data you can get >> from this? >> > > How you take this into account for the heuristics? >> > >> > This effort is just the first step to allow good heuristics to develop. >> > >> > > >> > > It seems to me that for this use perhaps the simple logic in >> histogram merging >> > > maximizing the number of BBs for given bucket will work well? It is >> > > inaccurate, but we are working with greatly inaccurate data anyway. >> > > Except for degenerated cases, the small and unimportant runs will >> have small BB >> > > counts, while large runs will have larger counts and those are ones >> we optimize >> > > for anyway. >> > >> > The working set curve for each type of applications contains lots of >> > information that can be mined. The inaccuracy can also be mitigated by >> > more data 'calibration'. >> >> Sure, I think I am leaning towards trying the solution 2) with maximizing >> counter count merging (probably it would make sense to rename it from BB >> count >> since it is not really BB count and thus it is misleading) and we will >> see how >> well it works in practice. >> >> We have benefits of much fewer issues with profile locking/unlocking and >> we >> lose bit of precision on BB counts. I tend to believe that the error will >> not >> be that important in practice. Another loss is more histogram streaming >> into >> each gcda file, but with skiping zero entries it should not be major >> overhead >> problem I hope. >> >> What do you think? >> > > I can go ahead with the histogram approach. There is some roundoff > error from the working set scaling approach that can affect different > merging orders as you note, although I think this only really affects the > small counter values. The other place where saving/merging the histogram > would help is when the distribution of counter values in the histogram > varies between runs, say for example, the hottest loop is much hotter in a > subsequent run, but the rest of the counter values stay largely consistent. > Note, however, that if the hotspots are different in different runs, then > merging either the histogram or the working set will have issues. The only > way to detect this is to recompute the histogram/working set from scratch > from the merged counters. > > I wonder in practice, even when there are a lot of simultaneous runs going > like in a gcc bootstrap, if we could get reasonably accurate summary > recomputation without global locking. The reason is that as long as the > actual counter updates are safe as they are now due to the individual file > locking, the inaccuracy in the recomputed summary information will not grow > over time, and there is a reasonable chance that the last run to complete > and merge will recompute the summary from the final merged counter values > and get it right (or only be off by a little bit if there are a couple of > runs completing simultaneously at the end). But this can be investigated as > a follow on to this patch. > The main concern is probably the build reproducibility in gcc bootstrap with FDO. David > > Thanks, > Teresa > > > >> > >> >> > >> >> > >> > 2) Do we plan to add some features in near future that will >> anyway require global locking? >> > >> > I guess LIPO itself does not count since it streams its data >> into independent file as you >> > >> > mentioned earlier and locking LIPO file is not that hard. >> > >> > Does LIPO stream everything into that common file, or does it >> use combination of gcda files >> > >> > and common summary? >> > >> >> > >> Actually, LIPO module grouping information are stored in gcda files. >> > >> It is also stored in a separate .imports file (one per object) --- >> > >> this is primarily used by our build system for dependence >> information. >> > > >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does >> LIPO behave >> > > on GCC bootstrap? >> > >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> > main problem for application build -- the link time (for debug build) >> > is. >> >> I was primarily curious how the LIPOs runtime analysis fare in the >> situation where >> you do very many small train runs on rather large app (sure GCC is small >> to google's >> use case ;). >> > >> > > (i.e. it does a lot more work in the libgcov module per each >> > > invocation, so I am curious if it is practically useful at all). >> > > >> > > With LTO based solution a lot can be probably pushed at link time? >> Before >> > > actual GCC starts from the linker plugin, LIPO module can read gcov >> CFGs from >> > > gcda files and do all the merging/updating/CFG constructions that is >> currently >> > > performed at runtime, right? >> > >> > The dynamic cgraph build and analysis is still done at runtime. >> > However, with the new implementation, FE is no longer involved. Gcc >> > driver is modified to understand module grouping, and lto is used to >> > merge the streamed output from aux modules. >> >> I see. Are there any fundamental reasons why it can not be done at >> link-time >> when all gcda files are available? Why the grouping is not done inside >> linker >> plugin? >> >> Honza >> > >> > >> > David >> > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 >
Sign in to reply to this message.
> > I can go ahead with the histogram approach. There is some roundoff > > error from the working set scaling approach that can affect different > > merging orders as you note, although I think this only really affects the > > small counter values. The other place where saving/merging the histogram Do you have any intuition on why simple maximalization merging (that is safe wrt ordering) would be bad idea? We care only about working set size around top of the histogram and I would say that we should sort of optimize for the largest (in the number of blocks in hot area) of the train runs. One way where things will get messed up is when the working set is about the same but the runs are of different size so all the blocks gets accounted into two different buckets. But in general I do not think there is resonably accurate way to merge the profiles without actually streaming out all counter IDs in every bucket, so perhaps this will work well enough. If not, we can in future introduce per-program global summary file that will contain the counters to be merged acurately. > > would help is when the distribution of counter values in the histogram > > varies between runs, say for example, the hottest loop is much hotter in a > > subsequent run, but the rest of the counter values stay largely consistent. > > Note, however, that if the hotspots are different in different runs, then > > merging either the histogram or the working set will have issues. The only > > way to detect this is to recompute the histogram/working set from scratch > > from the merged counters. > > > > I wonder in practice, even when there are a lot of simultaneous runs going > > like in a gcc bootstrap, if we could get reasonably accurate summary > > recomputation without global locking. The reason is that as long as the > > actual counter updates are safe as they are now due to the individual file > > locking, the inaccuracy in the recomputed summary information will not grow > > over time, and there is a reasonable chance that the last run to complete > > and merge will recompute the summary from the final merged counter values > > and get it right (or only be off by a little bit if there are a couple of > > runs completing simultaneously at the end). But this can be investigated as > > a follow on to this patch. > > > > > The main concern is probably the build reproducibility in gcc bootstrap > with FDO. Hmm, you mean in the first pass update every file with new counters and in the second pass just update the summaries? OK, so I guess we went through 1) two pass updating with race in between pases. 2) two pass updating with first pass updating countes and second having race only for summary update. (i.e. no races for counters) 3) two pass updating with flocking (and some way to handle detected deadlocks) 4) one pass updating with histogram merging + maximalization of working set. (we do not realy need to scale the buckets, we can simply merge the histograms and then mutiply them by nruns before comparing to actual counters. This assumes that working sets of all runs are about the same, but should work resonably in practice I think. I guess 3/4 are acceptable WRT bootstrap reproducibility. I have no experience with flocking large number of files and portability of this solution i.e. to Windows. If you think that 2) would be too inaccurate in practice and 3) has chance to be portable, we could go for this. It will solve the precision problems and will also work for LIPO summaries. I would be curious about effect on profiledbootstrap time of this if you implement it. Honza > > David > > > > > > > Thanks, > > Teresa > > > > > > >> > >> > >> > >> > >> > >> > 2) Do we plan to add some features in near future that will > >> anyway require global locking? > >> > >> > I guess LIPO itself does not count since it streams its data > >> into independent file as you > >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> > >> > Does LIPO stream everything into that common file, or does it > >> use combination of gcda files > >> > >> > and common summary? > >> > >> > >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> > >> It is also stored in a separate .imports file (one per object) --- > >> > >> this is primarily used by our build system for dependence > >> information. > >> > > > >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does > >> LIPO behave > >> > > on GCC bootstrap? > >> > > >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > >> > main problem for application build -- the link time (for debug build) > >> > is. > >> > >> I was primarily curious how the LIPOs runtime analysis fare in the > >> situation where > >> you do very many small train runs on rather large app (sure GCC is small > >> to google's > >> use case ;). > >> > > >> > > (i.e. it does a lot more work in the libgcov module per each > >> > > invocation, so I am curious if it is practically useful at all). > >> > > > >> > > With LTO based solution a lot can be probably pushed at link time? > >> Before > >> > > actual GCC starts from the linker plugin, LIPO module can read gcov > >> CFGs from > >> > > gcda files and do all the merging/updating/CFG constructions that is > >> currently > >> > > performed at runtime, right? > >> > > >> > The dynamic cgraph build and analysis is still done at runtime. > >> > However, with the new implementation, FE is no longer involved. Gcc > >> > driver is modified to understand module grouping, and lto is used to > >> > merge the streamed output from aux modules. > >> > >> I see. Are there any fundamental reasons why it can not be done at > >> link-time > >> when all gcda files are available? Why the grouping is not done inside > >> linker > >> plugin? > >> > >> Honza > >> > > >> > > >> > David > >> > > > > > > > > -- > > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > >
Sign in to reply to this message.
On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> > I can go ahead with the histogram approach. There is some roundoff >> > error from the working set scaling approach that can affect different >> > merging orders as you note, although I think this only really affects the >> > small counter values. The other place where saving/merging the histogram > > Do you have any intuition on why simple maximalization merging (that is safe wrt > ordering) would be bad idea? When you say "maximalization merging" are you talking about the histogram merging approach I mentioned a few emails back (my response on Aug 19) where we assume the same relative order of hotness in the counters between runs, and accumulate the counter values in the histogram in that order? This would be inaccurate if different runs exercised different areas of the code, and thus the counters would be ordered in the histogram differently. > > We care only about working set size around top of the histogram and I would say For optimizations that care about the boundary between hot and cold such as code layout I think we will also care about the smaller values in the histogram (to have a good idea of what constitutes a cold block counter value). > that we should sort of optimize for the largest (in the number of blocks in hot > area) of the train runs. One way where things will get messed up is when the > working set is about the same but the runs are of different size so all the > blocks gets accounted into two different buckets. I'm not sure I understand the last sentence - is this something that would not get handled by merging the histogram entries as I described earlier? Or it sounds like you might have a different merging approach in mind? > > But in general I do not think there is resonably accurate way to merge the > profiles without actually streaming out all counter IDs in every bucket, so perhaps > this will work well enough. If not, we can in future introduce per-program global > summary file that will contain the counters to be merged acurately. Sounds good. > >> > would help is when the distribution of counter values in the histogram >> > varies between runs, say for example, the hottest loop is much hotter in a >> > subsequent run, but the rest of the counter values stay largely consistent. >> > Note, however, that if the hotspots are different in different runs, then >> > merging either the histogram or the working set will have issues. The only >> > way to detect this is to recompute the histogram/working set from scratch >> > from the merged counters. >> > >> > I wonder in practice, even when there are a lot of simultaneous runs going >> > like in a gcc bootstrap, if we could get reasonably accurate summary >> > recomputation without global locking. The reason is that as long as the >> > actual counter updates are safe as they are now due to the individual file >> > locking, the inaccuracy in the recomputed summary information will not grow >> > over time, and there is a reasonable chance that the last run to complete >> > and merge will recompute the summary from the final merged counter values >> > and get it right (or only be off by a little bit if there are a couple of >> > runs completing simultaneously at the end). But this can be investigated as >> > a follow on to this patch. >> > >> >> >> The main concern is probably the build reproducibility in gcc bootstrap >> with FDO. > > Hmm, you mean in the first pass update every file with new counters and in the second > pass just update the summaries? Right, that's what I had in mind (what you have described in #2 below). > OK, so I guess we went through > 1) two pass updating with race in between pases. > 2) two pass updating with first pass updating countes and second having race only for summary update. > (i.e. no races for counters) > 3) two pass updating with flocking (and some way to handle detected deadlocks) > 4) one pass updating with histogram merging + maximalization of working set. > (we do not realy need to scale the buckets, we can simply merge the histograms > and then mutiply them by nruns before comparing to actual counters. By merging the histograms (and accumulating the counter values stored there as we merge), I don't think we need to multiply the counter values by nruns, do we? > This assumes that working sets of all runs are about the same, but should work > resonably in practice I think. > > I guess 3/4 are acceptable WRT bootstrap reproducibility. > > I have no experience with flocking large number of files and portability of > this solution i.e. to Windows. If you think that 2) would be too inaccurate > in practice and 3) has chance to be portable, we could go for this. It will > solve the precision problems and will also work for LIPO summaries. > I would be curious about effect on profiledbootstrap time of this if you implement > it. I'm hoping that 2) will be accurate enough in practice, but it will need some investigation. Thanks, Teresa > > Honza >> >> David >> >> >> >> > >> > Thanks, >> > Teresa >> > >> > > >> >> > >> >> >> > >> >> >> > >> > 2) Do we plan to add some features in near future that will >> >> anyway require global locking? >> >> > >> > I guess LIPO itself does not count since it streams its data >> >> into independent file as you >> >> > >> > mentioned earlier and locking LIPO file is not that hard. >> >> > >> > Does LIPO stream everything into that common file, or does it >> >> use combination of gcda files >> >> > >> > and common summary? >> >> > >> >> >> > >> Actually, LIPO module grouping information are stored in gcda files. >> >> > >> It is also stored in a separate .imports file (one per object) --- >> >> > >> this is primarily used by our build system for dependence >> >> information. >> >> > > >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does >> >> LIPO behave >> >> > > on GCC bootstrap? >> >> > >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> >> > main problem for application build -- the link time (for debug build) >> >> > is. >> >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the >> >> situation where >> >> you do very many small train runs on rather large app (sure GCC is small >> >> to google's >> >> use case ;). >> >> > >> >> > > (i.e. it does a lot more work in the libgcov module per each >> >> > > invocation, so I am curious if it is practically useful at all). >> >> > > >> >> > > With LTO based solution a lot can be probably pushed at link time? >> >> Before >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov >> >> CFGs from >> >> > > gcda files and do all the merging/updating/CFG constructions that is >> >> currently >> >> > > performed at runtime, right? >> >> > >> >> > The dynamic cgraph build and analysis is still done at runtime. >> >> > However, with the new implementation, FE is no longer involved. Gcc >> >> > driver is modified to understand module grouping, and lto is used to >> >> > merge the streamed output from aux modules. >> >> >> >> I see. Are there any fundamental reasons why it can not be done at >> >> link-time >> >> when all gcda files are available? Why the grouping is not done inside >> >> linker >> >> plugin? >> >> >> >> Honza >> >> > >> >> > >> >> > David >> >> >> > >> > >> > >> > -- >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 >> > -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> > I can go ahead with the histogram approach. There is some roundoff > >> > error from the working set scaling approach that can affect different > >> > merging orders as you note, although I think this only really affects the > >> > small counter values. The other place where saving/merging the histogram > > > > Do you have any intuition on why simple maximalization merging (that is safe wrt > > ordering) would be bad idea? > > When you say "maximalization merging" are you talking about the > histogram merging approach I mentioned a few emails back (my response > on Aug 19) where we assume the same relative order of hotness in the > counters between runs, and accumulate the counter values in the > histogram in that order? I was speaking of BB (counter) counts only here and considering the simplest strategy: maximalize the BB counts in corresponding buckets and sum the cummlative values in the corresponding buckets. The strategy of scaling is more sensible, but it will also be sensitive to order of train runs, i.e. it will give unstable results with parallel make. > > OK, so I guess we went through > > 1) two pass updating with race in between pases. > > 2) two pass updating with first pass updating countes and second having race only for summary update. > > (i.e. no races for counters) > > 3) two pass updating with flocking (and some way to handle detected deadlocks) > > 4) one pass updating with histogram merging + maximalization of working set. > > (we do not realy need to scale the buckets, we can simply merge the histograms > > and then mutiply them by nruns before comparing to actual counters. > > By merging the histograms (and accumulating the counter values stored > there as we merge), I don't think we need to multiply the counter > values by nruns, do we? With the simple approach we need, but... > > > This assumes that working sets of all runs are about the same, but should work > > resonably in practice I think. > > > > I guess 3/4 are acceptable WRT bootstrap reproducibility. > > > > I have no experience with flocking large number of files and portability of > > this solution i.e. to Windows. If you think that 2) would be too inaccurate > > in practice and 3) has chance to be portable, we could go for this. It will > > solve the precision problems and will also work for LIPO summaries. > > I would be curious about effect on profiledbootstrap time of this if you implement > > it. > > I'm hoping that 2) will be accurate enough in practice, but it will > need some investigation. Perhaps weighting all pros/cons you are right here. I think getting results consistent across different order of runs is more important than the possibility of race on the last update and 4) is probably getting quite a lot of GIGO issues. We can implement more consistent locking mechanizm incrementally if this turns out to be issue. It is posible to actually detect the race: at first round we are updating the nruns. In the second round we can see if nruns has changed. If so, we have problem since someone has already updated the file and will update the histograms. I would suggest skiping update when nruns has changes, so the summary at least match the counters in current file accurately (even though it will be off for the whole program since whoever updated the file may have not seen all our updates, just part of them). Just to reduce changes that the trashed run is the most important one. Theoretically we could also make libgcov to re-read all counters in this case and try another update, but I would not worry about that for the moment. We will still have problems with bootstrap: the summary of libbackend.a will depend on what of compiler binaries has executed last, since cc1 will have different summary from cc1plus becuase frontend is different. I would give extra points to voluteer who changes libgcov to do multiple summaries for different programs as intended originally (if you search archives in 2001-2003, somewhere are patches that changes libgcov from not merging at all to merging always. I guess changing it to merge just when program checksum match is not that hard). Thanks, Honza > > Thanks, > Teresa > > > > > Honza > >> > >> David > >> > >> > >> > >> > > >> > Thanks, > >> > Teresa > >> > > >> > > > >> >> > >> > >> >> > >> > >> >> > >> > 2) Do we plan to add some features in near future that will > >> >> anyway require global locking? > >> >> > >> > I guess LIPO itself does not count since it streams its data > >> >> into independent file as you > >> >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> >> > >> > Does LIPO stream everything into that common file, or does it > >> >> use combination of gcda files > >> >> > >> > and common summary? > >> >> > >> > >> >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> >> > >> It is also stored in a separate .imports file (one per object) --- > >> >> > >> this is primarily used by our build system for dependence > >> >> information. > >> >> > > > >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does > >> >> LIPO behave > >> >> > > on GCC bootstrap? > >> >> > > >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > >> >> > main problem for application build -- the link time (for debug build) > >> >> > is. > >> >> > >> >> I was primarily curious how the LIPOs runtime analysis fare in the > >> >> situation where > >> >> you do very many small train runs on rather large app (sure GCC is small > >> >> to google's > >> >> use case ;). > >> >> > > >> >> > > (i.e. it does a lot more work in the libgcov module per each > >> >> > > invocation, so I am curious if it is practically useful at all). > >> >> > > > >> >> > > With LTO based solution a lot can be probably pushed at link time? > >> >> Before > >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov > >> >> CFGs from > >> >> > > gcda files and do all the merging/updating/CFG constructions that is > >> >> currently > >> >> > > performed at runtime, right? > >> >> > > >> >> > The dynamic cgraph build and analysis is still done at runtime. > >> >> > However, with the new implementation, FE is no longer involved. Gcc > >> >> > driver is modified to understand module grouping, and lto is used to > >> >> > merge the streamed output from aux modules. > >> >> > >> >> I see. Are there any fundamental reasons why it can not be done at > >> >> link-time > >> >> when all gcda files are available? Why the grouping is not done inside > >> >> linker > >> >> plugin? > >> >> > >> >> Honza > >> >> > > >> >> > > >> >> > David > >> >> > >> > > >> > > >> > > >> > -- > >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > >> > > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Tue, Aug 21, 2012 at 11:18 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Tue, Aug 21, 2012 at 6:56 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> > I can go ahead with the histogram approach. There is some roundoff >> >> > error from the working set scaling approach that can affect different >> >> > merging orders as you note, although I think this only really affects the >> >> > small counter values. The other place where saving/merging the histogram >> > >> > Do you have any intuition on why simple maximalization merging (that is safe wrt >> > ordering) would be bad idea? >> >> When you say "maximalization merging" are you talking about the >> histogram merging approach I mentioned a few emails back (my response >> on Aug 19) where we assume the same relative order of hotness in the >> counters between runs, and accumulate the counter values in the >> histogram in that order? > > I was speaking of BB (counter) counts only here and considering the simplest > strategy: maximalize the BB counts in corresponding buckets and sum the cummlative > values in the corresponding buckets. I'm concerned about inaccuracies arising when multiple runs have different sizes and therefore counters are in different buckets in the corresponding histograms. We would end up in a situation where the merged histogram has more "counters" in it than there are actual counters in the program. > > The strategy of scaling is more sensible, but it will also be sensitive to order > of train runs, i.e. it will give unstable results with parallel make. Regarding the approach of merging histograms by matching up counters in the same order and apportioning out the histogram cumulative values evenly between counters in the same bucket before merging, I don't think the ordering will have a huge effect. The reason is that counters we end up matching together and accumulating will stay in the same relative order, and the min counter values are simply summed to compute the new bucket index and new min counter value to record. For the cumulative value, since the incoming cumulative values will be apportioned evenly between the counters coming from the same bucket when computing the new cumulative value (which is then just another add), there could be some slight differences from different round off errors when histograms are merged in different orders. But I would expect that to result in only small differences in the cumulative values in the histogram. Is that acceptable? >> > OK, so I guess we went through >> > 1) two pass updating with race in between pases. >> > 2) two pass updating with first pass updating countes and second having race only for summary update. >> > (i.e. no races for counters) >> > 3) two pass updating with flocking (and some way to handle detected deadlocks) >> > 4) one pass updating with histogram merging + maximalization of working set. >> > (we do not realy need to scale the buckets, we can simply merge the histograms >> > and then mutiply them by nruns before comparing to actual counters. >> >> By merging the histograms (and accumulating the counter values stored >> there as we merge), I don't think we need to multiply the counter >> values by nruns, do we? > > With the simple approach we need, but... >> >> > This assumes that working sets of all runs are about the same, but should work >> > resonably in practice I think. >> > >> > I guess 3/4 are acceptable WRT bootstrap reproducibility. >> > >> > I have no experience with flocking large number of files and portability of >> > this solution i.e. to Windows. If you think that 2) would be too inaccurate >> > in practice and 3) has chance to be portable, we could go for this. It will >> > solve the precision problems and will also work for LIPO summaries. >> > I would be curious about effect on profiledbootstrap time of this if you implement >> > it. >> >> I'm hoping that 2) will be accurate enough in practice, but it will >> need some investigation. > > Perhaps weighting all pros/cons you are right here. I think getting results > consistent across different order of runs is more important than the > possibility of race on the last update and 4) is probably getting quite a lot > of GIGO issues. > > We can implement more consistent locking mechanizm incrementally if this turns > out to be issue. > > It is posible to actually detect the race: at first round we are updating the > nruns. In the second round we can see if nruns has changed. If so, we have > problem since someone has already updated the file and will update the > histograms. > > I would suggest skiping update when nruns has changes, so the summary at least > match the counters in current file accurately (even though it will be off for > the whole program since whoever updated the file may have not seen all our > updates, just part of them). Just to reduce changes that the trashed run is > the most important one. Theoretically we could also make libgcov to re-read all > counters in this case and try another update, but I would not worry about that > for the moment. Ok, I think that makes sense. If it is ok with you, I'd prefer to go with a 3 step approach: 1) Implement some form of histogram merging (one of the options we discussed above). (Your approach #4). 2) Implement approach #2 (restructure into 2 passes with counters being updated accurately and a possible race on the histogram computation). 3) Investigate and implement either approach #3 (some kind of global lock) or the race detection you describe above. > > We will still have problems with bootstrap: the summary of libbackend.a will > depend on what of compiler binaries has executed last, since cc1 will have > different summary from cc1plus becuase frontend is different. I would give > extra points to voluteer who changes libgcov to do multiple summaries for > different programs as intended originally (if you search archives in 2001-2003, > somewhere are patches that changes libgcov from not merging at all to merging > always. I guess changing it to merge just when program checksum match is not > that hard). In this case, I assume that currently the merge will abort because the # counters, etc don't match? Thanks, Teresa > > Thanks, > Honza >> >> Thanks, >> Teresa >> >> > >> > Honza >> >> >> >> David >> >> >> >> >> >> >> >> > >> >> > Thanks, >> >> > Teresa >> >> > >> >> > > >> >> >> > >> >> >> >> > >> >> >> >> > >> > 2) Do we plan to add some features in near future that will >> >> >> anyway require global locking? >> >> >> > >> > I guess LIPO itself does not count since it streams its data >> >> >> into independent file as you >> >> >> > >> > mentioned earlier and locking LIPO file is not that hard. >> >> >> > >> > Does LIPO stream everything into that common file, or does it >> >> >> use combination of gcda files >> >> >> > >> > and common summary? >> >> >> > >> >> >> >> > >> Actually, LIPO module grouping information are stored in gcda files. >> >> >> > >> It is also stored in a separate .imports file (one per object) --- >> >> >> > >> this is primarily used by our build system for dependence >> >> >> information. >> >> >> > > >> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does >> >> >> LIPO behave >> >> >> > > on GCC bootstrap? >> >> >> > >> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the >> >> >> > main problem for application build -- the link time (for debug build) >> >> >> > is. >> >> >> >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the >> >> >> situation where >> >> >> you do very many small train runs on rather large app (sure GCC is small >> >> >> to google's >> >> >> use case ;). >> >> >> > >> >> >> > > (i.e. it does a lot more work in the libgcov module per each >> >> >> > > invocation, so I am curious if it is practically useful at all). >> >> >> > > >> >> >> > > With LTO based solution a lot can be probably pushed at link time? >> >> >> Before >> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov >> >> >> CFGs from >> >> >> > > gcda files and do all the merging/updating/CFG constructions that is >> >> >> currently >> >> >> > > performed at runtime, right? >> >> >> > >> >> >> > The dynamic cgraph build and analysis is still done at runtime. >> >> >> > However, with the new implementation, FE is no longer involved. Gcc >> >> >> > driver is modified to understand module grouping, and lto is used to >> >> >> > merge the streamed output from aux modules. >> >> >> >> >> >> I see. Are there any fundamental reasons why it can not be done at >> >> >> link-time >> >> >> when all gcda files are available? Why the grouping is not done inside >> >> >> linker >> >> >> plugin? >> >> >> >> >> >> Honza >> >> >> > >> >> >> > >> >> >> > David >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 >> >> > >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> > I'm concerned about inaccuracies arising when multiple runs have > different sizes and therefore counters are in different buckets in the > corresponding histograms. We would end up in a situation where the > merged histogram has more "counters" in it than there are actual > counters in the program. Yes, it is very inaccurate. > > > > > The strategy of scaling is more sensible, but it will also be sensitive to order > > of train runs, i.e. it will give unstable results with parallel make. > > Regarding the approach of merging histograms by matching up counters > in the same order and apportioning out the histogram cumulative values > evenly between counters in the same bucket before merging, I don't > think the ordering will have a huge effect. The reason is that > counters we end up matching together and accumulating will stay in the > same relative order, and the min counter values are simply summed to > compute the new bucket index and new min counter value to record. For > the cumulative value, since the incoming cumulative values will be > apportioned evenly between the counters coming from the same bucket > when computing the new cumulative value (which is then just another > add), there could be some slight differences from different round off > errors when histograms are merged in different orders. But I would > expect that to result in only small differences in the cumulative > values in the histogram. Is that acceptable? Yes, I think it will be small difference and it will give resonable results in practice. I am only concerned about reproducibility of the bootstraps that... > Ok, I think that makes sense. > > If it is ok with you, I'd prefer to go with a 3 step approach: > 1) Implement some form of histogram merging (one of the options we > discussed above). (Your approach #4). > 2) Implement approach #2 (restructure into 2 passes with counters > being updated accurately and a possible race on the histogram > computation). > 3) Investigate and implement either approach #3 (some kind of global > lock) or the race detection you describe above. Yes, this sounds like good plan. > > > > > We will still have problems with bootstrap: the summary of libbackend.a will > > depend on what of compiler binaries has executed last, since cc1 will have > > different summary from cc1plus becuase frontend is different. I would give > > extra points to voluteer who changes libgcov to do multiple summaries for > > different programs as intended originally (if you search archives in 2001-2003, > > somewhere are patches that changes libgcov from not merging at all to merging > > always. I guess changing it to merge just when program checksum match is not > > that hard). > > In this case, I assume that currently the merge will abort because the > # counters, etc don't match? We only abort when number of counters in current file does not match (i.e. the gcda file is from difference compilation). We accept when the overall program is different, i.e. libbackend.a is linked into multiple binaries. Since the histogram summing is global across all profiled libraries, we will end up with different histograms depending on what program using given library was executed last. To get this stable I think it is only possible to go with #3 (i.e. program wide global locking) and get different records for different programs (i.e. do checksum of all objects libgcov sees and merge only with matching checksums). In coverage.c we then can either choose the proper record whendoing LTO and knowing what program we are just compiling or do hitogram merging in your way but after sorting the entries into some canonical order (i.e. increasing checksum). So we can track this incrementally. I guess if you go with step 1) as you describe and we can move ahead. Other thing I have patch for is to make LTO ipa-profile to produce more accurate histograms based on real instruction counts (here we can read in all gcda files, turn them into CFG profiles, and do our own summary). I suppose that once you merge in your patches I can update this and we will have two mechanizms so we will also have data on how your implementation diverges from precise solution. Thanks! Honza > > Thanks, > Teresa > > > > > Thanks, > > Honza > >> > >> Thanks, > >> Teresa > >> > >> > > >> > Honza > >> >> > >> >> David > >> >> > >> >> > >> >> > >> >> > > >> >> > Thanks, > >> >> > Teresa > >> >> > > >> >> > > > >> >> >> > >> > >> >> >> > >> > >> >> >> > >> > 2) Do we plan to add some features in near future that will > >> >> >> anyway require global locking? > >> >> >> > >> > I guess LIPO itself does not count since it streams its data > >> >> >> into independent file as you > >> >> >> > >> > mentioned earlier and locking LIPO file is not that hard. > >> >> >> > >> > Does LIPO stream everything into that common file, or does it > >> >> >> use combination of gcda files > >> >> >> > >> > and common summary? > >> >> >> > >> > >> >> >> > >> Actually, LIPO module grouping information are stored in gcda files. > >> >> >> > >> It is also stored in a separate .imports file (one per object) --- > >> >> >> > >> this is primarily used by our build system for dependence > >> >> >> information. > >> >> >> > > > >> >> >> > > I see, getting LIPO safe WRT parallel updates will be fun. How does > >> >> >> LIPO behave > >> >> >> > > on GCC bootstrap? > >> >> >> > > >> >> >> > We have not tried gcc bootstrap with LIPO. Gcc compile time is not the > >> >> >> > main problem for application build -- the link time (for debug build) > >> >> >> > is. > >> >> >> > >> >> >> I was primarily curious how the LIPOs runtime analysis fare in the > >> >> >> situation where > >> >> >> you do very many small train runs on rather large app (sure GCC is small > >> >> >> to google's > >> >> >> use case ;). > >> >> >> > > >> >> >> > > (i.e. it does a lot more work in the libgcov module per each > >> >> >> > > invocation, so I am curious if it is practically useful at all). > >> >> >> > > > >> >> >> > > With LTO based solution a lot can be probably pushed at link time? > >> >> >> Before > >> >> >> > > actual GCC starts from the linker plugin, LIPO module can read gcov > >> >> >> CFGs from > >> >> >> > > gcda files and do all the merging/updating/CFG constructions that is > >> >> >> currently > >> >> >> > > performed at runtime, right? > >> >> >> > > >> >> >> > The dynamic cgraph build and analysis is still done at runtime. > >> >> >> > However, with the new implementation, FE is no longer involved. Gcc > >> >> >> > driver is modified to understand module grouping, and lto is used to > >> >> >> > merge the streamed output from aux modules. > >> >> >> > >> >> >> I see. Are there any fundamental reasons why it can not be done at > >> >> >> link-time > >> >> >> when all gcda files are available? Why the grouping is not done inside > >> >> >> linker > >> >> >> plugin? > >> >> >> > >> >> >> Honza > >> >> >> > > >> >> >> > > >> >> >> > David > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > >> >> > > >> > >> > >> > >> -- > >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
Revision to earlier patch to augment the gcov program summary with working set information. We now emit the non-zero arc counter histogram entries into the summary, instead of computing the working set information and emitting that into the summary. The working set information is computed by the compiler from the histogram when it is read back in. A new method for merging histogram summaries based on earlier discussions with Honza is utilized when gcov summaries are merged. See earlier discussions and description at: http://gcc.gnu.org/ml/gcc-patches/2012-08/msg01062.html http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01412.html Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? Thanks, Teresa 2012-08-28 Teresa Johnson <tejohnson@google.com> * libgcc/libgcov.c (gcov_histogram_insert): New function. (gcov_compute_histogram): Ditto. (gcov_exit): Invoke compute_histogram, and perform merging of histograms during summary merging. * gcc/gcov-io.c (gcov_write_summary): Include length of non-zero histogram entries in GCOV_TAG_SUMMARY_LENGTH, and write out those entries. (gcov_read_summary): Read histogram entries. (gcov_histo_index): New function. (gcov_histogram_merge): Ditto. * gcc/gcov-io.h (gcov_type_unsigned): New type. (struct gcov_bucket_type): Ditto. (struct gcov_ctr_summary): Include histogram. (GCOV_TAG_SUMMARY_LENGTH): Update to include histogram entries. (GCOV_HISTOGRAM_SIZE): New macro. (GCOV_TYPE_SIZE): Make macro available for all build targets. * gcc/profile.c (NUM_GCOV_WORKING_SETS): New macro. (gcov_working_sets): New global variable. (compute_working_sets): New function. (find_working_set): Ditto. (get_exec_counts): Invoke compute_working_sets. * gcc/coverage.c (read_counts_file): Merge histograms, and fix bug with accessing summary info for non-summable counters. * gcc/basic-block.h (gcov_type_unsigned): New type. (struct gcov_working_set_info): Ditto. * gcc/gcov-dump.c (tag_summary): Dump out histogram. Index: libgcc/libgcov.c =================================================================== --- libgcc/libgcov.c (revision 190736) +++ libgcc/libgcov.c (working copy) @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned return 1; } +/* Insert counter VALUE into HISTOGRAM. */ + +static void +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value) +{ + unsigned i; + + i = gcov_histo_index(value); + gcc_assert (i < GCOV_HISTOGRAM_SIZE); + + histogram[i].num_counters++; + histogram[i].cum_value += value; + if (value < histogram[i].min_value) + histogram[i].min_value = value; +} + +/* Computes a histogram of the arc counters to place in the summary SUM. */ + +static void +gcov_compute_histogram (struct gcov_summary *sum) +{ + struct gcov_info *gi_ptr; + const struct gcov_fn_info *gfi_ptr; + const struct gcov_ctr_info *ci_ptr; + struct gcov_ctr_summary *cs_ptr; + unsigned t_ix, f_ix, ctr_info_ix, ix; + int h_ix; + + /* This currently only applies to arc counters. */ + t_ix = GCOV_COUNTER_ARCS; + + /* First check if there are any counts recorded for this counter. */ + cs_ptr = &(sum->ctrs[t_ix]); + if (!cs_ptr->num) + return; + + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + cs_ptr->histogram[h_ix].num_counters = 0; + cs_ptr->histogram[h_ix].min_value = cs_ptr->run_max; + cs_ptr->histogram[h_ix].cum_value = 0; + } + + /* Walk through all the per-object structures and record each of + the count values in histogram. */ + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) + { + if (!gi_ptr->merge[t_ix]) + continue; + + /* Find the appropriate index into the gcov_ctr_info array + for the counter we are currently working on based on the + existence of the merge function pointer for this object. */ + for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++) + { + if (gi_ptr->merge[ix]) + ctr_info_ix++; + } + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) + { + gfi_ptr = gi_ptr->functions[f_ix]; + + if (!gfi_ptr || gfi_ptr->key != gi_ptr) + continue; + + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; + for (ix = 0; ix < ci_ptr->num; ix++) + gcov_histogram_insert(cs_ptr->histogram, ci_ptr->values[ix]); + } + } +} + /* Dump the coverage counts. We merge with existing counts when possible, to avoid growing the .da files ad infinitum. We use this program's checksum to make sure we only accumulate whole program @@ -347,6 +419,7 @@ gcov_exit (void) } } } + gcov_compute_histogram (&this_prg); { /* Check if the level of dirs to strip off specified. */ @@ -482,8 +555,6 @@ gcov_exit (void) f_ix--; length = gcov_read_unsigned (); - if (length != GCOV_TAG_SUMMARY_LENGTH) - goto read_mismatch; gcov_read_summary (&tmp); if ((error = gcov_is_error ())) goto read_error; @@ -598,11 +669,18 @@ gcov_exit (void) if (gi_ptr->merge[t_ix]) { if (!cs_prg->runs++) - cs_prg->num = cs_tprg->num; + cs_prg->num = cs_tprg->num; + else if (cs_prg->num != cs_tprg->num) + goto read_mismatch; cs_prg->sum_all += cs_tprg->sum_all; if (cs_prg->run_max < cs_tprg->run_max) cs_prg->run_max = cs_tprg->run_max; cs_prg->sum_max += cs_tprg->run_max; + if (cs_prg->runs == 1) + memcpy (cs_prg->histogram, cs_tprg->histogram, + sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + else + gcov_histogram_merge (cs_prg->histogram, cs_tprg->histogram); } else if (cs_prg->runs) goto read_mismatch; Index: gcc/gcov-io.c =================================================================== --- gcc/gcov-io.c (revision 190736) +++ gcc/gcov-io.c (working copy) @@ -368,10 +368,18 @@ gcov_write_tag_length (gcov_unsigned_t tag, gcov_u GCOV_LINKAGE void gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary) { - unsigned ix; + unsigned ix, h_ix, h_cnt = 0; const struct gcov_ctr_summary *csum; - gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH); + /* Count number of non-zero histogram entries. The histogram is only + currently computed for arc counters. */ + csum = &summary->ctrs[GCOV_COUNTER_ARCS]; + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + if (csum->histogram[h_ix].num_counters > 0) + h_cnt++; + } + gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt)); gcov_write_unsigned (summary->checksum); for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { @@ -380,6 +388,21 @@ gcov_write_summary (gcov_unsigned_t tag, const str gcov_write_counter (csum->sum_all); gcov_write_counter (csum->run_max); gcov_write_counter (csum->sum_max); + if (ix != GCOV_COUNTER_ARCS) + { + gcov_write_unsigned (0); + continue; + } + gcov_write_unsigned (h_cnt); + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + if (!csum->histogram[h_ix].num_counters) + continue; + gcov_write_unsigned (h_ix); + gcov_write_unsigned (csum->histogram[h_ix].num_counters); + gcov_write_counter (csum->histogram[h_ix].min_value); + gcov_write_counter (csum->histogram[h_ix].cum_value); + } } } #endif /* IN_LIBGCOV */ @@ -488,7 +511,7 @@ gcov_read_string (void) GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *summary) { - unsigned ix; + unsigned ix, h_ix, h_cnt; struct gcov_ctr_summary *csum; summary->checksum = gcov_read_unsigned (); @@ -499,6 +522,16 @@ gcov_read_summary (struct gcov_summary *summary) csum->sum_all = gcov_read_counter (); csum->run_max = gcov_read_counter (); csum->sum_max = gcov_read_counter (); + memset (csum->histogram, 0, + sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + h_cnt = gcov_read_unsigned (); + while (h_cnt--) + { + h_ix = gcov_read_unsigned (); + csum->histogram[h_ix].num_counters = gcov_read_unsigned (); + csum->histogram[h_ix].min_value = gcov_read_counter (); + csum->histogram[h_ix].cum_value = gcov_read_counter (); + } } } @@ -550,3 +583,142 @@ gcov_time (void) return status.st_mtime; } #endif /* IN_GCOV */ + +#if IN_LIBGCOV || !IN_GCOV +/* Determine the index into histogram for VALUE. */ + +static unsigned +gcov_histo_index(gcov_type value) +{ + gcov_type_unsigned v = (gcov_type_unsigned)value; + unsigned r = 0; + unsigned prev2bits = 0; + + /* Find index into log2 scale histogram, where each of the log2 + sized buckets is divided into 4 linear sub-buckets for better + focus in the higher buckets. */ + + /* Find the place of the most-significant bit set. */ + if (v > 0) + r = GCOV_TYPE_SIZE - 1 - __builtin_clzll (v); + + /* If at most the 2 least significant bits are set (value is + 0 - 3) then that value is our index into the lowest set of + four buckets. */ + if (r < 2) + return (unsigned)value; + + gcc_assert (r < 64); + + /* Find the two next most significant bits to determine which + of the four linear sub-buckets to select. */ + prev2bits = (v >> (r - 2)) & 0x3; + /* Finally, compose the final bucket index from the log2 index and + the next 2 bits. The minimum r value at this point is 2 since we + returned above if r was 2 or more, so the minimum bucket at this + point is 4. */ + return (r - 1) * 4 + prev2bits; +} + +/* Merge SRC_HISTO into TGT_HISTO. */ + +static void gcov_histogram_merge(gcov_bucket_type *tgt_histo, + gcov_bucket_type *src_histo) +{ + unsigned src_i, tgt_i, tmp_i; + unsigned src_num, tgt_num, merge_num; + gcov_type src_cum, tgt_cum, merge_src_cum, merge_tgt_cum, merge_cum; + gcov_type merge_min; + gcov_bucket_type tmp_histo[GCOV_HISTOGRAM_SIZE]; + + memset(tmp_histo, 0, sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + + /* Assume that the counters are in the same relative order in both + histograms. Walk the histograms from smallest to largest entry, + matching up and combining counters in order. */ + src_num = 0; + src_cum = 0; + src_i = 0; + for (tgt_i = 0; tgt_i < GCOV_HISTOGRAM_SIZE; tgt_i++) + { + tgt_num = tgt_histo[tgt_i].num_counters; + tgt_cum = tgt_histo[tgt_i].cum_value; + /* Keep going until all of the target histogram's counters at this + position have been matched and merged with counters from the + source histogram. */ + while (tgt_num > 0) + { + /* If this is either the first time through this loop or we just + exhausted the previous non-zero source histogram entry, look + for the next non-zero source histogram entry. */ + if (!src_num) + { + /* Locate the first non-zero entry. */ + while (src_i < GCOV_HISTOGRAM_SIZE && + !src_histo[src_i].num_counters) + src_i++; + /* Source and target total number of counts should be + the same or caller should not have attempted merge. + Since we have target counters left to merge, there + ought to still be some left in the source. */ + gcc_assert (src_i < GCOV_HISTOGRAM_SIZE); + src_num = src_histo[src_i].num_counters; + src_cum = src_histo[src_i].cum_value; + } + + /* The number of counters to merge on this pass is the minimum + of the remaining counters from the current target and source + histogram entries. */ + merge_num = tgt_num; + if (src_num < merge_num) + merge_num = src_num; + + /* The merged min_value is the min of the min_values from target + and source. */ + merge_min = tgt_histo[tgt_i].min_value; + if (merge_min > src_histo[tgt_i].min_value) + merge_min = src_histo[src_i].min_value; + + /* Compute the portion of source and target entries' cum_value + that will be apportioned to the counters being merged. + The total remaining cum_value from each entry is divided + equally among the counters from that histogram entry if we + are not merging all of them. */ + merge_src_cum = src_cum; + if (merge_num < src_num) + merge_src_cum = merge_num * src_cum / src_num; + merge_tgt_cum = tgt_cum; + if (merge_num < tgt_num) + merge_tgt_cum = merge_num * tgt_cum / tgt_num; + /* The merged cum_value is the sum of the source and target + components. */ + merge_cum = merge_src_cum + merge_tgt_cum; + + /* Update the remaining number of counters and cum_value left + to be merged from this source and target entry. */ + src_cum -= merge_src_cum; + tgt_cum -= merge_tgt_cum; + src_num -= merge_num; + tgt_num -= merge_num; + + /* The merged counters get placed in the new merged histogram + at the entry for the merged min_value. */ + tmp_i = gcov_histo_index(merge_min); + gcc_assert (tmp_i < GCOV_HISTOGRAM_SIZE); + tmp_histo[tmp_i].num_counters += merge_num; + tmp_histo[tmp_i].cum_value += merge_cum; + if (!tmp_histo[tmp_i].min_value || + merge_min < tmp_histo[tmp_i].min_value) + tmp_histo[tmp_i].min_value = merge_min; + + /* Ensure the search for the next non-zero src_histo entry starts + at the next histogram bucket. */ + if (!src_num) + src_i++; + } + } + + /* Finally, copy the merged histogram into tgt_histo. */ + memcpy(tgt_histo, tmp_histo, sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); +} +#endif /* IN_LIBGCOV || !IN_GCOV */ Index: gcc/gcov-io.h =================================================================== --- gcc/gcov-io.h (revision 190736) +++ gcc/gcov-io.h (working copy) @@ -139,7 +139,9 @@ see the files COPYING3 and COPYING.RUNTIME respect counts: header int64:count* summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE count-summary: int32:num int32:runs int64:sum - int64:max int64:sum_max + int64:max int64:sum_max histogram + histogram: int32:count histogram-buckets* + histogram-buckets: int32:index int32:num int64:min int64:sum The ANNOUNCE_FUNCTION record is the same as that in the note file, but without the source location. The COUNTS gives the @@ -171,8 +173,10 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (SI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (DI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI))); #else typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #endif #else #if BITS_PER_UNIT == 16 @@ -180,16 +184,20 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (HI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #else typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #endif #else typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI))); typedef unsigned gcov_position_t __attribute__ ((mode (QI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #else typedef signed gcov_type __attribute__ ((mode (QI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI))); #endif #endif #endif @@ -210,11 +218,10 @@ typedef unsigned gcov_position_t; #if IN_GCOV #define GCOV_LINKAGE static typedef HOST_WIDEST_INT gcov_type; +typedef unsigned HOST_WIDEST_INT gcov_type_unsigned; #if IN_GCOV > 0 #include <sys/types.h> #endif -#else /*!IN_GCOV */ -#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) #endif #if defined (HOST_HAS_F_SETLKW) @@ -309,9 +316,10 @@ typedef HOST_WIDEST_INT gcov_type; #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2) #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) -#define GCOV_TAG_SUMMARY_LENGTH \ - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) +#define GCOV_TAG_SUMMARY_LENGTH(NUM) \ + (1 + GCOV_COUNTERS_SUMMABLE * (3 + 3 * 2) + (NUM) * 6) + /* Counters that are collected. */ #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ #define GCOV_COUNTERS_SUMMABLE 1 /* Counters which can be @@ -387,8 +395,28 @@ typedef HOST_WIDEST_INT gcov_type; #define GCOV_ARC_FAKE (1 << 1) #define GCOV_ARC_FALLTHROUGH (1 << 2) +#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) + /* Structured records. */ +/* Structure used for each bucket of the log2 histogram of counter values. */ +typedef struct +{ + /* Number of counters whose profile count falls within the bucket. */ + gcov_unsigned_t num_counters; + /* Smallest profile count included in this bucket. */ + gcov_type min_value; + /* Cumulative value of the profile counts in this bucket. */ + gcov_type cum_value; +} gcov_bucket_type; + +/* For a log2 scale histogram with each range split into 4 + linear sub-ranges, there will be at most GCOV_TYPE_SIZE - 1 log2 + ranges since the lowest 2 log2 values share the lowest 4 linear + sub-range (values 0 - 3). */ + +#define GCOV_HISTOGRAM_SIZE (GCOV_TYPE_SIZE - 1) * 4 + /* Cumulative counter data. */ struct gcov_ctr_summary { @@ -397,6 +425,8 @@ struct gcov_ctr_summary gcov_type sum_all; /* sum of all counters accumulated. */ gcov_type run_max; /* maximum value on a single run. */ gcov_type sum_max; /* sum of individual run max values. */ + gcov_bucket_type histogram[GCOV_HISTOGRAM_SIZE]; /* histogram of + counter values. */ }; /* Object & program summary record. */ Index: gcc/profile.c =================================================================== --- gcc/profile.c (revision 190736) +++ gcc/profile.c (working copy) @@ -84,6 +84,15 @@ struct bb_info { const struct gcov_ctr_summary *profile_info; +/* Number of data points in the working set summary array. Using 128 + provides information for at least every 1% increment of the total + profile size. The last entry is hardwired to 99.9% of the total. */ +#define NUM_GCOV_WORKING_SETS 128 + +/* Counter working set information computed from the current counter + summary. Not initialized unless profile_info summary is non-NULL. */ +static gcov_working_set_t gcov_working_sets[NUM_GCOV_WORKING_SETS]; + /* Collect statistics on the performance of this pass for the entire source file. */ @@ -192,6 +201,151 @@ instrument_values (histogram_values values) } +/* Compute the working set information from the counter histogram in + the profile summary. This is an array of information corresponding to a + range of percentages of the total execution count (sum_all), and includes + the number of counters required to cover that working set percentage and + the minimum counter value in that working set. */ + +static void +compute_working_sets (void) +{ + gcov_type working_set_cum_values[NUM_GCOV_WORKING_SETS]; + gcov_type ws_cum_hotness_incr; + gcov_type cum, tmp_cum; + const gcov_bucket_type *histo_bucket; + unsigned ws_ix, h_ix, c_num, count, pctinc, pct; + gcov_working_set_t *ws_info; + + if (!profile_info) + return; + + /* Compute the amount of sum_all that the cumulative hotness grows + by in each successive working set entry, which depends on the + number of working set entries. */ + ws_cum_hotness_incr = profile_info->sum_all / NUM_GCOV_WORKING_SETS; + + /* Next fill in an array of the cumulative hotness values corresponding + to each working set summary entry we are going to compute below. + Skip 0% statistics, which can be extrapolated from the + rest of the summary data. */ + cum = ws_cum_hotness_incr; + for (ws_ix = 0; ws_ix < NUM_GCOV_WORKING_SETS; + ws_ix++, cum += ws_cum_hotness_incr) + working_set_cum_values[ws_ix] = cum; + /* The last summary entry is reserved for (roughly) 99.9% of the + working set. Divide by 1024 so it becomes a shift, which gives + almost exactly 99.9%. */ + working_set_cum_values[NUM_GCOV_WORKING_SETS-1] + = profile_info->sum_all - profile_info->sum_all/1024; + + /* Next, walk through the histogram in decending order of hotness + and compute the statistics for the working set summary array. + As histogram entries are accumulated, we check to see which + working set entries have had their expected cum_value reached + and fill them in, walking the working set entries in increasing + size of cum_value. */ + ws_ix = 0; /* The current entry into the working set array. */ + cum = 0; /* The current accumulated counter sum. */ + count = 0; /* The current accumulated count of block counters. */ + for (h_ix = GCOV_HISTOGRAM_SIZE - 1; + h_ix > 0 && ws_ix < NUM_GCOV_WORKING_SETS; h_ix--) + { + histo_bucket = &profile_info->histogram[h_ix]; + + /* If we haven't reached the required cumulative counter value for + the current working set percentage, simply accumulate this histogram + entry into the running sums and continue to the next histogram + entry. */ + if (cum + histo_bucket->cum_value < working_set_cum_values[ws_ix]) + { + cum += histo_bucket->cum_value; + count += histo_bucket->num_counters; + continue; + } + + /* If adding the current histogram entry's cumulative counter value + causes us to exceed the current working set size, then estimate + how many of this histogram entry's counter values are required to + reach the working set size, and fill in working set entries + as we reach their expected cumulative value. */ + for (c_num = 0, tmp_cum = cum; + c_num < histo_bucket->num_counters && ws_ix < NUM_GCOV_WORKING_SETS; + c_num++) + { + count++; + /* If we haven't reached the last histogram entry counter, add + in the minimum value again. This will underestimate the + cumulative sum so far, because many of the counter values in this + entry may have been larger than the minimum. We could add in the + average value every time, but that would require an expensive + divide operation. */ + if (c_num + 1 < histo_bucket->num_counters) + tmp_cum += histo_bucket->min_value; + /* If we have reached the last histogram entry counter, then add + in the entire cumulative value. */ + else + tmp_cum = cum + histo_bucket->cum_value; + + /* Next walk through successive working set entries and fill in + the statistics for any whose size we have reached by accumulating + this histogram counter. */ + while (tmp_cum >= working_set_cum_values[ws_ix] + && ws_ix < NUM_GCOV_WORKING_SETS) + { + gcov_working_sets[ws_ix].num_counters = count; + gcov_working_sets[ws_ix].min_counter + = histo_bucket->min_value; + ws_ix++; + } + } + /* Finally, update the running cumulative value since we were + using a temporary above. */ + cum += histo_bucket->cum_value; + } + gcc_assert (ws_ix == NUM_GCOV_WORKING_SETS); + + if (dump_file) + { + fprintf (dump_file, "Counter working sets:\n"); + /* Multiply the percentage by 100 to avoid float. */ + pctinc = 100 * 100 / NUM_GCOV_WORKING_SETS; + for (ws_ix = 0, pct = pctinc; ws_ix < NUM_GCOV_WORKING_SETS; + ws_ix++, pct += pctinc) + { + if (ws_ix == NUM_GCOV_WORKING_SETS - 1) + pct = 9990; + ws_info = &gcov_working_sets[ws_ix]; + /* Print out the percentage using int arithmatic to avoid float. */ + fprintf (dump_file, "\t\t%u.%02u%%: num counts=%u, min counter=" + HOST_WIDEST_INT_PRINT_DEC "\n", + pct / 100, pct - (pct / 100 * 100), + ws_info->num_counters, + (HOST_WIDEST_INT)ws_info->min_counter); + } + } +} + +/* Given a the desired percentage of the full profile (sum_all from the + summary), multiplied by 10 to avoid float in PCT_TIMES_10, returns + the corresponding working set information. If an exact match for + the percentage isn't found, the closest value is used. */ + +gcov_working_set_t * +find_working_set (unsigned pct_times_10) +{ + unsigned i; + if (!profile_info) + return NULL; + gcc_assert (pct_times_10 <= 1000); + if (pct_times_10 >= 999) + return &gcov_working_sets[NUM_GCOV_WORKING_SETS - 1]; + i = pct_times_10 * NUM_GCOV_WORKING_SETS / 1000; + if (!i) + return &gcov_working_sets[0]; + return &gcov_working_sets[i - 1]; +} + /* Computes hybrid profile for all matching entries in da_file. CFG_CHECKSUM is the precomputed checksum for the CFG. */ @@ -219,6 +373,8 @@ get_exec_counts (unsigned cfg_checksum, unsigned l if (!counts) return NULL; + compute_working_sets(); + if (dump_file && profile_info) fprintf(dump_file, "Merged %u profiles with maximal count %u.\n", profile_info->runs, (unsigned) profile_info->sum_max); Index: gcc/coverage.c =================================================================== --- gcc/coverage.c (revision 190736) +++ gcc/coverage.c (working copy) @@ -248,6 +248,13 @@ read_counts_file (void) summary.ctrs[ix].run_max = sum.ctrs[ix].run_max; summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max; } + if (new_summary) + memcpy (summary.ctrs[GCOV_COUNTER_ARCS].histogram, + sum.ctrs[GCOV_COUNTER_ARCS].histogram, + sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + else + gcov_histogram_merge (summary.ctrs[GCOV_COUNTER_ARCS].histogram, + sum.ctrs[GCOV_COUNTER_ARCS].histogram); new_summary = 0; } else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident) @@ -268,8 +275,9 @@ read_counts_file (void) entry->ctr = elt.ctr; entry->lineno_checksum = lineno_checksum; entry->cfg_checksum = cfg_checksum; - entry->summary = summary.ctrs[elt.ctr]; - entry->summary.num = n_counts; + if (elt.ctr < GCOV_COUNTERS_SUMMABLE) + entry->summary = summary.ctrs[elt.ctr]; + entry->summary.num = n_counts; entry->counts = XCNEWVEC (gcov_type, n_counts); } else if (entry->lineno_checksum != lineno_checksum Index: gcc/basic-block.h =================================================================== --- gcc/basic-block.h (revision 190736) +++ gcc/basic-block.h (working copy) @@ -31,6 +31,7 @@ along with GCC; see the file COPYING3. If not see flow graph is manipulated by various optimizations. A signed type makes those easy to detect. */ typedef HOST_WIDEST_INT gcov_type; +typedef unsigned HOST_WIDEST_INT gcov_type_unsigned; /* Control flow edge information. */ struct GTY((user)) edge_def { @@ -91,6 +92,16 @@ enum cfg_edge_flags { profile.c. */ extern const struct gcov_ctr_summary *profile_info; +/* Working set size statistics for a given percentage of the entire + profile (sum_all from the counter summary). */ +typedef struct gcov_working_set_info +{ + /* Number of hot counters included in this working set. */ + unsigned num_counters; + /* Smallest counter included in this working set. */ + gcov_type min_counter; +} gcov_working_set_t; + /* Declared in cfgloop.h. */ struct loop; @@ -897,4 +908,7 @@ extern void rtl_profile_for_bb (basic_block); extern void rtl_profile_for_edge (edge); extern void default_rtl_profile (void); +/* In profile.c. */ +extern gcov_working_set_t *find_working_set(unsigned pct_times_10); + #endif /* GCC_BASIC_BLOCK_H */ Index: gcc/gcov-dump.c =================================================================== --- gcc/gcov-dump.c (revision 190736) +++ gcc/gcov-dump.c (working copy) @@ -447,7 +447,8 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED) { struct gcov_summary summary; - unsigned ix; + unsigned ix, h_ix; + gcov_bucket_type *histo_bucket; gcov_read_summary (&summary); printf (" checksum=0x%08x", summary.checksum); @@ -465,5 +466,24 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED (HOST_WIDEST_INT)summary.ctrs[ix].run_max); printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC, (HOST_WIDEST_INT)summary.ctrs[ix].sum_max); + if (ix != GCOV_COUNTER_ARCS) + continue; + printf ("\n"); + print_prefix (filename, 0, 0); + printf ("\t\tcounter histogram:"); + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + histo_bucket = &summary.ctrs[ix].histogram[h_ix]; + if (!histo_bucket->num_counters) + continue; + printf ("\n"); + print_prefix (filename, 0, 0); + printf ("\t\t%d: num counts=%u, min counter=" + HOST_WIDEST_INT_PRINT_DEC ", cum_counter=" + HOST_WIDEST_INT_PRINT_DEC, + h_ix, histo_bucket->num_counters, + (HOST_WIDEST_INT)histo_bucket->min_value, + (HOST_WIDEST_INT)histo_bucket->cum_value); + } } } -- This patch is available for review at http://codereview.appspot.com/6465057
Sign in to reply to this message.
> Index: libgcc/libgcov.c > =================================================================== > --- libgcc/libgcov.c (revision 190736) > +++ libgcc/libgcov.c (working copy) > @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned > return 1; > } > > +/* Insert counter VALUE into HISTOGRAM. */ > + > +static void > +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value) > +{ > + unsigned i; > + > + i = gcov_histo_index(value); > + gcc_assert (i < GCOV_HISTOGRAM_SIZE); Does checking_assert work in libgcov? I do not think internal consistency check should go to --enable-checking=release libgcov. We want to maintain it as lightweight as possible. (I see there are two existing gcc_asserts, since they report file format corruption, I think they should give better diagnostic). Inliner will do good job here, but perhaps explicit inline fits. > + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) > + { > + gfi_ptr = gi_ptr->functions[f_ix]; > + > + if (!gfi_ptr || gfi_ptr->key != gi_ptr) > + continue; > + > + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; > + for (ix = 0; ix < ci_ptr->num; ix++) > + gcov_histogram_insert(cs_ptr->histogram, ci_ptr->values[ix]); Space before (. > + } > + } > +} > + > /* Dump the coverage counts. We merge with existing counts when > possible, to avoid growing the .da files ad infinitum. We use this > program's checksum to make sure we only accumulate whole program > @@ -347,6 +419,7 @@ gcov_exit (void) > } > } > } > + gcov_compute_histogram (&this_prg); > @@ -598,11 +669,18 @@ gcov_exit (void) > if (gi_ptr->merge[t_ix]) > { > if (!cs_prg->runs++) > - cs_prg->num = cs_tprg->num; > + cs_prg->num = cs_tprg->num; > + else if (cs_prg->num != cs_tprg->num) > + goto read_mismatch; Doesn't think check that all the programs that contain this unit are the same? I.e. will this survive profiledbootstrap where we interleave cc1 and cc1plus? > + /* Count number of non-zero histogram entries. The histogram is only > + currently computed for arc counters. */ > + csum = &summary->ctrs[GCOV_COUNTER_ARCS]; > + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) > + { > + if (csum->histogram[h_ix].num_counters > 0) > + h_cnt++; > + } > + gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt)); > gcov_write_unsigned (summary->checksum); > for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) > { > @@ -380,6 +388,21 @@ gcov_write_summary (gcov_unsigned_t tag, const str > gcov_write_counter (csum->sum_all); > gcov_write_counter (csum->run_max); > gcov_write_counter (csum->sum_max); > + if (ix != GCOV_COUNTER_ARCS) > + { > + gcov_write_unsigned (0); > + continue; > + } > + gcov_write_unsigned (h_cnt); > + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) > + { > + if (!csum->histogram[h_ix].num_counters) > + continue; > + gcov_write_unsigned (h_ix); It is kind of waste to write whole unsigned for each histogram index. What about writting bitmap of non-zero entries followed by each entry? > +/* Merge SRC_HISTO into TGT_HISTO. */ Perhaps comment about overall concept of the merging routine would suit here. > -#else /*!IN_GCOV */ > -#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) Why do you need t omove this out of !libgcov? I do not thing this is correct for all configurations. i.e. gcov_type may be 16bit. Patch is OK if it passed profiledbootstrap modulo the comments above. Thanks! Honza
Sign in to reply to this message.
On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> Index: libgcc/libgcov.c >> =================================================================== >> --- libgcc/libgcov.c (revision 190736) >> +++ libgcc/libgcov.c (working copy) >> @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >> return 1; >> } >> >> +/* Insert counter VALUE into HISTOGRAM. */ >> + >> +static void >> +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value) >> +{ >> + unsigned i; >> + >> + i = gcov_histo_index(value); >> + gcc_assert (i < GCOV_HISTOGRAM_SIZE); > Does checking_assert work in libgcov? I do not think internal consistency check > should go to --enable-checking=release libgcov. We want to maintain it as > lightweight as possible. (I see there are two existing gcc_asserts, since they > report file format corruption, I think they should give better diagnostic). gcc_checking_assert isn't available, since tsystem.h not system.h is included. I could probably just remove the assert (to be safe, silently return if i is out of bounds?). > > Inliner will do good job here, but perhaps explicit inline fits. >> + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) >> + { >> + gfi_ptr = gi_ptr->functions[f_ix]; >> + >> + if (!gfi_ptr || gfi_ptr->key != gi_ptr) >> + continue; >> + >> + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; >> + for (ix = 0; ix < ci_ptr->num; ix++) >> + gcov_histogram_insert(cs_ptr->histogram, ci_ptr->values[ix]); > Space before (. Ok. >> + } >> + } >> +} >> + >> /* Dump the coverage counts. We merge with existing counts when >> possible, to avoid growing the .da files ad infinitum. We use this >> program's checksum to make sure we only accumulate whole program >> @@ -347,6 +419,7 @@ gcov_exit (void) >> } >> } >> } >> + gcov_compute_histogram (&this_prg); >> @@ -598,11 +669,18 @@ gcov_exit (void) >> if (gi_ptr->merge[t_ix]) >> { >> if (!cs_prg->runs++) >> - cs_prg->num = cs_tprg->num; >> + cs_prg->num = cs_tprg->num; >> + else if (cs_prg->num != cs_tprg->num) >> + goto read_mismatch; > > Doesn't think check that all the programs that contain this unit are the same? > I.e. will this survive profiledbootstrap where we interleave cc1 and cc1plus? Ok, removing that check and I am switching the histogram merging code to handle the case where there are different numbers of counters. It will end up with the same number of counters as in the summary we are merging into since that is the num we keep above when runs > 0 to start with. >> + /* Count number of non-zero histogram entries. The histogram is only >> + currently computed for arc counters. */ >> + csum = &summary->ctrs[GCOV_COUNTER_ARCS]; >> + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) >> + { >> + if (csum->histogram[h_ix].num_counters > 0) >> + h_cnt++; >> + } >> + gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt)); >> gcov_write_unsigned (summary->checksum); >> for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) >> { >> @@ -380,6 +388,21 @@ gcov_write_summary (gcov_unsigned_t tag, const str >> gcov_write_counter (csum->sum_all); >> gcov_write_counter (csum->run_max); >> gcov_write_counter (csum->sum_max); >> + if (ix != GCOV_COUNTER_ARCS) >> + { >> + gcov_write_unsigned (0); >> + continue; >> + } >> + gcov_write_unsigned (h_cnt); >> + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) >> + { >> + if (!csum->histogram[h_ix].num_counters) >> + continue; >> + gcov_write_unsigned (h_ix); > > It is kind of waste to write whole unsigned for each histogram index. > What about writting bitmap of non-zero entries followed by each entry? Sure, I will do that instead. >> +/* Merge SRC_HISTO into TGT_HISTO. */ > > Perhaps comment about overall concept of the merging routine would suit here. Ok. >> -#else /*!IN_GCOV */ >> -#define GCOV_TYPE_SIZE (LONG_LONG_TYPE_SIZE > 32 ? 64 : 32) > > Why do you need t omove this out of !libgcov? I do not thing this is correct for all configurations. > i.e. gcov_type may be 16bit. From my understanding of the mode attribute meanings, which I thought are defined in terms of the number of smallest addressable units, the code in gcov-io.h that sets up the gcov_type typedef will always end up with a gcov_type that is 32 or 64 bits? I.e. when BITS_PER_UNIT is 8 it will use either SI or DI which will end up either 32 or 64, and when BITS_PER_UNIT is 16 it would use either HI or SI which would again be either 32 or 64. Is that wrong and we can end up with a 16 bit gcov_type? The GCOV_TYPE_SIZE was being defined everywhere except when IN_GOV (so it was being defined IN_LIBGCOV), but I wanted it defined unconditionally because I am using it to determine the required number of histogram entries. In any case, I think it will be more straightforward to define the number of histogram entries based on the max possible gcov_type size which is 64 (so 256 entries). This will make implementing the bit mask simpler, since it will always require the same number of gcov_unsigned ints (8). > > Patch is OK if it passed profiledbootstrap modulo the comments above. Ok, thanks. Working on the fixes above. Teresa > Thanks! > Honza -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
> On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> Index: libgcc/libgcov.c > >> =================================================================== > >> --- libgcc/libgcov.c (revision 190736) > >> +++ libgcc/libgcov.c (working copy) > >> @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned > >> return 1; > >> } > >> > >> +/* Insert counter VALUE into HISTOGRAM. */ > >> + > >> +static void > >> +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value) > >> +{ > >> + unsigned i; > >> + > >> + i = gcov_histo_index(value); > >> + gcc_assert (i < GCOV_HISTOGRAM_SIZE); > > Does checking_assert work in libgcov? I do not think internal consistency check > > should go to --enable-checking=release libgcov. We want to maintain it as > > lightweight as possible. (I see there are two existing gcc_asserts, since they > > report file format corruption, I think they should give better diagnostic). > > gcc_checking_assert isn't available, since tsystem.h not system.h is > included. I could probably just remove the assert (to be safe, > silently return if i is out of bounds?). I think just removing the assert is fine here: only way it can trigger is when GCOV_HISTOGRAM_SIZE is wrong and it ought not to be. > > >From my understanding of the mode attribute meanings, which I thought > are defined in terms of the number of smallest addressable units, the > code in gcov-io.h that sets up the gcov_type typedef will always end > up with a gcov_type that is 32 or 64 bits? I.e. when BITS_PER_UNIT is > 8 it will use either SI or DI which will end up either 32 or 64, and > when BITS_PER_UNIT is 16 it would use either HI or SI which would > again be either 32 or 64. Is that wrong and we can end up with a 16 > bit gcov_type? I see, the code simplified a bit since we dropped support for some of more exotic targets. The type should be either 32bit or 64. > > The GCOV_TYPE_SIZE was being defined everywhere except when IN_GOV (so > it was being defined IN_LIBGCOV), but I wanted it defined > unconditionally because I am using it to determine the required number > of histogram entries. > > In any case, I think it will be more straightforward to define the > number of histogram entries based on the max possible gcov_type size > which is 64 (so 256 entries). This will make implementing the bit mask > simpler, since it will always require the same number of gcov_unsigned > ints (8). Sounds good to me. 64bit should be enough for everyone. Coverage is kind of useless for smaller types and for really restricted targets we more likely will want to disable histogram generation rather than making it a bit smaller. > > > > > Patch is OK if it passed profiledbootstrap modulo the comments above. > > Ok, thanks. Working on the fixes above. OK, thanks! Honza > > Teresa > > > Thanks! > > Honza > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
I just committed the patch (included below). I implemented the occupancy bit vector approach for recording non-zero histogram entries, and a few issues uncovered with the merging in a profiled bootstrap. Passes both bootstrap and profiledbootstrap builds and regression tests. Thanks, Teresa Enhances the gcov program summary by adding a histogram of arc counter entries. This is used to compute working set information in the compiler for use by optimizations that need information on hot vs cold counter values or the rough working set size in terms of the number of counters. Each working set data point is the minimum counter value and number of counters required to reach a given percentage of the cumulative counter sum across the profiled execution (sum_all in the program summary). 2012-09-04 Teresa Johnson <tejohnson@google.com> * libgcc/libgcov.c (struct gcov_summary_buffer): New structure. (gcov_histogram_insert): New function. (gcov_compute_histogram): Ditto. (gcov_exit): Invoke gcov_compute_histogram, and perform merging of histograms during summary merging. * gcc/gcov-io.c (gcov_write_summary): Write out non-zero histogram entries to function summary along with an occupancy bit vector. (gcov_read_summary): Read in the histogram entries. (gcov_histo_index): New function. (void gcov_histogram_merge): Ditto. * gcc/gcov-io.h (gcov_type_unsigned): New type. (struct gcov_bucket_type): Ditto. (struct gcov_ctr_summary): Include histogram. (GCOV_TAG_SUMMARY_LENGTH): Update to include histogram entries. (GCOV_HISTOGRAM_SIZE): New macro. (GCOV_HISTOGRAM_BITVECTOR_SIZE): Ditto. * gcc/profile.c (NUM_GCOV_WORKING_SETS): Ditto. (gcov_working_sets): New global variable. (compute_working_sets): New function. (find_working_set): Ditto. (get_exec_counts): Invoke compute_working_sets. * gcc/coverage.c (read_counts_file): Merge histograms, and fix bug with accessing summary info for non-summable counters. * gcc/basic-block.h (gcov_type_unsigned): New type. (struct gcov_working_set_info): Ditto. (find_working_set): Declare. * gcc/gcov-dump.c (tag_summary): Dump out histogram. Index: libgcc/libgcov.c =================================================================== --- libgcc/libgcov.c (revision 190950) +++ libgcc/libgcov.c (working copy) @@ -97,6 +97,12 @@ struct gcov_fn_buffer /* note gcov_fn_info ends in a trailing array. */ }; +struct gcov_summary_buffer +{ + struct gcov_summary_buffer *next; + struct gcov_summary summary; +}; + /* Chain of per-object gcov structures. */ static struct gcov_info *gcov_list; @@ -276,6 +282,76 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned return 1; } +/* Insert counter VALUE into HISTOGRAM. */ + +static void +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value) +{ + unsigned i; + + i = gcov_histo_index(value); + histogram[i].num_counters++; + histogram[i].cum_value += value; + if (value < histogram[i].min_value) + histogram[i].min_value = value; +} + +/* Computes a histogram of the arc counters to place in the summary SUM. */ + +static void +gcov_compute_histogram (struct gcov_summary *sum) +{ + struct gcov_info *gi_ptr; + const struct gcov_fn_info *gfi_ptr; + const struct gcov_ctr_info *ci_ptr; + struct gcov_ctr_summary *cs_ptr; + unsigned t_ix, f_ix, ctr_info_ix, ix; + int h_ix; + + /* This currently only applies to arc counters. */ + t_ix = GCOV_COUNTER_ARCS; + + /* First check if there are any counts recorded for this counter. */ + cs_ptr = &(sum->ctrs[t_ix]); + if (!cs_ptr->num) + return; + + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + cs_ptr->histogram[h_ix].num_counters = 0; + cs_ptr->histogram[h_ix].min_value = cs_ptr->run_max; + cs_ptr->histogram[h_ix].cum_value = 0; + } + + /* Walk through all the per-object structures and record each of + the count values in histogram. */ + for (gi_ptr = gcov_list; gi_ptr; gi_ptr = gi_ptr->next) + { + if (!gi_ptr->merge[t_ix]) + continue; + + /* Find the appropriate index into the gcov_ctr_info array + for the counter we are currently working on based on the + existence of the merge function pointer for this object. */ + for (ix = 0, ctr_info_ix = 0; ix < t_ix; ix++) + { + if (gi_ptr->merge[ix]) + ctr_info_ix++; + } + for (f_ix = 0; f_ix != gi_ptr->n_functions; f_ix++) + { + gfi_ptr = gi_ptr->functions[f_ix]; + + if (!gfi_ptr || gfi_ptr->key != gi_ptr) + continue; + + ci_ptr = &gfi_ptr->ctrs[ctr_info_ix]; + for (ix = 0; ix < ci_ptr->num; ix++) + gcov_histogram_insert (cs_ptr->histogram, ci_ptr->values[ix]); + } + } +} + /* Dump the coverage counts. We merge with existing counts when possible, to avoid growing the .da files ad infinitum. We use this program's checksum to make sure we only accumulate whole program @@ -347,6 +423,7 @@ gcov_exit (void) } } } + gcov_compute_histogram (&this_prg); { /* Check if the level of dirs to strip off specified. */ @@ -400,6 +477,8 @@ gcov_exit (void) const char *fname, *s; struct gcov_fn_buffer *fn_buffer = 0; struct gcov_fn_buffer **fn_tail = &fn_buffer; + struct gcov_summary_buffer *next_sum_buffer, *sum_buffer = 0; + struct gcov_summary_buffer **sum_tail = &sum_buffer; fname = gi_ptr->filename; @@ -482,17 +561,29 @@ gcov_exit (void) f_ix--; length = gcov_read_unsigned (); - if (length != GCOV_TAG_SUMMARY_LENGTH) - goto read_mismatch; gcov_read_summary (&tmp); if ((error = gcov_is_error ())) goto read_error; - if (summary_pos || tmp.checksum != crc32) - goto next_summary; + if (summary_pos) + { + /* Save all summaries after the one that will be + merged into below. These will need to be rewritten + as histogram merging may change the number of non-zero + histogram entries that will be emitted, and thus the + size of the merged summary. */ + (*sum_tail) = (struct gcov_summary_buffer *) + malloc (sizeof(struct gcov_summary_buffer)); + (*sum_tail)->summary = tmp; + (*sum_tail)->next = 0; + sum_tail = &((*sum_tail)->next); + goto next_summary; + } + if (tmp.checksum != crc32) + goto next_summary; for (t_ix = 0; t_ix != GCOV_COUNTERS_SUMMABLE; t_ix++) if (tmp.ctrs[t_ix].num != this_prg.ctrs[t_ix].num) - goto next_summary; + goto next_summary; prg = tmp; summary_pos = eof_pos; @@ -598,11 +689,16 @@ gcov_exit (void) if (gi_ptr->merge[t_ix]) { if (!cs_prg->runs++) - cs_prg->num = cs_tprg->num; + cs_prg->num = cs_tprg->num; cs_prg->sum_all += cs_tprg->sum_all; if (cs_prg->run_max < cs_tprg->run_max) cs_prg->run_max = cs_tprg->run_max; cs_prg->sum_max += cs_tprg->run_max; + if (cs_prg->runs == 1) + memcpy (cs_prg->histogram, cs_tprg->histogram, + sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + else + gcov_histogram_merge (cs_prg->histogram, cs_tprg->histogram); } else if (cs_prg->runs) goto read_mismatch; @@ -635,8 +731,18 @@ gcov_exit (void) /* Generate whole program statistics. */ gcov_write_summary (GCOV_TAG_PROGRAM_SUMMARY, &prg); - if (summary_pos < eof_pos) - gcov_seek (eof_pos); + /* Rewrite all the summaries that were after the summary we merged + into. This is necessary as the merged summary may have a different + size due to the number of non-zero histogram entries changing after + merging. */ + + while (sum_buffer) + { + gcov_write_summary (GCOV_TAG_PROGRAM_SUMMARY, &sum_buffer->summary); + next_sum_buffer = sum_buffer->next; + free (sum_buffer); + sum_buffer = next_sum_buffer; + } /* Write execution counts for each function. */ for (f_ix = 0; (unsigned)f_ix != gi_ptr->n_functions; f_ix++) Index: gcc/gcov-io.c =================================================================== --- gcc/gcov-io.c (revision 190950) +++ gcc/gcov-io.c (working copy) @@ -368,10 +368,25 @@ gcov_write_tag_length (gcov_unsigned_t tag, gcov_u GCOV_LINKAGE void gcov_write_summary (gcov_unsigned_t tag, const struct gcov_summary *summary) { - unsigned ix; + unsigned ix, h_ix, bv_ix, h_cnt = 0; const struct gcov_ctr_summary *csum; + unsigned histo_bitvector[GCOV_HISTOGRAM_BITVECTOR_SIZE]; - gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH); + /* Count number of non-zero histogram entries, and fill in a bit vector + of non-zero indices. The histogram is only currently computed for arc + counters. */ + for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++) + histo_bitvector[bv_ix] = 0; + csum = &summary->ctrs[GCOV_COUNTER_ARCS]; + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + if (csum->histogram[h_ix].num_counters > 0) + { + histo_bitvector[h_ix / 32] |= 1 << (h_ix % 32); + h_cnt++; + } + } + gcov_write_tag_length (tag, GCOV_TAG_SUMMARY_LENGTH(h_cnt)); gcov_write_unsigned (summary->checksum); for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) { @@ -380,6 +395,22 @@ gcov_write_summary (gcov_unsigned_t tag, const str gcov_write_counter (csum->sum_all); gcov_write_counter (csum->run_max); gcov_write_counter (csum->sum_max); + if (ix != GCOV_COUNTER_ARCS) + { + for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++) + gcov_write_unsigned (0); + continue; + } + for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++) + gcov_write_unsigned (histo_bitvector[bv_ix]); + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + if (!csum->histogram[h_ix].num_counters) + continue; + gcov_write_unsigned (csum->histogram[h_ix].num_counters); + gcov_write_counter (csum->histogram[h_ix].min_value); + gcov_write_counter (csum->histogram[h_ix].cum_value); + } } } #endif /* IN_LIBGCOV */ @@ -488,8 +519,10 @@ gcov_read_string (void) GCOV_LINKAGE void gcov_read_summary (struct gcov_summary *summary) { - unsigned ix; + unsigned ix, h_ix, bv_ix, h_cnt = 0; struct gcov_ctr_summary *csum; + unsigned histo_bitvector[GCOV_HISTOGRAM_BITVECTOR_SIZE]; + unsigned cur_bitvector; summary->checksum = gcov_read_unsigned (); for (csum = summary->ctrs, ix = GCOV_COUNTERS_SUMMABLE; ix--; csum++) @@ -499,6 +532,43 @@ gcov_read_summary (struct gcov_summary *summary) csum->sum_all = gcov_read_counter (); csum->run_max = gcov_read_counter (); csum->sum_max = gcov_read_counter (); + memset (csum->histogram, 0, + sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + for (bv_ix = 0; bv_ix < GCOV_HISTOGRAM_BITVECTOR_SIZE; bv_ix++) + { + histo_bitvector[bv_ix] = gcov_read_unsigned (); + h_cnt += __builtin_popcountll (histo_bitvector[bv_ix]); + } + bv_ix = 0; + h_ix = 0; + cur_bitvector = 0; + while (h_cnt--) + { + /* Find the index corresponding to the next entry we will read in. + First find the next non-zero bitvector and re-initialize + the histogram index accordingly, then right shift and increment + the index until we find a set bit. */ + while (!cur_bitvector) + { + h_ix = bv_ix * 32; + cur_bitvector = histo_bitvector[bv_ix++]; + gcc_assert(bv_ix <= GCOV_HISTOGRAM_BITVECTOR_SIZE); + } + while (!(cur_bitvector & 0x1)) + { + h_ix++; + cur_bitvector >>= 1; + } + gcc_assert(h_ix < GCOV_HISTOGRAM_SIZE); + + csum->histogram[h_ix].num_counters = gcov_read_unsigned (); + csum->histogram[h_ix].min_value = gcov_read_counter (); + csum->histogram[h_ix].cum_value = gcov_read_counter (); + /* Shift off the index we are done with and increment to the + corresponding next histogram entry. */ + cur_bitvector >>= 1; + h_ix++; + } } } @@ -550,3 +620,184 @@ gcov_time (void) return status.st_mtime; } #endif /* IN_GCOV */ + +#if IN_LIBGCOV || !IN_GCOV +/* Determine the index into histogram for VALUE. */ + +static unsigned +gcov_histo_index(gcov_type value) +{ + gcov_type_unsigned v = (gcov_type_unsigned)value; + unsigned r = 0; + unsigned prev2bits = 0; + + /* Find index into log2 scale histogram, where each of the log2 + sized buckets is divided into 4 linear sub-buckets for better + focus in the higher buckets. */ + + /* Find the place of the most-significant bit set. */ + if (v > 0) + r = 63 - __builtin_clzll (v); + + /* If at most the 2 least significant bits are set (value is + 0 - 3) then that value is our index into the lowest set of + four buckets. */ + if (r < 2) + return (unsigned)value; + + gcc_assert (r < 64); + + /* Find the two next most significant bits to determine which + of the four linear sub-buckets to select. */ + prev2bits = (v >> (r - 2)) & 0x3; + /* Finally, compose the final bucket index from the log2 index and + the next 2 bits. The minimum r value at this point is 2 since we + returned above if r was 2 or more, so the minimum bucket at this + point is 4. */ + return (r - 1) * 4 + prev2bits; +} + +/* Merge SRC_HISTO into TGT_HISTO. The counters are assumed to be in + the same relative order in both histograms, and are matched up + and merged in reverse order. Each counter is assigned an equal portion of + its entry's original cumulative counter value when computing the + new merged cum_value. */ + +static void gcov_histogram_merge(gcov_bucket_type *tgt_histo, + gcov_bucket_type *src_histo) +{ + int src_i, tgt_i, tmp_i = 0; + unsigned src_num, tgt_num, merge_num; + gcov_type src_cum, tgt_cum, merge_src_cum, merge_tgt_cum, merge_cum; + gcov_type merge_min; + gcov_bucket_type tmp_histo[GCOV_HISTOGRAM_SIZE]; + int src_done = 0; + + memset(tmp_histo, 0, sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + + /* Assume that the counters are in the same relative order in both + histograms. Walk the histograms from largest to smallest entry, + matching up and combining counters in order. */ + src_num = 0; + src_cum = 0; + src_i = GCOV_HISTOGRAM_SIZE - 1; + for (tgt_i = GCOV_HISTOGRAM_SIZE - 1; tgt_i >= 0 && !src_done; tgt_i--) + { + tgt_num = tgt_histo[tgt_i].num_counters; + tgt_cum = tgt_histo[tgt_i].cum_value; + /* Keep going until all of the target histogram's counters at this + position have been matched and merged with counters from the + source histogram. */ + while (tgt_num > 0 && !src_done) + { + /* If this is either the first time through this loop or we just + exhausted the previous non-zero source histogram entry, look + for the next non-zero source histogram entry. */ + if (!src_num) + { + /* Locate the next non-zero entry. */ + while (src_i >= 0 && !src_histo[src_i].num_counters) + src_i--; + /* If source histogram has fewer counters, then just copy over the + remaining target counters and quit. */ + if (src_i < 0) + { + tmp_histo[tgt_i].num_counters += tgt_num; + tmp_histo[tgt_i].cum_value += tgt_cum; + if (!tmp_histo[tgt_i].min_value || + tgt_histo[tgt_i].min_value < tmp_histo[tgt_i].min_value) + tmp_histo[tgt_i].min_value = tgt_histo[tgt_i].min_value; + while (--tgt_i >= 0) + { + tmp_histo[tgt_i].num_counters + += tgt_histo[tgt_i].num_counters; + tmp_histo[tgt_i].cum_value += tgt_histo[tgt_i].cum_value; + if (!tmp_histo[tgt_i].min_value || + tgt_histo[tgt_i].min_value + < tmp_histo[tgt_i].min_value) + tmp_histo[tgt_i].min_value = tgt_histo[tgt_i].min_value; + } + + src_done = 1; + break; + } + + src_num = src_histo[src_i].num_counters; + src_cum = src_histo[src_i].cum_value; + } + + /* The number of counters to merge on this pass is the minimum + of the remaining counters from the current target and source + histogram entries. */ + merge_num = tgt_num; + if (src_num < merge_num) + merge_num = src_num; + + /* The merged min_value is the sum of the min_values from target + and source. */ + merge_min = tgt_histo[tgt_i].min_value + src_histo[src_i].min_value; + + /* Compute the portion of source and target entries' cum_value + that will be apportioned to the counters being merged. + The total remaining cum_value from each entry is divided + equally among the counters from that histogram entry if we + are not merging all of them. */ + merge_src_cum = src_cum; + if (merge_num < src_num) + merge_src_cum = merge_num * src_cum / src_num; + merge_tgt_cum = tgt_cum; + if (merge_num < tgt_num) + merge_tgt_cum = merge_num * tgt_cum / tgt_num; + /* The merged cum_value is the sum of the source and target + components. */ + merge_cum = merge_src_cum + merge_tgt_cum; + + /* Update the remaining number of counters and cum_value left + to be merged from this source and target entry. */ + src_cum -= merge_src_cum; + tgt_cum -= merge_tgt_cum; + src_num -= merge_num; + tgt_num -= merge_num; + + /* The merged counters get placed in the new merged histogram + at the entry for the merged min_value. */ + tmp_i = gcov_histo_index(merge_min); + gcc_assert (tmp_i < GCOV_HISTOGRAM_SIZE); + tmp_histo[tmp_i].num_counters += merge_num; + tmp_histo[tmp_i].cum_value += merge_cum; + if (!tmp_histo[tmp_i].min_value || + merge_min < tmp_histo[tmp_i].min_value) + tmp_histo[tmp_i].min_value = merge_min; + + /* Ensure the search for the next non-zero src_histo entry starts + at the next smallest histogram bucket. */ + if (!src_num) + src_i--; + } + } + + gcc_assert (tgt_i < 0); + + /* In the case where there were more counters in the source histogram, + accumulate the remaining unmerged cumulative counter values. Add + those to the smallest non-zero target histogram entry. Otherwise, + the total cumulative counter values in the histogram will be smaller + than the sum_all stored in the summary, which will complicate + computing the working set information from the histogram later on. */ + if (src_num) + src_i--; + while (src_i >= 0) + { + src_cum += src_histo[src_i].cum_value; + src_i--; + } + /* At this point, tmp_i should be the smallest non-zero entry in the + tmp_histo. */ + gcc_assert(tmp_i >= 0 && tmp_i < GCOV_HISTOGRAM_SIZE + && tmp_histo[tmp_i].num_counters > 0); + tmp_histo[tmp_i].cum_value += src_cum; + + /* Finally, copy the merged histogram into tgt_histo. */ + memcpy(tgt_histo, tmp_histo, sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); +} +#endif /* IN_LIBGCOV || !IN_GCOV */ Index: gcc/gcov-io.h =================================================================== --- gcc/gcov-io.h (revision 190950) +++ gcc/gcov-io.h (working copy) @@ -139,7 +139,9 @@ see the files COPYING3 and COPYING.RUNTIME respect counts: header int64:count* summary: int32:checksum {count-summary}GCOV_COUNTERS_SUMMABLE count-summary: int32:num int32:runs int64:sum - int64:max int64:sum_max + int64:max int64:sum_max histogram + histogram: {int32:bitvector}8 histogram-buckets* + histogram-buckets: int32:num int64:min int64:sum The ANNOUNCE_FUNCTION record is the same as that in the note file, but without the source location. The COUNTS gives the @@ -171,8 +173,10 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (SI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (DI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (DI))); #else typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #endif #else #if BITS_PER_UNIT == 16 @@ -180,16 +184,20 @@ typedef unsigned gcov_unsigned_t __attribute__ ((m typedef unsigned gcov_position_t __attribute__ ((mode (HI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (SI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (SI))); #else typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #endif #else typedef unsigned gcov_unsigned_t __attribute__ ((mode (QI))); typedef unsigned gcov_position_t __attribute__ ((mode (QI))); #if LONG_LONG_TYPE_SIZE > 32 typedef signed gcov_type __attribute__ ((mode (HI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (HI))); #else typedef signed gcov_type __attribute__ ((mode (QI))); +typedef unsigned gcov_type_unsigned __attribute__ ((mode (QI))); #endif #endif #endif @@ -210,6 +218,7 @@ typedef unsigned gcov_position_t; #if IN_GCOV #define GCOV_LINKAGE static typedef HOST_WIDEST_INT gcov_type; +typedef unsigned HOST_WIDEST_INT gcov_type_unsigned; #if IN_GCOV > 0 #include <sys/types.h> #endif @@ -309,9 +318,10 @@ typedef HOST_WIDEST_INT gcov_type; #define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2) #define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t)0xa1000000) /* Obsolete */ #define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t)0xa3000000) -#define GCOV_TAG_SUMMARY_LENGTH \ - (1 + GCOV_COUNTERS_SUMMABLE * (2 + 3 * 2)) +#define GCOV_TAG_SUMMARY_LENGTH(NUM) \ + (1 + GCOV_COUNTERS_SUMMABLE * (10 + 3 * 2) + (NUM) * 5) + /* Counters that are collected. */ #define GCOV_COUNTER_ARCS 0 /* Arc transitions. */ #define GCOV_COUNTERS_SUMMABLE 1 /* Counters which can be @@ -389,6 +399,29 @@ typedef HOST_WIDEST_INT gcov_type; /* Structured records. */ +/* Structure used for each bucket of the log2 histogram of counter values. */ +typedef struct +{ + /* Number of counters whose profile count falls within the bucket. */ + gcov_unsigned_t num_counters; + /* Smallest profile count included in this bucket. */ + gcov_type min_value; + /* Cumulative value of the profile counts in this bucket. */ + gcov_type cum_value; +} gcov_bucket_type; + +/* For a log2 scale histogram with each range split into 4 + linear sub-ranges, there will be at most 64 (max gcov_type bit size) - 1 log2 + ranges since the lowest 2 log2 values share the lowest 4 linear + sub-range (values 0 - 3). This is 252 total entries (63*4). */ + +#define GCOV_HISTOGRAM_SIZE 252 + +/* How many unsigned ints are required to hold a bit vector of non-zero + histogram entries when the histogram is written to the gcov file. + This is essentially a ceiling divide by 32 bits. */ +#define GCOV_HISTOGRAM_BITVECTOR_SIZE (GCOV_HISTOGRAM_SIZE + 31) / 32 + /* Cumulative counter data. */ struct gcov_ctr_summary { @@ -397,6 +430,8 @@ struct gcov_ctr_summary gcov_type sum_all; /* sum of all counters accumulated. */ gcov_type run_max; /* maximum value on a single run. */ gcov_type sum_max; /* sum of individual run max values. */ + gcov_bucket_type histogram[GCOV_HISTOGRAM_SIZE]; /* histogram of + counter values. */ }; /* Object & program summary record. */ Index: gcc/profile.c =================================================================== --- gcc/profile.c (revision 190950) +++ gcc/profile.c (working copy) @@ -84,6 +84,15 @@ struct bb_info { const struct gcov_ctr_summary *profile_info; +/* Number of data points in the working set summary array. Using 128 + provides information for at least every 1% increment of the total + profile size. The last entry is hardwired to 99.9% of the total. */ +#define NUM_GCOV_WORKING_SETS 128 + +/* Counter working set information computed from the current counter + summary. Not initialized unless profile_info summary is non-NULL. */ +static gcov_working_set_t gcov_working_sets[NUM_GCOV_WORKING_SETS]; + /* Collect statistics on the performance of this pass for the entire source file. */ @@ -192,6 +201,152 @@ instrument_values (histogram_values values) } ^L +/* Compute the working set information from the counter histogram in + the profile summary. This is an array of information corresponding to a + range of percentages of the total execution count (sum_all), and includes + the number of counters required to cover that working set percentage and + the minimum counter value in that working set. */ + +static void +compute_working_sets (void) +{ + gcov_type working_set_cum_values[NUM_GCOV_WORKING_SETS]; + gcov_type ws_cum_hotness_incr; + gcov_type cum, tmp_cum; + const gcov_bucket_type *histo_bucket; + unsigned ws_ix, c_num, count, pctinc, pct; + int h_ix; + gcov_working_set_t *ws_info; + + if (!profile_info) + return; + + /* Compute the amount of sum_all that the cumulative hotness grows + by in each successive working set entry, which depends on the + number of working set entries. */ + ws_cum_hotness_incr = profile_info->sum_all / NUM_GCOV_WORKING_SETS; + + /* Next fill in an array of the cumulative hotness values corresponding + to each working set summary entry we are going to compute below. + Skip 0% statistics, which can be extrapolated from the + rest of the summary data. */ + cum = ws_cum_hotness_incr; + for (ws_ix = 0; ws_ix < NUM_GCOV_WORKING_SETS; + ws_ix++, cum += ws_cum_hotness_incr) + working_set_cum_values[ws_ix] = cum; + /* The last summary entry is reserved for (roughly) 99.9% of the + working set. Divide by 1024 so it becomes a shift, which gives + almost exactly 99.9%. */ + working_set_cum_values[NUM_GCOV_WORKING_SETS-1] + = profile_info->sum_all - profile_info->sum_all/1024; + + /* Next, walk through the histogram in decending order of hotness + and compute the statistics for the working set summary array. + As histogram entries are accumulated, we check to see which + working set entries have had their expected cum_value reached + and fill them in, walking the working set entries in increasing + size of cum_value. */ + ws_ix = 0; /* The current entry into the working set array. */ + cum = 0; /* The current accumulated counter sum. */ + count = 0; /* The current accumulated count of block counters. */ + for (h_ix = GCOV_HISTOGRAM_SIZE - 1; + h_ix >= 0 && ws_ix < NUM_GCOV_WORKING_SETS; h_ix--) + { + histo_bucket = &profile_info->histogram[h_ix]; + + /* If we haven't reached the required cumulative counter value for + the current working set percentage, simply accumulate this histogram + entry into the running sums and continue to the next histogram + entry. */ + if (cum + histo_bucket->cum_value < working_set_cum_values[ws_ix]) + { + cum += histo_bucket->cum_value; + count += histo_bucket->num_counters; + continue; + } + + /* If adding the current histogram entry's cumulative counter value + causes us to exceed the current working set size, then estimate + how many of this histogram entry's counter values are required to + reach the working set size, and fill in working set entries + as we reach their expected cumulative value. */ + for (c_num = 0, tmp_cum = cum; + c_num < histo_bucket->num_counters && ws_ix < NUM_GCOV_WORKING_SETS; + c_num++) + { + count++; + /* If we haven't reached the last histogram entry counter, add + in the minimum value again. This will underestimate the + cumulative sum so far, because many of the counter values in this + entry may have been larger than the minimum. We could add in the + average value every time, but that would require an expensive + divide operation. */ + if (c_num + 1 < histo_bucket->num_counters) + tmp_cum += histo_bucket->min_value; + /* If we have reached the last histogram entry counter, then add + in the entire cumulative value. */ + else + tmp_cum = cum + histo_bucket->cum_value; + + /* Next walk through successive working set entries and fill in + the statistics for any whose size we have reached by accumulating + this histogram counter. */ + while (tmp_cum >= working_set_cum_values[ws_ix] + && ws_ix < NUM_GCOV_WORKING_SETS) + { + gcov_working_sets[ws_ix].num_counters = count; + gcov_working_sets[ws_ix].min_counter + = histo_bucket->min_value; + ws_ix++; + } + } + /* Finally, update the running cumulative value since we were + using a temporary above. */ + cum += histo_bucket->cum_value; + } + gcc_assert (ws_ix == NUM_GCOV_WORKING_SETS); + + if (dump_file) + { + fprintf (dump_file, "Counter working sets:\n"); + /* Multiply the percentage by 100 to avoid float. */ + pctinc = 100 * 100 / NUM_GCOV_WORKING_SETS; + for (ws_ix = 0, pct = pctinc; ws_ix < NUM_GCOV_WORKING_SETS; + ws_ix++, pct += pctinc) + { + if (ws_ix == NUM_GCOV_WORKING_SETS - 1) + pct = 9990; + ws_info = &gcov_working_sets[ws_ix]; + /* Print out the percentage using int arithmatic to avoid float. */ + fprintf (dump_file, "\t\t%u.%02u%%: num counts=%u, min counter=" + HOST_WIDEST_INT_PRINT_DEC "\n", + pct / 100, pct - (pct / 100 * 100), + ws_info->num_counters, + (HOST_WIDEST_INT)ws_info->min_counter); + } + } +} + +/* Given a the desired percentage of the full profile (sum_all from the + summary), multiplied by 10 to avoid float in PCT_TIMES_10, returns + the corresponding working set information. If an exact match for + the percentage isn't found, the closest value is used. */ + +gcov_working_set_t * +find_working_set (unsigned pct_times_10) +{ + unsigned i; + if (!profile_info) + return NULL; + gcc_assert (pct_times_10 <= 1000); + if (pct_times_10 >= 999) + return &gcov_working_sets[NUM_GCOV_WORKING_SETS - 1]; + i = pct_times_10 * NUM_GCOV_WORKING_SETS / 1000; + if (!i) + return &gcov_working_sets[0]; + return &gcov_working_sets[i - 1]; +} + /* Computes hybrid profile for all matching entries in da_file. CFG_CHECKSUM is the precomputed checksum for the CFG. */ @@ -219,6 +374,8 @@ get_exec_counts (unsigned cfg_checksum, unsigned l if (!counts) return NULL; + compute_working_sets(); + if (dump_file && profile_info) fprintf(dump_file, "Merged %u profiles with maximal count %u.\n", profile_info->runs, (unsigned) profile_info->sum_max); Index: gcc/coverage.c =================================================================== --- gcc/coverage.c (revision 190950) +++ gcc/coverage.c (working copy) @@ -248,6 +248,13 @@ read_counts_file (void) summary.ctrs[ix].run_max = sum.ctrs[ix].run_max; summary.ctrs[ix].sum_max += sum.ctrs[ix].sum_max; } + if (new_summary) + memcpy (summary.ctrs[GCOV_COUNTER_ARCS].histogram, + sum.ctrs[GCOV_COUNTER_ARCS].histogram, + sizeof (gcov_bucket_type) * GCOV_HISTOGRAM_SIZE); + else + gcov_histogram_merge (summary.ctrs[GCOV_COUNTER_ARCS].histogram, + sum.ctrs[GCOV_COUNTER_ARCS].histogram); new_summary = 0; } else if (GCOV_TAG_IS_COUNTER (tag) && fn_ident) @@ -268,8 +275,9 @@ read_counts_file (void) entry->ctr = elt.ctr; entry->lineno_checksum = lineno_checksum; entry->cfg_checksum = cfg_checksum; - entry->summary = summary.ctrs[elt.ctr]; - entry->summary.num = n_counts; + if (elt.ctr < GCOV_COUNTERS_SUMMABLE) + entry->summary = summary.ctrs[elt.ctr]; + entry->summary.num = n_counts; entry->counts = XCNEWVEC (gcov_type, n_counts); } else if (entry->lineno_checksum != lineno_checksum Index: gcc/basic-block.h =================================================================== --- gcc/basic-block.h (revision 190950) +++ gcc/basic-block.h (working copy) @@ -31,6 +31,7 @@ along with GCC; see the file COPYING3. If not see flow graph is manipulated by various optimizations. A signed type makes those easy to detect. */ typedef HOST_WIDEST_INT gcov_type; +typedef unsigned HOST_WIDEST_INT gcov_type_unsigned; /* Control flow edge information. */ struct GTY((user)) edge_def { @@ -91,6 +92,16 @@ enum cfg_edge_flags { profile.c. */ extern const struct gcov_ctr_summary *profile_info; +/* Working set size statistics for a given percentage of the entire + profile (sum_all from the counter summary). */ +typedef struct gcov_working_set_info +{ + /* Number of hot counters included in this working set. */ + unsigned num_counters; + /* Smallest counter included in this working set. */ + gcov_type min_counter; +} gcov_working_set_t; + /* Declared in cfgloop.h. */ struct loop; @@ -897,4 +908,7 @@ extern void rtl_profile_for_bb (basic_block); extern void rtl_profile_for_edge (edge); extern void default_rtl_profile (void); +/* In profile.c. */ +extern gcov_working_set_t *find_working_set(unsigned pct_times_10); + #endif /* GCC_BASIC_BLOCK_H */ Index: gcc/gcov-dump.c =================================================================== --- gcc/gcov-dump.c (revision 190950) +++ gcc/gcov-dump.c (working copy) @@ -447,7 +447,8 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED unsigned tag ATTRIBUTE_UNUSED, unsigned length ATTRIBUTE_UNUSED) { struct gcov_summary summary; - unsigned ix; + unsigned ix, h_ix; + gcov_bucket_type *histo_bucket; gcov_read_summary (&summary); printf (" checksum=0x%08x", summary.checksum); @@ -465,5 +466,24 @@ tag_summary (const char *filename ATTRIBUTE_UNUSED (HOST_WIDEST_INT)summary.ctrs[ix].run_max); printf (", sum_max=" HOST_WIDEST_INT_PRINT_DEC, (HOST_WIDEST_INT)summary.ctrs[ix].sum_max); + if (ix != GCOV_COUNTER_ARCS) + continue; + printf ("\n"); + print_prefix (filename, 0, 0); + printf ("\t\tcounter histogram:"); + for (h_ix = 0; h_ix < GCOV_HISTOGRAM_SIZE; h_ix++) + { + histo_bucket = &summary.ctrs[ix].histogram[h_ix]; + if (!histo_bucket->num_counters) + continue; + printf ("\n"); + print_prefix (filename, 0, 0); + printf ("\t\t%d: num counts=%u, min counter=" + HOST_WIDEST_INT_PRINT_DEC ", cum_counter=" + HOST_WIDEST_INT_PRINT_DEC, + h_ix, histo_bucket->num_counters, + (HOST_WIDEST_INT)histo_bucket->min_value, + (HOST_WIDEST_INT)histo_bucket->cum_value); + } } } On Thu, Aug 30, 2012 at 8:32 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Wed, Aug 29, 2012 at 6:12 AM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> Index: libgcc/libgcov.c >> >> =================================================================== >> >> --- libgcc/libgcov.c (revision 190736) >> >> +++ libgcc/libgcov.c (working copy) >> >> @@ -276,6 +276,78 @@ gcov_version (struct gcov_info *ptr, gcov_unsigned >> >> return 1; >> >> } >> >> >> >> +/* Insert counter VALUE into HISTOGRAM. */ >> >> + >> >> +static void >> >> +gcov_histogram_insert(gcov_bucket_type *histogram, gcov_type value) >> >> +{ >> >> + unsigned i; >> >> + >> >> + i = gcov_histo_index(value); >> >> + gcc_assert (i < GCOV_HISTOGRAM_SIZE); >> > Does checking_assert work in libgcov? I do not think internal consistency check >> > should go to --enable-checking=release libgcov. We want to maintain it as >> > lightweight as possible. (I see there are two existing gcc_asserts, since they >> > report file format corruption, I think they should give better diagnostic). >> >> gcc_checking_assert isn't available, since tsystem.h not system.h is >> included. I could probably just remove the assert (to be safe, >> silently return if i is out of bounds?). > > I think just removing the assert is fine here: only way it can trigger is when > GCOV_HISTOGRAM_SIZE is wrong and it ought not to be. >> >> >From my understanding of the mode attribute meanings, which I thought >> are defined in terms of the number of smallest addressable units, the >> code in gcov-io.h that sets up the gcov_type typedef will always end >> up with a gcov_type that is 32 or 64 bits? I.e. when BITS_PER_UNIT is >> 8 it will use either SI or DI which will end up either 32 or 64, and >> when BITS_PER_UNIT is 16 it would use either HI or SI which would >> again be either 32 or 64. Is that wrong and we can end up with a 16 >> bit gcov_type? > > I see, the code simplified a bit since we dropped support for some of more exotic > targets. The type should be either 32bit or 64. >> >> The GCOV_TYPE_SIZE was being defined everywhere except when IN_GOV (so >> it was being defined IN_LIBGCOV), but I wanted it defined >> unconditionally because I am using it to determine the required number >> of histogram entries. >> >> In any case, I think it will be more straightforward to define the >> number of histogram entries based on the max possible gcov_type size >> which is 64 (so 256 entries). This will make implementing the bit mask >> simpler, since it will always require the same number of gcov_unsigned >> ints (8). > Sounds good to me. 64bit should be enough for everyone. Coverage is kind of useless > for smaller types and for really restricted targets we more likely will want to disable > histogram generation rather than making it a bit smaller. >> >> > >> > Patch is OK if it passed profiledbootstrap modulo the comments above. >> >> Ok, thanks. Working on the fixes above. > > OK, thanks! > Honza >> >> Teresa >> >> > Thanks! >> > Honza >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On 2012.09.04 at 14:23 -0700, Teresa Johnson wrote: > I just committed the patch (included below). I implemented the > occupancy bit vector approach for recording non-zero histogram > entries, and a few issues uncovered with the merging in a profiled > bootstrap. > > Passes both bootstrap and profiledbootstrap builds and regression tests. This commit causes: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487 -- Markus
Sign in to reply to this message.
Sorry about that. I am right now trying to reproduce the profiledbootstrap problem that H.J. reported, which is on x86_64-unknown-linux-gnu where I had successfully done a profiledbootstrap before my commit. Unfortunately after svn updating my client I am hitting an unrelated build problem with my profiledbootstrap (it can't find the rule to make libc++11convenience.la) which is slowing me down. I don't have access to a gentoo system so I might need your help to track down the error you reported if it doesn't turn out to be the same as the one H.J. hit. If you have a bad .gcda file that you could send me I can start with that. Sounds like there is a profile merging problem that I didn't see, that is corrupting the gcda files in both cases. Thanks, Teresa On Wed, Sep 5, 2012 at 12:12 AM, Markus Trippelsdorf <markus@trippelsdorf.de> wrote: > On 2012.09.04 at 14:23 -0700, Teresa Johnson wrote: >> I just committed the patch (included below). I implemented the >> occupancy bit vector approach for recording non-zero histogram >> entries, and a few issues uncovered with the merging in a profiled >> bootstrap. >> >> Passes both bootstrap and profiledbootstrap builds and regression tests. > > This commit causes: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54487 > > -- > Markus -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com> wrote: > Sorry about that. I am right now trying to reproduce the > profiledbootstrap problem that H.J. reported, which is on > x86_64-unknown-linux-gnu where I had successfully done a > profiledbootstrap before my commit. > > Unfortunately after svn updating my client I am hitting an unrelated > build problem with my profiledbootstrap (it can't find the rule to > make libc++11convenience.la) which is slowing me down. I can reproduce it with revision 190982 on Fedora 18/x86-64 with 8-core. -- H.J.
Sign in to reply to this message.
On Wed, Sep 5, 2012 at 8:44 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com> wrote: >> Sorry about that. I am right now trying to reproduce the >> profiledbootstrap problem that H.J. reported, which is on >> x86_64-unknown-linux-gnu where I had successfully done a >> profiledbootstrap before my commit. >> >> Unfortunately after svn updating my client I am hitting an unrelated >> build problem with my profiledbootstrap (it can't find the rule to >> make libc++11convenience.la) which is slowing me down. > > I can reproduce it with revision 190982 on Fedora 18/x86-64 > with 8-core. Ok, thanks. I am being blocked by an unrelated error: libtool: compile: /home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc/xgcc -shared-libgcc -B/home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc -nostdinc++ -L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src -L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs -B/usr/local/x86_64-unknown-linux-gnu/bin/ -B/usr/local/x86_64-unknown-linux-gnu/lib/ -isystem /usr/local/x86_64-unknown-linux-gnu/include -isystem /usr/local/x86_64-unknown-linux-gnu/sys-include -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/../libgcc -I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu -I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/libsupc++ -std=gnu++11 -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -frandom-seed=random.lo -g -O2 -D_GNU_SOURCE -c /home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/src/c++11/random.cc -fPIC -DPIC -o random.o /tmp/ccOm0d5x.s: Assembler messages: /tmp/ccOm0d5x.s:33: Error: no such instruction: `rdrand %eax' make[6]: *** [random.lo] Error 1 Looks like I am being hit by: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 I'm going to try backing out all the changes related to this bug to see if I can make progress on the profiledbootstrap. Teresa > > -- > H.J. -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
On Wed, Sep 5, 2012 at 8:50 AM, Teresa Johnson <tejohnson@google.com> wrote: > On Wed, Sep 5, 2012 at 8:44 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com> wrote: >>> Sorry about that. I am right now trying to reproduce the >>> profiledbootstrap problem that H.J. reported, which is on >>> x86_64-unknown-linux-gnu where I had successfully done a >>> profiledbootstrap before my commit. >>> >>> Unfortunately after svn updating my client I am hitting an unrelated >>> build problem with my profiledbootstrap (it can't find the rule to >>> make libc++11convenience.la) which is slowing me down. >> >> I can reproduce it with revision 190982 on Fedora 18/x86-64 >> with 8-core. > > Ok, thanks. I am being blocked by an unrelated error: > > libtool: compile: > /home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc/xgcc > -shared-libgcc -B/home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc > -nostdinc++ -L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src > -L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs > -B/usr/local/x86_64-unknown-linux-gnu/bin/ > -B/usr/local/x86_64-unknown-linux-gnu/lib/ -isystem > /usr/local/x86_64-unknown-linux-gnu/include -isystem > /usr/local/x86_64-unknown-linux-gnu/sys-include > -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/../libgcc > -I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu > -I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include > -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/libsupc++ > -std=gnu++11 -fno-implicit-templates -Wall -Wextra -Wwrite-strings > -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections > -fdata-sections -frandom-seed=random.lo -g -O2 -D_GNU_SOURCE -c > /home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/src/c++11/random.cc > -fPIC -DPIC -o random.o > /tmp/ccOm0d5x.s: Assembler messages: > /tmp/ccOm0d5x.s:33: Error: no such instruction: `rdrand %eax' > make[6]: *** [random.lo] Error 1 > > Looks like I am being hit by: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 > > I'm going to try backing out all the changes related to this bug to > see if I can make progress on the profiledbootstrap. > You can install the latest binutils and put it in your PATH. -- H.J.
Sign in to reply to this message.
On Wed, Sep 5, 2012 at 9:13 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Wed, Sep 5, 2012 at 8:50 AM, Teresa Johnson <tejohnson@google.com> wrote: >> On Wed, Sep 5, 2012 at 8:44 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> On Wed, Sep 5, 2012 at 8:09 AM, Teresa Johnson <tejohnson@google.com> wrote: >>>> Sorry about that. I am right now trying to reproduce the >>>> profiledbootstrap problem that H.J. reported, which is on >>>> x86_64-unknown-linux-gnu where I had successfully done a >>>> profiledbootstrap before my commit. >>>> >>>> Unfortunately after svn updating my client I am hitting an unrelated >>>> build problem with my profiledbootstrap (it can't find the rule to >>>> make libc++11convenience.la) which is slowing me down. >>> >>> I can reproduce it with revision 190982 on Fedora 18/x86-64 >>> with 8-core. >> >> Ok, thanks. I am being blocked by an unrelated error: >> >> libtool: compile: >> /home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc/xgcc >> -shared-libgcc -B/home/tejohnson/extra/gcc_trunk_4_validate/tmp/./gcc >> -nostdinc++ -L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src >> -L/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs >> -B/usr/local/x86_64-unknown-linux-gnu/bin/ >> -B/usr/local/x86_64-unknown-linux-gnu/lib/ -isystem >> /usr/local/x86_64-unknown-linux-gnu/include -isystem >> /usr/local/x86_64-unknown-linux-gnu/sys-include >> -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/../libgcc >> -I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu >> -I/home/tejohnson/extra/gcc_trunk_4_validate/tmp/x86_64-unknown-linux-gnu/libstdc++-v3/include >> -I/home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/libsupc++ >> -std=gnu++11 -fno-implicit-templates -Wall -Wextra -Wwrite-strings >> -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections >> -fdata-sections -frandom-seed=random.lo -g -O2 -D_GNU_SOURCE -c >> /home/tejohnson/extra/gcc_trunk_4/libstdc++-v3/src/c++11/random.cc >> -fPIC -DPIC -o random.o >> /tmp/ccOm0d5x.s: Assembler messages: >> /tmp/ccOm0d5x.s:33: Error: no such instruction: `rdrand %eax' >> make[6]: *** [random.lo] Error 1 >> >> Looks like I am being hit by: >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54419 >> >> I'm going to try backing out all the changes related to this bug to >> see if I can make progress on the profiledbootstrap. >> > > You can install the latest binutils and put it in your PATH. Ok, I just backed out the libstdc++ patches which fixed it for now. I am trying a couple different profiledbootstraps. One with just --with-build-config=bootstrap-lto and one with "--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld --enable-languages=c,c++ --enable-gnu-indirect-function --with-fpmath=sse". Thanks, Teresa > > -- > H.J. -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Sign in to reply to this message.
|