From 5f650f628930927e65d0b9a4282ca266119fa01d Mon Sep 17 00:00:00 2001 From: Tom Manshreck Date: Thu, 7 Nov 2024 14:51:44 -0500 Subject: [PATCH] Add Performance Hints doc to abseil.io --- _includes/side-nav-fast.html | 3 + fast/hints.md | 6129 ++++++++++++++++++++++++++++++++++ fast/index.md | 3 +- 3 files changed, 6134 insertions(+), 1 deletion(-) create mode 100644 fast/hints.md diff --git a/_includes/side-nav-fast.html b/_includes/side-nav-fast.html index 1500200..c6ee614 100644 --- a/_includes/side-nav-fast.html +++ b/_includes/side-nav-fast.html @@ -2,6 +2,9 @@
  • Performance Guide
  • +
  • + Performance Hints
    +
  • Fast Tips
  • {% assign sorted_posts = site.posts | sort: 'order' %} diff --git a/fast/hints.md b/fast/hints.md new file mode 100644 index 0000000..5545a45 --- /dev/null +++ b/fast/hints.md @@ -0,0 +1,6129 @@ +--- +title: "Performance Hints" +layout: fast +sidenav: side-nav-fast.html +type: markdown +--- + + +[Jeff Dean](https://research.google/people/jeff/), +[Sanjay Ghemawat](https://research.google/people/sanjayghemawat/) + +Original version: 2023/07/27, last updated: 2024/09/05 + + + +Over the years, we (Jeff & Sanjay) have done a fair bit of diving into +performance tuning of various pieces of code, and +improving the performance of our software has been important from the very earliest days of +Google, since it lets us do more for more users. We wrote this document as a way +of identifying some general principles and specific techniques that we use when +doing this sort of work, and tried to pick illustrative source code changes +(change lists, or CLs) that provide examples of the various approaches and +techniques. Most of the concrete suggestions below reference C++ types and CLs, +but the general principles apply to other languages. The document focuses on +general performance tuning in the context of a single binary, and does not cover +distributed systems or machine learning (ML) hardware performance tuning (huge +areas unto themselves). We hope others will find this useful. + +*Many of the examples in the document have code fragments that demonstrate the +techniques (click the little triangles!).* +*Note that some of these code fragments mention various internal Google codebase abstractions. We have included these anyway if we felt like the examples were self-contained enough to be understandable to those unfamiliar with the details of those abstractions.* + +## The importance of thinking about performance {#the-importance-of-thinking-about-performance} + +Knuth is often quoted out of context as saying *premature optimization is the +root of all evil*. The +[full quote](https://dl.acm.org/doi/pdf/10.1145/356635.356640) reads: *"We +should forget about small efficiencies, say about 97% of the time: premature +optimization is the root of all evil. Yet we should not pass up our +opportunities in that critical 3%."* This document is about that critical +3%, and a more compelling quote, again +from Knuth, reads: + +> The improvement in speed from Example 2 to Example 2a is only about 12%, and +> many people would pronounce that insignificant. The conventional wisdom shared +> by many of today's software engineers calls for ignoring efficiency in the +> small; but I believe this is simply an overreaction to the abuses they see +> being practiced by penny-wise-and-pound-foolish programmers, who can't debug +> or maintain their "optimized" programs. In established engineering disciplines +> a 12% improvement, easily obtained, is never considered marginal; and I +> believe the same viewpoint should prevail in software engineering. Of course I +> wouldn't bother making such optimizations on a one-shot job, but when it's a +> question of preparing quality programs, I don't want to restrict myself to +> tools that deny me such efficiencies. + +Many people will say "let's write down the code in as simple a way as possible +and deal with performance later when we can profile". However, this approach is +often wrong: + +1. If you disregard all performance concerns when developing a large system, + you will end up with a flat profile where there are no obvious hotspots + because performance is lost all over the place. It will be difficult to + figure out how to get started on performance improvements. +2. If you are developing a library that will be used by other people, the + people who will run into performance problems will be likely to be people + who cannot easily make performance improvements (they will have to + understand the details of code written by other people/teams, and have to + negotiate with them about the importance of performance optimizations). +3. It is harder to make significant changes to a system when it is in heavy + use. +4. It is also hard to tell if there are performance problems that can be solved + easily and so we end up with potentially expensive solutions like + over-replication or severe overprovisioning of a service to handle load + problems. + +Instead, we suggest that when writing code, try to choose the faster alternative +if it does not impact readability/complexity of the code significantly. + +## Estimation + +If you can develop an intuition for how much performance might matter in the +code you are writing, you can make a more informed decision (e.g., how much +extra complexity is warranted in the name of performance). Some tips on +estimating performance while you are writing code: + +* Is it test code? If so, you need to worry mostly about the asymptotic + complexity of your algorithms and data structures. (Aside: development cycle + time matters, so avoid writing tests that take a long time to run.) +* Is it code specific to an application? If so, try to figure out how much + performance matters for this piece of code. This is typically not very hard: + just figuring out whether code is initialization/setup code vs. code that + will end up on hot paths (e.g., processing every request in a service) is + often sufficient +* Is it library code that will be used by many applications? In this case it + is hard to tell how sensitive it might become. This is where it becomes + especially important to follow some of the simple techniques described in + this document. For example, if you need to store a vector that usually has a + small number of elements, use an absl::InlinedVector instead of std::vector. + Such techniques are not very hard to follow and don't add any non-local + complexity to the system. And if it turns out that the code you are writing + does end up using significant resources, it will be higher performance from + the start. And it will be easier to find the next thing to focus on when + looking at a profile. + +You can do a slightly deeper analysis when picking between options with +potentially different performance characteristics by relying on +[back of the envelope calculations](https://en.wikipedia.org/wiki/Back-of-the-envelope_calculation). +Such calculations can quickly give a very rough estimate of the performance of +different alternatives, and the results can be used to discard some of the +alternatives without having to implement them. + +Here is how such an estimation might work: + +1. Estimate how many low-level operations of various kinds are required, e.g., + number of disk seeks, number of network round-trips, bytes transmitted etc. +2. Multiply each kind of expensive operation with its rough cost, and add the + results together. +3. The preceding gives the *cost* of the system in terms of resource usage. If + you are interested in latency, and if the system has any concurrency, some + of the costs may overlap and you may have to do slightly more complicated + analysis to estimate the latency. + +The following table, which is an updated version of a table from a +[2007 talk at Stanford University](https://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford-295-talk.pdf) +(video of the 2007 talk no longer exists, but there is a +[video of a related 2011 Stanford talk that covers some of the same content](https://www.youtube.com/watch?v=modXC5IWTJI)) +may be useful since it lists the types of operations to consider, and their +rough cost : + +``` +L1 cache reference 0.5 ns +L2 cache reference 3 ns +Branch mispredict 5 ns +Mutex lock/unlock (uncontended) 15 ns +Main memory reference 50 ns +Compress 1K bytes with Snappy 1,000 ns +Read 4KB from SSD 20,000 ns +Round trip within same datacenter 50,000 ns +Read 1MB sequentially from memory 64,000 ns +Read 1MB over 100 Gbps network 100,000 ns +Read 1MB from SSD 1,000,000 ns +Disk seek 5,000,000 ns +Read 1 MB sequentially from disk 10,000,000 ns +Send packet CA->Netherlands->CA 150,000,000 ns +``` + +The preceding table contains rough costs for some basic low-level operations. +You may find it useful to also track estimated costs for higher-level operations +relevant to your system. E.g., you might want to know the rough cost of a point +read from your SQL database, the latency of interacting with a Cloud service, or +the time to render a simple HTML page. If you don’t know the relevant cost of +different operations, you can’t do decent back-of-the-envelope calculations! + +### Example: Time to quicksort a billion 4 byte numbers + +As a rough approximation, a good quicksort algorithm makes log(N) passes over an +array of size N. On each pass, the array contents will be streamed from memory +into the processor cache, and the partitition code will compare each element +once to a pivot element. Let's add up the dominant costs: + +1. Memory bandwidth: the array occupies 4 GB (4 bytes per number times a + billion numbers). Let's assume ~16GB/s of memory bandwidth per core. That + means each pass will take ~0.25s. N is ~2^30, so we will make ~30 passes, so + the total cost of memory transfer will be ~7.5 seconds. +2. Branch mispredictions: we will do a total of N*log(N) comparisons, i.e., ~30 + billion comparisons. Let's assume that half of them (i.e., 15 billion) are + mispredicted. Multiplying by 5 ns per misprediction, we get a misprediction + cost of 75 seconds. We assume for this analysis that correctly predicted + branches are free. +3. Adding up the previous numbers, we get an estimate of ~82.5 seconds. + +If necessary, we could refine our analyis to account for processor caches. This +refinement is probably not needed since branch mispredictions are the dominant +cost according to the analysis above, but we include it here anyway as another +example. Let's assume we have a 32MB L3 cache, and that the cost of transferring +data from L3 cache to the processor is negligible. The L3 cache can hold 2^23 +numbers, and therefore the last 22 passes can operate on the data resident in +the L3 cache (the 23rd last pass brings data into the L3 cache and the remaining +passes operate on that data.) That cuts down the memory transfer cost to 2.5 +seconds (10 memory transfers of 4GB at 16GB/s) instead of 7.5 seconds (30 memory +transfers). + +### Example: Time to generate a web page with 30 image thumbnails + +Let's compare two potential designs where the original images are stored on +disk, and each image is approximately 1MB in size. + +1. Read the contents of the 30 images serially and generate a thumbnail for + each one. Each read takes one seek + one transfer, which adds up to 5ms for + the seek, and 10ms for the transfer, which adds up to 30 images times 15ms + per image, i.e., 450ms. +2. Read in parallel, assuming the images are spread evenly across K disks. The + previous resource usage estimate still holds, but latency will drop by + roughly a factor of K, ignoring variance (e.g, we will sometimes get unlucky + and one disk will have more than 1/Kth of the images we are reading). + Therefore if we are running on a distributed filesystem with hundreds of + disks, the expected latency will drop to ~15ms. +3. Let's consider a variant where all images are on a single SSD. This changes + the sequential read performance to 20µs + 1ms per image, which adds up to + ~30 ms overall. + +## Measurement {#measurement} + +The preceding section gives some tips about how to think about performance when +writing code without worrying too much about how to measure the performance +impact of your choices. However, before you actually start making improvements, +or run into a tradeoff involving various things like performance, simplicity, +etc. you will want to measure or estimate potential performance benefits. Being +able to measure things effectively is the number one tool you'll want to have in +your arsenal when doing performance-related work. + +As an aside, it’s worth pointing out that profiling code that you’re unfamiliar +with can also be a good way of getting a general sense of the structure of the +codebase and how it operates. Examining the source code of heavily involved +routines in the dynamic call graph of a program can give you a high level sense +of “what happens” when running the code, which can then build your own +confidence in making performance-improving changes in slightly unfamiliar code. + +### Profiling tools and tips {#profiling-tools-and-tips} + +Many useful profiling tools are available. A useful tool to reach for first is +[pprof](https://github.com/google/pprof/blob/main/doc/README.md) since it gives +good high level performance information and is easy to use both locally and for +code running in production. Also try +[perf](https://perf.wiki.kernel.org/index.php/Main_Page) if you want more +detailed insight into performance. + +Some tips for profiling: + +* Build production binaries with appropriate debugging information and + optimization flags. +* If you can, write a micro-benchmark that covers the code you are improving. + Microbenchmarks improve turn-around time when making performance + improvements, help verify the impact of performance improvements, and can + help prevent future performance regressions. However microbenchmarks can + have [pitfalls][fast39] that make them non-representative of full system + performance. Useful libraries for writing micro-benchmarks: + [C++][cpp benchmarks] [Go][go benchmarks] [Java][jmh]. +* Use a benchmark library to [emit performance counter readings][fast53] both + for better precision, and to get more insight into program behavior. +* Lock contention can often artificially lower CPU usage. Some mutex + implementations provide support for profiling lock contention. +* Use [ML profilers][xprof] for machine learning performance + work. + +### What to do when profiles are flat {#what-to-do-when-profiles-are-flat} + +You will often run into situations where your CPU profile is flat (there is no +obvious big contributor to slowness). This can often happen when all low-hanging +fruit has been picked. Here are some tips to consider if you find yourself in +this situation: + +* Don't discount the value of many small optimizations! Making twenty separate + 1% improvements in some subsystem is often eminently possible and + collectively mean a pretty sizable improvement (work of this flavor often + relies on having stable and high quality microbenchmarks). Some examples of + these sorts of changes are in the + [changes that demonstrate multiple techniques](#cls-that-demonstrate-multiple-techniques) + section. +* Find loops closer to the top of call stacks (flame graph view of a CPU + profile can be helpful here). Potentially, the loop or the code it calls + could be restructured to be more efficient. Some code that initially built a + complicated graph structure incrementally by looping over nodes and edges of + the input was changed to build the graph structure in one shot by passing it + the entire input. This removed a bunch of internal checks that were + happening per edge in the initial code. +* Take a step back and look for structural changes higher up in the call + stacks instead of concentrating on micro-optimizations. The techniques + listed under [algorithmic improvements](#algorithmic-improvements) can be + useful when doing this. +* Look for overly general code. Replace it with a customized or lower-level + implementation. E.g., if an application is repeatedly using a regular + expression match where a simple prefix match would suffice, consider + dropping the use of the regular expression. +* Attempt to reduce the number of allocations: + [get an allocation profile][profile sources], and pick away at the highest + contributor to the number of allocations. This will have two effects: (1) It + will provide a direct reduction of the amount of time spent in the allocator + (and garbage collector for GC-ed languages) (2) There will often be a + reduction in cache misses since in a long running program using tcmalloc, + every allocation tends to go to a different cache line. +* Gather other types of profiles, specially ones based on hardware performance + counters. Such profiles may point out functions that are encountering a high + cache miss rate. Techniques described in the + [profiling tools and tips](#profiling-tools-and-tips) section can be + helpful. + +## API considerations {#api-considerations} + +Some of the techniques suggested below require changing data structures and +function signatures, which may be disruptive to callers. Try to organize code so +that the suggested performance improvements can be made inside an encapsulation +boundary without affecting public interfaces. This will be easier if your +[modules are deep](https://web.stanford.edu/~ouster/cgi-bin/book.php) +(significant functionality accessed via a narrow interface). + +Widely used APIs come under heavy pressure to add +features. +Be careful when adding new features since these will constrain future +implementations and increase cost unnecessarily for users who don't need the new +features. E.g., many C++ standard library containers promise iterator stability, +which in typical implementations increases the number of allocations +significantly, even though many users do not need pointer stability. + +Some specific techniques are listed below. Consider carefully the performance +benefits vs. any API usability issues introduced by such changes. + +### Bulk APIs + +Provide bulk ops to reduce expensive API boundary crossings or to take advantage +of algorithmic improvements. + +
    + +Added bulk MemoryManager::LookupMany interface. + + +In addition to adding a bulk interface, this also simplified the signature for +the new bulk variant: it turns out clients only needed to know if all the keys +were found, so we can return a bool rather than a Status object. + +memory_manager.h + +{: .bad-code} +```c++ +class MemoryManager { + public: + ... + util::StatusOr Lookup(const TensorIdProto& id); +``` + +{: .new} +```c++ +class MemoryManager { + public: + ... + util::StatusOr Lookup(const TensorIdProto& id); + + // Lookup the identified tensors + struct LookupKey { + ClientHandle client; + uint64 local_id; + }; + bool LookupMany(absl::Span keys, + absl::Span tensors); +``` + +
    + +
    + +Added bulk ObjectStore::DeleteRefs API to amortize +locking overhead. + + +object_store.h + +{: .bad-code} +```c++ +template +class ObjectStore { + public: + ... + absl::Status DeleteRef(Ref); +``` + +{: .new} +```c++ +template +class ObjectStore { + public: + ... + absl::Status DeleteRef(Ref); + + // Delete many references. For each ref, if no other Refs point to the same + // object, the object will be deleted. Returns non-OK on any error. + absl::Status DeleteRefs(absl::Span refs); + ... +template +absl::Status ObjectStore::DeleteRefs(absl::Span refs) { + util::Status result; + absl::MutexLock l(&mu_); + for (auto ref : refs) { + result.Update(DeleteRefLocked(ref)); + } + return result; +} +``` + +memory_tracking.cc + +{: .bad-code} +```c++ +void HandleBatch(int, const plaque::Batch& input) override { + for (const auto& t : input) { + auto in = In(t); + PLAQUE_OP_ASSIGN_OR_RETURN(const auto& handles, in.handles()); + for (const auto handle : handles.value->handles()) { + PLAQUE_OP_RETURN_IF_ERROR(in_buffer_store_ + ? bstore_->DeleteRef(handle) + : tstore_->DeleteRef(handle)); + } + } +} +``` + +{: .new} +```c++ +void HandleBatch(int, const plaque::Batch& input) override { + for (const auto& t : input) { + auto in = In(t); + PLAQUE_OP_ASSIGN_OR_RETURN(const auto& handles, in.handles()); + if (in_buffer_store_) { + PLAQUE_OP_RETURN_IF_ERROR( + bstore_->DeleteRefs(handles.value->handles())); + } else { + PLAQUE_OP_RETURN_IF_ERROR( + tstore_->DeleteRefs(handles.value->handles())); + } + } +} +``` + +
    + +
    + +Floyd's +heap construction. + + +Bulk initialization of a heap can be done in O(N) time, whereas adding one +element at a time and updating the heap property after each addition requires +O(N lg(N)) time. + +
    + +Sometimes it is hard to change callers to use a new bulk API directly. In that +case it might be beneficial to use a bulk API internally and cache the results +for use in future non-bulk API calls: + +
    + +Cache block decode results for use in future calls. + + +Each lookup needs to decode a whole block of K entries. Store the decoded +entries in a cache and consult the cache on future lookups. + +lexicon.cc + +{: .bad-code} +```c++ +void GetTokenString(int pos, std::string* out) const { + ... + absl::FixedArray entries(pos + 1); + + // Decode all lexicon entries up to and including pos. + for (int i = 0; i <= pos; ++i) { + p = util::coding::TwoValuesVarint::Decode32(p, &entries[i].remaining, + &entries[i].shared); + entries[i].remaining_str = p; + p += entries[i].remaining; // remaining bytes trail each entry. + } +``` + +{: .new} +```c++ +mutable std::vector> cache_; +... +void GetTokenString(int pos, std::string* out) const { + ... + DCHECK_LT(skentry, cache_.size()); + if (!cache_[skentry].empty()) { + *out = cache_[skentry][pos]; + return; + } + ... + // Init cache. + ... + const char* prev = p; + for (int i = 0; i < block_sz; ++i) { + uint32 shared, remaining; + p = TwoValuesVarint::Decode32(p, &remaining, &shared); + auto& cur = cache_[skentry].emplace_back(); + gtl::STLStringResizeUninitialized(&cur, remaining + shared); + + std::memcpy(cur.data(), prev, shared); + std::memcpy(cur.data() + shared, p, remaining); + prev = cur.data(); + p += remaining; + } + *out = cache_[skentry][pos]; +``` + +
    + +### View types + +Prefer view types (e.g., `std::string_view`, `std::Span`, +`absl::FunctionRef`) for function arguments (unless ownership of the +data is being transferred). These types reduce copying, and allow callers to +pick their own container types (e.g., one caller might use `std::vector` whereas +another one uses `absl::InlinedVector`). + +### Pre-allocated/pre-computed arguments + +For frequently called routines, sometimes it is useful to allow higher-level +callers to pass in a data structure that they own or information that the called +routine needs that the client already has. This can avoid the low-level routine +being forced to allocate its own temporary data structure or recompute +already-available information. + +
    + +Added RPC_Stats::RecordRPC variant allowing client to pass +in already available WallTime value. + + +rpc-stats.h + +{: .bad-code} +```c++ +static void RecordRPC(const Name &name, const RPC_Stats_Measurement& m); +``` + +{: .new} +```c++ +static void RecordRPC(const Name &name, const RPC_Stats_Measurement& m, + WallTime now); +``` + +clientchannel.cc + +{: .bad-code} +```c++ +const WallTime now = WallTime_Now(); +... +RPC_Stats::RecordRPC(stats_name, m); +``` + +{: .new} +```c++ +const WallTime now = WallTime_Now(); +... +RPC_Stats::RecordRPC(stats_name, m, now); +``` + +
    + +### Thread-compatible vs. Thread-safe types {#thread-compatible-vs-thread-safe-types} + +A type may be either thread-compatible (synchronized externally) or thread-safe +(synchronized internally). Most generally used types should be +thread-compatible. This way callers who do not need thread-safety don't pay for +it. + +
    + +Make a class thread-compatible since callers are already +synchronized. + + +hitless-transfer-phase.cc + +{: .bad-code} +```c++ +TransferPhase HitlessTransferPhase::get() const { + static CallsiteMetrics cm("HitlessTransferPhase::get"); + MonitoredMutexLock l(&cm, &mutex_); + return phase_; +} +``` + +{: .new} +```c++ +TransferPhase HitlessTransferPhase::get() const { return phase_; } +``` + +hitless-transfer-phase.cc + +{: .bad-code} +```c++ +bool HitlessTransferPhase::AllowAllocate() const { + static CallsiteMetrics cm("HitlessTransferPhase::AllowAllocate"); + MonitoredMutexLock l(&cm, &mutex_); + return phase_ == TransferPhase::kNormal || phase_ == TransferPhase::kBrownout; +} +``` + +{: .new} +```c++ +bool HitlessTransferPhase::AllowAllocate() const { + return phase_ == TransferPhase::kNormal || phase_ == TransferPhase::kBrownout; +} +``` + +
    + +However if the typical use of a type needs synchronization, prefer to move the +synchronization inside the type. This allows the synchronization mechanism to be +tweaked as necessary to improve performance (e.g., sharding to reduce +contention) without affecting callers. + +## Algorithmic improvements {#algorithmic-improvements} + +The most critical opportunities for performance improvements come from +algorithmic improvements, e.g., turning an O(N²) algorithm to O(N lg(N)) or +O(N), avoiding potentially exponential behavior, etc. These opportunities are +rare in stable code, but are worth paying attention to when writing new code. A +few examples that show such improvements to pre-existing code: + +
    + +Add nodes to cycle detection structure in reverse +post-order. + + +We were previously adding graph nodes and edges one at a time to a +cycle-detection data structure, which required expensive work per edge. We now +add the entire graph in reverse post-order, which makes cycle-detection trivial. + +graphcycles.h + +{: .bad-code} +```c++ +class GraphCycles : public util_graph::Graph { + public: + GraphCycles(); + ~GraphCycles() override; + + using Node = util_graph::Node; +``` + +{: .new} +```c++ +class GraphCycles : public util_graph::Graph { + public: + GraphCycles(); + ~GraphCycles() override; + + using Node = util_graph::Node; + + // InitFrom adds all the nodes and edges from src, returning true if + // successful, false if a cycle is encountered. + // REQUIRES: no nodes and edges have been added to GraphCycles yet. + bool InitFrom(const util_graph::Graph& src); +``` + +graphcycles.cc + +{: .new} +```c++ +bool GraphCycles::InitFrom(const util_graph::Graph& src) { + ... + // Assign ranks in topological order so we don't need any reordering during + // initialization. For an acyclic graph, DFS leaves nodes in reverse + // topological order, so we assign decreasing ranks to nodes as we leave them. + Rank last_rank = n; + auto leave = [&](util_graph::Node node) { + DCHECK(r->rank[node] == kMissingNodeRank); + NodeInfo* nn = &r->nodes[node]; + nn->in = kNil; + nn->out = kNil; + r->rank[node] = --last_rank; + }; + util_graph::DFSAll(src, std::nullopt, leave); + + // Add all the edges (detect cycles as we go). + bool have_cycle = false; + util_graph::PerEdge(src, [&](util_graph::Edge e) { + DCHECK_NE(r->rank[e.src], kMissingNodeRank); + DCHECK_NE(r->rank[e.dst], kMissingNodeRank); + if (r->rank[e.src] >= r->rank[e.dst]) { + have_cycle = true; + } else if (!HasEdge(e.src, e.dst)) { + EdgeListAddNode(r, &r->nodes[e.src].out, e.dst); + EdgeListAddNode(r, &r->nodes[e.dst].in, e.src); + } + }); + if (have_cycle) { + return false; + } else { + DCHECK(CheckInvariants()); + return true; + } +} +``` + +graph_partitioner.cc + +{: .bad-code} +```c++ +absl::Status MergeGraph::Init() { + const Graph& graph = *compiler_->graph(); + clusters_.resize(graph.NodeLimit()); + graph.PerNode([&](Node node) { + graph_->AddNode(node); + NodeList* n = new NodeList; + n->push_back(node); + clusters_[node] = n; + }); + absl::Status s; + PerEdge(graph, [&](Edge e) { + if (!s.ok()) return; + if (graph_->HasEdge(e.src, e.dst)) return; // already added + if (!graph_->InsertEdge(e.src, e.dst)) { + s = absl::InvalidArgumentError("cycle in the original graph"); + } + }); + return s; +} +``` + +{: .new} +```c++ +absl::Status MergeGraph::Init() { + const Graph& graph = *compiler_->graph(); + if (!graph_->InitFrom(graph)) { + return absl::InvalidArgumentError("cycle in the original graph"); + } + clusters_.resize(graph.NodeLimit()); + graph.PerNode([&](Node node) { + NodeList* n = new NodeList; + n->push_back(node); + clusters_[node] = n; + }); + return absl::OkStatus(); +} +``` + +
    + +
    + +Replace the deadlock detection system built into a mutex +implementation with a better algorithm. + + +Replaced deadlock detection algorithm by one that is ~50x as fast and scales to +millions of mutexes without problem (the old algorithm relied on a 2K limit to +avoid a performance cliff). The new code is based on the following paper: A +dynamic topological sort algorithm for directed acyclic graphs David J. Pearce, +Paul H. J. Kelly Journal of Experimental Algorithmics (JEA) JEA Homepage archive +Volume 11, 2006, Article No. 1.7 + +The new algorithm takes O(|V|+|E|) space (instead of the O(|V|^2) bits needed by +the older algorithm). Lock-acquisition order graphs are very sparse, so this is +much less space. The algorithm is also quite simple: the core of it is ~100 +lines of C++. Since the code now scales to much larger number of Mutexes, we +were able to relax an artificial 2K limit, which uncovered a number of latent +deadlocks in real programs. + +Benchmark results: these were run in DEBUG mode since deadlock detection is +mainly enabled in debug mode. The benchmark argument (/2k etc.) is the number of +tracked nodes. At the default 2k limit of the old algorithm, the new algorithm +takes only 0.5 microseconds per InsertEdge compared to 22 microseconds for the +old algorithm. The new algorithm also easily scales to much larger graphs +without problems whereas the old algorithm keels over quickly. + +{: .bad-code} +``` +DEBUG: Benchmark Time(ns) CPU(ns) Iterations +---------------------------------------------------------- +DEBUG: BM_StressTest/2k 23553 23566 29086 +DEBUG: BM_StressTest/4k 45879 45909 15287 +DEBUG: BM_StressTest/16k 776938 777472 817 +``` + +{: .new} +``` +DEBUG: BM_StressTest/2k 392 393 10485760 +DEBUG: BM_StressTest/4k 392 393 10485760 +DEBUG: BM_StressTest/32k 407 407 10485760 +DEBUG: BM_StressTest/256k 456 456 10485760 +DEBUG: BM_StressTest/1M 534 534 10485760 +``` + +
    + +
    + +Replace an IntervalMap (with O(lg N) lookups) with a hash +table (O(1) lookups). + + +The initial code was using IntervalMap because it seemed like the right data +structure to support coalescing of adjacent blocks, but a hash table suffices +since the adjacent block can be found by a hash table lookup. This (plus other +changes in the CL) improve the performance of tpu::BestFitAllocator by ~4X. + +best_fit_allocator.h + +{: .bad-code} +```c++ +using Block = gtl::IntervalMap::Entry; +... +// Map of pairs (address range, BlockState) with one entry for each allocation +// covering the range [0, allocatable_range_end_). Adjacent kFree and +// kReserved blocks are coalesced. Adjacent kAllocated blocks are not +// coalesced. +gtl::IntervalMap block_list_; + +// Set of all free blocks sorted according to the allocation policy. Adjacent +// free blocks are coalesced. +std::set free_list_; +``` + +{: .new} +```c++ +// A faster hash function for offsets in the BlockTable +struct OffsetHash { + ABSL_ATTRIBUTE_ALWAYS_INLINE size_t operator()(int64 value) const { + uint64 m = value; + m *= uint64_t{0x9ddfea08eb382d69}; + return static_cast(m ^ (m >> 32)); + } +}; + +// Hash table maps from block start address to block info. +// We include the length of the previous block in this info so we +// can find the preceding block to coalesce with. +struct HashTableEntry { + BlockState state; + int64 my_length; + int64 prev_length; // Zero if there is no previous block. +}; +using BlockTable = absl::flat_hash_map; +``` + +
    + +
    + +Replace sorted-list intersection (O(N log N)) with hash +table lookups (O(N)). + + +Old code to detect whether or not two nodes share a common source would get the +sources for each node in sorted order and then do a sorted intersection. The new +code places the sources for one node in a hash-table and then iterates over the +other node's sources checking the hash-table. + +{: .bench} +``` +name old time/op new time/op delta +BM_CompileLarge 28.5s ± 2% 22.4s ± 2% -21.61% (p=0.008 n=5+5) +``` + +
    + +
    + +Implement good hash function so that things are O(1) +instead of O(N). + + +location.h + +{: .bad-code} +```c++ +// Hasher for Location objects. +struct LocationHash { + size_t operator()(const Location* key) const { + return key != nullptr ? util_hash::Hash(key->address()) : 0; + } +}; +``` + +{: .new} +```c++ +size_t HashLocation(const Location& loc); +... +struct LocationHash { + size_t operator()(const Location* key) const { + return key != nullptr ? HashLocation(*key) : 0; + } +}; +``` + +location.cc + +{: .new} +```c++ +size_t HashLocation(const Location& loc) { + util_hash::MurmurCat m; + + // Encode some simpler features into a single value. + m.AppendAligned((loc.dynamic() ? 1 : 0) // + | (loc.append_shard_to_address() ? 2 : 0) // + | (loc.is_any() ? 4 : 0) // + | (!loc.any_of().empty() ? 8 : 0) // + | (loc.has_shardmap() ? 16 : 0) // + | (loc.has_sharding() ? 32 : 0)); + + if (loc.has_shardmap()) { + m.AppendAligned(loc.shardmap().output() | + static_cast(loc.shardmap().stmt()) << 20); + } + if (loc.has_sharding()) { + uint64_t num = 0; + switch (loc.sharding().type_case()) { + case Sharding::kModShard: + num = loc.sharding().mod_shard(); + break; + case Sharding::kRangeSplit: + num = loc.sharding().range_split(); + break; + case Sharding::kNumShards: + num = loc.sharding().num_shards(); + break; + default: + num = 0; + break; + } + m.AppendAligned(static_cast(loc.sharding().type_case()) | + (num << 3)); + } + + auto add_string = [&m](absl::string_view s) { + if (!s.empty()) { + m.Append(s.data(), s.size()); + } + }; + + add_string(loc.address()); + add_string(loc.lb_policy()); + + // We do not include any_of since it is complicated to compute a hash + // value that is not sensitive to order and duplication. + return m.GetHash(); +} +``` + +
    + +## Better memory representation {#better-memory-representation} + +Careful consideration of memory footprint and cache footprint of important data +structures can often yield big savings. The data structures below focus on +supporting common operations by touching fewer cache lines. Care taken here can +(a) avoid expensive cache misses (b) reduce memory bus traffic, which speeds up +both the program in question and anything else running on the same machine. They +rely on some common techniques you may find useful when designing your own data +structures. + +### Compact data structures + +Use compact representations for data that will be accessed often or that +comprises a large portion of the application's memory usage. A compact +representation can significantly reduce memory usage and improve performance by +touching fewer cache lines and reducing memory bus bandwidth usage. However, +watch out for [cache-line contention](#reduce-false-sharing). + +### Memory layout {#memory-layout} + +Carefully consider the memory layout of types that have a large memory or cache +footprint. + +* Reorder fields to reduce padding between fields with different alignment + requirements + (see [class layout discussion](https://stackoverflow.com/questions/9989164/optimizing-memory-layout-of-class-instances-in-c)). +* Use smaller numeric types where the stored data will fit in the smaller + type. +* Enum values sometimes take up a whole word unless you're careful. Consider + using a smaller representation (e.g., use `enum class OpType : uint8_t { ... + }` instead of `enum class OpType { ... }`). +* Order fields so that fields that are frequently accessed together are closer + to each other – this will reduce the number of cache lines touched on common + operations. +* Place hot read-only fields away from hot mutable fields so that writes to + the mutable fields do not cause the read-only fields to be evicted from + nearby caches. +* Move cold data so it does not live next to hot data, either by placing the + cold data at the end of the struct, or behind a level of indirection, or in + a separate array. +* Consider packing things into fewer bytes by using bit and byte-level + encoding. This can be complicated, so only do this when the data under + question is encapsulated inside a well-tested module, and the overall + reduction of memory usage is significant. Furthermore, watch out for side + effects like under-alignment of frequently used data, or more expensive code + for accessing packed representations. Validate such changes using + benchmarks. + +### Indices instead of pointers {#indices-instead-of-pointers} + +On modern 64-bit machines, pointers take up 64 bits. If you have a pointer-rich +data structure, you can easily chew up lots of memory with indirections of T\*. +Instead, consider using integer indices into an array T[] or other data +structure. Not only will the references be smaller (if the number of indices is +small enough to fit in 32 or fewer bits), but the storage for all the T[] +elements will be contiguous, often leading to better cache locality. + +### Batched storage + +Avoid data structures that allocate a separate object per stored element (e.g., +`std::map`, `std::unordered_map` in C++). Instead, consider types that use +chunked or flat representations to store multiple elements in close proximity in +memory (e.g., `std::vector`, `absl::flat_hash_{map,set}` in C++). Such types +tend to have much better cache behavior. Furthermore, they encounter less +allocator overhead. + +One useful technique is to partition elements into chunks where each chunk can +hold a fixed number of elements. This technique can reduce the cache footprint +of a data structure significantly while preserving good asymptotic behavior. + +For some data structures, a single chunk suffices to hold all elements (e.g., +strings and vectors). Other types (e.g., `absl::flat_hash_map`) also use this +technique. + +### Inlined storage {#inlined-storage} + +Some container types are optimized for storing a small number of elements. These +types provide space for a small number of elements at the top level and +completely avoid allocations when the number of elements is small. This can be +very helpful when instances of such types are constructed often (e.g., as stack +variables in frequently executed code), or if many instances are live at the +same time. If a container will typically contain a small number of elements +consider using one of the inlined storage types, e.g., InlinedVector. + +Caveat: if `sizeof(T)` is large, inlined storage containers may not be the best +choice since the inlined backing store will be large. + +### Unnecessarily nested maps + +Sometimes a nested map data structure can be replaced with a single-level map +with a compound key. This can reduce the cost of lookups and insertions +significantly. + +
    + +Reduce allocations and improve cache footprint by +converting btree<a,btree<b,c>> to btree<pair<a,b>,c>. + + +graph_splitter.cc + +{: .bad-code} +```c++ +absl::btree_map> ops; +``` + +{: .new} +```c++ +// The btree maps from {package_name, op_name} to its const Opdef*. +absl::btree_map, + const OpDef*> + ops; +``` + +
    + +Caveat: if the first map key is big, it might be better to stick with nested +maps: + +
    + +Switch to a nested map leads to 76% performance +improvement in microbenchmark. + + +We previously had a single-level hash table where the key consisted of a +(string) path and some other numeric sub-keys. Each path occurred in +approximately 1000 keys on average. We split the hash table into two levels +where the first level was keyed by the path and each second level hash table +kept just the sub-key to data mapping for a particular path. This reduced the +memory usage for storing paths by a factor of 1000, and also sped up accesses +
    + +### Arenas {#arenas} + +Arenas can help reduce memory allocation cost, but they also have the benefit of +packing together independently allocated items next to each other, typically in +fewer cache lines, and eliminating most destruction costs. They are likely most +effective for complex data structures with many sub-objects. Consider providing +an appropriate initial size for the arena since that can help reduce +allocations. + +Caveat: it is easy to misuse arenas by putting too many short-lived objects in a +long-lived arena, which can unnecessarily bloat memory footprint. + +### Arrays instead of maps + +If the domain of a map can be represented by a small integer or is an enum, or +if the map will have very few elements, the map can sometimes be replaced by an +array or a vector of some form. + +
    + +Use an array instead of flat_map. + + +rtp_controller.h + +{: .bad-code} +```c++ +const gtl::flat_map payload_type_to_clock_frequency_; +``` + +{: .new} +```c++ +// A map (implemented as a simple array) indexed by payload_type to clock freq +// for that paylaod type (or 0) +struct PayloadTypeToClockRateMap { + int map[128]; +}; +... +const PayloadTypeToClockRateMap payload_type_to_clock_frequency_; +``` + +
    + +### Bit vectors instead of sets + +If the domain of a set can be represented by a small integer, the set can be +replaced with a bit vector (InlinedBitVector is often a good choice). Set +operations can also be nicely efficient on these representations using bitwise +boolean operations (OR for union, AND for intersection, etc.). + +
    + +Spanner placement system. Replace +dense_hash_set<ZoneId> with a bit-vector with one bit per zone. + + +zone_set.h + +{: .bad-code} +```c++ +class ZoneSet: public dense_hash_set { + public: + ... + bool Contains(ZoneId zone) const { + return count(zone) > 0; + } +``` + +{: .new} +```c++ +class ZoneSet { + ... + // Returns true iff "zone" is contained in the set + bool ContainsZone(ZoneId zone) const { + return zone < b_.size() && b_.get_bit(zone); + } + ... + private: + int size_; // Number of zones inserted + util::bitmap::InlinedBitVector<256> b_; +``` + +Benchmark results: + +{: .bench} +``` +CPU: AMD Opteron (4 cores) dL1:64KB dL2:1024KB +Benchmark Base (ns) New (ns) Improvement +------------------------------------------------------------------ +BM_Evaluate/1 960 676 +29.6% +BM_Evaluate/2 1661 1138 +31.5% +BM_Evaluate/3 2305 1640 +28.9% +BM_Evaluate/4 3053 2135 +30.1% +BM_Evaluate/5 3780 2665 +29.5% +BM_Evaluate/10 7819 5739 +26.6% +BM_Evaluate/20 17922 12338 +31.2% +BM_Evaluate/40 36836 26430 +28.2% +``` + +
    + +
    + +Use bit matrix to keep track of reachability properties +between operands instead of hash table. + + +hlo_computation.h + +{: .bad-code} +```c++ +using TransitiveOperandMap = + std::unordered_map>; +``` + +{: .new} +```c++ +class HloComputation::ReachabilityMap { + ... + // dense id assignment from HloInstruction* to number + tensorflow::gtl::FlatMap ids_; + // matrix_(a,b) is true iff b is reachable from a + tensorflow::core::Bitmap matrix_; +}; +``` + +
    + +## Reduce allocations {#reduce-allocations} + +Memory allocation adds costs: + +1. It increases the time spent in the allocator. +2. Newly-allocated objects may require expensive initialization and sometimes + corresponding expensive destruction when no longer needed. +3. Every allocation tends to be on a new cache line and therefore data spread + across many independent allocations will have a larger cache footprint than + data spread across fewer allocations. + +Garbage-collection runtimes sometimes obviate issue #3 by placing consecutive +allocations sequentially in memory. + +### Avoid unnecessary allocations {#avoid-unnecessary-allocations} + +
    + +Reducing allocations increases benchmark throughput by +21%. + + +memory_manager.cc + +{: .bad-code} +```c++ +LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr dinfo, + bool is_batched) + : tensor(std::move(t)), + device_info(dinfo ? std::move(dinfo) : std::make_shared()), + is_batched(is_batched) { +``` + +{: .new} +```c++ +static const std::shared_ptr& empty_device_info() { + static std::shared_ptr* result = + new std::shared_ptr(new DeviceInfo); + return *result; +} + +LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr dinfo, + bool is_batched) + : tensor(std::move(t)), is_batched(is_batched) { + if (dinfo) { + device_info = std::move(dinfo); + } else { + device_info = empty_device_info(); + } +``` + +
    + +
    + +Use statically-allocated zero vector when possible rather +than allocating a vector and filling it with zeroes. + + +embedding_executor_8bit.cc + +{: .bad-code} +```c++ +// The actual implementation of the EmbeddingLookUpT using template parameters +// instead of object members to improve the performance. +template +static tensorflow::Status EmbeddingLookUpT(...) { + ... + std::unique_ptr zero_data( + new tensorflow::quint8[max_embedding_width]); + memset(zero_data.get(), 0, sizeof(tensorflow::quint8) * max_embedding_width); +``` + +{: .new} +```c++ +// A size large enough to handle most embedding widths +static const int kTypicalMaxEmbedding = 256; +static tensorflow::quint8 static_zero_data[kTypicalMaxEmbedding]; // All zeroes +... +// The actual implementation of the EmbeddingLookUpT using template parameters +// instead of object members to improve the performance. +template +static tensorflow::Status EmbeddingLookUpT(...) { + ... + std::unique_ptr zero_data_backing(nullptr); + + // Get a pointer to a memory area with at least + // "max_embedding_width" quint8 zero values. + tensorflow::quint8* zero_data; + if (max_embedding_width <= ARRAYSIZE(static_zero_data)) { + // static_zero_data is big enough so we don't need to allocate zero data + zero_data = &static_zero_data[0]; + } else { + // static_zero_data is not big enough: we need to allocate zero data + zero_data_backing = + absl::make_unique(max_embedding_width); + memset(zero_data_backing.get(), 0, + sizeof(tensorflow::quint8) * max_embedding_width); + zero_data = zero_data_backing.get(); + } +``` + +
    + +Also, prefer stack allocation over heap allocation when object lifetime is +bounded by the scope (although be careful with stack frame sizes for large +objects). + +### Resize or reserve containers {#resize-or-reserve-containers} + +When the maximum or expected maximum size of a vector (or some other container +types) is known in advance, pre-size the container's backing store (e.g., using +`resize` or `reserve` in C++). + +
    + +Pre-size a vector and fill it in, rather than N push_back +operations. + + +indexblockdecoder.cc + +{: .bad-code} +```c++ +for (int i = 0; i < ndocs-1; i++) { + uint32 delta; + ERRORCHECK(b->GetRice(rice_base, &delta)); + docs_.push_back(DocId(my_shard_ + (base + delta) * num_shards_)); + base = base + delta + 1; +} +docs_.push_back(last_docid_); +``` + +{: .new} +```c++ +docs_.resize(ndocs); +DocId* docptr = &docs_[0]; +for (int i = 0; i < ndocs-1; i++) { + uint32 delta; + ERRORCHECK(b.GetRice(rice_base, &delta)); + *docptr = DocId(my_shard_ + (base + delta) * num_shards_); + docptr++; + base = base + delta + 1; +} +*docptr = last_docid_; +``` + +
    + +Caveat: Do not use `resize` or `reserve` to grow one element at a time since +that may lead to quadratic behavior. Also, if element construction is expensive, +prefer an initial `reserve` call followed by several `push_back` or +`emplace_back` calls instead of an initial `resize` since that will double the +number of constructor calls. + +### Avoid copying when possible {#avoid-copying-when-possible} + +* Prefer moving to copying data structures when possible. +* If lifetime is not an issue, store pointers or indices instead of copies of + objects in transient data structures. E.g., if a local map is used to select + a set of protos from an incoming list of protos, we can make the map store + just pointers to the incoming protos instead of copying potentially deeply + nested data. Another common example is sorting a vector of indices rather + than sorting a vector of large objects directly since the latter would incur + significant copying/moving costs. + +
    + +Avoid an extra copy when receiving a tensor via gRPC. + + +A benchmark that sends around 400KB tensors speeds up by ~10-15%: + +{: .bad-code} +``` +Benchmark Time(ns) CPU(ns) Iterations +----------------------------------------------------- +BM_RPC/30/98k_mean 148764691 1369998944 1000 +``` + +{: .new} +``` +Benchmark Time(ns) CPU(ns) Iterations +----------------------------------------------------- +BM_RPC/30/98k_mean 131595940 1216998084 1000 +``` + +
    + +
    + +Move large options structure rather than copying it. + + +index.cc + +{: .bad-code} +```c++ +return search_iterators::DocPLIteratorFactory::Create(opts); +``` + +{: .new} +```c++ +return search_iterators::DocPLIteratorFactory::Create(std::move(opts)); +``` + +
    + +
    + +Use std::sort instead of std::stable_sort, which avoids +an internal copy inside the stable sort implementation. + + +encoded-vector-hits.h + +{: .bad-code} +```c++ +std::stable_sort(hits_.begin(), hits_.end(), + gtl::OrderByField(&HitWithPayloadOffset::docid)); +``` + +{: .new} +```c++ +struct HitWithPayloadOffset { + search_iterators::LocalDocId64 docid; + int first_payload_offset; // offset into the payload vector. + int num_payloads; + + bool operator<(const HitWithPayloadOffset& other) const { + return (docid < other.docid) || + (docid == other.docid && + first_payload_offset < other.first_payload_offset); + } +}; + ... + std::sort(hits_.begin(), hits_.end()); +``` + +
    + +### Reuse temporary objects + +A container or an object declared inside a loop will be recreated on every loop +iteration. This can lead to expensive construction, destruction, and resizing. +Hoisting the declaration outside the loop enables reuse and can provide a +significant performance boost. (Compilers are often unable to do such hoisting +on their own due to language semantics or their inability to ensure program +equivalence.) + +
    + +Hoist variable definition outside of loop iteration. + + +autofdo_profile_utils.h + +{: .bad-code} +```c++ +auto iterator = absl::WrapUnique(sstable->GetIterator()); +while (!iterator->done()) { + T profile; + if (!profile.ParseFromString(iterator->value_view())) { + return absl::InternalError( + "Failed to parse mem_block to specified profile type."); + } + ... + iterator->Next(); +} +``` + +{: .new} +```c++ +auto iterator = absl::WrapUnique(sstable->GetIterator()); +T profile; +while (!iterator->done()) { + if (!profile.ParseFromString(iterator->value_view())) { + return absl::InternalError( + "Failed to parse mem_block to specified profile type."); + } + ... + iterator->Next(); +} +``` + +
    + +
    + +Define a protobuf variable outside a loop so that its +allocated storage can be reused across loop iterations. + + +stats-router.cc + +{: .bad-code} +```c++ +for (auto& r : routers_to_update) { + ... + ResourceRecord record; + { + MutexLock agg_lock(r.agg->mutex()); + r.agg->AddResourceRecordUsages(measure_indices, &record); + } + ... +} +``` + +{: .new} +```c++ +ResourceRecord record; +for (auto& r : routers_to_update) { + ... + record.Clear(); + { + MutexLock agg_lock(r.agg->mutex()); + r.agg->AddResourceRecordUsages(measure_indices, &record); + } + ... +} +``` + +
    + +
    + +Serialize to same std::string repeatedly. + + +program_rep.cc + +{: .bad-code} +```c++ +std::string DeterministicSerialization(const proto2::Message& m) { + std::string result; + proto2::io::StringOutputStream sink(&result); + proto2::io::CodedOutputStream out(&sink); + out.SetSerializationDeterministic(true); + m.SerializePartialToCodedStream(&out); + return result; +} +``` + +{: .new} +```c++ +absl::string_view DeterministicSerializationTo(const proto2::Message& m, + std::string* scratch) { + scratch->clear(); + proto2::io::StringOutputStream sink(scratch); + proto2::io::CodedOutputStream out(&sink); + out.SetSerializationDeterministic(true); + m.SerializePartialToCodedStream(&out); + return absl::string_view(*scratch); +} +``` + +
    + +Caveat: protobuf, string, vector, containers etc. tend to grow to the size of +the largest value ever stored in them. Therefore reconstructing them +periodically (e.g., after every N uses) can help reduce memory requirements and +reinitialization costs. + +## Avoid unnecessary work {#avoid-unnecessary-work} + +Perhaps one of the most effective categories of improving performance is +avoiding work you don't have to do. This can take many forms, including creating +specialized paths through code for common cases that avoid more general +expensive computation, precomputation, deferring work until it is really needed, +hoisting work into less-frequently executed pieces of code, and other similar +approaches. Below are many examples of this general approach, categorized into a +few representative categories. + +### Fast paths for common cases + +Often, code is written to cover all cases, but some subset of the cases are much +simpler and more common than others. E.g., `vector::push_back` usually has +enough space for the new element, but contains code to resize the underlying +storage when it does not. Some attention paid to the structure of code can help +make the common simple case faster without hurting uncommon case performance +significantly. + +
    + +Make fast path cover more common cases. + + +Add handling of trailing single ASCII bytes, rather than only handling multiples +of four bytes with this routine. This avoids calling the slower generic routine +for all-ASCII strings that are, for example, 5 bytes. + +utf8statetable.cc + +{: .bad-code} +```c++ +// Scan a UTF-8 stringpiece based on state table. +// Always scan complete UTF-8 characters +// Set number of bytes scanned. Return reason for exiting +// OPTIMIZED for case of 7-bit ASCII 0000..007f all valid +int UTF8GenericScanFastAscii(const UTF8ScanObj* st, absl::string_view str, + int* bytes_consumed) { + ... + int exit_reason; + do { + // Skip 8 bytes of ASCII at a whack; no endianness issue + while ((src_limit - src >= 8) && + (((UNALIGNED_LOAD32(src + 0) | UNALIGNED_LOAD32(src + 4)) & + 0x80808080) == 0)) { + src += 8; + } + // Run state table on the rest + int rest_consumed; + exit_reason = UTF8GenericScan( + st, absl::ClippedSubstr(str, src - initial_src), &rest_consumed); + src += rest_consumed; + } while (exit_reason == kExitDoAgain); + + *bytes_consumed = src - initial_src; + return exit_reason; +} +``` + +{: .new} +```c++ +// Scan a UTF-8 stringpiece based on state table. +// Always scan complete UTF-8 characters +// Set number of bytes scanned. Return reason for exiting +// OPTIMIZED for case of 7-bit ASCII 0000..007f all valid +int UTF8GenericScanFastAscii(const UTF8ScanObj* st, absl::string_view str, + int* bytes_consumed) { + ... + int exit_reason = kExitOK; + do { + // Skip 8 bytes of ASCII at a whack; no endianness issue + while ((src_limit - src >= 8) && + (((UNALIGNED_LOAD32(src + 0) | UNALIGNED_LOAD32(src + 4)) & + 0x80808080) == 0)) { + src += 8; + } + while (src < src_limit && Is7BitAscii(*src)) { // Skip ASCII bytes + src++; + } + if (src < src_limit) { + // Run state table on the rest + int rest_consumed; + exit_reason = UTF8GenericScan( + st, absl::ClippedSubstr(str, src - initial_src), &rest_consumed); + src += rest_consumed; + } + } while (exit_reason == kExitDoAgain); + + *bytes_consumed = src - initial_src; + return exit_reason; +} +``` + +
    + +
    + +Simpler fast paths for InlinedVector. + + +inlined_vector.h + +{: .bad-code} +```c++ +auto Storage::Resize(ValueAdapter values, size_type new_size) -> void { + StorageView storage_view = MakeStorageView(); + + IteratorValueAdapter move_values( + MoveIterator(storage_view.data)); + + AllocationTransaction allocation_tx(GetAllocPtr()); + ConstructionTransaction construction_tx(GetAllocPtr()); + + absl::Span construct_loop; + absl::Span move_construct_loop; + absl::Span destroy_loop; + + if (new_size > storage_view.capacity) { + ... + } else if (new_size > storage_view.size) { + construct_loop = {storage_view.data + storage_view.size, + new_size - storage_view.size}; + } else { + destroy_loop = {storage_view.data + new_size, storage_view.size - new_size}; + } +``` + +{: .new} +```c++ +auto Storage::Resize(ValueAdapter values, size_type new_size) -> void { + StorageView storage_view = MakeStorageView(); + auto* const base = storage_view.data; + const size_type size = storage_view.size; + auto* alloc = GetAllocPtr(); + if (new_size <= size) { + // Destroy extra old elements. + inlined_vector_internal::DestroyElements(alloc, base + new_size, + size - new_size); + } else if (new_size <= storage_view.capacity) { + // Construct new elements in place. + inlined_vector_internal::ConstructElements(alloc, base + size, &values, + new_size - size); + } else { + ... + } +``` + +
    + +
    + +Fast path for common cases of initializing 1-D to 4-D +tensors. + + +tensor_shape.cc + +{: .bad-code} +```c++ +template +TensorShapeBase::TensorShapeBase(gtl::ArraySlice dim_sizes) { + set_tag(REP16); + set_data_type(DT_INVALID); + set_ndims_byte(0); + set_num_elements(1); + for (int64 s : dim_sizes) { + AddDim(internal::SubtleMustCopy(s)); + } +} +``` + +{: .new} +```c++ +template +void TensorShapeBase::InitDims(gtl::ArraySlice dim_sizes) { + DCHECK_EQ(tag(), REP16); + + // Allow sizes that are under kint64max^0.25 so that 4-way multiplication + // below cannot overflow. + static const uint64 kMaxSmall = 0xd744; + static_assert(kMaxSmall * kMaxSmall * kMaxSmall * kMaxSmall <= kint64max, + "bad overflow check"); + bool large_size = false; + for (auto s : dim_sizes) { + if (s > kMaxSmall) { + large_size = true; + break; + } + } + + if (!large_size) { + // Every size fits in 16 bits; use fast-paths for dims in {1,2,3,4}. + uint16* dst = as16()->dims_; + switch (dim_sizes.size()) { + case 1: { + set_ndims_byte(1); + const int64 size = dim_sizes[0]; + const bool neg = Set16(kIsPartial, dst, 0, size); + set_num_elements(neg ? -1 : size); + return; + } + case 2: { + set_ndims_byte(2); + const int64 size0 = dim_sizes[0]; + const int64 size1 = dim_sizes[1]; + bool neg = Set16(kIsPartial, dst, 0, size0); + neg |= Set16(kIsPartial, dst, 1, size1); + set_num_elements(neg ? -1 : (size0 * size1)); + return; + } + case 3: { + ... + } + case 4: { + ... + } + } + } + + set_ndims_byte(0); + set_num_elements(1); + for (int64 s : dim_sizes) { + AddDim(internal::SubtleMustCopy(s)); + } +} +``` + +
    + +
    + +Make varint parser fast path cover just the 1-byte case, +instead of covering 1-byte and 2-byte cases. + + +Reducing the size of the (inlined) fast path reduces code size and icache +pressure, which leads to improved performance. + +parse_context.h + +{: .bad-code} +```c++ +template +PROTOBUF_NODISCARD const char* VarintParse(const char* p, T* out) { + auto ptr = reinterpret_cast(p); + uint32_t res = ptr[0]; + if (!(res & 0x80)) { + *out = res; + return p + 1; + } + uint32_t byte = ptr[1]; + res += (byte - 1) << 7; + if (!(byte & 0x80)) { + *out = res; + return p + 2; + } + return VarintParseSlow(p, res, out); +} +``` + +{: .new} +```c++ +template +PROTOBUF_NODISCARD const char* VarintParse(const char* p, T* out) { + auto ptr = reinterpret_cast(p); + uint32_t res = ptr[0]; + if (!(res & 0x80)) { + *out = res; + return p + 1; + } + return VarintParseSlow(p, res, out); +} +``` + +parse_context.cc + +{: .bad-code} +```c++ +std::pair VarintParseSlow32(const char* p, + uint32_t res) { + for (std::uint32_t i = 2; i < 5; i++) { + ... +} +... +std::pair VarintParseSlow64(const char* p, + uint32_t res32) { + uint64_t res = res32; + for (std::uint32_t i = 2; i < 10; i++) { + ... +} +``` + +{: .new} +```c++ +std::pair VarintParseSlow32(const char* p, + uint32_t res) { + for (std::uint32_t i = 1; i < 5; i++) { + ... +} +... +std::pair VarintParseSlow64(const char* p, + uint32_t res32) { + uint64_t res = res32; + for (std::uint32_t i = 1; i < 10; i++) { + ... +} +``` + +
    + +
    + +Skip significant work in RPC_Stats_Measurement addition if +no errors have occurred. + + +rpc-stats.h + +{: .bad-code} +```c++ +struct RPC_Stats_Measurement { + ... + double errors[RPC::NUM_ERRORS]; +``` + +{: .new} +```c++ +struct RPC_Stats_Measurement { + ... + double get_errors(int index) const { return errors[index]; } + void set_errors(int index, double value) { + errors[index] = value; + any_errors_set = true; + } + private: + ... + // We make this private so that we can keep track of whether any of + // these values have been set to non-zero values. + double errors[RPC::NUM_ERRORS]; + bool any_errors_set; // True iff any of the errors[i] values are non-zero +``` + +rpc-stats.cc + +{: .bad-code} +```c++ +void RPC_Stats_Measurement::operator+=(const RPC_Stats_Measurement& x) { + ... + for (int i = 0; i < RPC::NUM_ERRORS; ++i) { + errors[i] += x.errors[i]; + } +} +``` + +{: .new} +```c++ +void RPC_Stats_Measurement::operator+=(const RPC_Stats_Measurement& x) { + ... + if (x.any_errors_set) { + for (int i = 0; i < RPC::NUM_ERRORS; ++i) { + errors[i] += x.errors[i]; + } + any_errors_set = true; + } +} +``` + +
    + +
    + +Do array lookup on first byte of string to often avoid +fingerprinting full string. + + +soft-tokens-helper.cc + +{: .bad-code} +```c++ +bool SoftTokensHelper::IsSoftToken(const StringPiece& token) const { + return soft_tokens_.find(Fingerprint(token.data(), token.size())) != + soft_tokens_.end(); +} +``` + +soft-tokens-helper.h + +{: .new} +```c++ +class SoftTokensHelper { + ... + private: + ... + // Since soft tokens are mostly punctuation-related, for performance + // purposes, we keep an array filter_. filter_[i] is true iff any + // of the soft tokens start with the byte value 'i'. This avoids + // fingerprinting a term in the common case, since we can just do an array + // lookup based on the first byte, and if filter_[b] is false, then + // we can return false immediately. + bool filter_[256]; + ... +}; + +inline bool SoftTokensHelper::IsSoftToken(const StringPiece& token) const { + if (token.size() >= 1) { + char first_char = token.data()[0]; + if (!filter_[first_char]) { + return false; + } + } + return IsSoftTokenFallback(token); +} +``` + +soft-tokens-helper.cc + +{: .new} +```c++ +bool SoftTokensHelper::IsSoftTokenFallback(const StringPiece& token) const { + return soft_tokens_.find(Fingerprint(token.data(), token.size())) != + soft_tokens_.end(); +} +``` + +
    + +### Precompute expensive information once + +
    + +Precompute a TensorFlow graph execution node property +that allows us to quickly rule out certain unusual cases. + + +executor.cc + +{: .bad-code} +```c++ +struct NodeItem { + ... + bool kernel_is_expensive = false; // True iff kernel->IsExpensive() + bool kernel_is_async = false; // True iff kernel->AsAsync() != nullptr + bool is_merge = false; // True iff IsMerge(node) + ... + if (IsEnter(node)) { + ... + } else if (IsExit(node)) { + ... + } else if (IsNextIteration(node)) { + ... + } else { + // Normal path for most nodes + ... + } +``` + +{: .new} +```c++ +struct NodeItem { + ... + bool kernel_is_expensive : 1; // True iff kernel->IsExpensive() + bool kernel_is_async : 1; // True iff kernel->AsAsync() != nullptr + bool is_merge : 1; // True iff IsMerge(node) + bool is_enter : 1; // True iff IsEnter(node) + bool is_exit : 1; // True iff IsExit(node) + bool is_control_trigger : 1; // True iff IsControlTrigger(node) + bool is_sink : 1; // True iff IsSink(node) + // True iff IsEnter(node) || IsExit(node) || IsNextIteration(node) + bool is_enter_exit_or_next_iter : 1; + ... + if (!item->is_enter_exit_or_next_iter) { + // Fast path for nodes types that don't need special handling + DCHECK_EQ(input_frame, output_frame); + ... + } else if (item->is_enter) { + ... + } else if (item->is_exit) { + ... + } else { + DCHECK(IsNextIteration(node)); + ... + } +``` + +
    + +
    + +Precompute 256 element array and use during trigram +initialization. + + +byte_trigram_classifier.cc + +{: .bad-code} +```c++ +void ByteTrigramClassifier::VerifyModel(void) const { + ProbT class_sums[num_classes_]; + for (int cls = 0; cls < num_classes_; cls++) { + class_sums[cls] = 0; + } + for (ByteNgramId id = 0; id < trigrams_.num_trigrams(); id++) { + for (int cls = 0; cls < num_classes_; ++cls) { + class_sums[cls] += Prob(trigram_probs_[id].log_probs[cls]); + } + } + ... +} +``` + +{: .new} +```c++ +void ByteTrigramClassifier::VerifyModel(void) const { + CHECK_EQ(sizeof(ByteLogProbT), 1); + ProbT fast_prob[256]; + for (int b = 0; b < 256; b++) { + fast_prob[b] = Prob(static_cast(b)); + } + + ProbT class_sums[num_classes_]; + for (int cls = 0; cls < num_classes_; cls++) { + class_sums[cls] = 0; + } + for (ByteNgramId id = 0; id < trigrams_.num_trigrams(); id++) { + for (int cls = 0; cls < num_classes_; ++cls) { + class_sums[cls] += fast_prob[trigram_probs_[id].log_probs[cls]]; + } + } + ... +} +``` + +
    + +General advice: check for malformed inputs at module boundaries instead of +repeating checks internally. + +### Move expensive computations outside loops + +
    + +Move bounds computation outside loop. + + +literal_linearizer.cc + +{: .bad-code} +```c++ +for (int64 i = 0; i < src_shape.dimensions(dimension_numbers.front()); + ++i) { +``` + +{: .new} +```c++ +int64 dim_front = src_shape.dimensions(dimension_numbers.front()); +const uint8* src_buffer_data = src_buffer.data(); +uint8* dst_buffer_data = dst_buffer.data(); +for (int64 i = 0; i < dim_front; ++i) { +``` + +
    + +### Defer expensive computation {#defer-expensive-computation} + +
    + +Defer GetSubSharding call until needed, which reduces 43 +seconds of CPU time to 2 seconds. + + +sharding_propagation.cc + +{: .bad-code} +```c++ +HloSharding alternative_sub_sharding = + user.sharding().GetSubSharding(user.shape(), {i}); +if (user.operand(i) == &instruction && + hlo_sharding_util::IsShardingMoreSpecific(alternative_sub_sharding, + sub_sharding)) { + sub_sharding = alternative_sub_sharding; +} +``` + +{: .new} +```c++ +if (user.operand(i) == &instruction) { + // Only evaluate GetSubSharding if this operand is of interest, + // as it is relatively expensive. + HloSharding alternative_sub_sharding = + user.sharding().GetSubSharding(user.shape(), {i}); + if (hlo_sharding_util::IsShardingMoreSpecific( + alternative_sub_sharding, sub_sharding)) { + sub_sharding = alternative_sub_sharding; + } +} +``` + +
    + +
    + +Don't update stats eagerly; compute them on demand. + + +Do not update stats on the very frequent allocation/deallocation calls. Instead, +compute stats on demand when the much less frequently called Stats() method is +invoked. + +
    + +
    + +Preallocate 10 nodes not 200 for query handling in Google's +web server. + + +A simple change that reduced web server's CPU usage by 7.5%. + +querytree.h + +{: .bad-code} +```c++ +static const int kInitParseTreeSize = 200; // initial size of querynode pool +``` + +{: .new} +```c++ +static const int kInitParseTreeSize = 10; // initial size of querynode pool +``` + +
    + +
    + +Change search order for 19% throughput improvement. + + +An old search system (circa 2000) had two tiers: one contained a full-text +index, and the other tier contained just the index for the title and anchor +terms. We used to search the smaller title/anchor tier first. +Counter-intuitively, we found that it is cheaper to search the larger full-text +index tier first since if we reach the end of the full-text tier, we can +entirely skip searching the title/anchor tier (a subset of the full-text tier). +This happened reasonably often and allowed us to reduce the average number of +disk seeks to process a query. + +See discussion of title and anchor text handling in +[The Anatomy of a Large-Scale Hypertextual Web Search Engine](https://research.google/pubs/the-anatomy-of-a-large-scale-hypertextual-web-search-engine/) +
    + +### Specialize code + +A particular performance-sensitive call-site may not need the full generality +provided by a general-purpose library. Consider writing specialized code in such +cases instead of calling the general-purpose code if it provides a performance +improvement. + +
    + +Custom printing code for Histogram class is 4x as fast as +sprintf. + + +This code is performance sensitive because it is invoked when monitoring systems +gather statistics from various servers. + +histogram_export.cc + +{: .bad-code} +```c++ +void Histogram::PopulateBuckets(const string &prefix, + expvar::MapProto *const var) const { + ... + for (int i = min_bucket; i <= max_bucket; ++i) { + const double count = BucketCount(i); + if (!export_empty_buckets && count == 0.0) continue; + acc += count; + // The label format of exported buckets for discrete histograms + // specifies an inclusive upper bound, which is the same as in + // the original Histogram implementation. This format is not + // applicable to non-discrete histograms, so a half-open interval + // is used for them, with "_" instead of "-" as a separator to + // make possible to distinguish the formats. + string key = + options_.export_cumulative_counts() ? + StringPrintf("%.12g", boundaries_->BucketLimit(i)) : + options_.discrete() ? + StringPrintf("%.0f-%.0f", + ceil(boundaries_->BucketStart(i)), + ceil(boundaries_->BucketLimit(i)) - 1.0) : + StringPrintf("%.12g_%.12g", + boundaries_->BucketStart(i), + boundaries_->BucketLimit(i)); + EscapeMapKey(&key); + const double value = options_.export_cumulative_counts() ? acc : count; + expvar::AddMapFloat(StrCat(prefix, + options_.export_bucket_key_prefix(), + key), + value * count_mult, + var); + } +``` + +{: .new} +```c++ +// Format "val" according to format. If "need_escape" is true, then the +// format can produce output with a '.' in it, and the result will be escaped. +// If "need_escape" is false, then the caller guarantees that format is +// such that the resulting number will not have any '.' characters and +// therefore we can avoid calling EscapeKey. +// The function is free to use "*scratch" for scratch space if necessary, +// and the resulting StringPiece may point into "*scratch". +static StringPiece FormatNumber(const char* format, + bool need_escape, + double val, string* scratch) { + // This routine is specialized to work with only a limited number of formats + DCHECK(StringPiece(format) == "%.0f" || StringPiece(format) == "%.12g"); + + scratch->clear(); + if (val == trunc(val) && val >= kint32min && val <= kint32max) { + // An integer for which we can just use StrAppend + StrAppend(scratch, static_cast(val)); + return StringPiece(*scratch); + } else if (isinf(val)) { + // Infinity, represent as just 'inf'. + return StringPiece("inf", 3); + } else { + // Format according to "format", and possibly escape. + StringAppendF(scratch, format, val); + if (need_escape) { + EscapeMapKey(scratch); + } else { + DCHECK(!StringPiece(*scratch).contains(".")); + } + return StringPiece(*scratch); + } +} +... +void Histogram::PopulateBuckets(const string &prefix, + expvar::MapProto *const var) const { + ... + const string full_key_prefix = StrCat(prefix, + options_.export_bucket_key_prefix()); + string key = full_key_prefix; // Keys will start with "full_key_prefix". + string start_scratch; + string limit_scratch; + const bool cumul_counts = options_.export_cumulative_counts(); + const bool discrete = options_.discrete(); + for (int i = min_bucket; i <= max_bucket; ++i) { + const double count = BucketCount(i); + if (!export_empty_buckets && count == 0.0) continue; + acc += count; + // The label format of exported buckets for discrete histograms + // specifies an inclusive upper bound, which is the same as in + // the original Histogram implementation. This format is not + // applicable to non-discrete histograms, so a half-open interval + // is used for them, with "_" instead of "-" as a separator to + // make possible to distinguish the formats. + key.resize(full_key_prefix.size()); // Start with full_key_prefix. + DCHECK_EQ(key, full_key_prefix); + + const double limit = boundaries_->BucketLimit(i); + if (cumul_counts) { + StrAppend(&key, FormatNumber("%.12g", true, limit, &limit_scratch)); + } else { + const double start = boundaries_->BucketStart(i); + if (discrete) { + StrAppend(&key, + FormatNumber("%.0f", false, ceil(start), &start_scratch), + "-", + FormatNumber("%.0f", false, ceil(limit) - 1.0, + &limit_scratch)); + } else { + StrAppend(&key, + FormatNumber("%.12g", true, start, &start_scratch), + "_", + FormatNumber("%.12g", true, limit, &limit_scratch)); + } + } + const double value = cumul_counts ? acc : count; + + // Add to map var + expvar::AddMapFloat(key, value * count_mult, var); + } +} +``` + +
    + +
    + +Add specializations for VLOG(1), VLOG(2), … for speed and +smaller code size. + + +`VLOG` is a heavily used macro throughout the code base. This change avoids +passing an extra integer constant at nearly every call site (if the log level is +constant at the call site, as it almost always is, as in `VLOG(1) << ...`), +which saves code space. + +vlog_is_on.h + +{: .bad-code} +```c++ +class VLogSite final { + public: + ... + bool IsEnabled(int level) { + int stale_v = v_.load(std::memory_order_relaxed); + if (ABSL_PREDICT_TRUE(level > stale_v)) { + return false; + } + + // We put everything other than the fast path, i.e. vlogging is initialized + // but not on, behind an out-of-line function to reduce code size. + return SlowIsEnabled(stale_v, level); + } + ... + private: + ... + ABSL_ATTRIBUTE_NOINLINE + bool SlowIsEnabled(int stale_v, int level); + ... +}; +``` + +{: .new} +```c++ +class VLogSite final { + public: + ... + bool IsEnabled(int level) { + int stale_v = v_.load(std::memory_order_relaxed); + if (ABSL_PREDICT_TRUE(level > stale_v)) { + return false; + } + + // We put everything other than the fast path, i.e. vlogging is initialized + // but not on, behind an out-of-line function to reduce code size. + // "level" is almost always a call-site constant, so we can save a bit + // of code space by special-casing for levels 1, 2, and 3. +#if defined(__has_builtin) && __has_builtin(__builtin_constant_p) + if (__builtin_constant_p(level)) { + if (level == 0) return SlowIsEnabled0(stale_v); + if (level == 1) return SlowIsEnabled1(stale_v); + if (level == 2) return SlowIsEnabled2(stale_v); + if (level == 3) return SlowIsEnabled3(stale_v); + if (level == 4) return SlowIsEnabled4(stale_v); + if (level == 5) return SlowIsEnabled5(stale_v); + } +#endif + return SlowIsEnabled(stale_v, level); + ... + private: + ... + ABSL_ATTRIBUTE_NOINLINE + bool SlowIsEnabled(int stale_v, int level); + ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled0(int stale_v); + ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled1(int stale_v); + ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled2(int stale_v); + ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled3(int stale_v); + ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled4(int stale_v); + ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled5(int stale_v); + ... +}; +``` + +vlog_is_on.cc + +{: .new} +```c++ +bool VLogSite::SlowIsEnabled0(int stale_v) { return SlowIsEnabled(stale_v, 0); } +bool VLogSite::SlowIsEnabled1(int stale_v) { return SlowIsEnabled(stale_v, 1); } +bool VLogSite::SlowIsEnabled2(int stale_v) { return SlowIsEnabled(stale_v, 2); } +bool VLogSite::SlowIsEnabled3(int stale_v) { return SlowIsEnabled(stale_v, 3); } +bool VLogSite::SlowIsEnabled4(int stale_v) { return SlowIsEnabled(stale_v, 4); } +bool VLogSite::SlowIsEnabled5(int stale_v) { return SlowIsEnabled(stale_v, 5); } +``` + +
    + +
    + +Replace RE2 call with a simple prefix match when possible. + + +read_matcher.cc + +{: .bad-code} +```c++ +enum MatchItemType { + MATCH_TYPE_INVALID, + MATCH_TYPE_RANGE, + MATCH_TYPE_EXACT, + MATCH_TYPE_REGEXP, +}; +``` + +{: .new} +```c++ +enum MatchItemType { + MATCH_TYPE_INVALID, + MATCH_TYPE_RANGE, + MATCH_TYPE_EXACT, + MATCH_TYPE_REGEXP, + MATCH_TYPE_PREFIX, // Special type for regexp ".*" +}; +``` + +read_matcher.cc + +{: .bad-code} +```c++ +p->type = MATCH_TYPE_REGEXP; +``` + +{: .new} +```c++ +term.NonMetaPrefix().CopyToString(&p->prefix); +if (term.RegexpSuffix() == ".*") { + // Special case for a regexp that matches anything, so we can + // bypass RE2::FullMatch + p->type = MATCH_TYPE_PREFIX; +} else { + p->type = MATCH_TYPE_REGEXP; +``` + +
    + +
    + +Use StrCat rather than StringPrintf to format IP +addresses. + + +ipaddress.cc + +{: .bad-code} +```c++ +string IPAddress::ToString() const { + char buf[INET6_ADDRSTRLEN]; + + switch (address_family_) { + case AF_INET: + CHECK(inet_ntop(AF_INET, &addr_.addr4, buf, INET6_ADDRSTRLEN) != NULL); + return buf; + case AF_INET6: + CHECK(inet_ntop(AF_INET6, &addr_.addr6, buf, INET6_ADDRSTRLEN) != NULL); + return buf; + case AF_UNSPEC: + LOG(DFATAL) << "Calling ToString() on an empty IPAddress"; + return ""; + default: + LOG(FATAL) << "Unknown address family " << address_family_; + } +} +... +string IPAddressToURIString(const IPAddress& ip) { + switch (ip.address_family()) { + case AF_INET6: + return StringPrintf("[%s]", ip.ToString().c_str()); + default: + return ip.ToString(); + } +} +... +string SocketAddress::ToString() const { + return IPAddressToURIString(host_) + StringPrintf(":%u", port_); +} +``` + +{: .new} +```c++ +string IPAddress::ToString() const { + char buf[INET6_ADDRSTRLEN]; + + switch (address_family_) { + case AF_INET: { + uint32 addr = gntohl(addr_.addr4.s_addr); + int a1 = static_cast((addr >> 24) & 0xff); + int a2 = static_cast((addr >> 16) & 0xff); + int a3 = static_cast((addr >> 8) & 0xff); + int a4 = static_cast(addr & 0xff); + return StrCat(a1, ".", a2, ".", a3, ".", a4); + } + case AF_INET6: + CHECK(inet_ntop(AF_INET6, &addr_.addr6, buf, INET6_ADDRSTRLEN) != NULL); + return buf; + case AF_UNSPEC: + LOG(DFATAL) << "Calling ToString() on an empty IPAddress"; + return ""; + default: + LOG(FATAL) << "Unknown address family " << address_family_; + } +} +... +string IPAddressToURIString(const IPAddress& ip) { + switch (ip.address_family()) { + case AF_INET6: + return StrCat("[", ip.ToString(), "]"); + default: + return ip.ToString(); + } +} +... +string SocketAddress::ToString() const { + return StrCat(IPAddressToURIString(host_), ":", port_); +} +``` + +
    + +### Use caching to avoid repeated work {#use-caching-to-avoid-repeated-work} + +
    + +Cache based on precomputed fingerprint of large +serialized proto. + + +dp_ops.cc + +{: .bad-code} +```c++ +InputOutputMappingProto mapping_proto; +PLAQUE_OP_REQUIRES( + mapping_proto.ParseFromStringPiece(GetAttrMappingProto(state)), + absl::InternalError("Failed to parse InputOutputMappingProto")); +ParseMapping(mapping_proto); +``` + +{: .new} +```c++ +uint64 mapping_proto_fp = GetAttrMappingProtoFp(state); +{ + absl::MutexLock l(&fp_to_iometa_mu); + if (fp_to_iometa == nullptr) { + fp_to_iometa = + new absl::flat_hash_map>; + } + auto it = fp_to_iometa->find(mapping_proto_fp); + if (it != fp_to_iometa->end()) { + io_metadata_ = it->second.get(); + } else { + auto serial_proto = GetAttrMappingProto(state); + DCHECK_EQ(mapping_proto_fp, Fingerprint(serial_proto)); + InputOutputMappingProto mapping_proto; + PLAQUE_OP_REQUIRES( + mapping_proto.ParseFromStringPiece(GetAttrMappingProto(state)), + absl::InternalError("Failed to parse InputOutputMappingProto")); + auto io_meta = ParseMapping(mapping_proto); + io_metadata_ = io_meta.get(); + (*fp_to_iometa)[mapping_proto_fp] = std::move(io_meta); + } +} +``` + +
    + +### Make the compiler's job easier + +The compiler may have trouble optimizing through layers of abstractions because +it must make conservative assumptions about the overall behavior of the code, or +may not make the right speed vs. size tradeoffs. The application programmer will +often know more about the behavior of the system and can aid the compiler by +rewriting the code to operate at a lower level. However, only do this when +profiles show an issue since compilers will often get things right on their own. +Looking at the generated assembly code for performance critical routines can +help you understand if the compiler is "getting it right". Pprof provides a very +helpful [display of source code interleaved with disassembly][annotated source] +and annotated with performance data. + +Some techniques that may be useful: + +1. Avoid functions calls in hot functions (allows the compiler to avoid frame + setup costs). +2. Move slow-path code into a separate tail-called function. +3. Copy small amounts of data into local variables before heavy use. This can + let the compiler assume there is no aliasing with other data, which may + improve auto-vectorization and register allocation. +4. Hand-unroll very hot loops. + +
    + +Speed up ShapeUtil::ForEachState by replacing absl::Span +with raw pointers to the underlying arrays. + + +shape_util.h + +{: .bad-code} +```c++ +struct ForEachState { + ForEachState(const Shape& s, absl::Span b, + absl::Span c, absl::Span i); + ~ForEachState(); + + const Shape& shape; + const absl::Span base; + const absl::Span count; + const absl::Span incr; +``` + +{: .new} +```c++ +struct ForEachState { + ForEachState(const Shape& s, absl::Span b, + absl::Span c, absl::Span i); + inline ~ForEachState() = default; + + const Shape& shape; + // Pointers to arrays of the passed-in spans + const int64_t* const base; + const int64_t* const count; + const int64_t* const incr; +``` + +
    + +
    + +Hand unroll +cyclic +redundancy check (CRC) computation loop. + + +crc.cc + +{: .bad-code} +```c++ +void CRC32::Extend(uint64 *lo, uint64 *hi, const void *bytes, size_t length) + const { + ... + // Process bytes 4 at a time + while ((p + 4) <= e) { + uint32 c = l ^ WORD(p); + p += 4; + l = this->table3_[c & 0xff] ^ + this->table2_[(c >> 8) & 0xff] ^ + this->table1_[(c >> 16) & 0xff] ^ + this->table0_[c >> 24]; + } + + // Process the last few bytes + while (p != e) { + int c = (l & 0xff) ^ *p++; + l = this->table0_[c] ^ (l >> 8); + } + *lo = l; +} +``` + +{: .new} +```c++ +void CRC32::Extend(uint64 *lo, uint64 *hi, const void *bytes, size_t length) + const { + ... +#define STEP { \ + uint32 c = l ^ WORD(p); \ + p += 4; \ + l = this->table3_[c & 0xff] ^ \ + this->table2_[(c >> 8) & 0xff] ^ \ + this->table1_[(c >> 16) & 0xff] ^ \ + this->table0_[c >> 24]; \ +} + + // Process bytes 16 at a time + while ((e-p) >= 16) { + STEP; + STEP; + STEP; + STEP; + } + + // Process bytes 4 at a time + while ((p + 4) <= e) { + STEP; + } +#undef STEP + + // Process the last few bytes + while (p != e) { + int c = (l & 0xff) ^ *p++; + l = this->table0_[c] ^ (l >> 8); + } + *lo = l; +} + +``` + +
    + +
    + +Handle four characters at a time when parsing Spanner +keys. + + +1. Hand unroll loop to deal with four characters at a time rather than using + memchr + +2. Manually unroll loop for finding separated sections of name + +3. Go backwards to find separated portions of a name with '#' separators + (rather than forwards) since the first part is likely the longest in the + name. + +key.cc + +{: .bad-code} +```c++ +void Key::InitSeps(const char* start) { + const char* base = &rep_[0]; + const char* limit = base + rep_.size(); + const char* s = start; + + DCHECK_GE(s, base); + DCHECK_LT(s, limit); + + for (int i = 0; i < 3; i++) { + s = (const char*)memchr(s, '#', limit - s); + DCHECK(s != NULL); + seps_[i] = s - base; + s++; + } +} +``` + +{: .new} +```c++ +inline const char* ScanBackwardsForSep(const char* base, const char* p) { + while (p >= base + 4) { + if (p[0] == '#') return p; + if (p[-1] == '#') return p-1; + if (p[-2] == '#') return p-2; + if (p[-3] == '#') return p-3; + p -= 4; + } + while (p >= base && *p != '#') p--; + return p; +} + +void Key::InitSeps(const char* start) { + const char* base = &rep_[0]; + const char* limit = base + rep_.size(); + const char* s = start; + + DCHECK_GE(s, base); + DCHECK_LT(s, limit); + + // We go backwards from the end of the string, rather than forwards, + // since the directory name might be long and definitely doesn't contain + // any '#' characters. + const char* p = ScanBackwardsForSep(s, limit - 1); + DCHECK(*p == '#'); + seps_[2] = p - base; + p--; + + p = ScanBackwardsForSep(s, p); + DCHECK(*p == '#'); + seps_[1] = p - base; + p--; + + p = ScanBackwardsForSep(s, p); + DCHECK(*p == '#'); + seps_[0] = p - base; +} +``` + +
    + +
    + +Avoid frame setup costs by converting ABSL_LOG(FATAL) to +ABSL_DCHECK(false). + + +arena_cleanup.h + +{: .bad-code} +```c++ +inline ABSL_ATTRIBUTE_ALWAYS_INLINE size_t Size(Tag tag) { + if (!EnableSpecializedTags()) return sizeof(DynamicNode); + + switch (tag) { + case Tag::kDynamic: + return sizeof(DynamicNode); + case Tag::kString: + return sizeof(TaggedNode); + case Tag::kCord: + return sizeof(TaggedNode); + default: + ABSL_LOG(FATAL) << "Corrupted cleanup tag: " << static_cast(tag); + return sizeof(DynamicNode); + } +} +``` + +{: .new} +```c++ +inline ABSL_ATTRIBUTE_ALWAYS_INLINE size_t Size(Tag tag) { + if (!EnableSpecializedTags()) return sizeof(DynamicNode); + + switch (tag) { + case Tag::kDynamic: + return sizeof(DynamicNode); + case Tag::kString: + return sizeof(TaggedNode); + case Tag::kCord: + return sizeof(TaggedNode); + default: + ABSL_DCHECK(false) << "Corrupted cleanup tag: " << static_cast(tag); + return sizeof(DynamicNode); + } +} +``` + +
    + +### Reduce stats collection costs + +Balance the utility of stats and other behavioral information about a system +against the cost of maintaining that information. The extra information can +often help people to understand and improve high-level behavior, but can also be +costly to maintain. + +Stats that are not useful can be dropped altogether. + +
    + +Stop maintaining expensive stats about number of alarms and +closures in SelectServer. + + +Part of changes that reduce time for setting an alarm from 771 ns to 271 ns. + +selectserver.h + +{: .bad-code} +```c++ +class SelectServer { + public: + ... + protected: + ... + scoped_ptr num_alarms_stat_; + ... + scoped_ptr num_closures_stat_; + ... +}; +``` + +{: .new} +```c++ +// Selectserver class +class SelectServer { + ... + protected: + ... +}; +``` + +/selectserver.cc + +{: .bad-code} +```c++ +void SelectServer::AddAlarmInternal(Alarmer* alarmer, + int offset_in_ms, + int id, + bool is_periodic) { + ... + alarms_->insert(alarm); + num_alarms_stat_->IncBy(1); + ... +} +``` + +{: .new} +```c++ +void SelectServer::AddAlarmInternal(Alarmer* alarmer, + int offset_in_ms, + int id, + bool is_periodic) { + ... + alarms_->Add(alarm); + ... +} +``` + +/selectserver.cc + +{: .bad-code} +```c++ +void SelectServer::RemoveAlarm(Alarmer* alarmer, int id) { + ... + alarms_->erase(alarm); + num_alarms_stat_->IncBy(-1); + ... +} +``` + +{: .new} +```c++ +void SelectServer::RemoveAlarm(Alarmer* alarmer, int id) { + ... + alarms_->Remove(alarm); + ... +} +``` + +
    + +Often, stats or other properties can be maintained for a sample of the elements +handled by the system (e.g., RPC requests, input records, users). Many +subsystems use this approach (tcmalloc allocation tracking, /requestz status +pages, Dapper samples). + +When sampling, consider reducing the sampling rate when appropriate. + +
    + +Maintain stats for just a sample of doc info requests. + + +Sampling allows us to avoid touching 39 histograms and MinuteTenMinuteHour stats +for most requests. + +generic-leaf-stats.cc + +{: .bad-code} +```c++ +... code that touches 39 histograms to update various stats on every request ... +``` + +{: .new} +```c++ +// Add to the histograms periodically +if (TryLockToUpdateHistogramsDocInfo(docinfo_stats, bucket)) { + // Returns true and grabs bucket->lock only if we should sample this + // request for maintaining stats + ... code that touches 39 histograms to update various stats ... + bucket->lock.Unlock(); +} +``` + +
    + +
    + +Reduce sampling rate and make faster sampling decisions. + + +This change reduces the sampling rate from 1 in 10 to 1 in 32. Furthermore, we +now keep execution time stats just for the sampled events and speed up sampling +decisions by using a power of two modulus. This code is called on every packet +in the Google Meet video conferencing system and needed performance work to keep +up with capacity demands during the first part of the COVID outbreak as users +rapidly migrated to doing more online meetings. + +packet_executor.cc + +{: .bad-code} +```c++ +class ScopedPerformanceMeasurement { + public: + explicit ScopedPerformanceMeasurement(PacketExecutor* packet_executor) + : packet_executor_(packet_executor), + tracer_(packet_executor->packet_executor_trace_threshold_, + kClosureTraceName) { + // ThreadCPUUsage is an expensive call. At the time of writing, + // it takes over 400ns, or roughly 30 times slower than absl::Now, + // so we sample only 10% of closures to keep the cost down. + if (packet_executor->closures_executed_ % 10 == 0) { + thread_cpu_usage_start_ = base::ThreadCPUUsage(); + } + + // Sample start time after potentially making the above expensive call, + // so as not to pollute wall time measurements. + run_start_time_ = absl::Now(); + } + + ~ScopedPerformanceMeasurement() { +``` + +{: .new} +```c++ +ScopedPerformanceMeasurement::ScopedPerformanceMeasurement( + PacketExecutor* packet_executor) + : packet_executor_(packet_executor), + tracer_(packet_executor->packet_executor_trace_threshold_, + kClosureTraceName) { + // ThreadCPUUsage is an expensive call. At the time of writing, + // it takes over 400ns, or roughly 30 times slower than absl::Now, + // so we sample only 1 in 32 closures to keep the cost down. + if (packet_executor->closures_executed_ % 32 == 0) { + thread_cpu_usage_start_ = base::ThreadCPUUsage(); + } + + // Sample start time after potentially making the above expensive call, + // so as not to pollute wall time measurements. + run_start_time_ = absl::Now(); +} +``` + +packet_executor.cc + +{: .bad-code} +```c++ +~ScopedPerformanceMeasurement() { + auto run_end_time = absl::Now(); + auto run_duration = run_end_time - run_start_time_; + + if (thread_cpu_usage_start_.has_value()) { + ... + } + + closure_execution_time->Record(absl::ToInt64Microseconds(run_duration)); +``` + +{: .new} +```c++ +ScopedPerformanceMeasurement::~ScopedPerformanceMeasurement() { + auto run_end_time = absl::Now(); + auto run_duration = run_end_time - run_start_time_; + + if (thread_cpu_usage_start_.has_value()) { + ... + closure_execution_time->Record(absl::ToInt64Microseconds(run_duration)); + } +``` + +Benchmark results: + +{: .bench} +``` +Run on (40 X 2793 MHz CPUs); 2020-03-24T20:08:19.991412535-07:00 +CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB +Benchmark Base (ns) New (ns) Improvement +---------------------------------------------------------------------------- +BM_PacketOverhead_mean 224 85 +62.0% +``` + +
    + +### Avoid logging on hot code paths + +Logging statements can be costly, even if the logging-level for the statement +doesn't actually log anything. E.g., `ABSL_VLOG`'s implementation requires at +least a load and a comparison, which may be a problem in hot code paths. In +addition, the presence of the logging code may inhibit compiler optimizations. +Consider dropping logging entirely from hot code paths. + +
    + +Remove logging from guts of memory allocator. + + +This was a small part of a larger change. + +gpu_bfc_allocator.cc + +{: .bad-code} +```c++ +void GPUBFCAllocator::SplitChunk(...) { + ... + VLOG(6) << "Adding to chunk map: " << new_chunk->ptr; + ... +} +... +void GPUBFCAllocator::DeallocateRawInternal(void* ptr) { + ... + VLOG(6) << "Chunk at " << c->ptr << " no longer in use"; + ... +} +``` + +{: .new} +```c++ +void GPUBFCAllocator::SplitChunk(...) { +... +} +... +void GPUBFCAllocator::DeallocateRawInternal(void* ptr) { +... +} +``` + +
    + +
    + +Precompute whether or not logging is enabled outside a +nested loop. + + +image_similarity.cc + +{: .bad-code} +```c++ +for (int j = 0; j < output_subimage_size_y; j++) { + int j1 = j - rad + output_to_integral_subimage_y; + int j2 = j1 + 2 * rad + 1; + // Create a pointer for this row's output, taking into account the offset + // to the full image. + double *image_diff_ptr = &(*image_diff)(j + min_j, min_i); + + for (int i = 0; i < output_subimage_size_x; i++) { + ... + if (VLOG_IS_ON(3)) { + ... + } + ... + } +} +``` + +{: .new} +```c++ +const bool vlog_3 = DEBUG_MODE ? VLOG_IS_ON(3) : false; + +for (int j = 0; j < output_subimage_size_y; j++) { + int j1 = j - rad + output_to_integral_subimage_y; + int j2 = j1 + 2 * rad + 1; + // Create a pointer for this row's output, taking into account the offset + // to the full image. + double *image_diff_ptr = &(*image_diff)(j + min_j, min_i); + + for (int i = 0; i < output_subimage_size_x; i++) { + ... + if (vlog_3) { + ... + } + } +} +``` + +{: .bench} +``` +Run on (40 X 2801 MHz CPUs); 2016-05-16T15:55:32.250633072-07:00 +CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB +Benchmark Base (ns) New (ns) Improvement +------------------------------------------------------------------ +BM_NCCPerformance/16 29104 26372 +9.4% +BM_NCCPerformance/64 473235 425281 +10.1% +BM_NCCPerformance/512 30246238 27622009 +8.7% +BM_NCCPerformance/1k 125651445 113361991 +9.8% +BM_NCCLimitedBoundsPerformance/16 8314 7498 +9.8% +BM_NCCLimitedBoundsPerformance/64 143508 132202 +7.9% +BM_NCCLimitedBoundsPerformance/512 9335684 8477567 +9.2% +BM_NCCLimitedBoundsPerformance/1k 37223897 34201739 +8.1% +``` + +
    + +
    + +Precompute whether logging is enabled and use the result +in helper routines. + + +periodic_call.cc + +{: .bad-code} +```c++ + VLOG(1) << Logid() + << "MaybeScheduleAlarmAtNextTick. Time until next real time: " + << time_until_next_real_time; + ... + uint64 next_virtual_time_ms = + next_virtual_time_ms_ - num_ticks * kResolutionMs; + CHECK_GE(next_virtual_time_ms, 0); + ScheduleAlarm(now, delay, next_virtual_time_ms); +} + +void ScheduleNextAlarm(uint64 current_virtual_time_ms) + ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) { + if (calls_.empty()) { + VLOG(1) << Logid() << "No calls left, entering idle mode"; + next_real_time_ = absl::InfiniteFuture(); + return; + } + uint64 next_virtual_time_ms = FindNextVirtualTime(current_virtual_time_ms); + auto delay = + absl::Milliseconds(next_virtual_time_ms - current_virtual_time_ms); + ScheduleAlarm(GetClock().TimeNow(), delay, next_virtual_time_ms); +} + +// An alarm scheduled by this function supersedes all previously scheduled +// alarms. This is ensured through `scheduling_sequence_number_`. +void ScheduleAlarm(absl::Time now, absl::Duration delay, + uint64 virtual_time_ms) + ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) { + next_real_time_ = now + delay; + next_virtual_time_ms_ = virtual_time_ms; + ++ref_count_; // The Alarm holds a reference. + ++scheduling_sequence_number_; + VLOG(1) << Logid() << "ScheduleAlarm. Time : " + << absl::FormatTime("%M:%S.%E3f", now, absl::UTCTimeZone()) + << ", delay: " << delay << ", virtual time: " << virtual_time_ms + << ", refs: " << ref_count_ + << ", seq: " << scheduling_sequence_number_ + << ", executor: " << executor_; + + executor_->AddAfter( + delay, new Alarm(this, virtual_time_ms, scheduling_sequence_number_)); +} +``` + +{: .new} +```c++ + const bool vlog_1 = VLOG_IS_ON(1); + + if (vlog_1) { + VLOG(1) << Logid() + << "MaybeScheduleAlarmAtNextTick. Time until next real time: " + << time_until_next_real_time; + } + ... + uint64 next_virtual_time_ms = + next_virtual_time_ms_ - num_ticks * kResolutionMs; + CHECK_GE(next_virtual_time_ms, 0); + ScheduleAlarm(now, delay, next_virtual_time_ms, vlog_1); +} + +void ScheduleNextAlarm(uint64 current_virtual_time_ms, bool vlog_1) + ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) { + if (calls_.empty()) { + if (vlog_1) { + VLOG(1) << Logid() << "No calls left, entering idle mode"; + } + next_real_time_ = absl::InfiniteFuture(); + return; + } + uint64 next_virtual_time_ms = FindNextVirtualTime(current_virtual_time_ms); + auto delay = + absl::Milliseconds(next_virtual_time_ms - current_virtual_time_ms); + ScheduleAlarm(GetClock().TimeNow(), delay, next_virtual_time_ms, vlog_1); +} + +// An alarm scheduled by this function supersedes all previously scheduled +// alarms. This is ensured through `scheduling_sequence_number_`. +void ScheduleAlarm(absl::Time now, absl::Duration delay, + uint64 virtual_time_ms, + bool vlog_1) + ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) { + next_real_time_ = now + delay; + next_virtual_time_ms_ = virtual_time_ms; + ++ref_count_; // The Alarm holds a reference. + ++scheduling_sequence_number_; + if (vlog_1) { + VLOG(1) << Logid() << "ScheduleAlarm. Time : " + << absl::FormatTime("%M:%S.%E3f", now, absl::UTCTimeZone()) + << ", delay: " << delay << ", virtual time: " << virtual_time_ms + << ", refs: " << ref_count_ + << ", seq: " << scheduling_sequence_number_ + << ", executor: " << executor_; + } + + executor_->AddAfter( + delay, new Alarm(this, virtual_time_ms, scheduling_sequence_number_)); +} +``` + +
    + +## Code size considerations {#code-size-considerations} + +Performance encompasses more than just runtime speed. Sometimes it is worth +considering the effects of software choices on the size of generated code. Large +code size means longer compile and link times, bloated binaries, more memory +usage, more icache pressure, and other sometimes negative effects on +microarchitectural structures like branch predictors, etc. +Thinking about these issues is especially important when writing low-level +library code that will be used in many places, or when writing templated code +that you expect will be instantiated for many different types. + +The techniques that are useful for reducing code size vary significantly across +programming languages. Here are some techniques that we have found useful for +C++ code (which can suffer from an over-use of templates and inlining). + +### Trim commonly inlined code + +Widely called functions combined with inlining can have a dramatic effect on +code size. + +
    + +Speed up TF_CHECK_OK. + + +Avoid creating Ok object, and save code space by doing complex formatting of +fatal error message out of line instead of at every call site. + +status.h + +{: .bad-code} +```c++ +#define TF_CHECK_OK(val) CHECK_EQ(::tensorflow::Status::OK(), (val)) +#define TF_QCHECK_OK(val) QCHECK_EQ(::tensorflow::Status::OK(), (val)) +``` + +{: .new} +```c++ +extern tensorflow::string* TfCheckOpHelperOutOfLine( + const ::tensorflow::Status& v, const char* msg); +inline tensorflow::string* TfCheckOpHelper(::tensorflow::Status v, + const char* msg) { + if (v.ok()) return nullptr; + return TfCheckOpHelperOutOfLine(v, msg); +} +#define TF_CHECK_OK(val) \ + while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \ + LOG(FATAL) << *(_result) +#define TF_QCHECK_OK(val) \ + while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \ + LOG(QFATAL) << *(_result) +``` + +status.cc + +{: .new} +```c++ +string* TfCheckOpHelperOutOfLine(const ::tensorflow::Status& v, + const char* msg) { + string r("Non-OK-status: "); + r += msg; + r += " status: "; + r += v.ToString(); + // Leaks string but this is only to be used in a fatal error message + return new string(r); +} +``` + +
    + +
    + +Shrink each RETURN_IF_ERROR call site by 79 bytes of +code. + + +1. Added special adaptor class for use by just RETURN_IF_ERROR. +2. Do not construct/destruct StatusBuilder on fast path of RETURN_IF_ERROR. +3. Do not inline some StatusBuilder methods since they are now no longer needed + on the fast path. +4. Avoid unnecessary ~Status call. + +
    + +
    + +Improve performance of CHECK_GE by 4.5X and shrink code +size from 125 bytes to 77 bytes. + + +logging.h + +{: .bad-code} +```c++ +struct CheckOpString { + CheckOpString(string* str) : str_(str) { } + ~CheckOpString() { delete str_; } + operator bool() const { return str_ == NULL; } + string* str_; +}; +... +#define DEFINE_CHECK_OP_IMPL(name, op) \ + template \ + inline string* Check##name##Impl(const t1& v1, const t2& v2, \ + const char* names) { \ + if (v1 op v2) return NULL; \ + else return MakeCheckOpString(v1, v2, names); \ + } \ + string* Check##name##Impl(int v1, int v2, const char* names); +DEFINE_CHECK_OP_IMPL(EQ, ==) +DEFINE_CHECK_OP_IMPL(NE, !=) +DEFINE_CHECK_OP_IMPL(LE, <=) +DEFINE_CHECK_OP_IMPL(LT, < ) +DEFINE_CHECK_OP_IMPL(GE, >=) +DEFINE_CHECK_OP_IMPL(GT, > ) +#undef DEFINE_CHECK_OP_IMPL +``` + +{: .new} +```c++ +struct CheckOpString { + CheckOpString(string* str) : str_(str) { } + // No destructor: if str_ is non-NULL, we're about to LOG(FATAL), + // so there's no point in cleaning up str_. + operator bool() const { return str_ == NULL; } + string* str_; +}; +... +extern string* MakeCheckOpStringIntInt(int v1, int v2, const char* names); + +template +string* MakeCheckOpString(const int& v1, const int& v2, const char* names) { + return MakeCheckOpStringIntInt(v1, v2, names); +} +... +#define DEFINE_CHECK_OP_IMPL(name, op) \ + template \ + inline string* Check##name##Impl(const t1& v1, const t2& v2, \ + const char* names) { \ + if (v1 op v2) return NULL; \ + else return MakeCheckOpString(v1, v2, names); \ + } \ + inline string* Check##name##Impl(int v1, int v2, const char* names) { \ + if (v1 op v2) return NULL; \ + else return MakeCheckOpString(v1, v2, names); \ + } +DEFINE_CHECK_OP_IMPL(EQ, ==) +DEFINE_CHECK_OP_IMPL(NE, !=) +DEFINE_CHECK_OP_IMPL(LE, <=) +DEFINE_CHECK_OP_IMPL(LT, < ) +DEFINE_CHECK_OP_IMPL(GE, >=) +DEFINE_CHECK_OP_IMPL(GT, > ) +#undef DEFINE_CHECK_OP_IMPL +``` + +logging.cc + +{: .new} +```c++ +string* MakeCheckOpStringIntInt(int v1, int v2, const char* names) { + strstream ss; + ss << names << " (" << v1 << " vs. " << v2 << ")"; + return new string(ss.str(), ss.pcount()); +} +``` + +
    + +### Inline with care + +Inlining can often improve performance, but sometimes it can increase code size +without a corresponding performance payoff (and in some case even a performance +loss due to increased instruction cache pressure). + +
    + +Reduce inlining in TensorFlow. + + +The change stops inlining many non-performance-sensitive functions (e.g., error +paths and op registration code). Furthermore, slow paths of some +performance-sensitive functions are moved into non-inlined functions. + +These changes reduces the size of tensorflow symbols in a typical binary by +
    + +
    + +Protocol buffer library change. Avoid expensive inlined +code space for encoding message length for messages ≥ 128 bytes and instead +do a procedure call to a shared out-of-line routine. + + +Not only makes important large binaries smaller but also faster. + +Bytes of generated code per line of a heavily inlined routine in one large +binary. First number represents the total bytes generated for a particular +source line including all locations where that code has been inlined. + +Before: + +{: .bad-code} +```c++ +. 0 1825 template +. 0 1826 inline uint8* WireFormatLite::InternalWriteMessage( +. 0 1827 int field_number, const MessageType& value, uint8* target, +. 0 1828 io::EpsCopyOutputStream* stream) { +>>> 389246 1829 target = WriteTagToArray(field_number, WIRETYPE_LENGTH_DELIMITED, target); +>>> 5454640 1830 target = io::CodedOutputStream::WriteVarint32ToArray( +>>> 337837 1831 static_cast(value.GetCachedSize()), target); +>>> 1285539 1832 return value._InternalSerialize(target, stream); +. 0 1833 } +``` + +The new codesize output with this change looks like: + +{: .new} +```c++ +. 0 1825 template +. 0 1826 inline uint8* WireFormatLite::InternalWriteMessage( +. 0 1827 int field_number, const MessageType& value, uint8* target, +. 0 1828 io::EpsCopyOutputStream* stream) { +>>> 450612 1829 target = WriteTagToArray(field_number, WIRETYPE_LENGTH_DELIMITED, target); +>> 9609 1830 target = io::CodedOutputStream::WriteVarint32ToArrayOutOfLine( +>>> 434668 1831 static_cast(value.GetCachedSize()), target); +>>> 1597394 1832 return value._InternalSerialize(target, stream); +. 0 1833 } +``` + +coded_stream.h + +{: .new} +```c++ +class PROTOBUF_EXPORT CodedOutputStream { + ... + // Like WriteVarint32() but writing directly to the target array, and with the + // less common-case paths being out of line rather than inlined. + static uint8* WriteVarint32ToArrayOutOfLine(uint32 value, uint8* target); + ... +}; +... +inline uint8* CodedOutputStream::WriteVarint32ToArrayOutOfLine(uint32 value, + uint8* target) { + target[0] = static_cast(value); + if (value < 0x80) { + return target + 1; + } else { + return WriteVarint32ToArrayOutOfLineHelper(value, target); + } +} +``` + +coded_stream.cc + +{: .new} +```c++ +uint8* CodedOutputStream::WriteVarint32ToArrayOutOfLineHelper(uint32 value, + uint8* target) { + DCHECK_GE(value, 0x80); + target[0] |= static_cast(0x80); + value >>= 7; + target[1] = static_cast(value); + if (value < 0x80) { + return target + 2; + } + target += 2; + do { + // Turn on continuation bit in the byte we just wrote. + target[-1] |= static_cast(0x80); + value >>= 7; + *target = static_cast(value); + ++target; + } while (value >= 0x80); + return target; +} +``` + +
    + +
    + +Reduce absl::flat_hash_set and absl::flat_hash_map code +size. + + +1. Extract code that does not depend on the specific hash table type into + common (non-inlined) functions. +2. Place ABSL_ATTRIBUTE_NOINLINE directives judiciously. +3. Out-of-line some slow paths. + +
    + +
    + +Do not inline string allocation and deallocation when not +using protobuf arenas. + + +public/arenastring.h + +{: .bad-code} +```c++ + if (IsDefault(default_value)) { + std::string* new_string = new std::string(); + tagged_ptr_.Set(new_string); + return new_string; + } else { + return UnsafeMutablePointer(); + } +} +``` + +{: .new} +```c++ + if (IsDefault(default_value)) { + return SetAndReturnNewString(); + } else { + return UnsafeMutablePointer(); + } +} +``` + +internal/arenastring.cc + +{: .new} +```c++ +std::string* ArenaStringPtr::SetAndReturnNewString() { + std::string* new_string = new std::string(); + tagged_ptr_.Set(new_string); + return new_string; +} +``` + +
    + +
    + +Avoid inlining some routines. Create variants of routines +that take 'const char\*' rather than 'const std::string&' to avoid std::string +construction code at every call site. + + +op.h + +{: .bad-code} +```c++ +class OpDefBuilderWrapper { + public: + explicit OpDefBuilderWrapper(const char name[]) : builder_(name) {} + OpDefBuilderWrapper& Attr(std::string spec) { + builder_.Attr(std::move(spec)); + return *this; + } + OpDefBuilderWrapper& Input(std::string spec) { + builder_.Input(std::move(spec)); + return *this; + } + OpDefBuilderWrapper& Output(std::string spec) { + builder_.Output(std::move(spec)); + return *this; + } +``` + +{: .new} +```c++ +class OpDefBuilderWrapper { + public: + explicit OpDefBuilderWrapper(const char name[]) : builder_(name) {} + OpDefBuilderWrapper& Attr(std::string spec) { + builder_.Attr(std::move(spec)); + return *this; + } + OpDefBuilderWrapper& Attr(const char* spec) TF_ATTRIBUTE_NOINLINE { + return Attr(std::string(spec)); + } + OpDefBuilderWrapper& Input(std::string spec) { + builder_.Input(std::move(spec)); + return *this; + } + OpDefBuilderWrapper& Input(const char* spec) TF_ATTRIBUTE_NOINLINE { + return Input(std::string(spec)); + } + OpDefBuilderWrapper& Output(std::string spec) { + builder_.Output(std::move(spec)); + return *this; + } + OpDefBuilderWrapper& Output(const char* spec) TF_ATTRIBUTE_NOINLINE { + return Output(std::string(spec)); + } +``` + +
    + +### Reduce template instantiations + +Templated code can be duplicated for every possible combination of template +arguments which which it is instantiated. + +
    + +Replace template argument with a regular argument. + + +Changed a large routine templated on a bool to instead take the bool as an extra +argument. (The bool was only being used once to select one of two string +constants, so a run-time check was just fine.) This reduced the # of +instantiations of the large routine from 287 to 143. + +sharding_util_ops.cc + +{: .bad-code} +```c++ +template +Status GetAndValidateAttributes(OpKernelConstruction* ctx, + std::vector& num_partitions, + int& num_slices, std::vector& paddings, + bool& has_paddings) { + absl::string_view num_partitions_attr_name = + Split ? kNumSplitsAttrName : kNumConcatsAttrName; + ... + return OkStatus(); +} +``` + +{: .new} +```c++ +Status GetAndValidateAttributes(bool split, OpKernelConstruction* ctx, + std::vector& num_partitions, + int& num_slices, std::vector& paddings, + bool& has_paddings) { + absl::string_view num_partitions_attr_name = + split ? kNumSplitsAttrName : kNumConcatsAttrName; + ... + return OkStatus(); +} +``` + +
    + +
    + +Move bulky code from templated constructor to a +non-templated shared base class constructor. + + +Also reduce number of template instantiations from one for every combination of +`` to one for every `` and every ``. + +sharding_util_ops.cc + +{: .bad-code} +```c++ +template +class XlaSplitNDBaseOp : public OpKernel { + public: + explicit XlaSplitNDBaseOp(OpKernelConstruction* ctx) : OpKernel(ctx) { + OP_REQUIRES_OK( + ctx, GetAndValidateAttributes(/*split=*/true, ctx, num_splits_, + num_slices_, paddings_, has_paddings_)); + } +``` + +{: .new} +```c++ +// Shared base class to save code space +class XlaSplitNDShared : public OpKernel { + public: + explicit XlaSplitNDShared(OpKernelConstruction* ctx) TF_ATTRIBUTE_NOINLINE + : OpKernel(ctx), + num_slices_(1), + has_paddings_(false) { + GetAndValidateAttributes(/*split=*/true, ctx, num_splits_, num_slices_, + paddings_, has_paddings_); + } +``` + +
    + +
    + +Reduce generated code size for absl::flat_hash_set and +absl::flat_hash_map. + + +* Extract code that does not depend on the specific hash table type into + common (non-inlined) functions. +* Place ABSL_ATTRIBUTE_NOINLINE directives judiciously. +* Move some slow paths out of line. + +
    + +### Reduce container operations + +Consider the impact of map and other container operations since each call to +such and operation can produce large amounts of generated code. + +
    + +Turn many map insertion calls in a row to initialize a +hash table of emoji characters into a single bulk insert operation (188KB of +text down to 360 bytes in library linked into many binaries). 😊 + + +textfallback_init.h + +{: .bad-code} +```c++ +inline void AddEmojiFallbacks(TextFallbackMap *map) { + (*map)[0xFE000] = &kFE000; + (*map)[0xFE001] = &kFE001; + (*map)[0xFE002] = &kFE002; + (*map)[0xFE003] = &kFE003; + (*map)[0xFE004] = &kFE004; + (*map)[0xFE005] = &kFE005; + ... + (*map)[0xFEE7D] = &kFEE7D; + (*map)[0xFEEA0] = &kFEEA0; + (*map)[0xFE331] = &kFE331; +}; +``` + +{: .new} +```c++ +inline void AddEmojiFallbacks(TextFallbackMap *map) { +#define PAIR(x) {0x##x, &k##x} + // clang-format off + map->insert({ + PAIR(FE000), + PAIR(FE001), + PAIR(FE002), + PAIR(FE003), + PAIR(FE004), + PAIR(FE005), + ... + PAIR(FEE7D), + PAIR(FEEA0), + PAIR(FE331)}); + // clang-format on +#undef PAIR +}; +``` + +
    + +
    + +Stop inlining a heavy user of InlinedVector operations. + + +Moved very long routine that was being inlined from .h file to .cc (no real +performance benefit from inlining this). + +reduction_ops_common.h + +{: .bad-code} +```c++ +Status Simplify(const Tensor& data, const Tensor& axis, + const bool keep_dims) { + ... Eighty line routine body ... +} +``` + +{: .new} +```c++ +Status Simplify(const Tensor& data, const Tensor& axis, const bool keep_dims); +``` + +
    + +## Parallelization and synchronization {#parallelization-and-synchronization} + +### Exploit Parallelism {#exploit-parallelism} + +Modern machines have many cores, and they are often underutilized. Expensive +work may therefore be completed faster by parallelizing it. The most common +approach is to process different items in parallel and combine the results when +done. Typically, the items are first partitioned into batches to avoid paying +the cost of running something in parallel per item. + +
    + +Improves the rate of encoding tokens by ~3.6x with four-way +parallelization. + + +blocked-token-coder.cc + +{: .new} +```c++ +MutexLock l(&encoder_threads_lock); +if (encoder_threads == NULL) { + encoder_threads = new ThreadPool(NumCPUs()); + encoder_threads->SetStackSize(262144); + encoder_threads->StartWorkers(); +} +encoder_threads->Add + (NewCallback(this, + &BlockedTokenEncoder::EncodeRegionInThread, + region_tokens, N, region, + stats, + controller_->GetClosureWithCost + (NewCallback(&DummyCallback), N))); +``` + +
    + +
    + +Parallelization improves decoding performance by 5x. + + +coding.cc + +{: .bad-code} +```c++ +for (int c = 0; c < clusters->size(); c++) { + RET_CHECK_OK(DecodeBulkForCluster(...); +} +``` + +{: .new} +```c++ +struct SubTask { + absl::Status result; + absl::Notification done; +}; + +std::vector tasks(clusters->size()); +for (int c = 0; c < clusters->size(); c++) { + options_.executor->Schedule([&, c] { + tasks[c].result = DecodeBulkForCluster(...); + tasks[c].done.Notify(); + }); +} +for (int c = 0; c < clusters->size(); c++) { + tasks[c].done.WaitForNotification(); +} +for (int c = 0; c < clusters->size(); c++) { + RETURN_IF_ERROR(tasks[c].result); +} +``` + +
    + +The effect on system performance should be measured carefully – if spare CPU is +not available, or if memory bandwidth is saturated, parallelization may not +help, or may even hurt. + +### Amortize Lock Acquisition {#amortize-lock-acquisition} + +Avoid fine-grained locking to reduce the cost of Mutex operations in hot paths. +Caveat: this should only be done if the change does not increase lock +contention. + +
    + +Acquire lock once to free entire tree of query nodes, rather +than reacquiring lock for every node in tree. + + +mustang-query.cc + +{: .bad-code} +```c++ +// Pool of query nodes +ThreadSafeFreeList pool_(256); +... +void MustangQuery::Release(MustangQuery* node) { + if (node == NULL) + return; + for (int i=0; i < node->children_->size(); ++i) + Release((*node->children_)[i]); + node->children_->clear(); + pool_.Delete(node); +} +``` + +{: .new} +```c++ +// Pool of query nodes +Mutex pool_lock_; +FreeList pool_(256); +... +void MustangQuery::Release(MustangQuery* node) { + if (node == NULL) + return; + MutexLock l(&pool_lock_); + ReleaseLocked(node); +} + +void MustangQuery::ReleaseLocked(MustangQuery* node) { +#ifndef NDEBUG + pool_lock_.AssertHeld(); +#endif + if (node == NULL) + return; + for (int i=0; i < node->children_->size(); ++i) + ReleaseLocked((*node->children_)[i]); + node->children_->clear(); + pool_.Delete(node); +} +``` + +
    + +### Keep critical sections short {#keep-critical-sections-short} + +Avoid expensive work inside critical sections. In particular, watch out for +innocuous looking code that might be doing RPCs or accessing +files. + +
    + +Reduce number of cache lines touched in critical section. + + +Careful data structure adjustments reduce the number of cache lines accessed +significantly and improve the performance of an ML training run by 3.3%. + +1. Precompute some per-node type properties as bits within the NodeItem data + structure, meaning that we can avoid touching the Node* object for outgoing + edges in the critical section. +2. Change ExecutorState::ActivateNodes to use the NodeItem of the destination + node for each outgoing edge, rather than touching fields in the *item->node + object. Typically this means that we touch 1 or 2 cache lines total for + accessing the needed edge data, rather than `~2 + O(num_outgoing edges)` + (and for large graphs with many cores executing them there is also less TLB + pressure). + +
    + +
    + +Avoid RPC while holding Mutex. + + +trainer.cc + +{: .bad-code} +```c++ +{ + // Notify the parameter server that we are starting. + MutexLock l(&lock_); + model_ = model; + MaybeRecordProgress(last_global_step_); +} +``` + +{: .new} +```c++ +bool should_start_record_progress = false; +int64 step_for_progress = -1; +{ + // Notify the parameter server that we are starting. + MutexLock l(&lock_); + model_ = model; + should_start_record_progress = ShouldStartRecordProgress(); + step_for_progress = last_global_step_; +} +if (should_start_record_progress) { + StartRecordProgress(step_for_progress); +} +``` + +
    + +Also, be wary of expensive destructors that will run before a Mutex is unlocked +(this can often happen when the Mutex unlock is triggered by a `~MutexUnlock`.) +Declaring objects with expensive destructors before MutexLock may help (assuming +it is thread-safe). + +### Reduce contention by sharding {#reduce-contention-by-sharding} + +Sometimes a data structure protected by a Mutex that is exhibiting high +contention can be safely split into multiple shards, each shard with its own +Mutex. (Note: this requires that there are no cross-shard invariants between the +different shards.) + +
    + +Shards a cache 16 ways which improves throughput under a +multi-threaded load by ~2x. + + +cache.cc + +{: .new} +```c++ +class ShardedLRUCache : public Cache { + private: + LRUCache shard_[kNumShards]; + port::Mutex id_mutex_; + uint64_t last_id_; + + static inline uint32_t HashSlice(const Slice& s) { + return Hash(s.data(), s.size(), 0); + } + + static uint32_t Shard(uint32_t hash) { + return hash >> (32 - kNumShardBits); + } + ... + virtual Handle* Lookup(const Slice& key) { + const uint32_t hash = HashSlice(key); + return shard_[Shard(hash)].Lookup(key, hash); + } +``` + +
    + +
    + +Shards spanner data structure for tracking calls. + + +transaction_manager.cc + +{: .bad-code} +```c++ +absl::MutexLock l(&active_calls_in_mu_); +ActiveCallMap::const_iterator iter = active_calls_in_.find(m->tid()); +if (iter != active_calls_in_.end()) { + iter->second.ExtractElements(&m->tmp_calls_); +} +``` + +{: .new} +```c++ +ActiveCalls::LockedShard shard(active_calls_in_, m->tid()); +const ActiveCallMap& active_calls_map = shard.active_calls_map(); +ActiveCallMap::const_iterator iter = active_calls_map.find(m->tid()); +if (iter != active_calls_map.end()) { + iter->second.ExtractElements(&m->tmp_calls_); +} +``` + +
    + +If the data structure in question is a map, consider using a concurrent hash map +implementation instead. + +Be careful with the information used for shard selection. If, for example, you +use some bits of a hash value for shard selection and then those same bits end +up being used again later, the latter use may perform poorly since it sees a +skewed distribution of hash values. + +
    + +Fix information used for shard selection to prevent hash +table issues. + + +netmon_map_impl.h + +{: .bad-code} +```c++ +ConnectionBucket* GetBucket(Index index) { + // Rehash the hash to make sure we are not partitioning the buckets based on + // the original hash. If num_buckets_ is a power of 2 that would drop the + // entropy of the buckets. + size_t original_hash = absl::Hash()(index); + int hash = absl::Hash()(original_hash) % num_buckets_; + return &buckets_[hash]; +} +``` + +{: .new} +```c++ +ConnectionBucket* GetBucket(Index index) { + absl::Hash> hasher{}; + // Combine the hash with 42 to prevent shard selection using the same bits + // as the underlying hashtable. + return &buckets_[hasher({index, 42}) % num_buckets_]; +} +``` + +
    + +
    + +Shard Spanner data structure used for tracking calls. + + +This CL partitions the ActiveCallMap into 64 shards. Each shard is protected by +a separate mutex. A given transaction will be mapped to exactly one shard. A new +interface LockedShard(tid) is added for accessing the ActiveCallMap for a +transaction in a thread-safe manner. Example usage: + +transaction_manager.cc + +{: .bad-code} +```c++ +{ + absl::MutexLock l(&active_calls_in_mu_); + delayed_locks_timer_ring_.Add(delayed_locks_flush_time_ms, tid); +} +``` + +{: .new} +```c++ +{ + ActiveCalls::LockedShard shard(active_calls_in_, tid); + shard.delayed_locks_timer_ring().Add(delayed_locks_flush_time_ms, tid); +} +``` + +The results show a 69% reduction in overall wall-clock time when running the +benchmark with 8192 fibers + +{: .bad-code} +``` +Benchmark Time(ns) CPU(ns) Iterations +------------------------------------------------------------------ +BM_ActiveCalls/8k 11854633492 98766564676 10 +BM_ActiveCalls/16k 26356203552 217325836709 10 +``` + +{: .new} +``` +Benchmark Time(ns) CPU(ns) Iterations +------------------------------------------------------------------ +BM_ActiveCalls/8k 3696794642 39670670110 10 +BM_ActiveCalls/16k 7366284437 79435705713 10 +``` + +
    + +### SIMD Instructions {#simd-instructions} + +Explore whether handling multiple items at once using +[SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) +instructions available on modern CPUs can give speedups (e.g., see +`absl::flat_hash_map` discussion below in [Bulk Operations](#bulk-operations) +section). + +### Reduce false sharing {#reduce-false-sharing} + +If different threads access different mutable data, consider placing the +different data items on different cache lines, e.g., in C++ using the `alignas` +directive. However, these directives are easy to misuse and may increase object +sizes significantly, so make sure performance measurements justify their use. + +
    + +Segregate commonly mutated fields in a different cache +line than other fields. + + +histogram.h + +{: .bad-code} +```c++ +HistogramOptions options_; +... +internal::HistogramBoundaries *boundaries_; +... +std::vector buckets_; + +double min_; // Minimum. +double max_; // Maximum. +double count_; // Total count of occurrences. +double sum_; // Sum of values. +double sum_of_squares_; // Sum of squares of values. +... +RegisterVariableExporter *exporter_; +``` + +{: .new} +```c++ + HistogramOptions options_; + ... + internal::HistogramBoundaries *boundaries_; + ... + RegisterVariableExporter *exporter_; + ... + // Place the following fields in a dedicated cacheline as they are frequently + // mutated, so we can avoid potential false sharing. + ... +#ifndef SWIG + alignas(ABSL_CACHELINE_SIZE) +#endif + std::vector buckets_; + + double min_; // Minimum. + double max_; // Maximum. + double count_; // Total count of occurrences. + double sum_; // Sum of values. + double sum_of_squares_; // Sum of squares of values. +``` + +
    + +### Reduce frequency of context switches + +
    + +Process small work items inline instead of on device +thread pool. + + +cast_op.cc + +{: .new} +```c++ +template +void CastMaybeInline(const Device& d, typename TTypes::Flat o, + typename TTypes::ConstFlat i) { + if (o.size() * (sizeof(Tin) + sizeof(Tout)) < 16384) { + // Small cast on a CPU: do inline + o = i.template cast(); + } else { + o.device(d) = i.template cast(); + } +} +``` + +
    + +### Use buffered channels for pipelining {#use-buffered-channels-for-pipelining} + +Channels can be unbuffered which means that a writer blocks until a reader is +ready to pick up an item. Unbuffered channels can be useful when the channel is +being used for synchronization, but not when the channel is being used to +increase parallelism. + +### Consider lock-free approaches + +Sometimes lock-free data structures can make a difference over more conventional +mutex-protected data structures. However, direct atomic variable manipulation +can be [dangerous][atomic danger]. Prefer higher-level abstractions. + +
    + +Use lock-free map to manage a cache of RPC channels. + + +Entries in an RPC stub cache are read thousands of times a second and modified +rarely. Switching to an appropriate lock-free map reduces search latency by +
    + +
    + +Use a fixed lexicon+lock-free hash map to speed-up +determining IsValidTokenId. + + +dynamic_token_class_manager.h + +{: .bad-code} +```c++ +mutable Mutex mutex_; + +// The density of this hash map is guaranteed by the fact that the +// dynamic lexicon reuses previously allocated TokenIds before trying +// to allocate new ones. +dense_hash_map tid_to_cid_ + GUARDED_BY(mutex_); +``` + +{: .new} +```c++ +// Read accesses to this hash-map should be done using +// 'epoch_gc_'::(EnterFast / LeaveFast). The writers should periodically +// GC the deleted entries, by simply invoking LockFreeHashMap::CreateGC. +typedef util::gtl::LockFreeHashMap + TokenIdTokenClassIdMap; +TokenIdTokenClassIdMap tid_to_cid_; +``` + +
    + +## Protocol Buffer advice {#protobuf-advice} + +Protobufs are a convenient representation of data, especially if the data will +be sent over the wire or stored persistently. However, they can have significant +performance costs. For example, a piece of code that fills in a list of 1000 +points and then sums up the Y coordinates, speeds up by a **factor of 20** when +converted from protobufs to a C++ std::vector of structs! + +
    + +Benchmark code for both versions. + + +{: .bench} +``` +name old time/op new time/op delta +BenchmarkIteration 17.4µs ± 5% 0.8µs ± 1% -95.30% (p=0.000 n=11+12) +``` + +Protobuf version: + +{: .bad-code} +```proto +message PointProto { + int32 x = 1; + int32 y = 2; +} +message PointListProto { + repeated PointProto points = 1; +} +``` + +{: .bad-code} +```c++ +void SumProto(const PointListProto& vec) { + int sum = 0; + for (const PointProto& p : vec.points()) { + sum += p.y(); + } + ABSL_VLOG(1) << sum; +} + +void BenchmarkIteration() { + PointListProto points; + points.mutable_points()->Reserve(1000); + for (int i = 0; i < 1000; i++) { + PointProto* p = points.add_points(); + p->set_x(i); + p->set_y(i * 2); + } + SumProto(points); +} +``` + +Non-protobuf version: + +{: .new} +```c++ +struct PointStruct { + int x; + int y; +}; + +void SumVector(const std::vector& vec) { + int sum = 0; + for (const PointStruct& p : vec) { + sum += p.y; + } + ABSL_VLOG(1) << sum; +} + +void BenchmarkIteration() { + std::vector points; + points.reserve(1000); + for (int i = 0; i < 1000; i++) { + points.push_back({i, i * 2}); + } + SumVector(points); +} +``` + +
    + +In addition, the protobuf version adds a few kilobytes of code and data to the +binary, which may not seem like much, but adds up quickly in systems with many +protobuf types. This increased size creates performance problems by creating +i-cache and d-cache pressure. + +Here are some tips related to protobuf performance: + +
    + +Do not use protobufs unnecessarily. + + +Given the factor of 20 performance difference described above, if some data is +never serialized or parsed, you probably should not put it in a protocol buffer. +The purpose of protocol buffers is to make it easy to serialize and deserialize +data structures, but they can have significant code-size, memory, and CPU +overheads. Do not use them if all you want are some of the other niceties like +
    + +
    + +Avoid unnecessary message hierarchies. + + +Message hierarchy can be useful to organize information in a more readable +fashion. However, the extra level of message hierarchy incurs overheads like +memory allocations, function calls, cache misses, larger serialized messages, +etc. + +E.g., instead of: + +{: .bad-code} +```proto +message Foo { + optional Bar bar = 1; +} +message Bar { + optional Baz baz = 1; +} +message Baz { + optional int32 count = 1; +} +``` + +Prefer: + +{: .new} +```proto +message Foo { + optional int32 count = 1; +} +``` + +A protocol buffer message corresponds to a message class in C++ generated code +and emits a tag and the length of the payload on the wire. To carry an integer, +the old form requires more allocations (and deallocations) and emits a larger +amount of generated code. As a result, all protocol buffer operations (parsing, +serialization, size, etc.) become more expensive, having to traverse the message +
    + +
    + +Use small field numbers for frequently occurring fields. + + +Protobufs use a variable length integer representation for the combination of +field number and wire format (see the +[protobuf encoding documentation](https://protobuf.dev/programming-guides/encoding/)). +This representation is 1 byte for field numbers between 1 and 15, and two bytes +for field numbers between 16 and 2047. (Field numbers 2048 or greater should +typically be avoided.) + +Consider pre-reserving some small field numbers for future extension of +
    + +
    + +Choose carefully between int32, sint32, fixed32, and uint32 (and +similarly for the 64 bit variants). + + +Generally, use `int32` or `int64`, but use `fixed32` or `fixed64` for large +values like hash codes and `sint32` or `sint64` for values are that are often +negative. + +A varint occupies fewer bytes to encode small integers and can save space at the +cost of more expensive decoding. However, it can take up more space for negative +or large values. In that case, using fixed32 or fixed64 (instead of uint32 or +uint64) reduces size with much cheaper encoding and decoding. For small negative +
    + +
    + +For proto2, pack repeated numeric fields by annotating them with +[packed=true]. + + +In proto2, repeated values are serialized as a sequence of (tag, value) pairs by +default. This is inefficient because tags have to be decoded for every element. + +Packed repeated primitives are serialized with the length of the payload first +followed by values without tags. When using fixed-width values, we can avoid +reallocations by knowing the final size the moment we start parsing; i.e., no +reallocation cost. We still don't know how many varints are in the payload and +may have to pay the reallocation cost. + +In proto3, repeated fields are packed by default. + +Packed works best with fixed-width values like fixed32, fixed64, float, double, +etc. since the entire encoded length can be predetermined by multiplying the +number of elements by the fixed value size, instead of having to calculate the +
    + +
    + +Use bytes instead for string for binary data +and large values. + + +The `string` type holds UTF8-encoded text, and can sometimes require validation. +The `bytes` type can hold an arbitrary sequence of bytes (non-text data) and is +
    + +
    + +Consider using Cord for large fields to reduce copying +costs. + + +Annotating large `bytes` and `string` fields with `[ctype=CORD]` may reduce +copying costs. This annotation changes the representation of the field from +`std::string` to `absl::Cord`. `absl::Cord` uses reference counting and +tree-based storage to reduce copying and appending costs. If a protocol buffer +is serialized to a cord, parsing a string or bytes field with `[ctype=CORD]` can +avoid copying the field contents. + +{: .new} +```proto +message Document { + ... + bytes html = 4 [ctype = CORD]; +} +``` + +Performance of a Cord field depends on length distribution and access patterns. +
    + +
    + +Use protobuf arenas in C++ code. + + +Consider using arenas to save allocation and deallocation costs, especially for +protobufs containing repeated, string, or message fields. + +Message and string fields are heap-allocated (even if the top-level protocol +buffer object is stack-allocated). If a protocol buffer message has a lot of sub +message fields and string fields, allocation and deallocation cost can be +significant. Arenas amortize allocation costs and makes deallocation virtually +free. It also improves memory locality by allocating from contiguous chunks of +
    + +
    + +Keep .proto files small + + +Do not put too many messages in a single .proto file. Once you rely on anything +at all from a .proto file, the entire file will get pulled in by the linker even +if it's mostly unused. This increases build times and binary sizes. You can use +extensions and Any to avoid creating hard dependencies on big +
    + +
    + +Consider storing protocol buffers in serialized form, even in memory. + + +In-memory protobuf objects have a large memory footprint (often 5x the wire +format size), potentially spread across many cache lines. So if your application +is going to keep many protobuf objects live for long periods of time, consider +
    + +
    + +Avoid protobuf map fields. + + +Protobuf map fields have performance problems that usually outweigh the small +syntactic convenience they provide. Prefer using non-protobuf maps initialized +from protobuf contents: + +msg.proto + +{: .bad-code} +```proto +map env_variables = 5; +``` + +{: .new} +```proto +message Var { + string key = 1; + bytes value = 2; +} +repeated Var env_variables = 5; +``` + +
    + +
    + +Use protobuf message definition with a subset of the fields. + + +If you want to access only a few fields of a large message type, consider +defining your own protocol buffer message type that mimics the original type, +but only defines the fields that you care about. Here's an example: + +{: .bad-code} +```proto +message FullMessage { + optional int32 field1 = 1; + optional BigMessage field2 = 2; + optional int32 field3 = 3; + repeater AnotherBigMessage field4 = 4; + ... + optional int32 field100 = 100; +} +``` + +{: .new} +```proto +message SubsetMessage { + optional int32 field3 = 3; + optional int32 field88 = 88; +} +``` + +By parsing a serialized `FullMessage` into a `SubsetMessage`, only two out of a +hundred fields are parsed and others are treated as unknown fields. Consider +using APIs that discard unknown fields to improve performance even more when +
    + +
    + +Reuse protobuf objects when possible. + + +Declare protobuf objects outside loops so that their allocated storage can be +
    + + + +## C++-Specific Advice + +### absl::flat_hash_map (and set) + +[Absl hash tables](https://abseil.io/docs/cpp/guides/container) usually +out-perform C++ standard library containers such as `std::map` and +`std::unordered_map`. + +
    + +Sped up LanguageFromCode (use absl::flat_hash_map instead +of a __gnu_cxx::hash_map). + + +languages.cc + +{: .bad-code} +```c++ +class CodeToLanguage + ... + : public __gnu_cxx::hash_map { +``` + +{: .new} +```c++ +class CodeToLanguage + ... + : public absl::flat_hash_map { +``` + +Benchmark results: + +{: .bench} +``` +name old time/op new time/op delta +BM_CodeToLanguage 19.4ns ± 1% 10.2ns ± 3% -47.47% (p=0.000 n=8+10) +``` + +
    + +
    + +Speed up stats publish/unpublish (an older change, so +uses dense_hash_map instead of absl::flat_hash_map, which did not exist at the +time). + + +publish.cc + +{: .bad-code} +```c++ +typedef hash_map PublicationMap; +static PublicationMap* publications = NULL; +``` + +{: .new} +```c++ +typedef dense_hash_map PublicationMap;; +static PublicationMap* publications GUARDED_BY(mu) = NULL; +``` + +
    + +
    + +Use dense_hash_map instead of hash_map for keeping track of +SelectServer alarms (would use absl::flat_hash_map today). + + +alarmer.h + +{: .bad-code} +```c++ +typedef hash_map AlarmList; +``` + +{: .new} +```c++ +typedef dense_hash_map AlarmList; +``` + +
    + +### absl::btree_map/absl::btree_set + +absl::btree_map and absl::btree_set store multiple entries per tree node. This +has a number of advantages over ordered C++ standard library containers such as +`std::map`. First, the pointer overhead of pointing to child tree nodes is often +significantly reduced. Second, because the entries or key/values are stored +consecutively in memory for a given btree tree node, cache efficiency is often +significantly better. + +
    + +Use btree_set instead of std::set to represent a very heavily used +work-queue. + + +register_allocator.h + +{: .bad-code} +```c++ +using container_type = std::set; +``` + +{: .new} +```c++ +using container_type = absl::btree_set; +``` + +
    + +### util::bitmap::InlinedBitVector + +`util::bitmap::InlinedBitvector` can store short bit-vectors inline, and +therefore can often be a better choice than `std::vector` or other bitmap +types. + +
    + +Use InlinedBitVector instead of std::vector<bool>, and +then use FindNextBitSet to find the next item of interest. + + +block_encoder.cc + +{: .bad-code} +```c++ +vector live_reads(nreads); +... +for (int offset = 0; offset < b_.block_width(); offset++) { + ... + for (int r = 0; r < nreads; r++) { + if (live_reads[r]) { +``` + +{: .new} +```c++ +util::bitmap::InlinedBitVector<4096> live_reads(nreads); +... +for (int offset = 0; offset < b_.block_width(); offset++) { + ... + for (size_t r = 0; live_reads.FindNextSetBit(&r); r++) { + DCHECK(live_reads[r]); +``` + +
    + +### absl::InlinedVector + +absl::InlinedVector stores a small number of elements inline (configurable via +the second template argument). This enables small vectors up to this number of +elements to generally have better cache efficiency and also to avoid allocating +a backing store array at all when the number of elements is small. + +
    + +Use InlinedVector instead of std::vector in various places. + + +bundle.h + +{: .bad-code} +```c++ +class Bundle { + public: + ... + private: + // Sequence of (slotted instruction, unslotted immediate operands). + std::vector instructions_; + ... +}; +``` + +{: .new} +```c++ +class Bundle { + public: + ... + private: + // Sequence of (slotted instruction, unslotted immediate operands). + absl::InlinedVector instructions_; + ... +}; +``` + +
    + +### gtl::vector32 + +Saves space by using a customized vector type that only supports sizes that fit +in 32 bits. + +
    + +Simple type change saves ~8TiB of memory in Spanner. + + +table_ply.h + +{: .bad-code} +```c++ +class TablePly { + ... + // Returns the set of data columns stored in this file for this table. + const std::vector& modified_data_columns() const { + return modified_data_columns_; + } + ... + private: + ... + std::vector modified_data_columns_; // Data columns in the table. +``` + +{: .new} +```c++ +#include "util/gtl/vector32.h" + ... + // Returns the set of data columns stored in this file for this table. + absl::Span modified_data_columns() const { + return modified_data_columns_; + } + ... + + ... + // Data columns in the table. + gtl::vector32 modified_data_columns_; +``` + +
    + +### gtl::small_map + +gtl::small_map uses an inline array to store up to a certain number of unique +key-value-pair elements, but upgrades itself automatically to be backed by a +user-specified map type when it runs out of space. + +
    + +Use gtl::small_map in tflite_model. + + +tflite_model.cc + +{: .bad-code} +```c++ +using ChoiceIdToContextMap = gtl::flat_hash_map; +``` + +{: .new} +```c++ +using ChoiceIdToContextMap = + gtl::small_map>; +``` + +
    + +### gtl::small_ordered_set + +gtl::small_ordered_set is an optimization for associative containers (such as +std::set or absl::btree_multiset). It uses a fixed array to store a certain +number of elements, then reverts to using a set or multiset when it runs out of +space. For sets that are typically small, this can be considerably faster than +using something like set directly, as set is optimized for large data sets. This +change shrinks cache footprint and reduces critical section length. + +
    + +Use gtl::small_ordered_set to hold set of listeners. + + +broadcast_stream.h + +{: .bad-code} +```c++ +class BroadcastStream : public ParsedRtpTransport { + ... + private: + ... + std::set listeners_ ABSL_GUARDED_BY(listeners_mutex_); +}; +``` + +{: .new} +```c++ +class BroadcastStream : public ParsedRtpTransport { + ... + private: + ... + using ListenersSet = + gtl::small_ordered_set, 10>; + ListenersSet listeners_ ABSL_GUARDED_BY(listeners_mutex_); +``` + +
    + +### gtl::intrusive_list {#gtl-intrusive_list} + +`gtl::intrusive_list` is a doubly-linked list where the link pointers are +embedded in the elements of type T. It saves one cache line+indirection per +element when compared to `std::list`. + +
    + +Use intrusive_list to keep track of inflight requests for +each index row update. + + +row-update-sender-inflight-set.h + +{: .bad-code} +```c++ +std::set inflight_requests_ GUARDED_BY(mu_); +``` + +{: .new} +```c++ +class SeqNum : public gtl::intrusive_link { + ... + int64 val_ = -1; + ... +}; +... +gtl::intrusive_list inflight_requests_ GUARDED_BY(mu_); +``` + +
    + +### Limit absl::Status and absl::StatusOr usage + +Even though `absl::Status` and `absl::StatusOr` types are fairly efficient, they +have a non-zero overhead even in the success path and should therefore be +avoided for hot routines that don't need to return any meaningful error details +(or perhaps never even fail!): + +
    + +Avoid StatusOr<int64> return type for +RoundUpToAlignment() function. + + +best_fit_allocator.cc + +{: .bad-code} +```c++ +absl::StatusOr BestFitAllocator::RoundUpToAlignment(int64 bytes) const { + TPU_RET_CHECK_GE(bytes, 0); + + const int64 max_aligned = MathUtil::RoundDownTo( + std::numeric_limits::max(), alignment_in_bytes_); + if (bytes > max_aligned) { + return util::ResourceExhaustedErrorBuilder(ABSL_LOC) + << "Attempted to allocate " + << strings::HumanReadableNumBytes::ToString(bytes) + << " which after aligning to " + << strings::HumanReadableNumBytes::ToString(alignment_in_bytes_) + << " cannot be expressed as an int64."; + } + + return MathUtil::RoundUpTo(bytes, alignment_in_bytes_); +} +``` + +best_fit_allocator.h + +{: .new} +```c++ +// Rounds bytes up to nearest multiple of alignment_. +// REQUIRES: bytes >= 0. +// REQUIRES: result does not overflow int64. +// REQUIRES: alignment_in_bytes_ is a power of 2 (checked in constructor). +int64 RoundUpToAlignment(int64 bytes) const { + DCHECK_GE(bytes, 0); + DCHECK_LE(bytes, max_aligned_bytes_); + int64 result = + ((bytes + (alignment_in_bytes_ - 1)) & ~(alignment_in_bytes_ - 1)); + DCHECK_EQ(result, MathUtil::RoundUpTo(bytes, alignment_in_bytes_)); + return result; +} +``` + +
    + +
    + +Added ShapeUtil::ForEachIndexNoStatus to avoid creating a +Status return object for every element of a tensor. + + +shape_util.h + +{: .bad-code} +```c++ +using ForEachVisitorFunction = + absl::FunctionRef(absl::Span)>; + ... +static void ForEachIndex(const Shape& shape, absl::Span base, + absl::Span count, + absl::Span incr, + const ForEachVisitorFunction& visitor_function); + +``` + +{: .new} +```c++ +using ForEachVisitorFunctionNoStatus = + absl::FunctionRef)>; + ... +static void ForEachIndexNoStatus( + const Shape& shape, absl::Span base, + absl::Span count, absl::Span incr, + const ForEachVisitorFunctionNoStatus& visitor_function); +``` + +literal.cc + +{: .bad-code} +```c++ +ShapeUtil::ForEachIndex( + result_shape, [&](absl::Span output_index) { + for (int64_t i = 0, end = dimensions.size(); i < end; ++i) { + scratch_source_index[i] = output_index[dimensions[i]]; + } + int64_t dest_index = IndexUtil::MultidimensionalIndexToLinearIndex( + result_shape, output_index); + int64_t source_index = IndexUtil::MultidimensionalIndexToLinearIndex( + shape(), scratch_source_index); + memcpy(dest_data + primitive_size * dest_index, + source_data + primitive_size * source_index, primitive_size); + return true; + }); +``` + +{: .new} +```c++ +ShapeUtil::ForEachIndexNoStatus( + result_shape, [&](absl::Span output_index) { + // Compute dest_index + int64_t dest_index = IndexUtil::MultidimensionalIndexToLinearIndex( + result_shape, result_minor_to_major, output_index); + + // Compute source_index + int64_t source_index; + for (int64_t i = 0, end = dimensions.size(); i < end; ++i) { + scratch_source_array[i] = output_index[dimensions[i]]; + } + if (src_shape_dims == 1) { + // Fast path for this case + source_index = scratch_source_array[0]; + DCHECK_EQ(source_index, + IndexUtil::MultidimensionalIndexToLinearIndex( + src_shape, src_minor_to_major, scratch_source_span)); + } else { + source_index = IndexUtil::MultidimensionalIndexToLinearIndex( + src_shape, src_minor_to_major, scratch_source_span); + } + // Move one element from source_index in source to dest_index in dest + memcpy(dest_data + PRIMITIVE_SIZE * dest_index, + source_data + PRIMITIVE_SIZE * source_index, PRIMITIVE_SIZE); + return true; + }); +``` + +
    + +
    + +In TF_CHECK_OK, avoid creating Ok object in order to test +for ok(). + + +status.h + +{: .bad-code} +```c++ +#define TF_CHECK_OK(val) CHECK_EQ(::tensorflow::Status::OK(), (val)) +#define TF_QCHECK_OK(val) QCHECK_EQ(::tensorflow::Status::OK(), (val)) +``` + +{: .new} +```c++ +extern tensorflow::string* TfCheckOpHelperOutOfLine( + const ::tensorflow::Status& v, const char* msg); +inline tensorflow::string* TfCheckOpHelper(::tensorflow::Status v, + const char* msg) { + if (v.ok()) return nullptr; + return TfCheckOpHelperOutOfLine(v, msg); +} +#define TF_CHECK_OK(val) \ + while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \ + LOG(FATAL) << *(_result) +#define TF_QCHECK_OK(val) \ + while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \ + LOG(QFATAL) << *(_result) +``` + +
    + +
    + +Remove StatusOr from the hot path of remote procedure +calls (RPCs). + + +Removal of StatusOr from a hot path eliminated a 14% CPU regression in RPC +benchmarks caused by an earlier change. + +privacy_context.h + +{: .bad-code} +```c++ +absl::StatusOr GetRawPrivacyContext( + const CensusHandle& h); +``` + +privacy_context_statusfree.h + +{: .new} +```c++ +enum class Result { + kSuccess, + kNoRootScopedData, + kNoPrivacyContext, + kNoDDTContext, + kDeclassified, + kNoPrequestContext +}; +... +Result GetRawPrivacyContext(const CensusHandle& h, + PrivacyContext* privacy_context); +``` + +
    + +## Bulk Operations {#bulk-operations} + +If possible, handle many items at once rather than just one at a time. + +
    + +absl::flat_hash_map compares one hash byte per key from a +group of keys using a single SIMD instruction. + + +See [Swiss Table Design Notes](https://abseil.io/about/design/swisstables) and +related [CppCon 2017](https://www.youtube.com/watch?v=ncHmEUmJZf4) and +[CppCon 2019](https://www.youtube.com/watch?v=JZE3_0qvrMg) talks by Matt +Kulukundis. + +raw_hash_set.h + +{: .new} +```c++ +// Returns a bitmask representing the positions of slots that match hash. +BitMask Match(h2_t hash) const { + auto ctrl = _mm_loadu_si128(reinterpret_cast(pos)); + auto match = _mm_set1_epi8(hash); + return BitMask(_mm_movemask_epi8(_mm_cmpeq_epi8(match, ctrl))); +} +``` + +
    + +
    + +Do single operations to deal with many bytes and fix +things up, rather than checking every byte what to do. + + +ordered-code.cc + +{: .bad-code} +```c++ +int len = 0; +while (val > 0) { + len++; + buf[9 - len] = (val & 0xff); + val >>= 8; +} +buf[9 - len - 1] = (unsigned char)len; +len++; +FastStringAppend(dest, reinterpret_cast(buf + 9 - len), len); +``` + +{: .new} +```c++ +BigEndian::Store(val, buf + 1); // buf[0] may be needed for length +const unsigned int length = OrderedNumLength(val); +char* start = buf + 9 - length - 1; +*start = length; +AppendUpto9(dest, start, length + 1); +``` + +
    + +
    + +Improve Reed-Solomon processing speed by handling +multiple interleaved input buffers more efficiently in chunks. + + +{: .bench} +``` +Run on (12 X 3501 MHz CPUs); 2016-09-27T16:04:55.065995192-04:00 +CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB +Benchmark Base (ns) New (ns) Improvement +------------------------------------------------------------------ +BM_OneOutput/3/2 466867 351818 +24.6% +BM_OneOutput/4/2 563130 474756 +15.7% +BM_OneOutput/5/3 815393 688820 +15.5% +BM_OneOutput/6/3 897246 780539 +13.0% +BM_OneOutput/8/4 1270489 1137149 +10.5% +BM_AllOutputs/3/2 848772 642942 +24.3% +BM_AllOutputs/4/2 1067647 638139 +40.2% +BM_AllOutputs/5/3 1739135 1151369 +33.8% +BM_AllOutputs/6/3 2045817 1456744 +28.8% +BM_AllOutputs/8/4 3012958 2484937 +17.5% +BM_AllOutputsSetUpOnce/3/2 717310 493371 +31.2% +BM_AllOutputsSetUpOnce/4/2 833866 600060 +28.0% +BM_AllOutputsSetUpOnce/5/3 1537870 1137357 +26.0% +BM_AllOutputsSetUpOnce/6/3 1802353 1398600 +22.4% +BM_AllOutputsSetUpOnce/8/4 3166930 2455973 +22.4% +``` + +
    + +
    + +Decode four integers at a time (circa 2004). + + +Introduced a +[GroupVarInt format](https://static.googleusercontent.com/media/research.google.com/en//people/jeff/WSDM09-keynote.pdf) +that encodes/decodes groups of 4 variable-length integers at a time in 5-17 +bytes, rather than one integer at a time. Decoding one group of 4 integers in +the new format takes ~1/3rd the time of decoding 4 individually varint-encoded +integers. + +groupvarint.cc + +{: .new} +```c++ +const char* DecodeGroupVar(const char* p, int N, uint32* dest) { + assert(groupvar_initialized); + assert(N % 4 == 0); + while (N) { + uint8 tag = *p; + p++; + + uint8* lenptr = &groupvar_table[tag].length[0]; + +#define GET_NEXT \ + do { \ + uint8 len = *lenptr; \ + *dest = UNALIGNED_LOAD32(p) & groupvar_mask[len]; \ + dest++; \ + p += len; \ + lenptr++; \ + } while (0) + GET_NEXT; + GET_NEXT; + GET_NEXT; + GET_NEXT; +#undef GET_NEXT + + N -= 4; + } + return p; +} +``` + +
    + +
    + +Encode groups of 4 k-bit numbers at a time. + + +Added KBitStreamEncoder and KBitStreamDecoder classes to encode/decode 4 k-bit +numbers at a time into a bit stream. Since K is known at compile time, the +encoding and decoding can be quite efficient. E.g., since four numbers are +encoded at a time, the code can assume that the stream is always byte-aligned +
    + +## CLs that demonstrate multiple techniques {#cls-that-demonstrate-multiple-techniques} + +Sometimes a single CL contains a number of performance-improving changes that +use many of the preceding techniques. Looking at the kinds of changes in these +CLs is sometimes a good way to get in the mindset of making general changes to +speed up the performance of some part of a system after that has been identified +as a bottleneck. + +
    + +Speed up GPU memory allocator by ~40%. + + +36-48% speedup in allocation/deallocation speed for GPUBFCAllocator: + +1. Identify chunks by a handle number, rather than by a pointer to a Chunk. + Chunk data structures are now allocated in a `vector`, and a handle + is an index into this vector to refer to a particular chunk. This allows the + next and prev pointers in Chunk to be ChunkHandle (4 bytes), rather than + `Chunk*` (8 bytes). + +2. When a Chunk object is no longer in use, we maintain a free list of Chunk + objects, whose head is designated by ChunkHandle `free_chunks_list_`, and + with the `Chunk->next` pointing to the next free list entry. Together with + (1), this allows us to avoid heap allocation/deallocation of Chunk objects + in the allocator, except (rarely) when the `vector` grows. It also + makes all the memory for Chunk objects contiguous. + +3. Rather than having the bins_ data structure be a std::set and using + lower_bound to locate the appropriate bin given a byte_size, we instead have + an array of bins, indexed by a function that is log₂(byte_size/256). This + allows the bin to be located with a few bit operations, rather than a binary + search tree lookup. It also allows us to allocate the storage for all the + Bin data structures in a contiguous array, rather than in many different + cache lines. This reduces the number of cache lines that must be moved + around between cores when multiple threads are doing allocations. + +4. Added fast path to GPUBFCAllocator::AllocateRaw that first tries to allocate + memory without involving the retry_helper_. If an initial attempt fails + (returns nullptr), then we go through the retry_helper_, but normally we can + avoid several levels of procedure calls as well as the + allocation/deallocation of a std::function with several arguments. + +5. Commented out most of the VLOG calls. These can be reenabled selectively + when needed for debugging purposes by uncommenting and recompiling. + +Added multi-threaded benchmark to test allocation under contention. + +Speeds up ptb_word_lm on my desktop machine with a Titan X card from 8036 words +per second to 8272 words per second (+2.9%). + +{: .bench} +``` +Run on (40 X 2801 MHz CPUs); 2016/02/16-15:12:49 +CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB +Benchmark Base (ns) New (ns) Improvement +------------------------------------------------------------------ +BM_Allocation 347 184 +47.0% +BM_AllocationThreaded/1 351 181 +48.4% +BM_AllocationThreaded/4 2470 1975 +20.0% +BM_AllocationThreaded/16 11846 9507 +19.7% +BM_AllocationDelayed/1 392 199 +49.2% +BM_AllocationDelayed/10 285 169 +40.7% +BM_AllocationDelayed/100 245 149 +39.2% +BM_AllocationDelayed/1000 238 151 +36.6% +``` + +
    + +
    + +Speed up Pathways throughput by ~20% via a set of +miscellaneous changes. + + +* Unified a bunch of special fast descriptor parsing functions into a single + ParsedDescriptor class and use this class in more places to avoid expensive + full parse calls. + +* Change several protocol buffer fields from string to bytes (avoids + unnecessary utf-8 checks and associated error handling code). + +* DescriptorProto.inlined_contents is now a string, not a Cord (it is expected + to be used only for small-ish tensors). This necessitated the addition of a + bunch of copying helpers in tensor_util.cc (need to now support both strings + and Cords). + +* Use flat_hash_map instead of std::unordered_map in a few places. + +* Added MemoryManager::LookupMany for use by Stack op instead of calling + Lookup per batch element. This change reduces setup overhead like locking. + +* Removed some unnecessary string creation in TransferDispatchOp. + +* Performance results for transferring a batch of 1000 1KB tensors from one + component to another in the same process: + +{: .bench} +``` +Before: 227.01 steps/sec +After: 272.52 steps/sec (+20% throughput) +``` + +
    + +
    + +~15% XLA compiler performance improvement through a +series of changes. + + +Some changes to speed up XLA compilation: + +1. In SortComputationsByContent, return false if a == b in comparison function, + to avoid serializing and fingerprinting long computation strings. + +2. Turn CHECK into DCHECK to avoid touching an extra cache line in + HloComputation::ComputeInstructionPostOrder + +3. Avoid making an expensive copy of the front instruction in + CoreSequencer::IsVectorSyncHoldSatisfied(). + +4. Rework 2-argument HloComputation::ToString and HloComputation::ToCord + routines to do the bulk of the work in terms of appending to std::string, + rather than appending to a Cord. + +5. Change PerformanceCounterSet::Increment to just do a single hash table + lookup rather than two. + +6. Streamline Scoreboard::Update code + +Overall speedup of 14% in XLA compilation time for one important +model. + +
    + +
    + +Speed up low level logging in Google Meet application +code. + + +Speed up ScopedLogId, which is on the critical path for each packet. + +* Removed the `LOG_EVERY_N(ERROR, ...)` messages that seemed to be there only + to see if invariants were violated. +* Inlined the PushLogId and PopLogid() routines (since without the + `LOG_EVERY_N_SECONDS(ERROR, ...)` statements, they are now small enough to + inline. +* Switched to using a fixed array of size 4 and an 'int size' variable instead + of an `InlinedVector<...>` for maintaining the thread local state. Since we + never were growing beyond size 4 anyway, the InlinedVector's functionality + was more general than needed. + +{: .bench} +``` +Base: Baseline plus the code in scoped_logid_test.cc to add the benchmark +New: This changelist + +CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB +Benchmark Base (ns) New (ns) Improvement +---------------------------------------------------------------------------- +BM_ScopedLogId/threads:1 8 4 +52.6% +BM_ScopedLogId/threads:2 8 4 +51.9% +BM_ScopedLogId/threads:4 8 4 +52.9% +BM_ScopedLogId/threads:8 8 4 +52.1% +BM_ScopedLogId/threads:16 11 6 +44.0% + +``` + +
    + +
    + +Reduce XLA compilation time by ~31% by improving Shape +handling. + + +Several changes to improve XLA compiler performance: + +1. Improved performance of ShapeUtil::ForEachIndex... iteration in a few ways: + + * In ShapeUtil::ForEachState, save just pointers to the arrays represented + by the spans, rather than the full span objects. + + * Pre-form a ShapeUtil::ForEachState::indexes_span pointing at the + ShapeUtil::ForEachState::indexes vector, rather than constructing this + span from the vector on every loop iteration. + + * Save a ShapeUtil::ForEachState::indexes_ptr pointer to the backing store + of the ShapeUtil::ForEachState::indexes vector, allowing simple array + operations in ShapeUtil::ForEachState::IncrementDim(), rather than more + expensive vector::operator[] operations. + + * Save a ShapeUtil::ForEachState::minor_to_major array pointer initialized + in the constructor by calling shape.layout().minor_to_major().data() + rather than calling LayoutUtil::Minor(...) for each dimension for each + iteration. + + * Inlined the ShapeUtil::ForEachState constructor and the + ShapeUtil::ForEachState::IncrementDim() routines + +2. Improved the performance of ShapeUtil::ForEachIndex iteration for call sites + that don't need the functionality of returning a Status in the passed in + function. Did this by introducing ShapeUtil::ForEachIndexNoStatus variants, + which accept a ForEachVisitorFunctionNoStatus (which returns a plain bool). + This is faster than the ShapeUtil::ForEachIndex routines, which accept a + ForEachVisitorFunction (which returns a `StatusOr`, which requires an + expensive `StatusOr` destructor call per element that we iterate + over). + + * Used this variant of ShapeUtil::ForEachIndexNoStatus in + LiteralBase::Broadcast and GenerateReduceOutputElement. + +3. Improved performance of LiteralBase::Broadcast in several ways: + + * Introduced templated BroadcastHelper routine in literal.cc that is + specialized for different primitive byte sizes (without this, + primitive_size was a runtime variable and so the compiler couldn't do a + very good job of optimizing the memcpy that occurred per element, and + would invoke the general memcpy path that assumes the byte count is + fairly large, even though in our case it is a tiny power of 2 (typically + 1, 2, 4, or 8)). + + * Avoided all but one of ~(5 + num_dimensions + num_result_elements) + virtual calls per Broadcast call by making a single call to 'shape()' at + the beginning of the LiteralBase::Broadcast routine. The innocuous + looking 'shape()' calls that were sprinkled throughout end up boiling + down to "root_piece().subshape()", where subshape() is a virtual + function. + + * In the BroadcastHelper routine, Special-cased the source dimensions + being one and avoided a call to + IndexUtil::MultiDimensionalIndexToLinearIndex for this case. + + * In BroadcastHelper, used a scratch_source_array pointer variable that + points into the backing store of the scratch_source_index vector, and + used that directly to avoid vector::operator[] operations inside the + per-element code. Also pre-computed a scratch_source_span that points to + the scratch_source_index vector outside the per-element loop in + BroadcastHelper, to avoid constructing a span from the vector on each + element. + + * Introduced new three-argument variant of + IndexUtil::MultiDimensionalIndexToLinearIndex where the caller passes in + the minor_to_major span associated with the shape argument. Used this in + BroadcastHelper to compute this for the src and dst shapes once per + Broadcast, rather than once per element copied. + +4. In ShardingPropagation::GetShardingFromUser, for the HloOpcode::kTuple case, + only call user.sharding().GetSubSharding(...) if we have found the operand + to be of interest. Avoiding calling it eagerly reduces CPU time in this + routine for one lengthy compilation from 43.7s to 2.0s. + +5. Added benchmarks for ShapeUtil::ForEachIndex and Literal::Broadcast and for + the new ShapeUtil::ForEachIndexNoStatus. + +{: .bench} +``` +Base is with the benchmark additions of +BM_ForEachIndex and BM_BroadcastVectorToMatrix (and BUILD file change to add +benchmark dependency), but no other changes. + +New is this cl + +Run on (72 X 1357.56 MHz CPU s) CPU Caches: L1 Data 32 KiB (x36) +L1 Instruction 32 KiB (x36) L2 Unified 1024 KiB (x36) L3 Unified 25344 KiB (x2) + +Benchmark Base (ns) New (ns) Improvement +---------------------------------------------------------------------------- +BM_MakeShape 18.40 18.90 -2.7% +BM_MakeValidatedShape 35.80 35.60 +0.6% +BM_ForEachIndex/0 57.80 55.80 +3.5% +BM_ForEachIndex/1 90.90 85.50 +5.9% +BM_ForEachIndex/2 1973606 1642197 +16.8% +``` + +The newly added ForEachIndexNoStatus is considerably faster than the +ForEachIndex variant (it only exists in this new cl, but the benchmark work that +is done by BM_ForEachIndexNoStatus/NUM is comparable to the BM_ForEachIndex/NUM +results above). + +{: .bench} +``` +Benchmark Base (ns) New (ns) Improvement +---------------------------------------------------------------------------- +BM_ForEachIndexNoStatus/0 0 46.90 ---- +BM_ForEachIndexNoStatus/1 0 65.60 ---- +BM_ForEachIndexNoStatus/2 0 1001277 ---- +``` + +Broadcast performance improves by ~58%. + +{: .bench} +``` +Benchmark Base (ns) New (ns) Improvement +---------------------------------------------------------------------------- +BM_BroadcastVectorToMatrix/16/16 5556 2374 +57.3% +BM_BroadcastVectorToMatrix/16/1024 319510 131075 +59.0% +BM_BroadcastVectorToMatrix/1024/1024 20216949 8408188 +58.4% +``` + +Macro results from doing ahead-of-time compilation of a large language model +(program does more than just the XLA compilation, but spends a bit less than +half its time in XLA-related code): + +Baseline program overall: 573 seconds With this cl program overall: 465 seconds +(+19% improvement) + +Time spent in compiling the two largest XLA programs in running this program: + +Baseline: 141s + 143s = 284s With this CL: 99s + 95s = 194s (+31% improvement) +
    + +
    + +Reduce compilation time for large programs by ~22% in +Plaque (a distributed execution framework). + + +Small tweaks to speed up compilation by ~22%. + +1. Speed up detection of whether or not two nodes share a common source. + Previously, we would get the sources for each node in sorted order and then + do a sorted intersection. We now place the sources for one node in a + hash-table and then iterate over the other node's sources checking the + hash-table. +2. Reuse the same scratch hash-table in step 1. +3. When generating compiled proto, keep a single btree keyed by `pair` instead of a btree of btrees. +4. Store pointer to opdef in the preceding btree instead of copying the opdef + into the btree. + +Measurement of speed on large programs (~45K ops): + +{: .bench} +``` +name old time/op new time/op delta +BM_CompileLarge 28.5s ± 2% 22.4s ± 2% -21.61% (p=0.008 n=5+5) +``` + +
    + +
    + +MapReduce improvements (~2X speedup for wordcount +benchmark). + + +Mapreduce speedups: + +1. The combiner data structures for the SafeCombinerMapOutput class have been + changed. Rather than using a `hash_multimap`, + which had a hash table entry for each unique key/value inserted in the + table, we instead use a `hash_map` (where + ValuePtr is a linked list of values and repetition counts). This helps in + three ways: + + * It significantly reduces memory usage, since we only use + "sizeof(ValuePtr) + value_len" bytes for each value, rather than + "sizeof(SafeCombinerKey) + sizeof(StringPiece) + value_len + new hash + table entry overhead" for each value. This means that we flush the + reducer buffer less often. + + * It's significantly faster, since we avoid extra hash table entries when + we're inserting a new value for a key that already exists in the table + (and instead we just hook the value into the linked list of values for + that key). + + * Since we associate a repetition count with each value in the linked + list, we can represent this sequence: + + ```c++ + Output(key, "1"); + Output(key, "1"); + Output(key, "1"); + Output(key, "1"); + Output(key, "1"); + ``` + + as a single entry in the linked list for "key" with a repetition count of 5. + Internally we yield "1" five times to the user-level combining function. (A + similar trick could be applied on the reduce side, perhaps). + +2. (Minor) Added a test for "nshards == 1" to the default + MapReductionBase::KeyFingerprintSharding function that avoids fingerprinting + the key entirely if we are just using 1 reduce shard (since we can just + return 0 directly in that case without examining the key). + +3. Turned some VLOG(3) statements into DVLOG(3) in the code path that is called + for each key/value added to the combiner. + +Reduces time for one wordcount benchmark from 12.56s to 6.55s. + +
    + +
    + +Reworked the alarm handling code in the SelectServer to +significantly improve its performance (adding+removing an alarm from 771 ns to +271 ns). + + +Reworked the alarm handling code in the SelectServer to significantly improve +its performance. + +Changes: + +1. Switched to using `AdjustablePriorityQueue` instead of a a + `set` for the `AlarmQueue`. This significantly speeds up alarm + handling, reducing the time taken to add and remove an alarm from 771 + nanoseconds to 281 nanoseconds. This change avoids an + allocation/deallocation per alarm setup (for the red-black tree node in the + STL set object), and also gives much better cache locality (since the + AdjustablePriorityQueue is a heap implemented in a vector, rather than a + red-black tree), there are fewer cache lines touched when manipulating the + `AlarmQueue` on every trip through the selectserver loop. + +2. Converted AlarmList in Alarmer from a hash_map to a dense_hash_map to avoid + another allocation/deallocation per alarm addition/deletion (this also + improves cache locality when adding/removing alarms). + +3. Removed the `num_alarms_stat_` and `num_closures_stat_` + MinuteTenMinuteHourStat objects, and the corresponding exported variables. + Although monitoring these seems nice, in practice they add significant + overhead to critical networking code. If I had left these variables in as + Atomic32 variables instead of MinuteTenMinuteHourStat, they would have still + increased the cost of adding and removing alarms from 281 nanoseconds to 340 + nanoseconds. + +Benchmark results + +{: .bad-code} +``` +Benchmark Time(ns) CPU(ns) Iterations +----------------------------------------------------------- +BM_AddAlarm/1 902 771 777777 +``` + +With this change + +{: .new} +``` +Benchmark Time(ns) CPU(ns) Iterations +----------------------------------------------------------- +BM_AddAlarm/1 324 281 2239999 +``` + +
    + +
    + +3.3X performance in index serving speed! + + +We found a number of performance issues when planning a switch from on-disk to +in-memory index serving in 2001. This change fixed many of these problems and +took us from 150 to over 500 in-memory queries per second (for a 2 GB in-memory +index on dual processor Pentium III machine). + +* Lots of performance improvements to index block decoding speed (8.9 MB/s to + 13.1 MB/s for a micro-benchmark). +* We now checksum the block during decoding. This allows us to implement all + of our getsymbol operations to be done without any bounds checking. +* We have grungy macros that hold the various fields of a BitDecoder in local + variables over entire loops, and then store them back at the end of the + loops. +* We use inline assembly to get at the 'bsf' instruction on Intel chips for + getUnary (finds index of first 1 bit in a word) +* When decoding values into a vector, we resize the vector outside of the loop + and just walk a pointer along the vector, rather than doing a bounds-checked + access to store every value. +* During docid decoding, we keep the docids in local docid space, to avoid + multiplying by num_shards_. Only when we need the actual docid value do we + multiply by num_shards_ and add my_shard_. +* The IndexBlockDecoder now exports an interface 'AdvanceToDocid' that returns + the index of the first docid ≥ "d". This permits the scanning to be done + in terms of local docids, rather than forcing the conversion of each local + docid to a global docid when the client calls GetDocid(index) for every + index in the block. +* Decoding of position data for documents is now done on demand, rather than + being done eagerly for the entire block when the client asked for position + data for any document within the block. +* If the index block being decoded ends within 4 bytes of a page boundary, we + copy it to a local buffer. This allows us to always load our bit decoding + buffer via a 4-byte load, without having to worry about seg faults if we run + off the end of a mmapped page. +* We only initialize the first nterms_ elements of various scoring data + structures, rather than initializing all MAX_TERMS of them (in some cases, + we were unnecessarily memsetting 20K to 100K of data per document scored). +* Avoid round_to_int and subsequent computation on intermediate scoring values + when the value being computed is 0 (the subsequent computation was just + writing '0' over the 0 that we had memset in these cases, and this was the + most common case). +* Made a bounds check on scoring data structures into a debug-mode assertion. + +
    + +## Further Reading + +In no particular order, a list of performance related books and articles that +the authors have found helpful: + +* [Optimizing software in C++](https://www.agner.org/optimize/optimizing_cpp.pdf) + by Agner Fog. Describes many useful low-level techniques for improving + performance. +* [Understanding Software Dynamics](https://www.oreilly.com/library/view/understanding-software-dynamics/9780137589692/) + by Richard L. Sites. Covers expert methods and advanced tools for diagnosing + and fixing performance problems. +* [Performance tips of the week](https://abseil.io/fast/) - a collection of + useful tips. +* [Performance Matters](https://travisdowns.github.io/) - a collection of + articles about performance. +* [Daniel Lemire's blog](https://lemire.me/blog/) - high performance + implementations of interesting algorithms. +* [Building Software Systems at Google and Lessons Learned](https://www.youtube.com/watch?v=modXC5IWTJI) - + a video that describes system performance issues encountered at Google over + a decade. +* [Programming Pearls](https://books.google.com/books/about/Programming_Pearls.html?id=kse_7qbWbjsC) + and + [More Programming Pearls: Confessions of a Coder](https://books.google.com/books/about/More_Programming_Pearls.html?id=a2AZAQAAIAAJ) + by Jon Bentley. Essays on starting with algorithms and ending up with simple + and efficient implementations. +* [Hacker's Delight](https://en.wikipedia.org/wiki/Hacker%27s_Delight) by + Henry S. Warren. Bit-level and arithmetic algorithms for solving some common + problems. +* [Computer Architecture: A Quantitive Approach](https://books.google.com/books/about/Computer_Architecture.html?id=cM8mDwAAQBAJ) + by John L. Hennessy and David A. Patterson - Covers many aspects of computer + architecture, including one that performance-minded software developers + should be aware of like like caches, branch predictors, TLBs, etc. + +## Suggested Citation + +If you want to cite this document, we suggest: + +``` +Jeffrey Dean & Sanjay Ghemawat, Performance Hints, 2024, https://google.github.io/performance-hints +``` + +Or in BibTeX: + +```bibtex +@misc{DeanGhemawatPerformance2024, + author = {Dean, Jeffrey and Ghemawat, Sanjay}, + title = {Performance Hints}, + year = {2024}, + publisher = {GitHub}, + journal = {GitHub repository}, + howpublished = {\url{https://google.github.io/performance-hints}}, +} +``` + +## Acknowledgments + +Many colleagues have provided helpful feedback on this document, including: + +* Adrian Ulrich +* Alexander Kuzmin +* Alexei Bendebury +* Alexey Alexandrov +* Amer Diwan +* Austin Sims +* Benoit Boissinot +* Brooks Moses +* Chris Kennelly +* Chris Ruemmler +* Danila Kutenin +* Darryl Gove +* David Majnemer +* Dmitry Vyukov +* Emanuel Taropa +* Felix Broberg +* Francis Birck Moreira +* Gideon Glass +* Henrik Stewenius +* Jeremy Dorfman +* John Dethridge +* Kurt Kluever +* Kyle Konrad +* Lucas Pereira +* Marc Eaddy +* Michael Marty +* Michael Whittaker +* Mircea Trofin +* Misha Brukman +* Nicolas Hillegeer +* Ranjit Mathew +* Rasmus Larsen +* Soheil Hassas Yeganeh +* Srdjan Petrovic +* Steinar H. Gunderson +* Stergios Stergiou +* Steven Timotius +* Sylvain Vignaud +* Thomas Etter +* Thomas Köppe +* Tim Chestnutt +* Todd Lipcon +* Vance Lankhaar +* Victor Costan +* Yao Zuo +* Zhou Fang +* Zuguang Yang + +[go benchmarks]: https://pkg.go.dev/testing#hdr-Benchmarks +[fast39]: https://abseil.io/fast/39 +[fast53]: https://abseil.io/fast/53 +[cpp benchmarks]: https://github.com/google/benchmark/blob/main/README.md +[jmh]: https://github.com/openjdk/jmh +[xprof]: https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras#debug_performance_bottlenecks +[profile sources]: https://gperftools.github.io/gperftools/heapprofile.html +[annotated source]: https://github.com/google/pprof/blob/main/doc/README.md#annotated-source-code +[disassembly]: https://github.com/google/pprof/blob/main/doc/README.md#annotated-source-code +[atomic danger]: https://abseil.io/docs/cpp/atomic_danger diff --git a/fast/index.md b/fast/index.md index 3ed5e9e..65104a7 100644 --- a/fast/index.md +++ b/fast/index.md @@ -5,7 +5,8 @@ sidenav: side-nav-fast.html type: markdown --- -This Performance Guide consists of a selection of our Performance Tips of the +This Performance Guide consists of a set of [Performance Hints](hints) +and a selection of our Performance Tips of the Week. The Performance Tips of the Week form a sort of "Effective analysis and optimization of production performance and resource usage": a gallery of "do"s and "don't"s gathered from the hard-learned lessons of optimizing