Prelude

Some notes from the Envoy Internals Deep Dive talk by Matt Klein. The talk was given at KubeCon EU 2018. Watch on YouTube here.

Actual Notes :D

Architecture

  • Envoy is designed with an out of process architecture specifically to address the polyglot nature of microservices

  • At its core, Envoy is a byte proxy. This gives it extensibility to supoort a multitude of protocols.

  • Create with versatility from the start so that Envoy can act as service, middle, and edge proxy

  • Hot restart is key feature

  • One part of Envoy is built as a pipeline with filters as building blocks. Filters are extension points where users can add functionality.

  • The other part of the architecture is where the central management functionality lives. Things like the cluster/service managers or stats engine.

  • The prevailing school of thought for proxies before the 2000s was to have serve a single connection per thread. Envoy is takes a different approach where it serves multiple connections per thread via an event loop.

  • The “main” thread of Envoy handles the management functionality and worker threads handle the connections and filter pipeline. The key to keeping Envoy simple is the latter part of the previous statement. It means that worker threads do not have to communicate with each other and only have to acquire locks in a small set of cases.

  • Read-Copy-Update (RCU) is used to govern parallelization within Envoy. “Designed for read heavy, write infrequent workloads”. Helps reduce the need for locking by making the read path lock free.

  • Envoy keeps a vector pointers in thread local storage (TLS) that is mirrored across the main and worker threads. RCU is used to update slots in the vector and posts messages to worker threads to update information e.g. route tables.

Hot Restarts

  • “Full binary reload without dropping any connections”

  • Allow two processes of Envoy to run at the same time and utilize a shared memory region. Stats and locks are stored here.

  • The two proceses communicate through an RPC protocol over Unix domain sockets (UDS) and perform socket passing. After the second Envoy process spins up, it will communicate that it is good to go and the first process will begin to drain connections.

Stats

  • There’s a 2 level caching mechanism for stats. There is a TLS cache, a central cache on the main thread, and the actual entries either in shared memory or in process memory.

Useful references

Prelude

Some notes from the V8 internals for JavaScript developers? talk by Mathias Bynens. The talk was given at JsConf AU 2018. Watch on YouTube here. This talk focuses on arrays and JS engine optimizations around array operations.

Actual Notes :D

Element kinds

  • Arrays can be described by the types of elements they hold (3 types in V8) and whether or not they are packed.
    • The three types are SMI, DOUBLE, and ELEMENTS. SMI is a subset of DOUBLE and DOUBLE is a subset of ELEMENTS. SMI here stands for small integers.
    • An array and it’s element types are described together as the following regex (PACKED | HOLEY)(_SMI | _DOUBLE)?_ELEMENTS
    • For example, PACKED_SMI_ELEMENTS or HOLEY_ELEMENTS
    • To be clear, a “holey” array is also known as a sparse array
  • Packed arrays are more performant than holey arrays
    • This is because V8 not only has to perform bounds checks on your array index, it must also check the prototype chain for the index.
    • For example, suppose you are querying arr[8] and arr is holey with length 9 and arr[8] is undefined. V8 will check if the property 8 exists on arr with hasOwnProperty(arr, '8'). This check will return false so V8 must go the next level up the prototype chain and check hasOwnProperty(Array.protoype, '8'). This will return false as well. This check goes on until the prototype chain ends. This is a lot of work that the runtime has to do especially if you consider that JavaScript arrays can be subclassed.

Transitioning between element kinds

  • Delving deeper into how the different element kinds relate to each other, V8 will classify an empty array by the simplest classification i.e. SMI. If you push a floating point number onto a SMI array, the array will be reclassified to DOUBLE. The analogous re-classification will occur if you then push an object on to an array i.e. the array is re-classified as just holding ELEMENTS.

  • Array re-classifications are only done one way. Concretely, you can transition an array from SMI to DOUBLE. However, if you remove all floating point numbers from the array, V8 will not reclass that array back to SMI. The array will retain it’s classification as a DOUBLE.

  • The same relation holds true when making a packed array holey. The array will be reclassified as a holey array. However, if you fill in the holes, V8 won’t go back and classify the array as packed again. Mathias says that these transitions can be visualized as a lattice (see figure below).

Lattice screen capture from Mathias Bynens' talk
  • Try to use arrays over array-like objects. Array-like objects are integer indexed objects with a length property. Using array-like objects prevents V8 from fully optimizing your code even though you can use Array.protoype methods via call(). This includes using the arguements object in functions. Nowadays, you should use rest parameters instead.

Optimizing array initialization vs element kinds

  • You can initialize an array with the array constructor i.e. new Array(n). This is great if you know you will have a large array and want the JS engine to pre-allocate space. The downside is that the array will start out with n holes, which means that array operations won’t be as optimized as operations for a packed array.

  • Initializing an empty array and continually pushing onto it will help keep optimizations for array operations, but this comes at the cost of array re-allocations as the array grows. These re-allocations can grow slower and slower as the array grows.

  • Which optimization should you choose? There is no right answer and it’s just a game of trade-offs. You can optimize for array creation or you can optimize for array operations. It will depend on your use case.

Key takeaway

  • “Write modern, idiomatic JavaScript, and let the JavaScript engine worry about making it fast” - Mathias Bynens

Useful references

Prelude

Some notes from the What’s inside of TurboFan? by Benedikt Meurer talk at AgentConf 2018. Watch on YouTube here.

Actual Notes :D

Intro to TurboShaft

  • TurboFan is the optimizing compiler in the execution pipeline for the v8 JS engine.

  • Replaces Crankshaft–the old optimizing compiler–which couldn’t handle optimizations for new JS (ES2015+) features.

    • One interesting case of this is that Crankshaft couldn’t handle optimizing try-catch and ES2015 has a lot try-catch usage. For example, for..of loops have an implicit try..catch.
  • TurboShaft begins compilation from bytecode whereas Crankshaft compiles from JS source. This is an important difference because Ignition (the v8 interpreter) generates bytecode as part of its iterpretation process. This means that TurboShaft has a headstart and doesn’t have to re-parse the JS source as part of it’s compilation process.

  • TurboShaft takes the bytecode and generates a graph representation of it, which is then fed into the inlining and specialization phase. This phase will use runtime information from the interpreter to prune graph paths that are not used, thereby optimizing the code. This is the frontend of the compiler

  • The result is passed through an optimization phase where textbook optimizations (e.g. redundancy elimination, escape analysis) are made to the graph representation of the code.

  • The last phase (or backend) of the compiler performs machine-level optimizations and code generation. This phase is particular for the machine the code is running on (little redundant but worth repeating :D).

Inlining Specifics

  • Inline means to essentially take the body of a function and place it at the call site of said function.

  • Might not look like much on its own but, combined with other optimization passes (e.g. constant folding), it can have amazing results.

  • The compiler can also inline builtin JS functions (e.g. Array.prototype.every) and run further optimization passes on the library functions.

  • Builtin inlining is important because it helps developers write idiomatic JS.

Predictable Performance

  • Crankshaft had really crazy performance cliffs. This means that special cases that differed from an optimized code path would incur large and unpredictable penalties.

  • In Crankshaft, certain features (e.g. for..of, const usage, etc.), caused a function to be completely disabled for optimizations. TurboShaft removes this obstacle and allows functions containing these so called “optimization killers” to be optimized. It does not mean these features can be fully optimized, but at least the rest of the function is checked for optimization.

  • A caveat to keep in mind about optimizing JS code is that, in real world scenarios, there may be significantly more time spent doing layout and style calculations. Fully focusing on getting the most optimal JS compilation may not be the best way to spend time.

  • A key takeaway though is that when optimizations are made through TurboShaft there are significant performance increases.

Useful references

Prelude

Some notes from the Distributed Sagas: A Protocol for Coordinating Microservices talk given by Caitie McAffery at .NET Fringe 2017. Watch on YouTube here.

Actual Notes :D

Pretext

  • With the evolution of service architectures toward a microservices based architecture comes the need to maintain database integrity. One class of solutions for this is termed feral concurrency control by Peter Bailis in 2015 paper.
    • Side note: feral is defined as wild and uncultivated.
  • Feral concurrency control refers to “application-level mechanisms for mainting database integrity” (emphasis my own). This means that the database is no longer the source of truth that enforces consistency and concurrency guarantees. This code is now spread throughout the services that constitute the application.

  • This is an issue for code maintenance and puts more burden on application developers that should really be abstracted away from individual services.

  • Example given in talk is a trip planning application. Can book hotels, rental cars, and flights. There is also a payment system.

  • Example flow with feral concurrency control. Front-end service sends request to hotels service which has code that determines that the next step is the payments service. The hotel service also determines what happens when a payment fails.

  • The key here is that causal relationships–the chain of events that need to occur–are written into the services.

  • You can imagine a trips service being added that utilizes all of the other services–the hotel, car rental, flight, and payments services. There would be a ton of feral concurrency mechanisms built into the service in order to deal with failures in its dependencies.

  • As these systems evolve you can end up with a mesh of service relationships i.e. the death star architecture. These become hard to maintain because you don’t know what caual relationships are encoded into which services. Therefore, you don’t have observabiliity into the architecture.

Solved problem?

  • Frowns on Google’s Spanner project because it promises all of both consistency and availibility in the CAP theorem, but this is only true because they modify the meaning of availibility to 5 9’s availibility.

  • Strong consistency between all services may not be the boon it is made out to be because–as hypothesized by a Facebook paper–service latencies may increase. This latency increase is posited to be due to all services being bound to the slowest machine in a cluster.

  • So the strongest form of consistency may not be desirable due to the latency it adds back into the system and users may be more disturbed by such latency increases than minor consistency issues.

  • Two phase commit (2PC) is another way to keep atomicity across distributed transactions.

  • A prepare phase where services reply to a coordinator if they can service the request. A commit phase where the coordinator tells services to commit/abort. Finally, services will reply with done message.

  • 2PC isn’t used in industry because it doesn’t scale well. There may be O(N^2) messages in the worst case and the coordinator presents a single point of failure

Distributed sagas

  • Sagas were originally proposed in a 1987 paper for usage in a single database for long lived transactions

  • “A Saga is a Long Lived Transaction that can be written as a sequence of transactions that can be interleaved. All transactions in the sequence complete successfully or compensating transactions are ran to amend a partial execution.”

  • A distributed saga is defined in the talk to be a “collection of requests and compensating requests that represent a single business level action”
    • The requests can be booking a flight or and the compensating transaction can be canceling that flight
  • Invariants that we want in our requests:
    1. Can abort
    2. Idempotent
  • Invariants that we want for our compensating requests:
    1. Semantically undoes the effect of a request. Doesn’t need to rollback to the same exact state e.g. charge a credit and then issue a refund. May see two line items on a statement but the state is semantically the same.
    2. Cannot abort i.e. must always succeed
    3. Idempotent so we can re-run until success
    4. Commutative because timeouts may cause retries of the same request that we are trying to compensate for
  • Distributed saga guarantee: “All requests were completed successfully or a subset of requests and the corresponding compensating requests were executed”

  • No guarantee of either atomicity or isolation of requests

  • Can model as a DAG and inherits nice properties of DAGs like concurrent execution of certain constituent sagas

  • Need a distributed saga log in order to maintain fault tolerance and high availability

  • Also need a saga execution coordinator (SEC) to execute the saga DAG. The SEC differs from the coordinator in 2PC in that there can be multiple SEC’s.

  • To do rollback, flip the edges of the DAG. This keeps causal relationships intact e.g if there is a payment for a flight then there was a flight booked.

  • The SEC is a key component because all concurrency control is stored within it and not spread out the codebase i.e. not feral.

  • Makes service composition easy. You can imagine having a saga for car rentals like book rental car -> pay for rental and analogous sagas for booking a hotel and booking a flight. A big trips saga can then just pull the independent sagas into a DAG of its own.

Useful references

Prelude

This is gonna be a long one lol…There’s a lot to know about the event loop.

Actual Notes :D

  • setTimeout(ms, callback) doesn’t work how you would intuitively expect it to. The JavaScript main thread doesn’t just yield back to the setTimeout callback exactly after the specified ms. It must wait for a “turn” of the event loop in addition to the specified timeout. So really, you should think of the timeout you specify as the minimum time that setTimeout will sleep for.

  • After the timeout, the callback you specify is queued as a task to be executed by the main thread. It is common to hear that JavaScript is single-threaded, but this is not entirely accurate. Asynchronous tasks are still run in parallel on separate threads so that the main thread is not blocked from doing other things like rendering. The single-thread notion just comes from the concurrency model utilizing a main thread as a means to process results from the async threads.

  • The JS specification allows for browsers to implement multiple queues for different things. This allows browser implementers to have some leeway in how they implement the spec. In general, you can conceptualize that there will be a (1) main task queue, (2) a micro-task queue, and (3) an animation callback queue. Microtasks and animation callbacks are described below.

  • Take the following code block as an example:

    setTimeout(1000, cb1)
    setTimeout(1000, cb2)
    
    • The event loop will proceess the callbacks like so:
      1. Execute the statements synchronously in one “turn” of the event loop
      2. Callbacks are queued in the task queue after 1000 ms
      3. cb1 is executed in the current turn
      4. cb2 is executed in the subsequent turn
    • The key takeaway is that tasks are executed one at a time, i.e. one task per turn of the event loop

requestAnimationFrame

  • So the previous is a summary of the task queue portion of the event loop…a separate part is the rendering logic

  • When you want to run code with the render steps, you would use a requestAnimationFrame callback. Conceptually, the render steps are on the complete other side of the task queues. Visually…see below:
    Event loop screen capture from Jake Archibald's talk The Event Loop.
  • So far what we have is, on a turn of the event loop where a render is necessary:
    1. rAF callbacks are executed right at the start of the frame
    2. Layout and paint is performed
    3. A task from the task queue will be processed
  • where a render is necessary” is an important note because the browser doesn’t need to render on every turn of the event loop. Many turns can go by before a render is deemed necessary.

    • Renders are only necessary when there are actual style and layout changes. Render only needs to happen when the monitor refreshes–often, 60fps. There won’t be any rendering if the browser tab is not visible.
  • As of this writing (3-15-2018), Safari and Edge(?) put requestAnimationFrame after style and layout occur which violates the spec.

Compositor thread

  • Though not in spec, most browsers will offload the actual rendering of pixels to what is known as the compositor thread. Pixel data is computed during the paint step and then it is sent to the compositor thread as a texture to actually be written onto the screen.

  • “The compositor can only take already drawn bitmaps and move them around” - Jake Archibald

  • This also means that certain CSS properties (e.g. transform) can be entirely offloaded to the compositor thread and not be affected by what happens on event loop. (Stay tuned–notes on this coming right up).

Microtasks

  • Used for promise callbacks, but were originally designed to be used for MutationObservers.

  • Microtasks happen at the end of a task or when the JavaScript stack goes from non-empty to empty. This means that microtasks can run during a task or during rAF’s.

  • Because microtasks are run after the stack empties, you are guaranteed that your promise callback is not running while other JavaScript is running.

  • Taking a look at how the task queues empty:

    • The task queue is processed one per turn of the event loop and new tasks queued by the current task go to the end wait for the event loop.

    • During render, the animation task queue will be emptied except for any new callbacks that get scheduled.

    • When the JS stack empties, all microtasks in the queue will be processed as well as any additional microtasks that are added by pre-existing microtasks. This means that you can have an infinite loop of microtasks.

Useful references