Line data Source code
1 : #ifndef HEADER_fd_src_flamenco_runtime_txncache_h
2 : #define HEADER_fd_src_flamenco_runtime_txncache_h
3 :
4 : #include "../fd_flamenco_base.h"
5 : #include "../types/fd_types.h"
6 : #include <math.h>
7 :
8 : /* A txn cache is a concurrent map for saving the result (status) of
9 : transactions that have executed. In addition to supporting fast
10 : concurrent insertion and query of transaction results, the txn
11 : cache supports serialization of its state into a standard bincode
12 : format for serving the state to other nodes via. snapshot responses,
13 : and then restoring snapshots produced by other nodes.
14 :
15 : The txn cache is designed to do two operations fast,
16 :
17 : (a) Insertion. Insertion is done by both the leader pipeline and
18 : the replay stage, with an insertion for every transaction that
19 : is executed, potentially over 1M per second.
20 :
21 : (b) Query. The leader pipeline queries the status of transactions
22 : before executing them to ensure that it does not execute
23 : anything twice.
24 :
25 : Both of these operations are concurrent and lockless, assuming there
26 : are no other (non-insert/query) operations occuring on the txn cache.
27 : Most other operations lock the entire structure and will prevent both
28 : insertion and query from proceeding.
29 :
30 : The txn cache is both CPU and memory sensitive. A transaction result
31 : is 1 byte, and the stored transaction hashes are 20 bytes, so
32 : without any overhead just storing 150 slots of transactions in a flat
33 : array would require
34 :
35 : 524,288 * 150 * 21 ~ 1.5GiB
36 :
37 : Of memory. But, we can't query, insert, and remove quickly from a
38 : flat array. We make a few trade-offs to achieve good performance
39 : without bloating memory completely. In particular:
40 :
41 : - The transactions to be queried are stored in a structure that
42 : looks like a
43 :
44 : hash_map<blockhash, hash_map<txnhash, vec<(slot, status)>>>
45 :
46 : The top level hash_map is a probed hash map, and the txnhash map
47 : is a chained hash map, where the items come from a pool of pages
48 : of transactions. We use pages of transactions to support fast
49 : removal of a blockhash from the top level map, we need to return
50 : at most 4,800 pages back to the pool rather than 78,643,200
51 : individual transactions.
52 :
53 : This adds additional memory overhead, a blockhash with only one
54 : transaction in it will still consume a full page (16,384) of
55 : transactions of memory. Allocating a new transaction page to a
56 : chained map is rare (once every 16,384 inserts) so the cost
57 : amortizes to zero. Creating a blockhash happens once per
58 : blockhash, so also amortizes to zero, the only operation we care
59 : about is then the simple insert case with an unfull transaction
60 : page into an existing blockhash. This can be done with two
61 : compare-and-swaps,
62 :
63 : // 1. Find the blockhash in the probed hash map.
64 :
65 : let by_blockhash = hash_map[ blockhash ];
66 :
67 : // 2. Request space for the transaction from the current page
68 :
69 : let page = by_blockhash.pages.back();
70 : let idx = page.used.compare_and_swap( current, current+1 );
71 :
72 : // 3. Write the transaction into the page and the map
73 :
74 : page.txns[ idx ] = txn;
75 : page.txns[ idx ].next = by_blockhash.txns[ txnhash ].idx;
76 : by_blockhash.txns[ txnhash ].head.compare_and_swap( current, idx );
77 :
78 : Removal of a blockhash from this structure is simple because it
79 : does not need to be concurrent (the caller will only remove
80 : between executing slots, so there's no contention and it can take
81 : a full write lock). We take a write lock, restore the pages in
82 : the blockhash to the pool, and then mark the space in the
83 : hash_map as empty. This is fast since there are at most 4,800
84 : pages to restore and restoration is a simple memcpy.
85 :
86 : - Another structure is required to support serialization of
87 : snapshots from the cache. Serialization must produce a binary
88 : structure that encodes essentially:
89 :
90 : hash_map<slot, hash_map<blockhash, vec<(txnhash, txn_result)>>>
91 :
92 : For each slot that is rooted. Observe that we can't reuse the
93 : structure used for queries, since it's not grouped by slot, we
94 : would have to iterate all of the transactions to do this grouping
95 : which is not feasible.
96 :
97 : We can't add the slot to that index, since then query would not
98 : be fast. Queries are just against (blockhash, txnhash) pairs,
99 : they don't know which slot might have included it.
100 :
101 : So we need a second index for this information. The index is
102 : going to line up with what we described above, the root hash map
103 : will be probed, and the nested hash map will be chained. The
104 : vector of results will also use a pool of pages.
105 :
106 : In this case the page pool is for a different reason: we know a
107 : sensible upper bound for the number of transactions that could be
108 : alive in the entire cache at any point in time, but we don't know
109 : quite how to bound it for a single slot or blockhash. Well, we
110 : do: the bound is 524,288 per (slot, blockhash) pair but we would
111 : need to allocate all that memory up front. Instead, we bound the
112 : total number of slots and transactions per slot, assume the
113 : transactions must be distributed somehow within the blockhashes,
114 : and then give some overhead for unoccupied page entries.
115 :
116 : The insertion process for this structure is similar to the one
117 : above, except the vector is not a chained map but just a regular
118 : vector.
119 :
120 : // 1. Find the (slot, blockhash) in the probed hash map.
121 :
122 : let by_slot_blockhash = hash_map[ slot ][ blockhash ];
123 :
124 : // 2. Request space for the transaction from the current page
125 :
126 : let page = by_slot_blockhash.pages.back();
127 : let idx = page.used.fetch_add( 1 );
128 :
129 : // 3. Write the transaction into the page and the map
130 :
131 : page.txns[ idx ] = txn;
132 : by_slot_blockhash.txn_count += 1;
133 :
134 : The final increment is OK even with concurrent inserts because
135 : no reader will try to observe the txn_count until the slot is
136 : rooted, at which point nothing could be being inserted into it
137 : and the txn_count will be finalized. */
138 :
139 : #include "../fd_flamenco_base.h"
140 :
141 0 : #define FD_TXNCACHE_ALIGN (128UL)
142 :
143 0 : #define FD_TXNCACHE_MAGIC (0xF17EDA2CE5CAC4E0) /* FIREDANCE SCACHE V0 */
144 :
145 : /* The duration of history to keep around in the txn cache before aging
146 : it out. This must be at least 150, otherwise we could forget about
147 : transactions for blockhashes that are still valid to use, and let
148 : duplicate transactions make it into a produced block.
149 :
150 : Beyond 150, any value is valid. The value used here, 300 comes from
151 : Agave, and corresponds to roughly 2 minutes of history assuming there
152 : are no forks, with an extra 12.8 seconds of history for slots that
153 : are in progress but have not rooted yet. On a production validator
154 : without RPC support, the value should probably be configurable and
155 : always set to strictly 150. */
156 :
157 0 : #define FD_TXNCACHE_DEFAULT_MAX_ROOTED_SLOTS (300UL)
158 :
159 : /* A note on live slots ...
160 :
161 : The maximum number of live slots is the sum of the rooted and
162 : unrooted slots. The rooted slots are explicitly capped at 300 for
163 : this structure (implying we keep around 2 minutes of history
164 : around for queries and snapshots).
165 :
166 : For the unrooted slots, we must root at least one slot in an epoch
167 : for an epoch transition to occur successfully to the next one, so
168 : assuming every slot is unconfirmed for some reason, and the prior
169 : epoch was rooted at the first slot in the epoch, and the next epoch
170 : is rooted at the last slot, there could be
171 :
172 : 432,000 + 432,000 - 31 = 863,969
173 :
174 : Live slots on the validator. This is clearly impractical as each
175 : bank consumes a lof of memory to store slot state, so the
176 : validator would crash long before this.
177 :
178 : For now we just pick a number: 2048, based on running on mainnet.
179 : We take into account that apart from the 450 recent blockhashes,
180 : there are also nonce accounts used as blockhashes by transactions
181 : and they contribute significantly to entries being filled up.
182 :
183 : TODO: This number might be unbounded, more testing required with
184 : mainnet before deciding an actual number. */
185 :
186 : #define FD_TXNCACHE_DEFAULT_MAX_LIVE_SLOTS (2048UL)
187 :
188 : /* The Solana consensus protocol has an implied restriction on the number
189 : transactions in a slot. A slot might have 50,000,000+ CUs,
190 : but a transaction requires at least around 1500 CUs, so there could
191 : be at most 33,333 transactions in a slot.
192 :
193 : For Firedancer, we respect this limit when running in production, but
194 : for development and preformance tuning this limit is removed, and
195 : instead we will produce and accept at most 524,288 transactions per
196 : slot. This is chosen arbitrarily and works out to around ~1.3M TPS,
197 : however such a value does not exist in the consensus protocol. */
198 :
199 0 : #define FD_TXNCACHE_DEFAULT_MAX_TRANSACTIONS_PER_SLOT (524288UL)
200 :
201 : struct fd_txncache_insert {
202 : uchar const * blockhash;
203 : uchar const * txnhash;
204 : ulong slot;
205 : uchar const * result;
206 : };
207 :
208 : typedef struct fd_txncache_insert fd_txncache_insert_t;
209 :
210 : struct fd_txncache_query {
211 : uchar const * blockhash;
212 : uchar const * txnhash;
213 : };
214 :
215 : typedef struct fd_txncache_query fd_txncache_query_t;
216 :
217 : struct fd_txncache_snapshot_entry {
218 : ulong slot;
219 : uchar blockhash[ 32 ];
220 : uchar txnhash[ 20 ];
221 : ulong txn_idx;
222 : uchar result;
223 : };
224 :
225 : typedef struct fd_txncache_snapshot_entry fd_txncache_snapshot_entry_t;
226 :
227 : /* Forward declare opaque handle */
228 : struct fd_txncache_private;
229 : typedef struct fd_txncache_private fd_txncache_t;
230 :
231 : FD_PROTOTYPES_BEGIN
232 :
233 : /* fd_txncache_{align,footprint} give the needed alignment and
234 : footprint of a memory region suitable to hold a txn cache.
235 : fd_txncache_{align,footprint} return the same value as
236 : FD_TXNCACHE_{ALIGN,FOOTPRINT}.
237 :
238 : fd_txncache_new formats memory region with suitable alignment and
239 : footprint suitable for holding a txn cache. Assumes shmem points
240 : on the caller to the first byte of the memory region owned by the
241 : caller to use. Returns shmem on success and NULL on failure (logs
242 : details). The memory region will be owned by the state on successful
243 : return. The caller is not joined on return.
244 :
245 : fd_txncache_join joins the caller to a txn cache. Assumes shtc points
246 : to the first byte of the memory region holding the state. Returns a
247 : local handle to the join on success (this is not necessarily a simple
248 : cast of the address) and NULL on failure (logs details).
249 :
250 : fd_txncache_leave leaves the caller's current local join to a txn
251 : cache. Returns a pointer to the memory region holding the state on
252 : success (this is not necessarily a simple cast of the address) and
253 : NULL on failure (logs details). The caller is not joined on
254 : successful return.
255 :
256 : fd_txncache_delete unformats a memory region that holds a txn cache.
257 : Assumes shtc points on the caller to the first byte of the memory
258 : region holding the state and that nobody is joined. Returns a
259 : pointer to the memory region on success and NULL on failure (logs
260 : details). The caller has ownership of the memory region on
261 : successful return. */
262 :
263 : FD_FN_CONST ulong
264 : fd_txncache_align( void );
265 :
266 : FD_FN_CONST ulong
267 : fd_txncache_footprint( ulong max_rooted_slots,
268 : ulong max_live_slots,
269 : ulong max_txn_per_slot );
270 :
271 : void *
272 : fd_txncache_new( void * shmem,
273 : ulong max_rooted_slots,
274 : ulong max_live_slots,
275 : ulong max_txn_per_slot );
276 :
277 : fd_txncache_t *
278 : fd_txncache_join( void * shtc );
279 :
280 : void *
281 : fd_txncache_leave( fd_txncache_t * tc );
282 :
283 : void *
284 : fd_txncache_delete( void * shtc );
285 :
286 : /* fd_txncache_register_root registers a root slot in a txn cache. Only
287 : the provided limit of roots (typically 300) can exist in the txn
288 : cache at once, after which the oldest roots will be purged.
289 :
290 : Purging a root means it cannot be served in a snapshot response,
291 : although transactions in the root may or may not still be present.
292 : Transaction status is removed once all roots referencing the
293 : blockhash of the transaction are removed from the txn cache.
294 :
295 : This is neither cheap or expensive, it will pause all insertion and
296 : query operations but only momentarily until any old slots can be
297 : purged from the cache. */
298 :
299 : void
300 : fd_txncache_register_root_slot( fd_txncache_t * tc,
301 : ulong slot );
302 :
303 : /* fd_txncache_root_slots returns the list of live slots currently
304 : tracked by the txn cache. There will be at most max_root_slots
305 : slots, which will be written into the provided out_slots. It is
306 : assumed tc points to a txn cache and out_slots has space for at least
307 : max_root_slots results. If there are less than max_root_slots slots
308 : in the cache, the front part of out_slots will be filled in, and all
309 : the remaining slots will be set to ULONG_MAX.
310 :
311 : This is a fast operation, but it will lock the whole structure and
312 : cause a temporary pause in insert and query operations. */
313 :
314 : void
315 : fd_txncache_root_slots( fd_txncache_t * tc,
316 : ulong * out_slots );
317 :
318 : /* fd_txncache_snapshot writes the current state of a txn cache into a
319 : binary format suitable for serving to other nodes via snapshot
320 : responses. The write function is called in a streaming fashion with
321 : the binary data, the size of the data, and the ctx pointer provided
322 : to this function. The write function should return 0 on success and
323 : -1 on failure, this function will propgate a failure to write back to
324 : the caller immediately, so this function also returns 0 on success
325 : and -1 on failure.
326 :
327 : IMPORTANT! THIS ASSUMES THERE ARE NO CONCURRENT INSERTS OCCURING ON
328 : THE TXN CACHE AT THE ROOT SLOTS DURING SNAPSHOTTING. OTHERWISE THE
329 : SNAPSHOT MIGHT BE NOT CONTAIN ALL OF THE DATA, ALTHOUGH IT WILL NOT
330 : CAUSE CORRUPTION. THIS IS ASSUMED OK BECAUSE YOU CANNOT MODIFY A
331 : ROOTED SLOT.
332 :
333 : This is a cheap operation and will not cause any pause in insertion
334 : or query operations. */
335 :
336 : int
337 : fd_txncache_snapshot( fd_txncache_t * tc,
338 : void * ctx,
339 : int ( * write )( uchar const * data, ulong data_sz, void * ctx ) );
340 :
341 : /* fd_txncache_insert_batch inserts a batch of transaction results into
342 : a txn cache. Assumes tc points to a txn cache, txns is a list of
343 : transaction results to be inserted, and txns_cnt is the count of the
344 : results list. The insert copies data from the results and does not
345 : have any lifetime interest in the results memory provided once this
346 : call returns.
347 :
348 : Returns 1 on success and 0 on failure. The only reason insertion can
349 : fail is because the txn cache is full, which should never happen in
350 : practice if the caller sizes the bounds correctly. This is mostly
351 : here to support testing and fuzzing.
352 :
353 : This is a cheap, high performance, concurrent operation and can occur
354 : at the same time as queries and arbitrary other insertions. */
355 :
356 : int
357 : fd_txncache_insert_batch( fd_txncache_t * tc,
358 : fd_txncache_insert_t const * txns,
359 : ulong txns_cnt );
360 :
361 : /* fd_txncache_query_batch queries a batch of transactions to determine
362 : if they exist in the txn cache or not. The queries have an ambiguous
363 : slot, but must match both the blockhash and txnhash. In addition, if
364 : the query_func is not NULL, the query_func will be called with the
365 : slot of the txn and the query_func_ctx that was provided, and the
366 : transaction will be considered present only if the query_func also
367 : returns 1.
368 :
369 : Assumes tc points to a txn cache, queries is a list of queries to be
370 : executed, and queries_cnt is the count of the queries list.
371 : out_results must be at least as large as queries_cnt and will be
372 : filled with 0 or 1 if the transaction is not present or present
373 : respectively.
374 :
375 : This is a cheap, high performance, concurrent operation and can occur
376 : at the same time as queries and arbitrary other insertions. */
377 :
378 : void
379 : fd_txncache_query_batch( fd_txncache_t * tc,
380 : fd_txncache_query_t const * queries,
381 : ulong queries_cnt,
382 : void * query_func_ctx,
383 : int ( * query_func )( ulong slot, void * ctx ),
384 : int * out_results );
385 :
386 : /* fd_txncache_set_txnhash_offset sets the correct offset value for the
387 : txn hash "key slice" in the blockcache and slotblockcache. This is used
388 : primarily for snapshot restore since in firedancer we always use the
389 : offset of 0. Return an error if the cache entry isn't found. */
390 : int
391 : fd_txncache_set_txnhash_offset( fd_txncache_t * tc,
392 : ulong slot,
393 : uchar blockhash[ 32 ],
394 : ulong txnhash_offset );
395 :
396 : /* fd_txncache_is_rooted_slot returns 1 is `slot` is rooted, 0 otherwise. */
397 : int
398 : fd_txncache_is_rooted_slot( fd_txncache_t * tc,
399 : ulong slot );
400 :
401 : /* fd_txncache_get_entries is responsible for converting the rooted state of
402 : the status cache back into fd_bank_slot_deltas_t, which is the decoded
403 : format used by Agave. This is a helper method used to generate Agave-
404 : compatible snapshots. */
405 :
406 : int
407 : fd_txncache_get_entries( fd_txncache_t * tc,
408 : fd_bank_slot_deltas_t * bank_slot_deltas,
409 : fd_spad_t * spad );
410 :
411 : FD_PROTOTYPES_END
412 :
413 : #endif /* HEADER_fd_src_flamenco_runtime_txncache_h */
|