Line data Source code
1 : #ifndef HEADER_fd_src_flamenco_runtime_txncache_h
2 : #define HEADER_fd_src_flamenco_runtime_txncache_h
3 :
4 : /* A txn cache is a concurrent map for saving the result (status) of
5 : transactions that have executed. In addition to supporting fast
6 : concurrent insertion and query of transaction results, the txn
7 : cache supports serialization of its state into a standard bincode
8 : format for serving the state to other nodes via. snapshot responses,
9 : and then restoring snapshots produced by other nodes.
10 :
11 : The txn cache is designed to do two operations fast,
12 :
13 : (a) Insertion. Insertion is done by both the leader pipeline and
14 : the replay stage, with an insertion for every transaction that
15 : is executed, potentially over 1M per second.
16 :
17 : (b) Query. The leader pipeline queries the status of transactions
18 : before executing them to ensure that it does not execute
19 : anything twice.
20 :
21 : Both of these operations are concurrent and lockless, assuming there
22 : are no other (non-insert/query) operations occuring on the txn cache.
23 : Most other operations lock the entire structure and will prevent both
24 : insertion and query from proceeding.
25 :
26 : The txn cache is both CPU and memory sensitive. A transaction result
27 : is 1 byte, and the stored transaction hashes are 20 bytes, so
28 : without any overhead just storing 150 slots of transactions in a flat
29 : array would require
30 :
31 : 524,288 * 150 * 21 ~ 1.5GiB
32 :
33 : Of memory. But, we can't query, insert, and remove quickly from a
34 : flat array. We make a few trade-offs to achieve good performance
35 : without bloating memory completely. In particular:
36 :
37 : - The transactions to be queried are stored in a structure that
38 : looks like a
39 :
40 : hash_map<blockhash, hash_map<txnhash, vec<(slot, status)>>>
41 :
42 : The top level hash_map is a probed hash map, and the txnhash map
43 : is a chained hash map, where the items come from a pool of pages
44 : of transactions. We use pages of transactions to support fast
45 : removal of a blockhash from the top level map, we need to return
46 : at most 4,800 pages back to the pool rather than 78,643,200
47 : individual transactions.
48 :
49 : This adds additional memory overhead, a blockhash with only one
50 : transaction in it will still consume a full page (16,384) of
51 : transactions of memory. Allocating a new transaction page to a
52 : chained map is rare (once every 16,384 inserts) so the cost
53 : amortizes to zero. Creating a blockhash happens once per
54 : blockhash, so also amortizes to zero, the only operation we care
55 : about is then the simple insert case with an unfull transaction
56 : page into an existing blockhash. This can be done with two
57 : compare-and-swaps,
58 :
59 : // 1. Find the blockhash in the probed hash map.
60 :
61 : let by_blockhash = hash_map[ blockhash ];
62 :
63 : // 2. Request space for the transaction from the current page
64 :
65 : let page = by_blockhash.pages.back();
66 : let idx = page.used.compare_and_swap( current, current+1 );
67 :
68 : // 3. Write the transaction into the page and the map
69 :
70 : page.txns[ idx ] = txn;
71 : page.txns[ idx ].next = by_blockhash.txns[ txnhash ].idx;
72 : by_blockhash.txns[ txnhash ].head.compare_and_swap( current, idx );
73 :
74 : Removal of a blockhash from this structure is simple because it
75 : does not need to be concurrent (the caller will only remove
76 : between executing slots, so there's no contention and it can take
77 : a full write lock). We take a write lock, restore the pages in
78 : the blockhash to the pool, and then mark the space in the
79 : hash_map as empty. This is fast since there are at most 4,800
80 : pages to restore and restoration is a simple memcpy.
81 :
82 : - Another structure is required to support serialization of
83 : snapshots from the cache. Serialization must produce a binary
84 : structure that encodes essentially:
85 :
86 : hash_map<slot, hash_map<blockhash, vec<(txnhash, txn_result)>>>
87 :
88 : For each slot that is rooted. Observe that we can't reuse the
89 : structure used for queries, since it's not grouped by slot, we
90 : would have to iterate all of the transactions to do this grouping
91 : which is not feasible.
92 :
93 : We can't add the slot to that index, since then query would not
94 : be fast. Queries are just against (blockhash, txnhash) pairs,
95 : they don't know which slot might have included it.
96 :
97 : So we need a second index for this information. The index is
98 : going to line up with what we described above, the root hash map
99 : will be probed, and the nested hash map will be chained. The
100 : vector of results will also use a pool of pages.
101 :
102 : In this case the page pool is for a different reason: we know a
103 : sensible upper bound for the number of transactions that could be
104 : alive in the entire cache at any point in time, but we don't know
105 : quite how to bound it for a single slot or blockhash. Well, we
106 : do: the bound is 524,288 per (slot, blockhash) pair but we would
107 : need to allocate all that memory up front. Instead, we bound the
108 : total number of slots and transactions per slot, assume the
109 : transactions must be distributed somehow within the blockhashes,
110 : and then give some overhead for unoccupied page entries.
111 :
112 : The insertion process for this structure is similar to the one
113 : above, except the vector is not a chained map but just a regular
114 : vector.
115 :
116 : // 1. Find the (slot, blockhash) in the probed hash map.
117 :
118 : let by_slot_blockhash = hash_map[ slot ][ blockhash ];
119 :
120 : // 2. Request space for the transaction from the current page
121 :
122 : let page = by_slot_blockhash.pages.back();
123 : let idx = page.used.fetch_add( 1 );
124 :
125 : // 3. Write the transaction into the page and the map
126 :
127 : page.txns[ idx ] = txn;
128 : by_slot_blockhash.txn_count += 1;
129 :
130 : The final increment is OK even with concurrent inserts because
131 : no reader will try to observe the txn_count until the slot is
132 : rooted, at which point nothing could be being inserted into it
133 : and the txn_count will be finalized. */
134 :
135 : #include "../fd_flamenco_base.h"
136 :
137 0 : #define FD_TXNCACHE_ALIGN (128UL)
138 :
139 0 : #define FD_TXNCACHE_MAGIC (0xF17EDA2CE5CAC4E0) /* FIREDANCE SCACHE V0 */
140 :
141 : /* The duration of history to keep around in the txn cache before aging
142 : it out. This must be at least 150, otherwise we could forget about
143 : transactions for blockhashes that are still valid to use, and let
144 : duplicate transactions make it into a produced block.
145 :
146 : Beyond 150, any value is valid. The value used here, 300 comes from
147 : Agave, and corresponds to roughly 2 minutes of history assuming there
148 : are no forks, with an extra 12.8 seconds of history for slots that
149 : are in progress but have not rooted yet. On a production validator
150 : without RPC support, the value should probably be configurable and
151 : always set to strictly 150. */
152 :
153 0 : #define FD_TXNCACHE_DEFAULT_MAX_ROOTED_SLOTS (300UL)
154 :
155 : /* A note on live slots ...
156 :
157 : The maximum number of live slots is the sum of the rooted and
158 : unrooted slots. The rooted slots are explicitly capped at 300 for
159 : this structure (implying we keep around 2 minutes of history
160 : around for queries and snapshots).
161 :
162 : For the unrooted slots, we must root at least one slot in an epoch
163 : for an epoch transition to occur successfully to the next one, so
164 : assuming every slot is unconfirmed for some reason, and the prior
165 : epoch was rooted at the first slot in the epoch, and the next epoch
166 : is rooted at the last slot, there could be
167 :
168 : 432,000 + 432,000 - 31 = 863,969
169 :
170 : Live slots on the validator. This is clearly impractical as each
171 : bank consumes a lof of memory to store slot state, so the
172 : validator would crash long before this.
173 :
174 : For now we just pick a number: 2048, based on running on mainnet.
175 : We take into account that apart from the 450 recent blockhashes,
176 : there are also nonce accounts used as blockhashes by transactions
177 : and they contribute significantly to entries being filled up.
178 :
179 : TODO: This number might be unbounded, more testing required with
180 : mainnet before deciding an actual number. */
181 :
182 0 : #define FD_TXNCACHE_DEFAULT_MAX_LIVE_SLOTS (2048UL)
183 :
184 : /* The Solana consensus protocol has an implied restriction on the number
185 : transactions in a slot. A slot might have at most 48,000,000 CUs,
186 : but a transaction requires at least around 1500 CUs, so there could
187 : be at most 32,000 transactions in a slot.
188 :
189 : For Firedancer, we respect this limit when running in production, but
190 : for development and preformance tuning this limit is removed, and
191 : instead we will produce and accept at most 524,288 transactions per
192 : slot. This is chosen arbitrarily and works out to around ~1.3M TPS,
193 : however such a value does not exist in the consensus protocol. */
194 :
195 0 : #define FD_TXNCACHE_DEFAULT_MAX_TRANSACTIONS_PER_SLOT (524288UL)
196 :
197 : struct fd_txncache_insert {
198 : uchar const * blockhash;
199 : uchar const * txnhash;
200 : ulong slot;
201 : uchar const * result;
202 : };
203 :
204 : typedef struct fd_txncache_insert fd_txncache_insert_t;
205 :
206 : struct fd_txncache_query {
207 : uchar const * blockhash;
208 : uchar const * txnhash;
209 : };
210 :
211 : typedef struct fd_txncache_query fd_txncache_query_t;
212 :
213 : struct fd_txncache_snapshot_entry {
214 : ulong slot;
215 : uchar blockhash[ 32 ];
216 : uchar txnhash[ 20 ];
217 : ulong txn_idx;
218 : uchar result;
219 : };
220 :
221 : typedef struct fd_txncache_snapshot_entry fd_txncache_snapshot_entry_t;
222 :
223 : /* Forward declare opaque handle */
224 : struct fd_txncache_private;
225 : typedef struct fd_txncache_private fd_txncache_t;
226 :
227 : FD_PROTOTYPES_BEGIN
228 :
229 : /* fd_txncache_{align,footprint} give the needed alignment and
230 : footprint of a memory region suitable to hold a txn cache.
231 : fd_txncache_{align,footprint} return the same value as
232 : FD_TXNCACHE_{ALIGN,FOOTPRINT}.
233 :
234 : fd_txncache_new formats memory region with suitable alignment and
235 : footprint suitable for holding a txn cache. Assumes shmem points
236 : on the caller to the first byte of the memory region owned by the
237 : caller to use. Returns shmem on success and NULL on failure (logs
238 : details). The memory region will be owned by the state on successful
239 : return. The caller is not joined on return.
240 :
241 : fd_txncache_join joins the caller to a txn cache. Assumes shtc points
242 : to the first byte of the memory region holding the state. Returns a
243 : local handle to the join on success (this is not necessarily a simple
244 : cast of the address) and NULL on failure (logs details).
245 :
246 : fd_txncache_leave leaves the caller's current local join to a txn
247 : cache. Returns a pointer to the memory region holding the state on
248 : success (this is not necessarily a simple cast of the address) and
249 : NULL on failure (logs details). The caller is not joined on
250 : successful return.
251 :
252 : fd_txncache_delete unformats a memory region that holds a txn cache.
253 : Assumes shtc points on the caller to the first byte of the memory
254 : region holding the state and that nobody is joined. Returns a
255 : pointer to the memory region on success and NULL on failure (logs
256 : details). The caller has ownership of the memory region on
257 : successful return. */
258 :
259 : FD_FN_CONST ulong
260 : fd_txncache_align( void );
261 :
262 : FD_FN_CONST ulong
263 : fd_txncache_footprint( ulong max_rooted_slots,
264 : ulong max_live_slots,
265 : ulong max_txn_per_slot );
266 :
267 : void *
268 : fd_txncache_new( void * shmem,
269 : ulong max_rooted_slots,
270 : ulong max_live_slots,
271 : ulong max_txn_per_slot );
272 :
273 : fd_txncache_t *
274 : fd_txncache_join( void * shtc );
275 :
276 : void *
277 : fd_txncache_leave( fd_txncache_t * tc );
278 :
279 : void *
280 : fd_txncache_delete( void * shtc );
281 :
282 : /* fd_txncache_register_root registers a root slot in a txn cache. Only
283 : the provided limit of roots (typically 300) can exist in the txn
284 : cache at once, after which the oldest roots will be purged.
285 :
286 : Purging a root means it cannot be served in a snapshot response,
287 : although transactions in the root may or may not still be present.
288 : Transaction status is removed once all roots referencing the
289 : blockhash of the transaction are removed from the txn cache.
290 :
291 : This is neither cheap or expensive, it will pause all insertion and
292 : query operations but only momentarily until any old slots can be
293 : purged from the cache. */
294 :
295 : void
296 : fd_txncache_register_root_slot( fd_txncache_t * tc,
297 : ulong slot );
298 :
299 : /* fd_txncache_root_slots returns the list of live slots currently
300 : tracked by the txn cache. There will be at most max_root_slots
301 : slots, which will be written into the provided out_slots. It is
302 : assumed tc points to a txn cache and out_slots has space for at least
303 : max_root_slots results. If there are less than max_root_slots slots
304 : in the cache, the front part of out_slots will be filled in, and all
305 : the remaining slots will be set to ULONG_MAX.
306 :
307 : This is a fast operation, but it will lock the whole structure and
308 : cause a temporary pause in insert and query operations. */
309 :
310 : void
311 : fd_txncache_root_slots( fd_txncache_t * tc,
312 : ulong * out_slots );
313 :
314 : /* fd_txncache_snapshot writes the current state of a txn cache into a
315 : binary format suitable for serving to other nodes via snapshot
316 : responses. The write function is called in a streaming fashion with
317 : the binary data, the size of the data, and the ctx pointer provided
318 : to this function. The write function should return 0 on success and
319 : -1 on failure, this function will propgate a failure to write back to
320 : the caller immediately, so this function also returns 0 on success
321 : and -1 on failure.
322 :
323 : IMPORTANT! THIS ASSUMES THERE ARE NO CONCURRENT INSERTS OCCURING ON
324 : THE TXN CACHE AT THE ROOT SLOTS DURING SNAPSHOTTING. OTHERWISE THE
325 : SNAPSHOT MIGHT BE NOT CONTAIN ALL OF THE DATA, ALTHOUGH IT WILL NOT
326 : CAUSE CORRUPTION. THIS IS ASSUMED OK BECAUSE YOU CANNOT MODIFY A
327 : ROOTED SLOT.
328 :
329 : This is a cheap operation and will not cause any pause in insertion
330 : or query operations. */
331 :
332 : int
333 : fd_txncache_snapshot( fd_txncache_t * tc,
334 : void * ctx,
335 : int ( * write )( uchar const * data, ulong data_sz, void * ctx ) );
336 :
337 : /* fd_txncache_insert_batch inserts a batch of transaction results into
338 : a txn cache. Assumes tc points to a txn cache, txns is a list of
339 : transaction results to be inserted, and txns_cnt is the count of the
340 : results list. The insert copies data from the results and does not
341 : have any lifetime interest in the results memory provided once this
342 : call returns.
343 :
344 : Returns 1 on success and 0 on failure. The only reason insertion can
345 : fail is because the txn cache is full, which should never happen in
346 : practice if the caller sizes the bounds correctly. This is mostly
347 : here to support testing and fuzzing.
348 :
349 : This is a cheap, high performance, concurrent operation and can occur
350 : at the same time as queries and arbitrary other insertions. */
351 :
352 : int
353 : fd_txncache_insert_batch( fd_txncache_t * tc,
354 : fd_txncache_insert_t const * txns,
355 : ulong txns_cnt );
356 :
357 : /* fd_txncache_query_batch queries a batch of transactions to determine
358 : if they exist in the txn cache or not. The queries have an ambiguous
359 : slot, but must match both the blockhash and txnhash. In addition, if
360 : the query_func is not NULL, the query_func will be called with the
361 : slot of the txn and the query_func_ctx that was provided, and the
362 : transaction will be considered present only if the query_func also
363 : returns 1.
364 :
365 : Assumes tc points to a txn cache, queries is a list of queries to be
366 : executed, and queries_cnt is the count of the queries list.
367 : out_results must be at least as large as queries_cnt and will be
368 : filled with 0 or 1 if the transaction is not present or present
369 : respectively.
370 :
371 : This is a cheap, high performance, concurrent operation and can occur
372 : at the same time as queries and arbitrary other insertions. */
373 :
374 : void
375 : fd_txncache_query_batch( fd_txncache_t * tc,
376 : fd_txncache_query_t const * queries,
377 : ulong queries_cnt,
378 : void * query_func_ctx,
379 : int ( * query_func )( ulong slot, void * ctx ),
380 : int * out_results );
381 :
382 : /* fd_txncache_set_txnhash_offset sets the correct offset value for the
383 : txn hash "key slice" in the blockcache and slotblockcache. This is used
384 : primarily for snapshot restore since in firedancer we always use the
385 : offset of 0. Return an error if the cache entry isn't found. */
386 : int
387 : fd_txncache_set_txnhash_offset( fd_txncache_t * tc,
388 : ulong slot,
389 : uchar blockhash[ 32 ],
390 : ulong txnhash_offset );
391 :
392 : /* fd_txncache_is_rooted_slot returns 1 is `slot` is rooted, 0 otherwise. */
393 : int
394 : fd_txncache_is_rooted_slot( fd_txncache_t * tc,
395 : ulong slot );
396 :
397 : FD_PROTOTYPES_END
398 :
399 : #endif /* HEADER_fd_src_flamenco_runtime_txncache_h */
|