Line data Source code
1 : #ifndef HEADER_fd_src_disco_store_fd_store_h
2 : #define HEADER_fd_src_disco_store_fd_store_h
3 :
4 : /* fd_store is a high-performance in-memory storage engine for shreds as
5 : they are received from the network.
6 :
7 : The elements in the store themselves are not shreds, but FEC set
8 : payloads. Briefly, FEC sets are groupings of shreds that encode the
9 : same data but provide redundancy and security. While Firedancer
10 : receives individual shreds over Turbine / Repair (the relevant Solana
11 : protocols), a lot of Firedancer's validation of those shreds can only
12 : be done at the FEC set boundary. Also, there are potential future
13 : protocol changes to encode all shreds in a FEC set so that they can
14 : only be decoded once the entire FEC set is received ("all-coding").
15 :
16 : Therefore, the FEC set becomes the logical unit for the store vs.
17 : individual shreds. Replay will only ever attempt replay of a FEC set
18 : so there is no need to store at the granularity of individual shreds.
19 : The store is essentially a mapping of merkle root->FEC set payload,
20 : in which the merkle root was produced by feeding every shred in a FEC
21 : set into a merkle tree. This uniquely identifies a FEC set, because
22 : if any of the shreds change, the merkle root changes.
23 :
24 : Shreds are coalesced and inserted into the store as bytes. The data
25 : buffer for each FEC set element is allocated separately in a
26 : contiguous region and referenced by gaddr. The max bytes per FEC set
27 : is configurable via the fixed_fec_sets development config parameter
28 : (see default.toml for details).
29 :
30 : The shared memory used by a store instance is within a workspace such
31 : that it is also persistent and remotely inspectable. Store is
32 : designed to be used inter-process (allowing concurrent joins from
33 : multiple tiles), relocated in memory (via wksp operations), and
34 : accessed concurrently (managing conflicts with a lock).
35 :
36 : EQUIVOCATION
37 :
38 : There is a protocol violation called equivocation (also known as
39 : "duplicates") that results in two or more blocks for the same slot.
40 : The actual conflict occurs at the shred level: a leader produces two
41 : or more shreds for the same slot at the same index, with different
42 : data payloads. The result will be two different FEC sets for the
43 : same "logical" FEC set based on (slot, fec_set_idx).
44 :
45 : Unfortunately a naive keying scheme such as (slot, fec_set_idx) is
46 : insufficient as a result of equivocation. As mentioned earlier,
47 : Store instead uses the merkle root as the key for its FEC set
48 : elements, at the cost of some keying performance and complexity.
49 :
50 : ARCHITECTURE
51 :
52 : In the Firedancer topology, Shred tile inserts to store and Replay
53 : tile queries from store. Replay tile only queries from the store
54 : after Shred tile has notified Replay that a FEC set is ready. Shred's
55 : inserts are append-only and Replay is responsible for removing once
56 : it is done consuming (signaled by a new Tower root).
57 :
58 : Shred (inserts) -> Replay (queries, removes)
59 :
60 : ORDERING
61 :
62 : In the above architecture, Shred delivers FEC sets to Replay in
63 : partial order. Any given fork will be delivered in-order, but
64 : concurrent forks can be delivered in arbitrary order. Another way to
65 : phrase this is a parent FEC set will always be delivered before the
66 : child (see fd_reasm).
67 :
68 : CONCURRENCY
69 :
70 : It is possible to design Store access in a way that enables parallel
71 : inserts and minimizes lock contention between readers and writers.
72 : Store contains a fd_rwlock (read-write lock), but the name is a bit
73 : of a misnomer because inserts can actually be concurrent and the only
74 : operation that will actually need the write lock is remove, which
75 : will be done by Replay. It is more appropriate to describe as an
76 : exclusive-shared access lock.
77 :
78 : In this design, writers (Shred tiles) hold the shared lock during
79 : their access. The reader (Replay tile) also holds the shared lock
80 : during its access, and given both Shred and Replay will be taking out
81 : shared locks, they will not contend.
82 :
83 : For parallel inserts, the Store's hash function is carefully designed
84 : to partition the keyspace so that the same Shred tile always inserts
85 : to the same map slots. This ensures map collisions always happen on
86 : the same Shred tile and cannot happen across tiles. Specifically, if
87 : two different FEC sets hash to the same slot, it is guaranteed that
88 : to be the same Shred tile processing both those FEC sets. This
89 : prevents a data race in which multiple Shred tiles (each with a
90 : handle to the shared lock) write to the same map slot.
91 :
92 : The hash function is defined as follows:
93 : ```
94 : #define MAP_KEY_HASH(key,seed) ((ulong)key->mr.ul[0]%seed + (key)->part_idx*seed)
95 : ```
96 : where `key` is a key type that includes the merkle root (32 bytes)
97 : and the partition index (8 bytes) that is equivalent to the Shred
98 : tile index doing the insertion. seed, on initialization, is the
99 : number of chains/buckets in the map_chain divided by the number of
100 : partitions. In effect, seed is the size of each partition. For
101 : example, if the map_chain is sized to 1024, and there are 4 shred
102 : tiles, then the seed is 1024/4 = 256. Then the map key hash can
103 : bound the chain index of each partition as such: shred tile 0 will
104 : write to chains 0-255, shred tile 1 will write to chains 256-511,
105 : shred tile 2 will write to chains 512-767, and shred tile 3 will
106 : write to chains 768-1023, without overlap. The merkle root is a 32
107 : byte SHA-256 hash, so we can expect a fairly uniform distribution of
108 : hash values even after truncating to the first 8 bytes, without
109 : needing to introduce more randomness. Thus we can repurpose the
110 : `seed` argument to be the number of partitions.
111 :
112 : Essentially, this allows for limited single-producer single-consumer
113 : (SPSC) concurrency, where the producer is a given Shred tile and the
114 : consumer is Replay tile. The SPSC concurrency is limited in that the
115 : Store should 1. only be read by Replay after Shred has notified
116 : Replay it is time to read (ie. Shred has finished writing), and 2. be
117 : written to by Shred(s) in append-only fashion, so Shred never
118 : modifies or removes from the map. Store is backed by fd_map_chain,
119 : which is not thread-safe generally, but does support this particular
120 : SPSC concurrency model in cases where the consumer is guaranteed to
121 : be lagging the producer.
122 :
123 : Analyzing fd_map_chain in gory detail, in the case of a map collision
124 : where Replay tile is reading an element and Shred tile inserts a new
125 : element to the same map slot, that new element is prepended to the
126 : hash chain within that slot (which modifies what the head of the
127 : chain points to as well as the now-previous head in the hash chain's
128 : `.next` field, but does not touch application data). With fencing
129 : enabled (MAP_INSERT_FENCE), it is guaranteed the consumer either
130 : queries the head before or after the update. If it queries before,
131 : that is safe, it would just check the key (if no match, iterate down
132 : the chain etc.) If it queries after, it is also safe because the new
133 : element is guaranteed to be before the old element in the chain, so
134 : it would just do one more iteration. Note the consumer should always
135 : use fd_store_query_const to ensure the underlying fd_map_chain is not
136 : modified during querying.
137 :
138 : The exception to the above is removing. Removing requires exclusive
139 : access because it involves removing from fd_map_chain, which is not
140 : safe for shared access. So the Replay tile should take out the
141 : exclusive lock. Removing happens at most once per slot, so it is a
142 : relatively infrequent Store access compared to FEC queries and
143 : inserts (which is good because it is also the most expensive). */
144 :
145 : #include "../../disco/shred/fd_fec_set.h"
146 : #include "../../flamenco/fd_rwlock.h"
147 : #include "../../flamenco/fd_flamenco_base.h"
148 : #include "../../util/hist/fd_histf.h"
149 :
150 : /* FD_STORE_ALIGN specifies the alignment needed for store. ALIGN is
151 : double x86 cache line to mitigate various kinds of false sharing (eg.
152 : ACLPF adjacent cache line prefetch). */
153 :
154 : #define FD_STORE_ALIGN (128UL)
155 :
156 : /* FD_STORE_MAGIC is a magic number for detecting store corruption. */
157 :
158 18 : #define FD_STORE_MAGIC (0xf17eda2ce75702e0UL) /* firedancer store version 0 */
159 :
160 : /* fd_store_fec describes a store element (FEC set). The pointer fields
161 : implement a left-child, right-sibling n-ary tree. */
162 :
163 : struct __attribute__((packed)) fd_store_key {
164 : fd_hash_t merkle_root;
165 : ulong part_idx; /* partition index of the caller of fd_store_insert */
166 : };
167 : typedef struct fd_store_key fd_store_key_t;
168 :
169 : struct __attribute__((aligned(FD_STORE_ALIGN))) fd_store_fec {
170 : fd_store_key_t key; /* map key, merkle root of the FEC set + a partition index */
171 : ulong next; /* reserved for internal use by fd_pool, fd_map_chain */
172 : uint shred_offs[FD_FEC_SHRED_CNT]; /* shred_offs[ i ] is the total size of data shreds [0, i], up to FD_FEC_SHRED_CNT */
173 : ulong data_sz; /* sz of the FEC set payload, guaranteed <= store->fec_data_max */
174 : ulong data_gaddr; /* wksp gaddr of this element's data buffer */
175 : };
176 : typedef struct fd_store_fec fd_store_fec_t;
177 :
178 : #define POOL_NAME fd_store_pool
179 435 : #define POOL_ELE_T fd_store_fec_t
180 : #include "../../util/tmpl/fd_pool_para.c"
181 :
182 : #define MAP_NAME fd_store_map
183 : #define MAP_ELE_T fd_store_fec_t
184 : #define MAP_KEY_T fd_store_key_t
185 84 : #define MAP_KEY key
186 282 : #define MAP_KEY_EQ(k0,k1) (!memcmp((k0),(k1), sizeof(fd_hash_t)))
187 396 : #define MAP_KEY_HASH(key,seed) ((ulong)key->merkle_root.ul[0]%seed + (key)->part_idx*seed) /* see top-level documentation of hash function */
188 : #define MAP_INSERT_FENCE 1
189 : #include "../../util/tmpl/fd_map_chain.c"
190 :
191 : struct fd_store {
192 : ulong magic; /* ==FD_STORE_MAGIC */
193 : ulong part_cnt; /* number of partitions, also the number of writers */
194 : ulong fec_max; /* max number of FEC sets that can be stored */
195 : ulong fec_data_max; /* max data payload bytes per FEC set */
196 : ulong store_gaddr; /* wksp gaddr of store in the backing wksp, non-zero gaddr */
197 : ulong map_gaddr; /* wksp gaddr of map of fd_store_key->fd_store_fec */
198 : ulong pool_mem_gaddr; /* wksp gaddr of shmem_t object in pool_para */
199 : ulong pool_ele_gaddr; /* wksp gaddr of first ele_t object in pool_para */
200 : ulong data_gaddr; /* wksp gaddr of the base of contiguous data region (fec_max * fec_data_max bytes) */
201 :
202 : fd_rwlock_t lock; /* shared-exclusive lock */
203 : };
204 : typedef struct fd_store fd_store_t;
205 :
206 : FD_PROTOTYPES_BEGIN
207 :
208 : /* Constructors */
209 :
210 : /* fd_store_{align,footprint} return the required alignment and
211 : footprint of a memory region suitable for use as store with up to
212 : fec_max elements. fec_max is a positive integer (does not need to
213 : be a power of two). fec_data_max is the max data payload size per
214 : FEC set. */
215 :
216 : FD_FN_CONST static inline ulong
217 192 : fd_store_align( void ) {
218 192 : return alignof(fd_store_t);
219 192 : }
220 :
221 : FD_FN_CONST static inline ulong
222 : fd_store_footprint( ulong fec_max,
223 36 : ulong fec_data_max ) {
224 36 : return FD_LAYOUT_FINI(
225 36 : FD_LAYOUT_APPEND(
226 36 : FD_LAYOUT_APPEND(
227 36 : FD_LAYOUT_APPEND(
228 36 : FD_LAYOUT_APPEND(
229 36 : FD_LAYOUT_APPEND(
230 36 : FD_LAYOUT_INIT,
231 36 : alignof(fd_store_t), sizeof(fd_store_t) ),
232 36 : fd_store_map_align(), fd_store_map_footprint( fd_store_map_chain_cnt_est( fec_max ) ) ),
233 36 : fd_store_pool_align(), fd_store_pool_footprint() ),
234 36 : alignof(fd_store_fec_t), sizeof(fd_store_fec_t)*fec_max ),
235 36 : FD_STORE_ALIGN, fec_data_max*fec_max ),
236 36 : fd_store_align() );
237 36 : }
238 :
239 : /* fd_store_new formats an unused memory region for use as a store.
240 : mem is a non-NULL pointer to this region in the local address space
241 : with the required footprint and alignment. fec_max is a positive
242 : integer (does not need to be a power of two). fec_data_max is the
243 : max data bytes per FEC set. */
244 :
245 : void *
246 : fd_store_new( void * shmem,
247 : ulong part_cnt,
248 : ulong fec_max,
249 : ulong fec_data_max );
250 :
251 : /* fd_store_join joins the caller to the store. store points to the
252 : first byte of the memory region backing the store in the caller's
253 : address space.
254 :
255 : Returns a pointer in the local address space to store on success. */
256 :
257 : fd_store_t *
258 : fd_store_join( void * store );
259 :
260 : /* fd_store_leave leaves a current local join. Returns a pointer to the
261 : underlying shared memory region on success and NULL on failure (logs
262 : details). Reasons for failure include store is NULL. */
263 :
264 : void *
265 : fd_store_leave( fd_store_t const * store );
266 :
267 : /* fd_store_delete unformats a memory region used as a store.
268 : Assumes only the nobody is joined to the region. Returns a
269 : pointer to the underlying shared memory region or NULL if used
270 : obviously in error (e.g. store is obviously not a store ... logs
271 : details). The ownership of the memory region is transferred to the
272 : caller. */
273 :
274 : void *
275 : fd_store_delete( void * shstore );
276 :
277 : /* Accessors */
278 :
279 : /* fd_store_wksp returns the local join to the wksp backing the store.
280 : The lifetime of the returned pointer is at least as long as the
281 : lifetime of the local join. Assumes store is a current local
282 : join. */
283 :
284 : FD_FN_PURE static inline fd_wksp_t *
285 1362 : fd_store_wksp( fd_store_t const * store ) {
286 1362 : return (fd_wksp_t *)( ( (ulong)store ) - store->store_gaddr );
287 1362 : }
288 :
289 : /* fd_store_fec_data returns a pointer in the local address space to the
290 : data buffer of the given FEC set element. Assumes store is a current
291 : local join and fec is in the store's pool. */
292 :
293 : FD_FN_PURE static inline uchar *
294 : fd_store_fec_data( fd_store_t const * store,
295 12 : fd_store_fec_t const * fec ) {
296 12 : return (uchar *)( (ulong)store - store->store_gaddr + fec->data_gaddr );
297 12 : }
298 :
299 : /* fd_store_{s}lock_{acquire,release} interface store's shared-exclusive
300 : lock. See also FD_STORE_{S,X}LOCK_{BEGIN,END}. */
301 :
302 225 : static inline void fd_store_slock_acquire( fd_store_t * store ) { fd_rwlock_read ( &store->lock ); }
303 225 : static inline void fd_store_slock_release( fd_store_t * store ) { fd_rwlock_unread ( &store->lock ); }
304 6 : static inline void fd_store_xlock_acquire( fd_store_t * store ) { fd_rwlock_write ( &store->lock ); }
305 6 : static inline void fd_store_xlock_release( fd_store_t * store ) { fd_rwlock_unwrite( &store->lock ); }
306 :
307 : static inline void
308 225 : fd_store_private_slock_end( fd_store_t ** _store ) {
309 225 : fd_store_t * store = *_store;
310 225 : fd_store_slock_release( store );
311 225 : }
312 :
313 225 : #define FD_STORE_SLOCK_BEGIN(store) { \
314 225 : fd_store_t * _store __attribute__((cleanup(fd_store_private_slock_end))) = (store); \
315 225 : fd_store_slock_acquire( _store ); \
316 225 : {
317 :
318 225 : #define FD_STORE_SLOCK_END }}
319 :
320 : static inline void
321 6 : fd_store_private_xlock_end( fd_store_t ** _store ) {
322 6 : fd_store_t * store = *_store;
323 6 : fd_store_xlock_release( store );
324 6 : }
325 :
326 6 : #define FD_STORE_XLOCK_BEGIN(store) { \
327 6 : fd_store_t * _store __attribute__((cleanup(fd_store_private_xlock_end))) = (store); \
328 6 : fd_store_xlock_acquire( _store ); \
329 6 : {
330 :
331 6 : #define FD_STORE_XLOCK_END }}
332 :
333 :
334 : /* fd_store_query queries the FEC set keyed by merkle_root. Returns a
335 : pointer to the fd_store_fec_t if found, NULL otherwise. Assumes
336 : caller has acquired the shared lock.
337 :
338 : IMPORTANT SAFETY TIP! Caller should only call release the shared
339 : lock when they no longer retain interest in the returned pointer. */
340 :
341 : fd_store_fec_t *
342 : fd_store_query( fd_store_t * store,
343 : fd_hash_t const * merkle_root );
344 :
345 : /* Operations */
346 :
347 : /* fd_store_insert inserts a new FEC keyed by merkle_root into the
348 : store. Returns a pointer to the inserted pool ele (fd_store_fec_t *)
349 : on success. If the merkle root has previously been inserted, returns
350 : a pointer to the previous element with that key. Returns NULL if the
351 : store is full (see fd_store_evict).
352 :
353 : Each fd_store_fec_t can hold at most store->fec_data_max bytes of data.
354 : Caller is responsible for copying into the data buffer region
355 : (accessed via fd_store_fec_data).
356 :
357 : This is a blocking operation that acquires a shared lock as part of
358 : its implementation. Assumes caller has safely partitioned insertions
359 : if running in parallel (see top-level documentation in fd_store.h for
360 : details). */
361 :
362 : fd_store_fec_t *
363 : fd_store_insert( fd_store_t * store,
364 : ulong part_idx,
365 : fd_hash_t * merkle_root );
366 :
367 : /* fd_store_remove removes the FEC set keyed by merkle_root. Logs a
368 : warning if merkle_root is not found.
369 :
370 : This is a blocking operation that acquires an exclusive lock as part
371 : of its implementation.
372 :
373 : IMPORTANT SAFETY TIP! Caller should only call fd_store_shrel or
374 : fd_store_exrel when they no longer retain interest in the returned
375 : pointer. */
376 :
377 : void
378 : fd_store_remove( fd_store_t * store,
379 : fd_hash_t const * merkle_root );
380 :
381 : /* fd_store_verify returns 0 if the store is not obviously corrupt or -1
382 : otherwise (logs details). */
383 :
384 : int
385 : fd_store_verify( fd_store_t * store );
386 :
387 : FD_PROTOTYPES_END
388 :
389 : #endif /* HEADER_fd_src_disco_store_fd_store_h */
|