Line data Source code
1 : #ifndef HEADER_fd_src_discof_repair_fd_store_h
2 : #define HEADER_fd_src_discof_repair_fd_store_h
3 :
4 : /* fd_store is a high-performance in-memory storage engine for shreds as
5 : they are received from the network.
6 :
7 : The elements in the store themselves are not shreds, but FEC set
8 : payloads. Briefly, FEC sets are groupings of shreds that encode the
9 : same data but provide redundancy and security. While Firedancer
10 : receives individual shreds over Turbine / Repair (the relevant Solana
11 : protocols), a lot of Firedancer's validation of those shreds can only
12 : be done at the FEC set boundary. Also, there are potential future
13 : protocol changes to encode all shreds in a FEC set so that they can
14 : only be decoded once the entire FEC set is received ("all-coding").
15 :
16 : Therefore, the FEC set becomes the logical unit for the store vs.
17 : individual shreds. Replay will only ever attempt replay of a FEC set
18 : so there is no need to store at the granularity of individual shreds.
19 : The store is essentially a mapping of merkle root->FEC set payload,
20 : in which the merkle root was produced by feeding every shred in a FEC
21 : set into a merkle tree. This uniquely identifies a FEC set, because
22 : if any of the shreds change, the merkle root changes.
23 :
24 : Shreds are coalesced and inserted into the store as bytes. The max
25 : bytes per FEC set is currently variable-length but capped at 63985.
26 : In the future this will be fixed to 31840 bytes, when FEC sets are
27 : enforced to always be 32 shreds.
28 :
29 : The API is designed to be used inter-process (ie. concurrent joins
30 : from multiple tiles) and also supports concurrent access, but callers
31 : must interface with the store through a read-write lock (fd_rwlock).
32 :
33 : EQUIVOCATION
34 :
35 : There is a protocol violation called equivocation (also known as
36 : "duplicates") that results in two or more blocks for the same slot.
37 : The actual conflict occurs at the shred level: a leader produces two
38 : or more shreds for the same slot at the same index, with different
39 : data payloads. The result will be two different FEC sets for the
40 : same "logical" FEC set based on (slot, fec_set_idx).
41 :
42 : Unfortunately a naive keying scheme such as (slot, fec_set_idx) is
43 : insufficient as a result of equivocation. As mentioned earlier,
44 : Store instead uses the merkle root as the key for its FEC set
45 : elements, at the cost of some keying performance and complexity.
46 :
47 : ARCHITECTURE
48 :
49 : In the Firedancer topology, Shred tile writes to store and Replay
50 : tile reads from store. Replay tile only reads from the store after
51 : Repair tile (which is downstream of Shred) has notified Replay that a
52 : FEC set is ready. Shred's writes are append-only and Replay is
53 : responsible for publishing once it is done consuming (signaled by a
54 : new Tower root).
55 :
56 : Shred (writes) -> Repair (notif) -> Replay (reads, publishes)
57 :
58 : ORDERING
59 :
60 : In the above architecture, it is guaranteed that Repair will deliver
61 : FEC sets to Replay in replay order. That is, the parent FEC set will
62 : always be delivered before the child (see fd_fec_chainer). Note
63 : however concurrent forks are delivered in the order they are received
64 : from the network ie. arbitrary order.
65 :
66 : CONCURRENCY
67 :
68 : It is possible to design Store access in a way that enables parallel
69 : writes and minimizes lock contention between readers and writers. In
70 : this design, writers (Shred tiles) only hold the read lock during
71 : their access. The reader (Replay tile) also holds the read lock
72 : during its access, but given both Shred and Replay will be taking out
73 : read locks, they will not contend. "Write lock" is now a bit of a
74 : misnomer, because the only operation that will actually need the
75 : write lock is publish, which will be done by Replay.
76 :
77 : For parallel writes, the Store's hash function should be carefully
78 : constructed to mirror the Shred tiles' (writers') round-robin hashing
79 : to ensure the property that any slot collision in the underlying
80 : fd_map_chain will always occur on the same Shred tile. So if two
81 : different hashes point to the same slot, it is guaranteed to be the
82 : same Shred tile processing both keys (and similarly, a hash collision
83 : would also be processed by the same tile). This prevents a data race
84 : in which multiple Shred tiles attempt to write to the same map slot.
85 :
86 : For reducing lock contention, the Store should 1. only be read by
87 : Replay after Repair has notified Replay it is time to read, and 2. be
88 : written to by Shred tile(s) in append-only fashion, so Shred never
89 : modifies or removes what it has written. Store is backed by a
90 : fd_map_chain, which is not thread-safe generally, but in the case of
91 : a slot collision where Replay tile is reading an element and Shred
92 : tile writes a new element to the same slot, that new element is
93 : always appended to the end of the hash chain within that slot (which
94 : modifies the second-to-last element in the hash chain's `.next` field
95 : but does not touch application data). So this can enable lock-free
96 : concurrent reads and writes. Note Replay tile (reader) should always
97 : use fd_store_query_const to ensure the underlying fd_map_chain is not
98 : modified during querying.
99 :
100 : The exception to the above is publishing. Publishing requires the
101 : write lock to ensure the tile doing the publishing (Replay tile) is
102 : the only thing accessing the store. Publishing happens at most once
103 : per slot, so it is a relatively infrequent Store access compared to
104 : FEC queries and inserts. */
105 :
106 : #include "../../flamenco/fd_rwlock.h"
107 : #include "../../flamenco/types/fd_types_custom.h"
108 :
109 : /* FD_STORE_USE_HANDHOLDING: Define this to non-zero at compile time
110 : to turn on additional runtime checks and logging. */
111 :
112 : #ifndef FD_STORE_USE_HANDHOLDING
113 : #define FD_STORE_USE_HANDHOLDING 1
114 : #endif
115 :
116 : /* FD_STORE_ALIGN specifies the alignment needed for store. ALIGN is
117 : double x86 cache line to mitigate various kinds of false sharing (eg.
118 : ACLPF adjacent cache line prefetch). */
119 :
120 : #define FD_STORE_ALIGN (128UL)
121 :
122 : /* FD_STORE_MAGIC is a magic number for detecting store corruption. */
123 :
124 3 : #define FD_STORE_MAGIC (0xf17eda2ce75702e0UL) /* firedancer store version 0 */
125 :
126 : /* FD_STORE_DATA_MAX defines a constant for the maximum size of a FEC
127 : set payload. The value is computed from the maximum number
128 : of shreds in a FEC set * the payload bytes per shred.
129 :
130 : 67 shreds per FEC set * 955 payloads per shred = 63985 bytes max. */
131 :
132 : #define FD_STORE_DATA_MAX (63985UL) /* FIXME fixed-32 */
133 :
134 : /* fd_store_fec describes a store element (FEC set). The pointer fields
135 : implement a left-child, right-sibling n-ary tree. */
136 :
137 : struct __attribute__((aligned(128UL))) fd_store_fec {
138 :
139 : /* Keys */
140 :
141 : fd_hash_t key; /* map key, merkle root of the FEC set */
142 : fd_hash_t cmr; /* parent's map key, chained merkle root of the FEC set */
143 :
144 : /* Pointers */
145 :
146 : ulong next; /* reserved for internal use by fd_pool, fd_map_chain, orphan list */
147 : ulong parent; /* pool idx of the parent */
148 : ulong child; /* pool idx of the left-child */
149 : ulong sibling; /* pool idx of the right-sibling */
150 :
151 : /* Data */
152 :
153 : ulong data_sz; /* FIXME fixed-32. sz of the FEC set payload, guaranteed < FD_STORE_DATA_MAX */
154 : uchar data[FD_STORE_DATA_MAX]; /* FEC set payload = coalesced data shreds (byte array) */
155 : };
156 : typedef struct fd_store_fec fd_store_fec_t;
157 :
158 : #define POOL_NAME fd_store_pool
159 6 : #define POOL_T fd_store_fec_t
160 : #include "../../util/tmpl/fd_pool.c"
161 :
162 : #define MAP_NAME fd_store_map
163 : #define MAP_ELE_T fd_store_fec_t
164 : #define MAP_KEY_T fd_hash_t
165 111 : #define MAP_KEY_EQ(k0,k1) (!memcmp((k0),(k1), sizeof(fd_hash_t)))
166 153 : #define MAP_KEY_HASH(key,seed) (fd_hash((seed),(key),sizeof(fd_hash_t)))
167 : #include "../../util/tmpl/fd_map_chain.c"
168 :
169 : struct fd_store {
170 : ulong magic; /* ==FD_STORE_MAGIC */
171 : ulong seed; /* seed for various hashing function used under the hood, arbitrary */
172 : ulong root; /* pool idx of the root */
173 : ulong store_gaddr; /* wksp gaddr of store in the backing wksp, non-zero gaddr */
174 : ulong pool_gaddr; /* wksp gaddr of pool of store elements (fd_store_fec) */
175 : ulong map_gaddr; /* wksp gaddr of map of fd_store_key->fd_store_fec */
176 :
177 : fd_rwlock_t lock; /* rwlock for concurrent access */
178 : };
179 : typedef struct fd_store fd_store_t;
180 :
181 : FD_PROTOTYPES_BEGIN
182 :
183 : /* Constructors */
184 :
185 : /* fd_store_{align,footprint} return the required alignment and
186 : footprint of a memory region suitable for use as store with up to
187 : fec_max elements. */
188 :
189 : FD_FN_CONST static inline ulong
190 36 : fd_store_align( void ) {
191 36 : return alignof(fd_store_t);
192 36 : }
193 :
194 : FD_FN_CONST static inline ulong
195 6 : fd_store_footprint( ulong fec_max ) {
196 6 : return FD_LAYOUT_FINI(
197 6 : FD_LAYOUT_APPEND(
198 6 : FD_LAYOUT_APPEND(
199 6 : FD_LAYOUT_APPEND(
200 6 : FD_LAYOUT_INIT,
201 6 : alignof(fd_store_t), sizeof(fd_store_t) ),
202 6 : fd_store_pool_align(), fd_store_pool_footprint( fec_max ) ),
203 6 : fd_store_map_align(), fd_store_map_footprint( fec_max ) ),
204 6 : fd_store_align() );
205 6 : }
206 :
207 : /* fd_store_new formats an unused memory region for use as a store.
208 : mem is a non-NULL pointer to this region in the local address space
209 : with the required footprint and alignment. */
210 :
211 : void *
212 : fd_store_new( void * shmem, ulong fec_max, ulong seed );
213 :
214 : /* fd_store_join joins the caller to the store. store points to the
215 : first byte of the memory region backing the store in the caller's
216 : address space.
217 :
218 : Returns a pointer in the local address space to store on success. */
219 :
220 : fd_store_t *
221 : fd_store_join( void * store );
222 :
223 : /* fd_store_leave leaves a current local join. Returns a pointer to the
224 : underlying shared memory region on success and NULL on failure (logs
225 : details). Reasons for failure include store is NULL. */
226 :
227 : void *
228 : fd_store_leave( fd_store_t const * store );
229 :
230 : /* fd_store_delete unformats a memory region used as a store.
231 : Assumes only the nobody is joined to the region. Returns a
232 : pointer to the underlying shared memory region or NULL if used
233 : obviously in error (e.g. store is obviously not a store ... logs
234 : details). The ownership of the memory region is transferred to the
235 : caller. */
236 :
237 : void *
238 : fd_store_delete( void * store );
239 :
240 : /* Accessors */
241 :
242 : /* fd_store_wksp returns the local join to the wksp backing the store.
243 : The lifetime of the returned pointer is at least as long as the
244 : lifetime of the local join. Assumes store is a current local
245 : join. */
246 :
247 : FD_FN_PURE static inline fd_wksp_t *
248 342 : fd_store_wksp( fd_store_t const * store ) {
249 342 : return (fd_wksp_t *)( ( (ulong)store ) - store->store_gaddr );
250 342 : }
251 :
252 : /* fd_store_{pool,map} returns a pointer in the caller's address space
253 : to the corresponding store field. const versions for each are also
254 : provided. */
255 :
256 141 : FD_FN_PURE static inline fd_store_fec_t * fd_store_pool ( fd_store_t * store ) { return fd_wksp_laddr_fast ( fd_store_wksp ( store ), store->pool_gaddr ); }
257 78 : FD_FN_PURE static inline fd_store_fec_t const * fd_store_pool_const ( fd_store_t const * store ) { return fd_wksp_laddr_fast ( fd_store_wksp ( store ), store->pool_gaddr ); }
258 45 : FD_FN_PURE static inline fd_store_map_t * fd_store_map ( fd_store_t * store ) { return fd_wksp_laddr_fast ( fd_store_wksp ( store ), store->map_gaddr ); }
259 78 : FD_FN_PURE static inline fd_store_map_t const * fd_store_map_const ( fd_store_t const * store ) { return fd_wksp_laddr_fast ( fd_store_wksp ( store ), store->map_gaddr ); }
260 6 : FD_FN_PURE static inline fd_store_fec_t * fd_store_root ( fd_store_t * store ) { return fd_store_pool_ele ( fd_store_pool ( store ), store->root ); }
261 0 : FD_FN_PURE static inline fd_store_fec_t const * fd_store_root_const ( fd_store_t const * store ) { return fd_store_pool_ele_const( fd_store_pool_const( store ), store->root ); }
262 :
263 : /* fd_store_{parent,child,sibling} returns a pointer in the caller's
264 : address space to the corresponding {parent,left-child,right-sibling}
265 : of fec. Assumes store is a current local join and fec is a valid
266 : pointer to a pool element inside store. const versions for each are
267 : also provided. */
268 :
269 24 : FD_FN_PURE static inline fd_store_fec_t * fd_store_parent ( fd_store_t * store, fd_store_fec_t const * fec ) { return fd_store_pool_ele ( fd_store_pool ( store ), fec->parent ); }
270 0 : FD_FN_PURE static inline fd_store_fec_t const * fd_store_parent_const ( fd_store_t const * store, fd_store_fec_t const * fec ) { return fd_store_pool_ele_const( fd_store_pool_const( store ), fec->parent ); }
271 18 : FD_FN_PURE static inline fd_store_fec_t * fd_store_child ( fd_store_t * store, fd_store_fec_t const * fec ) { return fd_store_pool_ele ( fd_store_pool ( store ), fec->child ); }
272 0 : FD_FN_PURE static inline fd_store_fec_t const * fd_store_child_const ( fd_store_t const * store, fd_store_fec_t const * fec ) { return fd_store_pool_ele_const( fd_store_pool_const( store ), fec->child ); }
273 3 : FD_FN_PURE static inline fd_store_fec_t * fd_store_sibling ( fd_store_t * store, fd_store_fec_t const * fec ) { return fd_store_pool_ele ( fd_store_pool ( store ), fec->sibling ); }
274 0 : FD_FN_PURE static inline fd_store_fec_t const * fd_store_sibling_const( fd_store_t const * store, fd_store_fec_t const * fec ) { return fd_store_pool_ele_const( fd_store_pool_const( store ), fec->sibling ); }
275 :
276 : /* fd_store_{query,query_const} queries the FEC set keyed by merkle.
277 : Returns a pointer to the fd_store_fec_t if found, NULL otherwise.
278 :
279 : Assumes caller has already acquired a read lock via fd_rwlock_read.
280 :
281 : IMPORTANT SAFETY TIP! Caller should only call fd_rwlock_unread when
282 : they no longer retain interest in the returned pointer. */
283 :
284 : FD_FN_PURE static inline fd_store_fec_t *
285 0 : fd_store_query( fd_store_t * store, fd_hash_t * merkle_root ) {
286 0 : return fd_store_map_ele_query( fd_store_map( store ), merkle_root, NULL, fd_store_pool( store ) );
287 0 : }
288 :
289 : FD_FN_PURE static inline fd_store_fec_t const *
290 78 : fd_store_query_const( fd_store_t const * store, fd_hash_t * merkle_root ) {
291 : return fd_store_map_ele_query_const( fd_store_map_const( store ), merkle_root, NULL, fd_store_pool_const( store ) );
292 78 : }
293 :
294 : /* Operations */
295 :
296 : /* fd_store_insert inserts a new FEC set keyed by merkle. Returns the
297 : newly inserted fd_store_fec_t. Copies data and data_sz into its
298 : corresponding fields and copies at most FD_STORE_DATA_MAX bytes from
299 : data (if handholding is enabled, it will abort the caller with a
300 : descriptive error message if data_sz is too large).
301 :
302 : Assumes store is a current local join and has space for another
303 : element. Does additional checks when handholding is enabled and
304 : fails insertion (returning NULL) if checks fail. If this is the
305 : first element being inserted into store, the store root will be set
306 : to this newly inserted element.
307 :
308 : Assumes caller has already acquired the appropriate lock via
309 : fd_rwlock_read or fd_rwlock_write. See top-level documentation for
310 : why this operation may only require a read lock and not a write.
311 :
312 : IMPORTANT SAFETY TIP! Caller should only call fd_rwlock_unread when
313 : they no longer retain interest in the returned pointer. */
314 :
315 : fd_store_fec_t *
316 : fd_store_insert( fd_store_t * store,
317 : fd_hash_t * merkle_root,
318 : uchar * data,
319 : ulong data_sz /* FIXME fixed-32 */ );
320 :
321 : /* fd_store_link queries for and links the child keyed by merkle_root to
322 : parent keyed by chained_merkle_root. Returns a pointer to the child.
323 : Assumes merkle_root and chained_merkle_root are both non-NULL and key
324 : elements currently in the store. Does not require the lock. */
325 :
326 : fd_store_fec_t *
327 : fd_store_link( fd_store_t * store,
328 : fd_hash_t * merkle_root,
329 : fd_hash_t * chained_merkle_root );
330 :
331 : /* fd_store_publish publishes merkle_root as the new store root, pruning
332 : all elements across branches that do not descend from the new root.
333 : Returns a pointer to the new root. Assumes merkle_root is in the
334 : store and connected to the root (if handholding is enabled does
335 : additional checks and returns NULL on error). Note pruning can
336 : result in store elements greater than the new root slot being
337 : removed. These are elements that become orphaned as a result of the
338 : common ancestor with the new root being removed (the entire branch
339 : ie. fork is pruned).
340 :
341 : For example, in the tree (preorder) [0 1 2 4 3 5 6] publishing 2 will
342 : result in [0 1] being removed given they are ancestors of 2, and
343 : removing 1 will leave [3 5 6] orphaned and also removed.
344 :
345 : Assumes caller has already acquired a write lock via fd_rwlock_write.
346 :
347 : IMPORTANT SAFETY TIP! Caller should only call fd_rwlock_unwrite when
348 : they no longer retain interest in the returned pointer. */
349 :
350 : fd_store_fec_t *
351 : fd_store_publish( fd_store_t * store,
352 : fd_hash_t * merkle_root );
353 :
354 : FD_PROTOTYPES_END
355 :
356 : #endif /* HEADER_fd_src_discof_repair_fd_store_h */
|