LCOV - code coverage report
Current view: top level - disco/store - fd_store.h (source / functions) Hit Total Coverage
Test: cov.lcov Lines: 51 51 100.0 %
Date: 2026-06-21 08:41:42 Functions: 10 140 7.1 %

          Line data    Source code
       1             : #ifndef HEADER_fd_src_disco_store_fd_store_h
       2             : #define HEADER_fd_src_disco_store_fd_store_h
       3             : 
       4             : /* fd_store is a high-performance in-memory storage engine for shreds as
       5             :    they are received from the network.
       6             : 
       7             :    The elements in the store themselves are not shreds, but FEC set
       8             :    payloads.  Briefly, FEC sets are groupings of shreds that encode the
       9             :    same data but provide redundancy and security.  While Firedancer
      10             :    receives individual shreds over Turbine / Repair (the relevant Solana
      11             :    protocols), a lot of Firedancer's validation of those shreds can only
      12             :    be done at the FEC set boundary.  Also, there are potential future
      13             :    protocol changes to encode all shreds in a FEC set so that they can
      14             :    only be decoded once the entire FEC set is received ("all-coding").
      15             : 
      16             :    Therefore, the FEC set becomes the logical unit for the store vs.
      17             :    individual shreds.  Replay will only ever attempt replay of a FEC set
      18             :    so there is no need to store at the granularity of individual shreds.
      19             :    The store is essentially a mapping of merkle root->FEC set payload,
      20             :    in which the merkle root was produced by feeding every shred in a FEC
      21             :    set into a merkle tree.  This uniquely identifies a FEC set, because
      22             :    if any of the shreds change, the merkle root changes.
      23             : 
      24             :    Shreds are coalesced and inserted into the store as bytes.  The data
      25             :    buffer for each FEC set element is allocated separately in a
      26             :    contiguous region and referenced by gaddr.  The max bytes per FEC set
      27             :    is configurable via the fixed_fec_sets development config parameter
      28             :    (see default.toml for details).
      29             : 
      30             :    The shared memory used by a store instance is within a workspace such
      31             :    that it is also persistent and remotely inspectable.  Store is
      32             :    designed to be used inter-process (allowing concurrent joins from
      33             :    multiple tiles), relocated in memory (via wksp operations), and
      34             :    accessed concurrently (managing conflicts with a lock).
      35             : 
      36             :    EQUIVOCATION
      37             : 
      38             :    There is a protocol violation called equivocation (also known as
      39             :    "duplicates") that results in two or more blocks for the same slot.
      40             :    The actual conflict occurs at the shred level: a leader produces two
      41             :    or more shreds for the same slot at the same index, with different
      42             :    data payloads.  The result will be two different FEC sets for the
      43             :    same "logical" FEC set based on (slot, fec_set_idx).
      44             : 
      45             :    Unfortunately a naive keying scheme such as (slot, fec_set_idx) is
      46             :    insufficient as a result of equivocation.  As mentioned earlier,
      47             :    Store instead uses the merkle root as the key for its FEC set
      48             :    elements, at the cost of some keying performance and complexity.
      49             : 
      50             :    ARCHITECTURE
      51             : 
      52             :    In the Firedancer topology, Shred tile inserts to store and Replay
      53             :    tile queries from store.  Replay tile only queries from the store
      54             :    after Shred tile has notified Replay that a FEC set is ready. Shred's
      55             :    inserts are append-only and Replay is responsible for removing once
      56             :    it is done consuming (signaled by a new Tower root).
      57             : 
      58             :    Shred (inserts) -> Replay (queries, removes)
      59             : 
      60             :    ORDERING
      61             : 
      62             :    In the above architecture, Shred delivers FEC sets to Replay in
      63             :    partial order.  Any given fork will be delivered in-order, but
      64             :    concurrent forks can be delivered in arbitrary order.  Another way to
      65             :    phrase this is a parent FEC set will always be delivered before the
      66             :    child (see fd_reasm).
      67             : 
      68             :    CONCURRENCY
      69             : 
      70             :    It is possible to design Store access in a way that enables parallel
      71             :    inserts and minimizes lock contention between readers and writers.
      72             :    Store contains a fd_rwlock (read-write lock), but the name is a bit
      73             :    of a misnomer because inserts can actually be concurrent and the only
      74             :    operation that will actually need the write lock is remove, which
      75             :    will be done by Replay.  It is more appropriate to describe as an
      76             :    exclusive-shared access lock.
      77             : 
      78             :    In this design, writers (Shred tiles) hold the shared lock during
      79             :    their access.  The reader (Replay tile) also holds the shared lock
      80             :    during its access, and given both Shred and Replay will be taking out
      81             :    shared locks, they will not contend.
      82             : 
      83             :    For parallel inserts, the Store's hash function is carefully designed
      84             :    to partition the keyspace so that the same Shred tile always inserts
      85             :    to the same map slots.  This ensures map collisions always happen on
      86             :    the same Shred tile and cannot happen across tiles.  Specifically, if
      87             :    two different FEC sets hash to the same slot, it is guaranteed that
      88             :    to be the same Shred tile processing both those FEC sets.  This
      89             :    prevents a data race in which multiple Shred tiles (each with a
      90             :    handle to the shared lock) write to the same map slot.
      91             : 
      92             :    The hash function is defined as follows:
      93             :    ```
      94             :    #define MAP_KEY_HASH(key,seed) ((ulong)key->mr.ul[0]%seed + (key)->part_idx*seed)
      95             :    ```
      96             :    where `key` is a key type that includes the merkle root (32 bytes)
      97             :    and the partition index (8 bytes) that is equivalent to the Shred
      98             :    tile index doing the insertion.  seed, on initialization, is the
      99             :    number of chains/buckets in the map_chain divided by the number of
     100             :    partitions.  In effect, seed is the size of each partition.  For
     101             :    example, if the map_chain is sized to 1024, and there are 4 shred
     102             :    tiles, then the seed is 1024/4 = 256.  Then the map key hash can
     103             :    bound the chain index of each partition as such: shred tile 0 will
     104             :    write to chains 0-255, shred tile 1 will write to chains 256-511,
     105             :    shred tile 2 will write to chains 512-767, and shred tile 3 will
     106             :    write to chains 768-1023, without overlap.  The merkle root is a 32
     107             :    byte SHA-256 hash, so we can expect a fairly uniform distribution of
     108             :    hash values even after truncating to the first 8 bytes, without
     109             :    needing to introduce more randomness.  Thus we can repurpose the
     110             :    `seed` argument to be the number of partitions.
     111             : 
     112             :    Essentially, this allows for limited single-producer single-consumer
     113             :    (SPSC) concurrency, where the producer is a given Shred tile and the
     114             :    consumer is Replay tile.  The SPSC concurrency is limited in that the
     115             :    Store should 1. only be read by Replay after Shred has notified
     116             :    Replay it is time to read (ie. Shred has finished writing), and 2. be
     117             :    written to by Shred(s) in append-only fashion, so Shred never
     118             :    modifies or removes from the map.  Store is backed by fd_map_chain,
     119             :    which is not thread-safe generally, but does support this particular
     120             :    SPSC concurrency model in cases where the consumer is guaranteed to
     121             :    be lagging the producer.
     122             : 
     123             :    Analyzing fd_map_chain in gory detail, in the case of a map collision
     124             :    where Replay tile is reading an element and Shred tile inserts a new
     125             :    element to the same map slot, that new element is prepended to the
     126             :    hash chain within that slot (which modifies what the head of the
     127             :    chain points to as well as the now-previous head in the hash chain's
     128             :    `.next` field, but does not touch application data).  With fencing
     129             :    enabled (MAP_INSERT_FENCE), it is guaranteed the consumer either
     130             :    queries the head before or after the update.  If it queries before,
     131             :    that is safe, it would just check the key (if no match, iterate down
     132             :    the chain etc.)  If it queries after, it is also safe because the new
     133             :    element is guaranteed to be before the old element in the chain, so
     134             :    it would just do one more iteration.  Note the consumer should always
     135             :    use fd_store_query_const to ensure the underlying fd_map_chain is not
     136             :    modified during querying.
     137             : 
     138             :    The exception to the above is removing.  Removing requires exclusive
     139             :    access because it involves removing from fd_map_chain, which is not
     140             :    safe for shared access.  So the Replay tile should take out the
     141             :    exclusive lock.  Removing happens at most once per slot, so it is a
     142             :    relatively infrequent Store access compared to FEC queries and
     143             :    inserts (which is good because it is also the most expensive). */
     144             : 
     145             : #include "../../disco/shred/fd_fec_set.h"
     146             : #include "../../flamenco/fd_rwlock.h"
     147             : #include "../../flamenco/fd_flamenco_base.h"
     148             : #include "../../util/hist/fd_histf.h"
     149             : 
     150             : /* FD_STORE_ALIGN specifies the alignment needed for store.  ALIGN is
     151             :    double x86 cache line to mitigate various kinds of false sharing (eg.
     152             :    ACLPF adjacent cache line prefetch). */
     153             : 
     154             : #define FD_STORE_ALIGN (128UL)
     155             : 
     156             : /* FD_STORE_MAGIC is a magic number for detecting store corruption. */
     157             : 
     158          18 : #define FD_STORE_MAGIC (0xf17eda2ce75702e0UL) /* firedancer store version 0 */
     159             : 
     160             : /* fd_store_fec describes a store element (FEC set).  The pointer fields
     161             :    implement a left-child, right-sibling n-ary tree. */
     162             : 
     163             : struct __attribute__((packed)) fd_store_key {
     164             :    fd_hash_t merkle_root;
     165             :    ulong     part_idx; /* partition index of the caller of fd_store_insert */
     166             : };
     167             : typedef struct fd_store_key fd_store_key_t;
     168             : 
     169             : struct __attribute__((aligned(FD_STORE_ALIGN))) fd_store_fec {
     170             :   fd_store_key_t key;                          /* map key, merkle root of the FEC set + a partition index */
     171             :   ulong          next;                         /* reserved for internal use by fd_pool, fd_map_chain */
     172             :   uint           shred_offs[FD_FEC_SHRED_CNT]; /* shred_offs[ i ] is the total size of data shreds [0, i], up to FD_FEC_SHRED_CNT */
     173             :   ulong          data_sz;                      /* sz of the FEC set payload, guaranteed <= store->fec_data_max */
     174             :   ulong          data_gaddr;                   /* wksp gaddr of this element's data buffer */
     175             : };
     176             : typedef struct fd_store_fec fd_store_fec_t;
     177             : 
     178             : #define POOL_NAME  fd_store_pool
     179         435 : #define POOL_ELE_T fd_store_fec_t
     180             : #include "../../util/tmpl/fd_pool_para.c"
     181             : 
     182             : #define MAP_NAME               fd_store_map
     183             : #define MAP_ELE_T              fd_store_fec_t
     184             : #define MAP_KEY_T              fd_store_key_t
     185          84 : #define MAP_KEY                key
     186         282 : #define MAP_KEY_EQ(k0,k1)      (!memcmp((k0),(k1), sizeof(fd_hash_t)))
     187         396 : #define MAP_KEY_HASH(key,seed) ((ulong)key->merkle_root.ul[0]%seed + (key)->part_idx*seed) /* see top-level documentation of hash function */
     188             : #define MAP_INSERT_FENCE       1
     189             : #include "../../util/tmpl/fd_map_chain.c"
     190             : 
     191             : struct fd_store {
     192             :   ulong magic;          /* ==FD_STORE_MAGIC */
     193             :   ulong part_cnt;       /* number of partitions, also the number of writers */
     194             :   ulong fec_max;        /* max number of FEC sets that can be stored */
     195             :   ulong fec_data_max;   /* max data payload bytes per FEC set */
     196             :   ulong store_gaddr;    /* wksp gaddr of store in the backing wksp, non-zero gaddr */
     197             :   ulong map_gaddr;      /* wksp gaddr of map of fd_store_key->fd_store_fec */
     198             :   ulong pool_mem_gaddr; /* wksp gaddr of shmem_t object in pool_para */
     199             :   ulong pool_ele_gaddr; /* wksp gaddr of first ele_t object in pool_para */
     200             :   ulong data_gaddr;     /* wksp gaddr of the base of contiguous data region (fec_max * fec_data_max bytes) */
     201             : 
     202             :   fd_rwlock_t lock; /* shared-exclusive lock */
     203             : };
     204             : typedef struct fd_store fd_store_t;
     205             : 
     206             : FD_PROTOTYPES_BEGIN
     207             : 
     208             : /* Constructors */
     209             : 
     210             : /* fd_store_{align,footprint} return the required alignment and
     211             :    footprint of a memory region suitable for use as store with up to
     212             :    fec_max elements.  fec_max is a positive integer (does not need to
     213             :    be a power of two).  fec_data_max is the max data payload size per
     214             :    FEC set. */
     215             : 
     216             : FD_FN_CONST static inline ulong
     217         192 : fd_store_align( void ) {
     218         192 :   return alignof(fd_store_t);
     219         192 : }
     220             : 
     221             : FD_FN_CONST static inline ulong
     222             : fd_store_footprint( ulong fec_max,
     223          36 :                     ulong fec_data_max ) {
     224          36 :   return FD_LAYOUT_FINI(
     225          36 :     FD_LAYOUT_APPEND(
     226          36 :     FD_LAYOUT_APPEND(
     227          36 :     FD_LAYOUT_APPEND(
     228          36 :     FD_LAYOUT_APPEND(
     229          36 :     FD_LAYOUT_APPEND(
     230          36 :     FD_LAYOUT_INIT,
     231          36 :       alignof(fd_store_t),     sizeof(fd_store_t)                    ),
     232          36 :       fd_store_map_align(),    fd_store_map_footprint( fd_store_map_chain_cnt_est( fec_max ) ) ),
     233          36 :       fd_store_pool_align(),   fd_store_pool_footprint()             ),
     234          36 :       alignof(fd_store_fec_t), sizeof(fd_store_fec_t)*fec_max        ),
     235          36 :       FD_STORE_ALIGN,          fec_data_max*fec_max                  ),
     236          36 :     fd_store_align() );
     237          36 : }
     238             : 
     239             : /* fd_store_new formats an unused memory region for use as a store.
     240             :    mem is a non-NULL pointer to this region in the local address space
     241             :    with the required footprint and alignment.  fec_max is a positive
     242             :    integer (does not need to be a power of two).  fec_data_max is the
     243             :    max data bytes per FEC set. */
     244             : 
     245             : void *
     246             : fd_store_new( void * shmem,
     247             :               ulong  part_cnt,
     248             :               ulong  fec_max,
     249             :               ulong  fec_data_max );
     250             : 
     251             : /* fd_store_join joins the caller to the store.  store points to the
     252             :    first byte of the memory region backing the store in the caller's
     253             :    address space.
     254             : 
     255             :    Returns a pointer in the local address space to store on success. */
     256             : 
     257             : fd_store_t *
     258             : fd_store_join( void * store );
     259             : 
     260             : /* fd_store_leave leaves a current local join.  Returns a pointer to the
     261             :    underlying shared memory region on success and NULL on failure (logs
     262             :    details).  Reasons for failure include store is NULL. */
     263             : 
     264             : void *
     265             : fd_store_leave( fd_store_t const * store );
     266             : 
     267             : /* fd_store_delete unformats a memory region used as a store.
     268             :    Assumes only the nobody is joined to the region.  Returns a
     269             :    pointer to the underlying shared memory region or NULL if used
     270             :    obviously in error (e.g. store is obviously not a store ... logs
     271             :    details).  The ownership of the memory region is transferred to the
     272             :    caller. */
     273             : 
     274             : void *
     275             : fd_store_delete( void * shstore );
     276             : 
     277             : /* Accessors */
     278             : 
     279             : /* fd_store_wksp returns the local join to the wksp backing the store.
     280             :    The lifetime of the returned pointer is at least as long as the
     281             :    lifetime of the local join.  Assumes store is a current local
     282             :    join. */
     283             : 
     284             : FD_FN_PURE static inline fd_wksp_t *
     285        1362 : fd_store_wksp( fd_store_t const * store ) {
     286        1362 :   return (fd_wksp_t *)( ( (ulong)store ) - store->store_gaddr );
     287        1362 : }
     288             : 
     289             : /* fd_store_fec_data returns a pointer in the local address space to the
     290             :    data buffer of the given FEC set element.  Assumes store is a current
     291             :    local join and fec is in the store's pool. */
     292             : 
     293             : FD_FN_PURE static inline uchar *
     294             : fd_store_fec_data( fd_store_t const *     store,
     295          12 :                    fd_store_fec_t const * fec ) {
     296          12 :   return (uchar *)( (ulong)store - store->store_gaddr + fec->data_gaddr );
     297          12 : }
     298             : 
     299             : /* fd_store_{s}lock_{acquire,release} interface store's shared-exclusive
     300             :    lock.  See also FD_STORE_{S,X}LOCK_{BEGIN,END}. */
     301             : 
     302         225 : static inline void fd_store_slock_acquire( fd_store_t * store ) { fd_rwlock_read   ( &store->lock ); }
     303         225 : static inline void fd_store_slock_release( fd_store_t * store ) { fd_rwlock_unread ( &store->lock ); }
     304           6 : static inline void fd_store_xlock_acquire( fd_store_t * store ) { fd_rwlock_write  ( &store->lock ); }
     305           6 : static inline void fd_store_xlock_release( fd_store_t * store ) { fd_rwlock_unwrite( &store->lock ); }
     306             : 
     307             : static inline void
     308         225 : fd_store_private_slock_end( fd_store_t ** _store ) {
     309         225 :    fd_store_t * store = *_store;
     310         225 :    fd_store_slock_release( store );
     311         225 : }
     312             : 
     313         225 : #define FD_STORE_SLOCK_BEGIN(store) {                                                 \
     314         225 :   fd_store_t * _store __attribute__((cleanup(fd_store_private_slock_end))) = (store); \
     315         225 :   fd_store_slock_acquire( _store );                                                   \
     316         225 :   {
     317             : 
     318         225 : #define FD_STORE_SLOCK_END }}
     319             : 
     320             : static inline void
     321           6 : fd_store_private_xlock_end( fd_store_t ** _store ) {
     322           6 :    fd_store_t * store = *_store;
     323           6 :    fd_store_xlock_release( store );
     324           6 : }
     325             : 
     326           6 : #define FD_STORE_XLOCK_BEGIN(store) {                                                 \
     327           6 :   fd_store_t * _store __attribute__((cleanup(fd_store_private_xlock_end))) = (store); \
     328           6 :   fd_store_xlock_acquire( _store );                                                   \
     329           6 :   {
     330             : 
     331           6 : #define FD_STORE_XLOCK_END }}
     332             : 
     333             : 
     334             : /* fd_store_query queries the FEC set keyed by merkle_root.  Returns a
     335             :    pointer to the fd_store_fec_t if found, NULL otherwise.  Assumes
     336             :    caller has acquired the shared lock.
     337             : 
     338             :    IMPORTANT SAFETY TIP!  Caller should only call release the shared
     339             :    lock when they no longer retain interest in the returned pointer. */
     340             : 
     341             : fd_store_fec_t *
     342             : fd_store_query( fd_store_t      * store,
     343             :                 fd_hash_t const * merkle_root );
     344             : 
     345             : /* Operations */
     346             : 
     347             : /* fd_store_insert inserts a new FEC keyed by merkle_root into the
     348             :    store.  Returns a pointer to the inserted pool ele (fd_store_fec_t *)
     349             :    on success.  If the merkle root has previously been inserted, returns
     350             :    a pointer to the previous element with that key.  Returns NULL if the
     351             :    store is full (see fd_store_evict).
     352             : 
     353             :    Each fd_store_fec_t can hold at most store->fec_data_max bytes of data.
     354             :    Caller is responsible for copying into the data buffer region
     355             :    (accessed via fd_store_fec_data).
     356             : 
     357             :    This is a blocking operation that acquires a shared lock as part of
     358             :    its implementation.  Assumes caller has safely partitioned insertions
     359             :    if running in parallel (see top-level documentation in fd_store.h for
     360             :    details). */
     361             : 
     362             : fd_store_fec_t *
     363             : fd_store_insert( fd_store_t * store,
     364             :                  ulong        part_idx,
     365             :                  fd_hash_t  * merkle_root );
     366             : 
     367             : /* fd_store_remove removes the FEC set keyed by merkle_root.  Logs a
     368             :    warning if merkle_root is not found.
     369             : 
     370             :    This is a blocking operation that acquires an exclusive lock as part
     371             :    of its implementation.
     372             : 
     373             :    IMPORTANT SAFETY TIP!  Caller should only call fd_store_shrel or
     374             :    fd_store_exrel when they no longer retain interest in the returned
     375             :    pointer. */
     376             : 
     377             : void
     378             : fd_store_remove( fd_store_t      * store,
     379             :                  fd_hash_t const * merkle_root );
     380             : 
     381             : /* fd_store_verify returns 0 if the store is not obviously corrupt or -1
     382             :    otherwise (logs details). */
     383             : 
     384             : int
     385             : fd_store_verify( fd_store_t * store );
     386             : 
     387             : FD_PROTOTYPES_END
     388             : 
     389             : #endif /* HEADER_fd_src_disco_store_fd_store_h */

Generated by: LCOV version 1.14