-
Notifications
You must be signed in to change notification settings - Fork 66
PerconaFT Engine Architecture
The mongo storage engine api has a FAQ here but it hasn't been kept up to date much (last update was 4 months ago as of this writing).
Roughly, the api has these components that need to be implemented:
-
RecordStore: responsible for storing BSON documents accessible byRecordId. May be capped. -
SortedDataInterface: implements an index (including the_idindex), which maps index keys toRecordIds. May be unique. -
RecoveryUnit: manages a transaction for the storage engine, essentially. Lives inside anOperationContextwhich is passed around through most functions in the code. Arguably the most complicated part of the api because it has strange rules around atomicity and read-your-own-writes that don't always map directly to typical transactional semantics. However, we've got this mostly figured out, it shouldn't need much tweaking. -
StorageEngine: just provides the initialization and glue that lets the server create the above three objects
You'll also be exposed to the following types of objects:
-
RecordId: usually just an autoincrement integer (except in the case of the oplog where you have to extract it from the optime here that you get to generate inside theRecordStore. - [
KeyString'](https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/key_string.h): The serialized form of either aRecordIdor a BSON key, optionally concatenated with aRecordId. Once encoded, these can be compared withmemcmp, which is a nice feature. They can be decoded back intoRecordIds and BSON keys, but to decode an index key you also needTypeBitswhich are usually stored next to the encodedKeyString`. -
RecordData: just a chunk of bytes representing a record as stored on disk. It's a serialized BSON object, but you don't need to know that, just store it and hand it back.
We decided that having separate RecordStore and SortedDataInterface classes was a bit complicated, and a better API would match the fractal tree API more closely as just a sorted key/value dictionary. Ideally this API would also be suitable for RocksDB and WiredTiger, but it seems unlikely that they'll use this API at this point. In any case, the code in the KVDictionary layer implements most of the complicated operations in the RecordStore and SortedDataInterface (including logic around capped collections and tailable cursors, as well as persisting size and count statistics for engines which can't provide those simply themselves) in terms of a simpler API called KVDictionary.
We don't simplify the RecoveryUnit abstraction in this layer yet, but it wouldn't be a bad idea to try.
The KVDictionary layer requires the implementation of the following classes:
-
KVDictionary: this is the main sorted map data structure. You need to be able to do inserts, gets, and deletes, and provide aCursorwhich can seek to a key and go next in a given direction. Optionally, you can implement optimized versions of updates and duplicate key checks for unique indexes. -
KVEngineImpl: just needs to create newKVDictionarys andRecoveryUnits. You still need to implementRecoveryUnityourself.
The KVDictionary layer makes use of a class Slice which is just a contiguous range of bytes with some logic around memory management and borrowing. These are used for keys and values, it's pretty straightforward but be careful about returning owned Slices.
A non-persistent reference implementation of the KVDictionary API using a std::map is available, called kv_heap. It's a good place to start to see what it takes to make a basic engine, there's a bit of glue code in kv_heap_init.cpp and in the SConscript that's worth checking, and you should see how the unit tests work.
The PerconaFT engine lives in tokuft and implements the KVDictionary API, including its optional parts (updates, capped delete optimization, duplicate key checking) and makes use of some optional parts of KVDictionary (KVSizeStorer to maintain persistent stats, and VisibleIdTracker to track document visibility in capped collections).
The RecoveryUnit is a bit weird. The way we've done it is to not use child transactions, and to use a counter to track depth. We commit when the outermost commit happens, and abort if the outermost transaction ends without being committed. When we create a transaction, we peek at the lock state to see whether the client intends to write something; if they do we create a DB_SERIALIZABLE transaction so that we take read locks, if not, we create a DB_TXN_SNAPSHOT | DB_TXN_READ_ONLY transaction. We also have a way to detect when we're a secondary, and in that case we pretend we're prelocked because we know there will be no conflicts (https://github.com/percona/percona-server-mongodb/blob/b84f79c259a7db416e1b01f673c20b7e35a2ca7d/src/mongo/db/storage/tokuft/tokuft_dictionary.cpp#L82-L100).
The PerconaFT engine uses a new API layer above the ydb called ftcxx, which provides a somewhat more modern, fairly opinionated C++ API for the fractal tree. By far the most complicated thing here is the cursor API, which makes heavy use of templates to be able to inline things like comparison functions for bounds checking, and to do zero-copy but inlined callbacks for result handling, as well as automatic buffering for the bulk fetch API. Unfortunately, the structure of the cursors in the mongo storage engine can't make great use of this because they need to copy out the data, but the goal was to have something useful for more than just TokuMXse in the future, and it doesn't slow down TokuMXse. The templates are nasty, but the result is actually quite nice to work with, which you can see in the example in cursor_test.cpp.