Latest-Versioned/Marker Patterns and MVCC

Getting the basics right is obviously important. If you’re moving beyond what Andrew Wilson would call get-put man then you should be thinking about versioning your objects. That means making your data immutable. Doing this has a number of benefits:

  1. Versioning provides a historic record of changes.
  2. By linking versioning with the wall-clock / business times (i.e. bi-temporal) views of the system at previous points in time can be recomponsed. This is important for providing consistent views over your data.
  3. Versioning allows concurrency to be managed through Multi-Version Concurrency Control (MVCC)

Implementing Versioning

However simply adding versions to your objects (more precisely your object key) has the downside that you can no longer look up the value via it’s business key: you must know the business key as well as the version of the object that you want.

Key = [Business Key][Version]

In Coherence accessing objects via their key directly is far more performant than doing a query (see The Fallacy of Linear Scalability) so it is preferable to keep the latest version of the object available via its business key alone. There are two common approaches to solving this problem: The Latest/Versioned pattern and the Latest Version Marker pattern.

Approach 1: Latest and Versioned Caches

The first approach is to define two caches for every object. The Latest… cache and the Versioned… cache. The key of the ‘latest’ cache is simply the business key:

Latest Cache Key = [Business Key]

This cache only ever contains the latest object. The ‘versioned’ cache contains all versions of the object with a, usually monotonically incrementing version embedded in the key:

Versioned Cache Key = [Business Key][Version]

Writes must be directed at the ‘Latest’ cache and a Coherence Trigger is used to copy the object reference to the ‘Versioned’ cache adding the version onto the key as it does so. This is demonstrated in the first figure opposite.

The disadvantage of this approach is a memory inefficiency arising because the  latest object exists in both Latest and Versioned caches. When the object is written the same reference can be used to save space, however the backup copies in each cache will be different instances and, should a node be lost, and  process of recreating the primary from the backup copy will create new instances by default further eating memory. It is therefore advisable to use the LatestMarker pattern below when memory is a concern. The advantage of this approach is that it reduces the number of records in the latest caches which makes filter operations faster when they operate only on ‘Latest’ data (a common use case in most applications).

Checklist:

Approach 2: Versioned Cache Only With a Latest Version Marker

A second approach to solving the same problem is to only use a single cache with the key format:

Key = [Business Key][Version]

but specifying that the latest version of an object has a special version marker:

KeyLatest = [Business Key][LatestVersionMarker]

As clients are aware of the LatestVersionMarker (for example -1 is common) they can always access the latest value directly by calling:

cache.get([businessKey][-1])

This approach does not suffer from the issues of duplication  associated with separate Latest and Versioned caches but has the disadvantage that versioned data is in the same cache as latest data, marginally slowing down filters. Just reiterating that again: in this pattern there is only one copy of the latest object. The one with the latest marker. This is different to the latest/versioned pattern where the latest object will exist in both caches (so twice) so that the versioned cache can contain all versions of that object.

Checklist:

Implementing the trigger to avoid reentrancy issues

The below code outlines one  mechanism for moving objects (in this case for the Latest/Versioned pattern) from one cache to the other using direct backing map access.

public void copyObjectToVersionedCacheAddingVersion(MapTrigger.Entry entry) {
   // I'm assuming that you are tracking the version, and incrementing it, in your object
   // Also note that it's more efficient to just take the version out rather than deserialise
   // the whole object but this way is more succinct
   MyValue value = (MyValue)entry.getValue();
   MyKey versionedKey = (MyKey)value.getKey();

   BinaryEntry binaryEntry = (BinaryEntry)entry;
   Binary binaryValue = binaryEntry.getBinaryValue();

   Map versionedCacheBackingMap = binaryEntry.getContext().getBackingMap("VersionedCacheName");
   versionedCacheBackingMap.put(toBinary(versionedKey), binaryValue);
}

If you are using Latest-Marker it’s essentially the same but with a marker key.

Latest/Versioned or Latest-Marker – which to choose?

Both patterns are good. We have use both extensively in my current project. Latest marker is probably best overall due to the aforementioned storage issues with Latest-Versioned. However if you are likely to make most use of the ‘Latest’ view, and will be scanning without the use of indexes, Latest-Versioned can offer performance benefits. It also feels simpler when you use it, as from the outside things are what they are.

These patterns are really important to use. They form the basis for many of the more advanced use cases. You need one of these to do MVCC, Snapshotting etc.

Note that affinity (Key Association) must be used to ensure that  the versioning process is entirely local to the JVM doing the write.

Check out Andy Coates’ neat way for doing it here.

MVCC & Snapshotting

One of the main reasons for implementing these patterns is to allow more advanced features of MVCC and Snapshotting. MVCC is a concurrency control mechanism which is based on your objects being versioned. It is useful where two clients mutated the same version of the object and you want one to get a failure (and one write to succeed). This is very simple to implement in Coherence by including the object version in the write and have a trigger ensure that the version of the object being updated equals on the in the cache, otherwise exception.

Snapshotting is a more complex topic because it requires time so I’ve covered in a separate post here.

Related Posts

  1. Great post by Andy Coates on implementing this pattern with a bit more style [here]
  2. Use normalisation to reduce the versioning burden through the application of Star Schemas and Connected Replication [link]
  3. Performing cross cache joins in Coherence [link]
  4. Understanding problems of reentrancy in Coherence [link]

Posted on October 19th, 2011 in Coherence


  1. Ashwin Jayaprakash October 19th, 2011
    21:47 GMT

    Your version map seems like a per-key queue of fixed size. How do you purge old versions beyond, say 3 revisions/versions?

    Do you do a findMaxVersion(key) and findMinVersion(key), then if max > min + 3, then remove(key + min)? That looks like a lot of queries.

    Redis has readymade constructs for such things – http://redis.io/commands

    Ashwin.

  2. ben October 20th, 2011
    8:06 GMT

    Hi Ashwin. It’s actually just a Map. Versions can be expired based on time, size or some custom policy.

  3. Alexey Ragozin November 1st, 2011
    7:40 GMT

    Hi,

    Accidently, I’m working on exactly same problem right now.
    And I can suggest 3rd aproach – using special kind of indexing structure (via Coherence index API) to find right version of data.
    You can read more details here Monday, October 31, 2011
    Data Grid Pattern – Time series index for managing versioned data

    (source code is also available)

    Reagards,
    Alexey

  4. ben November 1st, 2011
    8:58 GMT

    Thanks Alexey. That is a very useful post. I have a post about snapshotting and timestamping in the pipeline and your method will be very relevant to that so I will incorporate it there.

    I see your solution as slightly different to the two mentioned here though. The ones here are simply designed to allow direct key based access to the latest version – this being the version most users are interested in most of the time. Key based access has some important benefits over regular filters:

    – It is faster and more scalable (as it is directed rather than hitting each node). Although this problem can be partially addressed with PartitionFilters.
    – It allows you to use near caching – something not available to filters.

    As I said though – for time based queries which I’ll be posting on and discussing at the SIG your approach looks ace.

  5. Rjw November 16th, 2011
    0:54 GMT

    I wonder what the cost of
    A) getting the key owner of the non versiony part of the key
    B) doing an invocation to the owner
    C) once on the owner, poking into your versioned index stashed somewhere accessible by your indexawareextractor, to find the latest key. ( everything written in with immutable version numbers)
    D) get (local unless partition redistribution happens before this point) – return that as the result of your invocable.

    All depends on the cost of a single invocation, and making sure no pointless deser/ser goes on.

    Advantages would be: no second copy, no need to do a trigger that causes double the backup traffic/ events, immutable versions (ie near cacheable with no invalidation), should easily scale to get all, and pretty much zero overhead for as-of versioning ( the invocable can know what the highest version acceptable is) if you do clocked/non-local versions in the keys.

    But would need to try it out… 🙂

  6. Alexey Ragozin November 16th, 2011
    19:08 GMT

    You do not need to do it that complex way 🙂

    All you need just say


    // get latest version for key k
    cache.entrySet(tsHelper.floor(k, null));

    See example on http://code.google.com/p/gridkit/wiki/TimeSeriesIndex

    Underhood, helper will create TimeSeriesFilter wrapped into KeyAssociationFilter.

    KeyAssociationFilter will ensure requiest to be send to key owner only.

    This is almost as efficient and cache.get().

  7. Rjw November 16th, 2011
    19:51 GMT

    Yep, that looks pretty good…

Have your say

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>




Safari hates me
IMPORTANT! To be able to proceed, you need to solve the following simple problem (so we know that you are a human) :-)

Add the numbers ( 5 + 3 ) and SUBTRACT two ?
Please leave these two fields as-is:

Talks (View on YouTube)