Tuesday, 20 January 2015

Abstract Data Use Not Data Access

Common data access abstractions I've come across and been guilty of implementing myself are the likes of:

  • IDatabase
  • IPersistentStore
  • IConnection
  • IDataStore
  • IRepository

The problem is, these are not really abstractions. If anything they add an extra layer of indirection. One such benefit of this level of indirection is each concrete implementation can be substituted. This makes testing easy. Other than this, such generic solutions introduce a whole host of problems.


Problems

Abstraction

Such examples are said to be at the wrong level of abstraction. This indirection forces developers to work at the wrong level of abstraction. For example, a controller has no right to be directly querying your data store directly. If the same query is required somewhere else you introduce duplication.

Big Bang Upgrade

Given such indirection offers a poor abstraction, upgrading to use a different implementation is tricky. If we assume one hundred usages of IDatabase, all of these code paths need to be migrated and tested. This can be such a huge undertaking that upgrades are often left as technical debt, never to be fulfilled.

Leaky Abstractions

In a similar manner to the previous point, these abstractions are poor. They leak implementation details. Due to this they cannot be considered as valid abstractions. Consider a SQL implementation of IDatabase, we may have a FindById method that takes an integer as the Id. If we wished to update to a NoSQL solution the lack of a primary key causes problems. FindById for the NoSQL implementation may require a Guid. There interface is now broken.

Interface Bloat

Another downside of coding at the wrong level of abstraction is that the amount of use cases increase constantly. What might begin as a humble interface consisting of a handful of query methods soon becomes a dumping ground for all sorts of exotic behaviour - specific to niche use cases.

Lowest Common Denominator

Different data access providers have different capabilities, but in order to stay "decoupled" only core functionality present in all providers can be used. This leads to dull, limited interfaces consisting of standard data access functionality. The limited feature set can mean a poor integration. Why avoid the advanced features your library offers?

A poor abstraction that exhibits the problems above may look like this.

To retrieve a user based on the Id.


Solution

If we abstract how the data is used and not how the data access is performed we can avoid these pitfalls. By staying at the right level of abstraction and not leaking implementation details we end up with a different looking interface.

The concrete implementation in this example will be a SQL implementation using Dapper.NET.

The usage is similar.

The key point here is that we solve the problems of the "generic" solution.

  • IUserQuery is a better abstraction, it allows selective upgrades. This use case will have limited use, meaning updating a handful of references is easier than updating every data access component in one go.
  • The fact we use a SQL database as our store is hidden, no details leak. UserId encapsulates how we identify users, if we were to switch to a NoSQL store our consumers would be unaware.
  • One of the biggest benefits is the ability to use our third party library to its fullest. Rather than wrapping Dapper we can make use of it directly, making use of any special features it offers, rather than conforming to a limited subset of an API.

Aren't We Introducing Lots of Classes?

More, but not "lots". However this is a common complaint when the above solution is proposed, though given the vast benefits included this trade off is certainly worth it. Additionally, each query or repository that is implemented in this manner is easier to develop and test due to closer adherence to the Single Responsibility Principle.

How Do We Unit Test SqlUserQuery?

You don't. In this example we make use of the third party library directly. The benefits discussed prior justify this, though it means unit testing is not possible. Therefore you should apply integration testing against a real data store. The rest of the system will be coded against the abstraction, so unit tests can be applied as normal here. Any attempt to "abstract" or wrap the third party will remove many of the benefits of this solution, so don't worry about it.


References

For a great discussion on this topic, check out a talk by Kijana Woodard for more examples.

Tuesday, 6 January 2015

Caching

The naive approach to implement caching is to just store everything in an in memory collection such as a hashtable. After all it works on my machine.

I've worked on systems in the past that used this technique but:

  • Bring in two processes and this falls apart
  • No Time to Live (TTL)
  • No cache eviction, memory will grow until it crashes the process

This sort of caching meant the system needed daily restarts due to each worker process starting to eat up more and more RAM. At the time I didn't realise this was the problem as to why daily restarts were required. These were automated so the team just sort of forgot about the problem after a while. This never felt right.

"Improper use of caching is the major cause of memory leaks, which turn into horrors like daily server restarts" - @mtnygard in Release It!.

Scale this system up, and daily becomes twice daily and so on. In a global market where software shouldn't be constrained by time zones or "working hours" this is wrong.

Solutions

There are numerous easy ways to solve these problems depending on the application in question.

Don't Roll your Own, Try a Third Party

Easy. Just use an off the shelf solution that solves the problems above plus includes a whole host of additional features.

Use your Standard Library

For example .NET includes caching functionality within the System.Runtime.Caching namespace. While there are limitations to this, it will work for some scenarios and solves some of the problems above.

Soft References

I've overlooked soft references in the past but for caching they can be incredibly useful. Use soft references for anything that isn't important or that can be recalculated. An example would be content displayed within an MVC view using the web servers session. Here if each item stored is a weak reference we introduce some benefits.

  • Stops your web server running of of memory - references will be reclaimed if memory starts to become a bottleneck.
  • Greater scalability with the same amount of memory - great for a sudden spike in traffic.

A web server's session being full of references that won't expire for a set period is a common cause of downtime. If soft references are used all we need to do is perform a simple conditional check prior to retrieval from the session. Different languages have similar features, e.g. Weak References in .NET.

Pre-Computation

Caching isn't always the best solution, in some cases pre-computation can be much easier and offer better performance. In other words at least some users will experience a slow response until the cache is warm, other techniques can be used to avoid this completely. I will expand on pre-computation in a future post.

Reference

More information can be found in the excellent book Release It!

Saturday, 27 December 2014

Pair Programming vs Pairing

I'm a fan of pair programming. I owe a lot of this practice to my improvement early on in my career. I define pair programming as two developers working on a task using one or more machines at the same time.

I have had some excellent pair programming sessions. I can even remember some of them in great detail. Here I went away learning something new, solved a difficult problem, or just generally had a fun time.

On the other hand I've also had some awful experiences, which unfortunately I can still remember. Here my partner wouldn't play the role of the driver or navigator correctly, wouldn't be engaged, or just generally didn't get into the flow of pair programming.

Team's mandating 100% pair programming is bad. Some tasks don't need two developers to be working on them concurrently. Here pairing should be used.

Pairing is two developers working together to solve a task, but doing so separately. During pairing regularly communication, design sessions and feedback should be used. You can even join up to pair program on complex areas. The difference is that unlike pair programming you don't need to have two developers working on the same part of a task at all times. Pair programming and pairing are two very distinct concepts.

The key takeaway here is to know when to use pairing over pair programming and vice versa. Both have their merits and should be applied in the correct context.

Tuesday, 23 December 2014

A Unit is Not Always a Method or Class

Part three of my Three Steps to Code Quality via TDD series. The most important concept when coupled with the previous two points - not every unit will relate to a method or class.


Most introductions into TDD use simple examples. Even the excellent TDD by Example uses a value object in terms of Domain Driven Design. Most introductory articles on the Internet suffer the same fate. While these are great for demonstrations, they don't relate to what most developers need to code on a day to day basis. It's around this point where people proclaim that the benefit of automated testing (even after the fact) is a waste of time.

One of my biggest revelations with TDD was that each unit does not need to equate to a single method or class. For a long time I followed what others did. Each collaborator would be injected and replaced with a test double. Each class would have a corresponding test file. However as I have stated in the introduction, this leads to problems.

We should test units of behaviour, not units of implementation. Given we know we should be using as few dependencies as possible, and we know we should limit visibility, each test should be simple to write. As part of the refactor step if we choose to introduce a new class that is fine. There is no need in most cases to extract this and introduce a test double. Every time this is done we tie the test closer and closer to the implementation details. Every class having a corresponding test file is wrong.

By testing a unit of behaviour we can chop and change the internals of the system under test without breaking anything. This allows the merciless refactoring automated testing advertises as a benefit.

Aren't you describing integration testing?

No. Tests should be isolated as I've documented before, but there is nothing stating they should be isolated from the components they work with. If we isolate at the method or class level we make testing and refactoring much harder. Due to the term "unit" being so closely linked with a class or method, I like the naming convention Google use for their tests - small, medium and large.

Additionally an excellent article from Martin Fowler on the subject of unit testing introduces two new terms, solitary and sociable tests. Neither one style alone works so the type of test you write should be based on context. Unfortunately the industry seems to be fixated on solitary testing.

Sociable Tests

Work great at the aggregate root level. Does the object do what we expect it to? It can use zero or many collaborators behind the scenes but these are implementation details. Here we would limit the use of test doubles as much as possible but still have fast, isolated tests. As generalization - most automated testing should fall into this category as the core domain of your application is likely to have the most amount of logic present.

Solitary Tests

Useful at the adapter or system edge. For example, does the controller invoke the correct application service? We don't care how it works behind the scenes. Anything beyond this service would be a test double. These sort of tests are more closely coupled to implementation details so should be used sparingly.

Doesn't this lead to huge tests?

No, not really. As you will limit implementation details leaking into the public API the use of test doubles will reduce. This will shrink test setup and in most cases improve readability. Worrying about large tests shouldn't be a problem with this style of testing. You will not reduce the amount of tests required, however you will find them to be much more stable and resilient than before.