I do a lot of programming in the large and spend far too much time, according to some, thinking about programming in the large concepts such as SOA and CEP but from time to time I like to keep it real and code something detailed and low(er) level. Recently I've been doing a little investigation around compression and so I ended up read up on Huffman Coding. Huffman coding is a way of taking a fixed length encoding and turning it into a shorter variable length encoding.
Back to basics
That's a little vague and academic so I'm going to assume little knowledge of how computers store data and work my way up and see if I can explain it properly.
Computer storage is made of switches, lots and lots of on/off switches, a switch in storage terms is called a bit and is the smallest possible unit of storage. Switches can be on or off and computers represent that on or off as 1 or 0 respectively. That's why the classic power switch on most devices is a zero with a one overlayed as in figure 1, 0 and 1 mean off and on to engineers. If I only have 0s and 1s then I'm going to have to find a way of representing numbers bigger than 1 using them.
Getting to second base
Classically we use a numbering system with ten digits zero through to nine, which mathematicians would call base 10 or decimal. However as computers only have two digits they use a counting system base 2 which is referred to as binary. To count in binary we simply increment the number on the right until it reaches it digit limit (1) and then increment the number closest to it on the left whilst reseting that number to zero, if the number on the left is also at its limit we move again to the left and reset it until we find one we can increment, this is exactly the algorithm we use for normal counting. Therefore we have
Interestingly and logically where in decimal the first digit represents units and the second tens and the third hundreds, in decimals they represent 1, 2, 4, 8 and double for each digit to the left. Thus 101 (binary) is
(1 x 1) + (2 x 0) + (4 x 1) = 5 (decimal)
Anyway enough about bases, I think we have that covered off. What we need to understand is you can represent numbers using sequences of zeros and one on computers.
Computers encode many things to files such as text, images, video, CAD models and many more. Let's choose the simplest thing possible, though, for our example: text. To have text file we must have some way of mapping numbers to/from letters, what computer scientists call character encodings. Whilst conceptually simply these things stike fear into programmers who are smart enough to understand what they are and what a mess we've created for ourselves. I'm going to dodge the bullet of fully explain why they are a pain and use the simplest one as an example. US-ASCII, which is the seemly tautological United States-American Standard Code for Information Interchange, as an example character encoding.
US-ASCII defines a mapping of numbers to characters. Each character is represented by an 8 binary digit number known as a byte, remember a binary digit is called a bit, so a byte is an 8-bit number. The reason numbers are stored as 8-bits is so that they can be read off storage easily. Remember there are only zeros and ones on computers. Therefore to know where a number starts and ends there has to be some way of splitting them back apart. A file contains for example:
Which is two bytes of data namely 01001111 and 01001011 or in decimal 79 and 75 or in US-ASCII encoding "OK".
ASCII defines which numbers represent which characters, for example 'A' is 41 (decimal) that is 01000001 (binary). As it just uses 8 bit numbers there are only 256 characters possible (i.e. 00000000 (0) through to 11111111 (255)). Full mappings of numbers to characters for ASCII are availble all over the web.
You might be thinking why bother with the leading 0s on each byte in the "OK" example above, well if I did that I would have
and how on earth do I know which digits belong to which number, it could be 10011111, 001011 or it could be 100111, 11010101 indeed there are 13 possibilities assuming I know there are two numbers represented, otherwise it could be 14 numbers or 1, basically I'm screwed if I don't know the length of the numbers in bits. I can't add anything to separate them because all I have is zeros and ones and I can't use a zero or one to separate them because I won't have anything left to count with; you can't count in base 1. Therefore in order to put numbers in storage I need to predetermine a length for those numbers in bits and stick to it. The general building blocks of information in computer science is 8-bit bytes and that is why.
To summarize, we have numbers represented by 8-bit bytes and we have a mapping of those numbers to characters assuming we are using US-ASCII. However forcing all numbers to by 8 bits seems awfully wasteful. Let's suppose I want to store the sequence of 20 characters:
Well in this case I only need two numbers to do so so I could just store them using a basic character binary encoding of 0=a and 1=b. Therefore rather than using 8 x 20 = 160 bits of storage I could just use 20 bits. To spell it out I could use
This is huge storage saving in percentage terms. It's important to point out that an encoding maps one set of symbols to another, in the case of ASCII it is 8-bit bytes to common western characters, for Huffman it is variable length binary sequences to 8-bit bytes. Therefore what's actually happening here is:
a (ASCII) -> 10001101 (8-bit byte) -> 0 (Huffman)
Now what if I have 3 characters in my sequence:
This presents a problem for the encoding technique above, because I have no obvious way to encode the 'c'. If I use the next number in the binary sequence namely 10 for c. I have a problem that a sequence such as cba:
has ambigious meaning, it could be abc, cc, bac or cba, not much use. However, and here's the key insight, suppose I only use zero/one sequences that don't appears as the beginning of any other sequence to represent 8-bit numbers. Therefore I pick 0=a and because I can't use a number starting with 0 after picking 0=a I must use something like 10=b leaving 11=c. I can therefore unambigiously encode cba as:
There is nothing else it can be but cba. I don't need to know the bit length of each number to decode. To work the example the first digit, 1, this isn't a character, nothing is encoded as 1 so I add the next bit to it and get 11, well that's c there is no other number starting with 11 so I can unambigiously decode it. Similarly with 10 and 0.
The aim of Huffman Coding is to create a shorter encoding that the original fixed width encoding. Indeed it is a little more than that, it is to get the shortest possible encoding unambigious encoding. How do we find this out? Well first a few things about Huffman Coding. The genius of the algorithm is that it is simple and will always find the optimum (shortest possible) encoding and this encoding will always be less than or equal to the length of the equivalent 8-bit encoding.
Firstly I describe the algorithm then I'll do any example. As we know US-ASCII is simply a sequence of 8-bit bytes as is every other file on your computer (on the vast majority of modern computers). Therefore we have a sequence of bytes.
Huffman encodings use trees, Huffman trees, to describe their encoding. A Huffman tree is a binary tree, in that each branch gives way to 2 or fewer branches.
So the algorithm:
Count the number of occurences of each byte in the sequence and put them in a list
Sort that list in ascending order of freqency
Pick the two lowest frequency bytes off the top of the table and add them to the tree as two branches on the trunk
Add the frequencies of those two nodes together and add that part of the tree back to the list and sort the list again in ascending order of frequencies
If there is more than one item left in the list then go to step 3, otherwise you are done, the last item in the list is the completed tree
In this example we are going to encode the string: "Mississippi hippies".
Here's the frequency table:
Note I've substituded [space] for the underscore character "_".
So we take the two smallest values off the top and create our first part of the tree, hopefully the notation is self explanitory the tilda (~) means that it is just a node and the letter before is simply and identifier, the leafs have the character they represent and the frequencies:
We add this back to the list
Then the next two lowest frequency items
Adding it back in gives
Then the next two (which are both sub-trees) gives
There we have it, our final Huffman tree. How do we use it? Well in order to find the encoding for each letter we travel down the tree until we get to it. When we go left we add a 0 when we go right we add 1. For example to encode 'e' which is a rare character in our input string we get the following:
Or for 'i' which is very common in the input sequence we get
a much sorter encoding. From this tree we can build an encoding table as follows:
Thus we can encode the orginal string "Mississippi hippies" into:
Which is 46 bits rather than the (19 x 8) 152 of the orginal. We've compressed 19 bytes into 5 bytes (last byte is zero padded).
A couple of notes:
For input with almost every byte possibility and roughly equal frequency of those bytes compression will be very limited — large movie files for instance.
For some encodings certain bytes will encode to longer than 8-bit codes, but the shorter encodings will at least offset those
Huffman tree can be stored very effiently using a fix length format, pick the left then the right, move down to the left and repeat until the tree is traverse. When you hit nodes in the tree that are leafs output 1 followed by the value (not the frequency), when they are nodes that are simple parents of others output 0 followed by nothing.
This was a pretty fiddly blog post, please let me know if I've made mistakes in the comments or email.
When people start to service orient their organisation they often focus on exposing APIs and those APIs invariably solely or mostly focus on method calls, what I and others often refer to as RPC. This is great and brings huge benefit but it does miss a huge opportunity and that is being able to observe and react to what's happening in your organisation.
In order to be able to observe and therefore react to whats happening in the services that make up your organisation you need to add events to your services. What do I mean by events? To start with let's leave technology aside and think of the business problem you might be trying to solve. As an example let's take a retail bank that offers current (checking) accounts. To model this account appropriately there are things that should be modelled as RPC and things that should be modelled as events. If a customer uses an ATM to check their balance this should be RPC, the ATM will call the account service to get the current balance to display it to the customer. There is little point in doing this as an business event because you need the output of the customer asking for their balance to continue.
Now suppose the customer wants to make a withdrawal, this would cause an RPC type invocation (i.e debit £100) and an event (i.e. user withdrawal occured on account id 5123). The RPC call allows us to perform a blocking operation to check there is sufficient funds, make the deduction and inform the ATM that it can dispense, the event will be published for interested parties to be informed that something they might be interested in has happened. Who might be interested in this event? Well it could be an analytics package that wants to keep track of which ATMs are popular or maybe a complex anti-fraud system figuring out suspicious patterns of withdrawals.
The great thing about events is that the systems raising them doesn't need to understand how they are used they can simply raise them and go about their business. In the example above suppose you were asked to add a feature where customers could have details of their withdrawals emailed to them. Rather than go in and change the mission critical code around financial transactions you could set up a service that listens to these events and when it sees one email the customer. The team that looks after the account service need not even know.
Events start to get really interesting when you combine them, what some folks call this Complex Event Processing (CEP) but I prefer to consider a fairly logical part of of Event Driven SOA.The 'complex' in CEP refers to the fact that multiple events are combined to infer or derive something more interesting has happened. This is all a bit theoretical so let's revisit the anti-fraud example from earlier. A security analyst has identified that when a customer makes withdrawals from ATMs in two different counties within the space of a day then this is suspicious but not impossible, it might raise an event such as "Customer Crossed Border" event. If the customer goes on to make high value transactions in the first country then the matter needs investigating as another "Customer Crossed Border" event occured on the same day as the first. The fact that this has happened would raise an "Suspicious Occurance" event which the account system listens to and locks the account. When the account is locked an "Account Locked" event is raised with a reason code; the customer support centre service listens on this event. When one is received a task is added to one of the call centre operatives list of work (i.e. call customer and verify transactions) and so on.
Please don't get me wrong you shouldn't expose all data and behaviour solely via events, this would be ridiculous but I've seen horror stories of people doing so, indeed I've worked to put them right. What I'm advocating is using a balance of RPC and events to best represent your organisation in an SOA fashion.
I'll talk more in future blog posts about patterns for adding and utilising events in your SOA.
ESBs irk me, not the technology in and of itself, that can be useful, it's the way they used. Mostly because every architect and his mentor seems to think you can't have an architecture without one. I swear sometimes they must have the ESB icon pre-painted on their whiteboards because, splat, there it is, a hulking great rectangle in the middle of every systems diagram. They aren't always called ESB, sometimes they sneak through as 'Orchestration' or 'Integration Hub' or a vendor product name.
I first came across the term ESB about 10 years ago, a colleague mentioned them to me and we discussed the concept over lunch, by the end of the lunch break we'd come to the conclusion they weren't necessary or indeed desirable in an SOA. I'll put forward some of the classic reasons architects give for using ESB and explain a better way to achieve each desired outcome or in some cases why that outcome isn't desirable.
Before I go on I should explain the difference between the central ESB (bad) and using ESB technology in a more appropriate manner.
Figure 1. shows an ESB through which all services communicate, they are essentially unaware of each other and may be in the dark over what each others' interfaces are. Now in the worst case the central ESB grows a team to support it. This is, after all, the obvious things to do, every service in the organisation wants changes all the time to what they use/expose from/to each other and therefore in order to prevent everyone hacking away at the ESB and general chaos the ESB team is created. All requests for change to the ESB go to the ESB team.
The ESB team are now furiously busy and considered heros for helping everyone communicate and are the first port of call from each service team to get what they need from the rest of the organisation. Soon they are getting requests for new functionality, however they don't know how to change the services and perhaps it's a little tricky to work out which service should handle that functionality so they slip a little bit of business logic into the ESB and call it orchestration, after all orchestration is what ESBs are for. This continues for a number of years until it dawns on everyone that the ESB contains most, or at least significant portions of the business logic and the services have become no more than CRUD layers over databases.
Figure 2. is ESB technology used in a more appropriate way, of course it is no longer a true ESB, because it isn't "Enterprise", it isn't just one service bus for the entire enterprise. Each service uses ESB technology within itself to ensure their interface can be stable, they can use it to maintain multiple versions of the interfaces simultaneously but that is within the service. The service team has complete autonomy over what happens with their service now. In many cases they don't need expensive software to do this, they can simply code it in the programming language of choice or use a simple library to achieve things like mediation, routing, location, security etc. Once each service has this capability the requirement for a central ESB team falls away entirely. As the organisation has been sensible and assigned teams for each service then those team speak to each other directly to get new functionality. The act of the teams talking directly to each other, face to face, person to person about their requirements also improves overall understanding of the organisations software assets.
Below I address some of the common reasons people give for using central ESBs and suggest more appropriate patterns for achieving the desired outcome.
The ESB Protects Me From Change
The argument here is that if you want Service A to call Service B you'd better go through the ESB in case Service B's interface changes.
What should happen is that Service B publishes an interface and guarantees it won't change for a period of time (say 12 months), if changes are required another version of just the interface is created and the old version is mediated on to that new version.
That's quite a bit to take on so I'll give a simple example. A company builds a service with an operation that exposes the price of various commodities, it returns a MoneyValue response:
This works perfectly until functional requirement gets added to return the price of silver too. The architects, being smart, realise this is probably not the last metal they will be asked to add and so they create a method that's more flexible and looks like
MoneyValue getPriceOfMetal(Metal metal)
However there are several consumers of the old getPriceOfGold method. In order to support these clients they leave the old service interface in place but redirect calls from getPriceOfGold to getPriceOfMetal(gold)within the service and everyone is happy. Eventually they will ask consumers of getPriceOfGold() upgrade to getPriceOfMetal(Metal metal).
Therefore you don't need a central ESB to achieve this objective, you don't need anything outside the service but a bit of mediation logic in your service.
Extra Level of Protection
Just to quickly address the oft retort to the above that a central ESB gives another layer of protection, yes it does, but it comes with all the draw backs of the central ESB: adding logic in the wrong places, functional enchancement bottleneck and so forth.
The ESB Allows Me To Orchestrate Services
Orchestration is either business logic, which belongs in the service, this is the point of services after all, they contain your logic and data, or if it is simply a case of moving a human through some process or other then that belongs in the UI code (e.g. register for a website and then add something to your basket).
The ESB Can Mediate My Data
Yes it can but see above section 'The ESB Protects Me From Change', the mediation can be achieve much more sensibly in your service.
The ESB Can Locate My Services
Each service should have a mechanism for locating any other service for the purposes of RPC calls. I prefer to use DNS in most cases (e.g. metal-exchange.example.com would resolve to the metal exchange service interface or API), DNS has huge power but can be used very simply too. Bonjour (aka Zeroconf) is a often cited as a solution too and it's a good answer but it merely some extentions to DNS at the end of the day. Others suggest things like UDDI but I have never found the need myself.
The ESB Can Do My Routing
For services that subscribe to events from other service, the same process can be used for locating those topics/queues, DNS to find the broker and well named and documented topics/queues on those brokers. Most message brokers provide means of routing messages from location to location sensibly, if yours doesn't, get another.
The ESB Can Monitor My Services
Your services should provide monitoring information over any number of technologies that can be monitored by any number of technologies. Examples of technologies that can enable your services to be monitored are syslog, JMX, SNMP, Windows Events and simple log files — one or more of these are available in just about every language. Examples of technologies that can monitor those technologies are Nagios, OpenNMS and any number of commerical systems. You don't need an ESB to do this.
The ESB Provides Extra Security
No they don't, they remove security by terminating it prematurely. They open you to either deliberate or mistaken man in the middle attacks.
I'm aiming for a simple, fast and minimalist blog. A blog where writing posts is all I have to think about, not themes, fancy backgrounds, AJAX, hosting services, cloud APIs and well one starts to lose the will to live.
The design goals are:
I'm quite pleased with the way it works, it's probably not for technophobes but then nor is blogging. In order to publish a post you create a new text file with your post in it in a simple directory structure, if you have any images, video etc then you drop them in a "resources" folder. If you'd like some formatting in your post you can use markdown. After that, run a simple script and everything is taken care of: it's published to the web. So far my limited posts look like this in the directory structure:
The markdown from the YAML file is converted to HTML, which is minified (optimised to remove redundant spaces etc), and then put into a folder which btsync is watching, once a change is noticed it is synced to all computers with btsync installed on, including, most importantly the web server.
A Note On DNS
As I'm hosting this on my home broadband connection, which doesn't have a static IP address, I needed a way to update my DNS record quickly every time my IP address changed. To do this I used a free service DNSdynamic which gives you a subdomain on one of their domains (e.g. example.dnsdynamic.com), I chose andyhedges.http01.com but anything would work, you then install a client which regularly checks your IP address and updates the DNS if need be. This is great but I wanted to use my vanity domain name hedges.net, I therefore configured a CNAME with my DNS registrar to point blog.hedges.net to andyhedges.http01.com and I was in business, DNS-wise at least.
The cost of my blogging platform breaks down as below:
software - £0 (all open source or freeware)
hosting - £0 (using my home fibre connection)
Raspberry Pi - £28.99
Power Supply - £0 (free with phone)
Network Cable - ~£1
SD Card - ~£4
DNS - £0 (using my DNS registrar and Dynamic DNS provider)
There we have it, a blogging platform for thirty four quid that doesn't require 3rd party hosting. It remains to be seen if my ISP gets cross.
On the little R'berry Pi a quick benchmark with no optimisation shows it can handle 250 requests per second with a response time of 3ms (across my home gigabit network).
Indian takeaway container, no this isn't the wacky name of some kickstarter project, I used one of the plastic tubs that takeaway curry comes in to provide a case for the Raspberry Pi, it keeps dust and/or water off it
Mini USB charger plug and cable from my Nexus 5 (I have so many of these and so I chose a smallish one)
I hold the opinion that Services (as in SOA) should, where at all possible, be delivered as separately budgeted and planned work from functional enhancement projects to avoid First Project Syndrome.
To understand what First Project Syndrome is let's take a look at some graphs (bear with me...).
As you can see from figure 1. the assertion is that the cost of delivering the first project with a service is higher than for the first project using a point solution (by point solution, I mean grabbing data from source data stores or replicating data into your database or any number of data sync techniques). The reason why this costs more for the first project is:
there are overheads with creating a service such as following their prescribed best practise
creation of infrastructure to run the service
some expert knowledge in SOA practices
However from the first project onward the savings are realised. The reasons why point solutions cost more from the second project onward are because:
a point solution often starts from scratch each time, the work has to be redone and redone slightly differently for each specific scenario
point solutions layer complexity upon complexity (e.g. tables accessed by many unknown systems, various extract files created for many systems, data shunted to and from multiple undocumented systems)
it's very hard not to make mistakes when syncing data
syncing data causes all sorts of edge cases when trying to modify it
Point solutions are almost always more expensive to manage, with a service built correctly and to specification the operational costs are lower: day one. They are easier to manage, monitor, failover and so on. The significant reason why they have these qualities is because they are conforming to set of good practises that specifically give these qualities. They leverage the wealth of investment made on previous service developments within the organisation. Each service then stands as a container for future enhancement in its particular business domain allowing functionality to be added and still providing the operational characteristics demanded.
More often than not the first project does not consider, in detail, the operational cost of managing the solution on-going or may not have the budget/resource/time to seriously take this in to account. Typically the projects have the more immediate concern of getting the solution shipped to the business.
Other Common Pit Falls
There are a number of other pitfalls of delivering services as part of the first project. In no particular order:
the first project's scope determines the scope of the service making it less suitable for other consumers
freezing of the project causes the service to also be frozen where that service would have had wider benefits to the business. The case for the project didn't stack up and so the assumption would be that the service's case didn't either.
compromises in the design of the project solution force compromises in the service design
To all these points I would posit that the first business project that requires a service should not be the project that delivers that service. This does, of course, mean that the business case should stack up against more than one business facing project (and if it doesn't then it probably isn't worth building).
You wouldn't try to create a power station as part of the build of a house but every house needs one to function.