Four Errors and a Funeral

The story so far…

The integration framework went live, along with a little catch-all for errors that would store what the error was, and what we were trying to integrate with when it happened.

A tiny (hideous) dashboard was thrown around it to keep track of the couple of errors that would appear (everyone’s testing their code, right?), and job-well-done-everyone. Cheers.



Fast-forward a couple of weeks and the dashboard is taking upwards of five minutes to load, if it loads at all, and has become useless.

Plot thickens and becomes something else.

The problem here is that no-one really cares about individual errors. I mean, of course they care, in a general fashion, but specifically they don’t care about specific errors. Not when there are several thousand of them.

What we need is a way to say “This error – this one here. If we fix this one, we fix one thousand other ones as well. Let’s start with this one.”

That doesn’t sound too tricky – the only problem is that we don’t have categories of errors. All we have is a text message describing the error, that’s been generated from one-of-who-knows-who-many systems. Some messages give detailed classes, timestamps and input data. Others say "my test message" . Hmmm.

I like big data and I cannot lie.

You other coders can’t deny.

A classification problem like this sounds like an ideal candidate for some sort of machine learning technique, which has the added bonus of being really fashionable and a great story to tell around a really dull bar.

Of course there’s just one problem (okay, maybe not) with trying to implement a machine learning approach. The biggest (obvious) one in this case is that machine learning needs a lot of pre-existing training data, which we didn’t have. Plus it’s fashionable overkill, at least in this case.

Various attempts

The just-use-a-util-class approach

I’d heard from a colleague that the Apache StringUtils .difference  function would give details on the difference between two Strings. That sounded like it would work – loop through all the errors, and group the ones where the difference was fairly small.

Well, that didn’t work at all. Messages for the same error which had any variable data early on would all be placed into different groups, leaving us with thousands of error groupings instead of error messages. Not great.

Actually taking the time to read the docs on the .difference  function revealed:

(More precisely, return the remainder of the second String, starting from where it’s different from the first.)

So it’s the sub-string of the second string, from the first character where they differ. Definitely not going to work – we need something cleverer. What’s the difference between the two messages? Hey…

Just diff them

We use diff tools all the time – they’re for those, “Hold on, what did I do to this class?” moments. They present us with a list of changes made to strings — or how we could convert one String to another, somewhat Levenshtein-y but line orientated rather than character orientated.


The Sinatra diff

I grabbed Neil Fraser’s diff implementation and used it to compare the error messages, looking for messages with minimum diffs. I tried a cut-off of 15, and cleaned up the diffs slightly by removing EQUAL  results.

This… kinda worked, actually. Except in cases where, say, there was user data that shared some properties. There were a couple of mislabels, but I figured it was fine. Until, one (the next) day…

Attack of the Larger Words

Fuzzy string matching using cosine similarity appeared on my Skimfeed one morning – how could I resist?

It’s a very fancy name, cosine similarity, and it’s very neat. It’s not really thaaaat complicated though.

Basically, it’s like checking how far apart two points in space are. Except, instead of a point in 2D space, we have a point in n-dimensional space, where the number of dimensions is specified by the number of unique words between the two  sentences.

The distance traveled along each dimension is given by how many times the word appears in the sentence. To get the similarity, we check how far apart these two n-dimensional points are. Check the linked article, it explains it quite well.

This produced a slightly better result, with much cleaner code.

Care for a date?

This worked a lot of the time, except when it didn’t. Because of the time.

As an example, we would get a message something like:

Because of these matching times, and the differing times, the first two messages would be grouped together, and the third would be on its own, when really we wanted all of the SystemBBBBBB  messages together. Now what? How do we replace all instances of a date in free-text?

I pondered this idly for a bit, before it occurred to me that someone had probably already pondered it quite a bit harder. This led me to Natty.

Natty is a quite-neat library that, given some free-text, will return information on each date within it. I fired it up, used it to replace all dates with the word DATE in my errors, ran the grouping and discovered that they grouped perfectly – far better than expected, actually.


At a relatively high level, the final code was to first place the errors until we could not place them any more.

To actually place the error in the group, we look at all of our existing groups, grab a sample  and find which group was the best match.

match is a quick method to tie together extracting the sample (cleaning the data) and calling the previously linked Cosine Similarity tester.

Finally, use Natty to get all the dates in the text, and for all explicit dates (I avoided inferred dates made entirely of numbers due to possibly false positives) replace them with DATE


That worked fairly well for me – on around 2000 errors, it matched 100%. Let me know if there was something I missed entirely that would have worked much better though, or if there are any questions.

Posted in Uncategorized

Leave a Reply