If you’ve ever been involved in building or using any kind of new business system, chances are you’ve probably heard “the database” being spoken of by wise technical colleagues in almost mythical terms. In project meetings, pronouncements like “We’ll need to optimize the database…” can sound a little like making offerings to the unpredictable data gods.
And in most cases, obsessing over the database is entirely correct: for any project, using the right tool to store and retrieve your data is essential. ‘Proper’ databases (Oracle, SQL Server, MySQL, etc.) are extremely sophisticated and powerful tools, and typically do an excellent job.
However, these have their overheads and their limitations, and it’s sometimes worth considering the alternatives before diving right in and trying to squeeze whatever data you’re dealing with into a database. We experienced an example of this recently which is worth sharing.
We’re currently working with JISC Collections to develop a system to view, search and compare XML license documents. Now, putting XML documents into a database is a dark art, not least because a database works in terms of tables (think: “looks like Excel”) whereas these documents don’t resemble tables at all. There are certainly good products out there which can work wonders with XML documents (MarkLogic Server springs to mind here), but there’s an overhead in terms of cost and configuration.
So, we started our planning by asking a stupid question: “What if we don’t have a database, and just get the system to rifle through all the licenses whenever we need to do a search?” This is known as ‘straight-lining’, and is almost always a bad idea because it doesn’t scale: ie. it takes longer and longer to run as you add more and more documents. However, it’s easy to implement, extremely flexible (and not dependent on forcing everything into tables), and doesn’t require a specialist database tool.
Before writing off straight-lining, we decided to run some tests, and discovered that we can search through around 2,700 licenses per second. Not bad at all! (It helps that the license documents are all quite small). Considering the initial repository for our JISC Collections project will contain 80 licenses, that means any query will take about 0.03 seconds. What’s more, the number of licenses could still grow by over three thousand percent before it takes 1 second for a search. Considering the content in question, it’s pretty safe to assume that’s unlikely to happen.
We’ll also use cacheing (ie. storing the answers to all queries after they are run), so users will only every experience that awful 0.03 second delay if their particular query hasn’t been run by another user already. Because the set of documents certainly won’t be changing daily, this will become very efficient – a kind of ‘crowdsourcing’ of the most common queries to ensure instant search results.
The moral of the story here? We think sometimes it’s worth questioning the ‘correct’ way to do things, because it might just work better to develop a system the ‘wrong’ way. (And we’ll save a discussion of why we built a database engine from scratch to drive our MasterVision product for another time…)