Faster Time-to-Value with In-Database Machine Learning
Most business decision-makers know that the data the business accrues has value. However, releasing that value comes with a cost in terms of both time and monetary investment. That’s especially true when the issue of machine learning is raised: Sure, ML would help the business process its data resources to unlock valuable insights, trend analysis, pattern predictions, etc., but when, and will it be worth it by the time it’s done?
Directing dedicated ML-focused teams involves careful choice and prioritization of projects, plus an acceptance that any results will take time – something few decision-makers have in abundance. Fast data processing can shave a few percentiles off processing time, but as your data team will tell you, the computing requirements of any analysis are only part of the overall equation. Data preparation processes, finding the right data sets, model building and management, and production engineering overheads loom larger by far in the list of roadblocks.
Even after a significant investment in time and resources, there’s no guarantee that any meaningful results will transpire — sometimes, the data you have simply doesn’t hold the answers you want. Unfortunately, the realization that either the data wasn’t right, or the question was wrong, can take weeks or months to become clear.
We spoke recently to Paige Roberts, the Open Source Relations Manager at Vertica, about how organizations solve some of these problems, and reduce the time taken to have machine learning models start producing practical and useful results for businesses with in-database machine learning Paige highlighted a common problem that big data projects suffer from – that of the mismatch between the size of data sets and the limitations of many data science tools like R and Python. Many engineers take a sample of the data to work on locally, and do their data preparations steps, and train their models using it. “You can’t even take 500 gigabytes of data and put it on your laptop and play with it; it doesn’t work. There’s not enough memory to do much with it, and it’s painfully slow. You’re going to have to take some small sample of the data and work with that. Whereas if you have a cluster-based database and you’ve got 10 terabytes of data, or 100, you do the exact same SQL or Python commands, and the ML algorithm looks at all the data.”
The more complex the steps to consolidate and prepare data, train, evaluate, and manage models, the slower the outcomes – unless data teams are blessed with handy clusters of parallel compute at their disposal, such as a distributed database that can do machine learning work internally.
From the sample-on-a-local-laptop to production, there are traditional roadblocks a-plenty. Paige quoted an example: when a data scientist who has created a model with a small sample of data hands the model over to a data engineer to put it in production. “The data engineer then has to start over and recreate all of those 100 data preparation steps, at scale for the entire dataset of the company in a technology such as Spark. That can take three to six months to a year or more. I think six months is usually the average. And that’s for successful data science projects. Just to jump that last hurdle, to get from – I have a model ready – to the point where it’s actually making the company money.”
For companies that are looking at the bottom line of projects, a year is a long time. At some point, research needs to percolate into production, and do it quickly. Otherwise, it’s as if machine learning as a discipline never left academe.
A question of size
It’s not just large enterprises that accrue significant amounts of data. Even the small startup can be in a position of having gigabytes of raw information that can (and perhaps should) yield significant value. Some distributed analytical databases offer a free version, or an extended free ownership period for startup companies with a lot of data. It makes sense to use capable software for free until your company gets going, then pay when your company has grown to need more capacity, or wants enterprise level support.
Paige is a community enthusiast. “Our community users are probably our most valued resource,” she said. In fact, download the free community edition of Vertica and go is how many of her company’s customers begin.
“Our Community Edition has all 650 plus functions built-in. [It’s limited to three nodes and 1TB of data BTW – Ed..] Anything you can do with the paid-for Vertica, you can do with the Community Edition, just on a smaller scale.”
Keep watching these pages for the next article in this series where we focus on Vertica’s business advantages or you can download the company’s Community Edition. (There’s also a trial for Vertica Accelerator, a new SaaS option.)