Thanks for the great feedback on my last post. I had a lot of questions inquiring if the big data stack is a new approach to thinking about data analysis. Short answer is ‘no’. The big data stack is not new. The only part that is new is the ‘big’ prefix to data. And the reason that this idea is not new is because the methodical approach to solving problems with data is not new.
The best way to explain the data stack is with an example. So lets consider a simple example that we can all relate to – airline travel. In my previous job I worked with a company that solved problems related to transportation. Here is a simple use-case: What planes should an airline assign to the various routes based on fuel cost, demand, competition, etc. with an objective to maximize profits. There are other variations for the objective, which can vary from maximize passengers carried, minimize costs to maximize profits. Each will results in a different answer. Once use-case has been defined, this will be the guiding post of the rest of the exercise.
So – how does this use-case align with the data stack?
Data Layer: The data layer contains many components – The airline schedule with the approximate times for the network, the number of planes, seats per plane, fuel cost, seat prices, passenger demand per flight and connecting traffic and so on. All this data, as you can imagine, came from different systems. And by the way, this is not big data at all. In fact this is very small data.
Data Preparation Layer: The next step was to prep the data from the various sources so that it could be ingested in the model. This involves blending the costs and revenue with the schedule.
Analysis Layer: The algorithm used to solve this problem is called Linear Programming and there are a few commercially available packages that offer a solution. The choices are ILog/Cplex, Gurobi, Gams, etc. The answer from the analysis is the total profit, the aircraft assignment by route, the number of passenger carried, etc.; everything that the scheduling analyst will require to get his/her job done. This, by the way, is the job of the analysis layer – to produce an answer to the use-case. The analysis layer needs to be customized to the use-case that has to be solved. One size does NOT fill all! If you have an optimization problem, you call an LP solver. If you have a predictive analysis problem, you call the Emcien solver. Emcien offers an analysis layer solution for pattern discovery and predictive analysis use-cases.
The analysis layer needs to be customized to the use-case that has to be solved. One size does NOT fill all! If you have an optimization problem, you call an LP solver.
If you have a predictive analysis problem, you call the Emcien solver. Emcien offers an analysis layer solution for pattern discovery and predictive analysis use-cases.
Presentation Layer: The result from the analysis layer is presented in a format as requested by the airline scheduler. This depends on his/her work-flow.
Value Layer: The use-case, as defined in step 1, is to deliver an optimal airline schedule that abides by the rules and constraints. This answer is delivered to the scheduling analyst so that he/she can get their job done!
As you can see, the data stack approach is always relevant if you are solving a business problem – whether you have small data or big data. The discipline to start with a use-case ensures that the project has an objective and a deliverable.
In today’s big-data world, the priority seems to be on the “big data” and not the use-case. I am sure this will shift back to use-case.
What are your thoughts on this?