Sampling in Big Data Analytic. Dilemma Persists.
Since the inception of data mining, sampling is not the preferred word. It was concluded that the size of data we use for research calculates its predictive power and value.
Data miners have a reason they resisted the sampling process. First, the data mining was considered for the people expert in business and IT, but not having the statistical knowledge. They didn’t find any value of sampling in finding the valuable pattern in their data and put that information in use. Some people using Big Data thought that sampling process can be skipped as they have complete online data. Though sampling is used by analysts, scientists, statisticians on a routine bases, but there are many agencies which are defaming the use sampling in name of accuracy of research.
So how can we know how big a sample needed ? Classical statistics has methods for that, you can learn them.
Even if we use all the data in our Big Data Resources, we are not actually avoiding the sampling. As the data will be used to draw conclusion about the future cases, that are not your resource today. So you can consider BigData as a big sample from the population that matters and not the population itself.
Some people make a point that if you have the data then why not use it all. More isn’t always better. Analyzing large data consumes lots of resources as computing power storage space and analyst time. Even if computing resources are present the time it takes it large. This time can be used to make other use full researches. This massive data used can actually be used for different researches by using sample data.
Here resources are not the only issue. The quality of data you use for research matters a lot. If we are using all the data in Bigdata, high probability is that repository of data is not clean and focused on target group. If we take sample of BigData we can make sure that it is clean as compared to complete BigData.
On the contrary the BigData Analytics experts says that days of Sampling the data is actually over. If you really want the lowdown on what’s happening in your business, you need large volumes of highly detailed data,” wrote Philip Russom, research director for data warehousing with The Data Warehousing Institute (TDWI), in Big Data Analytics, a recent TDWI report. “If you truly want to see something you’ve never seen before, it helps to tap into data that’s never been tapped for business intelligence or analytics.” They say it used to be the case, that organizations earlier were not able mine, analyze all the data that they were collecting. That is why practices of sampling were a necessary evil. But these days according to DataAnalytics experts they have algorithms like Mapreduce algorithm, SQL- or DMBS-like access amenities and a variety of programming tools that are good enough for research.
According to research done by TDWi they said that almost one-third of respondents in US said that they are using data analytics in some form or the other. The industry experts says that though the Bigdata research techniques is good enough , but the companies coming up with more and more innovative techniques to reduces the error in research and make the process convenient for the researcher.
Though BIGDATA is being talked about everywhere and is coming up strongly, but there is a matter of discussion, that whether the sampling techniques used in conventional researches are still applicable or World of BIG DATA is changing it a Big way……
Posted by: Gurjot Singh_13PGP081
Other Members of Group 4_Section B: Aniruddh Mukerji, Alok Jyoti Paul, Chanyo YL, Rohit Garg, Anwesha Dasgupta, Anusha C