What is Big Data?
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
This is indeed the era of Big Data revolution. Whether it is healthcare, IT, Industrial, Manufacturing, Food Corporations, Agriculture or any large scale or small scale industries, there are terabytes and petabytes of data generated each day. The daily functioning of all the companies in all sectors relies on the extracting meaningful information from structured and unstructured data.
With this veritable explosion, Big Data is going to have an effect on every business in this Universe. Data is expanding at a much faster rate than before, and it is predicted that after five years, around 1.7 megabytes of novel information will be generated every second for every human being on this planet.
Simply put, we sincerely need Experts to analyze and handle these immeasurable volumes of data. Keeping up with the spirit, most top notch technical brands have discovered complex and intricate technologies, platforms and softwares to administer and employ converting bytes into structured, readable information. Most MNCs using Big Data technology in their operations are on a recruitment spree looking out for skilled workforce in various platforms like Hadoop, NoSQL, Cassandra, MongoDB, HBase, Data Science, Spark, Storm, Scala and others.
Individuals, aiming to excel in their technology careers, can’t overpass this data eruption and need to prepare now for the bigger and better.These nice platforms can’t be self-learned and require learning from adroit trainers.The current trends involve integrated learning of Big Data + Data science courses as it helps individuals expand their scope of getting identified by top-paying companies.
Present and Future Outlook
A recent report by Gartner reveals that more than 75% of the world’s companies are preparing to invest a considerable capital in Big Data and related platforms in the next two years. According to the survey, the organizations aim at improving customer services, rationalizing current business processes, acquiring more traffic and optimizing costs using big data. In 2015, most big data projects are initiated by CIO (32%) and Unit Heads (31%).
A study conducted by Forbes in association with McKinsey and Teradata lately declared the indispensable urge for data-learned professionals. Forbes questioned around 300 global executives and found that 59% of the respondents appraised big data among the top five ways to gain a competitive edge over others. Most of them, in fact, ranked big data on number one.
The Teradata-Forbes Insights’ survey of top-decision makers further announced that big data analytics enterprises have had a considerable impact on ROI.
Jobs in Big Data
While Hadoop, MapReduce, Cassandra, HBase, MongoDB, Spark, Storm and Scala are the most in-demand platforms for processing of Big Data, there are thousands of jobs generated every month. Most companies across the globe are seeking for Specialists and Professionals who can be productive from Day 1 and hold proficiency in managing high data volumes.
It was correctly estimated last year by the Senior Vice President of Gartner and Global Head Research that there would be 4.4 million IT jobs all over the world to drive Big data, creating around 1.9 million IT open positions in the US.
The average Salary of big data-related skills is over $120,000 per annum. According to Payscale, this figure is calculated as Rs. 607,193 per year.
Interestingly, top companies that use big data and related platforms on a frequent basis include Google, Hortonworks, Cloudera, LinkedIn, Facebook, Twitter, IBM, PWC, SAS, Oracle, Teradata, SAP, Dell, HP.
Prerequisites for learning Big data skills
Experience and mind bend towards any object-oriented Programming language will help learners grab the curriculum faster and easily. Basic Command knowledge of UNIX and SQL Scripting can be an added advantage.
do data people get excited about large data?
Typically the larger the data the harder it becomes to do even basic analysis and processing. For example much more sophisticated things can be done very simply in matlab or numpy or R than are practical with Hadoop. Furthermore large data tends to be at heart just a collection of small data sets repeated millions of times (for each user, or web page, or whatever).
I understand why this interests infrastructure people, but I would think data people would be more turned on by the analytical sophistication than the number of bytes.
The accuracy & nature of answers you get on large data sets can be completely different from what you see on small samples. Big data provides a competitive advantage. For the web data sets you describe, it turns out that having 10x the amount of data allows you to automatically discover patterns that would be impossible with smaller samples (think Signal to Noise). The deeper into demographic slices you want to dive, the more data you will need to get the same accuracy.
Take note of recent innovations in artificial intelligence like IBM Watson and Google's self driving cars.These advancements were made possible by leveraging large diverse data sets and computational horsepower.
I think the main reasons for the obsession with Big Data are the following :
Information Asymmetry: Storage gets cheaper every day, and if you have more data you can make better decisions than competitors.The whole is greater than the sum of its parts - many data sets combined tell you more than they do separately.Game changing advancements in machine learning are coming from leveraging Big Data
The importance of large datasets for allowing more sophisticated statistical models that capture more juice, yield more predictive power, has already been discussed and I agree. But additional elements are worth mentioning:
* Many internet-related datasets involve very sparse variables and rarely occurring events: if the things you care most about happen only once in a million user visits, then you need that much more data to capture that phenomenon. For example, in the computational advertising area, with the kinds of big data we are using with steerads.com, the events of interest are clicks, or even worse, when someone buys something in response to seeing an ad. These are extremely rare.
* If you are going to use your machine learning device to take decisions that can sometime work for you and sometimes against you (you lose money), you really need to estimate and minimize your risk. Applying your device in a context involving a huge number of decisions is the easiest way to reduce your risk (and this is especially important when there are rare events involved). Of course you need big data to validate that and meaningfully compare strategies.
* One should be very careful in assessing the effect of dataset size; it depends on the type of machine learning or statistical model used. The performance of a logistic regression with a reasonably small input size will quickly saturate as you increase the amount of data. Other more sophisticated models (in general going towards the non-parametric, or allowing the capacity of the model increase with the amount of data) will gradually become more relatively advantageous as the amount of data is increased.
* The effect of dataset size also depends on the task, of course. Easier tasks will be solved with smaller datasets. However, as mentioned in the above posts, for many of the more interesting, AI-related tasks, we seem to never have enough data. This is connected to the so-called curse of dimensionality: a "stupid" non-parametric statistical model will 'want' an amount of data that grows with the number of ups and downs of the function we want to estimate, that can easily grow exponentially with the number of variables involved (because of the number of configurations of factors of interest can grow that fast).
* Advanced machine learning research is trying to go beyond the limitations of "stupid" non-parametric learning algorithms, to be able to generalize to zillions of configurations of the input variables never seen, or even close to, any of those seen in the training set. We know that it must be possible to do that, because brains do that. Humans, mammals and birds learn very sophisticated things from a number of examples that is actually much much smaller than what Google needs to answer our queries or get the sense that two images talk about the same thing. A general way to achieve this is through what is called "sharing of statistical strength", and this comes up in many guises.
The "current obsession" with Big Data is not new. During the last 25 years there have been numerous periods of great interest in storing and analysing large data sets. In 1983 Teradata installed brought on Wells Fargo as their first beta site. In 1986 this software was Fortune Magazine's "Product of the Year" - it was exciting because it pioneered the ability to analyse terabyte-sized data sets. By the early 90's most big banks had all their data in a data warehouse of some sort, and there was a lot of work going on in trying to work out how to actually use that data.
Next was the big OLAP craze. Cognos, Holos, Microsoft OLAP Services (as it was then called), etc. were what all the cool kids were talking about. It was still expensive to store very large data sets, so through much of the 90's Big Data was still restricted to bigger companies - especially in financial services, where lots of data was being collected. (These companies had to store complete transactional records for operational and legal reasons, so they already were collecting and storing the data - that's another reason they were amongst the first to leverage these approaches.)
Also important in the 90's was the development of neural networks. For the first time companies were able to use flexible models, without being bound by the constraints of parametric models such as GLMs. Because standard CPUs weren't able to process data fast enough to train neural nets on large data sets, companies such as HNC produced plugin boards which used custom silicon to greatly speed up processing. Decision trees such as CHAID were also big at this time.
So by the time the new millenium rolled around, many of the bigger companies had been doing a lot of work with summarising (OLAP) and modelling (neural nets / decision trees) data. The skills to do these things were still not widely available, so getting help cost lots of money, and the software was still largely proprietary and expensive.
During the 2000's came the next Big Data craze - for the first time, everyone was on the web, and everyone was putting their processes online, which meant now everyone had lots of data to analyse. It wasn't just the financial services companies any more. Much of the interest during this time was in analysing web logs, and people looked enviously at the ability of companies like Google and Amazon who were using predictive modelling algorithms to surge ahead. It was during this time that Big Data became accessible - more people were learning the skills to store and analyse large data sets, because they could see the benefits, and the resources to do it were coming down in price. Open source software (both for storing and extracting - e.g. MySQL, and for analysing - e.g. R) on home PCs could now do what before required million-dollar infrastructure.
The most recent Big Data craze really kicked off with Google's paper about their Map/Reduce algorithm, and the follow-up work from many folks in trying to replicate their success. Today, much of this activity is centred around the Apache Foundation (Hadoop, Cassandra, etc.) Less trendy but equally important development has been happening in programming languages which now support lazy list evaluation, and therefore are no longer constrained by memory when running models (e.g. Parallel LINQ in .Net, List comprehensions in Python, the rise of functional languages like Haskell and F#).
I've been involved in analysing large data sets throughout this time, and it has always been an exciting and challenging business. Much was written about the Data Warehouse craze, the Neural Net craze, the Decision Tree craze, the OLAP craze, the Log Analysis craze, and the many other Big Data crazes over the last 25 years.
Today, the ability to store, extract, summarise, and model large data sets is more widely accessible than it has ever been. The hardest parts of a problem will always attract the most interest, so right now that's where the focus is - for instance, mining web-scale link graphs, or analysing high-speed streams in algorithmic trading. Just because these are the issues that get the most words written about them doesn't mean they're the most important - it just means that's where the biggest development challenges are right now.
A large dataset of data that all has the same bias (systematic error) will not give you better insight into a question. Instead, it will give you a very precise measurement of your flawed answer. For example, it doesn't really matter if you ask 100 teens or a million teens about the best movie of all time. You'll still get an answer that discounts older movies no matter how many you ask.
Larger datasets will tend to include more diversity so you can control for this error. Netflix's data includes all different kinds of people, so their algorithms can account for the bias that may come from potentially having more teens in their database. But they can't say anything about the entertainment preferences of people who don't like to use the internet (or mail order dvds) and no amount of their data will help. Twitter data has an even worse problem as the kind of people who tweet are even more unrepresentative. I'd much rather have a smaller dataset of more representative users to answer questions that people often use Twitter for.
It's not just sampling that causes error. Consider Facebook likes. I believe that Texas A&M is the most liked college on Facebook. What does that mean? Since there is no dislike button, it's hard to even say that they are the most popular. We also can't say whether it has a good alumni group, has popular sports, or is good academically, and the volume of data won't help us. Instead, we need more detail.
The gist here is that larger datasets are generally better datasets. But there are lots of things that are more important than the size of a dataset, among them being sampling, diversity of measurement, and detail of measurement.
t is a consequence of the progress in hard disk and network development. We are finally able to affordably store and process petabytes of data in a 19" rack-- something that many people have dreamed about for a long time.
One reason why people dreamed about storing big amounts of data is that they want to infer knowledge from that data. So, even if something is repeated over and over again within that collection, maybe this repetition is the signal we are looking for?
Big Data has become a hot topic because people are collecting more data than ever before (of course not only, but especially on the Web). This data screams for being mined, to get valuable information out of it.
Actually, when we say "Big Data" in the context of statistical analysis, data mining or information extraction and retrieval, most of the time we mean "Representative Data".
"Representative" is a fuzzy term, but I would define it as "having roughly the same properties as the whole thing that we are interested in".
Representativeness is very important for drawing any conclusion from data, because if your data is not representative, you might conclude anything.
For example: you count frequencies of some event in your tiny dataset, relate it to the size of the collection, and claim this is the "true" probability of that event. Take another small dataset (of the same problem domain), and you find a different probability? Turned out that your data was not representative.
Thus, size is key, but so is the retrieval method. If you induce a bias (source bias, topic bias...) in collecting your data, you might run into the same problems as with little data. So it is indeed a good strategy to collect as much data as possible, from as much diverse data sources as possible. Especially if you do not know in the beginning what you might be looking for.
Now, the cool thing with big data is: It's not really difficult to work with it! In many, if not all cases, at various levels the data is governed by power laws, i.e. the absolute number/frequency of some aspect in the data is actually less relevant than the accompanied order of magnitude. It turns out that only a few things stand out in orders of magnitude (and the more data you have, the slower that number grows).
If your whole dataset abides by power laws, a random sample will do so as well. I call such datasets "Zipf-representative", in remembrance of George Kingsley Zipf, who spent a lifetime in finding and formalizing power law phenomena.
That is, even though you are now able to retrieve and store enormous amounts of data, in most cases you actually do not have to go the hard way and run your tasks (analyses, human assessments, whatever) over each and everything. Instead, you random-sample a fraction from it, and there you go. Run these small bites sequentially or in parallel and you approach the complete dataset.
In fact, the overall frequent things are not so interesting, and they in fact appear already in smaller datasets (like the "stop-words" in text). What's more interesting are the "unexpectedly" frequent things, i.e., you want to know what is characteristic to a particular subset in your data. In many cases, these subsets need to be constructed on-demand (e.g., for search) and so you just need have big data in order to ensure that your dataset is representative for many different scenarios.
The "big data" strategy has its limitations, of course. First, we never get enough data to statistically mine "all possible" relevant information. Second, there are always scenarios where field experts may get sufficiently good results with less data. Third, one might always be intrigued in finding super-surprising properties in the data and neglect statistical significance (avoid over-fitting your models).
Nevertheless, the strategy works well in many scenarios, and this is why so many people like it.
"Big Data" is a very subjective term. While it can mean management and analysis of large, static data sets, to me it also means handling real-time data streams at high speed and the analytics and decision-making tools necessary to alter behavior. These techniques are not ends in themselves; they are merely vehicles for handling the inexorable rise in the quantity of data being collected. One of the key Big Data questions from my perspective is: how can you transform Big Data into Small Data? In any given data set it is likely that only a small portion has true information value; how does one decide which elements of the data set on which to focus in order to reduce the costs of storage and compute while generating better, more actionable decisions. I believe Big Data is a hot topic because so few firms manage their data stack well, from core database architecture to processing to predictive analytics, all in real-time. There is no silver bullet to solving a given company's data problem; at this point the issue is less about tools and more about culture.
There are several things that are causing this interest:
1. More users on the Internet than ever before, especially due to mobile computing.
2. Our word-of-mouth systems, like Twitter and Facebook, are more efficient than ever before, which is causing companies to see faster and bigger spikes in usage.
3. Our needs for more data than ever before. Look at Quora. In the old days this would be a simple forum with maybe four rows in a database. Today? We're seeing lots more of pieces of info being captured (related articles, votes, thanks, comments, who is following, traffic, etc).
4. Lots more machine generated data. Logs, etc.
The reason "big data" is so exciting is because it is poorly defined, mysterious, has lots of spy-like implications, and high-tech marketing teams have glommed onto and now they are stuck with it. If you are in software and you don't do "big data" you might as well just hang it up. Such bullshit.
Forget about big data. Just know that we have to deal with more data than we've ever been used to before, no matter the size of the data sets themselves. As you seem to be implying, we need to figure out ways to make that data accessible without exposing all the gory details, so that we can uncover new and interesting ways to analyze it, hopefully for the benefit of humans, the economy, nature, and the built environment.
Ultimately, maybe the way to think about 'big data' is really as a confluence of three forces (some names and examples are included here for context):An explosion of diverse data that exists inside and outside an enterprise
Internal Data (e.g. customer service interactions, financial transactions/payments, sales channel)
External Data (e.g. transaction histories, social media activity)
Advanced technologies built to aggregate, organize and analyze that data. Next-Generation Data Layers (e.g. Hadoop, MongoDB). Advanced Analytics Tools (e.g. DataRobot, Drill, Palantir, R, Tableau). High-Performance Hardware (e.g. Calxeda, FusionIO, PernixData). A new way of making decisions and interacting with customers.360 Degree Views of a Customer & Household (e.g. client portals, customized product recommendations). Data- Heavy Decision Making (e.g. data science, predictive modeling). Simplified Access to and Use of Data (e.g. data procurement, cleansing, preparation)