Big Data and Your Future as a Data Scientist

Have you heard of the Big Data Challenge? Run by the STEM Fellowship, it’s a competition that helps high school students get excited about data science. This year’s theme: Think Global and Act Local with Big Data.

While big data is the darling of some Science, Technology, Engineering and Mathematics programs, most mainstream high school students have no idea what it is or how it’s transforming the business world – and your opportunities for employment. While the industry has only existed for a decade, big data is everywhere. It has spread across the corporate landscape like wildfire in a blaze of statistics, analysis, and new titles like vice president of big data and chief data architect. 

Forbes offers a simple definition sound bite: “Big data is a collection of data from traditional and digital sources inside and outside a company that represents a source for ongoing discovery and analysis.” In other words, the business world collects and examines big data to operate more efficiently and effectively in ways like saving money and investing it wisely in new products and services, improving its customer base and relationships, gaining competitive advantage over other companies, and, ideally, becoming more successful. The job description for a recently created position, vice president of customer insights and operational excellence, is to use big data analytics to understand customers, develop new products and cut operational costs.

KWHS spoke with Wharton marketing professor Peter Fader, an expert in data analysis, to help us understand the mysteries locked in all these numbers, and what they mean for tomorrow’s job prospects.

KWHS: What is big data?

Peter Fader: Let’s first talk about what big data isn’t. Too often people hear those words and they immediately think about the sheer volume of the data – more specifically, how many different customers we’re looking at or how many rows are in our database. Volume is part of it. But most of the action in big data is not in the rows or the number of customers you have, it’s in the different measures you have for each customer. In the old days, all we knew about customers was their demographics. You could look at someone and see that he was a 56-year-old white male and you would put him in a bucket. That was all you had. In the 1970s, we started to track behavior so we would know which soup you were buying and connect it to other purchases within the grocery store. And then in the 1980s and 1990s, we started building CRM systems, customer relationship management, that would let us connect different kinds of purchases. We could look at your purchases in a grocery store and connect them with your purchases in a department store. Or, we might be able to connect it with your media exposure so that we knew what advertisements you saw and what stuff you bought. This is the birth of big data. When you’re looking at seemingly unrelated data sources and connecting them together at a granular level.

Now let’s fast forward to where we are today. Not only do we know a lot about the customers, but we also know things like geolocation – where the customer was when he took a certain action. Things like biometrics – what was your heart rate when you purchased something? There’s an emerging field of neuroscience – what parts of the brain light up when you do certain things. There’s social networking – not only what I’m doing, but what people closely connected to me are doing. There’s social media – what are people saying to and about each other. It’s all these different fields ideally linked together at the customer level, and trying to get greater insight out of any one of these fields to be able to answer really deep questions about not only who is going to buy what next, but why. In theory, it gets us to a much better understanding of the overall customer experience or journey than what would have been conceivable a few years ago. That’s big data.

KWHS: How does big data relate to data analytics?

Fader: Having a better grip on who is buying what when in order to forecast the expense of new products or new marketing campaigns begins to get at why we need big data. But in order to do so, we have to get below the surface of the data itself. The data by itself doesn’t answer what’s happening in the future or what’s happening below the surface. That’s where the analytics come in. You can collect all the data you want and have a good time doing data science, just kind of mucking around with the data and seeing what correlates with what. You really need to get past the raw data and either project what the future data’s going to look like or get below the data to start asking questions about the true, underlying, unobservable propensities that generate the data. That’s the stuff that companies really need. That’s where analytics really shines – being able to go beyond the raw data.

KWHS: What are the job prospects related to the field of big data?

Fader: There are a ton of jobs purely on the data-collection side. Developing technologies that are either aimed at collecting and creating the big data structure, or that do so as a side benefit of some other thing. Walmart has a new program called Scan and Go. It’s a mobile app. You walk into Walmart and you basically scan your purchases yourself. You pull out your phone, scan each product with your phone, and when you go to check out instead of scanning each item in your basket, you hold up your phone with one code and it shows the cashier all the things you bought at once. Push a button and you’re done. Walmart is doing this because it’s a quicker, easier shopping experience and they pay less in labor. Along the way, they’re collecting all this cool data. They need to hire people who will help them manage it, and they need to hire people who will help them leverage it. So that’s one area of employment: hiring people to do data collection or to manage the information that arises from leading-edge data collection tasks.

That leads to the next step. We have each of these different interesting new data structures coming in, so how do we get it all to link together at the household level? That’s where data science really shines. That ability to manage and merge and dedupe [data deduplication], which means looking at duplicate records from households and combining them together. There are all kinds of technical and soft skills involved in putting the data in a form that will not only make it clean and neat, but will also enhance the ability to do analysis on it.

That takes us to the next step, which is doing the analysis on the data. I think data management is really important and will be a lucrative field for many smart people. Data science is the next step. It’s about getting the science from the data set. How do we forecast or get below the surface to explain why. That’s science. It requires a scientific process of asking hypotheses and knowing how to test them. Or building statistical models that will help us take the data in directions that the raw data don’t necessarily thoroughly answer. The deeper analysis is data science.

The next step would be enabling decision-making. Yes, it’s great to be able to make this forecast and to get underlying insights about who is buying what and why, but why are we doing this? Because companies want to make more money. They want to develop better products. They want to acquire better customers. Being able to turn those analyses into action requires a beautiful blend of scientific and business skills. It used to be that people could do great at business without having that analytical angle. In fact, being analytical would actually get in the way and lead to analysis paralysis. Today, you need to have both skill sets to ask the questions, know how to answer them, and then to take action on them. Today’s managers require a different set of skills.

KWHS: Are high school students prepared for the big data economy?

Fader: A lot of these skills and the technologies required to do them didn’t exist 10 years ago, so there’s just no room for them in the high school curriculum. A student’s day is already full and the school’s faculty is already staffed up. Where are we going to fit in that course on data management? How are we going to afford to hire another teacher to teach that stuff? In many cases, these things get crowded out from a high school education because there’s no room for them. A lot of people say we should allow programming languages like Python to be emphasized and required as much as foreign languages. It’s important to have this kind of conversation about what students need to learn.

If you look at the math curriculum, the way we’re teaching math today: algebra, geometry and so on, is the same as it ever was. Very few students are taking statistics in high school, and if they are it’s a tack-on to the end. It’s really left to the students or their parents to do an extracurricular like programming camp [to prepare students for careers in areas like big data]. This also leads to the great digital divide. Parents with money and time can sign junior up for these courses, but there are a lot of really smart people who don’t get exposure to these classes.

If you only have limited amounts of math to teach, especially for kids who may not go to college or have limited math once they’re there, we may want to rethink the types of courses that are offered in high school. Students would be better off learning more about probabilities and statistics. We are teaching them a lot of stuff in college that they should be learning in high school. A great example is Microsoft Excel. Everyone should be learning this in high school. You should not be able to graduate from high school without being reasonably fluent in Excel. The fact that we have to teach Excel 101 and programming 101 and probability and statistics 101 means that half your college time is gone before you can start getting in deep and exploring some of these skills. Even at a top school, we are not seeing students who are ready to hit the ground running as data scientists.

KWHS: What fuels your personal interest in big data and data analytics?

Fader: I’ve always been interested in forecasting things even when I was in high school, whether it was sports or music. Let’s look at the billboard charts and predict what song will be No. 1 next week. It’s a fun game. Anyone who is playing fantasy football or anything else in the forecasting business knows this. Part of it is a desire to do that well. It’s the process of saying how many factors do I need to take into account? How complex do I have to make it, but not too complex that the forecasts go haywire? It’s about finding that just-right balance. Students should familiarize themselves with the concept of Occam’s Razor [the process of paring down information to make it easier to find the truth]. The whole idea is that the best explanation or best forecast will be the simplest, plausible one. We can easily go out there and complicate things, but the more you overcomplicate, the worse your forecast is going to get. Occam’s Razor is about striking that just-right balance between an explanation that’s adequately good to help us trust the forecast, but not overdoing it.

KWHS: Any advice for high school students who are interested in big data?

Fader: Too many people, because of the misnomer of data science, think that if I can just crunch those numbers, sort them and manage them, great things will happen. The big payoff in data management is in the extracting beyond the data – the forecasting and the analytics. It’s more than just data-management skills; it’s the analytical skills to frame up the right questions. Because the data are getting so big and messy, sometimes people are crowding out those analytical things because they’re having such a good time mucking around with the data. Finding the balance there is important.

Also, to the extent that our students in high school are getting outside the pure math courses and taking steps in this big data direction, very often they are doing it through economics courses. I’m not at all saying that learning economics is bad, but too often economic thinking doesn’t do justice to all the underlying data. It rests on a lot of assumptions: If people were rational, if markets were efficient, then what would happen. It’s a fun exercise to think about that, but it’s not directly aligned with probability and statistics where we’re not trying to impose a lot of assumptions, we’re trying to learn the truth. Students who want to learn about leveraging data often start doing a lot of econ stuff. I would prefer that they put probability and statistics on an equal footing.

Conversation Starters

What is big data and why is the industry so valuable to the business world?

What are three jobs related to big data and the possible skills necessary to do them well?

What is Occam's Razor and why is this concept so critical to the field of data science?

One thought on “Big Data and Your Future as a Data Scientist

  1. As a big data freak, I completely agree with a lot a of the fantastic points made in this article. It definitely requires a holistic approach to thinking and problem-solving, and it’s almost a four-dimensional way of attacking and solving a problem. As someone who is highly motivated to optimize things and deconstruct the complex workings out there, I’m glad I discovered data science and big data – it’s definitely a challenging yet encouraging environment that fastens my thinking cap as tight as can possibly be.

    To any high school students out there – this will be the next big thing, hands down. Big data tells us so many stories, and it tells us those stories in its own language – mathematics. To be able to decode and explain in simple English what billions of these unique data points are saying about our society is a task that takes great skill and perseverance to develop. As a beginner like myself, it can be absolutely frustrating at times to analyze data in the wrong way – trust me! This field, however, is soon to become a powerful weapon that’s not only in the hands of businesses, but whoever takes the time and energy necessary to be proficient in such a demanding industry. The sky’s the limit in big data, and how you interact with data is absolutely essential to your success. Thanks Prof. Fader for these fantastic insights!

Join the Conversation