Today we are pleased to bring you the second installment in the “Data in Action” data innovator profile series.
Raghav Kaushik is a researcher in the Database group in Microsoft Research.
Romi Mahajan: When you think about the concept of “data,” what first comes to mind?
Raghav Kaushik: I never think of data in isolation. To me, data only means something in the context of what you want to do with it. The combination of data and its management leads to data management software. When I think of “data management”, I think of any large collection of information that needs to be systematically managed – this includes storing, updating and querying it. For example, if our goal is to design a banking system with customer records that supports financial transactions, a relational database management system is appropriate. Similarly, if our goal is to store personal documents and search for them, a personal information management system such as Desktop search is appropriate. I view both relational and search systems as data management systems.
In CS Research, what role does data play? Is it a central concept or on the periphery?
Data management systems have had a profound impact on society. They crucially underlie the digital infrastructure of all enterprises – from payroll management to supply chain management to e-commerce. Every time one withdraws cash from an ATM, makes a flight reservation and purchases something online, the technology behind it involves a database management system. Similarly, every time one interacts with a search engine, the underlying technology involves managing huge amounts of data.
Not surprisingly, data management is a central part of CS research. The development of relational database technology and search engine technology, to pick two crown jewels, are among the biggest achievements of CS and CS research which led the way in innovation in both of the above areas.
Everywhere you turn, you read about “Big Data.” How do you define this area and what is your approach to it?
“Big Data” is an umbrella term that means multiple things to multiple people. To me personally, the “bigness” of “Big Data” is not the most interesting aspect of the term.
Databases have always been used for analyzing information. The traditional way in which data analysis is performed essentially involves aggregating the data along various dimensions. For example, an enterprise might wish to aggregate sales by geographical region and time to understand business trends.
What I understand by “Big Data” is the desire to perform data analysis in much more sophisticated ways. For example, many enterprises want to (1) analyze not only point of sales, but also the sentiment being expressed by the public online through social media such as facebook, twitter and the blogosphere, (2) analyze what the public feels about their competitors, (3) offer personalized recommendations to customers based on the prior history of purchases. The connection with the “Big” part of “Big Data” is the fact that enterprises have the resources today to store very detailed information; for example, the ability to store a history of all past transactions for each customer.
“Big Data” as outlined above has several implications for data management infrastructure. Traditional data analysis such as aggregating sales by region is a precise task and data management systems could focus on how best to execute the input task efficiently at scale. The main difference with “Big Data” is that even the task (e.g. twitter sentiment) is not precisely defined. Hence, statistical techniques based on machine learning gain prominence. One of the main challenges facing data management systems is to incorporate statistical machine learning efficiently at scale. This is the subject of ongoing research. Further, since SQL is no longer adequate as a language for posing data analysis tasks, the research community is rethinking the use of relational database systems for data analysis purposes. In particular, there is active investigation of alternate platforms such as Hadoop to power “Big Data”
Are the issues related to “Big Data” fundamentally Computer Science issues or are they business issues having to do with the application of data to everyday business problems?
They are both. Computer Science and more generally, engineering issues never exist in a vacuum. They always begin with an existing business issue or foresee a future business issue. The same holds for “Big Data”.
The role of “Data Scientists” is increasing, at least in terms of what we read about in the business press. What is a Data Scientist and what are the characteristics of an effective one?
Data analysts always used to exist in enterprises. To me, “Data Scientists” is the evolution of the role of data analysts to incorporate “Big Data”. A “Data Scientist” is one who analyzes the data collected by an enterprise. As noted above, a lot of modern data analysis is imprecise and exploratory. Accordingly, a data scientist needs to perform the challenging task of reducing imprecise business goals into concrete technical decisions using the tools of statistical machine learning. The data scientist also needs to be able to apply the above tools over large volumes of data by programming against data management systems such as relational systems or Hadoop-style alternate systems.
What in your mind are the most exciting data-related opportunities today- from the perspective of both a CS Researcher but also from the perspective of an observer of the world?
Undoubtedly, the task of making sense of data in sophisticated ways, i.e. “Big Data” is one large data-related opportunity for the reasons described above.
Another exciting trend is the emergence of the cloud. The data-related aspect is the promise of offering data management as a service over the cloud. Centralizing data management offers economies of scale that makes it a very compelling proposition. The technical implications of offering data management as a service are huge, right from the infrastructure needed to run a service (availability, performance, etc.) to novel considerations such as security (a client that wants the cloud to host their data would naturally be concerned that the cloud provider has access to its sensitive data.)
While the above are two of the most exciting novel trends, I do wish to point out that there is a considerable amount of innovation needed in more traditional data management software. For example, one exciting problem is to exploit emerging hardware (solid state devices, multi-core CPUs, large main memories) to improve the performance of data management systems.
Is there such a thing as “too much” data? Are we being overwhelmed by an information society in which there is too much noise and too little signal?
First, the answer depends on what we want to do with the data (going back to question 1). It is not hard to construct analytical tasks that are way beyond what is practically possible today, for which the question of “too much” does not even arise. Indeed, one of the main challenges facing data scientists is to identify what business goals are even technically feasible.
Restricting ourselves to goals that are feasible, I would view the amount of data to be processed as an economic problem. There is no such thing as “too much” data if we are willing to pay enough. For example, consider the task of analyzing twitter feeds to find popular sentiment about some product. In the extreme case, we could always hire enough people to manually examine tweets and produce an answer. Given constraints on cost, however, it does make sense to discard subsets of data that are too noisy to be of value. For example, comparison shopping engines do not necessarily crawl the entire web. Instead, they often restrict themselves to popular sites such as Amazon, CNet and PriceGrabber since the cost of integrating information over the entire web is prohibitive.
Do you believe data should ultimately be “free” and not situated behind corporate paywalls and insulated by concepts of IP?
I believe the overall framework in which we think about the above question should recognize the following principles. One is that individuals (not corporations) have a right to privacy. For example, if I participate in a social network, I ought to have the right to whatever information I post. I don’t think such individual data even if hosted by a corporation should be free. The second principle is that corporations are human institutions meant to run an economy. They don’t come with any rights. I do believe a lot of data is owned by corporations today ought to be free. Beyond these principles, it is hard to give a general answer.
As a clarificatory note, I wish to point out that what I have said above is in the context of a capitalist society. It is worth noting that corporations are totalitarian institutions that are an abomination. But overthrowing them requires radical changes to society.
What makes data research exciting? What advice do you have for a young, aspiring Computer Scientist, as regards the world of data?
One of the reasons is that as illustrated above, data management research offers an opportunity to be at the cutting edge of some really important problems. The second reason is intellectual. The goal of research in any software engineering discipline such as data management is find simple principles that underpin the development of the software. By this definition, research in data management has had significant success. While there are several providers of relational database systems, the same principles have guided all of their development i.e., the relational data model that views data as a set of tables, a query language – SQL – based on simple manipulations of tables (filter, grouping, join), and the modeling of concurrent activity (such as multiple users interacting concurrently with their bank accounts) using the notion of software transactions.
The advice I would have for a young aspiring Computer Scientist is to understand why data management is exciting for the reasons above.
Raghav Kaushik is a researcher in the Database group in Microsoft Research. He has worked on several areas of data management including XML indexing, query processing and data cleaning (which is one of the core problems underlying “Big Data”). His research has had impact on Microsoft technologies such as Bing Maps and Bing Shopping. His current research interests include cloud data security, in particular enhancing database infrastructure to support encryption.