CORP DATA TO BIZ INTELLIGENCE: AN INTERVIEW WITH VASANT DHAR
by Alan Beck, editor in chief
05/09/97
Techniques such as genetic algorithms, OLAP, neural nets and tree induction, that embed artificial intelligence or represent knowledge, are assuming increasing prominence in all areas of knowledge discovery. To learn more about how such systems can lever the power of datasets in business environments, HPCwire interviewed Vasant Dhar, co-author with Roger Stein of "Seven Ways of Transforming Corporate Data into Business Intelligence" (Prentice-Hall, 1997). The book provides an accessible treatment of the range of knowledge discovery techniques from a business perspective along with several fleshed out case studies, making it of interest to technologists in industry and academia, as well as business people interested in understanding the technologies and how they can profit from them.
Dhar spent several years at Morgan Stanley where he founded and led the Data Mining Group focusing on the trading and asset gathering parts of the business. He is currently Associate professor of Information Systems at New York University's Stern School of Business, and president of Datamining Systems, a company specializing in providing Data Mining services that include consulting, advisory, data analysis, and education services. Following are selected excerpts from the discussion.
---
HPCwire: What motivated you to co-author your recent book?
DHAR: "Seven or eight years ago, while I was a professor at NYU, I was involved with research on techniques like genetic algorithms, neural nets and tree induction algorithms and obtained good results with them on small scale problems. I realized that although the technology was mature and could produce results in theory, the challenge that remained was how to apply the technology to real problems in the business world; or conversely, how to formulate business problems in a way that would be amenable to these techniques and at the same time make sense to hard-nosed business people. This meant taking into account considerations we often ignore in the research environment.
My desire to demonstrate the practical applicability of the technology led me to Wall Street. About three years ago, I set up the Data Mining group at Morgan Stanley. At that time, Data Mining was largely unheard of in industry. I determined to apply these techniques to a range of problems dealt with by the firm. I like to categorize these problems into the 'low-fruit' and the 'high-fruit'. Low-fruit includes problems like profiling customers, sales, business processes, etc. These are problems where you just KNOW that transaction data will yield abstractions that are useful to the people running the business. The hard part here is getting into the heads of the business people on the one hand, and finding and cleaning up the relevant transaction data on the other. Mining it is the easy part. High-fruit problems include development of trading models and automated strategies, which requires collecting, cleaning, and analyzing large amounts of time series data. That's the holy grail!
"My initial motivation for writing the book was twofold. First, my co-author and I felt that, based on our collective experience with real-world projects where we had applied these powerful emerging technologies, we had something useful to say about how to make them really work in business. We did a post mortem of our projects and asked ourselves what we did right or wrong in each case. What emerged was a practical framework that we felt was worth describing for two reasons. It provided a vocabulary for expressing requirements that covered a range of technical and business considerations. It also provided a basis for contrasting the applicability of alternative techniques to a problem.
"A second reason for writing the book was based on my teaching experience at NYU. After almost 10 years of teaching courses on intelligent systems in organizations, I had still not found a book that completely satisfied me. The computer science books addressed only the technology, and the business books never rose above the most popular level of comprehension, doing justice to neither the technology nor the problems. I wanted to bridge that gap.
"While I was at Morgan, it became increasingly apparent to me that such a book would be of great interest to the professional audience. With the maturing of networks, databases, and desktops, the information infrastructure in organizations is finally mature enough to leverage the vast amounts of data that businesses have been collecting, but archiving down a one way street. These new technologies can now help businesses derive real value from their data. In order to communicate the essence of these technologies, we've adopted a highly visual approach, with high quality graphics. We also have lots of cases that illustrate the process by which some organizations went about assessing and using these technologies. These cases cover problems such customer support, quality control, resource scheduling and dispatching, and financial market prediction."
HPCwire: Please characterize the state of the art in intelligent systems for trading models and strategies.
DHAR: "One of the key challenges in developing trading models is to formulate the problem in a way that the patterns you find are robust and adaptive to a wide variety of market conditions. Many people who have addressed this problem in the past have found that their models worked for a while and then stopped working; they didn't know why. That isn't acceptable. Models must be flexible and intelligent enough to adapt to different market regimes, yet simple and explainable to traders. That way, you can discuss why something does or doesn't work.
"My own approach is to first understand the vocabulary traders actually use in describing markets. Then I attempt to formalize this language by constructing indicators that reflect what's being talked about. This terminology is often quite colorful, albeit a little "fuzzy": traders speak of markets being 'exhausted' or 'trending' or 'nervous', or 'building/releasing tension' for example. The object is to find patterns in terms of this language, making it credible to the traders themselves. Traders often don't understand traditional 'rocket science' approaches, like stochastic differential equations. Thus, even if such methods yield good results, there is a fundamental discomfort with them."
HPCwire: How large does a database have to be before the techniques you address reveal significantly more than trained intuition would?
DHAR: "I've found that trained intuition is often wrong! Or incomplete. But the size of the database is secondary. Even a database containing only thousands of records can yield very interesting insights -- some of which would conform to your intuition, while some might run counter to it. Obviously, the more data you have, the better. But I've found you can get a lot of mileage even from relatively small databases, if the problem is formulated in an intelligent way. In fact, sometimes you find that it makes sense to aggregate lots of transactions into relatively small datasets which are then easy to mine. For example, aggregating customer transactions and doing high level profiling. There's a lot of low fruit to be picked in this area.
"In addition, there is much data out there that's not in the terabyte range but ranging in size from tens of megabytes to a few gigabytes that is highly exploitable using these techniques. And you can do it with low-end supercomputers or high-end workstations. In exploratory data analysis, rapid experimentation is critical. My experience is that with datasets that are under a few gigabytes, you can employ these techniques and obtain results in minutes rather than hours. This has a big impact in how you approach the data mining problem. Rapid iterative experimentation makes data exploration a productive activity. I think of databases ranging up to a gigabyte as a sort of 'sweet spot' that is very exploitable with current software and workstation technology."
HPCwire: Then why aren't these techniques used more?
DHAR: "They are being used a lot more than before. But remember, the bottleneck lies in the effective formulation of problems. That's where the lion's share of time is usually spent."
HPCwire: So how can those who want to realize business benefits from these techniques educate themselves?
DHAR: "This involves developing a partnership and trust with technologists who understand the business and who can provide the requisite technical education without becoming bogged down in nuts-and-bolts issues. For technologists, understanding how businesspeople think is vital. For businesspeople, it requires being open to what their data are telling them. As I said earlier, their intuition is often wrong. And some of the most dramatic learning in organizations occurs when the data run counter to the intuition. People shouldn't assume that they already know everything about their business that's worth knowing.
Also, the process of learning is iterative. Understanding what it is you should know more about, i.e. formulating problems, is iterative -- you never get it right the first time. There must be a commitment on both sides to understand and solve the right problem."
HPCwire: What is the future of these techniques on HPC platforms?
DHAR: "Several of these techniques will benefit immensely from the maturation of HPC. For example, genetic algorithms are highly compute-intensive and also highly parallelizable. When you turn a GA loose on any realistically sized dataset, hours of computation are usually required before it comes back with anything. This has practical repercussions for productivity. Specifically, it makes you a lot more hesitant to explore, especially when there's an urgency to produce results. HPC would prove valuable for such a technique: by distributing the computation and/or data across multiple processors, the search time could be reduced from hours to a few minutes."
HPCwire: How do you see your book contributing to the developments you've been describing?
DHAR: "The book provides a framework whereby understanding and selection of technologies becomes transparent for both businesspeople and technologists. The book is also very concise and thus very accessible as a convenient handbook to a broad spectrum of readers -- anyone interested in getting more value from their data. Finally, I think the book should help bring technologists and business people closer. People often underestimate how wide the gap is between these two sets of people."
--- For more information, see http://www.stern.nyu.edu/~vdhar http://www.prenhall.com/allbooks/be_0132820064.html
-------------------- Alan Beck is editor in chief of HPCwire. Comments are always welcome and should be directed to editor@hpcwire.tgc.com
--------------------------------------------------------------------------------
Our Sponsors:
@Xi Mellanox CSI Intel
WSM SGI IBM Corp. Hewlett-Packard
NEC Portland Group Fujitsu SUN
Quadrics Etnus / TotalView Linux Networx Myricom
Platform Microsoft MSC Software