As the 90’s progressed, the need to standardize the lessons learned into a common methodology became increasingly acute. Two of leading tool providers of the day – SPSS and Teradata – along with three early adopter user corporations, Daimler, NCR, and OHRA convened a Special Interest Group (SIG) in 1996 and over the course of less than a year managed to codify what is still today the CRISP-DM, CRoss Industry Standard Process for Data Mining. CRISP-DM was not actually the first. SAS Institute had its own version called SEMMA (Sample, Explore, Modify, Model, Assess). Nevertheless, within just a year or two many more practitioners were basing their approach on CRISP-DM.
The CRISP-DM process or methodology of CRISP-DM is described in these six major steps:
Focuses on understanding the project objectives and requirements from a business perspective. The analyst formulates this knowledge as a data mining problem and develops preliminary plan
Starting with initial data collection, the analyst proceeds with activities to get familiar with the data, identify data quality problems & discover first insights into the data. In this phase, the analyst might also detect interesting subsets to form hypotheses for hidden information
The data preparation phase covers all activities to construct the final dataset from the initial raw data
The analyst evaluates, selects & applies the appropriate modelling techniques. Since some techniques like neural nets have specific requirements regarding the form of the data. There can be a loop back here to data prep
The analyst builds & chooses models that appear to have high quality based on loss functions that were selected. The analyst them tests them to ensure that they can generalise the models against unseen data. Subsequently, the analyst also validates that the models sufficiently cover all key business issues. The end result is the selection of the champion model(s)
Generally this will mean deploying a code representation of the model into an operating system. This also includes mechanisms to score or categorise new unseen data as it arises. The mechanism should use the new information in the solution of the original business problem. Importantly, the code representation must also include all the data prep steps leading up to modelling. This ensures that the model will treat new raw data in the same manner as during model development
Characteristics of CRISP-DM
I believe CRISP-DM’s longevity in a rapidly changing area stems from a number of characteristics:
- It encourages data miners to focus on business goals, so as to ensure that project outputs provide tangible benefits to the organization. Too often, analysts can lose sight of the ultimate business purpose of their analysis – the analysis can become an end in itself rather than a means to an end. The CRISP-DM approach helps ensure that the business goals remain at the centre of the project throughout.
- CRISP-DM provides an iterative approach, including frequent opportunities to evaluate the progress of the project against its original objectives. This helps minimize risk of getting to the end of the project and finding that the business objectives have not really been addressed. It also means that the project stakeholders can adapt & change the objectives in the light of new findings.
- The CRISP-DM methodology is both technology and problem-neutral. You can use any software you like for your analysis and apply it to any data mining problem you want to. Whatever the nature of your data mining project, CRISP-DM will still provide you with a framework with enough structure to be useful.