example case study with solution about crisp dm methodology

Technology is Opportunity

A practical guide to crisp-dm, what is crisp-dm.

At some point working in data science, it is common to come across CRISP-DM. I like to irreverently call it the Crispy Process. It is an old concept for data science that’s been around since the mid-1990s. This post is meant as a practical guide to CRISP-DM.

CRISP-DM stands for CR oss I ndustry S tandard P rocess for D ata M ining. The process model spans six phases meant to fully describe the data science life cycle.

  • Business understanding
  • Data understanding
  • Data preparation

CRISP-DM Process Diagram

This cycle comes off as an abstract process with little meaning if it cannot be grounded into some sort of practical example. That’s what this post is meant to be. The following is going to take a casual scenario appropriate for many Midwestern gardeners about this time of year.

What am I planting in my backyard this year?

It is a vague, urgent question reminiscent of many data science client problems.  In order to do that we’re going to use the Crispy Process in this practical guide to CRISP-DM.

First, we need a “Business understanding”. What does the business (or in this case, gardener) need to know?

Next, we have to form “Data understanding”. So, what data is going to cover our first needs? What format do we need that data in?

With our data found, we need to do “Data preparation”. The data has to be organized and formatted so it can actually be used for whatever analysis we’re going to use for it.

The fourth phase is the sexy bit of data science, “Modeling”. There’s information in those hills! Er…data. But we need to apply algorithms to extract that information. I personally find the conventional title of this fifth phase to be somewhat confusing in contemporary data science. Colloquial conversations I would have among fellow data professionals wouldn’t use “Modeling” but rather “Algorithm Design”  for this part.

“Evaluation” time. We have information As one of my former team leads would ask at this stage, “You have the what. So what?” 

Now that you have something, it needs to be shared with the “Deployment” stage. Don’t ignore the importance of this stage!

I pick up on who is a new professional and who is a veteran by how they feel about this part of a project. Newbies have put so much energy into Modeling and Evaluation, Deployment is like an afterthought. Stop! It’s a trap! 

For the rest of us “Deployment” might as well be “What we’re actually being paid for”. I cannot stress enough that all the hours, sweat, and frustration of the previous phases will be for nothing if you do not get this part right .

Business understanding: What does the business need to know?

We have a basic question from our gardener.

To get a full understanding of what they need though in order to take action as a gardener to plant their backyard this year, we need to break this question down into more specific concrete questions.

If I ever can, I want to learn as much about the context of the client. This does necessarily mean I want them to answer “What data do you want?” It is also important to keep a client steered away from preconceived notions of the end result of the project. Hypothesizes can dangerously turn into premature predictions and disappointment when reality does not match those expectations.

Rather, it is important to appreciate what kind of piece you’re creating for the greater puzzle your client is putting together.

About the client

I am a Midwestern gardener myself so I’m going to be my own customer.

Gardening hobbyist who wants to understand the plants best suited for a given environment. Ideal environment to include would be the American Midwest, the client’s location. Their favorite color is red and they like the idea of bits of showy red in their backyard. Anything that is low maintenance is a plus.

For this client, they would prefer the ability to keep the data simple and shareable to other hobbyists. Whatever data, we get should be verified for what it does and does not have as the client is skeptical of a dataset’s true objectivity.  

Data understanding: What data is going to cover our needs?

One trick I use to try to objectively break down complex scenarios in real life is to sift the business problem for distinct entities and use those to push my data requirements.

We can infer from the scenario that the minimal amount of entities are the gardener and the plant. As the gardener is presumably a hobbyist and probably doesn’t have something like a greenhouse at their disposal, we can also presume that their backyard is another major entity, which is made of dirt and is a location. It is outside, so other entities at play include the weather. That is also dependent on location. Additionally, the client cares about a plant’s hardiness and color.

So we know we have the following issues at least to address:

  • Plant Hardiness
  • Plant Color

The Gardener is our client and is seeking to gain knowledge about what is outside their person. So we can discard them as an essential data point. 

The plant can be anything. It is also the core of our client question. We should find data that is plant-centric for sure.

Location is essential because that can dictate the other entities I’m considering like Dirt and Weather. Our data should help us figure out these kinds of environmental factors.

Additionally, we need to find data that could help us find the color and hardiness of a plant.

There are many datasets for plants, especially for the US and UK. Our client is American so American-focused datasets will narrow our search. 

usda plant finder page

It has several issues though with our needs. One of the most glaring ones is location. While it does have state information, single states in the United States can be larger than one country in many parts of the world. They can cover large geography types so our concern about issues like weather are not accounted for in this dataset.

Perhaps ironically, the USDA does have a measuring system for handling geographically-based plant growing environments, the USDA Plant Hardiness Zones.

USDA Plant Hardiness Zones are so prevalent, they are what Americans gardeners typically used to shop for plants. Given that our client is an American gardener, it is going to be important to grab that information. Below is an example of an American plant store describing the hardiness zone for the plant listed.

burpee seeds plant listing

American institutions dedicated to plants and agriculture are not limited to just the federal government. In the Midwest, the Missouri Botanical Garden has its own plant database, which shows great promise.

mobot plant finder page

The way the current data is set up, we could send this on to our client, but we have no way of helping them verify exactly what this dataset does and does not have. We only know what it could have (drought-resistant, flowering, etc), but not how many entries.

We’re going to have to extract this data out of MOBOT’s website and into a format we can explore in something like a Jupyter notebook.

Data preparation: How does the data need to be formatted?

Getting the data.

The clearest first step is that we need to get that data out of just MOBOT’s website. 

Using Python, this is a straightforward process using the popular library, Beautiful Soup. The following presumes you are using some version of Python 3.x.

The first function we want to approach this is a systematic way of crawling all the individual webpages with plant entries. Luckily, for every letter in the Latin alphabet,  MOBOT has web pages that use the following URL pattern: 

https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderListResults.aspx?letter=<LETTER>

So for every letter in the Latin alphabet, we can loop through all the links in all the webpages we need.

The following is how I tackled this need. To just go straight to the code, follow this link .

def find_mobot_links():

    alphabet_list = [“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”, “K”, “L”, “M”, “N”, “O”, “P”, “Q”, “R”, “S”, “T”, “U”, “V”, “W”, “X”, “Y”, “Z”]

    for letter in alphabet_list:

       file_name = “link_list_” + letter + “.csv”

       g = open(“mobot_entries/” + file_name, ‘w’)

        

       url = “https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderListResults.aspx?letter=” + letter

       page = requests.get(url)

       soup = BeautifulSoup(page.content, ‘html.parser’)

       for link in soup.findAll(‘a’, id=lambda x: x and x.startswith(“MainContentPlaceHolder_SearchResultsList_TaxonName_”)):

           g.write(link.get(‘href’) + “\n”)

       g.close()

Now that we have the links we know we need, let’s visit them and extract data from them. Web page scraping is a process of trial and error. Web pages are diverse and often change. The following grabbed the data I needed and wanted from MOBOT but things can always change. 

def scrape_and_save_mobot_links():

    alphabet_list = [“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”, “K”, “L”, “M”, “N”, “O”, “P”, “Q”, “R”, “S”, “T”, “U”, “V”, “W”, “X”, “Y”, “Z”]

    for letter in alphabet_list:

        file_name = “link_list_” + letter + “.csv”

        with open(“./mobot_entries/” + file_name, ‘r’) as f:

            for link_path in f:

                url = “https://www.missouribotanicalgarden.org” + link_path

                html_page = requests.get(url)

                http_encoding = html_page.encoding if ‘charset’ in html_page.headers.get(‘content-type’, ”).lower() else None

                html_encoding = EncodingDetector.find_declared_encoding(html_page.content, is_html=True)

                encoding = html_encoding or http_encoding

                soup = BeautifulSoup(html_page.content, from_encoding=encoding)

                file_name = str(soup.title.string).replace(”  – Plant Finder”, “”)

                file_name = re.sub(r’\W+’, ”, file_name)

   

                g = open(“mobot_entries/scraped_results/” + file_name + “.txt”, ‘w’)

                g.write(str(soup.title.string).replace(”  – Plant Finder”, “”) + “\n”)

                g.write(str(soup.find(“div”, {“class”: “row”})))

                g.close()

                print(“finished ” + file_name)

            f.close()

            time.sleep( 5 )

Side note: A small, basic courtesy is to avoid overloading websites serving the common good like MOBOT with a barrage of activity. That is why the timer is used in between every loop.

Transforming the Data

With the data out and in our hands, we still need to bring it together in one convenient file, we can examine all at once using another Python library like pandas. The method is relatively straightforward and also already on Github if you would like to just jump in here .

Because our previous step got us almost everything we could possibly get from MOBOT’s Plant Finder, we can pick and choose just what columns we really want to deal with in a simple, flat csv file. You may notice the code allows for the near constant instances where a data column we want to fill in doesn’t have a value from a given plant. We just have to work with what we have.

Ultimately, the code pulls Attracts, Bloom Description, Bloom Time, Common Name, Culture, Family, Flower, Formal Name, Fruit, Garden Uses, Height, Invasive, Leaf, Maintenance, Native Range, Noteworthy Characteristics, Other, Problems, Spread, Suggested Use, Sun, Tolerate, Type, Water, and Zone.

That should get us somewhere!

Modeling: How are we extracting information out of the data?

I am afraid there isn’t going to be anything fancy happening here. I do not like doing anything complicated when it can be straightforward. In this case, we can be very straightforward. For the entirity of my data anlsysi process, I encourage you to go over to my Jupyter Notebook here for more: https://github.com/prairie-cybrarian/mobot_plant_finder/blob/master/learn_da_mobot.ipynb

The most important part is the results of our extracted information:

  • Chinese Lilac (Syringa chinensis Red Rothomagensis)
  • Common Lilac (Syringa vulgaris Charles Joly)
  • Peony (Paeonia Zhu Sha Pan CINNABAR RED)
  • Butterfly Bush (Buddleja davidii Monum PETITE PLUM)
  • Butterfly Bush (Buddleja davidii PIIBDII FIRST EDITIONS FUNKY …)
  • Blanket Flower (Gaillardia Tizzy)
  • Coneflower (Echinacea Emily Saul BIG SKY AFTER MIDNIGHT)
  • Miscellaneous Tulip (Tulipa Little Beauty)
  • Coneflower (Echinacea Meteor Red)
  • Blanket Flower (Gaillardia Frenzy)
  • Lily (Lilium Barbaresco)

Additionally, we have a simple csv we can hand over to the client. I will admit as far as clients go, I am easy. Almost like I can read my mind.

Evaluation: You have the what. So what?

In some cases, this step is simply done. We have answered the client’s question. We have addressed the client’s needs. 

Yet, we can still probably do a little more. In the hands of a solid sales team, this is the time for the upsell. Otherwise, we are in scope-creep territory. 

Since I have a good relationship with my client (me), I’m going to at least suggest the following next steps. 

Things you can now do with these new answers:

  • Cross reference soil preferences of our listed flowers with the actual location of the garden using the USDA Soil Survey’ data ( https://websoilsurvey.sc.egov.usda.gov/App/HomePage.htm ).
  • Identify potential consumer needs of the client in order to find and suggest seed or plant sources for them to purchase the listed flowers.

Deployment: Make your findings known

Person experience has shown me that deployment is largely an exercise in client empathy. Final delivery can look like so many things. Maybe it is a giant blog post. Maybe it is a PDF or a PowerPoint. So long as you deliver in a format that works for your user, it does not matter. All that matters is that it works.

Adapting the CRISP-DM Data Mining Process: A Case Study in the Financial Services Domain

  • Conference paper
  • First Online: 08 May 2021
  • Cite this conference paper

example case study with solution about crisp dm methodology

  • Veronika Plotnikova 9 ,
  • Marlon Dumas 9 &
  • Fredrik Milani 9  

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 415))

Included in the following conference series:

  • International Conference on Research Challenges in Information Science

1700 Accesses

5 Citations

Data mining techniques have gained widespread adoption over the past decades, particularly in the financial services domain. To achieve sustained benefits from these techniques, organizations have adopted standardized processes for managing data mining projects, most notably CRISP-DM. Research has shown that these standardized processes are often not used as prescribed, but instead, they are extended and adapted to address a variety of requirements. To improve the understanding of how standardized data mining processes are extended and adapted in practice, this paper reports on a case study in a financial services organization, aimed at identifying perceived gaps in the CRISP-DM process and characterizing how CRISP-DM is adapted to address these gaps. The case study was conducted based on documentation from a portfolio of data mining projects, complemented by semi-structured interviews with project participants. The results reveal 18 perceived gaps in CRISP-DM alongside their perceived impact and mechanisms employed to address these gaps. The identified gaps are grouped into six categories. The study provides practitioners with a structured set of gaps to be considered when applying CRISP-DM or similar processes in financial services. Also, number of the identified gaps are generic and applicable to other sectors with similar concerns (e.g. privacy), such as telecom, e-commerce.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

KDD - Knowledge Discovery in Databases; SEMMA - Sample, Explore, Modify, Model, and Assess; CRISP-DM - Cross-Industry Process for Data Mining.

The protocol is available at: https://figshare.com/s/33c42eda3b19784e8b21 .

A recently introduced EU legislation to safeguard customer data.

Forbes Homepage (2017). https://www.forbes.com/sites/louiscolumbus/2017/12/24/53-of-companies-are-adopting-big-data-analytics . Accessed 30 Jan 2021

Niaksu, O.: CRISP data mining methodology extension for medical domain. Baltic J. Mod. Comput. 3 (2), 92 (2015)

Google Scholar  

Solarte, J.: A proposed data mining methodology and its application to industrial engineering. Ph.D. thesis, University of Tennessee (2002)

Marbán, Ó., Mariscal, G., Menasalvas, E., Segovia, J.: An engineering approach to data mining projects. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 578–588. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77226-2_59

Chapter   Google Scholar  

Plotnikova, V., Dumas, M., Milani, F.P.: Data mining methodologies in the banking domain: a systematic literature review. In: Pańkowska, M., Sandkuhl, K. (eds.) BIR 2019. LNBIP, vol. 365, pp. 104–118. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31143-8_8

Marban, O., Mariscal, G., Segovia, J.: A data mining and knowledge discovery process model. In: Julio, P., Adem, K. (eds.) Data Mining and Knowledge Discovery in Real Life Applications, pp. 438–453. Paris, I-Tech, Vienna (2009)

Plotnikova, V., Dumas, M., Milani, F.P.: Adaptations of data mining methodologies: a systematic literature review. PeerJ Comput. Sci. 6 , e267, (2020)

Runeson, P., Host, M., Rainer, A., Regnell, B.: Case Study Research in Software Engineering: Guidelines and Examples. Wiley, Hoboken (2012)

Yin, R.K.: Case Study Research and Applications: Design and Methods. Sage Publications, Los Angeles (2017)

Saldana, J.: The Coding Manual for Qualitative Researchers. Sage Publications, Los Angeles (2015)

McNaughton, B., Ray, P., Lewis, L: Designing an evaluation framework for IT service management. Inf. Manag. 47 (4), 219–225 (2010)

Martinez-Plumed, F., et al.: CRISP-DM twenty years later: from data mining processes to data science trajectories. IEEE Trans. Knowl. Data Eng. (2019)

AXELOS Limited: ITIL® Foundation, ITIL 4 Edition. TSO (The Stationery Office) (2019)

Download references

Author information

Authors and affiliations.

Institute of Computer Science, University of Tartu, Narva mnt 18, 51009, Tartu, Estonia

Veronika Plotnikova, Marlon Dumas & Fredrik Milani

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Veronika Plotnikova .

Editor information

Editors and affiliations.

Conservatoire National des Arts et Métiers, Paris, France

Samira Cherfi

Fondazione Bruno Kessler, Trento, Italy

Anna Perini

University Paris 1 Panthéon-Sorbonne, Paris, France

Selmin Nurcan

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Cite this paper.

Plotnikova, V., Dumas, M., Milani, F. (2021). Adapting the CRISP-DM Data Mining Process: A Case Study in the Financial Services Domain. In: Cherfi, S., Perini, A., Nurcan, S. (eds) Research Challenges in Information Science. RCIS 2021. Lecture Notes in Business Information Processing, vol 415. Springer, Cham. https://doi.org/10.1007/978-3-030-75018-3_4

Download citation

DOI : https://doi.org/10.1007/978-3-030-75018-3_4

Published : 08 May 2021

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-75017-6

Online ISBN : 978-3-030-75018-3

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Application of CRISP-DM and DMME to a Case Study of Condition Monitoring of Lens Coating Machines

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Data Science Process Alliance

What is CRISP DM?

by Nick Hotz | Last updated Apr 28, 2024 | Life Cycle

Published in 1999 to standardize data mining processes across industries, it has since become the most common methodology for data mining, analytics, and data science projects.

Data science teams that combine a loose implementation of CRISP-DM with overarching team-based agile project management approaches will likely see the best results.

What are the 6 CRISP-DM Phases?

I. business understanding.

Any good project starts with a deep understanding of the customer’s needs. Data mining projects are no exception and CRISP-DM recognizes this.

The Business Understanding phase focuses on understanding the objectives and requirements of the project. Aside from the third task, the three other tasks in this phase are foundational project management activities that are universal to most projects:

  • Determine business objectives: You should first “thoroughly understand, from a business perspective, what the customer really wants to accomplish.” ( CRISP-DM Guide ) and then define business success criteria.
  • Assess situation: Determine resources availability, project requirements, assess risks and contingencies, and conduct a cost-benefit analysis.
  • Determine data mining goals: In addition to defining the business objectives, you should also define what success looks like from a technical data mining perspective.
  • Produce project plan: Select technologies and tools and define detailed plans for each project phase.

While many teams hurry through this phase, establishing a strong business understanding is like building the foundation of a house – absolutely essential.

II. Data Understanding

Next is the Data Understandin g phase. Adding to the foundation of Business Understanding , it drives the focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase also has four tasks:

  • Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.
  • Describe data: Examine the data and document its surface properties like data format, number of records, or field identities.
  • Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.
  • Verify data quality: How clean/dirty is the data? Document any quality issues.

III. Data Preparation

A common rule of thumb is that 80% of the project is data preparation.

This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. It has five tasks:

  • Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.
  • Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A common practice during this task is to correct, impute, or remove erroneous values.
  • Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index from height and weight fields.
  • Integrate data: Create new data sets by combining data from multiple sources.
  • Format data: Re-format data as necessary. For example, you might convert string values that store numbers to numeric values so that you can perform mathematical operations.

IV. Modeling

What is widely regarded as data science’s most exciting work is also often the shortest phase of the project. Here you’ll likely build and assess various models based on several different modeling techniques. This phase has four tasks:

  • Select modeling techniques: Determine which algorithms to try (e.g. regression, neural net).
  • Generate test design: Pending your modeling approach, you might need to split the data into training, test, and validation sets.
  • Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg = LinearRegression().fit(X, y)”.
  • Assess model: Generally, multiple models are competing against each other, and the data scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.

Although the CRISP-DM Guide suggests to “iterate model building and assessment until you strongly believe that you have found the best model(s)”, in practice teams should continue iterating until they find a “good enough” model, proceed through the CRISP-DM lifecycle, then further improve the model in future iterations.

V. Evaluation

Whereas the Assess Model task of the Modeling phase focuses on technical model assessment, the Evaluation phase looks more broadly at which model best meets the business and what to do next. This phase has three tasks:

  • Evaluate results: Do the models meet the business success criteria? Which one(s) should we approve for the business?
  • Review process: Review the work accomplished. Was anything overlooked? Were all steps properly executed? Summarize findings and correct anything if needed.
  • Determine next steps: Based on the previous three tasks, determine whether to proceed to deployment, iterate further, or initiate new projects.

VI. Deployment

“Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.” -CRISP-DM Guide

A model is not particularly useful unless the customer can access its results. The complexity of this phase varies widely. This final phase has four tasks:

  • Plan deployment: Develop and document a plan for deploying the model
  • Plan monitoring and maintenance: Develop a thorough monitoring and maintenance plan to avoid issues during the operational phase (or post-project phase) of a model
  • Produce final report: The project team documents a summary of the project which might include a final presentation of data mining results.
  • Review project: Conduct a project retrospective about what went well, what could have been better, and how to improve in the future.

Your organization’s work might not end there. As a project framework, CRISP-DM does not outline what to do after the project (also known as “operations”). But if the model is going to production, be sure you maintain the model in production. Constant monitoring and occasional model tuning is often required.

Join 8,0 00+ readers who get monthly tips to manage artificial intelligence projects and products better

You have Successfully Subscribed!

Is crisp-dm agile or waterfall.

Some argue that it is flexible and agile and while others see CRISP-DM as rigid. What really matters is how you implement it.

Waterfall: On one hand, many view CRISP-DM as a rigid waterfall process – in part because of its reporting requirements are excessive for most projects. Moreover, the guide states in the business understanding phase that “the project plan contains detailed plans for each phase” – a hallmark aspect of traditional waterfall approaches that require detailed, upfront planning. Indeed, if you follow CRISP-DM precisely (defining detailed plans for each phase at the project start and include every report) and choose not to iterate frequently, then you’re operating more of a waterfall process.

Agile: On the other hand, CRISP-DM indirectly advocates agile principles and practices by stating: “The sequence of the phases is not rigid. Moving back and forth between different phases is always required. The outcome of each phase determines which phase, or particular task of a phase, has to be performed next.” Thus if you follow CRISP-DM in a more flexible way, iterate quickly, and layer in other agile processes, you’ll wind up with an agile approach.

Example: To illustrate how CRISP-DM could be implemented in either an Agile or waterfall manner, imagine a churn project with three deliverables: a voluntary churn model, a non-pay disconnect churn model, and a propensity to accept a retention-focused offer.

CRISP-DM Waterfall: Horizontal Slicing

Learn more about slicing at  Vertical vs Horizontal Slicing Data Science

In a waterfall-style implementation, the team’s work would comprehensively and horizontally span across each deliverable as shown below. The team might infrequently loop back to a lower horizontal layer only if critically needed. One “big bang” deliverable is delivered at the end of the project.

CRISP-DM Agile: Vertical Slicing

Alternatively, in an agile implementation of CRISP-DM, the team would narrowly focus on quickly delivering one vertical slice up the value chain at a time as shown below.

They would deliver multiple smaller vertical releases and frequently solicit feedback along the way.

Which is Better?

When possible, take an agile approach and slice vertically so that:

  • Stakeholders get value sooner
  • Stakeholders can provide meaningful feedback
  • The data scientists can assess model performance earlier
  • The project team can adjust the plan based on stakeholder feedback

How Popular is CRISP-DM?

Definitive research does not exist on how frequently data science teams use different management approaches. So to get an idea on approach popularity, we investigated KDnuggets polls, conducted our own poll, and researched Google search volumes. Each of these views suggests that CRISP-DM is the most commonly used approach for data science projects.

KDnuggets Polls

Bear in mind that the website caters toward data mining, and the data science field has changed a lot since 2014.

KDnuggets is a common source for data mining methodology usage. Each of the polls in  2002 ,  2004 ,  2007  posed the question: “What main methodology are you using for data mining?”, and the  2014 poll  expanded the question to include “…for analytics, data mining, or data science projects.” 150-200 respondents answered each poll.

data science methodology poll

CRISP-DM was the popular methodology in each poll spanning the 12 years.

Our 2020 Poll

To learn more about the poll, go to  this post .

For a more current look into the popularity of various approaches, we conducted our own poll on this site in August and September 2020.

Note the response options for our poll were different from the KDnuggets polls, and our site attracts a different audience.

most popular data science processes

CRISP-DM was the clear winner, garnering nearly half of the 109 votes.

Google Searches

Given the ambiguity of a searcher’s intent, some searches like “my own” could not be analyzed and others like “tdsp” and “semma” could be misleading.

For yet third view into CRISP-DM, we turned to Google Keyword Planner tool which provided the average monthly search volumes in the USA for select key search terms and related terms (e.g. “crispdm” or “crisp dm data science”). Clearly irrelevant searches like “tdsp electrical charges” or “semma both aagatha” were then removed.

data science process google search volume

CRISP-DM yet again reigned as king, and this time with a much broader margin.

Should I use CRISP-DM for Data Science?

So CRISP is popular. But should you use it?

Like most answers in data science, it’s kind of complicated. But here’s a quick overview.

From today’s data science perspective this seems like common sense. This is exactly the point. The common process is so logical that it has become embedded into all our education, training, and practice. -William Vorheis, one of CRISP-DM’s authors ( from Data Science Central )
  • Generalize-able: Although designed for data mining, William Vorhies, one of the creators of CRISP-DM, argues that because all data science projects start with business understanding, have data that must be gathered and cleaned, and apply data science algorithms, “CRISP-DM provides strong guidance for even the most advanced of today’s data science activities” ( Vorhies, 2016 ).
  • Common Sense: When students were asked to do a data science project without project management direction, they “tended toward a CRISP-like methodology and identified the phases and did several iterations.” Moreover, teams which were trained and explicitly told to implement CRISP-DM performed better than teams using other approaches ( Saltz, Shamshurin, & Crowston, 2017 ).
  • Adopt-able: Like Kanban , CRISP-DM can be implemented without much training, organizational role changes, or controversy.
  • Right Start: The initial focus on Business Understanding is helpful to align technical work with business needs and to steer data scientists away from jumping into a problem without properly understanding business objectives.
  • Strong Finish: Its final step Deployment likewise addresses important considerations to close out the project and transition to maintenance and operations.
  • Flexible: A loose CRISP-DM implementation can be flexible to provide many of the benefits of agile principles and practices. By accepting that a project starts with significant unknowns, the user can cycle through steps, each time gaining a deeper understanding of the data and the problem. The empirical knowledge learned from previous cycles can then feed into the following cycles.

Weaknesses and Challenges

In a controlled experiment, students who used CRISP-DM “were the last to start coding” and “did not fully understand the coding challenges they were going to face” – Saltz, Shamshurin, & Crowston, 2017
  • Rigid: On the other hand, some argue that CRISP-DM suffers from the same weaknesses of Waterfall and encumbers rapid iteration.
  • Documentation Heavy: Nearly every task has a documentation step. While documenting one’s work is key in a mature process, CRISP-DM’s documentation requirements might unnecessarily slow the team from actually delivering increments.
  • Not Modern: Counter to Vorheis’ argument for the sustaining relevance of CRISP-DM, others argue that CRISP-DM, as a process that pre-dates big data, “might not be suitable for Big Data projects due its four V’s” ( Saltz & Shamshurin, 2016 ).
  • Not a Project Management Approach: Perhaps most significantly, CRISP-DM is not a true project management methodology because it implicitly assumes that its user is a single person or small, tight-knit team and ignores the teamwork coordination necessary for larger projects ( Saltz, Shamshurin, & Connors, 2017 ).

Recommendations

For a more comprehensive view of recommendations view the data science process post .

CRISP-DM is a great starting point for those who are looking to understand the general data science process. Five tips to overcome these weaknesses are:

  • Iterate quickly : Don’t fall into a waterfall trap by working thoroughly across layers of the project. Rather, think vertically and deliver thin vertical slices of end-to-end value. Your first deliverable might not be too useful. That’s okay. Iterate.
  • Document enough…but not too much: If you follow CRISP-DM precisely, you might spend more time documenting than doing anything else. Do what’s reasonable and appropriate but don’t go overboard.
  • Don’t forgot modern technology: Add steps to leverage cloud architectures and modern software practices like git version control and CI/CD pipelines to your project plan when appropriate.
  • Set expectations: CRISP-DM lacks communication strategies with stakeholders. So be sure to set expectations and communicate with them frequently.
  • Data Driven Scrum

What are other CRISP-DM Alternatives?

A few years prior to the publication of CRISP-DM, SAS developed Sample, Explore, Modify, Model, and Assess ( SEMMA ) . Although designed to help guide users through tools in SAS Enterprise Miner for data mining problems, SEMMA is often considered to be a general data mining methodology. SEMMA’s popularity has waned with only 1% of respondents in our 2020 poll stating they use it.

Compared to CRISP-DM, SEMMA is even more narrowly focused on the technical steps of data mining. It skips over the initial Business Understanding phase from CRISP-DM and instead starts with data sampling processes. SEMMA likewise does not cover the final Deployment aspects. Otherwise, its phases somewhat mirror the middle four phases of CRISP-DM. Although potentially useful as a process to follow data mining steps, SEMMA should not be viewed as a comprehensive project management approach.

See the main article for SEMMA

KDD and KDDS

Dating back to 1989, Knowledge Discovery in Database   is the general process of discovering knowledge in data through data mining, or the extraction of patterns and information from large datasets using machine learning, statistics, and database systems. There are different representations of KDD with perhaps the most common having five phases: Select , Pre-Processing , Transformation , Data Mining , and Interpretation/Evaluation. Like SEMMA, KDD is similar to CRISP but more narrowly focused and excludes the initial Business Understanding and Deployment phases.

In 2016, Nancy Grady of SAIC, published the  Knowledge Discovery in Data Science (KDDS)  describing it “as an end-to-end process model from mission needs planning to the delivery of value”, KDDS specifically expands upon KDD and CRISP-DM to address big data problems. It also provides some additional integration with management processes. KDDS defines four distinct phases: a ssess, architect, build,  and  improve  and five process stages:  plan, collect, curate, analyze,  and  act .

KDD tends to be an older term that is less frequently used. KDDS never had significant adoption.

See the main article for  KDD and Data Mining Process .

Where can I learn more?

  • Blog Post: What is a Data Science Life Cycle?
  • Blog Post: What is a Data Science Workflow?
  • Blog Post: What is the Data Science Process?
  • Blog Post: Steps to Define an Effective Data Science Process
  • Blog Post: CRISP-DM for Data Science  – 5 Actions to Consider
  • Blog Post: CRISP-DM is still the most Popular Framework
  • Blog Post: Data Science vs Software Engineering
  • Explore the   Consulting services to learn CRISP and other processes
  • (external): Official CRISP-DM Guide

Explore Related Content

The GenAI Life Cycle

The GenAI Life Cycle

by Jeff Saltz | Last updated Apr 28, 2024 | Agile , Artificial Intelligence , Life Cycle

The GenAI life cycle delineates the steps for creating AI-based applications, such as chatbots, virtual assistants or...

Managing Generative AI Projects

Managing Generative AI Projects

by Jeff Saltz | Last updated Apr 10, 2024 | Agile , Artificial Intelligence , Life Cycle , Project Management

Not stopping at merely utilizing apps like ChatGPT, many companies are building, or exploring the possibility of...

What is the AI Life Cycle?

What is the AI Life Cycle?

by Jeff Saltz | Last updated Mar 31, 2024 | Artificial Intelligence , Life Cycle

  In the rapidly evolving world of artificial intelligence (AI), project management can be as complex as the...

example case study with solution about crisp dm methodology

Finally...a field guide for managing data science projects!

Data science is unique. It's time to start managing it as such.

Get the jumpstart guide to manage your next project better.

Plus get monthly tips in  data science  project management.

  • CRISPR-DM Case Study
  • by Arga Adyatama
  • Last updated about 3 years ago
  • Hide Comments (–) Share Hide Toolbars

Twitter Facebook Google+

Or copy & paste this link into an email or IM:

IMAGES

  1. What is CRISP DM?

    example case study with solution about crisp dm methodology

  2. SOLUTION: What is the crisp dm methodology complete researche

    example case study with solution about crisp dm methodology

  3. How CRISP-DM Methodology Can Accelerate Data Science Projects

    example case study with solution about crisp dm methodology

  4. Using CRISP-DM to Grow as Data Scientist

    example case study with solution about crisp dm methodology

  5. CRISP-DM Methodology For Your First Data Science Project

    example case study with solution about crisp dm methodology

  6. The Crisp-DM Methodology

    example case study with solution about crisp dm methodology

VIDEO

  1. LAB 3 The data mining process CRISP DM methodology

  2. Data_ science _ Methodology- CRISP DM- Case Study

  3. The interaction between Business Understanding & Data Understanding in CRISP-DM

  4. Rizani Rupa' M

  5. Data Science

  6. CRISP DM

COMMENTS

  1. A Practical Guide to CRISP-DM Using A Case Study in Steps

    CRISP-DM stands for CR oss I ndustry S tandard P rocess for D ata M ining. The process model spans six phases meant to fully describe the data science life cycle. Business understanding. Data understanding. Data preparation. Modeling. Evaluation. Deployment.

  2. PDF Adapting the CRISP-DM Data Mining Process: A Case Study in the

    Several data mining processes have been proposed by researchers and practi-tioners, including KDD and CRISP-DM, with the latter regarded as a 'de facto' industry standard [6]. CRISP-DM consists of six phases executed in iterations [6]. The rst phase is business understanding including problem de nition, scop-ing, and planning.

  3. Adapting the CRISP-DM Data Mining Process: A Case Study in the

    A case study is an empirical research method aimed at investigating a specific reality within its real-life context [].This method is suitable when the defining boundaries between what is studied and its context are unclear [], which is the case in our research.The case study was conducted according to a detailed protocol Footnote 2.The protocol provides details of the case study design and ...

  4. (PDF) Integrating Crisp DM Methodology for a Business ...

    The aim of this term paper is to understand how the C RISP-DM Model works and how it. is used as a data m in ing methodology in six phases by businesses for their data minin g. projects. Tableau ...

  5. A Systematic Literature Review on Applying CRISP-DM Process Model

    7. Conclusion This paper explores CRISP-DM phases in recent studies. CRISP-DM is a de-facto standard process model in data mining projects. This systematic literature review is used to give an overview of how CRISP-DM is used in recent studies and to find research focus, best practices and innovative methods.

  6. CRISP-DM Methodology For Your First Data Science Project

    Image by Author. If you enjoy my content and want to get more in-depth knowledge regarding data or just daily life as a Data Scientist, please consider subscribing to my newsletter here.. The cross-industry standard process for data mining or CRISP-DM is an open standard process framework model for data mining project planning. This is a framework that many have used in many industrial ...

  7. The CRISP-DM Process: A Comprehensive Guide

    The Six Phases of CRISP-DM. 1. Business Understanding. This initial phase focuses on understanding the objectives and requirements of the project from a business perspective. Key tasks include: 2 ...

  8. A CRISP-DM Approach for Predicting Liver Failure Cases: An Indian Case

    The present study followed the CRISP-DM methodology. This is a procedure developed for carrying out DM processes based on six phases, namely, business understanding, data understanding, data preparation, modeling, evaluation, and deployment [ 8 ]. Each phase of the CRISP-DM methodology will be described in the following sections.

  9. Understanding CRISP-DM in Data Analytics: A Practical Guide

    CRISP-DM stands as a beacon in the data analytics process, guiding teams from problem understanding to actionable insights. Its structured yet flexible approach makes it an invaluable methodology for businesses aiming to leverage data for strategic decisions. By following CRISP-DM, organizations can ensure that their data analytics projects are ...

  10. Applying the CRISP-DM data mining process in the ...

    The case study was conducted based on documentation from a portfolio of data mining projects, complemented by semi-structured interviews with project participants. The results reveal 18 perceived gaps in CRISP-DM alongside their perceived impact and mechanisms employed to address these gaps. The identified gaps are grouped into six categories.

  11. PDF A Case Study of Evaluating Job Readiness with Data Mining Tools and

    2. CRISP-DM . CRISP-DM is a freely available model that has become the leading methodology in data mining. Because of its industry and tool independence, CRISP-DM can provide guidelines for organized and transparent execution of any project. Typically, it groups all scheduled tasks into six consecutive phases [1]:

  12. PDF Copy of CRISP-DM for Data Science

    Published in 1999, CRISP-DM (CRoss Industry Standard Process for Data Mining (CRISP-DM) is the most popular framework for executing data science projects. It provides a natural description of a data science life cycle (the workflow in data-focused projects). However, this task-focused approach for executing projects fails to address team and ...

  13. A Systematic Literature Review on Applying CRISP-DM Process Model

    The objective of this systematic literature review is to identify the research focus, best practices and new methods. for applying CRISP - DM phases. The results show that s everal data mining ...

  14. PDF Adapting the CRISP-DM Data Mining Process: A Case Study in the

    these gaps. The case study was conducted based on documentation from a portfolio of data mining projects, complemented by semi-structured interviews with project participants. The results reveal 18 perceived gaps in CRISP-DM alongside their perceived impact and mechanisms employed to address these gaps. The identified gaps are grouped into six ...

  15. PDF The CRISP-DM Process Model

    1.1 Hierarchical Breakdown. The CRISP-DM data mining methodology is described in terms of a hierarchical process model, consisting of sets of tasks described at four levels of abstraction (from general to specific): phase, generic task, specialised task, and process instance (see figure 1).

  16. How to apply CRISP-DM to real business case

    The process of CRISP-DM is described in these six major steps: Business Understanding. Data Understanding. Data Preparation. Modeling. Evaluation. Deployment. This post will go through the process ...

  17. PDF CRISP-DM: Towards a Standard Process Model for Data Mining

    This section gives an overview of the CRISP-DM methodology. More detailed information can be found in (CRISP, 1999). 3.1 Overview The CRISP-DM methodology is described in terms of a hierarchical process model, comprising four levels of abstraction (from general to specific): phases, generic tasks, specialized tasks, and process instances (see ...

  18. Application of CRISP-DM and DMME to a Case Study of Condition

    Our main contribution is the description of deficiencies and gaps of the standard process framework CRISP-DM based on the issues faced in implementing each phase of the process in a real world case study. In addition, we propose future research ideas to close these gaps. This complements findings in the literature on gaps in CRISP-DM and DMME.

  19. Predicting Churning Customers Using CRISP-DM Methodology

    Dec 17, 2020. 4. The development of this project aimed to identify the churn generation of customers. The project's motivation was to analyze patterns, trends and predictions extracted from the data using machine learning models capable of identifying the significant decrease in the use of services and products by customers.

  20. What is CRISP DM?

    Published in 1999 to standardize data mining processes across industries, it has since become the most common methodology for data mining, analytics, and data science projects. Data science teams that combine a loose implementation of CRISP-DM with overarching team-based agile project management approaches will likely see the best results.

  21. Application of CRISP-DM methodology for managing human-wildlife

    The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a widely-used process model for structured decision-making. This study demonstrates the novel application of CRISP-DM to HWC related decision-making. We apply CRISP-DM and conduct hotspot and temporal (monthly) analysis of HWC data from Ramnagar Forest Division, India.

  22. RPubs

    Or copy & paste this link into an email or IM:

  23. CRISP-DM Brief Explanation and Example

    CRISP-DM stands for Cross-industry standard process for data mining. It is a common method used to find many solutions in Data Science. It has bee a standard practice used by industry for years ...