Talk:Data mining/2012

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Introduction is inconsistent and confuses data mining with kdd

In the beginning of the intro it is stated: "The goal of data mining is to extract knowledge from a data set in a human-understandable structure[2] and involves database and data management, data preprocessing, model and inference considerations"

Later on: "Neither the data collection, data preparation nor result interpretation and reporting are part of the data mining step, but do belong to the overall KDD process as additional steps." — Preceding unsigned comment added by 129.206.66.132 (talk) 12:11, 7 March 2012 (UTC)

Unfortunately, KDD and Data mining (as the analysis step of KDD) are used inconsistently throughout literature, something Wikipedia will not ultimately resolve. All the way up to the point of "CRISP-DM", which specifies the KDD process as "Data Mining process". So to resolve your confusion, just consider "data mining" to be the vague something it is, while "data mining step" and "data mining process" are slightly more precise. I'll try to improve this by using the term "data mining process" in the introduction. --Chire (talk) 17:32, 7 March 2012 (UTC)

Removed tag from Further reading section

Hello all. The Further reading section was tagged as {{Copy edit-section|for=wikification, too much literature, most covering subdomains only. Literature spam|date=September 2011}}. I've removed this in order not to draw general (non-expert) copy editors to it, because I believe only a topic expert can make good decisions about what reading to include in that section and what to exclude. (I don't think the section is too long per se, but that's up to you). If you'd like to re-tag so as to attract an editor able to deal with this issue, perhaps another tag would be more suitable? --Stfg (talk) 17:01, 1 June 2012 (UTC)

Medical Mining

I'm trying to figure out where to put an area for medical data mining. The US Supreme Court rulings authorize the data mining of Pharmacies, and that information to be re-sold under the 1st Amendment.

Link: http://articles.latimes.com/2011/jun/24/nation/la-na-court-drugs-20110624

Twillisjr (talk) 03:50, 9 September 2012 (UTC)

Should we add SEMMA to the Process section?

Should we add SEMMA to the Process section? On the one hand, yes, it is a process model, but on the other hand, it's been proposed and adopted as a standard by only one vendor (SAS). What do people think?Karl (talk) 14:04, 14 November 2012 (UTC)

I found several independent sources for SEMMA and a comparison of it and CRISP-DM, so I added it to this page. If other people feel differently about this addition, please edit it. The SEMMA page received an orphan-page tag today also, so adding a link to it on this page also helps resolve that problem. FYI, I added a link to SEMMA on the CRISP-DM page also. Karl (talk) 16:49, 15 November 2012 (UTC)

I'm wondering how much SEMMA is really about data mining in the scientific sense of the word. For all I see, it probably fits better into the Business Intelligence part that loves to label itself with the buzzword "data mining". IMHO an article on SEMMA is not needed, but it should be merged into some SAS Institute Inc. related larger article. Clearly, SAS Enterprise Miner is a superset of SEMMA, and doesn't have an article yet. This is like discussing the steering wheel without having an article on what a car is. --Chire (talk) 14:16, 16 November 2012 (UTC)

Including material about R's increasing popularity & Satisfaction ratings of tools

At the bottom of the lists of open-source data and commercial data mining tools, I inserted a sentence that indicates which are the most widely used tools; and for the commercial tools, I also noted which have received the highest satisfaction ratings. I want to openly state that I have a complete COI, since I am the primary author on the research I cite to support these statements. However, when I try to look objectively at this wikipedia page, I think that the inclusion of this information will be very useful to readers, especially to readers who are not familiar with the large number of tools that are available. So, in good faith I've made these additions to data mining. If other wikipedians think the material should be removed, I will not argue. I am simply putting it out there for others to evaluate. When reviewing this material I encourage people to look at the value of the material being included on the page, and not simply to remove it with a black/white interpretation of COI. Thank you. Karl (talk) 15:08, 27 November 2012 (UTC)

I don't find information about software popularity to be particularly helpful in this article, so I removed it again. - MrOllie (talk) 15:52, 27 November 2012 (UTC)

I like how you said that, MrOliie. I agree with your statement that "If it is important someone else will add it." So I agree that, due to my COI, it is better for me not to add this information.

However, I disagree with your statement about software popularity not being helpful to the article. It is my personal view that an encyclopedia entry on a technology topic is enhanced by knowing which software has been more widely adopted. But perhaps I am too close to this topic, so I will leave it to others to evaluate and add material if they think it would be helpful. I found two other published sources (that I have no COI with) that speak to the wide and growing adoption of R. One is in the NY Times, and the other is in Java Developers Journal. I will leave my thoughts here on the TALK page, and if other people feel the material is worth adding, they can add it.

This is the material I suggest adding after the list of open-source tools:

R is the most popular open-source data mining tool.^[1] ^[2] In 2010, the R language overtook commercial tools to become the most widely used analytic tool among data miners (43% of data miners reported using it).^[3]

This is the material I suggest adding after the commercial tool list:

Among the commercial tools, SAS, IBM/SPSS, and StatSoft software are the most widely used. And STATISTICA Data Miner, IBM SPSS Modeler, and Salford Systems tools have received the strongest satisfaction ratings from data miners in recent surveys.^[4]^[5]

And this is the bullet point I suggest adding to the Marketplace survey list:

Rexer Analytics Data Miner Surveys^[4]^[3]

I once again want to openly state that I recognize I have a COI with the above material, so I leave it to others to decide if it is useful to add or not. Karl (talk) 16:23, 27 November 2012 (UTC)

I can't think of any encyclopedic reason to add the material. I think it best to follow WP:RECENTISM in such situations. --Ronz (talk) 18:42, 27 November 2012 (UTC)

Thank you. I am a novice wikipedia editor, and hadn't seen WP:RECENTISM before. I like the perspectives expressed there, and will try to keep them in mind during my wikipedia writing. In this case, I agree that, on the one hand, the material is recent. However, on the other hand, my interpretation of WP:RECENTISM is that it is OK to include small amounts of recent material -- it's just that we don't want large amounts of recent material in an entry to overwhelm the stable agreed-upon and historical information in a wikipedia entry. E.g., in United States presidential election we don't want the bulk of the material to be on the most recent election. However, it is OK for the entry to briefly mention the most recent election. In my view, this material about the adoption rate of data mining tools is only a couple sentences, and would be OK. I also feel the the ideas of WP:RECENTISM are balanced by the usefulness of informing readers (briefly) about the current state of tool usage. Over the past decade, open source tools in general have seen increasing adoption. This multi-year trend has been the case with the R programming language as well. While I have never used R myself, it would be a COI for me to put the following info in data mining (and it would be excessive and therefore violate both good judgement and WP:RECENTISM), but other wikipedians evaluating this TALK page discussion might be interested to see the following trend in R adoption that we have seen in our five surveys:

23% of data miners reported using R in the 2007 Survey (N=314)
36% of data miners reported using R in the 2008 Survey (N=348)
38% of data miners reported using R in the 2009 Survey (N=710)
43% of data miners reported using R in the 2010 Survey (N=735)
47% of data miners reported using R in the 2011 Survey (N=1,319)

Karl (talk) 19:42, 27 November 2012 (UTC)

As KDNuggets has found the same in their surveys, it might be an option to cite them. However, is it of encyclopedic relevance what is the current favorite toy? Much of Rs popularity might come from the recent trend of trying to solve everything by matrix factorization. A few years ago, it probably was Weka. In a year or two, it might be Hadoop/Mahout, for example. We don't try to cover all that there is on the Interwebs, but try to keep focused. This article should focus on the scientific aspects IMHO, and I see little value in putting tool rankings into Wikipedia. There are websites such as yours for that. --Chire (talk) 23:52, 27 November 2012 (UTC)

^ Vance, Ashlee (2009-01-06). "Data Analysts Captivated by R's Power". New York Times. Retrieved 2009-04-28. R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca...
^ David Smith (2012); R Tops Data Mining Software Poll, Java Developers Journal, May 31, 2012.
^ ^a ^b Karl Rexer, Heather Allen, & Paul Gearan (2010); 2010 Data Miner Survey Summary, presented at Predictive Analytics World, Oct. 2010.
^ ^a ^b Karl Rexer, Heather Allen, & Paul Gearan (2011); 2011 Data Miner Survey Summary, presented at Predictive Analytics World, Oct. 2011.
^ Karl Rexer, Heather Allen, & Paul Gearan (2011); Understanding Data Miners, Analytics Magazine, May/June 2011 (INFORMS: Institute for Operations Research and the Management Sciences).

[nytimes09-1] Vance, Ashlee (2009-01-06). "Data Analysts Captivated by R's Power". New York Times. Retrieved 2009-04-28. R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca...

[KDDalgorithm2012-2] David Smith (2012); R Tops Data Mining Software Poll, Java Developers Journal, May 31, 2012.

[survey2010-3] Karl Rexer, Heather Allen, & Paul Gearan (2010); 2010 Data Miner Survey Summary, presented at Predictive Analytics World, Oct. 2010.

[survey2011-4] Karl Rexer, Heather Allen, & Paul Gearan (2011); 2011 Data Miner Survey Summary, presented at Predictive Analytics World, Oct. 2011.

[informs2010-5] Karl Rexer, Heather Allen, & Paul Gearan (2011); Understanding Data Miners, Analytics Magazine, May/June 2011 (INFORMS: Institute for Operations Research and the Management Sciences).

[1]

[2]

[3]

[4]

[5]