Talk:Data mining/2009

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

2009[edit]

Data mining vs Information extraction[edit]

pdfpdf added this to the top of the article, I have moved it here as it seems like more a discussion topic.

Note: "Data mining" is a quite different process to "Information extraction"

Dmmd123 (talk) 00:00, 9 January 2009 (UTC)[reply]

It was intended to be a "disambiguation-type" entry.
It is quite common for those not involved to assume that "Data mining" means "Information extraction".
The article makes no mention of this common incorrect assumption, hence I added the "hat-note".
Although my edit was perhaps not the best way to address the issue, I disagree that it is a "discussion topic".
What is a better way to do it? Cheers, Pdfpdf (talk) 12:21, 9 January 2009 (UTC)[reply]
As there has been no response or discussion, I have reinstated my edit. Pdfpdf (talk) 10:44, 13 January 2009 (UTC)[reply]
There are two reasons I do not think this should go at the top of the article.
1. Linking at the top of the page like that is normally for disambiguation - when someone searches for one thing but wanted the other. At the very least it should conform to the style guide.
2. The reason that confusion arises is because this article poorly defines datamining. I think we both agree Datamining and Information extraction are two very unique things. This article basically gives a whole lot of fairly cloudly examples of where dataming might be used:
It has been suggested that both the Central Intelligence Agency and the Canadian Security Intelligence Service have employed this method.
When there are much clearer examples where datamining has been used.
Dmmd123 (talk) 23:05, 13 January 2009 (UTC)[reply]
Regarding your second reason:
Yes, I agree with your second reason, that "I think we both agree Datamining and Information extraction are two" quite different things. I also agree that "this article poorly defines datamining" and also that the examples given are indeed, (to be generous), "a bit vague".
However, I'm not quite sure how that relates to the hat-note.
(Unless you are implying something like: "If the article was clear, then we wouldn't need the hat-note"). If that is your intent, then I don't quite agree - the hat-note says it before the reader starts on the article; even if the article were clear, the reader would still have to read the article before they realised what is already stated in the hat-note.)
Regarding your first reason:
I thought I had already addressed that. I agree that: "Linking at the top of the page like that is normally for disambiguation - when someone searches for one thing but wanted the other." As I said above, that is my intention here - i.e. someone comes here thinking that "data mining" is a synonym for "information extraction". The first thing they encounter is a line saying: "Data mining" is a quite different process to "Information extraction".
My view is that I don't see any point in slavishly conforming to the style guide when strictly following the guide is not applicable; this certainly conforms to the intent of the style guide. Further, the guide is a guide, it is not a set of rules.
Your thoughts? Cheers, Pdfpdf (talk) 10:15, 14 January 2009 (UTC)[reply]
Carrying on from my original question "What is a better way to do it?", it would seem that this is the answer. Pdfpdf (talk) 12:36, 14 January 2009 (UTC)[reply]
Why does the article seem to assume that data == personal data? Particularly, it says "As more data are gathered, with the amount of data doubling every three years", where I can think of a number of scenarios where the amount of data is not doubled every three years. If data mining is specific to a type of data, the article should say that - my personal understanding (which could be wrong of course) is that data mining is the act of representing seemingly random data in a meaningful form in an attempt to highlight patterns for a number of purposes. So for example you may take data (temperature, humidity, air pressure) from hundreds of weather stations and 'mine' it to produce a forecast, or to better understand how weather works. Am I incorrect here? 90.208.217.227 (talk) 22:44, 9 December 2009 (UTC)[reply]


I'm not out to start a religious war here, but it's my understanding that knowledge can only exist between two ears; by "definition", machines can not create "knowledge" - only people can create "knowledge". Machines can only turn data (and information) into information.
What are other people's opinions? Pdfpdf (talk) 09:06, 14 January 2009 (UTC)[reply]

In my experience the Datamining/KDD field uses a faily precise vocabulary. Information is a term which is normally avoided as it is too vague, it can describe a fact, a pattern, piece of knowledge. Normally Datamining algorithms are described as returning patterns, this is stated in the opening sentence. The sentence in question states that data mining is becoming an increasingly important tool to transform this data into (information/knowledge). Datamining transforms data into knowledge through the KDD process (Knowledge Discovery in Databases) - which should be explained in the article as KDD redirects here. The final step of KDD is for a human interpret the datamining patterns into knowledge - the stuff between our two ears. Perhaps this sentence needs to be clarified, but Datamining is definitely a tool which assists in the creation of knowledge. Dmmd123 (talk) 22:42, 14 January 2009 (UTC)[reply]
Thanks for replying - I find it useful to read a second opinion.
Let's do the easy one first: I completely agree that "Datamining is definitely a tool which assists in the creation of knowledge."
(Conversely, I vehemently disagree that datamining creates knowledge - fortunately for me, you didn't say that.)
KDD redirects here, as does Knowledge Discovery in Databases. I'm afraid that's not helpful in telling me what KDD is "supposed" to mean, or if (or how) it is different to datamining (or not), so I can't make any comment about KDD.
"In my experience the Datamining/KDD field uses a faily precise vocabulary." - My experience has been more varied than yours - i.e., some papers and texts are indeed quite specific, however, I have also come across others that are, to be generous, "vague and non-specific". So that doesn't help much either.
"Information is a term which is normally avoided as it is too vague, it can describe a fact, a pattern, piece of knowledge." - Is normally avoided by whom? I could accurately make the same comment about data, knowledge and dozens of other terms (and provide a mountain of supporting evidence). I'm not sure what point you are trying to make here.
"Normally Datamining algorithms are described as returning patterns" - Agreed.
"this is stated in the opening sentence" - Agreed.
"The sentence in question states that data mining is becoming an increasingly important tool to transform this data into (information/knowledge)." - Agreed. However, I would be more comfortable if the sentence said "important tool to assist in the transformation of this data".
"Datamining transforms data into knowledge through the KDD process (Knowledge Discovery in Databases) - which should be explained in the article as KDD redirects here." - Whoa Nelly! You make a number of points here.
  • "Datamining transforms data into knowledge through the KDD process" - I have my doubts here, not the least of which is the first one that I don't know what KDD is supposed to be, but my first reaction (based on ignorance) is that I sincerely doubt that KDD "transforms data into knowledge". (However, I would have no problem with "KDD assists with the transformation of data into knowledge".)
  • "which should be explained in the article" - Agreed. In fact: Strongly agreed !!
  • "as KDD redirects here" - Is KDD really an indistinguishable synonym for data mining? Somehow, I doubt it - what would be the point of yet-another-name when the perfectly good name "data mining" already exists? i.e. I suspect that KDD must have some characteristics of its own which distinguish it from data mining. Again, it would seem that a definition of KDD would be useful ...
"The final step of KDD is for a human interpret the datamining patterns into knowledge - the stuff between our two ears." - That is what I would expect, which therefore, in the terminology that I have been using, suggests that KDD creates information, not knowledge. By-the-way, I'm not trying to split hairs and be pedantic here - I'm trying to work out what other people mean by "information" and "knowledge".
"Perhaps this sentence needs to be clarified" - I would classify that sentence as a major understatement! ;-) My biased opinion is that, until such clarification is achieved, the conversation will go around in circles.
Thank you for the food for thought. Cheers, Pdfpdf (talk) 12:39, 15 January 2009 (UTC)[reply]
Let me add that I also use the term knowledge to refer to what is "between the ears"--a human construction. Information, a term I find no less precise than knowledge, is what can be externalized by humans and transferred to other humans (who construct their knowledge). I do not say that only humans can construct knowledge, but it is much better to say that both machines and people are processing information and that the end result of human processing is called "knowledge" and the end result of (all current) machines is further information. —Preceding unsigned comment added by Robotczar (talkcontribs) 22:15, 12 May 2010 (UTC)[reply]

KDD vs DM?[edit]

Is there a difference between KDD and DM?
What is the difference between KDD and DM?

I've been looking for a "good" definition of KDD.

One that keeps cropping up is Knowledge discovery is defined as "the non-trivial extraction of implicit, unknown, and potentially useful information from data" Frawley, W.J., Piatetsky-Shapiro, G., and Matheus, C. Knowledge Discovery In Databases: An Overview. In Knowledge Discovery In Databases, eds. G. Piatetsky-Shapiro, and W. J. Frawley, AAAI Press/MIT Press, Cambridge, MA., 1991, pp. 1-30.

Personally, I'm not sure if or how that is different from Data mining.

Yet, in [1], Peggy Wright says (circa 1997) that:

  • in Fayyad, U.M., Piatetsky-Shapiro, G., and Smyth, P. From Data Mining To Knowledge Discovery: An Overview. In Advances In Knowledge Discovery And Data Mining , eds. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press/The MIT Press, Menlo Park, CA., 1996, pp. 1-34.
  • a clear distinction between data mining and knowledge discovery is drawn. Under their conventions, the knowledge discovery process takes the raw results from data mining (the process of extracting trends or patterns from data) and carefully and accurately transforms them into useful and understandable information. This information is not typically retrievable by standard techniques but is uncovered through the use of AI techniques.

That was 10 years ago. Has anything changed?
What do others think?
Can anyone supply links to current definitions of DM & KDD?
Thanks, Pdfpdf (talk) 05:18, 26 January 2009 (UTC)[reply]

If you read the full article from Fayyad you will see that he goes on to talk about KDD as an overall process and DM as a particular step in that process. This might help: http://seclab.cs.ucdavis.edu/projects/misuse/meetings/KDD.html
In current colloquial usage the terms DM is commonly used to mean KDD, although this is incorrect. (no source apart from my own opinion on that one).
131.170.90.2 (talk) 02:42, 28 January 2009 (UTC) actually by me but not logged in Dmmd123 (talk) 04:28, 29 January 2009 (UTC)[reply]

"longitudinal changes"??[edit]

What the heck is "longitudinal changes"?? 65.13.73.143 (talk) 15:18, 2 March 2009 (UTC)[reply]

Too obvious[edit]

This sentence has twice been removed because it is too obvious: "However, while it can be used to uncover hidden patterns in data that have been collected, it can neither uncover patterns that are not already present in the data, nor can it uncover patterns in data that have not been collected."

pdfpdf undid it the first time because: a) If I didn't think it added value, I wouldn't have put it in there. b) "unnecessary" is subjective. c) You would be surprised (appaled?) by how the obvious isn't obvious to those who don't think.

I have just added it again on the basis of argument C. I think this is an important sentence as it clarifies that datamining is not some crystal ball which can magically tell you the future - it is a data analysis technique. It also helps clarify the distinction between data collection and datamining.

Dmmd123 (talk) Revision as of 18:10, 21 March 2009

As far as argument C -- in the articles on databases, we don't say "Obviously a database can't store data that doesn't exist". I agree with you that a lot of people seem to lack common sense, but I don't feel that this justifies overloading wikipedia with each piece of common sense knowledge that our brains contain. If some people are too stupid to realize that a piece of software can't manipulate data that doesn't exist, then that should be their problem, and shouldn't become a problem for everyone else here who is not stupid. The intelligent reader should not have to go through droves of commonsense knowledge to learn about a topic they might not understand. Anyhow, that's why I have removed it. Unfortunately, I am going to have to be away from wikipedia for a few weeks here, so I won't be able to participate further in this discussion until I return, but I just wanted to throw that in before I left. --- Jrtayloriv (talk) 18:10, 23 March 2009 (UTC)[reply]
I mostly agree with most of what you say, but you are only addressing a small part of the issue. You don't seem to be addressing the issue that Dmmd123 has raised above, nor the issue that I have raised below. As I said below, perhaps whats there doesn't do the job it is intended to be doing well enough, but the solution to that is not to remove it; the solution is to change it so that it does the job better. Pdfpdf (talk) 22:15, 23 March 2009 (UTC)[reply]
I agree entirely with Jrtayloriv. This is not an "important point" for the very simple reason that it is not a point at all. For something to be a point, it has to be in some way controvertible. The statement in question is not. Pdfpdf's argument a (with all due respect) is not an argument. Every single person who contributes to wikipedia in good faith can make the statement that they would not have contributed if they didn't think what they were writing added value; that does not impact at all on the question of whether their additions actually *do* add value. Likewise, argument b is a basically meaningless statement. Yes, "unnecessary" is subjective. It's one of the common, even standard, subjective terms with which we all routinely work. One hopes to use it in a way that most people will also, subjectively, find reasonable. Jrtayloriv points to the fact that most reasonable people would find the statement in question unnecessary. S/he has already addressed argument c.

In response to Dmmd123, and your point below, Pdfpdf, that many people want have magical thinking when it comes to data-mining, I am making a change in the article itself. Peace. —Preceding unsigned comment added by 156.56.192.250 (talk) 06:21, 3 April 2009 (UTC)[reply]

Purpose of the lead paragraph[edit]

The purpose of the lead paragraph is to summarise the article and highlight the important.
The issue should not be whether the statement is obvious or not; the issue should be whether the statement summarises an important point.
Both Dmmd123 and I think it is an important point. The question is, "Does it summarise it?" I suspect the answer may be, "Not very well."
All contributions towards improving the summary are welcome. Complete removal of the summary, without replacing it with something better, is not welcome. Pdfpdf (talk) 09:03, 21 March 2009 (UTC)[reply]

Links[edit]

External links, in the External links section or not, should follow the appropriate policies and guidelines. Usually this means WP:EL, WP:SPAM, and WP:NOTLINK. Sometimes these links are just references are not formatted properly, so WP:RS, WP:SELFPUB, and related policies and guidelines apply. --Ronz (talk) 16:28, 7 April 2009 (UTC)[reply]

>External links ... should follow the appropriate policies and guidelines.
Yes. And these do. If you disagree, please state specifically what the problem(s) is/are.
>Sometimes these links are just references are not formatted properly
I'm sorry, I don't understand.
Also, you say, "please discuss". Certainly. What do you want to discuss. Cheers, Pdfpdf (talk) 16:39, 7 April 2009 (UTC)[reply]

In the Data mining tools and vendors section, the Gartner study acts as the ref for the listing of the vendors and products. Adding offsite links directly to the products should be avoided. It would be better to find additional third party articles about those entries and create their own articles (should be enough out there for those three to easilly establish notability to support their own articles). --- Barek (talkcontribs) - 17:31, 7 April 2009 (UTC)[reply]

That sounds good to me. (i.e. I agree with you.) --Pdfpdf (talk) 13:57, 8 April 2009 (UTC)[reply]

Misleading edit comment[edit]

Regarding this edit and this edit, your edit comment is misleading.
You have said: "unsourced - comes off as an advertisement"
This directly implies: "If you supply a source for this, and if you make it more objective and less like an advertisement, then it will be an acceptible contribution."
This is misleading.

The real problem with that bit of text is that it is not an example of data mining.

Even if it were perfectly written and compliant with all wiki standards and guidelines, someone with knowledge of the field would have quickly removed it, because it is simply irrelevant. An edit comment that implies that if the contributor "fixes it up" it will be OK is quite misleading for the contributor. If the contributor does "fix it up", they will have every justification for being annoyed when their work is once again removed, this time for the real reason.

--Pdfpdf (talk) 00:28, 11 April 2009 (UTC)[reply]

Extended content

I don't want to sound ungreatful, but this and other edits currently being performed by User:Ronz are a level of intervention which, at this stage of the development of the article, is proving to be counter-productive. This is because this intervention is getting in the way of the development of the article, and distracting people from the job of developing the article.

As an analogy, it is like someone has walked into a car repair shop and insisted that a car's flat tyres be inflated, without considering whether the tyres have sufficient tread on them to be used safely. Yes, the tyres will need inflating before the car is roadworthy, but let's make sure that we have the right tyres on the car first, and then that the tyres are in roadworthy condition, before we think about inflating them.

So User:Ronz, what do you think of the idea of you waiting a few months until the contributors have got their facts straight, and then come along and address the wiki-concerns? Given that you are complaining that you are busy and don't have the time to do the job properly anyway, I would have thought such a suggestion would be attractive to you. Pdfpdf (talk) 00:50, 9 April 2009 (UTC)[reply]

Please follow WP:NPA, WP:BATTLE, and WP:TALK. Thanks! --Ronz (talk) 02:15, 9 April 2009 (UTC)[reply]
For someone who professes to know everything about everything, and is always right, you have appalling manners.
To use your style of address: "Please follow WP:CIVIL and WP:AGF."
You do not appear to have any interest in any opinion but your own. When someone, (and looking at your talk page, there have been many. Very many.), points this out to you, you hide behind your self-righteous self-opinion, and delete their contributions. Of the dozen or so questions I have asked you, you have only answered one.
For your information, YOU are the ONLY person who cares if you are busy. Your "busy-ness" is YOUR choice, YOUR problem, and YOUR responsibility. Not mine. Not anyone elses.
If you are "too busy" to edit wikipedia, then there is a simple solution - DON'T spend your "valuable time" editing wikipedia.
For your further information:
It is extremely rude to alter another editors contributions, and thus change the intent of their statements and hence misrepresent their arguments. Don't do it.
It is slanderous to make false, misleading and unsubstantiated accusations.
It is cowardice to not take responsibility for one's own statements, and to hide behind a string of vague, non-specific generalisations.
It is far from useful to nitpick about the trivial detail of the finished product when the basic foundations are being laid.
It is inappropriate and unproductive to make uninformed irrelevant comments about a topic where you demonstrate you have no knowledge.
And in any society, it is completely inappropriate to insult people who have been elected by their peers to represent them.
I have made (at least) five attempts to politely and logically bring this information to your attention. My five attempts have NEVER involved ANY aspects of the non-specific alphabet soup you seem to take great solace in quoting. In doing so, I have asked you a number of questions in order to better understand your concerns. Of those five attempts, you have reverted three without either comment or explanation, to another you have replied "I don't have time for this", and to the one above, you have responded with alphabet soup.
Now, I will repeat myself by saying, your edits are disruptive, uninformed, negative, and NOT useful.
Despite your view of the world, they are NOT adding value.
In fact, not only are they not adding value, but they are distracting people from making useful contributions to the development of this article, and wasting their time by forcing them to deal with issues that, at this stage of development of the article, are irrelevant.
Further, as I have politely descibed above, they are misleading.
If you are too busy and don't have the time to inform yourself properly and do a proper job of editing here, then please, DON'T edit here. Go somewhere else where you have knowledge of the domain and can make a POSITIVE contribution. I will point out that you have not made ONE POSITIVE contribution here - EVERY one of your contributions here has been negative.
If any of the above is NOT clear to you, please ask for clarification - I am only too happy to reply to specific questions about specific issues. --Pdfpdf (talk) 14:08, 9 April 2009 (UTC)[reply]

Discussion: The nature of useful resources in the fields of Data Mining and KDD[edit]

It seems there are problems with some of the useful resources in the "External links" section and other places in the article, and they have been removed. Sadly, some of them contravene wikipedia guidelines.

This is a problem in the field of Computer Science due to the nature of the field. Unlike many fields that have an established body of reliable literature, and where new information arises infrequently and in small increments, Computer Science is rapidly evolving on a large number of fronts. Often the best, and the only, sources of information are on sites which, for other reasons, contravene the guidelines.

So we find ourselves on WP of being in the situation of "throwing the baby out with the bathwater", because the guidelines require the whole thing to be thrown out, not just "the bathwater".

I can not immediately think of a solution to this problem.
I invite and welcome positive contributions and discussion towards the solution of this problem.
Also, this problem can not be a unique to this page. Pointers to how other communities address this situation would be useful. --Pdfpdf (talk) 01:47, 11 April 2009 (UTC)[reply]

List of relevant wikipedia guidelines[edit]

The following is a list of pointers to relevant WP guidelines. If you feel others are relevant, please add them to the list.
(This list is in sort by the wikipedia shortcut.)

Discussion[edit]

Useful links for finding sources[edit]

Ignoring the WP:COI problems for the moment, the following two links should be removed from the article per WP:ELNO #4 & 13 and WP:NOTLINK. They certainly are useful for finding references that could be used to verify information currently in the article and for future expansion.

--Ronz (talk) 17:29, 14 April 2009 (UTC)[reply]

Thanks for the info - will try to reply before midnight "real soon", but not today I'm afraid. Pdfpdf (talk) 15:41, 15 April 2009 (UTC)[reply]