data mining vs machine learning
"Data Mining" and "Machine Learning" are separate communities, which means they go to different conferences, etc. (I didn't meet any data mining people at NIPS). I would like to understand why, since the central idea is the same: do statistics in order to answer queries.
Today I chatted with a PhD student in data mining about the topic of how his field differs from machine learning / statistics. What he said can be summarized as follows:
(1) data mining concerns huge data sets, in which issues like memory management, indexing, data summaries are important
(2) data mining cares a great deal about "pre-processing" (which apparently can include problems such as named-entity recognition/co-reference resolution)
(3) data mining cares about structured objects
(4) data mining cares about catering to users who are not able/willing to type their queries in a formal language
(5) data mining only cares about making "systems that work" (in quotes because this is very ambiguous)
(6) data mining doesn't have a very high standard for reproducibility
Here are my thoughts:
(1) machine learning deals with such issues too, though perhaps some only recently. Online learning (the idea of keeping (near)sufficient statistics that can be updated online) has surely been around for a good while.
(2) I really don't understand this comment.
(3) lots of machine learning is about structured objects! This is my favorite kind, and includes almost all of NLP and bioinformatics.
(4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
(5) no comment.
(6) I guess this is probably because the data they work with is largely proprietary, since one of their applications is business intelligence. Nevertheless, I wonder what it's like to work in a field where the research is not reproducible. If your results can't be reproduced, what do people cite you for?
Today I chatted with a PhD student in data mining about the topic of how his field differs from machine learning / statistics. What he said can be summarized as follows:
(1) data mining concerns huge data sets, in which issues like memory management, indexing, data summaries are important
(2) data mining cares a great deal about "pre-processing" (which apparently can include problems such as named-entity recognition/co-reference resolution)
(3) data mining cares about structured objects
(4) data mining cares about catering to users who are not able/willing to type their queries in a formal language
(5) data mining only cares about making "systems that work" (in quotes because this is very ambiguous)
(6) data mining doesn't have a very high standard for reproducibility
Here are my thoughts:
(1) machine learning deals with such issues too, though perhaps some only recently. Online learning (the idea of keeping (near)sufficient statistics that can be updated online) has surely been around for a good while.
(2) I really don't understand this comment.
(3) lots of machine learning is about structured objects! This is my favorite kind, and includes almost all of NLP and bioinformatics.
(4) I guess this means they are reaching towards HCI, specifically natural language-ish interfaces.
(5) no comment.
(6) I guess this is probably because the data they work with is largely proprietary, since one of their applications is business intelligence. Nevertheless, I wonder what it's like to work in a field where the research is not reproducible. If your results can't be reproduced, what do people cite you for?