Concept Mining

What is Concept Mining?

Concept mining is a discipline related to Data mining and Text mining, and as such a sub-discipline of Artificial intelligence and Statistics, which are themselves sub disciplines of computer science and mathematics.

There is a strong connection between Concept Mining and Linguistics. The idea of concept mining is best described in relation to Text mining. Whereas Text Mining is the discipline of extracting information from a document, such as an auto generated precis, or a subject classification, using statistics and inferences generated from the words in a document, Concept mining attempts to do the same kind of things using the concepts in a document.


Traditionally, the conversion of words to concepts has been performed using a Thesaurus, and for computational techniques the tendency is to do the same. The Thesauri used are either specially created for the task, or a pre existing language model, usually related to Princeton's WordNet.

The mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. We, i.e. humans, use context to disambiguate the various meanings of a given piece of text, where available. Machine translation systems cannot easily infer context, and this gives rise to some of the marvelous howlers such systems generate.

For the purposes of Concept mining however, these ambiguities tend to be less important than they are with Machine Translation, for large documents the ambiguities tend to even out, much as is the case with text mining. There are many techniques for disambiguation that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora.

The Concept Mining Workflow

Concept mining involves not only a concept list but also a process named the concept mining workflow. The goal of this workflow is to enable the concrete linking of evidence and ideas through the exploration of documents to reach a useful conclusion.

      Query page > Search Results page > Concept List > Search Results page > Concept List > Search Results page > and so on...

By progressing from a query to a search results list to a concept list, you begin to discover:

  • Knowledge related to your need
  • Relationships between things

Within the concept mining workflow, it is important to consider the following items:

  • Data within the collection
  • Dimensions of the subject

It is most useful to search relevant data. For example, if you wanted to find out about "military procurement within an political context" it would be more useful to search a collection of newspaper articles than a collection of political science books, though the latter may also prove fruitful.

In the same context, you need to carefully consider the dimensions of the subject matter for which you want to search. A carefully constructed query is the most important task when beginning to mine concepts. Longer queries are always better. For example, if you were looking for information about the "political organizations in Montana," a good query would include the names of actual political organizations and/or names of towns where the headquarters of political organizations are located.


Text mining models tend to be very large. A model that attempts to classify, for instance, news stories using Support Vector Machines or the Na´ve Bayes algorithm will be very large, in the megabytes, and thus slow to load and evaluate. Concept mining models can be minute in comparison - hundreds of bytes.

For some applications, such as plagiarism detection, concept mining offers new possibilities. Where the plagiariser has been cunning enough to perform a thesaurus based substitution that will fool text comparison algorithms, the concepts in a document will be relatively unchanged. So 'the cat sat on the mat' and 'the feline squatted on the rug' appear very different from text mining algorithms, and nearly identical to concept mining algorithms.

See also


For more about data warehousing:         Data warehousing     |     Data integration     |     Data mining

See our comprehensive range of other professional data cleansing software products at