Text Mining Overview

Text mining can be defined as the analysis of semi-structured or unstructured text data. The goal is to turn text information into numbers so that data  mining algorithms can be  applied. Text mining is an interdisciplinary field which incorporates data mining, web mining, information retrieval, information extraction, computational linguistics and natural language processing.

Text mining transforms textual data into structured format through the use of several techniques. It involves Information Retrieval techniques (such as identification and collection of the textual data sources), NLP techniques (like part of speech tagging and syntactic parsing etc), Information Extraction (like entity/concept extraction which identifies named features like people, places, organizations, etc.,), data mining (like establishing relationship between different entities/concepts, pattern and trend analysis) and visualization techniques.

Information retrieval (IR) systems The first step in the text mining process is to find the body of documents that are relevant to the research question(s). An IR system allows us to narrow down the set of documents that are relevant to a specific problem. The most well known IR systems are search engines such as Google. IR systems can speed up the analysis significantly by reducing the number of documents for analysis.

Natural language processing (NLP) analyzes the text in structures based on human speech. It is the study of human language so that computers can understand natural languages similar to that of humans. It allows the computer to perform a grammatical analysis of a sentence to “read” the text.

NLP is a technology that concerns with natural language generation (NLG) and natural language understanding (NLU).

NLG uses some level of underlying linguistic representation of text, to make sure that the generated text is grammatically correct and fluent. Most NLG systems include a syntactic reliazer to ensure that grammatical rules such as subject-verb agreement are obeyed, and text planner to decide how to arrange sentences, paragraph, and other parts coherently. The most well known NLG application is machine translation system.

NLU is a system that computes the meaning representation, essentially restricting the discussion to the domain of computational linguistic. NLU consists of at least of one the following components; tokenization, morphological or lexical analysis, syntactic analysis and semantic analysis. In tokenization, a sentence is segmented into a list of tokens.  Morphological or lexical analysis is a process where each word is tagged with its part of speech.  Syntactic analysis is a process of assigning a syntactic structure or a parse tree, to a given natural language sentence. Semantic analysis is a process of translating a syntactic structure of a sentence into a semantic representation that is precise and unambiguous representation of the meaning expressed by the sentence.

Information extraction (IE) involves structuring the data that the NLP system generates. It is the process of automatically extracting structured information from unstructured and/or semi structured text documents. An IE system involves identifying entities such as names of people, companies and location, attributes and relationship between entities. It does this by pattern recognition.

IE deals with the extraction of specified entities, events and relationships from unrestricted text sources. The goal is to find specific data or information in natural language texts. Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many text mining applications.

Data mining (DM) It refers to finding relevant information or discovering knowledge from large volumes of data. Data mining attempts to discover statistical rules and patterns automatically from data.

Data Mining vs Text Mining vs Web Mining.



  Search Discover
Structured Data Retrieval Data Mining
Unstructured Information Retrieval Text Mining

 The difference between regular data mining and text mining is that in text mining, the patterns are extracted from natural language texts rather than from structured databases of facts as in database mining. Web Mining in contrast involves pattern extraction from the web sources which are  structured as opposed to free unstructured text input in text mining.

As approximately 80% percent of the corporate data is in unstructured  text format, text mining is considered to have a high value than that of data mining.

Approaches or objectives of Text Mining

Text mining can be summarized as a process of converting the unstructured content into a structured format and quantifying it. In simple terms, all words found in the input documents will be indexed and counted resulting in a matrix of frequencies. Thereafter all standard statistical and data mining techniques can be applied to derive dimensions or clusters of words or documents, or to identify “important” words or terms that best predict another outcome variable of interest.

There are various specialties within text mining that have different objectives.  Some of the technologies that have been developed and can be used in the text mining process are information extraction, topic tracking, summarization, categorization, clustering, concept linkage, information visualization, and question answering. In the following sections we will discuss each of these technologies and the role that they play in text mining.

  • Text categorization: assigning the documents with pre-defined categories (e.g decision trees induction). When categorizing a document, a computer program will often treat the document as a “bag of words.” It does not attempt to process the actual information as information extraction does. Rather, categorization only counts words that appear and, from the counts, identifies the main topics that the document covers.
  • Text clustering: descriptive activity, which groups similar documents together (e.g. self-organizing maps). It differs from categorization in the sense that documents are clustered on the fly instead of through the use of predefined topics.
  • Concept mining: Concept linkage tools connect related documents by identifying their commonly-shared concepts and help users find information that they perhaps wouldn’t have found using traditional searching methods. It promotes browsing for information rather than searching for it.
  • A topic tracking system works by keeping user profiles and, based on the documents the user views, predicts other documents of interest to the user.
  • Text summarization is immensely helpful for trying to figure out whether or not a lengthy document meets the user’s needs and is worth reading for further information.
  • Visual text mining, or information visualization puts large textual sources in a visual hierarchy or map and provides browsing capabilities, in addition to simple searching.
  • Question answering (Q&A) deals with how to find the best answer to a given question.
  • Association rule mining (ARM) is a technique used to discover relationships among a large set of variables in a data set.


Text mining begins with document collection which are then preprocessed before useful patterns and knowledge can be discovered from them. The steps are highlighted are below:

Document Collection: Can be manual or automatic, such as via a Web crawler or database query. Documents must be organized and reformatted into a similar format.

Text Preprocessing as per below steps:

  1. Tokenization: Fragment the text into items that can be counted.
  2. Stemming: Identify common core word fragments (e.g. “act,” “acts,” “acting,” “acted” all become “act.” )
  3. Eliminate stop words: Create a dictionary of low predictive value words (e.g. a, an, the)
  4. Case normalization: Convert all text to lower case to prevent counting words with different capitalization
  5. Eliminate punctuation: Remove punctuation to prevent counting words with and without punctuation being counted separately.

Text transformation: Reduce dimensionality and select features. Text is represented by the text it contains using one of the below approach

  • Bag of words
  • Vector spaces
  1. Term Document Matrix (TDM): Create a two dimensional matrix where each document is one row and each column is a term from the abbreviated list generated by cleaning up the text. The relationship between the row and column is represented by indices. Singular Value Decomposition (SVD) is used to expose the underlying meaning and structure by reducing the dimensionality. It is related to principal components analysis.
  2. Indices: At the simplest level, this can be the count, or number of times a term appears in a document. Log or binary frequencies can be used to dampen large number of occurrences. The most commonly used index is the inverse document frequency (ITF). It represents the relative importance of a term and reflects the relative frequency of occurrence of terms and their document frequencies.

Extract knowledge

At this point Text mining becomes data mining. Data mining methods such as clustering, classification information retrieval etc., can be used for modeling. Some of the common modeling techniques include.

  • Classification: Assign terms into a predetermined set of categories. A training data set is used with documents and categories. This is used in genre detection, spam filtering and Web page categorization.
  • Clustering: A technique of unsupervised learning. Clustering divides data into groups of similar objects. Each cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups.
  • Association: Identify terms that are frequently found together. This is known as market basket analysis in the retail industry where items are bought together.
  • Trend analysis: Identify time dependent changes in a term. This is used in identifying rising popularity of technologies.


Due to the rapidly expanding amount of text-based information, text mining has already been applied regularly in the areas of spam filters, fraud detection, sentiment analysis, identification of trends and authorship.

Some of the applications in specific business areas has been described below.

Analyzing open-ended survey responses. In survey, insights into customers’ views and opinions may be discovered  by analyzing words or terms that are commonly used by respondents  for open-ended questions.

Automatic processing of messages, emails. Another common application for text mining is to aid in the automatic classification of texts.

  • This can be used to “filter” out “junk email” based on certain terms or words that are not likely to appear in legitimate messages.
  • Messages can to be routed (automatically) to the most appropriate department or agency.

HR management: Text mining techniques can be used to manage human resources by analyzing staff’s opinions, monitoring the level of employee satisfaction, as well as managing profiles for the recruitment of new personnel.

Analyzing warranty or insurance claims, diagnostic interviews. In cases such as warranty claims or initial medical interviews majority of information is collected in open-ended, textual form. This information can be usefully exploited to  identify common clusters of problems or complaints on certain products or  useful clues for the actual medical diagnosis in symptom description respectively.

Investigating a domain by crawling web sites. Automatic processing of the contents of Web pages could efficiently deliver valuable business intelligence regarding the products, competitors, partners, economic outlook etc. For example, you could go to a Web page, and begin “crawling” the links you find there to process all Web pages that are referenced and thus derive a list of terms and documents available at that site enabling you to quickly determine the most important terms and features that are described.

Sentiment analysis or opinion mining. Sentiment analysis plays a major role in customer buying decisions. Large number of people share their opinions about a product or service on the social media, reviews, blogs etc which in turn influence the buying behavior of others. Sentiment analysis can therefore be used for knowing consumer attitudes and trends.

The goal of the sentiment analysis is to obtain the writers outlook. The writer’s outlook may be because of the knowledge he or she possess, his or her emotional state while writing or the intended emotional touch the writer wants to present to the reader.

SA can be done based on the polarity of the document/ text or opinion at the sentence or entity feature level to find out if the opinion expressed is positive, negative or neutral.

Further, sentiment classification can also be done on the basis of the emotional state expressed by the writer like (glad, dejected, and annoyed).

Sentiment analysis can also be done on the basis of objective or subjective opinions expressed by a writer. Sentiment Analysis identifies the phrases in a text that bears some sentiment. The author may speak about some objective facts or subjective opinions. It is necessary to distinguish between the two. SA finds the subject towards whom the sentiments are directed. A text may contain many entities but it is necessary to find the entity towards which the sentiments are directed. It identifies the polarity and degree of the sentiment. Sentiments are classified as objective (facts), positive (denotes a state of happiness, bliss or satisfaction on part of the writer) or negative (denotes a state of sorrow, dejection or disappointment on part of the writer).

Another way of capturing sentiments is by using scoring method where sentiments are given a score based on their degree of positivity, negativity or objectivity. In this method a piece of text is an analyzed and subsequent analysis of the concepts contained in the text is carried out to understand the sentimental words and how these words relate to concepts. Each concept is then given a score based on the relation between the sentimental words and the associated concepts.

Collaborative learning process analysis:  Collaboration is “the mutual engagement of participants in a coordinated effort to solve a problem together”.

The social interaction of computer supported collaborative learners has been regarded as a “gold mine of information” (Henri 1992, p. 118) on how learners acquire knowledge and skills together.

Analyzing the different facets of learners’ interaction is a time consuming and effortful process. Improving automated analyses of such highly valued processes of collaborative learning by using recent text classification technologies would make it a less arduous task to obtain insights from corpus data.

This endeavor also holds the potential for improving on-line instruction both by providing teachers and facilitators with reports about the groups they are moderating and by triggering context sensitive collaborative learning support on an as-needed basis.

Assessment of participation by counting the number of student contributions and in-depth understanding of different qualities of interaction would enable to the instructors to gauge the strengths and weakness of the course design. Barros and Verdejo (Barros et al., 1999) analyzed students’ online newsgroup conversations and computed values for initiative, creativity, elaboration and conformity.

Inaba & Okamoto (Inaba et al., 1997) implemented a system that used a finite state machine to determine the level of coordination taking into account the flow of conversation of the group participants.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s