Here is highlight of the text mining process in a simple text classification experiment using LightSIDE. LightSide accepts the input data in the .csv format. The process in general has been outlined below:
Data Preparation-> Features Extraction -> Model Generation
Data preparation
Data preparation is the first step in which the data is cleaned and reformatted to bring it to the usable form. In the classification tasks some data should be annotated or labeled so that it can be used for training the model. A rule of thumb is to have at least 1000 instances to begin with out of which – 200 should be used as development data for qualitative analysis. 700 should be used for training the model and 100 should be for testing the model.
Get a sense of the development data set aside in step above. Write down the example of complex, difficult to classify statements and see their structure. Find out nuances that must be taken into account when representing the data.
Feature Extractions
LightSide has the option of both basic and advanced feature selection. Some of the features that can be extracted involve N grams, Parts of speech, Line length, Count, Stopwords removal, Stemming, Stretchy pattern, regular expression etc.
Model Generation: Model Generation, evaluation and selection follows once features have been generated.
An example of data preparation in the context of sentimental analysis is given here. A set if 10 movie reviews pre-classified as positive or negative have been analyzed to understand the nuances for feature extraction. Observation are as follows:
Positive movie reviews have positive words while negative reviews have negative words. Although some positive reviews also have some negative words and vice versa, use of words like ‘so’ or ‘really’, ‘very’ etc have been used in bigrams to emphasize the sentiment. Upper case has been used to indicate the degree of sentiments. Some unigrams and bigrams have been used multiple times so count could prove to be good indicator.
A list of words classified according to the degree and type of sentiment have been extracted.
Strong positive: Extremely attractive, captivated, really +[ positive], great job, perfectly, very [+ive], faboulous, enthralled, [capital positive]
Mild positive: romance, enjoyable, lovable, charming, interesting, entertaining, well
Mild negative: wrong, dull, bad, shallow, silliness, immaturity, passionless, not so well, eccentric
Strong negative: hate, capital[-ive], disaster, completely missmarketed, worst, awful, terrible, really [-ive], incompetent, preposterous
Based on the above observations, following rules have been devised for movie classification. Performance of these rules can give pointers to the characteristics of a good classifier.
- Rule 1: Any one of strong positive words: True implies Positive Review
- Rule 2: Any of mild positive words: True implies Positive Review
- Rule 3: Any of mild negative words: False implies Positive Review
- Rule 4: Any of strong negative words: False implies Positive Review
- Rule 5: [Positive words AND NO strong negative words] OR [count of positive > negative]: : True implies Positive Review
Decision Table created as below: T: True, F: False
Value of Rule 1 | Value of Rule 2 | Value of Rule 3 | Value of Rule 4 | Value of Rule 5 | |
Positive Review 1 | T | T | F | F | T |
Positive Review 2 | T | T | F | F | T |
Positive Review 3 | T | T | T | T | T |
Positive Review 4 | T | T | T | T | T |
Positive Review 5 | T | F | T | F | T |
Negative Review 1 | F | T | T | T | F |
Negative Review 2 | F | F | T | T | F |
Negative Review 3 | F | T | T | F | F |
Negative Review 4 | F | T | F | T | F |
Negative Review 5 | F | T | T | F | T |
Based on the above table we notice that Strong words are good classifier as there is little room of ambiguity when they are used. On contrary mild words are more context specific. Sentiment analysis is not as simple as counting pos and neg words. Context matters. Sometimes sentiment is expressed indirectly.
So the process that we engaged in is- We read the texts. We looked at the words. We saw whether those words themselves were good predictors. We made a list of what words we saw.
We saw that context matters. We saw that there are rhetorical strategies. Sometimes we mix positive and negative. Sometimes the negativity is expressed in a very indirect way, and we need a few more words around it, and usually it was about three or four words, and so we know kind of what is the window around a word that we might need in order to capture some context.
Model Generation:
Here we are classifying the movie review as positive or negative. We already have some review instance which we have classified as positive or negative manually. Using this as the training data we will run a model which we will use to classify the unlabeled reviews.
We already have the data in the required csv format. So data cleaning and reformatting has already been taken care of. An image of sample data is given here. Column A is the moview review and column B is the label whether it is positive or negative.
Following Logistic Models with with 10 fold cross validation and varying features were generated.
Model1 (‘logit__1grams‘): A simple model basic features have been generated with the following features:
- Unigrams
- Include Punctuation
The model performance is as follows: Accuracy = 0.77, Kappa = 0.55
Confusion Matrix
Act/ Pred | Pos | Neg |
Neg | 116 | 34 |
Pos | 34 | 116 |
Model2 (‘logit__1grams_stretch’): Another model was generated with the following features:
- Stretchy pattern was selected.
- Pattern Length was 2-4 and pattern gap was 1-2.
- Since we wanted to get more context specifically around positive and negative words, we used categories for positive and negative words in the patterns we build.
The model performance is as follows: Accuracy = 0.78, Kappa = 0.56
Confusion Matrix
Act/ Pred | Pos | Neg |
Neg | 117 | 33 |
Pos | 33 | 117 |
On comparing both the model we find that there is no significant improvement.
Error Analysis
Using Frequency and frequency weight to explore the instances which were actually negative but were classified as positive we note the following features adding a lot of weight towards false positive classification:
- ‘between’ is adding a lot of positive weight for classification towards positive. Its frequency is also very high. Although it does not a have any positive annotation.
- Feature ‘Great’ though has a positive connotation it has negative sense in some documents depending on the context e.g ‘it could have been a great movie’ or ‘Great disaster’
Model3 : Another model was generated with the following features:
- Stretchy pattern was selected.
- Pattern Length was 2-4 and pattern gap was 1-2.
- Since we wanted to get more context specifically around positive and negative words, we used categories for positive and negative words in the patterns we build.
- The categories files for positive and negative terms has been updated to additionally include the below terms.
Negative words : absurd, ugly, lack , weirdest, let-down, sucked, poorly, horrid, uninteresting, disaster, hate etc.
Positive: very well, superb, infinitely superior, glorious, outstanding, brilliant, perfect, splendid
The model performance is as follows: Accuracy = 0.8, Kappa = 0.6
Confusion Matrix
Act/ Pred | Pos | Neg |
Neg | 121 | 29 |
Pos | 30 | 120 |
Model Comparison
The newly generated model shows highly significant improvement (p=0.006**, t = -2.743) over the basic model. Therefore we select this one as out model of choice.