Zizka immm 2012

20
IMMM-2012, October 21-26, 2012, Venice, Italy Jan Žižka and František Dařena Department of Informatics FBE, Mendel University in Brno Brno, Czech Republic [email protected], [email protected] Parallel Processing of Very Many Textual Customers’ Reviews Freely Written Down in Natural Languages

Transcript of Zizka immm 2012

Page 1: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

Jan Žižka and František Dařena

Department of InformaticsFBE, Mendel University in Brno

Brno, Czech Republic

[email protected], [email protected]

Parallel Processing of Very Many Textual Customers’ Reviews Freely Written Down

in Natural Languages

Page 2: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

One of contemporary typical data is text written in various natural languages.

Among others, textual data very often represents also subjective opinions, meanings, sentiments, attitudes, views, ideas of the text authors – and we can mine it from the textual data, getting knowledge: text-mining.

The following slides deal with customer opinions evaluating hotel services.

We can see such data, for example, on web-sites as booking.com, or elsewhere.

Page 3: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

Discovering knowledgethat ishidden incollectedvery largereal-worldtextual datain variousnaturallanguages:

Page 4: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

Text-mining of very many documents written in natural languages is limited by the computational complexity (time and memory) and computer performance.

Most of common users can use only ordinary personal computers (PC's) – no supercomputers.

A regular solution: Having just standard processors (even if multicore ones) and some gigabytes of RAM, the whole data set has to be divided into smaller subsets that can be processed in parallel.

Are the results different or not? If yes, how much?

Page 5: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The original text-mining research aimed at automatical search for significant words and phrases, which could be then used for deeper examination of positive and negative reviews; that is, looking for typical praises or complaints.

For example, “good location”, “bad food”, “very noisy environment”, “not too much friendly personnel”, “nice clean room”, and so like.

To obtain such understandable kind of knowledge, decision trees (rules) were generated.

Page 6: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

Some original English examples, ca 1,200,000 positive and 800,000 negative (no grammar corrections):

– breakfast and the closeness to the railway station were the only things that werent bad

– did not spend enogh time in hotel to assess

– it was somewhere to sleep

– very little !!!!!!!!!

– breakfast, supermarket in the same building, kitchen inthe apartment (basic but better than none)

– no complaints on the hotel

Page 7: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The upper computational complexity of the entropy-based decision tree (c5) is O(m∙n2) for m reviews and n unique words in the generated dictionary; some n's:

Page 8: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The minimum review length was 1 word (for example, “Excellent!!!”), the maximum was 167 words.

The average length of a review was 19 words.

The vector sparsity was typically around 0.01%, that is, on average, a review contained only 0.01% of the all words in the dictionary created from reviews.

An overwhelming majority of words were insignificant, only some 100-300 of terms (depending on a specific language) played their significant role from the classification point of view.

Page 9: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

Intuitively, the larger data, the better discovered knowledge.

However, there is always an insurmountable problem:

How to process very large textual data?

Page 10: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The experiments were aimed at finding the optimal subset size for the main data set division. The optimum was defined as obtaining the same results from the whole data set and from the individual subsets.

Ideally, each subset should provide the same significant words that would have the same significance for classification.

The word significance was defined as the number of times when a decision tree asked what was a word frequency in a review.

Page 11: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

If there is a word from the whole data set in the root, most of the subsets (ideally all) should have the same word in their roots.

Similarly, the same rule can be applied to other words included in the trees on levels approaching the leaves. Then we could say that each subset represents the original set perfectly.

In reality, the decision trees generated for each review subset more or less mutually differ because they are created from different reviews. A tree generated from a subset may contain also at least one word that is not in the tree generated from the whole.

Page 12: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

Part of a tree for a subset: Do all the subsets of the customer reviews have the same word location in their roots?

And what about other words? Are they on the same positions in all subset trees?

Page 13: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The question is:

How many subsets should the whole review set be divided into so that the unified results from all the subsets provide (almost) the same result as from the whole?

It is not easy to find a general solution because the result depends on particular data.

The research used the data described above because it corresponded to many similar situations: a lot of short reviews concerning just one topic.

Page 14: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The original set (2,000,000 reviews) was too big to be processed as a whole.

For a given PC model (8 GB RAM, 4-core processor, 64-bit machine), the experiments worked with 200,000 randomly selected reviews as the whole (for more, it took more than 24 hours of computation, or crashed due to the insufficient memory error).

Then, the task was to find an optimal division of the 200,000 reviews into smaller subsets.

Page 15: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The results of experiments are demonstrated on the followning graphs for different sizes of the whole set and its subsets.

On the horizontal x axis, there are the most significant words generated by the trees.

The vertical axis y shows the correspondence between the percentage of the significant words in the whole set and the average percentage of the relevant subsets.

The whole set contains all the significant words: the y value is always 1.0 (that is, 100%).

Page 16: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The results provided by the whole set and its subsets is given by the agreement between the percentage weight of a word w

j in the tree generated for the whole

set and the average percentage weight of wj in the

trees generated for all corresponding subsets.

If a word wj percentage weight in the whole set would

(on average) be the same as for all subsets, then the agreement is perfect; otherwise, imperfect, where the imperfection is given by the difference.

Page 17: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The whole set contains 200,000 reviews

Page 18: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The whole set contains 100,000 reviews

Page 19: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

The whole set contains 50,000 reviews

Page 20: Zizka immm 2012

IMMM-2012, October 21-26, 2012, Venice, Italy

Conclusions:

Probably, it is no surprise that the subsets should be as large as possible to obtain reliable knowledge.

However, the question is: How big should be the value “small” for the inevitable subsets?

For a given data, it can be found experimentally, and then the result is applicable to the same data type in the future, as the experiments demonstrated.

This research report suggests a method how to act.