Medium Information Quantity

13

Click here to load reader

Transcript of Medium Information Quantity

Page 1: Medium Information Quantity

Medium Information QuantityVitalie Scurtu

Page 2: Medium Information Quantity

History

● Shannon - "A Mathematical Theory of Communication“ - 1948

● Lossless compression (Shannon-Fano, Adaptive Huffman)

● Today it is used in cryptography, analysis of DNA sequences (http://pnylab.com/pny/papers/cdna/cdna/index.html), in Natural Language Processing applications

Page 3: Medium Information Quantity

What is information quantity?

The information quantity of one phenomenon depends on its frequency● Low information quantity

○ It rains in London○ The economy is in crisis○ Berlusconi went with Escorts

Low Information quantity: I am telling things you already heard of many times, nothing new

● High information quantity○ Today it snows in Rome○ Dentists on strike

High Information quantity: I am telling things you never or rarely heard of, much new information

Page 4: Medium Information Quantity

Entropy and the medium information quantity

● The formula of information quantity○ H (m ) = -Σ p(m) * log p(m)

○ V(m) = 1/n Σ 1- p(m) ■ p(m) - probability that m will happen

H(m) - Entropy or information quantity (IQ)V(m) - Medium information quantity (MQI)

Page 5: Medium Information Quantity

Probability coefficients

lim p(x) -> 1

x=1..n y=p(x)

Page 6: Medium Information Quantity

Logarithm coefficients

lim log(x) -> -∞

x=0..1 y=log(x)

Page 7: Medium Information Quantity

Coefficients of Shannon Entropy

p(x) log(p(x))*-1

● Very likely words: in, the, is, has,of

● Very unlikely words: APT,

x=p(1..n) y=x*log(x)*-1

Page 8: Medium Information Quantity

Documents distribution based on its entropy

Zipf distribution (long tail)Few in superior extremity and many in inferior extremity● MIN=0● MAX=1700(no limit)

x=doc(1..n) y=H(doc(1..n)

Page 9: Medium Information Quantity

Documents distribution based on its medium information quantity

Gaussian DistributionFew in extremity and the majority inside the medium values

MIN=0MAX=1.0

x=doc(1...n) Y=V(doc(1...n))

Page 10: Medium Information Quantity

Documents distribution based on its medium quantity

x=doc(1...n) y=V(doc(1...n))

Page 11: Medium Information Quantity

Entropy depends on the text length

Correlation: 0.99 - the highest correlation, almost identical

Page 12: Medium Information Quantity

Conclusions

● Very low correlation of MIQ with text lengths 0.05 vs. 0.985

● Correlation of MIQ vs. IQ is 0.57 ● Entropy depends on text length, MIQ does

not depend on text length therefore it find anomalies

● MIQ: information about text style● MIQ compensates IQ

Page 13: Medium Information Quantity

The End

¿Questions?email to [email protected]