Medium Information Quantity

Post on 21-Jun-2015

158 views 0 download

Transcript of Medium Information Quantity

Medium Information QuantityVitalie Scurtu

History

● Shannon - "A Mathematical Theory of Communication“ - 1948

● Lossless compression (Shannon-Fano, Adaptive Huffman)

● Today it is used in cryptography, analysis of DNA sequences (http://pnylab.com/pny/papers/cdna/cdna/index.html), in Natural Language Processing applications

What is information quantity?

The information quantity of one phenomenon depends on its frequency● Low information quantity

○ It rains in London○ The economy is in crisis○ Berlusconi went with Escorts

Low Information quantity: I am telling things you already heard of many times, nothing new

● High information quantity○ Today it snows in Rome○ Dentists on strike

High Information quantity: I am telling things you never or rarely heard of, much new information

Entropy and the medium information quantity

● The formula of information quantity○ H (m ) = -Σ p(m) * log p(m)

○ V(m) = 1/n Σ 1- p(m) ■ p(m) - probability that m will happen

H(m) - Entropy or information quantity (IQ)V(m) - Medium information quantity (MQI)

Probability coefficients

lim p(x) -> 1

x=1..n y=p(x)

Logarithm coefficients

lim log(x) -> -∞

x=0..1 y=log(x)

Coefficients of Shannon Entropy

p(x) log(p(x))*-1

● Very likely words: in, the, is, has,of

● Very unlikely words: APT,

x=p(1..n) y=x*log(x)*-1

Documents distribution based on its entropy

Zipf distribution (long tail)Few in superior extremity and many in inferior extremity● MIN=0● MAX=1700(no limit)

x=doc(1..n) y=H(doc(1..n)

Documents distribution based on its medium information quantity

Gaussian DistributionFew in extremity and the majority inside the medium values

MIN=0MAX=1.0

x=doc(1...n) Y=V(doc(1...n))

Documents distribution based on its medium quantity

x=doc(1...n) y=V(doc(1...n))

Entropy depends on the text length

Correlation: 0.99 - the highest correlation, almost identical

Conclusions

● Very low correlation of MIQ with text lengths 0.05 vs. 0.985

● Correlation of MIQ vs. IQ is 0.57 ● Entropy depends on text length, MIQ does

not depend on text length therefore it find anomalies

● MIQ: information about text style● MIQ compensates IQ

The End

¿Questions?email to scurtu19@gmail.com