The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina...

7
Acta Astronautica 68 (2011) 399--405 Contents lists available at ScienceDirect Acta Astronautica journal homepage: www.elsevier.com/locate/actaastro The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy John Elliott Computational Intelligence Research Group, School of Computing, Leeds Metropolitan University, Leeds LSI 3HE, UK ARTICLE INFO ABSTRACT Article history: Received 27 February 2009 Accepted 11 August 2009 Available online 27 October 2009 Keywords: Structure Signal Language Waveform Detection The ultimate aim of any system designed to analyse a signal for structured and intelligent- like behaviour, is not merely to provide a Boolean filter but to have the ability to deduce what category the signal fits—language, image, music, noise, etc.—and then to decipher its contents. Up until now, my research has concentrated on the lower level universals of language structure and how they can be detected [J. Elliott, E. Atwell, W. Whyte, First stage identificaton of syntactic elements in an extraterrestrial signal, in: Proceedings of IAC'2001: the 52nd International Astronautical Congress, Paper IAA-01-IAA.9.2.07, Toulouse, France, 2001; J. Elliott, E. Atwell, Language in signals: the detection of generic species-independent intelligent language features in symbolic and oral communications, in: Proceedings of the 50th International Astronautical Congress, Paper IAA-99-IAA.9.1.08, International Astro- nautical Federation, Paris, 1999]. However, in this paper, I begin to look into developing techniques, which use only surface structure information deduced from the signal sample itself, for learning what underlies the higher level elements of the language (lingua ex machina) hierarchy, where syntax meets semantics. The aim therefore is to move towards developing techniques, which will unlock a signal's meaning using its internal cohesion and constraints. In doing so, I look at how the most disparate of human language orthographic forms mirror each other's underlying structure, despite their encoding strategies and how classes of words give their secrets away by the functional `friends' they keep. Finally, I present techniques developed for non-language structured signals: how the statis- tical `fingerprint' of a binary image, such as the recent challenge set by Dr. Dutil [University of Catalunya, Barcelone, Spain & ABB Bomem Inc, 2001 http://ww3.sympatico.ca/stephane_ dumas/CETl/output_stream.txt] and the 1974 Arecibo transmission [Arecibo Observatory http://www.naic.edu/], can be detected when encoded as a binary bit-stream by the anal- ysis of its type-token distribution, and further analysis of the structure of music and how it differs from language. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction To address such linguistic problems, which reach down to and beyond the very core fundamental concepts of what we currently understand as language structure, I have had E-mail address: [email protected] 0094-5765/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.actaastro.2009.08.012 to develop new techniques to provide ways of learning and understanding syntactic behaviour from surface structure alone. Although each word or symbol is arbitrarily paired with its meaning, empirical evidence has shown that the semantic class or part-of-speech to which it belongs is con- strained by its behaviour and according to underlying inter- active rules. By using functional elements/words detected from lower level analysis and visualisation techniques for

Transcript of The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina...

Page 1: The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy

Acta Astronautica 68 (2011) 399 -- 405

Contents lists available at ScienceDirect

Acta Astronautica

journal homepage: www.e lsev ier .com/ locate /ac taast ro

The filtration of inter-galactic objets trouvés and the identification of thelingua exmachina hierarchy

John Elliott

Computational Intelligence Research Group, School of Computing, Leeds Metropolitan University, Leeds LSI 3HE, UK

A R T I C L E I N F O A B S T R A C T

Article history:Received 27 February 2009Accepted 11 August 2009Available online 27 October 2009

Keywords:StructureSignalLanguageWaveformDetection

The ultimate aim of any system designed to analyse a signal for structured and intelligent-like behaviour, is not merely to provide a Boolean filter but to have the ability to deducewhat category the signal fits—language, image, music, noise, etc.—and then to decipherits contents. Up until now, my research has concentrated on the lower level universals oflanguage structure and how they can be detected [J. Elliott, E. Atwell, W. Whyte, First stageidentificaton of syntactic elements in an extraterrestrial signal, in: Proceedings of IAC'2001:the 52nd International Astronautical Congress, Paper IAA-01-IAA.9.2.07, Toulouse, France,2001; J. Elliott, E. Atwell, Language in signals: the detection of generic species-independentintelligent language features in symbolic and oral communications, in: Proceedings of the50th International Astronautical Congress, Paper IAA-99-IAA.9.1.08, International Astro-nautical Federation, Paris, 1999]. However, in this paper, I begin to look into developingtechniques, which use only surface structure information deduced from the signal sampleitself, for learning what underlies the higher level elements of the language (lingua exmachina) hierarchy, where syntax meets semantics. The aim therefore is to move towardsdeveloping techniques, which will unlock a signal's meaning using its internal cohesion andconstraints. In doing so, I look at how the most disparate of human language orthographicforms mirror each other's underlying structure, despite their encoding strategies and howclasses of words give their secrets away by the functional `friends' they keep.Finally, I present techniques developed for non-language structured signals: how the statis-tical `fingerprint' of a binary image, such as the recent challenge set by Dr. Dutil [Universityof Catalunya, Barcelone, Spain & ABB Bomem Inc, 2001 〈http://ww3.sympatico.ca/stephane_dumas/CETl/output_stream.txt〉] and the 1974 Arecibo transmission [Arecibo Observatory〈http://www.naic.edu/〉], can be detected when encoded as a binary bit-stream by the anal-ysis of its type-token distribution, and further analysis of the structure of music and howit differs from language.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

To address such linguistic problems, which reach downto and beyond the very core fundamental concepts of whatwe currently understand as language structure, I have had

E-mail address: [email protected]

0094-5765/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.actaastro.2009.08.012

to develop new techniques to provide ways of learning andunderstanding syntactic behaviour from surface structurealone. Although each word or symbol is arbitrarily pairedwith its meaning, empirical evidence has shown that thesemantic class or part-of-speech to which it belongs is con-strained by its behaviour and according to underlying inter-active rules. By using functional elements/words detectedfrom lower level analysis and visualisation techniques for

Page 2: The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy

400 J. Elliott / Acta Astronautica 68 (2011) 399 -- 405

bonding, I have begun to develop algorithms, which groupa communication's word class, using pairings of functionwords as constraints initially on single word occurrences:functional sandwiches. The behaviour within such minimalconstrained pairings also captures evidence of morphology,recursion and the notion of polyposic1 and uniposic2 mem-bership. Initial trials with English confirm such behaviourand show how these semantic classes can be detected andwords attributed their membership, without prior knowl-edge, using unsupervised positional rule-based criteria andstatistical variance.

2. Functional sandwiches

A recent avenue of my research is using functional wordsas constraints, where a single `wildcard' word is constrainedby two previously identified functional—closed class—words[fwdi〈x〉fwdj]. Tests were carried out on English `raw' text toascertain if content—open class—words could be clusteredwithout any prior knowledge or expert interpretation to`label' the output. This approach is important, as ultimatelywe need to establish methods for deciphering semantic con-tent.

Initial results using this minimal functional `sandwich'method are coarse-grained, in that nouns/adjectives andverbs/adverbs are often grouped together in a given con-straint pairing. However, this is taken from first pass ob-servations only and the success rate for clustering relatedgroups uniquely to a given pairing is high: 93.06%. It isalso worth noting that the classification of the major parts-of-speech can vary from language to language—adjectivesoften classified as nouns—so this course-grained analy-sis may well be more appropriate and realistic. Once thisinitial `first-pass' stage is completed, a second pass usingidentified high frequency nouns through their inter-samplevariance [argmax �2(xi)] can begin to reveal additional in-formation and behaviour. Pairings from these noun-likeclass clusters with high frequency membership are then re-run but this time constraining two words rather than one[fwdi〈x〉〈y〉fwdj]. Initial results show that, when using thefollowing algorithm in conjunction with such constraints,very high accuracy is attained for the parts-of-speech map-pings of these additional variables. Up to date, accuracy ratesfor correct mappings on these second variables (words),using this technique on identified previous noun mappingsfrom minimal functional sandwiches, are 97.7%.

These initial trials, using a 60,000 word English samplecorpus have subsequently been tested on a larger 500,000word corpus, where results confirm the consistency of theparts-of-speech each function word pair constrains: resultson this larger corpus show an average accuracy for cluster-ing of 91.08% on fwdi〈x〉fwdj, which is only 1.98% less thanthe smaller 60,000 word corpus, and an accuracy of 98.8%for clustering adjectives and verbs using the fwdi〈x〉〈y〉fwdjalgorithm. These results are very encouraging given largercorpora are much more likely to capture exceptions.

1 Homonyms with multiple part-of-speech membership.2 Homonyms with membership restricted to only one part-of-speech.

To further support this hypothesis for using commonfunction words to classify a word's part-of-speech category,additional trials were conducted by a final year student forhis dissertation [5]: I provided the project methodology, re-sources and additional supervision. In this project, wordswere classified by there proximity to a set of given functionwords, using a range of window sizes—optimum observedto be 8, which is consistent with cognition and maximumdistributions between function words3—and clustering met-rics. Results were performed on the same 500,000 word cor-pus of Don Quixote [6], and this time also its original Span-ish version to ascertain the robustness of the algorithm as ageneric tool.

These experiments only select themost frequent 500 con-tent words from the corpus but do attempt to cluster wordsinto their specific parts-of-speech rather than the previouscoarse-grained groupings: a classification set of 23 parts-ofspeech were adopted for this exercise. Results disclose anaverage accuracy of 84.7% for grouping frequent English con-tent words and 85.8% for grouping Spanish content words.

Algorithms for functional sandwiches:

Algorithm for fwdi〈x〉fwdj:If 〈x〉 is not a functional word〈x〉= noun-type xor verb-type

Algorithm for fwdi〈x〉〈y〉fwdj:When 〈y〉= nounGiven 〈x〉〈y〉

〈x〉= elaborator (adjective)If 〈x〉 is not a functional word ornoun

If 〈x〉= nounThen 〈y〉= action (verb/adverb)If y is not a functional word or noun

Another interesting consequence of applying these con-straints is the evidence of morphological and thereforerecursive behaviour. This is apparent from statistically sig-nificant increase in a given grouping of a particular affix, farexceeding what would occur through random behaviour.Examples of this are, using the constraints of `which-the'show 80% of suffixes end in `ed'; constraints of in-the resultin 59% of suffixes ending with `ing', and 12% of nouns ina particularly large mapping group also re-occurring withan `s' suffix where the normal distribution of such suffixesacross the corpus only occur at 4.681%.

Finally, evidence of a given word's polyposic4 oruniposic5 behaviour can be gleaned from the inter-groupingmembership using such surface structure behaviour. Exam-ples of the reuse of a given spelling to represent entirelydifferent semantics can be seen with such words as love,look, lay lead, insult, hope, better and dress for polyposicbehaviour and equally frequent membership of multiple

3 Closed class words: e.g., articles and conjunctions.4 Polyposy: a word is polyposic if membership crosses known parts

of speech behavioural categories.5 Uniposy: a word is uniposic if membership remains exclusively

within a single known part of speech category.

Page 3: The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy

J. Elliott / Acta Astronautica 68 (2011) 399 -- 405 401

groups than only occur for a single part-of-speech category:uniposic.

Other potential uses for this bidirectional constraint clus-tering method, apart from deriving word classes from an un-known text, are initially seen as the enhancing and enrichingof existing probabilistic models (e.g. clustering algorithms,hidden Markov models and finite transition networks).

3. Combinational constraint behaviour: English vs.Chinese

Using a technique presented at last year's conference,which analyses `mirrored' pairs of linguistic objects (ter-minals and non-terminals alike), I have begun to comparethe combinational constraint behaviour of different languagefamilies in an attempt to ascertain if the behaviour of corepart-of-speech, irrespective of their encoding strategies, dis-play a common behaviour, which underlies and is part ofa generic template. At least metaphorically, the behaviourcan be considered to show the `binding force' between thetwo objects varying with their separation. One such com-parison, which has now been completed, is that of Chinese[7] and English [8]. This provided an opportunity to comparethe two very different orthographic systems of Sino-Tibetanand Indo-European (West Teutonic) languages. Below is apictorial summary of my findings, which display results in-dicating that the interactive behaviour between their coreparts-of-speech are in fact remarkably similar. This supportsthe hypothesis that the way we weave our respective on-tological descriptions of the world around us when com-municating are in fact constrained to general binding rules.The one area where differences are seen is with immediatebonding of articles with some parts-of-speech: specifically,with verbs, adjectives and adverbs when preceded by arti-cles and where conjunctions and adverbs are immediate pri-ors. I believe the restricted set of articles used in Chinese isthe root cause of this effect. The calculation of cohesion forselected linguistic object pairs is performed using the fol-lowing formulae [13], where both orderings (j given i, i givenj) are performed concurrently. The resulting profile sets,which reflect the behaviour of parts-of-speech, in additionto such applications as tagging and probability assignmentsunderlying hidden Markov models could also complementthe detection of language—like behaviour and subsequentclassification (Fig. 1).

f (ki).f (kj) =∑

x>0

f (ki, x, kj) +∑

x<0

f (kj, x, ki)

and the probability w1 precedes w2 at offset is

f (ki)f (∀k) × f (ki, x, kj)

f (ki,n, kj)

Left hand side probability = f (x)/∑

x<0 f (x)Left hand side probability = f (x)/

∑x>0 f (x)

As no a priori information is available for the system un-der analysis, a threshold can be applied to remove noisy data,where isolated counter-trend combinational occurrences areencountered due to such contamination as cross-sentential

Prep

Con

j

Det

Art

Nou

n

Verb

Adj

Adv

Prep

Conj

Det/Art Noun

Verb Adj

Adv

Fig. 1. Visualisation of Chinese/English comparison: high spikes= dissimilar behaviour, medium height spikes = similar behaviour, lowspikes = very similar, no spike = exactly the same behaviour.

occurrences:

�(kj|kI) = {0 : f (j|i)<T}.

The classification for the similarity criteria depictedabove where extremely conservative and even those clas-sified as only similar, where in fact showing many closelyrelated trends in the bonding at the vary degrees of sep-aration. These findings have helped further indicate andcorroborate that cross-language structural consistency oc-curs to the highest syntactic level of abstraction irrespectiveof orthography.

4. Music

In an attempt to further establish methods that filter outall known structured phenomena from natural language, Ihave once again looked into the behaviour of music. Thistime, however, I have invested much more time and handbuilt a corpus of music comprising of 10,000 notes, coveringmost musical genres including Classical, Jazz, Rock, Tradi-tional and Folk.

Sections of this corpus, which contain lyrics, have alsohad each note tagged with their respective sound valueswith respect to syllables and words, in order to representand compare music and language structure as closely as pos-sible. This strategy has enabled initially quite differing en-codings to be closely compared across a variety of analyticalstages to ascertain whether or not, this man-made form of aseemingly innate phenomena, which occurs in many formsacross a variety of species, `mirrors' and is indistinguish-able from human natural language. My intention through-out these stages of analysis has been to represent music and

Page 4: The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy

402 J. Elliott / Acta Astronautica 68 (2011) 399 -- 405

Language vs Music

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Chunk length

Entr

opic

Val

ue

Alphabetic system

Music as a Rebus system

Logographic system

Fig. 2. First order entropy: a comparison of language and music.

language as closely as possible to, if possible, confound theability to disambiguate. The rationale being that if I can thenfilter out music when represented in this format, I will con-firm the robustness of the methods in place.

Using the standard 88 note range of a piano keyboardas a base, each note was represented numerically from 1 to88 (lowest to highest): each octave therefore possessing dif-ferent values, which reflects their differing frequencies. Thisthen enabled music to be represented in a format suitablefor computational analysis.

4.1. First order entropy

Once again the initial metric applied to the data streamfor detecting language-like structure, is first order entropy[9]. Previous experiments have found that if the signal con-tains merely a set of random digits, the expected value ofthis function will rise monotonically as N increases. How-ever, if the string contains a set of symbols of fixed lengthrepresenting a highly structured phenomenon such as seenin communication, we see a sudden drop in entropy againstthis trend thresh-holding at

[L]� = {�1 �H1c � �2}

where �1 =H1p −1; �2 =H1n+1; [L]� is the physical languagestructure; H1c the current entropic value; H1p the previousentropic value; H1n the next entropic value.

Results comparing music with language show that musicdoes display similar behaviour at this initial level. The dropin entropic value at its encoding obeys the same thresh-holding criteria. However, this drop is relative to the adjacentvalues and does not compare or evaluate the absolute valuesrecorded. What can be seen is that the entropic values formusic are consistently less across all bit-length analysis: onaverage approximately 28% less, indicating a much simplerand predictable structure (see Fig. 2).

4.2. Higher order entropy

This next level of analysis, where higher-order entropicvalues calculate the conditional probability of prior in-formation and thereby a `signals' internal structure, is a

An Entropic Comparison of Signals

0

1

2

3

4

5

6

1 2 3 4 5Entropic Order

Entr

opic

Val

ue

English D.N.A.(ecoli) Random SETI Signal Music Music as words

Fig. 3. Higher order entropy: a comparison of `signals'.

'Legal' ngrams

0

20

40

60

80

100

120

Bigrams Trigrams

%

English Swedish French Latin finnish swahili malay

maltese german polish gaelic welsh spanish russian

sanskrit turkish RANDOM-5K RANDOM-1mill DNA RANDOM-50K music

Fig. 4. Orthotactic sets: a comparison of `signals'.

further powerful indicator for the presence of highly struc-tured phenomena. If results are consistent with knownnatural languages, the entropic value of the `signal' willdecrease, at approximately a slope of −1[m = dy/dx],as the entropic order—and therefore the number ofdependencies—increases.

Here again, music behaves in a language-like manner,with its entropic value decreasing with respect to entropicorder at a slope close to −I and within tolerances: 0.915.However, as with first order entropy, all values recorded forhigher-order entropy, whilst decreasing at a language-likeslope, show once again consistently lower values in compar-ison to natural language: on average approximately 39.6%less, indicating a much simpler internal structure (see Fig. 3).

4.3. Pattern frequencies and ngrams

Having ascertained that the `signal' under investigationhas structure, the set of tokens that represent the `system' are

Page 5: The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy

J. Elliott / Acta Astronautica 68 (2011) 399 -- 405 403

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Patterns

Freq

uenc

y

English Music

3400

1800

600Patterns

Summertime

0

20

40

60

80

100

120

1 2 3 4 5 6 7Notes (rank order)

Freq

uenc

y

Freq

uenc

y

Fig. 5. Pattern [note] frequencies of music: (a) comparison of music and language corpora; (b) note frequencies (across three octaves) of modern technomusic; (c) note frequencies of Summertime.

then analysed for their frequency and orthotactic6 profiles.This final initial stage provides essential information on thedistribution of pattern frequencies and what percentages ofpattern combinations are present in the `signal'. These thencan be compared with known generic profiles of naturallanguage to ascertain if the signal displays known language-like behaviour.

As can be seen in Fig. 5a–c, which compare music andalphabetic language pattern frequencies, music reveals thesimpler nature gleaned from previous entropic evaluation.The range of tokens used in both systems are directly com-parable, as the range of notes used in a typical piece of mu-sic average to approximately two octaves, which 24 notesparallel closely the average number of letters found in theaverage alphabetic system. Nevertheless, music consistentlydisplays significantly different frequency profiles across therange of genres analysed: such variations in token frequen-cies are not seen in language and remain consistent through-out.

Complementing these frequency profiles is the discov-ery of what percentage of pattern combinations can empir-ically be seen to be `legal'. Once again language provides astrong consistent generic template of observations for two

6 Orthotactic: `legal' set of combinations.

and three letter combinations, which reflect the principlesof least effort (reciprocal altruism) and recursion, essentialfor complex communication between intelligent agents.

Although, as previously mentioned, the average set of to-kens used are comparable, yet again music displays a re-duced set in comparison to that used in natural language (seeFig. 4); reflecting the cyclic and repetitious nature of musiccomposition [10]. One of the more obvious examples of thiscontrast in language and music is where it displays occur-rences of multiple consecutive idems.7 In the 10,000 notecorpus 22 trigrams and 21 quadragrams are found, occur-ring 277 and 122 times, respectively, in contrast to findingsin a much larger corpus of 403,000 words/1,749,000 lettersof natural language which displays none.

5. Type-token distribution

Another metric, which has proved useful, is the calcula-tion of token distributions across the frequency bandwidthof occurrences [11]. Each pattern's frequency is taken andsubject to the bandwidth segmentation used, incrementsthe counter for the appropriate frequency band. When allpatterns have been allocated, the distribution profile can

7 Idem: identical reoccurrence of pattern.

Page 6: The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy

404 J. Elliott / Acta Astronautica 68 (2011) 399 -- 405

Comparison at 7 bit window Frequency Language Binary image Random1-100 14 127 128101-200 7201-300 4301-400 1401-500 0501-600 0 1601-700 1

Fig. 6. The figure compares distributions of language, image and randomlygenerated ASCII characters.

be compared with other distributions clearly and accu-rately. This enables the detection of distributions against atrend, such as when progressively increasing the bit-lengthwindow of a binary stream for assessing the presence of alanguage-like fixed length encoded symbol system.

An interesting and useful phenomena found whilst de-coding a recent transmission challenge set by the SETIcommunity [3] was that binary image (��) data comprisepattern frequencies, which all except for one of high fre-quency, consistently fall within percentages of less than10 percent—mostly less than one per cent. This particularpattern distribution `fingerprint' constantly occurs acrossall window chunk-length segmentation (between 1 and 20;also cross-correlated with the 1974 Arecibo signal [1]) andhas provided a robust detector for the presence of binaryimage `signals', in contrast to the Zipfian-like distributions[12] seen in natural language. Formula for the presence ofbinary image data:

�� = f (Pn) +∑

f (Pi)

where f (Pn) = max1� i�n f (Pi); n = argmax1� i�n f (Pi); Pi ={0<Pi �10%}.

In order to establish the absence of any intelligentlanguage-like fixed length encoded communication, type-token distributions were calculated for chunk lengthsfrom one to 20. As previously, results showed distinctivenon-language like behaviour, markedly contrasting to lan-guage, which gives a much broader distribution across thebandwidth spectrum (see Fig. 6). In this case, pattern fre-quencies all except one constantly recorded as less than 10percent—mostly less than one per cent—and one pattern ofhigh frequency. See Fig. 7 for details.

As a comparator of non-language like transmissions al-ready used and as now a likely candidate for the bit-streamformat, I analysed the original 1974 Arecibo transmission[4]. As depicted below (Fig. 8) and although a much shortertransmission, the type-token distribution shows a very sim-ilar profile, prompting the hypothesis that the bit-streamunder investigation was not likely to be language but a bi-nary image: noting that randomly generated patterns resultin all type-tokens recording a frequency within the lowestbandwidth (see Fig. 6).

As no evidence could therefore be seen for the presenceof language structure, but strong evidence was present for abinary image during these initial stages, no further analysisregarding elements of the language hierarchy were deemed

SETI 2001 bit stream transmission

% Freq

3 4 5 6 7 8 9 10

1-10 7 15 31 63 127 253 491 49211-2021-3031-4041-5051-60 1 1 1 1 161-70 1 171-80 1

Fig. 7. Pattern distributions for SETI 2001 transmission for first 10chunk-lengths.

SETI Arecibo transmission 1974% 3 4 5 6 7 8 9 10 11 12

1-10 7 15 30 44 58 77 87 87 95 98

11-20 1 1 1

21-30 1 1 1

31-40 1 1

41-50 1

51-60 1

Bandwidth Bitlength of chunking window

Fig. 8. Pattern distributions for SETI Arecibo 1974 transmission for first12 chunk-lengths.

appropriate for this `signal' and the hypothesis for imagedata was supported and later confirmed by the author.

6. Conclusions

In this paper I have presented findings which cover theareas of music, binary image data, combinational constraintbehaviour of English and Chinese, and how functional wordscan disclose a content word's part-of-speech.

I believe this bottom-up approach is essential in view-ing language transparently at its various levels of abstractionand thereby understanding all the mechanisms involved inits generation and processing. This has been particularly ap-parent for detecting, unsupervised and without prior knowl-edge, language-like behaviour in unknown `signals' for thepurposes of contributing to detection algorithms for SETI; asit has provided an approach that can compare and disam-biguate different forms of structured phenomena by either asingle stage filter or a combination of heuristics: effectivelyproviding algorithms for the filtration of inter-galactic ob-jets trouvés and the identification of the lingua ex machinahierarchy.

To summarise, achievements to date include:

• a method for splitting a binary digit-stream into charac-ters, by using entropy to diagnose byte-length;

Page 7: The filtration of inter-galactic objets trouvés and the identification of the lingua ex machina hierarchy

J. Elliott / Acta Astronautica 68 (2011) 399 -- 405 405

• internal structure confirmation and detection of `legal'combinations using higher-order entropy and ngrams;

• a method for tokenising unknown character-streams intowords of language;

• an approach to chunking words into phrase-like sub-sequences, by assuming high-frequency function wordsact as phrase-delimiters;

• a visualisation tool for exploring word-combination pat-terns, where word-pairs need not be immediate neigh-bours but characteristically combine despite several in-tervening words;

• coarse-grained unsupervised part-of-speech clustering,using function words;

• binary image detection using type-token distributions;• a toolkit for analysing the physical structure of audio sig-

nals both historically and over time.

I submit these results, methods and formula, in continuationfrom my previous work in this area [1,2], and to hopefullyfurther contribute to a more fundamental understanding ofwhat distinguishes language from the rest of the signal uni-verse.

References

[1] J. Elliott, E. Atwell, W. Whyte, First stage identification of syntacticelements in an extraterrestrial signal, in: Proceedings of IAC'2001: the52nd International Astronautical Congress, paper IAA-01-IAA.9.2.07,Toulouse, France, 2001.

[2] J. Elliott, E. Atwell, Language in signals: the detection of genericspecies-independent intelligent language features in symbolic andoral communications, in: Proceedings of the 50th InternationalAstronautical Congress, Paper IAA-99-IAA.9.1.08, InternationalAstronautical Federation, Paris, 1999.

[3] Y. Dutil, University of Catalunya, Barcelone, Spain & ABB BomemInc, 2001 〈http://ww3.sympatico.ca/stephane_dumas/CETl/output_stream.txt〉.

[4] Arecibo Observatory 〈http://www.naic.edu/〉.[5] A. Roberts, Undergraduate dissertation: automatic acquisition of word

classification using distribution analysis of content words with respectto function words, School of Computing, University of Leeds, UK,2002.

[6] J. Ormsby, Translation of Don Quixote de la Mancha Miguel DeCervantes 1665, 1885.

[7] S.S. Piao, Sentence and word alignment between Chinese and English,Ph.D. Thesis, Lancaster University, 2000; S.S. Piao, Chinese CorpusAdapted from CEPC Corpus, Sheffield University, Sheffield, UK, 2000.

[8] Lancaster Oslo Bergen (LOB) Corpus, 〈www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html〉.

[9] C.E. Shannon, A mathematical theory of communication, Bell SystemsTechnical Journal 27 (1948) 379–423, 623–656.

[10] J.L. Casti, Would be Worlds: Twelve Tone Music, Wiley, USA, 1997pp. 66–69.

[11] J. Elliott, The SETI challenge, in: Proceedings of the 5th AnnualCLUK Colloquium: Computational Linguistics in the United Kingdom,University of Leeds, 2002.

[12] G.K. Zipf, Human Behaviour and the Principle of Least Effort, AddisonWesley Press, New York, 1949 (1965 reprint).

[13] J. Elliott, E. Atwell, Visualisation of long distance grammaticalcollocation patterns in language, in: IV2001: Proceedings of 5thInternational Conference on Information Visualisation, 2001, pp.297–302. ISBN 0-7695-195z.