Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic...
-
Upload
sydney-barber -
Category
Documents
-
view
216 -
download
0
Transcript of Heinz Dreher School of Information Systems Curtin University, Perth, Western Australia Automatic...
Heinz DreherSchool of Information Systems
Curtin University, Perth, Western Australia
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 1
25th Bled eConferenceeDependability:
Reliable and Trustworthy eStructures, eProcesses, eOperations
and eServices for the FutureJune 17, 2012 – June 20, 2012; Bled, Slovenia
Automatic Semantic Trend Analysis of the Bled
eConference: 2001-2011
overviewFinding concepts and concept-trends in vast repositories of text
Semantic analysis algorithms have developed over the last decade to the point where they are almost within reach of everyone as is Google for text searching
This study reports on an experimental application of automated semantic analysis of the Bled eConference 2001-2011 proceedings full text corpus
Rubrico, the specific tool used in the study is introduced
The methodology used to deploy Rubrico on the Bled corpus for the purpose of revealing the embedded concepts is explained
Interpretation and discussion is offered to indicate the possibilities ensuing from the semantic analysis
Future work is indicated to address limitations and further explore the prospects
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 2
the problemgiven a large body of knowledge (textual corpus) how do we find out what is in it?
consider the proceedings of your favourite conference, e.g. Bled eConference held in Slovenia in June each year
attracting speakers and delegates from business, government, information technology providers and universities - it is a major venue for
researchers working in all aspects of “e”
#documents for the eBled conference years 2001 - 2011
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 3
year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 total #docs 50 49 71 52 51 52 60 45 41 42 42 555
some possible solutions if you are a lecturer you get your 40 students to each read and summarize 10 papers (=400), and you do 10 yourself (now 500 are done), and reward the 11 good students with another 5 each and we have the 555 papers summarized! (they get marks, you get summaries)
if you are an award winning researcher, get your RA to do the work – following a methodology you have already had her devise
if you have PhD students you set up a paper review and critical appraisal colloquium and offer the task as a learning exercise in preparation for their upcoming literature review
if you are innovative, you’ll look around for some ‘smartech’ way to tackle the taskAutomatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 4
a ‘smartech’ way to tackle the task
this will use already devised algorithms suited to the purpose and involves, rather obviously, artificial intelligence, since the task of finding concepts in a document requires intelligence
in machine learning, a branch of artificial intelligence, there exist suitable algorithms that can be deployed
semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents. Initially it does not involve prior semantic understanding of the documents, although this is beneficial for subsequent follow-on and refined analyses
Latent Semantic Analysis (or Latent Semantic Indexing), is perhaps the best known – but there are many others
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 5
a sample of the many methods & techniques
1 Supervised learningArtificial neural network; Bayesian statistics; Case-based reasoning; Decision trees; Inductive logic programming; Gaussian process regression; Learning Automata; Learning Vector Quantization; Minimum message length (decision trees, decision graphs, etc.); Nearest Neighbor Algorithm; Analogical modeling; Support Vector Machines; Regression analysis
2 Unsupervised learningAssociation rule learning; Hierarchical clustering; Partitional clustering e.g. Latent Semantic Analysis (LSA or LSI)
3 Reinforcement learning
4 Combinations and adaptations of the aboveRubrico (Reiterer, Dreher, & Gütl)Normalised Word Vector (NWV) as used in MarkIT™ (Williams & Dreher)webLyzard technology (Scharl & Weichselbraun)
(refer to inter alia Wikipedia for a good start on the topic of semantic analysis)
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 6
the Rubrico process
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 7
Rubrico generates a vizualisation
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 8
Rubrico generates a concept hierarchy,
and Rr values (Rubrico-
relevance)
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 9
Ontological structure of concept “knowledge” computed from “Bled 2001-2011 Abstracts”
As an example of a concept hierarchy (3 levels deep), look to top left. The concept “knowledge” subsumes “model”, which itself subsumes “framework”.
Typically, concepts are identified via a thesaurus, reference ontology, and word taxonomies such as that implemented in the lexical database WordNet.
‘Rubrico-relevance’ or Rr is shown in the rightmost column
Rubrico ConceptKeyword counts for the Bled 2001-
2011corpus
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 10
Σ from 1 to ConceptKeywordCount of all Rr values for a conference
year = 100
what we (humans) may generate from what Rubrico generates -
interpretation
use Rr to identify ‘Top concepts’ (and ‘Bottom concepts’ too)
focus on the Rr_rank (1 is the highest value Rr in a given year, or unit of analysis)
choose some ‘concepts’ for detailed study (as one cannot realistically track thousands) – good opportunity for further tool development
focus on the top-30
and/or the bottom-30
or make any other selection
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 11
using Rr to select conceptsTop30 Concepts in 2001 and 2011
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 12
Top 30 Concepts by Relevance, and # Years Occurring
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 13
Bottom 30 Concepts by Relevance, and # Years Occurring
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 14
a final set of 37 concepts (ConceptKeywords) to focus on
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 15
37 ConceptKeywords and their usage over 11 Conference Years
“ConceptKeywords-and-their-usage-over-time” = Appendix 2
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 16
Interpretation and Discussion
the ConceptKeywords identified as ‘prominent’ across the 11 conference years for further study in this presentation have been selected from the ‘Top 37’, and they are:
1) user; 2) pp; 3) relationship; 4) knowledge; 5) com; 6) group
zero-entry concepts for some years: 11) supplier; 13) participant; 14) need; 18)
site; 19) respondent; 22) employee; 23) device
first appearing in 2005 concepts: access; http; people
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 17
most ‘prominent’ across the 11 conference years was ConceptKeyword user
1) user;
<<< put in the ontology for user across the 11 years … on one slide???? huh ? I will need to devise some new ‘dynamic’ vizualisations for this … must wait for later …>>>
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 18
2001 ontology for user
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 19
2011 ontology for user
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 20
1) user
in each of the 11 conference years (2001-2011), the concept of user featured strongly, being ranked (Rr_rank) at either 1, 2, or 3, except for the year 2010 in which it achieved only 3834th place.
this, at first, very surprising perturbation is easily explained. See below, and note that in column with name “rank-B.10” the ConceptKeywords people and health (rows 27 and 28) have values 2 and 3 respectively. This indicates that authors were using the term or concept people rather than user, and the reader may now check that eHealth was a big feature of the 24th Bled eConference held in that year. People is another perspective on user, and one may postulate that what we see here is the response by authors to the calls of the conference organisers and editors, adding weight to the proposition that editors have a big influence in the direction of the thinking of a body of authors. To the extent that this is true one can verify that the semantic analysis (for example as per Rubrico) is creating a ‘true’ picture of reality.
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 21
2) pp
the second most prominent ConceptKeyword to emerge is pp. What an odd thing is that?
pp is of course meaningless as a concept in the usual sense, however if we understand that the corpus is a collection of scholarly articles, for which the authors have created reference lists, often including a sequence of pages in their citations (indicated by “pp. 22-55”, for example), then we may make some sense out of this.
is there a particular style of referencing being required which may explain the pp performance?
note that in the year 2001 the Rr_rank is 6, then climbing to 3, then 2, and very often at 1. This could, for example, be associated with an already-strong expectation of precise citation being tightened during the early years of the decade.
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 22
3) 2004_relationship
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 23
3) 2011_relationship
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 24
3) relationship
relationship is the third most prominently occurring ConceptKeyword and its Rr_rank profile rises and falls but remains within the range of a low at 15 and a high of 2.
does such a consistent and strong performance indicate a very great emphasis, in this conference, on the pursuit of truth and explanation through systematic study and investigation of the essential connections between things?
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 25
4) knowledge
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 26
4) knowledge
knowledge (the fourth) its Rr_rank profile rises and falls but remains within the rage of a low at 17 and a high of 2.
does such a consistent and strong performance indicate a very great emphasis, in this conference, on the pursuit of truth and explanation through systematic study and investigation of the essential connections between things?
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 27
5) com
item 5 featured in the ‘top’ 37 ConceptKeywords list is com, which performs strongly over all of the conference years and is clearly associated with references to “dot com” and website URLs.
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 28
6)grou
p
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 29
6) group
group is the first item (according to a row-by-row consideration of Appendix 2) to have a zero score (in year 2001) then gradually, if a little erratically, growing in prominence over the ensuing decade. It may reveal an emergence of the role of people in teams and concern for societal issues in general.
in the left hand panel of the previous Figure, one can see that group is rather an extensive structure, consisting of 34 elements, nine of which have sub-hierarchies. In the right hand panel we see the content sub-hierarchy, which is 3-levels deep.
Inspection of the occurrence for group reveals that this ConceptKeyword was absent from the 2001 proceedings.
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 30
zero-entry concepts for some years:
11) supplier; 13) participant; 14) need; 18) site;
19) respondent; 22) employee; 23) device
the following ConceptKeywords have zero entries for one or more conference years: supplier (missing in 2003); participant (2010); need (02, 05, 08, 10, and 11); site; respondent; employee; device; …..
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 31
first appearing in 2005 concepts:
access; http; people
ConceptKeywords access, http, people, are remarkable because they appear only in 2005, then disappear for a period and perhaps reappear. This may be indicative of a fad, but ….
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 32
human concepts vs Rubrico concepts
the human-created ontology (in our terminology) is at a higher conceptual level than achieved by Rubrico. Combinations of terms such as “eMarkets, Directories, Auctions” are reported as being characteristic of the period 1998-2002 (Clarke 2012)
the ‘super concept‘ formed from “eMarkets, Directories, Auctions” to be detected automatically it must have a textual association and eventual representation in the corpus
For the years 2001 and 2002 in our analysis the ConceptKeywords transaction and implementation may pertainAutomatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 33
eMarkets_Directories_Auctions cf.
ConceptKeywords transaction & implementation
Clarke (2012) presents manual analyses of the Bled Conference corpus
the human-created ontology (in our terminology) is at a higher conceptual level than achieved by Rubrico
combinations of terms such as “eMarkets, Directories, Auctions” are reported as being characteristic of the period 1998-2002 (Clarke 2012)
for the ‘super concept’ formed from “eMarkets, Directories, Auctions” to be detected automatically it must have a textual association and eventual representation in the corpus
years 2001 and 2002 in our analysis the ConceptKeywords transaction and implementation may pertain
an intensive knowledge-elicitation and -engineering exercise would be needed to match this against the mentioned ‘super concept’ eMarkets_Directories_Auctions
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 34
Future work As in all experimental research, there are limitations and deficiencies that one would like addressed. Rubrico makes possible the semantic treatment of vast amounts of text; but it is not intelligent.
The human mind may find it difficult to comprehend certain ontological structures that Rubrico computes (e.g. group, user, pp). Therefore, an improvement that needs to be considered is for human editing of the initially-computed ontology, followed by a re-run of the semantic analysis.
Another limitation at present is the performance of Rubrico with large document sets – which we currently estimate as being greater than about 100 conference papers. Our initial attempt at analysis was to deploy Rubrico on a laptop computer, to deal with the entire 555 documents in the Bled 2001-2011 Conference corpus; it resulted in ‘stagnation’. This points to a clear need for implementation in a more computationally-powerful environment.
Next, we want further functionality to automate the construction of the “ConceptKeywords-and-their-usage-over-time” for example, and interactivity, interoperation, and dynamic visualisation of the elements of information structures depicted therein.
In order to check the usefulness of automated conceptual analysis of full text corpora such as has been attempted here, one would ideally need to engage in a comparison with the results of other types of analyses, and especially human-generated ones. As there is a special section of this 25th anniversary of the Bled conference, any alternate analyses published could form an interesting and useful agenda for further research.
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 35
AcknowledgementsEmanuel Reiterer who developed the Rubrico “conceptual analysis system” into a computer program during a Master’s project in 2009 (TU Graz - with C. Gütl & H. Dreher)
Gregor Lenart who was instrumental in providing access to the full text of the proceedings for the years 2001-2012
Roger Clarke who encouraged the computer analysis of the full-text documents and assisted in obtaining access to the textual repository referred to here as the Bled 2001-2011 Conference Proceedings Corpus
Naomi Dreher – the RA who devised the “Finding Relevant Concepts Method” - Dreher, N. & Dreher, H. (2011). Empowering Doctoral Candidates in Finding Relevant Concepts in a Literature Set. International Journal of Doctoral Studies, Vol. 6, pp. 33-50. http://ijds.org/
Automatic Semantic Trend Analysis of the Bled eConference: 2001-2012 – Heinz Dreher – 19 June 2012 36