Michael P. Oakes

23
Michael P. Oakes University of Sunderland

description

Michael P. Oakes. University of Sunderland. Contents. Proposals for a Master’s programme in Natural Language Processing Future research plans / link with Wolverhampton Plans for publications Plans for grant proposals Other funding ideas. - PowerPoint PPT Presentation

Transcript of Michael P. Oakes

Page 1: Michael P. Oakes

Michael P. Oakes

University of Sunderland

Page 2: Michael P. Oakes

Contents

• Proposals for a Master’s programme in Natural Language Processing

• Future research plans / link with Wolverhampton

• Plans for publications

• Plans for grant proposals

• Other funding ideas

Page 3: Michael P. Oakes

Proposals for a Master’s programme in Natural Language Processing

• Some preliminaries:• Entry requirements: first or second class degree in a related

discipline. Computer programming will be taught from scratch.• Funding: Erasmus, European Social Fund, ESRC Master’s

training package scheme for programme development, work-based learning

• Students must receive an accurate idea of the content of the programme beforehand

• Induction week: meet the teaching team, familiarity with the University, formal registration, etc.

• Diploma, Certificate and Master’s awards. 8 taught modules (24 lectures, 18 hours’ practical, 58 directed reading, 50 self-directed research).

Page 4: Michael P. Oakes

Certificate Stage

REPLI (Research, Ethics, Professionalism and Legal Issues). Generic research skills such as referencing, statistics, experimental design. BCS Accreditation. *

Programming. PERL for string handling, R for statistics (can handle Bayesian statistics, text mining and graphs), Introduction to Java for general computing

Overview of NLP. Phonetics, morphology, lexis, parts-of-speech, syntax, semantics, pragmatics.

Empirical Linguistics. Corpora, annotation, alignment, collocations, anaphora resolution. *

Page 5: Michael P. Oakes

Diploma Stage

Symbolic NLP. finite-state transducers, parsing, semantic representation: First-Order predicate calculus, semantic networks.

Machine Translation. statistical, symbolic, example-based.

Information Retrieval. vector space model, indexing, summarisation, evaluation, clustering, text classification, text data mining. *

Research Seminars. All members of the group (and outside speakers) talk about their research. Assessment is to produce a good project proposal.

Page 6: Michael P. Oakes

Project

• Close links with industry established through 3-month industrial placements, based either with the company or at the University.

• The sponsor will either be from industry or academia, and there will also be a staff member from Wolverhampton to act as supervisor.

• Project management (TOR, reviews), poster, viva, dissertation (typically introduction, research, analysis, implementation, evaluation / experiments, reflective conclusions).

Page 7: Michael P. Oakes

Administration

• Programme board of studies: Institute Director or deputy, student representatives, one or more employers’ representatives, module leaders, programme leader, responsible for the management of the programme and the well-being of each module.

• Board of assessment: to decide student progression. External Examiner, no student representatives

• Internal (prior to hand-out) and External (sample work shown prior to programme assessments) moderation.

• Other quality control: student and staff feedback, EE’s report, programme annual report.

• Each student has a personal tutor and student handbook. • Timely, face-to-face assessment may improve student

satisfaction.

Page 8: Michael P. Oakes

Future Research Plans,

• And how these might complement the research topics of the Research Group in Computational Linguistics.

Page 9: Michael P. Oakes

Automatic Summarisation

• CAST Project produced an automatic summarisation tool: “term-based summarisation”

• Content-Based Abstracting (Paice). • TRESTLE (Gaizauskas). • David Evans: evaluation of information extraction• Query-based summaries. Intrinsic (representativeness) vs.

Extrinsic (judgeability) evaluation (Liang). • SumTrain: reached second round of EU evaluation.• Extraction of statistics-related phrases, e.g. “greater than”,

“significant reduction in”, “was directly proportional to”, “did not affect”.

Page 10: Michael P. Oakes

Concept-Based Abstracting Project

• window length = 4• STOP 6 "and foliar treatment AGEN"• 5 "foliar treatment AGEN +"• 5 "treatment AGEN + AGEN"• 4 "effect of mildew AGEN"• 3 "AGEN gave a significant"• 2 "AGEN was the most"• 2 "AGEN at different sowing"• 2 "AGEN increased fertile tillers“• LOW-FQ 1 "effect of AGEN sprays"

Page 11: Michael P. Oakes

Automatic Terminology Processing

• Le An Ha looked at the concept of a terminology rather than individual terms. Knowledge patterns from glossaries: store of terms and relations between them.

• David Evans. Identification of terms using TF.IDF and other statistical methods (see slide 20).

• Shiyan Ou. Sentiment classification (see slide 20). • Constantin Orasan. Corpus of junk mail (spam filters,

Farrow).• Constantin Orasan. Analysis of genre differences – project

on “Language, Computation and Style” (authorship).• Englishes, Scrip newsfeeds, BELGA: “feature extraction

for text classification”.

Page 12: Michael P. Oakes

Annotation tools

• Constantin Orasan: PALinkA, automatic annotation of anaphoric links.

• Lewandowska, Oakes & Rayson: part-of-speech and semantic code tagging in English; alignment enables partial semantic tagging of L2.

Page 13: Michael P. Oakes

Annotation: Aligned and Partially Tagged Polish text (Lewandowska, Oakes and Rayson)

• Tak jest_A3+ mowi Polemarch_Z99 a do_Z5 tego jeszcze urzadra nocne nabozenstwo, ktore_Z8 warto zobaczyc

• “_”_PUNC That_DD1_Z8 ’s_VBZ_A3+ the_AT_Z5 way_NN1_X4.2 of_IO_Z5 it_PPH1_Z8 ,_,_PUNC “_”_PUNC said_VVD_Q2.1 Polymarchus_NP1_Z99 _,_,PUNC “_”_PUNC and_CC_Z5 ,_,_PUNC besides_RR_Z5 _,_,PUNC there_EX_Z5, is_VBZ_A3+ to_TO_Z5 be_VBI_A3+ a_AT1_Z5 night_NNT1_T1.3 festival_NN1_K1/S1.1.3+ which_DDQ_Z8 will_VM_T1.1.3 be_VBI_A3+ worth_II_I1.3 seeing_VVG_X3.4 ._._PUNC

Page 14: Michael P. Oakes

Mobile Devices

• Laura Hasler and Dalila Mekhaldi: QALL-ME, Question-Answering for Digital Phones.

• Chufeng Chen: Annotation of digital photographs taken with a GPS camera. A gazetteer “translated” longitude and latitude data into place name, geographical feature, e.g. Long = 54.91, Lat = -1.4, place = Sunderland, feature = harbour. Episodic memory.

Page 15: Michael P. Oakes

Other Related Work

• Andrea Mulloni: Corpus Linguistics.• Empirical vs. Chomskyan• Own interest “Statistics for Corpus Linguistics”.• Driving the process rather than merely testing for

statistical significance, e.g. Mutual Information to find collocations.

• Irina Temnikova: Machine Translation• Alignment for example-based machine translation

(Lewandowska & Oakes).

Page 16: Michael P. Oakes

Plans for Publications (1)

• Book Chapters in press:• Processing Multilingual Corpora, Chapter 32 of Corpus

Linguistics: An International Handbook, eds. Anke Lüdeling and Merja Kytö, Mouton de Gruyter.

• Corpus Linguistics and Stylometry, Chapter 52, ibid.• Corpus Linguistics and Language Variation, in

Contemporary Approaches to Corpus Linguistics, ed. Paul Baker, Continuum.

• Javanese, in “Languages of the World”, ed. Bernard Comrie, Routledge.

• J. Vilares, M. Oakes and M. Vilares: A Knowledge-Light Approach to Query Translation in CLIR. RANLP V, ed. N. Nicolov, Benjamins.

Page 17: Michael P. Oakes

Plans for Publications (2)

• Under second review:• S-W. Ke, C. Bowerman and M. Oakes,

“Automatic classification of personal email with PERC and time-related strategies”, ACM Transactions on Information Systems.

• W-C Lin, M. Oakes and J. Tait, “Improving image annotation via representative feature selection”, Cognitive Processing.

Page 18: Michael P. Oakes

Plans for Publications (3)

• Future plans:• VITALAS Video and image Indexing and reTrievAl in the

LArge Scale.• Update “Statistics for Corpus Linguistics” – sold over

1500 copies, but now 10 years old• Last chapter was “Literary Detective Work”, which could

be a book in its own right: disputed authorship (compendium of techniques, Shakespeare, religious texts, still unsolved mysteries e.g. The Quiet Don, Marxism and the Philosophy of Language), unknown languages (Linear B, Voynich manuscript). JLLC, QL.

Page 19: Michael P. Oakes

Plans for Grant Proposals (1)

• Closing the Semantic Gap

• Related to machine learning (boosting), caption analysis, gazetteers, alignment of low level image content features and high level semantic features (words)

• Son of VITALAS?

Image content

Semantic description

H = 0, S = 1, V = 0.5, F = 0.9

Kim Clijsters, tennis

H = 1, S = 0.6, V = 0, F = 0.125

Palace of Brussels

H = 0.3, S = 0.3, V = 1, F = 0.9

Centre Court, Wimbledon, tennis

Page 20: Michael P. Oakes

Plans for Grant Proposals (2)

• Which words are truly characteristic of a corpus? X² etc. • Countable linguistic features. • Measures from IR e.g. PageRank (Łódź, Palomino).• AHRC (if theoretical, Englishes), ESRC (if applied, e.g.

spam filters).• Sentiment analysis (Thijs Westerveld at Teezir): mining

online opinions. Cheerful, chic, cheap, clean vs. chaos, cranky, cumbersome, damaged.

• Interface between NLP and IR: sentence analysis e.g. adjectives, negatives; follow links to navigate websites.

• IR relevant vs. irrelevant documents.

Page 21: Michael P. Oakes

Plans for Grant Proposals (3)

• Temporal relations in query language modelling (Dawei Song).

• Temporal similarity + semantic similarity overall similarity.

• The temporal similarity between texts (e.g. query and document) can be estimated by a) time stamp, b) temporal logic between the texts (Andrea Setzer).

Page 22: Michael P. Oakes

Plans for Grant Proposals (4)

• Corpus Profiling Workshop on October 18th. • Exploring how corpus characteristics affect the

behaviour of techniques in IR and NLP, and to set out a roadmap for a shared research agenda.

• Data set profile impacts on automatic classification, IR, anaphora resolution, automatic summarisation and word sense disambiguation.

Page 23: Michael P. Oakes

Other Funding Ideas

• IRSG-like “Industry Day” to foster industrial contacts (consultancy? Grant proposals?)

• Organise conferences, e.g. bid for Corpus Linguistics, CLEF, ECIR.

• Exploitation of Intellectual Property. • Is there an equivalent of CEDEC (Computing and

Engineering Distance Education Centre) with whom we can discuss marketing programmes world-wide / part-time? Work-based learning?