Concordancing the Web with KWiCFinder

23
Concordancing the Web with KWiCFinder William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001

description

Concordancing the Web with KWiCFinder. William H. Fletcher United States Naval Academy American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001. How Big is the Web?. - PowerPoint PPT Presentation

Transcript of Concordancing the Web with KWiCFinder

Page 1: Concordancing the Web with KWiCFinder

Concordancing the Webwith KWiCFinder

William H. FletcherUnited States Naval Academy

American Association for Applied Corpus LinguisticsThird North American Symposium on

Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001

Page 2: Concordancing the Web with KWiCFinder

How Big is the Web?Now 2-4 billion webpages accessible via public links (Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.)

“Invisible web” / restricted sites several times largerEstimated 80%-95% content in English, but…Since mid 2000, non-Anglophones outnumber English speakers onlineAnglophones < 30% of 850 million users in 2005Percentage of new users fluent in English decreasingFor many regions / languages, still no data available

Page 3: Concordancing the Web with KWiCFinder

Search PurposesGeneral users typically seek…

a specific site any well-stocked site meeting their needs

Scholarly searchers must examine and evaluate a range of sites to identify the most relevant and reliable resourcesEducators want to foster similar online research behavior in their students

Page 4: Concordancing the Web with KWiCFinder

Typical Search Behaviors

Marked preference for directories with pre-selected links organized by topic over full-text search enginesSimple queries – single word or phrase – predominate (80%-90%)10%-25% of attempted complex queries (Boolean operators, bracketing) are ill-formedUsers tend to work in a single window, calling up one document at a time, then returning to search engine for another link

Page 5: Concordancing the Web with KWiCFinder

Typical Search Outcomes

Users follow up only first few links, then settle on a page after browsing from these

Usual outcome is a match, not best match

Page 6: Concordancing the Web with KWiCFinder

Ways to Use the Web for Instruction and ResearchMicro level

Discover eloquent examples Verify current / possible usage, with rough indication of prevalenceAcquire vocabulary not (yet) in dictionaries

Timeliness is essential -- “off-the-shelf corpora” often cannot help here!

Enable students to develop discovery skills (Salzman/Mills “Grammar Safari”)

Page 7: Concordancing the Web with KWiCFinder

Ways to Use the Web for Instruction and Research

(2)

Macro levelFind authentic texts accessible to studentsLocate relevant online resources for research projects

Student reportsScholarly research

Page 8: Concordancing the Web with KWiCFinder

Impediments to Finding Relevant Resources Online

Reliance on commercial search engines (SEs) essential due to Web’s sizeSEs’ priorities match ours only by coincidenceLink rot

Pages move or disappearPage content changes

Page 9: Concordancing the Web with KWiCFinder

Challenges to Responsible Research

Online there is too much ephemeral content of unknown reliability

Preponderance of journalistic, commercial and personal texts of unknown authorship and authorityDetails of sources and research methodology haphazardEven student papers (gasp) and machine translated texts (groan choke)

Page 10: Concordancing the Web with KWiCFinder

Challenges to Responsible Research (2)

Representativity of Web as CorpusMuch ill-formed or fragmentary languageDomain only a rough clue to provenance

Numbers vs. StatisticsSearch engines number of pages matching a query, not actual citationsOne page may contain alternate usagesNarrower filters may eliminate some pages

Page 11: Concordancing the Web with KWiCFinder

Webidence as Evidence Our profession needs to develop

“Standards of Webidence” to guide selection and documentation of online language for serious research purposes.

Page 12: Concordancing the Web with KWiCFinder

The Web is not a corpus in the classical sense……but it does offer an inexhaustible body of linguistic and cultural information for research and use.

Page 13: Concordancing the Web with KWiCFinder

Why KWiCFinder?Automate process of search and retrievalExpedite evaluation of webpagesProvide specific enhancements for foreign language users and linguistsEncourage students and colleagues to take full advantage of online resources

Page 14: Concordancing the Web with KWiCFinder

Why AltaVista?All words are indexed, including "stopwords"Distinguishes case and "special characters"Supports Boolean operators, bracketing, and wildcardsTrue world-wide coverage, with search by languageNo limits to length or complexity of the query Literal text search, without "second-guessing"

Page 15: Concordancing the Web with KWiCFinder

KWiCFinder Enhances AltaVista with…

Intuitive input for foreign characters, bracketing, operators, datesInclusion / exclusion criteria not included in KWiC report to focus searchAutomatic search and retrieval in the background returning KWiC abstracts

Page 16: Concordancing the Web with KWiCFinder

KWiCFinder Enhances AltaVista with… (2)

Restricted wildcards ? % (1, 0-1 char) vs. AltaVista * (0-5 chars)“Sic” option so “plain” or lower-case char does not match “special” or upper-case variants:

By SE default, a matches any of aáâäàãæåAÁÂÄÀÃÆÅ

Page 17: Concordancing the Web with KWiCFinder

KWiCFinder Enhances AltaVista with… (3)

“Tamecards” -- User inputs pattern, KF generates variants:

on-line matches on-line, on line, onlines[iau]ng matches sing, sang, sung{me,te,se,nos,os,se} desp[i,]ert{o,as,a,amos,áis,an} matches only reflexive forms me despierto, te despiertas, se despierta, nos despertamos, os despertáis, se despiertan

Page 18: Concordancing the Web with KWiCFinder

How Does XML Enhance KWiCFinder?

Search results become a dynamic database for end user to manipulate:

categorize, annotate, delete, merge / split searches, citations and documents

Free tools permit developer or end-user to restyle and add interactivity to reports

LayoutsLanguagesData format

Page 19: Concordancing the Web with KWiCFinder

Why WebKWiC?Original hope: cross-platform, cross-browser solutionMinimal entry threshold: small download of HTML pages + JavaScriptSupport for non-Western European languages

Page 20: Concordancing the Web with KWiCFinder

Why Google?Link popularity ranking puts relevant sites at or near top of listStraightforward approach to Advanced Search (“implicit Booleans”) easy to learn, thus most likely to be used by students independentlyLargest number of pages analyzedMatching pages always* available in cache with KWiC markup

Page 21: Concordancing the Web with KWiCFinder

How Does WebKWiC Complement Google?

Focuses and enhances interface for language learnersProvides tools to navigate among citations and documentsSimplifies management of multiple windows

Page 22: Concordancing the Web with KWiCFinder

Future of Web Concordancing

Agents will create specialized corpora on demand, by “search and crawl” or by monitoring specific sitesMultiplicity of encoding formats (various HTMLs, XML…) and languages will place increasing demands on developers of KWiCFinder and analogues

Page 23: Concordancing the Web with KWiCFinder

Pleas(e) Visit http://miniappolis.com/

Download and try KWiCFinder and WebKWiCView bibliography as well as this and related presentationsUse these tools with your studentsSend feedback and suggestions to [email protected]