Concordancing the Web with KWiCFinder
description
Transcript of Concordancing the Web with KWiCFinder
Concordancing the Webwith KWiCFinder
William H. FletcherUnited States Naval Academy
American Association for Applied Corpus LinguisticsThird North American Symposium on
Corpus Linguistics and Language Teaching, Boston, MA, 23-25 March 2001
How Big is the Web?Now 2-4 billion webpages accessible via public links (Cyberveillance estimates & projection July 2000; Inktomi estimates are more modest.)
“Invisible web” / restricted sites several times largerEstimated 80%-95% content in English, but…Since mid 2000, non-Anglophones outnumber English speakers onlineAnglophones < 30% of 850 million users in 2005Percentage of new users fluent in English decreasingFor many regions / languages, still no data available
Search PurposesGeneral users typically seek…
a specific site any well-stocked site meeting their needs
Scholarly searchers must examine and evaluate a range of sites to identify the most relevant and reliable resourcesEducators want to foster similar online research behavior in their students
Typical Search Behaviors
Marked preference for directories with pre-selected links organized by topic over full-text search enginesSimple queries – single word or phrase – predominate (80%-90%)10%-25% of attempted complex queries (Boolean operators, bracketing) are ill-formedUsers tend to work in a single window, calling up one document at a time, then returning to search engine for another link
Typical Search Outcomes
Users follow up only first few links, then settle on a page after browsing from these
Usual outcome is a match, not best match
Ways to Use the Web for Instruction and ResearchMicro level
Discover eloquent examples Verify current / possible usage, with rough indication of prevalenceAcquire vocabulary not (yet) in dictionaries
Timeliness is essential -- “off-the-shelf corpora” often cannot help here!
Enable students to develop discovery skills (Salzman/Mills “Grammar Safari”)
Ways to Use the Web for Instruction and Research
(2)
Macro levelFind authentic texts accessible to studentsLocate relevant online resources for research projects
Student reportsScholarly research
Impediments to Finding Relevant Resources Online
Reliance on commercial search engines (SEs) essential due to Web’s sizeSEs’ priorities match ours only by coincidenceLink rot
Pages move or disappearPage content changes
Challenges to Responsible Research
Online there is too much ephemeral content of unknown reliability
Preponderance of journalistic, commercial and personal texts of unknown authorship and authorityDetails of sources and research methodology haphazardEven student papers (gasp) and machine translated texts (groan choke)
Challenges to Responsible Research (2)
Representativity of Web as CorpusMuch ill-formed or fragmentary languageDomain only a rough clue to provenance
Numbers vs. StatisticsSearch engines number of pages matching a query, not actual citationsOne page may contain alternate usagesNarrower filters may eliminate some pages
Webidence as Evidence Our profession needs to develop
“Standards of Webidence” to guide selection and documentation of online language for serious research purposes.
The Web is not a corpus in the classical sense……but it does offer an inexhaustible body of linguistic and cultural information for research and use.
Why KWiCFinder?Automate process of search and retrievalExpedite evaluation of webpagesProvide specific enhancements for foreign language users and linguistsEncourage students and colleagues to take full advantage of online resources
Why AltaVista?All words are indexed, including "stopwords"Distinguishes case and "special characters"Supports Boolean operators, bracketing, and wildcardsTrue world-wide coverage, with search by languageNo limits to length or complexity of the query Literal text search, without "second-guessing"
KWiCFinder Enhances AltaVista with…
Intuitive input for foreign characters, bracketing, operators, datesInclusion / exclusion criteria not included in KWiC report to focus searchAutomatic search and retrieval in the background returning KWiC abstracts
KWiCFinder Enhances AltaVista with… (2)
Restricted wildcards ? % (1, 0-1 char) vs. AltaVista * (0-5 chars)“Sic” option so “plain” or lower-case char does not match “special” or upper-case variants:
By SE default, a matches any of aáâäàãæåAÁÂÄÀÃÆÅ
KWiCFinder Enhances AltaVista with… (3)
“Tamecards” -- User inputs pattern, KF generates variants:
on-line matches on-line, on line, onlines[iau]ng matches sing, sang, sung{me,te,se,nos,os,se} desp[i,]ert{o,as,a,amos,áis,an} matches only reflexive forms me despierto, te despiertas, se despierta, nos despertamos, os despertáis, se despiertan
How Does XML Enhance KWiCFinder?
Search results become a dynamic database for end user to manipulate:
categorize, annotate, delete, merge / split searches, citations and documents
Free tools permit developer or end-user to restyle and add interactivity to reports
LayoutsLanguagesData format
Why WebKWiC?Original hope: cross-platform, cross-browser solutionMinimal entry threshold: small download of HTML pages + JavaScriptSupport for non-Western European languages
Why Google?Link popularity ranking puts relevant sites at or near top of listStraightforward approach to Advanced Search (“implicit Booleans”) easy to learn, thus most likely to be used by students independentlyLargest number of pages analyzedMatching pages always* available in cache with KWiC markup
How Does WebKWiC Complement Google?
Focuses and enhances interface for language learnersProvides tools to navigate among citations and documentsSimplifies management of multiple windows
Future of Web Concordancing
Agents will create specialized corpora on demand, by “search and crawl” or by monitoring specific sitesMultiplicity of encoding formats (various HTMLs, XML…) and languages will place increasing demands on developers of KWiCFinder and analogues
Pleas(e) Visit http://miniappolis.com/
Download and try KWiCFinder and WebKWiCView bibliography as well as this and related presentationsUse these tools with your studentsSend feedback and suggestions to [email protected]