CCBC Diversity Statistics Book Search: Guide for Database ...
Comparable Corpora BootCat (CCBC)
description
Transcript of Comparable Corpora BootCat (CCBC)
Comparable Corpora BootCat(CCBC)
Adam Kilgarriff, Avinesh PVSLexical Computing Ltd
BootCaT
• Bootstrapping Corpora and Terms• Translators– Know the language– Not domain experts– Can interpret domain terms but can’t guess them
• Instant domain corpus from the web• Marco Baroni and Silvia Bernardini (2004)
BootCaT method
• Piggyback on a search engine– Google, Yahoo, Bing
• Set of seed terms• Repeat– Take random 3 seeds– Send to search engine– Gather ‘search hits’ pages
• Remove, duplicates, find terms– Can iterate
WebBootCaT
• Web interface• Improved cleaning, duplicate removal• Integrated with corpus tool (Sketch Engine)
Going multilingual
• Google-translate– English: volcanology volcanologist "volcanic
eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic
– French:vulcanologue volcanologie "éruption volcanique" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologiestratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques
• And do the same thing for French
• By July 2011– All steps integrated – Propose bilingual terminology