Download - Overview: Requirements for implementing the AARDVARC visionlinguistlist.org/aardvarc/resources/Simons-AARDVARC-Vision.pdf · Overview: Requirements for implementing the AARDVARC vision

Overview: Requirements for implementing the

AARDVARC vision

Gary Simons SIL Interna*onal AARDVARC Workshop 9–11 May 2013, Ypsilan?, MI

The context w A cross-‐cuDng, NSF-‐wide ini?a?ve called

§  Cyberinfrastructure Framework for 21st Century Science and Engineering (CIF21)

w Vision statement §  “CIF21 will provide a comprehensive, integrated, sus-‐tainable, and secure cyberinfrastructure to accelerate research and educa*on and new func*onal capabili-‐*es in computa*onal and data-‐intensive science and engineering, thereby transforming our ability to effec*vely address and solve the many complex problems facing science and society.”

2

The funding program w  AARDVARC grant was awarded by NSF’s program on Building Community and Capacity for Data-‐Intensive Research in the Social, Behavioral, and Economic Sciences and in Educa?on and Human Resources (BCC-‐SBE/EHR) § We “seek to enable research communi*es to de-‐velop visions, teams, and prototype capabili*es dedicated to crea*ng and u*lizing innova*ve and large-‐scale data resources and relevant analy*c techniques to advance fundamental research for the SBE and EHR areas of research.”

3

A three-stage program 1.  Funded projects focus on bringing together

cross-‐disciplinary communi?es to work on the design of cyberinfrastructure for data-‐intensive research. [2012 and 2013]

2.  A selec?on (perhaps one-‐fourth) of these communi?es will be funded to develop prototypes of the facili?es designed in Stage 1. [Beginning 2014, funding permiDng]

3.  An even smaller number of projects will be funded to develop the actual facility.

4

Roadmap for current project w  The compe??on will be fierce across a wide range of disciplines.

w  In order to succeed in the second stage of the program, we must write a top-‐25% proposal.

w  Can we put ourselves in the shoes of poten?al re-‐viewers and an?cipate what the likely cri?ques to an AARDVARC implementa?on proposal might be?

w  If so, that could help us set an agenda for the problems we should be working on during the course of the current project.

5

Fast forward to implementation w The current AARDVARC proposal is not an implementa?on proposal §  However, reading it through that lens sheds light on what would need to be addressed if it were

w Reading the proposal in this way, §  I have imagined four show-‐stopping reviewer cri?ques that we want to be sure to avoid

§  This presenta?on discusses the requirements for an implementa?on proposal that would avoid these cri?ques

6

Critiques we want to avoid 1.  The focus seems too narrow to be truly

transforma?ve.

2.  The issues of sustainability are not adequately addressed.

3.  It is not clear that automa?c transcrip?on of under-‐resourced languages is even possible.

4.  There is not an adequate story about how the community will work on a large scale to fill the repository.

7

1. Find the right framing w Vision of CIF21: “transform our ability to effec*vely address and solve the many com-‐plex problems facing science and society”

w Poten?al cri?que §  The AARDVARC focus seems too narrow to be truly transforma?ve.

w Requirement §  A successful proposal will need to frame the proposed cyberinfrastructure in terms that non-‐linguists will embrace as truly transforma?ve. 8

Problem w The name AARDVARC frames the problem in terms of a repository for automa?cally annotated video and audio resources §  Among non-‐linguists is a framing in terms of automa?c annota?on likely to rise to the top 25% of cross-‐cuDng problems?

§  Probably not since solving the transcrip?on bocleneck puts the focus on a means to the end, rather than the end itself

w The true end is having a repository of data from every language 9

A more compelling framing w The AARDVARC name fails to name the main thing — language §  The most fundamental problem for data-‐intensive research in the 21st century is that we lack a repository of interoperable data from every human language

w Among non-‐linguists, would a framing like that rise to the top 25% of cross-‐cuDng problems? §  This seems much more likely §  And others have already laid some groundwork

10

Human Language Project w  Building by analogy to the Human Genome Project, Abney and Bird have proposed a Human Language Project to the computa?onal linguis?cs community: §  “We present a grand challenge to build a corpus that will include all of the world’s languages, in a consistent structure that permits large-‐scale cross-‐linguis?c processing, enabling the study of universal linguis?cs.” (Abney and Bird 2010)

w  In two conference papers, they have argued the mo?va?on for the project and specified basic formats for data

11

Language Commons w  Building on “the commons” tradi?on, Bice, Bird, and Welcher have spearheaded the Language Commons §  “The Language Commons is an interna?onal consor?um that is crea?ng a large collec?on of wricen and spoken language material, made available under open licenses. The content includes text and speech corpora, along with transla?ons, lexicons and other linguis?c resources that support large-‐scale inves?ga?on of the world's languages.”

w  Currently an open collec?on in the Internet Archive §  Browse: hcp://archive.org/details/LanguageCommons §  Submit: hcp://upload.languagecommons.org/ 12

We need to join forces w  AARDVARC, Human Language Project, and the Language Commons are varia?ons on the same fundamental vision §  A repository of interoperable data from every human language

w  Facing fierce compe??on with other disciplines §  We are too small to have compe?ng visions, we need a single vision that others will find compelling

§  For an implementa?on proposal, we should all join forces to create a grand vision of cyberinfrastructure for language-‐related research in the 21st century that will embrace every language 13

References w  The Human Language Project: Building a universal corpus of the

World’s languages Steven Abney and Steven Bird. 2010. Proceedings of the 48th Annual Mee*ng of the Associa*on for Computa*onal Linguis*cs, 88-‐97, Uppsala, Sweden

w  Towards a data model for the Universal Corpus Steven Abney and Steven Bird. 2011. Proceedings of the 4th Workshop on Building and Using Comparable Corpora, 120-‐127, Portland, USA

w  The Language Commons Wiki Ed Bice and others. 2010. Presenta?on at Wikimania 2010, Gdańsk, Poland

w  The Roseca Project and The Language Commons Laura Welcher. 2011. Presenta?on posted on The Long Now Founda?on blog.

14

2. Ensure sustainability w Vision of CIF21:

§  “provide a … sustainable ... cyberinfrastructure” w Poten?al cri?que

§  The issues of sustainability are not adequately addressed.

w Requirement §  A successful proposal will need to give a convincing plan for the sustainability of the infrastructure and the resources it houses.

15

16

A repository is not enough w Simply building a repository does not ensure sustainability §  It must also func?on as an archive that guarantees access far into the future

w A huge NSF investment in the repository we envision would go to waste if it could not §  Con?nue opera?ng aner the grant money ran out §  Survive the inevitable upgrades to hardware and system sonware at the host ins?tu?on

§  Recover from a disaster (natural or ins?tu?onal)

17

Non-use is also waste w Even deeper than the sustained func?oning of a repository is the sustained use of the resources it houses

w The huge investment would also go to waste if §  Resources deteriorate or slip to obsolete formats §  Poten?al users never discover relevant resources §  Users are unable to access discovered resources §  Users cannot make sense of resources they access §  Accessed resources are not compa?ble with the computa?onal working environments of users

Conditions of sustainable use w  A complete proposal would addresses the condi-‐?ons of sustainable use (Simons & Bird 2008, sec. 3) §  Extant — Preserved through off-‐site backup, refreshing copies, format migra?on, fixity metadata

§  Discoverable — Adequate descrip?ve metadata accessed through open and easy-‐to-‐use search

§  Available — User has rights to access as well as a means of access

§  Interpretable — Markup, encoding, abbrevia?ons, terminology, methodologies are well documented

§  Portable — File formats that are open (not proprietary) and work on all plaqorms 18

Checklist for responsible archiving w  A good proposal would measure up against the criteria of the TAPS Checklist (Chang 2010, pp. 136-‐7) §  Based on a review of mainstream tools for assessing archival prac?ces, TAPS is a checklist of 16 points to help linguists evaluate whether a prospec?ve home for their data will be a responsible archive

§  Target — Are the mission and audience a good fit? §  Access — Will your audiences have adequate access? §  Preserva7on — Is the archive following best prac?ces for ensuring long-‐term preserva?on?

§  Sustainability — Is the ins?tu?on well situated for the long term? 19

A repository or an aggregator? w Or should the infrastructure have an aggregator at the center rather than a single repository? §  In today’s web economy, being the aggregator (rather than a supplier) is the sweet spot (Simons 2007 paints a vision of such a cyberinfrastructure)

§  This would require community agreement on: § Metadata standards (content, format, protocol) — OLAC provides a star?ng point

§  Data standards (contents, formats, protocols) — Universal Corpus provides a star?ng point

§  S?ll needs a self-‐service default repository §  e.g. Language Commons in Internet Archive 20

References w  Toward a global infrastructure for the sustainability of language

resources Gary Simons and Steven Bird. 2008. Proceedings of the 22nd Pacific Asia Conference on Language, Informa*on and Computa*on, 20–22 November 2008, Cebu City, Philippines. Pages 87–100.

w  TAPS: Checklist for responsible archiving of digital language resources Debbie Chang. 2010. MA thesis, Graduate Ins?tute of Applied Linguis?cs. Dallas, TX.

w  Doing linguis?cs in the 21st century: Interopera?on and the quest for the global riches of knowledge Gary Simons. 2007. Proceedings of the E-‐MELD/DTS-‐L Workshop: Toward the Interoperability of Language Resources, 13–15 July 2007, Palo Alto, CA.

21

3. Focus on achievable automation w Purpose of BCC-‐SBE/EHR:

§  “enable research communi*es to develop … prototype capabili*es”

w Poten?al cri?que §  It is not clear that automa?c transcrip?on of under-‐resourced languages is even possible.

w Requirement §  A successful proposal will need a compelling descrip?on of automated helps for annota?on that can be implemented today.

22

The BCC-SBE/EHR vision w  Building Community and Capacity for Data-‐Intensive

Research program is about ac?vity in the present to support research in the future:

23

Present activities

We “seek to enable research communities to develop visions, teams, and prototype capabilities

Present focus

dedicated to creating and utilizing innovative and large-scale data resources and relevant analytic techniques

Future result

to advance fundamental research for the SBE and EHR areas of research.”

Setting the right target w  Automated transcrip?on of under-‐resourced languages is s?ll in the future §  It is an advance in fundamental research that can be furthered by a data-‐intensive cyberinfrastructure

w  The follow-‐up proposal in the BCC program is an implementa?on proposal, not a research proposal §  It must focus on the automated helps for annota?on that we can implement immediately

§  It is not meant to be a request to support research on annota?on tasks we cannot currently automate

§  It should implement a framework into which we can plug the lacer as that research comes to fruit 24

Sorting the tasks w During the AARDVARC project we should

§  Iden?fy annota?on tasks that we can automate now § Plan work modules for these in the proposed implementa?on grant

§  Iden?fy annota?on tasks that are clearly in the future § Pursue research grants on these through the normal research programs

§  Implementa?on proposal would men?on supplying data to future research as within its broader impacts

§  Iden?fy annota?on tasks that are borderline § Conduct proof-‐of-‐concept tes?ng now to determine whether it belongs in the first set or the second set

Breaking the bottleneck w  The repository should embrace all strategies for breaking the transcrip?on bocleneck §  Focus on the end of data in every language, as opposed to a par?cular means for geDng it

w  A promising new strategy is oral annota?on §  Woodbury (2003) proposed this to turn a huge collec?on of tapes from 15 years of Cup’ik radio broadcasts into usable data § Make running oral transla?ons §  Do careful respeaking of “hard-‐to-‐hear tapes”

§  This inspired the development of BOLD: § Basic Oral Language Documenta?on 26

References w  Defining documentary linguis?cs

Anthony Woodbury. 2003. In Peter Aus?n (ed.), Language Documenta*on and Descrip*on 1:35-‐51. London: SOAS.

w  The rise of documentary linguis?cs and a new kind of corpus Gary Simons. 2008. Presented at 5th Na*onal Natural Language Research Symposium, De La Salle University, Manila, 25 Nov 2008.

w  Basic Oral Language Documenta?on D. Will Reiman. 2010. Language Documenta*on and Conserva*on, Vol. 4 , pp. 254-‐268

w  A scalable method for preserving oral literature from small languages Steven Bird. 2010. Proceedings of the 12th Interna*onal Conference on Asia-‐Pacific Digital Libraries, 5-‐14, Gold Coast, Australia

w  To BOLDly go where no one has gone before Brenda Boerger. 2011. Language Documenta*on and Conserva*on, Vol. 5 , pp. 208-‐233

27

w  Original recording on first recorder

w  Careful respeacking on second recorder §  Original played back (with pauses) into len channel

§  Respoken on mike into right channel

Example of respeaking

28

From fieldwork of Will Reiman on Kasanga [cji] language, Guinea-‐Bissau

A known best practice in field methods w  Instruc*ons for the Recording of Linguis*c Data

§  In Bouquiaux and Thomas (1976), trans. Roberts (1992). Studying and Describing an Unwri]en Language. Dallas: Summer Ins?tute of Linguis?cs.

§  “Go over this spontaneous recording, either with the narrator himself or with a qualified speaker, in order to have it repeated sentence by sentence, in a careful, rela?vely slow, yet normal manner, and to have it whistled (tone languages).” (p. 180)

§  Goes on to describe method using 2 tape recorders

w  This method may be even more essen?al today as we prepare recordings for automa?c transcrip?on

BOLD:PNG

w  A project led by Steven Bird; see www.boldpng.info w  Trained university students to use low-‐cost digital

recorders to go back to their home villages to make recordings and to annotate them orally

w  Problems: §  Managing all the files on all the recorders did not scale §  Two recorder annota?on was too complicated 30

Working on solutions w  Language Preserva?on 2.0: Crowdsourcing Oral Language Documenta?on using Mobile Devices §  hcp://lp20.org/

w  They have developed an Android app, Aikuma §  Files shared within community via Internet or local Wi-‐Fi hub; supports vo?ng for what to release

§  Annotate on a single device with a simple two-‐bucon tool

w  Blog post containing two demo videos from Bird’s current field trip in the Amazon 31

4. Foster global collaboration w Purpose of BCC-‐SBE/EHR:

§  “enable research communi*es … to creat[e] new, large-‐scale, next-‐genera*on data resources”

w Poten?al cri?que §  There is not an adequate story about how the community will work on a large scale.

w Requirement §  A successful proposal will need a compelling account of how a global community of researchers, speakers, and ci?zen scien?sts will collaborate to fill the repository with annotated resources. 32

The real challenge w  Building the repository is one thing, but filling it with resources from most languages will be quite another §  Funded staff will be able to implement the repository, but it will take thousands of volunteers to really fill it

w  Realizing the vision will depend on §  Mobilizing the research community to par?cipate §  Mobilizing speaker communi?es to par?cipate §  Mobilizing ci?zen scien?sts to par?cipate §  Building an infrastructure that supports collabora?on among all these players on a global scale

33

Resources as open-ended w  Repository must support open-‐ended annota?on w  Aner ini?al deposit, other players should be able to

§  Add careful respeaking §  Add a transla?on (either oral or wricen) §  Add a transcrip?on (of text or of transla?on) §  Add a transla?on of the transla?on §  Invoke an automa?c transcrip?on or transla?on §  Check and revise the automa?c output

w  Each addi?on should be a separate deposit (with its own metadata) that links back to what it annotates (i.e., stand-‐off markup) 34

Resource workflow w  The types and languages of the complete set of annota?ons associated with a resource comprise the state of that resource

w  The annota?on tasks are operators on that state §  Each annota?on task has a prerequisite state §  Performing the task changes the state of the resource

w  This defines an implicit workflow §  For any resource, there is a set of possible next tasks §  The infrastructure needs to manage that workflow

35

36

Supply and demand w We need to match up two things:

§  The huge demand for annota?on tasks to be done — all of the possible next tasks for all resources

§  The supply of people worldwide who could do them

w Our infrastructure needs to be a marketplace that matches supply with demand §  E.g., eBay, eHarmony, mTurk.com

w Match a user’s language profile to find next tasks to do §  E.g., TED’s Open Transla?on Project using Amara § Web tool to segment videos and add sub?tles §  140 languages, ~10,000 translators, >50,000 transla?ons

If we build it … w  They won’t necessarily come!

w  In addi?on to describing the infrastructure we would implement to match supply and demand, a compelling proposal would also: §  Describe the plans for organizing the people who par?cipate (including governance)

§  Describe plans for mobilizing the various target communi?es: researchers, speakers, ci?zens

§  Describe incen?ves for par?cipa?on, especially ones that are built into the design of the infrastructure

37

Conclusion w  The AARDVARC project gives us the opportunity to

build the vision and plans for a sustainable cyberinfrastructure to §  Collect and provide access to interoperable data resources from every human language

§  Harness automa?on wherever possible to add the needed transcrip?ons and transla?ons

§  Create a marketplace that will permit thousands worldwide to collaborate in performing the annota?on tasks that cannot be automated

w  Thus transforming our ability to address and solve language-‐related problems facing science and society