20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

download 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

of 83

  • date post

    23-Aug-2014
  • Category

    Internet

  • view

    299
  • download

    1

Embed Size (px)

description

In all of its many flavors, crowdsourcing works. It works for cultural heritage organizations too. During this presentation we look at various aspects of crowdsourced OCR text correction, commenting, and tagging for digitized historical newspapers at the National Library of Australia’s Trove, the California Digital Newspaper Collection (CDNC), and at the Cambridge Public Library in Cambridge Massachusetts as well as the astounding number of historical birth, death, marriage, census, and other records transcribed by “crowd” volunteers at Family Search. Some aspects include: demographics, experiences, motivation, quality, preferred data, economics and marketing. You will see that crowd sourcing is not only feasible but also practical and desirable. You will wonder why your own cultural heritage organization hasn't begun its own crowdsourcing project!

Transcript of 20140628 crowdsourcing, family history, and long tails for libraries [ala annual las vegas]

  • Crowdsourcing, Family History, and Long Tails for Libraries ! http://slidesha.re/1qzB8vv Frederick Zarndt frederick@frederickzarndt.com Secretary, IFLA Newspapers Section Photo held by John Oxley Library, State Library of Queensland. Original from Courier-mail, Brisbane, Queensland, Australia.
  • Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers. ... [It] is different from ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific, named group. Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing (accessed March 17, 2013)
  • crowdsourcing ! was coined by Jeff Howe in The rise of crowdsourcing published in Wired magazine June 2006.
  • web trends for crowdsourcing Jan-2006 to Jun-2014
  • On the date of publication of Jeff Howes Wired magazine article, 1-Jun-2007, Wikipedia did not have an entry (list) of crowdsourcing projects*. On 25-Jan-2010 Wikipedias list of crowdsourcing projects had 35 entries*. On 17-Mar -2013 Wikipedias list of crowdsourcing projects had 158 entries+. * From Internet Archives Wayback Machine. + Wikipedia contributors, "List of crowdsourcing projects," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/List_of_crowdsourcing_projects (accessed March 17, 2013).
  • Amazon Mechanical Turk was launched Nov 2005 Alexa global / country rank of Amazon Mechanical Turk (June 2014): 6,465 / 2,046 crowdsourcing
  • crowdsourcing Each day 200,000,000 recaptchas are solved by humans around the world
  • Galaxy Zoo was 1st launched July 2007 Alexa global / country traffic rank of Galaxy Zoo (June 2014): 606,971 / 100,298 citizen science
  • Kickstarter was 1st launched in 2009 Alexa global / country traffic rank of Kickstarter (June 2014): 782 / 326 60,000+ projects successfully funded with more than USD $1,000,000,000 crowd funding
  • crowd collaboration
  • Family Search Indexing was 1st launched (beta) 2004 Alexa global / country traffic rank of FamilySearch (June 2014): 4,385 / 1,321
  • Project Gutenberg was 1st launched Dec 1971 Alexa global / country traffic rank of Project Gutenberg (June 2014): 6,615 / 4,066
  • Alexa global / country traffic rank of National Library of Finland 2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)
  • so what? why should a library care about crowdsourcing? Time Life Pictures Getty Images
  • user engagement refers to the quality of the user experience that emphasizes the positive aspects of the interaction with a web application, and in particular the phenomena associated with wanting to use that web application longer and frequently Elad Yom-Tov, Mounia Lalmas, Georges Dupret, Ricardo Baeza-Yates, Pinard Donmez, and Janette Lehmann. 2012. The effect of links on networked user engagement. In Proceedings of the 21st international conference companion on World Wide Web (WWW '12 Companion). ACM, New York, NY, USA, 641-642. DOI=10.1145/2187980.2188167 http://doi.acm.org/ 10.1145/2187980.2188167
  • in addition toincreasing search accuracy or lowering the costs of document transcription, crowdsourcing is the single greatest advancement in getting people using and interacting with library collections Paraphrased from Trevor Owens blog http://www.trevorowens.org/2012/03/ crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).
  • While [the National Library of Australias] Trove offers a range of user engagement features, and use of each of these features continues to grow, it is Troves newspaper text correction features that have attracted the highest level of user engagement. Marie-Louise Ayres. 2013. Singing for their supper: Trove, Australian newspapers, and the crowd. Paper presented at IFLA WLIC 2013, Singapore. Accessed June 2014 IFLA Library http:// library.ifla.org/id/eprint/245.
  • Alexa global / country rank of National Library of Australia (June 2014): 10,964 / 249 Trove gets ~78% of all National Library web traffic.
  • National Library of Australia Online since 2008 More than 13,000,000 / 127,437,967 newspaper pages / articles (May 2014) Top text corrector 2,625,205+ lines (May 2014) 2,682,119 lines corrected each month (average for 1st 5 months 2014) 129,046,297 lines corrected as of May 2014, up from 66,527,535 lines corrected May 2012 129,300 / 8,218 registered / active users (May 2014)
  • 1 1,000 1,000,000 AustralianNewspapers Books Picturesandphotos JournalArticles Musicsoundandvideo Maps Archivedwebsites Diaries,letters,archives Peopleandorganisations unique visits page views 2013 monthly averages
  • 0 1,500,000 3,000,000 4,500,000 6,000,000 AustralianNewspapers Books Picturesandphotos JournalArticles Musicsoundandvideo Maps Archivedwebsites Diaries,letters,archives Peopleandorganisations unique visits page views 2013 monthly averages
  • California Digital Newspaper Collection CDNC began digitizing newspapers in 2005 as part of NDNP Newspapers digitized to article-level as well as to page-level as required by NDNP Hosted on Veridian beginning 2009 Collection size 61,412 issues, 545,955 pages, 6,364,529 articles (May 2014)
  • OCR text correction OCR text correction added Aug 2011 Corrections are done line by line 2246 registered / 1,266 active users (Jun 2014) 2,656,497+ lines of text corrected (Jun 2014) ~2% of the collection corrected, 98% to go! Top corrector 717,855 lines > 2x 2nd corrector
  • Cambridge Public Library Historic Newspaper Collection Cambridge Historic Newspapers online since Jan 2012. Cambridge Massachusetts Public Library digitized local newspapers (http://cambridge.dlconsulting.com/) Newspapers digitized to article-level Collection size 6,346 issues, 59,070 pages, 669,406 articles (May 2014) Collection includes 13,099 obituary cards
  • 0%10% 90% Historic Cambridge Newspapers (1846-1923) Cambridge City Directories (1848 - 1910) Cambridge Chronicle (August 2005 to present) 2013 monthly averages
  • why correct text? heres why Image copyright Karl R Lilliendahl Photographer
  • Deaths. llnrieff, Esq. of
  • user lines corrected* 1 646,873 2 236,323 3 111,749 4 100,749 5 99,999 6 87,720 7 82,768 8 63,786 9 57,441 10 56,458 lines corrected* user 2,455,338 1 1,822,422 2 1,448,370 3 1,265,217 4 1,174,835 5 1,069,669 6 1,058,179 7 1,020,462 8 949,694 9 886,315 10 *numbers from Mar 2014
  • User rank Lines corrected Jun 2014 1 717,855 2 271,972 3 120,220 4 113,787 5 109,999 6 99,999 7 94,742 8 65,637 9 63,786 10 59,724 Lines corrected Oct 2012 242,965 87,515 31,318 24,144 23,184 19,240 18,898 16,875 11,784 9,762
  • uncorrected OCR accuracy by newspaper title Title OCR character accuracy ~OCR word accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 68.1% SFC San Francisco Call 1890 - 1913 92.6% 68.1% LAH Los Angeles Herald 1873 - 1910 88.7% 54.9% LH Livermore Herald 1877 - 1899 88.6% 54.6% DAC Daily Alta California 1841 - 1891 88.2% 53.4% CFJ California Farmer and Journal of Useful Sciences 1855 - 1880 86.5% 48.4% SN Sausalito News 1885 - 1922 70.4% 17.3% *Word accuracy assumes average word length is 5 characters
  • corrected OCR accuracy by newspaper title Title OCR character accuracy Corrected accuracy PRP Pacific Rural Press 1871 - 1922 92.6% 99.3% SFC San Francisco Call 1890 - 1913 92.6% 99.6% LAH Los Angeles Herald 1873 - 1910 88.7% 99.1% LH Livermore Herald 1877 - 1899 88.6% 99.9% DAC Daily Alta California 1841 - 1891 88.2% 99.9% CFJ California Farmer and Journal of Useful Sciences 1855 - 1880 86.5% 99.8% SN Sausalito News 1885 - 1922 70.4% 100.0%
  • Title OCR character accuracy ~OCR word accuracy Corrected accuracy ~Corrected word accuracy PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5% SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0% LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6% LH 1877 - 1899 88.6% 54.6% 99.9% 99.5% DA