Using socially authored content to provide new routes through content archives

Post on 18-Dec-2014

396 views 0 download

description

 

Transcript of Using socially authored content to provide new routes through content archives

Using socially authored content to provide new routes through existing content archives

Rob Lee : robl@monkeyhelper.comwww.monkeyhelper.com

Using commons authored content to add new paths into, through and out of your content archives

Socially authored content/ user-generated content / has been with us for about 8 years nowTime magazine cover from 2006Mainstream, only going to increase - how can we use it with our content

Wikipedia - primary example of community created content, the largest commons based project on the web with over 2 million entries on things from Axolotl to Zebu.

Wikipedia supports wilfing - What Was I Looking For ?

Wikipedia has lots of links between things, these things have relationships, not sure what these relationships are OR how strong they are but they exist

Wikipedia supports wilfing - What Was I Looking For ?

Wikipedia has lots of links between things, these things have relationships, not sure what these relationships are OR how strong they are but they exist

http://ben.pixelmachine.org/articles/2008/04/25/mapping-the-distraction-that-is-wikipedia

“Mapping the distraction that is Wikipedia”

Ben Hughes post on pixelmachine “Mapping the distraction this is wikipedia” - wrote greasemonkey w/back-end app that tracked his wikipedia usage and created a map so he could track how he moved through wikipedia Demonstrates the linked web of information

A brief interlude

Innovation Labs+

A brief interlude

Lots of inline hyperlinksSupports wilfing - interesting to visualise these in a different way

Lots of inline hyperlinksSupports wilfing - interesting to visualise these in a different way

First 20 internal page links from the wikipedia page for Ruby on Rails, we can see MVC/DHH/MIT/RubyLots of signal, some noise -> provides context

MIT Licence

MySQL AB

Web App David Heinemeier Hansson

X-Platform

MVC

37 Signals

Web App Framework

March 14

2007Ruby Programming Language

Open Source

Don’t Repeat Yourself

Web Design

Web App Framework

First 20 internal page links from the wikipedia page for Ruby on Rails, we can see MVC/DHH/MIT/RubyLots of signal, some noise -> provides context

Organic Farming

Sewage SludgeJapan

Human Waste

Farmers’ Market

Food Additive

European Union

United States

Ionizing Radiation

Fertilizer

Organic Farm

Pesticide

New York Times

Organic Certification

Antibiotic

OrganicFood

background ?

Menston

West Riding of YorkshireRiver Wharfe

Metropolitan Borough

Burley-in-Wharfedale

The Chevin

Sailing

West Yorkshire

Pool-in-Wharfedale

England

Angling

1944

Wharfedale

City of Leeds

Birdwatching

Otley

Great - we can generate these relationships for anything that has a representation in Wikipedia - over x million things

circles aroundcomparison of web app frameworksinterview linkDDH says no to use of rails logorailscasts

Not just internal links that are useful - there are typically an ‘interesting’ set of external linksSome functional but others picked by users and thus likely to be interesting

How can we start to utilise this structure - how can we relate it to our content ?

circles aroundcomparison of web app frameworksinterview linkDDH says no to use of rails logorailscasts

Not just internal links that are useful - there are typically an ‘interesting’ set of external linksSome functional but others picked by users and thus likely to be interesting

How can we start to utilise this structure - how can we relate it to our content ?

- how do we relate it to our commons content ?We need to relate Wikipedia data to our content, most archives have poor classification/descriptive data, journalists make poor librarians. How can we improve the classification of content in our archives, how can find out what it’s about ?

Various automated techniques available including Semantic Analysis, Term ExtractionWe cheated - used YTENow we can say what characterises a story and start to relate it to Wikipedia

Various automated techniques available including Semantic Analysis, Term ExtractionWe cheated - used YTENow we can say what characterises a story and start to relate it to Wikipedia

Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ?We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing

Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ?We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing

- how does this page support wilfing ?- How can we make it better ?

CCTV Example -> Nottinghamshire police, guardian

We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing

How do we rank information external links, some links will be quite functional and some will be ‘interesting’ - users can easily determine this in wikipedia, how can we automate this ?

Looked at google and technorati buzz - but we need to consider popularity and context when recommending links

How do we rank information external links, some links will be quite functional and some will be ‘interesting’ - users can easily determine this in wikipedia, how can we automate this ?

Looked at google and technorati buzz - but we need to consider popularity and context when recommending links

Use another 3rd party service to ranks external URL’s, tried using google rank, technorati buzz - but they didn’t necessarily rank the interesting links highly (e.g. try searching for google.com on google and see how many hits there are) - del.icio.us is different though users provide both context and ranking (via tagging and the process of bookmarking a URL) - so for each external link from all the wikipedia pages, we can see how many of the original extracted tags there are compared to the all the tags for a URL - high matches = high rank - how relevant is this url to the story in question and the fact it’s been bookmarked means it’s interesting to someone

highlight muddyboots internet links section

A story marked up with links generated by our system‘MuddyBoots’ - tramping new paths through the BBC News Archive !This technique can also be used for internal links as well - just need to be big media org with lots of links in delicious and Wikipedia (BBC News has over 83000 in Wikipedia at last check)

highlight muddyboots internet links section

A story marked up with links generated by our system‘MuddyBoots’ - tramping new paths through the BBC News Archive !This technique can also be used for internal links as well - just need to be big media org with lots of links in delicious and Wikipedia (BBC News has over 83000 in Wikipedia at last check)

Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method

Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method

Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensive->Currently takes about 2 minutes to generate for a storyStory classification incorrectDoesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie)Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance

Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensive->Currently takes about 2 minutes to generate for a storyStory classification incorrectDoesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie)Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance

Problem with language here - Most of the links are in vietnamese Other problems also occur such as links to more local geography

Problem with language here - Most of the links are in vietnamese Other problems also occur such as links to more local geography

Wikipedia has a LOT of data - many different things could be used as measures of interestingnessOther metrics that could define interesting and could be used in our processWikipedia traffic statsMaybe group recommendations by IP Locked articles Magnolia instead of del.icio.us (for speed - and includes a rating system)

A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’

Apple Apple ?

A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’

How about using something that has semantics built in ?

geonames.org

Wikipedia is great for humans, not so great for computers. Fortunately projects exist that make it easier for computers too, DBPedia and Freebase both expose Wikipedia in a more semantic way, DBpedia has information about more than 218 million things including over 80,000 people,

Also possible to link other datasources such as geonames or musicbrainz, so we can find out more information than previously was possible just using wikipedia

For a given term we can find a) if it’s ambiguous and b) what the term could possibly meanIt’s possible to use this information to help determine which term we are actually refer to in the original text.

For a given term we can find a) if it’s ambiguous and b) what the term could possibly meanIt’s possible to use this information to help determine which term we are actually refer to in the original text.

So if our story is about Apple computers then it’s a fair bet it mentions iphone, itunes, wozniak, steve jobs - all of which can be found on this page but are unlikely to be found on the page about apple the fruit - we can say with a much greater degree of confidence this is the ‘apple’ referenced in out story

We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content

We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content

We can also create better relationships between our content as if one story is about apple the company then we can recommend another that has been semantically marked up this way - great for large content archives with diverse material - e.g. newsWe can better define what is related

We’ve just created a simple opencalais from commons sourced data

How accurate is wikipedia, what are the problems with using commons sourced data ?Nature study (2005) found wikipedia comparable in accuracy to the Encyclopedia Britannica (contested)Stern Magazine compared German language version of wikipedia (50 articles) and found it to be more accurate than the encyclopedia compared to

What mechanisms are there to prevent gaming/incorrect dataLocked articles, featured article candidates, formal peer review process, flaggedrevs coming soon

In summary, lots of commons data out there, need to look at different ways we can use it, hopefully I‘ve demonstrated a few ways and given you a few ideas for re-vitalising your content archives

Photos:http://www.flickr.com/photos/drydens/2320501752/http://www.flickr.com/photos/laughingsquid/260374487/http://www.flickr.com/photos/bekathwia/2120050762/http://www.flickr.com/photos/25094278@N02/2368981952/http://www.flickr.com/photos/pmtorrone/1570649523http://www.flickr.com/photos/kightp/1562395268http://www.flickr.com/photos/valter/604187726/http://flickr.com/photos/speedye/1489563371/

Rob Lee:robl@monkeyhelper.com

In summary, lots of commons data out there, need to look at different ways we can use it, hopefully I‘ve demonstrated a few ways and given you a few ideas for re-vitalising your content archives