Using socially authored content to provide new routes through content archives

52
Using socially authored content to provide new routes through existing content archives Rob Lee : [email protected] www.monkeyhelper.com Using commons authored content to add new paths into, through and out of your content archives

description

 

Transcript of Using socially authored content to provide new routes through content archives

Page 1: Using socially authored content to provide new routes through content archives

Using socially authored content to provide new routes through existing content archives

Rob Lee : [email protected]

Using commons authored content to add new paths into, through and out of your content archives

Page 2: Using socially authored content to provide new routes through content archives

Socially authored content/ user-generated content / has been with us for about 8 years nowTime magazine cover from 2006Mainstream, only going to increase - how can we use it with our content

Page 3: Using socially authored content to provide new routes through content archives

Wikipedia - primary example of community created content, the largest commons based project on the web with over 2 million entries on things from Axolotl to Zebu.

Page 4: Using socially authored content to provide new routes through content archives

Wikipedia supports wilfing - What Was I Looking For ?

Wikipedia has lots of links between things, these things have relationships, not sure what these relationships are OR how strong they are but they exist

Page 5: Using socially authored content to provide new routes through content archives

Wikipedia supports wilfing - What Was I Looking For ?

Wikipedia has lots of links between things, these things have relationships, not sure what these relationships are OR how strong they are but they exist

Page 6: Using socially authored content to provide new routes through content archives

http://ben.pixelmachine.org/articles/2008/04/25/mapping-the-distraction-that-is-wikipedia

“Mapping the distraction that is Wikipedia”

Ben Hughes post on pixelmachine “Mapping the distraction this is wikipedia” - wrote greasemonkey w/back-end app that tracked his wikipedia usage and created a map so he could track how he moved through wikipedia Demonstrates the linked web of information

Page 7: Using socially authored content to provide new routes through content archives

A brief interlude

Page 8: Using socially authored content to provide new routes through content archives

Innovation Labs+

A brief interlude

Page 9: Using socially authored content to provide new routes through content archives

Lots of inline hyperlinksSupports wilfing - interesting to visualise these in a different way

Page 10: Using socially authored content to provide new routes through content archives

Lots of inline hyperlinksSupports wilfing - interesting to visualise these in a different way

Page 11: Using socially authored content to provide new routes through content archives

First 20 internal page links from the wikipedia page for Ruby on Rails, we can see MVC/DHH/MIT/RubyLots of signal, some noise -> provides context

Page 12: Using socially authored content to provide new routes through content archives

MIT Licence

MySQL AB

Web App David Heinemeier Hansson

X-Platform

MVC

37 Signals

Web App Framework

March 14

2007Ruby Programming Language

Open Source

Don’t Repeat Yourself

Web Design

Web App Framework

First 20 internal page links from the wikipedia page for Ruby on Rails, we can see MVC/DHH/MIT/RubyLots of signal, some noise -> provides context

Page 13: Using socially authored content to provide new routes through content archives

Organic Farming

Sewage SludgeJapan

Human Waste

Farmers’ Market

Food Additive

European Union

United States

Ionizing Radiation

Fertilizer

Organic Farm

Pesticide

New York Times

Organic Certification

Antibiotic

OrganicFood

Page 14: Using socially authored content to provide new routes through content archives

background ?

Menston

West Riding of YorkshireRiver Wharfe

Metropolitan Borough

Burley-in-Wharfedale

The Chevin

Sailing

West Yorkshire

Pool-in-Wharfedale

England

Angling

1944

Wharfedale

City of Leeds

Birdwatching

Otley

Great - we can generate these relationships for anything that has a representation in Wikipedia - over x million things

Page 15: Using socially authored content to provide new routes through content archives

circles aroundcomparison of web app frameworksinterview linkDDH says no to use of rails logorailscasts

Not just internal links that are useful - there are typically an ‘interesting’ set of external linksSome functional but others picked by users and thus likely to be interesting

How can we start to utilise this structure - how can we relate it to our content ?

Page 16: Using socially authored content to provide new routes through content archives

circles aroundcomparison of web app frameworksinterview linkDDH says no to use of rails logorailscasts

Not just internal links that are useful - there are typically an ‘interesting’ set of external linksSome functional but others picked by users and thus likely to be interesting

How can we start to utilise this structure - how can we relate it to our content ?

Page 17: Using socially authored content to provide new routes through content archives

- how do we relate it to our commons content ?We need to relate Wikipedia data to our content, most archives have poor classification/descriptive data, journalists make poor librarians. How can we improve the classification of content in our archives, how can find out what it’s about ?

Page 18: Using socially authored content to provide new routes through content archives

Various automated techniques available including Semantic Analysis, Term ExtractionWe cheated - used YTENow we can say what characterises a story and start to relate it to Wikipedia

Page 19: Using socially authored content to provide new routes through content archives

Various automated techniques available including Semantic Analysis, Term ExtractionWe cheated - used YTENow we can say what characterises a story and start to relate it to Wikipedia

Page 20: Using socially authored content to provide new routes through content archives

Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ?We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing

Page 21: Using socially authored content to provide new routes through content archives

Too much information - we’ve established a set of relationships but how do we pick out something useful AND relevant/interesting from all these articles ? How do we rank information and what information do we want to extract/use ?We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing

Page 22: Using socially authored content to provide new routes through content archives

- how does this page support wilfing ?- How can we make it better ?

CCTV Example -> Nottinghamshire police, guardian

We concentrated on external links as the BBC’s external links at the time were very functional (Chosen by journalists without much time?) and not very interesting - didn’t really support wilfing

Page 23: Using socially authored content to provide new routes through content archives

How do we rank information external links, some links will be quite functional and some will be ‘interesting’ - users can easily determine this in wikipedia, how can we automate this ?

Looked at google and technorati buzz - but we need to consider popularity and context when recommending links

Page 24: Using socially authored content to provide new routes through content archives

How do we rank information external links, some links will be quite functional and some will be ‘interesting’ - users can easily determine this in wikipedia, how can we automate this ?

Looked at google and technorati buzz - but we need to consider popularity and context when recommending links

Page 25: Using socially authored content to provide new routes through content archives

Use another 3rd party service to ranks external URL’s, tried using google rank, technorati buzz - but they didn’t necessarily rank the interesting links highly (e.g. try searching for google.com on google and see how many hits there are) - del.icio.us is different though users provide both context and ranking (via tagging and the process of bookmarking a URL) - so for each external link from all the wikipedia pages, we can see how many of the original extracted tags there are compared to the all the tags for a URL - high matches = high rank - how relevant is this url to the story in question and the fact it’s been bookmarked means it’s interesting to someone

Page 26: Using socially authored content to provide new routes through content archives

highlight muddyboots internet links section

A story marked up with links generated by our system‘MuddyBoots’ - tramping new paths through the BBC News Archive !This technique can also be used for internal links as well - just need to be big media org with lots of links in delicious and Wikipedia (BBC News has over 83000 in Wikipedia at last check)

Page 27: Using socially authored content to provide new routes through content archives

highlight muddyboots internet links section

A story marked up with links generated by our system‘MuddyBoots’ - tramping new paths through the BBC News Archive !This technique can also be used for internal links as well - just need to be big media org with lots of links in delicious and Wikipedia (BBC News has over 83000 in Wikipedia at last check)

Page 28: Using socially authored content to provide new routes through content archives
Page 29: Using socially authored content to provide new routes through content archives
Page 30: Using socially authored content to provide new routes through content archives

Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method

Page 31: Using socially authored content to provide new routes through content archives

Recent story - Journalist has added good related links this time -> Interestingly the muddyboots system has ‘recommended’ many of the same links independently - nice verification of the method

Page 32: Using socially authored content to provide new routes through content archives

Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensive->Currently takes about 2 minutes to generate for a storyStory classification incorrectDoesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie)Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance

Page 33: Using socially authored content to provide new routes through content archives

Problems - Real time performance -> currently takes about two minutes, due to API access and data-set sizes (we have local mirror of wikipedia db) but del.icio.us queries are expensive->Currently takes about 2 minutes to generate for a storyStory classification incorrectDoesn’t always work - sometimes the coverage isn’t there in del.icio.us (not such a biggie)Disambiguation can be a big problem causing false positives - the main issue Problems - language, geographical relevance

Page 34: Using socially authored content to provide new routes through content archives

Problem with language here - Most of the links are in vietnamese Other problems also occur such as links to more local geography

Page 35: Using socially authored content to provide new routes through content archives

Problem with language here - Most of the links are in vietnamese Other problems also occur such as links to more local geography

Page 36: Using socially authored content to provide new routes through content archives

Wikipedia has a LOT of data - many different things could be used as measures of interestingnessOther metrics that could define interesting and could be used in our processWikipedia traffic statsMaybe group recommendations by IP Locked articles Magnolia instead of del.icio.us (for speed - and includes a rating system)

Page 37: Using socially authored content to provide new routes through content archives

A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’

Page 38: Using socially authored content to provide new routes through content archives

Apple Apple ?

A more semantic approach can help, if we can say this is a person or a company or a place rather than just a piece of text then we can be more certain about what it is and thus be less ambiguous, it would be be good if we had a single point of reference for this ‘thing’

Page 39: Using socially authored content to provide new routes through content archives

How about using something that has semantics built in ?

Page 40: Using socially authored content to provide new routes through content archives

geonames.org

Wikipedia is great for humans, not so great for computers. Fortunately projects exist that make it easier for computers too, DBPedia and Freebase both expose Wikipedia in a more semantic way, DBpedia has information about more than 218 million things including over 80,000 people,

Also possible to link other datasources such as geonames or musicbrainz, so we can find out more information than previously was possible just using wikipedia

Page 41: Using socially authored content to provide new routes through content archives
Page 42: Using socially authored content to provide new routes through content archives

For a given term we can find a) if it’s ambiguous and b) what the term could possibly meanIt’s possible to use this information to help determine which term we are actually refer to in the original text.

Page 43: Using socially authored content to provide new routes through content archives

For a given term we can find a) if it’s ambiguous and b) what the term could possibly meanIt’s possible to use this information to help determine which term we are actually refer to in the original text.

Page 44: Using socially authored content to provide new routes through content archives

So if our story is about Apple computers then it’s a fair bet it mentions iphone, itunes, wozniak, steve jobs - all of which can be found on this page but are unlikely to be found on the page about apple the fruit - we can say with a much greater degree of confidence this is the ‘apple’ referenced in out story

Page 45: Using socially authored content to provide new routes through content archives

We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content

Page 46: Using socially authored content to provide new routes through content archives

We can also say that this is a company (or if it was Steve Jobs for example - a person). How about we use this information to markup our original content with hCard microformats, suddenly we’ve created a way for semantic aggregators to know our story is really about apple the company - we can drive more targeted, better quality search traffic to our site - new routes into the content

Page 47: Using socially authored content to provide new routes through content archives

We can also create better relationships between our content as if one story is about apple the company then we can recommend another that has been semantically marked up this way - great for large content archives with diverse material - e.g. newsWe can better define what is related

Page 48: Using socially authored content to provide new routes through content archives

We’ve just created a simple opencalais from commons sourced data

Page 49: Using socially authored content to provide new routes through content archives

How accurate is wikipedia, what are the problems with using commons sourced data ?Nature study (2005) found wikipedia comparable in accuracy to the Encyclopedia Britannica (contested)Stern Magazine compared German language version of wikipedia (50 articles) and found it to be more accurate than the encyclopedia compared to

Page 50: Using socially authored content to provide new routes through content archives

What mechanisms are there to prevent gaming/incorrect dataLocked articles, featured article candidates, formal peer review process, flaggedrevs coming soon

Page 51: Using socially authored content to provide new routes through content archives

In summary, lots of commons data out there, need to look at different ways we can use it, hopefully I‘ve demonstrated a few ways and given you a few ideas for re-vitalising your content archives

Page 52: Using socially authored content to provide new routes through content archives

Photos:http://www.flickr.com/photos/drydens/2320501752/http://www.flickr.com/photos/laughingsquid/260374487/http://www.flickr.com/photos/bekathwia/2120050762/http://www.flickr.com/photos/25094278@N02/2368981952/http://www.flickr.com/photos/pmtorrone/1570649523http://www.flickr.com/photos/kightp/1562395268http://www.flickr.com/photos/valter/604187726/http://flickr.com/photos/speedye/1489563371/

Rob Lee:[email protected]

In summary, lots of commons data out there, need to look at different ways we can use it, hopefully I‘ve demonstrated a few ways and given you a few ideas for re-vitalising your content archives