Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at...

52
BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR CHINESE PATENTS ERIC PUGH | [email protected] | @dep4b

Transcript of Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at...

Page 1: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR CHINESE PATENTS

ERIC PUGH | [email protected] | @dep4b

Page 2: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

Page 3: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Co-AuthorN

ext Edition May!

Page 4: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Congrats to Trey and Tim!(Tim is here somewhere)

Page 5: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Agilista

Page 6: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Selected Customers

Page 7: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Telling some stories

Page 8: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Telling some storieswar ^

Page 9: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC
Page 10: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Risks

• Cloud new at USPTO

• Discovery is tenuous concept

• Conflicting User Goals

• Fixed Budget: trade scope for budget/quality

Page 11: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

• First USPTO application in “the cloud”

• Simple, and discoverable

• Expresses our philosophy of “Cloud meets Ocean”

!

• Check it out at http://gpsn.uspto.gov

Page 12: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Telling some stories

➡How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 13: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Flow of understanding

Data UnderstandingInformation

Page 14: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Building “Discovery”

UX DataTension

Page 15: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Building “Discovery”

Engine

UX DataTension

Page 16: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Grok data at gut level

Look for outliers

!

!

User Interviews

Surveys

Card Sorting

Scenarios/Personas

!

UX

Data

brainstormMockups

Proof of concept

!

!

Page 17: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Where to spend time?

UX

Engine

Data

40%

!

20%

!

40%

!

Page 18: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Where to spend time?

UX

Engine

Data

40%

!

20%

!

40%

!

40%

!

40%

!

20%

We spent

!

!

Page 19: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Telling some stories

• How to inject “Discovery” into your app

➡The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 20: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Boy meets Girl Story

Page 21: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Boy meets Girl Story

Page 22: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Boy meets Girl Story

Page 23: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Boy meets Girl Story

Metadata

Ingest Pipeline

Discovery UX

Content Files

Page 24: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

How we built it

EmberJS Single Page Search App

HTML

XML

JSON

Server Dashboard

GPSN UI (Bootsrap CSS)

BrowsersMobile/

Tablet

Third Party Application

Servers

S3 BucketSolr

Page 25: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Lessons Learned

Page 26: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Don’t Move Files

• Copying 5 TB data up to S3 was very painful.

• We used S3Funnel which is “rsync like”

• We bought more network bandwidth for our office

Page 27: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Never underestimate

the bandwidth of a station wagon

full of tapes hurtling down the highway.

–Andrew Tanenbaum, 1981

Page 28: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Data Size

0

250000

500000

750000

1000000

1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

Patent Count

277871

Page 29: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Data Size

0

250000

500000

750000

1000000

1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

Patent Count

277871

Page 30: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Think about Data Volume• Started with older dataset, and tasks like TIFF -> PNG

conversion became progressively harder. Map/Reduce nice, need more visibility into progress..

• Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!)

• 8 shards dropped time from 12 hours to 2 hours. Merging took 5!

• We had too many steps in our pipeline

Page 31: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Building  a  Patents  IndexM

achi

ne C

ount

0

75

150

225

300

5 days 3 days 30 Minutes

1 5

300

Page 32: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Key scaling concept behind GPSN:

!

Cloud meets Ocean

Page 33: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

More prosaically…

Database

Server

Server

Server

Client

Client

Client

$

$

$

$

Page 34: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

More prosaically…

Database

Server

Server

Server

Client

Client

Client

$

$

$

$$

Page 35: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

More prosaically…

Database Server

Client

Client

Client

$

$

$

$$

Page 36: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

More prosaically…

Database Server

Client

ClientClient

Client

$

$

$

$$

Client

Page 37: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

More prosaically…

Database Server

Client

ClientClient

Client

$

$

$

$ $$

Client

$

Page 38: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Telling some stories

• How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

➡Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 39: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Why so many pipelines?Morphlines

Page 40: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Tika as a pipeline?

Page 41: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Lot’s of File Types

• Sometimes in ZIP archives, sometimes not!

• multiple XML formats as well as CSV and EDI

• Purplebook, Yellowbook, Redbook,Greenbook, Questel, SIPO…

Page 42: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Tika as a pipeline!

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project (and others!).

Page 43: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Lots of files!HHHHHT APS1 ISSUE - 760106!PATN!WKU 039302717!SRC 5!APN 5328756!APT 1!ART 353!APD 19741216!TTL Golf glove!ISD 19760106!NCL 4!ECL 1

<PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>

Page 44: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Detector to pick Filepublic  class  GreenbookDetector  implements  Detector  {  !        private  static  Pattern  pattern  =  Pattern.compile("PATN");                    @Override          public  MediaType  detect(InputStream  stream,  Metadata  metadata)  throws  IOException  {  !                MediaType  type  =  MediaType.OCTET_STREAM;                  InputStream  lookahead  =  new  LookaheadInputStream(stream,  1024);                  String  extract  =  org.apache.commons.io.IOUtils.toString(lookahead,  "UTF-­‐8");  !                Matcher  matcher  =  pattern.matcher(extract);  !                if  (matcher.find())  {                          type  =  GreenbookParser.MEDIA_TYPE;                  }  !                lookahead.close();                                    return  type;          }        }

Page 45: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Telling some stories

• How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

➡Don’t be Afraid to Share!

Page 46: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Your BigData solution isn’t perfect

• Allow users to export data

• Most business users want to work in Excel! Accept it!

• Allow other applications to build on top of it.

Page 47: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

GPSN has• Lots of easy “Print to

PDF” options.

• Data stored in S3 as:

• individual patent files

• chunky downloads.

• Filtering to expand or select specific data sets.

• Permalinks: simple, very sharable URLs.

• Underlying Solr service is exposed to public via proxy. You can query Solr yourself.

• Need advance querying? Use Lucene syntax in search bar.

Page 48: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

One more thought...

Page 49: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Measuring the impact of our algorithms

changes is just getting harder with Big Data.

Page 50: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

www.quepid.com

Quepid: Give your Queries some Love

Project SolrPanl

Page 51: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

www.quepid.com

Quepid: Give your Queries some Love

Project SolrPanl

We

needbeta users!

Page 52: Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

Thank you! !

Questions?

[email protected]

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!