Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

30
1 Software Industry in India and Keyword Search Over Dynamic Categorized Information Manish Bhide [email protected]

Transcript of Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

Page 1: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

1

Software Industry in India and Keyword Search Over Dynamic Categorized Information

Manish [email protected]

Page 2: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

2

My Background

BE in CSE, 2000 from VRCE (now VNIT )

MTech in CSE, 2002 from IITB

Working with IBM India Research Lab since 2002

Started part time PhD in IITB in 2005

Page 3: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

3

Types of Software Companies (type of work)

Services Companies

Product development companies

Research and Development Companies

Many companies do work that falls in all the above three categories

Page 4: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

4

Services Companies

Services companies work for other companies using the “outsourcing model” E.g., Bank wants to focus on its core business, and not worry

about the software needed to run it IT part is outsourced to a services company.

Services companies can do following type of work Support Product development

Support L1: First level of contact (BPO companies) L2: If problem not solved by L1, it is escalated to L2 L3: If problem not solved by L2, it is escalated to L3

Page 5: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

5

Services Companies

Support work can be categorized as Business process outsourcing Application Support & Maintenance (L2, L3) L3 work involves bug fixing

Support work might not be that great!

Product development in services companies The product companies outsource the development of the

products to services companies Conceptualization and design done by the product company Development and testing done by services company

Page 6: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

6

Product Development

Involves development of products Also involves testing of products – not great

Development part more exciting than services work

Pays better than services work (L1, L2, L3)

Quality of people hired by product companies is better than those hired by services companies People from CS@IIT do not apply to services companies

Page 7: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

7

Research and Development

R&D is a misused word

Some companies try to put product development into R&D

True meaning: Involves conceptualization of new product ideas or enhancement to existing product ideas The concept of a “database” originated in IBM Research Job role can be thought of as originating ideas and developing new products

People hired are typically PhD or masters in computer Science from IIT’s or the best universities worldwide

Hires the best and pays the best amongst all types of software companies

Page 8: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

8

How it all fits together

Consider an example of a bank It wants to focus on its core business – banking IT part is outsourced to “Services Companies” Software used by bank will consist of – database, web-

server, etc. Services company will use these software products to build

a “solution” for the bank – banking software Someone needs to build the product like database, web-

server etc. for the services companies to use – This is done by the product companies

Someone needs to think of a need for a new product – This is where R&D companies play a role

Page 9: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

9

Take-away for you…

As far as possible, try to find a job in product development companies Getting into R&D right after BTech is difficult

If interested in doing quality work, try to do a MTech/MS.

If still interested in further studies, register for a PhD

Caveat Emptor: Not everyone will get a job in product development This is not the right time to try out all the above

Page 10: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

10

How to find a Job in good companies First and foremost: You need to be good academically!

Enhance your coding skills

Try to participate in coding contests such as: Google India Code Jam International Online Programming Contest (organized by IIT’s),

Try to participate in open source software development

Try to do a summer internship in product companies Try to contact VNIT alumni in these companies to improve your chances

Realize your potential! From my personal experience I believe that the top 10% of the folks in CS@VNIT are at

par with those in IIT The rest are better than most of the guys from other engineering colleges in India The faculty in VNIT is amongst the best in India

Page 11: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

11

Keyword Search Over Dynamic Categorized Information

Joint work with:Venkatesan Chakravarthy,

Krithi Ramamritham and Prasan Roy

Page 12: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

12

Motivating Example

Prime ministerial candidate a political party “PP” wants to asses reaction of different voter categories to their manifesto

Current Approach: Keyword Query: “PP manifesto” Results consist of large number of blog posts Cannot form a consolidated opinion

Desirable Result: Most relevant categories Blogs about education issues Blogs about Tax rebates

Page 13: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

13

Motivating Example (contd..)

Alternate Approach: Use traditional search, group results into categories

Problems: Difficult to assign labels to generated clusters Unpredictability of generated results

Solution: Categorized search (Faceted search) over pre-defined categories

Page 14: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

14

C1: pc

Categories

C2: pc

C5: pc

C6: pc

C3: pc

C4: pc

CN: pc

Problem Statement CS* (Categorized Search) system supports top-K keyword search

over categories

d1: A(d1), T(d1)d2: A(d2), T(d2) d3: A(d3), T(d3)..

Information Repository

Q(t1,t2..,tl)Keyword Query

Top-KCategories

di = Blog PostsA(di) = Attributes in user profile

T(di) = Text of blogBlog posts about educational issues

pc = Text classifier

“PP Manifesto”

Blogs about educational issuesBlogs about tax rebates

Page 15: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

15

Scoring Function

We use standard tf-idf based scoring function to compute relevance of a category to a keyword query

Term Frequency:

Inverse Document Frequency:

Score:

Page 16: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

16

Computing Top-K Categories Scoring Function:

Use stored meta-data to compute Score(c,Q) values

C1: pc

Categories

C2: pc

C5: pc

C6: pc

C3: pc

C4: pc

CN: pc

d1: A(d1), T(d1)d2: A(d2), T(d2) d3: A(d3), T(d3)..

Information Repository

Q(t1,t2..,tl)Keyword Query

Top-KCategories

Meta-Data

dN: A(dN), T(dN)

Page 17: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

17

Naïve Approach: Update-all Strategy

Refresh all the categories when a new data-item is added Evaluate pc of each category with respect to the data item Update meta-data for those categories whose pc evaluates to true

pc can be a text classifier or could involve expensive joins High value customer: Transactions more than 10K in last 15 days

If one pc evaluation takes 25 milliseconds, for 1000 categories it will take 25 seconds!

While one data item is being processed more data-items could be added As per 2006 estimate 13 blog posts are created per second

Meta-data will become stale, affecting quality of results

Need for an intelligent selective update strategy!

Page 18: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

18

CS* Approach: Selective update of categories with selective data

Identify a sub-set of categories (of size ImpCat) that are deemed important

Identify a sub-set of data-items (of size ImpData) that can provide maximum impact in terms of update to meta-data

Refresh important categories using the sub-set of data-items

CS* consists of two components: Meta-data refresher Query Answering Module

Page 19: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

19

Overview

Motivation, Problem Statement, Naïve Strategy

Statistics used by CS*

Meta-Data Refresher

Query Answering Module

Experimental Evaluation

Conclusions

Page 20: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

20

Statistics Maintained by CS*

Time-step

d1 d2 d3 d4 d5 d6 d7 d8 d9

s1 s2 s3 s4 s5 s6 s7 s8 s9

Data-Items

Ci: pc

Contiguous Refreshing: CS* refreshes a category in a contiguous manner When the statistics of a category are refreshed using data item di,

it is also refreshed using all the data item added before di

RefreshRefresh

Last Refresh Time rt(c): Largest time step till which the statistics of c have been refreshed

rt(Ci) = s6

tfs6(Ci,t) will be available

Page 21: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

21

Estimating approximate tf

Need to find tf at current time s* - tfs*(c,t) Use principle of locality Find rate of change of term frequency Δ(c,t) – estimate of change in

term frequency per data item Δ(c,t) updated whenever c is refreshed

Time-step

d1 d2 d3 d4 d5 d6 d7 d8 d*

s1 s2 s3 s4 s5 s6 s7 s8 s*

Data-Items

Ci: pc

Refreshrt(Ci) = s6

tfs6(Ci,t) will be available

Current Time

Estimated term frequency tfests* calculated as

Page 22: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

22

Overview

Motivation, Problem Statement, Naïve Strategy

Statistics used by CS*

Meta-Data Refresher

Query Answering Module

Experimental Evaluation

Conclusions

Page 23: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

23

Determining Important Categories

What categories will be important? Categories which will be useful for answering queries in the future

What queries are likely to be asked in the future? Need to predict the queries

What categories will be useful for answering those queries? Look at history and find categories used of answering queries in

past

How to compute the benefit of a set of data items? How many categories can be refreshed using the data items? What is the importance of those categories?

Importance is a measure of the likelihood of the category being

used to answer a query in the future

Page 24: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

24

Range Selection Problem

Input: Sequence of categories c1, c2,….,cN

Width ImpData

Output: Set of data items such that Total number of data items selected is at most ImpData Total benefit is maximized

We use a dynamic programming algorithm to solve this problem

Details are in paper

Page 25: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

25

Overview

Motivation, Problem Statement, Naïve Strategy

Statistics used by CS*

Meta-Data Refresher

Query Answering Module

Experimental Evaluation

Conclusions

Page 26: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

26

Query Answering Module

Given keyword query Q = {t1, t2,….,tl} use tfest and idfest to find top-K categories using scoring function:

Naïve approach: Compute score for all categories containing any one of Q, and return top-K categories

We use threshold algorithm (TA) to do this efficiently TA solves the problem of finding the topmost object amongst a set of

objects using scoring function consisting of multiple components TA requires input objects to be sorted on each of the components

In our setup score of C is combination of tf.idf score for each keyword ti

Page 27: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

27

Query Answering Module Algorithm Overview

Setup l ordered lists – one for each keyword List for keyword ti provides ordering of categories based on tfest

s* x idfest

s* for ti

Lists are merged using TA algorithm to get top-K categories

TAScoreest

s*(*,Q)

tfests*(*,t1) x idfest

s*(t1)

C3 C1 C9 C2

tfests*(*,t2) x idfest

s*(t2)

C5 C2 C6 C1

tfests*(*,tl) x idfest

s*(tl)

C6 C3 C1 C8

C4 C6 C1 C8 C7

Categories sorted based on tfest

s*(*,t1) x idfests*(t1)

Categories sorted based on tfest

s*(*,tl) x idfests*(t1)

Page 28: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

28

Query Answering Module

Recall formula for tfests*:

Maintaining sorted list as per tfests* is not easy

Dependant on Function of time s* – ordering changes with time Problem solved by using another level of threshold algorithm

Page 29: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

29

Conclusion

First to identify the problem of keyword search over categorized dynamic data

Developed the CS* system consisting of two components:

Query Answering Module: Two level threshold algorithm

Meta-data Refresher: Formulated an interval selection problem and proposed a dynamic programming solution

Provides accuracy in excess of 90% using 57% less resources than the Update-All Strategy

Page 30: Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

30

Thank You & Questions!