Optimizing Data Architecture for Natural Language Processing
Transcript of Optimizing Data Architecture for Natural Language Processing
![Page 1: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/1.jpg)
x.ai a personal assistant who schedules meetings for you
DATA ENGINEERING APRIL 2015 NEW YORK CITY VISIT X.AI TO JOIN THE WAITLIST
Optimizing data architecture design for
natural language processing
@alexpoon06@xdotai
![Page 2: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/2.jpg)
What’s x.ai?
Magically Schedule Meetings
![Page 3: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/3.jpg)
Pain Solution Jane Alex Jane [email protected] Alex
CC: Amy @ x.ai“Amy, please set something up for John and I next week.”
![Page 4: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/4.jpg)
Product Characteristics
● Need quick response
● Supervised Learning requires large training data set
● # meetings scale linearly with # users
● 1 user meets with N people
● people share meeting places and company
![Page 5: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/5.jpg)
Technical challenges
● Natural language understanding with extremely high accuracy
● Natural conversation over email with people
● Complex data relationship
● Optimize for sparse data
● Speed of development and change
![Page 6: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/6.jpg)
Stack
Database(tell you in a couple of slides)
![Page 7: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/7.jpg)
Queue based architecture
![Page 8: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/8.jpg)
Picking a database
● Familiar technology
● Low initial maintenance
● Flexible schema
● Easy early scaling
● Reasonable production quality
![Page 9: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/9.jpg)
Pros● Schema-less
● Mongoose (Schema Control)
● Work out of the box
● Repliset scales reasonably well
● MMS provides good monitoring
Cons● No joins
● Pain to do backup yourself
● DB level locking (Mongo v2.6)
● Cross datacenter is not great
● I don’t want to shard this
![Page 10: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/10.jpg)
Modeling Meetings{
host : Participant,guests : [Participant],time : { start : Date,
end: Date, recurring: String},
timezone : String,duration : Number,locations : [Location],timeInitiated : Date,timeRescheduled: [Date],timeCompleted: Date,status : String,…...
}
![Page 11: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/11.jpg)
Modeling Meetings
Meetings
People
Places Companies
1:N and N:N relationships across various collections
![Page 12: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/12.jpg)
Embedding vs. Referencing
{ host : { name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String], …... }, travelTime : String, status : String, timezone : String, duration : Number, …...}
{ host : Participant, travelTime : String, status : String, timezone : String, duration : Number, …...}
Participant { name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String], …... },
Embedding ReferencingConsiderations
● Query patterns
● Access to embedded doc
● # references to a doc
● Application level join
● 1-way or 2-way referencing
![Page 13: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/13.jpg)
Assistant is a PERSON Assistant is an Attribute of PERSON
Assistant is a PROFILE, a separate and smaller
entity
Modeling someone’s assistant1st try 2nd try 3rd try
{ name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String] …...}
{ name : { first : String, last: String }, primaryEmail : String }
{ name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String], assistant : { name : {.....}, primaryEmail : String } …...}
![Page 14: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/14.jpg)
Dealing with schema changes
Issues
● Inconsistent character offsets
● Inconsistent time representation
● Improper sent date (yr 2026)
● Key info not saved
Fixes
● Recalculate character offsets
● Reconstruct time entities
● Recalculate timezone based on context
● Filter out unsalvageable data
![Page 15: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/15.jpg)
Feeding data science
![Page 16: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/16.jpg)
ML training architecture
![Page 17: Optimizing Data Architecture for Natural Language Processing](https://reader036.fdocuments.in/reader036/viewer/2022062710/55b3f628bb61eb74708b4739/html5/thumbnails/17.jpg)
alex @ x.aicoo and founder
25 Broadway. 9th FloorNew York, 10005 NY
E: [email protected]: @xdotai
Visit x.ai to join the waitlist