Optimizing Data Architecture for Natural Language Processing
Transcript of Optimizing Data Architecture for Natural Language Processing
x.ai a personal assistant who schedules meetings for you
DATA ENGINEERING APRIL 2015 NEW YORK CITY VISIT X.AI TO JOIN THE WAITLIST
Optimizing data architecture design for
natural language processing
@alexpoon06@xdotai
What’s x.ai?
Magically Schedule Meetings
Pain Solution Jane Alex Jane [email protected] Alex
CC: Amy @ x.ai“Amy, please set something up for John and I next week.”
Product Characteristics
● Need quick response
● Supervised Learning requires large training data set
● # meetings scale linearly with # users
● 1 user meets with N people
● people share meeting places and company
Technical challenges
● Natural language understanding with extremely high accuracy
● Natural conversation over email with people
● Complex data relationship
● Optimize for sparse data
● Speed of development and change
Stack
Database(tell you in a couple of slides)
Queue based architecture
Picking a database
● Familiar technology
● Low initial maintenance
● Flexible schema
● Easy early scaling
● Reasonable production quality
Pros● Schema-less
● Mongoose (Schema Control)
● Work out of the box
● Repliset scales reasonably well
● MMS provides good monitoring
Cons● No joins
● Pain to do backup yourself
● DB level locking (Mongo v2.6)
● Cross datacenter is not great
● I don’t want to shard this
Modeling Meetings{
host : Participant,guests : [Participant],time : { start : Date,
end: Date, recurring: String},
timezone : String,duration : Number,locations : [Location],timeInitiated : Date,timeRescheduled: [Date],timeCompleted: Date,status : String,…...
}
Modeling Meetings
Meetings
People
Places Companies
1:N and N:N relationships across various collections
Embedding vs. Referencing
{ host : { name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String], …... }, travelTime : String, status : String, timezone : String, duration : Number, …...}
{ host : Participant, travelTime : String, status : String, timezone : String, duration : Number, …...}
Participant { name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String], …... },
Embedding ReferencingConsiderations
● Query patterns
● Access to embedded doc
● # references to a doc
● Application level join
● 1-way or 2-way referencing
Assistant is a PERSON Assistant is an Attribute of PERSON
Assistant is a PROFILE, a separate and smaller
entity
Modeling someone’s assistant1st try 2nd try 3rd try
{ name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String] …...}
{ name : { first : String, last: String }, primaryEmail : String }
{ name : {.....}, nicknames : [String], phones : [{Type: String}] primaryEmail : String, secondaryEmails : [String], title : String, signatures: [String], assistant : { name : {.....}, primaryEmail : String } …...}
Dealing with schema changes
Issues
● Inconsistent character offsets
● Inconsistent time representation
● Improper sent date (yr 2026)
● Key info not saved
Fixes
● Recalculate character offsets
● Reconstruct time entities
● Recalculate timezone based on context
● Filter out unsalvageable data
Feeding data science
ML training architecture
alex @ x.aicoo and founder
25 Broadway. 9th FloorNew York, 10005 NY
E: [email protected]: @xdotai
Visit x.ai to join the waitlist