[Mas 500] Data Basics
Transcript of [Mas 500] Data Basics
![Page 1: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/1.jpg)
MAS.500 - Software Module - Rahul Bhargava
Data Management
2014.11.21
![Page 2: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/2.jpg)
Topics
❖ Regular Expressions (online quickstart)
❖ Databases❖ History❖ Relational modeling❖ Sql (mysql quickstart)❖ Keys/Indexes❖ No-sql (couchdb quickstart)
❖ Behind the Scenes with Ed Platt
❖ Homework
![Page 3: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/3.jpg)
Regular Expressions
![Page 4: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/4.jpg)
Regular Expressions (RegEx/grep)
❖ Match a string of text by defining a pattern
❖ Useful for cleaning up or identifying data
❖ “Find” Demo on http://regexpal.com❖ “Find/Replace” Demo with
http://www.sugarscript.com/findandreplace/index.php
❖ Interested? Interactive tutorial on http://regexone.com
![Page 5: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/5.jpg)
Databases
![Page 6: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/6.jpg)
Database History
❖ List-based❖ Follow link from one record to another
(linked-list)
❖ File-system data stores❖ Based on filenaming convention, limited by
file i/o speeds
❖ Generic data storage and management❖ Relational modeling or entities and
relationships (ER)
![Page 7: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/7.jpg)
Relational Modeling: In English
❖ A Group has many People❖ A Person belongs to one Group
❖ A Group has many Projects❖ A Project belongs to one Group
❖ A Person has many Projects❖ A Project has many People
![Page 8: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/8.jpg)
Relational Modeling: Diagram
GroupPerson
Project
many 1
1
many
many
many
![Page 9: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/9.jpg)
Relational Modeling: Tables
Group:id
nameurl
Person:id
namepasswordgroup_id
Project:id
nameurl
many 1
1
many
many
manyMembership:person_idproject_id
![Page 10: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/10.jpg)
Relational Modeling: Keys
Group:id
nameurl
Person:id
namepasswordgroup_id
Project:id
nameurl
many 1
1
many
many
manyMembership:
person_idproject_id
key
Foreign keys
key
key
![Page 11: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/11.jpg)
Structured Query Language (SQL)
❖ Works in lots of database servers❖ SQLite, MySQL, PostgreSQL, MS SQL Server
❖ Standard way to:❖ Find subsets of data based on criteria❖ Merge data in separate tables❖ Compute aggregate info
❖ Assumptions❖ Don’t duplicate data (“data normalization”)❖ Various parts of your data relate to each other❖ Your metadata/schema (tables/columns) doesn’t change often
❖ Many frameworks will generate SQL for you❖ Ask about Database Abstraction Layers
![Page 12: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/12.jpg)
NoSQL
❖ Sometimes your data isn’t relational and the metadata changes often
❖ Queuing, document storage, logging, real-time, low-latency, concurrency
❖ Read this write up for more:❖ http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
![Page 13: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/13.jpg)
Tangent: JavaScript Object Notation(JSON)
❖ A human-readable data exchange format❖ CSV, XML, YAML are some others
❖ Example:
❖ http://media.mongodb.org/zips.json
❖ http://mongohub.todayclose.com (for Mac)
![Page 14: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/14.jpg)
❖ sudo mkdir -p /data/db
![Page 15: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/15.jpg)
MongoDB: Intro
❖ Demo:
❖ Command Line
❖ MongoHub
![Page 16: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/16.jpg)
Indexes
❖ An index tracks keys❖ Convention: have an “id” column with an index on
it❖ Why all these indexes?
❖ Multiple ways to get at rows quickly❖ Creating indexes is tricky
❖ Many frameworks include query logging to help you find slow queries that might need optimizing
❖ Query optimization is a bit of an art❖ Use the “Explain” command
![Page 17: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/17.jpg)
Map-Reduce Instead of SQL
❖ Used to query large datasets
❖ Example: Count words in a document
❖ Map: select the data you need to operate on❖ “emit” one records for each word in a
document, keyed by the word
❖ Reduce: combine the mapped data❖ Sum up the uses of each word, “emitting”
one record for each total
![Page 18: [Mas 500] Data Basics](https://reader038.fdocuments.in/reader038/viewer/2022102815/5552c11bb4c90581158b478e/html5/thumbnails/18.jpg)
Picking Data Storage Strategies
❖ If you just need to dump data and pull it out by some id, use a no-sql solution (MongoDB is simple)
❖ flexible, easy to start with
❖ If you are modeling an app, a relational database is usually the right answer (MySQL/PostgreSQL are standard)❖ Database modeling is REALLY important to get
right at the start of your project, because it is a pain to change later
❖ Names matter – choose your table names carefully
❖ PS: we can try stuff out on Amazon’s cloud services for free