CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a...

28
‹nr.› Het begint met een idee CRASH COURSE PYTHON

Transcript of CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a...

Page 1: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

‹nr.› Het begint met een idee

CRASH COURSE PYTHON

Page 2: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Not a programming course

• For data analysts, who want to learn Python

• For optimizers, who are fed up with Matlab

This talk

2

Page 3: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Scripting language • expensive computations typically in compiled modules

• such as matrix multiplication, optimization, classification • Faster Python code: Numba’s @jit construct (or Cython)

• Support for functions and OOP (classes, abstract classes, polymorphism, inheritance; but no encapsulation)

• Direct competitors: R, Julia, Matlab

Python

3

Page 4: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced.

Zen of Python

4

Page 5: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• 1994: Python 1 • 2000: Python 2 (backward compatible) • 2008: Python 3

• Most pronounced difference:

• Python 2: print “hello world!” • Python 3: print(“hello world!”)

• Strength of Python: broad availability of modules

• Many modules have been updated for Python 3

• Some people still use Python 2

Python 2 or 3?

5

Page 6: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Windows users: use winPython • Has MKL for fast linear algebra, and many preinstalled modules • Portable, so extract & go • Ships with the Spyder editor for coding and debugging • and a compiler for new modules • winPython 3.4 is currently recommended (3.5 does not (yet) ship with

a compiler)

• Mac users • OS X ships with Python 2.7 (and depends on it, do not “update” to 3) • Python 3 can be installed alongside

• Linux • Ubuntu ships with both Python 2.7 and Python 3.4

• Commands: python & python3

Installing Python

6

Page 7: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Mac/Linux/POSIX-compatible systems: run pip from the terminal

• e.g.: pip install cylp

• WinPython: run “WinPython Command Prompt.exe”

and use pip • For dependencies that require a shell script (“./configure”):

• add the folder “winPython/share/mingwpy/bin” to the path • install msys from mingw.org • start msys (C:\MinGW\msys\1.0\msys.bat) • Configure&compile the dependency

Installing modules

7

Page 8: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• The editor probably has a hotkey (F5 in Spyder)

• Shell command: “python filename.py”

• Alternative: “python” (runs commands as they are entered)

Running Python

8

Page 9: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Crash course

9

Data type Initialize empty Initialize with data

List x = [] x = [1,2,5]

Tuple - x = (1,2,5)

Set x = set() x = {1,2,5}

Dict x = {} x = {"one": 1, "two": 2, "five": 5}

String x = "" x = "hello world"

Page 10: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Integers have infinite precision • Floats have finite precision

• use decimal/float/mpmath modules for arbitrary precision

>> print(2**1000) 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376

Precision

10

Page 11: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• range(n) is “list-like”

• internally it is an object that can be converted to a list

• range(int(1e10)) requires a few bytes instead of 74.5 GB

Creating a list

11

Code Output

x = [0,1,2,3,4,5,6,7,8,9,10]

print(x)

x = range(11)

print(x)

print(list(x))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

range(0, 11)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Page 12: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• No curly braces or “end for” • Structure is derived from level of indentation • One statement per line • No semicolons required

Loops

12

Code Output

for i in [1,2,3]:

print(i)

while i < 5:

i += 1

print(i)

1

2

3

4

5

Page 13: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• All arguments are named: fun(name=‘group’)

• Naming useful for optional arguments

• Return is optional

Functions

13

Code Output

def fun(name, greeting='Hi', me='evil caterpillar'):

print(greeting + ' ' + name + ', this is ' + me)

return 0

fun('group', me='Python')

Hi group, this is Python

Page 14: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Behavior depends on whether type is mutable

• variables are pointers • memory gets overwritten for mutable types only

• String, int, double, tuple are immutable • List, set, dict are mutable • “y=list(x)” creates a shallow copy (y = copy.deepcopy(x) when x contains mutable data)

Functions

14

Code Output

def trick_me(a,b,c):

a.append('o')

b.append('o')

c += 1

x = ['m','n']

y = x

z = 1

trick_me(x,y,z)

print(x,y,z)

['m', 'n', 'o', 'o'] ['m', 'n', 'o', 'o'] 1

Page 15: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Creating a list with squares: 0, 1, 4, …, 100 Creating a list of even numbers 6, 8, 10, 12, 14

List comprehensions

15

Naive code Idiomatic Python

x = []

for i in range(11):

x.append(i*i)

x = [i*i for i in range(11)]

Naive code Idiomatic Python

x = []

for i in range(6,15):

if i % 2 == 0:

x.append(i)

x = [i for i in range(6,15) if i%2==0]

# or

x = [i for i in range(6,15,2)]

Page 16: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Find the last ten digits of the series: 11 + 22 + 33 + ... + 10001000 (projecteuler.net)

• >> print(str(sum([k**k for k in range(1,1001)]))[-10:]) 9110846700

• [k**k for k in range(1,1001)] creates the terms • sum(.) takes the sum • str(.) converts the argument to a string • [-10:] takes a substring

One-liner example

16

Page 17: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Matlab replacements • scipy (free, linear algebra) • matplotlib (free, graphing)

• Optimization

• cylp (free, linear and mixed integer optimization) • pyipopt (free, convex optimization) • gurobi / cplex (academic license)

• Data mining • pandas (free, importing and slicing data) • scikit-learn (free, machine learning) • xgboost (free, gradient boosting) • takes less than 20 lines to create a cross-validated ensemble of

classifiers

Modules

17

Page 18: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Recap

18

Example: function, for-loop, range, comment

def take_sum(S):

sum = 0

for i in S:

sum += i

return sum

print(take_sum(range(7)))

# outputs 21

Example: named arguments

def fun(name, greeting='Hi', me='evil caterpillar'):

print(greeting + ' ' + name + ', this is ' + me)

return 0

fun('group', me='Python')

Page 19: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Reading data with pandas • Visualization with matplotlib • Machine learning with scikit-learn

Data mining

19

Page 20: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

• Pandas offers read_csv, read_excel, read_sql, read_json, read_html, read_sas, etc

• read_* returns pandas data structure: DataFrame

• Having data in DataFrame is useful • filtering, combining, grouping, sorting • to_csv, to_excel, etc (for, e.g., converting csv to json)

Reading data

20

Page 21: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Example: reading csv file

21

Code

import pandas

filename = 'train.csv'

X = pandas.read_csv(filename, sep=",")

y = X.target

X.drop(['target', 'id'], axis=1, inplace=True)

CSV file

id,feat_1,feat_2,feat_3,feat_4,feat_5,target

1,1,0,0,0,0,1

2,0,0,0,0,0,0

3,0,0,0,0,0,0

4,1,0,0,1,6,0

5,0,0,0,0,0,1

Page 22: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Filtering data

22

Code

filename = 'train.csv'

data = pandas.read_csv(filename)

print(data[0:2])

output: id feat_1 feat_2 feat_3 feat_4 feat_5 target

1 2 0 0 0 0 0 1

2 3 0 0 0 0 0 1

CSV file

id,feat_1,feat_2,feat_3,feat_4,feat_5,target

1,1,0,0,0,0,1

2,0,0,0,0,0,0

3,0,0,0,0,0,0

4,1,0,0,1,6,0

5,0,0,0,0,0,1

Page 23: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Filtering data

23

Code

filename = 'train.csv'

data = pandas.read_csv(filename)

print(data[data.feat_1 == 1])

output: id feat_1 feat_2 feat_3 feat_4 feat_5 target

0 1 1 0 0 0 0 1

3 4 1 0 0 1 6 1

CSV file

id,feat_1,feat_2,feat_3,feat_4,feat_5,target

1,1,0,0,0,0,1

2,0,0,0,0,0,0

3,0,0,0,0,0,0

4,1,0,0,1,6,0

5,0,0,0,0,0,1

Page 24: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Visualization

24

Code

data[data.feat_2<=5].feat_2.plot(kind='hist')

# since the data takes few distinct values:

data[data.feat_2<=5].feat_2.value_counts().sort_index().plot(kind='bar')

Page 25: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Grouping

25

Code

import numpy as np

pandas.set_option('display.precision',2)

for feat_2_value,group in data.groupby('feat_2'):

# group is the DataFrame data[feat_2 == feat_2_value]

data.groupby('feat_2').aggregate(pandas.Series.nunique)

# other aggregation functions: np.min, np.max, np.sum, np.std

id feat_1 feat_3 feat_4 feat_5 target

feat_2

0 55018 37 39 48 15 9

1 4012 26 39 36 10 9

2 1215 14 31 39 7 9

3 549 9 24 27 7 7

4 310 13 21 27 4 5

5 170 5 10 13 3 6

Page 26: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Example: time series

26

Code

import pandas

import numpy as np

ts = pandas.Series(np.random.randn(1000), \

index=pandas.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()

print(ts.mean())

# output: 28.642802230898678

Page 27: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Suppose csv file is 100 GB and has thousands of columns Subset of three columns is manageable

Example: large data set

27

Code

infile = 'train.csv'

outfile = ‘output.xlsx’

df = pandas.DataFrame()

# chunksize is the number of rows to read per iteration

for data in pandas.read_csv(infile, chunksize=100):

data = data[['feat_1', 'feat_2', 'target']]

df = pandas.concat([df,data])

writer = pandas.ExcelWriter(outfile)

df.to_excel(writer, 'Sheet1')

writer.save()

Page 28: CRASH COURSE PYTHON - 3142.nl · CRASH COURSE PYTHON . Vrije Universiteit Amsterdam • Not a programming course • For data analysts, who want to learn Python • For optimizers,

Vrije Universiteit Amsterdam

Logistic regression

28

Code

from sklearn import cross_validation,linear_model

from sklearn.metrics import log_loss

filename = 'train.csv'

X = pandas.read_csv(filename, sep=",")

y = X.target

X.drop(['target', 'id'], axis=1, inplace=True)

y[y==1] = 0

y[y>1] = 1

X,X_test,y,y_test = cross_validation.train_test_split(X, y, test_size=0.5)

clf = linear_model.LogisticRegression()

clf.fit(X,y)

prediction = clf.predict_proba(X_test)

print(log_loss(y_test,prediction))

# output: 0.00159227347414; log_loss is in in [0, 34.5]

# 0 for “perfect fit”, 0.7 for “constant p=0.5”, 34.5 for “all wrong”