Python for R developers and data scientists
-
Upload
lambda-tree -
Category
Software
-
view
458 -
download
5
Transcript of Python for R developers and data scientists
![Page 1: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/1.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python for R developers and data scientists
Artur Matos
http://www.lambdatree.com
June 8, 2016
![Page 2: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/2.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Outline
1 Getting Up and Running
2 Vectors
3 Data Frames
4 Analysis
5 Visualization
6 I/O
7 Conclusion
![Page 3: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/3.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 1
Getting Up and Running
![Page 4: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/4.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Which Python?
Python runtimesSeveral available: CPython, PyPy, Jython. . .CPython is the official runtime written in C.PyPy is a JIT-based runtime that runs significantly faster than CPython.For scientific computing, CPython is the only choice.
Python 2 vs Python 3Python 3 is not backwards compatibleAnswer today is Python 3 (might have answered differently last year)Unless you have other teams using Python 2. . .But all major packages support Python 3 already
![Page 5: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/5.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Installation
Several options, ranging from simple to complex.We will use Anaconda here, which will get you up and running quickly.On Linux and Mac you can also install Python with your package manager.Use virtualenv to isolate Python environments (not covered here).
![Page 6: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/6.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Installing Anaconda
https://www.continuum.io/downloads
![Page 7: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/7.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Installing Anaconda (2)
![Page 8: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/8.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Jupyter
![Page 9: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/9.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes
Convert input into uppercase;Cuts anything longer than 10 characters;Adds extra spaces if shorter than 10 characters;Add single quotes.
>>> quote_pad_string("This is rather long")’THIS IS RA’>>> quote_pad_string("Short")’SHORT ’
![Page 10: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/10.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
Function:
# WARNING: This is purely to cover some basic python syntax# there are better ways to do this in Pythondef quote_pad_string(a_string):
maximum_length = 10num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:num_missing_characters = 0
if num_missing_characters:for i in range(num_missing_characters):
a_string = a_string + " "else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
![Page 11: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/11.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
def defines the body of a function. Python is dynamically typed:
# WARNING: This is purely to cover some basic python syntax# there are better ways to do this in Pythondef quote_pad_string(a_string):
maximum_length = 10num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:num_missing_characters = 0
if num_missing_characters:for i in range(num_missing_characters):
a_string = a_string + " "else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
![Page 12: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/12.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
Python uses indentation for code blocks instead of curly braces:
# WARNING: This is purely to cover some basic python syntax# there are better ways to do this in Pythondef quote_pad_string(a_string):
maximum_length = 10num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:num_missing_characters = 0
if num_missing_characters:for i in range(num_missing_characters):
a_string = a_string + " "else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
![Page 13: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/13.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
‘=’ for assignment:
# WARNING: This is purely to cover some basic python syntax# there are better ways to do this in Pythondef quote_pad_string(a_string):
maximum_length = 10num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:num_missing_characters = 0
if num_missing_characters:for i in range(num_missing_characters):
a_string = a_string + " "else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
![Page 14: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/14.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
‘if’ statement:
# WARNING: This is purely to cover some basic python syntax# there are better ways to do this in Pythondef quote_pad_string(a_string):
maximum_length = 10num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:num_missing_characters = 0
if num_missing_characters:for i in range(num_missing_characters):
a_string = a_string + " "else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
![Page 15: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/15.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
‘for’ statement:
# WARNING: This is purely to cover some basic python syntax# there are better ways to do this in Pythondef quote_pad_string(a_string):
maximum_length = 10num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:num_missing_characters = 0
if num_missing_characters:for i in range(num_missing_characters):
a_string = a_string + " "else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
![Page 16: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/16.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Python syntax in 2 minutes (2)
len(a_string) is a function call, a_string.upper() is a method invocation:
# WARNING: This is purely to cover some basic python syntax# there are better ways to do this in Pythondef quote_pad_string(a_string):
maximum_length = 10num_missing_characters = maximum_length - len(a_string)
if num_missing_characters < 0:num_missing_characters = 0
if num_missing_characters:for i in range(num_missing_characters):
a_string = a_string + " "else:
a_string = a_string[:maximum_length]
return "’" + a_string.upper() + "’"
![Page 17: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/17.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 2
Vectors
![Page 18: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/18.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Scalars - R
In R there are no real scalar types. They are just vectors of length 1:
> a <- 5 # Equivalent to a <- c(5)> a[1] 5> length(a)[1] 1
![Page 19: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/19.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Scalars - Python
In Python scalars and vectors are not the same thing:
>>> a = 5 # Scalar5
>>> b = np.array([5]) # Array with one elementarray([5])>>> len(b) # Equivalent to ’length’ in R1
This won’t work:
>>> len(a)TypeError: object of type ’int’ has no len()
![Page 20: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/20.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vectors, matrices and arrays - R
In R, there’s ‘c’ for 1d vectors, ‘matrix’ for 2 dimensions, and ‘array’ for higher-orderdimensions:
> c(1,2,3,4)[1] 1 2 3 4> matrix(1:4, nrow=2,ncol=2)
[,1] [,2][1,] 1 3[2,] 2 4> array(1:3, c(2,4,6))...
Strangely enough, a 1d array is not the same as a vector:
> a <- as.array(1:3)[1] 1 2 3> is.vector(a)[1] FALSE
![Page 21: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/21.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vectors, matrices and arrays - Python
Python has no builtin vector or matrix type. You will need numpy:
>>> import numpy as np # Equivalent to ’library(numpy)’ in R.>>> np.array([1, 2, 3, 4]) # 1d vectorarray([1, 2, 3, 4])
>>> np.array([[1, 2], [3, 4]]) # matrixarray([[1, 2],
[3, 4]])
np.array works with any dimension and it’s a single type (ndarray).(There’s also a matrix type specifically for two dimensions but it should be avoided.Always use ndarray.)
![Page 22: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/22.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Generating regular sequences
R> 5:10 # Shortend for seq[1] 5 6 7 8 9 10> seq(0, 1, length.out = 11)[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Python>>> np.arange(3.0)array([ 0., 1., 2.])>>> np.arange(3, 7)array([3, 4, 5, 6])>>> np.arange(3, 7, 2)array([3, 5])
![Page 23: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/23.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vector operations - Python
For the most part, vector operations in Python work just like in R:
>>> a = np.arange(5.0)array([ 0., 1., 2., 3., 4.])>>> 1.0 + a # Addingarray([ 1., 2., 3., 4., 5.])>>> a * a # Multiplying element wisearray([ 0., 1., 4., 9., 16.])>>> a ** 3 # to the power of 3array([ 0., 1., 8., 27., 64.])
![Page 24: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/24.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vector operations - Python (2)
For matrix multiplication use the ‘@’ operator:
>> a = np.array([[1, 0], [0, 1]])array([[1, 0],
[0, 1]])>> b = np.array([[4, 1], [2, 2]])array([[4, 1],
[2, 2]])>> a @ barray([[4, 1],
[2, 2]])
(In Python 2 use np.dot(a,b).)
![Page 25: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/25.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Vector operations - Python (3)
Numpy also has the usual mathematical operations that work on vectors:
>> a = np.arange(5.0)array([ 0., 1., 2., 3., 4.])>> np.sin(a)array([ 0., 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
Full reference here:http://docs.scipy.org/doc/numpy/reference/routines.math.html
![Page 26: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/26.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Recycling - R
When doing vector operations, R automatically extends the smallest element to be aslarge as the other:
> c(1,2) + c(1,2,3,4) # Equivalent to c(1,2,1,2) + c(1,2,3,4)[1] 2 4 4 6
You can do this even if the lengths aren’t multiples of one another, albeit with awarning:
> c(1,2) + c(1,2,3,4,5)[1] 2 4 4 6 6Warning message:In c(1, 2) + c(1,2,3,4,5) :
longer object length is not a multiple of shorter object length
![Page 27: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/27.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Recycling - Python
This won’t work in Python however:
>> np.arange(2.0) + np.arange(4.0)----------------------------------------ValueError: operands could not be broadcast together with shapes(2,) (4,)
Numpy has much more strict recycling (aka broadcasting) rules.
![Page 28: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/28.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1: [0 1 23 4 5
]+
[01
]
=
[0 1 23 4 5
]+
[0 0 01 1 1
]2x3 and 1x3: [
0 1 23 4 5
]+[0 1 2
]=
[0 1 23 4 5
]+
[0 1 20 1 2
]Adding a single element array or a scalar always works:[
0 1 23 4 5
]+
[0]=
[0 1 23 4 5
]+
[0 0 00 0 0
]This won’t work (the dimensions need to match exactly or be 1):[
0 1 2 3]+[0 1
]
![Page 29: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/29.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1: [0 1 23 4 5
]+
[01
]=
[0 1 23 4 5
]+
[0 0 01 1 1
]
2x3 and 1x3: [0 1 23 4 5
]+[0 1 2
]=
[0 1 23 4 5
]+
[0 1 20 1 2
]Adding a single element array or a scalar always works:[
0 1 23 4 5
]+
[0]=
[0 1 23 4 5
]+
[0 0 00 0 0
]This won’t work (the dimensions need to match exactly or be 1):[
0 1 2 3]+[0 1
]
![Page 30: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/30.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1: [0 1 23 4 5
]+
[01
]=
[0 1 23 4 5
]+
[0 0 01 1 1
]2x3 and 1x3: [
0 1 23 4 5
]+[0 1 2
]
=
[0 1 23 4 5
]+
[0 1 20 1 2
]Adding a single element array or a scalar always works:[
0 1 23 4 5
]+
[0]=
[0 1 23 4 5
]+
[0 0 00 0 0
]This won’t work (the dimensions need to match exactly or be 1):[
0 1 2 3]+[0 1
]
![Page 31: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/31.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1: [0 1 23 4 5
]+
[01
]=
[0 1 23 4 5
]+
[0 0 01 1 1
]2x3 and 1x3: [
0 1 23 4 5
]+[0 1 2
]=
[0 1 23 4 5
]+
[0 1 20 1 2
]
Adding a single element array or a scalar always works:[0 1 23 4 5
]+
[0]=
[0 1 23 4 5
]+
[0 0 00 0 0
]This won’t work (the dimensions need to match exactly or be 1):[
0 1 2 3]+[0 1
]
![Page 32: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/32.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1: [0 1 23 4 5
]+
[01
]=
[0 1 23 4 5
]+
[0 0 01 1 1
]2x3 and 1x3: [
0 1 23 4 5
]+[0 1 2
]=
[0 1 23 4 5
]+
[0 1 20 1 2
]Adding a single element array or a scalar always works:[
0 1 23 4 5
]+[0]
=
[0 1 23 4 5
]+
[0 0 00 0 0
]This won’t work (the dimensions need to match exactly or be 1):[
0 1 2 3]+[0 1
]
![Page 33: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/33.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1: [0 1 23 4 5
]+
[01
]=
[0 1 23 4 5
]+
[0 0 01 1 1
]2x3 and 1x3: [
0 1 23 4 5
]+[0 1 2
]=
[0 1 23 4 5
]+
[0 1 20 1 2
]Adding a single element array or a scalar always works:[
0 1 23 4 5
]+[0]=
[0 1 23 4 5
]+
[0 0 00 0 0
]
This won’t work (the dimensions need to match exactly or be 1):[0 1 2 3
]+[0 1
]
![Page 34: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/34.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Broadcasting Rules - Python
2x3 and 2x1: [0 1 23 4 5
]+
[01
]=
[0 1 23 4 5
]+
[0 0 01 1 1
]2x3 and 1x3: [
0 1 23 4 5
]+[0 1 2
]=
[0 1 23 4 5
]+
[0 1 20 1 2
]Adding a single element array or a scalar always works:[
0 1 23 4 5
]+[0]=
[0 1 23 4 5
]+
[0 0 00 0 0
]This won’t work (the dimensions need to match exactly or be 1):[
0 1 2 3]+[0 1
]
![Page 35: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/35.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
![Page 36: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/36.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
Indexing in Python starts from 0 (not 1):
>>> a[0]0.0
Indexing on a single value returns a scalar (not an array!)
![Page 37: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/37.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
Use ‘i:j’ to index from position i to j-1:
>>> a[1:3]array([ 1, 2])
![Page 38: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/38.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
An optional ‘k’ element defines the step:
>>> a[1:7:2]array([1, 3, 5])
![Page 39: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/39.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
i and j can be negative, which means they will start counting from the last:
>>> a[1:-3]array([1, 2, 3, 4, 5, 6])
![Page 40: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/40.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
i and j can be negative, which means they will start counting from the last:
>>> a[-3:-1]array([7, 8])
![Page 41: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/41.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
While a negative k will go in the opposite direction:
>>> a[-3:-9:-1]array([7, 6, 5, 4, 3, 2])
![Page 42: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/42.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - i:j:k syntax
a = np.arange(10)
a =[
0 1 2 3 4 5 6 7 8 9]
Not all need to be included:
>>> a[::-1]array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
![Page 43: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/43.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing - multiple dimensions
Use ‘,’ for additional dimensions:
x =
[1 2 34 5 6
]>>> x[0:2, 0:1]array([[1],
[4]])
![Page 44: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/44.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Indexing using conditions
Operators like ‘>’ or ‘<=’ operate element-wise and return a logical vector:
>>> a > 4array([False, False, False, False, False, True, True,True, True, True], dtype=bool)
These can be combined into more complex expressions:
>> (a > 2) && (b ** 2 <= a)...
And used as indexing too:
>> a[a > 4]array([5, 6, 7, 8, 9])
![Page 45: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/45.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Assignment
Any index can be used together with ‘=’ for assignment:
>>> a[0] = 10array([10, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Conditions work as well:
>>> a[a > 4] = 99array([99, 1, 2, 3, 4, 99, 99, 99, 99, 99])
![Page 46: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/46.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 3
Data Frames
![Page 47: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/47.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Pandas
Data frames in Python aren’t builtin. You will need pandas:
import pandas as pd
Loading the iris dataset:
>>> iris = pd.read\_csv("""https://raw.githubusercontent.com/pydata/pandas/master/pandas/tests/data/iris.csv""")>>> iris.head()
SepalLength SepalWidth PetalLength PetalWidth Name0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa
![Page 48: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/48.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
![Page 49: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/49.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
‘.<columnName>’ returns only that column:
>>> iris.SepalLength0 5.11 4.92 4.7...
![Page 50: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/50.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
This also works:
>>> iris["SepalLength"]...
![Page 51: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/51.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
You can select multiple columns by passing a list:
>>> iris[["SepalWidth", "SepalLength"]]SepalWidth SepalLength
0 3.5 5.11 3.0 4.9...
![Page 52: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/52.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
Or if you pass a slice you can select rows:
>>> iris[1:3]SepalLength SepalWidth PetalLength PetalWidth Name
1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa
![Page 53: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/53.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
With loc you can slice both rows and columns:
>>> iris.loc[1:3, ["SepalLength", "SepalWidth"]]SepalLength SepalWidth
1 4.9 3.02 4.7 3.23 4.6 3.1
loc is inclusive at the end.
![Page 54: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/54.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
As well as only one single row:
>>> iris.loc[3]SepalLength 4.6SepalWidth 3.1PetalLength 1.5PetalWidth 0.2Name Iris-setosaName: 3, dtype: object
![Page 55: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/55.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
iloc works with integer indices, similar to numpy arrays:
>>> iris.iloc[0:2, 0:3]SepalLength SepalWidth PetalLength
0 5.1 3.5 1.41 4.9 3.0 1.4
![Page 56: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/56.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
‘:’ will include all the rows (or all the columns):
>>> iris.iloc[0:2, :]SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa
![Page 57: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/57.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
Or you can use conditions like numpy:
>>> iris[iris.SepalLength < 5]SepalLength SepalWidth PetalLength PetalWidth Name
1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa...
![Page 58: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/58.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Selection
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa. . . . . . . . . . . . . . . . . .
Picking a single value returns a scalar:
>>> iris.iloc[0,0]5.0999999999999996
Normally it’s better to use at or iat (faster).
![Page 59: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/59.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Assignment
Assignment works as expected:
>>> iris.loc[iris.SepalLength > 7.6, "Name"] = "Iris-orlando"
Beware that this doesn’t work:
>> iris[iris.SepalLength > 7.6].Name = "Iris-orlando"SettingWithCopyWarning: A value is trying to be set on a copyof a slice from a DataFrame.
![Page 60: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/60.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Operations
Operators work as expected:
>>> iris["SepalD"] = iris["SepalLength"] * iris["SepalWidth"]
There’s also apply:
>>> iris[["SepalLength", "SepalWidth"]].apply(np.sqrt)SepalLength SepalWidth
0 2.258318 1.8708291 2.213594 1.732051...
Use axis=1 to apply function to each row.
![Page 61: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/61.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
SQL-like operations - group by
>>> iris.groupby("Name").mean()SepalLength SepalWidth PetalLength PetalWidth
NameIris-setosa 5.006 3.418 1.464 0.244Iris-versicolor 5.936 2.770 4.260 1.326Iris-virginica 6.588 2.974 5.552 2.026
![Page 62: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/62.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
SQL-like operations (2) - join
>> leftkey lval
0 foo 11 foo 2
>> rightkey rval
0 foo 41 foo 5
>> pd.merge(left, right, on=’key’)key lval rval
0 foo 1 41 foo 1 5...
![Page 63: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/63.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Time Series
Pandas data frames can also work as time series, replacing R’s ts, xts or zoo.Downloading some stock data from Google finance:
>>> import pandas.io.data as web>>> import datetime
>>> aapl = web.DataReader("AAPL", ’google’,datetime.datetime(2013, 1, 1),datetime.datetime(2014, 1, 1))
Open High Low Close VolumeDate2013-01-02 79.12 79.29 77.38 78.43 1401248662013-01-03 78.27 78.52 77.29 77.44 88240950...
Time series are just regular pandas data frames but with time stamps as indices.
![Page 64: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/64.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Time Series (2)
Use loc to select based on dates:
>>> aapl.loc[’20130131’:’20130217’]Open High Low Close Volume
Date2013-01-31 65.28 65.61 65.00 65.07 798332152013-02-01 65.59 65.64 64.05 64.80 134867089...
Use iloc as before for selecting based on numerical indices:
>>> aapl.iloc[1:3]Open High Low Close Volume
Date2013-01-03 78.27 78.52 77.29 77.44 882409502013-01-04 76.71 76.95 75.12 75.29 148581860
![Page 65: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/65.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 4
Analysis
![Page 66: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/66.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Statistical tests
Use scipy.stats for common statistical tests:
>>> from scipy import stats
>>> iris_virginica = iris[iris.Name == ’Iris-virginica’].SepalLength.values>>> iris_setosa = iris[iris.Name == ’Iris-setosa’].SepalLength.values>>> t_test = stats.ttest_ind(iris_virginica, iris_setosa)>>> t_test.pvalue6.8925460606740589e-28
Use scikits.bootstrap for bootstrapped confidence intervals.
![Page 67: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/67.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Ordinary Least Squares
Use statsmodels:
import numpy as npimport statsmodels.api as smimport statsmodels.formula.api as smf
The formula API is very similar to R:
>>> results = smf.ols("PetalWidth ~ Name + PetalLength", data=iris).fit()
It automatically includes an intercept (just like R).Use smf.glm for generalized linear models.
![Page 68: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/68.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Formula API
Very similar to R:
You can include arbitrary transformations, e.g. “np.log(PetalWidth)”.To remove the intercept add a “- 1” or “0 +”Use “C(a)” to coerce a number to a factorUse “a:b” for modelling interactions between a and b.“a*b” means “a + b + a:b”Strings are automatically coerced to factors (more on this later)
![Page 69: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/69.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Decision trees
Use scikit-learn:
>>> from sklearn import tree>>> clf = tree.DecisionTreeClassifier()>>> clf = clf.fit(sk_iris.data, sk_iris.target)
After being fitted, the model can then be used to predict the class of samples:
>>> clf.predict([[5.1, 3.5, 1.4, 0.2]])array([0])
![Page 70: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/70.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Support Vector Machines
scikit-learn has a very regular API. Here’s the same example using an SVM:
>>> from sklearn import svm
>>> clf_svm = svm.SVC()>>> clf_svm = clf_svm.fit(sk_iris.data, sk_iris.target)
>>> clf_svm.predict([[5.1, 3.5, 1.4, 0.2]])array([0])
![Page 71: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/71.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
K-Means clustering
Clustering follows the same pattern:
>>> from sklearn import cluster
>>> k_means = cluster.KMeans(n_clusters=3)>>> k_means.fit(sk_iris.data)KMeans(copy_x=True, init=’k-means++’, ...
labels_ contains the assigned categories, following the same order as the data:
>>> k_means.labels_array([1, 1, 1, 1, 1...
predict works the same as for the other models, and returns the predicted category.
![Page 72: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/72.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Principal Component Analysis
from sklearn import decomposition
pca = decomposition.PCA(n_components=3)pca = pca.fit(sk_iris.data)
explained_variance_ratio_ and components_ will include the explained varianceand the PCA components respectively:
>>> pca.explained_variance_ratio_array([ 0.92461621, 0.05301557, 0.01718514])
>>>pca.components_array([[ 0.36158968, -0.08226889, 0.85657211, 0.35884393],
[-0.65653988, -0.72971237, 0.1757674 , 0.07470647],[ 0.58099728, -0.59641809, -0.07252408, -0.54906091]])
![Page 73: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/73.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Cross validation
scikit-learn also includes extensive support for cross-validation. Here’s a simple splitinto training and out-of-sample:
>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(... sk_iris.data, sk_iris.target, test_size=0.4, random_state=0)
>>> X_train.shape, y_train.shape((90, 4), (90,))>>> X_test.shape, y_test.shape((60, 4), (60,))
It also supports K-fold, stratified K-fold, shuffling, etc. . .
![Page 74: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/74.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
NAs
There’s no builtin NA in Python. You normally use NaN for NAs. numpy has a bunchof builtin functions to ignore NaNs:
>>> a = np.array([1.0, 3.0, np.NaN, 5.0])>>> a.sum()nan>>> np.nansum(a)9.0
Pandas usually ignores NaNs when computing sums, means, etc.. but propagatesthem accordingly.scikit-learn assumes there’s no missing data so be sure to pre-process them, e.g.remove them or set them to 0. Look at sklearn.preprocessing.Imputerstatsmodels also use NaNs for missing data, but only has basic support forhandling them (it can only ignore them or raise an error). See the missingattribute in the model class.
![Page 75: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/75.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Factors
Similarly to NAs, Python has no builtin factor data type.Different packages handle them differently:
numpy has no support for factors. Use integers.Pandas has categoricals, which work fairly similar to factorsstatsmodels convert strings to their own internal factor type, very similar to R.There’s also the ‘C’ operator.scikit-learn doesn’t support factors internally, but has some tools to convert stringsinto dummy variables, e.g. DictVectorizer
![Page 76: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/76.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Notorious Omissions
Bayesian modellingTime series analysisEconometricsSignal processing, i.e. filter designNatural language processing. . .
![Page 77: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/77.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 5
Visualization
![Page 78: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/78.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Visualization
![Page 79: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/79.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 6
I/O
![Page 80: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/80.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Pandas
read_csv Reads data from CSV files
>>> pd.read_csv(’foo.csv’)Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.1868601 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953...
Conversely there is to_csv to write CSV files:
In [136]: df.to_csv(’foo.csv’)
![Page 81: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/81.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Other options
For data frames:
HDF5: read_hdf5, to_hdf5Excel: read_excel, to_excelSQL: read_sql, to_sqlStata: read_stata, to_stataSAS: read_sas, to_sasREST APIs: read_json or alternatively use requests
For numpy arrays:
You can use load and save for saving into .npy formatNormally I prefer to use HDF5 with the h5py library
![Page 82: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/82.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
h5py - datasets
Creating a data set:
>>> import h5py>>> import numpy as np>>>>>> f = h5py.File("mytestfile.hdf5", "w")>>> dset = f.create_dataset("mydataset", (100,), dtype=’i’)
Datasets work similarly to numpy arrays:
>>> dset[...] = np.arange(100)>>> dset[0]0>>> dset[10]10>>> dset[0:100:10]array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
![Page 83: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/83.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Other options
pickle - Python standard serialization format. See also shelve.tinydb - local document-oriented database (good for NLP tasks)sqlalchemy - Heavy-duty SQL to relational mapper.
![Page 84: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/84.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Section 7
Conclusion
![Page 85: Python for R developers and data scientists](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a6d97cd7f8b9a22428b5cb9/html5/thumbnails/85.jpg)
Installation Vectors Data Frames Analysis Visualization I/O Conclusion
Things I haven’t covered:
Python data structures: dicts, listsPython - R interoperability: RPyParallel computing: IPython.parallel, pysparkOptimizing python code: Cython, numba, numexpr
Hope you’ve enjoyed. Feel free to get in touch: [email protected]