Python 数组的基本操作

Posted by Run.D.Guan on February 15, 2016

向量化计算在数据分析中很常用,与数学公式最为相近,又可简化代码。Numpy是Python的一个扩展库,支持高阶维度数组和矩阵运算,Numpy的array 类称为ndarray,也被称为array。这里对其运算进行简单的介绍做为学习与查询之用。

import numpy as np

一维数组

lst = [3, 1, 4, 1, 5, 9, 2]  # list
vec = np.array([3, 1, 4, 1, 5, 9, 2], dtype='float64')
>>> lst
[3, 1, 4, 1, 5, 9, 2]
>>> vec
array([ 3.,  1.,  4.,  1.,  5.,  9.,  2.])

另一种来使得array为浮点型的方法是在一个或多个数字后加一个小数点’.’。

下面来看看 listarray 之间的差别

>>> lst + lst
[3, 1, 4, 1, 5, 9, 2, 3, 1, 4, 1, 5, 9, 2]
>>> vec + vec
array([  6.,   2.,   8.,   2.,  10.,  18.,   4.])
>>> # lst ** 2
>>> vec ** 2
array([  9.,   1.,  16.,   1.,  25.,  81.,   4.])

对于 array,对于大多数的操作(+,-,*,/,**,等)和函数(exp, log, sin, 等)都是逐个计算的,矩阵乘法用 dot() 函数

>>> lst[0], lst[-1], lst[3:5]
(3, 2, [1, 5])
>>> vec[0], vec[-1], vec[3:5]
(3.0, 2.0, array([ 1.,  5.]))

list 和 array 都支持从0开始的索引和切片,array 还支持按列表索引,但普通的 list 却不行

>>> vec[[0, 3, 5, 3]]
array([ 3.,  1.,  9.,  1.])
>>> # lst[[0, 3, 5, 3]]

二维数组(矩阵)

numpy 虽然提供了 “matrix” 类型,但是这里不建议用它,而是都使用 arrays,arrays 应用的范围更广,这样会减少困惑和更少的 bugs。

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9],
              [10, 11, 12],
              [13, 14, 15]], dtype='float64')
print("Here is the whole test array:")
print(A)
print('A is a numpy array with shape ' + repr(A.shape))
print('That means A has %d row' % A.shape[0], 'and %d columns.' % A.shape[1])
Here is the whole test array:
[[  1.   2.   3.]
 [  4.   5.   6.]
 [  7.   8.   9.]
 [ 10.  11.  12.]
 [ 13.  14.  15.]]
A is a numpy array with shape (5, 3)
That means A has 5 row and 3 columns.
print("跟普通的list一样,我们可以按行索引")
print("A[0] = ")
print(A[0])
print("A[0:3] = ")
print(A[0:3])
跟普通的list一样,我们可以按行索引
A[0] =
[ 1.  2.  3.]
A[0:3] =
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]]
print("我们可以笨拙的提取一行第一个元素:")
print("A[0][0] = ")
print(A[0][0])
print("但是一种更简洁的方式为:")
print("A[0,0] = ")
print(A[0, 0])
我们可以笨拙的提取一行第一个元素:
A[0][0] =
1.0
但是一种更简洁的方式为:
A[0,0] =
1.0
print("Rows and columns can be sliced like Python lists:")
print("A[0:2,:] = ")
print(A[0:2, :])
Rows and columns can be sliced like Python lists:
A[0:2,:] =
[[ 1.  2.  3.]
 [ 4.  5.  6.]]
print("Extracting multiple desired rows and columns:")
print(A[np.ix_([0, 3, 4], [0, 2])])
# Another version is "A[[0,3,4]][:,[0,2]]", but that can't be assigned to.
Extracting multiple desired rows and columns:
[[  1.   3.]
 [ 10.  12.]
 [ 13.  15.]]

Work with arrays

计算 array 每列的均值

I, J = A.shape
mu = np.zeros(J)  #记录结果
for i in range(I):
    for j in range(J):
        mu[j] += A[i,j]
mu = mu / I
print('Mean: ' + repr(mu))
Mean: array([ 7.,  8.,  9.])

可以采用 “mean” 函数来完成上述的工作

mu = np.mean(A, 0)  # 0 means sum over the 0th dimension of the array (the rows)
print('Mean: ', repr(mu))
Mean:  array([ 7.,  8.,  9.])

令每列0均值化,即每列都减去均值

A_shift = np.copy(A)  # without copy, would modify A when modify A_shift
for i in range(I):
    for j in range(J):
        A_shift[i,j] -= mu[j]
print('Cols centered:\n' + repr(A_shift))
Cols centered:
array([[-6., -6., -6.],
       [-3., -3., -3.],
       [ 0.,  0.,  0.],
       [ 3.,  3.,  3.],
       [ 6.,  6.,  6.]])

该过程可以按行减去均值

A_shift = np.copy(A)
for i in range(I):
    A_shift[i] -= mu
print('Cols centered:\n' + repr(A_shift))
Cols centered:
array([[-6., -6., -6.],
       [-3., -3., -3.],
       [ 0.,  0.,  0.],
       [ 3.,  3.,  3.],
       [ 6.,  6.,  6.]])

来使每行的均值为0

row_mu = np.mean(A, 1)
row_mu.shape == (5,)  #行向量
True

但我们想要的 5x1 的列向量,可以通过 row_mu[:,newaxis] 实现列向量的转换,该向量可以从任何 5xD array 中以每列的方式减去

A_shift = A - row_mu[:, np.newaxis]

查询与排序

people = ['Jim', 'alice', 'ali', 'bob']
height_cm = np.array([180, 165, 165, 178])
# The laborious, procedural programming way:
largest_height = -np.Inf
tallest_person = ''
for (i, h) in enumerate(height_cm):
    if h > largest_height:
        largest_height = h
        tallest_person = people[i]
print('largest_height = %g' % largest_height)
print('tallest_person = %s' % tallest_person)
largest_height = 180
tallest_person = Jim

当然啦,这么典型的问题肯定有内建函数来解决

largest_height = max(height_cm)
print('largest_height = %g' % largest_height)
largest_height = 180
#alternatively we can ask for the location of the largest element:
idx = np.argmax(height_cm)
largest_height = height_cm[idx]
tallest_person = people[idx]
print('largest_height = %g' % largest_height)
print('tallest_person = %s' % tallest_person)
largest_height = 180
tallest_person = Jim
# What about the shortest person?
smallest_height = min(height_cm)
idx = np.argmin(height_cm)
smallest_person = people[idx]
# 对于最小的有两个,如何将其都显示出来:
#ids = nonzero(height_cm == smallest_height)[0]
ids = np.where(height_cm == smallest_height)[0]
smallest_people = [people[i] for i in ids]  #list
print('smallest_people: ' + repr(smallest_people))
smallest_people: ['alice', 'ali']

sort (height_cm) sorts the list. Again, a second argument will give the indexes of the corresponding items.

sorted_heights = np.sort(height_cm)
print('sorted_heights: ' + repr(sorted_heights))
ids = np.argsort(height_cm)
sorted_heights = height_cm[ids]
sorted_heights: array([165, 165, 178, 180])
# The previous line wouldn't work with a normal list, but numpy arrays allow a
# list of indexes. The python list of people needs a list comprehension:
people_in_height_order = [people[i] for i in ids]
print('sorted_heights: ' + repr(sorted_heights))
print('people_in_height_order: ' + repr(people_in_height_order))
sorted_heights: array([165, 165, 178, 180])
people_in_height_order: ['alice', 'ali', 'bob', 'Jim']

矩阵操作 “向量化”

Given a mathematical expression like: result = $\sum_{i=1}^I f_n(x_i, x_j) val(z_i)$ The results of $f_n$ could be put in an $I\times J$ matrix, and the results of val in a length$-I$ vector. The sum is then a matrix-vector multiply:

result = dot(fn.T, val)

Sometimes a newaxis index expression is required to make a vector be a $D\times1$ or $1\times D$ matrix so that an elementwise operator can combine it with a matrix (using “broadcasting”).

Reference