Measure before optimizing. In notebooks, use %timeit
; in scripts, use timeit
.
import numpy as np, timeit
# Notebook: %timeit (example comments)
# %timeit np.arange(10_000_000) # fast construction
x = np.random.rand(1_000_000)
def loop_square(x):
out = np.empty_like(x)
for i, v in enumerate(x):
out[i] = v*v
return out
print('loop :', timeit.timeit(lambda: loop_square(x), number=3))
print('vector:', timeit.timeit(lambda: x*x, number=3))
Prefer ufuncs and broadcasting to operate on whole arrays at once.
X = np.random.randn(2_000, 512)
w = np.random.randn(512)
# Vectorized dot per row
y = X @ w
# Equivalent but slower (Python loop)
# y = np.array([(row * w).sum() for row in X])
Contiguous arrays (C- or F-order) speed up many kernels. After transposes/slicing, arrays can become non-contiguous and trigger hidden copies.
A = np.arange(12).reshape(3,4)
AT = A.T
print(A.flags['C_CONTIGUOUS'], AT.flags['C_CONTIGUOUS']) # True, False
# Ensure contiguity when required by downstream libraries
ATc = np.ascontiguousarray(AT)
out=
Intermediate temporaries cost memory and time. Use in-place operators and ufunc out
.
x = np.ones(5)
y = np.arange(5, dtype=float)
# Bad: creates temporaries
z = (x + y) * 2 - y
# Better: reuse buffers with out=
tmp = np.empty_like(x)
np.add(x, y, out=tmp) # tmp = x + y
np.multiply(tmp, 2, out=tmp)
z2 = np.subtract(tmp, y) # allocates z2 once
# In-place when safe (watch aliasing!)
y += 2
y *= 0.5
Broadcasting is lazy and avoids big copies. tile
/repeat
materially enlarge arrays—use only if an API needs explicit shapes.
B = np.arange(12).reshape(3,4)
col_bias = np.array([10,20,30])[:, None] # (3,1)
# Preferred:
C = B + col_bias # broadcast (3,4)
# Avoid:
C2 = B + np.tile(col_bias, (1,4)) # makes a (3,4) copy first
Smaller dtypes save memory and can improve cache locality. But avoid precision loss when it matters (e.g., summations).
N = 10_000_000
a32 = np.ones(N, dtype=np.float32)
a64 = np.ones(N, dtype=np.float64)
print(a32.nbytes/1e6, 'MB', a64.nbytes/1e6, 'MB') # ~40MB vs ~80MB
# Accumulations: promote to float64 if needed
s = a32.astype(np.float64).sum()
Repeated appends cause many reallocations. Preallocate or compute vectorized.
N = 1_000_000
# Bad: list append then np.array
lst = []
for i in range(N): lst.append(i*i)
arr_bad = np.array(lst)
# Better: preallocate
arr = np.empty(N, dtype=np.int64)
for i in range(N): arr[i] = i*i
# Best: vectorized
arr_vec = np.arange(N, dtype=np.int64)**2
Convert at the edges, not at every step. Keep arrays contiguous where heavy math runs.
X = np.random.rand(1_000, 1_000).astype(np.float64)
# heavy math in float64; at the end compress to float32 for storage
Y = np.tanh(X @ X.T)
Y32 = Y.astype(np.float32) # single cast at pipeline end
Operations along the last axis (C-order) can be faster due to memory access patterns.
A = np.random.rand(2_000, 512)
# Mean across axis=1 visits contiguous memory per row
m1 = A.mean(axis=1)
# If your layout differs, consider transposing once then compute
import numpy as np, timeit
A = np.random.rand(5_000, 512)
B = np.random.rand(5_000, 512)
def add_temporary(A, B):
return (A * 1.1) + (B * 0.9)
def add_inplace(A, B):
out = np.empty_like(A)
np.multiply(A, 1.1, out=out)
np.add(out, B*0.9, out=out)
return out
print('temp :', timeit.timeit(lambda: add_temporary(A,B), number=5))
print('inplace:', timeit.timeit(lambda: add_inplace(A,B), number=5))
memmap
or process blocks.np.ascontiguousarray
.out=
.import numpy as np, timeit
# 1) Benchmark loop vs vector for squaring 2M floats
x = np.random.rand(2_000_000)
def loop_sq(x):
out = np.empty_like(x)
for i,v in enumerate(x): out[i] = v*v
return out
print('%timeit-like (loop):', timeit.timeit(lambda: loop_sq(x), number=1))
print('%timeit-like (vec) :', timeit.timeit(lambda: x*x, number=1))
# 2) Show effect of contiguity on a transpose before heavy matmul
A = np.random.rand(2000, 512)
AT = A.T
ATc = np.ascontiguousarray(AT)
print(timeit.timeit(lambda: AT @ A, number=1),
timeit.timeit(lambda: ATc @ A, number=1))
# 3) Rewrite an expression to use ufunc out= and fewer temporaries
B = np.random.rand(1000, 1000)
C = np.random.rand(1000, 1000)
# target: (B - B.mean(0)) / (B.std(0) + 1e-6) + 0.25*C
Author
🎥 Join me live on YouTubePassionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.