Performance Optimization in NumPy: memory, strides, vectorization vs loops

1) Profile first: %timeit and timeit

Measure before optimizing. In notebooks, use %timeit; in scripts, use timeit.

import numpy as np, timeit

# Notebook: %timeit (example comments)
# %timeit np.arange(10_000_000)  # fast construction

x = np.random.rand(1_000_000)
def loop_square(x):
    out = np.empty_like(x)
    for i, v in enumerate(x):
        out[i] = v*v
    return out

print('loop  :', timeit.timeit(lambda: loop_square(x), number=3))
print('vector:', timeit.timeit(lambda: x*x, number=3))

2) Vectorization beats Python loops

Prefer ufuncs and broadcasting to operate on whole arrays at once.

X = np.random.randn(2_000, 512)
w = np.random.randn(512)
# Vectorized dot per row
y = X @ w
# Equivalent but slower (Python loop)
# y = np.array([(row * w).sum() for row in X])

3) Memory layout, contiguity & strides

Contiguous arrays (C- or F-order) speed up many kernels. After transposes/slicing, arrays can become non-contiguous and trigger hidden copies.

A = np.arange(12).reshape(3,4)
AT = A.T
print(A.flags['C_CONTIGUOUS'], AT.flags['C_CONTIGUOUS'])  # True, False

# Ensure contiguity when required by downstream libraries
ATc = np.ascontiguousarray(AT)

4) Avoid temporary arrays: in-place ops & ufunc `out=`

Intermediate temporaries cost memory and time. Use in-place operators and ufunc out.

x = np.ones(5)
y = np.arange(5, dtype=float)

# Bad: creates temporaries
z = (x + y) * 2 - y

# Better: reuse buffers with out=
tmp = np.empty_like(x)
np.add(x, y, out=tmp)     # tmp = x + y
np.multiply(tmp, 2, out=tmp)
z2 = np.subtract(tmp, y)  # allocates z2 once

# In-place when safe (watch aliasing!)
y += 2
y *= 0.5

5) Broadcasting vs tile/repeat

Broadcasting is lazy and avoids big copies. tile/repeat materially enlarge arrays—use only if an API needs explicit shapes.

B = np.arange(12).reshape(3,4)
col_bias = np.array([10,20,30])[:, None]   # (3,1)
# Preferred:
C = B + col_bias     # broadcast (3,4)
# Avoid:
C2 = B + np.tile(col_bias, (1,4))  # makes a (3,4) copy first

6) Choose the right dtype

Smaller dtypes save memory and can improve cache locality. But avoid precision loss when it matters (e.g., summations).

N = 10_000_000
a32 = np.ones(N, dtype=np.float32)
a64 = np.ones(N, dtype=np.float64)
print(a32.nbytes/1e6, 'MB', a64.nbytes/1e6, 'MB')  # ~40MB vs ~80MB

# Accumulations: promote to float64 if needed
s = a32.astype(np.float64).sum()

7) Preallocation instead of growing containers

Repeated appends cause many reallocations. Preallocate or compute vectorized.

N = 1_000_000

# Bad: list append then np.array
lst = []
for i in range(N): lst.append(i*i)
arr_bad = np.array(lst)

# Better: preallocate
arr = np.empty(N, dtype=np.int64)
for i in range(N): arr[i] = i*i

# Best: vectorized
arr_vec = np.arange(N, dtype=np.int64)**2

8) Reduce precision & copy cost in pipelines

Convert at the edges, not at every step. Keep arrays contiguous where heavy math runs.

X = np.random.rand(1_000, 1_000).astype(np.float64)
# heavy math in float64; at the end compress to float32 for storage
Y = np.tanh(X @ X.T)
Y32 = Y.astype(np.float32)   # single cast at pipeline end

9) Cache-friendly axis choices

Operations along the last axis (C-order) can be faster due to memory access patterns.

A = np.random.rand(2_000, 512)
# Mean across axis=1 visits contiguous memory per row
m1 = A.mean(axis=1)
# If your layout differs, consider transposing once then compute

10) Micro-benchmarks: measuring memory & speed

import numpy as np, timeit

A = np.random.rand(5_000, 512)
B = np.random.rand(5_000, 512)

def add_temporary(A, B):
    return (A * 1.1) + (B * 0.9)

def add_inplace(A, B):
    out = np.empty_like(A)
    np.multiply(A, 1.1, out=out)
    np.add(out, B*0.9, out=out)
    return out

print('temp  :', timeit.timeit(lambda: add_temporary(A,B), number=5))
print('inplace:', timeit.timeit(lambda: add_inplace(A,B), number=5))

11) When NumPy is not enough

Highly branching logic: consider Numba/Cython.
Very large arrays with memory constraints: chunk with memmap or process blocks.
Linear algebra heavy-lifting: ensure BLAS is optimized (NumPy uses linked BLAS/LAPACK).

Common pitfalls & quick fixes

“Why is this slow?” Array is non-contiguous → np.ascontiguousarray.
High memory use: Chained expressions create temporaries → use in-place ops and out=.
Unexpected copy: Mixing dtypes forces upcasts → align dtypes first.
Broadcast explosion: Estimate output shape; avoid materializing huge results.

Practice: quick exercises

import numpy as np, timeit

# 1) Benchmark loop vs vector for squaring 2M floats
x = np.random.rand(2_000_000)
def loop_sq(x):
    out = np.empty_like(x)
    for i,v in enumerate(x): out[i] = v*v
    return out
print('%timeit-like (loop):', timeit.timeit(lambda: loop_sq(x), number=1))
print('%timeit-like (vec) :', timeit.timeit(lambda: x*x, number=1))

# 2) Show effect of contiguity on a transpose before heavy matmul
A = np.random.rand(2000, 512)
AT = A.T
ATc = np.ascontiguousarray(AT)
print(timeit.timeit(lambda: AT @ A, number=1),
      timeit.timeit(lambda: ATc @ A, number=1))

# 3) Rewrite an expression to use ufunc out= and fewer temporaries
B = np.random.rand(1000, 1000)
C = np.random.rand(1000, 1000)
# target: (B - B.mean(0)) / (B.std(0) + 1e-6) + 0.25*C

Download the above full source code from Github or run the code in your Google colab platform.

Peformance Optimization
https://github.com/plus2net/numpy/blob/main/numpy_14_performance_opimization.ipynb

Numpy Vectorization Views & Strides Broadcasting File I/O

Pandas Python - Tutorials »

Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.