Performance Optimization in NumPy

1) Profile first: %timeit and timeit

Measure before optimizing. In notebooks, use %timeit; in scripts, use timeit.

import numpy as np, timeit

# Notebook: %timeit (example comments)
# %timeit np.arange(10_000_000)  # fast construction

x = np.random.rand(1_000_000)
def loop_square(x):
    out = np.empty_like(x)
    for i, v in enumerate(x):
        out[i] = v*v
    return out

print('loop  :', timeit.timeit(lambda: loop_square(x), number=3))
print('vector:', timeit.timeit(lambda: x*x, number=3))

2) Vectorization beats Python loops

Prefer ufuncs and broadcasting to operate on whole arrays at once.

X = np.random.randn(2_000, 512)
w = np.random.randn(512)
# Vectorized dot per row
y = X @ w
# Equivalent but slower (Python loop)
# y = np.array([(row * w).sum() for row in X])

3) Memory layout, contiguity & strides

Contiguous arrays (C- or F-order) speed up many kernels. After transposes/slicing, arrays can become non-contiguous and trigger hidden copies.

A = np.arange(12).reshape(3,4)
AT = A.T
print(A.flags['C_CONTIGUOUS'], AT.flags['C_CONTIGUOUS'])  # True, False

# Ensure contiguity when required by downstream libraries
ATc = np.ascontiguousarray(AT)

4) Avoid temporary arrays: in-place ops & ufunc out=

Intermediate temporaries cost memory and time. Use in-place operators and ufunc out.

x = np.ones(5)
y = np.arange(5, dtype=float)

# Bad: creates temporaries
z = (x + y) * 2 - y

# Better: reuse buffers with out=
tmp = np.empty_like(x)
np.add(x, y, out=tmp)     # tmp = x + y
np.multiply(tmp, 2, out=tmp)
z2 = np.subtract(tmp, y)  # allocates z2 once

# In-place when safe (watch aliasing!)
y += 2
y *= 0.5

5) Broadcasting vs tile/repeat

Broadcasting is lazy and avoids big copies. tile/repeat materially enlarge arrays—use only if an API needs explicit shapes.

B = np.arange(12).reshape(3,4)
col_bias = np.array([10,20,30])[:, None]   # (3,1)
# Preferred:
C = B + col_bias     # broadcast (3,4)
# Avoid:
C2 = B + np.tile(col_bias, (1,4))  # makes a (3,4) copy first

6) Choose the right dtype

Smaller dtypes save memory and can improve cache locality. But avoid precision loss when it matters (e.g., summations).

N = 10_000_000
a32 = np.ones(N, dtype=np.float32)
a64 = np.ones(N, dtype=np.float64)
print(a32.nbytes/1e6, 'MB', a64.nbytes/1e6, 'MB')  # ~40MB vs ~80MB

# Accumulations: promote to float64 if needed
s = a32.astype(np.float64).sum()

7) Preallocation instead of growing containers

Repeated appends cause many reallocations. Preallocate or compute vectorized.

N = 1_000_000

# Bad: list append then np.array
lst = []
for i in range(N): lst.append(i*i)
arr_bad = np.array(lst)

# Better: preallocate
arr = np.empty(N, dtype=np.int64)
for i in range(N): arr[i] = i*i

# Best: vectorized
arr_vec = np.arange(N, dtype=np.int64)**2

8) Reduce precision & copy cost in pipelines

Convert at the edges, not at every step. Keep arrays contiguous where heavy math runs.

X = np.random.rand(1_000, 1_000).astype(np.float64)
# heavy math in float64; at the end compress to float32 for storage
Y = np.tanh(X @ X.T)
Y32 = Y.astype(np.float32)   # single cast at pipeline end

9) Cache-friendly axis choices

Operations along the last axis (C-order) can be faster due to memory access patterns.

A = np.random.rand(2_000, 512)
# Mean across axis=1 visits contiguous memory per row
m1 = A.mean(axis=1)
# If your layout differs, consider transposing once then compute

10) Micro-benchmarks: measuring memory & speed

import numpy as np, timeit

A = np.random.rand(5_000, 512)
B = np.random.rand(5_000, 512)

def add_temporary(A, B):
    return (A * 1.1) + (B * 0.9)

def add_inplace(A, B):
    out = np.empty_like(A)
    np.multiply(A, 1.1, out=out)
    np.add(out, B*0.9, out=out)
    return out

print('temp  :', timeit.timeit(lambda: add_temporary(A,B), number=5))
print('inplace:', timeit.timeit(lambda: add_inplace(A,B), number=5))

11) When NumPy is not enough

  • Highly branching logic: consider Numba/Cython.
  • Very large arrays with memory constraints: chunk with memmap or process blocks.
  • Linear algebra heavy-lifting: ensure BLAS is optimized (NumPy uses linked BLAS/LAPACK).

Common pitfalls & quick fixes

  • “Why is this slow?” Array is non-contiguous → np.ascontiguousarray.
  • High memory use: Chained expressions create temporaries → use in-place ops and out=.
  • Unexpected copy: Mixing dtypes forces upcasts → align dtypes first.
  • Broadcast explosion: Estimate output shape; avoid materializing huge results.

Practice: quick exercises

import numpy as np, timeit

# 1) Benchmark loop vs vector for squaring 2M floats
x = np.random.rand(2_000_000)
def loop_sq(x):
    out = np.empty_like(x)
    for i,v in enumerate(x): out[i] = v*v
    return out
print('%timeit-like (loop):', timeit.timeit(lambda: loop_sq(x), number=1))
print('%timeit-like (vec) :', timeit.timeit(lambda: x*x, number=1))

# 2) Show effect of contiguity on a transpose before heavy matmul
A = np.random.rand(2000, 512)
AT = A.T
ATc = np.ascontiguousarray(AT)
print(timeit.timeit(lambda: AT @ A, number=1),
      timeit.timeit(lambda: ATc @ A, number=1))

# 3) Rewrite an expression to use ufunc out= and fewer temporaries
B = np.random.rand(1000, 1000)
C = np.random.rand(1000, 1000)
# target: (B - B.mean(0)) / (B.std(0) + 1e-6) + 0.25*C
Numpy Vectorization Views & Strides Broadcasting File I/O
Subhendu Mohapatra — author at plus2net
Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.



Subscribe to our YouTube Channel here



plus2net.com







Python Video Tutorials
Python SQLite Video Tutorials
Python MySQL Video Tutorials
Python Tkinter Video Tutorials
We use cookies to improve your browsing experience. . Learn more
HTML MySQL PHP JavaScript ASP Photoshop Articles Contact us
©2000-2025   plus2net.com   All rights reserved worldwide Privacy Policy Disclaimer