Efficient Matlab Programs
shared by John Stutz, updated on Sep 09, 2010
Summary
Matlab has a reputation for running slowly. Here are some pointers on how to speed computations, to an often unexpected degree. Subjects currently covered:
Matrix Coding
Implicit Multithreading on a Multicore Machine
Sparse Matrices
SubBlock Computation to Avoid Memory Overflow
Matrix Coding  1
Matlab documentation notes that efficient computation depends on using the matrix facilities, and that mathematically identical algorithms can have very different runtimes, but they are a bit coy about just what these differences are. A simple but telling example:
The following is the core of the GDCLS algorithm of Berry et.al., copied from fig. 1 of Shahnaz et.al, 2006, "Document clustering using nonnegative matrix factorization':
for jj = 1:maxiter
A = W'*W + lambda*eye(k);
for ii = 1:n
b = W'*V(:,ii);
H(:,ii) = A \ b;
end
H = H .* (H>0);
W = W .* (V*H') ./ (W*(H*H') + 1e9);
end
Replacing the columwise update of H with a matrix update gives:
for jj = 1:maxiter
A = W'*W + lambda*eye(k);
B = W'*V;
H = A \ B;
H = H .* (H>0);
W = W .* (V*H') ./ (W*(H*H') + 1e9);
end
These were tested on an 8049 x 8660 sparse matrix bag of words V (.0083 nonzeros), with W of size 8049 x 50, H 50 x 8660, maxiter = 50, lambda = 0.1, and identical initial W. They were run consecutivly, multithreaded on an 8processor Sun server, starting at ~7:30PM. Tictoc timing was recorded.
Runtimes were respectivly 6586.2 and 70.5 seconds, a 93:1 difference. The maximum absolute pairwise difference between W matrix values was 6.6e14.
Similar speedups have been consistantly observed in other cases. In one algorithm, combining matrix operations with efficient use of the sparse matrix facilities gave a 3600:1 speedup.
For speed alone, Cstyle iterative programming should be avoided wherever possible. In addition, when a couple lines of matrix code can substitute for an entire Cstyle function, program clarity is much improved.
Matrix Coding  2
Applied to integration, the speed gains are not so great, largely due to the time taken to set up the and deal with the boundaries. The anyomous function setup time is neglegable. I demonstrate on a simple uniform step linearly interpolated 1D integration of cos() from 0 to pi, which should yield zero:
tic;
step = .00001;
fun = @cos;
start = 0;
endit = pi;
enda = floor((endit  start)/step)step + start;
delta = (endit  enda)/step;
intF = fun(start)/2;
intF = intF + fun(endit)delta/2;
intF = intF + fun(enda)(delta+1)/2;
for ii = start+step:step:endastep
intF = intF + fun(ii);
end
intF = intFstep
toc;
intF = 2.910164109692914e14
Elapsed time is 4.091038 seconds.
Replacing the inner summation loop with the matrix equivalent speeds things up a bit:
tic;
step = .00001;
fun = @cos;
start = 0;
endit = pi;
enda = floor((endit  start)/step)*step + start;
delta = (endit  enda)/step;
intF = fun(start)/2;
intF = intF + fun(endit)*delta/2;
intF = intF + fun(enda)*(delta+1)/2;
intF = intF + sum(fun(start+step:step:endastep));
intF = intF*step
toc;
intF = 2.868419946011613e14
Elapsed time is 0.141564 seconds.
The core computation takes about 70% of the full time used above, but without the boundary handling gives a potentially significant error:
tic; sum(fun(start:step:endit))*step, toc;
ans = 2.653589763815900e06
Elapsed time is 0.099088 seconds. 
Implicit Multithreading on a Multicore Machine
Select File > Preferences > General > Multithreading. and enable multithreading. Enjoy.
See Multiprocessing in MATLAB on the help system for documentation.
Sparse Matrices
In Matlab, the sparse matrix facilities are activated by simply making a matrix sparse,as in Xs = sparse(X), and using it. Most of the matrix operations are able to make effecient use of truely sparse matrices, with considerable relief of memory demands. A simple example follows:
load ... /home/stutz/DaSH/IDU/ASAP/Runs20080303_gdcls/gdcls_dataset.mat
X = Data0.X_T; % a terms*docs bag of word counts
nnz(X) / numel(X)
% ans = 0.0044
C = Data0.Cats; % a 0/1 valued nonexclusive category matrix
nnz(C) / numel(C)
% ans = 0.0329
The memory saved by using sparse matrices is not trivial:
Xf = full(X);
Cf = full(C);
Ct = C';
Ctf = Cf';
Xt = X';
Xtf = Xf';
whos X Xt Xf Xtf C Ct Cf Ctf
% Name Size Bytes Class Attributes
%
% C 11245x33 195536 double sparse
% Cf 11245x33 2968680 double
% Ct 33x11245 285232 double sparse
% Ctf 33x11245 2968680 double
% X 4686x11245 11649776 double sparse
% Xf 14686x11245 1321152560 double
% Xt 11245x14686 11677304 double sparse
% Xtf 11245x14686 1321152560 double
The disparity between C and Ct byte counts are due to the assymetry. Matlab's sparse representation requires a column index for nonempty columns plus a row index for each nonzero element.
The matrix multiplication timings vary significantly, but not quite as one might expect, due to the great disparity in matrix sizes:
tic; XC; toc; % sparsesparse
% Elapsed time is 0.100225, 0.108354, 0.115247, 0.102686 seconds.
tic; CtXt; toc; % sparsesparse
% Elapsed time is 0.153515, 0.146475, 0.145449, 0.156352 seconds.
tic; CtfXt; toc; % fullsparse
% Elapsed 0.404232, 0.421417, 0.424837, 0.404256
tic; XCf; toc; % sparsefull
% Elapsed 0.762932, 1.148991, 0.727793, 0.717451
tic; CtfXtf; toc; % fullfull
% Elapsed 2.984380, 3.164782, 3.050478, 3.253079
tic; XfCf; toc; % fullfull
% Elapsed 3.423657, 3.503390, 3.854899, 3.555406
tic; XfC; toc; % fullsparse
% Elapsed 7.204459, 7.193888, 8.999477, 9.477529
tic; CtXtf; toc; % sparsefull
% Elapsed 9.655192, 7.155294, 7.155073, 7.496134
The above was monitored with UNIX top, while running basic Matlab in multithreading mode on an 8processor server. These calculations involved up to 16% of CPU, say 1.3 processors, but typically 12.5% or 1 processor, dropping to under 0.04% when idle.
Matrix normalization time differences are even greater, depite the advantage of multithreading The sparse matrix normalizations below ran with no more than 12.5% CPU. The following full matrix calculations quickly ramped up and held at 9091.8% of CPU, say 7.3 processors, with only one significant (>0.2% CPU) competing process present.
tic; X / spdiags(sum( X.^2)', 0, 11245, 11245); toc;
% sparse/sparse(sparse)
% Elapsed32.781410, 30.889889, 30.978490, 32.140246,
ic; X / diag(sum(X.^2)); toc; % sparse/full(sparse)
% Elapsed 30.714821, 30.973805, 30.398733, 30.443698,
tic; Xf / diag(sum(Xf.^2)); toc; % full/full(full)
Killed Matlab (not responding) after 22 hours weekend clock time.
tic; Xf / spdiags(sum(Xf.^2)', 0, 11245, 11245); toc; %full/sparse(full)
Killed Matlab (not responding) after 42 hours weekend clock time.
SubBlock Computation to Avoid Memory Overflow
The downside of matrix computation is that intermediate values may be so large that the total memory allocation exceeds what is available. Using sparse matrices helps avoid this, but not all of our matrices are sparse. If a calculation can be done on a column by colume basis, or line by line, or page by page..., we can usually break it into blocks of computable size while retaining most of speed advantage of matrix calculation. Consider the following schematic:
function [...] = whatever(.....)
try
<the basic matrix calculation>
return; % straight matrix result
catch
<set a large initial blocksize>
try
<set up accumulator(s)>
while blocksize >= 1
try
<zero accumulator(s)>
<set up subblocks and bookkeeping>
for <loop over subblocks>
<do calculation on a subblock)
<add result to accumulator>
end
return % accumulated result
catch
blocksize = floor(blocksize / 2);
end % inner try
end % while
error('unable to complete whatever at blocksize == 1');
catch % middle try
error('unable to allocate accumulator(s)');
end % middle try
end % outer try
The initially tried matrix calculation both lets us deal effeciently with small matrices and provides good documentation for what we do in the more complex blocked calculation. The intermediate try allows a graceful exit when we cannot even build an accumulator, and is probably not needed when our result is of smaller dimension than the inputs. The inner try/while quickly runs our blocksize down to something that that works, so long as the initial block processed is one of the largest.
Files
There are currently no files associated to this item
Discussions
John's Projects (2)


Text Mining  Classification ...
2 members
Need help?
Visit our help center