Peak finding in Raman spectroscopy

| categories: data analysis | View Comments



Peak finding in Raman spectroscopy

John Kitchin

Raman spectroscopy is a vibrational spectroscopy. The data typically comes as intensity vs. wavenumber, and it is discrete. Sometimes it is necessary to identify the precise location of a peak. In this post, we will use spline smoothing to construct an interpolating function of the data, and then use fminbnd to identify peak positions.

clear all; close all

Read and plot the raw data

[w,i] = textread('raman.txt');

xlabel('Raman shift (cm^{-1})')
ylabel('Intensity (counts)')

Narrow to the region of interest

We will focus on the peaks between 1340 and 1360 cm^-1.

ind = w > 1340 & w < 1360;
w1 = w(ind);
i1 = i(ind);
plot(w1, i1,'b. ')
xlabel('Raman shift (cm^{-1})')
ylabel('Intensity (counts)')

Generate the spline

The last parameter in csaps determines the amount of smoothing. Closer to 1.0 is less smoothing.

pp = csaps(w1, i1,.99)
hold all
plot(w1, ppval(pp,w1))
pp = 

      form: 'pp'
    breaks: [1x51 double]
     coefs: [50x4 double]
    pieces: 50
     order: 4
       dim: 1

Find all the zeros in the first derivative

We will brute force this by looking for sign changes in the derivative vector evaluated at all the points of the original data

d1 = fnder(pp, 1); % first derivative
s = ppval(d1, w1);
s(s >= 0) = 1;
s(s < 0) = 0;

% where diff = -1 indicates a transition from positive slope to
% negative slope, which is a maximum. These are approximate locations
% of the zeros
z = find(diff(s) == -1);

Make an interpolating function to find the minima

to use fminbnd we need a function. we make a function handle that is the negative of the main function, so that the peak locations are located at minima. interpolating function handle

f = @(xx) -ppval(pp,xx);

% now loop through each value found in z to find the minimum over the
% interval around each zero.
x = zeros(size(z));
for i=1:length(x)
    lowerbound = w1(z(i)-5);
    upperbound = w1(z(i)+5);
    x(i) = fminbnd(f,lowerbound,upperbound);

Plot the derivatives and peaks found

hold all
plot(w1, ppval(d1,w1))
plot(w1(z), ppval(d1, x),'ro ')

xlabel('Raman shift (cm^{-1})')
legend('1^{st} derivative', 'zeros')

Label the peaks in the original data

plot(x, ppval(pp,x), 'ro','markerfacecolor','r')

% print some offset labels
for i=1:length(x)
    sprintf('%1.1f cm^{-1}',x(i))
ans =

1346.5 cm^{-1}

ans =

1348.1 cm^{-1}


In the end, we have illustrated how to construct a spline smoothing interpolation function and to find maxima in the function, including generating some initial guesses. There is more art to this than you might like, since you have to judge how much smoothing is enough or too much. With too much, you may smooth peaks out. With too little, noise may be mistaken for peaks.


% categories: data analysis
% post_id = 2007; %delete this line to force new post;
% permaLink =;
Read and Post Comments

curve fitting to get overlapping peak areas

| categories: data analysis | View Comments



curve fitting to get overlapping peak areas

John Kitchin

Today we examine an approach to fitting curves to overlapping peaks to deconvolute them so we can estimate the area under each curve. You will need to download gc-data-2.txt for this example. This file contains data from a gas chromatograph with two peaks that overlap. We want the area under each peak to estimate the gas composition. You will see how to read the text file in, parse it to get the data for plotting and analysis, and then how to fit it.

function main

read in the data file

The data file is all text, and we have to read it in, and find lines that match certain patterns to identify the regions that are data, then we read in the data lines.

clear all; close all

datafile = 'gc-data-2.txt';

fid = fopen(datafile);

first we get the number of data points, and read up to the data

i = 0;
while 1
    line = fgetl(fid);
    sm = strmatch('# of Points',line);
    if ~isempty(sm)
        regex = '# of Points(.*)';
        [match tokens] = regexp(line,regex,'match','tokens');
        npoints = str2num(tokens{1}{1});
    elseif strcmp(line,'R.Time	Intensity')
    i = i + 1

initialize the data vectors

t = zeros(1,npoints);
intensity = zeros(1,npoints);

now read in the data

for j=1:npoints
    line = fgetl(fid);
    [a] = textscan(line,'%f%d');
    t(j) = a{1};
    intensity(j) = a{2};


Plot the data

xlabel('Time (s)')
ylabel('Intensity (arb. units)')
xlim([4 6])

correct for non-zero baseline

intensity = intensity + 352;
xlabel('Time (s)')
ylabel('Intensity (arb. units)')
xlim([4 6])

a fitting function for one peak

The peaks are asymmetric, decaying gaussian functions.

    function f = asym_peak(pars,t)
        % from Anal. Chem. 1994, 66, 1294-1301
        a0 = pars(1);  % peak area
        a1 = pars(2);  % elution time
        a2 = pars(3);  % width of gaussian
        a3 = pars(4);  % exponential damping term
        f = a0/2/a3*exp(a2^2/2/a3^2 + (a1 - t)/a3)...
            .*(erf((t-a1)/(sqrt(2)*a2) - a2/sqrt(2)/a3) + 1);

a fitting function for two peaks

to get two peaks, we simply add two peaks together.

    function f = two_peaks(pars, t)
        a10 = pars(1);  % peak area
        a11 = pars(2);  % elution time
        a12 = pars(3);  % width of gaussian
        a13 = pars(4);  % exponential damping term
        a20 = pars(5);  % peak area
        a21 = pars(6);  % elution time
        a22 = pars(7);  % width of gaussian
        a23 = pars(8);  % exponential damping term

        p1 = asym_peak([a10 a11 a12 a13],t);
        p2 = asym_peak([a20 a21 a22 a23],t);

        f = p1 + p2;

Plot fitting function with an initial guess for each parameter

the fit is not very good.

hold all;
parguess = [1500,4.85,0.05,0.05,5000,5.1,0.05,0.1];
plot (t,two_peaks(parguess,t),'g-')
legend 'raw data' 'initial guess'

nonlinear fitting

now we use nonlinear fitting to get the parameters that best fit our data, and plot the fit on the graph.

pars = nlinfit(t,intensity, @two_peaks, parguess)
plot(t,two_peaks(pars, t),'r-')
legend 'raw data' 'initial guess' 'total fit'
pars =

   1.0e+03 *

    1.3052    0.0049    0.0001    0.0000    5.3162    0.0051    0.0000    0.0001

the fits are not perfect. The small peak is pretty good, but there is an unphysical tail on the larger peak, and a small mismatch at the peak. There is not much to do about that, it means the model peak we are using is not a good model for the peak. We will still integrate the areas though.

now extract out the two peaks and integrate the areas

pars1 = pars(1:4)
pars2 = pars(5:8)

peak1 = asym_peak(pars1, t);
peak2 = asym_peak(pars2, t);

legend 'raw data' 'initial guess' 'total fit' 'peak 1' 'peak 2';

area1 = trapz(t, peak1)
area2 = trapz(t, peak2)
pars1 =

   1.0e+03 *

    1.3052    0.0049    0.0001    0.0000

pars2 =

   1.0e+03 *

    5.3162    0.0051    0.0000    0.0001

area1 =


area2 =


Compute relative amounts

percent1 = area1/(area1 + area2)
percent2 = area2/(area1 + area2)
percent1 =


percent2 =


This sample was air, and the first peak is oxygen, and the second peak is nitrogen. we come pretty close to the actual composition of air, although it is low on the oxygen content. To do better, one would have to use a calibration curve.


% categories: data analysis

% post_id = 1994; %delete this line to force new post;
% permaLink =;
Read and Post Comments

Colors, 3D Plotting, and Data Manipulation

| categories: data analysis, plotting | View Comments



Colors, 3D Plotting, and Data Manipulation

Guest authors: Harrison Rose and Breanna Stillo

In this post, Harrison and Breanna present three-dimensional experimental data, and show how to plot the data, fit curves through the data, and plot surfaces. You will need to download mycmap.mat

function main
close all
clear all

In order to generate plots with a consistent and hopefully attractive color scheme, we will generate these color presets. Matlab normalizes colors to [R G B] arrays with values between 0 and 1. Since it is easier to find RGB values from 0 to 255, however, we will simply normalize them ourselves. A good colorpicker is

pcol = [255,0,0]/255; % red
lcol = [135,14,179]/255; % a pinkish color

Raw Data

This is raw data from an experiment for three trials (a and b and c). The X and Y are independent variables, and we measured Z.

X_a = [8.3 8.3 8.3 8.3 8.3 8.3 8.3];
X_b = [11 11 11 11 11 11 11];
X_c = [14 14 14 14 14 14 14];

Y_a = [34 59 64 39 35 36 49];
Y_b = [39 32 27 61 52 57 65];
Y_c = [63 33 38 50 54 68 22];

Z_a = [-3.59833 7.62 0 4.233333333 -2.54 -0.635 7.209];
Z_b = [16.51 10.16 6.77 5.08 15.24 13.7 3.048];
Z_c = [36 20 28 37 40 32 10];

Plotting the raw data for all three trials:

As you can see, the raw data does not look like very much, and it is pretty hard to interperet what it could mean.

We do see, however, that since X_a is all of one value, X_b is all of another value, and X_c is all of a third, that the data lies entirely on three separate planes.

hold on     % Use this to plot multiple series on a single figure
hold off    % Use this to make sure the next plot command will not
% try to plot on this figure as well.
title('Raw Experimental Data for Trials A, B, and C')
xlabel('x Data')
ylabel('y Data')
zlabel('z Data')
grid on     % 3D data is easier to visualize with the grid. Normally
% the grid defaults to 'on' but using the 'hold on'
% command as we did above causes the grid to default to
% 'off'

A note on the view:

The command


lets you view a 3D plot from the same viewpoint each time you run the code. To determine the best viewpoint to use, use the click the 'Rotate 3D' icon in the figure toolbar (it looks like a box with a counterclockwise arrow around it), and drag your plot around to view it from different angles. You will notice the text "Az: ## El: ##" appear in the lower left corner of the figure window. This stands for Azimuth and Elevation which represent a point in spherical coordinates from which to view the plot (the radius is fixed by the axes sizes). The command used here will always display the plot from azimuth = -39, and elevation = 10.


A closer look at the raw data:

hold on
title('Raw Data for Trial A, x = 8.3')
hold off

hold on
title('Raw Data for Trial B, x = 11')
hold off

hold on
title('Raw Data for Trial C, x = 14')
hold off

Fitting the raw data:

In this case, we expect the data to fit the shape of a binomial distribution, so we use the following fit function with three parameters:

    function v = mygauss(par, t)
        A  = par(1);
        mu = par(2);
        s  = par(3);


Fitting the data

res = 20;

% Dataset A
guesses=[20, 40, 20];
[pars residuals J]=nlinfit(Y_a, Z_a-min(Z_a), @mygauss, guesses);


Zfitfun_a=@(y) A1*exp(-(y-mu1).^2./(2*s1.^2))+min(Z_a);
% Note: We have to shift the dataset up to zero because a gaussian
% typically cannot go below the horizontal axis

% Generate a fit-line through the data


% Dataset B
guesses=[10, 25, 20];
[pars residuals J]=nlinfit(Y_b, Z_b, @mygauss, guesses);


Zfitfun_b=@(y) A2*exp(-(y-mu2).^2./(2*s2.^2));

% Generate a fit-line through the data

% Dataset c
guesses=[20, 60, 20];
[pars residuals J]=nlinfit(Y_c, Z_c, @mygauss, guesses);


Zfitfun_c=@(y) A3*exp(-(y-mu3).^2./(2*s3.^2));

% Generate a fit-line through the data

Plotting the fitted data:

% Trial A
hold on
title('Fitted Data for Trial A, x = 8.3')
hold off

% Trial B
hold on
title('Fitted Data for Trial B, x = 11')
plot(Yfit, Zfit_b,'Color',lcol,'LineWidth',2)
hold off

% Trial C
hold on
title('Fitted Data for Trial C, x = 14')
hold off

Generate a surface plot:

For every point along the fit-line for dataset A, connect it to the corresponding point along the fit-line for dataset B using a straight line. This linear interpolation is done automatically by the surf command. If more datasets were available, we could use a nonlinear fit to produce a more accurate surface plot, but for now, we will assume that the experiment is well-behaved, and that 11 and 8.3 are close enough together that we can use linear interpolation between them.

Interpolate along the fit lines to produce YZ points for each data series (X):

Fits through the fits

Yspace = linspace(25,65,res);
Xs = [8.3; 11; 14];
Xspan1 = linspace(8.3,14,res);
Xspan2 = linspace(8,14.3,res);
q = 0;
r = 0;
for i = 1:res
    Zs = [Zfitfun_a(Yspace(i)); Zfitfun_b(Yspace(i)); Zfitfun_c(Yspace(i))];
    Yspan = linspace(Yspace(i),Yspace(i),res);
    p(:,i) = polyfit(Xs,Zs,2);
    for j = 1:res
        Zfit_fit(i,j) = polyval(p(:,i),Xspan1(j));

Plot a surface through the fits through the fits

hold all
surf(Xspan1,Yspace,Zfit_fit,'EdgeColor','black');   % Generate the surface
%surf(Xsurf,Ysurf,Zsurf,'EdgeColor','black');   % Generate the surface

% plot leading and lagging lines (just for show!)
for i = 1:res
    Zs = [Zfitfun_a(Yspace(i)); Zfitfun_b(Yspace(i)); Zfitfun_c(Yspace(i))];
    Yspan = linspace(Yspace(i),Yspace(i),res);

%Plot the raw data:

%Plot the fit lines:
alpha(.2)   % Make the plot transparent
title('z vs. x vs. y')
grid on

Add a line through the max

We must minimize the negative of each of our initial three fits to find the maximum of the fit.

f1 = @(y) -Zfitfun_a(y);
f2 = @(y) -Zfitfun_b(y);
f3 = @(y) -Zfitfun_c(y);

Ystar_a = fminbnd(f1,0,100);
Ystar_b = fminbnd(f2,0,100);
Ystar_c = fminbnd(f3,0,100);

Ystars = [Ystar_a; Ystar_b; Ystar_c];
Zstars = [Zfitfun_a(Ystar_a); Zfitfun_b(Ystar_b); Zfitfun_c(Ystar_c)];

hold on

xy = polyfit(Xs,Ystars,2);
xz = polyfit(Xs,Zstars,2);


A note on colormaps

To edit the color of the surface, you can apply a colormap. Matlab has several built-in colormaps (to see them, type 'doc colormap' in the Command Window. As you see here, however, we are using our own colormap, which has been stored in the file 'mycmap.mat'

To modify a colormap, type


in the Command Window or click 'Edit' 'Colormap...' in the figure window. For instructions on how to use the colormap editor, type 'doc colormapeditor' in the Command Window. If you have 'Immediate apply' checked, or you click 'Apply' the colormap will load onto the figure. To save a colormap, type the following into the Command Window:

 mycmap = get(gcf,'Colormap')
s = load('mycmap.mat');
newmap = s.mycmap;
set(gcf,'Colormap',newmap)  % See corresponding 'Note on colormaps')


Matlab offers a lot of capability to analyze and present data in 3-dimensions.

% categories: plotting, data analysis
Read and Post Comments

are two averages different?

| categories: data analysis | View Comments

are two averages different?

are two averages different?

John Kitchin 1/28/2012

adapted from


Class A had 30 students who received an average test score of 78, with standard deviation of 10. Class B had 25 students an average test score of 85, with a standard deviation of 15. We want to know if the difference in these averages is statistically relevant. Note that we only have estimates of the true average and standard deviation for each class, and there is uncertainty in those estimates. As a result, we are unsure if the averages are really different. It could have just been luck that a few students in class B did better.

The hypothesis:

the true averages are the same. We need to perform a two-sample t-test of the hypothesis that $\mu_1 - \mu_2 = 0$ (this is often called the null hypothesis). we use a two-tailed test because we do not care if the difference is positive or negative, either way means the averages are not the same.

the data

n1 = 30; % students in class A
x1 = 78; % average grade in class A
s1 = 10; % std dev of exam grade in class A

n2 = 25; % students in class B
x2 = 85; % average grade in class B
s2 = 15; % std dev of exam grade in class B

Compute the standard error

this is the standard error of the difference between the two averages. Note the standard error of the difference is less than the standard error of the individual averages.

SE = sqrt(s1^2/n1 + s2^2/n2)
SE =


calculate degrees of freedom

see the discussion at for a more complex definition of degrees of freedom. Here we simply subtract one from each sample size to account for the estimation of the average of each sample.

DF = (n1 - 1) + (n2 - 1)
DF =


compute the t-score for our data

The difference between two averages determined from small sample numbers follows the t-distribution. the t-score is the difference between the difference of the means and the hypothesized difference of the means, normalized by the standard error. we compute the absolute value of the t-score to make sure it is positive for convenience later.

tscore = abs(((x1 - x2)-0)/SE)
tscore =



A way to approach determinining if the difference is significant or not is to ask, does our computed average fall within a confidence range of the hypothesized value (zero)? If it does, then we can attribute the difference to statistical variations at that confidence level. If it does not, we can say that statistical variations do not account for the difference at that confidence level, and hence the averages must be different.

Lets compute the t-value that corresponds to a 95% confidence level for a mean of zero with the degrees of freedom computed earlier. This means that 95% of the t-scores we expect to get will fall within $\pm$ t95.

ci = 0.95;
alpha = 1 - ci;
t95 = tinv(1-alpha/2, DF)
t95 =


since tscore < t95, we conclude that at the 95% confidence level we cannot say these averages are statistically different because our computed t-score falls in the expected range of deviations. Note that our t-score is very close to the 95% limit. Let's consider a smaller confidence interval.

ci = 0.94;
alpha = 1 - ci;
t94 = tinv(1-alpha/2, DF)
t94 =


at the 94% confidence level, however, tscore > t94, which means we can say with 94% confidence that the two averages are different; class B performed better than class A did. Alternatively, there is only about a 6% chance we are wrong about that statement.

another way to get there

An alternative way to get the confidence that the averages are different is to directly compute it from the cumulative t-distribution function. We compute the difference between all the t-values less than tscore and the t-values less than -tscore, which is the fraction of measurements that are between them. You can see here that we are practically 95% sure that the averages are different.

f = tcdf(tscore,DF)-tcdf(-tscore,DF)
f =


% categories: data analysis
ans =


Read and Post Comments

Using Java inside Matlab

| categories: data analysis, miscellaneous | View Comments

Using Java inside Matlab

Using Java inside Matlab

John Kitchin

Matlab is pretty good, but it doesn't do everything, and you might not always have the toolbox that contains the function you want. There might even be an excellent open-source software library that can't be included in Matlab, but can be used within Matlab.

Matlab has excellent support for using Java with minimal effort. A repository of numerical libraries can be found at In this example we install the commons-math Java library, and show an example of using it.


Here are a few things the commons libary can do:

  Computing means, variances and other summary statistics for a list of numbers
  Fitting a line to a set of data points using linear regression
  Finding a smooth curve that passes through a collection of points (interpolation)
  Fitting a parametric model to a set of measurements using least-squares methods
  Solving equations involving real-valued functions (i.e. root-finding)
  Solving systems of linear equations
  Solving Ordinary Differential Equations
  Minimizing multi-dimensional functions
  Generating random numbers with more restrictions (e.g distribution, range) than what is possible using the JDK
  Generating random samples and/or datasets that are "like" the data in an input file
  Performing statistical significance tests
  Miscellaneous mathematical functions such as factorials, binomial coefficients and "special functions" (e.g. gamma, beta functions)

Many of them duplicate features in Matlab, but they may implement different algorithms, or more convenient syntax.

clear all; clc; close all

download and install

These two commands only need to be run once, so comment them out after you run them.


We have to add the jar file to the javaclasspath so we can use it.

javaaddpath([pwd '/commons-math-2.2/commons-math-2.2.jar'])

linear regression with Matlab

Note that the regress command is part of the Statistics toolbox, which is great if you have it! We show here how to do it, so we also have something to compare the Java result to.

x = [1 2 3 4 5]';
y = [3 5 7 14 11]';

[b bint]= regress(y,[x.^0 x])
b =


bint =

   -7.5615    8.5615
    0.0694    4.9306

Now the Java way

Documentation Examples

The first thing we do is import the Java library. the * at the end indicates that all classes should be imported.

import org.apache.commons.math.stat.regression.*

Create a class instance.

The SimpleRegression class has several methods that we can call later to perform analysis

regression = SimpleRegression();
Methods for class org.apache.commons.math.stat.regression.SimpleRegression:

SimpleRegression            getSlope                    
addData                     getSlopeConfidenceInterval  
clear                       getSlopeStdErr              
equals                      getSumSquaredErrors         
getClass                    getTotalSumSquares          
getIntercept                hashCode                    
getInterceptStdErr          notify                      
getMeanSquareError          notifyAll                   
getN                        predict                     
getR                        toString                    
getRSquare                  wait                        

add the data

regression.addData([x y]);

get the slope and some statistics of the slope

m = regression.getSlope()
mstd = regression.getSlopeStdErr()
mint = regression.getSlopeConfidenceInterval(); % 95% by default
[m - mint  m + mint] % this is the 95% interval
m =


mstd =


ans =

    0.0694    4.9306

Note the confidence interval is greater than +- 2*sigma. This is because of the student-t table multiplier on the standard error due to the small number

get the intercept and some statistics

note there is not getInterceptConfidenceInterval method for some reason!

b = regression.getIntercept()
bstd = regression.getInterceptStdErr()

sprintf('R^2 = %1.3f', regression.getRSquare())
b =


bstd =


ans =

R^2 = 0.781


It takes more lines of code to get the Java version of linear regression, but you have finer control over what data you get, and what order you get it in. You still have to read the Java documentation to learn the command syntax, and debugging can be somewhat more difficult than straight Matlab. However, the integration of Java into Matlab opens some interesting possibilities to extend Matlab's functionality.

% categories: Miscellaneous, data analysis
Read and Post Comments

Next Page ยป