LOWESS- Locally Weighted Scatterplot Smoothing

Contents

Modifications:

This regression will work on linear and non-linear relationships between X and Y.

  12/19/2008 - added upper and lower LOWESS smooths.  These additional
  smooths show how the distribution of Y varies with X.  These smooths
  are simply LOWESS applied to the positive and negative residuals
  separately, then added to the original lowess of the data.  The same
  smoothing factor is applied to both the upper and lower limits.
  2/21/2009 - added sorting to the function, data no longer need to be
  sorted.  Also added a routine such that if a user also supplies a
  second dataset, linear interpolations are done one the lowess and used
  to predict y-values for the supplied x-values.
  10/22/2009 - modified the second user provided X-data for obtaining
  predictions.  Matlab function unique sorts by default.  It really was
  not needed in the section of code to perform linear interpolations of
  the x-data using the y-predicted LOWESS results.  If the user does not
  supply a second x-data set, it will assume to use the supplied x-y
  data set.  Thus there is an output (xy) that maintains the original
  sequence of the input.  Additionally, the user can now include a
  sequence index as the first column of input data. This can be a
  datenum or some other ordering index.  The output will be sequenced
  using that index.  If a sequence index is provided a second subplot
  will be created show the predicted Y-values in the order of the
  included sequence index. I suspect this sequence index most often will
  be a DateTime (i.e. datenum).  Just to the function generic enough,
  the X-axis labels are not converted to a nice date format, but the
  user could easily change that with a datetic attribute in the subplot.
  11/3/2009 - modified charting to include upper and lower line plots on
  the simulated subplot (i.e. the lower plot).
  6/15/2012 - oddly, when using this routine on data without a time
  sequence (i.e. a third column), the plotting portions cause an error.
  Not sure how I would have missed that but...I think I have it fixed.
  An example dataset (skl1a.mat) is also included in the ZIP file for
  convenience. or you can use the below segment as a test that has no time
  sequence in the data.

Create example dataset: (uncomment and paste in the command window)

  x = 0:0.1:10;
  noise = normrnd(0,1, [1 length(x)]);
  y = 10*sin(x) + noise;
  datain = [x;y]';
  f = 0.15;
  wantplot = 1;
  xdata = x';
  [dataout lowerLimit upperLimit xy] = lowess(datain,f,0);
  % A second plot to illustrate what values would be used, and how it
  % compares to the original function used to generate the data
  plot(dataout(:,1), dataout(:,2), '.blue',dataout(:,1),dataout(:,3),...
     '-black', x, 10*sin(x), '--red')
  grid on

Description

Using a robust regression like LOWESS allows one the ability to detect a trend in data that may otherwise have too much variance resulting in non-significance p-values.

Yhat (prediction) is computed from a weghted least squares regression whose weights are both a function of distance from X and magnitude from of the residual from the previous regression.

The conceptual of these functions and subfunctions follow the USGS Kendall.exe routines. Because matlab is 8-byte precision, there are some very small differences between FORTRAN compiled and matlab. Maybe 64-bit OS's has 16-byte precision in matlab?

There is a very simple subfunction to create a plot of the data and regression if the user so choses with a flag in the call to the lowess function. BTW-- the png file looks much better than what the figure looks like on screen.

There are loops in these routines to keep the memory requirements to a minimum, since it is foreseeable that one may have very large datasets to work with.

f = a smoothing factor between 0 and 1. The closer to one, the more smoothing done.

Syntax:

 [dataout lowerLimit upperLimit xy] = lowess(datain,f,wantplot,imagefile,xdata)
 datain = n x 2 (or 3 if sequend index is included) matrix
 dataout = n x 3 matrix
 wantplot = scaler (optional)
      if ~= 0 then create plot
 imagefile = full path and file name where to output the figure to an
      png file type at 600 dpi. If imagefile not provided, a figure will
      be displayed but not exported to a graphics file.
      e.g. imagefile = 'd:\temp\lowess.png';
 xdata = n x 1 vector. The user can supply a second dataset of x-values
      that will be used to predict y-values using the lowess regression
      results.
 xy = x-values supplied by the user in xdata (or taken from the input
      data), and y-predictions using the lowess regression results.  If
      a sequence index is given this will be included as well and
      inserted as the first column and the last two columns are the
      lower and upper simulations using the regression lower and upper
      restuls.

where:

*  datain(:,1) = x
*  datain(:,2) = y
*  f = scaler (0 < f < 1)
*  wantplot = scaler
*  imagefile = string
*  dataout(:,1) = x
*  dataout(:,2) = y
*  dataout(:,3) = y-prediction (aka yhat)
*  lowerLimit(:,1) = x with negative residuals
*  lowerLimit(:,2) = y-prediction of residuals + original y-prediction
*  upperLimit(:,1) = x with positive residuals
*  upperLimit(:,2) = y-prediction of residuals + original y-prediction

Requirements: none

Written by

Jeff Burkey
King County Department of Natural Resources and Parks
email: jeff.burkey@kingcounty.gov
12/16/2008

Example syntax: [dataout lowerLimit upperLimit xy] = lowess(skl1a,.25,1,'c:\temp\test.png',xdata)

Primary Function: lowess

The main engine for this function.

function [dataout lowerLimit upperLimit xy] = lowess(datain,f,wantplot,imagefile,xdata)
    % start timer
    start = tic;

    rowcol = size(datain);

    if rowcol(2) == 3
        % assume time index for first column
        dte = datain(:,1);
        x_data = datain(:,2);
        y_data = datain(:,3);
    else % assign empty set
        dte = [];
        x_data = datain(:,1);
        y_data = datain(:,2);
    end

    if exist('xdata','var') == 0
        % User didn't provide any x-valures for generating a dataset use
        % supplied set prior to sorting.
        xdata = [dte x_data];
    else
        xdata = [dte xdata];
    end


    datain = sortrows([x_data y_data]);

    if exist('wantplot','var') == 0 || wantplot == 0
        % user didn't provide assume zero (i.e. no plot)
        fprintf('\nNo plot will be created.\n');
        wantplot = 0;
        imagefile = '';
        limits = 1;
        upperLimit = nan;
        lowerLimit = nan;
    else
        limits = 3;
    end
    if exist('imagefile','var') == 0
        % User didn't provide do not export to graphics file
        fprintf('\nNo plot will be exported.\n');
        imagefile = '';
    end

    dataout = [];

    for nplots=1:limits
        % if limits is turned on, then plot the upper and lower limits of
        % the lowess- set to plot residuals lowess
        row = find(datain(:,1));
        x = datain(row,1);
        y = datain(row,2);

        switch nplots
            case 2
                row = lwsResiduals > 0;
                x = dataout(row,1);
                y = lwsResiduals(row);
            case 3
                row = lwsResiduals < 0;
                x = dataout(row,1);
                y = lwsResiduals(row);
        end

        n = length(x);

        if (f <= 0.0)
            f=0.25; % set to default
        end

        m=fix(n*f+0.5);
        window = zeros(n,1);
        yhat = zeros(n,1);

        for j=1:n
            % This could be done in a matrix, but need to keep memory footprint
            % small, thus the loop.
            d = abs(x- x(j));
            r1 = ones(n,1);
            d = sort(d);

            window(j)=d(m);
            yhat(j)= rwlreg(x,y,n,window(j),r1,x(j));
        end

        for it=1:2
            e = abs(y-yhat);

            n = length(e);
            s=median(e);

            r = e/(6*s);
            r = 1-r.^2;
            r = max(0.d0,r);
            r = r.^2;

            for j=1:n
                yhat(j)= rwlreg(x,y,n,window(j),r,x(j));
            end
        end

        switch nplots
            case 1
                % calculate residuals otherwise skip
                lwsResiduals = y - yhat;
                dataout = [x y yhat];
            case 2
                ul = [x y yhat];
                [~, ia, ib] = intersect(dataout(:,1),ul(:,1));
                upperLimit = [ul(ib,1) ul(ib,3) + dataout(ia,3)];
                clear ul c ia ib
            case 3
                ll = [x y yhat];
                [~, ia, ib] = intersect(dataout(:,1),ll(:,1));
                lowerLimit = [ll(ib,1) ll(ib,3) + dataout(ia,3)];
                clear ul c ia ib
        end
    end

    fprintf('\nCompute time %6.4f seconds.\n',toc(start));

Generate predicted XY data

Using linear interpolation to estimate y from the lowess regression Any x-values beyond the range given to generate the lowess will be ignorged. Matlab unique function sorts the data, thus a modified unique function (usunique) was developed to return an unsorted vector.

This limiting of the interpolation is done because if there are data beyond the regression range it may not be valid to use a different regression for an extrapolation. But users choice.

    if ~isempty(xdata)
        xyd = [dataout(:,1) dataout(:,3)];
        xyd = unique(xyd,'rows');
        xd = xdata;

        % The below two lines are commented out to allow for extrapolations
        %         xd = xdata(xdata >= min(xyd(:,1)));
        %         xd = xd(xd <= max(xyd(:,1)));

        if ~isnan(xyd)
            % Note:  it may be possible to have a few nan's in the data set
            % while the results would still be valid.  I haven't come
            % across a case of this but is possible.  If so, then the user
            % may need to incorporate a threshold of just how many nan's
            % would be acceptable before dumping the whole regression.  But
            % to be conservative, if there are any nan's throw out the
            % whole result dataset.
            %
            % Note: change LINEAR to SPLINE, this will allow for
            % extrapolations, BUT using the SPLINE function now.
            xd = unique(xd,'rows');

            % Needed to account for useing including time sequence or not.
            %  modified June 15, 2012
            [~, c] = size(xd);
            if c == 2
                xv = xd(:,2);
            elseif c == 1
                xv = xd;
            end

            yinterp = interp1(xyd(:,1),xyd(:,2),xv,'spline');
            if wantplot ~= 0
                ylow = interp1(lowerLimit(:,1),lowerLimit(:,2),xv,'spline');
                yup =  interp1(upperLimit(:,1),upperLimit(:,2),xv,'spline');
                xy = [dte xv yinterp ylow yup];
            else
                xy = [dte xv yinterp];
            end


            if length(xd) ~= length(xdata)
                fprintf('\nOne or more x-values were beyond the range supplied to lowess.\nOr there were duplicate values.\nThey will be ignored.\n');
            end
        else
            fprintf('\nThere were NaNs in the lowess results. No plot will be created.\n');
            wantplot = 0;
            xy = nan;
        end
    end

    if wantplot ~= 0
        customplot(dataout,upperLimit,lowerLimit,f,[dte x_data y_data],xy,imagefile);
    end
Warning: NaN found in Y, interpolation at undefined values
	 will result in undefined values. 
Warning: All data points with NaN in their value will be ignored. 
end

Modification of check for ten or more non-zero weights

Hirsch June 1987

Robust weighted least squares regression, bisquare weights by distance on X-axis. x = is the estimation point yy = is the estimate value of y at x dd = is half the width of the window r = is the robustness weight, a bisquare weight of residuals.

function [yy] = rwlreg(x,y,n,d,r,xx)
    dd=d;
    ddmax = abs(x(n) - x(1));
    if dd == 0.0
        error('Regression:lowess','LOWESS window size = 0. Increase f.');
    else
        while dd <= ddmax
            c = 0;
            total = 0.0;
            f = (abs(x-xx)/dd);
            f = 1.0-f.^3;
            w = ((max(0.d0,f)).^3).*r;
            total = sum(w);
            c = sum(w>0);
            if c > 3
                break % out of while loop
            else
                dd=1.28*dd;
                fprintf('\nrobust size of window = %5.0f.\nLowess window size increased to %3.2f\n', c, dd);
            end
        end
    end

    w = w/total;

    [a b] = wlsq(x,y,w);
    yy=a+b*xx;
end
robust size of window =     3.
Lowess window size increased to 89.60
robust size of window =     3.
Lowess window size increased to 89.60

Weighted least squares

This subfunction does not require any toolboxes in matlab to execute.

function[a b] = wlsq(x,y,w)
    sumw = abs(1-sum(w));
    if sumw > 1e-10
        % The weights, w, must sum to one. Precision assuming type double,
        %    The user may want to adjust this value.
        error('Regression:wlsq','\nThere is an error in the weights.\nWeights do not equal zero (%10.9f).\n',sumw);
    end
    wxx = sum(w.*x.^2);
    wx = sum(w.*x);
    wxy = sum(w.*x.*y);
    wy = sum(w.*y);
    b = (wxy-wy*wx)/(wxx-wx^2);
    a = wy-b*wx;
end
Compute time 2.0610 seconds.

Plotting of data and lowess regression line

If a sequence vector is included in the data, the figure will contain two subplots. The first one is the LOWESS regression of the data, the second plots the time in the original sequence using the first column of input data. Example a datenum for when the data were observed would be common. The second plot will plot the observed Y-data and the predicted Y-data.

function customplot(lws,uplmt,lwrlmt,f,oldxy,newxy,imgfile)
    figure1 = figure;

    try
        rowcol = size(newxy);
        if rowcol(2) == 5 % Users provided a sequence index e.g. Datenum
            % The second subplot id defined first as a matter of
            % readability in the code.  This Conditional segment will not
            % be executed if no sequence index is provided (e.g. datetime).
            subplot(2,2,3:4,'Parent',figure1,...
                'YScale','linear','YMinorTick','off',...
                'XScale','linear','XMinorTick','on',...
                'YMinorGrid','off',...
                'XMinorGrid','on');
            box('on');
            grid('on');
            hold('all');

            line(oldxy(:,1),oldxy(:,3),'LineStyle','none', ...
                'Marker','o','MarkerSize',7,...
                'MarkerEdgeColor','k',...
                'MarkerFaceColor','b',...
                'DisplayName','Observed');

            line(newxy(:,1),newxy(:,3),'LineStyle','-', ...
                'LineWidth',2,'Color','r',...
                'DisplayName','Simulated');

            line(newxy(:,1),newxy(:,4),'LineStyle',':', ...
                'LineWidth',2,'Color','m',...
                'DisplayName','Lower');

            h = line(newxy(:,1),newxy(:,5),'LineStyle',':', ...
                'LineWidth',2,'Color','g',...
                'DisplayName','Upper');

            % The below lines are commented out, but for convenience
            % uncomment if one wants to make the upper and lower lines the
            % same color, you can not have duplicate symbols in the legend.
%             hAnnotation = get(h,'Annotation');
%             hLegendEntry = get(hAnnotation','LegendInformation');
%             set(hLegendEntry,'IconDisplayStyle','off');

            ylabel('y-Values');
            xlabel('Sequence Index (e.g. datenum)');
            title('Simulated y-Values using LOWESS Regression');

            % Create legend - although I'm not overly pleased using the default
            % location of placement.  I want it outside the grid, but the
            % default without manually describing locations this will have to
            % do for now.  Placing them inside the grid yields the most
            % reelestate, but the lenged box possibly could cover up data.
            legend('show','Location','EastOutside');

            hold('off');

            % This is the primary plot that will be generate either from
            % this conditional statement or in the ELSEIF below. They are
            % exact same except for defining Subplot space as either two
            % plots or one.
            subplot(2,2,1:2,'Parent',figure1,...
                'YScale','linear','YMinorTick','off',...
                'XScale','linear','XMinorTick','on',...
                'YMinorGrid','off',...
                'XMinorGrid','on');
        elseif rowcol(2) == 2
            % No sequence index is given, second plot would be identical to
            % first plot.  Define plot to occupy space of both subplots.
            % This could be revised and not even call it a subplot, but for
            % consistency it is.
            subplot(2,1,1:2,'Parent',figure1,...
                'YScale','linear','YMinorTick','off',...
                'XScale','linear','XMinorTick','on',...
                'YMinorGrid','off',...
                'XMinorGrid','on');
        end

        box('on');
        grid('on');
        hold('all');

        x = lws(:,1);
        y = lws(:,2);
        yh = lws(:,3);

        % Point plot of points of the observed data on the LOWESS plot
        line(x,y,'LineStyle','none', ...
            'Marker','o','MarkerSize',7,...
            'MarkerEdgeColor','k',...
            'MarkerFaceColor','b',...
            'DisplayName','Observed');

        % Line plot of the LOWESS regression
        line(x,yh, 'Color','r', 'LineWidth',2, 'LineStyle','-', ...
            'DisplayName','Regression' ...
            );


        x = uplmt(:,1);
        yh = uplmt(:,2);

        % Line plot of the upper limit LOWESS regression
        line(x,yh, 'Color', 'r', 'LineWidth',2,'LineStyle',':', ...
            'DisplayName','Upper/Lower' ...
            );

        x = lwrlmt(:,1);
        yh = lwrlmt(:,2);

        % Line plot of the lower limit LOWESS regression
        h = line(x,yh, 'Color', 'r', 'LineWidth',2,'LineStyle',':', ...
            'DisplayName','Lower' ...
            );
        hAnnotation = get(h,'Annotation');
        hLegendEntry = get(hAnnotation','LegendInformation');
        set(hLegendEntry,'IconDisplayStyle','off');

        grid on
        xlabel('x-Values')
        ylabel('y-Values')
        ts = strcat('LOWESS Regression plot f=',num2str(f));
        title(ts)

        % Create legend - although I'm not overly pleased using the default
        % location of placement.  I want it outside the grid, but the
        % default without manually describing locations this will have to
        % do for now.  Placing them inside the grid yields the most
        % real-estate, but the lenged box possibly could cover up data.
        legend('show','Location','EastOutside');

        hold('off');

        if ~isempty(imgfile)
            % If a filename is give for the plot, create a PNG file.
            fprintf('\nCreating plot. Give a few tics.\n');
            print('-dpng','-r600', imgfile);
            fprintf('\nFinished...\n');
        end

        close(figure1)
    catch ME1
        disp(ME1)
    end
end
Creating plot. Give a few tics.

Finished...