6
$\begingroup$

I've measured and plotted some data - performance-metric on the vertical axis (high number is bad, low number good), time of measurement on the horizontal axis.

The obvious striations/linear-arrangements are not expected at all. If I were to print it, sit with a ruler I could by eye pick out something like 20-30 linear groups with linear-fit correlation coefficient in the high 90% range. Quantifying this using many data-sets with the same characteristic may help to understand the source, and at least would give an objective measure to help compare different data-sets.


The question:

Is anyone is aware of a technique to quantify how many "highly linearly-correlated" groups of points there are - mathematically, that is. Sitting with a ruler one could get a reasonable estimate.

It seems this is in some ways an analog of clustering, but with an unusual "associated-ness" metric


Why posted in Mathematica StackExchange?

obvious striations/correlation in structure

$\endgroup$
2

1 Answer 1

6
$\begingroup$

While I agree with @J.M. that asking on the stats site would be a good idea, I thought I'd have a go anyway. But be warned: I'm no statistician.

The basic idea is that if you can project the data onto the right vector (perpendicular to the striations), it will naturally fall into clusters. You can then partition the original data according to this clustering, and find best fit lines through the data subsets.

First, construct some pretend data:

m = 20; n = 1000; phi = RandomReal[{0, \[Pi]/2}];
xmin = 71917500; xmax = 71935000; xrange = xmax - xmin;
ymin = 0; ymax = 10000; yrange = ymax - ymin;
xvals = RandomReal[{xmin, xmax}, n];
alist = Range[ymin + xmin/Tan[phi], ymax + xmax/Tan[phi], 
          (ymax - ymin + (xmax - xmin) Cot[phi])/m]; 
yvals = RandomChoice[Select[alist - #/Tan[phi], ymin < # < ymax &]] + 
          RandomReal[(ymax - ymin)/m] & /@ xvals;

ListPlot[rawdata = Transpose[{xvals, yvals}], PlotLabel -> "phi = " <> ToString[phi]]

n is the number of data points, m is the number of lines I've built into the data, and phi is the angle perpendicular to the striations.

enter image description here

The data is clearly not from the same distribution as your original data. But everything here should work (perhaps with minor tweaking, but perhaps not). The most dubious assumption I've made here is that the striations are all equidistant from each other. If that's not the case in your data, though, there's a pretty easy way round it. But I'll get to that at the end.

First, we're going to find phi approximately (but accurately enough to cluster the data), and then we'll pin it down by fitting a function. You can see how the data falls into clusters when projected onto the right vector with Manipulate (on normalized data):

Manipulate[
 ListPlot[
  Projection[#, {Cos[theta], Sin[theta]}] & /@ 
   ({(#[[1]] - xmin)/xrange, (#[[2]] - ymin)/yrange} & /@ rawdata), 
  PlotRange -> {{0, Sqrt[2]}, {0, Sqrt[2]}}],
 {{theta, ArcTan[Tan[phi] yrange/xrange]}, 0, \[Pi]/2}]

enter image description here

Notice how, when the projection angle is wrong there is one big cluster. We're going to automate this process, but the idea is to run through a range of projections and pick the one that maximizes the number of clusters we can identify. The cluster identification I'm going to use is pretty low-brow. Presumably it's possible to get it working with FindClusters, but I had no luck with any Method or DistanceFunction. Plus, this way is faster.

I'm going to use Split, which picks out runs of identical elements, where we decide how close points have to be to count as "identical". This may need some tweaking: if the threshold is too large, it might merge clusters that we want to keep separate; too small and it will separate clusters that we want to keep together.

proj[list_, theta_] := Projection[#, {Cos[theta], Sin[theta]}] & /@ list

anglelimits = {10^-2, \[Pi]/2 - 10^-2};
clustersizes = Table[
  sortproj = Sort[proj[rawdata, theta]];
  {theta, 
    Length@Split[sortproj, Norm[#1 - #2] < 10/n Norm[sortproj[[-1]] - sortproj[[1]]] &]
  },
  {theta, anglelimits[[1]], anglelimits[[2]], (anglelimits[[2]] - anglelimits[[1]])/100.}
 ];
ListPlot[clustersizes, PlotRange -> All, Frame -> True, 
 FrameLabel -> {"Projection Angle", "Number of clusters"}]   

enter image description here

Note the spike in the number of clusters around a projection angle of 0.5. The following function uses this idea by progressively focusing in on the angle around that spike, and terminates when all the angles in a small range yield the same number of clusters. It then returns the mean of that range of angles.

refineangle[rawpoints_] := 
  Module[{anglelimits = {10^-2, \[Pi]/2 - 10^-2}, 
    data = {{0, 1}, {0, 2}}, sortproj},
   While[! (SameQ @@ data[[;; , 2]]),
    data = Table[
      sortproj = Sort[proj[rawpoints, theta]];
      {theta, 
       Length@Split[sortproj, 
         Norm[#1 - #2] < 10/n Norm[sortproj[[-1]] - sortproj[[1]]] &]},
      {theta, anglelimits[[1]], anglelimits[[2]], (
       anglelimits[[2]] - anglelimits[[1]])/100.}];
    anglelimits = Sort[MaximalBy[clustersizes, Last]][[{1, -1}, 1]];
    ];
   Mean[anglelimits]
   ];

Then

mu = refineangle[rawdata]
phi

(* 0.490747
   0.497375 *)

Now that we've found the approximate projection angle, we can project the data down. But we need to keep track of which data points end up in which cluster, so we have to add explicit indices to our data, and also update our projection to keep the indices.

dataindexed = MapIndexed[{First@#2, #1} &, rawdata];
projindexed[indexedlist_, theta_] := {#[[1]], 
  Projection[#[[2]], {Cos[theta], Sin[theta]}]} & /@ indexedlist

Then we project, find the indices of the points in the clusters, and then extract those elements from the raw data.

sortproj = SortBy[projindexed[dataindexed, mu], Last];
indexpartition = 
  Split[sortproj, 
    Norm[Last@#1 - Last@#2] < 
      10 / n Norm[sortproj[[-1, 2]] - sortproj[[1, 2]]] &][[;; , ;; , 1]];
datapartition = rawdata[[#]] & /@ indexpartition;
ListPlot[datapartition]

enter image description here

I'm going to make two assumptions about the data here (which are true for the data I made up, but may not be for your data):

  1. All the lines are, in fact, parallel.
  2. The lines are evenly spaced.

If the lines are not parallel then you can just treat the data clusters as separate lists and find a linear fit through each of them. If they're parallel, then you need to fit to all of them at once, which is a bit trickier. If they're not parallel, then the even spacing assumption is irrelevant too. But if you're solving them all at once, then you need to either make an assumption about their spacing, or build variable spacing into the function you're going to fit. I'm going to go with the first one.

First, we need to Flatten the data, but add an index for the cluster:

datapartitionindexed = 
  Flatten[MapIndexed[
    With[{subset = #1, 
       index = First@#2}, {#1[[1]], index, #1[[2]]} & /@ subset] &, 
    datapartition], 1];

Then fit a linear function:

fitparams = FindFit[datapartitionindexed, a + b y + c x, {a, b, c}, {x, y}]
(* {a -> 1.3255*10^8, b -> 2112.56, c -> -1.84307} *)

and plot the result:

Show[ListPlot[datapartition], 
 Plot[Table[
   a + b y + c x /. fitparams, {y, Length[datapartition]}], {x, xmin, 
   xmax}, PlotRange -> {{xmin, xmax}, {ymin, ymax}}]]

enter image description here

Note, finally, that the parameter c gives the slope of the striations, which (in the invented data) are perpendicular to the angle phi. If the data were noiseless, we would expect that ArcTan[-1 / c] = phi. With noise in the system, we're still pretty close:

ArcTan[-1/c] /. fitparams
phi

(* 0.497122
   0.497375 *)
$\endgroup$
1
  • $\begingroup$ inspired and pragmatic. Appreciate the effort. I will have to do this justice over the weekend by going through it in detail and apply it to my data. $\endgroup$
    – Paul_A
    Commented Aug 31, 2017 at 4:38

Not the answer you're looking for? Browse other questions tagged or ask your own question.