While I agree with @J.M. that asking on the stats site would be a good idea, I thought I'd have a go anyway. But be warned: I'm no statistician.
The basic idea is that if you can project the data onto the right vector (perpendicular to the striations), it will naturally fall into clusters. You can then partition the original data according to this clustering, and find best fit lines through the data subsets.
First, construct some pretend data:
m = 20; n = 1000; phi = RandomReal[{0, \[Pi]/2}];
xmin = 71917500; xmax = 71935000; xrange = xmax - xmin;
ymin = 0; ymax = 10000; yrange = ymax - ymin;
xvals = RandomReal[{xmin, xmax}, n];
alist = Range[ymin + xmin/Tan[phi], ymax + xmax/Tan[phi],
(ymax - ymin + (xmax - xmin) Cot[phi])/m];
yvals = RandomChoice[Select[alist - #/Tan[phi], ymin < # < ymax &]] +
RandomReal[(ymax - ymin)/m] & /@ xvals;
ListPlot[rawdata = Transpose[{xvals, yvals}], PlotLabel -> "phi = " <> ToString[phi]]
n
is the number of data points, m
is the number of lines I've built into the data, and phi
is the angle perpendicular to the striations.
The data is clearly not from the same distribution as your original data. But everything here should work (perhaps with minor tweaking, but perhaps not). The most dubious assumption I've made here is that the striations are all equidistant from each other. If that's not the case in your data, though, there's a pretty easy way round it. But I'll get to that at the end.
First, we're going to find phi
approximately (but accurately enough to cluster the data), and then we'll pin it down by fitting a function. You can see how the data falls into clusters when projected onto the right vector with Manipulate
(on normalized data):
Manipulate[
ListPlot[
Projection[#, {Cos[theta], Sin[theta]}] & /@
({(#[[1]] - xmin)/xrange, (#[[2]] - ymin)/yrange} & /@ rawdata),
PlotRange -> {{0, Sqrt[2]}, {0, Sqrt[2]}}],
{{theta, ArcTan[Tan[phi] yrange/xrange]}, 0, \[Pi]/2}]
Notice how, when the projection angle is wrong there is one big cluster. We're going to automate this process, but the idea is to run through a range of projections and pick the one that maximizes the number of clusters we can identify. The cluster identification I'm going to use is pretty low-brow. Presumably it's possible to get it working with FindClusters
, but I had no luck with any Method
or DistanceFunction
. Plus, this way is faster.
I'm going to use Split
, which picks out runs of identical elements, where we decide how close points have to be to count as "identical". This may need some tweaking: if the threshold is too large, it might merge clusters that we want to keep separate; too small and it will separate clusters that we want to keep together.
proj[list_, theta_] := Projection[#, {Cos[theta], Sin[theta]}] & /@ list
anglelimits = {10^-2, \[Pi]/2 - 10^-2};
clustersizes = Table[
sortproj = Sort[proj[rawdata, theta]];
{theta,
Length@Split[sortproj, Norm[#1 - #2] < 10/n Norm[sortproj[[-1]] - sortproj[[1]]] &]
},
{theta, anglelimits[[1]], anglelimits[[2]], (anglelimits[[2]] - anglelimits[[1]])/100.}
];
ListPlot[clustersizes, PlotRange -> All, Frame -> True,
FrameLabel -> {"Projection Angle", "Number of clusters"}]
Note the spike in the number of clusters around a projection angle of 0.5. The following function uses this idea by progressively focusing in on the angle around that spike, and terminates when all the angles in a small range yield the same number of clusters. It then returns the mean of that range of angles.
refineangle[rawpoints_] :=
Module[{anglelimits = {10^-2, \[Pi]/2 - 10^-2},
data = {{0, 1}, {0, 2}}, sortproj},
While[! (SameQ @@ data[[;; , 2]]),
data = Table[
sortproj = Sort[proj[rawpoints, theta]];
{theta,
Length@Split[sortproj,
Norm[#1 - #2] < 10/n Norm[sortproj[[-1]] - sortproj[[1]]] &]},
{theta, anglelimits[[1]], anglelimits[[2]], (
anglelimits[[2]] - anglelimits[[1]])/100.}];
anglelimits = Sort[MaximalBy[clustersizes, Last]][[{1, -1}, 1]];
];
Mean[anglelimits]
];
Then
mu = refineangle[rawdata]
phi
(* 0.490747
0.497375 *)
Now that we've found the approximate projection angle, we can project the data down. But we need to keep track of which data points end up in which cluster, so we have to add explicit indices to our data, and also update our projection to keep the indices.
dataindexed = MapIndexed[{First@#2, #1} &, rawdata];
projindexed[indexedlist_, theta_] := {#[[1]],
Projection[#[[2]], {Cos[theta], Sin[theta]}]} & /@ indexedlist
Then we project, find the indices of the points in the clusters, and then extract those elements from the raw data.
sortproj = SortBy[projindexed[dataindexed, mu], Last];
indexpartition =
Split[sortproj,
Norm[Last@#1 - Last@#2] <
10 / n Norm[sortproj[[-1, 2]] - sortproj[[1, 2]]] &][[;; , ;; , 1]];
datapartition = rawdata[[#]] & /@ indexpartition;
ListPlot[datapartition]
I'm going to make two assumptions about the data here (which are true for the data I made up, but may not be for your data):
- All the lines are, in fact, parallel.
- The lines are evenly spaced.
If the lines are not parallel then you can just treat the data clusters as separate lists and find a linear fit through each of them. If they're parallel, then you need to fit to all of them at once, which is a bit trickier. If they're not parallel, then the even spacing assumption is irrelevant too. But if you're solving them all at once, then you need to either make an assumption about their spacing, or build variable spacing into the function you're going to fit. I'm going to go with the first one.
First, we need to Flatten
the data, but add an index for the cluster:
datapartitionindexed =
Flatten[MapIndexed[
With[{subset = #1,
index = First@#2}, {#1[[1]], index, #1[[2]]} & /@ subset] &,
datapartition], 1];
Then fit a linear function:
fitparams = FindFit[datapartitionindexed, a + b y + c x, {a, b, c}, {x, y}]
(* {a -> 1.3255*10^8, b -> 2112.56, c -> -1.84307} *)
and plot the result:
Show[ListPlot[datapartition],
Plot[Table[
a + b y + c x /. fitparams, {y, Length[datapartition]}], {x, xmin,
xmax}, PlotRange -> {{xmin, xmax}, {ymin, ymax}}]]
Note, finally, that the parameter c
gives the slope of the striations, which (in the invented data) are perpendicular to the angle phi
. If the data were noiseless, we would expect that ArcTan[-1 / c] = phi
. With noise in the system, we're still pretty close:
ArcTan[-1/c] /. fitparams
phi
(* 0.497122
0.497375 *)