SlideShare a Scribd company logo
Ââåäåíèå â Data Science
Çàíÿòèå 3. Ìîäåëè, îñíîâàííûå íà ïðàâèëàõ
Íèêîëàé Àíîõèí Ìèõàèë Ôèðóëèê
16 ìàðòà 2014 ã.
Ïëàí çàíÿòèÿ
Äåðåâüÿ ðåøåíèé
Çàäà÷à
Äàíî:
îáó÷àþùàÿ âûáîðêà èç ïðîôèëåé
íå��êîëüêèõ äåñÿòêîâ òûñÿ÷
÷åëîâåê
ïîë (binary)
âîçðàñò (numeric)
îáðàçîâàíèå (nominal)
è åùå 137 ïðèçíàêîâ
íàëè÷èå èíòåðåñà ê êîñìåòèêå
Çàäà÷à:
Äëÿ ðåêëàìíîé êàìïàíèè
îïðåäåëèòü, õàðàêòåðèñòèêè
ëþäåé, èíòåðåñóþùèõñÿ
êîñìåòèêîé
Îáàìà èëè Êëèíòîí?
Õîðîøèé äåíü äëÿ ïàðòèè â ãîëüô
Ðåãèîíû ïðèíÿòèÿ ðåøåíèé
Ðåêóðñèâíûé àëãîðèòì
decision_tree(XN ):
åñëè XN óäîâëåòâîðÿåò êðèòåðèþ ëèñòà:
ñîçäàåì òåêóùèé óçåë N êàê ëèñò
âûáèðàåì ïîäõîäÿùèé êëàññ CN
èíà÷å:
ñîçäàåì òåêóùèé óçåë N êàê âíóòðåííèé
ðàçáèâàåì XN íà ïîäâûáîðêè
äëÿ êàæäîé ïîäâûáîðêè Xj :
n = decision_tree(Xj )
äîáàâëÿåì n ê N êàê ðåáåíêà
âîçâðàùàåì N
CART
Classication And Regression Trees
1. Êàê ïðîèñõîäèò ðàçäåëåíèå?
2. Íà ñêîëüêî äåòåé ðàçäåëÿòü êàæäûé óçåë?
3. Êàêîé êðèòåðèé ëèñòà âûáðàòü?
4. Êàê óêîðîòèòü ñëèøêîì áîëüøîå äåðåâî?
5. Êàê âûáðàòü êëàññ êàæäîãî ëèñòà?
6. ×òî äåëàòü, åñëè ÷àñòü çíà÷åíèé îòñóòñòâóåò?
×èñòîòà óçëà
Çàäà÷à
Âûáðàòü ìåòîä, ïîçâîëÿþùèé ðàçäåëèòü óçåë íà äâà èëè íåñêîëüêî
äåòåé íàèëó÷øèì îáðàçîì
Êëþ÷åâîå ïîíÿòèå  impurity óçëà.
1. Misclassication
i(N) = 1 − max
k
p(x ∈ Ck )
2. Gini
i(N) = 1 −
k
p2
(x ∈ Ck ) =
i=j
p(x ∈ Ci )p(x ∈ Cj )
3. Èíôîðìàöèîííàÿ ýíòðîïèÿ
i(N) = −
k
p(x ∈ Ck ) log2 p(x ∈ Ck )
Òåîðèÿ èíôîðìàöèè
Êîëè÷åñòâî èíôîðìàöèè ∼ ñòåïåíü óäèâëåíèÿ
h(x) = − log2 p(x)
Èíôîðìàöèîííàÿ ýíòðîïèÿ H[x] = E[h(x)]
H[x] = − p(x) log2 p(x) èëè H[x] = − p(x) log2 p(x)dx
Óïðàæíåíèå
Äàíà ñëó÷àéíàÿ âåëè÷èíà x, ïðèíèìàþùàÿ 4 çíà÷åíèÿ ñ ðàâíûìè
âåðîÿòíîñòÿìè
1
4 , è ñëó÷àéíàÿ âåëè÷èíà y, ïðèíèìàþùàÿ 4 çíà÷åíèÿ
ñ âåðîÿòíîñòÿìè {1
2 , 1
4 , 1
8 , 1
8 }. Âû÷èñëèòü H[x] è H[y].
Âûáîð íàèëó÷øåãî ðàçäåëåíèÿ
Êðèòåðèé
Âûáðàòü ïðèçíàê è òî÷êó îòñå÷åíèÿ òàêèìè, ÷òîáû áûëî
ìàêñèìàëüíî óìåíüøåíèå impurity
∆i(N, NL, NR ) = i(N) −
NL
N
i(NL) −
NR
N
i(NR )
Çàìå÷àíèÿ
Âûáîð ãðàíèöû ïðè ÷èñëîâûõ ïðèçíàêàõ: ñåðåäèíà?
Ðåøåíèÿ ïðèíèìàþ��ñÿ ëîêàëüíî: íåò ãàðàíòèè ãëîáàëüíî
îïòèìàëüíîãî ðåøåíèÿ
Íà ïðàêòèêå âûáîð impurity íå ñèëüíî âëèÿåò íà ðåçóëüòàò
Åñëè ðàçäåëåíèå íå áèíàðíîå
Åñòåñòâåííûé âûáîð ïðè ðàçäåëåíèè íà B äåòåé
∆i(N, N1, . . . , NB ) = i(N) −
B
k=1
Nk
N
i(Nk ) → max
Ïðåäïî÷òåíèå îòäàåòñÿ áîëüøèì B. Ìîäèôèêàöèÿ:
∆iB (N, N1, . . . , NB ) =
∆i(N, N1, . . . , NB )
−
B
k=1
Nk
N log2
Nk
N
→ max
(gain ratio impurity)
Èñïîëüçîâàíèå íåñêîëüêèõ ïðèçíàêîâ
Ïðàêòèêà
Çàäà÷à
Âû÷èñëèòü íàèëó÷øåå áèíàðíîå ðàçäåëåíèå êîðíåâîãî óçëà ïî
îäíîìó ïðèçíàêó, ïîëüçóÿñü gini impurity.
 Ïîë Îáðàçîâàíèå Ðàáîòà Êîñìåòèêà
1 Ì Âûñøåå Äà Íåò
2 Ì Ñðåäíåå Íåò Íåò
3 Ì Íåò Äà Íåò
4 Ì Âûñøåå Íåò Äà
1 Æ Íåò Íåò Äà
2 Æ Âûñøåå Äà Äà
3 Æ Ñðåäíåå Äà Íåò
4 Æ Ñðåäíåå Íåò Äà
Êîãäà îñòàíîâèòü ðàçäåëåíèå
Split stopping criteria
íèêîãäà
èñïîëüçîâàòü âàëèäàöèîííóþ âûáîðêó
óñòàíîâèòü ìèíèìàëüíûé ðàçìåð óçëà
óñòàíîâèòü ïîðîã ∆i(N)  β
ñòàòèñòè÷åñêèé ïîäõîä
χ2
=
2
k=1
(nkL − NL
N nk )2
NL
N nk
Óêîðà÷èâàåì äåðåâî
Pruning (a.k.a. îòðåçàíèå âåòâåé)
1. Ðàñòèì ïîëíîå äåðåâî T0
2. Íà êàæäîì øàãå çàìåíÿåì ñàìûé ñëàáûé âíóòðåííèé óçåë íà
ëèñò
Rα(Tk ) = err(Tk ) + αsize(Tk )
3. Äëÿ çàäàííîãî α èç ïîëó÷èâøåéñÿ ïîñëåäîâàòåëüíîñòè
T0 T1 . . . Tr
âûáèðàåì äåðåâî Tk , ìèíèìèçèðóþùåå Rα(Tk )
Çíà÷åíèå α âûáèðàåòñÿ íà îñíîâàíèè òåñòîâîé âûáîðêè èëè CV
Êàêîé êëàññ ïðèñâîèòü ëèñòüÿì
1. Ïðîñòåéøèé ñëó÷àé:
êëàññ ñ ìàêñèìàëüíûì êîëè÷åñòâîì îáúåêòîâ
2. Äèñêðèìèíàòèâíûé ñëó÷àé:
âåðîÿòíîñòü p(Ck |x)
Âû÷èñëèòåëüíàÿ ñëîæíîñòü
Âûáîðêà ñîñòîèò èç n îáúåêòîâ, îïèñàííûõ m ïðèçíàêàìè
Ïðåäïîëîæåíèÿ
1. Óçëû äåëÿòñÿ ïðèìåðíî ïîðîâíó
2. Äåðåâî èìååò log n óðîâíåé
3. Ïðèçíàêè áèíàðíûå
Îáó÷åíèå. Äëÿ óçëà ñ k îáó÷àþùèìè îáúåêòàìè:
Âû÷èñëåíèå impurity ïî îäíîìó ïðèçíàêó O(k)
Âûáîð ðàçäåëÿþùåãî ïðèçíàêà O(mk)
Èòîã: O(mn) + 2O(mn
2 ) + 4O(mn
4 ) + . . . = O(mn log n)
Ïðèìåíåíèå. O(log n)
Îòñóòñòâóþùèå çíà÷åíèÿ
Óäàëèòü îáúåêòû èç âûáîðêè
Èñïîëüçîâàòü îòñòóòñâèå êàê îòäåëüíóþ êàòåãîðèþ
Âû÷èñëÿòü impurity, ïðîïóñêàÿ îòñóòñòâóþùèå çíà÷åíèÿ
Surrogate splits: ðàçäåëÿåì âòîðûì ïðèçíàêîì òàê, ÷òîáû áûëî
ìàêñèìàëüíî ïîõîæå íà ïåðâè÷íîå ðàçäåëåíèå
Surrogate split
c1 : x1 =


0
7
8

 , x2 =


1
8
9

 , x3 =


2
9
0

 , x4 =


4
1
1

 , x5 =


5
2
2


c2 : y1 =


3
3
3

 , y2 =


6
0
4

 , y3 =


7
4
5

 , y4 =


8
5
6

 , y5 =


9
6
7


Óïðàæíåíèå
Âû÷èñëèòü âòîðîé surrogate split
Çàäà÷à î êîñìåòèêå
X[0] = 26.5000
entropy = 0.999935785529
samples = 37096
X[2] = 0.5000
entropy = 0.987223228214
samples = 10550
X[6] = 0.5000
entropy = 0.998866839115
samples = 26546
entropy = 0.9816
samples = 8277
value = [ 3479. 4798.]
entropy = 0.9990
samples = 2273
value = [ 1095. 1178.]
entropy = 0.9951
samples = 16099
value = [ 8714. 7385.]
entropy = 0.9995
samples = 10447
value = [ 5085. 5362.]
X0  âîçðàñò, X4  íåîêîí÷åííîå âûñøåå îáðàçîâàíèå, X6 - ïîë
Çàäà÷è ðåãðåññèè
Impurity óçëà N
i(N) =
y∈N
(y − y)2
Ïðèñâîåíèå êëàññà ëèñòüÿì
Ñðåäíåå çíà÷åíèå
Ëèíåéíàÿ ìîäåëü
Êðîìå CART
ID3 Iterative Dichotomiser 3
Òîëüêî íîìèíàëüíûå ïðèçíàêè
Êîëè÷åñòâî äåòåé â óçëå = êîëè÷åñòâî çíà÷åíèé ðàçäåëÿþùåãî
ïðèçíàêà
Äåðåâî ðàñòåò äî ìàêñèìàëüíîé âûñîòû
Ñ4.5 Óëó÷øåíèå ID3
×èñëîâûå ïðèçíàêè  êàê â CART, íîìèíàëüíûå  êàê â ID3
Ïðè îòñóòñòâèè çíà÷åíèÿ èñïîëüçóþòñÿ âñå äåòè
Óêîðà÷èâàåò äåðåâî, óáèðàÿ íåíóæíûå ïðåäèêàòû â ïðàâèëàõ
C5.0 Óëó÷øåíèå C4.5
Ïðîïðèåòàðíûé
Ðåøàþùèå äåðåâüÿ. Èòîã
+ Ëåãêî èíòåðïðåòèðóåìû. Âèçóàëèçàöèÿ (íÿ!)
+ Ëþáûå âõîäíûå äàííûå
+ Ìóëüòèêëàññ èç êîðîáêè
+ Ïðåäñêàçàíèå çà O(log n)
+ Ïîääàþòñÿ ñòàòèñòè÷åñêîìó àíàëèçó
 Ñêëîííû ê ïåðåîáó÷åíèþ
 Æàäíûå è íåñòàáèëüíûå
 Ïëîõî ðàáîòàþò ïðè äèñáàëàíñå êëàññîâ
Êëþ÷åâûå ôèãóðû
Claude Elwood Shannon
(Òåîðèÿ èíôîðìàöèè)
Leo Breiman
(CART, RF)
John Ross Quinlan
(ID3, C4.5, C5.0)
Äðóãèå ìîäåëè, îñíîâàííûå íà ïðàâèëàõ
Market Basket, Association
Rules, A-Priori
Logical inference, FOL
Çàêëþ÷åíèå
Binary Trees give an interesting and often illuminating way of looking at
the data in classication or regression problems. They should not be used
to the exclusion of other methods. We do not claim that they are always
better. They do add a exible nonparametric tool to the data analyst's
arsenal.
Breiman, Friedman, Olshen, Stone
Çàäà÷à
Ïðåäñêàçàòü êàòåãîðèþ ñåìåéíîãî äîõîäà íà îñíîâàíèè ïðîôèëåé
ïîëüçîâàòåëåé ñ èñïîëüçîâàíèåì äåðåâà ðåøåíèé (èìïëåìåíòàöèÿ èç
sklearn).
Ìåòðèêà êà÷åñòâà:
µ =
accuracy
maxk P(Ck )
Íàãðàäà:
 ÄÇ ìîæíî èñïîëüçîâàòü ëþáóþ ãîòîâóþ èìïëåìåíòàöèþ DT
Äîìàøíåå çàäàíèå 2
Äåðåâüÿ ðåøåíèé
Ðåàëèçîâàòü
àëãîðèòì CART äëÿ çàäà÷è ðåãðåññèè
àëãîðèòì CART äëÿ çàäà÷è êëàññèôèêàöèè
Ïîääåðæêà: ðàçíûå impurity, split stopping, pruning (+)
Êëþ÷åâûå äàòû
Äî 2014/03/22 00.00 âûáðàòü çàäà÷ó è îòâåòñòâåííîãî â ãðóïïå
Äî 2014/03/29 00.00 ïðåäîñòàâèòü ðåøåíèå çàäàíèÿ
Ñïàñèáî!
Îáðàòíàÿ ñâÿçü

More Related Content

L4: Решающие деревья

  • 1. Ââåäåíèå â Data Science Çàíÿòèå 3. Ìîäåëè, îñíîâàííûå íà ïðàâèëàõ Íèêîëàé Àíîõèí Ìèõàèë Ôèðóëèê 16 ìàðòà 2014 ã.
  • 3. Çàäà÷à Äàíî: îáó÷àþùàÿ âûáîðêà èç ïðîôèëåé íåñêîëüêèõ äåñÿòêîâ òûñÿ÷ ÷åëîâåê ïîë (binary) âîçðàñò (numeric) îáðàçîâàíèå (nominal) è åùå 137 ïðèçíàêîâ íàëè÷èå èíòåðåñà ê êîñìåòèêå Çàäà÷à: Äëÿ ðåêëàìíîé êàìïàíèè îïðåäåëèòü, õàðàêòåðèñòèêè ëþäåé, èíòåðåñóþùèõñÿ êîñìåòèêîé
  • 5. Õîðîøèé äåíü äëÿ ïàðòèè â ãîëüô
  • 7. Ðåêóðñèâíûé àëãîðèòì decision_tree(XN ): åñëè XN óäîâëåòâîðÿåò êðèòåðèþ ëèñòà: ñîçäàåì òåêóùèé óçåë N êàê ëèñò âûáèðàåì ïîäõîäÿùèé êëàññ CN èíà÷å: ñîçäàåì òåêóùèé óçåë N êàê âíóòðåííèé ðàçáèâàåì XN íà ïîäâûáîðêè äëÿ êàæäîé ïîäâûáîðêè Xj : n = decision_tree(Xj ) äîáàâëÿåì n ê N êàê ðåáåíêà âîçâðàùàåì N
  • 8. CART Classication And Regression Trees 1. Êàê ïðîèñõîäèò ðàçäåëåíèå? 2. Íà ñêîëüêî äåòåé ðàçäåëÿòü êàæäûé óçåë? 3. Êàêîé êðèòåðèé ëèñòà âûáðàòü? 4. Êàê óêîðîòèòü ñëèøêîì áîëüøîå äåðåâî? 5. Êàê âûáðàòü êëàññ êàæäîãî ëèñòà? 6. ×òî äåëàòü, åñëè ÷àñòü çíà÷åíèé îòñóòñòâóåò?
  • 9. ×èñòîòà óçëà Çàäà÷à Âûáðàòü ìåòîä, ïîçâîëÿþùèé ðàçäåëèòü óçåë íà äâà èëè íåñêîëüêî äåòåé íàèëó÷øèì îáðàçîì Êëþ÷åâîå ïîíÿòèå impurity óçëà. 1. Misclassication i(N) = 1 − max k p(x ∈ Ck ) 2. Gini i(N) = 1 − k p2 (x ∈ Ck ) = i=j p(x ∈ Ci )p(x ∈ Cj ) 3. Èíôîðìàöèîííàÿ ýíòðîïèÿ i(N) = − k p(x ∈ Ck ) log2 p(x ∈ Ck )
  • 10. Òåîðèÿ èíôîðìàöèè Êîëè÷åñòâî èíôîðìàöèè ∼ ñòåïåíü óäèâëåíèÿ h(x) = − log2 p(x) Èíôîðìàöèîííàÿ ýíòðîïèÿ H[x] = E[h(x)] H[x] = − p(x) log2 p(x) èëè H[x] = − p(x) log2 p(x)dx Óïðàæíåíèå Äàíà ñëó÷àéíàÿ âåëè÷èíà x, ïðèíèìàþùàÿ 4 çíà÷åíèÿ ñ ðàâíûìè âåðîÿòíîñòÿìè 1 4 , è ñëó÷àéíàÿ âåëè÷èíà y, ïðèíèìàþùàÿ 4 çíà÷åíèÿ ñ âåðîÿòíîñòÿìè {1 2 , 1 4 , 1 8 , 1 8 }. Âû÷èñëèòü H[x] è H[y].
  • 11. Âûáîð íàèëó÷øåãî ðàçäåëåíèÿ Êðèòåðèé Âûáðàòü ïðèçíàê è òî÷êó îòñå÷åíèÿ òàêèìè, ÷òîáû áûëî ìàêñèìàëüíî óìåíüøåíèå impurity ∆i(N, NL, NR ) = i(N) − NL N i(NL) − NR N i(NR ) Çàìå÷àíèÿ Âûáîð ãðàíèöû ïðè ÷èñëîâûõ ïðèçíàêàõ: ñåðåäèíà? Ðåøåíèÿ ïðèíèìàþòñÿ ëîêàëüíî: íåò ãàðàíòèè ãëîáàëüíî îïòèìàëüíîãî ðåøåíèÿ Íà ïðàêòèêå âûáîð impurity íå ñèëüíî âëèÿåò íà ðåçóëüòàò
  • 12. Åñëè ðàçäåëåíèå íå áèíàðíîå Åñòåñòâåííûé âûáîð ïðè ðàçäåëåíèè íà B äåòåé ∆i(N, N1, . . . , NB ) = i(N) − B k=1 Nk N i(Nk ) → max Ïðåäïî÷òåíèå îòäàåòñÿ áîëüøèì B. Ìîäèôèêàöèÿ: ∆iB (N, N1, . . . , NB ) = ∆i(N, N1, . . . , NB ) − B k=1 Nk N log2 Nk N → max (gain ratio impurity)
  • 14. Ïðàêòèêà Çàäà÷à Âû÷èñëèòü íàèëó÷øåå áèíàðíîå ðàçäåëåíèå êîðíåâîãî óçëà ïî îäíîìó ïðèçíàêó, ïîëüçóÿñü gini impurity.  Ïîë Îáðàçîâàíèå Ðàáîòà Êîñìåòèêà 1 Ì Âûñøåå Äà Íåò 2 Ì Ñðåäíåå Íåò Íåò 3 Ì Íåò Äà Íåò 4 Ì Âûñøåå Íåò Äà 1 Æ Íåò Íåò Äà 2 Æ Âûñøåå Äà Äà 3 Æ Ñðåäíåå Äà Íåò 4 Æ Ñðåäíåå Íåò Äà
  • 15. Êîãäà îñòàíîâèòü ðàçäåëåíèå Split stopping criteria íèêîãäà èñïîëüçîâàòü âàëèäàöèîííóþ âûáîðêó óñòàíîâèòü ìèíèìàëüíûé ðàçìåð óçëà óñòàíîâèòü ïîðîã ∆i(N) β ñòàòèñòè÷åñêèé ïîäõîä χ2 = 2 k=1 (nkL − NL N nk )2 NL N nk
  • 16. Óêîðà÷èâàåì äåðåâî Pruning (a.k.a. îòðåçàíèå âåòâåé) 1. Ðàñòèì ïîëíîå äåðåâî T0 2. Íà êàæäîì øàãå çàìåíÿåì ñàìûé ñëàáûé âíóòðåííèé óçåë íà ëèñò Rα(Tk ) = err(Tk ) + αsize(Tk ) 3. Äëÿ çàäàííîãî α èç ïîëó÷èâøåéñÿ ïîñëåäîâàòåëüíîñòè T0 T1 . . . Tr âûáèðàåì äåðåâî Tk , ìèíèìèçèðóþùåå Rα(Tk ) Çíà÷åíèå α âûáèðàåòñÿ íà îñíîâàíèè òåñòîâîé âûáîðêè èëè CV
  • 17. Êàêîé êëàññ ïðèñâîèòü ëèñòüÿì 1. Ïðîñòåéøèé ñëó÷àé: êëàññ ñ ìàêñèìàëüíûì êîëè÷åñòâîì îáúåêòîâ 2. Äèñêðèìèíàòèâíûé ñëó÷àé: âåðîÿòíîñòü p(Ck |x)
  • 18. Âû÷èñëèòåëüíàÿ ñëîæíîñòü Âûáîðêà ñîñòîèò èç n îáúåêòîâ, îïèñàííûõ m ïðèçíàêàìè Ïðåäïîëîæåíèÿ 1. Óçëû äåëÿòñÿ ïðèìåðíî ïîðîâíó 2. Äåðåâî èìååò log n óðîâíåé 3. Ïðèçíàêè áèíàðíûå Îáó÷åíèå. Äëÿ óçëà ñ k îáó÷àþùèìè îáúåêòàìè: Âû÷èñëåíèå impurity ïî îäíîìó ïðèçíàêó O(k) Âûáîð ðàçäåëÿþùåãî ïðèçíàêà O(mk) Èòîã: O(mn) + 2O(mn 2 ) + 4O(mn 4 ) + . . . = O(mn log n) Ïðèìåíåíèå. O(log n)
  • 19. Îòñóòñòâóþùèå çíà÷åíèÿ Óäàëèòü îáúåêòû èç âûáîðêè Èñïîëüçîâàòü îòñòóòñâèå êàê îòäåëüíóþ êàòåãîðèþ Âû÷èñëÿòü impurity, ïðîïóñêàÿ îòñóòñòâóþùèå çíà÷åíèÿ Surrogate splits: ðàçäåëÿåì âòîðûì ïðèçíàêîì òàê, ÷òîáû áûëî ìàêñèìàëüíî ïîõîæå íà ïåðâè÷íîå ðàçäåëåíèå
  • 20. Surrogate split c1 : x1 =   0 7 8   , x2 =   1 8 9   , x3 =   2 9 0   , x4 =   4 1 1   , x5 =   5 2 2   c2 : y1 =   3 3 3   , y2 =   6 0 4   , y3 =   7 4 5   , y4 =   8 5 6   , y5 =   9 6 7   Óïðàæíåíèå Âû÷èñëèòü âòîðîé surrogate split
  • 21. Çàäà÷à î êîñìåòèêå X[0] = 26.5000 entropy = 0.999935785529 samples = 37096 X[2] = 0.5000 entropy = 0.987223228214 samples = 10550 X[6] = 0.5000 entropy = 0.998866839115 samples = 26546 entropy = 0.9816 samples = 8277 value = [ 3479. 4798.] entropy = 0.9990 samples = 2273 value = [ 1095. 1178.] entropy = 0.9951 samples = 16099 value = [ 8714. 7385.] entropy = 0.9995 samples = 10447 value = [ 5085. 5362.] X0 âîçðàñò, X4 íåîêîí÷åííîå âûñøåå îáðàçîâàíèå, X6 - ïîë
  • 22. Çàäà÷è ðåãðåññèè Impurity óçëà N i(N) = y∈N (y − y)2 Ïðèñâîåíèå êëàññà ëèñòüÿì Ñðåäíåå çíà÷åíèå Ëèíåéíàÿ ìîäåëü
  • 23. Êðîìå CART ID3 Iterative Dichotomiser 3 Òîëüêî íîìèíàëüíûå ïðèçíàêè Êîëè÷åñòâî äåòåé â óçëå = êîëè÷åñòâî çíà÷åíèé ðàçäåëÿþùåãî ïðèçíàêà Äåðåâî ðàñòåò äî ìàêñèìàëüíîé âûñîòû Ñ4.5 Óëó÷øåíèå ID3 ×èñëîâûå ïðèçíàêè êàê â CART, íîìèíàëüíûå êàê â ID3 Ïðè îòñóòñòâèè çíà÷åíèÿ èñïîëüçóþòñÿ âñå äåòè Óêîðà÷èâàåò äåðåâî, óáèðàÿ íåíóæíûå ïðåäèêàòû â ïðàâèëàõ C5.0 Óëó÷øåíèå C4.5 Ïðîïðèåòàðíûé
  • 24. Ðåøàþùèå äåðåâüÿ. Èòîã + Ëåãêî èíòåðïðåòèðóåìû. Âèçóàëèçàöèÿ (íÿ!) + Ëþáûå âõîäíûå äàííûå + Ìóëüòèêëàññ èç êîðîáêè + Ïðåäñêàçàíèå çà O(log n) + Ïîääàþòñÿ ñòàòèñòè÷åñêîìó àíàëèçó Ñêëîííû ê ïåðåîáó÷åíèþ Æàäíûå è íåñòàáèëüíûå Ïëîõî ðàáîòàþò ïðè äèñáàëàíñå êëàññîâ
  • 25. Êëþ÷åâûå ôèãóðû Claude Elwood Shannon (Òåîðèÿ èíôîðìàöèè) Leo Breiman (CART, RF) John Ross Quinlan (ID3, C4.5, C5.0)
  • 26. Äðóãèå ìîäåëè, îñíîâàííûå íà ïðàâèëàõ Market Basket, Association Rules, A-Priori Logical inference, FOL
  • 27. Çàêëþ÷åíèå Binary Trees give an interesting and often illuminating way of looking at the data in classication or regression problems. They should not be used to the exclusion of other methods. We do not claim that they are always better. They do add a exible nonparametric tool to the data analyst's arsenal. Breiman, Friedman, Olshen, Stone
  • 28. Çàäà÷à Ïðåäñêàçàòü êàòåãîðèþ ñåìåéíîãî äîõîäà íà îñíîâàíèè ïðîôèëåé ïîëüçîâàòåëåé ñ èñïîëüçîâàíèåì äåðåâà ðåøåíèé (èìïëåìåíòàöèÿ èç sklearn). Ìåòðèêà êà÷åñòâà: µ = accuracy maxk P(Ck ) Íàãðàäà:  ÄÇ ìîæíî èñïîëüçîâàòü ëþáóþ ãîòîâóþ èìïëåìåíòàöèþ DT
  • 29. Äîìàøíåå çàäàíèå 2 Äåðåâüÿ ðåøåíèé Ðåàëèçîâàòü àëãîðèòì CART äëÿ çàäà÷è ðåãðåññèè àëãîðèòì CART äëÿ çàäà÷è êëàññèôèêàöèè Ïîääåðæêà: ðàçíûå impurity, split stopping, pruning (+) Êëþ÷åâûå äàòû Äî 2014/03/22 00.00 âûáðàòü çàäà÷ó è îòâåòñòâåííîãî â ãðóïïå Äî 2014/03/29 00.00 ïðåäîñòàâèòü ðåøåíèå çàäàíèÿ