Diese Seite wurde absichtlich Leer gelassen.


C4.5

C4.5 ist ein Algorithmus, der zur Erzeugung von Entscheidungsbäumen eingesetzt wird. C4.5 wurde von Ross Quinlan.[1] als Nachfolger seines ID3-Algorithmus' entwickelt. Da die von C4.5 erzeugten Entscheidungsbäume auch als Klassifikator eingesetzt werden kann, gilt das Verfahren auch als Klassifikationsverfahren.

Neben den bekannten CARTs und CHAIDs gewinnt C4.5 immer mehr an Bedeutung. Er wird mittlerweile bereits von verschiedenen Softwarepaketen eingesetzt.

Algorithmus

Bearbeiten

HERE

Grundsätzlich verhält sich dieser Algorithmus sehr ähnlich zum CART-Algorithmus. Der Hauptunterschied besteht darin, dass bei C4.5 keine binäre Aufteilung erfolgen muss, sondern eine beliebige Anzahl Verzweigungen eingebaut werden können. Der Baum wird breiter. Er ist meist weniger tief als der korrespondierende CART-Baum. Dafür werden nach der ersten Klassifizierung die nachfolgenden Aufsplittungen weniger bedeutungsvoll.

Ein weiterer Unterschied zeigt sich beim so genannten Pruning, beim Stutzen des Baumes. CART erzeugt einige Subtrees und testet diese mit neuen, vorher noch nicht klassifizierten Daten. C4.5 hingegen beschneidet den Baum ohne Beachtung der gegebenen Datenbasis.

Algorithm

Bearbeiten

C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set   of already classified samples. Each sample   consists of a p-dimensional vector  , where the   represent attributes or features of the sample, as well as the class in which   falls.

At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.

This algorithm has a few base cases.

  • All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
  • None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
  • Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

Pseudocode

Bearbeiten

In pseudocode, the general algorithm for building decision trees is:[2]

  1. Check for base cases
  2. For each attribute a
    1. Find the normalized information gain from splitting on a
  3. Let a_best be the attribute with the highest normalized information gain
  4. Create a decision node that splits on a_best
  5. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node

Implementations

Bearbeiten

J48 is an open source Java implementation of the C4.5 algorithm in the weka data mining tool. TimeSleuth [3] extends C4.5's use to temporal and causal discovery. TimeSleuth also allows converting C4.5 rules to Prolog statements [4]

Improvements from ID3 algorithm

Bearbeiten

C4.5 made a number of improvements to ID3. Some of these are:

  • Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.[5]
  • Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.
  • Handling attributes with differing costs.
  • Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

Improvements in C5.0/See5 algorithm

Bearbeiten

Vorlage:POV-section

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are[6]Vorlage:Third-party-inline:

  • Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
  • Memory usage - C5.0 is more memory efficient than C4.5
  • Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.
  • Support for boosting - Boosting improves the trees and gives them more accuracy.
  • Weighting - C5.0 allows you to weight different cases and misclassification types.
  • Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful.

Source for a single-threaded Linux version of C5.0 is available under the GPL.

See also

Bearbeiten

References

Bearbeiten
  1. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
  2. S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31(2007) 249-268, 2007
  3. K. Karimi and H.J. Hamilton, TimeSleuth: A Tool for Discovering Causal and Temporal Rules, ICTAI, 2002
  4. K. Karimi and H.J. Hamilton, Logical Decision Rules: Teaching C4.5 to Speak Prolog, IDEAL, 2000
  5. J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.
  6. Is See5/C5.0 Better Than C4.5?
Bearbeiten