Data Sets
Reel Two is constantly engaged in developing new datasets that test the performance and robustness of the Classification System technology. The sample datasets included here are from a variety of applications in different industries, and demonstrate the Classification System's ability to work with both text and nominal data formats.
The table lists statistics for each dataset, including "Build Time" and "F-Measure". Build Time is the time to load, model and evaluate (using Leave-One-Out evaluation) a dataset on a WinXP/1GHz Celeron/256MB computer. F-Measure is the micro-averaged F-Measure across all categories in the dataset.
| Categories | Instances | Build Time | F Measure | |
|---|---|---|---|---|
| Reuters-21578 (Top 10) | ||||
|
The Reuters News research dataset is a compilation of new stories from Reuters News organized into a number of topics. Identifying the documents from the largest 10 categories is one of the most popular text categorization tests. Download: Reuters.ratz License: Restricted Original Dataset: Maintained here by David Lewis of AT&T. |
10 | 2,535 | 15 seconds | 0.9121 |
| Gene Ontology (GO) MEDLINE Abstracts | ||||
|
The GO dataset is an association of MEDLINE research abstracts that have been classified according to the Gene Ontology, a structure encoding information about gene products and functions. Download: Gene Ontology.ratz License: Restricted Original Dataset: Maintained here by the United States National Library of Medicine. |
72 | 2,721 | 45 seconds | 0.7242 |
| Jaguar: Car or Cat | ||||
|
Reel Two created this dataset as a basic demonstration of the categorization task. The dataset consists of documents containing the word "Jaguar", but are they about the car or the cat? Download: Jaguar.ratz License: Public Domain |
2 | 200 | n/a | n/a |
| Language Recognition | ||||
|
The Reel Two Classification System supports 25 languages via the built-in facilities of the Java programming language. This dataset was created by Reel Two to demonstrate that capability. News was sampled from a variety of news sources around the world. Download: Multilingual.ratz License: |
25 | 626 | 10 seconds | 0.9774 |
| Steel Annealing | ||||
|
Download: Anneal.ratz License: Public Domain Original Dataset: Maintained here by the University of California, Irvine (UCI). |
6 | 798 | 5 seconds | 0.9211 |
| Diabetes Detection | ||||
|
Download: Diabetes.ratz License: Original Dataset: Maintained here by the University of California, Irvine (UCI). |
2 | 421 | 2 seconds | 0.7981 |
| Gene Splicing | ||||
|
Download: Splice.ratz License: Original Dataset: Maintained by the University of California, Irvine (UCI). |
3 | 3,190 | 15 seconds | 0.9128 |
![[Reel Two]](logo.gif)