| 4-026 | |
| Non-shared disk cluster ?a fault tolerant, commodity approach to hi-bandwidth data analysis | |
| D. Olson, E. Hjort, J. Lauret, M. Messer, A. Shoshani, A. Sim | The STAR experiment is prototyping / developing an approach to accomplish a high bandwidth data analysis capability using commodity components in a fault tolerant fashion. The prototype hardware consists of two small clusters of linux nodes (about 10 dual-CPU nodes), a few 100 GB local disk on each node. One cluster is at the RCF at Brookhaven and the other at PDSF at LBNL/NERSC. The local disk on each node is not exported on the network so that all processing of data occurs on processors with locally attached disk. A file catalog is used to track and manage the placement of data on the local disks and is also used to coordinate the processing with nodes having the requested data. This paper will describe the current status of this project as well as describe the development plans for a full scale implementation, consisting of 10s TB of disk capacity and more than 100 processors. Initial ideas for a full implementation include modifications to the HENP Grand Challenge software, STACS (http://gizmo.lbl.gov/stacs) that combines queries of the central tag database to define analysis tasks and a new component, a parallel job dispatcher, to split analysis tasks into multiple jobs submitted to nodes containing the requested data, or to nodes with space to stage some of the requested data. The system is fault tolerant to the extent that individual data nodes may fail without disturbing the processing on other nodes, and the failed node can be restored. Jobs interrupted on the failed node are restarted on other nodes with the necessary data. |
Keywords: |
data commodity hi-bandwidth |
| Contact: | Dr. Douglas Olson |
| Lawrence Berkeley National Laboratory | |
| dlolson@lbl.gov | |