- DataRush Technology
DataRush Technology, introduced in 2009 by Austin-based technology company Pervasive Software, uses multicore technology to process data sets for analytics and other business applications. The technology enables performance on a single server or small cluster and allows high-throughput analytics on massive datasets. A parallel data flow engine, it is used to power batch processing jobs, and runs data preparation tasks (e.g. sorting, deduplicating, aggregating, cleansing, joining, loading, validating) and machine learning programs like fuzzy matching algorithms.
DataRush uses a dataflow architecture. The architecture implements a program that executes as a graph of computation nodes interconnected by dataflow queues. The nodes use the queues to share data. In this sense, dataflow is a shared-nothing architecture. The lack of share state simplifies node implementation, since threads do not have to synchronize share state. The in-memory, blocking queues implement the synchronization required to safely hand off data from node to node.
In DataRush, the computation nodes of a dataflow graph are known as operators. DataRush provides a library of ready-to-use operator components. Developers can also write custom operators to extend the standard library. For example, several of the sample applications have their own implementations of operators.
To support the creation of a dataflow graph for execution, DataRush provides a composition phase for constructing operators and linking them in an execution graph. Operator properties can be set to determine both operator composition and runtime behavior. At runtime, a composed graph is realized by creating threads for each computation node, creating dataflow queues, and linking nodes. The execution engine also supports monitoring using Java Management Extensions (JMXs). During the execution phase, statistics objects may be created and MBeans instantiated to export profile and debug information. DataRush provides a VisualVM plug-in that can be used within VisualVM to display the exported run-time information.
Pervasive DataRush supports two types of operators, DataflowOperator and DataflowProcess, both Java interfaces. DataflowOperator is a composite operator, used only to compose other operators. After composition, a DataflowOperator no longer exists (it is compiled away). DataflowProcess is an executable operator attached to a thread and executed at runtime. Both operator types can be used to set operator properties and can be linked by dataflow queues.
Dataflow queues are not instantiated at composition time to prevent premature access before runtime. During composition, operators are linked using a flow concept. When an operator is composed, its internal structure is created and its methods exposed for obtaining output flows. The output flow of one operator can be passed as input to another operator, which uses the passed-in flow to complete linking.
ApplicationGraph is a special DataflowOperator, used to create an application to run within the DataRush engine. Like other graphs, it lives at composition time and has an interface for adding operators. Once composed, the ApplicationGraph can be run.
After a graph is composed, it is ready to run. The ApplicationGraph interface defines a run method. On instantiation, engine properties set during composition define monitoring structures. At runtime, threads are launched and the main thread then waits for either normal thread completion or an error.
The DataRush engine includes a deadlock algorithm that is instantiated whenever a thread has to wait on a queue according to certain criteria. The algorithm looks for cycles in the wait graph. If any are found, then deadlock has occurred. Without intervention, graph execution halts while the deadlock algorithm determines which queue is at fault and expands memory for that queue. Deadlocks are thus often transient and occur only occasionally on a graph under particular stress.
The execution of a DataRush application can be monitored using VisualVM, the JMX console that is shipped with the Java JDK. Pervasive DataRush ships with a plugin for VisualVM. The plugin can be found in the plugins directory under the Pervasive DataRush installation. It is contained in the file named datarush-visualvm-*.nbm. Follow the instructions within VisualVM for installing a new plugin. To obtain runtime information, connect to the JVM executing Pervasive DataRush using VisualVM. You can use the DataRush tabs within VisualVM to see the running nodes, view queue information, and obtain general JVM and system information.
- ^ Ericson, W. (January 2011). "The Next Wave in Big Data Analytics: Exploiting Multi-core Chips and SMP Machines". Bye Network. http://www.b-eye-network.com/blogs/eckerson/archives/2011/01/the_next_wave_i.php.
- Krill, P. (February 2011). "Pervasive's parallel development API paired with Hadoop MapReduce". Infoworld. http://www.infoworld.com/d/application-development/pervasives-parallel-development-api-paired-hadoop-mapreduce-014.
- Falgout, J. (January 2011). "Dataflow Programming: A Scalable Data-Centric Approach to Parallelism". JAVA Developer's Journal. http://java.sys-con.com/node/1678918.
- West, J. (July 2009). "LexisNexis Brings Its Data Management Magic to Bear on Scientific Data". HPCWire. http://www.hpcwire.com/features/LexisNexis-Brings-Its-Data-Management-Magic-To-Bear-on-Scientific-Data-51520412.html.
- Sperling, E. (November 2008). "Why can’t apps run faster?". Forbes. http://www.forbes.com/2008/11/01/cio-apps-processors-tech-cio-cx_es_1103apps.html.
- Woods, D. (August 2009). "Waking up multi-core processors". Forbes. http://www.forbes.com/2009/08/24/pervasive-software-multicore-technology-cio-network-processors.html.
- Haff, G. (April 2008). "Pervasive Takes on Multicore Programming". CNET. http://news.cnet.com/8301-13556_3-10214940-t61.hml.
Wikimedia Foundation. 2010.