- Lustre (file system)
Infobox software
name = Lustre
developer =Sun Microsystems
latest release version = 1.6.5.1
latest release date = release date|2008|07|10
operating system =Linux
genre =Shared disk file system
license = GPL
website = http://www.lustre.org, http://www.sun.com/software/products/lustre/Lustre is a
shared disk file system , generally used for large scalecluster computing . The name Lustre is a blend of the wordsLinux and cluster. The project aims to provide a file system for clusters of tens of thousands of nodes withpetabyte s of storage capacity, without compromising speed or security. Lustre is available under theGNU GPL .Lustre is designed, developed and maintained by
Sun Microsystems , Inc. with input from many other individuals and companies. Sun completed its acquisition ofCluster File Systems, Inc. , including the Lustre file system, onOctober 2 ,2007 , with the intention of bringing the benefits of Lustre technologies to Sun'sZFS file system and theSolaris operating system .Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-scale, multi-site clusters. Fifteen of the top 30 supercomputers in the world use Lustre file systems, including the world's second fastest supercomputer, the
Blue Gene /L atLawrence Livermore National Laboratory (LLNL) [cite web
url=http://top500.org/list/2008/06/100
title=TOP500 List - June 2008|date=2006-06-18
publisher=TOP500.Org] . Othersupercomputers that use the Lustre file system include systems atOak Ridge National Laboratory ,Pacific Northwest National Laboratory , andNASA [cite web
url=http://www.nas.nasa.gov/Resources/Systems/pleiades.html
title=Pleiades Supercomputer
date=2008-08-18
publisher=www.nas.nasa.gov] in North America, the largest system in Asia atTokyo Institute of Technology [cite web
url=http://www.top500.org/system/8216
title=TOP500 List - November 2006
publisher=TOP500.Org] , and one of the largest system in Europe at CEA [cite web
url=http://www.top500.org/system/8237
title=TOP500 List - June 2006
publisher=TOP500.Org] .Lustre file systems can support up to tens of thousands of
client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Businesses ranging from Internet service providers to large financial institutions deploy Lustre file systems in their data centers. Due to the highscalability of Lustre file systems, Lustre deployments are popular in the oil and gas, manufacturing, rich media and finance sectors. [cite web
url = http://video.google.ca/videoplay?docid=965817030102279091
title = Lustre File System presentation
accessdate = 2008-01-28
publisher = Google Video]History
The Lustre file system architecture was developed as a research project in 1999 by Peter Braam, who was a Senior Systems Scientist at
Carnegie Mellon University at the time. Braam went on to found his own companyCluster File Systems , which released Lustre 1.0 in 2003. Cluster File Systems was acquired by Sun Microsystems in October 2007.The Lustre file system was first installed for production use in March 2003 on the MCR Linux Cluster at LLNL [url=http://goliath.ecnext.com/coms2/gi_0199-2958660/LINUX-NETWORX-CLUSTER-BECOMES-3RD.html] , one of the largest supercomputers at the time [url=http://www.top500.org/system/6085] .
Lustre 1.2.0, released in March 2004, provided Linux kernel 2.6 support, a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side write-back cache accounting.
Lustre 1.4.0, released in November 2004, provided protocol compatibility between versions,
Infiniband network support, and support for extents/mballoc in the ldiskfs on-disk filesystem.Lustre 1.6.0, released in April 2007, supported mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", supported dynamic addition of "object storage targets" (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on
symmetric multiprocessing (SMP) servers, and supported free space management for object allocation.The Lustre file system and associated open source software has been adopted by many partners. Both
Red Hat andSUSE (Novell ) offer Linux kernels that work without patches on the client for easy deployment.Architecture
A Lustre file system has three major functional units:
* A single "
metadata target" (MDT) per filesystem that stores metadata, such as filenames, directories, permissions, and file layout, on the "metadata server" (MDS)* One or more "object storage targets" (OSTs) that store file data on one or more "object storage servers (OSSes)". Depending on the server’s hardware, an OSS typically serves between two and eight targets, each target a local disk filesystem up to 8 terabytes (TBs) in size. The capacity of a Lustre file system is the sum of the capacities provided by the targets
* "Client(s)" that access and use the data. Lustre presents all clients with standard
POSIX semantics and concurrent read and write access to the files in the filesystem.The MDT, OST, and client can be on the same node or on different nodes, but in typical installations, these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including
Infiniband ,TCP/IP onEthernet ,Myrinet ,Quadrics , and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA ) transfers, when available, to improve throughput and reduce CPU usage.The storage attached to the servers is partitioned, optionally organized with
logical volume management (LVM) and/orRAID , and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of
ext3 to store data. In the future, Sun'sZFS /DMU will also be used to store data.When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or writer operations, the client then passes the layout to a "logical object volume" (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and
OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and risk filesystem corruption from misbehaving/defective clients.Implementation
In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.
On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach was used in the LLNL
Blue Gene installation [cite web
url=http://www.hpcwire.com/hpcwire/hpcwireWWW/04/1015/108577.html
title = DataDirect SELECTED AS STORAGE TECH POWERING BLUEGENE/L
publisher = "HPC Wire", October 15, 2004: Vol. 13, No. 41.] .Another approach uses the "liblustre" library to provide userspace applications with direct filesystem access. Liblustre is a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors can access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allows data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly. Good performance characteristics and scalability make this approach the most suitable for using Lustre with MPP systems. Liblustre is the most significant design difference between Lustre implementations on MPPs such as
Cray XT3 and Lustre implementations on conventional clustered computational systems.Lustre 1.8, which is expected to be released in Q4 2008 will focus on improving recovery in the face of multiple failures, and storage management via OST Pools [cite web
url=http://blogs.sun.com/HPC/resource/ISC08-Lustre_Update.pdf
title = Lustre Roadmap and Future Plans
accessdate = 2008-08-21
publisher = Sun Microsystems] .Lustre technology has been integrated in the
HP StorageWorks Scalable File Share product [cite web
url=http://www.hp.com/techservers/hpccn/linux_gfs/
title = Global File systems (SFS) collaboration
publisher = Hewlett Packard] .Data objects and file striping
In a traditional UNIX disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDSs point to one or more objects associated with the file rather than to data blocks. These objects are implemented as files on the OSSs. When a client opens a file, the file open operation transfers a set of object pointers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored, allowing the client to perform I/O on the file without further communication with the MDS.
If only one object is associated with an MDS inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is “striped” across the objects similar to RAID_0. Striping provides significant performance benefits. When striping is used, the maximum file size is not limited by the size of a single target. Also, capacity and aggregate I/O bandwidth scale with the number of OSTs.
Networking
In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre file system MDSs and OSSs using traditional
storage area network (SAN) technologies.LNET supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when supported by underlying networks such as Quadrics Elan, Myrinet, and InfiniBand. High availability and recovery features enable transparent recovery in conjunction with failover servers.
LNET provides end-to-end throughput over
Gigabit Ethernet (GigE) networks in excess of 100 MB/s [cite web
author=Lafoucrière, Jacques-Charles
url = http://hepix.caspur.it/afs/hepix.org/project/strack/hep_pdf/2007/Spring/Lustre-CEA-hepix2007.pdf
title = Lustre Experience at CEA/DIF
publisher = HEPiX Forum, April 2007] , throughput up to 1.5 GB/s over InfiniBand double data rate (DDR) links, and throughput over 1 GB/s across 10GigE interfaces.High availability
Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay.
Lustre MDSs are configured as an active/passive pair, while OSSs are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.
ZFS integration
Lustre 3.0 will allow users to choose between
ZFS and ldiskfs as back-end storage. [ cite web | url = http://arch.lustre.org/index.php?title=Architecture_ZFS_for_Lustre | title = Architecture ZFS for Lustre | accessdate = 2008-02-18 | publisher = Sun Microsystems]References
See also
*
GlusterFS
*Gfarm file system
*Google file system
*BigTable
*Parallel Virtual File System External links
* [http://www.sun.com/software/products/lustre/ Lustre File System product page at sun.com]
* [http://www.lustre.org/ Website]
* [http://wiki.lustre.org/ Lustre Users Home]
Wikimedia Foundation. 2010.