The following post contains my notes from the GPFS User Group meeting in York.
Doris Conti, Director, Spectrum Scale (GPFS) and HPC SW Product Development
Doris' Keynote started the day with a theme that continued throughout the day: IBM are encouraging their users to get in touch with developers to help steer the direction of GPFS. They are also looking for customers to join Beta programs for various components of GPFS.
Scott Fadden
First of all we received a friendly reminder of the new naming and it's mapping: Spectrum Scale (GPFS) is part of Spectrum Storage. Spectrum control can now manage / monitor Spectrum Scale.
GPFS 4.1.1 will be release in June 2015 and will contain the following new features:
GPFS 4.2 aims to be released in Q3 2015.
GPFS is aiming to provide a single name space to: Client workstations, hardoop, compute farm, user and applications. They provide this by providing a number of access methods that they refer to as “The Protocols”. These Include:
A “Protcol Node” is a bundled software stack that aims to:
Moving from the Kernel NFS server to Ganesha 2.2, a userland NFS server. This move is for increased performance and the ability for GPFS to be able to fix problems. The former is always a surprise to me, that a userland service can perform better. But the latter could not come quick enough, the fact that GPFS has to resort to rebooting a node if it detects an NFS lock gives some insight to how problematic the kernel NFS server can be.
Ganesha also has good support for NFSv4 as well as NFSv3.
In some situations there is a group limit of 16 that needs to be addressed, it depends on the authentication protocol being used. IBM have a table for this somewhere.
GPFS will provide a fully compliant Openstack Swift REST interface (PUT, POST, GET, DELETE). There will also be an Amazon S3 protocol emulation layer for those users who require it.
The SMB offering uses samba 4.3. It has full support for SMB2 and SMB2.1. SMB3 support includes all the manadatory features plus SMB encryption.
Directory Change Notification is turned off by default as it has performance impact, but can be enabled if required.
A summary of some of the limits of each protocol:
| NFS | Object | SMB | |
|---|---|---|---|
| Max # Protocol Nodes: | 32 | 16 | 16 |
| Max # of “shares”: | 100 exports | 4M containers | 1000 exports |
| Max # of connections: | 4000-5000 / node | 3000 / per node | |
| Max # of files: | 9 billion / FS | 1 billion / FS | 9 billion / FS |
| Rolling Upgrade?: | Yes |
IBM are keen to make the install process easier. I'm personally not sure how useful this is as the install process isn't that difficult already and it will usually be performed by someone fairly experienced.
What's not interesting is that upgrades should become easier. One of the big benefits is that kernel module creation and installation should be automated
Immutable filesets were introduced in 2007 for the Integration archive product. Now it's being exposed as a standard GPFS feature via a new mmchfilset option. a fileset will be able to be in the following modes:
Other interesting new features include:
mmbackup will be able to operate on a per fileset level instead of filesystem. Should make it easier to split backups in to more managable chunks.read fastest, which should improve read performance.mmapplypolicy by using the –sort-command option.empty option does not scan for drained data–inode-criteria criteriafile -o inoderesultfile prints interesting inodes.Scott Fadden
Blame the network... and use `nsdperf` to do it
nsdperf is similar to iperf, but it simulates NSD protocol traffic instead of just IP traffic. This has the following advantages.
While nsdperf is in the sample directory, it does not require GPFS to be installed on the system.
Since nsdperf just emulates the NSD protocol, it's not useful for looking at real NSD traffic on disk. mmdiag –iohist is more useful in that case.
Can query the current AFM state using mmafmctl fs1 getstate
A common assumption is that if the queue length is never zero, the AFM must not
be syned. This might not be true as it would be read operations in the queue.
Use mmfsadm dump afm to see the active operations. If there are long running
ones, this might suggest a problem.
Afm stats in mmpmon should also be able to show useful performance information.
Christian Bolik
Spectrum Control is the new name for IBM Tivoli Storage Productivity Center (TPC). It traditionally provided improved visibility in to FC fabrics and is now expanding in to monitoring GPFS
Unfortunately it currently only updates on a daily basis, so is limited in its usage for alerting. It is also a separatelt licenced product.
The following is questions are addressed by Spectrum control (TPC 5.2.5)
Planned content in 5.2.6:
Future:
Metric data is kept for 3 months by default, but can be tuned. It is stored in a DB2 database and therefore could be dumped out if needed.
Simon Thompson*
Research Support do:
HPC:
If he could change one thing: I/O heat mapping
Research Data Storage:
Replicated across 2 data centres seperate IB fabrics at each DC 10GbE links between DCs Extedned SAN based - users can buy space designed and built in partnership with OCF SAN over dark fibre.
Clients access a separate Samba cluster plan to put a tape layer in. How will samba play with HSM? Archive bit? Powerfolder sync and share pilot - interesting to see how it wroks with GPFS.
GPFS - Openstack
allow usesrs to archive VMs and datasets. part of the archiving process.
Archiving in to a ceph cluster. How to do that automaticaly? Would be nice if there was a HSM S3 drvier.
They are using the cinder driver.
GPFS inside the VMs?
Jason Hick*
NERSC has 21 different NFS server.
Utilization on group clusters was sporadic ethernet interconnect was uder-provisioned
Introduced new GPFS scratch file system in Genepool cluster
Diverse workload:
Enabled Disaster protections:
TCP kernel setting: need smaller initial send buffers
prevent head-of-line blocking - saw congestion like symptoms without congestion traffic.
They preferred Debian and initially used Debian 5 with GPFS 3.3
Switched to Debian 6 with GPFS 3.4 - all memory errors ceased, reduced the number of expels
Move from eithernet to IB:
data 1PB seq 0.5PB projectb 2.5PB - scratch
scheduler upgrade/enhancements: consider better features for job deps workflow software: help manage work external to compute
data management tools: SRM/BeStMan, iRODS, GPFS Policy Manager
Shankar Balasubramanian
DR for spectrum scale
only works on GPFS independent filesets
no way of having multiple secondaries
primary is an AFM cache fileset secondary if AFM home filesete can be created independantly only the priary can write to the secondary - it's RO for the rest of the world data flows always from primary to secondary primary is continuously accessable even when secondary is not accessible PROs are maintained using peer-peer snapshots between primary and secondary failover to secondary is done by upgraing the secondary to a primary (acting primary) failback to old primary is done through a downgrade of acting primary to secondary
cannot go back more than 2 snapshots (is this really true?!) 15 mins is the default snapshot period. async delay is tunable
RPO misses: - RPO is missed due ot network delay AFMPROMISS mmaddcallback to reister event handling script. check every 5 mins if the RPO is still in the gateway queue
HSM support not in the first release.
Railure recovert
failback, create a …
Avail in 4.1.1
Stefan Bender
bob: mmigmbackup - backup metadata
stefan.bender@de.ibm.com
mmbackup
cycle: start mmbackup use existing shadow DB or query TSM server to generate a new shadow database perform filesystem scan compare scan reults and shadow database perform expire / upgrade / send by using the TSM CA CLI backup shadow DB
3.5 TL3 updates:
ACL or extended atrributes changes are considered file changes in TSM eyes
4.1 updates
improved env verification
mmbackup options - TSM –max-backup-size - TXBYTELIMIT max-backout-count max-expire-count - TXGROUPMAX expire-threads backup -threads - (TSM BA client: RESOURCEUTILIZATION) - MAXSESSION MAX
mmbackup options for maax-backup-size should be larger than TSM server. TSM server will chunk it in to batches.
TSM CA client file list expiration processing importved in TSM 6.4.1 - can do multiple expirations per transactions (currently only 1)
TSM include and exclude options hay have significant impact on scan performance.
iuse as few EXCLUDE as possible avaoid using INCLUDE. use exclude instead do not use EXCLUDE /dir/…/*, use EXCLUDE.DIR instead do not combine exclude and include for one subtree if include is only used to assign right management class in TSM, use mmbackup service flag is used
serialize backups of different file systems!
charabter limitations: files with ctrl=x, ctrl-y , carriage return adn the new line char in their canme can't be backup up to TSM use QUOTESARELITERAL if file names cantain “ or ' use WILDCARDSARELITERAL if file names contain * or ?)
Sven Oehme
average: 1 bit flip per 1PB GNE end-to-end ingtegraty checking prevents this.
baseline testing: ior:
api: POSIX access: file-per-rocess ordering in a file = requential offsets order inter file = no tasks offsets clients = 32 (4 per node) repetiations = 100 xfersize = 1MiB blocksize = 128 GiB aggregate filesize = 4096 GiB
GS4-SSD becnhmark results:
filesystem blocksize write MB/sec read 1 MB 17139 20858 4 18205 26110 8 19201 24457 GS4-SAS: 1 1709 3029 4 4039 6715 8 4665 7666 16 5619 8858
at least 4MB block size for most new filesystems. since it's not much waste with <4kb files being stored in inodes.
NSD server can do 200,000 iops. Will go up to 350,000 in 4.1.1
big improvements in performance between 3.4 to 3.5 and even more to 4.1
scatter vs cluster. scatter does't slow down as much with full filesystems.
USing GL2-NSSAS as TSM backend: get 5Gb/s