[cdwg] ORNL test shot results 4-12-2013

Thu May 2 12:04:08 PDT 2013

In the interest to help out the community we have decided it is best to
share our test results with the pre Lustre releases after we have our
test shot.

This is the results from the April 13th test shot.

OLCF Portion
Lustre 2.4 Testing Information

After the last test shot on March 18th several Lustre software bugs were
reported. At the smaller scale of the single cage Arthur Cray test bed
we managed to replicate those issues. With the recent code base, the
large stripe issue appeared to have been resolved, so it was decided to
include the large stripe count test set. As has been the case DVS is
still not functional with the current LNET version in the 2.3.63 code
branch. Currently we are in contact with the Cray DVS developer to
resolve this issue as soon as possible.

The test shot plan was laid out into the standard three phases that have
been used in previous test shots. The first phase was ORNL to test the
special 2.3.63 (pre 2.4) Cray clients; the second phase was to have Cray
test the 2.3.63 clients with Cray's I/O stress suite; the third phase
was ORNL to perform a retesting of the first phase with the default
(Cray) 1.8.6 clients. For all three phases the back end storage ran
Lustre 2.3.63 on an RHEL6.3 image. The ORNL test set consisted of a S3D
run, mdtest benchmarking, and several combinations of IOR runs. For all
intents and purposes, ORNL tests were targeted for evaluating the Lustre
2.3.63 functionality on large scale.

The file system start up process began at 12:30pm after the bulk of the
Maintenance activities were completed by the HPC Operations
Infrastructure team. The file system start up process encountered an
oops on mounting the MDT. This was from the recent FID format change
that was landed in the Lustre code, which required the test file system
to reformat. After this detail was revealed, the file system was
reformatted. By 3:30 the file system was ready to mount but one OST
failed to mount which was traced to an Infiniband Network issue where
the Subnet Manger was not allowing new IPoIB connections because of an
error in the fabric. The resolution was to restart the Subnet Manger and
then the file system was then successfully mounted at 4:30. Next the
file system was to taken down to set the fail over node-pairs. In the
previous test shot the setting of the file over nids did not work, but
this time it was successful using the workaround provided after the last
test shot. The fail over pairs were set on the widow-oss5a[1-4]. The
file system was successfully mounted on a test node at 5:37. Titan
rebooted and mounted the Lustre 2.4 file system at 7:32 PM. Once Titan
was successfully booted, issues with the job scheduler were discovered
due to the machine being reassembled into 200 cabinets just before the
test time. By 9:50PM the job scheduler issues were resolved and we began
testing. 

The first job we ran was the IOR hero run for large stripe. Just like
the test shot before the job failed. Again the same job was launched,
this time with full debugging turned on. The Lustre debug daemons
running on the servers were unable to keep up with the messages
generated, but a large volume of logs was collected up to the job
failure. The logs on server and client side were then collected on a
management server. Once the transfer of the log files was completed, and
the debug daemon was disabled, we moved on to the IR recovery test
starting at 12:21 AM. To test this setting we ran the hero ssf test run
and then powered off widow-oss5a2. The oss recovered successfully in
3:58. We saw a few soft locks, but nothing that prevented recovery from
success. Soon after recovery, we saw the single shared file fail to
ENOSPC like the wide stripe testing. We began to think it was due to
grant space, but latter discussion with Intel engineers suggest this is
not the case. After this test, we continued with the final work load of
S3, mdtest and IOR scaling. These jobs ran to completion with no
problems.

CRAY Portion
High level summary of the Cray test time on Titan running against Lustre
2.4 client and server on 4/13/2013.

Originally the test time was to start at 8pm CST on Friday 4/12, this
changed early in the day to Midnight.

Time line of events:

        ~12:30am Got access to the system.

John Lewis reported there was an issue with file system, some jobs
getting No space errors, he wanted to know how much that would affect my
run. I started up a work load for about ~15-30 minutes across the entire
machine. I then looked to see how many of the jobs were getting errors.
I found about 8% of jobs at that time got the file system errors. John
and I determined this was good enough for me to continue with testing.

        ~12:50am Started the first real workload on the machine. This
        load consisted of 1regression stream (looking for correctness)
        and 4 other streams each of which had a different random core
        count. All together there was more than enough to keep the
        system full. The job core counts were in the thousands to
        multiple thousands for each job.

The regression stream contains all of our I/O tests plus other
functional tests. The 4 other streams had the MDS intensive tests
disabled so they would not run. I wanted to focus this part of the ORNL
session away from beating only on the MDS.

        ~1:30am Checked the run, system was loaded. File system was
        fairly responsive.
        ~2:30am Checked the system again, most compute nodes were being
        used. File system was responsive.
        ~4:30am Again checked the system, most compute nodes were
        utilized and the file system was a bit sluggish but not bad
        enough to worry about.
        ~9:15am Killed off the current jobs and then changed the
        workload on the machine.

Instead of using all the compute nodes, I focused more on a smaller
subset about 1/4 of the machine with more smaller jobs running. Each job
would be below 1000 cores, still the same threads running and steering
away from the MDS intensive tests.

        ~9:50am Having issues getting the new workload started, few jobs
        are running. Operations in the Lustre tree are taking tens of
        minutes. /lustre/routed1/scratch/darason
        ~10:00-10:30 John looked around to see what was happening.
        Commands like /bin/ls and mkdir are taking 15+ minutes
        to /lustre/routed1/scratch/darason

John called for the “Server side Calvalry”

        11:16am John said they were going to setup to crash dump the MDS
        11:20am Time is up.

Over all the time went well. Only the one issue was uncovered. I’ve
looked over the test results and besides the “No Space” issue hitting
about 10% of the jobs executed, I did not see any other issues.

ORNL Portion
Lustre 1.8 Compatibility Testing Information

The first job we ran was the IOR hero run for large stripe. We
experienced no problems when running the 1.8 client against our 2.3.64
servers. After the large stripe test was completed we started our next
hero run using file per process. This job completed successfully also.
After this test, we continued with the final work load of S3, mdtest and
IOR scaling. These jobs ran to completion with no problems either. All
jobs completed except for the IOR scaling job but that was due to time
constraints which required that we kill the job. Once all testing was
complete, Titan was shut down to run pre-acceptance diagnostics. The
file system was successfully rebooted into the production and returned
to service at 12:05am on 4/14/2013.