[cdwg] Lustre 2.5 Development Planning

Tue Jun 4 13:53:12 PDT 2013

On 2013/04/06 8:22 AM, "James A Simmons" <uja at ornl.gov> wrote:
>> >LU-3406 - merge raid5-mmp-unplug patch upstream
>> >LU-2442 - quota scaling improvements. Need to be pushed
>> >LU-3305   upstream
>> 
>> These depend on the willingness of the upstream patch maintainers to
>> accept. Definitely something to track, but no guarantee of completion
>>in 2.5.
>
>True it could be awhile before these patches get merged upstream. We
>have to be persistent otherwise it will get dropped.
>
>> > LU-684 - dev_rdonly patch is replaced by linux fail frame work.
>> 
>> This one doesn't have any existing patches, so needs new development
>>work.
>
>I started to play with some code locally. Its pretty easy but it does
>require you to build a kernel with CONFIG_FAIL_MAKE_REQUEST. I put some
>notes in the JIRA ticket about how to use it. With test shots for 2.4 I
>haven't got to really testing it yet. Now I will have some free time to
>play with this again.

I expect it might also be possible to build the dm-fail module against an
existing kernel?  I don't think we'll have any control over which options
the vendor kernels will be built with.

>> >and for the project I like to work on related to ldiskfs is to make
>> >ldiskfs patch less against the tip of Linus tree. Anything that will
>> >not be pushed upstream will be moved into osd-ldiskfs. No JIRA ticket
>> >for this work yet.
>> 
>> I'm definitely in support of this.  One of the main patches that needs
>>work
>> to be accepted upstream is the large_xattr patch.  Please see LU-908 for
>> what needs to be done before this patch can land upstream.
>> 
>> The other major file system feature that is not yet landed upstream is
>>the
>> dirdata feature, used for FID-in-dirent.  This doesn't have a lot of
>>appeal
>> to non-Lustre users today, but there may be a way to get this included
>> upstream as part of an "attributes in dirent" feature that Ted
>>discussed at
>> one time.  I'm not sure if he is working on that, but I can ask him.
>
>Do you a mailing lists links to these attribute discussions? It would be
>nice to see the frame work so we can move to it. Patch less ldiskfs is
>also a big task so we might not finish by the 2.5 release.

There was only a short discussion about this on the list:
http://lists.openwall.net/linux-ext4/2012/08/11/8

>> >LNET work
>> >---------------------------------------------------------------
>> >LU-2456 - Dynamic LNET config support
>> >LU-2950 - LNET route config
>> >LU-2466 - LNET hash tables
>> >LU-2934 - Router Priority
>> >
>> >Enable LNET to process its own checksums and do hand shaking with
>> >the ptlrpc layer. No JIRA ticket for this yet.
>> 
>> Could you please explain this more?  What is the benefit of adding
>>another
>> layer of checksumming at the LNEt level vs. the existing Lustre-level
>> checksums, except overhead?
>
>Their are companies that are using LNET for more than Lustre. These
>other non Lustre software products have a need to guarantee the data
>over the fabric as well. Lustre itself only does check summing for
>bulk messages and ignores small messages which also can suffer
>corruption. I agree if we already have a check sum done by the Lustre
>layer then we have no need to redo it. That is what I'm referring
>to by hand shaking. If done already don't go again.
>
>> >**********************************************************
>> >LNET change I would like so see these in 2.4.1 if possible as well :-)
>> >**********************************************************
>> >LU-2212 - add crc32c module loading to libcfs
>> 
>> There is no objection to this patch landing, except that nobody reported
>> that the patch for that bug actually said that the patch actually fixed
>> the problem for them.
>
>I don't know what happen to JNET2000 but this patch makes my life a
>little easier. Without this patch I have to manually modprobe the crc32c
>module before starting lustre on my Cray test bed compute nodes. It
>auto-magically happens with this patch.

Could you please post this information into the Jira ticket.  If this patch
fixes the problem for you, it should be landed for 2.4 and 2.1.

>> >Wish list work for myself that I most likely will not have ready for
>> >2.5.X
>> >
>> >Enable compression of LNET traffic.
>> 
>> Is there an expectation that this will improve throughput under normal
>> usage, or would this only be good for WAN data transfers?  As it is, the
>> clients are already using considerable CPU for data handling, so I could
>> only see this helping if the client data compression went all the way to
>> the disk (i.e. it is compressed at the Lustre client, saved to disk in
>> compressed form, and then decompressed again at the Lustre client), not
>> at the LNET level.
>
>Exactly what I was thinking. Both ZFS and btrfs support transparent
>compression. Also we have the e2compr project so ldiskfs could also
>support compression in place.

I don't think e2compr ever made it anywhere, and I haven't heard about it
in many years.  In any case, there is an open question of whether the OSD
filesystems can be told that they are getting already-compressed data.

> As you pointed out this would be plus for
>WAN data transfers. At the same time it is expensive and should be not
>be used unless needed or your client happens to have the compute power.
>In the case of clients with Intel Phi cards or GPUs we now have the
>native CPUs often idle.
>
>> >Fix up Lustre so it can be built with llvm. First step to compile some
>> >of the more cpu intensive code in lustre as TSGI code to be executed by
>> >the GPU.
>> 
>> This is theoretically interesting, but I'm not sure if there is any
>>piece
>> of Lustre code which would actually benefit from GPU offloading.  I
>>think
>> that is only useful for CPU-intensive code that is run in a tight loop.
>> It would also suffer if the code is doing data access, since it would
>>need
>> to do all the data access over the PCI bus in addition to the existing
>>two
>> network<->CPU<->storage transfers.
>
>Today we have the PCI bus access penalty but that will be going away
>over the next few years. For example AMD has huma in the works.
>
>http://www.theregister.co.uk/2013/05/01/amd_huma
>
>As for what code would I targeted. Well the compression code :-)

This puts this firmly into the "some day when it is ready" category, and
not
really "ready for 2.5"...

I'm not against any of these projects, but if the development hasn't
already
started on some major feature it is probably already too late for 2.5.

>> Given that Lustre does auto-negotiation of the best checksum algorithms
>> between the client and OST to use the hardware CRC support of the CPU,
>>do
>> you have any candidates that might benefit from this?
>
>Besides compression I can see encryption handling also benefiting. It is
>true check summing also could benefit. The most common use case for GPU
>offloading that already has been done is software raid but I have no
>idea if LAID will ever become a reality anymore.

Lustre hashing is already done by using the kernel cryptoapi, and I don't
think it makes sense to expose Lustre to the details of the implementation.
If there is a fast GPU-based crypto or hashing code then it should be added
as a cryptoapi module and every kernel component can benefit.  Similarly,
if
clients require this functionality without a GPU, they need to do this in
software and cryptoapi is expected to have the most efficient assembly
versions
of the various algorithms.

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division