lists.openfabrics.orglists.openfabrics.org/pipermail/general/2006-December.txtFrom thomas.bub at...

Sasha,I'm having trouble to get the patch applied.I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCEpath but after running the ofed-install script the sources in the/usr/local/ofed din't contain that patch anymore.Can you help me out of the dark and tell me how to build thelibvendor.so out of/on the ofed-1.1/SOURCES tree.ThanksThomas

> -----Original Message-----> From: Sasha Khapyorsky [mailto:sashak at voltaire.com]> Sent: Monday, November 27, 2006 5:43 PM> To: Bub Thomas> Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen> Subject: Re: [openib-general] Is an umad_close_port a good idea afterI> disconnect from the SA with osm_vendor_delete ?> > On 14:13 Mon 27 Nov , Bub Thomas wrote:> >> > Sasha,> > whom to ask to add this to the osm_vendor functions?> > Please test this patch:> > diff --git a/osm/libvendor/osm_vendor_ibumad.c> b/osm/libvendor/osm_vendor_ibumad.c> index e82695f..4205b23 100644> --- a/osm/libvendor/osm_vendor_ibumad.c> +++ b/osm/libvendor/osm_vendor_ibumad.c> @@ -545,10 +545,15 @@ osm_vendor_delete(> umad_receiver_t *p_ur;> int agent_id;> > -/* unregister UMAD agents */> -for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++)> -if ( (*pp_vend)->agents[agent_id] )> -umad_unregister( (*pp_vend)->umad_port_id,agent_id );> +if ((*pp_vend)->umad_port_id >= 0) {> +/* unregister UMAD agents */> +for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS;agent_id++)> +if ( (*pp_vend)->agents[agent_id] )> +umad_unregister((*pp_vend)->umad_port_id,> +agent_id );> +umad_close_port((*pp_vend)->umad_port_id);> +(*pp_vend)->umad_port_id = -1;> +}> > clear_madw( *pp_vend );> /* make sure all ports are closed */> > > > Or should I file a bug for this> > Good idea too.> > Sasha

Hi.

> Hi,> I'm hoping someone here can help me diagnose a this problem.> I have a really simple test app that uses verbs and is failing to create> a QP on one machine in particular. On other machines the app works and> behaves as expected without any problems.>> The machine in question is 32bit dual CPU Intel system running FC4 and> the released OFED 1.1 with a Mellanox PCI-X HCA (MT23108)> [root at localhost test]# uname -a> Linux localhost.localdomain 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2> 23:08:39 EDT 2005 i686 i686 i386 GNU/Linux> [root at localhost test]# cat /usr/local/ofed/BUILD_ID> OFED-1.1>> openib-1.1 (REV=9905)> # User space> https://openib.org/svn/gen2/branches/1.1/src/userspace> Git:> ref: refs/heads/ofed_1_1> commit a083ec1174cb4b5a5052ef5de9a8175df82e864a>> The code in question is pretty simple and as I've said works everywhere> else I've tried it.>> Errno is set to 22, and I've traced the problem to this point in the> OFED stack, so I can see where it fails but still have no idea why:> It fails at line 578 in "src/userspace/libibverbs/src/cmd.c" the> instruction is> 'write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size'> cmd_fd looked valid (was 6), cmd looked to point to a valid structure,> and cmd_size was 96.>> This was called from line 533 of src/userspace/libmthca/src/verbs.c> 'ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, &resp,> sizeof resp);'>> Which was invoked by my code calling ibv_create_qp as seen below:>> > /* create the qpairs */> init_attr.send_cq = info->cq_hndl;> init_attr.recv_cq = info->cq_hndl;> init_attr.cap.max_send_wr = info->oust_wr_sq; //8> init_attr.cap.max_recv_wr = info->oust_wr_rq; //8> init_attr.cap.max_send_sge = info->sg_size_sq; //1> init_attr.cap.max_recv_sge = info->sg_size_rq; //1> init_attr.cap.max_inline_data = 1024;> init_attr.qp_type = IBV_QPT_RC;>> if ((info->qp_hndl[CLIENT] => ibv_create_qp(info->pd_hndl, &init_attr)) == NULL) {>> info->failed = 1;> rc = ERR_INIT_HCA_FAILED;>> }> >>> Any ideas or pointers in the right direction would be greatly appreciated.

I think that the problem is the amount of inline data that you try to use.I suggest that you put 0, create the QP and check the value that are beingreturned from the QP creation and use it.

I believe that the maximum size that can be used in this attribute is ~ 420 .

Dotan

Hi.

> Hi,>> Im using the openib gen2 trunk and was running the performance tests> from that tree.> I get a "Segmentation Fault" on running ib_read_bw and the remaining> tests.> The output is as follows:> ------------------------------------------------------------------> RDMA_Read BW Test> Connection type : RC> Segmentation fault>> Any particular reason why this is happening?

Can you give some more info, such as:

which driver git/svn version are you using?which parameters did you use in each side?which distro are you using?which computer arch are you using?

thanksDotan

On 11/30/06, Ralph Campbell wrote:> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote:> > So what did you change since v1? How do you deal with fitting 64-bit> > addresses into an sg list entry that has a 32-bit dma_addr_t?

> The ipath_map_sg() handler for ib_dma_map_sg() doesn't store> anything in the struct scatterlist. The translation is> done when ipath_sg_dma_address() is called which now> returns u64 instead of dma_addr_t thus avoiding the truncation> problem.

And there is this open/TODO of calling kmap(page) on dma mapping time(or when ipath_sg_dma_address is called) and kunmap(page) on dmaunmapping time, where you must store the kvaddr between the two callsand the sg does not have a room for it where dma_addr_t is u32 andkvaddr is u64 ....

> All of the callers to ib_dma_map_single(), ib_dma_map_page(),> and ib_sg_dma_address() have been modifed to save the address> in a u64 instead of a dma_addr_t. This actually wasn't much> of a change since the address was being cast to u64 anway> when assigned to struct sge.addr.

Its fixes a bug, so it actually somehow much of a change. Without iton arch as mentioned above, ipath_dma_map_single would return only au32 portion of the kvaddr and later the ulp code would place thischopped address in sge.addr and the ipath driver would use the wrongaddress.

Or.

Hi Thomas,

On 10:12 Fri 01 Dec , Bub Thomas wrote:> Sasha,> I'm having trouble to get the patch applied.> I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE> path but after running the ofed-install script the sources in the> /usr/local/ofed din't contain that patch anymore.> Can you help me out of the dark and tell me how to build the> libvendor.so out of/on the ofed-1.1/SOURCES tree.

Never did it personally, but you may want to look athttps://openib.org/tiki/tiki-index.php?page=OFED+Supportfor how ofed_patch.sh does this.

And you can use svn or git versions of management/osm as well.

Sasha

> Thanks> Thomas> > > > -----Original Message-----> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]> > Sent: Monday, November 27, 2006 5:43 PM> > To: Bub Thomas> > Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen> > Subject: Re: [openib-general] Is an umad_close_port a good idea after> I> > disconnect from the SA with osm_vendor_delete ?> > > > On 14:13 Mon 27 Nov , Bub Thomas wrote:> > >> > > Sasha,> > > whom to ask to add this to the osm_vendor functions?> > > > Please test this patch:> > > > diff --git a/osm/libvendor/osm_vendor_ibumad.c> > b/osm/libvendor/osm_vendor_ibumad.c> > index e82695f..4205b23 100644> > --- a/osm/libvendor/osm_vendor_ibumad.c> > +++ b/osm/libvendor/osm_vendor_ibumad.c> > @@ -545,10 +545,15 @@ osm_vendor_delete(> > umad_receiver_t *p_ur;> > int agent_id;> > > > -/* unregister UMAD agents */> > -for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++)> > -if ( (*pp_vend)->agents[agent_id] )> > -umad_unregister( (*pp_vend)->umad_port_id,> agent_id );> > +if ((*pp_vend)->umad_port_id >= 0) {> > +/* unregister UMAD agents */> > +for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS;> agent_id++)> > +if ( (*pp_vend)->agents[agent_id] )> > +> umad_unregister((*pp_vend)->umad_port_id,> > +agent_id );> > +umad_close_port((*pp_vend)->umad_port_id);> > +(*pp_vend)->umad_port_id = -1;> > +}> > > > clear_madw( *pp_vend );> > /* make sure all ports are closed */> > > > > > > Or should I file a bug for this> > > > Good idea too.> > > > Sasha> >

On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote:> Hi Thomas,> > On 10:12 Fri 01 Dec , Bub Thomas wrote:> > Sasha,> > I'm having trouble to get the patch applied.> > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE> > path but after running the ofed-install script the sources in the> > /usr/local/ofed din't contain that patch anymore.> > Can you help me out of the dark and tell me how to build the> > libvendor.so out of/on the ofed-1.1/SOURCES tree.> > Never did it personally, but you may want to look at> https://openib.org/tiki/tiki-index.php?page=OFED+Support> for how ofed_patch.sh does this.> > And you can use svn or git versions of management/osm as well.

There's currently no git version of OFED 1.1 OpenSM AFAIK.

-- Hal

> Sasha> > > Thanks> > Thomas> > > > > > > -----Original Message-----> > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]> > > Sent: Monday, November 27, 2006 5:43 PM> > > To: Bub Thomas> > > Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen> > > Subject: Re: [openib-general] Is an umad_close_port a good idea after> > I> > > disconnect from the SA with osm_vendor_delete ?> > > > > > On 14:13 Mon 27 Nov , Bub Thomas wrote:> > > >> > > > Sasha,> > > > whom to ask to add this to the osm_vendor functions?> > > > > > Please test this patch:> > > > > > diff --git a/osm/libvendor/osm_vendor_ibumad.c> > > b/osm/libvendor/osm_vendor_ibumad.c> > > index e82695f..4205b23 100644> > > --- a/osm/libvendor/osm_vendor_ibumad.c> > > +++ b/osm/libvendor/osm_vendor_ibumad.c> > > @@ -545,10 +545,15 @@ osm_vendor_delete(> > > umad_receiver_t *p_ur;> > > int agent_id;> > > > > > -/* unregister UMAD agents */> > > -for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++)> > > -if ( (*pp_vend)->agents[agent_id] )> > > -umad_unregister( (*pp_vend)->umad_port_id,> > agent_id );> > > +if ((*pp_vend)->umad_port_id >= 0) {> > > +/* unregister UMAD agents */> > > +for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS;> > agent_id++)> > > +if ( (*pp_vend)->agents[agent_id] )> > > +> > umad_unregister((*pp_vend)->umad_port_id,> > > +agent_id );> > > +umad_close_port((*pp_vend)->umad_port_id);> > > +(*pp_vend)->umad_port_id = -1;> > > +}> > > > > > clear_madw( *pp_vend );> > > /* make sure all ports are closed */> > > > > > > > > > Or should I file a bug for this> > > > > > Good idea too.> > > > > > Sasha> > > >

On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote:> Steve,> > As you know, I have my rnfs kernel running the stable iwarp-stack on> my cluster now. But how do I compile the userspace packages from that> stack? > You build and install the userspace libraries from the iwarp stablebranch. This will install all the needed header files to build otherpackages that depend on them. Like mvapich2-0.9.8, for instance.

If rping is working for you, then you've already done this. The userlibs and header files are all installed in /usr/local by default. Ifyou have /usr/local/include/rdma/rdma_cma.h, for instance, you'veprobably already installed the userspace stuff from the iwarp stablebranch.

To build and install the user libs from the iwarp branch, please see thewiki howto. There is a section describing installing the userspacelibraries.

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3

Hope this helps...

Steve.

On Thu, 2006-11-30 at 20:57 -0800, Matt Leininger wrote:> On Thu, 2006-11-30 at 12:30 -0600, Steve Wise wrote:> > On Thu, 2006-11-30 at 12:12 -0500, Jeff Squyres wrote:> > > It just clicked in my brain as to why you were asking this question.> > > > > > Remember that OMPI currently does not use any CM for OF connections > > > at all. So it's not like it's using the old CM that doesn't support > > > iWARP. OMPI uses its own out-of-band mechanism, which, as I > > > understand it, should work with iWARP just as well as it works for IB.> > > > > > Am I incorrect in thinking that? (I have no iWARP hardware to test > > > with)> > > > iWARP _requires_ the RDMA-CM for connection setup...> > > > So OMPI as it stands today won't work over iwarp devices.> > > > Right now, the only non-uDAPL MPI solution that will work with the iwarp> > stable svn branch + 2.6.17 RNFS is MVAPICH2.> > > > If you utilize uDAPL, then Intel and HP have MPI libs that might work...> > OMPI also has a uDAPL network device (along with a device that uses> verbs directly). So if we just use OMPI uDAPL it should work over> iWarp?>

It should. You might have to tweak OMPI slightly to work with uDAPLfrom the iWARP branch. Or take the latest uDAPL and back-port it to theiwarp branch.

Steve.

OpenSM/osm_sm.c: In osm_sm_mcgrp_join, use CL_PLOCK_RELEASE macrorather than calling cl_plock_release directly

Signed-off-by: Hal Rosenstock

diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.cindex 9aa4a36..100f2a0 100644--- a/osm/opensm/osm_sm.c+++ b/osm/opensm/osm_sm.c@@ -740,7 +740,7 @@ osm_sm_mcgrp_join( status = osm_port_add_mgrp( p_port, mlid ); if( status != IB_SUCCESS ) {- cl_plock_release( p_sm->p_lock );+ CL_PLOCK_RELEASE( p_sm->p_lock ); osm_log( p_sm->p_log, OSM_LOG_ERROR, "osm_sm_mcgrp_join: ERR 2E03: " "Unable to associate port 0x%" PRIx64 " to mlid 0x%X\n",

On 09:27 Fri 01 Dec , Hal Rosenstock wrote:> On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote:> > Hi Thomas,> > > > On 10:12 Fri 01 Dec , Bub Thomas wrote:> > > Sasha,> > > I'm having trouble to get the patch applied.> > > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE> > > path but after running the ofed-install script the sources in the> > > /usr/local/ofed din't contain that patch anymore.> > > Can you help me out of the dark and tell me how to build the> > > libvendor.so out of/on the ofed-1.1/SOURCES tree.> > > > Never did it personally, but you may want to look at> > https://openib.org/tiki/tiki-index.php?page=OFED+Support> > for how ofed_patch.sh does this.> > > > And you can use svn or git versions of management/osm as well.> > There's currently no git version of OFED 1.1 OpenSM AFAIK.

What about 1.1 git branch? This is same as SVN's 1.1. :)

Sasha

What about iWARP support?

Arkady Kanevsky email: arkady at netapp.comNetwork Appliance Inc. phone: 781-768-53951601 Trapelo Rd. - Suite 16. Fax: 781-895-1195Waltham, MA 02451 central phone: 781-768-5300

> -----Original Message-----> From: Bill Boas [mailto:bboas at systemfabricworks.com] > Sent: Thursday, November 30, 2006 2:06 PM> To: 'OPENIB'; openib-promoters at openib.org; > openfabrics-iwg at openfabrics.org> Cc: 'Tziporet Koren'; 'Jeff Squyres'; 'EWG'> Subject: [openfabrics-iwg] OFED 1.2 contents and schedule as > proposed by the EWG> > Following the Developer Summit discussions in Tampa the EWG > is proposing the contents and schedule for OFED 1.2 as > described on their wiki> > https://openib.org/tiki/tiki-index.php?page=OFED+release+procedure> > Many members of the OpenFabrics Board could not be present at > the summit and many members of the OpenFabrics community were > also not present.> > Also the IWG is planning for its next Interoperability Test > Event after which it is probable that the OpenFabrics Logo > program should be in effect.> > Please review this proposal from the EWG carefully to ensure that if:-> > 1) you represent your company in the OpenFabrics community > that your company's product needs in the spring and early > summer of 2007 will be met by OFED 1.2 as proposed;> > 2) you are a customer or end user that may wish to deploy > OFED 1.2 after its release and distribution that it looks > like it will contain what you need for your installations by then;> > 3) you are working for a Linux distribution then the > schedule, process and testing planned by the EWG and the IWG > meet your requirements and schedule;> > 4) your interests do not align with the 3 identified above > but you are also planning to use OFED 1.2 please speak up and > give the community feedback.> > Any other feedback or comments are welcome.> > In my role in the Alliance I'd like to thank Tziporet, Jeff, > Nimrod, Aviram, Bob, Hal, Sean, Tom, Or, Betsy, Roland, > (please forgive me if I left out your name)and everyone who > has been working in the EWG for their tremendous individual > contributions to the Alliance and kernel software.> > Bill Boas> VP, Business Development | System Fabric Works > bboas at systemfabricworks.com | 510-375-8840> > > -----Original Message-----> From: openfabrics-ewg-bounces at openib.org> [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of > Tziporet Koren> Sent: Thursday, November 30, 2006 6:06 AM> To: EWG> Cc: OPENIB> Subject: [openfabrics-ewg] reminder: OFED 1.2 meeting next Monday> > Hi All,> I wish to remind all that we have the EWG meeting on Monday > 4-Dec at 9am-10am.> Jeff already sent all details.> > Agenda: close OFED 1.2 features after each owner approve that > the schedule can be met (meaning code complete on end of January)> > See also> https://openib.org/tiki/tiki-index.php?page=OFED+release+procedure for details on the features.> > Tziporet> > _______________________________________________> openfabrics-ewg mailing list> openfabrics-ewg at openib.org> http://openib.org/mailman/listinfo/openfabrics-ewg> > > > _______________________________________________> openfabrics-iwg mailing list> openfabrics-iwg at openfabrics.org> https://openfabrics.org/mailman/listinfo/openfabrics-iwg> > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general>

I managed to get the test working .. I just restarted the server andit was working..Im actually doing some work with the Xen VMM and Infiniband..I have setup 2 servers (Pentium D - x86_64 arch) with red hatenterprise linux 4 and Xen 3 VMM running .. The IB driver seems to beworking on dom0 in any case and I can do all of the perf tests.I wanted to know if there was any correlation between the QoS setupdone using openSM and the perf tests i.e. if I configure QoS inopensm.opts should I be seeing marked differences in the BW from theperf tests?Is there any kind of documentation that gives an idea how the BW canchange for diff QoS params?

Regards,Adit

On 12/1/06, dotanb at dev.mellanox.co.il wrote:> Hi.>> > Hi,> >> > Im using the openib gen2 trunk and was running the performance tests> > from that tree.> > I get a "Segmentation Fault" on running ib_read_bw and the remaining> > tests.> > The output is as follows:> > ------------------------------------------------------------------> > RDMA_Read BW Test> > Connection type : RC> > Segmentation fault> >> > Any particular reason why this is happening?>> Can you give some more info, such as:>> which driver git/svn version are you using?> which parameters did you use in each side?> which distro are you using?> which computer arch are you using?>> thanks> Dotan>>

-- Adit RanadiveFreshman,Georgia Institute of Technology,Atlanta, GA

On Fri, 2006-12-01 at 10:30, Sasha Khapyorsky wrote:> On 09:27 Fri 01 Dec , Hal Rosenstock wrote:> > On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote:> > > Hi Thomas,> > > > > > On 10:12 Fri 01 Dec , Bub Thomas wrote:> > > > Sasha,> > > > I'm having trouble to get the patch applied.> > > > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE> > > > path but after running the ofed-install script the sources in the> > > > /usr/local/ofed din't contain that patch anymore.> > > > Can you help me out of the dark and tell me how to build the> > > > libvendor.so out of/on the ofed-1.1/SOURCES tree.> > > > > > Never did it personally, but you may want to look at> > > https://openib.org/tiki/tiki-index.php?page=OFED+Support> > > for how ofed_patch.sh does this.> > > > > > And you can use svn or git versions of management/osm as well.> > > > There's currently no git version of OFED 1.1 OpenSM AFAIK.> > What about 1.1 git branch? This is same as SVN's 1.1. :)

I sit corrected...

-- Hal

> Sasha

On Thu, 2006-11-30 at 17:41, Todd Rimmer wrote:> > From: Roland Dreier [mailto:rdreier at cisco.com]> > Sent: Thursday, November 30, 2006 5:32 PM> > To: Todd Rimmer> > Cc: openib-general at openib.org> > Subject: Re: [openib-general] IPv6 and IPoIB scalability issue> > > > > Proposed solution:> > > - add an IPoIB configuration parameter. This parameter could> redirect> > > the Solicited Node Multicast traffic to the IPv6 All Nodes> multicast> > > address (IB GID 0xff01601B.....0000001)> > > > This is silly however. For one thing you are now not following the> > RFC, and compliant IPv6 over IPoIB stacks will send neighbour> > discovery messages to the solicited node address, so they won't be> > received since the node didn't join.> > > > There's no requirement that a SM assign a unique MLID to each> > multicast group. The obvious solution to the problem is simply that> > the SM reuse MLIDs for solicited node multicast groups, perhaps even> > collapsing all of them down to 1 MLID.> > > > I think its worth discussing a number of alternatives. I'm not sure> there is an ideal solution here.> > Doesn't an SM based solution produce other complications?> - Such as the SM/SA must maintain an extremely large list of Multicast> Member records (potentially N^2).

Certainly O(N) groups where N is the number of IPv6 hosts (and eachgroup is 1 or more MCMs).

> - Host nodes will be joining N multicast groups and maintaining> membership in them (potentially further stressing the SA, etc)

Do all IPv6 nodes join all the solicited node groups ? I don't see thisoccuring (so far) on the subnets I have seen.

> Not to mention that the SM would then need to know about IPoIB GID> addressing conventions (which seems like a violation of network layers,> etc).

There's already the IPv6 signature as part of the MGID to help with thislayering violation. Some SMs already do things with this already.

-- Hal

> Todd Rimmer> > _______________________________________________> openib-general mailing list> openib-general at openib.org> http://openib.org/mailman/listinfo/openib-general> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general>

On Thu, 2006-11-30 at 18:01, Jason Gunthorpe wrote:> On Thu, Nov 30, 2006 at 05:29:16PM -0500, Hal Rosenstock wrote:> > > > IPV6 defines that each node will have a Solicited Node Multicast> > > address. This address is unique per node and is constructed from the> > > IPV6 unicast address of the node. (see RFC 2373 for more details).> > > > > > IP over IB defines that IPV6 multicast addresses map to IB multicast> > > GIDs in a one to one manner.> > > > > > IB defines a multicast address space limit of 4095 LIDs.> > > > actually it is 16K-1> > For IPv6 only the lower 24 bits of each assigned IPv6 address are> used to construct a solicited node multicast in the range > FF02::1:FF00:0/104. The Solicited Node Multicast address it not> expected to be uniquely subscribed.

Any idea on how many would subscribe ? What does this depend on ?

> > MGIDs are different from MLIDs. Multiple MGIDs can be mapped onto a> > single MLID if the characteristics are the same. Is that the case for> > the IPv6 groups ?> > The solicited node multicast feature is intended for scalability by> having the switching core prune ND queries. It is OK if the multicast> goes to more nodes than subscribe to it (this happens on cheap> ethernet switch gear without multicast support anyhow).

And a similar thing is accomodated within IB. With limited MFT space,the collapse of multiple (similar) MGRPs (MGIDs) on a single MLID isseems important (and reduces some of the scalability issues Toddmentioned in terms of IPv6).

> I think the thing to do here is for the SM to have an option to> compress a particular MGID range (using a hash of some kind). Ie> configure so that all of IPv6 FF02::1:FF00:0/104 will use at most 16> MLIDs.

Yes, that is one strategy which seems reasonable to me.

> That way the site can select that some MGID's get mapped directly to> MLIDs and others get shared to save LID space.> > Then if you still run out it can randomly combine MGIDs into MLIDs.

Yes, that's another wrinkle.

-- Hal

> Jason

Matt wrote,> OMPI also has a uDAPL network device (along with a device that uses>verbs directly). So if we just use OMPI uDAPL it should work over>iWarp?

> - Matt

This should just work. (famous last words). For OFED 1.2, since the iWarp support will be in the basekernel (2.6.19), it should be easier to test to make sure that uDAPLworks both over IB and iWarp as expected. Once this is tested andany issues fixed, Intel MPI, HPMPI, and OMPI (if it has a uDAPL driver)should all work over iWarp in addition to IB.

woody

Andrew Morton wrote:> The name memcpy_cachebypass() doesn't tell us whether it bypasses caching> on the source, the dest or both. It'd be nice if it did.> Yep, I'll fix that and resubmit.

Steve,>> As you know, I have my rnfs kernel running the stable iwarp-stack on> my cluster now. But how do I compile the userspace packages from that> stack?>You build and install the userspace libraries from the iwarp stablebranch. This will install all the needed header files to build otherpackages that depend on them. Like mvapich2-0.9.8, for instance.If rping is working for you, then you've already done this. The userlibs and header files are all installed in /usr/local by default. Ifyou have /usr/local/include/rdma/rdma_cma.h, for instance, you'veprobably already installed the userspace stuff from the iwarp stablebranch.To build and install the user libs from the iwarp branch, please see thewiki howto. There is a section describing installing the userspacelibraries.https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3Hope this helps...Steve.-------------- next part --------------An HTML attachment was scrubbed...URL:

On Fri, Dec 01, 2006 at 11:20:15AM -0500, Hal Rosenstock wrote:> > For IPv6 only the lower 24 bits of each assigned IPv6 address are> > used to construct a solicited node multicast in the range > > FF02::1:FF00:0/104. The Solicited Node Multicast address it not> > expected to be uniquely subscribed.> > Any idea on how many would subscribe ? What does this depend on ?

Each node subscribes to a SNM on an interface for each IPv6 address onthat interface. In most cases that should mean 1 subscription perinterface, but more is possible..

Generally IPv6 addresses should be constructed based on the EUI64 ofthe IB interface. In this case the lower 24 bits of the SNM will bethe lower 24 bits of the EUI64. Thus in many cases the SNMs will becluster-unique..

Here is another thought.. Is there anything in the spec that says aMGID must map to a MLID? If there is a single subscription why notjust do away with the MLID and return a unicast LID of the onlysubscriber? That would probably solve 90% of the IPv6 issue Toddpointed out. MGID compression would take care of the rest..

Jason

On Fri, 2006-12-01 at 13:37, Jason Gunthorpe wrote:> On Fri, Dec 01, 2006 at 11:20:15AM -0500, Hal Rosenstock wrote:> > > For IPv6 only the lower 24 bits of each assigned IPv6 address are> > > used to construct a solicited node multicast in the range > > > FF02::1:FF00:0/104. The Solicited Node Multicast address it not> > > expected to be uniquely subscribed.> > > > Any idea on how many would subscribe ? What does this depend on ?> > Each node subscribes to a SNM on an interface for each IPv6 address on> that interface. In most cases that should mean 1 subscription per> interface, but more is possible..

> Generally IPv6 addresses should be constructed based on the EUI64 of> the IB interface. In this case the lower 24 bits of the SNM will be> the lower 24 bits of the EUI64. Thus in many cases the SNMs will be> cluster-unique..

It seems to depend on the low 24 bits of the IPv6 addresses in thesubnet being the same (as to whether there is more than 1 member ofthese groups).

> Here is another thought.. Is there anything in the spec that says a> MGID must map to a MLID?

Yes. Here's the first one:p.149 line 3-8The multicast LID range is a flat identifier space defined as 0xC000 to0xFFFE.The DLID for any packet which contains a multicast GID shall be withinthe above specified multicast LID range.

I'm sure there are others in the spec if I looked further...

> If there is a single subscription why not> just do away with the MLID and return a unicast LID of the only> subscriber?

The current spec requirements :-( But this is an interesting idea andmay warrant further consideration.

-- Hal

> That would probably solve 90% of the IPv6 issue Todd> pointed out. MGID compression would take care of the rest..> > Jason>

On Fri, Dec 01, 2006 at 01:53:45PM -0500, Hal Rosenstock wrote:> > Generally IPv6 addresses should be constructed based on the EUI64 of> > the IB interface. In this case the lower 24 bits of the SNM will be> > the lower 24 bits of the EUI64. Thus in many cases the SNMs will be> > cluster-unique..> > It seems to depend on the low 24 bits of the IPv6 addresses in the> subnet being the same (as to whether there is more than 1 member of> these groups).

Correct. It is common practice for all IPv6 addresses to have thelower 64 bits be the EUI64 of the interface. The administrator canassign a different address, but that could be discouraged forscalability reasoons.

> > Here is another thought.. Is there anything in the spec that says a> > MGID must map to a MLID?> > Yes. Here's the first one:> p.149 line 3-8

Hmm. Thats a shame. It is a conformance statment too :< At least theaccepetance statements in C9 page 279+ don't specify to check that aMGID is matched with a MLID so at least it should work with currenthardware.

Jason

On Fri, 2006-12-01 at 14:24, Jason Gunthorpe wrote:> On Fri, Dec 01, 2006 at 01:53:45PM -0500, Hal Rosenstock wrote:> > > Generally IPv6 addresses should be constructed based on the EUI64 of> > > the IB interface. In this case the lower 24 bits of the SNM will be> > > the lower 24 bits of the EUI64. Thus in many cases the SNMs will be> > > cluster-unique..> > > > It seems to depend on the low 24 bits of the IPv6 addresses in the> > subnet being the same (as to whether there is more than 1 member of> > these groups).> > Correct. It is common practice for all IPv6 addresses to have the> lower 64 bits be the EUI64 of the interface. The administrator can> assign a different address, but that could be discouraged for> scalability reasoons.> > > > Here is another thought.. Is there anything in the spec that says a> > > MGID must map to a MLID?> > > > Yes. Here's the first one:> > p.149 line 3-8> > Hmm. Thats a shame.

I think there are other issues with this and haven't thought about itenough. What happens if a second node joins that group (as the low 24bits match) ? How would the LID be revoked and changed to an MLID ?There's more spec checking to do here...

> It is a conformance statment too :< At least the> accepetance statements in C9 page 279+ don't specify to check that a> MGID is matched with a MLID

I would say that's a hole in the spec right now...

> so at least it should work with current> hardware.

I would use the word might rather than should in that last sentence.

-- Hal

> Jason

> From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com]> Sent: Friday, December 01, 2006 1:37 PM> To: Hal Rosenstock> Cc: Todd Rimmer; openib-general at openib.org> Subject: Re: [openib-general] IPv6 and IPoIB scalability issue> > > Here is another thought.. Is there anything in the spec that says a> MGID must map to a MLID? If there is a single subscription why not> just do away with the MLID and return a unicast LID of the only> subscriber? That would probably solve 90% of the IPv6 issue Todd> pointed out. MGID compression would take care of the rest..>

Summary of alternatives and trade-offs. Lets assume a 2000 node clusterfor analysis.

Option 1 use ALL Nodes MulticastNon standard for IPoIBsmall change to IPoIB code onlyWorks with all existing SMstotal of 5 MGIDs in cluster5 Multicast subscriptions per nodetotal of 10,000 multicast member records in SA for fabric

Option 2 compress MGID to MLID mappingStandard for IPoIBmodification of SMs required, significant changeconfiguration of MGID space in SM to consider for compression may berequiredtotal of 2005 MGIDs in clusterup to 2005 multicast subscriptions per node (sender only for SolicitedNode initiators)total of 2000*2005 (4,010,000) multicast member records in SA for fabric

Option 3 compress MGID to MLID mapping, use Unicast for Solicited NodeMGIDsStandard for IPoIBnot clear if standard for IBmodification of SMs required, significant changeconfiguration of MGID space in SM to consider for compression may berequiredconfiguration of MGID space in SM to use for unicast may be requiredtotal of 2005 MGIDs in clusterup to 2005 multicast subscriptions per node (sender only for SolicitedNode initiators)total of 2000*2005 (4,010,000) multicast member records in SA for fabric

Hence thus far, option 2 is most standard, option 3 may be standard,option 1 has best scalability for SM.

It seems worth while to implement option 1 (which should be approx 10-20lines of code in IPoIB) and continue to pursue option 2 and 3 as SMfeatures. Then customers can choose which option works best for them.

Todd Rimmer

On Fri, Dec 01, 2006 at 02:28:55PM -0500, Hal Rosenstock wrote:

> I think there are other issues with this and haven't thought about it> enough. What happens if a second node joins that group (as the low 24> bits match) ? How would the LID be revoked and changed to an MLID ?> There's more spec checking to do here...

Oh, right, yeah revoking is pretty serious! Oh well.

Jason

On Fri, 2006-12-01 at 14:42, Todd Rimmer wrote:> > From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com]> > Sent: Friday, December 01, 2006 1:37 PM> > To: Hal Rosenstock> > Cc: Todd Rimmer; openib-general at openib.org> > Subject: Re: [openib-general] IPv6 and IPoIB scalability issue> > > > > > Here is another thought.. Is there anything in the spec that says a> > MGID must map to a MLID? If there is a single subscription why not> > just do away with the MLID and return a unicast LID of the only> > subscriber? That would probably solve 90% of the IPv6 issue Todd> > pointed out. MGID compression would take care of the rest..> > > > Summary of alternatives and trade-offs. Lets assume a 2000 node cluster> for analysis.> > Option 1 use ALL Nodes Multicast> Non standard for IPoIB> small change to IPoIB code only> Works with all existing SMs> total of 5 MGIDs in cluster> 5 Multicast subscriptions per node> total of 10,000 multicast member records in SA for fabric

IMO if you want to go down this direction, the place to discuss it is onthe ipoib IETF mailing list. It is still active although dormant or verysleepy.

> Option 2 compress MGID to MLID mapping> Standard for IPoIB> modification of SMs required, significant change

Significant in what respect ? The code changes are reasonably simple Ithink. Is it from the perspective of upgrading SMs in the field for this? I think it is a feature for better IPv6 support.

> configuration of MGID space in SM to consider for compression may be> required> total of 2005 MGIDs in cluster> up to 2005 multicast subscriptions per node (sender only for Solicited> Node initiators)

Does the node subscribe to every IPv6 SN group ?

> total of 2000*2005 (4,010,000) multicast member records in SA for fabric

This is based on the above (which I'm not sure about) and is the worsttheoretical case, not the practical case.

> Option 3 compress MGID to MLID mapping, use Unicast for Solicited Node> MGIDs> Standard for IPoIB> not clear if standard for IB

More issues than this

> modification of SMs required, significant change

At first glance, there are more issues here than option 2 in terms of SM(and client operation).

> configuration of MGID space in SM to consider for compression may be> required> configuration of MGID space in SM to use for unicast may be required> total of 2005 MGIDs in cluster> up to 2005 multicast subscriptions per node (sender only for Solicited> Node initiators)> total of 2000*2005 (4,010,000) multicast member records in SA for fabric> > Hence thus far, option 2 is most standard, option 3 may be standard,> option 1 has best scalability for SM.> > It seems worth while to implement option 1 (which should be approx 10-20> lines of code in IPoIB) and continue to pursue option 2 and 3 as SM> features. Then customers can choose which option works best for them.

I think before pursuing option 1 there needs to be a discussion with theIETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu).

-- Hal

> Todd Rimmer

OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminateunneeded lock acquisition

Signed-off-by: Sasha Khapyorsky Signed-off-by: Hal Rosenstock

diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.cindex f7f879b..d6c6968 100644--- a/osm/opensm/osm_sa_mcmember_record.c+++ b/osm/opensm/osm_sa_mcmember_record.c@@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp( new_join_state | (p_mcm_port->scope_state & 0xf0); mcmember_rec.scope_state = p_mcm_port->scope_state;++ CL_PLOCK_RELEASE( p_rcv->p_lock ); } else {@@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp( "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: " "osm_sm_mcgrp_leave failed\n" ); }-- CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock);- /* Note: The deletion of the mgrp itself will be done in the callback- for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */ } } else@@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp( goto Exit; } - CL_PLOCK_RELEASE( p_rcv->p_lock );- /* Send an SA response */ __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec );

> On 11/30/06, Ralph Campbell wrote:>> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote:>> > So what did you change since v1? How do you deal with fitting 64-bit>> > addresses into an sg list entry that has a 32-bit dma_addr_t?>>> The ipath_map_sg() handler for ib_dma_map_sg() doesn't store>> anything in the struct scatterlist. The translation is>> done when ipath_sg_dma_address() is called which now>> returns u64 instead of dma_addr_t thus avoiding the truncation>> problem.>> And there is this open/TODO of calling kmap(page) on dma mapping time> (or when ipath_sg_dma_address is called) and kunmap(page) on dma> unmapping time, where you must store the kvaddr between the two calls> and the sg does not have a room for it where dma_addr_t is u32 and> kvaddr is u64 ....

Although the driver compiles on 32-bit kernels, it is unsupportedand never been tested. All known 64-bit systems don't defineCONFIG_HIGHMEM. In spite of previous emails suggesting thatpage_address() can return NULL without CONFIG_HIGHMEM defined,the code in include/linux/mm.h doesn't allow it (assuming thepage pointer is valid and not some random address).I verified this with Andrew Morton.

I don't see value in adding code which will be unsupportedand untested.

>> All of the callers to ib_dma_map_single(), ib_dma_map_page(),>> and ib_sg_dma_address() have been modifed to save the address>> in a u64 instead of a dma_addr_t. This actually wasn't much>> of a change since the address was being cast to u64 anway>> when assigned to struct sge.addr.>> Its fixes a bug, so it actually somehow much of a change. Without it> on arch as mentioned above, ipath_dma_map_single would return only a> u32 portion of the kvaddr and later the ulp code would place this> chopped address in sge.addr and the ipath driver would use the wrong> address.>> Or.

I only meant that the change was minor compared to the previouspatches sent. Of course, fixing a bug is important and not minor.

Steve, Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP stable branch? I do not get some of library (librdmacm) gets created to be used by mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.

David "Chen, Helen Y" wrote: Thanks, Helen

--------------------------------- From: Steve Wise [mailto:swise at opengridcomputing.com]Sent: Fri 12/1/2006 7:35 AMTo: Chen, Helen YCc: Jeff Squyres; openib-general at openib.org; Leininger, Matthew LSubject: RE: [openib-general] openMPI for 2.6.17.10 kernel

On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote:> Steve,>> As you know, I have my rnfs kernel running the stable iwarp-stack on> my cluster now. But how do I compile the userspace packages from that> stack?>You build and install the userspace libraries from the iwarp stablebranch. This will install all the needed header files to build otherpackages that depend on them. Like mvapich2-0.9.8, for instance.

If rping is working for you, then you've already done this. The userlibs and header files are all installed in /usr/local by default. Ifyou have /usr/local/include/rdma/rdma_cma.h, for instance, you'veprobably already installed the userspace stuff from the iwarp stablebranch.

To build and install the user libs from the iwarp branch, please see thewiki howto. There is a section describing installing the userspacelibraries.


Hope this helps...

Steve.

_______________________________________________openib-general mailing listopenib-general at openib.orghttp://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

---------------------------------Everyone is raving about the all-new Yahoo! Mail beta.-------------- next part --------------An HTML attachment was scrubbed...URL:

On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote:> Steve,> > Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP> stable branch? > > I do not get some of library (librdmacm) gets created to be used by> mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.> > David>

The stable release of the iWARP branch is here:

https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable

Instructions on setting this up with Chelsio's T3 device are here:


Steve.

thanks

Steve Wise wrote:






Steve.



> > configuration of MGID space in SM to consider for compression may> > be required total of 2005 MGIDs in cluster up to 2005 multicast> > subscriptions per node (sender only for Solicited Node initiators)> > Does the node subscribe to every IPv6 SN group ?

A node will only use another nodes SN group in a send-only fashion andonly when it is doing neighbour discovery for that node.

So at the worst case you potentially have N^2 send-only subscriptions,N normal subscriptions and N groups.

If IPv6 SN multicast MLIDs are always routed in the fabric so that allIPv6 nodes can be send-only then the send-only subscriptions don'tneed to be considered. Presumably because of this send-only join andunjoin can result in no data structure in the SM..

> I think before pursuing option 1 there needs to be a discussion with the> IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu).

Option 1 sounds difficult to me. It would be hard to have interopbetween nodes using this optimization and nodes that don't..

Another approach would be to manipulate the IPv6 address of the nodeso that the lower 24 bits are the same. That gets the same effect, butI'm not sure how you'd go about doing it :>

Jason

Hello all,

I am running the HPCC benchmark on a Sun Blade 8000 blade server. I have two blades running RHEL4U3 and SLESSP3 respectively with 32 GBytes of memory each. The HPCC benchmark is running on a sun developed IB module that uses the Mellanox 25204 chips. When it gets to the MPIRandomAccess test, it immediately fails and I see the following messages listed below.

Does anyone know what the messages mean, and a possible underlying cause? Please reply to me directly as I am not subscribed to this list.

Thank you,

Dave Costadavid.costa at sun.com

[root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile /usr/local/bin/hpcc24 - MPI_CANCEL : Internal MPI error![24] [] Aborting Program!mpirun_rsh: Abort signaled from [24]26 - MPI_CANCEL : Internal MPI error![26] [] Aborting Program!15 - MPI_CANCEL : Internal MPI error![15] [] Aborting Program!18 - MPI_CANCEL : Internal MPI error![18] [] Aborting Program!22 - MPI_CANCEL : Internal MPI error![22] [] Aborting Program!4 - MPI_CANCEL : Internal MPI error![4] [] Aborting Program!13 - MPI_CANCEL : Internal MPI error![13] [] Aborting Program!11 - MPI_CANCEL : Internal MPI error!16 - MPI_CANCEL : Internal MPI error![16] [] Aborting Program![11] [] Aborting Program!28 - MPI_CANCEL : Internal MPI error![28] [] Aborting Program![19] Abort: [an1-bl1:19] Got completion with error, code=12 at line 2365 in file viacheck.c[23] Abort: [an1-bl1:23] Got completion with error, code=12 at line 2365 in file viacheck.c[17] Abort: [an1-bl1:17] Got completion with error, code=12 at line 2365 in file viacheck.cdone.-------------- next part --------------An HTML attachment was scrubbed...URL:

> Option 1 sounds difficult to me. It would be hard to have interop > between nodes using this optimization and nodes that don't..

Yes, that is a major problem.

One intermediate thing we could do is to have nodes join their ownsolicited-node group as a full member, but have other nodes send NDmessages to the all-nodes group. Then the SM would only have O(N)MCG memberships to maintain. But it still requires the SM to be smartabout mapping multiple MCGs to a single MLID.

And even if that works, I'm not sure it's compliant with all therelevant RFCs, and it might break in some strange situations...

(To be honest though, I think that the SM for a subnet with N nodesshould really be beefy enough to handle N^2 multicast memberships.Even 10K nodes leads to only 100M group memberships, which shouldn'tbe _that_ expensive with the right data structures)

- R.

Hi David, If you are using OFED-1.1 stack and OSU MVAPICH provided with theOFED-1.1 package as your MPI layer,the attached patch should solve your problem. Please, let me know if that helped. Regards, Boris ShpolyanskyApplication EngineerMellanox Technologies Inc.2900 Stender WaySanta Clara, CA 95054Tel.: (408) 916 0014Fax: (408) 970 3403Cell: (408) 834 9365www.mellanox.com

________________________________

From: openib-general-bounces at openib.org[mailto:openib-general-bounces at openib.org] On Behalf Of David CostaSent: Friday, December 01, 2006 2:21 PMTo: openib-general at openib.org; David.Costa at Sun.COM; Robert Houk; AnthonyVinciguerra; Thomas BabbitSubject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test

Hello all,

I am running the HPCC benchmark on a Sun Blade 8000 blade server. I havetwo blades running RHEL4U3 and SLESSP3 respectively with 32 GBytes ofmemory each. The HPCC benchmark is running on a sun developed IB modulethat uses the Mellanox 25204 chips. When it gets to the MPIRandomAccesstest, it immediately fails and I see the following messages listedbelow.

Does anyone know what the messages mean, and a possible underlyingcause? Please reply to me directly as I am not subscribed to this list.

Thank you,

Dave Costadavid.costa at sun.com

[root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile/usr/local/bin/hpcc24 - MPI_CANCEL : Internal MPI error![24] [] Aborting Program!mpirun_rsh: Abort signaled from [24]26 - MPI_CANCEL : Internal MPI error![26] [] Aborting Program!15 - MPI_CANCEL : Internal MPI error![15] [] Aborting Program!18 - MPI_CANCEL : Internal MPI error![18] [] Aborting Program!22 - MPI_CANCEL : Internal MPI error![22] [] Aborting Program!4 - MPI_CANCEL : Internal MPI error![4] [] Aborting Program!13 - MPI_CANCEL : Internal MPI error![13] [] Aborting Program!11 - MPI_CANCEL : Internal MPI error!16 - MPI_CANCEL : Internal MPI error![16] [] Aborting Program![11] [] Aborting Program!28 - MPI_CANCEL : Internal MPI error![28] [] Aborting Program![19] Abort: [an1-bl1:19] Got completion with error, code=12 at line 2365 in file viacheck.c[23] Abort: [an1-bl1:23] Got completion with error, code=12 at line 2365 in file viacheck.c[17] Abort: [an1-bl1:17] Got completion with error, code=12 at line 2365 in file viacheck.cdone. -------------- next part --------------An HTML attachment was scrubbed...URL: -------------- next part --------------A non-text attachment was scrubbed...Name: smpi_cancel.patchType: application/octet-streamSize: 1116 bytesDesc: smpi_cancel.patchURL:

> 24 - MPI_CANCEL : Internal MPI error!

It might be useful to know what MPI implementation you're using...(Also, knowing where you got your IB drivers and what version they arewouldn't hurt either)

- R.

Hi Steve,I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stablefor the Ammasso card.

While compiling the libamso library, I got the following error:make all-ammake[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso'if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \ then mv -f ".deps/src_amso_la-cq.Tpo" ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fimkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c -fPIC -DPIC -o .libs/src_amso_la-cq.oIn file included from src/cq.c:42:src/amso.h: In function 'to_amso_dev':src/amso.h:83: warning: implicit declaration of function 'offsetof'src/amso.h:83: error: expected expression before 'struct'src/amso.h: In function 'to_amso_ctx':src/amso.h:88: error: expected expression before 'struct'src/amso.h: In function 'to_amso_pd':src/amso.h:93: error: expected expression before 'struct'src/amso.h: In function 'to_amso_cq':src/amso.h:98: error: expected expression before 'struct'src/amso.h: In function 'to_amso_qp':src/amso.h:103: error: expected expression before 'struct'make[1]: *** [src_amso_la-cq.lo] Error 1make[1]: Leaving directory `/usr/src/gen2/branches/iwarp/userspace/libamso'make: *** [all] Error 2

which seems to be complaining something in amso.h file in the following lins:

#define to_amso_xxx(xxx, type) \ ((struct amso_##type *) \ ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx)))

Can you let me know if I am missing something?Thanks,David

Steve Wise wrote:






Steve.


Steve,

I added

#include

in amso.h file, then I can compile it.

David

david elsen wrote: Hi Steve,I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stablefor the Ammasso card.





Steve Wise wrote:






Steve.

---------------------------------Everyone is raving about the all-new Yahoo! Mail beta._______________________________________________openib-general mailing listopenib-general at openib.orghttp://openib.org/mailman/listinfo/openib-general


---------------------------------Access over 1 million songs - Yahoo! Music Unlimited.-------------- next part --------------An HTML attachment was scrubbed...URL:

???? ?????? ?? ?????? !

? "??? ?????" ?????? ?????/?? ????? ???????? ?????? ?????, ????? ???????. *??"? ?????: ??? ?-?, ???? :18:00-9:00 *????? ????? ????? ????????, ??????? ??????.*??? - ????+????? - ??? 11,000 ?"? ????????. ??????:*???? ?? ?????? ????? ????? ????? ????.* ????? ????? ????????? *?????? ?????? ???? ?????? ???????.* ???? ????? ????*?????? ????? ??????.*???? ?????? ?????.

????? ?????: ?"? ?? ??/ ???? ?? ???????- ???/? ??"? ????? (??? ????) ????? ????.

?????, ????. ?.????? ?"?,"??? ?????"shir4u.co.il

Steve,

I can run rping, rdma_lat etc on the Ammasso card but when I try to run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error.

./mpdboot -n 1debug: starting/root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directoryrunning mpdallexit on ammasso1LAUNCHED mpd on ammasso1 via debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -ddebug: mpd on ammasso1 on port 35352RUNNING: mpd on ammasso1debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Thanks,David

david elsen wrote: Steve,

I added

#include


David






Steve Wise wrote:






Steve.



---------------------------------Access over 1 million songs - Yahoo! Music Unlimited.

---------------------------------Access over 1 million songs - Yahoo! Music Unlimited.-------------- next part --------------An HTML attachment was scrubbed...URL:

Steve,

I can run rping, rdma_lat etc on the Ammasso card but when I try to run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error.

./mpdboot -n 1debug: starting/root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directoryrunning mpdallexit on ammasso1LAUNCHED mpd on ammasso1 via debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -ddebug: mpd on ammasso1 on port 35352RUNNING: mpd on ammasso1debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Thanks,David

david elsen wrote: Steve,

I added

#include


David






Steve Wise wrote:






Steve.



---------------------------------Access over 1 million songs - Yahoo! Music Unlimited.

---------------------------------Cheap Talk? Check out Yahoo! Messenger's low PC-to-Phone call rates.-------------- next part --------------An HTML attachment was scrubbed...URL:

Hi Boris,

Thanks for forwarding the patch to the list. This patch was also addedto the MVAPICH svn repository (both trunk and 0.9.8 bugfix branches)a few days back.

David: If you are using MVAPICH, you can check out from the SVN 0.9.8bugfix branch too.

Thanks,Sayantan.

* On Dec,3 Boris Shpolyansky wrote :> Hi David,> > If you are using OFED-1.1 stack and OSU MVAPICH provided with the OFED-1.1> package as your MPI layer,> the attached patch should solve your problem.> > Please, let me know if that helped.> > Regards,> > Boris Shpolyansky> Application Engineer> Mellanox Technologies Inc.> 2900 Stender Way> Santa Clara, CA 95054> Tel.: (408) 916 0014> Fax: (408) 970 3403> Cell: (408) 834 9365> www.mellanox.com-- http://www.cse.ohio-state.edu/~surs

Network Appliance is pleased to announce release 7 of the NFS/RDMAclient and server for Linux 2.6.18. This update to the August releasefixes known issues, improves usability and server stability, and supportsNFSv4. The code supports both Infiniband and iWARP transports overthe standard openfabrics Linux facility.

This code is functionally similar to the previous RC6 release, with manybugfixes and performance improvements applied. The client and servernow use port 2050 (instead of overloading the standard NFS/TCP 2049),pending further discussion and official assignment as proposed in the mostrecent IETF working group meeting. An alignment issue leading to performanceimpact on IA64 architectures has been corrected in the server. Extensivefurther testing on Infiniband and iWARP was performed and (for example)this NFS/RDMA code was demonstrated running Oracle 10g Reliable ApplicationClusters at SuperComputing 2006 last month.

A full list of bugs resolved is available at the project's tracking page:

We welcome protocol comments, implementation comments and userexperience, directly or on any of the above mailing lists.

Tom Talpey, for the NFS/RDMA project.

What is the status of moving this code towards merging to the upstream kernel?

Thanks, Roland

I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing theirdev team so maybe they can help.

Steve.

On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote:> Steve,> > I can run rping, rdma_lat etc on the Ammasso card but when I try to> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. > > ./mpdboot -n 1> debug: starting> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries:> librdmacm.so: cannot open shared object file: No such file or> directory> running mpdallexit on ammasso1> LAUNCHED mpd on ammasso1 via > debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d> debug: mpd on ammasso1 on port 35352> RUNNING: mpd on ammasso1> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352,> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}> > Thanks,> David> > david elsen wrote:> Steve,> > I added > > #include > > in amso.h file, then I can compile it.> > David> > > david elsen wrote:> Hi Steve,> I am trying to use the> https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable> for the Ammasso card.> > While compiling the libamso library, I got the> following error:> make all-am> make[1]: Entering directory> `/usr/src/gen2/branches/iwarp/userspace/libamso'> if /bin/sh ./libtool --tag=CC --mode=compile gcc> -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE> -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF> ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo> `test -f 'src/cq.c' || echo './'`src/cq.c; \> then mv -f ".deps/src_amso_la-cq.Tpo"> ".deps/src_amso_la-cq.Plo"; else rm -f> ".deps/src_amso_la-cq.Tpo"; exit 1; fi> mkdir .libs> gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall> -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP> -MF .deps/src_amso_la-cq.Tpo -c src/cq.c -fPIC -DPIC> -o .libs/src_amso_la-cq.o> In file included from src/cq.c:42:> src/amso.h: In function 'to_amso_dev':> src/amso.h:83: warning: implicit declaration of> function 'offsetof'> src/amso.h:83: error: expected expression before> 'struct'> src/amso.h: In function 'to_amso_ctx':> src/amso.h:88: error: expected expression before> 'struct'> src/amso.h: In function 'to_amso_pd':> src/amso.h:93: error: expected expression before> 'struct'> src/amso.h: In function 'to_amso_cq':> src/amso.h:98: error: expected expression before> 'struct'> src/amso.h: In function 'to_amso_qp':> src/amso.h:103: error: expected expression before> 'struct'> make[1]: *** [src_amso_la-cq.lo] Error 1> make[1]: Leaving directory> `/usr/src/gen2/branches/iwarp/userspace/libamso'> make: *** [all] Error 2> > which seems to be complaining something in amso.h file> in the following lins:> > #define to_amso_xxx(xxx, type)> \> ((struct amso_##type *)> \> ((void *) ib##xxx - offsetof(struct> amso_##type, ibv_##xxx)))> > Can you let me know if I am missing something?> Thanks,> David> > Steve Wise wrote:> > > On Fri, 2006-12-01 at 12:50 -0800, david elsen> wrote:> > Steve,> > > > Is this> https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP> > stable branch? > > > > I do not get some of library (librdmacm)> gets created to be used by> > mvapich2-0.9.8 on the Fedora 6 distribution> with 2.6.17.13 kernel.> > > > David> > > > The stable release of the iWARP branch is> here:> > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable> > > Instructions on setting this up with Chelsio's> T3 device are here:> > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3> > > Steve.> > > > > > ______________________________________________________> Everyone is raving about the all-new Yahoo! Mail> beta._______________________________________________> openib-general mailing list> openib-general at openib.org> http://openib.org/mailman/listinfo/openib-general> > To unsubscribe, please visit> http://openib.org/mailman/listinfo/openib-general> > > > ______________________________________________________________> Access over 1 million songs - Yahoo! Music Unlimited.> > > > > ______________________________________________________________________> Access over 1 million songs - Yahoo! Music Unlimited.

> Although the driver compiles on 32-bit kernels, it is unsupported > and never been tested. All known 64-bit systems don't define > CONFIG_HIGHMEM. In spite of previous emails suggesting that > page_address() can return NULL without CONFIG_HIGHMEM defined, > the code in include/linux/mm.h doesn't allow it (assuming the > page pointer is valid and not some random address). > I verified this with Andrew Morton.

Hmm, is there no way to make this work on 32-bit kernels? I don'twant to do something that we'll have to change again if we want tomake things work on 32-bits.

(And I know that qlogic has no intention of supporting the driver on32-bit kernels, but we shouldn't make it impossible for someone elseto fix it)

Oh yeah, one other thing...

could you respin this so that all the new dma_xxx wrappers go into anew file like (and include that from)? ib_verbs.h is already too big I think.


> > total of 2000*2005 (4,010,000) multicast member records in SA for fabric> > This is based on the above (which I'm not sure about) and is the worst> theoretical case, not the practical case.

It isn't in the IB spec, but what would really help here is to be ableto join a multicast prefix (more than 1 group with a single entry).

Todd's option 1 optimization is then easially realized by having allIPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries) asfull members. This provides interoperability between with stacks withthis feature and without.

Option 2 works better as well because all the nodes joinFF02::1:FF00:0/104 as a send-only member on startup and then you onlyget N*2 multicast records to maintain.

This also would improve the performance of IPv6 ND by not having tojoin/leave the SN groups for each ND query.

IBA would have to be changed to support a prefix bits field in theMCMemberRecord structure though..

Jason

My apologies to everyone who replied, I am indeed using OFED 1.1 and the included OSU MVAPICH. I will try your patch on Monday Boris and reply to the list about how I made out.

Best Regards,

Dave Costa

Boris Shpolyansky wrote:> Hi David,> > If you are using OFED-1.1 stack and OSU MVAPICH provided with the > OFED-1.1 package as your MPI layer,> the attached patch should solve your problem.> > Please, let me know if that helped.> > Regards,> > Boris Shpolyansky> Application Engineer> Mellanox Technologies Inc.> 2900 Stender Way> Santa Clara, CA 95054> Tel.: (408) 916 0014> Fax: (408) 970 3403> Cell: (408) 834 9365> www.mellanox.com>> ------------------------------------------------------------------------> *From:* openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] *On Behalf Of *David Costa> *Sent:* Friday, December 01, 2006 2:21 PM> *To:* openib-general at openib.org; David.Costa at Sun.COM; Robert Houk; > Anthony Vinciguerra; Thomas Babbit> *Subject:* [openib-general] HPCC benchmark aborts at MPIRandomAccess test>> Hello all,>> I am running the HPCC benchmark on a Sun Blade 8000 blade server. I > have two blades running RHEL4U3 and SLESSP3 respectively with 32 > GBytes of memory each. The HPCC benchmark is running on a sun > developed IB module that uses the Mellanox 25204 chips. When it gets > to the MPIRandomAccess test, it immediately fails and I see the > following messages listed below.>> Does anyone know what the messages mean, and a possible underlying > cause? Please reply to me directly as I am not subscribed to this list.>> Thank you,>> Dave Costa> david.costa at sun.com>>> [root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile > /usr/local/bin/hpcc> 24 - MPI_CANCEL : Internal MPI error!> [24] [] Aborting Program!> mpirun_rsh: Abort signaled from [24]> 26 - MPI_CANCEL : Internal MPI error!> [26] [] Aborting Program!> 15 - MPI_CANCEL : Internal MPI error!> [15] [] Aborting Program!> 18 - MPI_CANCEL : Internal MPI error!> [18] [] Aborting Program!> 22 - MPI_CANCEL : Internal MPI error!> [22] [] Aborting Program!> 4 - MPI_CANCEL : Internal MPI error!> [4] [] Aborting Program!> 13 - MPI_CANCEL : Internal MPI error!> [13] [] Aborting Program!> 11 - MPI_CANCEL : Internal MPI error!> 16 - MPI_CANCEL : Internal MPI error!> [16] [] Aborting Program!> [11] [] Aborting Program!> 28 - MPI_CANCEL : Internal MPI error!> [28] [] Aborting Program!> [19] Abort: [an1-bl1:19] Got completion with error, code=12> at line 2365 in file viacheck.c> [23] Abort: [an1-bl1:23] Got completion with error, code=12> at line 2365 in file viacheck.c> [17] Abort: [an1-bl1:17] Got completion with error, code=12> at line 2365 in file viacheck.c> done.

-------------- next part --------------An HTML attachment was scrubbed...URL:

On Fri, 2006-12-01 at 16:47, Jason Gunthorpe wrote:> On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote:> > > > configuration of MGID space in SM to consider for compression may> > > be required total of 2005 MGIDs in cluster up to 2005 multicast> > > subscriptions per node (sender only for Solicited Node initiators)> > > > Does the node subscribe to every IPv6 SN group ?> > A node will only use another nodes SN group in a send-only fashion and> only when it is doing neighbour discovery for that node.> > So at the worst case you potentially have N^2 send-only subscriptions,> N normal subscriptions and N groups.

Send only subscriptions are largely the same (in terms of SM/SA) as fullsubscriptions except in a couple of details.

> If IPv6 SN multicast MLIDs are always routed in the fabric so that all> IPv6 nodes can be send-only then the send-only subscriptions don't> need to be considered. Presumably because of this send-only join and> unjoin can result in no data structure in the SM..

There is a data structure associated with these memberships.

-- Hal

> > I think before pursuing option 1 there needs to be a discussion with the> > IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu).> > Option 1 sounds difficult to me. It would be hard to have interop> between nodes using this optimization and nodes that don't..> > Another approach would be to manipulate the IPv6 address of the node> so that the lower 24 bits are the same. That gets the same effect, but> I'm not sure how you'd go about doing it :>> > Jason

On Fri, 2006-12-01 at 17:26, Roland Dreier wrote:> > Option 1 sounds difficult to me. It would be hard to have interop> > between nodes using this optimization and nodes that don't..> > Yes, that is a major problem.> > One intermediate thing we could do is to have nodes join their own> solicited-node group as a full member, but have other nodes send ND> messages to the all-nodes group. Then the SM would only have O(N)> MCG memberships to maintain. But it still requires the SM to be smart> about mapping multiple MCGs to a single MLID.> > And even if that works, I'm not sure it's compliant with all the> relevant RFCs, and it might break in some strange situations...> > (To be honest though, I think that the SM for a subnet with N nodes> should really be beefy enough to handle N^2 multicast memberships.> Even 10K nodes leads to only 100M group memberships, which shouldn't> be _that_ expensive with the right data structures)

The data structures are one concern. The others would be routing N large(multicast) trees and also the SA transaction rate this causes (similarto the large path record request case).

-- Hal

> - R.

On Fri, 2006-12-01 at 18:17, Jason Gunthorpe wrote:> On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote:> > > > total of 2000*2005 (4,010,000) multicast member records in SA for fabric> > > > This is based on the above (which I'm not sure about) and is the worst> > theoretical case, not the practical case.> > It isn't in the IB spec, but what would really help here is to be able> to join a multicast prefix (more than 1 group with a single entry).> > Todd's option 1 optimization is then easially realized by having all> IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries)

These are IPmc groups not IB mc groups though. I suppose you are askingfor the equivalent function in IB. When that subscribe is done, would itautomatically collapse to 1 MLID ? If that's what you mean, a specextension for this could be proposed and carried forward at the (IBTA)MgtWG. Is there a special value of those 24 bits which is not used (andcould be used to indicate subscribe all) ? Or do you see another way toindicate this ? There are some reserved bits at the end ofMCMemberRecord which could also be used to indicate this. That'sprobably better.

> as> full members. This provides interoperability between with stacks with> this feature and without.> > Option 2 works better as well because all the nodes join> FF02::1:FF00:0/104 as a send-only member on startup and then you only> get N*2 multicast records to maintain.> > This also would improve the performance of IPv6 ND by not having to> join/leave the SN groups for each ND query.> > IBA would have to be changed to support a prefix bits field in the> MCMemberRecord structure though..

Is a full prefix needed or only 1 bit indicating join all ? If a prefixis needed, it sounds like it is 24 bits in width. (That appears morethan what is available but I'll look more).

-- Hal

> Jason

> > Although the driver compiles on 32-bit kernels, it is unsupported> > and never been tested. All known 64-bit systems don't define> > CONFIG_HIGHMEM. In spite of previous emails suggesting that> > page_address() can return NULL without CONFIG_HIGHMEM defined,> > the code in include/linux/mm.h doesn't allow it (assuming the> > page pointer is valid and not some random address).> > I verified this with Andrew Morton.>> Hmm, is there no way to make this work on 32-bit kernels? I don't> want to do something that we'll have to change again if we want to> make things work on 32-bits.>> (And I know that qlogic has no intention of supporting the driver on> 32-bit kernels, but we shouldn't make it impossible for someone else> to fix it)

I don't think this is impossible to implement. I just wantedto avoid the work unless you and others thought it was reallyworth it given the reality that we already have a largetest matrix of platforms, distros, and kernel versions andit probably won't get much testing. It is possible thatat some point 32-bit kernels will become a prioritybut I don't know when that might happen.

> Oh yeah, one other thing...>> could you respin this so that all the new dma_xxx wrappers go into a> new file like (and include that from> )? ib_verbs.h is already too big I think.

Sure, no problem.

Steve Wise wrote:> I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing their> dev team so maybe they can help.> > Steve.> > > > On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote:>> Steve,>

lists.openfabrics.orglists.openfabrics.org/pipermail/general/2006-December.txtFrom thomas.bub at...

Documents

Transcript of lists.openfabrics.orglists.openfabrics.org/pipermail/general/2006-December.txtFrom thomas.bub at...