[erlang-bugs] Bug in the resolver?

Raimo Niskanen raimo+erlang-bugs@REDACTED
Fri Apr 15 10:01:35 CEST 2011


Hi.

I now have reworked the code. It of course passes our regression tests,
but since I have not yet managed to reproduce your bug it would be
nice if you could test it. I pushed a branch rn/inet_res-crash-rest-time-0
to my github repository git://github.com/RaimoNiskanen/otp.git from
where it can be fetched. The branch has two commits to inet_res.erl.

  http://github.com/RaimoNiskanen/otp/tree/rn/inet_res-crash-rest-time-0
  * Check return values from UDP send functions
    http://github.com/RaimoNiskanen/otp/commit/24f13b771e84b4baad3fb804e14e85478432e289
  * Cleanup timeout handling, fix bug for remaining time =:= 0
    http://github.com/RaimoNiskanen/otp/commit/684c8ea15059093d25d9c97c21bf1db091579f08

If it is awkward for you to build from source, and you run R14B02
I have placed the VM executable at http://erlang.org/~raimo/inet_res.beam
that you can drop into your erlang installation. Be sure to keep
the old around just in case I messed something up...

Please report your test results...
/ Raimo



On Thu, Apr 14, 2011 at 02:34:20PM +0200, Raimo Niskanen wrote:
> On Tue, Apr 12, 2011 at 01:46:05PM +0200, Raimo Niskanen wrote:
> > On Tue, Apr 12, 2011 at 09:13:10PM +1000, Evgeniy Khramtsov wrote:
> > > 12.04.2011 20:30, Raimo Niskanen wrote:
> > > >
> > > >You must have called inet_res:getbyname(Name, Type, infinity),
> > > >and that was apparently not tested. The functions that calculate
> > > >the remaining time for do_udp_recv/5 are not written for a timeout of
> > > >'infinity' and crash for the subtraction of Now - 'undefined'.
> > > >   
> > > 
> > > Strange. There is inet_res:getbyname(String, srv, 10000) actually.
> > 
> > Sorry, I misread the condition in the code. To get to where your stacktrace
> > tells me the value of Timeout to inet_res:do_udp_receive/5 must be 0.
> > 
> > Then it seems the code accidentally loops exactly when 0 milliseconds
> > remain to wait for the whole user interface timeout. If a lower level
> > timeout of 5 seconds (which sounds familiar) is involved, then
> > two such UDP timeouts could make the code loop after exactly
> > 10 seconds and get a rest timeout of 0 ms.
> > 
> > Try a timeout value of 11111 ms instead.
> 
> That was rubbish. A long enough timeout seems to be necessary.
> 
> > 
> > If this guess is correct the bug is more serious than I first assumed.
> 
> New findings
> ============
> 
> There are two timeout values involved, plus a retry limit.
> 
> The UDP query timeout values are 2 s and 3 retries.
> 
> The 3:rd argument to inet_res:getbyname/3 is a timeout limit for that call.
> It sets an upper limit to the UDP query timeout and retry procedure.
> 
> If you do not use that 3:rd argument, or set it to 'infinity', the call will
> timeout anyway after all queries have timed out. They will timeout
> as follows, from the man page for inet_res:
> 
>     For  UDP  queries, the resolver options timeout and retry control
>     retransmission. Each nameserver in the nameservers list is tried with
>     a timeout of timeout / retry. Then all nameservers are tried again
>     doubling the timeout, for a total of retry times.
> 
> So, for default values for UDP query timeouts, it will take
> (666 + 1333 + 2666) = 4665 ms per nameserver for the whole call to timeout.
> If any servers are unreachable (ECONNREFUSED, ENETUNREACH) this will
> decrease the time since such a server is discarded after the first failure.
> If inet_res has to retry with TCP the time might increase since a timeout
> value of 5 * (UDP query timeout) is used for every TCP query.
> 
> Anyway, if you use a 3:rd argument to inet_res that forces it to cut
> the last UDP query to timeout 0, there is a bug that is triggered
> by an incoming UDP reply at that late time.
> 
> Example: For 3 nameservers a 3:rd argument timeout of less than
> 3 * (666 + 1333) + (3 - 1) * 2666 = 11329 ms combined with an UDP reply
> arriving so late it is received by the last gen_udp:recv, with timeout 0,
> will trigger this bug.
> 
> I have not yet managed to reproduce it and am not sure it is possible
> with certanity, so this conclusion still might be wrong.
> 
> Since it seems to be possible to avoid the bug with a long enough
> timeout value it is not very serious. I am nevertheless rewriting
> the code and fix the bug to become more confident that it works.
> 
> > 
> > > 
> > > -- 
> > > Regards,
> > > Evgeniy Khramtsov, ProcessOne.
> > > xmpp:xram@REDACTED
> > > 
> > > _______________________________________________
> > > erlang-bugs mailing list
> > > erlang-bugs@REDACTED
> > > http://erlang.org/mailman/listinfo/erlang-bugs
> > 
> > -- 
> > 
> > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > _______________________________________________
> > erlang-bugs mailing list
> > erlang-bugs@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-bugs
> 
> -- 
> 
> / Raimo Niskanen, Erlang/OTP, Ericsson AB
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



More information about the erlang-bugs mailing list