when the cpu breaks instructions into IOPs, µops or something similar, and when it does branch prediction and such, if it can determine the outcome of a decoded branch, it can either discard the branch instruction entirely from the code stream, or it can inline the probable branch into the code stream for speculative execution immediately after the branch instruction. even if the branch instruction stays in the code stream, if the processor has a separate branch unit, then the presence of the branch instruction will not affect the latency of the actual instructions inside the loop, it will "only" waste bandwidth in the different queues inside the processor.
the powerpc g4 was the first cpu to do branch folding. intel has published some data on the pentium 4, and it works a bit differently. more modern cpus probably have an even more advanced take on this.
--
updated.
the new intels have something called loop detection which will attempt to count the number of iterations a given loop takes. if the loop is repeated with the same number of iterations, even the loop exit is predicted correctly.
further, assembly language loops can (and should be) structured so that the condition and target of a branch can be predicted as early as possible before the branch is actually taken.
that is, while a loop in a high level language could be:
[Them,Others].each do { |target|or
target.nuke
}
for (i=0; i<100; i++) {incrementing i and comparing it to <100 can be separated from launching more nukes by more code than is obvious from just the high level representation.
launchNukes();
}
this probably looks more like:
...yup.
i = 0
if not i < 100 jump to after_loop
set next_branch to loop_start
loop_start:
inc i
set status based on (i<100)
load warheads
acquire target
fuel missiles
press red button
if loop should continue jump to next_branch
after_loop:
...
0 kommenttia:
Post a Comment